Lecture No. 31
Date :-  7 Nov. 2003
Scribe by Bharat Kumar Jain
                                                             WEB CACHING
-------------------------------------------------------------------------------------------------------------------------------------------------------
Web Caching:
           Caching of documents in the proxies  so that frequently used documents are available on cache(main memory),
 rather than going to the server to fetch it each time the document is request.
                   The advantage of Web caching are
                                 1. Reduces Network Traffic.
                                 2. Decreases Response time
                                 3. Reduces load on the server.
                                 4. Cost is reduces, if the client has to pay for the channel(link) capacity used.          

ICP Web caching Protocol:
           ICP stands for Internet Cache Protocol. In ICP protocol, whenever a client request for a document, the proxy
first
sees it cache for the document, if found it is given to the client(this is called local cache hit).  Whenever a local
cache miss
ccurs, the proxy multicast the requests to all other proxies in the network.  Proxies having that document
will send it to the requesting proxy.  This is called cache hit.  If the all proxies in the network does not have the
document then the request is send to the server. This is called cache miss.


The advantage of ICP protocol are
     1. It reduces cache miss ratio.
     2. Decreases response time if more local cache hit occurs.
     3. It reduces the cost, if we need to pay for the channel between proxies and the server.

Disadvantages of ICP protocol are
     1. Huge number of messages are transmitted between proxies whenever a  local cache miss occurs. Hence
          scalability of proxies is the problem.
     2. The response time is very high whenever the document is not present in any of the proxies.

-------------------------------------------------------------------------------------------------------------------------------------------------          
To overcome the above disadvantages following suggestion were given in the class.

        Just query the neighbouring proxies rather than all proxies.  If neighbouring proxy does not have it then
neighbouring proxies queries their neighbours and so on.  Though this method reduces Network overhead but
it increases response time.

        The other suggestion given by a  student was, when a  queried proxy  does not  have requested document,
instead of generating a NACK, it should fetch the document from the server and pass it to the requesting proxy.
But the disadvange of above approach is that if requesting proxy wants it again, it has to again request it to other
proxies. The second disadvantage of this approach is if there are n proxies and no one has the requested document
then n-1 requests will be send to the server, thus increasing the network traffic by a factor of n-1.
----------------------------------------------------------------------------------------------------------------------------------------------------

To overcome the above disadvantages a new protocol called "Summary Cache" has been proposed.

    The main idea of it is, instead of querying all the proxies; query only those proxies which has greater chances of
having the document. Now the question arises how can a proxy knows which are others the proxies that can have a
required document. This is done by maintaining the cache directory information(summary) of other proxies.

Before getting into the protocol, lets define two terms frequently used in the protocol.
False Hit    : If a request to other proxy does not result in the cache hit then it is called as false hit. This may happen
                   as the Cache directory may not contain accurate information.
False Miss : If a request to other proxy may have resulted in the cache hit but the proxy did not requested because
                   there was no entry in the cache directory.

The two issues to be resolved for the above approach are:
            
                    1. When to update the Summary information.
                    2. Representation of Summary.

The answer to first question is, instead of update after each change in the cache, update only when there are more
that X% of changes in the cache.  It has been found that X can have value of 1%-10% through Trace Driven Simulation.

The two simple approaches for representation of Summary are :
           Exact Directory: The disadvantage of this approach is that too much main memory is required for storing
                                     the Directories.
           Server Name    : Storing only the server names may result in many false hit.

The answer to the second question is to use BLOOM filter. Bloom filter is a computationally efficient  hash based
probabilistic scheme that can represent a set of URLs of cached documents with minimum memory requirement while
answer queries with zero false negative and small false positivies.

 Bloom Filter:

        Bloom filter is essential a data structure for efficient membership queries.  Bloom filter is a method for representing
 a set A={a1,a2,a3,a4.....an} of n elements to suport membership queries.  The idea is to allocate a vector of  m bits,
initially all set to 0, then choose k independent hash functions h1...hk, each with range {1....,m}.  For each element
'a' of A, the bits at position h1(a),...hk(a),
in v are set to 1. Note that a particular bit might be set to 1 multiple times.

Now given a query for b we check the bits at positions h1(b), h2(b)..hk(b) if any one of them is zero then certainly
b is not in the set A.  If all the bits are set we can guess with certain probability that b is present.

False Positive: If we guess b is present and but the guess is wrong, this is called False Positive.
It has been found that the probability of false positive is 0.9% for 5 hash functions and m/n = 10 which
is almost zero.

Using Bloom Filter in Summary Representation:

    Each proxy maintains a
local Bloom filter to represent its own cached documents. To reflect the changes in the
set A, a bit aray of count is maintained which keeps track of how many times a bit is set to 1.  All counts are initially
zero, When a key is  inserted or deleted the counts c(h1(a),...c(hk(a)) are incremented or decremented respectively.

  A proxy builds a Bloom filter from the list of URLs of cached documents and sends the bit array plus the specification
of the hash functions to other proxies.  When updating the summary the proxy can either specify which bits in the bit
array are flipped or send the whole array, whichever is smaller. The number of bits used to represent a average number
of documents in the cache is called LOAD FACTOR
. Average number of documents is calculated by dividing cache size
by 8k. The advantage of Bloom filter is, there provide tradeoff between the memory requirement and the false positive
ratio just by changing the m/n value.

Example :
 Assume that 100 proxies each with 8Gb of cache would like to cooperate. Each proxy stores on average about 1M
web pages.  The bloom filter memory needed to represent 1M pages is 2Mb at load factor 16. Each proxy needs about
200 Mb to  represent all the summaries plus another 1Mb to represent its own counters. The memory requirement
for ICP protocol to represent 1M pages is 16Mb, therefore each proxy requires 1600Mb to represent all the summaries.
Clearly ICP is not scalable.

Conclusion:
           
Thus it can be concluded that Summary Cache reduces the number of Inter-proxy Messages implying
less bandwidth requirement and also memory requirment is low when compared to currently used ICP protocol, as
can be seen in above example.  This two advantages helps in achieving scalibility of protocol.  All this advantages
does not effect the Cache hit ratio.