SCRIBE NOTES FOR LECTURE 31:

TOPIC   :  WEB CACHE SHARING


WEB CACHING

DEFINITION :

The storage of Web files for later re-use at a point more quickly accessed by the end user.

Caching can happen at many places, including proxies (i.e. the user's ISP) and the user's local machine. The objective is to make efficient use of resources and speed the delivery of content to the end user.


The purposes of web caching are :
CACHE SHARING :

        The Sharing of caches among Web proxies.It will reduce the Web traffic and alleviate network bottlenecks. There are two protocols for this cache sharing.

1. ICP  - Internet Cache Protocol.
2. Summary-Cache.

CACHE SHARING VIA  ICP :


OVERHEAD OF ICP :

         Compared with no cache sharing, ICP

SUMMARY CACHE :

   The compressed directories which consists of the URLs of the documents are called "summaries".  In the summary cache, each proxy stores a summary of URLs of documents cached at every other proxy. When a user request for a document then first, the URL of that document is looked in the local cache.If a cache miss occurs in the local cache, the proxy checks the stored summaries to see if the requested document is present in other proxies.If it presents then the proxy sends out requests to the relevent proxies to fetch the document.If it is not present in other proxies also, then the proxy sends the request directly to the Web server.

 Overheads :

1.False Misses :  The document requested is cached at some other proxy but its summary does not reflect the fact. In this case, a remote cache hit is lost, and the total hit ratio with in the collection of caches is reduced.

2.False Hits :   The document requested is not cached at some kother proxy but its summary indicates that it is present. The proxy will send a query message to the other proxy, only to be notified that the document is not cached there.In this case a query message is wasted.

3.Stale Hits : The document is stored at some other proxy, but that copy is a        stale copy.The effect is wasted query messages.

Two issues to resolve :

    1. When to do sumary  Updates ?  
    2. How to summarize ?

If we update the summary database when ever there is a change,then the network overhead increases.
There are two other approaches for this.

 i. Periodic  summary  Updates
ii. Delay summary Updates until X% of cached documents are 'new'.

    For the second option, Trace-driven simulations indicates that Delay threshold of 1-10% works well in practice. This translates to Update frequency of about once  in 5 minutes.

The second issue to be resolved is How to summarize.

For performance reasons,the summaries are stored in main memory rather than in hard disk. The memory requirement is determined by the frequency of summary updates and by the number of cooperating proxies.Since the memory grows linearly with the number of proxies, it is important to keep the individual summaries small.

First consider the two summary representations :

 i.   Exact-directory  
ii.   Server-name.

    In the exact-directory approach, the summary is essentially the list of URLs represented by its 16-byte MD5 signature.
    In the server-name approach, the summary is the collection of Web Server names in the URLs of cached documents. Since on average, the ratio of different  URLs to different Web Server names is about 10 to 1 the server-name approach can cut down the memory requirement by a  factor of 10.
    Neither of the above two approaches is good.  The exact-directory approach consumes too much memory.

Consider the below example.

     Let Proxy size is 8GB.
    Average File size is 8KB.
    There are 16 proxies.

The exact-directory approach would consume (16-1)*16*(8GB/8KB) = 240MB of main memory  per proxy.

   
The server-name approach, though consuming less memory, generates too many flase hits that significantly increase the network traffic.


BLOOM-FILTERS :


    Bloom-filters support membership test for a set of keys. A Bloom filter is a method for representing a set A = {a1,a2,...,an} of n elements to support membership queries.
   In  this Bloom-filters allocate a vector v of "m" bits, initially all set to 0, and then choose k independent hash functions, h1(a),h2(a),...hk(a) in v are  set to 1.(A particular bit might be set to 1 multiple times.) Given a query for b we check the bits at positions h1(b),h2(b),...hk(b). If any of them is 0,then certainly b is not in the set A. Otherwise we conjuncture that b is  in the set although there is a certain probability that we are wrong.

    The parameters k and m should be chosen such that the probability of a false positive  is acceptable.

    The probability that a particular bit is still 0 is exactly  p=(1-1/m)^(kn).

     Probability of false positive is (1-p)^k = (1-e^(kn/m))^k.
   
    This value is minimised when k is ln2 * (m/n).

    The minimum value is 1/2^k=(0.6185)^(m/n). Since the base value is less  than 1 probabilty decreases exponentially with m/n. This (m/n) indicates the number of bits per data item.

Proxy builds Bloom-filter and sends to other proxies.This Bloom-filter mechanism is scalable well . It requires less memory even for moderately large  number of proxies.