Resilient Overlay Networks (RON)

 

Introduction.

    A Resilient Overlay Network (RON) is an architecture that allows distributed Internet applications  to detect and recover from path outages and periods of degraded performance within seconds. It is an application layer overlay on top of existing Internet routing substrate. The RON nodes monitor the functionality and quality of the Internet paths among themselves and use this information to decide whether to route packets directly over the Internet or by way of other RON nodes, optimizing application specific metrics.

eg. IP itself is an overlay network.

Purpose

    Add functionalities to IP for eg. Multicast, resilience or fault tolerance and resilience service composition. The motivation behind are

Multicast

Resilience

    The routing BGP has many drawbacks. It is a path vector protocol and converges slowly as opposed to a link state protocol. The only metrics used by BGP is the path length. Also BGP hides underlying topological details in the interests of scalability and policy enforcements. As a result the recovery mechanisms take minutes to converge. For eg even to implement policy decision BGP uses path length. Consider a topology where there are four nodes a,b,c,d. There are links between c--a, c--d, a--d, d--b, a--b. Suppose the policy by "a" is not to give transit service to "c" unless the c--d--b link is also down. For this "a" advertises to "b" a path to "c" via "a--a--a--a--a" instead of via "a" only. This way "a" abuses the path length to implement policy.

The BGP availability is also very small. Only 10% of the paths is available for more than 95% and on the other hand a telephone line is available for 99.999%.

One thing to note is that multi homing is not a solution because

 

RON Model

    The RON model will have a set of designated RON nodes. Those will communicate among themselves through a fully mesh network and exchange performance and reachability information with a link state based dissemination and do routing based on that. These will give us the following advantages

  1. fast recovery from failures
  2. application specific metrics
  3. packet classification policies

Besides that there should be some metrics and the architecture should store performance based on those metrices.

Usage Models

In overlay ISP there are several ISP's that form a RON and RON can be brought from them.

Failure Detection in RON

     Usually these heartbeats are small and do not cause much overhead so we can send faster heartbeats under some constraint. But the failure detection is application specific. For eg. the criterion for failure detection for an FTP application will be different than that for a video conferencing application.

   But high rate of heartbeats is not always good because of false alarms. Incase of too fast heartbeats there can be flip flops leading to packet reordering (particularly bad for video conf. because there are some special frames which contain large amount of information)

Metrics

   It is calculated by the RTT measurement of the heartbeats. We need to know how stable the RTT values are. Study has shown that these RTT's are stable over 15-minutes to 1 hour. Usually EWMA of the RTT is taken therefore random spikes don't matter and results remain stable for 15min to 1 hour period. The total RTT will be the summation over individual links.

   Packet loss rate is calculated by the heartbeats. The equation used is 1-(1-p1)(1-p2) for two links where p1 and p2 are packet loss rates in link1 and link2.

    The bandwidth can be measured by sending back to back packets. In TCP the receiving end will receive packets at the rate of the bottleneck bandwidth. But even in that there are two cases

  1. If the intermediate routers implement FIFO then the above is not true
  2. If the intermediate routers implement WFQ then the above is true

   The bandwidth is very unstable. Usually the variation is about a factor of 2. Therefore we use alternate paths iff alternate bandwidth > 2 times current bandwidth otherwise there will be a lot of switching.

Also only two hop alternate paths are considered. Consider the topology below

 

                                    A---------------------------B                                                                                  

                                    |       \                             /       |

                                    |               \          /                  |

                                    |                /          \                 |   

                                    |      /                               \      |

                                    C---------------------------D

There are links  A-B, A-C , A-D, B-C, B-D, C-D.

If A-B goes down then using the above rule we will use either A-D-B or A-C-B only.

As a consequence of all the above in case of Mobile IP it is not always necessary that the triangle routing will cause greater latency than a direct route.