The Classical Fish Problem in Routing

The Internet in mid 80s and 90s was envisaged to work on the IP destination based routing/forwarding paradigm. This meant that the routing protocols would establish the best paths based on some scalar metric like the hop count, or link costs, and all traffic would follow that path. This would work since all IP routing and forwarding was based on the IP address carried in the IP packets. With all due respect and credit to the Internet’s forefathers, the vision and the design did work, till a few years back, when the network operators started realizing that though the IP architecture was indeed scalable, it lacked the finesse to optimally utilize the network resources (particularly in the backbones).

The inadequate utilization of the network resources can be illustrated with the classic “fish problem“. It derives its name from the network (Fig 1) resembling a fish, with A being the head and G and H, the tail of a fish.

Figure 1

All traffic emanating (or passing through) from the tail, (G or H) towards the head (A) can take either of the two paths (F-D-C-B or F-E-B) based on how the IP routing tables are programmed by the routing protocols. The latter decide on the best path by considering the link costs advertised by each router in the network. In this example the total cost of path G-F-D-C-B-A is (10+5+5+5+10) 35, while the cost of the path G-F-E-B-A is 30. This means that all traffic from G to A (or G to B) would follow the path through router E (as shown by the red arrow), and the F-D-C-B path would remain unused, since the total cost associated with it is higher than the one through router E.

This leads to an extremely unbalanced traffic distribution, where the link F-E-B can get heavily overloaded, and at worst, congested, while F-D-C-B always remains idle. This problem arises because of the way IP routing paradigm works. Lets us see why:

o IP routing is destination based, so packets are only routed based on the destination IP address in the packet. Routing protocols typically install one next-hop for each IP address (except in case of equal cost routes, which we can ignore for the time being) or a range of IP addresses (subnet masks) thus all packets sharing the same destination address would all get routed to the same next-hop. This means that if F installs a route 100/8 with next-hop as E, then all IP traffic falling under 100/8 coming to F, would get routed to E. This can lead to unbalanced traffic distribution and create unnecessary congestion hot spots.

o Routers make a local decision, based on what they think is the most optimal path from their perspective, when selecting a path. Since all routers run the same SPF algorithm, with the same lin state database, they all come up with the same shortest path, which very soon turns congested, while the non-shortest path remains idle, and unused. This implies that to optimize the network utilization the routers must factor in some other things before chosing the path. One thing that comes instantly to mind is the total bandwidth available on each link when computing the path. If routers can somehow keep track of the available bandwidth available on each link, then it can distribute the traffic in a manner which can optimize the network resource utilization.

Coming back to our network, we find that all traffic from tail to head flows through the router E, leaving D and C idle. So, what can be done to fix this?

The operator can manipulate the link costs on path F-D-C-B, in a manner as shown below, to get the traffic to flow over it.

Figure 2

This clearly works, since cost of the path F-D-C-B (9) is now lower than path F-E-B (10). But hang on. What we’ve only achieved is moving the entire traffic from path F-E-B to F-D-C-B! The traffic would soon start congesting the latter link, while leaving the former unused. We have really achieved nothing, but have only moved the problem elsewhere. Clearly, this wouldn’t work. So, what else can be done?

Well, not much. A clever network operator can play around with the link costs on paths F-D-C-B and F-E-B such that both become equal, as shown in the figure below.

Figure 3

This would surely alleviate the problem as the two paths would now be equally used. However, this scheme of manually adjusting the link costs is not scalable and only works for small networks. Imagine replacing C, D and E with hundreds of routers, with a subset of them being connected to each other. The precarious scheme of adjusting the link costs would become too complex and too fragile to work. A single link (or a router) failure or a cost change would bring down the entire scheme of distributing the traffic across two paths.

The only scalable solution for the fish problem is by going beyond the realms of traditional IP routing and by providing mechanisms to explicitly manage the traffic inside the network. This new paradigm of routing is called constraint based routing, which essentially strives to compute the best path without violating any of the constraints imposed on the network, and at the same time, being optimal with respect to some scalar metric (hop count, links costs, etc). Once such a path is computed, it establishes and maintains forwarding state along such a path.

This differs from the existing IP routing paradigm which only tries to optimize (by minimizing), a particular scalar metric when computing the best path to a destination. Thus RIP optimizes the number of hops and OSPF/IS-IS, the total path cost, where total path cost is the sum total of individual cost of all the links along the path.

I would discuss more on constraint based routing in my subsequent posts.


Shortest Path First (SPF) Calculation in OSPF and IS-IS

Both OSPF and IS-IS use the Shortest Path First (SPF) algorithm to calculate the best path to all known destinations based on the information in their link state database. It works by building the shortest path tree from a specific root node to all other nodes in the area/domain and thereby computing the best route to every known destination from that particular source/node. The shortest path tree thus constructed, consists of three main entities – the edges, the nodes and the leaves.

Each router in OSPF or an Intermediate System in case of IS-IS, is a node in the SPF tree. The links connecting these routers, the edges. The IP network associated with an IP interface, added into OSPF via the network command is a node, while the IP address associated with an interface thats added in IS-IS is  a leaf. An IP prefix redistributed into OSPF or IS-IS from other routing protocols (say BGP)  becomes a leaf in both the protocols. Inter-area routes are patently, the leaves.

Network Diagram

If you consider the network as shown above, then OSPF would consider routers A, B, C and the network as nodes. This is assuming that the interface associated with has been added into OSPF. The only leaf in the graph would be the IP prefix redistributed into A from some other protocol. IS-IS otoh, would consider routers A, B and C as nodes and networks and as leaves. This seemingly innocuous difference in representation of the SPF tree leads to some subtle differences between the SPF run in OSPF and IS-IS, which can interest a network engineer.

The nodes in the shortest path tree or the graph form the backbone or the skeleton of that tree. Any change there necessitates a recalculation of the SPF tree, while a change in a leaf of the SPF tree does not require a full recalculation. Removing and adding of leaves without recalculating the entire SPF tree is known as Partial SPF and is a feature of almost every implementation of OSPF and IS-IS that i am aware of. This implies that if the link connecting router C to goes down, then a full SPF would be triggered in case of OSPF, and a partial SPF in case of IS-IS.

This shows that the general adage – “Avoid externals in OSPF” should be taken with a pinch of salt and it really depends upon your topology. I have seen networks where ISPs redistribute numerous routes that have a potential to change on a regular basis, as opposed to bringing them via the network command.


o IP routing is integrated into IS-IS by adding some new TLVs which carry IP reachability information in the LSPs. All IP networks are considered externals, and they always end up as leaf nodes in the shortest path tree when IS-IS does a SPF run. All node information, neccessary for SPF calculation is advertised in its IS Neighbors or IS Reachability TLVs. This unambiguously separates the prefix information from the topology information which makes Partial Route Calculation (PRC) easily applicable. Thus IS-IS performs only the less CPU intensive PRC when network events do not affect the basic topology but only affect the IP prefixes.

o Used narrow (6 bits wide) metrics which helped in some SPF optimization. However such small bits proved insufficient for providing flexibility in designing IS-IS networks and other applications using IS-IS routing (MPLS-TE). “IS-IS extensions for Traffic Engineering” introduced new TLVs which defined wider metrics to be used for IS-IS thus taking away this optimization. But then CPU are fast these days and there arent many very big networks anyways!

o SPF for a given level is computed in a single phase by taking all IS-IS LSP’s TLV’s together.


o Is built around links, and any IP prefix change in an area will trigger a full SPF. It advertises IP information in Router and Network LSAs. The routers thus, advertise both the IP prefix information (or the connected subnet information) and topology information in the same LSAs. This implies that if an IP address attached to an interface changes, OSPF routers would have to originate a Router LSA or a Network LSA, which btw also carries the topology information. This would trigger a full SPF on all routers in that area, since the same LSAs are flooded to convey topological change information. This can be an issue with an access router or the one sitting at the edge, since many stub links can change regularly.

o Only changes in interarea, external and NSSA routes result in partial SPF calculation (since type 3, 4, 5 and 7 LSAs only advertise IP prefix information) and thus IS-IS’s PRC is more pervasive than OSPF’s partial SPF. This difference allows IS-IS to be more tolerant of larger single area domains whereas OSPF forces hierarchical designs for relatively smaller networks. However with the route leaking from L2 to L1 incorporated into IS-IS the apparent motivation for keeping large single area domains too goes away.

o SPF is calculated in three phases. The first is the calculation of intra-area routes by building the shortest path tree for each attached area. The second phase calculates the inter-area routes by examining the summary LSAs and the last one examines the AS-External-LSAs to calculate the routes to the external destinations.

o OSPFv3 has been made smarter. It removes the IP prefix advertisement function from the Router and the Network LSAs, and puts it in the new Intra-Area Prefix LSA. This means that Router and Network LSAs now truly represent only the router’s node information for SPF and woudl get flooded only if information pertinent to the SPF algorithm changes, i.e., there is atopological change event. If an IP prefix changes, or the state of a stub link changes, that information is flooded in an Intra-Area Prefix LSA which does not trigger an SPF run. Thus by separating the IP information from the topology information, we have made PRC more applicable in OSPFv3 as compared to OSPF2.

I recently wrote a post that discusses this further.

Metrics Size in OSPF and IS-IS ..

Each interface in the link state protocols in given a metric or a cost, which is advertised with the link state information in the LSP or the LSA. The SPF algorithm uses this metric to calculate the cost and the nexthop to a destination. Metrics used are generally the inverse of bandwidth, thus a large bandwith capacity link would have a lower metric


o ISO10589 specified the metric to be 6 bits in size. Therefore, the metric value could only range from 0-63. This metric information was carried in Neighbor Reachbility TLV and the IP reachability TLV. Since it was only 6 bits wide, it was called the “narrow metric“. The maximum path metric MaxPathMetric supported is 1023. This in theory brought down the complexity of the SPF algorithm from O(nlog n) to O(n). But this isn’t a significant motivation any more since the CPUs are really fast these days. The metric size apparently was kept small to optimize search while doing the SPF. It also allowed two types of metrics – External and Internal.

o Soon , the “narrow” metric range was found to be too small for certain networks and new TLVs (Extended IP and Extended Neighbor Reachability) were introduced to carry larger metrics as part of the Traffic Engineering document. These were called Wide Metrics. The MaxLinkMetric value now is 0xFFFFFFand the MaxPathMetric, 0xFE00000.

The Extended IP reachability TLV allows for a 4 byte metric, while the Extended Neighbor reachability TLV allows for 3 bytes metric size. This is to enable the metric summarized across levels/domains to be as large as 0xFFFFFFFF while the link metric itself is no larger than 0xFFFFFE. If a metric value of 0xFFFFFF is used the prefix is not used in SPF computation.

o The original specification defined 4 kinds of narrow metrics – delay, default, error and expense. These were never implemented and all implementations only support the default metric. In ISO 10589, the most significant bit of the 8 bit metrics was the field S (Supported) which defined if the metric was to be considered or not. Confusingly, as most ISO documents are, an implementation was supposed to set this bit to 1 if it wanted to indicate that the metric was unsupported. Since only the default metric is used, the implementations must always set this bit to 0 when using the narrow metrics. Later RFC 2966 used this bit in the default metric to mark L1 routes that have been leaked from L1 to L2 and back down into L1 again.

Current implementations must generate the default metric when using narrow metrics and should ignore the other three metrics when using narrow metrics.


o It allows a link to have a 2 byte metric field in the Router LSA which implies a maximum metric of 0xFFFF.

o The Summary, Summary ASBR, AS-External and NSSA LSAs have a 3 byte metric value. A cost of 0xFFFFFF (LSInfinity) is used to tell the destination described in the LSA is unreachable.

o AS-External and NSSA LSAs allow two metric types, Type 1 and Type 2 which are equivalent to IS-IS Internal and External metrics. The type 1 considers the cost to the ASBR in addition to the advertised cost of the route while the latter uses just the advertised cost while calculating the routes during the SPF.

o The scheme thus allows for links to be configured with a metric no larger than 0xFFFF, while allowing cost of destinations injected across areas/levels to be as large as 0xFFFFFE.


o It allows similar metric size for the Router LSA as in OSPFv2.

o It allows similar metric sizes for Intra Area Prefix LSA, Inter Area Prefix LSA, AS-External LSA and NSSA LSA as in OSPFv2. The value and significance of LS Infinity is valid here too.