Internet, only a few weeks ago, had seen Pakistan Telecom Authority (PTA) hijacking the IP prefixes announced by Youtube, as protest against some videos that had been put up there, knocking millions off, all around the globe from accessing Youtube. I wrote about this here. This time its an ISP from USA and Europe, AboveNet (AS 6461) thats hijacked prefix announced and owned by Africa Online (AS 36915).

AboveNet inexplicably started announcing reachability to one of the prefixes (194.9.82.0/24) owned by Africa Online. It took AboveNet more than 22 hours since the problem was first reported, to fix it. Wonder what took them so long! As a result of this prefix hijack, potentially millions of users in Kenya or Africa, all behind 194.9.82.0/24, lost connectivity to the Internet. In isolation 194.9.82.0/24 is not a huge space, but add a couple of NATs and the number of users easily swells to millions. What this means for you and me, who are not being served by Africa Online is, that we lose connectivity to all the websites being hosted behind this IP address block. Imagine what it would do to the Internet if emergency services, banks, google were being hosted there!

Lets see why users in Kenya would lose total connectivity to the Internet:

A user accesses google.com with a (NATed or otherwise) source IP address 194.9.82.x. Google graciously responds, and the IP packet carries the destination IP 194.9.82.x. Because AboveNet has announced reachability to this IP address block, all traffic destined to 194.9.82.x comes to AboveNet where it gets royally dumped, while the user sitting in Kenya (or Africa) is still hopelessly waiting for the packet to arrive.

So, why are the service providers all over the world preferring the route announcement from AboveNet over the one originated from Africa Online?

Well, thats unfortunately how Internet, and my favorite routing protocol - BGP, works!

In BGP, the route advertisement from the provider which has a better vantage point on the Internet, usually wins.

In this particular case both AboveNet(AS 6461) and Africa Online (AS 36915) announced the route to 194.9.82.0/24 . AboveNet, operating from US, sits much closer to the core as compared to Africa Online, and is thus better connected to the other networks than the latter. The AS_PATH length thus seen by the other service providers for the route advertised by AboveNet is much shorter than the one advertised by Africa Online. As a result of this, other BGP speakers pick up the route advertised by AboveNet against the route advertised by Africa Online.

The figure below, constructed using BGPlay from RIPE NCC, shows a snapshot of the routing activity for 194.9.82.0/24 during the period when it was hijacked. The colored lines indicate the path different ASes would take to reach this prefix. Clearly most of the ASes believed AboveNet to be a better path for 194.9.82.0/24.

Vantage Point on the Internet

This is how BGP works and mind you, this isnt broken.

Whats broken is our disability to verify the claim of a service provider when it announces ownership of an address block in BGP. Restrictive route filtering can be applied where the providers only accept the specific prefixes allocated to the customers or where the upstream accepts only specific prefixes allocated to the ISPs, but this is too cumbersome and rarely works. As the matter stands today, there isn’t any clean way to know if the reachability announced by your friendly peer is genuine or whether the provider has a feasible path to the destinations advertised. This needs to be fixed and there is work going on towards this direction in the SIDR WG of IETF.

Can the service providers do something when they learn that an IP prefix has been hijacked by some AS? The answer, fortunately, to this is an unequivocal Yes.

A service provider can override BGP’s decision process in selecting the route advertisement with the shortest AS_PATH length by (i) manipulating the BGP path attribute like LOCAL_PREF since its checked before the AS_PATH length or (ii)  decreasing the weight of the offending peer you learn the hijacked route from (this would only work for routers connected  directly to AboveNet) (iii) Use regular expressions to filter all or the specific hijacked route advertisement from AS 6461 (AboveNet) so that the announcement from Africa Online wins. The legitimate route is now propagated to other parts of the world.

Each time an ISP inadvertently hijacks someone else’s address block it risks to lose some amount of credibility in the service provider world. Their names are splashed on the mails/PPTs in NANOG and IETF whenever there’s a discussion on interdomain security or on the blogs all over the world.

Fortunately for AboveNet, their hijacking didn’t throw millions off popular websites like Youtube, Google, Yahoo! etc. That would have attracted a LOT more attention than what this event did. When PTA had hijacked YouTube, it was all over the news and there were columns running in Wall Street Journal and New York Times about how tenuous the Internet architecture is. Also what unfortunately went in favor of AboveNet was that the affected users were not in US/Europe/Japan, but were in a relatively silent African subcontinent.

The Internet in mid 80s and 90s was envisaged to work on the IP destination based routing/forwarding paradigm. This meant that the routing protocols would establish the best paths based on some scalar metric like the hop count, or link costs, and all traffic would follow that path. This would work since all IP routing and forwarding was based on the IP address carried in the IP packets. With all due respect and credit to the Internet’s forefathers, the vision and the design did work, till a few years back, when the network operators started realizing that though the IP architecture was indeed scalable, it lacked the finesse to optimally utilize the network resources (particularly in the backbones).

The inadequate utilization of the network resources can be illustrated with the classic “fish problem“. It derives its name from the network (Fig 1) resembling a fish, with A being the head and G and H, the tail of a fish.

Figure 1

All traffic emanating (or passing through) from the tail, (G or H) towards the head (A) can take either of the two paths (F-D-C-B or F-E-B) based on how the IP routing tables are programmed by the routing protocols. The latter decide on the best path by considering the link costs advertised by each router in the network. In this example the total cost of path G-F-D-C-B-A is (10+5+5+5+10) 35, while the cost of the path G-F-E-B-A is 30. This means that all traffic from G to A (or G to B) would follow the path through router E (as shown by the red arrow), and the F-D-C-B path would remain unused, since the total cost associated with it is higher than the one through router E.

This leads to an extremely unbalanced traffic distribution, where the link F-E-B can get heavily overloaded, and at worst, congested, while F-D-C-B always remains idle. This problem arises because of the way IP routing paradigm works. Lets us see why:

o IP routing is destination based, so packets are only routed based on the destination IP address in the packet. Routing protocols typically install one next-hop for each IP address (except in case of equal cost routes, which we can ignore for the time being) or a range of IP addresses (subnet masks) thus all packets sharing the same destination address would all get routed to the same next-hop. This means that if F installs a route 100/8 with next-hop as E, then all IP traffic falling under 100/8 coming to F, would get routed to E. This can lead to unbalanced traffic distribution and create unnecessary congestion hot spots.

o Routers make a local decision, based on what they think is the most optimal path from their perspective, when selecting a path. Since all routers run the same SPF algorithm, with the same lin state database, they all come up with the same shortest path, which very soon turns congested, while the non-shortest path remains idle, and unused. This implies that to optimize the network utilization the routers must factor in some other things before chosing the path. One thing that comes instantly to mind is the total bandwidth available on each link when computing the path. If routers can somehow keep track of the available bandwidth available on each link, then it can distribute the traffic in a manner which can optimize the network resource utilization.

Coming back to our network, we find that all traffic from tail to head flows through the router E, leaving D and C idle. So, what can be done to fix this?

The operator can manipulate the link costs on path F-D-C-B, in a manner as shown below, to get the traffic to flow over it.

Figure 2

This clearly works, since cost of the path F-D-C-B (9) is now lower than path F-E-B (10). But hang on. What we’ve only achieved is moving the entire traffic from path F-E-B to F-D-C-B! The traffic would soon start congesting the latter link, while leaving the former unused. We have really achieved nothing, but have only moved the problem elsewhere. Clearly, this wouldn’t work. So, what else can be done?

Well, not much. A clever network operator can play around with the link costs on paths F-D-C-B and F-E-B such that both become equal, as shown in the figure below.

Figure 3

This would surely alleviate the problem as the two paths would now be equally used. However, this scheme of manually adjusting the link costs is not scalable and only works for small networks. Imagine replacing C, D and E with hundreds of routers, with a subset of them being connected to each other. The precarious scheme of adjusting the link costs would become too complex and too fragile to work. A single link (or a router) failure or a cost change would bring down the entire scheme of distributing the traffic across two paths.

The only scalable solution for the fish problem is by going beyond the realms of traditional IP routing and by providing mechanisms to explicitly manage the traffic inside the network. This new paradigm of routing is called constraint based routing, which essentially strives to compute the best path without violating any of the constraints imposed on the network, and at the same time, being optimal with respect to some scalar metric (hop count, links costs, etc). Once such a path is computed, it establishes and maintains forwarding state along such a path.

This differs from the existing IP routing paradigm which only tries to optimize (by minimizing), a particular scalar metric when computing the best path to a destination. Thus RIP optimizes the number of hops and OSPF/IS-IS, the total path cost, where total path cost is the sum total of individual cost of all the links along the path.

I would discuss more on constraint based routing in my subsequent posts.

In the previous post I explained how the SPF algorithm works and how its used in the link state routing protocols. Click here to know the difference between SPF run for OSPF and IS-IS.

After the SPF run, IS-IS would have an SPF tree with the shortest path to reach each Intermediate System in its level. Fair enough - but how do we determine the IP networks that we can reach?

We earlier saw that at each step of the algorithm, the TENT is examined, and the node with the least cost from the root node is moved into PATHS. When a node is placed in PATHS, all IP prefixes/networks advertised by it are installed in the IS-IS Routing Information Base (RIB) with the corresponding metric and next hop. The directly connected IS-IS neighbors of the node that just made it into PATHS are then added to TENT if they are not already there and their associated costs adjusted accordingly, for the next selection.

The SPF tree thus computed considers the Intermediate Systems as nodes of the graph and the IP addresses advertised by these as the leaves, hanging off the nodes. Thus, the entire shortest path tree in IS-IS does not need to be recomputed if the network changes involved are only related to IP prefixes. Instead, the router can run a partial computation to find an alternative IP prefix if one exists - this partial run is called the partial SPF. More details here and here.

The network topology is computed and determined by the adjacencies advertised in the IS-IS LSPs. We already know that a full SPF is only required when the network topology changes. This implies that only a loss of an IS-IS adjacency would trigger a full SPF. To cite an example, when a point-to-point link goes down, the router loses its adjacency with the neighbor at the other end. This signals a change in topology and, a full SPF is scheduled. OTOH, when a route redistributed from a different routing protocol or a level 2 route leaked into the level 1 route goes away, then it does not bring about any topology change. Because the IP prefixes are only the leaves of the SPF tree, and this does not flag a change in network topology, only the partial route computation (PRC) is run to find an alternative path, if one exists.

For the sake of brevity i would refer an Intermediate System (IS) in IS-IS parlance as a router in the following post - thus a router in this post could mean either an OSPF speaker or an IS-IS speaker or some other routing element from your favorite link state protocol. For the SPF algorithm to work, it would require *all* routers in the network to know about all the other routers in the network and the links connecting them. How a link state routing protocol encodes this information and ensures that its disseminated properly is left to that protocol. OSPF encodes this information in Link State Advertisements (LSAs) and floods it reliably, while IS-IS encodes this in a Link State Packet (LSP) that it originates.

Once each router knows about all the other routers and the links connecting them, it runs the Dijkstra Shortest Path First algorithm to determine the shortest path from itself to all the other routers in the network. Since each router has a similar copy of the link state database and each runs the same algorithm, they end up constructing the same view of the network and packets get routed consistently at each hop.

So, how does SPF algorithm work in OSPF and IS-IS.

Imagine a simple network as shown in the figure below.

Network Diagram

Once each router has flooded its link state information in the network, all routers know about all the other routers and the links connecting them. The link state database on each router looks like the following:

[A, B, 3], [A, C, 6], [B, A, 3], [B, D, 3], [B, E, 5], [C, A, 6], [C, D, 9], [D, C, 9], [D, B, 3], [D, E, 3] , [E, B, 5] and [E, D, 3]

Each triple should be read as {originating router, router its connected to, the cost of the link connecting the two routers}

So what does Router A do with this information and how is this used in SPF?

While running SPF, each router maintains two lists - the first is a list of nodes for which the shortest path has been determined and we are sure that no path shorter than the one we have computed can exist. This list is called the PATH (or PATHS) list. The second is the list of paths through the routers that may or may not be the shortest to a destination. This list is called the TENTative list, or simply the TENT list From now on, TENT would refer to the TENT list and PATH, to the PATH list.

Each element in the list is a triplet of the kind {endpoint router that we’re trying to reach, total distance from the calculating router, next-hop to reach the endpoint router}

Each router runs the following algorithm to compute the shortest path to each node:

Step I: Put “self” on the PATH with a distance of 0 and a next hop of self. The router running the SPF refers to itself as either “self” or the root node, because this node is the root of the shortest-path tree.

Step II: Take the node (call it the PATH node) just placed on the PATH list and examine its list of neighbors. Add each neighbor to the TENT with a next hop of the PATH node, unless that neighbor is already in the TENT or PATH list with a lower cost.

Call the node just added to the TENT as the TENT node. Set the cost to reach the TENT node equal to the cost to get from the root node to the PATH node plus the cost to get from the PATH node to the TENT node.

If the node just added to the TENT already exists in the TENT, but with a higher cost, replace the higher-cost node with the node currently under consideration.

Step III: Find the lowest cost neighbor in the TENT and move that neighbor to the PATH, and repeat Step 2. Stop only when TENT becomes empty.

Lets follow the sequence that Router A goes through for building its SPF tree.

1st Iteration of the SPF run

Step I: Put “self” on the PATH with a distance of 0 and a next hop of self. After this step the PATH and TENT look as follows:

PATH - {A, 0, A}

TENT - { }

Step II: Take the node (call it the PATH node) just placed on the PATH list and examine its list of neighbors. Patently, A is the PATH node. Examine its list of neighbors ({A, B, 3}, {A, C, 6}). OSPF does this by looking at the LSAs advertised by Router A, while IS-IS does this by looking at the neighbors TLV found in Router A’s LSP. When an IS-IS node is placed in PATHS, all IP prefixes advertised by it are installed in the IS-IS Routing Information Base (RIB) with the corresponding metric and next hop.

The Step II further says - “Add each neighbor to the TENT with a next hop of the PATH node, unless that neighbor is already in the TENT or PATH list with a lower cost.” A’s neighbors are B and C, and since neither of them is in the PATH or TENT, we add both of them to the TENT.

PATH - {A, 0, A}

TENT - {B, 3, A}, {C, 6, A}

Step II says - “Call the node just added to the TENT as the TENT node. Set the cost to reach the TENT node equal to the cost to get from the root node to the PATH node plus the cost to get from the PATH node to the TENT node.”

Lets pick up the TENT node B. The cost to reach B would be the cost to reach from root node to PATH node + cost from PATH node to TENT node. In the first iteration of SPF, both the root node and PATH node is A. Thus total cost to reach B is cost to reach from A (PATH node) to B (TENT node) which is 3.

Step II says - “If the node just added to the TENT already exists in the TENT, but with a higher cost, replace the higher-cost node with the node currently under consideration.”

B isnt in the TENT, so skip this.

Step III says - “Find the lowest cost neighbor in the TENT and move that neighbor to the PATH, and repeat“. The lowest cost neighbor is B (with cost 3). Move this to PATH. We go back to Step II since TENT isnt yet empty.

2nd Iteration of the SPF run

PATH - {A, 0, A} {B, 3, A}

TENT - {C, 6, A}

We have thus added the neighbor B in PATH, since we know that there cannot be any other shorter path to reach it. And this is, as you will note, consistent with our definition of PATH wherein we had earlier stated that nodes can only be placed there once we are sure that there cannot be any shorter path to reach them from the root node.

We begin our 2nd iteration and go to Step II which says - “Take the node (call it the PATH node) just placed on the PATH list and examine its list of neighbors“. B is the PATH node and we examine its neighbors (A, E and D). Step II further says - “Add each neighbor to the TENT with a next hop of the PATH node, unless that neighbor is already in the TENT or PATH list with a lower cost.”

Since A is already in PATH with ignore it and only add E and D to TENT.

PATH - {A, 0, A} {B, 3, A}

TENT - {C, 6, A} {D, 3, B}, {E, 5, B}

Step II says - “Call the node just added to the TENT as the TENT node. Set the cost to reach the TENT node equal to the cost to get from the root node to the PATH node plus the cost to get from the PATH node to the TENT node.

Aah .. this means that cost against D and E would not be 3 and 5, as what i have shown, but would instead be 3 (cost from A to B) +3 (cost from B to D) = 6 and 3 (cost from A to B) +5 (B to E) = 8

Thus the PATH and TENT look as follows:

PATH - {A, 0, A} {B, 3, A}

TENT - {C, 6, A} {D, 6, B}, {E, 8, B}

The rest of the Step II does not apply here since D and E dont exist in the TENT.

Come to Step III which says “Find the lowest cost neighbor in the TENT and move that neighbor to the PATH, and repeat Step 2″.

Lowest cost neighbor can either be C or D so pick on up randomly. It can mathematically be proven that we would end up with the same SPF tree irrespective of which equal cost neighbor is picked up from the TENT first. In our case, lets pick up C.

It is thus moved to the PATH

PATH - {A, 0, A} {B, 3, A} {C, 6, A}

TENT - {D, 6, B}, {E, 8, B}

3rd Iteration of the SPF run

Go back to Step II which says “Take the node (call it the PATH node) just placed on the PATH list and examine its list of neighbors. Add each neighbor to the TENT with a next hop of the PATH node, unless that neighbor is already in the TENT or PATH list with a lower cost

C’s neighbors are A and D. Since A is already in the PATH only D is added in the TENT.

PATH - {A, 0, A} {B, 3, A} {C, 6, A}

TENT - {D, 6, B}, {E, 8, B} {D, 9, C}

As per Step II we now need to fix the cost to reach D from the root node. This would be cost to reach from A to C (6) + cost from C to D (9) = 15

PATH and TENT now:

PATH - {A, 0, A} {B, 3, A} {C, 6, A}

TENT - {D, 6, B}, {E, 8, B} {D, 15, C}

Step II further says - “If the node just added to the TENT already exists in the TENT, but with a higher cost, replace the higher-cost node with the node currently under consideration.

node D already exists in the TENT (via B with cost 6) and since its with a lesser cost, we remove the node that we had just added from the TENT. This is because a lower cost path to reach node D already exists in the TENT.

PATH - {A, 0, A} {B, 3, A} {C, 6, A}

TENT - {D, 6, B}, {E, 8, B}

We come to Step III which says “Find the lowest cost neighbor in the TENT and move that neighbor to the PATH, and repeat Step 2″.

Lowest cost neighbor is D - which means we move D now to the PATH and go to Step II, since the TENT isnt yet empty.

4th Iteration of the SPF run

PATH - {A, 0, A} {B, 3, A} {C, 6, A} {D, 6, B}

TENT - {E, 8, B}

Step II says “Take the node (call it the PATH node) just placed on the PATH list and examine its list of neighbors. Add each neighbor to the TENT with a next hop of the PATH node, unless that neighbor is already in the TENT or PATH list with a lower cost.”

To examine D’s neighbors we look at the link state information it advertised. It advertised the following information:

[D, C, 9], [D, B, 3], [D, E, 3]

This means that D is says that its connected to C, B and E. We ignore its connection to B and C, since they are already in PATH. We thus only add neighbor E in the TENT.

PATH - {A, 0, A} {B, 3, A} {C, 6, A} {D, 6, B}

TENT - {E, 8, B} {E, 3, D}

Continuing with Step II which further says - “Call the node just added to the TENT as the TENT node. Set the cost to reach the TENT node equal to the cost to get from the root node to the PATH node plus the cost to get from the PATH node to the TENT node.”

This means that we need to adjust the cost of the triple {E, 3, D} that we just added to the TENT. The cost to reach E via D would thus be the cost to reach D from A (which is the root node) + the cost to reach E from D. This comes out to be 6 + 3 = 9.

TENT thus looks like this - {E, 8, B} {E, 9, D}

Step II further says - “If the node just added to the TENT already exists in the TENT, but with a higher cost, replace the higher-cost node with the node currently under consideration”

We just added node E in the TENT and a route to E already exists in the TENT, and its with a lower cost. This means that we remove the route that we had just added.

So PATH and TENT at this point look as follows:

PATH - {A, 0, A} {B, 3, A} {C, 6, A} {D, 6, B}

TENT - {E, 8, B}

We go to Step III which says - “Find the lowest cost neighbor in the TENT and move that neighbor to the PATH, and repeat Step 2″.

The lowest cost neighbor in TENT right now is E. We move this to PATH.

So PATH and TENT at this point look as follows:

PATH - {A, 0, A} {B, 3, A} {C, 6, A} {D, 6, B} {E, 8, B}

TENT - { }

Step III further states that we continue if and only if something remains in the TENT. TENT is now empty, which means that we have computed the shortest paths to all the nodes that A was aware of.

This is marks the end of the SPF algorithm run and the SPF tree that it has computed looks as follows

Network Topology after the SPF Run

Both OSPF and IS-IS use the Shortest Path First (SPF) algorithm to calculate the best path to all known destinations based on the information in their link state database. It works by building the shortest path tree from a specific root node to all other nodes in the area/domain and thereby computing the best route to every known destination from that particular source/node. The shortest path tree thus constructed, consists of three main entities - the edges, the nodes and the leaves.

Each router in OSPF or an Intermediate System in case of IS-IS, is a node in the SPF tree. The links connecting these routers, the edges. The IP network associated with an IP interface, added into OSPF via the network command is a node, while the IP address associated with an interface thats added in IS-IS is  a leaf. An IP prefix redistributed into OSPF or IS-IS from other routing protocols (say BGP)  becomes a leaf in both the protocols. Inter-area routes are patently, the leaves.

Network Diagram

If you consider the network as shown above, then OSPF would consider routers A, B, C and the network 10.1.1.0/24 as nodes. This is assuming that the interface associated with 10.1.1.0/24 has been added into OSPF. The only leaf in the graph would be the IP prefix 56.1.1.0/24 redistributed into A from some other protocol. IS-IS otoh, would consider routers A, B and C as nodes and networks 10.1.1.0/24 and 56.1.1.0/24 as leaves. This seemingly innocuous difference in representation of the SPF tree leads to some subtle differences between the SPF run in OSPF and IS-IS, which can interest a network engineer.

The nodes in the shortest path tree or the graph form the backbone or the skeleton of that tree. Any change there necessitates a recalculation of the SPF tree, while a change in a leaf of the SPF tree does not require a full recalculation. Removing and adding of leaves without recalculating the entire SPF tree is known as Partial SPF and is a feature of almost every implementation of OSPF and IS-IS that i am aware of. This implies that if the link connecting router C to 10.1.1.0/24 goes down, then a full SPF would be triggered in case of OSPF, and a partial SPF in case of IS-IS.

This shows that the general adage - “Avoid externals in OSPF” should be taken with a pinch of salt and it really depends upon your topology. I have seen networks where ISPs redistribute numerous routes that have a potential to change on a regular basis, as opposed to bringing them via the network command.

IS-IS

o IP routing is integrated into IS-IS by adding some new TLVs which carry IP reachability information in the LSPs. All IP networks are considered externals, and they always end up as leaf nodes in the shortest path tree when IS-IS does a SPF run. All node information, neccessary for SPF calculation is advertised in its IS Neighbors or IS Reachability TLVs. This unambiguously separates the prefix information from the topology information which makes Partial Route Calculation (PRC) easily applicable. Thus IS-IS performs only the less CPU intensive PRC when network events do not affect the basic topology but only affect the IP prefixes.

o Used narrow (6 bits wide) metrics which helped in some SPF optimization. However such small bits proved insufficient for providing flexibility in designing IS-IS networks and other applications using IS-IS routing (MPLS-TE). “IS-IS extensions for Traffic Engineering” introduced new TLVs which defined wider metrics to be used for IS-IS thus taking away this optimization. But then CPU are fast these days and there arent many very big networks anyways!

o SPF for a given level is computed in a single phase by taking all IS-IS LSP’s TLV’s together.

OSPFv2

o Is built around links, and any IP prefix change in an area will trigger a full SPF. It advertises IP information in Router and Network LSAs. The routers thus, advertise both the IP prefix information (or the connected subnet information) and topology information in the same LSAs. This implies that if an IP address attached to an interface changes, OSPF routers would have to originate a Router LSA or a Network LSA, which btw also carries the topology information. This would trigger a full SPF on all routers in that area, since the same LSAs are flooded to convey topological change information. This can be an issue with an access router or the one sitting at the edge, since many stub links can change regularly.

o Only changes in interarea, external and NSSA routes result in partial SPF calculation (since type 3, 4, 5 and 7 LSAs only advertise IP prefix information) and thus IS-IS’s PRC is more pervasive than OSPF’s partial SPF. This difference allows IS-IS to be more tolerant of larger single area domains whereas OSPF forces hierarchical designs for relatively smaller networks. However with the route leaking from L2 to L1 incorporated into IS-IS the apparent motivation for keeping large single area domains too goes away.

o SPF is calculated in three phases. The first is the calculation of intra-area routes by building the shortest path tree for each attached area. The second phase calculates the inter-area routes by examining the summary LSAs and the last one examines the AS-External-LSAs to calculate the routes to the external destinations.

o OSPFv3 has been made smarter. It removes the IP prefix advertisement function from the Router and the Network LSAs, and puts it in the new Intra-Area Prefix LSA. This means that Router and Network LSAs now truly represent only the router’s node information for SPF and woudl get flooded only if information pertinent to the SPF algorithm changes, i.e., there is atopological change event. If an IP prefix changes, or the state of a stub link changes, that information is flooded in an Intra-Area Prefix LSA which does not trigger an SPF run. Thus by separating the IP information from the topology information, we have made PRC more applicable in OSPFv3 as compared to OSPF2.

We’ve been reminded yet again, of how vulnerable the Internet architecture is, to malicious attacks and simple mis-configuration (oversight?) from the service provider side. A week ago, the Pakistan Telecommunication Authority released an order instructing the country’s 70 odd ISPs to block Youtube.com until further notice. The ISPs acting on this directive could have installed access-control lists (ACLs) on all their router interfaces dropping packets bound to this website, but they instead, chose a more convoluted way of blocking traffic bound to this site - they created a more specific route pointing to a NULL (or a discard) interface, thereby black-holing all traffic bound to this address. Fair enough, this is also a way to drop traffic since the former would entail augmenting all existing ACL filtering policies on all router interfaces. Whats intriguing is how this route “accidentally” leaked into BGP and got advertised to PCCW (AS 3491), the upstream provider providing services to Pakistan Telecom (AS 17557), from where it propagated further to other parts of the Internet.

Youtube advertises 208.65.152.0/22 and Pakistan Telecom advertised a more specific route, 208.65.153.0/24, to its provider PCCW . PCCW, like most ISPs, without validating the prefix announcements based on Regional Internet Registry (RIR) allocations or even Internet Routing Registry (IRR) objects, further propagated this BGP UPDATE to its peers. Within no time, the erroneous BGP announcement permeated across large parts of the internet, resulting in all traffic bound to Youtube, to go towards Pakistan, where it would end up getting blackholed. Youtube was thus successfully “hijacked” by Pakistan Telecom .

I dont intend to discuss, whether this was done maliciously or whether it was an error on part of the PT, but the whole saga raises uncomfortable questions on the security and frailty of the Internet as it works today. Currently, the larger ISPs tend to trust the network providers that they are connected to and work on the tenuous assumption that the smaller providers would not illegitimately “hijack” someone else’s IP address and will behave decorously and only announce the IPs that they own. Patently, this doesn’t seem to be working out.

An attacker can masquerade as a big financial website by “hijacking” all traffic bound towards the legitimate website, thereby wreaking havoc on the gullible users. It should be noted that this sort of attack is more potent and dangerous than the spam mails that we often receive in our mailboxes, asking us to update and enter our bank details. In the latter cases, an alert user can always detect, that the server on which the page is loaded does not belong to the bank or the financial institution that the mail purports to be. However when the IP address block is “hijacked”, there are no such defenses.

Internet connectivity is vital in todays age, with a lot of emergency services being routed over the Internet. Imagine a natural disaster or a terrorist attack followed by an IP address block “hijack”. The full repercussions of what all can be achieved with this kind of an attack makes your mind spin and wobble.

A timeline created by Renesys, which provides real-time monitoring services, says that it took about 15 seconds for large Pacific-rim providers to direct YouTube.com traffic to the Pakistan ISP, and about 45 seconds for the central routers on which most of the Internet traffic relies, to misroute the traffic.

So, what happened to Pakistan’s Internet Connectivity as it attracted zillions of bytes of data intended for Youtube.com? It probably went down as PT could not have handled that massive amount of data, rendering millions inside Pakistan without any Internet connectivity.

In the coming days i would discuss what BGP Prefix Hijacking is and what it entails to protect the Internet from this and the work being done in IETF wrt this.

Also read about AboveNet hijacking Africa Online here.

Most of us believe that using cryptographic authentication methods (MD5, etc) for the routing protocols running inside our networks really makes them very secure. Well, not really ..

We have posted an IETF draft that explains how each routing protocol can be exploited despite using the cryptographic authentication mechanisms endorsed by the IETF community.

To cite an example, a simple IP header attack on OSPF or RIP can result in the two adjacent routers bringing down the peering relationship between them. This can, in the worst case, blackhole a substantial amount of data traffic inside the network, something that will certainly not go well with the customers!

So how can OSPF adjacency be brought down?

OSPF neighbors on the broadcast, NBMA and point-to-multipoint networks are identified by the IP address in the IP header. Because the IP header is not covered by the MAC in the cryptographic authentication scheme as described in RFC 2328, an attack can be made exploiting this vulnerability.

R1 sends an authenticated HELLO to R2. This HELLO is captured and replayed back to R1, changing the source IP in the IP header to that of R2.

R1 not finding itself in HELLO would deduce that the connection is not bidirectional and would bring down the adjacency !!

We recently updated our IS-IS WG document “IS-IS Generic Cryptographic Authentication” based on discussions that we had on the IETF IS-IS WG mailing list and various other cryptographers and technologists offline

Currently the IS-IS specification allows only clear-text passwords and the HMAC-MD5 style of authentication. This is done via the authentication TLV 10. Our draft proposes a new authentication type to be carried within TLV 10, which we call the generic cryptographic authentication. It is generic because it can be used with any authentication algorithm for authenticating and verifying IS-IS PDUs.

This draft in particular explains how the HMAC-SHA authentication should be used within IS-IS. Since its a generic authentication scheme we needed a way to demultiplex the various authentication algorithms that could have been applied to the PDUs. Doing this by looking at the length of TLV 10 is incorrect and impossible since we can have multiple authentication algorithms returning the same hash length (E.g. Truncated Hashes!).

To solve this we have introduced the concept of a Key ID in IS-IS. Its basically an octet long unsigned number that would uniquely define an IS-IS security association, as manually configured by the network operator. The reciever would determine the active SA by looking at the Key ID carried in the incoming IS-IS PDU.

A Key ID is associated with an authentication algorithm (or the protocol) and a secret key (the password) at the very least. Most implementations (at least the ones that i am aware of!) would also associate a fixed lifetime with a key. A key can thus time out after some time, and the next one can take over. This ensures that the same key is not used in the network for a long time, thereby making it harder for the bad guy to guess the password.

Using Key IDs makes changing keys while maintaining protocol operation extremely convenient. One only needs to compute the authentication data with the parameters specified in the new key, and to change the value of the Key ID sent in the IS-IS PDU. The actual operation of these mechanisms is outside the scope of this post.

Read the draft for more details ..

The “OSPF HMAC-SHA Cryptographic Authentication” draft co-authored by me has been accepted as a WG document.

This IETF draft describes a mechanism for authenticating OSPF packets by making use of the HMAC algorithm in conjunction with the SHA family of cryptographic hash functions. Because of the way the hash functions are used in HMAC construction, the collision attacks currently known against SHA-1 do not apply. 
    
So why do we need this extension?

In the cryptographic authentication scheme described in RFC 2328, the OSPF routers on a common network/subnet share a secret key which is used to generate a keyed MD5 digest for each packet while a monotonically increasing sequence number scheme is used to prevent replay attacks. 
    
This isnt good enough as there have been recent reports about attacks on MD5 which raises concerns about the remaining useful lifetime of this scheme. Specifically, the researchers have been able to develop algorithms that can compute hash collisions for some applications of MD5. MD5CRK, was a distributed computing project to break the MD5 hash algorithm in a short period of time. The project closed down with the publication of the paper.
    
It was discovered that collisions can be found in MD5 algorithm in less than 24 hours, making MD5 insecure. Further research has verified this result and shown other ways to find collisions in MD5 hashes.  It should however be noted that these attacks may not necessarily result in direct vulnerabilities in Keyed-MD5 as used in OSPF authentication purposes, because the colliding message may not  necessarily be a syntactically correct protocol packet. However, there is a need felt to move away from MD5 towards more complex and difficult to break hash algorithms. 
    
This document therefore adds support for HMAC  construction to be used for authenticating OSPF packets. HMAC can be used, without modifying any hash function, for calculating and verifying the message authentication values. It verifies both the data integrity and the authenticity of a message. Because of the way the hash functions are used in HMAC construction, the collision attacks    currently known against MD5 and SHA-1 do not apply. 
    
By definition, HMAC requires a cryptographic hash function. We propose to use any one of SHA-1, SHA-224, SHA-256, SHA-384 and SHA-512 [NIST] for this purpose to authenticate the OSPF packets. 
    
This document explains how HMAC-SHA-1, HMAC-SHA-224, HMAC-SHA-256, HMAC-SHA-384 and HMAC-SHA-512 shall work with OSPF. 

If the control and forwarding functions in a router can be separated independently, it is possible to maintain a router’s data forwarding capability intact while the router’s control software is restarted/reloaded. This functionality is termed as “graceful restart” or “nonstop forwarding”.  The router’s control software (the routing protocols and the signalling protocols) can stop and can restart for myriad reasons - SW error crashing the protocol task, a switchover to the redundant control card, or a planned shutdown as part of the operational maintenance - the list is endless. The idea behind graceful restart is to continue forwarding packets based on the snapshot of FIB just before the router restarted. The restarting router is assumed to be capable of preserving the FIB and some amount of control information (like cryptographic sequence number in case of OSPF).

OSPF   
        
 o Restarting OSPF router originates a Grace LSA (link local Opaque LSA)  specifying the ‘grace period’, thereby indicating to its neighbors the time, in seconds, that the neighbors (the helpers)  should continue to consider this router as fully adjacent. The helping neighbors enter into a state known as helping mode during this period. The onus falls on the helpers to detect a topological change during the grace period and acting accordingly.

o  In case of a planned restart, OSPF issues a Grace LSA to its neighbors on each restarting interface and sets the value 1, which is Software restart, in the Graceful Restart Reason TLV. In case of an unplanned outage, the router first issues a Grace LSA before sending out any HELLOs. Most implementations transmit the Grace LSAs multiple times, till an acknowledgement is heard from the neighboring routers.

o The helping router continues advertising the restarting router in its LSAs and other routers in the network never come to know of this event.

o Using standard OSPF procedures the helping routers establish adjacencies with the restarting router and synchronize their LSDBs. During the grace period, the restarting router receives its own self generated pre-restart LSAs. It accepts them as valid, and does not originate type 1 through 5 and type 7 LSAs, even after it transitions to a FULL state. The restarting router can run the SPF, but its not yet allowed to update the FIB.

o Once the restarting router and its helpers have synchronized their databases within the grace period, the former flushes its grace LSAs to signal successful completion of the gracweful restart procedure. The restarting router now reoriginates its router LSAs on all attached areas and the network LSAs on the segments, where its the DR. It now schedules a full SPF, calculates the routes, and updates the FIB.

o The restarting router had marked all the routes in FIB as stale before sending out the Grace LSAs. After graceful restart is over and it has recalculated the routes, it deletes all the routes marked as stale in the FIB. It can now reoriginate summary LSAs, type 7 LSAs and AS External LSAs as appropriate.

o When the helpers receive the flushes Grace LSAs, they exit the helper mode and revert back to normal OSPF procedures.

o OSPF automatically reverts back to standard OSPF restart from graceful restart if topological changes are detected or if one or more of the restarting router’s neighbors do not support graceful restart. 

o More details here.

IS-IS   
        
o Restarting router does not re-compute its own routes until it has achieved database synchronization with its neighbors.

o Uses restart TLV (type 211) in its IIH to obtain the graceful restart functionality. Grace period is decided as the minimum of the Remaining times of received IIHs containing a restart TLV with RA bit set.  Upon receiving this IIH, the helping routers would flood the complete sets of CSNPs onto the link and set the SRM flag on all its LSPs towards the restarting router.

o The restarting router can now clear its restart timer, since it has some helping routers that can provide a full set of link state information through the normal transmission and retransmission process.
          
o  During grace period, restarting router does not transmit self-originated LSPs and self-LSPs are not purged or modified. These restrictions are necessary to prevent premature removal of an own LSP and hence churn in other routers.   
          
 o Restart mechanism in IS-IS allows to establish adjacency without cycling through the normal operation of adjacency state machine.  

o If a timer on the restarting router expires before it recieves a full set of CSNPs from its helpers, the adjacency is reset, and a normal neighbor restart is attempted.
         
o  Usual database synchronization is achieved in situations where the neighboring routers of the restarting router do not support the restart TLV.   

o More details here.

Each interface in the link state protocols in given a metric or a cost, which is advertised with the link state information in the LSP or the LSA. The SPF algorithm uses this metric to calculate the cost and the nexthop to a destination. Metrics used are generally the inverse of bandwidth, thus a large bandwith capacity link would have a lower metric

IS-IS

o ISO10589 specified the metric to be 6 bits in size. Therefore, the metric value could only range from 0-63. This metric information was carried in Neighbor Reachbility TLV and the IP reachability TLV. Since it was only 6 bits wide, it was called the “narrow metric“. The maximum path metric MaxPathMetric supported is 1023. This in theory brought down the complexity of the SPF algorithm from O(nlog n) to O(n). But this isn’t a significant motivation any more since the CPUs are really fast these days. The metric size apparently was kept small to optimize search while doing the SPF. It also allowed two types of metrics - External and Internal.

o Soon , the “narrow” metric range was found to be too small for certain networks and new TLVs (Extended IP and Extended Neighbor Reachability) were introduced to carry larger metrics as part of the Traffic Engineering document. These were called Wide Metrics. The MaxLinkMetric value now is 0xFFFFFFand the MaxPathMetric, 0xFE00000.

The Extended IP reachability TLV allows for a 4 byte metric, while the Extended Neighbor reachability TLV allows for 3 bytes metric size. This is to enable the metric summarized across levels/domains to be as large as 0xFFFFFFFF while the link metric itself is no larger than 0xFFFFFE. If a metric value of 0xFFFFFF is used the prefix is not used in SPF computation.

o The original specification defined 4 kinds of narrow metrics - delay, default, error and expense. These were never implemented and all implementations only support the default metric. In ISO 10589, the most significant bit of the 8 bit metrics was the field S (Supported) which defined if the metric was to be considered or not. Confusingly, as most ISO documents are, an implementation was supposed to set this bit to 1 if it wanted to indicate that the metric was unsupported. Since only the default metric is used, the implementations must always set this bit to 0 when using the narrow metrics. Later RFC 2966 used this bit in the default metric to mark L1 routes that have been leaked from L1 to L2 and back down into L1 again.

Current implementations must generate the default metric when using narrow metrics and should ignore the other three metrics when using narrow metrics.

OSPFv2

o It allows a link to have a 2 byte metric field in the Router LSA which implies a maximum metric of 0xFFFF.

o The Summary, Summary ASBR, AS-External and NSSA LSAs have a 3 byte metric value. A cost of 0xFFFFFF (LSInfinity) is used to tell the destination described in the LSA is unreachable.

o AS-External and NSSA LSAs allow two metric types, Type 1 and Type 2 which are equivalent to IS-IS Internal and External metrics. The type 1 considers the cost to the ASBR in addition to the advertised cost of the route while the latter uses just the advertised cost while calculating the routes during the SPF.

o The scheme thus allows for links to be configured with a metric no larger than 0xFFFF, while allowing cost of destinations injected across areas/levels to be as large as 0xFFFFFE.

OSPFv3

o It allows similar metric size for the Router LSA as in OSPFv2.

o It allows similar metric sizes for Intra Area Prefix LSA, Inter Area Prefix LSA, AS-External LSA and NSSA LSA as in OSPFv2. The value and significance of LS Infinity is valid here too.

Both protocols have a field indicating the “type” of authentication. There are however differences in the two protocols. In IS-IS, the data associated with the authentication is of variable length. In OSPF it is fixed at 64 bits. 64 bits is sufficient for a password scheme, but would not suffice for a public key signature scheme, which would need a field several hundreds of bits long.  A way to circumvent this is by appending a ”message” digest at the end of the OSPF packet.
        
In OSPF there is a single password per link. A router is configured with a password for each link to which it is attached. It transmits that password when it transmits OSPF messages on that link. It expects all OSPF messages it receives on that link to have that password. In IS-IS, a router is configured with a transmit password on a link, which is the password it uses when it transmits IS-IS messages, as well as a set of acceptable receive passwords.   
             
On a P2P link a password scheme in which the receive and transmit passwords are different offers some security. If the passwords are the same, the intruder need only wait for the other router to transmit first, and the intruder will find out the password. Even with two passwords, an intruder can, with effort, discover the passwords.  
             
The reason IS-IS configures routers with a set of acceptable receive passwords, rather than a single receive password, is so that a link, such as a LAN, can be migrated from one password to another without disrupting the network. Since OSPF has single password per link, it is not possible to change the password in an operational network. The routers would all have to be brought down and locally reconfigured.   
             
One of the issues brought by the IS-IS proponents is apparently the big advantage that IS-IS has over OSPF from a security point of view as the former’s packets cannot be routed beyond the immediate next hop or can never be sourced by non-border routers. This can allegedly prevent a variety of potential DoS attacks as anyone can launch OSPF packet bombs in the others network. This apparent vulnerability to DoS attacks is because OSPF rides over IP rather than directly running over the link layer.    
               
Since all OSPF packets can be authenticated using MD5, all spurious OSPF packets can be dropped. But there can be times when MD5 can itself be a part of a problem because it takes significant CPU to check signatures and discard the packets. This is partly true but it is to be noted however that even if OSPF encapsulation is changed to L2, we would still have to support IP encapsulation for virtual links, so we would still have to do MD5.  Computing the MD5 used to be a problem in the past when CPUs were not that powerful. However, this is a non issue today, and most of the authentication can be done in the Hardware itself.
        
Moreover the system administrator can filter on the edges of the network to pry away all the OSPF messages coming from the edges. This will of course be done in addition to cryptography. 

There are issues in the HMAC-MD5 scheme as there have recently been reports about attacks on the collision resistance properties of MD5. MD5CRK, was a distributed computing project to break the MD5 hash algorithm in a short period of time. The project closed down with the publication of the paper. 
    
It was discovered that collisions can be found in MD5 algorithm in less than 24 hours, making MD5 very insecure. Further research has verified this result and shown other ways to find collisions in MD5 hashes. We thus need to move away from MD5 towards more complex and difficult to break hash algorithms. 

It is because of this that there are now proposals in both the IS-IS and OSPF WGs to support HMAC-SHA based authentication schemes for the respective protocols.

IS-IS

- IS-IS allows a Level-1 Area which is partitioned to be automatically repaired, by electing Partition Designated Level 2 routers and having a virtual link between them. The mechanism is not often implemented and requires an explicit tunnelling mechanism.
             
- Used in ISO IS-IS for connecting partitions of Level 1 Area over the Level 2 backbone.   
             
OSPF   
        
- Used for connecting physically separate area zeroes (0.0.0.0) to maintain contiguity of the backbone.

- Used for connecting remote areas to the backbone through other areas if direct physical connectivity is not possible. This enables an OSPF packet to be sent from one part of an remote isolated site to the main OSPF network.   
          
- For Virtual links to work, OSPF accepts packets which are have originated more than one hop away. This can lead to security concerns if the packets at the edge of the domain are not properly filtered.  

Initial Database Exchange  
        
For the SPF algorithm to work properly, all routers in the area should have the same database information on which the SPF algorithm works. The process of synchronization includes the “Initial Database Exchange” which is done when the adjacency is coming up and the asynchronous flooding when the Adjacencies are up.  

OSPF

- A master-slave relation is established to do the database exchange. Besides the MTU is exchanged in the database description packets before any database exchange starts.   
          
- The database exchange begins once the adjacency state reaches Exstart. On a broadcast links, the DR and BDR form adjacencies with all other routers on the network.  
          
- Only one DB Description packet can be unacknowledged at a time that is, the window size is 1. Each DB Description packet from the master is acknowledged by the slave. The slave sends its own DB Description packet with similar identifiers as the masters.  This is important because if you miss even one DD packet, then the routers have to start all over again.
          
- DB description packets containing the summary of LSA’s at each end are exchanged. Only when the entire summary is received by the neighbour can it tell which instance of the LSA is not there in the senders database.   
          
- An adjacency in OSPF is declared FULL/UP, when the entire database exchange is completed.   
          
- OSPF does not allow routers to resynchronize their link state database in the steady state. It is only done during the initial database synchronization or when network topology changes. However, there are techniques to do that. One such way is described in “OSPF Out-of-band LSDB resynchronization”  
          
IS-IS  
        
- The MTU check is done at the hello exchange time itself.   
        
- CSNP’s are sent by the DIS on a broadcast link. On a point-to-point link both the neighbours exchange CSNP’s with each other.    
          
- On point-to-point link all the LSP’s SRM flag is also set for the circuit, to indicate the LSP’s have to be sent over the circuit.   
      
- The CNSP’s are sent to reduce the actual flooding of all the LSP’s between the neighbours.   
          
- Multiple CSNP’s can be sent together. CSNP’s unlike DB Descriptions in OSPF are not acknowledged.   

- As the CSNP’s have a range of LSP-ID’s, and contain all the LSP’s in the database falling in that range. A neigbour on receiving a CSNP can know which LSP’s in the neighbour are newer, which older and which are absent. Based on this the neighbour can send newer LSP’s to the neighbour.   
          
- Link state database is continuously refreshed and synchronized because of the periodic CSNPs that are announced.  

Asynchronous Flooding  
        
Whenever any information in an the database changes, the information is to be exchanged with all other routers in the network. This is done by the flooding process:
             
OSPF  
        
- Uses reliable flooding mechanism for all link types.    
        
- Changed LSA’s are packed in LS Update packets and send over adjacencies to the neighbour, which unpacks the LSA’s. LS Acknowledgement packets are sent by the receiver, which informs the sender that the receiver has received the LSA.   
          
- The sender retransmits the LSA’s after the re-transmission interval if it does not get acknowledgements for them.   
          
- On a broadcast link LSUpdate packets are sent only to all-DR routers multicast address. The DR floods the LSUpdate packets to All-SPF-Routers.  
          
- Whenever a new DR/BDR is elected, it has to form adjacencies with all other routers in the network.   
          
- There is no difference in the asynchronous flooding procedures between OSPFv2 and OSPFv3.     
    
IS-IS   
        
- LSP’s are flooded as is across the area. They are not packed inside any other packet.   
          
- On broadcast links flooding is not done reliably. A changed LSP is flooded to all IS-IS routers, however no retransmissions occur.    
          
- The reliability in database exchange on a broadcast link is achieved by periodic database exchange. This is done as CSNP’s are sent periodically by the DIS, which initiates the entire database exchange process all over again.   

- As the DIS sends periodic CSNP, nothing different needs to be done when a new DIS is exchanged.   
          
- On a point-to-point link flooding is done reliably. LSP’s are flooded to the neighbour and if CSNP entry for the LSP is not received in a particular time interval, the LSP is re-flooded to that neighbour.

IS-IS

o Does not require any particular alignment of the PDU fields.

o Uses Type-Length-Value (TLV) encoded packets to advertise the routing information. The TLVs that are not supported/recognized are ignored by other IS-IS routers.

o LSPs are flooded intact with unrecognized TLV information making it very extensible. IPv6, GMPLS, etc. support is provided by simply adding a few more TLVs.

o TLVs can be nested as sub-TLVs providing even more flexibility for future extensions. Though the base spec does not use them but the newer drafts and RFCs have started using them (TE extensions, etc).

OSPFv2

o Uses fixed format packets with all fields aligned at 32-bit boundaries for faster processing of the OSPF packets (doesn’t really matter anymore as the CPUs are really fast these days!). This was primarily done because OSPF was meant to be an IPv4 only protocol.

o The downside of the above is that the packet formats are not at all extensible. We had to come out with OSPFv3 when we wanted to provide support for IPv6.

o It uses Link State Advertisements (LSAs) for advertising the routing information and the original specification called for dropping any unrecognized LSA type.

o LSAs of type 9, 10 and 11 have been introduced for advertising other application specific information and enough vendors now support this so that they are likely to get from one side of the network to the other.

o Since the unrecognized LSA types are not flooded to neighbors it makes it very difficult to extend OSPF. This in turn means that all the OSPF routers must be upgraded network-wide to make the new extensions work.

o The new RFCs (Traffic Engineering, GMPLS extensions, etc) written for OSPF now support TLV encoding.

OSPFv3

o Exhibits implicit opaque LSA behaviour i.e. unrecognized LSA types are flooded to the neighbors making it more extensible that OSPFv2

o Designed in a way which makes it easily extensible to any other layer 3 protocol suite.

The MTU of a sub-network is the largest size packet or frame, specified in octets that can be sent over it. Both OSPF and IS-IS require communicating routers to have matching MTU sizes in order to form adjacencies. This is needed so that routers will not advertise packets larger than a neighbor can receive and process. However, each protocol uses a different mechanism to check against MTU mismatch.  For this discussion the term MTU is used for a links Maximum Receive Unit (MRU) too.  
        
IS-IS   
        
- IS-IS works over the link layer, which does not provide for fragmentation and reassembly.   
          
- Hello’s are sent padded to MTU size till an adjacency comes up. If there is an MTU mismatch, the side having the lesser MTU would drop the bigger than MTU hello. This would not allow adjacencies to be formed between interfaces having different MTU’s.   
          
- The hello MTU match is an insufficient condition for IS-IS as LSP’s are flooded as is and not packed into any other packets. For the LSP’s to be successfully synchronized across the subdomain, all LSP’s need to be of a size lesser than the smallest link MTU in the subdomain, else the flooding of the LSP on the link will fail resulting in inconsistent routing tables.    
          
- Mis-configuration of the maximum packet size that a router sends out can cause problems across the subdomain as there is no way to check the value between routers that are not adjacent.            

OSPF  

- OSPF works over IP, so the fragmentation and reassembly of any OSPF packet is taken care by the IP layer. However for some link technologies where MTU is configurable but not negotiated, we can have packet black-holes whenever packets larger than the receiving sides MTU are sent.   
          
- The MTU is exchanged in the database description packets. If the value of MTU received in the first DB description packet is greater than that can be accepted on an interface, the packet is rejected and the adjacency is not formed. Retransmissions of DB description packets occur because the packets are never acknowledged. The adjacency therefore gets stuck in EXstart state.   

- As LS Update’s are assembled in each router, the MTU of another link does not affect the size of the LS Update packet.   
          
- As the MTU match is done at the database exchange state after the DR election has been completed, in case the DR itself cannot form adjacencies with the rest of the routers, it can cause the network to become a stub. 

Changing area-id for an area is useful for link state routing protocols in order to merge two areas into one or to split an area into several areas.  
             
IS-IS   
        
- An area address is a variable length quantity.  
        
- An area can have multiple area addresses. Neighboring IS’s will not form an adjacency unless they have a single area address in common.  This is quite useful for IP networks that are transitioning from one area address to another, merging two areas into one or even to split an area into several pieces.   
          
- Seamless transition of area addresses for an area is easier in IS-IS, e.g. initially an area can have area adress A, then the set {A, B} and when the new area address B is recognized by all the routers in the area, old area address A can be removed.  

OSPF

- In OSPF each area has a single ID, a 4-byte quantity.   
        
- OSPF does not have the ability to merge and split areas dynamically as IS-IS has, though partitioned backbone can be repaired by using virtual link. But it should be ensured that the area through which virtual link is configured is having full routing information, i.e. it should not be a stub area.   
          
- Area-id can not be changed dynamically in case of OSPF.   

The maximum size of an LSP is 1492 bytes.

Available space = 1492 - 27 (Header) = 1465 bytes for TLVs.

Thus an IS-IS router has theoretically up to 256*1465 of space to pack IP reachability TLVs.

The following calculation enables us to determine the number of IP prefixes that can be advertised in an LSP.

The following constraints are to be considered in the calculation:

The maximum size (maxLSPsize) of an LSP is 1492 bytes.
The LSP header (lspHeadersize) is 27 bytes.
The maximum length of a TLV (maxTLVlength) is 255 bytes.

Each TLV 128 consists of type (1 byte), length (1 byte), and IP prefixes (n x 12 bytes) up to total of 255 bytes. The maximum number of fragments of an LSP (maxLSPfragments) is 256.

The number of fragments is determined from the 1-byte LSP Number field in the LSP identifier.

The first fragment contains other TLVs, and the remaining 255 fragments are packed with only TLV 128.

The actual calculation is as follows:

The total space available for TLVs in an LSP is TLVSpace = maxLSPsize - lspHeadersize = 1492 - 27 = 1465 bytes

The number of TLVs that can fit into TLVSpace is 1465/255 = 5.7, approximately 6

Assuming a 1-byte Type field and 1-byte Length field, overhead for 6 TLVs is 6 x 2 = 12 bytes.

Actual space available for prefixes is 1465 - 12 bytes overhead = 1453 bytes

Number of prefixes, each 12 bytes (address + subnet mask + metric) that can fit into TLVSpace is 1453/12 = 121.08 (approximately 121 IP prefixes per LSP)

Considering that few other TLVs can be generated by the router, the number of IP prefixes that can be supported per IS-IS router is 256 fragments, each containing 121 prefixes, for a total of 30,976 prefixes.

This seems to be the favorite question that every newbie has! There is no unequivocal answer to this and it all depends upon the kind of network and the topology. Having said this, lets try to see how the two IGPs can be compared.

IS-IS

This protocol is limited by the maximum number of LSPs that each IS-IS router can issue. This is 256 as its LSP ID is 1 octet long. The total number of IP prefixes carried by IS-IS can be easily computed and it comes to O(31000). However, RFC 3786 describes mechanisms to relax the limit on the number of LSP fragments, thereby increasing the number of IP prefixes that can be carried within IS-IS.

I have however, never seen any network carrying more than O(30K) IP prefixes inside IS-IS. Do let me know if you’re aware of one that does!

I say this because this is a reasonable number for any sane IS-IS deployment and it will not run out of space unless someone actually injects the entire BGP feed into the IGP. In that case we will run out of space at about 20% of the way into redistribution and not be able to advertise the rest. It is for this reason that this practice has now been deprecated and the RFC 1745 which lays down the rules for BGP- OSPF interaction, has been moved to the HISTORICAL status.

There are 8 bits to define a pseudonode number in the LSPID which means that a router can be a Designated Intermediate System (DIS) for only 256 LANs. Additionally there is also a limitation on the number of routers that can be advertised in pseudonode LSP of the DIS. Dont worry - RFC 3786 fixes this!

RFC 3373 OTOH proposes a new TLV thats carried in the IIH PDUs that can increase the number of point-to-point adjacencies from 256 on a single router.

The “Remaining lifetime” field which gives the number of seconds before LSP is considered expired is 16 bits wide.

This gives the life time of the LSP as 2^16/60/60 Hrs = 18.7 Hrs

Thus the LSP issued by a router needs to be refreshed after every 18.7 Hrs.

OSPF

In theory, OSPF topology is limited by the number of links that can be advertised in the Router LSA as each router gets only one Router LSA and it cant be bigger than 64K which is the biggest an IP packet can be. The same constraint applies to the Network LSA also.

Each link in the router can take up at most 24 bytes. Thus, number of links which can be supported is given by (64 * 1024) / 24 = 2370

However, if we take the minimum link size per link (12 bytes) then the maximum is about 2 * 2370 = O(5000) links

To be more specific, we can have O(2300) p2p and p2mp links (not considering virtual links, etc) and O(5000) broadcast/NMBA links described in OSPF’s Router LSA and its Network LSA.

Thus each Router LSA can carry some 5000 links information in it. It is hard to imagine a router having 5000 neighbors but there are already routers with 400 neighbors in some ISPs, and it may not take long to reach the order of the magnitude limited by OSPF.

The Network LSAs are generated by the designated router (DR) for each broadcast network it is connected to. To have scaling problems it should have 2730 * 6 times neighbors on that interface. This is even less probable and hence there are no scalability problems with OSPF per se.

All other LSAs apart from Type 1 and Type 2 hold single prefixes. Because there is no limit to the number of such LSAs, a large number of inter-area and externals can be generated depending upon the memory resources of the router.

Each LSA has an LS Age field which is counted upwards starting from zero. Its life is an architectural constant which says one hour. When an LSA’s LS age field reaches MaxAge, it is reflooded in an attempt to flush the LSA from the routing domain. One hour seems like a long time but if one originates 50,000 LSAs then OSPF will be refreshing on an average of just 36ms!

Total number of LSAs to be refreshed = 50,000

Time by which all the LSAs must be refreshed = LSRefreshTime = 30mins = 1800 secs

Rate at which the LSAs need to be refreshed = 1800/50000 = 36ms

However, if the refreshes are perfectly spread out across time and perfectly batched, the actual update transmission rate may be on the order of one packet per second.

There is however a “do-not-age” LSA which in theory can be pressed into service and which never gets aged. However, such LSAs will be eventually purged from the LS database if they become stale after being held for at least 60 minutes and the originator not reachable for the same period. Moreover it is not backward compatible and if one deploys that in the network today with some routers not supporting this then the network can really get weird. So there isn’t really much that can be done using these unless the whole network is changed!

Both the routing protocols are scalable and there should not be any scalability issues with any one of them if implemented properly. Both have similar stability and convergence properties.

This post compares how the two link state protocols hold their routing information in their databases as this affects their behavior in how they flood/distribute the change of routing information and the internal implementation complexity.

OSPF

o Organization of Routing Information

OSPF encodes the routing information into small chunks, which it calls Link State Advertisement (LSA). Each LSA has its own 20-byte header in order to be identified uniquely. This header is called the LSA Header. There is no limitation on the size of a LSA, though the actual LSA size is limited by IP packet size limitation: 65,535 bytes minus the LSA Header size and IP packet header size. The database access in OSPF is per LSA basis.

In OSPF routing, the information within an area is described by type 1 and type 2 LSAs (known as Router-LSA and Network-LSA respectively). These LSAs can become big depending upon the number of adjacencies to be advertised and prefixes to be carried inside an area. In other words, the routing information with respect to a single node (either router or network node) is encoded inside a single LSA. On the other hand, each inter-area or external prefix is advertised in a separate LSA (AS-External LSA).

An OSPFv2 router may originate only one Router-LSA for itself, while in OSPFv3, a router is allowed to originate multiple Router-LSAs. A router may originate a Network-LSA for each IP subnet on which the router acts as a designated router (DR). A router may originate one LSA for each inter-area and external prefix, with no limitations on the number of LSAs that it may originate.

Consequences

Originating a new and a unique LSA for each inter-area route and an external prefix implies that there is a LSA Header overhead involved while the information is kept in the database or is flooded to the neighbors. There is thus some extra memory and bandwidth consumed in total.

o Carrying Routing Information

LSAs are carried in Link State Update packets (called LS Updates or LSUs). Each LS Update packet has its own header, consists of a 24 byte OSPF protocol header, and a 4-bytes field indicating the number of LSAs contained in the packet. Thus multiple LSAs can be packed into a single LS Update packet. Some implementations may not do this as its considered difficult achieving this during flooding.

Consequences

In the face of network changes, OSPF floods only the updated LSAs. Therefore, even if an implementation does not pack multiple LSAs into a single LS Update packet (and so bandwidth is consumed by LS Update header for each update of a single LSA), the bandwidth consumption for each network change can be considered adequately small.

IS-IS

o Organization of the Routing Information

In IS-IS, protocol packets are called Protocol Data Units or PDUs. IS-IS encodes the link state information into the set of TLVs and packs these TLV