Author Archives: Manav

Its time we retire Authentication Header (AH) from the IPsec Suite!

Folks who think Authentication Header (AH) is a manna from heavens need to read the Bible again. Thankfully you dont find too many such folks these days. But there are still some who thank Him everyday for blessing their lives with AH. I dread getting stuck with such people in the elevators — actually, i dont think i would like getting stuck with anybody in an elevator, but these are definitely the worst kind to get stuck with.

So lets start from the beginning.

IPsec, for reasons that nobody cares to remember now, decided to come out with two protocols – Encapsulating Security Payload (ESP) and AH, as part of the core architecture. ESP did pretty much what AH did, with the addition of providing encryption services. While both provided data integrity protection, AH went a step further and also secured a few fields from the IP header for you.

There are bigots, and i unfortunately met one a few days ago, who like to argue that AH provides greater security than ESP since AH covers the IP header as well. They parrot this since that’s what most textbooks and wannabe CCIE blogs and websites say. Lets see if securing the IP header really helps us.

When IPsec successfully authenticates the payload, we know that the packet came from someone who knew the authentication key. I would wager that that should be enough to accept the packet. The IP header is just required to route the packet to reach the recipient – its not meant to do anything else. Thats networking 101 really.

IPsec Security Associations are established based on the source and destination addresses and some L4 port information. The receiver matches the incoming packet’s against SPI and inbound selectors associated with the SA. Packet is only accepted if it came from the correct source and destination IP address. If an attacker somehow manages to change the IP header then there are high chances that it will get rejected by IPsec since it will fail the Security Policy Database (SPD) check.  So, what is protecting the header really giving us?

BTW ESP can also protect the IP header if its used in the tunnel mode. So, if someone is really keen on protecting the IP header then ESP in the tunnel mode can also be used. It should however be noted that ESP tunnel mode SA applied to an, say IPv6 flow,  results in at least 50 bytes of additional overhead per packet. This additional overhead may be undesirable for many bandwidth-constrained wireless and/or satellite communications networks, as these types of infrastructure are not over provisioned.

Packet overhead is particularly significant for traffic profiles characterized by small packet payloads (e.g., various voice codecs). If these small packets are afforded the security services of an IPsec tunnel mode SA, the amount of per-packet overhead is increased.

This issue will be alleviated by header compression schemes defined in the IETF.

I have recently published an IETF draft where i explicitly ask for AH to be retired since there is nothing useful that it does that cant be achieved with ESP with NULL encryption algorithm.

Please note that i have absolutely no complaints with AH and the claims that it makes. It does its job really well. Its just that its completely redundant and the world can certainly do with one less protocol to manage.

Retiring AH doesn’t mean that people have to stop using AH right now. It only means that in the opinion of the community there are now better alternatives. This will discourage new applications and protocols to mandate the use of AH. It however, does not preclude the possibility of new work to IETF that will require or enhance AH. It just means that the authors will have to do a real good job of convincing the community on why that solution is really needed and the reason why ESP with NULL encryption algorithm cannot be used instead.

The IETF draft that i have written aims to dispel several myths  surrounding AH and i show that in each case ESP with NULL encryption algorithm can be used instead, often with better results.


Life of Crypto Keys employed in Routing Protocols

Everyone knows that the cryptographic key used for securing your favorite protocol (OSPF, IS-IS, BGP TCP-AO, PIM-SM, BFD, etc)  must have a limited life time and the keys must be changed frequently. However, most people don’t understand the real reason for doing so. They argue that keys must be regularly changed since they are vulnerable to cryptanalysis attacks. Each time a crypto key is employed it generates a cipher text. In case of routing protocols the cipher text is the authentication data that is carried by the protocol packets. Its alleged that using the same key repetitively allows an attacker to build up a store of cipher texts which can prove sufficient for a successful cryptanalysis of the key value. It is also believed that if a routing protocol is transmitting packets at a high rate then the ”long life” may be in order of a few hours. Thus it’s the amount of traffic that has been put on the wire using a specific key for authentication and not necessarily the duration for which the key has been in use that determines how long the key should be employed.

This was true in the Jurassic ages but not any more. The number of times a key can be used is  dependent upon the properties of the cryptographic mode than the algorithms themselves. In a cipher block chaining mode, with a b-bit block, one can safely encrypt to around 2^(b/2) blocks. AES (Advanced Encryption Standard)  used worldwide has a fixed block size of 128, which means that it can be safely used for 2^(64+4) bytes of routing data. If we assume a protocol that sends 1 Gig (!!) worth of control traffic *every* second, even then it is safe enough to be used for around 8700 *years* without changing the key! Hopefully, the system admin will remember to change the crypto key after 8700 years! ;-)

So, if the data is secure then why do we really need to change the crypto keys ever?

As a general rule, where strong cryptography is employed, physical, procedural, and logical access protection considerations often have more impact on the key life than do algorithm and key size factors. People need to change the keys when an operator who had access to the keys leaves the company. Using a key chain, a set of keys derived from the same keying material and used one after the other, also does not help as one still has to change all the keys in the key chain when an operator having access to all those keys leaves the company. Additionally, key chains will not help if the routing transport subsystem does not support rolling over to the new keys without bouncing the routing sessions and adjacencies.

Another threat against a long-lived key is that one of the systems storing the key, or one of the users entrusted with the key, could be subverted. So, while there may not be cryptographic motivations of changing the keys, there could be system security motivations for rolling or changing the key.

What complicates this further is that more frequent manual key changes might actually increase the risk of exposure as it is during the time that the keys are being changed that they are likely to be disclosed! In these cases, especially when very strong cryptography is employed, it may be more prudent to have fewer, well controlled manual key distributions rather than more frequent, poorly controlled manual key distributions.

To summarize, operators need to change their crypto keys because of social and political, rather than scientific or engineering driven reasons.

You can read more about this in the IETF draft that i have co-authored here.


Issues with how BFD is currently implemented over LAGs

The BFD standards dont explicitly talk about how BFD should be implemented on Link Aggregation Groups (LAGs). This leaves a lot of room for imagination and vendors have implemented their own proprietary mechanisms to make BFD work on LAGs. Now, there is only this much room for innovation and most vendors have naturally arrived at similar techniques to implement interoperable BFD over LAGs. So, what makes BFD so sticky to implement on LAGs?

BFD being an L3 protocol, is oblivious to the physical link that the BFD packets go out on. Usually, there is only one link associated with an L3 interface, and there is thus no ambiguity on the link that packet needs to go out on. However, when an IP interface is configured over a LAG, there are multiple constituent links that the packet can go out on, and BFD has to decide the link it wants to use for sending the packets out.

A LAG binds together several physical ports between two adjacent nodes so they appear to higher-layer protocols and applications as a single, higher bandwidth “virtual” pipe.

The problem with running BFD over a LAG is that without internal knowledge of the LAG structure it is impossible for BFD to guarantee a detection of anything but a full LAG shutdown within the BFD timeout period. The LAG shutdown is typically initiated by some LAG module. LAG timers are typically multiple times slower than the BFD detection timers (multiple 100ms vs. multiple 10ms of BFD). There is thus a need to bring some sort of determinism in how BFD runs over a LAG. There is also a need to detect member link failures much faster than what Link Aggregation Control Protocol (LACP) allows.

Lets look at what implementations currently do to implement BFD on LAGs.

The simplest approach to run BFD on a LAG interface is to ignore the internal structure and treat the LAG as one “big, virtual pipe”.

Because there is no standard, vendors have implemented their own proprietary mechanisms to run BFD over LAG interfaces. Two examples are shown here.

Some implementations send BFD packets only over the “primary” member link of the LAG. Others spray BFD packets over all member links of the LAG. There are issues with both these designs.

In the first design, BFD will remain Up as long as the primary link is alive. If the primary link goes down, and another link is not selected as the primary, before BFD times out (around 30-50ms), then the BFD session on the LAG comes crashing down. Problems arise as BFD, in this design, is oblivious to the presence of other member links in the LAG. If a non-primary link goes down, the BFD session remains unaffected as it can still send and receive BFD packets over the primary link. Since the BFD session is Up, other routers in the network continue sending traffic meant to egress out of this interface. As expected from the LAG, all traffic egressing out of this interface gets load distributed on all LAG member links. Now, there is one link thats down. All traffic sent over that failed link gets dropped, till the LAG manager detects this and removes it from the LAG.

In the second design, BFD packets are sprayed over all the member links of a LAG. This is done naively via round-robin, where each BFD packet is sent using the subsequent member link, in a round-robin fashion. It solves the problem of BFD going down because of the primary link going down, but it still does not solve the problem of traffic getting lost when one of the member link goes down. This is because, when a member link goes down, BFD remains up and traffic continues to go over the link that has failed till a higher layer protocol (usually LACP or the LAG Manager) detects this and removes the offending link from the LAG.

The above two designs defeat the core purpose of a BFD, which is to detect faults between the two forwarding engines. In each design traffic gets lost on a failed link till some protocol other than BFD detects this and removes that link from the LAG. The timers associated with the other protocol are an order of magnitude higher than BFD.

Operators have since long expressed a need to be able to detect the failed links fast so that their traffic doesnt get lost. The idea is to get BFD to take charge of the LAG and make it responsible for maintaining the list of active links in a LAG. This way we can use the BFD fast timers to quickly detect link failures.

One could argue that there are native Ethernet OAM mechanisms (.1ag, .3ah) that can be used to detect link failures in a LAG, and one need not rely on slow protocols like LACP or the LAG manager. The reality is that operators who have deployed BFD in their IP/MPLS networks want a common failure detection mechanism and dont want a mix of different technologies.

To solve the above mentioned issues I have co-authored an IETF document that proposes running BFD on each constituent link of the LAG. We call the BFD sessions running on each link a “micro BFD session”. We call this mode of BFD on LAGs as BLM – BFD on Lag Members.

BLM is an umbrella BFD session that contains information about the LAG (or the aggregated interface) that its running on. It consists of a set of micro BFD sessions that are running on each constituent link of the LAG. And it contains a state that we call the “Concluded State”, which describes the overall state of the LAG (Up, Down, AdminDown).

Each micro BFD session is a regular RFC 5880 and RFC 5881 compliant BFD session. Only Asynchronous mode is supported for the micro BFD sessions as the sole reason for running BFD on each member link is to verify the link connectivity. The Echo function for the micro BFD sessions is not recommended as it requires twice as many packets to achieve a particular Detection time as does the pure Asynchronous mode.

At least one system MUST take the Active role (possibly both). The micro BFD sessions on the member links are independent BFD sessions. They use their own unique, local discriminator values, maintain their own set of state variables and have their own independent state machine. Typically each micro BFD session will have the same timer values, however, nothing precludes the possibility of having different timer values among the different micro BFD sessions belonging to the same LAG.

A session begins with the periodic, slow transmission of BFD Control packets. When bidirectional communication is achieved, the BFD session becomes Up. The LAG manager is informed at this point, and the member link becomes an active link of the LAG.

If the micro session goes Down, the transmission of Control packets goes back to the slow rate. The LAG Manager is informed which removes the member link from the LAG.

Once a session has been declared Down, it cannot come back up until the remote end first signals that it is down (by leaving the Upstate), thus implementing a three-way handshake.

A session MAY be kept administratively down by entering the AdminDown state and sending an explanatory diagnostic code in the Diagnostic field.

In short, its pretty much the same as a standard BFD session.

This solves the issues that i had described in the earlier two designs. The micro BFD sessions will quickly detect a failed link, and will instantly remove it from the LAG. Traffic that was earlier egressing out over the failed link, will now get hashed to a different link in the LAG. This results in zero traffic loss on the LAG.

You can read more about our proposal here.

Recognizing the need for running BFD on all member links, various vendors support their own proprietary, un-interoperable implementation of BFD over LAGs. We’re hoping that our IETF proposal to standardize this behavior will bring some order to the chaos thats out there and a relief to the providers who are stuck with proprietary solutions.


So what are inter-session Replay attacks?

Inter-session replay attacks are extremely hard to fix and most IETF routing and signaling protocols are vulnerable to them. Lets first understand what an inter-session replay attack is before we delve deeper into how we can fix them.

A reply attack is a type of an attack where the attacker captures the packets exchanged between two routers and later retransmits, or “replays”, this same packet back to the routers and thereby deceiving them into believing that this is a legitimate packet sent by their remote neighbor. Lets see how this will work:

Assume router A is sending an integrity protected (via some authentication mechanism) protocol packet to router B. The attacker can record the packet that A is sending. The attacker now waits for some time and retransmits this packet without any modification, back to B. B upon receiving this packet will as usual first try to verify the contents for any tampering. It will do this by verifying the authentication digest (usually Keyed-MD5 or HMAC-SHA) that the packet carries. Since the attacker has not modified the packet it will pass the integrity check as long as the key exchanged between the two routers remains unchanged. The integrity checks will pass on Router B and it will accept this packet as a legitimate packet sent by A.

This is a replay attack – So, how can it harm you?

Assume A was not advertising any route, or any neighbor reachability when the attacker had recorded this control packet.  In OSPF parlance, this could be a Hello without any neighbors or a RIPv2 packet without any routing information. Later when A learns some routes or neighbors it sends an updated protocol packet listing this information. B receives this packet and updates its protocol state and routing tables based on the information that A provides. Now the attacker replays the earlier recorded packet. B, upon receiving this “new” packet believes this to have come from A and updates its routing tables accordingly. This is incorrect as B will now update its forwarding tables based on stale information. If the replayed packet is an old OSPF Hello when A did not have any neighbors, B will, upon receiving this packet assume that A has now lost all its neighbors and will delete all routes via A. I had co-authored RFC 6039 some time back which describes many such replay attacks in great detail.

So, how do IETF protocols protect themselves from such attacks?

Most protocols packets carry a Cryptographic Sequence Number that increases as each packet is sent. The receivers only accept a packet if it carries a sequence number that is higher than what it had received earlier from the same neighbor.

This fixes the problem that i had described earlier as the replayed packet will carry a sequence number that will be lower than what B would have last heard from A. B, upon receiving this replayed packet will not accept it and would thus prevent itself from such replay attacks.  Its appears that we have a solution against all replay attacks – do we?

Well it turns out that the answer to this question is a big NO!

The cryptographic sequence number can protect us from what i call the intra-session replay attacks. However, it cannot protect us against inter-session replay attacks. Let see why?

Assume that the cryptographic sequence number currently being used by router A for some specific routing protocol is 1000. This means that B will not accept any protocol packet if it comes with a sequence number less than 1000. This is fine, and this will protect us against some attacks. Now assume that the attacker captures and records this packet with sequence number 1000. No one will know about this as the attacker has silently recorded this packet.

Now the attacker has to wait patiently till the current session between the Router A and Router B goes down and a new one is established. This can happen if one of the routers reboots (could be planned or unplanned). When this happens the routers reset their cryptographic sequence number to 0 and start all over again. If the password key  between the two routers has not changed, and it usually doesn’t, then the packet that the attacker has captured is carrying a valid cryptographic digest. The attacker can replay this packet any time and this will get accepted by B if the current sequence number that its seeing in the new session from A is less than 1000. This is an inter-session replay attack and is extremely difficult to fix with the current IETF security and authentication mechanisms. Note that a trivial way to protect against inter-session replay attacks is by changing the key each time a new session is established. However changing the key requires manual intervention and thus cannot be easily done all the time.

So, how do you fix this issue?

Sam Hartman (Huawei), Dacheng (Huawei) and I have submitted two proposals in the IETF to fix this inter-session replay attacks that i have described above.

The first is extremely simple.

We propose to change the current cryptographic sequence number space from 32 bits to 64.  The least significant 32 bits would be the usual cryptographic sequence number that will monotonically increase with each fresh packet transmitted. The most significant 32 bits would indicate the number of times this router has cold booted. Thus when the router initially comes up for the first time its value would be 0. Next time when it reboots and comes up, its value would be 1.

Consider a state when the router has cold booted “n” times and its current cryptographic sequence number is “m”. The aggregated cryptographic sequence number that will be used by the routing protocols would be:

m << 32 || n, where << is the left shift operator and || is the bit-wise OR operation.

Now this router reboots (again planned or unplanned).

Now its cryptographic sequence space starts from:

(m+1) << 32

Its trivial to see that the ((m+1) << 32) > ((m << 32) || n) for all values of m and n where each m and n > 0.

This mechanism will solve the inter-session replay attacks that have been described above. I will describe the second method in some other post. We have defined a generic mechanism that all protocols can use here in this KARP draft.


Why providers still prefer IS-IS over OSPF when designing large flat topologies!

I was recently interacting with our pre-sales team for a large MPLS deployment and was reading the network design that was proposed. I saw that they had suggested IS-IS over OSPF as the IGP to use at the core. One of the reasons cited was the inherent security that IS-IS provides by running natively over the Layer 2. Another was that IS-IS is more modular and thus easier to extend as compared to OSPF. OSPF, its alleged is very rigid and required a complete protocol rewrite to support something as basic as IPv6! :-) Then there was this overload feature that IS-IS provides which can signal memory overload that does not exist in OSPF and finally a point about IS-IS showing superior scalability (faster convergence). In case you’re intrigued about the last point, as i clearly was, then it was explained that IS-IS uses just one Link State Packet (LSP) per level for exchanging the routing information. This LSP contains many TLVs, each of which represents a piece of routing information. OSPF on the other hand, needs to originate multiple LSAs, one for each type and thus is a lot more chatty and thus not suitable for large flat networks.

I personally dont agree to any one of the reasons listed above and anyone who favors IS-IS over OSPF for the above reasons is patently mistaken. These are all extremely weak arguments and have mostly been overtaken by reality. Lets look at each one by one.

Security – While its true that one cant lob an IS-IS packet from a distance without a tunnel it was never really a compelling reason for some one to pick up IS-IS over OSPF. The same holds good for OSPF multicast packets as well which cannot be launched by some script kidde sitting miles away from his personal laptop. Both the protocols have been extended to support stronger algorithms  (RFC 5310 for IS-IS and RFC5709 for OSPF) and have similar authentication mechanisms. I can say this with some degree of confidence as i have co-authored both these standards.

Modularity – While its somewhat easier to extend IS-IS in a backward compatible way this sort of thing doesnt happen much any more. Both protocols have been extended to support multiple instances, traffic engineering, multi-topology, graceful restart, etc. This isnt imo a showstopper for someone picking up OSPF as the IGP to use.

Overload Mechanism – IS-IS has the ability to set the Overload (OL) bit in its LSAs. This results in other routers in that area treating this router as a leaf router in their shortest path trees, which means that its only used for reaching the directly connected interfaces and is never placed on the transit path to reach other routers. So does this happen any more? No, it doesnt. This feature was required in the jurassic age when routers came with severely constrained memory, CPU power and the original intention of the OL mechanism is now mostly irrelevant. Most core routers today have enough memory and CPU that they will not get inundated by the IS-IS routes in any sane network design.

These days OL bit is used to prevent unintentional blackholing of packets in BGP transit networks. Due to the nature of these protocols, IS-IS and OSPF converge must faster than BGP. Thus there is a possibility that while the IGP has converged, IBGP is still learning the routes. In that case if other IBGP routers start sending traffic towards this IBGP router that has not yet completely converged it will start dropping traffic. This is because it isnt yet aware of the complete BGP routes. OL bit comes handy in such situations. When a new IBGP neighbor is added or a router restarts, the IS-IS OL bit is set. Since directly connected (including loopbacks) addresses on an “overloaded” router are considered by other routers, IBGP can be bought up and can begin exchanging routes. Other routers will not use this router for transit traffic and will route the packets out through an alternate path. Once BGP has converged, the OL bit is cleared and this router can begin forwarding transit traffic.

So how can we do this in OSPF since there is no OL bit in its LSAs?

Simple. We can set the metric of all transit links on an “overloaded” router to 0xffff in its Router LSAs. This will result in the router not being included as a transit node in the SPF tree.  Stub links can still be advertised with their normal metrics so that they are reachable even when the router is “overloaded”.  Thus this point against OSPF is also not valid.

Finally we come to the scalability and the convergence part. This one is slightly tricky and is not so easy. I wrote a few posts around 4 years back discussing this here and here. You might want to read these.

IMO one of the big reasons why most big providers use (or have used) IS-IS is because way back in 90s Cisco OSPF implementation was a disaster. The first big ISPs (UUnet, MCI) came to them and said “we want to build big infrastructures, should we use OSPF?” and Cisco basically said “No, thats not a good idea, use IS-IS instead”. Dave Katz in Cisco had recently rewritten Cisco’s IS-IS implementation as a side effect of implementing NetWare Link Services Protocol – NLSP (basically IS-IS for Novel IPX) so Cisco was quite confident of its IS-IS implementation. The operators thus picked up IS-IS and continue using it even today as there is really no real difference between IS-IS and OSPF, so no motivation to move from one to the other.

IS-IS was also an advantage in the early days as a router vendor because it was an “open proprietary spec”.  It was out there, and published, but unless you had some background in OSI you didn’t know much about it and the spec was scary and weird.  This wasn’t on purpose, but it was handy.

It was also nice in the IETF because IS-IS was viewed, at least at the time, as the poor cousin of OSPF and so nobody really cared that much other than the handful of folks that were doing the work.  This made the extension of IS-IS a lot easier and a lot less political than OSPF.  In fact i have heard about a t-shirt which said “IS – IS = 0″ that was distributed in one of the IETF meetings long time ago! Things however have changed and IS-IS is considered at par with OSPF today and both the working groups are quite active in the IETF.

There was one real technical advantage to IS-IS in common deployment scenarios of that day as well.  Back then, it was popular to build full meshes of ATM or Frame Relay as the Layer 2 topology for large backbones, because of the perception that healing faults at L2 would happen faster and cleaner than letting the IP routing protocols take care of it (arguably true at the time).  Full mesh topologies are the worst possible topologies for standard flooding protocols (IS-IS and OSPF both) and the cost of topology changes was huge.  However, IS-IS lent itself to the “mesh group” hack by which you could manually prune the flooding topology to be a subset of the links.  OSPF doesn’t easily allow this because of details about the flooding model it uses.  Cisco apparently did implement a hack to get around this problem, but its probably more gross than the IS-IS “mesh groups” hack!

Another reason i believe people prefer IS-IS over OSPF is the belief that you can design large networks by building a single large Level 1 (L1) area without any hierarchies in IS-IS and still be able to manage – something that would be difficult with OSPF. There are issues with inter-area traffic engineering and such and most people would like to keep their network as a single area if the routing protocol can manage it.

I used to believe that operators can design big networks without hierarchies in IS-IS since all IP prefixes (i.e. network interfaces, routes aka reachabilities in ISO-speak) are considered as leaf nodes in the SPF for IS-IS. Thus a full SPF will not be triggered for an interface or a route flap in case of IS-IS. OSPF otoh, would go ballistic running SPF each time any IP information changes. The only time we dont run a full SPF in when a Type 5 LSA information changes, but thats hardly an optimization. Compared to this, the only time we run a full SPF in IS-IS is when an actual node goes down (which OSPF would also anyways do).

I was recently having a discussion with Dave Katz from Juniper on this and i realized that this really is an implementation choice. “The graph theory”, he very aptly pointed out, “is the same in both cases!”.  The IS-IS spec makes it easier to put an IS-IS reachability as leaf nodes as all routers are identified by a different set of TLVs. This information while its available in OSPF is slightly tricky as the node information is mixed with the link information. Thus while even a naive IS-IS implementation may be able to optimize SPF, it would require a good understanding of the spec to get it right in OSPF.

You could get the exact same optimization in OSPF as IS-IS if you realize that OSPF calculates the routes to the *router IDs* and not the addresses. The distinction between nodes and destinations is syntactically (and semantically) quite clear in OSPF as well. The spec considers the Router IDs which i concede look like IP addresses, something that most people miss.

Actual addresses and prefixes are quite distinct, even in OSPF.  So as long as you can keep track of what’s an address and what’s an ID, it’s not that hard, for what it’s worth.  The bigger problem is that only a handful of people really understand *why* things in the OSPF spec are done the way they are, and there are less and less of those folks because hardly anybody *needs* to understand it.

But having said all that, the cost of an SPF is so small on the scale of things that it’s not really the issue (which is also why I am not a big fan of partial-SPF optimizations:  ”See how great it works when you have around O(50K) nodes and there is this one little node that goes down!” is sort of silly because lots of other things would break before a network ever got that big.)

Part of the SPF fear was I believe because Cisco’s original SPF implementation in OSPF was horribly inefficient (and everyone was using slow processors back then) and IOS was a non-preemptive, single threaded environment, and so an SPF (or any slow process) would block other things (like sending and receiving Hellos and other important bits) and would affect *everything*. I am btw sure that its changed now since i am aware of a couple of large Cisco deployments that are running OSPF in the core!  Overall system state management is a *much* bigger problem these days than the algorithmic efficiency of these protocols, particularly as we build larger and more distributed environments that require message passing internally.

Also what could have pushed providers back then to IS-IS was the deployment guidelines that Cisco used to publish (including the number of routes in an area) back then which were absurdly small. I am sure, its changed now.

There’s no technical reason why very large flat topologies can’t be supported by a good implementation of either protocol, but ISPs need to be conservative and suspicious of their vendors in order to survive.  ;-) I guess that nobody wants to be the first to deploy a large flat OSPF topology;  best practices tend to be sticky. However, there is no reason why you cant do it with OSPF today.

I suspect that, at this point, ISPs choose based on culture and familiarity and comfort rather than real technical differences. The perception still exists that while IS-IS can support large flat networks, OSPF cant. However, as i said its just a perception and is not really true any more.


Catching Corrupted OSPF Packets!

I was having a discussion with Paul Jakma (a friend, co-author in a few IETF drafts, a routing protocols expert, the guy behind Quagga, the list just goes on ..) some time back on a weird problem that he came across at a customer network where the OSPF packets were being corrupted in between being read off the wire and having CRC and IP checksum verified and being delivered to OSPF stack. While the problem was repeatable within 30 minutes on that particular network, he could never reproduce it on his VM network (and neither could the folks who reported this problem).

Eventually, for some inexplicable reason, he asked them to turn on MD5 authentication (with a tweak to drop duplicate sequence number packets – duplicate packets as the trigger of the problems being a theory). With this, their problems changed from “weird” to “adjacencies just start dropping, with lots of log messages about MD5 failures”!

So it appears that the customer had some kind of corruption bug in custom parts of their network stack, on input, such that OSPF gets handed a good long sequence of corrupt packets – all of which  (we dont know how many) seem to pass the internet checksum and then cause very odd problems for OSPF.

So, is this a realistic scenario and can this actually happen? While i have personally never experienced this, there are chances of this happening because of any of the following reasons:

o PCI transmission error (PCI parallel had parity checks, but not always enabled, PCI express has a 32bit CRC though)

o memory bus error (though, all routers and hosts should use ECC RAM)

o bad memory (same)

o bad cache (CPUs don’t always ECC their caches – Sun its seems was badly bitten by this; While the last few generations of Intel and AMD CPU do this, what about all those embedded CPUs that we use in the routers?)

o logic errors, particularly in network hardware drivers

o finally, CRCs and the internet checksum are not very good and its not impossible for wire-corrupted packets to get past link, IP AND OSPF CRC/checksums.

The internet checksum, which is used for the OSPF packet checksum, is incredibly weak. There are various papers out there, particularly ones by Partridge (who helped author the internet checksum RFC!) which cover this, basically it offers very little protection:

- it can’t detect re-ordering of 2-byte aligned words
- it can’t detect various bit flips that keep the 1s complement sum the same (e.g. 0×0000 to 0xffff and vice versa)

Even the link-layer CRC also is not perfect, and Partridge has co-authored papers detailling how corrupted packets can even get past both CRC and internet checksum.

So what choice do the operators have for catching corrupted packets in the SW?

Well, they could either use the incredibly poor internet checksum that exists today or they could turn on cryptographic authentication (keyed MD5 with RFC 2328 or different HMAC-SHA variants with RFC 5709) and catch all such failures. The former would not always work as there are errors that can creep in with these algorithms. The latter would work but there are certain disadvantages  is using cryptographic HMACs purely for integrity checking. The algorithms require more computation, which may be noticable on less powerful and/or energy-sensitive platforms. Additionally, the need to configure and synchronize the keying material is an additional administrative burden. I had posted a survey on Nanog some time back where i had asked the operators if they had ever turned on crypto protection to detect such failures and i received a couple of responses offline where they alluded to doing this to prevent checksum failures.

Paul and I wrote a short IETF draft some time back where we propose to change the checksum algorithm used for verifying OSPFv2 and OSPFv3 packets. We would only like to upgrade the very weak packet checksum with something thats more stronger without having to go with the full crypto hash protection way. You can find all the gory details here!


Hub and Spoke Multi-Point LSPs for Scalable VPLS Architecture

Multiprotocol Label Switching (MPLS) and Generalized MPLS (GMPLS) provide a mechanism to set up point-to-multipoint (P2MP) LSPs which carry traffic from one ingress point (the root node) to several egress points (the leaf nodes), thus enabling multicast forwarding in an MPLS domain. However, there is no provision to provide a co-routed path back from the egress points (the leaves) to the ingress node (the root). The only way to do this is by configuring Unidirectional point-to-point LSPs from the leaves back to the root node. This entails configuring each leaf node with an LSP back to the root which could be a configuration and management nightmare if the number of leaves are large. Second, it can also not guarantee a co-routed path back from the leaf to the node, as the process of setting up the Unidirectional P2P path is independent from setting up the P2MP path.

This post introduces the concept of a hub-and-spoke multipoint (HSMP) LSP that allows traffic from the root to the leaves via a P2MP LSP and back from the leaves to the root via a unidirectional co-routed LSP. The proposed technique targets one-to-many applications that require reverse one-to-one traffic flow (thus many one-to-one in the reverse direction).

Consider the figure shown below.

PE1 is the ingress router for the HSMP LSP. The egress routers (leaves) are PE2, PE3 and PE4. As can be seen from the figure, PE1 creates a single copy of each packet arriving from the data source. This packet carries the MPLS label value L1. P1 is an ordinary Label Switch Router (LSR) that swaps the incoming label L1 with L2. Lets now focus on the packet forwarding process on the node P2.

For each packet belonging to the HSMP LSP, P2 makes three copies (just like how its done for a P2MP LSP), each of which is sent to PE2, PE3 and PE4 respectively. Packets arrive at P2 with MPLS label L2. As shown in the above figure, P2′s ILM contains an entry for label L2 saying that one copy of the packet should be sent out on interface if1 with label L3, another copy on interface if2 with label L4 and a last copy on interface if3 with label L5. Since the LSR P2 is replicating the MPLS packets its called the branch node.

An obvious advantage of this scheme is the bandwidth optimization. If we had used unidirectional P2P LSPs instead of an HSMP LSP, PE1 would have sent three copies of each packet that it had received from the data source and thus congesting the links PE1-P1 and P1-P2.

What sets an HSMP LSP apart from a regular P2MP LSP is the ability in the former to set up a path from the leaves back to the root. P2MP LSPs are unidirectional, so no traffic can flow from the leaf node routers to the ingress head end router along the P2MP LSP. In HSMP LSP, the leaf nodes can also send unidirectional traffic back to the root. This is shown in the figure below.

Because of the mechanisms defined in HSMP LSP, the branch node P2 advertises the same upstream label L1 for a given HSMP LSP to the nodes PE2, PE3 and PE4. It programs its ILM table as shown above, where it simply swaps L1 with L2 for all incoming MPLS packets and sends those towards P1. This way HSMP LSP is also able to provide a path back from the leaf nodes to the root node.

In the last post i had discussed issues that exist in VPLS. Lets see how HSMP LSP can solve them. I am using the same topology as was used there to demonstrate how HSMP LSP helps.

The figure 3 above shows the same VPLS service that we had discussed in the earlier post.

PE1 knows through some out-of-band mechanism (could be via BGP, Radius, manual configs, etc) that PE2, PE3 and PE4 are the egress nodes that belong to the same VPLS domain. PE1 now needs to establish an HSMP LSP (can be trivially extended to support a pseudowire) to PE2, PE3 and PE4. Figure shows 3 HSMP LSPs that will be required in this arrangement. The red HSMP LSP has PE1 as the root node and PE2, PE3 and PE4 as the leaf nodes. The green HSMP LSP has been initiated by PE2 and has PE2 as the root, and PE1, PE3 and PE4 as the leaf nodes. The blue HSMP LSP has been initiated by PE3 and has PE1, PE2 and PE4 as the leaf nodes. There is another HSMP LSP thats required – the one initiated by PE4 for the VPLS service to function. It has been omitted from the figure for the sake of clarity.

Thus all PE nodes in a VPLS service need to initiate an HSMP LSP (or a HSMP pseudowire) that terminates at the other PE routers.

In VPLS all BUM (broadcast, unknown unicast and multicast) traffic is flooded to all PE nodes. Its only the learnt traffic thats sent unicast by one PE to the other PE.

As explained earlier, a single copy sent by PE1 over an HSMP LSP will reach PE2, PE3 and PE4 (due to its P2MP component). Also the PE routers PE2, PE3 and PE4 can use this HSMP LSP (terminating at them) to send unicast traffic back to PE1.

Thus PE1 sends all BUM traffic on the HSMP LSP it initiates and all learnt unicast over the HSMP LSP that terminates on it.

Going back to figure 3, we can see that PE1 can use the red HSMP LSP to send all BUM traffic. This way it only sends one copy, and all the PE routers receive this packet. If PE1 wants to send learnt unicast traffic back to PE2, it uses the green HSMP LSP that terminates on it. PE1 can use this to send traffic back to PE2, which is the root node for this HSMP LSP. Similarly, PE1 uses the blue HSMP LSP whenever it wants to send learnt unicast traffic back to PE3.

Lets look at LSPs that PE2 uses. Whenever it wants to send BUM traffic, it uses the green HSMP LSP that it has originated. Any packet sent over that LSP is received by all the leaf nodes (which in this case happen to be PE1, PE3 and PE4). Like PE1, if it has to send learnt unicast traffic back to PE1, it uses the red HSMP LSP that was originated by PE1 and terminates at PE2.

Thus for a VPLS to fully function, all PE nodes must establish an HSMP LSP with all the other participating PE routers. It can use the optimized HSMP LSP that it originates for the BUM traffic and the HSMP LSP that other PEs originate for unicast communication.

The above table compares a VPLS service using HSMP LSPs with a regular VPLS service or Hierarchical VPLS (H-VPLS) service. Clearly, the former wins against the regular VPLS and H-VPLS on all counts. This may also perhaps be an answer to Juniper’s claim that H-VPLS is not scalable. Operators now need not implement H-VPLS; they can instead go in for VPLS services implemented using HSMP LSPs.

This post only briefly explains the idea behind HSMP LSPs. Its explained in detail here in this draft.


Does Hierarchical VPLS solve all Scaling issues found in VPLS?

Virtual Private Lan Service (VPLS) is a multipoint-to-multipoint ethernet bridging service over an IP/MPLS backbone and is used for connecting geographically separated customer sites by emulating a LAN segment. This post assumes that the reader is familiar with the basic concepts of VPLS and IP/MPLS and does not attempt explain them.

VPLS requires a logical full mesh of all the participating Provider Edge (PE) routers since its emulating a LAN. This means that every PE router is connected to every other PE router by a pseudowire resulting in a full mesh infrastructure. VPLS solves the loop problem by using a split-horizon rule which states that member PE routers (PE1, PE2, PE3 and PE4 in the below Figure) must forward VPLS traffic only to the local attachment circuits when they receive the traffic from the other PE routers. Exchanging traffic learnt from one remote PE router to the other is not allowed. This prevents loops and also eliminates the need to run STP in the VPLS core network.

Consider a Video broadcast service (implemented as a VPLS service)  that uses a video codec and offers 200 channels of standard-definition content. Assuming around 2 Mbps for each channel it would require 400 Mbps of total standard-definition traffic. Add another 50 HDTV channels, each consuming around 10Mbps, the total network bandwidth approaches 1Gbps.

Assume that the source of the Video transmission falls behind PE1. In that case, PE1 needs to send 1Gbps worth of Video traffic to PE2, PE3 and PE4 respectively.

Figure 2 shows the physical topology where PE1 is connected to the other PE routers via LSRs P1 and P2. Since PE1 is in a full logical mesh with PE2, PE3 and PE4, it means that PE1 needs to replicate all BUM (Broadcast, Unknown Unicast and Multicast) traffic three times, so that each PE can receive a copy.

Thus PE1 needs to make three copies of all the Video traffic that it receives which means that the link PE1 – P1 and P1-P2 will be now carrying 3Gbps. The amount of traffic that these links will carry will increase linearly as the number of PEs that PE1 is in full mesh with. Clearly this is NOT scalable!

So, whats the solution?

The solution apparently lies in using Hierarchical VPLS, also popularly known as H-VPLS.

The original VPLS architecture requires all PEs to be in a full mesh. This however may not always be practical if the number of PE routers is too high. Provisioning a full meshed network may also in some instances not be an efficient network design, as was illustrated in the topology in Figure 2.

To fix these issues H-VPLS architecture introduces the concept of spoke-pseudowires. Unlike mesh-pseudowires, that are used in regular VPLS, spoke-pseudowires can exchange traffic with other pseudowires (both mesh and spoke), so they can relay traffic between PE routers.  Let us see how we can re-design the above network using H-VPLS.

To illustrate this, the figure above shows two H-VPLS architectures that can be used to break the logical full mesh that is required in the regular VPLS.

In the first there is a logical connection between PE1-PE2, PE2 – PE4 and PE1 – PE3. There is thus a VPLS service defined on PE1 which has just two spoke pseudowires and a connection to the local attachment circuit which is the source of the Video traffic. The first pseudowire connects PE1 to PE2 and the other connects it to PE3.  There is no pseudowire connecting PE1 to PE4.

In the other design, PE1, PE2, PE3 and PE4 are all connected in a ring. In this case PE1 is connected to PE2 and PE3 using spoke pseudowires; PE2 to PE1 and PE4; PE4 to PE3 and PE2 and PE3 to PE1 and PE4.

While the two examples that i have taken dont have a mesh pseudowire, there is nothing that precludes that from happening. The true potential of HVPLS is only exploited when the network is designed using a combination of spoke and mesh pseudowires.

The split-horizon rule “Do not relay traffic among mesh-pseudowires” is used to prevent forwarding loops in H-VPLS networks. The mesh pseudowires dont exchange traffic as in the regular VPLS architecture. The spoke pseudowires otoh do not obey the split-horizon rule – thus traffic arriving on a spoke pseudowire is forwarded to the other spokes, meshes and local attachment circuits if any. This requires the provider to run Spanning Tree Protocol (STP) in the core to keep it loop-free, something that the providers dont feel to happy about given the high convergence values of STP. Since the traffic can loop providers need to be extremely careful when planning where the spokes and mesh pseudowires are placed.

While H-VPLS solves the problems seen in VPLS, it introduces a few of its own.

Lets look at each design and see how ..

In the first H-VPLS design (as shown above) the links PE1-P1 and P1-P2 are still carrying two copies of each BUM traffic, thus we have not gained significantly from the original VPLS design in terms of saving the bandwidth efficiency.

The second problem is that PE4 only gets the packets after they are relayed by PE2. If you go back to Figure 2, you will see that there is no physical connection between PE2 and PE4. Thus all packets go back to P2 before they reach P4, thus congesting this link. Its also introduces a single point of failure, where PE4 can get completely disconnected from the rest of the PEs if PE2 goes down. There is thus a huge amount of redundancy planned in H-VPLS, which can result in loops and thus using STP becomes extremely important here.

The third issue, and as per some critics, the biggest problem with H-VPLS is that PE2 now has to learn all the customer MACs that fall behind PE1 and P4. This is because all traffic from PE1 is terminated at PE2 and relayed to PE4. Similarly, all traffic coming from PE4 gets terminated at PE2 and is then forwarded to PE1. During this process PE2 has to learn all these MACs. This is primarily because H-VPLS implements the Hub-and-Spoke architecture in the data plane as against Route Reflectors in BGP that do it in the control plane.

Lets now look at the other H-VPLS design where all the PE routers are in a ring.

Clearly we need to run STP to break the forwarding loop. Assume that STP puts the spoke pseudowire connecting PE1 – PE3 in a blocked state. In this case PE1 only has one pseudowire (PE1 – PE2) to send traffic to. This solves the bandwidth wastage problem on the links PE1-P1 and P1-P2 as they only carry one copy of the BUM traffic.

This however, doubles the traffic on the links PE2-P2 and P2-PE4 as each packet is first relayed by PE2 to PE4 and then later by PE4 to PE3. So, while we had saved bandwidth on some links, we ended up wasting it somewhere else!

Like before, PE2 and PE4 have to unnecessarily learn all the customer MACs exchanged between learnt traffic between PE1 and PE3.

Another issue is that the learnt traffic between PE1 and PE3 cannot pick up the most optimal IGP path since it has to necessarily get routed via PE2 and PE4. This is a direct consequence of H-VPLS implementing the Hub-Spoke architecture in the data plane.

Its thus easy to sum up the issues that exist in VPLS and H-VPLS.

(1) Bandwidth is wasted since traffic for different pseudowires is replicated on a shared physical path. This is a big issue as more and more multicast video traffic is sent using VPLS services.

(2) A full mesh of PE routers can result in a significant amount of control traffic.

(3) If one uses H-VPLS then operators need to ensure loop prevention and detection. This entails running STP which is not the most attractive choice.

(4) Learnt traffic may not follow the best and the most optimal path since it has to get relayed via multiple PE routers.

(5) Unnecessary MAC learning happens on all PE routers participating in H-VPLS.

So, do we have a solution to the above problems? Thankfully we do!

I have written a draft with Lizhong Jin from ZTE and Frederic Jounay from France Telecom where we have introduced a concept of a Hub-and-Spoke Multipoint LSPs. This can be trivially extended to a Hub-and-Spoke Multipoint Pseudowires which can solve the issues described above that exist in the regular VPLS and the H-VPLS architectures. More detail about Hub-and-Spoke Multipoint LSPs and how they solve all the issues described above in the next post!


Can we solve the inter-session replay attacks?

Most routing (OSPF, BFD, RIP, OSPFv3-AT, etc) and signalling (LDP, RSVP, etc) protocols defined by IETF have a cryptographic sequence number within the authentication data that increases monotonically with each new packet that the router originates. This protects the protocol from replay attacks as the receivers now keep track of the sequence numbers and ignore all packets that arrive with a number thats lower than the currently active one.

At worst, the attacker can keep replaying the last packet that was originated since most protocols accept packets with sequence number greater than or equal to what they had last received. This in my opinion is a hole that can be trivially plugged by mandating that protocols must only accept protocol packets if they come with a sequence number thats greater than what they have received till now.

So does this solve all replay attacks problem?

No, not really.

Imagine an attacker who captures a protocol packet when the cryptographic sequence number is say, 1000.  Now the next time this router cold boots it will reinitialize its sequence space to 1 and start sending packets from this value. The attacker can now replay the earlier captured packet – the one with the sequence number 1000. The receivers will accept the replayed packet since it comes with a sequence number thats higher than what they were currently seeing from the router. This is a vulnerability that most IETF protocols are susceptible to. This is not an issue with protocols that use an automated key management protocol (like IKEv2) as all the security parameters are renegotiated when a session bounces. However, most routing and signalling protocols DONT use an automated key management protocol and are thus exposed to this risk.

I call this as an inter-session replay attack where packets from the previous/stale sessions can be replayed. So, do we have a solution to this problem?

Well, there are a couple of things that we could do here. The most obvious solution is to update the last cryptographic sequence number in the non volatile memory of the router. Thus we update the memory each time we increment the sequence number. This can be read when the router cold boots and it can start using sequence numbers from this value. The problem with this solution is that this will involve frequent writes to the non volatile memory on the routers which is not recommended because of the limited life of such media.

The other solution is to use the clock (number of seconds elapsed since midnight UTC  January 1 1970) as the sequence numbers. In theory this time will always advance and we will thus never have a router issuing sequence numbers that will ever go back. This would ideally also work when the router reboots as the time would only have advanced. The problem with this solution is that we end up relying on NTP or 1588 and an assumption that clocks on a router will NEVER go back. This is unrealistic and cannot be the basis of a security system defined for any protocol. Its fragile and can be broken.

So what are we left with?

Sam Hartman, Dacheng Zhang and I start looking at this problem for OSPF and have written an IETF draft that we think addresses this problem. It associates two scalars with a router – the Session ID and the Nonce, and uses these in combination with the cryptographic sequence numbers to protect OSPF routers against inter and intra replay attacks. The mechanism described in this draft can be easily generalized and extended to other routing and signalling protocols.

This is currently being discussed actively in the OSPF WG and the KARP WG mailing lists. I will in some other post explain how the concept of Nonce and Session ID helps in solving the inter-reply attacks which is the key problem that needs to be solved.


2010 in review

The stats helper monkeys at WordPress.com mulled over how this blog did in 2010, and here’s a high level summary of its overall blog health:

Healthy blog!

The Blog-Health-o-Meter™ reads This blog is doing awesome!.

Crunchy numbers

Featured image

A Boeing 747-400 passenger jet can hold 416 passengers. This blog was viewed about 19,000 times in 2010. That’s about 46 full 747s.

 

In 2010, there were 2 new posts, growing the total archive of this blog to 33 posts.

The busiest day of the year was September 8th with 183 views. The most popular post that day was Designated Router (or DIS) concepts .. .

Where did they come from?

Some visitors came searching, mostly for spf algorithm, ospf graceful restart, shortest path first, ospf vs isis, and ospf flooding.

Attractions in 2010

These are the posts and pages that got the most views in 2010.

1

Designated Router (or DIS) concepts .. March 2007
1 comment

2

Shortest Path First (SPF) Algorithm Demystified .. March 2008
11 comments and 1 Like on WordPress.com,

3

Shortest Path First (SPF) Calculation in OSPF and IS-IS March 2008
4 comments

4

Graceful Restart Mechanisms in OSPF and IS-IS .. March 2007
2 comments

5

Database Exchange and Flooding in OSPF and IS-IS .. March 2007
2 comments


Fixing OSPFv3 Authentication and Security Mechanism!

Sounds like a presumptuous claim for a blog title, eh? Well, we’ll soon find out that it isnt! OSPFv3, unlike OSPFv2, IS-IS and RIP uses IPsec for authenticating its protocol packets. It relies on the IP Authentication Header (AH)  and the IP Encapsulating Security Payload (ESP) to cryptographically sign routing information passed between routers. When using ESP, the null encryption algorithm is used, so the data carried in the OSPFv3 packets is signed, but not encrypted.   This provides data origin authentication for adjacent routers, and data integrity (which gives the assurance that the data transmitted by a router has not changed in transit).  However, it does not provide confidentiality of the information transmitted; this is acceptable because the privacy of the information being carried in the routing protocols need not be kept secret.

Authentication/Confidentiality for OSPFv3” mandates the use of ESP with null encryption for authentication and also does encourage the use of confidentiality to protect the privacy of the routing information transmitted, using ESP encryption. It discusses, at length, the reasoning behind using   manually configured keys, rather than some automated key management   protocol such as IKEv2.  The primary problem is the lack of a suitable key management mechanism, as OSPF adjacencies are formed on a one-to-many basis and most key management mechanisms are designed for a one-to-one communication model. This forces the system administrator to use manually configured security associations  (SAs) and cryptographic keys to provide the authentication and, if desired, confidentiality services. Regarding replay protection, [RFC4552] states the following:

Since it is not possible using the current standards to provide complete replay protection while using manual keying, the proposed solution will not provide protection against replay attacks.

Since there is no replay protection provided there are a number of attacks that are possible for OSPFv3. Some of the them are described here.

Since there is no deterministic way to differentiate between encrypted and unencrypted ESP packets by simply examining the packet, it becomes tricky for some implementations to prioritize certain OSPFv3 packets (Hellos for example) over the others. This is big issue in most service grade routers working in scaled setups.

Then there are some environments, e.g., Mobile Ad-hoc Networks (MANETs), where IPsec is difficult to configure and maintain, and cannot be used for authenticating OSPFv3. There is also an issue with IPsec not being      available on some platforms or it requiring some additional license which may be expensive.

I posted a survey on Nanog asking operators if they were using authentication with OSPFv3.  I only received a few responses, as operators, for a good reason are quite paranoid about their security policies and would not reveal their policies to someone posting a survey on a public mailing list. The results of that survey were interesting; more than half of them cited the issues that i have described above as reasons for not turning on authentication with OSPFv3. They are running OSPFv3 in their v6 networks without any security! This you would note is very different from operators running OSPFv2. It is well known that a majority of the big providers use MD5 with OSPFv2 and their networks are secure. Most vendors have now started work on implementing HMAC-SHA support for OSPFv2 as described in RFC5709. This led me to believe that if we got rid of IPsec for OSPFv3 and somehow got it to use the same infrastructure as OSPFv2, then we would see more people using security with OSPFv3.

The other big issue with using IPsec for OSPFv3 is that it becomes very difficult to prioritize some OSPFv3 control packets over others. This is because deep inspecting ESP-NULL packets is difficult as its not easy to know if the packet is encrypted or not. One alternative is to mandate WESP instead of ESP-NULL. An easier alternative is to write a new mechanism to authenticate OSPFv3 packets. I have written an RFC 6506 – “Supporting Authentication Trailer for OSPFv3” that defines an alternative, non IPsec mechanism, for authenticating and securing OSPFv3 protocol packets.

We propose to append a special block, called the Authentication Trailer to the end of the OSPFv3 packets. This is similar to the Authentication data thats carried in OSPFv2, where the length of this trailer is NOT included into the length of the OSPFv3 packet, but is accounted for in the IPv6 payload length. Unlike OSPFv2, the AT will also include the contents of the LLS block in its digest which means that the LLS block doesnt need to have its own separate authentication mechanism.

We also describe a new AT (Authentication Trailer) bit in the OSPFv3 options field that the routers must set in the HELLOs and DDs to indicate that this router will include Authentication trailer in all subsequent OSPFv3 packets on the link. In other words, the Authentication trailer is only examined if the AT bit is set. You can read the RFC for more details.


Using Checksum in determining the newer LSA

RFC 2328 Section 13.1 explains what a router must do when it encounters two similar instances of an LSA (i.e. the two LSAs are of the same LS type, have the same Link State ID, Advertising Router and the LS Sequence Number) and it needs to determine which of the two LSAs is more recent.

Since the two LSAs have the same sequence number RFC 2328 suggests examining the checksums. The instance having the larger LS checksum is considered more recent.

Now, the question is that how can we be sure that the more recent LSA will necessarily have a higher checksum? The answer is that we cant! Its just done this way to ensure that all routers in the network arrive at the same decision. OSPF would have converged even if the algorithm was to pick the LSA with a lower checksum!

So, how does it work?

The LS checksum is computed over the entire contents of the LSA, excepting the LS age field. This is left out so that the LSA’s age can be incremented without updating the LS checksum. Its used primarily to detect data corruption of an LSA. Its also used as explained above in determining which LSA is more recent.

Its obvious that checksums can only differ if the contents of the LSAs differ. This is possible if a router reboots and floods an LSA that has the same sequence number as the previous LSA it had issued before restarting that still exists in the network.

The purpose of the checksum selection really beyond correct routing and is critical to the detection and elimination of this case. If the LSA issued by the rebooted box has the higher checksum, that information is newer (as per the tie breaking algo specified in the RFC) and will eventually eliminate the earlier version of this LSA. However, if the LSA issued by the rebooted box has a lower checksum, the old version of the LSA is “newer”, and will eventually make it back to the issuing router, which will trump it by reissuing the LSA with a higher sequence number.

If it weren’t for the tie-breaking rule, both LSAs would live in the network until the next refresh (close to 30 mins for OSPF and even longer for ISIS), and long-lived routing loops could result.


Traffic Visibility inside ESP

Use of Encapsulating Security Payload (ESP) within IPsec specifies how ESP packet encapsulation is performed.  It also specifies that ESP can use NULL encryption while preserving data integrity and authenticity.  The exact encapsulation and algorithms employed are negotiated out-of-band using, for example, IKEv2 and based on policy.

Enterprise environments typically employ numerous security policies (and tools for enforcing them), as related to access control, content screening, firewalls, network monitoring functions, deep packet inspection, Intrusion Detection and Prevention Systems (IDS and IPS), scanning and detection of viruses and worms, etc.  In order to enforce these policies, network tools and intermediate devices require visibility into packets, ranging from simple packet header inspection to deeper payload examination.  Network security protocols which encrypt the data in transit prevent these network tools from performing the aforementioned functions.

When employing IPsec within an enterprise environment, it is desirable to employ ESP instead of AH [RFC4302], as AH does not work in NAT environments. Furthermore, in order to preserve the above network monitoring functions, it is desirable to use ESP-NULL. In a mixed mode environment some packets containing sensitive data employ a given encryption cipher suite, while other packets employ ESP-NULL. For an intermediate device to unambiguously distinguish which packets are leveraging ESP-NULL, they would require knowledge of all the policies being employed for each protected session. This is clearly not practical. Heuristic-based methods can be employed to parse the packets, but these can be very expensive, containing numerous rules based on each different protocol and payload.  Even then, the parsing may not be robust in cases where fields within a given encrypted packet happen to resemble the fields for a given protocol or heuristic rule.

This is even more problematic when different length Initialization Vectors (IVs), Integrity Check Values (ICVs) and padding are used for different security associations, making it difficult to determine the start and end of the payload data, let alone attempting any further parsing. Furthermore, storage, lookup and cross-checking a set of comprehensive rules against every packet adds cost to hardware implementations and degrades performance. In cases where the packets may be encrypted, it is also wasteful to check against heuristics-based rules, when a simple exception policy (e.g., allow, drop or redirect) can be employed to handle the encrypted packets. Because of the non-deterministic nature of heuristics-based rules for disambiguating between encrypted and non-  encrypted data, an alternative method for enabling intermediate devices to function in encrypted data environments needs to be defined. Additionally there are many types and classes of network devices employed within a given network and a deterministic approach would provide a simple solution for all these devices. Enterprise environments typically use both stateful and stateless packet inspection mechanisms. The previous considerations weigh particularly heavy on stateless mechanisms such as router ACLs and NetFlow exporters. Nevertheless, a deterministic approach provides a simple solution for the myriad types of devices employed within a network, regardless of their stateful or stateless nature.

We have published an IETF standard RFC5840 to provide additional information in relevant IPsec packets so intermediate devices can efficiently and deterministically disambiguate encrypted ESP packets from ESP packets with NULL encryption.


BFD Generic Cryptographic Authentication

Bidirectional Forwarding Detection (BFD) specification includes five different types of authentication schemes: Simple Password, Keyed Message Digest 5 (MD5), Meticulous Keyed MD5, Keyed Secure Hash Algorithm (SHA-1) and Meticulous SHA-1. In the simple password scheme of authentication, the passwords are exchanged in the clear text on the network and anyone with physical access to the network can learn the password and compromise the security of the BFD domain.

It was discovered that collisions can be found in MD5 algorithm in less than 24 hours, making MD5 insecure. Further research has verified this result and shown other ways to find collisions in MD5 hashes.

It should however be noted that these attacks may not necessarily result in direct vulnerabilities in Keyed-MD5 as used in BFD authentication purposes, because the colliding message may not necessarily be a syntactically correct protocol packet. However, there is a need felt to move away from MD5 towards more complex and difficult to break hash algorithms.

In Keyed SHA-1 and Meticulous SHA-1, the BFD routers share a secret key which is used to generate a keyed SHA-1 digest for each packet and a monotonically increasing sequence number scheme is used to prevent replay attacks.

Like MD5 there have been reports of attacks on SHA-1. Such attacks do not mean that all the protocols using SHA-1 for authentication are at risk. However, it does mean that SHA-1 should be replaced as soon as possible and should not be used for new applications.

However, if SHA-1 is used in the Hashed Message Authentication Code (HMAC) construction then collision attacks currently known against SHA-1 do not apply. The new attacks on SHA-1 have no impact on the security of HMAC-SHA-1.

I have written an IETF document that proposes two new authentication types – the cryptographic authentication and the meticulous cryptographic authentication . These can be used to specify any authentication algorithm for authenticating and verifying the BFD packets (aka key agility). In addition to this, this memo also explains how HMAC-SHA authentication can be used for BFD.

HMAC can be used, without modifying any hash function, for calculating and verifying the message authentication values. It verifies both the data integrity and the authenticity of a message.

By definition, HMAC requires a cryptographic hash function. We propose to use any one of SHA-1, SHA-256, SHA-384 and SHA-512 for this purpose to authenticate the BFD packets.

I recently co-authored an IETF draft that does BFD’s security and authentication mechanism’s gap analysis for the KARP WG – that draft can be found here.


BGP Path Hunting and Exploration ..

It has been shown through various studies that BGP convergence can be roughly given by the following formula:

Convergence = (Maximum AS_PATH – Minimum AS_PATH) x MRAI (MinimumRouteAdvertisementInterval)

MRAI as per the specification is the amount of time that a BGP speaker will wait before passing on successive route updates for the same prefix – it defaults to 30 seconds, and applies to all announcements, with no exemption for withdrawals. Since there can be some variance in the MRAI across different BGP implementations the convergence time is roughly given by the above formula.

Lets see how we arrive at this. Consider the topology as show in the fig 1 below:

Network Topology

Figure 1: Network Topology

A, B, C, D and E are all routers in ASes A, B, C, D and E respectively. Assume that A has announced its reachability to 10/8 and all routers have converged. This implies that E has the following paths for 10/8 in its adj-RIBs-In:

p(B) -> AS_PATH {B A} (path learnt from B)

p(C) -> AS_PATH {C B A} (path learnt from C)

p(D) -> AS_PATH {D C B A} (path learnt from D)

In steady state E would advertise 10/8 with AS_PATH {E B A}

Lets see what happens when A loses its connection to the network 10/8.

A announces a withdrawal for 10/8 to B as shown in figure 2.

Figure 2

Figure 2

B runs its decision process and finds that there is no other route to 10/8, and thus sends an explicit withdrawal to its peers C and E.

E upon receiving the withdrawl runs its decision process and finds that the best route, now available for 10/8 is the one that it learnt from C – the one with the AS_PATH {C B A}. It installs this route in the FIB and advertises this, as shown in figure 3.

Figure 3

Figure 3

E switches to the next best route (via C) and continues to forward traffic for 10/8. It should however be noted, that E at this point of time, has absolutely no idea that C too has received a withdrawal for 10/8.

C meanwhile runs its decision process and decides to send an explicit withdrawal to its peers D and E.

E upon receiving the withdrawal from C runs its decision process and finds that it still has a valid path to reach 10/8 – the one via D (refer to figure 4). It switches to this path, again unaware that D too has received an withdrawal for 10/8.

Figure 4

Figure 4

D now sends a withdrawal to E, and thereby removing all possible paths to 10/8 from E (fig 5). Its now that the network is converged.

Figure 5

Figure 5

What we just saw happening above on E is “BGP Path Hunting”. i.e., BGP “hunts” though all possible AS paths, starting from the shorter ones to the longer ones, until it finally converges.

It can be easily proven that the amount of path hunting will increase as the meshiness of the topology increases. In an acyclic topology (say a tree) there is only one possible path so there is no path hunting. An addition of a single link in this topology creates a cycle in the graph and thus at most two possible paths for BGP to “hunt”. Subsequent links can add many more alternate paths for BGP to hunt, depending on where they’re placed in the graph.

In the worst possible case path hunting in BGP can explore every possible path of each path length. More commonly it has been observed that path hunting in today’s Internet can add an additional 2 or 3 BGP updates to a prefix withdrawal.


AboveNet Hijacks Africa Online!

Internet, only a few weeks ago, had seen Pakistan Telecom Authority (PTA) hijacking the IP prefixes announced by Youtube, as protest against some videos that had been put up there, knocking millions off, all around the globe from accessing Youtube. I wrote about this here. This time its an ISP from USA and Europe, AboveNet (AS 6461) thats hijacked prefix announced and owned by Africa Online (AS 36915).

AboveNet inexplicably started announcing reachability to one of the prefixes (194.9.82.0/24) owned by Africa Online. It took AboveNet more than 22 hours since the problem was first reported, to fix it. Wonder what took them so long! As a result of this prefix hijack, potentially millions of users in Kenya or Africa, all behind 194.9.82.0/24, lost connectivity to the Internet. In isolation 194.9.82.0/24 is not a huge space, but add a couple of NATs and the number of users easily swells to millions. What this means for you and me, who are not being served by Africa Online is, that we lose connectivity to all the websites being hosted behind this IP address block. Imagine what it would do to the Internet if emergency services, banks, google were being hosted there!

Lets see why users in Kenya would lose total connectivity to the Internet:

A user accesses google.com with a (NATed or otherwise) source IP address 194.9.82.x. Google graciously responds, and the IP packet carries the destination IP 194.9.82.x. Because AboveNet has announced reachability to this IP address block, all traffic destined to 194.9.82.x comes to AboveNet where it gets royally dumped, while the user sitting in Kenya (or Africa) is still hopelessly waiting for the packet to arrive.

So, why are the service providers all over the world preferring the route announcement from AboveNet over the one originated from Africa Online?

Well, thats unfortunately how Internet, and my favorite routing protocol – BGP, works!

In BGP, the route advertisement from the provider which has a better vantage point on the Internet, usually wins.

In this particular case both AboveNet(AS 6461) and Africa Online (AS 36915) announced the route to 194.9.82.0/24 . AboveNet, operating from US, sits much closer to the core as compared to Africa Online, and is thus better connected to the other networks than the latter. The AS_PATH length thus seen by the other service providers for the route advertised by AboveNet is much shorter than the one advertised by Africa Online. As a result of this, other BGP speakers pick up the route advertised by AboveNet against the route advertised by Africa Online.

The figure below, constructed using BGPlay from RIPE NCC, shows a snapshot of the routing activity for 194.9.82.0/24 during the period when it was hijacked. The colored lines indicate the path different ASes would take to reach this prefix. Clearly most of the ASes believed AboveNet to be a better path for 194.9.82.0/24.

Vantage Point on the Internet

This is how BGP works and mind you, this isnt broken.

Whats broken is our disability to verify the claim of a service provider when it announces ownership of an address block in BGP. Restrictive route filtering can be applied where the providers only accept the specific prefixes allocated to the customers or where the upstream accepts only specific prefixes allocated to the ISPs, but this is too cumbersome and rarely works. As the matter stands today, there isn’t any clean way to know if the reachability announced by your friendly peer is genuine or whether the provider has a feasible path to the destinations advertised. This needs to be fixed and there is work going on towards this direction in the SIDR WG of IETF.

Can the service providers do something when they learn that an IP prefix has been hijacked by some AS? The answer, fortunately, to this is an unequivocal Yes.

A service provider can override BGP’s decision process in selecting the route advertisement with the shortest AS_PATH length by (i) manipulating the BGP path attribute like LOCAL_PREF since its checked before the AS_PATH length or (ii)  decreasing the weight of the offending peer you learn the hijacked route from (this would only work for routers connected  directly to AboveNet) (iii) Use regular expressions to filter all or the specific hijacked route advertisement from AS 6461 (AboveNet) so that the announcement from Africa Online wins. The legitimate route is now propagated to other parts of the world.

Each time an ISP inadvertently hijacks someone else’s address block it risks to lose some amount of credibility in the service provider world. Their names are splashed on the mails/PPTs in NANOG and IETF whenever there’s a discussion on interdomain security or on the blogs all over the world.

Fortunately for AboveNet, their hijacking didn’t throw millions off popular websites like Youtube, Google, Yahoo! etc. That would have attracted a LOT more attention than what this event did. When PTA had hijacked YouTube, it was all over the news and there were columns running in Wall Street Journal and New York Times about how tenuous the Internet architecture is. Also what unfortunately went in favor of AboveNet was that the affected users were not in US/Europe/Japan, but were in a relatively silent African subcontinent.


The Classical Fish Problem in Routing

The Internet in mid 80s and 90s was envisaged to work on the IP destination based routing/forwarding paradigm. This meant that the routing protocols would establish the best paths based on some scalar metric like the hop count, or link costs, and all traffic would follow that path. This would work since all IP routing and forwarding was based on the IP address carried in the IP packets. With all due respect and credit to the Internet’s forefathers, the vision and the design did work, till a few years back, when the network operators started realizing that though the IP architecture was indeed scalable, it lacked the finesse to optimally utilize the network resources (particularly in the backbones).

The inadequate utilization of the network resources can be illustrated with the classic “fish problem“. It derives its name from the network (Fig 1) resembling a fish, with A being the head and G and H, the tail of a fish.

Figure 1

All traffic emanating (or passing through) from the tail, (G or H) towards the head (A) can take either of the two paths (F-D-C-B or F-E-B) based on how the IP routing tables are programmed by the routing protocols. The latter decide on the best path by considering the link costs advertised by each router in the network. In this example the total cost of path G-F-D-C-B-A is (10+5+5+5+10) 35, while the cost of the path G-F-E-B-A is 30. This means that all traffic from G to A (or G to B) would follow the path through router E (as shown by the red arrow), and the F-D-C-B path would remain unused, since the total cost associated with it is higher than the one through router E.

This leads to an extremely unbalanced traffic distribution, where the link F-E-B can get heavily overloaded, and at worst, congested, while F-D-C-B always remains idle. This problem arises because of the way IP routing paradigm works. Lets us see why:

o IP routing is destination based, so packets are only routed based on the destination IP address in the packet. Routing protocols typically install one next-hop for each IP address (except in case of equal cost routes, which we can ignore for the time being) or a range of IP addresses (subnet masks) thus all packets sharing the same destination address would all get routed to the same next-hop. This means that if F installs a route 100/8 with next-hop as E, then all IP traffic falling under 100/8 coming to F, would get routed to E. This can lead to unbalanced traffic distribution and create unnecessary congestion hot spots.

o Routers make a local decision, based on what they think is the most optimal path from their perspective, when selecting a path. Since all routers run the same SPF algorithm, with the same lin state database, they all come up with the same shortest path, which very soon turns congested, while the non-shortest path remains idle, and unused. This implies that to optimize the network utilization the routers must factor in some other things before chosing the path. One thing that comes instantly to mind is the total bandwidth available on each link when computing the path. If routers can somehow keep track of the available bandwidth available on each link, then it can distribute the traffic in a manner which can optimize the network resource utilization.

Coming back to our network, we find that all traffic from tail to head flows through the router E, leaving D and C idle. So, what can be done to fix this?

The operator can manipulate the link costs on path F-D-C-B, in a manner as shown below, to get the traffic to flow over it.

Figure 2

This clearly works, since cost of the path F-D-C-B (9) is now lower than path F-E-B (10). But hang on. What we’ve only achieved is moving the entire traffic from path F-E-B to F-D-C-B! The traffic would soon start congesting the latter link, while leaving the former unused. We have really achieved nothing, but have only moved the problem elsewhere. Clearly, this wouldn’t work. So, what else can be done?

Well, not much. A clever network operator can play around with the link costs on paths F-D-C-B and F-E-B such that both become equal, as shown in the figure below.

Figure 3

This would surely alleviate the problem as the two paths would now be equally used. However, this scheme of manually adjusting the link costs is not scalable and only works for small networks. Imagine replacing C, D and E with hundreds of routers, with a subset of them being connected to each other. The precarious scheme of adjusting the link costs would become too complex and too fragile to work. A single link (or a router) failure or a cost change would bring down the entire scheme of distributing the traffic across two paths.

The only scalable solution for the fish problem is by going beyond the realms of traditional IP routing and by providing mechanisms to explicitly manage the traffic inside the network. This new paradigm of routing is called constraint based routing, which essentially strives to compute the best path without violating any of the constraints imposed on the network, and at the same time, being optimal with respect to some scalar metric (hop count, links costs, etc). Once such a path is computed, it establishes and maintains forwarding state along such a path.

This differs from the existing IP routing paradigm which only tries to optimize (by minimizing), a particular scalar metric when computing the best path to a destination. Thus RIP optimizes the number of hops and OSPF/IS-IS, the total path cost, where total path cost is the sum total of individual cost of all the links along the path.

I would discuss more on constraint based routing in my subsequent posts.


The Complete and Partial SPF in IS-IS

In the previous post I explained how the SPF algorithm works and how its used in the link state routing protocols. Click here to know the difference between SPF run for OSPF and IS-IS.

After the SPF run, IS-IS would have an SPF tree with the shortest path to reach each Intermediate System in its level. Fair enough - but how do we determine the IP networks that we can reach?

We earlier saw that at each step of the algorithm, the TENT is examined, and the node with the least cost from the root node is moved into PATHS. When a node is placed in PATHS, all IP prefixes/networks advertised by it are installed in the IS-IS Routing Information Base (RIB) with the corresponding metric and next hop. The directly connected IS-IS neighbors of the node that just made it into PATHS are then added to TENT if they are not already there and their associated costs adjusted accordingly, for the next selection.

The SPF tree thus computed considers the Intermediate Systems as nodes of the graph and the IP addresses advertised by these as the leaves, hanging off the nodes. Thus, the entire shortest path tree in IS-IS does not need to be recomputed if the network changes involved are only related to IP prefixes. Instead, the router can run a partial computation to find an alternative IP prefix if one exists – this partial run is called the partial SPF. More details here and here.

The network topology is computed and determined by the adjacencies advertised in the IS-IS LSPs. We already know that a full SPF is only required when the network topology changes. This implies that only a loss of an IS-IS adjacency would trigger a full SPF. To cite an example, when a point-to-point link goes down, the router loses its adjacency with the neighbor at the other end. This signals a change in topology and, a full SPF is scheduled. OTOH, when a route redistributed from a different routing protocol or a level 2 route leaked into the level 1 route goes away, then it does not bring about any topology change. Because the IP prefixes are only the leaves of the SPF tree, and this does not flag a change in network topology, only the partial route computation (PRC) is run to find an alternative path, if one exists.


Shortest Path First (SPF) Algorithm Demystified ..

For the sake of brevity i would refer an Intermediate System (IS) in IS-IS parlance as a router in the following post – thus a router in this post could mean either an OSPF speaker or an IS-IS speaker or some other routing element from your favorite link state protocol. For the SPF algorithm to work, it would require *all* routers in the network to know about all the other routers in the network and the links connecting them. How a link state routing protocol encodes this information and ensures that its disseminated properly is left to that protocol. OSPF encodes this information in Link State Advertisements (LSAs) and floods it reliably, while IS-IS encodes this in a Link State Packet (LSP) that it originates.

Once each router knows about all the other routers and the links connecting them, it runs the Dijkstra Shortest Path First algorithm to determine the shortest path from itself to all the other routers in the network. Since each router has a similar copy of the link state database and each runs the same algorithm, they end up constructing the same view of the network and packets get routed consistently at each hop.

So, how does SPF algorithm work in OSPF and IS-IS.

Imagine a simple network as shown in the figure below.

Network Diagram

Once each router has flooded its link state information in the network, all routers know about all the other routers and the links connecting them. The link state database on each router looks like the following:

[A, B, 3], [A, C, 6], [B, A, 3], [B, D, 3], [B, E, 5], [C, A, 6], [C, D, 9], [D, C, 9], [D, B, 3], [D, E, 3] , [E, B, 5] and [E, D, 3]

Each triple should be read as {originating router, router its connected to, the cost of the link connecting the two routers}

So what does Router A do with this information and how is this used in SPF?

While running SPF, each router maintains two lists – the first is a list of nodes for which the shortest path has been determined and we are sure that no path shorter than the one we have computed can exist. This list is called the PATH (or PATHS) list. The second is the list of paths through the routers that may or may not be the shortest to a destination. This list is called the TENTative list, or simply the TENT list From now on, TENT would refer to the TENT list and PATH, to the PATH list.

Each element in the list is a triplet of the kind {endpoint router that we’re trying to reach, total distance from the calculating router, next-hop to reach the endpoint router}

Each router runs the following algorithm to compute the shortest path to each node:

Step I: Put “self” on the PATH with a distance of 0 and a next hop of self. The router running the SPF refers to itself as either “self” or the root node, because this node is the root of the shortest-path tree.

Step II: Take the node (call it the PATH node) just placed on the PATH list and examine its list of neighbors. Add each neighbor to the TENT with a next hop of the PATH node, unless that neighbor is already in the TENT or PATH list with a lower cost.

Call the node just added to the TENT as the TENT node. Set the cost to reach the TENT node equal to the cost to get from the root node to the PATH node plus the cost to get from the PATH node to the TENT node.

If the node just added to the TENT already exists in the TENT, but with a higher cost, replace the higher-cost node with the node currently under consideration.

Step III: Find the lowest cost neighbor in the TENT and move that neighbor to the PATH, and repeat Step 2. Stop only when TENT becomes empty.

Lets follow the sequence that Router A goes through for building its SPF tree.

1st Iteration of the SPF run

Step I: Put “self” on the PATH with a distance of 0 and a next hop of self. After this step the PATH and TENT look as follows:

PATH – {A, 0, A}

TENT – { }

Step II: Take the node (call it the PATH node) just placed on the PATH list and examine its list of neighbors. Patently, A is the PATH node. Examine its list of neighbors ({A, B, 3}, {A, C, 6}). OSPF does this by looking at the LSAs advertised by Router A, while IS-IS does this by looking at the neighbors TLV found in Router A’s LSP. When an IS-IS node is placed in PATHS, all IP prefixes advertised by it are installed in the IS-IS Routing Information Base (RIB) with the corresponding metric and next hop.

The Step II further says – “Add each neighbor to the TENT with a next hop of the PATH node, unless that neighbor is already in the TENT or PATH list with a lower cost.” A’s neighbors are B and C, and since neither of them is in the PATH or TENT, we add both of them to the TENT.

PATH – {A, 0, A}

TENT – {B, 3, A}, {C, 6, A}

Step II says – “Call the node just added to the TENT as the TENT node. Set the cost to reach the TENT node equal to the cost to get from the root node to the PATH node plus the cost to get from the PATH node to the TENT node.”

Lets pick up the TENT node B. The cost to reach B would be the cost to reach from root node to PATH node + cost from PATH node to TENT node. In the first iteration of SPF, both the root node and PATH node is A. Thus total cost to reach B is cost to reach from A (PATH node) to B (TENT node) which is 3.

Step II says – “If the node just added to the TENT already exists in the TENT, but with a higher cost, replace the higher-cost node with the node currently under consideration.”

B isnt in the TENT, so skip this.

Step III says – “Find the lowest cost neighbor in the TENT and move that neighbor to the PATH, and repeat“. The lowest cost neighbor is B (with cost 3). Move this to PATH. We go back to Step II since TENT isnt yet empty.

2nd Iteration of the SPF run

PATH – {A, 0, A} {B, 3, A}

TENT – {C, 6, A}

We have thus added the neighbor B in PATH, since we know that there cannot be any other shorter path to reach it. And this is, as you will note, consistent with our definition of PATH wherein we had earlier stated that nodes can only be placed there once we are sure that there cannot be any shorter path to reach them from the root node.

We begin our 2nd iteration and go to Step II which says – “Take the node (call it the PATH node) just placed on the PATH list and examine its list of neighbors“. B is the PATH node and we examine its neighbors (A, E and D). Step II further says – “Add each neighbor to the TENT with a next hop of the PATH node, unless that neighbor is already in the TENT or PATH list with a lower cost.”

Since A is already in PATH with ignore it and only add E and D to TENT.

PATH – {A, 0, A} {B, 3, A}

TENT – {C, 6, A} {D, 3, B}, {E, 5, B}

Step II says - “Call the node just added to the TENT as the TENT node. Set the cost to reach the TENT node equal to the cost to get from the root node to the PATH node plus the cost to get from the PATH node to the TENT node.

Aah .. this means that cost against D and E would not be 3 and 5, as what i have shown, but would instead be 3 (cost from A to B) +3 (cost from B to D) = 6 and 3 (cost from A to B) +5 (B to E) = 8

Thus the PATH and TENT look as follows:

PATH – {A, 0, A} {B, 3, A}

TENT – {C, 6, A} {D, 6, B}, {E, 8, B}

The rest of the Step II does not apply here since D and E dont exist in the TENT.

Come to Step III which says “Find the lowest cost neighbor in the TENT and move that neighbor to the PATH, and repeat Step 2″.

Lowest cost neighbor can either be C or D so pick on up randomly. It can mathematically be proven that we would end up with the same SPF tree irrespective of which equal cost neighbor is picked up from the TENT first. In our case, lets pick up C.

It is thus moved to the PATH

PATH – {A, 0, A} {B, 3, A} {C, 6, A}

TENT – {D, 6, B}, {E, 8, B}

3rd Iteration of the SPF run

Go back to Step II which says “Take the node (call it the PATH node) just placed on the PATH list and examine its list of neighbors. Add each neighbor to the TENT with a next hop of the PATH node, unless that neighbor is already in the TENT or PATH list with a lower cost

C’s neighbors are A and D. Since A is already in the PATH only D is added in the TENT.

PATH – {A, 0, A} {B, 3, A} {C, 6, A}

TENT – {D, 6, B}, {E, 8, B} {D, 9, C}

As per Step II we now need to fix the cost to reach D from the root node. This would be cost to reach from A to C (6) + cost from C to D (9) = 15

PATH and TENT now:

PATH – {A, 0, A} {B, 3, A} {C, 6, A}

TENT – {D, 6, B}, {E, 8, B} {D, 15, C}

Step II further says – “If the node just added to the TENT already exists in the TENT, but with a higher cost, replace the higher-cost node with the node currently under consideration.

node D already exists in the TENT (via B with cost 6) and since its with a lesser cost, we remove the node that we had just added from the TENT. This is because a lower cost path to reach node D already exists in the TENT.

PATH – {A, 0, A} {B, 3, A} {C, 6, A}

TENT – {D, 6, B}, {E, 8, B}

We come to Step III which says “Find the lowest cost neighbor in the TENT and move that neighbor to the PATH, and repeat Step 2″.

Lowest cost neighbor is D – which means we move D now to the PATH and go to Step II, since the TENT isnt yet empty.

4th Iteration of the SPF run

PATH – {A, 0, A} {B, 3, A} {C, 6, A} {D, 6, B}

TENT – {E, 8, B}

Step II says “Take the node (call it the PATH node) just placed on the PATH list and examine its list of neighbors. Add each neighbor to the TENT with a next hop of the PATH node, unless that neighbor is already in the TENT or PATH list with a lower cost.”

To examine D’s neighbors we look at the link state information it advertised. It advertised the following information:

[D, C, 9], [D, B, 3], [D, E, 3]

This means that D is says that its connected to C, B and E. We ignore its connection to B and C, since they are already in PATH. We thus only add neighbor E in the TENT.

PATH – {A, 0, A} {B, 3, A} {C, 6, A} {D, 6, B}

TENT – {E, 8, B} {E, 3, D}

Continuing with Step II which further says – “Call the node just added to the TENT as the TENT node. Set the cost to reach the TENT node equal to the cost to get from the root node to the PATH node plus the cost to get from the PATH node to the TENT node.”

This means that we need to adjust the cost of the triple {E, 3, D} that we just added to the TENT. The cost to reach E via D would thus be the cost to reach D from A (which is the root node) + the cost to reach E from D. This comes out to be 6 + 3 = 9.

TENT thus looks like this – {E, 8, B} {E, 9, D}

Step II further says – “If the node just added to the TENT already exists in the TENT, but with a higher cost, replace the higher-cost node with the node currently under consideration”

We just added node E in the TENT and a route to E already exists in the TENT, and its with a lower cost. This means that we remove the route that we had just added.

So PATH and TENT at this point look as follows:

PATH – {A, 0, A} {B, 3, A} {C, 6, A} {D, 6, B}

TENT – {E, 8, B}

We go to Step III which says – “Find the lowest cost neighbor in the TENT and move that neighbor to the PATH, and repeat Step 2″.

The lowest cost neighbor in TENT right now is E. We move this to PATH.

So PATH and TENT at this point look as follows:

PATH – {A, 0, A} {B, 3, A} {C, 6, A} {D, 6, B} {E, 8, B}

TENT – { }

Step III further states that we continue if and only if something remains in the TENT. TENT is now empty, which means that we have computed the shortest paths to all the nodes that A was aware of.

This is marks the end of the SPF algorithm run and the SPF tree that it has computed looks as follows

Network Topology after the SPF Run

Shortest Path First (SPF) Calculation in OSPF and IS-IS

Both OSPF and IS-IS use the Shortest Path First (SPF) algorithm to calculate the best path to all known destinations based on the information in their link state database. It works by building the shortest path tree from a specific root node to all other nodes in the area/domain and thereby computing the best route to every known destination from that particular source/node. The shortest path tree thus constructed, consists of three main entities – the edges, the nodes and the leaves.

Each router in OSPF or an Intermediate System in case of IS-IS, is a node in the SPF tree. The links connecting these routers, the edges. The IP network associated with an IP interface, added into OSPF via the network command is a node, while the IP address associated with an interface thats added in IS-IS is  a leaf. An IP prefix redistributed into OSPF or IS-IS from other routing protocols (say BGP)  becomes a leaf in both the protocols. Inter-area routes are patently, the leaves.

Network Diagram

If you consider the network as shown above, then OSPF would consider routers A, B, C and the network 10.1.1.0/24 as nodes. This is assuming that the interface associated with 10.1.1.0/24 has been added into OSPF. The only leaf in the graph would be the IP prefix 56.1.1.0/24 redistributed into A from some other protocol. IS-IS otoh, would consider routers A, B and C as nodes and networks 10.1.1.0/24 and 56.1.1.0/24 as leaves. This seemingly innocuous difference in representation of the SPF tree leads to some subtle differences between the SPF run in OSPF and IS-IS, which can interest a network engineer.

The nodes in the shortest path tree or the graph form the backbone or the skeleton of that tree. Any change there necessitates a recalculation of the SPF tree, while a change in a leaf of the SPF tree does not require a full recalculation. Removing and adding of leaves without recalculating the entire SPF tree is known as Partial SPF and is a feature of almost every implementation of OSPF and IS-IS that i am aware of. This implies that if the link connecting router C to 10.1.1.0/24 goes down, then a full SPF would be triggered in case of OSPF, and a partial SPF in case of IS-IS.

This shows that the general adage – “Avoid externals in OSPF” should be taken with a pinch of salt and it really depends upon your topology. I have seen networks where ISPs redistribute numerous routes that have a potential to change on a regular basis, as opposed to bringing them via the network command.

IS-IS

o IP routing is integrated into IS-IS by adding some new TLVs which carry IP reachability information in the LSPs. All IP networks are considered externals, and they always end up as leaf nodes in the shortest path tree when IS-IS does a SPF run. All node information, neccessary for SPF calculation is advertised in its IS Neighbors or IS Reachability TLVs. This unambiguously separates the prefix information from the topology information which makes Partial Route Calculation (PRC) easily applicable. Thus IS-IS performs only the less CPU intensive PRC when network events do not affect the basic topology but only affect the IP prefixes.

o Used narrow (6 bits wide) metrics which helped in some SPF optimization. However such small bits proved insufficient for providing flexibility in designing IS-IS networks and other applications using IS-IS routing (MPLS-TE). “IS-IS extensions for Traffic Engineering” introduced new TLVs which defined wider metrics to be used for IS-IS thus taking away this optimization. But then CPU are fast these days and there arent many very big networks anyways!

o SPF for a given level is computed in a single phase by taking all IS-IS LSP’s TLV’s together.

OSPFv2

o Is built around links, and any IP prefix change in an area will trigger a full SPF. It advertises IP information in Router and Network LSAs. The routers thus, advertise both the IP prefix information (or the connected subnet information) and topology information in the same LSAs. This implies that if an IP address attached to an interface changes, OSPF routers would have to originate a Router LSA or a Network LSA, which btw also carries the topology information. This would trigger a full SPF on all routers in that area, since the same LSAs are flooded to convey topological change information. This can be an issue with an access router or the one sitting at the edge, since many stub links can change regularly.

o Only changes in interarea, external and NSSA routes result in partial SPF calculation (since type 3, 4, 5 and 7 LSAs only advertise IP prefix information) and thus IS-IS’s PRC is more pervasive than OSPF’s partial SPF. This difference allows IS-IS to be more tolerant of larger single area domains whereas OSPF forces hierarchical designs for relatively smaller networks. However with the route leaking from L2 to L1 incorporated into IS-IS the apparent motivation for keeping large single area domains too goes away.

o SPF is calculated in three phases. The first is the calculation of intra-area routes by building the shortest path tree for each attached area. The second phase calculates the inter-area routes by examining the summary LSAs and the last one examines the AS-External-LSAs to calculate the routes to the external destinations.

o OSPFv3 has been made smarter. It removes the IP prefix advertisement function from the Router and the Network LSAs, and puts it in the new Intra-Area Prefix LSA. This means that Router and Network LSAs now truly represent only the router’s node information for SPF and woudl get flooded only if information pertinent to the SPF algorithm changes, i.e., there is atopological change event. If an IP prefix changes, or the state of a stub link changes, that information is flooded in an Intra-Area Prefix LSA which does not trigger an SPF run. Thus by separating the IP information from the topology information, we have made PRC more applicable in OSPFv3 as compared to OSPF2.

I recently wrote a post that discusses this further.


Follow

Get every new post delivered to your Inbox.