Category Archives: IP Routing

Stepping onto IEEE toes with Micro BFD? Ouch!

Nay, we would never do that.

IETF shall never — ever, step onto another standardization bodies toes, feet or anything that looks even remotely inviting. Drawing inspiration from Quentin Tarantino, where the story line keeps moving back and forth, and an enormous amount of cerebral energy is spent in keeping track of whats currently happening, i will boldly take you where no man — or person if you will, has ever gone before (now imagine the Starship Enterprise whooshing past).

Service providers were complaining about how the dead link detection for LAGs was waaaaaaaaaay to slow (more here). It would take them 3 seconds at the least, to detect that a link has gone down, before they would stop sending traffic down that offending link.  3 secs of outage is HUGE and clearly unacceptable – especially with all routers, or at least the ones that claim to forward traffic, supporting 50ms of FRR times.

An easy and a viable option before folks was to request IEEE to rework their LACP standard to bring down the protocol timers from 1 second to a few milli seconds. However, what made this nonviable was the urgency with which the service operators needed the extension – a way to detect dead links in a LAG in the O(ms). I am no IEEE expert, and its my humble understanding that things move slower — way too slower, in IEEE than IETF. The work thus naturally gravitated towards IETF.

Within IETF — and you’ll have to squint your eyes really hard — we can use BFD for this purpose. Its after all a very light weight protocol that just does liveliness detection. BFD keeps exchanging packets back and forth over an IP interface and if it doesnt hear from the other end, it brings down all protocols riding on top of that interface with the enthusiasm that most people often find quite alarming (if not disconcerting).

So, BFD did look like a best fit within IETF, to take on this challenge of fixing the issue of removing dead links for the LAGs asap.

Take a giant leap in the time fabric and you’ll find yourself staring at an ongoing effort in the BFD WG where we establish micro BFD (uBFD) sessions over each link (more here) in an attempt to fix the LAG issue.

The simplest and arguably the most elegant way to get uBFD to work on LAGs is by tweaking the LACP state machine — wherein a member link of a LAG is NOT brought into the Distributing State till the uBFD session on that link goes Up. While this looks way too cool and would have resulted in a single page RFC (which i confess has a certain charm to it), it poses certain challenges. The biggest being that IETF has no control over LACP and any proposal that tweaks LACP’s state machine would have brought out daggers, sabers and other metallic implements that can hurt — really, really hurt.

BFD WG is filled with people who love peace, harmony and detest anything that engenders ill feeling between two biggest Internet related standardization bodies – IEEE and IETF, and it was thus decided that anything that meant stepping onto IEEE’s toes was not cool enough to be considered.

We went back to the drawing boards and really looked at how we could fix the problem within the realms of IETF.

The basic problem that needs to be fixed is as follows:

You have multiple links in a LAG. There is a link that goes dead. Detect that dead link and remove it from the LAG asap so that no traffic is sent over that link.

We decided to run BFD on each link – called it a micro BFD session. Once the uBFD session is established we keep using that link till the session is alive. As soon as the uBFD session goes down, we need to do something so that that link is NOT used for forwarding. Since we cant remove the link from the LAG (as that functionality is beyond our scope) we decided to do the next best thing. We decided that if the uBFD session goes down, we will remove that link from the “load balancer” thats responsible for distributing traffic over the LAG.

A “load balancer” above is a hypothetical construct that controls the member links over which the traffic needs to be load balanced in case of LAGs. Assume there is a LAG comprising of 3 member links. Traffic egressing out of that LAG must hit some HW entity that splits the traffic across these 3 member links. If a member link is removed from that entity the traffic will only be hashed across the remaining two member ports. Manipulating stuff on this HW entity doesnt violate our pious agreement with IEEE as this HW entity is not owned by IEEE.

We can now be more specific. We can say that uBFD should only affect HW entity that manages member links for IP traffic — since BFD is after all an IP protocol and is only verifying IP connectivity. If your HW is capable of separate HW entities for both IPv4 and IPv6, then you could theoretically run multiple uBFD sessions between the same set of end points.

If you have separate HW entities for load splitting for L2 and L3 traffic (nobody does that) then you could theoretically only remove/add member links based on uBFD sessions on the HW entity that manages traffic distribution for IP traffic!

So we now have a solution that fixes the problem cited by the service providers all within the realms of IETF! You run uBFD sessions over each member link. If the uBFD session goes down, you remove the link from the “load balancer” which would effectively mean that no traffic will now be sent over that offending link.

And btw, we arrived at the solution without stepping onto IEEE toes, coz doing that would have hurt. Ouch!


Why do we need specialized switches in Data Centers?

Whats the big deal about Data centers and why do they need special routers and switches anyway? Why cant they use the existing switches that folks use in their back offices or service providers in their networks. What’s so special, really, about a bunch of servers that need Internet connectivity, huh?

Working in the metro Ethernet space all my life I wasn’t sure if I really understood the hype and the reason why Data centers required specialized HW.

It’s only once I started reading about Data centers and how they work and what they’re supposed to do that I was able to appreciate their need for specialized HW – and why the existing products may not be cut for them.

In the world of Wall Street, milliseconds can mean billions of dollars. Really, am not kidding here. Packets carrying Wall Street transactions get delivered to the switch and are then forwarded to the server in the Data Center. There they ride up the protocol stack to the application that executes the trade. The commit message then has to go back down the stack and then be sent over the wire to the switch. The switch does a lookup in its forwarding tables and sends it out on an egress destination port.

One of the things that would differentiate one Data Center switch from the other would be the time it takes for the switch to process the incoming packet, the amount and the nature of queuing that happens (which directly affects the latency), the serialization delays at the ingress and egress, and other factors that can contribute to adding a few microseconds to each packet processing (or transaction in Wall Street speak).

So is adding a few microseconds really a big deal?

Oh Yes, you bet it is – especially in the big bad world of Wall Street.

You only need to google for “high-frequency trading” and you would understand why it’s the suddenly become one of the most talked about thing in the Wall Street.

Lets see how shaving off a few microseconds can help?

A Mutual Fund house places an order to purchase 100000 shares of a company ABC that’s currently going at $10. NASDAQ (or some other exchange) could offer a few selected high frequency traders a peek into the incoming orders for 30 milliseconds or so (this is illegal, but there are loopholes in the system). The high frequency trader, knows that a purchase order for 100000 shares of ABC is coming and immediately picks up all the available shares at $10. After a few seconds, the Mutual Fund house order hits the market place and the high frequency trader sells their shares at $10.50, pocketing $50000 from a single transaction. Now, multiply this by the average number of transactions that typically take place in an Exchange and you would you arrive at the staggering amount of $$$ that’s at stake here.

Its thus imperative that the high speed computers that are doing all the number crunching have supporting network infrastructure that can help them in making the kill. Lower network latency and increased throughput means faster and better profits for the trading companies.

Firms using high frequency trading earned over $21 billion in profits last year. The TABB Group estimates that a 5 millisecond delay in transmitting an automatic trade can cost a broker 1% of its flow; which could be worth $4 million in revenues per millisecond. According to Reuters, trading a stock is now far faster than a blink of an eye or the speed of a lightning strike.

In fact several high frequency traders house their systems as close as possible to the exchanges to minimize the latency in executing their orders.

There are also other environments where low latency is desirable. Environments such as computer animation studios that may spend 80 to 90 hours rendering a single frame for a 3D movie, or scientific compute server farms (for Computational Fluid Dynamics) that might involve tens of thousands of compute cores. If the network is the bottleneck within those massive computer arrays, the overall performance is affected.

The Data centers thus patently need switches that have extremely extremely low latency in forwarding packets.

So what are the other things that the Data center switches must support – and which may not be available in ordinary switches?

Micro-bursting often happens in Data Centers, wherein the buffers overrun and the switches drop packets. The problem is that these micro-bursts happen often at microsecond intervals, so the switches may not report them. A good Data Center switch will absorb the micro-burst and forward the packets without dropping ‘em.

Data centers as we just saw are designed for critical systems that require high availability. This means redundant power, efficient cooling, secure access, ideally no down time, and a whole lot of other things, but most of all, it means no single points of failure.

Every device in a data center should have dual power supplies, and each one of those power supplies should be fed from independent power feeds. The power supplies are sized such that the device operates with only power path. All devices in a data center should have front-to-back airflow, or ideally, airflow that can be configured front to back or back to front. Thermal guidelines for Data Centers is a science by itself and there is more than petabytes of information available on the net on how this needs to be effectively done.

All devices in a data center should support the means to upgrade, replace, or shut down any single chassis at any time without interruption to meet the hard Service Level Agreements (SLAs). In-Service Software Upgrades (ISSU) should ideally be available, but this can be circumvented by properly distributing load to allow meeting the prior requirement. Data center devices should offer robust hardware, even NEBS compliance where required, and robust software to match.

This isnt the most exhaustive list of the things required out of switches deployed in Data Centers, and only serves to give a hint of whats needed there.

Oh and btw, I must finish this post and rush to place an order on NASDAQ before some high frequency trader preempts me and books all the profits!


OpenFlow, Controllers – Whats missing in Routing Protocols today?

There is a lot of hype around OpenFlow as a technology and as a protocol these days. Few envision this to be the most exciting innovation in the networking industry after the vaccum tubes, diodes and transistors were miniaturized to form integrated circuits.  This is obviously an exaggeration, but you get the drift, right?

The idea in itself is quite radical. It changes the classical IP forwarding model from one where all decisions are distributed to one where there is a centralized beast – the controller – that takes the forwarding decisions and pushes that state to all the devices (could be routers, switches, WiFi access points, remote access devices such as CPEs) in the network.

Before we get into the details, let’s look at the main components – the Management, Control and the Forwarding (Data) plane – of a networking device. The Management plane is used to manage (CLI, loading firmware, etc) and monitor the device through its connection to the network and also coordinates functions between the Control and the Forwarding plane. Examples of protocols processed in the management plane are SNMP, Telnet, HTTP, Secure HTTP (HTTPS), and SSH.

The Forwarding plane is responsible for forwarding frames – it receives frames from an ingress port, processes them, and sends those out on an egress port based on what’s programmed in the forwarding tables. The Control plane gathers and maintains network topology information, and passes it to the forwarding plane so that it knows where to forward the received frames. It’s in here that we run OSPF, LDP, BGP, STP, TRILL, etc – basically, whatever it takes us to program the forwarding tables.

Routing Protocols gather information about all the devices and the routes in the network and populate the Routing Information Base (RIB) with that information. The RIB then selects the best route from all the routing protocols and populates the forwarding tables – and Routing thus becomes Forwarding.

So far, so good.

The question that keeps coming up is whether our routing protocols are good enough? Are ISIS, OSPF, BGP, STP, etc the only protocols that we can use today to map the paths in the network? Are there other, better options – Can we do better than what we have today?

Note that these protocols were designed more than 20 years ago (STP was invented in 1985 and the first version of OSPF in 1989) with the mathematics that goes in behind these protocols even further. The code that we have running in our networks is highly reliable, practical, proven to be scalable – and it works. So, the question before us is – Are there other, alternate, efficient ways to program the network?

Lets start with what’s good in the Routing Protocols today.

They are reliable – We’ve had them since last 20+ years. They have proven themselves to be workable. The code that we use to run them has proven itself to be reliable. There wouldn’t be an Internet if these protocols weren’t working.

They are deterministic in that we know and understand them and are highly predictable – we have experience with them. So we know that when we configure OSPF, what exactly will it end up doing and how exactly will it work – there are no surprises.

Also what’s important about today’s protocols are that they are self healing. In a network where there are multiple paths between the source and the destination, a loss of an interface or a device causes the network to self heal. It will autonomously discover alternate paths and will begin to forward frames along the secondary path. While this may not necessarily be the best path, the frames will get delivered.

We can also say that today’s protocols are scalable.  BGP certainly has proven itself to run at the Internet’s scale with extraordinarily large number of routes. ISIS has as per the local folklore proven to be more scalable than OSPF. Trust me when i say that the scalability aspect is not the limitation of the protocol, but is rather the limitation of perhaps the implementation. More on this here.

And like everything else in the world, there are certain things that are not so good.

Routing Protocols work under the idea that if you have a room full of people and you want them to agree on something then they must speak the same language. This means that if we’re running OSPFv3, then all the devices in the network must run the exact same version of OSPFv3 and must understand the same thing. This means that if you throw in a lot of different devices with varying capabilities in the network then they must all support OSPFv3 if they want to be heard.

Most of the protocols are change resistant, i.e., we find it very difficult to extend OSPFv2 to say introduce newer types of LSAs. We find it difficult to make enhancements to STP to make it better, faster – more scalable, to add more features. Nobody wants to radically change the design of these protocols.

Another argument that’s often discussed is that the metrics used by these protocols are really not good enough. BGP for example considers the entire AS as one hop. In OSPF and ISIS, the metrics are a function of the BW of the link. But is BW really the best way to calculate a metric of an interface to feed in to the computation to select the best path?

When OSPF and all the routing protocols that we use today were designed and built they were never designed to forward data packets while they were still re-converging. They were designed to drop data as that was the right thing to do at that time because the mathematical computation/algorithms took long enough and it was more important to avoid loops by dropping packets.  To cite an example, when OSPF comes up, it installs the routes only after it has exchanged the entire LSDB with its neighbors and has reached a FULL state. Given the volume of ancillary data that OSPF today exchanges via Opaque LSAs this design is an over-kill and folks at IETF are already working on addressing this.

We also have poor multipath ability with our current protocols today. We can load balance between multiple interfaces, but we have problems with the return path which does not necessarily come back the way you wanted. We work around that to some extent by network designs that adapt to that.

Current routing protocols forward data based on destination address only. We send traffic to 192.168.1.1 but we don’t care where it came from. In truth as networks get more complex and applications get more sophisticated, we need a way to route by source as well by destination. We need to be able to do more sophisticated forwarding. Is it just enough to send an envelope by writing somebody’s address on an envelope and putting it in a post box and letting it go in the hope that it gets there? Shouldn’t it say that Hey this message is from the electricity deptt. That can go at a lower priority than say a birthday card from grandma that goes at a higher priority. They all go to the same address but do we want to treat them with the same priority?

So the question is that are our current protocols good enough – The answer is of course Yes, but they do have some weaknesses and that’s the part which has been driving the next generation of networking and a part of which is where OpenFlow comes in ..

If we want to replace the Routing protocols (OSPF, STP, LDP, RSVP-TE, etc) then we need something to replace those with. We’ve seen that Routing protocols have only one purpose for their existence, and that’s to update the forwarding tables in the networking devices. The SW that runs the whole system today is reasonably complex, i.e., SW like OSPF, LDP, BGP, multicast is all sitting inside the SW in an attempt to load the data into the forwarding tables. So a reasonably complex layer of Control Plane is sitting inside each device in the network to load the correct data into the forwarding tables so that correct forwarding decisions are taken.

Now imagine for a moment that we can replace all this Control Plane with some central controller that can update the forwarding tables on all the devices in the network. This is essentially the OpenFlow idea, or the OpenFlow model.

In the OpenFlow model there is an OpenFlow controller that sends the Forwarding table data to the OpenFlow client in each device. The device firmware then loads that into the forwarding path. So now we’ve taken all that complexity around the Control Plane in the networking device and replaced it with a simple client that merely receives and processes data from the Controller. The OpenFlow controller loads data directly into the OpenFlow client which then loads it directly into the FIB. In this situation the only SW in the device is the chip firmware to load the data into the FIB or TCAM memories and to run the simple device management functions, the CLI, to run the flash and monitor the system environmentals. All the complexity around generating the forwarding table has been abstracted away into an external controller. Now its also possible that the device can still maintain the complex Control Plane and have OpenFlow support. OpenFlow in such cases would load data into the FIBs in addition to the RIB that’s maintained by the Control Plane.

The Networking OS would change a little to handle all device operations such as Boot, Flash, Memory Management, OpenFlow protocol handler, SNMP agent, etc. This device will have no OSPF, ISIS,RSVP or Multicast – none of the complex protocols running. Typically, routers spend close to 30+% of CPU cycles doing topology discovery. If this information is already available in some central server, then this frees up significant CPU cycles on all routers in the network. There will also be no code bloat – we will only keep what we need on the devices. Clearly, smaller the code running on the devices, lesser is the bugs, resources required to maintain it – all translating into lower cost.

If we have a controller that’s dumping data into the FIB of a network device then it’s a piece of SW – its an application. It’s a SW program that sits on a computer somewhere. It could be an appliance, a virtual machine (VM) or could reside somewhere on a router. The controller needs to have connectivity to all the networking devices so that it can write out, send the FIB updates to all devices. And it would need to receive data back from the devices. It is envisioned that the controller would build a topology of the network in memory and run some algorithm to decide how the forwarding tables should be programmed in each networking device. Once the algorithm has been executed across the network topology then it could dispatch topology updates to the forwarding tables using OpenFlow.

OpenFlow is an API and a protocol which decides how to map the FIB entries out of the controller and into the device. In this sense a controller is, if we look back at what we understand today, very similar to Stack Master in Cisco. So if one has 5 switches in a stack then one of them becomes the Stack Master. It takes all of the data about the forwarding table. It’s the one that runs the STP algo, decides what the FIB looks like and sends the FIB data on the stacking backplane to each of the devices so that each has a local FIB (that was decided by the Stack Master).

To better understand the Controllers we need to think of 5 elements as shown in the figure.

Controller

At the bottom we have the network with all the devices. The OpenFlow protocol communicates with these devices and the Controller. The Controller has its own model of the network (as shown on the right) and presents the User Interface out to the user so that the config data can come in. Via the User Interface the admin selects the rules, does some configuration, instructs on how it wants the network to look like. The Controller then looks at its model of the network that it has constructed by gathering information from the network and then proceeds with programming the forwarding tables in all the network devices to be able to achieve that successful outcome. OpenFlow is a protocol – its not a SW or a platform – it’s a defined information style that allows for dynamic configuration of the networking devices.

A controller could build a model of the network and have a database and then run SPF, RSVP-TE, etc algorithms across the network to produce the same results as OSPF, RSVP-TE running on live devices. We could build an SPF model inside the controller and run SPF over that model and load the forwarding tables in all devices in the network. This would free up each device in the network from running OSPF, etc.

The controller has real time visibility of the network in terms of the topology, preferences, faults, performance, capacity, etc. This data can be aggregated by the controller and made available to the network applications.  The modern network applications can be made adaptive, with the potential to become more network-efficient and achieve better application performance (e.g., accelerated download rates, higher resolution videos), by leveraging better network provided information.

Theoretically these concepts can be used for saving energy by identifying underused devices and shutting them down when they are not needed.

So for one last time, lets see what OpenFlow is.

OpenFlow is a protocol between networking devices and an external controller, or in other words a standard method to interface between the control and data planes. In today’s network switches, the data forwarding path and the control path execute in the same device. The OpenFlow specification defines a new operational model for these devices that separates these two functions with the packet processing path on the switch but with the control functions such as routing protocols, ACL definition moved from the switch to a separate controller. The OpenFlow specification defines the protocol and messages that are communicated between the controller and network elements to manage their forwarding operation.


Its time we retire Authentication Header (AH) from the IPsec Suite!

Folks who think Authentication Header (AH) is a manna from heavens need to read the Bible again. Thankfully you dont find too many such folks these days. But there are still some who thank Him everyday for blessing their lives with AH. I dread getting stuck with such people in the elevators — actually, i dont think i would like getting stuck with anybody in an elevator, but these are definitely the worst kind to get stuck with.

So lets start from the beginning.

IPsec, for reasons that nobody cares to remember now, decided to come out with two protocols – Encapsulating Security Payload (ESP) and AH, as part of the core architecture. ESP did pretty much what AH did, with the addition of providing encryption services. While both provided data integrity protection, AH went a step further and also secured a few fields from the IP header for you.

There are bigots, and i unfortunately met one a few days ago, who like to argue that AH provides greater security than ESP since AH covers the IP header as well. They parrot this since that’s what most textbooks and wannabe CCIE blogs and websites say. Lets see if securing the IP header really helps us.

When IPsec successfully authenticates the payload, we know that the packet came from someone who knew the authentication key. I would wager that that should be enough to accept the packet. The IP header is just required to route the packet to reach the recipient – its not meant to do anything else. Thats networking 101 really.

IPsec Security Associations are established based on the source and destination addresses and some L4 port information. The receiver matches the incoming packet’s against SPI and inbound selectors associated with the SA. Packet is only accepted if it came from the correct source and destination IP address. If an attacker somehow manages to change the IP header then there are high chances that it will get rejected by IPsec since it will fail the Security Policy Database (SPD) check.  So, what is protecting the header really giving us?

BTW ESP can also protect the IP header if its used in the tunnel mode. So, if someone is really keen on protecting the IP header then ESP in the tunnel mode can also be used. It should however be noted that ESP tunnel mode SA applied to an, say IPv6 flow,  results in at least 50 bytes of additional overhead per packet. This additional overhead may be undesirable for many bandwidth-constrained wireless and/or satellite communications networks, as these types of infrastructure are not over provisioned.

Packet overhead is particularly significant for traffic profiles characterized by small packet payloads (e.g., various voice codecs). If these small packets are afforded the security services of an IPsec tunnel mode SA, the amount of per-packet overhead is increased.

This issue will be alleviated by header compression schemes defined in the IETF.

I have recently published an IETF draft where i explicitly ask for AH to be retired since there is nothing useful that it does that cant be achieved with ESP with NULL encryption algorithm.

Please note that i have absolutely no complaints with AH and the claims that it makes. It does its job really well. Its just that its completely redundant and the world can certainly do with one less protocol to manage.

Retiring AH doesn’t mean that people have to stop using AH right now. It only means that in the opinion of the community there are now better alternatives. This will discourage new applications and protocols to mandate the use of AH. It however, does not preclude the possibility of new work to IETF that will require or enhance AH. It just means that the authors will have to do a real good job of convincing the community on why that solution is really needed and the reason why ESP with NULL encryption algorithm cannot be used instead.

The IETF draft that i have written aims to dispel several myths  surrounding AH and i show that in each case ESP with NULL encryption algorithm can be used instead, often with better results.


Life of Crypto Keys employed in Routing Protocols

Everyone knows that the cryptographic key used for securing your favorite protocol (OSPF, IS-IS, BGP TCP-AO, PIM-SM, BFD, etc)  must have a limited life time and the keys must be changed frequently. However, most people don’t understand the real reason for doing so. They argue that keys must be regularly changed since they are vulnerable to cryptanalysis attacks. Each time a crypto key is employed it generates a cipher text. In case of routing protocols the cipher text is the authentication data that is carried by the protocol packets. Its alleged that using the same key repetitively allows an attacker to build up a store of cipher texts which can prove sufficient for a successful cryptanalysis of the key value. It is also believed that if a routing protocol is transmitting packets at a high rate then the ”long life” may be in order of a few hours. Thus it’s the amount of traffic that has been put on the wire using a specific key for authentication and not necessarily the duration for which the key has been in use that determines how long the key should be employed.

This was true in the Jurassic ages but not any more. The number of times a key can be used is  dependent upon the properties of the cryptographic mode than the algorithms themselves. In a cipher block chaining mode, with a b-bit block, one can safely encrypt to around 2^(b/2) blocks. AES (Advanced Encryption Standard)  used worldwide has a fixed block size of 128, which means that it can be safely used for 2^(64+4) bytes of routing data. If we assume a protocol that sends 1 Gig (!!) worth of control traffic *every* second, even then it is safe enough to be used for around 8700 *years* without changing the key! Hopefully, the system admin will remember to change the crypto key after 8700 years! ;-)

So, if the data is secure then why do we really need to change the crypto keys ever?

As a general rule, where strong cryptography is employed, physical, procedural, and logical access protection considerations often have more impact on the key life than do algorithm and key size factors. People need to change the keys when an operator who had access to the keys leaves the company. Using a key chain, a set of keys derived from the same keying material and used one after the other, also does not help as one still has to change all the keys in the key chain when an operator having access to all those keys leaves the company. Additionally, key chains will not help if the routing transport subsystem does not support rolling over to the new keys without bouncing the routing sessions and adjacencies.

Another threat against a long-lived key is that one of the systems storing the key, or one of the users entrusted with the key, could be subverted. So, while there may not be cryptographic motivations of changing the keys, there could be system security motivations for rolling or changing the key.

What complicates this further is that more frequent manual key changes might actually increase the risk of exposure as it is during the time that the keys are being changed that they are likely to be disclosed! In these cases, especially when very strong cryptography is employed, it may be more prudent to have fewer, well controlled manual key distributions rather than more frequent, poorly controlled manual key distributions.

To summarize, operators need to change their crypto keys because of social and political, rather than scientific or engineering driven reasons.

You can read more about this in the IETF draft that i have co-authored here.


Issues with how BFD is currently implemented over LAGs

The BFD standards dont explicitly talk about how BFD should be implemented on Link Aggregation Groups (LAGs). This leaves a lot of room for imagination and vendors have implemented their own proprietary mechanisms to make BFD work on LAGs. Now, there is only this much room for innovation and most vendors have naturally arrived at similar techniques to implement interoperable BFD over LAGs. So, what makes BFD so sticky to implement on LAGs?

BFD being an L3 protocol, is oblivious to the physical link that the BFD packets go out on. Usually, there is only one link associated with an L3 interface, and there is thus no ambiguity on the link that packet needs to go out on. However, when an IP interface is configured over a LAG, there are multiple constituent links that the packet can go out on, and BFD has to decide the link it wants to use for sending the packets out.

A LAG binds together several physical ports between two adjacent nodes so they appear to higher-layer protocols and applications as a single, higher bandwidth “virtual” pipe.

The problem with running BFD over a LAG is that without internal knowledge of the LAG structure it is impossible for BFD to guarantee a detection of anything but a full LAG shutdown within the BFD timeout period. The LAG shutdown is typically initiated by some LAG module. LAG timers are typically multiple times slower than the BFD detection timers (multiple 100ms vs. multiple 10ms of BFD). There is thus a need to bring some sort of determinism in how BFD runs over a LAG. There is also a need to detect member link failures much faster than what Link Aggregation Control Protocol (LACP) allows.

Lets look at what implementations currently do to implement BFD on LAGs.

The simplest approach to run BFD on a LAG interface is to ignore the internal structure and treat the LAG as one “big, virtual pipe”.

Because there is no standard, vendors have implemented their own proprietary mechanisms to run BFD over LAG interfaces. Two examples are shown here.

Some implementations send BFD packets only over the “primary” member link of the LAG. Others spray BFD packets over all member links of the LAG. There are issues with both these designs.

In the first design, BFD will remain Up as long as the primary link is alive. If the primary link goes down, and another link is not selected as the primary, before BFD times out (around 30-50ms), then the BFD session on the LAG comes crashing down. Problems arise as BFD, in this design, is oblivious to the presence of other member links in the LAG. If a non-primary link goes down, the BFD session remains unaffected as it can still send and receive BFD packets over the primary link. Since the BFD session is Up, other routers in the network continue sending traffic meant to egress out of this interface. As expected from the LAG, all traffic egressing out of this interface gets load distributed on all LAG member links. Now, there is one link thats down. All traffic sent over that failed link gets dropped, till the LAG manager detects this and removes it from the LAG.

In the second design, BFD packets are sprayed over all the member links of a LAG. This is done naively via round-robin, where each BFD packet is sent using the subsequent member link, in a round-robin fashion. It solves the problem of BFD going down because of the primary link going down, but it still does not solve the problem of traffic getting lost when one of the member link goes down. This is because, when a member link goes down, BFD remains up and traffic continues to go over the link that has failed till a higher layer protocol (usually LACP or the LAG Manager) detects this and removes the offending link from the LAG.

The above two designs defeat the core purpose of a BFD, which is to detect faults between the two forwarding engines. In each design traffic gets lost on a failed link till some protocol other than BFD detects this and removes that link from the LAG. The timers associated with the other protocol are an order of magnitude higher than BFD.

Operators have since long expressed a need to be able to detect the failed links fast so that their traffic doesnt get lost. The idea is to get BFD to take charge of the LAG and make it responsible for maintaining the list of active links in a LAG. This way we can use the BFD fast timers to quickly detect link failures.

One could argue that there are native Ethernet OAM mechanisms (.1ag, .3ah) that can be used to detect link failures in a LAG, and one need not rely on slow protocols like LACP or the LAG manager. The reality is that operators who have deployed BFD in their IP/MPLS networks want a common failure detection mechanism and dont want a mix of different technologies.

To solve the above mentioned issues I have co-authored an IETF document that proposes running BFD on each constituent link of the LAG. We call the BFD sessions running on each link a “micro BFD session”. We call this mode of BFD on LAGs as BLM – BFD on Lag Members.

BLM is an umbrella BFD session that contains information about the LAG (or the aggregated interface) that its running on. It consists of a set of micro BFD sessions that are running on each constituent link of the LAG. And it contains a state that we call the “Concluded State”, which describes the overall state of the LAG (Up, Down, AdminDown).

Each micro BFD session is a regular RFC 5880 and RFC 5881 compliant BFD session. Only Asynchronous mode is supported for the micro BFD sessions as the sole reason for running BFD on each member link is to verify the link connectivity. The Echo function for the micro BFD sessions is not recommended as it requires twice as many packets to achieve a particular Detection time as does the pure Asynchronous mode.

At least one system MUST take the Active role (possibly both). The micro BFD sessions on the member links are independent BFD sessions. They use their own unique, local discriminator values, maintain their own set of state variables and have their own independent state machine. Typically each micro BFD session will have the same timer values, however, nothing precludes the possibility of having different timer values among the different micro BFD sessions belonging to the same LAG.

A session begins with the periodic, slow transmission of BFD Control packets. When bidirectional communication is achieved, the BFD session becomes Up. The LAG manager is informed at this point, and the member link becomes an active link of the LAG.

If the micro session goes Down, the transmission of Control packets goes back to the slow rate. The LAG Manager is informed which removes the member link from the LAG.

Once a session has been declared Down, it cannot come back up until the remote end first signals that it is down (by leaving the Upstate), thus implementing a three-way handshake.

A session MAY be kept administratively down by entering the AdminDown state and sending an explanatory diagnostic code in the Diagnostic field.

In short, its pretty much the same as a standard BFD session.

This solves the issues that i had described in the earlier two designs. The micro BFD sessions will quickly detect a failed link, and will instantly remove it from the LAG. Traffic that was earlier egressing out over the failed link, will now get hashed to a different link in the LAG. This results in zero traffic loss on the LAG.

You can read more about our proposal here (more on how it evolved within IETF here).

Recognizing the need for running BFD on all member links, various vendors support their own proprietary, un-interoperable implementation of BFD over LAGs. We’re hoping that our IETF proposal to standardize this behavior will bring some order to the chaos thats out there and a relief to the providers who are stuck with proprietary solutions.


So what are inter-session Replay attacks?

Inter-session replay attacks are extremely hard to fix and most IETF routing and signaling protocols are vulnerable to them. Lets first understand what an inter-session replay attack is before we delve deeper into how we can fix them.

A reply attack is a type of an attack where the attacker captures the packets exchanged between two routers and later retransmits, or “replays”, this same packet back to the routers and thereby deceiving them into believing that this is a legitimate packet sent by their remote neighbor. Lets see how this will work:

Assume router A is sending an integrity protected (via some authentication mechanism) protocol packet to router B. The attacker can record the packet that A is sending. The attacker now waits for some time and retransmits this packet without any modification, back to B. B upon receiving this packet will as usual first try to verify the contents for any tampering. It will do this by verifying the authentication digest (usually Keyed-MD5 or HMAC-SHA) that the packet carries. Since the attacker has not modified the packet it will pass the integrity check as long as the key exchanged between the two routers remains unchanged. The integrity checks will pass on Router B and it will accept this packet as a legitimate packet sent by A.

This is a replay attack – So, how can it harm you?

Assume A was not advertising any route, or any neighbor reachability when the attacker had recorded this control packet.  In OSPF parlance, this could be a Hello without any neighbors or a RIPv2 packet without any routing information. Later when A learns some routes or neighbors it sends an updated protocol packet listing this information. B receives this packet and updates its protocol state and routing tables based on the information that A provides. Now the attacker replays the earlier recorded packet. B, upon receiving this “new” packet believes this to have come from A and updates its routing tables accordingly. This is incorrect as B will now update its forwarding tables based on stale information. If the replayed packet is an old OSPF Hello when A did not have any neighbors, B will, upon receiving this packet assume that A has now lost all its neighbors and will delete all routes via A. I had co-authored RFC 6039 some time back which describes many such replay attacks in great detail.

So, how do IETF protocols protect themselves from such attacks?

Most protocols packets carry a Cryptographic Sequence Number that increases as each packet is sent. The receivers only accept a packet if it carries a sequence number that is higher than what it had received earlier from the same neighbor.

This fixes the problem that i had described earlier as the replayed packet will carry a sequence number that will be lower than what B would have last heard from A. B, upon receiving this replayed packet will not accept it and would thus prevent itself from such replay attacks.  Its appears that we have a solution against all replay attacks – do we?

Well it turns out that the answer to this question is a big NO!

The cryptographic sequence number can protect us from what i call the intra-session replay attacks. However, it cannot protect us against inter-session replay attacks. Let see why?

Assume that the cryptographic sequence number currently being used by router A for some specific routing protocol is 1000. This means that B will not accept any protocol packet if it comes with a sequence number less than 1000. This is fine, and this will protect us against some attacks. Now assume that the attacker captures and records this packet with sequence number 1000. No one will know about this as the attacker has silently recorded this packet.

Now the attacker has to wait patiently till the current session between the Router A and Router B goes down and a new one is established. This can happen if one of the routers reboots (could be planned or unplanned). When this happens the routers reset their cryptographic sequence number to 0 and start all over again. If the password key  between the two routers has not changed, and it usually doesn’t, then the packet that the attacker has captured is carrying a valid cryptographic digest. The attacker can replay this packet any time and this will get accepted by B if the current sequence number that its seeing in the new session from A is less than 1000. This is an inter-session replay attack and is extremely difficult to fix with the current IETF security and authentication mechanisms. Note that a trivial way to protect against inter-session replay attacks is by changing the key each time a new session is established. However changing the key requires manual intervention and thus cannot be easily done all the time.

So, how do you fix this issue?

Sam Hartman (Huawei), Dacheng (Huawei) and I have submitted two proposals in the IETF to fix this inter-session replay attacks that i have described above.

The first is extremely simple.

We propose to change the current cryptographic sequence number space from 32 bits to 64.  The least significant 32 bits would be the usual cryptographic sequence number that will monotonically increase with each fresh packet transmitted. The most significant 32 bits would indicate the number of times this router has cold booted. Thus when the router initially comes up for the first time its value would be 0. Next time when it reboots and comes up, its value would be 1.

Consider a state when the router has cold booted “n” times and its current cryptographic sequence number is “m”. The aggregated cryptographic sequence number that will be used by the routing protocols would be:

m << 32 || n, where << is the left shift operator and || is the bit-wise OR operation.

Now this router reboots (again planned or unplanned).

Now its cryptographic sequence space starts from:

(m+1) << 32

Its trivial to see that the ((m+1) << 32) > ((m << 32) || n) for all values of m and n where each m and n > 0.

This mechanism will solve the inter-session replay attacks that have been described above. I will describe the second method in some other post. We have defined a generic mechanism that all protocols can use here in this KARP draft.


Why providers still prefer IS-IS over OSPF when designing large flat topologies!

I was recently interacting with our pre-sales team for a large MPLS deployment and was reading the network design that was proposed. I saw that they had suggested IS-IS over OSPF as the IGP to use at the core. One of the reasons cited was the inherent security that IS-IS provides by running natively over the Layer 2. Another was that IS-IS is more modular and thus easier to extend as compared to OSPF. OSPF, its alleged is very rigid and required a complete protocol rewrite to support something as basic as IPv6! :-) Then there was this overload feature that IS-IS provides which can signal memory overload that does not exist in OSPF and finally a point about IS-IS showing superior scalability (faster convergence). In case you’re intrigued about the last point, as i clearly was, then it was explained that IS-IS uses just one Link State Packet (LSP) per level for exchanging the routing information. This LSP contains many TLVs, each of which represents a piece of routing information. OSPF on the other hand, needs to originate multiple LSAs, one for each type and thus is a lot more chatty and thus not suitable for large flat networks.

I personally dont agree to any one of the reasons listed above and anyone who favors IS-IS over OSPF for the above reasons is patently mistaken. These are all extremely weak arguments and have mostly been overtaken by reality. Lets look at each one by one.

Security – While its true that one cant lob an IS-IS packet from a distance without a tunnel it was never really a compelling reason for some one to pick up IS-IS over OSPF. The same holds good for OSPF multicast packets as well which cannot be launched by some script kidde sitting miles away from his personal laptop. Both the protocols have been extended to support stronger algorithms  (RFC 5310 for IS-IS and RFC5709 for OSPF) and have similar authentication mechanisms. I can say this with some degree of confidence as i have co-authored both these standards.

Modularity – While its somewhat easier to extend IS-IS in a backward compatible way this sort of thing doesnt happen much any more. Both protocols have been extended to support multiple instances, traffic engineering, multi-topology, graceful restart, etc. This isnt imo a showstopper for someone picking up OSPF as the IGP to use.

Overload Mechanism – IS-IS has the ability to set the Overload (OL) bit in its LSAs. This results in other routers in that area treating this router as a leaf router in their shortest path trees, which means that its only used for reaching the directly connected interfaces and is never placed on the transit path to reach other routers. So does this happen any more? No, it doesnt. This feature was required in the jurassic age when routers came with severely constrained memory, CPU power and the original intention of the OL mechanism is now mostly irrelevant. Most core routers today have enough memory and CPU that they will not get inundated by the IS-IS routes in any sane network design.

These days OL bit is used to prevent unintentional blackholing of packets in BGP transit networks. Due to the nature of these protocols, IS-IS and OSPF converge must faster than BGP. Thus there is a possibility that while the IGP has converged, IBGP is still learning the routes. In that case if other IBGP routers start sending traffic towards this IBGP router that has not yet completely converged it will start dropping traffic. This is because it isnt yet aware of the complete BGP routes. OL bit comes handy in such situations. When a new IBGP neighbor is added or a router restarts, the IS-IS OL bit is set. Since directly connected (including loopbacks) addresses on an “overloaded” router are considered by other routers, IBGP can be bought up and can begin exchanging routes. Other routers will not use this router for transit traffic and will route the packets out through an alternate path. Once BGP has converged, the OL bit is cleared and this router can begin forwarding transit traffic.

So how can we do this in OSPF since there is no OL bit in its LSAs?

Simple. We can set the metric of all transit links on an “overloaded” router to 0xffff in its Router LSAs. This will result in the router not being included as a transit node in the SPF tree.  Stub links can still be advertised with their normal metrics so that they are reachable even when the router is “overloaded”.  Thus this point against OSPF is also not valid.

Finally we come to the scalability and the convergence part. This one is slightly tricky and is not so easy. I wrote a few posts around 4 years back discussing this here and here. You might want to read these.

IMO one of the big reasons why most big providers use (or have used) IS-IS is because way back in 90s Cisco OSPF implementation was a disaster. The first big ISPs (UUnet, MCI) came to them and said “we want to build big infrastructures, should we use OSPF?” and Cisco basically said “No, thats not a good idea, use IS-IS instead”. Dave Katz in Cisco had recently rewritten Cisco’s IS-IS implementation as a side effect of implementing NetWare Link Services Protocol – NLSP (basically IS-IS for Novel IPX) so Cisco was quite confident of its IS-IS implementation. The operators thus picked up IS-IS and continue using it even today as there is really no real difference between IS-IS and OSPF, so no motivation to move from one to the other.

IS-IS was also an advantage in the early days as a router vendor because it was an “open proprietary spec”.  It was out there, and published, but unless you had some background in OSI you didn’t know much about it and the spec was scary and weird.  This wasn’t on purpose, but it was handy.

It was also nice in the IETF because IS-IS was viewed, at least at the time, as the poor cousin of OSPF and so nobody really cared that much other than the handful of folks that were doing the work.  This made the extension of IS-IS a lot easier and a lot less political than OSPF.  In fact i have heard about a t-shirt which said “IS – IS = 0″ that was distributed in one of the IETF meetings long time ago! Things however have changed and IS-IS is considered at par with OSPF today and both the working groups are quite active in the IETF.

There was one real technical advantage to IS-IS in common deployment scenarios of that day as well.  Back then, it was popular to build full meshes of ATM or Frame Relay as the Layer 2 topology for large backbones, because of the perception that healing faults at L2 would happen faster and cleaner than letting the IP routing protocols take care of it (arguably true at the time).  Full mesh topologies are the worst possible topologies for standard flooding protocols (IS-IS and OSPF both) and the cost of topology changes was huge.  However, IS-IS lent itself to the “mesh group” hack by which you could manually prune the flooding topology to be a subset of the links.  OSPF doesn’t easily allow this because of details about the flooding model it uses.  Cisco apparently did implement a hack to get around this problem, but its probably more gross than the IS-IS “mesh groups” hack!

Another reason i believe people prefer IS-IS over OSPF is the belief that you can design large networks by building a single large Level 1 (L1) area without any hierarchies in IS-IS and still be able to manage – something that would be difficult with OSPF. There are issues with inter-area traffic engineering and such and most people would like to keep their network as a single area if the routing protocol can manage it.

I used to believe that operators can design big networks without hierarchies in IS-IS since all IP prefixes (i.e. network interfaces, routes aka reachabilities in ISO-speak) are considered as leaf nodes in the SPF for IS-IS. Thus a full SPF will not be triggered for an interface or a route flap in case of IS-IS. OSPF otoh, would go ballistic running SPF each time any IP information changes. The only time we dont run a full SPF in when a Type 5 LSA information changes, but thats hardly an optimization. Compared to this, the only time we run a full SPF in IS-IS is when an actual node goes down (which OSPF would also anyways do).

I was recently having a discussion with Dave Katz from Juniper on this and i realized that this really is an implementation choice. “The graph theory”, he very aptly pointed out, “is the same in both cases!”.  The IS-IS spec makes it easier to put an IS-IS reachability as leaf nodes as all routers are identified by a different set of TLVs. This information while its available in OSPF is slightly tricky as the node information is mixed with the link information. Thus while even a naive IS-IS implementation may be able to optimize SPF, it would require a good understanding of the spec to get it right in OSPF.

You could get the exact same optimization in OSPF as IS-IS if you realize that OSPF calculates the routes to the *router IDs* and not the addresses. The distinction between nodes and destinations is syntactically (and semantically) quite clear in OSPF as well. The spec considers the Router IDs which i concede look like IP addresses, something that most people miss.

Actual addresses and prefixes are quite distinct, even in OSPF.  So as long as you can keep track of what’s an address and what’s an ID, it’s not that hard, for what it’s worth.  The bigger problem is that only a handful of people really understand *why* things in the OSPF spec are done the way they are, and there are less and less of those folks because hardly anybody *needs* to understand it.

But having said all that, the cost of an SPF is so small on the scale of things that it’s not really the issue (which is also why I am not a big fan of partial-SPF optimizations:  ”See how great it works when you have around O(50K) nodes and there is this one little node that goes down!” is sort of silly because lots of other things would break before a network ever got that big.)

Part of the SPF fear was I believe because Cisco’s original SPF implementation in OSPF was horribly inefficient (and everyone was using slow processors back then) and IOS was a non-preemptive, single threaded environment, and so an SPF (or any slow process) would block other things (like sending and receiving Hellos and other important bits) and would affect *everything*. I am btw sure that its changed now since i am aware of a couple of large Cisco deployments that are running OSPF in the core!  Overall system state management is a *much* bigger problem these days than the algorithmic efficiency of these protocols, particularly as we build larger and more distributed environments that require message passing internally.

Also what could have pushed providers back then to IS-IS was the deployment guidelines that Cisco used to publish (including the number of routes in an area) back then which were absurdly small. I am sure, its changed now.

There’s no technical reason why very large flat topologies can’t be supported by a good implementation of either protocol, but ISPs need to be conservative and suspicious of their vendors in order to survive.  ;-) I guess that nobody wants to be the first to deploy a large flat OSPF topology;  best practices tend to be sticky. However, there is no reason why you cant do it with OSPF today.

I suspect that, at this point, ISPs choose based on culture and familiarity and comfort rather than real technical differences. The perception still exists that while IS-IS can support large flat networks, OSPF cant. However, as i said its just a perception and is not really true any more.


Catching Corrupted OSPF Packets!

I was having a discussion with Paul Jakma (a friend, co-author in a few IETF drafts, a routing protocols expert, the guy behind Quagga, the list just goes on ..) some time back on a weird problem that he came across at a customer network where the OSPF packets were being corrupted in between being read off the wire and having CRC and IP checksum verified and being delivered to OSPF stack. While the problem was repeatable within 30 minutes on that particular network, he could never reproduce it on his VM network (and neither could the folks who reported this problem).

Eventually, for some inexplicable reason, he asked them to turn on MD5 authentication (with a tweak to drop duplicate sequence number packets – duplicate packets as the trigger of the problems being a theory). With this, their problems changed from “weird” to “adjacencies just start dropping, with lots of log messages about MD5 failures”!

So it appears that the customer had some kind of corruption bug in custom parts of their network stack, on input, such that OSPF gets handed a good long sequence of corrupt packets – all of which  (we dont know how many) seem to pass the internet checksum and then cause very odd problems for OSPF.

So, is this a realistic scenario and can this actually happen? While i have personally never experienced this, there are chances of this happening because of any of the following reasons:

o PCI transmission error (PCI parallel had parity checks, but not always enabled, PCI express has a 32bit CRC though)

o memory bus error (though, all routers and hosts should use ECC RAM)

o bad memory (same)

o bad cache (CPUs don’t always ECC their caches – Sun its seems was badly bitten by this; While the last few generations of Intel and AMD CPU do this, what about all those embedded CPUs that we use in the routers?)

o logic errors, particularly in network hardware drivers

o finally, CRCs and the internet checksum are not very good and its not impossible for wire-corrupted packets to get past link, IP AND OSPF CRC/checksums.

The internet checksum, which is used for the OSPF packet checksum, is incredibly weak. There are various papers out there, particularly ones by Partridge (who helped author the internet checksum RFC!) which cover this, basically it offers very little protection:

- it can’t detect re-ordering of 2-byte aligned words
- it can’t detect various bit flips that keep the 1s complement sum the same (e.g. 0×0000 to 0xffff and vice versa)

Even the link-layer CRC also is not perfect, and Partridge has co-authored papers detailling how corrupted packets can even get past both CRC and internet checksum.

So what choice do the operators have for catching corrupted packets in the SW?

Well, they could either use the incredibly poor internet checksum that exists today or they could turn on cryptographic authentication (keyed MD5 with RFC 2328 or different HMAC-SHA variants with RFC 5709) and catch all such failures. The former would not always work as there are errors that can creep in with these algorithms. The latter would work but there are certain disadvantages  is using cryptographic HMACs purely for integrity checking. The algorithms require more computation, which may be noticable on less powerful and/or energy-sensitive platforms. Additionally, the need to configure and synchronize the keying material is an additional administrative burden. I had posted a survey on Nanog some time back where i had asked the operators if they had ever turned on crypto protection to detect such failures and i received a couple of responses offline where they alluded to doing this to prevent checksum failures.

Paul and I wrote a short IETF draft some time back where we propose to change the checksum algorithm used for verifying OSPFv2 and OSPFv3 packets. We would only like to upgrade the very weak packet checksum with something thats more stronger without having to go with the full crypto hash protection way. You can find all the gory details here!


Can we solve the inter-session replay attacks?

Most routing (OSPF, BFD, RIP, OSPFv3-AT, etc) and signalling (LDP, RSVP, etc) protocols defined by IETF have a cryptographic sequence number within the authentication data that increases monotonically with each new packet that the router originates. This protects the protocol from replay attacks as the receivers now keep track of the sequence numbers and ignore all packets that arrive with a number thats lower than the currently active one.

At worst, the attacker can keep replaying the last packet that was originated since most protocols accept packets with sequence number greater than or equal to what they had last received. This in my opinion is a hole that can be trivially plugged by mandating that protocols must only accept protocol packets if they come with a sequence number thats greater than what they have received till now.

So does this solve all replay attacks problem?

No, not really.

Imagine an attacker who captures a protocol packet when the cryptographic sequence number is say, 1000.  Now the next time this router cold boots it will reinitialize its sequence space to 1 and start sending packets from this value. The attacker can now replay the earlier captured packet – the one with the sequence number 1000. The receivers will accept the replayed packet since it comes with a sequence number thats higher than what they were currently seeing from the router. This is a vulnerability that most IETF protocols are susceptible to. This is not an issue with protocols that use an automated key management protocol (like IKEv2) as all the security parameters are renegotiated when a session bounces. However, most routing and signalling protocols DONT use an automated key management protocol and are thus exposed to this risk.

I call this as an inter-session replay attack where packets from the previous/stale sessions can be replayed. So, do we have a solution to this problem?

Well, there are a couple of things that we could do here. The most obvious solution is to update the last cryptographic sequence number in the non volatile memory of the router. Thus we update the memory each time we increment the sequence number. This can be read when the router cold boots and it can start using sequence numbers from this value. The problem with this solution is that this will involve frequent writes to the non volatile memory on the routers which is not recommended because of the limited life of such media.

The other solution is to use the clock (number of seconds elapsed since midnight UTC  January 1 1970) as the sequence numbers. In theory this time will always advance and we will thus never have a router issuing sequence numbers that will ever go back. This would ideally also work when the router reboots as the time would only have advanced. The problem with this solution is that we end up relying on NTP or 1588 and an assumption that clocks on a router will NEVER go back. This is unrealistic and cannot be the basis of a security system defined for any protocol. Its fragile and can be broken.

So what are we left with?

Sam Hartman, Dacheng Zhang and I start looking at this problem for OSPF and have written an IETF draft that we think addresses this problem. It associates two scalars with a router – the Session ID and the Nonce, and uses these in combination with the cryptographic sequence numbers to protect OSPF routers against inter and intra replay attacks. The mechanism described in this draft can be easily generalized and extended to other routing and signalling protocols.

This is currently being discussed actively in the OSPF WG and the KARP WG mailing lists. I will in some other post explain how the concept of Nonce and Session ID helps in solving the inter-reply attacks which is the key problem that needs to be solved.


Fixing OSPFv3 Authentication and Security Mechanism!

Sounds like a presumptuous claim for a blog title, eh? Well, we’ll soon find out that it isnt! OSPFv3, unlike OSPFv2, IS-IS and RIP uses IPsec for authenticating its protocol packets. It relies on the IP Authentication Header (AH)  and the IP Encapsulating Security Payload (ESP) to cryptographically sign routing information passed between routers. When using ESP, the null encryption algorithm is used, so the data carried in the OSPFv3 packets is signed, but not encrypted.   This provides data origin authentication for adjacent routers, and data integrity (which gives the assurance that the data transmitted by a router has not changed in transit).  However, it does not provide confidentiality of the information transmitted; this is acceptable because the privacy of the information being carried in the routing protocols need not be kept secret.

Authentication/Confidentiality for OSPFv3” mandates the use of ESP with null encryption for authentication and also does encourage the use of confidentiality to protect the privacy of the routing information transmitted, using ESP encryption. It discusses, at length, the reasoning behind using   manually configured keys, rather than some automated key management   protocol such as IKEv2.  The primary problem is the lack of a suitable key management mechanism, as OSPF adjacencies are formed on a one-to-many basis and most key management mechanisms are designed for a one-to-one communication model. This forces the system administrator to use manually configured security associations  (SAs) and cryptographic keys to provide the authentication and, if desired, confidentiality services. Regarding replay protection, [RFC4552] states the following:

Since it is not possible using the current standards to provide complete replay protection while using manual keying, the proposed solution will not provide protection against replay attacks.

Since there is no replay protection provided there are a number of attacks that are possible for OSPFv3. Some of the them are described here.

Since there is no deterministic way to differentiate between encrypted and unencrypted ESP packets by simply examining the packet, it becomes tricky for some implementations to prioritize certain OSPFv3 packets (Hellos for example) over the others. This is big issue in most service grade routers working in scaled setups.

Then there are some environments, e.g., Mobile Ad-hoc Networks (MANETs), where IPsec is difficult to configure and maintain, and cannot be used for authenticating OSPFv3. There is also an issue with IPsec not being      available on some platforms or it requiring some additional license which may be expensive.

I posted a survey on Nanog asking operators if they were using authentication with OSPFv3.  I only received a few responses, as operators, for a good reason are quite paranoid about their security policies and would not reveal their policies to someone posting a survey on a public mailing list. The results of that survey were interesting; more than half of them cited the issues that i have described above as reasons for not turning on authentication with OSPFv3. They are running OSPFv3 in their v6 networks without any security! This you would note is very different from operators running OSPFv2. It is well known that a majority of the big providers use MD5 with OSPFv2 and their networks are secure. Most vendors have now started work on implementing HMAC-SHA support for OSPFv2 as described in RFC5709. This led me to believe that if we got rid of IPsec for OSPFv3 and somehow got it to use the same infrastructure as OSPFv2, then we would see more people using security with OSPFv3.

The other big issue with using IPsec for OSPFv3 is that it becomes very difficult to prioritize some OSPFv3 control packets over others. This is because deep inspecting ESP-NULL packets is difficult as its not easy to know if the packet is encrypted or not. One alternative is to mandate WESP instead of ESP-NULL. An easier alternative is to write a new mechanism to authenticate OSPFv3 packets. I have written an RFC 6506 – “Supporting Authentication Trailer for OSPFv3” that defines an alternative, non IPsec mechanism, for authenticating and securing OSPFv3 protocol packets.

We propose to append a special block, called the Authentication Trailer to the end of the OSPFv3 packets. This is similar to the Authentication data thats carried in OSPFv2, where the length of this trailer is NOT included into the length of the OSPFv3 packet, but is accounted for in the IPv6 payload length. Unlike OSPFv2, the AT will also include the contents of the LLS block in its digest which means that the LLS block doesnt need to have its own separate authentication mechanism.

We also describe a new AT (Authentication Trailer) bit in the OSPFv3 options field that the routers must set in the HELLOs and DDs to indicate that this router will include Authentication trailer in all subsequent OSPFv3 packets on the link. In other words, the Authentication trailer is only examined if the AT bit is set. You can read the RFC for more details.


BFD Generic Cryptographic Authentication

Bidirectional Forwarding Detection (BFD) specification includes five different types of authentication schemes: Simple Password, Keyed Message Digest 5 (MD5), Meticulous Keyed MD5, Keyed Secure Hash Algorithm (SHA-1) and Meticulous SHA-1. In the simple password scheme of authentication, the passwords are exchanged in the clear text on the network and anyone with physical access to the network can learn the password and compromise the security of the BFD domain.

It was discovered that collisions can be found in MD5 algorithm in less than 24 hours, making MD5 insecure. Further research has verified this result and shown other ways to find collisions in MD5 hashes.

It should however be noted that these attacks may not necessarily result in direct vulnerabilities in Keyed-MD5 as used in BFD authentication purposes, because the colliding message may not necessarily be a syntactically correct protocol packet. However, there is a need felt to move away from MD5 towards more complex and difficult to break hash algorithms.

In Keyed SHA-1 and Meticulous SHA-1, the BFD routers share a secret key which is used to generate a keyed SHA-1 digest for each packet and a monotonically increasing sequence number scheme is used to prevent replay attacks.

Like MD5 there have been reports of attacks on SHA-1. Such attacks do not mean that all the protocols using SHA-1 for authentication are at risk. However, it does mean that SHA-1 should be replaced as soon as possible and should not be used for new applications.

However, if SHA-1 is used in the Hashed Message Authentication Code (HMAC) construction then collision attacks currently known against SHA-1 do not apply. The new attacks on SHA-1 have no impact on the security of HMAC-SHA-1.

I have written an IETF document that proposes two new authentication types – the cryptographic authentication and the meticulous cryptographic authentication . These can be used to specify any authentication algorithm for authenticating and verifying the BFD packets (aka key agility). In addition to this, this memo also explains how HMAC-SHA authentication can be used for BFD.

HMAC can be used, without modifying any hash function, for calculating and verifying the message authentication values. It verifies both the data integrity and the authenticity of a message.

By definition, HMAC requires a cryptographic hash function. We propose to use any one of SHA-1, SHA-256, SHA-384 and SHA-512 for this purpose to authenticate the BFD packets.

I recently co-authored an IETF draft that does BFD’s security and authentication mechanism’s gap analysis for the KARP WG – that draft can be found here.


BGP Path Hunting and Exploration ..

It has been shown through various studies that BGP convergence can be roughly given by the following formula:

Convergence = (Maximum AS_PATH – Minimum AS_PATH) x MRAI (MinimumRouteAdvertisementInterval)

MRAI as per the specification is the amount of time that a BGP speaker will wait before passing on successive route updates for the same prefix – it defaults to 30 seconds, and applies to all announcements, with no exemption for withdrawals. Since there can be some variance in the MRAI across different BGP implementations the convergence time is roughly given by the above formula.

Lets see how we arrive at this. Consider the topology as show in the fig 1 below:

Network Topology

Figure 1: Network Topology

A, B, C, D and E are all routers in ASes A, B, C, D and E respectively. Assume that A has announced its reachability to 10/8 and all routers have converged. This implies that E has the following paths for 10/8 in its adj-RIBs-In:

p(B) -> AS_PATH {B A} (path learnt from B)

p(C) -> AS_PATH {C B A} (path learnt from C)

p(D) -> AS_PATH {D C B A} (path learnt from D)

In steady state E would advertise 10/8 with AS_PATH {E B A}

Lets see what happens when A loses its connection to the network 10/8.

A announces a withdrawal for 10/8 to B as shown in figure 2.

Figure 2

Figure 2

B runs its decision process and finds that there is no other route to 10/8, and thus sends an explicit withdrawal to its peers C and E.

E upon receiving the withdrawl runs its decision process and finds that the best route, now available for 10/8 is the one that it learnt from C – the one with the AS_PATH {C B A}. It installs this route in the FIB and advertises this, as shown in figure 3.

Figure 3

Figure 3

E switches to the next best route (via C) and continues to forward traffic for 10/8. It should however be noted, that E at this point of time, has absolutely no idea that C too has received a withdrawal for 10/8.

C meanwhile runs its decision process and decides to send an explicit withdrawal to its peers D and E.

E upon receiving the withdrawal from C runs its decision process and finds that it still has a valid path to reach 10/8 – the one via D (refer to figure 4). It switches to this path, again unaware that D too has received an withdrawal for 10/8.

Figure 4

Figure 4

D now sends a withdrawal to E, and thereby removing all possible paths to 10/8 from E (fig 5). Its now that the network is converged.

Figure 5

Figure 5

What we just saw happening above on E is “BGP Path Hunting”. i.e., BGP “hunts” though all possible AS paths, starting from the shorter ones to the longer ones, until it finally converges.

It can be easily proven that the amount of path hunting will increase as the meshiness of the topology increases. In an acyclic topology (say a tree) there is only one possible path so there is no path hunting. An addition of a single link in this topology creates a cycle in the graph and thus at most two possible paths for BGP to “hunt”. Subsequent links can add many more alternate paths for BGP to hunt, depending on where they’re placed in the graph.

In the worst possible case path hunting in BGP can explore every possible path of each path length. More commonly it has been observed that path hunting in today’s Internet can add an additional 2 or 3 BGP updates to a prefix withdrawal.


AboveNet Hijacks Africa Online!

Internet, only a few weeks ago, had seen Pakistan Telecom Authority (PTA) hijacking the IP prefixes announced by Youtube, as protest against some videos that had been put up there, knocking millions off, all around the globe from accessing Youtube. I wrote about this here. This time its an ISP from USA and Europe, AboveNet (AS 6461) thats hijacked prefix announced and owned by Africa Online (AS 36915).

AboveNet inexplicably started announcing reachability to one of the prefixes (194.9.82.0/24) owned by Africa Online. It took AboveNet more than 22 hours since the problem was first reported, to fix it. Wonder what took them so long! As a result of this prefix hijack, potentially millions of users in Kenya or Africa, all behind 194.9.82.0/24, lost connectivity to the Internet. In isolation 194.9.82.0/24 is not a huge space, but add a couple of NATs and the number of users easily swells to millions. What this means for you and me, who are not being served by Africa Online is, that we lose connectivity to all the websites being hosted behind this IP address block. Imagine what it would do to the Internet if emergency services, banks, google were being hosted there!

Lets see why users in Kenya would lose total connectivity to the Internet:

A user accesses google.com with a (NATed or otherwise) source IP address 194.9.82.x. Google graciously responds, and the IP packet carries the destination IP 194.9.82.x. Because AboveNet has announced reachability to this IP address block, all traffic destined to 194.9.82.x comes to AboveNet where it gets royally dumped, while the user sitting in Kenya (or Africa) is still hopelessly waiting for the packet to arrive.

So, why are the service providers all over the world preferring the route announcement from AboveNet over the one originated from Africa Online?

Well, thats unfortunately how Internet, and my favorite routing protocol – BGP, works!

In BGP, the route advertisement from the provider which has a better vantage point on the Internet, usually wins.

In this particular case both AboveNet(AS 6461) and Africa Online (AS 36915) announced the route to 194.9.82.0/24 . AboveNet, operating from US, sits much closer to the core as compared to Africa Online, and is thus better connected to the other networks than the latter. The AS_PATH length thus seen by the other service providers for the route advertised by AboveNet is much shorter than the one advertised by Africa Online. As a result of this, other BGP speakers pick up the route advertised by AboveNet against the route advertised by Africa Online.

The figure below, constructed using BGPlay from RIPE NCC, shows a snapshot of the routing activity for 194.9.82.0/24 during the period when it was hijacked. The colored lines indicate the path different ASes would take to reach this prefix. Clearly most of the ASes believed AboveNet to be a better path for 194.9.82.0/24.

Vantage Point on the Internet

This is how BGP works and mind you, this isnt broken.

Whats broken is our disability to verify the claim of a service provider when it announces ownership of an address block in BGP. Restrictive route filtering can be applied where the providers only accept the specific prefixes allocated to the customers or where the upstream accepts only specific prefixes allocated to the ISPs, but this is too cumbersome and rarely works. As the matter stands today, there isn’t any clean way to know if the reachability announced by your friendly peer is genuine or whether the provider has a feasible path to the destinations advertised. This needs to be fixed and there is work going on towards this direction in the SIDR WG of IETF.

Can the service providers do something when they learn that an IP prefix has been hijacked by some AS? The answer, fortunately, to this is an unequivocal Yes.

A service provider can override BGP’s decision process in selecting the route advertisement with the shortest AS_PATH length by (i) manipulating the BGP path attribute like LOCAL_PREF since its checked before the AS_PATH length or (ii)  decreasing the weight of the offending peer you learn the hijacked route from (this would only work for routers connected  directly to AboveNet) (iii) Use regular expressions to filter all or the specific hijacked route advertisement from AS 6461 (AboveNet) so that the announcement from Africa Online wins. The legitimate route is now propagated to other parts of the world.

Each time an ISP inadvertently hijacks someone else’s address block it risks to lose some amount of credibility in the service provider world. Their names are splashed on the mails/PPTs in NANOG and IETF whenever there’s a discussion on interdomain security or on the blogs all over the world.

Fortunately for AboveNet, their hijacking didn’t throw millions off popular websites like Youtube, Google, Yahoo! etc. That would have attracted a LOT more attention than what this event did. When PTA had hijacked YouTube, it was all over the news and there were columns running in Wall Street Journal and New York Times about how tenuous the Internet architecture is. Also what unfortunately went in favor of AboveNet was that the affected users were not in US/Europe/Japan, but were in a relatively silent African subcontinent.


The Classical Fish Problem in Routing

The Internet in mid 80s and 90s was envisaged to work on the IP destination based routing/forwarding paradigm. This meant that the routing protocols would establish the best paths based on some scalar metric like the hop count, or link costs, and all traffic would follow that path. This would work since all IP routing and forwarding was based on the IP address carried in the IP packets. With all due respect and credit to the Internet’s forefathers, the vision and the design did work, till a few years back, when the network operators started realizing that though the IP architecture was indeed scalable, it lacked the finesse to optimally utilize the network resources (particularly in the backbones).

The inadequate utilization of the network resources can be illustrated with the classic “fish problem“. It derives its name from the network (Fig 1) resembling a fish, with A being the head and G and H, the tail of a fish.

Figure 1

All traffic emanating (or passing through) from the tail, (G or H) towards the head (A) can take either of the two paths (F-D-C-B or F-E-B) based on how the IP routing tables are programmed by the routing protocols. The latter decide on the best path by considering the link costs advertised by each router in the network. In this example the total cost of path G-F-D-C-B-A is (10+5+5+5+10) 35, while the cost of the path G-F-E-B-A is 30. This means that all traffic from G to A (or G to B) would follow the path through router E (as shown by the red arrow), and the F-D-C-B path would remain unused, since the total cost associated with it is higher than the one through router E.

This leads to an extremely unbalanced traffic distribution, where the link F-E-B can get heavily overloaded, and at worst, congested, while F-D-C-B always remains idle. This problem arises because of the way IP routing paradigm works. Lets us see why:

o IP routing is destination based, so packets are only routed based on the destination IP address in the packet. Routing protocols typically install one next-hop for each IP address (except in case of equal cost routes, which we can ignore for the time being) or a range of IP addresses (subnet masks) thus all packets sharing the same destination address would all get routed to the same next-hop. This means that if F installs a route 100/8 with next-hop as E, then all IP traffic falling under 100/8 coming to F, would get routed to E. This can lead to unbalanced traffic distribution and create unnecessary congestion hot spots.

o Routers make a local decision, based on what they think is the most optimal path from their perspective, when selecting a path. Since all routers run the same SPF algorithm, with the same lin state database, they all come up with the same shortest path, which very soon turns congested, while the non-shortest path remains idle, and unused. This implies that to optimize the network utilization the routers must factor in some other things before chosing the path. One thing that comes instantly to mind is the total bandwidth available on each link when computing the path. If routers can somehow keep track of the available bandwidth available on each link, then it can distribute the traffic in a manner which can optimize the network resource utilization.

Coming back to our network, we find that all traffic from tail to head flows through the router E, leaving D and C idle. So, what can be done to fix this?

The operator can manipulate the link costs on path F-D-C-B, in a manner as shown below, to get the traffic to flow over it.

Figure 2

This clearly works, since cost of the path F-D-C-B (9) is now lower than path F-E-B (10). But hang on. What we’ve only achieved is moving the entire traffic from path F-E-B to F-D-C-B! The traffic would soon start congesting the latter link, while leaving the former unused. We have really achieved nothing, but have only moved the problem elsewhere. Clearly, this wouldn’t work. So, what else can be done?

Well, not much. A clever network operator can play around with the link costs on paths F-D-C-B and F-E-B such that both become equal, as shown in the figure below.

Figure 3

This would surely alleviate the problem as the two paths would now be equally used. However, this scheme of manually adjusting the link costs is not scalable and only works for small networks. Imagine replacing C, D and E with hundreds of routers, with a subset of them being connected to each other. The precarious scheme of adjusting the link costs would become too complex and too fragile to work. A single link (or a router) failure or a cost change would bring down the entire scheme of distributing the traffic across two paths.

The only scalable solution for the fish problem is by going beyond the realms of traditional IP routing and by providing mechanisms to explicitly manage the traffic inside the network. This new paradigm of routing is called constraint based routing, which essentially strives to compute the best path without violating any of the constraints imposed on the network, and at the same time, being optimal with respect to some scalar metric (hop count, links costs, etc). Once such a path is computed, it establishes and maintains forwarding state along such a path.

This differs from the existing IP routing paradigm which only tries to optimize (by minimizing), a particular scalar metric when computing the best path to a destination. Thus RIP optimizes the number of hops and OSPF/IS-IS, the total path cost, where total path cost is the sum total of individual cost of all the links along the path.

I would discuss more on constraint based routing in my subsequent posts.


Pakistan Hijacks Youtube!

We’ve been reminded yet again, of how vulnerable the Internet architecture is, to malicious attacks and simple mis-configuration (oversight?) from the service provider side. A week ago, the Pakistan Telecommunication Authority released an order instructing the country’s 70 odd ISPs to block Youtube.com until further notice. The ISPs acting on this directive could have installed access-control lists (ACLs) on all their router interfaces dropping packets bound to this website, but they instead, chose a more convoluted way of blocking traffic bound to this site – they created a more specific route pointing to a NULL (or a discard) interface, thereby black-holing all traffic bound to this address. Fair enough, this is also a way to drop traffic since the former would entail augmenting all existing ACL filtering policies on all router interfaces. Whats intriguing is how this route “accidentally” leaked into BGP and got advertised to PCCW (AS 3491), the upstream provider providing services to Pakistan Telecom (AS 17557), from where it propagated further to other parts of the Internet.

Youtube advertises 208.65.152.0/22 and Pakistan Telecom advertised a more specific route, 208.65.153.0/24, to its provider PCCW . PCCW, like most ISPs, without validating the prefix announcements based on Regional Internet Registry (RIR) allocations or even Internet Routing Registry (IRR) objects, further propagated this BGP UPDATE to its peers. Within no time, the erroneous BGP announcement permeated across large parts of the internet, resulting in all traffic bound to Youtube, to go towards Pakistan, where it would end up getting blackholed. Youtube was thus successfully “hijacked” by Pakistan Telecom .

I dont intend to discuss, whether this was done maliciously or whether it was an error on part of the PT, but the whole saga raises uncomfortable questions on the security and frailty of the Internet as it works today. Currently, the larger ISPs tend to trust the network providers that they are connected to and work on the tenuous assumption that the smaller providers would not illegitimately “hijack” someone else’s IP address and will behave decorously and only announce the IPs that they own. Patently, this doesn’t seem to be working out.

An attacker can masquerade as a big financial website by “hijacking” all traffic bound towards the legitimate website, thereby wreaking havoc on the gullible users. It should be noted that this sort of attack is more potent and dangerous than the spam mails that we often receive in our mailboxes, asking us to update and enter our bank details. In the latter cases, an alert user can always detect, that the server on which the page is loaded does not belong to the bank or the financial institution that the mail purports to be. However when the IP address block is “hijacked”, there are no such defenses.

Internet connectivity is vital in todays age, with a lot of emergency services being routed over the Internet. Imagine a natural disaster or a terrorist attack followed by an IP address block “hijack”. The full repercussions of what all can be achieved with this kind of an attack makes your mind spin and wobble.

A timeline created by Renesys, which provides real-time monitoring services, says that it took about 15 seconds for large Pacific-rim providers to direct YouTube.com traffic to the Pakistan ISP, and about 45 seconds for the central routers on which most of the Internet traffic relies, to misroute the traffic.

So, what happened to Pakistan’s Internet Connectivity as it attracted zillions of bytes of data intended for Youtube.com? It probably went down as PT could not have handled that massive amount of data, rendering millions inside Pakistan without any Internet connectivity.

In the coming days i would discuss what BGP Prefix Hijacking is and what it entails to protect the Internet from this and the work being done in IETF wrt this.

Also read about AboveNet hijacking Africa Online here.


Follow

Get every new post delivered to your Inbox.

Join 44 other followers