Tag Archives: Scalability

Why providers still prefer IS-IS over OSPF when designing large flat topologies!

I was recently interacting with our pre-sales team for a large MPLS deployment and was reading the network design that was proposed. I saw that they had suggested IS-IS over OSPF as the IGP to use at the core. One of the reasons cited was the inherent security that IS-IS provides by running natively over the Layer 2. Another was that IS-IS is more modular and thus easier to extend as compared to OSPF. OSPF, its alleged is very rigid and required a complete protocol rewrite to support something as basic as IPv6! :-) Then there was this overload feature that IS-IS provides which can signal memory overload that does not exist in OSPF and finally a point about IS-IS showing superior scalability (faster convergence). In case you’re intrigued about the last point, as i clearly was, then it was explained that IS-IS uses just one Link State Packet (LSP) per level for exchanging the routing information. This LSP contains many TLVs, each of which represents a piece of routing information. OSPF on the other hand, needs to originate multiple LSAs, one for each type and thus is a lot more chatty and thus not suitable for large flat networks.

I personally dont agree to any one of the reasons listed above and anyone who favors IS-IS over OSPF for the above reasons is patently mistaken. These are all extremely weak arguments and have mostly been overtaken by reality. Lets look at each one by one.

Security – While its true that one cant lob an IS-IS packet from a distance without a tunnel it was never really a compelling reason for some one to pick up IS-IS over OSPF. The same holds good for OSPF multicast packets as well which cannot be launched by some script kidde sitting miles away from his personal laptop. Both the protocols have been extended to support stronger algorithms  (RFC 5310 for IS-IS and RFC5709 for OSPF) and have similar authentication mechanisms. I can say this with some degree of confidence as i have co-authored both these standards.

Modularity – While its somewhat easier to extend IS-IS in a backward compatible way this sort of thing doesnt happen much any more. Both protocols have been extended to support multiple instances, traffic engineering, multi-topology, graceful restart, etc. This isnt imo a showstopper for someone picking up OSPF as the IGP to use.

Overload Mechanism – IS-IS has the ability to set the Overload (OL) bit in its LSAs. This results in other routers in that area treating this router as a leaf router in their shortest path trees, which means that its only used for reaching the directly connected interfaces and is never placed on the transit path to reach other routers. So does this happen any more? No, it doesnt. This feature was required in the jurassic age when routers came with severely constrained memory, CPU power and the original intention of the OL mechanism is now mostly irrelevant. Most core routers today have enough memory and CPU that they will not get inundated by the IS-IS routes in any sane network design.

These days OL bit is used to prevent unintentional blackholing of packets in BGP transit networks. Due to the nature of these protocols, IS-IS and OSPF converge must faster than BGP. Thus there is a possibility that while the IGP has converged, IBGP is still learning the routes. In that case if other IBGP routers start sending traffic towards this IBGP router that has not yet completely converged it will start dropping traffic. This is because it isnt yet aware of the complete BGP routes. OL bit comes handy in such situations. When a new IBGP neighbor is added or a router restarts, the IS-IS OL bit is set. Since directly connected (including loopbacks) addresses on an “overloaded” router are considered by other routers, IBGP can be bought up and can begin exchanging routes. Other routers will not use this router for transit traffic and will route the packets out through an alternate path. Once BGP has converged, the OL bit is cleared and this router can begin forwarding transit traffic.

So how can we do this in OSPF since there is no OL bit in its LSAs?

Simple. We can set the metric of all transit links on an “overloaded” router to 0xffff in its Router LSAs. This will result in the router not being included as a transit node in the SPF tree.  Stub links can still be advertised with their normal metrics so that they are reachable even when the router is “overloaded”.  Thus this point against OSPF is also not valid.

Finally we come to the scalability and the convergence part. This one is slightly tricky and is not so easy. I wrote a few posts around 4 years back discussing this here and here. You might want to read these.

IMO one of the big reasons why most big providers use (or have used) IS-IS is because way back in 90s Cisco OSPF implementation was a disaster. The first big ISPs (UUnet, MCI) came to them and said “we want to build big infrastructures, should we use OSPF?” and Cisco basically said “No, thats not a good idea, use IS-IS instead”. Dave Katz in Cisco had recently rewritten Cisco’s IS-IS implementation as a side effect of implementing NetWare Link Services Protocol – NLSP (basically IS-IS for Novel IPX) so Cisco was quite confident of its IS-IS implementation. The operators thus picked up IS-IS and continue using it even today as there is really no real difference between IS-IS and OSPF, so no motivation to move from one to the other.

IS-IS was also an advantage in the early days as a router vendor because it was an “open proprietary spec”.  It was out there, and published, but unless you had some background in OSI you didn’t know much about it and the spec was scary and weird.  This wasn’t on purpose, but it was handy.

It was also nice in the IETF because IS-IS was viewed, at least at the time, as the poor cousin of OSPF and so nobody really cared that much other than the handful of folks that were doing the work.  This made the extension of IS-IS a lot easier and a lot less political than OSPF.  In fact i have heard about a t-shirt which said “IS – IS = 0″ that was distributed in one of the IETF meetings long time ago! Things however have changed and IS-IS is considered at par with OSPF today and both the working groups are quite active in the IETF.

There was one real technical advantage to IS-IS in common deployment scenarios of that day as well.  Back then, it was popular to build full meshes of ATM or Frame Relay as the Layer 2 topology for large backbones, because of the perception that healing faults at L2 would happen faster and cleaner than letting the IP routing protocols take care of it (arguably true at the time).  Full mesh topologies are the worst possible topologies for standard flooding protocols (IS-IS and OSPF both) and the cost of topology changes was huge.  However, IS-IS lent itself to the “mesh group” hack by which you could manually prune the flooding topology to be a subset of the links.  OSPF doesn’t easily allow this because of details about the flooding model it uses.  Cisco apparently did implement a hack to get around this problem, but its probably more gross than the IS-IS “mesh groups” hack!

Another reason i believe people prefer IS-IS over OSPF is the belief that you can design large networks by building a single large Level 1 (L1) area without any hierarchies in IS-IS and still be able to manage – something that would be difficult with OSPF. There are issues with inter-area traffic engineering and such and most people would like to keep their network as a single area if the routing protocol can manage it.

I used to believe that operators can design big networks without hierarchies in IS-IS since all IP prefixes (i.e. network interfaces, routes aka reachabilities in ISO-speak) are considered as leaf nodes in the SPF for IS-IS. Thus a full SPF will not be triggered for an interface or a route flap in case of IS-IS. OSPF otoh, would go ballistic running SPF each time any IP information changes. The only time we dont run a full SPF in when a Type 5 LSA information changes, but thats hardly an optimization. Compared to this, the only time we run a full SPF in IS-IS is when an actual node goes down (which OSPF would also anyways do).

I was recently having a discussion with Dave Katz from Juniper on this and i realized that this really is an implementation choice. “The graph theory”, he very aptly pointed out, “is the same in both cases!”.  The IS-IS spec makes it easier to put an IS-IS reachability as leaf nodes as all routers are identified by a different set of TLVs. This information while its available in OSPF is slightly tricky as the node information is mixed with the link information. Thus while even a naive IS-IS implementation may be able to optimize SPF, it would require a good understanding of the spec to get it right in OSPF.

You could get the exact same optimization in OSPF as IS-IS if you realize that OSPF calculates the routes to the *router IDs* and not the addresses. The distinction between nodes and destinations is syntactically (and semantically) quite clear in OSPF as well. The spec considers the Router IDs which i concede look like IP addresses, something that most people miss.

Actual addresses and prefixes are quite distinct, even in OSPF.  So as long as you can keep track of what’s an address and what’s an ID, it’s not that hard, for what it’s worth.  The bigger problem is that only a handful of people really understand *why* things in the OSPF spec are done the way they are, and there are less and less of those folks because hardly anybody *needs* to understand it.

But having said all that, the cost of an SPF is so small on the scale of things that it’s not really the issue (which is also why I am not a big fan of partial-SPF optimizations:  “See how great it works when you have around O(50K) nodes and there is this one little node that goes down!” is sort of silly because lots of other things would break before a network ever got that big.)

Part of the SPF fear was I believe because Cisco’s original SPF implementation in OSPF was horribly inefficient (and everyone was using slow processors back then) and IOS was a non-preemptive, single threaded environment, and so an SPF (or any slow process) would block other things (like sending and receiving Hellos and other important bits) and would affect *everything*. I am btw sure that its changed now since i am aware of a couple of large Cisco deployments that are running OSPF in the core!  Overall system state management is a *much* bigger problem these days than the algorithmic efficiency of these protocols, particularly as we build larger and more distributed environments that require message passing internally.

Also what could have pushed providers back then to IS-IS was the deployment guidelines that Cisco used to publish (including the number of routes in an area) back then which were absurdly small. I am sure, its changed now.

There’s no technical reason why very large flat topologies can’t be supported by a good implementation of either protocol, but ISPs need to be conservative and suspicious of their vendors in order to survive.  ;-) I guess that nobody wants to be the first to deploy a large flat OSPF topology;  best practices tend to be sticky. However, there is no reason why you cant do it with OSPF today.

I suspect that, at this point, ISPs choose based on culture and familiarity and comfort rather than real technical differences. The perception still exists that while IS-IS can support large flat networks, OSPF cant. However, as i said its just a perception and is not really true any more.


Hub and Spoke Multi-Point LSPs for Scalable VPLS Architecture

Multiprotocol Label Switching (MPLS) and Generalized MPLS (GMPLS) provide a mechanism to set up point-to-multipoint (P2MP) LSPs which carry traffic from one ingress point (the root node) to several egress points (the leaf nodes), thus enabling multicast forwarding in an MPLS domain. However, there is no provision to provide a co-routed path back from the egress points (the leaves) to the ingress node (the root). The only way to do this is by configuring Unidirectional point-to-point LSPs from the leaves back to the root node. This entails configuring each leaf node with an LSP back to the root which could be a configuration and management nightmare if the number of leaves are large. Second, it can also not guarantee a co-routed path back from the leaf to the node, as the process of setting up the Unidirectional P2P path is independent from setting up the P2MP path.

This post introduces the concept of a hub-and-spoke multipoint (HSMP) LSP that allows traffic from the root to the leaves via a P2MP LSP and back from the leaves to the root via a unidirectional co-routed LSP. The proposed technique targets one-to-many applications that require reverse one-to-one traffic flow (thus many one-to-one in the reverse direction).

Consider the figure shown below.

PE1 is the ingress router for the HSMP LSP. The egress routers (leaves) are PE2, PE3 and PE4. As can be seen from the figure, PE1 creates a single copy of each packet arriving from the data source. This packet carries the MPLS label value L1. P1 is an ordinary Label Switch Router (LSR) that swaps the incoming label L1 with L2. Lets now focus on the packet forwarding process on the node P2.

For each packet belonging to the HSMP LSP, P2 makes three copies (just like how its done for a P2MP LSP), each of which is sent to PE2, PE3 and PE4 respectively. Packets arrive at P2 with MPLS label L2. As shown in the above figure, P2’s ILM contains an entry for label L2 saying that one copy of the packet should be sent out on interface if1 with label L3, another copy on interface if2 with label L4 and a last copy on interface if3 with label L5. Since the LSR P2 is replicating the MPLS packets its called the branch node.

An obvious advantage of this scheme is the bandwidth optimization. If we had used unidirectional P2P LSPs instead of an HSMP LSP, PE1 would have sent three copies of each packet that it had received from the data source and thus congesting the links PE1-P1 and P1-P2.

What sets an HSMP LSP apart from a regular P2MP LSP is the ability in the former to set up a path from the leaves back to the root. P2MP LSPs are unidirectional, so no traffic can flow from the leaf node routers to the ingress head end router along the P2MP LSP. In HSMP LSP, the leaf nodes can also send unidirectional traffic back to the root. This is shown in the figure below.

Because of the mechanisms defined in HSMP LSP, the branch node P2 advertises the same upstream label L1 for a given HSMP LSP to the nodes PE2, PE3 and PE4. It programs its ILM table as shown above, where it simply swaps L1 with L2 for all incoming MPLS packets and sends those towards P1. This way HSMP LSP is also able to provide a path back from the leaf nodes to the root node.

In the last post i had discussed issues that exist in VPLS. Lets see how HSMP LSP can solve them. I am using the same topology as was used there to demonstrate how HSMP LSP helps.

The figure 3 above shows the same VPLS service that we had discussed in the earlier post.

PE1 knows through some out-of-band mechanism (could be via BGP, Radius, manual configs, etc) that PE2, PE3 and PE4 are the egress nodes that belong to the same VPLS domain. PE1 now needs to establish an HSMP LSP (can be trivially extended to support a pseudowire) to PE2, PE3 and PE4. Figure shows 3 HSMP LSPs that will be required in this arrangement. The red HSMP LSP has PE1 as the root node and PE2, PE3 and PE4 as the leaf nodes. The green HSMP LSP has been initiated by PE2 and has PE2 as the root, and PE1, PE3 and PE4 as the leaf nodes. The blue HSMP LSP has been initiated by PE3 and has PE1, PE2 and PE4 as the leaf nodes. There is another HSMP LSP thats required – the one initiated by PE4 for the VPLS service to function. It has been omitted from the figure for the sake of clarity.

Thus all PE nodes in a VPLS service need to initiate an HSMP LSP (or a HSMP pseudowire) that terminates at the other PE routers.

In VPLS all BUM (broadcast, unknown unicast and multicast) traffic is flooded to all PE nodes. Its only the learnt traffic thats sent unicast by one PE to the other PE.

As explained earlier, a single copy sent by PE1 over an HSMP LSP will reach PE2, PE3 and PE4 (due to its P2MP component). Also the PE routers PE2, PE3 and PE4 can use this HSMP LSP (terminating at them) to send unicast traffic back to PE1.

Thus PE1 sends all BUM traffic on the HSMP LSP it initiates and all learnt unicast over the HSMP LSP that terminates on it.

Going back to figure 3, we can see that PE1 can use the red HSMP LSP to send all BUM traffic. This way it only sends one copy, and all the PE routers receive this packet. If PE1 wants to send learnt unicast traffic back to PE2, it uses the green HSMP LSP that terminates on it. PE1 can use this to send traffic back to PE2, which is the root node for this HSMP LSP. Similarly, PE1 uses the blue HSMP LSP whenever it wants to send learnt unicast traffic back to PE3.

Lets look at LSPs that PE2 uses. Whenever it wants to send BUM traffic, it uses the green HSMP LSP that it has originated. Any packet sent over that LSP is received by all the leaf nodes (which in this case happen to be PE1, PE3 and PE4). Like PE1, if it has to send learnt unicast traffic back to PE1, it uses the red HSMP LSP that was originated by PE1 and terminates at PE2.

Thus for a VPLS to fully function, all PE nodes must establish an HSMP LSP with all the other participating PE routers. It can use the optimized HSMP LSP that it originates for the BUM traffic and the HSMP LSP that other PEs originate for unicast communication.

The above table compares a VPLS service using HSMP LSPs with a regular VPLS service or Hierarchical VPLS (H-VPLS) service. Clearly, the former wins against the regular VPLS and H-VPLS on all counts. This may also perhaps be an answer to Juniper’s claim that H-VPLS is not scalable. Operators now need not implement H-VPLS; they can instead go in for VPLS services implemented using HSMP LSPs.

This post only briefly explains the idea behind HSMP LSPs. Its explained in detail here in this draft.


Convergence and Scalability Issues in OSPF and IS-IS

This seems to be the favorite question that every newbie has! There is no unequivocal answer to this and it all depends upon the kind of network and the topology. Having said this, lets try to see how the two IGPs can be compared.

IS-IS

This protocol is limited by the maximum number of LSPs that each IS-IS router can issue. This is 256 as its LSP ID is 1 octet long. The total number of IP prefixes carried by IS-IS can be easily computed and it comes to O(31000). However, RFC 3786 describes mechanisms to relax the limit on the number of LSP fragments, thereby increasing the number of IP prefixes that can be carried within IS-IS.

I have however, never seen any network carrying more than O(30K) IP prefixes inside IS-IS. Do let me know if you’re aware of networks where you see IS-IS carrying more than 30K routes.

I say this because this is a reasonable number for any sane IS-IS deployment and it will not run out of space unless someone actually injects the entire (or even partial) BGP feed into the IGP. In that case we will run out of space at about 20% of the way into redistribution and not be able to advertise the rest. It is for this reason that this practice has now been deprecated and the RFC 1745 which lays down the rules for BGP- OSPF interaction, has been moved to the HISTORICAL status.

There are 8 bits to define a pseudonode number in the LSPID which means that a router can be a Designated Intermediate System (DIS) for only 256 LANs. Additionally there is also a limitation on the number of routers that can be advertised in pseudonode LSP of the DIS. Dont worry – RFC 3786 fixes this!

RFC 3373 OTOH proposes a new TLV thats carried in the IIH PDUs that can increase the number of point-to-point adjacencies from 256 on a single router.

The “Remaining lifetime” field which gives the number of seconds before LSP is considered expired is 16 bits wide.

This gives the life time of the LSP as 2^16/60/60 Hrs = 18.7 Hrs

Thus the LSP issued by a router needs to be refreshed after every 18.7 Hrs. So youre not going to see a lot of IS-IS control packets being regenerated in a stable topology.

OSPF

In theory, OSPF topology is limited by the number of links that can be advertised in the Router LSA as each router gets only one Router LSA and it cant be bigger than 64K which is the biggest an IP packet can be. The same constraint applies to the Network LSA also.

Each link in the router can take up at most 24 bytes. Thus, number of links which can be supported is given by (64 * 1024) / 24 = 2370

However, if we take the minimum link size per link (12 bytes) then the maximum is about 2 * 2370 = O(5000) links

To be more specific, we can have O(2300) p2p and p2mp links (not considering virtual links, etc) and O(5000) broadcast/NMBA links described in OSPF’s Router LSA and its Network LSA.

Thus each Router LSA can carry some 5000 links information in it. It is hard to imagine a router having 5000 neighbors but there are already routers with 400 neighbors in some ISPs, and it may not take long to reach the order of the magnitude limited by OSPF.

The Network LSAs are generated by the designated router (DR) for each broadcast network it is connected to. To have scaling problems it should have 2730 * 6 times neighbors on that interface. This is even less probable and hence there are no scalability problems with OSPF per se.

All other LSAs apart from Type 1 and Type 2 hold single prefixes. Because there is no limit to the number of such LSAs, a large number of inter-area and externals can be generated depending upon the memory resources of the router.

Each LSA has an LS Age field which is counted upwards starting from zero. Its life is an architectural constant which says one hour. When an LSA’s LS age field reaches MaxAge, it is reflooded in an attempt to flush the LSA from the routing domain. One hour seems like a long time but if one originates 50,000 LSAs then OSPF will be refreshing on an average of just 36ms!

Total number of LSAs to be refreshed = 50,000

Time by which all the LSAs must be refreshed = LSRefreshTime = 30mins = 1800 secs

Rate at which the LSAs need to be refreshed = 1800/50000 = 36ms

However, if the refreshes are perfectly spread out across time and perfectly batched, the actual update transmission rate may be on the order of one packet per second.

There is however a “do-not-age” LSA which in theory can be pressed into service and which never gets aged. However, such LSAs will be eventually purged from the LS database if they become stale after being held for at least 60 minutes and the originator not reachable for the same period. Moreover it is not backward compatible and if one deploys that in the network today with some routers not supporting this then the network can really get weird. So there isn’t really much that can be done using these unless the whole network is changed!

Theoretically, both the routing protocols are scalable and there should not be any issues with either one of these if implemented properly. Both have similar stability and convergence properties. Practically, providers must go with what their vendors suggest since the vendors are best aware of how each protocol has been implemented on their platforms.

I discuss more of this here.


Follow

Get every new post delivered to your Inbox.

Join 85 other followers