Why providers still prefer IS-IS over OSPF when designing large flat topologies!


I was recently interacting with our pre-sales team for a large MPLS deployment and was reading the network design that was proposed. I saw that they had suggested IS-IS over OSPF as the IGP to use at the core. One of the reasons cited was the inherent security that IS-IS provides by running natively over the Layer 2. Another was that IS-IS is more modular and thus easier to extend as compared to OSPF. OSPF, its alleged is very rigid and required a complete protocol rewrite to support something as basic as IPv6! 🙂 Then there was this overload feature that IS-IS provides which can signal memory overload that does not exist in OSPF and finally a point about IS-IS showing superior scalability (faster convergence). In case you’re intrigued about the last point, as i clearly was, then it was explained that IS-IS uses just one Link State Packet (LSP) per level for exchanging the routing information. This LSP contains many TLVs, each of which represents a piece of routing information. OSPF on the other hand, needs to originate multiple LSAs, one for each type and as a consequence is a lot more chattier and hence not suitable for large flat networks.

I personally dont agree to any one of the reasons listed above and anyone who favors IS-IS over OSPF for the above reasons is patently mistaken. These are all extremely weak arguments and have mostly been overtaken by reality. Lets look at each one by one.

Security – While its true that one cant lob an IS-IS packet from a distance without a tunnel it was never really a compelling reason for some one to pick up IS-IS over OSPF. The same holds good for OSPF multicast packets as well which cannot be launched by some script kidde sitting miles away from his personal laptop. Both the protocols have been extended to support stronger algorithms  (RFC 5310 for IS-IS and RFC5709 for OSPF) and have similar authentication mechanisms. I can say this with some degree of confidence as i have co-authored both these standards.

Modularity – While its somewhat easier to extend IS-IS in a backward compatible way this sort of thing doesnt happen much any more. Both protocols have been extended to support multiple instances, traffic engineering, multi-topology, graceful restart, etc. This isnt imo a showstopper for someone picking up OSPF as the IGP to use.

Overload Mechanism – IS-IS has the ability to set the Overload (OL) bit in its LSAs. This results in other routers in that area treating this router as a leaf router in their shortest path trees, which means that its only used for reaching the directly connected interfaces and is never placed on the transit path to reach other routers. So does this happen any more? No, it doesnt. This feature was required in the jurassic age when routers came with severely constrained memory, CPU power and the original intention of the OL mechanism is now mostly irrelevant. Most core routers today have enough memory and CPU that they will not get inundated by the IS-IS routes in any sane network design.

These days OL bit is used to prevent unintentional blackholing of packets in BGP transit networks. Due to the nature of these protocols, IS-IS and OSPF converge must faster than BGP. Thus there is a possibility that while the IGP has converged, IBGP is still learning the routes. In that case if other IBGP routers start sending traffic towards this IBGP router that has not yet completely converged it will start dropping traffic. This is because it isnt yet aware of the complete BGP routes. OL bit comes handy in such situations. When a new IBGP neighbor is added or a router restarts, the IS-IS OL bit is set. Since directly connected (including loopbacks) addresses on an “overloaded” router are considered by other routers, IBGP can be bought up and can begin exchanging routes. Other routers will not use this router for transit traffic and will route the packets out through an alternate path. Once BGP has converged, the OL bit is cleared and this router can begin forwarding transit traffic.

So how can we do this in OSPF since there is no OL bit in its LSAs?

Simple. We can set the metric of all transit links on an “overloaded” router to 0xffff in its Router LSAs. This will result in the router not being included as a transit node in the SPF tree.  Stub links can still be advertised with their normal metrics so that they are reachable even when the router is “overloaded”.  Thus this point against OSPF is also not valid.

Finally we come to the scalability and the convergence part. This one is slightly tricky and is not so easy. I wrote a few posts around 4 years back discussing this here and here. You might want to read these.

IMO one of the big reasons why most big providers use (or have used) IS-IS is because way back in 90s Cisco OSPF implementation was a disaster. The first big ISPs (UUnet, MCI) came to them and said “we want to build big infrastructures, should we use OSPF?” and Cisco basically said “No, thats not a good idea, use IS-IS instead”. Dave Katz in Cisco had recently rewritten Cisco’s IS-IS implementation as a side effect of implementing NetWare Link Services Protocol – NLSP (basically IS-IS for Novel IPX) so Cisco was quite confident of its IS-IS implementation. The operators thus picked up IS-IS and continue using it even today as there is really no real difference between IS-IS and OSPF, so no motivation to move from one to the other.

IS-IS was also an advantage in the early days as a router vendor because it was an “open proprietary spec”.  It was out there, and published, but unless you had some background in OSI you didn’t know much about it and the spec was scary and weird.  This wasn’t on purpose, but it was handy.

It was also nice in the IETF because IS-IS was viewed, at least at the time, as the poor cousin of OSPF and so nobody really cared that much other than the handful of folks that were doing the work.  This made the extension of IS-IS a lot easier and a lot less political than OSPF.  In fact i have heard about a t-shirt which said “IS – IS = 0” that was distributed in one of the IETF meetings long time ago! Things however have changed and IS-IS is considered at par with OSPF today and both the working groups are quite active in the IETF.

There was one real technical advantage to IS-IS in common deployment scenarios of that day as well.  Back then, it was popular to build full meshes of ATM or Frame Relay as the Layer 2 topology for large backbones, because of the perception that healing faults at L2 would happen faster and cleaner than letting the IP routing protocols take care of it (arguably true at the time).  Full mesh topologies are the worst possible topologies for standard flooding protocols (IS-IS and OSPF both) and the cost of topology changes was huge.  However, IS-IS lent itself to the “mesh group” hack by which you could manually prune the flooding topology to be a subset of the links.  OSPF doesn’t easily allow this because of details about the flooding model it uses.  Cisco apparently did implement a hack to get around this problem, but its probably more gross than the IS-IS “mesh groups” hack!

Another reason i believe people prefer IS-IS over OSPF is the belief that you can design large networks by building a single large Level 1 (L1) area without any hierarchies in IS-IS and still be able to manage – something that would be difficult with OSPF. There are issues with inter-area traffic engineering and such and most people would like to keep their network as a single area if the routing protocol can manage it.

I used to believe that operators can design big networks without hierarchies in IS-IS since all IP prefixes (i.e. network interfaces, routes aka reachabilities in ISO-speak) are considered as leaf nodes in the SPF for IS-IS. Thus a full SPF will not be triggered for an interface or a route flap in case of IS-IS. OSPF otoh, would go ballistic running SPF each time any IP information changes. The only time we dont run a full SPF in when a Type 5 LSA information changes, but thats hardly an optimization. Compared to this, the only time we run a full SPF in IS-IS is when an actual node goes down (which OSPF would also anyways do).

I was recently having a discussion with Dave Katz from Juniper on this and i realized that this really is an implementation choice. “The graph theory”, he very aptly pointed out, “is the same in both cases!”.  The IS-IS spec makes it easier to put an IS-IS reachability as leaf nodes as all routers are identified by a different set of TLVs. This information while its available in OSPF is slightly tricky as the node information is mixed with the link information. Thus while even a naive IS-IS implementation may be able to optimize SPF, it would require a good understanding of the spec to get it right in OSPF.

You could get the exact same optimization in OSPF as IS-IS if you realize that OSPF calculates the routes to the *router IDs* and not the addresses. The distinction between nodes and destinations is syntactically (and semantically) quite clear in OSPF as well. The spec considers the Router IDs which i concede look like IP addresses, something that most people miss.

Actual addresses and prefixes are quite distinct, even in OSPF.  So as long as you can keep track of what’s an address and what’s an ID, it’s not that hard, for what it’s worth.  The bigger problem is that only a handful of people really understand *why* things in the OSPF spec are done the way they are, and there are less and less of those folks because hardly anybody *needs* to understand it.

But having said all that, the cost of an SPF is so small on the scale of things that it’s not really the issue (which is also why I am not a big fan of partial-SPF optimizations:  “See how great it works when you have around O(50K) nodes and there is this one little node that goes down!” is sort of silly because lots of other things would break before a network ever got that big.)

Part of the SPF fear was I believe because Cisco’s original SPF implementation in OSPF was horribly inefficient (and everyone was using slow processors back then) and IOS was a non-preemptive, single threaded environment, and so an SPF (or any slow process) would block other things (like sending and receiving Hellos and other important bits) and would affect *everything*. I am btw sure that its changed now since i am aware of a couple of large Cisco deployments that are running OSPF in the core!  Overall system state management is a *much* bigger problem these days than the algorithmic efficiency of these protocols, particularly as we build larger and more distributed environments that require message passing internally.

Also what could have pushed providers back then to IS-IS was the deployment guidelines that Cisco used to publish (including the number of routes in an area) back then which were absurdly small. I am sure, its changed now.

There’s no technical reason why very large flat topologies can’t be supported by a good implementation of either protocol, but ISPs need to be conservative and suspicious of their vendors in order to survive.  😉 I guess that nobody wants to be the first to deploy a large flat OSPF topology;  best practices tend to be sticky. However, there is no reason why you cant do it with OSPF today.

I suspect that, at this point, ISPs choose based on culture and familiarity and comfort rather than real technical differences. The perception still exists that while IS-IS can support large flat networks, OSPF cant. However, as i said its just a perception and is not really true any more.

24 thoughts on “Why providers still prefer IS-IS over OSPF when designing large flat topologies!

  1. Hi, this is a great post which explains my doubt, I was trying to find evidence on why IS-IS is preferred to OSPF in ISP networks, people say large scale ISP network should use ISIS because it is more scalable, but I do not see any clear reason why OSPF cannot scale the same, this post provides good insight on why..Thanks.

    Like

  2. There are large providers, such as AT&T that use OSPF in a large flat area 0. The reasons for using ISIS are mostly historical and as you mentioned Ciscos ISIS implementation was ahead of the OSPF one when a lot of networks, that would later become very large, started running IP networks.

    Like

  3. “There’s no technical reason why very large flat topologies can’t be supported by a good implementation of either protocol”.

    So everything is fine, as long as you have a good implementation ?
    And who is gonna write that “good” implementation ? You ?

    I’ve worked on both OSPF and IS-IS implementations. Implementations that are deployed in real networks. And I can tell you, OSPF is a Pain to implement. You can hardly use any neat internal implementation tricks. Because everything is overspecifcied in the specs. Give me IS-IS any day. Dave Katz has worked on IS-IS (at cisco) and OSPF (at Juniper). And he prefers IS-IS. I wonder why.

    For the rest, this article is pretty good. I sometimes disagree with subtleties and the underlying bias. But it’s better than any comparison I’ve read before.

    Like

  4. Why I prefer IS-IS over OSPF
    1. Still, in Cisco IOS you have to run two versions of OSPF if you want to run ipv4/ipv6
    2. You can play with LSP lifetime individually on each router as this timer goes down from value set on the router itself (and it is advertised in LSP). In OSPF it goes from 0 to maximum which is set in stone. Based on this lifetime interval one can increase refresh interval.
    3. Maximum value for LSPs originated in one router is 256 and if you unintentionally redistribute BGP into ISIS it won’t melt down as it stops after reaching 256 fragments. in OSPF your whole network will crash, and you have to REBOOT your routers all at once (or one by one disabling OSPF untill all are rebooted) as they will continue to flood those LSA 5 all over the place. They will be so busy so they wouldn’t even respond to console! We tried this in lab, it was very impressive (and scary). And I know one big real accident happened a couple of years ago which was due to BGP->OSPF redistr
    4. Advertise-passive only
    5. IS-IS database is much more easier to read.
    6. OL bit is much easier to use. You enter just one command in router, not several, changing metrics on both sides of links. And if you use it with wait-for-bgp keyword it is unset automatically.

    Like

    1. Most of the reason is implementation detail , not the design , Maybe operational cost can be discussable but still , Manav here is right about all of them , except , I also wonder for IPV6 for brownfield , how you can enable it with Ospfv2. Ospf v3 for greenfield now also support IPv4 , even if you wanna deploy ipv4 core greenfield network , it supports almost everything is IS-IS as well and familiarity of OSPF could carry it one more step ahead ?

      Like

  5. Agree with all of them but you said also , there is another reason IPv6 extendability , with OSPV2 only protocol in brownfield , if you want to support , how could you do that ?

    Like

  6. ISIS also provides a lot of mechanisms that help in migrations between different implementations or when doing maintenance windows, which is useful for providers.

    Like

  7. I have a question to flip this on it’s head. Why not used ISIS on small networks 🙂 I am arguing with my University Professor (I have 15 years Network exp and I’m a Network Engineer) why anyone would go from OSPF > EIGRP for better performance. If I wanted something better than OSPF I would go to ISIS. I run small specialized ISIS networks in our lab and it’s works great, simplistically for import/export. In the real world can anyone tell me why you wouldn’t use ISIS on a small flat high-speed network connected to two and MPLS provider ISPs over eEBP?

    Like

    1. There is no reason why you cant use IS-IS in small networks! 🙂

      BTW, there isnt anything in the protocol that makes IS-IS better than OSPF, so i would be highly circumspect before claiming that !

      Like

  8. I have seen networks where OSPF has been implemented as flat topology. As per my understanding OSPF is the best when you have more and more stable network with no flaps. Because it required mlogn calculations…

    Like

    1. More developments such Segement routing, trill, fabric path is happening on ISIS. ISIS is more modular to introduce any features.

      Like

  9. I have stumbled upon your blog just today, and here was my first blog post to read and it’s very informative and amazing the details you’re sharing with us.

    Thank you.

    Like

  10. Long, long ago AS6461 had a large flat OSPF topology. We ripped it out & replaced it with ISIS.

    But the real scaling issue at the time was “redistribute connected” into OSPF that caused frequent database meltdowns on original GSR RPs, not the protocol itsself.

    Like

  11. Just to clarify, at for Cisco’s implementation for OSPFv2, there’s no separation between node IDs (which are used to build the graph) and NLRIs (info that the nodes making up the graph exchange).

    This means that for OSPFv2, every time you change the cost or even a mask on an OSPF-enabled link, that event triggers a full SPF calculation for all routers in that area.

    This is somenthing that does not happen with OSPFv3, through the new prefix and link LSAs.

    IS-IS have always had a clear separation between nodes and NLRIs.
    It’s is a much well designed protocol in several aspects. In fact OSPF used a lot of ideas from IS-IS initially.

    Radia Perlman says that, if people had noticed that IS-IS could carry IP NLRI information, we wouldn’t have two LS protocols today (https://www.ietf.org/mail-archive/web/ospf/current/msg00620.html)

    Currently, both protocols (OSPFv3 and IS-IS) are very similar in features and performance, there’s compelling reason to choose one over another.

    OSPF has a slightly advantage for being more widely implemented though.

    Like

    1. Cisco’s implementation for OSPFv3 requires to use IPv6 addresses on the links between routers, even if we use OSPFv3 for IPv4 address family.

      Like

  12. Yes everything you can do with IS-IS while building a large flat network you actually can do with OSPF but to me the answer is simple, it’s easier to do it with IS-IS. At least today. That’s it.

    Like

  13. One of the amazing things about ISIS is there is no error path, except for “overloaded routers,” and when the massive sequence space wraps. It is completely extensible with TLVs. Of course it runs over the link layer, since it provides the network connectivity.

    Look, you don’t know what you are talking about, your thoughts are shallow. ISIS has always been a superior protocol to OSPF. Some try to be neutral about this, but they in fact aren’t. The first original protocol running in AT&T was ISIS, MCI, Sprint. Why? Because it provably could scale whereas OSPF could not at the time.

    I was there. I know. You don’t.

    Like

  14. Hi Manav,

    Great insight as always.. There is a command called overload in OSPF and ISIS. So, does overload command also do same changes in protocol as told by you in your topic or is it some other mechanism?

    Thanks
    Shubham

    Like

Leave a comment