Catching Corrupted OSPF Packets!

I was having a discussion with Paul Jakma (a friend, co-author in a few IETF drafts, a routing protocols expert, the guy behind Quagga, the list just goes on ..) some time back on a weird problem that he came across at a customer network where the OSPF packets were being corrupted in between being read off the wire and having CRC and IP checksum verified and being delivered to OSPF stack. While the problem was repeatable within 30 minutes on that particular network, he could never reproduce it on his VM network (and neither could the folks who reported this problem).

Eventually, for some inexplicable reason, he asked them to turn on MD5 authentication (with a tweak to drop duplicate sequence number packets – duplicate packets as the trigger of the problems being a theory). With this, their problems changed from “weird” to “adjacencies just start dropping, with lots of log messages about MD5 failures”!

So it appears that the customer had some kind of corruption bug in custom parts of their network stack, on input, such that OSPF gets handed a good long sequence of corrupt packets – all of which  (we dont know how many) seem to pass the internet checksum and then cause very odd problems for OSPF.

So, is this a realistic scenario and can this actually happen? While i have personally never experienced this, there are chances of this happening because of any of the following reasons:

o PCI transmission error (PCI parallel had parity checks, but not always enabled, PCI express has a 32bit CRC though)

o memory bus error (though, all routers and hosts should use ECC RAM)

o bad memory (same)

o bad cache (CPUs don’t always ECC their caches – Sun its seems was badly bitten by this; While the last few generations of Intel and AMD CPU do this, what about all those embedded CPUs that we use in the routers?)

o logic errors, particularly in network hardware drivers

o finally, CRCs and the internet checksum are not very good and its not impossible for wire-corrupted packets to get past link, IP AND OSPF CRC/checksums.

The internet checksum, which is used for the OSPF packet checksum, is incredibly weak. There are various papers out there, particularly ones by Partridge (who helped author the internet checksum RFC!) which cover this, basically it offers very little protection:

– it can’t detect re-ordering of 2-byte aligned words
– it can’t detect various bit flips that keep the 1s complement sum the same (e.g. 0x0000 to 0xffff and vice versa)

Even the link-layer CRC also is not perfect, and Partridge has co-authored papers detailling how corrupted packets can even get past both CRC and internet checksum.

So what choice do the operators have for catching corrupted packets in the SW?

Well, they could either use the incredibly poor internet checksum that exists today or they could turn on cryptographic authentication (keyed MD5 with RFC 2328 or different HMAC-SHA variants with RFC 5709) and catch all such failures. The former would not always work as there are errors that can creep in with these algorithms. The latter would work but there are certain disadvantages  is using cryptographic HMACs purely for integrity checking. The algorithms require more computation, which may be noticable on less powerful and/or energy-sensitive platforms. Additionally, the need to configure and synchronize the keying material is an additional administrative burden. I had posted a survey on Nanog some time back where i had asked the operators if they had ever turned on crypto protection to detect such failures and i received a couple of responses offline where they alluded to doing this to prevent checksum failures.

Paul and I wrote a short IETF draft some time back where we propose to change the checksum algorithm used for verifying OSPFv2 and OSPFv3 packets. We would only like to upgrade the very weak packet checksum with something thats more stronger without having to go with the full crypto hash protection way. You can find all the gory details here!

Metrics Size in OSPF and IS-IS ..

Each interface in the link state protocols in given a metric or a cost, which is advertised with the link state information in the LSP or the LSA. The SPF algorithm uses this metric to calculate the cost and the nexthop to a destination. Metrics used are generally the inverse of bandwidth, thus a large bandwith capacity link would have a lower metric

IS-IS

o ISO10589 specified the metric to be 6 bits in size. Therefore, the metric value could only range from 0-63. This metric information was carried in Neighbor Reachbility TLV and the IP reachability TLV. Since it was only 6 bits wide, it was called the “narrow metric“. The maximum path metric MaxPathMetric supported is 1023. This in theory brought down the complexity of the SPF algorithm from O(nlog n) to O(n). But this isn’t a significant motivation any more since the CPUs are really fast these days. The metric size apparently was kept small to optimize search while doing the SPF. It also allowed two types of metrics – External and Internal.

o Soon , the “narrow” metric range was found to be too small for certain networks and new TLVs (Extended IP and Extended Neighbor Reachability) were introduced to carry larger metrics as part of the Traffic Engineering document. These were called Wide Metrics. The MaxLinkMetric value now is 0xFFFFFFand the MaxPathMetric, 0xFE00000.

The Extended IP reachability TLV allows for a 4 byte metric, while the Extended Neighbor reachability TLV allows for 3 bytes metric size. This is to enable the metric summarized across levels/domains to be as large as 0xFFFFFFFF while the link metric itself is no larger than 0xFFFFFE. If a metric value of 0xFFFFFF is used the prefix is not used in SPF computation.

o The original specification defined 4 kinds of narrow metrics – delay, default, error and expense. These were never implemented and all implementations only support the default metric. In ISO 10589, the most significant bit of the 8 bit metrics was the field S (Supported) which defined if the metric was to be considered or not. Confusingly, as most ISO documents are, an implementation was supposed to set this bit to 1 if it wanted to indicate that the metric was unsupported. Since only the default metric is used, the implementations must always set this bit to 0 when using the narrow metrics. Later RFC 2966 used this bit in the default metric to mark L1 routes that have been leaked from L1 to L2 and back down into L1 again.

Current implementations must generate the default metric when using narrow metrics and should ignore the other three metrics when using narrow metrics.

OSPFv2

o It allows a link to have a 2 byte metric field in the Router LSA which implies a maximum metric of 0xFFFF.

o The Summary, Summary ASBR, AS-External and NSSA LSAs have a 3 byte metric value. A cost of 0xFFFFFF (LSInfinity) is used to tell the destination described in the LSA is unreachable.

o AS-External and NSSA LSAs allow two metric types, Type 1 and Type 2 which are equivalent to IS-IS Internal and External metrics. The type 1 considers the cost to the ASBR in addition to the advertised cost of the route while the latter uses just the advertised cost while calculating the routes during the SPF.

o The scheme thus allows for links to be configured with a metric no larger than 0xFFFF, while allowing cost of destinations injected across areas/levels to be as large as 0xFFFFFE.

OSPFv3

o It allows similar metric size for the Router LSA as in OSPFv2.

o It allows similar metric sizes for Intra Area Prefix LSA, Inter Area Prefix LSA, AS-External LSA and NSSA LSA as in OSPFv2. The value and significance of LS Infinity is valid here too.

Convergence and Scalability Issues in OSPF and IS-IS

This seems to be the favorite question that every newbie has! There is no unequivocal answer to this and it all depends upon the kind of network and the topology. Having said this, lets try to see how the two IGPs can be compared.

IS-IS

This protocol is limited by the maximum number of LSPs that each IS-IS router can issue. This is 256 as its LSP ID is 1 octet long. The total number of IP prefixes carried by IS-IS can be easily computed and it comes to O(31000). However, RFC 3786 describes mechanisms to relax the limit on the number of LSP fragments, thereby increasing the number of IP prefixes that can be carried within IS-IS.

I have however, never seen any network carrying more than O(30K) IP prefixes inside IS-IS. Do let me know if you’re aware of networks where you see IS-IS carrying more than 30K routes.

I say this because this is a reasonable number for any sane IS-IS deployment and it will not run out of space unless someone actually injects the entire (or even partial) BGP feed into the IGP. In that case we will run out of space at about 20% of the way into redistribution and not be able to advertise the rest. It is for this reason that this practice has now been deprecated and the RFC 1745 which lays down the rules for BGP- OSPF interaction, has been moved to the HISTORICAL status.

There are 8 bits to define a pseudonode number in the LSPID which means that a router can be a Designated Intermediate System (DIS) for only 256 LANs. Additionally there is also a limitation on the number of routers that can be advertised in pseudonode LSP of the DIS. Dont worry – RFC 3786 fixes this!

RFC 3373 OTOH proposes a new TLV thats carried in the IIH PDUs that can increase the number of point-to-point adjacencies from 256 on a single router.

The “Remaining lifetime” field which gives the number of seconds before LSP is considered expired is 16 bits wide.

This gives the life time of the LSP as 2^16/60/60 Hrs = 18.7 Hrs

Thus the LSP issued by a router needs to be refreshed after every 18.7 Hrs. So youre not going to see a lot of IS-IS control packets being regenerated in a stable topology.

OSPF

In theory, OSPF topology is limited by the number of links that can be advertised in the Router LSA as each router gets only one Router LSA and it cant be bigger than 64K which is the biggest an IP packet can be. The same constraint applies to the Network LSA also.

Each link in the router can take up at most 24 bytes. Thus, number of links which can be supported is given by (64 * 1024) / 24 = 2370

However, if we take the minimum link size per link (12 bytes) then the maximum is about 2 * 2370 = O(5000) links

To be more specific, we can have O(2300) p2p and p2mp links (not considering virtual links, etc) and O(5000) broadcast/NMBA links described in OSPF’s Router LSA and its Network LSA.

Thus each Router LSA can carry some 5000 links information in it. It is hard to imagine a router having 5000 neighbors but there are already routers with 400 neighbors in some ISPs, and it may not take long to reach the order of the magnitude limited by OSPF.

The Network LSAs are generated by the designated router (DR) for each broadcast network it is connected to. To have scaling problems it should have 2730 * 6 times neighbors on that interface. This is even less probable and hence there are no scalability problems with OSPF per se.

All other LSAs apart from Type 1 and Type 2 hold single prefixes. Because there is no limit to the number of such LSAs, a large number of inter-area and externals can be generated depending upon the memory resources of the router.

Each LSA has an LS Age field which is counted upwards starting from zero. Its life is an architectural constant which says one hour. When an LSA’s LS age field reaches MaxAge, it is reflooded in an attempt to flush the LSA from the routing domain. One hour seems like a long time but if one originates 50,000 LSAs then OSPF will be refreshing on an average of just 36ms!

Total number of LSAs to be refreshed = 50,000

Time by which all the LSAs must be refreshed = LSRefreshTime = 30mins = 1800 secs

Rate at which the LSAs need to be refreshed = 1800/50000 = 36ms

However, if the refreshes are perfectly spread out across time and perfectly batched, the actual update transmission rate may be on the order of one packet per second.

There is however a “do-not-age” LSA which in theory can be pressed into service and which never gets aged. However, such LSAs will be eventually purged from the LS database if they become stale after being held for at least 60 minutes and the originator not reachable for the same period. Moreover it is not backward compatible and if one deploys that in the network today with some routers not supporting this then the network can really get weird. So there isn’t really much that can be done using these unless the whole network is changed!

Theoretically, both the routing protocols are scalable and there should not be any issues with either one of these if implemented properly. Both have similar stability and convergence properties. Practically, providers must go with what their vendors suggest since the vendors are best aware of how each protocol has been implemented on their platforms.

I discuss more of this here.

Checks on HELLOs for OSPF and IS-IS during Adjacency Formation ..

The HELLOs (or IIHs in IS-IS parlance)  are responsible bringing up the adjacencies between the two (or multiple) routers. Forming adjacencies is an integral part of all link state routing protocols as all protocol packets other than HELLOs are flooded only over the adjacencies. The rules for formation of such adjacencies however differ between IS-IS, OSPF v2 and OSPF v3.

IS-IS

Besides the basic checks to verify the integrity of the packet, IS-IS does a few checks before forming any adjacency upon receiving the IIHs.

o It allows multiple area addresses to be configured on a router. During the IIH exchange the adjacency is formed only if at least one of the area address matches. The advantage of having multiple areas is explained in the further posts. NOTE that Level 2 only adjacencies would be formed even if the area addresses are not matching.

o To prevent the LSPs and CSNPs from being dropped due to different values for originatingLSPBufferSize and ReceiveLSPBufferSize, all IIHs are padded till the maximum MTU when the adjacency comes up. This check verifies consistent settings between the adjacent routers. This is however not a sufficient check.

o Adjacencies are formed without regard to interface addressing or asymmetric HOLD timer values. Values of IIH interval are not sent in IIH packets. While the IS-IS protocol provides sufficient routing information for relaying packets between adjacent routers, many implementations nonetheless require ARP support to do this. These implementations typically refuse to form an adjacency unless the neighbor interface IP address is on the local interface’s IP subnet.

o IS-IS can carry addressing information of different protocols in its TLV’s. However, the protocol supported field must be sent in Dual and IP-Only routers. RFC1195 specifies no checks for the protocol supported field for adjacency formation. It places topology restrictions on multi-protocol networks. In networks that conform to these restrictions, neighboring routers will always have a protocol in common. Therefore, it does not state whether adjacency formation should take protocols supported into account. However, many implementations, do not form an adjacency with a neighbor unless they share at least one protocol in common.

o Not matching Hold Timer values has advantages wherein the administrator can set different Hold times for different routers. This helps in cases where the going down of a DIS or some router needs to be detected faster. For such routers the hold timer can be set to a lower value.

OSPFv2

The checks for formation of adjacencies are stricter in OSPFv2 as compare to that of IS-IS.

o The area-id of the received packet should always match the incoming interface (with the exception of virtual links). Area type is strictly checked by checking the E-bit (not set for non-default areas) and the N- bit (not-set for non-NSSA areas).

o The values of the Hello interval, the Router Dead Interval and network mask received in Hellos are matched with those on the configured interface. Any mismatch in the values causes the Hello packet to be dropped and hence prevents formation of adjacencies. The disadvantage of this approach is that Hello Interval and Router Dead Interval changes need to be done within the Router Dead Interval, to prevent breaking adjacencies. The advantage is we would not form adjacency in case there is a router that has been mis-configured with a large value and which could cause problems later. The network mask check however does not apply to point to point links. That allows the two ends of a Point-to-Point link to have different addresses.

o MTU check is not done in the hellos. It is done in the during the Database (DB) Exchange process.

OSPFv3

Most of the checks for OSPFv3 are similar to that of OSPFv2.

o OSPFv3 runs on a per link basis instead of a per subnet basis. The check for a network mask is thus not done.

o Instance ID field (non-existent in OSPFv2) on the link is matched with the incoming ID in Hellos. The adjacency is formed only if the Instance ID matches. This allows multiple instances of OSPFv3 to run on a single link.