Issues with how BFD is currently implemented over LAGs

The BFD standards dont explicitly talk about how BFD should be implemented on Link Aggregation Groups (LAGs). This leaves a lot of room for imagination and vendors have implemented their own proprietary mechanisms to make BFD work on LAGs. Now, there is only this much room for innovation and most vendors have naturally arrived at similar techniques to implement interoperable BFD over LAGs. So, what makes BFD so sticky to implement on LAGs?

BFD being an L3 protocol, is oblivious to the physical link that the BFD packets go out on. Usually, there is only one link associated with an L3 interface, and there is thus no ambiguity on the link that packet needs to go out on. However, when an IP interface is configured over a LAG, there are multiple constituent links that the packet can go out on, and BFD has to decide the link it wants to use for sending the packets out.

A LAG binds together several physical ports between two adjacent nodes so they appear to higher-layer protocols and applications as a single, higher bandwidth “virtual” pipe.

The problem with running BFD over a LAG is that without internal knowledge of the LAG structure it is impossible for BFD to guarantee a detection of anything but a full LAG shutdown within the BFD timeout period. The LAG shutdown is typically initiated by some LAG module. LAG timers are typically multiple times slower than the BFD detection timers (multiple 100ms vs. multiple 10ms of BFD). There is thus a need to bring some sort of determinism in how BFD runs over a LAG. There is also a need to detect member link failures much faster than what Link Aggregation Control Protocol (LACP) allows.

Lets look at what implementations currently do to implement BFD on LAGs.

The simplest approach to run BFD on a LAG interface is to ignore the internal structure and treat the LAG as one “big, virtual pipe”.

Because there is no standard, vendors have implemented their own proprietary mechanisms to run BFD over LAG interfaces. Two examples are shown here.

Some implementations send BFD packets only over the “primary” member link of the LAG. Others spray BFD packets over all member links of the LAG. There are issues with both these designs.

In the first design, BFD will remain Up as long as the primary link is alive. If the primary link goes down, and another link is not selected as the primary, before BFD times out (around 30-50ms), then the BFD session on the LAG comes crashing down. Problems arise as BFD, in this design, is oblivious to the presence of other member links in the LAG. If a non-primary link goes down, the BFD session remains unaffected as it can still send and receive BFD packets over the primary link. Since the BFD session is Up, other routers in the network continue sending traffic meant to egress out of this interface. As expected from the LAG, all traffic egressing out of this interface gets load distributed on all LAG member links. Now, there is one link thats down. All traffic sent over that failed link gets dropped, till the LAG manager detects this and removes it from the LAG.

In the second design, BFD packets are sprayed over all the member links of a LAG. This is done naively via round-robin, where each BFD packet is sent using the subsequent member link, in a round-robin fashion. It solves the problem of BFD going down because of the primary link going down, but it still does not solve the problem of traffic getting lost when one of the member link goes down. This is because, when a member link goes down, BFD remains up and traffic continues to go over the link that has failed till a higher layer protocol (usually LACP or the LAG Manager) detects this and removes the offending link from the LAG.

The above two designs defeat the core purpose of a BFD, which is to detect faults between the two forwarding engines. In each design traffic gets lost on a failed link till some protocol other than BFD detects this and removes that link from the LAG. The timers associated with the other protocol are an order of magnitude higher than BFD.

Operators have since long expressed a need to be able to detect the failed links fast so that their traffic doesnt get lost. The idea is to get BFD to take charge of the LAG and make it responsible for maintaining the list of active links in a LAG. This way we can use the BFD fast timers to quickly detect link failures.

One could argue that there are native Ethernet OAM mechanisms (.1ag, .3ah) that can be used to detect link failures in a LAG, and one need not rely on slow protocols like LACP or the LAG manager. The reality is that operators who have deployed BFD in their IP/MPLS networks want a common failure detection mechanism and dont want a mix of different technologies.

To solve the above mentioned issues I have co-authored an IETF document that proposes running BFD on each constituent link of the LAG. We call the BFD sessions running on each link a “micro BFD session”. We call this mode of BFD on LAGs as BLM – BFD on Lag Members.

BLM is an umbrella BFD session that contains information about the LAG (or the aggregated interface) that its running on. It consists of a set of micro BFD sessions that are running on each constituent link of the LAG. And it contains a state that we call the “Concluded State”, which describes the overall state of the LAG (Up, Down, AdminDown).

Each micro BFD session is a regular RFC 5880 and RFC 5881 compliant BFD session. Only Asynchronous mode is supported for the micro BFD sessions as the sole reason for running BFD on each member link is to verify the link connectivity. The Echo function for the micro BFD sessions is not recommended as it requires twice as many packets to achieve a particular Detection time as does the pure Asynchronous mode.

At least one system MUST take the Active role (possibly both). The micro BFD sessions on the member links are independent BFD sessions. They use their own unique, local discriminator values, maintain their own set of state variables and have their own independent state machine. Typically each micro BFD session will have the same timer values, however, nothing precludes the possibility of having different timer values among the different micro BFD sessions belonging to the same LAG.

A session begins with the periodic, slow transmission of BFD Control packets. When bidirectional communication is achieved, the BFD session becomes Up. The LAG manager is informed at this point, and the member link becomes an active link of the LAG.

If the micro session goes Down, the transmission of Control packets goes back to the slow rate. The LAG Manager is informed which removes the member link from the LAG.

Once a session has been declared Down, it cannot come back up until the remote end first signals that it is down (by leaving the Upstate), thus implementing a three-way handshake.

A session MAY be kept administratively down by entering the AdminDown state and sending an explanatory diagnostic code in the Diagnostic field.

In short, its pretty much the same as a standard BFD session.

This solves the issues that i had described in the earlier two designs. The micro BFD sessions will quickly detect a failed link, and will instantly remove it from the LAG. Traffic that was earlier egressing out over the failed link, will now get hashed to a different link in the LAG. This results in zero traffic loss on the LAG.

You can read more about our proposal here (more on how it evolved within IETF here).

Recognizing the need for running BFD on all member links, various vendors support their own proprietary, un-interoperable implementation of BFD over LAGs. We’re hoping that our IETF proposal to standardize this behavior will bring some order to the chaos thats out there and a relief to the providers who are stuck with proprietary solutions.