Issues with how BFD is currently implemented over LAGs


The BFD standards dont explicitly talk about how BFD should be implemented on Link Aggregation Groups (LAGs). This leaves a lot of room for imagination and vendors have implemented their own proprietary mechanisms to make BFD work on LAGs. Now, there is only this much room for innovation and most vendors have naturally arrived at similar techniques to implement interoperable BFD over LAGs. So, what makes BFD so sticky to implement on LAGs?

BFD being an L3 protocol, is oblivious to the physical link that the BFD packets go out on. Usually, there is only one link associated with an L3 interface, and there is thus no ambiguity on the link that packet needs to go out on. However, when an IP interface is configured over a LAG, there are multiple constituent links that the packet can go out on, and BFD has to decide the link it wants to use for sending the packets out.

A LAG binds together several physical ports between two adjacent nodes so they appear to higher-layer protocols and applications as a single, higher bandwidth “virtual” pipe.

The problem with running BFD over a LAG is that without internal knowledge of the LAG structure it is impossible for BFD to guarantee a detection of anything but a full LAG shutdown within the BFD timeout period. The LAG shutdown is typically initiated by some LAG module. LAG timers are typically multiple times slower than the BFD detection timers (multiple 100ms vs. multiple 10ms of BFD). There is thus a need to bring some sort of determinism in how BFD runs over a LAG. There is also a need to detect member link failures much faster than what Link Aggregation Control Protocol (LACP) allows.

Lets look at what implementations currently do to implement BFD on LAGs.

The simplest approach to run BFD on a LAG interface is to ignore the internal structure and treat the LAG as one “big, virtual pipe”.

Because there is no standard, vendors have implemented their own proprietary mechanisms to run BFD over LAG interfaces. Two examples are shown here.

Some implementations send BFD packets only over the “primary” member link of the LAG. Others spray BFD packets over all member links of the LAG. There are issues with both these designs.

In the first design, BFD will remain Up as long as the primary link is alive. If the primary link goes down, and another link is not selected as the primary, before BFD times out (around 30-50ms), then the BFD session on the LAG comes crashing down. Problems arise as BFD, in this design, is oblivious to the presence of other member links in the LAG. If a non-primary link goes down, the BFD session remains unaffected as it can still send and receive BFD packets over the primary link. Since the BFD session is Up, other routers in the network continue sending traffic meant to egress out of this interface. As expected from the LAG, all traffic egressing out of this interface gets load distributed on all LAG member links. Now, there is one link thats down. All traffic sent over that failed link gets dropped, till the LAG manager detects this and removes it from the LAG.

In the second design, BFD packets are sprayed over all the member links of a LAG. This is done naively via round-robin, where each BFD packet is sent using the subsequent member link, in a round-robin fashion. It solves the problem of BFD going down because of the primary link going down, but it still does not solve the problem of traffic getting lost when one of the member link goes down. This is because, when a member link goes down, BFD remains up and traffic continues to go over the link that has failed till a higher layer protocol (usually LACP or the LAG Manager) detects this and removes the offending link from the LAG.

The above two designs defeat the core purpose of a BFD, which is to detect faults between the two forwarding engines. In each design traffic gets lost on a failed link till some protocol other than BFD detects this and removes that link from the LAG. The timers associated with the other protocol are an order of magnitude higher than BFD.

Operators have since long expressed a need to be able to detect the failed links fast so that their traffic doesnt get lost. The idea is to get BFD to take charge of the LAG and make it responsible for maintaining the list of active links in a LAG. This way we can use the BFD fast timers to quickly detect link failures.

One could argue that there are native Ethernet OAM mechanisms (.1ag, .3ah) that can be used to detect link failures in a LAG, and one need not rely on slow protocols like LACP or the LAG manager. The reality is that operators who have deployed BFD in their IP/MPLS networks want a common failure detection mechanism and dont want a mix of different technologies.

To solve the above mentioned issues I have co-authored an IETF document that proposes running BFD on each constituent link of the LAG. We call the BFD sessions running on each link a “micro BFD session”. We call this mode of BFD on LAGs as BLM – BFD on Lag Members.

BLM is an umbrella BFD session that contains information about the LAG (or the aggregated interface) that its running on. It consists of a set of micro BFD sessions that are running on each constituent link of the LAG. And it contains a state that we call the “Concluded State”, which describes the overall state of the LAG (Up, Down, AdminDown).

Each micro BFD session is a regular RFC 5880 and RFC 5881 compliant BFD session. Only Asynchronous mode is supported for the micro BFD sessions as the sole reason for running BFD on each member link is to verify the link connectivity. The Echo function for the micro BFD sessions is not recommended as it requires twice as many packets to achieve a particular Detection time as does the pure Asynchronous mode.

At least one system MUST take the Active role (possibly both). The micro BFD sessions on the member links are independent BFD sessions. They use their own unique, local discriminator values, maintain their own set of state variables and have their own independent state machine. Typically each micro BFD session will have the same timer values, however, nothing precludes the possibility of having different timer values among the different micro BFD sessions belonging to the same LAG.

A session begins with the periodic, slow transmission of BFD Control packets. When bidirectional communication is achieved, the BFD session becomes Up. The LAG manager is informed at this point, and the member link becomes an active link of the LAG.

If the micro session goes Down, the transmission of Control packets goes back to the slow rate. The LAG Manager is informed which removes the member link from the LAG.

Once a session has been declared Down, it cannot come back up until the remote end first signals that it is down (by leaving the Upstate), thus implementing a three-way handshake.

A session MAY be kept administratively down by entering the AdminDown state and sending an explanatory diagnostic code in the Diagnostic field.

In short, its pretty much the same as a standard BFD session.

This solves the issues that i had described in the earlier two designs. The micro BFD sessions will quickly detect a failed link, and will instantly remove it from the LAG. Traffic that was earlier egressing out over the failed link, will now get hashed to a different link in the LAG. This results in zero traffic loss on the LAG.

You can read more about our proposal here (more on how it evolved within IETF here).

Recognizing the need for running BFD on all member links, various vendors support their own proprietary, un-interoperable implementation of BFD over LAGs. We’re hoping that our IETF proposal to standardize this behavior will bring some order to the chaos thats out there and a relief to the providers who are stuck with proprietary solutions.

12 thoughts on “Issues with how BFD is currently implemented over LAGs

  1. Good posting,

    BFDoLAG related issues are something that I have discussed several times in the past with my colleagues. The solution presented in the draft seems to be a novel way to solve this problem.

    I still feel a little bit concerned if Layer 2 protocol state (LAG) should be derived from Layer 3 protocol state (micro BFD session). However, as you noted, operators might be reluctant to implement other protocols, such as EthOAM to track LAG status.

    I wonder if your concept could be generalized, so that the link liveness tracking protocol could be anything. For, example “micro”BFD session, 802.1ag “micro”CFM session (802.1ag is still service level OAM), or 802.3ah OAM.

    Like

  2. Manav.

    Out of interest, how long do these proposals make it into production? I know it’s hard to put a number on anything but what’s the average?

    p.s. I see you worked for Riverstone. Can you believe I still have some RS3000’s running in some basements… Kit was good, cli was awful

    Like

  3. how about sending BFD packets on all links of the member, and receiving atleast from one member is good enough to keep the session up, right?

    Like

    1. Hi Kuttuva,

      No, it isnt. Assume one of the member links has failed. The BFD session would still remain alive as youre receiving BFD packets over the other links. Since you think everything is ok, you will continue forwarding traffic over that failed link, which will get lost. This will continue till LACP discovers that the link is down and decides to remove the link from the LAG. In case LACP is not used, you will always end up losing all traffic sent over that offending link.

      Cheers, Manav

      Like

  4. Hi Manav,
    Is there any significance for new UDP dest port (6784 ) and MAC for micro BFD sessions mentioned in RFC 7130 apart from the point to distinguish the normal BFD session from Micro BFD (which can be done by other ways).

    Like

Leave a comment