Tag Archives: MPLS

Software defined WAN (SD-WAN) is really about Intelligence ..

Lets admit that most of us in the networking domain know as much about SD-WAN as an average 6th grader on sex — which is to say pretty much nothing. We take it as something much grander and exotic than what it really is and are obviously surrounded by friends and well-wishers who wink conspiratorially that they “know it all” and consider themselves on an intellectual high ground to educate us on matters of this rich and riveting biological social interaction. Like most others in that tender and impressionable age, i did get swayed by what i heard and its only later that i was able to sort things out in my head, till it all became somewhat clear.

The proverbial clock’s wound backwards and i experience that feeling of deja-vu each time i read an article on SD-WAN that either extols its virtues or vilifies it as something that has always existed and is being speciously served on a platter dressed up as something that it is not. And like the big boys then, there are men who-know-it-all, who have already written SD-WAN off as something that has always existed and really presents nothing new here. Clearly, i disagree with that view.

I presume, perhaps a trifle rashly, that you are already aware of basic concepts of SDN and NFV (and this) and hence wouldnt waste any more oxygen explaining those.

So what really is the SD-WAN technology and the precise problem that its trying to solve?

SD-WAN is a way of architecting, designing and deploying enterprise WANs using commodity Internet connections in a manner that makes those “magically” appear as a private “MPLS-like” connection. Its the claim that it can appear “MPLS-like” that really peeves the regular-big-mpls-vendors-and-consultants. I will delve into the “MPLS-like” aspect a little later, so please hold on to your sabers till then. What makes the “magic” work is the control plane that implements and enforces the network access policies (VOIP is high priority/low latency/low jitter, big data sync medium priority and all else low priority, no VOIP via Afghanistan, etc) and the data plane that weaves an L2/L3 overlay on top of the existing consumer-grade Internet links (broadband links and in a few cases the LTE/4G connections).

The SD-WAN evangelists want to wean enterprises off their dedicated prohibitively priced private WAN connections (read MPLS circuits) with commodity enterprise broadband links. Philosophically, adding a new branch should just mean shipping a CPE device (perhaps in a virtualized form-factor) that auto-magically dials into a central controller when brought to life. Once thats done and the credentials verified, the branch should just come online (viola!) and should be visible to all the geo-separated branches. Contrast this with the provisioning time (can go as high as a year in some remote locations) and the complexity it takes to get a remote branch online today with MPLS and you will understand why most IT folks have ulcers and are perennially on anti-anxiety/depressant medicines. And btw we’ve not even begun talking about the expenses and long term contracts with the MPLS connections here!

Typically SD-WAN solutions have a central SDN controller which is really a cluster of x86 devices (servers, VMs, containers, take your pick) and hence has computing and analytical horsepower much more than a dedicated HW network device. The controller has complete visibility right from the source all the way till the destination and can constantly analyze traffic and can carve out optimal network paths for applications and individual flows based on the user and application policies. In the first mile the Internet links are either coalesced to form a fatter pipe or are used separately as dictated by the customer policies. The customer traffic is continuously finger-printed and is routed dynamically based on the real time network conditions.

Where most people go wrong is when they believe that SD-WAN solutions lose control over the traffic once it leaves the customer premises or the SD-WAN edge node. Bear in mind that there is nothing in the SD-WAN technology that prevents further control over how the traffic is routed and this could perhaps be one aspect differentiating one SD-WAN offering from the other. Since SD-WAN is an overlay technology you will not have control over each physical hop, but you can surely do something more nuanced given the application and end-to-end network visibility that exists with the controller.


Its “MPLS-like” in the sense that you can, in most cases, guarantee the available bandwidth and network up time. The central controller can monitor each overlay circuit for loss/jitter/delay and can take corrective actions when routing traffic. Patently enterprise broadband connections in certain geographies dont come with the same level of reliability as MPLS and it behooves upon us to ask ourselves if we need that level of reliability (given the cost that we pay for such connections). An enterprise can always hedge its risks by commissioning a few backup enterprise broadband connections for those rainy days when the primary is out cold. Alternatively, enterprises can go in for a hybrid approach where they maintain a low bandwidth MPLS connection for their mission-critical traffic and use the SD-WAN solution for everything else OR can implement a policy to revert to the MPLS connection when the Internet connections are not working satisfactorily. This can also provide a plausible transition strategy to the enterprises who may not be comfortable switching to SD-WANs given that the technology is still relatively new.

And do note that even MPLS connections go down, so its really not fair to say that SD-WAN solutions stand on tenuous grounds with regard to the reliability. Yes i concede that there are SLAs given with MPLS that just dont exist with regular Internet pipes. However,  one could argue that you can get some bit of extra reliability by throwing in an additional Internet link (with a different provider?) thats only there as a standby. Also note that with service providers now giving fiber connections, the size and the quality of Internet links is only going to improve with time. A large site for instance can aggregate a 1Gbps Google Fiber and a 1Gbps Verizon FIOS connection and can retain a small MPLS connection as the standby. If the enterprise discovers that its MPLS connection is underutilized it can negotiate on pricing or can go with lower MPLS pipe and thereby save on its costs.

I recently read a blog which argued that enterprise broadband promising 350Mbps would mostly give only around 320Mbps on an average. Sure this might be true in a few geographies, but seriously, who cares? Given the cost difference between a broadband connection and an MPLS circuit i will gladly assume that i only had a 300Mbps connection and derive utmost pleasure any time it gives me anything more than that!

The central controller in the SD-WAN technologies amongst other things (analyzing traffic, links) can also continually learn about the customer network conditions and can predict when link qualities will deteriorate and can preemptively reroute traffic before the links start acting up. Given that the controller is monitoring paths end-to-end and is also monitoring and analyzing the traffic emanating from the branch sites there are insights that enterprises can draw that they could have never imagined when using traditional WAN architectures since in that world all connections are really only “dumb pipes”. SD-WAN changes all that — it changes how the enterprise connections and the applications running there are viewed. The WAN architecture is aligned to the application service requirements and its management is greatly simplified. You can implement complex network policies and let the SD-WAN infrastructure sweat on your behalf (HINT: intent driven networking).

So watch out before you disdainfully write off SD-WAN as a technology thats merely replacing your dumb MPLS pipes with the regular Internet connections, since i argue, it can really do a lot more than that. Perhaps a topic worth discussing some other day.

BFD in the new Avatar


BFDWe all love Bi-directional Forwarding Detection (BFD) and cant possibly imagine our lives without it. We love it so much that we were ready with sabers and daggers drawn when we approached IEEE to let BFD control the individual links inside a LAG — something thats traditionally done by LACP.

Having done that, you would imagine that people would have settled down for a while (after their small victory dance of course) — but no, not the folks in the BFD WG. We are now working on a new enhancement that really takes BFD to the next level.

There isnt anything egregiously wrong or missing per se in BFD today. Its just not very optimal in certain scenarios and we’re trying to plug those holes (and doing our bit to ensure that folks in data comm industry have ample work and remain perennially employed).

Ok, lets not be modest – there are some scenarios where it doesnt work (as we shall see).

So what are we fixing here?

Slow Start

Well for one, BFD takes awfully looooong to bring up the session. Remember BFD starts with sedate timers and then slowly picks up (each side needs to come to an agreement on the rate at which they will send packets) . So it takes a while before BFD can really be used for path/end node liveliness detection. If BFD is being used to validate an MPLS path/LSP then it will take a few additional seconds for BFD to come up because of the LSP ping bootstrapping procedures (RFC 5884).

In certain deployments, this delay is bad and we want to eliminate it. It is expected that some MPLS deployments would require traffic engineered LSPs to be created dynamically, driven by external applications as in Software Defined Networks (SDN). It is operationally critical to ensure that the forwarding paths are up (via BFD) before the applications start utilizing the newly created tunnels. We cant hence wait for BFD to take its time in coming up since the applications are ready to push data down the tunnels. So, something needs to be done to get BFD to come up FAST!

This is an issue in SDN domains where a centralized controller is managing and maintaining the dynamic network. Since the tunnels are being engineered by this centralized entity we want to be really sure that the new tunnel is up before sending traffic down that path. In the absence of additional control protocols (eg. RSVP) we might want to use BFD to ensure that the path is up before using it. Current BFD, with large set up times, can become a bottle neck. If the centralized controller can quickly verify the forwarding path, it can steer the traffic to the traffic engineered tunnel very quickly without adversely affecting the service.

The problem exacerbates as the scale of the network and the number of traffic engineered tunnels increase.

Unidirectional Forwarding Path Validation

The “B” in BFD, stands for “Bi-directional” (in case you missed that). The protocol was originally defined to verify bidirectional connectivity between two nodes. This means that when you run BFD between routers A and B, then both A and B come to know when either router goes down (or when something nasty happens to the link). However, there are many scenarios where only one of the routers is interested in verifying the data plane continuity between the two nodes (e.g., static route using BFD to validate reachability to the next-hop router OR a Unidirectional tunnel using BFD to validate reachability to the egress node). In such cases, validating the reverse direction is not required.

However, traditional BFD requires the other side to maintain the entire BFD state even if its not interested in the liveliness of the remote end.  So if you have “n” routers using a particular gateway, then the gateway has to maintain “n” BFD sessions with all its clients. This is not required and can easily be done away with.

Anycast Addresses

Anycast addressing is used for high availability, fast recovery, load balancing and dispersed deployments where the IGPs direct the traffic to the nearest server(s) within a group of potential servers, all sharing the same Anycast address. BFD as defined today is stateful, and hence cannot work with Anycast addresses.

With the growing need to use Anycast addresses for higher reliability (DNS, multicast, 6to4, etc) there is a need for a BFD variant that can work with Anycast addresses.

BFD Fault Isolation

BFD works in a binary state – it either tells you that the session is UP or its DOWN. In case of failures it doesnt help you identify and localize the fault. Using other tools to isolate the fault may not necessarily work as the OAM packets may not follow the exact same path as the BFD packets (e.g., when ECMP is employed).

There is hence a need for a BFD variant that has some capabilities that can help in fault isolation.

So, where does this lead to?

We have attempted to fix all the issues that i have described above in a new BFD variant that we call the “Seamless BFD” (S-BFD). Its stateless and the receiver (or the reflector) responds with an S-BFD response packet whenever it receives an S-BFD packet from the source. You can imagine this as a ping-pong game between the source and the destination routers. The source (or the client in S-BFD speak) wants to check if the path to the destination (or the Reflector in S-BFD speak) is UP or the reflector is UP and sends an S-BFD “ping” packet. The Reflector upon receiving this, responds with a S-BFD “Pong” packet.  The client upon receiving the “Pong” knows that the Reflector is alive and starts using the path.

Each Reflector selects a well known “Discriminator” that all the other devices in the network know about. This can be statically configured, or a routing protocol can be used to flood/distribute this information. We could use OSPF/IS-IS within an AS and BGP across the ASes. Any clinet that wants to send an S-BFD packet to this Reflector (or a server if it helps) sends the S-BFD packet with the peer’s Discriminator value.

A reflector receiving an S-BFD packet with its own Discriminator value responds with a S-BFD packet. It must NOT transmit any BFD packet based on a local timer expiration.

A router can also advertise more than one Discriminator value for others to use. In such cases it should accept all S-BFD packets addressed to any of those Discriminator values. Why would somebody do that?

You could, if you want to implement some sort of redundancy. A node could choose to terminate S-BFD packets with different Discriminator values on different line cards for load distribution (works for architectures where a BFD controller in HW resides on a line card). Two nodes can now have multiple S-BFD sessions between them (similar to micro-BFD sessions that we have defined for the LAG in RFC 7130) — where each terminates on a different line card (demuxed using different Discriminator values). The aggregate BFD session will  only go down when all the component S-BFD sessions go down. Hence the aggregate BFD session between the two nodes will remain alive as long as there at least one component S-BFD session alive. This is another use case that can be added to S-BFD btw!

This helps in the SDN environments where you want to verify the forwarding path before actually using it. With S-BFD you no longer need to wait for the session to come up. The centralized controller can quickly use S-BFD to determine if the path is up. If the originating node receives an S-BFD response from the destination then it knows that the end point is alive and this information can be passed to the controller.

Similarly applications in the SDN environments can quickly send a S-BFD packet to the destination. If they receive an S-BFD response then they know that the path can be used.

This also alleviates the issue of maintaining redundant BFD sesssion states on the servers since they only need to respond with S-BFD packets.

Authentication becomes a slight challenge since the reflector is not keeping track of the crypto sequence numbers (remember the point was to make it stateless!). However, this isnt an insurmountable problem and can be fixed.

For more sordid details refer to the IETF draft in the BFD WG which explains the Seamless BFD protocol and another one with the use-cases. I have not covered all use cases for Seamless BFD (S-BFD) and we have a few more described there in the use-case document.

Hub and Spoke Multi-Point LSPs for Scalable VPLS Architecture

Multiprotocol Label Switching (MPLS) and Generalized MPLS (GMPLS) provide a mechanism to set up point-to-multipoint (P2MP) LSPs which carry traffic from one ingress point (the root node) to several egress points (the leaf nodes), thus enabling multicast forwarding in an MPLS domain. However, there is no provision to provide a co-routed path back from the egress points (the leaves) to the ingress node (the root). The only way to do this is by configuring Unidirectional point-to-point LSPs from the leaves back to the root node. This entails configuring each leaf node with an LSP back to the root which could be a configuration and management nightmare if the number of leaves are large. Second, it can also not guarantee a co-routed path back from the leaf to the node, as the process of setting up the Unidirectional P2P path is independent from setting up the P2MP path.

This post introduces the concept of a hub-and-spoke multipoint (HSMP) LSP that allows traffic from the root to the leaves via a P2MP LSP and back from the leaves to the root via a unidirectional co-routed LSP. The proposed technique targets one-to-many applications that require reverse one-to-one traffic flow (thus many one-to-one in the reverse direction).

Consider the figure shown below.

PE1 is the ingress router for the HSMP LSP. The egress routers (leaves) are PE2, PE3 and PE4. As can be seen from the figure, PE1 creates a single copy of each packet arriving from the data source. This packet carries the MPLS label value L1. P1 is an ordinary Label Switch Router (LSR) that swaps the incoming label L1 with L2. Lets now focus on the packet forwarding process on the node P2.

For each packet belonging to the HSMP LSP, P2 makes three copies (just like how its done for a P2MP LSP), each of which is sent to PE2, PE3 and PE4 respectively. Packets arrive at P2 with MPLS label L2. As shown in the above figure, P2’s ILM contains an entry for label L2 saying that one copy of the packet should be sent out on interface if1 with label L3, another copy on interface if2 with label L4 and a last copy on interface if3 with label L5. Since the LSR P2 is replicating the MPLS packets its called the branch node.

An obvious advantage of this scheme is the bandwidth optimization. If we had used unidirectional P2P LSPs instead of an HSMP LSP, PE1 would have sent three copies of each packet that it had received from the data source and thus congesting the links PE1-P1 and P1-P2.

What sets an HSMP LSP apart from a regular P2MP LSP is the ability in the former to set up a path from the leaves back to the root. P2MP LSPs are unidirectional, so no traffic can flow from the leaf node routers to the ingress head end router along the P2MP LSP. In HSMP LSP, the leaf nodes can also send unidirectional traffic back to the root. This is shown in the figure below.

Because of the mechanisms defined in HSMP LSP, the branch node P2 advertises the same upstream label L1 for a given HSMP LSP to the nodes PE2, PE3 and PE4. It programs its ILM table as shown above, where it simply swaps L1 with L2 for all incoming MPLS packets and sends those towards P1. This way HSMP LSP is also able to provide a path back from the leaf nodes to the root node.

In the last post i had discussed issues that exist in VPLS. Lets see how HSMP LSP can solve them. I am using the same topology as was used there to demonstrate how HSMP LSP helps.

The figure 3 above shows the same VPLS service that we had discussed in the earlier post.

PE1 knows through some out-of-band mechanism (could be via BGP, Radius, manual configs, etc) that PE2, PE3 and PE4 are the egress nodes that belong to the same VPLS domain. PE1 now needs to establish an HSMP LSP (can be trivially extended to support a pseudowire) to PE2, PE3 and PE4. Figure shows 3 HSMP LSPs that will be required in this arrangement. The red HSMP LSP has PE1 as the root node and PE2, PE3 and PE4 as the leaf nodes. The green HSMP LSP has been initiated by PE2 and has PE2 as the root, and PE1, PE3 and PE4 as the leaf nodes. The blue HSMP LSP has been initiated by PE3 and has PE1, PE2 and PE4 as the leaf nodes. There is another HSMP LSP thats required – the one initiated by PE4 for the VPLS service to function. It has been omitted from the figure for the sake of clarity.

Thus all PE nodes in a VPLS service need to initiate an HSMP LSP (or a HSMP pseudowire) that terminates at the other PE routers.

In VPLS all BUM (broadcast, unknown unicast and multicast) traffic is flooded to all PE nodes. Its only the learnt traffic thats sent unicast by one PE to the other PE.

As explained earlier, a single copy sent by PE1 over an HSMP LSP will reach PE2, PE3 and PE4 (due to its P2MP component). Also the PE routers PE2, PE3 and PE4 can use this HSMP LSP (terminating at them) to send unicast traffic back to PE1.

Thus PE1 sends all BUM traffic on the HSMP LSP it initiates and all learnt unicast over the HSMP LSP that terminates on it.

Going back to figure 3, we can see that PE1 can use the red HSMP LSP to send all BUM traffic. This way it only sends one copy, and all the PE routers receive this packet. If PE1 wants to send learnt unicast traffic back to PE2, it uses the green HSMP LSP that terminates on it. PE1 can use this to send traffic back to PE2, which is the root node for this HSMP LSP. Similarly, PE1 uses the blue HSMP LSP whenever it wants to send learnt unicast traffic back to PE3.

Lets look at LSPs that PE2 uses. Whenever it wants to send BUM traffic, it uses the green HSMP LSP that it has originated. Any packet sent over that LSP is received by all the leaf nodes (which in this case happen to be PE1, PE3 and PE4). Like PE1, if it has to send learnt unicast traffic back to PE1, it uses the red HSMP LSP that was originated by PE1 and terminates at PE2.

Thus for a VPLS to fully function, all PE nodes must establish an HSMP LSP with all the other participating PE routers. It can use the optimized HSMP LSP that it originates for the BUM traffic and the HSMP LSP that other PEs originate for unicast communication.

The above table compares a VPLS service using HSMP LSPs with a regular VPLS service or Hierarchical VPLS (H-VPLS) service. Clearly, the former wins against the regular VPLS and H-VPLS on all counts. This may also perhaps be an answer to Juniper’s claim that H-VPLS is not scalable. Operators now need not implement H-VPLS; they can instead go in for VPLS services implemented using HSMP LSPs.

This post only briefly explains the idea behind HSMP LSPs. Its explained in detail here in this draft.