Does Hierarchical VPLS solve all Scaling issues found in VPLS?

Virtual Private Lan Service (VPLS) is a multipoint-to-multipoint ethernet bridging service over an IP/MPLS backbone and is used for connecting geographically separated customer sites by emulating a LAN segment. This post assumes that the reader is familiar with the basic concepts of VPLS and IP/MPLS and does not attempt explain them.

VPLS requires a logical full mesh of all the participating Provider Edge (PE) routers since its emulating a LAN. This means that every PE router is connected to every other PE router by a pseudowire resulting in a full mesh infrastructure. VPLS solves the loop problem by using a split-horizon rule which states that member PE routers (PE1, PE2, PE3 and PE4 in the below Figure) must forward VPLS traffic only to the local attachment circuits when they receive the traffic from the other PE routers. Exchanging traffic learnt from one remote PE router to the other is not allowed. This prevents loops and also eliminates the need to run STP in the VPLS core network.

Consider a Video broadcast service (implemented as a VPLS service)  that uses a video codec and offers 200 channels of standard-definition content. Assuming around 2 Mbps for each channel it would require 400 Mbps of total standard-definition traffic. Add another 50 HDTV channels, each consuming around 10Mbps, the total network bandwidth approaches 1Gbps.

Assume that the source of the Video transmission falls behind PE1. In that case, PE1 needs to send 1Gbps worth of Video traffic to PE2, PE3 and PE4 respectively.

Figure 2 shows the physical topology where PE1 is connected to the other PE routers via LSRs P1 and P2. Since PE1 is in a full logical mesh with PE2, PE3 and PE4, it means that PE1 needs to replicate all BUM (Broadcast, Unknown Unicast and Multicast) traffic three times, so that each PE can receive a copy.

Thus PE1 needs to make three copies of all the Video traffic that it receives which means that the link PE1 – P1 and P1-P2 will be now carrying 3Gbps. The amount of traffic that these links will carry will increase linearly as the number of PEs that PE1 is in full mesh with. Clearly this is NOT scalable!

So, whats the solution?

The solution apparently lies in using Hierarchical VPLS, also popularly known as H-VPLS.

The original VPLS architecture requires all PEs to be in a full mesh. This however may not always be practical if the number of PE routers is too high. Provisioning a full meshed network may also in some instances not be an efficient network design, as was illustrated in the topology in Figure 2.

To fix these issues H-VPLS architecture introduces the concept of spoke-pseudowires. Unlike mesh-pseudowires, that are used in regular VPLS, spoke-pseudowires can exchange traffic with other pseudowires (both mesh and spoke), so they can relay traffic between PE routers.  Let us see how we can re-design the above network using H-VPLS.

To illustrate this, the figure above shows two H-VPLS architectures that can be used to break the logical full mesh that is required in the regular VPLS.

In the first there is a logical connection between PE1-PE2, PE2 – PE4 and PE1 – PE3. There is thus a VPLS service defined on PE1 which has just two spoke pseudowires and a connection to the local attachment circuit which is the source of the Video traffic. The first pseudowire connects PE1 to PE2 and the other connects it to PE3.  There is no pseudowire connecting PE1 to PE4.

In the other design, PE1, PE2, PE3 and PE4 are all connected in a ring. In this case PE1 is connected to PE2 and PE3 using spoke pseudowires; PE2 to PE1 and PE4; PE4 to PE3 and PE2 and PE3 to PE1 and PE4.

While the two examples that i have taken dont have a mesh pseudowire, there is nothing that precludes that from happening. The true potential of HVPLS is only exploited when the network is designed using a combination of spoke and mesh pseudowires.

The split-horizon rule “Do not relay traffic among mesh-pseudowires” is used to prevent forwarding loops in H-VPLS networks. The mesh pseudowires dont exchange traffic as in the regular VPLS architecture. The spoke pseudowires otoh do not obey the split-horizon rule – thus traffic arriving on a spoke pseudowire is forwarded to the other spokes, meshes and local attachment circuits if any. This requires the provider to run Spanning Tree Protocol (STP) in the core to keep it loop-free, something that the providers dont feel to happy about given the high convergence values of STP. Since the traffic can loop providers need to be extremely careful when planning where the spokes and mesh pseudowires are placed.

While H-VPLS solves the problems seen in VPLS, it introduces a few of its own.

Lets look at each design and see how ..

In the first H-VPLS design (as shown above) the links PE1-P1 and P1-P2 are still carrying two copies of each BUM traffic, thus we have not gained significantly from the original VPLS design in terms of saving the bandwidth efficiency.

The second problem is that PE4 only gets the packets after they are relayed by PE2. If you go back to Figure 2, you will see that there is no physical connection between PE2 and PE4. Thus all packets go back to P2 before they reach P4, thus congesting this link. Its also introduces a single point of failure, where PE4 can get completely disconnected from the rest of the PEs if PE2 goes down. There is thus a huge amount of redundancy planned in H-VPLS, which can result in loops and thus using STP becomes extremely important here.

The third issue, and as per some critics, the biggest problem with H-VPLS is that PE2 now has to learn all the customer MACs that fall behind PE1 and P4. This is because all traffic from PE1 is terminated at PE2 and relayed to PE4. Similarly, all traffic coming from PE4 gets terminated at PE2 and is then forwarded to PE1. During this process PE2 has to learn all these MACs. This is primarily because H-VPLS implements the Hub-and-Spoke architecture in the data plane as against Route Reflectors in BGP that do it in the control plane.

Lets now look at the other H-VPLS design where all the PE routers are in a ring.

Clearly we need to run STP to break the forwarding loop. Assume that STP puts the spoke pseudowire connecting PE1 – PE3 in a blocked state. In this case PE1 only has one pseudowire (PE1 – PE2) to send traffic to. This solves the bandwidth wastage problem on the links PE1-P1 and P1-P2 as they only carry one copy of the BUM traffic.

This however, doubles the traffic on the links PE2-P2 and P2-PE4 as each packet is first relayed by PE2 to PE4 and then later by PE4 to PE3. So, while we had saved bandwidth on some links, we ended up wasting it somewhere else!

Like before, PE2 and PE4 have to unnecessarily learn all the customer MACs exchanged between learnt traffic between PE1 and PE3.

Another issue is that the learnt traffic between PE1 and PE3 cannot pick up the most optimal IGP path since it has to necessarily get routed via PE2 and PE4. This is a direct consequence of H-VPLS implementing the Hub-Spoke architecture in the data plane.

Its thus easy to sum up the issues that exist in VPLS and H-VPLS.

(1) Bandwidth is wasted since traffic for different pseudowires is replicated on a shared physical path. This is a big issue as more and more multicast video traffic is sent using VPLS services.

(2) A full mesh of PE routers can result in a significant amount of control traffic.

(3) If one uses H-VPLS then operators need to ensure loop prevention and detection. This entails running STP which is not the most attractive choice.

(4) Learnt traffic may not follow the best and the most optimal path since it has to get relayed via multiple PE routers.

(5) Unnecessary MAC learning happens on all PE routers participating in H-VPLS.

So, do we have a solution to the above problems? Thankfully we do!

I have written a draft with Lizhong Jin from ZTE and Frederic Jounay from France Telecom where we have introduced a concept of a Hub-and-Spoke Multipoint LSPs. This can be trivially extended to a Hub-and-Spoke Multipoint Pseudowires which can solve the issues described above that exist in the regular VPLS and the H-VPLS architectures. More detail about Hub-and-Spoke Multipoint LSPs and how they solve all the issues described above in the next post!

Database Granularity in OSPF vs IS-IS ..

This post compares how the two link state protocols hold their routing information in their databases as this affects their behavior in how they flood/distribute the change of routing information and the internal implementation complexity.

OSPF

o Organization of Routing Information

OSPF encodes the routing information into small chunks, which it calls Link State Advertisement (LSA). Each LSA has its own 20-byte header in order to be identified uniquely. This header is called the LSA Header. There is no limitation on the size of a LSA, though the actual LSA size is limited by IP packet size limitation: 65,535 bytes minus the LSA Header size and IP packet header size. The database access in OSPF is per LSA basis.

In OSPF routing, the information within an area is described by type 1 and type 2 LSAs (known as Router-LSA and Network-LSA respectively). These LSAs can become big depending upon the number of adjacencies to be advertised and prefixes to be carried inside an area. In other words, the routing information with respect to a single node (either router or network node) is encoded inside a single LSA. On the other hand, each inter-area or external prefix is advertised in a separate LSA (AS-External LSA).

An OSPFv2 router may originate only one Router-LSA for itself, while in OSPFv3, a router is allowed to originate multiple Router-LSAs. A router may originate a Network-LSA for each IP subnet on which the router acts as a designated router (DR). A router may originate one LSA for each inter-area and external prefix, with no limitations on the number of LSAs that it may originate.

Consequences

Originating a new and a unique LSA for each inter-area route and an external prefix implies that there is a LSA Header overhead involved while the information is kept in the database or is flooded to the neighbors. There is thus some extra memory and bandwidth consumed in total.

o Carrying Routing Information

LSAs are carried in Link State Update packets (called LS Updates or LSUs). Each LS Update packet has its own header, consists of a 24 byte OSPF protocol header, and a 4-bytes field indicating the number of LSAs contained in the packet. Thus multiple LSAs can be packed into a single LS Update packet. Some implementations may not do this as its considered difficult achieving this during flooding.

Consequences

In the face of network changes, OSPF floods only the updated LSAs. Therefore, even if an implementation does not pack multiple LSAs into a single LS Update packet (and so bandwidth is consumed by LS Update header for each update of a single LSA), the bandwidth consumption for each network change can be considered adequately small.

IS-IS

o Organization of the Routing Information

In IS-IS, protocol packets are called Protocol Data Units or PDUs. IS-IS encodes the link state information into the set of TLVs and packs these TLVs into one or more Link State PDUs (LSPs). The size limit of a LSP is configurable. The Routing database consists of these PDUs and the access to the database is per PDU basis. The original IS-IS specification places an upper bound on the number of LSPs a router can originate to 255. There are however techniques which enable a router to originate more than 255 LSPs, by using multiple system-id’s for itself.

Consequences

Since routing information in IS-IS for each router is packed in fewer LSPs, the memory consumed for bookkeeping of the routing data within the database is less and is more efficient.

o Carrying Routing Information

Each LSP is flooded independently, without being modified all the way from the originator through the routers till the very end. This results in all the routers having the same LSPs as that originated by the first router.

Consequences

Since LSPs are not modified in any way and are not allowed to be fragmented, in order to be flooded successfully over all links existing in the IS-IS network, great care must be ensured when configuring the size limit of LSP that routers can originate and receive.

If the size limit of the LSP is set without taking into account the minimum value of the MTUs throughout the network, or if the size limit of LSPs conflict among some the routers in the network, the database synchronization may not be achieved, and this can result in routing loops and/or blackholes.

When a change occurs to a LSP, the whole LSP needs to be flooded, and therefore the bandwidth usage can be non-optimal. There is however a solution which exists in theory. If an implementation finds some of the entities to be flapping, then they may be packed into smaller LSPs or may be isolated from the other stable entities. This way one needs to only advertise the unstable LSP/LSPs. I have not btw come across any implementation that does that. Leave a comment if you know one that does this!

Database granularity also affects when two routers need to synchronize their databases. In OSPF, because of its high database granularity there are a lot of items which it needs to synchronize and that process is somewhat complicated with a lot of DBD packets being exchanged back and forth. This gets worse if the router trying to sync is being inundated with a lot of other data traffic also. This is not much of an issue these days as any router worth its salt would prioritize the OSPF control packets.

This is however much simpler in case of IS-IS and there isn’t any finite state machine that the neighbors need to go through to synchronize their databases. It just uses it regular flooding mechanism (a couple of CSNPs describe their entire topology information) to exchange its entire database. You plug in the new IS-IS router and before you realize the router is already sync’ed up with all the other IS-IS routers in the network!