BGP MPLS based Ethernet VPN

BGP MPLS based Ethernet VPN

Introduction

BGP MPLS based Ethernet VPN (E-VPN) is described in this draft document. The technology introduces the concept of routing MAC addresses using BGP (MP-BGP, to be precise) over MPLS core. Conceptually, the technology is very similar to Layer 3 MPLS VPN. Done right, this can solve the problem of stretching Layer 2 networks without VPLS. With some proprietary enhancements, Juniper is already using this technology in their revolutionary QFabric product. Cisco's focus is on the enhanced version of E-VPN, called PBB-EVPN (more on that in another article).

EVPN provides great benefits that VPLS fails to provide (some atleast hard to achieve with VPLS). Some of them are - ability to have dual-active multi-homed edge devices, load-balancing across dual-active links, MAC address mobility, multi-tenancy, etc. These will be covered later but first some terminology.

Terminology

The draft introduces new terms. The definitions from the draft are -

CE - Customer Edge device e.g., host or router or switch.

EVPN Instance (EVI) - An EVPN routing and forwarding instance on a PE.

Ethernet Segment Identifier (ESI) - If a CE is multi-homed to two or more PEs, the set of Ethernet links that attaches the CE to the PEs is an ’Ethernet segment’. Ethernet segments MUST have a unique non- zero identifier, the ’Ethernet Segment Identifier’.

Ethernet Tag - An Ethernet Tag identifies a particular broadcast domain, e.g., a VLAN. An E-VPN instance consists of one or more broadcast domains. Ethernet tag(s) are assigned to the broadcast domains of a given E-VPN instance by the provider of that E-VPN, and each PE in that E-VPN instance performs a mapping between broadcast domain identifier(s) understood by each of its attached CEs and the corresponding Ethernet tag.

Overview

Figure 1 is a very simple diagram of EVPN components. An EVI is configured on PE routers per tenant/customer, for example, to which the CE devices attach. Each EVI has a unique RD and one or more RTs. The CE device can be a host, a switch or a router. If a CE device is multi-homed to two or more PEs, the set of Ethernet links constitute an Ethernet Segment. Each Ethernet Segment is identified using a unique Ethernet Segment Identifier (ESI) which is a 10-byte value. A CE device attached to a single PE is said to be attached to an Ethernet Segment with ESI=0. The draft provides multiple ways to assign the ESI- either manually configured, using LLDP, using LACP or using Spanning Tree Protocol (STP).

The draft provides flexibility to connect PEs using an IP/GRE encapsulation or MPLS. One of the enhancements that EVPN brings over VPLS is that MAC learning is performed in the Control Plane using MP-BGP, rather than Data Plane. This provides great control as PEs can choose to learn MAC addresses based on RTs configured. The PEs advertise the MAC addresses with an MPLS label using MP-BGP to remote PEs. The MAC address learning from CE is still done in Data Plane by the PE routers.

Ethernet Tag

It is important to understand what an Ethernet Tag is and whether it is advertised to remote PEs in MP-BGP. So, an Ethernet Tag identifies a broadcast domain - for example- a VLAN in an EVI.

For a given EVI, each PE router performs a mapping between an Ethernet Tag and corresponding broadcast domain identifier (i.e. VLAN ID). There are various types of ways in which PE and CE interfaces are configured -

1. CE interface is simple Ethernet or 802.1q and PE interface is configured as simple Ethernet interface - in both cases, the PE routers set Ethernet Tag=0 in BGP routes and the MPLS traffic carries the CE device traffic as is (without Preamble and FCS).

2. CE and PE interfaces are configured as 802.1q (or 802.1ad) interfaces in a single bridge domain - in this case also, the PE routers set Ethernet Tag=0 in BGP routes and the MPLS traffic carries the CE device traffic as is (without Preamble and FCS). VLAN ID translation is not allowed as the PE router will not be able to identify the outgoing interface.

3. CE and PE interfaces are configured as 802.1q (or 802.1ad) interfaces in multiple bridge domains - in this case, the PE routers set the derived Ethernet Tag and advertise in BGP, and the MPLS traffic carries the VLAN ID.

Next, I will mention about the features that EVPN provides and their procedures.

Multi-Homing

As shown in figure 1, both CE devices are multi-homed to 2 PE routers. As per the draft, there is zero or minimum configuration required on PE routers to discover that they are connected to the same Ethernet Segment. This is done using the exchange of BGP Ethernet Segment Route messages. Every EVPN related BGP message is a new EVPN NLRI of TLV format. Figure 2 shows a new EVPN Ethernet Segment Route Type and a new ES-Import extended community.

Each PE router connected to an Ethernet Segment, advertises a BGP Ethernet Segment (ES) route that consists of an ESI (for single-homed connections, ESI=0) and ES-Import extended community. For example, in figure 1, PE1 and PE2 routers will advertise ES route with ES-Import extended community (along with other extended communities like Route-Target). Both routers will also construct a filter based on ES-Import extended community which results in only these PE routers to import the ES route and identifying that they are connected to the same Ethernet Segment.

Fast Convergence

When an Ethernet Segment is fails/unavailable, EVPN provides a mechanism using new route types to signal to remote PEs the need to update their forwarding tables. The draft introduces a new Ethernet Auto-Discovery (A-D) route type that each PE router advertises per segment to all remote PE routers. Upon failure, the PE router withdraws the corresponding Ethernet A-D route. This triggers all remote PE routers to update their forwarding tables for all MAC addresses associated with that Ethernet Segment. If a backup PE is available for the same Ethernet Segment, the remote PEs update the next-hop.

Note: An ESI can span across one or more EVIs.

Figure 3 shows the new Ethernet A-D route type and a new ESI MPLS Label extended community. Each PE router advertises the Ethernet A-D route for a particular ESI with Ethernet Tag=0 and MPLS Label=0. The ESI MPLS Label extended community is also included with the route. The last 2 bits of the flags are-

R-L for Root/Leaf

A-S for Active/Standby, if dual-active multi-homing is desired, this flag is set to 0 and a valid MPLS label value is set.

Note: Ethernet A-D route is not advertised for ESI=0.

Also, all Route-Targets associated with the ESI must be included with the route.

Split Horizon

From figure 1, if a CE device sends Broadcast, Unknown Unicast or Multicast traffic (aka BUM traffic), the PE devices must not forward it back to the same CE device. This is referred to as Split Horizon. To achieve Split Horizon, every BUM packet must be encapsulated with an MPLS label that identifies the Ethernet Segment. This MPLS label is advertised by the PE routers using Ethernet A-D route per Ethernet Segment in ESI MPLS Label extended community. The egress PE relies on the ESI MPLS Label to determine whether to forward the BUM packet over the specific Ethernet Segment, or not.

The assignment of ESI MPLS label is dependent on the type of tunnel that will be used to deliver BUM packet (data-plane forwarding).

Split Horizon procedure is followed based on ESI, not EVI. This is the key difference between Split Horizon and "Handling of Multi-destination traffic" procedure which will be discussed later.

1. P2P or MP2P LSPs

A PE router receiving multi-destination traffic must forward it to all (or a subset of) remote PEs. To achieve this, the PEs advertise a new Inclusive Multicast Ethernet Tag route type. The frame format of Inclusive Multicast Ethernet Tag route is shown in figure 5.

Consider figure 4 - a PE router distributes an Inclusive Multicast Ethernet Tag route in the associated EVI to other PE routers. Now, if PE1 receives BUM traffic from CE1 device, it will encapsulate this BUM traffic with MPLS labels as shown in figure 6 before sending to PE2, for example (PE3 and PE4 are single-homed to their CE devices and will not advertise Ethernet A-D routes). First PE1 pushes the ESI MPLS label received from PE2, on top of that, PE1 pushes the MPLS label received in Inclusive Multicast Ethernet Tag route received from PE2, and finally, the top label is the transport label for P2P LSP.

After PHP, when PE2 receives MPLS encapsulated traffic, it identifies the ESI(s) from the top label. If the bottom label is the ESI MPLS label that PE2 advertised for ESI=10, then PE2 does not forward the packet onto ESI.

2. P2MP LSPs

With P2MP LSPs, the Leaf nodes initiate the LSPs towards the Root node. The Root node assigns the labels. Hence, the label is upstream assigned.

Consider figure 7 - PE1 is the Root node and other PE routers are Leaf nodes. PE1 advertises the ESI MPLS label to all remote PE routers in Ethernet A-D route. When PE1 receives BUM traffic from CE1 device, it pushes the ESI MPLS label onto the label stack and the top label is the transport label for P2MP LSP. For this case, PHP must be disabled. When PE2 receives this MPLS encapsulated traffic, it sees PE1 assigned ESI MPLS label for ESI=10 and does not forward the packet onto the ESI. While PE3 and PE4 also notice ESI MPLS label for ESI=10 and since they don't have any interfaces connected to ESI=10, they will forward the packet to all ESIs for the EVI.

MAC Advertisement

A PE router performs MAC learning in the data plane for packets coming from CE network for a particular EVI. The PE router snoops for DHCP and ARP(IPv4)/ND(IPv6) packets. For CE MAC addresses that are behind other PE routers, the MAC addresses are advertised in BGP NLRI using a new MAC Advertisement route type. This is shown in figure 8 below.

This route type is used to advertise locally learned MAC addresses in BGP to remote PEs. As per the draft, the MAC Addresses can be aggregated and a MAC Prefix can be advertised rather than advertising every single MAC Address. If a MAC Prefix is advertised, the IP Address length field is set to 0 and no IP address is advertised. If an individual MAC address is advertised, the IP address field corresponds to that MAC address. If the PE router sees an ARP Request for an IP address from a CE, and if the PE has the MAC address binding for that IP address, the PE performs ARP Proxy and responds to the ARP Request.

The MPLS label field depends on the type of allocation. The PE router can advertise a single MPLS label for all MAC addresses per EVI which requires least number of MPLS labels and can save on memory on the PE router, but when forwarding to the CE network, the PE router must perform a MAC lookup which can cost in delay and CPU cycles.

Alternatively, a PE router can advertise a unique label per <ESI, Ethernet Tag> combination.

And as for a third option, a PE router can advertise a unique label for each MAC address.

Aliasing

Samer Salam clearly describes the requirement of introducing Aliasing in EVPN and why VPLS is not good enough for load balancing across provider edge nodes in a multi-homed setup. Here's the excerpt from his blog post -

"One of these requirements is to provide support for enhanced redundancy with fine-grained (i.e. per flow) load balancing across provider edge nodes in a multi-homed setup. This is of particular relevance in data center interconnect and cloud services, where it is critical to increase the aggregate bisectional bandwidth between the customer network and the provider’s edge for all VLANs that are being extended over the MPLS network. With the existing data-plane MAC learning model, it is not possible to support this requirement because a given MAC address can only be associated with a single pseudowire, and consequently with a single provider edge node. As a result, in the best case, VPLS can support per-VPN load balancing among provider edge nodes."

Aliasing allows a PE router to advertise reachability of a particular ESI to remote PEs even if it has not learnt any MAC addresses over that Ethernet Segment. This is achieved using Ethernet A-D Route per EVI (not per Ethernet Segment).

Note: Ethernet A-D Route type per ES and per EVI are differentiated by the RD and ESI. If RD is of the form IP_ADD:0 and ESI=0, then the route-type is per ES. If the RD is of the form IP_ADD:Unique_Number and ESI!=0, then the route-type is per EVI.

In case of figure 1, both PE1 and PE2 will advertise reachability of ESI to remote PE3 and PE4 using Ethernet A-D route type, for all EVIs that ESI spans across (multiple Route Targets). Also, PE1 will advertise reachability of MAC address associated with CE1 device using MAC Advertisement route type to PE3 and PE4 routers. If remote PEs needs to forward traffic to CE1, it can use-

- <ESI, Ethernet Tag> information from PE1 and PE2 (Fast Convergence procedure)

- <MAC address, ESI> information from PE1 (MAC Advertisement procedure)

If PE1 and PE2 do not set A-S flag in ESI MPLS Label extended community in Ethernet A-D route type, then remote PEs must treat the Ethernet Segment as operating in all-active redundancy mode. Hence, remote PEs can now load balance the traffic that they receive from their local CE devices, destined for CE1, behind PE1 and PE2.

When PE1 or PE2 receive traffic from remote PEs destined for CE1, if the packet is unicast MAC packet, it is forwarded to CE1. If the packet is a BUM packet, only one of PE1 and PE2 must forward the packet to CE1 - which PE forwards the packet to CE1 is decided by DF Election. This is discussed next.

Designated Forwarder (DF) Election

If a CE device is multi-homed to two or more PE routers, one of the PEs is elected a Designated Forwarder (DF). Only a DF forwards a multi-destination (BUM) packet to the Ethernet Segment towards CE; other PEs drop the BUM packet.

DF election is based on the information received in Ethernet Segment routes. Each PE router builds an ordered list of Ethernet Segment routes based on originating PEs IP address. The PE router with highest IP address is elected the DF and the next highest IP address PE router is elected the Backup DF (BDF). BDF takes over in case of DF failure. A DF/BDF is elected for each Ethernet Segment.

MAC Mobility

Consider figure 9 where a Virtual Machine with MAC A is present on a server connected to PE1 and PE2 routers via ESI=10. When PE1 and PE2 learn MAC A, one of them advertises this MAC address in BGP to remote PEs. A Server Administrator than performs vMotion and relocates the VM on a server connected to PE3 and PE4 routers via ESI=20. When PE3 or PE4 learns MAC A in the data plane over ESI=20, one of them will advertise the MAC in BGP to PE1 and PE2 using MAC Advertisement route type and a new MAC Mobility extended community. The MAC Mobility extended community contains a Sequence Number which basically means the newness of the MAC address.

Since PE1 and PE2 cannot detect whether MAC A has moved to another Ethernet Segment, the reception of new MAC Advertisement with MAC Mobility extended community acts as a trigger for these PEs to perform MAC Withdraw.

Conclusion

EVPN seems like a great new technology which provides more benefits over traditional VPLS. Data Center Interconnect (DCI) is definitely its application area but weaving a Layer3 fabric within the data center is also a very interesting concept.