--- 1/draft-ietf-bess-evpn-unequal-lb-01.txt 2019-07-22 10:13:05.767566565 -0700 +++ 2/draft-ietf-bess-evpn-unequal-lb-02.txt 2019-07-22 10:13:05.807567575 -0700 @@ -8,24 +8,24 @@ J. Rabadan Nokia J. Drake Juniper A. Lingala AT&T -Expires: Sept 26, 2019 March 25, 2019 +Expires: Jan 23, 2020 July 22, 2019 Weighted Multi-Path Procedures for EVPN All-Active Multi-Homing - draft-ietf-bess-evpn-unequal-lb-01 + draft-ietf-bess-evpn-unequal-lb-02 Abstract In an EVPN-IRB based network overlay, EVPN all-active multi-homing enables multi-homing for a CE device connected to two or more PEs via a LAG bundle, such that bridged and routed traffic from remote PEs can be equally load balanced (ECMPed) across the multi-homing PEs. This document defines extensions to EVPN procedures to optimally handle unequal access bandwidth distribution across a set of multi- homing PEs in order to: @@ -86,29 +86,32 @@ 3.1 Link Bandwidth Extended Community . . . . . . . . . . . . . 8 3.2 REMOTE PE Behavior . . . . . . . . . . . . . . . . . . . . . 9 4. Weighted BUM Traffic Load-Sharing . . . . . . . . . . . . . . 10 4.1 The BW Capability in the DF Election Extended Community . . 10 4.2 BW Capability and Default DF Election algorithm . . . . . . 11 4.3 BW Capability and HRW DF Election algorithm (Type 1 and 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.3.1 BW Increment . . . . . . . . . . . . . . . . . . . . . . 11 4.3.2 HRW Hash Computations with BW Increment . . . . . . . . 12 4.3.3 Cost-Benefit Tradeoff on Link Failures . . . . . . . . . 13 - 4.4 BW Capability and Preference DF Election algorithm . . . . 14 - 5. Real-time Available Bandwidth . . . . . . . . . . . . . . . . . 15 - 6. Routed EVPN Overlay . . . . . . . . . . . . . . . . . . . . . . 15 - 7. EVPN-IRB Multi-homing with non-EVPN routing . . . . . . . . . . 16 - 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 17 - 7.1 Normative References . . . . . . . . . . . . . . . . . . . 17 - 7.2 Informative References . . . . . . . . . . . . . . . . . . 17 - 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 18 - Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 18 + 4.4 BW Capability and Weighted HRW DF Election algorithm + (Type TBD) . . . . . . . . . . . . . . . . . . . . . . . . 14 + 4.5 BW Capability and Preference DF Election algorithm . . . . 15 + 5. Real-time Available Bandwidth . . . . . . . . . . . . . . . . . 16 + 6. Routed EVPN Overlay . . . . . . . . . . . . . . . . . . . . . . 16 + 7. EVPN-IRB Multi-homing with non-EVPN routing . . . . . . . . . . 17 + 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 + 7.1 Normative References . . . . . . . . . . . . . . . . . . . 18 + 7.2 Informative References . . . . . . . . . . . . . . . . . . 18 + 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 19 + 9. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 19 + Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 19 1 Introduction In an EVPN-IRB based network overlay, with a CE multi-homed via a EVPN all-active multi-homing, bridged and routed traffic from remote PEs can be equally load balanced (ECMPed) across the multi-homing PEs: o ECMP Load-balancing for bridged unicast traffic is enabled via aliasing and mass-withdraw procedures detailed in RFC 7432. @@ -364,46 +367,46 @@ Type 1. In the event that link bandwidth attribute is not received from one or more PEs, forwarding path-list would be computed using regular ECMP semantics. 4. Weighted BUM Traffic Load-Sharing Optionally, load sharing of per-service DF role, weighted by individual PE's link-bandwidth share within a multi-homed ES may also be achieved. - In order to do that, a new DF Election Capability [EVPN-DF-ELECT- - FRAMEWORK] called "BW" (Bandwidth Weighted DF Election) is defined. - BW may be used along with some DF Election Types, as described in the - following sections. + In order to do that, a new DF Election Capability [RFC8584] called + "BW" (Bandwidth Weighted DF Election) is defined. BW may be used + along with some DF Election Types, as described in the following + sections. 4.1 The BW Capability in the DF Election Extended Community - [EVPN-DF-ELECT-FRAMEWORK] defines a new extended community for PEs - within a redundancy group to signal and agree on uniform DF Election - Type and Capabilities for each ES. This document requests a bit in - the DF Election extended community Bitmap: + [RFC8584] defines a new extended community for PEs within a + redundancy group to signal and agree on uniform DF Election Type and + Capabilities for each ES. This document requests a bit in the DF + Election extended community Bitmap: Bit 28: BW (Bandwidth Weighted DF Election) ES routes advertised with the BW bit set will indicate the desire of the advertising PE to consider the link-bandwidth in the DF Election algorithm defined by the value in the "DF Type". - As per [EVPN-DF-ELECT-FRAMEWORK], all the PEs in the ES MUST - advertise the same Capabilities and DF Type, otherwise the PEs will - fall back to Default [RFC7432] DF Election procedure. + As per [RFC8584], all the PEs in the ES MUST advertise the same + Capabilities and DF Type, otherwise the PEs will fall back to Default + [RFC7432] DF Election procedure. The BW Capability MAY be advertised with the following DF Types: o Type 0: Default DF Election algorithm, as in [RFC7432] - o Type 1: HRW algorithm, as in [EVPN-DF-ELECT-FRAMEWORK] + o Type 1: HRW algorithm, as in [RFC8584] o Type 2: Preference algorithm, as in [EVPN-DF-PREF] o Type 4: HRW per-multicast flow DF Election, as in [EVPN-PER-MCAST-FLOW-DF] The following sections describe how the DF Election procedures are modified for the above DF Types when the BW Capability is used. 4.2 BW Capability and Default DF Election algorithm When all the PEs in the Ethernet Segment (ES) agree to use the BW @@ -424,28 +427,28 @@ for DF election is: [PE-1, PE-1, PE-2, PE-3]. The DF for a given VLAN-a on ES-10 is now computed as (VLAN-a % 4). This would result in the DF role being distributed across PE1, PE2, and PE3 in portion to each PE's normalized weight for ES-10. 4.3 BW Capability and HRW DF Election algorithm (Type 1 and 4) - [EVPN-DF-ELECT-FRAMEWORK] introduces Highest Random Weight (HRW) - algorithm (DF Type 1) for DF election in order to solve potential DF - election skew depending on Ethernet tag space distribution. [EVPN- - PER-MCAST-FLOW-DF] further extends HRW algorithm for per-multicast - flow based hash computations (DF Type 4). This section describes - extensions to HRW Algorithm for EVPN DF Election specified in [EVPN- - DF-ELECT-FRAMEWORK] and in [EVPN-PER-MCAST-FLOW-DF] in order to - achieve DF election distribution that is weighted by link bandwidth. + [RFC8584] introduces Highest Random Weight (HRW) algorithm (DF Type + 1) for DF election in order to solve potential DF election skew + depending on Ethernet tag space distribution. [EVPN-PER-MCAST-FLOW- + DF] further extends HRW algorithm for per-multicast flow based hash + computations (DF Type 4). This section describes extensions to HRW + Algorithm for EVPN DF Election specified in [RFC8584] and in [EVPN- + PER-MCAST-FLOW-DF] in order to achieve DF election distribution that + is weighted by link bandwidth. 4.3.1 BW Increment A new variable called "bandwidth increment" is computed for each [PE, ES] advertising the ES link bandwidth attribute as follows: In the context of an ES, L(i) = Link bandwidth advertised by PE(i) for this ES @@ -471,24 +474,24 @@ b(1) = 1, b(2) = 1, b(3) = 1 Note that the bandwidth increment must always be an integer, including, in an unlikely scenario of a PE's link bandwidth not being an exact multiple of L(min). If it computes to a non-integer value (including as a result of link failure), it MUST be rounded down to an integer. 4.3.2 HRW Hash Computations with BW Increment - HRW algorithm as described in [EVPN-DF-ELECT-FRAMEWORK] and in [EVPN- - PER-MCAST-FLOW-DF] compute a random hash value (referred to as - affinity here) for each PE(i), where, (0 < i <= N), PE(i) is the PE - at ordinal i, and Address(i) is the IP address of PE at ordinal i. + HRW algorithm as described in [RFC8584] and in [EVPN-PER-MCAST-FLOW- + DF] compute a random hash value (referred to as affinity here) for + each PE(i), where, (0 < i <= N), PE(i) is the PE at ordinal i, and + Address(i) is the IP address of PE at ordinal i. For 'N' PEs sharing an Ethernet segment, this results in 'N' candidate hash computations. PE that has the highest hash value is selected as the DF. Affinity computation for each PE(i) is extended to be computed one per-bandwidth increment associated with PE(i) instead of a single affinity computation per PE(i). PE(i) with b(i) = j, results in j affinity computations: @@ -526,22 +529,22 @@ For e.g., affinity function specified in [EVPN-PER-MCAST-FLOW-DF] MAY be extended as follows to incorporate bandwidth increment j: affinity(S,G,V, ESI, Address(i,j)) = (1103515245.((1103515245.Address(i).j + 12345) XOR D(S,G,V,ESI))+12345) (mod 2^31) - affinity or random function specified in [EVPN-DF-ELECT-FRAMEWORK] - MAY be extended as follows to incorporate bandwidth increment j: + affinity or random function specified in [RFC8584] MAY be extended as + follows to incorporate bandwidth increment j: affinity(v, Es, Address(i,j)) = (1103515245((1103515245.Address(i).j + 12345) XOR D(v,Es))+12345)(mod 2^31) 4.3.3 Cost-Benefit Tradeoff on Link Failures While incorporating link bandwidth into the DF election process provides optimal BUM traffic distribution across the ES links, it also implies that affinity values for a given PE are re-computed, and DF elections are re-adjusted on changes to that PE's bandwidth @@ -549,21 +552,59 @@ the operator does not wish to have this level of churn in their DF election, then they should not advertise the BW capability. Not advertising BW capability may result in less than optimal BUM traffic distribution while still retaining the ability to allow a remote ingress PE to do weighted ECMP for its unicast traffic to a set of multi-homed PEs, as described in section 3.2. Same also applies to use of BW capability with service carving (DF Type 0), as specified in section 4.2. -4.4 BW Capability and Preference DF Election algorithm +4.4 BW Capability and Weighted HRW DF Election algorithm (Type TBD) + + Use of BW capability together with HRW DF election algorithm + described in the previous section has a few limitations: + + o While in most scenarios a change in BW for a given PE results in + re-assigment of DF roles from or to that PE, in certain + scenarios, a change in PE BW can result in complete re-assignment + of DF roles. + o If BW advertised from a set of PEs does not have a good least + common multiple, the BW set may result in a high BW increment for + each PE, and hence, may result in higher order of complexity. + + [WEIGHTED-HRW] document describes an alternate DF election algorithm + that uses a weighted score function that is minimally disruptive such + that it minimizes the probability of complete re-assignment of DF + roles in a BW change scenario. It also does not require multiple BW + increment based computations. + + Instead of computing BW increment and an HRW hash for each [PE, BW + increment], a single weighted score is computed for each PE using the + proposed score function with absolute BW advertised by each PE as its + weight value. + + As described in section 4 of [WEIGHTED-HRW], a HRW hash computation + for each PE is converted to a weighted score as follows: + + Score(Oi, Sj) = -wi/log(Hash(Oi, Sj)/Hmax); where Hmax is the maximum + hash value. + + Oi is object being assigned, for e.g., a vlan-id in this case; + + Sj is the server, for e.g., a PE IP address in this case; + + wi is the weight, for e.g., BW capability in this case; + + Object Oi is assigned to server Si with the highest score. + +4.5 BW Capability and Preference DF Election algorithm This section applies to ES'es where all the PEs in the ES agree use the BW Capability with DF Type 2. The BW Capability modifies the Preference DF Election procedure [EVPN-DF-PREF], by adding the LBW value as a tie-breaker as follows: o Section 4.1, bullet (f) in [EVPN-DF-PREF] now considers the LBW value: f) In case of equal Preference in two or more PEs in the ES, the @@ -629,70 +670,79 @@ use cases from EVPN IRB use cases discussed earlier is that EVPN control plane is used only to enable LAG interface based multi-homing and NOT as an overlay VPN control plane. EVPN control plane in this case enables: o DF election via EVPN RT-4 based procedures described in [RFC7432] o LOCAL MAC sync across multi-homing PEs via EVPN RT-2 o LOCAL ARP and ND sync across multi-homing PEs via EVPN RT-2 Applicability of weighted ECMP procedures proposed in this document - to these set of use cases will be addressed in subsequent revisions. + to these set of use cases is an area of further consideration. 7. References 7.1 Normative References [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 2015, . [BGP-LINK-BW] Mohapatra, P., Fernando, R., "BGP Link Bandwidth - Extended Community", January 2013, + Extended Community", March 2018, . + bandwidth-07>. [EVPN-IP-ALIASING] Sajassi, A., Badoni, G., "L3 Aliasing and Mass Withdrawal Support for EVPN", July 2017, . [EVPN-DF-PREF] Rabadan, J., Sathappan, S., Przygienda, T., Lin, W., Drake, J., Sajassi, A., and S. Mohanty, "Preference-based EVPN DF Election", internet-draft ietf-bess-evpn-pref-df- 01.txt, April 2018. [EVPN-PER-MCAST-FLOW-DF] Sajassi, et al., "Per multicast flow Designated Forwarder Election for EVPN", March 2018, . - [EVPN-DF-ELECT-FRAMEWORK] Rabadan, Mohanty, et al., "Framework for - EVPN Designated Forwarder Election Extensibility", March - 2018, . + [RFC8584] Rabadan, Mohanty, et al., "Framework for Ethernet VPN + Designated Forwarder Election Extensibility", April 2019, + . + + [WEIGHTED-HRW] Mohanty, et al., "Weighted HRW and its applications", + Sept. 2019, . [RFC2119] S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", March 1997, . [RFC8174] B. Leiba, "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", May 2017, . 7.2 Informative References 8. Acknowledgements Authors would like to thank Satya Mohanty for valuable review and - inputs with respect to HRW algorithm refinements proposed in this - document. + inputs with respect to HRW and weighted HRW algorithm refinements + proposed in this document. + +9. Contributors + + Satya Ranjan Mohanty + Cisco + Email: satyamoh@cisco.com Authors' Addresses Neeraj Malhotra, Editor. Arrcus Email: neeraj.ietf@gmail.com Ali Sajassi Cisco Email: sajassi@cisco.com