--- 1/draft-ietf-bess-evpn-unequal-lb-05.txt 2020-07-27 19:13:10.346060454 -0700 +++ 2/draft-ietf-bess-evpn-unequal-lb-06.txt 2020-07-27 19:13:10.386061476 -0700 @@ -1,26 +1,26 @@ BESS WorkGroup N. Malhotra, Ed. Internet-Draft A. Sajassi Intended status: Standards Track Cisco Systems -Expires: January 11, 2021 J. Rabadan +Expires: January 28, 2021 J. Rabadan Nokia J. Drake Juniper A. Lingala ATT S. Thoria Cisco Systems - July 10, 2020 + July 27, 2020 Weighted Multi-Path Procedures for EVPN All-Active Multi-Homing - draft-ietf-bess-evpn-unequal-lb-05 + draft-ietf-bess-evpn-unequal-lb-06 Abstract In an EVPN-IRB based network overlay, EVPN all-active multi-homing enables multi-homing for a CE device connected to two or more PEs via a LAG, such that bridged and routed traffic from remote PEs can be equally load balanced (ECMPed) across the multi-homing PEs. This document defines extensions to EVPN procedures to optimally handle unequal access bandwidth distribution across a set of multi-homing PEs in order to: @@ -38,21 +38,21 @@ Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." - This Internet-Draft will expire on January 11, 2021. + This Internet-Draft will expire on January 28, 2021. Copyright Notice Copyright (c) 2020 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents @@ -65,41 +65,39 @@ Table of Contents 1. Requirements Language and Terminology . . . . . . . . . . . . 3 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1. PE-CE Link Provisioning . . . . . . . . . . . . . . . . . 4 2.2. PE-CE Link Failures . . . . . . . . . . . . . . . . . . . 5 2.3. Design Requirement . . . . . . . . . . . . . . . . . . . 6 3. Solution Overview . . . . . . . . . . . . . . . . . . . . . . 6 4. Weighted Unicast Traffic Load-balancing . . . . . . . . . . . 7 4.1. Local PE Behavior . . . . . . . . . . . . . . . . . . . . 7 - 4.2. Link Bandwidth Extended Community . . . . . . . . . . . . 7 - 4.3. Remote PE Behavior . . . . . . . . . . . . . . . . . . . 7 + 4.2. EVPN Link Bandwidth Extended Community . . . . . . . . . 7 + 4.3. Remote PE Behavior . . . . . . . . . . . . . . . . . . . 8 5. Weighted BUM Traffic Load-Sharing . . . . . . . . . . . . . . 9 5.1. The BW Capability in the DF Election Extended Community . 9 5.2. BW Capability and Default DF Election algorithm . . . . . 10 5.3. BW Capability and HRW DF Election algorithm (Type 1 and 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 - 5.3.1. BW Increment . . . . . . . . . . . . . . . . . . . . 10 + 5.3.1. BW Increment . . . . . . . . . . . . . . . . . . . . 11 5.3.2. HRW Hash Computations with BW Increment . . . . . . . 11 - 5.4. BW Capability and Weighted HRW DF Election algorithm - (Type TBD) . . . . . . . . . . . . . . . . . . . . . . . 13 - 5.5. BW Capability and Preference DF Election algorithm . . . 13 - 6. Cost-Benefit Tradeoff on Link Failures . . . . . . . . . . . 14 - 7. Real-time Available Bandwidth . . . . . . . . . . . . . . . . 14 + 5.4. BW Capability and Preference DF Election algorithm . . . 13 + 6. Cost-Benefit Tradeoff on Link Failures . . . . . . . . . . . 13 + 7. Real-time Available Bandwidth . . . . . . . . . . . . . . . . 13 8. Routed EVPN Overlay . . . . . . . . . . . . . . . . . . . . . 14 - 9. EVPN-IRB Multi-homing With Non-EVPN routing . . . . . . . . . 15 + 9. EVPN-IRB Multi-homing With Non-EVPN routing . . . . . . . . . 14 10. Operational Considerations . . . . . . . . . . . . . . . . . 15 - 11. Security Considerations . . . . . . . . . . . . . . . . . . . 16 - 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 - 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 16 - 14. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 16 + 11. Security Considerations . . . . . . . . . . . . . . . . . . . 15 + 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 + 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 15 + 14. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 15 15. Normative References . . . . . . . . . . . . . . . . . . . . 16 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 17 1. Requirements Language and Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. "Local PE" in the context of an ESI refers to a provider edge switch @@ -308,30 +306,40 @@ 4.1. Local PE Behavior A PE that is part of an Ethernet Segment's redundancy group would advertise an additional "link bandwidth" extended community attribute with Ethernet A-D per-ES route (EVPN Route Type 1), that represents total bandwidth of PE's physical links in an Ethernet Segment. BGP link bandwidth extended community defined in [BGP-LINK-BW] is re-used for this purpose. -4.2. Link Bandwidth Extended Community +4.2. EVPN Link Bandwidth Extended Community + + A new EVPN Link Bandwidth extended community is defined to signal + local ES link bandwidth to remote PEs. This extended-community is + defined of type 0x06 (EVPN). IANA is requested to assign a sub-type + value of 0x10 for the EVPN Link bandwidth extended community, of type + 0x06 (EVPN). EVPN Link Bandwidth extended community is defined as + conditional transitive with the following behavior: + + o Pass it across eBGP session when next-hop is not rewritten. + + o Drop it across eBGP session when next-hop is rewritten. Link bandwidth extended community described in [BGP-LINK-BW] for - layer 3 VPNs is re-used here to signal local ES link bandwidth to - remote PEs. link bandwidth extended community is however defined in - [BGP-LINK-BW] as optional non-transitive. In inter-AS scenarios, - link-bandwidth may need to be signaled to an eBGP neighbor along with - next-hop unchanged. It is work in progress with authors of [BGP- - LINK-BW] to allow for this attribute to be used as transitive in - inter-AS scenarios. + layer 3 VPNs was considered for re-use here. This Link bandwidth + extended community is however defined in [BGP-LINK-BW] as optional + non-transitive. In inter-AS scenarios, link-bandwidth needs to be + signaled to an eBGP neighbor when the next-hop is not unchanged. + Since it is not possible to change deployed behavior of this + extended-community, it was decided to define a new one. 4.3. Remote PE Behavior A receiving PE SHOULD use per-ES link bandwidth attribute received from each PE to compute a relative weight for each remote PE, per-ES, and then use this relative weight to compute a weighted path-list to be used for load balancing, as opposed to using an ECMP path-list for load balancing across the PE paths. PE Weight and resulting weighted path-list computation at remote PEs is a local matter. An example computation algorithm is shown below to illustrate the idea: @@ -555,67 +564,27 @@ extended to include bandwidth increment in the computation. For e.g., affinity function specified in [EVPN-PER-MCAST-FLOW-DF] MAY be extended as follows to incorporate bandwidth increment j: affinity(S,G,V, ESI, Address(i,j)) = (1103515245.((1103515245.Address(i).j + 12345) XOR D(S,G,V,ESI))+12345) (mod 2^31) - affinity or random function specified in [RFC8584] MAY be extended as follows to incorporate bandwidth increment j: affinity(v, Es, Address(i,j)) = (1103515245((1103515245.Address(i).j + 12345) XOR D(v,Es))+12345)(mod 2^31) -5.4. BW Capability and Weighted HRW DF Election algorithm (Type TBD) - - Use of BW capability together with HRW DF election algorithm - described in the previous section has a few limitations: - - o While in most scenarios a change in BW for a given PE results in - re-assigment of DF roles from or to that PE, in certain scenarios, - a change in PE BW can result in complete re-assignment of DF - roles. - - o If BW advertised from a set of PEs does not have a good least - common multiple, the BW set may result in a high BW increment for - each PE, and hence, may result in higher order of complexity. - - [RFC8584] describes an alternate DF election algorithm that uses a - weighted score function that is minimally disruptive such that it - minimizes the probability of complete re-assignment of DF roles in a - BW change scenario. It also does not require multiple BW increment - based computations. - - Instead of computing BW increment and an HRW hash for each [PE, BW - increment], a single weighted score is computed for each PE using the - proposed score function with absolute BW advertised by each PE as its - weight value. - - As described in section 4 of [RFC8584], a HRW hash computation for - each PE is converted to a weighted score as follows: - - Score(Oi, Sj) = -wi/log(Hash(Oi, Sj)/Hmax); where Hmax is the maximum - hash value. - - Oi is object being assigned, for e.g., a vlan-id in this case; - - Sj is the server, for e.g., a PE IP address in this case; - - wi is the weight, for e.g., BW capability in this case; - - Object Oi is assigned to server Si with the highest score. - -5.5. BW Capability and Preference DF Election algorithm +5.4. BW Capability and Preference DF Election algorithm This section applies to ES'es where all the PEs in the ES agree use the BW Capability with DF Type 2. The BW Capability modifies the Preference DF Election procedure [EVPN-DF-PREF], by adding the LBW value as a tie-breaker as follows: Section 4.1, bullet (f) in [EVPN-DF-PREF] now considers the LBW value: f) In case of equal Preference in two or more PEs in the ES, the tie- @@ -718,28 +687,38 @@ 12. IANA Considerations [RFC8584] defines a new extended community for PEs within a redundancy group to signal and agree on uniform DF Election Type and Capabilities for each ES. This document requests IANA for a bit in the DF Election extended community Bitmap: Bit 28: BW (Bandwidth Weighted DF Election) + A new EVPN Link Bandwidth extended community is defined to signal + local ES link bandwidth to remote PEs. This extended-community is + defined of type 0x06 (EVPN). IANA is requested to assign a sub-type + value of 0x10 for the EVPN Link bandwidth extended community, of type + 0x06 (EVPN). EVPN Link Bandwidth extended community is defined as + conditional transitive with the following behavior: + + o Pass it across eBGP session when next-hop is not rewritten. + + o Drop it across eBGP session when next-hop is rewritten. + 13. Acknowledgements Authors would like to thank Satya Mohanty for valuable review and inputs with respect to HRW and weighted HRW algorithm refinements proposed in this document. 14. Contributors - Satya Ranjan Mohanty Cisco Systems US Email: satyamoh@cisco.com 15. Normative References [BGP-LINK-BW] Mohapatra, P. and R. Fernando, "BGP Link Bandwidth Extended Community", draft-ietf-idr-link-bandwidth-07