draft-ietf-v6ops-pmtud-ecmp-problem-03.txt   draft-ietf-v6ops-pmtud-ecmp-problem-04.txt 
v6ops M. Byerly v6ops M. Byerly
Internet-Draft Fastly Internet-Draft Fastly
Intended status: Informational M. Hite Intended status: Informational M. Hite
Expires: December 30, 2015 Evernote Expires: March 1, 2016 Evernote
J. Jaeggli J. Jaeggli
Fastly Fastly
June 28, 2015 August 29, 2015
Close encounters of the ICMP type 2 kind (near misses with ICMPv6 PTB) Close encounters of the ICMP type 2 kind (near misses with ICMPv6 PTB)
draft-ietf-v6ops-pmtud-ecmp-problem-03 draft-ietf-v6ops-pmtud-ecmp-problem-04
Abstract Abstract
This document calls attention to the problem of delivering ICMPv6 This document calls attention to the problem of delivering ICMPv6
type 2 "Packet Too Big" (PTB) messages to the intended destination in type 2 "Packet Too Big" (PTB) messages to the intended destination
ECMP load balanced or anycast network architectures. It discusses (typically the server) in ECMP load balanced or anycast network
operational mitigations that can be employed to address this class of architectures. It discusses operational mitigations that can be
failures. employed to address this class of failures.
Status of This Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on December 30, 2015. This Internet-Draft will expire on March 1, 2016.
Copyright Notice Copyright Notice
Copyright (c) 2015 IETF Trust and the persons identified as the Copyright (c) 2015 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 2, line 13 skipping to change at page 2, line 13
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3. Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . 4 3. Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1. Alternatives . . . . . . . . . . . . . . . . . . . . . . 5 3.1. Alternative Mitigations . . . . . . . . . . . . . . . . . 5
3.2. Implementation . . . . . . . . . . . . . . . . . . . . . 5 3.2. Implementation . . . . . . . . . . . . . . . . . . . . . 5
3.2.1. Alternatives . . . . . . . . . . . . . . . . . . . . 6 3.2.1. Alternative Implementation . . . . . . . . . . . . . 6
4. Improvements . . . . . . . . . . . . . . . . . . . . . . . . 7 4. Improvements . . . . . . . . . . . . . . . . . . . . . . . . 7
5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7
7. Security Considerations . . . . . . . . . . . . . . . . . . . 7 7. Security Considerations . . . . . . . . . . . . . . . . . . . 7
8. Informative References . . . . . . . . . . . . . . . . . . . 8 8. Informative References . . . . . . . . . . . . . . . . . . . 8
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 8 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 8
1. Introduction 1. Introduction
Operators of popular Internet services face complex challenges Operators of popular Internet services face complex challenges
associated with scaling their infrastructure. One approach is to associated with scaling their infrastructure. One scaling approach
utilize equal-cost multi-path (ECMP) routing to perform stateless is to utilize equal-cost multi-path (ECMP) routing to perform
distribution of incoming TCP or UDP sessions to multiple servers or stateless distribution of incoming TCP or UDP sessions to multiple
to middle boxes such as load balancers. Distribution of traffic in servers or to middle boxes such as load balancers. Distribution of
this manner presents a problem when dealing with ICMP signaling. traffic in this manner presents a problem when dealing with ICMP
Specifically, an ICMP error is not guaranteed to hash via ECMP to the signaling. Specifically, an ICMP error is not guaranteed to hash via
same destination as its corresponding TCP or UDP session. A case ECMP to the same destination as its corresponding TCP or UDP session.
where this is particularly problematic operationally is path MTU A case where this is particularly problematic operationally is path
discovery (PMTUD). MTU discovery (PMTUD).
2. Problem 2. Problem
A common application for stateless load balancing of TCP or UDP flows A common application for stateless load balancing of TCP or UDP flows
is to perform an initial subdivision of flows in front of a stateful is to perform an initial subdivision of flows in front of a stateful
load balancer tier or multiple servers so that the workload becomes load balancer tier or multiple servers so that the workload becomes
divided into manageable fractions of the total number of flows. The divided into manageable fractions of the total number of flows. The
flow division is performed using ECMP forwarding and a stateless but flow division is performed using ECMP forwarding and a stateless but
sticky algorithm for hashing across the available paths. This sticky algorithm for hashing across the available paths. This
nexthop selection for the purposes of flow distribution is a nexthop selection for the purposes of flow distribution is a
skipping to change at page 3, line 21 skipping to change at page 3, line 21
balanced anycast address as the destination and hence will be balanced anycast address as the destination and hence will be
statelessly load balanced to one of the servers. While the ICMPv6 statelessly load balanced to one of the servers. While the ICMPv6
PTB message contains as much of the packet that could not be PTB message contains as much of the packet that could not be
forwarded as possible, the payload headers are not considered in the forwarded as possible, the payload headers are not considered in the
forwarding decision and are ignored. Because the PTB message is not forwarding decision and are ignored. Because the PTB message is not
identifiable as part of the original flow by the IP or upper layer identifiable as part of the original flow by the IP or upper layer
packet headers, the results of the ICMPv6 ECMP hash calculation are packet headers, the results of the ICMPv6 ECMP hash calculation are
unlikely to be hashed to the same nexthop as packets matching the TCP unlikely to be hashed to the same nexthop as packets matching the TCP
or UDP ECMP hash of the flow. or UDP ECMP hash of the flow.
An example packet flow and topology follow. An example packet flow and topology follow. The packet for which the
PTB message was generated was intended for the client.
ptb -> router ecmp -> nexthop L4/L7 load balancer -> destination ptb -> router ecmp -> nexthop L4/L7 load balancer -> destination
router --> load balancer 1 ---> router --> load balancer 1 --->
\\--> load balancer 2 ---> load-balanced service \\--> load balancer 2 ---> load-balanced service
\--> load balancer N ---> \--> load balancer N --->
Figure 1 Figure 1
The router ECMP decision is used because it is part of the forwarding The router ECMP decision is used because it is part of the forwarding
skipping to change at page 5, line 7 skipping to change at page 5, line 7
potentially significant implications for resource consumption and for potentially significant implications for resource consumption and for
self-inflicted denial-of-service if not carefully employed. self-inflicted denial-of-service if not carefully employed.
Fortunately, in real-world deployments we have observed that the Fortunately, in real-world deployments we have observed that the
number of flows for which this problem occurs is relatively small number of flows for which this problem occurs is relatively small
(example, 10 or fewer pps on 1Gb/s or more worth of https traffic in (example, 10 or fewer pps on 1Gb/s or more worth of https traffic in
a real world deployment); sensible ingress rate limiters which will a real world deployment); sensible ingress rate limiters which will
discard excessive message volume can be applied to protect even very discard excessive message volume can be applied to protect even very
large anycast server tiers with the potential for fallout limited to large anycast server tiers with the potential for fallout limited to
circumstances of deliberate duress. circumstances of deliberate duress.
3.1. Alternatives 3.1. Alternative Mitigations
As an alternative, it may be appropriate to lower the TCP MSS to 1220 As an alternative, it may be appropriate to lower the TCP MSS to 1220
in order to accommodate 1280 byte MTU. We consider this undesirable in order to accommodate 1280 byte MTU. We consider this undesirable
as hosts may not be able to independently set TCP MSS by address- as hosts may not be able to independently set TCP MSS by address-
family thereby impacting IPv4, or alternatively that middle-boxes family thereby impacting IPv4, or alternatively that middle-boxes
need to be employed to clamp the MSS independently from the end- need to be employed to clamp the MSS independently from the end-
systems. Potentially, extension headers might further alter the systems. Potentially, extension headers might further alter the
lower bound that the MSS would have to be set to, making clamping lower bound that the MSS would have to be set to, making clamping
still more undesirable. still more undesirable.
3.2. Implementation 3.2. Implementation
1. Filter-based-forwarding matches next-header ICMPv6 type-2 and 1. Filter-based-forwarding matches next-header ICMPv6 type-2 and
matches a next-hop on a particular subnet directly attached to matches a next-hop on a particular subnet directly attached to 1
both border routers. The filter is policed to reasonable limits or more routers. The filter is policed to reasonable limits (we
(we chose 1000pps, more conservative rates might be required in chose 1000pps, more conservative rates might be required in other
other implementations). implementations).
2. Filter is applied on input side of all external interfaces 2. Filter is applied on input side of all external (internet or
customer facing) interfaces.
3. A proxy located at the next-hop forwards ICMPv6 type-2 packets 3. A proxy located at the next-hop forwards ICMPv6 type-2 packets
received at the next-hop to an Ethernet broadcast address received at the next-hop to an Ethernet broadcast address
(example ff:ff:ff:ff:ff:ff) on all specified subnets. This was (example ff:ff:ff:ff:ff:ff) on all specified subnets. This was
necessitated by router inability (in IPv6) to forward the same necessitated by router inability (in IPv6) to forward the same
packet to multiple unicast next-hops. packet to multiple unicast next-hops.
4. Anycast servers receive the PTB error and process packet as 4. Anycasted servers receive the PTB error and process packet as
needed. needed.
A simple Python scapy script that can perform the ICMPv6 proxy A simple Python scapy script that can perform the ICMPv6 proxy
reflection is included. reflection is included.
#!/usr/bin/python #!/usr/bin/python
from scapy.all import * from scapy.all import *
IFACE_OUT = ["p2p1", "p2p2"] IFACE_OUT = ["p2p1", "p2p2"]
skipping to change at page 6, line 34 skipping to change at page 6, line 34
if __name__ == '__main__': if __name__ == '__main__':
main() main()
This example script listens on all interfaces for IPv6 PTB errors This example script listens on all interfaces for IPv6 PTB errors
being forwarded using filter-based-forwarding. It removes the being forwarded using filter-based-forwarding. It removes the
existing Ethernet source and rewrites a new Ethernet destination of existing Ethernet source and rewrites a new Ethernet destination of
the Ethernet broadcast address. It then sends the resulting frame the Ethernet broadcast address. It then sends the resulting frame
out the p2p1 and p2p2 interfaces which attached to vlans where our out the p2p1 and p2p2 interfaces which attached to vlans where our
anycast servers reside. anycast servers reside.
3.2.1. Alternatives 3.2.1. Alternative Implementation
Alternatively, network designs in which a common layer 2 network Alternatively, network designs in which a common layer 2 network
exists on the ECMP hop could distribute the proxy onto the end exists on the ECMP hop could distribute the proxy onto the end
systems, eliminating the need for policy routing. They could then systems, eliminating the need for policy routing. They could then
rewrite the destination -- for example, using iptables before rewrite the destination -- for example, using iptables before
forwarding the packet back to the network containing all of the forwarding the packet back to the network containing all of the
server or load balancer interfaces. This implmentation can be done server or load balancer interfaces. This implmentation can be done
entirely within the Linux iptables firewall. Because of the entirely within the Linux iptables firewall. Because of the
distributed nature of the filter, more conservative rate limits are distributed nature of the filter, more conservative rate limits are
required than when a global rate limit can be employed. required than when a global rate limit can be employed.
skipping to change at page 7, line 12 skipping to change at page 7, line 12
-m pkttype ! --pkt-type broadcast -m limit --limit 10/second \ -m pkttype ! --pkt-type broadcast -m limit --limit 10/second \
-j TEE 2001:DB8::1 -j TEE 2001:DB8::1
As with the scapy example, once the destination has been rewritten As with the scapy example, once the destination has been rewritten
from a hardcoded ND entry to an Ethernet broadcast address -- in this from a hardcoded ND entry to an Ethernet broadcast address -- in this
case to an IPv6 documentation address -- the traffic will be case to an IPv6 documentation address -- the traffic will be
reflected to all the hosts on the subnet. reflected to all the hosts on the subnet.
4. Improvements 4. Improvements
There are several ways that improvements could be made to the There are several ways that improvements could be made to the problem
situation with respect to ECMP load balancing of ICMPv6 PTB. how to ECMP load balance of ICMPv6 PTB messages. little in the way of
Internet protocol specification change is required, rather we forsee
practical implemention change which insofar as we are aware does not
exist in current router switch or layer3/4 load balancers.
alternatively improved behavior on the part of client/server
detection of path mtu in band could render the behavior of devices in
the path irrelevant.
1. Routers with sufficient capacity within the lookup process could 1. Routers with sufficient capacity within the lookup process could
parse all the way through the L3 or L4 header in the ICMPv6 parse all the way through the L3 or L4 header in the ICMPv6
payload beginning at bit offset 32 of the ICMP header. By payload beginning at bit offset 32 of the ICMP header. By
reordering the elements of the hash to match the inward direction reordering the elements of the hash to match the inward direction
of the flow, the PTB error could be directed to the same next-hop of the flow, the PTB error could be directed to the same next-hop
as the incoming packets in the flow. as the incoming packets in the flow.
2. The FIB (Forwarding Information Base) on the router could be 2. The FIB (Forwarding Information Base) on the router could be
programmed with a multicast distribution tree that included all programmed with a multicast distribution tree that included all
of the necessary next-hops, and ICMPv6 packets could be policy of the necessary next-hops, and unicast ICMPv6 packets could be
routed to this destination. policy routed to these destinations.
3. Ubiquitous implementation of RFC 4821 [RFC4821] Packetization 3. Ubiquitous implementation of RFC 4821 [RFC4821] Packetization
Layer Path MTU Discovery would probably go a long way towards Layer Path MTU Discovery would probably go a long way towards
reducing dependence on ICMPv6 PTB by end systems. reducing dependence on ICMPv6 PTB by end systems.
5. Acknowledgements 5. Acknowledgements
The authors would like to thank Marak Majkowsiki for contributing The authors would like to thank Marak Majkowsiki for contributing
text, examples, and a very close review. The authors would like to text, examples, and a very close review. The authors would like to
thank Mark Andrews, Brian Carpenter, Nick Hilliard and Ray Hunter, thank Mark Andrews, Brian Carpenter, Nick Hilliard and Ray Hunter,
skipping to change at page 8, line 10 skipping to change at page 8, line 16
The proxy replication results in devices not associated with the flow The proxy replication results in devices not associated with the flow
that generated the PTB being recipients of an ICMPv6 message which that generated the PTB being recipients of an ICMPv6 message which
contains a fragment of a packet. This could arguably result in contains a fragment of a packet. This could arguably result in
information disclosure. Recipient machines should be in a common information disclosure. Recipient machines should be in a common
administrative domain. administrative domain.
8. Informative References 8. Informative References
[RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU
Discovery", RFC 4821, March 2007. Discovery", RFC 4821, DOI 10.17487/RFC4821, March 2007,
<http://www.rfc-editor.org/info/rfc4821>.
Authors' Addresses Authors' Addresses
Matt Byerly Matt Byerly
Fastly Fastly
Kapolei, HI Kapolei, HI
US US
Email: suckawha@gmail.com Email: suckawha@gmail.com
 End of changes. 17 change blocks. 
33 lines changed or deleted 42 lines changed or added

This html diff was produced by rfcdiff 1.42. The latest version is available from http://tools.ietf.org/tools/rfcdiff/