draft-ietf-v6ops-pmtud-ecmp-problem-00.txt   draft-ietf-v6ops-pmtud-ecmp-problem-01.txt 
v6ops M. Byerly v6ops M. Byerly
Internet-Draft Fastly Internet-Draft Fastly
Intended status: Informational M. Hite Intended status: Informational M. Hite
Expires: September 2, 2015 Evernote Expires: November 20, 2015 Evernote
J. Jaeggli J. Jaeggli
Fastly Fastly
March 1, 2015 May 19, 2015
Close encounters of the ICMP type 2 kind (near misses with ICMPv6 PTB) Close encounters of the ICMP type 2 kind (near misses with ICMPv6 PTB)
draft-ietf-v6ops-pmtud-ecmp-problem-00 draft-ietf-v6ops-pmtud-ecmp-problem-01
Abstract Abstract
This document calls attention to the problem of delivering ICMPv6 This document calls attention to the problem of delivering ICMPv6
type 2 "Packet Too Big" (PTB) messages to the intended destination in type 2 "Packet Too Big" (PTB) messages to the intended destination in
ECMP load balanced, or anycast network architectures. It discusses ECMP load balanced or anycast network architectures. It discusses
operational mitigations that can be employed to address this class of operational mitigations that can be employed to address this class of
failure. failure.
Status of This Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on September 2, 2015. This Internet-Draft will expire on November 20, 2015.
Copyright Notice Copyright Notice
Copyright (c) 2015 IETF Trust and the persons identified as the Copyright (c) 2015 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 2, line 15 skipping to change at page 2, line 15
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3. Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . 4 3. Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1. Alternatives . . . . . . . . . . . . . . . . . . . . . . 5 3.1. Alternatives . . . . . . . . . . . . . . . . . . . . . . 5
3.2. Implementation . . . . . . . . . . . . . . . . . . . . . 5 3.2. Implementation . . . . . . . . . . . . . . . . . . . . . 5
4. Improvements . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2.1. Alternatives . . . . . . . . . . . . . . . . . . . . 6
4. Improvements . . . . . . . . . . . . . . . . . . . . . . . . 7
5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7
7. Security Considerations . . . . . . . . . . . . . . . . . . . 7 7. Security Considerations . . . . . . . . . . . . . . . . . . . 7
8. Informative References . . . . . . . . . . . . . . . . . . . 7 8. Informative References . . . . . . . . . . . . . . . . . . . 8
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 7 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 8
1. Introduction 1. Introduction
Operators of popular Internet services face complex challenges Operators of popular Internet services face complex challenges
associated with scaling their infrastructure. One approach is to associated with scaling their infrastructure. One approach is to
utilize equal-cost multi-path (ECMP) routing to perform stateless utilize equal-cost multi-path (ECMP) routing to perform stateless
distribution of incoming TCP or UDP sessions to multiple servers or distribution of incoming TCP or UDP sessions to multiple servers or
to middle boxes such as load balancers. Distribution of traffic in to middle boxes such as load balancers. Distribution of traffic in
this manner presents a problem when dealing with ICMP signaling. this manner presents a problem when dealing with ICMP signaling.
Specifically, an ICMP error is not guaranteed to hash via ECMP to the Specifically, an ICMP error is not guaranteed to hash via ECMP to the
same destination as its corresponding TCP or UDP session. A case same destination as its corresponding TCP or UDP session. A case
where this is particularly problematic operationally is path MTU where this is particularly problematic operationally is path MTU
discovery (PMTUD). discovery (PMTUD).
2. Problem 2. Problem
A common application for stateless load balancing of TCP or UDP flows A common application for stateless load balancing of TCP or UDP flows
is to perform an initial subdivision of flows in front of a stateful is to perform an initial subdivision of flows in front of a stateful
load balancer tier or multiple servers, so that the workload becomes load balancer tier or multiple servers so that the workload becomes
divided into manageable fractions of the total number of flows. The divided into manageable fractions of the total number of flows. The
flow division is performed using ECMP forwarding and a stateless but flow division is performed using ECMP forwarding and a stateless but
sticky algorithm for hashing across the available paths. This sticky algorithm for hashing across the available paths. This
nexthop selection for the purposes of flow distribution is a nexthop selection for the purposes of flow distribution is a
constrained form of anycast topology, where all anycast destinations constrained form of anycast topology where all anycast destinations
are equidistant from the upstream router responsible for making the are equidistant from the upstream router responsible for making the
last next-hop forwarding decision before the flow arrives on the last next-hop forwarding decision before the flow arrives on the
destination device. In this approach, the hash is performed across destination device. In this approach, the hash is performed across
some set of available protocol headers. Typically, these headers may some set of available protocol headers. Typically, these headers may
include all or a subset of (IPv6)Flow-Label, IP-source, IP- include all or a subset of (IPv6) Flow-Label, IP-source, IP-
destination, protocol, source-port, destination-port and potentially destination, protocol, source-port, destination-port and potentially
others such as ingress interface. others such as ingress interface.
A problem common to this approach of distribution through hashing is A problem common to this approach of distribution through hashing is
impact on path MTU discovery. An ICMPv6 type 2 PTB message generated impact on path MTU discovery. An ICMPv6 type 2 PTB message generated
on an intermediate device for a packet sent from an a server that is on an intermediate device for a packet sent from a server that is
part of an ECMP load balanced service to a client, will have the part of an ECMP load balanced service to a client will have the load
load-balanced anycast address as the destination and would be balanced anycast address as the destination and hence will be
statelessly load balanced to one of the servers. While the ICMPv6 statelessly load balanced to one of the servers. While the ICMPv6
PTB message contains as much of the packet that could not be PTB message contains as much of the packet that could not be
forwarded as possible, the payload headers are not considered into forwarded as possible, the payload headers are not considered in the
the forwarding decision and are ignored. Because the PTB message is forwarding decision and are ignored. Because the PTB message is not
not identifiable as part of the original flow by the IP or upper identifiable as part of the original flow by the IP or upper layer
layer packet headers the results of the ICMPv6 ECMP hash are unlikely packet headers, the results of the ICMPv6 ECMP hash are unlikely to
to be hashed to the same nexthop as packets matching TCP or UDP ECMP be hashed to the same nexthop as packets matching TCP or UDP ECMP
hash. hash.
An example packet flow and topology follow. An example packet flow and topology follow.
ptb -> router ecmp -> nexthop L4/L7 load balancer -> destination ptb -> router ecmp -> nexthop L4/L7 load balancer -> destination
router --> load balancer 1 ---> router --> load balancer 1 --->
\\--> load balancer 2 ---> load-balanced service \\--> load balancer 2 ---> load-balanced service
\--> load balancer N ---> \--> load balancer N --->
Figure 1 Figure 1
The router ECMP decision is used because it is part of the forwarding The router ECMP decision is used because it is part of the forwarding
architecture, can be performed at line rate, and does not depend on architecture, can be performed at line rate, and does not depend on
shared state or coordination across a distributed forwarding system shared state or coordination across a distributed forwarding system
which may include multiple linecards or routers. The ECMP routing which may include multiple linecards or routers. The ECMP routing
decision is deterministic with respect to packets having the same decision is deterministic with respect to packets having the same
computed hash. computed hash.
Atypical case where ICMPv6 PTB messages are received at the load A typical case where ICMPv6 PTB messages are received at the load
balancer is a case where the path MTU from the client to the load balancer is a case where the path MTU from the client to the load
balancer is limited by a tunnel in which the client itself is not balancer is limited by a tunnel in which the client itself is not
aware of. In the common case of a TCP flow where TLS is employed, aware of.
the first packet sent from the server that is likely to exceed a
tunnel MTU lower than that specified by the MSS on the client and the
load balancer/server is the TLS ServerHello and certificate.
Direct experience says that the frequency of PTB messages is small Direct experience says that the frequency of PTB messages is small
compared to total flows. One possible conclusion being that tunneled compared to total flows. One possible conclusion being that tunneled
IPv6 deployments that cannot carry 1500 mtu packets are relatively IPv6 deployments that cannot carry 1500 MTU packets are relatively
rare. Techniques employed by clients such as happy-eyeballs may rare. Techniques employed by clients such as happy-eyeballs may
actually contribute some amelioration to the IPv6 client experience actually contribute some amelioration to the IPv6 client experience
by preferring IPv4 in cases that might be identified as failures. by preferring IPv4 in cases that might be identified as failures.
Still, the expectation of operators is that PMTUD should work and Still, the expectation of operators is that PMTUD should work and
that unnecessary breakage of client traffic should be avoided. that unnecessary breakage of client traffic should be avoided.
A final observation regarding server tuning is that it is not always A final observation regarding server tuning is that it is not always
possible even if it is potentially desirable to be able to possible even if it is potentially desirable to be able to
independently set the TCP MSS for different address families on end- independently set the TCP MSS for different address families on some
systems. end-systems. On Linux platforms, advmss may be set on a per route
basis for selected destinations in cases where discrimination by
route is possible.
The problem as described does also impact IPv4; however, the ability The problem as described does also impact IPv4; however
to fragment on wire at tunnel ingress points and the relative rarity implementation of RFC 4821 [RFC4821] TCP MTU probing, the ability to
of sub-1500 byte MTUs that are not coupled to changes in client fragment on wire at tunnel ingress points and the relative rarity of
behavior (for example, endpoint VPN clients set the tunnel interface sub-1500 byte MTUs that are not coupled to changes in client behavior
MTU accordingly for performance reasons) makes the problem (for example, endpoint VPN clients set the tunnel interface MTU
sufficiently rare that some existing deployments simply choose to accordingly for performance reasons) makes the problem sufficiently
ignore it. rare that some existing deployments have choosen to ignore it.
3. Mitigation 3. Mitigation
Mitigation of the potential for PTB messages to be mis-delivered Mitigation of the potential for PTB messages to be mis-delivered
involves ensuring that an ICMPv6 error message is distributed to the involves ensuring that an ICMPv6 error message is distributed to the
same anycast server responsible for the flow for which the error is same anycast server responsible for the flow for which the error is
generated. Ideally Mitigation could be done by the mechanism hosts generated. Ideally, mitigation could be done by the mechanism hosts
use to identify the flow, by looking into the payload of the ICMPv6 use to identify the flow, by looking into the payload of the ICMPv6
message (to determine which TCP flow it was associated with) before message (to determine which TCP flow it was associated with) before
making a forwarding decision. Because the encapsulated IP header making a forwarding decision. Because the encapsulated IP header
occurs at a fixed offset in the icmp message it is not outside the occurs at a fixed offset in the ICMP message it is not outside the
realm of possibility that routers with sufficient header processing realm of possibility that routers with sufficient header processing
capability could parse that far into the payload. Employing a capability could parse that far into the payload. Employing a
mediation device that handles the parsing and distribution of PTB mediation device that handles the parsing and distribution of PTB
messages after policy routing or on each load-balancer/server is a messages after policy routing or on each load-balancer/server is a
possibility. possibility.
Another mitigation approach is predicated upon distributing the PTB Another mitigation approach is predicated upon distributing the PTB
message to all anycast servers under the assumption that the one for message to all anycast servers under the assumption that the one for
which the message was intended will be able to match it to the flow which the message was intended will be able to match it to the flow
and update the route cache with the new MTU, devices not able to and update the route cache with the new MTU and that devices not able
match the flow will discard these packets. Such distribution has to match the flow will discard these packets. Such distribution has
potentially significant implications for resource consumption and the potentially significant implications for resource consumption and the
potential for self-inflicted denial-of-service if not carefully potential for self-inflicted denial-of-service if not carefully
employed. Fortunately, in real-world-deployment we have observed employed. Fortunately, in real-world deployments we have observed
that, the number of flows for which this problem occurs is relatively that the number of flows for which this problem occurs is relatively
small (example, 10 or fewer pps on 1Gb/s or more worth of https small (example, 10 or fewer pps on 1Gb/s or more worth of https
traffic) and sensible ingress rate limiters which will discard traffic) and sensible ingress rate limiters which will discard
excessive message volume can be applied to protect even very large excessive message volume can be applied to protect even very large
anycast server tiers with the potential for fallout only under anycast server tiers with the potential for fallout only under
circumstances of deliberate duress. circumstances of deliberate duress.
3.1. Alternatives 3.1. Alternatives
As an alternative, it may be appropriate to lower the TCP MSS to 1220 As an alternative, it may be appropriate to lower the TCP MSS to 1220
in order to accommodate 1280 byte MTU. We consider this undesirable in order to accommodate 1280 byte MTU. We consider this undesirable
as hosts may not be able to independently set TCP MSS by address- as hosts may not be able to independently set TCP MSS by address-
family thereby impacting IPv4, or alternatively that it relies on a family thereby impacting IPv4, or alternatively that it relies on a
middle-box to clamp the MSS independently from the end-systems. middle-box to clamp the MSS independently from the end-systems.
3.2. Implementation 3.2. Implementation
1. Filter-based-forwarding matches next-header ICMPv6 type-2 and 1. Filter-based-forwarding matches next-header ICMPv6 type-2 and
matches a next-hop on a particular subnet directly attached to matches a next-hop on a particular subnet directly attached to
both border routers. The filter is policed to reasonable limits both border routers. The filter is policed to reasonable limits
(we chose 1000pps). (we chose 1000pps more conservative rates might be required in
other imlementations).
2. Filter is applied on input side of all external interfaces 2. Filter is applied on input side of all external interfaces
3. A proxy located at the next-hop forwards ICMPv6 type-2 packets 3. A proxy located at the next-hop forwards ICMPv6 type-2 packets
received at the next-hop to an Ethernet broadcast address received at the next-hop to an Ethernet broadcast address
(example ff:ff:ff:ff:ff:ff) on all specified subnets. This was (example ff:ff:ff:ff:ff:ff) on all specified subnets. This was
necessitated by router inability (in IPv6) to forward the same necessitated by router inability (in IPv6) to forward the same
packet to multiple unicast next-hops. packet to multiple unicast next-hops.
4. Anycast servers receive the PTB error and process packet as 4. Anycast servers receive the PTB error and process packet as
skipping to change at page 6, line 33 skipping to change at page 6, line 33
if __name__ == '__main__': if __name__ == '__main__':
main() main()
This example script listens on all interfaces for IPv6 PTB errors This example script listens on all interfaces for IPv6 PTB errors
being forwarded using filter-based-forwarding. It removes the being forwarded using filter-based-forwarding. It removes the
existing Ethernet source and rewrites a new Ethernet destination of existing Ethernet source and rewrites a new Ethernet destination of
the Ethernet broadcast address. It then sends the resulting frame the Ethernet broadcast address. It then sends the resulting frame
out the p2p1 and p2p2 interfaces where our anycast servers reside. out the p2p1 and p2p2 interfaces where our anycast servers reside.
Alternatively, network designs in which a common layer2 network 3.2.1. Alternatives
exists could rewrite the destination on the end system, for example
in using iptables before forwarding the packet back to the network Alternatively, network designs in which a common layer 2 network
containing all of the server or load balancer interfaces. exists on the ECMP hop could distribute the proxy onto the end
systems, eleminating the need for policy routing. They could then
rewrite the destination -- for example, using iptables before
forwarding the packet back to the network containing all of the
server or load balancer interfaces. This implmentation can be done
entirely within the Linux iptables firewall. Because of the
distributed nature of the filter, more conservative rate limits are
required than when a global rate limit can be employed.
An example ip6tables / nftables rule to match icmp6 traffic, not
match broadcast traffic, impose a rate limit of 10 pps, and pass to a
target destination would resemble:
ip6tables -I INPUT -i lo -p icmpv6 -m icmpv6 --icmpv6-type 2/0 \
-m pkttype ! --pkt-type broadcast -m limit --limit 10/second \
-j TEE 2001:DB8::1
As with the scapy example, once the destination has been rewritten
from a hardcoded ND entry to an Ethernet broadcast address -- in this
case to an IPv6 documentation address -- the traffic will be
reflected to all the hosts on the subnet.
4. Improvements 4. Improvements
There are several ways that improvements could be made to the There are several ways that improvements could be made to the
situation with respect to ECMP load balancing of ICMPv6 PTB. situation with respect to ECMP load balancing of ICMPv6 PTB.
1. Routers with sufficient capacity within the lookup process could 1. Routers with sufficient capacity within the lookup process could
parse all the way through the L3 or L4 header in the ICMPv6 parse all the way through the L3 or L4 header in the ICMPv6
payload beginning at bit offset 32 of the ICMP header. By payload beginning at bit offset 32 of the ICMP header. By
reordering the elements of the hash to match the inward direction reordering the elements of the hash to match the inward direction
of the flow, the PTB error could be directed to the same next-hop of the flow, the PTB error could be directed to the same next-hop
as the incoming packets in the flow. as the incoming packets in the flow.
2. The FIB could be programmed with a multicast distribution tree 2. The FIB (Forwarding Information Base) on the router could be
that included all of the necessary next-hops. programmed with a multicast distribution tree that included all
of the necessary next-hops.
3. Ubiquitous implementation of RFC 4821 [RFC4821] Packetization 3. Ubiquitous implementation of RFC 4821 [RFC4821] Packetization
Layer Path MTU Discovery would probably go a long way towards Layer Path MTU Discovery would probably go a long way towards
reducing dependence on ICMPv6 PTB. reducing dependence on ICMPv6 PTB.
5. Acknowledgements 5. Acknowledgements
The authors would like to thank Mark Andrews, Brian Carpenter, Nick The authors would like to thank Marak Majkowsiki for contributing
Hilliard and Ray Hunter, for review. text, examples, and a very close review. The authors would like to
thank Mark Andrews, Brian Carpenter, Nick Hilliard and Ray Hunter,
for review.
6. IANA Considerations 6. IANA Considerations
This memo includes no request to IANA. This memo includes no request to IANA.
7. Security Considerations 7. Security Considerations
The employed mitigation has the potential to greatly amplify the The employed mitigation has the potential to greatly amplify the
impact of a deliberately malicious sending of ICMPv6 PTB messages. impact of a deliberately malicious sending of ICMPv6 PTB messages.
Sensible ingress rate limiting can reduce the potential for impact; Sensible ingress rate limiting can reduce the potential for impact;
skipping to change at page 7, line 44 skipping to change at page 8, line 19
[RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU
Discovery", RFC 4821, March 2007. Discovery", RFC 4821, March 2007.
Authors' Addresses Authors' Addresses
Matt Byerly Matt Byerly
Fastly Fastly
Kapolei, HI Kapolei, HI
US US
Email: mbyerly@zynga.com Email: suckawha@gmail.com
Matt Hite Matt Hite
Evernote Evernote
Redwood City, CA Redwood City, CA
US US
Email: mhite@hotmail.com Email: mhite@hotmail.com
Joel Jaeggli Joel Jaeggli
Fastly Fastly
Mountain View, CA Mountain View, CA
US US
Email: joelja@gmail.com Email: joelja@gmail.com
 End of changes. 27 change blocks. 
50 lines changed or deleted 75 lines changed or added

This html diff was produced by rfcdiff 1.42. The latest version is available from http://tools.ietf.org/tools/rfcdiff/