draft-ietf-tsvwg-tcp-eifel-response-03.txt   draft-ietf-tsvwg-tcp-eifel-response-04.txt 
Network Working Group Reiner Ludwig Network Working Group Reiner Ludwig
INTERNET-DRAFT Ericsson Research INTERNET-DRAFT Ericsson Research
Expires: September 2003 Andrei Gurtov Expires: April 2004 Andrei Gurtov
Sonera Corporation TeliaSonera
March, 2003 October, 2003
The Eifel Response Algorithm for TCP The Eifel Response Algorithm for TCP
<draft-ietf-tsvwg-tcp-eifel-response-03.txt> <draft-ietf-tsvwg-tcp-eifel-response-04.txt>
Status of this memo Status of this memo
This document is an Internet-Draft and is in full conformance with This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026. all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that other Task Force (IETF), its areas, and its working groups. Note that other
groups may also distribute working documents as Internet-Drafts. groups may also distribute working documents as Internet-Drafts.
skipping to change at page 1, line 34 skipping to change at page 1, line 34
material or cite them other than as "work in progress". material or cite them other than as "work in progress".
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/lid-abstracts.txt http://www.ietf.org/ietf/lid-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html http://www.ietf.org/shadow.html
Abstract Abstract
The Eifel response algorithm requires a detection algorithm to detect Based on an appropriate detection algorithm, the Eifel response
a posteriori whether the TCP sender has entered loss recovery algorithm provides a way for a TCP sender to respond to a detected
unnecessarily. In response to a spurious timeout it adapts the spurious timeout. It adapts the retransmission timer to avoid further
retransmission timer to avoid further spurious timeouts, and can spurious timeouts, and can avoid - depending on the detection
avoid - depending on the detection algorithm - the often unnecessary algorithm - the often unnecessary go-back-N retransmits that would
go-back-N retransmits that would otherwise be sent. Likewise, it otherwise be sent. In addition, the Eifel response algorithm restores
adapts the duplicate acknowledgement threshold in response to a the congestion control state in such a way that packet bursts are
spurious fast retransmit. In both cases, the Eifel response algorithm avoided.
restores the congestion control state in such a way that packet
bursts are avoided.
Terminology Terminology
The keywords MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, The keywords MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD,
SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL, when they appear in this SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL, when they appear in this
document, are to be interpreted as described in [RFC2119]. document, are to be interpreted as described in [RFC2119].
We refer to the first-time transmission of an octet as the 'original We refer to the first-time transmission of an octet as the 'original
transmit'. A subsequent transmission of the same octet is referred to transmit'. A subsequent transmission of the same octet is referred to
as a 'retransmit'. In most cases this terminology can likewise be as a 'retransmit'. In most cases this terminology can likewise be
applied to data segments as opposed to octets. However, when applied to data segments as opposed to octets. However, when
repacketization occurs, a segment can contain both first-time repacketization occurs, a segment can contain both first-time
transmissions and retransmissions of octets. In that case this transmissions and retransmissions of octets. In that case, this
terminology is only consistent when applied to octets. For the Eifel terminology is only consistent when applied to octets. For the Eifel
detection and response algorithms this makes no difference as they detection and response algorithms this makes no difference as they
also operate correctly when repacketization occurs. also operate correctly when repacketization occurs.
We use the term 'acceptable ACK' as defined in [RFC793]. That is an We use the term 'acceptable ACK' as defined in [RFC793]. That is an
ACK that acknowledges previously unacknowledged data. We use the term ACK that acknowledges previously unacknowledged data. We use the TCP
'duplicate ACK', and the variable 'dupacks' as defined in [WS95]. The sender state variables 'SND.UNA' and 'SND.NXT' as defined in
variable 'dupacks' is a counter of duplicate ACKs that have already [RFC793]. SND.UNA holds the segment sequence number of the oldest
been received by the TCP sender before the fast retransmit is sent. outstanding segment. SND.NXT holds the segment sequence number of the
We use the variable 'DupThresh' to refer to the so-called duplicate next segment the TCP sender will (re-)transmit. In addition, we
acknowledgement threshold, i.e., the number of duplicate ACKs that define as 'SND.MAX' the segment sequence number of the next original
need to arrive at the TCP sender to trigger a fast retransmit. transmit to be sent. The definition of SND.MAX is equivalent to the
Currently, DupThresh is specified as a fixed value of three definition of 'snd_max' in [WS95].
[RFC2581].
Furthermore, we use the TCP sender state variables 'SND.UNA' and
'SND.NXT' as defined in [RFC793]. SND.UNA holds the segment sequence
number of the oldest outstanding segment. SND.NXT holds the segment
sequence number of the next segment the TCP sender will
(re-)transmit. In addition, we define as 'SND.MAX' the segment
sequence number of the next original transmit to be sent. The
definition of SND.MAX is equivalent to the definition of snd_max in
[WS95].
We use the TCP sender state variables 'cwnd' (congestion window), and We use the TCP sender state variables 'cwnd' (congestion window), and
'ssthresh' (slow start threshold), and the terms 'SMSS', 'ssthresh' (slow-start threshold), and the terms 'FlightSize', and
'FlightSize', and 'Initial Window (IW)' as defined in [RFC2581]. 'Initial Window (IW)' as defined in [RFC2581]. FlightSize is the
FlightSize is the amount of outstanding data in the network, or amount of outstanding data at a given point in time. The IW is the
alternatively, the difference between SND.MAX and SND.UNA at a given size of the sender's congestion window after the three-way handshake
point in time. The IW is the size of the sender's congestion window is completed. We use the TCP sender state variables 'SRTT' and
after the three-way handshake is completed. We use the TCP sender 'RTTVAR', and the term 'RTO' as defined in [RFC2988]. In addition, we
state variables 'SRTT' and 'RTTVAR', and the term 'RTO' as defined in assume that the TCP sender maintains in the variable 'RTT-SAMPLE' the
[RFC2988]. In addition, we assume that the TCP sender maintains in value of the latest round-trip time (RTT) measurement.
the variable 'RTT-SAMPLE' the value of the latest round-trip time
(RTT) measurement.
1. Introduction 1. Introduction
The Eifel response algorithm relies on a detection algorithm such as The Eifel response algorithm relies on a detection algorithm such as
the Eifel detection algorithm defined in [RFC***B]. That document the Eifel detection algorithm defined in [RFC3522]. That document
discusses the relevant background and motivation that also applies to discusses the relevant background and motivation that also applies to
this document. Hence, the reader is expected to be familiar with this document. Hence, the reader is expected to be familiar with
[RFC***B]. Note that alternative response algorithms have been [RFC3522]. Note that alternative response algorithms have been
proposed [BDA03] that could also rely on the Eifel detection proposed [BA02] that could also rely on the Eifel detection
algorithm, and vice versa alternative detection algorithms have been algorithm, and vice versa alternative detection algorithms have been
proposed [BA02b], [SK03] that could work together with the Eifel proposed [BA03], [SK03] that could work together with the Eifel
response algorithm. response algorithm.
The Eifel response algorithm requires a detection algorithm to detect Based on an appropriate detection algorithm, the Eifel response
a posteriori whether the TCP sender has entered loss recovery algorithm provides a way for a TCP sender to respond to a detected
unnecessarily. In response to a spurious timeout it adapts the spurious timeout. It adapts the retransmission timer to avoid further
retransmission timer to avoid further spurious timeouts, and can spurious timeouts, and can avoid - depending on the detection
avoid - depending on the detection algorithm - the often unnecessary algorithm - the often unnecessary go-back-N retransmits that would
go-back-N retransmits that would otherwise be sent. Likewise, it otherwise be sent. In addition, the Eifel response algorithm restores
adapts the duplicate acknowledgement threshold in response to a the congestion control state in such a way that packet bursts are
spurious fast retransmit. In both cases, the Eifel response algorithm avoided.
restores the congestion control state in such a way that packet
bursts are avoided. Note: A previous version of the Eifel Response algorithm also
included a response to a detected spurious fast retransmit.
However, since a consensus was not reached about how to adapt the
duplicate acknowledgement threshold in that case, that part of the
algorithm was removed for the time being.
2. Interworking with Detection Algorithms 2. Interworking with Detection Algorithms
If the Eifel response algorithm is implemented at the TCP sender, it If the Eifel response algorithm is implemented at the TCP sender, it
MUST be implemented together with a detection algorithm that is MUST be implemented together with a detection algorithm that is
specified in an RFC. specified in an RFC.
Designers of detection algorithms who want to offer the possibility Designers of detection algorithms who want their algorithms to work
that their detection algorithms can work together with the Eifel together with the Eifel response algorithm should reuse the variable
response algorithm MUST reuse the variable SpuriousRecovery with the SpuriousRecovery with the semantics and defined values as specified
semantics and defined values as specified in [RFC***B]. In addition, in [RFC3522]. In addition, we define LATE_SPUR_TO (equal -1) as
we define LATE_SPUR_TO (equal -1) as another possible value of the another possible value of the variable SpuriousRecovery. Detection
variable SpuriousRecovery. Detection algorithms must set the value of algorithms should set the value of SpuriousRecovery to LATE_SPUR_TO
SpuriousRecovery to LATE_SPUR_TO if the detection is based upon if the detection of a spurious retransmit is based upon receiving the
receiving the ACK for the retransmit. For example, this applies to ACK for the retransmit (as opposed to the ACK for the original
detection algorithms that are based on the DSACK option. transmit). For example, this applies to detection algorithms that are
based on the DSACK option [BA03].
3. The Eifel Response Algorithm 3. The Eifel Response Algorithm
The complete algorithm is specified in section 2.1. In sections 2.2 The complete algorithm is specified in section 3.1. In sections 3.2
to 2.4, we motivate the different steps of the algorithm. to 3.5, we motivate the different steps of the algorithm.
3.1. The Algorithm 3.1. The Algorithm
Given that a TCP sender has enabled a detection algorithm that Given that a TCP sender has enabled a detection algorithm that
complies with the requirements set in Section 2, a TCP sender MAY use complies with the requirements set in Section 2, a TCP sender MAY use
the Eifel response algorithm as defined in this subsection. the Eifel response algorithm as defined in this subsection.
If the Eifel response algorithm is used, the following steps MUST be If the Eifel response algorithm is used, the following steps MUST be
taken by the TCP sender, but only upon initiation of loss recovery, taken by the TCP sender, but only upon initiation of loss recovery,
i.e., when either the timeout-based retransmit or the fast retransmit i.e., when the timeout-based retransmit is sent. Note: The algorithm
is sent. Note: The algorithm MUST NOT be reinitiated after loss MUST NOT be reinitiated after loss recovery has already started. In
recovery has already started. In particular, it may not be particular, it may not be reinitiated upon subsequent timeouts for
reinitiated upon subsequent timeouts for the same segment, and not the same segment, and not upon retransmitting segments other than the
upon retransmitting segments other than the oldest outstanding oldest outstanding segment.
segment.
(0) Before the variables cwnd and ssthresh get updated when (INIT) Before the variables cwnd and ssthresh get updated when
loss recovery is initiated, set a "pipe_prev" variable as loss recovery is initiated, set a "pipe_prev" variable as
follows: follows:
pipe_prev <- max (FlightSize, ssthresh) pipe_prev <- max (FlightSize, ssthresh)
(DTCT) This is a placeholder for a detection algorithm that must (DET) This is a placeholder for a detection algorithm that must
be executed at this point. In case [RFC***B] is used as be executed at this point. In case [RFC3522] is used as
the detection algorithm, steps (1) - (6) of that algorithm the detection algorithm, steps (1) - (6) of that algorithm
go here. go here.
(RESP) If SpuriousRecovery equals FALSE, then proceed to step (RESP) If SpuriousRecovery equals SPUR_TO, then
(DONE), proceed to step (STO.1),
else if SpuriousRecovery equals SPUR_TO, then proceed to
step (STO.1),
else if SpuriousRecovery equals LATE_SPUR_TO, then proceed else if SpuriousRecovery equals LATE_SPUR_TO, then
to step (STO.2), proceed to step (STO.2),
else (spurious fast retransmit) proceed to step (SFR). else
proceed to step (DONE).
(STO.1) Resume transmission off the top: (STO.1) Resume transmission off the top:
Set Set
SND.NXT <- SND.MAX SND.NXT <- SND.MAX
(STO.2) Adapt the Conservativeness of the Retransmission Timer: (STO.2) Adapt the Conservativeness of the Retransmission Timer:
If the retransmission timer is implemented according to If the retransmission timer is implemented according to
[RFC2988], then change the calculation of SRTT to [RFC2988], then
SRTT <- SRTT + 1/FlightSize * (RTT-SAMPLE - SRTT) if the TCP Timestamps option [RFC1323] is enabled for
and set this connection, then set
SRTT <- RTT-SAMPLE SRTT <- RTT-SAMPLE
RTTVAR <- RTT-SAMPLE/2, RTTVAR <- RTT-SAMPLE/2
recalculate the RTO, and restart the retransmission timer,
Note: Even after changing the calculation of SRTT, the
retransmission timer is considered as being
implemented according to [RFC2988].
else adapt the conservativeness of the retransmission
timer.
Proceed to step (ReCC).
(SFR) Adapt the duplicate acknowledgement threshold: else set
RTTVAR <- max (2 * RTTVAR, SRTT)
SRTT <- 2 * SRTT
Set Set
DupThresh <- max (DupThresh, SpuriousRecovery) RTO <- SRTT + max (G, 4*RTTVAR)
Restart the retransmission timer
else
appropriately adapt the conservativeness of the
retransmission timer that is implemented.
Proceed to step (ReCC). Proceed to step (ReCC).
(ReCC) Revert the congestion control state: (ReCC) Reversing the congestion control state:
If the acceptable ACK has the ECN-Echo flag [RFC3168] set If the acceptable ACK has the ECN-Echo flag [RFC3168] set,
OR the TCP sender has already taken more than three then
timeouts for the oldest outstanding segment, then proceed proceed to step (DONE),
to step (DONE),
else set else set
cwnd <- min (pipe_prev, (FlightSize + IW)) cwnd <- FlightSize + min (bytes_acked, IW)
ssthresh <- pipe_prev ssthresh <- pipe_prev
Proceed to step (DONE). Proceed to step (DONE).
(CWV) Interworking with Congestion Window Validation (the
variables 'T_last' and 'tcpnow' are defined in [RFC2861]):
If congestion window validation is implemented according
to [RFC2861], then set
T_last <- tcpnow
(DONE) No further processing. (DONE) No further processing.
3.2 Responding to Spurious Timeouts 3.2 Storing the Current Congestion Control State (step INIT)
3.2.1 Suppressing the Unnecessary go-back-N Retransmits (step STO.1) The TCP sender stores in pipe_prev what is considered a "safe" slow-
start threshold (ssthresh) before loss recovery is initiated, i.e.,
before the loss indication is taken into account. This is either the
current FlightSize if the TCP sender is in congestion avoidance or
the current ssthresh if the TCP sender is in slow-start. If the TCP
sender later detects that it has entered loss recovery unnecessarily,
then pipe_prev is used in step (ReCC) to reverse the congestion
control state. Thus, until the loss recovery phase is terminated,
pipe_prev maintains a memory of the congestion control state of the
time right before the loss recovery phase was initiated. A similar
approach is proposed in [RFC2861], where this state is stored in
ssthresh directly after a TCP sender has become application-limited.
There had been debates about whether the value of pipe_prev should be
decayed over time, e.g., upon subsequent timeouts for the same
outstanding segment. We do not require the decaying of pipe_prev for
the Eifel response algorithm, and do not believe that such a
conservative approach would be in place. Instead, we follow the idea
of revalidating the congestion window through slow-start as suggested
in [RFC2861]. That is, in step (ReCC), the cwnd is reset to a value
that avoids large packet bursts, while ssthresh is reset to the value
of pipe_prev. Note that [RFC2581] and [RFC2861] also do not require a
decaying of ssthresh after it has been reset in response to a loss
indication, or after a TCP sender has become application-limited.
3.3 Responding to Spurious Timeouts
3.3.1 Suppressing the Unnecessary go-back-N Retransmits (step STO.1)
Without the use of the TCP timestamps option, the TCP sender suffers Without the use of the TCP timestamps option, the TCP sender suffers
from the retransmission ambiguity problem [Zh86], [KP87]. This means from the retransmission ambiguity problem [Zh86], [KP87]. Hence, when
that when the first acceptable ACK arrives after a spurious timeout, the first acceptable ACK arrives after a spurious timeout, the TCP
the TCP sender must believe that that ACK was sent in response to the sender must assume that this ACK was sent in response to the
retransmit when in fact it was sent in response to the original retransmit when in fact it was sent in response to the original
transmit. Furthermore, the TCP sender must also believe that all transmit. Furthermore, the TCP sender must further assume that all
other segments outstanding at that point were lost. other segments outstanding at that point were lost.
Note: Except for certain cases where original ACKs were lost, that Note: Except for certain cases where original ACKs were lost, the
first acceptable ACK cannot carry any DSACK option [RFC2883]. first acceptable ACK cannot carry any DSACK option [RFC2883].
Consequently, once the TCP sender's state has been updated after the Consequently, once the TCP sender's state has been updated after the
first acceptable ACK has arrived, SND.NXT equals SND.UNA. This is first acceptable ACK has arrived, SND.NXT equals SND.UNA. This is
what causes the often unnecessary go-back-N retransmits. Now every what causes the often unnecessary go-back-N retransmits. From that
arriving acceptable ACK that was sent in response to an original point on every arriving acceptable ACK that was sent in response to
transmit will advance SND.NXT. But as long as SND.NXT is smaller than an original transmit will advance SND.NXT. But as long as SND.NXT is
the value that SND.MAX had when the timeout occurred, those ACKs will smaller than the value that SND.MAX had when the timeout occurred,
clock out retransmits; whether those segments were lost or not. those ACKs will clock out retransmits, whether those segments were
lost or not.
In fact, during this phase the TCP sender breaks 'packet In fact, during this phase the TCP sender breaks 'packet
conservation' [Jac88]. This is because the go-back-N retransmits are conservation' [Jac88]. This is because the go-back-N retransmits are
sent during slow start. I.e., for each original transmit leaving the sent during slow-start. I.e., for each original transmit leaving the
network, two retransmits are sent into the network as long as SND.NXT network, two retransmits are sent into the network as long as SND.NXT
does not equal SND.MAX (see [LK00] for more detail). does not equal SND.MAX (see [LK00] for more detail).
The use of the TCP timestamps option reliably eliminates the The use of the TCP timestamps option reliably eliminates the
retransmission ambiguity problem. Thus, once the Eifel detection retransmission ambiguity problem. Once the Eifel detection algorithm
algorithm detected that a timeout was spurious, it is therefore safe has detected that a timeout was spurious, it is therefore safe to let
to let the TCP sender resume the transmission with new data. Thus, the TCP sender resume the transmission with new data. Thus, the Eifel
the Eifel response algorithm changes the TCP sender's state by response algorithm changes the TCP sender's state by setting SND.NXT
setting SND.NXT to SND.MAX in that case. to SND.MAX in that case.
3.2.2 Adapting the Retransmission Timer (step STO.2) 3.3.2 Adapting the Retransmission Timer (step STO.2)
There is currently only one retransmission timer standardized for TCP There is currently only one retransmission timer standardized for TCP
[RFC2988]. We therefore only address that timer explicitly. Future [RFC2988]. We therefore only address that timer explicitly. Future
standards that might define alternatives to [RFC2988] should propose standards that might define alternatives to [RFC2988] should propose
similar measures to adapt the conservativeness of the retransmission similar measures to adapt the conservativeness of the retransmission
timer. timer.
Since the timeout was spurious, the TCP sender's RTT estimators are Since the timeout was spurious, the TCP sender's RTT estimators are
likely to be off. However, since timestamps are being used, a new and likely to be off. If timestamps are enabled for this connection, a
valid RTT measurement (RTT-SAMPLE) can be derived from the acceptable new and valid RTT measurement (RTT-SAMPLE) can be derived from the
ACK. It is therefore suggested to reinitialize the RTT estimators acceptable ACK. It is therefore suggested to reinitialize the RTT
from RTT-SAMPLE. Note that this RTT-SAMPLE will be relatively large estimators from RTT-SAMPLE according to rule (2.2) of RFC2988. Note
since it will include the delay spike that caused the spurious that this RTT-SAMPLE will be relatively large since it will include
timeout in the first place. To have the new RTO become effective, the the delay spike that caused the spurious timeout in the first place.
retransmission timer needs to be restarted. This is consistent with If timestamps are not enabled for this connection, the TCP sender
[RFC2988] which recommends restarting the retransmission timer with should instead double SRTT and also make RTTVAR more conservative.
the arrival of an acceptable ACK.
When the path's RTT varies largely, it is recommended to take RTT
samples more frequently than only once per RTT. This allows the TCP
sender to track changes in the RTT more closely. In particular, a TCP
sender can react more quickly to sudden increases of the RTT by
sooner updating the RTO to a more conservative value. The TCP
Timestamps option [RFC1323] provides this capability, allowing the
TCP sender to sample the RTT from every segment that is acknowledged.
Using timestamps across such paths leads to a more conservative TCP
retransmission timer and reduces the risk of triggering spurious
timeouts [IMLGK02].
On the other hand, it is known that executing the RTO calculation
defined in [RFC2988] more often than once per RTT leads to an RTO
that decays too quickly, i.e., that converges to the RTT too quickly.
This is because of the fixed gains (1/8 and 1/4) of RFC2988's RTT
estimators. When timing every segment these gains are increasingly
too large with an increasing FlightSize. This leads to the effect
that the RTT estimators "lose" their memory too soon. This is a known
conflict between [RFC2988] and [RFC1323]. Especially, a large RTO
resulting from an RTT spike will decay within one or two RTTs (e.g.,
see [LS00]). Hence, simply reinitializing RFC2988's RTT estimators
from RTT-SAMPLE is probably not enough to make the retransmission
timer sufficiently conservative for at least the next couple of RTTs.
A solution for the case when every segment is timed according to
[RFC1323] is to make the gains adaptive to the FlightSize [LS00]. We
suggest to adopt this solution for at least the SRTT.
3.3 Responding to Spurious Fast Retransmits (step SFR)
The assumption behind the fast retransmit algorithm [RFC2581] is that
a segment was lost if as many duplicate ACKs have arrived at the TCP
sender as indicated by DupThresh. Currently, DupThresh is specified
as a fixed value of three [RFC2581]. That value is assumed to be
sufficiently conservative so that packet reordering and/or packet
duplication does not falsely trigger the fast retransmit algorithm.
Clearly, this assumption does not hold for a particular TCP
connection once the TCP sender detects that the last fast retransmit
was spurious. It is therefore suggested to dynamically adapt
DupThresh to the reordering characteristics observed over the course
of a particular connection.
At the beginning of a connection DupThresh is initialized with three.
Then for each spurious fast retransmit that is detected, DupThresh is
set to the maximum of the previous DupThresh, and the lowest value
that would have avoided that last spurious fast retransmit. Note that
the Eifel detection algorithm records the latter value in
SpuriousRecovery. This strategy ensures that the TCP sender is able
to cope with the longest reordering length seen on a particular
connection so far. However, the strategy may lead to fast timeouts
[RFC***B], i.e., an event where the retransmission timer expires
before the TCP sender receives the duplicate ACK that would trigger a
fast retransmit of the oldest outstanding segment.
Also, we believe that this strategy should be implemented together
with an advanced version of the Limited Transmit algorithm [RFC3042].
That is for each duplicate ACK that arrives until DupThresh is
reached, the TCP sender should sent a new data segment if allowed by
the TCP receiver's advertised window, and if new data is available.
Although, the current Limited Transmit algorithm only allows this for
the first two duplicate ACKs, we believe that such an advanced
limited transmit strategy is safe. It is already implemented in
widely deployed TCPs [SK02].
Other alternatives for responding to spurious fast retransmits are To have the new RTO become effective, the retransmission timer needs
discussed in [BA02a]. to be restarted. This is consistent with [RFC2988] which recommends
restarting the retransmission timer with the arrival of an acceptable
ACK.
3.4 Reverting Congestion Control State (step ReCC) 3.4 Reversing the Congestion Control State (step ReCC)
When a TCP sender enters loss recovery, it also assumes that is has When a TCP sender enters loss recovery, it also assumes that is has
received a congestion indication. In response to that it reduces received a congestion indication. In response to that it reduces
cwnd, and ssthresh. However, once the TCP sender detects that the cwnd, and ssthresh. However, once the TCP sender detects that the
loss recovery has been falsely triggered, this reduction was loss recovery has been falsely triggered, this reduction was
unnecessary. In fact, no congestion signal has been received. We unnecessary. In fact, no congestion indication has been received. We
therefore believe that it is safe to revert to the previous therefore believe that it is safe to revert to the previous
congestion control state. congestion control state following the approach of revalidating the
congestion window as outlined below. This is unless the acceptable
ACK signals congestion through the ECN-Echo flag [RFC3168]. In that
case, the TCP sender MUST refrain from reversing congestion control
state.
We suggest to restore cwnd to the minimum of the previous FlightSize, If the ECN-Echo flag is not set, cwnd is reset to the sum of the
and the current FlightSize plus IW. The latter avoids large packet current FlightSize and the minimum of IW and the number of bytes that
bursts that may occur with less careful variants for restoring have been acknowledged by the acceptable ACK. Note that the value of
congestion control state. For example, the original proposal [LK00] cwnd must not be changed any further for that ACK, and that the value
typically causes large bursts after packet reordering. The current of FlightSize at this point in time may be different from the value
proposal limits a potential packet burst to IW, which is considered of FlightSize in step (INIT). The value of IW puts a limit on the
an acceptable burst size. It is the amount of data that a TCP sender size of the packet burst that the TCP sender may send into the
may send into a yet "unprobed" network at the beginning of a network after the Eifel response algorithm has terminated. The value
connection. of IW is considered an acceptable burst size. It is the amount of
data that a TCP sender may send into a yet "unprobed" network at the
beginning of a connection.
In addition, we suggest to restore ssthresh to pipe_prev, i.e., the The TCP sender is then forced into slow-start by resetting ssthresh
maximum of the previous value of ssthresh and the value that to the value of pipe_prev. As a result, the TCP sender either
FlightSize had when loss recovery was unnecessarily entered. As a immediately resumes probing the network for more bandwidth in
result, the TCP sender either immediately resumes probing the network congestion avoidance, or it first slow-starts to what is considered a
for more bandwidth in congestion avoidance, or it first slow starts "safe" operating point for the congestion window. In some cases, this
until it has reached its previous share of the available bandwidth. can mean that the first few acceptable ACKs that arrive will not
clock out any data segments.
Clearly, when the acceptable ACK signals congestion through the 3.5 Interworking with the Congestion Window Validation Algorithm
ECN-Echo flag [RFC3168], the TCP sender MUST refrain from reverting
congestion control state. The same is true if the TCP sender has An implementation of the Congestion Window Validation (CWV) algorithm
already taken more than three timeouts for the oldest outstanding [RFC2861] could potentially misinterpret a delay spike that caused a
segment. Allowing three timeouts while still reverting congestion spurious timeout as a phase where the TCP sender had been
control state goes beyond [RFC2581]. That standard recommends setting application-limited. To prevent the triggering of CWV algorithm in
cwnd to no more than the restart window (one SMSS) if the TCP sender this case, the variable 'T_last' defined in [RFC2861] is reset.
has not sent data in an interval exceeding the current RTO. That is
done to restart the ACK clock which is believed to be lost. The case
in step (ReCC) of the Eifel response algorithm is different. Since,
an acceptable ACK corresponding to an original transmit has finally
returned, the TCP has reason to believe that the ACK clock was merely
interrupted but has now resumed "ticking" again.
4. Non-Conservative Advanced Loss Recovery after Spurious Timeouts 4. Non-Conservative Advanced Loss Recovery after Spurious Timeouts
A TCP sender MAY implement an optimistic form of advanced loss A TCP sender MAY implement an optimistic form of advanced loss
recovery after a spurious timeout has been detected as motivated in recovery after a spurious timeout has been detected as motivated in
this section. Such a scheme MUST be terminated after the highest this section. Such a scheme MUST be terminated after the highest
sequence number outstanding when the spurious timeout was detected sequence number outstanding when the spurious timeout was detected
has been acknowledged. has been acknowledged.
We have studied environments where spurious timeouts and multiple We have studied environments where spurious timeouts and multiple
losses from the same flight of packets often coincide [GL02]. Note losses from the same flight of packets often coincide [GL02]. In such
that we refer to the case were the oldest outstanding segment does a case the oldest outstanding segment does arrive at the TCP
arrive at the TCP receiver but one or more packets from the remaining receiver, but one or more packets from the remaining outstanding
outstanding flight are lost. We found that in such a case TCP-Reno's flight are lost. In those environments, TCP-Reno's performance
performance can even suffer if the Eifel response algorithm is suffers if the Eifel response algorithm is operated without an
operated without an advanced loss recovery scheme such as NewReno advanced loss recovery scheme such as NewReno [RFC2582], or SACK-
[RFC2582], or SACK-based schemes [2018], [RFC***A]. The reason is based schemes [RFC2018], [RFC3517]. The reason is TCP-Reno's
TCP-Reno's aggressiveness after a spurious timeout. Even though it aggressiveness after a spurious timeout. Even though it breaks
breaks 'packet conservation' (see Section 2.2.1) when blindly 'packet conservation' (see Section 2.2.1) when blindly retransmitting
retransmitting all outstanding segments, it usually recovers the all outstanding segments, it usually recovers all packets lost from
back-to-back packet losses within a single round-trip time. On the that flight within a single round-trip time. On the contrary, the
contrary, the more conservative TCP-Reno/Eifel was forced into more conservative TCP-Reno/Eifel is often forced into another
another (backed-off) timeout in that case. (backed-off) timeout.
However, in a more recent study [GL03], we found that the mentioned However, in a more recent study [GL03], we found that the mentioned
advanced loss recovery schemes are often too conservative to compete advanced loss recovery schemes are often too conservative to compete
against TCP-Reno's blind go-back-N in terms of quickly recovering against TCP-Reno's blind go-back-N in terms of quickly recovering
multiple losses after a spurious timeout. The problem with the multiple losses after a spurious timeout. The problem with the
NewReno scheme is that it does not exploit knowledge (e.g., provided NewReno scheme is that it does not exploit knowledge (e.g., provided
through SACK options) about which segments were lost. The problem through SACK options) about which segments were lost. The problem
with the conservative SACK-based scheme [RFC***A] is that it waits with the conservative SACK-based scheme [RFC3517] is that it waits
for three SACKs before it retransmits a lost segment. This may often for three SACKs before it retransmits a lost segment. This may often
lead to a second - and in this case genuine - (potentially backed- lead to a second - and in this case genuine - (potentially backed-
off) timeout. In those cases TCP-Reno's loss recovery is often off) timeout. In those cases TCP-Reno's loss recovery is often
quicker due the blind go-back-N. This could be viewed as a quicker due the blind go-back-N. This could be viewed as a
disincentive to the deployment of the Eifel response algorithm. disincentive to the deployment of the Eifel response algorithm.
[Making TCP (even) more conservative by fixing a misbehavior in
the name of 'packet conservation' would probably at most result in
credits in the academic world.]
We therefore suggest that a TCP sender MAY implement an optimistic We therefore suggest that a TCP sender MAY implement an optimistic
(non-conservative) form of advanced loss recovery after a spurious (non-conservative) form of advanced loss recovery after a spurious
timeout has been detected, if the following guidelines are met: timeout has been detected, if the following guidelines are met:
- Packet Conservation: The TCP sender may not have more segments - Packet Conservation: The TCP sender may not have more segments
(counting both original transmits and retransmits) in flight (counting both original transmits and retransmits) in flight
than indicated by the congestion window. than indicated by the congestion window.
- A retransmit may only be sent when a potential loss has been - A retransmit may only be sent when a potential loss has been
indicated. For example, a single duplicate ACK is such an indicated. For example, a single duplicate ACK is such an
skipping to change at page 10, line 26 skipping to change at page 9, line 31
There is a risk that a detection algorithm is fooled by spoofed ACKs There is a risk that a detection algorithm is fooled by spoofed ACKs
that make genuine retransmits appear to the TCP sender as spurious that make genuine retransmits appear to the TCP sender as spurious
retransmits. When such a detection algorithm is run together with the retransmits. When such a detection algorithm is run together with the
Eifel response algorithm, this could effectively disable congestion Eifel response algorithm, this could effectively disable congestion
control at the TCP sender. Should this become a concern, the Eifel control at the TCP sender. Should this become a concern, the Eifel
response algorithm SHOULD only be run together with detection response algorithm SHOULD only be run together with detection
algorithms that are known to be safe against such "ACK spoofing algorithms that are known to be safe against such "ACK spoofing
attacks". attacks".
For example, the safe variant of the Eifel detection algorithm For example, the safe variant of the Eifel detection algorithm
[RFC***B], is a reliable method to protect against this risk. [RFC3522], is a reliable method to protect against this risk.
Acknowledgments Acknowledgments
Many thanks to Keith Sklower, Randy Katz, Michael Meyer, Stephan Many thanks to Keith Sklower, Randy Katz, Michael Meyer, Stephan
Baucke, Sally Floyd, Vern Paxson, Mark Allman, Ethan Blanton, Pasi Baucke, Sally Floyd, Vern Paxson, Mark Allman, Ethan Blanton, Pasi
Sarolahti, and Alexey Kuznetsov for very useful discussions that Sarolahti, Alexey Kuznetsov, and Yogesh Swami for many discussions
contributed to this work. that contributed to this work.
Normative References Normative References
[RFC2581] M. Allman, V. Paxson, W. Stevens, TCP Congestion Control, [RFC2581] Allman, M., Paxson, V. and W. Stevens, TCP Congestion
RFC 2581, April 1999. Control, RFC 2581, April 1999.
[RFC3042] M. Allman, H. Balakrishnan, S. Floyd, Enhancing TCP's Loss
Recovery Using Limited Transmit, RFC 3042, January 2001.
[RFC2119] S. Bradner, Key words for use in RFCs to Indicate [RFC2119] Bradner, S., Key words for use in RFCs to Indicate
Requirement Levels, RFC 2119, March 1997. Requirement Levels, RFC 2119, March 1997.
[RFC2582] S. Floyd, T. Henderson, The NewReno Modification to TCP's [RFC2582] Floyd, S. and T. Henderson, The NewReno Modification to
Fast Recovery Algorithm, RFC 2582, April 1999. TCP's Fast Recovery Algorithm, RFC 2582, April 1999.
[RFC2883] S. Floyd, J. Mahdavi, M. Mathis, M. Podolsky, A. Romanow, [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., Podolsky, M. and A.
An Extension to the Selective Acknowledgement (SACK) Option Romanow, An Extension to the Selective Acknowledgement
for TCP, RFC 2883, July 2000. (SACK) Option for TCP, RFC 2883, July 2000.
[RFC1323] V. Jacobson, R. Braden, D. Borman, TCP Extensions for High [RFC2861] Handley, M., Padhye, J. and S. Floyd, TCP Congestion Window
Performance, RFC 1323, May 1992. Validation, RFC 2861, June 2000.
[RFC***B] R. Ludwig, M. Meyer, The Eifel Detection Algorithm for TCP, [RFC1323] Jacobson, V., Braden, R. and D. Borman, TCP Extensions for
RFC***B, March 2003. High Performance, RFC 1323, May 1992.
[RFC2018] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, TCP Selective [RFC3522] Ludwig, R. and M. Meyer, The Eifel Detection Algorithm for
Acknowledgement Options, RFC 2018, October 1996. TCP, RFC3522, April 2003.
[RFC2988] V. Paxson, M. Allman, Computing TCP's Retransmission Timer, [RFC2018] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, TCP
RFC 2988, November 2000. Selective Acknowledgement Options, RFC 2018, October 1996.
[RFC793] J. Postel, Transmission Control Protocol, RFC793, September [RFC2988] Paxson, V. and M. Allman, Computing TCP's Retransmission
1981. Timer, RFC 2988, November 2000.
[RFC3168] K. Ramakrishnan, S. Floyd, D. Black, The Addition of [RFC793] Postel, J., Transmission Control Protocol, RFC793,
September 1981.
[RFC3168] Ramakrishnan, K., Floyd, S. and D. Black, The Addition of
Explicit Congestion Notification (ECN) to IP, RFC 3168, Explicit Congestion Notification (ECN) to IP, RFC 3168,
September 2001 September 2001
Informative References Informative References
[BA02a] E. Blanton, M. Allman, On Making TCP More Robust to Packet [BA02] Blanton, E. and M. Allman, On Making TCP More Robust to
Reordering, ACM Computer Communication Review, Vol. 32, Packet Reordering, ACM Computer Communication Review,
No. 1, January 2002. Vol. 32, No. 1, January 2002.
[BA02b] E. Blanton, M. Allman, Using TCP DSACKs and SCTP Duplicate
TSNs to Detect Spurious Retransmissions, draft-blanton-
dsack-use-02.txt (work in progress), October 2002.
[BDA03] E. Blanton, R. Dimond, M. Allman. Practices for TCP Senders
in the Face of Segment Reordering, draft-blanton-tcp-
reordering-00.txt (work in progress), February 2003..
[RFC***A] E. Blanton, M. Allman, K. Fall, L. Wang, A Conservative [BA03] Blanton, E. and M. Allman, Using TCP DSACKs and SCTP
SACK-based Loss Recovery Algorithm for TCP, RFC***A, Duplicate TSNs to Detect Spurious Retransmissions, draft-
March 2003. ietf-tsvwg-dsack-use-02.txt (work in progress),
October 2003.
[Gu01] A. Gurtov, Effect of Delays on TCP Performance, In [RFC3517] Blanton, E., Allman, M., Fall, K. and L. Wang,
Proceedings of IFIP Personal Wireless Conference, A Conservative SACK-based Loss Recovery Algorithm for TCP,
August 2001. RFC3517, April 2003.
[GL02] A. Gurtov, R. Ludwig, Evaluating the Eifel Algorithm for [GL02] Gurtov, A. and R. Ludwig, Evaluating the Eifel Algorithm
TCP in a GPRS Network, In Proceedings of the European for TCP in a GPRS Network, In Proceedings of the European
Wireless Conference, February 2002. Wireless Conference, February 2002.
[GL03] A. Gurtov, R. Ludwig, Responding to Spurious Timeouts in [GL03] Gurtov, A. and R. Ludwig, Responding to Spurious Timeouts
TCP, In Proceedings of IEEE INFOCOM 03, . in TCP, In Proceedings of IEEE INFOCOM 03, .
[RFC3481] H. Inamura, G. Montenegro, R. Ludwig, A. Gurtov, [Jac88] Jacobson, V., Congestion Avoidance and Control, In
F. Khafizov, TCP over Second (2.5G) and Third (3G) Proceedings of ACM SIGCOMM 88.
Generation Wireless Networks, RFC3481, February 2003.
[KP87] P. Karn, C. Partridge, Improving Round-Trip Time Estimates [KP87] Karn, P. and C. Partridge, Improving Round-Trip Time
in Reliable Transport Protocols, In Proceedings of ACM Estimates in Reliable Transport Protocols, In Proceedings
SIGCOMM 87. of ACM SIGCOMM 87.
[LK00] R. Ludwig, R. H. Katz, The Eifel Algorithm: Making TCP [LK00] Ludwig, R. and R. H. Katz, The Eifel Algorithm: Making TCP
Robust Against Spurious Retransmissions, ACM Computer Robust Against Spurious Retransmissions, ACM Computer
Communication Review, Vol. 30, No. 1, January 2000. Communication Review, Vol. 30, No. 1, January 2000.
[LS00] R. Ludwig, K. Sklower, The Eifel Retransmission Timer, ACM [SK03] Sarolahti, P. and M. Kojo, F-RTO: An Algorithm for
Computer Communication Review, Vol. 30, No. 3, July 2000. Detecting Spurious Retransmission Timeouts with TCP and
SCTP, draft-ietf-tsvwg-tcp-frto-00.txt (work in progress),
[SK02] P. Sarolahti, A. Kuznetsov, Congestion Control in Linux October 2003.
TCP, In Proceedings of USENIX, June 2002.
[SK03] P. Sarolahti, M. Kojo, F-RTO: A TCP RTO Recovery Algorithm
for Avoiding Unnecessary Retransmissions, draft-sarolahti-
tsvwg-tcp-frto-03.txt (work in progress), January 2003.
[WS95] G. R. Wright, W. R. Stevens, TCP/IP Illustrated, Volume 2 [WS95] Wright, G. R. and W. R. Stevens, TCP/IP Illustrated,
(The Implementation), Addison Wesley, January 1995. Volume 2 (The Implementation), Addison Wesley,
January 1995.
[Zh86] L. Zhang, Why TCP Timers Don't Work Well, In Proceedings of [Zh86] Zhang, L., Why TCP Timers Don't Work Well, In Proceedings
ACM SIGCOMM 88. of ACM SIGCOMM 88.
Author's Address Author's Address
Reiner Ludwig Reiner Ludwig
Ericsson Research (EED) Ericsson Research (EED)
Ericsson Allee 1 Ericsson Allee 1
52134 Herzogenrath, Germany 52134 Herzogenrath, Germany
Email: Reiner.Ludwig@ericsson.com Email: Reiner.Ludwig@ericsson.com
Andrei Gurtov Andrei Gurtov
Cellular Systems Development TeliaSonera Finland
P.O. Box 970, FIN-00051 Sonera P.O. Box 970, FIN-00051 Sonera
Helsinki, Finland Helsinki, Finland
Phone: +358(0)20401 Email: andrei.gurtov@teliasonera.com
Fax: +358(0)204064365
Email: andrei.gurtov@sonera.com
Homepage: http://www.cs.helsinki.fi/u/gurtov Homepage: http://www.cs.helsinki.fi/u/gurtov
This Internet-Draft expires in September 2003. This Internet-Draft expires in April 2004.
 End of changes. 

This html diff was produced by rfcdiff 1.23, available from http://www.levkowetz.com/ietf/tools/rfcdiff/