draft-ietf-rddp-mpa-00.txt   draft-ietf-rddp-mpa-01.txt 
Remote Direct Data Placement Work Group P. Culley
INTERNET-DRAFT P. Culley INTERNET-DRAFT Hewlett-Packard Company
draft-ietf-rddp-mpa-00.txt Hewlett-Packard Company draft-ietf-rddp-mpa-01.txt U. Elzur
U. Elzur
Broadcom Corporation Broadcom Corporation
R. Recio R. Recio
IBM Corporation IBM Corporation
S. Bailey S. Bailey
Sandburst Corporation Sandburst Corporation
J. Carrier J. Carrier
Adaptec Adaptec
Expires: March 2004 Expires: January 2005 July 13, 2004
Marker PDU Aligned Framing for TCP Specification Marker PDU Aligned Framing for TCP Specification
1 Status of this Memo Status of this Memo
This document is an Internet-Draft and is subject to all provisions By submitting this Internet-Draft, I certify that any applicable
of Section 10 of RFC2026. patent or other IPR claims of which I am aware have been disclosed,
or will be disclosed, and any of which I become aware will be
disclosed, in accordance with RFC 3668.
By submitting this Internet-Draft, I accept the provisions of Section
4 of RFC 3667.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet- other groups may also distribute working documents as Internet-
Drafts. Drafts.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/1id-abstracts.html. The list of Internet-Draft http://www.ietf.org/1id-abstracts.html. The list of Internet-Draft
Shadow Directories can be accessed at http://www.ietf.org/shadow.html Shadow Directories can be accessed at http://www.ietf.org/shadow.html
2 Abstract Abstract
A framing protocol is defined for TCP that is fully compliant with A framing protocol is defined for TCP that is fully compliant with
applicable TCP RFCs and fully interoperable with existing TCP applicable TCP RFCs and fully interoperable with existing TCP
implementations. The framing mechanism is designed to work as an implementations. The framing mechanism is designed to work as an
"adaptation layer" between TCP and the Direct Data Placement [DDP] "adaptation layer" between TCP and the Direct Data Placement [DDP]
protocol, preserving the reliable, in-order delivery of TCP, while protocol, preserving the reliable, in-order delivery of TCP, while
adding the preservation of higher-level protocol record boundaries adding the preservation of higher-level protocol record boundaries
that DDP requires. that DDP requires.
Table of Contents Table of Contents
1 Status of this Memo..........................................1 Status of this Memo.................................................1
2 Abstract.....................................................1 Abstract............................................................1
3 Introduction.................................................5 1 Introduction.................................................5
3.1 Motivation...................................................5 1.1 Motivation...................................................5
3.2 Protocol Overview............................................5 1.2 Protocol Overview............................................5
4 Glossary.....................................................9 2 Glossary.....................................................9
5 LLP and DDP requirements....................................11 3 LLP and DDP requirements....................................11
5.1 TCP implementation Requirements to support MPA..............11 3.1 TCP implementation Requirements to support MPA..............11
5.1.1 TCP Transmit side...........................................11 3.1.1 TCP Transmit side...........................................11
5.1.2 TCP Receive side............................................11 3.1.2 TCP Receive side............................................11
5.2 MPA's interactions with DDP.................................12 3.2 MPA's interactions with DDP.................................12
6 FPDU Formats................................................14 4 FPDU Formats................................................14
6.1 Marker Format...............................................15 4.1 Marker Format...............................................15
7 Data Transfer Semantics.....................................16 5 Data Transfer Semantics.....................................16
7.1 MPA Markers.................................................16 5.1 MPA Markers.................................................16
7.2 CRC Calculation.............................................18 5.2 CRC Calculation.............................................18
7.3 MPA on TCP Sender Segmentation..............................21 5.3 MPA on TCP Sender Segmentation..............................21
7.3.1 Effects of MPA on TCP Segmentation..........................21 5.3.1 Effects of MPA on TCP Segmentation..........................21
7.3.2 FPDU Size Considerations....................................23 5.3.2 FPDU Size Considerations....................................23
7.4 MPA Receiver FPDU Identification............................24 5.4 MPA Receiver FPDU Identification............................24
7.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders....25 5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders....25
8 Connection Semantics........................................26 6 Connection Semantics........................................26
8.1 Connection setup............................................26 6.1 Connection setup............................................26
8.1.1 MPA Request Frame Format....................................29 6.1.1 MPA Request Frame Format....................................30
8.1.2 Example Delayed Startup sequence............................30 6.1.2 Example Delayed Startup sequence............................31
8.1.3 Use of "Private Data".......................................33 6.1.3 Use of "Private Data".......................................34
8.1.4 "Dual Stack" implementations................................36 6.1.4 "Dual Stack" implementations................................37
8.2 Normal Connection Teardown..................................37 6.2 Normal Connection Teardown..................................38
9 Error Semantics.............................................38 7 Error Semantics.............................................39
10 Security Considerations.....................................39 8 Security Considerations.....................................40
10.1 Protocol-specific Security Considerations...................39 8.1 Protocol-specific Security Considerations...................40
10.2 Using IPsec With MPA........................................39 8.2 Using IPsec With MPA........................................40
11 IANA Considerations.........................................40 9 IANA Considerations.........................................41
12 References..................................................41 10 References..................................................42
12.1 Normative References........................................41 10.1 Normative References........................................42
12.2 Informative References......................................41 10.2 Informative References......................................42
13 Appendix....................................................43 11 Appendix....................................................44
13.1 Analysis of MPA over TCP Operations.........................43 11.1 Analysis of MPA over TCP Operations.........................44
13.1.1 Assumptions...............................................43 11.1.1 Assumptions...............................................44
13.1.2 The Value of Header Alignment.............................44 11.1.2 The Value of Header Alignment.............................45
13.2 Receiver implementation.....................................52 11.2 Receiver implementation.....................................53
13.2.1 Network Layer Reassembly Buffers..........................52 11.2.1 Network Layer Reassembly Buffers..........................53
13.2.2 TCP Reassembly buffers....................................53 11.2.2 TCP Reassembly buffers....................................54
14 Author's Addresses..........................................54 12 Author's Addresses..........................................55
15 Acknowledgments.............................................55 13 Acknowledgments.............................................56
16 Full Copyright Statement....................................58 14 Full Copyright Statement....................................59
Table of Figures Table of Figures
Figure 1 ULP MPA TCP Layering.......................................7 Figure 1 ULP MPA TCP Layering.......................................7
Figure 2 FPDU Format...............................................14 Figure 2 FPDU Format...............................................14
Figure 3 Marker Format.............................................15 Figure 3 Marker Format.............................................15
Figure 4 Example FPDU Format with Marker...........................17 Figure 4 Example FPDU Format with Marker...........................17
Figure 5 Annotated Hex Dump of an FPDU.............................20 Figure 5 Annotated Hex Dump of an FPDU.............................20
Figure 6 Annotated Hex Dump of an FPDU with Marker.................20 Figure 6 Annotated Hex Dump of an FPDU with Marker.................20
Figure 7 "MPA Request/Reply Frame".................................29 Figure 7 "MPA Request/Reply Frame".................................30
Figure 8: Example Delayed Startup negotiation......................31 Figure 8: Example Delayed Startup negotiation......................32
Figure 8: Example Immediate Startup negotiation....................34 Figure 9: Example Immediate Startup negotiation....................35
Figure 9: Non-aligned FPDU freely placed in TCP octet stream.......46 Figure 10: Non-aligned FPDU freely placed in TCP octet stream......47
Figure 10: Aligned FPDU placed immediately after TCP header........48 Figure 11: Aligned FPDU placed immediately after TCP header........49
Revision history Revision history
[draft-ietf-rddp-mpa-02] workgroup draft with following changes:
Added the "R" bit (Rejected) to the "MPA Reply Frame" and
described its semantics.
Added some comments on recent decisions regarding startup.
Updated RFC3667 boilerplate.
[draft-ietf-rddp-mpa-01] Alias of draft-ietf-rddp-map-00.
[draft-ietf-rddp-mpa-00] workgroup draft with following changes: [draft-ietf-rddp-mpa-00] workgroup draft with following changes:
Changed "Start Key" to two separate startup frames to facilitate Changed "Start Key" to two separate startup frames to facilitate
identification of incorrect Active/Active startup. identification of incorrect Active/Active startup.
Changed Active/Passive nomenclature to Initiator/Responder to Changed Active/Passive nomenclature to Initiator/Responder to
reduce confusion with TCP startup and verbs doc (which used reduce confusion with TCP startup and verbs doc (which used
opposite sense). opposite sense).
Added "Private Data" to the startup key sequences. This also Added "Private Data" to the startup key sequences. This also
skipping to change at page 5, line 5 skipping to change at page 5, line 5
Added clarifications of the MPA/TCP interaction for optimized Added clarifications of the MPA/TCP interaction for optimized
implementations and that any such optimizations are to be used implementations and that any such optimizations are to be used
only when requested by MPA. only when requested by MPA.
Note: a discussion of reasons for these changes can be found in Note: a discussion of reasons for these changes can be found in
[ELZER-MPA]. [ELZER-MPA].
[draft-culley-iwarp-mpa-01] initial draft. [draft-culley-iwarp-mpa-01] initial draft.
3 Introduction 1 Introduction
This section discusses the reason for creating MPA on TCP and a This section discusses the reason for creating MPA on TCP and a
general overview of the protocol. Later sections show the MPA general overview of the protocol. Later sections show the MPA
headers (see section 6 on page 14), and detailed protocol headers (see section 4 on page 14), and detailed protocol
requirements and characteristics (see section 7 on page 16), as well requirements and characteristics (see section 5 on page 16), as well
as Connection Semantics (section 8 on page 25), Error Semantics as Connection Semantics (section 6 on page 25), Error Semantics
(section 9 on page 38), and Security Considerations (section 10 on (section 7 on page 39), and Security Considerations (section 8 on
page 39). page 40).
3.1 Motivation 1.1 Motivation
The Direct Data Placement protocol [DDP], when used with TCP [RFC793] The Direct Data Placement protocol [DDP], when used with TCP [RFC793]
requires a mechanism to detect record boundaries. The DDP records requires a mechanism to detect record boundaries. The DDP records
are referred to as Upper Layer Protocol Data Units by this document. are referred to as Upper Layer Protocol Data Units by this document.
The ability to locate the Upper Layer Protocol Data Unit (ULPDU) The ability to locate the Upper Layer Protocol Data Unit (ULPDU)
boundary is useful to a hardware network adapter that uses DDP to boundary is useful to a hardware network adapter that uses DDP to
directly place the data in the application buffer based on the directly place the data in the application buffer based on the
control information carried in the ULPDU header. This may be done control information carried in the ULPDU header. This may be done
without requiring that the packets arrive in order. Potential without requiring that the packets arrive in order. Potential
benefits of this capability are the avoidance of the memory copy benefits of this capability are the avoidance of the memory copy
skipping to change at page 5, line 48 skipping to change at page 5, line 48
examine the data stream at locations that are known to contain the examine the data stream at locations that are known to contain the
embedded control, the protocol can never misinterpret application embedded control, the protocol can never misinterpret application
data as being embedded control data. For unambiguous handling of an data as being embedded control data. For unambiguous handling of an
out of order packet, the deterministic approach is preferred. out of order packet, the deterministic approach is preferred.
The MPA protocol provides a framing mechanism for DDP running over The MPA protocol provides a framing mechanism for DDP running over
TCP using the deterministic approach. It allows the location of the TCP using the deterministic approach. It allows the location of the
ULPDU to be determined in the TCP stream even if the TCP segments ULPDU to be determined in the TCP stream even if the TCP segments
arrive out of order. arrive out of order.
3.2 Protocol Overview 1.2 Protocol Overview
MPA is described as an extra layer above TCP and below DDP. The MPA is described as an extra layer above TCP and below DDP. The
operation sequence is: operation sequence is:
1. A TCP connection is established by ULP action. This is done 1. A TCP connection is established by ULP action. This is done
using methods not described by this specification. The ULP may using methods not described by this specification. The ULP may
exchange some amount of data in streaming mode prior to starting exchange some amount of data in streaming mode prior to starting
MPA, but is not required to do so. MPA, but is not required to do so.
2. The Consumer negotiates the use of DDP and MPA at both ends of a 2. The Consumer negotiates the use of DDP and MPA at both ends of a
connection. The mechanisms to do this are not described in this connection. The mechanisms to do this are not described in this
specification. The negotiation may be done in streaming mode, or specification. The negotiation may be done in streaming mode, or
by some other mechanism (such as a pre-arranged port number). by some other mechanism (such as a pre-arranged port number).
3. The ULP activates MPA on each end in the "Startup Phase", either 3. The ULP activates MPA on each end in the "Startup Phase", either
as an "Initiator" or a "Responder", as determined by the ULP. as an "Initiator" or a "Responder", as determined by the ULP.
This mode verifies the usage of MPA, specifies the use of CRC and This mode verifies the usage of MPA, specifies the use of CRC and
Markers, and allows the ULP to communicate some additional data Markers, and allows the ULP to communicate some additional data
via a "private data" exchange. See section 8.1 Connection setup via a "private data" exchange. See section 6.1 Connection setup
for more details on the startup process. for more details on the startup process.
4. At the end of the Startup Phase, the ULP puts MPA (and DDP) into 4. At the end of the Startup Phase, the ULP puts MPA (and DDP) into
full operation and begins sending DDP data as further described full operation and begins sending DDP data as further described
below. In this document, DDP data chunks are called ULPDUs. For below. In this document, DDP data chunks are called ULPDUs. For
a description of the DDP data, see [DDP]. a description of the DDP data, see [DDP].
Following is a description of data transfer when MPA is in full Following is a description of data transfer when MPA is in full
operation. operation.
skipping to change at page 8, line 24 skipping to change at page 8, line 24
check can reduce the chance of data errors being missed. check can reduce the chance of data errors being missed.
MPA includes a CRC check to increase the ULPDU data integrity to the MPA includes a CRC check to increase the ULPDU data integrity to the
level provided by other modern protocols, such as SCTP [RFC2960]. level provided by other modern protocols, such as SCTP [RFC2960].
This check may be disabled with agreement by providers and This check may be disabled with agreement by providers and
administrators at both ends of a connection. This disabling of CRCs administrators at both ends of a connection. This disabling of CRCs
should only be done when it is clear that the connection through the should only be done when it is clear that the connection through the
network has data integrity at least as good as a CRC (for example network has data integrity at least as good as a CRC (for example
when IPSEC is implemented end to end). DDP's ULP expects this level when IPSEC is implemented end to end). DDP's ULP expects this level
of data integrity and therefore the ULP SHOULD NOT have to provide of data integrity and therefore the ULP SHOULD NOT have to provide
its own duplicate data integrity and error recovery for lost data its own duplicate data integrity and error recovery for lost data.
4 Glossary 2 Glossary
Consumer - the ULPs or applications that lie above MPA and DDP. The Consumer - the ULPs or applications that lie above MPA and DDP. The
Consumer is responsible for making TCP connections, starting MPA Consumer is responsible for making TCP connections, starting MPA
and DDP connections, and generally controlling operations. and DDP connections, and generally controlling operations.
Delivery - (Delivered, Delivers) - For MPA, Delivery is defined as Delivery - (Delivered, Delivers) - For MPA, Delivery is defined as
the process of informing DDP that a particular PDU is ordered for the process of informing DDP that a particular PDU is ordered for
use. This is specifically different from "passing the PDU to use. This is specifically different from "passing the PDU to
DDP", which may generally occur in any order, while the order of DDP", which may generally occur in any order, while the order of
"Delivery" is strictly defined. "Delivery" is strictly defined.
skipping to change at page 11, line 5 skipping to change at page 11, line 5
describing protocol exchanges or other interactions between two describing protocol exchanges or other interactions between two
Nodes. Nodes.
ULP - Upper Layer Protocol. The protocol layer above the protocol ULP - Upper Layer Protocol. The protocol layer above the protocol
layer currently being referenced. The ULP for MPA is DDP [DDP]. layer currently being referenced. The ULP for MPA is DDP [DDP].
ULPDU - Upper Layer Protocol Data Unit. The data record defined by ULPDU - Upper Layer Protocol Data Unit. The data record defined by
the layer above MPA (DDP). ULPDU corresponds to DDP's "DDP the layer above MPA (DDP). ULPDU corresponds to DDP's "DDP
Segment". Segment".
5 LLP and DDP requirements 3 LLP and DDP requirements
5.1 TCP implementation Requirements to support MPA 3.1 TCP implementation Requirements to support MPA
The TCP implementation MUST inform MPA when the TCP connection is The TCP implementation MUST inform MPA when the TCP connection is
closed or has begun closing the connection (e.g. received a FIN). closed or has begun closing the connection (e.g. received a FIN).
5.1.1 TCP Transmit side 3.1.1 TCP Transmit side
To provide optimum performance, an MPA-aware transmit side TCP To provide optimum performance, an MPA-aware transmit side TCP
implementation SHOULD be enabled to: implementation SHOULD be enabled to:
* With an EMSS large enough to contain the FPDU(s), segment the * With an EMSS large enough to contain the FPDU(s), segment the
outgoing TCP stream such that the first octet of every TCP outgoing TCP stream such that the first octet of every TCP
Segment begins with an FPDU. Multiple FPDUs MAY be packed into a Segment begins with an FPDU. Multiple FPDUs MAY be packed into a
single TCP segment as long as they are entirely contained in the single TCP segment as long as they are entirely contained in the
TCP segment. TCP segment.
skipping to change at page 11, line 44 skipping to change at page 11, line 44
achieve that result. For example, using the TCP_NODELAY socket achieve that result. For example, using the TCP_NODELAY socket
option to disable the Nagle algorithm will usually result in many of option to disable the Nagle algorithm will usually result in many of
the segments starting with an FPDU. the segments starting with an FPDU.
If the transmit side TCP implementation is not able to report the If the transmit side TCP implementation is not able to report the
EMSS, MPA may assume that TCP will use 1460 octet segments in EMSS, MPA may assume that TCP will use 1460 octet segments in
creating FPDUs. If the implementation has reason to believe that the creating FPDUs. If the implementation has reason to believe that the
TCP segment size is actually smaller than 1460, it may instead use a TCP segment size is actually smaller than 1460, it may instead use a
536 octet FPDU. 536 octet FPDU.
5.1.2 TCP Receive side 3.1.2 TCP Receive side
When an MPA receive implementation and the MPA-aware receive side TCP When an MPA receive implementation and the MPA-aware receive side TCP
implementation support handling out of order ULPDUs, the TCP receive implementation support handling out of order ULPDUs, the TCP receive
implementation SHOULD be enabled to: implementation SHOULD be enabled to:
* Pass incoming TCP segments to MPA as soon as they have been * Pass incoming TCP segments to MPA as soon as they have been
received and validated, even if not received in order. The TCP received and validated, even if not received in order. The TCP
layer MUST have committed to keeping each segment before it can layer MUST have committed to keeping each segment before it can
be passed to the MPA. This means that the segment must have be passed to the MPA. This means that the segment must have
passed the TCP, IP, and lower layer data integrity validation passed the TCP, IP, and lower layer data integrity validation
(i.e., checksum), must be in the receive window, must not be a (i.e., checksum), must be in the receive window, must not be a
duplicate, must be part of the same epoch (if timestamps are used duplicate, must be part of the same epoch (if timestamps are used
to verify this) and any other checks required by TCP RFCs. The to verify this) and any other checks required by TCP RFCs. The
segment MUST NOT be passed to MPA more than once unless segment MUST NOT be passed to MPA more than once unless
explicitly requested (see Section 9). explicitly requested (see Section 7).
This is not to imply that the data must be completely ordered This is not to imply that the data must be completely ordered
before use. An implementation may accept out of order segments, before use. An implementation may accept out of order segments,
SACK them [RFC2018], and pass them to DDP when the reception of SACK them [RFC2018], and pass them to DDP when the reception of
the segments needed to fill in the gaps arrive. Such an the segments needed to fill in the gaps arrive. Such an
implementation can "commit" to the data early on, and will not implementation can "commit" to the data early on, and will not
overwrite it even if (or when) duplicate data arrives. MPA overwrite it even if (or when) duplicate data arrives. MPA
expects to utilize this "commit" to allow the passing of ULPDUs expects to utilize this "commit" to allow the passing of ULPDUs
to DDP when they arrive, independent of ordering. to DDP when they arrive, independent of ordering.
skipping to change at page 12, line 43 skipping to change at page 12, line 43
TCP segments until completely ordered and then deliver them as TCP segments until completely ordered and then deliver them as
expected by non-MPA applications (and described in TCP RFCs) when MPA expected by non-MPA applications (and described in TCP RFCs) when MPA
is not enabled on the connection. When MPA is enabled above an MPA- is not enabled on the connection. When MPA is enabled above an MPA-
aware TCP, TCP SHOULD enable the in and out of order passing of data, aware TCP, TCP SHOULD enable the in and out of order passing of data,
and the separate ordering information as described above. and the separate ordering information as described above.
When an MPA receive implementation is coupled with a TCP receive When an MPA receive implementation is coupled with a TCP receive
implementation that does not support the preceding mechanisms, TCP implementation that does not support the preceding mechanisms, TCP
passes and Delivers incoming stream data to MPA in order. passes and Delivers incoming stream data to MPA in order.
5.2 MPA's interactions with DDP 3.2 MPA's interactions with DDP
DDP requires MPA to maintain DDP record boundaries from the sender to DDP requires MPA to maintain DDP record boundaries from the sender to
the receiver. When using MPA on TCP to send data, DDP provides the receiver. When using MPA on TCP to send data, DDP provides
records (ULPDUs) to MPA. MPA will use the reliable transmission records (ULPDUs) to MPA. MPA will use the reliable transmission
abilities of TCP to transmit the data, and will insert appropriate abilities of TCP to transmit the data, and will insert appropriate
additional information into the TCP stream to allow the MPA receiver additional information into the TCP stream to allow the MPA receiver
to locate the record boundary information. to locate the record boundary information.
As such, MPA accepts complete records (ULPDUs) from DDP at the sender As such, MPA accepts complete records (ULPDUs) from DDP at the sender
and returns them to DDP at the receiver. and returns them to DDP at the receiver.
skipping to change at page 14, line 5 skipping to change at page 14, line 5
sender transmitted them. One possible mechanism might be sender transmitted them. One possible mechanism might be
providing the TCP sequence number for each ULPDU. providing the TCP sequence number for each ULPDU.
* Provide a mechanism to indicate when a given ULPDU (and prior * Provide a mechanism to indicate when a given ULPDU (and prior
ULPDUs) are complete. One possible mechanism might be to allow ULPDUs) are complete. One possible mechanism might be to allow
DDP to see the current outgoing TCP Ack sequence number. DDP to see the current outgoing TCP Ack sequence number.
* Provide an indication to DDP that the TCP has closed or has begun * Provide an indication to DDP that the TCP has closed or has begun
to close the connection (e.g. received a FIN). to close the connection (e.g. received a FIN).
6 FPDU Formats 4 FPDU Formats
MPA senders create FPDUs out of ULPDUs. The format of an FPDU shown MPA senders create FPDUs out of ULPDUs. The format of an FPDU shown
below MUST be used for all MPA FPDUs. For purposes of clarity, below MUST be used for all MPA FPDUs. For purposes of clarity,
markers are not shown in Figure 2. markers are not shown in Figure 2.
0 1 2 3 0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ULPDU_Length | | | ULPDU_Length | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
skipping to change at page 14, line 41 skipping to change at page 14, line 41
support the largest IP datagrams for IPv4 or IPv6. support the largest IP datagrams for IPv4 or IPv6.
PAD: The PAD field trails the ULPDU and contains between zero and PAD: The PAD field trails the ULPDU and contains between zero and
three octets of data. The pad data MUST be set to zero by the sender three octets of data. The pad data MUST be set to zero by the sender
and ignored by the receiver (except for CRC checking). The length of and ignored by the receiver (except for CRC checking). The length of
the pad is set so as to make the size of the FPDU an integral the pad is set so as to make the size of the FPDU an integral
multiple of four. multiple of four.
CRC: 32 bits, When CRCs are enabled, this field contains a CRC32C CRC: 32 bits, When CRCs are enabled, this field contains a CRC32C
check value, which is used to verify the entire contents of the FPDU, check value, which is used to verify the entire contents of the FPDU,
using CRC32C. See section 7.2 CRC Calculation on page 18. When CRCs using CRC32C. See section 5.2 CRC Calculation on page 18. When CRCs
are not enabled, this field is still present, may contain any value, are not enabled, this field is still present, may contain any value,
and MUST NOT be checked. and MUST NOT be checked.
The FPDU adds a minimum of 6 octets to the length of the ULPDU. In The FPDU adds a minimum of 6 octets to the length of the ULPDU. In
addition, the total length of the FPDU will include the length of any addition, the total length of the FPDU will include the length of any
markers and from 0 to 3 pad octets added to round-up the ULPDU size. markers and from 0 to 3 pad octets added to round-up the ULPDU size.
6.1 Marker Format 4.1 Marker Format
The format of a marker MUST be as specified in Figure 3: The format of a marker MUST be as specified in Figure 3:
0 1 2 3 0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| RESERVED | FPDUPTR | | RESERVED | FPDUPTR |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 3 Marker Format Figure 3 Marker Format
RESERVED: The Reserved field MUST be set to zero on transmit and RESERVED: The Reserved field MUST be set to zero on transmit and
ignored on receive (except for CRC calculation). ignored on receive (except for CRC calculation).
FPDUPTR: The FPDU Pointer is a relative pointer, 16-bits long, FPDUPTR: The FPDU Pointer is a relative pointer, 16-bits long,
interpreted as an unsigned integer, that indicates the number of interpreted as an unsigned integer, that indicates the number of
octets in the TCP stream from the beginning of the FPDU to the first octets in the TCP stream from the beginning of the FPDU to the first
octet of the entire marker. octet of the entire marker.
7 Data Transfer Semantics 5 Data Transfer Semantics
This section discusses some characteristics and behavior of the MPA This section discusses some characteristics and behavior of the MPA
protocol as well as implications of that protocol. protocol as well as implications of that protocol.
7.1 MPA Markers 5.1 MPA Markers
MPA markers are used to identify the start of FPDUs when packets are MPA markers are used to identify the start of FPDUs when packets are
received out of order. This is done by locating the markers at fixed received out of order. This is done by locating the markers at fixed
intervals in the data stream (which is correlated to the TCP sequence intervals in the data stream (which is correlated to the TCP sequence
number) and using the marker value to locate the preceding FPDU number) and using the marker value to locate the preceding FPDU
start. start.
The MPA receiver's ability to locate out of order FPDUs and pass the The MPA receiver's ability to locate out of order FPDUs and pass the
ULPDUs to DDP is implementation dependent. MPA/DDP allows those ULPDUs to DDP is implementation dependent. MPA/DDP allows those
receivers that are able to deal with out of order FPDUs in this way receivers that are able to deal with out of order FPDUs in this way
to require the insertion of markers in the data stream. When the to require the insertion of markers in the data stream. When the
receiver cannot deal with out of order FPDUs in this way, it may receiver cannot deal with out of order FPDUs in this way, it may
disable the insertion of markers at the sender. All MPA senders MUST disable the insertion of markers at the sender. All MPA senders MUST
be able to generate markers when their use is declared by the be able to generate markers when their use is declared by the
opposing receiver (see section 8.1 Connection setup on page 26). opposing receiver (see section 6.1 Connection setup on page 26).
When Markers are enabled, MPA senders MUST insert a marker into the When Markers are enabled, MPA senders MUST insert a marker into the
data stream at a 512 octet periodic interval in the TCP Sequence data stream at a 512 octet periodic interval in the TCP Sequence
Number Space. The marker contains a 16 bit unsigned integer referred Number Space. The marker contains a 16 bit unsigned integer referred
to as the FPDUPTR (FPDU Pointer). to as the FPDUPTR (FPDU Pointer).
If the FPDUPTR's value is non-zero, the FPDU Pointer is a 16 bit If the FPDUPTR's value is non-zero, the FPDU Pointer is a 16 bit
relative back-pointer. FPDUPTR MUST contain the number of octets in relative back-pointer. FPDUPTR MUST contain the number of octets in
the TCP stream from the beginning of the current FPDU to the first the TCP stream from the beginning of the current FPDU to the first
octet of the marker, unless the marker falls between FPDUs. Thus the octet of the marker, unless the marker falls between FPDUs. Thus the
skipping to change at page 16, line 54 skipping to change at page 16, line 54
marker falls exactly between FPDUs. In this case, the marker MUST be marker falls exactly between FPDUs. In this case, the marker MUST be
placed in the following FPDU and viewed as being part of that FPDU placed in the following FPDU and viewed as being part of that FPDU
(e.g. for CRC calculation). Thus an FPDUPTR value of 0x0000 means (e.g. for CRC calculation). Thus an FPDUPTR value of 0x0000 means
that immediately following the marker is an FPDU header. that immediately following the marker is an FPDU header.
Since all FPDUs are integral multiples of 4 octets, the bottom two Since all FPDUs are integral multiples of 4 octets, the bottom two
bits of the FPDUPTR as calculated by the sender are zero. MPA bits of the FPDUPTR as calculated by the sender are zero. MPA
reserves these bits so they MUST be treated as zero for computation reserves these bits so they MUST be treated as zero for computation
at the receiver. at the receiver.
When Markers are enabled (see section 8.1 Connection setup on page When Markers are enabled (see section 6.1 Connection setup on page
26), the MPA markers MUST be inserted immediately following MPA 26), the MPA markers MUST be inserted immediately following MPA
connection establishment, and at every 512th octet of the TCP octet connection establishment, and at every 512th octet of the TCP octet
stream thereafter. As a result, the first marker has an FPDUPTR stream thereafter. As a result, the first marker has an FPDUPTR
value of 0x0000. If the first marker begins at octet sequence number value of 0x0000. If the first marker begins at octet sequence number
SeqStart, then markers are inserted such that the first octet of the SeqStart, then markers are inserted such that the first octet of the
marker is at octet sequence number SeqNum if the remainder of (SeqNum marker is at octet sequence number SeqNum if the remainder of (SeqNum
- SeqStart) mod 512 is zero. Note that SeqNum can wrap. - SeqStart) mod 512 is zero. Note that SeqNum can wrap.
For example, if the TCP sequence number were used to calculate the For example, if the TCP sequence number were used to calculate the
insertion point of the marker, the starting TCP sequence number is insertion point of the marker, the starting TCP sequence number is
skipping to change at page 18, line 5 skipping to change at page 18, line 5
| | PAD (2 octets:0,0) | | | PAD (2 octets:0,0) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| CRC | | CRC |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 4 Example FPDU Format with Marker Figure 4 Example FPDU Format with Marker
MPA Receivers MUST preserve ULPDU boundaries when passing data to MPA Receivers MUST preserve ULPDU boundaries when passing data to
DDP. MPA Receivers MUST pass the ULPDU data and the ULPDU Length to DDP. MPA Receivers MUST pass the ULPDU data and the ULPDU Length to
DDP and not the markers, headers, and CRC. DDP and not the markers, headers, and CRC.
7.2 CRC Calculation 5.2 CRC Calculation
An MPA implementation MUST implement CRC support and MUST either: An MPA implementation MUST implement CRC support and MUST either:
(1) always use CRCs (1) always use CRCs
or or
(2) only negotiate the non-use of CRC on the explicit request of the (2) only negotiate the non-use of CRC on the explicit request of the
system administrator, via an interface not defined in this spec. system administrator, via an interface not defined in this spec.
The default configuration for a connection MUST be to use CRCs. The default configuration for a connection MUST be to use CRCs.
skipping to change at page 18, line 32 skipping to change at page 18, line 32
protection from undetected errors as an end-to-end CRC32c. protection from undetected errors as an end-to-end CRC32c.
The process MUST be invisible to the ULP. The process MUST be invisible to the ULP.
After receipt of an MPA startup declaration indicating that its peer After receipt of an MPA startup declaration indicating that its peer
requires CRCs, an MPA instance MUST continue generating and checking requires CRCs, an MPA instance MUST continue generating and checking
CRCs until the connection terminates. If an MPA instance has CRCs until the connection terminates. If an MPA instance has
declared that it does not require CRCs, it MUST turn off CRC checking declared that it does not require CRCs, it MUST turn off CRC checking
immediately after receipt of an MPA mode declaration indicating that immediately after receipt of an MPA mode declaration indicating that
its peer also does not require CRCs. It MAY continue generating its peer also does not require CRCs. It MAY continue generating
CRCs. See section 8.1 Connection setup on page 26 for details on the CRCs. See section 6.1 Connection setup on page 26 for details on the
MPA startup. MPA startup.
When sending an FPDU, the sender MUST include a CRC field. When CRCs When sending an FPDU, the sender MUST include a CRC field. When CRCs
are enabled, the CRC field in the MPA FPDU MUST be computed using the are enabled, the CRC field in the MPA FPDU MUST be computed using the
CRC32C polynomial in the manner described in the iSCSI Protocol CRC32C polynomial in the manner described in the iSCSI Protocol
[iSCSI] document for Header and Data Digests. [iSCSI] document for Header and Data Digests.
The fields which MUST be included in the CRC calculation when sending The fields which MUST be included in the CRC calculation when sending
an FPDU are as follows: an FPDU are as follows:
skipping to change at page 19, line 36 skipping to change at page 19, line 36
MUST first perform the following: MUST first perform the following:
1) Calculate the CRC of the incoming FPDU in the same fashion as 1) Calculate the CRC of the incoming FPDU in the same fashion as
defined above. defined above.
2) Verify that the calculated CRC-32c value is the same as the 2) Verify that the calculated CRC-32c value is the same as the
received CRC-32c value found in the FPDU CRC field. If not, the received CRC-32c value found in the FPDU CRC field. If not, the
receiver MUST treat the FPDU as an invalid FPDU. receiver MUST treat the FPDU as an invalid FPDU.
The procedure for handling invalid FPDUs is covered in the Error The procedure for handling invalid FPDUs is covered in the Error
Section (see section 9 on page 38) Section (see section 7 on page 39)
The following is an annotated hex dump of an example FPDU sent as the The following is an annotated hex dump of an example FPDU sent as the
first FPDU on the stream. As such, it starts with a marker. The FPDU first FPDU on the stream. As such, it starts with a marker. The FPDU
contains 24 octets of the contained ULPDU, which are all zeros. The contains 24 octets of the contained ULPDU, which are all zeros. The
CRC32c has been correctly calculated and can be used as a reference. CRC32c has been correctly calculated and can be used as a reference.
See the [DDP] and [RDMA] specification for definitions of the DDP See the [DDP] and [RDMA] specification for definitions of the DDP
Control field, Queue, MSN, MO, and Send Data. Control field, Queue, MSN, MO, and Send Data.
Octet Contents Annotation Octet Contents Annotation
Count Count
skipping to change at page 21, line 5 skipping to change at page 21, line 5
01fe 00 00 01fe 00 00
0200 00 00 Marker: Reserved 0200 00 00 Marker: Reserved
0202 00 14 FPDUPTR 0202 00 14 FPDUPTR
0204 00 00 0204 00 00
Send Data (24 octets of zeros) Send Data (24 octets of zeros)
021a 00 00 021a 00 00
021c A1 9C CRC32c 021c A1 9C CRC32c
021e D1 03 021e D1 03
Figure 6 Annotated Hex Dump of an FPDU with Marker Figure 6 Annotated Hex Dump of an FPDU with Marker
7.3 MPA on TCP Sender Segmentation 5.3 MPA on TCP Sender Segmentation
The various TCP RFCs allow considerable choice in segmenting a TCP The various TCP RFCs allow considerable choice in segmenting a TCP
stream. In order to optimize FPDU recovery at the MPA receiver, MPA stream. In order to optimize FPDU recovery at the MPA receiver, MPA
specifies additional segmentation rules. specifies additional segmentation rules.
MPA MUST encapsulate the ULPDU such that there is exactly one ULPDU MPA MUST encapsulate the ULPDU such that there is exactly one ULPDU
contained in one FPDU. contained in one FPDU.
An MPA-aware TCP sender SHOULD, when enabled for MPA, on TCP An MPA-aware TCP sender SHOULD, when enabled for MPA, on TCP
implementations that support this, and with an EMSS large enough to implementations that support this, and with an EMSS large enough to
skipping to change at page 21, line 43 skipping to change at page 21, line 43
The sender MUST still format the FPDU according to FPDU format as The sender MUST still format the FPDU according to FPDU format as
shown in Figure 2. shown in Figure 2.
On a retransmission, TCP does not necessarily preserve original TCP On a retransmission, TCP does not necessarily preserve original TCP
segmentation boundaries. This can lead to the loss of FPDU alignment segmentation boundaries. This can lead to the loss of FPDU alignment
and containment within a TCP segment during TCP retransmissions. An and containment within a TCP segment during TCP retransmissions. An
MPA-aware TCP sender SHOULD try to preserve original TCP segmentation MPA-aware TCP sender SHOULD try to preserve original TCP segmentation
boundaries on a retransmission. boundaries on a retransmission.
7.3.1 Effects of MPA on TCP Segmentation 5.3.1 Effects of MPA on TCP Segmentation
Applications expected to see strong advantages from Direct Data Applications expected to see strong advantages from Direct Data
Placement include transaction-based applications and throughput Placement include transaction-based applications and throughput
applications. Request/response protocols typically send one FPDU per applications. Request/response protocols typically send one FPDU per
TCP segment and then wait for a response. Therefore, the application TCP segment and then wait for a response. Therefore, the application
is expected to set TCP parameters such that it can trade off latency is expected to set TCP parameters such that it can trade off latency
and wire efficiency. This is accomplished by setting the TCP_NODELAY and wire efficiency. This is accomplished by setting the TCP_NODELAY
socket option. socket option.
When latency is not critical, and the application provides data in When latency is not critical, and the application provides data in
skipping to change at page 23, line 5 skipping to change at page 23, line 5
The Nagle algorithm is not mandatory to use [RFC1122]. The Nagle algorithm is not mandatory to use [RFC1122].
It is up to the ULP to decide if Nagle is useful with DDP/MPA. Note It is up to the ULP to decide if Nagle is useful with DDP/MPA. Note
that many of the applications expected to take advantage of MPA/DDP that many of the applications expected to take advantage of MPA/DDP
prefer to avoid the extra delays caused by Nagle. In such scenarios prefer to avoid the extra delays caused by Nagle. In such scenarios
it is anticipated there will be minimal opportunity for packing at it is anticipated there will be minimal opportunity for packing at
the transmitter and receivers may choose to optimize their the transmitter and receivers may choose to optimize their
performance for this anticipated behavior. performance for this anticipated behavior.
7.3.2 FPDU Size Considerations 5.3.2 FPDU Size Considerations
MPA defines the Maximum Upper Layer Protocol Data Unit (MULPDU) as MPA defines the Maximum Upper Layer Protocol Data Unit (MULPDU) as
the size of the largest ULPDU fitting in an FPDU. For an empty TCP the size of the largest ULPDU fitting in an FPDU. For an empty TCP
Segment, MULPDU is EMSS minus the FPDU overhead (6 octets) minus Segment, MULPDU is EMSS minus the FPDU overhead (6 octets) minus
space for markers and pad octets. space for markers and pad octets.
The maximum ULPDU Length for a single ULPDU when markers are The maximum ULPDU Length for a single ULPDU when markers are
present MUST be computed as: present MUST be computed as:
MULPDU = EMSS - (6 + 4 * Ceiling(EMSS / 512) + EMSS mod 4) MULPDU = EMSS - (6 + 4 * Ceiling(EMSS / 512) + EMSS mod 4)
skipping to change at page 23, line 39 skipping to change at page 23, line 39
DDP SHOULD provide ULPDUs that are as large as possible, but less DDP SHOULD provide ULPDUs that are as large as possible, but less
than or equal to MULPDU. than or equal to MULPDU.
If the TCP implementation needs to adjust EMSS to support MTU If the TCP implementation needs to adjust EMSS to support MTU
changes, the MULPDU value is changed accordingly. changes, the MULPDU value is changed accordingly.
In certain rare situations, the EMSS may shrink to very small sizes. In certain rare situations, the EMSS may shrink to very small sizes.
If this occurs, the MPA on TCP sender MUST NOT shrink the MULPDU If this occurs, the MPA on TCP sender MUST NOT shrink the MULPDU
below 128 octets and is not required to follow the segmentation rules below 128 octets and is not required to follow the segmentation rules
in Section 7.3 MPA on TCP Sender Segmentation on page 21. in Section 5.3 MPA on TCP Sender Segmentation on page 21.
If one or more FPDUs are already packed into a TCP segment, such that If one or more FPDUs are already packed into a TCP segment, such that
the remaining room is less than 128 octets, MPA MUST NOT provide a the remaining room is less than 128 octets, MPA MUST NOT provide a
MULPDU smaller than 128. In this case, MPA would typically provide a MULPDU smaller than 128. In this case, MPA would typically provide a
MULPDU for the next full sized segment, but may still pack the next MULPDU for the next full sized segment, but may still pack the next
FPDU into the small remaining room, provide that the next FPDU is FPDU into the small remaining room, provide that the next FPDU is
small enough to fit. small enough to fit.
The value 128 is chosen as to allow DDP designers room for the DDP The value 128 is chosen as to allow DDP designers room for the DDP
Header and some user data. Header and some user data.
7.4 MPA Receiver FPDU Identification 5.4 MPA Receiver FPDU Identification
An MPA receiver MUST first verify the FPDU before passing the ULPDU An MPA receiver MUST first verify the FPDU before passing the ULPDU
to DDP. To do this, the receiver MUST: to DDP. To do this, the receiver MUST:
* locate the start of the FPDU unambiguously, * locate the start of the FPDU unambiguously,
* verify its CRC (if CRC checking is enabled). * verify its CRC (if CRC checking is enabled).
If the above conditions are true, the MPA receiver passes the ULPDU If the above conditions are true, the MPA receiver passes the ULPDU
to DDP. to DDP.
To detect the start of the FPDU unambiguously one of the following To detect the start of the FPDU unambiguously one of the following
MUST be used: MUST be used:
1: In an ordered TCP stream, the ULPDU Length field in the current 1: In an ordered TCP stream, the ULPDU Length field in the current
FPDU when FPDU has a valid CRC, can be used to identify the FPDU when FPDU has a valid CRC, can be used to identify the
beginning of the next FPDU. beginning of the next FPDU.
2: For receivers that support out of order reception of FPDUs (see 2: For receivers that support out of order reception of FPDUs (see
section 7.1 MPA Markers on page 16) a Marker can always be used section 5.1 MPA Markers on page 16) a Marker can always be used
to locate the beginning of an FPDU (in FPDUs with valid CRCs). to locate the beginning of an FPDU (in FPDUs with valid CRCs).
Since the location of the marker is known in the octet stream Since the location of the marker is known in the octet stream
(sequence number space), the marker can always be found. (sequence number space), the marker can always be found.
3: Having found an FPDU by means of a Marker, following contiguous 3: Having found an FPDU by means of a Marker, following contiguous
FPDUs can be found by using the ULPDU Lengths (from FPDUs with FPDUs can be found by using the ULPDU Lengths (from FPDUs with
valid CRCs) to establish the next FPDU boundary. valid CRCs) to establish the next FPDU boundary.
The ULPDU Length field (see section 6) MUST be used to determine if The ULPDU Length field (see section 4) MUST be used to determine if
the entire FPDU is present before forwarding the ULPDU to DDP. the entire FPDU is present before forwarding the ULPDU to DDP.
CRC calculation is discussed in section 7.2 on page 18 above. CRC calculation is discussed in section 5.2 on page 18 above.
7.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders 5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders
Since MPA on MPA-aware TCP senders start FPDUs on TCP segment Since MPA on MPA-aware TCP senders start FPDUs on TCP segment
boundaries, a receiving DDP on MPA on TCP implementation may be able boundaries, a receiving DDP on MPA on TCP implementation may be able
to optimize the reception of data in various ways. to optimize the reception of data in various ways.
However, MPA receivers MUST NOT depend on FPDU Alignment on TCP However, MPA receivers MUST NOT depend on FPDU Alignment on TCP
segment boundaries. segment boundaries.
Some MPA senders may be unable to conform to the sender requirements Some MPA senders may be unable to conform to the sender requirements
because their implementation of TCP is not designed with MPA in mind. because their implementation of TCP is not designed with MPA in mind.
skipping to change at page 26, line 5 skipping to change at page 26, line 5
DDP as soon as validated, and Delivered when ordering is DDP as soon as validated, and Delivered when ordering is
established. If the whole FPDU is not available, the receiver established. If the whole FPDU is not available, the receiver
should buffer until the remainder of the FPDU arrives. should buffer until the remainder of the FPDU arrives.
* Combinations of Unaligned or incomplete FPDUs (and potentially * Combinations of Unaligned or incomplete FPDUs (and potentially
other complete FPDUs) in the same TCP segment: If any FPDU is other complete FPDUs) in the same TCP segment: If any FPDU is
present in its entirety, or can be completed with portions present in its entirety, or can be completed with portions
already available, it can be passed to DDP as soon as validated, already available, it can be passed to DDP as soon as validated,
and Delivered when ordering is established. and Delivered when ordering is established.
8 Connection Semantics 6 Connection Semantics
8.1 Connection setup 6.1 Connection setup
MPA requires that the consumer MUST activate MPA, and any TCP MPA requires that the consumer MUST activate MPA, and any TCP
enhancements for MPA, on a TCP half connection at the same location enhancements for MPA, on a TCP half connection at the same location
in the octet stream at both the sender and the receiver. This is in the octet stream at both the sender and the receiver. This is
required in order for the marker scheme to correctly locate the required in order for the marker scheme to correctly locate the
markers (if enabled) and to correctly locate the first FPDU. markers (if enabled) and to correctly locate the first FPDU.
MPA, and any TCP enhancements for MPA are enabled by the ULP in both MPA, and any TCP enhancements for MPA are enabled by the ULP in both
directions at once at an endpoint. directions at once at an endpoint.
This can be accomplished several ways, and is left up to DDP's ULP: This can be accomplished several ways, and is left up to DDP's ULP:
* DDP's ULP MAY require DDP on MPA startup immediately after TCP * DDP's ULP MAY require DDP on MPA startup immediately after TCP
connection setup. This has the advantage that no streaming mode connection setup. This has the advantage that no streaming mode
negotiation is needed. An example of such a protocol is shown in negotiation is needed. An example of such a protocol is shown in
Figure 9: Example Immediate Startup negotiation on page 34. Figure 9: Example Immediate Startup negotiation on page 35.
This may be accomplished by using a well-known port, or a service This may be accomplished by using a well-known port, or a service
locator protocol to locate an appropriate port on which DDP on locator protocol to locate an appropriate port on which DDP on
MPA is expected to operate. MPA is expected to operate.
* DDP's ULP MAY negotiate the start of DDP on MPA sometime after a * DDP's ULP MAY negotiate the start of DDP on MPA sometime after a
normal TCP startup, using TCP streaming data exchanges on the normal TCP startup, using TCP streaming data exchanges on the
same connection. The exchange establishes that DDP on MPA (as same connection. The exchange establishes that DDP on MPA (as
well as other ULPs) will be used, and exactly locates the point well as other ULPs) will be used, and exactly locates the point
in the octet stream where MPA is to begin operation. Note that in the octet stream where MPA is to begin operation. Note that
such a negotiation protocol is outside the scope of this such a negotiation protocol is outside the scope of this
specification. A simplified example of such a protocol is shown specification. A simplified example of such a protocol is shown
in Figure 8: Example Delayed Startup negotiation on page 31. in Figure 8: Example Delayed Startup negotiation on page 32.
An MPA endpoint operates in two distinct phases. An MPA endpoint operates in two distinct phases.
The "Startup Phase" is used to verify correct MPA setup, exchange CRC The "Startup Phase" is used to verify correct MPA setup, exchange CRC
and Marker configuration, and optionally pass "private data" between and Marker configuration, and optionally pass "private data" between
endpoints prior to completing a DDP connection. During this phase, endpoints prior to completing a DDP connection. During this phase,
specifically formatted frames are exchanged as TCP byte streams specifically formatted frames are exchanged as TCP byte streams
without using CRCs or Markers. During this phase a DDP endpoint need without using CRCs or Markers. During this phase a DDP endpoint need
not be "bound" to the MPA connection. In fact, the choice of DDP not be "bound" to the MPA connection. In fact, the choice of DDP
endpoint and its operating parameters may not be known until the endpoint and its operating parameters may not be known until the
skipping to change at page 27, line 13 skipping to change at page 27, line 13
connection at entry to this phase. connection at entry to this phase.
When "private data" is passed between ULPs in the "Startup Phase", When "private data" is passed between ULPs in the "Startup Phase",
the ULP is responsible for interpreting that data, and then placing the ULP is responsible for interpreting that data, and then placing
MPA into "Full operation". MPA into "Full operation".
Note: The following text differentiates the two endpoints by calling Note: The following text differentiates the two endpoints by calling
them "Initiator" and "Responder". This is quite arbitrary and is them "Initiator" and "Responder". This is quite arbitrary and is
NOT related to the TCP startup (SYN, SYN/ACK sequence). The NOT related to the TCP startup (SYN, SYN/ACK sequence). The
Initiator is the side that sends first in the MPA startup Initiator is the side that sends first in the MPA startup
sequence (the MPA Request Frame). sequence (the "MPA Request Frame").
Note: The possibility that both endpoints would be allowed to make a
connection at the same time, sometimes called an "Active/Active"
connection, was considered by the work group and rejected. There
were several motivations for this decision. One was that
applications needing this facility were few (none other than
theoretical at the time of this draft). Another was that the
facility created some implementation difficulties, particularly
with the "Dual Stack" designs described later on. A last issue
was that dealing with rejected connections at startup would have
required at least an additional frame type, and more recovery
actinos, complicating the protocol. While none of these issues
was overwhelming, the group and implementers were not motivated
to do the work to resolve these issues.
The ULP is responsible for determining which side is "Initiator" or The ULP is responsible for determining which side is "Initiator" or
"Responder". For "Client/Server" type ULPs this is easy. For peer- "Responder". For "Client/Server" type ULPs this is easy. For peer-
peer ULPs (which might utilize a TCP style "active/active" startup), peer ULPs (which might utilize a TCP style "active/active" startup),
some mechanism (not defined by this specification) must be some mechanism (not defined by this specification) must be
established, or some streaming mode data exchanged prior to MPA established, or some streaming mode data exchanged prior to MPA
startup to determine the side which starts in "Initiator" and which startup to determine the side which starts in "Initiator" and which
starts in "Responder" MPA mode. starts in "Responder" MPA mode.
The following rules apply to MPA connection startup phase: The following rules apply to MPA connection startup phase:
1. When MPA is started in the "Initiator" mode, the MPA 1. When MPA is started in the "Initiator" mode, the MPA
implementation MUST send a valid "MPA Request Frame". implementation MUST send a valid "MPA Request Frame". The "MPA
Request Frame" MAY include ULP supplied "Private Data".
2. When MPA is started in the "Responder" mode, the MPA 2. When MPA is started in the "Responder" mode, the MPA
implementation MUST wait until a "MPA Request Frame" is received implementation MUST wait until a "MPA Request Frame" is received
and validated before sending any MPA data and before starting to and validated before entering full MPA/DDP operation.
interpret any data received as FPDUs and passing any received
ULPDUs to DDP. After the received "MPA Request Frame" is If the "MPA Request Frame" is improperly formatted, the
validated, the MPA implementation MUST either send a valid "MPA implementation MUST close the TCP connection and exit MPA.
Reply Frame", or close the connection.
If the "MPA Request Frame" is properly formatted but the "Private
Data" is not acceptable, the implementation SHOULD return an "MPA
Reply Frame" with the "Rejected Connection" bit set to '1'; the
"MPA Reply Frame" MAY include ULP supplied "Private Data"; the
implementation MUST exit MPA, leaving the TCP connection open.
The ULP may close TCP or use the connection for other purposes.
If the "MPA Request Frame" is properly formatted and the "Private
Data" is acceptable, the implementation SHOULD return an "MPA
Reply Frame" with the "Rejected Connection" bit set to '0'; the
"MPA Reply Frame" MAY include ULP supplied "Private Data"; and
the responder SHOULD prepare to interpret any data received as
FPDUs and pass any received ULPDUs to DDP.
Note: Since the receiver's ability to deal with markers is Note: Since the receiver's ability to deal with markers is
unknown until the Request and Reply frames have been unknown until the Request and Reply frames have been
received, sending FPDUs before this occurs is not possible. received, sending FPDUs before this occurs is not possible.
Note: The requirement to wait on a Request Frame before sending a Note: The requirement to wait on a Request Frame before sending a
Reply frame is a design choice, it makes for well ordered Reply frame is a design choice, it makes for well ordered
sequence of events at each end, and avoids having to specify sequence of events at each end, and avoids having to specify
how to deal with situations where both ends start at the same how to deal with situations where both ends start at the same
time. time.
3. MPA "Initiator" mode implementations MUST receive and validate a 3. MPA "Initiator" mode implementations MUST receive and validate a
"MPA Reply Frame" before sending any FPDUs, and before starting "MPA Reply Frame".
to interpret any data received as FPDUs and passing any received
ULPDUs to DDP. If the "MPA Reply Frame" is improperly formatted, the
implementation MUST close the TCP connection and exit MPA.
If the "MPA Reply Frame" is properly formatted but is the
"Private Data" is not acceptable, or if the "Rejected Connection"
bit set to '1', the implementation MUST exit MPA, leaving the TCP
connection open. The ULP may close TCP or use the connection for
other purposes.
If the "MPA Reply Frame" is properly formatted and the "Private
Data" is acceptable, and the "Reject Connection" bit is set to
'0', the implementation SHOULD enter full MPA/DDP operation mode;
interpreting any received data as FPDUs and sending DDP ULPDUs as
FPDUs.
4. MPA "Responder" mode implementations MUST receive and validate at 4. MPA "Responder" mode implementations MUST receive and validate at
least one FPDU before sending any FPDUs or markers. least one FPDU before sending any FPDUs or markers.
Note: this requirement is present to allow the Initiator time to Note: this requirement is present to allow the Initiator time to
get its receiver into full operation before an FPDU arrives, get its receiver into full operation before an FPDU arrives,
avoiding potentially difficult requirements on the receiver. avoiding potential race conditions at the initiator. This
was also subject to some debate in the work group before
rough consensus was reached. Eliminating this requirement
would allow faster startup in some types of applications.
However, that would also make certain implementations
(particularly "Dual Stack") much harder.
5. If a received "Key" does not match the expected value, (See 8.1.1 5. If a received "Key" does not match the expected value, (See 6.1.1
MPA Request Frame Format below) the TCP/DDP connection MUST be MPA Request and Reply Frame Format below) the TCP/DDP connection
closed, and an error returned to the ULP. MUST be closed, and an error returned to the ULP.
6. The received "Private Data" fields may be used by consumers at 6. The received "Private Data" fields may be used by consumers at
either end to further validate the connection, and set up DDP or either end to further validate the connection, and set up DDP or
other ULP parameters. The ULP MAY close the TCP/MPA/DDP other ULP parameters. The Initiator ULP MAY close the
connection as a result of validating the "Private Data" fields. TCP/MPA/DDP connection as a result of validating the "Private
Data" fields. The Responder SHOULD return a "MPA Reply Frame"
with the "Reject Connection" Bit set to '1' if the validation of
the "Private Data" is not acceptable to the ULP.
7. When the first FPDU is to be sent, then if markers are enabled, 7. When the first FPDU is to be sent, then if markers are enabled,
the first octets sent are the special marker 0x00000000, followed the first octets sent are the special marker 0x00000000, followed
by the start of the FPDU (the FPDU's "ULPDU Length" field). If by the start of the FPDU (the FPDU's "ULPDU Length" field). If
markers are not enabled, the first octets sent are the start of markers are not enabled, the first octets sent are the start of
the FPDU (the FPDU's "ULPDU Length" field). the FPDU (the FPDU's "ULPDU Length" field).
8. MPA implementations MUST use the difference between the MPA 8. MPA implementations MUST use the difference between the "MPA
Request Frame and the MPA Reply Frame to check for incorrect Request Frame" and the "MPA Reply Frame" to check for incorrect
"Initiator/Initiator" startups. Implementations SHOULD put a "Initiator/Initiator" startups. Implementations SHOULD put a
timeout on waiting for the MPA Reply Frame when started in timeout on waiting for the "MPA Request Frame" when started in
"Responder" mode, to detect incorrect "Responder/Responder" "Responder" mode, to detect incorrect "Responder/Responder"
startups. startups.
8.1.1 MPA Request Frame Format 9. MPA implementations MUST validate the PD_Length field. The
buffer that receives the "Private Data" field MUST be large
enough to receive that data; the amount of "Private Data" MUST
not exceed the PD_Length, or the application buffer. If any of
the above fails, the startup frame MUST be considered improperly
formatted.
10. MPA implementations SHOULD implement a reasonable timeout while
waiting for the entire startup frames; this prevents certain
denial of service attacks. ULPs SHOULD implement a reasonable
timeout while waiting for FPDUs, ULPDUs and application level
messages to guard against application failures and certain denial
of service attacks.
6.1.1 MPA Request and Reply Frame Format
0 1 2 3 0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0 | | 0 | |
+ Key (16 bytes containing "MPA ID Req Frame") + + Key (16 bytes containing "MPA ID Req Frame") +
4 | (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65) | 4 | (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65) |
+ Or (16 bytes containing "MPA ID Rep Frame") + + Or (16 bytes containing "MPA ID Rep Frame") +
8 | (4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65) | 8 | (4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65) |
+ + + +
12 | | 12 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
16 |M|C| Res | Rev | PD_Length | 16 |M|C|R| Res | Rev | PD_Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | | |
~ ~ ~ ~
~ Private Data ~ ~ Private Data ~
| | | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 7 "MPA Request/Reply Frame" Figure 7 "MPA Request/Reply Frame"
skipping to change at page 29, line 47 skipping to change at page 30, line 47
50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal). 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal).
Initiator mode receivers MUST check this field for the same Initiator mode receivers MUST check this field for the same
value, and close the connection and report an error locally if value, and close the connection and report an error locally if
any other value is detected. any other value is detected.
M: This bit, when sent in an "MPA Request Frame" or an "MPA Reply M: This bit, when sent in an "MPA Request Frame" or an "MPA Reply
Frame", declares a receiver's requirement for Markers. When in a Frame", declares a receiver's requirement for Markers. When in a
received "MPA Request Frame" or "MPA Reply Frame" and the value received "MPA Request Frame" or "MPA Reply Frame" and the value
is '0', markers MUST NOT be added to the data stream by the is '0', markers MUST NOT be added to the data stream by the
sender. When '1' markers MUST be added as described in section sender. When '1' markers MUST be added as described in section
7.1 MPA Markers on page 16. 5.1 MPA Markers on page 16.
C: This bit declares an endpoint's preferred CRC usage. When this C: This bit declares an endpoint's preferred CRC usage. When this
field is '0' in the "MPA Request Frame" and the "MPA Reply field is '0' in the "MPA Request Frame" and the "MPA Reply
Frame", CRCs MUST not be checked and need not be generated by Frame", CRCs MUST not be checked and need not be generated by
either endpoint. When this bit is '1' in either the "MPA Request either endpoint. When this bit is '1' in either the "MPA Request
Frame" or "MPA Reply Frame", CRCs MUST be generated and checked Frame" or "MPA Reply Frame", CRCs MUST be generated and checked
by both endpoints. by both endpoints.
R: This bit is set to zero, and not checked on reception in the "MPA
Request Frame". In the "MPA Reply Frame", this bit is the
"Rejected Connection" bit, set by the responders ULP to indicate
acceptance '0', or rejection '1', of the connection parameters
provided in the "Private Data".
Res: This field is reserved for future use. It must be set to zero Res: This field is reserved for future use. It must be set to zero
when sending, and not checked on reception. when sending, and not checked on reception.
Rev: This field contains the Revision of MPA. For this version of Rev: This field contains the Revision of MPA. For this version of
the specification senders MUST set this field to zero. MPA the specification senders MUST set this field to zero. MPA
receivers compliant with this version of the specification MUST receivers compliant with this version of the specification MUST
check this field for zero, and close the connection and report an check this field for zero, and close the connection and report an
error locally if any other value is detected. error locally if any other value is detected.
PD_Length: This field MUST contain the length in Octets of the PD_Length: This field MUST contain the length in Octets of the
Private Data field. A value of zero indicates that there is no Private Data field. A value of zero indicates that there is no
private data field present at all. The private data field may be private data field present at all. The private data field may be
as long as 65535 Octets. as long as 65535 Octets.
Private Data: This field may contain any value defined by ULPs or may Private Data: This field may contain any value defined by ULPs or may
not be present. ULPs define how to set and validate this field. not be present. ULPs define how to set and validate this field.
8.1.2 Example Delayed Startup sequence 6.1.2 Example Delayed Startup sequence
A variety of startup sequences are possible when using MPA on TCP. A variety of startup sequences are possible when using MPA on TCP.
Following is an example of an MPA/DDP startup that occurs after TCP Following is an example of an MPA/DDP startup that occurs after TCP
has been running for a while and has exchanged some amount of has been running for a while and has exchanged some amount of
streaming data. This example does not use any private data (an streaming data. This example does not use any private data (an
example that does is shown later in 8.1.3.2 Example Immediate Startup example that does is shown later in 6.1.3.2 Example Immediate Startup
using Private Data on page 34), although it is perfectly legal to using Private Data on page 35), although it is perfectly legal to
include the private data. Note that since the example does not use include the private data. Note that since the example does not use
any Private Data, there are no ULP interactions shown between any Private Data, there are no ULP interactions shown between
receiving "Startup frames" and putting MPA into "Full operation". receiving "Startup frames" and putting MPA into "Full operation".
Initiator Responder Initiator Responder
+---------------------------+ +---------------------------+
|ULP streaming mode | |ULP streaming mode |
| <Hello> request to | | <Hello> request to |
| transition to DDP/MPA | +--------------------------+ | transition to DDP/MPA | +--------------------------+
skipping to change at page 32, line 7 skipping to change at page 33, line 7
|MPA sends first FPDU (as | +--------------------------+ |MPA sends first FPDU (as | +--------------------------+
|DDP ULPDUs become | ========> |MPA Receives first FPDU. | |DDP ULPDUs become | ========> |MPA Receives first FPDU. |
|available). | |MPA sends first FPDU (as | |available). | |MPA sends first FPDU (as |
+---------------------------+ |DDP ULPDUs become | +---------------------------+ |DDP ULPDUs become |
<====== |available. | <====== |available. |
+--------------------------+ +--------------------------+
Figure 8: Example Delayed Startup negotiation Figure 8: Example Delayed Startup negotiation
An example Delayed Startup sequence is described below: An example Delayed Startup sequence is described below:
* Active and passive sides start up a TCP connection in the * Active and passive sides start up a TCP connection in the
ususal fashion, probably using sockets APIs. They exchange usual fashion, probably using sockets APIs. They exchange
some amount of streaming mode data. At some point one side some amount of streaming mode data. At some point one side
(the MPA Initiator) sends streaming mode data that (the MPA Initiator) sends streaming mode data that
effectively says "Hello, Lets go into MPA/DDP mode." effectively says "Hello, Lets go into MPA/DDP mode."
* When the remote side (the MPA Responder) gets this streaming mode * When the remote side (the MPA Responder) gets this streaming mode
message, the consumer would send a last streaming mode message message, the consumer would send a last streaming mode message
that effectively says "I Acknowledge your Hello, and am now in that effectively says "I Acknowledge your Hello, and am now in
MPA Responder Mode". The exchange of these messages establishes MPA Responder Mode". The exchange of these messages establishes
the exact point in the TCP stream where MPA is enabled. The the exact point in the TCP stream where MPA is enabled. The
Responding Consumer enables MPA in the Responder mode and waits Responding Consumer enables MPA in the Responder mode and waits
for the initial MPA startup message. for the initial MPA startup message.
* The Initiating Consumer would enable MPA startup in the * The Initiating Consumer would enable MPA startup in the
Initiator mode which then sends the MPA Request Frame. It is Initiator mode which then sends the "MPA Request Frame". It
assumed that no "Private Data" messages are needed for this is assumed that no "Private Data" messages are needed for
example, although it is possible to do so. The Initiating this example, although it is possible to do so. The
MPA (and Consumer) would also wait for the MPA connection to Initiating MPA (and Consumer) would also wait for the MPA
be accepted. connection to be accepted.
* The Responding MPA would receive the initial "MPA Request Frame" * The Responding MPA would receive the initial "MPA Request Frame"
and would inform the consumer that this message arrived. The and would inform the consumer that this message arrived. The
Consumer can then accept the MPA/DDP connection or close the TCP Consumer can then accept the MPA/DDP connection or close the TCP
connection. connection.
* To accept the connection request, the Responding Consumer would * To accept the connection request, the Responding Consumer would
use an appropriate API to bind the TCP/MPA connections to a DDP use an appropriate API to bind the TCP/MPA connections to a DDP
endpoint, thus enabling MPA/DDP into full operation. In the endpoint, thus enabling MPA/DDP into full operation. In the
process of going to full operation, MPA sends the "MPA Reply process of going to full operation, MPA sends the "MPA Reply
Frame". MPA/DDP waits for the first incoming FPDU before sending Frame". MPA/DDP waits for the first incoming FPDU before sending
any FPDUs. any FPDUs.
* If the initial TCP data was not a properly formatted MPA Request * If the initial TCP data was not a properly formatted "MPA Request
Frame the Consumer can close or reset the TCP connection Frame" MPA will close or reset the TCP connection immediately.
immediately.
* The Initiating MPA would receive the MPA Reply Frame and * The Initiating MPA would receive the "MPA Reply Frame" and
would report this message to the Consumer. The Consumer can would report this message to the Consumer. The Consumer can
then accept the MPA/DDP connection, or close or reset the TCP then accept the MPA/DDP connection, or close or reset the TCP
connection to abort the process. connection to abort the process.
* On determining that the Connection is acceptable, the * On determining that the Connection is acceptable, the
Initiating Consumer would use an appropriate API to bind the Initiating Consumer would use an appropriate API to bind the
TCP/MPA connections to a DDP endpoint thus enabling MPA/DDP TCP/MPA connections to a DDP endpoint thus enabling MPA/DDP
into full operation. MPA/DDP would begin sending DDP into full operation. MPA/DDP would begin sending DDP
messages as MPA FPDUs. messages as MPA FPDUs.
8.1.3 Use of "Private Data" 6.1.3 Use of "Private Data"
This section is advisory in nature, in that it suggests a method that This section is advisory in nature, in that it suggests a method that
a ULP can deal with pre-DDP connection information exchange. a ULP can deal with pre-DDP connection information exchange.
8.1.3.1 Motivation 6.1.3.1 Motivation
Prior RDMA protocols have been developed that provide "private data" Prior RDMA protocols have been developed that provide "private data"
via out of band mechanisms. As a result, many applications now via out of band mechanisms. As a result, many applications now
expect some form of "private data" to be available for application expect some form of "private data" to be available for application
use prior to setting up the DDP/RDMA connection. For example, use prior to setting up the DDP/RDMA connection. For example,
An RDMA Endpoint (referred to as a Queue Pair, or QP, in InfiniBand An RDMA Endpoint (referred to as a Queue Pair, or QP, in InfiniBand
and the [Verbs]) must be associated with a Protection Domain. No and the [Verbs]) must be associated with a Protection Domain. No
receive operations may be posted to the endpoint before it is receive operations may be posted to the endpoint before it is
associated with a Protection Domain. Indeed under both the associated with a Protection Domain. Indeed under both the
skipping to change at page 33, line 52 skipping to change at page 34, line 52
ULP reasons. ULP reasons.
There are several potential ways to exchange this "Private Data". There are several potential ways to exchange this "Private Data".
For Example, the InfiniBand specification includes a connection For Example, the InfiniBand specification includes a connection
management protocol that allows a small amount of "private data" to management protocol that allows a small amount of "private data" to
be exchanged using datagrams before actually starting the RDMA be exchanged using datagrams before actually starting the RDMA
connection. connection.
This draft allows for small amounts of "Private Data" to be exchanged This draft allows for small amounts of "Private Data" to be exchanged
as part of the MPA startup sequence. The actual Private Data fields as part of the MPA startup sequence. The actual Private Data fields
are carried in the MPA Request Frame, and the MPA Reply Frame. are carried in the "MPA Request Frame", and the "MPA Reply Frame".
If larger amounts of private data or more negotiation is necessary, If larger amounts of private data or more negotiation is necessary,
TCP streaming mode messages may be exchanged prior to enabling MPA. TCP streaming mode messages may be exchanged prior to enabling MPA.
8.1.3.2 Example Immediate Startup using Private Data 6.1.3.2 Example Immediate Startup using Private Data
Initiator Responder Initiator Responder
+---------------------------+ +---------------------------+
|TCP SYN sent | +--------------------------+ |TCP SYN sent | +--------------------------+
+---------------------------+ --------> |TCP gets SYN packet; | +---------------------------+ --------> |TCP gets SYN packet; |
+---------------------------+ | Sends SYN-Ack | +---------------------------+ | Sends SYN-Ack |
|TCP gets SYN-Ack | <-------- +--------------------------+ |TCP gets SYN-Ack | <-------- +--------------------------+
| Sends Ack | | Sends Ack |
+---------------------------+ --------> +--------------------------+ +---------------------------+ --------> +--------------------------+
skipping to change at page 34, line 49 skipping to change at page 35, line 49
|available). | |MPA sends first FPDU (as | |available). | |MPA sends first FPDU (as |
+---------------------------+ |DDP ULPDUs become | +---------------------------+ |DDP ULPDUs become |
<====== |available. | <====== |available. |
+--------------------------+ +--------------------------+
Figure 9: Example Immediate Startup negotiation Figure 9: Example Immediate Startup negotiation
Note: the exact order of when MPA is started in the TCP connection Note: the exact order of when MPA is started in the TCP connection
sequence is implementation dependent; the above diagram shows one sequence is implementation dependent; the above diagram shows one
possible sequence. Also, the Initiator "Ack" to the Responder's possible sequence. Also, the Initiator "Ack" to the Responder's
"SYN-Ack" may be combined into the same TCP segment containing "SYN-Ack" may be combined into the same TCP segment containing
the MPA Request Frame (as is allowed by TCP RFCs). the "MPA Request Frame" (as is allowed by TCP RFCs).
The example immediate startup sequence is described below: The example immediate startup sequence is described below:
* The passive side (Responding Consumer) would listen on the TCP * The passive side (Responding Consumer) would listen on the TCP
destination port, to indicate its readiness to accept a destination port, to indicate its readiness to accept a
connection. connection.
* The active side (Initiating Consumer) would request a * The active side (Initiating Consumer) would request a
connection from a TCP endpoint (that expected to upgrade to connection from a TCP endpoint (that expected to upgrade to
MPA/DDP/RDMA and expected the private data) to a destination MPA/DDP/RDMA and expected the private data) to a destination
skipping to change at page 35, line 47 skipping to change at page 36, line 47
connection with a return message. connection with a return message.
* To accept the connection request, the Responding Consumer would * To accept the connection request, the Responding Consumer would
use an appropriate API to bind the TCP/MPA connections to a DDP use an appropriate API to bind the TCP/MPA connections to a DDP
endpoint, thus enabling MPA/DDP into full operation. In the endpoint, thus enabling MPA/DDP into full operation. In the
process of going to full operation, MPA sends the "MPA Reply process of going to full operation, MPA sends the "MPA Reply
Frame" which includes the Consumer supplied "Private Data" Frame" which includes the Consumer supplied "Private Data"
containing any appropriate consumer response. MPA/DDP waits for containing any appropriate consumer response. MPA/DDP waits for
the first incoming FPDU before sending any FPDUs. the first incoming FPDU before sending any FPDUs.
* If the initial TCP data was not a properly formatted MPA Request * If the initial TCP data was not a properly formatted "MPA Request
Frame, or if the Consumer Private Data was not acceptable, the Frame", MPA will close or reset the TCP connection immediately.
Consumer can close or reset the TCP connection immediately.
* To reject the MPA connection request, the Responding Consumer * To reject the MPA connection request, the Responding Consumer
would send an MPA Reply Frame with any ULP supplied "Private Data would send an "MPA Reply Frame" with any ULP supplied "Private
Response" (with reason for rejection) and close the TCP Data" (with reason for rejection), with the "Rejected Connection"
connection. bit set to '1', and may close the TCP connection.
* The Initiating MPA would receive the MPA Reply Frame with the * The Initiating MPA would receive the "MPA Reply Frame" with
"Private Data Response" message and would report this message the "Private Data" message and would report this message to
to the Consumer, including the supplied Private Data. The the Consumer, including the supplied Private Data.
Consumer can then accept and finalize the MPA/DDP connection,
or close or reset the TCP connection to abort the process.
* On determining from the "Private Data Response" that the If the "rejected Connection" bit is set to a '1', MPA will
Connection is acceptable, the Initiating Consumer would use close the TCP connection and exit.
an appropriate API to bind the TCP/MPA connections to a DDP
endpoint thus enabling MPA/DDP into full operation. MPA/DDP
would begin sending DDP messages as MPA FPDUs.
8.1.4 "Dual Stack" implementations If the "Rejected Connection" bit is set to a '0', and on
determining from the "MPA Reply Frame" "Private Data" that
the Connection is acceptable, the Initiating Consumer would
use an appropriate API to bind the TCP/MPA connections to a
DDP endpoint thus enabling MPA/DDP into full operation.
MPA/DDP would begin sending DDP messages as MPA FPDUs.
6.1.4 "Dual Stack" implementations
MPA/DDP implementations are commonly expected to be implemented as MPA/DDP implementations are commonly expected to be implemented as
part of a "Dual stack" architecture. One "stack" is the traditional part of a "Dual stack" architecture. One "stack" is the traditional
TCP stack, usually with a sockets interface API. The second stack is TCP stack, usually with a sockets interface API. The second stack is
the MPA/DDP "stack" with its own API, and potentially separate code the MPA/DDP "stack" with its own API, and potentially separate code
or hardware to deal with the MPA/DDP data. Of course, or hardware to deal with the MPA/DDP data. Of course,
implementations may vary, so the following comments are of an implementations may vary, so the following comments are of an
advisory nature only. advisory nature only.
The use of the two "stacks" offers advantages: The use of the two "stacks" offers advantages:
skipping to change at page 37, line 26 skipping to change at page 38, line 30
message" as part of its "Responder" DDP/MPA enable function. message" as part of its "Responder" DDP/MPA enable function.
This allows the DDP/MPA stack to more easily manage the This allows the DDP/MPA stack to more easily manage the
conversion to DDP/MPA mode (and avoid problems with a very fast conversion to DDP/MPA mode (and avoid problems with a very fast
return of the "MPA Request Frame" from the Initiator side). return of the "MPA Request Frame" from the Initiator side).
Note: Regardless of the "stack" architecture used, TCP's rules must Note: Regardless of the "stack" architecture used, TCP's rules must
be followed. For example, if network data is lost, re-segmented be followed. For example, if network data is lost, re-segmented
or re-ordered, TCP must recover appropriately even when this or re-ordered, TCP must recover appropriately even when this
occurs while switching stacks. occurs while switching stacks.
8.2 Normal Connection Teardown 6.2 Normal Connection Teardown
Each half connection of MPA terminates when DDP closes the Each half connection of MPA terminates when DDP closes the
corresponding TCP half connection. corresponding TCP half connection.
A mechanism SHOULD be provided by MPA to DDP for DDP to be made aware A mechanism SHOULD be provided by MPA to DDP for DDP to be made aware
that a graceful close of the LLP connection has been received by the that a graceful close of the LLP connection has been received by the
LLP (e.g. FIN is received). LLP (e.g. FIN is received).
9 Error Semantics 7 Error Semantics
The following errors MUST be detected by MPA and the codes SHOULD be The following errors MUST be detected by MPA and the codes SHOULD be
provided to DDP or other consumer: provided to DDP or other consumer:
Code Error Code Error
1 TCP connection closed, terminated or lost. This includes 1 TCP connection closed, terminated or lost. This includes
lost by timeout, too many retries, RST received or FIN lost by timeout, too many retries, RST received or FIN
received. received.
skipping to change at page 39, line 5 skipping to change at page 40, line 5
following a reported error. Closing the connection is the following a reported error. Closing the connection is the
responsibility of DDP's ULP. responsibility of DDP's ULP.
Note that since MPA will not deliver any FPDUs on a half Note that since MPA will not deliver any FPDUs on a half
connection following an error detected on the receive side of connection following an error detected on the receive side of
that connection, DDP's ULP is expected to tear down the that connection, DDP's ULP is expected to tear down the
connection. This may not occur until after one or more last connection. This may not occur until after one or more last
messages are transmitted on the opposite half connection. This messages are transmitted on the opposite half connection. This
allows a diagnostic error message to be sent. allows a diagnostic error message to be sent.
10 Security Considerations 8 Security Considerations
This section discusses the security considerations for MPA. This section discusses the security considerations for MPA.
10.1 Protocol-specific Security Considerations 8.1 Protocol-specific Security Considerations
The vulnerabilities of MPA to third-party attacks are no greater than The vulnerabilities of MPA to third-party attacks are no greater than
any other protocol running over TCP. A third party, by sending any other protocol running over TCP. A third party, by sending
packets into the network that are delivered to an MPA receiver, could packets into the network that are delivered to an MPA receiver, could
launch a variety of attacks that take advantage of how MPA operates. launch a variety of attacks that take advantage of how MPA operates.
For example, a third party could send random packets that are valid For example, a third party could send random packets that are valid
for TCP, but contain no FPDU headers. An MPA receiver reports an for TCP, but contain no FPDU headers. An MPA receiver reports an
error to DDP when any packet arrives that cannot be validated as an error to DDP when any packet arrives that cannot be validated as an
FPDU when properly located on an FPDU boundary. This would have a FPDU when properly located on an FPDU boundary. This would have a
severe impact on performance. Communication security mechanisms such severe impact on performance. Communication security mechanisms such
as IPsec [RFC2401] may be used to prevent such attacks. Independent as IPsec [RFC2401] may be used to prevent such attacks. Independent
of how MPA operates, a third party could use ICMP messages to reduce of how MPA operates, a third party could use ICMP messages to reduce
the path MTU to such a small size that performance would likewise be the path MTU to such a small size that performance would likewise be
severely impacted. Range checking on path MTU sizes in ICMP packets severely impacted. Range checking on path MTU sizes in ICMP packets
may be used to prevent such attacks. may be used to prevent such attacks.
10.2 Using IPsec With MPA 8.2 Using IPsec With MPA
IPsec can be used to protect against the packet injection attacks IPsec can be used to protect against the packet injection attacks
outlined above. Because IPsec is designed to secure individual IP outlined above. Because IPsec is designed to secure individual IP
packets, MPA can run above IPsec without change. IPsec packets are packets, MPA can run above IPsec without change. IPsec packets are
processed (e.g., integrity checked and decrypted) in the order they processed (e.g., integrity checked and decrypted) in the order they
are received, and an MPA receiver will process the decrypted FPDUs are received, and an MPA receiver will process the decrypted FPDUs
contained in these packets in the same manner as FPDUs contained in contained in these packets in the same manner as FPDUs contained in
unsecured IP packets. unsecured IP packets.
11 IANA Considerations 9 IANA Considerations
If a well-known port is chosen as the mechanism to identify a DDP on If a well-known port is chosen as the mechanism to identify a DDP on
MPA on TCP, the well-known port must be registered with IANA. MPA on TCP, the well-known port must be registered with IANA.
Because the use of the port is DDP specific, registration of the port Because the use of the port is DDP specific, registration of the port
with IANA is left to DDP. with IANA is left to DDP.
12 References 10 References
12.1 Normative References 10.1 Normative References
[iSCSI] Satran, J., "iSCSI", draft-ietf-ips-iscsi-20.txt (work in [iSCSI] Satran, J., "iSCSI", draft-ietf-ips-iscsi-20.txt (work in
progress), January 2003. progress), January 2003.
[RFC1191] Mogul, J., and Deering, S., "Path MTU Discovery", RFC 1191, [RFC1191] Mogul, J., and Deering, S., "Path MTU Discovery", RFC 1191,
November 1990. November 1990.
[RFC2018] Mathis, M., Mahdavi, J., Floyd, S., Romanow, A., "TCP [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., Romanow, A., "TCP
Selective Acknowledgment Options", RFC 2018, October 1996. Selective Acknowledgment Options", RFC 2018, October 1996.
[RFC2026] Bradner, S., "The Internet Standards Process -- Revision [RFC2026] Bradner, S., "The Internet Standards Process -- Revision
3", BCP 9, RFC 2026, October 1996. 3", BCP 9, RFC 2026, October 1996.
[RFC793] Postel, J., "Transmission Control Protocol - DARPA Internet [RFC793] Postel, J., "Transmission Control Protocol - DARPA Internet
Program Protocol Specification", RFC 793, September 1981. Program Protocol Specification", RFC 793, September 1981.
12.2 Informative References 10.2 Informative References
[CRCTCP] Stone J., Partridge, C., "When the CRC and TCP checksum [CRCTCP] Stone J., Partridge, C., "When the CRC and TCP checksum
disagree", ACM Sigcomm, Sept. 2000. disagree", ACM Sigcomm, Sept. 2000.
[DDP] H. Shah et al., "Direct Data Placement over Reliable [DDP] H. Shah et al., "Direct Data Placement over Reliable
Transports", draft-ietf-rddp-ddp-00.txt (Work in progress), Transports", draft-ietf-rddp-ddp-02.txt (Work in progress),
February 2003 February 2004
[RFC2401] Atkinson, R., Kent, S., "Security Architecture for the [RFC2401] Atkinson, R., Kent, S., "Security Architecture for the
Internet Protocol", RFC 2401, November 1998. Internet Protocol", RFC 2401, November 1998.
[RFC0896] J. Nagle, "Congestion Control in IP/TCP Internetworks", RFC [RFC0896] J. Nagle, "Congestion Control in IP/TCP Internetworks", RFC
896, January 1984. 896, January 1984.
[NagleDAck] Minshall G., Mogul, J., Saito, Y., Verghese, B., [NagleDAck] Minshall G., Mogul, J., Saito, Y., Verghese, B.,
"Application performance pitfalls and TCP's Nagle algorithm", "Application performance pitfalls and TCP's Nagle algorithm",
Workshop on Internet Server Performance, May 1999. Workshop on Internet Server Performance, May 1999.
[RDMA] R. Recio et al., "RDMA Protocol Specification", [RDMA] R. Recio et al., "RDMA Protocol Specification",
draft-ietf-rddp-rdmap-00.txt, February 2003 draft-ietf-rddp-rdmap-02.txt, May 2004
[RFC2960] R. Stewart et al., "Stream Control Transmission Protocol", [RFC2960] R. Stewart et al., "Stream Control Transmission Protocol",
RFC 2960, October 2000. RFC 2960, October 2000.
[RFC792] Postel, J., "Internet Control Message Protocol". September [RFC792] Postel, J., "Internet Control Message Protocol". September
1981 1981
[RFC1122] Braden, R.T., "Requirements for Internet hosts - [RFC1122] Braden, R.T., "Requirements for Internet hosts -
communication layers". October 1989. communication layers". October 1989.
[ELZUR-MPA] Elzur, U., "Analysis of MPA over TCP Operations" draft- [ELZUR-MPA] Elzur, U., "Analysis of MPA over TCP Operations" draft-
elzur-iwarp-mpa-tcp-analysis-00.txt, February 2003. elzur-iwarp-mpa-tcp-analysis-00.txt, February 2003.
[Verbs] J. Hilland et al., "RDMA Protocol Verbs Specification" draft- [Verbs] J. Hilland et al., "RDMA Protocol Verbs Specification" draft-
hilland-rddp-verbs-00.txt, April 2003. hilland-rddp-verbs-00.txt, April 2003.
13 Appendix 11 Appendix
This appendix is for information only and is NOT part of the This appendix is for information only and is NOT part of the
standard. standard.
13.1 Analysis of MPA over TCP Operations 11.1 Analysis of MPA over TCP Operations
This appendix analyzes the impact of MPA (Marker PDU Aligned Framing This appendix analyzes the impact of MPA (Marker PDU Aligned Framing
for TCP [MPA]) on the TCP sender, receiver, and wire protocol. for TCP [MPA]) on the TCP sender, receiver, and wire protocol.
One of MPA's high level goals is to provide enough information, when One of MPA's high level goals is to provide enough information, when
combined with the Direct Data Placement Protocol [DDP], to enable combined with the Direct Data Placement Protocol [DDP], to enable
out-of-order placement of DDP payload into the final Upper Layer out-of-order placement of DDP payload into the final Upper Layer
Protocol (ULP) buffer. Note that DDP separates the act of placing Protocol (ULP) buffer. Note that DDP separates the act of placing
data into a ULP buffer from that of notifying the ULP that the ULP data into a ULP buffer from that of notifying the ULP that the ULP
buffer is available for use. In DDP terminology, the former is buffer is available for use. In DDP terminology, the former is
skipping to change at page 43, line 44 skipping to change at page 44, line 44
Unit (FPDU) (if there is payload present). Unit (FPDU) (if there is payload present).
2) that there be an integral number of FPDUs in a TCP segment (under 2) that there be an integral number of FPDUs in a TCP segment (under
conditions where the Path MTU is not changing). conditions where the Path MTU is not changing).
This Appendix concludes that the scaling advantages of Header This Appendix concludes that the scaling advantages of Header
Alignment are strong, based primarily on fairly drastic TCP receive Alignment are strong, based primarily on fairly drastic TCP receive
buffer reduction requirements and simplified receive handling. The buffer reduction requirements and simplified receive handling. The
analysis also shows that there is little effect to TCP wire behavior. analysis also shows that there is little effect to TCP wire behavior.
13.1.1 Assumptions 11.1.1 Assumptions
13.1.1.1 MPA is layered beneath DDP [DDP] 11.1.1.1 MPA is layered beneath DDP [DDP]
MPA is an adaptation layer between DDP and TCP. DDP requires MPA is an adaptation layer between DDP and TCP. DDP requires
preservation of DDP segment boundaries and a CRC32C digest covering preservation of DDP segment boundaries and a CRC32C digest covering
the DDP header and data. MPA adds these features to the TCP stream the DDP header and data. MPA adds these features to the TCP stream
so that DDP over TCP has the same basic properties as DDP over SCTP. so that DDP over TCP has the same basic properties as DDP over SCTP.
13.1.1.2 MPA preserves DDP message framing 11.1.1.2 MPA preserves DDP message framing
MPA was designed as a framing layer specifically for DDP and was not MPA was designed as a framing layer specifically for DDP and was not
intended as a general-purpose framing layer for any other ULP using intended as a general-purpose framing layer for any other ULP using
TCP. TCP.
A framing layer allows ULPs using it to receive indications from the A framing layer allows ULPs using it to receive indications from the
transport layer only when complete ULPDUs are present. As a framing transport layer only when complete ULPDUs are present. As a framing
layer, MPA is not aware of the content of the DDP PDU, only that it layer, MPA is not aware of the content of the DDP PDU, only that it
has received and, if necessary, reassembled a complete PDU for has received and, if necessary, reassembled a complete PDU for
delivery to the DDP. delivery to the DDP.
13.1.1.3 The size of the ULPDU passed to MPA is less than EMSS under 11.1.1.3 The size of the ULPDU passed to MPA is less than EMSS under
normal conditions normal conditions
To make reception of a complete DDP PDU on every received segment To make reception of a complete DDP PDU on every received segment
possible, DDP passes to MPA a PDU that is no larger than the EMSS of possible, DDP passes to MPA a PDU that is no larger than the EMSS of
the underlying fabric. Each FPDU that MPA creates contains sufficient the underlying fabric. Each FPDU that MPA creates contains sufficient
information for the receiver to directly place the ULP payload in the information for the receiver to directly place the ULP payload in the
correct location in the correct receive buffer. correct location in the correct receive buffer.
Edge cases when this condition does not occur are dealt with, but do Edge cases when this condition does not occur are dealt with, but do
not need to be on the fast path not need to be on the fast path
13.1.1.4 Out-of-order placement but NO out-of-order delivery 11.1.1.4 Out-of-order placement but NO out-of-order delivery
DDP receives complete DDP PDUs from MPA. Each DDP PDU contains the DDP receives complete DDP PDUs from MPA. Each DDP PDU contains the
information necessary to place its ULP payload directly in the information necessary to place its ULP payload directly in the
correct location in host memory. correct location in host memory.
Because each DDP segment is self-describing, it is possible for DDP Because each DDP segment is self-describing, it is possible for DDP
segments received out of order to have their ULP payload placed segments received out of order to have their ULP payload placed
immediately in the ULP receive buffer. immediately in the ULP receive buffer.
Data delivery to the ULP is guaranteed to be in the order the data Data delivery to the ULP is guaranteed to be in the order the data
was sent. DDP only indicates data delivery to the ULP after TCP has was sent. DDP only indicates data delivery to the ULP after TCP has
acknowledged the complete byte stream. acknowledged the complete byte stream.
13.1.2 The Value of Header Alignment 11.1.2 The Value of Header Alignment
Significant receiver optimizations can be achieved when Header Significant receiver optimizations can be achieved when Header
Alignment and complete FPDUs are the common case. The optimizations Alignment and complete FPDUs are the common case. The optimizations
allow utilizing significantly fewer buffers on the receiver and less allow utilizing significantly fewer buffers on the receiver and less
computation per FPDU. The net effect is the ability to build a "Flow- computation per FPDU. The net effect is the ability to build a "Flow-
Through" receiver that enables TCP-based solutions to scale to 10G Through" receiver that enables TCP-based solutions to scale to 10G
and beyond in an economical way. The optimizations are especially and beyond in an economical way. The optimizations are especially
relevant to hardware implementations of receivers that process relevant to hardware implementations of receivers that process
multiple protocol layers - Data Link Layer (e.g., Ethernet), Network multiple protocol layers - Data Link Layer (e.g., Ethernet), Network
and Transport Layer (e.g., TCP/IP), and even some ULP on top of TCP and Transport Layer (e.g., TCP/IP), and even some ULP on top of TCP
skipping to change at page 46, line 25 skipping to change at page 47, line 25
continue - while Ethernet speeds have scaled by 1000 (from 10 continue - while Ethernet speeds have scaled by 1000 (from 10
megabit/sec to 10 gigabit/sec), I/O bus bandwidth of volume CPU megabit/sec to 10 gigabit/sec), I/O bus bandwidth of volume CPU
architectures has scaled from ~2 MB/sec to ~2 GB/sec (PC-XT bus to architectures has scaled from ~2 MB/sec to ~2 GB/sec (PC-XT bus to
PCI-X DDR). Under these conditions, the Header Aligned FPDU approach PCI-X DDR). Under these conditions, the Header Aligned FPDU approach
allows BufferSizeAF to be indifferent to network speed. It is allows BufferSizeAF to be indifferent to network speed. It is
primarily a function of the local processing time for a given frame. primarily a function of the local processing time for a given frame.
Thus when the Header Aligned FPDU approach is used, receive buffering Thus when the Header Aligned FPDU approach is used, receive buffering
is expected to scale gracefully (i.e. less than linear scaling) as is expected to scale gracefully (i.e. less than linear scaling) as
network speed is increased. network speed is increased.
13.1.2.1 Impact of lack of Header Alignment on the receiver 11.1.2.1 Impact of lack of Header Alignment on the receiver
computational load and complexity computational load and complexity
The receiver must perform IP and TCP processing, and then perform The receiver must perform IP and TCP processing, and then perform
FPDU CRC checks, before it can trust the FPDU header placement FPDU CRC checks, before it can trust the FPDU header placement
information. For simplicity of the description, the assumption is information. For simplicity of the description, the assumption is
that a FPDU is carried in no more than 2 TCP segments. In reality, that a FPDU is carried in no more than 2 TCP segments. In reality,
with no Header Alignment, an FPDU can be carried by more than 2 TCP with no Header Alignment, an FPDU can be carried by more than 2 TCP
segments (e.g., if the PMTU was reduced). segments (e.g., if the PMTU was reduced).
----++-----------------------------++-----------------------++----- ----++-----------------------------++-----------------------++-----
skipping to change at page 50, line 24 skipping to change at page 51, line 24
along with the high probability that at least one complete FPDU is along with the high probability that at least one complete FPDU is
found with every TCP segment, allows the receiver to perform data found with every TCP segment, allows the receiver to perform data
placement for out-of-order TCP segments with no need for intermediate placement for out-of-order TCP segments with no need for intermediate
buffering. Essentially the TCP receive buffer has been eliminated and buffering. Essentially the TCP receive buffer has been eliminated and
TCP reassembly is done in place within the ULP buffer. TCP reassembly is done in place within the ULP buffer.
In case Header Alignment is not found, the receiver should follow the In case Header Alignment is not found, the receiver should follow the
algorithm for non aligned FPDU reception which may be slower and less algorithm for non aligned FPDU reception which may be slower and less
efficient. efficient.
13.1.2.2 Header Alignment effects on TCP wire protocol 11.1.2.2 Header Alignment effects on TCP wire protocol
An MPA-aware TCP exposes its EMSS to MPA. MPA uses the EMSS to An MPA-aware TCP exposes its EMSS to MPA. MPA uses the EMSS to
calculate its MULPDU, which it then exposes to DDP, its ULP. DDP calculate its MULPDU, which it then exposes to DDP, its ULP. DDP
uses the MULPDU to segment its payload so that each FPDU sent by uses the MULPDU to segment its payload so that each FPDU sent by
MPA fits completely into one TCP segment. This has no impact on MPA fits completely into one TCP segment. This has no impact on
wire protocol and exposing this information is already supported wire protocol and exposing this information is already supported
on many TCP implementations, including all modern flavors of BSD on many TCP implementations, including all modern flavors of BSD
networking, through the TCP_MAXSEG socket option. networking, through the TCP_MAXSEG socket option.
In the common case, the ULP (i.e. DDP over MPA) messages provided to In the common case, the ULP (i.e. DDP over MPA) messages provided to
skipping to change at page 51, line 22 skipping to change at page 52, line 22
the EMSS. Another class of applications with many small outstanding the EMSS. Another class of applications with many small outstanding
buffers (as compared to EMSS) is expected to use packing when buffers (as compared to EMSS) is expected to use packing when
applicable. Transaction oriented applications are also optimal. applicable. Transaction oriented applications are also optimal.
TCP retransmission is another area that can affect sender behavior. TCP retransmission is another area that can affect sender behavior.
TCP supports retransmission of the exact, originally transmitted TCP supports retransmission of the exact, originally transmitted
segment (see [RFC0793] section 2.6, [RFC0793] section 3.7 "managing segment (see [RFC0793] section 2.6, [RFC0793] section 3.7 "managing
the window" and [RFC1122] section 4.2.2.15 ). In the unlikely event the window" and [RFC1122] section 4.2.2.15 ). In the unlikely event
that part of the original segment has been received and acknowledged that part of the original segment has been received and acknowledged
by the remote peer (e.g., a re-segmenting middle box, as documented by the remote peer (e.g., a re-segmenting middle box, as documented
in 7.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders on in 5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders on
page 25), a better available bandwidth utilization may be possible by page 25), a better available bandwidth utilization may be possible by
re-transmitting only the missing octets. If an MPA-aware TCP re-transmitting only the missing octets. If an MPA-aware TCP
retransmits complete FPDUs, there may be some marginal bandwidth retransmits complete FPDUs, there may be some marginal bandwidth
loss. loss.
Another area where a change in the TCP segment number may have impact Another area where a change in the TCP segment number may have impact
is that of Slow Start and Congestion Avoidance. Slow-start is that of Slow Start and Congestion Avoidance. Slow-start
exponential increase is measured in segments per second, as the exponential increase is measured in segments per second, as the
algorithm focuses on the overhead per segment at the source for algorithm focuses on the overhead per segment at the source for
congestion that eventually results in dropped segments. Slow-start congestion that eventually results in dropped segments. Slow-start
skipping to change at page 52, line 5 skipping to change at page 53, line 5
algorithms. algorithms.
In summary, the ULP messages generated at the sender (e.g., the In summary, the ULP messages generated at the sender (e.g., the
amount of messages grouped for every transmission request) and amount of messages grouped for every transmission request) and
message size distribution has the most significant impact over the message size distribution has the most significant impact over the
number of TCP segments emitted. The worst case effect for certain number of TCP segments emitted. The worst case effect for certain
ULPs (with average message size of EMSS/2+1 to EMSS), is bounded by ULPs (with average message size of EMSS/2+1 to EMSS), is bounded by
an increase of up to 2x in the number of TCP segments and an increase of up to 2x in the number of TCP segments and
acknowledges. In reality the effect is expected to be marginal. acknowledges. In reality the effect is expected to be marginal.
13.2 Receiver implementation 11.2 Receiver implementation
Transport & Network Layer Reassembly Buffers: Transport & Network Layer Reassembly Buffers:
The use of reassembly buffers (either TCP reassembly buffers or IP The use of reassembly buffers (either TCP reassembly buffers or IP
fragmentation reassembly buffers) is implementation dependent. When fragmentation reassembly buffers) is implementation dependent. When
MPA is enabled, reassembly buffers are needed if out of order packets MPA is enabled, reassembly buffers are needed if out of order packets
arrive and Markers are not enabled. Buffers are also needed if FPDU arrive and Markers are not enabled. Buffers are also needed if FPDU
Alignment is lost or if IP fragmentation occurs. This is because the Alignment is lost or if IP fragmentation occurs. This is because the
incoming out of order segment may not contain enough information for incoming out of order segment may not contain enough information for
MPA to process all of the FPDU. For cases where a re-segmenting MPA to process all of the FPDU. For cases where a re-segmenting
middle box is present, or where the TCP sender is not MPA-aware, the middle box is present, or where the TCP sender is not MPA-aware, the
presence of markers significantly reduces the amount of buffering presence of markers significantly reduces the amount of buffering
needed. needed.
Recovery from IP Fragmentation must be transparent to the MPA Recovery from IP Fragmentation must be transparent to the MPA
Consumers. Consumers.
13.2.1 Network Layer Reassembly Buffers 11.2.1 Network Layer Reassembly Buffers
Most IP implementations set the IP Don't Fragment bit. Thus upon a Most IP implementations set the IP Don't Fragment bit. Thus upon a
path MTU change, intermediate devices drop the IP datagram if it is path MTU change, intermediate devices drop the IP datagram if it is
too large and reply with an ICMP message which tells the source TCP too large and reply with an ICMP message which tells the source TCP
that the path MTU has changed. This causes TCP to emit segments that the path MTU has changed. This causes TCP to emit segments
conformant with the new path MTU size. Thus IP fragments under most conformant with the new path MTU size. Thus IP fragments under most
conditions should never occur at the receiver. But it is possible. conditions should never occur at the receiver. But it is possible.
There are several options for implementation of network layer There are several options for implementation of network layer
reassembly buffers: reassembly buffers:
skipping to change at page 53, line 20 skipping to change at page 54, line 20
multiple IP datagrams that have all been fragmented). multiple IP datagrams that have all been fragmented).
Note that if the Remote Peer does not implement re-segmentation of Note that if the Remote Peer does not implement re-segmentation of
the data stream upon receiving the ICMP reply updating the path MTU, the data stream upon receiving the ICMP reply updating the path MTU,
it is possible to halt forward progress because the opposite peer it is possible to halt forward progress because the opposite peer
would continue to retransmit using a transport segment size that is would continue to retransmit using a transport segment size that is
too large. This deadlock scenario is no different than if the fabric too large. This deadlock scenario is no different than if the fabric
MTU (not last hop MTU) was reduced after connection setup, and the MTU (not last hop MTU) was reduced after connection setup, and the
remote Node's behavior is not compliant with [RFC1122]. remote Node's behavior is not compliant with [RFC1122].
13.2.2 TCP Reassembly buffers 11.2.2 TCP Reassembly buffers
A TCP reassembly buffer is also needed. TCP reassembly buffers are A TCP reassembly buffer is also needed. TCP reassembly buffers are
needed if FPDU Alignment is lost when using TCP with MPA or when the needed if FPDU Alignment is lost when using TCP with MPA or when the
MPA FPDU spans multiple TCP segments. Buffers are also needed if MPA FPDU spans multiple TCP segments. Buffers are also needed if
Markers are disabled and out of order packets arrive. Markers are disabled and out of order packets arrive.
Since lost FPDU Alignment often means that FPDUs are incomplete, an Since lost FPDU Alignment often means that FPDUs are incomplete, an
MPA on TCP implementation must have a reassembly buffer large enough MPA on TCP implementation must have a reassembly buffer large enough
to recover an FPDU that is less than or equal to the MTU of the to recover an FPDU that is less than or equal to the MTU of the
locally attached link (this should be the largest possible advertised locally attached link (this should be the largest possible advertised
skipping to change at page 54, line 5 skipping to change at page 55, line 5
deadlock the MPA algorithm. If the path MTU is reduced, FPDU deadlock the MPA algorithm. If the path MTU is reduced, FPDU
Alignment requires the source TCP to re-segment the data stream to Alignment requires the source TCP to re-segment the data stream to
the new path MTU. The source MPA will detect this condition and the new path MTU. The source MPA will detect this condition and
reduce the MPA segment size, but any FPDUs already posted to the reduce the MPA segment size, but any FPDUs already posted to the
source TCP will be re-segmented and lose FPDU Alignment. If the source TCP will be re-segmented and lose FPDU Alignment. If the
destination does not support a TCP reassembly buffer, these segments destination does not support a TCP reassembly buffer, these segments
can never be successfully transmitted and the protocol deadlocks. can never be successfully transmitted and the protocol deadlocks.
When a complete FPDU is received, processing continues normally. When a complete FPDU is received, processing continues normally.
14 Author's Addresses 12 Author's Addresses
Stephen Bailey Stephen Bailey
Sandburst Corporation Sandburst Corporation
600 Federal Street 600 Federal Street
Andover, MA 01810 USA Andover, MA 01810 USA
Phone: +1 978 689 1614 Phone: +1 978 689 1614
Email: steph@sandburst.com Email: steph@sandburst.com
Paul R. Culley Paul R. Culley
Hewlett-Packard Company Hewlett-Packard Company
skipping to change at page 55, line 5 skipping to change at page 56, line 5
Phone: 512-838-3685 Phone: 512-838-3685
Email: recio@us.ibm.com Email: recio@us.ibm.com
John Carrier John Carrier
Adaptec Inc. Adaptec Inc.
691 South Milpitas Blvd. 691 South Milpitas Blvd.
Milpitas, CA 95035 Milpitas, CA 95035
Phone: 360-378-8526 Phone: 360-378-8526
Email: John_Carrier@adaptec.com Email: John_Carrier@adaptec.com
15 Acknowledgments 13 Acknowledgments
Dwight Barron Dwight Barron
Hewlett-Packard Company Hewlett-Packard Company
20555 SH 249 20555 SH 249
Houston, Tx. USA 77070-2698 Houston, Tx. USA 77070-2698
Phone: 281-514-2769 Phone: 281-514-2769
Email: dwight.barron@hp.com Email: dwight.barron@hp.com
Jeff Chase Jeff Chase
Department of Computer Science Department of Computer Science
skipping to change at page 58, line 5 skipping to change at page 59, line 5
Phone: +1 916 785 5198 Phone: +1 916 785 5198
Email: jim_wendt@hp.com Email: jim_wendt@hp.com
Jim Williams Jim Williams
Emulex Corporation Emulex Corporation
580 Main Street 580 Main Street
Bolton, MA 01740 USA Bolton, MA 01740 USA
Phone: +1 978 779 7224 Phone: +1 978 779 7224
Email: jim.williams@emulex.com Email: jim.williams@emulex.com
16 Full Copyright Statement 14 Full Copyright Statement
This document and the information contained herein is provided on an This document and the information contained herein is provided on an
"AS IS" basis and ADAPTEC INC., AGILENT TECHNOLOGIES INC., BROADCOM "AS IS" basis and ADAPTEC INC., AGILENT TECHNOLOGIES INC., BROADCOM
CORPORATION, CISCO SYSTEMS INC., DUKE UNIVERSITY, EMC CORPORATION, CORPORATION, CISCO SYSTEMS INC., DUKE UNIVERSITY, EMC CORPORATION,
EMULEX CORPORATION, HEWLETT-PACKARD COMPANY, INTERNATIONAL BUSINESS EMULEX CORPORATION, HEWLETT-PACKARD COMPANY, INTERNATIONAL BUSINESS
MACHINES CORPORATION, INTEL CORPORATION, MICROSOFT CORPORATION, MACHINES CORPORATION, INTEL CORPORATION, MICROSOFT CORPORATION,
NETWORK APPLIANCE INC., SANDBURST CORPORATION, THE INTERNET SOCIETY, NETWORK APPLIANCE INC., SANDBURST CORPORATION, THE INTERNET SOCIETY,
AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT
THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY
IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR
PURPOSE. PURPOSE.
Copyright (c) 2003 ADAPTEC INC., BROADCOM CORPORATION, CISCO SYSTEMS Copyright (C) The Internet Society (2004). This document is subject
INC., EMC CORPORATION, HEWLETT-PACKARD COMPANY, INTERNATIONAL to the rights, licenses and restrictions contained in BCP 78, and
BUSINESS MACHINES CORPORATION, INTEL CORPORATION, MICROSOFT except as set forth therein, the authors retain all their rights.
CORPORATION, NETWORK APPLIANCE INC., All Rights Reserved
 End of changes. 

This html diff was produced by rfcdiff 1.23, available from http://www.levkowetz.com/ietf/tools/rfcdiff/