draft-ietf-rddp-mpa-03.txt   draft-ietf-rddp-mpa-04.txt 
Remote Direct Data Placement Work Group P. Culley Remote Direct Data Placement Work Group P. Culley
INTERNET-DRAFT Hewlett-Packard Company INTERNET-DRAFT Hewlett-Packard Company
draft-ietf-rddp-mpa-03.txt U. Elzur draft-ietf-rddp-mpa-04.txt U. Elzur
Broadcom Corporation Broadcom Corporation
R. Recio R. Recio
IBM Corporation IBM Corporation
S. Bailey S. Bailey
Sandburst Corporation Sandburst Corporation
J. Carrier J. Carrier
Cray Inc. Cray Inc.
Expires: April 2006 September 27, 2005 Expires: November 2006 May 30, 2006
Marker PDU Aligned Framing for TCP Specification Marker PDU Aligned Framing for TCP Specification
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
skipping to change at page 2, line 10 skipping to change at page 2, line 10
TCP RFCs and can be utilized with existing TCP implementations. MPA TCP RFCs and can be utilized with existing TCP implementations. MPA
also supports integrated implementations that combine TCP, MPA and also supports integrated implementations that combine TCP, MPA and
DDP to reduce buffering requirements in the implementation and DDP to reduce buffering requirements in the implementation and
improve performance at the system level. improve performance at the system level.
Table of Contents Table of Contents
Status of this Memo 1 Status of this Memo 1
Abstract 1 Abstract 1
1 Glossary 7 1 Glossary 7
2 Introduction 9 2 Introduction 10
2.1 Motivation 9 2.1 Motivation 10
2.2 Protocol Overview 9 2.2 Protocol Overview 10
3 LLP and DDP requirements 13 3 LLP and DDP requirements 14
3.1 TCP implementation Requirements to support MPA 13 3.1 TCP implementation Requirements to support MPA 14
3.1.1 TCP Transmit side 13 3.1.1 TCP Transmit side 14
3.1.2 TCP Receive side 14 3.1.2 TCP Receive side 14
3.2 MPA's interactions with DDP 15 3.2 MPA's interactions with DDP 16
4 FPDU Formats 17 4 FPDU Formats 18
4.1 Marker Format 18 4.1 Marker Format 19
5 Data Transfer Semantics 19 5 Data Transfer Semantics 20
5.1 MPA Markers 19 5.1 MPA Markers 20
5.2 CRC Calculation 22 5.2 CRC Calculation 23
5.3 MPA on TCP Sender Segmentation 25 5.3 MPA on TCP Sender Segmentation 26
5.3.1 Effects of MPA on TCP Segmentation 26 5.3.1 Effects of MPA on TCP Segmentation 27
5.3.2 FPDU Size Considerations 28 5.3.2 FPDU Size Considerations 29
5.4 MPA Receiver FPDU Identification 29 5.4 MPA Receiver FPDU Identification 30
5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders 30 5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders 31
6 Connection Semantics 31 6 Connection Semantics 32
6.1 Connection setup 31 6.1 Connection setup 32
6.1.1 MPA Request and Reply Frame Format 33 6.1.1 MPA Request and Reply Frame Format 34
6.1.2 Connection Startup Rules 34 6.1.2 Connection Startup Rules 35
6.1.3 Example Delayed Startup sequence 38 6.1.3 Example Delayed Startup sequence 38
6.1.4 Use of "Private Data" 41 6.1.4 Use of Private Data 41
6.1.5 "Dual Stack" implementations 44 6.1.5 "Dual stack" implementations 44
6.2 Normal Connection Teardown 45 6.2 Normal Connection Teardown 45
7 Error Semantics 46 7 Error Semantics 46
8 Security Considerations 47 8 Security Considerations 47
8.1 Protocol-specific Security Considerations 47 8.1 Protocol-specific Security Considerations 47
8.1.1 Spoofing 47 8.1.1 Spoofing 47
8.1.2 Eavesdropping 48 8.1.2 Eavesdropping 48
8.2 Introduction to Security Options 49 8.2 Introduction to Security Options 49
8.3 Using IPsec With MPA 49 8.3 Using IPsec With MPA 49
8.4 Requirements for IPsec Encapsulation of MPA/DDP 50 8.4 Requirements for IPsec Encapsulation of MPA/DDP 50
9 IANA Considerations 51 9 IANA Considerations 51
skipping to change at page 3, line 17 skipping to change at page 3, line 17
11.3.2 RDMAC RNIC and Non-permissive IETF RNIC 66 11.3.2 RDMAC RNIC and Non-permissive IETF RNIC 66
11.3.3 RDMAC RNIC and Permissive IETF RNIC 68 11.3.3 RDMAC RNIC and Permissive IETF RNIC 68
11.3.4 Non-Permissive IETF RNIC and Permissive IETF RNIC 69 11.3.4 Non-Permissive IETF RNIC and Permissive IETF RNIC 69
12 Author's Addresses 70 12 Author's Addresses 70
13 Acknowledgments 71 13 Acknowledgments 71
Full Copyright Statement 74 Full Copyright Statement 74
Intellectual Property 74 Intellectual Property 74
Table of Figures Table of Figures
Figure 1 ULP MPA TCP Layering 10 Figure 1 ULP MPA TCP Layering 11
Figure 2 FPDU Format 17 Figure 2 FPDU Format 18
Figure 3 Marker Format 18 Figure 3 Marker Format 19
Figure 4 Example FPDU Format with Marker 20 Figure 4 Example FPDU Format with Marker 21
Figure 5 Annotated Hex Dump of an FPDU 24 Figure 5 Annotated Hex Dump of an FPDU 25
Figure 6 Annotated Hex Dump of an FPDU with Marker 25 Figure 6 Annotated Hex Dump of an FPDU with Marker 26
Figure 7 "MPA Request/Reply Frame" 33 Figure 7 MPA Request/Reply Frame 34
Figure 8: Example Delayed Startup negotiation 39 Figure 8: Example Delayed Startup negotiation 39
Figure 9: Example Immediate Startup negotiation 42 Figure 9: Example Immediate Startup negotiation 42
Figure 10: Non-aligned FPDU freely placed in TCP octet stream 58 Figure 10: Non-aligned FPDU freely placed in TCP octet stream 58
Figure 11: Aligned FPDU placed immediately after TCP header 59 Figure 11: Aligned FPDU placed immediately after TCP header 59
Figure 12. Connection Parameters for the RNIC Types. 66 Figure 12. Connection Parameters for the RNIC Types. 66
Figure 13: MPA negotiation between an RDMAC RNIC and a Non-permissive Figure 13: MPA negotiation between an RDMAC RNIC and a Non-permissive
IETF RNIC. 67 IETF RNIC. 67
Figure 14: MPA negotiation between an RDMAC RNIC and a Permissive Figure 14: MPA negotiation between an RDMAC RNIC and a Permissive
IETF RNIC. 68 IETF RNIC. 68
Figure 15: MPA negotiation between a Non-permissive IETF RNIC and a Figure 15: MPA negotiation between a Non-permissive IETF RNIC and a
Permissive IETF RNIC. 69 Permissive IETF RNIC. 69
Revision history [To be deleted prior to RFC publication] Revision history [To be deleted prior to RFC publication]
[draft-ietf-rddp-mpa-04] workgroup draft with following changes:
Numerous capitalization and "" adjustments, tried to make more
consistent.
Added some missing capitalized terms to glossary
Removed company specific "use as is" boilerplate paragraph
Fixed up some contact information and cross references.
Removed reference to expired draft-elzur-iwarp-mpa-tcp-analysis-
00.txt
Suggested MTU to be used to determine EMSS, when otherwise not
available; removed technology specific lengths per AD suggestion
Tweaked text around disabling Nagle so that it is no longer
implied that that is all that is necessary to achieve proper
segmentation behavior
Revamped section 5.3.1 for improved clarity
[draft-ietf-rddp-mpa-03] workgroup draft with following changes: [draft-ietf-rddp-mpa-03] workgroup draft with following changes:
Tweaked abstract to give a bit more information. Tweaked abstract to give a bit more information.
Tightened definition and usage of "deliver" Tightened definition and usage of "deliver"
Cleaned up usage of terms "FPDU Alignment" and "Header Cleaned up usage of terms "FPDU Alignment" and "Header
Alignment" Alignment"
Rearranged overview sections with stack and glossary earlier Rearranged overview sections with stack and glossary earlier
skipping to change at page 4, line 4 skipping to change at page 4, line 26
Cleaned up usage of terms "FPDU Alignment" and "Header Cleaned up usage of terms "FPDU Alignment" and "Header
Alignment" Alignment"
Rearranged overview sections with stack and glossary earlier Rearranged overview sections with stack and glossary earlier
Mentioned how an non-MPA-Aware TCP MPA receiver deals with out Mentioned how an non-MPA-Aware TCP MPA receiver deals with out
of order segments (it doesn't have to...) of order segments (it doesn't have to...)
Fixed description of out of order segment handling in section Fixed description of out of order segment handling in section
3.1.1 3.1.1
Added text saying that ordering and completion indications are Added text saying that ordering and completion indications are
used to deliver to DDP used to deliver to DDP
Added redundant text indicating low two bits of FPDUPTR must Added redundant text indicating low two bits of FPDUPTR must
always be zero and treated as such in Section 4.1 always be zero and treated as such in Section 4.1
Added redundant text indicating markers are always included in a Added redundant text indicating Markers are always included in a
CRC calculation CRC calculation
Removed indication saying that an implementation can "ignore" an Removed indication saying that an implementation can "ignore" an
administrative input to not use CRCs; clarified that both ends administrative input to not use CRCs; clarified that both ends
have to agree to not use CRC (as originally intended). have to agree to not use CRC (as originally intended).
Changed example FPDU hex dump format for greater clarity Changed example FPDU hex dump format for greater clarity
Clarified that EMSS shrinking below 128 bytes is the condition Clarified that EMSS shrinking below 128 bytes is the condition
(rather than "very small sizes") (rather than "very small sizes")
Put connection startup rules after the start frame formats Put connection startup rules after the start frame formats
Added Initiator "private data" to figure 9 Added Initiator Private Data to figure 9
Removed or Clarified use of RNIC term Removed or Clarified use of RNIC term
Added intro to IETF/RDMAC interoperability appendix and gave a Added intro to IETF/RDMAC interoperability appendix and gave a
web reference for docs; also recommended use of "permissive IETF web reference for docs; also recommended use of "permissive IETF
RNIC" RNIC"
Numerous minor clarifications Numerous minor clarifications
Updated Boilerplates per current requirements Updated Boilerplates per current requirements
[draft-ietf-rddp-mpa-02] workgroup draft with following changes: [draft-ietf-rddp-mpa-02] workgroup draft with following changes:
Made IPsec must implement, optional to use. Made IPsec must implement, optional to use.
Updated Marker language to clarify that it points to ULPDU Updated Marker language to clarify that it points to ULPDU
Length even when marker precedes FPDU. Length even when Marker precedes FPDU.
Clarified when to start markers use (in full operation mode). Clarified when to start Markers use (in Full Operation mode).
Added informative text on interoperability with RDMAC RNICs. Added informative text on interoperability with RDMAC RNICs.
Reduced "Private Data" to 512 octets max. Reduced Private Data to 512 octets max.
Clarified CRC use description, must be used unless data is at Clarified CRC use description, must be used unless data is at
least as well protected by another means. least as well protected by another means.
Clarified CRC disabled mode; CRC field is always valid. Clarified CRC disabled mode; CRC field is always valid.
Added Security text. Added Security text.
Changed DDP and RDMAP version numbers in hex dumps (Fig 5,6) and Changed DDP and RDMAP version numbers in hex dumps (Fig 5, 6)
adjusted CRC accordingly. and adjusted CRC accordingly.
[draft-ietf-rddp-mpa-01] workgroup draft with following changes: [draft-ietf-rddp-mpa-01] workgroup draft with following changes:
Added the "R" bit (Rejected) to the "MPA Reply Frame" and Added the "R" bit (Rejected) to the MPA Reply Frame and
described its semantics. described its semantics.
Added some comments on recent decisions regarding startup. Added some comments on recent decisions regarding startup.
Updated RFC3667 boilerplate. Updated RFC3667 boilerplate.
[draft-ietf-rddp-mpa-00] workgroup draft with following changes: [draft-ietf-rddp-mpa-00] workgroup draft with following changes:
Changed "Start Key" to two separate startup frames to facilitate Changed "Start Key" to two separate startup frames to facilitate
identification of incorrect Active/Active startup. identification of incorrect active/active startup.
Changed Active/Passive nomenclature to Initiator/Responder to Changed Active/Passive nomenclature to Initiator/Responder to
reduce confusion with TCP startup and verbs doc (which used reduce confusion with TCP startup and verbs doc (which used
opposite sense). opposite sense).
Added "Private Data" to the startup key sequences. This also Added Private Data to the startup key sequences. This also
required describing the motivation and expected usage models required describing the motivation and expected usage models
along with some interface hints. Removed the "Private data" along with some interface hints. Removed the Private Data stuff
stuff from appendix. from appendix.
Added example "Immediate" startup with TCP and explanation. Added example "Immediate" startup with TCP and explanation.
[draft-culley-iwarp-mpa-03] [draft-culley-iwarp-mpa-03]
Add option to allow receivers to specify Marker use. Add option to allow receivers to specify Marker use.
Add option that allows both sides to agree not to use CRC. Add option that allows both sides to agree not to use CRC.
Added startup declaration "Start Key" with options and larger Added startup declaration "Start Key" with options and larger
MPA mode recognition "key". MPA mode recognition "key".
Updated MPA/DDP connection startup rules and sequence to deal Updated MPA/DDP connection startup rules and sequence to deal
with "Start Key". with "Start Key".
Added Appendix that provides a more detailed analysis of the Added Appendix that provides a more detailed analysis of the
effects of MPA on TCP data streams. effects of MPA on TCP data streams.
Added appendix that describes a mechanism to deal with "private Added appendix that describes a mechanism to deal with "Private
data" prior to full MPA/DDP operation. Data" prior to full MPA/DDP operation.
[draft-culley-iwarp-mpa-02] [draft-culley-iwarp-mpa-02]
Enhanced descriptions of how MPA is used over an unmodified TCP. Enhanced descriptions of how MPA is used over an unmodified TCP.
Removed "No Packing" text. Removed "No Packing" text.
Made MPA an adaptation layer for DDP, instead of a generalized Made MPA an adaptation layer for DDP, instead of a generalized
framing solution. framing solution.
Added clarifications of the MPA/TCP interaction for optimized Added clarifications of the MPA/TCP interaction for optimized
implementations and that any such optimizations are to be used implementations and that any such optimizations are to be used
only when requested by MPA. only when requested by MPA.
Note: a discussion of reasons for these changes can be found in
[ELZER-MPA].
[draft-culley-iwarp-mpa-01] initial draft. [draft-culley-iwarp-mpa-01] initial draft.
1 Glossary 1 Glossary
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
this document are to be interpreted as described in RFC 2119.
Consumer - the ULPs or applications that lie above MPA and DDP. The Consumer - the ULPs or applications that lie above MPA and DDP. The
Consumer is responsible for making TCP connections, starting MPA Consumer is responsible for making TCP connections, starting MPA
and DDP connections, and generally controlling operations. and DDP connections, and generally controlling operations.
Delivery - (Delivered, Delivers) - For MPA, Delivery is defined as Delivery - (Delivered, Delivers) - For MPA, Delivery is defined as
the process of informing DDP that a particular PDU is ordered for the process of informing DDP that a particular PDU is ordered for
use. A PDU is Delivered in the exact order that it was sent by use. A PDU is Delivered in the exact order that it was sent by
the original sender; MPA uses TCP's byte stream ordering to the original sender; MPA uses TCP's byte stream ordering to
determine when Delivery is possible. This is specifically determine when Delivery is possible. This is specifically
different from "passing the PDU to DDP", which may generally different from "passing the PDU to DDP", which may generally
occur in any order, while the order of "Delivery" is strictly occur in any order, while the order of Delivery is strictly
defined. defined.
EMSS - Effective Maximum Segment Size. EMSS is the smaller of the EMSS - Effective Maximum Segment Size. EMSS is the smaller of the
TCP maximum segment size (MSS) as defined in RFC 793 [RFC793], TCP maximum segment size (MSS) as defined in RFC 793 [RFC793],
and the current path Maximum Transfer Unit (MTU) [RFC1191]. and the current path Maximum Transfer Unit (MTU) [RFC1191].
FPDU - Framed Protocol Data Unit. The unit of data created by an MPA FPDU - Framed Protocol Data Unit. The unit of data created by an MPA
sender. sender.
FPDU Alignment - the property that an FPDU is Header Aligned with the FPDU Alignment - the property that an FPDU is Header Aligned with the
TCP segment, and the TCP segment includes an integer number of TCP segment, and the TCP segment includes an integer number of
FPDUs. A TCP segment with a FPDU Alignment allows immediate FPDUs. A TCP segment with a FPDU Alignment allows immediate
processing of the contained FPDUs without waiting on other TCP processing of the contained FPDUs without waiting on other TCP
segments to arrive or combining with prior segments. segments to arrive or combining with prior segments.
FPDU Pointer (FPDUPTR) - This field of the Marker is used to indicate
the beginning of an FPDU.
Full Operation (Full Operation Phase) - After the completion of the
Startup Phase MPA begins exchanging FPDUs.
Header Alignment - the property that a TCP segment begins with an Header Alignment - the property that a TCP segment begins with an
FPDU. The FPDU is "Header Aligned" when the FPDU header is FPDU. The FPDU is Header Aligned when the FPDU header is exactly
exactly at the start of the TCP segment (right behind the TCP at the start of the TCP segment (right behind the TCP headers on
headers on the wire). the wire).
Initiator - The endpoint of a connection that sends the MPA Request
Frame, i.e. the first to actually send data (which may not be the
one which sends the TCP SYN).
Marker - A four octet field that is placed in the MPA data stream at
fixed octet intervals (every 512 octets).
MPA-aware TCP - a TCP implementation that is aware of the receiver MPA-aware TCP - a TCP implementation that is aware of the receiver
efficiencies of MPA FPDU Alignment and is capable of sending TCP efficiencies of MPA FPDU Alignment and is capable of sending TCP
segments that begin with an FPDU. segments that begin with an FPDU.
MPA-enabled - MPA is enabled if the MPA protocol is visible on the MPA-enabled - MPA is enabled if the MPA protocol is visible on the
wire. When the sender is MPA-enabled, it is inserting framing wire. When the sender is MPA-enabled, it is inserting framing
and markers. When the receiver is MPA-enabled, it is and Markers. When the receiver is MPA-enabled, it is
interpreting framing and markers. interpreting framing and Markers.
MPA Request Frame - Data sent from the MPA Initiator to the MPA
Responder during the Startup Phase.
MPA Reply Frame - Data sent from the MPA Responder to the MPA
Initiator during the Startup Phase.
MPA - Marker-based ULP PDU Aligned Framing for TCP protocol. This MPA - Marker-based ULP PDU Aligned Framing for TCP protocol. This
document defines the MPA protocol. document defines the MPA protocol.
MULPDU - Maximum ULPDU. The current maximum size of the record that MULPDU - Maximum ULPDU. The current maximum size of the record that
is acceptable for DDP to pass to MPA for transmission. is acceptable for DDP to pass to MPA for transmission.
Node - A computing device attached to one or more links of a Network. Node - A computing device attached to one or more links of a Network.
A Node in this context does not refer to a specific application A Node in this context does not refer to a specific application
or protocol instantiation running on the computer. A Node may or protocol instantiation running on the computer. A Node may
consist of one or more MPA on TCP devices installed in a host consist of one or more MPA on TCP devices installed in a host
computer. computer.
PAD - A 1-3 octet group of zeros used to fill an FPDU to an exact
modulo 4 size.
PDU - protocol data unit PDU - protocol data unit
Private Data - A block of data exchanged between MPA endpoints during
initial connection setup.
Protection Domain - An RDMA concept (see [VERBS] and [RDMASEC]) that
tie use of various endpoint resources (memory access etc.) to the
specific RDMA/DDP/MPA connection.
RDMA - Remote Direct Memory Access; a protocol that uses DDP and MPA
to enable applications to transfer data directly from memory
buffers. See [RDMAP].
Remote Peer - The MPA protocol implementation on the opposite end of Remote Peer - The MPA protocol implementation on the opposite end of
the connection. Used to refer to the remote entity when the connection. Used to refer to the remote entity when
describing protocol exchanges or other interactions between two describing protocol exchanges or other interactions between two
Nodes. Nodes.
Responder - The connection endpoint which responds to an incoming MPA
connection request (the MAP Request Frame). This may not be the
endpoint which awaited the TCP SYN.
Startup Phase - The initial exchanges of an MPA connection which
serves to more fully identify MPA endpoints to each other and
pass connection specific setup information to each other.
ULP - Upper Layer Protocol. The protocol layer above the protocol ULP - Upper Layer Protocol. The protocol layer above the protocol
layer currently being referenced. The ULP for MPA is DDP [DDP]. layer currently being referenced. The ULP for MPA is DDP [DDP].
ULPDU - Upper Layer Protocol Data Unit. The data record defined by ULPDU - Upper Layer Protocol Data Unit. The data record defined by
the layer above MPA (DDP). ULPDU corresponds to DDP's "DDP the layer above MPA (DDP). ULPDU corresponds to DDP's DDP
Segment". segment.
ULPDU_Length - a field in the FPDU describing the length of the
included ULPDU.
2 Introduction 2 Introduction
This section discusses the reason for creating MPA on TCP and a This section discusses the reason for creating MPA on TCP and a
general overview of the protocol. Later sections show the MPA general overview of the protocol. Later sections show the MPA
headers (see section 4 on page 17), and detailed protocol headers (see section 4 on page 18), and detailed protocol
requirements and characteristics (see section 5 on page 19), as well requirements and characteristics (see section 5 on page 20), as well
as Connection Semantics (section 6 on page 30), Error Semantics as Connection Semantics (section 6 on page 31), Error Semantics
(section 7 on page 46), and Security Considerations (section 8 on (section 7 on page 46), and Security Considerations (section 8 on
page 47). page 47).
2.1 Motivation 2.1 Motivation
The Direct Data Placement protocol [DDP], when used with TCP [RFC793] The Direct Data Placement protocol [DDP], when used with TCP [RFC793]
requires a mechanism to detect record boundaries. The DDP records requires a mechanism to detect record boundaries. The DDP records
are referred to as Upper Layer Protocol Data Units by this document. are referred to as Upper Layer Protocol Data Units by this document.
The ability to locate the Upper Layer Protocol Data Unit (ULPDU) The ability to locate the Upper Layer Protocol Data Unit (ULPDU)
boundary is useful to a hardware network adapter that uses DDP to boundary is useful to a hardware network adapter that uses DDP to
skipping to change at page 10, line 33 skipping to change at page 11, line 33
1. A TCP connection is established by ULP action. This is done 1. A TCP connection is established by ULP action. This is done
using methods not described by this specification. The ULP may using methods not described by this specification. The ULP may
exchange some amount of data in streaming mode prior to starting exchange some amount of data in streaming mode prior to starting
MPA, but is not required to do so. MPA, but is not required to do so.
2. The Consumer negotiates the use of DDP and MPA at both ends of a 2. The Consumer negotiates the use of DDP and MPA at both ends of a
connection. The mechanisms to do this are not described in this connection. The mechanisms to do this are not described in this
specification. The negotiation may be done in streaming mode, or specification. The negotiation may be done in streaming mode, or
by some other mechanism (such as a pre-arranged port number). by some other mechanism (such as a pre-arranged port number).
3. The ULP activates MPA on each end in the "Startup Phase", either 3. The ULP activates MPA on each end in the Startup Phase, either as
as an "Initiator" or a "Responder", as determined by the ULP. an Initiator or a Responder, as determined by the ULP. This mode
This mode verifies the usage of MPA, specifies the use of CRC and verifies the usage of MPA, specifies the use of CRC and Markers,
Markers, and allows the ULP to communicate some additional data and allows the ULP to communicate some additional data via a
via a "private data" exchange. See section 6.1 Connection setup Private Data exchange. See section 6.1 Connection setup for more
for more details on the startup process. details on the startup process.
4. At the end of the Startup Phase, the ULP puts MPA (and DDP) into 4. At the end of the Startup Phase, the ULP puts MPA (and DDP) into
full operation and begins sending DDP data as further described Full Operation and begins sending DDP data as further described
below. In this document, DDP data chunks are called ULPDUs. For below. In this document, DDP data chunks are called ULPDUs. For
a description of the DDP data, see [DDP]. a description of the DDP data, see [DDP].
Following is a description of data transfer when MPA is in full Following is a description of data transfer when MPA is in Full
operation. Operation.
1. DDP determines the Maximum ULPDU (MULPDU) size by querying MPA 1. DDP determines the Maximum ULPDU (MULPDU) size by querying MPA
for this value. MPA derives this information from TCP or IP, for this value. MPA derives this information from TCP or IP,
when it is available, or chooses a reasonable value. when it is available, or chooses a reasonable value.
2. DDP creates ULPDUs of MULPDU size or smaller, and hands them to 2. DDP creates ULPDUs of MULPDU size or smaller, and hands them to
MPA at the sender. MPA at the sender.
3. MPA creates a Framed Protocol Data Unit (FPDU) by pre-pending a 3. MPA creates a Framed Protocol Data Unit (FPDU) by pre-pending a
header, optionally inserting markers, and appending a CRC field header, optionally inserting Markers, and appending a CRC field
after the ULPDU and PAD (if any). MPA delivers the FPDU to TCP. after the ULPDU and PAD (if any). MPA delivers the FPDU to TCP.
4. The TCP sender puts the FPDUs into the TCP stream. If the TCP 4. The TCP sender puts the FPDUs into the TCP stream. If the TCP
Sender is MPA-aware, it segments the TCP stream in such a way Sender is MPA-aware, it segments the TCP stream in such a way
that a TCP Segment boundary is also the boundary of an FPDU. TCP that a TCP Segment boundary is also the boundary of an FPDU. TCP
then passes each segment to the IP layer for transmission. then passes each segment to the IP layer for transmission.
5. The TCP receiver may be MPA-aware or may not be MPA-aware. If it 5. The TCP receiver may be MPA-aware or may not be MPA-aware. If it
is MPA-aware, it may separate passing the TCP payload to MPA from is MPA-aware, it may separate passing the TCP payload to MPA from
passing the TCP payload ordering information to MPA. In either passing the TCP payload ordering information to MPA. In either
case, RFC compliant TCP wire behavior is observed at both the case, RFC compliant TCP wire behavior is observed at both the
sender and receiver. sender and receiver.
6. The MPA receiver locates and assembles complete FPDUs within the 6. The MPA receiver locates and assembles complete FPDUs within the
stream, verifies their integrity, and removes MPA markers (when stream, verifies their integrity, and removes MPA Markers (when
present), ULPDU_Length, PAD and the CRC field. present), ULPDU_Length, PAD and the CRC field.
7. MPA then provides the complete ULPDUs to DDP. MPA may also 7. MPA then provides the complete ULPDUs to DDP. MPA may also
separate passing MPA payload to DDP from passing the MPA payload separate passing MPA payload to DDP from passing the MPA payload
ordering information. ordering information.
MPA-aware TCP is a TCP layer which potentially contains some MPA-aware TCP is a TCP layer which potentially contains some
additional semantics as defined in this document. MPA is implemented additional semantics as defined in this document. MPA is implemented
as a data stream ULP for TCP and is therefore RFC compliant. MPA- as a data stream ULP for TCP and is therefore RFC compliant. MPA-
aware TCP is RFC compliant. aware TCP is RFC compliant.
skipping to change at page 12, line 17 skipping to change at page 13, line 17
passes each ULPDU to DDP when the last bytes arrive from TCP, along passes each ULPDU to DDP when the last bytes arrive from TCP, along
with the indication that they are in order. with the indication that they are in order.
MPA implementations that support recovery of out of order ULPDUs MUST MPA implementations that support recovery of out of order ULPDUs MUST
support a mechanism to indicate the ordering of ULPDUs as the sender support a mechanism to indicate the ordering of ULPDUs as the sender
transmitted them and indicate when missing intermediate segments transmitted them and indicate when missing intermediate segments
arrive. These mechanisms allow DDP to reestablish record ordering arrive. These mechanisms allow DDP to reestablish record ordering
and report Delivery of complete messages (groups of records). and report Delivery of complete messages (groups of records).
MPA also addresses enhanced data integrity. Some users of TCP have MPA also addresses enhanced data integrity. Some users of TCP have
noted that the TCP checksum is not as strong as could be desired noted that the TCP checksum is not as strong as could be desired (see
(see[CRCTCP]). Studies such as [CRCTCP] have shown that the TCP [CRCTCP]). Studies such as [CRCTCP] have shown that the TCP checksum
checksum indicates segments in error at a much higher rate than the indicates segments in error at a much higher rate than the underlying
underlying link characteristics would indicate. With these higher link characteristics would indicate. With these higher error rates,
error rates, the chance that an error will escape detection, when the chance that an error will escape detection, when using only the
using only the TCP checksum for data integrity, becomes a concern. A TCP checksum for data integrity, becomes a concern. A stronger
stronger integrity check can reduce the chance of data errors being integrity check can reduce the chance of data errors being missed.
missed.
MPA includes a CRC check to increase the ULPDU data integrity to the MPA includes a CRC check to increase the ULPDU data integrity to the
level provided by other modern protocols, such as SCTP [RFC2960]. It level provided by other modern protocols, such as SCTP [RFC2960]. It
is possible to disable this CRC check, however CRCs MUST be enabled is possible to disable this CRC check, however CRCs MUST be enabled
unless it is clear that the end to end connection through the network unless it is clear that the end to end connection through the network
has data integrity at least as good as a MPA with CRC enabled (for has data integrity at least as good as a MPA with CRC enabled (for
example when IPsec is implemented end to end). DDP's ULP expects example when IPsec is implemented end to end). DDP's ULP expects
this level of data integrity and therefore the ULP does not have to this level of data integrity and therefore the ULP does not have to
provide its own duplicate data integrity and error recovery for lost provide its own duplicate data integrity and error recovery for lost
data. data.
3 LLP and DDP requirements 3 LLP and DDP requirements
The following sections describe requirements on TCP and DDP to The following sections describe requirements on TCP and DDP to
utilize MPA. The DDP requirements enable the correct operation over utilize MPA. The DDP requirements enable the correct operation over
MPA and TCP (as opposed to DDP over SCTP or other LLPs). MPA and TCP (as opposed to DDP over SCTP or other LLPs).
The TCP requirements are mostly intended to support the "MPA-aware The TCP requirements are mostly intended to support the MPA-aware TCP
TCP" variation, which allows implementations that require less buffer variation, which allows implementations that require less buffer
memory and may provide better overall system performance. memory and may provide better overall system performance.
3.1 TCP implementation Requirements to support MPA 3.1 TCP implementation Requirements to support MPA
The TCP implementation MUST inform MPA when the TCP connection is The TCP implementation MUST inform MPA when the TCP connection is
closed or has begun closing the connection (e.g. received a FIN). closed or has begun closing the connection (e.g. received a FIN).
3.1.1 TCP Transmit side 3.1.1 TCP Transmit side
To provide optimum performance, an MPA-aware transmit side TCP To provide optimum performance, an MPA-aware transmit side TCP
skipping to change at page 13, line 47 skipping to change at page 14, line 47
enable the segmentation rules described above for the DDP segments enable the segmentation rules described above for the DDP segments
(FPDUs) posted for transmission. (FPDUs) posted for transmission.
If the transmit side TCP implementation is not able to segment the If the transmit side TCP implementation is not able to segment the
TCP stream as indicated above, MPA SHOULD make a best effort to TCP stream as indicated above, MPA SHOULD make a best effort to
achieve that result. For example, using the TCP_NODELAY socket achieve that result. For example, using the TCP_NODELAY socket
option to disable the Nagle algorithm will usually result in many of option to disable the Nagle algorithm will usually result in many of
the segments starting with an FPDU. the segments starting with an FPDU.
If the transmit side TCP implementation is not able to report the If the transmit side TCP implementation is not able to report the
EMSS, MPA may assume that TCP will use 1460 octet segments in EMSS, MPA SHOULD use the current MTU value to establish a likely FPDU
creating FPDUs. If the implementation has reason to believe that the size, taking into account the various expected header sizes.
TCP segment size is actually smaller than 1460, it may instead use a
536 octet FPDU.
3.1.2 TCP Receive side 3.1.2 TCP Receive side
When an MPA receive implementation and the MPA-aware receive side TCP When an MPA receive implementation and the MPA-aware receive side TCP
implementation support handling out of order ULPDUs, the TCP receive implementation support handling out of order ULPDUs, the TCP receive
implementation SHOULD be enabled to: implementation SHOULD be enabled to:
* Pass incoming TCP segments to MPA as soon as they have been * Pass incoming TCP segments to MPA as soon as they have been
received and validated, even if not received in order. The TCP received and validated, even if not received in order. The TCP
layer MUST have committed to keeping each segment before it can layer MUST have committed to keeping each segment before it can
skipping to change at page 14, line 42 skipping to change at page 15, line 36
* Provide a mechanism to indicate the ordering of TCP segments as * Provide a mechanism to indicate the ordering of TCP segments as
the sender transmitted them. One possible mechanism might be the sender transmitted them. One possible mechanism might be
attaching the TCP sequence number to each segment. attaching the TCP sequence number to each segment.
* Provide a mechanism to indicate when a given TCP segment (and the * Provide a mechanism to indicate when a given TCP segment (and the
prior TCP stream) is complete. One possible mechanism might be prior TCP stream) is complete. One possible mechanism might be
to utilize the leading (left) edge of the TCP Receive Window. to utilize the leading (left) edge of the TCP Receive Window.
MPA uses the ordering and completion indications to inform DDP MPA uses the ordering and completion indications to inform DDP
when a ULPDU is complete; MPA "delivers" the FPDU to DDP. DDP when a ULPDU is complete; MPA Delivers the FPDU to DDP. DDP uses
uses the indications to "deliver" its messages to the DDP the indications to "deliver" its messages to the DDP consumer
consumer (see [DDP] for more details). (see [DDP] for more details).
DDP on MPA MUST utilize these two mechanisms to establish the DDP on MPA MUST utilize these two mechanisms to establish the
Delivery semantics that DDP's consumers agree to. These Delivery semantics that DDP's consumers agree to. These
semantics are described fully in [DDP]. These include semantics are described fully in [DDP]. These include
requirements on DDP's consumer to respect ownership of buffers requirements on DDP's consumer to respect ownership of buffers
prior to the time that DDP delivers them to the consumer. prior to the time that DDP delivers them to the Consumer.
An MPA-aware TCP receive side implementation MUST continue to buffer An MPA-aware TCP receive side implementation MUST continue to buffer
TCP segments until completely ordered and then deliver them as TCP segments until completely ordered and then deliver them as
expected by non-MPA applications (and described in TCP RFCs) when MPA expected by non-MPA applications (and described in TCP RFCs) when MPA
is not enabled on the connection. When MPA is enabled above an MPA- is not enabled on the connection. When MPA is enabled above an MPA-
aware TCP, TCP SHOULD enable the in and out of order passing of data, aware TCP, TCP SHOULD enable the in and out of order passing of data,
and the separate ordering information as described above. and the separate ordering information as described above.
When an MPA receive implementation is coupled with a TCP receive When an MPA receive implementation is coupled with a TCP receive
implementation that does not support the preceding mechanisms, TCP implementation that does not support the preceding mechanisms, TCP
skipping to change at page 15, line 25 skipping to change at page 16, line 19
records (ULPDUs) to MPA. MPA will use the reliable transmission records (ULPDUs) to MPA. MPA will use the reliable transmission
abilities of TCP to transmit the data, and will insert appropriate abilities of TCP to transmit the data, and will insert appropriate
additional information into the TCP stream to allow the MPA receiver additional information into the TCP stream to allow the MPA receiver
to locate the record boundary information. to locate the record boundary information.
As such, MPA accepts complete records (ULPDUs) from DDP at the sender As such, MPA accepts complete records (ULPDUs) from DDP at the sender
and returns them to DDP at the receiver. and returns them to DDP at the receiver.
MPA combined with an MPA-aware TCP can only ensure FPDU Alignment MPA combined with an MPA-aware TCP can only ensure FPDU Alignment
with the TCP Header if the FPDU is less than or equal to TCP's EMSS. with the TCP Header if the FPDU is less than or equal to TCP's EMSS.
Since FPDU alignment is generally desired by the receiver, DDP must Since FPDU Alignment is generally desired by the receiver, DDP must
cooperate with MPA to ensure FPDUs' lengths do not exceed the EMSS cooperate with MPA to ensure FPDUs' lengths do not exceed the EMSS
under normal conditions. This is done with the MULPDU mechanism. under normal conditions. This is done with the MULPDU mechanism.
MPA provides information to DDP on the current maximum size of the MPA provides information to DDP on the current maximum size of the
record that is acceptable to send (MULPDU). DDP SHOULD limit each record that is acceptable to send (MULPDU). DDP SHOULD limit each
record size to MULPDU. The range of MULPDU values MUST be between record size to MULPDU. The range of MULPDU values MUST be between
128 octets and 64768 octets, inclusive. 128 octets and 64768 octets, inclusive.
The sending DDP MUST NOT post a ULPDU larger than 64768 octets to The sending DDP MUST NOT post a ULPDU larger than 64768 octets to
MPA. DDP MAY post a ULPDU of any size between one and 64768 octets, MPA. DDP MAY post a ULPDU of any size between one and 64768 octets,
however MPA is NOT REQUIRED to support a "ULPDU Length" that is however MPA is not REQUIRED to support a ULPDU Length that is greater
greater than the current MULPDU. than the current MULPDU.
While the maximum theoretical length supported by the MPA header While the maximum theoretical length supported by the MPA header
ULPDU_Length field is 65535, TCP over IP requires the IP datagram ULPDU_Length field is 65535, TCP over IP requires the IP datagram
maximum length to be 65535 octets. To enable MPA to support FPDU maximum length to be 65535 octets. To enable MPA to support FPDU
Alignment, the maximum size of the FPDU must fit within an IP Alignment, the maximum size of the FPDU must fit within an IP
datagram. Thus the ULPDU limit of 64768 octets was derived by taking datagram. Thus the ULPDU limit of 64768 octets was derived by taking
the maximum IP datagram length, subtracting from it the maximum total the maximum IP datagram length, subtracting from it the maximum total
length of the sum of the IPv4 header, TCP header, IPv4 options, TCP length of the sum of the IPv4 header, TCP header, IPv4 options, TCP
options, and the worst case MPA overhead, and then rounding the options, and the worst case MPA overhead, and then rounding the
result down to a 128 octet boundary. result down to a 128 octet boundary.
skipping to change at page 16, line 13 skipping to change at page 17, line 6
the MPA implementation SHOULD: the MPA implementation SHOULD:
* Pass each ULPDU with its length to DDP as soon as it has been * Pass each ULPDU with its length to DDP as soon as it has been
fully received and validated. fully received and validated.
* Provide a mechanism to indicate the ordering of ULPDUs as the * Provide a mechanism to indicate the ordering of ULPDUs as the
sender transmitted them. One possible mechanism might be sender transmitted them. One possible mechanism might be
providing the TCP sequence number for each ULPDU. providing the TCP sequence number for each ULPDU.
* Provide a mechanism to indicate when a given ULPDU (and prior * Provide a mechanism to indicate when a given ULPDU (and prior
ULPDUs) are complete (delivered to DDP). One possible mechanism ULPDUs) are complete (Delivered to DDP). One possible mechanism
might be to allow DDP to see the current outgoing TCP Ack might be to allow DDP to see the current outgoing TCP Ack
sequence number. sequence number.
* Provide an indication to DDP that the TCP has closed or has begun * Provide an indication to DDP that the TCP has closed or has begun
to close the connection (e.g. received a FIN). to close the connection (e.g. received a FIN).
MPA MUST provide the protocol version negotiated with its peer to MPA MUST provide the protocol version negotiated with its peer to
DDP. DDP will use this version to set the version in its header and DDP. DDP will use this version to set the version in its header and
to report the version to RDMAP to report the version to [RDMAP].
4 FPDU Formats 4 FPDU Formats
MPA senders create FPDUs out of ULPDUs. The format of an FPDU shown MPA senders create FPDUs out of ULPDUs. The format of an FPDU shown
below MUST be used for all MPA FPDUs. For purposes of clarity, below MUST be used for all MPA FPDUs. For purposes of clarity,
markers are not shown in Figure 2. Markers are not shown in Figure 2.
0 1 2 3 0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ULPDU_Length | | | ULPDU_Length | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
| | | |
~ ~ ~ ~
~ ULPDU ~ ~ ULPDU ~
| | | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | PAD (0-3 octets) | | | PAD (0-3 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| CRC | | CRC |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 2 FPDU Format Figure 2 FPDU Format
ULPDU_Length: 16 bits (unsigned integer). This is the number of ULPDU_Length: 16 bits (unsigned integer). This is the number of
octets of the contained ULPDU. It does not include the length of the octets of the contained ULPDU. It does not include the length of the
FPDU header itself, the pad, the CRC, or of any markers that fall FPDU header itself, the pad, the CRC, or of any Markers that fall
within the ULPDU. The 16-bit "ULPDU Length" field is large enough to within the ULPDU. The 16-bit ULPDU Length field is large enough to
support the largest IP datagrams for IPv4 or IPv6. support the largest IP datagrams for IPv4 or IPv6.
PAD: The PAD field trails the ULPDU and contains between zero and PAD: The PAD field trails the ULPDU and contains between zero and
three octets of data. The pad data MUST be set to zero by the sender three octets of data. The pad data MUST be set to zero by the sender
and ignored by the receiver (except for CRC checking). The length of and ignored by the receiver (except for CRC checking). The length of
the pad is set so as to make the size of the FPDU an integral the pad is set so as to make the size of the FPDU an integral
multiple of four. multiple of four.
CRC: 32 bits, When CRCs are enabled, this field contains a CRC32C CRC: 32 bits, When CRCs are enabled, this field contains a CRC32C
check value, which is used to verify the entire contents of the FPDU, check value, which is used to verify the entire contents of the FPDU,
using CRC32C. See section 5.2 CRC Calculation on page 22. When CRCs using CRC32C. See section 5.2 CRC Calculation on page 23. When CRCs
are not enabled, this field is still present, may contain any value, are not enabled, this field is still present, may contain any value,
and MUST NOT be checked. and MUST NOT be checked.
The FPDU adds a minimum of 6 octets to the length of the ULPDU. In The FPDU adds a minimum of 6 octets to the length of the ULPDU. In
addition, the total length of the FPDU will include the length of any addition, the total length of the FPDU will include the length of any
markers and from 0 to 3 pad octets added to round-up the ULPDU size. Markers and from 0 to 3 pad octets added to round-up the ULPDU size.
4.1 Marker Format 4.1 Marker Format
The format of a marker MUST be as specified in Figure 3: The format of a Marker MUST be as specified in Figure 3:
0 1 2 3 0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| RESERVED | FPDUPTR | | RESERVED | FPDUPTR |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 3 Marker Format Figure 3 Marker Format
RESERVED: The Reserved field MUST be set to zero on transmit and RESERVED: The Reserved field MUST be set to zero on transmit and
ignored on receive (except for CRC calculation). ignored on receive (except for CRC calculation).
FPDUPTR: The FPDU Pointer is a relative pointer, 16-bits long, FPDUPTR: The FPDU Pointer is a relative pointer, 16-bits long,
interpreted as an unsigned integer, that indicates the number of interpreted as an unsigned integer that indicates the number of
octets in the TCP stream from the beginning of the "ULPDU Length" octets in the TCP stream from the beginning of the ULPDU Length field
field to the first octet of the entire marker. The least significant to the first octet of the entire Marker. The least significant two
two bits MUST always be set to zero at the transmitter, and the bits MUST always be set to zero at the transmitter, and the receivers
receivers MUST always treat these as zero for calculations. MUST always treat these as zero for calculations.
5 Data Transfer Semantics 5 Data Transfer Semantics
This section discusses some characteristics and behavior of the MPA This section discusses some characteristics and behavior of the MPA
protocol as well as implications of that protocol. protocol as well as implications of that protocol.
5.1 MPA Markers 5.1 MPA Markers
MPA markers are used to identify the start of FPDUs when packets are MPA Markers are used to identify the start of FPDUs when packets are
received out of order. This is done by locating the markers at fixed received out of order. This is done by locating the Markers at fixed
intervals in the data stream (which is correlated to the TCP sequence intervals in the data stream (which is correlated to the TCP sequence
number) and using the marker value to locate the preceding FPDU number) and using the Marker value to locate the preceding FPDU
start. start.
All MPA markers are included in the containing FPDU CRC calculation All MPA Markers are included in the containing FPDU CRC calculation
(when both CRCs and markers are in use). (when both CRCs and Markers are in use).
The MPA receiver's ability to locate out of order FPDUs and pass the The MPA receiver's ability to locate out of order FPDUs and pass the
ULPDUs to DDP is implementation dependent. MPA/DDP allows those ULPDUs to DDP is implementation dependent. MPA/DDP allows those
receivers that are able to deal with out of order FPDUs in this way receivers that are able to deal with out of order FPDUs in this way
to require the insertion of markers in the data stream. When the to require the insertion of Markers in the data stream. When the
receiver cannot deal with out of order FPDUs in this way, it may receiver cannot deal with out of order FPDUs in this way, it may
disable the insertion of markers at the sender. All MPA senders MUST disable the insertion of Markers at the sender. All MPA senders MUST
be able to generate markers when their use is declared by the be able to generate Markers when their use is declared by the
opposing receiver (see section 6.1 Connection setup on page 31). opposing receiver (see section 6.1 Connection setup on page 32).
When Markers are enabled, MPA senders MUST insert a marker into the When Markers are enabled, MPA senders MUST insert a Marker into the
data stream at a 512 octet periodic interval in the TCP Sequence data stream at a 512 octet periodic interval in the TCP Sequence
Number Space. The marker contains a 16 bit unsigned integer referred Number Space. The Marker contains a 16 bit unsigned integer referred
to as the FPDUPTR (FPDU Pointer). to as the FPDUPTR (FPDU Pointer).
If the FPDUPTR's value is non-zero, the FPDU Pointer is a 16 bit If the FPDUPTR's value is non-zero, the FPDU Pointer is a 16 bit
relative back-pointer. FPDUPTR MUST contain the number of octets in relative back-pointer. FPDUPTR MUST contain the number of octets in
the TCP stream from the beginning of the "ULPDU Length" field to the the TCP stream from the beginning of the ULPDU Length field to the
first octet of the marker, unless the marker falls between FPDUs. first octet of the Marker, unless the Marker falls between FPDUs.
Thus the location of the first octet of the previous FPDU header can Thus the location of the first octet of the previous FPDU header can
be determined by subtracting the value of the given marker from the be determined by subtracting the value of the given Marker from the
current octet-stream sequence number (i.e. TCP sequence number) of current octet-stream sequence number (i.e. TCP sequence number) of
the first octet of the marker. Note that this computation MUST take the first octet of the Marker. Note that this computation MUST take
into account that the TCP sequence number could have wrapped between into account that the TCP sequence number could have wrapped between
the marker and the header. the Marker and the header.
An FPDUPTR value of 0x0000 is a special case - it is used when the An FPDUPTR value of 0x0000 is a special case - it is used when the
marker falls exactly between FPDUs (between the preceding FPDU CRC Marker falls exactly between FPDUs (between the preceding FPDU CRC
field, and the next FPDU's "ULPDU Length" field). In this case, the field, and the next FPDU's ULPDU Length field). In this case, the
marker is considered to be contained in the following FPDU; the Marker is considered to be contained in the following FPDU; the
marker MUST be included in the CRC calculation of the FPDU following Marker MUST be included in the CRC calculation of the FPDU following
the marker (if CRCs are being generated or checked). Thus an FPDUPTR the Marker (if CRCs are being generated or checked). Thus an FPDUPTR
value of 0x0000 means that immediately following the marker is an value of 0x0000 means that immediately following the Marker is an
FPDU header (the "ULPDU Length" field). FPDU header (the ULPDU Length field).
Since all FPDUs are integral multiples of 4 octets, the bottom two Since all FPDUs are integral multiples of 4 octets, the bottom two
bits of the FPDUPTR as calculated by the sender are zero. MPA bits of the FPDUPTR as calculated by the sender are zero. MPA
reserves these bits so they MUST be treated as zero for computation reserves these bits so they MUST be treated as zero for computation
at the receiver. at the receiver.
When Markers are enabled (see section 6.1 Connection setup on page When Markers are enabled (see section 6.1 Connection setup on page
31), the MPA markers MUST be inserted immediately preceding the first 32), the MPA Markers MUST be inserted immediately preceding the first
FPDU of full operation phase, and at every 512th octet of the TCP FPDU of Full Operation phase, and at every 512th octet of the TCP
octet stream thereafter. As a result, the first marker has an octet stream thereafter. As a result, the first Marker has an
FPDUPTR value of 0x0000. If the first marker begins at octet FPDUPTR value of 0x0000. If the first Marker begins at octet
sequence number SeqStart, then markers are inserted such that the sequence number SeqStart, then Markers are inserted such that the
first octet of the marker is at octet sequence number SeqNum if the first octet of the Marker is at octet sequence number SeqNum if the
remainder of (SeqNum - SeqStart) mod 512 is zero. Note that SeqNum remainder of (SeqNum - SeqStart) mod 512 is zero. Note that SeqNum
can wrap. can wrap.
For example, if the TCP sequence number were used to calculate the For example, if the TCP sequence number were used to calculate the
insertion point of the marker, the starting TCP sequence number is insertion point of the Marker, the starting TCP sequence number is
unlikely to be zero, and 512 octet multiples are unlikely to fall on unlikely to be zero, and 512 octet multiples are unlikely to fall on
a modulo 512 of zero. If the MPA connection is started at TCP a modulo 512 of zero. If the MPA connection is started at TCP
sequence number 11, then the 1st marker will begin at 11, and sequence number 11, then the 1st Marker will begin at 11, and
subsequent markers will begin at 523, 1035, etc. subsequent Markers will begin at 523, 1035, etc.
If an FPDU is large enough to contain multiple markers, they MUST all If an FPDU is large enough to contain multiple Markers, they MUST all
point to the same point in the TCP stream: the first octet of the point to the same point in the TCP stream: the first octet of the
"ULPDU Length" field for the FPDU. ULPDU Length field for the FPDU.
If a marker interval contains multiple FPDUs (the FPDUs are small), If a Marker interval contains multiple FPDUs (the FPDUs are small),
the marker MUST point to the start of the "ULPDU Length" field for the Marker MUST point to the start of the ULPDU Length field for the
the FPDU containing the marker unless the marker falls between FPDUs, FPDU containing the Marker unless the Marker falls between FPDUs, in
in which case the marker MUST be zero. which case the Marker MUST be zero.
The following example shows an FPDU containing a marker. The following example shows an FPDU containing a Marker.
0 1 2 3 0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ULPDU Length (0x0010) | | | ULPDU Length (0x0010) | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
| | | |
+ + + +
| ULPDU (octets 0-9) | | ULPDU (octets 0-9) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| (0x0000) | FPDU ptr (0x000C) | | (0x0000) | FPDU ptr (0x000C) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ULPDU (octets 10-15) | | ULPDU (octets 10-15) |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | PAD (2 octets:0,0) | | | PAD (2 octets:0,0) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| CRC | | CRC |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 4 Example FPDU Format with Marker Figure 4 Example FPDU Format with Marker
MPA Receivers MUST preserve ULPDU boundaries when passing data to MPA Receivers MUST preserve ULPDU boundaries when passing data to
DDP. MPA Receivers MUST pass the ULPDU data and the "ULPDU Length" to DDP. MPA Receivers MUST pass the ULPDU data and the ULPDU Length to
DDP and not the markers, headers, and CRC. DDP and not the Markers, headers, and CRC.
5.2 CRC Calculation 5.2 CRC Calculation
An MPA implementation MUST implement CRC support and MUST either: An MPA implementation MUST implement CRC support and MUST either:
(1) always use CRCs; The MPA provider at is NOT REQUIRED to support (1) always use CRCs; The MPA provider at is not REQUIRED to support
an administrator's request that CRCs not be used. an administrator's request that CRCs not be used.
or or
(2a) only indicate a preference to not use CRCs on the explicit (2a) only indicate a preference to not use CRCs on the explicit
request of the system administrator, via an interface not defined request of the system administrator, via an interface not defined
in this spec. The default configuration for a connection MUST be in this spec. The default configuration for a connection MUST be
to use CRCs. to use CRCs.
(2b) disable CRC checking (and possibly generation) if both the local (2b) disable CRC checking (and possibly generation) if both the local
skipping to change at page 22, line 34 skipping to change at page 23, line 34
from undetected errors as an end-to-end CRC32c. from undetected errors as an end-to-end CRC32c.
The process MUST be invisible to the ULP. The process MUST be invisible to the ULP.
After receipt of an MPA startup declaration indicating that its peer After receipt of an MPA startup declaration indicating that its peer
requires CRCs, an MPA instance MUST continue generating and checking requires CRCs, an MPA instance MUST continue generating and checking
CRCs until the connection terminates. If an MPA instance has CRCs until the connection terminates. If an MPA instance has
declared that it does not require CRCs, it MUST turn off CRC checking declared that it does not require CRCs, it MUST turn off CRC checking
immediately after receipt of an MPA mode declaration indicating that immediately after receipt of an MPA mode declaration indicating that
its peer also does not require CRCs. It MAY continue generating its peer also does not require CRCs. It MAY continue generating
CRCs. See section 6.1 Connection setup on page 31 for details on the CRCs. See section 6.1 Connection setup on page 32 for details on the
MPA startup. MPA startup.
When sending an FPDU, the sender MUST include a CRC field. When CRCs When sending an FPDU, the sender MUST include a CRC field. When CRCs
are enabled, the CRC field in the MPA FPDU MUST be computed using the are enabled, the CRC field in the MPA FPDU MUST be computed using the
CRC32C polynomial in the manner described in the iSCSI Protocol CRC32C polynomial in the manner described in the iSCSI Protocol
[iSCSI] document for Header and Data Digests. [iSCSI] document for Header and Data Digests.
The fields which MUST be included in the CRC calculation when sending The fields which MUST be included in the CRC calculation when sending
an FPDU are as follows: an FPDU are as follows:
1) If a marker does not immediately precede the "ULPDU Length" 1) If a Marker does not immediately precede the ULPDU Length field,
field, the CRC-32c is calculated from the first octet of the the CRC-32c is calculated from the first octet of the ULPDU
"ULPDU Length" field, through all the ULPDU and markers (if Length field, through all the ULPDU and Markers (if present), to
present), to the last octet of the PAD (if present), inclusive. the last octet of the PAD (if present), inclusive. If there is a
If there is a marker immediately following the PAD, the marker is Marker immediately following the PAD, the Marker is included in
included in the CRC calculation for this FPDU. the CRC calculation for this FPDU.
2) If a marker immediately precedes the first octet of the "ULPDU 2) If a Marker immediately precedes the first octet of the ULPDU
Length" field of the FPDU, (i.e. the marker fell between FPDUs, Length field of the FPDU, (i.e. the Marker fell between FPDUs,
and thus is required to be included in the second FPDU), the CRC- and thus is required to be included in the second FPDU), the CRC-
32c is calculated from the first octet of the marker, through the 32c is calculated from the first octet of the Marker, through the
"ULPDU Length" header, through all the ULPDU and markers (if ULPDU Length header, through all the ULPDU and Markers (if
present), to the last octet of the PAD (if present), inclusive. present), to the last octet of the PAD (if present), inclusive.
3) After calculating the CRC-32c, the resultant value is placed into 3) After calculating the CRC-32c, the resultant value is placed into
the CRC field at the end of the FPDU. the CRC field at the end of the FPDU.
When an FPDU is received, and CRC checking is enabled, the receiver When an FPDU is received, and CRC checking is enabled, the receiver
MUST first perform the following: MUST first perform the following:
1) Calculate the CRC of the incoming FPDU in the same fashion as 1) Calculate the CRC of the incoming FPDU in the same fashion as
defined above. defined above.
2) Verify that the calculated CRC-32c value is the same as the 2) Verify that the calculated CRC-32c value is the same as the
received CRC-32c value found in the FPDU CRC field. If not, the received CRC-32c value found in the FPDU CRC field. If not, the
receiver MUST treat the FPDU as an invalid FPDU. receiver MUST treat the FPDU as an invalid FPDU.
The procedure for handling invalid FPDUs is covered in the Error The procedure for handling invalid FPDUs is covered in the Error
Section (see section 7 on page 46) Section (see section 7 on page 46)
The following is an annotated hex dump of an example FPDU sent as the The following is an annotated hex dump of an example FPDU sent as the
first FPDU on the stream. As such, it starts with a marker. The FPDU first FPDU on the stream. As such, it starts with a Marker. The
contains a 42 octet ULPDU (an example DDP segment) which in turn FPDU contains a 42 octet ULPDU (an example DDP segment) which in turn
contains 24 octets of the contained ULPDU, which is a data load that contains 24 octets of the contained ULPDU, which is a data load that
is all zeros. The CRC32c has been correctly calculated and can be is all zeros. The CRC32c has been correctly calculated and can be
used as a reference. See the [DDP] and [RDMA] specification for used as a reference. See the [DDP] and [RDMAP] specification for
definitions of the DDP Control field, Queue, MSN, MO, and Send Data. definitions of the DDP Control field, Queue, MSN, MO, and Send Data.
Octet Contents Annotation Octet Contents Annotation
Count Count
0000 00 Marker: Reserved 0000 00 Marker: Reserved
0001 00 0001 00
0002 00 Marker: FPDUPTR 0002 00 Marker: FPDUPTR
0003 00 0003 00
0004 00 ULPDU Length 0004 00 ULPDU Length
0005 2a 0005 2a
0006 41 DDP Control Field, Send with Last flag set 0006 41 DDP Control Field, Send with Last flag set
0007 43 0007 43
0008 00 Reserved (STag position with no STag) 0008 00 Reserved (DDP STag position with no STag)
0009 00 0009 00
000a 00 000a 00
000b 00 000b 00
000c 00 Queue = 0 000c 00 DDP Queue = 0
000d 00 000d 00
000e 00 000e 00
000f 00 000f 00
0010 00 MSN = 1 0010 00 DDP MSN = 1
0011 00 0011 00
0012 00 0012 00
0013 01 0013 01
0014 00 MO = 0 0014 00 DDP MO = 0
0015 00 0015 00
0016 00 0016 00
0017 00 0017 00
0018 00 Send Data (24 octets of zeros) 0018 00 DDP Send Data (24 octets of zeros)
... ...
002f 00 002f 00
0030 52 CRC32c 0030 52 CRC32c
0031 23 0031 23
0032 99 0032 99
0033 83 0033 83
Figure 5 Annotated Hex Dump of an FPDU Figure 5 Annotated Hex Dump of an FPDU
The following is an example sent as the second FPDU of the stream The following is an example sent as the second FPDU of the stream
where the first FPDU (which is not shown here) had a length of 492 where the first FPDU (which is not shown here) had a length of 492
octets and was also a Send to Queue 0 with Last Flag set. This octets and was also a Send to Queue 0 with Last Flag set. This
example contains a marker. example contains a Marker.
Octet Contents Annotation Octet Contents Annotation
Count Count
01ec 00 Length 01ec 00 Length
01ed 2a 01ed 2a
01ee 41 DDP Control Field: Send with Last Flag set 01ee 41 DDP Control Field: Send with Last Flag set
01ef 43 01ef 43
01f0 00 Reserved (STag position with no STag) 01f0 00 Reserved (DDP STag position with no STag)
01f1 00 01f1 00
01f2 00 01f2 00
01f3 00 01f3 00
01f4 00 Queue = 0 01f4 00 DDP Queue = 0
01f5 00 01f5 00
01f6 00 01f6 00
01f7 00 01f7 00
01f8 00 MSN = 2 01f8 00 DDP MSN = 2
01f9 00 01f9 00
01fa 00 01fa 00
01fb 02 01fb 02
01fc 00 MO = 0 01fc 00 DDP MO = 0
01fd 00 01fd 00
01fe 00 01fe 00
01ff 00 01ff 00
0200 00 Marker: Reserved 0200 00 Marker: Reserved
0201 00 0201 00
0202 00 Marker: FPDUPTR 0202 00 Marker: FPDUPTR
0203 14 0203 14
0204 00 Send Data (24 octets of zeros) 0204 00 DDP Send Data (24 octets of zeros)
... ...
021b 00 021b 00
021c 84 CRC32c 021c 84 CRC32c
021d 92 021d 92
021e 58 021e 58
021f 98 021f 98
Figure 6 Annotated Hex Dump of an FPDU with Marker Figure 6 Annotated Hex Dump of an FPDU with Marker
5.3 MPA on TCP Sender Segmentation 5.3 MPA on TCP Sender Segmentation
skipping to change at page 26, line 12 skipping to change at page 27, line 12
MPA MUST encapsulate the ULPDU such that there is exactly one ULPDU MPA MUST encapsulate the ULPDU such that there is exactly one ULPDU
contained in one FPDU. contained in one FPDU.
An MPA-aware TCP sender SHOULD, when enabled for MPA, on TCP An MPA-aware TCP sender SHOULD, when enabled for MPA, on TCP
implementations that support this, and with an EMSS large enough to implementations that support this, and with an EMSS large enough to
contain at least one FPDU, segment the outbound TCP stream such that contain at least one FPDU, segment the outbound TCP stream such that
each TCP segment begins with an FPDU, and fully contains all included each TCP segment begins with an FPDU, and fully contains all included
FPDUs. FPDUs.
Implementation note: To achieve the previous segmentation rule, Implementation note: To achieve the previous segmentation rule,
TCP's Nagle [RFC0896] algorithm SHOULD be disabled. an MPA-aware TCP sender implementation SHOULD disable TCP's
Nagle [RFC0896] algorithm, communicate the FPDU boundaries to
TCP, and make other minor changes such as the reporting of EMSS
to MPA.
There are exceptions to the above rule. Once an ULPDU is provided to There are exceptions to the above rule. Once an ULPDU is provided to
MPA, the MPA on TCP sender MUST transmit it or fail the connection; MPA, the MPA on TCP sender MUST transmit it or fail the connection;
it cannot be repudiated. As a result, during changes in MTU and it cannot be repudiated. As a result, during changes in MTU and
EMSS, or when TCP's Receive Window size (RWIN) becomes too small, it EMSS, or when TCP's Receive Window size (RWIN) becomes too small, it
may be necessary to send FPDUs that do not conform to the may be necessary to send FPDUs that do not conform to the
segmentation rule above. segmentation rule above.
A possible, but less desirable, alternative is to use IP A possible, but less desirable, alternative is to use IP
fragmentation on accepted FPDUs to deal with MTU reductions or fragmentation on accepted FPDUs to deal with MTU reductions or
extremely small EMSS. extremely small EMSS.
The sender MUST still format the FPDU according to FPDU format as The sender MUST still format the FPDU according to FPDU format as
shown in Figure 2. shown in Figure 2.
On a retransmission, TCP does not necessarily preserve original TCP On a retransmission, TCP does not necessarily preserve original TCP
segmentation boundaries. This can lead to the loss of FPDU alignment segmentation boundaries. This can lead to the loss of FPDU Alignment
and containment within a TCP segment during TCP retransmissions. An and containment within a TCP segment during TCP retransmissions. An
MPA-aware TCP sender SHOULD try to preserve original TCP segmentation MPA-aware TCP sender SHOULD try to preserve original TCP segmentation
boundaries on a retransmission. boundaries on a retransmission.
5.3.1 Effects of MPA on TCP Segmentation 5.3.1 Effects of MPA on TCP Segmentation
DDP/MPA senders will fill TCP segments to the EMSS with a single FPDU
when a DDP message is large enough. Since the DDP message may not
exactly fit into TCP segments, a "message tail" often occurs that
results in an FPDU that is smaller than a single TCP segment.
Additionally some DDP messages may be considerably shorter than the
EMSS. If a small FPDU is sent in a single TCP segment the result is
a "short" TCP segment.
Applications expected to see strong advantages from Direct Data Applications expected to see strong advantages from Direct Data
Placement include transaction-based applications and throughput Placement include transaction-based applications and throughput
applications. Request/response protocols typically send one FPDU per applications. Request/response protocols typically send one FPDU per
TCP segment and then wait for a response. Therefore, the application TCP segment and then wait for a response. Under these conditions,
is expected to set TCP parameters such that it can trade off latency these "short" TCP segments are an appropriate and expected effect of
and wire efficiency. This is accomplished by setting the TCP_NODELAY the segmentation.
socket option.
When latency is not critical, and the application provides data in
chunks larger than EMSS at one time, the TCP implementation may
"pack" any available stream data into TCP segments so that the
segments are filled to the EMSS. If the amount of data available is
not enough to fill the TCP segment when it is prepared for
transmission, TCP can send the segment partly filled, or use the
Nagle algorithm to wait for the ULP to post more data (discussed
below).
DDP/MPA senders will fill TCP segments to the EMSS with a single FPDU Another possibility is that the application might be sending multiple
when a DDP message is large enough. Since the DDP message may not messages (FPDUs) to the same endpoint before waiting for a response.
exactly fit into TCP segments, a "message tail" often occurs that
results in an FPDU that is smaller than a single TCP segment. If a
"message tail", small DDP messages, or the start of a larger DDP
message are available, MPA MAY "pack" the resulting FPDUs into TCP
segments. When this is done, the TCP segments can be more fully
utilized, but, due to the size constraints of FPDUs, segments may not
be filled to the EMSS.
Note that MPA receivers must do more processing of a TCP segment In this case, the segmentation policy would tend to reduce the
that contains multiple FPDUs, this may affect the performance of available connection bandwidth by under-filling the TCP segments.
some receiver implementations.
TCP implementations often utilize the "Nagle" [RFC0896] algorithm to TCP implementations often utilize the Nagle [RFC0896] algorithm to
ensure that segments are filled to the EMSS whenever the round trip ensure that segments are filled to the EMSS whenever the round trip
latency is large enough that the source stream can fully fill latency is large enough that the source stream can fully fill
segments before Acks arrive. The algorithm does this by delaying the segments before Acks arrive. The algorithm does this by delaying the
transmission of TCP segments until a ULP can fill a segment, or until transmission of TCP segments until a ULP can fill a segment, or until
an ACK arrives from the far side. The algorithm thus allows for an ACK arrives from the far side. The algorithm thus allows for
smaller segments when latencies are shorter to keep the ULP's end to smaller segments when latencies are shorter to keep the ULP's end to
end latency to reasonable levels. end latency to reasonable levels.
The Nagle algorithm is not mandatory to use [RFC1122]. The Nagle algorithm is not mandatory to use [RFC1122].
If Nagle or other algorithms for detecting the availability of
multiple FPDUs for transmission is used, "packing" of multiple FPDUs
into TCP segments can occur.
If a "message tail", small DDP messages, or the start of a larger DDP
message are available, MPA MAY pack multiple FPDUs into TCP segments.
When this is done, the TCP segments can be more fully utilized, but,
due to the size constraints of FPDUs, segments may not be filled to
the EMSS.
Note that MPA receivers must do more processing of a TCP segment
that contains multiple FPDUs, this may affect the performance of
some receiver implementations.
It is up to the ULP to decide if Nagle is useful with DDP/MPA. Note It is up to the ULP to decide if Nagle is useful with DDP/MPA. Note
that many of the applications expected to take advantage of MPA/DDP that many of the applications expected to take advantage of MPA/DDP
prefer to avoid the extra delays caused by Nagle. In such scenarios prefer to avoid the extra delays caused by Nagle. In such scenarios
it is anticipated there will be minimal opportunity for packing at it is anticipated there will be minimal opportunity for packing at
the transmitter and receivers may choose to optimize their the transmitter and receivers may choose to optimize their
performance for this anticipated behavior. performance for this anticipated behavior.
Therefore, the application is expected to set TCP parameters such
that it can trade off latency and wire efficiency. This is
accomplished by setting the TCP_NODELAY socket option (which disables
Nagle).
When latency is not critical, application is expected to leave Nagle
enabled. In this case the TCP implementation may pack any available
stream data into TCP segments so that the segments are filled to the
EMSS. If the amount of data available is not enough to fill the TCP
segment when it is prepared for transmission, TCP can send the
segment partly filled, or use the Nagle algorithm to wait for the ULP
to post more data (discussed below).
5.3.2 FPDU Size Considerations 5.3.2 FPDU Size Considerations
MPA defines the Maximum Upper Layer Protocol Data Unit (MULPDU) as MPA defines the Maximum Upper Layer Protocol Data Unit (MULPDU) as
the size of the largest ULPDU fitting in an FPDU. For an empty TCP the size of the largest ULPDU fitting in an FPDU. For an empty TCP
Segment, MULPDU is EMSS minus the FPDU overhead (6 octets) minus Segment, MULPDU is EMSS minus the FPDU overhead (6 octets) minus
space for markers and pad octets. space for Markers and pad octets.
The maximum ULPDU Length for a single ULPDU when markers are The maximum ULPDU Length for a single ULPDU when Markers are
present MUST be computed as: present MUST be computed as:
MULPDU = EMSS - (6 + 4 * Ceiling(EMSS / 512) + EMSS mod 4) MULPDU = EMSS - (6 + 4 * Ceiling(EMSS / 512) + EMSS mod 4)
The formula above accounts for the worst-case number of markers. The formula above accounts for the worst-case number of Markers.
The maximum ULPDU Length for a single ULPDU when markers are NOT The maximum ULPDU Length for a single ULPDU when Markers are NOT
present MUST be computed as: present MUST be computed as:
MULPDU = EMSS - (6 + EMSS mod 4) MULPDU = EMSS - (6 + EMSS mod 4)
As a further optimization of the wire efficiency an MPA As a further optimization of the wire efficiency an MPA
implementation MAY dynamically adjust the MULPDU (see section 7.3.1. implementation MAY dynamically adjust the MULPDU (see section 5.3.1
for latency and wire efficiency trade-offs). When one or more FPDUs for latency and wire efficiency trade-offs). When one or more FPDUs
are already packed into a TCP Segment, MULPDU MAY be reduced are already packed into a TCP Segment, MULPDU MAY be reduced
accordingly. accordingly.
DDP SHOULD provide ULPDUs that are as large as possible, but less DDP SHOULD provide ULPDUs that are as large as possible, but less
than or equal to MULPDU. than or equal to MULPDU.
If the TCP implementation needs to adjust EMSS to support MTU If the TCP implementation needs to adjust EMSS to support MTU
changes, the MULPDU value is changed accordingly. changes, the MULPDU value is changed accordingly.
In certain rare situations, the EMSS may shrink below 128 octets in In certain rare situations, the EMSS may shrink below 128 octets in
size. If this occurs, the MPA on TCP sender MUST NOT shrink the size. If this occurs, the MPA on TCP sender MUST NOT shrink the
MULPDU below 128 octets and is NOT REQUIRED to follow the MULPDU below 128 octets and is not REQUIRED to follow the
segmentation rules in Section 5.3 MPA on TCP Sender Segmentation on segmentation rules in Section 5.3 MPA on TCP Sender Segmentation on
page 25. page 26.
If one or more FPDUs are already packed into a TCP segment, such that If one or more FPDUs are already packed into a TCP segment, such that
the remaining room is less than 128 octets, MPA MUST NOT provide a the remaining room is less than 128 octets, MPA MUST NOT provide a
MULPDU smaller than 128. In this case, MPA would typically provide a MULPDU smaller than 128. In this case, MPA would typically provide a
MULPDU for the next full sized segment, but may still pack the next MULPDU for the next full sized segment, but may still pack the next
FPDU into the small remaining room, provide that the next FPDU is FPDU into the small remaining room, provide that the next FPDU is
small enough to fit. small enough to fit.
The value 128 is chosen as to allow DDP designers room for the DDP The value 128 is chosen as to allow DDP designers room for the DDP
Header and some user data. Header and some user data.
skipping to change at page 29, line 20 skipping to change at page 30, line 20
* locate the start of the FPDU unambiguously, * locate the start of the FPDU unambiguously,
* verify its CRC (if CRC checking is enabled). * verify its CRC (if CRC checking is enabled).
If the above conditions are true, the MPA receiver passes the ULPDU If the above conditions are true, the MPA receiver passes the ULPDU
to DDP. to DDP.
To detect the start of the FPDU unambiguously one of the following To detect the start of the FPDU unambiguously one of the following
MUST be used: MUST be used:
1: In an ordered TCP stream, the "ULPDU Length" field in the current 1: In an ordered TCP stream, the ULPDU Length field in the current
FPDU when FPDU has a valid CRC, can be used to identify the FPDU when FPDU has a valid CRC, can be used to identify the
beginning of the next FPDU. beginning of the next FPDU.
2: For receivers that support out of order reception of FPDUs (see 2: For receivers that support out of order reception of FPDUs (see
section 5.1 MPA Markers on page 19) a Marker can always be used section 5.1 MPA Markers on page 20) a Marker can always be used
to locate the beginning of an FPDU (in FPDUs with valid CRCs). to locate the beginning of an FPDU (in FPDUs with valid CRCs).
Since the location of the marker is known in the octet stream Since the location of the Marker is known in the octet stream
(sequence number space), the marker can always be found. (sequence number space), the Marker can always be found.
3: Having found an FPDU by means of a Marker, following contiguous 3: Having found an FPDU by means of a Marker, following contiguous
FPDUs can be found by using the "ULPDU Length" fields (from FPDUs FPDUs can be found by using the ULPDU Length fields (from FPDUs
with valid CRCs) to establish the next FPDU boundary. with valid CRCs) to establish the next FPDU boundary.
The "ULPDU Length" field (see section 4) MUST be used to determine if The ULPDU Length field (see section 4) MUST be used to determine if
the entire FPDU is present before forwarding the ULPDU to DDP. the entire FPDU is present before forwarding the ULPDU to DDP.
CRC calculation is discussed in section 5.2 on page 22 above. CRC calculation is discussed in section 5.2 on page 23 above.
5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders 5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders
Since MPA on MPA-aware TCP senders start FPDUs on TCP segment Since MPA on MPA-aware TCP senders start FPDUs on TCP segment
boundaries, a receiving DDP on MPA on TCP implementation may be able boundaries, a receiving DDP on MPA on TCP implementation may be able
to optimize the reception of data in various ways. to optimize the reception of data in various ways.
However, MPA receivers MUST NOT depend on FPDU Alignment on TCP However, MPA receivers MUST NOT depend on FPDU Alignment on TCP
segment boundaries. segment boundaries.
Some MPA senders may be unable to conform to the sender requirements Some MPA senders may be unable to conform to the sender requirements
because their implementation of TCP is not designed with MPA in mind. because their implementation of TCP is not designed with MPA in mind.
Even if the sender is MPA-aware, the network may contain "middle Even if the sender is MPA-aware, the network may contain "middle
boxes" which modify the TCP stream by changing the segmentation. boxes" which modify the TCP stream by changing the segmentation.
This is generally interoperable with TCP and its users and MPA must This is generally interoperable with TCP and its users and MPA must
be no exception. be no exception.
The presence of markers in MPA (when enabled) allows an MPA receiver The presence of Markers in MPA (when enabled) allows an MPA receiver
to recover the FPDUs despite these obstacles, although it may be to recover the FPDUs despite these obstacles, although it may be
necessary to utilize additional buffering at the receiver to do so. necessary to utilize additional buffering at the receiver to do so.
Some of the cases that a receiver may have to contend with are listed Some of the cases that a receiver may have to contend with are listed
below as a reminder to the implementer: below as a reminder to the implementer:
* A single Aligned and complete FPDU, either in order, or out of * A single Aligned and complete FPDU, either in order, or out of
order: This can be passed to DDP as soon as validated, and order: This can be passed to DDP as soon as validated, and
Delivered when ordering is established. Delivered when ordering is established.
skipping to change at page 31, line 9 skipping to change at page 32, line 9
* Combinations of Unaligned or incomplete FPDUs (and potentially * Combinations of Unaligned or incomplete FPDUs (and potentially
other complete FPDUs) in the same TCP segment: If any FPDU is other complete FPDUs) in the same TCP segment: If any FPDU is
present in its entirety, or can be completed with portions present in its entirety, or can be completed with portions
already available, it can be passed to DDP as soon as validated, already available, it can be passed to DDP as soon as validated,
and Delivered when ordering is established. and Delivered when ordering is established.
6 Connection Semantics 6 Connection Semantics
6.1 Connection setup 6.1 Connection setup
MPA requires that the consumer MUST activate MPA, and any TCP MPA requires that the Consumer MUST activate MPA, and any TCP
enhancements for MPA, on a TCP half connection at the same location enhancements for MPA, on a TCP half connection at the same location
in the octet stream at both the sender and the receiver. This is in the octet stream at both the sender and the receiver. This is
required in order for the marker scheme to correctly locate the required in order for the Marker scheme to correctly locate the
markers (if enabled) and to correctly locate the first FPDU. Markers (if enabled) and to correctly locate the first FPDU.
MPA, and any TCP enhancements for MPA are enabled by the ULP in both MPA, and any TCP enhancements for MPA are enabled by the ULP in both
directions at once at an endpoint. directions at once at an endpoint.
This can be accomplished several ways, and is left up to DDP's ULP: This can be accomplished several ways, and is left up to DDP's ULP:
* DDP's ULP MAY require DDP on MPA startup immediately after TCP * DDP's ULP MAY require DDP on MPA startup immediately after TCP
connection setup. This has the advantage that no streaming mode connection setup. This has the advantage that no streaming mode
negotiation is needed. An example of such a protocol is shown in negotiation is needed. An example of such a protocol is shown in
Figure 9: Example Immediate Startup negotiation on page 42. Figure 9: Example Immediate Startup negotiation on page 42.
skipping to change at page 31, line 40 skipping to change at page 32, line 40
normal TCP startup, using TCP streaming data exchanges on the normal TCP startup, using TCP streaming data exchanges on the
same connection. The exchange establishes that DDP on MPA (as same connection. The exchange establishes that DDP on MPA (as
well as other ULPs) will be used, and exactly locates the point well as other ULPs) will be used, and exactly locates the point
in the octet stream where MPA is to begin operation. Note that in the octet stream where MPA is to begin operation. Note that
such a negotiation protocol is outside the scope of this such a negotiation protocol is outside the scope of this
specification. A simplified example of such a protocol is shown specification. A simplified example of such a protocol is shown
in Figure 8: Example Delayed Startup negotiation on page 39. in Figure 8: Example Delayed Startup negotiation on page 39.
An MPA endpoint operates in two distinct phases. An MPA endpoint operates in two distinct phases.
The "Startup Phase" is used to verify correct MPA setup, exchange CRC The Startup Phase is used to verify correct MPA setup, exchange CRC
and Marker configuration, and optionally pass "private data" between and Marker configuration, and optionally pass Private Data between
endpoints prior to completing a DDP connection. During this phase, endpoints prior to completing a DDP connection. During this phase,
specifically formatted frames are exchanged as TCP byte streams specifically formatted frames are exchanged as TCP byte streams
without using CRCs or Markers. During this phase a DDP endpoint need without using CRCs or Markers. During this phase a DDP endpoint need
not be "bound" to the MPA connection. In fact, the choice of DDP not be "bound" to the MPA connection. In fact, the choice of DDP
endpoint and its operating parameters may not be known until the endpoint and its operating parameters may not be known until the
consumer supplied "private data" (if any) has been examined by the Consumer supplied Private Data (if any) has been examined by the
consumer. Consumer.
The second distinct phase is "Full operation" during which FPDUs are The second distinct phase is Full Operation during which FPDUs are
sent using all the rules that pertain (CRCs, Markers, MULPDU sent using all the rules that pertain (CRCs, Markers, MULPDU
restrictions etc.). A DDP endpoint MUST be "bound" to the MPA restrictions etc.). A DDP endpoint MUST be "bound" to the MPA
connection at entry to this phase. connection at entry to this phase.
When "private data" is passed between ULPs in the "Startup Phase", When Private Data is passed between ULPs in the Startup Phase, the
the ULP is responsible for interpreting that data, and then placing ULP is responsible for interpreting that data, and then placing MPA
MPA into "Full operation". into Full Operation.
Note: The following text differentiates the two endpoints by calling Note: The following text differentiates the two endpoints by calling
them "Initiator" and "Responder". This is quite arbitrary and is them Initiator and Responder. This is quite arbitrary and is NOT
NOT related to the TCP startup (SYN, SYN/ACK sequence). The related to the TCP startup (SYN, SYN/ACK sequence). The
Initiator is the side that sends first in the MPA startup Initiator is the side that sends first in the MPA startup
sequence (the "MPA Request Frame"). sequence (the MPA Request Frame).
Note: The possibility that both endpoints would be allowed to make a Note: The possibility that both endpoints would be allowed to make a
connection at the same time, sometimes called an "Active/Active" connection at the same time, sometimes called an active/active
connection, was considered by the work group and rejected. There connection, was considered by the work group and rejected. There
were several motivations for this decision. One was that were several motivations for this decision. One was that
applications needing this facility were few (none other than applications needing this facility were few (none other than
theoretical at the time of this draft). Another was that the theoretical at the time of this draft). Another was that the
facility created some implementation difficulties, particularly facility created some implementation difficulties, particularly
with the "Dual Stack" designs described later on. A last issue with the "dual stack" designs described later on. A last issue
was that dealing with rejected connections at startup would have was that dealing with rejected connections at startup would have
required at least an additional frame type, and more recovery required at least an additional frame type, and more recovery
actions, complicating the protocol. While none of these issues actions, complicating the protocol. While none of these issues
was overwhelming, the group and implementers were not motivated was overwhelming, the group and implementers were not motivated
to do the work to resolve these issues. The protocol includes a to do the work to resolve these issues. The protocol includes a
method of detecting these "Active/Active" startup attempts so method of detecting these active/active startup attempts so that
that they can be rejected and an error reported. they can be rejected and an error reported.
The ULP is responsible for determining which side is "Initiator" or The ULP is responsible for determining which side is Initiator or
"Responder". For "Client/Server" type ULPs this is easy. For peer- Responder. For client/server type ULPs this is easy. For peer-peer
peer ULPs (which might utilize a TCP style "active/active" startup), ULPs (which might utilize a TCP style active/active startup), some
some mechanism (not defined by this specification) must be mechanism (not defined by this specification) must be established, or
established, or some streaming mode data exchanged prior to MPA some streaming mode data exchanged prior to MPA startup to determine
startup to determine the side which starts in "Initiator" and which the side which starts in Initiator and which starts in Responder MPA
starts in "Responder" MPA mode. mode.
6.1.1 MPA Request and Reply Frame Format 6.1.1 MPA Request and Reply Frame Format
0 1 2 3 0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0 | | 0 | |
+ Key (16 bytes containing "MPA ID Req Frame") + + Key (16 bytes containing "MPA ID Req Frame") +
4 | (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65) | 4 | (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65) |
+ Or (16 bytes containing "MPA ID Rep Frame") + + Or (16 bytes containing "MPA ID Rep Frame") +
skipping to change at page 33, line 27 skipping to change at page 34, line 27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
16 |M|C|R| Res | Rev | PD_Length | 16 |M|C|R| Res | Rev | PD_Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | | |
~ ~ ~ ~
~ Private Data ~ ~ Private Data ~
| | | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 7 "MPA Request/Reply Frame" Figure 7 MPA Request/Reply Frame
Key: This field contains the "key" used to validate that the sender Key: This field contains the "key" used to validate that the sender
is an MPA sender. Initiator mode senders MUST set this field to is an MPA sender. Initiator mode senders MUST set this field to
the fixed value "MPA ID Req frame" or (in byte order) 4D 50 41 20 the fixed value "MPA ID Req frame" or (in byte order) 4D 50 41 20
49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal). Responder 49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal). Responder
mode receivers MUST check this field for the same value, and mode receivers MUST check this field for the same value, and
close the connection and report an error locally if any other close the connection and report an error locally if any other
value is detected. Responder mode senders MUST set this field to value is detected. Responder mode senders MUST set this field to
the fixed value "MPA ID Rep frame" or (in byte order) 4D 50 41 20 the fixed value "MPA ID Rep frame" or (in byte order) 4D 50 41 20
49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal). Initiator 49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal). Initiator
mode receivers MUST check this field for the same value, and mode receivers MUST check this field for the same value, and
close the connection and report an error locally if any other close the connection and report an error locally if any other
value is detected. value is detected.
M: This bit, when sent in an "MPA Request Frame" or an "MPA Reply M: This bit, when sent in an MPA Request Frame or an MPA Reply Frame,
Frame", declares a receiver's requirement for Markers. When in a declares a receiver's requirement for Markers. When in a
received "MPA Request Frame" or "MPA Reply Frame" and the value received MPA Request Frame or MPA Reply Frame and the value is
is '0', markers MUST NOT be added to the data stream by the '0', Markers MUST NOT be added to the data stream by the sender.
sender. When '1' markers MUST be added as described in section When '1' Markers MUST be added as described in section 5.1 MPA
5.1 MPA Markers on page 19. Markers on page 20.
C: This bit declares an endpoint's preferred CRC usage. When this C: This bit declares an endpoint's preferred CRC usage. When this
field is '0' in the "MPA Request Frame" and the "MPA Reply field is '0' in the MPA Request Frame and the MPA Reply Frame,
Frame", CRCs MUST not be checked and need not be generated by CRCs MUST not be checked and need not be generated by either
either endpoint. When this bit is '1' in either the "MPA Request endpoint. When this bit is '1' in either the MPA Request Frame
Frame" or "MPA Reply Frame", CRCs MUST be generated and checked or MPA Reply Frame, CRCs MUST be generated and checked by both
by both endpoints. Note that even when not in use, the CRC field endpoints. Note that even when not in use, the CRC field remains
remains present in the FPDU. When CRCs are not in use, the CRC present in the FPDU. When CRCs are not in use, the CRC field
field MUST be considered valid for FPDU checking regardless of MUST be considered valid for FPDU checking regardless of its
its contents. contents.
R: This bit is set to zero, and not checked on reception in the "MPA R: This bit is set to zero, and not checked on reception in the MPA
Request Frame". In the "MPA Reply Frame", this bit is the Request Frame. In the MPA Reply Frame, this bit is the Rejected
"Rejected Connection" bit, set by the responders ULP to indicate Connection bit, set by the Responders ULP to indicate acceptance
acceptance '0', or rejection '1', of the connection parameters '0', or rejection '1', of the connection parameters provided in
provided in the "Private Data". the Private Data.
Res: This field is reserved for future use. It MUST be set to zero Res: This field is reserved for future use. It MUST be set to zero
when sending, and not checked on reception. when sending, and not checked on reception.
Rev: This field contains the Revision of MPA. For this version of Rev: This field contains the Revision of MPA. For this version of
the specification senders MUST set this field to one. MPA the specification senders MUST set this field to one. MPA
receivers compliant with this version of the specification MUST receivers compliant with this version of the specification MUST
check this field. If the MPA receiver cannot interoperate with check this field. If the MPA receiver cannot interoperate with
the received version, then it MUST close the connection and the received version, then it MUST close the connection and
report an error locally. Otherwise, the MPA receiver should report an error locally. Otherwise, the MPA receiver should
report the received version to the ULP. report the received version to the ULP.
PD_Length: This field MUST contain the length in Octets of the PD_Length: This field MUST contain the length in Octets of the
Private Data field. A value of zero indicates that there is no Private Data field. A value of zero indicates that there is no
private data field present at all. If the receiver detects that Private Data field present at all. If the receiver detects that
the PD_Length field does not match the length of the "Private the PD_Length field does not match the length of the Private Data
Data" field, or if the length of the "Private Data" field exceeds field, or if the length of the Private Data field exceeds 512
512 octets, the receiver MUST close the connection and report an octets, the receiver MUST close the connection and report an
error locally. Otherwise, the MPA receiver should pass the error locally. Otherwise, the MPA receiver should pass the
PD_Length value and "Private Data" to the ULP. PD_Length value and Private Data to the ULP.
Private Data: This field may contain any value defined by ULPs or may Private Data: This field may contain any value defined by ULPs or may
not be present. The "Private Data" field MUST between 0 and 512 not be present. The Private Data field MUST between 0 and 512
octets in length. ULPs define how to size, set, and validate octets in length. ULPs define how to size, set, and validate
this field within these limits. this field within these limits.
6.1.2 Connection Startup Rules 6.1.2 Connection Startup Rules
The following rules apply to MPA connection startup phase: The following rules apply to MPA connection Startup Phase:
1. When MPA is started in the "Initiator" mode, the MPA 1. When MPA is started in the Initiator mode, the MPA implementation
implementation MUST send a valid "MPA Request Frame". The "MPA MUST send a valid MPA Request Frame. The MPA Request Frame MAY
Request Frame" MAY include ULP supplied "Private Data". include ULP supplied Private Data.
2. When MPA is started in the "Responder" mode, the MPA 2. When MPA is started in the Responder mode, the MPA implementation
implementation MUST wait until a "MPA Request Frame" is received MUST wait until a MPA Request Frame is received and validated
and validated before entering full MPA/DDP operation. before entering full MPA/DDP operation.
If the "MPA Request Frame" is improperly formatted, the If the MPA Request Frame is improperly formatted, the
implementation MUST close the TCP connection and exit MPA. implementation MUST close the TCP connection and exit MPA.
If the "MPA Request Frame" is properly formatted but the "Private If the MPA Request Frame is properly formatted but the Private
Data" is not acceptable, the implementation SHOULD return an "MPA Data is not acceptable, the implementation SHOULD return an MPA
Reply Frame" with the "Rejected Connection" bit set to '1'; the Reply Frame with the Rejected Connection bit set to '1'; the MPA
"MPA Reply Frame" MAY include ULP supplied "Private Data"; the Reply Frame MAY include ULP supplied Private Data; the
implementation MUST exit MPA, leaving the TCP connection open. implementation MUST exit MPA, leaving the TCP connection open.
The ULP may close TCP or use the connection for other purposes. The ULP may close TCP or use the connection for other purposes.
If the "MPA Request Frame" is properly formatted and the "Private If the MPA Request Frame is properly formatted and the Private
Data" is acceptable, the implementation SHOULD return an "MPA Data is acceptable, the implementation SHOULD return an MPA Reply
Reply Frame" with the "Rejected Connection" bit set to '0'; the Frame with the Rejected Connection bit set to '0'; the MPA Reply
"MPA Reply Frame" MAY include ULP supplied "Private Data"; and Frame MAY include ULP supplied Private Data; and the Responder
the responder SHOULD prepare to interpret any data received as SHOULD prepare to interpret any data received as FPDUs and pass
FPDUs and pass any received ULPDUs to DDP. any received ULPDUs to DDP.
Note: Since the receiver's ability to deal with markers is Note: Since the receiver's ability to deal with Markers is
unknown until the Request and Reply frames have been unknown until the Request and Reply frames have been
received, sending FPDUs before this occurs is not possible. received, sending FPDUs before this occurs is not possible.
Note: The requirement to wait on a Request Frame before sending a Note: The requirement to wait on a Request Frame before sending a
Reply frame is a design choice, it makes for well ordered Reply frame is a design choice, it makes for well ordered
sequence of events at each end, and avoids having to specify sequence of events at each end, and avoids having to specify
how to deal with situations where both ends start at the same how to deal with situations where both ends start at the same
time. time.
3. MPA "Initiator" mode implementations MUST receive and validate a 3. MPA Initiator mode implementations MUST receive and validate a
"MPA Reply Frame". MPA Reply Frame.
If the "MPA Reply Frame" is improperly formatted, the If the MPA Reply Frame is improperly formatted, the
implementation MUST close the TCP connection and exit MPA. implementation MUST close the TCP connection and exit MPA.
If the "MPA Reply Frame" is properly formatted but is the If the MPA Reply Frame is properly formatted but is the Private
"Private Data" is not acceptable, or if the "Rejected Connection" Data is not acceptable, or if the Rejected Connection bit set to
bit set to '1', the implementation MUST exit MPA, leaving the TCP '1', the implementation MUST exit MPA, leaving the TCP connection
connection open. The ULP may close TCP or use the connection for open. The ULP may close TCP or use the connection for other
other purposes. purposes.
If the "MPA Reply Frame" is properly formatted and the "Private If the MPA Reply Frame is properly formatted and the Private Data
Data" is acceptable, and the "Reject Connection" bit is set to is acceptable, and the Reject Connection bit is set to '0', the
'0', the implementation SHOULD enter full MPA/DDP operation mode; implementation SHOULD enter full MPA/DDP operation mode;
interpreting any received data as FPDUs and sending DDP ULPDUs as interpreting any received data as FPDUs and sending DDP ULPDUs as
FPDUs. FPDUs.
4. MPA "Responder" mode implementations MUST receive and validate at 4. MPA Responder mode implementations MUST receive and validate at
least one FPDU before sending any FPDUs or markers. least one FPDU before sending any FPDUs or Markers.
Note: this requirement is present to allow the Initiator time to Note: this requirement is present to allow the Initiator time to
get its receiver into full operation before an FPDU arrives, get its receiver into Full Operation before an FPDU arrives,
avoiding potential race conditions at the initiator. This avoiding potential race conditions at the Initiator. This
was also subject to some debate in the work group before was also subject to some debate in the work group before
rough consensus was reached. Eliminating this requirement rough consensus was reached. Eliminating this requirement
would allow faster startup in some types of applications. would allow faster startup in some types of applications.
However, that would also make certain implementations However, that would also make certain implementations
(particularly "Dual Stack") much harder. (particularly "dual stack") much harder.
5. If a received "Key" does not match the expected value, (See 6.1.1 5. If a received "Key" does not match the expected value, (See 6.1.1
MPA Request and Reply Frame Format below) the TCP/DDP connection MPA Request and Reply Frame Format above) the TCP/DDP connection
MUST be closed, and an error returned to the ULP. MUST be closed, and an error returned to the ULP.
6. The received "Private Data" fields may be used by consumers at 6. The received Private Data fields may be used by Consumers at
either end to further validate the connection, and set up DDP or either end to further validate the connection, and set up DDP or
other ULP parameters. The Initiator ULP MAY close the other ULP parameters. The Initiator ULP MAY close the
TCP/MPA/DDP connection as a result of validating the "Private TCP/MPA/DDP connection as a result of validating the Private Data
Data" fields. The Responder SHOULD return a "MPA Reply Frame" fields. The Responder SHOULD return a MPA Reply Frame with the
with the "Reject Connection" Bit set to '1' if the validation of "Reject Connection" Bit set to '1' if the validation of the
the "Private Data" is not acceptable to the ULP. Private Data is not acceptable to the ULP.
7. When the first FPDU is to be sent, then if markers are enabled, 7. When the first FPDU is to be sent, then if Markers are enabled,
the first octets sent are the special marker 0x00000000, followed the first octets sent are the special Marker 0x00000000, followed
by the start of the FPDU (the FPDU's "ULPDU Length" field). If by the start of the FPDU (the FPDU's ULPDU Length field). If
markers are not enabled, the first octets sent are the start of Markers are not enabled, the first octets sent are the start of
the FPDU (the FPDU's "ULPDU Length" field). the FPDU (the FPDU's ULPDU Length field).
8. MPA implementations MUST use the difference between the "MPA 8. MPA implementations MUST use the difference between the MPA
Request Frame" and the "MPA Reply Frame" to check for incorrect Request Frame and the MPA Reply Frame to check for incorrect
"Initiator/Initiator" startups. Implementations SHOULD put a "Initiator/Initiator" startups. Implementations SHOULD put a
timeout on waiting for the "MPA Request Frame" when started in timeout on waiting for the MPA Request Frame when started in
"Responder" mode, to detect incorrect "Responder/Responder" Responder mode, to detect incorrect "Responder/Responder"
startups. startups.
9. MPA implementations MUST validate the PD_Length field. The 9. MPA implementations MUST validate the PD_Length field. The
buffer that receives the "Private Data" field MUST be large buffer that receives the Private Data field MUST be large enough
enough to receive that data; the amount of "Private Data" MUST to receive that data; the amount of Private Data MUST not exceed
not exceed the PD_Length, or the application buffer. If any of the PD_Length, or the application buffer. If any of the above
the above fails, the startup frame MUST be considered improperly fails, the startup frame MUST be considered improperly formatted.
formatted.
10. MPA implementations SHOULD implement a reasonable timeout while 10. MPA implementations SHOULD implement a reasonable timeout while
waiting for the entire startup frames; this prevents certain waiting for the entire startup frames; this prevents certain
denial of service attacks. ULPs SHOULD implement a reasonable denial of service attacks. ULPs SHOULD implement a reasonable
timeout while waiting for FPDUs, ULPDUs and application level timeout while waiting for FPDUs, ULPDUs and application level
messages to guard against application failures and certain denial messages to guard against application failures and certain denial
of service attacks. of service attacks.
6.1.3 Example Delayed Startup sequence 6.1.3 Example Delayed Startup sequence
A variety of startup sequences are possible when using MPA on TCP. A variety of startup sequences are possible when using MPA on TCP.
Following is an example of an MPA/DDP startup that occurs after TCP Following is an example of an MPA/DDP startup that occurs after TCP
has been running for a while and has exchanged some amount of has been running for a while and has exchanged some amount of
streaming data. This example does not use any private data (an streaming data. This example does not use any Private Data (an
example that does is shown later in 6.1.4.2 Example Immediate Startup example that does is shown later in 6.1.4.2 Example Immediate Startup
using Private Data on page 42), although it is perfectly legal to using Private Data on page 42), although it is perfectly legal to
include the private data. Note that since the example does not use include the Private Data. Note that since the example does not use
any Private Data, there are no ULP interactions shown between any Private Data, there are no ULP interactions shown between
receiving "Startup frames" and putting MPA into "Full operation". receiving "Startup frames" and putting MPA into Full Operation.
Initiator Responder Initiator Responder
+---------------------------+ +---------------------------+
|ULP streaming mode | |ULP streaming mode |
| <Hello> request to | | <Hello> request to |
| transition to DDP/MPA | +--------------------------+ | transition to DDP/MPA | +--------------------------+
| mode (optional) | --------> |ULP gets request; | | mode (optional) | --------> |ULP gets request; |
+---------------------------+ |enables MPA Responder mode| +---------------------------+ |enables MPA Responder mode|
|with last (optional) | |with last (optional) |
skipping to change at page 40, line 13 skipping to change at page 40, line 13
Figure 8: Example Delayed Startup negotiation Figure 8: Example Delayed Startup negotiation
An example Delayed Startup sequence is described below: An example Delayed Startup sequence is described below:
* Active and passive sides start up a TCP connection in the * Active and passive sides start up a TCP connection in the
usual fashion, probably using sockets APIs. They exchange usual fashion, probably using sockets APIs. They exchange
some amount of streaming mode data. At some point one side some amount of streaming mode data. At some point one side
(the MPA Initiator) sends streaming mode data that (the MPA Initiator) sends streaming mode data that
effectively says "Hello, Lets go into MPA/DDP mode." effectively says "Hello, Lets go into MPA/DDP mode."
* When the remote side (the MPA Responder) gets this streaming mode * When the remote side (the MPA Responder) gets this streaming mode
message, the consumer would send a last streaming mode message message, the Consumer would send a last streaming mode message
that effectively says "I Acknowledge your Hello, and am now in that effectively says "I Acknowledge your Hello, and am now in
MPA Responder Mode". The exchange of these messages establishes MPA Responder Mode". The exchange of these messages establishes
the exact point in the TCP stream where MPA is enabled. The the exact point in the TCP stream where MPA is enabled. The
Responding Consumer enables MPA in the Responder mode and waits Responding Consumer enables MPA in the Responder mode and waits
for the initial MPA startup message. for the initial MPA startup message.
* The Initiating Consumer would enable MPA startup in the * The Initiating Consumer would enable MPA startup in the
Initiator mode which then sends the "MPA Request Frame". It Initiator mode which then sends the MPA Request Frame. It is
is assumed that no "Private Data" messages are needed for assumed that no Private Data messages are needed for this
this example, although it is possible to do so. The example, although it is possible to do so. The Initiating
Initiating MPA (and Consumer) would also wait for the MPA MPA (and Consumer) would also wait for the MPA connection to
connection to be accepted. be accepted.
* The Responding MPA would receive the initial "MPA Request Frame" * The Responding MPA would receive the initial MPA Request Frame
and would inform the consumer that this message arrived. The and would inform the Consumer that this message arrived. The
Consumer can then accept the MPA/DDP connection or close the TCP Consumer can then accept the MPA/DDP connection or close the TCP
connection. connection.
* To accept the connection request, the Responding Consumer would * To accept the connection request, the Responding Consumer would
use an appropriate API to bind the TCP/MPA connections to a DDP use an appropriate API to bind the TCP/MPA connections to a DDP
endpoint, thus enabling MPA/DDP into full operation. In the endpoint, thus enabling MPA/DDP into Full Operation. In the
process of going to full operation, MPA sends the "MPA Reply process of going to Full Operation, MPA sends the MPA Reply
Frame". MPA/DDP waits for the first incoming FPDU before sending Frame. MPA/DDP waits for the first incoming FPDU before sending
any FPDUs. any FPDUs.
* If the initial TCP data was not a properly formatted "MPA Request * If the initial TCP data was not a properly formatted MPA Request
Frame" MPA will close or reset the TCP connection immediately. Frame MPA will close or reset the TCP connection immediately.
* The Initiating MPA would receive the "MPA Reply Frame" and * The Initiating MPA would receive the MPA Reply Frame and
would report this message to the Consumer. The Consumer can would report this message to the Consumer. The Consumer can
then accept the MPA/DDP connection, or close or reset the TCP then accept the MPA/DDP connection, or close or reset the TCP
connection to abort the process. connection to abort the process.
* On determining that the Connection is acceptable, the * On determining that the Connection is acceptable, the
Initiating Consumer would use an appropriate API to bind the Initiating Consumer would use an appropriate API to bind the
TCP/MPA connections to a DDP endpoint thus enabling MPA/DDP TCP/MPA connections to a DDP endpoint thus enabling MPA/DDP
into full operation. MPA/DDP would begin sending DDP into Full Operation. MPA/DDP would begin sending DDP
messages as MPA FPDUs. messages as MPA FPDUs.
6.1.4 Use of "Private Data" 6.1.4 Use of Private Data
This section is advisory in nature, in that it suggests a method that This section is advisory in nature, in that it suggests a method that
a ULP can deal with pre-DDP connection information exchange. a ULP can deal with pre-DDP connection information exchange.
6.1.4.1 Motivation 6.1.4.1 Motivation
Prior RDMA protocols have been developed that provide "private data" Prior RDMA protocols have been developed that provide Private Data
via out of band mechanisms. As a result, many applications now via out of band mechanisms. As a result, many applications now
expect some form of "private data" to be available for application expect some form of Private Data to be available for application use
use prior to setting up the DDP/RDMA connection. For example, prior to setting up the DDP/RDMA connection. Following are some
examples of the use of Private Data.
An RDMA Endpoint (referred to as a Queue Pair, or QP, in InfiniBand An RDMA Endpoint (referred to as a Queue Pair, or QP, in InfiniBand
and the [Verbs]) must be associated with a Protection Domain. No and the [VERBS]) must be associated with a Protection Domain. No
receive operations may be posted to the endpoint before it is receive operations may be posted to the endpoint before it is
associated with a Protection Domain. Indeed under both the associated with a Protection Domain. Indeed under both the
InfiniBand and proposed iWARP verbs [Verbs] an endpoint/QP is created InfiniBand and proposed RDMA/DDP verbs [VERBS] an endpoint/QP is
within a Protection Domain. created within a Protection Domain.
There are some applications where the choice of Protection Domain is There are some applications where the choice of Protection Domain is
dependent upon the identity of the remote ULP client. For example, if dependent upon the identity of the remote ULP client. For example,
a user session requires multiple connections, it is highly desirable if a user session requires multiple connections, it is highly
for all of those connections to use a single Protection Domain. desirable for all of those connections to use a single Protection
Domain. Note: use of Protection Domains is further discussed in
[RDMASEC].
InfiniBand, the DAT APIs and the IT-API all provide for the active InfiniBand, the DAT APIs [DAT-API] and the [IT-API] all provide for
side ULP to provide "Private Data" when requesting a connection. This the active side ULP to provide Private Data when requesting a
data is passed to the ULP to allow it to determine whether to accept connection. This data is passed to the ULP to allow it to determine
the connection, and if so with which endpoint (and implicitly which whether to accept the connection, and if so with which endpoint (and
Protection Domain). implicitly which Protection Domain).
The Private Data can also be used to ensure that both ends of the The Private Data can also be used to ensure that both ends of the
connection have configured their RDMA endpoints compatibly on such connection have configured their RDMA endpoints compatibly on such
matters as the RDMA Read capacity. Further ULP-specific uses are also matters as the RDMA Read capacity (see [RDMAP]). Further ULP-
presumed, such as establishing the identity of the client. specific uses are also presumed, such as establishing the identity of
the client.
Private Data is also allowed for when accepting the connection, to Private Data is also allowed for when accepting the connection, to
allow completion of any negotiation on RDMA resources and for other allow completion of any negotiation on RDMA resources and for other
ULP reasons. ULP reasons.
There are several potential ways to exchange this "Private Data". There are several potential ways to exchange this Private Data. For
For Example, the InfiniBand specification includes a connection example, the InfiniBand specification includes a connection
management protocol that allows a small amount of "private data" to management protocol that allows a small amount of Private Data to be
be exchanged using datagrams before actually starting the RDMA exchanged using datagrams before actually starting the RDMA
connection. connection.
This draft allows for small amounts of "Private Data" to be exchanged This draft allows for small amounts of Private Data to be exchanged
as part of the MPA startup sequence. The actual Private Data fields as part of the MPA startup sequence. The actual Private Data fields
are carried in the "MPA Request Frame", and the "MPA Reply Frame". are carried in the MPA Request Frame, and the MPA Reply Frame.
If larger amounts of private data or more negotiation is necessary, If larger amounts of Private Data or more negotiation is necessary,
TCP streaming mode messages may be exchanged prior to enabling MPA. TCP streaming mode messages may be exchanged prior to enabling MPA.
6.1.4.2 Example Immediate Startup using Private Data 6.1.4.2 Example Immediate Startup using Private Data
Initiator Responder Initiator Responder
+---------------------------+ +---------------------------+
|TCP SYN sent | +--------------------------+ |TCP SYN sent | +--------------------------+
+---------------------------+ --------> |TCP gets SYN packet; | +---------------------------+ --------> |TCP gets SYN packet; |
+---------------------------+ | Sends SYN-Ack | +---------------------------+ | Sends SYN-Ack |
|TCP gets SYN-Ack | <-------- +--------------------------+ |TCP gets SYN-Ack | <-------- +--------------------------+
| Sends Ack | | Sends Ack |
+---------------------------+ --------> +--------------------------+ +---------------------------+ --------> +--------------------------+
+---------------------------+ |Consumer enables MPA | +---------------------------+ |Consumer enables MPA |
|Consumer enables MPA | |Responder Mode, waits for | |Consumer enables MPA | |Responder Mode, waits for |
|Initiator mode with | | <MPA Request frame> | |Initiator mode with | | <MPA Request frame> |
|"Private Data"; MPA sends | +--------------------------+ |Private Data; MPA sends | +--------------------------+
| <MPA Request Frame>; | | <MPA Request Frame>; |
|MPA waits for incoming | +--------------------------+ |MPA waits for incoming | +--------------------------+
| <MPA Reply Frame | - - - - > |MPA receives | | <MPA Reply Frame | - - - - > |MPA receives |
+---------------------------+ | <MPA Request Frame> | +---------------------------+ | <MPA Request Frame> |
|Consumer examines "Private| |Consumer examines Private |
|Data", provides MPA with | |Data, provides MPA with |
|return "Private Data", | |return Private Data, |
|binds DDP to MPA, and | |binds DDP to MPA, and |
|enables MPA to send an | |enables MPA to send an |
| <MPA Reply Frame>. | | <MPA Reply Frame>. |
|DDP/MPA enables FPDU | |DDP/MPA enables FPDU |
+---------------------------+ |decoding, but does not | +---------------------------+ |decoding, but does not |
|MPA receives the | < - - - - |send any FPDUs. | |MPA receives the | < - - - - |send any FPDUs. |
| <MPA Reply Frame> | +--------------------------+ | <MPA Reply Frame> | +--------------------------+
|Consumer examines "Private | |Consumer examines Private |
|Data", binds DDP to MPA, | |Data, binds DDP to MPA, |
|and enables DDP/MPA to | |and enables DDP/MPA to |
|begin full operation. | |begin Full Operation. |
|MPA sends first FPDU (as | +--------------------------+ |MPA sends first FPDU (as | +--------------------------+
|DDP ULPDUs become | ========> |MPA Receives first FPDU. | |DDP ULPDUs become | ========> |MPA Receives first FPDU. |
|available). | |MPA sends first FPDU (as | |available). | |MPA sends first FPDU (as |
+---------------------------+ |DDP ULPDUs become | +---------------------------+ |DDP ULPDUs become |
<====== |available. | <====== |available. |
+--------------------------+ +--------------------------+
Figure 9: Example Immediate Startup negotiation Figure 9: Example Immediate Startup negotiation
Note: the exact order of when MPA is started in the TCP connection Note: the exact order of when MPA is started in the TCP connection
sequence is implementation dependent; the above diagram shows one sequence is implementation dependent; the above diagram shows one
possible sequence. Also, the Initiator "Ack" to the Responder's possible sequence. Also, the Initiator "Ack" to the Responder's
"SYN-Ack" may be combined into the same TCP segment containing "SYN-Ack" may be combined into the same TCP segment containing
the "MPA Request Frame" (as is allowed by TCP RFCs). the MPA Request Frame (as is allowed by TCP RFCs).
The example immediate startup sequence is described below: The example immediate startup sequence is described below:
* The passive side (Responding Consumer) would listen on the TCP * The passive side (Responding Consumer) would listen on the TCP
destination port, to indicate its readiness to accept a destination port, to indicate its readiness to accept a
connection. connection.
* The active side (Initiating Consumer) would request a * The active side (Initiating Consumer) would request a
connection from a TCP endpoint (that expected to upgrade to connection from a TCP endpoint (that expected to upgrade to
MPA/DDP/RDMA and expected the private data) to a destination MPA/DDP/RDMA and expected the Private Data) to a destination
address and port. address and port.
* The Initiating Consumer would initiate a TCP connection to * The Initiating Consumer would initiate a TCP connection to
the destination port. Acceptance/rejection of the connection the destination port. Acceptance/rejection of the connection
would proceed as per normal TCP connection establishment. would proceed as per normal TCP connection establishment.
* The passive side (Responding Consumer) would receive the TCP * The passive side (Responding Consumer) would receive the TCP
connection request as usual allowing normal TCP gatekeepers, such connection request as usual allowing normal TCP gatekeepers, such
as INETD and TCPserver, to exercise their normal as INETD and TCPserver, to exercise their normal
safeguard/logging functions. On acceptance of the TCP safeguard/logging functions. On acceptance of the TCP
connection, the Responding consumer would enable MPA in the connection, the Responding Consumer would enable MPA in the
Responder mode and wait for the initial MPA startup message. Responder mode and wait for the initial MPA startup message.
* The Initiating Consumer would enable MPA startup in the * The Initiating Consumer would enable MPA startup in the
Initiator mode to send an initial "MPA Request Frame" with Initiator mode to send an initial MPA Request Frame with its
its included "Private Data" message to send. The Initiating included Private Data message to send. The Initiating MPA
MPA (and Consumer) would also wait for the MPA connection to (and Consumer) would also wait for the MPA connection to be
be accepted, and any returned private data. accepted, and any returned Private Data.
* The Responding MPA would receive the initial "MPA Request Frame" * The Responding MPA would receive the initial MPA Request Frame
with the "Private Data" message and would pass the Private Data with the Private Data message and would pass the Private Data
through to the consumer. The Consumer can then accept the through to the Consumer. The Consumer can then accept the
MPA/DDP connection, close the TCP connection, or reject the MPA MPA/DDP connection, close the TCP connection, or reject the MPA
connection with a return message. connection with a return message.
* To accept the connection request, the Responding Consumer would * To accept the connection request, the Responding Consumer would
use an appropriate API to bind the TCP/MPA connections to a DDP use an appropriate API to bind the TCP/MPA connections to a DDP
endpoint, thus enabling MPA/DDP into full operation. In the endpoint, thus enabling MPA/DDP into Full Operation. In the
process of going to full operation, MPA sends the "MPA Reply process of going to Full Operation, MPA sends the MPA Reply Frame
Frame" which includes the Consumer supplied "Private Data" which includes the Consumer supplied Private Data containing any
containing any appropriate consumer response. MPA/DDP waits for appropriate Consumer response. MPA/DDP waits for the first
the first incoming FPDU before sending any FPDUs. incoming FPDU before sending any FPDUs.
* If the initial TCP data was not a properly formatted "MPA Request * If the initial TCP data was not a properly formatted MPA Request
Frame", MPA will close or reset the TCP connection immediately. Frame, MPA will close or reset the TCP connection immediately.
* To reject the MPA connection request, the Responding Consumer * To reject the MPA connection request, the Responding Consumer
would send an "MPA Reply Frame" with any ULP supplied "Private would send an MPA Reply Frame with any ULP supplied Private Data
Data" (with reason for rejection), with the "Rejected Connection" (with reason for rejection), with the "Rejected Connection" bit
bit set to '1', and may close the TCP connection. set to '1', and may close the TCP connection.
* The Initiating MPA would receive the "MPA Reply Frame" with * The Initiating MPA would receive the MPA Reply Frame with the
the "Private Data" message and would report this message to Private Data message and would report this message to the
the Consumer, including the supplied Private Data. Consumer, including the supplied Private Data.
If the "rejected Connection" bit is set to a '1', MPA will If the "rejected Connection" bit is set to a '1', MPA will
close the TCP connection and exit. close the TCP connection and exit.
If the "Rejected Connection" bit is set to a '0', and on If the "Rejected Connection" bit is set to a '0', and on
determining from the "MPA Reply Frame" "Private Data" that determining from the MPA Reply Frame Private Data that the
the Connection is acceptable, the Initiating Consumer would Connection is acceptable, the Initiating Consumer would use
use an appropriate API to bind the TCP/MPA connections to a an appropriate API to bind the TCP/MPA connections to a DDP
DDP endpoint thus enabling MPA/DDP into full operation. endpoint thus enabling MPA/DDP into Full Operation. MPA/DDP
MPA/DDP would begin sending DDP messages as MPA FPDUs. would begin sending DDP messages as MPA FPDUs.
6.1.5 "Dual Stack" implementations 6.1.5 "Dual stack" implementations
MPA/DDP implementations are commonly expected to be implemented as MPA/DDP implementations are commonly expected to be implemented as
part of a "Dual stack" architecture. One "stack" is the traditional part of a "dual stack" architecture. One "stack" is the traditional
TCP stack, usually with a sockets interface API. The second stack is TCP stack, usually with a sockets interface API (Application
the MPA/DDP "stack" with its own API, and potentially separate code Programming Interface). The second stack is the MPA/DDP "stack" with
or hardware to deal with the MPA/DDP data. Of course, its own API, and potentially separate code or hardware to deal with
implementations may vary, so the following comments are of an the MPA/DDP data. Of course, implementations may vary, so the
advisory nature only. following comments are of an advisory nature only.
The use of the two "stacks" offers advantages: The use of the two "stacks" offers advantages:
TCP connection setup is usually done with the TCP stack. This TCP connection setup is usually done with the TCP stack. This
allows use of the usual naming and addressing mechanisms. It allows use of the usual naming and addressing mechanisms. It
also means that any mechanisms used to "harden" the connection also means that any mechanisms used to "harden" the connection
setup against security threats are also used when starting setup against security threats are also used when starting
MPA/DDP. MPA/DDP.
Some applications may have been originally designed for TCP, but Some applications may have been originally designed for TCP, but
are "enhanced" to utilize MPA/DDP after a negotiation reveals are "enhanced" to utilize MPA/DDP after a negotiation reveals
the capability to do so. The negotiation process takes place in the capability to do so. The negotiation process takes place in
TCP's streaming mode, using the usual TCP APIs. TCP's streaming mode, using the usual TCP APIs.
Some new applications, designed for RDMA or DDP, still need to Some new applications, designed for RDMA or DDP, still need to
exchange some data prior to starting MPA/DDP. This exchange can exchange some data prior to starting MPA/DDP. This exchange can
be of arbitrary length or complexity, but often consists of only be of arbitrary length or complexity, but often consists of only
a small amount of "private data", perhaps only a single message. a small amount of Private Data, perhaps only a single message.
Using the TCP streaming mode for this exchange allows this to be Using the TCP streaming mode for this exchange allows this to be
done using well understood methods. done using well understood methods.
The main disadvantage of using two stacks is the conversion of an The main disadvantage of using two stacks is the conversion of an
active TCP connection between them. This process must be done with active TCP connection between them. This process must be done with
care to prevent loss of data. care to prevent loss of data.
To avoid some of the problems when using a "dual stack" architecture To avoid some of the problems when using a "dual stack" architecture
the following additional restrictions may be required by the the following additional restrictions may be required by the
implementation: implementation:
1. Enabling the DDP/MPA stack SHOULD be done only when no incoming 1. Enabling the DDP/MPA stack SHOULD be done only when no incoming
stream data is expected. This is typically managed by the ULP stream data is expected. This is typically managed by the ULP
protocol. When following the recommended startup sequence, the protocol. When following the recommended startup sequence, the
"Responder" side enters DDP/MPA mode, sends the last streaming Responder side enters DDP/MPA mode, sends the last streaming mode
mode data, and then waits for the "MPA Request frame". No data, and then waits for the MPA Request Frame. No additional
additional streaming mode data is expected. The "Initiator" side streaming mode data is expected. The Initiator side ULP receives
ULP receives the last streaming mode data, and then enters the last streaming mode data, and then enters DDP/MPA mode.
DDP/MPA mode. Again, no additional streaming mode data is Again, no additional streaming mode data is expected.
expected.
2. The DDP/MPA MAY provide the ability to send a "Last streaming 2. The DDP/MPA MAY provide the ability to send a "last streaming
message" as part of its "Responder" DDP/MPA enable function. message" as part of its Responder DDP/MPA enable function. This
This allows the DDP/MPA stack to more easily manage the allows the DDP/MPA stack to more easily manage the conversion to
conversion to DDP/MPA mode (and avoid problems with a very fast DDP/MPA mode (and avoid problems with a very fast return of the
return of the "MPA Request Frame" from the Initiator side). MPA Request Frame from the Initiator side).
Note: Regardless of the "stack" architecture used, TCP's rules MUST Note: Regardless of the "stack" architecture used, TCP's rules MUST
be followed. For example, if network data is lost, re-segmented be followed. For example, if network data is lost, re-segmented
or re-ordered, TCP MUST recover appropriately even when this or re-ordered, TCP MUST recover appropriately even when this
occurs while switching stacks. occurs while switching stacks.
6.2 Normal Connection Teardown 6.2 Normal Connection Teardown
Each half connection of MPA terminates when DDP closes the Each half connection of MPA terminates when DDP closes the
corresponding TCP half connection. corresponding TCP half connection.
A mechanism SHOULD be provided by MPA to DDP for DDP to be made aware A mechanism SHOULD be provided by MPA to DDP for DDP to be made aware
that a graceful close of the LLP connection has been received by the that a graceful close of the LLP connection has been received by the
LLP (e.g. FIN is received). LLP (e.g. FIN is received).
7 Error Semantics 7 Error Semantics
The following errors MUST be detected by MPA and the codes SHOULD be The following errors MUST be detected by MPA and the codes SHOULD be
provided to DDP or other consumer: provided to DDP or other Consumer:
Code Error Code Error
1 TCP connection closed, terminated or lost. This includes lost 1 TCP connection closed, terminated or lost. This includes lost
by timeout, too many retries, RST received or FIN received. by timeout, too many retries, RST received or FIN received.
2 Received MPA CRC does not match the calculated value for the 2 Received MPA CRC does not match the calculated value for the
FPDU. FPDU.
3 In the event that the CRC is valid, received MPA marker (if 3 In the event that the CRC is valid, received MPA Marker (if
enabled) and "ULPDU Length" fields do not agree on the start enabled) and ULPDU Length fields do not agree on the start of
of a FPDU. If the FPDU start determined from previous "ULPDU a FPDU. If the FPDU start determined from previous ULPDU
Length" fields does not match with the MPA marker position, Length fields does not match with the MPA Marker position, MPA
MPA SHOULD deliver an error to DDP. It may not be possible to SHOULD deliver an error to DDP. It may not be possible to
make this check as a segment arrives, but the check SHOULD be make this check as a segment arrives, but the check SHOULD be
made when a gap creating an out of order sequence is closed made when a gap creating an out of order sequence is closed
and any time a marker points to an already identified FPDU. and any time a Marker points to an already identified FPDU.
It is OPTIONAL for a receiver to check each marker, if It is OPTIONAL for a receiver to check each Marker, if
multiple markers are present in an FPDU, or if the segment is multiple Markers are present in an FPDU, or if the segment is
received in order. received in order.
4 Invalid MPA Request Frame or MPA Response Frame received. In 4 Invalid MPA Request Frame or MPA Response Frame received. In
this case, the TCP connection MUST be immediately closed. DDP this case, the TCP connection MUST be immediately closed. DDP
and other ULPs should treat this similar to code 1, above. and other ULPs should treat this similar to code 1, above.
When conditions 2 or 3 above are detected, an MPA-aware TCP When conditions 2 or 3 above are detected, an MPA-aware TCP
implementation MAY choose to silently drop the TCP segment rather implementation MAY choose to silently drop the TCP segment rather
than reporting the error to DDP. In this case, the sending TCP will than reporting the error to DDP. In this case, the sending TCP will
retry the segment, usually correcting the error, unless the problem retry the segment, usually correcting the error, unless the problem
was at the source. In that case, the source will usually exceed the was at the source. In that case, the source will usually exceed the
number of retries and terminate the connection. number of retries and terminate the connection.
Once MPA delivers an error of any type, it MUST NOT pass or deliver Once MPA delivers an error of any type, it MUST NOT pass or deliver
any additional FPDUs on that half connection. any additional FPDUs on that half connection.
For Error codes 2 and 3, MPA MUST NOT close the TCP connection For Error codes 2 and 3, MPA MUST NOT close the TCP connection
following a reported error. Closing the connection is the following a reported error. Closing the connection is the
responsibility of DDP's ULP. responsibility of DDP's ULP.
Note that since MPA will not deliver any FPDUs on a half Note that since MPA will not Deliver any FPDUs on a half
connection following an error detected on the receive side of connection following an error detected on the receive side of
that connection, DDP's ULP is expected to tear down the that connection, DDP's ULP is expected to tear down the
connection. This may not occur until after one or more last connection. This may not occur until after one or more last
messages are transmitted on the opposite half connection. This messages are transmitted on the opposite half connection. This
allows a diagnostic error message to be sent. allows a diagnostic error message to be sent.
8 Security Considerations 8 Security Considerations
This section discusses the security considerations for MPA. This section discusses the security considerations for MPA.
skipping to change at page 47, line 30 skipping to change at page 47, line 30
target valid buffers. These types of attacks ultimately result in target valid buffers. These types of attacks ultimately result in
loss of connection and thus become a type of DOS (Denial Of Service) loss of connection and thus become a type of DOS (Denial Of Service)
attack. Communication security mechanisms such as IPsec [RFC2401] attack. Communication security mechanisms such as IPsec [RFC2401]
may be used to prevent such attacks. may be used to prevent such attacks.
Independent of how MPA operates, a third party could use ICMP Independent of how MPA operates, a third party could use ICMP
messages to reduce the path MTU to such a small size that performance messages to reduce the path MTU to such a small size that performance
would likewise be severely impacted. Range checking on path MTU would likewise be severely impacted. Range checking on path MTU
sizes in ICMP packets may be used to prevent such attacks. sizes in ICMP packets may be used to prevent such attacks.
[RDMA] and [DDP] are used to control, read and write data buffers [RDMAP] and [DDP] are used to control, read and write data buffers
over IP networks. Therefore, the control and the data packets of over IP networks. Therefore, the control and the data packets of
these protocols are vulnerable to the spoofing, tampering and these protocols are vulnerable to the spoofing, tampering and
information disclosure attacks listed below. In addition, Connection information disclosure attacks listed below. In addition, Connection
to/from an unauthorized or unauthenticated endpoint is a potential to/from an unauthorized or unauthenticated endpoint is a potential
problem with most applications using RDMA, DDP, and MPA. problem with most applications using RDMA, DDP, and MPA.
8.1.1 Spoofing 8.1.1 Spoofing
Spoofing attacks can be launched by the Remote Peer, or by a network Spoofing attacks can be launched by the Remote Peer, or by a network
based attacker. A network based spoofing attack applies to all Remote based attacker. A network based spoofing attack applies to all
Peers. Because the MPA Stream requires a TCP Stream in the Remote Peers. Because the MPA Stream requires a TCP Stream in the
ESTABLISHED state, certain types of traditional forms of wire attacks ESTABLISHED state, certain types of traditional forms of wire attacks
do not apply -- an end-to-end handshake must have occurred to do not apply -- an end-to-end handshake must have occurred to
establish the MPA Stream. So, the only form of spoofing that applies establish the MPA Stream. So, the only form of spoofing that applies
is one when a remote node can both send and receive packets. Yet even is one when a remote node can both send and receive packets. Yet
with this limitation the Stream is still exposed to the following even with this limitation the Stream is still exposed to the
spoofing attacks. following spoofing attacks.
8.1.1.1 Impersonation 8.1.1.1 Impersonation
A network based attacker can impersonate a legal MPA/DDP/RDMAP peer A network based attacker can impersonate a legal MPA/DDP/RDMAP peer
(by spoofing a legal IP address), and establish an MPA/DDP/RDMAP (by spoofing a legal IP address), and establish an MPA/DDP/RDMAP
Stream with the victim. End to end authentication (i.e. IPsec or ULP Stream with the victim. End to end authentication (i.e. IPsec or ULP
authentication) provides protection against this attack. authentication) provides protection against this attack.
8.1.1.2 Stream Hijacking 8.1.1.2 Stream Hijacking
Stream hijacking happens when a network based attacker follows the Stream hijacking happens when a network based attacker follows the
Stream establishment phase, and waits until the authentication phase Stream establishment phase, and waits until the authentication phase
(if such a phase exists) is completed successfully. He can then spoof (if such a phase exists) is completed successfully. He can then
the IP address and re-direct the Stream from the victim to its own spoof the IP address and re-direct the Stream from the victim to its
machine. For example, an attacker can wait until an iSCSI own machine. For example, an attacker can wait until an iSCSI
authentication is completed successfully, and hijack the iSCSI authentication is completed successfully, and hijack the iSCSI
Stream. Stream.
The best protection against this form of attack is end-to-end The best protection against this form of attack is end-to-end
integrity protection and authentication, such as IPsec to prevent integrity protection and authentication, such as IPsec to prevent
spoofing. Another option is to provide physical security. Discussion spoofing. Another option is to provide physical security.
of physical security is out of scope for this document. Discussion of physical security is out of scope for this document.
8.1.1.3 Man in the Middle Attack 8.1.1.3 Man in the Middle Attack
If a network based attacker has the ability to delete, inject replay, If a network based attacker has the ability to delete, inject replay,
or modify packets which will still be accepted by MPA (e.g., TCP or modify packets which will still be accepted by MPA (e.g., TCP
sequence number is correct, FPDU is valid etc.) then the Stream can sequence number is correct, FPDU is valid etc.) then the Stream can
be exposed to a man in the middle attack. The attacker could be exposed to a man in the middle attack. The attacker could
potentially use the services of [DDP] and [RDMAP] to read the potentially use the services of [DDP] and [RDMAP] to read the
contents of the associated data buffer, modify the contents of the contents of the associated data buffer, modify the contents of the
associated data buffer, or to disable further access to the buffer. associated data buffer, or to disable further access to the buffer.
skipping to change at page 52, line 18 skipping to change at page 52, line 18
[iSCSI] Satran, J., Internet Small Computer Systems Interface [iSCSI] Satran, J., Internet Small Computer Systems Interface
(iSCSI), RFC 3720, April 2004. (iSCSI), RFC 3720, April 2004.
[RFC1191] Mogul, J., and Deering, S., "Path MTU Discovery", RFC 1191, [RFC1191] Mogul, J., and Deering, S., "Path MTU Discovery", RFC 1191,
November 1990. November 1990.
[RFC2018] Mathis, M., Mahdavi, J., Floyd, S., Romanow, A., "TCP [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., Romanow, A., "TCP
Selective Acknowledgment Options", RFC 2018, October 1996. Selective Acknowledgment Options", RFC 2018, October 1996.
[RFC2026] Bradner, S., "The Internet Standards Process -- Revision
3", BCP 9, RFC 2026, October 1996.
[RFC3667] Bradner, S., "IETF Rights in Contributions", BCP 78, RFC
3667, February 2004.
[RFC3668] Bradner, S., Ed., "Intellectual Property Rights in IETF
Technology", BCP 79, RFC 3668, February 2004.
[RFC3723] Aboba B., et al, "Securing Block Storage Protocols over [RFC3723] Aboba B., et al, "Securing Block Storage Protocols over
IP", RFC3723, April 2004. IP", RFC3723, April 2004.
[RFC793] Postel, J., "Transmission Control Protocol - DARPA Internet [RFC793] Postel, J., "Transmission Control Protocol - DARPA Internet
Program Protocol Specification", RFC 793, September 1981. Program Protocol Specification", RFC 793, September 1981.
[RDMASEC] Pinkerton J., Deleganes E., Romanow A., Bitan S., [RDMASEC] Pinkerton J., Deleganes E., Bitan S., "DDP/RDMAP
"DDP/RDMAP Security", draft-ietf-rddp-security-06.txt (work in Security", draft-ietf-rddp-security-09.txt (work in progress),
progress), December 2004. MAY 2006.
10.2 Informative References 10.2 Informative References
[CRCTCP] Stone J., Partridge, C., "When the CRC and TCP checksum [CRCTCP] Stone J., Partridge, C., "When the CRC and TCP checksum
disagree", ACM Sigcomm, Sept. 2000. disagree", ACM Sigcomm, Sept. 2000.
[DAT-API] DAT Collaborative, "kDAPL (Kernel Direct Access Programming
Library) and uDAPL (User Direct Access Programming Library)",
http://www.datcollaborative.org.
[DDP] H. Shah et al., "Direct Data Placement over Reliable [DDP] H. Shah et al., "Direct Data Placement over Reliable
Transports", draft-ietf-rddp-ddp-04.txt (Work in progress), Transports", draft-ietf-rddp-ddp-06.txt (Work in progress), May
February 2005 2006.
[IT-API] The Open Group, "Interconnect Transport API (IT-API)"
Version 2.1, http://www.opengroup.org.
[RFC2401] Atkinson, R., Kent, S., "Security Architecture for the [RFC2401] Atkinson, R., Kent, S., "Security Architecture for the
Internet Protocol", RFC 2401, November 1998. Internet Protocol", RFC 2401, November 1998.
[RFC0896] J. Nagle, "Congestion Control in IP/TCP Internetworks", RFC [RFC0896] J. Nagle, "Congestion Control in IP/TCP Internetworks", RFC
896, January 1984. 896, January 1984.
[NagleDAck] Minshall G., Mogul, J., Saito, Y., Verghese, B., [NagleDAck] Minshall G., Mogul, J., Saito, Y., Verghese, B.,
"Application performance pitfalls and TCP's Nagle algorithm", "Application performance pitfalls and TCP's Nagle algorithm",
Workshop on Internet Server Performance, May 1999. Workshop on Internet Server Performance, May 1999.
[NFSv4CHANNEL] Williams, N., "On the Use of Channel Bindings to [NFSv4CHANNEL] Williams, N., "On the Use of Channel Bindings to
Secure Channels", Internet-Draft draft-ietf-nfsv4-channel- Secure Channels", Internet-Draft draft-ietf-nfsv4-channel-
bindings-02.txt, July 2004. bindings-02.txt, July 2004.
[RDMA] R. Recio et al., "RDMA Protocol Specification", [RDMAP] R. Recio et al., "RDMA Protocol Specification",
draft-ietf-rddp-rdmap-03.txt, February 2005 draft-ietf-rddp-rdmap-06.txt, May 2006.
[RFC2960] R. Stewart et al., "Stream Control Transmission Protocol", [RFC2960] R. Stewart et al., "Stream Control Transmission Protocol",
RFC 2960, October 2000. RFC 2960, October 2000.
[RFC792] Postel, J., "Internet Control Message Protocol". September [RFC792] Postel, J., "Internet Control Message Protocol", September
1981 1981
[RFC1122] Braden, R.T., "Requirements for Internet hosts - [RFC1122] Braden, R.T., "Requirements for Internet hosts -
communication layers". October 1989. communication layers", October 1989.
[ELZUR-MPA] Elzur, U., "Analysis of MPA over TCP Operations" draft-
elzur-iwarp-mpa-tcp-analysis-00.txt, February 2003.
[Verbs] J. Hilland et al., "RDMA Protocol Verbs Specification" draft- [VERBS] J. Hilland et al., "RDMA Protocol Verbs Specification",
hilland-rddp-verbs-00.txt, April 2003. draft-hilland-iwarp-verbs-v1.0-RDMAC.pdf April 2003,
http://www.rdmaconsortium.org.
11 Appendix 11 Appendix
This appendix is for information only and is NOT part of the This appendix is for information only and is NOT part of the
standard. standard.
The appendix covers three topics; The appendix covers three topics;
Section 11.1 is an analysis of MPA on TCP and why it is useful to Section 11.1 is an analysis of MPA on TCP and why it is useful to
integrate MPA with TCP (with modifications to typical TCP integrate MPA with TCP (with modifications to typical TCP
implementations) to reduce overall system buffering and overhead. implementations) to reduce overall system buffering and overhead.
Section 11.2 covers some MPA receiver implementation notes. Section 11.2 covers some MPA receiver implementation notes.
Section 11.3 covers methods of making MPA implementations Section 11.3 covers methods of making MPA implementations
interoperate with both IETF and RDMA Consortium versions of the interoperate with both IETF and RDMA Consortium versions of the
protocols. protocols.
11.1 Analysis of MPA over TCP Operations 11.1 Analysis of MPA over TCP Operations
This appendix analyzes the impact of MPA (Marker PDU Aligned Framing This appendix analyzes the impact of MPA on the TCP sender, receiver,
for TCP [MPA]) on the TCP sender, receiver, and wire protocol. and wire protocol.
One of MPA's high level goals is to provide enough information, when One of MPA's high level goals is to provide enough information, when
combined with the Direct Data Placement Protocol [DDP], to enable combined with the Direct Data Placement Protocol [DDP], to enable
out-of-order placement of DDP payload into the final Upper Layer out-of-order placement of DDP payload into the final Upper Layer
Protocol (ULP) buffer. Note that DDP separates the act of placing Protocol (ULP) buffer. Note that DDP separates the act of placing
data into a ULP buffer from that of notifying the ULP that the ULP data into a ULP buffer from that of notifying the ULP that the ULP
buffer is available for use. In DDP terminology, the former is buffer is available for use. In DDP terminology, the former is
defined as "Placement", and the later is defined as "Delivery". MPA defined as "Placement", and the later is defined as "Delivery". MPA
supports in-order delivery of the data to the ULP, including support supports in-order Delivery of the data to the ULP, including support
for Direct Data Placement in the final ULP buffer location when TCP for Direct Data Placement in the final ULP buffer location when TCP
segments arrive out-of-order. Effectively, the goal is to use the segments arrive out-of-order. Effectively, the goal is to use the
pre-posted ULP buffers as the TCP receive buffer, where the pre-posted ULP buffers as the TCP receive buffer, where the
reassembly of the ULP Protocol Data Unit (PDU) by TCP (with MPA and reassembly of the ULP Protocol Data Unit (PDU) by TCP (with MPA and
DDP) is done in place, in the ULP buffer, with no data copies. DDP) is done in place, in the ULP buffer, with no data copies.
This Appendix walks through the advantages and disadvantages of the This Appendix walks through the advantages and disadvantages of the
TCP sender modifications proposed by MPA: TCP sender modifications proposed by MPA:
1) that MPA prefers that the TCP sender to do "Header Alignment", 1) that MPA prefers that the TCP sender to do Header Alignment, where
where a TCP segment should begin with an MPA Framing Protocol Data a TCP segment should begin with an MPA Framing Protocol Data Unit
Unit (FPDU) (if there is payload present). (FPDU) (if there is payload present).
2) that there be an integral number of FPDUs in a TCP segment (under 2) that there be an integral number of FPDUs in a TCP segment (under
conditions where the Path MTU is not changing). conditions where the Path MTU is not changing).
This Appendix concludes that the scaling advantages of FPDU Alignment This Appendix concludes that the scaling advantages of FPDU Alignment
are strong, based primarily on fairly drastic TCP receive buffer are strong, based primarily on fairly drastic TCP receive buffer
reduction requirements and simplified receive handling. The analysis reduction requirements and simplified receive handling. The analysis
also shows that there is little effect to TCP wire behavior. also shows that there is little effect to TCP wire behavior.
11.1.1 Assumptions 11.1.1 Assumptions
skipping to change at page 55, line 24 skipping to change at page 55, line 24
11.1.1.2 MPA preserves DDP message framing 11.1.1.2 MPA preserves DDP message framing
MPA was designed as a framing layer specifically for DDP and was not MPA was designed as a framing layer specifically for DDP and was not
intended as a general-purpose framing layer for any other ULP using intended as a general-purpose framing layer for any other ULP using
TCP. TCP.
A framing layer allows ULPs using it to receive indications from the A framing layer allows ULPs using it to receive indications from the
transport layer only when complete ULPDUs are present. As a framing transport layer only when complete ULPDUs are present. As a framing
layer, MPA is not aware of the content of the DDP PDU, only that it layer, MPA is not aware of the content of the DDP PDU, only that it
has received and, if necessary, reassembled a complete PDU for has received and, if necessary, reassembled a complete PDU for
delivery to the DDP. Delivery to the DDP.
11.1.1.3 The size of the ULPDU passed to MPA is less than EMSS under 11.1.1.3 The size of the ULPDU passed to MPA is less than EMSS under
normal conditions normal conditions
To make reception of a complete DDP PDU on every received segment To make reception of a complete DDP PDU on every received segment
possible, DDP passes to MPA a PDU that is no larger than the EMSS of possible, DDP passes to MPA a PDU that is no larger than the EMSS of
the underlying fabric. Each FPDU that MPA creates contains sufficient the underlying fabric. Each FPDU that MPA creates contains
information for the receiver to directly place the ULP payload in the sufficient information for the receiver to directly place the ULP
correct location in the correct receive buffer. payload in the correct location in the correct receive buffer.
Edge cases when this condition does not occur are dealt with, but do Edge cases when this condition does not occur are dealt with, but do
not need to be on the fast path not need to be on the fast path
11.1.1.4 Out-of-order placement but NO out-of-order delivery 11.1.1.4 Out-of-order placement but NO out-of-order Delivery
DDP receives complete DDP PDUs from MPA. Each DDP PDU contains the DDP receives complete DDP PDUs from MPA. Each DDP PDU contains the
information necessary to place its ULP payload directly in the information necessary to place its ULP payload directly in the
correct location in host memory. correct location in host memory.
Because each DDP segment is self-describing, it is possible for DDP Because each DDP segment is self-describing, it is possible for DDP
segments received out of order to have their ULP payload placed segments received out of order to have their ULP payload placed
immediately in the ULP receive buffer. immediately in the ULP receive buffer.
Data delivery to the ULP is guaranteed to be in the order the data Data delivery to the ULP is guaranteed to be in the order the data
was sent. DDP only indicates data delivery to the ULP after TCP has was sent. DDP only indicates data delivery to the ULP after TCP has
acknowledged the complete byte stream. acknowledged the complete byte stream.
11.1.2 The Value of FPDU Alignment 11.1.2 The Value of FPDU Alignment
Significant receiver optimizations can be achieved when Header Significant receiver optimizations can be achieved when Header
Alignment and complete FPDUs are the common case. The optimizations Alignment and complete FPDUs are the common case. The optimizations
allow utilizing significantly fewer buffers on the receiver and less allow utilizing significantly fewer buffers on the receiver and less
computation per FPDU. The net effect is the ability to build a "Flow- computation per FPDU. The net effect is the ability to build a
Through" receiver that enables TCP-based solutions to scale to 10G "flow-through" receiver that enables TCP-based solutions to scale to
and beyond in an economical way. The optimizations are especially 10G and beyond in an economical way. The optimizations are
relevant to hardware implementations of receivers that process especially relevant to hardware implementations of receivers that
multiple protocol layers - Data Link Layer (e.g., Ethernet), Network process multiple protocol layers - Data Link Layer (e.g., Ethernet),
and Transport Layer (e.g., TCP/IP), and even some ULP on top of TCP Network and Transport Layer (e.g., TCP/IP), and even some ULP on top
(e.g., MPA/DDP). As network speed increases, there is an increasing of TCP (e.g., MPA/DDP). As network speed increases, there is an
desire to use a hardware based receiver in order to achieve an increasing desire to use a hardware based receiver in order to
efficient high performance solution. achieve an efficient high performance solution.
A TCP receiver, under worst case conditions, has to allocate buffers A TCP receiver, under worst case conditions, has to allocate buffers
(BufferSizeTCP) whose capacities are a function of the bandwidth- (BufferSizeTCP) whose capacities are a function of the bandwidth-
delay product. Thus: delay product. Thus:
BufferSizeTCP = K * bandwidth [octets/S] * Delay [S]. BufferSizeTCP = K * bandwidth [octets/Second] * Delay [Seconds].
Where bandwidth is the end-to-end bandwidth of the connection, delay Where bandwidth is the end-to-end bandwidth of the connection, delay
is the round trip delay of the connection, and K is an implementation is the round trip delay of the connection, and K is an implementation
dependent constant. dependent constant.
Thus BufferSizeTCP scales with the end-to-end bandwidth (10x more Thus BufferSizeTCP scales with the end-to-end bandwidth (10x more
buffers for a 10x increase in end-to-end bandwidth). As this buffers for a 10x increase in end-to-end bandwidth). As this
buffering approach may scale poorly for hardware or software buffering approach may scale poorly for hardware or software
implementations alike, several approaches allow reduction in the implementations alike, several approaches allow reduction in the
amount of buffering required for high-speed TCP communication. amount of buffering required for high-speed TCP communication.
skipping to change at page 56, line 47 skipping to change at page 56, line 47
TCP receive buffer. If the application pre-posts a sufficient amount TCP receive buffer. If the application pre-posts a sufficient amount
of buffering, and each TCP segment has sufficient information to of buffering, and each TCP segment has sufficient information to
place the payload into the right application buffer, when an out-of- place the payload into the right application buffer, when an out-of-
order TCP segment arrives it could potentially be placed directly in order TCP segment arrives it could potentially be placed directly in
the ULP buffer. However, placement can only be done when a complete the ULP buffer. However, placement can only be done when a complete
FPDU with the placement information is available to the receiver, and FPDU with the placement information is available to the receiver, and
the FPDU contents contain enough information to place the data into the FPDU contents contain enough information to place the data into
the correct ULP buffer (e.g., there is a DDP header available). the correct ULP buffer (e.g., there is a DDP header available).
For the case when the FPDU is not aligned with the TCP segment, it For the case when the FPDU is not aligned with the TCP segment, it
may take, on average, 2 TCP segments to assemble one FPDU. Therefore, may take, on average, 2 TCP segments to assemble one FPDU.
the receiver has to allocate BufferSizeNAF (Buffer Size, Non-Aligned Therefore, the receiver has to allocate BufferSizeNAF (Buffer Size,
FPDU) octets: Non-Aligned FPDU) octets:
BufferSizeNAF = K1* EMSS * number_of_connections + K2 * EMSS BufferSizeNAF = K1* EMSS * number_of_connections + K2 * EMSS
Where K1 and K2 are implementation dependent constants and EMSS is Where K1 and K2 are implementation dependent constants and EMSS is
the effective maximum segment size. the effective maximum segment size.
For example, a 1 Gbps link with 10,000 connections and an EMSS of For example, a 1 Gbps link with 10,000 connections and an EMSS of
1500B would require 15 MB of memory. Often the number of connections 1500B would require 15 MB of memory. Often the number of connections
used scales with the network speed, aggravating the situation for used scales with the network speed, aggravating the situation for
higher speeds. higher speeds.
FPDU Alignment would allow the receiver to allocate BufferSizeAF FPDU Alignment would allow the receiver to allocate BufferSizeAF
(Buffer Size, Aligned FPDU) octets: (Buffer Size, Aligned FPDU) octets:
BufferSizeAF = K2 * EMSS BufferSizeAF = K2 * EMSS
for the same conditions. A FPDU Aligned receiver may require memory for the same conditions. A FPDU Aligned receiver may require memory
in the range of ~100s of KB - which is feasible for an on-chip memory in the range of ~100s of KB - which is feasible for an on-chip memory
and enables a "Flow-Through" design, in which the data flows through and enables a "flow-through" design, in which the data flows through
the NIC and is placed directly in the destination buffer. Assuming the NIC and is placed directly in the destination buffer. Assuming
most of the connections support FPDU Alignment, the receiver buffers most of the connections support FPDU Alignment, the receiver buffers
no longer scale with number of connections. no longer scale with number of connections.
Additional optimizations can be achieved in a balanced I/O sub-system Additional optimizations can be achieved in a balanced I/O sub-system
-- where the system interface of the network controller provides -- where the system interface of the network controller provides
ample bandwidth as compared with the network bandwidth. For almost ample bandwidth as compared with the network bandwidth. For almost
twenty years this has been the case and the trend is expected to twenty years this has been the case and the trend is expected to
continue - while Ethernet speeds have scaled by 1000 (from 10 continue - while Ethernet speeds have scaled by 1000 (from 10
megabit/sec to 10 gigabit/sec), I/O bus bandwidth of volume CPU megabit/sec to 10 gigabit/sec), I/O bus bandwidth of volume CPU
skipping to change at page 58, line 23 skipping to change at page 58, line 23
The receiver algorithm for processing TCP segments (e.g., TCP segment The receiver algorithm for processing TCP segments (e.g., TCP segment
#X in Figure 10: Non-aligned FPDU freely placed in TCP octet stream) #X in Figure 10: Non-aligned FPDU freely placed in TCP octet stream)
carrying non-aligned FPDUs (in-order or out-of-order) includes: carrying non-aligned FPDUs (in-order or out-of-order) includes:
Data Link Layer processing (whole frame) - typically including a Data Link Layer processing (whole frame) - typically including a
CRC calculation. CRC calculation.
1. Network Layer processing (assuming not an IP fragment, the 1. Network Layer processing (assuming not an IP fragment, the
whole Data Link Layer frame contains one IP datagram. IP whole Data Link Layer frame contains one IP datagram. IP
fragments should be reassembled in a local buffer. This is not fragments should be reassembled in a local buffer. This is
a performance optimization goal) not a performance optimization goal)
2. Transport Layer processing -- TCP protocol processing, header 2. Transport Layer processing -- TCP protocol processing, header
and checksum checks. and checksum checks.
a. Classify incoming TCP segment using the 5 tuple (IP SRC, a. Classify incoming TCP segment using the 5 tuple (IP SRC,
IP DST, TCP SRC Port, TCP DST Port, protocol) IP DST, TCP SRC Port, TCP DST Port, protocol)
3. Find FPDU message boundaries. 3. Find FPDU message boundaries.
a. Get MPA state information for the connection a. Get MPA state information for the connection
skipping to change at page 60, line 21 skipping to change at page 60, line 21
whole Data Link Layer frame contains one IP datagram. IP whole Data Link Layer frame contains one IP datagram. IP
fragments should be reassembled in a local buffer. This is fragments should be reassembled in a local buffer. This is
not a performance optimization goal) not a performance optimization goal)
3) Transport Layer processing -- TCP protocol processing, header 3) Transport Layer processing -- TCP protocol processing, header
and checksum checks. and checksum checks.
a. Classify incoming TCP segment using the 5 tuple (IP SRC, a. Classify incoming TCP segment using the 5 tuple (IP SRC,
IP DST, TCP SRC Port, TCP DST Port, protocol) IP DST, TCP SRC Port, TCP DST Port, protocol)
4) Check for Header Alignment. (Described in detail in [MPA] 4) Check for Header Alignment. (Described in detail in Section
section 7.4). Assuming Header Alignment for the rest of the 5.4). Assuming Header Alignment for the rest of the
algorithm below. algorithm below.
a. If the header is not aligned, see the algorithm defined a. If the header is not aligned, see the algorithm defined
in the prior section. in the prior section.
5) If TCP is in-order or out-of-order the MPA header is at the 5) If TCP is in-order or out-of-order the MPA header is at the
beginning of the current TCP payload. Get the FPDU length beginning of the current TCP payload. Get the FPDU length
from the FPDU header. from the FPDU header.
6) Calculate CRC over FPDU 6) Calculate CRC over FPDU
skipping to change at page 60, line 46 skipping to change at page 60, line 46
8) If no FPDU CRC errors, placement is allowed 8) If no FPDU CRC errors, placement is allowed
9) CopyData(TCP segment #X, host buffer address, length) 9) CopyData(TCP segment #X, host buffer address, length)
10) Loop to #5 until all the FPDUs in the TCP segment are 10) Loop to #5 until all the FPDUs in the TCP segment are
consumed in order to handle FPDU packing. consumed in order to handle FPDU packing.
Implementation note: In both cases the receiver has to classify the Implementation note: In both cases the receiver has to classify the
incoming TCP segment and associate it with one of the flows it incoming TCP segment and associate it with one of the flows it
maintains. In the case of no FPDU Alignment, the receiver is forced maintains. In the case of no FPDU Alignment, the receiver is forced
to classify incoming traffic before it can calculate the FPDU CRC. In to classify incoming traffic before it can calculate the FPDU CRC.
the case of FPDU Alignment the operations order is left to the In the case of FPDU Alignment the operations order is left to the
implementer. implementer.
The FPDU Aligned receiver algorithm is significantly simpler. There The FPDU Aligned receiver algorithm is significantly simpler. There
is no need to locally buffer portions of FPDUs. Accessing state is no need to locally buffer portions of FPDUs. Accessing state
information is also substantially simplified - the normal case does information is also substantially simplified - the normal case does
not require retrieving information to find out where a FPDU starts not require retrieving information to find out where a FPDU starts
and ends or retrieval of a partial CRC before the CRC calculation can and ends or retrieval of a partial CRC before the CRC calculation can
commence. This avoids adding internal latencies, having multiple data commence. This avoids adding internal latencies, having multiple
passes through the CRC machine, or scheduling multiple commands for data passes through the CRC machine, or scheduling multiple commands
moving the data to the host buffer. for moving the data to the host buffer.
The aligned FPDU approach is useful for in-order and out-of-order The aligned FPDU approach is useful for in-order and out-of-order
reception. The receiver can use the same mechanisms for data storage reception. The receiver can use the same mechanisms for data storage
in both cases, and only needs to account for when all the TCP in both cases, and only needs to account for when all the TCP
segments have arrived to enable delivery. The Header Alignment, along segments have arrived to enable Delivery. The Header Alignment,
with the high probability that at least one complete FPDU is found along with the high probability that at least one complete FPDU is
with every TCP segment, allows the receiver to perform data placement found with every TCP segment, allows the receiver to perform data
for out-of-order TCP segments with no need for intermediate placement for out-of-order TCP segments with no need for intermediate
buffering. Essentially the TCP receive buffer has been eliminated and buffering. Essentially the TCP receive buffer has been eliminated
TCP reassembly is done in place within the ULP buffer. and TCP reassembly is done in place within the ULP buffer.
In case FPDU Alignment is not found, the receiver should follow the In case FPDU Alignment is not found, the receiver should follow the
algorithm for non aligned FPDU reception which may be slower and less algorithm for non aligned FPDU reception which may be slower and less
efficient. efficient.
11.1.2.2 FPDU Alignment effects on TCP wire protocol 11.1.2.2 FPDU Alignment effects on TCP wire protocol
An MPA-aware TCP exposes its EMSS to MPA. MPA uses the EMSS to An MPA-aware TCP exposes its EMSS to MPA. MPA uses the EMSS to
calculate its MULPDU, which it then exposes to DDP, its ULP. DDP calculate its MULPDU, which it then exposes to DDP, its ULP. DDP
uses the MULPDU to segment its payload so that each FPDU sent by uses the MULPDU to segment its payload so that each FPDU sent by
MPA fits completely into one TCP segment. This has no impact on MPA fits completely into one TCP segment. This has no impact on
wire protocol and exposing this information is already supported wire protocol and exposing this information is already supported
on many TCP implementations, including all modern flavors of BSD on many TCP implementations, including all modern flavors of BSD
networking, through the TCP_MAXSEG socket option. networking, through the TCP_MAXSEG socket option.
In the common case, the ULP (i.e. DDP over MPA) messages provided to In the common case, the ULP (i.e. DDP over MPA) messages provided to
the TCP layer are segmented to MULPDU size. It is assumed that the the TCP layer are segmented to MULPDU size. It is assumed that the
ULP message size is bounded by MULPDU, such that a single ULP message ULP message size is bounded by MULPDU, such that a single ULP message
can be encapsulated in a single TCP segment. Therefore, in the common can be encapsulated in a single TCP segment. Therefore, in the
case, there is no increase in the number of TCP segments emitted. For common case, there is no increase in the number of TCP segments
smaller ULP messages, the sender can also apply packing, i.e. the emitted. For smaller ULP messages, the sender can also apply
sender packs as many complete FPDUs as possible into one TCP segment. packing, i.e. the sender packs as many complete FPDUs as possible
The requirement to always have a complete FPDU may increase the into one TCP segment. The requirement to always have a complete FPDU
number of TCP segments emitted. Typically, a ULP message size varies may increase the number of TCP segments emitted. Typically, a ULP
from few bytes to multiple EMSS (e.g., 64 Kbytes). In some cases the message size varies from few bytes to multiple EMSS (e.g., 64
ULP may post more than one message at a time for transmission, giving Kbytes). In some cases the ULP may post more than one message at a
the sender an opportunity for packing. In the case where more than time for transmission, giving the sender an opportunity for packing.
one FPDU is available for transmission and the FPDUs are encapsulated In the case where more than one FPDU is available for transmission
into a TCP segment and there is no room in the TCP segment to include and the FPDUs are encapsulated into a TCP segment and there is no
the next complete FPDU, another TCP segment is sent. In this corner room in the TCP segment to include the next complete FPDU, another
case some of the TCP segments are not full size. In the worst case TCP segment is sent. In this corner case some of the TCP segments
scenario, the ULP may choose a FPDU size that is EMSS/2 +1 and has are not full size. In the worst case scenario, the ULP may choose a
multiple messages available for transmission. For this poor choice of FPDU size that is EMSS/2 +1 and has multiple messages available for
FPDU size, the average TCP segment size is therefore about 1/2 of the transmission. For this poor choice of FPDU size, the average TCP
EMSS and the number of TCP segments emitted is approaching 2x of what segment size is therefore about 1/2 of the EMSS and the number of TCP
is possible without the requirement to encapsulate an integer number segments emitted is approaching 2x of what is possible without the
of complete FPDUs in every TCP segment. This is a dynamic situation requirement to encapsulate an integer number of complete FPDUs in
that only lasts for the duration where the sender ULP has multiple every TCP segment. This is a dynamic situation that only lasts for
non-optimal messages for transmission and this causes a minor impact the duration where the sender ULP has multiple non-optimal messages
on the wire utilization. for transmission and this causes a minor impact on the wire
utilization.
However, it is not expected that requiring FPDU Alignment will have a However, it is not expected that requiring FPDU Alignment will have a
measurable impact on wire behavior of most applications. Throughput measurable impact on wire behavior of most applications. Throughput
applications with large I/Os are expected to take full advantage of applications with large I/Os are expected to take full advantage of
the EMSS. Another class of applications with many small outstanding the EMSS. Another class of applications with many small outstanding
buffers (as compared to EMSS) is expected to use packing when buffers (as compared to EMSS) is expected to use packing when
applicable. Transaction oriented applications are also optimal. applicable. Transaction oriented applications are also optimal.
TCP retransmission is another area that can affect sender behavior. TCP retransmission is another area that can affect sender behavior.
TCP supports retransmission of the exact, originally transmitted TCP supports retransmission of the exact, originally transmitted
segment (see [RFC0793] section 2.6, [RFC0793] section 3.7 "managing segment (see [RFC793] section 2.6, [RFC793] section 3.7 "managing the
the window" and [RFC1122] section 4.2.2.15 ). In the unlikely event window" and [RFC1122] section 4.2.2.15). In the unlikely event that
that part of the original segment has been received and acknowledged part of the original segment has been received and acknowledged by
by the remote peer (e.g., a re-segmenting middle box, as documented the remote peer (e.g., a re-segmenting middle box, as documented in
in 5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders on 5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders on
page 30), a better available bandwidth utilization may be possible by page 31), a better available bandwidth utilization may be possible by
re-transmitting only the missing octets. If an MPA-aware TCP re-transmitting only the missing octets. If an MPA-aware TCP
retransmits complete FPDUs, there may be some marginal bandwidth retransmits complete FPDUs, there may be some marginal bandwidth
loss. loss.
Another area where a change in the TCP segment number may have impact Another area where a change in the TCP segment number may have impact
is that of Slow Start and Congestion Avoidance. Slow-start is that of Slow Start and Congestion Avoidance. Slow-start
exponential increase is measured in segments per second, as the exponential increase is measured in segments per second, as the
algorithm focuses on the overhead per segment at the source for algorithm focuses on the overhead per segment at the source for
congestion that eventually results in dropped segments. Slow-start congestion that eventually results in dropped segments. Slow-start
exponential bandwidth growth for MPA-aware TCP is similar to any TCP exponential bandwidth growth for MPA-aware TCP is similar to any TCP
skipping to change at page 63, line 17 skipping to change at page 63, line 17
Transport & Network Layer Reassembly Buffers: Transport & Network Layer Reassembly Buffers:
The use of reassembly buffers (either TCP reassembly buffers or IP The use of reassembly buffers (either TCP reassembly buffers or IP
fragmentation reassembly buffers) is implementation dependent. When fragmentation reassembly buffers) is implementation dependent. When
MPA is enabled, reassembly buffers are needed if out of order packets MPA is enabled, reassembly buffers are needed if out of order packets
arrive and Markers are not enabled. Buffers are also needed if FPDU arrive and Markers are not enabled. Buffers are also needed if FPDU
Alignment is lost or if IP fragmentation occurs. This is because the Alignment is lost or if IP fragmentation occurs. This is because the
incoming out of order segment may not contain enough information for incoming out of order segment may not contain enough information for
MPA to process all of the FPDU. For cases where a re-segmenting MPA to process all of the FPDU. For cases where a re-segmenting
middle box is present, or where the TCP sender is not MPA-aware, the middle box is present, or where the TCP sender is not MPA-aware, the
presence of markers significantly reduces the amount of buffering presence of Markers significantly reduces the amount of buffering
needed. needed.
Recovery from IP Fragmentation must be transparent to the MPA Recovery from IP Fragmentation must be transparent to the MPA
Consumers. Consumers.
11.2.1 Network Layer Reassembly Buffers 11.2.1 Network Layer Reassembly Buffers
Most IP implementations set the IP Don't Fragment bit. Thus upon a Most IP implementations set the IP Don't Fragment bit. Thus upon a
path MTU change, intermediate devices drop the IP datagram if it is path MTU change, intermediate devices drop the IP datagram if it is
too large and reply with an ICMP message which tells the source TCP too large and reply with an ICMP message which tells the source TCP
skipping to change at page 64, line 31 skipping to change at page 64, line 31
A TCP reassembly buffer is also needed. TCP reassembly buffers are A TCP reassembly buffer is also needed. TCP reassembly buffers are
needed if FPDU Alignment is lost when using TCP with MPA or when the needed if FPDU Alignment is lost when using TCP with MPA or when the
MPA FPDU spans multiple TCP segments. Buffers are also needed if MPA FPDU spans multiple TCP segments. Buffers are also needed if
Markers are disabled and out of order packets arrive. Markers are disabled and out of order packets arrive.
Since lost FPDU Alignment often means that FPDUs are incomplete, an Since lost FPDU Alignment often means that FPDUs are incomplete, an
MPA on TCP implementation must have a reassembly buffer large enough MPA on TCP implementation must have a reassembly buffer large enough
to recover an FPDU that is less than or equal to the MTU of the to recover an FPDU that is less than or equal to the MTU of the
locally attached link (this should be the largest possible advertised locally attached link (this should be the largest possible advertised
TCP path MTU). If the MTU is smaller than 140 octets, the buffer MUST TCP path MTU). If the MTU is smaller than 140 octets, the buffer
be at least 140 octets long to support the minimum FPDU size. The MUST be at least 140 octets long to support the minimum FPDU size.
140 octets allows for the minimum MULPDU of 128, 2 octets of pad, 2 The 140 octets allows for the minimum MULPDU of 128, 2 octets of pad,
of ULPDU_Length, 4 of CRC, and space for a possible marker. As usual, 2 of ULPDU_Length, 4 of CRC, and space for a possible Marker. As
additional buffering may provide better performance. usual, additional buffering may provide better performance.
Note that if the TCP segment were not stored, it is possible to Note that if the TCP segment were not stored, it is possible to
deadlock the MPA algorithm. If the path MTU is reduced, FPDU deadlock the MPA algorithm. If the path MTU is reduced, FPDU
Alignment requires the source TCP to re-segment the data stream to Alignment requires the source TCP to re-segment the data stream to
the new path MTU. The source MPA will detect this condition and the new path MTU. The source MPA will detect this condition and
reduce the MPA segment size, but any FPDUs already posted to the reduce the MPA segment size, but any FPDUs already posted to the
source TCP will be re-segmented and lose FPDU Alignment. If the source TCP will be re-segmented and lose FPDU Alignment. If the
destination does not support a TCP reassembly buffer, these segments destination does not support a TCP reassembly buffer, these segments
can never be successfully transmitted and the protocol deadlocks. can never be successfully transmitted and the protocol deadlocks.
skipping to change at page 65, line 16 skipping to change at page 65, line 16
The RDMA Consortium created early specifications of the MPA/DDP/RDMA The RDMA Consortium created early specifications of the MPA/DDP/RDMA
protocols and some manufacturers created implementations of those protocols and some manufacturers created implementations of those
protocols before the IETF versions were finalized. These protocols protocols before the IETF versions were finalized. These protocols
and are very similar to the IETF versions making it possible for and are very similar to the IETF versions making it possible for
implementations to be created or modified to support either set of implementations to be created or modified to support either set of
specifications. For those interested, the RDMA Consortium protocol specifications. For those interested, the RDMA Consortium protocol
documents can be obtained at http://www.rdmaconsortium.org. documents can be obtained at http://www.rdmaconsortium.org.
In this section, implementations of MPA/DDP/RDMA that conform to the In this section, implementations of MPA/DDP/RDMA that conform to the
RDMAC specifications are called "RDMAC RNICs". Implementations of RDMAC specifications are called RDMAC RNICs. Implementations of
MPA/DDP/RDMA that conform to the IETF RFCs are called "IETF RNICs". MPA/DDP/RDMA that conform to the IETF RFCs are called IETF RNICs.
Without the exchange of MPA Request/Reply Frames, there is no Without the exchange of MPA Request/Reply Frames, there is no
standard mechanism for enabling RDMAC RNICs to interoperate with IETF standard mechanism for enabling RDMAC RNICs to interoperate with IETF
RNICs. Even if a ULP uses a well-known port to start an IETF RNIC RNICs. Even if a ULP uses a well-known port to start an IETF RNIC
immediately in RDMA mode (i.e., without exchanging the MPA immediately in RDMA mode (i.e., without exchanging the MPA
Request/Reply messages), there is no reason to believe an IETF RNIC Request/Reply messages), there is no reason to believe an IETF RNIC
will interoperate with an RDMAC RNIC because of the differences in will interoperate with an RDMAC RNIC because of the differences in
the version number in the DDP and RDMAP headers on the wire. the version number in the DDP and RDMAP headers on the wire.
Therefore, the ULP or other supporting entity at the RDMAC RNIC must Therefore, the ULP or other supporting entity at the RDMAC RNIC must
skipping to change at page 66, line 6 skipping to change at page 66, line 6
an RNIC can only interoperate with other IETF RNICs. an RNIC can only interoperate with other IETF RNICs.
Permissive IETF RNIC - an RNIC implementing the IETF protocols which Permissive IETF RNIC - an RNIC implementing the IETF protocols which
is capable of implementing the RDMAC protocols on a per is capable of implementing the RDMAC protocols on a per
connection basis. connection basis.
The Permissive IETF RNIC is recommended for those implementers that The Permissive IETF RNIC is recommended for those implementers that
want maximum interoperability with other RNIC implementations. want maximum interoperability with other RNIC implementations.
The values used by these three RNIC types for the MPA, DDP, and RDMAP The values used by these three RNIC types for the MPA, DDP, and RDMAP
versions as well as MPA markers and CRC are summarized in Figure 12. versions as well as MPA Markers and CRC are summarized in Figure 12.
+----------------++-----------+-----------+-----------+-----------+ +----------------++-----------+-----------+-----------+-----------+
| RNIC TYPE || DDP/RDMAP | MPA | MPA | MPA | | RNIC TYPE || DDP/RDMAP | MPA | MPA | MPA |
| || Version | Revision | Markers | CRC | | || Version | Revision | Markers | CRC |
+----------------++-----------+-----------+-----------+-----------+ +----------------++-----------+-----------+-----------+-----------+
+----------------++-----------+-----------+-----------+-----------+ +----------------++-----------+-----------+-----------+-----------+
| RDMAC || 0 | 0 | 1 | 1 | | RDMAC || 0 | 0 | 1 | 1 |
| || | | | | | || | | | |
+----------------++-----------+-----------+-----------+-----------+ +----------------++-----------+-----------+-----------+-----------+
| IETF || 1 | 1 | 0 or 1 | 0 or 1 | | IETF || 1 | 1 | 0 or 1 | 0 or 1 |
| Non-permissive || | | | | | Non-permissive || | | | |
+----------------++-----------+-----------+-----------+-----------+ +----------------++-----------+-----------+-----------+-----------+
| IETF || 1 or 0 | 1 or 0 | 0 or 1 | 0 or 1 | | IETF || 1 or 0 | 1 or 0 | 0 or 1 | 0 or 1 |
| permissive || | | | | | permissive || | | | |
+----------------++-----------+-----------+-----------+-----------+ +----------------++-----------+-----------+-----------+-----------+
Figure 12. Connection Parameters for the RNIC Types. Figure 12. Connection Parameters for the RNIC Types.
For MPA markers and MPA CRC, enabled=1, disabled=0. For MPA Markers and MPA CRC, enabled=1, disabled=0.
It is assumed there is no mixing of versions allowed between MPA, DDP It is assumed there is no mixing of versions allowed between MPA, DDP
and RDMAP. The RNIC either generates the RDMAC protocols on the wire and RDMAP. The RNIC either generates the RDMAC protocols on the wire
(version is zero) or the IETF protocols (version is one). (version is zero) or the IETF protocols (version is one).
During the exchange of the MPA Request/Reply Frames, each peer During the exchange of the MPA Request/Reply Frames, each peer
provides its MPA Revision, Marker preference (M: 0=disabled, provides its MPA Revision, Marker preference (M: 0=disabled,
1=enabled), and CRC preference. The MPA Revision provided in the MPA 1=enabled), and CRC preference. The MPA Revision provided in the MPA
Request Frame and the MPA Reply Frame may differ. Request Frame and the MPA Reply Frame may differ.
skipping to change at page 67, line 8 skipping to change at page 67, line 8
Figure 13 shows that a Non-permissive IETF RNIC cannot interoperate Figure 13 shows that a Non-permissive IETF RNIC cannot interoperate
with an RDMAC RNIC, despite the fact that both peers exchange MPA with an RDMAC RNIC, despite the fact that both peers exchange MPA
Request/Reply Frames. For a Non-permissive IETF RNIC, the MPA Request/Reply Frames. For a Non-permissive IETF RNIC, the MPA
negotiation has no effect on the DDP/RDMAP version and it is unable negotiation has no effect on the DDP/RDMAP version and it is unable
to interoperate with the RDMAC RNIC. to interoperate with the RDMAC RNIC.
The rows in the figure show the state of the Marker field in the MPA The rows in the figure show the state of the Marker field in the MPA
Request Frame sent by the MPA Initiator. The columns show the state Request Frame sent by the MPA Initiator. The columns show the state
of the Marker field in the MPA Reply Frame sent by the MPA Responder. of the Marker field in the MPA Reply Frame sent by the MPA Responder.
Each type of RNIC is shown as an initiator and a responder. The Each type of RNIC is shown as an Initiator and a Responder. The
connection results are shown in the lower right corner, at the connection results are shown in the lower right corner, at the
intersection of the different RNIC types, where V=0 is the RDMAC intersection of the different RNIC types, where V=0 is the RDMAC
DDP/RDMAP version, V=1 is the IETF DDP/RDMAC version, M=0 means MPA DDP/RDMAP version, V=1 is the IETF DDP/RDMAC version, M=0 means MPA
markers are disabled and M=1 means MPA markers are enabled. The Markers are disabled and M=1 means MPA Markers are enabled. The
negotiated marker state is shown as X/Y, for the receive direction of negotiated Marker state is shown as X/Y, for the receive direction of
the initiator/responder. the Initiator/Responder.
+---------------------------++-----------------------+ +---------------------------++-----------------------+
| MPA || MPA | | MPA || MPA |
| CONNECT || Responder | | CONNECT || Responder |
| MODE +-----------------++-------+---------------+ | MODE +-----------------++-------+---------------+
| | RNIC || RDMAC | IETF | | | RNIC || RDMAC | IETF |
| | TYPE || | Non-permissive| | | TYPE || | Non-permissive|
| | +------++-------+-------+-------+ | | +------++-------+-------+-------+
| | |MARKER|| M=1 | M=0 | M=1 | | | |MARKER|| M=1 | M=0 | M=1 |
+---------+----------+------++-------+-------+-------+ +---------+----------+------++-------+-------+-------+
skipping to change at page 69, line 11 skipping to change at page 69, line 11
If the Permissive IETF RNIC is the MPA Initiator, it prepares the MPA If the Permissive IETF RNIC is the MPA Initiator, it prepares the MPA
Request Frame setting the Rev field to one. Regardless of the value Request Frame setting the Rev field to one. Regardless of the value
of the M bit in the MPA Request Frame, the ULP or other supporting of the M bit in the MPA Request Frame, the ULP or other supporting
entity for the RDMAC RNIC will create an MPA Reply Frame with Rev entity for the RDMAC RNIC will create an MPA Reply Frame with Rev
equal to zero and the M bit set to one. equal to zero and the M bit set to one.
When the Initiator reads the Rev field of the MPA Reply Frame and When the Initiator reads the Rev field of the MPA Reply Frame and
finds that its peer is an RDMAC RNIC, it must inform its ULP that it finds that its peer is an RDMAC RNIC, it must inform its ULP that it
should generate version zero DDP/RDMAP messages and enable MPA should generate version zero DDP/RDMAP messages and enable MPA
markers and CRC. Markers and CRC.
11.3.4 Non-Permissive IETF RNIC and Permissive IETF RNIC 11.3.4 Non-Permissive IETF RNIC and Permissive IETF RNIC
For completeness, Figure 15 shows the results of MPA negotiation For completeness, Figure 15 shows the results of MPA negotiation
between a Non-permissive IETF RNIC and a Permissive IETF RNIC. The between a Non-permissive IETF RNIC and a Permissive IETF RNIC. The
important point from this figure is that an IETF RNIC cannot detect important point from this figure is that an IETF RNIC cannot detect
whether its peer is a Permissive or Non-permissive RNIC. whether its peer is a Permissive or Non-permissive RNIC.
+---------------------------++-------------------------------+ +---------------------------++-------------------------------+
| MPA || MPA | | MPA || MPA |
skipping to change at page 72, line 40 skipping to change at page 72, line 40
Phone: 503-712-4106 Phone: 503-712-4106
Email: dave.b.minturn@intel.com Email: dave.b.minturn@intel.com
Jim Pinkerton Jim Pinkerton
Microsoft, Inc. Microsoft, Inc.
One Microsoft Way One Microsoft Way
Redmond, WA, USA 98052 Redmond, WA, USA 98052
Email: jpink@microsoft.com Email: jpink@microsoft.com
Hemal Shah Hemal Shah
Intel Corporation 16215 Alton Parkway
MS PTL1 Irvine, California 92619-7013 USA
1501 South Mopac Expressway, #400 Phone: +1 949 926-6941
Austin, Texas 78746 Email: hemal@broadcom.com
Phone: 512-732-3963
Email: hemal.shah@intel.com
Allyn Romanow Allyn Romanow
Cisco Systems Cisco Systems
170 W Tasman Drive 170 W Tasman Drive
San Jose, CA 95134 USA San Jose, CA 95134 USA
Phone: +1 408 525 8836 Phone: +1 408 525 8836
Email: allyn@cisco.com Email: allyn@cisco.com
Tom Talpey Tom Talpey
Network Appliance Network Appliance
375 Totten Pond Road 375 Totten Pond Road
Waltham, MA 02451 USA Waltham, MA 02451 USA
Phone: +1 (781) 768-5329 Phone: +1 (781) 768-5329
EMail: thomas.talpey@netapp.com EMail: thomas.talpey@netapp.com
Patricia Thaler Patricia Thaler
Agilent Technologies, Inc. Broadcom
1101 Creekside Ridge Drive, #100 16215 Alton Parkway
M/S-RG10 Irvine, CA 92618
Roseville, CA 95678 Phone: 916 570 2707
Phone: +1-916-788-5662 pthaler@broadcom.com
email: pat_thaler@agilent.com
Jim Wendt Jim Wendt
Hewlett Packard Corporation Hewlett Packard Corporation
8000 Foothills Boulevard MS 5668 8000 Foothills Boulevard MS 5668
Roseville, CA 95747-5668 USA Roseville, CA 95747-5668 USA
Phone: +1 916 785 5198 Phone: +1 916 785 5198
Email: jim_wendt@hp.com Email: jim_wendt@hp.com
Jim Williams Jim Williams
Emulex Corporation Emulex Corporation
580 Main Street 580 Main Street
Bolton, MA 01740 USA Bolton, MA 01740 USA
Phone: +1 978 779 7224 Phone: +1 978 779 7224
Email: jim.williams@emulex.com Email: jim.williams@emulex.com
Full Copyright Statement Full Copyright Statement
This document and the information contained herein is provided on an
"AS IS" basis and ADAPTEC INC., AGILENT TECHNOLOGIES INC., BROADCOM
CORPORATION, CISCO SYSTEMS INC., DUKE UNIVERSITY, EMC CORPORATION,
EMULEX CORPORATION, HEWLETT-PACKARD COMPANY, INTERNATIONAL BUSINESS
MACHINES CORPORATION, INTEL CORPORATION, MICROSOFT CORPORATION,
NETWORK APPLIANCE INC., SANDBURST CORPORATION, THE INTERNET SOCIETY,
AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT
THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY
IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR
PURPOSE.
This document and the information contained herein are provided on an This document and the information contained herein are provided on an
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Copyright (C) The Internet Society (2005). This document is subject Copyright (C) The Internet Society (2006). This document is subject
to the rights, licenses and restrictions contained in BCP 78, and to the rights, licenses and restrictions contained in BCP 78, and
except as set forth therein, the authors retain all their rights. except as set forth therein, the authors retain all their rights.
Intellectual Property Intellectual Property
The IETF takes no position regarding the validity or scope of any The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has might or might not be available; nor does it represent that it has
 End of changes. 239 change blocks. 
570 lines changed or deleted 638 lines changed or added

This html diff was produced by rfcdiff 1.31. The latest version is available from http://www.levkowetz.com/ietf/tools/rfcdiff/