draft-ietf-rddp-mpa-02.txt   draft-ietf-rddp-mpa-03.txt 
Remote Direct Data Placement Work Group P. Culley Remote Direct Data Placement Work Group P. Culley
INTERNET-DRAFT Hewlett-Packard Company INTERNET-DRAFT Hewlett-Packard Company
draft-ietf-rddp-mpa-02.txt U. Elzur draft-ietf-rddp-mpa-03.txt U. Elzur
Broadcom Corporation Broadcom Corporation
R. Recio R. Recio
IBM Corporation IBM Corporation
S. Bailey S. Bailey
Sandburst Corporation Sandburst Corporation
J. Carrier J. Carrier
Adaptec Cray Inc.
Expires: August 2005 February 2, 2004 Expires: April 2006 September 27, 2005
Marker PDU Aligned Framing for TCP Specification Marker PDU Aligned Framing for TCP Specification
Status of this Memo Status of this Memo
By submitting this Internet-Draft, I certify that any applicable By submitting this Internet-Draft, each author represents that any
patent or other IPR claims of which I am aware have been disclosed, applicable patent or other IPR claims of which he or she is aware
or will be disclosed, and any of which I become aware will be have been or will be disclosed, and any of which he or she becomes
disclosed, in accordance with RFC 3668. aware will be disclosed, in accordance with Section 6 of BCP 79.
By submitting this Internet-Draft, I accept the provisions of Section
4 of RFC 3667.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet- other groups may also distribute working documents as Internet-
Drafts. Drafts.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/1id-abstracts.html. The list of Internet-Draft http://www.ietf.org/1id-abstracts.html. The list of Internet-Draft
Shadow Directories can be accessed at http://www.ietf.org/shadow.html Shadow Directories can be accessed at http://www.ietf.org/shadow.html
Abstract Abstract
A framing protocol is defined for TCP that is fully compliant with MPA (Marker Protocol data unit Aligned framing) is designed to work
applicable TCP RFCs and fully interoperable with existing TCP as an "adaptation layer" between TCP and the Direct Data Placement
implementations. The framing mechanism is designed to work as an [DDP] protocol, preserving the reliable, in-order delivery of TCP,
"adaptation layer" between TCP and the Direct Data Placement [DDP] while adding the preservation of higher-level protocol record
protocol, preserving the reliable, in-order delivery of TCP, while boundaries that DDP requires. MPA is fully compliant with applicable
adding the preservation of higher-level protocol record boundaries TCP RFCs and can be utilized with existing TCP implementations. MPA
that DDP requires. also supports integrated implementations that combine TCP, MPA and
DDP to reduce buffering requirements in the implementation and
improve performance at the system level.
Table of Contents Table of Contents
Status of this Memo.................................................1 Status of this Memo 1
Abstract............................................................1 Abstract 1
1 Introduction.................................................6 1 Glossary 7
1.1 Motivation...................................................6 2 Introduction 9
1.2 Protocol Overview............................................6 2.1 Motivation 9
2 Glossary....................................................10 2.2 Protocol Overview 9
3 LLP and DDP requirements....................................12 3 LLP and DDP requirements 13
3.1 TCP implementation Requirements to support MPA..............12 3.1 TCP implementation Requirements to support MPA 13
3.1.1 TCP Transmit side...........................................12 3.1.1 TCP Transmit side 13
3.1.2 TCP Receive side............................................12 3.1.2 TCP Receive side 14
3.2 MPA's interactions with DDP.................................13 3.2 MPA's interactions with DDP 15
4 FPDU Formats................................................15 4 FPDU Formats 17
4.1 Marker Format...............................................16 4.1 Marker Format 18
5 Data Transfer Semantics.....................................17 5 Data Transfer Semantics 19
5.1 MPA Markers.................................................17 5.1 MPA Markers 19
5.2 CRC Calculation.............................................19 5.2 CRC Calculation 22
5.3 MPA on TCP Sender Segmentation..............................22 5.3 MPA on TCP Sender Segmentation 25
5.3.1 Effects of MPA on TCP Segmentation..........................22 5.3.1 Effects of MPA on TCP Segmentation 26
5.3.2 FPDU Size Considerations....................................24 5.3.2 FPDU Size Considerations 28
5.4 MPA Receiver FPDU Identification............................25 5.4 MPA Receiver FPDU Identification 29
5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders....26 5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders 30
6 Connection Semantics........................................27 6 Connection Semantics 31
6.1 Connection setup............................................27 6.1 Connection setup 31
6.1.1 MPA Request and Reply Frame Format..........................31 6.1.1 MPA Request and Reply Frame Format 33
6.1.2 Example Delayed Startup sequence............................32 6.1.2 Connection Startup Rules 34
6.1.3 Use of "Private Data".......................................35 6.1.3 Example Delayed Startup sequence 38
6.1.4 "Dual Stack" implementations................................38 6.1.4 Use of "Private Data" 41
6.2 Normal Connection Teardown..................................39 6.1.5 "Dual Stack" implementations 44
7 Error Semantics.............................................40 6.2 Normal Connection Teardown 45
8 Security Considerations.....................................41 7 Error Semantics 46
8.1 Protocol-specific Security Considerations...................41 8 Security Considerations 47
8.1.1 Spoofing....................................................41 8.1 Protocol-specific Security Considerations 47
8.1.2 Eavesdropping...............................................42 8.1.1 Spoofing 47
8.2 Introduction to Security Options............................43 8.1.2 Eavesdropping 48
8.3 Using IPsec With MPA........................................43 8.2 Introduction to Security Options 49
8.4 Requirements for IPsec Encapsulation of DDP.................44 8.3 Using IPsec With MPA 49
9 IANA Considerations.........................................45 8.4 Requirements for IPsec Encapsulation of MPA/DDP 50
10 References..................................................46 9 IANA Considerations 51
10.1 Normative References........................................46 10 References 52
10.2 Informative References......................................46 10.1 Normative References 52
11 Appendix....................................................48 10.2 Informative References 52
11.1 Analysis of MPA over TCP Operations.........................48 11 Appendix 54
11.1.1 Assumptions...............................................48 11.1 Analysis of MPA over TCP Operations 54
11.1.2 The Value of Header Alignment.............................49 11.1.1 Assumptions 55
11.2 Receiver implementation.....................................57 11.1.2 The Value of FPDU Alignment 56
11.2.1 Network Layer Reassembly Buffers..........................57 11.2 Receiver implementation 63
11.2.2 TCP Reassembly buffers....................................58 11.2.1 Network Layer Reassembly Buffers 63
11.3 IETF RNIC Interoperability with RDMA Consortium Protocols...59 11.2.2 TCP Reassembly buffers 64
11.3.1 Negotiated Parameters.....................................59 11.3 IETF Implementation Interoperability with RDMA Consortium
11.3.2 RDMAC RNIC and Non-permissive IETF RNIC...................60 Protocols 65
11.3.3 RDMAC RNIC and Permissive IETF RNIC.......................62 11.3.1 Negotiated Parameters 65
11.3.4 Non-Permissive IETF RNIC and Permissive IETF RNIC.........63 11.3.2 RDMAC RNIC and Non-permissive IETF RNIC 66
12 Author's Addresses..........................................64 11.3.3 RDMAC RNIC and Permissive IETF RNIC 68
13 Acknowledgments.............................................65 11.3.4 Non-Permissive IETF RNIC and Permissive IETF RNIC 69
14 Full Copyright Statement....................................68 12 Author's Addresses 70
13 Acknowledgments 71
Full Copyright Statement 74
Intellectual Property 74
Table of Figures Table of Figures
Figure 1 ULP MPA TCP Layering.......................................8 Figure 1 ULP MPA TCP Layering 10
Figure 2 FPDU Format...............................................15 Figure 2 FPDU Format 17
Figure 3 Marker Format.............................................16 Figure 3 Marker Format 18
Figure 4 Example FPDU Format with Marker...........................18 Figure 4 Example FPDU Format with Marker 20
Figure 5 Annotated Hex Dump of an FPDU.............................21 Figure 5 Annotated Hex Dump of an FPDU 24
Figure 6 Annotated Hex Dump of an FPDU with Marker.................21 Figure 6 Annotated Hex Dump of an FPDU with Marker 25
Figure 7 "MPA Request/Reply Frame".................................31 Figure 7 "MPA Request/Reply Frame" 33
Figure 8: Example Delayed Startup negotiation......................33 Figure 8: Example Delayed Startup negotiation 39
Figure 9: Example Immediate Startup negotiation....................36 Figure 9: Example Immediate Startup negotiation 42
Figure 10: Non-aligned FPDU freely placed in TCP octet stream......51 Figure 10: Non-aligned FPDU freely placed in TCP octet stream 58
Figure 11: Aligned FPDU placed immediately after TCP header........53 Figure 11: Aligned FPDU placed immediately after TCP header 59
Figure 12. Connection Parameters for the RNIC Types................60 Figure 12. Connection Parameters for the RNIC Types. 66
Figure 13: MPA negotiation between an RDMAC RNIC and a Non-permissive Figure 13: MPA negotiation between an RDMAC RNIC and a Non-permissive
IETF RNIC..........................................................61 IETF RNIC. 67
Figure 14: MPA negotiation between an RDMAC RNIC and a Permissive Figure 14: MPA negotiation between an RDMAC RNIC and a Permissive
IETF RNIC..........................................................62 IETF RNIC. 68
Figure 15: MPA negotiation between a Non-permissive IETF RNIC and a Figure 15: MPA negotiation between a Non-permissive IETF RNIC and a
Permissive IETF RNIC...............................................63 Permissive IETF RNIC. 69
Revision history Revision history [To be deleted prior to RFC publication]
[draft-ietf-rddp-mpa-03] workgroup draft with following changes:
Tweaked abstract to give a bit more information.
Tightened definition and usage of "deliver"
Cleaned up usage of terms "FPDU Alignment" and "Header
Alignment"
Rearranged overview sections with stack and glossary earlier
Mentioned how an non-MPA-Aware TCP MPA receiver deals with out
of order segments (it doesn't have to...)
Fixed description of out of order segment handling in section
3.1.1
Added text saying that ordering and completion indications are
used to deliver to DDP
Added redundant text indicating low two bits of FPDUPTR must
always be zero and treated as such in Section 4.1
Added redundant text indicating markers are always included in a
CRC calculation
Removed indication saying that an implementation can "ignore" an
administrative input to not use CRCs; clarified that both ends
have to agree to not use CRC (as originally intended).
Changed example FPDU hex dump format for greater clarity
Clarified that EMSS shrinking below 128 bytes is the condition
(rather than "very small sizes")
Put connection startup rules after the start frame formats
Added Initiator "private data" to figure 9
Removed or Clarified use of RNIC term
Added intro to IETF/RDMAC interoperability appendix and gave a
web reference for docs; also recommended use of "permissive IETF
RNIC"
Numerous minor clarifications
Updated Boilerplates per current requirements
[draft-ietf-rddp-mpa-02] workgroup draft with following changes: [draft-ietf-rddp-mpa-02] workgroup draft with following changes:
Made IPSEC must implement, optional to use. Made IPsec must implement, optional to use.
Updated Marker language to clarify that it points to ULPDU Updated Marker language to clarify that it points to ULPDU
Length even when marker precedes FPDU. Length even when marker precedes FPDU.
Clarified when to start markers use (in full operation mode). Clarified when to start markers use (in full operation mode).
Added informative text on interoperability with RDMAC RNICs. Added informative text on interoperability with RDMAC RNICs.
Reduced "Private Data" to 512 octets max. Reduced "Private Data" to 512 octets max.
skipping to change at page 6, line 5 skipping to change at page 7, line 5
Added clarifications of the MPA/TCP interaction for optimized Added clarifications of the MPA/TCP interaction for optimized
implementations and that any such optimizations are to be used implementations and that any such optimizations are to be used
only when requested by MPA. only when requested by MPA.
Note: a discussion of reasons for these changes can be found in Note: a discussion of reasons for these changes can be found in
[ELZER-MPA]. [ELZER-MPA].
[draft-culley-iwarp-mpa-01] initial draft. [draft-culley-iwarp-mpa-01] initial draft.
1 Introduction 1 Glossary
Consumer - the ULPs or applications that lie above MPA and DDP. The
Consumer is responsible for making TCP connections, starting MPA
and DDP connections, and generally controlling operations.
Delivery - (Delivered, Delivers) - For MPA, Delivery is defined as
the process of informing DDP that a particular PDU is ordered for
use. A PDU is Delivered in the exact order that it was sent by
the original sender; MPA uses TCP's byte stream ordering to
determine when Delivery is possible. This is specifically
different from "passing the PDU to DDP", which may generally
occur in any order, while the order of "Delivery" is strictly
defined.
EMSS - Effective Maximum Segment Size. EMSS is the smaller of the
TCP maximum segment size (MSS) as defined in RFC 793 [RFC793],
and the current path Maximum Transfer Unit (MTU) [RFC1191].
FPDU - Framed Protocol Data Unit. The unit of data created by an MPA
sender.
FPDU Alignment - the property that an FPDU is Header Aligned with the
TCP segment, and the TCP segment includes an integer number of
FPDUs. A TCP segment with a FPDU Alignment allows immediate
processing of the contained FPDUs without waiting on other TCP
segments to arrive or combining with prior segments.
Header Alignment - the property that a TCP segment begins with an
FPDU. The FPDU is "Header Aligned" when the FPDU header is
exactly at the start of the TCP segment (right behind the TCP
headers on the wire).
MPA-aware TCP - a TCP implementation that is aware of the receiver
efficiencies of MPA FPDU Alignment and is capable of sending TCP
segments that begin with an FPDU.
MPA-enabled - MPA is enabled if the MPA protocol is visible on the
wire. When the sender is MPA-enabled, it is inserting framing
and markers. When the receiver is MPA-enabled, it is
interpreting framing and markers.
MPA - Marker-based ULP PDU Aligned Framing for TCP protocol. This
document defines the MPA protocol.
MULPDU - Maximum ULPDU. The current maximum size of the record that
is acceptable for DDP to pass to MPA for transmission.
Node - A computing device attached to one or more links of a Network.
A Node in this context does not refer to a specific application
or protocol instantiation running on the computer. A Node may
consist of one or more MPA on TCP devices installed in a host
computer.
PDU - protocol data unit
Remote Peer - The MPA protocol implementation on the opposite end of
the connection. Used to refer to the remote entity when
describing protocol exchanges or other interactions between two
Nodes.
ULP - Upper Layer Protocol. The protocol layer above the protocol
layer currently being referenced. The ULP for MPA is DDP [DDP].
ULPDU - Upper Layer Protocol Data Unit. The data record defined by
the layer above MPA (DDP). ULPDU corresponds to DDP's "DDP
Segment".
2 Introduction
This section discusses the reason for creating MPA on TCP and a This section discusses the reason for creating MPA on TCP and a
general overview of the protocol. Later sections show the MPA general overview of the protocol. Later sections show the MPA
headers (see section 4 on page 15), and detailed protocol headers (see section 4 on page 17), and detailed protocol
requirements and characteristics (see section 5 on page 17), as well requirements and characteristics (see section 5 on page 19), as well
as Connection Semantics (section 6 on page 26), Error Semantics as Connection Semantics (section 6 on page 30), Error Semantics
(section 7 on page 40), and Security Considerations (section 8 on (section 7 on page 46), and Security Considerations (section 8 on
page 41). page 47).
1.1 Motivation 2.1 Motivation
The Direct Data Placement protocol [DDP], when used with TCP [RFC793] The Direct Data Placement protocol [DDP], when used with TCP [RFC793]
requires a mechanism to detect record boundaries. The DDP records requires a mechanism to detect record boundaries. The DDP records
are referred to as Upper Layer Protocol Data Units by this document. are referred to as Upper Layer Protocol Data Units by this document.
The ability to locate the Upper Layer Protocol Data Unit (ULPDU) The ability to locate the Upper Layer Protocol Data Unit (ULPDU)
boundary is useful to a hardware network adapter that uses DDP to boundary is useful to a hardware network adapter that uses DDP to
directly place the data in the application buffer based on the directly place the data in the application buffer based on the
control information carried in the ULPDU header. This may be done control information carried in the ULPDU header. This may be done
without requiring that the packets arrive in order. Potential without requiring that the packets arrive in order. Potential
benefits of this capability are the avoidance of the memory copy benefits of this capability are the avoidance of the memory copy
skipping to change at page 6, line 48 skipping to change at page 9, line 48
examine the data stream at locations that are known to contain the examine the data stream at locations that are known to contain the
embedded control, the protocol can never misinterpret application embedded control, the protocol can never misinterpret application
data as being embedded control data. For unambiguous handling of an data as being embedded control data. For unambiguous handling of an
out of order packet, the deterministic approach is preferred. out of order packet, the deterministic approach is preferred.
The MPA protocol provides a framing mechanism for DDP running over The MPA protocol provides a framing mechanism for DDP running over
TCP using the deterministic approach. It allows the location of the TCP using the deterministic approach. It allows the location of the
ULPDU to be determined in the TCP stream even if the TCP segments ULPDU to be determined in the TCP stream even if the TCP segments
arrive out of order. arrive out of order.
1.2 Protocol Overview 2.2 Protocol Overview
The layering of PDUs with MPA is shown in Figure 1, below.
+------------------+
| ULP client |
+------------------+ <- Consumer messages
| DDP |
+------------------+ <- ULPDUs
| MPA |
+------------------+ <- FPDUs (containing ULPDUs)
| TCP* |
+------------------+ <- TCP Segments (containing FPDUs)
| IP etc. |
+------------------+
* TCP or MPA-aware TCP.
Figure 1 ULP MPA TCP Layering
MPA is described as an extra layer above TCP and below DDP. The MPA is described as an extra layer above TCP and below DDP. The
operation sequence is: operation sequence is:
1. A TCP connection is established by ULP action. This is done 1. A TCP connection is established by ULP action. This is done
using methods not described by this specification. The ULP may using methods not described by this specification. The ULP may
exchange some amount of data in streaming mode prior to starting exchange some amount of data in streaming mode prior to starting
MPA, but is not required to do so. MPA, but is not required to do so.
2. The Consumer negotiates the use of DDP and MPA at both ends of a 2. The Consumer negotiates the use of DDP and MPA at both ends of a
skipping to change at page 7, line 28 skipping to change at page 10, line 49
4. At the end of the Startup Phase, the ULP puts MPA (and DDP) into 4. At the end of the Startup Phase, the ULP puts MPA (and DDP) into
full operation and begins sending DDP data as further described full operation and begins sending DDP data as further described
below. In this document, DDP data chunks are called ULPDUs. For below. In this document, DDP data chunks are called ULPDUs. For
a description of the DDP data, see [DDP]. a description of the DDP data, see [DDP].
Following is a description of data transfer when MPA is in full Following is a description of data transfer when MPA is in full
operation. operation.
1. DDP determines the Maximum ULPDU (MULPDU) size by querying MPA 1. DDP determines the Maximum ULPDU (MULPDU) size by querying MPA
for this value. MPA derives this information from TCP, when it for this value. MPA derives this information from TCP or IP,
is available, or chooses a reasonable value. This information is when it is available, or chooses a reasonable value.
already supported on many TCP implementations, including all
modern flavors of BSD networking, through the TCP_MAXSEG socket
option.
2. DDP creates ULPDUs of MULPDU size or smaller, and hands them to 2. DDP creates ULPDUs of MULPDU size or smaller, and hands them to
MPA at the sender. MPA at the sender.
3. MPA creates a Framed Protocol Data Unit (FPDU) by pre-pending a 3. MPA creates a Framed Protocol Data Unit (FPDU) by pre-pending a
header, optionally inserting markers, and appending a CRC field header, optionally inserting markers, and appending a CRC field
after the ULPDU and PAD (if any). MPA delivers the FPDU to TCP. after the ULPDU and PAD (if any). MPA delivers the FPDU to TCP.
4. The TCP sender puts the FPDUs into the TCP stream. If the TCP 4. The TCP sender puts the FPDUs into the TCP stream. If the TCP
Sender is MPA-aware, it segments the TCP stream in such a way Sender is MPA-aware, it segments the TCP stream in such a way
skipping to change at page 8, line 9 skipping to change at page 11, line 28
sender and receiver. sender and receiver.
6. The MPA receiver locates and assembles complete FPDUs within the 6. The MPA receiver locates and assembles complete FPDUs within the
stream, verifies their integrity, and removes MPA markers (when stream, verifies their integrity, and removes MPA markers (when
present), ULPDU_Length, PAD and the CRC field. present), ULPDU_Length, PAD and the CRC field.
7. MPA then provides the complete ULPDUs to DDP. MPA may also 7. MPA then provides the complete ULPDUs to DDP. MPA may also
separate passing MPA payload to DDP from passing the MPA payload separate passing MPA payload to DDP from passing the MPA payload
ordering information. ordering information.
The layering of PDUs with MPA is shown in Figure 1, below.
MPA-aware TCP is a TCP layer which potentially contains some MPA-aware TCP is a TCP layer which potentially contains some
additional semantics as defined in this document. MPA is implemented additional semantics as defined in this document. MPA is implemented
as a data stream ULP for TCP and is therefore RFC compliant. MPA- as a data stream ULP for TCP and is therefore RFC compliant. MPA-
aware TCP is RFC compliant. aware TCP is RFC compliant.
+------------------+
| ULP client |
+------------------+ <- Consumer messages
| DDP |
+------------------+ <- ULPDUs
| MPA |
+------------------+ <- FPDUs (containing ULPDUs)
| TCP* |
+------------------+ <- TCP Segments (containing FPDUs)
| IP etc. |
+------------------+
* TCP or MPA-aware TCP.
Figure 1 ULP MPA TCP Layering
An MPA-aware TCP sender is able to segment the data stream such that An MPA-aware TCP sender is able to segment the data stream such that
TCP segments begin with FPDUs (FPDU Alignment). This has significant TCP segments begin with FPDUs (FPDU Alignment). This has significant
advantages for receivers. When segments arrive with aligned FPDUs advantages for receivers. When segments arrive with aligned FPDUs
the receiver usually need not buffer any portion of the segment, the receiver usually need not buffer any portion of the segment,
allowing DDP to place it in its destination memory immediately, thus allowing DDP to place it in its destination memory immediately, thus
avoiding copies from intermediate buffers (DDP's reason for avoiding copies from intermediate buffers (DDP's reason for
existence). existence).
MPA with an MPA-aware TCP receiver allows a DDP on MPA implementation MPA with an MPA-aware TCP receiver allows a DDP on MPA implementation
to recover ULPDUs that may be received out of order. This enables a to locate the start of ULPDUs that may be received out of order. It
DDP on MPA implementation to save a significant amount of also allows the implementation to determine if the entire ULPDU has
intermediate storage by placing the ULPDUs in the right locations in been received. As a result, MPA can pass out of order ULPDUs to DDP
the application buffers when they arrive, rather than waiting until for immediate use. This enables a DDP on MPA implementation to save
full ordering can be restored. a significant amount of intermediate storage by placing the ULPDUs in
the right locations in the application buffers when they arrive,
rather than waiting until full ordering can be restored.
The ability of a receiver to recover out of order ULPDUs is optional The ability of a receiver to recover out of order ULPDUs is optional
and declared to the transmitter during startup. When the receiver and declared to the transmitter during startup. When the receiver
declares that it does not support out of order recovery, the declares that it does not support out of order recovery, the
transmitter does not add the control information to the data stream transmitter does not add the control information to the data stream
needed for out of order recovery. needed for out of order recovery.
If TCP is not MPA-aware, then MPA receives a strictly ordered stream
of data and does not deal with out of order ULPDUs. In this case MPA
passes each ULPDU to DDP when the last bytes arrive from TCP, along
with the indication that they are in order.
MPA implementations that support recovery of out of order ULPDUs MUST MPA implementations that support recovery of out of order ULPDUs MUST
support a mechanism to indicate the ordering of ULPDUs as the sender support a mechanism to indicate the ordering of ULPDUs as the sender
transmitted them and indicate when missing intermediate segments transmitted them and indicate when missing intermediate segments
arrive. These mechanisms allow DDP to reestablish record ordering arrive. These mechanisms allow DDP to reestablish record ordering
and report Delivery of complete messages (groups of records). and report Delivery of complete messages (groups of records).
MPA also addresses enhanced data integrity. Many users of TCP have MPA also addresses enhanced data integrity. Some users of TCP have
noted that the TCP checksum is not as strong as could be desired noted that the TCP checksum is not as strong as could be desired
[CRCTCP]. Studies have shown that the TCP checksum indicates (see[CRCTCP]). Studies such as [CRCTCP] have shown that the TCP
segments in error at a much higher rate than the underlying link checksum indicates segments in error at a much higher rate than the
characteristics would indicate. With these higher error rates, the underlying link characteristics would indicate. With these higher
chance that an error will escape detection, when using only the TCP error rates, the chance that an error will escape detection, when
checksum for data integrity, becomes a concern. A stronger integrity using only the TCP checksum for data integrity, becomes a concern. A
check can reduce the chance of data errors being missed. stronger integrity check can reduce the chance of data errors being
missed.
MPA includes a CRC check to increase the ULPDU data integrity to the MPA includes a CRC check to increase the ULPDU data integrity to the
level provided by other modern protocols, such as SCTP [RFC2960]. It level provided by other modern protocols, such as SCTP [RFC2960]. It
is possible to disable this CRC check, however CRCs MUST be enabled is possible to disable this CRC check, however CRCs MUST be enabled
unless it is clear that the end to end connection through the network unless it is clear that the end to end connection through the network
has data integrity at least as good as a MPA with CRC enabled (for has data integrity at least as good as a MPA with CRC enabled (for
example when IPSEC is implemented end to end). DDP's ULP expects example when IPsec is implemented end to end). DDP's ULP expects
this level of data integrity and therefore the ULP does not have to this level of data integrity and therefore the ULP does not have to
provide its own duplicate data integrity and error recovery for lost provide its own duplicate data integrity and error recovery for lost
data. data.
2 Glossary 3 LLP and DDP requirements
Consumer - the ULPs or applications that lie above MPA and DDP. The
Consumer is responsible for making TCP connections, starting MPA
and DDP connections, and generally controlling operations.
Delivery - (Delivered, Delivers) - For MPA, Delivery is defined as
the process of informing DDP that a particular PDU is ordered for
use. This is specifically different from "passing the PDU to
DDP", which may generally occur in any order, while the order of
"Delivery" is strictly defined.
EMSS - Effective Maximum Segment Size. EMSS is the smaller of the
TCP maximum segment size (MSS) as defined in RFC 793 [RFC793],
and the current path Maximum Transfer Unit (MTU) [RFC1191].
FPDU - Framing Protocol Data Unit. The unit of data created by an
MPA sender.
FPDU Alignment - the property that a TCP segment begins with an FPDU.
Header Alignment - the property that a TCP segment begins with an
FPDU and the TCP segment includes an integer number of FPDUs.
PDU - protocol data unit
MPA-aware TCP - a TCP implementation that is aware of the receiver
efficiencies of MPA Header Alignment and is capable of sending
TCP segments that begin with an FPDU.
MPA-enabled - MPA is enabled if the MPA protocol is visible on the
wire. When the sender is MPA-enabled, it is inserting framing
and markers. When the receiver is MPA-enabled, it is
interpreting framing and markers.
MPA - Marker-based ULP PDU Aligned Framing for TCP protocol. This
document defines the MPA protocol.
MULPDU - Maximum ULPDU. The current maximum size of the record that
is acceptable for DDP to pass to MPA for transmission.
Node - A computing device attached to one or more links of a Network.
A Node in this context does not refer to a specific application
or protocol instantiation running on the computer. A Node may
consist of one or more MPA on TCP devices installed in a host
computer.
Remote Peer - The MPA protocol implementation on the opposite end of
the connection. Used to refer to the remote entity when
describing protocol exchanges or other interactions between two
Nodes.
ULP - Upper Layer Protocol. The protocol layer above the protocol
layer currently being referenced. The ULP for MPA is DDP [DDP].
ULPDU - Upper Layer Protocol Data Unit. The data record defined by The following sections describe requirements on TCP and DDP to
the layer above MPA (DDP). ULPDU corresponds to DDP's "DDP utilize MPA. The DDP requirements enable the correct operation over
Segment". MPA and TCP (as opposed to DDP over SCTP or other LLPs).
3 LLP and DDP requirements The TCP requirements are mostly intended to support the "MPA-aware
TCP" variation, which allows implementations that require less buffer
memory and may provide better overall system performance.
3.1 TCP implementation Requirements to support MPA 3.1 TCP implementation Requirements to support MPA
The TCP implementation MUST inform MPA when the TCP connection is The TCP implementation MUST inform MPA when the TCP connection is
closed or has begun closing the connection (e.g. received a FIN). closed or has begun closing the connection (e.g. received a FIN).
3.1.1 TCP Transmit side 3.1.1 TCP Transmit side
To provide optimum performance, an MPA-aware transmit side TCP To provide optimum performance, an MPA-aware transmit side TCP
implementation SHOULD be enabled to: implementation SHOULD be enabled to:
skipping to change at page 13, line 10 skipping to change at page 14, line 23
layer MUST have committed to keeping each segment before it can layer MUST have committed to keeping each segment before it can
be passed to the MPA. This means that the segment must have be passed to the MPA. This means that the segment must have
passed the TCP, IP, and lower layer data integrity validation passed the TCP, IP, and lower layer data integrity validation
(i.e., checksum), must be in the receive window, must not be a (i.e., checksum), must be in the receive window, must not be a
duplicate, must be part of the same epoch (if timestamps are used duplicate, must be part of the same epoch (if timestamps are used
to verify this) and any other checks required by TCP RFCs. The to verify this) and any other checks required by TCP RFCs. The
segment MUST NOT be passed to MPA more than once unless segment MUST NOT be passed to MPA more than once unless
explicitly requested (see Section 7). explicitly requested (see Section 7).
This is not to imply that the data must be completely ordered This is not to imply that the data must be completely ordered
before use. An implementation may accept out of order segments, before use. An implementation MAY accept out of order segments,
SACK them [RFC2018], and pass them to DDP when the reception of SACK them [RFC2018], and pass them to DDP immediately, before the
the segments needed to fill in the gaps arrive. Such an reception of the segments needed to fill in the gaps arrive.
implementation can "commit" to the data early on, and will not Such an implementation MUST "commit" to the data early on, and
overwrite it even if (or when) duplicate data arrives. MPA MUST NOT overwrite it even if (or when) duplicate data arrives.
expects to utilize this "commit" to allow the passing of ULPDUs MPA expects to utilize this "commit" to allow the passing of
to DDP when they arrive, independent of ordering. ULPDUs to DDP when they arrive, independent of ordering. DDP
uses the passed ULPDU to "place" the DDP segments (see [DDP] for
more details).
* Provide a mechanism to indicate the ordering of TCP segments as * Provide a mechanism to indicate the ordering of TCP segments as
the sender transmitted them. One possible mechanism might be the sender transmitted them. One possible mechanism might be
attaching the TCP sequence number to each segment. attaching the TCP sequence number to each segment.
* Provide a mechanism to indicate when a given TCP segment (and the * Provide a mechanism to indicate when a given TCP segment (and the
prior TCP stream) is complete. One possible mechanism might be prior TCP stream) is complete. One possible mechanism might be
to utilize the leading (left) edge of the TCP Receive Window. to utilize the leading (left) edge of the TCP Receive Window.
MPA uses the ordering and completion indications to inform DDP
when a ULPDU is complete; MPA "delivers" the FPDU to DDP. DDP
uses the indications to "deliver" its messages to the DDP
consumer (see [DDP] for more details).
DDP on MPA MUST utilize these two mechanisms to establish the DDP on MPA MUST utilize these two mechanisms to establish the
Delivery semantics that DDP's consumers agree to. These Delivery semantics that DDP's consumers agree to. These
semantics are described fully in [DDP]. These include semantics are described fully in [DDP]. These include
requirements on DDP's consumer to respect ownership of buffers requirements on DDP's consumer to respect ownership of buffers
prior to the time that DDP delivers them to the consumer. prior to the time that DDP delivers them to the consumer.
An MPA-aware TCP receive side implementation MUST continue to buffer An MPA-aware TCP receive side implementation MUST continue to buffer
TCP segments until completely ordered and then deliver them as TCP segments until completely ordered and then deliver them as
expected by non-MPA applications (and described in TCP RFCs) when MPA expected by non-MPA applications (and described in TCP RFCs) when MPA
is not enabled on the connection. When MPA is enabled above an MPA- is not enabled on the connection. When MPA is enabled above an MPA-
skipping to change at page 14, line 45 skipping to change at page 16, line 13
the MPA implementation SHOULD: the MPA implementation SHOULD:
* Pass each ULPDU with its length to DDP as soon as it has been * Pass each ULPDU with its length to DDP as soon as it has been
fully received and validated. fully received and validated.
* Provide a mechanism to indicate the ordering of ULPDUs as the * Provide a mechanism to indicate the ordering of ULPDUs as the
sender transmitted them. One possible mechanism might be sender transmitted them. One possible mechanism might be
providing the TCP sequence number for each ULPDU. providing the TCP sequence number for each ULPDU.
* Provide a mechanism to indicate when a given ULPDU (and prior * Provide a mechanism to indicate when a given ULPDU (and prior
ULPDUs) are complete. One possible mechanism might be to allow ULPDUs) are complete (delivered to DDP). One possible mechanism
DDP to see the current outgoing TCP Ack sequence number. might be to allow DDP to see the current outgoing TCP Ack
sequence number.
* Provide an indication to DDP that the TCP has closed or has begun * Provide an indication to DDP that the TCP has closed or has begun
to close the connection (e.g. received a FIN). to close the connection (e.g. received a FIN).
MPA MUST provide the protocol version negotiated with its peer to MPA MUST provide the protocol version negotiated with its peer to
DDP. DDP will use this version to set the version in its header and DDP. DDP will use this version to set the version in its header and
to report the version to RDMAP to report the version to RDMAP
4 FPDU Formats 4 FPDU Formats
skipping to change at page 15, line 41 skipping to change at page 17, line 41
support the largest IP datagrams for IPv4 or IPv6. support the largest IP datagrams for IPv4 or IPv6.
PAD: The PAD field trails the ULPDU and contains between zero and PAD: The PAD field trails the ULPDU and contains between zero and
three octets of data. The pad data MUST be set to zero by the sender three octets of data. The pad data MUST be set to zero by the sender
and ignored by the receiver (except for CRC checking). The length of and ignored by the receiver (except for CRC checking). The length of
the pad is set so as to make the size of the FPDU an integral the pad is set so as to make the size of the FPDU an integral
multiple of four. multiple of four.
CRC: 32 bits, When CRCs are enabled, this field contains a CRC32C CRC: 32 bits, When CRCs are enabled, this field contains a CRC32C
check value, which is used to verify the entire contents of the FPDU, check value, which is used to verify the entire contents of the FPDU,
using CRC32C. See section 5.2 CRC Calculation on page 19. When CRCs using CRC32C. See section 5.2 CRC Calculation on page 22. When CRCs
are not enabled, this field is still present, may contain any value, are not enabled, this field is still present, may contain any value,
and MUST NOT be checked. and MUST NOT be checked.
The FPDU adds a minimum of 6 octets to the length of the ULPDU. In The FPDU adds a minimum of 6 octets to the length of the ULPDU. In
addition, the total length of the FPDU will include the length of any addition, the total length of the FPDU will include the length of any
markers and from 0 to 3 pad octets added to round-up the ULPDU size. markers and from 0 to 3 pad octets added to round-up the ULPDU size.
4.1 Marker Format 4.1 Marker Format
The format of a marker MUST be as specified in Figure 3: The format of a marker MUST be as specified in Figure 3:
skipping to change at page 16, line 22 skipping to change at page 18, line 22
| RESERVED | FPDUPTR | | RESERVED | FPDUPTR |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 3 Marker Format Figure 3 Marker Format
RESERVED: The Reserved field MUST be set to zero on transmit and RESERVED: The Reserved field MUST be set to zero on transmit and
ignored on receive (except for CRC calculation). ignored on receive (except for CRC calculation).
FPDUPTR: The FPDU Pointer is a relative pointer, 16-bits long, FPDUPTR: The FPDU Pointer is a relative pointer, 16-bits long,
interpreted as an unsigned integer, that indicates the number of interpreted as an unsigned integer, that indicates the number of
octets in the TCP stream from the beginning of the "ULPDU Length" octets in the TCP stream from the beginning of the "ULPDU Length"
field to the first octet of the entire marker. field to the first octet of the entire marker. The least significant
two bits MUST always be set to zero at the transmitter, and the
receivers MUST always treat these as zero for calculations.
5 Data Transfer Semantics 5 Data Transfer Semantics
This section discusses some characteristics and behavior of the MPA This section discusses some characteristics and behavior of the MPA
protocol as well as implications of that protocol. protocol as well as implications of that protocol.
5.1 MPA Markers 5.1 MPA Markers
MPA markers are used to identify the start of FPDUs when packets are MPA markers are used to identify the start of FPDUs when packets are
received out of order. This is done by locating the markers at fixed received out of order. This is done by locating the markers at fixed
intervals in the data stream (which is correlated to the TCP sequence intervals in the data stream (which is correlated to the TCP sequence
number) and using the marker value to locate the preceding FPDU number) and using the marker value to locate the preceding FPDU
start. start.
All MPA markers are included in the containing FPDU CRC calculation
(when both CRCs and markers are in use).
The MPA receiver's ability to locate out of order FPDUs and pass the The MPA receiver's ability to locate out of order FPDUs and pass the
ULPDUs to DDP is implementation dependent. MPA/DDP allows those ULPDUs to DDP is implementation dependent. MPA/DDP allows those
receivers that are able to deal with out of order FPDUs in this way receivers that are able to deal with out of order FPDUs in this way
to require the insertion of markers in the data stream. When the to require the insertion of markers in the data stream. When the
receiver cannot deal with out of order FPDUs in this way, it may receiver cannot deal with out of order FPDUs in this way, it may
disable the insertion of markers at the sender. All MPA senders MUST disable the insertion of markers at the sender. All MPA senders MUST
be able to generate markers when their use is declared by the be able to generate markers when their use is declared by the
opposing receiver (see section 6.1 Connection setup on page 27). opposing receiver (see section 6.1 Connection setup on page 31).
When Markers are enabled, MPA senders MUST insert a marker into the When Markers are enabled, MPA senders MUST insert a marker into the
data stream at a 512 octet periodic interval in the TCP Sequence data stream at a 512 octet periodic interval in the TCP Sequence
Number Space. The marker contains a 16 bit unsigned integer referred Number Space. The marker contains a 16 bit unsigned integer referred
to as the FPDUPTR (FPDU Pointer). to as the FPDUPTR (FPDU Pointer).
If the FPDUPTR's value is non-zero, the FPDU Pointer is a 16 bit If the FPDUPTR's value is non-zero, the FPDU Pointer is a 16 bit
relative back-pointer. FPDUPTR MUST contain the number of octets in relative back-pointer. FPDUPTR MUST contain the number of octets in
the TCP stream from the beginning of the "ULPDU Length" field to the the TCP stream from the beginning of the "ULPDU Length" field to the
first octet of the marker, unless the marker falls between FPDUs. first octet of the marker, unless the marker falls between FPDUs.
Thus the location of the first octet of the previous FPDU header can Thus the location of the first octet of the previous FPDU header can
be determined by subtracting the value of the given marker from the be determined by subtracting the value of the given marker from the
current octet-stream sequence number (i.e. TCP sequence number) of current octet-stream sequence number (i.e. TCP sequence number) of
the first octet of the marker. Note that this computation must take the first octet of the marker. Note that this computation MUST take
into account that the TCP sequence number could have wrapped between into account that the TCP sequence number could have wrapped between
the marker and the header. the marker and the header.
An FPDUPTR value of 0x0000 is a special case - it is used when the An FPDUPTR value of 0x0000 is a special case - it is used when the
marker falls exactly between FPDUs (between the preceding FPDU CRC marker falls exactly between FPDUs (between the preceding FPDU CRC
field, and the next FPDU's "ULPDU Length" field). In this case, the field, and the next FPDU's "ULPDU Length" field). In this case, the
marker is considered to be contained in the following FPDU; the
marker MUST be included in the CRC calculation of the FPDU following marker MUST be included in the CRC calculation of the FPDU following
the marker (if CRCs are being generated or checked). Thus an FPDUPTR the marker (if CRCs are being generated or checked). Thus an FPDUPTR
value of 0x0000 means that immediately following the marker is an value of 0x0000 means that immediately following the marker is an
FPDU header (the "ULPDU Length" field). FPDU header (the "ULPDU Length" field).
Since all FPDUs are integral multiples of 4 octets, the bottom two Since all FPDUs are integral multiples of 4 octets, the bottom two
bits of the FPDUPTR as calculated by the sender are zero. MPA bits of the FPDUPTR as calculated by the sender are zero. MPA
reserves these bits so they MUST be treated as zero for computation reserves these bits so they MUST be treated as zero for computation
at the receiver. at the receiver.
When Markers are enabled (see section 6.1 Connection setup on page When Markers are enabled (see section 6.1 Connection setup on page
27), the MPA markers MUST be inserted immediately preceding the first 31), the MPA markers MUST be inserted immediately preceding the first
FPDU of full operation phase, and at every 512th octet of the TCP FPDU of full operation phase, and at every 512th octet of the TCP
octet stream thereafter. As a result, the first marker has an octet stream thereafter. As a result, the first marker has an
FPDUPTR value of 0x0000. If the first marker begins at octet FPDUPTR value of 0x0000. If the first marker begins at octet
sequence number SeqStart, then markers are inserted such that the sequence number SeqStart, then markers are inserted such that the
first octet of the marker is at octet sequence number SeqNum if the first octet of the marker is at octet sequence number SeqNum if the
remainder of (SeqNum - SeqStart) mod 512 is zero. Note that SeqNum remainder of (SeqNum - SeqStart) mod 512 is zero. Note that SeqNum
can wrap. can wrap.
For example, if the TCP sequence number were used to calculate the For example, if the TCP sequence number were used to calculate the
insertion point of the marker, the starting TCP sequence number is insertion point of the marker, the starting TCP sequence number is
skipping to change at page 18, line 51 skipping to change at page 21, line 4
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| (0x0000) | FPDU ptr (0x000C) | | (0x0000) | FPDU ptr (0x000C) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ULPDU (octets 10-15) | | ULPDU (octets 10-15) |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | PAD (2 octets:0,0) | | | PAD (2 octets:0,0) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| CRC | | CRC |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 4 Example FPDU Format with Marker Figure 4 Example FPDU Format with Marker
MPA Receivers MUST preserve ULPDU boundaries when passing data to MPA Receivers MUST preserve ULPDU boundaries when passing data to
DDP. MPA Receivers MUST pass the ULPDU data and the "ULPDU Length" to DDP. MPA Receivers MUST pass the ULPDU data and the "ULPDU Length" to
DDP and not the markers, headers, and CRC. DDP and not the markers, headers, and CRC.
5.2 CRC Calculation 5.2 CRC Calculation
An MPA implementation MUST implement CRC support and MUST either: An MPA implementation MUST implement CRC support and MUST either:
(1) always use CRCs (1) always use CRCs; The MPA provider at is NOT REQUIRED to support
an administrator's request that CRCs not be used.
or or
(2) only negotiate the non-use of CRC on the explicit request of the (2a) only indicate a preference to not use CRCs on the explicit
system administrator, via an interface not defined in this spec. request of the system administrator, via an interface not defined
The default configuration for a connection MUST be to use CRCs. in this spec. The default configuration for a connection MUST be
to use CRCs.
(3) The MPA provider at either peer MAY ignore its administrator's (2b) disable CRC checking (and possibly generation) if both the local
request that CRCs not be used. and remote endpoints indicate preference to not use CRCs.
The decision for one host to request CRC suppression MAY be made on The decision for hosts to request CRC suppression MAY be made on an
an administrative basis for any path that provides equivalent administrative basis for any path that provides equivalent protection
protection from undetected errors as an end-to-end CRC32c. from undetected errors as an end-to-end CRC32c.
The process MUST be invisible to the ULP. The process MUST be invisible to the ULP.
After receipt of an MPA startup declaration indicating that its peer After receipt of an MPA startup declaration indicating that its peer
requires CRCs, an MPA instance MUST continue generating and checking requires CRCs, an MPA instance MUST continue generating and checking
CRCs until the connection terminates. If an MPA instance has CRCs until the connection terminates. If an MPA instance has
declared that it does not require CRCs, it MUST turn off CRC checking declared that it does not require CRCs, it MUST turn off CRC checking
immediately after receipt of an MPA mode declaration indicating that immediately after receipt of an MPA mode declaration indicating that
its peer also does not require CRCs. It MAY continue generating its peer also does not require CRCs. It MAY continue generating
CRCs. See section 6.1 Connection setup on page 27 for details on the CRCs. See section 6.1 Connection setup on page 31 for details on the
MPA startup. MPA startup.
When sending an FPDU, the sender MUST include a CRC field. When CRCs When sending an FPDU, the sender MUST include a CRC field. When CRCs
are enabled, the CRC field in the MPA FPDU MUST be computed using the are enabled, the CRC field in the MPA FPDU MUST be computed using the
CRC32C polynomial in the manner described in the iSCSI Protocol CRC32C polynomial in the manner described in the iSCSI Protocol
[iSCSI] document for Header and Data Digests. [iSCSI] document for Header and Data Digests.
The fields which MUST be included in the CRC calculation when sending The fields which MUST be included in the CRC calculation when sending
an FPDU are as follows: an FPDU are as follows:
skipping to change at page 20, line 36 skipping to change at page 23, line 36
MUST first perform the following: MUST first perform the following:
1) Calculate the CRC of the incoming FPDU in the same fashion as 1) Calculate the CRC of the incoming FPDU in the same fashion as
defined above. defined above.
2) Verify that the calculated CRC-32c value is the same as the 2) Verify that the calculated CRC-32c value is the same as the
received CRC-32c value found in the FPDU CRC field. If not, the received CRC-32c value found in the FPDU CRC field. If not, the
receiver MUST treat the FPDU as an invalid FPDU. receiver MUST treat the FPDU as an invalid FPDU.
The procedure for handling invalid FPDUs is covered in the Error The procedure for handling invalid FPDUs is covered in the Error
Section (see section 7 on page 40) Section (see section 7 on page 46)
The following is an annotated hex dump of an example FPDU sent as the The following is an annotated hex dump of an example FPDU sent as the
first FPDU on the stream. As such, it starts with a marker. The FPDU first FPDU on the stream. As such, it starts with a marker. The FPDU
contains 24 octets of the contained ULPDU, which are all zeros. The contains a 42 octet ULPDU (an example DDP segment) which in turn
CRC32c has been correctly calculated and can be used as a reference. contains 24 octets of the contained ULPDU, which is a data load that
See the [DDP] and [RDMA] specification for definitions of the DDP is all zeros. The CRC32c has been correctly calculated and can be
Control field, Queue, MSN, MO, and Send Data. used as a reference. See the [DDP] and [RDMA] specification for
definitions of the DDP Control field, Queue, MSN, MO, and Send Data.
Octet Contents Annotation Octet Contents Annotation
Count Count
0000 00 00 Marker: Reserved 0000 00 Marker: Reserved
0002 00 00 FPDUPTR 0001 00
0004 00 2a Length 0002 00 Marker: FPDUPTR
0006 41 43 DDP Control Field, Send with Last flag set 0003 00
0008 00 00 Reserved (STag position with no STag) 0004 00 ULPDU Length
000a 00 00 0005 2a
000c 00 00 Queue = 0 0006 41 DDP Control Field, Send with Last flag set
000e 00 00 0007 43
0010 00 00 MSN = 1 0008 00 Reserved (STag position with no STag)
0012 00 01 0009 00
0014 00 00 MO = 0 000a 00
0016 00 00 000b 00
0018 00 00 000c 00 Queue = 0
Send Data (24 octets of zeros) 000d 00
002e 00 00 000e 00
0030 52 23 CRC32c 000f 00
0032 99 83 0010 00 MSN = 1
0011 00
0012 00
0013 01
0014 00 MO = 0
0015 00
0016 00
0017 00
0018 00 Send Data (24 octets of zeros)
...
002f 00
0030 52 CRC32c
0031 23
0032 99
0033 83
Figure 5 Annotated Hex Dump of an FPDU Figure 5 Annotated Hex Dump of an FPDU
The following is an example sent as the second FPDU of the stream The following is an example sent as the second FPDU of the stream
where the first FPDU (which is not shown here) had a length of 492 where the first FPDU (which is not shown here) had a length of 492
octets and was also a Send to Queue 0 with Last Flag set. This octets and was also a Send to Queue 0 with Last Flag set. This
example contains a marker. example contains a marker.
Octet Contents Annotation Octet Contents Annotation
Count Count
01ec 00 2a Length 01ec 00 Length
01ee 41 43 DDP Control Field: Send with Last Flag set 01ed 2a
01f0 00 00 Reserved (STag position with no STag) 01ee 41 DDP Control Field: Send with Last Flag set
01f2 00 00 01ef 43
01f4 00 00 Queue = 0 01f0 00 Reserved (STag position with no STag)
01f6 00 00 01f1 00
01f8 00 00 MSN = 2 01f2 00
01fa 00 02 01f3 00
01fc 00 00 MO = 0 01f4 00 Queue = 0
01fe 00 00 01f5 00
0200 00 00 Marker: Reserved 01f6 00
0202 00 14 FPDUPTR 01f7 00
0204 00 00 01f8 00 MSN = 2
Send Data (24 octets of zeros) 01f9 00
021a 00 00 01fa 00
021c 84 92 CRC32c 01fb 02
021e 58 98 01fc 00 MO = 0
01fd 00
01fe 00
01ff 00
0200 00 Marker: Reserved
0201 00
0202 00 Marker: FPDUPTR
0203 14
0204 00 Send Data (24 octets of zeros)
...
021b 00
021c 84 CRC32c
021d 92
021e 58
021f 98
Figure 6 Annotated Hex Dump of an FPDU with Marker Figure 6 Annotated Hex Dump of an FPDU with Marker
5.3 MPA on TCP Sender Segmentation 5.3 MPA on TCP Sender Segmentation
The various TCP RFCs allow considerable choice in segmenting a TCP The various TCP RFCs allow considerable choice in segmenting a TCP
stream. In order to optimize FPDU recovery at the MPA receiver, MPA stream. In order to optimize FPDU recovery at the MPA receiver, MPA
specifies additional segmentation rules. specifies additional segmentation rules.
MPA MUST encapsulate the ULPDU such that there is exactly one ULPDU MPA MUST encapsulate the ULPDU such that there is exactly one ULPDU
contained in one FPDU. contained in one FPDU.
skipping to change at page 24, line 36 skipping to change at page 28, line 36
for latency and wire efficiency trade-offs). When one or more FPDUs for latency and wire efficiency trade-offs). When one or more FPDUs
are already packed into a TCP Segment, MULPDU MAY be reduced are already packed into a TCP Segment, MULPDU MAY be reduced
accordingly. accordingly.
DDP SHOULD provide ULPDUs that are as large as possible, but less DDP SHOULD provide ULPDUs that are as large as possible, but less
than or equal to MULPDU. than or equal to MULPDU.
If the TCP implementation needs to adjust EMSS to support MTU If the TCP implementation needs to adjust EMSS to support MTU
changes, the MULPDU value is changed accordingly. changes, the MULPDU value is changed accordingly.
In certain rare situations, the EMSS may shrink to very small sizes. In certain rare situations, the EMSS may shrink below 128 octets in
If this occurs, the MPA on TCP sender MUST NOT shrink the MULPDU size. If this occurs, the MPA on TCP sender MUST NOT shrink the
below 128 octets and is not required to follow the segmentation rules MULPDU below 128 octets and is NOT REQUIRED to follow the
in Section 5.3 MPA on TCP Sender Segmentation on page 22. segmentation rules in Section 5.3 MPA on TCP Sender Segmentation on
page 25.
If one or more FPDUs are already packed into a TCP segment, such that If one or more FPDUs are already packed into a TCP segment, such that
the remaining room is less than 128 octets, MPA MUST NOT provide a the remaining room is less than 128 octets, MPA MUST NOT provide a
MULPDU smaller than 128. In this case, MPA would typically provide a MULPDU smaller than 128. In this case, MPA would typically provide a
MULPDU for the next full sized segment, but may still pack the next MULPDU for the next full sized segment, but may still pack the next
FPDU into the small remaining room, provide that the next FPDU is FPDU into the small remaining room, provide that the next FPDU is
small enough to fit. small enough to fit.
The value 128 is chosen as to allow DDP designers room for the DDP The value 128 is chosen as to allow DDP designers room for the DDP
Header and some user data. Header and some user data.
skipping to change at page 25, line 25 skipping to change at page 29, line 25
to DDP. to DDP.
To detect the start of the FPDU unambiguously one of the following To detect the start of the FPDU unambiguously one of the following
MUST be used: MUST be used:
1: In an ordered TCP stream, the "ULPDU Length" field in the current 1: In an ordered TCP stream, the "ULPDU Length" field in the current
FPDU when FPDU has a valid CRC, can be used to identify the FPDU when FPDU has a valid CRC, can be used to identify the
beginning of the next FPDU. beginning of the next FPDU.
2: For receivers that support out of order reception of FPDUs (see 2: For receivers that support out of order reception of FPDUs (see
section 5.1 MPA Markers on page 17) a Marker can always be used section 5.1 MPA Markers on page 19) a Marker can always be used
to locate the beginning of an FPDU (in FPDUs with valid CRCs). to locate the beginning of an FPDU (in FPDUs with valid CRCs).
Since the location of the marker is known in the octet stream Since the location of the marker is known in the octet stream
(sequence number space), the marker can always be found. (sequence number space), the marker can always be found.
3: Having found an FPDU by means of a Marker, following contiguous 3: Having found an FPDU by means of a Marker, following contiguous
FPDUs can be found by using the "ULPDU Length" fields (from FPDUs FPDUs can be found by using the "ULPDU Length" fields (from FPDUs
with valid CRCs) to establish the next FPDU boundary. with valid CRCs) to establish the next FPDU boundary.
The "ULPDU Length" field (see section 4) MUST be used to determine if The "ULPDU Length" field (see section 4) MUST be used to determine if
the entire FPDU is present before forwarding the ULPDU to DDP. the entire FPDU is present before forwarding the ULPDU to DDP.
CRC calculation is discussed in section 5.2 on page 19 above. CRC calculation is discussed in section 5.2 on page 22 above.
5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders 5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders
Since MPA on MPA-aware TCP senders start FPDUs on TCP segment Since MPA on MPA-aware TCP senders start FPDUs on TCP segment
boundaries, a receiving DDP on MPA on TCP implementation may be able boundaries, a receiving DDP on MPA on TCP implementation may be able
to optimize the reception of data in various ways. to optimize the reception of data in various ways.
However, MPA receivers MUST NOT depend on FPDU Alignment on TCP However, MPA receivers MUST NOT depend on FPDU Alignment on TCP
segment boundaries. segment boundaries.
skipping to change at page 27, line 23 skipping to change at page 31, line 23
markers (if enabled) and to correctly locate the first FPDU. markers (if enabled) and to correctly locate the first FPDU.
MPA, and any TCP enhancements for MPA are enabled by the ULP in both MPA, and any TCP enhancements for MPA are enabled by the ULP in both
directions at once at an endpoint. directions at once at an endpoint.
This can be accomplished several ways, and is left up to DDP's ULP: This can be accomplished several ways, and is left up to DDP's ULP:
* DDP's ULP MAY require DDP on MPA startup immediately after TCP * DDP's ULP MAY require DDP on MPA startup immediately after TCP
connection setup. This has the advantage that no streaming mode connection setup. This has the advantage that no streaming mode
negotiation is needed. An example of such a protocol is shown in negotiation is needed. An example of such a protocol is shown in
Figure 9: Example Immediate Startup negotiation on page 36. Figure 9: Example Immediate Startup negotiation on page 42.
This may be accomplished by using a well-known port, or a service This may be accomplished by using a well-known port, or a service
locator protocol to locate an appropriate port on which DDP on locator protocol to locate an appropriate port on which DDP on
MPA is expected to operate. MPA is expected to operate.
* DDP's ULP MAY negotiate the start of DDP on MPA sometime after a * DDP's ULP MAY negotiate the start of DDP on MPA sometime after a
normal TCP startup, using TCP streaming data exchanges on the normal TCP startup, using TCP streaming data exchanges on the
same connection. The exchange establishes that DDP on MPA (as same connection. The exchange establishes that DDP on MPA (as
well as other ULPs) will be used, and exactly locates the point well as other ULPs) will be used, and exactly locates the point
in the octet stream where MPA is to begin operation. Note that in the octet stream where MPA is to begin operation. Note that
such a negotiation protocol is outside the scope of this such a negotiation protocol is outside the scope of this
specification. A simplified example of such a protocol is shown specification. A simplified example of such a protocol is shown
in Figure 8: Example Delayed Startup negotiation on page 33. in Figure 8: Example Delayed Startup negotiation on page 39.
An MPA endpoint operates in two distinct phases. An MPA endpoint operates in two distinct phases.
The "Startup Phase" is used to verify correct MPA setup, exchange CRC The "Startup Phase" is used to verify correct MPA setup, exchange CRC
and Marker configuration, and optionally pass "private data" between and Marker configuration, and optionally pass "private data" between
endpoints prior to completing a DDP connection. During this phase, endpoints prior to completing a DDP connection. During this phase,
specifically formatted frames are exchanged as TCP byte streams specifically formatted frames are exchanged as TCP byte streams
without using CRCs or Markers. During this phase a DDP endpoint need without using CRCs or Markers. During this phase a DDP endpoint need
not be "bound" to the MPA connection. In fact, the choice of DDP not be "bound" to the MPA connection. In fact, the choice of DDP
endpoint and its operating parameters may not be known until the endpoint and its operating parameters may not be known until the
skipping to change at page 28, line 25 skipping to change at page 32, line 25
Note: The possibility that both endpoints would be allowed to make a Note: The possibility that both endpoints would be allowed to make a
connection at the same time, sometimes called an "Active/Active" connection at the same time, sometimes called an "Active/Active"
connection, was considered by the work group and rejected. There connection, was considered by the work group and rejected. There
were several motivations for this decision. One was that were several motivations for this decision. One was that
applications needing this facility were few (none other than applications needing this facility were few (none other than
theoretical at the time of this draft). Another was that the theoretical at the time of this draft). Another was that the
facility created some implementation difficulties, particularly facility created some implementation difficulties, particularly
with the "Dual Stack" designs described later on. A last issue with the "Dual Stack" designs described later on. A last issue
was that dealing with rejected connections at startup would have was that dealing with rejected connections at startup would have
required at least an additional frame type, and more recovery required at least an additional frame type, and more recovery
actinos, complicating the protocol. While none of these issues actions, complicating the protocol. While none of these issues
was overwhelming, the group and implementers were not motivated was overwhelming, the group and implementers were not motivated
to do the work to resolve these issues. to do the work to resolve these issues. The protocol includes a
method of detecting these "Active/Active" startup attempts so
that they can be rejected and an error reported.
The ULP is responsible for determining which side is "Initiator" or The ULP is responsible for determining which side is "Initiator" or
"Responder". For "Client/Server" type ULPs this is easy. For peer- "Responder". For "Client/Server" type ULPs this is easy. For peer-
peer ULPs (which might utilize a TCP style "active/active" startup), peer ULPs (which might utilize a TCP style "active/active" startup),
some mechanism (not defined by this specification) must be some mechanism (not defined by this specification) must be
established, or some streaming mode data exchanged prior to MPA established, or some streaming mode data exchanged prior to MPA
startup to determine the side which starts in "Initiator" and which startup to determine the side which starts in "Initiator" and which
starts in "Responder" MPA mode. starts in "Responder" MPA mode.
6.1.1 MPA Request and Reply Frame Format
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0 | |
+ Key (16 bytes containing "MPA ID Req Frame") +
4 | (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65) |
+ Or (16 bytes containing "MPA ID Rep Frame") +
8 | (4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65) |
+ +
12 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
16 |M|C|R| Res | Rev | PD_Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ ~
~ Private Data ~
| |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 7 "MPA Request/Reply Frame"
Key: This field contains the "key" used to validate that the sender
is an MPA sender. Initiator mode senders MUST set this field to
the fixed value "MPA ID Req frame" or (in byte order) 4D 50 41 20
49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal). Responder
mode receivers MUST check this field for the same value, and
close the connection and report an error locally if any other
value is detected. Responder mode senders MUST set this field to
the fixed value "MPA ID Rep frame" or (in byte order) 4D 50 41 20
49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal). Initiator
mode receivers MUST check this field for the same value, and
close the connection and report an error locally if any other
value is detected.
M: This bit, when sent in an "MPA Request Frame" or an "MPA Reply
Frame", declares a receiver's requirement for Markers. When in a
received "MPA Request Frame" or "MPA Reply Frame" and the value
is '0', markers MUST NOT be added to the data stream by the
sender. When '1' markers MUST be added as described in section
5.1 MPA Markers on page 19.
C: This bit declares an endpoint's preferred CRC usage. When this
field is '0' in the "MPA Request Frame" and the "MPA Reply
Frame", CRCs MUST not be checked and need not be generated by
either endpoint. When this bit is '1' in either the "MPA Request
Frame" or "MPA Reply Frame", CRCs MUST be generated and checked
by both endpoints. Note that even when not in use, the CRC field
remains present in the FPDU. When CRCs are not in use, the CRC
field MUST be considered valid for FPDU checking regardless of
its contents.
R: This bit is set to zero, and not checked on reception in the "MPA
Request Frame". In the "MPA Reply Frame", this bit is the
"Rejected Connection" bit, set by the responders ULP to indicate
acceptance '0', or rejection '1', of the connection parameters
provided in the "Private Data".
Res: This field is reserved for future use. It MUST be set to zero
when sending, and not checked on reception.
Rev: This field contains the Revision of MPA. For this version of
the specification senders MUST set this field to one. MPA
receivers compliant with this version of the specification MUST
check this field. If the MPA receiver cannot interoperate with
the received version, then it MUST close the connection and
report an error locally. Otherwise, the MPA receiver should
report the received version to the ULP.
PD_Length: This field MUST contain the length in Octets of the
Private Data field. A value of zero indicates that there is no
private data field present at all. If the receiver detects that
the PD_Length field does not match the length of the "Private
Data" field, or if the length of the "Private Data" field exceeds
512 octets, the receiver MUST close the connection and report an
error locally. Otherwise, the MPA receiver should pass the
PD_Length value and "Private Data" to the ULP.
Private Data: This field may contain any value defined by ULPs or may
not be present. The "Private Data" field MUST between 0 and 512
octets in length. ULPs define how to size, set, and validate
this field within these limits.
6.1.2 Connection Startup Rules
The following rules apply to MPA connection startup phase: The following rules apply to MPA connection startup phase:
1. When MPA is started in the "Initiator" mode, the MPA 1. When MPA is started in the "Initiator" mode, the MPA
implementation MUST send a valid "MPA Request Frame". The "MPA implementation MUST send a valid "MPA Request Frame". The "MPA
Request Frame" MAY include ULP supplied "Private Data". Request Frame" MAY include ULP supplied "Private Data".
2. When MPA is started in the "Responder" mode, the MPA 2. When MPA is started in the "Responder" mode, the MPA
implementation MUST wait until a "MPA Request Frame" is received implementation MUST wait until a "MPA Request Frame" is received
and validated before entering full MPA/DDP operation. and validated before entering full MPA/DDP operation.
skipping to change at page 31, line 5 skipping to change at page 38, line 5
the above fails, the startup frame MUST be considered improperly the above fails, the startup frame MUST be considered improperly
formatted. formatted.
10. MPA implementations SHOULD implement a reasonable timeout while 10. MPA implementations SHOULD implement a reasonable timeout while
waiting for the entire startup frames; this prevents certain waiting for the entire startup frames; this prevents certain
denial of service attacks. ULPs SHOULD implement a reasonable denial of service attacks. ULPs SHOULD implement a reasonable
timeout while waiting for FPDUs, ULPDUs and application level timeout while waiting for FPDUs, ULPDUs and application level
messages to guard against application failures and certain denial messages to guard against application failures and certain denial
of service attacks. of service attacks.
6.1.1 MPA Request and Reply Frame Format 6.1.3 Example Delayed Startup sequence
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0 | |
+ Key (16 bytes containing "MPA ID Req Frame") +
4 | (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65) |
+ Or (16 bytes containing "MPA ID Rep Frame") +
8 | (4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65) |
+ +
12 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
16 |M|C|R| Res | Rev | PD_Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ ~
~ Private Data ~
| |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 7 "MPA Request/Reply Frame"
Key: This field contains the "key" used to authenticate that the
sender is an MPA sender. Initiator mode senders must set this
field to the fixed value "MPA ID Req frame" or (in byte order) 4D
50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal).
Responder mode receivers MUST check this field for the same
value, and close the connection and report an error locally if
any other value is detected. Responder mode senders must set this
field to the fixed value "MPA ID Rep frame" or (in byte order) 4D
50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal).
Initiator mode receivers MUST check this field for the same
value, and close the connection and report an error locally if
any other value is detected.
M: This bit, when sent in an "MPA Request Frame" or an "MPA Reply
Frame", declares a receiver's requirement for Markers. When in a
received "MPA Request Frame" or "MPA Reply Frame" and the value
is '0', markers MUST NOT be added to the data stream by the
sender. When '1' markers MUST be added as described in section
5.1 MPA Markers on page 17.
C: This bit declares an endpoint's preferred CRC usage. When this
field is '0' in the "MPA Request Frame" and the "MPA Reply
Frame", CRCs MUST not be checked and need not be generated by
either endpoint. When this bit is '1' in either the "MPA Request
Frame" or "MPA Reply Frame", CRCs MUST be generated and checked
by both endpoints. Note that even when not in use, the CRC field
remains present in the FPDU. When CRCs are not in use, the CRC
field MUST be considered valid for FPDU checking regardless of
its contents.
R: This bit is set to zero, and not checked on reception in the "MPA
Request Frame". In the "MPA Reply Frame", this bit is the
"Rejected Connection" bit, set by the responders ULP to indicate
acceptance '0', or rejection '1', of the connection parameters
provided in the "Private Data".
Res: This field is reserved for future use. It must be set to zero
when sending, and not checked on reception.
Rev: This field contains the Revision of MPA. For this version of
the specification senders MUST set this field to one. MPA
receivers compliant with this version of the specification MUST
check this field. If the MPA receiver cannot interoperate with
the received version, then it MUST close the connection and
report an error locally. Otherwise, the MPA receiver should
report the received version to the ULP.
PD_Length: This field MUST contain the length in Octets of the
Private Data field. A value of zero indicates that there is no
private data field present at all. If the receiver detects that
the PD_Length field does not match the length of the "Private
Data" field, or if the length of the "Private Data" field exceeds
512 octets, the receiver MUST close the connection and report an
error locally. Otherwise, the MPA receiver should pass the
PD_Length value and "Private Data" to the ULP.
Private Data: This field may contain any value defined by ULPs or may
not be present. The "Private Data" field MUST between 0 and 512
octets in length. ULPs define how to size, set, and validate
this field within these limits.
6.1.2 Example Delayed Startup sequence
A variety of startup sequences are possible when using MPA on TCP. A variety of startup sequences are possible when using MPA on TCP.
Following is an example of an MPA/DDP startup that occurs after TCP Following is an example of an MPA/DDP startup that occurs after TCP
has been running for a while and has exchanged some amount of has been running for a while and has exchanged some amount of
streaming data. This example does not use any private data (an streaming data. This example does not use any private data (an
example that does is shown later in 6.1.3.2 Example Immediate Startup example that does is shown later in 6.1.4.2 Example Immediate Startup
using Private Data on page 36), although it is perfectly legal to using Private Data on page 42), although it is perfectly legal to
include the private data. Note that since the example does not use include the private data. Note that since the example does not use
any Private Data, there are no ULP interactions shown between any Private Data, there are no ULP interactions shown between
receiving "Startup frames" and putting MPA into "Full operation". receiving "Startup frames" and putting MPA into "Full operation".
Initiator Responder Initiator Responder
+---------------------------+ +---------------------------+
|ULP streaming mode | |ULP streaming mode |
| <Hello> request to | | <Hello> request to |
| transition to DDP/MPA | +--------------------------+ | transition to DDP/MPA | +--------------------------+
skipping to change at page 35, line 5 skipping to change at page 41, line 5
would report this message to the Consumer. The Consumer can would report this message to the Consumer. The Consumer can
then accept the MPA/DDP connection, or close or reset the TCP then accept the MPA/DDP connection, or close or reset the TCP
connection to abort the process. connection to abort the process.
* On determining that the Connection is acceptable, the * On determining that the Connection is acceptable, the
Initiating Consumer would use an appropriate API to bind the Initiating Consumer would use an appropriate API to bind the
TCP/MPA connections to a DDP endpoint thus enabling MPA/DDP TCP/MPA connections to a DDP endpoint thus enabling MPA/DDP
into full operation. MPA/DDP would begin sending DDP into full operation. MPA/DDP would begin sending DDP
messages as MPA FPDUs. messages as MPA FPDUs.
6.1.3 Use of "Private Data" 6.1.4 Use of "Private Data"
This section is advisory in nature, in that it suggests a method that This section is advisory in nature, in that it suggests a method that
a ULP can deal with pre-DDP connection information exchange. a ULP can deal with pre-DDP connection information exchange.
6.1.3.1 Motivation 6.1.4.1 Motivation
Prior RDMA protocols have been developed that provide "private data" Prior RDMA protocols have been developed that provide "private data"
via out of band mechanisms. As a result, many applications now via out of band mechanisms. As a result, many applications now
expect some form of "private data" to be available for application expect some form of "private data" to be available for application
use prior to setting up the DDP/RDMA connection. For example, use prior to setting up the DDP/RDMA connection. For example,
An RDMA Endpoint (referred to as a Queue Pair, or QP, in InfiniBand An RDMA Endpoint (referred to as a Queue Pair, or QP, in InfiniBand
and the [Verbs]) must be associated with a Protection Domain. No and the [Verbs]) must be associated with a Protection Domain. No
receive operations may be posted to the endpoint before it is receive operations may be posted to the endpoint before it is
associated with a Protection Domain. Indeed under both the associated with a Protection Domain. Indeed under both the
skipping to change at page 36, line 5 skipping to change at page 42, line 5
be exchanged using datagrams before actually starting the RDMA be exchanged using datagrams before actually starting the RDMA
connection. connection.
This draft allows for small amounts of "Private Data" to be exchanged This draft allows for small amounts of "Private Data" to be exchanged
as part of the MPA startup sequence. The actual Private Data fields as part of the MPA startup sequence. The actual Private Data fields
are carried in the "MPA Request Frame", and the "MPA Reply Frame". are carried in the "MPA Request Frame", and the "MPA Reply Frame".
If larger amounts of private data or more negotiation is necessary, If larger amounts of private data or more negotiation is necessary,
TCP streaming mode messages may be exchanged prior to enabling MPA. TCP streaming mode messages may be exchanged prior to enabling MPA.
6.1.3.2 Example Immediate Startup using Private Data 6.1.4.2 Example Immediate Startup using Private Data
Initiator Responder Initiator Responder
+---------------------------+ +---------------------------+
|TCP SYN sent | +--------------------------+ |TCP SYN sent | +--------------------------+
+---------------------------+ --------> |TCP gets SYN packet; | +---------------------------+ --------> |TCP gets SYN packet; |
+---------------------------+ | Sends SYN-Ack | +---------------------------+ | Sends SYN-Ack |
|TCP gets SYN-Ack | <-------- +--------------------------+ |TCP gets SYN-Ack | <-------- +--------------------------+
| Sends Ack | | Sends Ack |
+---------------------------+ --------> +--------------------------+ +---------------------------+ --------> +--------------------------+
+---------------------------+ |Consumer enables MPA | +---------------------------+ |Consumer enables MPA |
|Enters MPA Initiator mode; | |Responder Mode, waits for | |Consumer enables MPA | |Responder Mode, waits for |
|MPA sends | | <MPA Request frame> | |Initiator mode with | | <MPA Request frame> |
| <MPA Request Frame>; | +--------------------------+ |"Private Data"; MPA sends | +--------------------------+
| <MPA Request Frame>; |
|MPA waits for incoming | +--------------------------+ |MPA waits for incoming | +--------------------------+
| <MPA Reply Frame | - - - - > |MPA receives | | <MPA Reply Frame | - - - - > |MPA receives |
+---------------------------+ | <MPA Request Frame> | +---------------------------+ | <MPA Request Frame> |
|Consumer examines "Private| |Consumer examines "Private|
|Data", provides MPA with | |Data", provides MPA with |
|return "Private Data", | |return "Private Data", |
|binds DDP to MPA, and | |binds DDP to MPA, and |
|enables MPA to send an | |enables MPA to send an |
| <MPA Reply Frame>. | | <MPA Reply Frame>. |
|DDP/MPA enables FPDU | |DDP/MPA enables FPDU |
skipping to change at page 38, line 19 skipping to change at page 44, line 19
If the "rejected Connection" bit is set to a '1', MPA will If the "rejected Connection" bit is set to a '1', MPA will
close the TCP connection and exit. close the TCP connection and exit.
If the "Rejected Connection" bit is set to a '0', and on If the "Rejected Connection" bit is set to a '0', and on
determining from the "MPA Reply Frame" "Private Data" that determining from the "MPA Reply Frame" "Private Data" that
the Connection is acceptable, the Initiating Consumer would the Connection is acceptable, the Initiating Consumer would
use an appropriate API to bind the TCP/MPA connections to a use an appropriate API to bind the TCP/MPA connections to a
DDP endpoint thus enabling MPA/DDP into full operation. DDP endpoint thus enabling MPA/DDP into full operation.
MPA/DDP would begin sending DDP messages as MPA FPDUs. MPA/DDP would begin sending DDP messages as MPA FPDUs.
6.1.4 "Dual Stack" implementations 6.1.5 "Dual Stack" implementations
MPA/DDP implementations are commonly expected to be implemented as MPA/DDP implementations are commonly expected to be implemented as
part of a "Dual stack" architecture. One "stack" is the traditional part of a "Dual stack" architecture. One "stack" is the traditional
TCP stack, usually with a sockets interface API. The second stack is TCP stack, usually with a sockets interface API. The second stack is
the MPA/DDP "stack" with its own API, and potentially separate code the MPA/DDP "stack" with its own API, and potentially separate code
or hardware to deal with the MPA/DDP data. Of course, or hardware to deal with the MPA/DDP data. Of course,
implementations may vary, so the following comments are of an implementations may vary, so the following comments are of an
advisory nature only. advisory nature only.
The use of the two "stacks" offers advantages: The use of the two "stacks" offers advantages:
skipping to change at page 39, line 25 skipping to change at page 45, line 25
ULP receives the last streaming mode data, and then enters ULP receives the last streaming mode data, and then enters
DDP/MPA mode. Again, no additional streaming mode data is DDP/MPA mode. Again, no additional streaming mode data is
expected. expected.
2. The DDP/MPA MAY provide the ability to send a "Last streaming 2. The DDP/MPA MAY provide the ability to send a "Last streaming
message" as part of its "Responder" DDP/MPA enable function. message" as part of its "Responder" DDP/MPA enable function.
This allows the DDP/MPA stack to more easily manage the This allows the DDP/MPA stack to more easily manage the
conversion to DDP/MPA mode (and avoid problems with a very fast conversion to DDP/MPA mode (and avoid problems with a very fast
return of the "MPA Request Frame" from the Initiator side). return of the "MPA Request Frame" from the Initiator side).
Note: Regardless of the "stack" architecture used, TCP's rules must Note: Regardless of the "stack" architecture used, TCP's rules MUST
be followed. For example, if network data is lost, re-segmented be followed. For example, if network data is lost, re-segmented
or re-ordered, TCP must recover appropriately even when this or re-ordered, TCP MUST recover appropriately even when this
occurs while switching stacks. occurs while switching stacks.
6.2 Normal Connection Teardown 6.2 Normal Connection Teardown
Each half connection of MPA terminates when DDP closes the Each half connection of MPA terminates when DDP closes the
corresponding TCP half connection. corresponding TCP half connection.
A mechanism SHOULD be provided by MPA to DDP for DDP to be made aware A mechanism SHOULD be provided by MPA to DDP for DDP to be made aware
that a graceful close of the LLP connection has been received by the that a graceful close of the LLP connection has been received by the
LLP (e.g. FIN is received). LLP (e.g. FIN is received).
skipping to change at page 41, line 41 skipping to change at page 47, line 41
over IP networks. Therefore, the control and the data packets of over IP networks. Therefore, the control and the data packets of
these protocols are vulnerable to the spoofing, tampering and these protocols are vulnerable to the spoofing, tampering and
information disclosure attacks listed below. In addition, Connection information disclosure attacks listed below. In addition, Connection
to/from an unauthorized or unauthenticated endpoint is a potential to/from an unauthorized or unauthenticated endpoint is a potential
problem with most applications using RDMA, DDP, and MPA. problem with most applications using RDMA, DDP, and MPA.
8.1.1 Spoofing 8.1.1 Spoofing
Spoofing attacks can be launched by the Remote Peer, or by a network Spoofing attacks can be launched by the Remote Peer, or by a network
based attacker. A network based spoofing attack applies to all Remote based attacker. A network based spoofing attack applies to all Remote
Peers. Because the MPA Stream requires an TCP Stream in the Peers. Because the MPA Stream requires a TCP Stream in the
ESTABLISHED state, certain types of traditional forms of wire attacks ESTABLISHED state, certain types of traditional forms of wire attacks
do not apply -- an end-to-end handshake must have occurred to do not apply -- an end-to-end handshake must have occurred to
establish the MPA Stream. So, the only form of spoofing that applies establish the MPA Stream. So, the only form of spoofing that applies
is one when a remote node can both send and receive packets. Yet even is one when a remote node can both send and receive packets. Yet even
with this limitation the Stream is still exposed to the following with this limitation the Stream is still exposed to the following
spoofing attacks. spoofing attacks.
8.1.1.1 Impersonation 8.1.1.1 Impersonation
A network based attacker can impersonate a legal MPA/DDP/RDMAP peer A network based attacker can impersonate a legal MPA/DDP/RDMAP peer
skipping to change at page 44, line 5 skipping to change at page 50, line 5
8.3 Using IPsec With MPA 8.3 Using IPsec With MPA
IPsec can be used to protect against the packet injection attacks IPsec can be used to protect against the packet injection attacks
outlined above. Because IPsec is designed to secure individual IP outlined above. Because IPsec is designed to secure individual IP
packets, MPA can run above IPsec without change. IPsec packets are packets, MPA can run above IPsec without change. IPsec packets are
processed (e.g., integrity checked and decrypted) in the order they processed (e.g., integrity checked and decrypted) in the order they
are received, and an MPA receiver will process the decrypted FPDUs are received, and an MPA receiver will process the decrypted FPDUs
contained in these packets in the same manner as FPDUs contained in contained in these packets in the same manner as FPDUs contained in
unsecured IP packets. unsecured IP packets.
MPA Implementations MUST implement IPSEC. The use of IPSEC is up to MPA Implementations MUST implement IPsec as described in Section 8.4
ULPs and administrators. below. The use of IPsec is up to ULPs and administrators.
8.4 Requirements for IPsec Encapsulation of DDP 8.4 Requirements for IPsec Encapsulation of MPA/DDP
The IP Storage working group has spent significant time and effort to The IP Storage working group has spent significant time and effort to
define the normative IPsec requirements for IP Storage [RFC3723]. define the normative IPsec requirements for IP Storage [RFC3723].
Portions of that specification are applicable to a wide variety of Portions of that specification are applicable to a wide variety of
protocols, including the RDDP protocol suite. In order to not protocols, including the RDDP protocol suite. In order to not
replicate this effort, an RNIC implementation MUST follow the replicate this effort, an MPA ON TCP implementation MUST follow the
requirements defined in RFC3723 Section 2.3 and Section 5, including requirements defined in RFC3723 Section 2.3 and Section 5, including
the associated normative references for those sections. the associated normative references for those sections.
Additionally, since IPsec acceleration hardware may only be able to Additionally, since IPsec acceleration hardware may only be able to
handle a limited number of active IKE Phase 2 SAs, Phase 2 delete handle a limited number of active IKE Phase 2 SAs, Phase 2 delete
messages may be sent for idle SAs, as a means of keeping the number messages MAY be sent for idle SAs, as a means of keeping the number
of active Phase 2 SAs to a minimum. The receipt of an IKE Phase 2 of active Phase 2 SAs to a minimum. The receipt of an IKE Phase 2
delete message MUST NOT be interpreted as a reason for tearing down delete message MUST NOT be interpreted as a reason for tearing down
an DDP/RDMA Stream. Rather, it is preferable to leave the Stream up, an DDP/RDMA Stream. Rather, it is preferable to leave the Stream up,
and if additional traffic is sent on it, to bring up another IKE and if additional traffic is sent on it, to bring up another IKE
Phase 2 SA to protect it. This avoids the potential for continually Phase 2 SA to protect it. This avoids the potential for continually
bringing Streams up and down. bringing Streams up and down.
Note that there are serious security issues if IPsec is not Note that there are serious security issues if IPsec is not
implemented end-to-end. For example, if IPsec is implemented as a implemented end-to-end. For example, if IPsec is implemented as a
tunnel in the middle of the network, any hosts between the peer and tunnel in the middle of the network, any hosts between the peer and
the IPsec tunneling device can freely attack the unprotected Stream. the IPsec tunneling device can freely attack the unprotected Stream.
9 IANA Considerations 9 IANA Considerations
No IANA actions are required by this document.
If a well-known port is chosen as the mechanism to identify a DDP on If a well-known port is chosen as the mechanism to identify a DDP on
MPA on TCP, the well-known port must be registered with IANA. MPA on TCP, the well-known port must be registered with IANA.
Because the use of the port is DDP specific, registration of the port Because the use of the port is DDP specific, registration of the port
with IANA is left to DDP. with IANA is left to DDP.
10 References 10 References
10.1 Normative References 10.1 Normative References
[iSCSI] Satran, J., Internet Small Computer Systems Interface [iSCSI] Satran, J., Internet Small Computer Systems Interface
skipping to change at page 48, line 10 skipping to change at page 54, line 10
elzur-iwarp-mpa-tcp-analysis-00.txt, February 2003. elzur-iwarp-mpa-tcp-analysis-00.txt, February 2003.
[Verbs] J. Hilland et al., "RDMA Protocol Verbs Specification" draft- [Verbs] J. Hilland et al., "RDMA Protocol Verbs Specification" draft-
hilland-rddp-verbs-00.txt, April 2003. hilland-rddp-verbs-00.txt, April 2003.
11 Appendix 11 Appendix
This appendix is for information only and is NOT part of the This appendix is for information only and is NOT part of the
standard. standard.
The appendix covers three topics;
Section 11.1 is an analysis of MPA on TCP and why it is useful to
integrate MPA with TCP (with modifications to typical TCP
implementations) to reduce overall system buffering and overhead.
Section 11.2 covers some MPA receiver implementation notes.
Section 11.3 covers methods of making MPA implementations
interoperate with both IETF and RDMA Consortium versions of the
protocols.
11.1 Analysis of MPA over TCP Operations 11.1 Analysis of MPA over TCP Operations
This appendix analyzes the impact of MPA (Marker PDU Aligned Framing This appendix analyzes the impact of MPA (Marker PDU Aligned Framing
for TCP [MPA]) on the TCP sender, receiver, and wire protocol. for TCP [MPA]) on the TCP sender, receiver, and wire protocol.
One of MPA's high level goals is to provide enough information, when One of MPA's high level goals is to provide enough information, when
combined with the Direct Data Placement Protocol [DDP], to enable combined with the Direct Data Placement Protocol [DDP], to enable
out-of-order placement of DDP payload into the final Upper Layer out-of-order placement of DDP payload into the final Upper Layer
Protocol (ULP) buffer. Note that DDP separates the act of placing Protocol (ULP) buffer. Note that DDP separates the act of placing
data into a ULP buffer from that of notifying the ULP that the ULP data into a ULP buffer from that of notifying the ULP that the ULP
skipping to change at page 48, line 32 skipping to change at page 54, line 44
supports in-order delivery of the data to the ULP, including support supports in-order delivery of the data to the ULP, including support
for Direct Data Placement in the final ULP buffer location when TCP for Direct Data Placement in the final ULP buffer location when TCP
segments arrive out-of-order. Effectively, the goal is to use the segments arrive out-of-order. Effectively, the goal is to use the
pre-posted ULP buffers as the TCP receive buffer, where the pre-posted ULP buffers as the TCP receive buffer, where the
reassembly of the ULP Protocol Data Unit (PDU) by TCP (with MPA and reassembly of the ULP Protocol Data Unit (PDU) by TCP (with MPA and
DDP) is done in place, in the ULP buffer, with no data copies. DDP) is done in place, in the ULP buffer, with no data copies.
This Appendix walks through the advantages and disadvantages of the This Appendix walks through the advantages and disadvantages of the
TCP sender modifications proposed by MPA: TCP sender modifications proposed by MPA:
1) that MPA require the TCP sender to do "Header Alignment", where a 1) that MPA prefers that the TCP sender to do "Header Alignment",
TCP segment is required to begin with an MPA Framing Protocol Data where a TCP segment should begin with an MPA Framing Protocol Data
Unit (FPDU) (if there is payload present). Unit (FPDU) (if there is payload present).
2) that there be an integral number of FPDUs in a TCP segment (under 2) that there be an integral number of FPDUs in a TCP segment (under
conditions where the Path MTU is not changing). conditions where the Path MTU is not changing).
This Appendix concludes that the scaling advantages of Header This Appendix concludes that the scaling advantages of FPDU Alignment
Alignment are strong, based primarily on fairly drastic TCP receive are strong, based primarily on fairly drastic TCP receive buffer
buffer reduction requirements and simplified receive handling. The reduction requirements and simplified receive handling. The analysis
analysis also shows that there is little effect to TCP wire behavior. also shows that there is little effect to TCP wire behavior.
11.1.1 Assumptions 11.1.1 Assumptions
11.1.1.1 MPA is layered beneath DDP [DDP] 11.1.1.1 MPA is layered beneath DDP [DDP]
MPA is an adaptation layer between DDP and TCP. DDP requires MPA is an adaptation layer between DDP and TCP. DDP requires
preservation of DDP segment boundaries and a CRC32C digest covering preservation of DDP segment boundaries and a CRC32C digest covering
the DDP header and data. MPA adds these features to the TCP stream the DDP header and data. MPA adds these features to the TCP stream
so that DDP over TCP has the same basic properties as DDP over SCTP. so that DDP over TCP has the same basic properties as DDP over SCTP.
skipping to change at page 49, line 43 skipping to change at page 56, line 5
correct location in host memory. correct location in host memory.
Because each DDP segment is self-describing, it is possible for DDP Because each DDP segment is self-describing, it is possible for DDP
segments received out of order to have their ULP payload placed segments received out of order to have their ULP payload placed
immediately in the ULP receive buffer. immediately in the ULP receive buffer.
Data delivery to the ULP is guaranteed to be in the order the data Data delivery to the ULP is guaranteed to be in the order the data
was sent. DDP only indicates data delivery to the ULP after TCP has was sent. DDP only indicates data delivery to the ULP after TCP has
acknowledged the complete byte stream. acknowledged the complete byte stream.
11.1.2 The Value of Header Alignment 11.1.2 The Value of FPDU Alignment
Significant receiver optimizations can be achieved when Header Significant receiver optimizations can be achieved when Header
Alignment and complete FPDUs are the common case. The optimizations Alignment and complete FPDUs are the common case. The optimizations
allow utilizing significantly fewer buffers on the receiver and less allow utilizing significantly fewer buffers on the receiver and less
computation per FPDU. The net effect is the ability to build a "Flow- computation per FPDU. The net effect is the ability to build a "Flow-
Through" receiver that enables TCP-based solutions to scale to 10G Through" receiver that enables TCP-based solutions to scale to 10G
and beyond in an economical way. The optimizations are especially and beyond in an economical way. The optimizations are especially
relevant to hardware implementations of receivers that process relevant to hardware implementations of receivers that process
multiple protocol layers - Data Link Layer (e.g., Ethernet), Network multiple protocol layers - Data Link Layer (e.g., Ethernet), Network
and Transport Layer (e.g., TCP/IP), and even some ULP on top of TCP and Transport Layer (e.g., TCP/IP), and even some ULP on top of TCP
skipping to change at page 50, line 49 skipping to change at page 57, line 10
BufferSizeNAF = K1* EMSS * number_of_connections + K2 * EMSS BufferSizeNAF = K1* EMSS * number_of_connections + K2 * EMSS
Where K1 and K2 are implementation dependent constants and EMSS is Where K1 and K2 are implementation dependent constants and EMSS is
the effective maximum segment size. the effective maximum segment size.
For example, a 1 Gbps link with 10,000 connections and an EMSS of For example, a 1 Gbps link with 10,000 connections and an EMSS of
1500B would require 15 MB of memory. Often the number of connections 1500B would require 15 MB of memory. Often the number of connections
used scales with the network speed, aggravating the situation for used scales with the network speed, aggravating the situation for
higher speeds. higher speeds.
A Header Aligned FPDU would allow the receiver to allocate FPDU Alignment would allow the receiver to allocate BufferSizeAF
BufferSizeAF (Buffer Size, Aligned FPDU) octets: (Buffer Size, Aligned FPDU) octets:
BufferSizeAF = K2 * EMSS BufferSizeAF = K2 * EMSS
for the same conditions. A Header Aligned receiver may require memory
for the same conditions. A FPDU Aligned receiver may require memory
in the range of ~100s of KB - which is feasible for an on-chip memory in the range of ~100s of KB - which is feasible for an on-chip memory
and enables a "Flow-Through" design, in which the data flows through and enables a "Flow-Through" design, in which the data flows through
the NIC and is placed directly in the destination buffer. Assuming the NIC and is placed directly in the destination buffer. Assuming
most of the connections support Header Alignment, the receiver most of the connections support FPDU Alignment, the receiver buffers
buffers no longer scale with number of connections. no longer scale with number of connections.
Additional optimizations can be achieved in a balanced I/O sub-system Additional optimizations can be achieved in a balanced I/O sub-system
-- where the system interface of the network controller provides -- where the system interface of the network controller provides
ample bandwidth as compared with the network bandwidth. For almost ample bandwidth as compared with the network bandwidth. For almost
twenty years this has been the case and the trend is expected to twenty years this has been the case and the trend is expected to
continue - while Ethernet speeds have scaled by 1000 (from 10 continue - while Ethernet speeds have scaled by 1000 (from 10
megabit/sec to 10 gigabit/sec), I/O bus bandwidth of volume CPU megabit/sec to 10 gigabit/sec), I/O bus bandwidth of volume CPU
architectures has scaled from ~2 MB/sec to ~2 GB/sec (PC-XT bus to architectures has scaled from ~2 MB/sec to ~2 GB/sec (PC-XT bus to
PCI-X DDR). Under these conditions, the Header Aligned FPDU approach PCI-X DDR). Under these conditions, the FPDU Alignment approach
allows BufferSizeAF to be indifferent to network speed. It is allows BufferSizeAF to be indifferent to network speed. It is
primarily a function of the local processing time for a given frame. primarily a function of the local processing time for a given frame.
Thus when the Header Aligned FPDU approach is used, receive buffering Thus when the FPDU Alignment approach is used, receive buffering is
is expected to scale gracefully (i.e. less than linear scaling) as expected to scale gracefully (i.e. less than linear scaling) as
network speed is increased. network speed is increased.
11.1.2.1 Impact of lack of Header Alignment on the receiver 11.1.2.1 Impact of lack of FPDU Alignment on the receiver computational
computational load and complexity load and complexity
The receiver must perform IP and TCP processing, and then perform The receiver must perform IP and TCP processing, and then perform
FPDU CRC checks, before it can trust the FPDU header placement FPDU CRC checks, before it can trust the FPDU header placement
information. For simplicity of the description, the assumption is information. For simplicity of the description, the assumption is
that a FPDU is carried in no more than 2 TCP segments. In reality, that a FPDU is carried in no more than 2 TCP segments. In reality,
with no Header Alignment, an FPDU can be carried by more than 2 TCP with no FPDU Alignment, an FPDU can be carried by more than 2 TCP
segments (e.g., if the PMTU was reduced). segments (e.g., if the PMTU was reduced).
----++-----------------------------++-----------------------++----- ----++-----------------------------++-----------------------++-----
+---||---------------+ +--------||--------+ +----------||----+ +---||---------------+ +--------||--------+ +----------||----+
| TCP Seg X-1 | | TCP Seg X | | TCP Seg X+1 | | TCP Seg X-1 | | TCP Seg X | | TCP Seg X+1 |
+---||---------------+ +--------||--------+ +----------||----+ +---||---------------+ +--------||--------+ +----------||----+
----++-----------------------------++-----------------------++----- ----++-----------------------------++-----------------------++-----
FPDU #N-1 FPDU #N FPDU #N-1 FPDU #N
Figure 10: Non-aligned FPDU freely placed in TCP octet stream Figure 10: Non-aligned FPDU freely placed in TCP octet stream
skipping to change at page 53, line 24 skipping to change at page 59, line 40
is required. Use the MPA based Markers to calculate is required. Use the MPA based Markers to calculate
where FPDU boundaries are. where FPDU boundaries are.
When a complete FPDU is available, a similar procedure When a complete FPDU is available, a similar procedure
to the in-order algorithm above is used. There is to the in-order algorithm above is used. There is
additional complexity, though, because when the additional complexity, though, because when the
missing segment arrives, this TCP segment must be missing segment arrives, this TCP segment must be
run through the CRC engine after the CRC is run through the CRC engine after the CRC is
calculated for the missing segment. calculated for the missing segment.
If we assume Header Alignment, the following diagram and the If we assume FPDU Alignment, the following diagram and the algorithm
algorithm below apply. Note that when using MPA, the receiver is below apply. Note that when using MPA, the receiver is assumed to
assumed to actively detect presence or loss of Header Alignment for actively detect presence or loss of FPDU Alignment for every TCP
every TCP segment received. segment received.
+--------------------------+ +--------------------------+ +--------------------------+ +--------------------------+
+--|--------------------------+ +--|--------------------------+ +--|--------------------------+ +--|--------------------------+
| | TCP Seg X | | | TCP Seg X+1 | | | TCP Seg X | | | TCP Seg X+1 |
+--|--------------------------+ +--|--------------------------+ +--|--------------------------+ +--|--------------------------+
+--------------------------+ +--------------------------+ +--------------------------+ +--------------------------+
FPDU #N FPDU #N+1 FPDU #N FPDU #N+1
Figure 11: Aligned FPDU placed immediately after TCP header Figure 11: Aligned FPDU placed immediately after TCP header
The receiver algorithm for Header Aligned frames (in-order or out-of- The receiver algorithm for FPDU Aligned frames (in-order or out-of-
order) includes: order) includes:
1) Data Link Layer processing (whole frame) - typically 1) Data Link Layer processing (whole frame) - typically
including a CRC calculation. including a CRC calculation.
2) Network Layer processing (assuming not an IP fragment, the 2) Network Layer processing (assuming not an IP fragment, the
whole Data Link Layer frame contains one IP datagram. IP whole Data Link Layer frame contains one IP datagram. IP
fragments should be reassembled in a local buffer. This is fragments should be reassembled in a local buffer. This is
not a performance optimization goal) not a performance optimization goal)
skipping to change at page 54, line 45 skipping to change at page 60, line 45
8) If no FPDU CRC errors, placement is allowed 8) If no FPDU CRC errors, placement is allowed
9) CopyData(TCP segment #X, host buffer address, length) 9) CopyData(TCP segment #X, host buffer address, length)
10) Loop to #5 until all the FPDUs in the TCP segment are 10) Loop to #5 until all the FPDUs in the TCP segment are
consumed in order to handle FPDU packing. consumed in order to handle FPDU packing.
Implementation note: In both cases the receiver has to classify the Implementation note: In both cases the receiver has to classify the
incoming TCP segment and associate it with one of the flows it incoming TCP segment and associate it with one of the flows it
maintains. In the case of no Header Alignment, the receiver is forced maintains. In the case of no FPDU Alignment, the receiver is forced
to classify incoming traffic before it can calculate the FPDU CRC. In to classify incoming traffic before it can calculate the FPDU CRC. In
the case of Header Alignment the operations order is left to the the case of FPDU Alignment the operations order is left to the
implementer. implementer.
The Header Aligned receiver algorithm is significantly simpler. There The FPDU Aligned receiver algorithm is significantly simpler. There
is no need to locally buffer portions of FPDUs. Accessing state is no need to locally buffer portions of FPDUs. Accessing state
information is also substantially simplified - the normal case does information is also substantially simplified - the normal case does
not require retrieving information to find out where a FPDU starts not require retrieving information to find out where a FPDU starts
and ends or retrieval of a partial CRC before the CRC calculation can and ends or retrieval of a partial CRC before the CRC calculation can
commence. This avoids adding internal latencies, having multiple data commence. This avoids adding internal latencies, having multiple data
passes through the CRC machine, or scheduling multiple commands for passes through the CRC machine, or scheduling multiple commands for
moving the data to the host buffer. moving the data to the host buffer.
The aligned FPDU approach is useful for in-order and out-of-order The aligned FPDU approach is useful for in-order and out-of-order
reception. The receiver can use the same mechanisms for data storage reception. The receiver can use the same mechanisms for data storage
in both cases, and only needs to account for when all the TCP in both cases, and only needs to account for when all the TCP
segments have arrived to enable delivery. . The Header Alignment, segments have arrived to enable delivery. The Header Alignment, along
along with the high probability that at least one complete FPDU is with the high probability that at least one complete FPDU is found
found with every TCP segment, allows the receiver to perform data with every TCP segment, allows the receiver to perform data placement
placement for out-of-order TCP segments with no need for intermediate for out-of-order TCP segments with no need for intermediate
buffering. Essentially the TCP receive buffer has been eliminated and buffering. Essentially the TCP receive buffer has been eliminated and
TCP reassembly is done in place within the ULP buffer. TCP reassembly is done in place within the ULP buffer.
In case Header Alignment is not found, the receiver should follow the In case FPDU Alignment is not found, the receiver should follow the
algorithm for non aligned FPDU reception which may be slower and less algorithm for non aligned FPDU reception which may be slower and less
efficient. efficient.
11.1.2.2 Header Alignment effects on TCP wire protocol 11.1.2.2 FPDU Alignment effects on TCP wire protocol
An MPA-aware TCP exposes its EMSS to MPA. MPA uses the EMSS to An MPA-aware TCP exposes its EMSS to MPA. MPA uses the EMSS to
calculate its MULPDU, which it then exposes to DDP, its ULP. DDP calculate its MULPDU, which it then exposes to DDP, its ULP. DDP
uses the MULPDU to segment its payload so that each FPDU sent by uses the MULPDU to segment its payload so that each FPDU sent by
MPA fits completely into one TCP segment. This has no impact on MPA fits completely into one TCP segment. This has no impact on
wire protocol and exposing this information is already supported wire protocol and exposing this information is already supported
on many TCP implementations, including all modern flavors of BSD on many TCP implementations, including all modern flavors of BSD
networking, through the TCP_MAXSEG socket option. networking, through the TCP_MAXSEG socket option.
In the common case, the ULP (i.e. DDP over MPA) messages provided to In the common case, the ULP (i.e. DDP over MPA) messages provided to
skipping to change at page 56, line 9 skipping to change at page 62, line 9
scenario, the ULP may choose a FPDU size that is EMSS/2 +1 and has scenario, the ULP may choose a FPDU size that is EMSS/2 +1 and has
multiple messages available for transmission. For this poor choice of multiple messages available for transmission. For this poor choice of
FPDU size, the average TCP segment size is therefore about 1/2 of the FPDU size, the average TCP segment size is therefore about 1/2 of the
EMSS and the number of TCP segments emitted is approaching 2x of what EMSS and the number of TCP segments emitted is approaching 2x of what
is possible without the requirement to encapsulate an integer number is possible without the requirement to encapsulate an integer number
of complete FPDUs in every TCP segment. This is a dynamic situation of complete FPDUs in every TCP segment. This is a dynamic situation
that only lasts for the duration where the sender ULP has multiple that only lasts for the duration where the sender ULP has multiple
non-optimal messages for transmission and this causes a minor impact non-optimal messages for transmission and this causes a minor impact
on the wire utilization. on the wire utilization.
However, it is not expected that requiring Header Alignment will have However, it is not expected that requiring FPDU Alignment will have a
a measurable impact on wire behavior of most applications. Throughput measurable impact on wire behavior of most applications. Throughput
applications with large I/Os are expected to take full advantage of applications with large I/Os are expected to take full advantage of
the EMSS. Another class of applications with many small outstanding the EMSS. Another class of applications with many small outstanding
buffers (as compared to EMSS) is expected to use packing when buffers (as compared to EMSS) is expected to use packing when
applicable. Transaction oriented applications are also optimal. applicable. Transaction oriented applications are also optimal.
TCP retransmission is another area that can affect sender behavior. TCP retransmission is another area that can affect sender behavior.
TCP supports retransmission of the exact, originally transmitted TCP supports retransmission of the exact, originally transmitted
segment (see [RFC0793] section 2.6, [RFC0793] section 3.7 "managing segment (see [RFC0793] section 2.6, [RFC0793] section 3.7 "managing
the window" and [RFC1122] section 4.2.2.15 ). In the unlikely event the window" and [RFC1122] section 4.2.2.15 ). In the unlikely event
that part of the original segment has been received and acknowledged that part of the original segment has been received and acknowledged
by the remote peer (e.g., a re-segmenting middle box, as documented by the remote peer (e.g., a re-segmenting middle box, as documented
in 5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders on in 5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders on
page 26), a better available bandwidth utilization may be possible by page 30), a better available bandwidth utilization may be possible by
re-transmitting only the missing octets. If an MPA-aware TCP re-transmitting only the missing octets. If an MPA-aware TCP
retransmits complete FPDUs, there may be some marginal bandwidth retransmits complete FPDUs, there may be some marginal bandwidth
loss. loss.
Another area where a change in the TCP segment number may have impact Another area where a change in the TCP segment number may have impact
is that of Slow Start and Congestion Avoidance. Slow-start is that of Slow Start and Congestion Avoidance. Slow-start
exponential increase is measured in segments per second, as the exponential increase is measured in segments per second, as the
algorithm focuses on the overhead per segment at the source for algorithm focuses on the overhead per segment at the source for
congestion that eventually results in dropped segments. Slow-start congestion that eventually results in dropped segments. Slow-start
exponential bandwidth growth for MPA-aware TCP is similar to any TCP exponential bandwidth growth for MPA-aware TCP is similar to any TCP
skipping to change at page 59, line 5 skipping to change at page 65, line 5
deadlock the MPA algorithm. If the path MTU is reduced, FPDU deadlock the MPA algorithm. If the path MTU is reduced, FPDU
Alignment requires the source TCP to re-segment the data stream to Alignment requires the source TCP to re-segment the data stream to
the new path MTU. The source MPA will detect this condition and the new path MTU. The source MPA will detect this condition and
reduce the MPA segment size, but any FPDUs already posted to the reduce the MPA segment size, but any FPDUs already posted to the
source TCP will be re-segmented and lose FPDU Alignment. If the source TCP will be re-segmented and lose FPDU Alignment. If the
destination does not support a TCP reassembly buffer, these segments destination does not support a TCP reassembly buffer, these segments
can never be successfully transmitted and the protocol deadlocks. can never be successfully transmitted and the protocol deadlocks.
When a complete FPDU is received, processing continues normally. When a complete FPDU is received, processing continues normally.
11.3 IETF RNIC Interoperability with RDMA Consortium Protocols 11.3 IETF Implementation Interoperability with RDMA Consortium Protocols
The RDMA Consortium created early specifications of the MPA/DDP/RDMA
protocols and some manufacturers created implementations of those
protocols before the IETF versions were finalized. These protocols
and are very similar to the IETF versions making it possible for
implementations to be created or modified to support either set of
specifications. For those interested, the RDMA Consortium protocol
documents can be obtained at http://www.rdmaconsortium.org.
In this section, implementations of MPA/DDP/RDMA that conform to the
RDMAC specifications are called "RDMAC RNICs". Implementations of
MPA/DDP/RDMA that conform to the IETF RFCs are called "IETF RNICs".
Without the exchange of MPA Request/Reply Frames, there is no Without the exchange of MPA Request/Reply Frames, there is no
standard mechanism for enabling RDMAC RNICs to interoperate with IETF standard mechanism for enabling RDMAC RNICs to interoperate with IETF
RNICs. Even if a ULP uses a well-known port to start an IETF RNIC RNICs. Even if a ULP uses a well-known port to start an IETF RNIC
immediately in RDMA mode (i.e., without exchanging the MPA immediately in RDMA mode (i.e., without exchanging the MPA
Request/Reply messages), there is no reason to believe an IETF RNIC Request/Reply messages), there is no reason to believe an IETF RNIC
will interoperate with an RDMAC RNIC because of the differences in will interoperate with an RDMAC RNIC because of the differences in
the version number in the DDP and RDMAP headers on the wire. the version number in the DDP and RDMAP headers on the wire.
Therefore, the ULP or other supporting entity at the RDMAC RNIC must Therefore, the ULP or other supporting entity at the RDMAC RNIC must
skipping to change at page 59, line 38 skipping to change at page 65, line 50
RDMA mode. RDMA mode.
Non-permissive IETF RNIC - an RNIC implementing the IETF protocols Non-permissive IETF RNIC - an RNIC implementing the IETF protocols
which is not capable of implementing the RDMAC protocols. Such which is not capable of implementing the RDMAC protocols. Such
an RNIC can only interoperate with other IETF RNICs. an RNIC can only interoperate with other IETF RNICs.
Permissive IETF RNIC - an RNIC implementing the IETF protocols which Permissive IETF RNIC - an RNIC implementing the IETF protocols which
is capable of implementing the RDMAC protocols on a per is capable of implementing the RDMAC protocols on a per
connection basis. connection basis.
The Permissive IETF RNIC is recommended for those implementers that
want maximum interoperability with other RNIC implementations.
The values used by these three RNIC types for the MPA, DDP, and RDMAP The values used by these three RNIC types for the MPA, DDP, and RDMAP
versions as well as MPA markers and CRC are summarized in Figure 12. versions as well as MPA markers and CRC are summarized in Figure 12.
+----------------++-----------+-----------+-----------+-----------+ +----------------++-----------+-----------+-----------+-----------+
| RNIC TYPE || DDP/RDMAP | MPA | MPA | MPA | | RNIC TYPE || DDP/RDMAP | MPA | MPA | MPA |
| || Version | Revision | Markers | CRC | | || Version | Revision | Markers | CRC |
+----------------++-----------+-----------+-----------+-----------+ +----------------++-----------+-----------+-----------+-----------+
+----------------++-----------+-----------+-----------+-----------+ +----------------++-----------+-----------+-----------+-----------+
| RDMAC || 0 | 0 | 1 | 1 | | RDMAC || 0 | 0 | 1 | 1 |
| || | | | | | || | | | |
skipping to change at page 64, line 37 skipping to change at page 70, line 37
Renato J Recio Renato J Recio
IBM IBM
Internal Zip 9043 Internal Zip 9043
11400 Burnett Road 11400 Burnett Road
Austin, Texas 78759 Austin, Texas 78759
Phone: 512-838-3685 Phone: 512-838-3685
Email: recio@us.ibm.com Email: recio@us.ibm.com
John Carrier John Carrier
Adaptec Inc. Cray Inc.
691 South Milpitas Blvd. 411 First Avenue S, Suite 600
Milpitas, CA 95035 Seattle, WA 98104-2860
Phone: 360-378-8526 Phone: 206-701-2090
Email: John_Carrier@adaptec.com Email: carrier@cray.com
13 Acknowledgments 13 Acknowledgments
Dwight Barron Dwight Barron
Hewlett-Packard Company Hewlett-Packard Company
20555 SH 249 20555 SH 249
Houston, Tx. USA 77070-2698 Houston, Tx. USA 77070-2698
Phone: 281-514-2769 Phone: 281-514-2769
Email: dwight.barron@hp.com Email: dwight.barron@hp.com
skipping to change at page 68, line 5 skipping to change at page 74, line 5
Phone: +1 916 785 5198 Phone: +1 916 785 5198
Email: jim_wendt@hp.com Email: jim_wendt@hp.com
Jim Williams Jim Williams
Emulex Corporation Emulex Corporation
580 Main Street 580 Main Street
Bolton, MA 01740 USA Bolton, MA 01740 USA
Phone: +1 978 779 7224 Phone: +1 978 779 7224
Email: jim.williams@emulex.com Email: jim.williams@emulex.com
14 Full Copyright Statement Full Copyright Statement
This document and the information contained herein is provided on an This document and the information contained herein is provided on an
"AS IS" basis and ADAPTEC INC., AGILENT TECHNOLOGIES INC., BROADCOM "AS IS" basis and ADAPTEC INC., AGILENT TECHNOLOGIES INC., BROADCOM
CORPORATION, CISCO SYSTEMS INC., DUKE UNIVERSITY, EMC CORPORATION, CORPORATION, CISCO SYSTEMS INC., DUKE UNIVERSITY, EMC CORPORATION,
EMULEX CORPORATION, HEWLETT-PACKARD COMPANY, INTERNATIONAL BUSINESS EMULEX CORPORATION, HEWLETT-PACKARD COMPANY, INTERNATIONAL BUSINESS
MACHINES CORPORATION, INTEL CORPORATION, MICROSOFT CORPORATION, MACHINES CORPORATION, INTEL CORPORATION, MICROSOFT CORPORATION,
NETWORK APPLIANCE INC., SANDBURST CORPORATION, THE INTERNET SOCIETY, NETWORK APPLIANCE INC., SANDBURST CORPORATION, THE INTERNET SOCIETY,
AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT
THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY
skipping to change at line 2942 skipping to change at page 74, line 30
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Copyright (C) The Internet Society (2005). This document is subject Copyright (C) The Internet Society (2005). This document is subject
to the rights, licenses and restrictions contained in BCP 78, and to the rights, licenses and restrictions contained in BCP 78, and
except as set forth therein, the authors retain all their rights. except as set forth therein, the authors retain all their rights.
Intellectual Property
The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has
made any independent effort to identify any such rights. Information
on the procedures with respect to rights in RFC documents can be
found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at
ietf-ipr@ietf.org.
 End of changes. 97 change blocks. 
410 lines changed or deleted 561 lines changed or added

This html diff was produced by rfcdiff 1.27, available from http://www.levkowetz.com/ietf/tools/rfcdiff/