draft-ietf-rddp-mpa-05.txt   draft-ietf-rddp-mpa-06.txt 
Remote Direct Data Placement Work Group P. Culley Remote Direct Data Placement Work Group P. Culley
INTERNET-DRAFT Hewlett-Packard Company INTERNET-DRAFT Hewlett-Packard Company
draft-ietf-rddp-mpa-05.txt U. Elzur draft-ietf-rddp-mpa-06.txt U. Elzur
Broadcom Corporation Broadcom Corporation
R. Recio R. Recio
IBM Corporation IBM Corporation
S. Bailey S. Bailey
Sandburst Corporation Sandburst Corporation
J. Carrier J. Carrier
Cray Inc. Cray Inc.
Expires: December 2006 June 23, 2006 Expires: February 2007 September 5, 2006
Marker PDU Aligned Framing for TCP Specification Marker PDU Aligned Framing for TCP Specification
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
skipping to change at page 2, line 9 skipping to change at page 2, line 9
boundaries that DDP requires. MPA is fully compliant with applicable boundaries that DDP requires. MPA is fully compliant with applicable
TCP RFCs and can be utilized with existing TCP implementations. MPA TCP RFCs and can be utilized with existing TCP implementations. MPA
also supports integrated implementations that combine TCP, MPA and also supports integrated implementations that combine TCP, MPA and
DDP to reduce buffering requirements in the implementation and DDP to reduce buffering requirements in the implementation and
improve performance at the system level. improve performance at the system level.
Table of Contents Table of Contents
Status of this Memo 1 Status of this Memo 1
Abstract 1 Abstract 1
1 Glossary 4 1 Glossary 5
2 Introduction 7 2 Introduction 8
2.1 Motivation 7 2.1 Motivation 8
2.2 Protocol Overview 7 2.2 Protocol Overview 8
3 MPA's interactions with DDP 11 3 MPA's interactions with DDP 12
4 MPA Full Operation Mode 13 4 MPA Full Operation Mode 14
4.1 FPDU Format 13 4.1 FPDU Format 14
4.2 Marker Format 14 4.2 Marker Format 15
4.3 MPA Markers 14 4.3 MPA Markers 15
4.4 CRC Calculation 17 4.4 CRC Calculation 18
4.5 FPDU Size Considerations 20 4.5 FPDU Size Considerations 21
5 MPA's interactions with TCP 22 5 MPA's interactions with TCP 23
5.1 MPA transmitters with a standard layered TCP 23 5.1 MPA transmitters with a standard layered TCP 23
5.2 MPA receivers with a standard layered TCP 24 5.2 MPA receivers with a standard layered TCP 24
5.3 Optimized MPA/TCP transmitters 24 6 MPA Receiver FPDU Identification 24
5.3.1 Effects of Optimized MPA/TCP Segmentation 25 7 Connection Semantics 26
5.4 Optimized MPA/TCP receivers 27 7.1 Connection setup 26
6 MPA Receiver FPDU Identification 28 7.1.1 MPA Request and Reply Frame Format 28
6.1 Re-segmenting Middle boxes and non optimized MPA/TCP senders29 7.1.2 Connection Startup Rules 29
7 Connection Semantics 30 7.1.3 Example Delayed Startup sequence 32
7.1 Connection setup 30 7.1.4 Use of Private Data 35
7.1.1 MPA Request and Reply Frame Format 32 7.1.4.1 Motivation 35
7.1.2 Connection Startup Rules 33 7.1.4.2 Example Immediate Startup using Private Data 36
7.1.3 Example Delayed Startup sequence 36 7.1.5 "Dual stack" implementations 38
7.1.4 Use of Private Data 39 7.2 Normal Connection Teardown 39
7.1.5 "Dual stack" implementations 42 8 Error Semantics 40
7.2 Normal Connection Teardown 43 9 Security Considerations 41
8 Error Semantics 44 9.1 Protocol-specific Security Considerations 41
9 Security Considerations 45 9.1.1 Spoofing 41
9.1 Protocol-specific Security Considerations 45 9.1.1.1 Impersonation 41
9.1.1 Spoofing 45 9.1.1.2 Stream Hijacking 42
9.1.2 Eavesdropping 46 9.1.1.3 Man in the Middle Attack 42
9.2 Introduction to Security Options 47 9.1.2 Eavesdropping 42
9.3 Using IPsec With MPA 47 9.2 Introduction to Security Options 43
9.4 Requirements for IPsec Encapsulation of MPA/DDP 48 9.3 Using IPsec With MPA 43
10 IANA Considerations 49 9.4 Requirements for IPsec Encapsulation of MPA/DDP 44
11 References 50 10 IANA Considerations 45
11.1 Normative References 50 A Appendix. Optimized MPA-aware TCP implementations 46
11.2 Informative References 50 A.1 Optimized MPA/TCP transmitters 46
12 Appendix 52 A.2 Effects of Optimized MPA/TCP Segmentation 47
12.1 Analysis of MPA over TCP Operations 52 A.3 Optimized MPA/TCP receivers 49
12.1.1 Assumptions 53 A.4 Re-segmenting Middle boxes and non optimized MPA/TCP senders50
12.1.2 The Value of FPDU Alignment 54 A.5 Receiver implementation 51
12.2 Receiver implementation 61 A.5.1 Network Layer Reassembly Buffers 52
12.2.1 Network Layer Reassembly Buffers 61 A.5.2 TCP Reassembly buffers 53
12.2.2 TCP Reassembly buffers 62 B Appendix. Analysis of MPA over TCP Operations 54
12.3 IETF Implementation Interoperability with RDMA Consortium B.1 Assumptions 54
B.1.1 MPA is layered beneath DDP [DDP] 54
B.1.2 MPA preserves DDP message framing 55
B.1.3 The size of the ULPDU passed to MPA is less than EMSS under
normal conditions 55
B.1.4 Out-of-order placement but NO out-of-order Delivery 55
B.2 The Value of FPDU Alignment 55
B.2.1 Impact of lack of FPDU Alignment on the receiver computational
load and complexity 57
B.2.2 FPDU Alignment effects on TCP wire protocol 61
C Appendix. IETF Implementation Interoperability with RDMA Consortium
Protocols 63 Protocols 63
12.3.1 Negotiated Parameters 63 C.1 Negotiated Parameters 63
12.3.2 RDMAC RNIC and Non-permissive IETF RNIC 64 C.2 RDMAC RNIC and Non-permissive IETF RNIC 65
12.3.3 RDMAC RNIC and Permissive IETF RNIC 66 C.2.1 RDMAC RNIC Initiator 65
12.3.4 Non-Permissive IETF RNIC and Permissive IETF RNIC 67 C.2.2 Non-Permissive IETF RNIC Initiator 66
13 Author's Addresses 68 C.2.3 RDMAC RNIC and Permissive IETF RNIC 66
14 Acknowledgments 69 C.2.4 RDMAC RNIC Initiator 67
Full Copyright Statement 72 C.2.5 Permissive IETF RNIC Initiator 67
Intellectual Property 72 C.3 Non-Permissive IETF RNIC and Permissive IETF RNIC 67
Normative References 69
Informative References 69
Author's Addresses 71
Acknowledgments 72
Full Copyright Statement 75
Intellectual Property 75
Table of Figures Table of Figures
Figure 1 ULP MPA TCP Layering 8 Figure 1 ULP MPA TCP Layering 9
Figure 2 FPDU Format 13 Figure 2 FPDU Format 14
Figure 3 Marker Format 14 Figure 3 Marker Format 15
Figure 4 Example FPDU Format with Marker 16 Figure 4 Example FPDU Format with Marker 17
Figure 5 Annotated Hex Dump of an FPDU 19 Figure 5 Annotated Hex Dump of an FPDU 20
Figure 6 Annotated Hex Dump of an FPDU with Marker 20 Figure 6 Annotated Hex Dump of an FPDU with Marker 21
Figure 7 Fully layered implementation 22 Figure 7 Fully layered implementation 23
Figure 8 Optimized MPA/TCP implementation 22 Figure 8 MPA Request/Reply Frame 28
Figure 9 MPA Request/Reply Frame 32 Figure 9: Example Delayed Startup negotiation 33
Figure 10: Example Delayed Startup negotiation 37 Figure 10: Example Immediate Startup negotiation 36
Figure 11: Example Immediate Startup negotiation 40 Figure 11 Optimized MPA/TCP implementation 46
Figure 12: Non-aligned FPDU freely placed in TCP octet stream 56 Figure 12: Non-aligned FPDU freely placed in TCP octet stream 57
Figure 13: Aligned FPDU placed immediately after TCP header 57 Figure 13: Aligned FPDU placed immediately after TCP header 59
Figure 14. Connection Parameters for the RNIC Types. 64 Figure 14. Connection Parameters for the RNIC Types. 64
Figure 15: MPA negotiation between an RDMAC RNIC and a Non-permissive Figure 15: MPA negotiation between an RDMAC RNIC and a Non-permissive
IETF RNIC. 65 IETF RNIC. 65
Figure 16: MPA negotiation between an RDMAC RNIC and a Permissive Figure 16: MPA negotiation between an RDMAC RNIC and a Permissive
IETF RNIC. 66 IETF RNIC. 66
Figure 17: MPA negotiation between a Non-permissive IETF RNIC and a Figure 17: MPA negotiation between a Non-permissive IETF RNIC and a
Permissive IETF RNIC. 67 Permissive IETF RNIC. 68
Revision history [To be deleted prior to RFC publication] Revision history [To be deleted prior to RFC publication]
[draft-ietf-rddp-mpa-06] workgroup draft with following changes:
Document restructuring to move descriptive information on
implementing optimized MPA/TCP implementations to an appendix.
All normative text was removed from the appendix. Paragraph
added to security section explaining IPSEC version. Added
informative references to architecture, applicability, and
problem statement documents.
[draft-ietf-rddp-mpa-05] workgroup draft with following changes: [draft-ietf-rddp-mpa-05] workgroup draft with following changes:
Document restructuring to differentiate between fully layered Document restructuring to differentiate between fully layered
MPA on TCP implementations and optimized MPA/TCP MPA on TCP implementations and optimized MPA/TCP
implementations. This involved somewhat blurring the artificial implementations. This involved somewhat blurring the artificial
layer between MPA and an MPA-aware TCP. This involved a bit of layer between MPA and an MPA-aware TCP. This involved a bit of
terminology change. terminology change.
Re-wrote the requirement to avoid duplicate segments during TCP Re-wrote the requirement to avoid duplicate segments during TCP
out of order passing to MPA; this is now a co-responsibility out of order passing to MPA; this is now a co-responsibility
skipping to change at page 5, line 44 skipping to change at page 6, line 44
PDU - protocol data unit PDU - protocol data unit
Private Data - A block of data exchanged between MPA endpoints during Private Data - A block of data exchanged between MPA endpoints during
initial connection setup. initial connection setup.
Protection Domain - An RDMA concept (see [VERBS] and [RDMASEC]) that Protection Domain - An RDMA concept (see [VERBS] and [RDMASEC]) that
tie use of various endpoint resources (memory access etc.) to the tie use of various endpoint resources (memory access etc.) to the
specific RDMA/DDP/MPA connection. specific RDMA/DDP/MPA connection.
RDDP - a suite of protocols including MPA, [DDP], [RDMAP], an overall
security document [RDMASEC], a problem statement [RFC4297], an
architecture document [RFC4296], and an applicability document
[APPL].
RDMA - Remote Direct Memory Access; a protocol that uses DDP and MPA RDMA - Remote Direct Memory Access; a protocol that uses DDP and MPA
to enable applications to transfer data directly from memory to enable applications to transfer data directly from memory
buffers. See [RDMAP]. buffers. See [RDMAP].
Remote Peer - The MPA protocol implementation on the opposite end of Remote Peer - The MPA protocol implementation on the opposite end of
the connection. Used to refer to the remote entity when the connection. Used to refer to the remote entity when
describing protocol exchanges or other interactions between two describing protocol exchanges or other interactions between two
Nodes. Nodes.
Responder - The connection endpoint which responds to an incoming MPA Responder - The connection endpoint which responds to an incoming MPA
skipping to change at page 9, line 31 skipping to change at page 10, line 31
stream, verifies their integrity, and removes MPA Markers (when stream, verifies their integrity, and removes MPA Markers (when
present), ULPDU_Length, PAD and the CRC field. present), ULPDU_Length, PAD and the CRC field.
7. MPA then provides the complete ULPDUs to DDP. MPA may also 7. MPA then provides the complete ULPDUs to DDP. MPA may also
separate passing MPA payload to DDP from passing the MPA payload separate passing MPA payload to DDP from passing the MPA payload
ordering information. ordering information.
A fully layered MPA on TCP is implemented as a data stream ULP for A fully layered MPA on TCP is implemented as a data stream ULP for
TCP and is therefore RFC compliant. TCP and is therefore RFC compliant.
An optimized MPA/TCP uses a TCP layer which potentially contains some An optimized DDP/MPA/TCP uses a TCP layer which potentially contains
additional semantics as defined in this document. It is completely some additional behaviors as suggested in this document. When
interoperable with a fully layered MPA on TCP implementation and is DDP/MPA/TCP are cross-layer optimized, the behavior of TCP (esp.
also RFC compliant. sender segmentation) may change from that of the un-optimized
implementation, but the changes are within the bounds permitted by
the TCP RFC specifications, and will interoperate with an un-
optimized TCP. The additional behaviors are described in Appendix A
and are not normative, they are described at a TCP interface layer as
a convenience. Implementations may achieve the described
functionality using any method, including cross layer optimizations
between TCP, MPA and DDP.
An optimized MPA/TCP sender is able to segment the data stream such An optimized DDP/MPA/TCP sender is able to segment the data stream
that TCP segments begin with FPDUs (FPDU Alignment). This has such that TCP segments begin with FPDUs (FPDU Alignment). This has
significant advantages for receivers. When segments arrive with significant advantages for receivers. When segments arrive with
aligned FPDUs the receiver usually need not buffer any portion of the aligned FPDUs the receiver usually need not buffer any portion of the
segment, allowing DDP to place it in its destination memory segment, allowing DDP to place it in its destination memory
immediately, thus avoiding copies from intermediate buffers (DDP's immediately, thus avoiding copies from intermediate buffers (DDP's
reason for existence). reason for existence).
An optimized MPA/TCP receiver allows a DDP on MPA implementation to An optimized DDP/MPA/TCP receiver allows a DDP on MPA implementation
locate the start of ULPDUs that may be received out of order. It to locate the start of ULPDUs that may be received out of order. It
also allows the implementation to determine if the entire ULPDU has also allows the implementation to determine if the entire ULPDU has
been received. As a result, MPA can pass out of order ULPDUs to DDP been received. As a result, MPA can pass out of order ULPDUs to DDP
for immediate use. This enables a DDP on MPA implementation to save for immediate use. This enables a DDP on MPA implementation to save
a significant amount of intermediate storage by placing the ULPDUs in a significant amount of intermediate storage by placing the ULPDUs in
the right locations in the application buffers when they arrive, the right locations in the application buffers when they arrive,
rather than waiting until full ordering can be restored. rather than waiting until full ordering can be restored.
The ability of a receiver to recover out of order ULPDUs is optional The ability of a receiver to recover out of order ULPDUs is optional
and declared to the transmitter during startup. When the receiver and declared to the transmitter during startup. When the receiver
declares that it does not support out of order recovery, the declares that it does not support out of order recovery, the
skipping to change at page 10, line 31 skipping to change at page 11, line 38
indicates segments in error at a much higher rate than the underlying indicates segments in error at a much higher rate than the underlying
link characteristics would indicate. With these higher error rates, link characteristics would indicate. With these higher error rates,
the chance that an error will escape detection, when using only the the chance that an error will escape detection, when using only the
TCP checksum for data integrity, becomes a concern. A stronger TCP checksum for data integrity, becomes a concern. A stronger
integrity check can reduce the chance of data errors being missed. integrity check can reduce the chance of data errors being missed.
MPA includes a CRC check to increase the ULPDU data integrity to the MPA includes a CRC check to increase the ULPDU data integrity to the
level provided by other modern protocols, such as SCTP [RFC2960]. It level provided by other modern protocols, such as SCTP [RFC2960]. It
is possible to disable this CRC check, however CRCs MUST be enabled is possible to disable this CRC check, however CRCs MUST be enabled
unless it is clear that the end to end connection through the network unless it is clear that the end to end connection through the network
has data integrity at least as good as a MPA with CRC enabled (for has data integrity at least as good as an MPA with CRC enabled (for
example when IPsec is implemented end to end). DDP's ULP expects example when IPsec is implemented end to end). DDP's ULP expects
this level of data integrity and therefore the ULP does not have to this level of data integrity and therefore the ULP does not have to
provide its own duplicate data integrity and error recovery for lost provide its own duplicate data integrity and error recovery for lost
data. data.
3 MPA's interactions with DDP 3 MPA's interactions with DDP
DDP requires MPA to maintain DDP record boundaries from the sender to DDP requires MPA to maintain DDP record boundaries from the sender to
the receiver. When using MPA on TCP to send data, DDP provides the receiver. When using MPA on TCP to send data, DDP provides
records (ULPDUs) to MPA. MPA will use the reliable transmission records (ULPDUs) to MPA. MPA will use the reliable transmission
skipping to change at page 13, line 46 skipping to change at page 14, line 46
support the largest IP datagrams for IPv4 or IPv6. support the largest IP datagrams for IPv4 or IPv6.
PAD: The PAD field trails the ULPDU and contains between zero and PAD: The PAD field trails the ULPDU and contains between zero and
three octets of data. The pad data MUST be set to zero by the sender three octets of data. The pad data MUST be set to zero by the sender
and ignored by the receiver (except for CRC checking). The length of and ignored by the receiver (except for CRC checking). The length of
the pad is set so as to make the size of the FPDU an integral the pad is set so as to make the size of the FPDU an integral
multiple of four. multiple of four.
CRC: 32 bits, When CRCs are enabled, this field contains a CRC32C CRC: 32 bits, When CRCs are enabled, this field contains a CRC32C
check value, which is used to verify the entire contents of the FPDU, check value, which is used to verify the entire contents of the FPDU,
using CRC32C. See section 4.4 CRC Calculation on page 17. When CRCs using CRC32C. See section 4.4 CRC Calculation on page 18. When CRCs
are not enabled, this field is still present, may contain any value, are not enabled, this field is still present, may contain any value,
and MUST NOT be checked. and MUST NOT be checked.
The FPDU adds a minimum of 6 octets to the length of the ULPDU. In The FPDU adds a minimum of 6 octets to the length of the ULPDU. In
addition, the total length of the FPDU will include the length of any addition, the total length of the FPDU will include the length of any
Markers and from 0 to 3 pad octets added to round-up the ULPDU size. Markers and from 0 to 3 pad octets added to round-up the ULPDU size.
4.2 Marker Format 4.2 Marker Format
The format of a Marker MUST be as specified in Figure 3: The format of a Marker MUST be as specified in Figure 3:
skipping to change at page 14, line 44 skipping to change at page 15, line 44
All MPA Markers are included in the containing FPDU CRC calculation All MPA Markers are included in the containing FPDU CRC calculation
(when both CRCs and Markers are in use). (when both CRCs and Markers are in use).
The MPA receiver's ability to locate out of order FPDUs and pass the The MPA receiver's ability to locate out of order FPDUs and pass the
ULPDUs to DDP is implementation dependent. MPA/DDP allows those ULPDUs to DDP is implementation dependent. MPA/DDP allows those
receivers that are able to deal with out of order FPDUs in this way receivers that are able to deal with out of order FPDUs in this way
to require the insertion of Markers in the data stream. When the to require the insertion of Markers in the data stream. When the
receiver cannot deal with out of order FPDUs in this way, it may receiver cannot deal with out of order FPDUs in this way, it may
disable the insertion of Markers at the sender. All MPA senders MUST disable the insertion of Markers at the sender. All MPA senders MUST
be able to generate Markers when their use is declared by the be able to generate Markers when their use is declared by the
opposing receiver (see section 7.1 Connection setup on page 30). opposing receiver (see section 7.1 Connection setup on page 26).
When Markers are enabled, MPA senders MUST insert a Marker into the When Markers are enabled, MPA senders MUST insert a Marker into the
data stream at a 512 octet periodic interval in the TCP Sequence data stream at a 512 octet periodic interval in the TCP Sequence
Number Space. The Marker contains a 16 bit unsigned integer referred Number Space. The Marker contains a 16 bit unsigned integer referred
to as the FPDUPTR (FPDU Pointer). to as the FPDUPTR (FPDU Pointer).
If the FPDUPTR's value is non-zero, the FPDU Pointer is a 16 bit If the FPDUPTR's value is non-zero, the FPDU Pointer is a 16 bit
relative back-pointer. FPDUPTR MUST contain the number of octets in relative back-pointer. FPDUPTR MUST contain the number of octets in
the TCP stream from the beginning of the ULPDU Length field to the the TCP stream from the beginning of the ULPDU Length field to the
first octet of the Marker, unless the Marker falls between FPDUs. first octet of the Marker, unless the Marker falls between FPDUs.
skipping to change at page 15, line 25 skipping to change at page 16, line 25
the Marker (if CRCs are being generated or checked). Thus an FPDUPTR the Marker (if CRCs are being generated or checked). Thus an FPDUPTR
value of 0x0000 means that immediately following the Marker is an value of 0x0000 means that immediately following the Marker is an
FPDU header (the ULPDU Length field). FPDU header (the ULPDU Length field).
Since all FPDUs are integral multiples of 4 octets, the bottom two Since all FPDUs are integral multiples of 4 octets, the bottom two
bits of the FPDUPTR as calculated by the sender are zero. MPA bits of the FPDUPTR as calculated by the sender are zero. MPA
reserves these bits so they MUST be treated as zero for computation reserves these bits so they MUST be treated as zero for computation
at the receiver. at the receiver.
When Markers are enabled (see section 7.1 Connection setup on page When Markers are enabled (see section 7.1 Connection setup on page
30), the MPA Markers MUST be inserted immediately preceding the first 26), the MPA Markers MUST be inserted immediately preceding the first
FPDU of Full Operation phase, and at every 512th octet of the TCP FPDU of Full Operation phase, and at every 512th octet of the TCP
octet stream thereafter. As a result, the first Marker has an octet stream thereafter. As a result, the first Marker has an
FPDUPTR value of 0x0000. If the first Marker begins at octet FPDUPTR value of 0x0000. If the first Marker begins at octet
sequence number SeqStart, then Markers are inserted such that the sequence number SeqStart, then Markers are inserted such that the
first octet of the Marker is at octet sequence number SeqNum if the first octet of the Marker is at octet sequence number SeqNum if the
remainder of (SeqNum - SeqStart) mod 512 is zero. Note that SeqNum remainder of (SeqNum - SeqStart) mod 512 is zero. Note that SeqNum
can wrap. can wrap.
For example, if the TCP sequence number were used to calculate the For example, if the TCP sequence number were used to calculate the
insertion point of the Marker, the starting TCP sequence number is insertion point of the Marker, the starting TCP sequence number is
skipping to change at page 17, line 34 skipping to change at page 18, line 34
from undetected errors as an end-to-end CRC32c. from undetected errors as an end-to-end CRC32c.
The process MUST be invisible to the ULP. The process MUST be invisible to the ULP.
After receipt of an MPA startup declaration indicating that its peer After receipt of an MPA startup declaration indicating that its peer
requires CRCs, an MPA instance MUST continue generating and checking requires CRCs, an MPA instance MUST continue generating and checking
CRCs until the connection terminates. If an MPA instance has CRCs until the connection terminates. If an MPA instance has
declared that it does not require CRCs, it MUST turn off CRC checking declared that it does not require CRCs, it MUST turn off CRC checking
immediately after receipt of an MPA mode declaration indicating that immediately after receipt of an MPA mode declaration indicating that
its peer also does not require CRCs. It MAY continue generating its peer also does not require CRCs. It MAY continue generating
CRCs. See section 7.1 Connection setup on page 30 for details on the CRCs. See section 7.1 Connection setup on page 26 for details on the
MPA startup. MPA startup.
When sending an FPDU, the sender MUST include a CRC field. When CRCs When sending an FPDU, the sender MUST include a CRC field. When CRCs
are enabled, the CRC field in the MPA FPDU MUST be computed using the are enabled, the CRC field in the MPA FPDU MUST be computed using the
CRC32C polynomial in the manner described in the iSCSI Protocol CRC32C polynomial in the manner described in the iSCSI Protocol
[iSCSI] document for Header and Data Digests. [iSCSI] document for Header and Data Digests.
The fields which MUST be included in the CRC calculation when sending The fields which MUST be included in the CRC calculation when sending
an FPDU are as follows: an FPDU are as follows:
skipping to change at page 18, line 36 skipping to change at page 19, line 36
MUST first perform the following: MUST first perform the following:
1) Calculate the CRC of the incoming FPDU in the same fashion as 1) Calculate the CRC of the incoming FPDU in the same fashion as
defined above. defined above.
2) Verify that the calculated CRC-32c value is the same as the 2) Verify that the calculated CRC-32c value is the same as the
received CRC-32c value found in the FPDU CRC field. If not, the received CRC-32c value found in the FPDU CRC field. If not, the
receiver MUST treat the FPDU as an invalid FPDU. receiver MUST treat the FPDU as an invalid FPDU.
The procedure for handling invalid FPDUs is covered in the Error The procedure for handling invalid FPDUs is covered in the Error
Section (see section 8 on page 44) Section (see section 8 on page 40).
The following is an annotated hex dump of an example FPDU sent as the The following is an annotated hex dump of an example FPDU sent as the
first FPDU on the stream. As such, it starts with a Marker. The first FPDU on the stream. As such, it starts with a Marker. The
FPDU contains a 42 octet ULPDU (an example DDP segment) which in turn FPDU contains a 42 octet ULPDU (an example DDP segment) which in turn
contains 24 octets of the contained ULPDU, which is a data load that contains 24 octets of the contained ULPDU, which is a data load that
is all zeros. The CRC32c has been correctly calculated and can be is all zeros. The CRC32c has been correctly calculated and can be
used as a reference. See the [DDP] and [RDMAP] specification for used as a reference. See the [DDP] and [RDMAP] specification for
definitions of the DDP Control field, Queue, MSN, MO, and Send Data. definitions of the DDP Control field, Queue, MSN, MO, and Send Data.
Octet Contents Annotation Octet Contents Annotation
skipping to change at page 21, line 27 skipping to change at page 22, line 27
already packed into a TCP Segment, MULPDU MAY be reduced accordingly. already packed into a TCP Segment, MULPDU MAY be reduced accordingly.
DDP SHOULD provide ULPDUs that are as large as possible, but less DDP SHOULD provide ULPDUs that are as large as possible, but less
than or equal to MULPDU. than or equal to MULPDU.
If the TCP implementation needs to adjust EMSS to support MTU changes If the TCP implementation needs to adjust EMSS to support MTU changes
or changing TCP options, the MULPDU value is changed accordingly. or changing TCP options, the MULPDU value is changed accordingly.
In certain rare situations, the EMSS may shrink below 128 octets in In certain rare situations, the EMSS may shrink below 128 octets in
size. If this occurs, the MPA on TCP sender MUST NOT shrink the size. If this occurs, the MPA on TCP sender MUST NOT shrink the
MULPDU below 128 octets and is not REQUIRED to follow the MULPDU below 128 octets and is not required to follow the
segmentation rules in Sections 5.1 and 5.3. segmentation rules in Sections 5.1 and Appendix A.
If one or more FPDUs are already packed into a TCP segment, such that If one or more FPDUs are already packed into a TCP segment, such that
the remaining room is less than 128 octets, MPA MUST NOT provide a the remaining room is less than 128 octets, MPA MUST NOT provide a
MULPDU smaller than 128. In this case, MPA would typically provide a MULPDU smaller than 128. In this case, MPA would typically provide a
MULPDU for the next full sized segment, but may still pack the next MULPDU for the next full sized segment, but may still pack the next
FPDU into the small remaining room, provide that the next FPDU is FPDU into the small remaining room, provide that the next FPDU is
small enough to fit. small enough to fit.
The value 128 is chosen as to allow DDP designers room for the DDP The value 128 is chosen as to allow DDP designers room for the DDP
Header and some user data. Header and some user data.
5 MPA's interactions with TCP 5 MPA's interactions with TCP
The following sections describe MPA's interactions with TCP. We will The following sections describe MPA's interactions with TCP. This
discuss two significant cases; using a standard layered TCP stack section discusses using a standard layered TCP stack with MPA
with MPA attached above a TCP socket, and using an optimized MPA- attached above a TCP socket. Discussion of using an optimized MPA-
aware TCP with an MPA implementation that takes advantage of the aware TCP with an MPA implementation that takes advantage of the
extra optimizations. Other implementations are possible. extra optimizations is done in Appendix A.
+-----------------------------------+ +-----------------------------------+
| +-----+ +-----------------+ | | +-----+ +-----------------+ |
| | MPA | | Other Protocols | | | | MPA | | Other Protocols | |
| +-----+ +-----------------+ | | +-----+ +-----------------+ |
| || || | | || || |
| ----- socket API -------------- | | ----- socket API -------------- |
| || | | || |
| +-----+ | | +-----+ |
| | TCP | | | | TCP | |
skipping to change at page 22, line 39 skipping to change at page 23, line 39
Figure 7 Fully layered implementation Figure 7 Fully layered implementation
The Fully layered implementation is described for completeness; The Fully layered implementation is described for completeness;
however, the user is cautioned that the reduced probability of FPDU however, the user is cautioned that the reduced probability of FPDU
alignment when transmitting with this implementation will tend to alignment when transmitting with this implementation will tend to
introduce a higher overhead at optimized receivers. In addition, the introduce a higher overhead at optimized receivers. In addition, the
lack of out-of-order receive processing will significantly reduce the lack of out-of-order receive processing will significantly reduce the
value of DDP/MPA by imposing higher buffering and copying overhead in value of DDP/MPA by imposing higher buffering and copying overhead in
the local receiver. the local receiver.
+-----------------------------------+
| +-----------+ +-----------------+ |
| | Optimized | | Other Protocols | |
| | MPA/TCP | +-----------------+ |
| +-----------+ || |
| \\ --- socket API --- |
| \\ || |
| \\ +-----+ |
| \\ | TCP | |
| \\ +-----+ |
| \\ // |
| +-------+ |
| | IP | |
| +-------+ |
+-----------------------------------+
Figure 8 Optimized MPA/TCP implementation
The optimized MPA/TCP implementations described below are only
applicable to MPA, all other TCP applications continue to use the
standard TCP stacks and interfaces.
5.1 MPA transmitters with a standard layered TCP 5.1 MPA transmitters with a standard layered TCP
MPA transmitters SHOULD calculate a MULPDU as described in section MPA transmitters SHOULD calculate a MULPDU as described in section
4.5 If the TCP implementation allows EMSS to be determined by MPA, 4.5 If the TCP implementation allows EMSS to be determined by MPA,
that value should be used. If the transmit side TCP implementation that value should be used. If the transmit side TCP implementation
is not able to report the EMSS, MPA SHOULD use the current MTU value is not able to report the EMSS, MPA SHOULD use the current MTU value
to establish a likely FPDU size, taking into account the various to establish a likely FPDU size, taking into account the various
expected header sizes. expected header sizes.
MPA transmitters SHOULD also use whatever facilities the TCP stack MPA transmitters SHOULD also use whatever facilities the TCP stack
skipping to change at page 24, line 13 skipping to change at page 24, line 43
alignment is lost (see section 6). alignment is lost (see section 6).
5.2 MPA receivers with a standard layered TCP 5.2 MPA receivers with a standard layered TCP
MPA receivers will get TCP data in the usual ordered stream. The MPA receivers will get TCP data in the usual ordered stream. The
receivers MUST identify FPDU boundaries by using the ULPDU_LENGTH receivers MUST identify FPDU boundaries by using the ULPDU_LENGTH
field, as described in section 6. Receivers MAY utilize markers to field, as described in section 6. Receivers MAY utilize markers to
check for FPDU boundary consistency, but they are NOT required to check for FPDU boundary consistency, but they are NOT required to
examine the markers to determine the FPDU boundaries. examine the markers to determine the FPDU boundaries.
5.3 Optimized MPA/TCP transmitters
The various TCP RFCs allow considerable choice in segmenting a TCP
stream. In order to optimize FPDU recovery at the MPA receiver, an
optimized MPA/TCP implementation uses additional segmentation rules.
To provide optimum performance, an optimized MPA/TCP transmit side
implementation SHOULD be enabled to:
* With an EMSS large enough to contain the FPDU(s), segment the
outgoing TCP stream such that the first octet of every TCP
Segment begins with an FPDU. Multiple FPDUs MAY be packed into a
single TCP segment as long as they are entirely contained in the
TCP segment.
* Report the current EMSS from the TCP to the MPA transmit layer.
There are exceptions to the above rule. Once an ULPDU is provided to
MPA, the MPA/TCP sender MUST transmit it or fail the connection; it
cannot be repudiated. As a result, during changes in MTU and EMSS,
or when TCP's Receive Window size (RWIN) becomes too small, it may be
necessary to send FPDUs that do not conform to the segmentation rule
above.
A possible, but less desirable, alternative is to use IP
fragmentation on accepted FPDUs to deal with MTU reductions or
extremely small EMSS.
The sender MUST still format the FPDU according to FPDU format as
shown in Figure 2.
On a retransmission, TCP does not necessarily preserve original TCP
segmentation boundaries. This can lead to the loss of FPDU Alignment
and containment within a TCP segment during TCP retransmissions. An
optimized MPA/TCP sender SHOULD try to preserve original TCP
segmentation boundaries on a retransmission.
5.3.1 Effects of Optimized MPA/TCP Segmentation
Optimized MPA/TCP senders will fill TCP segments to the EMSS with a
single FPDU when a DDP message is large enough. Since the DDP
message may not exactly fit into TCP segments, a "message tail" often
occurs that results in an FPDU that is smaller than a single TCP
segment. Additionally some DDP messages may be considerably shorter
than the EMSS. If a small FPDU is sent in a single TCP segment the
result is a "short" TCP segment.
Applications expected to see strong advantages from Direct Data
Placement include transaction-based applications and throughput
applications. Request/response protocols typically send one FPDU per
TCP segment and then wait for a response. Under these conditions,
these "short" TCP segments are an appropriate and expected effect of
the segmentation.
Another possibility is that the application might be sending multiple
messages (FPDUs) to the same endpoint before waiting for a response.
In this case, the segmentation policy would tend to reduce the
available connection bandwidth by under-filling the TCP segments.
Standard TCP implementations often utilize the Nagle [RFC0896]
algorithm to ensure that segments are filled to the EMSS whenever the
round trip latency is large enough that the source stream can fully
fill segments before Acks arrive. The algorithm does this by
delaying the transmission of TCP segments until a ULP can fill a
segment, or until an ACK arrives from the far side. The algorithm
thus allows for smaller segments when latencies are shorter to keep
the ULP's end to end latency to reasonable levels.
The Nagle algorithm is not mandatory to use [RFC1122].
When used with optimized MPA/TCP stacks, Nagle and similar algorithms
can result in the "packing" of multiple FPDUs into TCP segments.
If a "message tail", small DDP messages, or the start of a larger DDP
message are available, MPA MAY pack multiple FPDUs into TCP segments.
When this is done, the TCP segments can be more fully utilized, but,
due to the size constraints of FPDUs, segments may not be filled to
the EMSS. A dynamic MULPDU that informs DDP of the size of the
remaining TCP segment space makes filling the TCP segment more
effective.
Note that MPA receivers must do more processing of a TCP segment
that contains multiple FPDUs, this may affect the performance of
some receiver implementations.
It is up to the ULP to decide if Nagle is useful with DDP/MPA. Note
that many of the applications expected to take advantage of MPA/DDP
prefer to avoid the extra delays caused by Nagle. In such scenarios
it is anticipated there will be minimal opportunity for packing at
the transmitter and receivers may choose to optimize their
performance for this anticipated behavior.
Therefore, the application is expected to set TCP parameters such
that it can trade off latency and wire efficiency. This is
accomplished by setting the TCP_NODELAY socket option (which disables
Nagle).
When latency is not critical, application is expected to leave Nagle
enabled. In this case the TCP implementation may pack any available
FPDUs into TCP segments so that the segments are filled to the EMSS.
If the amount of data available is not enough to fill the TCP segment
when it is prepared for transmission, TCP can send the segment partly
filled, or use the Nagle algorithm to wait for the ULP to post more
data.
5.4 Optimized MPA/TCP receivers
When an MPA receive implementation and the MPA-aware receive side TCP
implementation support handling out of order ULPDUs, the TCP receive
implementation SHOULD be enabled to perform the following functions:
1) The implementation SHOULD pass incoming TCP segments to MPA as
soon as they have been received and validated, even if not
received in order. The TCP layer MUST have committed to keeping
each segment before it can be passed to the MPA. This means that
the segment must have passed the TCP, IP, and lower layer data
integrity validation (i.e., checksum), must be in the receive
window, must be part of the same epoch (if timestamps are used to
verify this) and any other checks required by TCP RFCs.
This is not to imply that the data must be completely ordered
before use. An implementation MAY accept out of order segments,
SACK them [RFC2018], and pass them to MPA immediately, before the
reception of the segments needed to fill in the gaps arrive.
MPA expects to utilize these segments when they are complete
FPDUs or can be combined into complete FPDUs to allow the passing
of ULPDUs to DDP when they arrive, independent of ordering. DDP
uses the passed ULPDU to "place" the DDP segments (see [DDP] for
more details).
Since MPA performs a CRC calculation and other checks on received
FPDUs, the MPA/TCP implementation MUST ensure that any TCP
segments that duplicate data already received and processed (as
can happen during TCP retries) do not overwrite already received
and processed FPDUs. This avoids the possibility that duplicate
data may corrupt already validated FPDUs.
2) The implementation MUST provide a mechanism to indicate the
ordering of TCP segments as the sender transmitted them. One
possible mechanism might be attaching the TCP sequence number to
each segment.
3) The implementation MUST provide a mechanism to indicate when a
given TCP segment (and the prior TCP stream) is complete. One
possible mechanism might be to utilize the leading (left) edge of
the TCP Receive Window.
MPA uses the ordering and completion indications to inform DDP
when a ULPDU is complete; MPA Delivers the FPDU to DDP. DDP uses
the indications to "deliver" its messages to the DDP consumer
(see [DDP] for more details).
DDP on MPA MUST utilize these two mechanisms to establish the
Delivery semantics that DDP's consumers agree to. These
semantics are described fully in [DDP]. These include
requirements on DDP's consumer to respect ownership of buffers
prior to the time that DDP delivers them to the Consumer.
6 MPA Receiver FPDU Identification 6 MPA Receiver FPDU Identification
An MPA receiver MUST first verify the FPDU before passing the ULPDU An MPA receiver MUST first verify the FPDU before passing the ULPDU
to DDP. To do this, the receiver MUST: to DDP. To do this, the receiver MUST:
* locate the start of the FPDU unambiguously, * locate the start of the FPDU unambiguously,
* verify its CRC (if CRC checking is enabled). * verify its CRC (if CRC checking is enabled).
If the above conditions are true, the MPA receiver passes the ULPDU If the above conditions are true, the MPA receiver passes the ULPDU
to DDP. to DDP.
To detect the start of the FPDU unambiguously one of the following To detect the start of the FPDU unambiguously one of the following
MUST be used: MUST be used:
1: In an ordered TCP stream, the ULPDU Length field in the current 1: In an ordered TCP stream, the ULPDU Length field in the current
FPDU when FPDU has a valid CRC, can be used to identify the FPDU when FPDU has a valid CRC, can be used to identify the
beginning of the next FPDU. beginning of the next FPDU.
2: For optimized MPA/TCP receivers that support out of order 2: For optimized MPA/TCP receivers that support out of order
reception of FPDUs (see section 4.3 MPA Markers on page 14) a reception of FPDUs (see section 4.3 MPA Markers on page 15) a
Marker can always be used to locate the beginning of an FPDU (in Marker can always be used to locate the beginning of an FPDU (in
FPDUs with valid CRCs). Since the location of the Marker is FPDUs with valid CRCs). Since the location of the Marker is
known in the octet stream (sequence number space), the Marker can known in the octet stream (sequence number space), the Marker can
always be found. always be found.
3: Having found an FPDU by means of a Marker, an optimized MPA/TCP 3: Having found an FPDU by means of a Marker, an optimized MPA/TCP
receiver can find following contiguous FPDUs by using the ULPDU receiver can find following contiguous FPDUs by using the ULPDU
Length fields (from FPDUs with valid CRCs) to establish the next Length fields (from FPDUs with valid CRCs) to establish the next
FPDU boundary. FPDU boundary.
The ULPDU Length field (see section 4) MUST be used to determine if The ULPDU Length field (see section 4 on page 14) MUST be used to
the entire FPDU is present before forwarding the ULPDU to DDP. determine if the entire FPDU is present before forwarding the ULPDU
to DDP.
CRC calculation is discussed in section 4.4 on page 17 above.
6.1 Re-segmenting Middle boxes and non optimized MPA/TCP senders
Since MPA senders often start FPDUs on TCP segment boundaries, a
receiving optimized MPA/TCP implementation may be able to optimize
the reception of data in various ways.
However, MPA receivers MUST NOT depend on FPDU Alignment on TCP
segment boundaries.
Some MPA senders may be unable to conform to the sender requirements
because their implementation of TCP is not designed with MPA in mind.
Even for optimized MPA/TCP senders, the network may contain "middle
boxes" which modify the TCP stream by changing the segmentation.
This is generally interoperable with TCP and its users and MPA must
be no exception.
The presence of Markers in MPA (when enabled) allows an optimized
MPA/TCP receiver to recover the FPDUs despite these obstacles,
although it may be necessary to utilize additional buffering at the
receiver to do so.
Some of the cases that a receiver may have to contend with are listed
below as a reminder to the implementer:
* A single Aligned and complete FPDU, either in order, or out of
order: This can be passed to DDP as soon as validated, and
Delivered when ordering is established.
* Multiple FPDUs in a TCP segment, aligned and fully contained,
either in order, or out of order: These can be passed to DDP as
soon as validated, and Delivered when ordering is established.
* Incomplete FPDU: The receiver should buffer until the remainder
of the FPDU arrives. If the remainder of the FPDU is already
available, this can be passed to DDP as soon as validated, and
Delivered when ordering is established.
* Unaligned FPDU start: The partial FPDU must be combined with its
preceding portion(s). If the preceding parts are already
available, and the whole FPDU is present, this can be passed to
DDP as soon as validated, and Delivered when ordering is
established. If the whole FPDU is not available, the receiver
should buffer until the remainder of the FPDU arrives.
* Combinations of Unaligned or incomplete FPDUs (and potentially CRC calculation is discussed in section 4.4 on page 18 above.
other complete FPDUs) in the same TCP segment: If any FPDU is
present in its entirety, or can be completed with portions
already available, it can be passed to DDP as soon as validated,
and Delivered when ordering is established.
7 Connection Semantics 7 Connection Semantics
7.1 Connection setup 7.1 Connection setup
MPA requires that the Consumer MUST activate MPA, and any TCP MPA requires that the Consumer MUST activate MPA, and any TCP
enhancements for MPA, on a TCP half connection at the same location enhancements for MPA, on a TCP half connection at the same location
in the octet stream at both the sender and the receiver. This is in the octet stream at both the sender and the receiver. This is
required in order for the Marker scheme to correctly locate the required in order for the Marker scheme to correctly locate the
Markers (if enabled) and to correctly locate the first FPDU. Markers (if enabled) and to correctly locate the first FPDU.
MPA, and any TCP enhancements for MPA are enabled by the ULP in both MPA, and any TCP enhancements for MPA are enabled by the ULP in both
directions at once at an endpoint. directions at once at an endpoint.
This can be accomplished several ways, and is left up to DDP's ULP: This can be accomplished several ways, and is left up to DDP's ULP:
* DDP's ULP MAY require DDP on MPA startup immediately after TCP * DDP's ULP MAY require DDP on MPA startup immediately after TCP
connection setup. This has the advantage that no streaming mode connection setup. This has the advantage that no streaming mode
negotiation is needed. An example of such a protocol is shown in negotiation is needed. An example of such a protocol is shown in
Figure 11: Example Immediate Startup negotiation on page 40. Figure 10: Example Immediate Startup negotiation on page 36.
This may be accomplished by using a well-known port, or a service This may be accomplished by using a well-known port, or a service
locator protocol to locate an appropriate port on which DDP on locator protocol to locate an appropriate port on which DDP on
MPA is expected to operate. MPA is expected to operate.
* DDP's ULP MAY negotiate the start of DDP on MPA sometime after a * DDP's ULP MAY negotiate the start of DDP on MPA sometime after a
normal TCP startup, using TCP streaming data exchanges on the normal TCP startup, using TCP streaming data exchanges on the
same connection. The exchange establishes that DDP on MPA (as same connection. The exchange establishes that DDP on MPA (as
well as other ULPs) will be used, and exactly locates the point well as other ULPs) will be used, and exactly locates the point
in the octet stream where MPA is to begin operation. Note that in the octet stream where MPA is to begin operation. Note that
such a negotiation protocol is outside the scope of this such a negotiation protocol is outside the scope of this
specification. A simplified example of such a protocol is shown specification. A simplified example of such a protocol is shown
in Figure 10: Example Delayed Startup negotiation on page 37. in Figure 9: Example Delayed Startup negotiation on page 33.
An MPA endpoint operates in two distinct phases. An MPA endpoint operates in two distinct phases.
The Startup Phase is used to verify correct MPA setup, exchange CRC The Startup Phase is used to verify correct MPA setup, exchange CRC
and Marker configuration, and optionally pass Private Data between and Marker configuration, and optionally pass Private Data between
endpoints prior to completing a DDP connection. During this phase, endpoints prior to completing a DDP connection. During this phase,
specifically formatted frames are exchanged as TCP byte streams specifically formatted frames are exchanged as TCP byte streams
without using CRCs or Markers. During this phase a DDP endpoint need without using CRCs or Markers. During this phase a DDP endpoint need
not be "bound" to the MPA connection. In fact, the choice of DDP not be "bound" to the MPA connection. In fact, the choice of DDP
endpoint and its operating parameters may not be known until the endpoint and its operating parameters may not be known until the
skipping to change at page 32, line 27 skipping to change at page 28, line 27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
16 |M|C|R| Res | Rev | PD_Length | 16 |M|C|R| Res | Rev | PD_Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | | |
~ ~ ~ ~
~ Private Data ~ ~ Private Data ~
| | | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 9 MPA Request/Reply Frame Figure 8 MPA Request/Reply Frame
Key: This field contains the "key" used to validate that the sender Key: This field contains the "key" used to validate that the sender
is an MPA sender. Initiator mode senders MUST set this field to is an MPA sender. Initiator mode senders MUST set this field to
the fixed value "MPA ID Req frame" or (in byte order) 4D 50 41 20 the fixed value "MPA ID Req frame" or (in byte order) 4D 50 41 20
49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal). Responder 49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal). Responder
mode receivers MUST check this field for the same value, and mode receivers MUST check this field for the same value, and
close the connection and report an error locally if any other close the connection and report an error locally if any other
value is detected. Responder mode senders MUST set this field to value is detected. Responder mode senders MUST set this field to
the fixed value "MPA ID Rep frame" or (in byte order) 4D 50 41 20 the fixed value "MPA ID Rep frame" or (in byte order) 4D 50 41 20
49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal). Initiator 49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal). Initiator
mode receivers MUST check this field for the same value, and mode receivers MUST check this field for the same value, and
close the connection and report an error locally if any other close the connection and report an error locally if any other
value is detected. value is detected.
M: This bit, when sent in an MPA Request Frame or an MPA Reply Frame, M: This bit, when sent in an MPA Request Frame or an MPA Reply Frame,
declares a receiver's requirement for Markers. When in a declares a receiver's requirement for Markers. When in a
received MPA Request Frame or MPA Reply Frame and the value is received MPA Request Frame or MPA Reply Frame and the value is
'0', Markers MUST NOT be added to the data stream by the sender. '0', Markers MUST NOT be added to the data stream by the sender.
When '1' Markers MUST be added as described in section 4.3 MPA When '1' Markers MUST be added as described in section 4.3 MPA
Markers on page 14. Markers on page 15.
C: This bit declares an endpoint's preferred CRC usage. When this C: This bit declares an endpoint's preferred CRC usage. When this
field is '0' in the MPA Request Frame and the MPA Reply Frame, field is '0' in the MPA Request Frame and the MPA Reply Frame,
CRCs MUST not be checked and need not be generated by either CRCs MUST not be checked and need not be generated by either
endpoint. When this bit is '1' in either the MPA Request Frame endpoint. When this bit is '1' in either the MPA Request Frame
or MPA Reply Frame, CRCs MUST be generated and checked by both or MPA Reply Frame, CRCs MUST be generated and checked by both
endpoints. Note that even when not in use, the CRC field remains endpoints. Note that even when not in use, the CRC field remains
present in the FPDU. When CRCs are not in use, the CRC field present in the FPDU. When CRCs are not in use, the CRC field
MUST be considered valid for FPDU checking regardless of its MUST be considered valid for FPDU checking regardless of its
contents. contents.
skipping to change at page 36, line 12 skipping to change at page 32, line 12
messages to guard against application failures and certain denial messages to guard against application failures and certain denial
of service attacks. of service attacks.
7.1.3 Example Delayed Startup sequence 7.1.3 Example Delayed Startup sequence
A variety of startup sequences are possible when using MPA on TCP. A variety of startup sequences are possible when using MPA on TCP.
Following is an example of an MPA/DDP startup that occurs after TCP Following is an example of an MPA/DDP startup that occurs after TCP
has been running for a while and has exchanged some amount of has been running for a while and has exchanged some amount of
streaming data. This example does not use any Private Data (an streaming data. This example does not use any Private Data (an
example that does is shown later in 7.1.4.2 Example Immediate Startup example that does is shown later in 7.1.4.2 Example Immediate Startup
using Private Data on page 40), although it is perfectly legal to using Private Data on page 36), although it is perfectly legal to
include the Private Data. Note that since the example does not use include the Private Data. Note that since the example does not use
any Private Data, there are no ULP interactions shown between any Private Data, there are no ULP interactions shown between
receiving "Startup frames" and putting MPA into Full Operation. receiving "Startup frames" and putting MPA into Full Operation.
Initiator Responder Initiator Responder
+---------------------------+ +---------------------------+
|ULP streaming mode | |ULP streaming mode |
| <Hello> request to | | <Hello> request to |
| transition to DDP/MPA | +--------------------------+ | transition to DDP/MPA | +--------------------------+
skipping to change at page 37, line 41 skipping to change at page 33, line 41
| <MPA Reply Frame> | +--------------------------+ | <MPA Reply Frame> | +--------------------------+
|Consumer binds DDP to MPA, | |Consumer binds DDP to MPA, |
|DDP/MPA begins full | |DDP/MPA begins full |
|operation. | |operation. |
|MPA sends first FPDU (as | +--------------------------+ |MPA sends first FPDU (as | +--------------------------+
|DDP ULPDUs become | ========> |MPA Receives first FPDU. | |DDP ULPDUs become | ========> |MPA Receives first FPDU. |
|available). | |MPA sends first FPDU (as | |available). | |MPA sends first FPDU (as |
+---------------------------+ |DDP ULPDUs become | +---------------------------+ |DDP ULPDUs become |
<====== |available. | <====== |available. |
+--------------------------+ +--------------------------+
Figure 10: Example Delayed Startup negotiation Figure 9: Example Delayed Startup negotiation
An example Delayed Startup sequence is described below: An example Delayed Startup sequence is described below:
* Active and passive sides start up a TCP connection in the * Active and passive sides start up a TCP connection in the
usual fashion, probably using sockets APIs. They exchange usual fashion, probably using sockets APIs. They exchange
some amount of streaming mode data. At some point one side some amount of streaming mode data. At some point one side
(the MPA Initiator) sends streaming mode data that (the MPA Initiator) sends streaming mode data that
effectively says "Hello, Lets go into MPA/DDP mode." effectively says "Hello, Lets go into MPA/DDP mode."
* When the remote side (the MPA Responder) gets this streaming mode * When the remote side (the MPA Responder) gets this streaming mode
message, the Consumer would send a last streaming mode message message, the Consumer would send a last streaming mode message
skipping to change at page 40, line 51 skipping to change at page 36, line 51
|Consumer examines Private | |Consumer examines Private |
|Data, binds DDP to MPA, | |Data, binds DDP to MPA, |
|and enables DDP/MPA to | |and enables DDP/MPA to |
|begin Full Operation. | |begin Full Operation. |
|MPA sends first FPDU (as | +--------------------------+ |MPA sends first FPDU (as | +--------------------------+
|DDP ULPDUs become | ========> |MPA Receives first FPDU. | |DDP ULPDUs become | ========> |MPA Receives first FPDU. |
|available). | |MPA sends first FPDU (as | |available). | |MPA sends first FPDU (as |
+---------------------------+ |DDP ULPDUs become | +---------------------------+ |DDP ULPDUs become |
<====== |available. | <====== |available. |
+--------------------------+ +--------------------------+
Figure 11: Example Immediate Startup negotiation Figure 10: Example Immediate Startup negotiation
Note: the exact order of when MPA is started in the TCP connection Note: the exact order of when MPA is started in the TCP connection
sequence is implementation dependent; the above diagram shows one sequence is implementation dependent; the above diagram shows one
possible sequence. Also, the Initiator "Ack" to the Responder's possible sequence. Also, the Initiator "Ack" to the Responder's
"SYN-Ack" may be combined into the same TCP segment containing "SYN-Ack" may be combined into the same TCP segment containing
the MPA Request Frame (as is allowed by TCP RFCs). the MPA Request Frame (as is allowed by TCP RFCs).
The example immediate startup sequence is described below: The example immediate startup sequence is described below:
* The passive side (Responding Consumer) would listen on the TCP * The passive side (Responding Consumer) would listen on the TCP
skipping to change at page 48, line 28 skipping to change at page 44, line 28
Additionally, since IPsec acceleration hardware may only be able to Additionally, since IPsec acceleration hardware may only be able to
handle a limited number of active IKE Phase 2 SAs, Phase 2 delete handle a limited number of active IKE Phase 2 SAs, Phase 2 delete
messages MAY be sent for idle SAs, as a means of keeping the number messages MAY be sent for idle SAs, as a means of keeping the number
of active Phase 2 SAs to a minimum. The receipt of an IKE Phase 2 of active Phase 2 SAs to a minimum. The receipt of an IKE Phase 2
delete message MUST NOT be interpreted as a reason for tearing down delete message MUST NOT be interpreted as a reason for tearing down
an DDP/RDMA Stream. Rather, it is preferable to leave the Stream up, an DDP/RDMA Stream. Rather, it is preferable to leave the Stream up,
and if additional traffic is sent on it, to bring up another IKE and if additional traffic is sent on it, to bring up another IKE
Phase 2 SA to protect it. This avoids the potential for continually Phase 2 SA to protect it. This avoids the potential for continually
bringing Streams up and down. bringing Streams up and down.
The IPsec requirements for RDDP are based on the version of IPsec
specified in RFC 2401 [RFC2401] and related RFCs, as profiled by RFC
3723 [RFC3723], despite the existence of a newer version of IPsec
specified in RFC 4301 [RFC4301] and related RFCs. One of the
important early applications of the RDDP protocols is their use with
iSCSI [iSER]; RDDP's IPsec requirements follow those of IPsec in
order to facilitate that usage by allowing a common profile of IPsec
to be used with iSCSI and the RDDP protocols. In the future, RFC
3723 may be updated to the newer version of IPsec, the IPsec security
requirements of any such update should apply uniformly to iSCSI and
the RDDP protocols.
Note that there are serious security issues if IPsec is not Note that there are serious security issues if IPsec is not
implemented end-to-end. For example, if IPsec is implemented as a implemented end-to-end. For example, if IPsec is implemented as a
tunnel in the middle of the network, any hosts between the peer and tunnel in the middle of the network, any hosts between the peer and
the IPsec tunneling device can freely attack the unprotected Stream. the IPsec tunneling device can freely attack the unprotected Stream.
10 IANA Considerations 10 IANA Considerations
No IANA actions are required by this document. No IANA actions are required by this document.
If a well-known port is chosen as the mechanism to identify a DDP on If a well-known port is chosen as the mechanism to identify a DDP on
MPA on TCP, the well-known port must be registered with IANA. MPA on TCP, the well-known port must be registered with IANA.
Because the use of the port is DDP specific, registration of the port Because the use of the port is DDP specific, registration of the port
with IANA is left to DDP. with IANA is left to DDP.
11 References A Appendix.
Optimized MPA-aware TCP implementations
11.1 Normative References This appendix is for information only and is NOT part of the
standard.
[iSCSI] Satran, J., Internet Small Computer Systems Interface This appendix covers some Optimized MPA-aware TCP implementation
(iSCSI), RFC 3720, April 2004. guidance to implementers. It is intended for those implementations
that want to send/receive as much traffic as possible in an aligned
and zero-copy fashion.
[RFC1191] Mogul, J., and Deering, S., "Path MTU Discovery", RFC 1191, +-----------------------------------+
November 1990. | +-----------+ +-----------------+ |
| | Optimized | | Other Protocols | |
| | MPA/TCP | +-----------------+ |
| +-----------+ || |
| \\ --- socket API --- |
| \\ || |
| \\ +-----+ |
| \\ | TCP | |
| \\ +-----+ |
| \\ // |
| +-------+ |
| | IP | |
| +-------+ |
+-----------------------------------+
[RFC2018] Mathis, M., Mahdavi, J., Floyd, S., Romanow, A., "TCP Figure 11 Optimized MPA/TCP implementation
Selective Acknowledgment Options", RFC 2018, October 1996.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate The diagram above shows a block diagram of a potential
Requirement Levels", BCP 14, RFC 2119, March 1997. implementation. The network sub-system in the diagram can support
traditional sockets based connections using the normal API as shown
on the right side of the diagram. Connections for DDP/MPA/TCP are
run using the facilities shown on the left side of the diagram.
[RFC3723] Aboba B., et al, "Securing Block Storage Protocols over The DDP/MPA/TCP connections can be started using the facilities shown
IP", RFC3723, April 2004. on the left side using some suitable API, or they can be initiated
using the facilities shown on the right side and transitioned to the
left side at the point in the connection setup where MPA goes to
"full MPA/DDP operation mode" as described in section 7.1.2 on page
29.
[RFC793] Postel, J., "Transmission Control Protocol - DARPA Internet The optimized MPA/TCP implementations (left side of diagram and
Program Protocol Specification", RFC 793, September 1981. described below) are only applicable to MPA, all other TCP
applications continue to use the standard TCP stacks and interfaces
shown in the right side of the diagram.
[RDMASEC] Pinkerton J., Deleganes E., Bitan S., "DDP/RDMAP A.1 Optimized MPA/TCP transmitters
Security", draft-ietf-rddp-security-09.txt (work in progress),
MAY 2006.
11.2 Informative References The various TCP RFCs allow considerable choice in segmenting a TCP
stream. In order to optimize FPDU recovery at the MPA receiver, an
optimized MPA/TCP implementation uses additional segmentation rules.
[CRCTCP] Stone J., Partridge, C., "When the CRC and TCP checksum To provide optimum performance, an optimized MPA/TCP transmit side
disagree", ACM Sigcomm, Sept. 2000. implementation should be enabled to:
[DAT-API] DAT Collaborative, "kDAPL (Kernel Direct Access Programming * With an EMSS large enough to contain the FPDU(s), segment the
Library) and uDAPL (User Direct Access Programming Library)", outgoing TCP stream such that the first octet of every TCP
http://www.datcollaborative.org. Segment begins with an FPDU. Multiple FPDUs may be packed into a
single TCP segment as long as they are entirely contained in the
TCP segment.
[DDP] H. Shah et al., "Direct Data Placement over Reliable * Report the current EMSS from the TCP to the MPA transmit layer.
Transports", draft-ietf-rddp-ddp-06.txt (Work in progress), May
2006.
[IT-API] The Open Group, "Interconnect Transport API (IT-API)" There are exceptions to the above rule. Once an ULPDU is provided to
Version 2.1, http://www.opengroup.org. MPA, the MPA/TCP sender transmits it or fails the connection; it
cannot be repudiated. As a result, during changes in MTU and EMSS,
or when TCP's Receive Window size (RWIN) becomes too small, it may be
necessary to send FPDUs that do not conform to the segmentation rule
above.
[RFC2401] Atkinson, R., Kent, S., "Security Architecture for the A possible, but less desirable, alternative is to use IP
Internet Protocol", RFC 2401, November 1998. fragmentation on accepted FPDUs to deal with MTU reductions or
extremely small EMSS.
[RFC0896] J. Nagle, "Congestion Control in IP/TCP Internetworks", RFC Even when alignment with TCP segments is lost, the sender still
896, January 1984. formats the FPDU according to FPDU format as shown in Figure 2.
[NFSv4CHANNEL] Williams, N., "On the Use of Channel Bindings to On a retransmission, TCP does not necessarily preserve original TCP
Secure Channels", Internet-Draft draft-ietf-nfsv4-channel- segmentation boundaries. This can lead to the loss of FPDU Alignment
bindings-02.txt, July 2004. and containment within a TCP segment during TCP retransmissions. An
optimized MPA/TCP sender should try to preserve original TCP
segmentation boundaries on a retransmission.
[RDMAP] R. Recio et al., "RDMA Protocol Specification", A.2 Effects of Optimized MPA/TCP Segmentation
draft-ietf-rddp-rdmap-06.txt, May 2006.
[RFC2960] R. Stewart et al., "Stream Control Transmission Protocol", Optimized MPA/TCP senders will fill TCP segments to the EMSS with a
RFC 2960, October 2000. single FPDU when a DDP message is large enough. Since the DDP
message may not exactly fit into TCP segments, a "message tail" often
occurs that results in an FPDU that is smaller than a single TCP
segment. Additionally some DDP messages may be considerably shorter
than the EMSS. If a small FPDU is sent in a single TCP segment the
result is a "short" TCP segment.
[RFC792] Postel, J., "Internet Control Message Protocol", September Applications expected to see strong advantages from Direct Data
1981 Placement include transaction-based applications and throughput
applications. Request/response protocols typically send one FPDU per
TCP segment and then wait for a response. Under these conditions,
these "short" TCP segments are an appropriate and expected effect of
the segmentation.
[RFC1122] Braden, R.T., "Requirements for Internet hosts - Another possibility is that the application might be sending multiple
communication layers", October 1989. messages (FPDUs) to the same endpoint before waiting for a response.
[VERBS] J. Hilland et al., "RDMA Protocol Verbs Specification", In this case, the segmentation policy would tend to reduce the
draft-hilland-iwarp-verbs-v1.0-RDMAC.pdf April 2003, available connection bandwidth by under-filling the TCP segments.
http://www.rdmaconsortium.org.
12 Appendix Standard TCP implementations often utilize the Nagle [RFC0896]
algorithm to ensure that segments are filled to the EMSS whenever the
round trip latency is large enough that the source stream can fully
fill segments before Acks arrive. The algorithm does this by
delaying the transmission of TCP segments until a ULP can fill a
segment, or until an ACK arrives from the far side. The algorithm
thus allows for smaller segments when latencies are shorter to keep
the ULP's end to end latency to reasonable levels.
This appendix is for information only and is NOT part of the The Nagle algorithm is not mandatory to use [RFC1122].
standard.
The appendix covers three topics; When used with optimized MPA/TCP stacks, Nagle and similar algorithms
can result in the "packing" of multiple FPDUs into TCP segments.
Section 12.1 is an analysis of MPA on TCP and why it is useful to If a "message tail", small DDP messages, or the start of a larger DDP
integrate MPA with TCP (with modifications to typical TCP message are available, MPA may pack multiple FPDUs into TCP segments.
implementations) to reduce overall system buffering and overhead. When this is done, the TCP segments can be more fully utilized, but,
due to the size constraints of FPDUs, segments may not be filled to
the EMSS. A dynamic MULPDU that informs DDP of the size of the
remaining TCP segment space makes filling the TCP segment more
effective.
Section 12.2 covers some MPA receiver implementation notes. Note that MPA receivers do more processing of a TCP segment that
contains multiple FPDUs, this may affect the performance of some
receiver implementations.
Section 12.3 covers methods of making MPA implementations It is up to the ULP to decide if Nagle is useful with DDP/MPA. Note
interoperate with both IETF and RDMA Consortium versions of the that many of the applications expected to take advantage of MPA/DDP
protocols. prefer to avoid the extra delays caused by Nagle. In such scenarios
it is anticipated there will be minimal opportunity for packing at
the transmitter and receivers may choose to optimize their
performance for this anticipated behavior.
12.1 Analysis of MPA over TCP Operations Therefore, the application is expected to set TCP parameters such
that it can trade off latency and wire efficiency. Implementations
should provide a connection option which disables Nagle for MPA/TCP
similar to the way the TCP_NODELAY socket option is provided for a
traditional sockets interface.
This appendix analyzes the impact of MPA on the TCP sender, receiver, When latency is not critical, application is expected to leave Nagle
and wire protocol. enabled. In this case the TCP implementation may pack any available
FPDUs into TCP segments so that the segments are filled to the EMSS.
If the amount of data available is not enough to fill the TCP segment
when it is prepared for transmission, TCP can send the segment partly
filled, or use the Nagle algorithm to wait for the ULP to post more
data.
A.3 Optimized MPA/TCP receivers
When an MPA receive implementation and the MPA-aware receive side TCP
implementation support handling out of order ULPDUs, the TCP receive
implementation performs the following functions:
1) The implementation passes incoming TCP segments to MPA as soon as
they have been received and validated, even if not received in
order. The TCP layer commits to keeping each segment before it
can be passed to the MPA. This means that the segment must have
passed the TCP, IP, and lower layer data integrity validation
(i.e., checksum), must be in the receive window, must be part of
the same epoch (if timestamps are used to verify this) and any
other checks required by TCP RFCs.
This is not to imply that the data must be completely ordered
before use. An implementation can accept out of order segments,
SACK them [RFC2018], and pass them to MPA immediately, before the
reception of the segments needed to fill in the gaps arrive.
MPA expects to utilize these segments when they are complete
FPDUs or can be combined into complete FPDUs to allow the passing
of ULPDUs to DDP when they arrive, independent of ordering. DDP
uses the passed ULPDU to "place" the DDP segments (see [DDP] for
more details).
Since MPA performs a CRC calculation and other checks on received
FPDUs, the MPA/TCP implementation ensures that any TCP segments
that duplicate data already received and processed (as can happen
during TCP retries) do not overwrite already received and
processed FPDUs. This avoids the possibility that duplicate data
may corrupt already validated FPDUs.
2) The implementation provides a mechanism to indicate the ordering
of TCP segments as the sender transmitted them. One possible
mechanism might be attaching the TCP sequence number to each
segment.
3) The implementation also provides a mechanism to indicate when a
given TCP segment (and the prior TCP stream) is complete. One
possible mechanism might be to utilize the leading (left) edge of
the TCP Receive Window.
MPA uses the ordering and completion indications to inform DDP
when a ULPDU is complete; MPA Delivers the FPDU to DDP. DDP uses
the indications to "deliver" its messages to the DDP consumer
(see [DDP] for more details).
DDP on MPA utilizes the above two mechanisms to establish the
Delivery semantics that DDP's consumers agree to. These
semantics are described fully in [DDP]. These include
requirements on DDP's consumer to respect ownership of buffers
prior to the time that DDP delivers them to the Consumer.
The use of SACK [RFC2018] significantly improves network utilization
and performance and is therefore recommended. When combined with the
out-of-order passing of segments to MPA and DDP, significant
buffering and copying of received data can be avoided.
A.4 Re-segmenting Middle boxes and non optimized MPA/TCP senders
Since MPA senders often start FPDUs on TCP segment boundaries, a
receiving optimized MPA/TCP implementation may be able to optimize
the reception of data in various ways.
However, MPA receivers MUST NOT depend on FPDU Alignment on TCP
segment boundaries.
Some MPA senders may be unable to conform to the sender requirements
because their implementation of TCP is not designed with MPA in mind.
Even for optimized MPA/TCP senders, the network may contain "middle
boxes" which modify the TCP stream by changing the segmentation.
This is generally interoperable with TCP and its users and MPA must
be no exception.
The presence of Markers in MPA (when enabled) allows an optimized
MPA/TCP receiver to recover the FPDUs despite these obstacles,
although it may be necessary to utilize additional buffering at the
receiver to do so.
Some of the cases that a receiver may have to contend with are listed
below as a reminder to the implementer:
* A single Aligned and complete FPDU, either in order, or out of
order: This can be passed to DDP as soon as validated, and
Delivered when ordering is established.
* Multiple FPDUs in a TCP segment, aligned and fully contained,
either in order, or out of order: These can be passed to DDP as
soon as validated, and Delivered when ordering is established.
* Incomplete FPDU: The receiver should buffer until the remainder
of the FPDU arrives. If the remainder of the FPDU is already
available, this can be passed to DDP as soon as validated, and
Delivered when ordering is established.
* Unaligned FPDU start: The partial FPDU must be combined with its
preceding portion(s). If the preceding parts are already
available, and the whole FPDU is present, this can be passed to
DDP as soon as validated, and Delivered when ordering is
established. If the whole FPDU is not available, the receiver
should buffer until the remainder of the FPDU arrives.
* Combinations of Unaligned or incomplete FPDUs (and potentially
other complete FPDUs) in the same TCP segment: If any FPDU is
present in its entirety, or can be completed with portions
already available, it can be passed to DDP as soon as validated,
and Delivered when ordering is established.
A.5 Receiver implementation
Transport & Network Layer Reassembly Buffers:
The use of reassembly buffers (either TCP reassembly buffers or IP
fragmentation reassembly buffers) is implementation dependent. When
MPA is enabled, reassembly buffers are needed if out of order packets
arrive and Markers are not enabled. Buffers are also needed if FPDU
Alignment is lost or if IP fragmentation occurs. This is because the
incoming out of order segment may not contain enough information for
MPA to process all of the FPDU. For cases where a re-segmenting
middle box is present, or where the TCP sender is not optimized, the
presence of Markers significantly reduces the amount of buffering
needed.
Recovery from IP Fragmentation is transparent to the MPA Consumers.
A.5.1 Network Layer Reassembly Buffers
The MPA/TCP implementation should set the IP Don't Fragment bit at
the IP layer. Thus upon a path MTU change, intermediate devices drop
the IP datagram if it is too large and reply with an ICMP message
which tells the source TCP that the path MTU has changed. This
causes TCP to emit segments conformant with the new path MTU size.
Thus IP fragments under most conditions should never occur at the
receiver. But it is possible.
There are several options for implementation of network layer
reassembly buffers:
1. drop any IP fragments, and reply with an ICMP message according
to [RFC792] (fragmentation needed and DF set) to tell the Remote
Peer to resize its TCP segment
2. support an IP reassembly buffer, but have it of limited size
(possibly the same size as the local link's MTU). The end Node
would normally never advertise a path MTU larger than the local
link MTU. It is recommended that a dropped IP fragment cause an
ICMP message to be generated according to RFC792.
3. multiple IP reassembly buffers, of effectively unlimited size.
4. support an IP reassembly buffer for the largest IP datagram (64
KB).
5. support for a large IP reassembly buffer which could span
multiple IP datagrams.
An implementation should support at least 2 or 3 above, to avoid
dropping packets that have traversed the entire fabric.
There is no end-to-end ACK for IP reassembly buffers, so there is no
flow control on the buffer. The only end-to-end ACK is a TCP ACK,
which can only occur when a complete IP datagram is delivered to TCP.
Because of this, under worst case, pathological scenarios, the
largest IP reassembly buffer is the TCP receive window (to buffer
multiple IP datagrams that have all been fragmented).
Note that if the Remote Peer does not implement re-segmentation of
the data stream upon receiving the ICMP reply updating the path MTU,
it is possible to halt forward progress because the opposite peer
would continue to retransmit using a transport segment size that is
too large. This deadlock scenario is no different than if the fabric
MTU (not last hop MTU) was reduced after connection setup, and the
remote Node's behavior is not compliant with [RFC1122].
A.5.2 TCP Reassembly buffers
A TCP reassembly buffer is also needed. TCP reassembly buffers are
needed if FPDU Alignment is lost when using TCP with MPA or when the
MPA FPDU spans multiple TCP segments. Buffers are also needed if
Markers are disabled and out of order packets arrive.
Since lost FPDU Alignment often means that FPDUs are incomplete, an
MPA on TCP implementation must have a reassembly buffer large enough
to recover an FPDU that is less than or equal to the MTU of the
locally attached link (this should be the largest possible advertised
TCP path MTU). If the MTU is smaller than 140 octets, a buffer of at
least 140 octets long is needed to support the minimum FPDU size.
The 140 octets allows for the minimum MULPDU of 128, 2 octets of pad,
2 of ULPDU_Length, 4 of CRC, and space for a possible Marker. As
usual, additional buffering is likely to provide better performance.
Note that if the TCP segment were not stored, it is possible to
deadlock the MPA algorithm. If the path MTU is reduced, FPDU
Alignment requires the source TCP to re-segment the data stream to
the new path MTU. The source MPA will detect this condition and
reduce the MPA segment size, but any FPDUs already posted to the
source TCP will be re-segmented and lose FPDU Alignment. If the
destination does not support a TCP reassembly buffer, these segments
can never be successfully transmitted and the protocol deadlocks.
When a complete FPDU is received, processing continues normally.
B Appendix.
Analysis of MPA over TCP Operations
This appendix is for information only and is NOT part of the
standard.
This appendix is an analysis of MPA on TCP and why it is useful to
integrate MPA with TCP (with modifications to typical TCP
implementations) to reduce overall system buffering and overhead.
One of MPA's high level goals is to provide enough information, when One of MPA's high level goals is to provide enough information, when
combined with the Direct Data Placement Protocol [DDP], to enable combined with the Direct Data Placement Protocol [DDP], to enable
out-of-order placement of DDP payload into the final Upper Layer out-of-order placement of DDP payload into the final Upper Layer
Protocol (ULP) buffer. Note that DDP separates the act of placing Protocol (ULP) buffer. Note that DDP separates the act of placing
data into a ULP buffer from that of notifying the ULP that the ULP data into a ULP buffer from that of notifying the ULP that the ULP
buffer is available for use. In DDP terminology, the former is buffer is available for use. In DDP terminology, the former is
defined as "Placement", and the later is defined as "Delivery". MPA defined as "Placement", and the later is defined as "Delivery". MPA
supports in-order Delivery of the data to the ULP, including support supports in-order Delivery of the data to the ULP, including support
for Direct Data Placement in the final ULP buffer location when TCP for Direct Data Placement in the final ULP buffer location when TCP
skipping to change at page 53, line 5 skipping to change at page 54, line 44
(FPDU) (if there is payload present). (FPDU) (if there is payload present).
2) that there be an integral number of FPDUs in a TCP segment (under 2) that there be an integral number of FPDUs in a TCP segment (under
conditions where the Path MTU is not changing). conditions where the Path MTU is not changing).
This Appendix concludes that the scaling advantages of FPDU Alignment This Appendix concludes that the scaling advantages of FPDU Alignment
are strong, based primarily on fairly drastic TCP receive buffer are strong, based primarily on fairly drastic TCP receive buffer
reduction requirements and simplified receive handling. The analysis reduction requirements and simplified receive handling. The analysis
also shows that there is little effect to TCP wire behavior. also shows that there is little effect to TCP wire behavior.
12.1.1 Assumptions B.1 Assumptions
12.1.1.1 MPA is layered beneath DDP [DDP] B.1.1 MPA is layered beneath DDP [DDP]
MPA is an adaptation layer between DDP and TCP. DDP requires MPA is an adaptation layer between DDP and TCP. DDP requires
preservation of DDP segment boundaries and a CRC32C digest covering preservation of DDP segment boundaries and a CRC32C digest covering
the DDP header and data. MPA adds these features to the TCP stream the DDP header and data. MPA adds these features to the TCP stream
so that DDP over TCP has the same basic properties as DDP over SCTP. so that DDP over TCP has the same basic properties as DDP over SCTP.
12.1.1.2 MPA preserves DDP message framing B.1.2 MPA preserves DDP message framing
MPA was designed as a framing layer specifically for DDP and was not MPA was designed as a framing layer specifically for DDP and was not
intended as a general-purpose framing layer for any other ULP using intended as a general-purpose framing layer for any other ULP using
TCP. TCP.
A framing layer allows ULPs using it to receive indications from the A framing layer allows ULPs using it to receive indications from the
transport layer only when complete ULPDUs are present. As a framing transport layer only when complete ULPDUs are present. As a framing
layer, MPA is not aware of the content of the DDP PDU, only that it layer, MPA is not aware of the content of the DDP PDU, only that it
has received and, if necessary, reassembled a complete PDU for has received and, if necessary, reassembled a complete PDU for
Delivery to the DDP. Delivery to the DDP.
12.1.1.3 The size of the ULPDU passed to MPA is less than EMSS under B.1.3 The size of the ULPDU passed to MPA is less than EMSS under
normal conditions normal conditions
To make reception of a complete DDP PDU on every received segment To make reception of a complete DDP PDU on every received segment
possible, DDP passes to MPA a PDU that is no larger than the EMSS of possible, DDP passes to MPA a PDU that is no larger than the EMSS of
the underlying fabric. Each FPDU that MPA creates contains the underlying fabric. Each FPDU that MPA creates contains
sufficient information for the receiver to directly place the ULP sufficient information for the receiver to directly place the ULP
payload in the correct location in the correct receive buffer. payload in the correct location in the correct receive buffer.
Edge cases when this condition does not occur are dealt with, but do Edge cases when this condition does not occur are dealt with, but do
not need to be on the fast path not need to be on the fast path
12.1.1.4 Out-of-order placement but NO out-of-order Delivery B.1.4 Out-of-order placement but NO out-of-order Delivery
DDP receives complete DDP PDUs from MPA. Each DDP PDU contains the DDP receives complete DDP PDUs from MPA. Each DDP PDU contains the
information necessary to place its ULP payload directly in the information necessary to place its ULP payload directly in the
correct location in host memory. correct location in host memory.
Because each DDP segment is self-describing, it is possible for DDP Because each DDP segment is self-describing, it is possible for DDP
segments received out of order to have their ULP payload placed segments received out of order to have their ULP payload placed
immediately in the ULP receive buffer. immediately in the ULP receive buffer.
Data delivery to the ULP is guaranteed to be in the order the data Data delivery to the ULP is guaranteed to be in the order the data
was sent. DDP only indicates data delivery to the ULP after TCP has was sent. DDP only indicates data delivery to the ULP after TCP has
acknowledged the complete byte stream. acknowledged the complete byte stream.
12.1.2 The Value of FPDU Alignment B.2 The Value of FPDU Alignment
Significant receiver optimizations can be achieved when Header Significant receiver optimizations can be achieved when Header
Alignment and complete FPDUs are the common case. The optimizations Alignment and complete FPDUs are the common case. The optimizations
allow utilizing significantly fewer buffers on the receiver and less allow utilizing significantly fewer buffers on the receiver and less
computation per FPDU. The net effect is the ability to build a computation per FPDU. The net effect is the ability to build a
"flow-through" receiver that enables TCP-based solutions to scale to "flow-through" receiver that enables TCP-based solutions to scale to
10G and beyond in an economical way. The optimizations are 10G and beyond in an economical way. The optimizations are
especially relevant to hardware implementations of receivers that especially relevant to hardware implementations of receivers that
process multiple protocol layers - Data Link Layer (e.g., Ethernet), process multiple protocol layers - Data Link Layer (e.g., Ethernet),
Network and Transport Layer (e.g., TCP/IP), and even some ULP on top Network and Transport Layer (e.g., TCP/IP), and even some ULP on top
skipping to change at page 55, line 36 skipping to change at page 57, line 25
continue - while Ethernet speeds have scaled by 1000 (from 10 continue - while Ethernet speeds have scaled by 1000 (from 10
megabit/sec to 10 gigabit/sec), I/O bus bandwidth of volume CPU megabit/sec to 10 gigabit/sec), I/O bus bandwidth of volume CPU
architectures has scaled from ~2 MB/sec to ~2 GB/sec (PC-XT bus to architectures has scaled from ~2 MB/sec to ~2 GB/sec (PC-XT bus to
PCI-X DDR). Under these conditions, the FPDU Alignment approach PCI-X DDR). Under these conditions, the FPDU Alignment approach
allows BufferSizeAF to be indifferent to network speed. It is allows BufferSizeAF to be indifferent to network speed. It is
primarily a function of the local processing time for a given frame. primarily a function of the local processing time for a given frame.
Thus when the FPDU Alignment approach is used, receive buffering is Thus when the FPDU Alignment approach is used, receive buffering is
expected to scale gracefully (i.e. less than linear scaling) as expected to scale gracefully (i.e. less than linear scaling) as
network speed is increased. network speed is increased.
12.1.2.1 Impact of lack of FPDU Alignment on the receiver computational B.2.1 Impact of lack of FPDU Alignment on the receiver computational
load and complexity load and complexity
The receiver must perform IP and TCP processing, and then perform The receiver must perform IP and TCP processing, and then perform
FPDU CRC checks, before it can trust the FPDU header placement FPDU CRC checks, before it can trust the FPDU header placement
information. For simplicity of the description, the assumption is information. For simplicity of the description, the assumption is
that a FPDU is carried in no more than 2 TCP segments. In reality, that a FPDU is carried in no more than 2 TCP segments. In reality,
with no FPDU Alignment, an FPDU can be carried by more than 2 TCP with no FPDU Alignment, an FPDU can be carried by more than 2 TCP
segments (e.g., if the PMTU was reduced). segments (e.g., if the PMTU was reduced).
----++-----------------------------++-----------------------++----- ----++-----------------------------++-----------------------++-----
skipping to change at page 59, line 24 skipping to change at page 61, line 24
along with the high probability that at least one complete FPDU is along with the high probability that at least one complete FPDU is
found with every TCP segment, allows the receiver to perform data found with every TCP segment, allows the receiver to perform data
placement for out-of-order TCP segments with no need for intermediate placement for out-of-order TCP segments with no need for intermediate
buffering. Essentially the TCP receive buffer has been eliminated buffering. Essentially the TCP receive buffer has been eliminated
and TCP reassembly is done in place within the ULP buffer. and TCP reassembly is done in place within the ULP buffer.
In case FPDU Alignment is not found, the receiver should follow the In case FPDU Alignment is not found, the receiver should follow the
algorithm for non aligned FPDU reception which may be slower and less algorithm for non aligned FPDU reception which may be slower and less
efficient. efficient.
12.1.2.2 FPDU Alignment effects on TCP wire protocol B.2.2 FPDU Alignment effects on TCP wire protocol
In an optimized MPA/TCP implementation, TCP exposes its EMSS to In an optimized MPA/TCP implementation, TCP exposes its EMSS to
MPA. MPA uses the EMSS to calculate its MULPDU, which it then MPA. MPA uses the EMSS to calculate its MULPDU, which it then
exposes to DDP, its ULP. DDP uses the MULPDU to segment its exposes to DDP, its ULP. DDP uses the MULPDU to segment its
payload so that each FPDU sent by MPA fits completely into one payload so that each FPDU sent by MPA fits completely into one
TCP segment. This has no impact on wire protocol and exposing TCP segment. This has no impact on wire protocol and exposing
this information is already supported on many TCP this information is already supported on many TCP
implementations, including all modern flavors of BSD networking, implementations, including all modern flavors of BSD networking,
through the TCP_MAXSEG socket option. through the TCP_MAXSEG socket option.
skipping to change at page 60, line 24 skipping to change at page 62, line 24
the EMSS. Another class of applications with many small outstanding the EMSS. Another class of applications with many small outstanding
buffers (as compared to EMSS) is expected to use packing when buffers (as compared to EMSS) is expected to use packing when
applicable. Transaction oriented applications are also optimal. applicable. Transaction oriented applications are also optimal.
TCP retransmission is another area that can affect sender behavior. TCP retransmission is another area that can affect sender behavior.
TCP supports retransmission of the exact, originally transmitted TCP supports retransmission of the exact, originally transmitted
segment (see [RFC793] section 2.6, [RFC793] section 3.7 "managing the segment (see [RFC793] section 2.6, [RFC793] section 3.7 "managing the
window" and [RFC1122] section 4.2.2.15). In the unlikely event that window" and [RFC1122] section 4.2.2.15). In the unlikely event that
part of the original segment has been received and acknowledged by part of the original segment has been received and acknowledged by
the remote peer (e.g., a re-segmenting middle box, as documented in the remote peer (e.g., a re-segmenting middle box, as documented in
Section 6.1, Re-segmenting Middle boxes and non optimized MPA/TCP Appendix A.4, Re-segmenting Middle boxes and non optimized MPA/TCP
senders on page 29), a better available bandwidth utilization may be senders on page 50), a better available bandwidth utilization may be
possible by re-transmitting only the missing octets. If an optimized possible by re-transmitting only the missing octets. If an optimized
MPA/TCP retransmits complete FPDUs, there may be some marginal MPA/TCP retransmits complete FPDUs, there may be some marginal
bandwidth loss. bandwidth loss.
Another area where a change in the TCP segment number may have impact Another area where a change in the TCP segment number may have impact
is that of Slow Start and Congestion Avoidance. Slow-start is that of Slow Start and Congestion Avoidance. Slow-start
exponential increase is measured in segments per second, as the exponential increase is measured in segments per second, as the
algorithm focuses on the overhead per segment at the source for algorithm focuses on the overhead per segment at the source for
congestion that eventually results in dropped segments. Slow-start congestion that eventually results in dropped segments. Slow-start
exponential bandwidth growth for optimized MPA/TCP is similar to any exponential bandwidth growth for optimized MPA/TCP is similar to any
skipping to change at page 61, line 5 skipping to change at page 63, line 5
the algorithms. the algorithms.
In summary, the ULP messages generated at the sender (e.g., the In summary, the ULP messages generated at the sender (e.g., the
amount of messages grouped for every transmission request) and amount of messages grouped for every transmission request) and
message size distribution has the most significant impact over the message size distribution has the most significant impact over the
number of TCP segments emitted. The worst case effect for certain number of TCP segments emitted. The worst case effect for certain
ULPs (with average message size of EMSS/2+1 to EMSS), is bounded by ULPs (with average message size of EMSS/2+1 to EMSS), is bounded by
an increase of up to 2x in the number of TCP segments and an increase of up to 2x in the number of TCP segments and
acknowledges. In reality the effect is expected to be marginal. acknowledges. In reality the effect is expected to be marginal.
12.2 Receiver implementation C Appendix.
IETF Implementation Interoperability with RDMA Consortium
Transport & Network Layer Reassembly Buffers: Protocols
The use of reassembly buffers (either TCP reassembly buffers or IP
fragmentation reassembly buffers) is implementation dependent. When
MPA is enabled, reassembly buffers are needed if out of order packets
arrive and Markers are not enabled. Buffers are also needed if FPDU
Alignment is lost or if IP fragmentation occurs. This is because the
incoming out of order segment may not contain enough information for
MPA to process all of the FPDU. For cases where a re-segmenting
middle box is present, or where the TCP sender is not optimized, the
presence of Markers significantly reduces the amount of buffering
needed.
Recovery from IP Fragmentation must be transparent to the MPA
Consumers.
12.2.1 Network Layer Reassembly Buffers
Most IP implementations set the IP Don't Fragment bit. Thus upon a
path MTU change, intermediate devices drop the IP datagram if it is
too large and reply with an ICMP message which tells the source TCP
that the path MTU has changed. This causes TCP to emit segments
conformant with the new path MTU size. Thus IP fragments under most
conditions should never occur at the receiver. But it is possible.
There are several options for implementation of network layer
reassembly buffers:
1. drop any IP fragments, and reply with an ICMP message according
to [RFC792] (fragmentation needed and DF set) to tell the Remote
Peer to resize its TCP segment
2. support an IP reassembly buffer, but have it of limited size
(possibly the same size as the local link's MTU). The end Node
would normally never advertise a path MTU larger than the local
link MTU. It is recommended that a dropped IP fragment cause an
ICMP message to be generated according to RFC792.
3. multiple IP reassembly buffers, of effectively unlimited size.
4. support an IP reassembly buffer for the largest IP datagram (64
KB).
5. support for a large IP reassembly buffer which could span
multiple IP datagrams.
An implementation should support at least 2 or 3 above, to avoid
dropping packets that have traversed the entire fabric.
There is no end-to-end ACK for IP reassembly buffers, so there is no
flow control on the buffer. The only end-to-end ACK is a TCP ACK,
which can only occur when a complete IP datagram is delivered to TCP.
Because of this, under worst case, pathological scenarios, the
largest IP reassembly buffer is the TCP receive window (to buffer
multiple IP datagrams that have all been fragmented).
Note that if the Remote Peer does not implement re-segmentation of
the data stream upon receiving the ICMP reply updating the path MTU,
it is possible to halt forward progress because the opposite peer
would continue to retransmit using a transport segment size that is
too large. This deadlock scenario is no different than if the fabric
MTU (not last hop MTU) was reduced after connection setup, and the
remote Node's behavior is not compliant with [RFC1122].
12.2.2 TCP Reassembly buffers
A TCP reassembly buffer is also needed. TCP reassembly buffers are
needed if FPDU Alignment is lost when using TCP with MPA or when the
MPA FPDU spans multiple TCP segments. Buffers are also needed if
Markers are disabled and out of order packets arrive.
Since lost FPDU Alignment often means that FPDUs are incomplete, an
MPA on TCP implementation must have a reassembly buffer large enough
to recover an FPDU that is less than or equal to the MTU of the
locally attached link (this should be the largest possible advertised
TCP path MTU). If the MTU is smaller than 140 octets, the buffer
MUST be at least 140 octets long to support the minimum FPDU size.
The 140 octets allows for the minimum MULPDU of 128, 2 octets of pad,
2 of ULPDU_Length, 4 of CRC, and space for a possible Marker. As
usual, additional buffering may provide better performance.
Note that if the TCP segment were not stored, it is possible to
deadlock the MPA algorithm. If the path MTU is reduced, FPDU
Alignment requires the source TCP to re-segment the data stream to
the new path MTU. The source MPA will detect this condition and
reduce the MPA segment size, but any FPDUs already posted to the
source TCP will be re-segmented and lose FPDU Alignment. If the
destination does not support a TCP reassembly buffer, these segments
can never be successfully transmitted and the protocol deadlocks.
When a complete FPDU is received, processing continues normally. This appendix is for information only and is NOT part of the
standard.
12.3 IETF Implementation Interoperability with RDMA Consortium Protocols This appendix covers methods of making MPA implementations
interoperate with both IETF and RDMA Consortium versions of the
protocols.
The RDMA Consortium created early specifications of the MPA/DDP/RDMA The RDMA Consortium created early specifications of the MPA/DDP/RDMA
protocols and some manufacturers created implementations of those protocols and some manufacturers created implementations of those
protocols before the IETF versions were finalized. These protocols protocols before the IETF versions were finalized. These protocols
and are very similar to the IETF versions making it possible for and are very similar to the IETF versions making it possible for
implementations to be created or modified to support either set of implementations to be created or modified to support either set of
specifications. For those interested, the RDMA Consortium protocol specifications.
documents can be obtained at http://www.rdmaconsortium.org.
For those interested, the RDMA Consortium protocol documents
(draft-culley-iwarp-mpa-v1.0.pdf, draft-shah-iwarp-ddp-v1.0.pdf, and
draft-recio-iwarp-rdmac-v1.0.pdf) can be obtained at
http://www.rdmaconsortium.org.
In this section, implementations of MPA/DDP/RDMA that conform to the In this section, implementations of MPA/DDP/RDMA that conform to the
RDMAC specifications are called RDMAC RNICs. Implementations of RDMAC specifications are called RDMAC RNICs. Implementations of
MPA/DDP/RDMA that conform to the IETF RFCs are called IETF RNICs. MPA/DDP/RDMA that conform to the IETF RFCs are called IETF RNICs.
Without the exchange of MPA Request/Reply Frames, there is no Without the exchange of MPA Request/Reply Frames, there is no
standard mechanism for enabling RDMAC RNICs to interoperate with IETF standard mechanism for enabling RDMAC RNICs to interoperate with IETF
RNICs. Even if a ULP uses a well-known port to start an IETF RNIC RNICs. Even if a ULP uses a well-known port to start an IETF RNIC
immediately in RDMA mode (i.e., without exchanging the MPA immediately in RDMA mode (i.e., without exchanging the MPA
Request/Reply messages), there is no reason to believe an IETF RNIC Request/Reply messages), there is no reason to believe an IETF RNIC
will interoperate with an RDMAC RNIC because of the differences in will interoperate with an RDMAC RNIC because of the differences in
the version number in the DDP and RDMAP headers on the wire. the version number in the DDP and RDMAP headers on the wire.
Therefore, the ULP or other supporting entity at the RDMAC RNIC must Therefore, the ULP or other supporting entity at the RDMAC RNIC must
implement MPA Request/Reply Frames on behalf of the RNIC in order to implement MPA Request/Reply Frames on behalf of the RNIC in order to
negotiate the connection parameters. The following section describes negotiate the connection parameters. The following section describes
the results following the exchange of the MPA Request/Reply Frames the results following the exchange of the MPA Request/Reply Frames
before the conversion from streaming to RDMA mode. before the conversion from streaming to RDMA mode.
12.3.1 Negotiated Parameters C.1 Negotiated Parameters
Three types of RNICs are considered: Three types of RNICs are considered:
Upgraded RDMAC RNIC - an RNIC implementing the RDMAC protocols which Upgraded RDMAC RNIC - an RNIC implementing the RDMAC protocols which
has a ULP or other supporting entity that exchanges the MPA has a ULP or other supporting entity that exchanges the MPA
Request/Reply Frames in streaming mode before the conversion to Request/Reply Frames in streaming mode before the conversion to
RDMA mode. RDMA mode.
Non-permissive IETF RNIC - an RNIC implementing the IETF protocols Non-permissive IETF RNIC - an RNIC implementing the IETF protocols
which is not capable of implementing the RDMAC protocols. Such which is not capable of implementing the RDMAC protocols. Such
skipping to change at page 64, line 47 skipping to change at page 65, line 7
DDP and RDMAP, no mixing of versions is allowed. Moreover, the DDP DDP and RDMAP, no mixing of versions is allowed. Moreover, the DDP
and RDMAP version MUST be identical in the two directions. The RNIC and RDMAP version MUST be identical in the two directions. The RNIC
either generates the RDMAC protocols on the wire (version is zero) or either generates the RDMAC protocols on the wire (version is zero) or
the IETF protocols (version is one). the IETF protocols (version is one).
In the following sections, the figures do not discuss CRC negotiation In the following sections, the figures do not discuss CRC negotiation
because there is no interoperability issue for CRCs. Since the RDMAC because there is no interoperability issue for CRCs. Since the RDMAC
RNIC will always request CRC use, then, according to the IETF MPA RNIC will always request CRC use, then, according to the IETF MPA
specification, both peers MUST generate and check CRCs. specification, both peers MUST generate and check CRCs.
12.3.2 RDMAC RNIC and Non-permissive IETF RNIC C.2 RDMAC RNIC and Non-permissive IETF RNIC
Figure 15 shows that a Non-permissive IETF RNIC cannot interoperate Figure 15 shows that a Non-permissive IETF RNIC cannot interoperate
with an RDMAC RNIC, despite the fact that both peers exchange MPA with an RDMAC RNIC, despite the fact that both peers exchange MPA
Request/Reply Frames. For a Non-permissive IETF RNIC, the MPA Request/Reply Frames. For a Non-permissive IETF RNIC, the MPA
negotiation has no effect on the DDP/RDMAP version and it is unable negotiation has no effect on the DDP/RDMAP version and it is unable
to interoperate with the RDMAC RNIC. to interoperate with the RDMAC RNIC.
The rows in the figure show the state of the Marker field in the MPA The rows in the figure show the state of the Marker field in the MPA
Request Frame sent by the MPA Initiator. The columns show the state Request Frame sent by the MPA Initiator. The columns show the state
of the Marker field in the MPA Reply Frame sent by the MPA Responder. of the Marker field in the MPA Reply Frame sent by the MPA Responder.
skipping to change at page 65, line 38 skipping to change at page 65, line 48
| +----------+------++-------+-------+-------+ | +----------+------++-------+-------+-------+
| MPA | | M=0 || close | V=1 | V=1 | | MPA | | M=0 || close | V=1 | V=1 |
|Initiator| IETF | || | M=0/0 | M=0/1 | |Initiator| IETF | || | M=0/0 | M=0/1 |
| |Non-perms.+------++-------+-------+-------+ | |Non-perms.+------++-------+-------+-------+
| | | M=1 || close | V=1 | V=1 | | | | M=1 || close | V=1 | V=1 |
| | | || | M=1/0 | M=1/1 | | | | || | M=1/0 | M=1/1 |
+---------+----------+------++-------+-------+-------+ +---------+----------+------++-------+-------+-------+
Figure 15: MPA negotiation between an RDMAC RNIC and a Non-permissive Figure 15: MPA negotiation between an RDMAC RNIC and a Non-permissive
IETF RNIC. IETF RNIC.
12.3.2.1 RDMAC RNIC Initiator C.2.1 RDMAC RNIC Initiator
If the RDMAC RNIC is the MPA Initiator, its ULP sends an MPA Request If the RDMAC RNIC is the MPA Initiator, its ULP sends an MPA Request
Frame with Rev field set to zero and the M and C bits set to one. Frame with Rev field set to zero and the M and C bits set to one.
Because the Non-permissive IETF RNIC cannot dynamically downgrade the Because the Non-permissive IETF RNIC cannot dynamically downgrade the
version number it uses for DDP and RDMAP, it would send an MPA Reply version number it uses for DDP and RDMAP, it would send an MPA Reply
Frame with the Rev field equal to one and then gracefully close the Frame with the Rev field equal to one and then gracefully close the
connection. connection.
12.3.2.2 Non-Permissive IETF RNIC Initiator C.2.2 Non-Permissive IETF RNIC Initiator
If the Non-permissive IETF RNIC is the MPA Initiator, it sends an MPA If the Non-permissive IETF RNIC is the MPA Initiator, it sends an MPA
Request Frame with Rev field equal to one. The ULP or supporting Request Frame with Rev field equal to one. The ULP or supporting
entity for the RDMAC RNIC responds with an MPA Reply Frame that has entity for the RDMAC RNIC responds with an MPA Reply Frame that has
the Rev field equal to zero and the M bit set to one. The Non- the Rev field equal to zero and the M bit set to one. The Non-
permissive IETF RNIC will gracefully close the connection after it permissive IETF RNIC will gracefully close the connection after it
reads the incompatible Rev field in the MPA Reply Frame. reads the incompatible Rev field in the MPA Reply Frame.
12.3.3 RDMAC RNIC and Permissive IETF RNIC C.2.3 RDMAC RNIC and Permissive IETF RNIC
Figure 16 shows that a Permissive IETF RNIC can interoperate with an Figure 16 shows that a Permissive IETF RNIC can interoperate with an
RDMAC RNIC regardless of its Marker preference. The figure uses the RDMAC RNIC regardless of its Marker preference. The figure uses the
same format as shown with the Non-permissive IETF RNIC. same format as shown with the Non-permissive IETF RNIC.
+---------------------------++-----------------------+ +---------------------------++-----------------------+
| MPA || MPA | | MPA || MPA |
| CONNECT || Responder | | CONNECT || Responder |
| MODE +-----------------++-------+---------------+ | MODE +-----------------++-------+---------------+
| | RNIC || RDMAC | IETF | | | RNIC || RDMAC | IETF |
skipping to change at page 66, line 40 skipping to change at page 67, line 5
Figure 16: MPA negotiation between an RDMAC RNIC and a Permissive Figure 16: MPA negotiation between an RDMAC RNIC and a Permissive
IETF RNIC. IETF RNIC.
A truly Permissive IETF RNIC will recognize an RDMAC RNIC from the A truly Permissive IETF RNIC will recognize an RDMAC RNIC from the
Rev field of the MPA Req/Rep Frames and then adjust its receive Rev field of the MPA Req/Rep Frames and then adjust its receive
Marker state and DDP/RDMAP version to accommodate the RDMAC RNIC. As Marker state and DDP/RDMAP version to accommodate the RDMAC RNIC. As
a result, as an MPA Responder, the Permissive IETF RNIC will never a result, as an MPA Responder, the Permissive IETF RNIC will never
return an MPA Reply Frame with the M bit set to zero. This case is return an MPA Reply Frame with the M bit set to zero. This case is
shown as a not applicable (N/A) in Figure 16. shown as a not applicable (N/A) in Figure 16.
12.3.3.1 RDMAC RNIC Initiator C.2.4 RDMAC RNIC Initiator
When the RDMAC RNIC is the MPA Initiator, its ULP or other supporting When the RDMAC RNIC is the MPA Initiator, its ULP or other supporting
entity prepares an MPA Request message and sets the revision to zero entity prepares an MPA Request message and sets the revision to zero
and the M bit and C bit to one. and the M bit and C bit to one.
The Permissive IETF Responder receives the MPA Request message and The Permissive IETF Responder receives the MPA Request message and
checks the revision field. Since it is capable of generating RDMAC checks the revision field. Since it is capable of generating RDMAC
DDP/RDMAP headers, it sends an MPA Reply message with revision set to DDP/RDMAP headers, it sends an MPA Reply message with revision set to
zero and the M and C bits set to one. The Responder must inform its zero and the M and C bits set to one. The Responder must inform its
ULP that it is generating version zero DDP/RDMAP messages. ULP that it is generating version zero DDP/RDMAP messages.
12.3.3.2 Permissive IETF RNIC Initiator C.2.5 Permissive IETF RNIC Initiator
If the Permissive IETF RNIC is the MPA Initiator, it prepares the MPA If the Permissive IETF RNIC is the MPA Initiator, it prepares the MPA
Request Frame setting the Rev field to one. Regardless of the value Request Frame setting the Rev field to one. Regardless of the value
of the M bit in the MPA Request Frame, the ULP or other supporting of the M bit in the MPA Request Frame, the ULP or other supporting
entity for the RDMAC RNIC will create an MPA Reply Frame with Rev entity for the RDMAC RNIC will create an MPA Reply Frame with Rev
equal to zero and the M bit set to one. equal to zero and the M bit set to one.
When the Initiator reads the Rev field of the MPA Reply Frame and When the Initiator reads the Rev field of the MPA Reply Frame and
finds that its peer is an RDMAC RNIC, it must inform its ULP that it finds that its peer is an RDMAC RNIC, it must inform its ULP that it
should generate version zero DDP/RDMAP messages and enable MPA should generate version zero DDP/RDMAP messages and enable MPA
Markers and CRC. Markers and CRC.
12.3.4 Non-Permissive IETF RNIC and Permissive IETF RNIC C.3 Non-Permissive IETF RNIC and Permissive IETF RNIC
For completeness, Figure 17 shows the results of MPA negotiation For completeness, Figure 17 below shows the results of MPA
between a Non-permissive IETF RNIC and a Permissive IETF RNIC. The negotiation between a Non-permissive IETF RNIC and a Permissive IETF
important point from this figure is that an IETF RNIC cannot detect RNIC. The important point from this figure is that an IETF RNIC
whether its peer is a Permissive or Non-permissive RNIC. cannot detect whether its peer is a Permissive or Non-permissive
RNIC.
+---------------------------++-------------------------------+ +---------------------------++-------------------------------+
| MPA || MPA | | MPA || MPA |
| CONNECT || Responder | | CONNECT || Responder |
| MODE +-----------------++---------------+---------------+ | MODE +-----------------++---------------+---------------+
| | RNIC || IETF | IETF | | | RNIC || IETF | IETF |
| | TYPE || Non-permissive| Permissive | | | TYPE || Non-permissive| Permissive |
| | +------++-------+-------+-------+-------+ | | +------++-------+-------+-------+-------+
| | |MARKER|| M=0 | M=1 | M=0 | M=1 | | | |MARKER|| M=0 | M=1 | M=0 | M=1 |
+---------+----------+------++-------+-------+-------+-------+ +---------+----------+------++-------+-------+-------+-------+
skipping to change at page 68, line 5 skipping to change at page 69, line 5
| MPA +----------+------++-------+-------+-------+-------+ | MPA +----------+------++-------+-------+-------+-------+
|Initiator| | M=0 || V=1 | V=1 | V=1 | V=1 | |Initiator| | M=0 || V=1 | V=1 | V=1 | V=1 |
| | IETF | || M=0/0 | M=0/1 | M=0/0 | M=0/1 | | | IETF | || M=0/0 | M=0/1 | M=0/0 | M=0/1 |
| |Permissive+------++-------+-------+-------+-------+ | |Permissive+------++-------+-------+-------+-------+
| | | M=1 || V=1 | V=1 | V=1 | V=1 | | | | M=1 || V=1 | V=1 | V=1 | V=1 |
| | | || M=1/0 | M=1/1 | M=1/0 | M=1/1 | | | | || M=1/0 | M=1/1 | M=1/0 | M=1/1 |
+---------+----------+------++-------+-------+-------+-------+ +---------+----------+------++-------+-------+-------+-------+
Figure 17: MPA negotiation between a Non-permissive IETF RNIC and a Figure 17: MPA negotiation between a Non-permissive IETF RNIC and a
Permissive IETF RNIC. Permissive IETF RNIC.
13 Author's Addresses Normative References
[iSCSI] Satran, J., Internet Small Computer Systems Interface
(iSCSI), RFC 3720, April 2004.
[RFC1191] Mogul, J., and Deering, S., "Path MTU Discovery", RFC 1191,
November 1990.
[RFC2018] Mathis, M., Mahdavi, J., Floyd, S., Romanow, A., "TCP
Selective Acknowledgment Options", RFC 2018, October 1996.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2401] Atkinson, R., Kent, S., "Security Architecture for the
Internet Protocol", RFC 2401, November 1998.
[RFC3723] Aboba B., et al, "Securing Block Storage Protocols over
IP", RFC3723, April 2004.
[RFC793] Postel, J., "Transmission Control Protocol - DARPA Internet
Program Protocol Specification", RFC 793, September 1981.
[RDMASEC] Pinkerton J., Deleganes E., Bitan S., "DDP/RDMAP
Security", draft-ietf-rddp-security-09.txt (work in progress),
MAY 2006.
Informative References
[APPL] Bestler, C., "Applicability of Remote Direct Memory Access
Protocol (RDMA) and Direct Data Placement (DDP)", draft-ietf-
rddp-applicability-08.txt (Work in progress), June 2006.
[CRCTCP] Stone J., Partridge, C., "When the CRC and TCP checksum
disagree", ACM Sigcomm, Sept. 2000.
[DAT-API] DAT Collaborative, "kDAPL (Kernel Direct Access Programming
Library) and uDAPL (User Direct Access Programming Library)",
http://www.datcollaborative.org.
[DDP] H. Shah et al., "Direct Data Placement over Reliable
Transports", draft-ietf-rddp-ddp-06.txt (Work in progress), May
2006.
[iSER] Mike Ko et al., "iSCSI Extensions for RDMA Specification",
draft-ietf-ips-iser-05.txt (Work in progress), October 2005.
[IT-API] The Open Group, "Interconnect Transport API (IT-API)"
Version 2.1, http://www.opengroup.org.
[NFSv4CHANNEL] Williams, N., "On the Use of Channel Bindings to
Secure Channels", Internet-Draft draft-ietf-nfsv4-channel-
bindings-02.txt, July 2004.
[RDMAP] R. Recio et al., "RDMA Protocol Specification",
draft-ietf-rddp-rdmap-06.txt, May 2006.
[RFC792] Postel, J., "Internet Control Message Protocol", September
1981
[RFC0896] J. Nagle, "Congestion Control in IP/TCP Internetworks", RFC
896, January 1984.
[RFC1122] Braden, R.T., "Requirements for Internet hosts -
communication layers", October 1989.
[RFC2960] R. Stewart et al., "Stream Control Transmission Protocol",
RFC 2960, October 2000.
[RFC4296] Bailey, S., Talpey, T, "The Architecture of Direct Data
Placement (DDP) and Remote Direct Memory Access (RDMA) on
Internet Protocols" RFC 4296, December 2005
[RFC4297] Romanow, A., et al., "Remote Direct Memory Access (RDMA)
over IP Problem Statement", RFC 4297, December 2005
[RFC4301] Kent, S., Seo, K., "Security Architecture for the Internet
Protocol", RFC 4301, December 2005
[VERBS] J. Hilland et al., "RDMA Protocol Verbs Specification",
draft-hilland-iwarp-verbs-v1.0-RDMAC.pdf April 2003,
http://www.rdmaconsortium.org.
Author's Addresses
Stephen Bailey Stephen Bailey
Sandburst Corporation Sandburst Corporation
600 Federal Street 600 Federal Street
Andover, MA 01810 USA Andover, MA 01810 USA
Phone: +1 978 689 1614 Phone: +1 978 689 1614
Email: steph@sandburst.com Email: steph@sandburst.com
Paul R. Culley Paul R. Culley
Hewlett-Packard Company Hewlett-Packard Company
skipping to change at page 69, line 5 skipping to change at page 72, line 5
Phone: 512-838-3685 Phone: 512-838-3685
Email: recio@us.ibm.com Email: recio@us.ibm.com
John Carrier John Carrier
Cray Inc. Cray Inc.
411 First Avenue S, Suite 600 411 First Avenue S, Suite 600
Seattle, WA 98104-2860 Seattle, WA 98104-2860
Phone: 206-701-2090 Phone: 206-701-2090
Email: carrier@cray.com Email: carrier@cray.com
14 Acknowledgments Acknowledgments
Dwight Barron Dwight Barron
Hewlett-Packard Company Hewlett-Packard Company
20555 SH 249 20555 SH 249
Houston, Tx. USA 77070-2698 Houston, Tx. USA 77070-2698
Phone: 281-514-2769 Phone: 281-514-2769
Email: dwight.barron@hp.com Email: dwight.barron@hp.com
Jeff Chase Jeff Chase
Department of Computer Science Department of Computer Science
 End of changes. 88 change blocks. 
509 lines changed or deleted 608 lines changed or added

This html diff was produced by rfcdiff 1.32. The latest version is available from http://www.levkowetz.com/ietf/tools/rfcdiff/