draft-ietf-rddp-mpa-06.txt   draft-ietf-rddp-mpa-07.txt 
Remote Direct Data Placement Work Group P. Culley Remote Direct Data Placement Work Group P. Culley
INTERNET-DRAFT Hewlett-Packard Company INTERNET-DRAFT Hewlett-Packard Company
draft-ietf-rddp-mpa-06.txt U. Elzur draft-ietf-rddp-mpa-07.txt U. Elzur
Broadcom Corporation Broadcom Corporation
R. Recio R. Recio
IBM Corporation IBM Corporation
S. Bailey S. Bailey
Sandburst Corporation Sandburst Corporation
J. Carrier J. Carrier
Cray Inc. Cray Inc.
Expires: February 2007 September 5, 2006 Expires: April 2007 October 5, 2006
Marker PDU Aligned Framing for TCP Specification Marker PDU Aligned Framing for TCP Specification
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
skipping to change at page 4, line 6 skipping to change at page 4, line 6
Figure 13: Aligned FPDU placed immediately after TCP header 59 Figure 13: Aligned FPDU placed immediately after TCP header 59
Figure 14. Connection Parameters for the RNIC Types. 64 Figure 14. Connection Parameters for the RNIC Types. 64
Figure 15: MPA negotiation between an RDMAC RNIC and a Non-permissive Figure 15: MPA negotiation between an RDMAC RNIC and a Non-permissive
IETF RNIC. 65 IETF RNIC. 65
Figure 16: MPA negotiation between an RDMAC RNIC and a Permissive Figure 16: MPA negotiation between an RDMAC RNIC and a Permissive
IETF RNIC. 66 IETF RNIC. 66
Figure 17: MPA negotiation between a Non-permissive IETF RNIC and a Figure 17: MPA negotiation between a Non-permissive IETF RNIC and a
Permissive IETF RNIC. 68 Permissive IETF RNIC. 68
Revision history [To be deleted prior to RFC publication] Revision history [To be deleted prior to RFC publication]
[draft-ietf-rddp-mpa-07] workgroup draft with following changes:
Minor clarifications; added CRC to glossary, made 2.1 discussion
on probabilistic/deterministic a little less global. Added note
that MULPDU is likely smaller than 64768, clarified 'M' bit
description, added xref to private data discussion in field
definition, removed LLP acronym, added sentence on DOS attack to
"Man in Middle" in security.
[draft-ietf-rddp-mpa-06] workgroup draft with following changes: [draft-ietf-rddp-mpa-06] workgroup draft with following changes:
Document restructuring to move descriptive information on Document restructuring to move descriptive information on
implementing optimized MPA/TCP implementations to an appendix. implementing optimized MPA/TCP implementations to an appendix.
All normative text was removed from the appendix. Paragraph All normative text was removed from the appendix. Paragraph
added to security section explaining IPSEC version. Added added to security section explaining IPSEC version. Added
informative references to architecture, applicability, and informative references to architecture, applicability, and
problem statement documents. problem statement documents.
[draft-ietf-rddp-mpa-05] workgroup draft with following changes:
Document restructuring to differentiate between fully layered
MPA on TCP implementations and optimized MPA/TCP
implementations. This involved somewhat blurring the artificial
layer between MPA and an MPA-aware TCP. This involved a bit of
terminology change.
Re-wrote the requirement to avoid duplicate segments during TCP
out of order passing to MPA; this is now a co-responsibility
between MPA/TCP; also explained that the requirement was to
avoid data corruption through bypassing MPA CRCs and other
checks.
1 Glossary 1 Glossary
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
this document are to be interpreted as described in [RFC2119]. this document are to be interpreted as described in [RFC2119].
Consumer - the ULPs or applications that lie above MPA and DDP. The Consumer - the ULPs or applications that lie above MPA and DDP. The
Consumer is responsible for making TCP connections, starting MPA Consumer is responsible for making TCP connections, starting MPA
and DDP connections, and generally controlling operations. and DDP connections, and generally controlling operations.
CRC - Cyclic Redundancy Check.
Delivery - (Delivered, Delivers) - For MPA, Delivery is defined as Delivery - (Delivered, Delivers) - For MPA, Delivery is defined as
the process of informing DDP that a particular PDU is ordered for the process of informing DDP that a particular PDU is ordered for
use. A PDU is Delivered in the exact order that it was sent by use. A PDU is Delivered in the exact order that it was sent by
the original sender; MPA uses TCP's byte stream ordering to the original sender; MPA uses TCP's byte stream ordering to
determine when Delivery is possible. This is specifically determine when Delivery is possible. This is specifically
different from "passing the PDU to DDP", which may generally different from "passing the PDU to DDP", which may generally
occur in any order, while the order of Delivery is strictly occur in any order, while the order of Delivery is strictly
defined. defined.
EMSS - Effective Maximum Segment Size. EMSS is the smaller of the EMSS - Effective Maximum Segment Size. EMSS is the smaller of the
skipping to change at page 8, line 26 skipping to change at page 8, line 26
boundary is useful to a hardware network adapter that uses DDP to boundary is useful to a hardware network adapter that uses DDP to
directly place the data in the application buffer based on the directly place the data in the application buffer based on the
control information carried in the ULPDU header. This may be done control information carried in the ULPDU header. This may be done
without requiring that the packets arrive in order. Potential without requiring that the packets arrive in order. Potential
benefits of this capability are the avoidance of the memory copy benefits of this capability are the avoidance of the memory copy
overhead and a smaller memory requirement for handling out of order overhead and a smaller memory requirement for handling out of order
or dropped packets. or dropped packets.
Many approaches have been proposed for a generalized framing Many approaches have been proposed for a generalized framing
mechanism. Some are probabilistic in nature and others are mechanism. Some are probabilistic in nature and others are
deterministic. A probabilistic approach is characterized by a deterministic. An example probabilistic approach is characterized by
detectable value embedded in the octet stream. It is probabilistic a detectable value embedded in the octet stream, with no method of
because under some conditions the receiver may incorrectly interpret preventing that value elsewhere within user data. It is
application data as the detectable value. Under these conditions, probabilistic because under some conditions the receiver may
the protocol may fail with unacceptable frequency. A deterministic incorrectly interpret application data as the detectable value.
approach is characterized by embedded controls at known locations in Under these conditions, the protocol may fail with unacceptable
the octet stream. Because the receiver can guarantee it will only frequency. AOne deterministic approach is characterized by embedded
examine the data stream at locations that are known to contain the controls at known locations in the octet stream. Because the
embedded control, the protocol can never misinterpret application receiver can guarantee it will only examine the data stream at
data as being embedded control data. For unambiguous handling of an locations that are known to contain the embedded control, the
out of order packet, the deterministic approach is preferred. protocol can never misinterpret application data as being embedded
control data. For unambiguous handling of an out of order packet,
athe deterministic approach is preferred.
The MPA protocol provides a framing mechanism for DDP running over The MPA protocol provides a framing mechanism for DDP running over
TCP using the deterministic approach. It allows the location of the TCP using the deterministic approach. It allows the location of the
ULPDU to be determined in the TCP stream even if the TCP segments ULPDU to be determined in the TCP stream even if the TCP segments
arrive out of order. arrive out of order.
2.2 Protocol Overview 2.2 Protocol Overview
The layering of PDUs with MPA is shown in Figure 1, below. The layering of PDUs with MPA is shown in Figure 1, below.
skipping to change at page 12, line 24 skipping to change at page 12, line 24
As such, MPA accepts complete records (ULPDUs) from DDP at the sender As such, MPA accepts complete records (ULPDUs) from DDP at the sender
and returns them to DDP at the receiver. and returns them to DDP at the receiver.
MPA MUST encapsulate the ULPDU such that there is exactly one ULPDU MPA MUST encapsulate the ULPDU such that there is exactly one ULPDU
contained in one FPDU. contained in one FPDU.
MPA over a standard TCP stack can usually provide FPDU Alignment with MPA over a standard TCP stack can usually provide FPDU Alignment with
the TCP Header if the FPDU is equal to TCP's EMSS. An optimized the TCP Header if the FPDU is equal to TCP's EMSS. An optimized
MPA/TCP stack can also maintain alignment as long as the FPDU is less MPA/TCP stack can also maintain alignment as long as the FPDU is less
than or equal to TCP's EMSS. Since FPDU Alignment is generally than or equal to TCP's EMSS. Since FPDU Alignment is generally
desired by the receiver, DDP must cooperate with MPA to ensure FPDUs' desired by the receiver, DDP must cooperates with MPA to ensure
lengths do not exceed the EMSS under normal conditions. This is done FPDUs' lengths do not exceed the EMSS under normal conditions. This
with the MULPDU mechanism. is done with the MULPDU mechanism.
MPA provides information to DDP on the current maximum size of the MPA provides information to DDP on the current maximum size of the
record that is acceptable to send (MULPDU). DDP SHOULD limit each record that is acceptable to send (MULPDU). DDP SHOULD limit each
record size to MULPDU. The range of MULPDU values MUST be between record size to MULPDU. The range of MULPDU values MUST be between
128 octets and 64768 octets, inclusive. 128 octets and 64768 octets, inclusive.
The sending DDP MUST NOT post a ULPDU larger than 64768 octets to The sending DDP MUST NOT post a ULPDU larger than 64768 octets to
MPA. DDP MAY post a ULPDU of any size between one and 64768 octets, MPA. DDP MAY post a ULPDU of any size between one and 64768 octets,
however MPA is not REQUIRED to support a ULPDU Length that is greater however MPA is not REQUIRED to support a ULPDU Length that is greater
than the current MULPDU. than the current MULPDU.
skipping to change at page 12, line 48 skipping to change at page 12, line 48
While the maximum theoretical length supported by the MPA header While the maximum theoretical length supported by the MPA header
ULPDU_Length field is 65535, TCP over IP requires the IP datagram ULPDU_Length field is 65535, TCP over IP requires the IP datagram
maximum length to be 65535 octets. To enable MPA to support FPDU maximum length to be 65535 octets. To enable MPA to support FPDU
Alignment, the maximum size of the FPDU must fit within an IP Alignment, the maximum size of the FPDU must fit within an IP
datagram. Thus the ULPDU limit of 64768 octets was derived by taking datagram. Thus the ULPDU limit of 64768 octets was derived by taking
the maximum IP datagram length, subtracting from it the maximum total the maximum IP datagram length, subtracting from it the maximum total
length of the sum of the IPv4 header, TCP header, IPv4 options, TCP length of the sum of the IPv4 header, TCP header, IPv4 options, TCP
options, and the worst case MPA overhead, and then rounding the options, and the worst case MPA overhead, and then rounding the
result down to a 128 octet boundary. result down to a 128 octet boundary.
Note that MULPDU will be significantly smaller than the theoretical
maximum in most implementations for most circumstances, due to link
MTUs, use of extra headers such as required for IPSEC etc.
On receive, MPA MUST pass each ULPDU with its length to DDP when it On receive, MPA MUST pass each ULPDU with its length to DDP when it
has been validated. has been validated.
If an MPA implementation supports passing out of order ULPDUs to DDP, If an MPA implementation supports passing out of order ULPDUs to DDP,
the MPA implementation SHOULD: the MPA implementation SHOULD:
* Pass each ULPDU with its length to DDP as soon as it has been * Pass each ULPDU with its length to DDP as soon as it has been
fully received and validated. fully received and validated.
* Provide a mechanism to indicate the ordering of ULPDUs as the * Provide a mechanism to indicate the ordering of ULPDUs as the
skipping to change at page 28, line 42 skipping to change at page 28, line 42
49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal). Responder 49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal). Responder
mode receivers MUST check this field for the same value, and mode receivers MUST check this field for the same value, and
close the connection and report an error locally if any other close the connection and report an error locally if any other
value is detected. Responder mode senders MUST set this field to value is detected. Responder mode senders MUST set this field to
the fixed value "MPA ID Rep frame" or (in byte order) 4D 50 41 20 the fixed value "MPA ID Rep frame" or (in byte order) 4D 50 41 20
49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal). Initiator 49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal). Initiator
mode receivers MUST check this field for the same value, and mode receivers MUST check this field for the same value, and
close the connection and report an error locally if any other close the connection and report an error locally if any other
value is detected. value is detected.
M: This bit, when sent in an MPA Request Frame or an MPA Reply Frame, M: This bit declares an endpoint's REQUIRED Marker usage. When this
declares a receiver's requirement for Markers. When in a bit is '1' in an MPA Request Frame, the Initiator declares that a
received MPA Request Frame or MPA Reply Frame and the value is receiver's requirement for Markers are REQUIRED in FPDUs sent
'0', Markers MUST NOT be added to the data stream by the sender. from the Responder. When set to '1' in an MPA Reply Frame, this
When '1' Markers MUST be added as described in section 4.3 MPA bit declares that Markers are REQUIRED in FPDUs sent from the
Markers on page 15. Initiator. When in a received MPA Request Frame or MPA Reply
Frame and the value is '0', Markers MUST NOT be added to the data
stream by that endpointsender. When '1' Markers MUST be added as
described in section 4.3 MPA Markers on page 15.
C: This bit declares an endpoint's preferred CRC usage. When this C: This bit declares an endpoint's preferred CRC usage. When this
field is '0' in the MPA Request Frame and the MPA Reply Frame, field is '0' in the MPA Request Frame and the MPA Reply Frame,
CRCs MUST not be checked and need not be generated by either CRCs MUST not be checked and need not be generated by either
endpoint. When this bit is '1' in either the MPA Request Frame endpoint. When this bit is '1' in either the MPA Request Frame
or MPA Reply Frame, CRCs MUST be generated and checked by both or MPA Reply Frame, CRCs MUST be generated and checked by both
endpoints. Note that even when not in use, the CRC field remains endpoints. Note that even when not in use, the CRC field remains
present in the FPDU. When CRCs are not in use, the CRC field present in the FPDU. When CRCs are not in use, the CRC field
MUST be considered valid for FPDU checking regardless of its MUST be considered valid for FPDU checking regardless of its
contents. contents.
skipping to change at page 29, line 36 skipping to change at page 29, line 39
Private Data field present at all. If the receiver detects that Private Data field present at all. If the receiver detects that
the PD_Length field does not match the length of the Private Data the PD_Length field does not match the length of the Private Data
field, or if the length of the Private Data field exceeds 512 field, or if the length of the Private Data field exceeds 512
octets, the receiver MUST close the connection and report an octets, the receiver MUST close the connection and report an
error locally. Otherwise, the MPA receiver should pass the error locally. Otherwise, the MPA receiver should pass the
PD_Length value and Private Data to the ULP. PD_Length value and Private Data to the ULP.
Private Data: This field may contain any value defined by ULPs or may Private Data: This field may contain any value defined by ULPs or may
not be present. The Private Data field MUST between 0 and 512 not be present. The Private Data field MUST between 0 and 512
octets in length. ULPs define how to size, set, and validate octets in length. ULPs define how to size, set, and validate
this field within these limits. this field within these limits. Private Data usage is further
discussed in section 7.1.4 on page 35.
7.1.2 Connection Startup Rules 7.1.2 Connection Startup Rules
The following rules apply to MPA connection Startup Phase: The following rules apply to MPA connection Startup Phase:
1. When MPA is started in the Initiator mode, the MPA implementation 1. When MPA is started in the Initiator mode, the MPA implementation
MUST send a valid MPA Request Frame. The MPA Request Frame MAY MUST send a valid MPA Request Frame. The MPA Request Frame MAY
include ULP supplied Private Data. include ULP supplied Private Data.
2. When MPA is started in the Responder mode, the MPA implementation 2. When MPA is started in the Responder mode, the MPA implementation
skipping to change at page 39, line 39 skipping to change at page 39, line 39
be followed. For example, if network data is lost, re-segmented be followed. For example, if network data is lost, re-segmented
or re-ordered, TCP MUST recover appropriately even when this or re-ordered, TCP MUST recover appropriately even when this
occurs while switching stacks. occurs while switching stacks.
7.2 Normal Connection Teardown 7.2 Normal Connection Teardown
Each half connection of MPA terminates when DDP closes the Each half connection of MPA terminates when DDP closes the
corresponding TCP half connection. corresponding TCP half connection.
A mechanism SHOULD be provided by MPA to DDP for DDP to be made aware A mechanism SHOULD be provided by MPA to DDP for DDP to be made aware
that a graceful close of the LLP connection has been received by the that a graceful close of the LLP TCP connection has been received by
LLP (e.g. FIN is received). the LLPTCP (e.g. FIN is received).
8 Error Semantics 8 Error Semantics
The following errors MUST be detected by MPA and the codes SHOULD be The following errors MUST be detected by MPA and the codes SHOULD be
provided to DDP or other Consumer: provided to DDP or other Consumer:
Code Error Code Error
1 TCP connection closed, terminated or lost. This includes lost 1 TCP connection closed, terminated or lost. This includes lost
by timeout, too many retries, RST received or FIN received. by timeout, too many retries, RST received or FIN received.
skipping to change at page 42, line 22 skipping to change at page 42, line 22
authentication is completed successfully, and hijack the iSCSI authentication is completed successfully, and hijack the iSCSI
Stream. Stream.
The best protection against this form of attack is end-to-end The best protection against this form of attack is end-to-end
integrity protection and authentication, such as IPsec to prevent integrity protection and authentication, such as IPsec to prevent
spoofing. Another option is to provide physical security. spoofing. Another option is to provide physical security.
Discussion of physical security is out of scope for this document. Discussion of physical security is out of scope for this document.
9.1.1.3 Man in the Middle Attack 9.1.1.3 Man in the Middle Attack
If a network based attacker has the ability to delete, inject replay, If a network based attacker has the ability to delete, inject,
or modify packets which will still be accepted by MPA (e.g., TCP replay, or modify packets which will still be accepted by MPA (e.g.,
sequence number is correct, FPDU is valid etc.) then the Stream can TCP sequence number is correct, FPDU is valid etc.) then the Stream
be exposed to a man in the middle attack. The attacker could can be exposed to a man in the middle attack. The attacker could
potentially use the services of [DDP] and [RDMAP] to read the potentially use the services of [DDP] and [RDMAP] to read the
contents of the associated data buffer, modify the contents of the contents of the associated data buffer, modify the contents of the
associated data buffer, or to disable further access to the buffer. associated data buffer, or to disable further access to the buffer.
The only countermeasure for this form of attack is to either secure Other attacks on the connection setup sequence and even on TCP can be
the MPA/DDP/RDMAP Stream (i.e. integrity protect) or attempt to used to cause denial of service. The only countermeasure for this
provide physical security to prevent man-in-the-middle type attacks. form of attack is to either secure the MPA/DDP/RDMAP Stream (i.e.
integrity protect) or attempt to provide physical security to prevent
man-in-the-middle type attacks.
The best protection against this form of attack is end-to-end The best protection against this form of attack is end-to-end
integrity protection and authentication, such as IPsec, to prevent integrity protection and authentication, such as IPsec, to prevent
spoofing or tampering. If Stream or session level authentication and spoofing or tampering. If Stream or session level authentication and
integrity protection are not used, then a man-in-the-middle attack integrity protection are not used, then a man-in-the-middle attack
can occur, enabling spoofing and tampering. can occur, enabling spoofing and tampering.
Another approach is to restrict access to only the local subnet/link, Another approach is to restrict access to only the local subnet/link,
and provide some mechanism to limit access, such as physical security and provide some mechanism to limit access, such as physical security
or 802.1.x. This model is an extremely limited deployment scenario, or 802.1.x. This model is an extremely limited deployment scenario,
skipping to change at page 69, line 46 skipping to change at page 69, line 46
rddp-applicability-08.txt (Work in progress), June 2006. rddp-applicability-08.txt (Work in progress), June 2006.
[CRCTCP] Stone J., Partridge, C., "When the CRC and TCP checksum [CRCTCP] Stone J., Partridge, C., "When the CRC and TCP checksum
disagree", ACM Sigcomm, Sept. 2000. disagree", ACM Sigcomm, Sept. 2000.
[DAT-API] DAT Collaborative, "kDAPL (Kernel Direct Access Programming [DAT-API] DAT Collaborative, "kDAPL (Kernel Direct Access Programming
Library) and uDAPL (User Direct Access Programming Library)", Library) and uDAPL (User Direct Access Programming Library)",
http://www.datcollaborative.org. http://www.datcollaborative.org.
[DDP] H. Shah et al., "Direct Data Placement over Reliable [DDP] H. Shah et al., "Direct Data Placement over Reliable
Transports", draft-ietf-rddp-ddp-06.txt (Work in progress), May Transports", draft-ietf-rddp-ddp-07.txt (Work in progress),
2006. September 2006.
[iSER] Mike Ko et al., "iSCSI Extensions for RDMA Specification", [iSER] Mike Ko et al., "iSCSI Extensions for RDMA Specification",
draft-ietf-ips-iser-05.txt (Work in progress), October 2005. draft-ietf-ips-iser-05.txt (Work in progress), October 2005.
[IT-API] The Open Group, "Interconnect Transport API (IT-API)" [IT-API] The Open Group, "Interconnect Transport API (IT-API)"
Version 2.1, http://www.opengroup.org. Version 2.1, http://www.opengroup.org.
[NFSv4CHANNEL] Williams, N., "On the Use of Channel Bindings to [NFSv4CHANNEL] Williams, N., "On the Use of Channel Bindings to
Secure Channels", Internet-Draft draft-ietf-nfsv4-channel- Secure Channels", Internet-Draft draft-ietf-nfsv4-channel-
bindings-02.txt, July 2004. bindings-02.txt, July 2004.
[RDMAP] R. Recio et al., "RDMA Protocol Specification", [RDMAP] R. Recio et al., "RDMA Protocol Specification",
draft-ietf-rddp-rdmap-06.txt, May 2006. draft-ietf-rddp-rdmap-07.txt, September 2006.
[RFC792] Postel, J., "Internet Control Message Protocol", September [RFC792] Postel, J., "Internet Control Message Protocol", September
1981 1981
[RFC0896] J. Nagle, "Congestion Control in IP/TCP Internetworks", RFC [RFC0896] J. Nagle, "Congestion Control in IP/TCP Internetworks", RFC
896, January 1984. 896, January 1984.
[RFC1122] Braden, R.T., "Requirements for Internet hosts - [RFC1122] Braden, R.T., "Requirements for Internet hosts -
communication layers", October 1989. communication layers", October 1989.
 End of changes. 15 change blocks. 
49 lines changed or deleted 58 lines changed or added

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/