draft-ietf-nfsv4-rpcrdma-06.txt | draft-ietf-nfsv4-rpcrdma-07.txt | |||
---|---|---|---|---|
NFSv4 Working Group Tom Talpey | NFSv4 Working Group Tom Talpey | |||
Internet-Draft Network Appliance, Inc. | Internet-Draft NetApp | |||
Intended status: Standards Track Brent Callaghan | Intended status: Standards Track Brent Callaghan | |||
Expires: January 1, 2008 Apple Computer, Inc. | Expires: August 23, 2008 Apple | |||
July 1, 2007 | February 22, 2008 | |||
RDMA Transport for ONC RPC | Remote Direct Memory Access Transport for Remote Procedure Call | |||
draft-ietf-nfsv4-rpcrdma-06 | draft-ietf-nfsv4-rpcrdma-07 | |||
Status of this Memo | Status of this Memo | |||
By submitting this Internet-Draft, each author represents that any | By submitting this Internet-Draft, each author represents that any | |||
applicable patent or other IPR claims of which he or she is aware | applicable patent or other IPR claims of which he or she is aware | |||
have been or will be disclosed, and any of which he or she becomes | have been or will be disclosed, and any of which he or she becomes | |||
aware will be disclosed, in accordance with Section 6 of BCP 79. | aware will be disclosed, in accordance with Section 6 of BCP 79. | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
skipping to change at page 1, line 38 | skipping to change at page 1, line 38 | |||
progress." | progress." | |||
The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
http://www.ietf.org/ietf/1id-abstracts.txt | http://www.ietf.org/ietf/1id-abstracts.txt | |||
The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
http://www.ietf.org/shadow.html. | http://www.ietf.org/shadow.html. | |||
Copyright Notice | Copyright Notice | |||
Copyright (C) The IETF Trust (2007). | Copyright (C) The IETF Trust (2008). | |||
Abstract | Abstract | |||
A protocol is described providing RDMA as a new transport for ONC | A protocol is described providing Remote Direct Memory Access | |||
RPC. The RDMA transport binding conveys the benefits of efficient, | (RDMA) as a new transport for Computing Remote Procedure Call | |||
bulk data transport over high speed networks, while providing for | (RPC). The RDMA transport binding conveys the benefits of | |||
minimal change to RPC applications and with no required revision of | efficient, bulk data transport over high speed networks, while | |||
the application RPC protocol, or the RPC protocol itself. | providing for minimal change to RPC applications and with no | |||
required revision of the application RPC protocol, or the RPC | ||||
protocol itself. | ||||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | |||
2. Abstract RDMA Requirements . . . . . . . . . . . . . . . . . 3 | 2. Abstract RDMA Requirements . . . . . . . . . . . . . . . . . 3 | |||
3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 4 | 3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 4 | |||
3.1. Short Messages . . . . . . . . . . . . . . . . . . . . . . 5 | 3.1. Short Messages . . . . . . . . . . . . . . . . . . . . . . 5 | |||
3.2. Data Chunks . . . . . . . . . . . . . . . . . . . . . . . 5 | 3.2. Data Chunks . . . . . . . . . . . . . . . . . . . . . . . 5 | |||
3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 6 | 3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7 | 3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7 | |||
3.5. XDR Decoding with Read Chunks . . . . . . . . . . . . . 11 | 3.5. XDR Decoding with Read Chunks . . . . . . . . . . . . . 10 | |||
3.6. XDR Decoding with Write Chunks . . . . . . . . . . . . . 11 | 3.6. XDR Decoding with Write Chunks . . . . . . . . . . . . . 11 | |||
3.7. XDR Roundup and Chunks . . . . . . . . . . . . . . . . . 12 | 3.7. XDR Roundup and Chunks . . . . . . . . . . . . . . . . . 12 | |||
3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 13 | 3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 13 | |||
3.9. Padding . . . . . . . . . . . . . . . . . . . . . . . . 16 | 3.9. Padding . . . . . . . . . . . . . . . . . . . . . . . . 16 | |||
4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 17 | 4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 17 | |||
4.1. RPC over RDMA Header . . . . . . . . . . . . . . . . . . 17 | 4.1. RPC over RDMA Header . . . . . . . . . . . . . . . . . . 17 | |||
4.2. RPC over RDMA header errors . . . . . . . . . . . . . . 19 | 4.2. RPC over RDMA header errors . . . . . . . . . . . . . . 19 | |||
4.3. XDR Language Description . . . . . . . . . . . . . . . . 20 | 4.3. XDR Language Description . . . . . . . . . . . . . . . . 20 | |||
5. Long Messages . . . . . . . . . . . . . . . . . . . . . . 22 | 5. Long Messages . . . . . . . . . . . . . . . . . . . . . . 22 | |||
5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 22 | 5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 22 | |||
5.2. RDMA Write of Long Replies (Reply Chunks) . . . . . . . 24 | 5.2. RDMA Write of Long Replies (Reply Chunks) . . . . . . . 24 | |||
6. Connection Configuration Protocol . . . . . . . . . . . . 25 | 6. Connection Configuration Protocol . . . . . . . . . . . . 25 | |||
6.1. Initial Connection State . . . . . . . . . . . . . . . . 26 | 6.1. Initial Connection State . . . . . . . . . . . . . . . . 26 | |||
6.2. Protocol Description . . . . . . . . . . . . . . . . . . 26 | 6.2. Protocol Description . . . . . . . . . . . . . . . . . . 26 | |||
7. Memory Registration Overhead . . . . . . . . . . . . . . . 28 | 7. Memory Registration Overhead . . . . . . . . . . . . . . . 28 | |||
8. Errors and Error Recovery . . . . . . . . . . . . . . . . 28 | 8. Errors and Error Recovery . . . . . . . . . . . . . . . . 28 | |||
9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 28 | 9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 28 | |||
10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 29 | 10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 29 | |||
11. Security . . . . . . . . . . . . . . . . . . . . . . . . 30 | 11. Security Considerations . . . . . . . . . . . . . . . . . 30 | |||
12. IANA Considerations . . . . . . . . . . . . . . . . . . . 30 | 12. IANA Considerations . . . . . . . . . . . . . . . . . . . 31 | |||
13. Acknowledgements . . . . . . . . . . . . . . . . . . . . 31 | 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . 32 | |||
14. Normative References . . . . . . . . . . . . . . . . . . 31 | 14. Normative References . . . . . . . . . . . . . . . . . . 32 | |||
15. Informative References . . . . . . . . . . . . . . . . . 32 | 15. Informative References . . . . . . . . . . . . . . . . . 33 | |||
16. Authors' Addresses . . . . . . . . . . . . . . . . . . . 33 | 16. Authors' Addresses . . . . . . . . . . . . . . . . . . . 34 | |||
17. Intellectual Property and Copyright Statements . . . . . 33 | 17. Intellectual Property and Copyright Statements . . . . . 35 | |||
Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . 34 | Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . 36 | |||
Requirements Language | Requirements Language | |||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in | |||
this document are to be interpreted as described in [RFC2119]. | this document are to be interpreted as described in [RFC2119]. | |||
1. Introduction | 1. Introduction | |||
RDMA is a technique for efficient movement of data between end | Remote Direct Memory Access (RDMA) [RFC5040, RFC5041] [IB] is a | |||
nodes, which becomes increasingly compelling over high speed | technique for efficient movement of data between end nodes, which | |||
transports. By directing data into destination buffers as it is | becomes increasingly compelling over high speed transports. By | |||
sent on a network, and placing it via direct memory access by | directing data into destination buffers as it is sent on a network, | |||
hardware, the double benefit of faster transfers and reduced host | and placing it via direct memory access by hardware, the double | |||
overhead is obtained. | benefit of faster transfers and reduced host overhead is obtained. | |||
ONC RPC [RFC1831bis] is a remote procedure call protocol that has | Open Network Computing Remote Procedure Call (ONC RPC, or simply, | |||
been run over a variety of transports. Most RPC implementations | RPC) [RFC1831bis] is a remote procedure call protocol that has been | |||
today use UDP or TCP. RPC messages are defined in terms of an | run over a variety of transports. Most RPC implementations today | |||
eXternal Data Representation (XDR) [RFC4506] which provides a | use UDP or TCP. RPC messages are defined in terms of an eXternal | |||
canonical data representation across a variety of host | Data Representation (XDR) [RFC4506] which provides a canonical data | |||
architectures. An XDR data stream is conveyed differently on each | representation across a variety of host architectures. An XDR data | |||
type of transport. On UDP, RPC messages are encapsulated inside | stream is conveyed differently on each type of transport. On UDP, | |||
datagrams, while on a TCP byte stream, RPC messages are delineated | RPC messages are encapsulated inside datagrams, while on a TCP byte | |||
by a record marking protocol. An RDMA transport also conveys RPC | stream, RPC messages are delineated by a record marking protocol. | |||
messages in a unique fashion that must be fully described if client | An RDMA transport also conveys RPC messages in a unique fashion | |||
and server implementations are to interoperate. | that must be fully described if client and server implementations | |||
are to interoperate. | ||||
RDMA transports present new semantics unlike the behaviors of | RDMA transports present new semantics unlike the behaviors of | |||
either UDP and TCP alone. They retain message delineations like | either UDP or TCP alone. They retain message delineations like UDP | |||
UDP while also providing a reliable, sequenced data transfer like | while also providing a reliable, sequenced data transfer like TCP. | |||
TCP. And, they provide the new efficient, bulk transfer service of | And, they provide the new efficient, bulk transfer service of RDMA. | |||
RDMA. RDMA transports are therefore naturally viewed as a new | RDMA transports are therefore naturally viewed as a new transport | |||
transport type by ONC RPC. | type by RPC. | |||
RDMA as a transport will benefit the performance of RPC protocols | RDMA as a transport will benefit the performance of RPC protocols | |||
that move large "chunks" of data, since RDMA hardware excels at | that move large "chunks" of data, since RDMA hardware excels at | |||
moving data efficiently between host memory and a high speed | moving data efficiently between host memory and a high speed | |||
network with little or no host CPU involvement. In this context, | network with little or no host CPU involvement. In this context, | |||
the NFS protocol, in all its versions [RFC1094] [RFC1813] [RFC3530] | the NFS protocol, in all its versions [RFC1094] [RFC1813] [RFC3530] | |||
[NFSv4.1], is an obvious beneficiary of RDMA. A complete problem | [NFSv4.1], is an obvious beneficiary of RDMA. A complete problem | |||
statement is discussed in [NFSRDMAPS], and related NFSv4 issues are | statement is discussed in [NFSRDMAPS], and related NFSv4 issues are | |||
discussed in [NFSv4.1]. Many other RPC-based protocols will also | discussed in [NFSv4.1]. Many other RPC-based protocols will also | |||
benefit. | benefit. | |||
skipping to change at page 4, line 12 | skipping to change at page 4, line 13 | |||
header contains a transaction ID (XID) followed by the program and | header contains a transaction ID (XID) followed by the program and | |||
procedure number as well as a security credential. An RPC reply | procedure number as well as a security credential. An RPC reply | |||
header begins with an XID that matches that of the RPC call | header begins with an XID that matches that of the RPC call | |||
message, followed by a security verifier and results. All data in | message, followed by a security verifier and results. All data in | |||
an RPC message is XDR encoded. For a complete description of the | an RPC message is XDR encoded. For a complete description of the | |||
RPC protocol and XDR encoding, see [RFC1831bis] and [RFC4506]. | RPC protocol and XDR encoding, see [RFC1831bis] and [RFC4506]. | |||
This protocol assumes the following abstract model for RDMA | This protocol assumes the following abstract model for RDMA | |||
transports. These terms, common in the RDMA lexicon, are used in | transports. These terms, common in the RDMA lexicon, are used in | |||
this document. A more complete glossary of RDMA terms can be found | this document. A more complete glossary of RDMA terms can be found | |||
in [RDMAP]. | in [RFC5040]. | |||
o Registered Memory | o Registered Memory | |||
All data moved via tagged RDMA operations is resident in | All data moved via tagged RDMA operations is resident in | |||
registered memory at its destination. This protocol assumes | registered memory at its destination. This protocol assumes | |||
that each segment of registered memory MUST be identified with | that each segment of registered memory MUST be identified with | |||
a steering tag of no more than 32 bits and memory addresses of | a steering tag of no more than 32 bits and memory addresses of | |||
up to 64 bits in length. | up to 64 bits in length. | |||
o RDMA Send | o RDMA Send | |||
The RDMA provider supports an RDMA Send operation with | The RDMA provider supports an RDMA Send operation with | |||
skipping to change at page 5, line 10 | skipping to change at page 5, line 10 | |||
receives no notification of RDMA Read completion, there is an | receives no notification of RDMA Read completion, there is an | |||
assumption that on receiving the data the receiver will signal | assumption that on receiving the data the receiver will signal | |||
completion with an RDMA Send message, so that the peer can | completion with an RDMA Send message, so that the peer can | |||
free the source buffers and the associated steering tags. | free the source buffers and the associated steering tags. | |||
This protocol is designed to be carried over all RDMA transports | This protocol is designed to be carried over all RDMA transports | |||
meeting the stated requirements. This protocol conveys to the RPC | meeting the stated requirements. This protocol conveys to the RPC | |||
peer, information sufficient for that RPC peer to direct an RDMA | peer, information sufficient for that RPC peer to direct an RDMA | |||
layer to perform transfers containing RPC data, and to communicate | layer to perform transfers containing RPC data, and to communicate | |||
their result(s). For example, it is readily carried over RDMA | their result(s). For example, it is readily carried over RDMA | |||
transports such as iWARP [RDDP] or Infiniband [IB]. | transports such as iWARP [RFC5040, RFC5041] or Infiniband [IB]. | |||
3. Protocol Outline | 3. Protocol Outline | |||
An RPC message can be conveyed in identical fashion, whether it is | An RPC message can be conveyed in identical fashion, whether it is | |||
a call or reply message. In each case, the transmission of the | a call or reply message. In each case, the transmission of the | |||
message proper is preceded by transmission of a transport-specific | message proper is preceded by transmission of a transport-specific | |||
header for use by RPC over RDMA transports. This header is | header for use by RPC over RDMA transports. This header is | |||
analogous to the record marking used for RPC over TCP, but is more | analogous to the record marking used for RPC over TCP, but is more | |||
extensive, since RDMA transports support several modes of data | extensive, since RDMA transports support several modes of data | |||
transfer and it is important to allow the client and server to use | transfer and it is important to allow the client and server to use | |||
skipping to change at page 5, line 49 | skipping to change at page 5, line 49 | |||
however define an exchange to dynamically enable RPC/RDMA on an | however define an exchange to dynamically enable RPC/RDMA on an | |||
existing RPC association. Any such exchange must be carefully | existing RPC association. Any such exchange must be carefully | |||
architected so as to prevent any ambiguity as to the framing in use | architected so as to prevent any ambiguity as to the framing in use | |||
for each side of the connection. Because RPC/RDMA framing delimits | for each side of the connection. Because RPC/RDMA framing delimits | |||
an entire RPC request or reply, any such shift must occur between | an entire RPC request or reply, any such shift must occur between | |||
distinct RPC messages. | distinct RPC messages. | |||
3.1. Short Messages | 3.1. Short Messages | |||
Many RPC messages are quite short. For example, the NFS version 3 | Many RPC messages are quite short. For example, the NFS version 3 | |||
GETATTR request, is only 56 bytes: 20 bytes of RPC header plus a 32 | GETATTR request, is only 56 bytes: 20 bytes of RPC header, plus a | |||
byte filehandle argument and 4 bytes of length. The reply to this | 32 byte file handle argument and 4 bytes of length. The reply to | |||
common request is about 100 bytes. | this common request is about 100 bytes. | |||
There is no benefit in transferring such small messages with an | There is no benefit in transferring such small messages with an | |||
RDMA Read or Write operation. The overhead in transferring | RDMA Read or Write operation. The overhead in transferring | |||
steering tags and memory addresses is justified only by large | steering tags and memory addresses is justified only by large | |||
transfers. The critical message size that justifies RDMA transfer | transfers. The critical message size that justifies RDMA transfer | |||
will vary depending on the RDMA implementation and network, but is | will vary depending on the RDMA implementation and network, but is | |||
typically of the order of a few kilobytes. It is appropriate to | typically of the order of a few kilobytes. It is appropriate to | |||
transfer a short message with an RDMA Send to a pre-posted buffer. | transfer a short message with an RDMA Send to a pre-posted buffer. | |||
The RPC over RDMA header with the short message (call or reply) | The RPC over RDMA header with the short message (call or reply) | |||
immediately following is transferred using a single RDMA Send | immediately following is transferred using a single RDMA Send | |||
skipping to change at page 7, line 30 | skipping to change at page 7, line 30 | |||
up or down at each opportunity to match the server's needs or | up or down at each opportunity to match the server's needs or | |||
policies. | policies. | |||
The RPC client MUST NOT send unacknowledged requests in excess of | The RPC client MUST NOT send unacknowledged requests in excess of | |||
this granted RPC server credit limit. If the limit is exceeded, | this granted RPC server credit limit. If the limit is exceeded, | |||
the RDMA layer may signal an error, possibly terminating the | the RDMA layer may signal an error, possibly terminating the | |||
connection. Even if an error does not occur, it is OPTIONAL that | connection. Even if an error does not occur, it is OPTIONAL that | |||
the server handle the excess request(s), and it MAY return an RPC | the server handle the excess request(s), and it MAY return an RPC | |||
error to the client. Also note that the never-zero requirement | error to the client. Also note that the never-zero requirement | |||
implies that an RPC server MUST always provide at least one credit | implies that an RPC server MUST always provide at least one credit | |||
to each connected RPC client. It is however OPTIONAL that the | to each connected RPC client from which no requests are | |||
server always be prepared to receive a request from each client, | outstanding. The client would deadlock otherwise, unable to send | |||
for example when the server is busy processing all granted client | another request. | |||
requests. | ||||
While RPC calls complete in any order, the current flow control | While RPC calls complete in any order, the current flow control | |||
limit at the RPC server is known to the RPC client from the Send | limit at the RPC server is known to the RPC client from the Send | |||
ordering properties. It is always the most recent server-granted | ordering properties. It is always the most recent server-granted | |||
credit value minus the number of requests in flight. | credit value minus the number of requests in flight. | |||
Certain RDMA implementations may impose additional flow control | Certain RDMA implementations may impose additional flow control | |||
restrictions, such as limits on RDMA Read operations in progress at | restrictions, such as limits on RDMA Read operations in progress at | |||
the responder. Because these operations are outside the scope of | the responder. Because these operations are outside the scope of | |||
this protocol, they are not addressed and SHOULD be provided for by | this protocol, they are not addressed and SHOULD be provided for by | |||
skipping to change at page 8, line 24 | skipping to change at page 8, line 24 | |||
encoded as a contiguous sequence of bytes for network transmission | encoded as a contiguous sequence of bytes for network transmission | |||
over UDP or TCP. However, in the case of an RDMA transport, local | over UDP or TCP. However, in the case of an RDMA transport, local | |||
routines such as XDR encode can determine that (for instance) an | routines such as XDR encode can determine that (for instance) an | |||
opaque byte array is large enough to be more efficiently moved via | opaque byte array is large enough to be more efficiently moved via | |||
an RDMA data transfer operation like RDMA Read or RDMA Write. | an RDMA data transfer operation like RDMA Read or RDMA Write. | |||
Semantically speaking, the protocol has no restriction regarding | Semantically speaking, the protocol has no restriction regarding | |||
data types which may or may not be represented by a read or write | data types which may or may not be represented by a read or write | |||
chunk. In practice however, efficiency considerations lead to the | chunk. In practice however, efficiency considerations lead to the | |||
conclusion that certain data types are not generally "chunkable". | conclusion that certain data types are not generally "chunkable". | |||
Typically, only opaque and aggregate data types which may attain | Typically, only those opaque and aggregate data types that may | |||
substantial size are considered to be eligible. With today's | attain substantial size are considered to be eligible. With | |||
hardware this size may be a kilobyte or more. However any object | today's hardware this size may be a kilobyte or more. However any | |||
MAY be chosen for chunking in any given message. | object MAY be chosen for chunking in any given message. | |||
The eligibility of XDR data items to be candidates for being moved | The eligibility of XDR data items to be candidates for being moved | |||
as data chunks (as opposed to being marshaled inline) is not | as data chunks (as opposed to being marshaled inline) is not | |||
specified by the RPC over RDMA protocol. Chunk eligibility | specified by the RPC over RDMA protocol. Chunk eligibility | |||
criteria MUST be determined by each upper layer in order to provide | criteria MUST be determined by each upper layer in order to provide | |||
for an interoperable specification. One such example with | for an interoperable specification. One such example with | |||
rationale, for the NFS protocol family, is provided in [NFSDDP]. | rationale, for the NFS protocol family, is provided in [NFSDDP]. | |||
The interface by which an upper layer implementation communicates | The interface by which an upper layer implementation communicates | |||
the eligibility of a data item locally to RPC for chunking is out | the eligibility of a data item locally to RPC for chunking is out | |||
skipping to change at page 13, line 21 | skipping to change at page 13, line 21 | |||
On the other hand, RPC/RDMA Read chunks carry the XDR position of | On the other hand, RPC/RDMA Read chunks carry the XDR position of | |||
each chunked element and length of the Chunk segment, and can be | each chunked element and length of the Chunk segment, and can be | |||
placed by the receiver exactly where they belong in the receiver's | placed by the receiver exactly where they belong in the receiver's | |||
memory without regard to the alignment of their position in the XDR | memory without regard to the alignment of their position in the XDR | |||
stream. Since any rounded-up data is not actually part of the | stream. Since any rounded-up data is not actually part of the | |||
upper layer's message, the receiver will not reference it, and | upper layer's message, the receiver will not reference it, and | |||
there is no reason to set it to any particular value in the | there is no reason to set it to any particular value in the | |||
receiver's memory. | receiver's memory. | |||
When roundup is present at the end of a sequence of chunks, the | When roundup is present at the end of a sequence of chunks, the | |||
length of the sequence will terminate it at an non-4-byte XDR | length of the sequence will terminate it at a non-4-byte XDR | |||
position. When the receiver proceeds to decode the remaining part | position. When the receiver proceeds to decode the remaining part | |||
of the XDR stream, it inspects the XDR position indicated by the | of the XDR stream, it inspects the XDR position indicated by the | |||
next chunk. Because this position will not match (else roundup | next chunk. Because this position will not match (else roundup | |||
would not have occurred), the receiver decoding will fall back to | would not have occurred), the receiver decoding will fall back to | |||
inspecting the remaining inline portion. If in turn, no data | inspecting the remaining inline portion. If in turn, no data | |||
remains to be decoded from the inline portion, then the receiver | remains to be decoded from the inline portion, then the receiver | |||
MUST conclude that roundup is present, and therefore advances the | MUST conclude that roundup is present, and therefore advances the | |||
XDR decode position to that indicated by the next chunk (if any). | XDR decode position to that indicated by the next chunk (if any). | |||
In this way, roundup is passed without ever actually transferring | In this way, roundup is passed without ever actually transferring | |||
additional XDR bytes. | additional XDR bytes. | |||
skipping to change at page 17, line 38 | skipping to change at page 17, line 38 | |||
RDMA on behalf of RPC requests will be placed into appropriately | RDMA on behalf of RPC requests will be placed into appropriately | |||
aligned buffers on the system that receives the transfer. In this | aligned buffers on the system that receives the transfer. In this | |||
way, the need for servers to perform RDMA Read to satisfy all but | way, the need for servers to perform RDMA Read to satisfy all but | |||
the largest client writes is obviated. | the largest client writes is obviated. | |||
The effect of padding is demonstrated below showing prior bytes on | The effect of padding is demonstrated below showing prior bytes on | |||
an XDR stream (XXX) followed by an opaque field consisting of four | an XDR stream (XXX) followed by an opaque field consisting of four | |||
length bytes (LLLL) followed by data bytes (DDDD). The receiver of | length bytes (LLLL) followed by data bytes (DDDD). The receiver of | |||
the RDMA Send has posted two chained receive buffers. Without | the RDMA Send has posted two chained receive buffers. Without | |||
padding, the opaque data is split across the two buffers. With the | padding, the opaque data is split across the two buffers. With the | |||
addition of padding bytes (ppp) prior to the first data byte, the | addition of padding bytes ("ppp" in the figure below) prior to the | |||
data can be forced to align correctly in the second buffer. | first data byte, the data can be forced to align correctly in the | |||
second buffer. | ||||
Buffer 1 Buffer 2 | Buffer 1 Buffer 2 | |||
Unpadded -------------- -------------- | Unpadded -------------- -------------- | |||
XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD | XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD | |||
Padded | Padded | |||
XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD | XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD | |||
skipping to change at page 26, line 29 | skipping to change at page 26, line 29 | |||
| RPC Call with rdma_reply | | | RPC Call with rdma_reply | | |||
Send | ------------------------------> | | Send | ------------------------------> | | |||
| | | | | | |||
| Long RPC Reply Msg | | | Long RPC Reply Msg | | |||
| <------------------------------ | Write | | <------------------------------ | Write | |||
| | | | | | |||
| RDMA over RPC Header | | | RDMA over RPC Header | | |||
| <------------------------------ | Send | | <------------------------------ | Send | |||
The use of RDMA Write to return long replies requires that the | The use of RDMA Write to return long replies requires that the | |||
client application anticipate a long reply and have some knowledge | client applications anticipate a long reply and have some knowledge | |||
of its size so that an adequately sized buffer can be allocated. | of its size so that an adequately sized buffer can be allocated. | |||
This is certainly true of NFS READDIR replies; where the client | This is certainly true of NFS READDIR replies; where the client | |||
already provides an upper bound on the size of the encoded | already provides an upper bound on the size of the encoded | |||
directory fragment to be returned by the server. | directory fragment to be returned by the server. | |||
The use of these "reply chunks" is highly efficient and convenient | The use of these "reply chunks" is highly efficient and convenient | |||
for both RPC client and server. Their use is encouraged for | for both RPC client and server. Their use is encouraged for | |||
eligible RPC operations such as NFS READDIR, which would otherwise | eligible RPC operations such as NFS READDIR, which would otherwise | |||
require extensive chunk management within the results or use of | require extensive chunk management within the results or use of | |||
RDMA Read and a Done message. [NFSDDP] | RDMA Read and a Done message. [NFSDDP] | |||
skipping to change at page 30, line 15 | skipping to change at page 30, line 15 | |||
10. RPC Binding | 10. RPC Binding | |||
RPC services normally register with a portmap or rpcbind [RFC1833] | RPC services normally register with a portmap or rpcbind [RFC1833] | |||
service, which associates an RPC program number with a service | service, which associates an RPC program number with a service | |||
address. (In the case of UDP or TCP, the service address for NFS | address. (In the case of UDP or TCP, the service address for NFS | |||
is normally port 2049.) This policy is no different with RDMA | is normally port 2049.) This policy is no different with RDMA | |||
interconnects, although it may require the allocation of port | interconnects, although it may require the allocation of port | |||
numbers appropriate to each upper layer binding which uses the RPC | numbers appropriate to each upper layer binding which uses the RPC | |||
framing defined here. | framing defined here. | |||
When mapped atop the iWARP [RDDP] transport, which uses IP port | When mapped atop the iWARP [RFC5040, RFC5041] transport, which uses | |||
addressing due to its layering on TCP and/or SCTP, port mapping is | IP port addressing due to its layering on TCP and/or SCTP, port | |||
trivial and consists merely of issuing the port in the connection | mapping is trivial and consists merely of issuing the port in the | |||
process. | connection process. | |||
When mapped atop Infiniband [IB], which uses a GID-based service | When mapped atop Infiniband [IB], which uses a GID-based service | |||
endpoint naming scheme, a translation MUST be employed. One such | endpoint naming scheme, a translation MUST be employed. One such | |||
translation is defined in the Infiniband Port Addressing Annex | translation is defined in the Infiniband Port Addressing Annex | |||
[IBPORT], which is appropriate for translating IP port addressing | [IBPORT], which is appropriate for translating IP port addressing | |||
to the Infiniband network. Therefore, in this case, IP port | to the Infiniband network. Therefore, in this case, IP port | |||
addressing may be readily employed by the upper layer. | addressing may be readily employed by the upper layer. | |||
When a mapping standard or convention exists for IP ports on an | When a mapping standard or convention exists for IP ports on an | |||
RDMA interconnect, there are several possibilities for each upper | RDMA interconnect, there are several possibilities for each upper | |||
skipping to change at page 31, line 8 | skipping to change at page 31, line 8 | |||
Alternatively, the client could simply connect to the mapped | Alternatively, the client could simply connect to the mapped | |||
well-known port for the service itself, if it is appropriately | well-known port for the service itself, if it is appropriately | |||
defined. | defined. | |||
Historically, different RPC protocols have taken different | Historically, different RPC protocols have taken different | |||
approaches to their port assignment, therefore the specific method | approaches to their port assignment, therefore the specific method | |||
is left to each RPC/RDMA-enabled upper layer binding, and not | is left to each RPC/RDMA-enabled upper layer binding, and not | |||
addressed here. | addressed here. | |||
This specification defines a new "netid", to be used for | This specification defines a new "netid", to be used for | |||
registration of upper layers atop iWARP [RDDP] and (when a suitable | registration of upper layers atop iWARP [RFC5040, RFC5041] and | |||
port translation service is available) Infiniband [IB] in section | (when a suitable port translation service is available) Infiniband | |||
12, "IANA Considerations." Additional RDMA-capable networks MAY | [IB] in section 12, "IANA Considerations." Additional RDMA-capable | |||
define their own netids, or if they provide a port translation, MAY | networks MAY define their own netids, or if they provide a port | |||
share the one defined here. | translation, MAY share the one defined here. | |||
11. Security | 11. Security Considerations | |||
ONC RPC provides its own security via the RPCSEC_GSS framework | RPC provides its own security via the RPCSEC_GSS framework | |||
[RFC2203]. RPCSEC_GSS can provide message authentication, | [RFC2203]. RPCSEC_GSS can provide message authentication, | |||
integrity checking, and privacy. This security mechanism will be | integrity checking, and privacy. This security mechanism will be | |||
unaffected by the RDMA transport. The data integrity and privacy | unaffected by the RDMA transport. The data integrity and privacy | |||
features alter the body of the message, presenting it as a single | features alter the body of the message, presenting it as a single | |||
chunk. For large messages the chunk may be large enough to qualify | chunk. For large messages the chunk may be large enough to qualify | |||
for RDMA Read transfer. However, there is much data movement | for RDMA Read transfer. However, there is much data movement | |||
associated with computation and verification of integrity, or | associated with computation and verification of integrity, or | |||
encryption/decryption, so certain performance advantages may be | encryption/decryption, so certain performance advantages may be | |||
lost. | lost. | |||
For efficiency, more appropriate security mechanism for RDMA links | For efficiency, a more appropriate security mechanism for RDMA | |||
may be link-level protection, such as certain configurations of | links may be link-level protection, such as certain configurations | |||
IPsec, which may be co-located in the RDMA hardware. The use of | of IPsec, which may be co-located in the RDMA hardware. The use of | |||
link-level protection MAY be negotiated through the use of a new | link-level protection MAY be negotiated through the use of the new | |||
RPCSEC_GSS mechanism like the Credential Cache GSS Mechanism [CCM]. | RPCSEC_GSS mechanism defined in [RPCSECGSSV2] in conjunction with | |||
Use of such mechanisms is RECOMMENDED where end-to-end integrity | the Channel Binding mechanism [RFC5056] and IPsec Channel | |||
and/or privacy is desired, and where efficiency is required. | Connection Latching [BTNSLATCH]. Use of such mechanisms is | |||
REQUIRED where integrity and/or privacy is desired, and where | ||||
efficiency is required. | ||||
There are no new issues here with exposed addresses. The only | An additional consideration is the protection of the integrity and | |||
exposed addresses here are in the chunk list and in the transport | privacy of local memory by the RDMA transport itself. The use of | |||
packets transferred via RDMA. The data contained in these | RDMA by RPC MUST NOT introduce any vulnerabilities to system memory | |||
addresses continues to be protected by RPCSEC_GSS integrity and | contents, or to memory owned by user processes. These protections | |||
privacy. | are provided by the RDMA layer specifications, and specifically | |||
their security models. It is REQUIRED that any RDMA provider used | ||||
for RPC transport be conformant to the requirements of [RFC5042] in | ||||
order to satisfy these protections. | ||||
Once delivered securely by the RDMA provider, any RDMA-exposed | ||||
addresses will contain only RPC payloads in the chunk lists, | ||||
transferred under the protection of RPCSEC_GSS integrity and | ||||
privacy. By these means, the data will be protected end-to-end, as | ||||
required by the RPC layer security model. | ||||
Where results are supplied to the requester via Read chunks, a | ||||
server resource deficit can arise if the client does not promptly | ||||
acknowledge their status via the RDMA_DONE message. This can | ||||
potentially lead to a denial of service situation, with a single | ||||
client unfairly (and unnecessarily) consuming server RDMA | ||||
resources. Servers MUST protect against this situation, | ||||
originating from one or many clients. For example, a time-based | ||||
window of buffer availability may be offered, if the client fails | ||||
to obtain the data within the window, it will simply retry using | ||||
ordinary RPC retry semantics. Or, a more severe method would be | ||||
for the server to simply close the client's RDMA connection, | ||||
freeing the RDMA resources and allowing the server to reclaim them. | ||||
A fairer and more useful method is provided by the protocol itself. | ||||
The server MAY use the rdma_credit value to limit the number of | ||||
outstanding requests for each client. By including the number of | ||||
outstanding RDMA_DONE completions in the computation of available | ||||
client credits, the server can limit its exposure to each client, | ||||
and therefore provide uninterrupted service as its resources | ||||
permit. | ||||
However, the server must ensure that it does not decrease the | ||||
credit count to zero with this method, since the RDMA_DONE message | ||||
is not acknowledged. If the credit count were to drop to zero | ||||
solely due to outstanding RDMA_DONE messages, the client would | ||||
deadlock since it would never obtain a new credit with which to | ||||
continue. Therefore, if the server adjusts credits to zero for | ||||
outstanding RDMA_DONE, it MUST withhold its reply to at least one | ||||
message in order to provide the next credit. The time-based window | ||||
(or any other appropriate method) SHOULD be used by the server to | ||||
recover resources in the event that the client never returns. | ||||
The "Connection Configuration Protocol", when used, MUST be | ||||
protected by an appropriate RPC security flavor, to ensure it is | ||||
not attacked in the process of initiating an RPC/RDMA connection. | ||||
12. IANA Considerations | 12. IANA Considerations | |||
The new RPC transport is to be assigned a new RPC "netid", which is | The new RPC transport is to be assigned a new RPC "netid", which is | |||
an rpcbind [RFC1833] string used to describe the underlying | an rpcbind [RFC1833] string used to describe the underlying | |||
protocol in order for RPC to select the appropriate transport | protocol in order for RPC to select the appropriate transport | |||
framing, as well as the format of the service ports. | framing, as well as the format of the service ports. | |||
The following "nc_proto" registry string is hereby defined for this | The following "nc_proto" registry string is hereby defined for this | |||
purpose: | purpose: | |||
NC_RDMA "rdma" | NC_RDMA "rdma" | |||
The mechanism of adding this value to the RPC netid registry is | ||||
outside the scope of this document and is an IANA consideration. | ||||
This netid MAY be used for any RDMA network satisfying the | This netid MAY be used for any RDMA network satisfying the | |||
requirements of section 2, and able to identify service endpoints | requirements of section 2, and able to identify service endpoints | |||
using IP port addressing, possibly through use of a translation | using IP port addressing, possibly through use of a translation | |||
service as described above in section 10, RPC Binding. | service as described above in section 10, RPC Binding. | |||
As a new RPC transport, this protocol has no effect on RPC program | As a new RPC transport, this protocol has no effect on RPC program | |||
numbers or existing registered port numbers. However, new port | numbers or existing registered port numbers. However, new port | |||
numbers MAY be registered for use by RPC/RDMA-enabled services, as | numbers MAY be registered for use by RPC/RDMA-enabled services, as | |||
appropriate to the new networks over which the services will | appropriate to the new networks over which the services will | |||
operate. | operate. | |||
The OPTIONAL Connection Configuration protocol described herein | The OPTIONAL Connection Configuration protocol described herein | |||
requires an RPC program number assignment. The value "100400" is | requires an RPC program number assignment. The value "100400" is | |||
hereby assigned: | hereby assigned: | |||
rdmaconfig 100400 rpc.rdmaconfig | rdmaconfig 100400 rpc.rdmaconfig | |||
Currently, these numbers are not assigned by IANA, they are merely | Currently, neither the nc_proto netid's nor the RPC program numbers | |||
republished [IANA-RPC]. The mechanism of this republishing is | are are assigned by IANA. The list in [RFC1833] has served as the | |||
outside the scope of this document and is an IANA consideration. | netid registry, and the republication declared in [IANA-RPC] has | |||
served as the program number registry. Ideally, IANA will create | ||||
explicit registries for these objects. However, in the absence of | ||||
new registries, this document would serve as the repository for the | ||||
RPC program number assignment, and the protocol netid. | ||||
13. Acknowledgements | 13. Acknowledgements | |||
The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak, | The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak, | |||
Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve | Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve | |||
Kleiman, Mike Eisler, Mark Wittle, Shantanu Mehendale, David | Kleiman, Mike Eisler, Mark Wittle, Shantanu Mehendale, David | |||
Robinson and Mallikarjun Chadalapaka for their contributions to | Robinson and Mallikarjun Chadalapaka for their contributions to | |||
this document. | this document. | |||
14. Normative References | 14. Normative References | |||
skipping to change at page 33, line 27 | skipping to change at page 34, line 27 | |||
[RFC3530] | [RFC3530] | |||
S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, | S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, | |||
M. Eisler, D. Noveck, "NFS version 4 Protocol", Standards | M. Eisler, D. Noveck, "NFS version 4 Protocol", Standards | |||
Track RFC, http://www.ietf.org/rfc/rfc3530.txt | Track RFC, http://www.ietf.org/rfc/rfc3530.txt | |||
[RFC2203] | [RFC2203] | |||
M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol | M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol | |||
Specification", Standards Track RFC, | Specification", Standards Track RFC, | |||
http://www.ietf.org/rfc/rfc2203.txt | http://www.ietf.org/rfc/rfc2203.txt | |||
15. Informative References | [RPCSECGSSV2] | |||
M. Eisler, "RPCSEC_GSS Version 2", Internet Draft Work in | ||||
Progress draft-ietf-nfsv4-rpcsec-gss-v2 | ||||
[RDMAP] | [RFC5056] | |||
R. Recio et al., "A Remote Direct Memory Access Protocol | N. Williams, "On the Use of Channel Bindings to Secure | |||
Specification", Standards Track RFC, draft-ietf-rddp-rdmap | Channels", Standards Track RFC | |||
[CCM] | [BTNSLATCH] | |||
M. Eisler, N. Williams, "CCM: The Credential Cache GSS | N. Williams, "IPsec Channels: Connection Latching", Internet | |||
Mechanism", Internet Draft Work in Progress, draft-ietf- | Draft Work in Progress draft-ietf-btns-connection-latching | |||
nfsv4-ccm | ||||
[RFC5042] | ||||
J. Pinkerton, E. Deleganes, "Direct Data Placement Protocol | ||||
(DDP) / Remote Direct Memory Access Protocol (RDMAP) Security" | ||||
Standards Track RFC | ||||
15. Informative References | ||||
[NFSDDP] | [NFSDDP] | |||
B. Callaghan, T. Talpey, "NFS Direct Data Placement" Internet | B. Callaghan, T. Talpey, "NFS Direct Data Placement" Internet | |||
Draft Work in Progress, draft-ietf-nfsv4-nfsdirect | Draft Work in Progress, draft-ietf-nfsv4-nfsdirect | |||
[RFC5040] | ||||
R. Recio et al., "A Remote Direct Memory Access Protocol | ||||
Specification", Standards Track RFC | ||||
[RDDP] | [RFC5041] | |||
H. Shah et al., "Direct Data Placement over Reliable | H. Shah et al., "Direct Data Placement over Reliable | |||
Transports", Standards Track RFC, draft-ietf-rddp-ddp | Transports", Standards Track RFC | |||
[NFSRDMAPS] | [NFSRDMAPS] | |||
T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet | T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet | |||
Draft Work in Progress, draft-ietf-nfsv4-nfs-rdma-problem- | Draft Work in Progress, draft-ietf-nfsv4-nfs-rdma-problem- | |||
statement | statement | |||
[NFSv4.1] | [NFSv4.1] | |||
S. Shepler et al., ed., "NFSv4 Minor Version 1" Internet Draft | S. Shepler et al., ed., "NFSv4 Minor Version 1" Internet Draft | |||
Work in Progress, draft-ietf-nfsv4-minorversion1 | Work in Progress, draft-ietf-nfsv4-minorversion1 | |||
[IB] | [IB] | |||
Infiniband Architecture Specification, available from | Infiniband Architecture Specification, available from | |||
http://www.infinibandta.org | http://www.infinibandta.org | |||
[IBPORT] | [IBPORT] | |||
Infiniband Trade Association, "IP Addressing Annex", available | Infiniband Trade Association, "IP Addressing Annex", available | |||
skipping to change at page 34, line 24 | skipping to change at page 35, line 37 | |||
from http://www.infinibandta.org | from http://www.infinibandta.org | |||
[IANA-RPC] | [IANA-RPC] | |||
IANA Sun RPC number statement, | IANA Sun RPC number statement, | |||
http://www.iana.org/assignments/sun-rpc-numbers | http://www.iana.org/assignments/sun-rpc-numbers | |||
16. Authors' Addresses | 16. Authors' Addresses | |||
Tom Talpey | Tom Talpey | |||
Network Appliance, Inc. | Network Appliance, Inc. | |||
375 Totten Pond Road | 1601 Trapelo Road, #16 | |||
Waltham, MA 02451 USA | Waltham, MA 02451 USA | |||
Phone: +1 781 768 5329 | Phone: +1 781 768 5329 | |||
EMail: thomas.talpey@netapp.com | EMail: thomas.talpey@netapp.com | |||
Brent Callaghan | Brent Callaghan | |||
Apple Computer, Inc. | Apple Computer, Inc. | |||
MS: 302-4K | MS: 302-4K | |||
2 Infinite Loop | 2 Infinite Loop | |||
Cupertino, CA 95014 USA | Cupertino, CA 95014 USA | |||
EMail: brentc@apple.com | EMail: brentc@apple.com | |||
17. Intellectual Property and Copyright Statements | 17. Intellectual Property and Copyright Statements | |||
skipping to change at page 34, line 42 | skipping to change at page 36, line 16 | |||
MS: 302-4K | MS: 302-4K | |||
2 Infinite Loop | 2 Infinite Loop | |||
Cupertino, CA 95014 USA | Cupertino, CA 95014 USA | |||
EMail: brentc@apple.com | EMail: brentc@apple.com | |||
17. Intellectual Property and Copyright Statements | 17. Intellectual Property and Copyright Statements | |||
Full Copyright Statement | Full Copyright Statement | |||
Copyright (C) The IETF Trust (2007). | Copyright (C) The IETF Trust (2008). | |||
This document is subject to the rights, licenses and restrictions | This document is subject to the rights, licenses and restrictions | |||
contained in BCP 78, and except as set forth therein, the authors | contained in BCP 78, and except as set forth therein, the authors | |||
retain all their rights. | retain all their rights. | |||
This document and the information contained herein are provided on | This document and the information contained herein are provided on | |||
an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE | an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE | |||
REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE | REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE | |||
IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL | IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL | |||
WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY | WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY | |||
End of changes. 36 change blocks. | ||||
101 lines changed or deleted | 163 lines changed or added | |||
This html diff was produced by rfcdiff 1.34. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |