--- 1/draft-ietf-nfsv4-rpcrdma-00.txt 2006-02-05 00:48:57.000000000 +0100 +++ 2/draft-ietf-nfsv4-rpcrdma-01.txt 2006-02-05 00:48:57.000000000 +0100 @@ -1,397 +1,470 @@ Internet-Draft Brent Callaghan -Expires: January 2005 Sun Microsystems, Inc. - Tom Talpey - Network Appliance, Inc. +Expires: August 2005 Tom Talpey -Document: draft-ietf-nfsv4-rpcrdma-00.txt July, 2004 +Document: draft-ietf-nfsv4-rpcrdma-01 February, 2005 RDMA Transport for ONC RPC Status of this Memo By submitting this Internet-Draft, I certify that any applicable patent or other IPR claims of which I am aware have been disclosed, or will be disclosed, and any of which I become aware will be disclosed, in accordance with RFC 3668. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. - Internet-Drafts are draft documents valid for a maximum of six months - and may be updated, replaced, or obsoleted by other documents at any - time. It is inappropriate to use Internet-Drafts as reference - material or to cite them other than as "work in progress." + Internet-Drafts are draft documents valid for a maximum of six + months and may be updated, replaced, or obsoleted by other + documents at any time. It is inappropriate to use Internet-Drafts + as reference material or to cite them other than as "work in + progress." The list of current Internet-Drafts can be accessed at - http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet- - Draft Shadow Directories can be accessed at + http://www.ietf.org/ietf/1id-abstracts.txt The list of + Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice - Copyright (C) The Internet Society (2004). All Rights Reserved. + Copyright (C) The Internet Society (2005). All Rights Reserved. Abstract A protocol is described providing RDMA as a new transport for ONC RPC. The RDMA transport binding conveys the benefits of efficient, bulk data transport over high speed networks, while providing for minimal change to RPC applications and with no required revision of the application RPC protocol, or the RPC protocol itself. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Abstract RDMA Model . . . . . . . . . . . . . . . . . . . . 3 - 3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 5 - 3.1. Short Messages . . . . . . . . . . . . . . . . . . . . . . 5 - 3.2. Data Chunks . . . . . . . . . . . . . . . . . . . . . . . 6 - 3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 6 - 3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7 + 3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 4 + 3.1. Short Messages . . . . . . . . . . . . . . . . . . . . . . 4 + 3.2. Data Chunks . . . . . . . . . . . . . . . . . . . . . . . 5 + 3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 5 + 3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 6 3.5. Padding . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.6. XDR Decoding with Read Chunks . . . . . . . . . . . . . 10 3.7. XDR Decoding with Write Chunks . . . . . . . . . . . . . 11 - 3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 11 + 3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 12 4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 14 - 4.1. RPC RDMA Transport Header . . . . . . . . . . . . . . . 14 + 4.1. RPC over RDMA Header . . . . . . . . . . . . . . . . . . 14 4.2. XDR Language Description . . . . . . . . . . . . . . . . 16 5. Large Chunkless Messages . . . . . . . . . . . . . . . . . 18 - 5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 19 - 5.2. RDMA Write of Long Replies . . . . . . . . . . . . . . . 20 - 5.3. RPC RDMA header errors . . . . . . . . . . . . . . . . . 21 - 6. Connection Configuration Protocol . . . . . . . . . . . . 22 + 5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 18 + 5.2. RDMA Write of Long Replies (Reply Chunks) . . . . . . . 20 + 5.3. RPC over RDMA header errors . . . . . . . . . . . . . . 21 + 6. Connection Configuration Protocol . . . . . . . . . . . . 21 6.1. Initial Connection State . . . . . . . . . . . . . . . . 22 - 6.2. Protocol Description . . . . . . . . . . . . . . . . . . 23 + 6.2. Protocol Description . . . . . . . . . . . . . . . . . . 22 7. Memory Registration Overhead . . . . . . . . . . . . . . . 24 8. Errors and Error Recovery . . . . . . . . . . . . . . . . 24 - 9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 25 + 9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 24 10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 25 11. Security . . . . . . . . . . . . . . . . . . . . . . . . 25 - 12. IANA Considerations . . . . . . . . . . . . . . . . . . . 26 + 12. IANA Considerations . . . . . . . . . . . . . . . . . . . 25 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . 26 14. Normative References . . . . . . . . . . . . . . . . . . 26 - 15. Informative References . . . . . . . . . . . . . . . . . 27 - 16. Authors' Addresses . . . . . . . . . . . . . . . . . . . 28 - 17. Full Copyright Statement . . . . . . . . . . . . . . . . 28 - Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 29 + 15. Informative References . . . . . . . . . . . . . . . . . 26 + 16. Authors' Addresses . . . . . . . . . . . . . . . . . . . 27 + 17. Full Copyright Statement . . . . . . . . . . . . . . . . 27 + Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 28 1. Introduction - RDMA is a technique for efficient movement of data over high speed - transports. It facilitates data movement via direct memory access by - hardware, yielding faster transfers of data over a network while - reducing host CPU overhead. + RDMA is a technique for efficient movement of data between end + nodes, which becomes increasingly compelling over high speed + transports. By directing data into destination buffers as it sent + on a network, and placing it via direct memory access by hardware, + the double benefit of faster transfers and reduced host overhead is + obtained. ONC RPC [RFC1831] is a remote procedure call protocol that has been - run over a variety of transports. Most implementations today use UDP - or TCP. RPC messages are defined in terms of an eXternal Data - Representation (XDR) [RFC1832] which provides a canonical data + run over a variety of transports. Most RPC implementations today + use UDP or TCP. RPC messages are defined in terms of an eXternal + Data Representation (XDR) [RFC1832] which provides a canonical data representation across a variety of host architectures. An XDR data stream is conveyed differently on each type of transport. On UDP, RPC messages are encapsulated inside datagrams, while on a TCP byte - stream, RPC messages are delineated by a record marking protocol. An - RDMA transport also conveys RPC messages in a unique fashion that - must be fully described if client and server implementations are to - interoperate. + stream, RPC messages are delineated by a record marking protocol. + An RDMA transport also conveys RPC messages in a unique fashion + that must be fully described if client and server implementations + are to interoperate. - RDMA transports present new semantics unlike the behaviors of either - UDP and TCP. They retain message delineations like UDP while also - providing a reliable, sequenced data transfer like TCP. All provide - the new efficient, bulk transfer service of RDMA. RDMA transports - are therefore naturally viewed as a new transport type by ONC RPC. + RDMA transports present new semantics unlike the behaviors of + either UDP and TCP alone. They retain message delineations like + UDP while also providing a reliable, sequenced data transfer like + TCP. And, they provide the new efficient, bulk transfer service of + RDMA. RDMA transports are therefore naturally viewed as a new + transport type by ONC RPC. RDMA as a transport will benefit the performance of RPC protocols that move large "chunks" of data, since RDMA hardware excels at - moving data efficiently between host memory and a high speed network - with little or no host CPU involvement. In this context, the NFS - protocol, in all its versions, is an obvious beneficiary of RDMA. - Many other RPC-based protocols will also benefit. + moving data efficiently between host memory and a high speed + network with little or no host CPU involvement. In this context, + the NFS protocol, in all its versions, is an obvious beneficiary of + RDMA. A complete problem statement is discussed in [NFSRDMAPS], + and related NFSv4 issues are discussed in [NFSSESS]. Many other + RPC-based protocols will also benefit. Although the RDMA transport described here provides relatively transparent support for any RPC application, the proposal goes further in describing mechanisms that can optimize the use of RDMA with more active participation by the RPC application. 2. Abstract RDMA Model An RPC transport is responsible for conveying an RPC message from a sender to a receiver. An RPC message is either an RPC call from a client to a server, or an RPC reply from the server back to the client. An RPC message contains an RPC call header followed by arguments if the message is an RPC call, or an RPC reply header - followed by results if the message is an RPC reply. The call header - contains a transaction ID (XID) followed by the program and procedure - number as well as a security credential. An RPC reply header begins - with an XID that matches that of the RPC call message, followed by a - security verifier and results. All data in an RPC message is XDR - encoded. For a complete description of the RPC protocol and XDR - encoding, see [RFC1831] and [RFC1832]. + followed by results if the message is an RPC reply. The call + header contains a transaction ID (XID) followed by the program and + procedure number as well as a security credential. An RPC reply + header begins with an XID that matches that of the RPC call + message, followed by a security verifier and results. All data in + an RPC message is XDR encoded. For a complete description of the + RPC protocol and XDR encoding, see [RFC1831] and [RFC1832]. This protocol assumes an abstract model for RDMA transports. The following terms, common in the RDMA lexicon, are used in this document. A more complete glossary of RDMA terms can be found in [RDMA]. o Registered Memory - - All data moved via RDMA must be resident in registered - memory at its source and destination. Each segment of - registered memory must be identified with a Steering Tag - (STag) of no more than 32 bits and memory addresses of up - to 64 bits in length. + All data moved via tagged RDMA operations must be resident in + registered memory at its destination. This protocol assumes + that each segment of registered memory is identified with a + steering tag of no more than 32 bits and memory addresses of + up to 64 bits in length. o RDMA Send - The RDMA provider supports an RDMA Send operation with - completion signalled at the receiver when data is placed - in a pre-posted buffer. The amount of transferred data - is limited only by the size of the receiver's buffer. - Sends complete at the receiver in the order they were - issued at the sender. + completion signalled at the receiver when data is placed in a + pre-posted buffer. The amount of transferred data is limited + only by the size of the receiver's buffer. Sends complete at + the receiver in the order they were issued at the sender. o RDMA Write - - The RDMA provider supports an RDMA Write operation to - directly place data in the receiver's buffer. An RDMA - Write is initiated by the sender and completion is - signalled at the sender. No completion is signalled at - the receiver. The sender uses a Steering Tag (STag), - memory address and length of the remote destination - buffer. A subsequent completion, provided by RDMA Send, - must be obtained at the receiver to guarantee that RDMA - Write data has been successfully placed in the receiver's - memory. + The RDMA provider supports an RDMA Write operation to directly + place data in the receiver's buffer. An RDMA Write is + initiated by the sender and completion is signalled at the + sender. No completion is signalled at the receiver. The + sender uses a steering tag, memory address and length of the + remote destination buffer. RDMA Writes are not necessarily + ordered with respect to one another, but are ordered with + respect to RDMA Sends; a subsequent RDMA Send completion must + be obtained at the receiver to notify that prior RDMA Write + data has been successfully placed in the receiver's memory. o RDMA Read + The RDMA provider supports an RDMA Read operation to directly + place peer source data in the requester's buffer. An RDMA + Read is initiated by the receiver and completion is signalled + at the receiver. The receiver provides steering tags, memory + addresses and a length for the remote source and local + destination buffers. Since the peer at the data source + receives no notification of RDMA Read completion, there is an + assumption that on receiving the data the receiver will signal + completion with an RDMA Send message, so that the peer can + free the source buffers and the associated steering tags. - The RDMA provider supports an RDMA Read operation to - directly place peer source data in the requester's buffer. - An RDMA Read is initiated by the receiver and completion is - signalled at the receiver. The receiver provides - Steering Tags, memory addresses and a length for the - remote source and local destination buffers. - Since the peer at the data source receives no notification - of RDMA Read completion, there is an assumption that on - receiving the data the receiver will signal completion - with an RDMA Send message, so that the peer can free the - source buffers. - - In its abstract form, this protocol is not an interoperable stan- - dard. It becomes a useful, implementable standard only when mapped - onto a specific RDMA transport, like iWARP [RDDP] or Infiniband - [IB]. + This protocol is designed to function with equivalent semantics + over all appropriate RDMA transports. In its abstract form, this + protocol does not implement RDMA directly. Instead, it conveys to + the RPC peer, information sufficient to direct an RDMA + implementation to perform transfers containing RPC data, and to + communicate their result(s). It therefore becomes a useful, + implementable standard when mapped onto a specific RDMA transport, + such as iWARP [RDDP] or Infiniband [IB]. 3. Protocol Outline - An RPC message can be conveyed in identical fashion, whether it is a - CALL or REPLY message. In each case, the transmission of the message - proper is preceded by transmission of a transport header for use by - RPC over RDMA transports. This header is analogous to the record - marking used for RPC over TCP, but is more extensive, since RDMA - transports support several modes of data transfer and it is important - to allow the client and server to use the most efficient mode for any - given transfer. Multiple segments of a message may be transferred in - different ways to different remote memory destinations. + An RPC message can be conveyed in identical fashion, whether it is + a call or reply message. In each case, the transmission of the + message proper is preceded by transmission of a transport-specific + header for use by RPC over RDMA transports. This header is + analogous to the record marking used for RPC over TCP, but is more + extensive, since RDMA transports support several modes of data + transfer and it is important to allow the client and server to use + the most efficient mode for any given transfer. Multiple segments + of a message may be transferred in different ways to different + remote memory destinations. - All transfers of a CALL or REPLY begin with an RDMA send which - transfers at least the transport header, usually with the CALL or - REPLY message appended, or at least some part thereof. Because the - size of what may be transmitted via RDMA send is limited by the size - of the receiver's pre-posted buffer, the RPC over RDMA transport - provides a number of methods to reduce the amount transferred by - means of the RDMA send, when necessary, by transferring various parts - of the message using RDMA read and RDMA write. + All transfers of a call or reply begin with an RDMA Send which + transfers at least the RPC over RDMA header, usually with the call + or reply message appended, or at least some part thereof. Because + the size of what may be transmitted via RDMA Send is limited by the + size of the receiver's pre-posted buffer, the RPC over RDMA + transport provides a number of methods to reduce the amount + transferred by means of the RDMA Send, when necessary, by + transferring various parts of the message using RDMA Read and RDMA + Write. 3.1. Short Messages Many RPC messages are quite short. For example, the NFS version 3 GETATTR request, is only 56 bytes: 20 bytes of RPC header plus a 32 byte filehandle argument and 4 bytes of length. The reply to this common request is about 100 bytes. - There is no benefit in transferring such small messages with an RDMA - Read or Write operation. The overhead in transferring STags and - memory addresses is justified only by large transfers. The critical - message size that justifies RDMA transfer will vary depending on the - RDMA implementation and network, but is typically of the order of a - few kilobytes. It is appropriate to transfer a short message with an - RDMA Send to a pre-posted buffer. The transport header with the - short message (CALL or REPLY) immediately following is transferred - using a single RDMA send operation. + There is no benefit in transferring such small messages with an + RDMA Read or Write operation. The overhead in transferring + steering tags and memory addresses is justified only by large + transfers. The critical message size that justifies RDMA transfer + will vary depending on the RDMA implementation and network, but is + typically of the order of a few kilobytes. It is appropriate to + transfer a short message with an RDMA Send to a pre-posted buffer. + The RPC over RDMA header with the short message (call or reply) + immediately following is transferred using a single RDMA Send + operation. Short RPC messages over an RDMA transport will look like this: Client Server | RPC Call | Send | ------------------------------> | | | | RPC Reply | | <------------------------------ | Send 3.2. Data Chunks - Some protocols, like NFS, have RPC procedures that can transfer very - large "chunks" of data in the RPC call or reply and would cause the - maximum send size to be exceeded if one tried to transfer them as - part of the RDMA send. These large chunks typically range from a - kilobyte to a megabyte or more. An RDMA transport can transfer large - chunks of data more efficiently via the direct placement of an RDMA - Read or RDMA Write operation. Using direct placement instead of in- - line transfer not only avoids expensive data copies, but provides - correct data alignment at the destination. + Some protocols, like NFS, have RPC procedures that can transfer + very large "chunks" of data in the RPC call or reply and would + cause the maximum send size to be exceeded if one tried to transfer + them as part of the RDMA Send. These large chunks typically range + from a kilobyte to a megabyte or more. An RDMA transport can + transfer large chunks of data more efficiently via the direct + placement of an RDMA Read or RDMA Write operation. Using direct + placement instead of in-line transfer not only avoids expensive + data copies, but provides correct data alignment at the + destination. 3.3. Flow Control - It is critical to provide flow control for an RDMA connection. RDMA - receive operations will fail if a pre-posted receive buffer is not - available to accept an incoming RDMA Send. Such errors are fatal to - the connection. This is a departure from conventional TCP/IP + It is critical to provide RDMA Send flow control for an RDMA + connection. RDMA receive operations will fail if a pre-posted + receive buffer is not available to accept an incoming RDMA Send, + and repeated occurrences of such errors can be fatal to the + connection. This is a departure from conventional TCP/IP networking where buffers are allocated dynamically on an as-needed basis, and pre-posting is not required. It is not practical to provide for fixed credit limits at the RPC server. Fixed limits scale poorly, since posted buffers are dedicated to the associated connection until consumed by receive operations. Additionally for protocol correctness, the server must - be able to reply whether or not a new buffer can be posted to accept - future receives. + always be able to reply to client requests, whether or not new + buffers have been posted to accept future receives. - Flow control is implemented as a simple request/grant protocol in the - transport header associated with each RPC message. The transport - header for RPC CALL messages contains a requested credit value for - the server, which may be dynamically adjusted by the caller to match - its expected needs. The transport header for the RPC REPLY messages - provide the granted result, which may have any value except it may - not be zero when no in-progress operations are present at the server, - since such a value would result in deadlock. The value may be - adjusted up or down at each opportunity to match the server's needs - or policies. + Flow control for RDMA Send operations is implemented as a simple + request/grant protocol in the RPC over RDMA header associated with + each RPC message. The RPC over RDMA header for RPC call messages + contains a requested credit value for the server, which may be + dynamically adjusted by the caller to match its expected needs. + The RPC over RDMA header for the RPC reply messages provide the + granted result, which may have any value except it may not be zero + when no in-progress operations are present at the server, since + such a value would result in deadlock. The value may be adjusted + up or down at each opportunity to match the server's needs or + policies. - While RPC CALLs may complete in any order, the current flow control + While RPC call may complete in any order, the current flow control limit at the RPC server is known to the RPC client from the Send ordering properties. It is always the most recent server granted credits minus the number of requests in flight. + Certain RDMA implementations may impose additional flow control + restrictions, such as limits on RDMA Read operations in progress at + the responder. Because these operations are outside the scope of + this protocol, they are not addressed and must be provided for by + other layers. For example, a simple upper layer RPC consumer might + perform single-issue RDMA Read requests, while a more + sophisticated, multithreaded RPC consumer may implement its own + FIFO queue of such operations. + 3.4. XDR Encoding with Chunks The data comprising an RPC call or reply message is marshaled or serialized into a contiguous stream by an XDR routine. XDR data - types such as integers, strings, arrays and linked lists are commonly - implemented over two very simple functions that encode either an XDR - data unit (32 bits) or an array of bytes. + types such as integers, strings, arrays and linked lists are + commonly implemented over two very simple functions that encode + either an XDR data unit (32 bits) or an array of bytes. - Normally, the separate data items in an XDR call or reply are encoded - as a contiguous sequence of bytes for network transmission over UDP - or TCP. However, in the case of an RDMA transport, local routines - such as XDR encode can determine that an opaque byte array is large - enough to be more efficiently moved via an RDMA data transfer - operation like RDMA Read or RDMA Write. + Normally, the separate data items in an RPC call or reply are + encoded as a contiguous sequence of bytes for network transmission + over UDP or TCP. However, in the case of an RDMA transport, local + routines such as XDR encode can determine that (for instance) an + opaque byte array is large enough to be more efficiently moved via + an RDMA data transfer operation like RDMA Read or RDMA Write. - When sending any message (request or reply) that contains a candidate - large data chunk, the XDR encoding routine avoids moving the data - into the XDR stream. Instead, it does not encode the data portion, - but records the address and size of each chunk in a separate "read - chunk list" encoded within RPC RDMA transport-specific headers. Such - chunks will be transferred via RDMA Read operations initiated by the - receiver. + Semantically speaking, the protocol has no restriction regarding + data types which may or may not be chunked. In practice however, + efficiency considerations lead to the conclusion that certain data + types are not generally "chunkable". Typically, only opaque and + aggregate data types which may attain substantial size are + considered to be eligible. With today's hardware this size may be + a kilobyte or more. However any object may be chosen for chunking + in any given message. - Since the chunks are to be moved via RDMA, the memory for each chunk - must be registered. This registration may take place within XDR - itself, providing for full transparency to upper layers, or it may be - performed by any other specific local implementation. + The eligibility of XDR data items to be candidates for being moved + as data chunks (as opposed to being marshalled inline) is not + specified by the RPC over RDMA protocol. Chunk eligibility + criteria must be determined by each upper layer in order to provide + for an interoperable specification. One such example with + rationale, for the NFS protocol family, is provided in [NFSDDP]. + + The interface by which an upper layer implementation communicates + the eligibility of a data item locally to RPC for chunking is out + of scope for this specification. In many implementations, it is + possible to implement a transparent RPC chunking facility. + However, such implementations may lead to inefficiencies, either + because they require the RPC layer to perform expensive + registration and deregistration of memory "on the fly", or they may + require using RDMA chunks in reply messages, along with the + resulting additional handshaking with the RPC over RDMA peer. + However, these issues are purely local and implementations are free + to innovate. + + When sending any message (request or reply) that contains an + eligible large data chunk, the XDR encoding routine avoids moving + the data into the XDR stream. Instead, it does not encode the data + portion, but records the address and size of each chunk in a + separate "read chunk list" encoded within RPC RDMA transport- + specific headers. Such chunks will be transferred via RDMA Read + operations initiated by the receiver. + + When the read chunks are to be moved via RDMA, the memory for each + chunk must be registered. This registration may take place within + XDR itself, providing for full transparency to upper layers, or it + may be performed by any other specific local implementation. Additionally, when making an RPC call that can result in bulk data - transferred in the reply, it is desirable to provide chunks to accept - the data directly via RDMA Write. These chunks will therefore be - pre-filled by the server prior to responding, and XDR decode at the - client will not be required. These "write chunk lists" undergo a - similar registration and advertisement to chunks built as a part of - XDR encoding. Just as with an encoded read chunk list, the memory - referenced in an encoded write chunk list must be pre-registered. If - the client chooses not to make a write chunk list available, then the - server must return data inline in the reply, or via a read chunk - list. + transferred in the reply, it is desirable to provide chunks to + accept the data directly via RDMA Write. These write chunks will + therefore be pre-filled by the server prior to responding, and XDR + decode at the client will not be required. These chunks undergo a + similar registration and advertisement via "write chunk lists" + built as a part of XDR encoding. + + Some RPC client implementations are not able to determine where an + RPC call's results reside during the "encode" phase. This makes it + difficult or impossible for the RPC client layer to encode the + write chunk list at the time of building the request. In this + case, it is difficult for the RPC implementation to provide + transparency to the RPC consumer, which may require recoding to + provide result information at this earlier stage. + + Therefore if the RPC client does not make a write chunk list + available to receive the result, then the RPC server must return + data inline in the reply, or if it so chooses, via a read chunk + list. RPC clients are discouraged from omitting write chunk lists + for eligible replies, due to the lower performance of the + additional handshaking to perform data transfer, and the + requirement that the server must expose (and preserve) the reply + data for a period of time. In the absence of a server-provided + read chunk list in the reply, if the encoded reply overflows the + inline buffer, the RPC will fail. When any data within a message is provided via either read or write chunks, the chunk itself refers only to the data portion of the XDR stream element. In particular, for counted fields (e.g. a "<>" encoding) the byte count which is encoded as part of the field remains in the XDR stream, as well as being encoded in the chunk list. Only the data portion is elided. This is important to maintain upper layer implementation compatibility - both the count and the data must be transferred as part of the XDR stream. In - addition, any byte count in the XDR stream must match the sum of the - byte counts present in the corresponding read or write chunk list. - If they do not agree, an RPC protocol encoding error results. + addition, any byte count in the XDR stream must match the sum of + the byte counts present in the corresponding read or write chunk + list. If they do not agree, an RPC protocol encoding error + results. The following items are contained in a chunk list entry. - STag - Steering tag or handle obtained when the chunk - memory is registered for RDMA. + Handle + Steering tag or handle obtained when the chunk memory is + registered for RDMA. + Length The length of the chunk in bytes. + Offset - The offset or memory address of the chunk. + The offset or memory address of the chunk. In order to + support the widest array of RDMA implementations, as well as + the most general steering tag scheme, this field is + unconditionally included in each chunk list entry. + Position - For data which is to be encoded, the position in - the XDR stream where the chunk would normally - reside. It is possible that a contiguous sequence - of chunks might all have the same position. For - data which is to be decoded, no "position" is - used. + For data which is to be encoded, the position in the XDR + stream where the chunk would normally reside. Note that the + chunk therefore inserts its data into the XDR stream at this + position, but its transfer is no longer "inline". Also note + it is possible that a contiguous sequence of chunks might all + have the same position. For data which is to be decoded, no + "position" is used. When XDR marshaling is complete, the chunk list is XDR encoded, then sent to the receiver prepended to the RPC message. Any source data for a read chunk, or the destination of a write chunk, remain - behind in the sender's registered memory. + behind in the sender's registered memory and their actual payload + is not marshalled into the request or reply. +----------------+----------------+------------- - | | | - | RDMA header w/ | RPC Header | Non-chunk args/results + | RPC over RDMA | | + | header w/ | RPC Header | Non-chunk args/results | chunks | | +----------------+----------------+------------- - Read chunk lists are structured differently from write chunk lists. - This is due to the different usage - read chunks are decoded and - indexed by their position in the XDR data stream, and may be used - for both arguments and results. Write chunks on the other hand are - used only for results, and have no preassigned offset in the XDR - stream until the results are produced. The mapping of Write chunks - onto designated NFS procedures and results is described in [NFS- - DDP]. + Read chunk lists and write chunk lists are structured somewhat + differently. This is due to the different usage - read chunks are + decoded and indexed by their position in the XDR data stream, their + size is always known, and may be used for both arguments and + results. Write chunks on the other hand are used only for results, + and have neither a preassigned offset in the XDR stream nor a size + until the results are produced. The mapping of Write chunks onto + designated NFS procedures and their results is described in + [NFSDDP]. - Therefore, read chunks are encoded as a single array, with each - entry tagged by its position in the XDR stream. Write chunks are - encoded as a list of arrays of RDMA buffers, with each list element - providing buffers for a separate result. + Therefore, read chunks are encoded into a read chunk list as a + single array, with each entry tagged by its position in the XDR + stream. Write chunks are encoded as a list of arrays of RDMA + buffers, with each list element (an array) providing buffers for a + separate result. Individual write chunk list elements may thereby + result in being partially or fully filled, or in fact not being + filled at all. 3.5. Padding Alignment of specific opaque data enables certain scatter/gather optimizations. Padding leverages the useful property that RDMA - transfers preserve alignment of data, even when they are placed into - pre-posted receive buffers by Sends. + transfers preserve alignment of data, even when they are placed + into pre-posted receive buffers by Sends. Many servers can make good use of such padding. Padding allows the chaining of RDMA receive buffers such that any data transferred by RDMA on behalf of RPC requests will be placed into appropriately aligned buffers on the system that receives the transfer. In this - way, the need for servers to perform RDMA Read to satisfy all but the - largest client writes is obviated. + way, the need for servers to perform RDMA Read to satisfy all but + the largest client writes is obviated. - The effect of padding is demonstrated below showing prior bytes on an - XDR stream (XXX) followed by an opaque field consisting of four + The effect of padding is demonstrated below showing prior bytes on + an XDR stream (XXX) followed by an opaque field consisting of four length bytes (LLLL) followed by data bytes (DDDD). The receiver of the RDMA Send has posted two chained receive buffers. Without padding, the opaque data is split across the two buffers. With the addition of padding bytes (ppp) prior to the first data byte, the data can be forced to align correctly in the second buffer. Buffer 1 Buffer 2 Unpadded -------------- -------------- XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD @@ -390,101 +463,111 @@ padding, the opaque data is split across the two buffers. With the addition of padding bytes (ppp) prior to the first data byte, the data can be forced to align correctly in the second buffer. Buffer 1 Buffer 2 Unpadded -------------- -------------- XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD Padded + XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD - Padding is implemented completely within the RDMA transport encoding, - flagged with a specific message type. Where padding is applied, two - values are passed to the peer: an "rdma_align" which is the padding - value used, and "rdma_thresh", which is the opaque data size at or - above which padding is applied. For instance, if the server is using - chained 4 KB receive buffers, then up to (4 KB - 1) padding bytes - could be used to achieve alignment of the data. If padding is to - apply only to chunks at least 1 KB in size, then the threshold should - be set to 1 KB. The XDR routine at the peer will consult these - values when decoding opaque values. Where the decoded length exceeds - the rdma_thresh, the XDR decode will skip over the appropriate - padding as indicated by rdma_align and the current XDR stream - position. + Padding is implemented completely within the RDMA transport + encoding, flagged with a specific message type. Where padding is + applied, two values are passed to the peer: an "rdma_align" which + is the padding value used, and "rdma_thresh", which is the opaque + data size at or above which padding is applied. For instance, if + the server is using chained 4 KB receive buffers, then up to (4 KB + - 1) padding bytes could be used to achieve alignment of the data. + If padding is to apply only to chunks at least 1 KB in size, then + the threshold should be set to 1 KB. The XDR routine at the peer + will consult these values when decoding opaque values. Where the + decoded length exceeds the rdma_thresh, the XDR decode will skip + over the appropriate padding as indicated by rdma_align and the + current XDR stream position. 3.6. XDR Decoding with Read Chunks The XDR decode process moves data from an XDR stream into a data structure provided by the client or server application. Where elements of the destination data structure are buffers or strings, the RPC application can either pre-allocate storage to receive the data, or leave the string or buffer fields null and allow the XDR - decode to automatically allocate storage of sufficient size. + decode stage of RPC processing to automatically allocate storage of + sufficient size. When decoding a message from an RDMA transport, the receiver first - XDR decodes the chunk lists from the RDMA transport header, then + XDR decodes the chunk lists from the RPC over RDMA header, then proceeds to decode the body of the RPC message (arguments or - results). Whenever the XDR offset in the decode stream matches that - of a chunk in the read chunk list, the XDR routine initiates an RDMA - Read to bring over the chunk data into locally registered memory for - the destination buffer. After completing such a transfer, the RPC - receiver must issue an RDMA_DONE message (described in Section 3.8) - to notify the peer that the source buffers can be freed. + results). Whenever the XDR offset in the decode stream matches + that of a chunk in the read chunk list, the XDR routine initiates + an RDMA Read to bring over the chunk data into locally registered + memory for the destination buffer. + + When processing an RPC request, the RPC receiver (server) + acknowledges its completion of use of the source buffers by simply + replying to the RPC sender (client), and the peer may free all + source buffers advertised by the request. + + When processing an RPC reply, after completing such a transfer the + RPC receiver (client) must issue an RDMA_DONE message (described in + Section 3.8) to notify the peer (server) that the source buffers + can be freed. The read chunk list is constructed and used entirely within the RPC/XDR layer. Other than specifying the minimum chunk size, the - management of the read chunk list is automatic and transparent to an - RPC application. + management of the read chunk list is automatic and transparent to + an RPC application. 3.7. XDR Decoding with Write Chunks When a "write chunk list" is provided for the results of the RPC - CALL, the server must provide any corresponding data via RDMA Write - to the memory referenced in the chunk list entries. The RPC REPLY - conveys this by returning the write chunk list to the client with the - lengths rewritten to match the actual transfer. The XDR "decode" of - the reply therefore performs no local data transfer but merely - returns the length obtained from the reply. + call, the server must provide any corresponding data via RDMA Write + to the memory referenced in the chunk list entries. The RPC reply + conveys this by returning the write chunk list to the client with + the lengths rewritten to match the actual transfer. The XDR + "decode" of the reply therefore performs no local data transfer but + merely returns the length obtained from the reply. - Each decoded result consumes one entry in the write chunk list, which - in turn consists of an array of RDMA segments. The length is - therefore the sum of all returned lengths in all segments comprising - the corresponding list entry. As each list entry is "decoded", the - entire entry is consumed. + Each decoded result consumes one entry in the write chunk list, + which in turn consists of an array of RDMA segments. The length is + therefore the sum of all returned lengths in all segments + comprising the corresponding list entry. As each list entry is + "decoded", the entire entry is consumed. - The write chunk list is constructed and used by the RPC application. - The RPC/XDR layer simply conveys the list between client and server - and initiates the RDMA Writes back to the client. The mapping of - write chunk list entries to procedure arguments must be determined - for each protocol. An example of a mapping is described in [NFSDDP]. + The write chunk list is constructed and used by the RPC + application. The RPC/XDR layer simply conveys the list between + client and server and initiates the RDMA Writes back to the client. + The mapping of write chunk list entries to procedure arguments must + be determined for each protocol. An example of a mapping is + described in [NFSDDP]. 3.8. RPC Call and Reply The RDMA transport for RPC provides three methods of moving data between client and server: In-line - Data are moved between client and server - within an RDMA Send. + Data are moved between client and server within an RDMA Send. RDMA Read - Data are moved between client and server - via an RDMA Read operation via STag, address - and offset obtained from a read chunk list. + Data are moved between client and server via an RDMA Read + operation via steering tag, address and offset obtained from a + read chunk list. RDMA Write - Result data is moved from server to client - via an RDMA Write operation via STag, address - and offset obtained from a write chunk list - or reply chunk in the client's RPC call message. + Result data is moved from server to client via an RDMA Write + operation via steering tag, address and offset obtained from a + write chunk list or reply chunk in the client's RPC call + message. These methods of data movement may occur in combinations within a single RPC. For instance, an RPC call may contain some in-line data along with some large chunks transferred via RDMA Read by the server. The reply to that call may have some result chunks that the server RDMA Writes back to the client. The following protocol interactions illustrate RPC calls that use these methods to move RPC message data: An RPC with write chunks in the call message looks like this: @@ -495,20 +578,24 @@ | | | Chunk 1 | | <------------------------------ | Write | : | | Chunk n | | <------------------------------ | Write | | | RPC Reply | | <------------------------------ | Send + In the presence of write chunks, RDMA ordering provides the + guarantee that any RDMA Write operations from the server have + completed prior to the client's RPC reply processing. + An RPC with read chunks in the call message looks like this: Client Server | RPC Call + Read Chunk list | Send | ------------------------------> | | | | Chunk 1 | | +------------------------------ | Read | v-----------------------------> | | : | @@ -528,136 +616,146 @@ | <------------------------------ | Send | | | Chunk 1 | Read | ------------------------------+ | | <-----------------------------v | | : | | Chunk n | Read | ------------------------------+ | | <-----------------------------v | | | - | RPC Done | + | Done | Send | ------------------------------> | - The final RPC Done message allows the client to signal the server - that it has received the chunks, so the server can de-register and - free the memory holding the chunks. An RPC Done completion is not - necessary for an RPC call, since the RPC reply Send is itself a - receive completion notification. + The final Done message allows the client to signal the server that + it has received the chunks, so the server can de-register and free + the memory holding the chunks. A Done completion is not necessary + for an RPC call, since the RPC reply Send is itself a receive + completion notification. - The RPC Done message has no effect on protocol latency since the - client has no expectation of a reply from the server. Nor does it + The Done message has no effect on protocol latency since the client + has no expectation of a reply from the server. Nor does it adversely affect bandwidth since it is only 16 bytes in length. In - the event that the client fails to return the Done message, the - server can proceed with a de-register and free chunk buffers after - a time-out. + the event that the client fails to return the Done message within + some timeout period, the server may conclude that a protocol + violation has occurred and close the RPC connection, or it may + proceed with a de-register and free its chunk buffers. This may + result in a fatal RDMA error if the client later attempts to + perform an RDMA Read operation, which amounts to the same thing. - It is important to note that the RPC Done message consumes a credit - at the server. The client must account for this in its accounting - of available credits, and the server should replenish the credit - consumed by RPC Done at its earliest oportunity. + It is important to note that the Done message consumes a credit at + the server. The client must account for this in its accounting of + available credits, and the server should replenish the credit + consumed by Done at its earliest opportunity. Finally, it is possible to conceive of RPC exchanges that involve - any or all combinations of write chunks in the RPC CALL, read - chunks in the RPC CALL, and read chunks in the RPC REPLY. Support + any or all combinations of write chunks in the RPC call, read + chunks in the RPC call, and read chunks in the RPC reply. Support for such exchanges is straightforward from a protocol perspective, but in practice such exchanges would be quite rare, limited to upper layer protocol exchanges which transferred bulk data in both the call and corresponding reply. 4. RPC RDMA Message Layout RPC call and reply messages are conveyed across an RDMA transport - with a prepended RDMA transport header. The transport header + with a prepended RPC over RDMA header. The RPC over RDMA header includes data for RDMA flow control credits, padding parameters and lists of addresses that provide direct data placement via RDMA Read and Write operations. The layout of the RPC message itself is unchanged from that described in [RFC1831] except for the possible exclusion of large data chunks that will be moved by RDMA Read or - Write operations. If the RPC message (along with the transport + Write operations. If the RPC message (along with the RPC over RDMA header) is too long for the posted receive buffer (even after any large chunks are removed), then the entire RPC message can be moved - separately as a chunk, leaving just the transport header in the RDMA - Send. + separately as a chunk, leaving just the RPC over RDMA header in the + RDMA Send. -4.1. RPC RDMA Transport Header +4.1. RPC over RDMA Header - The RPC RDMA transport header begins with four 32-bit fields that are - always present and which control the RDMA interaction including RDMA- - specific flow control. These are then followed by a number of items - such as chunk lists and padding which may or may not be present - depending on the type of transmission. The four fields which are - always present are: + The RPC over RDMA header begins with four 32-bit fields that are + always present and which control the RDMA interaction including + RDMA-specific flow control. These are then followed by a number of + items such as chunk lists and padding which may or may not be + present depending on the type of transmission. The four fields + which are always present are: 1. Transaction ID (XID). - The XID generated for the RPC call and reply. Having - the XID at the beginning of the message makes it easy to - establish the message context. This XID mirrors the XID - in the RPC call header, and takes precedence. + The XID generated for the RPC call and reply. Having the XID + at the beginning of the message makes it easy to establish the + message context. This XID mirrors the XID in the RPC header, + and takes precedence. The receiver may ignore the XID in the + RPC header, if it so chooses. 2. Version number. - This version of the RPC RDMA message protocol is 1. - The version number must be increased by one whenever the - format of the RPC RDMA messages is changed. + This version of the RPC RDMA message protocol is 1. The + version number must be increased by one whenever the format of + the RPC RDMA messages is changed. 3. Flow control credit value. - When sent in an RPC CALL message, the requested value is - provided. When sent in an RPC REPLY message, the - granted value is returned. RPC CALLs must not be sent - in excess of the currently granted limit. + When sent in an RPC call message, the requested value is + provided. When sent in an RPC reply message, the granted + value is returned. RPC calls must not be sent in excess of + the currently granted limit. 4. Message type. - RDMA_MSG = 0 indicates that chunk lists and RPC message - follow. RDMA_NOMSG = 1 indicates that after the chunk - lists there is no RPC message. In this case, the chunk - lists provide information to allow the message proper to - be transferred using RDMA read or write and thus is not - appended to the RPC RDMA transport header. RDMA_MSGP = - 2 indicates that a chunk list and RPC message with some - padding follow. RDMA_DONE = 3 indicates that the - message signals the completion of a chunk transfer via - RDMA Read. RDMA_ERROR = 4 is used to signal any detected - error(s) in the RPC RDMA chunk encoding. + + o RDMA_MSG = 0 indicates that chunk lists and RPC message + follow. + + o RDMA_NOMSG = 1 indicates that after the chunk lists there + is no RPC message. In this case, the chunk lists provide + information to allow the message proper to be transferred + using RDMA Read or write and thus is not appended to the + RPC over RDMA header. + + o RDMA_MSGP = 2 indicates that a chunk list and RPC message + with some padding follow. + + 0 RDMA_DONE = 3 indicates that the message signals the + completion of a chunk transfer via RDMA Read. + + o RDMA_ERROR = 4 is used to signal any detected error(s) in + the RPC RDMA chunk encoding. Because the version number is encoded as part of this header, and the RDMA_ERROR message type is used to indicate errors, these first four fields and the start of the following message body must always remain aligned at these fixed offsets for all versions of the RPC - RDMA transport header. + over RDMA header. For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write chunk lists follow. If the Read chunk list is null (a 32 bit word of zeros), then there are no chunks to be transferred separately and the RPC message follows in its entirety. If non-null, then it's the beginning of an XDR encoded sequence of Read chunk list entries. If the Write chunk list is non-null, then an XDR encoded sequence of Write chunk entries follows. If the message type is RDMA_MSGP, then two additional fields that specify the padding alignment and threshold are inserted prior to the Read and Write chunk lists. - A transport header of message type RDMA_MSG or RDMA_MSGP will be - followed by the RPC call or reply message, beginning with the XID. - This XID should match the one at the beginning of the RPC message - header. + A header of message type RDMA_MSG or RDMA_MSGP will be followed by + the RPC call or RPC reply message body, beginning with the XID. + The XID in the RDMA_MSG or RDMA_MSGP header must match this. +--------+---------+---------+-----------+-------------+---------- | | | | Message | NULLs | RPC Call | XID | Version | Credits | Type | or | or | | | | | Chunk Lists | Reply Msg +--------+---------+---------+-----------+-------------+---------- Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or RPC message follows. As an implementation hint: a gather operation - on the Send of the RDMA RPC message can be used to marshal the ini- - tial header, the chunk list, and the RPC message itself. + on the Send of the RDMA RPC message can be used to marshal the + initial header, the chunk list, and the RPC message itself. 4.2. XDR Language Description Here is the message layout in XDR language. struct xdr_rdma_segment { uint32 handle; /* Registered memory handle */ uint32 length; /* Length of the chunk in bytes */ uint64 offset; /* Chunk virtual address or offset */ }; @@ -745,27 +843,27 @@ }; 5. Large Chunkless Messages The receiver of RDMA Send messages is required to have previously posted one or more correctly sized buffers. The client can inform the server of the maximum size of its RDMA Send messages via the Connection Configuration Protocol described later in this document. Since RPC messages are frequently small, memory savings can be - achieved by posting small buffers. Even large messages like NFS READ - or WRITE will be quite small once the chunks are removed from the - message. However, there may be large, chunkless messages that would - demand a very large buffer be posted. A good example is an NFS - READDIR reply which may contain a large number of small filename - strings. Also, the NFS version 4 protocol [RFC3530] features - COMPOUND request and reply messages of unbounded length. + achieved by posting small buffers. Even large messages like NFS + READ or WRITE will be quite small once the chunks are removed from + the message. However, there may be large, chunkless messages that + would demand a very large buffer be posted. A good example is an + NFS READDIR reply which may contain a large number of small + filename strings. Also, the NFS version 4 protocol [RFC3530] + features COMPOUND request and reply messages of unbounded length. Ideally, each upper layer will negotiate these limits. However, it is frequently necessary to provide a transparent solution. 5.1. Message as an RDMA Read Chunk One relatively simple method is to have the client identify any RPC message that exceeds the server's posted buffer size and move it separately as a chunk, i.e. reference it as the first entry in the read chunk list with an XDR position of zero. @@ -784,188 +883,203 @@ | XID | Version | Credits | RDMA_NOMSG | Chunk Lists | | | | | | | +--------+---------+---------+------------+-------------+ | | +---------- | | Long RPC Call +->| or | Reply Message +---------- - If the receiver gets a transport header with a message type of - RDMA_NOMSG and finds an initial read chunk list entry with a zero XDR - position, it allocates a registered buffer and issues an RDMA Read of - the long RPC message into it. The receiver then proceeds to XDR - decode the RPC message as if it had received it in-line with the Send - data. Further decoding may issue additional RDMA Reads to bring over - additional chunks. + If the receiver gets an RPC over RDMA header with a message type of + RDMA_NOMSG and finds an initial read chunk list entry with a zero + XDR position, it allocates a registered buffer and issues an RDMA + Read of the long RPC message into it. The receiver then proceeds + to XDR decode the RPC message as if it had received it in-line with + the Send data. Further decoding may issue additional RDMA Reads to + bring over additional chunks. Although the handling of long messages requires one extra network turnaround, in practice these messages should be rare if the posted - receive buffers are correctly sized, and of course they will be non- - existent for RDMA-aware upper layers. + receive buffers are correctly sized, and of course they will be + non-existent for RDMA-aware upper layers. An RPC with long reply returned via RDMA Read looks like this: Client Server | RPC Call | Send | ------------------------------> | | | - | RPC Transport Header | + | RDMA over RPC Header | | <------------------------------ | Send | | | Long RPC Reply Msg | Read | ------------------------------+ | | <-----------------------------v | | | - | RPC Done | + | Done | Send | ------------------------------> | -5.2. RDMA Write of Long Replies +5.2. RDMA Write of Long Replies (Reply Chunks) - An alternative method of handling long, chunkless RPC replies is to - have the client post a large buffer into which the server can write a - large RPC reply. This has the advantage that an RDMA Write may be - slightly faster in network latency than an RDMA Read. Additionally, - for a reply it removes the need for an RDMA_DONE message if the large - reply is returned as a Read chunk. + A superior method of handling long, chunkless RPC replies is to + have the client post a large buffer into which the server can write + a large RPC reply. This has the advantage that an RDMA Write may + be slightly faster in network latency than an RDMA Read. + Additionally, for a reply it removes the need for an RDMA_DONE + message if the large reply is returned as a Read chunk. This protocol supports direct return of a large reply via the - inclusion of an optional rdma_reply write chunk after the read chunk - list and the write chunk list. The client allocates a buffer sized - to receive a large reply and enters its STag, address and length in - the rdma_reply write chunk. If the reply message is too long to - return in-line with an RDMA Send (exceeds the size of the client's - posted receive buffer), even with read chunks removed, then the - server RDMA writes the RPC reply message into the buffer indicated by - the rdma_reply chunk. If the client doesn't provide an rdma_reply - chunk, or if it's too small, then the message must be returned as a - Read chunk. + inclusion of an optional rdma_reply write chunk after the read + chunk list and the write chunk list. The client allocates a buffer + sized to receive a large reply and enters its steering tag, address + and length in the rdma_reply write chunk. If the reply message is + too long to return in-line with an RDMA Send (exceeds the size of + the client's posted receive buffer), even with read chunks removed, + then the server RDMA Writes the RPC reply message into the buffer + indicated by the rdma_reply chunk. If the client doesn't provide + an rdma_reply chunk, or if it's too small, then the message must be + returned as a Read chunk. An RPC with long reply returned via RDMA Write looks like this: Client Server | RPC Call with rdma_reply | Send | ------------------------------> | | | | Long RPC Reply Msg | | <------------------------------ | Write | | - | RPC Transport Header | + | RDMA over RPC Header | | <------------------------------ | Send The use of RDMA Write to return long replies requires that the client application anticipate a long reply and have some knowledge of its size so that a correctly sized buffer can be allocated. This is certainly true of NFS READDIR replies; where the client - already provides an upper bound on the size of the encoded direc- - tory fragment to be returned by the server. + already provides an upper bound on the size of the encoded + directory fragment to be returned by the server. -5.3. RPC RDMA header errors + The use of these "reply chunks" is highly efficient and convenient + for both client and server. Their use is encouraged for eligible + RPC operations such as NFS READDIR, which would otherwise require + extensive chunk management within the results or use of RDMA Read + and a Done message. + +5.3. RPC over RDMA header errors When a peer receives an RPC RDMA message, it must perform certain basic validity checks on the header and chunk contents. If errors are detected in an RPC request, an RDMA_ERROR reply should be generated. Two types of errors are defined, version mismatch and invalid chunk - format. When the peer detects an RPC RDMA header version which it - does not support (currently this draft defines only version 1), it - replies with an error code of ERR_VERS, and provides the low and high - inclusive version numbers it does, in fact, support. The version - number in this reply can be any value otherwise valid at the - receiver. When other decoding errors are detected in the header or - chunks, either an RPC decode error may be returned, or the error code - ERR_CHUNK. + format. When the peer detects an RPC over RDMA header version + which it does not support (currently this draft defines only + version 1), it replies with an error code of ERR_VERS, and provides + the low and high inclusive version numbers it does, in fact, + support. The version number in this reply can be any value + otherwise valid at the receiver. When other decoding errors are + detected in the header or chunks, either an RPC decode error may be + returned, or the error code ERR_CHUNK. 6. Connection Configuration Protocol - RDMA Send operations require the receiver to post one or more buffers - at the RDMA connection endpoint, each large enough to receive the - largest Send message. Buffers are consumed as Send messages are - received. If a buffer is too small, or if there are no buffers - posted, the RDMA transport will return an error and break the RDMA - connection. The receiver must post sufficient, correctly sized - buffers to avoid buffer overrun or capacity errors. + RDMA Send operations require the receiver to post one or more + buffers at the RDMA connection endpoint, each large enough to + receive the largest Send message. Buffers are consumed as Send + messages are received. If a buffer is too small, or if there are + no buffers posted, the RDMA transport may return an error and break + the RDMA connection. The receiver must post sufficient, correctly + sized buffers to avoid buffer overrun or capacity errors. The protocol described above includes only a mechanism for managing - the number of such receive buffers, and no explicit features to allow - the client and server to provision or control buffer sizing, nor any - other session parameters. + the number of such receive buffers, and no explicit features to + allow the client and server to provision or control buffer sizing, + nor any other session parameters. In the past, this type of connection management has not been necessary for RPC. RPC over UDP or TCP does not have a protocol to negotiate the link. The server can get a rough idea of the maximum - size of messages from the server protocol code. However, a protocol - to negotiate transport features on a more dynamic basis is desirable. + size of messages from the server protocol code. However, a + protocol to negotiate transport features on a more dynamic basis is + desirable. The Connection Configuration Protocol allows the client to pass its connection requirements to the server, and allows the server to inform the client of its connection limits. 6.1. Initial Connection State This protocol will be used for connection setup prior to the use of another RPC protocol that uses the RDMA transport. It operates in- - band, i.e. it uses the connection itself to negotiate the connection - parameters. To provide a basis for connection negotiation, the - connection is assumed to provide a basic level of interoperability: - the ability to exchange at least one RPC message at a time that is at - least 1 KB in size. The server may exceed this basic level of - configuration, but the client must not assume it. + band, i.e. it uses the connection itself to negotiate the + connection parameters. To provide a basis for connection + negotiation, the connection is assumed to provide a basic level of + interoperability: the ability to exchange at least one RPC message + at a time that is at least 1 KB in size. The server may exceed + this basic level of configuration, but the client must not assume + it. 6.2. Protocol Description - Version 1 of the protocol consists of a single procedure that allows - the client to inform the server of its connection requirements and - the server to return connection information to the client. + Version 1 of the protocol consists of a single procedure that + allows the client to inform the server of its connection + requirements and the server to return connection information to the + client. - The maxcallsize argument is the maximum size of an RPC call message - that the client will send in-line in an RDMA Send message to the - server. The server may return a maxcallsize value that is smaller or - larger than the client's request. The client must not send an in- - line call message larger than what the server will accept. The - maxcallsize limits only the size of in-line RPC calls. It does not - limit the size of long RPC messages transferred as an initial chunk - in the Read chunk list. + The maxcall_sendsize argument is the maximum size of an RPC call + message that the client will send in-line in an RDMA Send message + to the server. The server may return a maxcall_sendsize value that + is smaller or larger than the client's request. The client must + not send an in-line call message larger than what the server will + accept. The maxcall_sendsize limits only the size of in-line RPC + calls. It does not limit the size of long RPC messages transferred + as an initial chunk in the Read chunk list. - The maxreplysize is the maximum size of an in-line RPC message that - the client will accept from the server. + The maxreply_sendsize is the maximum size of an in-line RPC message + that the client will accept from the server. The maxrdmaread is the maximum number of RDMA Reads which may be - active at the peer. This number correlates to the RDMA incoming RDMA - Read count ("IRD") configured into each originating endpoint by the - client or server. If more than this number of RDMA Read operations - by the connected peer are issued simultaneously, connection loss or - suboptimal flow control may result, therefore the value should be - observed at all times. The peers' values need not be equal. If - zero, the peer must not issue requests which require RDMA Read to - satisfy, as no transfer will be possible. + active at the peer. This number correlates to the RDMA incoming + RDMA Read count ("IRD") configured into each originating endpoint + by the client or server. If more than this number of RDMA Read + operations by the connected peer are issued simultaneously, + connection loss or suboptimal flow control may result, therefore + the value should be observed at all times. The peers' values need + not be equal. If zero, the peer must not issue requests which + require RDMA Read to satisfy, as no transfer will be possible. The align value is the value recommended by the server for opaque - data values such as strings and counted byte arrays. The client can - use this value to compute the number of prepended pad bytes when XDR - encoding opaque values in the RPC call message. + data values such as strings and counted byte arrays. The client + can use this value to compute the number of prepended pad bytes + when XDR encoding opaque values in the RPC call message. typedef unsigned int uint32; struct config_rdma_req { - uint32 maxcallsize; /* max size of in-line RPC call */ - uint32 maxreplysize; /* max size of in-line RPC reply */ - uint32 maxrdmaread; /* max active RDMA Reads at client */ + uint32 maxcall_sendsize; + /* max size of in-line RPC call */ + uint32 maxreply_sendsize; + /* max size of in-line RPC reply */ + uint32 maxrdmaread; + /* max active RDMA Reads at client */ }; + struct config_rdma_reply { - uint32 maxcallsize; /* max call size accepted by server */ - uint32 align; /* server's receive buffer alignment */ - uint32 maxrdmaread; /* max active RDMA Reads at server */ + uint32 maxcall_sendsize; + /* max call size accepted by server */ + uint32 align; + /* server's receive buffer alignment */ + uint32 maxrdmaread; + /* max active RDMA Reads at server */ }; - program CONFIG_RDMA_PROG { version VERS1 { /* * Config call/reply */ config_rdma_reply CONF_RDMA(config_rdma_req) = 1; } = 1; } = nnnnnn; <-- Need program number assigned 7. Memory Registration Overhead @@ -963,227 +1077,221 @@ version VERS1 { /* * Config call/reply */ config_rdma_reply CONF_RDMA(config_rdma_req) = 1; } = 1; } = nnnnnn; <-- Need program number assigned 7. Memory Registration Overhead - RDMA requires that all data be transferred between registered memory - regions at the source and destination. All protocol headers as well - as separately transferred data chunks must use registered memory. - Since the cost of registering and de-registering memory can be a - large proportion of the RDMA transaction cost, it is important to - minimize registration activity. This is easily achieved within RPC - controlled memory by allocating chunk list data and RPC headers in a - reusable way from pre-registered pools. + RDMA requires that all data be transferred between registered + memory regions at the source and destination. All protocol headers + as well as separately transferred data chunks must use registered + memory. Since the cost of registering and de-registering memory + can be a large proportion of the RDMA transaction cost, it is + important to minimize registration activity. This is easily + achieved within RPC controlled memory by allocating chunk list data + and RPC headers in a reusable way from pre-registered pools. - The data chunks transferred via RDMA may occupy memory that persists - outside the bounds of the RPC transaction. Hence, the default - behavior of an RDMA transport is to register and de-register these - chunks on every transaction. However, this is not a limitation of - the protocol - only of the existing local RPC API. The API is easily - extended through such functions as rpc_control(3) to change the - default behavior so that the application can assume responsibility - for controlling memory registration through an RPC-provided - registered memory allocator. + The data chunks transferred via RDMA may occupy memory that + persists outside the bounds of the RPC transaction. Hence, the + default behavior of an RPC over RDMA transport is to register and + de-register these chunks on every transaction. However, this is + not a limitation of the protocol - only of the existing local RPC + API. The API is easily extended through such functions as + rpc_control(3) to change the default behavior so that the + application can assume responsibility for controlling memory + registration through an RPC-provided registered memory allocator. 8. Errors and Error Recovery Error reporting and recovery is outside the scope of this protocol. - It is assumed that the link itself will provide some degree of error - detection and retransmission. Additionally, the RPC layer itself can - accept errors from the link level and recover via retransmission. - RPC recovery can handle complete loss and re-establishment of the - link. + It is assumed that the link itself will provide some degree of + error detection and retransmission. Additionally, the RPC layer + itself can accept errors from the link level and recover via + retransmission. RPC recovery can handle complete loss and re- + establishment of the link. 9. Node Addressing In setting up a new RDMA connection, the first action by an RPC client will be to obtain a transport address for the server. The - mechanism used to obtain this address, and to open an RDMA connection - is dependent on the type of RDMA transport, and outside the scope of - this protocol. + mechanism used to obtain this address, and to open an RDMA + connection is dependent on the type of RDMA transport, and outside + the scope of this protocol. 10. RPC Binding RPC services normally register with a portmap or rpcbind service, which associates an RPC program number with a service address. In - the case of UDP or TCP, the service address for NFS is normally port - 2049. This policy should be no different with RDMA interconnects. + the case of UDP or TCP, the service address for NFS is normally + port 2049. This policy should be no different with RDMA + interconnects. - One possibility is to have the server's portmapper register itself on - the RDMA interconnect at a "well known" service address. On UDP or - TCP, this corresponds to port 111. A client could connect to this - service address and use the portmap protocol to obtain a service - address in response to a program number, e.g. a VI discriminator or - an Infiniband GID. + One possibility is to have the server's portmapper register itself + on the RDMA interconnect at a "well known" service address. On UDP + or TCP, this corresponds to port 111. A client could connect to + this service address and use the portmap protocol to obtain a + service address in response to a program number, e.g. a VI + discriminator or an Infiniband GID. 11. Security ONC RPC provides its own security via the RPCSEC_GSS framework [RFC 2203]. RPCSEC_GSS can provide message authentication, integrity - checking, and privacy. This security mechanism will be unaffected by - the RDMA transport. The data integrity and privacy features alter - the body of the message, presenting it as a single chunk. For large - messages the chunk may be large enough to qualify for RDMA Read - transfer. However, there is much data movement associated with - computation and verification of integrity, or encryption/decryption, - so any performance advantage will be lost. + checking, and privacy. This security mechanism will be unaffected + by the RDMA transport. The data integrity and privacy features + alter the body of the message, presenting it as a single chunk. + For large messages the chunk may be large enough to qualify for + RDMA Read transfer. However, there is much data movement + associated with computation and verification of integrity, or + encryption/decryption, so any performance advantage will be lost. - There should be no new issues here with exposed addresses. The only - exposed addresses here are in the chunk list and in the transport - packets generated by an RDMA. The data contained in these addresses - is adequately protected by RPCSEC_GSS integrity and privacy. - RPCSEC_GSS security mechanisms are typically implemented by the host - CPU. This additional data movement and CPU use may cancel out much - of the RDMA direct placement and offload benefit. + There should be no new issues here with exposed addresses. The + only exposed addresses here are in the chunk list and in the + transport packets generated by an RDMA. The data contained in + these addresses is adequately protected by RPCSEC_GSS integrity and + privacy. RPCSEC_GSS security mechanisms are typically implemented + by the host CPU. This additional data movement and CPU use may + cancel out much of the RDMA direct placement and offload benefit. A more appropriate security mechanism for RDMA links may be link- level protection, like IPSec, which may be co-located in the RDMA link hardware. The use of link-level protection may be negotiated through the use of a new RPCSEC_GSS mechanism like the Credential - Cache GSS Mechanism (CCM) [CCM]. + Cache GSS Mechanism [CCM]. 12. IANA Considerations As a new RPC transport, this protocol should have no effect on RPC program numbers or registered port numbers. The new RPC transport should be assigned a new RPC "netid". If adopted, the Connection Configuration protocol described herein will require an RPC program number assignment. 13. Acknowledgements The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak, Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve Kleiman, Mike Eisler, Mark Wittle and Shantanu Mehendale for their contributions to this document. 14. Normative References [RFC1831] R. Srinivasan, "RPC: Remote Procedure Call Protocol Specification - Version 2", - Standards Track RFC, + Version 2", Standards Track RFC, http://www.ietf.org/rfc/rfc1831.txt [RFC1832] R. Srinivasan, "XDR: External Data Representation Standard", - Standards Track RFC, - http://www.ietf.org/rfc/rfc1832.txt + Standards Track RFC, http://www.ietf.org/rfc/rfc1832.txt + [RFC1813] B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 Protocol - Specification", - Informational RFC, + Specification", Informational RFC, http://www.ietf.org/rfc/rfc1813.txt [RFC3530] S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M. - Eisler, D. Noveck, "NFS version 4 Protocol", - Standards Track RFC, + Eisler, D. Noveck, "NFS version 4 Protocol", Standards Track RFC, http://www.ietf.org/rfc/rfc3530.txt [RFC2203] M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol Specification", - Standards Track RFC, - http://www.ietf.org/rfc/rfc2203.txt + Standards Track RFC, http://www.ietf.org/rfc/rfc2203.txt 15. Informative References - [RDMA] R. Recio et al, "An RDMA Protocol Specification", - Internet Draft Work in Progress, - http://www.ietf.org/internet-drafts/ - draft-ietf-rddp-rdmap-01.txt - - [CCM] M. Eisler, N. Williams, "CCM: The Credential Cache GSS Mechanism", - Internet Draft Work in Progress, - http://www.ietf.org/internet-drafts/ - draft-ietf-nfsv4-ccm-03.txt +[RDMA] + R. Recio et al, "An RDMA Protocol Specification", Internet Draft + Work in Progress, http://www.ietf.org/internet-drafts/draft-ietf- + rddp-rdmap-03.txt - [NFSRDMA] - T. Talpey, S. Shepler, J. Bauman, "NFSv4 Session Extensions" - Internet Draft Work in Progress, - http://www.ietf.org/internet-drafts/ - draft-ietf-nfsv4-session-00.txt +[CCM] + M. Eisler, N. Williams, "CCM: The Credential Cache GSS Mechanism", + Internet Draft Work in Progress, http://www.ietf.org/internet- + drafts/draft-ietf-nfsv4-ccm-03.txt [NFSDDP] - B. Callaghan, T. Talpey, "NFS Direct Data Placement" - Internet Draft Work in Progress, - http://www.ietf.org/internet-drafts/ - draft-ietf-nfsv4-nfsdirect-00.txt + B. Callaghan, T. Talpey, "NFS Direct Data Placement" Internet Draft + Work in Progress, http://www.ietf.org/internet-drafts/draft-ietf- + nfsv4-nfsdirect-01.txt + [RDDP] Remote Direct Data Placement Working Group Charter, http://www.ietf.org/html.charters/rddp-charter.html - [RDDPPS] - Remote Direct Data Placement Working Group Problem Statement, - Internet Draft Work in Progress, - A. Romanow, J. Mogul, T. Talpey, S. Bailey, - http://www.ietf.org/internet-drafts/ - draft-ietf-rddp-problem-statement-04.txt +[NFSRDMAPS] + T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet + Draft Work in Progress, http://www.ietf.org/internet-drafts/draft- + ietf-nfsv4-nfs-rdma-problem-statement-02.txt + +[NFSSESS] + T. Talpey, S. Shepler, J. Bauman, "NFSv4 Session Extensions", + Internet Draft Work in Progress, http://www.ietf.org/internet- + drafts/draft-ietf-nfsv4-nfs-sess-01.txt [IB] - Infiniband Architecture Specification, - http://www.infinibandta.org + Infiniband Architecture Specification, http://www.infinibandta.org 16. Authors' Addresses Brent Callaghan - Sun Microsystems, Inc. - 17 Network Circle - Menlo Park, California 94025 USA + 1614 Montalto Dr. + Mountain View, California 94040 USA - Phone: +1 650 786 5067 - EMail: brent.callaghan@sun.com + Phone: +1 650 968 2333 + EMail: brent.callaghan@gmail.com Tom Talpey Network Appliance, Inc. 375 Totten Pond Road Waltham, MA 02451 USA Phone: +1 781 768 5329 EMail: thomas.talpey@netapp.com 17. Full Copyright Statement - Copyright (C) The Internet Society (2004). This document is sub- - ject to the rights, licenses and restrictions contained in BCP 78 - and except as set forth therein, the authors retain all their + + Copyright (C) The Internet Society (2005). This document is + subject to the rights, licenses and restrictions contained in BCP + 78 and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on - an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REP- - RESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE - INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR - IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF - THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED - WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. + an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE + REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND + THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT + THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR + ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A + PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. - Information on the procedures with respect to rights in RFC docu- - ments can be found in BCP 78 and BCP 79. + Information on the procedures with respect to rights in RFC + documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use - of such proprietary rights by implementers or users of this speci- - fication can be obtained from the IETF on-line IPR repository at - http://www.ietf.org/ipr. + of such proprietary rights by implementers or users of this + specification can be obtained from the IETF on-line IPR repository + at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf- ipr@ietf.org. Acknowledgement Funding for the RFC Editor function is currently provided by the