NFSv4Network File System Version 4 C. Lever Internet-Draft Oracle Intended status: InformationalNovember 2, 2015February 23, 2016 Expires:May 5,August 26, 2016 RPC-over-RDMA Version One Implementation Experiencedraft-ietf-nfsv4-rfc5666-implementation-experience-00draft-ietf-nfsv4-rfc5666-implementation-experience-01 Abstract This document details experiences and challenges implementing the RPC-over-RDMA Version One protocol. Specification changes are recommended to address avoidable interoperability failures. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire onMay 5,August 26, 2016. Copyright Notice Copyright (c)20152016 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 1.2. Purpose Of This Document . . . . . . . . . . . . . . . . 3 1.3. Updating RFC 5666 . . . . . . . . . . . . . . . . . . . .43 2. RPC-Over-RDMA Essentials . . . . . . . . . . . . . . . . . .54 2.1. Arguments And Results . . . . . . . . . . . . . . . . . .54 2.2. Remote Direct Memory Access . . . . . . . . . . . . . . . 52.2.1. Direct Data Placement . .2.3. Transfer Models . . . . . . . . . . . . . .6 2.2.2. Channel Operation. . . . . . . 6 2.4. Upper Layer Binding Specifications . . . . . . . . . . .6 2.2.3. Explicit RDMA Operation7 2.5. On-The-Wire Protocol . . . . . . . . . . . . . . .7 2.3. Transfer Models. . . 8 3. Specification Issues . . . . . . . . . . . . . . . . . .7 2.3.1. Read-Read. . 14 3.1. Extensibility Considerations . . . . . . . . . . . . . . 14 3.2. XDR Clarifications . . . . . .7 2.3.2. Write-Write. . . . . . . . . . . . . 15 3.3. The Position Zero Read Chunk . . . . . . . .7 2.3.3. Read-Write. . . . . . 18 3.4. RDMA_NOMSG Call Messages . . . . . . . . . . . . . . .8 2.4. Upper Layer Binding Specifications. 20 3.5. RDMA_MSG Call with Position Zero Read Chunk . . . . . . . 21 3.6. Padding Inline Content After A Chunk . . .8 2.5. On-The-Wire Protocol. . . . . . . 22 3.7. Write Chunk XDR Roundup . . . . . . . . . . .8 2.5.1. Inline Operation. . . . . . 24 3.8. Write List Error Cases . . . . . . . . . . . .8 2.5.2. RDMA Segment. . . . . 26 4. Operational Considerations . . . . . . . . . . . . . . .11 2.5.3. Chunk. . 29 4.1. Computing Request Buffer Requirements . . . . . . . . . . 29 4.2. Default Inline Buffer Size . . . . . . . . . . . .11 2.5.4. Read Chunk. . . 30 4.3. When To Use Reply Chunks . . . . . . . . . . . . . . . . 30 4.4. Computing Credit Values . .12 2.5.5. Write Chunk. . . . . . . . . . . . . . . 31 4.5. Race Windows . . . . . .12 2.5.6. Read List. . . . . . . . . . . . . . . . 32 5. Pre-requisites For NFSv4 . . . . . .13 2.5.7. Write List. . . . . . . . . . . . 32 5.1. Bi-directional Operation . . . . . . . . .14 2.5.8. Position Zero Read Chunk. . . . . . . 32 6. Considerations For Upper Layer Binding Specifications . . . . 33 6.1. Organization Of Binding Specification Requirements . . .14 2.5.9. Reply Chunk33 6.2. RDMA-Eligibility . . . . . . . . . . . . . . . . . . . . 34 6.3. Inline Threshold Requirements .15 3. Specification Issues. . . . . . . . . . . . . 35 6.4. Violations Of Binding Rules . . . . . . .15 3.1. Extensibility Considerations. . . . . . . . 36 6.5. Binding Specification Completion Assessment . . . . . .15 3.1.1. Recommendations. 37 7. Unimplemented Protocol Features . . . . . . . . . . . . . . . 38 7.1. Unimplemented Features To Be Removed . . .16 3.2. XDR Clarifications. . . . . . . 38 7.2. Unimplemented Features To Be Retained . . . . . . . . . .. . 16 3.2.1. Recommendations . . . . . . . . . . . . . . . . . . . 18 3.3. The Position Zero Read Chunk . . . . . . . . . . . . . . 19 3.3.1. Recommendations . . . . . . . . . . . . . . . . . . . 21 3.4. RDMA_NOMSG Call Messages . . . . . . . . . . . . . . . . 21 3.4.1. Recommendations . . . . . . . . . . . . . . . . . . . 22 3.5. RDMA_MSG Call with Position Zero Read Chunk . . . . . . . 22 3.5.1. Recommendations . . . . . . . .39 8. Security Considerations . . . . . . . . . . .23 3.6. Padding Inline Content After A Chunk. . . . . . . . 41 9. IANA Considerations . .23 3.6.1. Recommendations. . . . . . . . . . . . . . . . . . .25 3.7. Write List41 10. Appendix A: XDRRoundup . . . . . . . . . . . . . . . . . 25 3.7.1. Recommendations . . . . . . . . . . . . . . . . . . . 26 3.8. Write List Error Cases .Language Description . . . . . . . . . . . . 42 11. Appendix B: Binding Requirement Summary . . . .26 3.8.1. Recommendations. . . . . . . 45 12. Acknowledgements . . . . . . . . . . . .29 4. Operational Considerations. . . . . . . . . . 46 13. References . . . . . . .29 4.1. Computing Request Buffer Requirements. . . . . . . . . .29 4.1.1. Recommendations. . . . . . . . 46 13.1. Normative References . . . . . . . . . . .30 4.2. Default Inline Buffer Size. . . . . . . 46 13.2. Informative References . . . . . . . .30 4.2.1. Recommendations. . . . . . . . . 48 Author's Address . . . . . . . . . .30 4.3. When To Use Reply Chunks. . . . . . . . . . . . . .. . 30 4.3.1. Recommendations . . . . . . . . . . . . . . . . . . . 31 4.4. Computing Credit Values . . . . . . . . . . . . . . . . . 31 4.4.1. Recommendations . . . . . . . . . . . . . . . . . . . 32 4.5. Race Windows . . . . . . . . . . . . . . . . . . . . . . 32 4.5.1. Recommendations . . . . . . . . . . . . . . . . . . . 32 5. Pre-requisites For NFSv4 . . . . . . . . . . . . . . . . . . 32 5.1. Bi-directional Operation . . . . . . . . . . . . . . . . 32 5.1.1. Recommendations . . . . . . . . . . . . . . . . . . . 33 6. Considerations For Upper Layer Binding Specifications . . . . 33 6.1. Organization Of Binding Specification Requirements . . . 33 6.1.1. Recommendations . . . . . . . . . . . . . . . . . . . 34 6.2. RDMA-Eligibility . . . . . . . . . . . . . . . . . . . . 34 6.2.1. Recommendations . . . . . . . . . . . . . . . . . . . 35 6.3. Violations Of Binding Rules . . . . . . . . . . . . . . . 35 6.3.1. Recommendations . . . . . . . . . . . . . . . . . . . 36 6.4. Binding Specification Completion Assessment . . . . . . . 36 6.4.1. Recommendations . . . . . . . . . . . . . . . . . . . 37 7. Removal of Unimplemented Protocol Features . . . . . . . . . 37 7.1. Read-Read Transfer Model . . . . . . . . . . . . . . . . 37 7.1.1. Recommendations . . . . . . . . . . . . . . . . . . . 37 7.2. RDMA_MSGP . . . . . . . . . . . . . . . . . . . . . . . . 37 7.2.1. Recommendations . . . . . . . . . . . . . . . . . . . 38 8. Security Considerations . . . . . . . . . . . . . . . . . . . 38 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 38 10. Appendix A: XDR Language Description . . . . . . . . . . . . 38 11. Appendix B: Binding Requirement Summary . . . . . . . . . . . 41 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 43 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 43 13.1. Normative References . . . . . . . . . . . . . . . . . . 43 13.2. Informative References . . . . . . . . . . . . . . . . . 44 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 4448 1. Introduction 1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 1.2. Purpose Of This Document This document summarizes implementation experience with the RPC-over- RDMA Version One protocol [RFC5666], and proposes improvements to the protocol specification based on implementer experience, frequently- asked questions, and interviews with a co-author of RFC 5666. A key contribution of this document is to highlight areas of RFC 5666 where independent good faith readings could result in distinct implementations that do not interoperate with each other. Correcting these specification issues is critical: fresh implementations of RPC- over-RDMA Version One continue to arise. Recommendations are limited to the following areas: o Repairing specification ambiguities o Codifying successful implementation practices and conventions o Clarifying the role of Upper Layer Binding specifications o Exploring protocol enhancements that might be added while allowing extant implementations to interoperate with enhanced implementations 1.3. Updating RFC 5666 During IETF 92, several alternatives for updating RFC 5666 were discussed with the RFC Editor and with the assembled members of the nfsv4 Working Group. Among them were: o Filing individual errata for each issue o Introducing a new RFC that updates but does not obsolete RFC 5666, but makes no change to the protocol o Introducing an RFC 5666bis that replaces and thus obsoletes RFC 5666, but makes no change to the protocol o Introducing a new RFC that specifies RPC-over-RDMA Version Two An additional possibility which is sometimes chosen by other Working Groups would be to update RFC 5666 as it transitions from Proposed Standard to Draft Standard. There was general agreement during the meeting regarding the need to update and obsolete RFC 5666 while retaining a high degree of interoperability with current RPC-over-RDMA Version One implementations. This approach would avoid changes to on-the-wire behavior without burdening implementers, who could continue to reference a single specification of the protocol. In addition, this alternative extends the life of current interoperable RPC-over-RDMA Version One implementations in the field. Subsequent discussion within the nfsv4 Working Group has focused on resolving specification ambiguities that make the construction of interoperable implementations unduly difficult. A Version Two of RPC-over-RDMA, where deeper changes can be made and new functionality introduced, remains a possibility. 2. RPC-Over-RDMA Essentials The following sections summarize the state of affairs defined in RFC 5666. This is a distillation of text from RFC 5666, dialog with a co-author of RFC 5666, and implementer experience. The XDR definitions are copied from RFC 5666 Section 4.3. 2.1. Arguments And Results Like a local function call, every Remote Procedure Call (RPC) operation has a set of one or more "arguments" and a set of one or more "results." The calling context is not allowed to proceed until the function's results are available. Unlike a local function call, the called function is executed remotely rather than in the local application's context. A client endpoint, or "requester", serializes an RPC call's arguments into a byte stream using XDR [RFC4506]. This "XDR stream" is conveyed to a server endpoint via an RPC call message (sometimes referred to as an "RPC request"). The server endpoint, or "responder", deserializes the arguments and processes the requested operation. It then serializes the operation's results into another XDR stream. This stream is conveyed back to the client endpoint via an RPC reply message. The client deserializes the results and allows the original caller to proceed. The remainder of this document assumes a working knowledge of the RPC protocol [RFC5531] and especially XDR [RFC4506]. 2.2. Remote Direct Memory Access RPC messages may be very large. For example, NFS READ and WRITE operations are often 100KB or larger. An RPC client system can be made more efficient if RPC messages are transferred by a third party such as intelligent network interface hardware. Remote Direct Memory Access (RDMA) and Direct Data Placement (DDP) enables offloading data movement to avoid the negative performance effects of using traditional host CPU-based network operations to move bulk data. RFC 5666 describes how to use only the Send, Receive, RDMA Read, and RDMA Write operations described in [RFC5040] and [RFC5041] to move RPC calls and replies between requesters and responders. 2.2.1. Direct Data Placement RFC 5666 makes an important distinction between RDMA and Direct Data Placement (DDP). Very often, RPC implementations copy the contents of RPC messages into a buffer before being sent. A good RPC implementation may be able to send bulk data without having to copy it into a separate send buffer first. However, socket-based RPC implementations are often unable to receive data directly into its final place in memory. Receivers often need to copy incoming data to finish an RPC operation. In RFC 5666, "RDMA" refers to the physical mechanism an RDMA transport utilizes when moving data. Though it may not be optimal, before an RDMA transfer, the sender may still copy data into place. After an RDMA transfer, the receiver may copy that data again to its final destination. RFC 5666 uses the term "direct data placement" to refer to an optimization that makes it unnecessary for a host CPU to copy data to be transferred. RPC-over-RDMA Version One utilizes RDMA Read and Write operations to enable DDP. Not every RDMA-based transfer in RPC-over-RDMA Version One is DDP, however. 2.2.2. Channel Operation A Send operation initiates the transfer of a message from a local endpoint to a remote endpoint, similar to a datagram send operation. The remote endpoint pre-posts Receive operations to catch incoming messages. Send operations are flow-controlled to prevent overrunning receive resources. To reduce the amount of memory that must remain pinned awaiting incoming messages, receive buffers are limited in size and number. This transfer mode is utilized to convey size-limited RPC operations, and advertisements of buffer coordinates for explicit RDMA data transfer. Buffers involved in Send and Receive operations are usually left unexposed. 2.2.3. Explicit RDMA Operation A local endpoint tags memory areas to be involved in RDMA, exposes the areas, then advertises the coordinates of those areas to a remote endpoint via a Send operation. The remote endpoint transfers data into or out of those areas using RDMA Read and Write operations. The remote registers large sink buffers as needed, and invalidates them when data transfer is complete. Finally the remote endpoint signals that its work is done, and the local endpoint ensures remote access to the memory areas is no longer allowed. This transfer mode can be utilized to convey large whole RPC messages, although typically only one data item within a message is large. Explicit RDMA is most often used to move large argument or result data items directly into place. The remaining portions of the message are conveyed via a channel operation. 2.3. Transfer Models A "transfer model" describes which endpoint is responsible for performing RDMA Read and Write operations. The opposite endpoint must expose part or all of its memory, and advertise the coordinates of that memory. 2.3.1. Read-Read Requesters expose their memory to the responder, and the responder exposes its memory to requesters. The responder employs RDMA Read operations to convey RPC arguments or whole RPC calls. Requesters employ RDMA Read operations to convey RPC results or whole RPC relies. Although this model is specified in RFC 5666, no current RPC-over- RDMA Version One implementation uses the Read-Read transfer model. 2.3.2. Write-Write Requesters expose their memory to the responder, and the responder exposes its memory to requesters. Requesters employ RDMA Write operations to convey RPC arguments or whole RPC calls. The responder employs RDMA Write operations to convey RPC results or whole RPC relies. The Write-Write transfer model is not considered in RFC 5666. 2.3.3. Read-Write Requesters expose their memory to the responder, but the responder does not expose its memory. The responder employs RDMA Read operations to convey RPC arguments or whole RPC calls. The responder employs RDMA Write operations to convey RPC results or whole RPC relies. This model is specified in RFC 5666. All known RPC-over-RDMA Version One implementations employ this model. For clarity, the remainder of this document considers only the Read-Write transfer model. 2.4. Upper Layer Binding Specifications RFC 5666 provides a framework for conveying RPC requests and replies on RDMA transports. By itself this is insufficient to enable an RPC program, referred to as an "Upper Layer Protocol" or ULP, to operate over an RDMA transport. Arguments and results come in different sizes and have different serialization requirements, all depending on the Upper Layer Protocol. Some arguments and results are appropriate for Direct Data Placement, while others are not. Thus RFC 5666 requires additional separate specifications that describe how each ULP may use explicit RDMA operations to enable Direct Data Placement. The set of requirements for a ULP to use an RDMA transport is known as an "Upper Layer Binding" specification, or ULB. An Upper Layer Binding states which specific individual RPC arguments and results are permitted to be transferred via RDMA Read and Write for the purpose of Direct Data Placement. RFC 5666 refers to such arguments and results as "RDMA-eligible." RDMA-eligibility restrictions do not apply when a whole RPC call or reply is transferred via an RDMA Read or Write (long messages). A ULB is required for each RPC program and version tuple that may operate on an RDMA transport. A ULB may be part of another specification, or it may be a stand-alone document, similar to [RFC5667]. 2.5. On-The-Wire Protocol 2.5.1. Inline Operation Each RPC call or reply message conveyed on an RDMA transport starts with an RPC-over-RDMA header, which is encoded into an XDR stream. A requester uses a Send operation to convey the RPC-over-RDMA header to a responder. A responder does likewise to convey RPC replies back to a requester. All message contents sent via Send, including the RPC- over-RDMA header and possibly an RPC message proper, are referred to as "inline content." The RPC-over-RDMA header starts with three uint32 fields: <CODE BEGINS> struct rdma_msg { uint32 rdma_xid; /* Mirrors the RPC header xid */ uint32 rdma_vers; /* Version of this protocol */ uint32 rdma_credit; /* Buffers requested/granted */ rdma_body rdma_body; }; <CODE ENDS> Following these three fields is a union: <CODE BEGINS> enum rdma_proc { RDMA_MSG=0, /* An RPC call or reply msg */ RDMA_NOMSG=1, /* An RPC call or reply msg - separate body */ . . . RDMA_ERROR=4 /* An RPC RDMA encoding error */ }; union rdma_body switch (rdma_proc proc) { case RDMA_MSG: rpc_rdma_header rdma_msg; case RDMA_NOMSG: rpc_rdma_header_nomsg rdma_nomsg; . . . case RDMA_ERROR: rpc_rdma_error rdma_error; }; struct rpc_rdma_header { struct xdr_read_list *rdma_reads; struct xdr_write_list *rdma_writes; struct xdr_write_chunk *rdma_reply; /* rpc body follows */ }; struct rpc_rdma_header_nomsg { struct xdr_read_list *rdma_reads; struct xdr_write_list *rdma_writes; struct xdr_write_chunk *rdma_reply; }; <CODE ENDS> In either the RDMA_MSG or RDMA_NOMSG case, the RPC-over-RDMA header may advertise memory coordinates to be used for RDMA data transfers associated with this RPC. The difference between these two cases is whether or not the traditional RPC header itself is included in this Send operation (RDMA_MSG), or not (RDMA_NOMSG). In the former case, the RPC header follows immediately after the rdma_reply field. In the latter case, the RPC header is transfered via another mechanism (typically a separate RDMA Read operation). A requester may use either type of message to send an RPC call message, depending on the requirements of the RPC call message being conveyed. A responder may use RDMA_NOMSG only when the requester provides a Reply chunk (see Section 4.3). A responder is free to use RDMA_MSG instead in that case, depending on the requirements of the RPC reply message. 2.5.2. RDMA Segment An "RDMA segment", or just "segment", is a part of the RPC-over-RDMA header that contains the co-ordinates of a contiguous memory region that is to be conveyed via an RDMA Read or RDMA Write operation. The region defined by a segment is advertised in an RPC-over-RDMA header to enable the receiving endpoint to drive subsequent RDMA access of the data in that memory region. The RPC-over-RDMA Version One XDR represents an RDMA segment with the xdr_rdma_segment struct: <CODE BEGINS> struct xdr_rdma_segment { uint32 handle; uint32 length; uint64 offset; }; <CODE ENDS> See [RFC5040] for a discussion of what the content of these fields means. 2.5.3. Chunk A "chunk" refers to XDR stream data that is moved via an RDMA Read or Write operation. Chunk data is removed from the sender's XDR stream, is transferred by a separate RDMA operation, and is re-inserted into the XDR stream by the receiver. Each chunk is made up of one or more segments. Each segment represents a single contiguous piece of that chunk. If a chunk is to move a whole counted array, the count of array elements is left in the XDR stream, while the array elements appear in the chunk. Individual array elements appear in the chunk in their entirety. 2.5.4. Read Chunk One or more "read chunks" are used to advertise the coordinates of XDR stream data to be transferred via RDMA Read operations. Each read chunk is represented by the xdr_read_chunk struct: <CODE BEGINS> struct xdr_read_chunk { uint32 position; struct xdr_rdma_segment target; }; <CODE ENDS> RFC 5666 defines a read chunk as one RDMA segment with a Position field. The Position field indicates the location in the XDR stream where the transferred object's data would appear if it was not being moved in a chunk. The transferred data might be contained in one contiguous memory region. That data can be represented by a single read chunk (segment). Alternately, the transferred data might reside in multiple discontiguous memory regions. The data is represented by a list of read chunks (segments). The Position field in each segment in this list contains the same value. The receiver reconstructs the transferred data by concatenating the contents of each segment in list order into the receiver's XDR stream. The first segment begins at the XDR position in the Position field, and subsequent segments are concatenated afterwards until there are no more segments left at that XDR Position. This enables gathering data from multiple buffers on the sender. 2.5.5. Write Chunk A "Write chunk" advertises the coordinates of XDR stream data to be transferred via RDMA Write operations. A write chunk is represented by the xdr_write_chunk struct: <CODE BEGINS> struct xdr_write_chunk { struct xdr_rdma_segment target<>; }; <CODE ENDS> The sender fills each segment on the receiver, in array order, with the transferred data. This enables scattering data into multiple buffers on the receiver. Typically the exact size of the data cannot be predicted before the responder has formed its reply. Thus the requester must provide enough space in the write chunk for the largest result the responder might generate for this RPC operation. The responder updates the size field of each segment in the Write chunk when it returns the Write list to the requester via a matching RPC reply message. If a segment is not used, the responder sets the segment size field to zero. Because the requester must pre-allocate the area in which the responder writes the result before the responder has formed the reply, giving a position and size to the data, the requester cannot know the XDR stream position of the result data. Thus write chunks do not have a Position field. 2.5.6. Read List Each RPC-over-RDMA Version One call has one "Read list," provided by the requester. A requester provides the locations of RDMA-eligible argument data via read chunks. Via a Position Zero read chunk, a requester may provide an entire RPC request message as a chunk in this list. A Read list is represented by the xdr_read_list struct: <CODE BEGINS> struct xdr_read_list { struct xdr_read_chunk entry; struct xdr_read_list *next; }; <CODE ENDS> RFC 5666 does not restrict the order of read chunks in the Read list, but read chunks with the same value in their Position fields are ordered (see above). The Read list may be empty if the RPC call has no argument data that is RDMA-eligible and the Position Zero Read chunk is not being used. 2.5.7. Write List Each RPC-over-RDMA Version One call has one "Write list," provided by the requester. A requester provides write chunks in this list as receptacles for RDMA-eligible result data. A Write list is represented by the xdr_write_list struct: <CODE BEGINS> struct xdr_write_list { struct xdr_write_chunk entry; struct xdr_write_list *next; }; <CODE ENDS> Note that this looks similar to a Read list, but because an xdr_write_chunk is an array and not an RDMA segment, the two data structures are not the same. The Write list may be empty if there is no RDMA-eligible result data to return. The requester provides as many Write chunks as the Upper Layer Binding allows for the particular operation. The responder fills in each Write chunk with an RDMA-eligible result until the Write list is exhausted or there are no more RDMA-eligible results. 2.5.8. Position Zero Read Chunk A requester may use a "Position Zero read chunk" to convey part or all of an entire RPC call, rather than including the RPC call message inline. A Position Zero read chunk is necessary if the RPC call message is too large to fit inline. RFC 5666 Section 5.1 defines the operation of a "Position Zero read chunk." To support gathering a large RPC call message from multiple locations on the requester, a Position Zero read chunk may be comprised of more than one xdr_read_chunk. Each read chunk that belongs to the Position Zero read chunk has the value zero in its Position field. 2.5.9. Reply Chunk Each RPC-over-RDMA Version One call may have one "Reply chunk," provided by the requester. A responder may convey an entire RPC reply message in this chunk. A Reply chunk is a write chunk, thus it is an array of one or more RDMA segments. This enables a requester to control where the responder scatters the parts of an RPC reply message. In current implementations, there is only one RDMA segment in a Reply chunk. A requester provides the Reply chunk whenever it predicts the responder's reply might not fit inline. It may choose to provide the Reply chunk even when the responder can return only a small reply. A responder may use a "Reply chunk" to convey most or all of an entire RPC reply, rather than including the RPC reply message inline. 3. Specification Issues 3.1. Extensibility Considerations RPC-based protocols are defined solely by their XDR definitions. They are independent of the transport mechanism used to convey base RPC messages. Protocols defined this way often have signifcant extensibility restrictions placed on them. Not all restrictions on RPC-based Upper Layer Protocols may be appropriate for an RPC transport protocol, however. TCP [RFC0793], for example, is an RPC transport protocol that has been extended many times independently of the RPC and XDR standards. RPC-over-RDMA is partially specified by XDR, and it provides a version field in its headers. However, it is distinct from other RPC- and XDR-based protocols in some key ways: o Although it uses XDR encoding, RPC-over-RDMA is not an RPC program, nor is it an Upper Layer Protocol o XDR objects in RPC-over-RDMA headers exist near to but outside the embrace of an RPC message o RPC-over-RDMA relies on a more sophisticated set of base transport operations than traditional socket-based transports o The RDMA operations generated by verbs are not part of any XDR definition; however interoperability depends on RPC-over-RDMA implementations using these verbs in a particular way There are still reasonable restrictions, of course, that are necessary to maintain interoperability within a single Version of RPC-over-RDMA. But they are left largely unstated in RFC 5666. 3.1.1. Recommendations RFC 5666bis should not alter the basic physical operations that are in use by current implementations. It should not alter the on-the- wire appearance of RPC-over-RDMA Version One headers, and never by an explicit RDMA operations. Although it is implied, RFC 5666bis should state explicitly that all items in an RPC-over-RDMA Version One header must be conveyed via Send and Receive operations (ie, none of these items is ever RDMA- eligible). RFC 5666bis should discuss when a Version bump is required. Any significant changes to the way RDMA operations are used should require a Version bump, for instance. Certain limited XDR changes might be allowed, as long as the standards-specified set of wire behaviors remains intact. 3.2. XDR Clarifications Even seasoned NFS/RDMA implementers have had difficulty agreeing on precisely what a "chunk" is, and had challenges distinguishing the structure of the Read list from structure of the Write list. On occasion, the text of RFC 5666 uses the term "chunk" to represent either read chunks or write chunks, even though these are different data types and have different semantics. For example, RFC 5666 Section 3.4 uses the term "chunk list entry" even though the discussion is referring to an array element (a segment). It implies all chunk types have a Position field, even though only read chunks have this field. Near the end of Section 3.4, it says: Therefore, read chunks are encoded into a read chunk list as a single array, with each entry tagged by its (known) size and its argument's or result's position in the XDR stream. The Read list is not an XDR array, it is always an XDR list. A Write chunk is an XDR array. RFC 5666 Section 3.7, third paragraph uses the terms "chunked element" and "chunk segment." Neither term is defined or used anywhere else. The fourth paragraph refers to a "sequence of chunks" but likely means a sequence of RDMA segments. The XDR definition for a read chunk is an RDMA segment with a position field. RFC 5666 Section 3.4 states that multiple xdr_read_chunk structs can make up a single RPC argument if they share the same Position in the XDR stream. Some implementations depend on using multiple RDMA segments in the same XDR Position, particularly for sending Position Zero read chunks efficiently by gathering an RPC call message from multiple discontiguous memory locations. Other implementations do not support sending or receiving multiple Read chunks with the same Position. An Upper Layer Binding may limit the number of Read list entries allowed for a particular operation. In that case, the Upper Layer Binding is not restricting the total number of read chunks in the list, but rather the number of distinct Positions that appear in the list. RFC 5666 does not restrict the boundaries of a chunk other than to imply that a chunk's starting position and its length is a multiple of an XDR data unit. However, implementations have observed a practical restriction to facilitate straightforward integration of RDMA support into existing XDR infrastructure: A chunk containing RDMA-eligible data must be encoded or decoded as a single XDR object. In addition, Upper Layer Bindings make RDMA-eligibility statements about specific arguments and results (or portions thereof which still are whole XDR objects). The implication is that chunks contain only whole XDR objects, even though RFC 5666 is not explicit about this. A Position Zero read chunk typically contains an entire RPC request message, and a Reply chunk contains an entire RPC reply message. These are exceptions to the above restriction. The Write list is especially confusing because it is a list of arrays of RDMA segments, rather than a simple list of xdr_read_chunk objects. What is referred to as a Read list entry often means one xdr_read_chunk, or one segment. That segment can be either a portion of or a whole XDR object. A Write list entry is an array, and always represents a single XDR object in its entirety. An Upper Layer Binding may limit the number of chunks in a Write list allowed for a particular operation. That strictly limits the number of Write list entries. Not having a firm one-to-one correspondence between read chunks and XDR objects is sometimes awkward. The two chunk types should be more symmetrical to avoid confusion, although that might be difficult to pull off without altering the RPC-over-RDMA Version One XDR definition. As we will see later, the XDR roundup rules also appear to apply asymmetrically to read chunks and write chunks. Implementers have been aided by the ASCII art block comments in the Linux kernel in net/sunrpc/xprtrdma/rpcrdma.c, excerpted here. This diagram shows exactly how the Read list and Write list are constructed in an XDR stream. <CODE BEGINS> /* * Encoding key for single-list chunks * (HLOO = Handle32 Length32 Offset64): * * Read chunklist (a linked list): * N elements, position P (same P for all chunks of same arg!): * 1 - PHLOO - 1 - PHLOO - ... - 1 - PHLOO - 0 * * Write chunklist (a list of (one) counted array): * N elements: * 1 - N - HLOO - HLOO - ... - HLOO - 0 * * Reply chunk (a counted array): * N elements: * 1 - N - HLOO - HLOO - ... - HLOO */ <CODE ENDS> 3.2.1. Recommendations To aid in reader understanding, RFC 5666bis should expand the glossary that explains and distinguishes the various elements in the protocol. Upper Layer Binding specifications refer to these terms. RFC 5666bis should utilize and capitalize these glossary terms consistently. RFC 5666bis should introduce additional diagrams that supplement the XDR definition in RFC 5666 Section 4.3. RFC 5666bis should explain the structure of the XDR and how it is used. RFC 5666bis should contain an explicit but brief rationalization for the structural differences between the Read list and the Write list. RFC 5666bis should explicitly restrict chunks containing RDMA- eligible data so that a chunk represents exactly a single XDR object in its entirety. RFC 5666bis should use a consistent naming convention for all XDR definitions. For example, all structures and union names should use an "rpcrdma1_" prefix. To address conflation of a read chunk that is a single xdr_read_chunk and a read chunk that is a list of xdr_read_chunk elements with identical Position field values, the following specification changes should be made: o The XDR definition should rename the xdr_read_chunk struct as rpcrdma1_read_segment. o RFC 5666bis should redefine a "read chunk" as an ordered list of one or more rpcrdma1_read_segment structs that have identical Position values. o RFC 5666bis should redefine the "Read list" as a list of zero or more read chunks, expressed as an ordered list of rpcrdma1_read_segment structs whose Position value may vary. Segment positions in the list are non-descending. With these changes, there would no longer be a simple XDR object that explicitly represents a read chunk, but a read chunk and a write chunk are now equivalent objects that both map to a whole XDR object. All discussion should take care to use the terms "segment" and "read segment" instead of the term "read chunk" where appropriate. As a clean up, RFC 5666bis should remove the rpc_rdma_header_nomsg struct, and use the rpc_rdma_header struct in its place. Since rpc_rdma_header does not comprise the entire RPC-over-RDMA header, it should be renamed rpcrdma1_chunks to avoid confusion. XDR definitions should be enclosed in CODE BEGINS and CODE ENDS delimiters. An appropriate copyright block should accompany the XDR definitions in RFC 5666bis. An XDR extraction shell script should be provided in the text. See Section 10 for a full listing of the proposed XDR definitions. 3.3. The Position Zero Read Chunk RFC 5666 Section 5.1 defines the operation of the Position Zero read chunk. A requester uses the Position Zero read chunk in place of inline content. A requester is required to use the Position Zero read chunk when the total size of an RPC call message exceeds the size of the responder's receive buffers, and RDMA-eligible data has already been removed from the message. RFC 5666 Section 3.4 says: Semantically speaking, the protocol has no restriction regarding data types that may or may not be represented by a read or write chunk. In practice however, efficiency considerations lead to the conclusion that certain data types are not generally "chunkable". Typically, only those opaque and aggregate data types that may attain substantial size are considered to be eligible. With today's hardware, this size may be a kilobyte or more. However, any object MAY be chosen for chunking in any given message. The eligibility of XDR data items to be candidates for being moved as data chunks (as opposed to being marshaled inline) is not specified by the RPC-over-RDMA protocol. Chunk eligibility criteria MUST be determined by each upper-layer in order to provide for an interoperable specification. The intention of this text is to spell out that RDMA-eligibility applies only to individual XDR data objects in the Upper Layer Protocol. RDMA-eligibility criteria are specified within a separate specification, rather than in RFC 5666. The Position Zero read chunk is an exception to both of these guidelines. The Position Zero read chunk, by virtue of the fact that it typically conveys an entire RPC call message, may contain multiple arguments, independent of whether any particular argument in the RPC call is RDMA-eligible. Unlike the read chunks described in the RFC 5666 excerpt above, the content of a Position Zero read chunk is typically marshaled and copied on both ends of the transport, so it cannot benefit from Direct Data Placement. In particular, the Position Zero read chunk is not for conveying performance critical Upper Layer operations. Thus the requirements for what may or may not appear in the Position Zero read chunk are indeed specified by RFC 5666, in contradiction to the second paragraph quoted above. Upper Layer Binding specifications may have something to say about what may appear in the Position Zero read chunk, but the basic definition of Position Zero should be made clear in RFC 5666bis as distinct from a read chunk whose Position field is non-zero. Because a read chunk is defined as one RDMA segment with a Position field, at least one implementation allows only a single chunk segment in Position zero read chunks. This is a problem for two reasons: o Some RPCs are constructed in multiple non-contiguous buffers. Allowing only one read segment in Position Zero would mean a single large contiguous buffer would be have to be allocated and registered, and then the components of the XDR stream would have to be copied into that buffer. o Some requesters might not be able to register memory regions larger than the platform's physical page size. Allowing only one read segment in Position Zero would limit the maximum size of RPC- over-RDMA messages to a single page. Allowing multiple read segments means the message size can be as large as the maximum number of read chunks that can be sent in an RPC-over-RDMA header. RFC 5666 does not limit the number of read segments in a read chunk, nor does it limit the number of chunks that can appear in the Read list. The Position Zero read chunk, despite its name, is not limited to a single xdr_read_chunk. 3.3.1. Recommendations RFC 5666bis should state that the guidelines in RFC 5666 Section 3.4 apply only to RDMA_MSG type calls. When the Position Zero read chunk is introduced in RFC 5666 Section 5.1, enumerate the differences between it and the read chunks previously described in RFC 5666 Section 3.4. RFC 5666bis should describe what restrictions an Upper Layer Binding may make on Position Zero read chunks. 3.4. RDMA_NOMSG Call Messages The second paragraph of RFC 5667 Section 4 says, in reference to NFSv2 and NFSv3 WRITE and SYMLINK operations: . . . a single RDMA Read list entry MAY be posted by the client to supply the opaque file data for a WRITE request or the pathname for a SYMLINK request. The server MUST ignore any Read list for other NFS procedures, as well as additional Read list entries beyond the first in the list. However, large non-write NFS operations are conveyed via a Read list containing at least a Position Zero read chunk. Strictly speaking, the above requirement means large non-write NFS operations may never be conveyed because the responder MUST ignore the read chunk in such requests. It is likely the authors of RFC 5667 intended this limit to apply only to RDMA_MSG type calls. If that is true, however, an NFS implementation could legally skirt the stated restriction simply by using an RDMA_NOMSG type call that conveys both a Position Zero and a non-zero position read chunk to send a non-write NFS operation. Unless either RFC 5666 or the protocol's Upper Layer Binding explicitly prohibits it, allowing a read chunk in a non-zero Position in an RDMA_NOMSG type call means an Upper Layer Protocol may ignore Binding requirements like the above. Typically there is no benefit to allowing multiple read chunks for RDMA_NOMSG type calls. Any non-zero Position read segments can always be conveyed as part of the Position Zero read chunk. However, there is a class of RPC operations where RDMA_NOMSG with multiple read chunks is useful: when the body of an RPC call message is larger than the inline buffer size, even after RDMA-eligible argument data has been moved to read chunks. A similar discussion applies to RDMA_NOMSG replies with large reply bodies and RDMA-eligible result data. Such replies would use both the Write list and the Reply chunk simultaneously. However, write chunks do not have Position fields. 3.4.1. Recommendations RFC 5666bis should continue to allow RDMA_NOMSG type calls with additional read chunks. The rules about RDMA-eligibility in RFC 5666bis should discuss when the use of this construction is beneficial, and when it should be avoided. Authors of Upper Layer Bindings should be warned about ignoring these cases. RPC 5666bis should provide a default behavior that applies when Upper Layer Bindings omit this discussion. 3.5. RDMA_MSG Call with Position Zero Read Chunk The first item in the header of both RPC calls and RPC replies is the XID field [RFC5531]. RFC 5666 Section 4.1 says: A header of message type RDMA_MSG or RDMA_MSGP MUST be followed by the RPC call or RPC reply message body, beginning with the XID. This is a strong implication that the RPC header in an RDMA_MSG type message starts at XDR position zero. Assume for a moment that, by definition, the RPC header in an RPC-over-RDMA XDR stream starts at XDR position zero. An RDMA_MSG type call message includes the RPC header and zero or more read chunks. Recall the definition of a read chunk as a list of read segments whose Position field contains the same value. The value of the Position field determines where the read chunk appears in the XDR stream that comprises an RPC call message. A Position Zero read chunk, therefore, starts at XDR position zero, just like RPC header does. In an RDMA_NOMSG type call message, which does not include an RPC header, a Position Zero read chunk conveys the RPC header. There is no prohibition in RFC 5666 against an RDMA_MSG type call messsage with a Position Zero read chunk. However, it's not clear how a responder should interpret such a message. RFC 5666 requires the RPC header to start at XDR position zero, but there is a Position Zero read chunk, which also starts at XDR position zero. 3.5.1. Recommendations RPC 5666bis should clearly define what is meant by an XDR stream. RFC 5666bis should state that the value in the xdr_read_chunk "position" field is measured relative to the start of the RPC header, which is the first byte of the header's XID field. RFC 5666bis should prohibit requesters from providing a Position Zero read chunk in RDMA_MSG type calls. Likewise, RFC 5666bis should prohibit responders from utilizing a Reply chunk in RDMA_MSG type replies. The diagrams in RFC 5666 Section 3.8 which number chunks starting with 1 should be revised. Readers confuse this number with an XDR position. 3.6. Padding Inline Content After A Chunk To help clarify the discussion in this section, the term "read chunk" here always means the new definition where one or more read segments that have identical values in their Position fields represents exactly one RDMA-eligible XDR object. A read chunk conveys a large argument payload via one or more RDMA transfers. For instance, the data payload of an NFS WRITE operation may be be transferred using a read chunk [RFC5667]. NFSv3 WRITE operations place the data payload at the end of an RPC call message [RFC1813]. The RPC call's XDR stream starts inan inline buffer, continuesan inline buffer, continues in a read chunk, then ends there. An NFSv4 WRITE operation may occur as a middle operation in an NFSv4 COMPOUND [RFC5661]. The read chunk containing the data payload argument of the WRITE operation might finish before the RPC call's XDR stream does. In this case, the RPC call's XDR stream starts in an inline buffer, continues in the Read list, then finishes back in the inline buffer. The length of a chunk is the sum of the lengths of the segments that make up that chunk. The data payload in a chunk may have a length that is not evenly divisible by four. One or more of the segments may have an unaligned length. RFC 5666 Section 3.7 describes how to manage XDR roundup in a read chunk when its length is not XDR-aligned. The sender is not required to send the extra pad bytes at the end of a chunk because a) the receiver never references their content, therefore it is wasteful to transmit them, and b) each read chunk has a Position field and length that determines exactly where that chunk starts and ends in the XDR stream. A question arises, however, when considering where the next XDR object after a read chunk should appear. XDR requires each object to begin on 4-byte alignment [RFC4506]. But a read chunk's XDR padding is optional (see above) and thus may not appear in the chunk as actual zero bytes. The next read chunk's position field determines where it is placed in the XDR stream, so in that case there is no ambiguity. Inline content following a readchunk, then ends there. An NFSv4 WRITE operation may occur aschunk does not have amiddle operationPosition field to guide the receiver inan NFSv4 COMPOUND [RFC5661]. The read chunk containingthedata payload argumentreassembly of theWRITE operation might finish before the RPC call's XDR stream does. In this case, the RPC call'sXDRstream starts in an inline buffer, continues instream, however. Paragraph 4 of RFC 5666 Section 3.7 says: When roundup is present at theRead list, then finishes back inend of a sequence of chunks, theinline buffer. Thelength of the sequence will terminate it at achunk isnon-4-byte XDR position. When thesum ofreceiver proceeds to decode thelengthsremaining part of thesegments that make up thatXDR stream, it inspects the XDR position indicated by the next chunk.The data payload in a chunk mayBecause this position will not match (else roundup would not havea lengthoccurred), the receiver decoding will fall back to inspecting the remaining inline portion. If in turn, no data remains to be decoded from the inline portion, then the receiver MUST conclude that roundup isnot evenly divisiblepresent, and therefore it advances the XDR decode position to that indicated byfour. One or more ofthesegments may have an unaligned length. RFC 5666 Section 3.7next chunk (if any). In this way, roundup is passed without ever actually transferring additional XDR bytes. This paragraph adequately describeshow to manageXDRroundup inpadding requirements when a read chunkwhen its lengthisnot XDR-aligned. The senderfollowed by another read chunk. But it leaves unspoken any requirements for XDR padding and alignment when a read chunk isnot required to sendfollowed in theextra pad bytes atXDR stream by more inline content. Applying theendrules ofaXDR, the XDR pad for the read chunkbecause a)must not appear in thereceiver never references theirinline content,thereforeeven if itis wasteful to transmit them, and b) each readwas also not included in the chunkhas a Position field and length that determines exactly whereitself. This is because the inline content that preceded the read chunkstarts and endswill have been padded to 4-byte alignment. The next position in theXDR stream. A question arises, however, when considering whereinline buffer is already on a 4-byte boundary, thus no padding is necessary. 3.6.1. Recommendations State thenext XDR object afterabove requirement in RFC 5666bis in its equivalent of RFC 5666 Section 3.7. When aread chunk should appear. XDR requires each objectresponder forms a reply, the same restriction applies tobegininline content interleaved with write chunks. Because all XDR objects must start on4-bytean XDR alignment[RFC4506]. But aboundary, all readchunk's XDR padding is optional (see above)andthus may not appearwrite chunks and all inline XDR objects in any XDR stream must start on an XDR alignment boundary. This has implications for thechunk as actual zero bytes. The nextvalues allowed in readchunk's position field determines where it ischunk Position fields, for how XDR roundup works for chunks, and for how XDR objects are placed intheinline buffers. XDRstream, soalignment inthat case thereinline buffers isno ambiguity. Inline content following a read chunk does not have a Position fieldalways relative toguide the receiver in the reassembly ofPosition Zero (or, where the RPC header starts). 3.7. Write Chunk XDRstream, however. Paragraph 4Roundup The final paragraph of RFC 5666 Section 3.7 says:When roundup is present at the end of a sequence of chunks, the length of the sequence will terminate it atFor RDMA Write Chunks, anon-4-byte XDR position. When the receiver proceeds to decode the remaining part of the XDR stream, it inspects the XDR position indicated by the next chunk. Because this position will not match (elsesimpler encoding method applies. Again, roundupwouldbytes are nothave occurred),transferred, instead thereceiver decoding will fall backchunk length sent toinspectingtheremaining inline portion. Ifreceiver inturn, no data remains to be decoded from the inline portion, thenthereceiver MUST conclude that roundupreply ispresent, and therefore it advances the XDR decode positionsimply increased tothat indicated by the next chunk (if any). In this way, roundup is passed without ever actually transferring additionalinclude any roundup. A responder should avoid writing XDR pad bytes, as the requester's upper layer does not reference them, though the language does not fully prohibit writing these bytes.This paragraph adequately describes XDR padding requirements when a read chunk is followed by another read chunk. But it leaves unspoken any requirementsA requester always provides the extra space for XDR paddingand alignment whenanyway. A problem arises if the data item written into areadWrite chunk isfollowed inshorter than the chunk and requires an XDRstream by more inline content. Applyingpad. A responder may write therulesXDR pad past the end ofXDR,theXDRdata content. For a short directly-placed write, the pad bytes are then exposed in the RPC consumer's data buffer. In addition, for thereadchunk length to be rounded up as described, the requester mustnot appearprovide adequate extra space in theinline content, even if it was also not includedchunk for the XDR pad. A requester can provide space for the XDR pad using one of two approaches: 1. It can extend the last segment in thechunk itself. This is becausechunk. 2. It can provide another segment after theinline contentsegments that receive RDMA Write payloads. Case 1 is adequate when there is no danger thatprecededtheread chunkresponder's RDMA Write operations willhave been padded to 4-byte alignment. The next positionoverwrite existing data on the requester in memory following theinlineadvertised receive buffers. In Direct Data Placement scenarios, an extra segment must be provided separately to avoid overwriting existing data that follows the sink buffer (case 2). Thus, an extra registration isalready onneeded for just a4-byte boundary, thus no padding is necessary. 3.6.1. Recommendations Statehandful of bytes that may not be written by theabove requirement in RFC 5666bisresponder, and are ignored by the requester. Even so, this does not force the responder to direct the XDR pad bytes into this extra segment, should the data item inits equivalentthat chunk be shorter than the chunk itself. Registering the extra buffer is a needless cost. It would be more efficient if the XDR pad at the end ofRFC 5666 Section 3.7. Whenaresponder forms a reply,write chunk were treated the samerestriction applies to inline content interleaved with writeas it is for Read chunks. Becauseall XDR objectsRPC result data muststartbegin on an XDR alignment boundary,all read andthe result following the writechunks and all inline XDR objectschunk inanythe reply's XDR stream muststartbegin on an XDR alignment boundary.This has implications for the values allowed in read chunk Position fields,There is no need forhowa XDRroundup works for chunks, andpad to be present forhow XDR objects are placed in inline buffers. XDR alignment in inline buffers is always relativethe receiver toPosition Zero (or, wherere- assemble the RPCheader starts). 3.7. Write Listreply's XDRRoundup The finalstream properly. One responder implementation requires the requester to provide the extra buffer space in the Write chunk, but does not write to it. This follows the letter of the last paragraph ofRFC 5666Section 3.7says this: For RDMA Write Chunks, a simpler encoding method applies. Again, roundup bytes areof [RFC5666]. Another responder implementation does nottransferred, insteadrely on having thechunk length sentextra space (operation proceeds if it is missing) but when the extra space is present, this responder does write zeroes to it. While thereceiver inintention of Section 3.7 is that thereplyresponder does not write the pad, it issimply increasednot strictly forbidden. Client implementations all appear to provide the extra buffer space needed to accommodate the XDR pad. However, one implementation does not register this extra buffer, since the responder is not expected to write into it, while another implementation does. These implementations may not be 100% interoperable. The language of Section 3.7 of [RFC5666] appears toinclude any roundup. Aallow all of this behavior (in particular, it does not prohibit a respondershould never writefrom writing the XDR padbytes, as the requester's upper layersusing RFC2119-style keywords, and does notreference them. However, for the chunk length to be rounded up as described,require that requesters register therequester must provide adequateextra spacein the chunk forto accommodate the XDRpad. A requester can provide space forpad). Note that because the Reply chunk is a write chunk, these roundup rules also apply to it. 3.7.1. Recommendations The current specification allows XDR padusing onebytes to leak into user buffers, and none oftwo approaches: 1. It can extend the last segment in the chunk. 2. It can provide another segment afterthesegments that receive RDMA Write payloads. Case 1 is adequate when there is no danger thatcurrent implementations prevent this leak. There may be room to adjust theresponder's RDMA Write operations will overwrite existing data onprotocol specification independently of current implementation behavior. RFC 5666bis should explicitly discuss therequester in buffers followingrequirements around write chunk roundup separately from theadvertised receive buffers. In Direct Data Placement scenarios, an extra segment mustdiscussion of read chunk roundup. Explicit RFC2119-style interoperability requirements should be providedseparately to avoid overwriting existing data that followsfor write chunks. Responders MUST NOT write XDR pad bytes at thesink buffer (case 2). Thus, anend of a Write chunk. Allocating and registering extraregistration is neededspace forjust a handful ofXDR pad bytes that arenot written by the responder. Registering the extra buffer is a needless cost. It would be more efficient if thenever written is wasteful. RFC 5666bis should forbid it. Responders should not expect requesters to provide space for XDR padat the end ofbytes. 3.8. Write List Error Cases RFC 5666 Section 3.6 says: When a write chunkwere treated the same as itlist is provided forread chunks. Becausethe results of the RPCresultcall, the RPC server MUST provide any corresponding datamust begin on an XDR alignment boundary,via RDMA Write to theresult followingmemory referenced in thewritechunkinlist entries. This requires thereply's XDR stream must begin on an XDR alignment boundary. There should be no need for a XDR padresponder tobe present foruse thereceiverWrite list when it is provided. Another way to say it is a responder is not permitted tore-assemblereturn bulk data inline or in theRPC reply's XDR stream correctly. Unfortunately at least one server implementation relies onReply chunk when theexistence of that extra buffer, even thoughrequester has provided a Write list. This requirement is less clear when it comes to situations where a particular RPC reply is allowed to use a provided Write list, but does not have a bulk data payload to return. For example, RFC 5667 Section 4 permits requester to provide a Write list for NFS READ operations. However, NFSv3 READ operations have a union reply [RFC1813]: <CODE BEGINS> struct READ3resok { post_op_attr file_attributes; count3 count; bool eof; opaque data<>; }; struct READ3resfail { post_op_attr file_attributes; }; union READ3res switch (nfsstat3 status) { case NFS3_OK: READ3resok resok; default: READ3resfail resfail; }; <CODE ENDS> When an NFS READ operation fails, no data is returned. The arm of the READ3res union which is used when a read error occurs does notwrite to it. Another server implementationhave a bulk data argument. RFC 5666 does notrely on it (operation proceeds if it is missing) butprescribe how a responder should behave whenitRDMA- eligible result data for which the Write list ispresent, this serverprovided doeswrite zeroes to it. Thereforenot appear in theextra buffer for areply. RFC 5666 Section 3.4 says: Individual writechunk's XDR pad, either as a separate segment,chunk list elements MAY thereby result in being partially oras an extension of the segment that represents the data payload buffer, must remain for now. Note that because the Replyfully filled, or in fact not being filled at all. Unused write chunks, or unused bytes in write chunk buffer lists, are not returned as results, and their memory isa write chunk, these roundup rules applyreturned toitthe upper layer aswell. 3.7.1. Recommendations RFC 5666bis should provide a discussionpart of RPC completion. It also says: The RPC reply conveys this by returning therequirements aroundwrite chunkroundup, with examples. The discussion should be separate fromlist to thediscussion of read chunk roundup. Explicit RFC2119-style interoperability requirements should be provided inclient with thetext. For example,lengths rewritten to match therequester MUST provide buffer space for XDR roundupactual transfer. The disposition ofwrite chunks, andtheresponder SHOULD NOTadvertised writeinto that buffer. 3.8.buffers is therefore clear. The requirements for how the WriteList Error Cases RFC 5666 Section 3.6 says:list must appear in an RPC reply are somewhat less than clear. Here we are concerned with two cases: o When awriteresult consumes fewer RDMA segments than the requester provided in the Write chunklist isfor that result, what values are provided for theresultschunk's segment count and the lengths of theRPC call,unused segments o When a result is not used (say, theRPC server MUST provide anyreply uses the arm of an XDR union that does not contain the result correspondingdata via RDMA Writeto a Write chunk provided for that result), what values are provided for thememory referenced inchunk's segment count and thechunk list entries. This requireslengths of theresponder to useunused segments The language above suggests the proper value for the Writelist when it is provided. Another way to say it is a responderchunk's segment count isnot permitted to return bulk data inline or inalways theReply chunk whensame value that the requesterhas provided a Write list. This requirement is less clearsent, even whenit comes to situations where a particular RPC replythe chunk isallowed to use a provided Write list, but doesnothave a bulk data payload to return. For example, RFC 5667 Section 4 permits requester to provide a Write listused in the reply. The proper value forNFS READ operations. However, NFSv3 READ operations have a union reply [RFC1813]: <CODE BEGINS> struct READ3resok { post_op_attr file_attributes; count3 count; bool eof; opaque data<>; }; struct READ3resfail { post_op_attr file_attributes; }; union READ3res switch (nfsstat3 status) { case NFS3_OK: READ3resok resok; default: READ3resfail resfail; }; <CODE ENDS> Whenthe length of an unused segment in a Write chunk is always zero. Inspection of one existing server implementation shows that when an NFS READ operation fails,no data is returned. The arm oftheREAD3res union which is used when a read error occurs does not have a bulk data argument. RFC 5666 does not prescribe howreturned Write list contains one entry: aresponder should behave when RDMA- eligible result data for whichchunk array containing zero elements. Another server implementation returns the original Write listis provided does not appearchunk in this case. In either case, requesters appear to ignore thereply. RFC 5666 Section 3.4 says: Individual write chunkWrite listelements MAY thereby result in being partially or fully filled, orwhen no bulk data payload is expected. Thus it appears that, currently, responders may put whatever they like infact not being filled at all. Unused write chunks, or unused bytesthe Write list. Current NFSv4 client implementations behave like legacy NFS implementations inwrite chunk buffer lists, are not returned as results, and their memorythe sense that each READ COMPOUND requests only one contiguous data payload that isreturned tonever larger than theupper layer as partrsize setting ofRPC completion. It also says: The RPC reply conveys this by returningthewrite chunk listmount. However it is legal for an NFSv4 COMPOUND to contain more than one READ operation. Each READ request in a COMPOUND may have an RDMA-eligible result in theclient with the lengths rewrittenCOMPOUND reply. In general, a complex Upper Layer Binding may wish tomatchreturn more than one RDMA-eligible result in a single RPC reply. Depending on theactual transfer. The disposition ofRPC program, there may be nested or sequential switched unions in theadvertised write buffersreply. There istherefore clear. The requirements for howno Position field in the segments making up a Writelistchunk, so both sender and receiver mustappear in an RPCbe careful about how the replyare somewhat less than clear. Here we are concernedmessage is re-assembled. It should always be unambiguous which Write chunk matches withtwo cases: o When awhich result. To ensure interoperability, the responder associates the first RDMA-eligible resultconsumes fewer RDMA segments thanwith therequester providedfirst chunk in the Writechunk for that result, what valueslist, and so on, until either results or Write chunks areprovided forexhausted. The receiver makes thechunk's segment count andsame associations while parsing thelengthsXDR stream of theunused segments o When a result is not used (say, the reply usesreply. It should be thearmresponsibility ofan XDR union that does not containtheresult correspondingUpper Layer Binding to avoid ambiguous situations by appropriately restricting RDMA-eligible data items. Remember that a responder MUST use the Writechunk provided for that result), what values are provided forlist if thechunk's segment countrequester provided it and thelengths of the unused segments The language above suggests the proper value for the Write chunk's segment count is always the same value thatresponder has RDMA-eligible result data. If the requestersent, even when the chunk ishas notusedprovided enough Write chunks in thereply. The proper value forWrite list, the responder may have to use a long message as well, depending on thelengthremaining size of the RPC reply. 3.8.1. Recommendations RFC 5666bis should explicitly discuss responder behavior when anunused segment inRPC reply does not need to use a Writechunklist entry provided by a requester. This isalways zero. Inspectiongeneric behavior, independent ofone existing server implementation shows thatany Upper Layer Binding. The explanation can be partially or wholly copied from RFC 5667 Section 5's discussion of NFSv4 COMPOUND. A number of places in RFC 5666 Section 3.6 hint at how a responder behaves whenan NFS READ operation fails, the returnedit is to return data that does not use every byte of every provided Writelist contains one entry: achunkarray containing zero elements. Another server implementation returnssegment. RFC 5666bis should state specific requirements about how a responder should form theoriginalWrite listchunkinthis case. In either case,RPC replies, and/or it should explicitly require requestersappearto ignore the Write listwhen no bulk data payload is expected. Thus it appears that, currently, responders may put whatever they like in the Write list. Current NFSv4 client implementations behave like legacy NFS implementationsinthe sense that each READ COMPOUND requests only one contiguous data payloadthese cases. RFC 5666bis should require thatis never larger thanthersize settingresponder not alter the count of segments in themount. However it is legal for an NFSv4 COMPOUND to containWrite chunk. One or morethan one READ operation. Each READ request in a COMPOUND may have an RDMA-eligible resultexplicit examples should be provided inthe COMPOUND reply. In general, a complexRFC 5666bis. RFC 5666bis should provide clear instructions on how Upper LayerBinding may wishBindings are toreturn more than one RDMA-eligible result inbe written to take care of switched unions. 4. Operational Considerations 4.1. Computing Request Buffer Requirements The size maximum of a singleRPC reply. Depending onSend operation includes both the RPC- over-RDMA header and the RPCprogram, there may be nested or sequential switched unions inheader. Combined, those two headers must not exceed thereply. There is no Position field insize of one receive buffer. Senders often construct thesegments making up a Write chunk, so both senderRPC-over-RDMA header andreceiver must be careful about howthe RPC call or reply messageis re-assembled. It should alwaysin separate buffers, then combine them via an iovec into a single Send. This does not mean each element of that iovec can beunambiguous which Write chunk matches with which result. To ensure interoperability,as large as theresponder associatesinline threshold. An HCA or RNIC may have a small limit on thefirstsize of a registered memory region. In that case, RDMA-eligibleresult with the firstdata may be comprised of many chunkinsegments. This has implications for theWrite list,size of the Read andso on, until either results orWritechunks are exhausted.lists, which take up a variable amount of space in the RPC-over-RDMA header. Thereceiver makessum of thesame associations while parsingsize of theXDR streamRPC-over-RDMA header, including the Read and Write lists, and the size of thereply. ItRPC header must not exceed the inline threshold. This limits the maximum Upper Layer payload size. 4.1.1. Recommendations RFC 5666bis shouldbeprovide implementation guidance on how theresponsibilityinline threshold (the maximum send size) is computed. 4.2. Default Inline Buffer Size Section 6 ofthe Upper Layer Binding to avoid ambiguous situations by appropriately restricting RDMA-eligible data items. RememberRFC 5666 specifies an out-of-band protocol that allows an endpoint to discover aresponder MUST use the Write list ifpeer endpoint's receive buffer size, to avoid overrunning therequester providedreceiving buffer, causing a connection loss. Not all RPC-over-RDMA Version One implementations also implement CCP, as itand the responder has RDMA-eligible result data. If the requester has not provided enough Write chunks inis optional. Given theWrite list,importance of knowing theresponder may havereceiving end's receive buffer size, there should be some way that a sender can choose a size that is guaranteed tousework with no CCP interaction. RFC 5666 Section 6.1 describes along message as well, depending1KB receive buffer limit for the first operation on a connection with an unfamiliar responder. In theremaining sizeabsence of CCP, theRPC reply. 3.8.1.client cannot discover that responder's true limit without risking the loss of the transport connection. 4.2.1. Recommendations RFC 5666bis shouldexplicitly discuss responder behavior when an RPC reply does not needspecify a fixed send/receive buffer size as part of the RPC-over-RDMA Version One protocol, to usea Write list entry provided by a requester. Thiswhen CCP isgeneric behavior, independent of any Upper Layer Binding. The explanation cannot available. For example, the following could bepartially or wholly copied fromadded to the RFC5667 Section 5's discussion of NFSv4 COMPOUND. A number5666bis equivalent ofplaces inRFC 5666 Section3.6 hint at how6.1: "In the absence of CCP, requesters and responders MUST assume 1KB receive buffers for all Send operations." It should be safe for Upper Layer Binding specifications to provide aresponder behavesdifferent default inline threshold. Care must be taken whenitan endpoint isto return dataassociated with multiple RPC programs thatdoes not use every byte of every provided Write chunk segment.have different default inline thresholds. 4.3. When To Use Reply Chunks RFC5666bis should state specific requirements about how5666 Section 3.6 says: When aresponder should form the Writewrite chunk listinis provided for the results of the RPCreplies, and/or it should explicitly require requesters to ignorecall, the RPC server MUST provide any corresponding data via RDMA Writelistto the memory referenced inthese cases. RFC 5666bis should require thattheresponderchunk list entries. It is notalterclear whether thecountauthors ofsegments inRFC 5666 intended the above requirement to apply only to the Writechunk. Onelist, ormore explicit examples should be provided in RFC 5666bis. RFC 5666bis should provide clear instructions on how Upper Layer Bindings are to be writtentotake care of switched unions. 4. Operational Considerations 4.1. Computing Request Buffer Requirements The size maximum of a single Send operation includesboth theRPC- over-RDMA headerWrite list and to theRPC header. Combined, those two headers mustReply chunk, which is notexceed the size of one receive buffer. Senders often construct the RPC-over-RDMA header and the RPC call or reply message in separate buffers, then combine them viaa list. Implementation experience has shown that setting up aniovec intoexplicit RDMA operation to move asingle Send. This does not mean each elementfew hundred bytes ofthat iovec candata is inefficient, especially if there is no DDP opportunity. Channel operations are nearly always the best choice when handling a small RPC reply. o To reduce memory registration and invalidation costs, a requester might prefer to provide a Reply chunk only when a reply could beas large aslarger than the inline threshold.An HCA or RNIC may haveTo make that judgement, however, asmall limit onrequester must know the size ofa registered memory region. In that case, RDMA-eligible data maythe responder's send buffers, which might becomprised of many chunk segments. Thissmaller than its own receive buffers. o Even when a requester hasimplications for the size of the Read andprovided a Reply chunk, to reduce round trip costs, a responder might prefer to RDMA Writelists, which take upavariable amount of space in the RPC-over-RDMA header. The sum of the size ofReply chunk only when a reply is actually larger than theRPC-over-RDMA header, includinginline threshold. To make that judgement, however, theRead and Write lists, andresponder must know the size of theRPC header mustrequester's receive buffers, which might be smaller than its send buffers. If a requester does notexceedprovide a Reply chunk when one is needed, theinline threshold. This limitsresponder must reply with ERR_CHUNK (see RFC 5666, Section 4.2). The requester simply has to send the request again, this time with a Reply chunk. However ERR_CHUNK a generic failure mode. The requester may have some difficulty identifying themaximum Upper Layer payload size. 4.1.1. Recommendations RFC 5666bisproblem as a missing Reply chunk. To maintain 100% interoperability, a requester should always provideimplementation guidance on howa Reply chunk, and theinline threshold (the maximum send size)responder should always use it. However, as noted, this iscomputed. 4.2. Default Inline Buffer Size Section 6likely to be inefficient. 4.3.1. Recommendations To provide a stronger guarantee of interoperation while ensuring efficient operation, RFC5666 specifies an out-of-band protocol that allows5666bis should explicitly specify when a requester must offer a Reply chunk, and when the responder must use anendpoint to discoveroffered Reply chunk. Mandating apeer endpoint's receivedefault buffersize,size would allow both sides toavoid overrunning the receiving buffer, causing a connection loss. Not all RPC-over-RDMA Version One implementations also implement CCP, as it is optional. Givenchoose correctly with an in-advance CCP exchange. 4.4. Computing Credit Values The third paragraph of Section 3.3 of RFC 5666 leaves open theimportanceexact mechanism ofknowinghow often thereceiving end's receive buffer size, there shouldrequested and granted credit limits are supposed to besome way that a sender can choose a sizeadjusted. A reader might believe that these values are adjusted whenever an RPC call or reply isguaranteedreceived, towork with no CCP interaction.reflect the number of posted receive buffers on each side. Although adjustments are allowed by RFC 5666Section 6.1 describes a 1KB receive buffer limit for the first operationdue to changing availability of resources on either endpoint, current implementations use aconnection with an unfamiliar responder. Infixed value. Advertised credit values are always theabsencesum ofCCP, the client cannot discover that responder's true limit without riskingtheloss ofin-process receive buffers and thetransport connection. 4.2.1.ready-to-use receive buffers. 4.4.1. Recommendations RFC 5666bis shouldspecify a fixed send/receive buffer size as part of the RPC-over-RDMA Version One protocol, to use when CCP is not available. For example,clarify thefollowing could be addedmethod used tothecalculate these values. RFC 5666bisequivalent of RFC 5666 Section 6.1: "In the absence of CCP, requesters and responders MUST assume 1KB receive buffers for all Send operations." It should be safe for Upper Layer Binding specifications to provide a different default inline threshold. Care must be takenmight also discuss how flow control is impacted whenana server endpointis associated with multiple RPC programs that have different default inline thresholds. 4.3. When To Use Reply Chunksutilizes a shared receive queue. 4.5. Race Windows The second paragraph of RFC 5666 Section3.63.3 says:When a write chunk list is providedAdditionally, forthe results of the RPC call,protocol correctness, the RPC serverMUST provide any corresponding data via RDMA Write to the memory referenced in the chunk list entries. It is not clear whether the authors of RFC 5666 intended the above requirementmust always be able toapply onlyreply tothe Write list,client requests, whether or not new buffers have been posted tobothaccept future receives. It is true that theWrite listRPC server must always be able to reply, and that therefore the client must provide an adequate number of receive buffers. The dependent clause "whether or not new buffers have been posted tothe Reply chunk, whichaccept future receives" is problematic, however. It's nota list. Implementation experience has shown that setting up an explicit RDMA operationclear whether this clause refers tomoveafew hundred bytes of data is inefficient, especially if there is no DDP opportunity. Channel operations are nearly always the best choice when handlingserver leaving even a smallRPC reply. o To reduce memory registrationwindow where the sum of posted andinvalidation costs, a requester might preferin-process receive buffers is less than the credit limit; or refers toprovideaReply chunk only whenclient leaving areplywindow where the sum of posted and in-process receive buffers is less than its advertised credit limit. In either case, such a window could result in lost messages or belarger thancatastrophic for theinline threshold. To maketransport connection. 4.5.1. Recommendations Clarify or remove the dependent clause in the section in RFC 5666bis thatjudgement, however, a requester must knowis equivalent to RFC 5666 Section 3.3. 5. Pre-requisites For NFSv4 5.1. Bi-directional Operation NFSv4.1 moves thesize ofbackchannel onto theresponder's send buffers, which might be smaller than its ownsame transport as forward requests [RFC5661]. Typically RPC client endpoints do not expect to receivebuffers. o Even when a requester has provided a Reply chunk,RPC call messages. To support NFSv4.1 callback operations, client and server implementations must be updated toreduce round trip costs, a responder might prefersupport bi- directional operation. Because of RDMA's requirement to pre-post unadvertised receive buffers, special considerations are needed for bi-directional operation. Conventions have been provided toRDMA Write a Reply chunk only whenallow bi-direction, with areply is actually larger than the inline threshold. To makelimit on backchannel message size, such thatjudgement, however,no changes to theresponder must knowRPC-over-RDMA Version One protocol are needed [I-D.ietf-nfsv4-rpcrdma-bidirection]. 5.1.1. Recommendations RFC 5666bis should cite or include thesizebulk ofthe requester's receive buffers, which might be smaller than its send buffers. If a requester does not provide[I-D.ietf-nfsv4-rpcrdma-bidirection]. 6. Considerations For Upper Layer Binding Specifications RFC 5666 requires aReply chunk when one is needed,Binding specification for any RPC program wanting to use RPC-over-RDMA. The requirement appears in two separate places: The fourth paragraph of Section 3.4, and theresponder must reply with ERR_CHUNK (seefinal paragraph of Section 3.6. As critical as it is to have a Binding specification, RFC 5666's text regarding these specifications is sparse and not easy to find. 6.1. Organization Of Binding Specification Requirements Throughout RPC 5666,Section 4.2).various Binding requirements appear, such as: Therequester simply hasmapping of write chunk list entries tosend the request again, this time with a Reply chunk. However ERR_CHUNKprocedure arguments MUST be determined for each protocol. A similar specific requirement for read list entries is missing. Usually these statements are followed by ageneric failure mode. The requester may have some difficulty identifyingreference to the NFS Binding specification [RFC5667]. There is no summary of these requirements, however. Additional advice appears in the middle of Section 3.4: It is NOT RECOMMENDED that upper-layer RPC client protocol specifications omit write chunk lists for eligible replies, This requirement, being in theproblem as a missing Reply chunk. To maintain 100% interoperability, a requester should always providemiddle of aReply chunk, and the responder should always use it. However, as noted, thisdense paragraph about how write lists are formed, islikelyeasy for an author of Upper Layer Binding specifications tobe inefficient. 4.3.1.miss. 6.1.1. RecommendationsTo provide a stronger guarantee of interoperation while ensuring efficient operation,RFC 5666bis shouldexplicitly specify when a requester must offer a Reply chunk, and whensummarize explicit generic requirements for theresponder must usecontents of anoffered Reply chunk. Mandating a default buffer size would allow both sides to choose correctly withUpper Layer Binding specification in one separate section, perhaps in anin-advance CCP exchange. 4.4. Computing Credit Values The thirdAppendix. In particular, move the third, fourth and fifth paragraph ofSection 3.3 ofRFC 5666leaves open the exact mechanism of how often the requested and granted credit limits are supposedSection 3.4 tobe adjusted. A reader might believethis new section discussing Binding specification requirements. 6.2. RDMA-Eligibility Any RPC message thatthese values are adjusted wheneverfits in an inline buffer is conveyed via a Send operation. Any RPCcall or replymessage that isreceived,too large toreflect the number of posted receive buffers on each side. Although adjustments are allowedfit in an inline buffer is conveyed byRFC 5666 due to changing availability of resources on either endpoint, current implementations usetransferring the whole RPC message via an RDMA Read (i.e., a Position Zero Read chunk) or an RDMA Write (i.e., afixed value. Advertised credit values are always the sumReply chunk). RPC-over-RDMA also allows a mixture of these two mechanisms, where argument or result data is removed from thein-process receive buffersXDR stream and conveyed via a separate RDMA transfer. The receiving end assembles theready-to-use receive buffers. 4.4.1. Recommendationsdisparate buffers into a single XDR stream that represents the whole RPC message. RFC5666bis should clarify5666 uses themethod usedterm "RDMA eligibility" tocalculate these values. RFC 5666bis might also discuss how flow controlmean that an particular argument or result object isimpacted when a server endpoint utilizesallowed to be moved as ashared receive queue. 4.5. Race Windows The second paragraph of RFC 5666 Section 3.3 says: Additionally,separate chunk forprotocol correctness,the purpose of Direct Data Placement. The RPCserver must always be able to reply to client requests, whetherprogram's Upper Layer Binding makes eligibility statements permitting particular RPC argument ornot new buffers have been postedresult objects toaccept future receives. It is truebe directly placed. The third paragraph of Section 3.4 states thatthe RPC server must alwaysany XDR object MAY beable to reply,RDMA-eligible in any given message, but that: Typically, only those opaque and aggregate data types thattherefore the client must provide an adequate number of receive buffers. The dependent clause "whether or not new buffers have been postedmay attain substantial size are considered toaccept future receives"be eligible. Any large XDR object that can benefit from Direct Data Placement isproblematic, however. It's not clear whether this clause refers toaserver leaving evengood candidate for being moved in asmall window wherechunk. When data alignment matters, or when thesumNFS stack on either end ofposted and in-process receive buffers is less thanthecredit limit; or refersconnection does not need toa client leaving a window wheremanipulate thesum of posted and in-process receive bufferstransferred data, the Upper Layer Binding should make that object eligible for Direct Data Placement. Section 3.4 isless than its advertised credit limit. In either case, suchspecifically not discussing long messages, where awindow could result in lost messageswhole RPC message is moved via RDMA. When an RPC message is too large to fit inline, even after RDMA-eligible arguments orbe catastrophic forresults are removed, thetransport connection. 4.5.1. Recommendations Clarifymessage is always moved via a long message. All arguments orremove the dependent clauseresults in thesectionmessage are moved via RDMA inRFC 5666bis that is equivalent to RFC 5666 Section 3.3. 5. Pre-requisitesthis case. ForNFSv4 5.1. Bi-directional Operation NFSv4.1 moves the backchannel onto the same transport as forward requests [RFC5661]. Typically RPC client endpoints do not expect to receive RPC call messages. To support NFSv4.1 callback operations, client and server implementations mustinstance, an NFSv3 READDIR result can beupdated to support bi- directional operation. Because of RDMA's requirement to pre-post unadvertised receive buffers, special considerations are needed for bi-directional operation. Conventions have been provided to allow bi-direction, with a limit on backchannel message size, such that no changes tolarge. However, an NFS server assembles this result in place, encoding each section individually. The NFS client must perform theRPC-over-RDMA Version One protocol are needed [I-D.ietf-nfsv4-rpcrdma-bidirection]. 5.1.1. Recommendations RFC 5666bis should cite or includeconverse actions. Though there is potentially a large amount of data, thebulkbenefit of[I-D.ietf-nfsv4-rpcrdma-bidirection]. 6. Considerations For Upper Layer Binding Specifications RFC 5666 requires a Binding specificationdirect data placement is lost because of the need forany RPC program wantingboth host CPUs touse RPC-over-RDMA. The requirement appearsbe involved intwo separate places: The fourth paragraph of Section 3.4,marshaling and decoding. Thus thefinal paragraphNFSv3 Upper Layer Binding [RFC5667] does not make any part ofSection 3.6. As critical as itan NFSv3 READDIR reply RDMA-eligible. However, any NFS READDIR reply that isto have a Binding specification, RFC 5666's text regarding these specificationslarger than an inline buffer issparsestill moved via RDMA (a Reply chunk, in this case). 6.2.1. Recommendations RFC 5666bis should define the term "Upper Layer Binding", andnot easy to find. 6.1. Organization Ofexplain what it specifies. RFC 5666bis should explicitly require an Upper Layer BindingSpecification Requirements Throughoutfor every RPC5666, various Binding requirements appear, such as: The mapping of write chunk list entries to procedure arguments MUSTprogram that may operate on RDMA transports. Separate bindings may bedetermined for each protocol. A similar specific requirementrequired forread list entries is missing. Usually these statements are followed by a reference to the NFS Binding specification [RFC5667]. There is no summary of these requirements, however. Additional advice appears in the middledifferent versions ofSection 3.4:that program. The term "RDMA eligibility" should be retired. It isNOT RECOMMENDED that upper-layer RPC client protocol specifications omit write chunk listseasy to confuse the use of RDMA foreligible replies, This requirement, being inDirect Data Placement with themiddleuse of RDMA in long messages. Instead, RFC 5666bis should use adense paragraph about how write lists are formed,more precise term such as DDP-eligibility, which should be clearly defined before it iseasy for an author of Upper Layer Binding specifications to miss. 6.1.1. Recommendationsused. RFC 5666bis shouldsummarize explicitprovide genericrequirementsguidance about what makes an XDR object or data type eligible for Direct Data Placement. RFC 5666bis should state that thecontentsDDP-eligibility of any XDR object not mentioned explicitly in an Upper Layer Bindingspecification in one separate section, perhaps in an Appendix. In particular, move the third, fourth and fifth paragraph of RFC 5666 Section 3.4 to this new section discussing Binding specification requirements. 6.2. RDMA-Eligibility Any RPC message that fits in an inline bufferisconveyed via a Send operation. Any RPC message"not eligible." RFC 5666bis should note thatis too large to fit in an inline buffer is conveyed by transferring the whole RPC message via an RDMA Read (i.e., aPosition ZeroRead chunk) or an RDMA Write (i.e., aread chunks and Replychunk). RPC-over-RDMA also allows a mixture of these two mechanisms, wherechunks may contain any and all argumentor result data is removed from the XDR streamandconveyed via a separate RDMA transfer. The receiving end assembles the disparate buffers into a single XDR streamresults regardless of their DDP-eligibility. RFC 5666bis should remind authors of Upper Layer Bindings thatrepresentsthewhole RPC message. RFC 5666 usesReply chunk and Position Zero read chunks are expressly not for performance-critical Upper Layer operations. It is theterm "RDMA eligibility"responsibility of the Upper Layer Binding tomeanspecify RDMA- eligibity rules so that if anparticular argument or resultRDMA-eligible XDR object isallowedembedded within another, only one of these two objects is to bemoved asrepresented by aseparate chunk forchunk. This ensures that thepurposemapping from XDR position to the XDR object represented is unambiguous. 6.3. Inline Threshold Requirements An RPC-over-RDMA connection has two connection parameters that affect the operation ofDirect Data Placement. The RPC program'sUpper LayerBinding makes eligibility statements permitting particular RPC argument or result objects to be directly placed.Protocols: Thethird paragraph of Section 3.4 statescredit limit, which is how many outstanding RPCs are allowed on thatany XDR object MAY be RDMA-eligible in any given message, but that: Typically, only those opaqueconnection; andaggregate data types that may attain substantialthe inline threshold, which is the maximum payload sizeare considered to be eligible. Any large XDR objectof an RDMA Send on thatcan benefit from Direct Data Placementconnection. All ULPs sharing a connection also share the same credits and inline threshold values. The inline threshold is set when agood candidateconnection is established. The base RPC-over-RDMA protocol does not provide a mechanism forbeing moved inaltering the inline threshold of achunk. When data alignment matters, orconnection once it has been established. [RFC5667] places normative requirements on the inline threshold value for a connection. There is no guidance provided on how implementations should behave when two ULPs that have different inline threshold requirements share the same connection. Further, current NFSstack on either end of the connection does not need to manipulateimplementations ignore thetransferred data,inline threshold requirements stated in [RFC5667]. It is unlikely that they would interoperate successfully with any new implementation that followed the letter of [RFC5667]. 6.3.1. Recommendations Upper LayerBindingProtocols shouldmake that object eligible for Direct Data Placement. Section 3.4 is specifically not discussing long messages, where a whole RPC messagebe able to operate no matter what inline threshold ismoved via RDMA. Whenin use. An Upper Layer Binding might provide informative guidance about optimal values of anRPC message is too large to fit inline, even after RDMA-eligible arguments or resultsinline threshold, but normative requirements areremoved, the messagedifficult to enforce unless connection sharing isalways moved via a long message. All arguments or results inexplicitly not permitted. 6.4. Violations Of Binding Rules Section 3.4 of RFC 5666 introduces themessageidea of an Upper Layer Binding specification to state which Upper Layer operations aremoved viaallowed to use explicit RDMAin this case. For instance, an NFSv3 READDIR result can be large. However, an NFS server assemblesto transfer a bulk payload item. The fifth paragraph of thisresult in place, encoding eachsectionindividually.states: TheNFS client must perform the converse actions. Though there is potentially a large amount of data,interface by which an upper-layer implementation communicates thebenefiteligibility ofdirecta dataplacementitem locally to RPC for chunking islost becauseout ofthe needscope forboth host CPUsthis specification. In many implementations, it is possible tobe involved in marshaling and decoding. Thusimplement a transparent RPC chunking facility. If theNFSv3Upper LayerBinding [RFC5667] doeson a receiver is notmake any partaware ofan NFSv3 READDIR reply RDMA-eligible. However, any NFS READDIR reply that is larger than an inline buffer is still moved via RDMA (a Reply chunk, in this case). 6.2.1. Recommendations RFC 5666bis should definetheterm "Upper Layer Binding",presence andexplain whatoperation of an RPC-over-RDMA transport under it, itspecifies. RFC 5666bis should explicitly requirecould be challenging to discover when a sender has violated an Upper Layer Bindingfor every RPC program that may operate on RDMA transports. Separate bindings may be required for different versions of that program. The term "RDMA eligibility" should be retired. It is easy to confuse the use of RDMA for Direct Data Placement with the use of RDMA in long messages. Instead, RFC 5666bis should userule. If amore precise term such as DDP-eligibility, which should be clearly defined before it is used.violation does occur, RFC5666bis should provide generic guidance about what makes5666 does not define anXDR object or data type eligibleunambiguous mechanism forDirect Data Placement. RFC 5666bis should state thatreporting theDDP-eligibilityviolation. The violation ofany XDR object not mentioned explicitly inBinding rules is an Upper LayerBindingProtocol issue, but it is"not eligible." RFC 5666bis should notelikely thatPosition Zero read chunks and Reply chunks may contain any and all argument and results regardless of their DDP-eligibility. RFC 5666bis should remind authors ofthere is nothing the Upper LayerBindings thatcan do but reply with the equivalent of BAD XDR. When an erroneously-constructed reply reaches a requester, there is no recourse but to drop theReply chunkreply, andPosition Zero read chunks are expressly not for performance-criticalperhaps the transport connection as well. 6.4.1. Recommendations Policing DDP-eligibility must be done in co-operation with the Upper Layeroperations.Protocol by its receive endpoint implementation. It is theresponsibility of theUpper LayerBindingBinding's responsibility to specifyRDMA- eligibity rules so thathow a responder must reply ifan RDMA-eligible XDR object is embedded within another, only one of these two objects is to be represented byachunk. This ensures that the mapping from XDR position to the XDR object represented is unambiguous. 6.3. Violations Ofrequester violates a DDP-eligibilty rule. The BindingRules Section 3.4 ofspecification should provide similar guidance for requesters about handling invalid RPC-over-RDMA replies. 6.5. Binding Specification Completion Assessment RFC 5666introduces the idea of an Upper Layer Binding specification to state which Upper Layer operationsSection 3.4 states: Typically, only those opaque and aggregate data types that may attain substantial size areallowed to use explicit RDMAconsidered totransfer a bulk payload item. The fifth paragraph of this section states: The interfacebe eligible. However, any object MAY be chosen for chunking in any given message. Chunk eligibility criteria MUST be determined bywhich aneach upper-layerimplementation communicates the eligibility of a data item locallyin order toRPCprovide forchunking is outan interoperable specification. Authors ofscopeUpper Layer Binding specifications should consider each data type in the Upper Layer's XDR definition, in particular compound types such as arrays and lists, when restricting what XDR objects are eligible forthis specification. In many implementations, it is possibleDirect Data Placement. In addition, there are requirements related toimplement a transparent RPC chunking facility. If the Upper Layer on a receiver is not aware of the presenceusing NFS with RPC- over-RDMA in [RFC5667], andoperation of an RPC-over-RDMA transport under it, itthere are some in [RFC5661]. It could bechallenginghelpful todiscover when a sender has violatedhave guidance about what kind of requirements belong in an Upper Layer Bindingrule. If a violation does occur, RFC 5666 does not define an unambiguous mechanism for reportingspecification versus what belong in theviolation. The violation of Binding rules is anUpper Layer Protocolissue, but it is likelyspecification. 6.5.1. Recommendations RFC 5666bis should describe what makes a Binding specification complete (i.e. ready for publication). 7. Unimplemented Protocol Features There are features of RPC-over-RDMA Version One thatthere is nothingremain unimplemented in current implementations. Some are candidates to be removed from theUpper Layer can do but reply withprotocol because they have proven unnecessary or were not properly specified. Other features are unimplemented, unspecified, or have only one implementation (thus interoperability remains unproven). These are candidates to be retained and properly specified. 7.1. Unimplemented Features To Be Removed 7.1.1. Connection Configuration Protocol No implementation has seen fit to support theequivalentConnection Configuration Protocol. While a need to exchange pertinent connection information remains, the preference is to exchange that information as part ofBAD XDR. When an erroneously-constructed reply reachesthe set up of each connection, rather than as settings that apply to all connections (and thus all ULPs) between two peers. 7.1.1.1. Recommendations CCP should be removed from RFC 5666bis. 7.1.2. Read-Read Transfer Model All existing RPC-over-RDMA Version One implementations use arequester, thereRead- Write data transfer model. The server endpoint isno recourseresponsible for initiating all RDMA data transfers. The Read-Read transfer model has been deprecated, butto dropbecause it appears in RFC 5666, implementations are still responsible for supporting it. By removing thereply,specification andperhapsdiscussion of Read-Read, thetransport connection as well. 6.3.1. Recommendations Policing DDP-eligibility mustprotocol and specification can bedonemade simpler and more clear. 7.1.2.1. Recommendations Remove Read-Read from RFC 5666bis, inco-operation with the Upper Layer Protocol byparticular from itsreceive endpoint implementation. It is the Upper Layer Binding's responsibility to specify how a responder must reply if a requester violates a DDP-eligibilty rule. The Binding specification should provide similar guidance for requesters about handling invalid RPC-over-RDMA replies. 6.4. Binding Specification Completion Assessmentequivalent of RFC 5666 Section3.4 states: Typically, only those opaque and aggregate data types that may attain substantial size are considered to be eligible. However, any object MAY be chosen for chunking in any given message. Chunk eligibility criteria MUST be determined by each upper-layer in order3.8. RFC 5666bis should require implementations not toprovide forsend RDMA_DONE; aninteroperable specification. Authors of Upper Layer Binding specificationsimplementation receiving it shouldconsider each data type in the Upper Layer'signore it. The XDRdefinition,definition should reserve RDMA_DONE. 7.1.3. RDMA_MSGP It has been observed that the current specification of RDMA_MSGP is not clear enough to result inparticular compound types suchinteroperable implementations. Possibly asarraysa result, current receive endpoints do recognize andlists, when restricting what XDR objects are eligible for Direct Data Placement. In addition, thereprocess RDMA_MSGP messages, though they do not take advantage of the passed alignment parameters. Receivers treat RDMA_MSGP messages like RDMA_MSG messages. Currently senders do not use RDMA_MSGP messages. RDMA_MSGP depends on bulk payload occurring at the end of RPC messages, which is often not true of NFSv4 COMPOUND requests. Most NFSv3 requests arerequirements relatedsmall enough not tousing NFS with RPC- over-RDMAneed RDMA_MSGP. To be effective, RDMA_MSGP depends on getting alignment preferences in[RFC5667], and thereadvance via CCP. There aresome in [RFC5661]. It could be helpfulno CCP implementations tohave guidance about what kind of requirements belong indate. Without CCP, there is no way for peers to discover a receiver endpoint's preferred alignment parameters, unless the implementation provides anUpper Layer Binding specification versus what belongadministrative interface for specifying a remote's alignment parameters. RDMA_MSGP is useless without that knowledge. 7.1.3.1. Recommendations To maintain backward-compatibility, RDMA_MSGP must remain in theUpper Layer Protocol specification. 6.4.1. Recommendationsprotocol. RFC 5666bis shoulddescribe what makesrequire implementations to not send RDMA_MSGP messages. If an RDMA_MSGP message is seen by aBinding specification complete (i.e. ready for publication). 7. Removal ofreceiver, it should ignore the alignment parameters and treat RDMA_MSGP messages as RDMA_MSG messages. The XDR definition should reserve RDMA_MSGP. 7.2. UnimplementedProtocolFeatures7.1. Read-Read Transfer Model All existing RPC-over-RDMA Version OneTo Be Retained 7.2.1. RDMA_ERROR Type Messages Server implementationsuse a Read- Write data transfer model. The server endpointthe author isresponsible for initiating all RDMA data transfers. The Read-Read transfer model has been deprecated,familiar with can send RDMA_ERROR type messages, but only when an RPC-over-RDMA version mismatch occurs. There is no facility to return the ERR_CHUNK error. These implementations treat unrecognized message types and other parsing errors as an RDMA_MSG type message. Obviously this behavior does not comply with RFC 5666, butbecauseitappearsis also recognized that this behavior is not an improvement over the specification. 7.2.1.1. Recommendations RFC 5666bis should provide stronger guidance for error checking, and in particular, when a connection must be broken. Implementations that do not adequately check incoming RPC-over-RDMA headers must be updated. 7.2.2. RPCSEC_GSS On RPC-over-RDMA The second paragraph of RFC5666, implementations are still responsible5666 Section 11 says: For efficiency, a more appropriate security mechanism forsupporting it. By removingRDMA links may be link-level protection, such as certain configurations of IPsec, which may be co-located in thespecification and discussionRDMA hardware. The use ofRead-Read,link-level protection MAY be negotiated through theprotocoluse of the new RPCSEC_GSS mechanism defined in [RFC5403] in conjunction with the Channel Binding mechanism [RFC5056] andspecification can be made simplerIPsec Channel Connection Latching [RFC5660]. Use of such mechanisms is REQUIRED where integrity and/or privacy is desired, andmore clear. 7.1.1. Recommendations Remove Read-Read from RFC 5666bis, in particular from its equivalentwhere efficiency is required. However, consider: o As ofRFC 5666 Section 3.8. RFC 5666bis should require implementations not to send RDMA_DONE; anthis writing, no implementationreceivingof RPCSEC_GSS v2 Channel Binding or Connection Latching exist. Thus, though itshould ignore it. The XDR definition should reserve RDMA_DONE. 7.2. RDMA_MSGP Itis sensible, this part of RFC 5666 has never beenobservedimplemented. o Not all fabrics and RNICs support a link-layer protection mechanism that includes a privacy service. o When multiple users access a storage service from thecurrent specification of RDMA_MSGPsame client, it isnot clear enoughappropriate toresult in interoperable implementations. Possibly asdeploy aresult, current receive endpoints do recognize and process RDMA_MSGP messages, though they do not take advantage of the passed alignment parameters. Receivers treat RDMA_MSGP messages like RDMA_MSG messages. Currently senders do not use RDMA_MSGP messages. RDMA_MSGP dependsmessage authentication service concurrently with link-layer protection. Therefore, despite its performance impact, RPCSEC_GSS can add important function to RPC-over-RDMA deployments. Currently there is an InfiniBand-only client and server implementation of RPCSEC_GSS onbulk payload occurring atRPC-over-RDMA that supports theendauthentication, integrity, and privacy services. This pair ofRPC messages, which is often not trueimplementations was created without the benefit ofNFSv4 COMPOUND requests. Most NFSv3 requests are small enough not to need RDMA_MSGP. To be effective, RDMA_MSGP depends on getting alignment preferences in advance via CCP. Therenormative guidance from RFC 5666. This client and server pair interoperates with each other, but there are noCCPindependent implementations todate. Without CCP, there is no waytest with. RPC-over-RDMA requesters are responsible forpeersproviding adequate reply resources todiscover a receiver endpoint's preferred alignment parameters, unless the implementation providesresponders. These resources require special treatment when anadministrative interface for specifying a remote's alignment parameters. RDMA_MSGPintegrity or privacy service isuseless withoutin use. Direct data placement cannot be used with software integrity checking or encryption. Thus standards guidance is imperative to ensure thatknowledge. 7.2.1.independent RPCSEC_GSS implementations can interoperate on RPC-over- RDMA transports. 7.2.2.1. RecommendationsTo maintain backward-compatibility, RDMA_MSGP must remain in the protocol.RFC 5666bis should continue to requireimplementationsthe use of link layer protection when facilities are available to support it. At the least, RPCSEC_GSS per-message authentiction is valuable, even if link layer protection is in use. Integrity and privacy should also be made available even if they do notsend RDMA_MSGP messages. If an RDMA_MSGP messageperform well, because there isseen by a receiver, itno link layer protection for some fabrics. Therefore, RFC 5666bis shouldignoreprovide a specification for RPCSEC_GSS on RPC-over-RDMA, codifying thealignment parameters and treat RDMA_MSGP messages as RDMA_MSG messages. The XDR definition should reserve RDMA_MSGP.one existing implementation so that others may interoperate with it. 8. Security Considerations To enable RDMA Read and Write operations, an RPC-over-RDMA Version One requester exposes some or all of its memory to other hosts. RFC 5666bis should suggest best implementation practices to minimize exposure to careless or potentially malicious implementations that share the same fabric. Important considerations include: o The use of Protection Domains to limit the exposure of memory regions to a single connection is critical. Any attempt by a host not participating in that connection to re-use R_keys will result in a connection failure. Because ULP security relies on this behavior of Reliable Connections, strong authentication of the remote is recommended. o Unpredictable R_keys should be used for any operation requiring advertised memory regions. Advertising a continuously registered memory region allows a remote host to read or write its contents even when an RPC involving that memory is not under way. Therefore this practice should be avoided. o Advertised memory regions should be invalidated as soon as related RPC operations are complete. Invalidation and DMA unmapping of regions should be complete before an RPC application is allowed to continue execution and use the contents of a memory region. 9. IANA Considerations This document does not require actions by IANA. 10. Appendix A: XDR Language Description Revised XDR definition of RPC-over-RDMA Version One. The original definition is in Section 4.3 of RFC 5666. The XDR stream position of the fields and their use are not altered by this revision. The significant changes are: 1. Copyright boilerplate has been provided 2. The structure, field, and enum names have been made consistent with other standard XDR definitions 3. The xdr_read_chunk structure is now called an rpcrdma1_read_segment because that structure functions the same way that an rpcrdma1_segment element in a Write chunk array does 4. Duplicate definitions of the chunk list fields have been removed 5. As the Read-Read transfer model is deprecated, RDMA_DONE is now a reserved value 6. As RDMA_MSGP messages are deprecated, RDMA_MSGP is now a reserved value Code components extracted from this document must include the following license: <CODE BEGINS> /* * Copyright (c) 2010, 2015 IETF Trust and the persons * identified as authors of the code. All rights reserved. * * The authors of the code are: * B. Callaghan, T. Talpey, and C. Lever. * * Redistribution and use in source and binary forms, with * or without modification, are permitted provided that the * following conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the * following disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the * following disclaimer in the documentation and/or other * materials provided with the distribution. * * - Neither the name of Internet Society, IETF or IETF * Trust, nor the names of specific contributors, may be * used to endorse or promote products derived from this * software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ struct rpcrdma1_segment { uint32 rdma_handle; uint32 rdma_length; uint64 rdma_offset; }; struct rpcrdma1_read_segment { uint32 rdma_position; struct rpcrdma1_segment rdma_target; }; struct rpcrdma1_read_list { struct rpcrdma1_read_segment rdma_entry; struct rpcrdma1_read_list *rdma_next; }; struct rpcrdma1_write_chunk { struct rpcrdma1_segment rdma_target<>; }; struct rpcrdma1_write_list { struct rpcrdma1_write_chunk rdma_entry; struct rpcrdma1_write_list *rdma_next; }; struct rpcrdma1_msg { uint32 rdma_xid; uint32 rdma_vers; uint32 rdma_credit; rpcrdma1_body rdma_body; }; enum rpcrdma1_proc { RDMA_MSG = 0, RDMA_NOMSG = 1, RDMA_MSGP = 2, /* Reserved */ RDMA_DONE = 3, /* Reserved */ RDMA_ERROR = 4 }; struct rpcrdma1_chunks { struct rpcrdma1_read_list *rdma_reads; struct rpcrdma1_write_list *rdma_writes; struct rpcrdma1_write_chunk *rdma_reply; }; enum rpcrdma1_errcode { RDMA_ERR_VERS = 1, RDMA_ERR_CHUNK = 2 }; union rpcrdma1_error switch (rpcrdma1_errcode err) { case RDMA_ERR_VERS: uint32 rdma_vers_low; uint32 rdma_vers_high; case RDMA_ERR_CHUNK: void; }; union rdma_body switch (rpcrdma1_proc proc) { case RDMA_MSG: case RDMA_NOMSG: rpcrdma1_chunks rdma_chunks; case RDMA_MSGP: uint32 rdma_align; uint32 rdma_thresh; rpcrdma1_chunks rdma_achunks; case RDMA_DONE: void; case RDMA_ERROR: rpcrdma1_error rdma_error; }; <CODE ENDS> 11. Appendix B: Binding Requirement Summary This appendix collects the known generic Binding Requirements from RFC 5666 and this document. This might not be an exhaustive list. Note that RFC 5666 uses RFC 2119-style terms to specify binding requirements, even though the requirement statements apply to protocol specifications rather than to a particular protocol. 1. "Chunk eligibility criteria MUST be determined by each upper- layer in order to provide for an interoperable specification." (RFC 5666 Section 3.4) 2. More specifically, an Upper Layer Binding is required for every RPC program interested in using RPC-over-RDMA. Separate bindings may be required for different versions of that program. 3. Upper Layer Bindings make DDP-eligibility statements about specific arguments and results (or portions thereof which still are whole XDR objects). A chunk must contain only one whole XDR object. 4. DDP-eligibility of any XDR object not mentioned explicitly in an Upper Layer Binding is "not eligible." 5. Any XDR object may appear in a Position Zero read chunk or a Reply chunk regardless of its DDP-eligibility. 6. An Upper Layer Binding may limit the number of unique read chunk Positions allowed for a particular operation. An Upper Layer Binding may limit the number of chunks in a Write list allowed for a particular operation. 7. An Upper Layer Binding must take care not to allow abuses of the Position Zero read chunk to avoid DDP-eligibility restrictions. 8. "It is NOT RECOMMENDED that upper-layer RPC client protocol specifications omit write chunk lists for eligible replies, due to the lower performance of the additional handshaking to perform data transfer, and the requirement that the RPC server must expose (and preserve) the reply data for a period of time." (RFC 5666 Section 3.4) 9. "The mapping of write chunk list entries to procedure arguments MUST be determined for each protocol." (RFC 5666 Section 3.6) 10. More specifically: by default, the requester provides as many Write chunks as the Upper Layer Binding allows for the particular operation. The responder fills in each Write chunk with an RDMA-eligible result until the Write list is exhausted or there are no more RDMA-eligible results. If this default behavior leads to ambiguity when the requester re-assembles the XDR stream, the Binding must explain how to resolve the ambiguity, or restrict DDP-eligibility to ensure confusion cannot occur. 11. It is the responsibility of the Upper Layer Binding to specify DDP-eligibity rules so that if an DDP-eligible XDR object is embedded within another, only one of these two objects is to be represented by a chunk. 12. The Upper Layer Binding must specify how a responder should reply if a requester violates a DDP-eligibilty rule. The Binding specification should provide guidance for requesters about handling invalid RPC-over-RDMA replies. 12. Acknowledgements The author gratefully acknowledges the contributions of Dai Ngo, Karen Deitke, Chunli Zhang, Mahesh Siddheshwar, Dominique Martinet, and William Simpson. The author also wishes to thank Dave Noveck and Bill Baker for theirunwaveringsupport of this work. Special thanks go to nfsv4 Working Group Chair Spencer Shepler and nfsv4 Working Group Secretary Tom Haynes for their support. 13. References 13.1. Normative References [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, RFC 793, DOI 10.17487/RFC0793, September 1981, <http://www.rfc-editor.org/info/rfc793>. [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3 Protocol Specification", RFC 1813, DOI 10.17487/ RFC1813, June 1995, <http://www.rfc-editor.org/info/rfc1813>. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/ RFC2119, March 1997, <http://www.rfc-editor.org/info/rfc2119>. [RFC4506] Eisler, M., Ed., "XDR: External Data Representation Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May 2006, <http://www.rfc-editor.org/info/rfc4506>. [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. Garcia, "A Remote Direct Memory Access Protocol Specification", RFC 5040, DOI 10.17487/RFC5040, October 2007, <http://www.rfc-editor.org/info/rfc5040>. [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct Data Placement over Reliable Transports", RFC 5041, DOI 10.17487/RFC5041, October 2007, <http://www.rfc-editor.org/info/rfc5041>. [RFC5056] Williams, N., "On the Use of Channel Bindings to Secure Channels", RFC 5056, DOI 10.17487/RFC5056, November 2007, <http://www.rfc-editor.org/info/rfc5056>. [RFC5403] Eisler, M., "RPCSEC_GSS Version 2", RFC 5403, DOI 10.17487/RFC5403, February 2009, <http://www.rfc-editor.org/info/rfc5403>. [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, May 2009, <http://www.rfc-editor.org/info/rfc5531>. [RFC5660] Williams, N., "IPsec Channels: Connection Latching", RFC 5660, DOI 10.17487/RFC5660, October 2009, <http://www.rfc-editor.org/info/rfc5660>. [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., "Network File System (NFS) Version 4 Minor Version 1 Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, <http://www.rfc-editor.org/info/rfc5661>. [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access Transport for Remote Procedure Call", RFC 5666, DOI 10.17487/RFC5666, January 2010, <http://www.rfc-editor.org/info/rfc5666>. [RFC5667] Talpey, T. and B. Callaghan, "Network File System (NFS) Direct Data Placement", RFC 5667, DOI 10.17487/RFC5667, January 2010, <http://www.rfc-editor.org/info/rfc5667>. 13.2. Informative References [I-D.ietf-nfsv4-rpcrdma-bidirection] Lever, C., "Size-Limited Bi-directional Remote Procedure Call On Remote Direct Memory Access Transports", draft- ietf-nfsv4-rpcrdma-bidirection-01 (work in progress), September 2015. Author's Address Charles Lever Oracle Corporation 1015 Granger Avenue Ann Arbor, MI 48104 US Phone: +1 734 274 2396 Email: chuck.lever@oracle.com