draft-ietf-nfsv4-rfc5666bis-05.txt   draft-ietf-nfsv4-rfc5666bis-06.txt 
Network File System Version 4 C. Lever, Ed. Network File System Version 4 C. Lever, Ed.
Internet-Draft Oracle Internet-Draft Oracle
Obsoletes: 5666 (if approved) W. Simpson Obsoletes: 5666 (if approved) W. Simpson
Intended status: Standards Track DayDreamer Intended status: Standards Track DayDreamer
Expires: October 10, 2016 T. Talpey Expires: November 13, 2016 T. Talpey
Microsoft Microsoft
April 8, 2016 May 12, 2016
Remote Direct Memory Access Transport for Remote Procedure Call, Version Remote Direct Memory Access Transport for Remote Procedure Call, Version
One One
draft-ietf-nfsv4-rfc5666bis-05 draft-ietf-nfsv4-rfc5666bis-06
Abstract Abstract
This document specifies a protocol for conveying Remote Procedure This document specifies a protocol for conveying Remote Procedure
Call (RPC) messages on physical transports capable of Remote Direct Call (RPC) messages on physical transports capable of Remote Direct
Memory Access (RDMA). It requires no revision to application RPC Memory Access (RDMA). It requires no revision to application RPC
protocols or the RPC protocol itself. This document obsoletes RFC protocols or the RPC protocol itself. This document obsoletes RFC
5666. 5666.
Status of This Memo Status of This Memo
skipping to change at page 1, line 38 skipping to change at page 1, line 38
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on October 10, 2016. This Internet-Draft will expire on November 13, 2016.
Copyright Notice Copyright Notice
Copyright (c) 2016 IETF Trust and the persons identified as the Copyright (c) 2016 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 2, line 23 skipping to change at page 2, line 23
1.2. Remote Procedure Calls On RDMA Transports . . . . . . . . 3 1.2. Remote Procedure Calls On RDMA Transports . . . . . . . . 3
2. Changes Since RFC 5666 . . . . . . . . . . . . . . . . . . . 4 2. Changes Since RFC 5666 . . . . . . . . . . . . . . . . . . . 4
2.1. Changes To The Specification . . . . . . . . . . . . . . 4 2.1. Changes To The Specification . . . . . . . . . . . . . . 4
2.2. Changes To The Protocol . . . . . . . . . . . . . . . . . 4 2.2. Changes To The Protocol . . . . . . . . . . . . . . . . . 4
3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 5 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1. Remote Procedure Calls . . . . . . . . . . . . . . . . . 5 3.1. Remote Procedure Calls . . . . . . . . . . . . . . . . . 5
3.2. Remote Direct Memory Access . . . . . . . . . . . . . . . 8 3.2. Remote Direct Memory Access . . . . . . . . . . . . . . . 8
4. RPC-Over-RDMA Protocol Framework . . . . . . . . . . . . . . 10 4. RPC-Over-RDMA Protocol Framework . . . . . . . . . . . . . . 10
4.1. Transfer Models . . . . . . . . . . . . . . . . . . . . . 10 4.1. Transfer Models . . . . . . . . . . . . . . . . . . . . . 10
4.2. Message Framing . . . . . . . . . . . . . . . . . . . . . 11 4.2. Message Framing . . . . . . . . . . . . . . . . . . . . . 11
4.3. Managing Receiver Resources . . . . . . . . . . . . . . . 11 4.3. Managing Receiver Resources . . . . . . . . . . . . . . . 12
4.4. XDR Encoding With Chunks . . . . . . . . . . . . . . . . 13 4.4. XDR Encoding With Chunks . . . . . . . . . . . . . . . . 14
4.5. Message Size . . . . . . . . . . . . . . . . . . . . . . 20 4.5. Message Size . . . . . . . . . . . . . . . . . . . . . . 20
5. RPC-Over-RDMA In Operation . . . . . . . . . . . . . . . . . 23 5. RPC-Over-RDMA In Operation . . . . . . . . . . . . . . . . . 23
5.1. XDR Protocol Definition . . . . . . . . . . . . . . . . . 24 5.1. XDR Protocol Definition . . . . . . . . . . . . . . . . . 24
5.2. Fixed Header Fields . . . . . . . . . . . . . . . . . . . 28 5.2. Fixed Header Fields . . . . . . . . . . . . . . . . . . . 28
5.3. Chunk Lists . . . . . . . . . . . . . . . . . . . . . . . 30 5.3. Chunk Lists . . . . . . . . . . . . . . . . . . . . . . . 30
5.4. Memory Registration . . . . . . . . . . . . . . . . . . . 32 5.4. Memory Registration . . . . . . . . . . . . . . . . . . . 32
5.5. Error Handling . . . . . . . . . . . . . . . . . . . . . 34 5.5. Error Handling . . . . . . . . . . . . . . . . . . . . . 34
5.6. Protocol Elements No Longer Supported . . . . . . . . . . 36 5.6. Protocol Elements No Longer Supported . . . . . . . . . . 36
5.7. XDR Examples . . . . . . . . . . . . . . . . . . . . . . 37 5.7. XDR Examples . . . . . . . . . . . . . . . . . . . . . . 37
6. RPC Bind Parameters . . . . . . . . . . . . . . . . . . . . . 39 6. RPC Bind Parameters . . . . . . . . . . . . . . . . . . . . . 39
7. Upper Layer Binding Specifications . . . . . . . . . . . . . 40 7. Upper Layer Binding Specifications . . . . . . . . . . . . . 40
7.1. DDP-Eligibility . . . . . . . . . . . . . . . . . . . . . 41 7.1. DDP-Eligibility . . . . . . . . . . . . . . . . . . . . . 40
7.2. Maximum Reply Size . . . . . . . . . . . . . . . . . . . 42 7.2. Maximum Reply Size . . . . . . . . . . . . . . . . . . . 42
7.3. Additional Considerations . . . . . . . . . . . . . . . . 42 7.3. Additional Considerations . . . . . . . . . . . . . . . . 42
7.4. Upper Layer Protocol Extensions . . . . . . . . . . . . . 43 7.4. Upper Layer Protocol Extensions . . . . . . . . . . . . . 43
8. Protocol Extensibility . . . . . . . . . . . . . . . . . . . 43 8. Protocol Extensibility . . . . . . . . . . . . . . . . . . . 43
8.1. Conventional Extensions . . . . . . . . . . . . . . . . . 44 8.1. Conventional Extensions . . . . . . . . . . . . . . . . . 43
9. Security Considerations . . . . . . . . . . . . . . . . . . . 44 9. Security Considerations . . . . . . . . . . . . . . . . . . . 44
9.1. Memory Protection . . . . . . . . . . . . . . . . . . . . 44 9.1. Memory Protection . . . . . . . . . . . . . . . . . . . . 44
9.2. RPC Message Security . . . . . . . . . . . . . . . . . . 45 9.2. RPC Message Security . . . . . . . . . . . . . . . . . . 45
10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 48 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 48
11. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 49 11. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 49
12. References . . . . . . . . . . . . . . . . . . . . . . . . . 49 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 49
12.1. Normative References . . . . . . . . . . . . . . . . . . 49 12.1. Normative References . . . . . . . . . . . . . . . . . . 49
12.2. Informative References . . . . . . . . . . . . . . . . . 51 12.2. Informative References . . . . . . . . . . . . . . . . . 50
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 52 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 52
1. Introduction 1. Introduction
This document obsoletes RFC 5666. However, the protocol specified by This document obsoletes RFC 5666. However, the protocol specified by
this document is based on existing interoperating implementations of this document is based on existing interoperating implementations of
the RPC-over-RDMA Version One protocol. the RPC-over-RDMA Version One protocol.
The new specification clarifies text that is subject to multiple The new specification clarifies text that is subject to multiple
interpretations, and removes support for unimplemented RPC-over-RDMA interpretations, and removes support for unimplemented RPC-over-RDMA
Version One protocol elements. It makes the role of Upper Layer Version One protocol elements. It clarifies the role of Upper Layer
Bindings an explicit part of the protocol specification. Bindings and describes what they are to contain.
In addition, this document describes current practice using In addition, this document describes current practice using
RPCSEC_GSS [I-D.ietf-nfsv4-rpcsec-gssv3] on RDMA transports. RPCSEC_GSS [I-D.ietf-nfsv4-rpcsec-gssv3] on RDMA transports.
1.1. Requirements Language 1.1. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119]. document are to be interpreted as described in [RFC2119].
skipping to change at page 3, line 43 skipping to change at page 3, line 43
Open Network Computing Remote Procedure Call (ONC RPC, or simply, Open Network Computing Remote Procedure Call (ONC RPC, or simply,
RPC) [RFC5531] is a remote procedure call protocol that runs over a RPC) [RFC5531] is a remote procedure call protocol that runs over a
variety of transports. Most RPC implementations today use UDP variety of transports. Most RPC implementations today use UDP
[RFC0768] or TCP [RFC0793]. On UDP, RPC messages are encapsulated [RFC0768] or TCP [RFC0793]. On UDP, RPC messages are encapsulated
inside datagrams, while on a TCP byte stream, RPC messages are inside datagrams, while on a TCP byte stream, RPC messages are
delineated by a record marking protocol. An RDMA transport also delineated by a record marking protocol. An RDMA transport also
conveys RPC messages in a specific fashion that must be fully conveys RPC messages in a specific fashion that must be fully
described if RPC implementations are to interoperate. described if RPC implementations are to interoperate.
RDMA transports present semantics different from either UDP or TCP. RDMA transports present semantics different from either UDP or TCP.
They retain message delineations like UDP, but provide a reliable and They retain message delineations like UDP, but provide reliable and
sequenced data transfer like TCP. They also provide an offloaded sequenced data transfer like TCP. They also provide an offloaded
bulk transfer service not provided by UDP or TCP. RDMA transports bulk transfer service not provided by UDP or TCP. RDMA transports
are therefore appropriately viewed as a new transport type by RPC. are therefore appropriately viewed as a new transport type by RPC.
In this context, the Network File System (NFS) protocols as described In this context, the Network File System (NFS) protocols as described
in [RFC1094], [RFC1813], [RFC7530], [RFC5661], and future NFSv4 minor in [RFC1094], [RFC1813], [RFC7530], [RFC5661], and future NFSv4 minor
verions are obvious beneficiaries of RDMA transports. A complete verions are all obvious beneficiaries of RDMA transports. A complete
problem statement is discussed in [RFC5532], and NFSv4-related issues problem statement is presented in [RFC5532]. Many other RPC-based
are discussed in [RFC5661]. Many other RPC-based protocols can also protocols can also benefit.
benefit.
Although the RDMA transport described here can provide relatively Although the RDMA transport described herein can provide relatively
transparent support for any RPC application, this document also transparent support for any RPC application, this document also
describes mechanisms that can optimize data transfer further, given describes mechanisms that can optimize data transfer even further,
more active participation by RPC applications. given more active participation by RPC applications.
2. Changes Since RFC 5666 2. Changes Since RFC 5666
2.1. Changes To The Specification 2.1. Changes To The Specification
The following alterations have been made to the RPC-over-RDMA Version The following alterations have been made to the RPC-over-RDMA Version
One specification. The section numbers below refer to [RFC5666]. One specification. The section numbers below refer to [RFC5666].
o Section 2 has been expanded to introduce and explain key RPC, XDR, o Section 2 has been expanded to introduce and explain key RPC, XDR,
and RDMA terminology. These terms are now used consistently and RDMA terminology. These terms are now used consistently
skipping to change at page 5, line 6 skipping to change at page 4, line 50
o A subsection describing the use of RPCSEC_GSS with RPC-over-RDMA o A subsection describing the use of RPCSEC_GSS with RPC-over-RDMA
Version One has been added. Version One has been added.
2.2. Changes To The Protocol 2.2. Changes To The Protocol
Although the protocol described herein interoperates with existing Although the protocol described herein interoperates with existing
implementations of [RFC5666], the following changes have been made implementations of [RFC5666], the following changes have been made
relative to the protocol described in that document: relative to the protocol described in that document:
o Support for the Read-Read transfer model has been removed. Read- o Support for the Read-Read transfer model has been removed. Read-
Read is a slower transfer model than Read-Write, thus implementers Read is a slower transfer model than Read-Write. As a result,
have chosen not to support it. Removal simplifies explanatory implementers have chosen not to support it. Removal simplifies
text, and support for the RDMA_DONE procedure is no longer explanatory text, and support for the RDMA_DONE procedure is no
necessary. longer necessary.
o The specification of RDMA_MSGP in [RFC5666] and current o The specification of RDMA_MSGP in [RFC5666] is not adequate,
implementations of it are incomplete. Even if completed, benefit although some incomplete implementations exist. Even if an
for protocols such as NFSv4.0 [RFC7530] is doubtful. Therefore adequate specification were provided and an implementation was
the RDMA_MSGP message type is no longer supported. produced, benefit for protocols such as NFSv4.0 [RFC7530] is
doubtful. Therefore the RDMA_MSGP message type is no longer
supported.
o Technical errors with regard to handling RPC-over-RDMA header o Technical errors with regard to handling RPC-over-RDMA header
errors have been corrected. errors have been corrected.
o Specific requirements related to handling XDR round-up and complex o Specific requirements related to handling XDR round-up and complex
XDR data types have been added. XDR data types have been added.
o Explicit guidance is provided for sizing Write chunks, managing o Explicit guidance is provided for sizing Write chunks, managing
multiple chunks in the Write list, and handling unused Write multiple chunks in the Write list, and handling unused Write
chunks. chunks.
skipping to change at page 6, line 30 skipping to change at page 6, line 30
An arbitrary unique value is placed in the message's xid field in An arbitrary unique value is placed in the message's xid field in
order to match this CALL message to a corresponding REPLY message. order to match this CALL message to a corresponding REPLY message.
REPLY Message REPLY Message
A REPLY message, or "Reply", reports the results of work requested A REPLY message, or "Reply", reports the results of work requested
by a Call. A Reply is designated by the value one (1) in the by a Call. A Reply is designated by the value one (1) in the
message's msg_type field. The value contained in the message's message's msg_type field. The value contained in the message's
xid field is copied from the Call whose results are being xid field is copied from the Call whose results are being
reported. reported.
The RPC client endpoint, or "requester", serializes an RPC Call's The RPC client endpoint acts as a "requester". It serializes an RPC
arguments and conveys them to a server endpoint via an RPC Call Call's arguments and conveys them to a server endpoint via an RPC
message. This message contains an RPC protocol header, a header Call message. This message contains an RPC protocol header, a header
describing the requested upper layer operation, and all arguments. describing the requested upper layer operation, and all arguments.
The RPC server endpoint, or "responder", deserializes the arguments The RPC server endpoint acts as a "responder". It deserializes Call
and processes the requested operation. It then serializes the arguments, and processes the requested operation. It then serializes
operation's results into another byte stream. This byte stream is the operation's results into another byte stream. This byte stream
conveyed back to the requester via an RPC Reply message. This is conveyed back to the requester via an RPC Reply message. This
message contains an RPC protocol header, a header describing the message contains an RPC protocol header, a header describing the
upper layer reply, and all results. upper layer reply, and all results.
The requester deserializes the results and allows the original caller The requester deserializes the results and allows the original caller
to proceed. At this point the RPC transaction designated by the xid to proceed. At this point the RPC transaction designated by the xid
in the Call message is complete, and the xid is retired. in the Call message is complete, and the xid is retired.
In summary, CALL messages are sent by requesters to responders to
initiate an RPC transaction. REPLY messages are sent by responders
to requesters to complete the processing on an RPC transaction.
3.1.3. RPC Transports 3.1.3. RPC Transports
The role of an "RPC transport" is to mediate the exchange of RPC The role of an "RPC transport" is to mediate the exchange of RPC
messages between requesters and responders. An RPC transport bridges messages between requesters and responders. An RPC transport bridges
the gap between the RPC message abstraction and the native operations the gap between the RPC message abstraction and the native operations
of a particular network transport. of a particular network transport.
RPC-over-RDMA is a connection-oriented RPC transport. When a RPC-over-RDMA is a connection-oriented RPC transport. When a
connection-oriented transport is used, requesters initiate transport connection-oriented transport is used, clients initiate transport
connections, while responders wait passively for incoming connection connections, while servers wait passively for incoming connection
requests. requests.
3.1.4. External Data Representation 3.1.4. External Data Representation
One cannot assume that all requesters and responders internally One cannot assume that all requesters and responders internally
represent data objects the same way. RPC uses eXternal Data represent data objects the same way. RPC uses eXternal Data
Representation, or XDR, to translate data types and serialize Representation, or XDR, to translate data types and serialize
arguments and results [RFC4506]. arguments and results [RFC4506].
The XDR protocol encodes data independent of the endianness or size The XDR protocol encodes data independent of the endianness or size
skipping to change at page 11, line 5 skipping to change at page 11, line 14
The responder exposes its memory to requesters, but requesters do The responder exposes its memory to requesters, but requesters do
not expose their memory. Requesters employ RDMA Write operations not expose their memory. Requesters employ RDMA Write operations
to push RPC arguments or whole RPC calls to the responder. to push RPC arguments or whole RPC calls to the responder.
Requesters employ RDMA Read operations to pull RPC results or Requesters employ RDMA Read operations to pull RPC results or
whole RPC relies from the responder. whole RPC relies from the responder.
[RFC5666] specifies the use of both the Read-Read and the Read-Write [RFC5666] specifies the use of both the Read-Read and the Read-Write
Transfer Model. All current RPC-over-RDMA Version One Transfer Model. All current RPC-over-RDMA Version One
implementations use only the Read-Write Transfer Model. Therefore implementations use only the Read-Write Transfer Model. Therefore
the use of the Read-Read Transfer Model within RPC-over-RDMA Version the use of the Read-Read Transfer Model within RPC-over-RDMA Version
One implementations is no longer supported. Other Transfer Models One implementations is no longer supported. Transfer Models other
may be used in future versions of RPC-over-RDMA. than the Read-Write model may be used in future versions of RPC-over-
RDMA.
4.2. Message Framing 4.2. Message Framing
On an RPC-over-RDMA transport, each RPC message is encapsulated by an On an RPC-over-RDMA transport, each RPC message is encapsulated by an
RPC-over-RDMA message. An RPC-over-RDMA message consists of two XDR RPC-over-RDMA message. An RPC-over-RDMA message consists of two XDR
streams. streams.
RPC Payload Stream RPC Payload Stream
The "Payload stream" contains the encapsulated RPC message being The "Payload stream" contains the encapsulated RPC message being
transferred by this RPC-over-RDMA message. This stream always transferred by this RPC-over-RDMA message. This stream always
skipping to change at page 11, line 46 skipping to change at page 12, line 8
It is however possible for RPC-over-RDMA to be dynamically enabled in It is however possible for RPC-over-RDMA to be dynamically enabled in
the course of negotiating the use of RDMA via an Upper Layer Protocol the course of negotiating the use of RDMA via an Upper Layer Protocol
exchange. Because RPC framing delimits an entire RPC request or exchange. Because RPC framing delimits an entire RPC request or
reply, the resulting shift in framing must occur between distinct RPC reply, the resulting shift in framing must occur between distinct RPC
messages, and in concert with the underlying transport. messages, and in concert with the underlying transport.
4.3. Managing Receiver Resources 4.3. Managing Receiver Resources
It is critical to provide RDMA Send flow control for an RDMA It is critical to provide RDMA Send flow control for an RDMA
connection. If no pre-posted receive buffer is large enough to connection. If any pre-posted receive buffer on the connection is
accept an incoming RDMA Send, the RDMA Send operation fails. If a not large enough to accept an incoming RDMA Send, the RDMA Send
pre-posted receive buffer is not available to accept an incoming RDMA operation can fail. If a pre-posted receive buffer is not available
Send, the RDMA Send operation can fail. Repeated occurrences of such to accept an incoming RDMA Send, the RDMA Send operation can fail.
errors can be fatal to the connection. This is a departure from Repeated occurrences of such errors can be fatal to the connection.
conventional TCP/IP networking where buffers are allocated This is different than conventional TCP/IP networking, in which
dynamically as part of receiving messages. buffers are allocated dynamically as messages are received.
The longevity of an RDMA connection requires that sending endpoints The longevity of an RDMA connection requires that sending endpoints
respect the resource limits of peer receivers. To ensure messages respect the resource limits of peer receivers. To ensure messages
can be sent and received reliably, there are two operational can be sent and received reliably, there are two operational
parameters for each connection. parameters for each connection.
4.3.1. Credit Limit 4.3.1. RPC-over-RDMA Credits
The number of pre-posted RDMA Receive operations is sometimes Flow control for RDMA Send operations directed to the responder is
referred to as a peer's "credit limit." Flow control for RDMA Send implemented as a simple request/grant protocol in the RPC-over-RDMA
operations directed to the responder is implemented as a simple header associated with each RPC message.
request/grant protocol in the RPC-over-RDMA header associated with
each RPC message. Section 5.2.3 has further detail.
o The RPC-over-RDMA header for RPC Call messages contains a An RPC-over-RDMA Version One credit is the capability to handle one
requested credit value for the responder. This is the maximum RPC-over-RDMA transaction. Each RPC-over-RDMA message sent from
number of RPC replies the requester can handle at once, requester to responder requests a number of credits from the
independent of how many RPCs are in flight at that moment. The responder. Each RPC-over-RDMA message sent from responder to
requester MAY dynamically adjust the requested credit value to requester informs the requester how many credits the responder has
match its expected needs. granted. The requested and granted values are carried in each RPC-
over-RDMA message's rdma_credit field (see Section 5.2.3).
o The RPC-over-RDMA header for RPC Reply messages provides the Practically speaking, the critical value is the granted value. A
granted result. This is the maximum number of RPC calls the requester MUST NOT send unacknowledged requests in excess of the
responder can handle at once, without regard to how many RPCs are responder's granted credit limit. If the granted value is exceeded,
in flight at that moment. The granted value MUST NOT be zero, the RDMA layer may signal an error, possibly terminating the
since such a value would result in deadlock. The responder MAY connection. The granted value MUST NOT be zero, since such a value
dynamically adjust the granted credit value to match its needs or would result in deadlock.
policies (e.g. to accommodate the available resources in a shared
receive queue).
The requester MUST NOT send unacknowledged requests in excess of this RPC calls complete in any order, but the current granted credit limit
granted responder credit limit. If the limit is exceeded, the RDMA at the responder is known to the requester from RDMA Send ordering
layer may signal an error, possibly terminating the connection. If properties. The number of allowed new requests the requester may
an RDMA layer error does not occur, the responder MAY handle excess send is then the lower of the current requested and granted credit
requests or return an RPC layer error to the requester. values, minus the number of requests in flight. Advertised credit
values are not altered when individual RPCs are started or completed.
While RPC calls complete in any order, the current flow control limit The requested and granted credit values MAY be adjusted to match the
at the responder is known to the requester from the Send ordering needs or policies in effect on either peer. For instance, a
properties. It is always the lower of the requested and granted responder may reduce the granted credit value to accommodate the
credit values, minus the number of requests in flight. Advertised available resources in a Shared Receive Queue. The responder MUST
credit values are not altered when individual RPCs are started or ensure that an increase in receive resources is effected before the
completed. next reply message is sent.
On occasion a requester or responder may need to adjust the amount of A requester MUST maintain enough receive resources to accommodate
resources available to a connection. When this happens, the expected replies. Responders have to be prepared for there to be no
responder needs to ensure that a credit increase is effected (i.e. receive resources available on requesters with no pending RPC
RDMA Receives are posted) before the next reply is sent. transactions.
Certain RDMA implementations may impose additional flow control Certain RDMA implementations may impose additional flow control
restrictions, such as limits on RDMA Read operations in progress at restrictions, such as limits on RDMA Read operations in progress at
the responder. Accommodation of such restrictions is considered the the responder. Accommodation of such restrictions is considered the
responsibility of each RPC-over-RDMA Version One implementation. responsibility of each RPC-over-RDMA Version One implementation.
4.3.2. Inline Threshold 4.3.2. Inline Threshold
A receiver's "inline threshold" value is the largest message size (in A receiver's "inline threshold" value is the largest message size (in
octets) that the receiver can accept via an RDMA Receive operation. octets) that the receiver can accept via an RDMA Receive operation.
skipping to change at page 13, line 27 skipping to change at page 13, line 32
receiver. receiver.
Unlike credit limits, inline threshold values are not advertised to Unlike credit limits, inline threshold values are not advertised to
peers via the RPC-over-RDMA Version One protocol, and there is no peers via the RPC-over-RDMA Version One protocol, and there is no
provision for the inline threshold value to change during the provision for the inline threshold value to change during the
lifetime of an RPC-over-RDMA Version One connection. lifetime of an RPC-over-RDMA Version One connection.
4.3.3. Initial Connection State 4.3.3. Initial Connection State
When a connection is first established, peers might not know how many When a connection is first established, peers might not know how many
receive buffers the other has, nor how large these buffers are. receive resources the other has, nor how large these buffers are.
As a basis for an initial exchange of RPC requests, each RPC-over- As a basis for an initial exchange of RPC requests, each RPC-over-
RDMA Version One connection provides the ability to exchange at least RDMA Version One connection provides the ability to exchange at least
one RPC message at a time that is 1024 bytes in size. A responder one RPC message at a time that is 1024 bytes in size. A responder
MAY exceed this basic level of configuration, but a requester MUST MAY exceed this basic level of configuration, but a requester MUST
NOT assume more than one credit is available, and MUST receive a NOT assume more than one credit is available, and MUST receive a
valid reply from the responder carrying the actual number of valid reply from the responder carrying the actual number of
available credits, prior to sending its next request. available credits, prior to sending its next request.
Receiver implementations MUST support an inline threshold of 1024 Receiver implementations MUST support an inline threshold of 1024
bytes, but MAY support larger inline thresholds values. A mechanism bytes, but MAY support larger inline thresholds values. A mechanism
for discovering a peer's inline threshold value before a connection for discovering a peer's inline threshold value before a connection
is established may be used to optimize the use of RDMA Send is established may be used to optimize the use of RDMA Send
operations. In the absense of such a mechanism, senders MUST assume operations. In the absense of such a mechanism, senders MUST assume
a receiver's inline threshold is 1024 bytes. a receiver's inline threshold is 1024 bytes.
4.4. XDR Encoding With Chunks 4.4. XDR Encoding With Chunks
When a direct data placement capability is available, during XDR When a direct data placement capability is available, it can be
encoding it can be determined that the transport can efficiently determined during XDR encoding that the transport can efficiently
place the content of one or more data items directly in the place the contents of one or more XDR data items directly into the
receiver's memory, separately from the transfer of other parts of the receiver's memory, separately from the transfer of other parts of the
containing XDR stream. containing XDR stream.
4.4.1. Reducing An XDR Stream 4.4.1. Reducing An XDR Stream
RPC-over-RDMA Version One provides a mechanism for moving part of an RPC-over-RDMA Version One provides a mechanism for moving part of an
RPC message via a data transfer separate from an RDMA Send/Receive. RPC message via a data transfer separate from an RDMA Send/Receive.
The sender removes one or more XDR data items from the Payload The sender removes one or more XDR data items from the Payload
stream. They are conveyed via one or more RDMA Read or Write stream. They are conveyed via one or more RDMA Read or Write
operations. The receiver inserts the data items into the Payload operations. As the receiver decodes an incoming message, it skips
stream before passing it to the Upper Layer. over directly placed data items.
A contiguous piece of a Payload stream can be split out and moved via The piece of memory containing the portion of the data stream that is
separate RDMA operations. The piece of memory containing that split out and placed directly is referred to as a "chunk". In some
portion of the data stream and metadata in an RPC-over-RDMA header contexts, data in the RPC-over-RDMA header that describes such pieces
together comprise what is referred to as a "chunk." A Payload stream of memory is also referred to as a "chunk".
after chunks have been removed is referred to as a "reduced" Payload
stream. Likewise, a data item that has been removed from a Payload A Payload stream after chunks have been removed is referred to as a
stream to be transferred separately is referred to as a "reduced" "reduced" Payload stream. Likewise, a data item that has been
data item. removed from a Payload stream to be transferred separately is
referred to as a "reduced" data item.
4.4.2. DDP-Eligibility 4.4.2. DDP-Eligibility
Only an XDR data item that might benefit from Direct Data Placement Only an XDR data item that might benefit from Direct Data Placement
may be reduced. The eligibility of particular XDR data items to be may be reduced. The eligibility of particular XDR data items to be
reduced is independent of RPC-over-RDMA, and thus is not specified by reduced is independent of RPC-over-RDMA, and thus is not specified by
this document. this document.
To maintain interoperability on an RPC-over-RDMA transport, a To maintain interoperability on an RPC-over-RDMA transport, a
determination must be made of which XDR data items in each Upper determination must be made of which XDR data items in each Upper
skipping to change at page 15, line 8 skipping to change at page 15, line 13
data items MUST NOT be reduced. RPC-over-RDMA Version One uses RDMA data items MUST NOT be reduced. RPC-over-RDMA Version One uses RDMA
Read and Write operations to transfer DDP-eligible data that has been Read and Write operations to transfer DDP-eligible data that has been
reduced. reduced.
Detailed requirements for Upper Layer Bindings are discussed in full Detailed requirements for Upper Layer Bindings are discussed in full
in Section 7. in Section 7.
4.4.3. RDMA Segments 4.4.3. RDMA Segments
When encoding a Payload stream that contains a DDP-eligible data When encoding a Payload stream that contains a DDP-eligible data
item, a sender may choose to reduce that data item. It does not item, a sender may choose to reduce that data item. When it chooses
place the item into the Payload stream. Instead, the sender records to do so, the sender does not place the item into the Payload stream.
in the RPC-over-RDMA header the actual address and size of the memory Instead, the sender records in the RPC-over-RDMA header the actual
region containing that data item. address and size of the memory region containing that data item.
The requester provides location information for DDP-eligible data The requester provides location information for DDP-eligible data
items in both RPC Calls and Replies. The responder uses this items in both RPC Calls and Replies. The responder uses this
information to initiate RDMA Read and Write operations to retrieve or information to initiate RDMA Read and Write operations to retrieve or
update the specified region of the requester's memory. update the specified region of the requester's memory.
An "RDMA segment", or a "plain segment", is an RPC-over-RDMA header An "RDMA segment", or a "plain segment", is an RPC-over-RDMA header
data object that contains the precise co-ordinates of a contiguous data object that contains the precise co-ordinates of a contiguous
memory region that is to be conveyed via one or more RDMA Read or memory region that is to be conveyed via one or more RDMA Read or
RDMA Write operations. The following fields are contained in each RDMA Write operations.
segment.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Handle |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ Offset +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Handle Handle
Steering tag (STag) or handle obtained when the segment's memory Steering tag (STag) or handle obtained when the segment's memory
is registered for RDMA. Also known as an R_key, this value is is registered for RDMA. Also known as an R_key, this value is
generated by registering this memory with the RDMA provider. generated by registering this memory with the RDMA provider.
Length Length
The length of the memory segment, in octets. The length of the memory segment, in octets.
Offset Offset
skipping to change at page 16, line 14 skipping to change at page 15, line 50
4.4.4. Chunks 4.4.4. Chunks
In RPC-over-RDMA Version One, a "chunk" refers to a portion of the In RPC-over-RDMA Version One, a "chunk" refers to a portion of the
Payload stream that is moved via RDMA Read or Write operations. Payload stream that is moved via RDMA Read or Write operations.
Chunk data is removed from the sender's Payload stream, transferred Chunk data is removed from the sender's Payload stream, transferred
by separate RDMA operations, and then re-inserted into the receiver's by separate RDMA operations, and then re-inserted into the receiver's
Payload stream. Payload stream.
Each chunk consists of one or more RDMA segments. Each segment Each chunk consists of one or more RDMA segments. Each segment
represents a single contiguous piece of that chunk. Segments MAY represents a single contiguous piece of that chunk. A requester MAY
divide a chunk on any boundary that is convenient to the requester. divide a chunk into segments using any boundaries that are
convenient.
Except in special cases, a chunk contains exactly one XDR data item. Except in special cases, a chunk contains exactly one XDR data item.
This makes it straightforward to remove chunks from an XDR stream This makes it straightforward to remove chunks from an XDR stream
without affecting XDR alignment. Not every RPC-over-RDMA message has without affecting XDR alignment.
chunks associated with it.
Many RPC-over-RDMA messages have no associated chunks. In this case,
all three chunk lists are marked empty.
4.4.4.1. Counted Arrays 4.4.4.1. Counted Arrays
If a chunk contains a counted array data type, the count of array If a chunk contains a counted array data type, the count of array
elements MUST remain in the Payload stream, while the array elements elements MUST remain in the Payload stream, while the array elements
MUST be moved to the chunk. For example, when encoding an opaque MUST be moved to the chunk. For example, when encoding an opaque
byte array as a chunk, the count of bytes stays in the Payload byte array as a chunk, the count of bytes stays in the Payload
stream, while the bytes in the array are removed from the Payload stream, while the bytes in the array are removed from the Payload
stream and transferred within the chunk. stream and transferred within the chunk.
skipping to change at page 17, line 10 skipping to change at page 16, line 49
4.4.4.3. XDR Unions 4.4.4.3. XDR Unions
A union data type should never be made DDP-eligible, but one or more A union data type should never be made DDP-eligible, but one or more
of its arms may be DDP-eligible. of its arms may be DDP-eligible.
4.4.5. Read Chunks 4.4.5. Read Chunks
A "Read chunk" represents an XDR data item that is to be pulled from A "Read chunk" represents an XDR data item that is to be pulled from
the requester to the responder using RDMA Read operations. the requester to the responder using RDMA Read operations.
A Read chunk is a list of one or more RDMA segments. Each RDMA A Read chunk is a list of one or more RDMA read segments. Each RDMA
segment in a Read chunk is a plain segment which has an additional read segment consists of a Position field followed by a plain
Position field. segment. See Section 5.1.2 for details.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Position |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Handle |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ Offset +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Position Position
The byte offset in the Payload stream where the receiver re- The byte offset in the unreduced Payload stream where the receiver
inserts the data item conveyed in a chunk. The Position value re-inserts the data item conveyed in a chunk. The Position value
MUST be computed from the beginning of the Payload stream, which MUST be computed from the beginning of the unreduced Payload
begins at Position zero. All RDMA segments belonging to the same stream, which begins at Position zero. All RDMA read segments
Read chunk have the same value in their Position field. belonging to the same Read chunk have the same value in their
Position field.
While constructing an RPC-over-RDMA Call message, a requester While constructing an RPC-over-RDMA Call message, a requester
registers memory segments containing data in Read chunks. It registers memory segments that contain data to be transferred via
advertises these chunks in the RPC-over-RDMA header of the RPC Call. RDMA Read operations. It advertises the co-ordinates of these
segments in the RPC-over-RDMA header of the RPC Call.
After receiving an RPC Call sent via an RDMA Send operation, a After receiving an RPC Call sent via an RDMA Send operation, a
responder transfers the chunk data from the requester using RDMA Read responder transfers the chunk data from the requester using RDMA Read
operations. The responder reconstructs the transferred chunk data by operations. The responder reconstructs the transferred chunk data by
concatenating the contents of each segment, in list order, into the concatenating the contents of each segment, in list order, into the
received Payload stream at the Position value recorded in the received Payload stream at the Position value recorded in the
segment. segment.
Put another way, the responder inserts the first segment in a Read Put another way, the responder inserts the first segment in a Read
chunk into the Payload stream at the byte offset indicated by its chunk into the Payload stream at the byte offset indicated by its
skipping to change at page 18, line 25 skipping to change at page 18, line 4
requirements. Thus when an XDR data item is moved from the Payload requirements. Thus when an XDR data item is moved from the Payload
stream into a Read chunk, the requester MUST remove XDR padding for stream into a Read chunk, the requester MUST remove XDR padding for
that data item from the Payload stream as well. that data item from the Payload stream as well.
The length of a Read chunk is the sum of the lengths of the read The length of a Read chunk is the sum of the lengths of the read
segments that comprise it. If this sum is not a multiple of four, segments that comprise it. If this sum is not a multiple of four,
the requester MAY choose to send a Read chunk without any XDR the requester MAY choose to send a Read chunk without any XDR
padding. If the requester provides no actual round-up in a Read padding. If the requester provides no actual round-up in a Read
chunk, the responder MUST be prepared to provide appropriate round-up chunk, the responder MUST be prepared to provide appropriate round-up
in the reconstructed call XDR stream in the reconstructed call XDR stream
The Position field in a read segment indicates where the containing The Position field in a read segment indicates where the containing
Read chunk starts in the Payload stream. The value in this field Read chunk starts in the Payload stream. The value in this field
MUST be a multiple of four. Moreover, all segments in the same Read MUST be a multiple of four. Moreover, all segments in the same Read
chunk share the same Position value, even if one or more of the chunk share the same Position value, even if one or more of the
segments have a non-four-byte aligned length. segments have a non-four-byte aligned length.
4.4.5.2. Decoding Read Chunks 4.4.5.2. Decoding Read Chunks
While decoding a received Payload stream, whenever the XDR offset in While decoding a received Payload stream, whenever the XDR offset in
the Payload stream matches that of a Read chunk, the transport the Payload stream matches that of a Read chunk, the responder
initiates an RDMA Read to pull the chunk's data content into initiates an RDMA Read to pull the chunk's data content into
registered memory on the responder. registered local memory.
The responder acknowledges its completion of use of Read chunk source The responder acknowledges its completion of use of Read chunk source
buffers when it sends an RPC Reply to the requester. The requester buffers when it sends an RPC Reply to the requester. The requester
may then release Read chunks advertised in the request. may then release Read chunks advertised in the request.
4.4.6. Write Chunks 4.4.6. Write Chunks
A "Write chunk" represents an XDR data item that is to be pushed from A "Write chunk" represents an XDR data item that is to be pushed from
a responder to a requester using RDMA Write operations. a responder to a requester using RDMA Write operations.
A Write chunk is an array of one or more plain RDMA segments. Write A Write chunk is an array of one or more plain RDMA segments. Write
chunks are provided by a requester long before the responder has chunks are provided by a requester long before the responder has
prepared the reply Payload stream. Therefore RDMA segments in a prepared the reply Payload stream. In most cases, the byte offset of
Write chunk do not have a Position field. a particular XDR data item in the reply is not predictable at the
time a request is issued. Therefore RDMA segments in a Write chunk
do not have a Position field.
While constructing an RPC Call message, a requester also prepares While constructing an RPC Call message, a requester also prepares
memory regions to catch DDP-eligible reply data items. A requester memory regions to catch DDP-eligible reply data items. A requester
does not know the actual length of the result data item to be does not know the actual length of the result data item to be
returned, thus it MUST register a Write chunk long enough to returned, thus it MUST register a Write chunk long enough to
accommodate the maximum possible size of the returned data item. accommodate the maximum possible size of the returned data item.
A responder copies the requester-provided Write chunk segments into A responder copies the requester-provided Write chunk segments into
the RPC-over-RDMA header that it returns with the reply. The the RPC-over-RDMA header that it returns with the reply. The
responder MUST NOT change the number of segments in the Write chunk. responder MUST NOT change the number of segments in the Write chunk.
The responder fills the segments in array order until the data item The responder fills the segments in array order until the data item
has been completely written. The responder updates the segment has been completely written. The responder updates the segment
length fields to reflect the actual amount of data that is being length fields to reflect the actual amount of data that is being
returned in each segment. If a Write chunk segment is not filled by returned in each segment. If a Write chunk segment receives no data
the responder, the updated length of the segment SHOULD be zero. from the responder, the updated length of the segment MUST be zero.
The responder then sends the RPC Reply via an RDMA Send operation. The responder then sends the RPC Reply via an RDMA Send operation.
After receiving the RPC Reply, the requester reconstructs the After receiving the RPC Reply, the requester reconstructs the
transferred data by concatenating the contents of each segment, in transferred data by concatenating the contents of each segment, in
array order, into RPC Reply XDR stream. array order, into RPC Reply XDR stream.
4.4.6.1. Write Chunk Round-up 4.4.6.1. Write Chunk Round-up
XDR requires each encoded data item to start on four-byte alignment. XDR requires each encoded data item to start on four-byte alignment.
When an odd-length data item is encoded, its length is encoded When an odd-length data item is encoded, its length is encoded
skipping to change at page 19, line 51 skipping to change at page 19, line 32
accommodate XDR pad bytes. A responder MUST NOT write XDR pad bytes accommodate XDR pad bytes. A responder MUST NOT write XDR pad bytes
for a Write chunk. for a Write chunk.
4.4.6.2. Unused Write Chunks 4.4.6.2. Unused Write Chunks
There are occasions when a requester provides a Write chunk but the There are occasions when a requester provides a Write chunk but the
responder does not use it. responder does not use it.
For example, an Upper Layer Protocol may define a union result where For example, an Upper Layer Protocol may define a union result where
some arms of the union contain a DDP-eligible data item while other some arms of the union contain a DDP-eligible data item while other
arms do not. The requester is required to provide a Write chunk in arms do not. The responder is REQUIRED to use requester-provided
this case, but if the responder returns a result that uses an arm of Write chunks in this case, but if the responder returns a result that
the union that has no DDP-eligible data item, the Write chunk remains uses an arm of the union that has no DDP-eligible data item, the
unused. Write chunk remains unconsumed.
When forming an RPC-over-RDMA Reply message with an unused Write If there is a subsequent DDP-eligible data item, it MUST be placed in
chunk, the responder MUST set the length of all segments in the chunk that Write chunk. The requester MUST provision each Write chunk so
to zero. it can be filled with the largest DDP-eligible data item that can be
placed in it.
However, if this is the last or only Write chunk available and it
remains unconsumed, the responder MUST set the length of all segments
in the chunk to zero.
Unused write chunks, or unused bytes in write chunk segments, are not Unused write chunks, or unused bytes in write chunk segments, are not
returned as results. Their memory is returned to the Upper Layer as returned as results. Their memory is returned to the Upper Layer as
part of RPC completion. However, the RPC layer MUST NOT assume that part of RPC completion. However, the RPC layer MUST NOT assume that
the buffers have not been modified. the buffers have not been modified.
In other words, even if a responder indicates that a Write chunk is
not consumed (by setting all of the segment lengths in the chunk to
zero), the responder may have written some data into the segments
before deciding not to return that data item. For example, a problem
reading local storage might occur while an NFS server is filling
Write chunks. This would interrupt the stream of RDMA Write
operations that sends data back to the NFS client, but at that point
the NFS server needs to return an NFS error that reflects that the
Upper Layer NFS request has failed.
4.5. Message Size 4.5. Message Size
A receiver of RDMA Send operations is required by RDMA to have A receiver of RDMA Send operations is required by RDMA to have
previously posted one or more adequately sized buffers. Memory previously posted one or more adequately sized buffers. Memory
savings can be achieved on both requesters and responders by leaving savings are achieved on both requesters and responders by posting
the inline threshold small. However, not all RPC messages are small. small Receive buffers. However, not all RPC messages are small.
4.5.1. Short Messages 4.5.1. Short Messages
RPC messages are frequently smaller than typical inline thresholds. RPC messages are frequently smaller than typical inline thresholds.
For example, the NFS version 3 GETATTR request is only 56 bytes: 20 For example, the NFS version 3 GETATTR request is only 56 bytes: 20
bytes of RPC header, plus a 32-byte file handle argument and 4 bytes bytes of RPC header, plus a 32-byte file handle argument and 4 bytes
for its length. The reply to this common request is about 100 bytes. for its length. The reply to this common request is about 100 bytes.
Since all RPC messages conveyed via RPC-over-RDMA require an RDMA Since all RPC messages conveyed via RPC-over-RDMA require an RDMA
Send operation, the most efficient way to send an RPC message that is Send operation, the most efficient way to send an RPC message that is
skipping to change at page 23, line 44 skipping to change at page 23, line 44
copy of the message's transaction ID, data for managing RDMA flow copy of the message's transaction ID, data for managing RDMA flow
control credits, and lists of RDMA segments used for RDMA Read and control credits, and lists of RDMA segments used for RDMA Read and
Write operations. All RPC-over-RDMA header content is contained in Write operations. All RPC-over-RDMA header content is contained in
the Transport stream, and thus MUST be XDR encoded. the Transport stream, and thus MUST be XDR encoded.
RPC message layout is unchanged from that described in [RFC5531] RPC message layout is unchanged from that described in [RFC5531]
except for the possible reduction of data items that are moved by except for the possible reduction of data items that are moved by
RDMA Read or Write operations. RDMA Read or Write operations.
The RPC-over-RDMA protocol passes RPC messages without regard to The RPC-over-RDMA protocol passes RPC messages without regard to
their type (CALL or REPLY) or direction (forwards or backwards). their type (CALL or REPLY). Apart from restrictions imposed by
Each endpoint of a connection MAY send any RPC-over-RDMA message upper-layer bindings, each endpoint of a connection MAY send any RPC-
header type at any time (subject to credit limits). over-RDMA message header type at any time (subject to credit limits).
5.1. XDR Protocol Definition 5.1. XDR Protocol Definition
This section contains a description of the core features of the RPC- This section contains a description of the core features of the RPC-
over-RDMA Version One protocol, expressed in the XDR language over-RDMA Version One protocol, expressed in the XDR language
[RFC4506]. [RFC4506].
This description is provided in a way that makes it simple to extract This description is provided in a way that makes it simple to extract
into ready-to-compile form. The reader can apply the following shell into ready-to-compile form. The reader can apply the following shell
script to this document to produce a machine-readable XDR description script to this document to produce a machine-readable XDR description
of the RPC-over-RDMA Version One protocol without any OPTIONAL of the RPC-over-RDMA Version One protocol.
extensions.
<CODE BEGINS> <CODE BEGINS>
#!/bin/sh #!/bin/sh
grep '^ *///' | sed 's?^ /// ??' | sed 's?^ *///$??' grep '^ *///' | sed 's?^ /// ??' | sed 's?^ *///$??'
<CODE ENDS> <CODE ENDS>
That is, if the above script is stored in a file called "extract.sh" That is, if the above script is stored in a file called "extract.sh"
and this document is in a file called "spec.txt" then the reader can and this document is in a file called "spec.txt" then the reader can
skipping to change at page 27, line 21 skipping to change at page 27, line 21
/// struct xdr_write_chunk *rdma_reply; /// struct xdr_write_chunk *rdma_reply;
/// /* rpc body follows */ /// /* rpc body follows */
/// }; /// };
/// ///
/// struct rpc_rdma_header_nomsg { /// struct rpc_rdma_header_nomsg {
/// struct xdr_read_list *rdma_reads; /// struct xdr_read_list *rdma_reads;
/// struct xdr_write_list *rdma_writes; /// struct xdr_write_list *rdma_writes;
/// struct xdr_write_chunk *rdma_reply; /// struct xdr_write_chunk *rdma_reply;
/// }; /// };
/// ///
/// /* Not to be used */
/// struct rpc_rdma_header_padded { /// struct rpc_rdma_header_padded {
/// uint32 rdma_align; /* Padding alignment */ /// uint32 rdma_align;
/// uint32 rdma_thresh; /* Padding threshold */ /// uint32 rdma_thresh;
/// struct xdr_read_list *rdma_reads; /// struct xdr_read_list *rdma_reads;
/// struct xdr_write_list *rdma_writes; /// struct xdr_write_list *rdma_writes;
/// struct xdr_write_chunk *rdma_reply; /// struct xdr_write_chunk *rdma_reply;
/// /* rpc body follows */ /// /* rpc body follows */
/// }; /// };
/// ///
/// /* /// /*
/// * Error handling (Section 5.5) /// * Error handling (Section 5.5)
/// */ /// */
/// enum rpc_rdma_errcode { /// enum rpc_rdma_errcode {
/// ERR_VERS = 1, /* Fixed for all versions */ /// ERR_VERS = 1, /* Value fixed for all versions */
/// ERR_CHUNK = 2 /// ERR_CHUNK = 2
/// }; /// };
/// ///
/// /* Structure fixed for all versions */
/// struct rpc_rdma_errvers { /// struct rpc_rdma_errvers {
/// uint32 rdma_vers_low; /// uint32 rdma_vers_low;
/// uint32 rdma_vers_high; /// uint32 rdma_vers_high;
/// }; /// };
/// ///
/// union rpc_rdma_error switch (rpc_rdma_errcode err) { /// union rpc_rdma_error switch (rpc_rdma_errcode err) {
/// case ERR_VERS: /// case ERR_VERS:
/// rpc_rdma_errvers range; /// rpc_rdma_errvers range;
/// case ERR_CHUNK: /// case ERR_CHUNK:
/// void; /// void;
/// }; /// };
/// ///
/// /* /// /*
/// * Procedures (Section 5.2.4) /// * Procedures (Section 5.2.4)
/// */ /// */
/// enum rdma_proc { /// enum rdma_proc {
/// RDMA_MSG = 0, /* Fixed for all versions */ /// RDMA_MSG = 0, /* Value fixed for all versions */
/// RDMA_NOMSG = 1, /* Fixed for all versions */ /// RDMA_NOMSG = 1, /* Value fixed for all versions */
/// RDMA_MSGP = 2, /* Reserved */ /// RDMA_MSGP = 2, /* Not to be used */
/// RDMA_DONE = 3, /* Reserved */ /// RDMA_DONE = 3, /* Not to be used */
/// RDMA_ERROR = 4 /* Fixed for all versions */ /// RDMA_ERROR = 4 /* Value fixed for all versions */
/// }; /// };
/// ///
/// /* The position of the proc discriminator field is
/// * fixed for all versions */
/// union rdma_body switch (rdma_proc proc) { /// union rdma_body switch (rdma_proc proc) {
/// case RDMA_MSG: /// case RDMA_MSG:
/// rpc_rdma_header rdma_msg; /// rpc_rdma_header rdma_msg;
/// case RDMA_NOMSG: /// case RDMA_NOMSG:
/// rpc_rdma_header_nomsg rdma_nomsg; /// rpc_rdma_header_nomsg rdma_nomsg;
/// case RDMA_MSGP: /// case RDMA_MSGP: /* Not to be used */
/// rpc_rdma_header_padded rdma_msgp; /// rpc_rdma_header_padded rdma_msgp;
/// case RDMA_DONE: /// case RDMA_DONE: /* Not to be used */
/// void; /// void;
/// case RDMA_ERROR: /// case RDMA_ERROR:
/// rpc_rdma_error rdma_error; /// rpc_rdma_error rdma_error;
/// }; /// };
/// ///
/// /* /// /*
/// * Fixed header fields (Section 5.2) /// * Fixed header fields (Section 5.2)
/// */ /// */
/// struct rdma_msg { /// struct rdma_msg {
/// uint32 rdma_xid; /// uint32 rdma_xid; /* Position fixed for all versions */
/// uint32 rdma_vers; /// uint32 rdma_vers; /* Position fixed for all versions */
/// uint32 rdma_credit; /// uint32 rdma_credit; /* Position fixed for all versions */
/// rdma_body rdma_body; /// rdma_body rdma_body;
/// }; /// };
<CODE ENDS> <CODE ENDS>
5.2. Fixed Header Fields 5.2. Fixed Header Fields
The RPC-over-RDMA header begins with four fixed 32-bit fields that The RPC-over-RDMA header begins with four fixed 32-bit fields that
control the RDMA interaction. These four fields, which must remain control the RDMA interaction.
with the same meanings and in the same positions in all subsequent
versions of the RPC-over-RDMA protocol, are described below.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The first three words are individual fields in the rdma_msg
| XID | structure. The fourth word is the first word of the rdma_body union
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ which acts as the discriminator for the switched union. The contents
| Version Number | of this field are described in Section 5.2.4.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Credit Value | These four fields must remain with the same meanings and in the same
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ positions in all subsequent versions of the RPC-over-RDMA protocol.
| Procedure Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
5.2.1. Transaction ID (XID) 5.2.1. Transaction ID (XID)
The XID generated for the RPC Call and Reply. Having the XID at a The XID generated for the RPC Call and Reply. Having the XID at a
fixed location in the header makes it easy for the receiver to fixed location in the header makes it easy for the receiver to
establish context as soon as each RPC-over-RDMA message arrives. establish context as soon as each RPC-over-RDMA message arrives.
This XID MUST be the same as the XID in the RPC message. The This XID MUST be the same as the XID in the RPC message. The
receiver MAY perform its processing based solely on the XID in the receiver MAY perform its processing based solely on the XID in the
RPC-over-RDMA header, and thereby ignore the XID in the RPC message, RPC-over-RDMA header, and thereby ignore the XID in the RPC message,
if it so chooses. if it so chooses.
5.2.2. Version Number 5.2.2. Version Number
For RPC-over-RDMA Version One, this field MUST contain the value one For RPC-over-RDMA Version One, this field MUST contain the value one
(1). Rules regarding changes to this transport protocol version (1). Rules regarding changes to this transport protocol version
number can be found in Section 8. number can be found in Section 8.
5.2.3. Credit Value 5.2.3. Credit Value
When sent in an RPC Call message, the requested credit value is When sent with an RPC Call message, the requested credit value is
provided. When sent in an RPC Reply message, the granted credit provided. When sent with an RPC Reply message, the granted credit
value is returned. RPC Calls SHOULD NOT be sent in excess of the value is returned. Further discussion of how the credit value is
currently granted limit. Further discussion of how the credit value determined can be found in Section 4.3.
is determined can be found in Section 4.3.
5.2.4. Procedure number 5.2.4. Procedure Number
o RDMA_MSG = 0 indicates that chunk lists and a Payload stream o RDMA_MSG = 0 indicates that chunk lists and a Payload stream
follow. The format of the chunk lists is discussed below. follow. The format of the chunk lists is discussed below.
o RDMA_NOMSG = 1 indicates that after the chunk lists there is no o RDMA_NOMSG = 1 indicates that after the chunk lists there is no
Payload stream. In this case, the chunk lists provide information Payload stream. In this case, the chunk lists provide information
to allow the responder to transfer the Payload stream using RDMA to allow the responder to transfer the Payload stream using RDMA
Read or Write operations. Read or Write operations.
o RDMA_MSGP = 2 is reserved. o RDMA_MSGP = 2 is reserved.
skipping to change at page 33, line 43 skipping to change at page 33, line 34
Depending on the implementation and constraints imposed by Upper Depending on the implementation and constraints imposed by Upper
Layer Bindings, it is possible to implement reduction transparently Layer Bindings, it is possible to implement reduction transparently
to upper layers. Such implementations may lead to inefficiencies, to upper layers. Such implementations may lead to inefficiencies,
either because they require the RPC layer to perform expensive either because they require the RPC layer to perform expensive
registration and de-registration of memory "on the fly", or they may registration and de-registration of memory "on the fly", or they may
require using RDMA chunks in reply messages, along with the resulting require using RDMA chunks in reply messages, along with the resulting
additional handshaking with the RPC-over-RDMA peer. additional handshaking with the RPC-over-RDMA peer.
However, these issues are internal and generally confined to the However, these issues are internal and generally confined to the
local interface between RPC and its upper layers, one in which local interface between RPC and its upper layers, one in which
implementations are free to innovate. The only requirement is that implementations are free to innovate. The only requirement, beyond
the resulting RPC-over-RDMA protocol sent to the peer is valid for constraints imposed by the Upper Layer Binding, is that the resulting
the upper layer. RPC-over-RDMA protocol sent to the peer is valid for the upper layer.
5.4.3. Registration Strategies 5.4.3. Registration Strategies
The choice of which memory registration strategies to employ is left The choice of which memory registration strategies to employ is left
to requester and responder implementers. To support the widest array to requester and responder implementers. To support the widest array
of RDMA implementations, as well as the most general steering tag of RDMA implementations, as well as the most general steering tag
scheme, an Offset field is included in each segment. scheme, an Offset field is included in each segment.
While zero-based offset schemes are available in many RDMA While zero-based offset schemes are available in many RDMA
implementations, their use by RPC requires individual registration of implementations, their use by RPC requires individual registration of
skipping to change at page 44, line 16 skipping to change at page 44, line 7
Introducing new capabilities to RPC-over-RDMA Version One is limited Introducing new capabilities to RPC-over-RDMA Version One is limited
to the adoption of conventions that make use of existing XDR (defined to the adoption of conventions that make use of existing XDR (defined
in this document) and allowed abstract RDMA operations. Because no in this document) and allowed abstract RDMA operations. Because no
mechanism for detecting optional features exists in RPC-over-RDMA mechanism for detecting optional features exists in RPC-over-RDMA
Version One, implementations must rely on Upper Layer Protocols to Version One, implementations must rely on Upper Layer Protocols to
communicate the existence of such extensions. communicate the existence of such extensions.
Such extensions must be specified in a Standards Track document with Such extensions must be specified in a Standards Track document with
appropriate review by the nfsv4 Working Group and the IESG. An appropriate review by the nfsv4 Working Group and the IESG. An
example of a conventional extension to RPC-over-RDMA Version One can example of a conventional extension to RPC-over-RDMA Version One is
be found in [I-D.ietf-nfsv4-rpcrdma-bidirection]. the specification of backward direction message support to enable
NFSv4.1 callback operations, described in
[I-D.ietf-nfsv4-rpcrdma-bidirection].
9. Security Considerations 9. Security Considerations
9.1. Memory Protection 9.1. Memory Protection
A primary consideration is the protection of the integrity and A primary consideration is the protection of the integrity and
privacy of local memory by an RPC-over-RDMA transport. The use of privacy of local memory by an RPC-over-RDMA transport. The use of
RPC-over-RDMA MUST NOT introduce any vulnerabilities to system memory RPC-over-RDMA MUST NOT introduce any vulnerabilities to system memory
contents, nor to memory owned by user processes. contents, nor to memory owned by user processes.
skipping to change at page 50, line 5 skipping to change at page 49, line 35
The extract.sh shell script and formatting conventions were first The extract.sh shell script and formatting conventions were first
described by the authors of the NFSv4.1 XDR specification [RFC5662]. described by the authors of the NFSv4.1 XDR specification [RFC5662].
Special thanks go to nfsv4 Working Group Chair Spencer Shepler and Special thanks go to nfsv4 Working Group Chair Spencer Shepler and
nfsv4 Working Group Secretary Thomas Haynes for their support. nfsv4 Working Group Secretary Thomas Haynes for their support.
12. References 12. References
12.1. Normative References 12.1. Normative References
[I-D.ietf-nfsv4-rpcrdma-bidirection]
Lever, C., "Size-Limited Bi-directional Remote Procedure
Call On Remote Direct Memory Access Transports", draft-
ietf-nfsv4-rpcrdma-bidirection-01 (work in progress),
September 2015.
[I-D.ietf-nfsv4-rpcsec-gssv3] [I-D.ietf-nfsv4-rpcsec-gssv3]
Adamson, A. and N. Williams, "Remote Procedure Call (RPC) Adamson, A. and N. Williams, "Remote Procedure Call (RPC)
Security Version 3", draft-ietf-nfsv4-rpcsec-gssv3-17 Security Version 3", draft-ietf-nfsv4-rpcsec-gssv3-17
(work in progress), January 2016. (work in progress), January 2016.
[RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", [RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2",
RFC 1833, DOI 10.17487/RFC1833, August 1995, RFC 1833, DOI 10.17487/RFC1833, August 1995,
<http://www.rfc-editor.org/info/rfc1833>. <http://www.rfc-editor.org/info/rfc1833>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
skipping to change at page 51, line 7 skipping to change at page 50, line 29
5660, DOI 10.17487/RFC5660, October 2009, 5660, DOI 10.17487/RFC5660, October 2009,
<http://www.rfc-editor.org/info/rfc5660>. <http://www.rfc-editor.org/info/rfc5660>.
[RFC5665] Eisler, M., "IANA Considerations for Remote Procedure Call [RFC5665] Eisler, M., "IANA Considerations for Remote Procedure Call
(RPC) Network Identifiers and Universal Address Formats", (RPC) Network Identifiers and Universal Address Formats",
RFC 5665, DOI 10.17487/RFC5665, January 2010, RFC 5665, DOI 10.17487/RFC5665, January 2010,
<http://www.rfc-editor.org/info/rfc5665>. <http://www.rfc-editor.org/info/rfc5665>.
12.2. Informative References 12.2. Informative References
[I-D.ietf-nfsv4-rpcrdma-bidirection]
Lever, C., "Size-Limited Bi-directional Remote Procedure
Call On Remote Direct Memory Access Transports", draft-
ietf-nfsv4-rpcrdma-bidirection-01 (work in progress),
September 2015.
[IB] InfiniBand Trade Association, "InfiniBand Architecture [IB] InfiniBand Trade Association, "InfiniBand Architecture
Specifications", <http://www.infinibandta.org>. Specifications", <http://www.infinibandta.org>.
[IBPORT] InfiniBand Trade Association, "IP Addressing Annex", [IBPORT] InfiniBand Trade Association, "IP Addressing Annex",
<http://www.infinibandta.org>. <http://www.infinibandta.org>.
[RFC0768] Postel, J., "User Datagram Protocol", STD 6, RFC 768, DOI [RFC0768] Postel, J., "User Datagram Protocol", STD 6, RFC 768, DOI
10.17487/RFC0768, August 1980, 10.17487/RFC0768, August 1980,
<http://www.rfc-editor.org/info/rfc768>. <http://www.rfc-editor.org/info/rfc768>.
 End of changes. 67 change blocks. 
199 lines changed or deleted 202 lines changed or added

This html diff was produced by rfcdiff 1.45. The latest version is available from http://tools.ietf.org/tools/rfcdiff/