draft-ietf-nfsv4-rfc5666bis-04.txt   draft-ietf-nfsv4-rfc5666bis-05.txt 
Network File System Version 4 C. Lever, Ed. Network File System Version 4 C. Lever, Ed.
Internet-Draft Oracle Internet-Draft Oracle
Obsoletes: 5666 (if approved) W. Simpson Obsoletes: 5666 (if approved) W. Simpson
Intended status: Standards Track DayDreamer Intended status: Standards Track DayDreamer
Expires: September 5, 2016 T. Talpey Expires: October 10, 2016 T. Talpey
Microsoft Microsoft
March 4, 2016 April 8, 2016
Remote Direct Memory Access Transport for Remote Procedure Call Remote Direct Memory Access Transport for Remote Procedure Call, Version
draft-ietf-nfsv4-rfc5666bis-04 One
draft-ietf-nfsv4-rfc5666bis-05
Abstract Abstract
This document specifies a protocol for conveying Remote Procedure This document specifies a protocol for conveying Remote Procedure
Call (RPC) messages on physical transports capable of Remote Direct Call (RPC) messages on physical transports capable of Remote Direct
Memory Access (RDMA). It requires no revision to application RPC Memory Access (RDMA). It requires no revision to application RPC
protocols or the RPC protocol itself. This document obsoletes RFC protocols or the RPC protocol itself. This document obsoletes RFC
5666. 5666.
Status of This Memo Status of This Memo
skipping to change at page 1, line 37 skipping to change at page 1, line 38
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on September 5, 2016. This Internet-Draft will expire on October 10, 2016.
Copyright Notice Copyright Notice
Copyright (c) 2016 IETF Trust and the persons identified as the Copyright (c) 2016 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 2, line 15 skipping to change at page 2, line 16
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3
1.2. Remote Procedure Calls On RDMA Transports . . . . . . . . 3 1.2. Remote Procedure Calls On RDMA Transports . . . . . . . . 3
2. Changes Since RFC 5666 . . . . . . . . . . . . . . . . . . . 4 2. Changes Since RFC 5666 . . . . . . . . . . . . . . . . . . . 4
2.1. Changes To The Specification . . . . . . . . . . . . . . 4 2.1. Changes To The Specification . . . . . . . . . . . . . . 4
2.2. Changes To The XDR Definition . . . . . . . . . . . . . . 5 2.2. Changes To The Protocol . . . . . . . . . . . . . . . . . 4
2.3. Changes To The Protocol . . . . . . . . . . . . . . . . . 5 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 5
3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1. Remote Procedure Calls . . . . . . . . . . . . . . . . . 5
3.1. Remote Procedure Calls . . . . . . . . . . . . . . . . . 6
3.2. Remote Direct Memory Access . . . . . . . . . . . . . . . 8 3.2. Remote Direct Memory Access . . . . . . . . . . . . . . . 8
4. RPC-Over-RDMA Protocol Framework . . . . . . . . . . . . . . 10 4. RPC-Over-RDMA Protocol Framework . . . . . . . . . . . . . . 10
4.1. Transfer Models . . . . . . . . . . . . . . . . . . . . . 11 4.1. Transfer Models . . . . . . . . . . . . . . . . . . . . . 10
4.2. Message Framing . . . . . . . . . . . . . . . . . . . . . 11 4.2. Message Framing . . . . . . . . . . . . . . . . . . . . . 11
4.3. Managing Receiver Resources . . . . . . . . . . . . . . . 12 4.3. Managing Receiver Resources . . . . . . . . . . . . . . . 11
4.4. XDR Encoding With Chunks . . . . . . . . . . . . . . . . 14 4.4. XDR Encoding With Chunks . . . . . . . . . . . . . . . . 13
4.5. Message Size . . . . . . . . . . . . . . . . . . . . . . 20 4.5. Message Size . . . . . . . . . . . . . . . . . . . . . . 20
5. RPC-Over-RDMA In Operation . . . . . . . . . . . . . . . . . 22 5. RPC-Over-RDMA In Operation . . . . . . . . . . . . . . . . . 23
5.1. XDR Protocol Definition . . . . . . . . . . . . . . . . . 22 5.1. XDR Protocol Definition . . . . . . . . . . . . . . . . . 24
5.2. Fixed Header Fields . . . . . . . . . . . . . . . . . . . 28 5.2. Fixed Header Fields . . . . . . . . . . . . . . . . . . . 28
5.3. Chunk Lists . . . . . . . . . . . . . . . . . . . . . . . 30 5.3. Chunk Lists . . . . . . . . . . . . . . . . . . . . . . . 30
5.4. Memory Registration . . . . . . . . . . . . . . . . . . . 32 5.4. Memory Registration . . . . . . . . . . . . . . . . . . . 32
5.5. Error Handling . . . . . . . . . . . . . . . . . . . . . 33 5.5. Error Handling . . . . . . . . . . . . . . . . . . . . . 34
5.6. Protocol Elements No Longer Supported . . . . . . . . . . 36 5.6. Protocol Elements No Longer Supported . . . . . . . . . . 36
5.7. XDR Examples . . . . . . . . . . . . . . . . . . . . . . 37 5.7. XDR Examples . . . . . . . . . . . . . . . . . . . . . . 37
6. RPC Bind Parameters . . . . . . . . . . . . . . . . . . . . . 39 6. RPC Bind Parameters . . . . . . . . . . . . . . . . . . . . . 39
7. Bi-Directional RPC-Over-RDMA . . . . . . . . . . . . . . . . 40 7. Upper Layer Binding Specifications . . . . . . . . . . . . . 40
7.1. RPC Direction . . . . . . . . . . . . . . . . . . . . . . 40 7.1. DDP-Eligibility . . . . . . . . . . . . . . . . . . . . . 41
7.2. Backward Direction Flow Control . . . . . . . . . . . . . 41 7.2. Maximum Reply Size . . . . . . . . . . . . . . . . . . . 42
7.3. Conventions For Backward Operation . . . . . . . . . . . 43 7.3. Additional Considerations . . . . . . . . . . . . . . . . 42
7.4. Backward Direction Upper Layer Binding . . . . . . . . . 45 7.4. Upper Layer Protocol Extensions . . . . . . . . . . . . . 43
8. Upper Layer Binding Specifications . . . . . . . . . . . . . 45 8. Protocol Extensibility . . . . . . . . . . . . . . . . . . . 43
8.1. DDP-Eligibility . . . . . . . . . . . . . . . . . . . . . 46 8.1. Conventional Extensions . . . . . . . . . . . . . . . . . 44
8.2. Maximum Reply Size . . . . . . . . . . . . . . . . . . . 47 9. Security Considerations . . . . . . . . . . . . . . . . . . . 44
8.3. Additional Considerations . . . . . . . . . . . . . . . . 47 9.1. Memory Protection . . . . . . . . . . . . . . . . . . . . 44
8.4. Upper Layer Protocol Extensions . . . . . . . . . . . . . 48 9.2. RPC Message Security . . . . . . . . . . . . . . . . . . 45
9. Protocol Extensibility . . . . . . . . . . . . . . . . . . . 48 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 48
9.1. Changes To RPC-Over-RDMA Header XDR . . . . . . . . . . . 49 11. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 49
9.2. Feature Statuses With RPC-Over-RDMA Versions . . . . . . 50 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 49
9.3. RPC-Over-RDMA Version Numbering . . . . . . . . . . . . . 51 12.1. Normative References . . . . . . . . . . . . . . . . . . 49
9.4. RPC-Over-RDMA Version One Extension Practices . . . . . . 52 12.2. Informative References . . . . . . . . . . . . . . . . . 51
10. Security Considerations . . . . . . . . . . . . . . . . . . . 53 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 52
10.1. Memory Protection . . . . . . . . . . . . . . . . . . . 53
10.2. RPC Message Security . . . . . . . . . . . . . . . . . . 54
11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 57
12. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 58
13. References . . . . . . . . . . . . . . . . . . . . . . . . . 58
13.1. Normative References . . . . . . . . . . . . . . . . . . 58
13.2. Informative References . . . . . . . . . . . . . . . . . 59
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 61
1. Introduction 1. Introduction
This document obsoletes RFC 5666. However, the protocol specified by This document obsoletes RFC 5666. However, the protocol specified by
this document is based on existing interoperating implementations of this document is based on existing interoperating implementations of
the RPC-over-RDMA Version One protocol. the RPC-over-RDMA Version One protocol.
The new specification clarifies text that is subject to multiple The new specification clarifies text that is subject to multiple
interpretations, and removes support for unimplemented RPC-over-RDMA interpretations, and removes support for unimplemented RPC-over-RDMA
Version One protocol elements. It makes the role of Upper Layer Version One protocol elements. It makes the role of Upper Layer
Bindings an explicit part of the protocol specification. Bindings an explicit part of the protocol specification.
In addition, this document introduces conventions that enable bi- In addition, this document describes current practice using
directional RPC-over-RDMA operation, enabling operation of NFSv4.1 RPCSEC_GSS [I-D.ietf-nfsv4-rpcsec-gssv3] on RDMA transports.
[RFC5661] on RDMA transports, and that enable the use of RPCSEC_GSS
[RFC5403] on RDMA transports.
1.1. Requirements Language 1.1. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119]. document are to be interpreted as described in [RFC2119].
1.2. Remote Procedure Calls On RDMA Transports 1.2. Remote Procedure Calls On RDMA Transports
Remote Direct Memory Access (RDMA) [RFC5040] [RFC5041] [IB] is a Remote Direct Memory Access (RDMA) [RFC5040] [RFC5041] [IB] is a
skipping to change at page 4, line 46 skipping to change at page 4, line 35
o Sections 4 and 5 have been combined to improve the organization of o Sections 4 and 5 have been combined to improve the organization of
this information. this information.
o The specification of the optional Connection Configuration o The specification of the optional Connection Configuration
Protocol has been removed from the specification. Protocol has been removed from the specification.
o A section consolidating requirements for Upper Layer Bindings has o A section consolidating requirements for Upper Layer Bindings has
been added. been added.
o A section discussing RPC-over-RDMA protocol extensibility has been o An XDR extraction mechanism is provided, along with full
added. copyright, matching the approach used in [RFC5662].
o A section specifying conventions for bi-directional RPC operation
on RPC-over-RDMA Version One has been added.
o The "Security Considerations" section has been expanded to include o The "Security Considerations" section has been expanded to include
a discussion of how RPC-over-RDMA security depends on features of a discussion of how RPC-over-RDMA security depends on features of
the underlying RDMA transport. A subsection specifying the underlying RDMA transport.
conventions for using RPCSEC_GSS with RPC-over-RDMA Version One
has been added.
2.2. Changes To The XDR Definition
The XDR changes described in this section do not alter the over-the-
wire message format described in [RFC5666]. Changes made to the XDR
which do alter the over-the-wire message format (i.e., to make it
match actual interoperating implementations) are discussed in
Section 2.3.
These alterations make it easier to extend the RPC-over-RDMA
protocol. They also better organize the definition, making the
protocol elements more consonant with actual protocol function. The
specific changes are:
o The XDR description has been given an extraction script using a
sentinel sequence, matching the approach used in [RFC5662].
o XDR data types which need to be the same in all RPC-over-RDMA
versions have been moved to a separate section and given names
that are not version-specific.
o To allow extensions without modification to the existing XDR, the o A subsection describing the use of RPCSEC_GSS with RPC-over-RDMA
header types previously defined as members of the enum Version One has been added.
rpcrdma1_proc have been defined as constants, the union
rpcrdma1_body was deleted, and RDMA_ERR_CHUNK has been renamed as
RDMA_ERR_BADHEADER.
2.3. Changes To The Protocol 2.2. Changes To The Protocol
Although the protocol described herein interoperates with existing Although the protocol described herein interoperates with existing
implementations of [RFC5666], the following changes have been made implementations of [RFC5666], the following changes have been made
relative to the protocol described in that document: relative to the protocol described in that document:
o Support for the Read-Read transfer model has been removed. Read- o Support for the Read-Read transfer model has been removed. Read-
Read is a slower transfer model than Read-Write, thus implementers Read is a slower transfer model than Read-Write, thus implementers
have chosen not to support it. Removal simplifies explanatory have chosen not to support it. Removal simplifies explanatory
text, and support for the RDMA_DONE procedure is no longer text, and support for the RDMA_DONE procedure is no longer
necessary. necessary.
skipping to change at page 6, line 27 skipping to change at page 5, line 38
Reply chunk. Reply chunk.
The protocol version number has not been changed because the protocol The protocol version number has not been changed because the protocol
specified in this document fully interoperates with implementations specified in this document fully interoperates with implementations
of the RPC-over-RDMA Version One protocol specified in [RFC5666]. of the RPC-over-RDMA Version One protocol specified in [RFC5666].
3. Terminology 3. Terminology
3.1. Remote Procedure Calls 3.1. Remote Procedure Calls
This section introduces key elements of the Remote Procedure Call This section highlights key elements of the Remote Procedure Call
[RFC5531] and External Data Representation [RFC4506] protocols, upon [RFC5531] and External Data Representation [RFC4506] protocols, upon
which RPC-over-RDMA Version One is constructed. which RPC-over-RDMA Version One is constructed. Strong grounding
with these protocols is recommended before reading this document.
3.1.1. Upper Layer Protocols 3.1.1. Upper Layer Protocols
Remote Procedure Calls are an abstraction used to implement the Remote Procedure Calls are an abstraction used to implement the
operations of an "Upper Layer Protocol," or ULP. The term Upper operations of an "Upper Layer Protocol," or ULP. The term Upper
Layer Protocol refers to an RPC Program and Version tuple, which is a Layer Protocol refers to an RPC Program and Version tuple, which is a
versioned set of procedure calls that comprise a single well-defined versioned set of procedure calls that comprise a single well-defined
API. One example of an Upper Layer Protocol is the Network File API. One example of an Upper Layer Protocol is the Network File
System Version 4.0 [RFC7530]. System Version 4.0 [RFC7530].
skipping to change at page 8, line 26 skipping to change at page 7, line 38
A serialized stream of bytes that is the result of XDR encoding is A serialized stream of bytes that is the result of XDR encoding is
referred to as an "XDR stream." A sending endpoint encodes native referred to as an "XDR stream." A sending endpoint encodes native
data into an XDR stream and then transmits that stream to a receiver. data into an XDR stream and then transmits that stream to a receiver.
A receiving endpoint decodes incoming XDR byte streams into its A receiving endpoint decodes incoming XDR byte streams into its
native data representation format. native data representation format.
3.1.4.1. XDR Opaque Data 3.1.4.1. XDR Opaque Data
Sometimes a data item must be transferred as-is, without encoding or Sometimes a data item must be transferred as-is, without encoding or
decoding. Such a data item is referred to as "opaque data." XDR decoding. The contents of such a data item are referred to as
encoding places opaque data items directly into an XDR stream without "opaque data." XDR encoding places the content of opaque data items
altering their content in any way. Upper Layer Protocols or directly into an XDR stream without altering it in any way. Upper
applications perform any needed data translation in this case. Layer Protocols or applications perform any needed data translation
Examples of opaque data items include the content of files, or in this case. Examples of opaque data items include the content of
generic byte strings. files, or generic byte strings.
3.1.4.2. XDR Round-up 3.1.4.2. XDR Round-up
The number of octets in a variable-size opaque data item precedes The number of octets in a variable-size opaque data item precedes
that item in an XDR stream. If the size of an encoded data item is that item in an XDR stream. If the size of an encoded data item is
not a multiple of four octets, octets containing zero are added to not a multiple of four octets, octets containing zero are added to
the end of the item as it is encoded so that the next encoded data the end of the item as it is encoded so that the next encoded data
item starts on a four-octet boundary. The encoded size of the item item starts on a four-octet boundary. The encoded size of the item
is not changed by the addition of the extra octets, and the zero is not changed by the addition of the extra octets, and the zero
bytes are not exposed to the Upper Layer. bytes are not exposed to the Upper Layer.
skipping to change at page 9, line 5 skipping to change at page 8, line 17
This technique is referred to as "XDR round-up," and the extra octets This technique is referred to as "XDR round-up," and the extra octets
are referred to as "XDR padding". are referred to as "XDR padding".
3.2. Remote Direct Memory Access 3.2. Remote Direct Memory Access
RPC requesters and responders can be made more efficient if large RPC RPC requesters and responders can be made more efficient if large RPC
messages are transferred by a third party such as intelligent network messages are transferred by a third party such as intelligent network
interface hardware (data movement offload), and placed in the interface hardware (data movement offload), and placed in the
receiver's memory so that no additional adjustment of data alignment receiver's memory so that no additional adjustment of data alignment
has to be made (direct data placement). Remote Direct Memory Access has to be made (direct data placement). Remote Direct Memory Access
enables both optimizations. transports enable both optimizations.
3.2.1. Direct Data Placement 3.2.1. Direct Data Placement
Typically, RPC implementations copy the contents of RPC messages into Typically, RPC implementations copy the contents of RPC messages into
a buffer before being sent. An efficient RPC implementation sends a buffer before being sent. An efficient RPC implementation sends
bulk data without copying it into a separate send buffer first. bulk data without copying it into a separate send buffer first.
However, socket-based RPC implementations are often unable to receive However, socket-based RPC implementations are often unable to receive
data directly into its final place in memory. Receivers often need data directly into its final place in memory. Receivers often need
to copy incoming data to finish an RPC operation; sometimes, only to to copy incoming data to finish an RPC operation; sometimes, only to
skipping to change at page 11, line 42 skipping to change at page 11, line 4
Write-Read Write-Read
The responder exposes its memory to requesters, but requesters do The responder exposes its memory to requesters, but requesters do
not expose their memory. Requesters employ RDMA Write operations not expose their memory. Requesters employ RDMA Write operations
to push RPC arguments or whole RPC calls to the responder. to push RPC arguments or whole RPC calls to the responder.
Requesters employ RDMA Read operations to pull RPC results or Requesters employ RDMA Read operations to pull RPC results or
whole RPC relies from the responder. whole RPC relies from the responder.
[RFC5666] specifies the use of both the Read-Read and the Read-Write [RFC5666] specifies the use of both the Read-Read and the Read-Write
Transfer Model. All current RPC-over-RDMA Version One Transfer Model. All current RPC-over-RDMA Version One
implementations use only the Read-Write Transfer Model. Therefore implementations use only the Read-Write Transfer Model. Therefore
the use of the Read-Read Transfer Model by RPC-over-RDMA Version One the use of the Read-Read Transfer Model within RPC-over-RDMA Version
implementations is no longer supported. Other Transfer Models may be One implementations is no longer supported. Other Transfer Models
used by a future version of RPC-over-RDMA. may be used in future versions of RPC-over-RDMA.
4.2. Message Framing 4.2. Message Framing
On an RPC-over-RDMA transport, each RPC message is encapsulated by an On an RPC-over-RDMA transport, each RPC message is encapsulated by an
RPC-over-RDMA message. An RPC-over-RDMA message consists of two XDR RPC-over-RDMA message. An RPC-over-RDMA message consists of two XDR
streams. streams.
RPC Payload Stream RPC Payload Stream
The "Payload stream" contains the encapsulated RPC message being The "Payload stream" contains the encapsulated RPC message being
transferred by this RPC-over-RDMA message. This stream always transferred by this RPC-over-RDMA message. This stream always
begins with the XID field of the encapsulated RPC message. begins with the XID field of the encapsulated RPC message.
Transport Header Stream Transport Stream
The "Transport stream" contains a header that describes and The "Transport stream" contains a header that describes and
controls the transfer of the Payload stream in this RPC-over-RDMA controls the transfer of the Payload stream in this RPC-over-RDMA
message. This header is analogous to the record marking used for message. This header is analogous to the record marking used for
RPC over TCP but is more extensive, since RDMA transports support RPC over TCP but is more extensive, since RDMA transports support
several modes of data transfer. several modes of data transfer.
In its simplest form, an RPC-over-RDMA message consists of a In its simplest form, an RPC-over-RDMA message consists of a
Transport stream followed immediately by a Payload stream conveyed Transport stream followed immediately by a Payload stream conveyed
together in a single RDMA Send. To transmit large RPC messages, a together in a single RDMA Send. To transmit large RPC messages, a
combination of one RDMA Send operation and one or more RDMA Read or combination of one RDMA Send operation and one or more RDMA Read or
skipping to change at page 14, line 40 skipping to change at page 13, line 47
Receiver implementations MUST support an inline threshold of 1024 Receiver implementations MUST support an inline threshold of 1024
bytes, but MAY support larger inline thresholds values. A mechanism bytes, but MAY support larger inline thresholds values. A mechanism
for discovering a peer's inline threshold value before a connection for discovering a peer's inline threshold value before a connection
is established may be used to optimize the use of RDMA Send is established may be used to optimize the use of RDMA Send
operations. In the absense of such a mechanism, senders MUST assume operations. In the absense of such a mechanism, senders MUST assume
a receiver's inline threshold is 1024 bytes. a receiver's inline threshold is 1024 bytes.
4.4. XDR Encoding With Chunks 4.4. XDR Encoding With Chunks
When a direct data placement capability is available, during XDR When a direct data placement capability is available, during XDR
encoding it can be determined that an XDR data item is large enough encoding it can be determined that the transport can efficiently
that it might be more efficient if the transport placed the content place the content of one or more data items directly in the
of the data item directly in the receiver's memory. receiver's memory, separately from the transfer of other parts of the
containing XDR stream.
4.4.1. Reducing An XDR Stream 4.4.1. Reducing An XDR Stream
RPC-over-RDMA Version One provides a mechanism for moving part of an RPC-over-RDMA Version One provides a mechanism for moving part of an
RPC message via a data transfer separate from an RDMA Send/Receive. RPC message via a data transfer separate from an RDMA Send/Receive.
The sender removes one or more XDR data items from the Payload The sender removes one or more XDR data items from the Payload
stream. They are conveyed via one or more RDMA Read or Write stream. They are conveyed via one or more RDMA Read or Write
operations. The receiver inserts the data items into the Payload operations. The receiver inserts the data items into the Payload
stream before passing it to the Upper Layer. stream before passing it to the Upper Layer.
A contiguous piece of a Payload stream that is split out and moved A contiguous piece of a Payload stream can be split out and moved via
via separate RDMA operations is known as a "chunk." A Payload stream separate RDMA operations. The piece of memory containing that
portion of the data stream and metadata in an RPC-over-RDMA header
together comprise what is referred to as a "chunk." A Payload stream
after chunks have been removed is referred to as a "reduced" Payload after chunks have been removed is referred to as a "reduced" Payload
stream. stream. Likewise, a data item that has been removed from a Payload
stream to be transferred separately is referred to as a "reduced"
data item.
4.4.2. DDP-Eligibility 4.4.2. DDP-Eligibility
Only an XDR data item that might benefit from Direct Data Placement Only an XDR data item that might benefit from Direct Data Placement
may be reduced. The eligibility of particular XDR data items to be may be reduced. The eligibility of particular XDR data items to be
reduced is not specified by this document. reduced is independent of RPC-over-RDMA, and thus is not specified by
this document.
To maintain interoperability on an RPC-over-RDMA transport, a To maintain interoperability on an RPC-over-RDMA transport, a
determination must be made of which XDR data items in each Upper determination must be made of which XDR data items in each Upper
Layer Protocol are allowed to use Direct Data Placement. Therefore Layer Protocol are allowed to use Direct Data Placement. Therefore
an additional specification is needed that describes how an Upper an additional specification is needed that describes how an Upper
Layer Protocol enables Direct Data Placement. The set of Layer Protocol enables Direct Data Placement. The set of
requirements for an Upper Layer Protocol to use an RPC-over-RDMA requirements for an Upper Layer Protocol to use an RPC-over-RDMA
transport is known as an "Upper Layer Binding specification," or ULB. transport is known as an "Upper Layer Binding specification," or ULB.
An Upper Layer Binding specification states which specific individual An Upper Layer Binding specification states which specific individual
XDR data items in an Upper Layer Protocol MAY be transferred via XDR data items in an Upper Layer Protocol MAY be transferred via
Direct Data Placement. This document will refer to XDR data items Direct Data Placement. This document will refer to XDR data items
that are permitted to be reduced as "DDP-eligible". All other XDR that are permitted to be reduced as "DDP-eligible". All other XDR
data items MUST NOT be reduced. RPC-over-RDMA Version One uses RDMA data items MUST NOT be reduced. RPC-over-RDMA Version One uses RDMA
Read and Write operations to transfer DDP-eligible data that has been Read and Write operations to transfer DDP-eligible data that has been
reduced. reduced.
Detailed requirements for Upper Layer Bindings are discussed in full Detailed requirements for Upper Layer Bindings are discussed in full
in Section 8. in Section 7.
4.4.3. RDMA Segments 4.4.3. RDMA Segments
When encoding a Payload stream that contains a DDP-eligible data When encoding a Payload stream that contains a DDP-eligible data
item, a sender may choose to reduce that data item. It does not item, a sender may choose to reduce that data item. It does not
place the item into the Payload stream. Instead, the sender records place the item into the Payload stream. Instead, the sender records
in the RPC-over-RDMA header the actual address and size of the memory in the RPC-over-RDMA header the actual address and size of the memory
region containing that data item. region containing that data item.
The requester provides location information for DDP-eligible data The requester provides location information for DDP-eligible data
skipping to change at page 18, line 21 skipping to change at page 17, line 44
registers memory segments containing data in Read chunks. It registers memory segments containing data in Read chunks. It
advertises these chunks in the RPC-over-RDMA header of the RPC Call. advertises these chunks in the RPC-over-RDMA header of the RPC Call.
After receiving an RPC Call sent via an RDMA Send operation, a After receiving an RPC Call sent via an RDMA Send operation, a
responder transfers the chunk data from the requester using RDMA Read responder transfers the chunk data from the requester using RDMA Read
operations. The responder reconstructs the transferred chunk data by operations. The responder reconstructs the transferred chunk data by
concatenating the contents of each segment, in list order, into the concatenating the contents of each segment, in list order, into the
received Payload stream at the Position value recorded in the received Payload stream at the Position value recorded in the
segment. segment.
Put another way, a receiver inserts the first segment in a Read chunk Put another way, the responder inserts the first segment in a Read
into the Payload stream at the byte offset indicated by its Position chunk into the Payload stream at the byte offset indicated by its
field. Segments whose Position field value match this offset are Position field. Segments whose Position field value match this
concatenated afterwards, until there are no more segments at that offset are concatenated afterwards, until there are no more segments
Position value. The next XDR data item in the Payload stream at that Position value. The next XDR data item in the Payload stream
follows. follows.
4.4.5.1. Read Chunk Round-up 4.4.5.1. Read Chunk Round-up
XDR requires each encoded data item to start on four-byte alignment. XDR requires each encoded data item to start on four-byte alignment.
When an odd-length data item is encoded, its length is encoded When an odd-length data item is encoded, its length is encoded
literally, while the data is padded so the next data item in the XDR literally, while the data is padded so the next data item in the XDR
stream can start on a four-byte boundary. Receivers ignore the stream can start on a four-byte boundary. Receivers ignore the
content of the pad bytes. content of the pad bytes.
skipping to change at page 21, line 20 skipping to change at page 20, line 38
for its length. The reply to this common request is about 100 bytes. for its length. The reply to this common request is about 100 bytes.
Since all RPC messages conveyed via RPC-over-RDMA require an RDMA Since all RPC messages conveyed via RPC-over-RDMA require an RDMA
Send operation, the most efficient way to send an RPC message that is Send operation, the most efficient way to send an RPC message that is
smaller than the receiver's inline threshold is to append the Payload smaller than the receiver's inline threshold is to append the Payload
stream directly to the Transport stream. An RPC-over-RDMA header stream directly to the Transport stream. An RPC-over-RDMA header
with a small RPC Call or Reply message immediately following is with a small RPC Call or Reply message immediately following is
transferred using a single RDMA Send operation. No RDMA Read or transferred using a single RDMA Send operation. No RDMA Read or
Write operations are needed. Write operations are needed.
An RPC-over-RDMA transaction using Short Messages:
Requester Responder
| RDMA Send (RDMA_MSG) |
Call | ------------------------------> |
| | Processing
| |
| |
| RDMA Send (RDMA_MSG) |
| <------------------------------ | Reply
4.5.2. Chunked Messages 4.5.2. Chunked Messages
If DDP-eligible data items are present in a Payload stream, a sender If DDP-eligible data items are present in a Payload stream, a sender
MAY reduce the Payload stream to enable the use of RDMA Read or Write MAY reduce some or all of these items by removing them from the
operations to move the reduced data items. The Transport stream with Payload stream. The sender uses RDMA Read or Write operations to
the reduced Payload stream immediately following is transferred using transfer the reduced data items. The Transport stream with the
a single RDMA Send operation. reduced Payload stream immediately following is then transferred
using a single RDMA Send operation
After receiving the Transport and Payload streams of a Chunked RPC- After receiving the Transport and Payload streams of a Chunked RPC-
over-RDMA Call message, the responder uses RDMA Read operations to over-RDMA Call message, the responder uses RDMA Read operations to
move reduced data items in Read chunks. Before sending the Transport move reduced data items in Read chunks. Before sending the Transport
and Payload streams of a Chunked RPC-over-RDMA Reply message, the and Payload streams of a Chunked RPC-over-RDMA Reply message, the
responder uses RDMA Write operations to move reduced data items in responder uses RDMA Write operations to move reduced data items in
Write and Reply chunks. Write and Reply chunks.
An RPC-over-RDMA transaction with a Read chunk:
Requester Responder
| RDMA Send (RDMA_MSG) |
Call | ------------------------------> |
| RDMA Read |
| <------------------------------ |
| RDMA Response (arg data) |
| ------------------------------> |
| | Processing
| |
| |
| RDMA Send (RDMA_MSG) |
| <------------------------------ | Reply
An RPC-over-RDMA transaction with a Write chunk:
Requester Responder
| RDMA Send (RDMA_MSG) |
Call | ------------------------------> |
| | Processing
| |
| |
| RDMA Write (result data) |
| <------------------------------ |
| RDMA Send (RDMA_MSG) |
| <------------------------------ | Reply
4.5.3. Long Messages 4.5.3. Long Messages
When a Payload stream is larger than the receiver's inline threshold, When a Payload stream is larger than the receiver's inline threshold,
the Payload stream is reduced by removing DDP-eligible data items and the Payload stream is reduced by removing DDP-eligible data items and
placing them in chunks to be moved separately. If there are no DDP- placing them in chunks to be moved separately. If there are no DDP-
eligible data items in the Payload stream, or the Payload stream is eligible data items in the Payload stream, or the Payload stream is
still too large after it has been reduced, the RDMA transport MUST still too large after it has been reduced, the RDMA transport MUST
use RDMA Read or Write operations to convey the Payload stream use RDMA Read or Write operations to convey the Payload stream
itself. This mechanism is referred to as a "Long Message." itself. This mechanism is referred to as a "Long Message."
skipping to change at page 22, line 22 skipping to change at page 22, line 40
requester sizes the Reply chunk to accommodate the maximum requester sizes the Reply chunk to accommodate the maximum
expected reply size for that Upper Layer operation. expected reply size for that Upper Layer operation.
Though the purpose of a Long Message is to handle large RPC messages, Though the purpose of a Long Message is to handle large RPC messages,
requesters MAY use a Long Message at any time to convey an RPC Call. requesters MAY use a Long Message at any time to convey an RPC Call.
A responder chooses which form of reply to use based on the chunks A responder chooses which form of reply to use based on the chunks
provided by the requester. If Write chunks were provided and the provided by the requester. If Write chunks were provided and the
responder has a DDP-eligible result, it first reduces the reply responder has a DDP-eligible result, it first reduces the reply
Payload stream. If a Reply chunk was provided and the reduced Payload stream. If a Reply chunk was provided and the reduced
Payload is larger than the requester's inline threshold, the Payload stream is larger than the requester's inline threshold, the
responder MUST use the provided Reply chunk for the reply. responder MUST use the provided Reply chunk for the reply.
Because these special chunks contain a whole RPC message, any XDR Because these special chunks contain a whole RPC message, XDR data
data item MAY appear in one of these special chunks without regard to items appear in these special chunks without regard to their DDP-
its DDP-eligibility. DDP-eligible data items MAY be removed from eligibility.
these special chunks and conveyed via normal chunks, but non-eligible
data items MUST NOT appear in normal chunks. An RPC-over-RDMA transaction using a Long Call:
Requester Responder
| RDMA Send (RDMA_NOMSG) |
Call | ------------------------------> |
| RDMA Read |
| <------------------------------ |
| RDMA Response (RPC call) |
| ------------------------------> |
| | Processing
| |
| |
| RDMA Send (RDMA_MSG) |
| <------------------------------ | Reply
An RPC-over-RDMA transaction using a Long Reply:
Requester Responder
| RDMA Send (RDMA_MSG) |
Call | ------------------------------> |
| | Processing
| |
| |
| RDMA Write (RPC reply) |
| <------------------------------ |
| RDMA Send (RDMA_NOMSG) |
| <------------------------------ | Reply
5. RPC-Over-RDMA In Operation 5. RPC-Over-RDMA In Operation
Every RPC-over-RDMA Version One message has a header that includes a Every RPC-over-RDMA Version One message has a header that includes a
copy of the message's transaction ID, data for managing RDMA flow copy of the message's transaction ID, data for managing RDMA flow
control credits, and lists of RDMA segments used for RDMA Read and control credits, and lists of RDMA segments used for RDMA Read and
Write operations. All RPC-over-RDMA header content is contained in Write operations. All RPC-over-RDMA header content is contained in
the Transport stream, and thus MUST be XDR encoded. the Transport stream, and thus MUST be XDR encoded.
RPC message layout is unchanged from that described in [RFC5531] RPC message layout is unchanged from that described in [RFC5531]
except for the possible reduction of data items that are moved by except for the possible reduction of data items that are moved by
RDMA Read or Write operations. RDMA Read or Write operations.
The RPC-over-RDMA protocol passes RPC messages without regard to The RPC-over-RDMA protocol passes RPC messages without regard to
their type (CALL or REPLY) or direction (forwards or backwards). their type (CALL or REPLY) or direction (forwards or backwards).
Both endpoints of a connection MAY send any RPC-over-RDMA message Each endpoint of a connection MAY send any RPC-over-RDMA message
header type at any time (subject to credit limits). header type at any time (subject to credit limits).
5.1. XDR Protocol Definition 5.1. XDR Protocol Definition
This section contains a description of the core features of the RPC- This section contains a description of the core features of the RPC-
over-RDMA Version One protocol, expressed in the XDR language over-RDMA Version One protocol, expressed in the XDR language
[RFC4506]. [RFC4506].
This description is provided in a way that makes it simple to extract This description is provided in a way that makes it simple to extract
into ready-to-compile form. The reader can apply the following shell into ready-to-compile form. The reader can apply the following shell
skipping to change at page 23, line 28 skipping to change at page 24, line 34
That is, if the above script is stored in a file called "extract.sh" That is, if the above script is stored in a file called "extract.sh"
and this document is in a file called "spec.txt" then the reader can and this document is in a file called "spec.txt" then the reader can
do the following to extract an XDR description file: do the following to extract an XDR description file:
<CODE BEGINS> <CODE BEGINS>
sh extract.sh < spec.txt > rpcrdma_corev1.x sh extract.sh < spec.txt > rpcrdma_corev1.x
<CODE ENDS> <CODE ENDS>
As described in Section 9.4, extensions to RPC-over-RDMA Version One,
published as Proposed Standards, will have similar means of providing
an XDR description appropriate to those extensions. Once XDR for
extensions is also extracted, it can be appended to the XDR
description file extracted from this document to produce a
consolidated XDR description file reflecting all extensions selected
for an RPC-over-RDMA implementation.
RPC-over-RDMA is not a stand-alone RPC Program. To enable protocol
extension, there is no single XDR entity which describes the format
of RPC-over-RDMA headers. Instead, implementers need to follow the
instructions in Section 5.1.4 to appropriately encode and decode
protocol messages.
5.1.1. Code Component License 5.1.1. Code Component License
Code components extracted from this document must include the Code components extracted from this document must include the
following license text. When the extracted XDR code is combined with following license text. When the extracted XDR code is combined with
other complementary XDR code which itself has an identical license, other complementary XDR code which itself has an identical license,
only a single copy of the license text need be preserved. only a single copy of the license text need be preserved.
<CODE BEGINS> <CODE BEGINS>
/// /* /// /*
/// * Copyright (c) 2010, 2015 IETF Trust and the persons /// * Copyright (c) 2010, 2016 IETF Trust and the persons
/// * identified as authors of the code. All rights reserved. /// * identified as authors of the code. All rights reserved.
/// * /// *
/// * The authors of the code are: /// * The authors of the code are:
/// * B. Callaghan, T. Talpey, C. Lever, and D. Noveck. /// * B. Callaghan, T. Talpey, and C. Lever
/// * /// *
/// * Redistribution and use in source and binary forms, with /// * Redistribution and use in source and binary forms, with
/// * or without modification, are permitted provided that the /// * or without modification, are permitted provided that the
/// * following conditions are met: /// * following conditions are met:
/// * /// *
/// * - Redistributions of source code must retain the above /// * - Redistributions of source code must retain the above
/// * copyright notice, this list of conditions and the /// * copyright notice, this list of conditions and the
/// * following disclaimer. /// * following disclaimer.
/// * /// *
/// * - Redistributions in binary form must reproduce the above /// * - Redistributions in binary form must reproduce the above
skipping to change at page 24, line 48 skipping to change at page 25, line 48
/// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
/// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
/// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
/// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
/// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
/// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
/// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
/// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
/// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
/// */ /// */
///
<CODE ENDS> <CODE ENDS>
5.1.2. XDR Applying To All Versions Of RPC-Over-RDMA 5.1.2. RPC-Over-RDMA Version One XDR
XDR data items defined in this section describe elements of the RPC- XDR data items defined in this section encodes the Transport Header
over-RDMA protocol that are not subject to change in subsequent Stream in each RPC-over-RDMA Version One message. Comments identify
versions. A full discussion of the extensibility model is in items that cannot be changed in subsequent versions.
Section 9.
<CODE BEGINS> <CODE BEGINS>
/// typedef uint32 rdma_htype; /// /*
/// * Plain RDMA segment (Section 4.4.3)
/// */
/// struct xdr_rdma_segment {
/// uint32 handle; /* Registered memory handle */
/// uint32 length; /* Length of the chunk in bytes */
/// uint64 offset; /* Chunk virtual address or offset */
/// };
/// ///
/// struct rpcrdma_prefix { /// /*
/// uint32 rdma_xid; /// * Read segment (Section 4.4.5)
/// uint32 rdma_version; /// */
/// uint32 rdma_credits; /// struct xdr_read_chunk {
/// rpcrdma_htype rdma_htype; /// uint32 position; /* Position in XDR stream */
/// struct xdr_rdma_segment target;
/// }; /// };
/// ///
/// /* /// /*
/// * Mandatory RPC-over-RDMA message header types /// * Read list (Section 5.3.1)
/// */ /// */
/// const RDMA_MSG = 0; /// struct xdr_read_list {
/// const RDMA_NOMSG = 1; /// struct xdr_read_chunk entry;
/// const RDMA_ERROR = 4; /// struct xdr_read_list *next;
/// };
/// ///
/// struct rpcrdma_err_vers { /// /*
/// uint32 rdma_vers_low; /// * Write chunk (Section 4.4.6)
/// uint32 rdma_vers_high; /// */
/// struct xdr_write_chunk {
/// struct xdr_rdma_segment target<>;
/// }; /// };
///
<CODE ENDS>
5.1.3. XDR Applying To Version One Of RPC-Over-RDMA
XDR data items defined in this section are subject to change in
subsequent RPC-over-RDMA versions.
Even though the names of structures and unions begin "rpcrdma1_"
these are not restricted to use in RPC-over-RDMA Version One.
Structure definitions may be carried over unchanged to subsequence
versions, but unions are subject to extension according to the rules
for compatible XDR extension as discussed in Section 9. Comments
identify items that cannot be changed in subsequent versions.
<CODE BEGINS>
/// /* /// /*
/// * Version One reserved message types /// * Write list (Section 5.3.2)
/// */ /// */
/// const RDMA_MSGP = 2; /// struct xdr_write_list {
/// const RDMA_DONE = 3; /// struct xdr_write_chunk entry;
/// struct xdr_write_list *next;
/// };
/// ///
/// struct rpcrdma1_segment { /// /*
/// uint32 rdma_handle; /// * Chunk lists (Section 5.3)
/// uint32 rdma_length; /// */
/// uint64 rdma_offset; /// struct rpc_rdma_header {
/// struct xdr_read_list *rdma_reads;
/// struct xdr_write_list *rdma_writes;
/// struct xdr_write_chunk *rdma_reply;
/// /* rpc body follows */
/// }; /// };
/// ///
/// struct rpcrdma1_read_segment { /// struct rpc_rdma_header_nomsg {
/// uint32 rdma_position; /// struct xdr_read_list *rdma_reads;
/// struct rpcrdma1_segment rdma_target; /// struct xdr_write_list *rdma_writes;
/// struct xdr_write_chunk *rdma_reply;
/// }; /// };
/// ///
/// struct rpcrdma1_read_list { /// struct rpc_rdma_header_padded {
/// struct rpcrdma1_read_segment rdma_entry; /// uint32 rdma_align; /* Padding alignment */
/// struct rpcrdma1_read_list *rdma_next; /// uint32 rdma_thresh; /* Padding threshold */
/// struct xdr_read_list *rdma_reads;
/// struct xdr_write_list *rdma_writes;
/// struct xdr_write_chunk *rdma_reply;
/// /* rpc body follows */
/// }; /// };
/// ///
/// struct rpcrdma1_write_chunk { /// /*
/// struct rpcrdma1_segment rdma_target<>; /// * Error handling (Section 5.5)
/// */
/// enum rpc_rdma_errcode {
/// ERR_VERS = 1, /* Fixed for all versions */
/// ERR_CHUNK = 2
/// }; /// };
/// ///
/// struct rpcrdma1_write_list { /// struct rpc_rdma_errvers {
/// struct rpcrdma1_write_chunk rdma_entry; /// uint32 rdma_vers_low;
/// struct rpcrdma1_write_list *rdma_next; /// uint32 rdma_vers_high;
/// }; /// };
/// ///
/// struct rpcrdma1_chunks { /// union rpc_rdma_error switch (rpc_rdma_errcode err) {
/// struct rpcrdma1_read_list *rdma_reads; /// case ERR_VERS:
/// struct rpcrdma1_write_list *rdma_writes; /// rpc_rdma_errvers range;
/// struct rpcrdma1_write_chunk *rdma_reply; /// case ERR_CHUNK:
/// void;
/// }; /// };
/// ///
/// struct rpcrdma1_padded { /// /*
/// uint32 rdma_align; /// * Procedures (Section 5.2.4)
/// uint32 rdma_thresh; /// */
/// rpcrdma1_chunks rdma_chunks; /// enum rdma_proc {
/// RDMA_MSG = 0, /* Fixed for all versions */
/// RDMA_NOMSG = 1, /* Fixed for all versions */
/// RDMA_MSGP = 2, /* Reserved */
/// RDMA_DONE = 3, /* Reserved */
/// RDMA_ERROR = 4 /* Fixed for all versions */
/// }; /// };
/// ///
/// enum rpcrdma1_errcode { /// union rdma_body switch (rdma_proc proc) {
/// RDMA_ERR_VERS = 1, /// case RDMA_MSG:
/// RDMA_ERR_BADHEADER = 2 /// rpc_rdma_header rdma_msg;
/// case RDMA_NOMSG:
/// rpc_rdma_header_nomsg rdma_nomsg;
/// case RDMA_MSGP:
/// rpc_rdma_header_padded rdma_msgp;
/// case RDMA_DONE:
/// void;
/// case RDMA_ERROR:
/// rpc_rdma_error rdma_error;
/// }; /// };
/// ///
/// union rpcrdma1_error switch (rpcrdma1_errcode rdma_err) { /// /*
/// case RDMA_ERR_VERS: /// * Fixed header fields (Section 5.2)
/// rpcrdma_err_vers rdma_vrange; /* Immutable */ /// */
/// case RDMA_ERR_BADHEADER: /// struct rdma_msg {
/// void; /// uint32 rdma_xid;
/// uint32 rdma_vers;
/// uint32 rdma_credit;
/// rdma_body rdma_body;
/// }; /// };
<CODE ENDS> <CODE ENDS>
5.1.4. Use Of XDR Descriptions
Though it is described by XDR, RPC-over-RDMA is not an RPC Program.
Certain functions normally provided by RPC need to be addressed by
the RPC-over-RDMA definition itself. In particular, the following
functions normally provided by RPC need to be provided for as part of
the RPC-over-RDMA XDR description:
o negotiation of RPC-over-RDMA protocol version
o Identifying RPC-over-RDMA header types that are followed by a
Payload stream
In [RFC5666] the XDR description did not take account of the natural
layering between the part of RPC-over-RDMA functionality that
performed RPC-layer like functions described above and that which
implemented individual transport functions. As a result:
o The four 32-bit words which must be the same in all versions of
RPC-over-RDMA are split, with three of those words in struct
rpcrdma1_header and the remaining word part of union
rpcrdma1_body, together with each of the message bodies.
o It is impossible, within the resulting structure, to add a new
message type without modifying the existing XDR description.
The XDR description introduced in this document reorganizes the XDR
in line with this natural layering, while maintaining over-the-wire
equivalence. As a result, the 32-bit big-endian field strating
twelve bytes into the header is no longer the discriminator field of
union rpcrdma1_body. Instead it is the last 32-bit word within
struct rpcrdma_header which define the common (i.e., for all RPC-
over-RDMA versions) header prefix. It retains its role of indicating
the message type and deciding which particular header body is to
follow.
As a result there is no longer a single XDR item that encompasses the
entire RPC-over-RDMA header. Instead, each RPC-over-RDMA meassage
consists of up to three items and those using XDR encode and decode
must be aware that they proceed in sequence as follows:
1. A struct rpcrdma_prefix
2. Depending on the rdma_which field of the prefix, the appropriate
header body for that message type as given by Table 1. In cases
in which there is an undefined header type, this is to be treated
as an XDR encode/decode error.
3. If allowed for that header type as defined in Table 1, an XDR
stream for the RPC message being transported
+--------------+------------------------+-------------------+
| Message Type | Body | Payload stream? |
+--------------+------------------------+-------------------+
| RDMA_MSG | struct rpcrdma1_chunks | Yes |
+--------------+------------------------+-------------------+
| RDMA_NOMSG | struct rpcrdma1_chunks | No |
+--------------+------------------------+-------------------+
| RDMA_ERROR | union rpcrdma1_error | No |
+--------------+------------------------+-------------------+
Table 1. Header Type Characteristics
5.2. Fixed Header Fields 5.2. Fixed Header Fields
The RPC-over-RDMA header begins with four fixed 32-bit fields that The RPC-over-RDMA header begins with four fixed 32-bit fields that
control the RDMA interaction. These four fields, which must remain control the RDMA interaction. These four fields, which must remain
with the same meanings and in the same positions in all subsequent with the same meanings and in the same positions in all subsequent
versions of the RPC-over-RDMA protocol, are described below. versions of the RPC-over-RDMA protocol, are described below.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| XID | | XID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
skipping to change at page 29, line 19 skipping to change at page 29, line 29
establish context as soon as each RPC-over-RDMA message arrives. establish context as soon as each RPC-over-RDMA message arrives.
This XID MUST be the same as the XID in the RPC message. The This XID MUST be the same as the XID in the RPC message. The
receiver MAY perform its processing based solely on the XID in the receiver MAY perform its processing based solely on the XID in the
RPC-over-RDMA header, and thereby ignore the XID in the RPC message, RPC-over-RDMA header, and thereby ignore the XID in the RPC message,
if it so chooses. if it so chooses.
5.2.2. Version Number 5.2.2. Version Number
For RPC-over-RDMA Version One, this field MUST contain the value one For RPC-over-RDMA Version One, this field MUST contain the value one
(1). Rules regarding changes to this transport protocol version (1). Rules regarding changes to this transport protocol version
number can be found in Section 9.3. number can be found in Section 8.
5.2.3. Credit Value 5.2.3. Credit Value
When sent in an RPC Call message, the requested credit value is When sent in an RPC Call message, the requested credit value is
provided. When sent in an RPC Reply message, the granted credit provided. When sent in an RPC Reply message, the granted credit
value is returned. RPC Calls SHOULD NOT be sent in excess of the value is returned. RPC Calls SHOULD NOT be sent in excess of the
currently granted limit. Further discussion of how the credit value currently granted limit. Further discussion of how the credit value
is determined can be found in Section 4.3. is determined can be found in Section 4.3.
5.2.4. Procedure number 5.2.4. Procedure number
skipping to change at page 31, line 5 skipping to change at page 31, line 12
responder is to retrieve via RDMA Read operations. The requester has responder is to retrieve via RDMA Read operations. The requester has
removed the data in these chunks from the call's Payload stream. removed the data in these chunks from the call's Payload stream.
Via a Position Zero Read Chunk, a requester may provide an RPC Call Via a Position Zero Read Chunk, a requester may provide an RPC Call
message as a chunk in the Read list. message as a chunk in the Read list.
If the RPC Call has no argument data that is DDP-eligible and the If the RPC Call has no argument data that is DDP-eligible and the
Position Zero Read Chunk is not being used, the requester leaves the Position Zero Read Chunk is not being used, the requester leaves the
Read list empty. Read list empty.
Responders MUST leave the Read list empty in all replies.
5.3.2. Write List 5.3.2. Write List
Each RDMA_MSG or RDMA_NOMSG procedure has one "Write list." The Each RDMA_MSG or RDMA_NOMSG procedure has one "Write list." The
Write list is a list of zero or more Write chunks, provided by the Write list is a list of zero or more Write chunks, provided by the
requester. Each Write chunk is an array of RDMA segments, thus the requester. Each Write chunk is an array of RDMA segments, thus the
Write list is a list of counted arrays. Each Write chunk advertises Write list is a list of counted arrays. Each Write chunk advertises
receptacles for DDP-eligible data to be pushed by the responder via receptacles for DDP-eligible data to be pushed by the responder via
RDMA Write operations. If the RPC Reply has no possible DDP-eligible RDMA Write operations. If the RPC Reply has no possible DDP-eligible
result data items, the requester leaves the Write list empty. result data items, the requester leaves the Write list empty.
*** This section needs to specify when a requester must provide Write
chunks, and how many chunks must be provided. ***
When a Write list is provided for the results of an RPC Call, the When a Write list is provided for the results of an RPC Call, the
responder MUST provide data corresponding to DDP-eligible XDR data responder MUST provide data corresponding to DDP-eligible XDR data
items via RDMA Write operations to the memory referenced in the Write items via RDMA Write operations to the memory referenced in the Write
list. The responder removes the data in these chunks from the list. The responder removes the data in these chunks from the
reply's Payload stream. reply's Payload stream.
When multiple Write chunks are present, the responder fills in each When multiple Write chunks are present, the responder fills in each
Write chunk with a DDP-eligible result until either there are no more Write chunk with a DDP-eligible result until either there are no more
results or no more Write chunks. The requester may not be able to results or no more Write chunks. The requester may not be able to
predict which DDP-eligible data item goes in which chunk. Thus the predict which DDP-eligible data item goes in which chunk. Thus the
skipping to change at page 32, line 8 skipping to change at page 32, line 14
chunk in the Write list. If the responder populates that chunk with chunk in the Write list. If the responder populates that chunk with
data, the requester knows with certainty which result data item is data, the requester knows with certainty which result data item is
contained in it. contained in it.
However, Upper Layer Protocol procedures may allow replies where more However, Upper Layer Protocol procedures may allow replies where more
than one result data item is DDP-eligible. For example, an NFSv4 than one result data item is DDP-eligible. For example, an NFSv4
COMPOUND procedure is composed of individual NFSv4 operations, more COMPOUND procedure is composed of individual NFSv4 operations, more
than one of which may have a reply containing a DDP-eligible result. than one of which may have a reply containing a DDP-eligible result.
As stated above, when multiple Write chunks are present, the As stated above, when multiple Write chunks are present, the
responder reduces DDP-eligible result until either there are no more responder reduces DDP-eligible results until either there are no more
results or no more Write chunks. Then, as the requester decodes the results or no more Write chunks. Then, as the requester decodes the
reply Payload stream, it is clear from the contents of the reply reply Payload stream, it is clear from the contents of the reply
which Write chunk contains which data item. which Write chunk contains which data item.
When a requester has provided a Write list in a Call message, the
responder MUST copy that list into the associated Reply. The copied
Write list in the Reply is modified as above to reflect the actual
amount of data that is being returned in the Write list.
5.3.3. Reply Chunk 5.3.3. Reply Chunk
Each RDMA_MSG or RDMA_NOMSG procedure has one "Reply chunk." The Each RDMA_MSG or RDMA_NOMSG procedure has one "Reply chunk." The
Reply chunk is a Write chunk, provided by the requester. The Reply Reply chunk is a Write chunk, provided by the requester. The Reply
chunk is a single counted array of RDMA segments. chunk is a single counted array of RDMA segments.
A requester MUST provide a Reply chunk whenever the maximum possible A requester MUST provide a Reply chunk whenever the maximum possible
size of the reply is larger than its own inline threshold. The Reply size of the reply is larger than its own inline threshold. The Reply
chunk MUST be large enough to contain a Payload stream (RPC message) chunk MUST be large enough to contain a Payload stream (RPC message)
of this maximum size. If the actual reply Payload stream is smaller of this maximum size. If the actual reply Payload stream is smaller
than the requester's inline threshold, the responder MAY return it as than the requester's inline threshold, the responder MAY return it as
a Short message rather than using the Reply chunk. a Short message rather than using the Reply chunk.
When a requester has provided a Reply chunk in a Call message, the
responder MUST copy that chunk into the associated Reply. The copied
Reply chunk in the Reply is modified to reflect the actual amount of
data that is being returned in the Reply chunk.
5.4. Memory Registration 5.4. Memory Registration
RDMA requires that data is transferred between only registered memory RDMA requires that data is transferred between only registered memory
segments at the source and destination. All protocol headers as well segments at the source and destination. All protocol headers as well
as separately transferred data chunks must reside in registered as separately transferred data chunks must reside in registered
memory. memory.
Since the cost of registering and de-registering memory can be a Since the cost of registering and de-registering memory can be a
significant proportion of the RDMA transaction cost, it is important significant proportion of the RDMA transaction cost, it is important
to minimize registration activity. For memory that is targeted by to minimize registration activity. For memory that is targeted by
skipping to change at page 34, line 19 skipping to change at page 34, line 35
To form an RDMA_ERROR procedure: The rdma_xid field MUST contain the To form an RDMA_ERROR procedure: The rdma_xid field MUST contain the
same XID that was in the rdma_xid field in the failing request; The same XID that was in the rdma_xid field in the failing request; The
rdma_vers field MUST contain the same version that was in the rdma_vers field MUST contain the same version that was in the
rdma_vers field in the failing request; The rdma_proc field MUST rdma_vers field in the failing request; The rdma_proc field MUST
contain the value RDMA_ERROR; The rdma_err field contains a value contain the value RDMA_ERROR; The rdma_err field contains a value
that reflects the type of error that occurred, as described below. that reflects the type of error that occurred, as described below.
An RDMA_ERROR procedure indicates a permanent error. Receipt of this An RDMA_ERROR procedure indicates a permanent error. Receipt of this
procedure completes the RPC transaction associated with XID in the procedure completes the RPC transaction associated with XID in the
rdma_xid field. A receiver MUST silently discard an RDMA_ERROR rdma_xid field. A receiver MUST silently discard an RDMA_ERROR
procedure that cannot be decoded. procedure that it cannot decode.
5.5.1. Header Version Mismatch 5.5.1. Header Version Mismatch
When a receiver detects an RPC-over-RDMA header version that it does When a receiver detects an RPC-over-RDMA header version that it does
not support (currently this document defines only Version One), it not support (currently this document defines only Version One), it
MUST reply with an RDMA_ERROR procedure and set the rdma_err value to MUST reply with an RDMA_ERROR procedure and set the rdma_err value to
RDMA_ERR_VERS, also providing the low and high inclusive version ERR_VERS, also providing the low and high inclusive version numbers
numbers it does, in fact, support. it does, in fact, support.
5.5.2. XDR Errors 5.5.2. XDR Errors
A receiver might encounter an XDR parsing error that prevents it from A receiver might encounter an XDR parsing error that prevents it from
processing the incoming Transport stream. Examples of such errors processing the incoming Transport stream. Examples of such errors
include an invalid value in the rdma_proc field, an RDMA_NOMSG include an invalid value in the rdma_proc field, an RDMA_NOMSG
message that has no chunk lists, or the contents of the rdma_xid message that has no chunk lists, or the contents of the rdma_xid
field might not match the contents of the XID field in the field might not match the contents of the XID field in the
accompanying RPC message. If the rdma_vers field contains a accompanying RPC message. If the rdma_vers field contains a
recognized value, but an XDR parsing error occurs, the responder MUST recognized value, but an XDR parsing error occurs, the responder MUST
reply with an RDMA_ERROR procedure and set the rdma_err value to reply with an RDMA_ERROR procedure and set the rdma_err value to
RDMA_ERR_BADHEADER. ERR_CHUNK.
When a responder receives a valid RPC-over-RDMA header but the When a responder receives a valid RPC-over-RDMA header but the
responder's Upper Layer Protocol implementation cannot parse the RPC responder's Upper Layer Protocol implementation cannot parse the RPC
arguments in the RPC Call message, the responder SHOULD return a arguments in the RPC Call message, the responder SHOULD return a
RPC_GARBAGEARGS reply, using an RDMA_MSG procedure. This type of RPC_GARBAGEARGS reply, using an RDMA_MSG procedure. This type of
parsing failure might be due to mismatches between chunk sizes or parsing failure might be due to mismatches between chunk sizes or
offsets and the contents of the Payload stream, for example. A offsets and the contents of the Payload stream, for example. A
responder MAY also report the presence of a non-DDP-eligible data responder MAY also report the presence of a non-DDP-eligible data
item in a Read or Write chunk using RPC_GARBAGEARGS. item in a Read or Write chunk using RPC_GARBAGEARGS.
skipping to change at page 35, line 29 skipping to change at page 35, line 42
o If the requester-provided Reply chunk is too small to accommodate o If the requester-provided Reply chunk is too small to accommodate
a large RPC Reply, a Remote Access error occurs. A responder can a large RPC Reply, a Remote Access error occurs. A responder can
detect this problem before attempting to write past the end of the detect this problem before attempting to write past the end of the
Reply chunk. Reply chunk.
RDMA operational errors are typically fatal to the connection. To RDMA operational errors are typically fatal to the connection. To
avoid a retransmission loop and repeated connection loss that avoid a retransmission loop and repeated connection loss that
deadlocks the connection, once the requester has re-established a deadlocks the connection, once the requester has re-established a
connection, the responder should send an RDMA_ERROR reply with an connection, the responder should send an RDMA_ERROR reply with an
rdma_err value of RDMA_ERR_BADHEADER to indicate that no RPC-level rdma_err value of ERR_CHUNK to indicate that no RPC-level reply is
reply is possible for that XID. possible for that XID.
5.5.4. Other Operational Errors 5.5.4. Other Operational Errors
While a requester is constructing a Call message, an unrecoverable While a requester is constructing a Call message, an unrecoverable
problem might occur that prevents the requester from posting further problem might occur that prevents the requester from posting further
RDMA Work Requests on behalf of that message. As with other RDMA Work Requests on behalf of that message. As with other
transports, if a requester is unable to construct and transmit a Call transports, if a requester is unable to construct and transmit a Call
message, the associated RPC transaction fails immediately. message, the associated RPC transaction fails immediately.
After a requester has received a reply, if it is unable to invalidate After a requester has received a reply, if it is unable to invalidate
skipping to change at page 36, line 14 skipping to change at page 36, line 26
5.5.5. RDMA Transport Errors 5.5.5. RDMA Transport Errors
The RDMA connection and physical link provide some degree of error The RDMA connection and physical link provide some degree of error
detection and retransmission. iWARP's Marker PDU Aligned (MPA) layer detection and retransmission. iWARP's Marker PDU Aligned (MPA) layer
(when used over TCP), Stream Control Transmission Protocol (SCTP), as (when used over TCP), Stream Control Transmission Protocol (SCTP), as
well as the InfiniBand link layer all provide Cyclic Redundancy Check well as the InfiniBand link layer all provide Cyclic Redundancy Check
(CRC) protection of the RDMA payload, and CRC-class protection is a (CRC) protection of the RDMA payload, and CRC-class protection is a
general attribute of such transports. general attribute of such transports.
Additionally, the RPC layer itself can accept errors from the link Additionally, the RPC layer itself can accept errors from the
level and recover via retransmission. RPC recovery can handle transport, and recover via retransmission. RPC recovery can handle
complete loss and re-establishment of the link. complete loss and re-establishment of a transport connection.
The details of reporting and recovery from RDMA link layer errors are The details of reporting and recovery from RDMA link layer errors are
outside the scope of this protocol specification. See Section 10 for outside the scope of this protocol specification. See Section 9 for
further discussion of the use of RPC-level integrity schemes to further discussion of the use of RPC-level integrity schemes to
detect errors. detect errors.
5.6. Protocol Elements No Longer Supported 5.6. Protocol Elements No Longer Supported
The following protocol elements are no longer supported in RPC-over- The following protocol elements are no longer supported in RPC-over-
RDMA Version One. Related enum values and structure definitions RDMA Version One. Related enum values and structure definitions
remain in the RPC-over-RDMA Version One protocol for backwards remain in the RPC-over-RDMA Version One protocol for backwards
compatibility. compatibility.
skipping to change at page 37, line 12 skipping to change at page 37, line 24
GETATTR operation as the final element of the compound operation GETATTR operation as the final element of the compound operation
array. array.
Without a full specification of RDMA_MSGP, there has been no fully Without a full specification of RDMA_MSGP, there has been no fully
implemented prototype of it. Without a complete prototype of implemented prototype of it. Without a complete prototype of
RDMA_MSGP support, it is difficult to assess whether this protocol RDMA_MSGP support, it is difficult to assess whether this protocol
element has benefit, or can even be made to work interoperably. element has benefit, or can even be made to work interoperably.
Therefore, senders MUST NOT send RDMA_MSGP procedures. When Therefore, senders MUST NOT send RDMA_MSGP procedures. When
receiving an RDMA_MSGP procedure, receivers SHOULD reply with an receiving an RDMA_MSGP procedure, receivers SHOULD reply with an
RDMA_ERROR procedure, setting the rdma_err field to RDMA_ERROR procedure, setting the rdma_err field to ERR_CHUNK.
RDMA_ERR_BADHEADER.
5.6.2. RDMA_DONE 5.6.2. RDMA_DONE
Because no implementation of RPC-over-RDMA Version One uses the Read- Because no implementation of RPC-over-RDMA Version One uses the Read-
Read transfer model, there is never a need to send an RDMA_DONE Read transfer model, there is never a need to send an RDMA_DONE
procedure. procedure.
Therefore, senders MUST NOT send RDMA_DONE messages. When receiving Therefore, senders MUST NOT send RDMA_DONE messages. When receiving
an RDMA_DONE procedure, receivers SHOULD reply with an RDMA_ERROR an RDMA_DONE procedure, receivers SHOULD reply with an RDMA_ERROR
procedure, setting the rdma_err field to RDMA_ERR_BADHEADER. procedure, setting the rdma_err field to ERR_CHUNK.
5.7. XDR Examples 5.7. XDR Examples
RPC-over-RDMA chunk lists are complex data types. In this section, RPC-over-RDMA chunk lists are complex data types. In this section,
illustrations are provided to help readers grasp how chunk lists are illustrations are provided to help readers grasp how chunk lists are
represented inside an RPC-over-RDMA header. represented inside an RPC-over-RDMA header.
An RDMA segment is the simplest component, being made up of a 32-bit An RDMA segment is the simplest component, being made up of a 32-bit
handle (H), a 32-bit length (L), and 64-bits of offset (OO). Once handle (H), a 32-bit length (L), and 64-bits of offset (OO). Once
flattened into an XDR stream, RDMA segments appear as flattened into an XDR stream, RDMA segments appear as
skipping to change at page 40, line 16 skipping to change at page 40, line 18
well-known port for the service itself, if it is appropriately well-known port for the service itself, if it is appropriately
defined. By convention, the NFS/RDMA service, when operating atop defined. By convention, the NFS/RDMA service, when operating atop
such an InfiniBand fabric, will use the same 20049 assignment as such an InfiniBand fabric, will use the same 20049 assignment as
for iWARP. for iWARP.
Historically, different RPC protocols have taken different approaches Historically, different RPC protocols have taken different approaches
to their port assignment; therefore, the specific method is left to to their port assignment; therefore, the specific method is left to
each RPC-over-RDMA-enabled Upper Layer binding, and not addressed each RPC-over-RDMA-enabled Upper Layer binding, and not addressed
here. here.
In Section 11, this specification defines two new "netid" values, to In Section 10, this specification defines two new "netid" values, to
be used for registration of upper layers atop iWARP [RFC5040] be used for registration of upper layers atop iWARP [RFC5040]
[RFC5041] and (when a suitable port translation service is available) [RFC5041] and (when a suitable port translation service is available)
InfiniBand [IB]. Additional RDMA-capable networks MAY define their InfiniBand [IB]. Additional RDMA-capable networks MAY define their
own netids, or if they provide a port translation, MAY share the one own netids, or if they provide a port translation, MAY share the one
defined here. defined here.
7. Bi-Directional RPC-Over-RDMA 7. Upper Layer Binding Specifications
7.1. RPC Direction
7.1.1. Forward Direction
A traditional ONC RPC client is always a requester. A traditional
ONC RPC service is always a responder. This traditional form of ONC
RPC message passing is referred to as operation in the "forward
direction."
During forward direction operation, the ONC RPC client is responsible
for establishing transport connections.
7.1.2. Backward Direction
The ONC RPC standard does not forbid passing messages in the other
direction. An ONC RPC service endpoint can act as a requester, in
which case an ONC RPC client endpoint acts as a responder. This form
of message passing is referred to as operation in the "backward
direction."
During backward direction operation, the ONC RPC client is
responsible for establishing transport connections, even though ONC
RPC Calls come from the ONC RPC server.
7.1.3. Bi-direction
A pair of endpoints may choose to use only forward or only backward
direction operations on a particular transport. Or, the endpoints
may send operations in both directions concurrently on the same
transport.
Bi-directional operation occurs when both transport endpoints act as
a requester and a responder at the same time. As above, the ONC RPC
client is responsible for establishing transport connections.
7.1.4. XIDs with Bi-direction
During bi-directional operation, the forward and backward directions
use independent xid spaces.
In other words, a forward direction requester MAY use the same xid
value at the same time as a backward direction requester on the same
transport connection, but such concurrent requests represent distinct
ONC RPC transactions.
7.2. Backward Direction Flow Control
7.2.1. Backward RPC-over-RDMA Credits
Credits work the same way in the backward direction as they do in the
forward direction. However, forward direction credits and backward
direction credits are accounted separately.
In other words, the forward direction credit value is the same
whether or not there are backward direction resources associated with
an RPC-over-RDMA transport connection. The backward direction credit
value MAY be different than the forward direction credit value. The
rdma_credit field in a backward direction RPC-over-RDMA message MUST
NOT contain the value zero.
A backward direction requester (an RPC-over-RDMA service endpoint)
requests credits from the responder (an RPC-over-RDMA client
endpoint). The responder reports how many credits it can grant.
This is the number of backward direction Calls the responder is
prepared to handle at once.
When an RPC-over-RDMA server endpoint is operating correctly, it
sends no more outstanding requests at a time than the client
endpoint's advertised backward direction credit value.
7.2.2. Receive Buffer Management
An RPC-over-RDMA transport endpoint must pre-post receive buffers
before it can receive and process incoming RPC-over-RDMA messages.
If a sender transmits a message for a receiver which has no posted
receive buffer, the RDMA provider MAY drop the RDMA connection.
7.2.2.1. Client Receive Buffers
Typically an RPC-over-RDMA caller posts only as many receive buffers
as there are outstanding RPC Calls. A client endpoint without
backward direction support might therefore at times have no pre-
posted receive buffers.
To receive incoming backward direction Calls, an RPC-over-RDMA client
endpoint must pre-post enough additional receive buffers to match its
advertised backward direction credit value. Each outstanding forward
direction RPC requires an additional receive buffer above this
minimum.
When an RDMA transport connection is lost, all active receive buffers
are flushed and are no longer available to receive incoming messages.
When a fresh transport connection is established, a client endpoint
must re-post a receive buffer to handle the Reply for each
retransmitted forward direction Call, and a full set of receive
buffers to handle backward direction Calls.
7.2.2.2. Server Receive Buffers
A forward direction RPC-over-RDMA service endpoint posts as many
receive buffers as it expects incoming forward direction Calls. That
is, it posts no fewer buffers than the number of RPC-over-RDMA
credits it advertises in the rdma_credit field of forward direction
RPC replies.
To receive incoming backward direction replies, an RPC-over-RDMA
server endpoint must pre-post a receive buffer for each backward
direction Call it sends.
When the existing transport connection is lost, all active receive
buffers are flushed and are no longer available to receive incoming
messages. When a fresh transport connection is established, a server
endpoint must re-post a receive buffer to handle the Reply for each
retransmitted backward direction Call, and a full set of receive
buffers for receiving forward direction Calls.
7.3. Conventions For Backward Operation
7.3.1. In the Absense of Backward Direction Support
An RPC-over-RDMA transport endpoint might not support backward
direction operation. There might be no mechanism in the transport
implementation to do so, or the Upper Layer Protocol consumer might
not yet have configured the transport to handle backward direction
traffic.
A loss of the RDMA connection may result if the receiver is not
prepared to receive an incoming message. Thus a denial-of-service
could result if a sender continues to send backchannel messages after
every transport reconnect to an endpoint that is not prepared to
receive them.
For RPC-over-RDMA Version One transports, the Upper Layer Protocol is
responsible for informing its peer when it has established a backward
direction capability. Otherwise even a simple backward direction
NULL probe from a peer would result in a lost connection.
An Upper Layer Protocol consumer MUST NOT perform backward direction
ONC RPC operations unless the peer consumer has indicated it is
prepared to handle them. A description of Upper Layer Protocol
mechanisms used for this indication is outside the scope of this
document.
7.3.2. Backward Direction Retransmission
In rare cases, an ONC RPC transaction cannot be completed within a
certain time. This can be because the transport connection was lost,
the Call or Reply message was dropped, or because the Upper Layer
consumer delayed or dropped the ONC RPC request. Typically, the
requester sends the transaction again, reusing the same RPC XID.
This is known as an "RPC retransmission".
In the forward direction, the Caller is the ONC RPC client. The
client is always responsible for establishing a transport connection
before sending again.
In the backward direction, the Caller is the ONC RPC server. Because
an ONC RPC server does not establish transport connections with
clients, it cannot send a retransmission if there is no transport
connection. It must wait for the ONC RPC client to re-establish the
transport connection before it can retransmit ONC RPC transactions in
the backward direction.
If an ONC RPC client has no work to do, it may be some time before it
re-establishes a transport connection. Backward direction Callers
must be prepared to wait indefinitely before a connection is
established before a pending backward direction ONC RPC Call can be
retransmitted.
7.3.3. Backward Direction Message Size
RPC-over-RDMA backward direction messages are transmitted and
received using the same buffers as messages in the forward direction.
Therefore they are constrained to be no larger than receive buffers
posted for forward messages.
It is expected that the Upper Layer Protocol consumer establishes an
appropriate payload size limit for backward direction operations,
either by advertising that size limit to its peers, or by convention.
If that is done, backward direction messages do not exceed the size
of receive buffers at either endpoint.
If a sender transmits a backward direction message that is larger
than the receiver is prepared for, the RDMA provider drops the
message and the RDMA connection.
7.3.4. Sending A Backward Direction Call
To form a backward direction RPC-over-RDMA Call message on an RPC-
over-RDMA Version One transport, an ONC RPC service endpoint
constructs an RPC-over-RDMA header containing a fresh RPC XID in the
rdma_xid field.
The rdma_vers field MUST contain the value one. The number of
requested credits is placed in the rdma_credit field.
The rdma_proc field in the RPC-over-RDMA header MUST contain the
value RDMA_MSG. All three chunk lists MUST be empty.
The ONC RPC Call header MUST follow immediately, starting with the
same XID value that is present in the RPC-over-RDMA header. The Call
header's msg_type field MUST contain the value CALL.
7.3.5. Sending A Backward Direction Reply
To form a backward direction RPC-over-RDMA Reply message on an RPC-
over-RDMA Version One transport, an ONC RPC client endpoint
constructs an RPC-over-RDMA header containing a copy of the matching
ONC RPC Call's RPC XID in the rdma_xid field.
The rdma_vers field MUST contain the value one. The number of
granted credits is placed in the rdma_credit field.
The rdma_proc field in the RPC-over-RDMA header MUST contain the
value RDMA_MSG. All three chunk lists MUST be empty.
The ONC RPC Reply header MUST follow immediately, starting with the
same XID value that is present in the RPC-over-RDMA header. The
Reply header's msg_type field MUST contain the value REPLY.
7.4. Backward Direction Upper Layer Binding
RPC programs that operate on RPC-over-RDMA Version One only in the
backward direction do not require an Upper Layer Binding
specification. Because RPC-over-RDMA Version One operation in the
backward direction does not allow reduction, there can be no DDP-
eligible data items in such a program. Backward direction operation
occurs on an already-established connection, thus there is no need to
specify RPC bind parameters.
8. Upper Layer Binding Specifications
An Upper Layer Protocol is typically defined independently of any An Upper Layer Protocol is typically defined independently of any
particular RPC transport. An Upper Layer Binding specification (ULB) particular RPC transport. An Upper Layer Binding specification (ULB)
provides guidance that helps the Upper Layer Protocol interoperate provides guidance that helps the Upper Layer Protocol interoperate
correctly and efficiently over a particular transport. For RPC-over- correctly and efficiently over a particular transport. For RPC-over-
RDMA Version One, a ULB provides: RDMA Version One, an Upper Layer Binding may provide:
o A taxonomy of XDR data items that are eligible for Direct Data o A taxonomy of XDR data items that are eligible for Direct Data
Placement Placement
o Constraints on which Upper Layer procedures may be reduced, and on
how many chunks may appear in a single RPC request
o A method for determining the maximum size of the reply Payload o A method for determining the maximum size of the reply Payload
stream for all procedures in the Upper Layer Protocol stream for all procedures in the Upper Layer Protocol
o An rpcbind port assignment for operation of the RPC Program and o An rpcbind port assignment for operation of the RPC Program and
Version on an RPC-over-RDMA transport Version on an RPC-over-RDMA transport
Each RPC Program and Version tuple that utilizes RPC-over-RDMA Each RPC Program and Version tuple that utilizes RPC-over-RDMA
Version One needs to have an Upper Layer Binding specification. Version One needs to have an Upper Layer Binding specification.
Requesters MUST NOT send RPC-over-RDMA messages for Upper Layer
Protocols that do not have a Upper Layer Binding. Responders MUST
NOT reply to RPC-over-RDMA messages for Upper Layer Protocols that do
not have a Upper Layer Binding.
8.1. DDP-Eligibility 7.1. DDP-Eligibility
An Upper Layer Binding designates some XDR data items as eligible for An Upper Layer Binding designates some XDR data items as eligible for
Direct Data Placement. As an RPC-over-RDMA message is formed, DDP- Direct Data Placement. As an RPC-over-RDMA message is formed, DDP-
eligible data items can be removed from the Payload stream and placed eligible data items can be removed from the Payload stream and placed
directly in the receiver's memory (reduced). directly in the receiver's memory.
An XDR data item should be considered for DDP-eligibility if there is An XDR data item should be considered for DDP-eligibility if there is
a clear benefit to moving the contents of the item directly from the a clear benefit to moving the contents of the item directly from the
sender's memory to the receiver's memory. Criteria for DDP- sender's memory to the receiver's memory. Criteria for DDP-
eligibility include: eligibility include:
o The XDR data item is frequently sent or received, and its size is o The XDR data item is frequently sent or received, and its size is
often much larger than typical inline thresholds. often much larger than typical inline thresholds.
o Transport-level processing of the XDR data item is not needed. o Transport-level processing of the XDR data item is not needed.
For example, the data item is an opaque byte array, which requires For example, the data item is an opaque byte array, which requires
no XDR encoding and decoding of its content. no XDR encoding and decoding of its content.
o The content of the XDR data item is sensitive to address o The content of the XDR data item is sensitive to address
alignment. For example, pullup would be required on the receiver alignment. For example, pullup would be required on the receiver
before the content of the item can be used. before the content of the item can be used.
o The XDR data item does not contain DDP-eligible data items. o The XDR data item does not contain DDP-eligible data items.
In addition to defining the set of data items that are DDP-eligible,
an Upper Layer Binding may also limit the use of chunks to particular
Upper Layer procedures. If more than one data item in a procedure is
DDP-eligible, the Upper Layer Binding may also limit the number of
chunks that a requester can provide for a particular Upper Layer
procedure.
Senders MUST NOT reduce data items that are not DDP-eligible. Such Senders MUST NOT reduce data items that are not DDP-eligible. Such
data items MAY, however, be moved as part of a Position Zero Read data items MAY, however, be moved as part of a Position Zero Read
Chunk or a Reply chunk. Chunk or a Reply chunk.
The interface by which an Upper Layer implementation indicates the The programming interface by which an Upper Layer implementation
DDP-eligibility of a data item to the RPC transport is not described indicates the DDP-eligibility of a data item to the RPC transport is
by this specification. The only requirements are that the receiver not described by this specification. The only requirements are that
can re-assemble the transmitted RPC-over-RDMA message into a valid the receiver can re-assemble the transmitted RPC-over-RDMA message
XDR stream, and that DDP-eligibility rules specified by the Upper into a valid XDR stream, and that DDP-eligibility rules specified by
Layer Binding are respected. the Upper Layer Binding are respected.
There is no provision to express DDP-eligibility within the XDR There is no provision to express DDP-eligibility within the XDR
language. The only definitive specification of DDP-eligibility is language. The only definitive specification of DDP-eligibility is an
the Upper Layer Binding itself. Upper Layer Binding.
8.1.1. DDP-Eligibility Violation 7.1.1. DDP-Eligibility Violation
A DDP-eligibility violation occurs when a requester forms a Call A DDP-eligibility violation occurs when a requester forms a Call
message with a non-DDP-eligible data item in a Read chunk. A message with a non-DDP-eligible data item in a Read chunk. A
violation occurs when a responder forms a Reply message without violation occurs when a responder forms a Reply message without
reducing a DDP-eligible data item when there is a Write list provided reducing a DDP-eligible data item when there is a Write list provided
by the requester. by the requester.
In the first case, a responder MUST NOT process the Call message. In the first case, a responder MUST NOT process the Call message.
In the second case, as a requester parses a Reply message, it must In the second case, as a requester parses a Reply message, it must
assume that the responder has correctly reduced a DDP-eligible result assume that the responder has correctly reduced a DDP-eligible result
data item. If the responder has not done so, it is likely that the data item. If the responder has not done so, it is likely that the
requester cannot finish parsing the Payload stream and that an XDR requester cannot finish parsing the Payload stream and that an XDR
error would result. error would result.
Both types of violations MUST be reported as described in Both types of violations MUST be reported as described in
Section 5.5.2. Section 5.5.2.
8.2. Maximum Reply Size 7.2. Maximum Reply Size
A requester provides resources for both a Call message and its A requester provides resources for both a Call message and its
matching Reply message. A requester forms the Call message itself, matching Reply message. A requester forms the Call message itself,
thus can compute the exact resources needed for it. thus can compute the exact resources needed for it.
A requester must allocate resources for the Reply message (an RPC- A requester must allocate resources for the Reply message (an RPC-
over-RDMA credit, a Receive buffer, and possibly a Write list and over-RDMA credit, a Receive buffer, and possibly a Write list and
Reply chunk) before the responder has formed the actual reply. To Reply chunk) before the responder has formed the actual reply. To
accommodate all possible replies for the procedure in the Call accommodate all possible replies for the procedure in the Call
message, a requester must allocate reply resources based on the message, a requester must allocate reply resources based on the
maximum possible size of the expected Reply message. maximum possible size of the expected Reply message.
If there are procedures in the Upper Layer Protocol for which there If there are procedures in the Upper Layer Protocol for which there
is no clear reply size maximum, the Upper Layer Binding needs to is no clear reply size maximum, the Upper Layer Binding needs to
specify a dependable means for determining the maximum. specify a dependable means for determining the maximum.
8.3. Additional Considerations 7.3. Additional Considerations
There may be other details provided in an Upper Layer Binding. There may be other details provided in an Upper Layer Binding.
o An Upper Layer Binding may recommend an inline threshold value or o An Upper Layer Binding may recommend an inline threshold value or
other transport-related parameters for RPC-over-RDMA Version One other transport-related parameters for RPC-over-RDMA Version One
connections bearing that Upper Layer Protocol. connections bearing that Upper Layer Protocol.
o An Upper Layer Protocol may provide a means to communicate these o An Upper Layer Protocol may provide a means to communicate these
transport-related parameters between peers. Note that RPC-over- transport-related parameters between peers. Note that RPC-over-
RDMA Version One does not specify any mechanism for changing any RDMA Version One does not specify any mechanism for changing any
transport-related parameter after a connection has been transport-related parameter after a connection has been
established. established.
o Multiple Upper Layer Protocols may share a single RPC-over-RDMA o Multiple Upper Layer Protocols may share a single RPC-over-RDMA
Version One connection when their Upper Layer Bindings allow the Version One connection when their Upper Layer Bindings allow the
use of RPC-over-RDMA Version One and the rpcbind port assignments use of RPC-over-RDMA Version One and the rpcbind port assignments
for the Protocols allow connection sharing. In this case, the for the Protocols allow connection sharing. In this case, the
same transport parameters (such as inline threshold) apply to all same transport parameters (such as inline threshold) apply to all
Protocols using that connection. Protocols using that connection.
Given the above, Upper Layer Bindings and Upper Layer Protocols must Each Upper Layer Binding needs to be designed to allow correct
be designed to interoperate correctly no matter what connection interoperation without regard to the transport parameters actually in
parameters are in effect on a connection. use. Furthermore, implementations of Upper Layer Protocols must be
designed to interoperate correctly regardless of the connection
parameters in effect on a connection.
8.4. Upper Layer Protocol Extensions 7.4. Upper Layer Protocol Extensions
An RPC Program and Version tuple may be extensible. For instance, An RPC Program and Version tuple may be extensible. For instance,
there may be a minor versioning scheme that is not reflected in the there may be a minor versioning scheme that is not reflected in the
RPC version number. Or, the Upper Layer Protocol may allow RPC version number. Or, the Upper Layer Protocol may allow
additional features to be specified after the original RPC program additional features to be specified after the original RPC program
specification was ratified. specification was ratified.
Upper Layer Bindings are provided for interoperable RPC Programs and Upper Layer Bindings are provided for interoperable RPC Programs and
Versions by extending existing Upper Layer Bindings to reflect the Versions by extending existing Upper Layer Bindings to reflect the
changes made necessary by each addition to the existing XDR. changes made necessary by each addition to the existing XDR.
9. Protocol Extensibility 8. Protocol Extensibility
The RPC-over-RDMA header format is specified using XDR, unlike the The RPC-over-RDMA header format is specified using XDR, unlike the
message header format of RPC on TCP. Defining the header using XDR message header used with RPC over TCP. To maintain a high degree of
allows minor issues with the transport protocol to be addressed and interoperability among implementations of RPC-over-RDMA, any change
optional features to be introduced by making extensions to the RPC- to this XDR requires a protocol version number change. New versions
over-RDMA header XDR. Such changes can be made without a change to of RPC-over-RDMA may be published as separate protocol specifications
the protocol version number. without updating this document.
When more invasive changes to the protocol are to be made, a protocol
version number change is required. In either case, any changes to
the RPC-over-RDMA protocol can only be effected by publication of a
Standards Track document with appropriate review by the nfsv4 Working
Group and the IESG.
Although it is possible to make XDR changes which are not limited to
the use of compatible extensions in new RPC-over-RDMA versions, such
changes should only be done when absolutely necessary, as they limit
interoperability with existing implementations. It is appropriate
for the nfsv4 Working Group to consider alternatives carefully before
using this approach.
Unlike the rest of this document, which defines the base of RPC-over-
RDMA Version One, Section 9 (except for Section 9.4) applies to all
versions of RPC-over-RDMA. New versions of RPC-over-RDMA may be
published as separate protocols without updating this document, but
any change to the extensibility model defined here requires
publication of a Standards Track document updating this document.
9.1. Changes To RPC-Over-RDMA Header XDR The first four fields in every RPC-over-RDMA header must remain
aligned at the same fixed offsets for all versions of the RPC-over-
RDMA protocol. The version number must be in a fixed place to enable
implementations to detect protocol version mismatches.
The first four fields in the RPC-over-RDMA header (now in struct For version mismatches to be reported in a fashion that all future
rpcrdma_prefix) must remain aligned at the same fixed offsets for all version implementations can reliably decode, the rdma_proc field must
versions of the RPC-over-RDMA protocol. The version number must be remain in a fixed place, the value of ERR_VERS must always remain the
in a fixed place in order to enable version mismatch detection. For same, and the field placement in struct rpc_rdma_errvers must always
version mismatches to be reported in a fashion that all future
version implementations can reliably decode, the rdma_which field
must be in a fixed place, the value of RDMA_ERR_VERS must always
remain the same, and the field placement of the RDMA_ERR_VERS arm of
the rpcrdma1_error union (now in struct rpcrdma_err_vers) must always
remain the same. remain the same.
Given these constraints, one way to extend RPC-over-RDMA is to add 8.1. Conventional Extensions
new values to the rdma_proc enumerated type and new components (arms)
to the rpcrdma1_body union. New argument and result types may be
introduced for each new procedure defined this way. These extensions
would be specified by new Internet Drafts with appropriate Working
Group and IESG review to ensure continued interoperation with
existing implementations.
XDR extensions may introduce only optional features to an existing
RPC-over-RDMA protocol version. To detect when an optional rdma_proc
value is supported by a receiver, it is desirable to have a specific
value of the rdma_err field, say, RDMA_ERR_PROC, that indicates when
the receiver does not recognize an rdma_proc value.
In RPC-over-RDMA Version One, a receiver can indicate that it does
not recognize an rdma_proc enum value only by returning an RDMA_ERROR
procedure with the rdma_err field set to RDMA_ERR_BADHEADER (see
Section 5.5.2). This is indistinguishable from a situation where the
receiver does indeed support the procedure, but the XDR is malformed.
To resolve this problem, an RPC-over-RDMA Version One sender uses the
following convention. If the first time the sender uses an optional
rdma_proc value the receiver returns an RDMA_ERROR procedure with
RDMA_ERR_BADHEADER in the rdma_err field, the sender simply marks
that feature as unsupported and does not send it again on the current
connection instance. Subsequent to an initial successful result,
receiving RDMA_ERR_BADHEADER retains its more relaxed meaning of
"generic XDR parsing error."
To ensure backwards compatibility when such an extension mechanism is
in place, the value of RDMA_ERR_BADHEADER must remain the same for
all versions of the RPC-over-RDMA protocol.
Most changes to the RPC-over-RDMA XDR will take the form of a
compatible extension to the existing XDR. Changes which do not
update the version number (see Section 9.3) must take this form.
For an XDR description B to be a compatible extension of an XDR
description A, the following must be the case:
o All input recognized as description valid by A must be recognized
as valid by description B
o Any input recognized as valid by both descriptions must be
interpreted as having the same structure according to both
descriptions
o Any input recognized as valid by description B but not by
description A can be recognizable as part of a supported./unknown
extension using description A
The following changes can be made compatibly:
o Addition of a new message header type and associated header body
o Addition of new enum values and associated arms to unions that do
not include a default case
o Addition of previously undefined flag bits to flag words that are
included in existing header bodies
Each such addition is referred to as a "protocol element." A set of
protocol elements defined together such that all must be supported or
not supported by a receiver is called a "feature."
Because of the simplicity of the existing protocol and deficiencies
in the existing error reporting structure, some of the above
techiques are not realizable within RPC-over-RDMA Version One. For a
discussion of protocol extension practices within RPC-over-RDMA
Version One, including XDR extension, see Section 9.4.
9.2. Feature Statuses With RPC-Over-RDMA Versions
Within a given RPC-over-RDMA version, every known feature is either
OPTIONAL, REQUIRED, or "not allowed".
o REQUIRED features MUST be supported by all receivers. Senders can
depend on them being supported.
o OPTIONAL features MAY be supported by particular receivers.
Senders need to be prepared for the absence of support.
o "Not allowed" features are typically those that were formally
OPTIONAL or REQUIRED, but for which support has been removed.
All features defined in this document are REQUIRED in RPC-over-RDMA
Version One. OPTIONAL features may be added to Version One as
specified in Section 9.4.
The terms "OPTIONAL" and "REQUIRED" are used as specified in
[RFC2119] as indicated in Section 1.1. These status values are
assigned by those writing additional specifications (e.g., new RPC-
over-RDMA versions or extensions to existing RPC-over-RDMA versions).
Their choice in this regard is their guidance to implementers. As
used in this document, these terms are only directed to implementers
of RPC-over-RDMA Version One.
The status of features may change between RPC-over-RDMA protocol
versions.
9.3. RPC-Over-RDMA Version Numbering
RPC-over-RDMA version numbering enables both endpoints to agree on a
set of interoperable behaviors and determine which OPTIONAL features
are available.
An expected pattern of protocol development is to introduce OPTIONAL
features within a given version using XDR extension. Such features
often need a significant period of optional general use to ensure
they are capable of being implemented broadly. This is especially
true for infrastructural features that others will build upon. When
it is appropriate for OPTIONAL features to become REQUIRED, that
would be an occasion to create a new RPC-over-RDMA protocol version.
The value of the RPC-over-RDMA header's version field has to be
updated when the protocol is altered in a way that prevents
interoperability with current implementations. A version change is
needed whenever:
o The RPC-over-RDMA header XDR definition is changed to add a
REQUIRED protocol element, or an existing OPTIONAL feature is made
REQUIRED
o A REQUIRED feature is made OPTIONAL
o A REQUIRED or OPTIONAL feature is converted to be "not allowed"
o An XDR change is made that is not a compatible extension as
defined in Section 9.1
o The use of a previously not used abstract RDMA operation is
specified as REQUIRED
o The use of an existing REQUIRED abstract RDMA operation is removed
When a version number change is to be made, the nfsv4 Working Group
creates a Standards Track document that does one of the following:
1. Documents the whole protocol as amended
2. Documents changes relative to the previous minor version
3. Documents extensions made since the previous minor versions by
normatively referencing the documents defining those extensions
4. Documents all REQUIRED functionality, and includes OPTIONAL
features by normatively referencing the documents defining those
extensions
The Working Group retains all these options, but the last is
typically preferred. When an XDR change that is not a compatible
extension is made, the first is most desirable. In any case, if
there are features whose status has been changed to "not allowed",
the document needs to explain that change and how it is intended that
existing implementations address the feature removal.
9.4. RPC-Over-RDMA Version One Extension Practices
This subsection applies primarily to RPC-over-RDMA Version One but
remains in effect unless modified by documents defining future RPC-
over-RDMA versions. Such documents need not update this document.
9.4.1. Documentation Requirements
RPC-over-RDMA Version One may be extended by defining a new message
header type and XDR description of the corresponding header body.
A set of such new protocol elements may be introduced by a Standards
Track document and are together considered an OPTIONAL feature.
nfsv4 Working Group and IESG review, together with appropriate
testing of prototype implementations, should ensure continued
interoperation with existing implementations.
Documents describing extensions to RPC-over-RDMA Version One should
contain:
o An explanation of the purpose and use of each new protocol element
o An XDR description and a script to extract it
o A receiver response that a sender can use to determine that
support is in fact present
o A description of interactions with existing features (e.g., any
requirement that another OPTIONAL or REQUIRED feature needs to be
present and supported for the new feature to work)
Implementers concatenate the XDR description of the new feature with
the XDR description of the base protocol, extracted from this
document, to produce a combined XDR description for the RPC-over-RDMA
Version One protocol with the specified extension.
9.4.2. Detecting Support For Message Header Types
A sender determines whether a receiver supports an OPTIONAL message
header type by issuing a simple test request using that message
header type. The receiver sends an affirmative response that
indicates the message header type is supported. The response message
header type may itself be an extension. The sender ties together the
message and response using the rdma_xid field.
The receiver indicates that it does not recognize a particular
rdma_which value by returning an RDMA_ERROR message type with the
rdma_err field set to RDMA_ERR_BADHEADER and with the rdma_xid field
set to a value that matches the test message.
This is indistinguishable from a situation where the receiver does Introducing new capabilities to RPC-over-RDMA Version One is limited
support the procedure but the test message is malformed. However, if to the adoption of conventions that make use of existing XDR (defined
the sender always tests for receiver support using a simple instance in this document) and allowed abstract RDMA operations. Because no
of the message header type to be tested, such an error at this point mechanism for detecting optional features exists in RPC-over-RDMA
indicates the sender and receiver have no prospect of using the new Version One, implementations must rely on Upper Layer Protocols to
protocol element interoperably. A lack of support for this feature communicate the existence of such extensions.
can be reasonably assumed.
A sender should issue OPTIONAL message header types one-at-a-time Such extensions must be specified in a Standards Track document with
until it receives indication of the receiver's support status of that appropriate review by the nfsv4 Working Group and the IESG. An
message header type. example of a conventional extension to RPC-over-RDMA Version One can
be found in [I-D.ietf-nfsv4-rpcrdma-bidirection].
10. Security Considerations 9. Security Considerations
10.1. Memory Protection 9.1. Memory Protection
A primary consideration is the protection of the integrity and A primary consideration is the protection of the integrity and
privacy of local memory by an RPC-over-RDMA transport. The use of privacy of local memory by an RPC-over-RDMA transport. The use of
RPC-over-RDMA MUST NOT introduce any vulnerabilities to system memory RPC-over-RDMA MUST NOT introduce any vulnerabilities to system memory
contents, nor to memory owned by user processes. contents, nor to memory owned by user processes.
It is REQUIRED that any RDMA provider used for RPC transport be It is REQUIRED that any RDMA provider used for RPC transport be
conformant to the requirements of [RFC5042] in order to satisfy these conformant to the requirements of [RFC5042] in order to satisfy these
protections. These protections are provided by the RDMA layer protections. These protections are provided by the RDMA layer
specifications, and in particular, their security models. specifications, and in particular, their security models.
10.1.1. Protection Domains 9.1.1. Protection Domains
The use of Protection Domains to limit the exposure of memory The use of Protection Domains to limit the exposure of memory
segments to a single connection is critical. Any attempt by an segments to a single connection is critical. Any attempt by an
endpoint not participating in that connection to re-use memory endpoint not participating in that connection to re-use memory
handles needs to result in immediate failure of that connection. handles needs to result in immediate failure of that connection.
Because Upper Layer Protocol security mechanisms rely on this aspect Because Upper Layer Protocol security mechanisms rely on this aspect
of Reliable Connection behavior, strong authentication of remote of Reliable Connection behavior, strong authentication of remote
endpoints is recommended. endpoints is recommended.
10.1.2. Handle Predictability 9.1.2. Handle Predictability
Unpredictable memory handles should be used for any operation Unpredictable memory handles should be used for any operation
requiring advertised memory segments. Advertising a continuously requiring advertised memory segments. Advertising a continuously
registered memory region allows a remote host to read or write to registered memory region allows a remote host to read or write to
that region even when an RPC involving that memory is not under way. that region even when an RPC involving that memory is not under way.
Therefore implementations should avoid advertising persistently Therefore implementations should avoid advertising persistently
registered memory. registered memory.
10.1.3. Memory Fencing 9.1.3. Memory Fencing
Advertised memory segments should be invalidated as soon as related Requesters should register memory segments for remote access only
RPC operations are complete. Invalidation and DMA unmapping of when they are about to be the target of an RPC operation that
segments should be complete before the Upper Layer is allowed to involves an RDMA Read or Write.
continue execution and use or alter the contents of a memory region.
10.2. RPC Message Security Registered memory segments should be invalidated as soon as related
RPC operations are complete. Invalidation and DMA unmapping of RDMA
segments should be complete before message integrity checking is
done, and before the RPC consumer is allowed to continue execution
and use or alter the contents of a memory region.
An RPC transaction on a requester might be terminated before a reply
arrives if the RPC consumer exits unexpectedly (for example it is
signaled or a segmentation fault occurs). When an RPC terminates
abnormally, memory segments associated with that RPC should be
invalidated appropriately before the segments are released to be
reused for other purposes on the requester.
9.2. RPC Message Security
ONC RPC provides cryptographic security via the RPCSEC_GSS framework ONC RPC provides cryptographic security via the RPCSEC_GSS framework
[RFC2203]. RPCSEC_GSS implements message authentication, per-message [I-D.ietf-nfsv4-rpcsec-gssv3]. RPCSEC_GSS implements message
integrity checking, and per-message confidentiality. However, authentication, per-message integrity checking, and per-message
integrity and privacy services require significant movement of data confidentiality. However, integrity and privacy services require
on each endpoint host. Some performance benefits enabled by RDMA significant movement of data on each endpoint host. Some performance
transports can be lost. Note that some performance loss is expected benefits enabled by RDMA transports can be lost.
when RPCSEC_GSS integrity or privacy is in use on any RPC transport.
10.2.1. RPC-Over-RDMA Link-Level Protection 9.2.1. RPC-Over-RDMA Protection At Lower Layers
Link-level protection is a more appropriate security mechanism for Note that performance loss is expected when RPCSEC_GSS integrity or
RDMA transports. Certain configurations of IPsec can be co-located privacy is in use on any RPC transport. Protection below the RDMA
in RDMA hardware, for example, without any change to RDMA consumers layer is a more appropriate security mechanism for RDMA transports in
or loss of data movement efficiency. performance-sensitive deployments. Certain configurations of IPsec
can be co-located in RDMA hardware, for example, without any change
to RDMA consumers or loss of data movement efficiency.
The use of link-level protection MAY be negotiated through the use of The use of protection in a lower layer MAY be negotiated through the
the RPCSEC_GSS security flavor defined in [RFC5403] in conjunction use of an RPCSEC_GSS security flavor defined in
with the Channel Binding mechanism [RFC5056] and IPsec Channel [I-D.ietf-nfsv4-rpcsec-gssv3] in conjunction with the Channel Binding
Connection Latching [RFC5660]. Use of such mechanisms is REQUIRED mechanism [RFC5056] and IPsec Channel Connection Latching [RFC5660].
where integrity and/or privacy is desired and where efficiency is Use of such mechanisms is REQUIRED where integrity and/or privacy is
required. desired and where efficiency is required.
10.2.2. RPCSEC_GSS On RPC-Over-RDMA Transports 9.2.2. RPCSEC_GSS On RPC-Over-RDMA Transports
RPCSEC_GSS [RFC5403] extends the ONC RPC protocol [RFC5531] without Not all RDMA devices and fabrics support the above protection
changing the format of RPC messages. By observing the conventions mechanisms. Also, per-message authentication is still required on
described in this section, an RPC-over-RDMA implementation can NFS clients where multiple users access NFS files. In these cases,
support RPCSEC_GSS in a way that interoperates successfully with RPCSEC_GSS can protect NFS traffic conveyed on RPC-over-RDMA
other implementations. connections.
RPCSEC_GSS extends the ONC RPC protocol [RFC5531] without changing
the format of RPC messages. By observing the conventions described
in this section, an RPC-over-RDMA transport can convey RPCSEC_GSS-
protected RPC messages interoperably.
As part of the ONC RPC protocol, protocol elements of RPCSEC_GSS that As part of the ONC RPC protocol, protocol elements of RPCSEC_GSS that
appear in the Payload stream of an RPC-over-RDMA message (such as appear in the Payload stream of an RPC-over-RDMA message (such as
control messages exchanged as part of establishing or destroying a control messages exchanged as part of establishing or destroying a
security context, or data items that are part of RPCSEC_GSS security context, or data items that are part of RPCSEC_GSS
authentication material) MUST NOT be reduced. authentication material) MUST NOT be reduced.
10.2.2.1. RPCSEC_GSS Context Negotiation 9.2.2.1. RPCSEC_GSS Context Negotiation
Some NFS client implementations use a separate connection to Some NFS client implementations use a separate connection to
establish a GSS context for NFS operation. These clients use TCP and establish a GSS context for NFS operation. These clients use TCP and
the standard NFS port (2049) for context establishment, but there is the standard NFS port (2049) for context establishment. However
no guarantee that an NFS/RDMA server provides a TCP-based NFS server there is no guarantee that an NFS/RDMA server provides a TCP-based
on port 2049. NFS server on port 2049.
10.2.2.2. RPC-Over-RDMA With RPCSEC_GSS Authentication 9.2.2.2. RPC-Over-RDMA With RPCSEC_GSS Authentication
The RPCSEC_GSS authentication service has no impact on the DDP- The RPCSEC_GSS authentication service has no impact on the DDP-
eligibity of data items in an Upper Layer Protocol. eligibity of data items in an Upper Layer Protocol.
However, RPCSEC_GSS authentication material appearing in an RPC However, RPCSEC_GSS authentication material appearing in an RPC
message header is often larger than material associated with, say, message header can be larger than, say, an AUTH_SYS authenticator.
the AUTH_SYS security flavor. In particular, when an RPCSEC_GSS In particular, when an RPCSEC_GSS pseudoflavor is in use, a requester
pseudoflavor is in use, a requester needs to accommodate a larger RPC needs to accommodate a larger RPC credential when marshaling Call
credential when marshaling Call messages, and to provide for a messages, and to provide for a maximum size RPCSEC_GSS verifier when
maximum size RPCSEC_GSS verifier when allocating reply buffers and allocating reply buffers and Reply chunks.
Reply chunks.
RPC messages, and thus Payload streams, are made larger as a result. RPC messages, and thus Payload streams, are made larger as a result.
Upper Layer Protocol operations that fit in a Short Message when a Upper Layer Protocol operations that fit in a Short Message when a
simpler form of authentication is in use might need to be reduced or simpler form of authentication is in use might need to be reduced, or
conveyed via a Long Message when RPCSEC_GSS authentication is in use. conveyed via a Long Message, when RPCSEC_GSS authentication is in
This can impact efficiency when RPCSEC_GSS authentication is use. use. It is more likely that a requester provides both a Read list
and a Reply chunk in the same RPC-over-RDMA header to convey a Long
Because average RPC message size is larger when RPCSEC_GSS call and provision a receptacle for a Long reply. More frequent use
authentication is in use, it is more likely that a requester will of Long messages can impact transport efficiency.
provide both a Read list and a Reply chunk in the same RPC-over-RDMA
header to convey a Long call and provision a receptacle for a Long
reply.
10.2.2.3. RPC-Over-RDMA With RPCSEC_GSS Integrity Or Privacy 9.2.2.3. RPC-Over-RDMA With RPCSEC_GSS Integrity Or Privacy
The RPCSEC_GSS integrity service enables endpoints to detect The RPCSEC_GSS integrity service enables endpoints to detect
modification of RPC messages in flight. The RPCSEC_GSS privacy modification of RPC messages in flight. The RPCSEC_GSS privacy
service prevents all but the intended recipient from viewing the service prevents all but the intended recipient from viewing the
cleartext content of RPC messages. RPCSEC_GSS integrity and privacy cleartext content of RPC arguments and results. RPCSEC_GSS integrity
are end-to-end; that is, they protect RPC arguments and results from and privacy are end-to-end. They protect RPC arguments and results
application to server endpoint, and back. from application to server endpoint, and back.
The RPCSEC_GSS integrity and encryption services operate on whole RPC The RPCSEC_GSS integrity and encryption services operate on whole RPC
messages after they have been XDR encoded for transmit, and before messages after they have been XDR encoded for transmit, and before
they have been XDR decoded after receipt. Both the sender and the they have been XDR decoded after receipt. Both sender and receiver
receiver endpoints use intermediate buffers to prevent exposure of endpoints use intermediate buffers to prevent exposure of encrypted
encrypted data or unverified cleartext data to RPC consumers. After data or unverified cleartext data to RPC consumers. After
verification, encryption, and message wrapping has been performed, verification, encryption, and message wrapping has been performed,
the transport layer can use RDMA data transfer between these the transport layer MAY use RDMA data transfer between these
intermediate buffers. intermediate buffers.
The process of reducing a DDP-eligible data item removes the data The process of reducing a DDP-eligible data item removes the data
item and its XDR padding from the encoded XDR stream. XDR padding of item and its XDR padding from the encoded XDR stream. XDR padding of
a reduced data item is not transferred in an RPC-over-RDMA message. a reduced data item is not transferred in an RPC-over-RDMA message.
After reduction, the Payload stream contains fewer octets then the After reduction, the Payload stream contains fewer octets then the
whole XDR stream did beforehand. XDR padding octets are often zero whole XDR stream did beforehand. XDR padding octets are often zero
bytes, but they don't have to be. Thus reducing DDP-eligible items bytes, but they don't have to be. Thus reducing DDP-eligible items
affects the result of message integrity verification or encryption. affects the result of message integrity verification or encryption.
skipping to change at page 56, line 48 skipping to change at page 47, line 42
integrity or encryption services are in use. Effectively, no data integrity or encryption services are in use. Effectively, no data
item is DDP-eligible in this situation, and Chunked Messages cannot item is DDP-eligible in this situation, and Chunked Messages cannot
be used. In this mode, an RPC-over-RDMA transport operates in the be used. In this mode, an RPC-over-RDMA transport operates in the
same manner as a transport that does not support direct data same manner as a transport that does not support direct data
placement. placement.
When RPCSEC_GSS integrity or privacy is in use, a requester provides When RPCSEC_GSS integrity or privacy is in use, a requester provides
both a Read list and a Reply chunk in the same RPC-over-RDMA header both a Read list and a Reply chunk in the same RPC-over-RDMA header
to convey a Long call and provision a receptacle for a Long reply. to convey a Long call and provision a receptacle for a Long reply.
10.2.2.4. RPC-Over-RDMA Header Exposure 9.2.2.4. Protecting RPC-Over-RDMA Transport Headers
Like the base fields in an ONC RPC message (XID, call direction, and Like the base fields in an ONC RPC message (XID, call direction, and
so on), the contents of an RPC-over-RDMA message's Transport stream so on), the contents of an RPC-over-RDMA message's Transport stream
are not protected by RPCSEC_GSS. This exposes XIDs, connection are not protected by RPCSEC_GSS. This exposes XIDs, connection
credit limits, and chunk lists (but not the content of the data items credit limits, and chunk lists (but not the content of the data items
they refer to) to malicious behavior, which could redirect data that they refer to) to malicious behavior, which could redirect data that
is transferred by the RPC-over-RDMA message, result in spurious is transferred by the RPC-over-RDMA message, result in spurious
retransmits, or trigger connection loss. retransmits, or trigger connection loss.
Encryption at the link layer, as described in Section 10.2.1, In particular, if an attacker alters the information contained in the
protects the content of the Transport stream. chunk lists of an RPC-over-RDMA header, data contained in those
chunks can be redirected to other registered memory segments on
requesters. An attacker might alter the arguments of RDMA Read and
RDMA Write operations on the wire to similar effect. The use of
RPCSEC_GSS integrity or privacy services enable the requester to
detect if such tampering has been done and reject the RPC message.
11. IANA Considerations Encryption at lower layers, as described in Section 9.2.1, protects
the content of the Transport stream. To address attacks on RDMA
protocols themselves, RDMA transport implementations should conform
to [RFC5042].
10. IANA Considerations
Three assignments are specified by this document. These are Three assignments are specified by this document. These are
unchanged from [RFC5666]: unchanged from [RFC5666]:
o A set of RPC "netids" for resolving RPC-over-RDMA services o A set of RPC "netids" for resolving RPC-over-RDMA services
o Optional service port assignments for Upper Layer Bindings o Optional service port assignments for Upper Layer Bindings
o An RPC program number assignment for the configuration protocol o An RPC program number assignment for the configuration protocol
skipping to change at page 58, line 15 skipping to change at page 49, line 21
For example, the NFS/RDMA service defined in [RFC5667] has been For example, the NFS/RDMA service defined in [RFC5667] has been
assigned the port 20049, in the IANA registry: assigned the port 20049, in the IANA registry:
nfsrdma 20049/tcp Network File System (NFS) over RDMA nfsrdma 20049/tcp Network File System (NFS) over RDMA
nfsrdma 20049/udp Network File System (NFS) over RDMA nfsrdma 20049/udp Network File System (NFS) over RDMA
nfsrdma 20049/sctp Network File System (NFS) over RDMA nfsrdma 20049/sctp Network File System (NFS) over RDMA
The RPC program number assignment policy and registry are defined in The RPC program number assignment policy and registry are defined in
[RFC5531]. [RFC5531].
12. Acknowledgments 11. Acknowledgments
The editor gratefully acknowledges the work of Brent Callaghan and The editor gratefully acknowledges the work of Brent Callaghan and
Tom Talpey on the original RPC-over-RDMA Version One specification Tom Talpey on the original RPC-over-RDMA Version One specification
[RFC5666]. [RFC5666].
Dave Noveck provided excellent review, constructive suggestions, and Dave Noveck provided excellent review, constructive suggestions, and
consistent navigational guidance throughout the process of drafting consistent navigational guidance throughout the process of drafting
this document. Dave also contributed much of the organization and this document. Dave also contributed much of the organization and
content of Section 9 and helped the authors understand the content of Section 8 and helped the authors understand the
complexities of XDR extensibility. complexities of XDR extensibility.
The comments and contributions of Karen Deitke, Dai Ngo, Chunli The comments and contributions of Karen Deitke, Dai Ngo, Chunli
Zhang, Dominique Martinet, and Mahesh Siddheshwar are accepted with Zhang, Dominique Martinet, and Mahesh Siddheshwar are accepted with
great thanks. The editor also wishes to thank Bill Baker for his great thanks. The editor also wishes to thank Bill Baker, Greg
support of this work. Marsden, and Matt Benjamin for their support of this work.
The extract.sh shell script and formatting conventions were first The extract.sh shell script and formatting conventions were first
described by the authors of the NFSv4.1 XDR specification [RFC5662]. described by the authors of the NFSv4.1 XDR specification [RFC5662].
Special thanks go to nfsv4 Working Group Chair Spencer Shepler and Special thanks go to nfsv4 Working Group Chair Spencer Shepler and
nfsv4 Working Group Secretary Thomas Haynes for their support. nfsv4 Working Group Secretary Thomas Haynes for their support.
13. References 12. References
13.1. Normative References 12.1. Normative References
[I-D.ietf-nfsv4-rpcrdma-bidirection]
Lever, C., "Size-Limited Bi-directional Remote Procedure
Call On Remote Direct Memory Access Transports", draft-
ietf-nfsv4-rpcrdma-bidirection-01 (work in progress),
September 2015.
[I-D.ietf-nfsv4-rpcsec-gssv3]
Adamson, A. and N. Williams, "Remote Procedure Call (RPC)
Security Version 3", draft-ietf-nfsv4-rpcsec-gssv3-17
(work in progress), January 2016.
[RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", [RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2",
RFC 1833, DOI 10.17487/RFC1833, August 1995, RFC 1833, DOI 10.17487/RFC1833, August 1995,
<http://www.rfc-editor.org/info/rfc1833>. <http://www.rfc-editor.org/info/rfc1833>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/ Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/
RFC2119, March 1997, RFC2119, March 1997,
<http://www.rfc-editor.org/info/rfc2119>. <http://www.rfc-editor.org/info/rfc2119>.
[RFC2203] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol
Specification", RFC 2203, DOI 10.17487/RFC2203, September
1997, <http://www.rfc-editor.org/info/rfc2203>.
[RFC4506] Eisler, M., Ed., "XDR: External Data Representation [RFC4506] Eisler, M., Ed., "XDR: External Data Representation
Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May
2006, <http://www.rfc-editor.org/info/rfc4506>. 2006, <http://www.rfc-editor.org/info/rfc4506>.
[RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement
Protocol (DDP) / Remote Direct Memory Access Protocol Protocol (DDP) / Remote Direct Memory Access Protocol
(RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October
2007, <http://www.rfc-editor.org/info/rfc5042>. 2007, <http://www.rfc-editor.org/info/rfc5042>.
[RFC5056] Williams, N., "On the Use of Channel Bindings to Secure [RFC5056] Williams, N., "On the Use of Channel Bindings to Secure
Channels", RFC 5056, DOI 10.17487/RFC5056, November 2007, Channels", RFC 5056, DOI 10.17487/RFC5056, November 2007,
<http://www.rfc-editor.org/info/rfc5056>. <http://www.rfc-editor.org/info/rfc5056>.
[RFC5403] Eisler, M., "RPCSEC_GSS Version 2", RFC 5403, DOI
10.17487/RFC5403, February 2009,
<http://www.rfc-editor.org/info/rfc5403>.
[RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol
Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, Specification Version 2", RFC 5531, DOI 10.17487/RFC5531,
May 2009, <http://www.rfc-editor.org/info/rfc5531>. May 2009, <http://www.rfc-editor.org/info/rfc5531>.
[RFC5660] Williams, N., "IPsec Channels: Connection Latching", RFC [RFC5660] Williams, N., "IPsec Channels: Connection Latching", RFC
5660, DOI 10.17487/RFC5660, October 2009, 5660, DOI 10.17487/RFC5660, October 2009,
<http://www.rfc-editor.org/info/rfc5660>. <http://www.rfc-editor.org/info/rfc5660>.
[RFC5665] Eisler, M., "IANA Considerations for Remote Procedure Call [RFC5665] Eisler, M., "IANA Considerations for Remote Procedure Call
(RPC) Network Identifiers and Universal Address Formats", (RPC) Network Identifiers and Universal Address Formats",
RFC 5665, DOI 10.17487/RFC5665, January 2010, RFC 5665, DOI 10.17487/RFC5665, January 2010,
<http://www.rfc-editor.org/info/rfc5665>. <http://www.rfc-editor.org/info/rfc5665>.
13.2. Informative References 12.2. Informative References
[IB] InfiniBand Trade Association, "InfiniBand Architecture [IB] InfiniBand Trade Association, "InfiniBand Architecture
Specifications", <http://www.infinibandta.org>. Specifications", <http://www.infinibandta.org>.
[IBPORT] InfiniBand Trade Association, "IP Addressing Annex", [IBPORT] InfiniBand Trade Association, "IP Addressing Annex",
<http://www.infinibandta.org>. <http://www.infinibandta.org>.
[RFC0768] Postel, J., "User Datagram Protocol", STD 6, RFC 768, DOI [RFC0768] Postel, J., "User Datagram Protocol", STD 6, RFC 768, DOI
10.17487/RFC0768, August 1980, 10.17487/RFC0768, August 1980,
<http://www.rfc-editor.org/info/rfc768>. <http://www.rfc-editor.org/info/rfc768>.
 End of changes. 126 change blocks. 
865 lines changed or deleted 437 lines changed or added

This html diff was produced by rfcdiff 1.45. The latest version is available from http://tools.ietf.org/tools/rfcdiff/