draft-ietf-nfsv4-rfc5666bis-00.txt   draft-ietf-nfsv4-rfc5666bis-01.txt 
Network File System Version 4 C. Lever, Ed. Network File System Version 4 C. Lever, Ed.
Internet-Draft Oracle Internet-Draft Oracle
Obsoletes: 5666 (if approved) T. Talpey Obsoletes: 5666 (if approved) W. Simpson
Intended status: Standards Track Microsoft Intended status: Standards Track DayDreamer
Expires: June 3, 2016 December 1, 2015 Expires: June 16, 2016 T. Talpey
Microsoft
December 14, 2015
Remote Direct Memory Access Transport for Remote Procedure Call Remote Direct Memory Access Transport for Remote Procedure Call
draft-ietf-nfsv4-rfc5666bis-00 draft-ietf-nfsv4-rfc5666bis-01
Abstract Abstract
This document describes a protocol providing Remote Direct Memory This document specifies a protocol for conveying Remote Procedure
Access (RDMA) as a new transport for Remote Procedure Call (RPC). Call (RPC) messages on physical transports capable of Remote Direct
The RDMA transport binding conveys the benefits of efficient, bulk- Memory Access (RDMA). The RDMA transport binding enables efficient
data transport over high-speed networks, while providing for minimal bulk-data transport over high-speed networks with minimal change to
change to RPC applications and with no required revision of the RPC applications. It requires no revision to application RPC
application RPC protocol, or the RPC protocol itself. This document protocols or the RPC protocol itself. This document obsoletes RFC
obsoletes RFC 5666. 5666.
Status of This Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on June 3, 2016. This Internet-Draft will expire on June 16, 2016.
Copyright Notice Copyright Notice
Copyright (c) 2015 IETF Trust and the persons identified as the Copyright (c) 2015 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 2, line 15 skipping to change at page 2, line 17
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3
1.2. RPC Over RDMA Transports . . . . . . . . . . . . . . . . 3 1.2. RPC Over RDMA Transports . . . . . . . . . . . . . . . . 3
2. Changes Since RFC 5666 . . . . . . . . . . . . . . . . . . . 4 2. Changes Since RFC 5666 . . . . . . . . . . . . . . . . . . . 4
2.1. Changes To The Specification . . . . . . . . . . . . . . 4 2.1. Changes To The Specification . . . . . . . . . . . . . . 4
2.2. Changes To The Protocol . . . . . . . . . . . . . . . . . 4 2.2. Changes To The Protocol . . . . . . . . . . . . . . . . . 5
3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 5 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1. Remote Procedure Calls . . . . . . . . . . . . . . . . . 5 3.1. Remote Procedure Calls . . . . . . . . . . . . . . . . . 5
3.2. Remote Direct Memory Access . . . . . . . . . . . . . . . 8 3.2. Remote Direct Memory Access . . . . . . . . . . . . . . . 8
4. Protocol Framework . . . . . . . . . . . . . . . . . . . . . 10 4. RPC-Over-RDMA Protocol Framework . . . . . . . . . . . . . . 10
4.1. Transfer Models . . . . . . . . . . . . . . . . . . . . . 10 4.1. Transfer Models . . . . . . . . . . . . . . . . . . . . . 10
4.2. RPC-over-RDMA Framing . . . . . . . . . . . . . . . . . . 10 4.2. RPC Message Framing . . . . . . . . . . . . . . . . . . . 11
4.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . 11 4.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . 11
4.4. XDR Encoding With Chunks . . . . . . . . . . . . . . . . 13 4.4. XDR Encoding With Chunks . . . . . . . . . . . . . . . . 13
4.5. Data Exchange . . . . . . . . . . . . . . . . . . . . . . 18 4.5. Data Exchange . . . . . . . . . . . . . . . . . . . . . . 19
4.6. Message Size . . . . . . . . . . . . . . . . . . . . . . 21 4.6. Message Size . . . . . . . . . . . . . . . . . . . . . . 21
5. RPC-over-RDMA In Operation . . . . . . . . . . . . . . . . . 22 5. RPC-Over-RDMA In Operation . . . . . . . . . . . . . . . . . 23
5.1. Fixed Header Fields . . . . . . . . . . . . . . . . . . . 22 5.1. Fixed Header Fields . . . . . . . . . . . . . . . . . . . 23
5.2. Chunk Lists . . . . . . . . . . . . . . . . . . . . . . . 24 5.2. Chunk Lists . . . . . . . . . . . . . . . . . . . . . . . 24
5.3. Forming Messages . . . . . . . . . . . . . . . . . . . . 25 5.3. Forming Messages . . . . . . . . . . . . . . . . . . . . 26
5.4. Memory Registration . . . . . . . . . . . . . . . . . . . 28 5.4. Memory Registration . . . . . . . . . . . . . . . . . . . 29
5.5. Handling Errors . . . . . . . . . . . . . . . . . . . . . 29 5.5. Handling Errors . . . . . . . . . . . . . . . . . . . . . 30
5.6. XDR Language Description . . . . . . . . . . . . . . . . 30 5.6. XDR Language Description . . . . . . . . . . . . . . . . 31
5.7. Deprecated Protocol Elements . . . . . . . . . . . . . . 33 5.7. Deprecated Protocol Elements . . . . . . . . . . . . . . 34
6. Upper Layer Binding Specifications . . . . . . . . . . . . . 33 6. Upper Layer Binding Specifications . . . . . . . . . . . . . 34
6.1. Determining DDP-Eligibility . . . . . . . . . . . . . . . 34 6.1. Determining DDP-Eligibility . . . . . . . . . . . . . . . 35
6.2. Write List Ordering . . . . . . . . . . . . . . . . . . . 35 6.2. Write List Ordering . . . . . . . . . . . . . . . . . . . 36
6.3. DDP-Eligibility Violation . . . . . . . . . . . . . . . . 35 6.3. DDP-Eligibility Violation . . . . . . . . . . . . . . . . 36
6.4. Other Binding Information . . . . . . . . . . . . . . . . 36 6.4. Other Binding Information . . . . . . . . . . . . . . . . 37
7. RPC Bind Parameters . . . . . . . . . . . . . . . . . . . . . 36 7. RPC Bind Parameters . . . . . . . . . . . . . . . . . . . . . 37
8. Bi-directional RPC-over-RDMA . . . . . . . . . . . . . . . . 37 8. Bi-Directional RPC-Over-RDMA . . . . . . . . . . . . . . . . 38
8.1. RPC Direction . . . . . . . . . . . . . . . . . . . . . . 37 8.1. RPC Direction . . . . . . . . . . . . . . . . . . . . . . 39
8.2. Backward Direction Flow Control . . . . . . . . . . . . . 38 8.2. Backward Direction Flow Control . . . . . . . . . . . . . 40
8.3. Conventions For Backward Operation . . . . . . . . . . . 40 8.3. Conventions For Backward Operation . . . . . . . . . . . 41
8.4. Backward Direction Upper Layer Binding . . . . . . . . . 42 8.4. Backward Direction Upper Layer Binding . . . . . . . . . 43
9. Transport Protocol Extensibility . . . . . . . . . . . . . . 42 9. Transport Protocol Extensibility . . . . . . . . . . . . . . 44
9.1. Bumping The RPC-over-RDMA Version . . . . . . . . . . . . 43 9.1. Bumping The RPC-over-RDMA Version . . . . . . . . . . . . 44
10. Security Considerations . . . . . . . . . . . . . . . . . . . 43 10. Security Considerations . . . . . . . . . . . . . . . . . . . 45
11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 45 10.1. Memory Protection . . . . . . . . . . . . . . . . . . . 45
12. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 46 10.2. Using GSS With RPC-Over-RDMA . . . . . . . . . . . . . . 45
13. Appendices . . . . . . . . . . . . . . . . . . . . . . . . . 46
13.1. Appendix 1: XDR Examples . . . . . . . . . . . . . . . . 46
14. References . . . . . . . . . . . . . . . . . . . . . . . . . 47 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 46
14.1. Normative References . . . . . . . . . . . . . . . . . . 47 12. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 47
14.2. Informative References . . . . . . . . . . . . . . . . . 49 13. Appendices . . . . . . . . . . . . . . . . . . . . . . . . . 47
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 50 13.1. Appendix 1: XDR Examples . . . . . . . . . . . . . . . . 47
14. References . . . . . . . . . . . . . . . . . . . . . . . . . 49
14.1. Normative References . . . . . . . . . . . . . . . . . . 49
14.2. Informative References . . . . . . . . . . . . . . . . . 50
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 51
1. Introduction 1. Introduction
This document obsoletes RFC 5666, but makes no operational changes to This document obsoletes RFC 5666; however, the protocol specified by
RPC-over-RDMA Version One protocol on the wire. It is published to this document is based on existing interoperating implementations of
clarify ambiguous text that is subject to multiple interpretations, the RPC-over-RDMA Version One protocol. The new specification
deprecate unimplemented RPC-over-RDMA Version One protocol elements, clarifies text that is subject to multiple interpretations and
and introduce conventions to allow bi-directional RPC-over-RDMA eliminates support for unimplemented RPC-over-RDMA Version One
operation. protocol elements. In addition, it introduces conventions that
enable bi-directional RPC-over-RDMA operation.
1.1. Requirements Language 1.1. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119]. document are to be interpreted as described in [RFC2119].
1.2. RPC Over RDMA Transports 1.2. RPC Over RDMA Transports
Remote Direct Memory Access (RDMA) [RFC5040] [RFC5041] [IB] is a Remote Direct Memory Access (RDMA) [RFC5040] [RFC5041] [IB] is a
technique for efficient movement of data between end nodes, which technique for moving data efficiently between end nodes. By
becomes increasingly compelling over high-speed transports. By
directing data into destination buffers as it is sent on a network, directing data into destination buffers as it is sent on a network,
and placing it via direct memory access by hardware, the double and placing it via direct memory access by hardware, the benefits of
benefit of faster transfers and reduced host overhead is obtained. faster transfers and reduced host overhead are obtained.
Open Network Computing Remote Procedure Call (ONC RPC, or simply, Open Network Computing Remote Procedure Call (ONC RPC, or simply,
RPC) [RFC5531] is a remote procedure call protocol that has been run RPC) [RFC5531] is a remote procedure call protocol that runs over a
over a variety of transports. Most RPC implementations today use UDP variety of transports. Most RPC implementations today use UDP or
or TCP. RPC messages are defined in terms of an eXternal Data TCP. On UDP, RPC messages are encapsulated inside datagrams, while
Representation (XDR) [RFC4506], which provides a canonical data on a TCP byte stream, RPC messages are delineated by a record marking
representation across a variety of host architectures. An XDR data protocol. An RDMA transport also conveys RPC messages in a specific
stream is conveyed differently on each type of transport. On UDP, fashion that must be fully described if RPC implementations are to
RPC messages are encapsulated inside datagrams, while on a TCP byte interoperate.
stream, RPC messages are delineated by a record marking protocol. An
RDMA transport also conveys RPC messages in a unique fashion that
must be fully described if RPC implementations are to interoperate.
RDMA transports present new semantics unlike the behaviors of either RDMA transports present semantics different from either UDP or TCP.
UDP or TCP alone. They retain message delineations like UDP while They retain message delineations like UDP, but provide a reliable and
also providing a reliable, sequenced data transfer like TCP. Also, sequenced data transfer like TCP. They also provide an efficient
they provide the new efficient, bulk-transfer service enabled by bulk-transfer service not provided by UDP or TCP. RDMA transports
Remote Direct Memory Access. RDMA transports are therefore naturally are therefore appropriately viewed as a new transport type by RPC.
viewed as a new transport type by RPC.
RDMA as a transport will benefit the performance of RPC protocols RDMA as a transport can enhance the performance of RPC protocols that
that move large "chunks" of data, since RDMA hardware excels at move large quantities of data, since RDMA hardware excels at moving
moving data efficiently between host memory and a high-speed network data efficiently between host memory and a high-speed network with
with little or no host CPU involvement. In this context, the Network little host CPU involvement. In this context, the Network File
File System (NFS) protocol, in all its versions [RFC1094] [RFC1813] System (NFS) protocols as described in [RFC1094], [RFC1813],
[RFC7530] [RFC5661], is an obvious beneficiary of RDMA. A complete [RFC7530], [RFC5661], and future NFSv4 minor verions are obvious
problem statement is discussed in [RFC5532], and related NFSv4 issues beneficiaries of RDMA transports. A complete problem statement is
are discussed in [RFC5661]. Many other RPC-based protocols can also discussed in [RFC5532], and NFSv4-related issues are discussed in
benefit. [RFC5661]. Many other RPC-based protocols can also benefit.
Although the RDMA transport described here provides relatively Although the RDMA transport described here can provide relatively
transparent support for any RPC application, this document goes transparent support for any RPC application, this document also
further in describing mechanisms that can optimize the use of RDMA describes mechanisms that can optimize data transfer further, given
with more active participation by the RPC application. more active participation by RPC applications.
2. Changes Since RFC 5666 2. Changes Since RFC 5666
2.1. Changes To The Specification 2.1. Changes To The Specification
The following alterations have been made to the RPC-over-RDMA Version The following alterations have been made to the RPC-over-RDMA Version
One specification: One specification:
o Often implementers familiar with RDMA are not familiar with the o Section 2 has been expanded to introduce and explain key RPC, XDR,
mechanics of RPC, and vice versa. Section 2 has been expanded to and RDMA terminology. These terms are now used consistently
introduce and explain key RPC, XDR, and RDMA terminology. These throughout the specification. This change was necesssary because
terms are now used consistently throughout the specification. implementers familiar with RDMA are often not familiar with the
mechanics of RPC, and vice versa.
o Section 3 has been re-organized and split into sub-sections to o Section 3 has been re-organized and split into sub-sections to
facilitate locating specific requirements and definitions. facilitate locating specific requirements and definitions.
o Section 4 and 5 have been combined for clarity and to improve the o Section 4 and 5 have been combined for clarity and to improve the
organization of this information. organization of this information.
o The XDR definition of RPC-over-RDMA Version One has been updated
(without on-the-wire changes) to align with the terms and concepts
introduced in this specification.
o The specification of the optional Connection Configuration o The specification of the optional Connection Configuration
Protocol has been removed from the specification, as there are no Protocol has been removed from the specification, as there are no
known implementations of the protocol. known implementations of the protocol.
o Sections discussing requirements for Upper Layer Bindings have o Sections discussing requirements for Upper Layer Bindings have
been added. been added.
o A section discussing RPC-over-RDMA protocol extensibility has been o A section discussing RPC-over-RDMA protocol extensibility has been
added. added.
2.2. Changes To The Protocol 2.2. Changes To The Protocol
The specific changes to the protocol are: While the protocol described herein interoperates with existing
implementations of [RFC5666], the following changes have been made
relative to the protocol described in that document:
o Support for the Read-Read transfer model has been deprecated. o Support for the Read-Read transfer model has been removed. Read-
Read-Read is a slower transfer model than Read-Write, thus Read is a slower transfer model than Read-Write, thus implementers
implementers have chosen not to support it. have chosen not to support it.
o Support for the RDMA_MSGP message type has been deprecated. It o Support for sending the RDMA_MSGP message type has been
has no benefit for RPC programs that place bulk payload items in deprecated. This document instructs senders not to use it, but
the middle of their argument or result lists, as is typical with receivers must continue to recognize it.
NFSv4 COMPOUND RPCs [RFC7530]. It is also not beneficial when the
inline threshold is significantly smaller than the system page
size.
o The XDR definition of RPC-over-RDMA Version One has been updated RDMA_MSGP has no benefit for RPC programs that place bulk payload
(without on-the-wire changes) to align with the terms and concepts items at positions other than at the end of their argument or
introduced in this specification. result lists, as is common with NFSv4 COMPOUND RPCs [RFC7530].
Similarly it is not beneficial when a connection's inline
threshold is significantly smaller than the system page size, as
is typical for RPC-over-RDMA Version One implementations.
o Specific requirements related to handling XDR round-up and o Specific requirements related to handling XDR round-up and
abstract data types have been added. abstract data types have been added.
o Clear guidance about Send and Receive buffer size has been added. o Clear guidance about Send and Receive buffer size has been added.
This enables better decisions about when to provide and use the This enables better decisions about when to provide and use the
Reply chunk. Reply chunk.
o A section specifying bi-directional RPC operation on RPC-over-RDMA o A section specifying bi-directional RPC operation on RPC-over-RDMA
has been added. This enables the NFSv4.1 backchannel [RFC5661] on has been added. This enables the NFSv4.1 backchannel [RFC5661] on
RPC-over-RDMA Version One transports. RPC-over-RDMA Version One transports when both endpoints support
the new functionality.
The protocol version number is not changed because the protocol The protocol version number has not been changed because the protocol
specified in this document fully interoperates with implementations specified in this document fully interoperates with implementations
of the RPC-over-RDMA Version One protocol specified in [RFC5666]. of the RPC-over-RDMA Version One protocol specified in [RFC5666].
3. Terminology 3. Terminology
3.1. Remote Procedure Calls 3.1. Remote Procedure Calls
This section introduces key elements of the Remote Procedure Call This section introduces key elements of the Remote Procedure Call
[RFC5531] and External Data Representation [RFC4506] protocols upon [RFC5531] and External Data Representation [RFC4506] protocols upon
which RPC-over-RDMA Version One is constructed. which RPC-over-RDMA Version One is constructed.
skipping to change at page 6, line 18 skipping to change at page 6, line 25
"arguments" and a set of "results". A calling context is not allowed "arguments" and a set of "results". A calling context is not allowed
to proceed until the procedure's results are available to it. Unlike to proceed until the procedure's results are available to it. Unlike
a local procedure call, the called procedure is executed remotely a local procedure call, the called procedure is executed remotely
rather than in the local application's context. rather than in the local application's context.
The RPC protocol as described in [RFC5531] is fundamentally a The RPC protocol as described in [RFC5531] is fundamentally a
message-passing protocol between one server and one or more clients. message-passing protocol between one server and one or more clients.
ONC RPC transactions are made up of two types of messages: ONC RPC transactions are made up of two types of messages:
CALL Message CALL Message
A CALL message, or "Call", requests work. A Call is designated by A CALL message, or "Call", requests that work be done. A Call is
the value CALL in the message's msg_type field. An arbitrary designated by the value CALL in the message's msg_type field. An
unique value is placed in the message's xid field. arbitrary unique value is placed in the message's xid field.
REPLY Message REPLY Message
A REPLY message, or "Reply", reports the results of work requested A REPLY message, or "Reply", reports the results of work requested
by a Call. A Reply is designated by the value REPLY in the by a Call. A Reply is designated by the value REPLY in the
message's msg_type field. The value contained in the message's message's msg_type field. The value contained in the message's
xid field is copied from the Call whose results are being xid field is copied from the Call whose results are being
reported. reported.
An RPC client endpoint, or "requester", serializes an RPC call's An RPC client endpoint, or "requester", serializes an RPC call's
arguments and conveys them to a server endpoint via an RPC call arguments and conveys them to a server endpoint via an RPC call
skipping to change at page 7, line 30 skipping to change at page 7, line 37
native data representation format. native data representation format.
The function of an RPC transport is to convey RPC messages, each The function of an RPC transport is to convey RPC messages, each
encoded as a separate XDR stream, from one endpoint to another. encoded as a separate XDR stream, from one endpoint to another.
3.1.3.1. XDR Opaque Data 3.1.3.1. XDR Opaque Data
Sometimes a data item must be transferred as-is, without encoding or Sometimes a data item must be transferred as-is, without encoding or
decoding. Such a data item is referred to as "opaque data." XDR decoding. Such a data item is referred to as "opaque data." XDR
encoding places opaque data items directly into an XDR stream without encoding places opaque data items directly into an XDR stream without
altering its content in any way. altering its content in any way. Upper Layer Protocols or
applications perform any needed data translation in this case.
Typically Upper Layer Protocols or applications manage any needed Examples of opaque data items include the contents of files, and
data translation in this case. Examples of opaque data items include generic byte strings.
the contents of files, and generic byte strings.
3.1.3.2. XDR Round-up 3.1.3.2. XDR Round-up
The number of octets in a variable-size data item precedes that item The number of octets in a variable-size data item precedes that item
in the encoding stream. If the size of an encoded data item is not a in the encoding stream. If the size of an encoded data item is not a
multiple of four octets, octets containing zero are added to the end multiple of four octets, octets containing zero are added to the end
of the item so that the next encoded data item starts on a four-octet of the item so that the next encoded data item starts on a four-octet
boundary. The encoded size of the item is not changed by the boundary. The encoded size of the item is not changed by the
addition of the extra octets. addition of the extra octets.
This technique is referred to as "XDR round-up," and the extra octets This technique is referred to as "XDR round-up," and the extra octets
are referred to as "XDR padding". The content of XDR pad octets is are referred to as "XDR padding". The content of XDR pad octets is
ignored by receivers. ignored by receivers.
3.2. Remote Direct Memory Access 3.2. Remote Direct Memory Access
An RPC requester can be made more efficient if large RPC messages are RPC requesters and responders can be made more efficient if large RPC
transferred by a third party such as intelligent network interface messages are transferred by a third party such as intelligent network
hardware (data movement offload), and placed in the receiver's memory interface hardware (data movement offload), and placed in the
so that no additional adjustment of data alignment has to be made receiver's memory so that no additional adjustment of data alignment
(direct data placement). Remote Direct Memory Access, or "RDMA" is a has to be made (direct data placement). Remote Direct Memory Access
network transport technology that enables both optimizations. enables both optimizations.
3.2.1. Direct Data Placement 3.2.1. Direct Data Placement
Very often, RPC implementations copy the contents of RPC messages Very often, RPC implementations copy the contents of RPC messages
into a buffer before being sent. An efficient RPC implementation can into a buffer before being sent. An efficient RPC implementation
send bulk data without copying it into a separate send buffer first. sends bulk data without copying it into a separate send buffer first.
However, socket-based RPC implementations are often unable to receive However, socket-based RPC implementations are often unable to receive
data directly into its final place in memory. Receivers often need data directly into its final place in memory. Receivers often need
to copy incoming data to finish an RPC operation; sometimes, only to to copy incoming data to finish an RPC operation; sometimes, only to
adjust data alignment. adjust data alignment.
In this document, "RDMA" refers to the physical mechanism an RDMA In this document, "RDMA" refers to the physical mechanism an RDMA
transport utilizes when moving data. Though it may not be optimal, transport utilizes when moving data. Although it may not be
before an RDMA transfer, the sender may still copy data into place. efficient, before an RDMA transfer a sender may copy data into an
After an RDMA transfer, the receiver may copy that data again to its intermediate buffer before an RDMA transfer. After an RDMA transfer,
final destination. a receiver may copy that data again to its final destination.
This document uses the term "direct data placement" (or DDP) to refer This document uses the term "direct data placement" (or DDP) to refer
specifically to an optimized data transfer where it is unnecessary specifically to an optimized data transfer where it is unnecessary
for a receiving host's CPU to copy transferred data again after it for a receiving host's CPU to copy transferred data to another
has been received. Not all RDMA-based data transfer qualifies as location after it has been received. Not all RDMA-based data
Direct Data Placement, and DDP can be achieved using non-RDMA transfer qualifies as Direct Data Placement, and DDP can be achieved
mechanisms. using non-RDMA mechanisms.
3.2.2. RDMA Transport Requirements 3.2.2. RDMA Transport Requirements
The RPC-over-RDMA Version One protocol assumes the physical transport The RPC-over-RDMA Version One protocol assumes the physical transport
provides the following abstract operations. A more complete provides the following abstract operations. A more complete
discussion of these operations is found in [RFC5040]. discussion of these operations is found in [RFC5040].
Registered Memory
Registered memory is a segment of memory that is assigned a
steering tag that temporarily permits access by the RDMA provider
to perform data transfer operations. The RPC-over-RDMA Version
One protocol assumes that each segment of registered memory MUST
be identified with a steering tag of no more than 32 bits and
memory addresses of up to 64 bits in length.
RDMA Send RDMA Send
The RDMA provider supports an RDMA Send operation with completion The RDMA provider supports an RDMA Send operation, with completion
signaled at the receiver when data is placed in a pre-posted signaled on the receiving peer after data has been placed in a
buffer. The amount of transferred data is limited only by the pre-posted memory segment. Sends complete at the receiver in the
size of the receiver's buffer. Sends complete at the receiver in order they were issued at the sender. The amount of data
the order they were issued at the sender. transferred by an RDMA Send operation is limited by the size of
the remote pre-posted memory segment.
RDMA Receive RDMA Receive
Receive endpoints pre-post enough RDMA Receive operations to catch The RDMA provider supports an RDMA Receive operation to receive
incoming RDMA Send operations. To reduce the amount of memory data conveyed by incoming RDMA Send operations. To reduce the
that must remain pinned awaiting incoming Sends, receive buffers amount of memory that must remain pinned awaiting incoming Sends,
are limited in size and number. Flow-control to prevent the amount of pre-posted memory is limited. Flow-control to
overrunning receiver resources is provided by the upper layer prevent overrunning receiver resources is provided by the RDMA
protocol. consumer (in this case, the RPC-over-RDMA Version One protocol).
Registered Memory
All data moved via tagged RDMA operations is resident in
registered memory at its destination. This protocol assumes that
each segment of registered memory MUST be identified with a
steering tag of no more than 32 bits and memory addresses of up to
64 bits in length.
RDMA Write RDMA Write
The RDMA provider supports an RDMA Write operation to directly The RDMA provider supports an RDMA Write operation to directly
place data in the receiver's buffer. An RDMA Write is initiated place data in remote memory. The local host initiates an RDMA
by the sender and completion is signaled at the sender. No Write, and completion is signaled there; no completion is signaled
completion is signaled at the receiver. The sender uses a on the remote. The local host provides a steering tag, memory
steering tag, memory address, and length of the remote destination address, and length of the remote's memory segment.
buffer.
RDMA Writes are not necessarily ordered with respect to one RDMA Writes are not necessarily ordered with respect to one
another, but are ordered with respect to RDMA Sends. A subsequent another, but are ordered with respect to RDMA Sends. A subsequent
RDMA Send completion obtained at the receiver guarantees that RDMA Send completion obtained at the write initiator guarantees
prior RDMA Write data has been successfully placed in the that prior RDMA Write data has been successfully placed in the
receiver's memory. remote peer's memory.
RDMA Read RDMA Read
The RDMA provider supports an RDMA Read operation to directly The RDMA provider supports an RDMA Read operation to directly
place peer source data in the requester's buffer. An RDMA Read is place peer source data in the read initiator's memory. The local
initiated by the receiver and completion is signaled at the host initiates an RDMA Read, and completion is signaled there; no
receiver. The receiver provides steering tags, memory addresses, completion is signaled on the remote. The local host provides
and a length for the remote source and local destination buffers. steering tags, memory addresses, and a length for the remote
Since the peer at the data source receives no notification of RDMA source and local destination memory segments.
Read completion, there is an assumption that on receiving the
data, the receiver will signal completion with an RDMA Send The remote peer receives no notification of RDMA Read completion.
message, so that the peer can free the source buffers and the The local host signals completion as part of an RDMA Send message
associated steering tags. so that the remote peer can release steering tags and subsequently
free associated source memory segments.
The RPC-over-RDMA Version One protocol is designed to be carried over The RPC-over-RDMA Version One protocol is designed to be carried over
RDMA transports that support the above abstract operations. This RDMA transports that support the above abstract operations. This
protocol conveys to the RPC peer information sufficient for that RPC protocol conveys to the RPC peer information sufficient for that RPC
peer to direct an RDMA layer to perform transfers containing RPC data peer to direct an RDMA layer to perform transfers containing RPC data
and to communicate their result(s). For example, it is readily and to communicate their result(s). For example, it is readily
carried over RDMA transports such as Internet Wide Area RDMA Protocol carried over RDMA transports such as Internet Wide Area RDMA Protocol
(iWARP) [RFC5040] [RFC5041], or InfiniBand [IB]. (iWARP) [RFC5040] [RFC5041].
4. Protocol Framework 4. RPC-Over-RDMA Protocol Framework
4.1. Transfer Models 4.1. Transfer Models
A "transfer model" designates which endpoint is responsible for A "transfer model" designates which endpoint is responsible for
performing RDMA Read and Write operations. To enable these performing RDMA Read and Write operations. To enable these
operations, the peer endpoint first exposes segments of its memory to operations, the peer endpoint first exposes segments of its memory to
the endpoint performing the RDMA Read and Write operations. the endpoint performing the RDMA Read and Write operations.
Read-Read Read-Read
Requesters expose their memory to the responder, and the responder Requesters expose their memory to the responder, and the responder
skipping to change at page 10, line 45 skipping to change at page 10, line 49
Write-Read Write-Read
The responder exposes its memory to requesters, but requesters do The responder exposes its memory to requesters, but requesters do
not expose their memory. Requesters employ RDMA Write operations not expose their memory. Requesters employ RDMA Write operations
to convey RPC arguments or whole RPC calls. Requesters employ to convey RPC arguments or whole RPC calls. Requesters employ
RDMA Read operations to convey RPC results or whole RPC relies. RDMA Read operations to convey RPC results or whole RPC relies.
[RFC5666] specifies the use of both the Read-Read and the Read-Write [RFC5666] specifies the use of both the Read-Read and the Read-Write
Transfer Model. All current RPC-over-RDMA Version One Transfer Model. All current RPC-over-RDMA Version One
implementations use the Read-Write Transfer Model. Use of the Read- implementations use the Read-Write Transfer Model. Use of the Read-
Read Transfer Model by RPC-over-RDMA Version One implementations is Read Transfer Model by RPC-over-RDMA Version One implementations is
therefore deprecated. Other Transfer Models may be used by a future no longer supported. Other Transfer Models may be used by a future
version of RPC-over-RDMA. version of RPC-over-RDMA.
4.2. RPC-over-RDMA Framing 4.2. RPC Message Framing
During transmission, the XDR stream containing an RPC message is During transmission, the XDR stream containing an RPC message is
preceded by an RPC-over-RDMA header. This header is analogous to the preceded by an RPC-over-RDMA header. This header is analogous to the
record marking used for RPC over TCP but is more extensive, since record marking used for RPC over TCP but is more extensive, since
RDMA transports support several modes of data transfer. RDMA transports support several modes of data transfer.
All transfers of an RPC message begin with an RDMA Send that All transfers of an RPC message begin with an RDMA Send that
transfers an RPC-over-RDMA header and part or all of the accompanying transfers an RPC-over-RDMA header and part or all of the accompanying
RPC message. Because the size of what may be transmitted via RDMA RPC message. Because the size of what may be transmitted via RDMA
Send is limited by the size of the receiver's pre-posted buffers, the Send is limited by the size of the receiver's pre-posted buffers, the
skipping to change at page 11, line 25 skipping to change at page 11, line 30
RPC-over-RDMA framing replaces all other RPC framing (such as TCP RPC-over-RDMA framing replaces all other RPC framing (such as TCP
record marking) when used atop an RPC-over-RDMA association, even record marking) when used atop an RPC-over-RDMA association, even
when the underlying RDMA protocol may itself be layered atop a when the underlying RDMA protocol may itself be layered atop a
transport with a defined RPC framing (such as TCP). transport with a defined RPC framing (such as TCP).
It is however possible for RPC-over-RDMA to be dynamically enabled in It is however possible for RPC-over-RDMA to be dynamically enabled in
the course of negotiating the use of RDMA via an Upper Layer Protocol the course of negotiating the use of RDMA via an Upper Layer Protocol
exchange. Because RPC framing delimits an entire RPC request or exchange. Because RPC framing delimits an entire RPC request or
reply, the resulting shift in framing must occur between distinct RPC reply, the resulting shift in framing must occur between distinct RPC
messages, and in concert with the transport. messages, and in concert with the underlying transport.
4.3. Flow Control 4.3. Flow Control
It is critical to provide RDMA Send flow control for an RDMA It is critical to provide RDMA Send flow control for an RDMA
connection. RDMA receive operations can fail if a pre-posted receive connection. RDMA receive operations can fail if a pre-posted receive
buffer is not available to accept an incoming RDMA Send, and repeated buffer is not available to accept an incoming RDMA Send, and repeated
occurrences of such errors can be fatal to the connection. This is a occurrences of such errors can be fatal to the connection. This is a
departure from conventional TCP/IP networking where buffers are departure from conventional TCP/IP networking where buffers are
allocated dynamically as part of receiving messages. allocated dynamically as part of receiving messages.
It is not practical to provide for fixed credit limits at the Flow control for RDMA Send operations directed to the responder is
responder. Fixed limits scale poorly, since posted buffers are implemented as a simple request/grant protocol in the RPC-over-RDMA
dedicated to the associated connection until consumed by receive header associated with each RPC message (Section 5.1.3 has details).
operations. In addition, for protocol correctness, a responder must
always be able to reply to requesters, whether or not the responder
has posted buffers to accept more requests.
Therefore, flow control for RDMA Send operations is implemented as a o The RPC-over-RDMA header for RPC call messages contains a
simple request/grant protocol in the RPC-over-RDMA header associated requested credit value for the responder. This is the maximum
with each RPC message. The RPC-over-RDMA header for RPC call number of RPC replies the requester can handle at once,
messages contains a requested credit value for the responder, which independent of how many RPCs are in flight at that moment. The
MAY be dynamically adjusted by the caller to match its expected requester MAY dynamically adjust the requested credit value to
needs. match its expected needs.
The RPC-over-RDMA header for RPC reply messages provides the granted o The RPC-over-RDMA header for RPC reply messages provides the
result, which MAY have any value except it MUST NOT be zero when no granted result. This is the maximum number of RPC calls the
in-progress operations are present at the responder, since such a responder can handle at once, without regard to how many RPCs are
value would result in deadlock. The value MAY be adjusted up or down in flight at that moment. The granted value MUST NOT be zero,
at each opportunity to match the responder's needs or policies. since such a value would result in deadlock. The responder MAY
dynamically adjust the granted credit value to match its needs or
policies (e.g. to accommodate the available resources in a shared
receive queue).
The requester MUST NOT send unacknowledged requests in excess of this The requester MUST NOT send unacknowledged requests in excess of this
granted responder credit limit. If the limit is exceeded, the RDMA granted responder credit limit. If the limit is exceeded, the RDMA
layer may signal an error, possibly terminating the connection. Even layer may signal an error, possibly terminating the connection. Even
if an error does not occur, it is OPTIONAL that the responder handle if an RDMA layer error does not occur, the responder MAY handle
the excess request(s). it MAY return an RPC error to the requester excess requests or return an RPC layer error to the requester.
Note that the never-zero requirement implies that an responder MUST
always provide at least one credit to each connected requester from
which no requests are outstanding. The requester would deadlock
otherwise, unable to send another request.
While RPC calls complete in any order, the current flow control limit While RPC calls complete in any order, the current flow control limit
at the responder is known to the requester from the Send ordering at the responder is known to the requester from the Send ordering
properties. It is always the most recent responder-granted credit properties. It is always the lower of the requested and granted
value minus the number of requests in flight. credit values, minus the number of requests in flight. Advertised
credit values are not altered because individual RPCs are started or
completed.
On occasion a requester or responder may need to adjust the amount of
resources available to a connection. When this happens, the
responder needs to ensure that a credit increase is effected (i.e.
receives are posted) before the next reply is sent.
Certain RDMA implementations may impose additional flow control Certain RDMA implementations may impose additional flow control
restrictions, such as limits on RDMA Read operations in progress at restrictions, such as limits on RDMA Read operations in progress at
the responder. Because these operations are outside the scope of the responder. Accommodation of such restrictions is considered the
this protocol, they are not addressed and SHOULD be provided for by responsibility of each RPC-over-RDMA Version One implementation.
other layers.
4.3.1. Initial Connection State 4.3.1. Initial Connection State
There are two operational parameters for each connection: There are two operational parameters for each connection:
Credit Limit Credit Limit
The number of available receive buffers is a connection's credit As described above, the total number of responder receive buffers
limit. The credit limit is advertised in the RPC-over-RDMA header is a connection's credit limit. The credit limit is advertised in
in each RPC message, and can change during the lifetime of a the RPC-over-RDMA header in each RPC message, and can change
connection. during the lifetime of a connection.
Inline Threshold Inline Threshold
The maximum RDMA message size that can be received is a The size of the receiver's smallest posted receive buffer is the
connection's "inline threshold." This is the size of the smallest largest size message that a sender can convey with an RDMA Send
posted receive buffer, though usually all of a connection's operation, and is known as a connection's "inline threshold."
receive buffers are the same size. Unlike the connection's credit Unlike the connection's credit limit, the inline threshold value
limit, the inline threshold value is not advertised to peers via is not advertised to peers via the RPC-over-RDMA Version One
the RPC-over-RDMA Version One protocol, and there is no provision protocol, and there is no provision for the inline threshold value
for the inline threshold value to change during the lifetime of an to change during the lifetime of an RPC-over-RDMA Version One
RPC-over-RDMA Version One connection. connection. Connection peers MAY have different inline
thresholds.
The longevity of a transport connection requires that sending The longevity of a transport connection requires that sending
endpoints respect the resource limits of peer receivers. However, endpoints respect the resource limits of peer receivers. However,
when a connection is first established, peers cannot know how many when a connection is first established, peers cannot know how many
receive buffers the other has, nor how large the buffers are. receive buffers the other has, nor how large the buffers are.
To provide a basis for an initial exchange of RPC requests, each RPC- As a basis for an initial exchange of RPC requests, each RPC-over-
over-RDMA connection is assumed to provide a basic level of RDMA Version One connection provides the ability to exchange at least
interoperability: the ability to exchange at least one RPC message at one RPC message at a time that is 1024 bytes in size. A responder
a time that is 1024 bytes in size. A responder MAY exceed this basic MAY exceed this basic level of configuration, but a requester MUST
level of configuration, but a requester MUST NOT assume more than one NOT assume more than one credit is available, and MUST receive a
credit is available, and MUST receive a valid reply from the valid reply from the responder carrying the actual number of
responder carrying the actual number of available credits, prior to available credits, prior to sending its next request.
sending its next request.
In the absense of an exchange of buffer size information (such as the
Connection Configuration Protocol described in [RFC5666]), senders
MUST assume the receiver's inline threshold is 1024 bytes.
Implementations MUST support an inline threshold of 1024 bytes, but Implementations MUST support an inline threshold of 1024 bytes, but
MAY support larger inline thresholds. MAY support larger inline thresholds. In the absense of a mechanism
for discovering a peer's inline threshold, senders MUST assume a
receiver's inline threshold is 1024 bytes.
4.4. XDR Encoding With Chunks 4.4. XDR Encoding With Chunks
On traditional RPC transports, XDR data items in an RPC message are On traditional RPC transports, XDR data items in an RPC message are
encoded as a contiguous sequence of bytes for network transmission. encoded as a contiguous sequence of bytes for network transmission.
However, in the case of an RDMA transport, during XDR encoding it can However, in the case of an RDMA transport, during XDR encoding it can
be determined that (for instance) an opaque byte array is large be determined that (for instance) an opaque byte array is large
enough to be moved via an RDMA Read or RDMA Write operation. enough to be moved via an RDMA Read or RDMA Write operation.
RPC-over-RDMA Version One provides a mechanism for moving part an RPC RPC-over-RDMA Version One provides a mechanism for moving part an RPC
skipping to change at page 13, line 42 skipping to change at page 13, line 49
XDR stream that is split out and moved via a separate RDMA operation XDR stream that is split out and moved via a separate RDMA operation
is known as a "chunk." The sender removes the chunk data out from is known as a "chunk." The sender removes the chunk data out from
the XDR stream conveyed via RDMA Send, and the receiver inserts it the XDR stream conveyed via RDMA Send, and the receiver inserts it
before handing the reconstructed stream to the Upper Layer. before handing the reconstructed stream to the Upper Layer.
4.4.1. DDP-Eligibility 4.4.1. DDP-Eligibility
Only an XDR data item that might benefit from Direct Data Placement Only an XDR data item that might benefit from Direct Data Placement
should be moved to a chunk. The eligibility of specific XDR data should be moved to a chunk. The eligibility of specific XDR data
items to be moved as a chunk, as opposed to being left in the XDR items to be moved as a chunk, as opposed to being left in the XDR
stream, is not specified by this document. The Upper Layer Protocol stream, is not specified by this document. A determination must be
MUST determine which items in its XDR definition are allowed to use made for each Upper Layer Protocol which items in its XDR definition
Direct Data Placement. Therefore an additional specification is are allowed to use Direct Data Placement. Therefore an additional
needed that describes how an Upper Layer Protocol enables Direct Data specification is needed that describes how an Upper Layer Protocol
Placement. The set of requirements for a ULP to use an RDMA enables Direct Data Placement. The set of requirements for a ULP to
transport is known as an "Upper Layer Binding" specification, or ULB. use an RDMA transport is known as an "Upper Layer Binding"
specification, or ULB.
An Upper Layer Binding states which specific individual XDR data An Upper Layer Binding states which specific individual XDR data
items in an Upper Layer Protocol MAY be transferred via Direct Data items in an Upper Layer Protocol MAY be transferred via Direct Data
Placement. This document will refer to such XDR data items as "DDP- Placement. This document will refer to such XDR data items as "DDP-
eligible". All other XDR data items MUST NOT be placed in a chunk. eligible". All other XDR data items MUST NOT be placed in a chunk.
RPC-over-RDMA Version One uses RDMA Read and Write operations to RPC-over-RDMA Version One uses RDMA Read and Write operations to
transfer DDP-eligible data that has been placed in chunks. transfer DDP-eligible data that has been placed in chunks.
The details and requirements for Upper Layer Bindings are discussed The details and requirements for Upper Layer Bindings are discussed
in full in Section 6. in full in Section 6.
skipping to change at page 16, line 17 skipping to change at page 16, line 31
Position Position
For data that is to be encoded, the byte offset in the RPC message For data that is to be encoded, the byte offset in the RPC message
XDR stream where the receiver re-inserts the chunk data. The byte XDR stream where the receiver re-inserts the chunk data. The byte
offset MUST be computed from the beginning of the RPC message, not offset MUST be computed from the beginning of the RPC message, not
the beginning of the RPC-over-RDMA header. All segments belonging the beginning of the RPC-over-RDMA header. All segments belonging
to the same Read chunk have the same value in their Position to the same Read chunk have the same value in their Position
field. field.
While constructing the RPC call, the requester registers memory While constructing the RPC call, the requester registers memory
regions containing data in Read chunks. It advertises these chunks segments containing data in Read chunks. It advertises these chunks
in the RPC-over-RDMA header of the RPC call. in the RPC-over-RDMA header of the RPC call.
After receiving the RPC call via an RDMA Send operation, the After receiving the RPC call via an RDMA Send operation, the
responder transfers the chunk data from the requester using RDMA Read responder transfers the chunk data from the requester using RDMA Read
operations. The responder reconstructs the transferred chunk data by operations. The responder reconstructs the transferred chunk data by
concatenating the contents of each segment, in list order, into the concatenating the contents of each segment, in list order, into the
RPC call XDR stream. The first segment begins at the XDR position in RPC call XDR stream. The first segment begins at the XDR position in
the Position field, and subsequent segments are concatenated the Position field, and subsequent segments are concatenated
afterwards until there are no more segments left at that XDR afterwards until there are no more segments left at that XDR
Position. Position.
skipping to change at page 17, line 39 skipping to change at page 18, line 6
A "Write chunk" represents an XDR data item that is to be pushed from A "Write chunk" represents an XDR data item that is to be pushed from
the responder to the requester using RDMA Write operations. the responder to the requester using RDMA Write operations.
A Write chunk is an array of one or more RDMA segments. Segments in A Write chunk is an array of one or more RDMA segments. Segments in
a Write chunk do not have a Position field because Write chunks are a Write chunk do not have a Position field because Write chunks are
provided by a requester long before the responder prepares the reply provided by a requester long before the responder prepares the reply
XDR stream. XDR stream.
While constructing the RPC call, the requester also sets up memory While constructing the RPC call, the requester also sets up memory
regions to catch DDP-eligible reply data. The requester provides as segments to catch DDP-eligible reply data. The requester provides as
many segments as needed to accommodate the largest possible size of many segments as needed to accommodate the largest possible size of
the data item in each Write chunk. the data item in each Write chunk.
The responder transfers the chunk data to the requester using RDMA The responder transfers the chunk data to the requester using RDMA
Write operations. The responder copies the responder's Write chunk Write operations. The responder copies the responder's Write chunk
segments into the RPC-over-RDMA header to be sent with the reply. segments into the RPC-over-RDMA header to be sent with the reply.
The responder updates the segment length fields to reflect the actual The responder updates the segment length fields to reflect the actual
amount of data that is being returned in the chunk. The updated amount of data that is being returned in the chunk. The updated
length of a Write chunk segment MAY be zero if the segment was not length of a Write chunk segment MAY be zero if the segment was not
filled by the responder. However the responder MUST NOT change the filled by the responder. However the responder MUST NOT change the
skipping to change at page 21, line 41 skipping to change at page 22, line 17
If DDP-eligible data items are present in an RPC message, a sender If DDP-eligible data items are present in an RPC message, a sender
MAY remove them from the RPC message, and use RDMA Read or Write MAY remove them from the RPC message, and use RDMA Read or Write
operations to move that data. The RPC-over-RDMA header with the operations to move that data. The RPC-over-RDMA header with the
shortened RPC call or reply message immediately following is shortened RPC call or reply message immediately following is
transferred using a single RDMA Send operation. Removed DDP-eligible transferred using a single RDMA Send operation. Removed DDP-eligible
data items are conveyed using RDMA Read or Write operations using data items are conveyed using RDMA Read or Write operations using
additional information provided in the RPC-over-RDMA header. additional information provided in the RPC-over-RDMA header.
4.6.3. Long Messages 4.6.3. Long Messages
When an RPC message is larger than the connection's inline threshold When an RPC message is larger than the connection's inline threshold,
and the Upper Layer Binding does not identify any DDP-eligible data DDP-eligible data items are removed from the message and placed in
items in the requested operation that may be moved separately, the chunks and moved separately. If there are no DDP-eligible data items
RDMA transport MUST use RDMA Read and Write operations to convey the in the message, or the message is still too large after DDP-eligible
whole RPC message. This mechanism is referred to as a "Long items have been removed, the RDMA transport MUST use RDMA Read or
Message." Write operations to convey the RPC message body itself. This
mechanism is referred to as a "Long Message."
To send an RPC message as a Long Message, the sender conveys only the To send an RPC message as a Long Message, the sender conveys only the
RPC-over-RDMA header with an RDMA Send operation. The RPC message RPC-over-RDMA header with an RDMA Send operation. The RPC message
itself is not included in the Send buffer. Instead, the requester itself is not included in the Send buffer. Instead, the requester
provides chunks that the responder uses to move the whole RPC provides chunks that the responder uses to move the whole RPC
message. message.
Long RPC call Long RPC call
To handle an RPC request using a Long Message, the requester To handle an RPC request using a Long Message, the requester
provides a special Read chunk that contains the RPC call's XDR provides a special Read chunk that contains the RPC call's XDR
skipping to change at page 22, line 31 skipping to change at page 23, line 8
Responders SHOULD use a Long Message whenever a Reply chunk has been Responders SHOULD use a Long Message whenever a Reply chunk has been
provided by a requester. Both types of special chunk MAY be present provided by a requester. Both types of special chunk MAY be present
in the same RPC message. in the same RPC message.
Because these special chunks contain a whole RPC message, any XDR Because these special chunks contain a whole RPC message, any XDR
data item MAY appear in one of these special chunks without regard to data item MAY appear in one of these special chunks without regard to
its DDP-eligibility. DDP-eligible data items MAY be removed from its DDP-eligibility. DDP-eligible data items MAY be removed from
these special chunks and conveyed via normal chunks, but non-eligible these special chunks and conveyed via normal chunks, but non-eligible
data items MUST NOT appear in normal chunks. data items MUST NOT appear in normal chunks.
5. RPC-over-RDMA In Operation 5. RPC-Over-RDMA In Operation
An RPC-over-RDMA Version One header precedes all RPC messages An RPC-over-RDMA Version One header precedes all RPC messages
conveyed across an RDMA transport. This header includes a copy of conveyed across an RDMA transport. This header includes a copy of
the message's transaction ID, data for RDMA flow control credits, and the message's transaction ID, data for RDMA flow control credits, and
lists of memory addresses used for RDMA Read and Write operations. lists of memory addresses used for RDMA Read and Write operations.
All RPC-over-RDMA header content MUST be XDR encoded. All RPC-over-RDMA header content MUST be XDR encoded.
RPC message layout is unchanged from that described in [RFC5531] RPC message layout is unchanged from that described in [RFC5531]
except for the possible removal of data items that are moved by RDMA except for the possible removal of data items that are moved by RDMA
Read or Write operations. If an RPC message (along with its RPC- Read or Write operations. If an RPC message (along with its RPC-
skipping to change at page 28, line 42 skipping to change at page 29, line 27
requester anticipates a long reply and has some knowledge of its size requester anticipates a long reply and has some knowledge of its size
so that an adequately sized buffer can be allocated. Typically the so that an adequately sized buffer can be allocated. Typically the
Upper Layer Protocol can limit the size of RPC replies appropriately. Upper Layer Protocol can limit the size of RPC replies appropriately.
It is possible for a single RPC procedure to employ both a long call It is possible for a single RPC procedure to employ both a long call
for its arguments and a long reply for its results. However, such an for its arguments and a long reply for its results. However, such an
operation is atypical, as few upper layers define such exchanges. operation is atypical, as few upper layers define such exchanges.
5.4. Memory Registration 5.4. Memory Registration
RDMA requires that all data be transferred between registered memory RDMA requires that data is transferred only between registered memory
regions at the source and destination. All protocol headers as well segments at the source and destination. All protocol headers as well
as separately transferred data chunks use registered memory. Since as separately transferred data chunks use registered memory.
the cost of registering and de-registering memory can be a large
proportion of the RDMA transaction cost, it is important to minimize
registration activity. This is easily achieved within RPC-controlled
memory by allocating chunk list data and RPC headers in a reusable
way from pre-registered pools.
Data chunks transferred via RDMA Read and Write MAY occupy memory Since the cost of registering and de-registering memory can be a
significant proportion of the RDMA transaction cost, it is important
to minimize registration activity. This is easily achieved within
RPC-controlled memory by allocating chunk list data and RPC headers
in a reusable way from pre-registered pools.
5.4.1. Registration Longevity
Data chunks transferred via RDMA Read and Write MAY reside in memory
that persists outside the bounds of the RPC transaction. Hence, the that persists outside the bounds of the RPC transaction. Hence, the
default behavior of an RPC-over-RDMA transport is to register and default behavior of an RPC-over-RDMA transport is to register and
invalidate these chunks on every RPC transaction. The requester invalidate these chunks on every RPC transaction.
transport implementation must ensure that these memory regions are
The requester endpoint must ensure that these memory segments are
properly fenced from the responder before allowing Upper Layer access properly fenced from the responder before allowing Upper Layer access
to the data contained in them. to the data contained in them. The data in such segments must be at
rest while a responder has access to that memory.
The interface by which an upper-layer implementation communicates the This includes segments that are associated with canceled RPCs. A
eligibility of a data item locally to RPC for chunking is out of responder cannot know that the requester is no longer waiting for a
scope for this specification. Depending on the implementation and reply, and might proceed to read or even update memory that the
constraints imposed by Upper Layer Bindings, it is possible to requester has released for other use.
implement an RPC chunking facility that is transparent to upper
layers. However, such implementations may lead to inefficiencies, 5.4.2. Communicating DDP-Eligibility
either because they require the RPC layer to perform expensive
registration and de-registration of memory "on the fly", or they may The interface by which an Upper Layer Protocol implementation
require using RDMA chunks in reply messages, along with the resulting communicates the eligibility of a data item locally to its local RPC-
additional handshaking with the RPC-over-RDMA peer. However, these over-RDMA endpoint is out of scope for this specification.
issues are internal and generally confined to the local interface
between RPC and its upper layers, one in which implementations are Depending on the implementation and constraints imposed by Upper
free to innovate. The only requirement is that the resulting RPC- Layer Bindings, it is possible to implement an RPC chunking facility
over-RDMA protocol sent to the peer is valid for the upper layer. that is transparent to upper layers. Such implementations may lead
to inefficiencies, either because they require the RPC layer to
perform expensive registration and de-registration of memory "on the
fly", or they may require using RDMA chunks in reply messages, along
with the resulting additional handshaking with the RPC-over-RDMA
peer.
However, these issues are internal and generally confined to the
local interface between RPC and its upper layers, one in which
implementations are free to innovate. The only requirement is that
the resulting RPC-over-RDMA protocol sent to the peer is valid for
the upper layer.
5.4.3. Registration Strategies
The choice of which memory registration strategies to employ is left The choice of which memory registration strategies to employ is left
to requester and responder implementers. To support the widest array to requester and responder implementers. To support the widest array
of RDMA implementations, as well as the most general steering tag of RDMA implementations, as well as the most general steering tag
scheme, an Offset field is included in each segment. scheme, an Offset field is included in each segment.
While zero-based offset schemes are available in many RDMA While zero-based offset schemes are available in many RDMA
implementations, their use by RPC requires individual registration of implementations, their use by RPC requires individual registration of
each segment. For such implementations, this can be a significant each segment. For such implementations, this can be a significant
overhead. By providing an offset in each chunk, many pre- overhead. By providing an offset in each chunk, many pre-
skipping to change at page 35, line 14 skipping to change at page 36, line 14
The XDR language definition of DDP-eligible data items is not The XDR language definition of DDP-eligible data items is not
decorated in any way. decorated in any way.
It is the responsibility of the protocol's Upper Layer Binding to It is the responsibility of the protocol's Upper Layer Binding to
specify DDP-eligibity rules so that if a DDP-eligible XDR data item specify DDP-eligibity rules so that if a DDP-eligible XDR data item
is embedded within another, only one of these two objects is to be is embedded within another, only one of these two objects is to be
represented by a chunk. This ensures that the mapping from XDR represented by a chunk. This ensures that the mapping from XDR
position to the XDR object represented is unambiguous. position to the XDR object represented is unambiguous.
An Upper Layer Binding is considered ready to publish when: 6.2. Write List Ordering
o Every XDR data type in the protocol has been considered for DDP- A requester constructs the Write list for an RPC transaction before
eligibility the responder has formulated the transaction's reply.
o Long Messages When there is only one result data item that is DDP-eligible, the
requester appends only a single Write chunk to that Write list. If
the responder populates that chunk with data, there is no ambiguity
about which result is contained in it.
6.2. Write List Ordering However, an Upper Layer Binding MAY permit a reply where more than
one result data item is DDP-eligible. For example, an NFSv4 COMPOUND
reply is composed of individual NFSv4 operations, more than one of
which can contain a DDP-eligible result.
Place holder A requester provides multiple Write chunks when it expects the RPC
reply to contain more than one data item that is DDP-elegible.
Ambiguities can arise when replies contain XDR unions or arrays of
complex data types which allow a responder options about whether a
DDP-eligible data item is included or not.
An Upper Layer Binding MUST determine how Write list entries are Requester and responder must agree beforehand which data items appear
mapped to procedure arguments for each Upper Layer procedure. in which Write chunk. Therefore an Upper Layer Binding MUST
determine how Write list entries are mapped to procedure arguments
for each Upper Layer procedure.
6.3. DDP-Eligibility Violation 6.3. DDP-Eligibility Violation
If the Upper Layer on a receiver is not aware of the presence and If the Upper Layer on a receiver is not aware of the presence and
operation of an RPC-over-RDMA transport under it, it could be operation of an RPC-over-RDMA transport under it, it could be
challenging to discover when a sender has violated an Upper Layer challenging to discover when a sender has violated an Upper Layer
Binding rule. Binding rule.
If a violation does occur, RFC 5666 does not define an unambiguous If a violation does occur, RFC 5666 does not define an unambiguous
mechanism for reporting the violation. The violation of Binding mechanism for reporting the violation. The violation of Binding
skipping to change at page 36, line 10 skipping to change at page 37, line 22
It is the Upper Layer Binding's responsibility to specify how a It is the Upper Layer Binding's responsibility to specify how a
responder must reply if a requester violates a DDP-eligibilty rule. responder must reply if a requester violates a DDP-eligibilty rule.
The Binding specification should provide similar guidance for The Binding specification should provide similar guidance for
requesters about handling invalid RPC-over-RDMA replies. requesters about handling invalid RPC-over-RDMA replies.
6.4. Other Binding Information 6.4. Other Binding Information
An Upper Layer Binding may recommend inline threshold values for RPC- An Upper Layer Binding may recommend inline threshold values for RPC-
over-RDMA Version One connections bearing that Upper Layer Protocol. over-RDMA Version One connections bearing that Upper Layer Protocol.
However, note that RPC-over-RDMA connections can be shared by more However, note that RPC-over-RDMA connections can be shared by more
than one Upper Layer Protocol, and that mechanisms such as CCP often than one Upper Layer Protocol, and that an implementation may use the
apply to all connections and Protocols that flow between two peers. same inline threshold for all connections and Protocols that flow
between two peers.
If an Upper Layer Protocol specifies a method for exchanging inline If an Upper Layer Protocol specifies a method for exchanging inline
threshold information, the sender can find out the receiver's threshold information, the sender can find out the receiver's
threshold value only subsequent to establishing an RPC-over-RDMA threshold value only subsequent to establishing an RPC-over-RDMA
connection. The new threshold value can take effect only when a new connection. The new threshold value can take effect only when a new
connection is established. connection is established.
7. RPC Bind Parameters 7. RPC Bind Parameters
In setting up a new RDMA connection, the first action by a requester In setting up a new RDMA connection, the first action by a requester
skipping to change at page 37, line 37 skipping to change at page 38, line 48
each RPC-over-RDMA-enabled Upper Layer binding, and not addressed each RPC-over-RDMA-enabled Upper Layer binding, and not addressed
here. here.
In Section 12, "IANA Considerations", this specification defines two In Section 12, "IANA Considerations", this specification defines two
new "netid" values, to be used for registration of upper layers atop new "netid" values, to be used for registration of upper layers atop
iWARP [RFC5040] [RFC5041] and (when a suitable port translation iWARP [RFC5040] [RFC5041] and (when a suitable port translation
service is available) InfiniBand [IB]. Additional RDMA-capable service is available) InfiniBand [IB]. Additional RDMA-capable
networks MAY define their own netids, or if they provide a port networks MAY define their own netids, or if they provide a port
translation, MAY share the one defined here. translation, MAY share the one defined here.
8. Bi-directional RPC-over-RDMA 8. Bi-Directional RPC-Over-RDMA
8.1. RPC Direction 8.1. RPC Direction
8.1.1. Forward Direction 8.1.1. Forward Direction
A traditional ONC RPC client is always a requester. A traditional A traditional ONC RPC client is always a requester. A traditional
ONC RPC service is always a responder. This traditional form of ONC ONC RPC service is always a responder. This traditional form of ONC
RPC message passing is referred to as operation in the "forward RPC message passing is referred to as operation in the "forward
direction." direction."
During forward direction operation, the ONC RPC client is responsible During forward direction operation, the ONC RPC client is responsible
skipping to change at page 43, line 47 skipping to change at page 45, line 15
changed in a way that prevents interoperability with current changed in a way that prevents interoperability with current
implementations implementations
o Whenever the set of abstract RDMA operations that may be used is o Whenever the set of abstract RDMA operations that may be used is
changed changed
o Whenever the set of allowable transfer models is altered o Whenever the set of allowable transfer models is altered
10. Security Considerations 10. Security Considerations
10.1. Memory Protection
A primary consideration is the protection of the integrity and A primary consideration is the protection of the integrity and
privacy of local memory by the RDMA transport itself. The use of privacy of local memory by an RPC-over-RDMA transport. The use of
RPC-over-RDMA MUST NOT introduce any vulnerabilities to system memory RPC-over-RDMA MUST NOT introduce any vulnerabilities to system memory
contents, or to memory owned by user processes. These protections contents, nor to memory owned by user processes.
are provided by the RDMA layer specifications, and specifically their
security models.
It is REQUIRED that any RDMA provider used for RPC transport be It is REQUIRED that any RDMA provider used for RPC transport be
conformant to the requirements of [RFC5042] in order to satisfy these conformant to the requirements of [RFC5042] in order to satisfy these
protections. Best practices to ensure memory contents are completely protections. These protections are provided by the RDMA layer
protected during an RPC transaction include the following. specifications, and specifically their security models.
o The use of Protection Domains to limit the exposure of memory o The use of Protection Domains to limit the exposure of memory
regions to a single connection is critical. Any attempt by a host segments to a single connection is critical. Any attempt by a
not participating in that connection to re-use handles will result host not participating in that connection to re-use handles will
in a connection failure. Because Upper Layer Protocol security result in a connection failure. Because Upper Layer Protocol
mechanisms rely on this aspect of Reliable Connection behavior, security mechanisms rely on this aspect of Reliable Connection
strong authentication of the remote is recommended. behavior, strong authentication of the remote is recommended.
o Unpredictable memory handles should be used for any operation o Unpredictable memory handles should be used for any operation
requiring advertised memory regions. Advertising a continuously requiring advertised memory segments. Advertising a continuously
registered memory region allows a remote host to read or write to registered memory region allows a remote host to read or write to
that region even when an RPC involving that memory is not under that region even when an RPC involving that memory is not under
way. Therefore advertising persistently registered memory should way. Therefore advertising persistently registered memory should
be avoided. be avoided.
o Advertised memory regions should be invalidated as soon as related o Advertised memory segments should be invalidated as soon as
RPC operations are complete. Invalidation and DMA unmapping of related RPC operations are complete. Invalidation and DMA
regions should be complete before an RPC application is allowed to unmapping of segments should be complete before an RPC application
continue execution and use the contents of a memory region. is allowed to continue execution and use the contents of a memory
region.
Once delivered securely by the RDMA provider, any RDMA-exposed 10.2. Using GSS With RPC-Over-RDMA
addresses will contain only RPC payloads in the chunk lists,
transferred under the protection of RPCSEC_GSS integrity and privacy.
By these means, the data will be protected end-to-end, as required by
the RPC layer security model.
RPC provides its own security via the RPCSEC_GSS framework [RFC2203]. ONC RPC provides its own security via the RPCSEC_GSS framework
RPCSEC_GSS can provide message authentication, integrity checking, [RFC2203]. RPCSEC_GSS can provide message authentication, integrity
and privacy. This security mechanism is unaffected by the RDMA checking, and privacy. This security mechanism is unaffected by the
transport. However, there is much data movement associated with RDMA transport. However, there is much data movement associated with
computation and verification of integrity, or encryption/decryption, computation and verification of integrity, or encryption/decryption,
so certain performance advantages may be lost. so certain performance advantages may be lost.
For efficiency, a more appropriate security mechanism for RDMA links For efficiency, a more appropriate security mechanism for RDMA links
may be link-level protection, such as certain configurations of may be link-level protection, such as certain configurations of
IPsec, which may be co-located in the RDMA hardware. The use of IPsec, which may be co-located in the RDMA hardware. The use of
link-level protection MAY be negotiated through the use of the link-level protection MAY be negotiated through the use of the
RPCSEC_GSS mechanism defined in [RFC5403] in conjunction with the RPCSEC_GSS mechanism defined in [RFC5403] in conjunction with the
Channel Binding mechanism [RFC5056] and IPsec Channel Connection Channel Binding mechanism [RFC5056] and IPsec Channel Connection
Latching [RFC5660]. Use of such mechanisms is REQUIRED where Latching [RFC5660]. Use of such mechanisms is REQUIRED where
integrity and/or privacy is desired, and where efficiency is integrity and/or privacy is desired, and where efficiency is
required. required.
Once delivered securely by the RDMA provider, any RDMA-exposed memory
will contain only RPC payloads in the chunk lists, transferred under
the protection of RPCSEC_GSS integrity and privacy. By these means,
the data will be protected end-to-end, as required by the RPC layer
security model.
11. IANA Considerations 11. IANA Considerations
Three new assignments are specified by this document: Three new assignments are specified by this document:
- A new set of RPC "netids" for resolving RPC-over-RDMA services - A new set of RPC "netids" for resolving RPC-over-RDMA services
- Optional service port assignments for Upper Layer Bindings - Optional service port assignments for Upper Layer Bindings
- An RPC program number assignment for the configuration protocol - An RPC program number assignment for the configuration protocol
skipping to change at page 46, line 14 skipping to change at page 47, line 31
The RPC program number assignment policy and registry are defined in The RPC program number assignment policy and registry are defined in
[RFC5531]. [RFC5531].
12. Acknowledgments 12. Acknowledgments
The editor gratefully acknowledges the work of Brent Callaghan and The editor gratefully acknowledges the work of Brent Callaghan and
Tom Talpey on the original RPC-over-RDMA Version One specification Tom Talpey on the original RPC-over-RDMA Version One specification
[RFC5666]. [RFC5666].
The comments and contributions of Karen Deitke, William Simpson, Dai The comments and contributions of Karen Deitke, Dai Ngo, Chunli
Ngo, Chunli Zhang, Dominique Martinet, and Mahesh Siddheshwar are Zhang, Dominique Martinet, and Mahesh Siddheshwar are accepted with
accepted with many and great thanks. The editor also wishes to thank many and great thanks. The editor also wishes to thank Dave Noveck
Dave Noveck and Bill Baker for their unwavering support of this work. and Bill Baker for their unwavering support of this work.
Special thanks go to nfsv4 Working Group Chair Spencer Shepler and Special thanks go to nfsv4 Working Group Chair Spencer Shepler and
nfsv4 Working Group Secretary Thomas Haynes for their support. nfsv4 Working Group Secretary Thomas Haynes for their support.
13. Appendices 13. Appendices
13.1. Appendix 1: XDR Examples 13.1. Appendix 1: XDR Examples
RPC-over-RDMA chunk lists are complex data types. In this appendix, RPC-over-RDMA chunk lists are complex data types. In this appendix,
illustrations are provided to help readers grasp how chunk lists are illustrations are provided to help readers grasp how chunk lists are
skipping to change at page 50, line 16 skipping to change at page 51, line 34
Charles Lever (editor) Charles Lever (editor)
Oracle Corporation Oracle Corporation
1015 Granger Avenue 1015 Granger Avenue
Ann Arbor, MI 48104 Ann Arbor, MI 48104
USA USA
Phone: +1 734 274 2396 Phone: +1 734 274 2396
Email: chuck.lever@oracle.com Email: chuck.lever@oracle.com
William Allen Simpson
DayDreamer
1384 Fontaine
Madison Heights, MI 48071
USA
Email: william.allen.simpson@gmail.com
Tom Talpey Tom Talpey
Microsoft Corp. Microsoft Corp.
One Microsoft Way One Microsoft Way
Redmond, WA 98052 Redmond, WA 98052
USA USA
Phone: +1 425 704-9945 Phone: +1 425 704-9945
Email: ttalpey@microsoft.com Email: ttalpey@microsoft.com
 End of changes. 84 change blocks. 
302 lines changed or deleted 354 lines changed or added

This html diff was produced by rfcdiff 1.42. The latest version is available from http://tools.ietf.org/tools/rfcdiff/