draft-ietf-nfsv4-rfc5667bis-00.txt   draft-ietf-nfsv4-rfc5667bis-01.txt 
Network File System Version 4 C. Lever, Ed. Network File System Version 4 C. Lever, Ed.
Internet-Draft Oracle Internet-Draft Oracle
Obsoletes: 5667 (if approved) June 13, 2016 Obsoletes: 5667 (if approved) June 30, 2016
Intended status: Standards Track Intended status: Standards Track
Expires: December 15, 2016 Expires: January 1, 2017
Network File System (NFS) Direct Data Placement Network File System (NFS) Upper Layer Binding To RPC-Over-RDMA
draft-ietf-nfsv4-rfc5667bis-00 draft-ietf-nfsv4-rfc5667bis-01
Abstract Abstract
This document defines the bindings of the various Network File System This document specifies the Upper Layer Bindings of Network File
(NFS) versions to the Remote Direct Memory Access (RDMA) operations System (NFS) protocol versions to RPC-over-RDMA transports. Such
supported by the RPC-over-RDMA transport protocol. It describes the Upper Layer Bindings are required to enable RPC-based protocols to
use of direct data placement by means of server-initiated RDMA use direct data placement when conveying large data payloads on RPC-
operations into client-supplied buffers for implementations of NFS over-RDMA transports. This document obsoletes RFC 5667.
versions 2, 3, 4, and 4.1 over such an RDMA transport. This document
obsoletes RFC 5667.
Status of This Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on December 15, 2016. This Internet-Draft will expire on January 1, 2017.
Copyright Notice Copyright Notice
Copyright (c) 2016 IETF Trust and the persons identified as the Copyright (c) 2016 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Requirements Language . . . . . . . . . . . . . . . . . . 2 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3
1.2. Planned Changes To This Document . . . . . . . . . . . . 2 1.2. Changes Since RFC 5667 . . . . . . . . . . . . . . . . . 3
2. Transfers from NFS Client to NFS Server . . . . . . . . . . . 3 1.3. Planned Changes To This Document . . . . . . . . . . . . 4
3. Transfers from NFS Server to NFS Client . . . . . . . . . . . 3 2. Conveying NFS Operations On RPC-Over-RDMA Transports . . . . 4
4. NFS Versions 2 and 3 Mapping . . . . . . . . . . . . . . . . 5 2.1. Use Of The Read List . . . . . . . . . . . . . . . . . . 4
5. NFS Version 4 Mapping . . . . . . . . . . . . . . . . . . . . 6 2.2. Use Of The Write List . . . . . . . . . . . . . . . . . . 5
5.1. NFS Version 4 Callbacks . . . . . . . . . . . . . . . . . 8 2.3. Construction Of Individual Chunks . . . . . . . . . . . . 5
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 2.4. Use Of Long Calls And Replies . . . . . . . . . . . . . . 5
7. Security Considerations . . . . . . . . . . . . . . . . . . . 9 3. NFS Versions 2 And 3 Upper Layer Binding . . . . . . . . . . 5
8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 9 4. NFS Version 4 Upper Layer Binding . . . . . . . . . . . . . . 6
9. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.1. NFS Version 4 COMPOUND Considerations . . . . . . . . . . 7
9.1. Normative References . . . . . . . . . . . . . . . . . . 10 4.2. NFS Version 4 Callbacks . . . . . . . . . . . . . . . . . 8
9.2. Informative References . . . . . . . . . . . . . . . . . 10 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8
6. Security Considerations . . . . . . . . . . . . . . . . . . . 9
7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 9
8. References . . . . . . . . . . . . . . . . . . . . . . . . . 9
8.1. Normative References . . . . . . . . . . . . . . . . . . 9
8.2. Informative References . . . . . . . . . . . . . . . . . 10
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 11 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 11
1. Introduction 1. Introduction
The Remote Direct Memory Access (RDMA) Transport for Remote Procedure Remote Direct Memory Access Transport for Remote Procedure Call,
Call (RPC) [I-D.ietf-nfsv4-rfc5666bis] allows an RPC client Version One [I-D.ietf-nfsv4-rfc5666bis] (RPC-over-RDMA) enables the
application to post buffers in a Chunk list for specific arguments use of direct data placement to accelerate the transmission of large
and results from an RPC call. The RDMA transport header conveys this data payloads associated with RPC transactions.
list of client buffer addresses to the server where the application
can associate them with client data and use RDMA operations to Each RPC-over-RDMA transport header can convey lists of memory
transfer the results directly to and from the posted buffers on the locations involved in direct transfers of data payloads. These
client. The client and server must agree on a consistent mapping of memory locations correspond to XDR data items defined in an Upper
posted buffers to RPC. This document details the mapping for each Layer Protocol (such as NFS).
version of the NFS protocol [RFC1094] [RFC1813] [RFC7530] [RFC5661].
To facilitate interoperation, RPC client and server implementations
must agree on what XDR data items in which RPC procedures are
eligible for direct data placement (DDP).
This document specifies the set of XDR data items in each of the
following NFS protocol versions that are eligible for DDP. It also
contains additional material required of Upper Layer Bindings as
specified in [I-D.ietf-nfsv4-rfc5666bis].
o NFS Version 2 [RFC1094]
o NFS Version 3 [RFC1813]
o NFS Version 4.0 [RFC7530]
o NFS Version 4.1 [RFC5661]
o NFS Version 4.2 [I-D.ietf-nfsv4-minorversion2]
The Upper Layer Binding specified in this document can be extended to
cover the addition of new DDP-eligible XDR data items defined by
versions of the NFS version 4 protocol specified after this document
has been ratified.
1.1. Requirements Language 1.1. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119]. document are to be interpreted as described in [RFC2119].
1.2. Planned Changes To This Document 1.2. Changes Since RFC 5667
The following changes will be made, relative to [RFC5667]: Corrections and updates made necessary by new language in
[I-D.ietf-nfsv4-rfc5666bis] has been introduced. For example,
references to deprecated features of RPC-over-RDMA Version One, such
as RDMA_MSGP, and the use of the Read list for handling RPC replies,
has been removed. The term "mapping" has been replaced with the term
"binding" or "Upper Layer Binding" throughout the document. Material
that duplicates what is in [I-D.ietf-nfsv4-rfc5666bis] has been
deleted.
o References to [RFC5666] will be replaced with references to Material required by [I-D.ietf-nfsv4-rfc5666bis] for Upper Layer
[I-D.ietf-nfsv4-rfc5666bis]. Corrections and updates relative to Bindings that was not present in [RFC5667] has been added, including
new language in [I-D.ietf-nfsv4-rfc5666bis] will be introduced. discussion of how each NFS version properly estimates the maximum
size of RPC replies.
o References to obsolete RFCs will be replaced. The following changes have been made, relative to [RFC5667]:
o The reference to a non-existant NFSv4 SYMLINK operation will be o Ambiguous or erroneous uses of RFC2119 terms have been corrected.
replaced with NFSv4 CREATE(NF4LNK).
o The discussion of 12KB and 36KB inline threshold will be removed. o References to specific data movement mechanisms have been made
generic or removed.
o The discussion of NFSv4 COMPOUND handling will be completed. o References to obsolete RFCs have been replaced.
o Technical corrections have been made. For example, the mention of
12KB and 36KB inline thresholds have been removed. The reference
to a non-existant NFS version 4 SYMLINK operation has been
replaced with NFS version 4 CREATE(NF4LNK).
o An IANA Considerations Section has replaced the "Port Usage
Considerations" Section.
o Code excerpts have been removed, and figures have been modernized.
o Language inconsistent with or contradictory to
[I-D.ietf-nfsv4-rfc5666bis] has been removed from Sections 2 and
3, and both Sections have been combined into Section 2 in the
present document.
o An explicit discussion of NFSv4.0 and NFSv4.1 backchannel o An explicit discussion of NFSv4.0 and NFSv4.1 backchannel
operation will be introduced. operation will replace the previous treatment of callback
operations. No NFSv4.x callback operation is DDP-eligible.
o An IANA Considerations section is required by IDNITS. o The binding for NFSv4.1 has been completed. No additional DDP-
eligible operations exist in NFSv4.1.
o Code excerpts will be modernized. o A binding for NFSv4.2 has been added that includes discussion of
new data-bearing operations like READ_PLUS.
Other minor changes and editorial corrections may also be made. 1.3. Planned Changes To This Document
2. Transfers from NFS Client to NFS Server The following changes are planned, relative to [RFC5667]:
The RDMA Read list, in the RDMA transport header, allows an RPC o The discussion of NFS version 4 COMPOUND handling will be
client to marshal RPC call data selectively. Large chunks of data, completed.
such as the file data of an NFS WRITE request, MAY be referenced by
an RDMA Read list and be moved efficiently and directly placed by an
RDMA Read operation initiated by the server.
The process of identifying these chunks for the RDMA Read list can be o Remarks about handling DDP-eligibility violations will be
implemented entirely within the RPC layer. It is transparent to the introduced.
upper-level protocol, such as NFS. For instance, the file data
portion of an NFS WRITE request can be selected as an RDMA "chunk"
within the eXternal Data Representation (XDR) marshaling code of RPC
based on a size criterion, independently of the NFS protocol layer.
The XDR unmarshaling on the receiving system can identify the
correspondence between Read chunks and protocol elements via the XDR
position value encoded in the Read chunk entry.
RPC RDMA Read chunks are employed by this NFS mapping to convey o A discussion of how the NFS binding to RPC-over-RDMA is extended
specific NFS data to the server in a manner that may be directly by standards action will be added.
placed. The following sections describe this mapping for versions of
the NFS protocol.
3. Transfers from NFS Server to NFS Client 2. Conveying NFS Operations On RPC-Over-RDMA Transports
The RDMA Write list, in the RDMA transport header, allows the client Definitions of terminology and a general discussion of how RPC-over-
to post one or more buffers into which the server will RDMA Write RDMA is used to convey RPC transactions can be found in
designated result chunks directly. If the client sends a null Write [I-D.ietf-nfsv4-rfc5666bis]. In this section, these general
list, then results from the RPC call will be returned either as an principals are applied to the specifics of the NFS protocol.
inline reply, as chunks in an RDMA Read list of server-posted
buffers, or in a client-posted reply buffer.
Each posted buffer in a Write list is represented as an array of 2.1. Use Of The Read List
memory segments. This allows the client some flexibility in
submitting discontiguous memory segments into which the server will
scatter the result. Each segment is described by a triplet
consisting of the segment handle or steering tag (STag), segment
length, and memory address or offset.
<CODE BEGINS> The Read list in each RPC-over-RDMA transport header represents a set
of memory regions containing DDP-eligible NFS argument data. Large
data items, such as the file data payload of an NFS WRITE request,
are referenced by the Read list and placed directly into server
memory.
struct xdr_rdma_segment { XDR unmarshaling code on the NFS server identifies the correspondence
uint32 handle; /* Registered memory handle */ between Read chunks and particular NFS arguments via the chunk
uint32 length; /* Length of the chunk in bytes */ Position value encoded in each Read chunk.
uint64 offset; /* Chunk virtual address or offset */
};
struct xdr_write_chunk { 2.2. Use Of The Write List
struct xdr_rdma_segment target<>;
};
struct xdr_write_list { The Write list in each RPC-over-RDMA transport header represents a
struct xdr_write_chunk entry; set of memory regions that can receive DDP-eligible NFS result data.
struct xdr_write_list *next; Large data items such as the payload of an NFS READ request are
}; referenced by the Write list and placed directly into client memory.
<CODE ENDS> Each Write chunk corresponds to a specific XDR data item in an NFS
reply. This document specifies how NFS client and server
implementations identify the correspondence between Write chunks and
each XDR result.
The sum of the segment lengths yields the total size of the buffer, 2.3. Construction Of Individual Chunks
which MUST be large enough to accept the result. If the buffer is
too small, the server MUST return an XDR encode error. The server
MUST return the result data for a posted buffer by progressively
filling its segments, perhaps leaving some trailing segments unfilled
or partially full if the size of the result is less than the total
size of the buffer segments.
The server returns the RDMA Write list to the client with the segment Each Read chunk is represented as a list of segments at the same XDR
length fields overwritten to indicate the amount of data RDMA written Position, and each Write chunk is represented as an array of
to each segment. Results returned by direct placement MUST NOT be segments. An NFS client thus has the flexibility to advertise a set
returned by other methods, e.g., by Read chunk list or inline. If no of discontiguous memory regions in which to send or receive a single
result data at all is returned for the element, the server places no DDP-eligible data item.
data in the buffer(s), but does return zeros in the segment length
fields corresponding to the result.
The RDMA Write list allows the client to provide multiple result 2.4. Use Of Long Calls And Replies
buffers -- each buffer maps to a specific result in the reply. The
NFS client and server implementations agree by specifying the mapping
of results to buffers for each RPC procedure. The following sections
describe this mapping for versions of the NFS protocol.
Through the use of RDMA Write lists in NFS requests, it is not Small RPC messages are conveyed using RDMA Send operations which are
necessary to employ the RDMA Read lists in the NFS replies, as of limited size. If an NFS request is too large to be conveyed via
described in the RPC-over-RDMA protocol. This enables more efficient an RDMA Send, and there are no DDP-eligible data items that can be
operation, by avoiding the need for the server to expose buffers for removed, an NFS client must send the request using a Long Call. The
RDMA, and also avoiding "RDMA_DONE" exchanges. Clients MAY entire NFS request is sent in a special Read chunk.
additionally employ RDMA Reply chunks to receive entire messages, as
described in [I-D.ietf-nfsv4-rfc5666bis].
4. NFS Versions 2 and 3 Mapping If a client expects that an NFS reply will be too large to be
conveyed via an RDMA Send, it provides a Reply chunk in the RPC-over-
RDMA transport header conveying the NFS request. The server can
place the entire NFS reply in the Reply chunk.
A single RDMA Write list entry MAY be posted by the client to receive These are described in more detail in [I-D.ietf-nfsv4-rfc5666bis].
either the opaque file data from a READ request or the pathname from
a READLINK request. The server MUST ignore a Write list for any
other NFS procedure, as well as any Write list entries beyond the
first in the list.
Similarly, a single RDMA Read list entry MAY be posted by the client 3. NFS Versions 2 And 3 Upper Layer Binding
to supply the opaque file data for a WRITE request or the pathname
for a SYMLINK request. The server MUST ignore any Read list for
other NFS procedures, as well as additional Read list entries beyond
the first in the list.
Because there are no NFS version 2 or 3 requests that transfer bulk An NFS client MAY send a single Read chunk to supply opaque file data
data in both directions, it is not necessary to post requests for an NFS WRITE procedure, or the pathname for an NFS SYMLINK
containing both Write and Read lists. Any unneeded Read or Write procedure. For all other NFS procedures, the server MUST ignore Read
lists are ignored by the server. chunks that have a non-zero value in their Position fields, and Read
chunks beyond the first in the Read list.
In the case where the outgoing request or expected incoming reply is Similarly, an NFS client MAY provide a single Write chunk to receive
larger than the maximum size supported on the connection, it is either opaque file data from an NFS READ procedure, or the pathname
possible for the RPC layer to post the entire message or result in a from an NFS READLINK procedure. The server MUST ignore the Write
special "RDMA_NOMSG" message type that is transferred entirely by list for any other NFS procedure, and any Write chunks beyond the
RDMA. This is implemented in RPC, below NFS, and therefore has no first in the Write list.
effect on the message contents.
Non-RDMA (inline) WRITE transfers MAY OPTIONALLY employ the There are no NFS version 2 or 3 procedures that have DDP-eligible
"RDMA_MSGP" padding method described in the RPC-over-RDMA protocol, data items in both their Call and Reply. However, if an NFS client
if the appropriate value for the server is known to the client. is sending a Long Call or Reply, it MAY provide a combination of Read
Padding allows the opaque file data to arrive at the server in an list, Write list, and/or a Reply chunk in the same transaction.
aligned fashion, which may improve server performance.
The NFS version 2 and 3 protocols are frequently limited in practice NFS clients already successfully estimate the maximum reply size of
to requests containing less than or equal to 8 kilobytes and 32 each operation in order to provide an adequate set of buffers to
kilobytes of data, respectively. In these cases, it is often receive each NFS reply. An NFS client provides a Reply chunk when
practical to support basic operation without employing a the maximum possible reply size is larger than the client's responder
configuration exchange as discussed in [I-D.ietf-nfsv4-rfc5666bis]. inline threshold.
The server MUST post buffers large enough to receive the largest
possible incoming message (approximately 12 KB for NFS version 2, or
36 KB for NFS version 3, would be vastly sufficient), and the client
can post buffers large enough to receive replies based on the "rsize"
it is using to the server, plus a fixed overhead for the RPC and NFS
headers. Because the server MUST NOT return data in excess of this
size, the client can be assured of the adequacy of its posted buffer
sizes.
Flow control is handled dynamically by the RPC RDMA protocol, and How does the server respond if the client has not provided enough
write padding is OPTIONAL and therefore MAY remain unused. Write list resources to handle an NFS WRITE or READLINK reply? How
does the server respond if the client has not provided enough Reply
chunk resources to handle an NFS reply?
Alternatively, if the server is administratively configured to values 4. NFS Version 4 Upper Layer Binding
appropriate for all its clients, the same assurance of
interoperability within the domain can be made.
The use of a configuration protocol with NFS v2 and v3 is therefore This specification applies to NFS Version 4.0 [RFC7530], NFS Version
OPTIONAL. Employing a configuration exchange may allow some 4.1 [RFC5661], and NFS Version 4.2 [I-D.ietf-nfsv4-minorversion2].
advantage to server resource management through accurately sizing It also applies to the callback protocols associated with each of
buffers, enabling the server to know exactly how many RDMA Reads may these minor versions.
be in progress at once on the client connection, and enabling client
write padding, which may be desirable for certain servers when RDMA
Read is impractical.
5. NFS Version 4 Mapping An NFS client MAY send a Read chunk to supply opaque file data for a
WRITE operation or the pathname for a CREATE(NF4LNK) operation in an
NFS version 4 COMPOUND procedure. An NFS client MUST NOT send a Read
chunk that corresponds with any other XDR data item in any other NFS
version 4 operation.
This specification applies to the first minor version of NFS version Similarly, an NFS client MAY provide a Write chunk to receive either
4 (NFSv4.0) and any subsequent minor versions that do not override opaque file data from a READ operation, NFS4_CONTENT_DATA from a
this mapping. READ_PLUS operation, or the pathname from a READLINK operation in an
NFS version 4 COMPOUND procedure. An NFS client MUST NOT provide a
Write chunk that corresponds with any other XDR data item in any
other NFS version 4 operation.
The Write list MUST be considered only for the COMPOUND procedure. There is no prohibition against an NFS version 4 COMPOUND procedure
This procedure returns results from a sequence of operations. Only constructed with both a READ and WRITE operation, say. Thus it is
the opaque file data from an NFS READ operation and the pathname from possible for NFS version 4 COMPOUND procedures to use both the Read
a READLINK operation MUST utilize entries from the Write list. list and Write list simultaneously. An NFS client MAY provide a Read
list and a Write list in the same transaction if it is sending a Long
Call or Reply.
If there is no Write list, i.e., the list is null, then any READ or Some remarks need to be made about how NFS version 4 clients estimate
READLINK operations in the COMPOUND MUST return their data inline. reply size, and how DDP-eligibility violations are reported.
The NFSv4.0 client MUST ensure in this case that any result of its
READ and READLINK requests will fit within its receive buffers, in
order to avoid a resulting RDMA transport error upon transfer. The
server is not required to detect this.
The first entry in the Write list MUST be used by the first READ or 4.1. NFS Version 4 COMPOUND Considerations
READLINK in the COMPOUND request. The next Write list entry is used
by the next READ or READLINK, and so on. If there are more READ or
READLINK operations than Write list entries, then any remaining
operations MUST return their results inline.
If a Write list entry is presented, then the corresponding READ or An NFS version 4 COMPOUND procedure supplies arguments for a sequence
READLINK MUST return its data via an RDMA Write to the buffer of operations, and returns results from that sequence. A client MAY
indicated by the Write list entry. If the Write list entry has zero construct an NFS version 4 COMPOUND procedure that uses more than one
RDMA segments, or if the total size of the segments is zero, then the chunk in either the Read list or Write list. The NFS client provides
corresponding READ or READLINK operation MUST return its result XDR Position values in each Read chunk to disambiguate which chunk is
inline. associated with which XDR data item.
The following example shows an RDMA Write list with three posted However NFS server and client implementations must agree in advance
buffers A, B, and C. The designated operations in the compound on how to pair Write chunks with returned result data items. The
request, READ and READLINK, consume the posted buffers by writing mechanism specified in [I-D.ietf-nfsv4-rfc5666bis]) is applied here:
their results back to each buffer.
RDMA Write list: o The first chunk in the Write list MUST be used by the first READ
or READLINK operation in an NFS version 4 COMPOUND procedure. The
next Write chunk is used by the next READ or READLINK, and so on.
o If there are more READ or READLINK operations than Write chunks,
then any remaining operations MUST return their results inline.
o If an NFS client presents a Write chunk, then the corresponding
READ or READLINK operation MUST return its data by placing data
into that chunk.
o If the Write chunk has zero RDMA segments, or if the total size of
the segments is zero, then the corresponding READ or READLINK
operation MUST return its result inline.
The following example shows a Write list with three Write chunks, A,
B, and C. The server consumes the provided Write chunks by writing
the results of the designated operations in the compound request,
READ and READLINK, back to each chunk.
Write list:
A --> B --> C A --> B --> C
Compound request: NFS version 4 COMPOUND request:
PUTFH LOOKUP READ PUTFH LOOKUP READLINK PUTFH LOOKUP READ PUTFH LOOKUP READ PUTFH LOOKUP READLINK PUTFH LOOKUP READ
| | | | | |
v v v v v v
A B C A B C
If the client does not want to have the READLINK result returned If the client does not want to have the READLINK result returned
directly, then it provides a zero-length array of segment triplets directly, it provides a zero-length array of segment triplets for
for buffer B or sets the values in the segment triplet for buffer B buffer B or sets the values in the segment triplet for buffer B to
to zeros so that the READLINK result MUST be returned inline. zeros to indicate that the READLINK result must be returned inline.
The situation is similar for RDMA Read lists sent by the client and
applies to the NFSv4.0 WRITE and SYMLINK procedures as for v3.
Additionally, inline segments too large to fit in posted buffers MAY
be transferred in special "RDMA_NOMSG" messages.
Non-RDMA (inline) WRITE transfers MAY OPTIONALLY employ the
"RDMA_MSGP" padding method described in the RPC-over-RDMA protocol,
if the appropriate value for the server is known to the client.
Padding allows the opaque file data to arrive at the server in an
aligned fashion, which may improve server performance. In order to
ensure accurate alignment for all data, it is likely that the client
will restrict its use of OPTIONAL padding to COMPOUND requests
containing only a single WRITE operation.
Unlike NFS versions 2 and 3, the maximum size of an NFS version 4 Unlike NFS versions 2 and 3, the maximum size of an NFS version 4
COMPOUND is not bounded, even when RDMA chunks are in use. While it COMPOUND is not bounded. However, typical NFS version 4 clients
might appear that a configuration protocol exchange (such as the one rarely issue such problematic requests. In practice, NFS version 4
described in [I-D.ietf-nfsv4-rfc5666bis]) would help, in fact the clients behave in much more predictable ways. Rsize and wsize apply
layering issues involved in building COMPOUNDs by NFS make such a to COMPOUND operations by capping the total amount of data payload
mechanism unworkable. allowed in each COMPOUND. An extension to NFS version 4 supporting a
comprehensive exchange of upper-layer message size parameters is part
However, typical NFS version 4 clients rarely issue such problematic of [RFC5661].
requests. In practice, they behave in much more predictable ways, in
fact most still support the traditional rsize/wsize mount parameters.
Therefore, most NFS version 4 clients function over RPC-over-RDMA in
the same way as NFS versions 2 and 3, operationally.
There are however advantages to allowing both client and server to
operate with prearranged size constraints, for example, use of the
sizes to better manage the server's response cache. An extension to
NFS version 4 supporting a more comprehensive exchange of upper-layer
parameters is part of [RFC5661].
5.1. NFS Version 4 Callbacks 4.2. NFS Version 4 Callbacks
The NFS version 4 protocols support server-initiated callbacks to The NFS version 4 protocols support server-initiated callbacks to
selected clients, in order to notify them of events such as recalled notify clients of events such as recalled delegations. There are no
delegations, etc. These callbacks present no particular issue to DDP-eligible data items in callback protocols associated with
being framed over RPC-over-RDMA since such callbacks do not carry NFSv4.0, NFSv4.1, or NFSv4.2.
bulk data such as NFS READ or NFS WRITE. They MAY be transmitted
inline via RDMA_MSG, or if the callback message or its reply overflow
the negotiated buffer sizes for a callback connection, they MAY be
transferred via the RDMA_NOMSG method as described above for other
exchanges.
One special case is noteworthy: in NFS version 4.1, the callback
channel is optionally negotiated to be on the same connection as one
used for client requests. In this case, and because the transaction
ID (XID) is present in the RPC-over-RDMA header, the client MUST
ascertain whether the message is in fact an RPC REPLY, and therefore
a reply to a prior request and carrying its XID, before processing it
as such. By the same token, the server MUST ascertain whether an
incoming message on such a callback-eligible connection is an RPC
CALL, before optionally processing the XID.
In the callback case, the XID present in the RPC-over-RDMA header In NFS version 4.1 and 4.2, callback operations may appear on the
will potentially have any value, which may (or may not) collide with same connection as one used for NFS version 4 client requests. To
an XID used by the client for a previous or future request. The operate on RPC-over-RDMA transports, NFS version 4 clients and
client and server MUST inspect the RPC component of the message to servers MUST use the mechanism described in
determine its potential disposition as either an RPC CALL or RPC [I-D.ietf-nfsv4-rpcrdma-bidirection].
REPLY, prior to processing this XID, and MUST NOT reject or accept it
without also determining the proper context.
6. IANA Considerations 5. IANA Considerations
NFS use of direct data placement introduces a need for an additional NFS use of direct data placement introduces a need for an additional
NFS port number assignment for networks that share traditional UDP NFS port number assignment for networks that share traditional UDP
and TCP port spaces with RDMA services. The iWARP [RFC5041] and TCP port spaces with RDMA services. The iWARP [RFC5041]
[RFC5040] protocol is such an example (InfiniBand is not). [RFC5040] protocol is such an example (InfiniBand is not).
NFS servers for versions 2 and 3 [RFC1094] [RFC1813] traditionally NFS servers for versions 2 and 3 [RFC1094] [RFC1813] traditionally
listen for clients on UDP and TCP port 2049, and additionally, they listen for clients on UDP and TCP port 2049, and additionally, they
register these with the portmapper and/or rpcbind [RFC1833] service. register these with the portmapper and/or rpcbind [RFC1833] service.
However, [RFC7530] requires NFS servers for version 4 to listen on However, [RFC7530] requires NFS servers for version 4 to listen on
skipping to change at page 9, line 33 skipping to change at page 9, line 13
in [I-D.ietf-nfsv4-rfc5666bis]. in [I-D.ietf-nfsv4-rfc5666bis].
An NFS version 4 server supporting RPC-over-RDMA on such a network An NFS version 4 server supporting RPC-over-RDMA on such a network
MUST use the alternative well-known port number for its RPC-over-RDMA MUST use the alternative well-known port number for its RPC-over-RDMA
service. Clients SHOULD connect to this well-known port without service. Clients SHOULD connect to this well-known port without
consulting the RPC portmapper (as for NFSv4/TCP). consulting the RPC portmapper (as for NFSv4/TCP).
The port number assigned to an NFS service over an RPC-over-RDMA The port number assigned to an NFS service over an RPC-over-RDMA
transport is available from the IANA port registry [RFC3232]. transport is available from the IANA port registry [RFC3232].
7. Security Considerations 6. Security Considerations
The RDMA transport for RPC [I-D.ietf-nfsv4-rfc5666bis] supports all The RDMA transport for RPC [I-D.ietf-nfsv4-rfc5666bis] supports all
RPC [RFC5531] security models, including RPCSEC_GSS [RFC2203] RPC [RFC5531] security models, including RPCSEC_GSS [RFC2203]
security and link- level security. The choice of RDMA Read and RDMA security and transport-level security. The choice of RDMA Read and
Write to return RPC argument and results, respectively, does not RDMA Write to convey RPC argument and results does not affect this,
affect this, since it only changes the method of data transfer. since it only changes the method of data transfer. Specifically, the
Specifically, the requirements of [I-D.ietf-nfsv4-rfc5666bis] ensure requirements of [I-D.ietf-nfsv4-rfc5666bis] ensure that this choice
that this choice does not introduce new vulnerabilities. does not introduce new vulnerabilities.
Because this document defines only the binding of the NFS protocols Because this document defines only the binding of the NFS protocols
atop [I-D.ietf-nfsv4-rfc5666bis], all relevant security atop [I-D.ietf-nfsv4-rfc5666bis], all relevant security
considerations are therefore to be described at that layer. considerations are therefore to be described at that layer.
8. Acknowledgments 7. Acknowledgments
The author gratefully acknowledges the work of Brent Callaghan and The author gratefully acknowledges the work of Brent Callaghan and
Tom Talpey on the original NFS Direct Data Placement specification Tom Talpey on the original NFS Direct Data Placement specification
[RFC5667]. The author also wishes to thank Bill Baker and Greg [RFC5667]. The author also wishes to thank Bill Baker and Greg
Marsden for their support of this work. Marsden for their support of this work.
9. References Dave Noveck provided excellent review, constructive suggestions, and
consistent navigational guidance throughout the process of drafting
this document.
9.1. Normative References Special thanks go to nfsv4 Working Group Chair Spencer Shepler and
nfsv4 Working Group Secretary Thomas Haynes for their support.
8. References
8.1. Normative References
[I-D.ietf-nfsv4-minorversion2]
Haynes, T., "NFS Version 4 Minor Version 2", draft-ietf-
nfsv4-minorversion2-41 (work in progress), January 2016.
[I-D.ietf-nfsv4-rfc5666bis]
Lever, C., Simpson, W., and T. Talpey, "Remote Direct
Memory Access Transport for Remote Procedure Call, Version
One", draft-ietf-nfsv4-rfc5666bis-07 (work in progress),
May 2016.
[I-D.ietf-nfsv4-rpcrdma-bidirection]
Lever, C., "Bi-directional Remote Procedure Call On RPC-
over-RDMA Transports", draft-ietf-nfsv4-rpcrdma-
bidirection-05 (work in progress), June 2016.
[RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", [RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2",
RFC 1833, DOI 10.17487/RFC1833, August 1995, RFC 1833, DOI 10.17487/RFC1833, August 1995,
<http://www.rfc-editor.org/info/rfc1833>. <http://www.rfc-editor.org/info/rfc1833>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/ Requirement Levels", BCP 14, RFC 2119,
RFC2119, March 1997, DOI 10.17487/RFC2119, March 1997,
<http://www.rfc-editor.org/info/rfc2119>. <http://www.rfc-editor.org/info/rfc2119>.
[RFC2203] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol [RFC2203] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol
Specification", RFC 2203, DOI 10.17487/RFC2203, September Specification", RFC 2203, DOI 10.17487/RFC2203, September
1997, <http://www.rfc-editor.org/info/rfc2203>. 1997, <http://www.rfc-editor.org/info/rfc2203>.
[RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol
Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, Specification Version 2", RFC 5531, DOI 10.17487/RFC5531,
May 2009, <http://www.rfc-editor.org/info/rfc5531>. May 2009, <http://www.rfc-editor.org/info/rfc5531>.
[RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
"Network File System (NFS) Version 4 Minor Version 1 "Network File System (NFS) Version 4 Minor Version 1
Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010,
<http://www.rfc-editor.org/info/rfc5661>. <http://www.rfc-editor.org/info/rfc5661>.
[RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System
(NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530,
March 2015, <http://www.rfc-editor.org/info/rfc7530>. March 2015, <http://www.rfc-editor.org/info/rfc7530>.
9.2. Informative References 8.2. Informative References
[I-D.ietf-nfsv4-rfc5666bis]
Lever, C., Simpson, W., and T. Talpey, "Remote Direct
Memory Access Transport for Remote Procedure Call, Version
One", draft-ietf-nfsv4-rfc5666bis-07 (work in progress),
May 2016.
[RFC1094] Nowicki, B., "NFS: Network File System Protocol [RFC1094] Nowicki, B., "NFS: Network File System Protocol
specification", RFC 1094, DOI 10.17487/RFC1094, March specification", RFC 1094, DOI 10.17487/RFC1094, March
1989, <http://www.rfc-editor.org/info/rfc1094>. 1989, <http://www.rfc-editor.org/info/rfc1094>.
[RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS
Version 3 Protocol Specification", RFC 1813, DOI 10.17487/ Version 3 Protocol Specification", RFC 1813,
RFC1813, June 1995, DOI 10.17487/RFC1813, June 1995,
<http://www.rfc-editor.org/info/rfc1813>. <http://www.rfc-editor.org/info/rfc1813>.
[RFC3232] Reynolds, J., Ed., "Assigned Numbers: RFC 1700 is Replaced [RFC3232] Reynolds, J., Ed., "Assigned Numbers: RFC 1700 is Replaced
by an On-line Database", RFC 3232, DOI 10.17487/RFC3232, by an On-line Database", RFC 3232, DOI 10.17487/RFC3232,
January 2002, <http://www.rfc-editor.org/info/rfc3232>. January 2002, <http://www.rfc-editor.org/info/rfc3232>.
[RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D.
Garcia, "A Remote Direct Memory Access Protocol Garcia, "A Remote Direct Memory Access Protocol
Specification", RFC 5040, DOI 10.17487/RFC5040, October Specification", RFC 5040, DOI 10.17487/RFC5040, October
2007, <http://www.rfc-editor.org/info/rfc5040>. 2007, <http://www.rfc-editor.org/info/rfc5040>.
[RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct
Data Placement over Reliable Transports", RFC 5041, DOI Data Placement over Reliable Transports", RFC 5041,
10.17487/RFC5041, October 2007, DOI 10.17487/RFC5041, October 2007,
<http://www.rfc-editor.org/info/rfc5041>. <http://www.rfc-editor.org/info/rfc5041>.
[RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access
Transport for Remote Procedure Call", RFC 5666, DOI
10.17487/RFC5666, January 2010,
<http://www.rfc-editor.org/info/rfc5666>.
[RFC5667] Talpey, T. and B. Callaghan, "Network File System (NFS) [RFC5667] Talpey, T. and B. Callaghan, "Network File System (NFS)
Direct Data Placement", RFC 5667, DOI 10.17487/RFC5667, Direct Data Placement", RFC 5667, DOI 10.17487/RFC5667,
January 2010, <http://www.rfc-editor.org/info/rfc5667>. January 2010, <http://www.rfc-editor.org/info/rfc5667>.
Author's Address Author's Address
Charles Lever (editor) Charles Lever (editor)
Oracle Corporation Oracle Corporation
1015 Granger Avenue 1015 Granger Avenue
Ann Arbor, MI 48104 Ann Arbor, MI 48104
 End of changes. 69 change blocks. 
291 lines changed or deleted 285 lines changed or added

This html diff was produced by rfcdiff 1.45. The latest version is available from http://tools.ietf.org/tools/rfcdiff/