draft-ietf-nfsv4-rfc5666bis-06.txt | draft-ietf-nfsv4-rfc5666bis-07.txt | |||
---|---|---|---|---|
Network File System Version 4 C. Lever, Ed. | Network File System Version 4 C. Lever, Ed. | |||
Internet-Draft Oracle | Internet-Draft Oracle | |||
Obsoletes: 5666 (if approved) W. Simpson | Obsoletes: 5666 (if approved) W. Simpson | |||
Intended status: Standards Track DayDreamer | Intended status: Standards Track DayDreamer | |||
Expires: November 13, 2016 T. Talpey | Expires: November 28, 2016 T. Talpey | |||
Microsoft | Microsoft | |||
May 12, 2016 | May 27, 2016 | |||
Remote Direct Memory Access Transport for Remote Procedure Call, Version | Remote Direct Memory Access Transport for Remote Procedure Call, Version | |||
One | One | |||
draft-ietf-nfsv4-rfc5666bis-06 | draft-ietf-nfsv4-rfc5666bis-07 | |||
Abstract | Abstract | |||
This document specifies a protocol for conveying Remote Procedure | This document specifies a protocol for conveying Remote Procedure | |||
Call (RPC) messages on physical transports capable of Remote Direct | Call (RPC) messages on physical transports capable of Remote Direct | |||
Memory Access (RDMA). It requires no revision to application RPC | Memory Access (RDMA). It requires no revision to application RPC | |||
protocols or the RPC protocol itself. This document obsoletes RFC | protocols or the RPC protocol itself. This document obsoletes RFC | |||
5666. | 5666. | |||
Status of This Memo | Status of This Memo | |||
skipping to change at page 1, line 38 ¶ | skipping to change at page 1, line 38 ¶ | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
This Internet-Draft will expire on November 13, 2016. | This Internet-Draft will expire on November 28, 2016. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2016 IETF Trust and the persons identified as the | Copyright (c) 2016 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
skipping to change at page 5, line 14 ¶ | skipping to change at page 5, line 14 ¶ | |||
explanatory text, and support for the RDMA_DONE procedure is no | explanatory text, and support for the RDMA_DONE procedure is no | |||
longer necessary. | longer necessary. | |||
o The specification of RDMA_MSGP in [RFC5666] is not adequate, | o The specification of RDMA_MSGP in [RFC5666] is not adequate, | |||
although some incomplete implementations exist. Even if an | although some incomplete implementations exist. Even if an | |||
adequate specification were provided and an implementation was | adequate specification were provided and an implementation was | |||
produced, benefit for protocols such as NFSv4.0 [RFC7530] is | produced, benefit for protocols such as NFSv4.0 [RFC7530] is | |||
doubtful. Therefore the RDMA_MSGP message type is no longer | doubtful. Therefore the RDMA_MSGP message type is no longer | |||
supported. | supported. | |||
o Technical errors with regard to handling RPC-over-RDMA header | o Technical issues with regard to handling RPC-over-RDMA header | |||
errors have been corrected. | errors have been corrected. | |||
o Specific requirements related to handling XDR round-up and complex | o Specific requirements related to implicit XDR round-up and complex | |||
XDR data types have been added. | XDR data types have been added. | |||
o Explicit guidance is provided for sizing Write chunks, managing | o Explicit guidance is provided related to sizing Write chunks, | |||
multiple chunks in the Write list, and handling unused Write | managing multiple chunks in the Write list, and handling unused | |||
chunks. | Write chunks. | |||
o Clear guidance about Send and Receive buffer size has been added. | o Clear guidance about Send and Receive buffer sizes has been | |||
This enables better decisions about when to provide and use the | introduced. This enables better decisions about when a Reply | |||
Reply chunk. | chunk must be provided. | |||
The protocol version number has not been changed because the protocol | The protocol version number has not been changed because the protocol | |||
specified in this document fully interoperates with implementations | specified in this document fully interoperates with implementations | |||
of the RPC-over-RDMA Version One protocol specified in [RFC5666]. | of the RPC-over-RDMA Version One protocol specified in [RFC5666]. | |||
3. Terminology | 3. Terminology | |||
3.1. Remote Procedure Calls | 3.1. Remote Procedure Calls | |||
This section highlights key elements of the Remote Procedure Call | This section highlights key elements of the Remote Procedure Call | |||
skipping to change at page 13, line 19 ¶ | skipping to change at page 13, line 19 ¶ | |||
receive resources available on requesters with no pending RPC | receive resources available on requesters with no pending RPC | |||
transactions. | transactions. | |||
Certain RDMA implementations may impose additional flow control | Certain RDMA implementations may impose additional flow control | |||
restrictions, such as limits on RDMA Read operations in progress at | restrictions, such as limits on RDMA Read operations in progress at | |||
the responder. Accommodation of such restrictions is considered the | the responder. Accommodation of such restrictions is considered the | |||
responsibility of each RPC-over-RDMA Version One implementation. | responsibility of each RPC-over-RDMA Version One implementation. | |||
4.3.2. Inline Threshold | 4.3.2. Inline Threshold | |||
A receiver's "inline threshold" value is the largest message size (in | An "inline threshold" value is the largest message size (in octets) | |||
octets) that the receiver can accept via an RDMA Receive operation. | that can be conveyed in one direction between peer implementations | |||
Each connection has two inline threshold values, one for each peer | using RDMA Send and Receive. The inline threshold value is the | |||
receiver. | minimum of how large a message the sender can post via an RDMA Send | |||
operation, and how large a message the receiver can accept via an | ||||
RDMA Receive operation. Each connection has two inline threshold | ||||
values: one for messages flowing from requester-to-responder | ||||
(referred to as the "call inline threshold"), and one for messages | ||||
flowing from responder-to-requester (referred to as the "reply inline | ||||
threshold"). | ||||
Unlike credit limits, inline threshold values are not advertised to | Unlike credit limits, inline threshold values are not advertised to | |||
peers via the RPC-over-RDMA Version One protocol, and there is no | peers via the RPC-over-RDMA Version One protocol, and there is no | |||
provision for the inline threshold value to change during the | provision for inline threshold values to change during the lifetime | |||
lifetime of an RPC-over-RDMA Version One connection. | of an RPC-over-RDMA Version One connection. | |||
4.3.3. Initial Connection State | 4.3.3. Initial Connection State | |||
When a connection is first established, peers might not know how many | When a connection is first established, peers might not know how many | |||
receive resources the other has, nor how large these buffers are. | receive resources the other has, nor how large the other peer's | |||
inline thresholds are. | ||||
As a basis for an initial exchange of RPC requests, each RPC-over- | As a basis for an initial exchange of RPC requests, each RPC-over- | |||
RDMA Version One connection provides the ability to exchange at least | RDMA Version One connection provides the ability to exchange at least | |||
one RPC message at a time that is 1024 bytes in size. A responder | one RPC message at a time, whose Call and Reply messages are no more | |||
MAY exceed this basic level of configuration, but a requester MUST | 1024 bytes in size. A responder MAY exceed this basic level of | |||
NOT assume more than one credit is available, and MUST receive a | configuration, but a requester MUST NOT assume more than one credit | |||
valid reply from the responder carrying the actual number of | is available, and MUST receive a valid reply from the responder | |||
available credits, prior to sending its next request. | carrying the actual number of available credits, prior to sending its | |||
next request. | ||||
Receiver implementations MUST support an inline threshold of 1024 | Receiver implementations MUST support inline thresholds of 1024 | |||
bytes, but MAY support larger inline thresholds values. A mechanism | bytes, but MAY support larger inline thresholds values. A mechanism | |||
for discovering a peer's inline threshold value before a connection | for discovering a peer's inline thresholds before a connection is | |||
is established may be used to optimize the use of RDMA Send | established may be used to optimize the use of RDMA Send and Receive | |||
operations. In the absense of such a mechanism, senders MUST assume | operations. In the absense of such a mechanism, senders and receives | |||
a receiver's inline threshold is 1024 bytes. | MUST assume the inline thresholds are 1024 bytes. | |||
4.4. XDR Encoding With Chunks | 4.4. XDR Encoding With Chunks | |||
When a direct data placement capability is available, it can be | When a direct data placement capability is available, it can be | |||
determined during XDR encoding that the transport can efficiently | determined during XDR encoding that the transport can efficiently | |||
place the contents of one or more XDR data items directly into the | place the contents of one or more XDR data items directly into the | |||
receiver's memory, separately from the transfer of other parts of the | receiver's memory, separately from the transfer of other parts of the | |||
containing XDR stream. | containing XDR stream. | |||
4.4.1. Reducing An XDR Stream | 4.4.1. Reducing An XDR Stream | |||
skipping to change at page 15, line 15 ¶ | skipping to change at page 15, line 21 ¶ | |||
reduced. | reduced. | |||
Detailed requirements for Upper Layer Bindings are discussed in full | Detailed requirements for Upper Layer Bindings are discussed in full | |||
in Section 7. | in Section 7. | |||
4.4.3. RDMA Segments | 4.4.3. RDMA Segments | |||
When encoding a Payload stream that contains a DDP-eligible data | When encoding a Payload stream that contains a DDP-eligible data | |||
item, a sender may choose to reduce that data item. When it chooses | item, a sender may choose to reduce that data item. When it chooses | |||
to do so, the sender does not place the item into the Payload stream. | to do so, the sender does not place the item into the Payload stream. | |||
Instead, the sender records in the RPC-over-RDMA header the actual | Instead, the sender records in the RPC-over-RDMA header the location | |||
address and size of the memory region containing that data item. | and size of the memory region containing that data item. | |||
The requester provides location information for DDP-eligible data | The requester provides location information for DDP-eligible data | |||
items in both RPC Calls and Replies. The responder uses this | items in both RPC Calls and Replies. The responder uses this | |||
information to initiate RDMA Read and Write operations to retrieve or | information to initiate RDMA Read and Write operations to retrieve or | |||
update the specified region of the requester's memory. | update the specified region of the requester's memory. | |||
An "RDMA segment", or a "plain segment", is an RPC-over-RDMA header | An "RDMA segment", or a "plain segment", is an RPC-over-RDMA header | |||
data object that contains the precise co-ordinates of a contiguous | data object that contains the precise co-ordinates of a contiguous | |||
memory region that is to be conveyed via one or more RDMA Read or | memory region that is to be conveyed via one or more RDMA Read or | |||
RDMA Write operations. | RDMA Write operations. | |||
skipping to change at page 18, line 39 ¶ | skipping to change at page 18, line 49 ¶ | |||
a particular XDR data item in the reply is not predictable at the | a particular XDR data item in the reply is not predictable at the | |||
time a request is issued. Therefore RDMA segments in a Write chunk | time a request is issued. Therefore RDMA segments in a Write chunk | |||
do not have a Position field. | do not have a Position field. | |||
While constructing an RPC Call message, a requester also prepares | While constructing an RPC Call message, a requester also prepares | |||
memory regions to catch DDP-eligible reply data items. A requester | memory regions to catch DDP-eligible reply data items. A requester | |||
does not know the actual length of the result data item to be | does not know the actual length of the result data item to be | |||
returned, thus it MUST register a Write chunk long enough to | returned, thus it MUST register a Write chunk long enough to | |||
accommodate the maximum possible size of the returned data item. | accommodate the maximum possible size of the returned data item. | |||
A responder copies the requester-provided Write chunk segments into | The responder fills the segments contiguously in array order until | |||
the RPC-over-RDMA header that it returns with the reply. The | the result data item has been completely written into the Write | |||
responder MUST NOT change the number of segments in the Write chunk. | chunk. The responder copies the consumed Write chunk segments into | |||
the Reply's RPC-over-RDMA header. As it does so, the responder | ||||
The responder fills the segments in array order until the data item | updates the segment length fields to reflect the actual amount of | |||
has been completely written. The responder updates the segment | data that is being returned in each segment, and updates the Write | |||
length fields to reflect the actual amount of data that is being | chunk's segment count to reflect how many segments were consumed. | |||
returned in each segment. If a Write chunk segment receives no data | Unconsumed segments are omitted in the returned Write chunk. | |||
from the responder, the updated length of the segment MUST be zero. | ||||
The responder then sends the RPC Reply via an RDMA Send operation. | The responder then sends the RPC Reply via an RDMA Send operation. | |||
After receiving the RPC Reply, the requester reconstructs the | After receiving the RPC Reply, the requester reconstructs the | |||
transferred data by concatenating the contents of each segment, in | transferred data by concatenating the contents of each segment, in | |||
array order, into RPC Reply XDR stream. | array order, into RPC Reply XDR stream. | |||
4.4.6.1. Write Chunk Round-up | 4.4.6.1. Write Chunk Round-up | |||
XDR requires each encoded data item to start on four-byte alignment. | XDR requires each encoded data item to start on four-byte alignment. | |||
When an odd-length data item is encoded, its length is encoded | When an odd-length data item is encoded, its length is encoded | |||
skipping to change at page 19, line 28 ¶ | skipping to change at page 19, line 36 ¶ | |||
Write chunk, the responder MUST remove XDR padding for that data item | Write chunk, the responder MUST remove XDR padding for that data item | |||
from the reply Payload stream as well. | from the reply Payload stream as well. | |||
A requester SHOULD NOT provide extra length in a Write chunk to | A requester SHOULD NOT provide extra length in a Write chunk to | |||
accommodate XDR pad bytes. A responder MUST NOT write XDR pad bytes | accommodate XDR pad bytes. A responder MUST NOT write XDR pad bytes | |||
for a Write chunk. | for a Write chunk. | |||
4.4.6.2. Unused Write Chunks | 4.4.6.2. Unused Write Chunks | |||
There are occasions when a requester provides a Write chunk but the | There are occasions when a requester provides a Write chunk but the | |||
responder does not use it. | responder is not able to use it. | |||
For example, an Upper Layer Protocol may define a union result where | For example, an Upper Layer Protocol may define a union result where | |||
some arms of the union contain a DDP-eligible data item while other | some arms of the union contain a DDP-eligible data item while other | |||
arms do not. The responder is REQUIRED to use requester-provided | arms do not. The responder is REQUIRED to use requester-provided | |||
Write chunks in this case, but if the responder returns a result that | Write chunks in this case, but if the responder returns a result that | |||
uses an arm of the union that has no DDP-eligible data item, the | uses an arm of the union that has no DDP-eligible data item, the | |||
Write chunk remains unconsumed. | Write chunk remains unconsumed. | |||
If there is a subsequent DDP-eligible data item, it MUST be placed in | If there is a subsequent DDP-eligible data item, it MUST be placed in | |||
that Write chunk. The requester MUST provision each Write chunk so | that unconsumed Write chunk. The requester MUST provision each Write | |||
it can be filled with the largest DDP-eligible data item that can be | chunk so it can be filled with the largest DDP-eligible data item | |||
placed in it. | that can be placed in it. | |||
However, if this is the last or only Write chunk available and it | However, if this is the last or only Write chunk available and it | |||
remains unconsumed, the responder MUST set the length of all segments | remains unconsumed, The responder MUST set the Write chunk segment | |||
in the chunk to zero. | count to zero, returning no segments in the Write chunk. | |||
Unused write chunks, or unused bytes in write chunk segments, are not | Unused write chunks, or unused bytes in write chunk segments, are not | |||
returned as results. Their memory is returned to the Upper Layer as | returned as results. Their memory is returned to the Upper Layer as | |||
part of RPC completion. However, the RPC layer MUST NOT assume that | part of RPC completion. However, the RPC layer MUST NOT assume that | |||
the buffers have not been modified. | the buffers have not been modified. | |||
In other words, even if a responder indicates that a Write chunk is | In other words, even if a responder indicates that a Write chunk is | |||
not consumed (by setting all of the segment lengths in the chunk to | not consumed (by setting all of the segment lengths in the chunk to | |||
zero), the responder may have written some data into the segments | zero), the responder may have written some data into the segments | |||
before deciding not to return that data item. For example, a problem | before deciding not to return that data item. For example, a problem | |||
skipping to change at page 20, line 25 ¶ | skipping to change at page 20, line 34 ¶ | |||
4.5. Message Size | 4.5. Message Size | |||
A receiver of RDMA Send operations is required by RDMA to have | A receiver of RDMA Send operations is required by RDMA to have | |||
previously posted one or more adequately sized buffers. Memory | previously posted one or more adequately sized buffers. Memory | |||
savings are achieved on both requesters and responders by posting | savings are achieved on both requesters and responders by posting | |||
small Receive buffers. However, not all RPC messages are small. | small Receive buffers. However, not all RPC messages are small. | |||
4.5.1. Short Messages | 4.5.1. Short Messages | |||
RPC messages are frequently smaller than typical inline thresholds. | RPC messages are frequently smaller than typical inline thresholds. | |||
For example, the NFS version 3 GETATTR request is only 56 bytes: 20 | For example, the NFS version 3 GETATTR operation is only 56 bytes: 20 | |||
bytes of RPC header, plus a 32-byte file handle argument and 4 bytes | bytes of RPC header, plus a 32-byte file handle argument and 4 bytes | |||
for its length. The reply to this common request is about 100 bytes. | for its length. The reply to this common request is about 100 bytes. | |||
Since all RPC messages conveyed via RPC-over-RDMA require an RDMA | Since all RPC messages conveyed via RPC-over-RDMA require an RDMA | |||
Send operation, the most efficient way to send an RPC message that is | Send operation, the most efficient way to send an RPC message that is | |||
smaller than the receiver's inline threshold is to append the Payload | smaller than the inline threshold is to append the Payload stream | |||
stream directly to the Transport stream. An RPC-over-RDMA header | directly to the Transport stream. An RPC-over-RDMA header with a | |||
with a small RPC Call or Reply message immediately following is | small RPC Call or Reply message immediately following is transferred | |||
transferred using a single RDMA Send operation. No RDMA Read or | using a single RDMA Send operation. No RDMA Read or Write operations | |||
Write operations are needed. | are needed. | |||
An RPC-over-RDMA transaction using Short Messages: | An RPC-over-RDMA transaction using Short Messages: | |||
Requester Responder | Requester Responder | |||
| RDMA Send (RDMA_MSG) | | | RDMA Send (RDMA_MSG) | | |||
Call | ------------------------------> | | Call | ------------------------------> | | |||
| | Processing | ||||
| | | | | | |||
| | Processing | ||||
| | | | | | |||
| RDMA Send (RDMA_MSG) | | | RDMA Send (RDMA_MSG) | | |||
| <------------------------------ | Reply | | <------------------------------ | Reply | |||
4.5.2. Chunked Messages | 4.5.2. Chunked Messages | |||
If DDP-eligible data items are present in a Payload stream, a sender | If DDP-eligible data items are present in a Payload stream, a sender | |||
MAY reduce some or all of these items by removing them from the | MAY reduce some or all of these items by removing them from the | |||
Payload stream. The sender uses RDMA Read or Write operations to | Payload stream. The sender uses RDMA Read or Write operations to | |||
transfer the reduced data items. The Transport stream with the | transfer the reduced data items. The Transport stream with the | |||
skipping to change at page 21, line 30 ¶ | skipping to change at page 21, line 39 ¶ | |||
An RPC-over-RDMA transaction with a Read chunk: | An RPC-over-RDMA transaction with a Read chunk: | |||
Requester Responder | Requester Responder | |||
| RDMA Send (RDMA_MSG) | | | RDMA Send (RDMA_MSG) | | |||
Call | ------------------------------> | | Call | ------------------------------> | | |||
| RDMA Read | | | RDMA Read | | |||
| <------------------------------ | | | <------------------------------ | | |||
| RDMA Response (arg data) | | | RDMA Response (arg data) | | |||
| ------------------------------> | | | ------------------------------> | | |||
| | Processing | ||||
| | | | | | |||
| | Processing | ||||
| | | | | | |||
| RDMA Send (RDMA_MSG) | | | RDMA Send (RDMA_MSG) | | |||
| <------------------------------ | Reply | | <------------------------------ | Reply | |||
An RPC-over-RDMA transaction with a Write chunk: | An RPC-over-RDMA transaction with a Write chunk: | |||
Requester Responder | Requester Responder | |||
| RDMA Send (RDMA_MSG) | | | RDMA Send (RDMA_MSG) | | |||
Call | ------------------------------> | | Call | ------------------------------> | | |||
| | Processing | ||||
| | | | | | |||
| | Processing | ||||
| | | | | | |||
| RDMA Write (result data) | | | RDMA Write (result data) | | |||
| <------------------------------ | | | <------------------------------ | | |||
| RDMA Send (RDMA_MSG) | | | RDMA Send (RDMA_MSG) | | |||
| <------------------------------ | Reply | | <------------------------------ | Reply | |||
4.5.3. Long Messages | 4.5.3. Long Messages | |||
When a Payload stream is larger than the receiver's inline threshold, | When a Payload stream is larger than the receiver's inline threshold, | |||
the Payload stream is reduced by removing DDP-eligible data items and | the Payload stream is reduced by removing DDP-eligible data items and | |||
skipping to change at page 22, line 40 ¶ | skipping to change at page 23, line 4 ¶ | |||
requester sizes the Reply chunk to accommodate the maximum | requester sizes the Reply chunk to accommodate the maximum | |||
expected reply size for that Upper Layer operation. | expected reply size for that Upper Layer operation. | |||
Though the purpose of a Long Message is to handle large RPC messages, | Though the purpose of a Long Message is to handle large RPC messages, | |||
requesters MAY use a Long Message at any time to convey an RPC Call. | requesters MAY use a Long Message at any time to convey an RPC Call. | |||
A responder chooses which form of reply to use based on the chunks | A responder chooses which form of reply to use based on the chunks | |||
provided by the requester. If Write chunks were provided and the | provided by the requester. If Write chunks were provided and the | |||
responder has a DDP-eligible result, it first reduces the reply | responder has a DDP-eligible result, it first reduces the reply | |||
Payload stream. If a Reply chunk was provided and the reduced | Payload stream. If a Reply chunk was provided and the reduced | |||
Payload stream is larger than the requester's inline threshold, the | Payload stream is larger than the reply inline threshold, the | |||
responder MUST use the provided Reply chunk for the reply. | responder MUST use the requester-provided Reply chunk for the reply. | |||
Because these special chunks contain a whole RPC message, XDR data | Because these special chunks contain a whole RPC message, XDR data | |||
items appear in these special chunks without regard to their DDP- | items appear in these special chunks without regard to their DDP- | |||
eligibility. | eligibility. | |||
An RPC-over-RDMA transaction using a Long Call: | An RPC-over-RDMA transaction using a Long Call: | |||
Requester Responder | Requester Responder | |||
| RDMA Send (RDMA_NOMSG) | | | RDMA Send (RDMA_NOMSG) | | |||
Call | ------------------------------> | | Call | ------------------------------> | | |||
| RDMA Read | | | RDMA Read | | |||
| <------------------------------ | | | <------------------------------ | | |||
| RDMA Response (RPC call) | | | RDMA Response (RPC call) | | |||
| ------------------------------> | | | ------------------------------> | | |||
| | Processing | ||||
| | | | | | |||
| | Processing | ||||
| | | | | | |||
| RDMA Send (RDMA_MSG) | | | RDMA Send (RDMA_MSG) | | |||
| <------------------------------ | Reply | | <------------------------------ | Reply | |||
An RPC-over-RDMA transaction using a Long Reply: | An RPC-over-RDMA transaction using a Long Reply: | |||
Requester Responder | Requester Responder | |||
| RDMA Send (RDMA_MSG) | | | RDMA Send (RDMA_MSG) | | |||
Call | ------------------------------> | | Call | ------------------------------> | | |||
| | Processing | ||||
| | | | | | |||
| | Processing | ||||
| | | | | | |||
| RDMA Write (RPC reply) | | | RDMA Write (RPC reply) | | |||
| <------------------------------ | | | <------------------------------ | | |||
| RDMA Send (RDMA_NOMSG) | | | RDMA Send (RDMA_NOMSG) | | |||
| <------------------------------ | Reply | | <------------------------------ | Reply | |||
5. RPC-Over-RDMA In Operation | 5. RPC-Over-RDMA In Operation | |||
Every RPC-over-RDMA Version One message has a header that includes a | Every RPC-over-RDMA Version One message has a header that includes a | |||
copy of the message's transaction ID, data for managing RDMA flow | copy of the message's transaction ID, data for managing RDMA flow | |||
control credits, and lists of RDMA segments used for RDMA Read and | control credits, and lists of RDMA segments used for RDMA Read and | |||
Write operations. All RPC-over-RDMA header content is contained in | Write operations. All RPC-over-RDMA header content is contained in | |||
the Transport stream, and thus MUST be XDR encoded. | the Transport stream, and thus MUST be XDR encoded. | |||
RPC message layout is unchanged from that described in [RFC5531] | RPC message layout is unchanged from that described in [RFC5531] | |||
except for the possible reduction of data items that are moved by | except for the possible reduction of data items that are moved by | |||
RDMA Read or Write operations. | RDMA Read or Write operations. | |||
The RPC-over-RDMA protocol passes RPC messages without regard to | The RPC-over-RDMA protocol passes RPC messages without regard to | |||
their type (CALL or REPLY). Apart from restrictions imposed by | their type (CALL or REPLY). Apart from restrictions imposed by | |||
upper-layer bindings, each endpoint of a connection MAY send any RPC- | upper-layer bindings, each endpoint of a connection MAY send RDMA_MSG | |||
over-RDMA message header type at any time (subject to credit limits). | or RDMA_NOMSG message header types at any time (subject to credit | |||
limits). | ||||
5.1. XDR Protocol Definition | 5.1. XDR Protocol Definition | |||
This section contains a description of the core features of the RPC- | This section contains a description of the core features of the RPC- | |||
over-RDMA Version One protocol, expressed in the XDR language | over-RDMA Version One protocol, expressed in the XDR language | |||
[RFC4506]. | [RFC4506]. | |||
This description is provided in a way that makes it simple to extract | This description is provided in a way that makes it simple to extract | |||
into ready-to-compile form. The reader can apply the following shell | into ready-to-compile form. The reader can apply the following shell | |||
script to this document to produce a machine-readable XDR description | script to this document to produce a machine-readable XDR description | |||
skipping to change at page 30, line 22 ¶ | skipping to change at page 30, line 22 ¶ | |||
and MUST hold the Payload stream for this RPC-over-RDMA message. If | and MUST hold the Payload stream for this RPC-over-RDMA message. If | |||
a Read or Write chunk list is present, a portion of the Payload | a Read or Write chunk list is present, a portion of the Payload | |||
stream has been excised and is conveyed separately via RDMA Read or | stream has been excised and is conveyed separately via RDMA Read or | |||
Write operations. | Write operations. | |||
An RDMA_ERROR procedure conveys the Transport stream via an RDMA Send | An RDMA_ERROR procedure conveys the Transport stream via an RDMA Send | |||
operation. The Transport stream contains the four fixed fields, | operation. The Transport stream contains the four fixed fields, | |||
followed by formatted error information. No Payload stream is | followed by formatted error information. No Payload stream is | |||
conveyed in this type of RPC-over-RDMA message. | conveyed in this type of RPC-over-RDMA message. | |||
A requester MUST NOT send an RPC-over-RDMA header with the RDMA_ERROR | ||||
procedure. A responder MUST silently discard RDMA_ERROR procedures. | ||||
A gather operation on each RDMA Send operation can be used to combine | A gather operation on each RDMA Send operation can be used to combine | |||
the Transport and Payload streams, which might have been constructed | the Transport and Payload streams, which might have been constructed | |||
in separate buffers. However, the total length of the gathered send | in separate buffers. However, the total length of the gathered send | |||
buffers MUST NOT exceed the peer receiver's inline threshold. | buffers MUST NOT exceed the inline threshold. | |||
5.3. Chunk Lists | 5.3. Chunk Lists | |||
The chunk lists in an RPC-over-RDMA Version One header are three XDR | The chunk lists in an RPC-over-RDMA Version One header are three XDR | |||
optional-data fields that follow the fixed header fields in RDMA_MSG | optional-data fields that follow the fixed header fields in RDMA_MSG | |||
and RDMA_NOMSG procedures. Read Section 4.19 of [RFC4506] carefully | and RDMA_NOMSG procedures. Read Section 4.19 of [RFC4506] carefully | |||
to understand how optional-data fields work. Examples of XDR encoded | to understand how optional-data fields work. Examples of XDR encoded | |||
chunk lists are provided in Section 5.7 as an aid to understanding. | chunk lists are provided in Section 5.7 as an aid to understanding. | |||
5.3.1. Read List | 5.3.1. Read List | |||
skipping to change at page 32, line 23 ¶ | skipping to change at page 32, line 28 ¶ | |||
Write list in the Reply is modified as above to reflect the actual | Write list in the Reply is modified as above to reflect the actual | |||
amount of data that is being returned in the Write list. | amount of data that is being returned in the Write list. | |||
5.3.3. Reply Chunk | 5.3.3. Reply Chunk | |||
Each RDMA_MSG or RDMA_NOMSG procedure has one "Reply chunk." The | Each RDMA_MSG or RDMA_NOMSG procedure has one "Reply chunk." The | |||
Reply chunk is a Write chunk, provided by the requester. The Reply | Reply chunk is a Write chunk, provided by the requester. The Reply | |||
chunk is a single counted array of RDMA segments. | chunk is a single counted array of RDMA segments. | |||
A requester MUST provide a Reply chunk whenever the maximum possible | A requester MUST provide a Reply chunk whenever the maximum possible | |||
size of the reply is larger than its own inline threshold. The Reply | size of the reply message is larger than the inline threshold for | |||
chunk MUST be large enough to contain a Payload stream (RPC message) | messages from responder to requester. The Reply chunk MUST be large | |||
of this maximum size. If the actual reply Payload stream is smaller | enough to contain a Payload stream (RPC message) of this maximum | |||
than the requester's inline threshold, the responder MAY return it as | size. If the Transport stream and reply Payload stream together are | |||
a Short message rather than using the Reply chunk. | smaller than the reply inline threshold, the responder MAY return it | |||
as a Short message rather than using the requester-provided Reply | ||||
chunk. | ||||
When a requester has provided a Reply chunk in a Call message, the | When a requester has provided a Reply chunk in a Call message, the | |||
responder MUST copy that chunk into the associated Reply. The copied | responder MUST copy that chunk into the associated Reply. The copied | |||
Reply chunk in the Reply is modified to reflect the actual amount of | Reply chunk in the Reply is modified to reflect the actual amount of | |||
data that is being returned in the Reply chunk. | data that is being returned in the Reply chunk. | |||
5.4. Memory Registration | 5.4. Memory Registration | |||
RDMA requires that data is transferred between only registered memory | RDMA requires that data is transferred between only registered memory | |||
segments at the source and destination. All protocol headers as well | segments at the source and destination. All protocol headers as well | |||
skipping to change at page 34, line 9 ¶ | skipping to change at page 34, line 17 ¶ | |||
each segment. For such implementations, this can be a significant | each segment. For such implementations, this can be a significant | |||
overhead. By providing an offset in each chunk, many pre- | overhead. By providing an offset in each chunk, many pre- | |||
registration or region-based registrations can be readily supported. | registration or region-based registrations can be readily supported. | |||
By using a single, universal chunk representation, the RPC-over-RDMA | By using a single, universal chunk representation, the RPC-over-RDMA | |||
protocol implementation is simplified to its most general form. | protocol implementation is simplified to its most general form. | |||
5.5. Error Handling | 5.5. Error Handling | |||
A receiver performs basic validity checks on the RPC-over-RDMA header | A receiver performs basic validity checks on the RPC-over-RDMA header | |||
and chunk contents before it passes the RPC message to the RPC | and chunk contents before it passes the RPC message to the RPC | |||
consumer. If errors are detected in an RPC-over-RDMA header, an | consumer. If errors are detected in the RPC-over-RDMA header of a | |||
RDMA_ERROR procedure MUST be generated. Because the transport layer | Call message, a responder MUST send an RDMA_ERROR message back to the | |||
may not be aware of the direction of a problematic RPC message, an | requester. If errors are detected in the RPC-over-RDMA header of a | |||
RDMA_ERROR procedure MAY be generated by either a requester or a | Reply message, a requester MUST silently discard the message. | |||
responder. | ||||
To form an RDMA_ERROR procedure: The rdma_xid field MUST contain the | To form an RDMA_ERROR procedure: The rdma_xid field MUST contain the | |||
same XID that was in the rdma_xid field in the failing request; The | same XID that was in the rdma_xid field in the failing request; The | |||
rdma_vers field MUST contain the same version that was in the | rdma_vers field MUST contain the same version that was in the | |||
rdma_vers field in the failing request; The rdma_proc field MUST | rdma_vers field in the failing request; The rdma_proc field MUST | |||
contain the value RDMA_ERROR; The rdma_err field contains a value | contain the value RDMA_ERROR; The rdma_err field contains a value | |||
that reflects the type of error that occurred, as described below. | that reflects the type of error that occurred, as described below. | |||
An RDMA_ERROR procedure indicates a permanent error. Receipt of this | An RDMA_ERROR procedure indicates a permanent error. Receipt of this | |||
procedure completes the RPC transaction associated with XID in the | procedure completes the RPC transaction associated with XID in the | |||
rdma_xid field. A receiver MUST silently discard an RDMA_ERROR | rdma_xid field. A receiver MUST silently discard an RDMA_ERROR | |||
procedure that it cannot decode. | procedure that it cannot decode. | |||
5.5.1. Header Version Mismatch | 5.5.1. Header Version Mismatch | |||
When a receiver detects an RPC-over-RDMA header version that it does | When a responder detects an RPC-over-RDMA header version that it does | |||
not support (currently this document defines only Version One), it | not support (currently this document defines only Version One), it | |||
MUST reply with an RDMA_ERROR procedure and set the rdma_err value to | MUST reply with an RDMA_ERROR procedure and set the rdma_err value to | |||
ERR_VERS, also providing the low and high inclusive version numbers | ERR_VERS, also providing the low and high inclusive version numbers | |||
it does, in fact, support. | it does, in fact, support. | |||
5.5.2. XDR Errors | 5.5.2. XDR Errors | |||
A receiver might encounter an XDR parsing error that prevents it from | A receiver might encounter an XDR parsing error that prevents it from | |||
processing the incoming Transport stream. Examples of such errors | processing the incoming Transport stream. Examples of such errors | |||
include an invalid value in the rdma_proc field, an RDMA_NOMSG | include an invalid value in the rdma_proc field, an RDMA_NOMSG | |||
skipping to change at page 37, line 13 ¶ | skipping to change at page 37, line 17 ¶ | |||
This is not typical for NFSv4 COMPOUND RPCs, which often include a | This is not typical for NFSv4 COMPOUND RPCs, which often include a | |||
GETATTR operation as the final element of the compound operation | GETATTR operation as the final element of the compound operation | |||
array. | array. | |||
Without a full specification of RDMA_MSGP, there has been no fully | Without a full specification of RDMA_MSGP, there has been no fully | |||
implemented prototype of it. Without a complete prototype of | implemented prototype of it. Without a complete prototype of | |||
RDMA_MSGP support, it is difficult to assess whether this protocol | RDMA_MSGP support, it is difficult to assess whether this protocol | |||
element has benefit, or can even be made to work interoperably. | element has benefit, or can even be made to work interoperably. | |||
Therefore, senders MUST NOT send RDMA_MSGP procedures. When | Therefore, senders MUST NOT send RDMA_MSGP procedures. When | |||
receiving an RDMA_MSGP procedure, receivers SHOULD reply with an | receiving an RDMA_MSGP procedure, responders SHOULD reply with an | |||
RDMA_ERROR procedure, setting the rdma_err field to ERR_CHUNK. | RDMA_ERROR procedure, setting the rdma_err field to ERR_CHUNK; | |||
requesters MUST silently discard the message. | ||||
5.6.2. RDMA_DONE | 5.6.2. RDMA_DONE | |||
Because no implementation of RPC-over-RDMA Version One uses the Read- | Because no implementation of RPC-over-RDMA Version One uses the Read- | |||
Read transfer model, there is never a need to send an RDMA_DONE | Read transfer model, there is never a need to send an RDMA_DONE | |||
procedure. | procedure. | |||
Therefore, senders MUST NOT send RDMA_DONE messages. When receiving | Therefore, senders MUST NOT send RDMA_DONE messages. Receivers MUST | |||
an RDMA_DONE procedure, receivers SHOULD reply with an RDMA_ERROR | silently discard RDMA_DONE messages. | |||
procedure, setting the rdma_err field to ERR_CHUNK. | ||||
5.7. XDR Examples | 5.7. XDR Examples | |||
RPC-over-RDMA chunk lists are complex data types. In this section, | RPC-over-RDMA chunk lists are complex data types. In this section, | |||
illustrations are provided to help readers grasp how chunk lists are | illustrations are provided to help readers grasp how chunk lists are | |||
represented inside an RPC-over-RDMA header. | represented inside an RPC-over-RDMA header. | |||
An RDMA segment is the simplest component, being made up of a 32-bit | An RDMA segment is the simplest component, being made up of a 32-bit | |||
handle (H), a 32-bit length (L), and 64-bits of offset (OO). Once | handle (H), a 32-bit length (L), and 64-bits of offset (OO). Once | |||
flattened into an XDR stream, RDMA segments appear as | flattened into an XDR stream, RDMA segments appear as | |||
skipping to change at page 42, line 37 ¶ | skipping to change at page 42, line 37 ¶ | |||
maximum possible size of the expected Reply message. | maximum possible size of the expected Reply message. | |||
If there are procedures in the Upper Layer Protocol for which there | If there are procedures in the Upper Layer Protocol for which there | |||
is no clear reply size maximum, the Upper Layer Binding needs to | is no clear reply size maximum, the Upper Layer Binding needs to | |||
specify a dependable means for determining the maximum. | specify a dependable means for determining the maximum. | |||
7.3. Additional Considerations | 7.3. Additional Considerations | |||
There may be other details provided in an Upper Layer Binding. | There may be other details provided in an Upper Layer Binding. | |||
o An Upper Layer Binding may recommend an inline threshold value or | o An Upper Layer Binding may recommend inline threshold values or | |||
other transport-related parameters for RPC-over-RDMA Version One | other transport-related parameters for RPC-over-RDMA Version One | |||
connections bearing that Upper Layer Protocol. | connections bearing that Upper Layer Protocol. | |||
o An Upper Layer Protocol may provide a means to communicate these | o An Upper Layer Protocol may provide a means to communicate these | |||
transport-related parameters between peers. Note that RPC-over- | transport-related parameters between peers. Note that RPC-over- | |||
RDMA Version One does not specify any mechanism for changing any | RDMA Version One does not specify any mechanism for changing any | |||
transport-related parameter after a connection has been | transport-related parameter after a connection has been | |||
established. | established. | |||
o Multiple Upper Layer Protocols may share a single RPC-over-RDMA | o Multiple Upper Layer Protocols may share a single RPC-over-RDMA | |||
End of changes. 41 change blocks. | ||||
79 lines changed or deleted | 91 lines changed or added | |||
This html diff was produced by rfcdiff 1.45. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |