draft-ietf-nfsv4-rfc5666-implementation-experience-01.txt | draft-ietf-nfsv4-rfc5666-implementation-experience-02.txt | |||
---|---|---|---|---|
Network File System Version 4 C. Lever | Network File System Version 4 C. Lever | |||
Internet-Draft Oracle | Internet-Draft Oracle | |||
Intended status: Informational February 23, 2016 | Intended status: Informational April 8, 2016 | |||
Expires: August 26, 2016 | Expires: October 10, 2016 | |||
RPC-over-RDMA Version One Implementation Experience | RPC-over-RDMA Version One Implementation Experience | |||
draft-ietf-nfsv4-rfc5666-implementation-experience-01 | draft-ietf-nfsv4-rfc5666-implementation-experience-02 | |||
Abstract | Abstract | |||
This document details experiences and challenges implementing the | This document details experiences and challenges implementing the | |||
RPC-over-RDMA Version One protocol. Specification changes are | RPC-over-RDMA Version One protocol. Specification changes are | |||
recommended to address avoidable interoperability failures. | recommended to address avoidable interoperability failures. | |||
Status of This Memo | Status of This Memo | |||
This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
skipping to change at page 1, line 32 ¶ | skipping to change at page 1, line 32 ¶ | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
This Internet-Draft will expire on August 26, 2016. | This Internet-Draft will expire on October 10, 2016. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2016 IETF Trust and the persons identified as the | Copyright (c) 2016 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
include Simplified BSD License text as described in Section 4.e of | include Simplified BSD License text as described in Section 4.e of | |||
the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
described in the Simplified BSD License. | described in the Simplified BSD License. | |||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 | 1.1. Purpose Of This Document . . . . . . . . . . . . . . . . 3 | |||
1.2. Purpose Of This Document . . . . . . . . . . . . . . . . 3 | 1.2. Updating RFC 5666 . . . . . . . . . . . . . . . . . . . . 3 | |||
1.3. Updating RFC 5666 . . . . . . . . . . . . . . . . . . . . 3 | 1.3. Requirements Language . . . . . . . . . . . . . . . . . . 4 | |||
2. RPC-Over-RDMA Essentials . . . . . . . . . . . . . . . . . . 4 | 2. RPC-Over-RDMA Essentials . . . . . . . . . . . . . . . . . . 4 | |||
2.1. Arguments And Results . . . . . . . . . . . . . . . . . . 4 | 2.1. Arguments And Results . . . . . . . . . . . . . . . . . . 4 | |||
2.2. Remote Direct Memory Access . . . . . . . . . . . . . . . 5 | 2.2. Remote Direct Memory Access . . . . . . . . . . . . . . . 5 | |||
2.3. Transfer Models . . . . . . . . . . . . . . . . . . . . . 6 | 2.3. Transfer Models . . . . . . . . . . . . . . . . . . . . . 6 | |||
2.4. Upper Layer Binding Specifications . . . . . . . . . . . 7 | 2.4. Upper Layer Binding Specifications . . . . . . . . . . . 7 | |||
2.5. On-The-Wire Protocol . . . . . . . . . . . . . . . . . . 8 | 2.5. On-The-Wire Protocol . . . . . . . . . . . . . . . . . . 8 | |||
3. Specification Issues . . . . . . . . . . . . . . . . . . . . 14 | 3. Specification Issues . . . . . . . . . . . . . . . . . . . . 14 | |||
3.1. Extensibility Considerations . . . . . . . . . . . . . . 14 | 3.1. Extensibility Considerations . . . . . . . . . . . . . . 14 | |||
3.2. XDR Clarifications . . . . . . . . . . . . . . . . . . . 15 | 3.2. XDR Clarifications . . . . . . . . . . . . . . . . . . . 15 | |||
3.3. The Position Zero Read Chunk . . . . . . . . . . . . . . 18 | 3.3. Additional XDR Issues . . . . . . . . . . . . . . . . . . 18 | |||
3.4. RDMA_NOMSG Call Messages . . . . . . . . . . . . . . . . 20 | 3.4. The Position Zero Read Chunk . . . . . . . . . . . . . . 19 | |||
3.5. RDMA_MSG Call with Position Zero Read Chunk . . . . . . . 21 | 3.5. RDMA_NOMSG Call Messages . . . . . . . . . . . . . . . . 21 | |||
3.6. Padding Inline Content After A Chunk . . . . . . . . . . 22 | 3.6. RDMA_MSG Call with Position Zero Read Chunk . . . . . . . 22 | |||
3.7. Write Chunk XDR Roundup . . . . . . . . . . . . . . . . . 24 | 3.7. Padding Inline Content After A Chunk . . . . . . . . . . 23 | |||
3.8. Write List Error Cases . . . . . . . . . . . . . . . . . 26 | 3.8. Write Chunk XDR Roundup . . . . . . . . . . . . . . . . . 25 | |||
4. Operational Considerations . . . . . . . . . . . . . . . . . 29 | 3.9. Write List Error Cases . . . . . . . . . . . . . . . . . 27 | |||
4.1. Computing Request Buffer Requirements . . . . . . . . . . 29 | 4. Operational Considerations . . . . . . . . . . . . . . . . . 30 | |||
4.2. Default Inline Buffer Size . . . . . . . . . . . . . . . 30 | 4.1. Computing Request Buffer Requirements . . . . . . . . . . 30 | |||
4.3. When To Use Reply Chunks . . . . . . . . . . . . . . . . 30 | 4.2. Default Inline Buffer Size . . . . . . . . . . . . . . . 31 | |||
4.4. Computing Credit Values . . . . . . . . . . . . . . . . . 31 | 4.3. When To Use Reply Chunks . . . . . . . . . . . . . . . . 31 | |||
4.5. Race Windows . . . . . . . . . . . . . . . . . . . . . . 32 | 4.4. Computing Credit Values . . . . . . . . . . . . . . . . . 32 | |||
5. Pre-requisites For NFSv4 . . . . . . . . . . . . . . . . . . 32 | 4.5. Race Windows . . . . . . . . . . . . . . . . . . . . . . 33 | |||
5.1. Bi-directional Operation . . . . . . . . . . . . . . . . 32 | 4.6. Detection Of Unsupported Protocol Versions . . . . . . . 33 | |||
6. Considerations For Upper Layer Binding Specifications . . . . 33 | 5. Pre-requisites For NFSv4 . . . . . . . . . . . . . . . . . . 34 | |||
6.1. Organization Of Binding Specification Requirements . . . 33 | 5.1. Bi-directional Operation . . . . . . . . . . . . . . . . 34 | |||
6.2. RDMA-Eligibility . . . . . . . . . . . . . . . . . . . . 34 | 6. Considerations For Upper Layer Binding Specifications . . . . 35 | |||
6.3. Inline Threshold Requirements . . . . . . . . . . . . . . 35 | 6.1. Organization Of Binding Specification Requirements . . . 35 | |||
6.4. Violations Of Binding Rules . . . . . . . . . . . . . . . 36 | 6.2. RDMA-Eligibility . . . . . . . . . . . . . . . . . . . . 35 | |||
6.5. Binding Specification Completion Assessment . . . . . . . 37 | 6.3. Inline Threshold Requirements . . . . . . . . . . . . . . 37 | |||
7. Unimplemented Protocol Features . . . . . . . . . . . . . . . 38 | 6.4. Violations Of Binding Rules . . . . . . . . . . . . . . . 38 | |||
7.1. Unimplemented Features To Be Removed . . . . . . . . . . 38 | 6.5. Binding Specification Completion Assessment . . . . . . . 39 | |||
7.2. Unimplemented Features To Be Retained . . . . . . . . . . 39 | 7. Unimplemented Protocol Features . . . . . . . . . . . . . . . 39 | |||
8. Security Considerations . . . . . . . . . . . . . . . . . . . 41 | 7.1. Unimplemented Features To Be Removed . . . . . . . . . . 39 | |||
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 41 | 7.2. Unimplemented Features To Be Retained . . . . . . . . . . 41 | |||
10. Appendix A: XDR Language Description . . . . . . . . . . . . 42 | 8. Security Considerations . . . . . . . . . . . . . . . . . . . 43 | |||
11. Appendix B: Binding Requirement Summary . . . . . . . . . . . 45 | 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 43 | |||
12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 46 | 10. Appendix A: XDR Language Description . . . . . . . . . . . . 43 | |||
13. References . . . . . . . . . . . . . . . . . . . . . . . . . 46 | 11. Appendix B: Binding Requirement Summary . . . . . . . . . . . 46 | |||
13.1. Normative References . . . . . . . . . . . . . . . . . . 46 | 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 48 | |||
13.2. Informative References . . . . . . . . . . . . . . . . . 48 | 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 48 | |||
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 48 | 13.1. Normative References . . . . . . . . . . . . . . . . . . 48 | |||
13.2. Informative References . . . . . . . . . . . . . . . . . 49 | ||||
1. Introduction | ||||
1.1. Requirements Language | ||||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 49 | |||
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and | ||||
"OPTIONAL" in this document are to be interpreted as described in | ||||
[RFC2119]. | ||||
1.2. Purpose Of This Document | 1. Introduction | |||
This document summarizes implementation experience with the RPC-over- | This document summarizes implementation experience with the RPC-over- | |||
RDMA Version One protocol [RFC5666], and proposes improvements to the | RDMA Version One protocol [RFC5666], and proposes improvements to the | |||
protocol specification based on implementer experience, frequently- | protocol specification based on implementer experience, frequently- | |||
asked questions, and interviews with a co-author of RFC 5666. | asked questions, and interviews with a co-author of RFC 5666. | |||
1.1. Purpose Of This Document | ||||
A key contribution of this document is to highlight areas of RFC 5666 | A key contribution of this document is to highlight areas of RFC 5666 | |||
where independent good faith readings could result in distinct | where independent good faith readings could result in distinct | |||
implementations that do not interoperate with each other. Correcting | implementations that do not interoperate with each other. Correcting | |||
these specification issues is critical: fresh implementations of RPC- | these specification issues is critical: fresh implementations of RPC- | |||
over-RDMA Version One continue to arise. | over-RDMA Version One continue to arise. | |||
Recommendations are limited to the following areas: | Recommendations are limited to the following areas: | |||
o Repairing specification ambiguities | o Repairing specification ambiguities | |||
o Codifying successful implementation practices and conventions | o Codifying successful implementation practices and conventions | |||
o Clarifying the role of Upper Layer Binding specifications | o Clarifying the role of Upper Layer Binding specifications | |||
o Exploring protocol enhancements that might be added while allowing | o Exploring protocol enhancements that might be added while allowing | |||
extant implementations to interoperate with enhanced | extant implementations to interoperate with enhanced | |||
implementations | implementations | |||
1.3. Updating RFC 5666 | 1.2. Updating RFC 5666 | |||
During IETF 92, several alternatives for updating RFC 5666 were | During IETF 92, several alternatives for updating RFC 5666 were | |||
discussed with the RFC Editor and with the assembled members of the | discussed with the RFC Editor and with the assembled members of the | |||
nfsv4 Working Group. Among them were: | nfsv4 Working Group. Among them were: | |||
o Filing individual errata for each issue | o Filing individual errata for each issue | |||
o Introducing a new RFC that updates but does not obsolete RFC 5666, | o Introducing a new RFC that updates but does not obsolete RFC 5666, | |||
but makes no change to the protocol | but makes no change to the protocol | |||
skipping to change at page 4, line 22 ¶ | skipping to change at page 4, line 16 ¶ | |||
update and obsolete RFC 5666 while retaining a high degree of | update and obsolete RFC 5666 while retaining a high degree of | |||
interoperability with current RPC-over-RDMA Version One | interoperability with current RPC-over-RDMA Version One | |||
implementations. This approach would avoid changes to on-the-wire | implementations. This approach would avoid changes to on-the-wire | |||
behavior without burdening implementers, who could continue to | behavior without burdening implementers, who could continue to | |||
reference a single specification of the protocol. In addition, this | reference a single specification of the protocol. In addition, this | |||
alternative extends the life of current interoperable RPC-over-RDMA | alternative extends the life of current interoperable RPC-over-RDMA | |||
Version One implementations in the field. | Version One implementations in the field. | |||
Subsequent discussion within the nfsv4 Working Group has focused on | Subsequent discussion within the nfsv4 Working Group has focused on | |||
resolving specification ambiguities that make the construction of | resolving specification ambiguities that make the construction of | |||
interoperable implementations unduly difficult. A Version Two of | interoperable implementations unduly difficult. Subsequent Versions | |||
RPC-over-RDMA, where deeper changes can be made and new functionality | of RPC-over-RDMA, where deeper changes can be made and new | |||
introduced, remains a possibility. | functionality introduced, remain a possibility. | |||
1.3. Requirements Language | ||||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | ||||
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and | ||||
"OPTIONAL" in this document are to be interpreted as described in | ||||
[RFC2119]. | ||||
2. RPC-Over-RDMA Essentials | 2. RPC-Over-RDMA Essentials | |||
The following sections summarize the state of affairs defined in RFC | The following sections summarize the state of affairs defined in RFC | |||
5666. This is a distillation of text from RFC 5666, dialog with a | 5666. This is a distillation of text from RFC 5666, dialog with a | |||
co-author of RFC 5666, and implementer experience. The XDR | co-author of RFC 5666, and implementer experience. The XDR | |||
definitions are copied from RFC 5666 Section 4.3. | definitions are copied from RFC 5666 Section 4.3. | |||
2.1. Arguments And Results | 2.1. Arguments And Results | |||
skipping to change at page 18, line 44 ¶ | skipping to change at page 18, line 44 ¶ | |||
rpc_rdma_header does not comprise the entire RPC-over-RDMA header, it | rpc_rdma_header does not comprise the entire RPC-over-RDMA header, it | |||
should be renamed rpcrdma1_chunks to avoid confusion. | should be renamed rpcrdma1_chunks to avoid confusion. | |||
XDR definitions should be enclosed in CODE BEGINS and CODE ENDS | XDR definitions should be enclosed in CODE BEGINS and CODE ENDS | |||
delimiters. An appropriate copyright block should accompany the XDR | delimiters. An appropriate copyright block should accompany the XDR | |||
definitions in RFC 5666bis. An XDR extraction shell script should be | definitions in RFC 5666bis. An XDR extraction shell script should be | |||
provided in the text. | provided in the text. | |||
See Section 10 for a full listing of the proposed XDR definitions. | See Section 10 for a full listing of the proposed XDR definitions. | |||
3.3. The Position Zero Read Chunk | 3.3. Additional XDR Issues | |||
3.3.1. Mechanical Issues | ||||
There are some mechanical problems with the XDR language definition | ||||
of RPC-over-RDMA Version One provided in Section 4.3 of [RFC5666]: | ||||
o No copyright boilerplate is provided | ||||
o An extraction script is not provided, and there is no escape | ||||
sequence around the code | ||||
o There is at least one XDR definition error that prevents the | ||||
extracted XDR from compiling | ||||
3.3.2. XDR Definition Recursiveness | ||||
The usual practice when defining an XDR-based protocol is that there | ||||
is one encompassing data type that represents one message in the | ||||
protocol. | ||||
This is not true for RPC-over-RDMA. The header is defined by one | ||||
data type (struct rdma_msg) but the RPC message payload is not | ||||
formally represented in the XDR definition in Section 4.3. The | ||||
presence or absence of the RPC message payload is indicated by the | ||||
message type, and the body of that payload is noted only with a code | ||||
comment. | ||||
3.3.3. Recommendations | ||||
The XDR presented in RFC5666bis should correct the deficiencies | ||||
described above. | ||||
To correct the lack of formal recursiveness issue without forcing an | ||||
on-the-wire behavior change, RFC5666bis should place the RPC-over- | ||||
RDMA header and the RPC message payload in separate XDR streams. | ||||
3.4. The Position Zero Read Chunk | ||||
RFC 5666 Section 5.1 defines the operation of the Position Zero read | RFC 5666 Section 5.1 defines the operation of the Position Zero read | |||
chunk. A requester uses the Position Zero read chunk in place of | chunk. A requester uses the Position Zero read chunk in place of | |||
inline content. A requester is required to use the Position Zero | inline content. A requester is required to use the Position Zero | |||
read chunk when the total size of an RPC call message exceeds the | read chunk when the total size of an RPC call message exceeds the | |||
size of the responder's receive buffers, and RDMA-eligible data has | size of the responder's receive buffers, and RDMA-eligible data has | |||
already been removed from the message. | already been removed from the message. | |||
RFC 5666 Section 3.4 says: | RFC 5666 Section 3.4 says: | |||
skipping to change at page 20, line 20 ¶ | skipping to change at page 21, line 10 ¶ | |||
read segment in Position Zero would limit the maximum size of RPC- | read segment in Position Zero would limit the maximum size of RPC- | |||
over-RDMA messages to a single page. Allowing multiple read | over-RDMA messages to a single page. Allowing multiple read | |||
segments means the message size can be as large as the maximum | segments means the message size can be as large as the maximum | |||
number of read chunks that can be sent in an RPC-over-RDMA header. | number of read chunks that can be sent in an RPC-over-RDMA header. | |||
RFC 5666 does not limit the number of read segments in a read chunk, | RFC 5666 does not limit the number of read segments in a read chunk, | |||
nor does it limit the number of chunks that can appear in the Read | nor does it limit the number of chunks that can appear in the Read | |||
list. The Position Zero read chunk, despite its name, is not limited | list. The Position Zero read chunk, despite its name, is not limited | |||
to a single xdr_read_chunk. | to a single xdr_read_chunk. | |||
3.3.1. Recommendations | 3.4.1. Recommendations | |||
RFC 5666bis should state that the guidelines in RFC 5666 Section 3.4 | RFC 5666bis should state that the guidelines in RFC 5666 Section 3.4 | |||
apply only to RDMA_MSG type calls. When the Position Zero read chunk | apply only to RDMA_MSG type calls. When the Position Zero read chunk | |||
is introduced in RFC 5666 Section 5.1, enumerate the differences | is introduced in RFC 5666 Section 5.1, enumerate the differences | |||
between it and the read chunks previously described in RFC 5666 | between it and the read chunks previously described in RFC 5666 | |||
Section 3.4. | Section 3.4. | |||
RFC 5666bis should describe what restrictions an Upper Layer Binding | RFC 5666bis should describe what restrictions an Upper Layer Binding | |||
may make on Position Zero read chunks. | may make on Position Zero read chunks. | |||
3.4. RDMA_NOMSG Call Messages | 3.5. RDMA_NOMSG Call Messages | |||
The second paragraph of RFC 5667 Section 4 says, in reference to | The second paragraph of RFC 5667 Section 4 says, in reference to | |||
NFSv2 and NFSv3 WRITE and SYMLINK operations: | NFSv2 and NFSv3 WRITE and SYMLINK operations: | |||
. . . a single RDMA Read list entry MAY be posted by the client to | . . . a single RDMA Read list entry MAY be posted by the client to | |||
supply the opaque file data for a WRITE request or the pathname | supply the opaque file data for a WRITE request or the pathname | |||
for a SYMLINK request. The server MUST ignore any Read list for | for a SYMLINK request. The server MUST ignore any Read list for | |||
other NFS procedures, as well as additional Read list entries | other NFS procedures, as well as additional Read list entries | |||
beyond the first in the list. | beyond the first in the list. | |||
skipping to change at page 21, line 26 ¶ | skipping to change at page 22, line 15 ¶ | |||
However, there is a class of RPC operations where RDMA_NOMSG with | However, there is a class of RPC operations where RDMA_NOMSG with | |||
multiple read chunks is useful: when the body of an RPC call message | multiple read chunks is useful: when the body of an RPC call message | |||
is larger than the inline buffer size, even after RDMA-eligible | is larger than the inline buffer size, even after RDMA-eligible | |||
argument data has been moved to read chunks. | argument data has been moved to read chunks. | |||
A similar discussion applies to RDMA_NOMSG replies with large reply | A similar discussion applies to RDMA_NOMSG replies with large reply | |||
bodies and RDMA-eligible result data. Such replies would use both | bodies and RDMA-eligible result data. Such replies would use both | |||
the Write list and the Reply chunk simultaneously. However, write | the Write list and the Reply chunk simultaneously. However, write | |||
chunks do not have Position fields. | chunks do not have Position fields. | |||
3.4.1. Recommendations | 3.5.1. Recommendations | |||
RFC 5666bis should continue to allow RDMA_NOMSG type calls with | RFC 5666bis should continue to allow RDMA_NOMSG type calls with | |||
additional read chunks. The rules about RDMA-eligibility in RFC | additional read chunks. The rules about RDMA-eligibility in RFC | |||
5666bis should discuss when the use of this construction is | 5666bis should discuss when the use of this construction is | |||
beneficial, and when it should be avoided. | beneficial, and when it should be avoided. | |||
Authors of Upper Layer Bindings should be warned about ignoring these | Authors of Upper Layer Bindings should be warned about ignoring these | |||
cases. RPC 5666bis should provide a default behavior that applies | cases. RPC 5666bis should provide a default behavior that applies | |||
when Upper Layer Bindings omit this discussion. | when Upper Layer Bindings omit this discussion. | |||
3.5. RDMA_MSG Call with Position Zero Read Chunk | 3.6. RDMA_MSG Call with Position Zero Read Chunk | |||
The first item in the header of both RPC calls and RPC replies is the | The first item in the header of both RPC calls and RPC replies is the | |||
XID field [RFC5531]. RFC 5666 Section 4.1 says: | XID field [RFC5531]. RFC 5666 Section 4.1 says: | |||
A header of message type RDMA_MSG or RDMA_MSGP MUST be followed by | A header of message type RDMA_MSG or RDMA_MSGP MUST be followed by | |||
the RPC call or RPC reply message body, beginning with the XID. | the RPC call or RPC reply message body, beginning with the XID. | |||
This is a strong implication that the RPC header in an RDMA_MSG type | This is a strong implication that the RPC header in an RDMA_MSG type | |||
message starts at XDR position zero. Assume for a moment that, by | message starts at XDR position zero. Assume for a moment that, by | |||
definition, the RPC header in an RPC-over-RDMA XDR stream starts at | definition, the RPC header in an RPC-over-RDMA XDR stream starts at | |||
skipping to change at page 22, line 19 ¶ | skipping to change at page 23, line 7 ¶ | |||
just like RPC header does. In an RDMA_NOMSG type call message, which | just like RPC header does. In an RDMA_NOMSG type call message, which | |||
does not include an RPC header, a Position Zero read chunk conveys | does not include an RPC header, a Position Zero read chunk conveys | |||
the RPC header. | the RPC header. | |||
There is no prohibition in RFC 5666 against an RDMA_MSG type call | There is no prohibition in RFC 5666 against an RDMA_MSG type call | |||
messsage with a Position Zero read chunk. However, it's not clear | messsage with a Position Zero read chunk. However, it's not clear | |||
how a responder should interpret such a message. RFC 5666 requires | how a responder should interpret such a message. RFC 5666 requires | |||
the RPC header to start at XDR position zero, but there is a Position | the RPC header to start at XDR position zero, but there is a Position | |||
Zero read chunk, which also starts at XDR position zero. | Zero read chunk, which also starts at XDR position zero. | |||
3.5.1. Recommendations | 3.6.1. Recommendations | |||
RPC 5666bis should clearly define what is meant by an XDR stream. | RPC 5666bis should clearly define what is meant by an XDR stream. | |||
RFC 5666bis should state that the value in the xdr_read_chunk | RFC 5666bis should state that the value in the xdr_read_chunk | |||
"position" field is measured relative to the start of the RPC header, | "position" field is measured relative to the start of the RPC header, | |||
which is the first byte of the header's XID field. | which is the first byte of the header's XID field. | |||
RFC 5666bis should prohibit requesters from providing a Position Zero | RFC 5666bis should prohibit requesters from providing a Position Zero | |||
read chunk in RDMA_MSG type calls. Likewise, RFC 5666bis should | read chunk in RDMA_MSG type calls. Likewise, RFC 5666bis should | |||
prohibit responders from utilizing a Reply chunk in RDMA_MSG type | prohibit responders from utilizing a Reply chunk in RDMA_MSG type | |||
replies. | replies. | |||
The diagrams in RFC 5666 Section 3.8 which number chunks starting | The diagrams in RFC 5666 Section 3.8 which number chunks starting | |||
with 1 should be revised. Readers confuse this number with an XDR | with 1 should be revised. Readers confuse this number with an XDR | |||
position. | position. | |||
3.6. Padding Inline Content After A Chunk | 3.7. Padding Inline Content After A Chunk | |||
To help clarify the discussion in this section, the term "read chunk" | To help clarify the discussion in this section, the term "read chunk" | |||
here always means the new definition where one or more read segments | here always means the new definition where one or more read segments | |||
that have identical values in their Position fields represents | that have identical values in their Position fields represents | |||
exactly one RDMA-eligible XDR object. | exactly one RDMA-eligible XDR object. | |||
A read chunk conveys a large argument payload via one or more RDMA | A read chunk conveys a large argument payload via one or more RDMA | |||
transfers. For instance, the data payload of an NFS WRITE operation | transfers. For instance, the data payload of an NFS WRITE operation | |||
may be be transferred using a read chunk [RFC5667]. | may be be transferred using a read chunk [RFC5667]. | |||
skipping to change at page 24, line 12 ¶ | skipping to change at page 25, line 5 ¶ | |||
any requirements for XDR padding and alignment when a read chunk is | any requirements for XDR padding and alignment when a read chunk is | |||
followed in the XDR stream by more inline content. | followed in the XDR stream by more inline content. | |||
Applying the rules of XDR, the XDR pad for the read chunk must not | Applying the rules of XDR, the XDR pad for the read chunk must not | |||
appear in the inline content, even if it was also not included in the | appear in the inline content, even if it was also not included in the | |||
chunk itself. This is because the inline content that preceded the | chunk itself. This is because the inline content that preceded the | |||
read chunk will have been padded to 4-byte alignment. The next | read chunk will have been padded to 4-byte alignment. The next | |||
position in the inline buffer is already on a 4-byte boundary, thus | position in the inline buffer is already on a 4-byte boundary, thus | |||
no padding is necessary. | no padding is necessary. | |||
3.6.1. Recommendations | 3.7.1. Recommendations | |||
State the above requirement in RFC 5666bis in its equivalent of RFC | State the above requirement in RFC 5666bis in its equivalent of RFC | |||
5666 Section 3.7. When a responder forms a reply, the same | 5666 Section 3.7. When a responder forms a reply, the same | |||
restriction applies to inline content interleaved with write chunks. | restriction applies to inline content interleaved with write chunks. | |||
Because all XDR objects must start on an XDR alignment boundary, all | Because all XDR objects must start on an XDR alignment boundary, all | |||
read and write chunks and all inline XDR objects in any XDR stream | read and write chunks and all inline XDR objects in any XDR stream | |||
must start on an XDR alignment boundary. This has implications for | must start on an XDR alignment boundary. This has implications for | |||
the values allowed in read chunk Position fields, for how XDR roundup | the values allowed in read chunk Position fields, for how XDR roundup | |||
works for chunks, and for how XDR objects are placed in inline | works for chunks, and for how XDR objects are placed in inline | |||
buffers. XDR alignment in inline buffers is always relative to | buffers. XDR alignment in inline buffers is always relative to | |||
Position Zero (or, where the RPC header starts). | Position Zero (or, where the RPC header starts). | |||
3.7. Write Chunk XDR Roundup | 3.8. Write Chunk XDR Roundup | |||
The final paragraph of RFC 5666 Section 3.7 says: | The final paragraph of RFC 5666 Section 3.7 says: | |||
For RDMA Write Chunks, a simpler encoding method applies. Again, | For RDMA Write Chunks, a simpler encoding method applies. Again, | |||
roundup bytes are not transferred, instead the chunk length sent | roundup bytes are not transferred, instead the chunk length sent | |||
to the receiver in the reply is simply increased to include any | to the receiver in the reply is simply increased to include any | |||
roundup. | roundup. | |||
A responder should avoid writing XDR pad bytes, as the requester's | A responder should avoid writing XDR pad bytes, as the requester's | |||
upper layer does not reference them, though the language does not | upper layer does not reference them, though the language does not | |||
skipping to change at page 26, line 5 ¶ | skipping to change at page 26, line 46 ¶ | |||
These implementations may not be 100% interoperable. The language of | These implementations may not be 100% interoperable. The language of | |||
Section 3.7 of [RFC5666] appears to allow all of this behavior (in | Section 3.7 of [RFC5666] appears to allow all of this behavior (in | |||
particular, it does not prohibit a responder from writing the XDR pad | particular, it does not prohibit a responder from writing the XDR pad | |||
using RFC2119-style keywords, and does not require that requesters | using RFC2119-style keywords, and does not require that requesters | |||
register the extra space to accommodate the XDR pad). | register the extra space to accommodate the XDR pad). | |||
Note that because the Reply chunk is a write chunk, these roundup | Note that because the Reply chunk is a write chunk, these roundup | |||
rules also apply to it. | rules also apply to it. | |||
3.7.1. Recommendations | 3.8.1. Recommendations | |||
The current specification allows XDR pad bytes to leak into user | The current specification allows XDR pad bytes to leak into user | |||
buffers, and none of the current implementations prevent this leak. | buffers, and none of the current implementations prevent this leak. | |||
There may be room to adjust the protocol specification independently | There may be room to adjust the protocol specification independently | |||
of current implementation behavior. | of current implementation behavior. | |||
RFC 5666bis should explicitly discuss the requirements around write | RFC 5666bis should explicitly discuss the requirements around write | |||
chunk roundup separately from the discussion of read chunk roundup. | chunk roundup separately from the discussion of read chunk roundup. | |||
Explicit RFC2119-style interoperability requirements should be | Explicit RFC2119-style interoperability requirements should be | |||
provided for write chunks. Responders MUST NOT write XDR pad bytes | provided for write chunks. Responders MUST NOT write XDR pad bytes | |||
at the end of a Write chunk. | at the end of a Write chunk. | |||
Allocating and registering extra space for XDR pad bytes that are | Allocating and registering extra space for XDR pad bytes that are | |||
never written is wasteful. RFC 5666bis should forbid it. Responders | never written is wasteful. RFC 5666bis should forbid it. Responders | |||
should not expect requesters to provide space for XDR pad bytes. | should not expect requesters to provide space for XDR pad bytes. | |||
3.8. Write List Error Cases | 3.9. Write List Error Cases | |||
RFC 5666 Section 3.6 says: | RFC 5666 Section 3.6 says: | |||
When a write chunk list is provided for the results of the RPC | When a write chunk list is provided for the results of the RPC | |||
call, the RPC server MUST provide any corresponding data via RDMA | call, the RPC server MUST provide any corresponding data via RDMA | |||
Write to the memory referenced in the chunk list entries. | Write to the memory referenced in the chunk list entries. | |||
This requires the responder to use the Write list when it is | This requires the responder to use the Write list when it is | |||
provided. Another way to say it is a responder is not permitted to | provided. Another way to say it is a responder is not permitted to | |||
return bulk data inline or in the Reply chunk when the requester has | return bulk data inline or in the Reply chunk when the requester has | |||
skipping to change at page 29, line 11 ¶ | skipping to change at page 30, line 11 ¶ | |||
the reply. It should be the responsibility of the Upper Layer | the reply. It should be the responsibility of the Upper Layer | |||
Binding to avoid ambiguous situations by appropriately restricting | Binding to avoid ambiguous situations by appropriately restricting | |||
RDMA-eligible data items. | RDMA-eligible data items. | |||
Remember that a responder MUST use the Write list if the requester | Remember that a responder MUST use the Write list if the requester | |||
provided it and the responder has RDMA-eligible result data. If the | provided it and the responder has RDMA-eligible result data. If the | |||
requester has not provided enough Write chunks in the Write list, the | requester has not provided enough Write chunks in the Write list, the | |||
responder may have to use a long message as well, depending on the | responder may have to use a long message as well, depending on the | |||
remaining size of the RPC reply. | remaining size of the RPC reply. | |||
3.8.1. Recommendations | 3.9.1. Recommendations | |||
RFC 5666bis should explicitly discuss responder behavior when an RPC | RFC 5666bis should explicitly discuss responder behavior when an RPC | |||
reply does not need to use a Write list entry provided by a | reply does not need to use a Write list entry provided by a | |||
requester. This is generic behavior, independent of any Upper Layer | requester. This is generic behavior, independent of any Upper Layer | |||
Binding. The explanation can be partially or wholly copied from RFC | Binding. The explanation can be partially or wholly copied from RFC | |||
5667 Section 5's discussion of NFSv4 COMPOUND. | 5667 Section 5's discussion of NFSv4 COMPOUND. | |||
A number of places in RFC 5666 Section 3.6 hint at how a responder | A number of places in RFC 5666 Section 3.6 hint at how a responder | |||
behaves when it is to return data that does not use every byte of | behaves when it is to return data that does not use every byte of | |||
every provided Write chunk segment. RFC 5666bis should state | every provided Write chunk segment. RFC 5666bis should state | |||
skipping to change at page 32, line 44 ¶ | skipping to change at page 33, line 44 ¶ | |||
where the sum of posted and in-process receive buffers is less than | where the sum of posted and in-process receive buffers is less than | |||
its advertised credit limit. In either case, such a window could | its advertised credit limit. In either case, such a window could | |||
result in lost messages or be catastrophic for the transport | result in lost messages or be catastrophic for the transport | |||
connection. | connection. | |||
4.5.1. Recommendations | 4.5.1. Recommendations | |||
Clarify or remove the dependent clause in the section in RFC 5666bis | Clarify or remove the dependent clause in the section in RFC 5666bis | |||
that is equivalent to RFC 5666 Section 3.3. | that is equivalent to RFC 5666 Section 3.3. | |||
4.6. Detection Of Unsupported Protocol Versions | ||||
Section 4.2 of [RFC5666] is explicit about how a responder must | ||||
handle RPC-over-RDMA messages that carry an unrecognized RPC-over- | ||||
RDMA protocol version: | ||||
When a peer receives an RPC RDMA message, it MUST perform the | ||||
following basic validity checks on the header and chunk contents. | ||||
If such errors are detected in the request, an RDMA_ERROR reply | ||||
MUST be generated. | ||||
When the peer detects an RPC-over-RDMA header version that it does | ||||
not support (currently this document defines only version 1), it | ||||
replies with an error code of ERR_VERS, and provides the low and | ||||
high inclusive version numbers it does, in fact, support. The | ||||
version number in this reply MUST be any value otherwise valid at | ||||
the receiver. | ||||
However, one widely deployed RPC-over-RDMA Version One server | ||||
implementation is known to discard requests that do not contain the | ||||
value one (1) in their rdma_vers field. This server implementation | ||||
does not reply with RDMA_ERROR / RDMA_ERR_VERS in this case. | ||||
Without a proper protocol version detection mechanism, it is not | ||||
possible for RPC-over-RDMA Version One implementations to | ||||
interoperate with implementations that support newer protocol | ||||
versions. | ||||
4.6.1. Recommendations | ||||
RPC-over-RDMA Version One implementations that discard non-Version | ||||
One requests without an error response are considered non-compliant | ||||
with [RFC5666]. No changes to the specification are needed. | ||||
5. Pre-requisites For NFSv4 | 5. Pre-requisites For NFSv4 | |||
5.1. Bi-directional Operation | 5.1. Bi-directional Operation | |||
NFSv4.1 moves the backchannel onto the same transport as forward | NFSv4.1 moves the backchannel onto the same transport as forward | |||
requests [RFC5661]. Typically RPC client endpoints do not expect to | requests [RFC5661]. Typically RPC client endpoints do not expect to | |||
receive RPC call messages. To support NFSv4.1 callback operations, | receive RPC call messages. To support NFSv4.1 callback operations, | |||
client and server implementations must be updated to support bi- | client and server implementations must be updated to support bi- | |||
directional operation. | directional operation. | |||
skipping to change at page 38, line 41 ¶ | skipping to change at page 40, line 19 ¶ | |||
7.1.2. Read-Read Transfer Model | 7.1.2. Read-Read Transfer Model | |||
All existing RPC-over-RDMA Version One implementations use a Read- | All existing RPC-over-RDMA Version One implementations use a Read- | |||
Write data transfer model. The server endpoint is responsible for | Write data transfer model. The server endpoint is responsible for | |||
initiating all RDMA data transfers. The Read-Read transfer model has | initiating all RDMA data transfers. The Read-Read transfer model has | |||
been deprecated, but because it appears in RFC 5666, implementations | been deprecated, but because it appears in RFC 5666, implementations | |||
are still responsible for supporting it. By removing the | are still responsible for supporting it. By removing the | |||
specification and discussion of Read-Read, the protocol and | specification and discussion of Read-Read, the protocol and | |||
specification can be made simpler and more clear. | specification can be made simpler and more clear. | |||
Once the Read-Read transfer model is no longer supported, a responder | ||||
would no longer be allowed to send a Read list to a requester. | ||||
Sending a Read list would be needed if a requester has not provided | ||||
enough memory space in the form of a Reply chunk or Write list to | ||||
receive a large RPC Reply. | ||||
There is currently no mechanism in the RPC-over-RDMA Version One | ||||
protocol for a responder to indicate that inadequate reply buffer | ||||
resources were provided by a requester. Therefore, requesters should | ||||
be fully responsible for providing all necessary memory resources to | ||||
receive each RPC reply, including a properly populated Write list | ||||
and/or a Reply chunk. | ||||
7.1.2.1. Recommendations | 7.1.2.1. Recommendations | |||
Remove Read-Read from RFC 5666bis, in particular from its equivalent | Remove Read-Read from RFC 5666bis, in particular from its equivalent | |||
of RFC 5666 Section 3.8. RFC 5666bis should require implementations | of RFC 5666 Section 3.8. RFC 5666bis should require implementations | |||
not to send RDMA_DONE; an implementation receiving it should ignore | not to send RDMA_DONE; an implementation receiving it should ignore | |||
it. The XDR definition should reserve RDMA_DONE. | it. The XDR definition should reserve RDMA_DONE. RFC 5666bis should | |||
explicitly state requirements for requesters to allocate and prepare | ||||
reply buffer resources for each RPC-over-RDMA message. | ||||
7.1.3. RDMA_MSGP | 7.1.3. RDMA_MSGP | |||
It has been observed that the current specification of RDMA_MSGP is | It has been observed that the current specification of RDMA_MSGP is | |||
not clear enough to result in interoperable implementations. | not clear enough to result in interoperable implementations. | |||
Possibly as a result, current receive endpoints do recognize and | Possibly as a result, current receive endpoints do recognize and | |||
process RDMA_MSGP messages, though they do not take advantage of the | process RDMA_MSGP messages, though they do not take advantage of the | |||
passed alignment parameters. Receivers treat RDMA_MSGP messages like | passed alignment parameters. Receivers treat RDMA_MSGP messages like | |||
RDMA_MSG messages. | RDMA_MSG messages. | |||
End of changes. 25 change blocks. | ||||
65 lines changed or deleted | 154 lines changed or added | |||
This html diff was produced by rfcdiff 1.45. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |