draft-ietf-nfsv4-rfc5666-implementation-experience-00.txt | draft-ietf-nfsv4-rfc5666-implementation-experience-01.txt | |||
---|---|---|---|---|
NFSv4 C. Lever | Network File System Version 4 C. Lever | |||
Internet-Draft Oracle | Internet-Draft Oracle | |||
Intended status: Informational November 2, 2015 | Intended status: Informational February 23, 2016 | |||
Expires: May 5, 2016 | Expires: August 26, 2016 | |||
RPC-over-RDMA Version One Implementation Experience | RPC-over-RDMA Version One Implementation Experience | |||
draft-ietf-nfsv4-rfc5666-implementation-experience-00 | draft-ietf-nfsv4-rfc5666-implementation-experience-01 | |||
Abstract | Abstract | |||
This document details experiences and challenges implementing the | This document details experiences and challenges implementing the | |||
RPC-over-RDMA Version One protocol. Specification changes are | RPC-over-RDMA Version One protocol. Specification changes are | |||
recommended to address avoidable interoperability failures. | recommended to address avoidable interoperability failures. | |||
Status of This Memo | Status of This Memo | |||
This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
skipping to change at page 1, line 32 | skipping to change at page 1, line 32 | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
This Internet-Draft will expire on May 5, 2016. | This Internet-Draft will expire on August 26, 2016. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2015 IETF Trust and the persons identified as the | Copyright (c) 2016 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
include Simplified BSD License text as described in Section 4.e of | include Simplified BSD License text as described in Section 4.e of | |||
the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
described in the Simplified BSD License. | described in the Simplified BSD License. | |||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 | 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 | |||
1.2. Purpose Of This Document . . . . . . . . . . . . . . . . 3 | 1.2. Purpose Of This Document . . . . . . . . . . . . . . . . 3 | |||
1.3. Updating RFC 5666 . . . . . . . . . . . . . . . . . . . . 4 | 1.3. Updating RFC 5666 . . . . . . . . . . . . . . . . . . . . 3 | |||
2. RPC-Over-RDMA Essentials . . . . . . . . . . . . . . . . . . 5 | 2. RPC-Over-RDMA Essentials . . . . . . . . . . . . . . . . . . 4 | |||
2.1. Arguments And Results . . . . . . . . . . . . . . . . . . 5 | 2.1. Arguments And Results . . . . . . . . . . . . . . . . . . 4 | |||
2.2. Remote Direct Memory Access . . . . . . . . . . . . . . . 5 | 2.2. Remote Direct Memory Access . . . . . . . . . . . . . . . 5 | |||
2.2.1. Direct Data Placement . . . . . . . . . . . . . . . . 6 | 2.3. Transfer Models . . . . . . . . . . . . . . . . . . . . . 6 | |||
2.2.2. Channel Operation . . . . . . . . . . . . . . . . . . 6 | 2.4. Upper Layer Binding Specifications . . . . . . . . . . . 7 | |||
2.2.3. Explicit RDMA Operation . . . . . . . . . . . . . . . 7 | ||||
2.3. Transfer Models . . . . . . . . . . . . . . . . . . . . . 7 | ||||
2.3.1. Read-Read . . . . . . . . . . . . . . . . . . . . . . 7 | ||||
2.3.2. Write-Write . . . . . . . . . . . . . . . . . . . . . 7 | ||||
2.3.3. Read-Write . . . . . . . . . . . . . . . . . . . . . 8 | ||||
2.4. Upper Layer Binding Specifications . . . . . . . . . . . 8 | ||||
2.5. On-The-Wire Protocol . . . . . . . . . . . . . . . . . . 8 | 2.5. On-The-Wire Protocol . . . . . . . . . . . . . . . . . . 8 | |||
2.5.1. Inline Operation . . . . . . . . . . . . . . . . . . 8 | 3. Specification Issues . . . . . . . . . . . . . . . . . . . . 14 | |||
2.5.2. RDMA Segment . . . . . . . . . . . . . . . . . . . . 11 | 3.1. Extensibility Considerations . . . . . . . . . . . . . . 14 | |||
2.5.3. Chunk . . . . . . . . . . . . . . . . . . . . . . . . 11 | 3.2. XDR Clarifications . . . . . . . . . . . . . . . . . . . 15 | |||
2.5.4. Read Chunk . . . . . . . . . . . . . . . . . . . . . 12 | 3.3. The Position Zero Read Chunk . . . . . . . . . . . . . . 18 | |||
2.5.5. Write Chunk . . . . . . . . . . . . . . . . . . . . . 12 | 3.4. RDMA_NOMSG Call Messages . . . . . . . . . . . . . . . . 20 | |||
2.5.6. Read List . . . . . . . . . . . . . . . . . . . . . . 13 | 3.5. RDMA_MSG Call with Position Zero Read Chunk . . . . . . . 21 | |||
2.5.7. Write List . . . . . . . . . . . . . . . . . . . . . 14 | 3.6. Padding Inline Content After A Chunk . . . . . . . . . . 22 | |||
2.5.8. Position Zero Read Chunk . . . . . . . . . . . . . . 14 | 3.7. Write Chunk XDR Roundup . . . . . . . . . . . . . . . . . 24 | |||
2.5.9. Reply Chunk . . . . . . . . . . . . . . . . . . . . . 15 | ||||
3. Specification Issues . . . . . . . . . . . . . . . . . . . . 15 | ||||
3.1. Extensibility Considerations . . . . . . . . . . . . . . 15 | ||||
3.1.1. Recommendations . . . . . . . . . . . . . . . . . . . 16 | ||||
3.2. XDR Clarifications . . . . . . . . . . . . . . . . . . . 16 | ||||
3.2.1. Recommendations . . . . . . . . . . . . . . . . . . . 18 | ||||
3.3. The Position Zero Read Chunk . . . . . . . . . . . . . . 19 | ||||
3.3.1. Recommendations . . . . . . . . . . . . . . . . . . . 21 | ||||
3.4. RDMA_NOMSG Call Messages . . . . . . . . . . . . . . . . 21 | ||||
3.4.1. Recommendations . . . . . . . . . . . . . . . . . . . 22 | ||||
3.5. RDMA_MSG Call with Position Zero Read Chunk . . . . . . . 22 | ||||
3.5.1. Recommendations . . . . . . . . . . . . . . . . . . . 23 | ||||
3.6. Padding Inline Content After A Chunk . . . . . . . . . . 23 | ||||
3.6.1. Recommendations . . . . . . . . . . . . . . . . . . . 25 | ||||
3.7. Write List XDR Roundup . . . . . . . . . . . . . . . . . 25 | ||||
3.7.1. Recommendations . . . . . . . . . . . . . . . . . . . 26 | ||||
3.8. Write List Error Cases . . . . . . . . . . . . . . . . . 26 | 3.8. Write List Error Cases . . . . . . . . . . . . . . . . . 26 | |||
3.8.1. Recommendations . . . . . . . . . . . . . . . . . . . 29 | ||||
4. Operational Considerations . . . . . . . . . . . . . . . . . 29 | 4. Operational Considerations . . . . . . . . . . . . . . . . . 29 | |||
4.1. Computing Request Buffer Requirements . . . . . . . . . . 29 | 4.1. Computing Request Buffer Requirements . . . . . . . . . . 29 | |||
4.1.1. Recommendations . . . . . . . . . . . . . . . . . . . 30 | ||||
4.2. Default Inline Buffer Size . . . . . . . . . . . . . . . 30 | 4.2. Default Inline Buffer Size . . . . . . . . . . . . . . . 30 | |||
4.2.1. Recommendations . . . . . . . . . . . . . . . . . . . 30 | ||||
4.3. When To Use Reply Chunks . . . . . . . . . . . . . . . . 30 | 4.3. When To Use Reply Chunks . . . . . . . . . . . . . . . . 30 | |||
4.3.1. Recommendations . . . . . . . . . . . . . . . . . . . 31 | ||||
4.4. Computing Credit Values . . . . . . . . . . . . . . . . . 31 | 4.4. Computing Credit Values . . . . . . . . . . . . . . . . . 31 | |||
4.4.1. Recommendations . . . . . . . . . . . . . . . . . . . 32 | ||||
4.5. Race Windows . . . . . . . . . . . . . . . . . . . . . . 32 | 4.5. Race Windows . . . . . . . . . . . . . . . . . . . . . . 32 | |||
4.5.1. Recommendations . . . . . . . . . . . . . . . . . . . 32 | ||||
5. Pre-requisites For NFSv4 . . . . . . . . . . . . . . . . . . 32 | 5. Pre-requisites For NFSv4 . . . . . . . . . . . . . . . . . . 32 | |||
5.1. Bi-directional Operation . . . . . . . . . . . . . . . . 32 | 5.1. Bi-directional Operation . . . . . . . . . . . . . . . . 32 | |||
5.1.1. Recommendations . . . . . . . . . . . . . . . . . . . 33 | ||||
6. Considerations For Upper Layer Binding Specifications . . . . 33 | 6. Considerations For Upper Layer Binding Specifications . . . . 33 | |||
6.1. Organization Of Binding Specification Requirements . . . 33 | 6.1. Organization Of Binding Specification Requirements . . . 33 | |||
6.1.1. Recommendations . . . . . . . . . . . . . . . . . . . 34 | ||||
6.2. RDMA-Eligibility . . . . . . . . . . . . . . . . . . . . 34 | 6.2. RDMA-Eligibility . . . . . . . . . . . . . . . . . . . . 34 | |||
6.2.1. Recommendations . . . . . . . . . . . . . . . . . . . 35 | 6.3. Inline Threshold Requirements . . . . . . . . . . . . . . 35 | |||
6.3. Violations Of Binding Rules . . . . . . . . . . . . . . . 35 | 6.4. Violations Of Binding Rules . . . . . . . . . . . . . . . 36 | |||
6.3.1. Recommendations . . . . . . . . . . . . . . . . . . . 36 | 6.5. Binding Specification Completion Assessment . . . . . . . 37 | |||
6.4. Binding Specification Completion Assessment . . . . . . . 36 | 7. Unimplemented Protocol Features . . . . . . . . . . . . . . . 38 | |||
6.4.1. Recommendations . . . . . . . . . . . . . . . . . . . 37 | 7.1. Unimplemented Features To Be Removed . . . . . . . . . . 38 | |||
7. Removal of Unimplemented Protocol Features . . . . . . . . . 37 | 7.2. Unimplemented Features To Be Retained . . . . . . . . . . 39 | |||
7.1. Read-Read Transfer Model . . . . . . . . . . . . . . . . 37 | 8. Security Considerations . . . . . . . . . . . . . . . . . . . 41 | |||
7.1.1. Recommendations . . . . . . . . . . . . . . . . . . . 37 | 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 41 | |||
7.2. RDMA_MSGP . . . . . . . . . . . . . . . . . . . . . . . . 37 | 10. Appendix A: XDR Language Description . . . . . . . . . . . . 42 | |||
7.2.1. Recommendations . . . . . . . . . . . . . . . . . . . 38 | 11. Appendix B: Binding Requirement Summary . . . . . . . . . . . 45 | |||
8. Security Considerations . . . . . . . . . . . . . . . . . . . 38 | 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 46 | |||
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 38 | 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 46 | |||
10. Appendix A: XDR Language Description . . . . . . . . . . . . 38 | 13.1. Normative References . . . . . . . . . . . . . . . . . . 46 | |||
11. Appendix B: Binding Requirement Summary . . . . . . . . . . . 41 | 13.2. Informative References . . . . . . . . . . . . . . . . . 48 | |||
12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 43 | Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 48 | |||
13. References . . . . . . . . . . . . . . . . . . . . . . . . . 43 | ||||
13.1. Normative References . . . . . . . . . . . . . . . . . . 43 | ||||
13.2. Informative References . . . . . . . . . . . . . . . . . 44 | ||||
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 44 | ||||
1. Introduction | 1. Introduction | |||
1.1. Requirements Language | 1.1. Requirements Language | |||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and | "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and | |||
"OPTIONAL" in this document are to be interpreted as described in | "OPTIONAL" in this document are to be interpreted as described in | |||
[RFC2119]. | [RFC2119]. | |||
skipping to change at page 25, line 26 | skipping to change at page 24, line 26 | |||
restriction applies to inline content interleaved with write chunks. | restriction applies to inline content interleaved with write chunks. | |||
Because all XDR objects must start on an XDR alignment boundary, all | Because all XDR objects must start on an XDR alignment boundary, all | |||
read and write chunks and all inline XDR objects in any XDR stream | read and write chunks and all inline XDR objects in any XDR stream | |||
must start on an XDR alignment boundary. This has implications for | must start on an XDR alignment boundary. This has implications for | |||
the values allowed in read chunk Position fields, for how XDR roundup | the values allowed in read chunk Position fields, for how XDR roundup | |||
works for chunks, and for how XDR objects are placed in inline | works for chunks, and for how XDR objects are placed in inline | |||
buffers. XDR alignment in inline buffers is always relative to | buffers. XDR alignment in inline buffers is always relative to | |||
Position Zero (or, where the RPC header starts). | Position Zero (or, where the RPC header starts). | |||
3.7. Write List XDR Roundup | 3.7. Write Chunk XDR Roundup | |||
The final paragraph of RFC 5666 Section 3.7 says this: | The final paragraph of RFC 5666 Section 3.7 says: | |||
For RDMA Write Chunks, a simpler encoding method applies. Again, | For RDMA Write Chunks, a simpler encoding method applies. Again, | |||
roundup bytes are not transferred, instead the chunk length sent | roundup bytes are not transferred, instead the chunk length sent | |||
to the receiver in the reply is simply increased to include any | to the receiver in the reply is simply increased to include any | |||
roundup. | roundup. | |||
A responder should never write XDR pad bytes, as the requester's | A responder should avoid writing XDR pad bytes, as the requester's | |||
upper layers does not reference them. However, for the chunk length | upper layer does not reference them, though the language does not | |||
to be rounded up as described, the requester must provide adequate | fully prohibit writing these bytes. A requester always provides the | |||
extra space in the chunk for the XDR pad. A requester can provide | extra space for XDR padding anyway. | |||
space for the XDR pad using one of two approaches: | ||||
A problem arises if the data item written into a Write chunk is | ||||
shorter than the chunk and requires an XDR pad. A responder may | ||||
write the XDR pad past the end of the data content. For a short | ||||
directly-placed write, the pad bytes are then exposed in the RPC | ||||
consumer's data buffer. | ||||
In addition, for the chunk length to be rounded up as described, the | ||||
requester must provide adequate extra space in the chunk for the XDR | ||||
pad. A requester can provide space for the XDR pad using one of two | ||||
approaches: | ||||
1. It can extend the last segment in the chunk. | 1. It can extend the last segment in the chunk. | |||
2. It can provide another segment after the segments that receive | 2. It can provide another segment after the segments that receive | |||
RDMA Write payloads. | RDMA Write payloads. | |||
Case 1 is adequate when there is no danger that the responder's RDMA | Case 1 is adequate when there is no danger that the responder's RDMA | |||
Write operations will overwrite existing data on the requester in | Write operations will overwrite existing data on the requester in | |||
buffers following the advertised receive buffers. | memory following the advertised receive buffers. | |||
In Direct Data Placement scenarios, an extra segment must be provided | In Direct Data Placement scenarios, an extra segment must be provided | |||
separately to avoid overwriting existing data that follows the sink | separately to avoid overwriting existing data that follows the sink | |||
buffer (case 2). Thus, an extra registration is needed for just a | buffer (case 2). Thus, an extra registration is needed for just a | |||
handful of bytes that are not written by the responder. | handful of bytes that may not be written by the responder, and are | |||
ignored by the requester. Even so, this does not force the responder | ||||
to direct the XDR pad bytes into this extra segment, should the data | ||||
item in that chunk be shorter than the chunk itself. | ||||
Registering the extra buffer is a needless cost. It would be more | Registering the extra buffer is a needless cost. It would be more | |||
efficient if the XDR pad at the end of a write chunk were treated the | efficient if the XDR pad at the end of a write chunk were treated the | |||
same as it is for read chunks. Because RPC result data must begin on | same as it is for Read chunks. Because RPC result data must begin on | |||
an XDR alignment boundary, the result following the write chunk in | an XDR alignment boundary, the result following the write chunk in | |||
the reply's XDR stream must begin on an XDR alignment boundary. | the reply's XDR stream must begin on an XDR alignment boundary. | |||
There should be no need for a XDR pad to be present for the receiver | There is no need for a XDR pad to be present for the receiver to re- | |||
to re-assemble the RPC reply's XDR stream correctly. | assemble the RPC reply's XDR stream properly. | |||
Unfortunately at least one server implementation relies on the | One responder implementation requires the requester to provide the | |||
existence of that extra buffer, even though it does not write to it. | extra buffer space in the Write chunk, but does not write to it. | |||
Another server implementation does not rely on it (operation proceeds | This follows the letter of the last paragraph of Section 3.7 of | |||
if it is missing) but when it is present, this server does write | [RFC5666]. | |||
zeroes to it. | ||||
Therefore the extra buffer for a write chunk's XDR pad, either as a | Another responder implementation does not rely on having the extra | |||
separate segment, or as an extension of the segment that represents | space (operation proceeds if it is missing) but when the extra space | |||
the data payload buffer, must remain for now. | is present, this responder does write zeroes to it. While the | |||
intention of Section 3.7 is that the responder does not write the | ||||
pad, it is not strictly forbidden. | ||||
Client implementations all appear to provide the extra buffer space | ||||
needed to accommodate the XDR pad. However, one implementation does | ||||
not register this extra buffer, since the responder is not expected | ||||
to write into it, while another implementation does. | ||||
These implementations may not be 100% interoperable. The language of | ||||
Section 3.7 of [RFC5666] appears to allow all of this behavior (in | ||||
particular, it does not prohibit a responder from writing the XDR pad | ||||
using RFC2119-style keywords, and does not require that requesters | ||||
register the extra space to accommodate the XDR pad). | ||||
Note that because the Reply chunk is a write chunk, these roundup | Note that because the Reply chunk is a write chunk, these roundup | |||
rules apply to it as well. | rules also apply to it. | |||
3.7.1. Recommendations | 3.7.1. Recommendations | |||
RFC 5666bis should provide a discussion of the requirements around | The current specification allows XDR pad bytes to leak into user | |||
write chunk roundup, with examples. The discussion should be | buffers, and none of the current implementations prevent this leak. | |||
separate from the discussion of read chunk roundup. | There may be room to adjust the protocol specification independently | |||
of current implementation behavior. | ||||
RFC 5666bis should explicitly discuss the requirements around write | ||||
chunk roundup separately from the discussion of read chunk roundup. | ||||
Explicit RFC2119-style interoperability requirements should be | Explicit RFC2119-style interoperability requirements should be | |||
provided in the text. For example, the requester MUST provide buffer | provided for write chunks. Responders MUST NOT write XDR pad bytes | |||
space for XDR roundup of write chunks, and the responder SHOULD NOT | at the end of a Write chunk. | |||
write into that buffer. | ||||
Allocating and registering extra space for XDR pad bytes that are | ||||
never written is wasteful. RFC 5666bis should forbid it. Responders | ||||
should not expect requesters to provide space for XDR pad bytes. | ||||
3.8. Write List Error Cases | 3.8. Write List Error Cases | |||
RFC 5666 Section 3.6 says: | RFC 5666 Section 3.6 says: | |||
When a write chunk list is provided for the results of the RPC | When a write chunk list is provided for the results of the RPC | |||
call, the RPC server MUST provide any corresponding data via RDMA | call, the RPC server MUST provide any corresponding data via RDMA | |||
Write to the memory referenced in the chunk list entries. | Write to the memory referenced in the chunk list entries. | |||
This requires the responder to use the Write list when it is | This requires the responder to use the Write list when it is | |||
skipping to change at page 35, line 45 | skipping to change at page 35, line 45 | |||
their DDP-eligibility. RFC 5666bis should remind authors of Upper | their DDP-eligibility. RFC 5666bis should remind authors of Upper | |||
Layer Bindings that the Reply chunk and Position Zero read chunks are | Layer Bindings that the Reply chunk and Position Zero read chunks are | |||
expressly not for performance-critical Upper Layer operations. | expressly not for performance-critical Upper Layer operations. | |||
It is the responsibility of the Upper Layer Binding to specify RDMA- | It is the responsibility of the Upper Layer Binding to specify RDMA- | |||
eligibity rules so that if an RDMA-eligible XDR object is embedded | eligibity rules so that if an RDMA-eligible XDR object is embedded | |||
within another, only one of these two objects is to be represented by | within another, only one of these two objects is to be represented by | |||
a chunk. This ensures that the mapping from XDR position to the XDR | a chunk. This ensures that the mapping from XDR position to the XDR | |||
object represented is unambiguous. | object represented is unambiguous. | |||
6.3. Violations Of Binding Rules | 6.3. Inline Threshold Requirements | |||
An RPC-over-RDMA connection has two connection parameters that affect | ||||
the operation of Upper Layer Protocols: The credit limit, which is | ||||
how many outstanding RPCs are allowed on that connection; and the | ||||
inline threshold, which is the maximum payload size of an RDMA Send | ||||
on that connection. All ULPs sharing a connection also share the | ||||
same credits and inline threshold values. | ||||
The inline threshold is set when a connection is established. The | ||||
base RPC-over-RDMA protocol does not provide a mechanism for altering | ||||
the inline threshold of a connection once it has been established. | ||||
[RFC5667] places normative requirements on the inline threshold value | ||||
for a connection. There is no guidance provided on how | ||||
implementations should behave when two ULPs that have different | ||||
inline threshold requirements share the same connection. | ||||
Further, current NFS implementations ignore the inline threshold | ||||
requirements stated in [RFC5667]. It is unlikely that they would | ||||
interoperate successfully with any new implementation that followed | ||||
the letter of [RFC5667]. | ||||
6.3.1. Recommendations | ||||
Upper Layer Protocols should be able to operate no matter what inline | ||||
threshold is in use. | ||||
An Upper Layer Binding might provide informative guidance about | ||||
optimal values of an inline threshold, but normative requirements are | ||||
difficult to enforce unless connection sharing is explicitly not | ||||
permitted. | ||||
6.4. Violations Of Binding Rules | ||||
Section 3.4 of RFC 5666 introduces the idea of an Upper Layer Binding | Section 3.4 of RFC 5666 introduces the idea of an Upper Layer Binding | |||
specification to state which Upper Layer operations are allowed to | specification to state which Upper Layer operations are allowed to | |||
use explicit RDMA to transfer a bulk payload item. | use explicit RDMA to transfer a bulk payload item. | |||
The fifth paragraph of this section states: | The fifth paragraph of this section states: | |||
The interface by which an upper-layer implementation communicates | The interface by which an upper-layer implementation communicates | |||
the eligibility of a data item locally to RPC for chunking is out | the eligibility of a data item locally to RPC for chunking is out | |||
of scope for this specification. In many implementations, it is | of scope for this specification. In many implementations, it is | |||
skipping to change at page 36, line 25 | skipping to change at page 37, line 11 | |||
If a violation does occur, RFC 5666 does not define an unambiguous | If a violation does occur, RFC 5666 does not define an unambiguous | |||
mechanism for reporting the violation. The violation of Binding | mechanism for reporting the violation. The violation of Binding | |||
rules is an Upper Layer Protocol issue, but it is likely that there | rules is an Upper Layer Protocol issue, but it is likely that there | |||
is nothing the Upper Layer can do but reply with the equivalent of | is nothing the Upper Layer can do but reply with the equivalent of | |||
BAD XDR. | BAD XDR. | |||
When an erroneously-constructed reply reaches a requester, there is | When an erroneously-constructed reply reaches a requester, there is | |||
no recourse but to drop the reply, and perhaps the transport | no recourse but to drop the reply, and perhaps the transport | |||
connection as well. | connection as well. | |||
6.3.1. Recommendations | 6.4.1. Recommendations | |||
Policing DDP-eligibility must be done in co-operation with the Upper | Policing DDP-eligibility must be done in co-operation with the Upper | |||
Layer Protocol by its receive endpoint implementation. | Layer Protocol by its receive endpoint implementation. | |||
It is the Upper Layer Binding's responsibility to specify how a | It is the Upper Layer Binding's responsibility to specify how a | |||
responder must reply if a requester violates a DDP-eligibilty rule. | responder must reply if a requester violates a DDP-eligibilty rule. | |||
The Binding specification should provide similar guidance for | The Binding specification should provide similar guidance for | |||
requesters about handling invalid RPC-over-RDMA replies. | requesters about handling invalid RPC-over-RDMA replies. | |||
6.4. Binding Specification Completion Assessment | 6.5. Binding Specification Completion Assessment | |||
RFC 5666 Section 3.4 states: | RFC 5666 Section 3.4 states: | |||
Typically, only those opaque and aggregate data types that may | Typically, only those opaque and aggregate data types that may | |||
attain substantial size are considered to be eligible. However, | attain substantial size are considered to be eligible. However, | |||
any object MAY be chosen for chunking in any given message. | any object MAY be chosen for chunking in any given message. | |||
Chunk eligibility criteria MUST be determined by each upper-layer | Chunk eligibility criteria MUST be determined by each upper-layer | |||
in order to provide for an interoperable specification. | in order to provide for an interoperable specification. | |||
skipping to change at page 37, line 8 | skipping to change at page 37, line 43 | |||
data type in the Upper Layer's XDR definition, in particular compound | data type in the Upper Layer's XDR definition, in particular compound | |||
types such as arrays and lists, when restricting what XDR objects are | types such as arrays and lists, when restricting what XDR objects are | |||
eligible for Direct Data Placement. | eligible for Direct Data Placement. | |||
In addition, there are requirements related to using NFS with RPC- | In addition, there are requirements related to using NFS with RPC- | |||
over-RDMA in [RFC5667], and there are some in [RFC5661]. It could be | over-RDMA in [RFC5667], and there are some in [RFC5661]. It could be | |||
helpful to have guidance about what kind of requirements belong in an | helpful to have guidance about what kind of requirements belong in an | |||
Upper Layer Binding specification versus what belong in the Upper | Upper Layer Binding specification versus what belong in the Upper | |||
Layer Protocol specification. | Layer Protocol specification. | |||
6.4.1. Recommendations | 6.5.1. Recommendations | |||
RFC 5666bis should describe what makes a Binding specification | RFC 5666bis should describe what makes a Binding specification | |||
complete (i.e. ready for publication). | complete (i.e. ready for publication). | |||
7. Removal of Unimplemented Protocol Features | 7. Unimplemented Protocol Features | |||
7.1. Read-Read Transfer Model | There are features of RPC-over-RDMA Version One that remain | |||
unimplemented in current implementations. Some are candidates to be | ||||
removed from the protocol because they have proven unnecessary or | ||||
were not properly specified. | ||||
Other features are unimplemented, unspecified, or have only one | ||||
implementation (thus interoperability remains unproven). These are | ||||
candidates to be retained and properly specified. | ||||
7.1. Unimplemented Features To Be Removed | ||||
7.1.1. Connection Configuration Protocol | ||||
No implementation has seen fit to support the Connection | ||||
Configuration Protocol. While a need to exchange pertinent | ||||
connection information remains, the preference is to exchange that | ||||
information as part of the set up of each connection, rather than as | ||||
settings that apply to all connections (and thus all ULPs) between | ||||
two peers. | ||||
7.1.1.1. Recommendations | ||||
CCP should be removed from RFC 5666bis. | ||||
7.1.2. Read-Read Transfer Model | ||||
All existing RPC-over-RDMA Version One implementations use a Read- | All existing RPC-over-RDMA Version One implementations use a Read- | |||
Write data transfer model. The server endpoint is responsible for | Write data transfer model. The server endpoint is responsible for | |||
initiating all RDMA data transfers. The Read-Read transfer model has | initiating all RDMA data transfers. The Read-Read transfer model has | |||
been deprecated, but because it appears in RFC 5666, implementations | been deprecated, but because it appears in RFC 5666, implementations | |||
are still responsible for supporting it. By removing the | are still responsible for supporting it. By removing the | |||
specification and discussion of Read-Read, the protocol and | specification and discussion of Read-Read, the protocol and | |||
specification can be made simpler and more clear. | specification can be made simpler and more clear. | |||
7.1.1. Recommendations | 7.1.2.1. Recommendations | |||
Remove Read-Read from RFC 5666bis, in particular from its equivalent | Remove Read-Read from RFC 5666bis, in particular from its equivalent | |||
of RFC 5666 Section 3.8. RFC 5666bis should require implementations | of RFC 5666 Section 3.8. RFC 5666bis should require implementations | |||
not to send RDMA_DONE; an implementation receiving it should ignore | not to send RDMA_DONE; an implementation receiving it should ignore | |||
it. The XDR definition should reserve RDMA_DONE. | it. The XDR definition should reserve RDMA_DONE. | |||
7.2. RDMA_MSGP | 7.1.3. RDMA_MSGP | |||
It has been observed that the current specification of RDMA_MSGP is | It has been observed that the current specification of RDMA_MSGP is | |||
not clear enough to result in interoperable implementations. | not clear enough to result in interoperable implementations. | |||
Possibly as a result, current receive endpoints do recognize and | Possibly as a result, current receive endpoints do recognize and | |||
process RDMA_MSGP messages, though they do not take advantage of the | process RDMA_MSGP messages, though they do not take advantage of the | |||
passed alignment parameters. Receivers treat RDMA_MSGP messages like | passed alignment parameters. Receivers treat RDMA_MSGP messages like | |||
RDMA_MSG messages. | RDMA_MSG messages. | |||
Currently senders do not use RDMA_MSGP messages. RDMA_MSGP depends | Currently senders do not use RDMA_MSGP messages. RDMA_MSGP depends | |||
on bulk payload occurring at the end of RPC messages, which is often | on bulk payload occurring at the end of RPC messages, which is often | |||
not true of NFSv4 COMPOUND requests. Most NFSv3 requests are small | not true of NFSv4 COMPOUND requests. Most NFSv3 requests are small | |||
enough not to need RDMA_MSGP. | enough not to need RDMA_MSGP. | |||
To be effective, RDMA_MSGP depends on getting alignment preferences | To be effective, RDMA_MSGP depends on getting alignment preferences | |||
in advance via CCP. There are no CCP implementations to date. | in advance via CCP. There are no CCP implementations to date. | |||
Without CCP, there is no way for peers to discover a receiver | Without CCP, there is no way for peers to discover a receiver | |||
endpoint's preferred alignment parameters, unless the implementation | endpoint's preferred alignment parameters, unless the implementation | |||
provides an administrative interface for specifying a remote's | provides an administrative interface for specifying a remote's | |||
alignment parameters. RDMA_MSGP is useless without that knowledge. | alignment parameters. RDMA_MSGP is useless without that knowledge. | |||
7.2.1. Recommendations | 7.1.3.1. Recommendations | |||
To maintain backward-compatibility, RDMA_MSGP must remain in the | To maintain backward-compatibility, RDMA_MSGP must remain in the | |||
protocol. RFC 5666bis should require implementations to not send | protocol. RFC 5666bis should require implementations to not send | |||
RDMA_MSGP messages. If an RDMA_MSGP message is seen by a receiver, | RDMA_MSGP messages. If an RDMA_MSGP message is seen by a receiver, | |||
it should ignore the alignment parameters and treat RDMA_MSGP | it should ignore the alignment parameters and treat RDMA_MSGP | |||
messages as RDMA_MSG messages. The XDR definition should reserve | messages as RDMA_MSG messages. The XDR definition should reserve | |||
RDMA_MSGP. | RDMA_MSGP. | |||
7.2. Unimplemented Features To Be Retained | ||||
7.2.1. RDMA_ERROR Type Messages | ||||
Server implementations the author is familiar with can send | ||||
RDMA_ERROR type messages, but only when an RPC-over-RDMA version | ||||
mismatch occurs. There is no facility to return the ERR_CHUNK error. | ||||
These implementations treat unrecognized message types and other | ||||
parsing errors as an RDMA_MSG type message. Obviously this behavior | ||||
does not comply with RFC 5666, but it is also recognized that this | ||||
behavior is not an improvement over the specification. | ||||
7.2.1.1. Recommendations | ||||
RFC 5666bis should provide stronger guidance for error checking, and | ||||
in particular, when a connection must be broken. | ||||
Implementations that do not adequately check incoming RPC-over-RDMA | ||||
headers must be updated. | ||||
7.2.2. RPCSEC_GSS On RPC-over-RDMA | ||||
The second paragraph of RFC 5666 Section 11 says: | ||||
For efficiency, a more appropriate security mechanism for RDMA | ||||
links may be link-level protection, such as certain configurations | ||||
of IPsec, which may be co-located in the RDMA hardware. The use | ||||
of link-level protection MAY be negotiated through the use of the | ||||
new RPCSEC_GSS mechanism defined in [RFC5403] in conjunction with | ||||
the Channel Binding mechanism [RFC5056] and IPsec Channel | ||||
Connection Latching [RFC5660]. Use of such mechanisms is REQUIRED | ||||
where integrity and/or privacy is desired, and where efficiency is | ||||
required. | ||||
However, consider: | ||||
o As of this writing, no implementation of RPCSEC_GSS v2 Channel | ||||
Binding or Connection Latching exist. Thus, though it is | ||||
sensible, this part of RFC 5666 has never been implemented. | ||||
o Not all fabrics and RNICs support a link-layer protection | ||||
mechanism that includes a privacy service. | ||||
o When multiple users access a storage service from the same client, | ||||
it is appropriate to deploy a message authentication service | ||||
concurrently with link-layer protection. | ||||
Therefore, despite its performance impact, RPCSEC_GSS can add | ||||
important function to RPC-over-RDMA deployments. | ||||
Currently there is an InfiniBand-only client and server | ||||
implementation of RPCSEC_GSS on RPC-over-RDMA that supports the | ||||
authentication, integrity, and privacy services. This pair of | ||||
implementations was created without the benefit of normative guidance | ||||
from RFC 5666. This client and server pair interoperates with each | ||||
other, but there are no independent implementations to test with. | ||||
RPC-over-RDMA requesters are responsible for providing adequate reply | ||||
resources to responders. These resources require special treatment | ||||
when an integrity or privacy service is in use. Direct data | ||||
placement cannot be used with software integrity checking or | ||||
encryption. Thus standards guidance is imperative to ensure that | ||||
independent RPCSEC_GSS implementations can interoperate on RPC-over- | ||||
RDMA transports. | ||||
7.2.2.1. Recommendations | ||||
RFC 5666bis should continue to require the use of link layer | ||||
protection when facilities are available to support it. | ||||
At the least, RPCSEC_GSS per-message authentiction is valuable, even | ||||
if link layer protection is in use. Integrity and privacy should | ||||
also be made available even if they do not perform well, because | ||||
there is no link layer protection for some fabrics. | ||||
Therefore, RFC 5666bis should provide a specification for RPCSEC_GSS | ||||
on RPC-over-RDMA, codifying the one existing implementation so that | ||||
others may interoperate with it. | ||||
8. Security Considerations | 8. Security Considerations | |||
To enable RDMA Read and Write operations, an RPC-over-RDMA Version | To enable RDMA Read and Write operations, an RPC-over-RDMA Version | |||
One requester exposes some or all of its memory to other hosts. RFC | One requester exposes some or all of its memory to other hosts. RFC | |||
5666bis should suggest best implementation practices to minimize | 5666bis should suggest best implementation practices to minimize | |||
exposure to careless or potentially malicious implementations that | exposure to careless or potentially malicious implementations that | |||
share the same fabric. Important considerations include: | share the same fabric. Important considerations include: | |||
o The use of Protection Domains to limit the exposure of memory | o The use of Protection Domains to limit the exposure of memory | |||
regions to a single connection is critical. Any attempt by a host | regions to a single connection is critical. Any attempt by a host | |||
skipping to change at page 43, line 22 | skipping to change at page 46, line 32 | |||
Binding specification should provide guidance for requesters | Binding specification should provide guidance for requesters | |||
about handling invalid RPC-over-RDMA replies. | about handling invalid RPC-over-RDMA replies. | |||
12. Acknowledgements | 12. Acknowledgements | |||
The author gratefully acknowledges the contributions of Dai Ngo, | The author gratefully acknowledges the contributions of Dai Ngo, | |||
Karen Deitke, Chunli Zhang, Mahesh Siddheshwar, Dominique Martinet, | Karen Deitke, Chunli Zhang, Mahesh Siddheshwar, Dominique Martinet, | |||
and William Simpson. | and William Simpson. | |||
The author also wishes to thank Dave Noveck and Bill Baker for their | The author also wishes to thank Dave Noveck and Bill Baker for their | |||
unwavering support of this work. Special thanks go to nfsv4 Working | support of this work. Special thanks go to nfsv4 Working Group Chair | |||
Group Chair Spencer Shepler and nfsv4 Working Group Secretary Tom | Spencer Shepler and nfsv4 Working Group Secretary Tom Haynes for | |||
Haynes for their support. | their support. | |||
13. References | 13. References | |||
13.1. Normative References | 13.1. Normative References | |||
[RFC0793] Postel, J., "Transmission Control Protocol", STD 7, RFC | [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, RFC | |||
793, DOI 10.17487/RFC0793, September 1981, | 793, DOI 10.17487/RFC0793, September 1981, | |||
<http://www.rfc-editor.org/info/rfc793>. | <http://www.rfc-editor.org/info/rfc793>. | |||
[RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS | [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS | |||
skipping to change at page 44, line 10 | skipping to change at page 47, line 19 | |||
[RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. | [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. | |||
Garcia, "A Remote Direct Memory Access Protocol | Garcia, "A Remote Direct Memory Access Protocol | |||
Specification", RFC 5040, DOI 10.17487/RFC5040, October | Specification", RFC 5040, DOI 10.17487/RFC5040, October | |||
2007, <http://www.rfc-editor.org/info/rfc5040>. | 2007, <http://www.rfc-editor.org/info/rfc5040>. | |||
[RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct | [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct | |||
Data Placement over Reliable Transports", RFC 5041, DOI | Data Placement over Reliable Transports", RFC 5041, DOI | |||
10.17487/RFC5041, October 2007, | 10.17487/RFC5041, October 2007, | |||
<http://www.rfc-editor.org/info/rfc5041>. | <http://www.rfc-editor.org/info/rfc5041>. | |||
[RFC5056] Williams, N., "On the Use of Channel Bindings to Secure | ||||
Channels", RFC 5056, DOI 10.17487/RFC5056, November 2007, | ||||
<http://www.rfc-editor.org/info/rfc5056>. | ||||
[RFC5403] Eisler, M., "RPCSEC_GSS Version 2", RFC 5403, DOI | ||||
10.17487/RFC5403, February 2009, | ||||
<http://www.rfc-editor.org/info/rfc5403>. | ||||
[RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol | [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol | |||
Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, | Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, | |||
May 2009, <http://www.rfc-editor.org/info/rfc5531>. | May 2009, <http://www.rfc-editor.org/info/rfc5531>. | |||
[RFC5660] Williams, N., "IPsec Channels: Connection Latching", RFC | ||||
5660, DOI 10.17487/RFC5660, October 2009, | ||||
<http://www.rfc-editor.org/info/rfc5660>. | ||||
[RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., | [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., | |||
"Network File System (NFS) Version 4 Minor Version 1 | "Network File System (NFS) Version 4 Minor Version 1 | |||
Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, | Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, | |||
<http://www.rfc-editor.org/info/rfc5661>. | <http://www.rfc-editor.org/info/rfc5661>. | |||
[RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access | [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access | |||
Transport for Remote Procedure Call", RFC 5666, DOI | Transport for Remote Procedure Call", RFC 5666, DOI | |||
10.17487/RFC5666, January 2010, | 10.17487/RFC5666, January 2010, | |||
<http://www.rfc-editor.org/info/rfc5666>. | <http://www.rfc-editor.org/info/rfc5666>. | |||
End of changes. 42 change blocks. | ||||
107 lines changed or deleted | 253 lines changed or added | |||
This html diff was produced by rfcdiff 1.42. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |