--- 1/draft-ietf-nfsv4-sess-00.txt 2006-02-05 00:49:00.000000000 +0100 +++ 2/draft-ietf-nfsv4-sess-01.txt 2006-02-05 00:49:00.000000000 +0100 @@ -1,23 +1,24 @@ + INTERNET-DRAFT Tom Talpey -Expires: January 2005 Network Appliance, Inc. +Expires: August 2005 Network Appliance, Inc. Spencer Shepler Sun Microsystems, Inc. Jon Bauman University of Michigan - July, 2004 + February, 2005 NFSv4 Session Extensions - draft-ietf-nfsv4-sess-00 + draft-ietf-nfsv4-sess-01 Status of this Memo By submitting this Internet-Draft, I certify that any applicable patent or other IPR claims of which I am aware have been disclosed, or will be disclosed, and any of which I become aware will be disclosed, in accordance with RFC 3668. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that @@ -30,21 +31,21 @@ as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice - Copyright (C) The Internet Society (2004). All Rights Reserved. + Copyright (C) The Internet Society (2005). All Rights Reserved. Abstract Extensions are proposed to NFS version 4 which enable it to support long-lived sessions, endpoint management, and operation atop a variety of RPC transports, including TCP and RDMA. These extensions enable support for reliably implemented client response caching by NFSv4 servers, enhanced security, multipathing and trunking of transport connections. These extensions provide identical benefits over both TCP and RDMA connection types. @@ -55,85 +56,87 @@ 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . 4 1.2. Problem Statement . . . . . . . . . . . . . . . . . . . 5 1.3. NFSv4 Session Extension Characteristics . . . . . . . . 7 2. Transport Issues . . . . . . . . . . . . . . . . . . . . . 7 2.1. Session Model . . . . . . . . . . . . . . . . . . . . . 7 2.1.1. Connection State . . . . . . . . . . . . . . . . . . . 9 2.1.2. NFSv4 Channels, Sessions and Connections . . . . . . . 9 2.1.3. Reconnection, Trunking and Failover . . . . . . . . . 11 2.1.4. Server Duplicate Request Cache . . . . . . . . . . . . 12 2.2. Session Initialization and Transfer Models . . . . . . . 13 - 2.2.1. RDMA Requirements . . . . . . . . . . . . . . . . . . 13 - 2.2.2. Session Negotiation . . . . . . . . . . . . . . . . . 14 - 2.2.3. Connection Resources . . . . . . . . . . . . . . . . . 15 - 2.2.4. Inline Transfer Model . . . . . . . . . . . . . . . . 16 - 2.2.5. Direct Transfer Model . . . . . . . . . . . . . . . . 19 - 2.3. Connection Models . . . . . . . . . . . . . . . . . . . 21 + 2.2.1. Session Negotiation . . . . . . . . . . . . . . . . . 13 + 2.2.2. RDMA Requirements . . . . . . . . . . . . . . . . . . 15 + 2.2.3. RDMA Connection Resources . . . . . . . . . . . . . . 15 + 2.2.4. TCP and RDMA Inline Transfer Model . . . . . . . . . . 16 + 2.2.5. RDMA Direct Transfer Model . . . . . . . . . . . . . . 19 + 2.3. Connection Models . . . . . . . . . . . . . . . . . . . 22 2.3.1. TCP Connection Model . . . . . . . . . . . . . . . . . 23 - 2.3.2. Negotiated RDMA Connection Model . . . . . . . . . . . 23 + 2.3.2. Negotiated RDMA Connection Model . . . . . . . . . . . 24 2.3.3. Automatic RDMA Connection Model . . . . . . . . . . . 24 2.4. Buffer Management, Transfer, Flow Control . . . . . . . 25 2.5. Retry and Replay . . . . . . . . . . . . . . . . . . . . 28 2.6. The Back Channel . . . . . . . . . . . . . . . . . . . . 28 2.7. COMPOUND Sizing Issues . . . . . . . . . . . . . . . . . 30 2.8. Data Alignment . . . . . . . . . . . . . . . . . . . . . 30 3. NFSv4 Integration . . . . . . . . . . . . . . . . . . . . 31 3.1. Minor Versioning . . . . . . . . . . . . . . . . . . . . 32 3.2. Slot Identifiers and Server Duplicate Request Cache . . 32 3.3. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . . 35 3.4. eXternal Data Representation Efficiency . . . . . . . . 36 3.5. Effect of Sessions on Existing Operations . . . . . . . 36 3.6. Authentication Efficiencies . . . . . . . . . . . . . . 37 4. Security Considerations . . . . . . . . . . . . . . . . . 38 - 5. IANA Considerations . . . . . . . . . . . . . . . . . . . 39 - 6. NFSv4 Protocol Extensions . . . . . . . . . . . . . . . . 40 - 6.1. Operation: CREATECLIENTID . . . . . . . . . . . . . . . 40 - 6.2. Operation: CREATE_SESSION . . . . . . . . . . . . . . . 45 - 6.3. Operation: BIND_BACKCHANNEL . . . . . . . . . . . . . . 50 - 6.4. Operation: DESTROYSESSION . . . . . . . . . . . . . . . 52 - 6.5. Operation: SEQUENCE . . . . . . . . . . . . . . . . . . 53 - 6.6. Callback operation: CB_RECALLCREDIT . . . . . . . . . . 55 - 7. NFSv4 Session Protocol Description . . . . . . . . . . . . 55 - 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . 62 - 9. References . . . . . . . . . . . . . . . . . . . . . . . . 62 - 9.1. Normative References . . . . . . . . . . . . . . . . . . 62 - 9.2. Informative References . . . . . . . . . . . . . . . . . 62 - 10. Authors' Addresses . . . . . . . . . . . . . . . . . . . . 64 - 11. Full Copyright Statement . . . . . . . . . . . . . . . . . 65 + 4.1. Authentication . . . . . . . . . . . . . . . . . . . . . 40 + 5. IANA Considerations . . . . . . . . . . . . . . . . . . . 41 + 6. NFSv4 Protocol Extensions . . . . . . . . . . . . . . . . 41 + 6.1. Operation: CREATECLIENTID . . . . . . . . . . . . . . . 41 + 6.2. Operation: CREATESESSION . . . . . . . . . . . . . . . . 46 + 6.3. Operation: BIND_BACKCHANNEL . . . . . . . . . . . . . . 51 + 6.4. Operation: DESTROYSESSION . . . . . . . . . . . . . . . 53 + 6.5. Operation: SEQUENCE . . . . . . . . . . . . . . . . . . 54 + 6.6. Callback operation: CB_RECALLCREDIT . . . . . . . . . . 56 + 6.7. Callback operation: CB_SEQUENCE . . . . . . . . . . . . 56 + 7. NFSv4 Session Protocol Description . . . . . . . . . . . . 58 + 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . 64 + 9. References . . . . . . . . . . . . . . . . . . . . . . . . 64 + 9.1. Normative References . . . . . . . . . . . . . . . . . . 64 + 9.2. Informative References . . . . . . . . . . . . . . . . . 65 + 10. Authors' Addresses . . . . . . . . . . . . . . . . . . . . 67 + 11. Full Copyright Statement . . . . . . . . . . . . . . . . . 67 1. Introduction - This draft proposes extensions to NFS version 4 enabling it to - support sessions and endpoint management, and to support operation - atop RDMA-capable RPC over transports such as iWARP. [RDMAP, DDP] - These extensions enable support for exactly-once semantics by NFSv4 - servers, multipathing and trunking of transport connections, and - enhanced security. The ability to operate over RDMA enables - greatly enhanced performance. Operation over existing TCP is - enhanced as well. + This draft proposes extensions to NFS version 4 [RFC3530] enabling + it to support sessions and endpoint management, and to support + operation atop RDMA-capable RPC over transports such as iWARP. + [RDMAP, DDP] These extensions enable support for exactly-once + semantics by NFSv4 servers, multipathing and trunking of transport + connections, and enhanced security. The ability to operate over + RDMA enables greatly enhanced performance. Operation over existing + TCP is enhanced as well. While discussed here with respect to IETF-chartered transports, the proposed protocol is intended to function over other standards, such as Infiniband. [IB] The following are the major aspects of this proposal: o Changes are proposed within the framework of NFSv4 minor versioning. RPC, XDR, and the NFSv4 procedures and operations are preserved. The proposed extension functions equally well over existing transports and RDMA, and interoperates transparently with existing implementations, both at the local programmatic interface and over the wire. - o An explicit session is introduced to NFSv4, and six new - operations are added to support it. The session allows for - enhanced trunking, failover and recovery, and authentication + o An explicit session is introduced to NFSv4, and new operations + are added to support it. The session allows for enhanced + trunking, failover and recovery, and authentication efficiency, along with necessary support for RDMA. The session is implemented as operations within NFSv4 COMPOUND and does not impact layering or interoperability with existing NFSv4 implementations. The NFSv4 callback channel is dynamically associated and is connected by the client and not the server, enhancing security and operation through firewalls. In fact, the callback channel will be enabled to share the same connection as the operations channel. o An enhanced RPC layer enables NFSv4 operation atop RDMA. The @@ -383,20 +386,23 @@ In RFC3530, the combination of a connected transport endpoint and a clientid forms the basis of connection state. While has been made to be workable with certain limitations, there are difficulties in correct and robust implementation. The NFSv4.0 protocol must provide a server-initiated connection for the callback channel, and must carefully specify the persistence of client state at the server in the face of transport interruptions. The server has only the client's transport address binding (the IP 4-tuple) to identify the client RPC transaction stream and to use as a lookup tag on the duplicate request cache. (A useful overview of this is in [RW96].) + If the server listens on multiple adddresses, and the client + connects to more than one, it must employ different clientid's on + each, negating its ability to aggregate bandwidth and redundancy. In effect, each transport connection is used as the server's representation of client state. But, transport connections are potentially fragile and transitory. In this proposal, a session identifier is assigned by the server upon initial session negotiation on each connection. This identifier is used to associate additional connections, to renegotiate after a reconnect, to provide an abstraction for the various session properties, and to address the duplicate request cache. No transport-specific information is used in the duplicate @@ -450,44 +456,39 @@ example, reads and writes may be assigned to specific, optimized connections, or sorted and separated by any or all of size, idempotency, etc. To address the problems described above, this proposal allows multiple sessions to share a clientid, as well as for multiple connections to share a session. Single Connection model: - NFSv4.1 client instance - | - Session + NFSv4.1 Session / \ Operations_Channel [Back_Channel] \ / Connection | Multi-connection trunked model (2 operations channels shown): - NFSv4.1 client instance - | - Session + NFSv4.1 Session / \ Operations_Channels [Back_Channel] | | | Connection Connection [Connection] | | | Multi-connection split-use model (2 mounts shown): - NFSv4.1 client instance + NFSv4.1 Session / \ - Session Session (/home) (/usr/local - readonly) / \ | Operations_Channel [Back_Channel] | | | Operations_Channel Connection [Connection] | | | Connection | In this way, implementation as well as resource management may be optimized. Each session will have its own response caching and @@ -591,53 +592,30 @@ enabled for a given session, the session reply must inform the client if the mode is in fact enabled. In this way the client can confidently proceed with operations without having to implement consistency facilities of its own. 2.2. Session Initialization and Transfer Models Session initialization issues, and data transfer models relevant to both TCP and RDMA are discussed in this section. -2.2.1. RDMA Requirements - - A complete discussion of the operation of RPC-based protocols atop - RDMA transports is in [RPCRDMA], and a general discussion of NFS - RDMA requirements is in [RDMAREQ]. Where RDMA is considered, this - proposal assumes the use of such a layering; it addresses only the - upper layer issues relevant to making best use of RPC/RDMA. - - A connection oriented (reliable sequenced) RDMA transport will be - required. There are several reasons for this. First, this model - most closely reflects the general NFSv4 requirement of long-lived - and congestion-controlled transports. Second, to operate correctly - over either an unreliable or unsequenced RDMA transport, or both, - would require significant complexity in the implementation and - protocol not appropriate for a strict minor version. For example, - retransmission on connected endpoints is explicitly disallowed in - the current NFSv4 draft; it would again be required with these - alternate transport characteristics. Third, the proposal assumes a - specific RDMA ordering semantic, which presents the same set of - ordering and reliability issues to the RDMA layer over such - transports. - - The RDMA implementation provides for making connections to other - RDMA-capable peers. In the case of the current proposals before - the RDDP working group, these RDMA connections are preceded by a - "streaming" phase, where ordinary TCP (or NFS) traffic might flow. - However, this is not assumed here and sizes and other parameters - are explicitly exchanged upon a session entering RDMA mode. - -2.2.2. Session Negotiation +2.2.1. Session Negotiation - Some of the parameters to be exchanged at session creation time are - as follows. + The following parameters are exchanged between client and server at + session creation time. Their values allow the server to properly + size resources allocated in order to service the client's requests, + and to provide the server with a way to communicate limits to the + client for proper and optimal operation. They are exchanged prior + to all session-related activity, over any transport type. + Discussion of their use is found in their descriptions as well as + throughout this section. Maximum Requests The client's desired maximum number of concurrent requests is passed, in order to allow the server to size its reply cache storage. The server may modify the client's requested limit downward (or upward) to match its local policy and/or resources. Over RDMA-capable RPC transports, the per-request management of low-level transport message credits is handled within the RPC layer. [RPCRDMA] @@ -658,46 +636,75 @@ Inline Padding/Alignment The server can inform the client of any padding which can be used to deliver NFSv4 inline WRITE payloads into aligned buffers. Such alignment can be used to avoid data copy operations at the server for both TCP and inline RDMA transfers. For RDMA, the client informs the server in each operation when padding has been applied. [RPCRDMA] Transport Attributes A placeholder for transport-specific attributes is provided, - with a format to be determined. Examples of information to be - passed in this parameter include transport security attributes - to be used on the connection, RDMA-specific attributes, legacy - "private data" as used on existing RDMA fabrics, transport - Quality of Service attributes, etc. This information is to be - passed to the peer's transport layer by local means which is - currently outside the scope of this draft, however one - attribute is provided in the RDMA case: + with a format to be determined. Possible examples of + information to be passed in this parameter include transport + security attributes to be used on the connection, RDMA- + specific attributes, legacy "private data" as used on existing + RDMA fabrics, transport Quality of Service attributes, etc. + This information is to be passed to the peer's transport layer + by local means which is currently outside the scope of this + draft, however one attribute is provided in the RDMA case: RDMA Read Resources - RDMA implementations must explicitly provision resources to - support RDMA Read requests from connected peers. These values - must be explicitly specified, to provide adequate resources - for matching the peer's expected needs and the connection's - delay-bandwidth parameters. The client provides its chosen - value to the server in the initial session creation, the value - must be provided in each client RDMA endpoint. The values are - asymmetric and should be set to zero at the server in order to - conserve RDMA resources, since clients do not issue RDMA Read - operations in this proposal. The result is communicated in - the session response, to permit matching of values across the - connection. The value may not be changed in the duration of - the session, although a new value may be requested as part of - a new session. + RDMA implementations must explicitly provision resources + to support RDMA Read requests from connected peers. + These values must be explicitly specified, to provide + adequate resources for matching the peer's expected needs + and the connection's delay-bandwidth parameters. The + client provides its chosen value to the server in the + initial session creation, the value must be provided in + each client RDMA endpoint. The values are asymmetric and + should be set to zero at the server in order to conserve + RDMA resources, since clients do not issue RDMA Read + operations in this proposal. The result is communicated + in the session response, to permit matching of values + across the connection. The value may not be changed in + the duration of the session, although a new value may be + requested as part of a new session. -2.2.3. Connection Resources +2.2.2. RDMA Requirements + + A complete discussion of the operation of RPC-based protocols atop + RDMA transports is in [RPCRDMA]. Where RDMA is considered, this + proposal assumes the use of such a layering; it addresses only the + upper layer issues relevant to making best use of RPC/RDMA. + + A connection oriented (reliable sequenced) RDMA transport will be + required. There are several reasons for this. First, this model + most closely reflects the general NFSv4 requirement of long-lived + and congestion-controlled transports. Second, to operate correctly + over either an unreliable or unsequenced RDMA transport, or both, + would require significant complexity in the implementation and + protocol not appropriate for a strict minor version. For example, + retransmission on connected endpoints is explicitly disallowed in + the current NFSv4 draft; it would again be required with these + alternate transport characteristics. Third, the proposal assumes a + specific RDMA ordering semantic, which presents the same set of + ordering and reliability issues to the RDMA layer over such + transports. + + The RDMA implementation provides for making connections to other + RDMA-capable peers. In the case of the current proposals before + the RDDP working group, these RDMA connections are preceded by a + "streaming" phase, where ordinary TCP (or NFS) traffic might flow. + However, this is not assumed here and sizes and other parameters + are explicitly exchanged upon a session entering RDMA mode. + +2.2.3. RDMA Connection Resources On transport endpoints which support automatic RDMA mode, that is, endpoints which are created in the RDMA-enabled state, a single, preposted buffer must initially be provided by both peers, and the client session negotiation must be the first exchange. On transport endpoints supporting dynamic negotiation, a more sophisticated negotiation is possible, but is not discussed in the current draft. @@ -717,21 +724,21 @@ the RPC layer to handle receives. These buffers remain in use by the RPC/NFSv4 implementation; the size and number of them must be known to the remote peer in order to avoid RDMA errors which would cause a fatal error on the RDMA connection. The session provides a natural way for the server to manage resource allocation to each client rather than to each transport connection itself. This enables considerable flexibility in the administration of transport endpoints. -2.2.4. Inline Transfer Model +2.2.4. TCP and RDMA Inline Transfer Model The basic transfer model for both TCP and RDMA is referred to as "inline". For TCP, this is the only transfer model supported, since TCP carries both the RPC header and data together in the data stream. For RDMA, the RDMA Send transfer model is used for all NFS requests and replies, but data is optionally carried by RDMA Writes or RDMA Reads. Use of Sends is required to ensure consistency of data and to deliver completion notifications. The pure-Send method is @@ -757,72 +764,76 @@ buffer : : Client Server : Write request with data : Send : ------------------------------> : untagged : : buffer : Write response : untagged : <------------------------------ : Send buffer : : - Responses must be sent to the client on the same RDMA connection - that the request was sent. This is important to preserve ordering - of operations, and especially RMDA consistency. Additionally, it - ensures that the RPC RDMA layer makes no requirement of the RDMA - provider to open its memory registration handles (Steering Tags) - beyond the scope of a single RDMA connection. This is an important - security consideration. + Responses must be sent to the client on the same connection that + the request was sent. It is important that the server does not + assume any specific client implementation, in particular whether + connections within a session share any state at the client. This + is also important to preserve ordering of RDMA operations, and + especially RMDA consistency. Additionally, it ensures that the RPC + RDMA layer makes no requirement of the RDMA provider to open its + memory registration handles (Steering Tags) beyond the scope of a + single RDMA connection. This is an important security + consideration. Two values must be known to each peer prior to issuing Sends: the maximum number of sends which may be posted, and their maximum size. These values are referred to, respectively, as the message credits and the maximum message size. While the message credits might vary dynamically over the duration of the session, the - maximum message size does not. The server must commit to posting a - number of receive buffers equal to or greater than its currently - advertised credit value, each of the advertised size. If fewer - credits or smaller buffers are provided, the connection may fail - with an RDMA transport error. + maximum message size does not. The server must commit to + preserving this number of duplicate request cache entires, and + preparing a number of receive buffers equal to or greater than its + currently advertised credit value, each of the advertised size. + These ensure that transport resources are allocated sufficient to + receive the full advertised limits. Note that the server must post the maximum number of session - requests to each client operations channel. It is not possible for - the client to spread its requests in any particular fashion across - connections within a session. Instead, the client may create + requests to each client operations channel. The client is not + required to spread its requests in any particular fashion across + connections within a session. If the client wishes, it may create multiple sessions, each with a single or small number of operations channels to provide the server with this resource advantage. Or, - the server may employ a "shared receive queue". The server can in - any case protect its resources by restricting the client's request - credits. + over RDMA the server may employ a "shared receive queue". The + server can in any case protect its resources by restricting the + client's request credits. While tempting to consider, it is not possible to use the TCP window as an RDMA operation flow control mechanism. First, to do so would violate layering, requiring both senders to be aware of the existing TCP outbound window at all times. Second, since requests are of variable size, the TCP window can hold a widely variable number of them, and since it cannot be reduced without actually receiving data, the receiver cannot limit the sender. Third, any middlebox interposing on the connection would wreck any possible scheme. [MIDTAX] In this proposal, maximum request count limits are exchanged at the session level to allow correct provisioning of receive buffers by transports. - When not operating over RDMA, request limits and sizes are still - employed in NFSv4.1, but instead of being required for correctness, - they provide the basis for efficient server implementation of the - duplicate request cache. The limits are chosen based upon the - expected needs and capabilities of the client and server, and are - in fact arbitrary. Sizes may be specified by the client as zero - (requesting the server's preferred or optimal value), and request - limits may be chosen in proportion to the client's capabilities. - For example, a limit of 1000 allows 1000 requests to be in - progress, which may generally be far more than adequate to keep - local networks and servers fully utilized. + When operating over TCP or other similar transport, request limits + and sizes are still employed in NFSv4.1, but instead of being + required for correctness, they provide the basis for efficient + server implementation of the duplicate request cache. The limits + are chosen based upon the expected needs and capabilities of the + client and server, and are in fact arbitrary. Sizes may be + specified by the client as zero (requesting the server's preferred + or optimal value), and request limits may be chosen in proportion + to the client's capabilities. For example, a limit of 1000 allows + 1000 requests to be in progress, which may generally be far more + than adequate to keep local networks and servers fully utilized. Both client and server have independent sizes and buffering, but over RDMA fabrics client credits are easily managed by posting a receive buffer prior to sending each request. Each such buffer may not be completed with the corresponding reply, since responses from NFSv4 servers arrive in arbitrary order. When an operations channel is also used for callbacks, the client must account for callback requests by posting additional buffers. Note that implementation-specific facilities such as a shared receive queue may also allow optimization of these allocations. @@ -844,21 +855,21 @@ procedures. Since an arbitrary number (total size) of operations can be specified in a single COMPOUND procedure, its size is effectively unbounded. This cannot be supported by RDMA Sends, and therefore this size negotiation places a restriction on the construction and maximum size of both COMPOUND requests and responses. If a COMPOUND results in a reply at the server that is larger than can be sent in an RDMA Send to the client, then the COMPOUND must terminate and the operation which causes the overflow will provide a TOOSMALL error status result. -2.2.5. Direct Transfer Model +2.2.5. RDMA Direct Transfer Model Placement of data by explicitly tagged RDMA operations is referred to as "direct" transfer. This method is typically used where the data payload is relatively large, that is, when RDMA setup has been performed prior to the operation, or when any overhead for setting up and performing the transfer is regained by avoiding the overhead of processing an ordinary receive. The client advertises RDMA buffers in this proposed model, and not the server. This means the "XDR Decoding with Read Chunks" @@ -1447,24 +1459,34 @@ that can be proposed when considering extensions. To support the duplicate request cache integrated with sessions and request control, it is desirable to tag each request with an identifier to be called a Slotid. This identifier must be passed by NFSv4 when running atop any transport, including traditional TCP. Therefore it is not desirable to add the Slotid to a new RPC transport, even though such a transport is indicated for support of RDMA. This draft and [RPCRDMA] do not propose such an approach. - Instead, this proposal confirms to the requirements of NFSv4 minor + Instead, this proposal conforms to the requirements of NFSv4 minor versioning, through the use of a new operation within NFSv4 COMPOUND procedures as detailed below. + If sessions are in use for a given clientid, this same clientid + cannot be used for non-session NFSv4 operation, including NFSv4.0. + Because the server will have allocated session-specific state to + the active clientid, it would be an unnecessary burden on the + server implementor to support and account for additional, non- + session traffic, in addition to being of no benefit. Therefore + this proposal prohibits a single clientid from doing this. + Nevertheless, employing a new clientid for such traffic is + supported. + 3.2. Slot Identifiers and Server Duplicate Request Cache The presence of deterministic maximum request limits on a session enables in-progress requests to be assigned unique values with useful properties. The RPC layer provides a transaction ID (xid), which, while required to be unique, is not especially convenient for tracking requests. The transaction ID is only meaningful to the issuer (client), it cannot be interpreted at the server except to test for @@ -1559,31 +1581,31 @@ granted maximum request count to the client, it may not be able to use receipt of the slotid to retire cache entries. The slotid used in an incoming request may not reflect the server's current idea of the client's session limit, because the request may have been sent from the client before the update was received. Therefore, in the downward adjustment case, the server may have to retain a number of duplicate request cache entries at least as large as the old value, until operation sequencing rules allow it to infer that the client has seen its reply. - The SEQUENCE operation also carries a "maxslot" value which carries - additional client slot usage information. The client must always - provide its highest-numbered outstanding slot value in the maxslot - argument, and the server may reply with a new recognized value. - The client should in all cases provide the most conservative value - possible, although it can be increased somewhat above the actual - instantaneous usage to maintain some minimum or optimal level. - This provides a way for the client to yield unused request slots - back to the server, which in turn can use the information to - reallocate resources. Obviously, maxslot can never be zero, or the - session would deadlock. + The SEQUENCE (and CB_SEQUENCE) operation also carries a "maxslot" + value which carries additional client slot usage information. The + client must always provide its highest-numbered outstanding slot + value in the maxslot argument, and the server may reply with a new + recognized value. The client should in all cases provide the most + conservative value possible, although it can be increased somewhat + above the actual instantaneous usage to maintain some minimum or + optimal level. This provides a way for the client to yield unused + request slots back to the server, which in turn can use the + information to reallocate resources. Obviously, maxslot can never + be zero, or the session would deadlock. The server also provides a target maxslot value to the client, which is an indication to the client of the maxslot the server wishes the client to be using. This permits the server to withdraw (or add) resources from a client that has been found to not be using them, in order to more fairly share resources among a varying level of demand from other clients. The client must always comply with the server's value updates, since they indicate newly established hard limits on the client's access to session resources. However, because of request pipelining, the client may @@ -1646,46 +1668,70 @@ recursive calls, etc). Often, such conversions are carried out even when no size or byte order conversion is necessary. It is recommended that implementations pay close attention to the details of memory referencing in such code. It is far more efficient to inspect data in place, using native facilities to deal with word size and byte order conversion into registers or local variables, rather than formally (and blindly) performing the operation via fetch, reallocate and store. - Of particular concern is the result of the READDIR_DIRECT - operation, in which such encoding abounds. + Of particular concern is the result of the READDIR operation, in + which such encoding abounds. 3.5. Effect of Sessions on Existing Operations The use of a session replaces the use of the SETCLIENTID and SETCLIENTID_CONFIRM operations, and allows certain simplification of the RENEW and callback addressing mechanisms in the base protocol. The cb_program and cb_location which are obtained by the server in SETCLIENTID_CONFIRM must not be used by the server, because the NFSv4.1 client performs callback channel designation with BIND_BACKCHANNEL. Therefore the SETCLIENTID and SETCLIENTID_CONFIRM operations becomes obsolete when sessions are in use, and a server should return an error to NFSv4.1 clients which might issue either operation. - Since the session carries the client indication with it implicitly, - any request on a session associated with a given client will renew - that client's leases. Therefore the RENEW operation is made - unnecessary when a session is present, as any request (e.g. a - SEQUENCE operation with or without and additional NFSv4 operations) - performs its function. It is possible (though this proposal does - not make any recommendation) that the RENEW operation could be made - obsolete. + Another favorable result of the session is that the server is able + to avoid requiring the client to perform OPEN_CONFIRM operations. + The existence of a reliable and effective DRC means that the server + will be able to determine whether an OPEN request carrying a + previously known open_owner from a client is or is not a + retransmission. Because of this, the server no longer requires + OPEN_CONFIRM to verify whether the client is retransmitting an open + request. This in turn eliminates the server's reason for + requesting OPEN_CONFIRM - the server can simply replace any + previous information on this open_owner. Client OPEN operations + are therefore streamlined, reducing overhead and latency through + avoiding the additional OPEN_CONFIRM exchange. + + Since the session carries the client liveness indication with it + implicitly, any request on a session associated with a given client + will renew that client's leases. Therefore the RENEW operation is + made unnecessary when a session is present, as any request + (including a SEQUENCE operation with or without additional NFSv4 + operations) performs its function. It is possible (though this + proposal does not make any recommendation) that the RENEW operation + could be made obsolete. + + An interesting issue arises however if an error occurs on such a + SEQUENCE operation. If the SEQUENCE operation fails, perhaps due + to an invalid slotid or other non-renewal-based issue, the server + may or may not have performed the RENEW. In this case, the state + of any renewal is undefined, and the client should make no + assumption that it has been performed. In practice, this should + not occur but even if it did, it is expected the client would + perform some sort of recovery which would result in a new, + successful, SEQUENCE operation being run and the client assured + that the renewal took place. 3.6. Authentication Efficiencies NFSv4 requires the use of the RPCSEC_GSS ONC RPC security flavor [RFC2203] to provide authentication, integrity, and privacy via cryptography. The server dictates to the client the use of RPCSEC_GSS, the service (authentication, integrity, or privacy), and the specific GSS-API security mechanism that each remote procedure call and result will use. @@ -1769,40 +1815,75 @@ If the NFS client wishes to maintain full control over RPCSEC_GSS protection, it may still perform its transfer operations using either the inline or RDMA transfer model, or of course employ traditional TCP stream operation. In the RDMA inline case, header padding is recommended to optimize behavior at the server. At the client, close attention should be paid to the implementation of RPCSEC_GSS processing to minimize memory referencing and especially copying. These are well-advised in any case! - Proper authentication of the session and clientid creation - operation of the proposed NFSv4.1 exactly follows the similar - requirement on client identifiers in NFSv4.0. It must not be - possible for a client to bind a callback channel to an existing - session by guessing its session identifier. To protect against - this, NFSv4.0 requires appropriate authentication and matching of - the principal used. This is discussed in Section 16, Security - Considerations of [RFC3530]. The same requirement before binding - to a session identifier applies here. - The proposed session callback channel binding improves security over that provided by NFSv4 for the callback channel. The connection is client-initiated, and subject to the same firewall and routing checks as the operations channel. The connection cannot be hijacked by an attacker who connects to the client port prior to the intended server. The connection is set up by the client with its desired attributes, such as optionally securing with IPsec or similar. The binding is fully authenticated before being activated. +4.1. Authentication + + Proper authentication of the principal which issues any session and + clientid in the proposed NFSv4.1 operations exactly follows the + similar requirement on client identifiers in NFSv4.0. It must not + be possible for a client to impersonate another by guessing its + session identifiers for NFSv4.1 operations, nor to bind a callback + channel to an existing session. To protect against this, NFSv4.0 + requires appropriate authentication and matching of the principal + used. This is discussed in Section 16, Security Considerations of + [RFC3530]. The same requirement when using a session identifier + applies to NFSv4.1 here. + + Going beyond NFSv4.0, the presence of a session associated with any + clientid may also be used to enhance NFSv4.1 security with respect + to client impersonation. In NFSv4.0, there are many operations + which carry no clientid, including in particular those which employ + a stateid argument. A rogue client which wished to carry out a + denial of service attack on another client could perform CLOSE, + DELEGRETURN, etc operations with that client's current filehandle, + sequenceid and stateid, after having obtained them from + eavesdropping or other approach. Locking and open downgrade + operations could be similarly attacked. + + When an NFSv4.1 session is in place for any clientid, + countermeasures are easily applied through use of authentication by + the server. Because the clientid and sessionid must be present in + each request within a session, the server may verify that the + clientid is in fact originating from a principal with the + appropriate authenticated credentials, that the sessionid belongs + to the clientid, and that the stateid is valid in these contexts. + This is in general not possible with the affected operations in + NFSv4.0 due to the fact that the clientid is not present in the + requests. + + In the event that authentication information is not available in + the incoming request, for example after a reconnection when the + security was previously downgraded using CCM, the server must + require the client re-establish the authentication in order that + the server may validate the other client-provided context, prior to + executing any operation. The sessionid, present in the newly + retransmitted request, combined with the retransmission detection + enabled by the NFSv4.1 duplicate request cache, are a convenient + and reliable context for the server to use for this contingency. + The server should take care to protect itself against denial of service attacks in the creation of sessions and clientids. Clients who connect and create sessions, only to disconnect and never use them may leave significant state behind. (The same issue applies to NFSv4.0 with clients who may perform SETCLIENTID, then never perform SETCLIENTID_CONFIRM.) Careful authentication coupled with resource checks is highly recommended. 5. IANA Considerations @@ -2057,26 +2138,25 @@ { id_arg, verifier_arg, *, clientid_ret, FALSE } ERRORS NFS4ERR_BADXDR NFS4ERR_CLID_INUSE NFS4ERR_INVAL NFS4ERR_RESOURCE NFS4ERR_SERVERFAULT -6.2. Operation: CREATE_SESSION - Create New Session and Confirm -Clientid +6.2. Operation: CREATESESSION - Create New Session and Confirm Clientid SYNOPSIS - clientid, sessionid, session_args -> session_args + clientid, session_args -> sessionid, session_args ARGUMENT struct CREATESESSION4args { clientid4 clientid; bool persist; count4 maxrequestsize; count4 maxresponsesize; count4 maxrequests; count4 headerpadsize; switch (bool clientid_confirm) { @@ -2089,42 +2169,42 @@ case DEFAULT: void; case STREAM: streamchannelattrs4 streamchanattrs; case RDMA: rdmachannelattrs4 rdmachanattrs; }; }; RESULT - typedef uint32_t sessionid4; + typedef opaque sessionid4[16]; struct CREATESESSION4resok { sessionid4 sessionid; bool persist; count4 maxrequestsize; count4 maxresponsesize; count4 maxrequests; count4 headerpadsize; switch (channelmode4 mode) { case DEFAULT: void; case STREAM: streamchannelattrs4 streamchanattrs; case RDMA: rdmachannelattrs4 rdmachanattrs; }; }; union CREATESESSION4res switch (nfsstat4 status) { case NFS4_OK: - CREATE_SESSION4resok resok4; + CREATESESSION4resok resok4; default: void; }; DESCRIPTION This operation is used by the client to create new session objects on the server. Additionally the first session created with a new shorthand client identifier serves to confirm the creation of that client's state on the server. The server returns the parameter @@ -2405,36 +2485,35 @@ clientid4 clientid; sessionid4 sessionid; sequenceid4 sequenceid; slotid4 slotid; slotid4 maxslot; slotid4 target_maxslot; }; union SEQUENCE4res switch (nfsstat4 status) { case NFS4_OK: - struct SEQUENCE4resok resok4; + SEQUENCE4resok resok4; default: void; }; DESCRIPTION The SEQUENCE operation is used to manage operational accounting for the session on which the operation is sent. The contents include the client and session to which this request belongs, slotid and sequenceid, used by the server to implement session request control and the duplicate reply cache semantics, and exchanged slot counts which are used to adjust these values. This operation must appear - once as the first operation in each COMPOUND and CB_COMPOUND sent - after the channel is successfully bound, or a protocol error must - result. + once as the first operation in each COMPOUND sent after the channel + is successfully bound, or a protocol error must result. ... ERRORS NFS4ERR_BADSESSION NFS4ERR_BADSLOT 6.6. Callback operation: CB_RECALLCREDIT - change flow control limits @@ -2460,20 +2539,77 @@ The CB_RECALLCREDIT operation requests the client to return session and transport credits to the server, by zero-length RDMA Sends or NULL NFSv4 operations. ... ERRORS +6.7. Callback operation: CB_SEQUENCE - Supply callback channel +sequencing and control + + SYNOPSIS + + control -> control + + ARGUMENT + + typedef uint32_t sequenceid4; + typedef uint32_t slotid4; + + struct CB_SEQUENCE4args { + clientid4 clientid; + sessionid4 sessionid; + sequenceid4 sequenceid; + slotid4 slotid; + slotid4 maxslot; + }; + + RESULT + + struct CB_SEQUENCE4resok { + clientid4 clientid; + sessionid4 sessionid; + sequenceid4 sequenceid; + slotid4 slotid; + slotid4 maxslot; + slotid4 target_maxslot; + }; + + union CB_SEQUENCE4res switch (nfsstat4 status) { + case NFS4_OK: + CB_SEQUENCE4resok resok4; + default: + void; + }; + + DESCRIPTION + + The CB_SEQUENCE operation is used to manage operational accounting + for the callback channel of the session on which the operation is + sent. The contents include the client and session to which this + request belongs, slotid and sequenceid, used by the server to + implement session request control and the duplicate reply cache + semantics, and exchanged slot counts which are used to adjust these + values. This operation must appear once as the first operation in + each CB_COMPOUND sent after the callback channel is successfully + bound, or a protocol error must result. + + ... + + ERRORS + + NFS4ERR_BADSESSION + NFS4ERR_BADSLOT + 7. NFSv4 Session Protocol Description This section contains the proposed protocol changes in RPC description language. The constants named in this section are illustrative. When the working group decides on the full content of the NFSv4.1 minor revision, they may change in order to avoid conflict. NFS4ERR_BADSESSION = 10049,/* invalid session */ NFS4ERR_BADSLOT = 10050 /* invalid slotid */ @@ -2496,84 +2632,82 @@ CREATECLIENTID4resok resok4; default: void; }; /* * Channel attributes - TBD. */ enum channelmode4 { - DEFAULT = 0, // don't change - STREAM = 1, // TCP stream - RDMA = 2 // upshift to RDMA + DEFAULT = 0, /* don't change */ + STREAM = 1, /* TCP stream */ + RDMA = 2 /* upshift to RDMA */ }; struct streamchannelattrs4 { - /* TBD */ + opaque nothing[0]; /* TBD */ }; struct rdmachannelattrs4 { count4 maxrdmareads; /* plus TBD */ }; /* * CREATESESSION: v4.1 session creation and optional * clientid confirm */ typedef opaque sessionid4[16]; - struct CREATESESSION4args { - clientid4 clientid; - bool persist; - count4 maxrequestsize; - count4 maxresponsesize; - count4 maxrequests; - count4 headerpadsize; - switch (bool clientid_confirm) { + union optverifier4 switch (bool clientid_confirm) { case TRUE: verifier4 setclientid_confirm; case FALSE: void; - } - switch (channelmode4 mode) { + }; + + union transportattrs4 switch (channelmode4 mode) { case DEFAULT: void; case STREAM: streamchannelattrs4 streamchanattrs; case RDMA: rdmachannelattrs4 rdmachanattrs; }; + + struct CREATESESSION4args { + clientid4 clientid; + bool persist; + count4 maxrequestsize; + count4 maxresponsesize; + count4 maxrequests; + count4 headerpadsize; + optverifier4 verifier; + transportattrs4 transportattrs; }; struct CREATESESSION4resok { sessionid4 sessionid; bool persist; count4 maxrequestsize; count4 maxresponsesize; count4 maxrequests; count4 headerpadsize; - switch (channelmode4 mode) { - case DEFAULT: - void; - case STREAM: - streamchannelattrs4 streamchanattrs; - case RDMA: - rdmachannelattrs4 rdmachanattrs; - }; + transportattrs4 transportattrs; }; union CREATESESSION4res switch (nfsstat4 status) { case NFS4_OK: - CREATE_SESSION4resok resok4; + CREATESESSION4resok resok4; default: + void; }; /* * BIND_BACKCHANNEL: v4.1 callback binding */ struct BIND_BACKCHANNEL4args { clientid4 clientid; uint32_t callback_program; @@ -2574,42 +2708,28 @@ * BIND_BACKCHANNEL: v4.1 callback binding */ struct BIND_BACKCHANNEL4args { clientid4 clientid; uint32_t callback_program; uint32_t callback_ident; count4 maxrequestsize; count4 maxresponsesize; count4 maxrequests; - switch (channelmode4 mode) { - case DEFAULT: - void; - case STREAM: - streamchannelattrs4 streamchanattrs; - case RDMA: - rdmachannelattrs4 rdmachanattrs; - }; + transportattrs4 transportattrs; }; struct BIND_BACKCHANNEL4resok { count4 maxrequestsize; count4 maxresponsesize; count4 maxrequests; - switch (channelmode4 mode) { - case DEFAULT: - void; - case STREAM: - streamchannelattrs4 streamchanattrs; - case RDMA: - rdmachannelattrs4 rdmachanattrs; - }; + transportattrs4 transportattrs; }; union BIND_BACKCHANNEL4res switch (nfsstat4 status) { case NFS4_OK: BIND_BACKCHANNEL4resok resok4; default: void; }; /* @@ -2694,30 +2814,63 @@ struct CB_RECALLCREDIT4args { sessionid4 sessionid; uint32_t target; }; struct CB_RECALLCREDIT4res { nfsstat4 status; }; + /* + * CB_SEQUENCE: v4.1 operation sequence control + */ + + struct CB_SEQUENCE4args { + clientid4 clientid; + sessionid4 sessionid; + sequenceid4 sequenceid; + slotid4 slotid; + slotid4 maxslot; + }; + + struct CB_SEQUENCE4resok { + clientid4 clientid; + sessionid4 sessionid; + sequenceid4 sequenceid; + slotid4 slotid; + slotid4 maxslot; + slotid4 target_maxslot; + }; + + union CB_SEQUENCE4res switch (nfsstat4 status) { + case NFS4_OK: + struct CB_SEQUENCE4resok resok4; + default: + void; + }; + /* Operation values */ OP_CB_RECALL_CREDIT = 5, + OP_CB_SEQUENCE = 6 /* Operation arguments */ case OP_CB_RECALLCREDIT: CB_RECALLCREDIT4args opcbrecallcredit; + case OP_CB_SEQUENCE: + CB_SEQUENCE4args opcbsequence; /* Operation results */ case OP_CB_RECALLCREDIT: CB_RECALLCREDIT4res opcbrecallcredit; + case OP_CB_SEQUENCE: + CB_SEQUENCE4res opcbsequence; 8. Acknowledgements The authors wish to acknowledge the valuable contributions and review of Charles Antonelli, Brent Callaghan, Mike Eisler, John Howard, Chet Juszczak, Trond Myklebust, Dave Noveck, John Scott, Mike Stolarchuk and Mark Wittle. 9. References @@ -2750,21 +2904,21 @@ [DCK+03] M. DeBergalis, P. Corbett, S. Kleiman, A. Lent, D. Noveck, T. Talpey, M. Wittle, "The Direct Access File System", in Proceedings of 2nd USENIX Conference on File and Storage Technologies (FAST '03), San Francisco, CA, March 31 - April 2, 2003 [DDP] H. Shah, J. Pinkerton, R. Recio, P. Culley, "Direct Data Placement over Reliable Transports", - http://www.ietf.org/internet-drafts/draft-ietf-rddp-ddp-01 + http://www.ietf.org/internet-drafts/draft-ietf-rddp-ddp-03 [FJDAFS] Fujitsu Prime Software Technologies, "Meet the DAFS Performance with DAFS/VI Kernel Implementation using cLAN", http://www.pst.fujitsu.com/english/dafsdemo/index.html [FJNFS] Fujitsu Prime Software Technologies, "An Adaptation of VIA to NFS on Linux", http://www.pst.fujitsu.com/english/nfs/index.html @@ -2785,51 +2939,45 @@ Proceedings of 2002 USENIX Annual Technical Conference, Monterey, CA, June 9-14, 2002. [MIDTAX] B. Carpenter, S. Brim, "Middleboxes: Taxonomy and Issues", Informational RFC, http://www.ietf.org/rfc/rfc3234 [NFSDDP] B. Callaghan, T. Talpey, "NFS Direct Data Placement", Internet-Draft Work in Progress, http://www.ietf.org/internet- - drafts/draft-ietf-nfsv4-nfsdirect-00 + drafts/draft-ietf-nfsv4-nfsdirect-01 [NFSPS] T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet-Draft Work in Progress, http://www.ietf.org/internet- - drafts/draft-ietf-nfsv4-nfs-rdma-problem-statement-00 - - [RDMAREQ] - B. Callaghan, M. Wittle, "NFS RDMA requirements", Internet- - Draft Work in Progress, http://www.ietf.org/internet- - drafts/draft-callaghan-nfs-rdmareq-00 + drafts/draft-ietf-nfsv4-nfs-rdma-problem-statement-02 [RDDP] Remote Direct Data Placement Working Group charter, http://www.ietf.org/html.charters/rddp-charter.html [RDDPPS] A. Romanow, J. Mogul, T. Talpey, S. Bailey, Remote Direct Data Placement Working Group Problem Statement, Internet-Draft Work in Progress, http://www.ietf.org/internet-drafts/draft-ietf- - rddp-problem-statement-04 + rddp-problem-statement-05 [RDMAP] R. Recio, P. Culley, D. Garcia, J. Hilland, "An RDMA Protocol Specification", Internet-Draft Work in Progress, - http://www.ietf.org/internet-drafts/draft-ietf-rddp-rdmap-01 - + http://www.ietf.org/internet-drafts/draft-ietf-rddp-rdmap-03 [RPCRDMA] B. Callaghan, T. Talpey, "RDMA Transport for ONC RPC" Internet-Draft Work in Progress, http://www.ietf.org/internet- - drafts/draft-ietf-nfsv4-rpcrdma-00 + drafts/draft-ietf-nfsv4-rpcrdma-01 [RFC2203] M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol Specification", Standards Track RFC, http://www.ietf.org/rfc/rfc2203 [RW96] R. Werme, "RPC XID Issues", Connectathon 1996, San Jose, CA, http://www.cthon.org/talks96/werme1.pdf @@ -2858,21 +3006,21 @@ University of Michigan Center for Information Technology Integration 535 W. William St. Suite 3100 Ann Arbor, MI 48103 USA Phone: +1 734 615-4782 Email: baumanj@umich.edu 11. Full Copyright Statement - Copyright (C) The Internet Society (2004). This document is + Copyright (C) The Internet Society (2005). This document is subject to the rights, licenses and restrictions contained in BCP 78 and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR