draft-ietf-nfsv4-minorversion1-05.txt   draft-ietf-nfsv4-minorversion1-06.txt 
NFSv4 S. Shepler NFSv4 S. Shepler
Internet-Draft M. Eisler Internet-Draft M. Eisler
Intended status: Standards Track D. Noveck Intended status: Standards Track D. Noveck
Expires: February 16, 2007 Editors Expires: February 26, 2007 Editors
August 15, 2006 August 25, 2006
NFSv4 Minor Version 1 NFSv4 Minor Version 1
draft-ietf-nfsv4-minorversion1-05.txt draft-ietf-nfsv4-minorversion1-06.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 35 skipping to change at page 1, line 35
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on February 16, 2007. This Internet-Draft will expire on February 26, 2007.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2006). Copyright (C) The Internet Society (2006).
Abstract Abstract
This Internet-Draft describes NFSv4 minor version one, including This Internet-Draft describes NFSv4 minor version one, including
features retained from the base protocol and protocol extensions made features retained from the base protocol and protocol extensions made
subsequently. The current draft includes desciption of the major subsequently. The current draft includes desciption of the major
skipping to change at page 2, line 23 skipping to change at page 2, line 23
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 10 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1. The NFSv4.1 Protocol . . . . . . . . . . . . . . . . . . 10 1.1. The NFSv4.1 Protocol . . . . . . . . . . . . . . . . . . 10
1.2. NFS Version 4 Goals . . . . . . . . . . . . . . . . . . 10 1.2. NFS Version 4 Goals . . . . . . . . . . . . . . . . . . 10
1.3. Minor Version 1 Goals . . . . . . . . . . . . . . . . . 11 1.3. Minor Version 1 Goals . . . . . . . . . . . . . . . . . 11
1.4. Inconsistencies of this Document with Section XX . . . . 11 1.4. Inconsistencies of this Document with Section XX . . . . 11
1.5. Overview of NFS version 4.1 Features . . . . . . . . . . 11 1.5. Overview of NFS version 4.1 Features . . . . . . . . . . 11
1.5.1. RPC and Security . . . . . . . . . . . . . . . . . . 12 1.5.1. RPC and Security . . . . . . . . . . . . . . . . . . 12
1.5.2. Protocol Structure . . . . . . . . . . . . . . . . . 12 1.5.2. Protocol Structure . . . . . . . . . . . . . . . . . 12
1.5.3. File System Model . . . . . . . . . . . . . . . . . 14 1.5.3. File System Model . . . . . . . . . . . . . . . . . . 14
1.5.4. Locking Facilities . . . . . . . . . . . . . . . . . 15 1.5.4. Locking Facilities . . . . . . . . . . . . . . . . . 15
1.6. General Definitions . . . . . . . . . . . . . . . . . . 16 1.6. General Definitions . . . . . . . . . . . . . . . . . . 16
1.7. Differences from NFSv4.0 . . . . . . . . . . . . . . . . 18 1.7. Differences from NFSv4.0 . . . . . . . . . . . . . . . . 18
2. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 18 2. Core Infrastructure . . . . . . . . . . . . . . . . . . . . . 18
2.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 18 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 18
2.2. Structured Data Types . . . . . . . . . . . . . . . . . 20 2.2. RPC and XDR . . . . . . . . . . . . . . . . . . . . . . 18
3. RPC and Security Flavor . . . . . . . . . . . . . . . . . . . 29 2.2.1. RPC-based Security . . . . . . . . . . . . . . . . . 18
3.1. Ports and Transports . . . . . . . . . . . . . . . . . . 29 2.3. Non-RPC-based Security Services . . . . . . . . . . . . 19
3.1.1. Client Retransmission Behavior . . . . . . . . . . . 31 2.3.1. Authorization . . . . . . . . . . . . . . . . . . . . 19
3.2. Security Flavors . . . . . . . . . . . . . . . . . . . . 31 2.3.2. Auditing . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1. Security mechanisms for NFS version 4 . . . . . . . 31 2.3.3. Intrusion Detection . . . . . . . . . . . . . . . . . 19
3.3. Security Negotiation . . . . . . . . . . . . . . . . . . 33 2.4. Transport Layers . . . . . . . . . . . . . . . . . . . . 19
3.3.1. SECINFO and SECINFO_NO_NAME . . . . . . . . . . . . 33 2.4.1. Ports . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.2. Security Error . . . . . . . . . . . . . . . . . . . 33 2.4.2. Stream Transports . . . . . . . . . . . . . . . . . . 19
3.3.3. Callback RPC Authentication . . . . . . . . . . . . 34 2.4.3. RDMA Transports . . . . . . . . . . . . . . . . . . . 19
3.3.4. GSS Server Principal . . . . . . . . . . . . . . . . 34 2.5. Session . . . . . . . . . . . . . . . . . . . . . . . . 19
4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5.1. Motivation and Overview . . . . . . . . . . . . . . . 19
4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 35 2.5.2. NFSv4 Integration . . . . . . . . . . . . . . . . . . 19
4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 35 2.5.3. Channels . . . . . . . . . . . . . . . . . . . . . . 19
4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 35 2.5.4. Exactly Once Semantics . . . . . . . . . . . . . . . 20
4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 36 2.6. Channel Management . . . . . . . . . . . . . . . . . . . 20
4.2.1. General Properties of a Filehandle . . . . . . . . . 36 2.6.1. Buffer Management . . . . . . . . . . . . . . . . . . 20
4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 37 2.6.2. Data Transfer . . . . . . . . . . . . . . . . . . . . 20
4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 37 2.6.3. Flow Control . . . . . . . . . . . . . . . . . . . . 20
4.3. One Method of Constructing a Volatile Filehandle . . . . 39 2.6.4. COMPOUND Sizing Issues . . . . . . . . . . . . . . . 20
4.4. Client Recovery from Filehandle Expiration . . . . . . . 39 2.6.5. Data Alignment . . . . . . . . . . . . . . . . . . . 20
5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 40 2.7. Sessions Security . . . . . . . . . . . . . . . . . . . 20
5.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 41 2.7.1. Denial of Service via Unauthorized State Changes . . 20
5.2. Recommended Attributes . . . . . . . . . . . . . . . . . 41 2.8. Session Mechanics - Steady State . . . . . . . . . . . . 20
5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 42 2.8.1. Obligations of the Server . . . . . . . . . . . . . . 20
5.4. Classification of Attributes . . . . . . . . . . . . . . 42 2.8.2. Obligations of the Client . . . . . . . . . . . . . . 20
5.5. Mandatory Attributes - Definitions . . . . . . . . . . . 44 2.8.3. Steps the Client Takes To Establish a Session . . . . 20
5.6. Recommended Attributes - Definitions . . . . . . . . . . 45 2.8.4. Session Mechanics - Recovery . . . . . . . . . . . . 20
5.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 54 3. RPC and Security Flavor . . . . . . . . . . . . . . . . . . . 21
5.8. Interpreting owner and owner_group . . . . . . . . . . . 54 3.1. Ports and Transports . . . . . . . . . . . . . . . . . . 21
5.9. Character Case Attributes . . . . . . . . . . . . . . . 56 3.1.1. Client Retransmission Behavior . . . . . . . . . . . 22
5.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 56 3.2. Security Flavors . . . . . . . . . . . . . . . . . . . . 23
5.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 57 3.2.1. Security mechanisms for NFS version 4 . . . . . . . . 23
5.12. send_impl_id and recv_impl_id . . . . . . . . . . . . . 58 3.3. Security Negotiation . . . . . . . . . . . . . . . . . . 24
5.13. fs_layout_type . . . . . . . . . . . . . . . . . . . . . 59 3.3.1. SECINFO and SECINFO_NO_NAME . . . . . . . . . . . . . 25
5.14. layout_type . . . . . . . . . . . . . . . . . . . . . . 59 3.3.2. Security Error . . . . . . . . . . . . . . . . . . . 25
5.15. layout_hint . . . . . . . . . . . . . . . . . . . . . . 59 3.3.3. Callback RPC Authentication . . . . . . . . . . . . . 25
5.16. mdsthreshold . . . . . . . . . . . . . . . . . . . . . . 59 3.3.4. GSS Server Principal . . . . . . . . . . . . . . . . 26
6. Access Control Lists . . . . . . . . . . . . . . . . . . . . 60 4. Security Negotiation . . . . . . . . . . . . . . . . . . . . 26
6.1. ACE type . . . . . . . . . . . . . . . . . . . . . . . . 62 5. Clarification of Security Negotiation in NFSv4.1 . . . . . . 27
6.2. ACE Access Mask . . . . . . . . . . . . . . . . . . . . 63 5.1. PUTFH + LOOKUP . . . . . . . . . . . . . . . . . . . . . 27
6.2.1. ACE4_DELETE vs. ACE4_DELETE_CHILD . . . . . . . . . 67 5.2. PUTFH + LOOKUPP . . . . . . . . . . . . . . . . . . . . 27
6.3. ACE flag . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3. PUTFH + SECINFO . . . . . . . . . . . . . . . . . . . . 27
6.4. ACE who . . . . . . . . . . . . . . . . . . . . . . . . 70 5.4. PUTFH + Anything Else . . . . . . . . . . . . . . . . . 28
6.4.1. Discussion of EVERYONE@ . . . . . . . . . . . . . . 71 6. NFSv4.1 Sessions . . . . . . . . . . . . . . . . . . . . . . 28
6.4.2. Discussion of OWNER@ and GROUP@ . . . . . . . . . . 71 6.1. Sessions Background . . . . . . . . . . . . . . . . . . 28
6.5. Mode Attribute . . . . . . . . . . . . . . . . . . . . . 71 6.1.1. Introduction to Sessions . . . . . . . . . . . . . . 28
6.6. Interaction Between Mode and ACL Attributes . . . . . . 72 6.1.2. Session Model . . . . . . . . . . . . . . . . . . . . 29
6.6.1. Recomputing mode upon SETATTR of ACL . . . . . . . . 73 6.1.3. Connection State . . . . . . . . . . . . . . . . . . 30
6.6.2. Applying the mode given to CREATE or OPEN to an 6.1.4. NFSv4 Channels, Sessions and Connections . . . . . . 31
inherited ACL . . . . . . . . . . . . . . . . . . . 76 6.1.5. Reconnection, Trunking and Failover . . . . . . . . . 33
6.6.3. Applying a Mode to an Existing ACL . . . . . . . . . 77 6.1.6. Server Duplicate Request Cache . . . . . . . . . . . 33
6.6.4. ACL and mode in the same SETATTR . . . . . . . . . . 82 6.2. Session Initialization and Transfer Models . . . . . . . 35
6.6.5. Inheritance and turning it off . . . . . . . . . . . 83 6.2.1. Session Negotiation . . . . . . . . . . . . . . . . . 35
6.6.6. Deficiencies in a Mode Representation of an ACL . . 84 6.2.2. RDMA Requirements . . . . . . . . . . . . . . . . . . 36
7. Single-server Name Space . . . . . . . . . . . . . . . . . . 85 6.2.3. RDMA Connection Resources . . . . . . . . . . . . . . 37
7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 85 6.2.4. TCP and RDMA Inline Transfer Model . . . . . . . . . 37
7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 85 6.2.5. RDMA Direct Transfer Model . . . . . . . . . . . . . 40
7.3. Server Pseudo File System . . . . . . . . . . . . . . . 86 6.3. Connection Models . . . . . . . . . . . . . . . . . . . 43
7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 86 6.3.1. TCP Connection Model . . . . . . . . . . . . . . . . 44
7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 87 6.3.2. Negotiated RDMA Connection Model . . . . . . . . . . 45
7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 87 6.3.3. Automatic RDMA Connection Model . . . . . . . . . . . 46
7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 87 6.4. Buffer Management, Transfer, Flow Control . . . . . . . 46
7.8. Security Policy and Name Space Presentation . . . . . . 88 6.5. Retry and Replay . . . . . . . . . . . . . . . . . . . . 49
8. File Locking and Share Reservations . . . . . . . . . . . . . 89 6.6. The Back Channel . . . . . . . . . . . . . . . . . . . . 50
8.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 89 6.7. COMPOUND Sizing Issues . . . . . . . . . . . . . . . . . 51
8.1.1. Client ID . . . . . . . . . . . . . . . . . . . . . 90 6.8. Data Alignment . . . . . . . . . . . . . . . . . . . . . 51
8.1.2. Server Release of Clientid . . . . . . . . . . . . . 93 6.9. NFSv4 Integration . . . . . . . . . . . . . . . . . . . 53
8.1.3. State-owner and Stateid Definition . . . . . . . . . 94 6.9.1. Minor Versioning . . . . . . . . . . . . . . . . . . 53
8.1.4. Use of the Stateid and Locking . . . . . . . . . . . 97 6.9.2. Slot Identifiers and Server Duplicate Request Cache . 53
8.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 99 6.9.3. Resolving server callback races with sessions . . . . 56
8.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 99 6.9.4. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . 57
8.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 100 6.10. Sessions Security Considerations . . . . . . . . . . . . 59
8.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 100 6.10.1. Denial of Service via Unauthorized State Changes . . 59
8.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 101 6.11. Session Mechanics - Steady State . . . . . . . . . . . . 63
8.6.1. Client Failure and Recovery . . . . . . . . . . . . 101 6.11.1. Obligations of the Server . . . . . . . . . . . . . . 63
8.6.2. Server Failure and Recovery . . . . . . . . . . . . 102 6.11.2. Obligations of the Client . . . . . . . . . . . . . . 63
8.6.3. Network Partitions and Recovery . . . . . . . . . . 104 6.11.3. Steps the Client Takes To Establish a Session . . . . 64
8.7. Server Revocation of Locks . . . . . . . . . . . . . . . 108 6.12. Session Mechanics - Recovery . . . . . . . . . . . . . . 64
8.8. Share Reservations . . . . . . . . . . . . . . . . . . . 109 6.12.1. Events Requiring Client Action . . . . . . . . . . . 64
8.9. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 110 6.12.2. Events Requiring Server Action . . . . . . . . . . . 66
8.10. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 110 7. Minor Versioning . . . . . . . . . . . . . . . . . . . . . . 66
8.11. Short and Long Leases . . . . . . . . . . . . . . . . . 111 8. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 69
8.12. Clocks, Propagation Delay, and Calculating Lease 8.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 69
Expiration . . . . . . . . . . . . . . . . . . . . . . . 111 8.2. Structured Data Types . . . . . . . . . . . . . . . . . 70
8.13. Vestigial Locking Infrastructure From V4.0 . . . . . . . 112 9. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 80
9. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 113 9.1. Obtaining the First Filehandle . . . . . . . . . . . . . 80
9.1. Performance Challenges for Client-Side Caching . . . . . 114 9.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . . 80
9.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 114 9.1.2. Public Filehandle . . . . . . . . . . . . . . . . . . 80
9.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 116 9.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 81
9.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 118 9.2.1. General Properties of a Filehandle . . . . . . . . . 81
9.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 118 9.2.2. Persistent Filehandle . . . . . . . . . . . . . . . . 82
9.3.2. Data Caching and File Locking . . . . . . . . . . . 119 9.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . . 82
9.3.3. Data Caching and Mandatory File Locking . . . . . . 121 9.3. One Method of Constructing a Volatile Filehandle . . . . 84
9.3.4. Data Caching and File Identity . . . . . . . . . . . 121 9.4. Client Recovery from Filehandle Expiration . . . . . . . 84
9.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 122 10. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 85
9.4.1. Open Delegation and Data Caching . . . . . . . . . . 125 10.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 86
9.4.2. Open Delegation and File Locks . . . . . . . . . . . 126 10.2. Recommended Attributes . . . . . . . . . . . . . . . . . 86
9.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 126 10.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 87
9.4.4. Recall of Open Delegation . . . . . . . . . . . . . 129 10.4. Classification of Attributes . . . . . . . . . . . . . . 87
9.4.5. Clients that Fail to Honor Delegation Recalls . . . 131 10.5. Mandatory Attributes - Definitions . . . . . . . . . . . 89
9.4.6. Delegation Revocation . . . . . . . . . . . . . . . 132 10.6. Recommended Attributes - Definitions . . . . . . . . . . 90
9.5. Data Caching and Revocation . . . . . . . . . . . . . . 132 10.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 99
9.5.1. Revocation Recovery for Write Open Delegation . . . 133 10.8. Interpreting owner and owner_group . . . . . . . . . . . 99
9.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 134 10.9. Character Case Attributes . . . . . . . . . . . . . . . 101
9.7. Data and Metadata Caching and Memory Mapped Files . . . 136 10.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 101
9.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 138 10.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 102
9.9. Directory Caching . . . . . . . . . . . . . . . . . . . 139 10.12. send_impl_id and recv_impl_id . . . . . . . . . . . . . 103
10. Security Negotiation . . . . . . . . . . . . . . . . . . . . 140 10.13. fs_layout_type . . . . . . . . . . . . . . . . . . . . . 104
11. Clarification of Security Negotiation in NFSv4.1 . . . . . . 140 10.14. layout_type . . . . . . . . . . . . . . . . . . . . . . 104
11.1. PUTFH + LOOKUP . . . . . . . . . . . . . . . . . . . . . 140 10.15. layout_hint . . . . . . . . . . . . . . . . . . . . . . 104
11.2. PUTFH + LOOKUPP . . . . . . . . . . . . . . . . . . . . 141 10.16. mdsthreshold . . . . . . . . . . . . . . . . . . . . . . 104
11.3. PUTFH + SECINFO . . . . . . . . . . . . . . . . . . . . 141 11. Access Control Lists . . . . . . . . . . . . . . . . . . . . 105
11.4. PUTFH + Anything Else . . . . . . . . . . . . . . . . . 141 11.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . 105
12. NFSv4.1 Sessions . . . . . . . . . . . . . . . . . . . . . . 142 11.2. File Attributes Discussion . . . . . . . . . . . . . . . 106
12.1. Sessions Background . . . . . . . . . . . . . . . . . . 142 11.2.1. ACL Attribute . . . . . . . . . . . . . . . . . . . . 106
12.1.1. Introduction to Sessions . . . . . . . . . . . . . . 142 11.2.2. mode Attribute . . . . . . . . . . . . . . . . . . . 117
12.1.2. Session Model . . . . . . . . . . . . . . . . . . . 143 11.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 118
12.1.3. Connection State . . . . . . . . . . . . . . . . . . 144 11.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . . 118
12.1.4. NFSv4 Channels, Sessions and Connections . . . . . . 145 11.3.2. Computing a Mode Attribute from an ACL . . . . . . . 119
12.1.5. Reconnection, Trunking and Failover . . . . . . . . 146 11.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 120
12.1.6. Server Duplicate Request Cache . . . . . . . . . . . 147 11.4.1. Setting the mode and/or ACL Attributes . . . . . . . 121
12.2. Session Initialization and Transfer Models . . . . . . . 148 11.4.2. Retrieving the mode and/or ACL Attributes . . . . . . 122
12.2.1. Session Negotiation . . . . . . . . . . . . . . . . 148 11.4.3. Creating New Objects . . . . . . . . . . . . . . . . 122
12.2.2. RDMA Requirements . . . . . . . . . . . . . . . . . 150 12. Single-server Name Space . . . . . . . . . . . . . . . . . . 124
12.2.3. RDMA Connection Resources . . . . . . . . . . . . . 150 12.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 124
12.2.4. TCP and RDMA Inline Transfer Model . . . . . . . . . 151 12.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 125
12.2.5. RDMA Direct Transfer Model . . . . . . . . . . . . . 154 12.3. Server Pseudo File System . . . . . . . . . . . . . . . 125
12.3. Connection Models . . . . . . . . . . . . . . . . . . . 157 12.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 126
12.3.1. TCP Connection Model . . . . . . . . . . . . . . . . 158 12.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 126
12.3.2. Negotiated RDMA Connection Model . . . . . . . . . . 159 12.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 126
12.3.3. Automatic RDMA Connection Model . . . . . . . . . . 160 12.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 126
12.4. Buffer Management, Transfer, Flow Control . . . . . . . 160 12.8. Security Policy and Name Space Presentation . . . . . . 127
12.5. Retry and Replay . . . . . . . . . . . . . . . . . . . . 163 13. File Locking and Share Reservations . . . . . . . . . . . . . 128
12.6. The Back Channel . . . . . . . . . . . . . . . . . . . . 164 13.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 128
12.7. COMPOUND Sizing Issues . . . . . . . . . . . . . . . . . 165 13.1.1. Client ID . . . . . . . . . . . . . . . . . . . . . . 129
12.8. Data Alignment . . . . . . . . . . . . . . . . . . . . . 165 13.1.2. Server Release of Clientid . . . . . . . . . . . . . 132
12.9. NFSv4 Integration . . . . . . . . . . . . . . . . . . . 167 13.1.3. State-owner and Stateid Definition . . . . . . . . . 133
12.9.1. Minor Versioning . . . . . . . . . . . . . . . . . . 167 13.1.4. Use of the Stateid and Locking . . . . . . . . . . . 136
12.9.2. Slot Identifiers and Server Duplicate Request 13.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 138
Cache . . . . . . . . . . . . . . . . . . . . . . . 167 13.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 138
12.9.3. Resolving server callback races with sessions . . . 170 13.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 139
12.9.4. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . 171 13.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 140
12.10. Sessions Security Considerations . . . . . . . . . . . . 173 13.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 140
12.10.1. Denial of Service via Unauthorized State Changes . . 173 13.6.1. Client Failure and Recovery . . . . . . . . . . . . . 140
12.11. Session Mechanics - Steady State . . . . . . . . . . . . 177 13.6.2. Server Failure and Recovery . . . . . . . . . . . . . 141
12.11.1. Obligations of the Server . . . . . . . . . . . . . 177 13.6.3. Network Partitions and Recovery . . . . . . . . . . . 143
12.11.2. Obligations of the Client . . . . . . . . . . . . . 177 13.7. Server Revocation of Locks . . . . . . . . . . . . . . . 147
12.11.3. Steps the Client Takes To Establish a Session . . . 178 13.8. Share Reservations . . . . . . . . . . . . . . . . . . . 148
12.12. Session Mechanics - Recovery . . . . . . . . . . . . . . 178 13.9. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 149
12.12.1. Events Requiring Client Action . . . . . . . . . . . 178 13.10. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 150
12.12.2. Events Requiring Server Action . . . . . . . . . . . 180 13.11. Short and Long Leases . . . . . . . . . . . . . . . . . 150
13. Multi-server Name Space . . . . . . . . . . . . . . . . . . . 180 13.12. Clocks, Propagation Delay, and Calculating Lease
13.1. Location attributes . . . . . . . . . . . . . . . . . . 180 Expiration . . . . . . . . . . . . . . . . . . . . . . . 151
13.2. File System Presence or Absence . . . . . . . . . . . . 181 13.13. Vestigial Locking Infrastructure From V4.0 . . . . . . . 151
13.3. Getting Attributes for an Absent File System . . . . . . 182 14. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 152
13.3.1. GETATTR Within an Absent File System . . . . . . . . 182 14.1. Performance Challenges for Client-Side Caching . . . . . 153
13.3.2. READDIR and Absent File Systems . . . . . . . . . . 183 14.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 154
13.4. Uses of Location Information . . . . . . . . . . . . . . 184 14.2.1. Delegation Recovery . . . . . . . . . . . . . . . . . 155
13.4.1. File System Replication . . . . . . . . . . . . . . 185 14.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 157
13.4.2. File System Migration . . . . . . . . . . . . . . . 185 14.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 157
13.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 186 14.3.2. Data Caching and File Locking . . . . . . . . . . . . 158
13.5. Additional Client-side Considerations . . . . . . . . . 187 14.3.3. Data Caching and Mandatory File Locking . . . . . . . 160
13.6. Effecting File System Transitions . . . . . . . . . . . 187 14.3.4. Data Caching and File Identity . . . . . . . . . . . 160
13.6.1. Transparent File System Transitions . . . . . . . . 188 14.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 161
13.6.2. Filehandles and File System Transitions . . . . . . 190 14.4.1. Open Delegation and Data Caching . . . . . . . . . . 164
13.6.3. Fileid's and File System Transitions . . . . . . . . 191 14.4.2. Open Delegation and File Locks . . . . . . . . . . . 165
13.6.4. Fsid's and File System Transitions . . . . . . . . . 191 14.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 165
13.6.5. The Change Attribute and File System Transitions . . 192 14.4.4. Recall of Open Delegation . . . . . . . . . . . . . . 168
13.6.6. Lock State and File System Transitions . . . . . . . 192 14.4.5. Clients that Fail to Honor Delegation Recalls . . . . 170
13.6.7. Write Verifiers and File System Transitions . . . . 196 14.4.6. Delegation Revocation . . . . . . . . . . . . . . . . 171
13.7. Effecting File System Referrals . . . . . . . . . . . . 196 14.5. Data Caching and Revocation . . . . . . . . . . . . . . 171
13.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 196 14.5.1. Revocation Recovery for Write Open Delegation . . . . 172
13.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 200 14.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 173
13.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 202 14.7. Data and Metadata Caching and Memory Mapped Files . . . 175
13.9. The Attribute fs_locations . . . . . . . . . . . . . . . 203 14.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 177
13.10. The Attribute fs_locations_info . . . . . . . . . . . . 205 14.9. Directory Caching . . . . . . . . . . . . . . . . . . . 178
13.11. The Attribute fs_status . . . . . . . . . . . . . . . . 213 15. Multi-server Name Space . . . . . . . . . . . . . . . . . . . 179
14. Directory Delegations . . . . . . . . . . . . . . . . . . . . 216 15.1. Location attributes . . . . . . . . . . . . . . . . . . 179
14.1. Introduction to Directory Delegations . . . . . . . . . 217 15.2. File System Presence or Absence . . . . . . . . . . . . 179
14.2. Directory Delegation Design (in brief) . . . . . . . . . 218 15.3. Getting Attributes for an Absent File System . . . . . . 181
14.3. Recommended Attributes in support of Directory 15.3.1. GETATTR Within an Absent File System . . . . . . . . 181
Delegations . . . . . . . . . . . . . . . . . . . . . . 219 15.3.2. READDIR and Absent File Systems . . . . . . . . . . . 182
14.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 220 15.4. Uses of Location Information . . . . . . . . . . . . . . 183
14.5. Directory Delegation Recovery . . . . . . . . . . . . . 220 15.4.1. File System Replication . . . . . . . . . . . . . . . 183
15. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 220 15.4.2. File System Migration . . . . . . . . . . . . . . . . 184
15.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 220 15.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . . 185
15.2. General Definitions . . . . . . . . . . . . . . . . . . 223 15.5. Additional Client-side Considerations . . . . . . . . . 185
15.2.1. Metadata Server . . . . . . . . . . . . . . . . . . 223 15.6. Effecting File System Transitions . . . . . . . . . . . 186
15.2.2. Client . . . . . . . . . . . . . . . . . . . . . . . 223 15.6.1. Transparent File System Transitions . . . . . . . . . 187
15.2.3. Storage Device . . . . . . . . . . . . . . . . . . . 223 15.6.2. Filehandles and File System Transitions . . . . . . . 189
15.2.4. Storage Protocol . . . . . . . . . . . . . . . . . . 223 15.6.3. Fileid's and File System Transitions . . . . . . . . 189
15.2.5. Control Protocol . . . . . . . . . . . . . . . . . . 224 15.6.4. Fsid's and File System Transitions . . . . . . . . . 190
15.2.6. Metadata . . . . . . . . . . . . . . . . . . . . . . 224 15.6.5. The Change Attribute and File System Transitions . . 190
15.2.7. Layout . . . . . . . . . . . . . . . . . . . . . . . 224 15.6.6. Lock State and File System Transitions . . . . . . . 191
15.3. pNFS protocol semantics . . . . . . . . . . . . . . . . 225 15.6.7. Write Verifiers and File System Transitions . . . . . 194
15.3.1. Definitions . . . . . . . . . . . . . . . . . . . . 225 15.7. Effecting File System Referrals . . . . . . . . . . . . 194
15.3.2. Guarantees Provided by Layouts . . . . . . . . . . . 228 15.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . . 195
15.3.3. Getting a Layout . . . . . . . . . . . . . . . . . . 229 15.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 199
15.3.4. Committing a Layout . . . . . . . . . . . . . . . . 230 15.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 201
15.3.5. Recalling a Layout . . . . . . . . . . . . . . . . . 232 15.9. The Attribute fs_locations . . . . . . . . . . . . . . . 201
15.3.6. Metadata Server Write Propagation . . . . . . . . . 237 15.10. The Attribute fs_locations_info . . . . . . . . . . . . 203
15.3.7. Crash Recovery . . . . . . . . . . . . . . . . . . . 238 15.11. The Attribute fs_status . . . . . . . . . . . . . . . . 212
15.3.8. Security Considerations . . . . . . . . . . . . . . 243 16. Directory Delegations . . . . . . . . . . . . . . . . . . . . 215
15.4. The NFSv4 File Layout Type . . . . . . . . . . . . . . . 244 16.1. Introduction to Directory Delegations . . . . . . . . . 216
15.4.1. File Striping and Data Access . . . . . . . . . . . 244 16.2. Directory Delegation Design (in brief) . . . . . . . . . 217
15.4.2. Global Stateid Requirements . . . . . . . . . . . . 253 16.3. Recommended Attributes in support of Directory
15.4.3. The Layout Iomode . . . . . . . . . . . . . . . . . 253 Delegations . . . . . . . . . . . . . . . . . . . . . . 218
15.4.4. Storage Device State Propagation . . . . . . . . . . 253 16.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 219
15.4.5. Storage Device Component File Size . . . . . . . . . 256 16.5. Directory Delegation Recovery . . . . . . . . . . . . . 219
15.4.6. Crash Recovery Considerations . . . . . . . . . . . 256 17. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 219
15.4.7. Security Considerations for the File Layout Type . . 257 17.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 219
15.4.8. Alternate Approaches . . . . . . . . . . . . . . . . 257 17.2. General Definitions . . . . . . . . . . . . . . . . . . 222
16. Minor Versioning . . . . . . . . . . . . . . . . . . . . . . 258 17.2.1. Metadata Server . . . . . . . . . . . . . . . . . . . 222
17. Internationalization . . . . . . . . . . . . . . . . . . . . 261 17.2.2. Client . . . . . . . . . . . . . . . . . . . . . . . 222
17.1. Stringprep profile for the utf8str_cs type . . . . . . . 262 17.2.3. Storage Device . . . . . . . . . . . . . . . . . . . 222
17.2. Stringprep profile for the utf8str_cis type . . . . . . 264 17.2.4. Storage Protocol . . . . . . . . . . . . . . . . . . 222
17.3. Stringprep profile for the utf8str_mixed type . . . . . 265 17.2.5. Control Protocol . . . . . . . . . . . . . . . . . . 223
17.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 266 17.2.6. Metadata . . . . . . . . . . . . . . . . . . . . . . 223
18. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 267 17.2.7. Layout . . . . . . . . . . . . . . . . . . . . . . . 223
18.1. Error Definitions . . . . . . . . . . . . . . . . . . . 267 17.3. pNFS protocol semantics . . . . . . . . . . . . . . . . 224
18.2. Operations and their valid errors . . . . . . . . . . . 279 17.3.1. Definitions . . . . . . . . . . . . . . . . . . . . . 224
18.3. Callback operations and their valid errors . . . . . . . 287 17.3.2. Guarantees Provided by Layouts . . . . . . . . . . . 227
18.4. Errors and the operations that use them . . . . . . . . 287 17.3.3. Getting a Layout . . . . . . . . . . . . . . . . . . 228
19. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 293 17.3.4. Committing a Layout . . . . . . . . . . . . . . . . . 229
19.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 293 17.3.5. Recalling a Layout . . . . . . . . . . . . . . . . . 231
19.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 294 17.3.6. Metadata Server Write Propagation . . . . . . . . . . 237
20. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 298 17.3.7. Crash Recovery . . . . . . . . . . . . . . . . . . . 237
20.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 299 17.3.8. Security Considerations . . . . . . . . . . . . . . . 243
20.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 301 17.4. The NFSv4 File Layout Type . . . . . . . . . . . . . . . 244
20.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 302 17.4.1. File Striping and Data Access . . . . . . . . . . . . 244
20.4. Operation 6: CREATE - Create a Non-Regular File Object . 305 17.4.2. Global Stateid Requirements . . . . . . . . . . . . . 252
20.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting 17.4.3. The Layout Iomode . . . . . . . . . . . . . . . . . . 252
Recovery . . . . . . . . . . . . . . . . . . . . . . . . 307 17.4.4. Storage Device State Propagation . . . . . . . . . . 253
20.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 308 17.4.5. Storage Device Component File Size . . . . . . . . . 255
20.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 309 17.4.6. Crash Recovery Considerations . . . . . . . . . . . . 256
20.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 310 17.4.7. Security Considerations for the File Layout Type . . 256
20.9. Operation 11: LINK - Create Link to a File . . . . . . . 311 17.4.8. Alternate Approaches . . . . . . . . . . . . . . . . 257
20.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 312 18. Internationalization . . . . . . . . . . . . . . . . . . . . 258
20.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 316 18.1. Stringprep profile for the utf8str_cs type . . . . . . . 259
20.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 317 18.2. Stringprep profile for the utf8str_cis type . . . . . . 261
20.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 318 18.3. Stringprep profile for the utf8str_mixed type . . . . . 262
20.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 320 18.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 263
20.15. Operation 17: NVERIFY - Verify Difference in 19. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 264
Attributes . . . . . . . . . . . . . . . . . . . . . . . 321 19.1. Error Definitions . . . . . . . . . . . . . . . . . . . 264
20.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 322 19.2. Operations and their valid errors . . . . . . . . . . . 276
20.17. Operation 19: OPENATTR - Open Named Attribute 19.3. Callback operations and their valid errors . . . . . . . 284
Directory . . . . . . . . . . . . . . . . . . . . . . . 336 19.4. Errors and the operations that use them . . . . . . . . 284
20.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 337 20. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 290
20.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 338 20.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 290
20.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 339 20.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 291
20.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 341 21. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 295
20.22. Operation 25: READ - Read from File . . . . . . . . . . 341 21.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 296
20.23. Operation 26: READDIR - Read Directory . . . . . . . . . 343 21.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 298
20.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 347 21.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 299
20.25. Operation 28: REMOVE - Remove File System Object . . . . 348 21.4. Operation 6: CREATE - Create a Non-Regular File Object . 302
20.26. Operation 29: RENAME - Rename Directory Entry . . . . . 350 21.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting
20.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 351 Recovery . . . . . . . . . . . . . . . . . . . . . . . . 304
20.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 352 21.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 305
20.29. Operation 33: SECINFO - Obtain Available Security . . . 353 21.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 306
20.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 356 21.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 307
20.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 358 21.9. Operation 11: LINK - Create Link to a File . . . . . . . 308
20.32. Operation 38: WRITE - Write to File . . . . . . . . . . 360 21.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 309
20.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 364 21.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 313
20.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 364 21.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 314
20.35. Operation 42: CREATE_CLIENTID - Instantiate Clientid . . 368 21.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 315
20.36. Operation 43: CREATE_SESSION - Create New Session and 21.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 317
Confirm Clientid . . . . . . . . . . . . . . . . . . . . 374 21.15. Operation 17: NVERIFY - Verify Difference in
20.37. Operation 44: DESTROY_SESSION - Destroy existing Attributes . . . . . . . . . . . . . . . . . . . . . . . 318
session . . . . . . . . . . . . . . . . . . . . . . . . 382 21.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 319
20.38. Operation 45: FREE_STATEID - Free stateid with no 21.17. Operation 19: OPENATTR - Open Named Attribute
locks . . . . . . . . . . . . . . . . . . . . . . . . . 383 Directory . . . . . . . . . . . . . . . . . . . . . . . 333
20.39. Operation 46: GET_DIR_DELEGATION - Get a directory 21.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 334
delegation . . . . . . . . . . . . . . . . . . . . . . . 384 21.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 335
20.40. Operation 47: GETDEVICEINFO - Get Device Information . . 388 21.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 336
20.41. Operation 48: GETDEVICELIST . . . . . . . . . . . . . . 389 21.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 338
20.42. Operation 49: LAYOUTCOMMIT - Commit writes made using 21.22. Operation 25: READ - Read from File . . . . . . . . . . 338
a layout . . . . . . . . . . . . . . . . . . . . . . . . 390 21.23. Operation 26: READDIR - Read Directory . . . . . . . . . 340
20.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 394 21.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 344
20.44. Operation 51: LAYOUTRETURN - Release Layout 21.25. Operation 28: REMOVE - Remove File System Object . . . . 345
Information . . . . . . . . . . . . . . . . . . . . . . 396 21.26. Operation 29: RENAME - Rename Directory Entry . . . . . 347
20.45. Operation 52: SECINFO_NO_NAME - Get Security on 21.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 348
Unnamed Object . . . . . . . . . . . . . . . . . . . . . 399 21.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 349
20.46. Operation 53: SEQUENCE - Supply per-procedure 21.29. Operation 33: SECINFO - Obtain Available Security . . . 350
sequencing and control . . . . . . . . . . . . . . . . . 400 21.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 353
20.47. Operation 54: SET_SSV . . . . . . . . . . . . . . . . . 403 21.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 355
20.48. Operation 55: TEST_STATEID - Test stateids for 21.32. Operation 38: WRITE - Write to File . . . . . . . . . . 357
validity . . . . . . . . . . . . . . . . . . . . . . . . 405 21.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 361
20.49. Operation 56: WANT_DELEGATION . . . . . . . . . . . . . 406 21.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 361
20.50. Operation 10044: ILLEGAL - Illegal operation . . . . . . 409 21.35. Operation 42: CREATE_CLIENTID - Instantiate Clientid . . 365
21. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 409 21.36. Operation 43: CREATE_SESSION - Create New Session and
21.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 410 Confirm Clientid . . . . . . . . . . . . . . . . . . . . 371
21.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 410 21.37. Operation 44: DESTROY_SESSION - Destroy existing
22. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 412 session . . . . . . . . . . . . . . . . . . . . . . . . 379
22.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 412 21.38. Operation 45: FREE_STATEID - Free stateid with no
22.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 413 locks . . . . . . . . . . . . . . . . . . . . . . . . . 380
22.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 414 21.39. Operation 46: GET_DIR_DELEGATION - Get a directory
22.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 417 delegation . . . . . . . . . . . . . . . . . . . . . . . 381
22.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 420 21.40. Operation 47: GETDEVICEINFO - Get Device Information . . 385
22.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 421 21.41. Operation 48: GETDEVICELIST . . . . . . . . . . . . . . 386
22.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 424 21.42. Operation 49: LAYOUTCOMMIT - Commit writes made using
22.8. Operation 10: CB_RECALL_CREDIT - change flow control a layout . . . . . . . . . . . . . . . . . . . . . . . . 387
limits . . . . . . . . . . . . . . . . . . . . . . . . . 425 21.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 391
22.9. Operation 11: CB_SEQUENCE - Supply callback channel 21.44. Operation 51: LAYOUTRETURN - Release Layout
sequencing and control . . . . . . . . . . . . . . . . . 425 Information . . . . . . . . . . . . . . . . . . . . . . 394
22.10. Operation 12: CB_WANTS_CANCELLED . . . . . . . . . . . . 427 21.45. Operation 52: SECINFO_NO_NAME - Get Security on
22.11. Operation 10044: CB_ILLEGAL - Illegal Callback Unnamed Object . . . . . . . . . . . . . . . . . . . . . 396
Operation . . . . . . . . . . . . . . . . . . . . . . . 428 21.46. Operation 53: SEQUENCE - Supply per-procedure
23. Security Considerations . . . . . . . . . . . . . . . . . . . 428 sequencing and control . . . . . . . . . . . . . . . . . 398
24. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 429 21.47. Operation 54: SET_SSV . . . . . . . . . . . . . . . . . 401
24.1. Defining new layout types . . . . . . . . . . . . . . . 429 21.48. Operation 55: TEST_STATEID - Test stateids for
25. References . . . . . . . . . . . . . . . . . . . . . . . . . 429 validity . . . . . . . . . . . . . . . . . . . . . . . . 402
25.1. Normative References . . . . . . . . . . . . . . . . . . 429 21.49. Operation 56: WANT_DELEGATION . . . . . . . . . . . . . 404
25.2. Informative References . . . . . . . . . . . . . . . . . 431 21.50. Operation 10044: ILLEGAL - Illegal operation . . . . . . 407
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 432 22. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 407
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 432 22.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 408
Intellectual Property and Copyright Statements . . . . . . . . . 434 22.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 408
23. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 410
23.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 410
23.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 411
23.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 412
23.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 415
23.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 418
23.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 419
23.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 422
23.8. Operation 10: CB_RECALL_CREDIT - change flow control
limits . . . . . . . . . . . . . . . . . . . . . . . . . 423
23.9. Operation 11: CB_SEQUENCE - Supply callback channel
sequencing and control . . . . . . . . . . . . . . . . . 423
23.10. Operation 12: CB_WANTS_CANCELLED . . . . . . . . . . . . 425
23.11. Operation 10044: CB_ILLEGAL - Illegal Callback
Operation . . . . . . . . . . . . . . . . . . . . . . . 426
24. Security Considerations . . . . . . . . . . . . . . . . . . . 426
25. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 427
25.1. Defining new layout types . . . . . . . . . . . . . . . 427
26. References . . . . . . . . . . . . . . . . . . . . . . . . . 427
26.1. Normative References . . . . . . . . . . . . . . . . . . 427
26.2. Informative References . . . . . . . . . . . . . . . . . 429
Appendix A. ACL Algorithm Examples . . . . . . . . . . . . . . . 430
A.1. Recomputing mode upon SETATTR of ACL . . . . . . . . . . 430
A.2. Computing the Inherited ACL . . . . . . . . . . . . . . 433
A.2.1. Discussion . . . . . . . . . . . . . . . . . . . . . 434
A.3. Applying a Mode to an Existing ACL . . . . . . . . . . . 435
Appendix B. Acknowledgments . . . . . . . . . . . . . . . . . . 439
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 440
Intellectual Property and Copyright Statements . . . . . . . . . 441
1. Introduction 1. Introduction
1.1. The NFSv4.1 Protocol 1.1. The NFSv4.1 Protocol
The NFSv4.1 protocol is a minor version of the NFSv4 protocol The NFSv4.1 protocol is a minor version of the NFSv4 protocol
described in [2]. It generally follows the guidelines for minor described in [2]. It generally follows the guidelines for minor
versioning model laid in Section 10 of RFC 3530. However, it versioning model laid in Section 10 of RFC 3530. However, it
diverges from guidelines 11 ("a client and server that supports minor diverges from guidelines 11 ("a client and server that supports minor
version X must support minor versions 0 through X-1"), and 12 ("no version X must support minor versions 0 through X-1"), and 12 ("no
features may be introduced as mandatory in a minor version"). These features may be introduced as mandatory in a minor version"). These
divergences are due to the introduction of the sessions model for divergences are due to the introduction of the sessions model for
managing non-idempotent operations and the RECLAIM_COMPLETE managing non-idempotent operations and the RECLAIM_COMPLETE
operation. These two new features are infrastructural in nature and operation. These two new features are infrastructural in nature and
simplify implementation of existing and other new features. Making simplify implementation of existing and other new features. Making
them optional would add undue complexity to protocol definition and them optional would add undue complexity to protocol definition and
implementation. NFSv4.1 accordingly updates the Minor Versioning implementation. NFSv4.1 accordingly updates the Minor Versioning
guidelines (Section 16). guidelines (Section 7).
NFSv4.1, as a minor version, is consistent with the overall goals for NFSv4.1, as a minor version, is consistent with the overall goals for
NFS Version 4, but extends the protocol so as to better meet those NFS Version 4, but extends the protocol so as to better meet those
goals, based on experiences with NFSv4.0. In addition, NFSv4.1 has goals, based on experiences with NFSv4.0. In addition, NFSv4.1 has
adopted some additional goals, which motivate some of the major adopted some additional goals, which motivate some of the major
extensions in minor version 1. extensions in minor version 1.
1.2. NFS Version 4 Goals 1.2. NFS Version 4 Goals
The NFS version 4 protocol is a further revision of the NFS protocol The NFS version 4 protocol is a further revision of the NFS protocol
skipping to change at page 18, line 27 skipping to change at page 18, line 27
o Addition of the RECLAIM_COMPLETE operation to better structiure o Addition of the RECLAIM_COMPLETE operation to better structiure
the lock reclamation process. the lock reclamation process.
o < Support for directory delegation. o < Support for directory delegation.
o Operations to re-obtain a delegation. o Operations to re-obtain a delegation.
o Support for client and server implementation id's. o Support for client and server implementation id's.
2. Protocol Data Types 2. Core Infrastructure
2.1. Introduction
2.2. RPC and XDR
2.2.1. RPC-based Security
2.2.1.1. RPC Security Flavors
2.2.1.1.1. RPCSEC_GSS and Security Services
2.2.1.1.1.1. Authentication, Integrity, Privacy
2.2.1.1.1.2. GSS Server Principal
2.2.1.2. NFSv4 Security Tuples
2.2.1.2.1. Security Service Negotiation
2.2.1.2.1.1. SECINFO and SECINFO_NO_NAME
2.2.1.2.1.2. Security Error
2.2.1.2.1.3. PUTFH + LOOKUP
2.2.1.2.1.4. PUTFH + LOOKUPP
2.2.1.2.1.5. PUTFH + SECINFO
2.2.1.2.1.6. PUTFH + Anything Else
2.3. Non-RPC-based Security Services
2.3.1. Authorization
2.3.2. Auditing
2.3.3. Intrusion Detection
2.4. Transport Layers
2.4.1. Ports
2.4.2. Stream Transports
2.4.3. RDMA Transports
2.4.3.1. RDMA Requirements
2.4.3.2. RDMA Connection Resources
2.5. Session
2.5.1. Motivation and Overview
2.5.2. NFSv4 Integration
2.5.2.1. COMPOUND and CB_COMPOUND
2.5.2.2. SEQUENCE and CB_SEQUENCE
2.5.2.3. Clientid and Session Association
2.5.3. Channels
2.5.3.1. Operation Channel
2.5.3.2. Back Channel
2.5.3.2.1. Back Channel RPC Security
2.5.3.3. Session and Channel Association
2.5.3.4. Connection and Channel Association
2.5.3.4.1. Trunking
2.5.4. Exactly Once Semantics
2.5.4.1. Slot Identifiers and Server Duplicate Request Cache
2.5.4.2. Retry and Replay
2.5.4.3. Resolving server callback races with sessions
2.6. Channel Management
2.6.1. Buffer Management
2.6.2. Data Transfer
2.6.2.1. Inline Data Transfer (Stream and RDMA)
2.6.2.2. Direct Data Transfer (RDMA)
2.6.3. Flow Control
2.6.4. COMPOUND Sizing Issues
2.6.5. Data Alignment
2.7. Sessions Security
2.7.1. Denial of Service via Unauthorized State Changes
2.8. Session Mechanics - Steady State
2.8.1. Obligations of the Server
2.8.2. Obligations of the Client
2.8.3. Steps the Client Takes To Establish a Session
2.8.4. Session Mechanics - Recovery
2.8.4.1. Reconnection
2.8.4.2. Failover
2.8.4.3. Events Requiring Client Action
2.8.4.4. Events Requiring Server Action
3. RPC and Security Flavor
The NFS version 4.1 protocol is a Remote Procedure Call (RPC)
application that uses RPC version 2 and the corresponding eXternal
Data Representation (XDR) as defined in RFC1831 [4] and RFC4506 [3].
The RPCSEC_GSS security flavor as defined in RFC2203 [5] MUST be used
as the mechanism to deliver stronger security for the NFS version 4
protocol.
3.1. Ports and Transports
Historically, NFS version 2 and version 3 servers have resided on
port 2049. The registered port 2049 RFC3232 [19] for the NFS
protocol should be the default configuration. NFSv4 clients SHOULD
NOT use the RPC binding protocols as described in RFC1833 [20].
Where an NFS version 4 implementation supports operation over the IP
network protocol, the supported transports between NFS and IP MUST
have the following two attributes:
1. The transport must support reliable delivery of data in the order
it was sent.
2. The transport must be among the IETF-approved congestion control
transport protocols.
At the time this document was written, the only two transports that
had the above attributes were TCP and SCTP. To enhance the
possibilities for interoperability, an NFS version 4 implementation
MUST support operation over the TCP transport protocol.
If TCP is used as the transport, the client and server SHOULD use
persistent connections for at least two reasons:
1. This will prevent the weakening of TCP's congestion control via
short lived connections and will improve performance for the WAN
environment by eliminating the need for SYN handshakes.
2. The NFSv4.1 callback model has changed from NFSv4.0, and requires
the client and server to maintain a client-created channel for
the server to use.
As noted in the Security Considerations section, the authentication
model for NFS version 4 has moved from machine-based to principal-
based. However, this modification of the authentication model does
not imply a technical requirement to move the transport connection
management model from whole machine-based to one based on a per user
model. In particular, NFS over TCP client implementations have
traditionally multiplexed traffic for multiple users over a common
TCP connection between an NFS client and server. This has been true,
regardless whether the NFS client is using AUTH_SYS, AUTH_DH,
RPCSEC_GSS or any other flavor. Similarly, NFS over TCP server
implementations have assumed such a model and thus scale the
implementation of TCP connection management in proportion to the
number of expected client machines. NFS version 4.1 will not modify
this connection management model. NFS version 4.1 clients that
violate this assumption can expect scaling issues on the server and
hence reduced service.
Note that for various timers, the client and server should avoid
inadvertent synchronization of those timers. For further discussion
of the general issue refer to [Floyd].
3.1.1. Client Retransmission Behavior
When processing a request received over a reliable transport such as
TCP, the NFS version 4.1 server MUST NOT silently drop the request,
except if the transport connection has been broken. Given such a
contract between NFS version 4.1 clients and servers, clients MUST
NOT retry a request unless one or both of the following are true:
o The transport connection has been broken
o The procedure being retried is the NULL procedure
Since reliable transports, such as TCP, do not always synchronously
inform a peer when the other peer has broken the connection (for
example, when an NFS server reboots), the NFS version 4.1 client may
want to actively "probe" the connection to see if has been broken.
Use of the NULL procedure is one recommended way to do so. So, when
a client experiences a remote procedure call timeout (of some
arbitrary implementation specific amount), rather than retrying the
remote procedure call, it could instead issue a NULL procedure call
to the server. If the server has died, the transport connection
break will eventually be indicated to the NFS version 4.1 client.
The client can then reconnect, and then retry the original request.
If the NULL procedure call gets a response, the connection has not
broken. The client can decide to wait longer for the original
request's response, or it can break the transport connection and
reconnect before re-sending the original request.
For callbacks from the server to the client, the same rules apply,
but the server doing the callback becomes the client, and the client
receiving the callback becomes the server.
3.2. Security Flavors
Traditional RPC implementations have included AUTH_NONE, AUTH_SYS,
AUTH_DH, and AUTH_KRB4 as security flavors. With RFC2203 [5] an
additional security flavor of RPCSEC_GSS has been introduced which
uses the functionality of GSS-API RFC2743 [8]. This allows for the
use of various security mechanisms by the RPC layer without the
additional implementation overhead of adding RPC security flavors.
For NFS version 4, the RPCSEC_GSS security flavor MUST be implemented
to enable the mandatory security mechanism. Other flavors, such as,
AUTH_NONE, AUTH_SYS, and AUTH_DH MAY be implemented as well.
3.2.1. Security mechanisms for NFS version 4
The use of RPCSEC_GSS requires selection of: mechanism, quality of
protection, and service (authentication, integrity, privacy). The
remainder of this document will refer to these three parameters of
the RPCSEC_GSS security as the security triple.
3.2.1.1. Kerberos V5
The Kerberos V5 GSS-API mechanism as described in RFC1964 [6] MUST be
implemented.
column descriptions:
1 == number of pseudo flavor
2 == name of pseudo flavor
3 == mechanism's OID
4 == RPCSEC_GSS service
1 2 3 4
--------------------------------------------------------------------
390003 krb5 1.2.840.113554.1.2.2 rpc_gss_svc_none
390004 krb5i 1.2.840.113554.1.2.2 rpc_gss_svc_integrity
390005 krb5p 1.2.840.113554.1.2.2 rpc_gss_svc_privacy
Note that the pseudo flavor is presented here as a mapping aid to the
implementor. Because this NFS protocol includes a method to
negotiate security and it understands the GSS-API mechanism, the
pseudo flavor is not needed. The pseudo flavor is needed for NFS
version 3 since the security negotiation is done via the MOUNT
protocol.
For a discussion of NFS' use of RPCSEC_GSS and Kerberos V5, please
see RFC2623 [21].
3.2.1.2. LIPKEY as a security triple
The LIPKEY GSS-API mechanism as described in RFC2847 [7] MUST be
implemented and provide the following security triples. The
definition of the columns matches the previous subsection "Kerberos
V5 as security triple"
1 2 3 4
--------------------------------------------------------------------
390006 lipkey 1.3.6.1.5.5.9 rpc_gss_svc_none
390007 lipkey-i 1.3.6.1.5.5.9 rpc_gss_svc_integrity
390008 lipkey-p 1.3.6.1.5.5.9 rpc_gss_svc_privacy
3.2.1.3. SPKM-3 as a security triple
The SPKM-3 GSS-API mechanism as described in RFC2847 [7] MUST be
implemented and provide the following security triples. The
definition of the columns matches the previous subsection "Kerberos
V5 as security triple".
1 2 3 5
--------------------------------------------------------------------
390009 spkm3 1.3.6.1.5.5.1.3 rpc_gss_svc_none
390010 spkm3i 1.3.6.1.5.5.1.3 rpc_gss_svc_integrity
390011 spkm3p 1.3.6.1.5.5.1.3 rpc_gss_svc_privacy
3.3. Security Negotiation
With the NFS version 4 server potentially offering multiple security
mechanisms, the client needs a method to determine or negotiate which
mechanism is to be used for its communication with the server. The
NFS server may have multiple points within its file system name space
that are available for use by NFS clients. In turn the NFS server
may be configured such that each of these entry points may have
different or multiple security mechanisms in use.
The security negotiation between client and server must be done with
a secure channel to eliminate the possibility of a third party
intercepting the negotiation sequence and forcing the client and
server to choose a lower level of security than required or desired.
See the section "Security Considerations" for further discussion.
3.3.1. SECINFO and SECINFO_NO_NAME
The SECINFO and SECINFO_NO_NAME operations allow the client to
determine, on a per filehandle basis, what security triple is to be
used for server access. In general, the client will not have to use
either operation except during initial communication with the server
or when the client crosses policy boundaries at the server. It is
possible that the server's policies change during the client's
interaction therefore forcing the client to negotiate a new security
triple.
3.3.2. Security Error
Based on the assumption that each NFS version 4 client and server
must support a minimum set of security (i.e., LIPKEY, SPKM-3, and
Kerberos-V5 all under RPCSEC_GSS), the NFS client will start its
communication with the server with one of the minimal security
triples. During communication with the server, the client may
receive an NFS error of NFS4ERR_WRONGSEC. This error allows the
server to notify the client that the security triple currently being
used is not appropriate for access to the server's file system
resources. The client is then responsible for determining what
security triples are available at the server and choose one which is
appropriate for the client. See the section for the "SECINFO"
operation for further discussion of how the client will respond to
the NFS4ERR_WRONGSEC error and use SECINFO.
3.3.3. Callback RPC Authentication
Callback authentication has changed in NFSv4.1 from NFSv4.0.
NFSv4.0 required the NFS server to create a security context for
RPCSEC_GSS, AUTH_DH, and AUTH_KERB4, and any other security flavor
that had a security context. It also required that principal issuing
the callback be the same as the principal that accepted the callback
parameters (via SETCLIENTID), and that the client principal accepting
the callback be the same as that which issued the SETCLIENTID. This
required the NFS client to have an assigned machine credential.
NFSv4.1 does not require a machine credential. Instead, NFSv4.1
allows an RPCSEC_GSS security context initiated by the client and
eswtablished on both the client and server to be used on callback
RPCs sent by the server to the client. The BIND_BACKCHANNEL
operation is used establish RPCSEC_GSS contexts (if the client so
desires) on the server. No support for AUTH_DH, or AUTH_KERB4 is
specified.
3.3.4. GSS Server Principal
Regardless of what security mechanism under RPCSEC_GSS is being used,
the NFS server, MUST identify itself in GSS-API via a
GSS_C_NT_HOSTBASED_SERVICE name type. GSS_C_NT_HOSTBASED_SERVICE
names are of the form:
service@hostname
For NFS, the "service" element is
nfs
Implementations of security mechanisms will convert nfs@hostname to
various different forms. For Kerberos V5, LIPKEY, and SPKM-3, the
following form is RECOMMENDED:
nfs/hostname
4. Security Negotiation
The NFSv4.0 specification contains three oversights and ambiguities
with respect to the SECINFO operation.
First, it is impossible for the client to use the SECINFO operation
to determine the correct security triple for accessing a parent
directory. This is because SECINFO takes as arguments the current
file handle and a component name. However, NFSv4.0 uses the LOOKUPP
operation to get the parent directory of the current filehandle. If
the client uses the wrong security when issuing the LOOKUPP, and gets
back an NFS4ERR_WRONGSEC error, SECINFO is useless to the client.
The client is left with guessing which security the server will
accept. This defeats the purpose of SECINFO, which was to provide an
efficient method of negotiating security.
Second, there is ambiguity as to what the server should do when it is
passed a LOOKUP operation such that the server restricts access to
the current file handle with one security triple, and access to the
component with a different triple, and remote procedure call uses one
of the two security triples. Should the server allow the LOOKUP?
Third, there is a problem as to what the client must do (or can do),
whenever the server returns NFS4ERR_WRONGSEC in response to a PUTFH
operation. The NFSv4.0 specification says that client should issue a
SECINFO using the parent filehandle and the component name of the
filehandle that PUTFH was issued with. This may not be convenient
for the client.
This document resolves the above three issues in the context of
NFSv4.1.
5. Clarification of Security Negotiation in NFSv4.1
This section attempts to clarify NFSv4.1 security negotiation issues.
Unless noted otherwise, for any mention of PUTFH in this section, the
reader should interpret it as applying to PUTROOTFH and PUTPUBFH in
addition to PUTFH.
5.1. PUTFH + LOOKUP
The server implementation may decide whether to impose any
restrictions on export security administration. There are at least
three approaches (Sc is the flavor set of the child export, Sp that
of the parent),
a) Sc <= Sp (<= for subset)
b) Sc ^ Sp != {} (^ for intersection, {} for the empty set)
c) free form
To support b (when client chooses a flavor that is not a member of
Sp) and c, PUTFH must NOT return NFS4ERR_WRONGSEC in case of security
mismatch. Instead, it should be returned from the LOOKUP that
follows.
Since the above guideline does not contradict a, it should be
followed in general.
5.2. PUTFH + LOOKUPP
Since SECINFO only works its way down, there is no way LOOKUPP can
return NFS4ERR_WRONGSEC without the server implementing
SECINFO_NO_NAME. SECINFO_NO_NAME solves this issue because via style
"parent", it works in the opposite direction as SECINFO (component
name is implicit in this case).
5.3. PUTFH + SECINFO
This case should be treated specially.
A security sensitive client should be allowed to choose a strong
flavor when querying a server to determine a file object's permitted
security flavors. The security flavor chosen by the client does not
have to be included in the flavor list of the export. Of course the
server has to be configured for whatever flavor the client selects,
otherwise the request will fail at RPC authentication.
In theory, there is no connection between the security flavor used by
SECINFO and those supported by the export. But in practice, the
client may start looking for strong flavors from those supported by
the export, followed by those in the mandatory set.
5.4. PUTFH + Anything Else
PUTFH must return NFS4ERR_WRONGSEC in case of security mismatch.
This is the most straightforward approach without having to add
NFS4ERR_WRONGSEC to every other operations.
PUTFH + SECINFO_NO_NAME (style "current_fh") is needed for the client
to recover from NFS4ERR_WRONGSEC.
6. NFSv4.1 Sessions
6.1. Sessions Background
6.1.1. Introduction to Sessions
[[Comment.1: Noveck: Anyway, I think that trying to hack at the
existing text is basically hopeless. I think you have to figure out
what a new chapter (on sessions or basic protocol structure) should
say and then write it, pulling in text from the existing chapter when
appropriate. Apart from the issues you have found, that document was
written with a whole different purpose in mind. It discusses the
seesions "feature" and justifies it and talks about intergating it
into v4.0, etc. Instead, it is not a feature but is a basic
underpinning of v4.1 and we just explain what client and server need
to do, and some why but it is why this works not why we have made
these design choices vs. others we might have made. It's a totally
different story and I don't think you can get there incrementally.]]
NFSv4.1 adds extensions which allow NFSv4 to support sessions and
endpoint management, and to support operation atop RDMA-capable RPC
over transports such as iWARP. [RDMAP, DDP] These extensions enable
support for exactly-once semantics by NFSv4 servers, multipathing and
trunking of transport connections, and enhanced security. The
ability to operate over RDMA enables greatly enhanced performance.
Operation over existing TCP is enhanced as well.
While discussed here with respect to IETF-chartered transports, the
intent is NFSv4.1 will function over other standards, such as
Infiniband. [IB]
The following are the major aspects of the session feature:
o An explicit session is introduced to NFSv4, and new operations are
added to support it. The session allows for enhanced trunking,
failover and recovery, and support for RDMA. The session is
implemented as operations within NFSv4 COMPOUND and does not
impact layering or interoperability with existing NFSv4
implementations. The NFSv4 callback channel is dynamically
associated and is connected by the client and not the server,
enhancing security and operation through firewalls. [[Comment.2:
XXX is the following true:]]In fact, the callback channel will be
enabled to share the same connection as the operations channel.
o An enhanced RPC layer enables NFSv4 operation atop RDMA. The
session assists RDMA-mode connection, and additional facilities
are provided for managing RDMA resources at both NFSv4 server and
client. Existing NFSv4 operations continue to function as before,
though certain size limits are negotiated. A companion draft to
this specification, "RDMA Transport for ONC RPC" [RPCRDMA] is to
be referenced for details of RPC RDMA support.
o Support for exactly-once semantics ("EOS") is enabled by the new
session facilities, by providing to the server a way to bound the
size of the duplicate request cache for a single client, and to
manage its persistent storage.
Block Diagram
+-----------------+-------------------------------------+
| NFSv4 | NFSv4 + session extensions |
+-----------------+------+----------------+-------------+
| Operations | Session | |
+------------------------+----------------+ |
| RPC/XDR | |
+-------------------------------+---------+ |
| Stream Transport | RDMA Transport |
+-------------------------------+-----------------------+
6.1.2. Session Model
A session is a dynamically created, long-lived server object created
by a client, used over time from one or more transport connections.
Its function is to maintain the server's state relative to the
connection(s) belonging to a client instance. This state is entirely
independent of the connection itself. The session in effect becomes
the object representing an active client on a connection or set of
connections.
Clients may create multiple sessions for a single clientid, and may
wish to do so for optimization of transport resources, buffers, or
server behavior. A session could be created by the client to
represent a single mount point, for separate read and write
"channels", or for any number of other client-selected parameters.
The session enables several things immediately. Clients may
disconnect and reconnect (voluntarily or not) without loss of context
at the server. (Of course, locks, delegations and related
associations require special handling, and generally expire in the
extended absence of an open connection.) Clients may connect
multiple transport endpoints to this common state. The endpoints may
have all the same attributes, for instance when trunked on multiple
physical network links for bandwidth aggregation or path failover.
Or, the endpoints can have specific, special purpose attributes such
as callback channels.
The NFSv4.0 specification does not provide for any form of flow
control; instead it relies on the windowing provided by TCP to
throttle requests. This unfortunately does not work with RDMA, which
in general provides no operation flow control and will terminate a
connection in error when limits are exceeded. Limits are therefore
exchanged when a session is created; These limits then provide maxima
within which each session's connections must operate, they are
managed within these limits as described in [RPCRDMA]. The limits
may also be modified dynamically at the server's choosing by
manipulating certain parameters present in each NFSv4.1 request.
The presence of a maximum request limit on the session bounds the
requirements of the duplicate request cache. This can be used a
server accurately determine any storage needs, enable it to maintain
duplicate request cache persistence, and to provide reliable exactly-
once semantics.
6.1.3. Connection State
In NFSv4.0, the combination of a connected transport endpoint and a
clientid forms the basis of connection state. While this has been
made to be workable with certain limitations, there are difficulties
in correct and robust implementation. The NFSv4.0 protocol must
provide a server-initiated connection for the callback channel, and
must carefully specify the persistence of client state at the server
in the face of transport interruptions. The server has only the
client's transport address binding (the IP 4-tuple) to identify the
client RPC transaction stream and to use as a lookup tag on the
duplicate request cache. (A useful overview of this is in [RW96].)
If the server listens on multiple addresses, and the client connects
to more than one, it must employ different clientid's on each,
negating its ability to aggregate bandwidth and redundancy. In
effect, each transport connection is used as the server's
representation of client state. But, transport connections are
potentially fragile and transitory.
In this specification, a session identifier is assigned by the server
upon initial session negotiation on each connection. This identifier
is used to associate additional connections, to renegotiate after a
reconnect, to provide an abstraction for the various session
properties, and to address the duplicate request cache. No
transport-specific information is used in the duplicate request cache
implementation of an NFSv4.1 server, nor in fact the RPC XID itself.
The session identifier is unique within the server's scope and may be
subject to certain server policies such as being bounded in time.
6.1.4. NFSv4 Channels, Sessions and Connections
There are two types of NFSv4 channels: the "operations" or "fore"
channel used for ordinary requests from client to server, and the
"back" channel, used for callback requests from server to client.
Different NFSv4 operations on these channels can lead to different
resource needs. For example, server callback operations (CB_RECALL)
are specific, small messages which flow from server to client at
arbitrary times, while data transfers such as read and write have
very different sizes and asymmetric behaviors. It is sometimes
impractical for the RDMA peers (NFSv4 client and NFSv4 server) to
post buffers for these various operations on a single connection.
Commingling of requests with responses at the client receive queue is
particularly troublesome, due both to the need to manage both
solicited and unsolicited completions, and to provision buffers for
both purposes. Due to the lack of any ordering of callback requests
versus response arrivals, without any other mechanisms, the client
would be forced to allocate all buffers sized to the worst case.
The callback requests are likely to be handled by a different task
context from that handling the responses. Significant demultiplexing
and thread management may be required if both are received on the
same connection. The client and server have full control as to
whether a connection will service one channel or both channels.
[[Comment.3: I think trunking remains an open issue has there is no
way yet for clients to determine whether two different server network
addresses refer to the same server]]. Also, the client may wish to
perform trunking of operations channel requests for performance
reasons, or multipathing for availability. This specification
permits both, as well as many other session and connection
possibilities, by permitting each operation to carry session
membership information and to share session (and clientid) state in
order to draw upon the appropriate resources. For example, reads and
writes may be assigned to specific, optimized connections, or sorted
and separated by any or all of size, idempotency, etc.
To address the problems described above, this specification allows
multiple sessions to share a clientid, as well as for multiple
connections to share a session.
Single Connection model:
NFSv4.1 Session
/ \
Operations_Channel [Back_Channel]
\ /
Connection
|
Multi-connection trunked model (2 operations channels shown):
NFSv4.1 Session
/ \
Operations_Channels [Back_Channel]
| | |
Connection Connection [Connection]
| | |
Multi-connection split-use model (2 mounts shown):
NFSv4.1 Session
/ \
(/home) (/usr/local - readonly)
/ \ |
Operations_Channel [Back_Channel] |
| | Operations_Channel
Connection [Connection] |
| | Connection
|
In this way, implementation as well as resource management may be
optimized. Each session will have its own response caching and
buffering, and each connection or channel will have its own transport
resources, as appropriate. Clients which do not require certain
behaviors may optimize such resources away completely, by using
specific sessions and not even creating the additional channels and
connections.
6.1.5. Reconnection, Trunking and Failover
Reconnection after failure references stored state on the server
associated with lease recovery during the grace period. The session
provides a convenient handle for storing and managing information
regarding the client's previous state on a per- connection basis,
e.g. to be used upon reconnection. Reconnection to a previously
existing session, and its stored resources, are covered in
Section 6.3.
One important aspect of reconnection is that of RPC library support.
Traditionally, an Upper Layer RPC-based Protocol such as NFS leaves
all transport knowledge to the RPC layer implementation below it.
This allows NFS to operate over a wide variety of transports and has
proven to be a highly successful approach. The session, however,
introduces an abstraction which is, in a way, "between" RPC and
NFSv4.1. It is important that the session abstraction not have
ramifications within the RPC layer.
One such issue arises within the reconnection logic of RPC.
Previously, an explicit session binding operation, which established
session context for each new connection, was explored. This however
required that the session binding also be performed during reconnect,
which in turn required an RPC request. This additional request
requires new RPC semantics, both in implementation and the fact that
a new request is inserted into the RPC stream. Also, the binding of
a connection to a session required the upper layer to become "aware"
of connections, something the RPC layer abstraction architecturally
abstracts away. Therefore the session binding is not handled in
connection scope but instead explicitly carried in each request.
For Reliability Availability and Serviceability (RAS) issues such as
bandwidth aggregation and multipathing, clients frequently seek to
make multiple connections through multiple logical or physical
channels. The session is a convenient point to aggregate and manage
these resources.
6.1.6. Server Duplicate Request Cache
RPC-based server duplicate request caches, while not a part of an NFS
protocol, have become a de-facto requirement of any NFS
implementation. First described in [CJ89], the duplicate request
cache was initially found to reduce work at the server by avoiding
duplicate processing for retransmitted requests. A second, and in
the long run more important benefit, was improved correctness, as the
cache avoided certain destructive non-idempotent requests from being
reinvoked.
However, RPC-based caches do not provide correctness guarantees; they
cannot be managed in a reliable, persistent fashion. The reason is
understandable - their storage requirement is unbounded due to the
lack of any such bound in the NFS protocol, and they are dependent on
transport addresses for request matching.
The session model, the presence of maximum request count limits and
negotiated maximum sizes allows the size and duration of the cache to
be bounded, and coupled with a long-lived session identifier, enables
its persistent storage on a per-session basis.
This provides a single unified mechanism which provides the following
guarantees required in the NFSv4 specification, while extending them
to all requests, rather than limiting them only to a subset of state-
related requests:
"It is critical the server maintain the last response sent to the
client to provide a more reliable cache of duplicate non- idempotent
requests than that of the traditional cache described in [CJ89]..."
RFC3530 [2]
The maximum request count limit is the count of active operations,
which bounds the number of entries in the cache. Constraining the
size of operations additionally serves to limit the required storage
to the product of the current maximum request count and the maximum
response size. This storage requirement enables server- side
efficiencies.
Session negotiation allows the server to maintain other state. An
NFSv4.1 client invoking the session destroy operation will cause the
server to close the session, allowing the server to deallocate cache
entries. Clients can potentially specify that such caches not be
kept for appropriate types of sessions (for example, read-only
sessions). This can enable more efficient server operation resulting
in improved response times, and more efficient sizing of buffers and
response caches.
Similarly, it is important for the client to explicitly learn whether
the server is able to implement reliable semantics. Knowledge of
whether these semantics are in force is critical for a highly
reliable client, one which must provide transactional integrity
guarantees. When clients request that the semantics be enabled for a
given session, the session reply must inform the client if the mode
is in fact enabled. In this way the client can confidently proceed
with operations without having to implement consistency facilities of
its own.
6.2. Session Initialization and Transfer Models
Session initialization issues, and data transfer models relevant to
both TCP and RDMA are discussed in this section.
6.2.1. Session Negotiation
The following parameters are exchanged between client and server at
session creation time. Their values allow the server to properly
size resources allocated in order to service the client's requests,
and to provide the server with a way to communicate limits to the
client for proper and optimal operation. They are exchanged prior to
all session-related activity, over any transport type. Discussion of
their use is found in their descriptions as well as throughout this
section.
Maximum Requests
The client's desired maximum number of concurrent requests is
passed, in order to allow the server to size its reply cache
storage. The server may modify the client's requested limit
downward (or upward) to match its local policy and/or resources.
Over RDMA-capable RPC transports, the per-request management of
low-level transport message credits is handled within the RPC
layer. [RPCRDMA]
Maximum Request/Response Sizes
The maximum request and response sizes are exchanged in order to
permit allocation of appropriately sized buffers and request cache
entries. The size must allow for certain protocol minima,
allowing the receipt of maximally sized operations (e.g. RENAME
requests which contains two name strings). Note the maximum
request/response sizes cover the entire request/response message
and not simply the data payload as traditional NFS maximum read or
write size. Also note the server implementation may not, in fact
probably does not, require the reply cache entries to be sized as
large as the maximum response. The server may reduce the client's
requested sizes.
Inline Padding/Alignment
The server can inform the client of any padding which can be used
to deliver NFSv4 inline WRITE payloads into aligned buffers. Such
alignment can be used to avoid data copy operations at the server
for both TCP and inline RDMA transfers. For RDMA, the client
informs the server in each operation when padding has been
applied. [RPCRDMA]
Transport Attributes
A placeholder for transport-specific attributes is provided, with
a format to be determined. Possible examples of information to be
passed in this parameter include transport security attributes to
be used on the connection, RDMA- specific attributes, legacy
"private data" as used on existing RDMA fabrics, transport Quality
of Service attributes, etc. This information is to be passed to
the peer's transport layer by local means which is currently
outside the scope of this draft, however one attribute is provided
in the RDMA case:
RDMA Read Resources
RDMA implementations must explicitly provision resources to
support RDMA Read requests from connected peers. These values
must be explicitly specified, to provide adequate resources for
matching the peer's expected needs and the connection's delay-
bandwidth parameters. The client provides its chosen value to the
server in the initial session creation, the value must be provided
in each client RDMA endpoint. The values are asymmetric and
should be set to zero at the server in order to conserve RDMA
resources, since clients do not issue RDMA Read operations in this
specification. The result is communicated in the session
response, to permit matching of values across the connection. The
value may not be changed in the duration of the session, although
a new value may be requested as part of a new session.
6.2.2. RDMA Requirements
A complete discussion of the operation of RPC-based protocols atop
RDMA transports is in [RPCRDMA]. Where RDMA is considered, this
specification assumes the use of such a layering; it addresses only
the upper layer issues relevant to making best use of RPC/RDMA.
A connection oriented (reliable sequenced) RDMA transport will be
required. There are several reasons for this. First, this model
most closely reflects the general NFSv4 requirement of long-lived and
congestion-controlled transports. Second, to operate correctly over
either an unreliable or unsequenced RDMA transport, or both, would
require significant complexity in the implementation and protocol not
appropriate for a strict minor version. For example, retransmission
on connected endpoints is explicitly disallowed in the current NFSv4
draft; it would again be required with these alternate transport
characteristics. Third, this specification assumes a specific RDMA
ordering semantic, which presents the same set of ordering and
reliability issues to the RDMA layer over such transports.
The RDMA implementation provides for making connections to other
RDMA-capable peers. In the case of the current proposals before the
RDDP working group, these RDMA connections are preceded by a
"streaming" phase, where ordinary TCP (or NFS) traffic might flow.
However, this is not assumed here and sizes and other parameters are
explicitly exchanged upon a session entering RDMA mode.
6.2.3. RDMA Connection Resources
On transport endpoints which support automatic RDMA mode, that is,
endpoints which are created in the RDMA-enabled state, a single,
preposted buffer must initially be provided by both peers, and the
client session negotiation must be the first exchange.
On transport endpoints supporting dynamic negotiation, a more
sophisticated negotiation is possible, but is not discussed in the
current draft.
RDMA imposes several requirements on upper layer consumers.
Registration of memory and the need to post buffers of a specific
size and number for receive operations are a primary consideration.
Registration of memory can be a relatively high-overhead operation,
since it requires pinning of buffers, assignment of attributes (e.g.
readable/writable), and initialization of hardware translation.
Preregistration is desirable to reduce overhead. These registrations
are specific to hardware interfaces and even to RDMA connection
endpoints, therefore negotiation of their limits is desirable to
manage resources effectively.
Following the basic registration, these buffers must be posted by the
RPC layer to handle receives. These buffers remain in use by the
RPC/NFSv4 implementation; the size and number of them must be known
to the remote peer in order to avoid RDMA errors which would cause a
fatal error on the RDMA connection.
The session provides a natural way for the server to manage resource
allocation to each client rather than to each transport connection
itself. This enables considerable flexibility in the administration
of transport endpoints.
6.2.4. TCP and RDMA Inline Transfer Model
The basic transfer model for both TCP and RDMA is referred to as
"inline". For TCP, this is the only transfer model supported, since
TCP carries both the RPC header and data together in the data stream.
For RDMA, the RDMA Send transfer model is used for all NFS requests
and replies, but data is optionally carried by RDMA Writes or RDMA
Reads. Use of Sends is required to ensure consistency of data and to
deliver completion notifications. The pure-Send method is typically
used where the data payload is small, or where for whatever reason
target memory for RDMA is not available.
Inline message exchange
Client Server
: Request :
Send : ------------------------------> : untagged
: : buffer
: Response :
untagged : <------------------------------ : Send
buffer : :
Client Server
: Read request :
Send : ------------------------------> : untagged
: : buffer
: Read response with data :
untagged : <------------------------------ : Send
buffer : :
Client Server
: Write request with data :
Send : ------------------------------> : untagged
: : buffer
: Write response :
untagged : <------------------------------ : Send
buffer : :
Responses must be sent to the client on the same connection that the
request was sent. It is important that the server does not assume
any specific client implementation, in particular whether connections
within a session share any state at the client. This is also
important to preserve ordering of RDMA operations, and especially
RMDA consistency. Additionally, it ensures that the RPC RDMA layer
makes no requirement of the RDMA provider to open its memory
registration handles (Steering Tags) beyond the scope of a single
RDMA connection. This is an important security consideration.
Two values must be known to each peer prior to issuing Sends: the
maximum number of sends which may be posted, and their maximum size.
These values are referred to, respectively, as the message credits
and the maximum message size. While the message credits might vary
dynamically over the duration of the session, the maximum message
size does not. The server must commit to preserving this number of
duplicate request cache entires, and preparing a number of receive
buffers equal to or greater than its currently advertised credit
value, each of the advertised size. These ensure that transport
resources are allocated sufficient to receive the full advertised
limits.
Note that the server must post the maximum number of session requests
to each client operations channel. The client is not required to
spread its requests in any particular fashion across connections
within a session. If the client wishes, it may create multiple
sessions, each with a single or small number of operations channels
to provide the server with this resource advantage. Or, over RDMA
the server may employ a "shared receive queue". The server can in
any case protect its resources by restricting the client's request
credits.
While tempting to consider, it is not possible to use the TCP window
as an RDMA operation flow control mechanism. First, to do so would
violate layering, requiring both senders to be aware of the existing
TCP outbound window at all times. Second, since requests are of
variable size, the TCP window can hold a widely variable number of
them, and since it cannot be reduced without actually receiving data,
the receiver cannot limit the sender. Third, any middlebox
interposing on the connection would wreck any possible scheme.
[MIDTAX] In this specification, maximum request count limits are
exchanged at the session level to allow correct provisioning of
receive buffers by transports.
When operating over TCP or other similar transport, request limits
and sizes are still employed in NFSv4.1, but instead of being
required for correctness, they provide the basis for efficient server
implementation of the duplicate request cache. The limits are chosen
based upon the expected needs and capabilities of the client and
server, and are in fact arbitrary. Sizes may be specified by the
client as zero (requesting the server's preferred or optimal value),
and request limits may be chosen in proportion to the client's
capabilities. For example, a limit of 1000 allows 1000 requests to
be in progress, which may generally be far more than adequate to keep
local networks and servers fully utilized.
Both client and server have independent sizes and buffering, but over
RDMA fabrics client credits are easily managed by posting a receive
buffer prior to sending each request. Each such buffer may not be
completed with the corresponding reply, since responses from NFSv4
servers arrive in arbitrary order. When an operations channel is
also used for callbacks, the client must account for callback
requests by posting additional buffers. Note that implementation-
specific facilities such as a shared receive queue may also allow
optimization of these allocations.
When a session is created, the client requests a preferred buffer
size, and the server provides its answer. The server posts all
buffers of at least this size. The client must comply by not sending
requests greater than this size. It is recommended that server
implementations do all they can to accommodate a useful range of
possible client requests. There is a provision in [RPCRDMA] to allow
the sending of client requests which exceed the server's receive
buffer size, but it requires the server to "pull" the client's
request as a "read chunk" via RDMA Read. This introduces at least
one additional network roundtrip, plus other overhead such as
registering memory for RDMA Read at the client and additional RDMA
operations at the server, and is to be avoided.
An issue therefore arises when considering the NFSv4 COMPOUND
procedures. Since an arbitrary number (total size) of operations can
be specified in a single COMPOUND procedure, its size is effectively
unbounded. This cannot be supported by RDMA Sends, and therefore
this size negotiation places a restriction on the construction and
maximum size of both COMPOUND requests and responses. If a COMPOUND
results in a reply at the server that is larger than can be sent in
an RDMA Send to the client, then the COMPOUND must terminate and the
operation which causes the overflow will provide a TOOSMALL error
status result.
6.2.5. RDMA Direct Transfer Model
Placement of data by explicitly tagged RDMA operations is referred to
as "direct" transfer. This method is typically used where the data
payload is relatively large, that is, when RDMA setup has been
performed prior to the operation, or when any overhead for setting up
and performing the transfer is regained by avoiding the overhead of
processing an ordinary receive.
The client advertises RDMA buffers and not the server. This means
the "XDR Decoding with Read Chunks" described in [RPCRDMA] is not
employed by NFSv4.1 replies, and instead all results transferred via
RDMA to the client employ "XDR Decoding with Write Chunks". There
are several reasons for this.
First, it allows for a correct and secure mode of transfer. The
client may advertise specific memory buffers only during specific
times, and may revoke access when it pleases. The server is not
required to expose copies of local file buffers for individual
clients, or to lock or copy them for each client access.
Second, client credits based on fixed-size request buffers are easily
managed on the server, but for the server additional management of
buffers for client RDMA Reads is not well-bounded. For example, the
client may not perform these RDMA Read operations in a timely
fashion, therefore the server would have to protect itself against
denial-of-service on these resources.
Third, it reduces network traffic, since buffer exposure outside the
scope and duration of a single request/response exchange necessitates
additional memory management exchanges.
There are costs associated with this decision. Primary among them is
the need for the server to employ RDMA Read for operations such as
large WRITE. The RDMA Read operation is a two-way exchange at the
RDMA layer, which incurs additional overhead relative to RDMA Write.
Additionally, RDMA Read requires resources at the data source (the
client in this specification) to maintain state and to generate
replies. These costs are overcome through use of pipelining with
credits, with sufficient RDMA Read resources negotiated at session
initiation, and appropriate use of RDMA for writes by the client -
for example only for transfers above a certain size.
A description of which NFSv4 operation results are eligible for data
transfer via RDMA Write is in [NFSDDP]. There are only two such
operations: READ and READLINK. When XDR encoding these requests on
an RDMA transport, the NFSv4.1 client must insert the appropriate
xdr_write_list entries to indicate to the server whether the results
should be transferred via RDMA or inline with a Send. As described
in [NFSDDP], a zero-length write chunk is used to indicate an inline
result. In this way, it is unnecessary to create new operations for
RDMA-mode versions of READ and READLINK.
Another tool to avoid creation of new, RDMA-mode operations is the
Reply Chunk [RPCRDMA], which is used by RPC in RDMA mode to return
large replies via RDMA as if they were inline. Reply chunks are used
for operations such as READDIR, which returns large amounts of
information, but in many small XDR segments. Reply chunks are
offered by the client and the server can use them in preference to
inline. Reply chunks are transparent to upper layers such as NFSv4.
In any very rare cases where another NFSv4.1 operation requires
larger buffers than were negotiated when the session was created (for
example extraordinarily large RENAMEs), the underlying RPC layer may
support the use of "Message as an RDMA Read Chunk" and "RDMA Write of
Long Replies" as described in [RPCRDMA]. No additional support is
required in the NFSv4.1 client for this. The client should be
certain that its requested buffer sizes are not so small as to make
this a frequent occurrence, however.
All operations are initiated by a Send, and are completed with a
Send. This is exactly as in conventional NFSv4, but under RDMA has a
significant purpose: RDMA operations are not complete, that is,
guaranteed consistent, at the data sink until followed by a
successful Send completion (i.e. a receive). These events provide a
natural opportunity for the initiator (client) to enable and later
disable RDMA access to the memory which is the target of each
operation, in order to provide for consistent and secure operation.
The RDMAP Send with Invalidate operation may be worth employing in
this respect, as it relieves the client of certain overhead in this
case.
A "onetime" boolean advisory to each RDMA region might become a hint
to the server that the client will use the three-tuple for only one
NFSv4 operation. For a transport such as iWARP, the server can
assist the client in invalidating the three-tuple by performing a
Send with Solicited Event and Invalidate. The server may ignore this
hint, in which case the client must perform a local invalidate after
receiving the indication from the server that the NFSv4 operation is
complete. This may be considered in a future version of this draft
and [NFSDDP].
In a trusted environment, it may be desirable for the client to
persistently enable RDMA access by the server. Such a model is
desirable for the highest level of efficiency and lowest overhead.
RDMA message exchanges
Client Server
: Direct Read Request :
Send : ------------------------------> : untagged
: : buffer
: Segment :
tagged : <------------------------------ : RDMA Write
buffer : : :
: [Segment] :
tagged : <------------------------------ : [RDMA Write]
buffer : :
: Direct Read Response :
untagged : <------------------------------ : Send (w/Inv.)
buffer : :
Client Server
: Direct Write Request :
Send : ------------------------------> : untagged
: : buffer
: Segment :
tagged : v------------------------------ : RDMA Read
buffer : +-----------------------------> :
: : :
: [Segment] :
tagged : v------------------------------ : [RDMA Read]
buffer : +-----------------------------> :
: :
: Direct Write Response :
untagged : <------------------------------ : Send (w/Inv.)
buffer : :
6.3. Connection Models
There are three scenarios in which to discuss the connection model.
Each will be discussed individually, after describing the common case
encountered at initial connection establishment.
After a successful connection, the first request proceeds, in the
case of a new client association, to initial session creation, and
then optionally to session callback channel binding, prior to regular
operation.
Commonly, each new client "mount" will be the action which drives
creation of a new session. However there are any number of other
approaches. Clients may choose to share a single connection and
session among all their mount points. Or, clients may support
trunking, where additional connections are created but all within a
single session. Alternatively, the client may choose to create
multiple sessions, each tuned to the buffering and reliability needs
of the mount point. For example, a readonly mount can sharply reduce
its write buffering and also makes no requirement for the server to
support reliable duplicate request caching.
Similarly, the client can choose among several strategies for
clientid usage. Sessions can share a single clientid, or create new
clientids as the client deems appropriate. For kernel-based clients
which service multiple authenticated users, a single clientid shared
across all mount points is generally the most appropriate and
flexible approach. For example, all the client's file operations may
wish to share locking state and the local client kernel takes the
responsibility for arbitrating access locally. For clients choosing
to support other authentication models, perhaps example userspace
implementations, a new clientid is indicated. Through use of session
create options, both models are supported at the client's choice.
Since the session is explicitly created and destroyed by the client,
and each client is uniquely identified, the server may be
specifically instructed to discard unneeded persistent state. For
this reason, it is possible that a server will retain any previous
state indefinitely, and place its destruction under administrative
control. Or, a server may choose to retain state for some
configurable period, provided that the period meets other NFSv4
requirements such as lease reclamation time, etc. However, since
discarding this state at the server may affect the correctness of the
server as seen by the client across network partitioning, such
discarding of state should be done only in a conservative manner.
Each client request to the server carries a new SEQUENCE operation
within each COMPOUND, which provides the session context. This
session context then governs the request control, duplicate request
caching, and other persistent parameters managed by the server for a
session.
6.3.1. TCP Connection Model
The following is a schematic diagram of the NFSv4.1 protocol
exchanges leading up to normal operation on a TCP stream.
Client Server
TCPmode : Create Clientid(nfs_client_id4) : TCPmode
: ------------------------------> :
: :
: Clientid reply(clientid, ...) :
: <------------------------------ :
: :
: Create Session(clientid, size S, :
: maxreq N, STREAM, ...) :
: ------------------------------> :
: :
: Session reply(sessionid, size S', :
: maxreq N') :
: <------------------------------ :
: :
: <normal operation> :
: ------------------------------> :
: <------------------------------ :
: : :
No net additional exchange is added to the initial negotiation. In
the NFSv4.1 exchange, the CREATE_CLIENTID replaces SETCLIENTID
(eliding the callback "clientaddr4" addressing) and CREATE_SESSION
subsumes the function of SETCLIENTID_CONFIRM, as described elsewhere
in this specification. Callback channel binding is optional, as in
NFSv4.0. Note that the STREAM transport type is shown above, but
since the transport mode remains unchanged and transport attributes
are not necessarily exchanged, DEFAULT could also be passed.
6.3.2. Negotiated RDMA Connection Model
One possible design which has been considered is to have a
"negotiated" RDMA connection model, supported via use of a session
bind operation as a required first step. However due to issues
mentioned earlier, this proved problematic. This section remains as
a reminder of that fact, and it is possible such a mode can be
supported.
It is not considered critical that this be supported for two reasons.
One, the session persistence provides a way for the server to
remember important session parameters, such as sizes and maximum
request counts. These values can be used to restore the endpoint
prior to making the first reply. Two, there are currently no
critical RDMA parameters to set in the endpoint at the server side of
the connection. RDMA Read resources, which are in general not
settable after entering RDMA mode, are set only at the client - the
originator of the connection. Therefore as long as the RDMA provider
supports an automatic RDMA connection mode, no further support is
required from the NFSv4.1 protocol for reconnection.
Note, the client must provide at least as many RDMA Read resources to
its local queue for the benefit of the server when reconnecting, as
it used when negotiating the session. If this value is no longer
appropriate, the client should resynchronize its session state,
destroy the existing session, and start over with the more
appropriate values.
6.3.3. Automatic RDMA Connection Model
The following is a schematic diagram of the NFSv4.1 protocol
exchanges performed on an RDMA connection.
Client Server
RDMAmode : : : RDMAmode
: : :
Prepost : : : Prepost
receive : : : receive
: :
: Create Clientid(nfs_client_id4) :
: ------------------------------> :
: : Prepost
: Clientid reply(clientid, ...) : receive
: <------------------------------ :
Prepost : :
receive : Create Session(clientid, size S, :
: maxreq N, RDMA ...) :
: ------------------------------> :
: : Prepost <=N'
: Session reply(sessionid, size S', : receives of
: maxreq N') : size S'
: <------------------------------ :
: :
: <normal operation> :
: ------------------------------> :
: <------------------------------ :
: : :
6.4. Buffer Management, Transfer, Flow Control
Inline operations in NFSv4.1 behave effectively the same as TCP
sends. Procedure results are passed in a single message, and its
completion at the client signal the receiving process to inspect the
message.
RDMA operations are performed solely by the server in NFSv4.1, as
described in Section 6.2.5 RDMA Direct Transfer Model. Since server
RDMA operations do not result in a completion at the client, and due
to ordering rules in RDMA transports, after all required RDMA
operations are complete, a Send (Send with Solicited Event for iWARP)
containing the procedure results is performed from server to client.
This Send operation will result in a completion which will signal the
client to inspect the message.
In the case of client read-type NFSv4 operations, the server will
have issued RDMA Writes to transfer the resulting data into client-
advertised buffers. The subsequent Send operation performs two
necessary functions: finalizing any active or pending DMA at the
client, and signaling the client to inspect the message.
In the case of client write-type NFSv4 operations, the server will
have issued RDMA Reads to fetch the data from the client-advertised
buffers. No data consistency issues arise at the client, but the
completion of the transfer must be acknowledged, again by a Send from
server to client.
In either case, the client advertises buffers for direct (RDMA style)
operations. The client may desire certain advertisement limits, and
may wish the server to perform remote invalidation on its behalf when
the server has completed its RDMA. This may be considered in a
future version of this draft.
In the absence of remote invalidation, the client may perform its
own, local invalidation after the operation completes. This
invalidation should occur prior to any RPCSEC GSS integrity checking,
since a validly remotely accessible buffer can possibly be modified
by the peer. However, after invalidation and the contents integrity
checked, the contents are locally secure.
Credit updates over RDMA transports are supported at the RPC layer as
described in [RPCRDMA]. In each request, the client requests a
desired number of credits to be made available to the connection on
which it sends the request. The client must not send more requests
than the number which the server has previously advertised, or in the
case of the first request, only one. If the client exceeds its
credit limit, the connection may close with a fatal RDMA error.
The server then executes the request, and replies with an updated
credit count accompanying its results. Since replies are sequenced
by their RDMA Send order, the most recent results always reflect the
server's limit. In this way the client will always know the maximum
number of requests it may safely post.
Because the client requests an arbitrary credit count in each
request, it is relatively easy for the client to request more, or
fewer, credits to match its expected need. A client that discovered
itself frequently queuing outgoing requests due to lack of server
credits might increase its requested credits proportionately in
response. Or, a client might have a simple, configurable number.
The protocol also provides a per-operation "maxslot" exchange to
assist in dynamic adjustment at the session level, described in a
later section.
Occasionally, a server may wish to reduce the total number of credits
it offers a certain client on a connection. This could be
encountered if a client were found to be consuming its credits
slowly, or not at all. A client might notice this itself, and reduce
its requested credits in advance, for instance requesting only the
count of operations it currently has queued, plus a few as a base for
starting up again. Such mechanisms can, however, be potentially
complicated and are implementation-defined. The protocol does not
require them.
Because of the way in which RDMA fabrics function, it is not possible
for the server (or client back channel) to cancel outstanding receive
operations. Therefore, effectively only one credit can be withdrawn
per receive completion. The server (or client back channel) would
simply not replenish a receive operation when replying. The server
can still reduce the available credit advertisement in its replies to
the target value it desires, as a hint to the client that its credit
target is lower and it should expect it to be reduced accordingly.
Of course, even if the server could cancel outstanding receives, it
cannot do so, since the client may have already sent requests in
expectation of the previous limit.
This brings out an interesting scenario similar to that of client
reconnect discussed in Section 6.3. How does the server reduce the
credits of an inactive client?
One approach is for the server to simply close such a connection and
require the client to reconnect at a new credit limit. This is
acceptable, if inefficient, when the connection setup time is short
and where the server supports persistent session semantics.
A better approach is to provide a back channel request to return the
operations channel credits. The server may request the client to
return some number of credits, the client must comply by performing
operations on the operations channel, provided of course that the
request does not drop the client's credit count to zero (in which
case the connection would deadlock). If the client finds that it has
no requests with which to consume the credits it was previously
granted, it must send zero-length Send RDMA operations, or NULL NFSv4
operations in order to return the resources to the server. If the
client fails to comply in a timely fashion, the server can recover
the resources by breaking the connection.
While in principle, the back channel credits could be subject to a
similar resource adjustment, in practice this is not an issue, since
the back channel is used purely for control and is expected to be
statically provisioned.
It is important to note that in addition to maximum request counts,
the sizes of buffers are negotiated per-session. This permits the
most efficient allocation of resources on both peers. There is an
important requirement on reconnection: the sizes posted by the server
at reconnect must be at least as large as previously used, to allow
recovery. Any replies that are replayed from the server's duplicate
request cache must be able to be received into client buffers. In
the case where a client has received replies to all its retried
requests (and therefore received all its expected responses), then
the client may disconnect and reconnect with different buffers at
will, since no cache replay will be required.
6.5. Retry and Replay
NFSv4.0 forbids retransmission on active connections over reliable
transports; this includes connected-mode RDMA. This restriction must
be maintained in NFSv4.1.
If one peer were to retransmit a request (or reply), it would consume
an additional credit on the other. If the server retransmitted a
reply, it would certainly result in an RDMA connection loss, since
the client would typically only post a single receive buffer for each
request. If the client retransmitted a request, the additional
credit consumed on the server might lead to RDMA connection failure
unless the client accounted for it and decreased its available
credit, leading to wasted resources.
RDMA credits present a new issue to the duplicate request cache in
NFSv4.1. The request cache may be used when a connection within a
session is lost, such as after the client reconnects. Credit
information is a dynamic property of the connection, and stale values
must not be replayed from the cache. This implies that the request
cache contents must not be blindly used when replies are issued from
it, and credit information appropriate to the channel must be
refreshed by the RPC layer.
Finally, RDMA fabrics do not guarantee that the memory handles
(Steering Tags) within each rdma three-tuple are valid on a scope
outside that of a single connection. Therefore, handles used by the
direct operations become invalid after connection loss. The server
must ensure that any RDMA operations which must be replayed from the
request cache use the newly provided handle(s) from the most recent
request.
6.6. The Back Channel
The NFSv4 callback operations present a significant resource problem
for the RDMA enabled client. Clearly, callbacks must be negotiated
in the way credits are for the ordinary operations channel for
requests flowing from client to server. But, for callbacks to arrive
on the same RDMA endpoint as operation replies would require
dedicating additional resources, and specialized demultiplexing and
event handling. Or, callbacks may not require RDMA sevice at all
(they do not normally carry substantial data payloads). It is highly
desirable to streamline this critical path via a second
communications channel.
The session callback channel binding facility is designed for exactly
such a situation, by dynamically associating a new connected endpoint
with the session, and separately negotiating sizes and counts for
active callback channel operations. The binding operation is
firewall-friendly since it does not require the server to initiate
the connection.
This same method serves as well for ordinary TCP connection mode. It
is expected that all NFSv4.1 clients may make use of the session
facility to streamline their design.
The back channel functions exactly the same as the operations channel
except that no RDMA operations are required to perform transfers,
instead the sizes are required to be sufficiently large to carry all
data inline, and of course the client and server reverse their roles
with respect to which is in control of credit management. The same
rules apply for all transfers, with the server being required to flow
control its callback requests.
The back channel is optional. If not bound on a given session, the
server must not issue callback operations to the client. This in
turn implies that such a client must never put itself in the
situation where the server will need to do so, lest the client lose
its connection by force, or its operation be incorrect. For the same
reason, if a back channel is bound, the client is subject to
revocation of its delegations if the back channel is lost. Any
connection loss should be corrected by the client as soon as
possible.
This can be convenient for the NFSv4.1 client; if the client expects
to make no use of back channel facilities such as delegations, then
there is no need to create it. This may save significant resources
and complexity at the client.
For these reasons, if the client wishes to use the back channel, that
channel must be bound first, before using the operations channel. In
this way, the server will not find itself in a position where it will
send callbacks on the operations channel when the client is not
prepared for them.
[[Comment.4: [XXX - do we want to support this?]]] There is one
special case, that where the back channel is bound in fact to the
operations channel's connection. This configuration would be used
normally over a TCP stream connection to exactly implement the
NFSv4.0 behavior, but over RDMA would require complex resource and
event management at both sides of the connection. The server is not
required to accept such a bind request on an RDMA connection for this
reason, though it is recommended.
6.7. COMPOUND Sizing Issues
Very large responses may pose duplicate request cache issues. Since
servers will want to bound the storage required for such a cache, the
unlimited size of response data in COMPOUND may be troublesome. If
COMPOUND is used in all its generality, then the inclusion of certain
non-idempotent operations within a single COMPOUND request may render
the entire request non-idempotent. (For example, a single COMPOUND
request which read a file or symbolic link, then removed it, would be
obliged to cache the data in order to allow identical replay).
Therefore, many requests might include operations that return any
amount of data.
It is not satisfactory for the server to reject COMPOUNDs at will
with NFS4ERR_RESOURCE when they pose such difficulties for the
server, as this results in serious interoperability problems.
Instead, any such limits must be explicitly exposed as attributes of
the session, ensuring that the server can explicitly support any
duplicate request cache needs at all times.
6.8. Data Alignment
A negotiated data alignment enables certain scatter/gather
optimizations. A facility for this is supported by [RPCRDMA]. Where
NFS file data is the payload, specific optimizations become highly
attractive.
Header padding is requested by each peer at session initiation, and
may be zero (no padding). Padding leverages the useful property that
RDMA receives preserve alignment of data, even when they are placed
into anonymous (untagged) buffers. If requested, client inline
writes will insert appropriate pad bytes within the request header to
align the data payload on the specified boundary. The client is
encouraged to be optimistic and simply pad all WRITEs within the RPC
layer to the negotiated size, in the expectation that the server can
use them efficiently.
It is highly recommended that clients offer to pad headers to an
appropriate size. Most servers can make good use of such padding,
which allows them to chain receive buffers in such a way that any
data carried by client requests will be placed into appropriate
buffers at the server, ready for file system processing. The
receiver's RPC layer encounters no overhead from skipping over pad
bytes, and the RDMA layer's high performance makes the insertion and
transmission of padding on the sender a significant optimization. In
this way, the need for servers to perform RDMA Read to satisfy all
but the largest client writes is obviated. An added benefit is the
reduction of message roundtrips on the network - a potentially good
trade, where latency is present.
The value to choose for padding is subject to a number of criteria.
A primary source of variable-length data in the RPC header is the
authentication information, the form of which is client-determined,
possibly in response to server specification. The contents of
COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all
go into the determination of a maximal NFSv4 request size and
therefore minimal buffer size. The client must select its offered
value carefully, so as not to overburden the server, and vice- versa.
The payoff of an appropriate padding value is higher performance.
Sender gather:
|RPC Request|Pad bytes|Length| -> |User data...|
\------+---------------------/ \
\ \
\ Receiver scatter: \-----------+- ...
/-----+----------------\ \ \
|RPC Request|Pad|Length| -> |FS buffer|->|FS buffer|->...
In the above case, the server may recycle unused buffers to the next
posted receive if unused by the actual received request, or may pass
the now-complete buffers by reference for normal write processing.
For a server which can make use of it, this removes any need for data
copies of incoming data, without resorting to complicated end-to-end
buffer advertisement and management. This includes most kernel-based
and integrated server designs, among many others. The client may
perform similar optimizations, if desired.
Padding is negotiated by the session creation operation, and
subsequently used by the RPC RDMA layer, as described in [RPCRDMA].
6.9. NFSv4 Integration
The following section discusses the integration of the session
infrastructure into NFSv4.1
6.9.1. Minor Versioning
Minor versioning of NFSv4 is relatively restrictive, and allows for
tightly limited changes only. In particular, it does not permit
adding new "procedures" (it permits adding only new "operations").
Interoperability concerns make it impossible to consider additional
layering to be a minor revision. This somewhat limits the changes
that can be introduced when considering extensions.
To support the duplicate request cache integrated with sessions and
request control, it is desirable to tag each request with an
identifier to be called a Slotid. This identifier must be passed by
NFSv4.1 when running atop any transport, including traditional TCP.
Therefore it is not desirable to add the Slotid to a new RPC
transport, even though such a transport is indicated for support of
RDMA. This specification and [RPCRDMA] do not specify such an
approach.
Instead, this specification conforms to the requirements of NFSv4
minor versioning, through the use of a new operation within NFSv4
COMPOUND procedures as detailed below.
If sessions are in use for a given clientid, this same clientid
cannot be used for non-session NFSv4 operation, including NFSv4.0.
Because the server will have allocated session-specific state to the
active clientid, it would be an unnecessary burden on the server
implementor to support and account for additional, non- session
traffic, in addition to being of no benefit. Therefore this
specification prohibits a single clientid from doing this.
Nevertheless, employing a new clientid for such traffic is supported.
6.9.2. Slot Identifiers and Server Duplicate Request Cache
The presence of deterministic maximum request limits on a session
enables in-progress requests to be assigned unique values with useful
properties.
The RPC layer provides a transaction ID (xid), which, while required
to be unique, is not especially convenient for tracking requests.
The transaction ID is only meaningful to the issuer (client), it
cannot be interpreted at the server except to test for equality with
previously issued requests. Because RPC operations may be completed
by the server in any order, many transaction IDs may be outstanding
at any time. The client may therefore perform a computationally
expensive lookup operation in the process of demultiplexing each
reply.
In the specification, there is a limit to the number of active
requests. This immediately enables a convenient, computationally
efficient index for each request which is designated as a Slot
Identifier, or slotid.
When the client issues a new request, it selects a slotid in the
range 0..N-1, where N is the server's current "totalrequests" limit
granted the client on the session over which the request is to be
issued. The slotid must be unused by any of the requests which the
client has already active on the session. "Unused" here means the
client has no outstanding request for that slotid. Because the slot
id is always an integer in the range 0..N-1, client implementations
can use the slotid from a server response to efficiently match
responses with outstanding requests, such as, for example, by using
the slotid to index into a outstanding request array. This can be
used to avoid expensive hashing and lookup functions in the
performance-critical receive path.
The sequenceid, which accompanies the slotid in each request, is
important for a second, important check at the server: it must be
able to be determined efficiently whether a request using a certain
slotid is a retransmit or a new, never-before-seen request. It is
not feasible for the client to assert that it is retransmitting to
implement this, because for any given request the client cannot know
the server has seen it unless the server actually replies. Of
course, if the client has seen the server's reply, the client would
not retransmit!
The sequenceid must increase monotonically for each new transmit of a
given slotid, and must remain unchanged for any retransmission. The
server must in turn compare each newly received request's sequenceid
with the last one previously received for that slotid, to see if the
new request is:
o A new request, in which the sequenceid is one greater than that
previously seen in the slot (accounting for sequence wraparound).
The server proceeds to execute the new request.
o A retransmitted request, in which the sequenceid is equal to that
last seen in the slot. Note that this request may be either
complete, or in progress. The server performs replay processing
in these cases.
o A misordered duplicate, in which the sequenceid is less than
(acounting for sequence wraparound) than that previously seen in
the slot. The server MUST return NFS4ERR_SEQ_MISORDERED.
o A misordered new request, in which the sequenceid is two or more
than (acounting for sequence wraparound) than that previously seen
in the slot. Note that because the sequenceid must wraparound one
it reaches 0xFFFFFFFF, a misordered new request and a misordered
duplicate cannot be distinguished. Thus, the server MUST return
NFS4ERR_SEQ_MISORDERED.
Unlike the XID, the slotid is always within a specific range; this
has two implications. The first implication is that for a given
session, the server need only cache the results of a limited number
of COMPOUND requests. The second implication derives from the first,
which is unlike XID-indexed DRCs, the slotid DRC by its nature cannot
be overflowed. Through use of the sequenceid to identify
retransmitted requests, it is notable that the server does not need
to actually cache the request itself, reducing the storage
requirements of the DRC further. These new facilities makes it
practical to maintain all the required entries for an effective DRC.
The slotid and sequenceid therefore take over the traditional role of
the XID and port number in the server DRC implementation, and the
session replaces the IP address. This approach is considerably more
portable and completely robust - it is not subject to the frequent
reassignment of ports as clients reconnect over IP networks. In
addition, the RPC XID is not used in the reply cache, enhancing
robustness of the cache in the face of any rapid reuse of XIDs by the
client. [[Comment.5: We need to discuss the requirements of the
client for changing the XID.]].
It is required to encode the slotid information into each request in
a way that does not violate the minor versioning rules of the NFSv4.0
specification. This is accomplished here by encoding it in a control
operation (SEQUENCE) within each NFSv4.1 COMPOUND and CB_COMPOUND
procedure. The operation easily piggybacks within existing messages.
In general, the receipt of a new sequenced request arriving on any
valid slot is an indication that the previous DRC contents of that
slot may be discarded. In order to further assist the server in slot
management, the client is required to use the lowest available slot
when issuing a new request. In this way, the server may be able to
retire additional entries.
However, in the case where the server is actively adjusting its
granted maximum request count to the client, it may not be able to
use receipt of the slotid to retire cache entries. The slotid used
in an incoming request may not reflect the server's current idea of
the client's session limit, because the request may have been sent
from the client before the update was received. Therefore, in the
downward adjustment case, the server may have to retain a number of
duplicate request cache entries at least as large as the old value,
until operation sequencing rules allow it to infer that the client
has seen its reply.
The SEQUENCE (and CB_SEQUENCE) operation also carries a "maxslot"
value which carries additional client slot usage information. The
client must always provide its highest-numbered outstanding slot
value in the maxslot argument, and the server may reply with a new
recognized value. The client should in all cases provide the most
conservative value possible, although it can be increased somewhat
above the actual instantaneous usage to maintain some minimum or
optimal level. This provides a way for the client to yield unused
request slots back to the server, which in turn can use the
information to reallocate resources. Obviously, maxslot can never be
zero, or the session would deadlock.
The server also provides a target maxslot value to the client, which
is an indication to the client of the maxslot the server wishes the
client to be using. This permits the server to withdraw (or add)
resources from a client that has been found to not be using them, in
order to more fairly share resources among a varying level of demand
from other clients. The client must always comply with the server's
value updates, since they indicate newly established hard limits on
the client's access to session resources. However, because of
request pipelining, the client may have active requests in flight
reflecting prior values, therefore the server must not immediately
require the client to comply.
It is worthwhile to note that Sprite RPC [BW87] defined a "channel"
which in some ways is similar to the slotid defined here. Sprite RPC
used channels to implement parallel request processing and request/
response cache retirement.
6.9.3. Resolving server callback races with sessions
It is possible for server callbacks to arrive at the client before
the reply from related forward channel operations. For example, a
client may have been granted a delegation to a file it has opened,
but the reply to the OPEN (informing the client of the granting of
the delegation) may be delayed in the network. If a conflicting
operation arrives at the server, it will recall the delegation using
the callback channel, which may be on a different transport
connection, perhaps even a different network. In NFSv4.0, if the
callback request arrives before the related reply, the client may
reply to the server with an error.
The presence of a session between client and server alleviates this
issue. When a session is in place, each client request is uniquely
identified by its { slotid, sequenceid } pair. By the rules under
which slot entries (duplicate request cache entries) are retired, the
server has knowledge whether the client has "seen" each of the
server's replies. The server can therefore provide sufficient
information to the client to allow it to disambiguate between an
erroneous or conflicting callback and a race condition.
For each client operation which might result in some sort of server
callback, the server should "remember" the { slotid, sequenceid }
pair of the client request until the slotid retirement rules allow
the server to determine that the client has, in fact, seen the
server's reply. Until the time the { slotid, sequencedid } request
pair can be retired, any recalls of the associated object MUST carry
an array of these referring identifiers (in the CB_SEQUENCE
operation's arguments), for the benefit of the client. After this
time, it is not necessary for the server to provide this information
in related callbacks, since it is certain that a race condition can
no longer occur.
The CB_SEQUENCE operation which begins each server callback carries a
list of "referring" { slotid, sequenceid } tuples. If the client
finds the request corresponding to the referring slotid and sequenced
id be currently outstanding (i.e. the server's reply has not been
seen by the client), it can determine that the callback has raced the
reply, and act accordingly.
The client must not simply wait forever for the expected server reply
to arrive on any of the session's operations channels, because it is
possible that they will be delayed indefinitely. However, it should
wait for a period of time, and if the time expires it can provide a
more meaningful error such as NFS4ERR_DELAY.
[[Comment.6: XXX ... We need to consider the clients' options here,
and describe them... NFS4ERR_DELAY has been discussed as a legal
reply to CB_RECALL?]]
There are other scenarios under which callbacks may race replies,
among them pnfs layout recalls, described in Section 17.3.5.3
[[Comment.7: XXX fill in the blanks w/others, etc...]]
6.9.4. COMPOUND and CB_COMPOUND
[[Comment.8: Noveck: This is about the twelfth time we say that this
is minor version. The diagram makes sense if you are explaining
which should be done somewhere, but this is supposedly explaining
sessions.]]
Support for per-operation control is added to NFSv4 COMPOUNDs by
placing such facilities into their own, new operation, and placing
this operation first in each COMPOUND under the new NFSv4 minor
protocol revision. The contents of the operation would then apply to
the entire COMPOUND.
Recall that the NFSv4 minor version number is contained within the
COMPOUND header, encoded prior to the COMPOUNDed operations. By
simply requiring that the new operation always be contained in NFSv4
minor COMPOUNDs, the control protocol can piggyback perfectly with
each request and response.
In this way, the NFSv4 Session Extensions may stay in compliance with
the minor versioning requirements specified in section 10 of RFC3530
[2].
Referring to section 13.1 of RFC3530 [2], the specified session-
enabled COMPOUND and CB_COMPOUND have the form:
+-----+--------------+-----------+------------+-----------+----
| tag | minorversion | numops | control op | op + args | ...
| | (== 1) | (limited) | + args | |
+-----+--------------+-----------+------------+-----------+----
and the reply's structure is:
+------------+-----+--------+-------------------------------+--//
|last status | tag | numres | status + control op + results | //
+------------+-----+--------+-------------------------------+--//
//-----------------------+----
// status + op + results | ...
//-----------------------+----
[[Comment.9: The artwork above doesn't mention callback_ident that is
used for CB_COMPOUND. We need to mention that for NFSv4.1,
callback_ident is superfluous]] The single control operation,
SEQUENCE, within each NFSv4.1 COMPOUND defines the context and
operational session parameters which govern that COMPOUND request and
reply. Placing it first in the COMPOUND encoding is required in
order to allow its processing before other operations in the
COMPOUND.
6.10. Sessions Security Considerations
The NFSv4 minor version 1 retains all of existing NFSv4 security; all
security considerations present in NFSv4.0 apply to it equally.
Security considerations of any underlying RDMA transport are
additionally important, all the more so due to the emerging nature of
such transports. Examining these issues is outside the scope of this
specification.
When protecting a connection with RPCSEC_GSS, all data in each
request and response (whether transferred inline or via RDMA)
continues to receive this protection over RDMA fabrics [RPCRDMA].
However when performing data transfers via RDMA, RPCSEC_GSS
protection of the data transfer portion works against the efficiency
which RDMA is typically employed to achieve. This is because such
data is normally managed solely by the RDMA fabric, and intentionally
is not touched by software. The means by which the local RPCSEC_GSS
implementation is integrated with the RDMA data protection facilities
are outside the scope of this specification.
If the NFS client wishes to maintain full control over RPCSEC_GSS
protection, it may still perform its transfer operations using either
the inline or RDMA transfer model, or of course employ traditional
TCP stream operation. In the RDMA inline case, header padding is
recommended to optimize behavior at the server. At the client, close
attention should be paid to the implementation of RPCSEC_GSS
processing to minimize memory referencing and especially copying.
The session callback channel binding improves security over that
provided by NFSv4 for the callback channel. The connection is
client-initiated, and subject to the same firewall and routing checks
as the operations channel. The connection cannot be hijacked by an
attacker who connects to the client port prior to the intended
server. The connection is set up by the client with its desired
attributes, such as optionally securing with IPsec or similar. The
binding is fully authenticated before being activated.
6.10.1. Denial of Service via Unauthorized State Changes
Under some conditions, NFSv4.0 is vulnerable to a denial of service
issue with respect to its state management.
The attack works via an unauthorized client faking an open_owner4, an
open_owner/lock_owner pair, or stateid, combined with a seqid. The
operation is sent to the NFSv4 server. The NFSv4 server accepts the
state information, and as long as any status code from the result of
this operation is not NFS4ERR_STALE_CLIENTID, NFS4ERR_STALE_STATEID,
NFS4ERR_BAD_STATEID, NFS4ERR_BAD_SEQID, NFS4ERR_BADXDR,
NFS4ERR_RESOURCE, or NFS4ERR_NOFILEHANDLE, the sequence number is
incremented. When the authorized client issues an operation, it gets
back NFS4ERR_BAD_SEQID, because its idea of the current sequence
number is off by one. The authorized client's recovery options are
pretty limited, with SETCLIENTID, followed by complete reclaim of
state, which may or may not succeed completely. That qualifies as a
denial of service attack.
If the client uses RPCSEC_GSS authentication and integrity, and every
client maps each open_owner and lock_owner one and only one
principal, and the server enforces this binding, then the conditions
leading to vulnerability to the denial of service do not exist. One
should keep in mind that if AUTH_SYS is being used, far simpler
easier denial of service and other attacks are possible.
With NFSv4.1 sessions, the per-operation sequence number is ignored
(see Section 13.13) therefore the NFSv4.0 denial of service
vulnerability described above does not apply. However as described
to this point in the specification, an attacker could forge the
sessionid and issue a SEQUENCE with a slot id that he expects the
legitimate client to use next. The legitimate client could then use
the slotid with the same sequence number, and the server returns the
attacker's result from the replay cache, thereby disrupting the
legitimate client.
If we give each NFSv4.1 user their own session, and each user uses
RPCSEC_GSS authentication and integrity, then the denial of service
issue is solved, at the cost of additional per session state. The
alternative NFSv4.1 specifies is described as follows.
Transport connections MUST be bound to to a session by the client.
The server MUST return an error to an operation (other than the
operation that binds the connection to the session) that uses an
unbound connection. As a simplification, the transport connection
used by CREATE_SESSION is automatically bound to the session.
Additional connections are bound to a session via a new operation,
BIND_CONN_TO_SESSION.
To prevent attackers from issuing BIND_CONN_TO_SESSION operations,
the arguments to BIND_CONN_TO_SESSION include a digest of a shared
secret called the secret session verifier (SSV) that only the client
and server know. The digest is created via a one way, collision
resistance hash function, making it intractable for the attacker to
forge.
The SSV is sent to the server via SET_SSV. To prevent eavesdropping,
a SET_SSV for the SSV can be protected via RPCSEC_GSS with the
privacy service. The SSV can be changed by the client at any time,
by any principal. However several aspects of SSV changing prevent an
attacker from engaging in a successful denial of service attack:
1. A SET_SSV on the SSV does not replace the SSV with the argument
to SET_SVV. Instead, the current SSV on the server is logically
exclusive ORed (XORed) with the argument to SET_SSV. SET_SSV
MUST NOT be called with an SSV value that is zero.
2. The arguments to and results of SET_SSV include digests of the
old and new SSV, respectively.
3. Because the initial value of the SSV is zero, therefore known,
the client MUST issue at least one SET_SSV operation before the
first BIND_CONN_TO_SESSION operation. A client SHOULD issue
SET_SSV as soon as a session is created.
If a connection is disconnected, BIND_CONN_TO_SESSION is required to
bind a connection to the session, even if the connection that was
disconnected was the one CREATE_SESSION was created with.
If a client is assigned a machine principal then the client SHOULD
use the machine principal's RPCSEC_GSS context to privacy protect the
SSV from eavesdropping during the SET_SSV operation. If a machine
principal is not being used, then the client MAY use the non-machine
principal's RPCSEC_GSS context to privacy protect the SSV. The
server MUST accept either type of principal. A client SHOULD change
the SSV each time a new principal uses the session.
Here are the types of attacks that can be attempted an attacker named
Eve, and how the connection to session binding approach addresses
each attack:
o If the Eve creates a connection after the legitimate client
establishes an SSV via privacy protection from a machine
principal's RPCSEC_GSS session, she does not know the SSV and so
cannot compute a digest that BIND_CONN_TO_SESSION will accept.
Users on the legitimate client cannot be disrupted by Eve.
o If Eve first logs into the legitimate client, and the client does
not use machine principals, then Eve can cause an SSV to be
created via the legitimate client's NFSv4.1 implementation,
protected by the RPCSEC_GSS context created by the legitimate
client (which uses Eve's GSS principal and credentials). Eve can
eavesdrop on the network, and because she knows her credentials,
she can decrypt the SSV. Eve can compute a digest
BIND_CONN_TO_SESSION will accept, and so bind a new connection to
the session. Eve can change the slotid, sequence state, and/or
the SSV state in such a way that when Bob accesses the server via
the legitimate client, the legitimate client will be unable to use
the session. The client's only recourse is to create a new
session, which will cause any state Eve created on the legitimate
client over the old (but hijacked) session to be lost. This
disrupts Eve, but because she is the attacker, this is acceptable.
Once the legitimate client establishes an SSV over the new session
using Bob's RPCSEC_GSS context, Eve can use the new session via
the legitimate client, but she cannot disrupt Bob. Moreover,
because the client SHOULD have modified the SSV due to Eve using
the new session, Bob cannot get revenge on Eve by binding a rogue
connection to the session. The question is how does the
legitimate client detect that Eve has hijacked the old session?
When the client detects that a new principal, Bob, wants to use
the session, it SHOULD have issued a SET_SSV.
* Let us suppose that from the rogue connection, Eve issued a
SET_SSV with the same slotid and sequence that the legitimate
client later uses. The server will assume this is a replay,
and return to the legitimate client the reply it sent Eve.
However, unless Eve can correctly guess the SSV the legitimate
client will use, the digest verification checks in the SET_SSV
response will fail. That is the clue to the client that the
session has been hijacked.
* Alternatively, Eve issued a SET_SSV with a different slotid
than the legitimate client uses for its SET_SSV. Then the
digest verification on the server fails, and the client is
again clued that the session has been hijacked.
* Alternatively, Eve issued an operation other than SET_SSV, but
with the same slotid and sequence that the legitimate client
uses for its SET_SSV. The server returns to the legitimate
client the response it sent Eve. The client sees that the
response is not at all what it expects. The client assumes
either session hijacking or server bug, and either way destroys
the old session.
o Eve binds a rogue connection to the session as above, and then
destroys the session. Again, Bob goes to use the server from the
legitimate client. The client has a very clear indication that
its session was hijacked, and does not even have to destroy the
old session before creating a new session, which Eve will be
unable to hijack because it will be protected with an SSV created
via Bob's RPCSEC_GSS protection.
o If Eve creates a connection before the legitimate client
establishes an SSV, because the initial value of the SSV is zero
and therefore known, Eve can issue a SET_SSV that will pass the
digest verification check. However because the new connection has
not been bound to the session, the SET_SSV is rejected for that
reason.
o The connection to session binding model does not prevent
connection hijacking. However, if an attacker can perform
connection hijacking, it can issue denial of service attacks that
are less difficult than attacks based on forging sessions.
6.11. Session Mechanics - Steady State
6.11.1. Obligations of the Server
[[Comment.10: XXX - TBD]]
6.11.2. Obligations of the Client
The client has the following obligations in order to utilize the
session:
o Keep a necessary session from going idle on the server. A client
that requires a session, but nonetheless is not sending operations
risks having the session be destroyed by the server. This is
because sessions consume resources, and resource limitations may
force the server to cull the least recently used session.
o Destroy the session when idle. When a session has no state other
than the session, and no outstanding requests, the client should
consider destroying the session.
o Maintain GSS contexts for callback. If the client requires the
server to to use the RPCSEC_GSS security flavor for callbacks,
then it needs to be sure the contexts handed to the server via
BACKCHANNEL_CTL are unexpired. A good practice is to keep at
least two contexts outstanding, where the expiration time of the
newest context at the time it was created, is N times that of the
oldest context, where N is the number of contexts available for
callbacks.
o Maintain an active connection. The server requires a callback
path in order to gracefully recall recallable state, or notify the
client of certain events.
6.11.3. Steps the Client Takes To Establish a Session
The client issues CREATE_CLIENTID to establish a clientid.
The client uses the clientid to issue a CREATE_SESSION on a
connection to the server. The results of CREATE_SESSION indicate
whether the server will persist the session replay cache through a
server reboot or not, and the client notes this for future reference.
The client SHOULD issue SET_SSV in first COMPOUND after the session
is created. If it is not using machine credentials, then each time a
new principal goes to use the session, it SHOULD issue a SET_SSV
again.
If the client wants to use delegations, layouts, directory
notifications, or any other state that requires a call back channel,
then it must add connection to the backchannel if CREATE_SESSION did
not already do so. The client creates a connection, and calls
BIND_CONN_TO_SESSION to bind the connection to the session and the
session's backchannel. If CREATE_SESSION did not already do so, the
client MUST tell the server what security is required in order for
the client to accept callbacks. The client does this via
BACKCHANNEL_CTL.
If the client wants to use additional connections for the operations
and back channels, then it MUST call BIND_CONN_TO_SESSION on each
connection it wants to use with the session.
At this point the client has reached a steady state as far as session
use.
6.12. Session Mechanics - Recovery
This section discussions session related events that require
recovery.
6.12.1. Events Requiring Client Action
The following events require client action to recover.
6.12.1.1. RPCSEC_GSS Context Loss by Callback Path
If all RPCSEC_GSS contexts granted to by the client to the server for
callback use have expired, the client MUST establish a new context
via BIND_CONN_TO_SESSION. The sr_status field of SEQUENCE results
indicates when callback contexts are nearly expired, or fully expired
(see Section 21.46.4).
6.12.1.2. Connection Disconnect
If the client loses the last connection of the session, then it MUST
create a new connection, and bind it to the session via
BIND_CONN_TO_SESSION.
6.12.1.3. Loss of Session
The server may lose a record of the session. Causes include:
o Server crash and reboot
o A catastrophe that causes the cache to be corrupted or lost on the
media it was stored on. This applies even if the server indicated
in the CREATE_SESSION results that it would persist the cache.
o The server purges the session of a client that has been inactive
for a very extended period of time. [[Comment.11: XXX - Should we
add a value to the CREATE_SESSION results that tells a client how
long he can let a session stay idle before losing it?]].
Loss of replay cache is equivalent to loss of session. The server
indicates loss of session to the client by returning
NFS4ERR_BADSESSION on the next operation that uses the sessionid
associated with the lost session.
After an event like a server reboot, the client may have lost its
connections. The client assumes for the moment that the session has
not been lost. It reconnects, and invokes BIND_CONN_TO_SESSION using
the sessionid. If BIND_CONN_TO_SESSION returns NFS4ERR_BADSESSION,
the client knows the session was lost. If the connection survives
session loss, then the next SEQUENCE operation the client issues over
the connection will get back NFS4ERR_BADSESSION. The client again
knows the session was lost.
When the client detects session loss, it must call CREATE_SESSION to
recover. Any non-idempotent operations that were in progress may
have been performed on the server at the time of session loss. The
client has no general way to recover from this.
Note that loss of session does not imply loss of lock, open,
delegation, or layout state. Nor does loss of lock, open,
delegation, or layout state imply loss of session state.[[Comment.12:
Add reference to lock recovery section]]. A session can survive a
server reboot, but lock recovery may still be needed. The converse
is also true.
It is possible CREATE_SESSION will fail with NFS4ERR_STALE_CLIENTID
(for example the server reboots and does not preserve clientid
state). If so, the client needs to call CREATE_CLIENTID, followed by
CREATE_SESSION.
6.12.2. Events Requiring Server Action
The following events require server action to recover.
6.12.2.1. Client Crash and Reboot
As described in Section 21.35, a rebooted client causes the server to
delete any sessions it had.
6.12.2.2. Client Crash with No Reboot
If a client crashes and never comes back, it will never issue
CREATE_CLIENTID with its old clientid. Thus the server has session
state that will never be used again. After an extended period of
time and if the server has resource constraints, it MAY destroy the
old session.
6.12.2.2.1. Extended Network Parition
To the server, the extended network partition may be no different
than a client crash with no reboot (see Section 6.12.2.2 Client Crash
with No Reboot). Unless the server can discern that there is a
network partition, it is free to treat the situation as if the client
has crashed for good.
7. Minor Versioning
To address the requirement of an NFS protocol that can evolve as the
need arises, the NFS version 4 protocol contains the rules and
framework to allow for future minor changes or versioning.
The base assumption with respect to minor versioning is that any
future accepted minor version must follow the IETF process and be
documented in a standards track RFC. Therefore, each minor version
number will correspond to an RFC. Minor version zero of the NFS
version 4 protocol is represented by this RFC. The COMPOUND
procedure will support the encoding of the minor version being
requested by the client.
The following items represent the basic rules for the development of
minor versions. Note that a future minor version may decide to
modify or add to the following rules as part of the minor version
definition.
1. Procedures are not added or deleted
To maintain the general RPC model, NFS version 4 minor versions
will not add to or delete procedures from the NFS program.
2. Minor versions may add operations to the COMPOUND and
CB_COMPOUND procedures.
The addition of operations to the COMPOUND and CB_COMPOUND
procedures does not affect the RPC model.
* Minor versions may append attributes to GETATTR4args,
bitmap4, and GETATTR4res.
This allows for the expansion of the attribute model to allow
for future growth or adaptation.
* Minor version X must append any new attributes after the last
documented attribute.
Since attribute results are specified as an opaque array of
per-attribute XDR encoded results, the complexity of adding
new attributes in the midst of the current definitions will
be too burdensome.
3. Minor versions must not modify the structure of an existing
operation's arguments or results.
Again the complexity of handling multiple structure definitions
for a single operation is too burdensome. New operations should
be added instead of modifying existing structures for a minor
version.
This rule does not preclude the following adaptations in a minor
version.
* adding bits to flag fields such as new attributes to
GETATTR's bitmap4 data type
* adding bits to existing attributes like ACLs that have flag
words
* extending enumerated types (including NFS4ERR_*) with new
values
4. Minor versions may not modify the structure of existing
attributes.
5. Minor versions may not delete operations.
This prevents the potential reuse of a particular operation
"slot" in a future minor version.
6. Minor versions may not delete attributes.
7. Minor versions may not delete flag bits or enumeration values.
8. Minor versions may declare an operation as mandatory to NOT
implement.
Specifying an operation as "mandatory to not implement" is
equivalent to obsoleting an operation. For the client, it means
that the operation should not be sent to the server. For the
server, an NFS error can be returned as opposed to "dropping"
the request as an XDR decode error. This approach allows for
the obsolescence of an operation while maintaining its structure
so that a future minor version can reintroduce the operation.
1. Minor versions may declare attributes mandatory to NOT
implement.
2. Minor versions may declare flag bits or enumeration values
as mandatory to NOT implement.
9. Minor versions may downgrade features from mandatory to
recommended, or recommended to optional.
10. Minor versions may upgrade features from optional to recommended
or recommended to mandatory.
11. A client and server that support minor version X must support
minor versions 0 (zero) through X-1 as well.
12. No new features may be introduced as mandatory in a minor
version.
This rule allows for the introduction of new functionality and
forces the use of implementation experience before designating a
feature as mandatory.
13. A client MUST NOT attempt to use a stateid, filehandle, or
similar returned object from the COMPOUND procedure with minor
version X for another COMPOUND procedure with minor version Y,
where X != Y.
8. Protocol Data Types
The syntax and semantics to describe the data types of the NFS The syntax and semantics to describe the data types of the NFS
version 4 protocol are defined in the XDR RFC4506 [3] and RPC RFC1831 version 4 protocol are defined in the XDR RFC4506 [3] and RPC RFC1831
[4] documents. The next sections build upon the XDR data types to [4] documents. The next sections build upon the XDR data types to
define types and structures specific to this protocol. define types and structures specific to this protocol.
2.1. Basic Data Types 8.1. Basic Data Types
These are the base NFSv4 data types. These are the base NFSv4 data types.
+---------------+---------------------------------------------------+ +---------------+---------------------------------------------------+
| Data Type | Definition | | Data Type | Definition |
+---------------+---------------------------------------------------+ +---------------+---------------------------------------------------+
| int32_t | typedef int int32_t; | | int32_t | typedef int int32_t; |
| uint32_t | typedef unsigned int uint32_t; | | uint32_t | typedef unsigned int uint32_t; |
| int64_t | typedef hyper int64_t; | | int64_t | typedef hyper int64_t; |
| uint64_t | typedef unsigned hyper uint64_t; | | uint64_t | typedef unsigned hyper uint64_t; |
skipping to change at page 20, line 14 skipping to change at page 70, line 36
| | Verifier used for various operations (COMMIT, | | | Verifier used for various operations (COMMIT, |
| | CREATE, OPEN, READDIR, SETCLIENTID, | | | CREATE, OPEN, READDIR, SETCLIENTID, |
| | SETCLIENTID_CONFIRM, WRITE) NFS4_VERIFIER_SIZE is | | | SETCLIENTID_CONFIRM, WRITE) NFS4_VERIFIER_SIZE is |
| | defined as 8. | | | defined as 8. |
+---------------+---------------------------------------------------+ +---------------+---------------------------------------------------+
End of Base Data Types End of Base Data Types
Table 1 Table 1
2.2. Structured Data Types 8.2. Structured Data Types
2.2.1. nfstime4 8.2.1. nfstime4
struct nfstime4 { struct nfstime4 {
int64_t seconds; int64_t seconds;
uint32_t nseconds; uint32_t nseconds;
} }
The nfstime4 structure gives the number of seconds and nanoseconds The nfstime4 structure gives the number of seconds and nanoseconds
since midnight or 0 hour January 1, 1970 Coordinated Universal Time since midnight or 0 hour January 1, 1970 Coordinated Universal Time
(UTC). Values greater than zero for the seconds field denote dates (UTC). Values greater than zero for the seconds field denote dates
after the 0 hour January 1, 1970. Values less than zero for the after the 0 hour January 1, 1970. Values less than zero for the
skipping to change at page 20, line 42 skipping to change at page 71, line 16
nseconds fields would have a value of one-half second (500000000). nseconds fields would have a value of one-half second (500000000).
Values greater than 999,999,999 for nseconds are considered invalid. Values greater than 999,999,999 for nseconds are considered invalid.
This data type is used to pass time and date information. A server This data type is used to pass time and date information. A server
converts to and from its local representation of time when processing converts to and from its local representation of time when processing
time values, preserving as much accuracy as possible. If the time values, preserving as much accuracy as possible. If the
precision of timestamps stored for a file system object is less than precision of timestamps stored for a file system object is less than
defined, loss of precision can occur. An adjunct time maintenance defined, loss of precision can occur. An adjunct time maintenance
protocol is recommended to reduce client and server time skew. protocol is recommended to reduce client and server time skew.
2.2.2. time_how4 8.2.2. time_how4
enum time_how4 { enum time_how4 {
SET_TO_SERVER_TIME4 = 0, SET_TO_SERVER_TIME4 = 0,
SET_TO_CLIENT_TIME4 = 1 SET_TO_CLIENT_TIME4 = 1
}; };
2.2.3. settime4 8.2.3. settime4
union settime4 switch (time_how4 set_it) { union settime4 switch (time_how4 set_it) {
case SET_TO_CLIENT_TIME4: case SET_TO_CLIENT_TIME4:
nfstime4 time; nfstime4 time;
default: default:
void; void;
}; };
The above definitions are used as the attribute definitions to set The above definitions are used as the attribute definitions to set
time values. If set_it is SET_TO_SERVER_TIME4, then the server uses time values. If set_it is SET_TO_SERVER_TIME4, then the server uses
its local representation of time for the time value. its local representation of time for the time value.
2.2.4. specdata4 8.2.4. specdata4
struct specdata4 { struct specdata4 {
uint32_t specdata1; /* major device number */ uint32_t specdata1; /* major device number */
uint32_t specdata2; /* minor device number */ uint32_t specdata2; /* minor device number */
}; };
This data type represents additional information for the device file This data type represents additional information for the device file
types NF4CHR and NF4BLK. types NF4CHR and NF4BLK.
2.2.5. fsid4 8.2.5. fsid4
struct fsid4 { struct fsid4 {
uint64_t major; uint64_t major;
uint64_t minor; uint64_t minor;
}; };
2.2.6. fs_location4 8.2.6. fs_location4
struct fs_location4 { struct fs_location4 {
utf8str_cis server<>; utf8str_cis server<>;
pathname4 rootpath; pathname4 rootpath;
}; };
2.2.7. fs_locations4 8.2.7. fs_locations4
struct fs_locations4 { struct fs_locations4 {
pathname4 fs_root; pathname4 fs_root;
fs_location4 locations<>; fs_location4 locations<>;
}; };
The fs_location4 and fs_locations4 data types are used for the The fs_location4 and fs_locations4 data types are used for the
fs_locations recommended attribute which is used for migration and fs_locations recommended attribute which is used for migration and
replication support. replication support.
2.2.8. fattr4 8.2.8. fattr4
struct fattr4 { struct fattr4 {
bitmap4 attrmask; bitmap4 attrmask;
attrlist4 attr_vals; attrlist4 attr_vals;
}; };
The fattr4 structure is used to represent file and directory The fattr4 structure is used to represent file and directory
attributes. attributes.
The bitmap is a counted array of 32 bit integers used to contain bit The bitmap is a counted array of 32 bit integers used to contain bit
values. The position of the integer in the array that contains bit n values. The position of the integer in the array that contains bit n
can be computed from the expression (n / 32) and its bit within that can be computed from the expression (n / 32) and its bit within that
integer is (n mod 32). integer is (n mod 32).
0 1 0 1
+-----------+-----------+-----------+-- +-----------+-----------+-----------+--
| count | 31 .. 0 | 63 .. 32 | | count | 31 .. 0 | 63 .. 32 |
+-----------+-----------+-----------+-- +-----------+-----------+-----------+--
2.2.9. change_info4 8.2.9. change_info4
struct change_info4 { struct change_info4 {
bool atomic; bool atomic;
changeid4 before; changeid4 before;
changeid4 after; changeid4 after;
}; };
This structure is used with the CREATE, LINK, REMOVE, RENAME This structure is used with the CREATE, LINK, REMOVE, RENAME
operations to let the client know the value of the change attribute operations to let the client know the value of the change attribute
for the directory in which the target file system object resides. for the directory in which the target file system object resides.
2.2.10. netaddr4 8.2.10. netaddr4
struct netaddr4 { struct netaddr4 {
/* see struct rpcb in RFC1833 */ /* see struct rpcb in RFC1833 */
string r_netid<>; /* network id */ string r_netid<>; /* network id */
string r_addr<>; /* universal address */ string r_addr<>; /* universal address */
}; };
The netaddr4 structure is used to identify TCP/IP based endpoints. The netaddr4 structure is used to identify TCP/IP based endpoints.
The r_netid and r_addr fields are specified in RFC1833 [19], but they The r_netid and r_addr fields are specified in RFC1833 [20], but they
are underspecified in RFC1833 [19] as far as what they should look are underspecified in RFC1833 [20] as far as what they should look
like for specific protocols. like for specific protocols.
For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the
US-ASCII string: US-ASCII string:
h1.h2.h3.h4.p1.p2 h1.h2.h3.h4.p1.p2
The prefix, "h1.h2.h3.h4", is the standard textual form for The prefix, "h1.h2.h3.h4", is the standard textual form for
representing an IPv4 address, which is always four octets long. representing an IPv4 address, which is always four octets long.
Assuming big-endian ordering, h1, h2, h3, and h4, are respectively, Assuming big-endian ordering, h1, h2, h3, and h4, are respectively,
skipping to change at page 23, line 35 skipping to change at page 74, line 5
The suffix "p1.p2" is the service port, and is computed the same way The suffix "p1.p2" is the service port, and is computed the same way
as with universal addresses for TCP and UDP over IPv4. The prefix, as with universal addresses for TCP and UDP over IPv4. The prefix,
"x1:x2:x3:x4:x5:x6:x7:x8", is the standard textual form for "x1:x2:x3:x4:x5:x6:x7:x8", is the standard textual form for
representing an IPv6 address as defined in Section 2.2 of RFC1884 representing an IPv6 address as defined in Section 2.2 of RFC1884
[9]. Additionally, the two alternative forms specified in Section [9]. Additionally, the two alternative forms specified in Section
2.2 of RFC1884 [9] are also acceptable. 2.2 of RFC1884 [9] are also acceptable.
For TCP over IPv6 the value of r_netid is the string "tcp6". For UDP For TCP over IPv6 the value of r_netid is the string "tcp6". For UDP
over IPv6 the value of r_netid is the string "udp6". over IPv6 the value of r_netid is the string "udp6".
2.2.11. clientaddr4 8.2.11. clientaddr4
typedef netaddr4 clientaddr4; typedef netaddr4 clientaddr4;
The clientaddr4 structure is used as part of the SETCLIENTID The clientaddr4 structure is used as part of the SETCLIENTID
operation to either specify the address of the client that is using a operation to either specify the address of the client that is using a
clientid or as part of the callback registration. clientid or as part of the callback registration.
2.2.12. cb_client4 8.2.12. cb_client4
struct cb_client4 { struct cb_client4 {
unsigned int cb_program; unsigned int cb_program;
netaddr4 cb_location; netaddr4 cb_location;
}; };
This structure is used by the client to inform the server of its call This structure is used by the client to inform the server of its call
back address; includes the program number and client address. back address; includes the program number and client address.
2.2.13. nfs_client_id4 8.2.13. nfs_client_id4
struct nfs_client_id4 { struct nfs_client_id4 {
verifier4 verifier; verifier4 verifier;
opaque id<NFS4_OPAQUE_LIMIT> opaque id<NFS4_OPAQUE_LIMIT>
}; };
This structure is part of the arguments to the SETCLIENTID operation. This structure is part of the arguments to the SETCLIENTID operation.
NFS4_OPAQUE_LIMIT is defined as 1024. NFS4_OPAQUE_LIMIT is defined as 1024.
2.2.14. open_owner4 8.2.14. open_owner4
struct open_owner4 { struct open_owner4 {
clientid4 clientid; clientid4 clientid;
opaque owner<NFS4_OPAQUE_LIMIT> opaque owner<NFS4_OPAQUE_LIMIT>
}; };
This structure is used to identify the owner of open state. This structure is used to identify the owner of open state.
NFS4_OPAQUE_LIMIT is defined as 1024. NFS4_OPAQUE_LIMIT is defined as 1024.
2.2.15. lock_owner4 8.2.15. lock_owner4
struct lock_owner4 { struct lock_owner4 {
clientid4 clientid; clientid4 clientid;
opaque owner<NFS4_OPAQUE_LIMIT> opaque owner<NFS4_OPAQUE_LIMIT>
}; };
This structure is used to identify the owner of file locking state. This structure is used to identify the owner of file locking state.
NFS4_OPAQUE_LIMIT is defined as 1024. NFS4_OPAQUE_LIMIT is defined as 1024.
2.2.16. open_to_lock_owner4 8.2.16. open_to_lock_owner4
struct open_to_lock_owner4 { struct open_to_lock_owner4 {
seqid4 open_seqid; seqid4 open_seqid;
stateid4 open_stateid; stateid4 open_stateid;
seqid4 lock_seqid; seqid4 lock_seqid;
lock_owner4 lock_owner; lock_owner4 lock_owner;
}; };
This structure is used for the first LOCK operation done for an This structure is used for the first LOCK operation done for an
open_owner4. It provides both the open_stateid and lock_owner such open_owner4. It provides both the open_stateid and lock_owner such
that the transition is made from a valid open_stateid sequence to that the transition is made from a valid open_stateid sequence to
that of the new lock_stateid sequence. Using this mechanism avoids that of the new lock_stateid sequence. Using this mechanism avoids
the confirmation of the lock_owner/lock_seqid pair since it is tied the confirmation of the lock_owner/lock_seqid pair since it is tied
to established state in the form of the open_stateid/open_seqid. to established state in the form of the open_stateid/open_seqid.
2.2.17. stateid4 8.2.17. stateid4
struct stateid4 { struct stateid4 {
uint32_t seqid; uint32_t seqid;
opaque other[12]; opaque other[12];
}; };
This structure is used for the various state sharing mechanisms This structure is used for the various state sharing mechanisms
between the client and server. For the client, this data structure between the client and server. For the client, this data structure
is read-only. The starting value of the seqid field is undefined. is read-only. The starting value of the seqid field is undefined.
The server is required to increment the seqid field monotonically at The server is required to increment the seqid field monotonically at
each transition of the stateid. This is important since the client each transition of the stateid. This is important since the client
will inspect the seqid in OPEN stateids to determine the order of will inspect the seqid in OPEN stateids to determine the order of
OPEN processing done by the server. OPEN processing done by the server.
2.2.18. layouttype4 8.2.18. layouttype4
enum layouttype4 { enum layouttype4 {
LAYOUT_NFSV4_FILES = 1, LAYOUT_NFSV4_FILES = 1,
LAYOUT_OSD2_OBJECTS = 2, LAYOUT_OSD2_OBJECTS = 2,
LAYOUT_BLOCK_VOLUME = 3 LAYOUT_BLOCK_VOLUME = 3
}; };
A layout type specifies the layout being used. The implication is A layout type specifies the layout being used. The implication is
that clients have "layout drivers" that support one or more layout that clients have "layout drivers" that support one or more layout
types. The file server advertises the layout types it supports types. The file server advertises the layout types it supports
through the LAYOUT_TYPES file system attribute. A client asks for through the LAYOUT_TYPES file system attribute. A client asks for
layouts of a particular type in LAYOUTGET, and passes those layouts layouts of a particular type in LAYOUTGET, and passes those layouts
to its layout driver. to its layout driver.
The layouttype4 structure is 32 bits in length. The range The layouttype4 structure is 32 bits in length. The range
represented by the layout type is split into two parts. Types within represented by the layout type is split into two parts. Types within
the range 0x00000000-0x7FFFFFFF are globally unique and are assigned the range 0x00000000-0x7FFFFFFF are globally unique and are assigned
according to the description in Section 24.1; they are maintained by according to the description in Section 25.1; they are maintained by
IANA. Types within the range 0x8000000-0xFFFFFFFF are site specific IANA. Types within the range 0x80000000-0xFFFFFFFF are site specific
and for "private use" only. and for "private use" only.
The LAYOUT_NFSV4_FILES enumeration specifies that the NFSv4 file The LAYOUT_NFSV4_FILES enumeration specifies that the NFSv4 file
layout type is to be used. The LAYOUT_OSD2_OBJECTS enumeration layout type is to be used. The LAYOUT_OSD2_OBJECTS enumeration
specifies that the object layout, as defined in [20], is to be used. specifies that the object layout, as defined in [22], is to be used.
Similarly, the LAYOUT_BLOCK_VOLUME enumeration that the block/volume Similarly, the LAYOUT_BLOCK_VOLUME enumeration that the block/volume
layout, as defined in [21], is to be used. layout, as defined in [23], is to be used.
2.2.19. deviceid4 8.2.19. deviceid4
typedef uint32_t deviceid4; /* 32-bit device ID */ typedef uint32_t deviceid4; /* 32-bit device ID */
Layout information includes device IDs that specify a storage device Layout information includes device IDs that specify a storage device
through a compact handle. Addressing and type information is through a compact handle. Addressing and type information is
obtained with the GETDEVICEINFO operation. A client must not assume obtained with the GETDEVICEINFO operation. A client must not assume
that device IDs are valid across metadata server reboots. The device that device IDs are valid across metadata server reboots. The device
ID is qualified by the layout type and are unique per file system ID is qualified by the layout type and are unique per file system
(FSID). This allows different layout drivers to generate device IDs (FSID). This allows different layout drivers to generate device IDs
without the need for co-ordination. See Section 15.3.1.4 for more without the need for co-ordination. See Section 17.3.1.4 for more
details. details.
2.2.20. devlist_item4 8.2.20. devlist_item4
struct devlist_item4 { struct devlist_item4 {
deviceid4 dli_id; deviceid4 dli_id;
opaque dli_device_addr<>; opaque dli_device_addr<>;
}; };
An array of these values is returned by the GETDEVICELIST operation. An array of these values is returned by the GETDEVICELIST operation.
They define the set of devices associated with a file system for the They define the set of devices associated with a file system for the
layout type specified in the GETDEVICELIST4args. layout type specified in the GETDEVICELIST4args.
The device address is used to set up a communication channel with the The device address is used to set up a communication channel with the
storage device. Different layout types will require different types storage device. Different layout types will require different types
of structures to define how they communicate with storage devices. of structures to define how they communicate with storage devices.
The opaque device_addr field must be interpreted based on the The opaque device_addr field must be interpreted based on the
specified layout type. specified layout type.
This document defines the device address for the NFSv4 file layout This document defines the device address for the NFSv4 file layout
(struct netaddr4 (Section 2.2.10)), which identifies a storage device (struct netaddr4 (Section 8.2.10)), which identifies a storage device
by network IP address and port number. This is sufficient for the by network IP address and port number. This is sufficient for the
clients to communicate with the NFSv4 storage devices, and may be clients to communicate with the NFSv4 storage devices, and may be
sufficient for other layout types as well. Device types for object sufficient for other layout types as well. Device types for object
storage devices and block storage devices (e.g., SCSI volume labels) storage devices and block storage devices (e.g., SCSI volume labels)
will be defined by their respective layout specifications. will be defined by their respective layout specifications.
2.2.21. layout4 8.2.21. layout4
struct layout4 { struct layout4 {
offset4 lo_offset; offset4 lo_offset;
length4 lo_length; length4 lo_length;
layoutiomode4 lo_iomode; layoutiomode4 lo_iomode;
layouttype4 lo_type; layouttype4 lo_type;
opaque lo_layout<>; opaque lo_layout<>;
}; };
The layout4 structure defines a layout for a file. The layout type The layout4 structure defines a layout for a file. The layout type
specific data is opaque within this structure and must be specific data is opaque within this structure and must be
interepreted based on the layout type. Currently, only the NFSv4 interepreted based on the layout type. Currently, only the NFSv4
file layout type is defined; see Section 15.4.1 for its definition. file layout type is defined; see Section 17.4.1 for its definition.
Since layouts are sub-dividable, the offset and length together with Since layouts are sub-dividable, the offset and length together with
the file's filehandle, the clientid, iomode, and layout type, the file's filehandle, the clientid, iomode, and layout type,
identifies the layout. identifies the layout.
2.2.22. layoutupdate4 8.2.22. layoutupdate4
struct layoutupdate4 { struct layoutupdate4 {
layouttype4 lou_type; layouttype4 lou_type;
opaque lou_data<>; opaque lou_data<>;
}; };
The layoutupdate4 structure is used by the client to return 'updated' The layoutupdate4 structure is used by the client to return 'updated'
layout information to the metadata server at LAYOUTCOMMIT time. This layout information to the metadata server at LAYOUTCOMMIT time. This
structure provides a channel to pass layout type specific information structure provides a channel to pass layout type specific information
back to the metadata server. E.g., for block/volume layout types back to the metadata server. E.g., for block/volume layout types
this could include the list of reserved blocks that were written. this could include the list of reserved blocks that were written.
The contents of the opaque lou_data argument are determined by the The contents of the opaque lou_data argument are determined by the
layout type and are defined in their context. The NFSv4 file-based layout type and are defined in their context. The NFSv4 file-based
layout does not use this structure, thus the update_data field should layout does not use this structure, thus the update_data field should
have a zero length. have a zero length.
2.2.23. layouthint4 8.2.23. layouthint4
struct layouthint4 { struct layouthint4 {
layouttype4 loh_type; layouttype4 loh_type;
opaque loh_data<>; opaque loh_data<>;
}; };
The layouthint4 structure is used by the client to pass in a hint The layouthint4 structure is used by the client to pass in a hint
about the type of layout it would like created for a particular file. about the type of layout it would like created for a particular file.
It is the structure specified by the FILE_LAYOUT_HINT attribute It is the structure specified by the FILE_LAYOUT_HINT attribute
described below. The metadata server may ignore the hint, or may described below. The metadata server may ignore the hint, or may
selectively ignore fields within the hint. This hint should be selectively ignore fields within the hint. This hint should be
provided at create time as part of the initial attributes within provided at create time as part of the initial attributes within
OPEN. The NFSv4 file-based layout uses the "nfsv4_file_layouthint" OPEN. The NFSv4 file-based layout uses the "nfsv4_file_layouthint"
structure as defined in Section 15.4.1. structure as defined in Section 17.4.1.
2.2.24. layoutiomode4 8.2.24. layoutiomode4
enum layoutiomode4 { enum layoutiomode4 {
LAYOUTIOMODE_READ = 1, LAYOUTIOMODE_READ = 1,
LAYOUTIOMODE_RW = 2, LAYOUTIOMODE_RW = 2,
LAYOUTIOMODE_ANY = 3 LAYOUTIOMODE_ANY = 3
}; };
The iomode specifies whether the client intends to read or write The iomode specifies whether the client intends to read or write
(with the possibility of reading) the data represented by the layout. (with the possibility of reading) the data represented by the layout.
The ANY iomode MUST NOT be used for LAYOUTGET, however, it can be The ANY iomode MUST NOT be used for LAYOUTGET, however, it can be
used for LAYOUTRETURN and LAYOUTRECALL. The ANY iomode specifies used for LAYOUTRETURN and LAYOUTRECALL. The ANY iomode specifies
that layouts pertaining to both READ and RW iomodes are being that layouts pertaining to both READ and RW iomodes are being
returned or recalled, respectively. The metadata server's use of the returned or recalled, respectively. The metadata server's use of the
iomode may depend on the layout type being used. The storage devices iomode may depend on the layout type being used. The storage devices
may validate I/O accesses against the iomode and reject invalid may validate I/O accesses against the iomode and reject invalid
accesses. accesses.
2.2.25. nfs_impl_id4 8.2.25. nfs_impl_id4
struct nfs_impl_id4 { struct nfs_impl_id4 {
utf8str_cis nii_domain; utf8str_cis nii_domain;
utf8str_cs nii_name; utf8str_cs nii_name;
nfstime4 nii_date; nfstime4 nii_date;
}; };
This structure is used to identify client and server implementation This structure is used to identify client and server implementation
detail. The nii_domain field is the DNS domain name that the detail. The nii_domain field is the DNS domain name that the
implementer is associated with. The nii_name field is the product implementer is associated with. The nii_name field is the product
name of the implementation and is completely free form. It is name of the implementation and is completely free form. It is
encouraged that the nii_name be used to distinguish machine encouraged that the nii_name be used to distinguish machine
architecture, machine platforms, revisions, versions, and patch architecture, machine platforms, revisions, versions, and patch
levels. The nii_date field is the timestamp of when the software levels. The nii_date field is the timestamp of when the software
instance was published or built. instance was published or built.
2.2.26. impl_ident4 8.2.26. impl_ident4
struct impl_ident4 { struct impl_ident4 {
clientid4 ii_clientid; clientid4 ii_clientid;
struct nfs_impl_id4 ii_impl_id; struct nfs_impl_id4 ii_impl_id;
}; };
This is used for exchanging implementation identification between This is used for exchanging implementation identification between
client and server. client and server.
2.2.27. threshold_item4 8.2.27. threshold_item4
struct threshold_item4 { struct threshold_item4 {
layouttype4 thi_layout_type; layouttype4 thi_layout_type;
bitmap4 thi_hintset; bitmap4 thi_hintset;
opaque thi_hintlist<>; opaque thi_hintlist<>;
}; };
This structure contains a list of hints specific to a layout type for This structure contains a list of hints specific to a layout type for
helping the client determine when it should issue I/O directly helping the client determine when it should issue I/O directly
through the metadata server vs. the data servers. The hint structure through the metadata server vs. the data servers. The hint structure
skipping to change at page 29, line 28 skipping to change at page 79, line 44
| threshold4_read_iosize | 2 | length4 | For read I/O sizes below | | threshold4_read_iosize | 2 | length4 | For read I/O sizes below |
| | | | this threshold it is | | | | | this threshold it is |
| | | | recommended to read data | | | | | recommended to read data |
| | | | through the MDS | | | | | through the MDS |
| threshold4_write_iosize | 3 | length4 | For write I/O sizes below | | threshold4_write_iosize | 3 | length4 | For write I/O sizes below |
| | | | this threshold it is | | | | | this threshold it is |
| | | | recommended to write data | | | | | recommended to write data |
| | | | through the MDS | | | | | through the MDS |
+-------------------------+---+---------+---------------------------+ +-------------------------+---+---------+---------------------------+
2.2.28. mdsthreshold4 8.2.28. mdsthreshold4
struct mdsthreshold4 { struct mdsthreshold4 {
threshold_item4 mth_hints<>; threshold_item4 mth_hints<>;
}; };
This structure holds an array of threshold_item4 structures each of This structure holds an array of threshold_item4 structures each of
which is valid for a particular layout type. An array is necessary which is valid for a particular layout type. An array is necessary
since a server can support multiple layout types for a single file. since a server can support multiple layout types for a single file.
3. RPC and Security Flavor 9. Filehandles
The NFS version 4.1 protocol is a Remote Procedure Call (RPC)
application that uses RPC version 2 and the corresponding eXternal
Data Representation (XDR) as defined in RFC1831 [4] and RFC4506 [3].
The RPCSEC_GSS security flavor as defined in RFC2203 [5] MUST be used
as the mechanism to deliver stronger security for the NFS version 4
protocol.
3.1. Ports and Transports
Historically, NFS version 2 and version 3 servers have resided on
port 2049. The registered port 2049 RFC3232 [22] for the NFS
protocol should be the default configuration. NFSv4 clients SHOULD
NOT use the RPC binding protocols as described in RFC1833 [19].
Where an NFS version 4 implementation supports operation over the IP
network protocol, the supported transports between NFS and IP MUST
have the following two attributes:
1. The transport must support reliable delivery of data in the order
it was sent.
2. The transport must be among the IETF-approved congestion control
transport protocols.
At the time this document was written, the only two transports that
had the above attributes were TCP and SCTP. To enhance the
possibilities for interoperability, an NFS version 4 implementation
MUST support operation over the TCP transport protocol.
If TCP is used as the transport, the client and server SHOULD use
persistent connections for at least two reasons:
1. This will prevent the weakening of TCP's congestion control via
short lived connections and will improve performance for the WAN
environment by eliminating the need for SYN handshakes.
2. The NFSv4.1 callback model has changed from NFSv4.0, and requires
the client and server to maintain a client-created channel for
the server to use.
As noted in the Security Considerations section, the authentication
model for NFS version 4 has moved from machine-based to principal-
based. However, this modification of the authentication model does
not imply a technical requirement to move the transport connection
management model from whole machine-based to one based on a per user
model. In particular, NFS over TCP client implementations have
traditionally multiplexed traffic for multiple users over a common
TCP connection between an NFS client and server. This has been true,
regardless whether the NFS client is using AUTH_SYS, AUTH_DH,
RPCSEC_GSS or any other flavor. Similarly, NFS over TCP server
implementations have assumed such a model and thus scale the
implementation of TCP connection management in proportion to the
number of expected client machines. NFS version 4.1 will not modify
this connection management model. NFS version 4.1 clients that
violate this assumption can expect scaling issues on the server and
hence reduced service.
Note that for various timers, the client and server should avoid
inadvertent synchronization of those timers. For further discussion
of the general issue refer to [Floyd].
3.1.1. Client Retransmission Behavior
When processing a request received over a reliable transport such as
TCP, the NFS version 4.1 server MUST NOT silently drop the request,
except if the transport connection has been broken. Given such a
contract between NFS version 4.1 clients and servers, clients MUST
NOT retry a request unless one or both of the following are true:
o The transport connection has been broken
o The procedure being retried is the NULL procedure
Since reliable transports, such as TCP, do not always synchronously
inform a peer when the other peer has broken the connection (for
example, when an NFS server reboots), the NFS version 4.1 client may
want to actively "probe" the connection to see if has been broken.
Use of the NULL procedure is one recommended way to do so. So, when
a client experiences a remote procedure call timeout (of some
arbitrary implementation specific amount), rather than retrying the
remote procedure call, it could instead issue a NULL procedure call
to the server. If the server has died, the transport connection
break will eventually be indicated to the NFS version 4.1 client.
The client can then reconnect, and then retry the original request.
If the NULL procedure call gets a response, the connection has not
broken. The client can decide to wait longer for the original
request's response, or it can break the transport connection and
reconnect before re-sending the original request.
For callbacks from the server to the client, the same rules apply,
but the server doing the callback becomes the client, and the client
receiving the callback becomes the server.
3.2. Security Flavors
Traditional RPC implementations have included AUTH_NONE, AUTH_SYS,
AUTH_DH, and AUTH_KRB4 as security flavors. With RFC2203 [5] an
additional security flavor of RPCSEC_GSS has been introduced which
uses the functionality of GSS-API RFC2743 [8]. This allows for the
use of various security mechanisms by the RPC layer without the
additional implementation overhead of adding RPC security flavors.
For NFS version 4, the RPCSEC_GSS security flavor MUST be implemented
to enable the mandatory security mechanism. Other flavors, such as,
AUTH_NONE, AUTH_SYS, and AUTH_DH MAY be implemented as well.
3.2.1. Security mechanisms for NFS version 4
The use of RPCSEC_GSS requires selection of: mechanism, quality of
protection, and service (authentication, integrity, privacy). The
remainder of this document will refer to these three parameters of
the RPCSEC_GSS security as the security triple.
3.2.1.1. Kerberos V5
The Kerberos V5 GSS-API mechanism as described in RFC1964 [6] MUST be
implemented.
column descriptions:
1 == number of pseudo flavor
2 == name of pseudo flavor
3 == mechanism's OID
4 == RPCSEC_GSS service
1 2 3 4
--------------------------------------------------------------------
390003 krb5 1.2.840.113554.1.2.2 rpc_gss_svc_none
390004 krb5i 1.2.840.113554.1.2.2 rpc_gss_svc_integrity
390005 krb5p 1.2.840.113554.1.2.2 rpc_gss_svc_privacy
Note that the pseudo flavor is presented here as a mapping aid to the
implementor. Because this NFS protocol includes a method to
negotiate security and it understands the GSS-API mechanism, the
pseudo flavor is not needed. The pseudo flavor is needed for NFS
version 3 since the security negotiation is done via the MOUNT
protocol.
For a discussion of NFS' use of RPCSEC_GSS and Kerberos V5, please
see RFC2623 [23].
3.2.1.2. LIPKEY as a security triple
The LIPKEY GSS-API mechanism as described in RFC2847 [7] MUST be
implemented and provide the following security triples. The
definition of the columns matches the previous subsection "Kerberos
V5 as security triple"
1 2 3 4
--------------------------------------------------------------------
390006 lipkey 1.3.6.1.5.5.9 rpc_gss_svc_none
390007 lipkey-i 1.3.6.1.5.5.9 rpc_gss_svc_integrity
390008 lipkey-p 1.3.6.1.5.5.9 rpc_gss_svc_privacy
3.2.1.3. SPKM-3 as a security triple
The SPKM-3 GSS-API mechanism as described in RFC2847 [7] MUST be
implemented and provide the following security triples. The
definition of the columns matches the previous subsection "Kerberos
V5 as security triple".
1 2 3 5
--------------------------------------------------------------------
390009 spkm3 1.3.6.1.5.5.1.3 rpc_gss_svc_none
390010 spkm3i 1.3.6.1.5.5.1.3 rpc_gss_svc_integrity
390011 spkm3p 1.3.6.1.5.5.1.3 rpc_gss_svc_privacy
3.3. Security Negotiation
With the NFS version 4 server potentially offering multiple security
mechanisms, the client needs a method to determine or negotiate which
mechanism is to be used for its communication with the server. The
NFS server may have multiple points within its file system name space
that are available for use by NFS clients. In turn the NFS server
may be configured such that each of these entry points may have
different or multiple security mechanisms in use.
The security negotiation between client and server must be done with
a secure channel to eliminate the possibility of a third party
intercepting the negotiation sequence and forcing the client and
server to choose a lower level of security than required or desired.
See the section "Security Considerations" for further discussion.
3.3.1. SECINFO and SECINFO_NO_NAME
The SECINFO and SECINFO_NO_NAME operations allow the client to
determine, on a per filehandle basis, what security triple is to be
used for server access. In general, the client will not have to use
either operation except during initial communication with the server
or when the client crosses policy boundaries at the server. It is
possible that the server's policies change during the client's
interaction therefore forcing the client to negotiate a new security
triple.
3.3.2. Security Error
Based on the assumption that each NFS version 4 client and server
must support a minimum set of security (i.e., LIPKEY, SPKM-3, and
Kerberos-V5 all under RPCSEC_GSS), the NFS client will start its
communication with the server with one of the minimal security
triples. During communication with the server, the client may
receive an NFS error of NFS4ERR_WRONGSEC. This error allows the
server to notify the client that the security triple currently being
used is not appropriate for access to the server's file system
resources. The client is then responsible for determining what
security triples are available at the server and choose one which is
appropriate for the client. See the section for the "SECINFO"
operation for further discussion of how the client will respond to
the NFS4ERR_WRONGSEC error and use SECINFO.
3.3.3. Callback RPC Authentication
Callback authentication has changed in NFSv4.1 from NFSv4.0.
NFSv4.0 required the NFS server to create a security context for
RPCSEC_GSS, AUTH_DH, and AUTH_KERB4, and any other security flavor
that had a security context. It also required that principal issuing
the callback be the same as the principal that accepted the callback
parameters (via SETCLIENTID), and that the client principal accepting
the callback be the same as that which issued the SETCLIENTID. This
required the NFS client to have an assigned machine credential.
NFSv4.1 does not require a machine credential. Instead, NFSv4.1
allows an RPCSEC_GSS security context initiated by the client and
eswtablished on both the client and server to be used on callback
RPCs sent by the server to the client. The BIND_BACKCHANNEL
operation is used establish RPCSEC_GSS contexts (if the client so
desires) on the server. No support for AUTH_DH, or AUTH_KERB4 is
specified.
3.3.4. GSS Server Principal
Regardless of what security mechanism under RPCSEC_GSS is being used,
the NFS server, MUST identify itself in GSS-API via a
GSS_C_NT_HOSTBASED_SERVICE name type. GSS_C_NT_HOSTBASED_SERVICE
names are of the form:
service@hostname
For NFS, the "service" element is
nfs
Implementations of security mechanisms will convert nfs@hostname to
various different forms. For Kerberos V5, LIPKEY, and SPKM-3, the
following form is RECOMMENDED:
nfs/hostname
4. Filehandles
The filehandle in the NFS protocol is a per server unique identifier The filehandle in the NFS protocol is a per server unique identifier
for a file system object. The contents of the filehandle are opaque for a file system object. The contents of the filehandle are opaque
to the client. Therefore, the server is responsible for translating to the client. Therefore, the server is responsible for translating
the filehandle to an internal representation of the file system the filehandle to an internal representation of the file system
object. object.
4.1. Obtaining the First Filehandle 9.1. Obtaining the First Filehandle
The operations of the NFS protocol are defined in terms of one or The operations of the NFS protocol are defined in terms of one or
more filehandles. Therefore, the client needs a filehandle to more filehandles. Therefore, the client needs a filehandle to
initiate communication with the server. With the NFS version 2 initiate communication with the server. With the NFS version 2
protocol RFC1094 [17] and the NFS version 3 protocol RFC1813 [18], protocol RFC1094 [17] and the NFS version 3 protocol RFC1813 [18],
there exists an ancillary protocol to obtain this first filehandle. there exists an ancillary protocol to obtain this first filehandle.
The MOUNT protocol, RPC program number 100005, provides the mechanism The MOUNT protocol, RPC program number 100005, provides the mechanism
of translating a string based file system path name to a filehandle of translating a string based file system path name to a filehandle
which can then be used by the NFS protocols. which can then be used by the NFS protocols.
skipping to change at page 35, line 37 skipping to change at page 80, line 37
use of the public filehandle in combination with the LOOKUP operation use of the public filehandle in combination with the LOOKUP operation
in the NFS version 2 and 3 protocols, it has been demonstrated that in the NFS version 2 and 3 protocols, it has been demonstrated that
the MOUNT protocol is unnecessary for viable interaction between NFS the MOUNT protocol is unnecessary for viable interaction between NFS
client and server. client and server.
Therefore, the NFS version 4 protocol will not use an ancillary Therefore, the NFS version 4 protocol will not use an ancillary
protocol for translation from string based path names to a protocol for translation from string based path names to a
filehandle. Two special filehandles will be used as starting points filehandle. Two special filehandles will be used as starting points
for the NFS client. for the NFS client.
4.1.1. Root Filehandle 9.1.1. Root Filehandle
The first of the special filehandles is the ROOT filehandle. The The first of the special filehandles is the ROOT filehandle. The
ROOT filehandle is the "conceptual" root of the file system name ROOT filehandle is the "conceptual" root of the file system name
space at the NFS server. The client uses or starts with the ROOT space at the NFS server. The client uses or starts with the ROOT
filehandle by employing the PUTROOTFH operation. The PUTROOTFH filehandle by employing the PUTROOTFH operation. The PUTROOTFH
operation instructs the server to set the "current" filehandle to the operation instructs the server to set the "current" filehandle to the
ROOT of the server's file tree. Once this PUTROOTFH operation is ROOT of the server's file tree. Once this PUTROOTFH operation is
used, the client can then traverse the entirety of the server's file used, the client can then traverse the entirety of the server's file
tree with the LOOKUP operation. A complete discussion of the server tree with the LOOKUP operation. A complete discussion of the server
name space is in the section "NFS Server Name Space". name space is in the section "NFS Server Name Space".
4.1.2. Public Filehandle 9.1.2. Public Filehandle
The second special filehandle is the PUBLIC filehandle. Unlike the The second special filehandle is the PUBLIC filehandle. Unlike the
ROOT filehandle, the PUBLIC filehandle may be bound or represent an ROOT filehandle, the PUBLIC filehandle may be bound or represent an
arbitrary file system object at the server. The server is arbitrary file system object at the server. The server is
responsible for this binding. It may be that the PUBLIC filehandle responsible for this binding. It may be that the PUBLIC filehandle
and the ROOT filehandle refer to the same file system object. and the ROOT filehandle refer to the same file system object.
However, it is up to the administrative software at the server and However, it is up to the administrative software at the server and
the policies of the server administrator to define the binding of the the policies of the server administrator to define the binding of the
PUBLIC filehandle and server file system object. The client may not PUBLIC filehandle and server file system object. The client may not
make any assumptions about this binding. The client uses the PUBLIC make any assumptions about this binding. The client uses the PUBLIC
filehandle via the PUTPUBFH operation. filehandle via the PUTPUBFH operation.
4.2. Filehandle Types 9.2. Filehandle Types
In the NFS version 2 and 3 protocols, there was one type of In the NFS version 2 and 3 protocols, there was one type of
filehandle with a single set of semantics. This type of filehandle filehandle with a single set of semantics. This type of filehandle
is termed "persistent" in NFS Version 4. The semantics of a is termed "persistent" in NFS Version 4. The semantics of a
persistent filehandle remain the same as before. A new type of persistent filehandle remain the same as before. A new type of
filehandle introduced in NFS Version 4 is the "volatile" filehandle, filehandle introduced in NFS Version 4 is the "volatile" filehandle,
which attempts to accommodate certain server environments. which attempts to accommodate certain server environments.
The volatile filehandle type was introduced to address server The volatile filehandle type was introduced to address server
functionality or implementation issues which make correct functionality or implementation issues which make correct
skipping to change at page 36, line 39 skipping to change at page 81, line 39
invariant. Volatile filehandles may ease the implementation of invariant. Volatile filehandles may ease the implementation of
server functionality such as hierarchical storage management or file server functionality such as hierarchical storage management or file
system reorganization or migration. However, the volatile filehandle system reorganization or migration. However, the volatile filehandle
increases the implementation burden for the client. increases the implementation burden for the client.
Since the client will need to handle persistent and volatile Since the client will need to handle persistent and volatile
filehandles differently, a file attribute is defined which may be filehandles differently, a file attribute is defined which may be
used by the client to determine the filehandle types being returned used by the client to determine the filehandle types being returned
by the server. by the server.
4.2.1. General Properties of a Filehandle 9.2.1. General Properties of a Filehandle
The filehandle contains all the information the server needs to The filehandle contains all the information the server needs to
distinguish an individual file. To the client, the filehandle is distinguish an individual file. To the client, the filehandle is
opaque. The client stores filehandles for use in a later request and opaque. The client stores filehandles for use in a later request and
can compare two filehandles from the same server for equality by can compare two filehandles from the same server for equality by
doing a byte-by-byte comparison. However, the client MUST NOT doing a byte-by-byte comparison. However, the client MUST NOT
otherwise interpret the contents of filehandles. If two filehandles otherwise interpret the contents of filehandles. If two filehandles
from the same server are equal, they MUST refer to the same file. from the same server are equal, they MUST refer to the same file.
Servers SHOULD try to maintain a one-to-one correspondence between Servers SHOULD try to maintain a one-to-one correspondence between
filehandles and files but this is not required. Clients MUST use filehandles and files but this is not required. Clients MUST use
skipping to change at page 37, line 18 skipping to change at page 82, line 18
"Data Caching and File Identity". "Data Caching and File Identity".
As an example, in the case that two different path names when As an example, in the case that two different path names when
traversed at the server terminate at the same file system object, the traversed at the server terminate at the same file system object, the
server SHOULD return the same filehandle for each path. This can server SHOULD return the same filehandle for each path. This can
occur if a hard link is used to create two file names which refer to occur if a hard link is used to create two file names which refer to
the same underlying file object and associated data. For example, if the same underlying file object and associated data. For example, if
paths /a/b/c and /a/d/c refer to the same file, the server SHOULD paths /a/b/c and /a/d/c refer to the same file, the server SHOULD
return the same filehandle for both path names traversals. return the same filehandle for both path names traversals.
4.2.2. Persistent Filehandle 9.2.2. Persistent Filehandle
A persistent filehandle is defined as having a fixed value for the A persistent filehandle is defined as having a fixed value for the
lifetime of the file system object to which it refers. Once the lifetime of the file system object to which it refers. Once the
server creates the filehandle for a file system object, the server server creates the filehandle for a file system object, the server
MUST accept the same filehandle for the object for the lifetime of MUST accept the same filehandle for the object for the lifetime of
the object. If the server restarts or reboots the NFS server must the object. If the server restarts or reboots the NFS server must
honor the same filehandle value as it did in the server's previous honor the same filehandle value as it did in the server's previous
instantiation. Similarly, if the file system is migrated, the new instantiation. Similarly, if the file system is migrated, the new
NFS server must honor the same filehandle as the old NFS server. NFS server must honor the same filehandle as the old NFS server.
The persistent filehandle will be become stale or invalid when the The persistent filehandle will be become stale or invalid when the
file system object is removed. When the server is presented with a file system object is removed. When the server is presented with a
persistent filehandle that refers to a deleted object, it MUST return persistent filehandle that refers to a deleted object, it MUST return
an error of NFS4ERR_STALE. A filehandle may become stale when the an error of NFS4ERR_STALE. A filehandle may become stale when the
file system containing the object is no longer available. The file file system containing the object is no longer available. The file
system may become unavailable if it exists on removable media and the system may become unavailable if it exists on removable media and the
media is no longer available at the server or the file system in media is no longer available at the server or the file system in
whole has been destroyed or the file system has simply been removed whole has been destroyed or the file system has simply been removed
from the server's name space (i.e. unmounted in a UNIX environment). from the server's name space (i.e. unmounted in a UNIX environment).
4.2.3. Volatile Filehandle 9.2.3. Volatile Filehandle
A volatile filehandle does not share the same longevity A volatile filehandle does not share the same longevity
characteristics of a persistent filehandle. The server may determine characteristics of a persistent filehandle. The server may determine
that a volatile filehandle is no longer valid at many different that a volatile filehandle is no longer valid at many different
points in time. If the server can definitively determine that a points in time. If the server can definitively determine that a
volatile filehandle refers to an object that has been removed, the volatile filehandle refers to an object that has been removed, the
server should return NFS4ERR_STALE to the client (as is the case for server should return NFS4ERR_STALE to the client (as is the case for
persistent filehandles). In all other cases where the server persistent filehandles). In all other cases where the server
determines that a volatile filehandle can no longer be used, it determines that a volatile filehandle can no longer be used, it
should return an error of NFS4ERR_FHEXPIRED. should return an error of NFS4ERR_FHEXPIRED.
skipping to change at page 39, line 6 skipping to change at page 84, line 6
This situation can arise if FH4_VOL_MIGRATION or FH4_VOL_RENAME is This situation can arise if FH4_VOL_MIGRATION or FH4_VOL_RENAME is
set, if FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set, set, if FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set,
or if a non-readonly file system has a transition target in a or if a non-readonly file system has a transition target in a
different _handle _ class. In these cases, the server should deny a different _handle _ class. In these cases, the server should deny a
RENAME or REMOVE that would affect an OPEN file of any of the RENAME or REMOVE that would affect an OPEN file of any of the
components leading to the OPEN file. In addition, the server should components leading to the OPEN file. In addition, the server should
deny all RENAME or REMOVE requests during the grace period, in order deny all RENAME or REMOVE requests during the grace period, in order
to make sure that reclaims of files where filehandles may have to make sure that reclaims of files where filehandles may have
expired do not do a reclaim for the wrong file. expired do not do a reclaim for the wrong file.
4.3. One Method of Constructing a Volatile Filehandle 9.3. One Method of Constructing a Volatile Filehandle
A volatile filehandle, while opaque to the client could contain: A volatile filehandle, while opaque to the client could contain:
[volatile bit = 1 | server boot time | slot | generation number] [volatile bit = 1 | server boot time | slot | generation number]
o slot is an index in the server volatile filehandle table o slot is an index in the server volatile filehandle table
o generation number is the generation number for the table entry/ o generation number is the generation number for the table entry/
slot slot
skipping to change at page 39, line 29 skipping to change at page 84, line 29
has passed. If the server boot time is less than the current server has passed. If the server boot time is less than the current server
boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return
NFS4ERR_BADHANDLE. If the generation number does not match, return NFS4ERR_BADHANDLE. If the generation number does not match, return
NFS4ERR_FHEXPIRED. NFS4ERR_FHEXPIRED.
When the server reboots, the table is gone (it is volatile). When the server reboots, the table is gone (it is volatile).
If volatile bit is 0, then it is a persistent filehandle with a If volatile bit is 0, then it is a persistent filehandle with a
different structure following it. different structure following it.
4.4. Client Recovery from Filehandle Expiration 9.4. Client Recovery from Filehandle Expiration
If possible, the client SHOULD recover from the receipt of an If possible, the client SHOULD recover from the receipt of an
NFS4ERR_FHEXPIRED error. The client must take on additional NFS4ERR_FHEXPIRED error. The client must take on additional
responsibility so that it may prepare itself to recover from the responsibility so that it may prepare itself to recover from the
expiration of a volatile filehandle. If the server returns expiration of a volatile filehandle. If the server returns
persistent filehandles, the client does not need these additional persistent filehandles, the client does not need these additional
steps. steps.
For volatile filehandles, most commonly the client will need to store For volatile filehandles, most commonly the client will need to store
the component names leading up to and including the file system the component names leading up to and including the file system
skipping to change at page 40, line 21 skipping to change at page 85, line 21
like: like:
RENAME A B RENAME A B
LOOKUP B LOOKUP B
GETFH GETFH
Note that the COMPOUND procedure does not provide atomicity. This Note that the COMPOUND procedure does not provide atomicity. This
example only reduces the overhead of recovering from an expired example only reduces the overhead of recovering from an expired
filehandle. filehandle.
5. File Attributes 10. File Attributes
To meet the requirements of extensibility and increased To meet the requirements of extensibility and increased
interoperability with non-UNIX platforms, attributes must be handled interoperability with non-UNIX platforms, attributes must be handled
in a flexible manner. The NFS version 3 fattr3 structure contains a in a flexible manner. The NFS version 3 fattr3 structure contains a
fixed list of attributes that not all clients and servers are able to fixed list of attributes that not all clients and servers are able to
support or care about. The fattr3 structure can not be extended as support or care about. The fattr3 structure can not be extended as
new needs arise and it provides no way to indicate non-support. With new needs arise and it provides no way to indicate non-support. With
the NFS version 4 protocol, the client is able query what attributes the NFS version 4 protocol, the client is able query what attributes
the server supports and construct requests with only those supported the server supports and construct requests with only those supported
attributes (or a subset thereof). attributes (or a subset thereof).
skipping to change at page 41, line 35 skipping to change at page 86, line 35
reasonably computable by the client when support is not provided on reasonably computable by the client when support is not provided on
the server. the server.
Note that the hidden directory returned by OPENATTR is a convenience Note that the hidden directory returned by OPENATTR is a convenience
for protocol processing. The client should not make any assumptions for protocol processing. The client should not make any assumptions
about the server's implementation of named attributes and whether the about the server's implementation of named attributes and whether the
underlying file system at the server has a named attribute directory underlying file system at the server has a named attribute directory
or not. Therefore, operations such as SETATTR and GETATTR on the or not. Therefore, operations such as SETATTR and GETATTR on the
named attribute directory are undefined. named attribute directory are undefined.
5.1. Mandatory Attributes 10.1. Mandatory Attributes
These MUST be supported by every NFS version 4 client and server in These MUST be supported by every NFS version 4 client and server in
order to ensure a minimum level of interoperability. The server must order to ensure a minimum level of interoperability. The server must
store and return these attributes and the client must be able to store and return these attributes and the client must be able to
function with an attribute set limited to these attributes. With function with an attribute set limited to these attributes. With
just the mandatory attributes some client functionality may be just the mandatory attributes some client functionality may be
impaired or limited in some ways. A client may ask for any of these impaired or limited in some ways. A client may ask for any of these
attributes to be returned by setting a bit in the GETATTR request and attributes to be returned by setting a bit in the GETATTR request and
the server must return their value. the server must return their value.
5.2. Recommended Attributes 10.2. Recommended Attributes
These attributes are understood well enough to warrant support in the These attributes are understood well enough to warrant support in the
NFS version 4 protocol. However, they may not be supported on all NFS version 4 protocol. However, they may not be supported on all
clients and servers. A client may ask for any of these attributes to clients and servers. A client may ask for any of these attributes to
be returned by setting a bit in the GETATTR request but must handle be returned by setting a bit in the GETATTR request but must handle
the case where the server does not return them. A client may ask for the case where the server does not return them. A client may ask for
the set of attributes the server supports and should not request the set of attributes the server supports and should not request
attributes the server does not support. A server should be tolerant attributes the server does not support. A server should be tolerant
of requests for unsupported attributes and simply not return them of requests for unsupported attributes and simply not return them
rather than considering the request an error. It is expected that rather than considering the request an error. It is expected that
servers will support all attributes they comfortably can and only servers will support all attributes they comfortably can and only
fail to support attributes which are difficult to support in their fail to support attributes which are difficult to support in their
operating environments. A server should provide attributes whenever operating environments. A server should provide attributes whenever
they don't have to "tell lies" to the client. For example, a file they don't have to "tell lies" to the client. For example, a file
modification time should be either an accurate time or should not be modification time should be either an accurate time or should not be
supported by the server. This will not always be comfortable to supported by the server. This will not always be comfortable to
clients but the client is better positioned decide whether and how to clients but the client is better positioned decide whether and how to
fabricate or construct an attribute or whether to do without the fabricate or construct an attribute or whether to do without the
attribute. attribute.
5.3. Named Attributes 10.3. Named Attributes
These attributes are not supported by direct encoding in the NFS These attributes are not supported by direct encoding in the NFS
Version 4 protocol but are accessed by string names rather than Version 4 protocol but are accessed by string names rather than
numbers and correspond to an uninterpreted stream of bytes which are numbers and correspond to an uninterpreted stream of bytes which are
stored with the file system object. The name space for these stored with the file system object. The name space for these
attributes may be accessed by using the OPENATTR operation. The attributes may be accessed by using the OPENATTR operation. The
OPENATTR operation returns a filehandle for a virtual "attribute OPENATTR operation returns a filehandle for a virtual "attribute
directory" and further perusal of the name space may be done using directory" and further perusal of the name space may be done using
READDIR and LOOKUP operations on this filehandle. Named attributes READDIR and LOOKUP operations on this filehandle. Named attributes
may then be examined or changed by normal READ and WRITE and CREATE may then be examined or changed by normal READ and WRITE and CREATE
skipping to change at page 42, line 46 skipping to change at page 87, line 46
attributes, a client which is also able to handle them should be able attributes, a client which is also able to handle them should be able
to copy a file's data and meta-data with complete transparency from to copy a file's data and meta-data with complete transparency from
one location to another; this would imply that names allowed for one location to another; this would imply that names allowed for
regular directory entries are valid for named attribute names as regular directory entries are valid for named attribute names as
well. well.
Names of attributes will not be controlled by this document or other Names of attributes will not be controlled by this document or other
IETF standards track documents. See the section "IANA IETF standards track documents. See the section "IANA
Considerations" for further discussion. Considerations" for further discussion.
5.4. Classification of Attributes 10.4. Classification of Attributes
Each of the Mandatory and Recommended attributes can be classified in Each of the Mandatory and Recommended attributes can be classified in
one of three categories: per server, per file system, or per file one of three categories: per server, per file system, or per file
system object. Note that it is possible that some per file system system object. Note that it is possible that some per file system
attributes may vary within the file system. See the "homogeneous" attributes may vary within the file system. See the "homogeneous"
attribute for its definition. Note that the attributes attribute for its definition. Note that the attributes
time_access_set and time_modify_set are not listed in this section time_access_set and time_modify_set are not listed in this section
because they are write-only attributes corresponding to time_access because they are write-only attributes corresponding to time_access
and time_modify, and are used in a special instance of SETATTR. and time_modify, and are used in a special instance of SETATTR.
skipping to change at page 44, line 5 skipping to change at page 89, line 5
type, change, size, named_attr, fsid, rdattr_error, filehandle, type, change, size, named_attr, fsid, rdattr_error, filehandle,
ACL, archive, fileid, hidden, maxlink, mimetype, mode, ACL, archive, fileid, hidden, maxlink, mimetype, mode,
numlinks, owner, owner_group, rawdev, space_used, system, numlinks, owner, owner_group, rawdev, space_used, system,
time_access, time_backup, time_create, time_metadata, time_access, time_backup, time_create, time_metadata,
time_modify, mounted_on_fileid, layout_type, layout_hint, time_modify, mounted_on_fileid, layout_type, layout_hint,
layout_blksize, layout_alignment layout_blksize, layout_alignment
For quota_avail_hard, quota_avail_soft, and quota_used see their For quota_avail_hard, quota_avail_soft, and quota_used see their
definitions below for the appropriate classification. definitions below for the appropriate classification.
5.5. Mandatory Attributes - Definitions 10.5. Mandatory Attributes - Definitions
+-----------------+----+------------+--------+----------------------+ +-----------------+----+------------+--------+----------------------+
| name | # | Data Type | Access | Description | | name | # | Data Type | Access | Description |
+-----------------+----+------------+--------+----------------------+ +-----------------+----+------------+--------+----------------------+
| supp_attr | 0 | bitmap | READ | The bit vector which | | supp_attr | 0 | bitmap | READ | The bit vector which |
| | | | | would retrieve all | | | | | | would retrieve all |
| | | | | mandatory and | | | | | | mandatory and |
| | | | | recommended | | | | | | recommended |
| | | | | attributes that are | | | | | | attributes that are |
| | | | | supported for this | | | | | | supported for this |
skipping to change at page 45, line 44 skipping to change at page 90, line 44
| | | | | seconds. | | | | | | seconds. |
| rdattr_error | 11 | enum | READ | Error returned from | | rdattr_error | 11 | enum | READ | Error returned from |
| | | | | getattr during | | | | | | getattr during |
| | | | | readdir. | | | | | | readdir. |
| filehandle | 19 | nfs_fh4 | READ | The filehandle of | | filehandle | 19 | nfs_fh4 | READ | The filehandle of |
| | | | | this object | | | | | | this object |
| | | | | (primarily for | | | | | | (primarily for |
| | | | | readdir requests). | | | | | | readdir requests). |
+-----------------+----+------------+--------+----------------------+ +-----------------+----+------------+--------+----------------------+
5.6. Recommended Attributes - Definitions 10.6. Recommended Attributes - Definitions
+--------------------+----+---------------+--------+----------------+ +--------------------+----+---------------+--------+----------------+
| name | # | Data Type | Access | Description | | name | # | Data Type | Access | Description |
+--------------------+----+---------------+--------+----------------+ +--------------------+----+---------------+--------+----------------+
| ACL | 12 | nfsace4<> | R/W | The access | | ACL | 12 | nfsace4<> | R/W | The access |
| | | | | control list | | | | | | control list |
| | | | | for the | | | | | | for the |
| | | | | object. | | | | | | object. |
| aclsupport | 13 | uint32 | READ | Indicates what | | aclsupport | 13 | uint32 | READ | Indicates what |
| | | | | types of ACLs | | | | | | types of ACLs |
| | | | | are supported | | | | | | are supported |
skipping to change at page 54, line 5 skipping to change at page 99, line 5
| | | | | modification | | | | | | modification |
| | | | | to the object. | | | | | | to the object. |
| time_modify_set | 54 | settime4 | WRITE | Set the time | | time_modify_set | 54 | settime4 | WRITE | Set the time |
| | | | | of last | | | | | | of last |
| | | | | modification | | | | | | modification |
| | | | | to the object. | | | | | | to the object. |
| | | | | SETATTR use | | | | | | SETATTR use |
| | | | | only. | | | | | | only. |
+--------------------+----+---------------+--------+----------------+ +--------------------+----+---------------+--------+----------------+
5.7. Time Access 10.7. Time Access
As defined above, the time_access attribute represents the time of As defined above, the time_access attribute represents the time of
last access to the object by a read that was satisfied by the server. last access to the object by a read that was satisfied by the server.
The notion of what is an "access" depends on server's operating The notion of what is an "access" depends on server's operating
environment and/or the server's file system semantics. For example, environment and/or the server's file system semantics. For example,
for servers obeying POSIX semantics, time_access would be updated for servers obeying POSIX semantics, time_access would be updated
only by the READLINK, READ, and READDIR operations and not any of the only by the READLINK, READ, and READDIR operations and not any of the
operations that modify the content of the object. Of course, setting operations that modify the content of the object. Of course, setting
the corresponding time_access_set attribute is another way to modify the corresponding time_access_set attribute is another way to modify
the time_access attribute. the time_access attribute.
Whenever the file object resides on a writable file system, the Whenever the file object resides on a writable file system, the
server should make best efforts to record time_access into stable server should make best efforts to record time_access into stable
storage. However, to mitigate the performance effects of doing so, storage. However, to mitigate the performance effects of doing so,
and most especially whenever the server is satisfying the read of the and most especially whenever the server is satisfying the read of the
object's content from its cache, the server MAY cache access time object's content from its cache, the server MAY cache access time
updates and lazily write them to stable storage. It is also updates and lazily write them to stable storage. It is also
acceptable to give administrators of the server the option to disable acceptable to give administrators of the server the option to disable
time_access updates. time_access updates.
5.8. Interpreting owner and owner_group 10.8. Interpreting owner and owner_group
The recommended attributes "owner" and "owner_group" (and also users The recommended attributes "owner" and "owner_group" (and also users
and groups within the "acl" attribute) are represented in terms of a and groups within the "acl" attribute) are represented in terms of a
UTF-8 string. To avoid a representation that is tied to a particular UTF-8 string. To avoid a representation that is tied to a particular
underlying implementation at the client or server, the use of the underlying implementation at the client or server, the use of the
UTF-8 string has been chosen. Note that section 6.1 of RFC2624 [26] UTF-8 string has been chosen. Note that section 6.1 of RFC2624 [26]
provides additional rationale. It is expected that the client and provides additional rationale. It is expected that the client and
server will have their own local representation of owner and server will have their own local representation of owner and
owner_group that is used for local storage or presentation to the end owner_group that is used for local storage or presentation to the end
user. Therefore, it is expected that when these attributes are user. Therefore, it is expected that when these attributes are
skipping to change at page 56, line 21 skipping to change at page 101, line 21
groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER
error when there is a valid translation for the user or owner error when there is a valid translation for the user or owner
designated in this way. In that case, the client must use the designated in this way. In that case, the client must use the
appropriate name@domain string and not the special form for appropriate name@domain string and not the special form for
compatibility. compatibility.
The owner string "nobody" may be used to designate an anonymous user, The owner string "nobody" may be used to designate an anonymous user,
which will be associated with a file created by a security principal which will be associated with a file created by a security principal
that cannot be mapped through normal means to the owner attribute. that cannot be mapped through normal means to the owner attribute.
5.9. Character Case Attributes 10.9. Character Case Attributes
With respect to the case_insensitive and case_preserving attributes, With respect to the case_insensitive and case_preserving attributes,
each UCS-4 character (which UTF-8 encodes) has a "long descriptive each UCS-4 character (which UTF-8 encodes) has a "long descriptive
name" RFC1345 [27] which may or may not included the word "CAPITAL" name" RFC1345 [27] which may or may not included the word "CAPITAL"
or "SMALL". The presence of SMALL or CAPITAL allows an NFS server to or "SMALL". The presence of SMALL or CAPITAL allows an NFS server to
implement unambiguous and efficient table driven mappings for case implement unambiguous and efficient table driven mappings for case
insensitive comparisons, and non-case-preserving storage. For insensitive comparisons, and non-case-preserving storage. For
general character handling and internationalization issues, see the general character handling and internationalization issues, see the
section "Internationalization". section "Internationalization".
5.10. Quota Attributes 10.10. Quota Attributes
For the attributes related to file system quotas, the following For the attributes related to file system quotas, the following
definitions apply: definitions apply:
quota_avail_soft The value in bytes which represents the amount of quota_avail_soft The value in bytes which represents the amount of
additional disk space that can be allocated to this file or additional disk space that can be allocated to this file or
directory before the user may reasonably be warned. It is directory before the user may reasonably be warned. It is
understood that this space may be consumed by allocations to other understood that this space may be consumed by allocations to other
files or directories though there is a rule as to which other files or directories though there is a rule as to which other
files or directories. files or directories.
skipping to change at page 57, line 21 skipping to change at page 102, line 21
Note that there may be a number of distinct but overlapping sets Note that there may be a number of distinct but overlapping sets
of files or directories for which a quota_used value is of files or directories for which a quota_used value is
maintained. E.g. "all files with a given owner", "all files with maintained. E.g. "all files with a given owner", "all files with
a given group owner". etc. a given group owner". etc.
The server is at liberty to choose any of those sets but should do The server is at liberty to choose any of those sets but should do
so in a repeatable way. The rule may be configured per file so in a repeatable way. The rule may be configured per file
system or may be "choose the set with the smallest quota". system or may be "choose the set with the smallest quota".
5.11. mounted_on_fileid 10.11. mounted_on_fileid
UNIX-based operating environments connect a file system into the UNIX-based operating environments connect a file system into the
namespace by connecting (mounting) the file system onto the existing namespace by connecting (mounting) the file system onto the existing
file object (the mount point, usually a directory) of an existing file object (the mount point, usually a directory) of an existing
file system. When the mount point's parent directory is read via an file system. When the mount point's parent directory is read via an
API like readdir(), the return results are directory entries, each API like readdir(), the return results are directory entries, each
with a component name and a fileid. The fileid of the mount point's with a component name and a fileid. The fileid of the mount point's
directory entry will be different from the fileid that the stat() directory entry will be different from the fileid that the stat()
system call returns. The stat() system call is returning the fileid system call returns. The stat() system call is returning the fileid
of the root of the mounted file system, whereas readdir() is of the root of the mounted file system, whereas readdir() is
skipping to change at page 58, line 25 skipping to change at page 103, line 25
fileid of a directory entry returned by readdir(). If fileid of a directory entry returned by readdir(). If
mounted_on_fileid is requested in a GETATTR operation, the server mounted_on_fileid is requested in a GETATTR operation, the server
should obey an invariant that has it returning a value that is equal should obey an invariant that has it returning a value that is equal
to the file object's entry in the object's parent directory, i.e. to the file object's entry in the object's parent directory, i.e.
what readdir() would have returned. Some operating environments what readdir() would have returned. Some operating environments
allow a series of two or more file systems to be mounted onto a allow a series of two or more file systems to be mounted onto a
single mount point. In this case, for the server to obey the single mount point. In this case, for the server to obey the
aforementioned invariant, it will need to find the base mount point, aforementioned invariant, it will need to find the base mount point,
and not the intermediate mount points. and not the intermediate mount points.
5.12. send_impl_id and recv_impl_id 10.12. send_impl_id and recv_impl_id
These recommended attributes are used to identify the client and These recommended attributes are used to identify the client and
server. In the case of the send_impl_id attribute, the client sends server. In the case of the send_impl_id attribute, the client sends
its clientid4 value along with the nfs_impl_id4. The use of the its clientid4 value along with the nfs_impl_id4. The use of the
clientid4 value allows the server to identify and match specific clientid4 value allows the server to identify and match specific
client interaction. In the case of the recv_impl_id attribute, the client interaction. In the case of the recv_impl_id attribute, the
client receives the nfs_impl_id4 value. client receives the nfs_impl_id4 value.
Access to this identification information can be most useful at both Access to this identification information can be most useful at both
client and server. Being able to identify specific implementations client and server. Being able to identify specific implementations
skipping to change at page 59, line 8 skipping to change at page 104, line 8
the client and server might refuse to interoperate. the client and server might refuse to interoperate.
Because it is likely some implementations will violate the protocol Because it is likely some implementations will violate the protocol
specification and interpret the identity information, implementations specification and interpret the identity information, implementations
MUST allow the users of the NFSv4 client and server to set the MUST allow the users of the NFSv4 client and server to set the
contents of the sent nfs_impl_id structure to any value. contents of the sent nfs_impl_id structure to any value.
Even though these attributes are recommended, if the server supports Even though these attributes are recommended, if the server supports
one of them it MUST support the other. one of them it MUST support the other.
5.13. fs_layout_type 10.13. fs_layout_type
This attribute applies to a file system and indicates what layout This attribute applies to a file system and indicates what layout
types are supported by the file system. We expect this attribute to types are supported by the file system. We expect this attribute to
be queried when a client encounters a new fsid. This attribute is be queried when a client encounters a new fsid. This attribute is
used by the client to determine if it has applicable layout drivers. used by the client to determine if it has applicable layout drivers.
5.14. layout_type 10.14. layout_type
This attribute indicates the particular layout type(s) used for a This attribute indicates the particular layout type(s) used for a
file. This is for informational purposes only. The client needs to file. This is for informational purposes only. The client needs to
use the LAYOUTGET operation in order to get enough information (e.g., use the LAYOUTGET operation in order to get enough information (e.g.,
specific device information) in order to perform I/O. specific device information) in order to perform I/O.
5.15. layout_hint 10.15. layout_hint
This attribute may be set on newly created files to influence the This attribute may be set on newly created files to influence the
metadata server's choice for the file's layout. It is suggested that metadata server's choice for the file's layout. It is suggested that
this attribute is set as one of the initial attributes within the this attribute is set as one of the initial attributes within the
OPEN call. The metadata server may ignore this attribute. This OPEN call. The metadata server may ignore this attribute. This
attribute is a sub-set of the layout structure returned by LAYOUTGET. attribute is a sub-set of the layout structure returned by LAYOUTGET.
For example, instead of specifying particular devices, this would be For example, instead of specifying particular devices, this would be
used to suggest the stripe width of a file. It is up to the server used to suggest the stripe width of a file. It is up to the server
implementation to determine which fields within the layout it uses. implementation to determine which fields within the layout it uses.
5.16. mdsthreshold 10.16. mdsthreshold
This attribute acts as a hint to the client to help it determine when This attribute acts as a hint to the client to help it determine when
it is more efficient to issue read and write requests to the metadata it is more efficient to issue read and write requests to the metadata
server vs. the dataserver. Two types of thresholds are described: server vs. the dataserver. Two types of thresholds are described:
file size thresholds and I/O size thresholds. If a file's size is file size thresholds and I/O size thresholds. If a file's size is
smaller than the file size threshold, data accesses should be issued smaller than the file size threshold, data accesses should be issued
to the metadata server. If an I/O is below the I/O size threshold, to the metadata server. If an I/O is below the I/O size threshold,
the I/O should be issued to the metadata server. Each threshold can the I/O should be issued to the metadata server. Each threshold can
be specified independently for read and write requests. For either be specified independently for read and write requests. For either
threshold type, a value of 0 indicates no read or write should be threshold type, a value of 0 indicates no read or write should be
skipping to change at page 60, line 7 skipping to change at page 105, line 7
The attribute is available on a per filehandle basis. If the current The attribute is available on a per filehandle basis. If the current
filehandle refers to a non-pNFS file or directory, the metadata filehandle refers to a non-pNFS file or directory, the metadata
server should return an attribute that is representative of the server should return an attribute that is representative of the
filehandle's file system. It is suggested that this attribute is filehandle's file system. It is suggested that this attribute is
queried as part of the OPEN operation. Due to dynamic system queried as part of the OPEN operation. Due to dynamic system
changes, the client should not assume that the attribute will remain changes, the client should not assume that the attribute will remain
constant for any specific time period, thus it should be periodically constant for any specific time period, thus it should be periodically
refreshed. refreshed.
6. Access Control Lists 11. Access Control Lists
Access Control Lists (ACLs) are a file attribute that specify fine
grained access control. This chapter covers the "acl", "aclsupport",
and "mode" file attributes, and their interactions.
11.1. Goals
ACLs and modes represent two well established but different models
for specifying permissions. This chapter specifies requirements that
attempt to meet the following goals:
o If a server supports the mode attribute, it should provide
reasonable semantics to clients that only set and retrieve the
mode attribute.
o If a server supports the ACL attribute, it should provide
reasonable semantics to clients that only set and retrieve the ACL
attribute.
o On servers that support the mode attribute, if the ACL attribute
has never been set on an object, via inheritance or explicitly,
the behavior should be traditional UNIX-like behavior.
o On servers that support the mode attribute, if the ACL attribute
has been previously set on an object, either explicitly or via
inheritance:
* Setting only the mode attribute should effectively control the
traditional UNIX-like permissions of read, write, and execute
on owner, owner_group, and other.
* Setting only the mode attribute should provide reasonable
security. For example, setting a mode of 000 should be enough
to ensure that future opens for read or write by any principal
should fail, regardless of a previously existing or inherited
ACL.
o It must be possible to implement a server such that its clients
can have POSIX compliant semantics.
o This minor version of NFSv4 should not introduce significantly
different semantics relating to the mode and ACL attributes, nor
should it render invalid any existing conformant implementations.
Rather, this chapter provides clarifications based on previous
implementations and discussions around them.
o If a server supports the ACL attribute, then at any time, the
server can provide an ACL attribute when requested. The ACL
attribute will describe all permissions on the file object, except
for the three high-order bits of the mode attribute (described in
Section 11.2.2). The ACL attribute will not conflict with the
mode attribute, on servers that support the mode attribute.
o If a server supports the mode attribute, then at any time, the
server can provide a mode attribute when requested. The mode
attribute will not conflict with the ACL attribute, on servers
that support the ACL attribute.
o When a mode attribute is set on an object, the ACL attribute may
need to be modified so as to not conflict with the new mode. In
such cases, it is desirable that the ACL keep as much information
as possible. This includes information about inheritance, AUDIT
and ALARM ACEs, and permissions granted and denied that do not
conflict with the new mode.
11.2. File Attributes Discussion
11.2.1. ACL Attribute
The NFS version 4 ACL attribute is an array of access control entries The NFS version 4 ACL attribute is an array of access control entries
(ACEs). Although, the client can read and write the ACL attribute, (ACEs). Although the client can read and write the ACL attribute,
the server is responsible for using the ACL to perform access the server is responsible for using the ACL to perform access
control. The client can use the OPEN or ACCESS operations to check control. The client can use the OPEN or ACCESS operations to check
access without modifying or reading data or metadata. access without modifying or reading data or metadata.
The NFS ACE attribute is defined as follows: The NFS ACE attribute is defined as follows:
typedef uint32_t acetype4; typedef uint32_t acetype4;
typedef uint32_t aceflag4; typedef uint32_t aceflag4;
typedef uint32_t acemask4; typedef uint32_t acemask4;
skipping to change at page 60, line 36 skipping to change at page 107, line 9
}; };
To determine if a request succeeds, the server processes each nfsace4 To determine if a request succeeds, the server processes each nfsace4
entry in order. Only ACEs which have a "who" that matches the entry in order. Only ACEs which have a "who" that matches the
requester are considered. Each ACE is processed until all of the requester are considered. Each ACE is processed until all of the
bits of the requester's access have been ALLOWED. Once a bit (see bits of the requester's access have been ALLOWED. Once a bit (see
below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer
considered in the processing of later ACEs. If an ACCESS_DENIED_ACE considered in the processing of later ACEs. If an ACCESS_DENIED_ACE
is encountered where the requester's access still has unALLOWED bits is encountered where the requester's access still has unALLOWED bits
in common with the "access_mask" of the ACE, the request is denied. in common with the "access_mask" of the ACE, the request is denied.
However, unlike the ALLOWED and DENIED ACE types, the ALARM and AUDIT When the ACL is fully processed, if there are bits in the requester's
ACE types do not affect a requester's access, and instead are for mask that have not been ALLOWED or DENIED, access is denied.
triggering events as a result of a requester's access attempt.
Therefore, all AUDIT and ALARM ACEs are processed until end of the
ACL. When the ACL is fully processed, if there are bits in the
requester's mask that have not been ALLOWED or DENIED, access is
denied.
This is not intended to limit the ability of server implementations
to implement alternative access policies. For example:
o A server implementation might always grant ACE4_WRITE_ACL and
ACE4_READ_ACL permissions. This would prevent the user from
getting into the situation where they can't ever modify the ACL.
o If a file system is mounted read only, then the server may deny
ACE4_WRITE_DATA even though the ACL grants it.
As mentioned before, this is one of the reasons that client
implementations are not recommended to do their own access checks
based on their interpretation the ACL, but rather use the OPEN and
ACCESS to do access checks. This allows the client to act on the
results of having the server determine whether or not access should
be granted based on its interpretation of the ACL.
Clients must be aware of situations in which an object's ACL will
define a certain access even though the server will not enforce it.
In general, but especially in these situations, the client needs to
do its part in the enforcement of access as defined by the ACL. To
do this, the client may issue the appropriate ACCESS operation prior
to servicing the request of the user or application in order to
determine whether the user or application should be granted the
access requested.
Some situations in which the ACL may define accesses that the server
doesn't enforce:
o All servers will allow a user the ability to read the data of the
file when only the execute permission is granted (i.e. If the ACL
denies the user the ACE4_READ_DATA access and allows the user
ACE4_EXECUTE, the server will allow the user to read the data of the
file).
o Many servers have the notion of owner-override in which the owner Unlike the ALLOW and DENY ACE types, the ALARM and AUDIT ACE types do
of the object is allowed to override accesses that are denied by the not affect a requester's access, and instead are for triggering
ACL. events as a result of a requester's access attempt. Therefore, all
AUDIT and ALARM ACEs are processed until end of the ACL.
The NFS version 4 ACL model is quite rich. Some server platforms may The NFS version 4 ACL model is quite rich. Some server platforms may
provide access control functionality that goes beyond the UNIX-style provide access control functionality that goes beyond the UNIX-style
mode attribute, but which is not as rich as the NFS ACL model. So mode attribute, but which is not as rich as the NFS ACL model. So
that users can take advantage of this more limited functionality, the that users can take advantage of this more limited functionality, the
server may indicate that it supports ACLs as long as it follows the server may indicate that it supports ACLs as long as it follows the
guidelines for mapping between its ACL model and the NFS version 4 guidelines for mapping between its ACL model and the NFS version 4
ACL model. ACL model.
The situation is complicated by the fact that a server may have The situation is complicated by the fact that a server may have
multiple modules that enforce ACLs. For example, the enforcement for multiple modules that enforce ACLs. For example, the enforcement for
NFS version 4 access may be different from the enforcement for local NFS version 4 access may be different from the enforcement for local
access, and both may be different from the enforcement for access access, and both may be different from the enforcement for access
through other protocols such as SMB. So it may be useful for a through other protocols such as SMB. So it may be useful for a
server to accept an ACL even if not all of its modules are able to server to accept an ACL even if not all of its modules are able to
support it. support it.
The guiding principle in all cases is that the server must not accept The guiding principle in all cases is that the server must not accept
ACLs that appear to make the file more secure than it really is. ACLs that appear to make the file more secure than it really is.
6.1. ACE type 11.2.1.1. ACE Type
Type Description The constants used for the type field (acetype4) are as follows:
_____________________________________________________
ALLOW Explicitly grants the access defined in
acemask4 to the file or directory.
DENY Explicitly denies the access defined in const ACE4_ACCESS_ALLOWED_ACE_TYPE = 0x00000000;
acemask4 to the file or directory. const ACE4_ACCESS_DENIED_ACE_TYPE = 0x00000001;
const ACE4_SYSTEM_AUDIT_ACE_TYPE = 0x00000002;
const ACE4_SYSTEM_ALARM_ACE_TYPE = 0x00000003;
+------------------------------+--------------+---------------------+
| Value | Abbreviation | Description |
+------------------------------+--------------+---------------------+
| ACE4_ACCESS_ALLOWED_ACE_TYPE | ALLOW | Explicitly grants |
| | | the access defined |
| | | in acemask4 to the |
| | | file or directory. |
| ACE4_ACCESS_DENIED_ACE_TYPE | DENY | Explicitly denies |
| | | the access defined |
| | | in acemask4 to the |
| | | file or directory. |
| ACE4_SYSTEM_AUDIT_ACE_TYPE | AUDIT | LOG (system |
| | | dependent) any |
| | | access attempt to a |
| | | file or directory |
| | | which uses any of |
| | | the access methods |
| | | specified in |
| | | acemask4. |
| ACE4_SYSTEM_ALARM_ACE_TYPE | ALARM | Generate a system |
| | | ALARM (system |
| | | dependent) when any |
| | | access attempt is |
| | | made to a file or |
| | | directory for the |
| | | access methods |
| | | specified in |
| | | acemask4. |
+------------------------------+--------------+---------------------+
AUDIT LOG (system dependent) any access The "Abbreviation" column denotes how the types will be referred to
attempt to a file or directory which throughout the rest of this document.
uses any of the access methods specified
in acemask4.
ALARM Generate a system ALARM (system 11.2.1.2. The aclsupport Attribute
dependent) when any access attempt is
made to a file or directory for the
access methods specified in acemask4.
A server need not support all of the above ACE types. The bitmask A server need not support all of the above ACE types. The bitmask
constants used to represent the above definitions within the constants used to represent the above definitions within the
aclsupport attribute are as follows: aclsupport attribute are as follows:
const ACL4_SUPPORT_ALLOW_ACL = 0x00000001; const ACL4_SUPPORT_ALLOW_ACL = 0x00000001;
const ACL4_SUPPORT_DENY_ACL = 0x00000002; const ACL4_SUPPORT_DENY_ACL = 0x00000002;
const ACL4_SUPPORT_AUDIT_ACL = 0x00000004; const ACL4_SUPPORT_AUDIT_ACL = 0x00000004;
const ACL4_SUPPORT_ALARM_ACL = 0x00000008; const ACL4_SUPPORT_ALARM_ACL = 0x00000008;
The semantics of the "type" field follow the descriptions provided
above.
The constants used for the type field (acetype4) are as follows:
const ACE4_ACCESS_ALLOWED_ACE_TYPE = 0x00000000;
const ACE4_ACCESS_DENIED_ACE_TYPE = 0x00000001;
const ACE4_SYSTEM_AUDIT_ACE_TYPE = 0x00000002;
const ACE4_SYSTEM_ALARM_ACE_TYPE = 0x00000003;
Clients should not attempt to set an ACE unless the server claims Clients should not attempt to set an ACE unless the server claims
support for that ACE type. If the server receives a request to set support for that ACE type. If the server receives a request to set
an ACE that it cannot store, it MUST reject the request with an ACE that it cannot store, it MUST reject the request with
NFS4ERR_ATTRNOTSUPP. If the server receives a request to set an ACE NFS4ERR_ATTRNOTSUPP. If the server receives a request to set an ACE
that it can store but cannot enforce, the server SHOULD reject the that it can store but cannot enforce, the server SHOULD reject the
request with NFS4ERR_ATTRNOTSUPP. request with NFS4ERR_ATTRNOTSUPP.
Example: suppose a server can enforce NFS ACLs for NFS access but Example: suppose a server can enforce NFS ACLs for NFS access but
cannot enforce ACLs for local access. If arbitrary processes can run cannot enforce ACLs for local access. If arbitrary processes can run
on the server, then the server SHOULD NOT indicate ACL support. On on the server, then the server SHOULD NOT indicate ACL support. On
the other hand, if only trusted administrative programs run locally, the other hand, if only trusted administrative programs run locally,
then the server may indicate ACL support. then the server may indicate ACL support.
6.2. ACE Access Mask 11.2.1.3. ACE Access Mask
The access_mask field contains values based on the following: The bitmask constants used for the access mask field are as follows:
const ACE4_READ_DATA = 0x00000001;
const ACE4_LIST_DIRECTORY = 0x00000001;
const ACE4_WRITE_DATA = 0x00000002;
const ACE4_ADD_FILE = 0x00000002;
const ACE4_APPEND_DATA = 0x00000004;
const ACE4_ADD_SUBDIRECTORY = 0x00000004;
const ACE4_READ_NAMED_ATTRS = 0x00000008;
const ACE4_WRITE_NAMED_ATTRS = 0x00000010;
const ACE4_EXECUTE = 0x00000020;
const ACE4_DELETE_CHILD = 0x00000040;
const ACE4_READ_ATTRIBUTES = 0x00000080;
const ACE4_WRITE_ATTRIBUTES = 0x00000100;
const ACE4_DELETE = 0x00010000;
const ACE4_READ_ACL = 0x00020000;
const ACE4_WRITE_ACL = 0x00040000;
const ACE4_WRITE_OWNER = 0x00080000;
const ACE4_SYNCHRONIZE = 0x00100000;
11.2.1.3.1. Discussion of Mask Attributes
ACE4_READ_DATA ACE4_READ_DATA
Operation(s) affected: Operation(s) affected:
READ READ
OPEN OPEN
Discussion: Discussion:
Permission to read the data of the file. Permission to read the data of the file.
Servers SHOULD allow a user the ability to read the data Servers SHOULD allow a user the ability to read the data
of the file when only the ACE4_EXECUTE access mask bit is of the file when only the ACE4_EXECUTE access mask bit is
skipping to change at page 63, line 31 skipping to change at page 110, line 4
Permission to read the data of the file. Permission to read the data of the file.
Servers SHOULD allow a user the ability to read the data Servers SHOULD allow a user the ability to read the data
of the file when only the ACE4_EXECUTE access mask bit is of the file when only the ACE4_EXECUTE access mask bit is
allowed. allowed.
ACE4_LIST_DIRECTORY ACE4_LIST_DIRECTORY
Operation(s) affected: Operation(s) affected:
READDIR READDIR
Discussion: Discussion:
Permission to list the contents of a directory. Permission to list the contents of a directory.
ACE4_WRITE_DATA ACE4_WRITE_DATA
Operation(s) affected: Operation(s) affected:
WRITE WRITE
OPEN OPEN
SETATTR of size
Discussion: Discussion:
Permission to modify a file's data anywhere in the file's Permission to modify a file's data anywhere in the file's
offset range. This includes the ability to write to any offset range. This includes the ability to write to any
arbitrary offset and as a result to grow the file. arbitrary offset and as a result to grow the file.