draft-ietf-nfsv4-minorversion1-03.txt   draft-ietf-nfsv4-minorversion1-04.txt 
NFSv4 S. Shepler NFSv4 S. Shepler
Internet-Draft Editor Internet-Draft M. Eisler
Intended status: Standards Track June 20, 2006 Intended status: Standards Track D. Noveck
Expires: December 22, 2006 Expires: January 22, 2007 Editors
July 21, 2006
NFSv4 Minor Version 1 NFSv4 Minor Version 1
draft-ietf-nfsv4-minorversion1-03.txt draft-ietf-nfsv4-minorversion1-04.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 34 skipping to change at page 1, line 35
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on December 22, 2006. This Internet-Draft will expire on January 22, 2007.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2006). Copyright (C) The Internet Society (2006).
Abstract Abstract
This Internet-Draft describes the NFSv4 minor version 1 protocol This Internet-Draft describes NFSv4 minor version one, including
extensions. These most significant of these extensions are commonly features retained from the base protocol and protocol extensions made
called: Sessions, Directory Delegations, and parallel NFS or pNFS subsequently. The current draft includes desciption of the major
extensions, Sessions, Directory Delegations, and parallel NFS (pNFS).
This Internet-Draft is an active work item of the NFSv4 working
group. Active and resolved issues may be found in the issue tracker
at: http://www.nfsv4-editor.org/cgi-bin/roundup/nfsv4. New issues
related to this document should be raised with the NFSv4 Working
Group nfsv4@ietf.org and logged in the issue tracker.
Requirements Language Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [1]. document are to be interpreted as described in RFC 2119 [1].
Table of Contents Table of Contents
1. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 9 1. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 10
1.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 9 1.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 10
1.2. Structured Data Types . . . . . . . . . . . . . . . . . 10 1.2. Structured Data Types . . . . . . . . . . . . . . . . . 11
2. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 19 2. RPC and Security Flavor . . . . . . . . . . . . . . . . . . . 20
2.1. Obtaining the First Filehandle . . . . . . . . . . . . . 19 2.1. Ports and Transports . . . . . . . . . . . . . . . . . . 20
2.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . . 20 2.1.1. Client Retransmission Behavior . . . . . . . . . . . 21
2.1.2. Public Filehandle . . . . . . . . . . . . . . . . . . 20 2.2. Security Flavors . . . . . . . . . . . . . . . . . . . . 22
2.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 20 2.2.1. Security mechanisms for NFS version 4 . . . . . . . 22
2.2.1. General Properties of a Filehandle . . . . . . . . . 21 2.3. Security Negotiation . . . . . . . . . . . . . . . . . . 24
2.2.2. Persistent Filehandle . . . . . . . . . . . . . . . . 21 2.3.1. SECINFO and SECINFO_NO_NAME . . . . . . . . . . . . 24
2.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . . 22 2.3.2. Security Error . . . . . . . . . . . . . . . . . . . 24
2.3. One Method of Constructing a Volatile Filehandle . . . . 23 2.3.3. Callback RPC Authentication . . . . . . . . . . . . 25
2.4. Client Recovery from Filehandle Expiration . . . . . . . 24 2.3.4. GSS Server Principal . . . . . . . . . . . . . . . . 25
3. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 25 3. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 26 3.1. Obtaining the First Filehandle . . . . . . . . . . . . . 26
3.2. Recommended Attributes . . . . . . . . . . . . . . . . . 26 3.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 26
3.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 27 3.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 26
3.4. Classification of Attributes . . . . . . . . . . . . . . 27 3.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 27
3.5. Mandatory Attributes - Definitions . . . . . . . . . . . 28 3.2.1. General Properties of a Filehandle . . . . . . . . . 27
3.6. Recommended Attributes - Definitions . . . . . . . . . . 30 3.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 28
3.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 38 3.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 28
3.8. Interpreting owner and owner_group . . . . . . . . . . . 38 3.3. One Method of Constructing a Volatile Filehandle . . . . 29
3.9. Character Case Attributes . . . . . . . . . . . . . . . 40 3.4. Client Recovery from Filehandle Expiration . . . . . . . 30
3.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 41 4. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 31
3.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 41 4.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 32
3.12. send_impl_id and recv_impl_id . . . . . . . . . . . . . 42 4.2. Recommended Attributes . . . . . . . . . . . . . . . . . 32
3.13. fs_layouttype . . . . . . . . . . . . . . . . . . . . . 43 4.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 33
3.14. layouttype . . . . . . . . . . . . . . . . . . . . . . . 43 4.4. Classification of Attributes . . . . . . . . . . . . . . 33
3.15. layouthint . . . . . . . . . . . . . . . . . . . . . . . 43 4.5. Mandatory Attributes - Definitions . . . . . . . . . . . 34
3.16. Access Control Lists . . . . . . . . . . . . . . . . . . 44 4.6. Recommended Attributes - Definitions . . . . . . . . . . 36
3.16.1. ACE type . . . . . . . . . . . . . . . . . . . . . . 46 4.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 43
3.16.2. ACE Access Mask . . . . . . . . . . . . . . . . . . . 47 4.8. Interpreting owner and owner_group . . . . . . . . . . . 44
3.16.3. ACE flag . . . . . . . . . . . . . . . . . . . . . . 52 4.9. Character Case Attributes . . . . . . . . . . . . . . . 46
3.16.4. ACE who . . . . . . . . . . . . . . . . . . . . . . . 54 4.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 46
3.16.5. Mode Attribute . . . . . . . . . . . . . . . . . . . 55 4.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 47
3.16.6. Interaction Between Mode and ACL Attributes . . . . . 56 4.12. send_impl_id and recv_impl_id . . . . . . . . . . . . . 48
4. Single-server Name Space . . . . . . . . . . . . . . . . . . 69 4.13. fs_layout_type . . . . . . . . . . . . . . . . . . . . . 48
4.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 69 4.14. layout_type . . . . . . . . . . . . . . . . . . . . . . 48
4.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 69 4.15. layout_hint . . . . . . . . . . . . . . . . . . . . . . 49
4.3. Server Pseudo Filesystem . . . . . . . . . . . . . . . . 70 5. Access Control Lists . . . . . . . . . . . . . . . . . . . . 49
4.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 71 5.1. ACE type . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 71 5.2. ACE Access Mask . . . . . . . . . . . . . . . . . . . . 52
4.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 71 5.2.1. ACE4_DELETE vs. ACE4_DELETE_CHILD . . . . . . . . . 57
4.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 71 5.3. ACE flag . . . . . . . . . . . . . . . . . . . . . . . . 58
4.8. Security Policy and Name Space Presentation . . . . . . 72 5.4. ACE who . . . . . . . . . . . . . . . . . . . . . . . . 59
5. File Locking and Share Reservations . . . . . . . . . . . . . 73 5.4.1. Discussion of EVERYONE@ . . . . . . . . . . . . . . 60
5.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 73 5.4.2. Discussion of OWNER@ and GROUP@ . . . . . . . . . . 60
5.1.1. Client ID . . . . . . . . . . . . . . . . . . . . . . 74 5.5. Mode Attribute . . . . . . . . . . . . . . . . . . . . . 60
5.1.2. Server Release of Clientid . . . . . . . . . . . . . 76 5.6. Interaction Between Mode and ACL Attributes . . . . . . 61
5.1.3. lock_owner and stateid Definition . . . . . . . . . . 77 5.6.1. Recomputing mode upon SETATTR of ACL . . . . . . . . 62
5.1.4. Use of the stateid and Locking . . . . . . . . . . . 79 5.6.2. Applying the mode given to CREATE or OPEN to an
5.1.5. Sequencing of Lock Requests . . . . . . . . . . . . . 81 inherited ACL . . . . . . . . . . . . . . . . . . . 65
5.1.6. Recovery from Replayed Requests . . . . . . . . . . . 82 5.6.3. Applying a Mode to an Existing ACL . . . . . . . . . 67
5.1.7. Releasing lock_owner State . . . . . . . . . . . . . 82 5.6.4. ACL and mode in the same SETATTR . . . . . . . . . . 71
5.1.8. Use of Open Confirmation . . . . . . . . . . . . . . 82 5.6.5. Inheritance and turning it off . . . . . . . . . . . 72
5.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 84 5.6.6. Deficiencies in a Mode Representation of an ACL . . 73
5.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 84 6. Single-server Name Space . . . . . . . . . . . . . . . . . . 74
5.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 84 6.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 74
5.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 85 6.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 74
5.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 86 6.3. Server Pseudo Filesystem . . . . . . . . . . . . . . . . 75
5.6.1. Client Failure and Recovery . . . . . . . . . . . . . 86 6.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 76
5.6.2. Server Failure and Recovery . . . . . . . . . . . . . 87 6.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 76
5.6.3. Network Partitions and Recovery . . . . . . . . . . . 89 6.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 76
5.7. Recovery from a Lock Request Timeout or Abort . . . . . 92 6.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 76
5.8. Server Revocation of Locks . . . . . . . . . . . . . . . 93 6.8. Security Policy and Name Space Presentation . . . . . . 77
5.9. Share Reservations . . . . . . . . . . . . . . . . . . . 94 7. File Locking and Share Reservations . . . . . . . . . . . . . 78
5.10. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 94 7.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 78
5.10.1. Close and Retention of State Information . . . . . . 95 7.1.1. Client ID . . . . . . . . . . . . . . . . . . . . . 79
5.11. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 96 7.1.2. Server Release of Clientid . . . . . . . . . . . . . 81
5.12. Short and Long Leases . . . . . . . . . . . . . . . . . 96 7.1.3. lock_owner and stateid Definition . . . . . . . . . 82
5.13. Clocks, Propagation Delay, and Calculating Lease 7.1.4. Use of the stateid and Locking . . . . . . . . . . . 84
Expiration . . . . . . . . . . . . . . . . . . . . . . . 97 7.1.5. Sequencing of Lock Requests . . . . . . . . . . . . 86
6. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 97 7.1.6. Recovery from Replayed Requests . . . . . . . . . . 87
6.1. Performance Challenges for Client-Side Caching . . . . . 98 7.1.7. Releasing lock_owner State . . . . . . . . . . . . . 87
6.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 99 7.1.8. Use of Open Confirmation . . . . . . . . . . . . . . 87
6.2.1. Delegation Recovery . . . . . . . . . . . . . . . . . 100 7.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 89
6.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 102 7.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 89
6.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 102 7.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 89
6.3.2. Data Caching and File Locking . . . . . . . . . . . . 103 7.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 90
6.3.3. Data Caching and Mandatory File Locking . . . . . . . 105 7.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 91
6.3.4. Data Caching and File Identity . . . . . . . . . . . 105 7.6.1. Client Failure and Recovery . . . . . . . . . . . . 91
6.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 106 7.6.2. Server Failure and Recovery . . . . . . . . . . . . 92
6.4.1. Open Delegation and Data Caching . . . . . . . . . . 109 7.6.3. Network Partitions and Recovery . . . . . . . . . . 94
6.4.2. Open Delegation and File Locks . . . . . . . . . . . 110 7.7. Recovery from a Lock Request Timeout or Abort . . . . . 98
6.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 110 7.8. Server Revocation of Locks . . . . . . . . . . . . . . . 98
6.4.4. Recall of Open Delegation . . . . . . . . . . . . . . 113 7.9. Share Reservations . . . . . . . . . . . . . . . . . . . 99
6.4.5. Clients that Fail to Honor Delegation Recalls . . . . 115 7.10. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 100
6.4.6. Delegation Revocation . . . . . . . . . . . . . . . . 116 7.10.1. Close and Retention of State Information . . . . . . 101
6.5. Data Caching and Revocation . . . . . . . . . . . . . . 116 7.11. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 101
6.5.1. Revocation Recovery for Write Open Delegation . . . . 117 7.12. Short and Long Leases . . . . . . . . . . . . . . . . . 102
6.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 118 7.13. Clocks, Propagation Delay, and Calculating Lease
6.7. Data and Metadata Caching and Memory Mapped Files . . . 120 Expiration . . . . . . . . . . . . . . . . . . . . . . . 102
6.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 122 8. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 103
6.9. Directory Caching . . . . . . . . . . . . . . . . . . . 123 8.1. Performance Challenges for Client-Side Caching . . . . . 104
7. Security Negotiation . . . . . . . . . . . . . . . . . . . . 124 8.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 104
8. Clarification of Security Negotiation in NFSv4.1 . . . . . . 124 8.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 106
8.1. PUTFH + LOOKUP . . . . . . . . . . . . . . . . . . . . . 125 8.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 108
8.2. PUTFH + LOOKUPP . . . . . . . . . . . . . . . . . . . . 125 8.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 108
8.3. PUTFH + SECINFO . . . . . . . . . . . . . . . . . . . . 125 8.3.2. Data Caching and File Locking . . . . . . . . . . . 109
8.4. PUTFH + Anything Else . . . . . . . . . . . . . . . . . 126 8.3.3. Data Caching and Mandatory File Locking . . . . . . 111
9. NFSv4.1 Sessions . . . . . . . . . . . . . . . . . . . . . . 126 8.3.4. Data Caching and File Identity . . . . . . . . . . . 111
9.1. Sessions Background . . . . . . . . . . . . . . . . . . 126 8.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 112
9.1.1. Introduction to Sessions . . . . . . . . . . . . . . 126 8.4.1. Open Delegation and Data Caching . . . . . . . . . . 115
9.1.2. Motivation . . . . . . . . . . . . . . . . . . . . . 127 8.4.2. Open Delegation and File Locks . . . . . . . . . . . 116
9.1.3. Problem Statement . . . . . . . . . . . . . . . . . . 128 8.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 116
9.1.4. NFSv4 Session Extension Characteristics . . . . . . . 130 8.4.4. Recall of Open Delegation . . . . . . . . . . . . . 119
9.2. Transport Issues . . . . . . . . . . . . . . . . . . . . 130 8.4.5. Clients that Fail to Honor Delegation Recalls . . . 121
9.2.1. Session Model . . . . . . . . . . . . . . . . . . . . 130 8.4.6. Delegation Revocation . . . . . . . . . . . . . . . 122
9.2.2. Connection State . . . . . . . . . . . . . . . . . . 132 8.5. Data Caching and Revocation . . . . . . . . . . . . . . 122
9.2.3. NFSv4 Channels, Sessions and Connections . . . . . . 132 8.5.1. Revocation Recovery for Write Open Delegation . . . 123
9.2.4. Reconnection, Trunking and Failover . . . . . . . . . 134 8.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 124
9.2.5. Server Duplicate Request Cache . . . . . . . . . . . 135 8.7. Data and Metadata Caching and Memory Mapped Files . . . 126
9.3. Session Initialization and Transfer Models . . . . . . . 136 8.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 128
9.3.1. Session Negotiation . . . . . . . . . . . . . . . . . 136 8.9. Directory Caching . . . . . . . . . . . . . . . . . . . 129
9.3.2. RDMA Requirements . . . . . . . . . . . . . . . . . . 138 9. Security Negotiation . . . . . . . . . . . . . . . . . . . . 130
9.3.3. RDMA Connection Resources . . . . . . . . . . . . . . 138 10. Clarification of Security Negotiation in NFSv4.1 . . . . . . 130
9.3.4. TCP and RDMA Inline Transfer Model . . . . . . . . . 139 10.1. PUTFH + LOOKUP . . . . . . . . . . . . . . . . . . . . . 130
9.3.5. RDMA Direct Transfer Model . . . . . . . . . . . . . 142 10.2. PUTFH + LOOKUPP . . . . . . . . . . . . . . . . . . . . 131
9.4. Connection Models . . . . . . . . . . . . . . . . . . . 145 10.3. PUTFH + SECINFO . . . . . . . . . . . . . . . . . . . . 131
9.4.1. TCP Connection Model . . . . . . . . . . . . . . . . 146 10.4. PUTFH + Anything Else . . . . . . . . . . . . . . . . . 131
9.4.2. Negotiated RDMA Connection Model . . . . . . . . . . 147 11. NFSv4.1 Sessions . . . . . . . . . . . . . . . . . . . . . . 132
9.4.3. Automatic RDMA Connection Model . . . . . . . . . . . 148 11.1. Sessions Background . . . . . . . . . . . . . . . . . . 132
9.5. Buffer Management, Transfer, Flow Control . . . . . . . 148 11.1.1. Introduction to Sessions . . . . . . . . . . . . . . 132
9.6. Retry and Replay . . . . . . . . . . . . . . . . . . . . 151 11.1.2. Motivation . . . . . . . . . . . . . . . . . . . . . 133
9.7. The Back Channel . . . . . . . . . . . . . . . . . . . . 152 11.1.3. Problem Statement . . . . . . . . . . . . . . . . . 134
9.8. COMPOUND Sizing Issues . . . . . . . . . . . . . . . . . 153 11.1.4. NFSv4 Session Extension Characteristics . . . . . . 136
9.9. Data Alignment . . . . . . . . . . . . . . . . . . . . . 153 11.2. Transport Issues . . . . . . . . . . . . . . . . . . . . 136
9.10. NFSv4 Integration . . . . . . . . . . . . . . . . . . . 155 11.2.1. Session Model . . . . . . . . . . . . . . . . . . . 136
9.10.1. Minor Versioning . . . . . . . . . . . . . . . . . . 155 11.2.2. Connection State . . . . . . . . . . . . . . . . . . 137
9.10.2. Slot Identifiers and Server Duplicate Request Cache . 155 11.2.3. NFSv4 Channels, Sessions and Connections . . . . . . 138
9.10.3. Resolving server callback races with sessions . . . . 159 11.2.4. Reconnection, Trunking and Failover . . . . . . . . 140
9.10.4. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . 160 11.2.5. Server Duplicate Request Cache . . . . . . . . . . . 141
9.10.5. eXternal Data Representation Efficiency . . . . . . . 161 11.3. Session Initialization and Transfer Models . . . . . . . 142
9.10.6. Effect of Sessions on Existing Operations . . . . . . 161 11.3.1. Session Negotiation . . . . . . . . . . . . . . . . 142
9.10.7. Authentication Efficiencies . . . . . . . . . . . . . 162 11.3.2. RDMA Requirements . . . . . . . . . . . . . . . . . 144
9.11. Sessions Security Considerations . . . . . . . . . . . . 163 11.3.3. RDMA Connection Resources . . . . . . . . . . . . . 144
9.11.1. Authentication . . . . . . . . . . . . . . . . . . . 165 11.3.4. TCP and RDMA Inline Transfer Model . . . . . . . . . 145
10. Multi-server Name Space . . . . . . . . . . . . . . . . . . . 166 11.3.5. RDMA Direct Transfer Model . . . . . . . . . . . . . 148
10.1. Location attributes . . . . . . . . . . . . . . . . . . 166 11.4. Connection Models . . . . . . . . . . . . . . . . . . . 151
10.2. File System Presence or Absence . . . . . . . . . . . . 166 11.4.1. TCP Connection Model . . . . . . . . . . . . . . . . 152
10.3. Getting Attributes for an Absent File System . . . . . . 168 11.4.2. Negotiated RDMA Connection Model . . . . . . . . . . 153
10.3.1. GETATTR Within an Absent File System . . . . . . . . 168 11.4.3. Automatic RDMA Connection Model . . . . . . . . . . 154
10.3.2. READDIR and Absent File Systems . . . . . . . . . . . 169 11.5. Buffer Management, Transfer, Flow Control . . . . . . . 154
10.4. Uses of Location Information . . . . . . . . . . . . . . 170 11.6. Retry and Replay . . . . . . . . . . . . . . . . . . . . 157
10.4.1. File System Replication . . . . . . . . . . . . . . . 170 11.7. The Back Channel . . . . . . . . . . . . . . . . . . . . 158
10.4.2. File System Migration . . . . . . . . . . . . . . . . 171 11.8. COMPOUND Sizing Issues . . . . . . . . . . . . . . . . . 159
10.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . . 172 11.9. Data Alignment . . . . . . . . . . . . . . . . . . . . . 159
10.5. Additional Client-side Considerations . . . . . . . . . 172 11.10. NFSv4 Integration . . . . . . . . . . . . . . . . . . . 161
10.6. Effecting File System Transitions . . . . . . . . . . . 173 11.10.1. Minor Versioning . . . . . . . . . . . . . . . . . . 161
10.6.1. Transparent File System Transitions . . . . . . . . . 174 11.10.2. Slot Identifiers and Server Duplicate Request
10.6.2. Filehandles and File System Transitions . . . . . . . 176 Cache . . . . . . . . . . . . . . . . . . . . . . . 161
10.6.3. Fileid's and File System Transitions . . . . . . . . 176 11.10.3. Resolving server callback races with sessions . . . 165
10.6.4. Fsid's and File System Transitions . . . . . . . . . 177 11.10.4. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . 166
10.6.5. The Change Attribute and File System Transitions . . 177 11.10.5. eXternal Data Representation Efficiency . . . . . . 167
10.6.6. Lock State and File System Transitions . . . . . . . 178 11.10.6. Effect of Sessions on Existing Operations . . . . . 167
10.6.7. Write Verifiers and File System Transitions . . . . . 181 11.10.7. Authentication Efficiencies . . . . . . . . . . . . 168
10.7. Effecting File System Referrals . . . . . . . . . . . . 181 11.11. Sessions Security Considerations . . . . . . . . . . . . 169
10.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . . 182 11.11.1. Denial of Service via Unauthorized State Changes . . 170
10.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 186 11.11.2. Authentication . . . . . . . . . . . . . . . . . . . 173
10.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 188 12. Multi-server Name Space . . . . . . . . . . . . . . . . . . . 174
10.9. The Attribute fs_locations . . . . . . . . . . . . . . . 188 12.1. Location attributes . . . . . . . . . . . . . . . . . . 174
10.10. The Attribute fs_locations_info . . . . . . . . . . . . 190 12.2. File System Presence or Absence . . . . . . . . . . . . 175
10.11. The Attribute fs_status . . . . . . . . . . . . . . . . 199 12.3. Getting Attributes for an Absent File System . . . . . . 176
11. Directory Delegations . . . . . . . . . . . . . . . . . . . . 202 12.3.1. GETATTR Within an Absent File System . . . . . . . . 176
11.1. Introduction to Directory Delegations . . . . . . . . . 203 12.3.2. READDIR and Absent File Systems . . . . . . . . . . 177
11.2. Directory Delegation Design (in brief) . . . . . . . . . 204 12.4. Uses of Location Information . . . . . . . . . . . . . . 178
11.3. Recommended Attributes in support of Directory 12.4.1. File System Replication . . . . . . . . . . . . . . 178
Delegations . . . . . . . . . . . . . . . . . . . . . . 205 12.4.2. File System Migration . . . . . . . . . . . . . . . 179
11.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 206 12.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 180
11.5. Delegation Recovery . . . . . . . . . . . . . . . . . . 206 12.5. Additional Client-side Considerations . . . . . . . . . 180
12. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 206 12.6. Effecting File System Transitions . . . . . . . . . . . 181
13. General Definitions . . . . . . . . . . . . . . . . . . . . . 209 12.6.1. Transparent File System Transitions . . . . . . . . 182
13.1. Metadata Server . . . . . . . . . . . . . . . . . . . . 209 12.6.2. Filehandles and File System Transitions . . . . . . 184
13.2. Client . . . . . . . . . . . . . . . . . . . . . . . . . 209 12.6.3. Fileid's and File System Transitions . . . . . . . . 184
13.3. Storage Device . . . . . . . . . . . . . . . . . . . . . 209 12.6.4. Fsid's and File System Transitions . . . . . . . . . 185
13.4. Storage Protocol . . . . . . . . . . . . . . . . . . . . 209 12.6.5. The Change Attribute and File System Transitions . . 185
13.5. Control Protocol . . . . . . . . . . . . . . . . . . . . 210 12.6.6. Lock State and File System Transitions . . . . . . . 186
13.6. Metadata . . . . . . . . . . . . . . . . . . . . . . . . 210 12.6.7. Write Verifiers and File System Transitions . . . . 189
13.7. Layout . . . . . . . . . . . . . . . . . . . . . . . . . 210 12.7. Effecting File System Referrals . . . . . . . . . . . . 190
14. pNFS protocol semantics . . . . . . . . . . . . . . . . . . . 211 12.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 190
14.1. Definitions . . . . . . . . . . . . . . . . . . . . . . 211 12.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 194
14.1.1. Layout Types . . . . . . . . . . . . . . . . . . . . 211 12.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 196
14.1.2. Layout Iomode . . . . . . . . . . . . . . . . . . . . 211 12.9. The Attribute fs_locations . . . . . . . . . . . . . . . 196
14.1.3. Layout Segments . . . . . . . . . . . . . . . . . . . 212 12.10. The Attribute fs_locations_info . . . . . . . . . . . . 198
14.1.4. Device IDs . . . . . . . . . . . . . . . . . . . . . 213 12.11. The Attribute fs_status . . . . . . . . . . . . . . . . 207
14.1.5. Aggregation Schemes . . . . . . . . . . . . . . . . . 213 13. Directory Delegations . . . . . . . . . . . . . . . . . . . . 210
14.2. Guarantees Provided by Layouts . . . . . . . . . . . . . 214 13.1. Introduction to Directory Delegations . . . . . . . . . 211
14.3. Getting a Layout . . . . . . . . . . . . . . . . . . . . 215 13.2. Directory Delegation Design (in brief) . . . . . . . . . 212
14.4. Committing a Layout . . . . . . . . . . . . . . . . . . 216 13.3. Recommended Attributes in support of Directory
14.4.1. LAYOUTCOMMIT and mtime/atime/change . . . . . . . . . 216 Delegations . . . . . . . . . . . . . . . . . . . . . . 213
14.4.2. LAYOUTCOMMIT and size . . . . . . . . . . . . . . . . 217 13.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 214
14.4.3. LAYOUTCOMMIT and layoutupdate . . . . . . . . . . . . 218 13.5. Delegation Recovery . . . . . . . . . . . . . . . . . . 214
14.5. Recalling a Layout . . . . . . . . . . . . . . . . . . . 218 14. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 214
14.5.1. Basic Operation . . . . . . . . . . . . . . . . . . . 218 15. General Definitions . . . . . . . . . . . . . . . . . . . . . 217
14.5.2. Recall Callback Robustness . . . . . . . . . . . . . 220 15.1. Metadata Server . . . . . . . . . . . . . . . . . . . . 217
14.5.3. Recall/Return Sequencing . . . . . . . . . . . . . . 221 15.2. Client . . . . . . . . . . . . . . . . . . . . . . . . . 217
14.6. Metadata Server Write Propagation . . . . . . . . . . . 223 15.3. Storage Device . . . . . . . . . . . . . . . . . . . . . 217
14.7. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 223 15.4. Storage Protocol . . . . . . . . . . . . . . . . . . . . 217
14.7.1. Leases . . . . . . . . . . . . . . . . . . . . . . . 224 15.5. Control Protocol . . . . . . . . . . . . . . . . . . . . 218
14.7.2. Client Recovery . . . . . . . . . . . . . . . . . . . 225 15.6. Metadata . . . . . . . . . . . . . . . . . . . . . . . . 218
14.7.3. Metadata Server Recovery . . . . . . . . . . . . . . 226 15.7. Layout . . . . . . . . . . . . . . . . . . . . . . . . . 218
14.7.4. Storage Device Recovery . . . . . . . . . . . . . . . 228 16. pNFS protocol semantics . . . . . . . . . . . . . . . . . . . 219
15. Security Considerations . . . . . . . . . . . . . . . . . . . 229 16.1. Definitions . . . . . . . . . . . . . . . . . . . . . . 219
15.1. File Layout Security . . . . . . . . . . . . . . . . . . 230 16.1.1. Layout Types . . . . . . . . . . . . . . . . . . . . 219
15.2. Object Layout Security . . . . . . . . . . . . . . . . . 230 16.1.2. Layout Iomode . . . . . . . . . . . . . . . . . . . 219
15.3. Block/Volume Layout Security . . . . . . . . . . . . . . 232 16.1.3. Layout Segments . . . . . . . . . . . . . . . . . . 220
16. The NFSv4 File Layout Type . . . . . . . . . . . . . . . . . 232 16.1.4. Device IDs . . . . . . . . . . . . . . . . . . . . . 221
16.1. File Striping and Data Access . . . . . . . . . . . . . 232 16.1.5. Aggregation Schemes . . . . . . . . . . . . . . . . 221
16.1.1. Sparse and Dense Storage Device Data Layouts . . . . 234 16.2. Guarantees Provided by Layouts . . . . . . . . . . . . . 222
16.1.2. Metadata and Storage Device Roles . . . . . . . . . . 236 16.3. Getting a Layout . . . . . . . . . . . . . . . . . . . . 223
16.1.3. Device Multipathing . . . . . . . . . . . . . . . . . 237 16.4. Committing a Layout . . . . . . . . . . . . . . . . . . 224
16.1.4. Operations Issued to Storage Devices . . . . . . . . 237 16.4.1. LAYOUTCOMMIT and mtime/atime/change . . . . . . . . 224
16.1.5. COMMIT through metadata server . . . . . . . . . . . 238 16.4.2. LAYOUTCOMMIT and size . . . . . . . . . . . . . . . 225
16.2. Global Stateid Requirements . . . . . . . . . . . . . . 238 16.4.3. LAYOUTCOMMIT and layoutupdate . . . . . . . . . . . 226
16.3. The Layout Iomode . . . . . . . . . . . . . . . . . . . 239 16.5. Recalling a Layout . . . . . . . . . . . . . . . . . . . 226
16.4. Storage Device State Propagation . . . . . . . . . . . . 239 16.5.1. Basic Operation . . . . . . . . . . . . . . . . . . 226
16.4.1. Lock State Propagation . . . . . . . . . . . . . . . 240 16.5.2. Recall Callback Robustness . . . . . . . . . . . . . 228
16.4.2. Open-mode Validation . . . . . . . . . . . . . . . . 240 16.5.3. Recall/Return Sequencing . . . . . . . . . . . . . . 229
16.4.3. File Attributes . . . . . . . . . . . . . . . . . . . 240 16.6. Metadata Server Write Propagation . . . . . . . . . . . 231
16.5. Storage Device Component File Size . . . . . . . . . . . 241 16.7. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 231
16.6. Crash Recovery Considerations . . . . . . . . . . . . . 242 16.7.1. Leases . . . . . . . . . . . . . . . . . . . . . . . 232
16.7. Security Considerations . . . . . . . . . . . . . . . . 243 16.7.2. Client Recovery . . . . . . . . . . . . . . . . . . 233
16.8. Alternate Approaches . . . . . . . . . . . . . . . . . . 243 16.7.3. Metadata Server Recovery . . . . . . . . . . . . . . 233
17. Layouts and Aggregation . . . . . . . . . . . . . . . . . . . 244 16.7.4. Storage Device Recovery . . . . . . . . . . . . . . 236
17.1. Simple Map . . . . . . . . . . . . . . . . . . . . . . . 244 16.8. Security Considerations . . . . . . . . . . . . . . . . 237
17.2. Block Extent Map . . . . . . . . . . . . . . . . . . . . 245 17. The NFSv4 File Layout Type . . . . . . . . . . . . . . . . . 238
17.3. Striped Map (RAID 0) . . . . . . . . . . . . . . . . . . 245 17.1. File Striping and Data Access . . . . . . . . . . . . . 238
17.4. Replicated Map . . . . . . . . . . . . . . . . . . . . . 245 17.1.1. Sparse and Dense Storage Device Data Layouts . . . . 240
17.5. Concatenated Map . . . . . . . . . . . . . . . . . . . . 245 17.1.2. Metadata and Storage Device Roles . . . . . . . . . 242
17.6. Nested Map . . . . . . . . . . . . . . . . . . . . . . . 246 17.1.3. Device Multipathing . . . . . . . . . . . . . . . . 243
18. Minor Versioning . . . . . . . . . . . . . . . . . . . . . . 246 17.1.4. Operations Issued to Storage Devices . . . . . . . . 243
19. Internationalization . . . . . . . . . . . . . . . . . . . . 248 17.1.5. COMMIT through metadata server . . . . . . . . . . . 244
19.1. Stringprep profile for the utf8str_cs type . . . . . . . 249 17.2. Global Stateid Requirements . . . . . . . . . . . . . . 244
19.2. Stringprep profile for the utf8str_cis type . . . . . . 251 17.3. The Layout Iomode . . . . . . . . . . . . . . . . . . . 245
19.3. Stringprep profile for the utf8str_mixed type . . . . . 252 17.4. Storage Device State Propagation . . . . . . . . . . . . 245
19.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 254 17.4.1. Lock State Propagation . . . . . . . . . . . . . . . 246
20. Error Definitions . . . . . . . . . . . . . . . . . . . . . . 254 17.4.2. Open-mode Validation . . . . . . . . . . . . . . . . 246
21. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 263 17.4.3. File Attributes . . . . . . . . . . . . . . . . . . 246
21.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 263 17.5. Storage Device Component File Size . . . . . . . . . . . 247
21.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 264 17.6. Crash Recovery Considerations . . . . . . . . . . . . . 248
22. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 266 17.7. Security Considerations . . . . . . . . . . . . . . . . 249
22.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 267 17.8. Alternate Approaches . . . . . . . . . . . . . . . . . . 249
22.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 269 18. Minor Versioning . . . . . . . . . . . . . . . . . . . . . . 250
22.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 270 19. Internationalization . . . . . . . . . . . . . . . . . . . . 252
22.4. Operation 6: CREATE - Create a Non-Regular File Object . 273 19.1. Stringprep profile for the utf8str_cs type . . . . . . . 253
19.2. Stringprep profile for the utf8str_cis type . . . . . . 255
19.3. Stringprep profile for the utf8str_mixed type . . . . . 256
19.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 258
20. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 258
20.1. Error Definitions . . . . . . . . . . . . . . . . . . . 258
20.2. Operations and their valid errors . . . . . . . . . . . 270
20.3. Callback operations and their valid errors . . . . . . . 279
20.4. Errors and the operations that use them . . . . . . . . 279
21. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 284
21.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 284
21.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 285
22. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 287
22.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 288
22.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 290
22.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 291
22.4. Operation 6: CREATE - Create a Non-Regular File Object . 294
22.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting 22.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting
Recovery . . . . . . . . . . . . . . . . . . . . . . . . 276 Recovery . . . . . . . . . . . . . . . . . . . . . . . . 296
22.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 277 22.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 297
22.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 277 22.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 298
22.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 279 22.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 299
22.9. Operation 11: LINK - Create Link to a File . . . . . . . 280 22.9. Operation 11: LINK - Create Link to a File . . . . . . . 300
22.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 281 22.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 301
22.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 285 22.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 305
22.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 287 22.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 306
22.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 288 22.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 307
22.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 290 22.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 309
22.15. Operation 17: NVERIFY - Verify Difference in 22.15. Operation 17: NVERIFY - Verify Difference in
Attributes . . . . . . . . . . . . . . . . . . . . . . . 291 Attributes . . . . . . . . . . . . . . . . . . . . . . . 310
22.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 292 22.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 311
22.17. Operation 19: OPENATTR - Open Named Attribute 22.17. Operation 19: OPENATTR - Open Named Attribute
Directory . . . . . . . . . . . . . . . . . . . . . . . 306 Directory . . . . . . . . . . . . . . . . . . . . . . . 325
22.18. Operation 20: OPEN_CONFIRM - Confirm Open . . . . . . . 307 22.18. Operation 20: OPEN_CONFIRM - Confirm Open . . . . . . . 326
22.19. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 309 22.19. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 328
22.20. Operation 22: PUTFH - Set Current Filehandle . . . . . . 310 22.20. Operation 22: PUTFH - Set Current Filehandle . . . . . . 329
22.21. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 311 22.21. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 330
22.22. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 313 22.22. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 331
22.23. Operation 25: READ - Read from File . . . . . . . . . . 314 22.23. Operation 25: READ - Read from File . . . . . . . . . . 332
22.24. Operation 26: READDIR - Read Directory . . . . . . . . . 316 22.24. Operation 26: READDIR - Read Directory . . . . . . . . . 334
22.25. Operation 27: READLINK - Read Symbolic Link . . . . . . 320 22.25. Operation 27: READLINK - Read Symbolic Link . . . . . . 338
22.26. Operation 28: REMOVE - Remove Filesystem Object . . . . 321 22.26. Operation 28: REMOVE - Remove Filesystem Object . . . . 339
22.27. Operation 29: RENAME - Rename Directory Entry . . . . . 323 22.27. Operation 29: RENAME - Rename Directory Entry . . . . . 340
22.28. Operation 30: RENEW - Renew a Lease . . . . . . . . . . 325 22.28. Operation 30: RENEW - Renew a Lease . . . . . . . . . . 342
22.29. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 326 22.29. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 343
22.30. Operation 32: SAVEFH - Save Current Filehandle . . . . . 327 22.30. Operation 32: SAVEFH - Save Current Filehandle . . . . . 344
22.31. Operation 33: SECINFO - Obtain Available Security . . . 328 22.31. Operation 33: SECINFO - Obtain Available Security . . . 345
22.32. Operation 34: SETATTR - Set Attributes . . . . . . . . . 331 22.32. Operation 34: SETATTR - Set Attributes . . . . . . . . . 348
22.33. Operation 35: SETCLIENTID - Negotiate Clientid . . . . . 334 22.33. Operation 35: SETCLIENTID - Negotiate Clientid . . . . . 350
22.34. Operation 36: SETCLIENTID_CONFIRM - Confirm Clientid . . 338 22.34. Operation 36: SETCLIENTID_CONFIRM - Confirm Clientid . . 354
22.35. Operation 37: VERIFY - Verify Same Attributes . . . . . 341 22.35. Operation 37: VERIFY - Verify Same Attributes . . . . . 357
22.36. Operation 38: WRITE - Write to File . . . . . . . . . . 342 22.36. Operation 38: WRITE - Write to File . . . . . . . . . . 359
22.37. Operation 39: RELEASE_LOCKOWNER - Release Lockowner 22.37. Operation 39: RELEASE_LOCKOWNER - Release Lockowner
State . . . . . . . . . . . . . . . . . . . . . . . . . 347 State . . . . . . . . . . . . . . . . . . . . . . . . . 363
22.38. Operation 10044: ILLEGAL - Illegal operation . . . . . . 348 22.38. Operation 40: BIND_BACKCHANNEL - Create a callback
22.39. SECINFO_NO_NAME - Get Security on Unnamed Object . . . . 348 channel binding . . . . . . . . . . . . . . . . . . . . 363
22.40. CREATECLIENTID - Instantiate Clientid . . . . . . . . . 350 22.39. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 369
22.41. CREATESESSION - Create New Session and Confirm 22.40. Operation 42: CREATECLIENTID - Instantiate Clientid . . 372
Clientid . . . . . . . . . . . . . . . . . . . . . . . . 355 22.41. Operation 43: CREATESESSION - Create New Session and
22.42. BIND_BACKCHANNEL - Create a callback channel binding . . 360 Confirm Clientid . . . . . . . . . . . . . . . . . . . . 377
22.43. DESTROYSESSION - Destroy existing session . . . . . . . 362 22.42. Operation 44: DESTROYSESSION - Destroy existing
22.44. SEQUENCE - Supply per-procedure sequencing and control . 363 session . . . . . . . . . . . . . . . . . . . . . . . . 383
22.45. GET_DIR_DELEGATION - Get a directory delegation . . . . 364 22.43. Operation 45: GET_DIR_DELEGATION - Get a directory
22.46. LAYOUTGET - Get Layout Information . . . . . . . . . . . 368 delegation . . . . . . . . . . . . . . . . . . . . . . . 384
22.47. LAYOUTCOMMIT - Commit writes made using a layout . . . . 371 22.44. Operation 46: GETDEVICEINFO - Get Device Information . . 388
22.48. LAYOUTRETURN - Release Layout Information . . . . . . . 375 22.45. Operation 47: GETDEVICELIST . . . . . . . . . . . . . . 389
22.49. GETDEVICEINFO - Get Device Information . . . . . . . . . 376 22.46. Operation 48: LAYOUTCOMMIT - Commit writes made using
22.50. GETDEVICELIST . . . . . . . . . . . . . . . . . . . . . 377 a layout . . . . . . . . . . . . . . . . . . . . . . . . 390
22.51. WANT_DELEGATION . . . . . . . . . . . . . . . . . . . . 379 22.47. Operation 49: LAYOUTGET - Get Layout Information . . . . 394
23. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 382 22.48. Operation 50: LAYOUTRETURN - Release Layout
23.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 382 Information . . . . . . . . . . . . . . . . . . . . . . 396
23.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 383 22.49. Operation 51: SECINFO_NO_NAME - Get Security on
24. CB_RECALLCREDIT - change flow control limits . . . . . . . . 385 Unnamed Object . . . . . . . . . . . . . . . . . . . . . 398
25. CB_SEQUENCE - Supply callback channel sequencing and 22.50. Operation 52: SEQUENCE - Supply per-procedure
control . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 sequencing and control . . . . . . . . . . . . . . . . . 399
26. CB_NOTIFY - Notify directory changes . . . . . . . . . . . . 387 22.51. Operation 53: SET_SSV . . . . . . . . . . . . . . . . . 400
27. CB_RECALL_ANY - Keep any N delegations . . . . . . . . . . . 390 22.52. Operation 54: WANT_DELEGATION . . . . . . . . . . . . . 402
28. CB_SIZECHANGED . . . . . . . . . . . . . . . . . . . . . . . 393 22.53. Operation 10044: ILLEGAL - Illegal operation . . . . . . 405
29. CB_LAYOUTRECALL . . . . . . . . . . . . . . . . . . . . . . . 394 23. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 406
30. CB_PUSH_DELEG . . . . . . . . . . . . . . . . . . . . . . . . 397 23.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 406
31. CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . . . . . . . . . . 398 23.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 406
32. References . . . . . . . . . . . . . . . . . . . . . . . . . 398 24. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 408
32.1. Normative References . . . . . . . . . . . . . . . . . . 398 24.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 408
32.2. Informative References . . . . . . . . . . . . . . . . . 399 24.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 409
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 399 24.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 410
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 400 24.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 412
Intellectual Property and Copyright Statements . . . . . . . . . 401 24.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 416
24.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 417
24.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 419
24.8. Operation 10: CB_RECALLCREDIT - change flow control
limits . . . . . . . . . . . . . . . . . . . . . . . . . 420
24.9. Operation 11: CB_SEQUENCE - Supply callback channel
sequencing and control . . . . . . . . . . . . . . . . . 421
24.10. Operation 12: CB_SIZECHANGED . . . . . . . . . . . . . . 422
24.11. Operation 10044: CB_ILLEGAL - Illegal Callback
Operation . . . . . . . . . . . . . . . . . . . . . . . 423
25. References . . . . . . . . . . . . . . . . . . . . . . . . . 424
25.1. Normative References . . . . . . . . . . . . . . . . . . 424
25.2. Informative References . . . . . . . . . . . . . . . . . 425
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 425
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 426
Intellectual Property and Copyright Statements . . . . . . . . . 427
1. Protocol Data Types 1. Protocol Data Types
The syntax and semantics to describe the data types of the NFS The syntax and semantics to describe the data types of the NFS
version 4 protocol are defined in the XDR RFC1832 [2] and RPC RFC1831 version 4 protocol are defined in the XDR RFC4506 [2] and RPC RFC1831
[3] documents. The next sections build upon the XDR data types to [3] documents. The next sections build upon the XDR data types to
define types and structures specific to this protocol. define types and structures specific to this protocol.
1.1. Basic Data Types 1.1. Basic Data Types
These are the base NFSv4 data types. These are the base NFSv4 data types.
+---------------+---------------------------------------------------+ +---------------+---------------------------------------------------+
| Data Type | Definition | | Data Type | Definition |
+---------------+---------------------------------------------------+ +---------------+---------------------------------------------------+
skipping to change at page 13, line 18 skipping to change at page 14, line 18
struct clientaddr4 { struct clientaddr4 {
/* see struct rpcb in RFC1833 */ /* see struct rpcb in RFC1833 */
string r_netid<>; /* network id */ string r_netid<>; /* network id */
string r_addr<>; /* universal address */ string r_addr<>; /* universal address */
}; };
The clientaddr4 structure is used as part of the SETCLIENTID The clientaddr4 structure is used as part of the SETCLIENTID
operation to either specify the address of the client that is using a operation to either specify the address of the client that is using a
clientid or as part of the callback registration. The r_netid and clientid or as part of the callback registration. The r_netid and
r_addr fields are specified in RFC1833 [9], but they are r_addr fields are specified in RFC1833 [10], but they are
underspecified in RFC1833 [9] as far as what they should look like underspecified in RFC1833 [10] as far as what they should look like
for specific protocols. for specific protocols.
For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the
US-ASCII string: US-ASCII string:
h1.h2.h3.h4.p1.p2 h1.h2.h3.h4.p1.p2
The prefix, "h1.h2.h3.h4", is the standard textual form for The prefix, "h1.h2.h3.h4", is the standard textual form for
representing an IPv4 address, which is always four octets long. representing an IPv4 address, which is always four octets long.
Assuming big-endian ordering, h1, h2, h3, and h4, are respectively, Assuming big-endian ordering, h1, h2, h3, and h4, are respectively,
skipping to change at page 16, line 12 skipping to change at page 17, line 12
by this document. This would allow custom installations to introduce by this document. This would allow custom installations to introduce
new layout types. new layout types.
[[Comment.1: Determine private range of layout types]] [[Comment.1: Determine private range of layout types]]
New layout types must be specified in RFCs approved by the IESG New layout types must be specified in RFCs approved by the IESG
before becoming part of the pNFS specification. before becoming part of the pNFS specification.
The LAYOUT_NFSV4_FILES enumeration specifies that the NFSv4 file The LAYOUT_NFSV4_FILES enumeration specifies that the NFSv4 file
layout type is to be used. The LAYOUT_OSD2_OBJECTS enumeration layout type is to be used. The LAYOUT_OSD2_OBJECTS enumeration
specifies that the object layout, as defined in [10], is to be used. specifies that the object layout, as defined in [11], is to be used.
Similarly, the LAYOUT_BLOCK_VOLUME enumeration that the block/volume Similarly, the LAYOUT_BLOCK_VOLUME enumeration that the block/volume
layout, as defined in [11], is to be used. layout, as defined in [12], is to be used.
1.2.18. pnfs_deviceid4 1.2.18. pnfs_deviceid4
typedef uint32_t pnfs_deviceid4; /* 32-bit device ID */ typedef uint32_t pnfs_deviceid4; /* 32-bit device ID */
Layout information includes device IDs that specify a storage device Layout information includes device IDs that specify a storage device
through a compact handle. Addressing and type information is through a compact handle. Addressing and type information is
obtained with the GETDEVICEINFO operation. A client must not assume obtained with the GETDEVICEINFO operation. A client must not assume
that device IDs are valid across metadata server reboots. The device that device IDs are valid across metadata server reboots. The device
ID is qualified by the layout type and are unique per file system ID is qualified by the layout type and are unique per file system
(FSID). This allows different layout drivers to generate device IDs (FSID). This allows different layout drivers to generate device IDs
without the need for co-ordination. See Section 14.1.4 for more without the need for co-ordination. See Section 16.1.4 for more
details. details.
1.2.19. pnfs_netaddr4 1.2.19. pnfs_netaddr4
struct pnfs_netaddr4 { struct pnfs_netaddr4 {
string r_netid<>; /* network ID */ string r_netid<>; /* network ID */
string r_addr<>; /* universal address */ string r_addr<>; /* universal address */
}; };
For a description of the r_netid and r_addr fields see the For a description of the r_netid and r_addr fields see the
skipping to change at page 17, line 30 skipping to change at page 18, line 30
offset4 offset; offset4 offset;
length4 length; length4 length;
pnfs_layoutiomode4 iomode; pnfs_layoutiomode4 iomode;
pnfs_layouttype4 type; pnfs_layouttype4 type;
opaque layout<>; opaque layout<>;
}; };
The pnfs_layout4 structure defines a layout for a file. The layout The pnfs_layout4 structure defines a layout for a file. The layout
type specific data is opaque within this structure and must be type specific data is opaque within this structure and must be
interepreted based on the layout type. Currently, only the NFSv4 interepreted based on the layout type. Currently, only the NFSv4
file layout type is defined; see Section 16.1 for its definition. file layout type is defined; see Section 17.1 for its definition.
Since layouts are sub-dividable, the offset and length together with Since layouts are sub-dividable, the offset and length together with
the file's filehandle, the clientid, iomode, and layout type, the file's filehandle, the clientid, iomode, and layout type,
identifies the layout. identifies the layout.
[[Comment.2: there is a discussion of moving the striping [[Comment.2: there is a discussion of moving the striping
information, or more generally the "aggregation scheme", up to the information, or more generally the "aggregation scheme", up to the
generic layout level. This creates a two-layer system where the top generic layout level. This creates a two-layer system where the top
level is a switch on different data placement layouts, and the next level is a switch on different data placement layouts, and the next
level down is a switch on different data storage types. This lets level down is a switch on different data storage types. This lets
different layouts (e.g., striping or mirroring or redundant servers) different layouts (e.g., striping or mirroring or redundant servers)
skipping to change at page 18, line 28 skipping to change at page 19, line 28
opaque layouthint_data<>; opaque layouthint_data<>;
}; };
The layouthint4 structure is used by the client to pass in a hint The layouthint4 structure is used by the client to pass in a hint
about the type of layout it would like created for a particular file. about the type of layout it would like created for a particular file.
It is the structure specified by the FILE_LAYOUT_HINT attribute It is the structure specified by the FILE_LAYOUT_HINT attribute
described below. The metadata server may ignore the hint, or may described below. The metadata server may ignore the hint, or may
selectively ignore fields within the hint. This hint should be selectively ignore fields within the hint. This hint should be
provided at create time as part of the initial attributes within provided at create time as part of the initial attributes within
OPEN. The NFSv4 file-based layout uses the "nfsv4_file_layouthint" OPEN. The NFSv4 file-based layout uses the "nfsv4_file_layouthint"
structure as defined in Section 16.1. structure as defined in Section 17.1.
1.2.24. pnfs_layoutiomode4 1.2.24. pnfs_layoutiomode4
enum pnfs_layoutiomode4 { enum pnfs_layoutiomode4 {
LAYOUTIOMODE_READ = 1, LAYOUTIOMODE_READ = 1,
LAYOUTIOMODE_RW = 2, LAYOUTIOMODE_RW = 2,
LAYOUTIOMODE_ANY = 3 LAYOUTIOMODE_ANY = 3
}; };
The iomode specifies whether the client intends to read or write The iomode specifies whether the client intends to read or write
skipping to change at page 19, line 32 skipping to change at page 20, line 32
1.2.26. impl_ident4 1.2.26. impl_ident4
struct impl_ident4 { struct impl_ident4 {
clientid4 ii_clientid; clientid4 ii_clientid;
struct nfs_impl_id4 ii_impl_id; struct nfs_impl_id4 ii_impl_id;
}; };
This is used for exchanging implementation identification between This is used for exchanging implementation identification between
client and server. client and server.
2. Filehandles 2. RPC and Security Flavor
The NFS version 4.1 protocol is a Remote Procedure Call (RPC)
application that uses RPC version 2 and the corresponding eXternal
Data Representation (XDR) as defined in [RFC1831] and [RFC4506]. The
RPCSEC_GSS security flavor as defined in [RFC2203] MUST be used as
the mechanism to deliver stronger security for the NFS version 4
protocol.
2.1. Ports and Transports
Historically, NFS version 2 and version 3 servers have resided on
port 2049. The registered port 2049 [RFC3232] for the NFS protocol
should be the default configuration. NFSv4 clients SHOULD NOT use
the RPC binding protocols as described in [RFC1833].
Where an NFS version 4 implementation supports operation over the IP
network protocol, the supported transports between NFS and IP MUST
have the following two attributes:
1. The transport must support reliable delivery of data in the order
it was sent.
2. The transport must be among the IETF-approved congestion control
transport protocols.
At the time this document was written, the only two transports that
had the above attributes were TCP and SCTP. To enhance the
possibilities for interoperability, an NFS version 4 implementation
MUST support operation over the TCP transport protocol.
If TCP is used as the transport, the client and server SHOULD use
persistent connections for at least two reasons:
1. This will prevent the weakening of TCP's congestion control via
short lived connections and will improve performance for the WAN
environment by eliminating the need for SYN handshakes.
2. The NFSv4.1 callback model has changed from NFSv4.0, and requires
the client and server to maintain a client-created channel for
the server to use.
As noted in the Security Considerations section, the authentication
model for NFS version 4 has moved from machine-based to principal-
based. However, this modification of the authentication model does
not imply a technical requirement to move the transport connection
management model from whole machine-based to one based on a per user
model. In particular, NFS over TCP client implementations have
traditionally multiplexed traffic for multiple users over a common
TCP connection between an NFS client and server. This has been true,
regardless whether the NFS client is using AUTH_SYS, AUTH_DH,
RPCSEC_GSS or any other flavor. Similarly, NFS over TCP server
implementations have assumed such a model and thus scale the
implementation of TCP connection management in proportion to the
number of expected client machines. NFS version 4.1 will not modify
this connection management model. NFS version 4.1 clients that
violate this assumption can expect scaling issues on the server and
hence reduced service.
Note that for various timers, the client and server should avoid
inadvertent synchronization of those timers. For further discussion
of the general issue refer to [Floyd].
2.1.1. Client Retransmission Behavior
When processing a request received over a reliable transport such as
TCP, the NFS version 4.1 server MUST NOT silently drop the request,
except if the transport connection has been broken. Given such a
contract between NFS version 4.1 clients and servers, clients MUST
NOT retry a request unless one or both of the following are true:
o The transport connection has been broken
o The procedure being retried is the NULL procedure
Since reliable transports, such as TCP, do not always synchronously
inform a peer when the other peer has broken the connection (for
example, when an NFS server reboots), the NFS version 4.1 client may
want to actively "probe" the connection to see if has been broken.
Use of the NULL procedure is one recommended way to do so. So, when
a client experiences a remote procedure call timeout (of some
arbitrary implementation specific amount), rather than retrying the
remote procedure call, it could instead issue a NULL procedure call
to the server. If the server has died, the transport connection
break will eventually be indicated to the NFS version 4.1 client.
The client can then reconnect, and then retry the original request.
If the NULL procedure call gets a response, the connection has not
broken. The client can decide to wait longer for the original
request's response, or it can break the transport connection and
reconnect before re-sending the original request.
For callbacks from the server to the client, the same rules apply,
but the server doing the callback becomes the client, and the client
receiving the callback becomes the server.
2.2. Security Flavors
Traditional RPC implementations have included AUTH_NONE, AUTH_SYS,
AUTH_DH, and AUTH_KRB4 as security flavors. With [RFC2203] an
additional security flavor of RPCSEC_GSS has been introduced which
uses the functionality of GSS-API [RFC2743]. This allows for the use
of various security mechanisms by the RPC layer without the
additional implementation overhead of adding RPC security flavors.
For NFS version 4, the RPCSEC_GSS security flavor MUST be implemented
to enable the mandatory security mechanism. Other flavors, such as,
AUTH_NONE, AUTH_SYS, and AUTH_DH MAY be implemented as well.
2.2.1. Security mechanisms for NFS version 4
The use of RPCSEC_GSS requires selection of: mechanism, quality of
protection, and service (authentication, integrity, privacy). The
remainder of this document will refer to these three parameters of
the RPCSEC_GSS security as the security triple.
2.2.1.1. Kerberos V5
The Kerberos V5 GSS-API mechanism as described in [RFC1964] MUST be
implemented.
column descriptions:
1 == number of pseudo flavor
2 == name of pseudo flavor
3 == mechanism's OID
4 == RPCSEC_GSS service
1 2 3 4
--------------------------------------------------------------------
390003 krb5 1.2.840.113554.1.2.2 rpc_gss_svc_none
390004 krb5i 1.2.840.113554.1.2.2 rpc_gss_svc_integrity
390005 krb5p 1.2.840.113554.1.2.2 rpc_gss_svc_privacy
Note that the pseudo flavor is presented here as a mapping aid to the
implementor. Because this NFS protocol includes a method to
negotiate security and it understands the GSS-API mechanism, the
pseudo flavor is not needed. The pseudo flavor is needed for NFS
version 3 since the security negotiation is done via the MOUNT
protocol.
For a discussion of NFS' use of RPCSEC_GSS and Kerberos V5, please
see [RFC2623].
2.2.1.2. LIPKEY as a security triple
The LIPKEY GSS-API mechanism as described in [RFC2847] MUST be
implemented and provide the following security triples. The
definition of the columns matches the previous subsection "Kerberos
V5 as security triple"
1 2 3 4
--------------------------------------------------------------------
390006 lipkey 1.3.6.1.5.5.9 rpc_gss_svc_none
390007 lipkey-i 1.3.6.1.5.5.9 rpc_gss_svc_integrity
390008 lipkey-p 1.3.6.1.5.5.9 rpc_gss_svc_privacy
2.2.1.3. SPKM-3 as a security triple
The SPKM-3 GSS-API mechanism as described in [RFC2847] MUST be
implemented and provide the following security triples. The
definition of the columns matches the previous subsection "Kerberos
V5 as security triple".
1 2 3 5
--------------------------------------------------------------------
390009 spkm3 1.3.6.1.5.5.1.3 rpc_gss_svc_none
390010 spkm3i 1.3.6.1.5.5.1.3 rpc_gss_svc_integrity
390011 spkm3p 1.3.6.1.5.5.1.3 rpc_gss_svc_privacy
2.3. Security Negotiation
With the NFS version 4 server potentially offering multiple security
mechanisms, the client needs a method to determine or negotiate which
mechanism is to be used for its communication with the server. The
NFS server may have multiple points within its filesystem name space
that are available for use by NFS clients. In turn the NFS server
may be configured such that each of these entry points may have
different or multiple security mechanisms in use.
The security negotiation between client and server must be done with
a secure channel to eliminate the possibility of a third party
intercepting the negotiation sequence and forcing the client and
server to choose a lower level of security than required or desired.
See the section "Security Considerations" for further discussion.
2.3.1. SECINFO and SECINFO_NO_NAME
The SECINFO and SECINFO_NO_NAME operations allow the client to
determine, on a per filehandle basis, what security triple is to be
used for server access. In general, the client will not have to use
either operation except during initial communication with the server
or when the client crosses policy boundaries at the server. It is
possible that the server's policies change during the client's
interaction therefore forcing the client to negotiate a new security
triple.
2.3.2. Security Error
Based on the assumption that each NFS version 4 client and server
must support a minimum set of security (i.e., LIPKEY, SPKM-3, and
Kerberos-V5 all under RPCSEC_GSS), the NFS client will start its
communication with the server with one of the minimal security
triples. During communication with the server, the client may
receive an NFS error of NFS4ERR_WRONGSEC. This error allows the
server to notify the client that the security triple currently being
used is not appropriate for access to the server's filesystem
resources. The client is then responsible for determining what
security triples are available at the server and choose one which is
appropriate for the client. See the section for the "SECINFO"
operation for further discussion of how the client will respond to
the NFS4ERR_WRONGSEC error and use SECINFO.
2.3.3. Callback RPC Authentication
Callback authentication has changed in NFSv4.1 from NFSv4.0.
NFSv4.0 required the NFS server to create a security context for
RPCSEC_GSS, AUTH_DH, and AUTH_KERB4, and any other security flavor
that had a security context. It also required that principal issuing
the callback be the same as the principal that accepted the callback
parameters (via SETCLIENTID), and that the client principal accepting
the callback be the same as that which issued the SETCLIENTID. This
required the NFS client to have an assigned machine credential.
NFSv4.1 does not require a machine credential. Instead, NFSv4.1
allows an RPCSEC_GSS security context initiated by the client and
eswtablished on both the client and server to be used on callback
RPCs sent by the server to the client. The BIND_BACKCHANNEL
operation is used establish RPCSEC_GSS contexts (if the client so
desires) on the server. No support for AUTH_DH, or AUTH_KERB4 is
specified.
2.3.4. GSS Server Principal
Regardless of what security mechanism under RPCSEC_GSS is being used,
the NFS server, MUST identify itself in GSS-API via a
GSS_C_NT_HOSTBASED_SERVICE name type. GSS_C_NT_HOSTBASED_SERVICE
names are of the form:
service@hostname
For NFS, the "service" element is
nfs
Implementations of security mechanisms will convert nfs@hostname to
various different forms. For Kerberos V5, LIPKEY, and SPKM-3, the
following form is RECOMMENDED:
nfs/hostname
3. Filehandles
The filehandle in the NFS protocol is a per server unique identifier The filehandle in the NFS protocol is a per server unique identifier
for a filesystem object. The contents of the filehandle are opaque for a filesystem object. The contents of the filehandle are opaque
to the client. Therefore, the server is responsible for translating to the client. Therefore, the server is responsible for translating
the filehandle to an internal representation of the filesystem the filehandle to an internal representation of the filesystem
object. object.
2.1. Obtaining the First Filehandle 3.1. Obtaining the First Filehandle
The operations of the NFS protocol are defined in terms of one or The operations of the NFS protocol are defined in terms of one or
more filehandles. Therefore, the client needs a filehandle to more filehandles. Therefore, the client needs a filehandle to
initiate communication with the server. With the NFS version 2 initiate communication with the server. With the NFS version 2
protocol [RFC1094] and the NFS version 3 protocol [RFC1813], there protocol [RFC1094] and the NFS version 3 protocol [RFC1813], there
exists an ancillary protocol to obtain this first filehandle. The exists an ancillary protocol to obtain this first filehandle. The
MOUNT protocol, RPC program number 100005, provides the mechanism of MOUNT protocol, RPC program number 100005, provides the mechanism of
translating a string based filesystem path name to a filehandle which translating a string based filesystem path name to a filehandle which
can then be used by the NFS protocols. can then be used by the NFS protocols.
skipping to change at page 20, line 16 skipping to change at page 26, line 29
of the public filehandle in combination with the LOOKUP operation in of the public filehandle in combination with the LOOKUP operation in
the NFS version 2 and 3 protocols, it has been demonstrated that the the NFS version 2 and 3 protocols, it has been demonstrated that the
MOUNT protocol is unnecessary for viable interaction between NFS MOUNT protocol is unnecessary for viable interaction between NFS
client and server. client and server.
Therefore, the NFS version 4 protocol will not use an ancillary Therefore, the NFS version 4 protocol will not use an ancillary
protocol for translation from string based path names to a protocol for translation from string based path names to a
filehandle. Two special filehandles will be used as starting points filehandle. Two special filehandles will be used as starting points
for the NFS client. for the NFS client.
2.1.1. Root Filehandle 3.1.1. Root Filehandle
The first of the special filehandles is the ROOT filehandle. The The first of the special filehandles is the ROOT filehandle. The
ROOT filehandle is the "conceptual" root of the filesystem name space ROOT filehandle is the "conceptual" root of the filesystem name space
at the NFS server. The client uses or starts with the ROOT at the NFS server. The client uses or starts with the ROOT
filehandle by employing the PUTROOTFH operation. The PUTROOTFH filehandle by employing the PUTROOTFH operation. The PUTROOTFH
operation instructs the server to set the "current" filehandle to the operation instructs the server to set the "current" filehandle to the
ROOT of the server's file tree. Once this PUTROOTFH operation is ROOT of the server's file tree. Once this PUTROOTFH operation is
used, the client can then traverse the entirety of the server's file used, the client can then traverse the entirety of the server's file
tree with the LOOKUP operation. A complete discussion of the server tree with the LOOKUP operation. A complete discussion of the server
name space is in the section "NFS Server Name Space". name space is in the section "NFS Server Name Space".
2.1.2. Public Filehandle 3.1.2. Public Filehandle
The second special filehandle is the PUBLIC filehandle. Unlike the The second special filehandle is the PUBLIC filehandle. Unlike the
ROOT filehandle, the PUBLIC filehandle may be bound or represent an ROOT filehandle, the PUBLIC filehandle may be bound or represent an
arbitrary filesystem object at the server. The server is responsible arbitrary filesystem object at the server. The server is responsible
for this binding. It may be that the PUBLIC filehandle and the ROOT for this binding. It may be that the PUBLIC filehandle and the ROOT
filehandle refer to the same filesystem object. However, it is up to filehandle refer to the same filesystem object. However, it is up to
the administrative software at the server and the policies of the the administrative software at the server and the policies of the
server administrator to define the binding of the PUBLIC filehandle server administrator to define the binding of the PUBLIC filehandle
and server filesystem object. The client may not make any and server filesystem object. The client may not make any
assumptions about this binding. The client uses the PUBLIC assumptions about this binding. The client uses the PUBLIC
filehandle via the PUTPUBFH operation. filehandle via the PUTPUBFH operation.
2.2. Filehandle Types 3.2. Filehandle Types
In the NFS version 2 and 3 protocols, there was one type of In the NFS version 2 and 3 protocols, there was one type of
filehandle with a single set of semantics. This type of filehandle filehandle with a single set of semantics. This type of filehandle
is termed "persistent" in NFS Version 4. The semantics of a is termed "persistent" in NFS Version 4. The semantics of a
persistent filehandle remain the same as before. A new type of persistent filehandle remain the same as before. A new type of
filehandle introduced in NFS Version 4 is the "volatile" filehandle, filehandle introduced in NFS Version 4 is the "volatile" filehandle,
which attempts to accommodate certain server environments. which attempts to accommodate certain server environments.
The volatile filehandle type was introduced to address server The volatile filehandle type was introduced to address server
functionality or implementation issues which make correct functionality or implementation issues which make correct
skipping to change at page 21, line 19 skipping to change at page 27, line 31
invariant. Volatile filehandles may ease the implementation of invariant. Volatile filehandles may ease the implementation of
server functionality such as hierarchical storage management or server functionality such as hierarchical storage management or
filesystem reorganization or migration. However, the volatile filesystem reorganization or migration. However, the volatile
filehandle increases the implementation burden for the client. filehandle increases the implementation burden for the client.
Since the client will need to handle persistent and volatile Since the client will need to handle persistent and volatile
filehandles differently, a file attribute is defined which may be filehandles differently, a file attribute is defined which may be
used by the client to determine the filehandle types being returned used by the client to determine the filehandle types being returned
by the server. by the server.
2.2.1. General Properties of a Filehandle 3.2.1. General Properties of a Filehandle
The filehandle contains all the information the server needs to The filehandle contains all the information the server needs to
distinguish an individual file. To the client, the filehandle is distinguish an individual file. To the client, the filehandle is
opaque. The client stores filehandles for use in a later request and opaque. The client stores filehandles for use in a later request and
can compare two filehandles from the same server for equality by can compare two filehandles from the same server for equality by
doing a byte-by-byte comparison. However, the client MUST NOT doing a byte-by-byte comparison. However, the client MUST NOT
otherwise interpret the contents of filehandles. If two filehandles otherwise interpret the contents of filehandles. If two filehandles
from the same server are equal, they MUST refer to the same file. from the same server are equal, they MUST refer to the same file.
Servers SHOULD try to maintain a one-to-one correspondence between Servers SHOULD try to maintain a one-to-one correspondence between
filehandles and files but this is not required. Clients MUST use filehandles and files but this is not required. Clients MUST use
skipping to change at page 21, line 46 skipping to change at page 28, line 9
"Data Caching and File Identity". "Data Caching and File Identity".
As an example, in the case that two different path names when As an example, in the case that two different path names when
traversed at the server terminate at the same filesystem object, the traversed at the server terminate at the same filesystem object, the
server SHOULD return the same filehandle for each path. This can server SHOULD return the same filehandle for each path. This can
occur if a hard link is used to create two file names which refer to occur if a hard link is used to create two file names which refer to
the same underlying file object and associated data. For example, if the same underlying file object and associated data. For example, if
paths /a/b/c and /a/d/c refer to the same file, the server SHOULD paths /a/b/c and /a/d/c refer to the same file, the server SHOULD
return the same filehandle for both path names traversals. return the same filehandle for both path names traversals.
2.2.2. Persistent Filehandle 3.2.2. Persistent Filehandle
A persistent filehandle is defined as having a fixed value for the A persistent filehandle is defined as having a fixed value for the
lifetime of the filesystem object to which it refers. Once the lifetime of the filesystem object to which it refers. Once the
server creates the filehandle for a filesystem object, the server server creates the filehandle for a filesystem object, the server
MUST accept the same filehandle for the object for the lifetime of MUST accept the same filehandle for the object for the lifetime of
the object. If the server restarts or reboots the NFS server must the object. If the server restarts or reboots the NFS server must
honor the same filehandle value as it did in the server's previous honor the same filehandle value as it did in the server's previous
instantiation. Similarly, if the filesystem is migrated, the new NFS instantiation. Similarly, if the filesystem is migrated, the new NFS
server must honor the same filehandle as the old NFS server. server must honor the same filehandle as the old NFS server.
The persistent filehandle will be become stale or invalid when the The persistent filehandle will be become stale or invalid when the
filesystem object is removed. When the server is presented with a filesystem object is removed. When the server is presented with a
persistent filehandle that refers to a deleted object, it MUST return persistent filehandle that refers to a deleted object, it MUST return
an error of NFS4ERR_STALE. A filehandle may become stale when the an error of NFS4ERR_STALE. A filehandle may become stale when the
filesystem containing the object is no longer available. The file filesystem containing the object is no longer available. The file
system may become unavailable if it exists on removable media and the system may become unavailable if it exists on removable media and the
media is no longer available at the server or the filesystem in whole media is no longer available at the server or the filesystem in whole
has been destroyed or the filesystem has simply been removed from the has been destroyed or the filesystem has simply been removed from the
server's name space (i.e. unmounted in a UNIX environment). server's name space (i.e. unmounted in a UNIX environment).
2.2.3. Volatile Filehandle 3.2.3. Volatile Filehandle
A volatile filehandle does not share the same longevity A volatile filehandle does not share the same longevity
characteristics of a persistent filehandle. The server may determine characteristics of a persistent filehandle. The server may determine
that a volatile filehandle is no longer valid at many different that a volatile filehandle is no longer valid at many different
points in time. If the server can definitively determine that a points in time. If the server can definitively determine that a
volatile filehandle refers to an object that has been removed, the volatile filehandle refers to an object that has been removed, the
server should return NFS4ERR_STALE to the client (as is the case for server should return NFS4ERR_STALE to the client (as is the case for
persistent filehandles). In all other cases where the server persistent filehandles). In all other cases where the server
determines that a volatile filehandle can no longer be used, it determines that a volatile filehandle can no longer be used, it
should return an error of NFS4ERR_FHEXPIRED. should return an error of NFS4ERR_FHEXPIRED.
skipping to change at page 23, line 37 skipping to change at page 29, line 45
This situation can arise if FH4_VOL_MIGRATION or FH4_VOL_RENAME is This situation can arise if FH4_VOL_MIGRATION or FH4_VOL_RENAME is
set, if FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set, set, if FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set,
or if a non-readonly file system has a transition target in a or if a non-readonly file system has a transition target in a
different _handle _ class. In these cases, the server should deny a different _handle _ class. In these cases, the server should deny a
RENAME or REMOVE that would affect an OPEN file of any of the RENAME or REMOVE that would affect an OPEN file of any of the
components leading to the OPEN file. In addition, the server should components leading to the OPEN file. In addition, the server should
deny all RENAME or REMOVE requests during the grace period, in order deny all RENAME or REMOVE requests during the grace period, in order
to make sure that reclaims of files where filehandles may have to make sure that reclaims of files where filehandles may have
expired do not do a reclaim for the wrong file. expired do not do a reclaim for the wrong file.
2.3. One Method of Constructing a Volatile Filehandle 3.3. One Method of Constructing a Volatile Filehandle
A volatile filehandle, while opaque to the client could contain: A volatile filehandle, while opaque to the client could contain:
[volatile bit = 1 | server boot time | slot | generation number] [volatile bit = 1 | server boot time | slot | generation number]
o slot is an index in the server volatile filehandle table o slot is an index in the server volatile filehandle table
o generation number is the generation number for the table entry/ o generation number is the generation number for the table entry/
slot slot
When the client presents a volatile filehandle, the server makes the When the client presents a volatile filehandle, the server makes the
following checks, which assume that the check for the volatile bit following checks, which assume that the check for the volatile bit
has passed. If the server boot time is less than the current server has passed. If the server boot time is less than the current server
boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return
NFS4ERR_BADHANDLE. If the generation number does not match, return NFS4ERR_BADHANDLE. If the generation number does not match, return
skipping to change at page 24, line 11 skipping to change at page 30, line 21
has passed. If the server boot time is less than the current server has passed. If the server boot time is less than the current server
boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return
NFS4ERR_BADHANDLE. If the generation number does not match, return NFS4ERR_BADHANDLE. If the generation number does not match, return
NFS4ERR_FHEXPIRED. NFS4ERR_FHEXPIRED.
When the server reboots, the table is gone (it is volatile). When the server reboots, the table is gone (it is volatile).
If volatile bit is 0, then it is a persistent filehandle with a If volatile bit is 0, then it is a persistent filehandle with a
different structure following it. different structure following it.
2.4. Client Recovery from Filehandle Expiration 3.4. Client Recovery from Filehandle Expiration
If possible, the client SHOULD recover from the receipt of an If possible, the client SHOULD recover from the receipt of an
NFS4ERR_FHEXPIRED error. The client must take on additional NFS4ERR_FHEXPIRED error. The client must take on additional
responsibility so that it may prepare itself to recover from the responsibility so that it may prepare itself to recover from the
expiration of a volatile filehandle. If the server returns expiration of a volatile filehandle. If the server returns
persistent filehandles, the client does not need these additional persistent filehandles, the client does not need these additional
steps. steps.
For volatile filehandles, most commonly the client will need to store For volatile filehandles, most commonly the client will need to store
the component names leading up to and including the filesystem object the component names leading up to and including the filesystem object
skipping to change at page 25, line 5 skipping to change at page 31, line 13
like: like:
RENAME A B RENAME A B
LOOKUP B LOOKUP B
GETFH GETFH
Note that the COMPOUND procedure does not provide atomicity. This Note that the COMPOUND procedure does not provide atomicity. This
example only reduces the overhead of recovering from an expired example only reduces the overhead of recovering from an expired
filehandle. filehandle.
3. File Attributes 4. File Attributes
To meet the requirements of extensibility and increased To meet the requirements of extensibility and increased
interoperability with non-UNIX platforms, attributes must be handled interoperability with non-UNIX platforms, attributes must be handled
in a flexible manner. The NFS version 3 fattr3 structure contains a in a flexible manner. The NFS version 3 fattr3 structure contains a
fixed list of attributes that not all clients and servers are able to fixed list of attributes that not all clients and servers are able to
support or care about. The fattr3 structure can not be extended as support or care about. The fattr3 structure can not be extended as
new needs arise and it provides no way to indicate non-support. With new needs arise and it provides no way to indicate non-support. With
the NFS version 4 protocol, the client is able query what attributes the NFS version 4 protocol, the client is able query what attributes
the server supports and construct requests with only those supported the server supports and construct requests with only those supported
attributes (or a subset thereof). attributes (or a subset thereof).
skipping to change at page 26, line 18 skipping to change at page 32, line 34
reasonably computable by the client when support is not provided on reasonably computable by the client when support is not provided on
the server. the server.
Note that the hidden directory returned by OPENATTR is a convenience Note that the hidden directory returned by OPENATTR is a convenience
for protocol processing. The client should not make any assumptions for protocol processing. The client should not make any assumptions
about the server's implementation of named attributes and whether the about the server's implementation of named attributes and whether the
underlying filesystem at the server has a named attribute directory underlying filesystem at the server has a named attribute directory
or not. Therefore, operations such as SETATTR and GETATTR on the or not. Therefore, operations such as SETATTR and GETATTR on the
named attribute directory are undefined. named attribute directory are undefined.
3.1. Mandatory Attributes 4.1. Mandatory Attributes
These MUST be supported by every NFS version 4 client and server in These MUST be supported by every NFS version 4 client and server in
order to ensure a minimum level of interoperability. The server must order to ensure a minimum level of interoperability. The server must
store and return these attributes and the client must be able to store and return these attributes and the client must be able to
function with an attribute set limited to these attributes. With function with an attribute set limited to these attributes. With
just the mandatory attributes some client functionality may be just the mandatory attributes some client functionality may be
impaired or limited in some ways. A client may ask for any of these impaired or limited in some ways. A client may ask for any of these
attributes to be returned by setting a bit in the GETATTR request and attributes to be returned by setting a bit in the GETATTR request and
the server must return their value. the server must return their value.
3.2. Recommended Attributes 4.2. Recommended Attributes
These attributes are understood well enough to warrant support in the These attributes are understood well enough to warrant support in the
NFS version 4 protocol. However, they may not be supported on all NFS version 4 protocol. However, they may not be supported on all
clients and servers. A client may ask for any of these attributes to clients and servers. A client may ask for any of these attributes to
be returned by setting a bit in the GETATTR request but must handle be returned by setting a bit in the GETATTR request but must handle
the case where the server does not return them. A client may ask for the case where the server does not return them. A client may ask for
the set of attributes the server supports and should not request the set of attributes the server supports and should not request
attributes the server does not support. A server should be tolerant attributes the server does not support. A server should be tolerant
of requests for unsupported attributes and simply not return them of requests for unsupported attributes and simply not return them
rather than considering the request an error. It is expected that rather than considering the request an error. It is expected that
servers will support all attributes they comfortably can and only servers will support all attributes they comfortably can and only
fail to support attributes which are difficult to support in their fail to support attributes which are difficult to support in their
operating environments. A server should provide attributes whenever operating environments. A server should provide attributes whenever
they don't have to "tell lies" to the client. For example, a file they don't have to "tell lies" to the client. For example, a file
modification time should be either an accurate time or should not be modification time should be either an accurate time or should not be
supported by the server. This will not always be comfortable to supported by the server. This will not always be comfortable to
clients but the client is better positioned decide whether and how to clients but the client is better positioned decide whether and how to
fabricate or construct an attribute or whether to do without the fabricate or construct an attribute or whether to do without the
attribute. attribute.
3.3. Named Attributes 4.3. Named Attributes
These attributes are not supported by direct encoding in the NFS These attributes are not supported by direct encoding in the NFS
Version 4 protocol but are accessed by string names rather than Version 4 protocol but are accessed by string names rather than
numbers and correspond to an uninterpreted stream of bytes which are numbers and correspond to an uninterpreted stream of bytes which are
stored with the filesystem object. The name space for these stored with the filesystem object. The name space for these
attributes may be accessed by using the OPENATTR operation. The attributes may be accessed by using the OPENATTR operation. The
OPENATTR operation returns a filehandle for a virtual "attribute OPENATTR operation returns a filehandle for a virtual "attribute
directory" and further perusal of the name space may be done using directory" and further perusal of the name space may be done using
READDIR and LOOKUP operations on this filehandle. Named attributes READDIR and LOOKUP operations on this filehandle. Named attributes
may then be examined or changed by normal READ and WRITE and CREATE may then be examined or changed by normal READ and WRITE and CREATE
skipping to change at page 27, line 32 skipping to change at page 33, line 44
attributes, a client which is also able to handle them should be able attributes, a client which is also able to handle them should be able
to copy a file's data and meta-data with complete transparency from to copy a file's data and meta-data with complete transparency from
one location to another; this would imply that names allowed for one location to another; this would imply that names allowed for
regular directory entries are valid for named attribute names as regular directory entries are valid for named attribute names as
well. well.
Names of attributes will not be controlled by this document or other Names of attributes will not be controlled by this document or other
IETF standards track documents. See the section "IANA IETF standards track documents. See the section "IANA
Considerations" for further discussion. Considerations" for further discussion.
3.4. Classification of Attributes 4.4. Classification of Attributes
Each of the Mandatory and Recommended attributes can be classified in Each of the Mandatory and Recommended attributes can be classified in
one of three categories: per server, per filesystem, or per one of three categories: per server, per filesystem, or per
filesystem object. Note that it is possible that some per filesystem filesystem object. Note that it is possible that some per filesystem
attributes may vary within the filesystem. See the "homogeneous" attributes may vary within the filesystem. See the "homogeneous"
attribute for its definition. Note that the attributes attribute for its definition. Note that the attributes
time_access_set and time_modify_set are not listed in this section time_access_set and time_modify_set are not listed in this section
because they are write-only attributes corresponding to time_access because they are write-only attributes corresponding to time_access
and time_modify, and are used in a special instance of SETATTR. and time_modify, and are used in a special instance of SETATTR.
skipping to change at page 28, line 5 skipping to change at page 34, line 18
lease_time lease_time
o The per filesystem attributes are: o The per filesystem attributes are:
supp_attr, fh_expire_type, link_support, symlink_support, supp_attr, fh_expire_type, link_support, symlink_support,
unique_handles, aclsupport, cansettime, case_insensitive, unique_handles, aclsupport, cansettime, case_insensitive,
case_preserving, chown_restricted, files_avail, files_free, case_preserving, chown_restricted, files_avail, files_free,
files_total, fs_locations, homogeneous, maxfilesize, maxname, files_total, fs_locations, homogeneous, maxfilesize, maxname,
maxread, maxwrite, no_trunc, space_avail, space_free, maxread, maxwrite, no_trunc, space_avail, space_free,
space_total, time_delta, fs_layouttype, send_impl_id, space_total, time_delta, fs_layout_type, send_impl_id,
recv_impl_id recv_impl_id
o The per filesystem object attributes are: o The per filesystem object attributes are:
type, change, size, named_attr, fsid, rdattr_error, filehandle, type, change, size, named_attr, fsid, rdattr_error, filehandle,
ACL, archive, fileid, hidden, maxlink, mimetype, mode, ACL, archive, fileid, hidden, maxlink, mimetype, mode,
numlinks, owner, owner_group, rawdev, space_used, system, numlinks, owner, owner_group, rawdev, space_used, system,
time_access, time_backup, time_create, time_metadata, time_access, time_backup, time_create, time_metadata,
time_modify, mounted_on_fileid, layouttype, layouthint, time_modify, mounted_on_fileid, layout_type, layout_hint,
layout_blksize, layout_alignment layout_blksize, layout_alignment
For quota_avail_hard, quota_avail_soft, and quota_used see their For quota_avail_hard, quota_avail_soft, and quota_used see their
definitions below for the appropriate classification. definitions below for the appropriate classification.
3.5. Mandatory Attributes - Definitions 4.5. Mandatory Attributes - Definitions
+-----------------+----+------------+--------+----------------------+ +-----------------+----+------------+--------+----------------------+
| name | # | Data Type | Access | Description | | name | # | Data Type | Access | Description |
+-----------------+----+------------+--------+----------------------+ +-----------------+----+------------+--------+----------------------+
| supp_attr | 0 | bitmap | READ | The bit vector which | | supp_attr | 0 | bitmap | READ | The bit vector which |
| | | | | would retrieve all | | | | | | would retrieve all |
| | | | | mandatory and | | | | | | mandatory and |
| | | | | recommended | | | | | | recommended |
| | | | | attributes that are | | | | | | attributes that are |
| | | | | supported for this | | | | | | supported for this |
skipping to change at page 30, line 16 skipping to change at page 36, line 29
| | | | | seconds. | | | | | | seconds. |
| rdattr_error | 11 | enum | READ | Error returned from | | rdattr_error | 11 | enum | READ | Error returned from |
| | | | | getattr during | | | | | | getattr during |
| | | | | readdir. | | | | | | readdir. |
| filehandle | 19 | nfs_fh4 | READ | The filehandle of | | filehandle | 19 | nfs_fh4 | READ | The filehandle of |
| | | | | this object | | | | | | this object |
| | | | | (primarily for | | | | | | (primarily for |
| | | | | readdir requests). | | | | | | readdir requests). |
+-----------------+----+------------+--------+----------------------+ +-----------------+----+------------+--------+----------------------+
3.6. Recommended Attributes - Definitions 4.6. Recommended Attributes - Definitions
+--------------------+-----+--------------+--------+----------------+ +--------------------+----+--------------+--------+-----------------+
| name | # | Data Type | Access | Description | | name | # | Data Type | Access | Description |
+--------------------+-----+--------------+--------+----------------+ +--------------------+----+--------------+--------+-----------------+
| ACL | 12 | nfsace4<> | R/W | The access | | ACL | 12 | nfsace4<> | R/W | The access |
| | | | | control list | | | | | | control list |
| | | | | for the | | | | | | for the object. |
| | | | | object. |
| aclsupport | 13 | uint32 | READ | Indicates what | | aclsupport | 13 | uint32 | READ | Indicates what |
| | | | | types of ACLs | | | | | | types of ACLs |
| | | | | are supported | | | | | | are supported |
| | | | | on the current | | | | | | on the current |
| | | | | filesystem. | | | | | | filesystem. |
| archive | 14 | bool | R/W | True, if this | | archive | 14 | bool | R/W | True, if this |
| | | | | file has been | | | | | | file has been |
| | | | | archived since | | | | | | archived since |
| | | | | the time of | | | | | | the time of |
| | | | | last | | | | | | last |
skipping to change at page 31, line 7 skipping to change at page 37, line 16
| | | | | change the | | | | | | change the |
| | | | | times for a | | | | | | times for a |
| | | | | filesystem | | | | | | filesystem |
| | | | | object as | | | | | | object as |
| | | | | specified in a | | | | | | specified in a |
| | | | | SETATTR | | | | | | SETATTR |
| | | | | operation. | | | | | | operation. |
| case_insensitive | 16 | bool | READ | True, if | | case_insensitive | 16 | bool | READ | True, if |
| | | | | filename | | | | | | filename |
| | | | | comparisons on | | | | | | comparisons on |
| | | | | this | | | | | | this filesystem |
| | | | | filesystem are | | | | | | are case |
| | | | | case |
| | | | | insensitive. | | | | | | insensitive. |
| case_preserving | 17 | bool | READ | True, if | | case_preserving | 17 | bool | READ | True, if |
| | | | | filename case | | | | | | filename case |
| | | | | on this | | | | | | on this |
| | | | | filesystem are | | | | | | filesystem are |
| | | | | preserved. | | | | | | preserved. |
| chown_restricted | 18 | bool | READ | If TRUE, the | | chown_restricted | 18 | bool | READ | If TRUE, the |
| | | | | server will | | | | | | server will |
| | | | | reject any | | | | | | reject any |
| | | | | request to | | | | | | request to |
| | | | | change either | | | | | | change either |
| | | | | the owner or | | | | | | the owner or |
| | | | | the group | | | | | | the group |
| | | | | associated | | | | | | associated with |
| | | | | with a file if | | | | | | a file if the |
| | | | | the caller is | | | | | | caller is not a |
| | | | | not a | | | | | | privileged user |
| | | | | privileged | | | | | | (for example, |
| | | | | user (for |
| | | | | example, |
| | | | | "root" in UNIX | | | | | | "root" in UNIX |
| | | | | operating | | | | | | operating |
| | | | | environments | | | | | | environments or |
| | | | | or in Windows | | | | | | in Windows 2000 |
| | | | | 2000 the "Take | | | | | | the "Take |
| | | | | Ownership" | | | | | | Ownership" |
| | | | | privilege). | | | | | | privilege). |
| dir_notif_delay | 56 | nfstime4 | READ | notification |
| | | | | delays on |
| | | | | directory |
| | | | | attributes |
| dirent_notif_delay | 57 | nfstime4 | READ | notification |
| | | | | delays on child |
| | | | | attributes |
| fileid | 20 | uint64 | READ | A number | | fileid | 20 | uint64 | READ | A number |
| | | | | uniquely | | | | | | uniquely |
| | | | | identifying | | | | | | identifying the |
| | | | | the file | | | | | | file within the |
| | | | | within the |
| | | | | filesystem. | | | | | | filesystem. |
| files_avail | 21 | uint64 | READ | File slots | | files_avail | 21 | uint64 | READ | File slots |
| | | | | available to | | | | | | available to |
| | | | | this user on | | | | | | this user on |
| | | | | the filesystem | | | | | | the filesystem |
| | | | | containing | | | | | | containing this |
| | | | | this object - | | | | | | object - this |
| | | | | this should be | | | | | | should be the |
| | | | | the smallest | | | | | | smallest |
| | | | | relevant | | | | | | relevant limit. |
| | | | | limit. | | files_free | 22 | uint64 | READ | Free file slots |
| files_free | 22 | uint64 | READ | Free file | | | | | | on the |
| | | | | slots on the |
| | | | | filesystem | | | | | | filesystem |
| | | | | containing | | | | | | containing this |
| | | | | this object - | | | | | | object - this |
| | | | | this should be | | | | | | should be the |
| | | | | the smallest | | | | | | smallest |
| | | | | relevant | | | | | | relevant limit. |
| | | | | limit. |
| files_total | 23 | uint64 | READ | Total file | | files_total | 23 | uint64 | READ | Total file |
| | | | | slots on the | | | | | | slots on the |
| | | | | filesystem | | | | | | filesystem |
| | | | | containing | | | | | | containing this |
| | | | | this object. | | | | | | object. |
| fs_locations | 24 | fs_locations | READ | Locations | | fs_absent | 60 | bool | READ | Is current |
| | | | | where this | | | | | | filesystem |
| | | | | filesystem may | | | | | | present or |
| | | | | be found. If | | | | | | absent. |
| | | | | the server | | fs_layout_type | 62 | layouttype4 | READ | Layout types |
| | | | | available for |
| | | | | the filesystem. |
| fs_locations | 24 | fs_locations | READ | Locations where |
| | | | | this filesystem |
| | | | | may be found. |
| | | | | If the server |
| | | | | returns | | | | | | returns |
| | | | | NFS4ERR_MOVED | | | | | | NFS4ERR_MOVED |
| | | | | as an error, | | | | | | as an error, |
| | | | | this attribute | | | | | | this attribute |
| | | | | MUST be | | | | | | MUST be |
| | | | | supported. | | | | | | supported. |
| fs_locations_info | 67 | | READ | Full function |
| | | | | filesystem |
| | | | | location. |
| fs_status | 61 | fs4_status | READ | Generic |
| | | | | filesystem type |
| | | | | information. |
| hidden | 25 | bool | R/W | True, if the | | hidden | 25 | bool | R/W | True, if the |
| | | | | file is | | | | | | file is |
| | | | | considered | | | | | | considered |
| | | | | hidden with | | | | | | hidden with |
| | | | | respect to the | | | | | | respect to the |
| | | | | Windows API? | | | | | | Windows API? |
| homogeneous | 26 | bool | READ | True, if this | | homogeneous | 26 | bool | READ | True, if this |
| | | | | object's | | | | | | object's |
| | | | | filesystem is | | | | | | filesystem is |
| | | | | homogeneous, | | | | | | homogeneous, |
| | | | | i.e. are per | | | | | | i.e. are per |
| | | | | filesystem | | | | | | filesystem |
| | | | | attributes the | | | | | | attributes the |
| | | | | same for all | | | | | | same for all |
| | | | | filesystem's | | | | | | filesystem's |
| | | | | objects. | | | | | | objects. |
| layout_alignment | 66 | uint32_t | READ | Preferred |
| | | | | alignment for |
| | | | | layout related |
| | | | | I/O. |
| layout_blksize | 65 | uint32_t | READ | Preferred block |
| | | | | size for layout |
| | | | | related I/O. |
| layout_hint | 63 | layouthint4 | WRITE | Client |
| | | | | specified hint |
| | | | | for file |
| | | | | layout. |
| layout_type | 64 | layouttype4 | READ | Layout types |
| | | | | available for |
| | | | | the file. |
| maxfilesize | 27 | uint64 | READ | Maximum | | maxfilesize | 27 | uint64 | READ | Maximum |
| | | | | supported file | | | | | | supported file |
| | | | | size for the | | | | | | size for the |
| | | | | filesystem of | | | | | | filesystem of |
| | | | | this object. | | | | | | this object. |
| maxlink | 28 | uint32 | READ | Maximum number | | maxlink | 28 | uint32 | READ | Maximum number |
| | | | | of links for | | | | | | of links for |
| | | | | this object. | | | | | | this object. |
| maxname | 29 | uint32 | READ | Maximum | | maxname | 29 | uint32 | READ | Maximum |
| | | | | filename size | | | | | | filename size |
skipping to change at page 33, line 49 skipping to change at page 40, line 27
| | | | | of this | | | | | | of this |
| | | | | attribute can | | | | | | attribute can |
| | | | | lead to the | | | | | | lead to the |
| | | | | client either | | | | | | client either |
| | | | | wasting | | | | | | wasting |
| | | | | bandwidth or | | | | | | bandwidth or |
| | | | | not receiving | | | | | | not receiving |
| | | | | the best | | | | | | the best |
| | | | | performance. | | | | | | performance. |
| mimetype | 32 | utf8<> | R/W | MIME body | | mimetype | 32 | utf8<> | R/W | MIME body |
| | | | | type/subtype | | | | | | type/subtype of |
| | | | | of this | | | | | | this object. |
| | | | | object. | | mode | 33 | mode4 | R/W | UNIX-style mode |
| mode | 33 | mode4 | R/W | UNIX-style | | | | | | and permission |
| | | | | mode and |
| | | | | permission |
| | | | | bits for this | | | | | | bits for this |
| | | | | object. | | | | | | object. |
| no_trunc | 34 | bool | READ | True, if a | | mounted_on_fileid | 55 | uint64 | READ | Like fileid, |
| | | | | name longer | | | | | | but if the |
| | | | | than name_max | | | | | | target |
| | | | | is used, an | | | | | | filehandle is |
| | | | | error be | | | | | | the root of a |
| | | | | returned and | | | | | | filesystem |
| | | | | return the |
| | | | | fileid of the |
| | | | | underlying |
| | | | | directory. |
| no_trunc | 34 | bool | READ | True, if a name |
| | | | | longer than |
| | | | | name_max is |
| | | | | used, an error |
| | | | | be returned and |
| | | | | name is not | | | | | | name is not |
| | | | | truncated. | | | | | | truncated. |
| numlinks | 35 | uint32 | READ | Number of hard | | numlinks | 35 | uint32 | READ | Number of hard |
| | | | | links to this | | | | | | links to this |
| | | | | object. | | | | | | object. |
| owner | 36 | utf8<> | R/W | The string | | owner | 36 | utf8<> | R/W | The string name |
| | | | | name of the | | | | | | of the owner of |
| | | | | owner of this | | | | | | this object. |
| | | | | object. | | owner_group | 37 | utf8<> | R/W | The string name |
| owner_group | 37 | utf8<> | R/W | The string | | | | | | of the group |
| | | | | name of the |
| | | | | group |
| | | | | ownership of | | | | | | ownership of |
| | | | | this object. | | | | | | this object. |
| quota_avail_hard | 38 | uint64 | READ | For definition | | quota_avail_hard | 38 | uint64 | READ | For definition |
| | | | | see "Quota | | | | | | see "Quota |
| | | | | Attributes" | | | | | | Attributes" |
| | | | | section below. | | | | | | section below. |
| quota_avail_soft | 39 | uint64 | READ | For definition | | quota_avail_soft | 39 | uint64 | READ | For definition |
| | | | | see "Quota | | | | | | see "Quota |
| | | | | Attributes" | | | | | | Attributes" |
| | | | | section below. | | | | | | section below. |
| quota_used | 40 | uint64 | READ | For definition | | quota_used | 40 | uint64 | READ | For definition |
| | | | | see "Quota | | | | | | see "Quota |
| | | | | Attributes" | | | | | | Attributes" |
| | | | | section below. | | | | | | section below. |
| rawdev | 41 | specdata4 | READ | Raw device | | rawdev | 41 | specdata4 | READ | Raw device |
| | | | | identifier. | | | | | | identifier. |
| | | | | UNIX device | | | | | | UNIX device |
| | | | | major/minor | | | | | | major/minor |
| | | | | node | | | | | | node |
| | | | | information. | | | | | | information. If |
| | | | | If the value | | | | | | the value of |
| | | | | of type is not | | | | | | type is not |
| | | | | NF4BLK or | | | | | | NF4BLK or |
| | | | | NF4CHR, the | | | | | | NF4CHR, the |
| | | | | value return | | | | | | value return |
| | | | | SHOULD NOT be | | | | | | SHOULD NOT be |
| | | | | considered | | | | | | considered |
| | | | | useful. | | | | | | useful. |
| recv_impl_id | 59 | nfs_impl_id4 | READ | Client obtains |
| | | | | server |
| | | | | implementation |
| | | | | via GETATTR. |
| send_impl_id | 58 | impl_ident4 | WRITE | Client provides |
| | | | | server with |
| | | | | implementation |
| | | | | identity via |
| | | | | SETATTR. |
| space_avail | 42 | uint64 | READ | Disk space in | | space_avail | 42 | uint64 | READ | Disk space in |
| | | | | bytes | | | | | | bytes available |
| | | | | available to | | | | | | to this user on |
| | | | | this user on |
| | | | | the filesystem | | | | | | the filesystem |
| | | | | containing | | | | | | containing this |
| | | | | this object - | | | | | | object - this |
| | | | | this should be | | | | | | should be the |
| | | | | the smallest | | | | | | smallest |
| | | | | relevant | | | | | | relevant limit. |
| | | | | limit. | | space_free | 43 | uint64 | READ | Free disk space |
| space_free | 43 | uint64 | READ | Free disk | | | | | | in bytes on the |
| | | | | space in bytes |
| | | | | on the |
| | | | | filesystem | | | | | | filesystem |
| | | | | containing | | | | | | containing this |
| | | | | this object - | | | | | | object - this |
| | | | | this should be | | | | | | should be the |
| | | | | the smallest | | | | | | smallest |
| | | | | relevant | | | | | | relevant limit. |
| | | | | limit. |
| space_total | 44 | uint64 | READ | Total disk | | space_total | 44 | uint64 | READ | Total disk |
| | | | | space in bytes | | | | | | space in bytes |
| | | | | on the | | | | | | on the |
| | | | | filesystem | | | | | | filesystem |
| | | | | containing | | | | | | containing this |
| | | | | this object. | | | | | | object. |
| space_used | 45 | uint64 | READ | Number of | | space_used | 45 | uint64 | READ | Number of |
| | | | | filesystem | | | | | | filesystem |
| | | | | bytes | | | | | | bytes allocated |
| | | | | allocated to | | | | | | to this object. |
| | | | | this object. |
| system | 46 | bool | R/W | True, if this | | system | 46 | bool | R/W | True, if this |
| | | | | file is a | | | | | | file is a |
| | | | | "system" file | | | | | | "system" file |
| | | | | with respect | | | | | | with respect to |
| | | | | to the Windows | | | | | | the Windows |
| | | | | API? | | | | | | API? |
| time_access | 47 | nfstime4 | READ | The time of | | time_access | 47 | nfstime4 | READ | The time of |
| | | | | last access to | | | | | | last access to |
| | | | | the object by | | | | | | the object by a |
| | | | | a read that | | | | | | read that was |
| | | | | was satisfied | | | | | | satisfied by |
| | | | | by the server. | | | | | | the server. |
| time_access_set | 48 | settime4 | WRITE | Set the time | | time_access_set | 48 | settime4 | WRITE | Set the time of |
| | | | | of last access | | | | | | last access to |
| | | | | to the object. | | | | | | the object. |
| | | | | SETATTR use | | | | | | SETATTR use |
| | | | | only. | | | | | | only. |
| time_backup | 49 | nfstime4 | R/W | The time of | | time_backup | 49 | nfstime4 | R/W | The time of |
| | | | | last backup of | | | | | | last backup of |
| | | | | the object. | | | | | | the object. |
| time_create | 50 | nfstime4 | R/W | The time of | | time_create | 50 | nfstime4 | R/W | The time of |
| | | | | creation of | | | | | | creation of the |
| | | | | the object. | | | | | | object. This |
| | | | | This attribute | | | | | | attribute does |
| | | | | does not have | | | | | | not have any |
| | | | | any relation | | | | | | relation to the |
| | | | | to the |
| | | | | traditional | | | | | | traditional |
| | | | | UNIX file | | | | | | UNIX file |
| | | | | attribute | | | | | | attribute |
| | | | | "ctime" or | | | | | | "ctime" or |
| | | | | "change time". | | | | | | "change time". |
| time_delta | 51 | nfstime4 | READ | Smallest | | time_delta | 51 | nfstime4 | READ | Smallest useful |
| | | | | useful server | | | | | | server time |
| | | | | time |
| | | | | granularity. | | | | | | granularity. |
| time_metadata | 52 | nfstime4 | READ | The time of | | time_metadata | 52 | nfstime4 | READ | The time of |
| | | | | last meta-data | | | | | | last meta-data |
| | | | | modification | | | | | | modification of |
| | | | | of the object. | | | | | | the object. |
| time_modify | 53 | nfstime4 | READ | The time of | | time_modify | 53 | nfstime4 | READ | The time of |
| | | | | last | | | | | | last |
| | | | | modification | | | | | | modification to |
| | | | | to the object. | | | | | | the object. |
| time_modify_set | 54 | settime4 | WRITE | Set the time | | time_modify_set | 54 | settime4 | WRITE | Set the time of |
| | | | | of last | | | | | | last |
| | | | | modification | | | | | | modification to |
| | | | | to the object. | | | | | | the object. |
| | | | | SETATTR use | | | | | | SETATTR use |
| | | | | only. | | | | | | only. |
| mounted_on_fileid | 55 | uint64 | READ | Like fileid, | +--------------------+----+--------------+--------+-----------------+
| | | | | but if the |
| | | | | target |
| | | | | filehandle is |
| | | | | the root of a |
| | | | | filesystem |
| | | | | return the |
| | | | | fileid of the |
| | | | | underlying |
| | | | | directory. |
| send_impl_id | TBD | impl_ident4 | WRITE | Client |
| | | | | provides |
| | | | | server with |
| | | | | implementation |
| | | | | identity via |
| | | | | SETATTR. |
| recv_impl_id | TBD | nfs_impl_id4 | READ | Client obtains |
| | | | | server |
| | | | | implementation |
| | | | | via GETATTR. |
| dir_notif_delay | TBD | R/W | READ | notification |
| | | | | delays on |
| | | | | directory |
| | | | | attributes |
| dirent_notif_delay | TBD | R/W | READ | notification |
| | | | | delays on |
| | | | | child |
| | | | | attributes |
| fs_layouttype | TBD | layouttype4 | READ | Layout types |
| | | | | available for |
| | | | | the |
| | | | | filesystem. |
| layouttype | TBD | layouttype4 | READ | Layout types |
| | | | | available for |
| | | | | the file. |
| layouthint | TBD | layouthint4 | WRITE | Client |
| | | | | specified hint |
| | | | | for file |
| | | | | layout. |
| layout_blksize | TBD | uint32_t | READ | Preferred |
| | | | | block size for |
| | | | | layout related |
| | | | | I/O. |
| layout_alignment | TBD | uint32_t | READ | Preferred |
| | | | | alignment for |
| | | | | layout related |
| | | | | I/O. |
| fs_absent | TBD | bool | READ | Is current |
| | | | | filesystem |
| | | | | present or |
| | | | | absent. |
| fs_locations_info | TBD | | READ | Full function |
| | | | | filesystem |
| | | | | location. |
| fs_status | TBD | fs4_status | READ | Generic |
| | | | | filesystem |
| | | | | type |
| | | | | information. |
| | TBD | | READ | desc |
| | TBD | | READ | desc |
+--------------------+-----+--------------+--------+----------------+
3.7. Time Access 4.7. Time Access
As defined above, the time_access attribute represents the time of As defined above, the time_access attribute represents the time of
last access to the object by a read that was satisfied by the server. last access to the object by a read that was satisfied by the server.
The notion of what is an "access" depends on server's operating The notion of what is an "access" depends on server's operating
environment and/or the server's filesystem semantics. For example, environment and/or the server's filesystem semantics. For example,
for servers obeying POSIX semantics, time_access would be updated for servers obeying POSIX semantics, time_access would be updated
only by the READLINK, READ, and READDIR operations and not any of the only by the READLINK, READ, and READDIR operations and not any of the
operations that modify the content of the object. Of course, setting operations that modify the content of the object. Of course, setting
the corresponding time_access_set attribute is another way to modify the corresponding time_access_set attribute is another way to modify
the time_access attribute. the time_access attribute.
Whenever the file object resides on a writable filesystem, the server Whenever the file object resides on a writable filesystem, the server
should make best efforts to record time_access into stable storage. should make best efforts to record time_access into stable storage.
However, to mitigate the performance effects of doing so, and most However, to mitigate the performance effects of doing so, and most
especially whenever the server is satisfying the read of the object's especially whenever the server is satisfying the read of the object's
content from its cache, the server MAY cache access time updates and content from its cache, the server MAY cache access time updates and
lazily write them to stable storage. It is also acceptable to give lazily write them to stable storage. It is also acceptable to give
administrators of the server the option to disable time_access administrators of the server the option to disable time_access
updates. updates.
3.8. Interpreting owner and owner_group 4.8. Interpreting owner and owner_group
The recommended attributes "owner" and "owner_group" (and also users The recommended attributes "owner" and "owner_group" (and also users
and groups within the "acl" attribute) are represented in terms of a and groups within the "acl" attribute) are represented in terms of a
UTF-8 string. To avoid a representation that is tied to a particular UTF-8 string. To avoid a representation that is tied to a particular
underlying implementation at the client or server, the use of the underlying implementation at the client or server, the use of the
UTF-8 string has been chosen. Note that section 6.1 of [RFC2624] UTF-8 string has been chosen. Note that section 6.1 of [RFC2624]
provides additional rationale. It is expected that the client and provides additional rationale. It is expected that the client and
server will have their own local representation of owner and server will have their own local representation of owner and
owner_group that is used for local storage or presentation to the end owner_group that is used for local storage or presentation to the end
user. Therefore, it is expected that when these attributes are user. Therefore, it is expected that when these attributes are
skipping to change at page 40, line 44 skipping to change at page 46, line 5
groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER
error when there is a valid translation for the user or owner error when there is a valid translation for the user or owner
designated in this way. In that case, the client must use the designated in this way. In that case, the client must use the
appropriate name@domain string and not the special form for appropriate name@domain string and not the special form for
compatibility. compatibility.
The owner string "nobody" may be used to designate an anonymous user, The owner string "nobody" may be used to designate an anonymous user,
which will be associated with a file created by a security principal which will be associated with a file created by a security principal
that cannot be mapped through normal means to the owner attribute. that cannot be mapped through normal means to the owner attribute.
3.9. Character Case Attributes 4.9. Character Case Attributes
With respect to the case_insensitive and case_preserving attributes, With respect to the case_insensitive and case_preserving attributes,
each UCS-4 character (which UTF-8 encodes) has a "long descriptive each UCS-4 character (which UTF-8 encodes) has a "long descriptive
name" [RFC1345] which may or may not included the word "CAPITAL" or name" [RFC1345] which may or may not included the word "CAPITAL" or
"SMALL". The presence of SMALL or CAPITAL allows an NFS server to "SMALL". The presence of SMALL or CAPITAL allows an NFS server to
implement unambiguous and efficient table driven mappings for case implement unambiguous and efficient table driven mappings for case
insensitive comparisons, and non-case-preserving storage. For insensitive comparisons, and non-case-preserving storage. For
general character handling and internationalization issues, see the general character handling and internationalization issues, see the
section "Internationalization". section "Internationalization".
3.10. Quota Attributes 4.10. Quota Attributes
For the attributes related to filesystem quotas, the following For the attributes related to filesystem quotas, the following
definitions apply: definitions apply:
quota_avail_soft The value in bytes which represents the amount of quota_avail_soft The value in bytes which represents the amount of
additional disk space that can be allocated to this file or additional disk space that can be allocated to this file or
directory before the user may reasonably be warned. It is directory before the user may reasonably be warned. It is
understood that this space may be consumed by allocations to other understood that this space may be consumed by allocations to other
files or directories though there is a rule as to which other files or directories though there is a rule as to which other
files or directories. files or directories.
skipping to change at page 41, line 40 skipping to change at page 47, line 5
Note that there may be a number of distinct but overlapping sets Note that there may be a number of distinct but overlapping sets
of files or directories for which a quota_used value is of files or directories for which a quota_used value is
maintained. E.g. "all files with a given owner", "all files with maintained. E.g. "all files with a given owner", "all files with
a given group owner". etc. a given group owner". etc.
The server is at liberty to choose any of those sets but should do The server is at liberty to choose any of those sets but should do
so in a repeatable way. The rule may be configured per-filesystem so in a repeatable way. The rule may be configured per-filesystem
or may be "choose the set with the smallest quota". or may be "choose the set with the smallest quota".
3.11. mounted_on_fileid 4.11. mounted_on_fileid
UNIX-based operating environments connect a filesystem into the UNIX-based operating environments connect a filesystem into the
namespace by connecting (mounting) the filesystem onto the existing namespace by connecting (mounting) the filesystem onto the existing
file object (the mount point, usually a directory) of an existing file object (the mount point, usually a directory) of an existing
filesystem. When the mount point's parent directory is read via an filesystem. When the mount point's parent directory is read via an
API like readdir(), the return results are directory entries, each API like readdir(), the return results are directory entries, each
with a component name and a fileid. The fileid of the mount point's with a component name and a fileid. The fileid of the mount point's
directory entry will be different from the fileid that the stat() directory entry will be different from the fileid that the stat()
system call returns. The stat() system call is returning the fileid system call returns. The stat() system call is returning the fileid
of the root of the mounted filesystem, whereas readdir() is returning of the root of the mounted filesystem, whereas readdir() is returning
skipping to change at page 42, line 45 skipping to change at page 48, line 9
fileid of a directory entry returned by readdir(). If fileid of a directory entry returned by readdir(). If
mounted_on_fileid is requested in a GETATTR operation, the server mounted_on_fileid is requested in a GETATTR operation, the server
should obey an invariant that has it returning a value that is equal should obey an invariant that has it returning a value that is equal
to the file object's entry in the object's parent directory, i.e. to the file object's entry in the object's parent directory, i.e.
what readdir() would have returned. Some operating environments what readdir() would have returned. Some operating environments
allow a series of two or more filesystems to be mounted onto a single allow a series of two or more filesystems to be mounted onto a single
mount point. In this case, for the server to obey the aforementioned mount point. In this case, for the server to obey the aforementioned
invariant, it will need to find the base mount point, and not the invariant, it will need to find the base mount point, and not the
intermediate mount points. intermediate mount points.
3.12. send_impl_id and recv_impl_id 4.12. send_impl_id and recv_impl_id
These recommended attributes are used to identify the client and These recommended attributes are used to identify the client and
server. In the case of the send_impl_id attribute, the client sends server. In the case of the send_impl_id attribute, the client sends
its clientid4 value along with the nfs_impl_id4. The use of the its clientid4 value along with the nfs_impl_id4. The use of the
clientid4 value allows the server to identify and match specific clientid4 value allows the server to identify and match specific
client interaction. In the case of the recv_impl_id attribute, the client interaction. In the case of the recv_impl_id attribute, the
client receives the nfs_impl_id4 value. client receives the nfs_impl_id4 value.
Access to this identification information can be most useful at both Access to this identification information can be most useful at both
client and server. Being able to identify specific implementations client and server. Being able to identify specific implementations
skipping to change at page 43, line 27 skipping to change at page 48, line 39
the client and server might refuse to interoperate. the client and server might refuse to interoperate.
Because it is likely some implementations will violate the protocol Because it is likely some implementations will violate the protocol
specification and interpret the identity information, implementations specification and interpret the identity information, implementations
MUST allow the users of the NFSv4 client and server to set the MUST allow the users of the NFSv4 client and server to set the
contents of the sent nfs_impl_id structure to any value. contents of the sent nfs_impl_id structure to any value.
Even though these attributes are recommended, if the server supports Even though these attributes are recommended, if the server supports
one of them it MUST support the other. one of them it MUST support the other.
3.13. fs_layouttype 4.13. fs_layout_type
This attribute applies to a file system and indicates what layout This attribute applies to a file system and indicates what layout
types are supported by the file system. We expect this attribute to types are supported by the file system. We expect this attribute to
be queried when a client encounters a new fsid. This attribute is be queried when a client encounters a new fsid. This attribute is
used by the client to determine if it has applicable layout drivers. used by the client to determine if it has applicable layout drivers.
3.14. layouttype 4.14. layout_type
This attribute indicates the particular layout type(s) used for a This attribute indicates the particular layout type(s) used for a
file. This is for informational purposes only. The client needs to file. This is for informational purposes only. The client needs to
use the LAYOUTGET operation in order to get enough information (e.g., use the LAYOUTGET operation in order to get enough information (e.g.,
specific device information) in order to perform I/O. specific device information) in order to perform I/O.
3.15. layouthint 4.15. layout_hint
This attribute may be set on newly created files to influence the This attribute may be set on newly created files to influence the
metadata server's choice for the file's layout. It is suggested that metadata server's choice for the file's layout. It is suggested that
this attribute is set as one of the initial attributes within the this attribute is set as one of the initial attributes within the
OPEN call. The metadata server may ignore this attribute. This OPEN call. The metadata server may ignore this attribute. This
attribute is a sub-set of the layout structure returned by LAYOUTGET. attribute is a sub-set of the layout structure returned by LAYOUTGET.
For example, instead of specifying particular devices, this would be For example, instead of specifying particular devices, this would be
used to suggest the stripe width of a file. It is up to the server used to suggest the stripe width of a file. It is up to the server
implementation to determine which fields within the layout it uses. implementation to determine which fields within the layout it uses.
[[Comment.3: it has been suggested that the HINT is a well defined 5. Access Control Lists
type other than pnfs_layoutdata4, similar to pnfs_layoutupdate4.]]
3.16. Access Control Lists
The NFS version 4 ACL attribute is an array of access control entries The NFS version 4 ACL attribute is an array of access control entries
(ACE). Although, the client can read and write the ACL attribute, (ACEs). Although, the client can read and write the ACL attribute,
the NFSv4 model is the server does all access control based on the the server is responsible for using the ACL to perform access
server's interpretation of the ACL. If at any point the client wants control. The client can use the OPEN or ACCESS operations to check
to check access without issuing an operation that modifies or reads access without modifying or reading data or metadata.
data or metadata, the client can use the OPEN and ACCESS operations
to do so. There are various access control entry types, as defined
in Section 3.16.1. The server is able to communicate which ACE types
are supported by returning the appropriate value within the
aclsupport attribute. Each ACE covers one or more operations on a
file or directory as described in Section 3.16.2. It may also
contain one or more flags that modify the semantics of the ACE as
defined in Section 3.16.3.
The NFS ACE attribute is defined as follows: The NFS ACE attribute is defined as follows:
typedef uint32_t acetype4; typedef uint32_t acetype4;
typedef uint32_t aceflag4; typedef uint32_t aceflag4;
typedef uint32_t acemask4; typedef uint32_t acemask4;
struct nfsace4 { struct nfsace4 {
acetype4 type; acetype4 type;
aceflag4 flag; aceflag4 flag;
acemask4 access_mask; acemask4 access_mask;
utf8str_mixed who; utf8str_mixed who;
}; };
To determine if a request succeeds, each nfsace4 entry is processed To determine if a request succeeds, the server processes each nfsace4
in order by the server. Only ACEs which have a "who" that matches entry in order. Only ACEs which have a "who" that matches the
the requester are considered. Each ACE is processed until all of the requester are considered. Each ACE is processed until all of the
bits of the requester's access have been ALLOWED. Once a bit (see bits of the requester's access have been ALLOWED. Once a bit (see
below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer
considered in the processing of later ACEs. If an ACCESS_DENIED_ACE considered in the processing of later ACEs. If an ACCESS_DENIED_ACE
is encountered where the requester's access still has unALLOWED bits is encountered where the requester's access still has unALLOWED bits
in common with the "access_mask" of the ACE, the request is denied. in common with the "access_mask" of the ACE, the request is denied.
However, unlike the ALLOWED and DENIED ACE types, the ALARM and AUDIT However, unlike the ALLOWED and DENIED ACE types, the ALARM and AUDIT
ACE types do not affect a requester's access, and instead are for ACE types do not affect a requester's access, and instead are for
triggering events as a result of a requester's access attempt. triggering events as a result of a requester's access attempt.
Therefore, all AUDIT and ALARM ACEs are processed until end of the Therefore, all AUDIT and ALARM ACEs are processed until end of the
ACL. When the ACL is fully processed, if there are bits in ACL. When the ACL is fully processed, if there are bits in the
requester's mask that have not been considered whether the server requester's mask that have not been ALLOWED or DENIED, access is
allows or denies, the access is denied. Even though a request is denied.
denied, servers may choose to have other restrictions or
implementation defined security policies in place. In those cases,
access may be decided outside of what is in the ACL. Examples of
such security policies or restrictions are:
o The owner of the file will always be able granted ACE4_WRITE_ACL This is not intended to limit the ability of server implementations
and ACE4_READ_ACL permissions. This would prevent the user from to implement alternative access policies. For example:
o A server implementation might always grant ACE4_WRITE_ACL and
ACE4_READ_ACL permissions. This would prevent the user from
getting into the situation where they can't ever modify the ACL. getting into the situation where they can't ever modify the ACL.
o The ACL may say that an entity is to be granted ACE4_WRITE_DATA o If a file system is mounted read only, then the server may deny
permission, but the file system is mounted read only, therefore ACE4_WRITE_DATA even though the ACL grants it.
write access is denied.
As mentioned before, this is one of the reasons that client As mentioned before, this is one of the reasons that client
implementations are not recommended to do their own access checking. implementations are not recommended to do their own access checks
based on their interpretation the ACL, but rather use the OPEN and
ACCESS to do access checks. This allows the client to act on the
results of having the server determine whether or not access should
be granted based on its interpretation of the ACL.
Clients must be aware of situations in which an object's ACL will
define a certain access even though the server will not enforce it.
In general, but especially in these situations, the client needs to
do its part in the enforcement of access as defined by the ACL. To
do this, the client may issue the appropriate ACCESS operation prior
to servicing the request of the user or application in order to
determine whether the user or application should be granted the
access requested.
Some situations in which the ACL may define accesses that the server
doesn't enforce:
o All servers will allow a user the ability to read the data of the
file when only the execute permission is granted (i.e. If the ACL
denies the user the ACE4_READ_DATA access and allows the user
ACE4_EXECUTE, the server will allow the user to read the data of the
file).
o Many servers have the notion of owner-override in which the owner
of the object is allowed to override accesses that are denied by the
ACL.
The NFS version 4 ACL model is quite rich. Some server platforms may The NFS version 4 ACL model is quite rich. Some server platforms may
provide access control functionality that goes beyond the UNIX-style provide access control functionality that goes beyond the UNIX-style
mode attribute, but which is not as rich as the NFS ACL model. So mode attribute, but which is not as rich as the NFS ACL model. So
that users can take advantage of this more limited functionality, the that users can take advantage of this more limited functionality, the
server may indicate that it supports ACLs as long as it follows the server may indicate that it supports ACLs as long as it follows the
guidelines for mapping between its ACL model and the NFS version 4 guidelines for mapping between its ACL model and the NFS version 4
ACL model. ACL model.
The situation is complicated by the fact that a server may have The situation is complicated by the fact that a server may have
multiple modules that enforce ACLs. For example, the enforcement for multiple modules that enforce ACLs. For example, the enforcement for
NFS version 4 access may be different from the enforcement for local NFS version 4 access may be different from the enforcement for local
access, and both may be different from the enforcement for access access, and both may be different from the enforcement for access
through other protocols such as SMB. So it may be useful for a through other protocols such as SMB. So it may be useful for a
server to accept an ACL even if not all of its modules are able to server to accept an ACL even if not all of its modules are able to
support it. support it.
The guiding principle in all cases is that the server must not accept The guiding principle in all cases is that the server must not accept
ACLs that appear to make the file more secure than it really is. ACLs that appear to make the file more secure than it really is.
3.16.1. ACE type 5.1. ACE type
Type Description Type Description
_____________________________________________________ _____________________________________________________
ALLOW Explicitly grants the access defined in ALLOW Explicitly grants the access defined in
acemask4 to the file or directory. acemask4 to the file or directory.
DENY Explicitly denies the access defined in DENY Explicitly denies the access defined in
acemask4 to the file or directory. acemask4 to the file or directory.
AUDIT LOG (system dependent) any access AUDIT LOG (system dependent) any access
skipping to change at page 47, line 9 skipping to change at page 52, line 23
NFS4ERR_ATTRNOTSUPP. If the server receives a request to set an ACE NFS4ERR_ATTRNOTSUPP. If the server receives a request to set an ACE
that it can store but cannot enforce, the server SHOULD reject the that it can store but cannot enforce, the server SHOULD reject the
request with NFS4ERR_ATTRNOTSUPP. request with NFS4ERR_ATTRNOTSUPP.
Example: suppose a server can enforce NFS ACLs for NFS access but Example: suppose a server can enforce NFS ACLs for NFS access but
cannot enforce ACLs for local access. If arbitrary processes can run cannot enforce ACLs for local access. If arbitrary processes can run
on the server, then the server SHOULD NOT indicate ACL support. On on the server, then the server SHOULD NOT indicate ACL support. On
the other hand, if only trusted administrative programs run locally, the other hand, if only trusted administrative programs run locally,
then the server may indicate ACL support. then the server may indicate ACL support.
3.16.2. ACE Access Mask 5.2. ACE Access Mask
The access_mask field contains values based on the following: The access_mask field contains values based on the following:
ACE4_READ_DATA ACE4_READ_DATA
Operation(s) affected: Operation(s) affected:
READ READ
OPEN OPEN
Discussion: Discussion:
Permission to read the data of the file. Permission to read the data of the file.
Servers SHOULD allow a user the ability to read the data
of the file when only the ACE4_EXECUTE access mask bit is
allowed.
ACE4_LIST_DIRECTORY ACE4_LIST_DIRECTORY
Operation(s) affected: Operation(s) affected:
READDIR READDIR
Discussion: Discussion:
Permission to list the contents of a directory. Permission to list the contents of a directory.
ACE4_WRITE_DATA ACE4_WRITE_DATA
Operation(s) affected: Operation(s) affected:
WRITE WRITE
OPEN OPEN
skipping to change at page 48, line 46 skipping to change at page 54, line 15
directory. This is when createdir is TRUE and no named directory. This is when createdir is TRUE and no named
attribute directory exists. The ability to check whether attribute directory exists. The ability to check whether
or not a named attribute directory exists depends on the or not a named attribute directory exists depends on the
ability to look it up, therefore, users also need the ability to look it up, therefore, users also need the
ACE4_READ_NAMED_ATTRS permission in order to create a ACE4_READ_NAMED_ATTRS permission in order to create a
named attribute directory. named attribute directory.
ACE4_EXECUTE ACE4_EXECUTE
Operation(s) affected: Operation(s) affected:
LOOKUP LOOKUP
READ
OPEN
Discussion: Discussion:
Permission to execute a file or traverse/search a Permission to execute a file or traverse/search a
directory. directory.
Servers SHOULD allow a user the ability to read the data
of the file when only the ACE4_EXECUTE access mask bit is
allowed. This is because there is no way to execute a
file without reading the contents. Though a server may
treat ACE4_EXECUTE and ACE4_READ_DATA bits identically
when deciding to permit a READ operation, it SHOULD still
allow the two bits to be set independently in ACLs, and
MUST distinguish between them when replying to ACCESS
operations. In particular, servers SHOULD NOT silently
turn on one of the two bits when the other is set, as
that would make it impossible for the client to correctly
enforce the distinction between read and execute
permissions.
As an example, following a SETATTR of the following ACL:
nfsuser:ACE4_EXECUTE:ALLOW
A subsequent GETATTR of ACL for that file SHOULD return:
nfsuser:ACE4_EXECUTE:ALLOW
Rather than:
nfsuser:ACE4_EXECUTE/ACE4_READ_DATA:ALLOW
ACE4_DELETE_CHILD ACE4_DELETE_CHILD
Operation(s) affected: Operation(s) affected:
REMOVE REMOVE
Discussion: Discussion:
Permission to delete a file or directory within a Permission to delete a file or directory within a
directory. See section "ACE4_DELETE vs. directory. See section "ACE4_DELETE vs.
ACE4_DELETE_CHILD" for information on how these two access ACE4_DELETE_CHILD" for information on how these two access
mask bits interact. mask bits interact.
ACE4_READ_ATTRIBUTES ACE4_READ_ATTRIBUTES
Operation(s) affected: Operation(s) affected:
GETATTR of file system object attributes GETATTR of file system object attributes
skipping to change at page 51, line 9 skipping to change at page 57, line 5
If a server receives a SETATTR request that it cannot accurately If a server receives a SETATTR request that it cannot accurately
implement, it should error in the direction of more restricted implement, it should error in the direction of more restricted
access. For example, suppose a server cannot distinguish overwriting access. For example, suppose a server cannot distinguish overwriting
data from appending new data, as described in the previous paragraph. data from appending new data, as described in the previous paragraph.
If a client submits an ACE where APPEND_DATA is set but WRITE_DATA is If a client submits an ACE where APPEND_DATA is set but WRITE_DATA is
not (or vice versa), the server should reject the request with not (or vice versa), the server should reject the request with
NFS4ERR_ATTRNOTSUPP. Nonetheless, if the ACE has type DENY, the NFS4ERR_ATTRNOTSUPP. Nonetheless, if the ACE has type DENY, the
server may silently turn on the other bit, so that both APPEND_DATA server may silently turn on the other bit, so that both APPEND_DATA
and WRITE_DATA are denied. and WRITE_DATA are denied.
3.16.2.1. ACE4_DELETE vs. ACE4_DELETE_CHILD 5.2.1. ACE4_DELETE vs. ACE4_DELETE_CHILD
There are two separate access mask bits that govern the ability to Two access mask bits govern the ability to delete a file or directory
delete a file: ACE4_DELETE and ACE4_DELETE_CHILD. ACE4_DELETE is object: ACE4_DELETE on the object itself, and ACE4_DELETE_CHILD on
intended to be specified by the ACL for the object to be deleted, and the object's parent directory.
ACE4_DELETE_CHILD is intended to be specified by the ACL of the
parent directory.
In addition to ACE4_DELETE and ACE4_DELETE_CHILD, many systems also Many systems also consult the "sticky bit" (MODE4_SVTX) and write
consider the "sticky bit" (MODE4_SVTX) and the appropriate "write" mode bit on the parent directory when determining whether to allow a
mode bit when determining whether to allow a file to be deleted. The file to be deleted. The mode bit for write corresponds to
mode bit for write corresponds to ACE4_WRITE_DATA, which is the same ACE4_WRITE_DATA, which is the same physical bit as ACE4_ADD_FILE.
physical bit as ACE4_ADD_FILE. Therefore, ACE4_ADD_FILE can come Therefore, ACE4_ADD_FILE can come into play when determining
into play when determining permission to delete. permission to delete.
In the algorithm below, the strategy is that ACE4_DELETE and In the algorithm below, the strategy is that ACE4_DELETE and
ACE4_DELETE_CHILD take precedence over the sticky bit, and the sticky ACE4_DELETE_CHILD take precedence over the sticky bit, and the sticky
bit takes precedence over the "write" mode bits (reflected in bit takes precedence over the "write" mode bits (reflected in
ACE4_ADD_FILE). ACE4_ADD_FILE).
Server implementations SHOULD grant or deny permission to delete Server implementations SHOULD grant or deny permission to delete
based on the following algorithm. based on the following algorithm.
if ACE4_EXECUTE is denied by the parent directory ACL: if ACE4_EXECUTE is denied by the parent directory ACL:
deny delete deny delete
else if ACE4_EXECUTE is unspecified by the parent
directory ACL:
deny delete
else if ACE4_DELETE is allowed by the target object ACL: else if ACE4_DELETE is allowed by the target object ACL:
allow delete allow delete
else if ACE4_DELETE_CHILD is allowed by the parent else if ACE4_DELETE_CHILD is allowed by the parent
directory ACL: directory ACL:
allow delete allow delete
else if ACE4_DELETE_CHILD is denied by the else if ACE4_DELETE_CHILD is denied by the
parent directory ACL: parent directory ACL:
deny delete deny delete
else if ACE4_ADD_FILE is allowed by the parent directory ACL: else if ACE4_ADD_FILE is allowed by the parent directory ACL:
if MODE4_SVTX is set for the parent directory: if MODE4_SVTX is set for the parent directory:
skipping to change at page 52, line 32 skipping to change at page 58, line 5
ACE4_WRITE_DATA is allowed by the target ACE4_WRITE_DATA is allowed by the target
object ACL: object ACL:
allow delete allow delete
else: else:
deny delete deny delete
else: else:
allow delete allow delete
else: else:
deny delete deny delete
3.16.3. ACE flag 5.3. ACE flag
The "flag" field contains values based on the following descriptions. The "flag" field contains values based on the following descriptions.
ACE4_FILE_INHERIT_ACE ACE4_FILE_INHERIT_ACE
Can be placed on a directory and indicates that this ACE should be Can be placed on a directory and indicates that this ACE should be
added to each new non-directory file created. added to each new non-directory file created.
ACE4_DIRECTORY_INHERIT_ACE ACE4_DIRECTORY_INHERIT_ACE
Can be placed on a directory and indicates that this ACE should be Can be placed on a directory and indicates that this ACE should be
added to each new directory created. added to each new directory created.
skipping to change at page 53, line 39 skipping to change at page 59, line 15
The previously described processing applies to that of the ACCESS The previously described processing applies to that of the ACCESS
operation as well. The difference being that "success" or operation as well. The difference being that "success" or
"failure" does not mean whether ACCESS returns NFS4_OK or not. "failure" does not mean whether ACCESS returns NFS4_OK or not.
Success means whether ACCESS returns all requested and supported Success means whether ACCESS returns all requested and supported
bits. Failure means whether ACCESS failed to return a bit that bits. Failure means whether ACCESS failed to return a bit that
was requested and supported. was requested and supported.
ACE4_IDENTIFIER_GROUP ACE4_IDENTIFIER_GROUP
Indicates that the "who" refers to a GROUP as defined under UNIX Indicates that the "who" refers to a GROUP as defined under UNIX
or a GROUP ACCOUNT as defined under Windows. Clients and servers or a GROUP ACCOUNT as defined under Windows. Clients and servers
may ignore the ACE4_IDENTIFIER_GROUP flag on ACEs with a who value must ignore the ACE4_IDENTIFIER_GROUP flag on ACEs with a who
equal to one of the special identifiers outlined in section "ACE value equal to one of the special identifiers outlined in section
who". "ACE who".
The bitmask constants used for the flag field are as follows: The bitmask constants used for the flag field are as follows:
const ACE4_FILE_INHERIT_ACE = 0x00000001; const ACE4_FILE_INHERIT_ACE = 0x00000001;
const ACE4_DIRECTORY_INHERIT_ACE = 0x00000002; const ACE4_DIRECTORY_INHERIT_ACE = 0x00000002;
const ACE4_NO_PROPAGATE_INHERIT_ACE = 0x00000004; const ACE4_NO_PROPAGATE_INHERIT_ACE = 0x00000004;
const ACE4_INHERIT_ONLY_ACE = 0x00000008; const ACE4_INHERIT_ONLY_ACE = 0x00000008;
const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG = 0x00000010; const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG = 0x00000010;
const ACE4_FAILED_ACCESS_ACE_FLAG = 0x00000020; const ACE4_FAILED_ACCESS_ACE_FLAG = 0x00000020;
const ACE4_IDENTIFIER_GROUP = 0x00000040; const ACE4_IDENTIFIER_GROUP = 0x00000040;
skipping to change at page 54, line 21 skipping to change at page 59, line 46
For example, suppose a client tries to set an ACE with For example, suppose a client tries to set an ACE with
ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE. If the ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE. If the
server does not support any form of ACL inheritance, the server server does not support any form of ACL inheritance, the server
should reject the request with NFS4ERR_ATTRNOTSUPP. If the server should reject the request with NFS4ERR_ATTRNOTSUPP. If the server
supports a single "inherit ACE" flag that applies to both files and supports a single "inherit ACE" flag that applies to both files and
directories, the server may reject the request (i.e., requiring the directories, the server may reject the request (i.e., requiring the
client to set both the file and directory inheritance flags). The client to set both the file and directory inheritance flags). The
server may also accept the request and silently turn on the server may also accept the request and silently turn on the
ACE4_DIRECTORY_INHERIT_ACE flag. ACE4_DIRECTORY_INHERIT_ACE flag.
3.16.4. ACE who 5.4. ACE who
There are several special identifiers ("who") which need to be There are several special identifiers ("who") which need to be
understood universally, rather than in the context of a particular understood universally, rather than in the context of a particular
DNS domain. Some of these identifiers cannot be understood when an DNS domain. Some of these identifiers cannot be understood when an
NFS client accesses the server, but have meaning when a local process NFS client accesses the server, but have meaning when a local process
accesses the file. The ability to display and modify these accesses the file. The ability to display and modify these
permissions is permitted over NFS, even if none of the access methods permissions is permitted over NFS, even if none of the access methods
on the server understands the identifiers. on the server understands the identifiers.
Who Description Who Description
skipping to change at page 55, line 5 skipping to change at page 60, line 26
"BATCH" Accessed from a batch job. "BATCH" Accessed from a batch job.
"ANONYMOUS" Accessed without any authentication. "ANONYMOUS" Accessed without any authentication.
"AUTHENTICATED" Any authenticated user (opposite of "AUTHENTICATED" Any authenticated user (opposite of
ANONYMOUS) ANONYMOUS)
"SERVICE" Access from a system service. "SERVICE" Access from a system service.
To avoid conflict, these special identifiers are distinguish by an To avoid conflict, these special identifiers are distinguish by an
appended "@" and should appear in the form "xxxx@" (note: no domain appended "@" and should appear in the form "xxxx@" (note: no domain
name after the "@"). For example: ANONYMOUS@. name after the "@"). For example: ANONYMOUS@.
3.16.4.1. Discussion on EVERYONE@ 5.4.1. Discussion of EVERYONE@
It is important to note that "EVERYONE@" is not equivalent to the It is important to note that "EVERYONE@" is not equivalent to the
UNIX "other" entity. This is because, by definition, UNIX "other" UNIX "other" entity. This is because, by definition, UNIX "other"
does not include the owner or owning group of a file. "EVERYONE@" does not include the owner or owning group of a file. "EVERYONE@"
means literally everyone, including the owner or owning group. means literally everyone, including the owner or owning group.
3.16.4.2. Discussion on OWNER@ and GROUP@ 5.4.2. Discussion of OWNER@ and GROUP@
Due to the use of the special identifiers "OWNER@" and "GROUP@" to The ACL itself cannot be used to determine the owner and owning group
indicate that an ACE applies to the the owner and owning group, of a file. This information should be indicated by the values of the
respectively, associated with a file, the ACL cannot be used to owner and owner_group file attributes returned by the server.
determine the owner and owning group of a file. This information
should be indicated by the values of the owner and owner_group file
attributes returned by the server.
3.16.5. Mode Attribute 5.5. Mode Attribute
The NFS version 4 mode attribute is based on the UNIX mode bits. The The NFS version 4 mode attribute is based on the UNIX mode bits. The
following bits are defined: following bits are defined:
const MODE4_SUID = 0x800; /* set user id on execution */ const MODE4_SUID = 0x800; /* set user id on execution */
const MODE4_SGID = 0x400; /* set group id on execution */ const MODE4_SGID = 0x400; /* set group id on execution */
const MODE4_SVTX = 0x200; /* save text even after use */ const MODE4_SVTX = 0x200; /* save text even after use */
const MODE4_RUSR = 0x100; /* read permission: owner */ const MODE4_RUSR = 0x100; /* read permission: owner */
const MODE4_WUSR = 0x080; /* write permission: owner */ const MODE4_WUSR = 0x080; /* write permission: owner */
const MODE4_XUSR = 0x040; /* execute permission: owner */ const MODE4_XUSR = 0x040; /* execute permission: owner */
skipping to change at page 55, line 50 skipping to change at page 61, line 29
identified in the owner attribute. Bits MODE4_RGRP, MODE4_WGRP, and identified in the owner attribute. Bits MODE4_RGRP, MODE4_WGRP, and
MODE4_XGRP apply to the principals identified in the owner_group MODE4_XGRP apply to the principals identified in the owner_group
attribute. Bits MODE4_ROTH, MODE4_WOTH, MODE4_XOTH apply to any attribute. Bits MODE4_ROTH, MODE4_WOTH, MODE4_XOTH apply to any
principal that does not match that in the owner group, and does not principal that does not match that in the owner group, and does not
have a group matching that of the owner_group attribute. have a group matching that of the owner_group attribute.
The remaining bits are not defined by this protocol and MUST NOT be The remaining bits are not defined by this protocol and MUST NOT be
used. The minor version mechanism must be used to define further bit used. The minor version mechanism must be used to define further bit
usage. usage.
Note that in UNIX, if a file has the MODE4_SGID bit set and no 5.6. Interaction Between Mode and ACL Attributes
MODE4_XGRP bit set, then READ and WRITE must use mandatory file
locking.
3.16.6. Interaction Between Mode and ACL Attributes
As defined, there is a certain amount of overlap between ACL and mode As defined, there is a certain amount of overlap between ACL and mode
file attributes. Even though there is overlap, ACLs don't contain file attributes. Even though there is overlap, ACLs don't contain
all the information specified by a mode and modes can't possibly all the information specified by a mode and modes can't possibly
contain all the information specified by an ACL. contain all the information specified by an ACL.
For servers that support both mode and ACL, the mode's MODE4_R*, For servers that support both mode and ACL, the mode's MODE4_R*,
MODE4_W* and MODE4_X* values should be computed from the ACL and MODE4_W* and MODE4_X* values should be computed from the ACL and
should be recomputed upon each SETATTR of ACL. Similarly, upon should be recomputed upon each SETATTR of ACL. Similarly, upon
SETATTR of mode, the ACL should be modified in order to allow the SETATTR of mode, the ACL should be modified in order to allow the
mode computed from the ACL to be the same as the mode given to mode computed from the ACL to be the same as the mode given to
SETATTR. The mode computed from any given ACL should be SETATTR. The mode computed from any given ACL should be
deterministic. This means that given an ACL, the same mode will deterministic. This means that given an ACL, the same mode will
always be computed. always be computed.
For servers that support ACL and not mode, clients may handle For servers that support ACL and not mode, clients may handle
applications which set and get the mode by creating the correct ACL applications which set and get the mode by creating the correct ACL
to send to the server and by computing the mode from the ACL, to send to the server and by computing the mode from the ACL,
respectively. In this case, the methods used by the server to keep respectively. In this case, the methods used by the server to keep
the mode in sync with the ACL can also be used by the client. These the mode in sync with the ACL can also be used by the client. These
methods are explained in sections Section 3.16.6.3 Section 3.16.6.1 methods are explained in sections Section 5.6.3 Section 5.6.1 and
and Section 3.16.6.2. Section 5.6.2.
Since the mode can't possibly represent all of the information that Since the mode can't possibly represent all of the information that
is defined by an ACL, there are some descrepencies to be aware of. is defined by an ACL, there are some discrepencies to be aware of.
As explained in the section "Deficiencies in a Mode Representation of As explained in the section "Deficiencies in a Mode Representation of
an ACL", the mode bits computed from the ACL could potentially convey an ACL", the mode bits computed from the ACL could potentially convey
more restrictive permissions than what would be granted via the ACL. more restrictive permissions than what would be granted via the ACL.
Because of this clients are not recommended to do their own access Because of this clients are not recommended to do their own access
checks based on the mode of a file. checks based on the mode of a file.
Because the mode attribute includes bits (i.e. MODE4_SUID, Because the mode attribute includes bits (i.e. MODE4_SUID,
MODE4_SGID, MODE4_SVTX) that have nothing to do with ACL semantics, MODE4_SGID, MODE4_SVTX) that have nothing to do with ACL semantics,
it is permitted for clients to specify both the ACL attribute and it is permitted for clients to specify both the ACL attribute and
mode in the same SETATTR operation. However, because there is no mode in the same SETATTR operation. However, because there is no
prescribed order for processing the attributes in a SETATTR, clients prescribed order for processing the attributes in a SETATTR, clients
may see differing results. For recommendations on how to achieve may see differing results. For recommendations on how to achieve
consistent behavior, see Section 3.16.6.4 for recommendations. consistent behavior, see Section 5.6.4 for recommendations.
3.16.6.1. Recomputing mode upon SETATTR of ACL 5.6.1. Recomputing mode upon SETATTR of ACL
Keeping the mode and ACL attributes synchronized is important, but as Keeping the mode and ACL attributes synchronized is important, but as
mentioned previously, the mode cannot possibly represent all of the mentioned previously, the mode cannot possibly represent all of the
information in the ACL. Still, the mode should be modified to information in the ACL. Still, the mode should be modified to
represent the access as accurately as possible. represent the access as accurately as possible.
The general algorithm to assign a new mode attribute to an object The general algorithm to assign a new mode attribute to an object
based on a new ACL being set is: based on a new ACL being set is:
1. Walk through the ACEs in order, looking for ACEs with a "who" 1. Walk through the ACEs in order, looking for ACEs with a "who"
skipping to change at page 60, line 10 skipping to change at page 65, line 34
if a.type is ALLOW { if a.type is ALLOW {
mode |= XOTH; mode |= XOTH;
} }
} }
} }
} }
} }
} }
return mode | (old_mode & (SUID | SGID | SVTX)) return mode | (old_mode & (SUID | SGID | SVTX))
3.16.6.2. Applying the mode given to CREATE or OPEN to an inherited ACL 5.6.2. Applying the mode given to CREATE or OPEN to an inherited ACL
The goal of implementing ACL inheritance is for newly created objects The goal of implementing ACL inheritance is for newly created objects
to inherit the ACLs they were intended to inherit, but without to inherit the ACLs they were intended to inherit, but without
disregarding the mode that is given with the arguments to the CREATE disregarding the mode that is given with the arguments to the CREATE
or OPEN operations. The general algorithm is as follows: or OPEN operations. The general algorithm is as follows:
1. Form an ACL on the newly created object that is the concatenation 1. Form an ACL on the newly created object that is the concatenation
of all inheritable ACEs from its parent directory. Note that of all inheritable ACEs from its parent directory. Note that
there may be zero inheritable ACEs; thus, an object may start there may be zero inheritable ACEs; thus, an object may start
with an empty ACL. with an empty ACL.
skipping to change at page 61, line 28 skipping to change at page 66, line 50
G. On the second ACE, if the type field is ALLOW, an G. On the second ACE, if the type field is ALLOW, an
implementation MAY clear the following mask bits: implementation MAY clear the following mask bits:
ACE4_WRITE_ACL ACE4_WRITE_ACL
ACE4_WRITE_OWNER ACE4_WRITE_OWNER
3. To ensure that the mode is honored, apply the algorithm for 3. To ensure that the mode is honored, apply the algorithm for
applying a mode to a file/directory with an existing ACL on the applying a mode to a file/directory with an existing ACL on the
new object as described in Section 3.16.6.3, using the mode that new object as described in Section 5.6.3, using the mode that is
is to be used for file creation. to be used for file creation.
3.16.6.3. Applying a Mode to an Existing ACL 5.6.3. Applying a Mode to an Existing ACL
An existing ACL can mean two things in this context. One, that a An existing ACL can mean two things in this context. One, that a
file/directory already exists and it has an ACL. Two, that a file/directory already exists and it has an ACL. Two, that a
directory has inheritable ACEs that will make up the ACL for any new directory has inheritable ACEs that will make up the ACL for any new
files or directories created therein. files or directories created therein.
The high-level goal of the behavior when a mode is set on a file with The high-level goal of the behavior when a mode is set on a file with
an existing ACL is to take the new mode into account, without needing an existing ACL is to take the new mode into account, without needing
to delete a pre-existing ACL. to delete a pre-existing ACL.
skipping to change at page 66, line 36 skipping to change at page 71, line 36
else: set ACE4_WRITE_DATA and ACE4_APPEND_DATA in A3 else: set ACE4_WRITE_DATA and ACE4_APPEND_DATA in A3
If XGRP is set: set ACE4_EXECUTE in A4 If XGRP is set: set ACE4_EXECUTE in A4
else: set ACE4_EXECUTE in A3 else: set ACE4_EXECUTE in A3
If ROTH is set: set ACE4_READ_DATA in A6 If ROTH is set: set ACE4_READ_DATA in A6
else: set ACE4_READ_DATA in A5 else: set ACE4_READ_DATA in A5
If WOTH is set: set ACE4_WRITE_DATA and ACE4_APPEND_DATA in A6 If WOTH is set: set ACE4_WRITE_DATA and ACE4_APPEND_DATA in A6
else: set ACE4_WRITE_DATA and ACE4_APPEND_DATA in A5 else: set ACE4_WRITE_DATA and ACE4_APPEND_DATA in A5
If XOTH is set: set ACE4_EXECUTE in A6 If XOTH is set: set ACE4_EXECUTE in A6
else: set ACE4_EXECUTE in A5 else: set ACE4_EXECUTE in A5
3.16.6.4. ACL and mode in the same SETATTR 5.6.4. ACL and mode in the same SETATTR
The only reason that a mode and ACL should be set in the same SETATTR The only reason that a mode and ACL should be set in the same SETATTR
is if the user wants to set the SUID, SGID and SVTX bits along with is if the user wants to set the SUID, SGID and SVTX bits along with
setting the permissions by means of an ACL. There is still no way to setting the permissions by means of an ACL. There is still no way to
enforce which order the attributes will be set in, and it is likely enforce which order the attributes will be set in, and it is likely
that different orders of operations will produce different results. that different orders of operations will produce different results.
3.16.6.4.1. Client Side Recommendations 5.6.4.1. Client Side Recommendations
If an application needs to enforce a certain behavior, it is If an application needs to enforce a certain behavior, it is
recommended that the client implementations set mode and ACL in recommended that the client implementations set mode and ACL in
separate SETATTR requests. This will produce consistent and expected separate SETATTR requests. This will produce consistent and expected
results. results.
If an application wants to set SUID, SGID and SVTX bits and an ACL: If an application wants to set SUID, SGID and SVTX bits and an ACL:
In the first SETATTR, set the mode with SUID, SGID and SVTX bits In the first SETATTR, set the mode with SUID, SGID and SVTX bits
as desired and all other bits with a value of 0. as desired and all other bits with a value of 0.
In a following SETATTR (preferably in the same COMPOUND) set the In a following SETATTR (preferably in the same COMPOUND) set the
ACL. ACL.
3.16.6.4.2. Server Side Recommendations 5.6.4.2. Server Side Recommendations
If both mode and ACL are given to SETATTR, server implementations If both mode and ACL are given to SETATTR, server implementations
should verify that the mode and ACL don't conflict, i.e. the mode should verify that the mode and ACL don't conflict, i.e. the mode
computed from the given ACL must be the same as the given mode, computed from the given ACL must be the same as the given mode,
excluding the SUID, SGID and SVTX bits. The algorithm for assigning excluding the SUID, SGID and SVTX bits. The algorithm for assigning
a new mode based on the ACL can be used. This is described in a new mode based on the ACL can be used. (This is described in
section Section 3.16.6.1. If a server receives a request to set both section Section 5.6.1.) If a server receives a request to set both
mode and ACL, but the two conflict, the server should return mode and ACL, but the two conflict, the server should return
NFS4ERR_INVAL. NFS4ERR_INVAL.
3.16.6.5. Inheritance and turning it off 5.6.5. Inheritance and turning it off
The inheritance of access permissions may be problematic if a user The inheritance of access permissions may be problematic if a user
cannot prevent their file from inheriting unwanted permissions. For cannot prevent their file from inheriting unwanted permissions. For
example, a user, "samf", sets up a shared project directory to be example, a user, "bob", sets up a shared project directory to be used
used by everyone working on Project Foo. "lisagab" is a part of by everyone working on Project Foo. "alice" is a part of Project Foo,
Project Foo, but is working on something that should not be seen by but is working on something that should not be seen by anyone else.
anyone else. How can "lisagab" make sure that any new files that she How can "alice" make sure that any new files that she creates in this
creates in this shared project directory do not inherit anything that shared project directory do not inherit anything that could
could compromise the security of her work? compromise the security of her work?
More relevant to the implementors of NFS version 4 clients and More relevant to the implementors of NFS version 4 clients and
servers is the question of how to communicate the fact that user, servers is the question of how to communicate the fact that user
"lisagab", doesn't want any permissions to be inherited to her newly "alice" doesn't want any permissions to be inherited to her newly
created file or directory. created file or directory.
To do this, implementors should standardize on what the behavior of To do this, implementors should standardize on what the behavior of
CREATE and OPEN must be if: CREATE and OPEN must be if:
1. just mode is given 1. just mode is given
In this case, inheritance will take place, but the mode will be In this case, inheritance will take place, but the mode will be
applied to the inherited ACL as described in Section 3.16.6.1, applied to the inherited ACL as described in Section 5.6.1,
thereby modifying the ACL. thereby modifying the ACL.
2. just ACL is given 2. just ACL is given
In this case, inheritance will not take place, and the ACL as In this case, inheritance will not take place, and the ACL as
defined in the CREATE or OPEN will be set without modification. defined in the CREATE or OPEN will be set without modification.
3. both mode and ACL are given 3. both mode and ACL are given
In this case, implementors should verify that the mode and ACL In this case, implementors should verify that the mode and ACL
don't conflict, i.e. the mode computed from the given ACL must be don't conflict, i.e. the mode computed from the given ACL must be
the same as the given mode. The algorithm for assigning a new the same as the given mode. The algorithm for assigning a new
mode based on the ACL can be used. This is described in mode based on the ACL can be used. This is described in
Section 3.16.6.1) If a server receives a request to set both mode Section 5.6.1) If a server receives a request to set both mode
and ACL, but the two conflict, the server should return and ACL, but the two conflict, the server should return
NFS4ERR_INVAL. If the mode and ACL don't conflict, inheritance NFS4ERR_INVAL. If the mode and ACL don't conflict, inheritance
will not take placeand both, the mode and ACL, will be set will not take placeand both, the mode and ACL, will be set
without modification. without modification.
4. neither mode nor ACL are given 4. neither mode nor ACL are given
In this case, inheritance will take place and no modifications to In this case, inheritance will take place and no modifications to
the ACL will happen. It is worth noting that if no inheritable the ACL will happen. It is worth noting that if no inheritable
ACEs exist on the parent directory, the file will be created with ACEs exist on the parent directory, the file will be created with
an empty ACL, thus granting no accesses. an empty ACL, thus granting no accesses.
3.16.6.6. Deficiencies in a Mode Representation of an ACL 5.6.6. Deficiencies in a Mode Representation of an ACL
In the presence of an ACL, there are certain cases when the In the presence of an ACL, there are certain cases when the
representation of the mode is not guaranteed to be accurate. An representation of the mode is not guaranteed to be accurate. An
example of a situation is detailed below. example of a situation is detailed below.
As mentioned in Section 3.16.6, the representation of the mode is As mentioned in Section 5.6, the representation of the mode is
deterministic, but not guaranteed to be accurate. The mode bits deterministic, but not guaranteed to be accurate. The mode bits
potentially convey a more restrictive permission than what will potentially convey a more restrictive permission than what will
actually be granted via the ACL. actually be granted via the ACL.
Given the following ACL of two ACEs: Given the following ACL of two ACEs:
GROUP@:ACE4_READ_DATA/ACE4_WRITE_DATA/ACE4_EXECUTE: GROUP@:ACE4_READ_DATA/ACE4_WRITE_DATA/ACE4_EXECUTE:
ACE4_IDENTIFIER_GROUP:ALLOW ACE4_IDENTIFIER_GROUP:ALLOW
EVERYONE@:ACE4_READ_DATA/ACE4_WRITE_DATA/ACE4_EXECUTE::DENY EVERYONE@:ACE4_READ_DATA/ACE4_WRITE_DATA/ACE4_EXECUTE::DENY
skipping to change at page 69, line 25 skipping to change at page 74, line 25
longer in group "staff". User "bob" logs in to the system again, and longer in group "staff". User "bob" logs in to the system again, and
thus more processes are created, this time owned by "bob" but NOT in thus more processes are created, this time owned by "bob" but NOT in
group "staff". group "staff".
A mode of 0770 is inaccurate for processes not belonging to group A mode of 0770 is inaccurate for processes not belonging to group
"staff". But even if the mode of the file were proactively changed "staff". But even if the mode of the file were proactively changed
to 0070 at the time the group database was edited, mode 0070 would be to 0070 at the time the group database was edited, mode 0070 would be
inaccurate for the pre-existing processes owned by user "bob" and inaccurate for the pre-existing processes owned by user "bob" and
having membership in group "staff". having membership in group "staff".
4. Single-server Name Space 6. Single-server Name Space
This chapter describes the NFSv4 single-server name space. Single- This chapter describes the NFSv4 single-server name space. Single-
server namespaces may be presented directly to clients, or they may server namespaces may be presented directly to clients, or they may
be used as a basis to form larger multi-server namespaces (e.g. site- be used as a basis to form larger multi-server namespaces (e.g. site-
wide or organization-wide) to be presented to clients, as described wide or organization-wide) to be presented to clients, as described
in Section 10. in Section 12.
4.1. Server Exports 6.1. Server Exports
On a UNIX server, the name space describes all the files reachable by On a UNIX server, the name space describes all the files reachable by
pathnames under the root directory or "/". On a Windows NT server pathnames under the root directory or "/". On a Windows NT server
the name space constitutes all the files on disks named by mapped the name space constitutes all the files on disks named by mapped
disk letters. NFS server administrators rarely make the entire disk letters. NFS server administrators rarely make the entire
server's filesystem name space available to NFS clients. More often server's filesystem name space available to NFS clients. More often
portions of the name space are made available via an "export" portions of the name space are made available via an "export"
feature. In previous versions of the NFS protocol, the root feature. In previous versions of the NFS protocol, the root
filehandle for each export is obtained through the MOUNT protocol; filehandle for each export is obtained through the MOUNT protocol;
the client sends a string that identifies the export of name space the client sends a string that identifies the export of name space
and the server returns the root filehandle for it. The MOUNT and the server returns the root filehandle for it. The MOUNT
protocol supports an EXPORTS procedure that will enumerate the protocol supports an EXPORTS procedure that will enumerate the
server's exports. server's exports.
4.2. Browsing Exports 6.2. Browsing Exports
The NFS version 4 protocol provides a root filehandle that clients The NFS version 4 protocol provides a root filehandle that clients
can use to obtain filehandles for the exports of a particular server, can use to obtain filehandles for the exports of a particular server,
via a series of LOOKUP operations within a COMPOUND, to traverse a via a series of LOOKUP operations within a COMPOUND, to traverse a
path. A common user experience is to use a graphical user interface path. A common user experience is to use a graphical user interface
(perhaps a file "Open" dialog window) to find a file via progressive (perhaps a file "Open" dialog window) to find a file via progressive
browsing through a directory tree. The client must be able to move browsing through a directory tree. The client must be able to move
from one export to another export via single-component, progressive from one export to another export via single-component, progressive
LOOKUP operations. LOOKUP operations.
skipping to change at page 70, line 27 skipping to change at page 75, line 27
An automounter on the client can obtain a snapshot of the server's An automounter on the client can obtain a snapshot of the server's
name space using the EXPORTS procedure of the MOUNT protocol. If it name space using the EXPORTS procedure of the MOUNT protocol. If it
understands the server's pathname syntax, it can create an image of understands the server's pathname syntax, it can create an image of
the server's name space on the client. The parts of the name space the server's name space on the client. The parts of the name space
that are not exported by the server are filled in with a "pseudo that are not exported by the server are filled in with a "pseudo
filesystem" that allows the user to browse from one mounted filesystem" that allows the user to browse from one mounted
filesystem to another. There is a drawback to this representation of filesystem to another. There is a drawback to this representation of
the server's name space on the client: it is static. If the server the server's name space on the client: it is static. If the server
administrator adds a new export the client will be unaware of it. administrator adds a new export the client will be unaware of it.
4.3. Server Pseudo Filesystem 6.3. Server Pseudo Filesystem
NFS version 4 servers avoid this name space inconsistency by NFS version 4 servers avoid this name space inconsistency by
presenting all the exports for a given server within the framework of presenting all the exports for a given server within the framework of
a single namespace, for that server. An NFS version 4 client uses a single namespace, for that server. An NFS version 4 client uses
LOOKUP and READDIR operations to browse seamlessly from one export to LOOKUP and READDIR operations to browse seamlessly from one export to
another. Portions of the server name space that are not exported are another. Portions of the server name space that are not exported are
bridged via a "pseudo filesystem" that provides a view of exported bridged via a "pseudo filesystem" that provides a view of exported
directories only. A pseudo filesystem has a unique fsid and behaves directories only. A pseudo filesystem has a unique fsid and behaves
like a normal, read only filesystem. like a normal, read only filesystem.
skipping to change at page 71, line 5 skipping to change at page 76, line 5
that multiple pseudo filesystems may exist. For example, that multiple pseudo filesystems may exist. For example,
/a pseudo filesystem /a pseudo filesystem
/a/b real filesystem /a/b real filesystem
/a/b/c pseudo filesystem /a/b/c pseudo filesystem
/a/b/c/d real filesystem /a/b/c/d real filesystem
Each of the pseudo filesystems are considered separate entities and Each of the pseudo filesystems are considered separate entities and
therefore will have its own unique fsid. therefore will have its own unique fsid.
4.4. Multiple Roots 6.4. Multiple Roots
The DOS and Windows operating environments are sometimes described as The DOS and Windows operating environments are sometimes described as
having "multiple roots". Filesystems are commonly represented as having "multiple roots". Filesystems are commonly represented as
disk letters. MacOS represents filesystems as top level names. NFS disk letters. MacOS represents filesystems as top level names. NFS
version 4 servers for these platforms can construct a pseudo file version 4 servers for these platforms can construct a pseudo file
system above these root names so that disk letters or volume names system above these root names so that disk letters or volume names
are simply directory names in the pseudo root. are simply directory names in the pseudo root.
4.5. Filehandle Volatility 6.5. Filehandle Volatility
The nature of the server's pseudo filesystem is that it is a logical The nature of the server's pseudo filesystem is that it is a logical
representation of filesystem(s) available from the server. representation of filesystem(s) available from the server.
Therefore, the pseudo filesystem is most likely constructed Therefore, the pseudo filesystem is most likely constructed
dynamically when the server is first instantiated. It is expected dynamically when the server is first instantiated. It is expected
that the pseudo filesystem may not have an on disk counterpart from that the pseudo filesystem may not have an on disk counterpart from
which persistent filehandles could be constructed. Even though it is which persistent filehandles could be constructed. Even though it is
preferable that the server provide persistent filehandles for the preferable that the server provide persistent filehandles for the
pseudo filesystem, the NFS client should expect that pseudo file pseudo filesystem, the NFS client should expect that pseudo file
system filehandles are volatile. This can be confirmed by checking system filehandles are volatile. This can be confirmed by checking
the associated "fh_expire_type" attribute for those filehandles in the associated "fh_expire_type" attribute for those filehandles in
question. If the filehandles are volatile, the NFS client must be question. If the filehandles are volatile, the NFS client must be
prepared to recover a filehandle value (e.g. with a series of LOOKUP prepared to recover a filehandle value (e.g. with a series of LOOKUP
operations) when receiving an error of NFS4ERR_FHEXPIRED. operations) when receiving an error of NFS4ERR_FHEXPIRED.
4.6. Exported Root 6.6. Exported Root
If the server's root filesystem is exported, one might conclude that If the server's root filesystem is exported, one might conclude that
a pseudo-filesystem is unneeded. This not necessarily so. Assume a pseudo-filesystem is unneeded. This not necessarily so. Assume
the following filesystems on a server: the following filesystems on a server:
/ disk1 (exported) / disk1 (exported)
/a disk2 (not exported) /a disk2 (not exported)
/a/b disk3 (exported) /a/b disk3 (exported)
Because disk2 is not exported, disk3 cannot be reached with simple Because disk2 is not exported, disk3 cannot be reached with simple
LOOKUPs. The server must bridge the gap with a pseudo-filesystem. LOOKUPs. The server must bridge the gap with a pseudo-filesystem.
4.7. Mount Point Crossing 6.7. Mount Point Crossing
The server filesystem environment may be constructed in such a way The server filesystem environment may be constructed in such a way
that one filesystem contains a directory which is 'covered' or that one filesystem contains a directory which is 'covered' or
mounted upon by a second filesystem. For example: mounted upon by a second filesystem. For example:
/a/b (filesystem 1) /a/b (filesystem 1)
/a/b/c/d (filesystem 2) /a/b/c/d (filesystem 2)
The pseudo filesystem for this server may be constructed to look The pseudo filesystem for this server may be constructed to look
like: like:
skipping to change at page 72, line 20 skipping to change at page 77, line 20
It is the server's responsibility to present the pseudo filesystem It is the server's responsibility to present the pseudo filesystem
that is complete to the client. If the client sends a lookup request that is complete to the client. If the client sends a lookup request
for the path "/a/b/c/d", the server's response is the filehandle of for the path "/a/b/c/d", the server's response is the filehandle of
the filesystem "/a/b/c/d". In previous versions of the NFS protocol, the filesystem "/a/b/c/d". In previous versions of the NFS protocol,
the server would respond with the filehandle of directory "/a/b/c/d" the server would respond with the filehandle of directory "/a/b/c/d"
within the filesystem "/a/b". within the filesystem "/a/b".
The NFS client will be able to determine if it crosses a server mount The NFS client will be able to determine if it crosses a server mount
point by a change in the value of the "fsid" attribute. point by a change in the value of the "fsid" attribute.
4.8. Security Policy and Name Space Presentation 6.8. Security Policy and Name Space Presentation
The application of the server's security policy needs to be carefully The application of the server's security policy needs to be carefully
considered by the implementor. One may choose to limit the considered by the implementor. One may choose to limit the
viewability of portions of the pseudo filesystem based on the viewability of portions of the pseudo filesystem based on the
server's perception of the client's ability to authenticate itself server's perception of the client's ability to authenticate itself
properly. However, with the support of multiple security mechanisms properly. However, with the support of multiple security mechanisms
and the ability to negotiate the appropriate use of these mechanisms, and the ability to negotiate the appropriate use of these mechanisms,
the server is unable to properly determine if a client will be able the server is unable to properly determine if a client will be able
to authenticate itself. If, based on its policies, the server to authenticate itself. If, based on its policies, the server
chooses to limit the contents of the pseudo filesystem, the server chooses to limit the contents of the pseudo filesystem, the server
skipping to change at page 73, line 5 skipping to change at page 78, line 5
The security policy for /a/b/c is Kerberos with integrity. The The security policy for /a/b/c is Kerberos with integrity. The
server should apply the same security policy to /, /a, and /a/b. server should apply the same security policy to /, /a, and /a/b.
This allows for the extension of the protection of the server's This allows for the extension of the protection of the server's
namespace to the ancestors of the real shared resource. namespace to the ancestors of the real shared resource.
For the case of the use of multiple, disjoint security mechanisms in For the case of the use of multiple, disjoint security mechanisms in
the server's resources, the security for a particular object in the the server's resources, the security for a particular object in the
server's namespace should be the union of all security mechanisms of server's namespace should be the union of all security mechanisms of
all direct descendants. all direct descendants.
5. File Locking and Share Reservations 7. File Locking and Share Reservations
Integrating locking into the NFS protocol necessarily causes it to be Integrating locking into the NFS protocol necessarily causes it to be
stateful. With the inclusion of share reservations the protocol stateful. With the inclusion of share reservations the protocol
becomes substantially more dependent on state than the traditional becomes substantially more dependent on state than the traditional
combination of NFS and NLM [XNFS]. There are three components to combination of NFS and NLM [XNFS]. There are three components to
making this state manageable: making this state manageable:
o Clear division between client and server o Clear division between client and server
o Ability to reliably detect inconsistency in state between client o Ability to reliably detect inconsistency in state between client
skipping to change at page 73, line 39 skipping to change at page 78, line 39
protocol mechanisms used when a file is opened or created (LOOKUP, protocol mechanisms used when a file is opened or created (LOOKUP,
CREATE, ACCESS) need to be replaced. The NFS version 4 protocol has CREATE, ACCESS) need to be replaced. The NFS version 4 protocol has
an OPEN operation that subsumes the NFS version 3 methodology of an OPEN operation that subsumes the NFS version 3 methodology of
LOOKUP, CREATE, and ACCESS. However, because many operations require LOOKUP, CREATE, and ACCESS. However, because many operations require
a filehandle, the traditional LOOKUP is preserved to map a file name a filehandle, the traditional LOOKUP is preserved to map a file name
to filehandle without establishing state on the server. The policy to filehandle without establishing state on the server. The policy
of granting access or modifying files is managed by the server based of granting access or modifying files is managed by the server based
on the client's state. These mechanisms can implement policy ranging on the client's state. These mechanisms can implement policy ranging
from advisory only locking to full mandatory locking. from advisory only locking to full mandatory locking.
5.1. Locking 7.1. Locking
It is assumed that manipulating a lock is rare when compared to READ It is assumed that manipulating a lock is rare when compared to READ
and WRITE operations. It is also assumed that crashes and network and WRITE operations. It is also assumed that crashes and network
partitions are relatively rare. Therefore it is important that the partitions are relatively rare. Therefore it is important that the
READ and WRITE operations have a lightweight mechanism to indicate if READ and WRITE operations have a lightweight mechanism to indicate if
they possess a held lock. A lock request contains the heavyweight they possess a held lock. A lock request contains the heavyweight
information required to establish a lock and uniquely define the lock information required to establish a lock and uniquely define the lock
owner. owner.
The following sections describe the transition from the heavy weight The following sections describe the transition from the heavy weight
information to the eventual stateid used for most client and server information to the eventual stateid used for most client and server
locking and lease interactions. locking and lease interactions.
5.1.1. Client ID 7.1.1. Client ID
For each LOCK request, the client must identify itself to the server. For each LOCK request, the client must identify itself to the server.
This is done in such a way as to allow for correct lock This is done in such a way as to allow for correct lock
identification and crash recovery. A sequence of a SETCLIENTID identification and crash recovery. A sequence of a SETCLIENTID
operation followed by a SETCLIENTID_CONFIRM operation is required to operation followed by a SETCLIENTID_CONFIRM operation is required to
establish the identification onto the server. Establishment of establish the identification onto the server. Establishment of
identification by a new incarnation of the client also has the effect identification by a new incarnation of the client also has the effect
of immediately breaking any leased state that a previous incarnation of immediately breaking any leased state that a previous incarnation
of the client might have had on the server, as opposed to forcing the of the client might have had on the server, as opposed to forcing the
new client incarnation to wait for the leases to expire. Breaking new client incarnation to wait for the leases to expire. Breaking
skipping to change at page 74, line 31 skipping to change at page 79, line 31
Client identification is encapsulated in the following structure: Client identification is encapsulated in the following structure:
struct nfs_client_id4 { struct nfs_client_id4 {
verifier4 verifier; verifier4 verifier;
opaque id<NFS4_OPAQUE_LIMIT>; opaque id<NFS4_OPAQUE_LIMIT>;
}; };
The first field, verifier is a client incarnation verifier that is The first field, verifier is a client incarnation verifier that is
used to detect client reboots. Only if the verifier is different used to detect client reboots. Only if the verifier is different
from that the server has previously recorded the client (as from that the server has previously recorded for the client (as
identified by the second field of the structure, id) does the server identified by the second field of the structure, id) does the server
start the process of canceling the client's leased state. start the process of canceling the client's leased state.
The second field, id is a variable length string that uniquely The second field, id is a variable length string that uniquely
defines the client. defines the client.
There are several considerations for how the client generates the id There are several considerations for how the client generates the id
string: string:
o The string should be unique so that multiple clients do not o The string should be unique so that multiple clients do not
skipping to change at page 76, line 44 skipping to change at page 81, line 44
The client must also employ the SETCLIENTID operation when it The client must also employ the SETCLIENTID operation when it
receives a NFS4ERR_STALE_STATEID error using a stateid derived from receives a NFS4ERR_STALE_STATEID error using a stateid derived from
its current clientid, since this also indicates a server reboot which its current clientid, since this also indicates a server reboot which
has invalidated the existing clientid (see the next section has invalidated the existing clientid (see the next section
"lock_owner and stateid Definition" for details). "lock_owner and stateid Definition" for details).
See the detailed descriptions of SETCLIENTID and SETCLIENTID_CONFIRM See the detailed descriptions of SETCLIENTID and SETCLIENTID_CONFIRM
for a complete specification of the operations. for a complete specification of the operations.
5.1.2. Server Release of Clientid 7.1.2. Server Release of Clientid
If the server determines that the client holds no associated state If the server determines that the client holds no associated state
for its clientid, the server may choose to release the clientid. The for its clientid, the server may choose to release the clientid. The
server may make this choice for an inactive client so that resources server may make this choice for an inactive client so that resources
are not consumed by those intermittently active clients. If the are not consumed by those intermittently active clients. If the
client contacts the server after this release, the server must ensure client contacts the server after this release, the server must ensure
the client receives the appropriate error so that it will use the the client receives the appropriate error so that it will use the
SETCLIENTID/SETCLIENTID_CONFIRM sequence to establish a new identity. SETCLIENTID/SETCLIENTID_CONFIRM sequence to establish a new identity.
It should be clear that the server must be very hesitant to release a It should be clear that the server must be very hesitant to release a
skipping to change at page 77, line 29 skipping to change at page 82, line 29
that changes security flavors, and under the new flavor, there is no that changes security flavors, and under the new flavor, there is no
mapping to the previous owner) will in rare cases result in mapping to the previous owner) will in rare cases result in
NFS4ERR_CLID_INUSE. NFS4ERR_CLID_INUSE.
In that event, when the server gets a SETCLIENTID for a client id In that event, when the server gets a SETCLIENTID for a client id
that currently has no state, or it has state, but the lease has that currently has no state, or it has state, but the lease has
expired, rather than returning NFS4ERR_CLID_INUSE, the server MUST expired, rather than returning NFS4ERR_CLID_INUSE, the server MUST
allow the SETCLIENTID, and confirm the new clientid if followed by allow the SETCLIENTID, and confirm the new clientid if followed by
the appropriate SETCLIENTID_CONFIRM. the appropriate SETCLIENTID_CONFIRM.
5.1.3. lock_owner and stateid Definition 7.1.3. lock_owner and stateid Definition
When requesting a lock, the client must present to the server the When requesting a lock, the client must present to the server the
clientid and an identifier for the owner of the requested lock. clientid and an identifier for the owner of the requested lock.
These two fields are referred to as the lock_owner and the definition These two fields are referred to as the lock_owner and the definition
of those fields are: of those fields are:
o A clientid returned by the server as part of the client's use of o A clientid returned by the server as part of the client's use of
the SETCLIENTID operation. the SETCLIENTID operation.
o A variable length opaque array used to uniquely define the owner o A variable length opaque array used to uniquely define the owner
skipping to change at page 79, line 5 skipping to change at page 84, line 5
o utilize the "seqid" field of each stateid, such that seqid is o utilize the "seqid" field of each stateid, such that seqid is
monotonically incremented for each stateid that is associated with monotonically incremented for each stateid that is associated with
the same index into the locking-state table. the same index into the locking-state table.
By matching the incoming stateid and its field values with the state By matching the incoming stateid and its field values with the state
held at the server, the server is able to easily determine if a held at the server, the server is able to easily determine if a
stateid is valid for its current instantiation and state. If the stateid is valid for its current instantiation and state. If the
stateid is not valid, the appropriate error can be supplied to the stateid is not valid, the appropriate error can be supplied to the
client. client.
5.1.4. Use of the stateid and Locking 7.1.4. Use of the stateid and Locking
All READ, WRITE and SETATTR operations contain a stateid. For the All READ, WRITE and SETATTR operations contain a stateid. For the
purposes of this section, SETATTR operations which change the size purposes of this section, SETATTR operations which change the size
attribute of a file are treated as if they are writing the area attribute of a file are treated as if they are writing the area
between the old and new size (i.e. the range truncated or added to between the old and new size (i.e. the range truncated or added to
the file by means of the SETATTR), even where SETATTR is not the file by means of the SETATTR), even where SETATTR is not
explicitly mentioned in the text. explicitly mentioned in the text.
If the lock_owner performs a READ or WRITE in a situation in which it If the lock_owner performs a READ or WRITE in a situation in which it
has established a lock or share reservation on the server (any OPEN has established a lock or share reservation on the server (any OPEN
skipping to change at page 81, line 14 skipping to change at page 86, line 14
A lock may not be granted while a READ or WRITE operation using one A lock may not be granted while a READ or WRITE operation using one
of the special stateids is being performed and the range of the lock of the special stateids is being performed and the range of the lock
request conflicts with the range of the READ or WRITE operation. For request conflicts with the range of the READ or WRITE operation. For
the purposes of this paragraph, a conflict occurs when a shared lock the purposes of this paragraph, a conflict occurs when a shared lock
is requested and a WRITE operation is being performed, or an is requested and a WRITE operation is being performed, or an
exclusive lock is requested and either a READ or a WRITE operation is exclusive lock is requested and either a READ or a WRITE operation is
being performed. A SETATTR that sets size is treated similarly to a being performed. A SETATTR that sets size is treated similarly to a
WRITE as discussed above. WRITE as discussed above.
5.1.5. Sequencing of Lock Requests 7.1.5. Sequencing of Lock Requests
Locking is different than most NFS operations as it requires "at- Locking is different from most NFS operations as it requires "at-
most-one" semantics that are not provided by ONCRPC. ONCRPC over a most-one" semantics that are not provided by ONCRPC. ONCRPC over a
reliable transport is not sufficient because a sequence of locking reliable transport is not sufficient because a sequence of locking
requests may span multiple TCP connections. In the face of requests may span multiple TCP connections. In the face of
retransmission or reordering, lock or unlock requests must have a retransmission or reordering, lock or unlock requests must have a
well defined and consistent behavior. To accomplish this, each lock well defined and consistent behavior. To accomplish this, each lock
request contains a sequence number that is a consecutively increasing request contains a sequence number that is a consecutively increasing
integer. Different lock_owners have different sequences. The server integer. Different lock_owners have different sequences. The server
maintains the last sequence number (L) received and the response that maintains the last sequence number (L) received and the response that
was returned. The first request issued for any given lock_owner is was returned. The first request issued for any given lock_owner is
issued with a sequence number of zero. issued with a sequence number of zero.
skipping to change at page 82, line 14 skipping to change at page 87, line 14
The client MUST monotonically increment the sequence number for the The client MUST monotonically increment the sequence number for the
CLOSE, LOCK, LOCKU, OPEN, OPEN_CONFIRM, and OPEN_DOWNGRADE CLOSE, LOCK, LOCKU, OPEN, OPEN_CONFIRM, and OPEN_DOWNGRADE
operations. This is true even in the event that the previous operations. This is true even in the event that the previous
operation that used the sequence number received an error. The only operation that used the sequence number received an error. The only
exception to this rule is if the previous operation received one of exception to this rule is if the previous operation received one of
the following errors: NFS4ERR_STALE_CLIENTID, NFS4ERR_STALE_STATEID, the following errors: NFS4ERR_STALE_CLIENTID, NFS4ERR_STALE_STATEID,
NFS4ERR_BAD_STATEID, NFS4ERR_BAD_SEQID, NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, NFS4ERR_BAD_SEQID, NFS4ERR_BADXDR,
NFS4ERR_RESOURCE, NFS4ERR_NOFILEHANDLE. NFS4ERR_RESOURCE, NFS4ERR_NOFILEHANDLE.
5.1.6. Recovery from Replayed Requests 7.1.6. Recovery from Replayed Requests
As described above, the sequence number is per lock_owner. As long As described above, the sequence number is per lock_owner. As long
as the server maintains the last sequence number received and follows as the server maintains the last sequence number received and follows
the methods described above, there are no risks of a Byzantine router the methods described above, there are no risks of a Byzantine router
re-sending old requests. The server need only maintain the re-sending old requests. The server need only maintain the
(lock_owner, sequence number) state as long as there are open files (lock_owner, sequence number) state as long as there are open files
or closed files with locks outstanding. or closed files with locks outstanding.
LOCK, LOCKU, OPEN, OPEN_DOWNGRADE, and CLOSE each contain a sequence LOCK, LOCKU, OPEN, OPEN_DOWNGRADE, and CLOSE each contain a sequence
number and therefore the risk of the replay of these operations number and therefore the risk of the replay of these operations
resulting in undesired effects is non-existent while the server resulting in undesired effects is non-existent while the server
maintains the lock_owner state. maintains the lock_owner state.
5.1.7. Releasing lock_owner State 7.1.7. Releasing lock_owner State
When a particular lock_owner no longer holds open or file locking When a particular lock_owner no longer holds open or file locking
state at the server, the server may choose to release the sequence state at the server, the server may choose to release the sequence
number state associated with the lock_owner. The server may make number state associated with the lock_owner. The server may make
this choice based on lease expiration, for the reclamation of server this choice based on lease expiration, for the reclamation of server
memory, or other implementation specific details. In any event, the memory, or other implementation specific details. In any event, the
server is able to do this safely only when the lock_owner no longer server is able to do this safely only when the lock_owner no longer
is being utilized by the client. The server may choose to hold the is being utilized by the client. The server may choose to hold the
lock_owner state in the event that retransmitted requests are lock_owner state in the event that retransmitted requests are
received. However, the period to hold this state is implementation received. However, the period to hold this state is implementation
specific. specific.
In the case that a LOCK, LOCKU, OPEN_DOWNGRADE, or CLOSE is In the case that a LOCK, LOCKU, OPEN_DOWNGRADE, or CLOSE is
retransmitted after the server has previously released the lock_owner retransmitted after the server has previously released the lock_owner
state, the server will find that the lock_owner has no files open and state, the server will find that the lock_owner has no files open and
an error will be returned to the client. If the lock_owner does have an error will be returned to the client. If the lock_owner does have
a file open, the stateid will not match and again an error is a file open, the stateid will not match and again an error is
returned to the client. returned to the client.
5.1.8. Use of Open Confirmation 7.1.8. Use of Open Confirmation
In the case that an OPEN is retransmitted and the lock_owner is being In the case that an OPEN is retransmitted and the lock_owner is being
used for the first time or the lock_owner state has been previously used for the first time or the lock_owner state has been previously
released by the server, the use of the OPEN_CONFIRM operation will released by the server, the use of the OPEN_CONFIRM operation will
prevent incorrect behavior. When the server observes the use of the prevent incorrect behavior. When the server observes the use of the
lock_owner for the first time, it will direct the client to perform lock_owner for the first time, it will direct the client to perform
the OPEN_CONFIRM for the corresponding OPEN. This sequence the OPEN_CONFIRM for the corresponding OPEN. This sequence
establishes the use of an lock_owner and associated sequence number. establishes the use of an lock_owner and associated sequence number.
Since the OPEN_CONFIRM sequence connects a new open_owner on the Since the OPEN_CONFIRM sequence connects a new open_owner on the
server with an existing open_owner on a client, the sequence number server with an existing open_owner on a client, the sequence number
skipping to change at page 84, line 5 skipping to change at page 89, line 5
Requiring open confirmation on reclaim-type opens is avoidable Requiring open confirmation on reclaim-type opens is avoidable
because of the nature of the environments in which such opens are because of the nature of the environments in which such opens are
done. For CLAIM_PREVIOUS opens, this is immediately after server done. For CLAIM_PREVIOUS opens, this is immediately after server
reboot, so there should be no time for lockowners to be created, reboot, so there should be no time for lockowners to be created,
found to be unused, and recycled. For CLAIM_DELEGATE_PREV opens, we found to be unused, and recycled. For CLAIM_DELEGATE_PREV opens, we
are dealing with a client reboot situation. A server which supports are dealing with a client reboot situation. A server which supports
delegation can be sure that no lockowners for that client have been delegation can be sure that no lockowners for that client have been
recycled since client initialization and thus can ensure that recycled since client initialization and thus can ensure that
confirmation will not be required. confirmation will not be required.
5.2. Lock Ranges 7.2. Lock Ranges
The protocol allows a lock owner to request a lock with a byte range The protocol allows a lock owner to request a lock with a byte range
and then either upgrade or unlock a sub-range of the initial lock. and then either upgrade or unlock a sub-range of the initial lock.
It is expected that this will be an uncommon type of request. In any It is expected that this will be an uncommon type of request. In any
case, servers or server filesystems may not be able to support sub- case, servers or server filesystems may not be able to support sub-
range lock semantics. In the event that a server receives a locking range lock semantics. In the event that a server receives a locking
request that represents a sub-range of current locking state for the request that represents a sub-range of current locking state for the
lock owner, the server is allowed to return the error lock owner, the server is allowed to return the error
NFS4ERR_LOCK_RANGE to signify that it does not support sub-range lock NFS4ERR_LOCK_RANGE to signify that it does not support sub-range lock
operations. Therefore, the client should be prepared to receive this operations. Therefore, the client should be prepared to receive this
skipping to change at page 84, line 28 skipping to change at page 89, line 28
The client is discouraged from combining multiple independent locking The client is discouraged from combining multiple independent locking
ranges that happen to be adjacent into a single request since the ranges that happen to be adjacent into a single request since the
server may not support sub-range requests and for reasons related to server may not support sub-range requests and for reasons related to
the recovery of file locking state in the event of server failure. the recovery of file locking state in the event of server failure.
As discussed in the section "Server Failure and Recovery" below, the As discussed in the section "Server Failure and Recovery" below, the
server may employ certain optimizations during recovery that work server may employ certain optimizations during recovery that work
effectively only when the client's behavior during lock recovery is effectively only when the client's behavior during lock recovery is
similar to the client's locking behavior prior to server failure. similar to the client's locking behavior prior to server failure.
5.3. Upgrading and Downgrading Locks 7.3. Upgrading and Downgrading Locks
If a client has a write lock on a record, it can request an atomic If a client has a write lock on a record, it can request an atomic
downgrade of the lock to a read lock via the LOCK request, by setting downgrade of the lock to a read lock via the LOCK request, by setting
the type to READ_LT. If the server supports atomic downgrade, the the type to READ_LT. If the server supports atomic downgrade, the
request will succeed. If not, it will return NFS4ERR_LOCK_NOTSUPP. request will succeed. If not, it will return NFS4ERR_LOCK_NOTSUPP.
The client should be prepared to receive this error, and if The client should be prepared to receive this error, and if
appropriate, report the error to the requesting application. appropriate, report the error to the requesting application.
If a client has a read lock on a record, it can request an atomic If a client has a read lock on a record, it can request an atomic
upgrade of the lock to a write lock via the LOCK request by setting upgrade of the lock to a write lock via the LOCK request by setting
the type to WRITE_LT or WRITEW_LT. If the server does not support the type to WRITE_LT or WRITEW_LT. If the server does not support
atomic upgrade, it will return NFS4ERR_LOCK_NOTSUPP. If the upgrade atomic upgrade, it will return NFS4ERR_LOCK_NOTSUPP. If the upgrade
can be achieved without an existing conflict, the request will can be achieved without an existing conflict, the request will
succeed. Otherwise, the server will return either NFS4ERR_DENIED or succeed. Otherwise, the server will return either NFS4ERR_DENIED or
NFS4ERR_DEADLOCK. The error NFS4ERR_DEADLOCK is returned if the NFS4ERR_DEADLOCK. The error NFS4ERR_DEADLOCK is returned if the
client issued the LOCK request with the type set to WRITEW_LT and the client issued the LOCK request with the type set to WRITEW_LT and the
server has detected a deadlock. The client should be prepared to server has detected a deadlock. The client should be prepared to
receive such errors and if appropriate, report the error to the receive such errors and if appropriate, report the error to the
requesting application. requesting application.
5.4. Blocking Locks 7.4. Blocking Locks
Some clients require the support of blocking locks. The NFS version Some clients require the support of blocking locks. The NFS version
4 protocol must not rely on a callback mechanism and therefore is 4 protocol must not rely on a callback mechanism and therefore is
unable to notify a client when a previously denied lock has been unable to notify a client when a previously denied lock has been
granted. Clients have no choice but to continually poll for the granted. Clients have no choice but to continually poll for the
lock. This presents a fairness problem. Two new lock types are lock. This presents a fairness problem. Two new lock types are
added, READW and WRITEW, and are used to indicate to the server that added, READW and WRITEW, and are used to indicate to the server that
the client is requesting a blocking lock. The server should maintain the client is requesting a blocking lock. The server should maintain
an ordered list of pending blocking locks. When the conflicting lock an ordered list of pending blocking locks. When the conflicting lock
is released, the server may wait the lease period for the first is released, the server may wait the lease period for the first
skipping to change at page 85, line 28 skipping to change at page 90, line 28
storage would be required to guarantee ordered granting of blocking storage would be required to guarantee ordered granting of blocking
locks. locks.
Servers may also note the lock types and delay returning denial of Servers may also note the lock types and delay returning denial of
the request to allow extra time for a conflicting lock to be the request to allow extra time for a conflicting lock to be
released, allowing a successful return. In this way, clients can released, allowing a successful return. In this way, clients can
avoid the burden of needlessly frequent polling for blocking locks. avoid the burden of needlessly frequent polling for blocking locks.
The server should take care in the length of delay in the event the The server should take care in the length of delay in the event the
client retransmits the request. client retransmits the request.
5.5. Lease Renewal 7.5. Lease Renewal
The purpose of a lease is to allow a server to remove stale locks The purpose of a lease is to allow a server to remove stale locks
that are held by a client that has crashed or is otherwise that are held by a client that has crashed or is otherwise
unreachable. It is not a mechanism for cache consistency and lease unreachable. It is not a mechanism for cache consistency and lease
renewals may not be denied if the lease interval has not expired. renewals may not be denied if the lease interval has not expired.
The following events cause implicit renewal of all of the leases for The following events cause implicit renewal of all of the leases for
a given client (i.e. all those sharing a given clientid). Each of a given client (i.e. all those sharing a given clientid). Each of
these is a positive indication that the client is still active and these is a positive indication that the client is still active and
that the associated state held at the server, for the client, is that the associated state held at the server, for the client, is
skipping to change at page 86, line 24 skipping to change at page 91, line 24
renewal and in the worst case one RPC is required every lease period renewal and in the worst case one RPC is required every lease period
(i.e. a RENEW operation). The number of locks held by the client is (i.e. a RENEW operation). The number of locks held by the client is
not a factor since all state for the client is involved with the not a factor since all state for the client is involved with the
lease renewal action. lease renewal action.
Since all operations that create a new lease also renew existing Since all operations that create a new lease also renew existing
leases, the server must maintain a common lease expiration time for leases, the server must maintain a common lease expiration time for
all valid leases for a given client. This lease time can then be all valid leases for a given client. This lease time can then be
easily updated upon implicit lease renewal actions. easily updated upon implicit lease renewal actions.
5.6. Crash Recovery 7.6. Crash Recovery
The important requirement in crash recovery is that both the client The important requirement in crash recovery is that both the client
and the server know when the other has failed. Additionally, it is and the server know when the other has failed. Additionally, it is
required that a client sees a consistent view of data across server required that a client sees a consistent view of data across server
restarts or reboots. All READ and WRITE operations that may have restarts or reboots. All READ and WRITE operations that may have
been queued within the client or network buffers must wait until the been queued within the client or network buffers must wait until the
client has successfully recovered the locks protecting the READ and client has successfully recovered the locks protecting the READ and
WRITE operations. WRITE operations.
5.6.1. Client Failure and Recovery 7.6.1. Client Failure and Recovery
In the event that a client fails, the server may recover the client's In the event that a client fails, the server may recover the client's
locks when the associated leases have expired. Conflicting locks locks when the associated leases have expired. Conflicting locks
from another client may only be granted after this lease expiration. from another client may only be granted after this lease expiration.
If the client is able to restart or reinitialize within the lease If the client is able to restart or reinitialize within the lease
period the client may be forced to wait the remainder of the lease period the client may be forced to wait the remainder of the lease
period before obtaining new locks. period before obtaining new locks.
To minimize client delay upon restart, lock requests are associated To minimize client delay upon restart, lock requests are associated
with an instance of the client by a client supplied verifier. This with an instance of the client by a client supplied verifier. This
skipping to change at page 87, line 16 skipping to change at page 92, line 16
initialization, the server can compare a new verifier to the verifier initialization, the server can compare a new verifier to the verifier
associated with currently held locks and determine that they do not associated with currently held locks and determine that they do not
match. This signifies the client's new instantiation and subsequent match. This signifies the client's new instantiation and subsequent
loss of locking state. As a result, the server is free to release loss of locking state. As a result, the server is free to release
all locks held which are associated with the old clientid which was all locks held which are associated with the old clientid which was
derived from the old verifier. derived from the old verifier.
Note that the verifier must have the same uniqueness properties of Note that the verifier must have the same uniqueness properties of
the verifier for the COMMIT operation. the verifier for the COMMIT operation.
5.6.2. Server Failure and Recovery 7.6.2. Server Failure and Recovery
If the server loses locking state (usually as a result of a restart If the server loses locking state (usually as a result of a restart
or reboot), it must allow clients time to discover this fact and re- or reboot), it must allow clients time to discover this fact and re-
establish the lost locking state. The client must be able to re- establish the lost locking state. The client must be able to re-
establish the locking state without having the server deny valid establish the locking state without having the server deny valid
requests because the server has granted conflicting access to another requests because the server has granted conflicting access to another
client. Likewise, if there is the possibility that clients have not client. Likewise, if there is the possibility that clients have not
yet re-established their locking state for a file, the server must yet re-established their locking state for a file, the server must
disallow READ and WRITE operations for that file. The duration of disallow READ and WRITE operations for that file.
this recovery period is equal to the duration of the lease period.
A client can determine that server failure (and thus loss of locking A client can determine that server failure (and thus loss of locking
state) has occurred, when it receives one of two errors. The state) has occurred, when it receives one of two errors. The
NFS4ERR_STALE_STATEID error indicates a stateid invalidated by a NFS4ERR_STALE_STATEID error indicates a stateid invalidated by a
reboot or restart. The NFS4ERR_STALE_CLIENTID error indicates a reboot or restart. The NFS4ERR_STALE_CLIENTID error indicates a
clientid invalidated by reboot or restart. When either of these are clientid invalidated by reboot or restart. When either of these are
received, the client must establish a new clientid (See the section received, the client must establish a new clientid (See the section
"Client ID") and re-establish the locking state as discussed below. "Client ID") and re-establish its locking state.
The period of special handling of locking and READs and WRITEs, equal Once a session is established using the new clientid, the client will
in duration to the lease period, is referred to as the "grace use reclaim-type locking requests (i.e. LOCK requests with reclaim
period". During the grace period, clients recover locks and the set to true and OPEN operations with a claim type of CLAIM_PREVIOUS)
associated state by reclaim-type locking requests (i.e. LOCK to re-establish its locking state. Once this is done, or if there is
requests with reclaim set to true and OPEN operations with a claim no such locking state to reclaim, the client does a RECLAIM_COMPLETE
type of CLAIM_PREVIOUS). During the grace period, the server must operation to indicate that it has reclaimed all of the locking state
reject READ and WRITE operations and non-reclaim locking requests that it will reclaim. Once a client does a RECLAIM_COMPLETE
(i.e. other LOCK and OPEN operations) with an error of NFS4ERR_GRACE. operation, it may attempt non-reclaim locking operations, although it
may get NFS4ERR_GRACE errors on these until the period of special
handling is over.
The period of special handling of locking and READs and WRITEs, is
referred to as the "grace period". During the grace period, clients
recover locks and the associated state using reclaim-type locking
requests. During this period, the server must reject READ and WRITE
operations and non-reclaim locking requests (i.e. other LOCK and OPEN
operations) with an error of NFS4ERR_GRACE, unless it is able to
guarantee that these may be done safely, as described below.
The grace period may last until all clients to have locks have done a
RECLAIM_COMPLETE operation, indicating that they have finished
reclaiming the locks they held before the server reboot. The server
is assumed to maintain in stable storage a list of clients who may
have such locks. The server may also terminate the grace period
before all clients have done RECLAIM_COMPLETE. The server SHOULD NOT
terminate the grace period before a time equal to the lease period in
order to give clients an opportunity to find out about the server
reboot. Some additional time in order to allow time to establish a
new clientid and session and to effect lock reclaims may be added.
If the server can reliably determine that granting a non-reclaim If the server can reliably determine that granting a non-reclaim
request will not conflict with reclamation of locks by other clients, request will not conflict with reclamation of locks by other clients,
the NFS4ERR_GRACE error does not have to be returned and the non- the NFS4ERR_GRACE error does not have to be returned even within the
reclaim client request can be serviced. For the server to be able to grace period, although NFS4ERR_GRACE must always be returned to
service READ and WRITE operations during the grace period, it must clients attempting a non-reclaim lock request before doing their own
again be able to guarantee that no possible conflict could arise RECLAIM_COMPLETE. For the server to be able to service READ and
between an impending reclaim locking request and the READ or WRITE WRITE operations during the grace period, it must again be able to
operation. If the server is unable to offer that guarantee, the guarantee that no possible conflict could arise between an impending
NFS4ERR_GRACE error must be returned to the client. reclaim locking request and the READ or WRITE operation. If the
server is unable to offer that guarantee, the NFS4ERR_GRACE error
must be returned to the client.
For a server to provide simple, valid handling during the grace For a server to provide simple, valid handling during the grace
period, the easiest method is to simply reject all non-reclaim period, the easiest method is to simply reject all non-reclaim
locking requests and READ and WRITE operations by returning the locking requests and READ and WRITE operations by returning the
NFS4ERR_GRACE error. However, a server may keep information about NFS4ERR_GRACE error. However, a server may keep information about
granted locks in stable storage. With this information, the server granted locks in stable storage. With this information, the server
could determine if a regular lock or READ or WRITE operation can be could determine if a regular lock or READ or WRITE operation can be
safely processed. safely processed.
For example, if a count of locks on a given file is available in For example, if the server maintained on stable storage summary
stable storage, the server can track reclaimed locks for the file and information on whether mandatory locks exist, either mandatory byte-
when all reclaims have been processed, non-reclaim locking requests range locks, or share reservations specifying deny modes, many
may be processed. This way the server can ensure that non-reclaim requests could be allowed during the grace period. If it is known
locking requests will not conflict with potential reclaim requests. that no such share reservations exist, OPEN request that do not
With respect to I/O requests, if the server is able to determine that specify deny modes may be safely granted. If, in addition, it is
there are no outstanding reclaim requests for a file by information known that no mandatory byte-range locks exist, either through
from stable storage or another similar mechanism, the processing of information stored on stable storage or simply because the server
I/O requests could proceed normally for the file. does not support such locks, READ and WRITE requests may be safely
processed during the grace period.
To reiterate, for a server that allows non-reclaim lock and I/O To reiterate, for a server that allows non-reclaim lock and I/O
requests to be processed during the grace period, it MUST determine requests to be processed during the grace period, it MUST determine
that no lock subsequently reclaimed will be rejected and that no lock that no lock subsequently reclaimed will be rejected and that no lock
subsequently reclaimed would have prevented any I/O operation subsequently reclaimed would have prevented any I/O operation
processed during the grace period. processed during the grace period.
Clients should be prepared for the return of NFS4ERR_GRACE errors for Clients should be prepared for the return of NFS4ERR_GRACE errors for
non-reclaim lock and I/O requests. In this case the client should non-reclaim lock and I/O requests. In this case the client should
employ a retry mechanism for the request. A delay (on the order of employ a retry mechanism for the request. A delay (on the order of
skipping to change at page 89, line 5 skipping to change at page 94, line 27
A server may, upon restart, establish a new value for the lease A server may, upon restart, establish a new value for the lease
period. Therefore, clients should, once a new clientid is period. Therefore, clients should, once a new clientid is
established, refetch the lease_time attribute and use it as the basis established, refetch the lease_time attribute and use it as the basis
for lease renewal for the lease associated with that server. for lease renewal for the lease associated with that server.
However, the server must establish, for this restart event, a grace However, the server must establish, for this restart event, a grace
period at least as long as the lease period for the previous server period at least as long as the lease period for the previous server
instantiation. This allows the client state obtained during the instantiation. This allows the client state obtained during the
previous server instance to be reliably re-established. previous server instance to be reliably re-established.
5.6.3. Network Partitions and Recovery 7.6.3. Network Partitions and Recovery
If the duration of a network partition is greater than the lease If the duration of a network partition is greater than the lease
period provided by the server, the server will have not received a period provided by the server, the server will have not received a
lease renewal from the client. If this occurs, the server may free lease renewal from the client. If this occurs, the server may free
all locks held for the client. As a result, all stateids held by the all locks held for the client. As a result, all stateids held by the
client will become invalid or stale. Once the client is able to client will become invalid or stale. Once the client is able to
reach the server after such a network partition, all I/O submitted by reach the server after such a network partition, all I/O submitted by
the client with the now invalid stateids will fail with the server the client with the now invalid stateids will fail with the server
returning the error NFS4ERR_EXPIRED. Once this error is received, returning the error NFS4ERR_EXPIRED. Once this error is received,
the client will suitably notify the application that held the lock. the client will suitably notify the application that held the lock.
skipping to change at page 90, line 39 skipping to change at page 96, line 15
9. Client A issues a RENEW operation, and gets back a 9. Client A issues a RENEW operation, and gets back a
NFS4ERR_STALE_CLIENTID. NFS4ERR_STALE_CLIENTID.
10. Client A reclaims its lock within the server's grace period. 10. Client A reclaims its lock within the server's grace period.
As with the first edge condition, the final step of the scenario of As with the first edge condition, the final step of the scenario of
the second edge condition has the server erroneously granting client the second edge condition has the server erroneously granting client
A's lock reclaim. A's lock reclaim.
Solving the first and second edge conditions requires that the server Solving the first and second edge conditions requires that the server
either assume after it reboots that edge condition occurs, and thus either always assumes after it reboots that some edge condition
return NFS4ERR_NO_GRACE for all reclaim attempts, or that the server occurs, and thus return NFS4ERR_NO_GRACE for all reclaim attempts, or
record some information stable storage. The amount of information that the server record some information in stable storage. The
the server records in stable storage is in inverse proportion to how amount of information the server records in stable storage is in
harsh the server wants to be whenever the edge conditions occur. The inverse proportion to how harsh the server intends to be whenever the
server that is completely tolerant of all edge conditions will record edge conditions arise. The server that is completely tolerant of all
in stable storage every lock that is acquired, removing the lock edge conditions will record in stable storage every lock that is
record from stable storage only when the lock is unlocked by the acquired, removing the lock record from stable storage only when the
client and the lock's lockowner advances the sequence number such lock is released. For the two aforementioned edge conditions, the
that the lock release is not the last stateful event for the harshest a server can be, and still support a grace period for
lockowner's sequence. For the two aforementioned edge conditions,
the harshest a server can be, and still support a grace period for
reclaims, requires that the server record in stable storage reclaims, requires that the server record in stable storage
information some minimal information. For example, a server information some minimal information. For example, a server
implementation could, for each client, save in stable storage a implementation could, for each client, save in stable storage a
record containing: record containing:
o the client's id string o the client's id string
o a boolean that indicates if the client's lease expired or if there o a boolean that indicates if the client's lease expired or if there
was administrative intervention (see the section, Server was administrative intervention (see the section, Server
Revocation of Locks) to revoke a record lock, share reservation, Revocation of Locks) to revoke a record lock, share reservation,
or delegation or delegation
o a timestamp that is updated the first time after a server boot or o a boolean that indicates whether the client may have locks that it
reboot the client acquires record locking, share reservation, or believes to be reclaimable in situations which the grace period
delegation state on the server. The timestamp need not be updated was terminated, making the server's view of lock reclaimability
on subsequent lock requests until the server reboots. suspect. The server will set this for any client record in stable
storage where the client has not done a RECLAIM_COMPLETE, before
The server implementation would also record in the stable storage the it grants any new (i.e. not reclaimed) lock to any client.
timestamps from the two most recent server reboots.
Assuming the above record keeping, for the first edge condition, Assuming the above record keeping, for the first edge condition,
after the server reboots, the record that client A's lease expired after the server reboots, the record that client A's lease expired
means that another client could have acquired a conflicting record means that another client could have acquired a conflicting record
lock, share reservation, or delegation. Hence the server must reject lock, share reservation, or delegation. Hence the server must reject
a reclaim from client A with the error NFS4ERR_NO_GRACE. a reclaim from client A with the error NFS4ERR_NO_GRACE.
For the second edge condition, after the server reboots for a second For the second edge condition, after the server reboots for a second
time, the record that the client had an unexpired record lock, share time, the indication that the client had not completed its reclaims
reservation, or delegation established before the server's previous at the time at which the grace period ended means that the server
incarnation means that the server must reject a reclaim from client A must reject a reclaim from client A with the error NFS4ERR_NO_GRACE.
with the error NFS4ERR_NO_GRACE.
When either edge condition occurs, the client's attempt to reclaim
locks will result in the error NFS4ERR_NO_GRACE. When this is
received, or after the client reboots with no lock state, the client
will issue a RECLAIM_COMPLETE. When the RECLAIM_COMPLETE is
received, the server and client are again in sync regarding
reclaimable locks and both booleans in persistent storage can be
reset, to be set again only when there is a subsequent event that
causes lock reclaim operations to be questionable.
Regardless of the level and approach to record keeping, the server Regardless of the level and approach to record keeping, the server
MUST implement one of the following strategies (which apply to MUST implement one of the following strategies (which apply to
reclaims of share reservations, record locks, and delegations): reclaims of share reservations, record locks, and delegations):
1. Reject all reclaims with NFS4ERR_NO_GRACE. This is superharsh, 1. Reject all reclaims with NFS4ERR_NO_GRACE. This is extremely
but necessary if the server does not want to record lock state in unforgiving, but necessary if the server does not record lock
stable storage. state in stable storage.
2. Record sufficient state in stable storage such that all known 2. Record sufficient state in stable storage such that all known
edge conditions involving server reboot, including the two noted edge conditions involving server reboot, including the two noted
in this section, are detected. False positives are acceptable. in this section, are detected. False positives are acceptable.
Note that at this time, it is not known if there are other edge Note that at this time, it is not known if there are other edge
conditions. conditions.
In the event, after a server reboot, the server determines that In the event, after a server reboot, the server determines that
there is unrecoverable damage or corruption to the the stable there is unrecoverable damage or corruption to the information in
storage, then for all clients and/or locks affected, the server stable storage, then for all clients and/or locks which may be
MUST return NFS4ERR_NO_GRACE. affected, the server MUST return NFS4ERR_NO_GRACE.
A mandate for the client's handling of the NFS4ERR_NO_GRACE error is A mandate for the client's handling of the NFS4ERR_NO_GRACE error is
outside the scope of this specification, since the strategies for outside the scope of this specification, since the strategies for
such handling are very dependent on the client's operating such handling are very dependent on the client's operating
environment. However, one potential approach is described below. environment. However, one potential approach is described below.
When the client receives NFS4ERR_NO_GRACE, it could examine the When the client receives NFS4ERR_NO_GRACE, it could examine the
change attribute of the objects the client is trying to reclaim state change attribute of the objects the client is trying to reclaim state
for, and use that to determine whether to re-establish the state via for, and use that to determine whether to re-establish the state via
normal OPEN or LOCK requests. This is acceptable provided the normal OPEN or LOCK requests. This is acceptable provided the
skipping to change at page 92, line 27 skipping to change at page 98, line 8
client could also inform the application that its record lock or client could also inform the application that its record lock or
share reservations (whether they were delegated or not) have been share reservations (whether they were delegated or not) have been
lost, such as via a UNIX signal, a GUI pop-up window, etc. See the lost, such as via a UNIX signal, a GUI pop-up window, etc. See the
section, "Data Caching and Revocation" for a discussion of what the section, "Data Caching and Revocation" for a discussion of what the
client should do for dealing with unreclaimed delegations on client client should do for dealing with unreclaimed delegations on client
state. state.
For further discussion of revocation of locks see the section "Server For further discussion of revocation of locks see the section "Server
Revocation of Locks". Revocation of Locks".
5.7. Recovery from a Lock Request Timeout or Abort 7.7. Recovery from a Lock Request Timeout or Abort
In the event a lock request times out, a client may decide to not In the event a lock request times out, a client may decide to not
retry the request. The client may also abort the request when the retry the request. The client may also abort the request when the
process for which it was issued is terminated (e.g. in UNIX due to a process for which it was issued is terminated (e.g. in UNIX due to a
signal). It is possible though that the server received the request signal). It is possible though that the server received the request
and acted upon it. This would change the state on the server without and acted upon it. This would change the state on the server without
the client being aware of the change. It is paramount that the the client being aware of the change. It is paramount that the
client re-synchronize state with server before it attempts any other client re-synchronize state with server before it attempts any other
operation that takes a seqid and/or a stateid with the same operation that takes a seqid and/or a stateid with the same
lock_owner. This is straightforward to do without a special re- lock_owner. This is straightforward to do without a special re-
skipping to change at page 93, line 5 skipping to change at page 98, line 34
not receive a response. From this, the next time the client does a not receive a response. From this, the next time the client does a
lock operation for the lock_owner, it can send the cached request, if lock operation for the lock_owner, it can send the cached request, if
there is one, and if the request was one that established state (e.g. there is one, and if the request was one that established state (e.g.
a LOCK or OPEN operation), the server will return the cached result a LOCK or OPEN operation), the server will return the cached result
or if never saw the request, perform it. The client can follow up or if never saw the request, perform it. The client can follow up
with a request to remove the state (e.g. a LOCKU or CLOSE operation). with a request to remove the state (e.g. a LOCKU or CLOSE operation).
With this approach, the sequencing and stateid information on the With this approach, the sequencing and stateid information on the
client and server for the given lock_owner will re-synchronize and in client and server for the given lock_owner will re-synchronize and in
turn the lock state will re-synchronize. turn the lock state will re-synchronize.
5.8. Server Revocation of Locks 7.8. Server Revocation of Locks
At any point, the server can revoke locks held by a client and the At any point, the server can revoke locks held by a client and the
client must be prepared for this event. When the client detects that client must be prepared for this event. When the client detects that
its locks have been or may have been revoked, the client is its locks have been or may have been revoked, the client is
responsible for validating the state information between itself and responsible for validating the state information between itself and
the server. Validating locking state for the client means that it the server. Validating locking state for the client means that it
must verify or reclaim state for each lock currently held. must verify or reclaim state for each lock currently held.
The first instance of lock revocation is upon server reboot or re- The first instance of lock revocation is upon server reboot or re-
initialization. In this instance the client will receive an error initialization. In this instance the client will receive an error
skipping to change at page 94, line 12 skipping to change at page 99, line 41
ensure that a conflicting lock has not been granted. The client may ensure that a conflicting lock has not been granted. The client may
accomplish this task by issuing an I/O request, either a pending I/O accomplish this task by issuing an I/O request, either a pending I/O
or a zero-length read, specifying the stateid associated with the or a zero-length read, specifying the stateid associated with the
lock in question. If the response to the request is success, the lock in question. If the response to the request is success, the
client has validated all of the locks governed by that stateid and client has validated all of the locks governed by that stateid and
re-established the appropriate state between itself and the server. re-established the appropriate state between itself and the server.
If the I/O request is not successful, then one or more of the locks If the I/O request is not successful, then one or more of the locks
associated with the stateid was revoked by the server and the client associated with the stateid was revoked by the server and the client
must notify the owner. must notify the owner.
5.9. Share Reservations 7.9. Share Reservations
A share reservation is a mechanism to control access to a file. It A share reservation is a mechanism to control access to a file. It
is a separate and independent mechanism from record locking. When a is a separate and independent mechanism from record locking. When a
client opens a file, it issues an OPEN operation to the server client opens a file, it issues an OPEN operation to the server
specifying the type of access required (READ, WRITE, or BOTH) and the specifying the type of access required (READ, WRITE, or BOTH) and the
type of access to deny others (deny NONE, READ, WRITE, or BOTH). If type of access to deny others (deny NONE, READ, WRITE, or BOTH). If
the OPEN fails the client will fail the application's open request. the OPEN fails the client will fail the application's open request.
Pseudo-code definition of the semantics: Pseudo-code definition of the semantics:
skipping to change at page 94, line 45 skipping to change at page 100, line 27
const OPEN4_SHARE_ACCESS_READ = 0x00000001; const OPEN4_SHARE_ACCESS_READ = 0x00000001;
const OPEN4_SHARE_ACCESS_WRITE = 0x00000002; const OPEN4_SHARE_ACCESS_WRITE = 0x00000002;
const OPEN4_SHARE_ACCESS_BOTH = 0x00000003; const OPEN4_SHARE_ACCESS_BOTH = 0x00000003;
const OPEN4_SHARE_DENY_NONE = 0x00000000; const OPEN4_SHARE_DENY_NONE = 0x00000000;
const OPEN4_SHARE_DENY_READ = 0x00000001; const OPEN4_SHARE_DENY_READ = 0x00000001;
const OPEN4_SHARE_DENY_WRITE = 0x00000002; const OPEN4_SHARE_DENY_WRITE = 0x00000002;
const OPEN4_SHARE_DENY_BOTH = 0x00000003; const OPEN4_SHARE_DENY_BOTH = 0x00000003;
5.10. OPEN/CLOSE Operations 7.10. OPEN/CLOSE Operations
To provide correct share semantics, a client MUST use the OPEN To provide correct share semantics, a client MUST use the OPEN
operation to obtain the initial filehandle and indicate the desired operation to obtain the initial filehandle and indicate the desired
access and what if any access to deny. Even if the client intends to access and what if any access to deny. Even if the client intends to
use a stateid of all 0's or all 1's, it must still obtain the use a stateid of all 0's or all 1's, it must still obtain the
filehandle for the regular file with the OPEN operation so the filehandle for the regular file with the OPEN operation so the
appropriate share semantics can be applied. For clients that do not appropriate share semantics can be applied. For clients that do not
have a deny mode built into their open programming interfaces, deny have a deny mode built into their open programming interfaces, deny
equal to NONE should be used. equal to NONE should be used.
skipping to change at page 95, line 27 skipping to change at page 101, line 8
failure, NFS4ERR_LOCKS_HELD, if any locks would exist after the failure, NFS4ERR_LOCKS_HELD, if any locks would exist after the
CLOSE. CLOSE.
The LOOKUP operation will return a filehandle without establishing The LOOKUP operation will return a filehandle without establishing
any lock state on the server. Without a valid stateid, the server any lock state on the server. Without a valid stateid, the server
will assume the client has the least access. For example, a file will assume the client has the least access. For example, a file
opened with deny READ/WRITE cannot be accessed using a filehandle opened with deny READ/WRITE cannot be accessed using a filehandle
obtained through LOOKUP because it would not have a valid stateid obtained through LOOKUP because it would not have a valid stateid
(i.e. using a stateid of all bits 0 or all bits 1). (i.e. using a stateid of all bits 0 or all bits 1).
5.10.1. Close and Retention of State Information 7.10.1. Close and Retention of State Information
Since a CLOSE operation requests deallocation of a stateid, dealing Since a CLOSE operation requests deallocation of a stateid, dealing
with retransmission of the CLOSE, may pose special difficulties, with retransmission of the CLOSE, may pose special difficulties,
since the state information, which normally would be used to since the state information, which normally would be used to
determine the state of the open file being designated, might be determine the state of the open file being designated, might be
deallocated, resulting in an NFS4ERR_BAD_STATEID error. deallocated, resulting in an NFS4ERR_BAD_STATEID error.
Servers may deal with this problem in a number of ways. To provide Servers may deal with this problem in a number of ways. To provide
the greatest degree assurance that the protocol is being used the greatest degree assurance that the protocol is being used
properly, a server should, rather than deallocate the stateid, mark properly, a server should, rather than deallocate the stateid, mark
skipping to change at page 96, line 16 skipping to change at page 101, line 44
Servers may avoid this complexity, at the cost of less complete Servers may avoid this complexity, at the cost of less complete
protocol error checking, by simply responding NFS4_OK in the event of protocol error checking, by simply responding NFS4_OK in the event of
a CLOSE for a deallocated stateid, on the assumption that this case a CLOSE for a deallocated stateid, on the assumption that this case
must be caused by a retransmitted close. When adopting this must be caused by a retransmitted close. When adopting this
approach, it is desirable to at least log an error when returning a approach, it is desirable to at least log an error when returning a
no-error indication in this situation. If the server maintains a no-error indication in this situation. If the server maintains a
reply-cache mechanism, it can verify the CLOSE is indeed a reply-cache mechanism, it can verify the CLOSE is indeed a
retransmission and avoid error logging in most cases. retransmission and avoid error logging in most cases.
5.11. Open Upgrade and Downgrade 7.11. Open Upgrade and Downgrade
When an OPEN is done for a file and the lockowner for which the open When an OPEN is done for a file and the lockowner for which the open
is being done already has the file open, the result is to upgrade the is being done already has the file open, the result is to upgrade the
open file status maintained on the server to include the access and open file status maintained on the server to include the access and
deny bits specified by the new OPEN as well as those for the existing deny bits specified by the new OPEN as well as those for the existing
OPEN. The result is that there is one open file, as far as the OPEN. The result is that there is one open file, as far as the
protocol is concerned, and it includes the union of the access and protocol is concerned, and it includes the union of the access and
deny bits for all of the OPEN requests completed. Only a single deny bits for all of the OPEN requests completed. Only a single
CLOSE will be done to reset the effects of both OPENs. Note that the CLOSE will be done to reset the effects of both OPENs. Note that the
client, when issuing the OPEN, may not know that the same file is in client, when issuing the OPEN, may not know that the same file is in
skipping to change at page 96, line 47 skipping to change at page 102, line 27
When multiple open files on the client are merged into a single open When multiple open files on the client are merged into a single open
file object on the server, the close of one of the open files (on the file object on the server, the close of one of the open files (on the
client) may necessitate change of the access and deny status of the client) may necessitate change of the access and deny status of the
open file on the server. This is because the union of the access and open file on the server. This is because the union of the access and
deny bits for the remaining opens may be smaller (i.e. a proper deny bits for the remaining opens may be smaller (i.e. a proper
subset) than previously. The OPEN_DOWNGRADE operation is used to subset) than previously. The OPEN_DOWNGRADE operation is used to
make the necessary change and the client should use it to update the make the necessary change and the client should use it to update the
server so that share reservation requests by other clients are server so that share reservation requests by other clients are
handled properly. handled properly.
5.12. Short and Long Leases 7.12. Short and Long Leases
When determining the time period for the server lease, the usual When determining the time period for the server lease, the usual
lease tradeoffs apply. Short leases are good for fast server lease tradeoffs apply. Short leases are good for fast server
recovery at a cost of increased RENEW or READ (with zero length) recovery at a cost of increased RENEW or READ (with zero length)
requests. Longer leases are certainly kinder and gentler to servers requests. Longer leases are certainly kinder and gentler to servers
trying to handle very large numbers of clients. The number of RENEW trying to handle very large numbers of clients. The number of RENEW
requests drop in proportion to the lease time. The disadvantages of requests drop in proportion to the lease time. The disadvantages of
long leases are slower recovery after server failure (the server must long leases are slower recovery after server failure (the server must
wait for the leases to expire and the grace period to elapse before wait for the leases to expire and the grace period to elapse before
granting new lock requests) and increased file contention (if client granting new lock requests) and increased file contention (if client
fails to transmit an unlock request then server must wait for lease fails to transmit an unlock request then server must wait for lease
expiration before granting new locks). expiration before granting new locks).
Long leases are usable if the server is able to store lease state in Long leases are usable if the server is able to store lease state in
non-volatile memory. Upon recovery, the server can reconstruct the non-volatile memory. Upon recovery, the server can reconstruct the
lease state from its non-volatile memory and continue operation with lease state from its non-volatile memory and continue operation with
its clients and therefore long leases would not be an issue. its clients and therefore long leases would not be an issue.
5.13. Clocks, Propagation Delay, and Calculating Lease Expiration 7.13. Clocks, Propagation Delay, and Calculating Lease Expiration
To avoid the need for synchronized clocks, lease times are granted by To avoid the need for synchronized clocks, lease times are granted by
the server as a time delta. However, there is a requirement that the the server as a time delta. However, there is a requirement that the
client and server clocks do not drift excessively over the duration client and server clocks do not drift excessively over the duration
of the lock. There is also the issue of propagation delay across the of the lock. There is also the issue of propagation delay across the
network which could easily be several hundred milliseconds as well as network which could easily be several hundred milliseconds as well as
the possibility that requests will be lost and need to be the possibility that requests will be lost and need to be
retransmitted. retransmitted.
To take propagation delay into account, the client should subtract it To take propagation delay into account, the client should subtract it
skipping to change at page 97, line 43 skipping to change at page 103, line 24
before the lease would expire. before the lease would expire.
The server's lease period configuration should take into account the The server's lease period configuration should take into account the
network distance of the clients that will be accessing the server's network distance of the clients that will be accessing the server's
resources. It is expected that the lease period will take into resources. It is expected that the lease period will take into
account the network propagation delays and other network delay account the network propagation delays and other network delay
factors for the client population. Since the protocol does not allow factors for the client population. Since the protocol does not allow
for an automatic method to determine an appropriate lease period, the for an automatic method to determine an appropriate lease period, the
server's administrator may have to tune the lease period. server's administrator may have to tune the lease period.
6. Client-Side Caching 8. Client-Side Caching
Client-side caching of data, of file attributes, and of file names is Client-side caching of data, of file attributes, and of file names is
essential to providing good performance with the NFS protocol. essential to providing good performance with the NFS protocol.
Providing distributed cache coherence is a difficult problem and Providing distributed cache coherence is a difficult problem and
previous versions of the NFS protocol have not attempted it. previous versions of the NFS protocol have not attempted it.
Instead, several NFS client implementation techniques have been used Instead, several NFS client implementation techniques have been used
to reduce the problems that a lack of coherence poses for users. to reduce the problems that a lack of coherence poses for users.
These techniques have not been clearly defined by earlier protocol These techniques have not been clearly defined by earlier protocol
specifications and it is often unclear what is valid or invalid specifications and it is often unclear what is valid or invalid
client behavior. client behavior.
The NFS version 4 protocol uses many techniques similar to those that The NFS version 4 protocol uses many techniques similar to those that
have been used in previous protocol versions. The NFS version 4 have been used in previous protocol versions. The NFS version 4
protocol does not provide distributed cache coherence. However, it protocol does not provide distributed cache coherence. However, it
defines a more limited set of caching guarantees to allow locks and defines a more limited set of caching guarantees to allow locks and
share reservations to be used without destructive interference from share reservations to be used without destructive interference from
client side caching. client side caching.
skipping to change at page 98, line 22 skipping to change at page 104, line 5
defines a more limited set of caching guarantees to allow locks and defines a more limited set of caching guarantees to allow locks and
share reservations to be used without destructive interference from share reservations to be used without destructive interference from
client side caching. client side caching.
In addition, the NFS version 4 protocol introduces a delegation In addition, the NFS version 4 protocol introduces a delegation
mechanism which allows many decisions normally made by the server to mechanism which allows many decisions normally made by the server to
be made locally by clients. This mechanism provides efficient be made locally by clients. This mechanism provides efficient
support of the common cases where sharing is infrequent or where support of the common cases where sharing is infrequent or where
sharing is read-only. sharing is read-only.
6.1. Performance Challenges for Client-Side Caching 8.1. Performance Challenges for Client-Side Caching
Caching techniques used in previous versions of the NFS protocol have Caching techniques used in previous versions of the NFS protocol have
been successful in providing good performance. However, several been successful in providing good performance. However, several
scalability challenges can arise when those techniques are used with scalability challenges can arise when those techniques are used with
very large numbers of clients. This is particularly true when very large numbers of clients. This is particularly true when
clients are geographically distributed which classically increases clients are geographically distributed which classically increases
the latency for cache revalidation requests. the latency for cache revalidation requests.
The previous versions of the NFS protocol repeat their file data The previous versions of the NFS protocol repeat their file data
cache validation requests at the time the file is opened. This cache validation requests at the time the file is opened. This
skipping to change at page 99, line 14 skipping to change at page 104, line 46
.IP o Compatibility with a large range of server semantics. .IP o .IP o Compatibility with a large range of server semantics. .IP o
Provide the same caching benefits as previous versions of the NFS Provide the same caching benefits as previous versions of the NFS
protocol when unable to provide the more aggressive model. .IP o protocol when unable to provide the more aggressive model. .IP o
Requirements for aggressive caching are organized so that a large Requirements for aggressive caching are organized so that a large
portion of the benefit can be obtained even when not all of the portion of the benefit can be obtained even when not all of the
requirements can be met. .LP The appropriate requirements for the requirements can be met. .LP The appropriate requirements for the
server are discussed in later sections in which specific forms of server are discussed in later sections in which specific forms of
caching are covered. (see the section "Open Delegation"). caching are covered. (see the section "Open Delegation").
6.2. Delegation and Callbacks 8.2. Delegation and Callbacks
Recallable delegation of server responsibilities for a file to a Recallable delegation of server responsibilities for a file to a
client improves performance by avoiding repeated requests to the client improves performance by avoiding repeated requests to the
server in the absence of inter-client conflict. With the use of a server in the absence of inter-client conflict. With the use of a
"callback" RPC from server to client, a server recalls delegated "callback" RPC from server to client, a server recalls delegated
responsibilities when another client engages in sharing of a responsibilities when another client engages in sharing of a
delegated file. delegated file.
A delegation is passed from the server to the client, specifying the A delegation is passed from the server to the client, specifying the
object of the delegation and the type of delegation. There are object of the delegation and the type of delegation. There are
skipping to change at page 100, line 35 skipping to change at page 106, line 19
The server will not know what opens are in effect on the client. The server will not know what opens are in effect on the client.
Without this knowledge the server will be unable to determine if the Without this knowledge the server will be unable to determine if the
access and deny state for the file allows any particular open until access and deny state for the file allows any particular open until
the delegation for the file has been returned. the delegation for the file has been returned.
A client failure or a network partition can result in failure to A client failure or a network partition can result in failure to
respond to a recall callback. In this case, the server will revoke respond to a recall callback. In this case, the server will revoke
the delegation which in turn will render useless any modified state the delegation which in turn will render useless any modified state
still on the client. still on the client.
6.2.1. Delegation Recovery 8.2.1. Delegation Recovery
There are three situations that delegation recovery must deal with: There are three situations that delegation recovery must deal with:
o Client reboot or restart o Client reboot or restart
o Server reboot or restart o Server reboot or restart
o Network partition (full or callback-only) o Network partition (full or callback-only)
In the event the client reboots or restarts, the failure to renew In the event the client reboots or restarts, the failure to renew
skipping to change at page 102, line 31 skipping to change at page 108, line 15
by the client whose delegation is revoked and separately by other by the client whose delegation is revoked and separately by other
clients. See the section "Revocation Recovery for Write Open clients. See the section "Revocation Recovery for Write Open
Delegation" for a discussion of such issues. Note also that when Delegation" for a discussion of such issues. Note also that when
delegations are revoked, information about the revoked delegation delegations are revoked, information about the revoked delegation
will be written by the server to stable storage (as described in the will be written by the server to stable storage (as described in the
section "Crash Recovery"). This is done to deal with the case in section "Crash Recovery"). This is done to deal with the case in
which a server reboots after revoking a delegation but before the which a server reboots after revoking a delegation but before the
client holding the revoked delegation is notified about the client holding the revoked delegation is notified about the
revocation. revocation.
6.3. Data Caching 8.3. Data Caching
When applications share access to a set of files, they need to be When applications share access to a set of files, they need to be
implemented so as to take account of the possibility of conflicting implemented so as to take account of the possibility of conflicting
access by another application. This is true whether the applications access by another application. This is true whether the applications
in question execute on different clients or reside on the same in question execute on different clients or reside on the same
client. client.
Share reservations and record locks are the facilities the NFS Share reservations and record locks are the facilities the NFS
version 4 protocol provides to allow applications to coordinate version 4 protocol provides to allow applications to coordinate
access by providing mutual exclusion facilities. The NFS version 4 access by providing mutual exclusion facilities. The NFS version 4
protocol's data caching must be implemented such that it does not protocol's data caching must be implemented such that it does not
invalidate the assumptions that those using these facilities depend invalidate the assumptions that those using these facilities depend
upon. upon.
6.3.1. Data Caching and OPENs 8.3.1. Data Caching and OPENs
In order to avoid invalidating the sharing assumptions that In order to avoid invalidating the sharing assumptions that
applications rely on, NFS version 4 clients should not provide cached applications rely on, NFS version 4 clients should not provide cached
data to applications or modify it on behalf of an application when it data to applications or modify it on behalf of an application when it
would not be valid to obtain or modify that same data via a READ or would not be valid to obtain or modify that same data via a READ or
WRITE operation. WRITE operation.
Furthermore, in the absence of open delegation (see the section "Open Furthermore, in the absence of open delegation (see the section "Open
Delegation") two additional rules apply. Note that these rules are Delegation") two additional rules apply. Note that these rules are
obeyed in practice by many NFS version 2 and version 3 clients. obeyed in practice by many NFS version 2 and version 3 clients.
skipping to change at page 103, line 45 skipping to change at page 109, line 30
a file OPENed for write. This is complementary to the first rule. a file OPENed for write. This is complementary to the first rule.
If the data is not flushed at CLOSE, the revalidation done after If the data is not flushed at CLOSE, the revalidation done after
client OPENs as file is unable to achieve its purpose. The other client OPENs as file is unable to achieve its purpose. The other
aspect to flushing the data before close is that the data must be aspect to flushing the data before close is that the data must be
committed to stable storage, at the server, before the CLOSE committed to stable storage, at the server, before the CLOSE
operation is requested by the client. In the case of a server operation is requested by the client. In the case of a server
reboot or restart and a CLOSEd file, it may not be possible to reboot or restart and a CLOSEd file, it may not be possible to
retransmit the data to be written to the file. Hence, this retransmit the data to be written to the file. Hence, this
requirement. requirement.
6.3.2. Data Caching and File Locking 8.3.2. Data Caching and File Locking
For those applications that choose to use file locking instead of For those applications that choose to use file locking instead of
share reservations to exclude inconsistent file access, there is an share reservations to exclude inconsistent file access, there is an
analogous set of constraints that apply to client side data caching. analogous set of constraints that apply to client side data caching.
These rules are effective only if the file locking is used in a way These rules are effective only if the file locking is used in a way
that matches in an equivalent way the actual READ and WRITE that matches in an equivalent way the actual READ and WRITE
operations executed. This is as opposed to file locking that is operations executed. This is as opposed to file locking that is
based on pure convention. For example, it is possible to manipulate based on pure convention. For example, it is possible to manipulate
a two-megabyte file by dividing the file into two one-megabyte a two-megabyte file by dividing the file into two one-megabyte
regions and protecting access to the two regions by file locks on regions and protecting access to the two regions by file locks on
skipping to change at page 105, line 26 skipping to change at page 111, line 13
unrelated unlock. However, it would not be valid to write the entire unrelated unlock. However, it would not be valid to write the entire
block in which that single written byte was located since it includes block in which that single written byte was located since it includes
an area that is not locked and might be locked by another client. an area that is not locked and might be locked by another client.
Client implementations can avoid this problem by dividing files with Client implementations can avoid this problem by dividing files with
modified data into those for which all modifications are done to modified data into those for which all modifications are done to
areas covered by an appropriate record lock and those for which there areas covered by an appropriate record lock and those for which there
are modifications not covered by a record lock. Any writes done for are modifications not covered by a record lock. Any writes done for
the former class of files must not include areas not locked and thus the former class of files must not include areas not locked and thus
not modified on the client. not modified on the client.
6.3.3. Data Caching and Mandatory File Locking 8.3.3. Data Caching and Mandatory File Locking
Client side data caching needs to respect mandatory file locking when Client side data caching needs to respect mandatory file locking when
it is in effect. The presence of mandatory file locking for a given it is in effect. The presence of mandatory file locking for a given
file is indicated when the client gets back NFS4ERR_LOCKED from a file is indicated when the client gets back NFS4ERR_LOCKED from a
READ or WRITE on a file it has an appropriate share reservation for. READ or WRITE on a file it has an appropriate share reservation for.
When mandatory locking is in effect for a file, the client must check When mandatory locking is in effect for a file, the client must check
for an appropriate file lock for data being read or written. If a for an appropriate file lock for data being read or written. If a
lock exists for the range being read or written, the client may lock exists for the range being read or written, the client may
satisfy the request using the client's validated cache. If an satisfy the request using the client's validated cache. If an
appropriate file lock is not held for the range of the read or write, appropriate file lock is not held for the range of the read or write,
the read or write request must not be satisfied by the client's cache the read or write request must not be satisfied by the client's cache
and the request must be sent to the server for processing. When a and the request must be sent to the server for processing. When a
read or write request partially overlaps a locked region, the request read or write request partially overlaps a locked region, the request
should be subdivided into multiple pieces with each region (locked or should be subdivided into multiple pieces with each region (locked or
not) treated appropriately. not) treated appropriately.
6.3.4. Data Caching and File Identity 8.3.4. Data Caching and File Identity
When clients cache data, the file data needs to be organized When clients cache data, the file data needs to be organized
according to the filesystem object to which the data belongs. For according to the filesystem object to which the data belongs. For
NFS version 3 clients, the typical practice has been to assume for NFS version 3 clients, the typical practice has been to assume for
the purpose of caching that distinct filehandles represent distinct the purpose of caching that distinct filehandles represent distinct
filesystem objects. The client then has the choice to organize and filesystem objects. The client then has the choice to organize and
maintain the data cache on this basis. maintain the data cache on this basis.
In the NFS version 4 protocol, there is now the possibility to have In the NFS version 4 protocol, there is now the possibility to have
significant deviations from a "one filehandle per object" model significant deviations from a "one filehandle per object" model
skipping to change at page 106, line 44 skipping to change at page 112, line 32
fileid attribute for both of the handles, then it cannot be fileid attribute for both of the handles, then it cannot be
determined whether the two objects are the same. Therefore, determined whether the two objects are the same. Therefore,
operations which depend on that knowledge (e.g. client side data operations which depend on that knowledge (e.g. client side data
caching) cannot be done reliably. caching) cannot be done reliably.
o If GETATTR directed to the two filehandles returns different o If GETATTR directed to the two filehandles returns different
values for the fileid attribute, then they are distinct objects. values for the fileid attribute, then they are distinct objects.
o Otherwise they are the same object. o Otherwise they are the same object.
6.4. Open Delegation 8.4. Open Delegation
When a file is being OPENed, the server may delegate further handling When a file is being OPENed, the server may delegate further handling
of opens and closes for that file to the opening client. Any such of opens and closes for that file to the opening client. Any such
delegation is recallable, since the circumstances that allowed for delegation is recallable, since the circumstances that allowed for
the delegation are subject to change. In particular, the server may the delegation are subject to change. In particular, the server may
receive a conflicting OPEN from another client, the server must receive a conflicting OPEN from another client, the server must
recall the delegation before deciding whether the OPEN from the other recall the delegation before deciding whether the OPEN from the other
client may be granted. Making a delegation is up to the server and client may be granted. Making a delegation is up to the server and
clients should not assume that any particular OPEN either will or clients should not assume that any particular OPEN either will or
will not result in an open delegation. The following is a typical will not result in an open delegation. The following is a typical
skipping to change at page 109, line 13 skipping to change at page 115, line 5
The use of delegation together with various other forms of caching The use of delegation together with various other forms of caching
creates the possibility that no server authentication will ever be creates the possibility that no server authentication will ever be
performed for a given user since all of the user's requests might be performed for a given user since all of the user's requests might be
satisfied locally. Where the client is depending on the server for satisfied locally. Where the client is depending on the server for
authentication, the client should be sure authentication occurs for authentication, the client should be sure authentication occurs for
each user by use of the ACCESS operation. This should be the case each user by use of the ACCESS operation. This should be the case
even if an ACCESS operation would not be required otherwise. As even if an ACCESS operation would not be required otherwise. As
mentioned before, the server may enforce frequent authentication by mentioned before, the server may enforce frequent authentication by
returning an nfsace4 denying all access with every open delegation. returning an nfsace4 denying all access with every open delegation.
6.4.1. Open Delegation and Data Caching 8.4.1. Open Delegation and Data Caching
OPEN delegation allows much of the message overhead associated with OPEN delegation allows much of the message overhead associated with
the opening and closing files to be eliminated. An open when an open the opening and closing files to be eliminated. An open when an open
delegation is in effect does not require that a validation message be delegation is in effect does not require that a validation message be
sent to the server. The continued endurance of the "read open sent to the server. The continued endurance of the "read open
delegation" provides a guarantee that no OPEN for write and thus no delegation" provides a guarantee that no OPEN for write and thus no
write has occurred. Similarly, when closing a file opened for write write has occurred. Similarly, when closing a file opened for write
and if write open delegation is in effect, the data written does not and if write open delegation is in effect, the data written does not
have to be flushed to the server until the open delegation is have to be flushed to the server until the open delegation is
recalled. The continued endurance of the open delegation provides a recalled. The continued endurance of the open delegation provides a
skipping to change at page 110, line 28 skipping to change at page 116, line 20
With respect to authentication, flushing modified data to the server With respect to authentication, flushing modified data to the server
after a CLOSE has occurred may be problematic. For example, the user after a CLOSE has occurred may be problematic. For example, the user
of the application may have logged off the client and unexpired of the application may have logged off the client and unexpired
authentication credentials may not be present. In this case, the authentication credentials may not be present. In this case, the
client may need to take special care to ensure that local unexpired client may need to take special care to ensure that local unexpired
credentials will in fact be available. This may be accomplished by credentials will in fact be available. This may be accomplished by
tracking the expiration time of credentials and flushing data well in tracking the expiration time of credentials and flushing data well in
advance of their expiration or by making private copies of advance of their expiration or by making private copies of
credentials to assure their availability when needed. credentials to assure their availability when needed.
6.4.2. Open Delegation and File Locks 8.4.2. Open Delegation and File Locks
When a client holds a write open delegation, lock operations are When a client holds a write open delegation, lock operations are
performed locally. This includes those required for mandatory file performed locally. This includes those required for mandatory file
locking. This can be done since the delegation implies that there locking. This can be done since the delegation implies that there
can be no conflicting locks. Similarly, all of the revalidations can be no conflicting locks. Similarly, all of the revalidations
that would normally be associated with obtaining locks and the that would normally be associated with obtaining locks and the
flushing of data associated with the releasing of locks need not be flushing of data associated with the releasing of locks need not be
done. done.
When a client holds a read open delegation, lock operations are not When a client holds a read open delegation, lock operations are not
performed locally. All lock operations, including those requesting performed locally. All lock operations, including those requesting
non-exclusive locks, are sent to the server for resolution. non-exclusive locks, are sent to the server for resolution.
6.4.3. Handling of CB_GETATTR 8.4.3. Handling of CB_GETATTR
The server needs to employ special handling for a GETATTR where the The server needs to employ special handling for a GETATTR where the
target is a file that has a write open delegation in effect. The target is a file that has a write open delegation in effect. The
reason for this is that the client holding the write delegation may reason for this is that the client holding the write delegation may
have modified the data and the server needs to reflect this change to have modified the data and the server needs to reflect this change to
the second client that submitted the GETATTR. Therefore, the client the second client that submitted the GETATTR. Therefore, the client
holding the write delegation needs to be interrogated. The server holding the write delegation needs to be interrogated. The server
will use the CB_GETATTR operation. The only attributes that the will use the CB_GETATTR operation. The only attributes that the
server can reliably query via CB_GETATTR are size and change. server can reliably query via CB_GETATTR are size and change.
skipping to change at page 113, line 40 skipping to change at page 119, line 38
CB_GETATTR and responds to the second client as in the last step. CB_GETATTR and responds to the second client as in the last step.
This methodology resolves issues of clock differences between client This methodology resolves issues of clock differences between client
and server and other scenarios where the use of CB_GETATTR break and server and other scenarios where the use of CB_GETATTR break
down. down.
It should be noted that the server is under no obligation to use It should be noted that the server is under no obligation to use
CB_GETATTR and therefore the server MAY simply recall the delegation CB_GETATTR and therefore the server MAY simply recall the delegation
to avoid its use. to avoid its use.
6.4.4. Recall of Open Delegation 8.4.4. Recall of Open Delegation
The following events necessitate recall of an open delegation: The following events necessitate recall of an open delegation:
o Potentially conflicting OPEN request (or READ/WRITE done with o Potentially conflicting OPEN request (or READ/WRITE done with
"special" stateid) "special" stateid)
o SETATTR issued by another client o SETATTR issued by another client
o REMOVE request for the file o REMOVE request for the file
o RENAME request for the file as either source or target of the o RENAME request for the file as either source or target of the
RENAME RENAME
Whether a RENAME of a directory in the path leading to the file Whether a RENAME of a directory in the path leading to the file
results in recall of an open delegation depends on the semantics of results in recall of an open delegation depends on the semantics of
the server filesystem. If that filesystem denies such RENAMEs when a the server filesystem. If that filesystem denies such RENAMEs when a
file is open, the recall must be performed to determine whether the file is open, the recall must be performed to determine whether the
file in question is, in fact, open. file in question is, in fact, open.
In addition to the situations above, the server may choose to recall In addition to the situations above, the server may choose to recall
skipping to change at page 115, line 35 skipping to change at page 121, line 31
except as part of delegation return. Only in the case of closing the except as part of delegation return. Only in the case of closing the
open that resulted in obtaining the delegation would clients be open that resulted in obtaining the delegation would clients be
likely to do this early, since, in that case, the close once done likely to do this early, since, in that case, the close once done
will not be undone. Regardless of the client's choices on scheduling will not be undone. Regardless of the client's choices on scheduling
these actions, all must be performed before the delegation is these actions, all must be performed before the delegation is
returned, including (when applicable) the close that corresponds to returned, including (when applicable) the close that corresponds to
the open that resulted in the delegation. These actions can be the open that resulted in the delegation. These actions can be
performed either in previous requests or in previous operations in performed either in previous requests or in previous operations in
the same COMPOUND request. the same COMPOUND request.
6.4.5. Clients that Fail to Honor Delegation Recalls 8.4.5. Clients that Fail to Honor Delegation Recalls
A client may fail to respond to a recall for various reasons, such as A client may fail to respond to a recall for various reasons, such as
a failure of the callback path from server to the client. The client a failure of the callback path from server to the client. The client
may be unaware of a failure in the callback path. This lack of may be unaware of a failure in the callback path. This lack of
awareness could result in the client finding out long after the awareness could result in the client finding out long after the
failure that its delegation has been revoked, and another client has failure that its delegation has been revoked, and another client has
modified the data for which the client had a delegation. This is modified the data for which the client had a delegation. This is
especially a problem for the client that held a write delegation. especially a problem for the client that held a write delegation.
The server also has a dilemma in that the client that fails to The server also has a dilemma in that the client that fails to
skipping to change at page 116, line 31 skipping to change at page 122, line 26
time after the server attempted to recall the delegation. This time after the server attempted to recall the delegation. This
period of time MUST NOT be less than the value of the period of time MUST NOT be less than the value of the
lease_time attribute. lease_time attribute.
o When the client holds a delegation, it can not rely on operations, o When the client holds a delegation, it can not rely on operations,
except for RENEW, that take a stateid, to renew delegation leases except for RENEW, that take a stateid, to renew delegation leases
across callback path failures. The client that wants to keep across callback path failures. The client that wants to keep
delegations in force across callback path failures must use RENEW delegations in force across callback path failures must use RENEW
to do so. to do so.
6.4.6. Delegation Revocation 8.4.6. Delegation Revocation
At the point a delegation is revoked, if there are associated opens At the point a delegation is revoked, if there are associated opens
on the client, the applications holding these opens need to be on the client, the applications holding these opens need to be
notified. This notification usually occurs by returning errors for notified. This notification usually occurs by returning errors for
READ/WRITE operations or when a close is attempted for the open file. READ/WRITE operations or when a close is attempted for the open file.
If no opens exist for the file at the point the delegation is If no opens exist for the file at the point the delegation is
revoked, then notification of the revocation is unnecessary. revoked, then notification of the revocation is unnecessary.
However, if there is modified data present at the client for the However, if there is modified data present at the client for the
file, the user of the application should be notified. Unfortunately, file, the user of the application should be notified. Unfortunately,
it may not be possible to notify the user since active applications it may not be possible to notify the user since active applications
may not be present at the client. See the section "Revocation may not be present at the client. See the section "Revocation
Recovery for Write Open Delegation" for additional details. Recovery for Write Open Delegation" for additional details.
6.5. Data Caching and Revocation 8.5. Data Caching and Revocation
When locks and delegations are revoked, the assumptions upon which When locks and delegations are revoked, the assumptions upon which
successful caching depend are no longer guaranteed. For any locks or successful caching depend are no longer guaranteed. For any locks or
share reservations that have been revoked, the corresponding owner share reservations that have been revoked, the corresponding owner
needs to be notified. This notification includes applications with a needs to be notified. This notification includes applications with a
file open that has a corresponding delegation which has been revoked. file open that has a corresponding delegation which has been revoked.
Cached data associated with the revocation must be removed from the Cached data associated with the revocation must be removed from the
client. In the case of modified data existing in the client's cache, client. In the case of modified data existing in the client's cache,
that data must be removed from the client without it being written to that data must be removed from the client without it being written to
the server. As mentioned, the assumptions made by the client are no the server. As mentioned, the assumptions made by the client are no
longer valid at the point when a lock or delegation has been revoked. longer valid at the point when a lock or delegation has been revoked.
For example, another client may have been granted a conflicting lock For example, another client may have been granted a conflicting lock
after the revocation of the lock at the first client. Therefore, the after the revocation of the lock at the first client. Therefore, the
data within the lock range may have been modified by the other data within the lock range may have been modified by the other
client. Obviously, the first client is unable to guarantee to the client. Obviously, the first client is unable to guarantee to the
application what has occurred to the file in the case of revocation. application what has occurred to the file in the case of revocation.
Notification to a lock owner will in many cases consist of simply Notification to a lock owner will in many cases consist of simply
returning an error on the next and all subsequent READs/WRITEs to the returning an error on the next and all subsequent READs/WRITEs to the
open file or on the close. Where the methods available to a client open file or on the close. Where the methods available to a client
make such notification impossible because errors for certain make such notification impossible because errors for certain
skipping to change at page 117, line 28 skipping to change at page 123, line 23
open file or on the close. Where the methods available to a client open file or on the close. Where the methods available to a client
make such notification impossible because errors for certain make such notification impossible because errors for certain
operations may not be returned, more drastic action such as signals operations may not be returned, more drastic action such as signals
or process termination may be appropriate. The justification for or process termination may be appropriate. The justification for
this is that an invariant for which an application depends on may be this is that an invariant for which an application depends on may be
violated. Depending on how errors are typically treated for the violated. Depending on how errors are typically treated for the
client operating environment, further levels of notification client operating environment, further levels of notification
including logging, console messages, and GUI pop-ups may be including logging, console messages, and GUI pop-ups may be
appropriate. appropriate.
6.5.1. Revocation Recovery for Write Open Delegation 8.5.1. Revocation Recovery for Write Open Delegation
Revocation recovery for a write open delegation poses the special Revocation recovery for a write open delegation poses the special
issue of modified data in the client cache while the file is not issue of modified data in the client cache while the file is not
open. In this situation, any client which does not flush modified open. In this situation, any client which does not flush modified
data to the server on each close must ensure that the user receives data to the server on each close must ensure that the user receives
appropriate notification of the failure as a result of the appropriate notification of the failure as a result of the
revocation. Since such situations may require human action to revocation. Since such situations may require human action to
correct problems, notification schemes in which the appropriate user correct problems, notification schemes in which the appropriate user
or administrator is notified may be necessary. Logging and console or administrator is notified may be necessary. Logging and console
messages are typical examples. messages are typical examples.
skipping to change at page 118, line 12 skipping to change at page 124, line 7
contents in these situations or mark the results specially to warn contents in these situations or mark the results specially to warn
users of possible problems. users of possible problems.
Saving of such modified data in delegation revocation situations may Saving of such modified data in delegation revocation situations may
be limited to files of a certain size or might be used only when be limited to files of a certain size or might be used only when
sufficient disk space is available within the target filesystem. sufficient disk space is available within the target filesystem.
Such saving may also be restricted to situations when the client has Such saving may also be restricted to situations when the client has
sufficient buffering resources to keep the cached copy available sufficient buffering resources to keep the cached copy available
until it is properly stored to the target filesystem. until it is properly stored to the target filesystem.
6.6. Attribute Caching 8.6. Attribute Caching
The attributes discussed in this section do not include named The attributes discussed in this section do not include named
attributes. Individual named attributes are analogous to files and attributes. Individual named attributes are analogous to files and
caching of the data for these needs to be handled just as data caching of the data for these needs to be handled just as data
caching is for ordinary files. Similarly, LOOKUP results from an caching is for ordinary files. Similarly, LOOKUP results from an
OPENATTR directory are to be cached on the same basis as any other OPENATTR directory are to be cached on the same basis as any other
pathnames and similarly for directory contents. pathnames and similarly for directory contents.
Clients may cache file attributes obtained from the server and use Clients may cache file attributes obtained from the server and use
them to avoid subsequent GETATTR requests. Such caching is write them to avoid subsequent GETATTR requests. Such caching is write
skipping to change at page 120, line 8 skipping to change at page 126, line 5
client will either eventually have to write the access time to the client will either eventually have to write the access time to the
server with bad performance effects, or it would never update the server with bad performance effects, or it would never update the
server's time_access, thereby resulting in a situation where an server's time_access, thereby resulting in a situation where an
application that caches access time between a close and open of the application that caches access time between a close and open of the
same file observes the access time oscillating between the past and same file observes the access time oscillating between the past and
present. The time_access attribute always means the time of last present. The time_access attribute always means the time of last
access to a file by a read that was satisfied by the server. This access to a file by a read that was satisfied by the server. This
way clients will tend to see only time_access changes that go forward way clients will tend to see only time_access changes that go forward
in time. in time.
6.7. Data and Metadata Caching and Memory Mapped Files 8.7. Data and Metadata Caching and Memory Mapped Files
Some operating environments include the capability for an application Some operating environments include the capability for an application
to map a file's content into the application's address space. Each to map a file's content into the application's address space. Each
time the application accesses a memory location that corresponds to a time the application accesses a memory location that corresponds to a
block that has not been loaded into the address space, a page fault block that has not been loaded into the address space, a page fault
occurs and the file is read (or if the block does not exist in the occurs and the file is read (or if the block does not exist in the
file, the block is allocated and then instantiated in the file, the block is allocated and then instantiated in the
application's address space). application's address space).
As long as each memory mapped access to the file requires a page As long as each memory mapped access to the file requires a page
skipping to change at page 122, line 16 skipping to change at page 128, line 13
are record locks for. are record locks for.
o Clients and servers MAY deny a record lock on a file they know is o Clients and servers MAY deny a record lock on a file they know is
memory mapped. memory mapped.
o A client MAY deny memory mapping a file that it knows requires o A client MAY deny memory mapping a file that it knows requires
mandatory locking for I/O. If mandatory locking is enabled after mandatory locking for I/O. If mandatory locking is enabled after
the file is opened and mapped, the client MAY deny the application the file is opened and mapped, the client MAY deny the application
further access to its mapped file. further access to its mapped file.
6.8. Name Caching 8.8. Name Caching
The results of LOOKUP and READDIR operations may be cached to avoid The results of LOOKUP and READDIR operations may be cached to avoid
the cost of subsequent LOOKUP operations. Just as in the case of the cost of subsequent LOOKUP operations. Just as in the case of
attribute caching, inconsistencies may arise among the various client attribute caching, inconsistencies may arise among the various client
caches. To mitigate the effects of these inconsistencies and given caches. To mitigate the effects of these inconsistencies and given
the context of typical filesystem APIs, an upper time boundary is the context of typical filesystem APIs, an upper time boundary is
maintained on how long a client name cache entry can be kept without maintained on how long a client name cache entry can be kept without
verifying that the entry has not been made invalid by a directory verifying that the entry has not been made invalid by a directory
change operation performed by another client. .LP When a client is change operation performed by another client. .LP When a client is
not making changes to a directory for which there exist name cache not making changes to a directory for which there exist name cache
skipping to change at page 123, line 16 skipping to change at page 129, line 12
directories when the contents of the corresponding directory is directories when the contents of the corresponding directory is
modified. For a client to use the change_info4 information modified. For a client to use the change_info4 information
appropriately and correctly, the server must report the pre and post appropriately and correctly, the server must report the pre and post
operation change attribute values atomically. When the server is operation change attribute values atomically. When the server is
unable to report the before and after values atomically with respect unable to report the before and after values atomically with respect
to the directory operation, the server must indicate that fact in the to the directory operation, the server must indicate that fact in the
change_info4 return value. When the information is not atomically change_info4 return value. When the information is not atomically
reported, the client should not assume that other clients have not reported, the client should not assume that other clients have not
changed the directory. changed the directory.
6.9. Directory Caching 8.9. Directory Caching
The results of READDIR operations may be used to avoid subsequent The results of READDIR operations may be used to avoid subsequent
READDIR operations. Just as in the cases of attribute and name READDIR operations. Just as in the cases of attribute and name
caching, inconsistencies may arise among the various client caches. caching, inconsistencies may arise among the various client caches.
To mitigate the effects of these inconsistencies, and given the To mitigate the effects of these inconsistencies, and given the
context of typical filesystem APIs, the following rules should be context of typical filesystem APIs, the following rules should be
followed: followed:
o Cached READDIR information for a directory which is not obtained o Cached READDIR information for a directory which is not obtained
in a single READDIR operation must always be a consistent snapshot in a single READDIR operation must always be a consistent snapshot
skipping to change at page 124, line 10 skipping to change at page 130, line 7
directories when the contents of the corresponding directory is directories when the contents of the corresponding directory is
modified. For a client to use the change_info4 information modified. For a client to use the change_info4 information
appropriately and correctly, the server must report the pre and post appropriately and correctly, the server must report the pre and post
operation change attribute values atomically. When the server is operation change attribute values atomically. When the server is
unable to report the before and after values atomically with respect unable to report the before and after values atomically with respect
to the directory operation, the server must indicate that fact in the to the directory operation, the server must indicate that fact in the
change_info4 return value. When the information is not atomically change_info4 return value. When the information is not atomically
reported, the client should not assume that other clients have not reported, the client should not assume that other clients have not
changed the directory. changed the directory.
7. Security Negotiation 9. Security Negotiation
The NFSv4.0 specification contains three oversights and ambiguities The NFSv4.0 specification contains three oversights and ambiguities
with respect to the SECINFO operation. with respect to the SECINFO operation.
First, it is impossible for the client to use the SECINFO operation First, it is impossible for the client to use the SECINFO operation
to determine the correct security triple for accessing a parent to determine the correct security triple for accessing a parent
directory. This is because SECINFO takes as arguments the current directory. This is because SECINFO takes as arguments the current
file handle and a component name. However, NFSv4.0 uses the LOOKUPP file handle and a component name. However, NFSv4.0 uses the LOOKUPP
operation to get the parent directory of the current file handle. If operation to get the parent directory of the current file handle. If
the client uses the wrong security when issuing the LOOKUPP, and gets the client uses the wrong security when issuing the LOOKUPP, and gets
skipping to change at page 124, line 42 skipping to change at page 130, line 39
Third, there is a problem as to what the client must do (or can do), Third, there is a problem as to what the client must do (or can do),
whenever the server returns NFS4ERR_WRONGSEC in response to a PUTFH whenever the server returns NFS4ERR_WRONGSEC in response to a PUTFH
operation. The NFSv4.0 specification says that client should issue a operation. The NFSv4.0 specification says that client should issue a
SECINFO using the parent filehandle and the component name of the SECINFO using the parent filehandle and the component name of the
filehandle that PUTFH was issued with. This may not be convenient filehandle that PUTFH was issued with. This may not be convenient
for the client. for the client.
This document resolves the above three issues in the context of This document resolves the above three issues in the context of
NFSv4.1. NFSv4.1.
8. Clarification of Security Negotiation in NFSv4.1 10. Clarification of Security Negotiation in NFSv4.1
This section attempts to clarify NFSv4.1 security negotiation issues. This section attempts to clarify NFSv4.1 security negotiation issues.
Unless noted otherwise, for any mention of PUTFH in this section, the Unless noted otherwise, for any mention of PUTFH in this section, the
reader should interpret it as applying to PUTROOTFH and PUTPUBFH in reader should interpret it as applying to PUTROOTFH and PUTPUBFH in
addition to PUTFH. addition to PUTFH.
8.1. PUTFH + LOOKUP 10.1. PUTFH + LOOKUP
The server implementation may decide whether to impose any The server implementation may decide whether to impose any
restrictions on export security administration. There are at least restrictions on export security administration. There are at least
three approaches (Sc is the flavor set of the child export, Sp that three approaches (Sc is the flavor set of the child export, Sp that
of the parent), of the parent),
a) Sc <= Sp (<= for subset) a) Sc <= Sp (<= for subset)
b) Sc ^ Sp != {} (^ for intersection, {} for the empty set) b) Sc ^ Sp != {} (^ for intersection, {} for the empty set)
c) free form c) free form
To support b (when client chooses a flavor that is not a member of To support b (when client chooses a flavor that is not a member of
Sp) and c, PUTFH must NOT return NFS4ERR_WRONGSEC in case of security Sp) and c, PUTFH must NOT return NFS4ERR_WRONGSEC in case of security
mismatch. Instead, it should be returned from the LOOKUP that mismatch. Instead, it should be returned from the LOOKUP that
follows. follows.
Since the above guideline does not contradict a, it should be Since the above guideline does not contradict a, it should be
followed in general. followed in general.
8.2. PUTFH + LOOKUPP 10.2. PUTFH + LOOKUPP
Since SECINFO only works its way down, there is no way LOOKUPP can Since SECINFO only works its way down, there is no way LOOKUPP can
return NFS4ERR_WRONGSEC without the server implementing return NFS4ERR_WRONGSEC without the server implementing
SECINFO_NO_NAME. SECINFO_NO_NAME solves this issue because via style SECINFO_NO_NAME. SECINFO_NO_NAME solves this issue because via style
"parent", it works in the opposite direction as SECINFO (component "parent", it works in the opposite direction as SECINFO (component
name is implicit in this case). name is implicit in this case).
8.3. PUTFH + SECINFO 10.3. PUTFH + SECINFO
This case should be treated specially. This case should be treated specially.
A security sensitive client should be allowed to choose a strong A security sensitive client should be allowed to choose a strong
flavor when querying a server to determine a file object's permitted flavor when querying a server to determine a file object's permitted
security flavors. The security flavor chosen by the client does not security flavors. The security flavor chosen by the client does not
have to be included in the flavor list of the export. Of course the have to be included in the flavor list of the export. Of course the
server has to be configured for whatever flavor the client selects, server has to be configured for whatever flavor the client selects,
otherwise the request will fail at RPC authentication. otherwise the request will fail at RPC authentication.
In theory, there is no connection between the security flavor used by In theory, there is no connection between the security flavor used by
SECINFO and those supported by the export. But in practice, the SECINFO and those supported by the export. But in practice, the
client may start looking for strong flavors from those supported by client may start looking for strong flavors from those supported by
the export, followed by those in the mandatory set. the export, followed by those in the mandatory set.
8.4. PUTFH + Anything Else 10.4. PUTFH + Anything Else
PUTFH must return NFS4ERR_WRONGSEC in case of security mismatch. PUTFH must return NFS4ERR_WRONGSEC in case of security mismatch.
This is the most straightforward approach without having to add This is the most straightforward approach without having to add
NFS4ERR_WRONGSEC to every other operations. NFS4ERR_WRONGSEC to every other operations.
PUTFH + SECINFO_NO_NAME (style "current_fh") is needed for the client PUTFH + SECINFO_NO_NAME (style "current_fh") is needed for the client
to recover from NFS4ERR_WRONGSEC. to recover from NFS4ERR_WRONGSEC.
9. NFSv4.1 Sessions 11. NFSv4.1 Sessions
9.1. Sessions Background 11.1. Sessions Background
9.1.1. Introduction to Sessions 11.1.1. Introduction to Sessions
This draft proposes extensions to NFS version 4 [RFC3530] enabling it This draft proposes extensions to NFS version 4 [RFC3530] enabling it
to support sessions and endpoint management, and to support operation to support sessions and endpoint management, and to support operation
atop RDMA-capable RPC over transports such as iWARP. [RDMAP, DDP] atop RDMA-capable RPC over transports such as iWARP. [RDMAP, DDP]
These extensions enable support for exactly-once semantics by NFSv4 These extensions enable support for exactly-once semantics by NFSv4
servers, multipathing and trunking of transport connections, and servers, multipathing and trunking of transport connections, and
enhanced security. The ability to operate over RDMA enables greatly enhanced security. The ability to operate over RDMA enables greatly
enhanced performance. Operation over existing TCP is enhanced as enhanced performance. Operation over existing TCP is enhanced as
well. well.
skipping to change at page 127, line 30 skipping to change at page 133, line 23
+-----------------+-------------------------------------+ +-----------------+-------------------------------------+
| NFSv4 | NFSv4 + session extensions | | NFSv4 | NFSv4 + session extensions |
+-----------------+------+----------------+-------------+ +-----------------+------+----------------+-------------+
| Operations | Session | | | Operations | Session | |
+------------------------+----------------+ | +------------------------+----------------+ |
| RPC/XDR | | | RPC/XDR | |
+-------------------------------+---------+ | +-------------------------------+---------+ |
| Stream Transport | RDMA Transport | | Stream Transport | RDMA Transport |
+-------------------------------+-----------------------+ +-------------------------------+-----------------------+
9.1.2. Motivation 11.1.2. Motivation
NFS version 4 [RFC3530] has been granted "Proposed Standard" status. NFS version 4 [RFC3530] has been granted "Proposed Standard" status.
The NFSv4 protocol was developed along several design points, The NFSv4 protocol was developed along several design points,
important among them: effective operation over wide-area networks, important among them: effective operation over wide-area networks,
including the Internet itself; strong security integrated into the including the Internet itself; strong security integrated into the
protocol; extensive cross-platform interoperability including protocol; extensive cross-platform interoperability including
integrated locking semantics compatible with multiple operating integrated locking semantics compatible with multiple operating
systems; and protocol extensibility. systems; and protocol extensibility.
The NFS version 4 protocol, however, does not provide support for The NFS version 4 protocol, however, does not provide support for
skipping to change at page 128, line 40 skipping to change at page 134, line 35
FJDAFS] and Harvard University [KM02] are all relevant. FJDAFS] and Harvard University [KM02] are all relevant.
By layering a session binding for NFS version 4 directly atop a By layering a session binding for NFS version 4 directly atop a
standard RDMA transport, a greatly enhanced level of performance and standard RDMA transport, a greatly enhanced level of performance and
transparency can be supported on a wide variety of operating system transparency can be supported on a wide variety of operating system
platforms. These combined capabilities alter the landscape between platforms. These combined capabilities alter the landscape between
local filesystems and network attached storage, enable a new level of local filesystems and network attached storage, enable a new level of
performance, and lead new classes of application to take advantage of performance, and lead new classes of application to take advantage of
NFS. NFS.
9.1.3. Problem Statement 11.1.3. Problem Statement
Two issues drive the current proposal: correctness, and performance. Two issues drive the current proposal: correctness, and performance.
Both are instances of "raising the bar" for NFS, whereby the desire Both are instances of "raising the bar" for NFS, whereby the desire
to use NFS in new classes applications can be accommodated by to use NFS in new classes applications can be accommodated by
providing the basic features to make such use feasible. Such providing the basic features to make such use feasible. Such
applications include tightly coupled sharing environments such as applications include tightly coupled sharing environments such as
cluster computing, high performance computing (HPC) and information cluster computing, high performance computing (HPC) and information
processing such as databases. These trends are explored in depth in processing such as databases. These trends are explored in depth in
[NFSPS]. [NFSPS].
skipping to change at page 130, line 8 skipping to change at page 136, line 5
systems, NFSv4 over RDMA will enable applications running on a set of systems, NFSv4 over RDMA will enable applications running on a set of
client machines to interact through an NFSv4 file system, just as client machines to interact through an NFSv4 file system, just as
applications running on a single machine might interact through a applications running on a single machine might interact through a
local file system. local file system.
This raises the issue of whether additional protocol enhancements to This raises the issue of whether additional protocol enhancements to
enable such interaction would be desirable and what such enhancements enable such interaction would be desirable and what such enhancements
would be. This is a complicated issue which the working group needs would be. This is a complicated issue which the working group needs
to address and will not be further discussed in this document. to address and will not be further discussed in this document.
9.1.4. NFSv4 Session Extension Characteristics 11.1.4. NFSv4 Session Extension Characteristics
This draft will present a solution based upon minor versioning of This draft will present a solution based upon minor versioning of
NFSv4. It will introduce a session to collect transport endpoints NFSv4. It will introduce a session to collect transport endpoints
and resources such as reply caching, which in turn enables and resources such as reply caching, which in turn enables
enhancements such as trunking, failover and recovery. It will enhancements such as trunking, failover and recovery. It will
describe use of RDMA by employing support within an underlying RPC describe use of RDMA by employing support within an underlying RPC
layer [RPCRDMA]. Most importantly, it will focus on making the best layer [RPCRDMA]. Most importantly, it will focus on making the best
possible use of an RDMA transport. possible use of an RDMA transport.
These extensions are proposed as elements of a new minor revision of These extensions are proposed as elements of a new minor revision of
skipping to change at page 130, line 30 skipping to change at page 136, line 27
generically as "NFSv4", when describing properties common to all generically as "NFSv4", when describing properties common to all
minor versions. When referring specifically to properties of the minor versions. When referring specifically to properties of the
original, minor version 0 protocol, "NFSv4.0" will be used, and original, minor version 0 protocol, "NFSv4.0" will be used, and
changes proposed here for minor version 1 will be referred to as changes proposed here for minor version 1 will be referred to as
"NFSv4.1". "NFSv4.1".
This draft proposes only changes which are strictly upward- This draft proposes only changes which are strictly upward-
compatible with existing RPC and NFS Application Programming compatible with existing RPC and NFS Application Programming
Interfaces (APIs). Interfaces (APIs).
9.2. Transport Issues 11.2. Transport Issues
The Transport Issues section of the document explores the details of The Transport Issues section of the document explores the details of
utilizing the various supported transports. utilizing the various supported transports.
9.2.1. Session Model 11.2.1. Session Model
The first and most evident issue in supporting diverse transports is The first and most evident issue in supporting diverse transports is
how to provide for their differences. This draft proposes how to provide for their differences. This draft proposes
introducing an explicit session. introducing an explicit session.
A session introduces minimal protocol requirements, and provides for A session introduces minimal protocol requirements, and provides for
a highly useful and convenient way to manage numerous endpoint- a highly useful and convenient way to manage numerous endpoint-
related issues. The session is a local construct; it represents a related issues. The session is a local construct; it represents a
named, higher-layer object to which connections can refer, and named, higher-layer object to which connections can refer, and
encapsulates properties important to each associated client. encapsulates properties important to each associated client.
skipping to change at page 132, line 5 skipping to change at page 137, line 46
Finally, given adequate connection-oriented transport security Finally, given adequate connection-oriented transport security
semantics, authentication and authorization may be cached on a per- semantics, authentication and authorization may be cached on a per-
session basis, enabling greater efficiency in the issuing and session basis, enabling greater efficiency in the issuing and
processing of requests on both client and server. A proposal for processing of requests on both client and server. A proposal for
transparent, server-driven implementation of this in NFSv4 has been transparent, server-driven implementation of this in NFSv4 has been
made. [CCM] The existence of the session greatly facilitates the made. [CCM] The existence of the session greatly facilitates the
implementation of this approach. This is discussed in detail in the implementation of this approach. This is discussed in detail in the
Authentication Efficiencies section later in this draft. Authentication Efficiencies section later in this draft.
9.2.2. Connection State 11.2.2. Connection State
In RFC3530, the combination of a connected transport endpoint and a In RFC3530, the combination of a connected transport endpoint and a
clientid forms the basis of connection state. While has been made to clientid forms the basis of connection state. While has been made to
be workable with certain limitations, there are difficulties in be workable with certain limitations, there are difficulties in
correct and robust implementation. The NFSv4.0 protocol must provide correct and robust implementation. The NFSv4.0 protocol must provide
a server-initiated connection for the callback channel, and must a server-initiated connection for the callback channel, and must
carefully specify the persistence of client state at the server in carefully specify the persistence of client state at the server in
the face of transport interruptions. The server has only the the face of transport interruptions. The server has only the
client's transport address binding (the IP 4-tuple) to identify the client's transport address binding (the IP 4-tuple) to identify the
client RPC transaction stream and to use as a lookup tag on the client RPC transaction stream and to use as a lookup tag on the
skipping to change at page 132, line 41 skipping to change at page 138, line 34
The session identifier is unique within the server's scope and may be The session identifier is unique within the server's scope and may be
subject to certain server policies such as being bounded in time. subject to certain server policies such as being bounded in time.
It is envisioned that the primary transport model will be connection It is envisioned that the primary transport model will be connection
oriented. Connection orientation brings with it certain potential oriented. Connection orientation brings with it certain potential
optimizations, such as caching of per-connection properties, which optimizations, such as caching of per-connection properties, which
are easily leveraged through the generality of the session. However, are easily leveraged through the generality of the session. However,
it is possible that in future, other transport models could be it is possible that in future, other transport models could be
accommodated below the session abstraction. accommodated below the session abstraction.
9.2.3. NFSv4 Channels, Sessions and Connections 11.2.3. NFSv4 Channels, Sessions and Connections
There are at least two types of NFSv4 channels: the "operations" There are at least two types of NFSv4 channels: the "operations"
channel used for ordinary requests from client to server, and the channel used for ordinary requests from client to server, and the
"back" channel, used for callback requests from server to client. "back" channel, used for callback requests from server to client.
As mentioned above, different NFSv4 operations on these channels can As mentioned above, different NFSv4 operations on these channels can
lead to different resource needs. For example, server callback lead to different resource needs. For example, server callback
operations (CB_RECALL) are specific, small messages which flow from operations (CB_RECALL) are specific, small messages which flow from
server to client at arbitrary times, while data transfers such as server to client at arbitrary times, while data transfers such as
read and write have very different sizes and asymmetric behaviors. read and write have very different sizes and asymmetric behaviors.
skipping to change at page 134, line 41 skipping to change at page 140, line 41
| |
In this way, implementation as well as resource management may be In this way, implementation as well as resource management may be
optimized. Each session will have its own response caching and optimized. Each session will have its own response caching and
buffering, and each connection or channel will have its own transport buffering, and each connection or channel will have its own transport
resources, as appropriate. Clients which do not require certain resources, as appropriate. Clients which do not require certain
behaviors may optimize such resources away completely, by using behaviors may optimize such resources away completely, by using
specific sessions and not even creating the additional channels and specific sessions and not even creating the additional channels and
connections. connections.
9.2.4. Reconnection, Trunking and Failover 11.2.4. Reconnection, Trunking and Failover
Reconnection after failure references stored state on the server Reconnection after failure references stored state on the server
associated with lease recovery during the grace period. The session associated with lease recovery during the grace period. The session
provides a convenient handle for storing and managing information provides a convenient handle for storing and managing information
regarding the client's previous state on a per- connection basis, regarding the client's previous state on a per- connection basis,
e.g. to be used upon reconnection. Reconnection to a previously e.g. to be used upon reconnection. Reconnection to a previously
existing session, and its stored resources, are covered in the existing session, and its stored resources, are covered in the
"Connection Models" section below. "Connection Models" section below.
One important aspect of reconnection is that of RPC library support. One important aspect of reconnection is that of RPC library support.
skipping to change at page 135, line 32 skipping to change at page 141, line 32
of connections, something the RPC layer abstraction architecturally of connections, something the RPC layer abstraction architecturally
abstracts away. Therefore the session binding is not handled in abstracts away. Therefore the session binding is not handled in
connection scope but instead explicitly carried in each request. connection scope but instead explicitly carried in each request.
For Reliability Availability and Serviceability (RAS) issues such as For Reliability Availability and Serviceability (RAS) issues such as
bandwidth aggregation and multipathing, clients frequently seek to bandwidth aggregation and multipathing, clients frequently seek to
make multiple connections through multiple logical or physical make multiple connections through multiple logical or physical
channels. The session is a convenient point to aggregate and manage channels. The session is a convenient point to aggregate and manage
these resources. these resources.
9.2.5. Server Duplicate Request Cache 11.2.5. Server Duplicate Request Cache
Server duplicate request caches, while not a part of an NFS protocol, Server duplicate request caches, while not a part of an NFS protocol,
have become a standard, even required, part of any NFS have become a standard, even required, part of any NFS
implementation. First described in [CJ89], the duplicate request implementation. First described in [CJ89], the duplicate request
cache was initially found to reduce work at the server by avoiding cache was initially found to reduce work at the server by avoiding
duplicate processing for retransmitted requests. A second, and in duplicate processing for retransmitted requests. A second, and in
the long run more important benefit, was improved correctness, as the the long run more important benefit, was improved correctness, as the
cache avoided certain destructive non-idempotent requests from being cache avoided certain destructive non-idempotent requests from being
reinvoked. reinvoked.
skipping to change at page 136, line 41 skipping to change at page 142, line 41
Similarly, it is important for the client to explicitly learn whether Similarly, it is important for the client to explicitly learn whether
the server is able to implement reliable semantics. Knowledge of the server is able to implement reliable semantics. Knowledge of
whether these semantics are in force is critical for a highly whether these semantics are in force is critical for a highly
reliable client, one which must provide transactional integrity reliable client, one which must provide transactional integrity
guarantees. When clients request that the semantics be enabled for a guarantees. When clients request that the semantics be enabled for a
given session, the session reply must inform the client if the mode given session, the session reply must inform the client if the mode
is in fact enabled. In this way the client can confidently proceed is in fact enabled. In this way the client can confidently proceed
with operations without having to implement consistency facilities of with operations without having to implement consistency facilities of
its own. its own.
9.3. Session Initialization and Transfer Models 11.3. Session Initialization and Transfer Models
Session initialization issues, and data transfer models relevant to Session initialization issues, and data transfer models relevant to
both TCP and RDMA are discussed in this section. both TCP and RDMA are discussed in this section.
9.3.1. Session Negotiation 11.3.1. Session Negotiation
The following parameters are exchanged between client and server at The following parameters are exchanged between client and server at
session creation time. Their values allow the server to properly session creation time. Their values allow the server to properly
size resources allocated in order to service the client's requests, size resources allocated in order to service the client's requests,
and to provide the server with a way to communicate limits to the and to provide the server with a way to communicate limits to the
client for proper and optimal operation. They are exchanged prior to client for proper and optimal operation. They are exchanged prior to
all session-related activity, over any transport type. Discussion of all session-related activity, over any transport type. Discussion of
their use is found in their descriptions as well as throughout this their use is found in their descriptions as well as throughout this
section. section.
skipping to change at page 138, line 21 skipping to change at page 144, line 21
bandwidth parameters. The client provides its chosen value to the bandwidth parameters. The client provides its chosen value to the
server in the initial session creation, the value must be provided server in the initial session creation, the value must be provided
in each client RDMA endpoint. The values are asymmetric and in each client RDMA endpoint. The values are asymmetric and
should be set to zero at the server in order to conserve RDMA should be set to zero at the server in order to conserve RDMA
resources, since clients do not issue RDMA Read operations in this resources, since clients do not issue RDMA Read operations in this
proposal. The result is communicated in the session response, to proposal. The result is communicated in the session response, to
permit matching of values across the connection. The value may permit matching of values across the connection. The value may
not be changed in the duration of the session, although a new not be changed in the duration of the session, although a new
value may be requested as part of a new session. value may be requested as part of a new session.
9.3.2. RDMA Requirements 11.3.2. RDMA Requirements
A complete discussion of the operation of RPC-based protocols atop A complete discussion of the operation of RPC-based protocols atop
RDMA transports is in [RPCRDMA]. Where RDMA is considered, this RDMA transports is in [RPCRDMA]. Where RDMA is considered, this
proposal assumes the use of such a layering; it addresses only the proposal assumes the use of such a layering; it addresses only the
upper layer issues relevant to making best use of RPC/RDMA. upper layer issues relevant to making best use of RPC/RDMA.
A connection oriented (reliable sequenced) RDMA transport will be A connection oriented (reliable sequenced) RDMA transport will be