draft-ietf-nfsv4-minorversion1-10.txt   draft-ietf-nfsv4-minorversion1-11.txt 
NFSv4 S. Shepler NFSv4 S. Shepler
Internet-Draft M. Eisler Internet-Draft M. Eisler
Intended status: Standards Track D. Noveck Intended status: Standards Track D. Noveck
Expires: September 5, 2007 Editors Expires: December 13, 2007 Editors
March 4, 2007 June 11, 2007
NFSv4 Minor Version 1 NFSv4 Minor Version 1
draft-ietf-nfsv4-minorversion1-10.txt draft-ietf-nfsv4-minorversion1-11.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 35 skipping to change at page 1, line 35
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on September 5, 2007. This Internet-Draft will expire on December 13, 2007.
Copyright Notice Copyright Notice
Copyright (C) The IETF Trust (2007). Copyright (C) The IETF Trust (2007).
Abstract Abstract
This Internet-Draft describes NFSv4 minor version one, including This Internet-Draft describes NFSv4 minor version one, including
features retained from the base protocol and protocol extensions made features retained from the base protocol and protocol extensions made
subsequently. The current draft includes description of the major subsequently. The current draft includes description of the major
skipping to change at page 2, line 33 skipping to change at page 2, line 33
1.4.4. Locking Facilities . . . . . . . . . . . . . . . . . 14 1.4.4. Locking Facilities . . . . . . . . . . . . . . . . . 14
1.5. General Definitions . . . . . . . . . . . . . . . . . . 15 1.5. General Definitions . . . . . . . . . . . . . . . . . . 15
1.6. Differences from NFSv4.0 . . . . . . . . . . . . . . . . 17 1.6. Differences from NFSv4.0 . . . . . . . . . . . . . . . . 17
2. Core Infrastructure . . . . . . . . . . . . . . . . . . . . . 17 2. Core Infrastructure . . . . . . . . . . . . . . . . . . . . . 17
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 18 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 18
2.2. RPC and XDR . . . . . . . . . . . . . . . . . . . . . . 18 2.2. RPC and XDR . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1. RPC-based Security . . . . . . . . . . . . . . . . . 18 2.2.1. RPC-based Security . . . . . . . . . . . . . . . . . 18
2.3. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . . 21 2.3. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . . 21
2.4. Client Identifiers and Client Owners . . . . . . . . . . 22 2.4. Client Identifiers and Client Owners . . . . . . . . . . 22
2.4.1. Server Release of Client ID . . . . . . . . . . . . 26 2.4.1. Server Release of Client ID . . . . . . . . . . . . 26
2.4.2. Handling Client Owner Conflicts . . . . . . . . . . 26 2.4.2. Resolving Client Owner Conflicts . . . . . . . . . . 26
2.5. Server Owners . . . . . . . . . . . . . . . . . . . . . 27 2.5. Server Owners . . . . . . . . . . . . . . . . . . . . . 27
2.6. Security Service Negotiation . . . . . . . . . . . . . . 27 2.6. Security Service Negotiation . . . . . . . . . . . . . . 28
2.6.1. NFSv4 Security Tuples . . . . . . . . . . . . . . . 28 2.6.1. NFSv4.1 Security Tuples . . . . . . . . . . . . . . 28
2.6.2. SECINFO and SECINFO_NO_NAME . . . . . . . . . . . . 28 2.6.2. SECINFO and SECINFO_NO_NAME . . . . . . . . . . . . 28
2.6.3. Security Error . . . . . . . . . . . . . . . . . . . 28 2.6.3. Security Error . . . . . . . . . . . . . . . . . . . 29
2.7. Minor Versioning . . . . . . . . . . . . . . . . . . . . 32 2.7. Minor Versioning . . . . . . . . . . . . . . . . . . . . 32
2.8. Non-RPC-based Security Services . . . . . . . . . . . . 34 2.8. Non-RPC-based Security Services . . . . . . . . . . . . 34
2.8.1. Authorization . . . . . . . . . . . . . . . . . . . 34 2.8.1. Authorization . . . . . . . . . . . . . . . . . . . 34
2.8.2. Auditing . . . . . . . . . . . . . . . . . . . . . . 34 2.8.2. Auditing . . . . . . . . . . . . . . . . . . . . . . 35
2.8.3. Intrusion Detection . . . . . . . . . . . . . . . . 35 2.8.3. Intrusion Detection . . . . . . . . . . . . . . . . 35
2.9. Transport Layers . . . . . . . . . . . . . . . . . . . . 35 2.9. Transport Layers . . . . . . . . . . . . . . . . . . . . 35
2.9.1. Required and Recommended Properties of Transports . 35 2.9.1. Required and Recommended Properties of Transports . 35
2.9.2. Client and Server Transport Behavior . . . . . . . . 35 2.9.2. Client and Server Transport Behavior . . . . . . . . 36
2.9.3. Ports . . . . . . . . . . . . . . . . . . . . . . . 37 2.9.3. Ports . . . . . . . . . . . . . . . . . . . . . . . 37
2.10. Session . . . . . . . . . . . . . . . . . . . . . . . . 37 2.10. Session . . . . . . . . . . . . . . . . . . . . . . . . 37
2.10.1. Motivation and Overview . . . . . . . . . . . . . . 37 2.10.1. Motivation and Overview . . . . . . . . . . . . . . 37
2.10.2. NFSv4 Integration . . . . . . . . . . . . . . . . . 38 2.10.2. NFSv4 Integration . . . . . . . . . . . . . . . . . 38
2.10.3. Channels . . . . . . . . . . . . . . . . . . . . . . 39 2.10.3. Channels . . . . . . . . . . . . . . . . . . . . . . 40
2.10.4. Exactly Once Semantics . . . . . . . . . . . . . . . 42 2.10.4. Trunking . . . . . . . . . . . . . . . . . . . . . . 41
2.10.5. RDMA Considerations . . . . . . . . . . . . . . . . 51 2.10.5. Exactly Once Semantics . . . . . . . . . . . . . . . 44
2.10.6. Sessions Security . . . . . . . . . . . . . . . . . 53 2.10.6. RDMA Considerations . . . . . . . . . . . . . . . . 56
2.10.7. Session Mechanics - Steady State . . . . . . . . . . 57 2.10.7. Sessions Security . . . . . . . . . . . . . . . . . 59
2.10.8. Session Mechanics - Recovery . . . . . . . . . . . . 59 2.10.8. Session Mechanics - Steady State . . . . . . . . . . 67
2.10.9. Parallel NFS and Sessions . . . . . . . . . . . . . 62 2.10.9. Session Mechanics - Recovery . . . . . . . . . . . . 68
3. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 62 2.10.10. Parallel NFS and Sessions . . . . . . . . . . . . . 72
3.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 62 3. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 72
3.2. Structured Data Types . . . . . . . . . . . . . . . . . 64 3.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 72
4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.2. Structured Data Types . . . . . . . . . . . . . . . . . 74
4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 74 4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 74 4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 84
4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 74 4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 84
4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 75 4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 84
4.2.1. General Properties of a Filehandle . . . . . . . . . 75 4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 85
4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 76 4.2.1. General Properties of a Filehandle . . . . . . . . . 85
4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 76 4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 86
4.3. One Method of Constructing a Volatile Filehandle . . . . 77 4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 86
4.4. Client Recovery from Filehandle Expiration . . . . . . . 78 4.3. One Method of Constructing a Volatile Filehandle . . . . 87
5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 79 4.4. Client Recovery from Filehandle Expiration . . . . . . . 88
5.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 80 5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 89
5.2. Recommended Attributes . . . . . . . . . . . . . . . . . 80 5.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 90
5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 81 5.2. Recommended Attributes . . . . . . . . . . . . . . . . . 90
5.4. Classification of Attributes . . . . . . . . . . . . . . 81 5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 91
5.5. Mandatory Attributes - Definitions . . . . . . . . . . . 83 5.4. Classification of Attributes . . . . . . . . . . . . . . 91
5.6. Recommended Attributes - Definitions . . . . . . . . . . 84 5.5. Mandatory Attributes - Definitions . . . . . . . . . . . 93
5.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 94 5.6. Recommended Attributes - Definitions . . . . . . . . . . 94
5.8. Interpreting owner and owner_group . . . . . . . . . . . 95 5.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 104
5.9. Character Case Attributes . . . . . . . . . . . . . . . 97 5.8. Interpreting owner and owner_group . . . . . . . . . . . 105
5.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 97 5.9. Character Case Attributes . . . . . . . . . . . . . . . 107
5.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 98 5.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 107
5.12. Directory Notification Attributes . . . . . . . . . . . 99 5.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 108
5.12.1. dir_notif_delay . . . . . . . . . . . . . . . . . . 99 5.12. Directory Notification Attributes . . . . . . . . . . . 109
5.12.2. dirent_notif_delay . . . . . . . . . . . . . . . . . 99 5.12.1. dir_notif_delay . . . . . . . . . . . . . . . . . . 109
5.13. PNFS Attributes . . . . . . . . . . . . . . . . . . . . 99 5.12.2. dirent_notif_delay . . . . . . . . . . . . . . . . . 109
5.13.1. fs_layout_type . . . . . . . . . . . . . . . . . . . 99 5.13. PNFS Attributes . . . . . . . . . . . . . . . . . . . . 109
5.13.2. layout_alignment . . . . . . . . . . . . . . . . . . 99 5.13.1. fs_layout_type . . . . . . . . . . . . . . . . . . . 109
5.13.3. layout_blksize . . . . . . . . . . . . . . . . . . . 100 5.13.2. layout_alignment . . . . . . . . . . . . . . . . . . 109
5.13.4. layout_hint . . . . . . . . . . . . . . . . . . . . 100 5.13.3. layout_blksize . . . . . . . . . . . . . . . . . . . 110
5.13.5. layout_type . . . . . . . . . . . . . . . . . . . . 100 5.13.4. layout_hint . . . . . . . . . . . . . . . . . . . . 110
5.13.6. mdsthreshold . . . . . . . . . . . . . . . . . . . . 100 5.13.5. layout_type . . . . . . . . . . . . . . . . . . . . 110
5.14. Retention Attributes . . . . . . . . . . . . . . . . . . 101 5.13.6. mdsthreshold . . . . . . . . . . . . . . . . . . . . 110
6. Access Control Lists . . . . . . . . . . . . . . . . . . . . 103 5.14. Retention Attributes . . . . . . . . . . . . . . . . . . 111
6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . 103 6. Security Related Attributes . . . . . . . . . . . . . . . . . 113
6.2. File Attributes Discussion . . . . . . . . . . . . . . . 104 6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.2.1. ACL Attribute . . . . . . . . . . . . . . . . . . . 104 6.2. File Attributes Discussion . . . . . . . . . . . . . . . 114
6.2.2. dacl and sacl Attributes . . . . . . . . . . . . . . 115 6.2.1. ACL Attributes . . . . . . . . . . . . . . . . . . . 114
6.2.3. mode Attribute . . . . . . . . . . . . . . . . . . . 116 6.2.2. dacl and sacl Attributes . . . . . . . . . . . . . . 126
6.2.4. mode_set_masked Attribute . . . . . . . . . . . . . 116 6.2.3. mode Attribute . . . . . . . . . . . . . . . . . . . 127
6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 117 6.2.4. mode_set_masked Attribute . . . . . . . . . . . . . 127
6.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . 117 6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 128
6.3.2. Computing a Mode Attribute from an ACL . . . . . . . 118 6.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . 128
6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 119 6.3.2. Computing a Mode Attribute from an ACL . . . . . . . 129
6.4.1. Setting the mode and/or ACL Attributes . . . . . . . 120 6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 131
6.4.2. Retrieving the mode and/or ACL Attributes . . . . . 121 6.4.1. Setting the mode and/or ACL Attributes . . . . . . . 131
6.4.3. Creating New Objects . . . . . . . . . . . . . . . . 122 6.4.2. Retrieving the mode and/or ACL Attributes . . . . . 132
7. Single-server Name Space . . . . . . . . . . . . . . . . . . 125 6.4.3. Creating New Objects . . . . . . . . . . . . . . . . 133
7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 126 7. Single-server Name Space . . . . . . . . . . . . . . . . . . 137
7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 126 7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 137
7.3. Server Pseudo File System . . . . . . . . . . . . . . . 126 7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 137
7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 127 7.3. Server Pseudo File System . . . . . . . . . . . . . . . 138
7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 127 7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 138
7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 127 7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 139
7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 128 7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 139
7.8. Security Policy and Name Space Presentation . . . . . . 128 7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 139
8. File Locking and Share Reservations . . . . . . . . . . . . . 129 7.8. Security Policy and Name Space Presentation . . . . . . 140
8.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 130 8. File Locking and Share Reservations . . . . . . . . . . . . . 140
8.1.1. Client and Session ID . . . . . . . . . . . . . . . 130 8.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 141
8.1.2. State-owner Definition . . . . . . . . . . . . . . . 130 8.1.1. Client and Session ID . . . . . . . . . . . . . . . 141
8.1.3. Stateid Definition . . . . . . . . . . . . . . . . . 131 8.1.2. State-owner Definition . . . . . . . . . . . . . . . 142
8.1.4. Use of the Stateid and Locking . . . . . . . . . . . 134 8.1.3. Stateid Definition . . . . . . . . . . . . . . . . . 142
8.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 137 8.1.4. Use of the Stateid and Locking . . . . . . . . . . . 146
8.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 137 8.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 148
8.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 138 8.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 149
8.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 138 8.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 149
8.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 139 8.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 150
8.6.1. Client Failure and Recovery . . . . . . . . . . . . 139 8.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 150
8.6.2. Server Failure and Recovery . . . . . . . . . . . . 140 8.6.1. Client Failure and Recovery . . . . . . . . . . . . 151
8.6.3. Network Partitions and Recovery . . . . . . . . . . 143 8.6.2. Server Failure and Recovery . . . . . . . . . . . . 151
8.7. Server Revocation of Locks . . . . . . . . . . . . . . . 147 8.6.3. Network Partitions and Recovery . . . . . . . . . . 155
8.8. Share Reservations . . . . . . . . . . . . . . . . . . . 148 8.7. Server Revocation of Locks . . . . . . . . . . . . . . . 159
8.9. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 149 8.8. Share Reservations . . . . . . . . . . . . . . . . . . . 160
8.10. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 149 8.9. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 161
8.11. Short and Long Leases . . . . . . . . . . . . . . . . . 150 8.10. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 161
8.11. Short and Long Leases . . . . . . . . . . . . . . . . . 162
8.12. Clocks, Propagation Delay, and Calculating Lease 8.12. Clocks, Propagation Delay, and Calculating Lease
Expiration . . . . . . . . . . . . . . . . . . . . . . . 151 Expiration . . . . . . . . . . . . . . . . . . . . . . . 162
8.13. Vestigial Locking Infrastructure From V4.0 . . . . . . . 151 8.13. Vestigial Locking Infrastructure From V4.0 . . . . . . . 163
9. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 152 9. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 164
9.1. Performance Challenges for Client-Side Caching . . . . . 153 9.1. Performance Challenges for Client-Side Caching . . . . . 164
9.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 153 9.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 165
9.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 155 9.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 167
9.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 157 9.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 169
9.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 157 9.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 169
9.3.2. Data Caching and File Locking . . . . . . . . . . . 158 9.3.2. Data Caching and File Locking . . . . . . . . . . . 170
9.3.3. Data Caching and Mandatory File Locking . . . . . . 160 9.3.3. Data Caching and Mandatory File Locking . . . . . . 172
9.3.4. Data Caching and File Identity . . . . . . . . . . . 160 9.3.4. Data Caching and File Identity . . . . . . . . . . . 172
9.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 161 9.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 173
9.4.1. Open Delegation and Data Caching . . . . . . . . . . 164 9.4.1. Open Delegation and Data Caching . . . . . . . . . . 175
9.4.2. Open Delegation and File Locks . . . . . . . . . . . 165 9.4.2. Open Delegation and File Locks . . . . . . . . . . . 177
9.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 165 9.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 177
9.4.4. Recall of Open Delegation . . . . . . . . . . . . . 168 9.4.4. Recall of Open Delegation . . . . . . . . . . . . . 180
9.4.5. Clients that Fail to Honor Delegation Recalls . . . 170 9.4.5. Clients that Fail to Honor Delegation Recalls . . . 182
9.4.6. Delegation Revocation . . . . . . . . . . . . . . . 171 9.4.6. Delegation Revocation . . . . . . . . . . . . . . . 183
9.5. Data Caching and Revocation . . . . . . . . . . . . . . 171 9.5. Data Caching and Revocation . . . . . . . . . . . . . . 183
9.5.1. Revocation Recovery for Write Open Delegation . . . 172 9.5.1. Revocation Recovery for Write Open Delegation . . . 184
9.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 173 9.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 185
9.7. Data and Metadata Caching and Memory Mapped Files . . . 175 9.7. Data and Metadata Caching and Memory Mapped Files . . . 187
9.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 177 9.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 189
9.9. Directory Caching . . . . . . . . . . . . . . . . . . . 178 9.9. Directory Caching . . . . . . . . . . . . . . . . . . . 190
10. Multi-Server Name Space . . . . . . . . . . . . . . . . . . . 179 10. Multi-Server Name Space . . . . . . . . . . . . . . . . . . . 191
10.1. Location attributes . . . . . . . . . . . . . . . . . . 179 10.1. Location attributes . . . . . . . . . . . . . . . . . . 191
10.2. File System Presence or Absence . . . . . . . . . . . . 179 10.2. File System Presence or Absence . . . . . . . . . . . . 191
10.3. Getting Attributes for an Absent File System . . . . . . 181 10.3. Getting Attributes for an Absent File System . . . . . . 193
10.3.1. GETATTR Within an Absent File System . . . . . . . . 181 10.3.1. GETATTR Within an Absent File System . . . . . . . . 193
10.3.2. READDIR and Absent File Systems . . . . . . . . . . 182 10.3.2. READDIR and Absent File Systems . . . . . . . . . . 194
10.4. Uses of Location Information . . . . . . . . . . . . . . 183 10.4. Uses of Location Information . . . . . . . . . . . . . . 195
10.4.1. File System Replication . . . . . . . . . . . . . . 183 10.4.1. File System Replication . . . . . . . . . . . . . . 195
10.4.2. File System Migration . . . . . . . . . . . . . . . 185 10.4.2. File System Migration . . . . . . . . . . . . . . . 197
10.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 186 10.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 198
10.5. Additional Client-side Considerations . . . . . . . . . 187 10.5. Additional Client-side Considerations . . . . . . . . . 199
10.6. Effecting File System Transitions . . . . . . . . . . . 188 10.6. Effecting File System Transitions . . . . . . . . . . . 200
10.6.1. File System Transitions and Simultaneous Access . . 189 10.6.1. File System Transitions and Simultaneous Access . . 201
10.6.2. Simultaneous Use and Transparent Transitions . . . . 190 10.6.2. Simultaneous Use and Transparent Transitions . . . . 202
10.6.3. Filehandles and File System Transitions . . . . . . 192 10.6.3. Filehandles and File System Transitions . . . . . . 204
10.6.4. Fileid's and File System Transitions . . . . . . . . 192 10.6.4. Fileid's and File System Transitions . . . . . . . . 204
10.6.5. Fsids and File System Transitions . . . . . . . . . 193 10.6.5. Fsids and File System Transitions . . . . . . . . . 205
10.6.6. The Change Attribute and File System Transitions . . 193 10.6.6. The Change Attribute and File System Transitions . . 205
10.6.7. Lock State and File System Transitions . . . . . . . 194 10.6.7. Lock State and File System Transitions . . . . . . . 206
10.6.8. Write Verifiers and File System Transitions . . . . 197 10.6.8. Write Verifiers and File System Transitions . . . . 210
10.7. Effecting File System Referrals . . . . . . . . . . . . 197 10.7. Effecting File System Referrals . . . . . . . . . . . . 210
10.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 198 10.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 210
10.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 202 10.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 214
10.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 204 10.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 216
10.9. The Attribute fs_locations . . . . . . . . . . . . . . . 204 10.9. The Attribute fs_locations . . . . . . . . . . . . . . . 217
10.10. The Attribute fs_locations_info . . . . . . . . . . . . 206 10.10. The Attribute fs_locations_info . . . . . . . . . . . . 219
10.10.1. The fs_locations_server4 Structure . . . . . . . . . 209 10.10.1. The fs_locations_server4 Structure . . . . . . . . . 221
10.10.2. The fs_locations_info4 Structure . . . . . . . . . . 214 10.10.2. The fs_locations_info4 Structure . . . . . . . . . . 226
10.10.3. The fs_locations_item4 Structure . . . . . . . . . . 215 10.10.3. The fs_locations_item4 Structure . . . . . . . . . . 227
10.11. The Attribute fs_status . . . . . . . . . . . . . . . . 216 10.11. The Attribute fs_status . . . . . . . . . . . . . . . . 228
11. Directory Delegations . . . . . . . . . . . . . . . . . . . . 220 11. Directory Delegations . . . . . . . . . . . . . . . . . . . . 232
11.1. Introduction to Directory Delegations . . . . . . . . . 220 11.1. Introduction to Directory Delegations . . . . . . . . . 232
11.2. Directory Delegation Design . . . . . . . . . . . . . . 221 11.2. Directory Delegation Design . . . . . . . . . . . . . . 233
11.3. Attributes in Support of Directory Notifications . . . . 222 11.3. Attributes in Support of Directory Notifications . . . . 234
11.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 222 11.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 234
11.5. Directory Delegation Recovery . . . . . . . . . . . . . 222 11.5. Directory Delegation Recovery . . . . . . . . . . . . . 234
12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 222 12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 234
12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 222 12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 234
12.2. PNFS Definitions . . . . . . . . . . . . . . . . . . . . 224 12.2. PNFS Definitions . . . . . . . . . . . . . . . . . . . . 236
12.2.1. Metadata . . . . . . . . . . . . . . . . . . . . . . 224 12.2.1. Metadata . . . . . . . . . . . . . . . . . . . . . . 236
12.2.2. Metadata Server . . . . . . . . . . . . . . . . . . 224 12.2.2. Metadata Server . . . . . . . . . . . . . . . . . . 236
12.2.3. Client . . . . . . . . . . . . . . . . . . . . . . . 225 12.2.3. Client . . . . . . . . . . . . . . . . . . . . . . . 237
12.2.4. Storage Device . . . . . . . . . . . . . . . . . . . 225 12.2.4. Storage Device . . . . . . . . . . . . . . . . . . . 237
12.2.5. Data Server . . . . . . . . . . . . . . . . . . . . 225 12.2.5. Data Server . . . . . . . . . . . . . . . . . . . . 237
12.2.6. Storage Protocol or Data Protocol . . . . . . . . . 225 12.2.6. Storage Protocol or Data Protocol . . . . . . . . . 237
12.2.7. Control Protocol . . . . . . . . . . . . . . . . . . 225 12.2.7. Control Protocol . . . . . . . . . . . . . . . . . . 237
12.2.8. Layout . . . . . . . . . . . . . . . . . . . . . . . 226 12.2.8. Layout . . . . . . . . . . . . . . . . . . . . . . . 238
12.2.9. Layout Types . . . . . . . . . . . . . . . . . . . . 226 12.2.9. Layout Types . . . . . . . . . . . . . . . . . . . . 238
12.2.10. Layout Iomode . . . . . . . . . . . . . . . . . . . 226 12.2.10. Layout Iomode . . . . . . . . . . . . . . . . . . . 238
12.2.11. Layout Segment . . . . . . . . . . . . . . . . . . . 227 12.2.11. Layout Segment . . . . . . . . . . . . . . . . . . . 239
12.2.12. Device IDs . . . . . . . . . . . . . . . . . . . . . 228 12.2.12. Device IDs . . . . . . . . . . . . . . . . . . . . . 240
12.3. PNFS Operations . . . . . . . . . . . . . . . . . . . . 228 12.3. PNFS Operations . . . . . . . . . . . . . . . . . . . . 240
12.4. PNFS Attributes . . . . . . . . . . . . . . . . . . . . 229 12.4. PNFS Attributes . . . . . . . . . . . . . . . . . . . . 241
12.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 229 12.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 241
12.5.1. Guarantees Provided by Layouts . . . . . . . . . . . 229 12.5.1. Guarantees Provided by Layouts . . . . . . . . . . . 241
12.5.2. Getting a Layout . . . . . . . . . . . . . . . . . . 230 12.5.2. Getting a Layout . . . . . . . . . . . . . . . . . . 242
12.5.3. Committing a Layout . . . . . . . . . . . . . . . . 231 12.5.3. Committing a Layout . . . . . . . . . . . . . . . . 243
12.5.4. Recalling a Layout . . . . . . . . . . . . . . . . . 234 12.5.4. Recalling a Layout . . . . . . . . . . . . . . . . . 246
12.5.5. Metadata Server Write Propagation . . . . . . . . . 240 12.5.5. Metadata Server Write Propagation . . . . . . . . . 252
12.6. PNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 240 12.6. PNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 252
12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 241 12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 253
12.7.1. Client Recovery . . . . . . . . . . . . . . . . . . 241 12.7.1. Client Recovery . . . . . . . . . . . . . . . . . . 253
12.7.2. Dealing with Lease Expiration on the Client . . . . 242 12.7.2. Dealing with Lease Expiration on the Client . . . . 254
12.7.3. Dealing with Loss of Layout State on the Metadata 12.7.3. Dealing with Loss of Layout State on the Metadata
Server . . . . . . . . . . . . . . . . . . . . . . . 243 Server . . . . . . . . . . . . . . . . . . . . . . . 255
12.7.4. Recovery from Metadata Server Restart . . . . . . . 244 12.7.4. Recovery from Metadata Server Restart . . . . . . . 256
12.7.5. Operations During Metadata Server Grace Period . . . 246 12.7.5. Operations During Metadata Server Grace Period . . . 258
12.7.6. Storage Device Recovery . . . . . . . . . . . . . . 246 12.7.6. Storage Device Recovery . . . . . . . . . . . . . . 258
12.8. Metadata and Storage Device Roles . . . . . . . . . . . 247 12.8. Metadata and Storage Device Roles . . . . . . . . . . . 259
12.9. Security Considerations . . . . . . . . . . . . . . . . 248 12.9. Security Considerations . . . . . . . . . . . . . . . . 260
13. PNFS: NFSv4.1 File Layout Type . . . . . . . . . . . . . . . 249 13. PNFS: NFSv4.1 File Layout Type . . . . . . . . . . . . . . . 261
13.1. Session Considerations . . . . . . . . . . . . . . . . . 249 13.1. Session Considerations . . . . . . . . . . . . . . . . . 261
13.2. File Layout Definitions . . . . . . . . . . . . . . . . 251 13.2. File Layout Definitions . . . . . . . . . . . . . . . . 263
13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 251 13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 263
13.4. Interpreting the File Layout . . . . . . . . . . . . . . 255 13.4. Interpreting the File Layout . . . . . . . . . . . . . . 267
13.5. Sparse and Dense Stripe Unit Packing . . . . . . . . . . 257 13.5. Sparse and Dense Stripe Unit Packing . . . . . . . . . . 269
13.6. Data Server Multipathing . . . . . . . . . . . . . . . . 259 13.6. Data Server Multipathing . . . . . . . . . . . . . . . . 271
13.7. Operations Issued to NFSv4.1 Data Servers . . . . . . . 259 13.7. Operations Issued to NFSv4.1 Data Servers . . . . . . . 271
13.8. COMMIT Through Metadata Server . . . . . . . . . . . . . 260 13.8. COMMIT Through Metadata Server . . . . . . . . . . . . . 272
13.9. Global Stateid Requirements . . . . . . . . . . . . . . 261 13.9. Global Stateid Requirements . . . . . . . . . . . . . . 273
13.10. The Layout Iomode . . . . . . . . . . . . . . . . . . . 261 13.10. The Layout Iomode . . . . . . . . . . . . . . . . . . . 273
13.11. Data Server State Propagation . . . . . . . . . . . . . 261 13.11. Data Server State Propagation . . . . . . . . . . . . . 273
13.11.1. Lock State Propagation . . . . . . . . . . . . . . . 262 13.11.1. Lock State Propagation . . . . . . . . . . . . . . . 274
13.11.2. Open-mode Validation . . . . . . . . . . . . . . . . 262 13.11.2. Open-mode Validation . . . . . . . . . . . . . . . . 274
13.11.3. File Attributes . . . . . . . . . . . . . . . . . . 263 13.11.3. File Attributes . . . . . . . . . . . . . . . . . . 275
13.12. Data Server Component File Size . . . . . . . . . . . . 263 13.12. Data Server Component File Size . . . . . . . . . . . . 275
13.13. Recovery Considerations . . . . . . . . . . . . . . . . 264 13.13. Recovery Considerations . . . . . . . . . . . . . . . . 276
13.14. Security Considerations for the File Layout Type . . . . 265 13.14. Security Considerations for the File Layout Type . . . . 277
14. Internationalization . . . . . . . . . . . . . . . . . . . . 265 14. Internationalization . . . . . . . . . . . . . . . . . . . . 277
14.1. Stringprep profile for the utf8str_cs type . . . . . . . 266 14.1. Stringprep profile for the utf8str_cs type . . . . . . . 278
14.2. Stringprep profile for the utf8str_cis type . . . . . . 268 14.2. Stringprep profile for the utf8str_cis type . . . . . . 280
14.3. Stringprep profile for the utf8str_mixed type . . . . . 269 14.3. Stringprep profile for the utf8str_mixed type . . . . . 281
14.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 271 14.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 283
15. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 271 15. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 283
15.1. Error Definitions . . . . . . . . . . . . . . . . . . . 271 15.1. Error Definitions . . . . . . . . . . . . . . . . . . . 283
15.2. Operations and their valid errors . . . . . . . . . . . 285 15.2. Operations and their valid errors . . . . . . . . . . . 298
15.3. Callback operations and their valid errors . . . . . . . 299 15.3. Callback operations and their valid errors . . . . . . . 312
15.4. Errors and the operations that use them . . . . . . . . 300 15.4. Errors and the operations that use them . . . . . . . . 313
16. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 307 16. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 320
16.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 307 16.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 320
16.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 308 16.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 321
17. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 313 17. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 326
17.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 313 17.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 326
17.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 315 17.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 328
17.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 317 17.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 330
17.4. Operation 6: CREATE - Create a Non-Regular File Object . 319 17.4. Operation 6: CREATE - Create a Non-Regular File Object . 332
17.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting 17.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting
Recovery . . . . . . . . . . . . . . . . . . . . . . . . 322 Recovery . . . . . . . . . . . . . . . . . . . . . . . . 335
17.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 323 17.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 336
17.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 323 17.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 336
17.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 325 17.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 338
17.9. Operation 11: LINK - Create Link to a File . . . . . . . 326 17.9. Operation 11: LINK - Create Link to a File . . . . . . . 339
17.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 327 17.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 340
17.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 331 17.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 344
17.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 332 17.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 345
17.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 334 17.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 347
17.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 335 17.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 349
17.15. Operation 17: NVERIFY - Verify Difference in 17.15. Operation 17: NVERIFY - Verify Difference in
Attributes . . . . . . . . . . . . . . . . . . . . . . . 337 Attributes . . . . . . . . . . . . . . . . . . . . . . . 350
17.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 338 17.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 351
17.17. Operation 19: OPENATTR - Open Named Attribute 17.17. Operation 19: OPENATTR - Open Named Attribute
Directory . . . . . . . . . . . . . . . . . . . . . . . 352 Directory . . . . . . . . . . . . . . . . . . . . . . . 366
17.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 354 17.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 367
17.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 355 17.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 368
17.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 356 17.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 369
17.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 357 17.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 371
17.22. Operation 25: READ - Read from File . . . . . . . . . . 358 17.22. Operation 25: READ - Read from File . . . . . . . . . . 372
17.23. Operation 26: READDIR - Read Directory . . . . . . . . . 360 17.23. Operation 26: READDIR - Read Directory . . . . . . . . . 374
17.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 364 17.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 378
17.25. Operation 28: REMOVE - Remove File System Object . . . . 365 17.25. Operation 28: REMOVE - Remove File System Object . . . . 379
17.26. Operation 29: RENAME - Rename Directory Entry . . . . . 367 17.26. Operation 29: RENAME - Rename Directory Entry . . . . . 381
17.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 369 17.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 383
17.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 370 17.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 384
17.29. Operation 33: SECINFO - Obtain Available Security . . . 370 17.29. Operation 33: SECINFO - Obtain Available Security . . . 384
17.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 374 17.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 388
17.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 376 17.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 390
17.32. Operation 38: WRITE - Write to File . . . . . . . . . . 377 17.32. Operation 38: WRITE - Write to File . . . . . . . . . . 391
17.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 382 17.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 396
17.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 383 17.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 397
17.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 387 17.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 399
17.36. Operation 43: CREATE_SESSION - Create New Session and 17.36. Operation 43: CREATE_SESSION - Create New Session and
Confirm Client ID . . . . . . . . . . . . . . . . . . . 395 Confirm Client ID . . . . . . . . . . . . . . . . . . . 416
17.37. Operation 44: DESTROY_SESSION - Destroy existing 17.37. Operation 44: DESTROY_SESSION - Destroy existing
session . . . . . . . . . . . . . . . . . . . . . . . . 405 session . . . . . . . . . . . . . . . . . . . . . . . . 426
17.38. Operation 45: FREE_STATEID - Free stateid with no 17.38. Operation 45: FREE_STATEID - Free stateid with no
locks . . . . . . . . . . . . . . . . . . . . . . . . . 406 locks . . . . . . . . . . . . . . . . . . . . . . . . . 427
17.39. Operation 46: GET_DIR_DELEGATION - Get a directory 17.39. Operation 46: GET_DIR_DELEGATION - Get a directory
delegation . . . . . . . . . . . . . . . . . . . . . . . 407 delegation . . . . . . . . . . . . . . . . . . . . . . . 428
17.40. Operation 47: GETDEVICEINFO - Get Device Information . . 412 17.40. Operation 47: GETDEVICEINFO - Get Device Information . . 433
17.41. Operation 48: GETDEVICELIST . . . . . . . . . . . . . . 413 17.41. Operation 48: GETDEVICELIST . . . . . . . . . . . . . . 434
17.42. Operation 49: LAYOUTCOMMIT - Commit writes made using 17.42. Operation 49: LAYOUTCOMMIT - Commit writes made using
a layout . . . . . . . . . . . . . . . . . . . . . . . . 414 a layout . . . . . . . . . . . . . . . . . . . . . . . . 435
17.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 417 17.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 438
17.44. Operation 51: LAYOUTRETURN - Release Layout 17.44. Operation 51: LAYOUTRETURN - Release Layout
Information . . . . . . . . . . . . . . . . . . . . . . 420 Information . . . . . . . . . . . . . . . . . . . . . . 441
17.45. Operation 52: SECINFO_NO_NAME - Get Security on 17.45. Operation 52: SECINFO_NO_NAME - Get Security on
Unnamed Object . . . . . . . . . . . . . . . . . . . . . 423 Unnamed Object . . . . . . . . . . . . . . . . . . . . . 444
17.46. Operation 53: SEQUENCE - Supply per-procedure 17.46. Operation 53: SEQUENCE - Supply per-procedure
sequencing and control . . . . . . . . . . . . . . . . . 424 sequencing and control . . . . . . . . . . . . . . . . . 445
17.47. Operation 54: SET_SSV . . . . . . . . . . . . . . . . . 429 17.47. Operation 54: SET_SSV . . . . . . . . . . . . . . . . . 452
17.48. Operation 55: TEST_STATEID - Test stateids for 17.48. Operation 55: TEST_STATEID - Test stateids for
validity . . . . . . . . . . . . . . . . . . . . . . . . 431 validity . . . . . . . . . . . . . . . . . . . . . . . . 454
17.49. Operation 56: WANT_DELEGATION . . . . . . . . . . . . . 432 17.49. Operation 56: WANT_DELEGATION . . . . . . . . . . . . . 455
17.50. Operation 57: DESTROY_CLIENTID - Destroy existing 17.50. Operation 57: DESTROY_CLIENTID - Destroy existing
client ID . . . . . . . . . . . . . . . . . . . . . . . 435 client ID . . . . . . . . . . . . . . . . . . . . . . . 458
17.51. Operation 10044: ILLEGAL - Illegal operation . . . . . . 436 17.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims
18. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 437 Finished . . . . . . . . . . . . . . . . . . . . . . . . 459
18.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 437 17.52. Operation 10044: ILLEGAL - Illegal operation . . . . . . 460
18.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 437 18. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 461
19. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 439 18.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 461
19.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 439 18.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 461
19.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 441 19. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 463
19.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 442 19.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 463
19.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 444 19.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 465
19.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 447 19.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 466
19.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 448 19.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 468
19.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 451 19.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 471
19.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 472
19.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 475
19.8. Operation 10: CB_RECALL_SLOT - change flow control 19.8. Operation 10: CB_RECALL_SLOT - change flow control
limits . . . . . . . . . . . . . . . . . . . . . . . . . 452 limits . . . . . . . . . . . . . . . . . . . . . . . . . 476
19.9. Operation 11: CB_SEQUENCE - Supply callback channel 19.9. Operation 11: CB_SEQUENCE - Supply backchannel
sequencing and control . . . . . . . . . . . . . . . . . 453 sequencing and control . . . . . . . . . . . . . . . . . 477
19.10. Operation 12: CB_WANTS_CANCELLED . . . . . . . . . . . . 455 19.10. Operation 12: CB_WANTS_CANCELLED . . . . . . . . . . . . 480
19.11. Operation 13: CB_NOTIFY_LOCK - Notify of possible 19.11. Operation 13: CB_NOTIFY_LOCK - Notify of possible
lock availability . . . . . . . . . . . . . . . . . . . 456 lock availability . . . . . . . . . . . . . . . . . . . 481
19.12. Operation 10044: CB_ILLEGAL - Illegal Callback 19.12. Operation 10044: CB_ILLEGAL - Illegal Callback
Operation . . . . . . . . . . . . . . . . . . . . . . . 457 Operation . . . . . . . . . . . . . . . . . . . . . . . 482
20. Security Considerations . . . . . . . . . . . . . . . . . . . 458 20. Security Considerations . . . . . . . . . . . . . . . . . . . 483
21. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 458 21. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 483
21.1. Defining new layout types . . . . . . . . . . . . . . . 458 21.1. Defining new layout types . . . . . . . . . . . . . . . 483
22. References . . . . . . . . . . . . . . . . . . . . . . . . . 459 22. References . . . . . . . . . . . . . . . . . . . . . . . . . 484
22.1. Normative References . . . . . . . . . . . . . . . . . . 459 22.1. Normative References . . . . . . . . . . . . . . . . . . 484
22.2. Informative References . . . . . . . . . . . . . . . . . 460 22.2. Informative References . . . . . . . . . . . . . . . . . 485
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 461 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 487
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 462 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 487
Intellectual Property and Copyright Statements . . . . . . . . . 464 Intellectual Property and Copyright Statements . . . . . . . . . 489
1. Introduction 1. Introduction
1.1. The NFSv4.1 Protocol 1.1. The NFSv4.1 Protocol
The NFSv4.1 protocol is a minor version of the NFSv4 protocol The NFSv4.1 protocol is a minor version of the NFSv4 protocol
described in [2]. It generally follows the guidelines for minor described in [2]. It generally follows the guidelines for minor
versioning model laid in Section 10 of RFC 3530. However, it versioning model laid in Section 10 of RFC 3530. However, it
diverges from guidelines 11 ("a client and server that supports minor diverges from guidelines 11 ("a client and server that supports minor
version X must support minor versions 0 through X-1"), and 12 ("no version X must support minor versions 0 through X-1"), and 12 ("no
skipping to change at page 10, line 32 skipping to change at page 10, line 32
NFSv4.1, as a minor version, is consistent with the overall goals for NFSv4.1, as a minor version, is consistent with the overall goals for
NFS Version 4, but extends the protocol so as to better meet those NFS Version 4, but extends the protocol so as to better meet those
goals, based on experiences with NFSv4.0. In addition, NFSv4.1 has goals, based on experiences with NFSv4.0. In addition, NFSv4.1 has
adopted some additional goals, which motivate some of the major adopted some additional goals, which motivate some of the major
extensions in minor version 1. extensions in minor version 1.
1.2. NFS Version 4 Goals 1.2. NFS Version 4 Goals
The NFS version 4 protocol is a further revision of the NFS protocol The NFS version 4 protocol is a further revision of the NFS protocol
defined already by versions 2 [17]] and 3 [18]. It retains the defined already by versions 2 [21] and 3 [22]. It retains the
essential characteristics of previous versions: design for easy essential characteristics of previous versions: design for easy
recovery, independent of transport protocols, operating systems and recovery, independent of transport protocols, operating systems and
file systems, simplicity, and good performance. The NFS version 4 file systems, simplicity, and good performance. The NFS version 4
revision has the following goals: revision has the following goals:
o Improved access and good performance on the Internet. o Improved access and good performance on the Internet.
The protocol is designed to transit firewalls easily, perform well The protocol is designed to transit firewalls easily, perform well
where latency is high and bandwidth is low, and scale to very where latency is high and bandwidth is low, and scale to very
large numbers of clients per server. large numbers of clients per server.
skipping to change at page 13, line 44 skipping to change at page 13, line 44
to provide filehandles with more limited validity guarantees, called to provide filehandles with more limited validity guarantees, called
volatile filehandles. volatile filehandles.
1.4.3.2. File Attributes 1.4.3.2. File Attributes
The NFS version 4.1 protocol has a rich and extensible attribute The NFS version 4.1 protocol has a rich and extensible attribute
structure. Only a small set of the defined attributes are mandatory structure. Only a small set of the defined attributes are mandatory
and must be provided by all server implementations. The other and must be provided by all server implementations. The other
attributes are known as "recommended" attributes. attributes are known as "recommended" attributes.
One significant recommended file attribute is the Access Control List The acl, sacl, and dacl attributes are a significant set of file
(ACL) attribute. This attribute provides for directory and file attributes that make up the Access Control List (ACL) of a file.
access control beyond the model used in NFS Versions 2 and 3. The These attributes provide for directory and file access control beyond
ACL definition allows for specification of specific sets of the model used in NFS Versions 2 and 3. The ACL definition allows
permissions for individual users and groups. In addition, ACL for specification of specific sets of permissions for individual
inheritance allows propagation of access permissions and restriction users and groups. In addition, ACL inheritance allows propagation of
down a directory tree as file system objects are created. access permissions and restriction down a directory tree as file
system objects are created.
One other type of attribute is the named attribute. A named One other type of attribute is the named attribute. A named
attribute is an opaque octet stream that is associated with a attribute is an opaque octet stream that is associated with a
directory or file and referred to by a string name. Named attributes directory or file and referred to by a string name. Named attributes
are meant to be used by client applications as a method to associate are meant to be used by client applications as a method to associate
application-specific data with a regular file or directory. application-specific data with a regular file or directory.
1.4.3.3. Multi-server Namespace 1.4.3.3. Multi-server Namespace
NFS Version 4.1 contains a number of features to allow implementation NFS Version 4.1 contains a number of features to allow implementation
skipping to change at page 18, line 15 skipping to change at page 18, line 15
2.1. Introduction 2.1. Introduction
NFS version 4.1 (NFSv4.1) relies on core infrastructure common to NFS version 4.1 (NFSv4.1) relies on core infrastructure common to
nearly every operation. This core infrastructure is described in the nearly every operation. This core infrastructure is described in the
remainder of this section. remainder of this section.
2.2. RPC and XDR 2.2. RPC and XDR
The NFS version 4.1 (NFSv4.1) protocol is a Remote Procedure Call The NFS version 4.1 (NFSv4.1) protocol is a Remote Procedure Call
(RPC) application that uses RPC version 2 and the corresponding (RPC) application that uses RPC version 2 and the corresponding
eXternal Data Representation (XDR) as defined in RFC1831 [4] and eXternal Data Representation (XDR) as defined in [4] and [3].
RFC4506 [3].
2.2.1. RPC-based Security 2.2.1. RPC-based Security
Previous NFS versions have been thought of as having a host-based Previous NFS versions have been thought of as having a host-based
authentication model, where the NFS server authenticates the NFS authentication model, where the NFS server authenticates the NFS
client, and trust the client to authenticate all users. Actually, client, and trust the client to authenticate all users. Actually,
NFS has always depended on RPC for authentication. The first form of NFS has always depended on RPC for authentication. The first form of
RPC authentication which required a host-based authentication RPC authentication which required a host-based authentication
approach. NFSv4 also depends on RPC for basic security services, and approach. NFSv4.1 also depends on RPC for basic security services,
mandates RPC support for a user-based authentication model. The and mandates RPC support for a user-based authentication model. The
user-based authentication model has user principals authenticated by user-based authentication model has user principals authenticated by
a server, and in turn the server authenticated by user principals. a server, and in turn the server authenticated by user principals.
RPC provides some basic security services which are used by NFSv4. RPC provides some basic security services which are used by NFSv4.
2.2.1.1. RPC Security Flavors 2.2.1.1. RPC Security Flavors
As described in section 7.2 "Authentication" of [4], RPC security is As described in section 7.2 "Authentication" of [4], RPC security is
encapsulated in the RPC header, via a security or authentication encapsulated in the RPC header, via a security or authentication
flavor, and information specific to the specification of the security flavor, and information specific to the specification of the security
flavor. Every RPC header conveys information used to identify and flavor. Every RPC header conveys information used to identify and
authenticate a client and server. As discussed in Section 2.2.1.1.1, authenticate a client and server. As discussed in Section 2.2.1.1.1,
some security flavors provide additional security services. some security flavors provide additional security services.
NFSv4 clients and servers MUST implement RPCSEC_GSS. (This NFSv4.1 clients and servers MUST implement RPCSEC_GSS. (This
requirement to implement is not a requirement to use.) Other requirement to implement is not a requirement to use.) Other
flavors, such as AUTH_NONE, and AUTH_SYS, MAY be implemented as well. flavors, such as AUTH_NONE, and AUTH_SYS, MAY be implemented as well.
2.2.1.1.1. RPCSEC_GSS and Security Services 2.2.1.1.1. RPCSEC_GSS and Security Services
RPCSEC_GSS ([5]) uses the functionality of GSS-API RFC2743 [8]. This RPCSEC_GSS ([5]) uses the functionality of GSS-API [8]. This allows
allows for the use of various security mechanisms by the RPC layer for the use of various security mechanisms by the RPC layer without
without the additional implementation overhead of adding RPC security the additional implementation overhead of adding RPC security
flavors. flavors.
2.2.1.1.1.1. Identification, Authentication, Integrity, Privacy 2.2.1.1.1.1. Identification, Authentication, Integrity, Privacy
Via the GSS-API, RPCSEC_GSS can be used to identify and authenticate Via the GSS-API, RPCSEC_GSS can be used to identify and authenticate
users on clients to servers, and servers to users. It can also users on clients to servers, and servers to users. It can also
perform integrity checking on the entire RPC message, including the perform integrity checking on the entire RPC message, including the
RPC header, and the arguments or results. Finally, privacy, usually RPC header, and the arguments or results. Finally, privacy, usually
via encryption, is a service available with RPCSEC_GSS. Privacy is via encryption, is a service available with RPCSEC_GSS. Privacy is
performed on the arguments and results. Note that if privacy is performed on the arguments and results. Note that if privacy is
skipping to change at page 19, line 26 skipping to change at page 19, line 26
selected, but authentication is enabled, identification is enabled. selected, but authentication is enabled, identification is enabled.
RPCSEC_GSS does not provide identification as a separate service. RPCSEC_GSS does not provide identification as a separate service.
Although GSS-API has an authentication service distinct from its Although GSS-API has an authentication service distinct from its
privacy and integrity services, GSS-API's authentication service is privacy and integrity services, GSS-API's authentication service is
not used for RPCSEC_GSS's authentication service. Instead, each RPC not used for RPCSEC_GSS's authentication service. Instead, each RPC
request and response header is integrity protected with the GSS-API request and response header is integrity protected with the GSS-API
integrity service, and this allows RPCSEC_GSS to offer per-RPC integrity service, and this allows RPCSEC_GSS to offer per-RPC
authentication and identity. See [5] for more information. authentication and identity. See [5] for more information.
NFSv4 client and servers MUST support RPCSEC_GSS's integrity and NFSv4.1 client and servers MUST support RPCSEC_GSS's integrity and
authentication service. NFSv4.1 servers MUST support RPCSEC_GSS's authentication service. NFSv4.1 servers MUST support RPCSEC_GSS's
privacy service. privacy service.
2.2.1.1.1.2. Security mechanisms for NFS version 4 2.2.1.1.1.2. Security mechanisms for NFS version 4
RPCSEC_GSS, via GSS-API, normalizes access to mechanisms that provide RPCSEC_GSS, via GSS-API, normalizes access to mechanisms that provide
security services. Therefore NFSv4 clients and servers MUST support security services. Therefore NFSv4.1 clients and servers MUST
three security mechanisms: Kerberos V5, SPKM-3, and LIPKEY. support three security mechanisms: Kerberos V5, SPKM-3, and LIPKEY.
The use of RPCSEC_GSS requires selection of: mechanism, quality of The use of RPCSEC_GSS requires selection of: mechanism, quality of
protection (QOP), and service (authentication, integrity, privacy). protection (QOP), and service (authentication, integrity, privacy).
For the mandated security mechanisms, NFSv4 specifies that a QOP of For the mandated security mechanisms, NFSv4.1 specifies that a QOP of
zero (0) is used, leaving it up to the mechanism or the mechanism's zero (0) is used, leaving it up to the mechanism or the mechanism's
configuration to use an appropriate level of protection that QOP zero configuration to use an appropriate level of protection that QOP zero
maps to. Each mandated mechanism specifies minimum set of maps to. Each mandated mechanism specifies minimum set of
cryptographic algorithms for implementing integrity and privacy. cryptographic algorithms for implementing integrity and privacy.
NFSv4 clients and servers MUST be implemented on operating NFSv4.1 clients and servers MUST be implemented on operating
environments that comply with the mandatory cryptographic algorithms environments that comply with the mandatory cryptographic algorithms
of each mandated mechanism. of each mandated mechanism.
2.2.1.1.1.2.1. Kerberos V5 2.2.1.1.1.2.1. Kerberos V5
The Kerberos V5 GSS-API mechanism as described in RFC1964 [6] ( The Kerberos V5 GSS-API mechanism as described in [6] ( [[Comment.1:
[[Comment.1: need new Kerberos RFC]] ) MUST be implemented with the need new Kerberos RFC]] ) MUST be implemented with the RPCSEC_GSS
RPCSEC_GSS services as specified in the following table: services as specified in the following table:
column descriptions: column descriptions:
1 == number of pseudo flavor 1 == number of pseudo flavor
2 == name of pseudo flavor 2 == name of pseudo flavor
3 == mechanism's OID 3 == mechanism's OID
4 == RPCSEC_GSS service 4 == RPCSEC_GSS service
5 == NFSv4.1 clients MUST support 5 == NFSv4.1 clients MUST support
6 == NFSv4.1 servers MUST support 6 == NFSv4.1 servers MUST support
1 2 3 4 5 6 1 2 3 4 5 6
------------------------------------------------------------------ ------------------------------------------------------------------
390003 krb5 1.2.840.113554.1.2.2 rpc_gss_svc_none yes yes 390003 krb5 1.2.840.113554.1.2.2 rpc_gss_svc_none yes yes
390004 krb5i 1.2.840.113554.1.2.2 rpc_gss_svc_integrity yes yes 390004 krb5i 1.2.840.113554.1.2.2 rpc_gss_svc_integrity yes yes
390005 krb5p 1.2.840.113554.1.2.2 rpc_gss_svc_privacy no yes 390005 krb5p 1.2.840.113554.1.2.2 rpc_gss_svc_privacy no yes
Note that the number and name of the pseudo flavor is presented here Note that the number and name of the pseudo flavor is presented here
as a mapping aid to the implementor. Because the NFSv4 protocol as a mapping aid to the implementor. Because the NFSv4.1 protocol
includes a method to negotiate security and it understands the GSS- includes a method to negotiate security and it understands the GSS-
API mechanism, the pseudo flavor is not needed. The pseudo flavor is API mechanism, the pseudo flavor is not needed. The pseudo flavor is
needed for the NFS version 3 since the security negotiation is done needed for the NFS version 3 since the security negotiation is done
via the MOUNT protocol as described in [19]. via the MOUNT protocol as described in [23].
2.2.1.1.1.2.2. LIPKEY 2.2.1.1.1.2.2. LIPKEY
The LIPKEY V5 GSS-API mechanism as described in [7] MUST be The LIPKEY V5 GSS-API mechanism as described in [7] MUST be
implemented with the RPCSEC_GSS services as specified in the implemented with the RPCSEC_GSS services as specified in the
following table: following table:
1 2 3 4 5 6 1 2 3 4 5 6
------------------------------------------------------------------ ------------------------------------------------------------------
390006 lipkey 1.3.6.1.5.5.9 rpc_gss_svc_none yes yes 390006 lipkey 1.3.6.1.5.5.9 rpc_gss_svc_none yes yes
skipping to change at page 21, line 48 skipping to change at page 21, line 48
With the use of the COMPOUND procedure, the client is able to build With the use of the COMPOUND procedure, the client is able to build
simple or complex requests. These COMPOUND requests allow for a simple or complex requests. These COMPOUND requests allow for a
reduction in the number of RPCs needed for logical file system reduction in the number of RPCs needed for logical file system
operations. For example, multi-component lookup requests can be operations. For example, multi-component lookup requests can be
constructed by combining multiple LOOKUP operations. Those can be constructed by combining multiple LOOKUP operations. Those can be
further combined with operations such as GETATTR, READDIR, or OPEN further combined with operations such as GETATTR, READDIR, or OPEN
plus READ to do more complicated sets of operation without incurring plus READ to do more complicated sets of operation without incurring
additional latency. additional latency.
NFSv4 also contains a considerable set of callback operations in NFSv4.1 also contains a considerable set of callback operations in
which the server makes an RPC directed at the client. Callback RPC's which the server makes an RPC directed at the client. Callback RPC's
have a similar structure to that of the normal server requests. For have a similar structure to that of the normal server requests. For
the NFS version 4 protocol callbacks in all minor versions, there are the NFS version 4 protocol callbacks in all minor versions, there are
two RPC procedures, NULL and CB_COMPOUND. The CB_COMPOUND procedure two RPC procedures, NULL and CB_COMPOUND. The CB_COMPOUND procedure
is defined in an analogous fashion to that of COMPOUND with its own is defined in an analogous fashion to that of COMPOUND with its own
set of callback operations. set of callback operations.
Addition of new server and callback operation within the COMPOUND and Addition of new server and callback operation within the COMPOUND and
CB_COMPOUND request framework provide means of extending the protocol CB_COMPOUND request framework provide means of extending the protocol
in subsequent minor versions. in subsequent minor versions.
Except for a small number of operations needed for session creation, Except for a small number of operations needed for session creation,
server requests and callback requests are performed within the server requests and callback requests are performed within the
context of a session. Sessions provide a client context for every context of a session. Sessions provide a client context for every
request and support robust replay protection for non-idempotent request and support robust reply protection for non-idempotent
requests. requests.
2.4. Client Identifiers and Client Owners 2.4. Client Identifiers and Client Owners
For each operation that obtains or depends on locking state, the For each operation that obtains or depends on locking state, the
specific client must be determinable by the server. In NFSv4, each specific client must be determinable by the server. In NFSv4, each
distinct client instance is represented by a client ID, which is a distinct client instance is represented by a client ID, which is a
64-bit identifier that identifies a specific client at a given time 64-bit identifier that identifies a specific client at a given time
and which is changed whenever the client or the server re- and which is changed whenever the client or the server re-
initializes. Client IDs are used to support lock identification and initializes. Client IDs are used to support lock identification and
skipping to change at page 22, line 43 skipping to change at page 22, line 43
before a client ID is established, are those directly connected with before a client ID is established, are those directly connected with
establishing the client ID. establishing the client ID.
A sequence of an EXCHANGE_ID operation followed by a CREATE_SESSION A sequence of an EXCHANGE_ID operation followed by a CREATE_SESSION
operation using that client ID (eir_clientid as returned from operation using that client ID (eir_clientid as returned from
EXCHANGE_ID) is required to establish the identification on the EXCHANGE_ID) is required to establish the identification on the
server. Establishment of identification by a new incarnation of the server. Establishment of identification by a new incarnation of the
client also has the effect of immediately releasing any locking state client also has the effect of immediately releasing any locking state
that a previous incarnation of that same client might have had on the that a previous incarnation of that same client might have had on the
server. Such released state would include all lock, share server. Such released state would include all lock, share
reservation, and, where the server is not supporting the reservation, layout state, and where the server is not supporting the
CLAIM_DELEGATE_PREV claim type, all delegation state associated with CLAIM_DELEGATE_PREV claim type, all delegation state associated with
same client with the same identity. For discussion of delegation same client with the same identity. For discussion of delegation
state recovery, see Section 9.2.1. state recovery, see Section 9.2.1. For discussion of layout state
recovery see Section 12.7.1.
Releasing such state requires that the server be able to determine Releasing such state requires that the server be able to determine
that one client instance is the successor of another. Where this that one client instance is the successor of another. Where this
cannot be done, for any of a number of reasons, the locking state cannot be done, for any of a number of reasons, the locking state
will remain for a time subject to lease expiration (see Section 8.5) will remain for a time subject to lease expiration (see Section 8.5)
and the new client will need to wait for such state to be removed, if and the new client will need to wait for such state to be removed, if
it makes conflicting lock requests. it makes conflicting lock requests.
Client identification is encapsulated in the following Client Owner Client identification is encapsulated in the following Client Owner
structure: structure:
skipping to change at page 23, line 43 skipping to change at page 23, line 44
o The string should be selected so the subsequent incarnations (e.g. o The string should be selected so the subsequent incarnations (e.g.
reboots) of the same client cause the client to present the same reboots) of the same client cause the client to present the same
string. The implementor is cautioned from an approach that string. The implementor is cautioned from an approach that
requires the string to be recorded in a local file because this requires the string to be recorded in a local file because this
precludes the use of the implementation in an environment where precludes the use of the implementation in an environment where
there is no local disk and all file access is from an NFS version there is no local disk and all file access is from an NFS version
4 server. 4 server.
o The string should be the same for each server network address that o The string should be the same for each server network address that
the client accesses, rather than common to all server network the client accesses, (note: the precise opposite was advised in
addresses (note: the precise opposite was advised in RFC3530). the NFSv4.0 specification [2]). This way, if a server has
This way, if a server has multiple interfaces, the client can multiple interfaces, the client can trunk traffic over multiple
trunk traffic over multiple network paths as described in network paths as described in Section 2.10.4.
Section 2.10.3.4.1.
o The algorithm for generating the string should not assume that the o The algorithm for generating the string should not assume that the
client's network address will not change, unless the client client's network address will not change, unless the client
implementation knows it is using statically assigned network implementation knows it is using statically assigned network
addresses. This includes changes between client incarnations and addresses. This includes changes between client incarnations and
even changes while the client is still running in its current even changes while the client is still running in its current
incarnation. This means that if the client includes just the incarnation. This means that if the client includes just the
client's network address in the co_ownerid string, there is a real client's network address in the co_ownerid string, there is a real
risk, with dynamic address assignment, that after the client gives risk, with dynamic address assignment, that after the client gives
up the network address, another client, using a similar algorithm up the network address, another client, using a similar algorithm
skipping to change at page 24, line 47 skipping to change at page 24, line 47
o For a user level NFS version 4 client, it should contain o For a user level NFS version 4 client, it should contain
additional information to distinguish the client from other user additional information to distinguish the client from other user
level clients running on the same host, such as a process level clients running on the same host, such as a process
identifier or other unique sequence. identifier or other unique sequence.
As a security measure, the server MUST NOT cancel a client's leased As a security measure, the server MUST NOT cancel a client's leased
state if the principal established the state for a given co_ownerid state if the principal established the state for a given co_ownerid
string is not the same as the principal issuing the EXCHANGE_ID. string is not the same as the principal issuing the EXCHANGE_ID.
A server may compare an client_owner4 in a EXCHANGE_ID with an A server may compare a client_owner4 in an EXCHANGE_ID with an
nfs_client_id4 established using SETCLIENTID using NFSv4 minor nfs_client_id4 established using SETCLIENTID using NFSv4 minor
version 0, so that an NFSv4.1 client is not forced to delay until version 0, so that an NFSv4.1 client is not forced to delay until
lease expiration for locking state established by the earlier client lease expiration for locking state established by the earlier client
using minor version 0. This requires the client_owner4 be using minor version 0. This requires the client_owner4 be
constructed the same way as the nfs_client_id4. If the latter's constructed the same way as the nfs_client_id4. If the latter's
contents included the server's network address, and the NFSv4.1 contents included the server's network address, and the NFSv4.1
client does not wish to use a client ID that prevents trunking, it client does not wish to use a client ID that prevents trunking, it
should issue two EXCHANGE_ID operations. The first EXCHANGE_ID will should issue two EXCHANGE_ID operations. The first EXCHANGE_ID will
have a client_owner4 equal to the nfs_client_id4. This will clear have a client_owner4 equal to the nfs_client_id4. This will clear
the state created by the NFSv4.0 client. The second EXCHANGE_ID will the state created by the NFSv4.0 client. The second EXCHANGE_ID will
skipping to change at page 25, line 28 skipping to change at page 25, line 28
The shorthand client identifier (a client ID) is assigned by the The shorthand client identifier (a client ID) is assigned by the
server (the eir_clientid result from EXCHANGE_ID) and should be server (the eir_clientid result from EXCHANGE_ID) and should be
chosen so that it will not conflict with a client ID previously chosen so that it will not conflict with a client ID previously
assigned by the server. This applies across server restarts or assigned by the server. This applies across server restarts or
reboots. reboots.
In the event of a server restart, a client may find out that its In the event of a server restart, a client may find out that its
current client ID is no longer valid when receives a current client ID is no longer valid when receives a
NFS4ERR_STALE_CLIENTID error. The precise circumstances depend of NFS4ERR_STALE_CLIENTID error. The precise circumstances depend of
the characteristics of the sessions involved, specifically whether the characteristics of the sessions involved, specifically whether
the session is persistent (see Section 2.10.4.5). the session is persistent (see Section 2.10.5.5).
When a session is not persistent, the client will need to create a When a session is not persistent, the client will need to create a
new session. When the existing client ID is presented to a server as new session. When the existing client ID is presented to a server as
part of creating a session and that client ID is not recognized, as part of creating a session and that client ID is not recognized, as
would happen after a server reboot, the server will reject the would happen after a server reboot, the server will reject the
request with the error NFS4ERR_STALE_CLIENTID. When this happens, request with the error NFS4ERR_STALE_CLIENTID. When this happens,
the client must obtain a new client ID by use of the EXCHANGE_ID the client must obtain a new client ID by use of the EXCHANGE_ID
operation and then use that client ID as the basis of the basis of a operation and then use that client ID as the basis of the basis of a
new session and then proceed to any other necessary recovery for the new session and then proceed to any other necessary recovery for the
server reboot case (See Section 8.6.2). server reboot case (See Section 8.6.2).
skipping to change at page 26, line 32 skipping to change at page 26, line 32
active clients. If the client contacts the server after this active clients. If the client contacts the server after this
release, the server must ensure the client receives the appropriate release, the server must ensure the client receives the appropriate
error so that it will use the EXCHANGE_ID/CREATE_SESSION sequence to error so that it will use the EXCHANGE_ID/CREATE_SESSION sequence to
establish a new identity. It should be clear that the server must be establish a new identity. It should be clear that the server must be
very hesitant to release a client ID since the resulting work on the very hesitant to release a client ID since the resulting work on the
client to recover from such an event will be the same burden as if client to recover from such an event will be the same burden as if
the server had failed and restarted. Typically a server would not the server had failed and restarted. Typically a server would not
release a client ID unless there had been no activity from that release a client ID unless there had been no activity from that
client for many minutes. As long as there are sessions, opens, client for many minutes. As long as there are sessions, opens,
locks, delegations, layouts, or wants, the server MUST not release locks, delegations, layouts, or wants, the server MUST not release
the client ID. See Section 2.10.8.1.4 for discussion on releasing the client ID. See Section 2.10.9.1.4 for discussion on releasing
inactive sessions. inactive sessions.
2.4.2. Handling Client Owner Conflicts 2.4.2. Resolving Client Owner Conflicts
If the co_ownerid string in a EXCHANGE_ID request is properly
constructed, and if the client takes care to use the same principal
for each successive use of EXCHANGE_ID, then, barring an active
denial of service attack, conflicts are not possible.
However, client bugs, server bugs, or perhaps a deliberate change of
the principal owner of the co_ownerid string (such as the case of a
client that changes security flavors, and under the new flavor, there
is no mapping to the previous owner) will in rare cases result in a
conflict.
When the server gets a EXCHANGE_ID for a client owner that currently When the server gets an EXCHANGE_ID for a client owner that currently
has no state, or if it has state, but the lease has expired, server has no state, or if it has state, but the lease has expired, server
MUST allow the EXCHANGE_ID, and confirm the new client ID if followed MUST allow the EXCHANGE_ID, and confirm the new client ID if followed
by the appropriate CREATE_SESSION. by the appropriate CREATE_SESSION.
When the server gets a EXCHANGE_ID for a client owner that currently When the server gets an EXCHANGE_ID for a client owner that currently
has state, or an unexpired lease, and the principal that issues the has state and an unexpired lease, the server MUST NOT destroy any
EXCHANGE_ID is different than principal the previously established state that currently exists for the client owner unless one of the
the client owner, the server MUST not destroy the any state that following are true:
currently exists for client owner. Regardless, the server has two
choices. First, it can return NFS4ERR_CLID_INUSE. Second, it can o The principal that created the client ID for the client owner is
allow the EXCHANGE_ID, and simply treat the client owner as the same as the principal that is issuing the EXCHANGE_ID. Note
consisting of both the co_ownerid and the principal that issued the that if the client ID was created with SP4_MACH_CRED protection
EXCHANGE_ID. (Section 17.35), the principal MUST be based on RPCSEC_GSS
authentication, the RPCSEC_GSS service used MUST be integrity or
privacy, and the same GSS mechanism and principal must be used as
that used when the client ID was created.
o The client ID was established with SP4_SSV protection
(Section 17.35), and the client sends the EXCHANGE_ID with the
security flavor set to RPCSEC_GSS using the GSS SSV mechanism
(Section 2.10.7.4). Note that this is possible only if the server
and client persist the SSV.
o The client ID was established with SP4_SSV protection. Because
the SSV might not be persisted across client and server restart,
and because the first time a client issues EXCHANGE_ID to a server
it does not have an SSV, the client MAY issue the subsequent
EXCHANGE_ID without an SSV RPCSEC_GSS handle. Instead, as with
SP4_MACH_CRED protection, the principal MUST be based on
RPCSEC_GSS authentication, the RPCSEC_GSS service used MUST be
integrity or privacy, and the same GSS mechanism and principal
must be used as that used when the client ID was created.
If the none of the above situations apply, the server MUST return
NFS4ERR_CLID_INUSE.
Even the server accepts the principal and co_ownerid as matching that
which created the client ID, it MUST NOT delete any state unless the
co_verifier in the EXCHANGE_ID does not match the co_verifier used
when client ID was created. If the co_verifier matches, then the
client is either updating properties of the client ID, or possibly
attempting trunking opportunity (Section 2.10.4).
2.5. Server Owners 2.5. Server Owners
The Server Owner is somewhat similar to a Client Owner (Section 2.4), The Server Owner is somewhat similar to a Client Owner (Section 2.4),
but unlike the Client Owner, there is no shorthand serverid. The but unlike the Client Owner, there is no shorthand serverid. The
Server Owner is defined in the following structure: Server Owner is defined in the following structure:
struct server_owner4 { struct server_owner4 {
uint64_t so_minor_id; uint64_t so_minor_id;
opaque so_major_id<NFS4_OPAQUE_LIMIT>; opaque so_major_id<NFS4_OPAQUE_LIMIT>;
skipping to change at page 27, line 35 skipping to change at page 28, line 6
The Server Owner is returned in the results of EXCHANGE_ID. When the The Server Owner is returned in the results of EXCHANGE_ID. When the
so_major_id fields are the same in two EXCHANGE_ID results, the so_major_id fields are the same in two EXCHANGE_ID results, the
connections each EXCHANGE_ID are sent over can be assumed to address connections each EXCHANGE_ID are sent over can be assumed to address
the same Server (as defined in Section 1.5). If the so_minor_id the same Server (as defined in Section 1.5). If the so_minor_id
fields are also the same, then not only do both connections connect fields are also the same, then not only do both connections connect
to the same server, but the session and other state can be shared to the same server, but the session and other state can be shared
across both connections. The reader is cautioned that multiple across both connections. The reader is cautioned that multiple
servers may deliberately or accidentally claim to have the same servers may deliberately or accidentally claim to have the same
so_major_id or so_major_id/so_minor_id; the reader should examine so_major_id or so_major_id/so_minor_id; the reader should examine
Section 2.10.3.4.1 and Section 17.35. Section 2.10.4 and Section 17.35.
The considerations for generating an so_major_id are similar to that The considerations for generating a so_major_id are similar to that
for generating a co_ownerid string (see Section 2.4). The for generating a co_ownerid string (see Section 2.4). The
consequences of two servers generating conflict so_major_id values consequences of two servers generating conflicting so_major_id values
are less dire than they are for co_ownerid conflicts because the are less dire than they are for co_ownerid conflicts because the
client can use RPCSEC_GSS to compare the authenticity of each server client can use RPCSEC_GSS to compare the authenticity of each server
(see Section 2.10.3.4.1). (see Section 2.10.4).
2.6. Security Service Negotiation 2.6. Security Service Negotiation
With the NFS version 4 server potentially offering multiple security With the NFS version 4 server potentially offering multiple security
mechanisms, the client needs a method to determine or negotiate which mechanisms, the client needs a method to determine or negotiate which
mechanism is to be used for its communication with the server. The mechanism is to be used for its communication with the server. The
NFS server may have multiple points within its file system namespace NFS server may have multiple points within its file system namespace
that are available for use by NFS clients. These points can be that are available for use by NFS clients. These points can be
considered security policy boundaries, and in some NFS considered security policy boundaries, and in some NFS
implementations are tied to NFS export points. In turn the NFS implementations are tied to NFS export points. In turn the NFS
server may be configured such that each of these security policy server may be configured such that each of these security policy
boundaries may have different or multiple security mechanisms in use. boundaries may have different or multiple security mechanisms in use.
The security negotiation between client and server must be done with The security negotiation between client and server must be done with
a secure channel to eliminate the possibility of a third party a secure channel to eliminate the possibility of a third party
intercepting the negotiation sequence and forcing the client and intercepting the negotiation sequence and forcing the client and
server to choose a lower level of security than required or desired. server to choose a lower level of security than required or desired.
See Section 20 for further discussion. See Section 20 for further discussion.
2.6.1. NFSv4 Security Tuples 2.6.1. NFSv4.1 Security Tuples
An NFS server can assign one or more "security tuples" to each An NFS server can assign one or more "security tuples" to each
security policy boundary in its namespace. Each security tuple security policy boundary in its namespace. Each security tuple
consists of a security flavor (see Section 2.2.1.1), and if the consists of a security flavor (see Section 2.2.1.1), and if the
flavor is RPCSEC_GSS, a GSS-API mechanism OID, a GSS-API quality of flavor is RPCSEC_GSS, a GSS-API mechanism OID, a GSS-API quality of
protection, and an RPCSEC_GSS service. protection, and an RPCSEC_GSS service.
2.6.2. SECINFO and SECINFO_NO_NAME 2.6.2. SECINFO and SECINFO_NO_NAME
The SECINFO and SECINFO_NO_NAME operations allow the client to The SECINFO and SECINFO_NO_NAME operations allow the client to
skipping to change at page 28, line 51 skipping to change at page 29, line 23
principals should be listed first. principals should be listed first.
2.6.3. Security Error 2.6.3. Security Error
Based on the assumption that each NFS version 4 client and server Based on the assumption that each NFS version 4 client and server
must support a minimum set of security (i.e., LIPKEY, SPKM-3, and must support a minimum set of security (i.e., LIPKEY, SPKM-3, and
Kerberos-V5 all under RPCSEC_GSS), the NFS client will initiate file Kerberos-V5 all under RPCSEC_GSS), the NFS client will initiate file
access to the server with one of the minimal security tuples. During access to the server with one of the minimal security tuples. During
communication with the server, the client may receive an NFS error of communication with the server, the client may receive an NFS error of
NFS4ERR_WRONGSEC. This error allows the server to notify the client NFS4ERR_WRONGSEC. This error allows the server to notify the client
that the security tuple currently being used is contravenes the that the security tuple currently being used contravenes the server's
server's security policy. The client is then responsible for security policy. The client is then responsible for determining (see
determining (see Section 2.6.3.1) what security tuples are available Section 2.6.3.1) what security tuples are available at the server and
at the server and choosing one which is appropriate for the client. choosing one which is appropriate for the client.
2.6.3.1. Using NFS4ERR_WRONGSEC, SECINFO, and SECINFO_NO_NAME 2.6.3.1. Using NFS4ERR_WRONGSEC, SECINFO, and SECINFO_NO_NAME
This section explains of the mechanics of NFSv4.1 security This section explains of the mechanics of NFSv4.1 security
negotiation. The term "put filehandle operation" refers to negotiation. The term "put filehandle operation" refers to
PUTROOTFH, PUTPUBFH, PUTFH, and RESTOREFH. PUTROOTFH, PUTPUBFH, PUTFH, and RESTOREFH.
2.6.3.1.1. Put Filehandle Operation + SAVEFH 2.6.3.1.1. Put Filehandle Operation + SAVEFH
The client is saving a filehandle for a future RESTOREFH. The server The client is saving a filehandle for a future RESTOREFH. The server
skipping to change at page 34, line 31 skipping to change at page 34, line 42
such features to not be mandatory complicates implementation of such features to not be mandatory complicates implementation of
the minor version. the minor version.
13. A client MUST NOT attempt to use a stateid, filehandle, or 13. A client MUST NOT attempt to use a stateid, filehandle, or
similar returned object from the COMPOUND procedure with minor similar returned object from the COMPOUND procedure with minor
version X for another COMPOUND procedure with minor version Y, version X for another COMPOUND procedure with minor version Y,
where X != Y. where X != Y.
2.8. Non-RPC-based Security Services 2.8. Non-RPC-based Security Services
As described in Section 2.2.1.1.1.1, NFSv4 relies on RPC for As described in Section 2.2.1.1.1.1, NFSv4.1 relies on RPC for
identification, authentication, integrity, and privacy. NFSv4 itself identification, authentication, integrity, and privacy. NFSv4.1
provides additional security services as described in the next itself provides additional security services as described in the next
several subsections. several subsections.
2.8.1. Authorization 2.8.1. Authorization
Authorization to access a file object via an NFSv4 operation is Authorization to access a file object via an NFSv4.1 operation is
ultimately determined by the NFSv4 server. A client can predetermine ultimately determined by the NFSv4.1 server. A client can
its access to a file object via the OPEN (Section 17.16) and the predetermine its access to a file object via the OPEN (Section 17.16)
ACCESS (Section 17.1) operations. and the ACCESS (Section 17.1) operations.
Principals with appropriate access rights can modify the Principals with appropriate access rights can modify the
authorization on a file object via the SETATTR (Section 17.30) authorization on a file object via the SETATTR (Section 17.30)
operation. Four attributes that affect access rights are: mode, operation. Four attributes that affect access rights are: mode,
owner, owner_group, and acl. See Section 5. owner, owner_group, and acl. See Section 5.
2.8.2. Auditing 2.8.2. Auditing
NFSv4 provides auditing on a per file object basis, via the ACL NFSv4.1 provides auditing on a per file object basis, via the ACL
attribute as described in Section 6. It is outside the scope of this attribute as described in Section 6. It is outside the scope of this
specification to specify audit log formats or management policies. specification to specify audit log formats or management policies.
2.8.3. Intrusion Detection 2.8.3. Intrusion Detection
NFSv4 provides alarm control on a per file object basis, via the ACL NFSv4.1 provides alarm control on a per file object basis, via the
attribute as described in Section 6. Alarms may serve as the basis ACL attribute as described in Section 6. Alarms may serve as the
for intrusion detection. It is outside the scope of this basis for intrusion detection. It is outside the scope of this
specification to specify heuristics for detecting intrusion via specification to specify heuristics for detecting intrusion via
alarms. alarms.
2.9. Transport Layers 2.9. Transport Layers
2.9.1. Required and Recommended Properties of Transports 2.9.1. Required and Recommended Properties of Transports
NFSv4 works over RDMA and non-RDMA_based transports with the NFSv4.1 works over RDMA and non-RDMA_based transports with the
following attributes: following attributes:
o The transport supports reliable delivery of data, which NFSv4 o The transport supports reliable delivery of data, which NFSv4.1
requires but neither NFSv4 nor RPC has facilities for ensuring. requires but neither NFSv4.1 nor RPC has facilities for ensuring.
[20] [24]
o The transport delivers data in the order it was sent. Ordered o The transport delivers data in the order it was sent. Ordered
delivery simplifies detection of transmit errors, and simplifies delivery simplifies detection of transmit errors, and simplifies
the sending of arbitrary sized requests and responses, via the the sending of arbitrary sized requests and responses, via the
record marking protocol [4]. record marking protocol [4].
Where an NFS version 4 implementation supports operation over the IP Where an NFS version 4 implementation supports operation over the IP
network protocol, any transport used between NFS and IP MUST be among network protocol, any transport used between NFS and IP MUST be among
the IETF-approved congestion control transport protocols. At the the IETF-approved congestion control transport protocols. At the
time this document was written, the only two transports that had the time this document was written, the only two transports that had the
above attributes were TCP and SCTP. To enhance the possibilities for above attributes were TCP and SCTP. To enhance the possibilities for
interoperability, an NFS version 4 implementation MUST support interoperability, an NFS version 4 implementation MUST support
operation over the TCP transport protocol. operation over the TCP transport protocol.
Even if NFS version 4 is used over a non-IP network protocol, it is Even if NFS version 4 is used over a non-IP network protocol, it is
RECOMMENDED that the transport support congestion control. RECOMMENDED that the transport support congestion control.
It is permissible for a connectionless transport to be used under It is permissible for a connectionless transport to be used under
NFSv4.1, however reliable and in-order delivery of data by the NFSv4.1, however reliable and in-order delivery of data by the
connectionless transport is still required. NFSv4.1 assumes that a connectionless transport are still required. NFSv4.1 assumes that a
client transport address and server transport address used to send client transport address and server transport address used to send
data over a transport together constitute a connection, even if the data over a transport together constitute a connection, even if the
underlying transport eschews the concept of a connection. underlying transport eschews the concept of a connection.
2.9.2. Client and Server Transport Behavior 2.9.2. Client and Server Transport Behavior
If a connection-oriented transport (e.g. TCP) is used the client and If a connection-oriented transport (e.g. TCP) is used the client and
server SHOULD use long lived connections for at least three reasons: server SHOULD use long lived connections for at least three reasons:
1. This will prevent the weakening of the transport's congestion 1. This will prevent the weakening of the transport's congestion
control mechanisms via short lived connections. control mechanisms via short lived connections.
2. This will improve performance for the WAN environment by 2. This will improve performance for the WAN environment by
eliminating the need for connection setup handshakes. eliminating the need for connection setup handshakes.
3. The NFSv4.1 callback model differs from NFSv4.0, and requires the 3. The NFSv4.1 callback model differs from NFSv4.0, and requires the
client and server to maintain a client-created channel (see client and server to maintain a client-created backchannel (see
Section 2.10.3.4for the server to use. Section 2.10.3.1) for the server to use.
In order to reduce congestion, if a connection-oriented transport is In order to reduce congestion, if a connection-oriented transport is
used, and the request is not the NULL procedure, used, and the request is not the NULL procedure,
o A requester MUST NOT retry a request unless the connection the o A requester MUST NOT retry a request unless the connection the
request was issued over was disconnected before the reply was request was issued over was lost before the reply was received.
received.
o A replier MUST NOT silently drop a request, even if the request is o A replier MUST NOT silently drop a request, even if the request is
a retry. (The silent drop behavior of RPCSEC_GSS [5] does not a retry. (The silent drop behavior of RPCSEC_GSS [5] does not
apply because this behavior happens at the RPCSEC_GSS layer, a apply because this behavior happens at the RPCSEC_GSS layer, a
lower layer in the request processing). Instead, the replier lower layer in the request processing). Instead, the replier
SHOULD return an appropriate error (see Section 2.10.4.1) or it SHOULD return an appropriate error (see Section 2.10.5.1) or it
MAY disconnect the connection. MAY disconnect the connection.
When using RDMA transports there are other reasons for not tolerating When using RDMA transports there are other reasons for not tolerating
retries over the same connection: retries over the same connection:
o RDMA transports use "credits" to enforce flow control, where a o RDMA transports use "credits" to enforce flow control, where a
credit is a right to a peer to transmit a message. If one peer credit is a right to a peer to transmit a message. If one peer
were to retransmit a request (or reply), it would consume an were to retransmit a request (or reply), it would consume an
additional credit. If the replier retransmitted a reply, it would additional credit. If the replier retransmitted a reply, it would
certainly result in an RDMA connection loss, since the requester certainly result in an RDMA connection loss, since the requester
skipping to change at page 37, line 4 skipping to change at page 37, line 15
o RDMA credits present a new issue to the reply cache in NFSv4.1. o RDMA credits present a new issue to the reply cache in NFSv4.1.
The reply cache may be used when a connection within a session is The reply cache may be used when a connection within a session is
lost, such as after the client reconnects. Credit information is lost, such as after the client reconnects. Credit information is
a dynamic property of the RDMA connection, and stale values must a dynamic property of the RDMA connection, and stale values must
not be replayed from the cache. This implies that the reply cache not be replayed from the cache. This implies that the reply cache
contents must not be blindly used when replies are issued from it, contents must not be blindly used when replies are issued from it,
and credit information appropriate to the channel must be and credit information appropriate to the channel must be
refreshed by the RPC layer. refreshed by the RPC layer.
In addition, the NFSv4.1 requester is not allowed to stop waiting for In addition, the NFSv4.1 requester is not allowed to stop waiting for
a reply, as described in Section 2.10.4.2. a reply, as described in Section 2.10.5.2.
2.9.3. Ports 2.9.3. Ports
Historically, NFS version 2 and version 3 servers have resided on Historically, NFS version 2 and version 3 servers have listened over
port 2049. The registered port 2049 RFC3232 [21] for the NFS TCP port 2049. The registered port 2049 [25] for the NFS protocol
protocol should be the default configuration. NFSv4 clients SHOULD should be the default configuration. NFSv4.1 clients SHOULD NOT use
NOT use the RPC binding protocols as described in RFC1833 [22]. the RPC binding protocols as described in [26].
2.10. Session 2.10. Session
2.10.1. Motivation and Overview 2.10.1. Motivation and Overview
Previous versions and minor versions of NFS have suffered from the Previous versions and minor versions of NFS have suffered from the
following: following:
o Lack of support for exactly once semantics (EOS). This includes o Lack of support for exactly once semantics (EOS). This includes
lack of support for EOS through server failure and recovery. lack of support for EOS through server failure and recovery.
skipping to change at page 37, line 35 skipping to change at page 37, line 46
normal requests, and callbacks. normal requests, and callbacks.
o Limited trunking over multiple network paths. o Limited trunking over multiple network paths.
o Requiring machine credentials for fully secure operation. o Requiring machine credentials for fully secure operation.
Through the introduction of a session, NFSv4.1 addresses the above Through the introduction of a session, NFSv4.1 addresses the above
shortfalls with practical solutions: shortfalls with practical solutions:
o EOS is enabled by a reply cache with a bounded size, making it o EOS is enabled by a reply cache with a bounded size, making it
feasible to keep on persistent storage and enable EOS through feasible to keep the cache in persistent storage and enable EOS
server failure and recovery. One reason that previous revisions through server failure and recovery. One reason that previous
of NFS did not support EOS was because some EOS approaches often revisions of NFS did not support EOS was because some EOS
limited parallelism. As will be explained in Section 2.10.4), approaches often limited parallelism. As will be explained in
NFSv4.1 supports both EOS and unlimited parallelism. Section 2.10.5, NFSv4.1 supports both EOS and unlimited
parallelism.
o The NFSv4.1 client provides creates transport connections and o The NFSv4.1 client (defined in Section 1.5, Paragraph 1) creates
gives them to the server for sending callbacks, thus solving the transport connections and provides them to the server to use for
firewall issue (Section 17.34). Races between responses from sending callback requests, thus solving the firewall issue
client requests, and callbacks caused by the requests are detected (Section 17.34). Races between responses from client requests,
via the session's sequencing properties which are a byproduct of and callbacks caused by the requests are detected via the
EOS (Section 2.10.4.3). session's sequencing properties which are a consequence of EOS
(Section 2.10.5.3).
o The NFSv4.1 client can add an arbitrary number of connections to o The NFSv4.1 client can add an arbitrary number of connections to
the session, and thus provide trunking (Section 2.10.3.4.1). the session, and thus provide trunking (Section 2.10.4).
o The NFSv4.1 session produces a session key independent of client o The NFSv4.1 client and server produces a session key independent
and server machine credentials which can be used to compute a of client and server machine credentials which can be used to
digest for protecting key session management operations compute a digest for protecting critical session management
Section 2.10.6.3). operations (Section 2.10.7.3).
o The NFSv4.1 client can also create secure RPCSEC_GSS contexts for o The NFSv4.1 client can also create secure RPCSEC_GSS contexts for
use by the session's callback channel that do not require the use by the session's backchannel that do not require the server to
server to authenticate to a client machine principal authenticate to a client machine principal (Section 2.10.7.2).
(Section 2.10.6.2).
A session is a dynamically created, long-lived server object created A session is a dynamically created, long-lived server object created
by a client, used over time from one or more transport connections. by a client, used over time from one or more transport connections.
Its function is to maintain the server's state relative to the Its function is to maintain the server's state relative to the
connection(s) belonging to a client instance. This state is entirely connection(s) belonging to a client instance. This state is entirely
independent of the connection itself, and indeed the state exists independent of the connection itself, and indeed the state exists
whether the connection exists or not (though locks, delegations, etc. whether the connection exists or not. A client may have one or more
and generally expire in the extended absence of an open connection). sessions associated with it so that client-associated state may be
The session in effect becomes the object representing an active accessed using any of the sessions associated with that client's
client on a set of zero or more connections. client ID, when connections are associated with those sessions. When
no connections are associated for any of the sessions associated with
the client ID for an extended time such objects as locks, opens,
delegations, layouts, etc. are subject to expiration. The session
serves as an object representing a means of access by a client to the
associated client state on the server, independent of the physical
means of access to that state.
A single client may create multiple sessions. A single session MUST
NOT server multiple clients.
2.10.2. NFSv4 Integration 2.10.2. NFSv4 Integration
Sessions are part of NFSv4.1 and not NFSv4.0. Normally, a major Sessions are part of NFSv4.1 and not NFSv4.0. Normally, a major
infrastructure change like sessions would require a new major version infrastructure change such as sessions would require a new major
number to an RPC program like NFS. However, because NFSv4 version number to an ONC RPC program like NFS. However, because
encapsulates its functionality in a single procedure, COMPOUND, and NFSv4 encapsulates its functionality in a single procedure, COMPOUND,
because COMPOUND can support an arbitrary number of operations, and because COMPOUND can support an arbitrary number of operations,
sessions are almost trivially added. COMPOUND includes a minor sessions have been added to NFSv4.1 with little difficulty. COMPOUND
version number field, and for NFSv4.1 this minor version is set to 1. includes a minor version number field, and for NFSv4.1 this minor
When the NFSv4 server processes a COMPOUND with the minor version set version is set to 1. When the NFSv4 server processes a COMPOUND with
to 1, it expects a different set of operations than it does for the minor version set to 1, it expects a different set of operations
NFSv4.0. One operation it expects is the SEQUENCE operation, which than it does for NFSv4.0. NFSv4.1 defines the SEQUENCE operation,
is required for every COMPOUND that operates over an established which is required for every COMPOUND that operates over an
session. established session, with the exception of some session
administration operations, such as DESTROY_SESSION (Section 17.37).
2.10.2.1. SEQUENCE and CB_SEQUENCE 2.10.2.1. SEQUENCE and CB_SEQUENCE
In NFSv4.1, when the SEQUENCE operation is present, it is always the In NFSv4.1, when the SEQUENCE operation is present, it MUST be the
first operation in the COMPOUND procedure. The primary purpose of first operation in the COMPOUND procedure. The primary purpose of
SEQUENCE is to carry the session identifier. The session identifier SEQUENCE is to carry the session identifier. The session identifier
associates all other operations in the COMPOUND procedure with a associates all other operations in the COMPOUND procedure with a
particular session. SEQUENCE also contains required information for particular session. SEQUENCE also contains required information for
maintaining EOS (see Section 2.10.4). Session-enabled NFSv4.1 maintaining EOS (see Section 2.10.5). Session-enabled NFSv4.1
COMPOUND requests thus have the form: COMPOUND requests thus have the form:
+-----+--------------+-----------+------------+-----------+---- +-----+--------------+-----------+------------+-----------+----
| tag | minorversion | numops |SEQUENCE op | op + args | ... | tag | minorversion | numops |SEQUENCE op | op + args | ...
| | (== 1) | (limited) | + args | | | | (== 1) | (limited) | + args | |
+-----+--------------+-----------+------------+-----------+---- +-----+--------------+-----------+------------+-----------+----
and the reply's structure is: and the reply's structure is:
+------------+-----+--------+-------------------------------+--// +------------+-----+--------+-------------------------------+--//
|last status | tag | numres |status + SEQUENCE op + results | // |last status | tag | numres |status + SEQUENCE op + results | //
+------------+-----+--------+-------------------------------+--// +------------+-----+--------+-------------------------------+--//
//-----------------------+---- //-----------------------+----
// status + op + results | ... // status + op + results | ...
//-----------------------+---- //-----------------------+----
A CB_COMPOUND procedure request and reply has a similar form, but A CB_COMPOUND procedure request and reply has a similar form to
instead of a SEQUENCE operation, there is a CB_SEQUENCE operation, COMPOUND, but instead of a SEQUENCE operation, there is a CB_SEQUENCE
and there is an additional field called "callback_ident", which is operation. CB_COMPOUND also has an additional field called
superfluous in NFSv4.1. CB_SEQUENCE has the same information as "callback_ident", which is superfluous in NFSv4.1 and MUST be ignored
SEQUENCE, but includes other information needed to solve callback by the client. CB_SEQUENCE has the same information as SEQUENCE, and
races (Section 2.10.4.3). also includes other information needed to resolve callback races
(Section 2.10.5.3).
2.10.2.2. Client ID and Session Association 2.10.2.2. Client ID and Session Association
Sessions are subordinate to the client ID (Section 2.4). Each client Each client ID (Section 2.4) can have zero or more active sessions.
ID can have zero or more active sessions. A client ID, and a session A client ID, and a session associated with it are required to perform
bound to it are required to do anything useful in NFSv4.1. Each time file access in NFSv4.1. Each time a session is used (whether by a
a session is used, the state leased to its associated client ID is client sending a request to the server, or the client replying to a
automatically renewed. callback request from the server), the state leased to its associated
client ID is automatically renewed.
State such as share reservations, locks, delegations, and layouts State such as share reservations, locks, delegations, and layouts
(Section 1.4.4) is tied to the client ID, not the sessions of the (Section 1.4.4) is tied to the client ID. Client state is not tied
client ID. Successive state changing operations from a given state to the sessions of the client ID. Successive state changing
owner can go over different sessions, as long each session is operations from a given state owner MAY go over different sessions,
associated with the same client ID. Callbacks can arrive over a provided the session is associated with the same client ID. A
different session than the session that sent the operation the callback MAY arrive over a different session than from the session
acquired the state that the callback is for. For example, if session that originally acquired the state pertaining to the callback. For
A is used to acquire a delegation, a request to recall the delegation example, if session A is used to acquire a delegation, a request to
can arrive over session B. recall the delegation MAY arrive over session B if both sessions are
associated with the same client ID. Section 2.10.7.1 and
Section 2.10.7.2 discuss the security considerations around
callbacks.
2.10.3. Channels 2.10.3. Channels
Each session has one or two channels: the "operation" or "fore" A channel is not a connection. A channel represents the direction
channel used for ordinary requests from client to server, and the ONC RPC requests are sent to.
"back" channel, used for callback requests from server to client.
The session allocates resources for each channel, including separate
reply caches (see Section 2.10.4.1). These resources are for the
most part specified at time the session is created.
2.10.3.1. Operation Channel
The operation channel carries COMPOUND requests and responses. A
session always has an operation channel.
2.10.3.2. Backchannel Each session has one or two channels: the fore channel and the
backchannel. Because there are at most two channels per session, and
because each channel has a distinct purpose, channels are not
assigned identifiers.
The backchannel carries CB_COMPOUND requests and responses. Whether The fore channel is used for ordinary requests from the client to the
there is a backchannel or not is a decision of the client; NFSv4.1 server, and carries COMPOUND requests and responses. A session
servers MUST support backchannels. always has a fore channel.
2.10.3.3. Session and Channel Association The backchannel used for callback requests from server to client, and
carries CB_COMPOUND requests and responses. Whether there is a
backchannel or not is a decision by the client, however many features
of NFSv4.1 require a backchannel. NFSv4.1 servers MUST support
backchannels.
Because there are at most two channels per session, and because each Each session has resources for each channel, including separate reply
channel has a distinct purpose, channels are not assigned caches (see Section 2.10.5.1). Note that even the backchannel
identifiers. The operation and backchannel are implicitly created requires a reply cache because some callback operations are
and associated when the session is created. nonidempotent.
2.10.3.4. Connection and Channel Association 2.10.3.1. Association of Connections, Channels, and Sessions
Each channel is associated with zero or more transport connections. Each channel is associated with zero or more transport connections.
A connection can be bound to one channel or both channels of a A connection can be associated with one channel or both channels of a
session; the client and server negotiate whether a connection will session; the client and server negotiate whether a connection will
carry traffic for one channel or both channels via the CREATE_SESSION carry traffic for one channel or both channels via the CREATE_SESSION
(Section 17.36) and the BIND_CONN_TO_SESSION (Section 17.34) (Section 17.36) and the BIND_CONN_TO_SESSION (Section 17.34)
operations. When a session is created via CREATE_SESSION, it is operations. When a session is created via CREATE_SESSION, the
automatically bound to the operation channel, and optionally the connection that transported the CREATE_SESSION request is
backchannel. If the client does not specify connecting binding automatically associated with the fore channel, and optionally the
enforcement when the session is created, then additional connections backchannel. If the client specifies no state protection
are automatically bound to the operation channel when the are used (Section 17.35). when the session is created, then when SEQUENCE is
with a SEQUENCE operation that has the session's sessionid. transmitted on a different connection, the connection is
automatically associated with the fore channel of the session
specified in the SEQUENCE operation.
A connection MAY be bound to the channels of other sessions. The A connection's association with a session is not exclusive. A
client decides, and the NFSv4.1 server MUST allow it. A connection connection associated with the channel(s) of one session may be
MAY be bound to the channels of other sessions of other clientids. simultaneously associated with the channel(s) of other sessions
Again, the client decides, and the server MUST allow it. including sessions associated with other client IDs.
It is permissible for connections of multiple types to be bound to It is permissible for connections of multiple transport types to be
the same channel. For example a TCP and RDMA connection can be bound associated with the same channel. For example both a TCP and RDMA
to the operation channel. In the event an RDMA and non-RDMA connection can be associated with the fore channel. In the event an
connection are bound to the same channel, the maximum number of slots RDMA and non-RDMA connection are associated with the same channel,
must be at least one more than the total number of credits. This way the maximum number of slots SHOULD be at least one more than the
if all RDMA credits are use, the non-RDMA connection can have at total number of credits (Section 2.10.5.1. This way if all RDMA
least one outstanding request. credits are used, the non-RDMA connection can have at least one
outstanding request. If a server supports multiple transport types,
it MUST allow a client to associate connections from each transport
to a channel.
It is permissible for a connection of one type to be bound to the It is permissible for a connection of type of transport to be
operation channel, and another type bound to the backchannel. associated with the fore channel, and a connection of a different
type to be associated with the backchannel.
2.10.3.4.1. Trunking 2.10.4. Trunking
A client is allowed to issue EXCHANGE_ID multiple times to the same Trunking is the use of multiple connections between a client and
server. The client may be unaware that two different server network server in order to increase the speed of data transfer. NFSv4.1
addresses refer to the same server. The use of EXCHANGE_ID allows a supports two types of trunking: session trunking and client ID
client to become aware that an additional network address refers to a trunking. NFSv4.1 servers MUST support trunking.
server the client already has an established client ID and session
for. The eir_server_owner and eir_server_scope results from Session trunking is essentially the association of multiple
EXCHANGE_ID give a client a hint that the server it is connected to connections, each with a potentially different target network
may be the same as the server it is connected to via another address, to the same session.
connection. When EXCHANGE_ID is issued over two different
connections, and each return the same eir_server_owner.so_major_id Client ID trunking is the association of multiple sessions to the
and eir_server_scope, the client treats the connections as connected same client ID, major server owner ID (Section 2.5), and server scope
to the same server (subject to verification, as described later in (Section 10.6.7). When two servers return the same major server
this section (Paragraph 2), even if the destination network addresses owner and server scope it means the two servers are cooperating on
are different). As long two unrelated servers have not selected and locking state management which is a prerequisite for client ID
returned a conflicting pair of eir_major_id and eir_server_scope, or trunking.
unless the client has used different co_ownerid values in each
EXCHANGE_ID request, or the server has lost client ID state (e.g. the Understanding and distinguishing session and client ID trunking
server has rebooted) the server MUST return the same eir_clientid requires understanding how the results of the EXCHANGE_ID
result. Otherwise, the client and server use the common eir_clientid (Section 17.35) operation identify a server. Suppose a client issues
to identify the client. The eir_server_owner.so_minor_id field EXCHANGE_ID over two different connections each with a possibly
allows the server to control binding of connections to sessions. different target network address but each EXCHANGE_ID with the same
When two connections have a matching eir_server_scope, so_major_id value in the eia_clientowner field. If the same NFSv4.1 server is
and so_minor_id, the client may bind both connections to a common listening over each connection, then each EXCHANGE_ID result MUST
session; this is session trunking. When two connections have a return the same values of eir_clientid, eir_server_owner.so_major_id
matching so_major_id and eir_server_scope, but different so_minor_id, and eir_server_scope. The client can then treat each connection as
the client will need to create a new session for the client ID in referring to the same server (subject to verification, see
order to use the connection; this is client ID trunking. In either Paragraph 5 later in this section), and it can use each connection to
session or client ID trunking, the bandwidth capacity can scale with trunk requests and replies. The question is whether session trunking
the number of connections. and/or client ID trunking applies.
Session Trunking If the eia_clientowner argument is the same in two
different EXCHANGE_ID requests, and the eir_clientid,
eir_server_owner.so_major_id, eir_server_owner.so_minor_id, and
eir_server_scope results match in both EXCHANGE_ID results, then
the client is permitted to perform session trunking. If the
client has no session mapping to the tuple of eir_clientid,
eir_server_owner.so_major_id, eir_server_scope,
eir_server_owner.so_minor_id, then it creates the session via a
CREATE_SESSION operation over one of the connections, which
associates the connection to the session. If there is a session
for the tuple, the client can issue BIND_CONN_TO_SESSION to
associate the connection to the session. The client can invoke
CREATE_SESSION regardless whether there is session for the tuple.
The second connection is associated with the same session as the
first connection via the BIND_CONN_TO_SESSION operation.
Client ID Trunking If the eia_clientowner argument is the same in
two different EXCHANGE_ID requests, and the eir_clientid,
eir_server_owner.so_major_id, and eir_server_scope results match
in both EXCHANGE_ID results, but the eir_server_owner.so_minor_id
results do not match then the client is permitted to perform
client ID trunking. The client can associate each connection with
different sessions, where each session is associated with the same
server. Of course, even if the eir_server_owner.so_minor_id
fields do match, the client is free to employ client ID trunking
instead of sessiond trunking. The client completes the act of
client ID trunking by invoking CREATE_SESSION on each connection,
using the same client ID that was returned in eir_clientid. These
invocations create two sessions and also associate each connection
with each session.
When doing client ID trunking, locking state is shared across
sessions associated with the same client ID. This requires the
server to coordinate state across sessions.
When two servers over two connections claim matching or partially When two servers over two connections claim matching or partially
matching eir_server_owner, eir_server_scope, and eir_clientid values matching eir_server_owner, eir_server_scope, and eir_clientid values,
the client does not have to trust the servers' claims. The client the client does not have to trust the servers' claims. The client
may verify these claims before trunking traffic in the following may verify these claims before trunking traffic in the following
ways: ways:
o For session trunking, clients and servers can reliably verify if o For session trunking, clients SHOULD reliably verify if
connections between different network paths are in fact bound to connections between different network paths are in fact associated
the same NFSv4.1 server and usable on the same session. The with the same NFSv4.1 server and usable on the same session, and
SET_SSV (Section 17.47) operation allows a client and server to servers MUST allow clients to perform reliable verification. When
establish a unique, shared key value (the SSV). When a new a client ID is created, the client SHOULD specify that
connection is bound to the session (via the BIND_CONN_TO_SESSION BIND_CONN_TO_SESSION is to be verified according to the SP4_SSV or
operation, see Section 17.34), the client offers a digest that is SP4_MACH_CRED (Section 17.35) state protection options. For
based on the SSV. If the client mistakenly tries to bind a SP4_SSV, reliable verification depends on a shared secret (the
connection to a session of a wrong server, the server will either SSV) that is established via the SET_SSV (Section 17.47)
reject the attempt because it is not aware of the session operation.
identifier of the BIND_CONN_TO_SESSION arguments, or it will
reject the attempt because the digest for the SSV does not match
what the server expects. Even if the server mistakenly or
maliciously accepts the connection bind attempt, the digest it
computes in the response will not be verified by the client, the
client will know it cannot use the connection for trunking the
specified channel.
o In the case of client ID trunking, the client can use RPCSEC_GSS When a new connection is associated with the session (via the
to verify that each connection is aimed at the same server. When BIND_CONN_TO_SESSION operation, see Section 17.34), if the client
the client invokes EXCHANGE_ID, it should use RPCSEC_GSS. If each specified SP4_SSV state protection for the BIND_CONN_TO_SESSION
RPCSEC_GSS context over each connection has the same server operation, the client MUST issue the BIND_CONN_TO_SESSSION with
principal, then -- barring a compromise of the server's GSS RPCSEC_GSS protection, using integrity or privacy, and a
credentials -- the servers at the end of each connection are the RPCSEC_GSS using the GSS SSV mechanism (Section 2.10.7.4 If the
same. client mistakenly tries to associate a connection to a session of
a wrong server, the server will either reject the attempt because
it is not aware of the session identifier of the
BIND_CONN_TO_SESSION arguments, or it will reject the attempt
because the RPCSEC_GSS authentication fails. Even if the server
mistakenly or maliciously accepts the connection association
attempt, the RPCSEC_GSS verifier it computes in the response will
not be verified by the client, the client will know it cannot use
the connection for trunking the specified session.
2.10.4. Exactly Once Semantics If the client specified SP4_MACH_CRED state protection, the
BIND_CONN_TO_SESSION operation will use RPCSEC_GSS integrity or
privacy, using the same credential that was used when the client
ID was created. Mutual authentication via RPCSEC_GSS assures the
client that the connection is associated with the correct sesssion
of the correct server.
o For client ID trunking, the client has at least two options for
verifying that the same client ID obtained from two different
EXCHANGE_ID operations came from the same server. The first
option is to use RPCSEC_GSS authentication when issuing each
EXCHANGE_ID. Each time an EXCHANGE_ID is issued with RPCSEC_GSS
authentication, the client notes the principal name of GSS target.
If the EXCHANGE_ID results indicate client ID trunking is
possible, and the GSS targets' principal names are the same, the
servers are the same and client ID trunking is allowed.
The second option for verification is to use SP4_SSV protection.
When the client issues EXCHANGE_ID is specifies SP4_SSV
protection. The first EXCHANGE_ID the client issues always has to
be confirmed by a CREATE_SESSION call. The client then issues
SET_SSV on the sessions. Later the client issues EXCHANGE_ID to a
second destination network address than the first EXCHANGE_ID was
issued with. The client checks that each EXCHANGE_ID reply has
the same eir_clientid, eir_server_owner.so_major_id, and
eir_server_scope. If so, the client verifies the claim by issuing
a CREATE_SESSION to the second destination address, protected with
RPCSEC_GSS integrity using an RPCSEC_GSS handle returned by the
second EXCHANGE_ID. If the server accept the CREATE_SESSION
request, and if the client verifies the RPCSEC_GSS verifier and
integrity codes, then the client has proof the second server knows
the SSV, and thus the two servers are the same for the purposes of
client ID trunking.
2.10.5. Exactly Once Semantics
Via the session, NFSv4.1 offers exactly once semantics (EOS) for Via the session, NFSv4.1 offers exactly once semantics (EOS) for
requests sent over a channel. EOS is supported on both the operation requests sent over a channel. EOS is supported on both the fore and
and back channels. back channels.
Each COMPOUND or CB_COMPOUND request that is issued with a leading Each COMPOUND or CB_COMPOUND request that is issued with a leading
SEQUENCE or CB_SEQUENCE operation MUST be executed by the receiver SEQUENCE or CB_SEQUENCE operation MUST be executed by the receiver
exactly once. This requirement is regardless whether the request is exactly once. This requirement is regardless whether the request is
issued with reply caching specified (see Section 2.10.4.1.2). The issued with reply caching specified (see Section 2.10.5.1.2). The
requirement holds even if the requester is issuing the request over a requirement holds even if the requester is issuing the request over a
session created between a pNFS data client and pNFS data server. The session created between a pNFS data client and pNFS data server. The
rationale for this requirement is understood by categorizing requests rationale for this requirement is understood by categorizing requests
into three classifications: into three classifications:
o Nonidempotent requests. o Nonidempotent requests.
o Idempotent modifying requests. o Idempotent modifying requests.
o Idempotent non-modifying requests. o Idempotent non-modifying requests.
skipping to change at page 43, line 5 skipping to change at page 45, line 8
execution succeeds, the re-execution will fail. If the replier execution succeeds, the re-execution will fail. If the replier
returns the result from the re-execution, this result is incorrect. returns the result from the re-execution, this result is incorrect.
Therefore, EOS is required for nonidempotent requests. Therefore, EOS is required for nonidempotent requests.
An example of an idempotent modifying request is a COMPOUND request An example of an idempotent modifying request is a COMPOUND request
containing a WRITE operation. Repeated execution of the same WRITE containing a WRITE operation. Repeated execution of the same WRITE
has the same effect as execution of that write once. Nevertheless, has the same effect as execution of that write once. Nevertheless,
putting enforcing EOS for WRITEs and other idempotent modifying putting enforcing EOS for WRITEs and other idempotent modifying
requests is necessary to avoid data corruption. requests is necessary to avoid data corruption.
Suppose a client issues WRITEs A, B, C to a noncompliant server that Suppose a client issues WRITEs A, and B to a noncompliant server that
does not enforce EOS, and receives no response, perhaps due to a does not enforce EOS, and receives no response, perhaps due to a
network partition. The client reconnects to the server and re-issues network partition. The client reconnects to the server and re-issues
all three WRITEs. Now, the server has outstanding two instances of both WRITEs. Now, the server has outstanding two instances of each
each of A, B, and C. The server can be in a situation in which it of A and B. The server can be in a situation in which it executes and
executes and replies to the retries of A, B, and C while the first A, replies to the retries of A and B, while the first A and B are still
B, and C are still waiting around in the server's I/O system for some waiting in the server's I/O system for some resource. Upon receiving
resource. Upon receiving the replies to the second attempts of the replies to the second attempts of WRITEs A and B, the client
WRITEs A, B, and C, the client believes its writes are done so it is believes its writes are done so it is free to issue WRITE D which
free to do issue WRITE D which overlaps the range of one or more of overlaps the range of one or both of A and B. If A or B are
A, B, C. If any of A, B, or C are subsequently are executed for the subsequently executed for the second time, then what has been written
second time, then what has been written by D can be overwritten and by D can be overwritten and thus corrupted.
thus corrupted.
Note that it is not required the server cache the reply to the
modifying operation to avoid data corruption (but if the client
specified the reply to be cached, the server must cache it).
An example of an idempotent non-modifying request is a COMPOUND An example of an idempotent non-modifying request is a COMPOUND
containing SEQUENCE, PUTFH, READLINK and nothing else. The re- containing SEQUENCE, PUTFH, READLINK and nothing else. The re-
execution of a such a request will not cause data corruption, or execution of a such a request will not cause data corruption, or
produce an incorrect result. Nonetheless, for simplicity, the produce an incorrect result. Nonetheless, to keep the implementation
replier MUST enforce EOS for such requests. simple, the replier MUST enforce EOS for all requests whether
idempotent and non-modifying or not.
2.10.4.1. Slot Identifiers and Reply Cache Note that true and complete EOS is not possible unless the server
persists the reply cache in stable storage, unless the server is
somehow implemented to never require a restart (indeed if such a
server exists, the distinction between a reply cache kept in stable
storage versus one that is not is one without meaning). See
Section 2.10.5.5 for a discussion of persistence in the reply cache.
Regardless, even if the server does not persist the reply cache, EOS
improves robustness and correctness over previous versions of NFS
because the legacy duplicate request/reply caches were based on the
ONC RPC transaction identifier (XID). Section 2.10.5.1 explains the
shortcomings of the XID as a basis for a reply cache and describes
how NFSv4.1 sessions improve upon the XID.
The RPC layer provides a transaction ID (xid), which, while required 2.10.5.1. Slot Identifiers and Reply Cache
to be unique, is not especially convenient for tracking requests.
The xid is only meaningful to the requester it cannot be interpreted
at the replier except to test for equality with previously issued
requests. Because RPC operations may be completed by the replier in
any order, many transaction IDs may be outstanding at any time. The
requester may therefore perform a computationally expensive lookup
operation in the process of demultiplexing each reply.
In the NFSv4.1, there is a limit to the number of active requests. The RPC layer provides a transaction ID (XID), which, while required
This immediately enables a computationally efficient index for each to be unique, is not convenient for tracking requests for two
request which is designated as a Slot Identifier, or slotid. reasons. First, the XID is only meaningful to the requester; it
cannot be interpreted by the replier except to test for equality with
previously issued requests. When consulting an RPC-based duplicate
request cache, the opaqueness of the XID requires a computationally
expensive lookup (often via a hash that includes XID and source
address). NFSv4.1 requests use a non-opaque slot id which is an
index into a slot table, which is far more efficient. Second,
because RPC requests can be executed by the replier in any order,
there is no bound on the number of requests that may be outstanding
at any time. To achieve perfect EOS using ONC RPC would require
storing all replies in the reply cache. XIDs are 32 bits; storing
over four billion (2^32) replies in the reply cache is not practical.
In practice, previous versions of NFS have chosen to store a fixed
number of replies in the cache, and use a least recently used (LRU)
approach to replacing cache entries with new entries when the cache
is full. In NFSv4.1, the number of outstanding requests is bounded
by the size of the slot table, and a sequence id per slot is used to
tell the replier when it is safe to delete a cached reply.
When the requester issues a new request, it selects a slotid in the In the NFSv4.1 reply cache, when the requester issues a new request,
range 0..N-1, where N is the replier's current "outstanding requests" it selects a slot id in the range 0..N, where N is the replier's
limit granted to the requester on the session over which the request current maximum slot id granted to the requester on the session over
is to be issued. The value of N outstanding requests starts out as which the request is to be issued. The value of N starts out as
the value of ca_maxrequests (Section 17.36), but can be adjusted by equal to ca_maxrequests - 1 (Section 17.36), but can be adjusted by
the response to SEQUENCE or CB_SEQUENCE as described later in this the response to SEQUENCE or CB_SEQUENCE as described later in this
section. The slotid must be unused by any of the requests which the section. The slotid must be unused by any of the requests which the
requester has already active on the session. "Unused" here means the requester has already active on the session. "Unused" here means the
requester has no outstanding request for that slotid. Because the requester has no outstanding request for that slot id.
slot id is always an integer in the range 0..N-1, requester
implementations can use the slotid from a replier response to
efficiently match responses with outstanding requests, such as, for
example, by using the slotid to index into an outstanding request
array. This can be used to avoid expensive hashing and lookup
functions in the performance-critical receive path.
The sequenceid, which accompanies the slotid in each request, is for A slot contains a sequence id and the cached reply corresponding to
an important check at the server: it must be able to be determined the request send with that sequence id. The sequence id is a 32 bit
efficiently whether a request using a certain slotid is a retransmit unsigned value, and is therefore in the range 0..0xFFFFFFFF (2^32 -
or a new, never-before-seen request. It is not feasible for the 1). The first time a slot is used, the requester must specify a
client to assert that it is retransmitting to implement this, because sequence id of one (1) (Section 17.36). Each time a slot is re-used,
for any given request the client cannot know the server has seen it the request MUST specify a sequence id that is one greater than that
of the previous request on the slot. If the previous sequence id was
0xFFFFFFFF, then the next request for the slot MUST have the sequence
id set to zero (i.e. (2^32 - 1) + 1 mod 2^32).
The sequence id accompanies the slot id in each request. It is for
the critical check at the server: it used to efficiently determine
whether a request using a certain slot id is a retransmit or a new,
never-before-seen request. It is not feasible for the client to
assert that it is retransmitting to implement this, because for any
given request the client cannot know whether the server has seen it
unless the server actually replies. Of course, if the client has unless the server actually replies. Of course, if the client has
seen the server's reply, the client would not retransmit. seen the server's reply, the client would not retransmit.
The sequenceid MUST increase monotonically for each new transmit of a The replier compares each received request's sequence id with the
given slotid, and MUST remain unchanged for any retransmission. The last one previously received for that slot id, to see if the new
server must in turn compare each newly received request's sequenceid request is:
with the last one previously received for that slotid, to see if the
new request is:
o A new request, in which the sequenceid is one greater than that o A new request, in which the sequenceid is one greater than that
previously seen in the slot (accounting for sequence wraparound). previously seen in the slot (accounting for sequence wraparound).
The replier proceeds to execute the new request. The replier proceeds to execute the new request, and the replier
MUST increase the slot's sequence id by one.
o A retransmitted request, in which the sequenceid is equal to that o A retransmitted request, in which the sequenceid is equal to that
last seen in the slot. Note that this request may be either currently recorded in the slot. If the original request has
complete, or in progress. The replier performs replay processing executed to completion, the replier returns the cached reply. See
in these cases. Section 2.10.5.2 for direction on how the replier deals with
retries of requests that are stll in progress.
o A misordered replay, in which the sequenceid is less than o A misordered retry, in which the sequence id is less than
(accounting for sequence wraparound) than that previously seen in (accounting for sequence wraparound) that previously seen in the
the slot. The replier MUST return NFS4ERR_SEQ_MISORDERED (as the slot. The replier MUST return NFS4ERR_SEQ_MISORDERED (as the
result from SEQUENCE or CB_SEQUENCE). result from SEQUENCE or CB_SEQUENCE).
o A misordered new request, in which the sequenceid is two or more o A misordered new request, in which the sequenceid is two or more
than (accounting for sequence wraparound) than that previously than (accounting for sequence wraparound) than that previously
seen in the slot. Note that because the sequenceid must seen in the slot. Note that because the sequenceid must
wraparound one it reaches 0xFFFFFFFF, a misordered new request and wraparound to zero (0) once it reaches 0xFFFFFFFF, a misordered
a misordered replay cannot be distinguished. Thus, the replier new request and a misordered retry cannot be distinguished. Thus,
MUST return NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or the replier MUST return NFS4ERR_SEQ_MISORDERED (as the result from
CB_SEQUENCE). SEQUENCE or CB_SEQUENCE).
Unlike the XID, the slotid is always within a specific range; this Unlike the XID, the slotid is always within a specific range; this
has two implications. The first implication is that for a given has two implications. The first implication is that for a given
session, the replier need only cache the results of a limited number session, the replier need only cache the results of a limited number
of COMPOUND requests. The second implication derives from the first, of COMPOUND requests . The second implication derives from the
which is unlike XID-indexed reply caches (also know as duplicate first, which is unlike XID-indexed reply caches (also known as
request caches - DRCs), the slotid-based reply cache cannot be duplicate request caches - DRCs), the slot id-based reply cache
overflowed. Through use of the sequenceid to identify retransmitted cannot be overflowed. Through use of the sequence id to identify
requests, the replier does not need to actually cache the request retransmitted requests, the replier does not need to actually cache
itself, reducing the storage requirements of the reply cache further. the request itself, reducing the storage requirements of the reply
These new facilities makes it practical to maintain all the required cache further. These facilities make it practical to maintain all
entries for an effective reply cache. the required entries for an effective reply cache.
The slotid and sequenceid therefore take over the traditional role of The slot id and sequence id therefore take over the traditional role
the XID and port number in the replier reply cache implementation, of the XID and source network address in the replier's reply cache
and the session replaces the IP address. This approach is implementation. This approach is considerably more portable and
considerably more portable and completely robust - it is not subject completely robust - it is not subject to the reassignment of ports as
to the frequent reassignment of ports as clients reconnect over IP clients reconnect over IP networks. In addition, the RPC XID is not
networks. In addition, the RPC XID is not used in the reply cache, used in the reply cache, enhancing robustness of the cache in the
enhancing robustness of the cache in the face of any rapid reuse of face of any rapid reuse of XIDs by the requester. While the replier
XIDs by the client. [[Comment.3: We need to discuss the requirements does not care about the XID for the purposes of reply cache
of the client for changing the XID.]] management (but the replier MUST return the same XID that was in the
request), nonetheless there are considerations for the XID in NFSv4.1
that are the same as all other previous versions of NFS. The RPC XID
remains in each message and must be formulated in NFSv4.1 requests as
it any other ONC RPC request. The reasons include:
The slotid information is included in each request, without violating o The RPC layer retains its existing semantics and implementation.
the minor versioning rules of the NFSv4.0 specification, by encoding
it in the SEQUENCE operation within each NFSv4.1 COMPOUND and
CB_COMPOUND procedure. The operation easily piggybacks within
existing messages. [[Comment.4: Need a better term than piggyback]]
The receipt of a new sequenced request arriving on any valid slot is o The requester and replier must be able to interoperate at the RPC
an indication that the previous reply cache contents of that slot may layer, prior to the NFSv4.1 decoding of the SEQUENCE or
be discarded. CB_SEQUENCE operation
o If an operation is being used that does not start with SEQUENCE or
CB_SEQUENCE (e.g. BIND_CONN_TO_SESSION), then the RPC XID is
needed for correct operation to match the reply to the request.
o The SEQUENCE or CB_SEQUENCE operation may generate an error. If
so, the embedded slot id, sequence id, and sessionid (if present)
in the request will not be in the reply, and the requester has
only the XID to to match the reply to the request.
Givem that well formulated XIDs continue to be required, this begs
the question why SEQUENCE and CB_SEQUENCE replies have a sessionid,
slot id and sequence id? Having the sessionid in the reply means the
requester does not have to use the XID to lookup the sessionid, which
would be necessary if the connection were associated with multiple
sessions. Having the slot id and sequence id in the reply means
requester does not have to use the XID to lookup the slot id and
sequence id. Furhermore, since the XID is only 32 bits, it is too
small to guarantee the re-association of a reply with its request
([27]); having sessionid, slot id, and sequence id in the reply
allows the client to validate that the reply in fact belongs to the
matched request.
The SEQUENCE (and CB_SEQUENCE) operation also carries a The SEQUENCE (and CB_SEQUENCE) operation also carries a
"highest_slotid" value which carries additional client slot usage "highest_slotid" value which carries additional requester slot usage
information. The requester must always provide a slotid representing information. The requester must always provide a slot id
the outstanding request with the highest-numbered slot value. The representing the outstanding request with the highest-numbered slot
requester should in all cases provide the most conservative value value. The requester should in all cases provide the most
possible, although it can be increased somewhat above the actual conservative value possible, although it can be increased somewhat
instantaneous usage to maintain some minimum or optimal level. This above the actual instantaneous usage to maintain some minimum or
provides a way for the requester to yield unused request slots back optimal level. This provides a way for the requester to yield unused
to the replier, which in turn can use the information to reallocate request slots back to the replier, which in turn can use the
resources. information to reallocate resources.
The replier responds with both a new target highest_slotid, and an The replier responds with both a new target highest_slotid, and an
enforced highest_slotid, described as follows: enforced highest_slotid, described as follows:
o The target highest_slotid is an indication to the requester of the o The target highest_slotid is an indication to the requester of the
highest_slotid the replier wishes the requester to be using. This highest_slotid the replier wishes the requester to be using. This
permits the replier to withdraw (or add) resources from a permits the replier to withdraw (or add) resources from a
requester that has been found to not be using them, in order to requester that has been found to not be using them, in order to
more fairly share resources among a varying level of demand from more fairly share resources among a varying level of demand from
other requesters. The requester must always comply with the other requesters. The requester must always comply with the
replier's value updates, since they indicate newly established replier's value updates, since they indicate newly established
hard limits on the requester's access to session resources. hard limits on the requester's access to session resources.
However, because of request pipelining, the requester may have However, because of request pipelining, the requester may have
active requests in flight reflecting prior values, therefore the active requests in flight reflecting prior values, therefore the
replier must not immediately require the requester to comply. replier must not immediately require the requester to comply.
o The enforced highest_slotid indicates the highest slotid the o The enforced highest_slotid indicates the highest slotid the
requester is permitted to use on a subsequent SEQUENCE or requester is permitted to use on a subsequent SEQUENCE or
CB_SEQUENCE operation. CB_SEQUENCE operation. The replier's enforced highest_slotid
SHOULD be no less than the highest_slotid the requester indicated
in the SEQUENCE or CB_SEQUENCE arguments.
The requester is required to use the lowest available slot when If a replier detects the client is being intransigent, i.e. it
issuing a new request. This way, the replier may be able to retire fails in a series of requests to honor the target highest_slotid
slot entries faster. However, where the replier is actively even though the replier knows there are no outstanding requests a
adjusting its granted maximum request count (i.e. the highest_slotid) higher slot ids, it MAY take more forceful action. When faced
to the requester, it will not not be able to use just the receipt of with intransigence, the replier MAY reply with a new enforced
the slotid and highest_slotid in the request. Neither the slotid nor highest_slotid that is less than its previous enforced
the highest_slotid used in a request may reflect the replier's highest_slotid. Thereafter, if the requester continues to send
current idea of the requester's session limit, because the request requests with a highest_slotid that is greater than the replier's
may have been sent from the requester before the update was received. new enforced highest_slotid the server MAY return
Therefore, in the downward adjustment case, the replier may have to NFS4ERR_BAD_HIGHSLOT, unless the slot id in the request is greater
retain a number of reply cache entries at least as large as the old than the new enforced highest_slotid, and the request is a retry.
value of maximum requests outstanding, until operation sequencing
rules allow it to infer that the requester has seen its reply.
2.10.4.1.1. Errors from SEQUENCE and CB_SEQUENCE The replier SHOULD keep slots it wants to retire around until the
requester sends a request with a highest_slotid less than or equal
to the replier's new enforced highest_slotid. Also a request with
a slot that is higher than the new enforced highest_slotid can be
retired if the requester specifies a sequence id that is not equal
what is in the slot's reply cache. In other words, once the
replier has forcibly lowered the enforced highest_slotid, the
requester is only allowed to send retries to the to-be-retired
slots.
o The requester SHOULD use the lowest available slot when issuing a
new request. This way, the replier may be able to retire slot
entries faster. However, where the replier is actively adjusting
its granted highest_slotid, it will not not be able to use only
the receipt of the slot id and highest_slotid in the request.
Neither the slot id nor the highest_slotid used in a request may
reflect the replier's current idea of the requester's session
limit, because the request may have been sent from the requester
before the update was received. Therefore, in the downward
adjustment case, the replier may have to retain a number of reply
cache entries at least as large as the old value of maximum
requests outstanding, until operation sequencing rules allow it to
infer that the requester has seen its reply. [[Comment.3: What
are the rules?]]
2.10.5.1.1. Errors from SEQUENCE and CB_SEQUENCE
Any time SEQUENCE or CB_SEQUENCE return an error, the sequenceid of Any time SEQUENCE or CB_SEQUENCE return an error, the sequenceid of
the slot MUST NOT change. The replier MUST NOT modify the reply the slot MUST NOT change. The replier MUST NOT modify the reply
cache entry for the slot whenever an error is returned from SEQUENCE cache entry for the slot whenever an error is returned from SEQUENCE
or CB_SEQUENCE. or CB_SEQUENCE.
2.10.4.1.2. Optional Reply Caching 2.10.5.1.2. Optional Reply Caching
On a per-request basis the requester can choose to direct the replier On a per-request basis the requester can choose to direct the replier
to cache the reply to all operations after the first operation to cache the reply to all operations after the first operation
(SEQUENCE or CB_SEQUENCE) via the sa_cachethis or csa_cachethis (SEQUENCE or CB_SEQUENCE) via the sa_cachethis or csa_cachethis
fields of the arguments to SEQUENCE or CB_SEQUENCE. The reason it fields of the arguments to SEQUENCE or CB_SEQUENCE. The reason it
would not direct the replier to cache the entire reply is that the would not direct the replier to cache the entire reply is that the
request is composed of all idempotent operations [20]. Caching the request is composed of all idempotent operations [24]. Caching the
reply may offer little benefit, and if the reply is too large (see reply may offer little benefit. If the reply is too large (see
Section 2.10.4.4), it may not be cacheable anyway. Section 2.10.5.4), it may not be cacheable anyway. Even if the reply
to idempotent request is small enough to cache, unnecessarily caching
the reply slows down the server and increases RPC latency.
Whether the requester requests the reply to be cached or not has no Whether the requester requests the reply to be cached or not has no
effect on the slot processing. If the results of SEQUENCE or effect on the slot processing. If the results of SEQUENCE or
CB_SEQUENCE are NFS4_OK, then the slot's sequenceid MUST be CB_SEQUENCE are NFS4_OK, then the slot's sequenceid MUST be
incremented by one. If a requester does not direct the replier to incremented by one. If a requester does not direct the replier to
cache, the reply, the replier MUST do one of following: cache the reply, the replier MUST do one of following:
o The replier can cache the entire original reply. Even though o The replier can cache the entire original reply. Even though
sa_cachethis or csa_cachethis are FALSE, the replier is always sa_cachethis or csa_cachethis are FALSE, the replier is always
free to cache. It may choose this approach in order to simplify free to cache. It may choose this approach in order to simplify
implementation. implementation.
o The replier enters into its reply cache a reply consisting of the o The replier enters into its reply cache a reply consisting of the
original results to the SEQUENCE or CB_SEQUENCE operation, original results to the SEQUENCE or CB_SEQUENCE operation, and
followed by the error NFS4ERR_RETRY_UNCACHED_REP. Thus if the with the next operation in COMPOUND or CB)COMPOUND having the
requester later retries the request, it will get error NFS4ERR_RETRY_UNCACHED_REP. Thus if the requester later
NFS4ERR_RETRY_UNCACHE_REP. retries the request, it will get NFS4ERR_RETRY_UNCACHED_REP.
2.10.4.1.3. Multiple Connections and Sharing the Reply Cache 2.10.5.2. Retry and Replay of Reply
Multiple connections can be bound to a session's channel, hence the A requester MUST NOT retry a request, unless the connection it used
connections share the same table of slotids. For connections over to send the request disconnects. The requester can then reconnect
non-RDMA transports like TCP, there are no particular considerations. and resend the request, or it can resend the request over a different
Considerations for multiple RDMA connections sharing a slot table are connection that is associated with the same session.
discussed in Section 2.10.5.1. [[Comment.5: Also need to discuss
when RDMA and non-RDMA share a slot table.]]
2.10.4.2. Retry and Replay If the requester is a server wanting to resend a callback operation
over the backchannel of session, the requester of course cannot
reconnect because only the client can associate connections with the
backchannel. The server can resend the request over another
connection that is bound to the same session's backchannel. If there
is no such connection, the server MUST indicate that the session has
no backchannel by setting the SEQ4_STATUS_CB_PATH_DOWN_SESSION flag
bit in the response to the next SEQUENCE operation from the client.
The client MUST then associate a connection with the session (or
destroy the session).
A client MUST NOT retry a request, unless the connection it used to Note that it is not fatal for a client to retry without a disconnect
send the request disconnects. The client can then reconnect and between the request and retry. However the retry does consume
resend the request, or it can resend the request over a different resources, especially with RDMA, where each request, retry or not,
connection. In the case of the server resending over the consumes a credit. Retries for no reason, especially retries issued
backchannel, it cannot reconnect, and either resends the request over shortly after the previous attempt, are a poor use of network
another connection that the client has bound to the backchannel, or bandwidth and defeat the purpose of a transport's inherent congestion
if there is no other backchannel connection, waits for the client to control system.
bind a connection to the backchannel.
A client MUST wait for a reply to a request before using the slot for A client MUST wait for a reply to a request before using the slot for
another request. If it does not wait for a reply, then the client another request. If it does not wait for a reply, then the client
does not know what sequenceid to use for the slot on its next does not know what sequenceid to use for the slot on its next
request. For example, suppose a client sends a request with request. For example, suppose a client sends a request with sequence
sequenceid 1, and does not wait for the response. The next time it id 1, and does not wait for the response. The next time it uses the
uses the slot, it sends the new request with sequenceid 2. If the slot, it sends the new request with sequence id 2. If the server has
server has not seen the request with sequenceid 1, then the server is not seen the request with sequence id 1, then the server is not
expecting sequenceid 2, and rejects the client's new request with expecting sequenceid 2, and rejects the client's new request with
NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or CB_SEQUENCE). NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or CB_SEQUENCE).
RDMA fabrics do not guarantee that the memory handles (Steering Tags) RDMA fabrics do not guarantee that the memory handles (Steering Tags)
within each RDMA three-tuple are valid on a scope [[Comment.6: What within each RPC/RDMA "chunk" ([9]) are valid on a scope outside that
is a three-tuple?]] outside that of a single connection. Therefore, of a single connection. Therefore, handles used by the direct
handles used by the direct operations become invalid after connection operations become invalid after connection loss. The server must
loss. The server must ensure that any RDMA operations which must be ensure that any RDMA operations which must be replayed from the reply
replayed from the reply cache use the newly provided handle(s) from cache use the newly provided handle(s) from the most recent request.
the most recent request.
2.10.4.3. Resolving server callback races with sessions A retry might be issued while the original request is still in
progress on the replier. The replier SHOULD deal with issue by by
returning NFS4ERR_DELAY as the reply to SEQUENCE or CB_SEQUENCE
operation, but implementations MAY return NFS4ERR_MISORDERED. Since
errors from SEQUENCE and CB_SEQUENCE are never recorded in the reply
cache, this approach allows the results of the execution of the
original request to be properly recorded in the reply cache (assuming
the requester specified the reply to be cached).
2.10.5.3. Resolving Server Callback Races
It is possible for server callbacks to arrive at the client before It is possible for server callbacks to arrive at the client before
the reply from related forward channel operations. For example, a the reply from related fore channel operations. For example, a
client may have been granted a delegation to a file it has opened, client may have been granted a delegation to a file it has opened,
but the reply to the OPEN (informing the client of the granting of but the reply to the OPEN (informing the client of the granting of
the delegation) may be delayed in the network. If a conflicting the delegation) may be delayed in the network. If a conflicting
operation arrives at the server, it will recall the delegation using operation arrives at the server, it will recall the delegation using
the callback channel, which may be on a different transport the backchannel, which may be on a different transport connection,
connection, perhaps even a different network. In NFSv4.0, if the perhaps even a different network, or even a different session
callback request arrives before the related reply, the client may associated with the same client ID
reply to the server with an error.
The presence of a session between client and server alleviates this The presence of a session between client and server alleviates this
issue. When a session is in place, each client request is uniquely issue. When a session is in place, each client request is uniquely
identified by its { slotid, sequenceid } pair. By the rules under identified by its { sessionid, slot id, sequence id } triple. By the
which slot entries (reply cache entries) are retired, the server has rules under which slot entries (reply cache entries) are retired, the
knowledge whether the client has "seen" each of the server's replies. server has knowledge whether the client has "seen" each of the
The server can therefore provide sufficient information to the client server's replies. The server can therefore provide sufficient
to allow it to disambiguate between an erroneous or conflicting information to the client to allow it to disambiguate between an
callback and a race condition. erroneous or conflicting callback race condition.
For each client operation which might result in some sort of server For each client operation which might result in some sort of server
callback, the server should "remember" the { slotid, sequenceid } callback, the server SHOULD "remember" the { sessionid, slot id,
pair of the client request until the slotid retirement rules allow sequence id } triple of the client request until the slot id
the server to determine that the client has, in fact, seen the retirement rules allow the server to determine that the client has,
server's reply. Until the time the { slotid, sequenceid } request in fact, seen the server's reply. Until the time the { sessionid,
pair can be retired, any recalls of the associated object MUST carry slot id, sequence id } request triple can be retired, any recalls of
an array of these referring identifiers (in the CB_SEQUENCE the associated object MUST carry an array of these referring
operation's arguments), for the benefit of the client. After this identifiers (in the CB_SEQUENCE operation's arguments), for the
time, it is not necessary for the server to provide this information benefit of the client. After this time, it is not necessary for the
in related callbacks, since it is certain that a race condition can server to provide this information in related callbacks, since it is
no longer occur. certain that a race condition can no longer occur.
The CB_SEQUENCE operation which begins each server callback carries a The CB_SEQUENCE operation which begins each server callback carries a
list of "referring" { slotid, sequenceid } tuples. If the client list of "referring" { sessionid, slot id, sequence id } triples. If
finds the request corresponding to the referring slotid and sequenced the client finds the request corresponding to the referring
id be currently outstanding (i.e. the server's reply has not been sessionid, slot id and sequence id to be currently outstanding (i.e.
seen by the client), it can determine that the callback has raced the the server's reply has not been seen by the client), it can determine
reply, and act accordingly. that the callback has raced the reply, and act accordingly. If the
client does not find the request corresponding the referring triple
to be outstanding (including the case of a sessionid referring to a
destroyed session), then there is no race with respect to this
triple. The server SHOULD limit the referring triples to requests
that refer to just those that apply to the objects referred to in the
CB_COMPOUND procedure.
The client must not simply wait forever for the expected server reply The client must not simply wait forever for the expected server reply
to arrive on any of the session's operations channels, because it is to arrive before responding to the CB_COMPOUND that won the race,
possible that they will be delayed indefinitely. However, it should because it is possible that it will be delayed indefinitely. The
wait for a period of time, and if the time expires it can provide a client should assume the likely case that the reply will arrive
more meaningful error such as NFS4ERR_DELAY. within the average round trip time for COMPOUND requests to the
server, and wait that period of time. If that period of expires it
[[Comment.7: We need to consider the clients' options here, and can respond to the CB_COMPOUND with NFS4ERR_DELAY.
describe them... NFS4ERR_DELAY has been discussed as a legal reply
to CB_RECALL?]]
There are other scenarios under which callbacks may race replies, There are other scenarios under which callbacks may race replies,
among them pNFS layout recalls, described in Section 12.5.4.2 among them pNFS layout recalls, described in Section 12.5.4.2.
[[Comment.8: fill in the blanks w/others, etc...]]
2.10.4.4. COMPOUND and CB_COMPOUND Construction Issues 2.10.5.4. COMPOUND and CB_COMPOUND Construction Issues
Very large requests and replies may pose both buffer management Very large requests and replies may pose both buffer management
issues (especially with RDMA) and reply cache issues. When the issues (especially with RDMA) and reply cache issues. When the
session is created, (Section 17.36) the client and server negotiate session is created, (Section 17.36), for each channel (fore and
the maximum sized request they will send or process back), the client and server negotiate the maximum sized request they
(ca_maxrequestsize), the maximum sized reply they will return or will send or process (ca_maxrequestsize), the maximum sized reply
process (ca_maxresponsesize), and the maximum sized reply they will they will return or process (ca_maxresponsesize), and the maximum
store in the reply cache (ca_maxresponsesize_cached). sized reply they will store in the reply cache
(ca_maxresponsesize_cached).
If a request exceeds ca_maxrequestsize, the reply will have the If a request exceeds ca_maxrequestsize, the reply will have the
status NFS4ERR_REQ_TOO_BIG. A replier may return NFS4ERR_REQ_TOO_BIG status NFS4ERR_REQ_TOO_BIG. A replier MAY return NFS4ERR_REQ_TOO_BIG
as the status for first operation (SEQUENCE or CB_SEQUENCE) in the as the status for first operation (SEQUENCE or CB_SEQUENCE) in the
request, or it may chose to return it on a subsequent operation. request (which means no operations in the request executed, and the
state of the slot in the reply cache is unchanged), or it MAY chose
to return it on a subsequent operation in the same COMPOUND or
CB_COMPOUND request (which means at least one operation did execute
and the state of the slot in reply cache does change). The replier
SHOULD set NFS4ERR_REQ_TOO_BIG on the operation that exceeds
ca_maxrequestsize.
If a reply exceeds ca_maxresponsesize, the reply will have the status If a reply exceeds ca_maxresponsesize, the reply will have the status
NFS4ERR_REP_TOO_BIG. A replier may return NFS4ERR_REP_TOO_BIG as the NFS4ERR_REP_TOO_BIG. A replier MAY return NFS4ERR_REP_TOO_BIG as the
status for first operation (SEQUENCE or CB_SEQUENCE) in the request, status for first operation (SEQUENCE or CB_SEQUENCE) in the request,
or it may chose to return it on a subsequent operation. or it MAY chose to return it on a subsequent operation (in the same
COMPOUND or CB_COMPOUND reply). A replier MAY return
NFS4ERR_REP_TOO_BIG in the reply to SEQUENCE or CB_SEQUENCE, even if
the response would still exceed ca_maxresponsesize.
If sa_cachethis or csa_cachethis are TRUE, then the replier MUST If sa_cachethis or csa_cachethis are TRUE, then the replier MUST
cache a reply except if an error is returned by the SEQUENCE or cache a reply except if an error is returned by the SEQUENCE or
CB_SEQUENCE operation (see Section 2.10.4.1.1). If the reply exceeds CB_SEQUENCE operation (see Section 2.10.5.1.1). If the reply exceeds
ca_maxresponsesize_cached, (and sa_cachethis or csa_cachethis are ca_maxresponsesize_cached, (and sa_cachethis or csa_cachethis are
TRUE) then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE. Even TRUE) then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE. Even
if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that matter) if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that matter)
is returned on a operation other than first operation (SEQUENCE or is returned on a operation other than first operation (SEQUENCE or
CB_SEQUENCE), then the reply MUST be cached if sa_cachethis or CB_SEQUENCE), then the reply MUST be cached if sa_cachethis or
csa_cachethis are TRUE. For example, if a COMPOUND has eleven csa_cachethis are TRUE. For example, if a COMPOUND has eleven
operations, including SEQUENCE, the fifth operation is a RENAME, and operations, including SEQUENCE, the fifth operation is a RENAME, and
the tenth operation is a READ for one million bytes, server may the tenth operation is a READ for one million octets, the server may
return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth operation. Since return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth operation. Since
the server executed several operations, especially the non-idempotent the server executed several operations, especially the non-idempotent
RENAME, the client's request to cache the reply needs to be honored RENAME, the client's request to cache the reply needs to be honored
in order for correct operation of exactly once semantics. If the in order for correct operation of exactly once semantics. If the
client retries the request, the server will have cached a reply that client retries the request, the server will have cached a reply that
contains results for ten of the eleven requested operations, with the contains results for ten of the eleven requested operations, with the
tenth operation having a status of NFS4ERR_REP_TOO_BIG_TO_CACHE. tenth operation having a status of NFS4ERR_REP_TOO_BIG_TO_CACHE.
A client needs to take care that when sending operations that change A client needs to take care that when sending operations that change
the current filehandle (except for PUTFH, PUTPUBFH, and PUTROOTFH) the current filehandle (except for PUTFH, PUTPUBFH, and PUTROOTFH)
that it not exceed the maximum reply buffer before the GETFH that it not exceed the maximum reply buffer before the GETFH
operation. Otherwise the client will have to retry the operation operation. Otherwise the client will have to retry the operation
that changed the current filehandle, in order obtain the desired that changed the current filehandle, in order to obtain the desired
filehandle. For the OPEN operation (see Section 17.16), retry is not filehandle. For the OPEN operation (see Section 17.16), retry is not
always available as an option. The following guidelines for the always available as an option. The following guidelines for the
handling of filehandle changing operations are advised: handling of filehandle changing operations are advised:
o A client SHOULD issue GETFH immediately after a current filehandle o Within the same COMPOUND procedure, a client SHOULD issue GETFH
changing operation. This is especially important after any immediately after a current filehandle changing operation. A
current filehandle changing non-idempotent operation. It is client MUST issue GETFH after a current filehandle change
critical to issue GETFH immediately after OPEN. operation that is also non-idempotent (for example, the OPEN
operation).
o A server MAY return NFS4ERR_REP_TOO_BIG or o A server MAY return NFS4ERR_REP_TOO_BIG or
NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a
filehandle changing operation if the reply would be too large on filehandle changing operation if the reply would be too large on
the next operation. the next operation.
o A server SHOULD return NFS4ERR_REP_TOO_BIG or o A server SHOULD return NFS4ERR_REP_TOO_BIG or
NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a
filehandle changing non-idempotent operation if the reply would be filehandle changing non-idempotent operation if the reply would be
too large on the next operation, especially if the operation is too large on the next operation, especially if the operation is
OPEN. OPEN.
o A server MAY return NFS4ERR_UNSAFE_COMPOUND if it looks at the o A server MAY return NFS4ERR_UNSAFE_COMPOUND to a non-idempotent
next operation after a non-idempotent current filehandle changing current filehandle change operation, if it looks at the next
operation, and finds it is not GETFH. The server would do this if operation (in the same COMPOUND procedure) and finds it is not
it is unable to determine in advance whether the total response GETFH. The server SHOULD do this if it is unable to determine in
size would exceed ca_maxresponsesize_cached or ca_maxresponsesize. advance whether the total response size would exceed
ca_maxresponsesize_cached or ca_maxresponsesize.
2.10.4.5. Persistence 2.10.5.5. Persistence
Since the reply cache is bounded, it is practical for the server Since the reply cache is bounded, it is practical for the reply cache
reply cache to persist across server reboots, and to be kept in to persist across server restarts. The replier MUST persist the
stable storage (a client's reply cache for callbacks need not persist following information if it agreed to persist the session (when the
across client reboots unless the client intends for its session and session was created; see Section 17.36):
other state to persist across reboots).
o The sessionid.
o The slot table including the sequenceid and cached reply for each o The slot table including the sequenceid and cached reply for each
slot. slot.
o The sessionid. The above are sufficient for a replier to provide EOS semantics for
any requests that were sent and executed before the server restarted.
If the replier is a client then there is no need for it to persist
any more information, unless the client will be persisting all other
state across client restart. In which case, the server will never
see any NFSv4.1-level protocol manifestation of a client restart. If
the replier is a server, with just the slot table and sessionid
persisting, any requests the client retries after the server restart
will return the results that are cached in reply cache. and any new
requests (i.e. the sequence id is one (1) greater than the slot's
sequence id) MUST be rejected with NFS4ERR_DEADSESSION (returned by
SEQUENCE). Such a session is considered: dead. A server MAY re-
animate a session after a server restart so that the session will
accept new requests as well as retries. To re-animate a session the
server needs to persist additional information through server
restart:
o The client ID. o The client ID. This is a prerequisite to let the client to create
more sessions associated with the same client ID as the
o The SSV (see Section 2.10.6.3). o The client ID's sequenceid that is used for creating sessions (see
Section 17.35 and Section 17.36. This is a prerequisite to let
the client create more sessions.
The CREATE_SESSION (see Section 17.36 operation determines the o The principal that created the client ID. This allows the server
persistence of the reply cache. to authenticate the client when it issues EXCHANGE_ID.
2.10.5. RDMA Considerations o The SSV, if SP4_SSV state protection was specified when the client
ID was created (see Section 17.35). This lets the client create
new sessions, and associate connections with the new and existing
sessions.
A complete discussion of the operation of RPC-based protocols atop o The properties of the client ID as defined in Section 17.35.
RDMA transports is in [RPCRDMA]. A discussion of the operation of
NFSv4, including NFSv4.1 over RDMA is in [NFSDDP]. Where RDMA is
considered, this specification assumes the use of such a layering; it
addresses only the upper layer issues relevant to making best use of
RPC/RDMA.
2.10.5.1. RDMA Connection Resources A persistent reply cache places certain demands on the server. The
execution of the sequence of operations (starting with SEQUENCE) and
placement of its results in the persistent cache MUST be atomic. If
a client retries an sequence of operations that was previously
executed on the server the only acceptable outcomes are either the
original cached reply or an indication that client ID or session has
been lost (indicating a catastrophic loss of the reply cache or a
session that has been deleted because the client failed to use the
session for an extended period of time).
A server could fail and restart in the middle of a COMPOUND procedure
that contains one or more non-idempotent or idempotent-but-modifying
operations. This creates an even higher challenge for atomic
execution and placement of results in the reply cache. One way to
view the problem is as a single transaction consisting of each
operation in the COMPOUND followed by storing the result in
persistent storage, then finally a transaction commit. If there is a
failure before the transaction is committed, then the server rolls
back the transaction. If server itself fails, then when it restarts,
its recovery logic could roll back the transaction before starting
the NFSv4.1 server.
While the description of the implementation for atomic execution of
the request and caching of the reply is beyond the scope of this
document, an example implementation for NFS version 2 is described in
[28].
2.10.6. RDMA Considerations
A complete discussion of the operation of RPC-based protocols over
RDMA transports is in [9]. A discussion of the operation of NFSv4,
including NFSv4.1, over RDMA is in [10]. Where RDMA is considered,
this specification assumes the use of such a layering; it addresses
only the upper layer issues relevant to making best use of RPC/RDMA.
2.10.6.1. RDMA Connection Resources
RDMA requires its consumers to register memory and post buffers of a RDMA requires its consumers to register memory and post buffers of a
specific size and number for receive operations. specific size and number for receive operations.
Registration of memory can be a relatively high-overhead operation, Registration of memory can be a relatively high-overhead operation,
since it requires pinning of buffers, assignment of attributes (e.g. since it requires pinning of buffers, assignment of attributes (e.g.
readable/writable), and initialization of hardware translation. readable/writable), and initialization of hardware translation.
Preregistration is desirable to reduce overhead. These registrations Preregistration is desirable to reduce overhead. These registrations
are specific to hardware interfaces and even to RDMA connection are specific to hardware interfaces and even to RDMA connection
endpoints, therefore negotiation of their limits is desirable to endpoints, therefore negotiation of their limits is desirable to
manage resources effectively. manage resources effectively.
Following the basic registration, these buffers must be posted by the Following basic registration, these buffers must be posted by the RPC
RPC layer to handle receives. These buffers remain in use by the layer to handle receives. These buffers remain in use by the RPC/
RPC/NFSv4 implementation; the size and number of them must be known NFSv4.1 implementation; the size and number of them must be known to
to the remote peer in order to avoid RDMA errors which would cause a the remote peer in order to avoid RDMA errors which would cause a
fatal error on the RDMA connection. fatal error on the RDMA connection.
NFSv4.1 manages slots as resources on a per session basis (see NFSv4.1 manages slots as resources on a per session basis (see
Section 2.10), while RDMA connections manage credits on a per Section 2.10), while RDMA connections manage credits on a per
connection basis. This means that in order for a peer to send data connection basis. This means that in order for a peer to send data
over RDMA to a remote buffer, it has to have both an NFSv4.1 slot, over RDMA to a remote buffer, it has to have both an NFSv4.1 slot,
and an RDMA credit. and an RDMA credit. If multiple RDMA connections are associated with
a session, then if the total number of creds across all RDMA
connections associated with the session is X, and the number slots in
the session is Y, then the maximum number of outstanding requests is
lesser of X and Y.
2.10.5.2. Flow Control 2.10.6.2. Flow Control
NFSv4.0 and all previous versions do not provide for any form of flow Previous versions of NFS do not provide flow control; instead they
control; instead they rely on the windowing provided by transports rely on the windowing provided by transports like TCP to throttle
like TCP to throttle requests. This does not work with RDMA, which requests. This does not work with RDMA, which provides no operation
provides no operation flow control and will terminate a connection in flow control and will terminate a connection in error when limits are
error when limits are exceeded. Limits such as maximum number of exceeded. Limits such as maximum number of requests outstanding are
requests outstanding are therefore negotiated when a session is therefore negotiated when a session is created (see the
created (see the ca_maxrequests field in Section 17.36). These ca_maxrequests field in Section 17.36). These limits then provide
limits then provide the maxima each session's channels' connections the maxima which each connection associated with the session's
must operate within. RDMA connections are managed within these channel(s) must remain within. RDMA connections are managed within
limits as described in section 3.3 of [RPCRDMA]; if there are these limits as described in section 3.3 ("Flow Control"[[Comment.4:
multiple RDMA connections, then the maximum requests for a channel RFC Editor: please verify section and title of the RPCRDMA
will be divided among the RDMA connections. The limits may also be document]]) of [9]; if there are multiple RDMA connections, then the
modified dynamically at the server's choosing by manipulating certain maximum number of requests for a channel will be divided among the
parameters present in each NFSv4.1 request. In addition, the RDMA connections. Put a different way, the onus is on the replier to
CB_RECALL_SLOT callback operation (see Section 19.8 can be issued by ensure that total number of RDMA credits across all connections
a server to a client to return RDMA credits to the server, thereby associated with the replier's channel does exceed the channel's
lowering the maximum number of requests a client can have outstanding maximum number of outstanding requests.
to the server.
2.10.5.3. Padding The limits may also be modified dynamically at the replier's choosing
by manipulating certain parameters present in each NFSv4.1 reply. In
addition, the CB_RECALL_SLOT callback operation (see Section 19.8)
can be issued by a server to a client to return RDMA credits to the
server, thereby lowering the maximum number of requests a client can
have outstanding to the server.
2.10.6.3. Padding
Header padding is requested by each peer at session initiation (see Header padding is requested by each peer at session initiation (see
the csa_headerpadsize argument to CREATE_SESSION in Section 17.36), the ca_headerpadsize argument to CREATE_SESSION in Section 17.36),
and subsequently used by the RPC RDMA layer, as described in and subsequently used by the RPC RDMA layer, as described in [9].
[RPCRDMA]. Zero padding is permitted. Zero padding is permitted.
Padding leverages the useful property that RDMA receives preserve Padding leverages the useful property that RDMA preserve alignment of
alignment of data, even when they are placed into anonymous data, even when they are placed into anonymous (untagged) buffers.
(untagged) buffers. If requested, client inline writes will insert If requested, client inline writes will insert appropriate pad bytes
appropriate pad bytes within the request header to align the data within the request header to align the data payload on the specified
payload on the specified boundary. The client is encouraged to add boundary. The client is encouraged to add sufficient padding (up to
sufficient padding (up to the negotiated size) so that the "data" the negotiated size) so that the "data" field of the NFSv4.1 WRITE
field of the NFSv4.1 WRITE operation is aligned. Most servers can operation is aligned. Most servers can make good use of such
make good use of such padding, which allows them to chain receive padding, which allows them to chain receive buffers in such a way
buffers in such a way that any data carried by client requests will that any data carried by client requests will be placed into
be placed into appropriate buffers at the server, ready for file appropriate buffers at the server, ready for file system processing.
system processing. The receiver's RPC layer encounters no overhead The receiver's RPC layer encounters no overhead from skipping over
from skipping over pad bytes, and the RDMA layer's high performance pad bytes, and the RDMA layer's high performance makes the insertion
makes the insertion and transmission of padding on the sender a and transmission of padding on the sender a significant optimization.
significant optimization. In this way, the need for servers to In this way, the need for servers to perform RDMA Read to satisfy all
perform RDMA Read to satisfy all but the largest client writes is but the largest client writes is obviated. An added benefit is the
obviated. An added benefit is the reduction of message round trips reduction of message round trips on the network - a potentially good
on the network - a potentially good trade, where latency is present. trade, where latency is present.
The value to choose for padding is subject to a number of criteria. The value to choose for padding is subject to a number of criteria.
A primary source of variable-length data in the RPC header is the A primary source of variable-length data in the RPC header is the
authentication information, the form of which is client-determined, authentication information, the form of which is client-determined,
possibly in response to server specification. The contents of possibly in response to server specification. The contents of
COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all
go into the determination of a maximal NFSv4 request size and go into the determination of a maximal NFSv4.1 request size and
therefore minimal buffer size. The client must select its offered therefore minimal buffer size. The client must select its offered
value carefully, so as not to overburden the server, and vice- versa. value carefully, so as not to overburden the server, and vice- versa.
The payoff of an appropriate padding value is higher performance. The payoff of an appropriate padding value is higher performance.
[[Comment.5: RFC editor please keep this diagram on one page.]]
Sender gather: Sender gather:
|RPC Request|Pad bytes|Length| -> |User data...| |RPC Request|Pad bytes|Length| -> |User data...|
\------+---------------------/ \ \------+---------------------/ \
\ \ \ \
\ Receiver scatter: \-----------+- ... \ Receiver scatter: \-----------+- ...
/-----+----------------\ \ \ /-----+----------------\ \ \
|RPC Request|Pad|Length| -> |FS buffer|->|FS buffer|->... |RPC Request|Pad|Length| -> |FS buffer|->|FS buffer|->...
In the above case, the server may recycle unused buffers to the next In the above case, the server may recycle unused buffers to the next
posted receive if unused by the actual received request, or may pass posted receive if unused by the actual received request, or may pass
the now-complete buffers by reference for normal write processing. the now-complete buffers by reference for normal write processing.
For a server which can make use of it, this removes any need for data For a server which can make use of it, this removes any need for data
copies of incoming data, without resorting to complicated end-to-end copies of incoming data, without resorting to complicated end-to-end
buffer advertisement and management. This includes most kernel-based buffer advertisement and management. This includes most kernel-based
and integrated server designs, among many others. The client may and integrated server designs, among many others. The client may
perform similar optimizations, if desired. perform similar optimizations, if desired.
2.10.5.4. Dual RDMA and Non-RDMA Transports 2.10.6.4. Dual RDMA and Non-RDMA Transports
Some RDMA transports (for example see [RDDP]), [[Comment.9: need Some RDMA transports (for example [11]), permit a "streaming" (non-
xref]] require a "streaming" (non-RDMA) phase, where ordinary traffic RDMA) phase, where ordinary traffic might flow before "stepping up"
might flow before "stepping" up to RDMA mode, commencing RDMA to RDMA mode, commencing RDMA traffic. Some RDMA transports start
traffic. Some RDMA transports start connections always in RDMA mode. connections always in RDMA mode. NFSv4.1 allows, but does not
NFSv4.1 allows, but does not assume, a streaming phase before RDMA assume, a streaming phase before RDMA mode. When a connection is
mode. When a connection is bound to a session, the client and server associated with a session, the client and server negotiate whether
negotiate whether the connection is used in RDMA or non-RDMA mode the connection is used in RDMA or non-RDMA mode (see Section 17.36
(see Section 17.36 and Section 17.34). and Section 17.34).
2.10.6. Sessions Security 2.10.7. Sessions Security
2.10.6.1. Session Callback Security 2.10.7.1. Session Callback Security
Via session connection binding, NFSv4.1 improves security over that Via session / connection association, NFSv4.1 improves security over
provided by NFSv4.0 for the callback channel. The connection is that provided by NFSv4.0 for the backchannel. The connection is
client-initiated (see Section 17.34), and subject to the same client-initiated (see Section 17.34), and subject to the same
firewall and routing checks as the operations channel. The firewall and routing checks as the fore channel. The connection
connection cannot be hijacked by an attacker who connects to the cannot be hijacked by an attacker who connects to the client port
client port prior to the intended server. At the client's option prior to the intended server as is possible with NFSv4.0. At the
(see Section 17.36 binding is fully authenticated before being client's option (see Section 17.35), connection association is fully
activated (see Section 17.34). Traffic from the server over the authenticated before being activated (see Section 17.34). Traffic
callback channel is authenticated exactly as the client specifies from the server over the backchannel is authenticated exactly as the
(see Section 2.10.6.2). client specifies (see Section 2.10.7.2).
2.10.6.2. Backchannel RPC Security 2.10.7.2. Backchannel RPC Security
When the NFSv4.1 client establishes the backchannel, it informs the When the NFSv4.1 client establishes the backchannel, it informs the
server what security flavors and principals it must use when sending server of the security flavors and principals to use when sending
requests over the backchannel. If the security flavor is RPCSEC_GSS, requests. If the security flavor is RPCSEC_GSS, the client expresses
the client expresses the principal in the form of an established the principal in the form of an established RPCSEC_GSS context. The
RPCSEC_GSS context. The server is free to use any flavor/principal server is free to use any of the flavor/principal combinations the
combination the server offers, but MUST NOT use unoffered client offers, but it MUST NOT use unoffered combinations. This way,
combinations. the client need not to provide a target GSS principal for the
backchannel as it did with NFSv4.0, nor the server have to implement
This way, the client does not have to provide a target GSS principal an RPCSEC_GSS initiator as it did with NFSv4.0 [2].
as it did with NFSv4.0, and the server does not have to implement an
RPCSEC_GSS initiator as it did with NFSv4.0. [[Comment.10: xrefs]]
The CREATE_SESSION (Section 17.36) and BACKCHANNEL_CTL The CREATE_SESSION (Section 17.36) and BACKCHANNEL_CTL
(Section 17.33) operations allow the client to specify flavor/ (Section 17.33) operations allow the client to specify flavor/
principal combinations. principal combinations.
2.10.6.3. Protection from Unauthorized State Changes Also note that the SP4_SSV state protection mode (see Section 17.35
and Section 2.10.7.3) has the side benefit of providing SSV-derived
RPCSEC_GSS contexts (Section 2.10.7.4).
Under some conditions, NFSv4.0 is vulnerable to a denial of service 2.10.7.3. Protection from Unauthorized State Changes
issue with respect to its state management.
The attack works via an unauthorized client faking an open_owner4, an As described to this point in the specification, the state model of
open_owner/lock_owner pair, or stateid, combined with a seqid. The NFSv4.1 is vulnerable to an attacker that issues a SEQUENCE operation
operation is sent to the NFSv4 server. The NFSv4 server accepts the with a forged sessionid and with a slot id that it expects the
state information, and as long as any status code from the result of legitimate client to use next. When the legitimate client uses the
this operation is not NFS4ERR_STALE_CLIENTID, NFS4ERR_STALE_STATEID, slot id with the same sequence number, the server returns the
NFS4ERR_BAD_STATEID, NFS4ERR_BAD_SEQID, NFS4ERR_BADXDR, attacker's result from the reply cache which disrupts the legitimate
NFS4ERR_RESOURCE, or NFS4ERR_NOFILEHANDLE, the sequence number is client and thus denies service to it. Similarly an attacker could
incremented. When the authorized client issues an operation, it gets issue a CREATE_SESSION with a forged client ID to create a new
back NFS4ERR_BAD_SEQID, because its idea of the current sequence session associated with the client ID. The attacker could issue
number is off by one. The authorized client's recovery options are requests using the new session that change locking state, such as
pretty limited, with SETCLIENTID, followed by complete reclaim of LOCKU operations to release locks the legitimate client has acquired.
state, which may or may not succeed completely. That qualifies as a
denial of service attack.
If the client uses RPCSEC_GSS authentication and integrity, and every Setting a security policy on the file which requires RPCSEC_GSS
client maps each open_owner and lock_owner one and only one credentials when manipulating the file's state is one potential work
principal, and the server enforces this binding, then the conditions around, but has the disadvantage of preventing a legitimate client
leading to vulnerability to the denial of service do not exist. One from releasing state when RPCSEC_GSS is required to do so, but a GSS
should keep in mind that if AUTH_SYS is being used, far simpler context cannot be obtained (possibly because the user has logged off
easier denial of service and other attacks are possible. the client).
With NFSv4.1 sessions, the per-operation sequence number is ignored NFSv4.1 provides three options to a client for state protection which
(see Section 8.13) therefore the NFSv4.0 denial of service are specified when a client creates a client ID via EXCHANGE_ID
vulnerability described above does not apply. However as described (Section 17.35).
to this point in the specification, an attacker could forge the
sessionid and issue a SEQUENCE with a slot id that he expects the
legitimate client to use next. The legitimate client could then use
the slotid with the same sequence number, and the server returns the
attacker's result from the replay cache, thereby disrupting the
legitimate client.
If we give each NFSv4.1 user their own session, and each user uses The first (SP4_NONE) is to simply waive state protection.
RPCSEC_GSS authentication and integrity, then the denial of service
issue is solved, at the cost of additional per session state. The
alternative NFSv4.1 specifies is described as follows.
Transport connections MUST be bound to a session by the client. The The other two options (SP4_MACH_CRED and SP4_SSV) share several
server MUST return an error to an operation (other than the operation traits:
that binds the connection to the session) that uses an unbound
connection. As a simplification, the transport connection used by
CREATE_SESSION (see Section 17.36) is automatically bound to the
session. Additional connections are bound to a session via
BIND_CONN_TO_SESSION (see Section 17.34).
To prevent attackers from issuing BIND_CONN_TO_SESSION operations, o An RPCSEC_GSS-based credential is used to authenticate client ID
the arguments to BIND_CONN_TO_SESSION include a digest of a shared and session maintenance operations, including creating and
secret called the secret session verifier (SSV) that only the client destroying a session, associating a connection with the session,
and server know. The digest is created via a one way, collision and destroying the client ID.
resistant hash function, making it intractable for the attacker to
forge.
The SSV is sent to the server via SET_SSV (see Section 17.47). To o Because RPCSEC_GSS is used to authenticate client ID and session
prevent eavesdropping, a SET_SSV for the SSV SHOULD be protected via maintenance, the attacker cannot associate a rogue connection with
RPCSEC_GSS with the privacy service. The SSV can be changed by the a legitimate session, or associate a rogue session with a
client at any time, by any principal. However several aspects of SSV legitimate client ID in order to maliciously alter the client ID's
changing prevent an attacker from engaging in a successful denial of lock state via CLOSE, LOCKU, DELEGRETURN, LAYOUTRETURN, etc.
service attack:
o A SET_SSV on the SSV does not replace the SSV with the argument to o In cases where the server's security policies on a portion of its
SET_SSV. Instead, the current SSV on the server is logically namespace require RPCSEC_GSS authentication, a client may have to
exclusive ORed (XORed) with the argument to SET_SSV. SET_SSV MUST use an RPCSEC_GSS credential to remove per-file state (for example
NOT be called with an SSV value that is zero. LOCKU, CLOSE, etc.). The server may require that the principal
that removes the state match certain criteria (for example, the
principal might have to be the same as the one that acquired the
state). However, the client might not be have an RPCSEC_GSS
context for such a principal, and might not be able to create such
a context (perhaps because the user has logged off). When the
client establishes SP4_MACH_CRED or SP4_SSV protection, it can
specify a list of operations that the server MUST allow using the
machine credential (if SP4_MACH_CRED is used) or the SSV
credential (if SP4_SSV is used).
The SP4_MACH_CRED state protection option uses a machine credential
where the principal that creates the client ID, must also be the
principal that performs client ID and session maintenance operations.
The security of the machine credential state protection approach
depends entirely on safe guarding the per-machine credential.
Assuming a proper safe guard, using the per-machine credential for
operations like CREATE_SESSION, BIND_CONN_TO_SESSION,
DESTROY_SESSION, and DESTROY_CLIENTID will prevent an attacker from
associating a rogue connection with a session, or associating a rogue
session with a client ID.
There are at least three scenarios for the SP4_MACH_CRED option:
1. That the system administrator configures a unique, permanent per-
machine credential for one of the mandated GSS mechanisms (for
example, if Kerberos V5 is used, a "keytab" for principal named
after client host name could be used).
2. The client is used by a single user, and so the client ID and its
sessions are used by just that user. If the user's credential
expires, then session and client ID maintenance cannot occur, but
since the client has a single user, only that user is
inconvenienced.
3. The physical client has multiple users, but the client
implementation has a unique client ID for each user. This is
effectively the same as the second scenario, but a disadvantage
is that each user must be allocated at least one session each, so
the approach suffers from lack of economy.
The SP4_SSV protection option uses a Secret State Verifier (SSV)
which is shared between a client and server. The SSV serves as the
secret key for an internal (that is, internal to NFSv4.1) GSS
mechanism that uses the secret key for Message Integrity Code (MIC)
and Wrap tokens (Section 2.10.7.4). The SP4_SSV protection option is
intended for the client that has multiple users, and the system
administrator does not wish to configure a permanent machine
credential for each client. The SSV is established on the server via
SET_SSV (see Section 17.47). To prevent eavesdropping, a client
SHOULD issue SET_SSV via RPCSEC_GSS with the privacy service.
Several aspects of the SSV make it intractable for an attacker to
guess the SSV, and thus associate rogue connections with a session,
and rogue sessions with a client ID:
o The arguments to and results of SET_SSV include digests of the old o The arguments to and results of SET_SSV include digests of the old
and new SSV, respectively. and new SSV, respectively.
o Because the initial value of the SSV is zero, therefore known, the o Because the initial value of the SSV is zero, therefore known, the
client that opts for connecting binding enforcement, MUST issue at client that opts for SP4_SSV protection and opts to apply SP4_SSV
least one SET_SSV operation before the first BIND_CONN_TO_SESSION protection to BIND_CONN_TO_SESSION and CREATE_SESSION MUST issue
operation. A client SHOULD issue SET_SSV as soon as a session is at least one SET_SSV operation before the first
created. BIND_CONN_TO_SESSION operation or before the second CREATE_SESSION
operation on a client ID. If it does not, the SSV mechanism will
If a connection is disconnected, BIND_CONN_TO_SESSION is required to not generate tokens (Section 2.10.7.4). A client SHOULD issue
bind a connection to the session, even if the connection that was SET_SSV as soon as a session is created.
disconnected was the one CREATE_SESSION was created with.
If a client is assigned a machine principal then the client SHOULD o A SET_SSV does not replace the SSV with the argument to SET_SSV.
use the machine principal's RPCSEC_GSS context to privacy protect the Instead, the current SSV on the server is logically exclusive ORed
SSV from eavesdropping during the SET_SSV operation. If a machine (XORed) with the argument to SET_SSV. SET_SSV MUST NOT be called
principal is not being used, then the client MAY use the non-machine with an SSV value that is zero. For this reason, each time a new
principal's RPCSEC_GSS context to privacy protect the SSV. The principal uses a client ID for the first time, the client SHOULD
server MUST accept either type of principal. A client SHOULD change issue a SET_SSV with that principal's RPCSEC_GSS credentials, with
the SSV each time a new principal uses the session. RPCSEC_GSS service set to RPC_GSS_SVC_PRIVACY.
Here are the types of attacks that can be attempted by an attacker Here are the types of attacks that can be attempted by an attacker
named Eve, and how the connection to session binding approach named Eve on a victim named Bob, and how SP4_SSV protection foils
addresses each attack: each attack:
o If the Eve creates a connection after the legitimate client
establishes an SSV via privacy protection from a machine
principal's RPCSEC_GSS session, she does not know the SSV and so
cannot compute a digest that BIND_CONN_TO_SESSION will accept.
Users on the legitimate client cannot be disrupted by Eve.
o If Eve is the first one log into the legitimate client, and the o Suppose Eve is the first user to log into a legitimate client.
client does not use machine principals, then Eve can cause an SSV Eve's use of an NFSv4.1 filesystem will cause an SSV to be created
to be created via the legitimate client's NFSv4.1 implementation, via the legitimate client's NFSv4.1 implementation. The SET_SSV
protected by the RPCSEC_GSS context created by the legitimate that creates the SSV will be protected by the RPCSEC_GSS context
client (which uses Eve's GSS principal and credentials). Eve can created by the legitimate client which uses Eve's GSS principal
then eavesdrop on the network, and because she knows her and credentials. Eve can eavesdrop on the network while her
credentials, she can decrypt the SSV. Eve can compute a digest RPCSEC_GSS context is created, and the SET_SSV using her context
BIND_CONN_TO_SESSION will accept, and so bind a new connection to is issued. Even if the legitimate client issues the SET_SSV with
the session. Eve can change the slotid, sequence state, and/or RPC_GSS_SVC_PRIVACY, because Eve knows her own credentials, she
the SSV state in such a way that when Bob accesses the server via can decrypt the SSV. Eve can compute an RPCSEC_GSS credential
the legitimate client, the legitimate client will be unable to use that BIND_CONN_TO_SESSION will accept, and so associate a new
the session. connection with the legitimate session. Eve can change the slot
id and sequence state of a legitimate session, and/or the SSV
state, in such a way that when Bob accesses the server via the
same legitimate client, the legitimate client will be unable to
use the session.
The client's only recourse is to create a new session, which will The client's only recourse is to create a new client ID for Bob to
cause any state Eve created on the legitimate client over the old use, and establish a new SSV for the client ID. The client will
(but hijacked) session to be lost. This disrupts Eve, but because be unable to delete the old client ID, and will let the lease on
she is the attacker, this is acceptable. old client ID expire.
Once the legitimate client establishes an SSV over the new session Once the legitimate client establishes an SSV over the new session
using Bob's RPCSEC_GSS context, Eve can use the new session via using Bob's RPCSEC_GSS context, Eve can use the new session via
the legitimate client, but she cannot disrupt Bob. Moreover, the legitimate client, but she cannot disrupt Bob. Moreover,
because the client SHOULD have modified the SSV due to Eve using because the client SHOULD have modified the SSV due to Eve using
the new session, Bob cannot get revenge on Eve by binding a rogue the new session, Bob cannot get revenge on Eve by associating a
connection to the session. rogue connection with the session.
The question is how does the legitimate client detect that Eve has The question is how did the legitimate client detect that Eve has
hijacked the old session? When the client detects that a new hijacked the old session? When the client detects that a new
principal, Bob, wants to use the session, it SHOULD have issued a principal, Bob, wants to use the session, it SHOULD have issued a
SET_SSV. SET_SSV, which leads to following sub-scenarios:
* Let us suppose that from the rogue connection, Eve issued a * Let us suppose that from the rogue connection, Eve issued a
SET_SSV with the same slotid and sequence that the legitimate SET_SSV with the same slotid and sequence that the legitimate
client later uses. The server will assume this is a replay, client later uses. The server will assume this is a retry, and
and return to the legitimate client the reply it sent Eve. return to the legitimate client the reply it sent Eve. However,
However, unless Eve can correctly guess the SSV the legitimate unless Eve can correctly guess the SSV the legitimate client
client will use, the digest verification checks in the SET_SSV will use, the digest verification checks in the SET_SSV
response will fail. That is the clue to the client that the response will fail. That is an indication to the client that
session has been hijacked. the session has apparently been hijacked.
* Alternatively, Eve issued a SET_SSV with a different slotid * Alternatively, Eve issued a SET_SSV with a different slotid
than the legitimate client uses for its SET_SSV. Then the than the legitimate client uses for its SET_SSV. Then the
digest verification on the server fails, and the client is digest verification on the server fails, and it is again
again clued that the session has been hijacked. apparent to the client that the session has been hijacked.
* Alternatively, Eve issued an operation other than SET_SSV, but * Alternatively, Eve issued an operation other than SET_SSV, but
with the same slotid and sequence that the legitimate client with the same slotid and sequence that the legitimate client
uses for its SET_SSV. The server returns to the legitimate uses for its SET_SSV. The server returns to the legitimate
client the response it sent Eve. The client sees that the client the response it sent Eve. The client sees that the
response is not at all what it expects. The client assumes response is not at all what it expects. The client assumes
either session hijacking or server bug, and either way destroys either session hijacking or a server bug, and either way
the old session. destroys the old session.
o Eve binds a rogue connection to the session as above, and then o Eve associates a rogue connection with the session as above, and
destroys the session. Again, Bob goes to use the server from the then destroys the session. Again, Bob goes to use the server from
legitimate client. The client has a very clear indication that the legitimate client by issuing a SET_SSV. The client receives
its session was hijacked, and does not even have to destroy the an error that indicates the session does not exist. When the
old session before creating a new session, which Eve will be client tries to create a new session, this will fail because the
unable to hijack because it will be protected with an SSV created SSV it has does not that the server has, and now the client knows
via Bob's RPCSEC_GSS protection. the session was hijacked. The legitimate client establishes a new
client ID as before.
o If Eve creates a connection before the legitimate client o If Eve creates a connection before the legitimate client
establishes an SSV, because the initial value of the SSV is zero establishes an SSV, because the initial value of the SSV is zero
and therefore known, Eve can issue a SET_SSV that will pass the and therefore known, Eve can issue a SET_SSV that will pass the
digest verification check. However because the new connection has digest verification check. However because the new connection has
not been bound to the session, the SET_SSV is rejected for that not been associated with the session, the SET_SSV is rejected for
reason. that reason.
o The connection to session binding model does not prevent In summary an attacker's disruption of state when SP4_SSV protection
connection hijacking. However, if an attacker can perform is in use is limited to the formative period of a client ID, its
connection hijacking, it can issue denial of service attacks that first session, and the establishment of the SSV. Once a non-
are less difficult than attacks based on forging sessions. malicious user uses the client ID, the client quickly detects any
hijack and rectifies the situation. Once a non-malicious user
successfully modifies the SSV, the attacker cannot use NFSv4.1
operations to disrupt the non-malicious user.
2.10.7. Session Mechanics - Steady State Note that neither the SP4_MACH_CRED nor SP4_SSV protection approaches
prevent hijacking of a transport connection that has previously been
associated with a session. If the goal of a counter threat strategy
is to prevent connection hijacking, the use of IPsec is RECOMMENDED.
If the goal of a counter threat strategy is to prevent a connection
hijacker from making unauthorized state changes, then the
SP4_MACH_CRED protection approach can be used with a client ID per
user (i.e. the aforementioned third scenario for machine credential
state protection). Each EXCHANGE_ID can specify the all operations
MUST be protected with the machine credential. The server will then
reject any subsequent operations on the client ID that do not use
RPCSEC_GSS with privacy or integrity and do not use the same
credential that created the client ID.
2.10.7.1. Obligations of the Server 2.10.7.4. The SSV GSS Mechanism
The SSV provides the secret key for a mechanism that NFSv4.1 uses for
state protection. Contexts for this mechanism are not established
via the RPCSEC_GSS protocol. Instead, the contexts are automatically
created when EXCHANGE_ID specifies SP4_SSV protection. The only
tokens defined are the PerMsgToken (emitted by GSS_GetMIC) and the
SealedMessage (emitted by GSS_Wrap).
The mechanism OID for the SSV mechanism is:
iso.org.dod.internet.private.enterprise.Michael Eisler.nfs.ssv_mech
(1.3.6.1.4.1.28882.1.1). While the SSV mechansims does not define
any initial context tokens, the OID can be used to let servers
indicate that the SSV mechanism is acceptable whenever the client
issues a SECINFO or SECINFO_NO_NAME operation (see Section 2.6).
The PerMsgToken description is based on an XDR definition:
/* Input for computing smt_hmac */
struct ssv_mic_plain_tkn4 {
uint32_t smpt_ssv_seq;
opaque smpt_orig_plain<>;
};
/* SSV GSS SealedMessage token */
struct ssv_mic_tkn4 {
uint64_t smt_ssv_seq;
opaque smt_hmac<>;
};
The token emitted by GSS_GetMIC() is XDR encoded and of XDR data type
ssv_mic_tkn4. The field smt_ssv_seq comes from the SSV sequence
number which is equal to 1 after SET_SSV is called the first time on
a client ID. Thereafter, it is incremented on each SET_SSV. Thus
smt_ssv_seq represents the version of the SSV at the time
GSS_GetMIC() was called. This allows the SSV to be changed without
serializing all RPC calls that use the SSV mechanism with SET_SSV
operations.
The field smt_hmac is an HMAC ([12]), calculated by using the current
SSV as the key, the one way hash algorithm as negotiated by
EXCHANGE_ID, and the input text as represented by data of type
ssv_mic_plain_tkn4. The field smpt_ssv_seq is the same as
smt_ssv_seq. The field smt_orig_plain is the input text as passed
into GSS_GetMIC().
The SealedMessage description is based on an XDR definition:
/* Input for computing ssct_encr_data and ssct_hmac */
struct ssv_seal_plain_tkn4 {
opaque sspt_confounder<>;
uint32_t sspt_ssv_seq;
opaque sspt_orig_plain<>;
opaque sspt_pad<>;
};
/* SSV GSS SealedMessage token */
struct ssv_seal_cipher_tkn4 {
uint32_t ssct_ssv_seq;
opaque ssct_encr_data<>;
opaque ssct_hmac<>;
};
The token emitted by GSS_Wrap() is XDR encoded and of XDR data type
ssv_seal_cipher_tkn4. The field ssct_ssv_seq has the same meaning as
smt_ssv_seq. The ssct_encr_data field is the result of encrypting a
value of the XDR encoded data type ssv_seal_plain_tkn4. The
encryption key is the SSV, and the encryption algorithm is that
negotiated by EXCHANGE_ID.
The ssct_hmac field is the result of computing an HMAC using value of
the XDR encoded data type ssv_seal_plain_tkn4 as the input text. The
key is the SSV, and the one way hash algorithm is that negotiated by
EXCHANGE_ID.
The sspt_confounder field is a random value.
The sspt_ssv_seq field is the same as ssvt_ssv_seq.
The sspt_orig_plain field is the original plaintext as passed to
GSS_Wrap().
The sspt_pad field is present to support encryption algorithms that
require inputs to be in fixed sized blocks. The content of sspt_pad
is zero filled except for the length. Beware that the XDR encoding
of ssv_seal_plain_tkn4 contains three variable length arrays, and so
each array consumes 4 octets for an array length, and each array that
follows the length is always padded to a multiple of 4 octets per the
XDR standard.
For example suppose the encryption algorithm uses 16 octet blocks,
and the sspt_confounder is 3 octets long, and the sspt_orig_plain
field is 15 octets long. The XDR encoding of sspt_confounder uses 8
octets (4 + 3 + 1 octet pad), the XDR encoding of sspt_ssv_seq uses 4
octets, the XDR encoding of sspt_orig_plain uses 20 octets (4 + 15 +
1 octet pad), and the smallest XDR encoding of the sspt_pad field is
4 octets. This totals 36 octets. The next multiple of 16 is 48,
thus the length field of sspt_pad needs to be set to 12 octets, or a
total encoding of 16 octets. The total number of XDR encoded octets
is thus 8 + 4 + 20 + 16 = 48.
GSS_Wrap() emits a token that is an XDR encoding of a value of data
type ssv_seal_cipher_tkn4. Note that regardless whether the caller
of GSS_Wrap() requests confidentiality or not, the token always has
confidentiality. This is because the SSV mechanism is for
RPCSEC_GSS, and RPCSEC_GSS never produces GSS_wrap() tokens without
confidentiality.
Effectively there is a single GSS context for all RPCSEC_GSS handles
that have been created on a session. And all sessions associated
with a a client ID share the same SSV. SSV GSS contexts do not
expire except when the SSV is destroyed (causes would include the
client ID being destroyed or a server restart). Since one purpose of
context expiration is to replace keys that have been in use for "too
long" hence vulnerable to compromise by brute force or accident, the
client can issue periodic SET_SSV operations, by cycling through
different users' RPCSEC_GSS credentials. This way the SSV is
replaced without destroying the SSV's GSS contexts. If for some
reason SSV RPCSEC_GSS handles expire, the EXCHANGE_ID operation can
be used to create more SSV RPCSEC_GSS handles.
The client MUST establish an SSV via SET_SSV before the GSS context
can be used to emit tokens from GSS_Wrap() and GSS_GetMIC(). If
SET_SSV has not been successfully called, attempts to emit tokens
MUST fail.
The SSV mechanism does not support replay detection and sequencing in
its tokens becuase RPCSEC_GSS does not use those features (Section
5.2.2 "Context Creation Requests" in [5]).
2.10.8. Session Mechanics - Steady State
2.10.8.1. Obligations of the Server
The server has the primary obligation to monitor the state of The server has the primary obligation to monitor the state of
backchannel resources that the client has created for the server backchannel resources that the client has created for the server
(RPCSEC_GSS contexts and back channel connections). When these (RPCSEC_GSS contexts and backchannel connections). If these
resources go away, the server takes action as specified in resources vanish, the server takes action as specified in
Section 2.10.8.2. Section 2.10.9.2.
2.10.7.2. Obligations of the Client 2.10.8.2. Obligations of the Client
The client has the following obligations in order to utilize the The client SHOULD honor the following obligations in order to utilize
session: the session:
o Keep a necessary session from going idle on the server. A client o Keep a necessary session from going idle on the server. A client
that requires a session, but nonetheless is not sending operations that requires a session, but nonetheless is not sending operations
risks having the session be destroyed by the server. This is risks having the session be destroyed by the server. This is
because sessions consume resources, and resource limitations may because sessions consume resources, and resource limitations may
force the server to cull the least recently used session. force the server to cull a session that has not been used for long
time. [[Comment.6: Tom Talpey disagrees and thinks a server can
never cull a session. Mike Eisler doesn't know what the server is
supposed to do when it accumulates a zillion reply caches that no
client has touched in a century. :-)]]
o Destroy the session when idle. When a session has no state other o Destroy the session when not needed. If a client has multiple
than the session, and no outstanding requests, the client should sessions and one of them has no requests waiting for replies, and
consider destroying the session. has been idle for some period of time, it SHOULD destroy the
session.
o Maintain GSS contexts for callback. If the client requires the o Maintain GSS contexts for the callback channel. If the client
server to use the RPCSEC_GSS security flavor for callbacks, then requires the server to use the RPCSEC_GSS security flavor for
it needs to be sure the contexts handed to the server via callbacks, then it needs to be sure the contexts handed to the
BACKCHANNEL_CTL are unexpired. A good practice is to keep at server via BACKCHANNEL_CTL are unexpired.
least two contexts outstanding, where the expiration time of the
newest context at the time it was created, is N times that of the
oldest context, where N is the number of contexts available for
callbacks.
o Maintain an active connection. The server requires a callback o Preserve a connection for a backchannel. The server requires a
path in order to gracefully recall recallable state, or notify the backchannel in order to gracefully recall recallable state, or
client of certain events. notify the client of certain events. Note that if the connection
is not being used for the fore channel, there is no way the client
tell if the connection is still alive (e.g., the server rebooted
without sending a disconnect). The onus is on the server, not the
client, to determine if the backchannel's connection is alive, and
to indicate in the response to a SEQUENCE operation when the last
connection associated with a session's backchannel has
disconnected.
2.10.7.3. Steps the Client Takes To Establish a Session 2.10.8.3. Steps the Client Takes To Establish a Session
The client issues EXCHANGE_ID to establish a client ID. If the client does not have a client ID, the client issues
EXCHANGE_ID to establish a client ID. If it opts for SP4_MACH_CRED
or SP4_SSV protection, in the spo_must_enforce list of operations, it
SHOULD at minimum specify: CREATE_SESSION, DESTROY_SESSION,
BIND_CONN_TO_SESSION, BACKCHANNEL_CTL, and DESTROY_CLIENTID. If opts
for SP4_SSV protection, the client needs to ask for SSV-based
RPCSEC_GSS handles.
The client uses the client ID to issue a CREATE_SESSION on a The client uses the client ID to issue a CREATE_SESSION on a
connection to the server. The results of CREATE_SESSION indicate connection to the server. The results of CREATE_SESSION indicate
whether the server will persist the session replay cache through a whether the server will persist the session reply cache through a
server reboot or not, and the client notes this for future reference. server reboot or not, and the client notes this for future reference.
The client SHOULD have specified connecting binding enforcement when If the client specified SP4_SSV state protection when the client ID
the session was created. If so, the client SHOULD issue SET_SSV in was created, then it SHOULD issue SET_SSV in the first COMPOUND after
the first COMPOUND after the session is created. If it is not using the session is created. Each time a new principal goes to use the
machine credentials, then each time a new principal goes to use the client ID, it SHOULD issue a SET_SSV again.
session, it SHOULD issue a SET_SSV again.
If the client wants to use delegations, layouts, directory If the client wants to use delegations, layouts, directory
notifications, or any other state that requires a callback channel, notifications, or any other state that requires a backchannel, then
then it MUST add a connection to the backchannel if CREATE_SESSION it must add a connection to the backchannel if CREATE_SESSION did not
did not already do so. The client creates a connection, and calls already do so. The client creates a connection, and calls
BIND_CONN_TO_SESSION to bind the connection to the session and the BIND_CONN_TO_SESSION to associate the connection with the session and
session's backchannel. If CREATE_SESSION did not already do so, the the session's backchannel. If CREATE_SESSION did not already do so,
client MUST tell the server what security is required in order for the client MUST tell the server what security is required in order
the client to accept callbacks. The client does this via for the client to accept callbacks. The client does this via
BACKCHANNEL_CTL. BACKCHANNEL_CTL. If the client selected SP4_MACH_CRED or SP4_SSV
protection when it called EXCHANGE_ID, then the client SHOULD specify
that the backchannel use RPCSEC_GSS contexts for security.
If the client wants to use additional connections for the If the client wants to use additional connections for the
backchannel, then it MUST call BIND_CONN_TO_SESSION on each backchannel, then it must call BIND_CONN_TO_SESSION on each
connection it wants to use with the session. If the client wants to connection it wants to use with the session. If the client wants to
use additional connections for the operation channel, then it MUST use additional connections for the fore channel, then it must call
call BIND_CONN_TO_SESSION if it specified connection binding BIND_CONN_TO_SESSION if it specified SP4_SSV or SP4_MACH_CRED state
enforcement before using the connection. protection when the client ID was created.
At this point the client has reached a steady state as far as session At this point the session has reached steady state.
use.
2.10.8. Session Mechanics - Recovery 2.10.9. Session Mechanics - Recovery
2.10.8.1. Events Requiring Client Action 2.10.9.1. Events Requiring Client Action
The following events require client action to recover. The following events require client action to recover.
2.10.8.1.1. RPCSEC_GSS Context Loss by Callback Path 2.10.9.1.1. RPCSEC_GSS Context Loss by Callback Path
If all RPCSEC_GSS contexts granted to by the client to the server for If all RPCSEC_GSS contexts granted by the client to the server for
callback use have expired, the client MUST establish a new context callback use have expired, the client MUST establish a new context
via BACKCHANNEL_CTL. The sr_status_flags field of the SEQUENCE via BACKCHANNEL_CTL. The sr_status_flags field of the SEQUENCE
results indicates when callback contexts are nearly expired, or fully results indicates when callback contexts are nearly expired, or fully
expired (see Section 17.46.4). expired (see Section 17.46.4).
2.10.8.1.2. Connection Disconnect 2.10.9.1.2. Connection Loss
If the client loses the last connection of the session, then it MUST If the client loses the last connection of the session, and if wants
create a new connection, and if connecting binding enforcement was to retain the session, then it must create a new connection, and if,
specified when the session was created, bind it to the session via when the client ID was created, BIND_CONN_TO_SESSION was specified in
BIND_CONN_TO_SESSION. the spo_must_enforce list, the client MUST use BIND_CONNN_TO_SESSION
to associate the connection with the session.
If there were requests outstanding at the time the of connection If there was a request outstanding at the time the of connection
disconnect, then the client MUST retry the request, as described in loss, then if client wants to continue to use the session it MUST
Section 2.10.4.2. Note that it is not necessary to retry requests retry the request, as described in Section 2.10.5.2. Note that it is
over a connection with the same source network address or the same not necessary to retry requests over a connection with the same
destination network address as the disconnected connection. As long source network address or the same destination network address as the
as the sessionid, slotid, and sequenceid in the retry match that of lost connection. As long as the sessionid, slot id, and sequence id
the original request, the server will recognize the request as a in the retry match that of the original request, the server will
retry if it did see the request prior to disconnect. recognize the request as a retry if it executed the request prior to
disconnect.
If the connection that was bound to the backchannel is lost, the If the connection that was lost was the last one associated with the
client may need to reconnect, and use BIND_CONN_TO_SESSION, to give backchannel, and the client wants to retain the backchannel and/or
the connection to the backchannel. If the connection that was lost not put recallable state subject to revocation, the client must
was the last one bound to the backchannel, the client MUST reconnect, reconnect, and if it does, it MUST associate the connection to the
and bind the connection to the session and backchannel. The server session and backchannel via BIND_CONN_TO_SESSION. The server SHOULD
should indicate when it has no callback connection via the indicate when it has no callback connection via the sr_status_flags
sr_status_flags result from SEQUENCE. result from SEQUENCE.
2.10.8.1.3. Backchannel GSS Context Loss 2.10.9.1.3. Backchannel GSS Context Loss
Via the sr_status_flags result of the SEQUENCE operation or other Via the sr_status_flags result of the SEQUENCE operation or other
means, the client will learn if some or all of the RPCSEC_GSS means, the client will learn if some or all of the RPCSEC_GSS
contexts it assigned to the backchannel have been lost. The client contexts it assigned to the backchannel have been lost. If the
may need to use BACKCHANNEL_CTL to assign new contexts. It MUST client wants to the retain the backchannel and/or not put recallable
assign new contexts if there are no more contexts. state subjection to revocation, the client must use BACKCHANNEL_CTL
to assign new contexts.
2.10.8.1.4. Loss of Session 2.10.9.1.4. Loss of Session
The server may lose a record of the session. Causes include: The replier might lose a record of the session. Causes include:
o Server crash and reboot o Replier crash and reboot
o A catastrophe that causes the cache to be corrupted or lost on the o A catastrophe that causes the reply cache to be corrupted or lost
media it was stored on. This applies even if the server indicated on the media it was stored on. This applies even if the replier
in the CREATE_SESSION results that it would persist the cache. indicated in the CREATE_SESSION results that it would persist the
cache.
o The server purges the session of a client that has been inactive o The server purges the session of a client that has been inactive
for a very extended period of time. [[Comment.11: XXX - Should we for a very extended period of time.
add a value to the CREATE_SESSION results that tells a client how
long he can let a session stay idle before losing it?]]
Loss of replay cache is equivalent to loss of session. The server Loss of reply cache is equivalent to loss of session. The replier
indicates loss of session to the client by returning indicates loss of session to the requester by returning
NFS4ERR_BADSESSION on the next operation that uses the sessionid NFS4ERR_BADSESSION on the next operation that uses the sessionid that
associated with the lost session. refers to the lost session.
After an event like a server reboot, the client may have lost its After an event like a server reboot, the client may have lost its
connections. The client assumes for the moment that the session has connections. The client assumes for the moment that the session has
not been lost. It reconnects, and if it specified connecting binding not been lost. It reconnects, and if it specified connection
enforcement when the session was created, it invokes association enforcement when the session was created, it invokes
BIND_CONN_TO_SESSION using the sessionid. Otherwise, it invokes BIND_CONN_TO_SESSION using the sessionid. Otherwise, it invokes
SEQUENCE. If BIND_CONN_TO_SESSION or SEQUENCE returns SEQUENCE. If BIND_CONN_TO_SESSION or SEQUENCE returns
NFS4ERR_BADSESSION, the client knows the session was lost. If the NFS4ERR_BADSESSION, the client knows the session was lost. If the
connection survives session loss, then the next SEQUENCE operation connection survives session loss, then the next SEQUENCE operation
the client issues over the connection will get back the client issues over the connection will get back
NFS4ERR_BADSESSION. The client again knows the session was lost. NFS4ERR_BADSESSION. The client again knows the session was lost.
When the client detects session loss, it must call CREATE_SESSION to When the client detects session loss, it must call CREATE_SESSION to
recover. Any non-idempotent operations that were in progress may recover. Any non-idempotent operations that were in progress may
have been performed on the server at the time of session loss. The have been performed on the server at the time of session loss. The
client has no general way to recover from this. client has no general way to recover from this.
Note that loss of session does not imply loss of lock, open, Note that loss of session does not imply loss of lock, open,
delegation, or layout state. Nor does loss of lock, open, delegation, or layout state because locks, opens, delegations, and
delegation, or layout state imply loss of session state. layouts are tied to the client ID and depend on the client ID, not
[[Comment.12: Add reference to lock recovery section]] . A session the session. Nor does loss of lock, open, delegation, or layout
can survive a server reboot, but lock recovery may still be needed. state imply loss of session state, because the session depends on the
The converse is also true. client ID; loss of client ID however does imply loss of session,
lock, open, delegation, and layout state. See Section 8.6.2. A
session can survive a server reboot, but lock recovery may still be
needed.
It is possible CREATE_SESSION will fail with NFS4ERR_STALE_CLIENTID It is possible CREATE_SESSION will fail with NFS4ERR_STALE_CLIENTID
(for example the server reboots and does not preserve client ID (for example the server reboots and does not preserve client ID
state). If so, the client needs to call EXCHANGE_ID, followed by state). If so, the client needs to call EXCHANGE_ID, followed by
CREATE_SESSION. CREATE_SESSION.
2.10.8.1.5. Failover 2.10.9.2. Events Requiring Server Action
[[Comment.13: Dave Noveck requested this section; not sure what is
needed here if this refers to failover to a replica. What are the
session ramifications?]]
2.10.8.2. Events Requiring Server Action
The following events require server action to recover. The following events require server action to recover.
2.10.8.2.1. Client Crash and Reboot 2.10.9.2.1. Client Crash and Reboot
As described in Section 17.35, a rebooted client causes the server to As described in Section 17.35, a rebooted client issues EXCHANGE_ID
delete any sessions it had. in such a way it causes the server to delete any sessions it had.
2.10.8.2.2. Client Crash with No Reboot 2.10.9.2.2. Client Crash with No Reboot
If a client crashes and never comes back, it will never issue If a client crashes and never comes back, it will never issue
EXCHANGE_ID with its old client owner. Thus the server has session EXCHANGE_ID with its old client owner. Thus the server has session
state that will never be used again. After an extended period of state that will never be used again. After an extended period of
time and if the server has resource constraints, it MAY destroy the time and if the server has resource constraints, it MAY destroy the
old session. old session as well as locking state.
2.10.8.2.3. Extended Network Partition 2.10.9.2.3. Extended Network Partition
To the server, the extended network partition may be no different To the server, the extended network partition may be no different
than a client crash with no reboot (see Section 2.10.8.2.2). Unless from a client crash with no reboot (see Section 2.10.9.2.2). Unless
the server can discern that there is a network partition, it is free the server can discern that there is a network partition, it is free
to treat the situation as if the client has crashed for good. to treat the situation as if the client has crashed permanently.
2.10.8.2.4. Backchannel Connection Loss 2.10.9.2.4. Backchannel Connection Loss
If there were callback requests outstanding at the time the of a If there were callback requests outstanding at the time of a
connection disconnect, then the server MUST retry the request, as connection loss, then the server MUST retry the request, as described
described in Section 2.10.4.2. Note that it is not necessary to in Section 2.10.5.2. Note that it is not necessary to retry requests
retry requests over a connection with the same source network address over a connection with the same source network address or the same
or the same destination network address as the disconnected destination network address as the lost connection. As long as the
connection. As long as the sessionid, slotid, and sequenceid in the sessionid, slot id, and sequence id in the retry match that of the
retry match that of the original request, the callback target will original request, the callback target will recognize the request as a
recognize the request as a retry if it did see the request prior to retry even if it did see the request prior to disconnect.
disconnect.
If the connection lost is the last one bound to the backchannel, then If the connection lost is the last one associated with the
the server MUST indicate that in the sr_status_flags field of the backchannel, then the server MUST indicate that in the
next SEQUENCE reply. sr_status_flags field of every SEQUENCE reply until the backchannel
is reestablished. There are two situations each of which use
different status flags: no connectivity for the session's
backchannel, and no connectivity for any session backchannel of the
client. See Section 17.46 for a description of the appropriate flags
in sr_status_flags.
2.10.8.2.5. GSS Context Loss 2.10.9.2.5. GSS Context Loss
The server SHOULD monitor when the last RPCSEC_GSS context assigned The server SHOULD monitor when the number RPCSEC_GSS contexts
to the backchannel is near expiry (i.e. between one and two periods assigned to the backchannel reaches one, and that one context is near
of lease time), and indicate so in the sr_status_flags field of the expiry (i.e. between one and two periods of lease time), and indicate
next SEQUENCE reply. The server MUST indicate when the backchannel's so in the sr_status_flags field of all SEQUENCE replies. The server
last RPCSEC_GSS context has expired in the sr_status_flags field of MUST indicate when the all of the backchannel's assigned RPCSEC_GSS
the next SEQUENCE reply. contexts have expired in the sr_status_flags field of all SEQUENCE
replies.
2.10.9. Parallel NFS and Sessions 2.10.10. Parallel NFS and Sessions
A client and server can potentially be a non-pNFS implementation, a A client and server can potentially be a non-pNFS implementation, a
metadata server implementation, a data server implementation, or two metadata server implementation, a data server implementation, or two
or three types of implementations. The EXCHGID4_FLAG_USE_NON_PNFS, or three types of implementations. The EXCHGID4_FLAG_USE_NON_PNFS,
EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_FLAG_USE_PNFS_DS flags (not EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_FLAG_USE_PNFS_DS flags (not
mutually exclusive) are passed in the EXCHANGE_ID arguments and mutually exclusive) are passed in the EXCHANGE_ID arguments and
results to allow the client to indicate how it wants to use sessions results to allow the client to indicate how it wants to use sessions
created under the client ID, and to allow the server to indicate how created under the client ID, and to allow the server to indicate how
it will allow the sessions to be used. See Section 13.1 for pNFS it will allow the sessions to be used. See Section 13.1 for pNFS
sessions considerations. sessions considerations.
skipping to change at page 67, line 14 skipping to change at page 77, line 14
3.2.10. netaddr4 3.2.10. netaddr4
struct netaddr4 { struct netaddr4 {
/* see struct rpcb in RFC1833 */ /* see struct rpcb in RFC1833 */
string r_netid<>; /* network id */ string r_netid<>; /* network id */
string r_addr<>; /* universal address */ string r_addr<>; /* universal address */
}; };
The netaddr4 structure is used to identify TCP/IP based endpoints. The netaddr4 structure is used to identify TCP/IP based endpoints.
The r_netid and r_addr fields are specified in RFC1833 [22], but they The r_netid and r_addr fields are specified in RFC1833 [26], but they
are underspecified in RFC1833 [22] as far as what they should look are underspecified in RFC1833 [26] as far as what they should look
like for specific protocols. like for specific protocols.
For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the
US-ASCII string: US-ASCII string:
h1.h2.h3.h4.p1.p2 h1.h2.h3.h4.p1.p2
The prefix, "h1.h2.h3.h4", is the standard textual form for The prefix, "h1.h2.h3.h4", is the standard textual form for
representing an IPv4 address, which is always four octets long. representing an IPv4 address, which is always four octets long.
Assuming big-endian ordering, h1, h2, h3, and h4, are respectively, Assuming big-endian ordering, h1, h2, h3, and h4, are respectively,
skipping to change at page 67, line 48 skipping to change at page 77, line 48
For TCP over IPv6 and for UDP over IPv6, the format of r_addr is the For TCP over IPv6 and for UDP over IPv6, the format of r_addr is the
US-ASCII string: US-ASCII string:
x1:x2:x3:x4:x5:x6:x7:x8.p1.p2 x1:x2:x3:x4:x5:x6:x7:x8.p1.p2
The suffix "p1.p2" is the service port, and is computed the same way The suffix "p1.p2" is the service port, and is computed the same way
as with universal addresses for TCP and UDP over IPv4. The prefix, as with universal addresses for TCP and UDP over IPv4. The prefix,
"x1:x2:x3:x4:x5:x6:x7:x8", is the standard textual form for "x1:x2:x3:x4:x5:x6:x7:x8", is the standard textual form for
representing an IPv6 address as defined in Section 2.2 of RFC1884 representing an IPv6 address as defined in Section 2.2 of RFC1884
[9]. Additionally, the two alternative forms specified in Section [13]. Additionally, the two alternative forms specified in Section
2.2 of RFC1884 [9] are also acceptable. 2.2 of RFC1884 [13] are also acceptable.
For TCP over IPv6 the value of r_netid is the string "tcp6". For UDP For TCP over IPv6 the value of r_netid is the string "tcp6". For UDP
over IPv6 the value of r_netid is the string "udp6". That this over IPv6 the value of r_netid is the string "udp6". That this
document specifies the universal address and netid for UDP/IPv6 does document specifies the universal address and netid for UDP/IPv6 does
not imply that UDP/IPv6 is a legal transport for NFSv4.1 (see not imply that UDP/IPv6 is a legal transport for NFSv4.1 (see
Section 2.9). Section 2.9).
3.2.11. open_owner4 3.2.11. open_owner4
struct open_owner4 { struct open_owner4 {
skipping to change at page 69, line 34 skipping to change at page 79, line 34
The layouttype4 structure is 32 bits in length. The range The layouttype4 structure is 32 bits in length. The range
represented by the layout type is split into three parts. Type 0x0 represented by the layout type is split into three parts. Type 0x0
is reserved. Types within the range 0x00000001-0x7FFFFFFF are is reserved. Types within the range 0x00000001-0x7FFFFFFF are
globally unique and are assigned according to the description in globally unique and are assigned according to the description in
Section 21.1; they are maintained by IANA. Types within the range Section 21.1; they are maintained by IANA. Types within the range
0x80000000-0xFFFFFFFF are site specific and for "private use" only. 0x80000000-0xFFFFFFFF are site specific and for "private use" only.
The LAYOUT4_NFSV4_1_FILES enumeration specifies that the NFSv4.1 file The LAYOUT4_NFSV4_1_FILES enumeration specifies that the NFSv4.1 file
layout type is to be used. The LAYOUT4_OSD2_OBJECTS enumeration layout type is to be used. The LAYOUT4_OSD2_OBJECTS enumeration
specifies that the object layout, as defined in [23], is to be used. specifies that the object layout, as defined in [29], is to be used.
Similarly, the LAYOUT4_BLOCK_VOLUME enumeration that the block/volume Similarly, the LAYOUT4_BLOCK_VOLUME enumeration that the block/volume
layout, as defined in [24], is to be used. layout, as defined in [30], is to be used.
3.2.16. deviceid4 3.2.16. deviceid4
typedef uint32_t deviceid4; /* 32-bit device ID */ typedef uint32_t deviceid4; /* 32-bit device ID */
Layout information includes device IDs that specify a storage device Layout information includes device IDs that specify a storage device
through a compact handle. Addressing and type information is through a compact handle. Addressing and type information is
obtained with the GETDEVICEINFO operation. A client must not assume obtained with the GETDEVICEINFO operation. A client must not assume
that device IDs are valid across metadata server reboots. The device that device IDs are valid across metadata server reboots. The device
ID is qualified by the layout type and are unique per file system ID is qualified by the layout type and are unique per file system
skipping to change at page 70, line 19 skipping to change at page 80, line 19
opaque da_addr_body<>; opaque da_addr_body<>;
}; };
The device address is used to set up a communication channel with the The device address is used to set up a communication channel with the
storage device. Different layout types will require different types storage device. Different layout types will require different types
of structures to define how they communicate with storage devices. of structures to define how they communicate with storage devices.
The opaque da_addr_body field must be interpreted based on the The opaque da_addr_body field must be interpreted based on the
specified da_layout_type field. specified da_layout_type field.
This document defines the device address for the NFSv4.1 file layout This document defines the device address for the NFSv4.1 file layout
([[Comment.14: need xref]]), which identifies a storage device by ([[Comment.7: need xref]]), which identifies a storage device by
network IP address and port number. This is sufficient for the network IP address and port number. This is sufficient for the
clients to communicate with the NFSv4.1 storage devices, and may be clients to communicate with the NFSv4.1 storage devices, and may be
sufficient for other layout types as well. Device types for object sufficient for other layout types as well. Device types for object
storage devices and block storage devices (e.g., SCSI volume labels) storage devices and block storage devices (e.g., SCSI volume labels)
will be defined by their respective layout specifications. will be defined by their respective layout specifications.
3.2.18. devlist_item4 3.2.18. devlist_item4
struct devlist_item4 { struct devlist_item4 {
deviceid4 dli_id; deviceid4 dli_id;
skipping to change at page 74, line 10 skipping to change at page 84, line 10
for a file system object. The contents of the filehandle are opaque for a file system object. The contents of the filehandle are opaque
to the client. Therefore, the server is responsible for translating to the client. Therefore, the server is responsible for translating
the filehandle to an internal representation of the file system the filehandle to an internal representation of the file system
object. object.
4.1. Obtaining the First Filehandle 4.1. Obtaining the First Filehandle
The operations of the NFS protocol are defined in terms of one or The operations of the NFS protocol are defined in terms of one or
more filehandles. Therefore, the client needs a filehandle to more filehandles. Therefore, the client needs a filehandle to
initiate communication with the server. With the NFS version 2 initiate communication with the server. With the NFS version 2
protocol RFC1094 [17] and the NFS version 3 protocol RFC1813 [18], protocol RFC1094 [21] and the NFS version 3 protocol RFC1813 [22],
there exists an ancillary protocol to obtain this first filehandle. there exists an ancillary protocol to obtain this first filehandle.
The MOUNT protocol, RPC program number 100005, provides the mechanism The MOUNT protocol, RPC program number 100005, provides the mechanism
of translating a string based file system path name to a filehandle of translating a string based file system path name to a filehandle
which can then be used by the NFS protocols. which can then be used by the NFS protocols.
The MOUNT protocol has deficiencies in the area of security and use The MOUNT protocol has deficiencies in the area of security and use
via firewalls. This is one reason that the use of the public via firewalls. This is one reason that the use of the public
filehandle was introduced in RFC2054 [25] and RFC2055 [26]. With the filehandle was introduced in RFC2054 [31] and RFC2055 [32]. With the
use of the public filehandle in combination with the LOOKUP operation use of the public filehandle in combination with the LOOKUP operation
in the NFS version 2 and 3 protocols, it has been demonstrated that in the NFS version 2 and 3 protocols, it has been demonstrated that
the MOUNT protocol is unnecessary for viable interaction between NFS the MOUNT protocol is unnecessary for viable interaction between NFS
client and server. client and server.
Therefore, the NFS version 4 protocol will not use an ancillary Therefore, the NFS version 4 protocol will not use an ancillary
protocol for translation from string based path names to a protocol for translation from string based path names to a
filehandle. Two special filehandles will be used as starting points filehandle. Two special filehandles will be used as starting points
for the NFS client. for the NFS client.
skipping to change at page 86, line 25 skipping to change at page 96, line 25
| | | | | privileged | | | | | | privileged |
| | | | | user (for | | | | | | user (for |
| | | | | example, | | | | | | example, |
| | | | | "root" in UNIX | | | | | | "root" in UNIX |
| | | | | operating | | | | | | operating |
| | | | | environments | | | | | | environments |
| | | | | or in Windows | | | | | | or in Windows |
| | | | | 2000 the "Take | | | | | | 2000 the "Take |
| | | | | Ownership" | | | | | | Ownership" |
| | | | | privilege). | | | | | | privilege). |
| dacl | 58 | nfsacl41 | R/W | Automatically | | dacl | 58 | nfsacl41 | R/W | Access Control |
| | | | | inheritable | | | | | | List used for |
| | | | | access control |
| | | | | list used for |
| | | | | determining | | | | | | determining |
| | | | | access to file | | | | | | access to file |
| | | | | system | | | | | | system |
| | | | | objects. | | | | | | objects. |
| dir_notif_delay | 56 | nfstime4 | READ | notification | | dir_notif_delay | 56 | nfstime4 | READ | notification |
| | | | | delays on | | | | | | delays on |
| | | | | directory | | | | | | directory |
| | | | | attributes | | | | | | attributes |
| dirent_ | 57 | nfstime4 | READ | notification | | dirent_ | 57 | nfstime4 | READ | notification |
| notif_delay | | | | delays on | | notif_delay | | | | delays on |
skipping to change at page 92, line 30 skipping to change at page 102, line 30
| retention_set | 70 | retention_set4 | WRITE | Set the | | retention_set | 70 | retention_set4 | WRITE | Set the |
| | | | | retention | | | | | | retention |
| | | | | duration, and | | | | | | duration, and |
| | | | | optionally | | | | | | optionally |
| | | | | enable | | | | | | enable |
| | | | | retention on | | | | | | retention on |
| | | | | the file | | | | | | the file |
| | | | | object. | | | | | | object. |
| | | | | SETATTR use | | | | | | SETATTR use |
| | | | | only. | | | | | | only. |
| sacl | 59 | nfsacl41 | R/W | Automatically | | sacl | 59 | nfsacl41 | R/W | Access Control |
| | | | | inheritable | | | | | | List used for |
| | | | | access control |
| | | | | list used for |
| | | | | auditing | | | | | | auditing |
| | | | | access to | | | | | | access to file |
| | | | | files. | | | | | | system |
| | | | | objects. |
| space_avail | 42 | uint64 | READ | Disk space in | | space_avail | 42 | uint64 | READ | Disk space in |
| | | | | bytes | | | | | | bytes |
| | | | | available to | | | | | | available to |
| | | | | this user on | | | | | | this user on |
| | | | | the file | | | | | | the file |
| | | | | system | | | | | | system |
| | | | | containing | | | | | | containing |
| | | | | this object - | | | | | | this object - |
| | | | | this should be | | | | | | this should be |
| | | | | the smallest | | | | | | the smallest |
skipping to change at page 95, line 15 skipping to change at page 105, line 15
updates and lazily write them to stable storage. It is also updates and lazily write them to stable storage. It is also
acceptable to give administrators of the server the option to disable acceptable to give administrators of the server the option to disable
time_access updates. time_access updates.
5.8. Interpreting owner and owner_group 5.8. Interpreting owner and owner_group
The recommended attributes "owner" and "owner_group" (and also users The recommended attributes "owner" and "owner_group" (and also users
and groups within the "acl" attribute) are represented in terms of a and groups within the "acl" attribute) are represented in terms of a
UTF-8 string. To avoid a representation that is tied to a particular UTF-8 string. To avoid a representation that is tied to a particular
underlying implementation at the client or server, the use of the underlying implementation at the client or server, the use of the
UTF-8 string has been chosen. Note that section 6.1 of RFC2624 [27] UTF-8 string has been chosen. Note that section 6.1 of RFC2624 [33]
provides additional rationale. It is expected that the client and provides additional rationale. It is expected that the client and
server will have their own local representation of owner and server will have their own local representation of owner and
owner_group that is used for local storage or presentation to the end owner_group that is used for local storage or presentation to the end
user. Therefore, it is expected that when these attributes are user. Therefore, it is expected that when these attributes are
transferred between the client and server that the local transferred between the client and server that the local
representation is translated to a syntax of the form "user@ representation is translated to a syntax of the form "user@
dns_domain". This will allow for a client and server that do not use dns_domain". This will allow for a client and server that do not use
the same local representation the ability to translate to a common the same local representation the ability to translate to a common
syntax that can be interpreted by both. syntax that can be interpreted by both.
skipping to change at page 97, line 9 skipping to change at page 107, line 9
compatibility. compatibility.
The owner string "nobody" may be used to designate an anonymous user, The owner string "nobody" may be used to designate an anonymous user,
which will be associated with a file created by a security principal which will be associated with a file created by a security principal
that cannot be mapped through normal means to the owner attribute. that cannot be mapped through normal means to the owner attribute.
5.9. Character Case Attributes 5.9. Character Case Attributes
With respect to the case_insensitive and case_preserving attributes, With respect to the case_insensitive and case_preserving attributes,
each UCS-4 character (which UTF-8 encodes) has a "long descriptive each UCS-4 character (which UTF-8 encodes) has a "long descriptive
name" RFC1345 [28] which may or may not included the word "CAPITAL" name" RFC1345 [34] which may or may not included the word "CAPITAL"
or "SMALL". The presence of SMALL or CAPITAL allows an NFS server to or "SMALL". The presence of SMALL or CAPITAL allows an NFS server to
implement unambiguous and efficient table driven mappings for case implement unambiguous and efficient table driven mappings for case
insensitive comparisons, and non-case-preserving storage. For insensitive comparisons, and non-case-preserving storage. For
general character handling and internationalization issues, see the general character handling and internationalization issues, see the
section "Internationalization". section "Internationalization".
5.10. Quota Attributes 5.10. Quota Attributes
For the attributes related to file system quotas, the following For the attributes related to file system quotas, the following
definitions apply: definitions apply:
skipping to change at page 103, line 5 skipping to change at page 113, line 5
o retention_hold. This attribute allows one to 64 administrative o retention_hold. This attribute allows one to 64 administrative
holds, one hold per bit on the attribute. If retention_hold is holds, one hold per bit on the attribute. If retention_hold is
not zero, then the file MUST NOT be deleted, renamed, or modified, not zero, then the file MUST NOT be deleted, renamed, or modified,
even if the duration on enabled event or non-event-based retention even if the duration on enabled event or non-event-based retention
has been reached. The server MAY restrict the modification of has been reached. The server MAY restrict the modification of
retention_hold on the basis of the ACE4_WRITE_RETENTION_HOLD ACL retention_hold on the basis of the ACE4_WRITE_RETENTION_HOLD ACL
permission. The enabling of administration retention holds does permission. The enabling of administration retention holds does
not prevent the enabling of event-based or non-event-based not prevent the enabling of event-based or non-event-based
retention. retention.
6. Access Control Lists 6. Security Related Attributes
Access Control Lists (ACLs) are a file attribute that specify fine Access Control Lists (ACLs) are file attributes that specify fine
grained access control. This chapter covers the "acl", "dacl", grained access control. This chapter covers the "acl", "dacl",
"sacl", "aclsupport", "mode", "mode_set_masked" file attributes, and "sacl", "aclsupport", "mode", "mode_set_masked" file attributes, and
their interactions. their interactions. Note that file attributes may apply to any file
system objects.
6.1. Goals 6.1. Goals
ACLs and modes represent two well established but different models ACLs and modes represent two well established but different models
for specifying permissions. This chapter specifies requirements that for specifying permissions. This chapter specifies requirements that
attempt to meet the following goals: attempt to meet the following goals:
o If a server supports the mode attribute, it should provide o If a server supports the mode attribute, it should provide
reasonable semantics to clients that only set and retrieve the reasonable semantics to clients that only set and retrieve the
mode attribute. mode attribute.
o If a server supports the ACL attribute, it should provide o If a server supports ACL attributes, it should provide reasonable
reasonable semantics to clients that only set and retrieve the ACL semantics to clients that only set and retrieve those attributes.
attribute.
o On servers that support the mode attribute, if the ACL attribute o On servers that support the mode attribute, if ACL attributes have
has never been set on an object, via inheritance or explicitly, never been set on an object, via inheritance or explicitly, the
the behavior should be traditional UNIX-like behavior. behavior should be traditional UNIX-like behavior.
o On servers that support the mode attribute, if the ACL attribute o On servers that support the mode attribute, if the ACL attributes
has been previously set on an object, either explicitly or via have been previously set on an object, either explicitly or via
inheritance: inheritance:
* Setting only the mode attribute should effectively control the * Setting only the mode attribute should effectively control the
traditional UNIX-like permissions of read, write, and execute traditional UNIX-like permissions of read, write, and execute
on owner, owner_group, and other. on owner, owner_group, and other.
* Setting only the mode attribute should provide reasonable * Setting only the mode attribute should provide reasonable
security. For example, setting a mode of 000 should be enough security. For example, setting a mode of 000 should be enough
to ensure that future opens for read or write by any principal to ensure that future opens for read or write by any principal
should fail, regardless of a previously existing or inherited should fail, regardless of a previously existing or inherited
ACL. ACL.
o This minor version of NFSv4 should not introduce significantly o This minor version of NFSv4 may introduce different semantics
different semantics relating to the mode and ACL attributes, nor relating to the mode and ACL attributes, but it does not render
should it render invalid any existing implementations. Rather, invalid any previously existing implementations. Additionally,
this chapter provides clarifications based on previous this chapter provides clarifications based on previous
implementations and discussions around them. implementations and discussions around them.
o If a server supports the ACL attribute, then at any time, the o If a server supports ACL attributes (any of "acl", "dacl" and
server can provide an ACL attribute when requested. The ACL "sacl"), then at any time, the server can provide the supported
attribute will describe all permissions on the file object, except ACL attributes when requested. The ACL attributes will describe
for the three high-order bits of the mode attribute (described in all permissions on the file object, except for the three high-
Section 6.2.3). The ACL attribute will not conflict with the mode order bits of the mode attribute (described in Section 6.2.3).
attribute, on servers that support the mode attribute. The ACL attributes will not conflict with the mode attribute, on
servers that support the mode attribute. Briefly, "will not
conflict" means that applying the algorithm in Section 6.3.2 to
the ACL yields the nine low-order bits of the mode. See
Section 6.4.1 for exact requirements.
o If a server supports the mode attribute, then at any time, the o If a server supports the mode attribute, then at any time, the
server can provide a mode attribute when requested. The mode server can provide a mode attribute when requested. The mode
attribute will not conflict with the ACL attribute, on servers attribute will not conflict with the ACL attributes, on servers
that support the ACL attribute. that support the ACL attributes.
o When a mode attribute is set on an object, the ACL attribute may o When a mode attribute is set on an object, the ACL attributes may
need to be modified so as to not conflict with the new mode. In need to be modified so as to not conflict with the new mode. In
such cases, it is desirable that the ACL keep as much information such cases, it is desirable that the ACL keep as much information
as possible. This includes information about inheritance, AUDIT as possible. This includes information about inheritance, AUDIT
and ALARM ACEs, and permissions granted and denied that do not and ALARM ACEs, and permissions granted and denied that do not
conflict with the new mode. conflict with the new mode.
6.2. File Attributes Discussion 6.2. File Attributes Discussion
6.2.1. ACL Attribute 6.2.1. ACL Attributes
The NFS version 4 ACL attribute is an array of access control entries The NFS version 4 ACL attributes contain an array of access control
(ACEs). Although the client can read and write the ACL attribute, entries (ACEs). Although the client can read and write the acl
the server is responsible for using the ACL to perform access attribute, the server is responsible for using the ACL to perform
control. The client can use the OPEN or ACCESS operations to check access control. The client can use the OPEN or ACCESS operations to
access without modifying or reading data or metadata. check access without modifying or reading data or metadata.
The NFS ACE attribute is defined as follows: The NFS ACE structure is defined as follows:
typedef uint32_t acetype4; typedef uint32_t acetype4;
typedef uint32_t aceflag4; typedef uint32_t aceflag4;
typedef uint32_t acemask4; typedef uint32_t acemask4;
struct nfsace4 { struct nfsace4 {
acetype4 type; acetype4 type;
aceflag4 flag; aceflag4 flag;
acemask4 access_mask; acemask4 access_mask;
utf8str_mixed who; utf8str_mixed who;
skipping to change at page 105, line 8 skipping to change at page 115, line 12
bits of the requester's access have been ALLOWED. Once a bit (see bits of the requester's access have been ALLOWED. Once a bit (see
below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer
considered in the processing of later ACEs. If an ACCESS_DENIED_ACE considered in the processing of later ACEs. If an ACCESS_DENIED_ACE
is encountered where the requester's access still has unALLOWED bits is encountered where the requester's access still has unALLOWED bits
in common with the "access_mask" of the ACE, the request is denied. in common with the "access_mask" of the ACE, the request is denied.
When the ACL is fully processed, if there are bits in the requester's When the ACL is fully processed, if there are bits in the requester's
mask that have not been ALLOWED or DENIED, access is denied. mask that have not been ALLOWED or DENIED, access is denied.
Unlike the ALLOW and DENY ACE types, the ALARM and AUDIT ACE types do Unlike the ALLOW and DENY ACE types, the ALARM and AUDIT ACE types do
not affect a requester's access, and instead are for triggering not affect a requester's access, and instead are for triggering
events as a result of a requester's access attempt. Therefore, all events as a result of a requester's access attempt. Therefore, AUDIT
AUDIT and ALARM ACEs are processed until end of the ACL. and ALARM ACEs are processed only after processing ALLOW and DENY
ACEs.
The NFS version 4 ACL model is quite rich. Some server platforms may The NFS version 4 ACL model is quite rich. Some server platforms may
provide access control functionality that goes beyond the UNIX-style provide access control functionality that goes beyond the UNIX-style
mode attribute, but which is not as rich as the NFS ACL model. So mode attribute, but which is not as rich as the NFS ACL model. So
that users can take advantage of this more limited functionality, the that users can take advantage of this more limited functionality, the
server may indicate that it supports ACLs as long as it follows the server may indicate that it supports ACLs as long as it follows the
guidelines for mapping between its ACL model and the NFS version 4 guidelines for mapping between its ACL model and the NFS version 4
ACL model. ACL model.
The situation is complicated by the fact that a server may have The situation is complicated by the fact that a server may have
multiple modules that enforce ACLs. For example, the enforcement for multiple modules that enforce ACLs. For example, the enforcement for
NFS version 4 access may be different from the enforcement for local NFS version 4 access may be different from, but not weaker than, the
access, and both may be different from the enforcement for access enforcement for local access, and both may be different from the
through other protocols such as SMB. So it may be useful for a enforcement for access through other protocols such as SMB. So it
server to accept an ACL even if not all of its modules are able to may be useful for a server to accept an ACL even if not all of its
support it. modules are able to support it.
The guiding principle in all cases is that the server must not accept The guiding principle with regard to NFSv4 access is that the server
ACLs that appear to make the file more secure than it really is. must not accept ACLs that appear to make the file more secure than it
really is.
6.2.1.1. ACE Type 6.2.1.1. ACE Type
The constants used for the type field (acetype4) are as follows: The constants used for the type field (acetype4) are as follows:
const ACE4_ACCESS_ALLOWED_ACE_TYPE = 0x00000000; const ACE4_ACCESS_ALLOWED_ACE_TYPE = 0x00000000;
const ACE4_ACCESS_DENIED_ACE_TYPE = 0x00000001; const ACE4_ACCESS_DENIED_ACE_TYPE = 0x00000001;
const ACE4_SYSTEM_AUDIT_ACE_TYPE = 0x00000002; const ACE4_SYSTEM_AUDIT_ACE_TYPE = 0x00000002;
const ACE4_SYSTEM_ALARM_ACE_TYPE = 0x00000003; const ACE4_SYSTEM_ALARM_ACE_TYPE = 0x00000003;
Only the ALLOWED and DENIED bits types may be used in the dacl
attribute, and only the AUDIT and ALARM bits may be used in the sacl
attribute. All four are permitted in the acl attribute.
+------------------------------+--------------+---------------------+ +------------------------------+--------------+---------------------+
| Value | Abbreviation | Description | | Value | Abbreviation | Description |
+------------------------------+--------------+---------------------+ +------------------------------+--------------+---------------------+
| ACE4_ACCESS_ALLOWED_ACE_TYPE | ALLOW | Explicitly grants | | ACE4_ACCESS_ALLOWED_ACE_TYPE | ALLOW | Explicitly grants |
| | | the access defined | | | | the access defined |
| | | in acemask4 to the | | | | in acemask4 to the |
| | | file or directory. | | | | file or directory. |
| ACE4_ACCESS_DENIED_ACE_TYPE | DENY | Explicitly denies | | ACE4_ACCESS_DENIED_ACE_TYPE | DENY | Explicitly denies |
| | | the access defined | | | | the access defined |
| | | in acemask4 to the | | | | in acemask4 to the |
skipping to change at page 106, line 37 skipping to change at page 116, line 49
A server need not support all of the above ACE types. The bitmask A server need not support all of the above ACE types. The bitmask
constants used to represent the above definitions within the constants used to represent the above definitions within the
aclsupport attribute are as follows: aclsupport attribute are as follows:
const ACL4_SUPPORT_ALLOW_ACL = 0x00000001; const ACL4_SUPPORT_ALLOW_ACL = 0x00000001;
const ACL4_SUPPORT_DENY_ACL = 0x00000002; const ACL4_SUPPORT_DENY_ACL = 0x00000002;
const ACL4_SUPPORT_AUDIT_ACL = 0x00000004; const ACL4_SUPPORT_AUDIT_ACL = 0x00000004;
const ACL4_SUPPORT_ALARM_ACL = 0x00000008; const ACL4_SUPPORT_ALARM_ACL = 0x00000008;
Servers which support either the ALLOW or DENY ACE type SHOULD
support both ALLOW and DENY ACE types.
Clients should not attempt to set an ACE unless the server claims Clients should not attempt to set an ACE unless the server claims
support for that ACE type. If the server receives a request to set support for that ACE type. If the server receives a request to set
an ACE that it cannot store, it MUST reject the request with an ACE that it cannot store, it MUST reject the request with
NFS4ERR_ATTRNOTSUPP. If the server receives a request to set an ACE NFS4ERR_ATTRNOTSUPP. If the server receives a request to set an ACE
that it can store but cannot enforce, the server SHOULD reject the that it can store but cannot enforce, the server SHOULD reject the
request with NFS4ERR_ATTRNOTSUPP. request with NFS4ERR_ATTRNOTSUPP.
Example: suppose a server can enforce NFS ACLs for NFS access but Support for any of the ACL attributes is optional. However, a server
cannot enforce ACLs for local access. If arbitrary processes can run that supports either of the new ACL attributes (dacl or sacl) must
on the server, then the server SHOULD NOT indicate ACL support. On allow use of the new ACL attributes to access all of the ACE types
the other hand, if only trusted administrative programs run locally, which it supports. In more detail: if such a server supports ALLOW
then the server may indicate ACL support. or DENY ACEs, then it must support the dacl attribute, and if it
supports AUDIT or ALARM ACEs, then it must support the sacl
attribute.
6.2.1.3. ACE Access Mask 6.2.1.3. ACE Access Mask
The bitmask constants used for the access mask field are as follows: The bitmask constants used for the access mask field are as follows:
const ACE4_READ_DATA = 0x00000001; const ACE4_READ_DATA = 0x00000001;
const ACE4_LIST_DIRECTORY = 0x00000001; const ACE4_LIST_DIRECTORY = 0x00000001;
const ACE4_WRITE_DATA = 0x00000002; const ACE4_WRITE_DATA = 0x00000002;
const ACE4_ADD_FILE = 0x00000002; const ACE4_ADD_FILE = 0x00000002;
const ACE4_APPEND_DATA = 0x00000004; const ACE4_APPEND_DATA = 0x00000004;
const ACE4_ADD_SUBDIRECTORY = 0x00000004; const ACE4_ADD_SUBDIRECTORY = 0x00000004;
const ACE4_READ_NAMED_ATTRS = 0x00000008; const ACE4_READ_NAMED_ATTRS = 0x00000008;
const ACE4_WRITE_NAMED_ATTRS = 0x00000010; const ACE4_WRITE_NAMED_ATTRS = 0x00000010;
const ACE4_EXECUTE = 0x00000020; const ACE4_EXECUTE = 0x00000020;
const ACE4_TRAVERSE = 0x00000020;
const ACE4_DELETE_CHILD = 0x00000040; const ACE4_DELETE_CHILD = 0x00000040;
const ACE4_READ_ATTRIBUTES = 0x00000080; const ACE4_READ_ATTRIBUTES = 0x00000080;
const ACE4_WRITE_ATTRIBUTES = 0x00000100; const ACE4_WRITE_ATTRIBUTES = 0x00000100;
const ACE4_WRITE_RETENTION = 0x00000200; const ACE4_WRITE_RETENTION = 0x00000200;
const ACE4_WRITE_RETENTION_HOLD = 0x00000400; const ACE4_WRITE_RETENTION_HOLD = 0x00000400;
const ACE4_DELETE = 0x00010000; const ACE4_DELETE = 0x00010000;
const ACE4_READ_ACL = 0x00020000; const ACE4_READ_ACL = 0x00020000;
const ACE4_WRITE_ACL = 0x00040000; const ACE4_WRITE_ACL = 0x00040000;
const ACE4_WRITE_OWNER = 0x00080000; const ACE4_WRITE_OWNER = 0x00080000;
const ACE4_SYNCHRONIZE = 0x00100000; const ACE4_SYNCHRONIZE = 0x00100000;
6.2.1.3.1. Discussion of Mask Attributes Note that some masks have coincident values, for example,
ACE4_READ_DATA and ACE4_LIST_DIRECTORY. The mask entries
ACE4_LIST_DIRECTORY, ACE4_ADD_SUBDIRECTORY, and ACE4_TRAVERSE are
intended to be used with directory objects, while ACE4_READ_DATA,
ACE4_WRITE_DATA, and ACE4_EXECUTE are intended to be used with non-
directory objects.
6.2.1.3.1. Discussion of Mask Attributes
ACE4_READ_DATA ACE4_READ_DATA
Operation(s) affected: Operation(s) affected:
READ READ
OPEN OPEN
Discussion: Discussion:
Permission to read the data of the file. Permission to read the data of the file.
Servers SHOULD allow a user the ability to read the data Servers SHOULD allow a user the ability to read the data
of the file when only the ACE4_EXECUTE access mask bit is of the file when only the ACE4_EXECUTE access mask bit is
allowed. allowed.
skipping to change at page 108, line 5 skipping to change at page 118, line 27
READDIR READDIR
Discussion: Discussion:
Permission to list the contents of a directory. Permission to list the contents of a directory.
ACE4_WRITE_DATA ACE4_WRITE_DATA
Operation(s) affected: Operation(s) affected:
WRITE WRITE
OPEN OPEN
SETATTR of size SETATTR of size
Discussion: Discussion:
Permission to modify a file's data anywhere in the file's Permission to modify a file's data.
offset range. This includes the ability to write to any
arbitrary offset and as a result to grow the file.
ACE4_ADD_FILE ACE4_ADD_FILE
Operation(s) affected: Operation(s) affected:
CREATE CREATE
LINK
OPEN OPEN
RENAME
Discussion: Discussion:
Permission to add a new file in a directory. The CREATE Permission to add a new file in a directory. The CREATE
operation is affected when nfs_ftype4 is NF4LNK, NF4BLK, operation is affected when nfs_ftype4 is NF4LNK, NF4BLK,
NF4CHR, NF4SOCK, or NF4FIFO. (NF4DIR is not listed because NF4CHR, NF4SOCK, or NF4FIFO. (NF4DIR is not listed because
it is covered by ACE4_ADD_SUBDIRECTORY.) OPEN is affected it is covered by ACE4_ADD_SUBDIRECTORY.) OPEN is affected
when used to create a regular file. when used to create a regular file. LINK and RENAME are
always affected.
ACE4_APPEND_DATA ACE4_APPEND_DATA
Operation(s) affected: Operation(s) affected:
WRITE WRITE
OPEN OPEN
SETATTR of size SETATTR of size
Discussion: Discussion:
The ability to modify a file's data, but only starting at The ability to modify a file's data, but only starting at
EOF. This allows for the notion of append-only files, by EOF. This allows for the notion of append-only files, by
allowing ACE4_APPEND_DATA and denying ACE4_WRITE_DATA to allowing ACE4_APPEND_DATA and denying ACE4_WRITE_DATA to
the same user or group. If a file has an ACL such as the the same user or group. If a file has an ACL such as the
one described above and a WRITE request is made for one described above and a WRITE request is made for
somewhere other than EOF, the server SHOULD return somewhere other than EOF, the server SHOULD return
NFS4ERR_ACCESS. NFS4ERR_ACCESS.
ACE4_ADD_SUBDIRECTORY ACE4_ADD_SUBDIRECTORY
Operation(s) affected: Operation(s) affected:
CREATE CREATE
RENAME
Discussion: Discussion:
Permission to create a subdirectory in a directory. The Permission to create a subdirectory in a directory. The
CREATE operation is affected when nfs_ftype4 is NF4DIR. CREATE operation is affected when nfs_ftype4 is NF4DIR.
The RENAME operation is always affected.
ACE4_READ_NAMED_ATTRS ACE4_READ_NAMED_ATTRS
Operation(s) affected: Operation(s) affected:
OPENATTR OPENATTR
Discussion: Discussion:
Permission to read the named attributes of a file or to Permission to read the named attributes of a file or to
lookup the named attributes directory. OPENATTR is lookup the named attributes directory. OPENATTR is
affected when it is not used to create a named attribute affected when it is not used to create a named attribute
directory. This is when 1.) createdir is TRUE, but a directory. This is when 1.) createdir is TRUE, but a
named attribute directory already exists, or 2.) createdir named attribute directory already exists, or 2.) createdir
skipping to change at page 109, line 21 skipping to change at page 119, line 45
affected when it is used to create a named attribute affected when it is used to create a named attribute
directory. This is when createdir is TRUE and no named directory. This is when createdir is TRUE and no named
attribute directory exists. The ability to check whether attribute directory exists. The ability to check whether
or not a named attribute directory exists depends on the or not a named attribute directory exists depends on the
ability to look it up, therefore, users also need the ability to look it up, therefore, users also need the
ACE4_READ_NAMED_ATTRS permission in order to create a ACE4_READ_NAMED_ATTRS permission in order to create a
named attribute directory. named attribute directory.
ACE4_EXECUTE ACE4_EXECUTE
Operation(s) affected: Operation(s) affected:
LOOKUP
READ READ
OPEN OPEN
Discussion: Discussion:
Permission to execute a file or traverse/search a Permission to execute a file.
directory.
Servers SHOULD allow a user the ability to read the data Servers SHOULD allow a user the ability to read the data
of the file when only the ACE4_EXECUTE access mask bit is of the file when only the ACE4_EXECUTE access mask bit is
allowed. This is because there is no way to execute a allowed. This is because there is no way to execute a
file without reading the contents. Though a server may file without reading the contents. Though a server may
treat ACE4_EXECUTE and ACE4_READ_DATA bits identically treat ACE4_EXECUTE and ACE4_READ_DATA bits identically
when deciding to permit a READ operation, it SHOULD still when deciding to permit a READ operation, it SHOULD still
allow the two bits to be set independently in ACLs, and allow the two bits to be set independently in ACLs, and
MUST distinguish between them when replying to ACCESS MUST distinguish between them when replying to ACCESS
operations. In particular, servers SHOULD NOT silently operations. In particular, servers SHOULD NOT silently
skipping to change at page 109, line 50 skipping to change at page 120, line 24
permissions. permissions.
As an example, following a SETATTR of the following ACL: As an example, following a SETATTR of the following ACL:
nfsuser:ACE4_EXECUTE:ALLOW nfsuser:ACE4_EXECUTE:ALLOW
A subsequent GETATTR of ACL for that file SHOULD return: A subsequent GETATTR of ACL for that file SHOULD return:
nfsuser:ACE4_EXECUTE:ALLOW nfsuser:ACE4_EXECUTE:ALLOW
Rather than: Rather than:
nfsuser:ACE4_EXECUTE/ACE4_READ_DATA:ALLOW nfsuser:ACE4_EXECUTE/ACE4_READ_DATA:ALLOW
ACE4_EXECUTE
Operation(s) affected:
LOOKUP
Discussion:
Permission to traverse/search a directory.
ACE4_DELETE_CHILD ACE4_DELETE_CHILD
Operation(s) affected: Operation(s) affected:
REMOVE REMOVE
RENAME
Discussion: Discussion:
Permission to delete a file or directory within a Permission to delete a file or directory within a
directory. See section "ACE4_DELETE vs. ACE4_DELETE_CHILD" directory. See section "ACE4_DELETE vs. ACE4_DELETE_CHILD"
for information on how these two access mask bits interact. for information on how these two access mask bits interact.
ACE4_READ_ATTRIBUTES ACE4_READ_ATTRIBUTES
Operation(s) affected: Operation(s) affected:
GETATTR of file system object attributes GETATTR of file system object attributes
READDIR
Discussion: Discussion:
The ability to read basic attributes (non-ACLs) of a file. The ability to read basic attributes (non-ACLs) of a file.
On a UNIX system, basic attributes can be thought of as On a UNIX system, basic attributes can be thought of as
the stat level attributes. Allowing this access mask bit the stat level attributes. Allowing this access mask bit
would mean the entity can execute "ls -l" and stat. would mean the entity can execute "ls -l" and stat. If
a READDIR operation requests attributes, this mask must
be allowed for the READDIR to succeed.
ACE4_WRITE_ATTRIBUTES ACE4_WRITE_ATTRIBUTES
Operation(s) affected: Operation(s) affected:
SETATTR of time_access_set, time_backup, SETATTR of time_access_set, time_backup,
time_create, time_modify_set, mimetype, hidden, system time_create, time_modify_set, mimetype, hidden, system
Discussion: Discussion:
Permission to change the times associated with a file Permission to change the times associated with a file or
or directory to an arbitrary value. Also permission directory to an arbitrary value. Also permission to change
to change the mimetype, hidden and system attributes. the mimetype, hidden and system attributes. A user having
A user having ACE4_WRITE_DATA permission, but lacking ACE4_WRITE_DATA or ACE4_WRITE_ATTRIBUTES will be allowed to
ACE4_WRITE_ATTRIBUTES must be allowed to implicitly set set the times associated with a file to the current server
the times associated with a file. time.
ACE4_WRITE_RETENTION ACE4_WRITE_RETENTION
Operation(s) affected: Operation(s) affected:
SETATTR of retention_set, retentevt_set. SETATTR of retention_set, retentevt_set.
Discussion: Discussion:
Permission to modify the durations of event and non-event-based Permission to modify the durations of event and
retention. Also permission to enable event and non-event-based non-event-based retention. Also permission to enable event and
retention. A server MAY map ACE4_WRITE_ATTRIBUTES to non-event-based retention. A server MAY behave such that
ACE_WRITE_RETENTION. setting ACE4_WRITE_ATTRIBUTES allows ACE4_WRITE_RETENTION.
ACE4_WRITE_RETENTION_HOLD ACE4_WRITE_RETENTION_HOLD
Operation(s) affected: Operation(s) affected:
SETATTR of retention_hold. SETATTR of retention_hold.
Discussion: Discussion:
Permission to modify the administration retention holds. Permission to modify the administration retention holds.
A server MAY map ACE4_WRITE_ATTRIBUTES to A server MAY map ACE4_WRITE_ATTRIBUTES to
ACE_WRITE_RETENTION_HOLD. ACE_WRITE_RETENTION_HOLD.
ACE4_DELETE ACE4_DELETE
Operation(s) affected: Operation(s) affected:
REMOVE REMOVE
Discussion: Discussion:
Permission to delete the file or directory. See section Permission to delete the file or directory. See section
"ACE4_DELETE vs. ACE4_DELETE_CHILD" for information on how "ACE4_DELETE vs. ACE4_DELETE_CHILD" for information on how
these two access mask bits interact. these two access mask bits interact.
ACE4_READ_ACL ACE4_READ_ACL
Operation(s) affected: Operation(s) affected:
GETATTR of acl GETATTR of acl, dacl, or sacl
NVERIFY
VERIFY
Discussion: Discussion:
Permission to read the ACL. Permission to read the ACL.
ACE4_WRITE_ACL ACE4_WRITE_ACL
Operation(s) affected: Operation(s) affected:
SETATTR of acl and mode SETATTR of acl and mode
Discussion: Discussion:
Permission to write the acl and mode attributes. Permission to write the acl and mode attributes.
ACE4_WRITE_OWNER ACE4_WRITE_OWNER
Operation(s) affected: Operation(s) affected:
SETATTR of owner and owner_group SETATTR of owner and owner_group
Discussions: Discussions:
Permission to write the owner and owner_group attributes. Permission to write the owner and owner_group attributes.
On UNIX systems, this is the ability to execute chown(). On UNIX systems, this is the ability to execute chown() and
chgrp().
ACE4_SYNCHRONIZE ACE4_SYNCHRONIZE
Operation(s) affected: Operation(s) affected:
NONE NONE
Discussion: Discussion:
Permission to access file locally at the server with Permission to access file locally at the server with
synchronized reads and writes. synchronized reads and writes.
Server implementations need not provide the granularity of control Server implementations need not provide the granularity of control
that is implied by this list of masks. For example, POSIX-based that is implied by this list of masks. For example, POSIX-based
systems might not distinguish ACE4_APPEND_DATA (the ability to append systems might not distinguish ACE4_APPEND_DATA (the ability to append
to a file) from ACE4_WRITE_DATA (the ability to modify existing to a file) from ACE4_WRITE_DATA (the ability to modify existing
contents); both masks would be tied to a single "write" permission. contents); both masks would be tied to a single "write" permission.
When such a server returns attributes to the client, it would show When such a server returns attributes to the client, it would show
both ACE4_APPEND_DATA and ACE4_WRITE_DATA if and only if the write both ACE4_APPEND_DATA and ACE4_WRITE_DATA if and only if the write
permission is enabled. permission is enabled.
If a server receives a SETATTR request that it cannot accurately If a server receives a SETATTR request that it cannot accurately
implement, it should error in the direction of more restricted implement, it should err in the direction of more restricted access,
access. For example, suppose a server cannot distinguish overwriting except in the previously discussed cases of execute and read. For
data from appending new data, as described in the previous paragraph. example, suppose a server cannot distinguish overwriting data from
If a client submits an ACE where ACE4_APPEND_DATA is set but appending new data, as described in the previous paragraph. If a
client submits an ACE where ACE4_APPEND_DATA is set but
ACE4_WRITE_DATA is not (or vice versa), the server should reject the ACE4_WRITE_DATA is not (or vice versa), the server should reject the
request with NFS4ERR_ATTRNOTSUPP. Nonetheless, if the ACE has type request with NFS4ERR_ATTRNOTSUPP. Nonetheless, if the ACE has type
DENY, the server may silently turn on the other bit, so that both DENY, the server may silently turn on the other bit, so that both
ACE4_APPEND_DATA and ACE4_WRITE_DATA are denied. ACE4_APPEND_DATA and ACE4_WRITE_DATA are denied.
6.2.1.3.2. ACE4_DELETE vs. ACE4_DELETE_CHILD 6.2.1.3.2. ACE4_DELETE vs. ACE4_DELETE_CHILD
Two access mask bits govern the ability to delete a file or directory Two access mask bits govern the ability to delete a file or directory
object: ACE4_DELETE on the object itself, and ACE4_DELETE_CHILD on object: ACE4_DELETE on the object itself, and ACE4_DELETE_CHILD on
the object's parent directory. the object's parent directory.
Many systems also consult the "sticky bit" (MODE4_SVTX) and write Many systems also consult the "sticky bit" (MODE4_SVTX) and write
mode bit on the parent directory when determining whether to allow a mode bit on the parent directory when determining whether to allow a
file to be deleted. The mode bit for write corresponds to file to be deleted. The mode bit for write corresponds to
ACE4_WRITE_DATA, which is the same physical bit as ACE4_ADD_FILE. ACE4_WRITE_DATA, which is the same physical bit as ACE4_ADD_FILE.
Therefore, ACE4_ADD_FILE can come into play when determining Therefore, ACE4_WRITE_DATA can come into play when determining
permission to delete. permission to delete.
In the algorithm below, the strategy is that ACE4_DELETE and In the algorithm below, the strategy is that ACE4_DELETE and
ACE4_DELETE_CHILD take precedence over the sticky bit, and the sticky ACE4_DELETE_CHILD take precedence over the sticky bit, and the sticky
bit takes precedence over the "write" mode bits (reflected in bit takes precedence over the "write" mode bits (reflected in
ACE4_ADD_FILE). ACE4_ADD_FILE).
Server implementations SHOULD grant or deny permission to delete Server implementations SHOULD grant or deny permission to delete
based on the following algorithm. based on the following algorithm.
if ACE4_EXECUTE is denied by the parent directory ACL: if ACE4_TRAVERSE is denied by the parent directory ACL {
deny delete deny delete
else if ACE4_DELETE is allowed by the target object ACL: } else if ACE4_DELETE is allowed by the target object ACL {
allow delete allow delete
else if ACE4_DELETE_CHILD is allowed by the parent } else if ACE4_DELETE_CHILD is allowed by the parent
directory ACL: directory ACL {
allow delete allow delete
else if ACE4_DELETE_CHILD is denied by the } else if ACE4_DELETE_CHILD is denied by the
parent directory ACL: parent directory ACL {
deny delete deny delete
else if ACE4_ADD_FILE is allowed by the parent directory ACL: } else if ACE4_ADD_FILE is allowed by the parent directory ACL {
if MODE4_SVTX is set for the parent directory: if MODE4_SVTX is set for the parent directory {
if the principal owns the parent directory OR if the principal owns the parent directory OR
the principal owns the target object OR the principal owns the target object OR
ACE4_WRITE_DATA is allowed by the target ACE4_WRITE_DATA is allowed by the target
object ACL: object ACL {
allow delete allow delete
else: } else {
deny delete deny delete
else: }
} else {
allow delete allow delete
else: }
} else {
deny delete deny delete
}
6.2.1.4. ACE flag 6.2.1.4. ACE flag
The bitmask constants used for the flag field are as follows: The bitmask constants used for the flag field are as follows:
const ACE4_FILE_INHERIT_ACE = 0x00000001; const ACE4_FILE_INHERIT_ACE = 0x00000001;
const ACE4_DIRECTORY_INHERIT_ACE = 0x00000002; const ACE4_DIRECTORY_INHERIT_ACE = 0x00000002;
const ACE4_NO_PROPAGATE_INHERIT_ACE = 0x00000004; const ACE4_NO_PROPAGATE_INHERIT_ACE = 0x00000004;
const ACE4_INHERIT_ONLY_ACE = 0x00000008; const ACE4_INHERIT_ONLY_ACE = 0x00000008;
const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG = 0x00000010; const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG = 0x00000010;
const ACE4_FAILED_ACCESS_ACE_FLAG = 0x00000020; const ACE4_FAILED_ACCESS_ACE_FLAG = 0x00000020;
const ACE4_IDENTIFIER_GROUP = 0x00000040; const ACE4_IDENTIFIER_GROUP = 0x00000040;
const ACE4_INHERITED_ACE = 0x00000080; const ACE4_INHERITED_ACE = 0x00000080;
A server need not support any of these flags. If the server supports A server need not support any of these flags. If the server supports
flags that are similar to, but not exactly the same as, these flags, flags that are similar to, but not exactly the same as, these flags,
the implementation may define a mapping between the protocol-defined the implementation may define a mapping between the protocol-defined
flags and the implementation-defined flags. Again, the guiding flags and the implementation-defined flags.
principle is that the file not appear to be more secure than it
really is.
For example, suppose a client tries to set an ACE with For example, suppose a client tries to set an ACE with
ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE. If the ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE. If the
server does not support any form of ACL inheritance, the server server does not support any form of ACL inheritance, the server
should reject the request with NFS4ERR_ATTRNOTSUPP. If the server should reject the request with NFS4ERR_ATTRNOTSUPP. If the server
supports a single "inherit ACE" flag that applies to both files and supports a single "inherit ACE" flag that applies to both files and
directories, the server may reject the request (i.e., requiring the directories, the server may reject the request (i.e., requiring the
client to set both the file and directory inheritance flags). The client to set both the file and directory inheritance flags). The
server may also accept the request and silently turn on the server may also accept the request and silently turn on the
ACE4_DIRECTORY_INHERIT_ACE flag. ACE4_DIRECTORY_INHERIT_ACE flag.
6.2.1.4.1. Discussion of Flag Bits 6.2.1.4.1. Discussion of Flag Bits
ACE4_FILE_INHERIT_ACE ACE4_FILE_INHERIT_ACE
Can be placed on a directory and indicates that this ACE should be Any non-directory file in any sub-directory will get this ACE
added to each new non-directory file created. inherited.
ACE4_DIRECTORY_INHERIT_ACE ACE4_DIRECTORY_INHERIT_ACE
Can be placed on a directory and indicates that this ACE should be Can be placed on a directory and indicates that this ACE should be
added to each new directory created. added to each new directory created.
If this flag is set in an ACE in an ACL attribute to be set on a
non-directory file system object, the operation attempting to set
the ACL SHOULD fail with NFS4ERR_ATTRNOTSUPP.
ACE4_INHERIT_ONLY_ACE ACE4_INHERIT_ONLY_ACE
Can be placed on a directory but does not apply to the directory; Can be placed on a directory but does not apply to the directory;
ALLOW and DENY ACEs with this bit set do not affect access to the ALLOW and DENY ACEs with this bit set do not affect access to the
directory, and AUDIT and ALARM ACEs with this bit set do not directory, and AUDIT and ALARM ACEs with this bit set do not
trigger log or alarm events. Such ACEs only take effect once they trigger log or alarm events. Such ACEs only take effect once they
are applied (with this bit cleared) to newly created files and are applied (with this bit cleared) to newly created files and
directories as specified by the above two flags. directories as specified by the above two flags.
If this flag is present on an ACE, but neither
ACE4_DIRECTORY_INHERIT_ACE nor ACE4_FILE_INHERIT_ACE is present,
then an operation attempting to set such an attribute SHOULD fail
with NFS4ERR_ATTRNOTSUPP.
ACE4_NO_PROPAGATE_INHERIT_ACE ACE4_NO_PROPAGATE_INHERIT_ACE
Can be placed on a directory. This flag tells the server that Can be placed on a directory. This flag tells the server that
inheritance of this ACE should stop at newly created child inheritance of this ACE should stop at newly created child
directories. directories.
ACE4_INHERITED_ACE ACE4_INHERITED_ACE
Indicates that this ACE is inherited from a parent directory. A Indicates that this ACE is inherited from a parent directory. A
server that supports automatic inheritance will place this flag on server that supports automatic inheritance will place this flag on
any ACEs inherited from the parent directory when creating a new any ACEs inherited from the parent directory when creating a new
skipping to change at page 114, line 42 skipping to change at page 125, line 37
event occurs. If the operation failed, and if the FAILED flag was event occurs. If the operation failed, and if the FAILED flag was
set for the matching AUDIT or ALARM ACE, then the appropriate set for the matching AUDIT or ALARM ACE, then the appropriate
AUDIT or ALARM event occurs. Either or both of the SUCCESS or AUDIT or ALARM event occurs. Either or both of the SUCCESS or
FAILED can be set, but if neither is set, the AUDIT or ALARM ACE FAILED can be set, but if neither is set, the AUDIT or ALARM ACE
is not useful. is not useful.
The previously described processing applies to that of the ACCESS The previously described processing applies to that of the ACCESS
operation as well, the difference being that "success" or operation as well, the difference being that "success" or
"failure" does not mean whether ACCESS returns NFS4_OK or not. "failure" does not mean whether ACCESS returns NFS4_OK or not.
Success means whether ACCESS returns all requested and supported Success means whether ACCESS returns all requested and supported
bits. Failure means whether ACCESS failed to return a bit that bits. Failure means whether ACCESS failed to return at least one
was requested and supported. bit that was requested and supported.
ACE4_IDENTIFIER_GROUP ACE4_IDENTIFIER_GROUP
Indicates that the "who" refers to a GROUP as defined under UNIX Indicates that the "who" refers to a GROUP as defined under UNIX
or a GROUP ACCOUNT as defined under Windows. Clients and servers or a GROUP ACCOUNT as defined under Windows. Clients and servers
must ignore the ACE4_IDENTIFIER_GROUP flag on ACEs with a who MUST ignore the ACE4_IDENTIFIER_GROUP flag on ACEs with a who
value equal to one of the special identifiers outlined in value equal to one of the special identifiers outlined in
Section 6.2.1.5. Section 6.2.1.5.
6.2.1.5. ACE Who 6.2.1.5. ACE Who
The "who" field of an ACE is an identifier that specifies the The "who" field of an ACE is an identifier that specifies the
principal or principals to whom the ACE applies. It may refer to a principal or principals to whom the ACE applies. It may refer to a
user or a group, with the flag bit ACE4_IDENTIFIER_GROUP specifying user or a group, with the flag bit ACE4_IDENTIFIER_GROUP specifying
which. which.
skipping to change at page 115, line 41 skipping to change at page 126, line 34
| AUTHENTICATED | Any authenticated user (opposite of ANONYMOUS) | | AUTHENTICATED | Any authenticated user (opposite of ANONYMOUS) |
| SERVICE | Access from a system service. | | SERVICE | Access from a system service. |
+---------------+--------------------------------------------------+ +---------------+--------------------------------------------------+
Table 7 Table 7
To avoid conflict, these special identifiers are distinguish by an To avoid conflict, these special identifiers are distinguish by an
appended "@" and should appear in the form "xxxx@" (note: no domain appended "@" and should appear in the form "xxxx@" (note: no domain
name after the "@"). For example: ANONYMOUS@. name after the "@"). For example: ANONYMOUS@.
The ACE4_IDENTIFIER_GROUP flag MUST be ignored on entries with these
special identifiers. When encoding entries with these special
identifiers, the ACE4_IDENTIFIER_GROUP flag SHOULD be set to zero.
6.2.1.5.1. Discussion of EVERYONE@ 6.2.1.5.1. Discussion of EVERYONE@
It is important to note that "EVERYONE@" is not equivalent to the It is important to note that "EVERYONE@" is not equivalent to the
UNIX "other" entity. This is because, by definition, UNIX "other" UNIX "other" entity. This is because, by definition, UNIX "other"
does not include the owner or owning group of a file. "EVERYONE@" does not include the owner or owning group of a file. "EVERYONE@"
means literally everyone, including the owner or owning group. means literally everyone, including the owner or owning group.
6.2.2. dacl and sacl Attributes 6.2.2. dacl and sacl Attributes
The dacl and sacl attributes are like the acl attribute, but dacl and The dacl and sacl attributes are like the acl attribute, but dacl and
sacl each allow only certain types of ACEs. The dacl attribute sacl each allow only certain types of ACEs. The dacl attribute
allows just ALLOW and DENY ACEs. The sacl attribute allows just allows just ALLOW and DENY ACEs. The sacl attribute allows just
AUDIT and ALARM ACEs. The dacl and sacl attributes also have AUDIT and ALARM ACEs. The dacl and sacl attributes also support
improved support for automatic inheritance (see Section 6.4.3.2). automatic inheritance (see Section 6.4.3.2).
The separation of ACE types and inheritance support make dacl and
sacl a better choice (over acl) for clients when setting ACEs on a
file.
6.2.3. mode Attribute 6.2.3. mode Attribute
The NFS version 4 mode attribute is based on the UNIX mode bits. The The NFS version 4 mode attribute is based on the UNIX mode bits. The
following bits are defined: following bits are defined:
const MODE4_SUID = 0x800; /* set user id on execution */ const MODE4_SUID = 0x800; /* set user id on execution */
const MODE4_SGID = 0x400; /* set group id on execution */ const MODE4_SGID = 0x400; /* set group id on execution */
const MODE4_SVTX = 0x200; /* save text even after use */ const MODE4_SVTX = 0x200; /* save text even after use */
const MODE4_RUSR = 0x100; /* read permission: owner */ const MODE4_RUSR = 0x100; /* read permission: owner */
skipping to change at page 116, line 40 skipping to change at page 127, line 35
MODE4_XGRP apply to principals identified in the owner_group MODE4_XGRP apply to principals identified in the owner_group
attribute but who are not identified in the owner attribute. Bits attribute but who are not identified in the owner attribute. Bits
MODE4_ROTH, MODE4_WOTH, MODE4_XOTH apply to any principal that does MODE4_ROTH, MODE4_WOTH, MODE4_XOTH apply to any principal that does
not match that in the owner attribute, and does not have a group not match that in the owner attribute, and does not have a group
matching that of the owner_group attribute. matching that of the owner_group attribute.
Bits within the mode other than those specified above are not defined Bits within the mode other than those specified above are not defined
by this protocol. A server MUST NOT return bits other than those by this protocol. A server MUST NOT return bits other than those
defined above in a GETATTR or READDIR operation, and it MUST return defined above in a GETATTR or READDIR operation, and it MUST return
NFS4ERR_INVAL if bits other than those defined above are set in a NFS4ERR_INVAL if bits other than those defined above are set in a
SETATTR, CREATE, or OPEN operation. SETATTR, CREATE, OPEN, VERIFY or NVERIFY operation.
6.2.4. mode_set_masked Attribute 6.2.4. mode_set_masked Attribute
The mode_set_masked attribute is a write-only attribute that allows The mode_set_masked attribute is a write-only attribute that allows
individual bits in the mode attribute to be set or reset, without individual bits in the mode attribute to be set or reset, without
changing others. It allows, for example, the bits MODE4_SUID, changing others. It allows, for example, the bits MODE4_SUID,
MODE4_SGID, and MODE4_SVTX to be modified while leaving unmodified MODE4_SGID, and MODE4_SVTX to be modified while leaving unmodified
any of the nine low-order mode bits devoted to permissions. any of the nine low-order mode bits devoted to permissions.
In such instances that the nine low-order bits are left unmodified,
then neither the acl nor the dacl attribute should be automatically
modified as discussed in Section 6.4.1.
The mode_set_masked attribute consists of two words each in the form The mode_set_masked attribute consists of two words each in the form
of a mode4. The first consists of the value to be applied to the of a mode4. The first consists of the value to be applied to the
current mode value and the second is a mask. Only bits set to one in current mode value and the second is a mask. Only bits set to one in
the mask word are changed (set or reset) in the file's mode. All the mask word are changed (set or reset) in the file's mode. All
other bits in the mode remain unchanged. Bits in the first word that other bits in the mode remain unchanged. Bits in the first word that
correspond to bits which are zero in the mask are ignored, except correspond to bits which are zero in the mask are ignored, except
that undefined bits are checked for validity and can result in that undefined bits are checked for validity and can result in
NFSERR_INVAL as described below. NFS4ERR_INVAL as described below.
The mode_set_masked attribute is only valid in a SETATTR operation. The mode_set_masked attribute is only valid in a SETATTR operation.
If it is used in a CREATE or OPEN operation, the server MUST return If it is used in a CREATE or OPEN operation, the server MUST return
NFS4ERR_INVAL. NFS4ERR_INVAL.
Bits not defined as valid in the mode attribute are not valid in Bits not defined as valid in the mode attribute are not valid in
either word of the mode_set_masked attribute. The server MUST return either word of the mode_set_masked attribute. The server MUST return
NFS4ERR_INVAL if any of those are on in a SETATTR. If the mode and NFS4ERR_INVAL if any of those are on in a SETATTR. If the mode and
mode_set_masked attributes are both specified in the same SETATTR, mode_set_masked attributes are both specified in the same SETATTR,
the server MUST also return NFS4ERR_INVAL. the server MUST also return NFS4ERR_INVAL.
skipping to change at page 117, line 38 skipping to change at page 128, line 36
6.3.1.1. Server Considerations 6.3.1.1. Server Considerations
The server uses the algorithm described in Section 6.2.1 to determine The server uses the algorithm described in Section 6.2.1 to determine
whether an ACL allows access to an object. However, the ACL may not whether an ACL allows access to an object. However, the ACL may not
be the sole determiner of access. For example: be the sole determiner of access. For example:
o In the case of a file system exported as read-only, the server may o In the case of a file system exported as read-only, the server may
deny write permissions even though an object's ACL grants it. deny write permissions even though an object's ACL grants it.
o Server implementations MAY grant ACE4_WRITE_ACL and ACE4_READ_ACL o Server implementations MAY grant ACE4_WRITE_ACL and ACE4_READ_ACL
permissions in order to prevent the owner from getting into the permissions to prevent a situation from arising in which there is
situation where they can't ever modify the ACL. no valid way to ever modify the ACL.
o All servers will allow a user the ability to read the data of the o All servers will allow a user the ability to read the data of the
file when only the execute permission is granted (i.e. If the ACL file when only the execute permission is granted (i.e. If the ACL
denies the user the ACE4_READ_DATA access and allows the user denies the user the ACE4_READ_DATA access and allows the user
ACE4_EXECUTE, the server will allow the user to read the data of ACE4_EXECUTE, the server will allow the user to read the data of
the file). the file).
o Many servers have the notion of owner-override in which the owner o Many servers have the notion of owner-override in which the owner
of the object is allowed to override accesses that are denied by of the object is allowed to override accesses that are denied by
the ACL. This may be helpful, for example, to allow users the ACL. This may be helpful, for example, to allow users
continued access to open files on which the permissions have continued access to open files on which the permissions have
changed. changed.
o Many servers have the notion of a "superuser" that has privileges
beyond an ordinary user. The superuser may be able to read or
write data or metadata in ways that would not be permitted by the
ACL.
6.3.1.2. Client Considerations 6.3.1.2. Client Considerations
Clients SHOULD NOT do their own access checks based on their Clients SHOULD NOT do their own access checks based on their
interpretation the ACL, but rather use the OPEN and ACCESS operations interpretation the ACL, but rather use the OPEN and ACCESS operations
to do access checks. This allows the client to act on the results of to do access checks. This allows the client to act on the results of
having the server determine whether or not access should be granted having the server determine whether or not access should be granted
based on its interpretation of the ACL. based on its interpretation of the ACL.
Clients must be aware of situations in which an object's ACL will Clients must be aware of situations in which an object's ACL will
define a certain access even though the server will not enforce it. define a certain access even though the server will not enforce it.
skipping to change at page 118, line 30 skipping to change at page 129, line 35
access requested. For examples in which the ACL may define accesses access requested. For examples in which the ACL may define accesses
that the server doesn't enforce see Section 6.3.1.1. that the server doesn't enforce see Section 6.3.1.1.
6.3.2. Computing a Mode Attribute from an ACL 6.3.2. Computing a Mode Attribute from an ACL
The following method can be used to calculate the MODE4_R*, MODE4_W* The following method can be used to calculate the MODE4_R*, MODE4_W*
and MODE4_X* bits of a mode attribute, based upon an ACL. and MODE4_X* bits of a mode attribute, based upon an ACL.
1. To determine MODE4_ROTH, MODE4_WOTH, and MODE4_XOTH: 1. To determine MODE4_ROTH, MODE4_WOTH, and MODE4_XOTH:
1. If the special identifier EVERYONE@ is granted A. If the special identifier EVERYONE@ is granted
ACE4_READ_DATA, then the bit MODE4_ROTH SHOULD be set. ACE4_READ_DATA, then the bit MODE4_ROTH SHOULD be set.
Otherwise, MODE4_ROTH SHOULD NOT be set. Otherwise, MODE4_ROTH SHOULD NOT be set.
2. If the special identifier EVERYONE@ is granted B. If the special identifier EVERYONE@ is granted
ACE4_WRITE_DATA or ACE4_APPEND_DATA, then the bit MODE4_WOTH ACE4_WRITE_DATA or ACE4_APPEND_DATA, then the bit MODE4_WOTH
SHOULD be set. Otherwise, MODE4_WOTH SHOULD NOT be set. SHOULD be set. Otherwise, MODE4_WOTH SHOULD NOT be set.
3. If the special identifier EVERYONE@ is granted ACE4_EXECUTE, C. If the special identifier EVERYONE@ is granted ACE4_EXECUTE,
then the bit MODE4_XOTH SHOULD be set. Otherwise, MODE4_XOTH then the bit MODE4_XOTH SHOULD be set. Otherwise, MODE4_XOTH
SHOULD NOT be set. SHOULD NOT be set.
2. To determine MODE4_RGRP, MODE4_WGRP, and MODE4_XGRP, note that 2. To determine MODE4_RGRP, MODE4_WGRP, and MODE4_XGRP, note that
the EVERYONE@ special identifier SHOULD be taken into account. the EVERYONE@ special identifier SHOULD be taken into account.
In other words, when determining if the GROUP@ special identifier In other words, when determining if the GROUP@ special identifier
is granted a permission, ACEs with the identifier EVERYONE@ is granted a permission, ACEs with the identifier EVERYONE@
should take effect just as ACEs with the special identifier should take effect just as ACEs with the special identifier
GROUP@ would. GROUP@ would.
1. If the special identifier GROUP@ is granted ACE4_READ_DATA, A. If the special identifier GROUP@ is granted ACE4_READ_DATA,
then the bit MODE4_RGRP SHOULD be set. Otherwise, MODE4_RGRP then the bit MODE4_RGRP SHOULD be set. Otherwise, MODE4_RGRP
SHOULD NOT be set. SHOULD NOT be set.
2. If the special identifier GROUP@ is granted ACE4_WRITE_DATA B. If the special identifier GROUP@ is granted ACE4_WRITE_DATA
or ACE4_APPEND_DATA, then the bit MODE4_WGRP SHOULD be set. or ACE4_APPEND_DATA, then the bit MODE4_WGRP SHOULD be set.
Otherwise, MODE4_WGRP SHOULD NOT be set. Otherwise, MODE4_WGRP SHOULD NOT be set.
3. If the special identifier GROUP@ is granted ACE4_EXECUTE, C. If the special identifier GROUP@ is granted ACE4_EXECUTE,
then the bit MODE4_XGRP SHOULD be set. Otherwise, MODE4_XGRP then the bit MODE4_XGRP SHOULD be set. Otherwise, MODE4_XGRP
SHOULD NOT be set. SHOULD NOT be set.
3. To determine MODE4_RUSR, MODE4_WUSR, and MODE4_XUSR, note that 3. To determine MODE4_RUSR, MODE4_WUSR, and MODE4_XUSR, note that
the EVERYONE@ special identifier SHOULD be taken into account. the EVERYONE@ special identifier SHOULD be taken into account.
In other words, when determining if the OWNER@ special identifier In other words, when determining if the OWNER@ special identifier
is granted a permission, ACEs with the identifier EVERYONE@ is granted a permission, ACEs with the identifier EVERYONE@
should take effect just as ACEs with the special identifer OWNER@ should take effect just as ACEs with the special identifer OWNER@
would. would.
1. If the special identifier OWNER@ is granted ACE4_READ_DATA, A. If the special identifier OWNER@ is granted ACE4_READ_DATA,
then the bit MODE4_RUSR SHOULD be set. Otherwise, MODE4_RUSR then the bit MODE4_RUSR SHOULD be set. Otherwise, MODE4_RUSR
SHOULD NOT be set. SHOULD NOT be set.
2. If the special identifier OWNER@ is granted ACE4_WRITE_DATA B. If the special identifier OWNER@ is granted ACE4_WRITE_DATA
or ACE4_APPEND_DATA, then the bit MODE4_WUSR SHOULD be set. or ACE4_APPEND_DATA, then the bit MODE4_WUSR SHOULD be set.
Otherwise, MODE4_WUSR SHOULD NOT be set. Otherwise, MODE4_WUSR SHOULD NOT be set.
3. If the special identifier OWNER@ is granted ACE4_EXECUTE, C. If the special identifier OWNER@ is granted ACE4_EXECUTE,
then the bit MODE4_XUSR SHOULD be set. Otherwise, MODE4_XUSR then the bit MODE4_XUSR SHOULD be set. Otherwise, MODE4_XUSR
SHOULD NOT be set. SHOULD NOT be set.
6.3.2.1. Discussion 6.3.2.1. Discussion
The nine low-order mode bits (MODE4_R*, MODE4_W*, MODE4_X*) The nine low-order mode bits (MODE4_R*, MODE4_W*, MODE4_X*)
correspond to ACE4_READ_DATA, ACE4_WRITE_DATA/ACE4_APPEND_DATA, and correspond to ACE4_READ_DATA, ACE4_WRITE_DATA/ACE4_APPEND_DATA, and
ACE4_EXECUTE for OWNER@, GROUP@, and EVERYONE@. On some ACE4_EXECUTE for OWNER@, GROUP@, and EVERYONE@. On some
implementations, mode bits may represent a superset of these implementations, mode bits may represent a superset of these
permissions, e.g. if a specific user is granted ACE4_WRITE_DATA, then permissions, e.g. if a specific user is granted ACE4_WRITE_DATA, then
skipping to change at page 120, line 17 skipping to change at page 131, line 23
In this section, much is made of the methods in Section 6.3.2. Many In this section, much is made of the methods in Section 6.3.2. Many
requirements refer to this section. But note that the methods have requirements refer to this section. But note that the methods have
behaviors specified with "SHOULD". This is intentional, to avoid behaviors specified with "SHOULD". This is intentional, to avoid
invalidating existing implementations that compute the mode according invalidating existing implementations that compute the mode according
to the withdrawn POSIX ACL draft (1003.1e draft 17), rather than by to the withdrawn POSIX ACL draft (1003.1e draft 17), rather than by
actual permissions on owner, group, and other. actual permissions on owner, group, and other.
6.4.1. Setting the mode and/or ACL Attributes 6.4.1. Setting the mode and/or ACL Attributes
In the case where a server supports the sacl or dacl attribute, in
addition to the acl attribute, the server MUST fail a request to set
the acl attribute simultaneously with a dacl or sacl attribute. The
error to be given is NFS4ERR_ATTRNOTSUP.
6.4.1.1. Setting mode and not ACL 6.4.1.1. Setting mode and not ACL
When any mode permission bits are subject to change, either because When any of the nine low-order mode permission bits are subject to
the mode attribute was set or because the mode_set_masked attribute change, either because the mode attribute was set or because the
was set and the mask included one or more bits from the low-order mode_set_masked attribute was set and the mask included one or more
nine mode bits that control permissions, and the ACL attribute is not bits from the low-order nine mode bits that control permissions, and
explicitly set, the ACL attribute must be modified in accordance with no ACL attribute is explicitly set, the acl and dacl attributes must
the updated value of the permissions bits within the mode. This must be modified in accordance with the updated value of the permissions
happen even if the value of the permission bits within the mode is bits within the mode. This must happen even if the value of the
the same after the mode is set as before. permission bits within the mode is the same after the mode is set as
before.
In cases in which the permissions bits are subject to change, the ACL Note that any AUDIT or ALARM ACEs (hence any ACEs in the sacl
attribute MUST be modified such that the mode computed via the method attribute) are unaffected by changes to the mode.
in Section 6.3.2 yields the low-order nine bits (MODE4_R*, MODE4_W*,
MODE4_X*) of the mode attribute as modified by the attribute change. In cases in which the permissions bits are subject to change, the acl
The ACL SHOULD also be modified such that: and dacl attributes MUST be modified such that the mode computed via
the method in Section 6.3.2 yields the low-order nine bits (MODE4_R*,
MODE4_W*, MODE4_X*) of the mode attribute as modified by the
attribute change. The ACL attributes SHOULD also be modified such
that:
1. If MODE4_RGRP is not set, entities explicitly listed in the ACL 1. If MODE4_RGRP is not set, entities explicitly listed in the ACL
other than OWNER@ and EVERYONE@ SHOULD NOT be granted other than OWNER@ and EVERYONE@ SHOULD NOT be granted
ACE4_READ_DATA. ACE4_READ_DATA.
2. If MODE4_WGRP is not set, entities explicitly listed in the ACL 2. If MODE4_WGRP is not set, entities explicitly listed in the ACL
other than OWNER@ and EVERYONE@ SHOULD NOT be granted other than OWNER@ and EVERYONE@ SHOULD NOT be granted
ACE4_WRITE_DATA or ACE4_APPEND_DATA. ACE4_WRITE_DATA or ACE4_APPEND_DATA.
3. If MODE4_XGRP is not set, entities explicitly listed in the ACL 3. If MODE4_XGRP is not set, entities explicitly listed in the ACL
other than OWNER@ and EVERYONE@ SHOULD NOT be granted other than OWNER@ and EVERYONE@ SHOULD NOT be granted
ACE4_EXECUTE. ACE4_EXECUTE.
Access mask bits other those listed above, appearing in ALLOW ACEs, Access mask bits other those listed above, appearing in ALLOW ACEs,
MAY also be disabled. MAY also be disabled.
Note that ACEs with the flag ACE4_INHERIT_ONLY_ACE set do not affect Note that ACEs with the flag ACE4_INHERIT_ONLY_ACE set do not affect
the permissions of the ACL itself, nor do ACEs of the type AUDIT and the permissions of the ACL itself, nor do ACEs of the type AUDIT and
ALARM. As such, it is desirable to leave these ACEs unmodified when ALARM. As such, it is desirable to leave these ACEs unmodified when
modifying the ACL attribute. modifying the ACL attributes.
Also note that the requirement may be met by discarding the ACL, in Also note that the requirement may be met by discarding the acl and
favor of an ACL that represents the mode and only the mode. This is dacl, in favor of an ACL that represents the mode and only the mode.
permitted, but it is preferable for a server to preserve as much of This is permitted, but it is preferable for a server to preserve as
the ACL as possible without violating the above requirements. much of the ACL as possible without violating the above requirements.
Discarding the ACL makes it effectively impossible for a file created Discarding the ACL makes it effectively impossible for a file created
with a mode attribute to inherit an ACL (see Section 6.4.3). with a mode attribute to inherit an ACL (see Section 6.4.3).
6.4.1.2. Setting ACL and not mode 6.4.1.2. Setting ACL and not mode
When setting an ACL attribute and not setting the mode or When setting the acl or dacl and not setting the mode or
mode_set_masked attributes, the permission bits of the mode need to mode_set_masked attributes, the permission bits of the mode need to
be derived from the ACL. In this case, the ACL attribute SHOULD be be derived from the ACL. In this case, the ACL attribute SHOULD be
set as given. The nine low-order bits of the mode attribute set as given. The nine low-order bits of the mode attribute
(MODE4_R*, MODE4_W*, MODE4_X*) MUST be modified to match the result (MODE4_R*, MODE4_W*, MODE4_X*) MUST be modified to match the result
of the method Section 6.3.2. The three high-order bits of the mode of the method Section 6.3.2. The three high-order bits of the mode
(MODE4_SUID, MODE4_SGID, MODE4_SVTX) SHOULD remain unchanged. (MODE4_SUID, MODE4_SGID, MODE4_SVTX) SHOULD remain unchanged.
6.4.1.3. Setting both ACL and mode 6.4.1.3. Setting both ACL and mode
When setting both the mode (includes use of either the mode attribute When setting both the mode (includes use of either the mode attribute
or the mode_set_masked attribute) and the ACL attribute in the same or the mode_set_masked attribute) and the acl or dacl attributes in
operation, the attributes MUST be applied in this order: mode (or the same operation, the attributes MUST be applied in this order:
mode_set_masked), then ACL. The mode-related attribute is set as mode (or mode_set_masked), then ACL. The mode-related attribute is
given, then the ACL attribute is set as given, possibly changing the set as given, then the ACL attribute is set as given, possibly
final mode, as described above in Section 6.4.1.2. changing the final mode, as described above in Section 6.4.1.2.
6.4.2. Retrieving the mode and/or ACL Attributes 6.4.2. Retrieving the mode and/or ACL Attributes
This section applies only to servers that support both the mode and This section applies only to servers that support both the mode and
the ACL attribute. ACL attributes.
Some server implementations may have a concept of "objects without Some server implementations may have a concept of "objects without
ACLs", meaning that all permissions are granted and denied according ACLs", meaning that all permissions are granted and denied according
to the mode attribute, and that no ACL attribute is stored for that to the mode attribute, and that no ACL attribute is stored for that
object. If an ACL attribute is requested of such a server, the object. If an ACL attribute is requested of such a server, the
server SHOULD return an ACL that does not conflict with the mode; server SHOULD return an ACL that does not conflict with the mode;
that is to say, the ACL returned SHOULD represent the nine low-order that is to say, the ACL returned SHOULD represent the nine low-order
bits of the mode attribute (MODE4_R*, MODE4_W*, MODE4_X*) as bits of the mode attribute (MODE4_R*, MODE4_W*, MODE4_X*) as
described in Section 6.3.2. described in Section 6.3.2.
For other server implementations, the ACL attribute is always present For other server implementations, the ACL attribute is always present
for every object. Such servers SHOULD store at least the three high- for every object. Such servers SHOULD store at least the three high-
order bits of the mode attribute (MODE4_SUID, MODE4_SGID, order bits of the mode attribute (MODE4_SUID, MODE4_SGID,
MODE4_SVTX). The server SHOULD return a mode attribute if one is MODE4_SVTX). The server SHOULD return a mode attribute if one is
requested, and the low-order nine bits of the mode (MODE4_R*, requested, and the low-order nine bits of the mode (MODE4_R*,
MODE4_W*, MODE4_X*) MUST match the result of applying the method in MODE4_W*, MODE4_X*) MUST match the result of applying the method in
Section 6.3.2 to the ACL attribute. Section 6.3.2 to the ACL attribute.
6.4.3. Creating New Objects 6.4.3. Creating New Objects
If a server supports the ACL attribute, it may use the ACL attribute If a server supports any ACL attributes, it may use the ACL
on the parent directory to compute an initial ACL attribute for a attributes on the parent directory to compute an initial ACL
newly created object. This will be referred to as the inherited ACL attribute for a newly created object. This will be referred to as
within this section. The act of adding one or more ACEs to the the inherited ACL within this section. The act of adding one or more
inherited ACL that are based upon ACEs in the parent directory's ACL ACEs to the inherited ACL that are based upon ACEs in the parent
will be referred to as inheriting an ACE within this section. directory's ACL will be referred to as inheriting an ACE within this
section.
Implementors should standardize on what the behavior of CREATE and Implementors should standardize on what the behavior of CREATE and
OPEN must be depending on the presence or absence of the mode and ACL OPEN must be depending on the presence or absence of the mode and ACL
attributes. attributes.
1. If just mode is given: 1. If just the mode is given in the call:
In this case, inheritance SHOULD take place, but the mode MUST be In this case, inheritance SHOULD take place, but the mode MUST be
applied to the inherited ACL as described in Section 6.4.1.1, applied to the inherited ACL as described in Section 6.4.1.1,
thereby modifying the ACL. thereby modifying the ACL.
2. If just ACL is given: 2. If just the ACL is given in the call:
In this case, inheritance SHOULD NOT take place, and the ACL as In this case, inheritance SHOULD NOT take place, and the ACL as
defined in the CREATE or OPEN will be set without modification, defined in the CREATE or OPEN will be set without modification,
and the mode modified as in Section 6.4.1.2 and the mode modified as in Section 6.4.1.2
3. If both mode and ACL are given: 3. If both mode and ACL are given in the call:
In this case, inheritance SHOULD NOT take place, and both In this case, inheritance SHOULD NOT take place, and both
attributes will be set as described in Section 6.4.1.3. attributes will be set as described in Section 6.4.1.3.
4. If neither mode nor ACL are given: 4. If neither mode nor ACL are given in the call:
In the case where an object is being created without any initial In the case where an object is being created without any initial
attributes at all, e.g. an OPEN operation with an opentype4 of attributes at all, e.g. an OPEN operation with an opentype4 of
OPEN4_CREATE and a createmode4 of EXCLUSIVE4, inheritance SHOULD OPEN4_CREATE and a createmode4 of EXCLUSIVE4, inheritance SHOULD
NOT take place. Instead, the server SHOULD set permissions to NOT take place. Instead, the server SHOULD set permissions to
deny all access to the newly created object. It is expected that deny all access to the newly created object. It is expected that
the appropriate client will set the desired attributes in a the appropriate client will set the desired attributes in a
subsequent SETATTR operation, and the server SHOULD allow that subsequent SETATTR operation, and the server SHOULD allow that
operation to succeed, regardless of what permissions the object operation to succeed, regardless of what permissions the object
is created with. For example, an empty ACL denies all is created with. For example, an empty ACL denies all
permissions, but the server should allow the owner's SETATTR to permissions, but the server should allow the owner's SETATTR to
succeed even though WRITE_ACL is implicitly denied. succeed even though WRITE_ACL is implicitly denied.
In other cases, inheritance SHOULD take place, and no In other cases, inheritance SHOULD take place, and no
modifications to the ACL will happen. The mode attribute, if modifications to the ACL will happen. The mode attribute, if
supported, MUST be as computed in Section 6.3.2, with the supported, MUST be as computed in Section 6.3.2, with the
MODE4_SUID, MODE4_SGID and MODE4_SVTX bits clear. It is worth MODE4_SUID, MODE4_SGID and MODE4_SVTX bits clear. If no
noting that if no inheritable ACEs exist on the parent directory, inheritable ACEs exist on the parent directory, the rules for
the file will be created with an empty ACL, thus granting no creating acl, dacl or sacl attributes are implementation defined.