draft-ietf-nfsv4-minorversion1-15.txt   draft-ietf-nfsv4-minorversion1-16.txt 
NFSv4 S. Shepler NFSv4 S. Shepler
Internet-Draft M. Eisler Internet-Draft M. Eisler
Intended status: Standards Track D. Noveck Intended status: Standards Track D. Noveck
Expires: May 3, 2008 Editors Expires: May 15, 2008 Editors
October 31, 2007 November 12, 2007
NFSv4 Minor Version 1 NFSv4 Minor Version 1
draft-ietf-nfsv4-minorversion1-15.txt draft-ietf-nfsv4-minorversion1-16.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 35 skipping to change at page 1, line 35
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on May 3, 2008. This Internet-Draft will expire on May 15, 2008.
Copyright Notice Copyright Notice
Copyright (C) The IETF Trust (2007). Copyright (C) The IETF Trust (2007).
Abstract Abstract
This Internet-Draft describes NFSv4 minor version one, including This Internet-Draft describes NFSv4 minor version one, including
features retained from the base protocol and protocol extensions made features retained from the base protocol and protocol extensions made
subsequently. The current draft includes description of the major subsequently. The current draft includes description of the major
skipping to change at page 2, line 41 skipping to change at page 2, line 41
2.4. Client Identifiers and Client Owners . . . . . . . . . . 22 2.4. Client Identifiers and Client Owners . . . . . . . . . . 22
2.4.1. Upgrade from NFSv4.0 to NFSv4.1 . . . . . . . . . . 25 2.4.1. Upgrade from NFSv4.0 to NFSv4.1 . . . . . . . . . . 25
2.4.2. Server Release of Client ID . . . . . . . . . . . . 26 2.4.2. Server Release of Client ID . . . . . . . . . . . . 26
2.4.3. Resolving Client Owner Conflicts . . . . . . . . . . 26 2.4.3. Resolving Client Owner Conflicts . . . . . . . . . . 26
2.5. Server Owners . . . . . . . . . . . . . . . . . . . . . 27 2.5. Server Owners . . . . . . . . . . . . . . . . . . . . . 27
2.6. Security Service Negotiation . . . . . . . . . . . . . . 28 2.6. Security Service Negotiation . . . . . . . . . . . . . . 28
2.6.1. NFSv4.1 Security Tuples . . . . . . . . . . . . . . 28 2.6.1. NFSv4.1 Security Tuples . . . . . . . . . . . . . . 28
2.6.2. SECINFO and SECINFO_NO_NAME . . . . . . . . . . . . 28 2.6.2. SECINFO and SECINFO_NO_NAME . . . . . . . . . . . . 28
2.6.3. Security Error . . . . . . . . . . . . . . . . . . . 29 2.6.3. Security Error . . . . . . . . . . . . . . . . . . . 29
2.7. Minor Versioning . . . . . . . . . . . . . . . . . . . . 32 2.7. Minor Versioning . . . . . . . . . . . . . . . . . . . . 32
2.8. Non-RPC-based Security Services . . . . . . . . . . . . 34 2.8. Non-RPC-based Security Services . . . . . . . . . . . . 35
2.8.1. Authorization . . . . . . . . . . . . . . . . . . . 34 2.8.1. Authorization . . . . . . . . . . . . . . . . . . . 35
2.8.2. Auditing . . . . . . . . . . . . . . . . . . . . . . 35 2.8.2. Auditing . . . . . . . . . . . . . . . . . . . . . . 35
2.8.3. Intrusion Detection . . . . . . . . . . . . . . . . 35 2.8.3. Intrusion Detection . . . . . . . . . . . . . . . . 35
2.9. Transport Layers . . . . . . . . . . . . . . . . . . . . 35 2.9. Transport Layers . . . . . . . . . . . . . . . . . . . . 36
2.9.1. Required and Recommended Properties of Transports . 35 2.9.1. Required and Recommended Properties of Transports . 36
2.9.2. Client and Server Transport Behavior . . . . . . . . 36 2.9.2. Client and Server Transport Behavior . . . . . . . . 36
2.9.3. Ports . . . . . . . . . . . . . . . . . . . . . . . 37 2.9.3. Ports . . . . . . . . . . . . . . . . . . . . . . . 38
2.10. Session . . . . . . . . . . . . . . . . . . . . . . . . 37 2.10. Session . . . . . . . . . . . . . . . . . . . . . . . . 38
2.10.1. Motivation and Overview . . . . . . . . . . . . . . 37 2.10.1. Motivation and Overview . . . . . . . . . . . . . . 38
2.10.2. NFSv4 Integration . . . . . . . . . . . . . . . . . 38 2.10.2. NFSv4 Integration . . . . . . . . . . . . . . . . . 39
2.10.3. Channels . . . . . . . . . . . . . . . . . . . . . . 40 2.10.3. Channels . . . . . . . . . . . . . . . . . . . . . . 41
2.10.4. Trunking . . . . . . . . . . . . . . . . . . . . . . 41 2.10.4. Trunking . . . . . . . . . . . . . . . . . . . . . . 42
2.10.5. Exactly Once Semantics . . . . . . . . . . . . . . . 44 2.10.5. Exactly Once Semantics . . . . . . . . . . . . . . . 45
2.10.6. RDMA Considerations . . . . . . . . . . . . . . . . 56 2.10.6. RDMA Considerations . . . . . . . . . . . . . . . . 57
2.10.7. Sessions Security . . . . . . . . . . . . . . . . . 59 2.10.7. Sessions Security . . . . . . . . . . . . . . . . . 60
2.10.8. The SSV GSS Mechanism . . . . . . . . . . . . . . . 64 2.10.8. The SSV GSS Mechanism . . . . . . . . . . . . . . . 65
2.10.9. Session Mechanics - Steady State . . . . . . . . . . 68 2.10.9. Session Mechanics - Steady State . . . . . . . . . . 69
2.10.10. Session Mechanics - Recovery . . . . . . . . . . . . 69 2.10.10. Session Mechanics - Recovery . . . . . . . . . . . . 70
2.10.11. Parallel NFS and Sessions . . . . . . . . . . . . . 73 2.10.11. Parallel NFS and Sessions . . . . . . . . . . . . . 74
3. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 73 3. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 74
3.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 73 3.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 74
3.2. Structured Data Types . . . . . . . . . . . . . . . . . 75 3.2. Structured Data Types . . . . . . . . . . . . . . . . . 76
4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 85 4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 85 4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 86
4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 85 4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 86
4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 86 4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 87
4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 86 4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 87
4.2.1. General Properties of a Filehandle . . . . . . . . . 86 4.2.1. General Properties of a Filehandle . . . . . . . . . 87
4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 87 4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 88
4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 87 4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 88
4.3. One Method of Constructing a Volatile Filehandle . . . . 89 4.3. One Method of Constructing a Volatile Filehandle . . . . 90
4.4. Client Recovery from Filehandle Expiration . . . . . . . 89 4.4. Client Recovery from Filehandle Expiration . . . . . . . 90
5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 90 5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 91
5.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 91 5.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 92
5.2. Recommended Attributes . . . . . . . . . . . . . . . . . 92 5.2. Recommended Attributes . . . . . . . . . . . . . . . . . 93
5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 92 5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 93
5.4. Classification of Attributes . . . . . . . . . . . . . . 93 5.4. Classification of Attributes . . . . . . . . . . . . . . 94
5.5. Mandatory Attributes - List and Definition References . 94 5.5. Mandatory Attributes - List and Definition References . 95
5.6. Recommended Attributes - List and Definition 5.6. Recommended Attributes - List and Definition
References . . . . . . . . . . . . . . . . . . . . . . . 94 References . . . . . . . . . . . . . . . . . . . . . . . 95
5.7. Attribute Definitions . . . . . . . . . . . . . . . . . 95 5.7. Attribute Definitions . . . . . . . . . . . . . . . . . 96
5.8. Interpreting owner and owner_group . . . . . . . . . . . 103 5.8. Interpreting owner and owner_group . . . . . . . . . . . 104
5.9. Character Case Attributes . . . . . . . . . . . . . . . 105 5.9. Character Case Attributes . . . . . . . . . . . . . . . 106
5.10. Directory Notification Attributes . . . . . . . . . . . 105 5.10. Directory Notification Attributes . . . . . . . . . . . 106
5.11. pNFS Attribute Definitions . . . . . . . . . . . . . . . 106 5.11. pNFS Attribute Definitions . . . . . . . . . . . . . . . 107
5.12. Retention Attributes . . . . . . . . . . . . . . . . . . 107 5.12. Retention Attributes . . . . . . . . . . . . . . . . . . 108
6. Security Related Attributes . . . . . . . . . . . . . . . . . 110 6. Security Related Attributes . . . . . . . . . . . . . . . . . 111
6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.2. File Attributes Discussion . . . . . . . . . . . . . . . 111 6.2. File Attributes Discussion . . . . . . . . . . . . . . . 112
6.2.1. Attribute 12: acl . . . . . . . . . . . . . . . . . 111 6.2.1. Attribute 12: acl . . . . . . . . . . . . . . . . . 112
6.2.2. Attribute 58: dacl . . . . . . . . . . . . . . . . . 126 6.2.2. Attribute 58: dacl . . . . . . . . . . . . . . . . . 127
6.2.3. Attribute 59: sacl . . . . . . . . . . . . . . . . . 126 6.2.3. Attribute 59: sacl . . . . . . . . . . . . . . . . . 127
6.2.4. Attribute 33: mode . . . . . . . . . . . . . . . . . 126 6.2.4. Attribute 33: mode . . . . . . . . . . . . . . . . . 127
6.2.5. Attribute 74: mode_set_masked . . . . . . . . . . . 127 6.2.5. Attribute 74: mode_set_masked . . . . . . . . . . . 128
6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 127 6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 128
6.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . 127 6.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . 128
6.3.2. Computing a Mode Attribute from an ACL . . . . . . . 128 6.3.2. Computing a Mode Attribute from an ACL . . . . . . . 129
6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 129 6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 130
6.4.1. Setting the mode and/or ACL Attributes . . . . . . . 130 6.4.1. Setting the mode and/or ACL Attributes . . . . . . . 131
6.4.2. Retrieving the mode and/or ACL Attributes . . . . . 131 6.4.2. Retrieving the mode and/or ACL Attributes . . . . . 132
6.4.3. Creating New Objects . . . . . . . . . . . . . . . . 132 6.4.3. Creating New Objects . . . . . . . . . . . . . . . . 133
7. Single-server Namespace . . . . . . . . . . . . . . . . . . . 135 7. Single-server Namespace . . . . . . . . . . . . . . . . . . . 136
7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 136 7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 137
7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 136 7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 137
7.3. Server Pseudo File System . . . . . . . . . . . . . . . 136 7.3. Server Pseudo File System . . . . . . . . . . . . . . . 137
7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 137 7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 138
7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 137 7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 138
7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 138 7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 139
7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 138 7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 139
7.8. Security Policy and Namespace Presentation . . . . . . . 138 7.8. Security Policy and Namespace Presentation . . . . . . . 139
8. State Management . . . . . . . . . . . . . . . . . . . . . . 140 8. State Management . . . . . . . . . . . . . . . . . . . . . . 141
8.1. Client and Session ID . . . . . . . . . . . . . . . . . 140 8.1. Client and Session ID . . . . . . . . . . . . . . . . . 141
8.2. Stateid Definition . . . . . . . . . . . . . . . . . . . 141 8.2. Stateid Definition . . . . . . . . . . . . . . . . . . . 142
8.2.1. Stateid Types . . . . . . . . . . . . . . . . . . . 141 8.2.1. Stateid Types . . . . . . . . . . . . . . . . . . . 142
8.2.2. Stateid Structure . . . . . . . . . . . . . . . . . 142 8.2.2. Stateid Structure . . . . . . . . . . . . . . . . . 143
8.2.3. Special Stateids . . . . . . . . . . . . . . . . . . 143 8.2.3. Special Stateids . . . . . . . . . . . . . . . . . . 144
8.2.4. Stateid Lifetime and Validation . . . . . . . . . . 144 8.2.4. Stateid Lifetime and Validation . . . . . . . . . . 145
8.2.5. Stateid Use for IO Operations . . . . . . . . . . . 147 8.2.5. Stateid Use for IO Operations . . . . . . . . . . . 148
8.3. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 148 8.3. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 149
8.4. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 149 8.4. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 150
8.4.1. Client Failure and Recovery . . . . . . . . . . . . 149 8.4.1. Client Failure and Recovery . . . . . . . . . . . . 150
8.4.2. Server Failure and Recovery . . . . . . . . . . . . 150 8.4.2. Server Failure and Recovery . . . . . . . . . . . . 151
8.4.3. Network Partitions and Recovery . . . . . . . . . . 154 8.4.3. Network Partitions and Recovery . . . . . . . . . . 155
8.5. Server Revocation of Locks . . . . . . . . . . . . . . . 159 8.5. Server Revocation of Locks . . . . . . . . . . . . . . . 159
8.6. Short and Long Leases . . . . . . . . . . . . . . . . . 160 8.6. Short and Long Leases . . . . . . . . . . . . . . . . . 160
8.7. Clocks, Propagation Delay, and Calculating Lease 8.7. Clocks, Propagation Delay, and Calculating Lease
Expiration . . . . . . . . . . . . . . . . . . . . . . . 160 Expiration . . . . . . . . . . . . . . . . . . . . . . . 161
8.8. Vestigial Locking Infrastructure From V4.0 . . . . . . . 161 8.8. Vestigial Locking Infrastructure From V4.0 . . . . . . . 161
9. File Locking and Share Reservations . . . . . . . . . . . . . 162 9. File Locking and Share Reservations . . . . . . . . . . . . . 162
9.1. Opens and Byte-range Locks . . . . . . . . . . . . . . . 162 9.1. Opens and Byte-range Locks . . . . . . . . . . . . . . . 163
9.1.1. State-owner Definition . . . . . . . . . . . . . . . 162 9.1.1. State-owner Definition . . . . . . . . . . . . . . . 163
9.1.2. Use of the Stateid and Locking . . . . . . . . . . . 162 9.1.2. Use of the Stateid and Locking . . . . . . . . . . . 163
9.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 165 9.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 166
9.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 166 9.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 167
9.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 166 9.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 167
9.5. Share Reservations . . . . . . . . . . . . . . . . . . . 167 9.5. Share Reservations . . . . . . . . . . . . . . . . . . . 168
9.6. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 168 9.6. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 169
9.7. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 169 9.7. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 170
9.8. Parallel OPENs . . . . . . . . . . . . . . . . . . . . . 169 9.8. Parallel OPENs . . . . . . . . . . . . . . . . . . . . . 170
9.9. Reclaim of Open and Byte-range Locks . . . . . . . . . . 170 9.9. Reclaim of Open and Byte-range Locks . . . . . . . . . . 171
10. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 170 10. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 171
10.1. Performance Challenges for Client-Side Caching . . . . . 171 10.1. Performance Challenges for Client-Side Caching . . . . . 172
10.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 172 10.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 173
10.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 173 10.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 174
10.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 175 10.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 176
10.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 176 10.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 177
10.3.2. Data Caching and File Locking . . . . . . . . . . . 177 10.3.2. Data Caching and File Locking . . . . . . . . . . . 178
10.3.3. Data Caching and Mandatory File Locking . . . . . . 178 10.3.3. Data Caching and Mandatory File Locking . . . . . . 179
10.3.4. Data Caching and File Identity . . . . . . . . . . . 179 10.3.4. Data Caching and File Identity . . . . . . . . . . . 180
10.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 180 10.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 181
10.4.1. Open Delegation and Data Caching . . . . . . . . . . 182 10.4.1. Open Delegation and Data Caching . . . . . . . . . . 183
10.4.2. Open Delegation and File Locks . . . . . . . . . . . 183 10.4.2. Open Delegation and File Locks . . . . . . . . . . . 184
10.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 184 10.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 185
10.4.4. Recall of Open Delegation . . . . . . . . . . . . . 187 10.4.4. Recall of Open Delegation . . . . . . . . . . . . . 188
10.4.5. Clients that Fail to Honor Delegation Recalls . . . 189 10.4.5. Clients that Fail to Honor Delegation Recalls . . . 190
10.4.6. Delegation Revocation . . . . . . . . . . . . . . . 189 10.4.6. Delegation Revocation . . . . . . . . . . . . . . . 190
10.4.7. Delegations via WANT_DELEGATION . . . . . . . . . . 190 10.4.7. Delegations via WANT_DELEGATION . . . . . . . . . . 191
10.5. Data Caching and Revocation . . . . . . . . . . . . . . 191 10.5. Data Caching and Revocation . . . . . . . . . . . . . . 192
10.5.1. Revocation Recovery for Write Open Delegation . . . 191 10.5.1. Revocation Recovery for Write Open Delegation . . . 192
10.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 192 10.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 193
10.7. Data and Metadata Caching and Memory Mapped Files . . . 194 10.7. Data and Metadata Caching and Memory Mapped Files . . . 195
10.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 196 10.8. Name and Directory Caching without Directory
10.9. Directory Caching . . . . . . . . . . . . . . . . . . . 197 Delegations . . . . . . . . . . . . . . . . . . . . . . 197
11. Multi-Server Namespace . . . . . . . . . . . . . . . . . . . 198 10.8.1. Name Caching . . . . . . . . . . . . . . . . . . . . 197
11.1. Location Attributes . . . . . . . . . . . . . . . . . . 198 10.8.2. Directory Caching . . . . . . . . . . . . . . . . . 199
11.2. File System Presence or Absence . . . . . . . . . . . . 199 10.9. Directory Delegations . . . . . . . . . . . . . . . . . 200
11.3. Getting Attributes for an Absent File System . . . . . . 200 10.9.1. Introduction to Directory Delegations . . . . . . . 200
11.3.1. GETATTR Within an Absent File System . . . . . . . . 200 10.9.2. Directory Delegation Design . . . . . . . . . . . . 201
11.3.2. READDIR and Absent File Systems . . . . . . . . . . 201 10.9.3. Attributes in Support of Directory Notifications . . 202
11.4. Uses of Location Information . . . . . . . . . . . . . . 202 10.9.4. Directory Delegation Recall . . . . . . . . . . . . 202
11.4.1. File System Replication . . . . . . . . . . . . . . 203 10.9.5. Directory Delegation Recovery . . . . . . . . . . . 202
11.4.2. File System Migration . . . . . . . . . . . . . . . 204 11. Multi-Server Namespace . . . . . . . . . . . . . . . . . . . 202
11.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 205 11.1. Location Attributes . . . . . . . . . . . . . . . . . . 203
11.5. Additional Client-side Considerations . . . . . . . . . 206 11.2. File System Presence or Absence . . . . . . . . . . . . 203
11.6. Effecting File System Transitions . . . . . . . . . . . 207 11.3. Getting Attributes for an Absent File System . . . . . . 204
11.6.1. File System Transitions and Simultaneous Access . . 208 11.3.1. GETATTR Within an Absent File System . . . . . . . . 205
11.6.2. Simultaneous Use and Transparent Transitions . . . . 209 11.3.2. READDIR and Absent File Systems . . . . . . . . . . 206
11.6.3. Filehandles and File System Transitions . . . . . . 211 11.4. Uses of Location Information . . . . . . . . . . . . . . 206
11.6.4. Fileids and File System Transitions . . . . . . . . 212 11.4.1. File System Replication . . . . . . . . . . . . . . 207
11.6.5. Fsids and File System Transitions . . . . . . . . . 213 11.4.2. File System Migration . . . . . . . . . . . . . . . 208
11.6.6. The Change Attribute and File System Transitions . . 214 11.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 209
11.6.7. Lock State and File System Transitions . . . . . . . 214 11.5. Additional Client-side Considerations . . . . . . . . . 211
11.6.8. Write Verifiers and File System Transitions . . . . 218 11.6. Effecting File System Transitions . . . . . . . . . . . 211
11.6.1. File System Transitions and Simultaneous Access . . 213
11.6.2. Simultaneous Use and Transparent Transitions . . . . 213
11.6.3. Filehandles and File System Transitions . . . . . . 216
11.6.4. Fileids and File System Transitions . . . . . . . . 216
11.6.5. Fsids and File System Transitions . . . . . . . . . 217
11.6.6. The Change Attribute and File System Transitions . . 218
11.6.7. Lock State and File System Transitions . . . . . . . 219
11.6.8. Write Verifiers and File System Transitions . . . . 222
11.6.9. Readdir Cookies and Verifiers and File System 11.6.9. Readdir Cookies and Verifiers and File System
Transitions . . . . . . . . . . . . . . . . . . . . 218 Transitions . . . . . . . . . . . . . . . . . . . . 223
11.6.10. File System Data and File System Transitions . . . . 219 11.6.10. File System Data and File System Transitions . . . . 223
11.7. Effecting File System Referrals . . . . . . . . . . . . 220 11.7. Effecting File System Referrals . . . . . . . . . . . . 225
11.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 221 11.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 225
11.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 224 11.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 229
11.8. The Attribute fs_locations . . . . . . . . . . . . . . . 227 11.8. The Attribute fs_locations . . . . . . . . . . . . . . . 231
11.9. The Attribute fs_locations_info . . . . . . . . . . . . 229 11.9. The Attribute fs_locations_info . . . . . . . . . . . . 233
11.9.1. The fs_locations_server4 Structure . . . . . . . . . 232 11.9.1. The fs_locations_server4 Structure . . . . . . . . . 237
11.9.2. The fs_locations_info4 Structure . . . . . . . . . . 238 11.9.2. The fs_locations_info4 Structure . . . . . . . . . . 242
11.9.3. The fs_locations_item4 Structure . . . . . . . . . . 239 11.9.3. The fs_locations_item4 Structure . . . . . . . . . . 243
11.10. The Attribute fs_status . . . . . . . . . . . . . . . . 241 11.10. The Attribute fs_status . . . . . . . . . . . . . . . . 245
12. Directory Delegations . . . . . . . . . . . . . . . . . . . . 244 12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 249
12.1. Introduction to Directory Delegations . . . . . . . . . 244 12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 249
12.2. Directory Delegation Design . . . . . . . . . . . . . . 245 12.2. pNFS Definitions . . . . . . . . . . . . . . . . . . . . 251
12.3. Attributes in Support of Directory Notifications . . . . 246 12.2.1. Metadata . . . . . . . . . . . . . . . . . . . . . . 251
12.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 246 12.2.2. Metadata Server . . . . . . . . . . . . . . . . . . 251
12.5. Directory Delegation Recovery . . . . . . . . . . . . . 247 12.2.3. pNFS Client . . . . . . . . . . . . . . . . . . . . 252
13. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 247 12.2.4. Storage Device . . . . . . . . . . . . . . . . . . . 252
13.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 247 12.2.5. Storage Protocol . . . . . . . . . . . . . . . . . . 252
13.2. pNFS Definitions . . . . . . . . . . . . . . . . . . . . 249 12.2.6. Control Protocol . . . . . . . . . . . . . . . . . . 252
13.2.1. Metadata . . . . . . . . . . . . . . . . . . . . . . 249 12.2.7. Layout Types . . . . . . . . . . . . . . . . . . . . 252
13.2.2. Metadata Server . . . . . . . . . . . . . . . . . . 249 12.2.8. Layout . . . . . . . . . . . . . . . . . . . . . . . 253
13.2.3. pNFS Client . . . . . . . . . . . . . . . . . . . . 250 12.2.9. Layout Iomode . . . . . . . . . . . . . . . . . . . 253
13.2.4. Storage Device . . . . . . . . . . . . . . . . . . . 250 12.2.10. Device IDs . . . . . . . . . . . . . . . . . . . . . 254
13.2.5. Storage Protocol . . . . . . . . . . . . . . . . . . 250 12.3. pNFS Operations . . . . . . . . . . . . . . . . . . . . 255
13.2.6. Control Protocol . . . . . . . . . . . . . . . . . . 250 12.4. pNFS Attributes . . . . . . . . . . . . . . . . . . . . 256
13.2.7. Layout Types . . . . . . . . . . . . . . . . . . . . 250 12.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 256
13.2.8. Layout . . . . . . . . . . . . . . . . . . . . . . . 251 12.5.1. Guarantees Provided by Layouts . . . . . . . . . . . 256
13.2.9. Layout Iomode . . . . . . . . . . . . . . . . . . . 251 12.5.2. Getting a Layout . . . . . . . . . . . . . . . . . . 257
13.2.10. Device IDs . . . . . . . . . . . . . . . . . . . . . 252 12.5.3. Committing a Layout . . . . . . . . . . . . . . . . 258
13.3. pNFS Operations . . . . . . . . . . . . . . . . . . . . 253 12.5.4. Recalling a Layout . . . . . . . . . . . . . . . . . 261
13.4. pNFS Attributes . . . . . . . . . . . . . . . . . . . . 254 12.5.5. Metadata Server Write Propagation . . . . . . . . . 267
13.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 254 12.6. pNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 267
13.5.1. Guarantees Provided by Layouts . . . . . . . . . . . 254 12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 269
13.5.2. Getting a Layout . . . . . . . . . . . . . . . . . . 255 12.7.1. Client Recovery . . . . . . . . . . . . . . . . . . 269
13.5.3. Committing a Layout . . . . . . . . . . . . . . . . 256 12.7.2. Dealing with Lease Expiration on the Client . . . . 269
13.5.4. Recalling a Layout . . . . . . . . . . . . . . . . . 259 12.7.3. Dealing with Loss of Layout State on the Metadata
13.5.5. Metadata Server Write Propagation . . . . . . . . . 265 Server . . . . . . . . . . . . . . . . . . . . . . . 271
13.6. pNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 265 12.7.4. Recovery from Metadata Server Restart . . . . . . . 271
13.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 267 12.7.5. Operations During Metadata Server Grace Period . . . 273
13.7.1. Client Recovery . . . . . . . . . . . . . . . . . . 267 12.7.6. Storage Device Recovery . . . . . . . . . . . . . . 274
13.7.2. Dealing with Lease Expiration on the Client . . . . 267 12.8. Metadata and Storage Device Roles . . . . . . . . . . . 274
13.7.3. Dealing with Loss of Layout State on the Metadata 12.9. Security Considerations . . . . . . . . . . . . . . . . 276
Server . . . . . . . . . . . . . . . . . . . . . . . 269 13. PNFS: NFSv4.1 File Layout Type . . . . . . . . . . . . . . . 277
13.7.4. Recovery from Metadata Server Restart . . . . . . . 269 13.1. Client ID and Session Considerations . . . . . . . . . . 277
13.7.5. Operations During Metadata Server Grace Period . . . 271 13.2. File Layout Definitions . . . . . . . . . . . . . . . . 278
13.7.6. Storage Device Recovery . . . . . . . . . . . . . . 272 13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 279
13.8. Metadata and Storage Device Roles . . . . . . . . . . . 272 13.4. Interpreting the File Layout . . . . . . . . . . . . . . 282
13.9. Security Considerations . . . . . . . . . . . . . . . . 274 13.4.1. Interpreting the File Layout Using Sparse Packing . 282
14. PNFS: NFSv4.1 File Layout Type . . . . . . . . . . . . . . . 275 13.4.2. Interpreting the File Layout Using Dense Packing . . 285
14.1. Client ID and Session Considerations . . . . . . . . . . 275 13.5. Sparse and Dense Stripe Unit Packing . . . . . . . . . . 287
14.2. File Layout Definitions . . . . . . . . . . . . . . . . 276 13.6. Data Server Multipathing . . . . . . . . . . . . . . . . 289
14.3. File Layout Data Types . . . . . . . . . . . . . . . . . 277 13.7. Operations Issued to NFSv4.1 Data Servers . . . . . . . 290
14.4. Interpreting the File Layout . . . . . . . . . . . . . . 280 13.8. COMMIT Through Metadata Server . . . . . . . . . . . . . 292
14.4.1. Interpreting the File Layout Using Sparse Packing . 280 13.9. The Layout Iomode . . . . . . . . . . . . . . . . . . . 293
14.4.2. Interpreting the File Layout Using Dense Packing . . 283 13.10. Metadata and Data Server State Coordination . . . . . . 293
14.5. Sparse and Dense Stripe Unit Packing . . . . . . . . . . 285 13.10.1. Global Stateid Requirements . . . . . . . . . . . . 293
14.6. Data Server Multipathing . . . . . . . . . . . . . . . . 287 13.10.2. Data Server State Propagation . . . . . . . . . . . 295
14.7. Operations Issued to NFSv4.1 Data Servers . . . . . . . 288 13.11. Data Server Component File Size . . . . . . . . . . . . 297
14.8. COMMIT Through Metadata Server . . . . . . . . . . . . . 288 13.12. Recovery from Loss of Layout . . . . . . . . . . . . . . 297
14.9. The Layout Iomode . . . . . . . . . . . . . . . . . . . 289 13.13. Security Considerations for the File Layout Type . . . . 298
14.10. Metadata and Data Server State Coordination . . . . . . 289 14. Internationalization . . . . . . . . . . . . . . . . . . . . 298
14.10.1. Global Stateid Requirements . . . . . . . . . . . . 290 14.1. Stringprep profile for the utf8str_cs type . . . . . . . 300
14.10.2. Data Server State Propagation . . . . . . . . . . . 291 14.2. Stringprep profile for the utf8str_cis type . . . . . . 301
14.11. Data Server Component File Size . . . . . . . . . . . . 293 14.3. Stringprep profile for the utf8str_mixed type . . . . . 303
14.12. Recovery from Loss of Layout . . . . . . . . . . . . . . 293 14.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 304
14.13. Security Considerations for the File Layout Type . . . . 294 15. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 304
15. Internationalization . . . . . . . . . . . . . . . . . . . . 295 15.1. Error Definitions . . . . . . . . . . . . . . . . . . . 305
15.1. Stringprep profile for the utf8str_cs type . . . . . . . 296 15.2. Operations and their valid errors . . . . . . . . . . . 319
15.2. Stringprep profile for the utf8str_cis type . . . . . . 297 15.3. Callback operations and their valid errors . . . . . . . 333
15.3. Stringprep profile for the utf8str_mixed type . . . . . 299 15.4. Errors and the operations that use them . . . . . . . . 334
15.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 300 16. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 341
16. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 301 16.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 342
16.1. Error Definitions . . . . . . . . . . . . . . . . . . . 301 16.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 342
16.2. Operations and their valid errors . . . . . . . . . . . 315 17. Operations: mandatory or optional . . . . . . . . . . . . . . 353
16.3. Callback operations and their valid errors . . . . . . . 329 18. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 356
16.4. Errors and the operations that use them . . . . . . . . 330 18.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 356
17. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 337 18.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 359
17.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 338 18.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 360
17.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 338 18.4. Operation 6: CREATE - Create a Non-Regular File Object . 362
18. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 343
18.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 343
18.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 345
18.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 347
18.4. Operation 6: CREATE - Create a Non-Regular File Object . 349
18.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting 18.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting
Recovery . . . . . . . . . . . . . . . . . . . . . . . . 352 Recovery . . . . . . . . . . . . . . . . . . . . . . . . 365
18.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 353 18.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 366
18.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 353 18.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 366
18.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 355 18.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 368
18.9. Operation 11: LINK - Create Link to a File . . . . . . . 356 18.9. Operation 11: LINK - Create Link to a File . . . . . . . 368
18.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 357 18.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 370
18.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 361 18.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 374
18.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 362 18.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 375
18.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 364 18.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 377
18.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 366 18.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 378
18.15. Operation 17: NVERIFY - Verify Difference in 18.15. Operation 17: NVERIFY - Verify Difference in
Attributes . . . . . . . . . . . . . . . . . . . . . . . 367 Attributes . . . . . . . . . . . . . . . . . . . . . . . 379
18.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 368 18.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 380
18.17. Operation 19: OPENATTR - Open Named Attribute 18.17. Operation 19: OPENATTR - Open Named Attribute
Directory . . . . . . . . . . . . . . . . . . . . . . . 383 Directory . . . . . . . . . . . . . . . . . . . . . . . 396
18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 384 18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 397
18.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 386 18.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 398
18.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 386 18.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 399
18.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 388 18.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 400
18.22. Operation 25: READ - Read from File . . . . . . . . . . 389 18.22. Operation 25: READ - Read from File . . . . . . . . . . 400
18.23. Operation 26: READDIR - Read Directory . . . . . . . . . 391 18.23. Operation 26: READDIR - Read Directory . . . . . . . . . 402
18.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 395 18.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 406
18.25. Operation 28: REMOVE - Remove File System Object . . . . 396 18.25. Operation 28: REMOVE - Remove File System Object . . . . 407
18.26. Operation 29: RENAME - Rename Directory Entry . . . . . 398 18.26. Operation 29: RENAME - Rename Directory Entry . . . . . 409
18.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 400 18.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 410
18.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 401 18.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 411
18.29. Operation 33: SECINFO - Obtain Available Security . . . 401 18.29. Operation 33: SECINFO - Obtain Available Security . . . 412
18.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 405 18.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 415
18.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 407 18.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 418
18.32. Operation 38: WRITE - Write to File . . . . . . . . . . 408 18.32. Operation 38: WRITE - Write to File . . . . . . . . . . 419
18.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 413 18.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 423
18.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 414 18.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 424
18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 416 18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 427
18.36. Operation 43: CREATE_SESSION - Create New Session and 18.36. Operation 43: CREATE_SESSION - Create New Session and
Confirm Client ID . . . . . . . . . . . . . . . . . . . 433 Confirm Client ID . . . . . . . . . . . . . . . . . . . 443
18.37. Operation 44: DESTROY_SESSION - Destroy existing 18.37. Operation 44: DESTROY_SESSION - Destroy existing
session . . . . . . . . . . . . . . . . . . . . . . . . 443 session . . . . . . . . . . . . . . . . . . . . . . . . 453
18.38. Operation 45: FREE_STATEID - Free stateid with no 18.38. Operation 45: FREE_STATEID - Free stateid with no
locks . . . . . . . . . . . . . . . . . . . . . . . . . 445 locks . . . . . . . . . . . . . . . . . . . . . . . . . 455
18.39. Operation 46: GET_DIR_DELEGATION - Get a directory 18.39. Operation 46: GET_DIR_DELEGATION - Get a directory
delegation . . . . . . . . . . . . . . . . . . . . . . . 446 delegation . . . . . . . . . . . . . . . . . . . . . . . 455
18.40. Operation 47: GETDEVICEINFO - Get Device Information . . 450 18.40. Operation 47: GETDEVICEINFO - Get Device Information . . 460
18.41. Operation 48: GETDEVICELIST . . . . . . . . . . . . . . 451 18.41. Operation 48: GETDEVICELIST . . . . . . . . . . . . . . 461
18.42. Operation 49: LAYOUTCOMMIT - Commit writes made using 18.42. Operation 49: LAYOUTCOMMIT - Commit writes made using
a layout . . . . . . . . . . . . . . . . . . . . . . . . 453 a layout . . . . . . . . . . . . . . . . . . . . . . . . 463
18.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 456 18.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 467
18.44. Operation 51: LAYOUTRETURN - Release Layout 18.44. Operation 51: LAYOUTRETURN - Release Layout
Information . . . . . . . . . . . . . . . . . . . . . . 460 Information . . . . . . . . . . . . . . . . . . . . . . 470
18.45. Operation 52: SECINFO_NO_NAME - Get Security on 18.45. Operation 52: SECINFO_NO_NAME - Get Security on
Unnamed Object . . . . . . . . . . . . . . . . . . . . . 463 Unnamed Object . . . . . . . . . . . . . . . . . . . . . 474
18.46. Operation 53: SEQUENCE - Supply per-procedure 18.46. Operation 53: SEQUENCE - Supply per-procedure
sequencing and control . . . . . . . . . . . . . . . . . 465 sequencing and control . . . . . . . . . . . . . . . . . 475
18.47. Operation 54: SET_SSV . . . . . . . . . . . . . . . . . 472 18.47. Operation 54: SET_SSV . . . . . . . . . . . . . . . . . 481
18.48. Operation 55: TEST_STATEID - Test stateids for 18.48. Operation 55: TEST_STATEID - Test stateids for
validity . . . . . . . . . . . . . . . . . . . . . . . . 474 validity . . . . . . . . . . . . . . . . . . . . . . . . 483
18.49. Operation 56: WANT_DELEGATION . . . . . . . . . . . . . 476 18.49. Operation 56: WANT_DELEGATION . . . . . . . . . . . . . 485
18.50. Operation 57: DESTROY_CLIENTID - Destroy existing 18.50. Operation 57: DESTROY_CLIENTID - Destroy existing
client ID . . . . . . . . . . . . . . . . . . . . . . . 479 client ID . . . . . . . . . . . . . . . . . . . . . . . 488
18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims 18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims
Finished . . . . . . . . . . . . . . . . . . . . . . . . 480 Finished . . . . . . . . . . . . . . . . . . . . . . . . 489
18.52. Operation 10044: ILLEGAL - Illegal operation . . . . . . 482 18.52. Operation 10044: ILLEGAL - Illegal operation . . . . . . 491
19. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 483 19. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 492
19.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 483 19.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 492
19.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 484 19.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 493
20. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 486 20. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 496
20.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 487 20.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 496
20.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 488 20.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 497
20.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 489 20.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 498
20.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 492 20.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 502
20.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 496 20.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 505
20.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 497 20.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 506
20.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 499 20.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 509
20.8. Operation 10: CB_RECALL_SLOT - change flow control 20.8. Operation 10: CB_RECALL_SLOT - change flow control
limits . . . . . . . . . . . . . . . . . . . . . . . . . 500 limits . . . . . . . . . . . . . . . . . . . . . . . . . 510
20.9. Operation 11: CB_SEQUENCE - Supply backchannel 20.9. Operation 11: CB_SEQUENCE - Supply backchannel
sequencing and control . . . . . . . . . . . . . . . . . 501 sequencing and control . . . . . . . . . . . . . . . . . 511
20.10. Operation 12: CB_WANTS_CANCELLED . . . . . . . . . . . . 504 20.10. Operation 12: CB_WANTS_CANCELLED . . . . . . . . . . . . 513
20.11. Operation 13: CB_NOTIFY_LOCK - Notify of possible 20.11. Operation 13: CB_NOTIFY_LOCK - Notify of possible
lock availability . . . . . . . . . . . . . . . . . . . 505 lock availability . . . . . . . . . . . . . . . . . . . 514
20.12. Operation 10044: CB_ILLEGAL - Illegal Callback 20.12. Operation 10044: CB_ILLEGAL - Illegal Callback
Operation . . . . . . . . . . . . . . . . . . . . . . . 506 Operation . . . . . . . . . . . . . . . . . . . . . . . 516
21. Security Considerations . . . . . . . . . . . . . . . . . . . 507 21. Security Considerations . . . . . . . . . . . . . . . . . . . 516
22. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 507 22. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 517
22.1. Defining new layout types . . . . . . . . . . . . . . . 507 22.1. Defining New Notifications . . . . . . . . . . . . . . . 517
22.2. Named Attribute Definitions . . . . . . . . . . . . . . 508 22.2. Defining new layout types . . . . . . . . . . . . . . . 517
23. References . . . . . . . . . . . . . . . . . . . . . . . . . 509 22.3. Named Attribute Definitions . . . . . . . . . . . . . . 518
23.1. Normative References . . . . . . . . . . . . . . . . . . 509 22.4. Path Variable Definitions . . . . . . . . . . . . . . . 518
23.2. Informative References . . . . . . . . . . . . . . . . . 510 22.4.1. Path Variable Values . . . . . . . . . . . . . . . . 518
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 511 22.4.2. Path Variable Names . . . . . . . . . . . . . . . . 519
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 512 23. References . . . . . . . . . . . . . . . . . . . . . . . . . 519
Intellectual Property and Copyright Statements . . . . . . . . . 514 23.1. Normative References . . . . . . . . . . . . . . . . . . 519
23.2. Informative References . . . . . . . . . . . . . . . . . 520
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 522
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 524
Intellectual Property and Copyright Statements . . . . . . . . . 525
1. Introduction 1. Introduction
1.1. The NFSv4.1 Protocol 1.1. The NFSv4.1 Protocol
The NFSv4.1 protocol is a minor version of the NFSv4 protocol The NFSv4.1 protocol is a minor version of the NFSv4 protocol
described in [2]. It generally follows the guidelines for minor described in [20]. It generally follows the guidelines for minor
versioning model laid in Section 10 of RFC 3530. However, it versioning model laid in Section 10 of RFC 3530. However, it
diverges from guidelines 11 ("a client and server that supports minor diverges from guidelines 11 ("a client and server that supports minor
version X must support minor versions 0 through X-1"), and 12 ("no version X must support minor versions 0 through X-1"), and 12 ("no
features may be introduced as mandatory in a minor version"). These features may be introduced as mandatory in a minor version"). These
divergences are due to the introduction of the sessions model for divergences are due to the introduction of the sessions model for
managing non-idempotent operations and the RECLAIM_COMPLETE managing non-idempotent operations and the RECLAIM_COMPLETE
operation. These two new features are infrastructural in nature and operation. These two new features are infrastructural in nature and
simplify implementation of existing and other new features. Making simplify implementation of existing and other new features. Making
them optional would add undue complexity to protocol definition and them optional would add undue complexity to protocol definition and
implementation. NFSv4.1 accordingly updates the Minor Versioning implementation. NFSv4.1 accordingly updates the Minor Versioning
skipping to change at page 11, line 43 skipping to change at page 11, line 43
1.4. Overview of NFS version 4.1 Features 1.4. Overview of NFS version 4.1 Features
To provide a reasonable context for the reader, the major features of To provide a reasonable context for the reader, the major features of
NFS version 4.1 protocol will be reviewed in brief. This will be NFS version 4.1 protocol will be reviewed in brief. This will be
done to provide an appropriate context for both the reader who is done to provide an appropriate context for both the reader who is
familiar with the previous versions of the NFS protocol and the familiar with the previous versions of the NFS protocol and the
reader that is new to the NFS protocols. For the reader new to the reader that is new to the NFS protocols. For the reader new to the
NFS protocols, there is still a set of fundamental knowledge that is NFS protocols, there is still a set of fundamental knowledge that is
expected. The reader should be familiar with the XDR and RPC expected. The reader should be familiar with the XDR and RPC
protocols as described in [3] and [4]. A basic knowledge of file protocols as described in [2] and [3]. A basic knowledge of file
systems and distributed file systems is expected as well. systems and distributed file systems is expected as well.
This description of version 4.1 features will not distinguish those This description of version 4.1 features will not distinguish those
added in minor version one from those present in the base protocol added in minor version one from those present in the base protocol
but will treat minor version 1 as a unified whole. See Section 1.6 but will treat minor version 1 as a unified whole. See Section 1.6
for a description of the differences between the two minor versions. for a description of the differences between the two minor versions.
1.4.1. RPC and Security 1.4.1. RPC and Security
As with previous versions of NFS, the External Data Representation As with previous versions of NFS, the External Data Representation
(XDR) and Remote Procedure Call (RPC) mechanisms used for the NFS (XDR) and Remote Procedure Call (RPC) mechanisms used for the NFS
version 4.1 protocol are those defined in [3] and [4]. To meet end- version 4.1 protocol are those defined in [2] and [3]. To meet end-
to-end security requirements, the RPCSEC_GSS framework [5] will be to-end security requirements, the RPCSEC_GSS framework [4] will be
used to extend the basic RPC security. With the use of RPCSEC_GSS, used to extend the basic RPC security. With the use of RPCSEC_GSS,
various mechanisms can be provided to offer authentication, various mechanisms can be provided to offer authentication,
integrity, and privacy to the NFS version 4 protocol. Kerberos V5 integrity, and privacy to the NFS version 4 protocol. Kerberos V5
will be used as described in [6] to provide one security framework. will be used as described in [5] to provide one security framework.
The LIPKEY and SPKM-3 GSS-API mechanisms described in [7] will be The LIPKEY and SPKM-3 GSS-API mechanisms described in [6] will be
used to provide for the use of user password and client/server public used to provide for the use of user password and client/server public
key certificates by the NFS version 4 protocol. With the use of key certificates by the NFS version 4 protocol. With the use of
RPCSEC_GSS, other mechanisms may also be specified and used for NFS RPCSEC_GSS, other mechanisms may also be specified and used for NFS
version 4.1 security. version 4.1 security.
To enable in-band security negotiation, the NFS version 4.1 protocol To enable in-band security negotiation, the NFS version 4.1 protocol
has operations which provide the client a method of querying the has operations which provide the client a method of querying the
server about its policies regarding which security mechanisms must be server about its policies regarding which security mechanisms must be
used for access to the server's file system resources. With this, used for access to the server's file system resources. With this,
the client can securely match the security mechanism that meets the the client can securely match the security mechanism that meets the
skipping to change at page 13, line 10 skipping to change at page 13, line 10
as "layouts", which are integrated into the protocol locking model. as "layouts", which are integrated into the protocol locking model.
Clients direct requests for data access to a set of data servers Clients direct requests for data access to a set of data servers
specified by the layout via a data storage protocol which may be specified by the layout via a data storage protocol which may be
NFSv4.1 or may be another protocol. NFSv4.1 or may be another protocol.
1.4.3. File System Model 1.4.3. File System Model
The general file system model used for the NFS version 4.1 protocol The general file system model used for the NFS version 4.1 protocol
is the same as previous versions. The server file system is is the same as previous versions. The server file system is
hierarchical with the regular files contained within being treated as hierarchical with the regular files contained within being treated as
opaque octet streams. In a slight departure, file and directory opaque byte streams. In a slight departure, file and directory names
names are encoded with UTF-8 to deal with the basics of are encoded with UTF-8 to deal with the basics of
internationalization. internationalization.
The NFS version 4.1 protocol does not require a separate protocol to The NFS version 4.1 protocol does not require a separate protocol to
provide for the initial mapping between path name and filehandle. provide for the initial mapping between path name and filehandle.
All file systems exported by a server are presented as a tree so that All file systems exported by a server are presented as a tree so that
all file systems are reachable from a special per-server global root all file systems are reachable from a special per-server global root
filehandle. This allows LOOKUP operations to be used to perform filehandle. This allows LOOKUP operations to be used to perform
functions previously provided by the MOUNT protocol. The server functions previously provided by the MOUNT protocol. The server
provides any necessary pseudo file systems to bridge any gaps that provides any necessary pseudo file systems to bridge any gaps that
arise due to unexported gaps between exported file systems. arise due to unexported gaps between exported file systems.
skipping to change at page 14, line 6 skipping to change at page 14, line 6
The acl, sacl, and dacl attributes are a significant set of file The acl, sacl, and dacl attributes are a significant set of file
attributes that make up the Access Control List (ACL) of a file. attributes that make up the Access Control List (ACL) of a file.
These attributes provide for directory and file access control beyond These attributes provide for directory and file access control beyond
the model used in NFS Versions 2 and 3. The ACL definition allows the model used in NFS Versions 2 and 3. The ACL definition allows
for specification of specific sets of permissions for individual for specification of specific sets of permissions for individual
users and groups. In addition, ACL inheritance allows propagation of users and groups. In addition, ACL inheritance allows propagation of
access permissions and restriction down a directory tree as file access permissions and restriction down a directory tree as file
system objects are created. system objects are created.
One other type of attribute is the named attribute. A named One other type of attribute is the named attribute. A named
attribute is an opaque octet stream that is associated with a attribute is an opaque byte stream that is associated with a
directory or file and referred to by a string name. Named attributes directory or file and referred to by a string name. Named attributes
are meant to be used by client applications as a method to associate are meant to be used by client applications as a method to associate
application-specific data with a regular file or directory. application-specific data with a regular file or directory.
1.4.3.3. Multi-server Namespace 1.4.3.3. Multi-server Namespace
NFS Version 4.1 contains a number of features to allow implementation NFS Version 4.1 contains a number of features to allow implementation
of namespaces that cross server boundaries and that allow and of namespaces that cross server boundaries and that allow and
facilitate a non-disruptive transfer of support for individual file facilitate a non-disruptive transfer of support for individual file
systems between servers. They are all based upon attributes that systems between servers. They are all based upon attributes that
skipping to change at page 15, line 34 skipping to change at page 15, line 34
renew that lease. When leases are not promptly renewed locks are renew that lease. When leases are not promptly renewed locks are
subject to revocation. In the event of server reboot, clients have subject to revocation. In the event of server reboot, clients have
the opportunity to safely reclaim their locks within a special grace the opportunity to safely reclaim their locks within a special grace
period. period.
1.5. General Definitions 1.5. General Definitions
The following definitions are provided for the purpose of providing The following definitions are provided for the purpose of providing
an appropriate context for the reader. an appropriate context for the reader.
Byte This document defines a byte as an octet, i.e. a datum exactly
8 bits in length.
Client The "client" is the entity that accesses the NFS server's Client The "client" is the entity that accesses the NFS server's
resources. The client may be an application which contains the resources. The client may be an application which contains the
logic to access the NFS server directly. The client may also be logic to access the NFS server directly. The client may also be
the traditional operating system client that provides remote file the traditional operating system client that provides remote file
system services for a set of applications. file system services system services for a set of applications.
for a set of applications.
A client is uniquely identified by a Client Owner. A client is uniquely identified by a Client Owner.
With reference to file locking, the client is also the entity that With reference to file locking, the client is also the entity that
maintains a set of locks on behalf of one or more applications. maintains a set of locks on behalf of one or more applications.
This client is responsible for crash or failure recovery for those This client is responsible for crash or failure recovery for those
locks it manages. locks it manages.
Note that multiple clients may share the same transport and Note that multiple clients may share the same transport and
connection and multiple clients may exist on the same network connection and multiple clients may exist on the same network
skipping to change at page 18, line 9 skipping to change at page 18, line 9
2.1. Introduction 2.1. Introduction
NFS version 4.1 (NFSv4.1) relies on core infrastructure common to NFS version 4.1 (NFSv4.1) relies on core infrastructure common to
nearly every operation. This core infrastructure is described in the nearly every operation. This core infrastructure is described in the
remainder of this section. remainder of this section.
2.2. RPC and XDR 2.2. RPC and XDR
The NFS version 4.1 (NFSv4.1) protocol is a Remote Procedure Call The NFS version 4.1 (NFSv4.1) protocol is a Remote Procedure Call
(RPC) application that uses RPC version 2 and the corresponding (RPC) application that uses RPC version 2 and the corresponding
eXternal Data Representation (XDR) as defined in [4] and [3]. eXternal Data Representation (XDR) as defined in [3] and [2].
2.2.1. RPC-based Security 2.2.1. RPC-based Security
Previous NFS versions have been thought of as having a host-based Previous NFS versions have been thought of as having a host-based
authentication model, where the NFS server authenticates the NFS authentication model, where the NFS server authenticates the NFS
client, and trust the client to authenticate all users. Actually, client, and trust the client to authenticate all users. Actually,
NFS has always depended on RPC for authentication. The first form of NFS has always depended on RPC for authentication. The first form of
RPC authentication which required a host-based authentication RPC authentication which required a host-based authentication
approach. NFSv4.1 also depends on RPC for basic security services, approach. NFSv4.1 also depends on RPC for basic security services,
and mandates RPC support for a user-based authentication model. The and mandates RPC support for a user-based authentication model. The
user-based authentication model has user principals authenticated by user-based authentication model has user principals authenticated by
a server, and in turn the server authenticated by user principals. a server, and in turn the server authenticated by user principals.
RPC provides some basic security services which are used by NFSv4.1. RPC provides some basic security services which are used by NFSv4.1.
2.2.1.1. RPC Security Flavors 2.2.1.1. RPC Security Flavors
As described in section 7.2 "Authentication" of [4], RPC security is As described in section 7.2 "Authentication" of [3], RPC security is
encapsulated in the RPC header, via a security or authentication encapsulated in the RPC header, via a security or authentication
flavor, and information specific to the specification of the security flavor, and information specific to the specification of the security
flavor. Every RPC header conveys information used to identify and flavor. Every RPC header conveys information used to identify and
authenticate a client and server. As discussed in Section 2.2.1.1.1, authenticate a client and server. As discussed in Section 2.2.1.1.1,
some security flavors provide additional security services. some security flavors provide additional security services.
NFSv4.1 clients and servers MUST implement RPCSEC_GSS. (This NFSv4.1 clients and servers MUST implement RPCSEC_GSS. (This
requirement to implement is not a requirement to use.) Other requirement to implement is not a requirement to use.) Other
flavors, such as AUTH_NONE, and AUTH_SYS, MAY be implemented as well. flavors, such as AUTH_NONE, and AUTH_SYS, MAY be implemented as well.
2.2.1.1.1. RPCSEC_GSS and Security Services 2.2.1.1.1. RPCSEC_GSS and Security Services
RPCSEC_GSS ([5]) uses the functionality of GSS-API [8]. This allows RPCSEC_GSS ([4]) uses the functionality of GSS-API [7]. This allows
for the use of various security mechanisms by the RPC layer without for the use of various security mechanisms by the RPC layer without
the additional implementation overhead of adding RPC security the additional implementation overhead of adding RPC security
flavors. flavors.
2.2.1.1.1.1. Identification, Authentication, Integrity, Privacy 2.2.1.1.1.1. Identification, Authentication, Integrity, Privacy
Via the GSS-API, RPCSEC_GSS can be used to identify and authenticate Via the GSS-API, RPCSEC_GSS can be used to identify and authenticate
users on clients to servers, and servers to users. It can also users on clients to servers, and servers to users. It can also
perform integrity checking on the entire RPC message, including the perform integrity checking on the entire RPC message, including the
RPC header, and the arguments or results. Finally, privacy, usually RPC header, and the arguments or results. Finally, privacy, usually
skipping to change at page 19, line 15 skipping to change at page 19, line 15
If privacy is not selected, but integrity is selected, authentication If privacy is not selected, but integrity is selected, authentication
and identification are enabled. If integrity and privacy are not and identification are enabled. If integrity and privacy are not
selected, but authentication is enabled, identification is enabled. selected, but authentication is enabled, identification is enabled.
RPCSEC_GSS does not provide identification as a separate service. RPCSEC_GSS does not provide identification as a separate service.
Although GSS-API has an authentication service distinct from its Although GSS-API has an authentication service distinct from its
privacy and integrity services, GSS-API's authentication service is privacy and integrity services, GSS-API's authentication service is
not used for RPCSEC_GSS's authentication service. Instead, each RPC not used for RPCSEC_GSS's authentication service. Instead, each RPC
request and response header is integrity protected with the GSS-API request and response header is integrity protected with the GSS-API
integrity service, and this allows RPCSEC_GSS to offer per-RPC integrity service, and this allows RPCSEC_GSS to offer per-RPC
authentication and identity. See [5] for more information. authentication and identity. See [4] for more information.
NFSv4.1 client and servers MUST support RPCSEC_GSS's integrity and NFSv4.1 client and servers MUST support RPCSEC_GSS's integrity and
authentication service. NFSv4.1 servers MUST support RPCSEC_GSS's authentication service. NFSv4.1 servers MUST support RPCSEC_GSS's
privacy service. privacy service.
2.2.1.1.1.2. Security mechanisms for NFS version 4 2.2.1.1.1.2. Security mechanisms for NFS version 4
RPCSEC_GSS, via GSS-API, normalizes access to mechanisms that provide RPCSEC_GSS, via GSS-API, normalizes access to mechanisms that provide
security services. Therefore NFSv4.1 clients and servers MUST security services. Therefore NFSv4.1 clients and servers MUST
support three security mechanisms: Kerberos V5, SPKM-3, and LIPKEY. support three security mechanisms: Kerberos V5, SPKM-3, and LIPKEY.
skipping to change at page 19, line 40 skipping to change at page 19, line 40
zero (0) is used, leaving it up to the mechanism or the mechanism's zero (0) is used, leaving it up to the mechanism or the mechanism's
configuration to use an appropriate level of protection that QOP zero configuration to use an appropriate level of protection that QOP zero
maps to. Each mandated mechanism specifies minimum set of maps to. Each mandated mechanism specifies minimum set of
cryptographic algorithms for implementing integrity and privacy. cryptographic algorithms for implementing integrity and privacy.
NFSv4.1 clients and servers MUST be implemented on operating NFSv4.1 clients and servers MUST be implemented on operating
environments that comply with the mandatory cryptographic algorithms environments that comply with the mandatory cryptographic algorithms
of each mandated mechanism. of each mandated mechanism.
2.2.1.1.1.2.1. Kerberos V5 2.2.1.1.1.2.1. Kerberos V5
The Kerberos V5 GSS-API mechanism as described in [6] ( [[Comment.1: The Kerberos V5 GSS-API mechanism as described in [5] MUST be
need new Kerberos RFC]] ) MUST be implemented with the RPCSEC_GSS implemented with the RPCSEC_GSS services as specified in the
services as specified in the following table: following table:
column descriptions: column descriptions:
1 == number of pseudo flavor 1 == number of pseudo flavor
2 == name of pseudo flavor 2 == name of pseudo flavor
3 == mechanism's OID 3 == mechanism's OID
4 == RPCSEC_GSS service 4 == RPCSEC_GSS service
5 == NFSv4.1 clients MUST support 5 == NFSv4.1 clients MUST support
6 == NFSv4.1 servers MUST support 6 == NFSv4.1 servers MUST support
1 2 3 4 5 6 1 2 3 4 5 6
skipping to change at page 20, line 28 skipping to change at page 20, line 28
Note that the number and name of the pseudo flavor is presented here Note that the number and name of the pseudo flavor is presented here
as a mapping aid to the implementor. Because the NFSv4.1 protocol as a mapping aid to the implementor. Because the NFSv4.1 protocol
includes a method to negotiate security and it understands the GSS- includes a method to negotiate security and it understands the GSS-
API mechanism, the pseudo flavor is not needed. The pseudo flavor is API mechanism, the pseudo flavor is not needed. The pseudo flavor is
needed for the NFS version 3 since the security negotiation is done needed for the NFS version 3 since the security negotiation is done
via the MOUNT protocol as described in [23]. via the MOUNT protocol as described in [23].
2.2.1.1.1.2.2. LIPKEY 2.2.1.1.1.2.2. LIPKEY
The LIPKEY V5 GSS-API mechanism as described in [7] MUST be The LIPKEY V5 GSS-API mechanism as described in [6] MUST be
implemented with the RPCSEC_GSS services as specified in the implemented with the RPCSEC_GSS services as specified in the
following table: following table:
1 2 3 4 5 6 1 2 3 4 5 6
------------------------------------------------------------------ ------------------------------------------------------------------
390006 lipkey 1.3.6.1.5.5.9 rpc_gss_svc_none yes yes 390006 lipkey 1.3.6.1.5.5.9 rpc_gss_svc_none yes yes
390007 lipkey-i 1.3.6.1.5.5.9 rpc_gss_svc_integrity yes yes 390007 lipkey-i 1.3.6.1.5.5.9 rpc_gss_svc_integrity yes yes
390008 lipkey-p 1.3.6.1.5.5.9 rpc_gss_svc_privacy no yes 390008 lipkey-p 1.3.6.1.5.5.9 rpc_gss_svc_privacy no yes
2.2.1.1.1.2.3. SPKM-3 as a security triple 2.2.1.1.1.2.3. SPKM-3 as a security triple
The SPKM-3 GSS-API mechanism as described in [7] MUST be implemented The SPKM-3 GSS-API mechanism as described in [6] MUST be implemented
with the RPCSEC_GSS services as specified in the following table: with the RPCSEC_GSS services as specified in the following table:
1 2 3 4 5 6 1 2 3 4 5 6
------------------------------------------------------------------ ------------------------------------------------------------------
390009 spkm3 1.3.6.1.5.5.1.3 rpc_gss_svc_none yes yes 390009 spkm3 1.3.6.1.5.5.1.3 rpc_gss_svc_none yes yes
390010 spkm3i 1.3.6.1.5.5.1.3 rpc_gss_svc_integrity yes yes 390010 spkm3i 1.3.6.1.5.5.1.3 rpc_gss_svc_integrity yes yes
390011 spkm3p 1.3.6.1.5.5.1.3 rpc_gss_svc_privacy no yes 390011 spkm3p 1.3.6.1.5.5.1.3 rpc_gss_svc_privacy no yes
2.2.1.1.1.3. GSS Server Principal 2.2.1.1.1.3. GSS Server Principal
skipping to change at page 22, line 47 skipping to change at page 22, line 47
operation using that client ID (eir_clientid as returned from operation using that client ID (eir_clientid as returned from
EXCHANGE_ID) is required to establish and confirm the client ID on EXCHANGE_ID) is required to establish and confirm the client ID on
the server. Establishment of identification by a new incarnation of the server. Establishment of identification by a new incarnation of
the client also has the effect of immediately releasing any locking the client also has the effect of immediately releasing any locking
state that a previous incarnation of that same client might have had state that a previous incarnation of that same client might have had
on the server. Such released state would include all lock, share on the server. Such released state would include all lock, share
reservation, layout state, and where the server is not supporting the reservation, layout state, and where the server is not supporting the
CLAIM_DELEGATE_PREV claim type, all delegation state associated with CLAIM_DELEGATE_PREV claim type, all delegation state associated with
the same client with the same identity. For discussion of delegation the same client with the same identity. For discussion of delegation
state recovery, see Section 10.2.1. For discussion of layout state state recovery, see Section 10.2.1. For discussion of layout state
recovery see Section 13.7.1. recovery see Section 12.7.1.
Releasing such state requires that the server be able to determine Releasing such state requires that the server be able to determine
that one client instance is the successor of another. Where this that one client instance is the successor of another. Where this
cannot be done, for any of a number of reasons, the locking state cannot be done, for any of a number of reasons, the locking state
will remain for a time subject to lease expiration (see Section 8.3) will remain for a time subject to lease expiration (see Section 8.3)
and the new client will need to wait for such state to be removed, if and the new client will need to wait for such state to be removed, if
it makes conflicting lock requests. it makes conflicting lock requests.
Client identification is encapsulated in the following Client Owner Client identification is encapsulated in the following Client Owner
structure: structure:
skipping to change at page 23, line 45 skipping to change at page 23, line 45
o The string should be selected so the subsequent incarnations (e.g. o The string should be selected so the subsequent incarnations (e.g.
restarts) of the same client cause the client to present the same restarts) of the same client cause the client to present the same
string. The implementor is cautioned from an approach that string. The implementor is cautioned from an approach that
requires the string to be recorded in a local file because this requires the string to be recorded in a local file because this
precludes the use of the implementation in an environment where precludes the use of the implementation in an environment where
there is no local disk and all file access is from an NFS version there is no local disk and all file access is from an NFS version
4 server. 4 server.
o The string should be the same for each server network address that o The string should be the same for each server network address that
the client accesses, (note: the precise opposite was advised in the client accesses, (note: the precise opposite was advised in
the NFSv4.0 specification [2]). This way, if a server has the NFSv4.0 specification [20]). This way, if a server has
multiple interfaces, the client can trunk traffic over multiple multiple interfaces, the client can trunk traffic over multiple
network paths as described in Section 2.10.4. network paths as described in Section 2.10.4.
o The algorithm for generating the string should not assume that the o The algorithm for generating the string should not assume that the
client's network address will not change, unless the client client's network address will not change, unless the client
implementation knows it is using statically assigned network implementation knows it is using statically assigned network
addresses. This includes changes between client incarnations and addresses. This includes changes between client incarnations and
even changes while the client is still running in its current even changes while the client is still running in its current
incarnation. This means that if the client includes just the incarnation. This means that if the client includes just the
client's network address in the co_ownerid string, there is a real client's network address in the co_ownerid string, there is a real
skipping to change at page 25, line 4 skipping to change at page 25, line 4
The client ID is assigned by the server (the eir_clientid result from The client ID is assigned by the server (the eir_clientid result from
EXCHANGE_ID) and should be chosen so that it will not conflict with a EXCHANGE_ID) and should be chosen so that it will not conflict with a
client ID previously assigned by the server. This applies across client ID previously assigned by the server. This applies across
server restarts. server restarts.
In the event of a server restart, a client may find out that its In the event of a server restart, a client may find out that its
current client ID is no longer valid when it receives a current client ID is no longer valid when it receives a
NFS4ERR_STALE_CLIENTID error. The precise circumstances depend on NFS4ERR_STALE_CLIENTID error. The precise circumstances depend on
the characteristics of the sessions involved, specifically whether the characteristics of the sessions involved, specifically whether
the session is persistent (see Section 2.10.5.5). the session is persistent (see Section 2.10.5.5), but in each case
the client will receive this error when it attempts to establish a
new session with the existing client ID and receives the error
NFS4ERR_STALE_CLIENTID, indicating that a new client ID must be
obtained via EXCHANGE_ID and the new session established with that
client ID.
When a session is not persistent, the client will need to create a When a session is not persistent, the client will find out that it
new session. When the existing client ID is presented to a server as needs to create a new session as a result of getting an
part of creating a session and that client ID is not recognized, as NFS4ERR_BADSESSION, since the session in question was lost as part of
would happen after a server restart, the server will reject the a server reboot. When the existing client ID is presented to a
request with the error NFS4ERR_STALE_CLIENTID. When this happens, server as part of creating a session and that client ID is not
the client must obtain a new client ID by use of the EXCHANGE_ID recognized, as would happen after a server restart, the server will
operation, then use that client ID as the basis of a new session, and reject the request with the error NFS4ERR_STALE_CLIENTID.
then proceed to any other necessary recovery for the server restart
case (See Section 8.4.2).
In the case of the session being persistent, the client will re- In the case of the session being persistent, the client will re-
establish communication using the existing session after the restart. establish communication using the existing session after the restart.
This session will be associated with a client ID that has had state This session will be associated with the existing client ID but no
revoked (but the persistent session is never associated with a stale new operations can be performed on it. Operations that were
client ID, because if the session is persistent, the client ID MUST previously issued but for which no reply had been received may be
persist), and the client will receive an indication of that fact via reissued to determine whether they had been performed before the
the SEQ4_STATUS_RESTART_RECLAIM_NEEDED flag returned in the server reboot. The session in this situation is referred to as
sr_status_flags field the SEQUENCE operation (see Section 18.46.4). "dead" and when an operation that has not been performed previously,
The client can then use the existing session to do whatever i.e. it is not satisfied from the replay cache, the error
operations are necessary to determine the status of requests NFS4ERR_DEADSESSION is returned. In this situation, in order to
outstanding at the time of restart, while avoiding issuing new perform new operations, the client must establish a new session. If
requests, particularly any involving locking on that session. Such an attempt is made to establish this new session with the existing
requests would fail with an NFS4ERR_STALE_STATEID error, if client ID, the server will reject the request with
attempted. NFS4ERR_STALE_CLIENTID.
When NFS4ERR_STALE_CLIENTID is received in either of these
situations, the client must obtain a new client ID by use of the
EXCHANGE_ID operation, then use that client ID as the basis of a new
session, and then proceed to any other necessary recovery for the
server restart case (See Section 8.4.2).
See the detailed descriptions of EXCHANGE_ID (Section 18.35 and See the detailed descriptions of EXCHANGE_ID (Section 18.35 and
CREATE_SESSION (Section 18.36) for a complete specification of these CREATE_SESSION (Section 18.36) for a complete specification of these
operations. operations.
2.4.1. Upgrade from NFSv4.0 to NFSv4.1 2.4.1. Upgrade from NFSv4.0 to NFSv4.1
To facilitate upgrade from NFSv4.0 to NFSv4.1, a server may compare a To facilitate upgrade from NFSv4.0 to NFSv4.1, a server may compare a
client_owner4 in an EXCHANGE_ID with an nfs_client_id4 established client_owner4 in an EXCHANGE_ID with an nfs_client_id4 established
using SETCLIENTID using NFSv4.0, so that an NFSv4.1 client is not using SETCLIENTID using NFSv4.0, so that an NFSv4.1 client is not
skipping to change at page 30, line 31 skipping to change at page 30, line 38
non empty intersection with that of the parent. non empty intersection with that of the parent.
c) sec_policy_child ^ sec_policy_parent == {}. This means that c) sec_policy_child ^ sec_policy_parent == {}. This means that
the set of tuples specified on the security policy of a child the set of tuples specified on the security policy of a child
directory may not intersect with that of the parent. In other directory may not intersect with that of the parent. In other
words, there are no restrictions on how the system administrator words, there are no restrictions on how the system administrator
may set up these tuples. may set up these tuples.
For a server to support approach (b) (when client chooses a flavor For a server to support approach (b) (when client chooses a flavor
that is not a member of sec_policy_parent) and (c), the put that is not a member of sec_policy_parent) and (c), the put
filehandle operation must NOT return NFS4ERR_WRONGSEC in case of filehandle operation must NOT return NFS4ERR_WRONGSEC when there is a
security mismatch. Instead, it should be returned from the LOOKUP security tuple mismatch. Instead, it should be returned from the
(or OPEN by component name) that follows. LOOKUP (or OPEN by component name) that follows.
Since the above guideline does not contradict approach (a), it should Since the above guideline does not contradict approach (a), it should
be followed in general. Even if approach (a) is implemented, it is be followed in general. Even if approach (a) is implemented, it is
possible for the security tuple used to be acceptable for the target possible for the security tuple used to be acceptable for the target
of LOOKUP but not for the filehandles used in the put filehandle of LOOKUP but not for the filehandles used in the put filehandle
operation. The put filehandle operation could be a PUTROOTFH or operation. The put filehandle operation could be a PUTROOTFH or
PUTPUBFH, where the client cannot know the security tuples for the PUTPUBFH, where the client cannot know the security tuples for the
root or public filehandle. Or the security policy for the filehandle root or public filehandle. Or the security policy for the filehandle
used by the put filehandle operation could have changed since the used by the put filehandle operation could have changed since the
time the filehandle was obtained. time the filehandle was obtained.
skipping to change at page 31, line 49 skipping to change at page 32, line 10
2.6.3.1.6. Put Filehandle Operation + Nothing 2.6.3.1.6. Put Filehandle Operation + Nothing
The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC. The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC.
2.6.3.1.7. Put Filehandle Operation + Anything Else 2.6.3.1.7. Put Filehandle Operation + Anything Else
"Anything Else" includes OPEN by filehandle. "Anything Else" includes OPEN by filehandle.
The security policy enforcement applies to the filehandle specified The security policy enforcement applies to the filehandle specified
in the put filehandle operation. Therefore PUTFH must return in the put filehandle operation. Therefore the put filehandle
NFS4ERR_WRONGSEC in case of security tuple on the part of the operation must return NFS4ERR_WRONGSEC when there is a security tuple
mismatch. This avoids the complexity adding NFS4ERR_WRONGSEC as an mismatch. This avoids the complexity adding NFS4ERR_WRONGSEC as an
allowable error to every other operation. allowable error to every other operation.
A COMPOUND containing the series put filehandle operation + A COMPOUND containing the series put filehandle operation +
SECINFO_NO_NAME (style SECINFO_STYLE4_CURRENT_FH) is an efficient way SECINFO_NO_NAME (style SECINFO_STYLE4_CURRENT_FH) is an efficient way
for the client to recover from NFS4ERR_WRONGSEC. for the client to recover from NFS4ERR_WRONGSEC.
The NFSv4.1 server MUST not return NFS4ERR_WRONGSEC to any operation The NFSv4.1 server MUST not return NFS4ERR_WRONGSEC to any operation
other than a put filehandle operation, LOOKUP, LOOKUPP, and OPEN (by other than a put filehandle operation, LOOKUP, LOOKUPP, and OPEN (by
component name). component name).
2.6.3.1.8. Operations after SECINFO and SECINFO_NO_NAME
Placing an operation that uses the current filehandle after SECINFO
or SECINFO_NO_NAME seemingly introduces a issue with what error to
return when security tuple of the request is not allowed for the
operation that uses the current filehandle. For example, suppose a
client sends a COMPOUND procedure containing this series of
operations SEQUENCE, PUTFH, SECINFO_NONAME, READ, and suppose the
security tuple used does not match that required for the target file.
By rule (see Section 2.6.3.1.5), neither PUTFH nor SECINFO_NO_NAME
can return NFS4ERR_WRONGSEC. By rule (see Section 2.6.3.1.7), READ
cannot return NFS4ERR_WRONGSEC. The issue is resolved by the fact
that SECINFO and SECINFO_NO_NAME consume the current filehandle.
This leaves no current filehandle for READ to use, and READ returns
NFS4ERR_NOFILEHANDLE.
2.7. Minor Versioning 2.7. Minor Versioning
To address the requirement of an NFS protocol that can evolve as the To address the requirement of an NFS protocol that can evolve as the
need arises, the NFS version 4 protocol contains the rules and need arises, the NFS version 4 protocol contains the rules and
framework to allow for future minor changes or versioning. framework to allow for future minor changes or versioning.
The base assumption with respect to minor versioning is that any The base assumption with respect to minor versioning is that any
future accepted minor version must follow the IETF process and be future accepted minor version must follow the IETF process and be
documented in a standards track RFC. Therefore, each minor version documented in a standards track RFC. Therefore, each minor version
number will correspond to an RFC. Minor version zero of the NFS number will correspond to an RFC. Minor version zero of the NFS
version 4 protocol is represented by [2], and minor version one is version 4 protocol is represented by [20], and minor version one is
represented by this document [[Comment.2: change "document" to "RFC" represented by this document [[Comment.1: change "document" to "RFC"
when we publish]] . The COMPOUND and CB_COMPOUND procedures support when we publish]] . The COMPOUND and CB_COMPOUND procedures support
the encoding of the minor version being requested by the client. the encoding of the minor version being requested by the client.
The following items represent the basic rules for the development of The following items represent the basic rules for the development of
minor versions. Note that a future minor version may decide to minor versions. Note that a future minor version may decide to
modify or add to the following rules as part of the minor version modify or add to the following rules as part of the minor version
definition. definition.
1. Procedures are not added or deleted 1. Procedures are not added or deleted
skipping to change at page 35, line 39 skipping to change at page 36, line 19
NFSv4.1 works over RDMA and non-RDMA_based transports with the NFSv4.1 works over RDMA and non-RDMA_based transports with the
following attributes: following attributes:
o The transport supports reliable delivery of data, which NFSv4.1 o The transport supports reliable delivery of data, which NFSv4.1
requires but neither NFSv4.1 nor RPC has facilities for ensuring. requires but neither NFSv4.1 nor RPC has facilities for ensuring.
[24] [24]
o The transport delivers data in the order it was sent. Ordered o The transport delivers data in the order it was sent. Ordered
delivery simplifies detection of transmit errors, and simplifies delivery simplifies detection of transmit errors, and simplifies
the sending of arbitrary sized requests and responses, via the the sending of arbitrary sized requests and responses, via the
record marking protocol [4]. record marking protocol [3].
Where an NFS version 4 implementation supports operation over the IP Where an NFS version 4 implementation supports operation over the IP
network protocol, any transport used between NFS and IP MUST be among network protocol, any transport used between NFS and IP MUST be among
the IETF-approved congestion control transport protocols. At the the IETF-approved congestion control transport protocols. At the
time this document was written, the only two transports that had the time this document was written, the only two transports that had the
above attributes were TCP and SCTP. To enhance the possibilities for above attributes were TCP and SCTP. To enhance the possibilities for
interoperability, an NFS version 4 implementation MUST support interoperability, an NFS version 4 implementation MUST support
operation over the TCP transport protocol. operation over the TCP transport protocol.
Even if NFS version 4 is used over a non-IP network protocol, it is Even if NFS version 4 is used over a non-IP network protocol, it is
skipping to change at page 36, line 32 skipping to change at page 37, line 12
client and server to maintain a client-created backchannel (see client and server to maintain a client-created backchannel (see
Section 2.10.3.1) for the server to use. Section 2.10.3.1) for the server to use.
In order to reduce congestion, if a connection-oriented transport is In order to reduce congestion, if a connection-oriented transport is
used, and the request is not the NULL procedure, used, and the request is not the NULL procedure,
o A requester MUST NOT retry a request unless the connection the o A requester MUST NOT retry a request unless the connection the
request was issued over was lost before the reply was received. request was issued over was lost before the reply was received.
o A replier MUST NOT silently drop a request, even if the request is o A replier MUST NOT silently drop a request, even if the request is
a retry. (The silent drop behavior of RPCSEC_GSS [5] does not a retry. (The silent drop behavior of RPCSEC_GSS [4] does not
apply because this behavior happens at the RPCSEC_GSS layer, a apply because this behavior happens at the RPCSEC_GSS layer, a
lower layer in the request processing). Instead, the replier lower layer in the request processing). Instead, the replier
SHOULD return an appropriate error (see Section 2.10.5.1) or it SHOULD return an appropriate error (see Section 2.10.5.1) or it
MAY disconnect the connection. MAY disconnect the connection.
When sending a reply, the replier MUST send the reply to the same
full network address (e.g. if using an IP-based transport, the source
port of the requester is part of the full network address) that the
requester issued the request from. If using a connection-oriented
transport, replies MUST be sent on the same connection the request
was received from.
If a connection is dropped after the replier receives the request but
before the replier sends the reply, the replier might have an pending
reply. If a connection is established with the same source and
destination full network address as the dropped connection, then the
replier MUST NOT send the reply until the client retries the request.
The reason for this prohibition is that the client MAY retry a
request over a different connection than is associated with the
session.
When using RDMA transports there are other reasons for not tolerating When using RDMA transports there are other reasons for not tolerating
retries over the same connection: retries over the same connection:
o RDMA transports use "credits" to enforce flow control, where a o RDMA transports use "credits" to enforce flow control, where a
credit is a right to a peer to transmit a message. If one peer credit is a right to a peer to transmit a message. If one peer
were to retransmit a request (or reply), it would consume an were to retransmit a request (or reply), it would consume an
additional credit. If the replier retransmitted a reply, it would additional credit. If the replier retransmitted a reply, it would
certainly result in an RDMA connection loss, since the requester certainly result in an RDMA connection loss, since the requester
would typically only post a single receive buffer for each would typically only post a single receive buffer for each
request. If the requester retransmitted a request, the additional request. If the requester retransmitted a request, the additional
skipping to change at page 38, line 5 skipping to change at page 38, line 47
shortfalls with practical solutions: shortfalls with practical solutions:
o EOS is enabled by a reply cache with a bounded size, making it o EOS is enabled by a reply cache with a bounded size, making it
feasible to keep the cache in persistent storage and enable EOS feasible to keep the cache in persistent storage and enable EOS
through server failure and recovery. One reason that previous through server failure and recovery. One reason that previous
revisions of NFS did not support EOS was because some EOS revisions of NFS did not support EOS was because some EOS
approaches often limited parallelism. As will be explained in approaches often limited parallelism. As will be explained in
Section 2.10.5, NFSv4.1 supports both EOS and unlimited Section 2.10.5, NFSv4.1 supports both EOS and unlimited
parallelism. parallelism.
o The NFSv4.1 client (defined in Section 1.5, Paragraph 1) creates o The NFSv4.1 client (defined in Section 1.5, Paragraph 2) creates
transport connections and provides them to the server to use for transport connections and provides them to the server to use for
sending callback requests, thus solving the firewall issue sending callback requests, thus solving the firewall issue
(Section 18.34). Races between responses from client requests, (Section 18.34). Races between responses from client requests,
and callbacks caused by the requests are detected via the and callbacks caused by the requests are detected via the
session's sequencing properties which are a consequence of EOS session's sequencing properties which are a consequence of EOS
(Section 2.10.5.3). (Section 2.10.5.3).
o The NFSv4.1 client can add an arbitrary number of connections to o The NFSv4.1 client can add an arbitrary number of connections to
the session, and thus provide trunking (Section 2.10.4). the session, and thus provide trunking (Section 2.10.4).
skipping to change at page 39, line 46 skipping to change at page 40, line 41
COMPOUND, but instead of a SEQUENCE operation, there is a CB_SEQUENCE COMPOUND, but instead of a SEQUENCE operation, there is a CB_SEQUENCE
operation. CB_COMPOUND also has an additional field called operation. CB_COMPOUND also has an additional field called
"callback_ident", which is superfluous in NFSv4.1 and MUST be ignored "callback_ident", which is superfluous in NFSv4.1 and MUST be ignored
by the client. CB_SEQUENCE has the same information as SEQUENCE, and by the client. CB_SEQUENCE has the same information as SEQUENCE, and
also includes other information needed to resolve callback races also includes other information needed to resolve callback races
(Section 2.10.5.3). (Section 2.10.5.3).
2.10.2.2. Client ID and Session Association 2.10.2.2. Client ID and Session Association
Each client ID (Section 2.4) can have zero or more active sessions. Each client ID (Section 2.4) can have zero or more active sessions.
A client ID, and associated session are required to perform file A client ID and associated session are required to perform file
access in NFSv4.1. Each time a session is used (whether by a client access in NFSv4.1. Each time a session is used (whether by a client
sending a request to the server, or the client replying to a callback sending a request to the server, or the client replying to a callback
request from the server), the state leased to its associated client request from the server), the state leased to its associated client
ID is automatically renewed. ID is automatically renewed.
State such as share reservations, locks, delegations, and layouts State such as share reservations, locks, delegations, and layouts
(Section 1.4.4) is tied to the client ID. Client state is not tied (Section 1.4.4) is tied to the client ID. Client state is not tied
to any individual session. Successive state changing operations from to any individual session. Successive state changing operations from
a given state owner MAY go over different sessions, provided the a given state owner MAY go over different sessions, provided the
session is associated with the same client ID. A callback MAY arrive session is associated with the same client ID. A callback MAY arrive
skipping to change at page 49, line 49 skipping to change at page 50, line 44
new request. This way, the replier may be able to retire slot new request. This way, the replier may be able to retire slot
entries faster. However, where the replier is actively adjusting entries faster. However, where the replier is actively adjusting
its granted highest_slotid, it will not not be able to use only its granted highest_slotid, it will not not be able to use only
the receipt of the slot id and highest_slotid in the request. the receipt of the slot id and highest_slotid in the request.
Neither the slot id nor the highest_slotid used in a request may Neither the slot id nor the highest_slotid used in a request may
reflect the replier's current idea of the requester's session reflect the replier's current idea of the requester's session
limit, because the request may have been sent from the requester limit, because the request may have been sent from the requester
before the update was received. Therefore, in the downward before the update was received. Therefore, in the downward
adjustment case, the replier may have to retain a number of reply adjustment case, the replier may have to retain a number of reply
cache entries at least as large as the old value of maximum cache entries at least as large as the old value of maximum
requests outstanding, until operation sequencing rules allow it to requests outstanding, until it can infer that the requester has
infer that the requester has seen its reply. [[Comment.3: What seen a reply containing the new granted highest_slotid. The
are the rules?]] replier can infer that requester as seen such a reply when it
receives a new request with the same slotid as the request replied
to and the next higher sequenceid.
2.10.5.1.1. Errors from SEQUENCE and CB_SEQUENCE 2.10.5.1.1. Errors from SEQUENCE and CB_SEQUENCE
Any time SEQUENCE or CB_SEQUENCE return an error, the sequence id of Any time SEQUENCE or CB_SEQUENCE return an error, the sequence id of
the slot MUST NOT change. The replier MUST NOT modify the reply the slot MUST NOT change. The replier MUST NOT modify the reply
cache entry for the slot whenever an error is returned from SEQUENCE cache entry for the slot whenever an error is returned from SEQUENCE
or CB_SEQUENCE. or CB_SEQUENCE.
2.10.5.1.2. Optional Reply Caching 2.10.5.1.2. Optional Reply Caching
skipping to change at page 51, line 33 skipping to change at page 52, line 30
another request. If it does not wait for a reply, then the client another request. If it does not wait for a reply, then the client
does not know what sequence id to use for the slot on its next does not know what sequence id to use for the slot on its next
request. For example, suppose a client sends a request with sequence request. For example, suppose a client sends a request with sequence
id 1, and does not wait for the response. The next time it uses the id 1, and does not wait for the response. The next time it uses the
slot, it sends the new request with sequence id 2. If the server has slot, it sends the new request with sequence id 2. If the server has
not seen the request with sequence id 1, then the server is not not seen the request with sequence id 1, then the server is not
expecting sequence id 2, and rejects the client's new request with expecting sequence id 2, and rejects the client's new request with
NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or CB_SEQUENCE). NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or CB_SEQUENCE).
RDMA fabrics do not guarantee that the memory handles (Steering Tags) RDMA fabrics do not guarantee that the memory handles (Steering Tags)
within each RPC/RDMA "chunk" ([9]) are valid on a scope outside that within each RPC/RDMA "chunk" ([8]) are valid on a scope outside that
of a single connection. Therefore, handles used by the direct of a single connection. Therefore, handles used by the direct
operations become invalid after connection loss. The server must operations become invalid after connection loss. The server must
ensure that any RDMA operations which must be replayed from the reply ensure that any RDMA operations which must be replayed from the reply
cache use the newly provided handle(s) from the most recent request. cache use the newly provided handle(s) from the most recent request.
A retry might be issued while the original request is still in A retry might be issued while the original request is still in
progress on the replier. The replier SHOULD deal with issue by by progress on the replier. The replier SHOULD deal with issue by by
returning NFS4ERR_DELAY as the reply to SEQUENCE or CB_SEQUENCE returning NFS4ERR_DELAY as the reply to SEQUENCE or CB_SEQUENCE
operation, but implementations MAY return NFS4ERR_MISORDERED. Since operation, but implementations MAY return NFS4ERR_MISORDERED. Since
errors from SEQUENCE and CB_SEQUENCE are never recorded in the reply errors from SEQUENCE and CB_SEQUENCE are never recorded in the reply
skipping to change at page 53, line 7 skipping to change at page 54, line 4
The client must not simply wait forever for the expected server reply The client must not simply wait forever for the expected server reply
to arrive before responding to the CB_COMPOUND that won the race, to arrive before responding to the CB_COMPOUND that won the race,
because it is possible that it will be delayed indefinitely. The because it is possible that it will be delayed indefinitely. The
client should assume the likely case that the reply will arrive client should assume the likely case that the reply will arrive
within the average round trip time for COMPOUND requests to the within the average round trip time for COMPOUND requests to the
server, and wait that period of time. If that period of expires it server, and wait that period of time. If that period of expires it
can respond to the CB_COMPOUND with NFS4ERR_DELAY. can respond to the CB_COMPOUND with NFS4ERR_DELAY.
There are other scenarios under which callbacks may race replies, There are other scenarios under which callbacks may race replies,
among them pNFS layout recalls, described in Section 13.5.4.2. among them pNFS layout recalls, described in Section 12.5.4.2.
2.10.5.4. COMPOUND and CB_COMPOUND Construction Issues 2.10.5.4. COMPOUND and CB_COMPOUND Construction Issues
Very large requests and replies may pose both buffer management Very large requests and replies may pose both buffer management
issues (especially with RDMA) and reply cache issues. When the issues (especially with RDMA) and reply cache issues. When the
session is created, (Section 18.36), for each channel (fore and session is created, (Section 18.36), for each channel (fore and
back), the client and server negotiate the maximum sized request they back), the client and server negotiate the maximum sized request they
will send or process (ca_maxrequestsize), the maximum sized reply will send or process (ca_maxrequestsize), the maximum sized reply
they will return or process (ca_maxresponsesize), and the maximum they will return or process (ca_maxresponsesize), and the maximum
sized reply they will store in the reply cache sized reply they will store in the reply cache
skipping to change at page 53, line 49 skipping to change at page 54, line 46
If sa_cachethis or csa_cachethis are TRUE, then the replier MUST If sa_cachethis or csa_cachethis are TRUE, then the replier MUST
cache a reply except if an error is returned by the SEQUENCE or cache a reply except if an error is returned by the SEQUENCE or
CB_SEQUENCE operation (see Section 2.10.5.1.1). If the reply exceeds CB_SEQUENCE operation (see Section 2.10.5.1.1). If the reply exceeds
ca_maxresponsesize_cached, (and sa_cachethis or csa_cachethis are ca_maxresponsesize_cached, (and sa_cachethis or csa_cachethis are
TRUE) then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE. Even TRUE) then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE. Even
if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that matter) if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that matter)
is returned on a operation other than first operation (SEQUENCE or is returned on a operation other than first operation (SEQUENCE or
CB_SEQUENCE), then the reply MUST be cached if sa_cachethis or CB_SEQUENCE), then the reply MUST be cached if sa_cachethis or
csa_cachethis are TRUE. For example, if a COMPOUND has eleven csa_cachethis are TRUE. For example, if a COMPOUND has eleven
operations, including SEQUENCE, the fifth operation is a RENAME, and operations, including SEQUENCE, the fifth operation is a RENAME, and
the tenth operation is a READ for one million octets, the server may the tenth operation is a READ for one million bytes, the server may
return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth operation. Since return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth operation. Since
the server executed several operations, especially the non-idempotent the server executed several operations, especially the non-idempotent
RENAME, the client's request to cache the reply needs to be honored RENAME, the client's request to cache the reply needs to be honored
in order for correct operation of exactly once semantics. If the in order for correct operation of exactly once semantics. If the
client retries the request, the server will have cached a reply that client retries the request, the server will have cached a reply that
contains results for ten of the eleven requested operations, with the contains results for ten of the eleven requested operations, with the
tenth operation having a status of NFS4ERR_REP_TOO_BIG_TO_CACHE. tenth operation having a status of NFS4ERR_REP_TOO_BIG_TO_CACHE.
A client needs to take care that when sending operations that change A client needs to take care that when sending operations that change
the current filehandle (except for PUTFH, PUTPUBFH, and PUTROOTFH) the current filehandle (except for PUTFH, PUTPUBFH, and PUTROOTFH)
skipping to change at page 56, line 23 skipping to change at page 57, line 20
the NFSv4.1 server. the NFSv4.1 server.
While the description of the implementation for atomic execution of While the description of the implementation for atomic execution of
the request and caching of the reply is beyond the scope of this the request and caching of the reply is beyond the scope of this
document, an example implementation for NFS version 2 is described in document, an example implementation for NFS version 2 is described in
[28]. [28].
2.10.6. RDMA Considerations 2.10.6. RDMA Considerations
A complete discussion of the operation of RPC-based protocols over A complete discussion of the operation of RPC-based protocols over
RDMA transports is in [9]. A discussion of the operation of NFSv4, RDMA transports is in [8]. A discussion of the operation of NFSv4,
including NFSv4.1, over RDMA is in [10]. Where RDMA is considered, including NFSv4.1, over RDMA is in [9]. Where RDMA is considered,
this specification assumes the use of such a layering; it addresses this specification assumes the use of such a layering; it addresses
only the upper layer issues relevant to making best use of RPC/RDMA. only the upper layer issues relevant to making best use of RPC/RDMA.
2.10.6.1. RDMA Connection Resources 2.10.6.1. RDMA Connection Resources
RDMA requires its consumers to register memory and post buffers of a RDMA requires its consumers to register memory and post buffers of a
specific size and number for receive operations. specific size and number for receive operations.
Registration of memory can be a relatively high-overhead operation, Registration of memory can be a relatively high-overhead operation,
since it requires pinning of buffers, assignment of attributes (e.g. since it requires pinning of buffers, assignment of attributes (e.g.
skipping to change at page 57, line 20 skipping to change at page 58, line 17
Previous versions of NFS do not provide flow control; instead they Previous versions of NFS do not provide flow control; instead they
rely on the windowing provided by transports like TCP to throttle rely on the windowing provided by transports like TCP to throttle
requests. This does not work with RDMA, which provides no operation requests. This does not work with RDMA, which provides no operation
flow control and will terminate a connection in error when limits are flow control and will terminate a connection in error when limits are
exceeded. Limits such as maximum number of requests outstanding are exceeded. Limits such as maximum number of requests outstanding are
therefore negotiated when a session is created (see the therefore negotiated when a session is created (see the
ca_maxrequests field in Section 18.36). These limits then provide ca_maxrequests field in Section 18.36). These limits then provide
the maxima which each connection associated with the session's the maxima which each connection associated with the session's
channel(s) must remain within. RDMA connections are managed within channel(s) must remain within. RDMA connections are managed within
these limits as described in section 3.3 ("Flow Control"[[Comment.4: these limits as described in section 3.3 ("Flow Control"[[Comment.2:
RFC Editor: please verify section and title of the RPCRDMA RFC Editor: please verify section and title of the RPCRDMA
document]]) of [9]; if there are multiple RDMA connections, then the document]]) of [8]; if there are multiple RDMA connections, then the
maximum number of requests for a channel will be divided among the maximum number of requests for a channel will be divided among the
RDMA connections. Put a different way, the onus is on the replier to RDMA connections. Put a different way, the onus is on the replier to
ensure that total number of RDMA credits across all connections ensure that total number of RDMA credits across all connections
associated with the replier's channel does exceed the channel's associated with the replier's channel does exceed the channel's
maximum number of outstanding requests. maximum number of outstanding requests.
The limits may also be modified dynamically at the replier's choosing The limits may also be modified dynamically at the replier's choosing
by manipulating certain parameters present in each NFSv4.1 reply. In by manipulating certain parameters present in each NFSv4.1 reply. In
addition, the CB_RECALL_SLOT callback operation (see Section 20.8) addition, the CB_RECALL_SLOT callback operation (see Section 20.8)
can be issued by a server to a client to return RDMA credits to the can be issued by a server to a client to return RDMA credits to the
server, thereby lowering the maximum number of requests a client can server, thereby lowering the maximum number of requests a client can
have outstanding to the server. have outstanding to the server.
2.10.6.3. Padding 2.10.6.3. Padding
Header padding is requested by each peer at session initiation (see Header padding is requested by each peer at session initiation (see
the ca_headerpadsize argument to CREATE_SESSION in Section 18.36), the ca_headerpadsize argument to CREATE_SESSION in Section 18.36),
and subsequently used by the RPC RDMA layer, as described in [9]. and subsequently used by the RPC RDMA layer, as described in [8].
Zero padding is permitted. Zero padding is permitted.
Padding leverages the useful property that RDMA preserve alignment of Padding leverages the useful property that RDMA preserve alignment of
data, even when they are placed into anonymous (untagged) buffers. data, even when they are placed into anonymous (untagged) buffers.
If requested, client inline writes will insert appropriate pad octets If requested, client inline writes will insert appropriate pad bytes
within the request header to align the data payload on the specified within the request header to align the data payload on the specified
boundary. The client is encouraged to add sufficient padding (up to boundary. The client is encouraged to add sufficient padding (up to
the negotiated size) so that the "data" field of the NFSv4.1 WRITE the negotiated size) so that the "data" field of the NFSv4.1 WRITE
operation is aligned. Most servers can make good use of such operation is aligned. Most servers can make good use of such
padding, which allows them to chain receive buffers in such a way padding, which allows them to chain receive buffers in such a way
that any data carried by client requests will be placed into that any data carried by client requests will be placed into
appropriate buffers at the server, ready for file system processing. appropriate buffers at the server, ready for file system processing.
The receiver's RPC layer encounters no overhead from skipping over The receiver's RPC layer encounters no overhead from skipping over
pad octets, and the RDMA layer's high performance makes the insertion pad bytes, and the RDMA layer's high performance makes the insertion
and transmission of padding on the sender a significant optimization. and transmission of padding on the sender a significant optimization.
In this way, the need for servers to perform RDMA Read to satisfy all In this way, the need for servers to perform RDMA Read to satisfy all
but the largest client writes is obviated. An added benefit is the but the largest client writes is obviated. An added benefit is the
reduction of message round trips on the network - a potentially good reduction of message round trips on the network - a potentially good
trade, where latency is present. trade, where latency is present.
The value to choose for padding is subject to a number of criteria. The value to choose for padding is subject to a number of criteria.
A primary source of variable-length data in the RPC header is the A primary source of variable-length data in the RPC header is the
authentication information, the form of which is client-determined, authentication information, the form of which is client-determined,
possibly in response to server specification. The contents of possibly in response to server specification. The contents of
COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all
go into the determination of a maximal NFSv4.1 request size and go into the determination of a maximal NFSv4.1 request size and
therefore minimal buffer size. The client must select its offered therefore minimal buffer size. The client must select its offered
value carefully, so as not to overburden the server, and vice- versa. value carefully, so as not to overburden the server, and vice- versa.
The payoff of an appropriate padding value is higher performance. The payoff of an appropriate padding value is higher performance.
[[Comment.5: RFC editor please keep this diagram on one page.]] [[Comment.3: RFC editor please keep this diagram on one page.]]
Sender gather: Sender gather:
|RPC Request|Pad octets|Length| -> |User data...| |RPC Request|Pad bytes|Length| -> |User data...|
\------+----------------------/ \ \------+----------------------/ \
\ \ \ \
\ Receiver scatter: \-----------+- ... \ Receiver scatter: \-----------+- ...
/-----+----------------\ \ \ /-----+----------------\ \ \
|RPC Request|Pad|Length| -> |FS buffer|->|FS buffer|->... |RPC Request|Pad|Length| -> |FS buffer|->|FS buffer|->...
In the above case, the server may recycle unused buffers to the next In the above case, the server may recycle unused buffers to the next
posted receive if unused by the actual received request, or may pass posted receive if unused by the actual received request, or may pass
the now-complete buffers by reference for normal write processing. the now-complete buffers by reference for normal write processing.
For a server which can make use of it, this removes any need for data For a server which can make use of it, this removes any need for data
copies of incoming data, without resorting to complicated end-to-end copies of incoming data, without resorting to complicated end-to-end
buffer advertisement and management. This includes most kernel-based buffer advertisement and management. This includes most kernel-based
and integrated server designs, among many others. The client may and integrated server designs, among many others. The client may
perform similar optimizations, if desired. perform similar optimizations, if desired.
2.10.6.4. Dual RDMA and Non-RDMA Transports 2.10.6.4. Dual RDMA and Non-RDMA Transports
Some RDMA transports (for example [11]), permit a "streaming" (non- Some RDMA transports (for example [10]), permit a "streaming" (non-
RDMA) phase, where ordinary traffic might flow before "stepping up" RDMA) phase, where ordinary traffic might flow before "stepping up"
to RDMA mode, commencing RDMA traffic. Some RDMA transports start to RDMA mode, commencing RDMA traffic. Some RDMA transports start
connections always in RDMA mode. NFSv4.1 allows, but does not connections always in RDMA mode. NFSv4.1 allows, but does not
assume, a streaming phase before RDMA mode. When a connection is assume, a streaming phase before RDMA mode. When a connection is
associated with a session, the client and server negotiate whether associated with a session, the client and server negotiate whether
the connection is used in RDMA or non-RDMA mode (see Section 18.36 the connection is used in RDMA or non-RDMA mode (see Section 18.36
and Section 18.34). and Section 18.34).
2.10.7. Sessions Security 2.10.7. Sessions Security
skipping to change at page 59, line 30 skipping to change at page 60, line 30
2.10.7.2. Backchannel RPC Security 2.10.7.2. Backchannel RPC Security
When the NFSv4.1 client establishes the backchannel, it informs the When the NFSv4.1 client establishes the backchannel, it informs the
server of the security flavors and principals to use when sending server of the security flavors and principals to use when sending
requests. If the security flavor is RPCSEC_GSS, the client expresses requests. If the security flavor is RPCSEC_GSS, the client expresses
the principal in the form of an established RPCSEC_GSS context. The the principal in the form of an established RPCSEC_GSS context. The
server is free to use any of the flavor/principal combinations the server is free to use any of the flavor/principal combinations the
client offers, but it MUST NOT use unoffered combinations. This way, client offers, but it MUST NOT use unoffered combinations. This way,
the client need not provide a target GSS principal for the the client need not provide a target GSS principal for the
backchannel as it did with NFSv4.0, nor the server have to implement backchannel as it did with NFSv4.0, nor the server have to implement
an RPCSEC_GSS initiator as it did with NFSv4.0 [2]. an RPCSEC_GSS initiator as it did with NFSv4.0 [20].
The CREATE_SESSION (Section 18.36) and BACKCHANNEL_CTL The CREATE_SESSION (Section 18.36) and BACKCHANNEL_CTL
(Section 18.33) operations allow the client to specify flavor/ (Section 18.33) operations allow the client to specify flavor/
principal combinations. principal combinations.
Also note that the SP4_SSV state protection mode (see Section 18.35 Also note that the SP4_SSV state protection mode (see Section 18.35
and Section 2.10.7.3) has the side benefit of providing SSV-derived and Section 2.10.7.3) has the side benefit of providing SSV-derived
RPCSEC_GSS contexts (Section 2.10.8). RPCSEC_GSS contexts (Section 2.10.8).
2.10.7.3. Protection from Unauthorized State Changes 2.10.7.3. Protection from Unauthorized State Changes
skipping to change at page 64, line 39 skipping to change at page 65, line 39
iso.org.dod.internet.private.enterprise.Michael Eisler.nfs.ssv_mech iso.org.dod.internet.private.enterprise.Michael Eisler.nfs.ssv_mech
(1.3.6.1.4.1.28882.1.1). While the SSV mechanisms does not define (1.3.6.1.4.1.28882.1.1). While the SSV mechanisms does not define
any initial context tokens, the OID can be used to let servers any initial context tokens, the OID can be used to let servers
indicate that the SSV mechanism is acceptable whenever the client indicate that the SSV mechanism is acceptable whenever the client
issues a SECINFO or SECINFO_NO_NAME operation (see Section 2.6). issues a SECINFO or SECINFO_NO_NAME operation (see Section 2.6).
The SSV mechanism defines four subkeys derived from the SSV value. The SSV mechanism defines four subkeys derived from the SSV value.
Each time SET_SSV is invoked the subkeys are recalculated by the Each time SET_SSV is invoked the subkeys are recalculated by the
client and server. The four subkeys are calculated by from each of client and server. The four subkeys are calculated by from each of
the valid ssv_subkey4 enumerated values. The calculation uses the the valid ssv_subkey4 enumerated values. The calculation uses the
HMAC ([12]), algorithm, using the current SSV as the key, the one way HMAC ([11]), algorithm, using the current SSV as the key, the one way
hash algorithm as negotiated by EXCHANGE_ID, and the input text as hash algorithm as negotiated by EXCHANGE_ID, and the input text as
represented by the XDR encoded enumeration of type ssv_subkey4. represented by the XDR encoded enumeration of type ssv_subkey4.
/* Input for computing subkeys */ /* Input for computing subkeys */
enum ssv_subkey4 { enum ssv_subkey4 {
SSV4_SUBKEY_MIC_I2T = 1, SSV4_SUBKEY_MIC_I2T = 1,
SSV4_SUBKEY_MIC_T2I = 2, SSV4_SUBKEY_MIC_T2I = 2,
SSV4_SUBKEY_SEAL_I2T = 3, SSV4_SUBKEY_SEAL_I2T = 3,
SSV4_SUBKEY_SEAL_T2I = 4 SSV4_SUBKEY_SEAL_T2I = 4
}; };
skipping to change at page 66, line 35 skipping to change at page 67, line 35
The ssct_encr_data field is the result of encrypting a value of the The ssct_encr_data field is the result of encrypting a value of the
XDR encoded data type ssv_seal_plain_tkn4. The encryption key is the XDR encoded data type ssv_seal_plain_tkn4. The encryption key is the
subkey derived from SSV4_SUBKEY_SEAL_I2T or SSV4_SUBKEY_SEAL_T2I, and subkey derived from SSV4_SUBKEY_SEAL_I2T or SSV4_SUBKEY_SEAL_T2I, and
the encryption algorithm is that negotiated by EXCHANGE_ID. the encryption algorithm is that negotiated by EXCHANGE_ID.
The ssct_iv field is the initialization vector (IV) for the The ssct_iv field is the initialization vector (IV) for the
encryption algorithm (if applicable) and is sent in clear text. The encryption algorithm (if applicable) and is sent in clear text. The
content and size of the IV MUST comply with specification of the content and size of the IV MUST comply with specification of the
encryption algorithm. For example, the id-aes256-CBC algorithm MUST encryption algorithm. For example, the id-aes256-CBC algorithm MUST
use a 16 octet initialization vector (IV) which MUST be unpredictable use a 16 byte initialization vector (IV) which MUST be unpredictable
for each instance of a value of type ssv_seal_plain_tkn4 that is for each instance of a value of type ssv_seal_plain_tkn4 that is
encrypted with a particular SSV key. encrypted with a particular SSV key.
The ssct_hmac field is the result of computing an HMAC using value of The ssct_hmac field is the result of computing an HMAC using value of
the XDR encoded data type ssv_seal_plain_tkn4 as the input text. The the XDR encoded data type ssv_seal_plain_tkn4 as the input text. The
key is the subkey derived from SSV4_SUBKEY_MIC_I2T or key is the subkey derived from SSV4_SUBKEY_MIC_I2T or
SSV4_SUBKEY_MIC_T2I, and the one way hash algorithm is that SSV4_SUBKEY_MIC_T2I, and the one way hash algorithm is that
negotiated by EXCHANGE_ID. negotiated by EXCHANGE_ID.
The sspt_confounder field is a random value. The sspt_confounder field is a random value.
The sspt_ssv_seq field is the same as ssvt_ssv_seq. The sspt_ssv_seq field is the same as ssvt_ssv_seq.
The sspt_orig_plain field is the original plaintext as passed to The sspt_orig_plain field is the original plaintext as passed to
GSS_Wrap(). GSS_Wrap().
The sspt_pad field is present to support encryption algorithms that The sspt_pad field is present to support encryption algorithms that
require inputs to be in fixed sized blocks. The content of sspt_pad require inputs to be in fixed sized blocks. The content of sspt_pad
is zero filled except for the length. Beware that the XDR encoding is zero filled except for the length. Beware that the XDR encoding
of ssv_seal_plain_tkn4 contains three variable length arrays, and so of ssv_seal_plain_tkn4 contains three variable length arrays, and so
each array consumes 4 octets for an array length, and each array that each array consumes four bytes for an array length, and each array
follows the length is always padded to a multiple of 4 octets per the that follows the length is always padded to a multiple of four bytes
XDR standard. per the XDR standard.
For example suppose the encryption algorithm uses 16 octet blocks, For example suppose the encryption algorithm uses 16 byte blocks, and
and the sspt_confounder is 3 octets long, and the sspt_orig_plain the sspt_confounder is three bytes long, and the sspt_orig_plain
field is 15 octets long. The XDR encoding of sspt_confounder uses 8 field is 15 bytes long. The XDR encoding of sspt_confounder uses
octets (4 + 3 + 1 octet pad), the XDR encoding of sspt_ssv_seq uses 4 eight bytes (4 + 3 + 1 byte pad), the XDR encoding of sspt_ssv_seq
octets, the XDR encoding of sspt_orig_plain uses 20 octets (4 + 15 + uses four bytes, the XDR encoding of sspt_orig_plain uses 20 bytes (4
1 octet pad), and the smallest XDR encoding of the sspt_pad field is + 15 + 1 byte pad), and the smallest XDR encoding of the sspt_pad
4 octets. This totals 36 octets. The next multiple of 16 is 48, field is four bytes. This totals 36 bytes. The next multiple of 16
thus the length field of sspt_pad needs to be set to 12 octets, or a is 48, thus the length field of sspt_pad needs to be set to 12 bytes,
total encoding of 16 octets. The total number of XDR encoded octets or a total encoding of 16 bytes. The total number of XDR encoded
is thus 8 + 4 + 20 + 16 = 48. bytes is thus 8 + 4 + 20 + 16 = 48.
GSS_Wrap() emits a token that is an XDR encoding of a value of data GSS_Wrap() emits a token that is an XDR encoding of a value of data
type ssv_seal_cipher_tkn4. Note that regardless whether the caller type ssv_seal_cipher_tkn4. Note that regardless whether the caller
of GSS_Wrap() requests confidentiality or not, the token always has of GSS_Wrap() requests confidentiality or not, the token always has
confidentiality. This is because the SSV mechanism is for confidentiality. This is because the SSV mechanism is for
RPCSEC_GSS, and RPCSEC_GSS never produces GSS_wrap() tokens without RPCSEC_GSS, and RPCSEC_GSS never produces GSS_wrap() tokens without
confidentiality. confidentiality.
Effectively there is a single GSS context for a single client ID. Effectively there is a single GSS context for a single client ID.
All RPCSEC_GSS handles share the same GSS context. SSV GSS contexts All RPCSEC_GSS handles share the same GSS context. SSV GSS contexts
skipping to change at page 68, line 7 skipping to change at page 69, line 7
time and the EXCHANGE_ID operation can be used to create more SSV time and the EXCHANGE_ID operation can be used to create more SSV
RPCSEC_GSS handles. RPCSEC_GSS handles.
The client MUST establish an SSV via SET_SSV before the SSV GSS The client MUST establish an SSV via SET_SSV before the SSV GSS
context can be used to emit tokens from GSS_Wrap() and GSS_GetMIC(). context can be used to emit tokens from GSS_Wrap() and GSS_GetMIC().
If SET_SSV has not been successfully called, attempts to emit tokens If SET_SSV has not been successfully called, attempts to emit tokens
MUST fail. MUST fail.
The SSV mechanism does not support replay detection and sequencing in The SSV mechanism does not support replay detection and sequencing in
its tokens because RPCSEC_GSS does not use those features (See its tokens because RPCSEC_GSS does not use those features (See
Section 5.2.2 "Context Creation Requests" in [5]). Section 5.2.2 "Context Creation Requests" in [4]).
2.10.9. Session Mechanics - Steady State 2.10.9. Session Mechanics - Steady State
2.10.9.1. Obligations of the Server 2.10.9.1. Obligations of the Server
The server has the primary obligation to monitor the state of The server has the primary obligation to monitor the state of
backchannel resources that the client has created for the server backchannel resources that the client has created for the server
(RPCSEC_GSS contexts and backchannel connections). If these (RPCSEC_GSS contexts and backchannel connections). If these
resources vanish, the server takes action as specified in resources vanish, the server takes action as specified in
Section 2.10.10.2. Section 2.10.10.2.
2.10.9.2. Obligations of the Client 2.10.9.2. Obligations of the Client
The client SHOULD honor the following obligations in order to utilize The client SHOULD honor the following obligations in order to utilize
the session: the session:
o Keep a necessary session from going idle on the server. A client o Keep a necessary session from going idle on the server. A client
that requires a session, but nonetheless is not sending operations that requires a session, but nonetheless is not sending operations
risks having the session be destroyed by the server. This is risks having the session be destroyed by the server. This is
because sessions consume resources, and resource limitations may because sessions consume resources, and resource limitations may
force the server to cull a session that has not been used for long force the server to cull an inactive session.
time. [[Comment.6: Tom Talpey disagrees and thinks a server can
never cull a session. Mike Eisler doesn't know what the server is
supposed to do when it accumulates a zillion reply caches that no
client has touched in a century. :-)]]
o Destroy the session when not needed. If a client has multiple o Destroy the session when not needed. If a client has multiple
sessions and one of them has no requests waiting for replies, and sessions and one of them has no requests waiting for replies, and
has been idle for some period of time, it SHOULD destroy the has been idle for some period of time, it SHOULD destroy the
session. session.
o Maintain GSS contexts for the backchannel. If the client requires o Maintain GSS contexts for the backchannel. If the client requires
the server to use the RPCSEC_GSS security flavor for callbacks, the server to use the RPCSEC_GSS security flavor for callbacks,
then it needs to be sure the contexts handed to the server via then it needs to be sure the contexts handed to the server via
BACKCHANNEL_CTL are unexpired. BACKCHANNEL_CTL are unexpired.
skipping to change at page 70, line 15 skipping to change at page 71, line 11
2.10.10.1. Events Requiring Client Action 2.10.10.1. Events Requiring Client Action
The following events require client action to recover. The following events require client action to recover.
2.10.10.1.1. RPCSEC_GSS Context Loss by Callback Path 2.10.10.1.1. RPCSEC_GSS Context Loss by Callback Path
If all RPCSEC_GSS contexts granted by the client to the server for If all RPCSEC_GSS contexts granted by the client to the server for
callback use have expired, the client MUST establish a new context callback use have expired, the client MUST establish a new context
via BACKCHANNEL_CTL. The sr_status_flags field of the SEQUENCE via BACKCHANNEL_CTL. The sr_status_flags field of the SEQUENCE
results indicates when callback contexts are nearly expired, or fully results indicates when callback contexts are nearly expired, or fully
expired (see Section 18.46.4). expired (see Section 18.46.3).
2.10.10.1.2. Connection Loss 2.10.10.1.2. Connection Loss
If the client loses the last connection of the session, and if wants If the client loses the last connection of the session, and if wants
to retain the session, then it must create a new connection, and if, to retain the session, then it must create a new connection, and if,
when the client ID was created, BIND_CONN_TO_SESSION was specified in when the client ID was created, BIND_CONN_TO_SESSION was specified in
the spo_must_enforce list, the client MUST use BIND_CONNN_TO_SESSION the spo_must_enforce list, the client MUST use BIND_CONNN_TO_SESSION
to associate the connection with the session. to associate the connection with the session.
If there was a request outstanding at the time the of connection If there was a request outstanding at the time the of connection
skipping to change at page 73, line 24 skipping to change at page 74, line 24
2.10.11. Parallel NFS and Sessions 2.10.11. Parallel NFS and Sessions
A client and server can potentially be a non-pNFS implementation, a A client and server can potentially be a non-pNFS implementation, a
metadata server implementation, a data server implementation, or two metadata server implementation, a data server implementation, or two
or three types of implementations. The EXCHGID4_FLAG_USE_NON_PNFS, or three types of implementations. The EXCHGID4_FLAG_USE_NON_PNFS,
EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_FLAG_USE_PNFS_DS flags (not EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_FLAG_USE_PNFS_DS flags (not
mutually exclusive) are passed in the EXCHANGE_ID arguments and mutually exclusive) are passed in the EXCHANGE_ID arguments and
results to allow the client to indicate how it wants to use sessions results to allow the client to indicate how it wants to use sessions
created under the client ID, and to allow the server to indicate how created under the client ID, and to allow the server to indicate how
it will allow the sessions to be used. See Section 14.1 for pNFS it will allow the sessions to be used. See Section 13.1 for pNFS
sessions considerations. sessions considerations.
3. Protocol Data Types 3. Protocol Data Types
The syntax and semantics to describe the data types of the NFS The syntax and semantics to describe the data types of the NFS
version 4 protocol are defined in the XDR RFC4506 [3] and RPC RFC1831 version 4 protocol are defined in the XDR RFC4506 [2] and RPC RFC1831
[4] documents. The next sections build upon the XDR data types to [3] documents. The next sections build upon the XDR data types to
define types and structures specific to this protocol. define types and structures specific to this protocol.
3.1. Basic Data Types 3.1. Basic Data Types
These are the base NFSv4 data types. These are the base NFSv4 data types.
+----------------------+--------------------------------------------+ +----------------------+--------------------------------------------+
| Data Type | Definition | | Data Type | Definition |
+----------------------+--------------------------------------------+ +----------------------+--------------------------------------------+
| int32_t | typedef int int32_t; | | int32_t | typedef int int32_t; |
skipping to change at page 74, line 33 skipping to change at page 75, line 33
| | Various offset designations (READ, WRITE, | | | Various offset designations (READ, WRITE, |
| | LOCK, COMMIT) | | | LOCK, COMMIT) |
| qop4 | typedef uint32_t qop4; | | qop4 | typedef uint32_t qop4; |
| | Quality of protection designation in | | | Quality of protection designation in |
| | SECINFO | | | SECINFO |
| sec_oid4<> | typedef opaque sec_oid4<>; | | sec_oid4<> | typedef opaque sec_oid4<>; |
| | Security Object Identifier The sec_oid4 | | | Security Object Identifier The sec_oid4 |
| | data type is not really opaque. Instead | | | data type is not really opaque. Instead |
| | it contains an ASN.1 OBJECT IDENTIFIER as | | | it contains an ASN.1 OBJECT IDENTIFIER as |
| | used by GSS-API in the mech_type argument | | | used by GSS-API in the mech_type argument |
| | to GSS_Init_sec_context. See [8] for | | | to GSS_Init_sec_context. See [7] for |
| | details. | | | details. |
| sequenceid4 | typedef uint32_t sequenceid4; | | sequenceid4 | typedef uint32_t sequenceid4; |
| | sequence number used for various session | | | sequence number used for various session |
| | operations (EXCHANGE_ID, CREATE_SESSION, | | | operations (EXCHANGE_ID, CREATE_SESSION, |
| | SEQUENCE, CB_SEQUENCE). | | | SEQUENCE, CB_SEQUENCE). |
| seqid4 | typedef uint32_t seqid4; | | seqid4 | typedef uint32_t seqid4; |
| | Sequence identifier used for file locking | | | Sequence identifier used for file locking |
| sessionid4 | typedef opaque sessionid4[16]; | | sessionid4 | typedef opaque sessionid4[16]; |
| | Session identifier | | | Session identifier |
| slotid4 | typedef uint32_t slotid4; | | slotid4 | typedef uint32_t slotid4; |
skipping to change at page 78, line 36 skipping to change at page 79, line 36
The r_netid and r_addr fields are specified in RFC1833 [26], but they The r_netid and r_addr fields are specified in RFC1833 [26], but they
are underspecified in RFC1833 [26] as far as what they should look are underspecified in RFC1833 [26] as far as what they should look
like for specific protocols. like for specific protocols.
For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the
US-ASCII string: US-ASCII string:
h1.h2.h3.h4.p1.p2 h1.h2.h3.h4.p1.p2
The prefix, "h1.h2.h3.h4", is the standard textual form for The prefix, "h1.h2.h3.h4", is the standard textual form for
representing an IPv4 address, which is always four octets long. representing an IPv4 address, which is always four bytes long.
Assuming big-endian ordering, h1, h2, h3, and h4, are respectively, Assuming big-endian ordering, h1, h2, h3, and h4, are respectively,
the first through fourth octets each converted to ASCII-decimal. the first through fourth bytes each converted to ASCII-decimal.
Assuming big-endian ordering, p1 and p2 are, respectively, the first Assuming big-endian ordering, p1 and p2 are, respectively, the first
and second octets each converted to ASCII-decimal. For example, if a and second bytes each converted to ASCII-decimal. For example, if a
host, in big-endian order, has an address of 0x0A010307 and there is host, in big-endian order, has an address of 0x0A010307 and there is
a service listening on, in big endian order, port 0x020F (decimal a service listening on, in big endian order, port 0x020F (decimal
527), then complete universal address is "10.1.3.7.2.15". 527), then complete universal address is "10.1.3.7.2.15".
For TCP over IPv4 the value of r_netid is the string "tcp". For UDP For TCP over IPv4 the value of r_netid is the string "tcp". For UDP
over IPv4 the value of r_netid is the string "udp". That this over IPv4 the value of r_netid is the string "udp". That this
document specifies the universal address and netid for UDP/IPv6 does document specifies the universal address and netid for UDP/IPv6 does
not imply that UDP/IPv4 is a legal transport for NFSv4.1 (see not imply that UDP/IPv4 is a legal transport for NFSv4.1 (see
Section 2.9). Section 2.9).
For TCP over IPv6 and for UDP over IPv6, the format of r_addr is the For TCP over IPv6 and for UDP over IPv6, the format of r_addr is the
US-ASCII string: US-ASCII string:
x1:x2:x3:x4:x5:x6:x7:x8.p1.p2 x1:x2:x3:x4:x5:x6:x7:x8.p1.p2
The suffix "p1.p2" is the service port, and is computed the same way The suffix "p1.p2" is the service port, and is computed the same way
as with universal addresses for TCP and UDP over IPv4. The prefix, as with universal addresses for TCP and UDP over IPv4. The prefix,
"x1:x2:x3:x4:x5:x6:x7:x8", is the standard textual form for "x1:x2:x3:x4:x5:x6:x7:x8", is the standard textual form for
representing an IPv6 address as defined in Section 2.2 of RFC1884 representing an IPv6 address as defined in Section 2.2 of RFC1884
[13]. Additionally, the two alternative forms specified in Section [12]. Additionally, the two alternative forms specified in Section
2.2 of RFC1884 [13] are also acceptable. 2.2 of RFC1884 [12] are also acceptable.
For TCP over IPv6 the value of r_netid is the string "tcp6". For UDP For TCP over IPv6 the value of r_netid is the string "tcp6". For UDP
over IPv6 the value of r_netid is the string "udp6". That this over IPv6 the value of r_netid is the string "udp6". That this
document specifies the universal address and netid for UDP/IPv6 does document specifies the universal address and netid for UDP/IPv6 does
not imply that UDP/IPv6 is a legal transport for NFSv4.1 (see not imply that UDP/IPv6 is a legal transport for NFSv4.1 (see
Section 2.9). Section 2.9).
3.2.12. state_owner4 3.2.12. state_owner4
struct state_owner4 { struct state_owner4 {
skipping to change at page 80, line 43 skipping to change at page 81, line 43
that clients have "layout drivers" that support one or more layout that clients have "layout drivers" that support one or more layout
types. The file server advertises the layout types it supports types. The file server advertises the layout types it supports
through the fs_layout_type file system attribute (Section 5.11.1). A through the fs_layout_type file system attribute (Section 5.11.1). A
client asks for layouts of a particular type in LAYOUTGET, and passes client asks for layouts of a particular type in LAYOUTGET, and passes
those layouts to its layout driver. those layouts to its layout driver.
The layouttype4 structure is 32 bits in length. The range The layouttype4 structure is 32 bits in length. The range
represented by the layout type is split into three parts. Type 0x0 represented by the layout type is split into three parts. Type 0x0
is reserved. Types within the range 0x00000001-0x7FFFFFFF are is reserved. Types within the range 0x00000001-0x7FFFFFFF are
globally unique and are assigned according to the description in globally unique and are assigned according to the description in
Section 22.1; they are maintained by IANA. Types within the range Section 22.2; they are maintained by IANA. Types within the range
0x80000000-0xFFFFFFFF are site specific and for "private use" only. 0x80000000-0xFFFFFFFF are site specific and for "private use" only.
The LAYOUT4_NFSV4_1_FILES enumeration specifies that the NFSv4.1 file The LAYOUT4_NFSV4_1_FILES enumeration specifies that the NFSv4.1 file
layout type is to be used. The LAYOUT4_OSD2_OBJECTS enumeration layout type is to be used. The LAYOUT4_OSD2_OBJECTS enumeration
specifies that the object layout, as defined in [29], is to be used. specifies that the object layout, as defined in [29], is to be used.
Similarly, the LAYOUT4_BLOCK_VOLUME enumeration that the block/volume Similarly, the LAYOUT4_BLOCK_VOLUME enumeration that the block/volume
layout, as defined in [30], is to be used. layout, as defined in [30], is to be used.
3.2.16. deviceid4 3.2.16. deviceid4
typedef uint64_t deviceid4; struct deviceid4 {
uint64_t did_major;
uint64_t did_minor;
};
Layout information includes device IDs that specify a storage device Layout information includes device IDs that specify a storage device
through a compact handle. Addressing and type information is through a compact handle. Addressing and type information is
obtained with the GETDEVICEINFO operation. A client must not assume obtained with the GETDEVICEINFO operation. A client must not assume
that device IDs are valid across metadata server reboots. The device that device IDs are valid across metadata server reboots. The device
ID is qualified by the layout type and are unique per file system ID is qualified by the layout type and are unique per file system
(FSID). See Section 13.2.10 for more details. (FSID). See Section 12.2.10 for more details.
3.2.17. device_addr4 3.2.17. device_addr4
struct device_addr4 { struct device_addr4 {
layouttype4 da_layout_type; layouttype4 da_layout_type;
opaque da_addr_body<>; opaque da_addr_body<>;
}; };
The device address is used to set up a communication channel with the The device address is used to set up a communication channel with the
storage device. Different layout types will require different types storage device. Different layout types will require different types
of structures to define how they communicate with storage devices. of structures to define how they communicate with storage devices.
The opaque da_addr_body field must be interpreted based on the The opaque da_addr_body field must be interpreted based on the
specified da_layout_type field. specified da_layout_type field.
This document defines the device address for the NFSv4.1 file layout This document defines the device address for the NFSv4.1 file layout
([[Comment.7: need xref]]), which identifies a storage device by (see Section 13.3), which identifies a storage device by network IP
network IP address and port number. This is sufficient for the address and port number. This is sufficient for the clients to
clients to communicate with the NFSv4.1 storage devices, and may be communicate with the NFSv4.1 storage devices, and may be sufficient
sufficient for other layout types as well. Device types for object for other layout types as well. Device types for object storage
storage devices and block storage devices (e.g., SCSI volume labels) devices and block storage devices (e.g., SCSI volume labels) will be
will be defined by their respective layout specifications. defined by their respective layout specifications.
3.2.18. devlist_item4 3.2.18. devlist_item4
struct devlist_item4 { struct devlist_item4 {
deviceid4 dli_id; deviceid4 dli_id;
stateid4 dli_stateid;
device_addr4 dli_device_addr; device_addr4 dli_device_addr;
}; };
An array of these values is returned by the GETDEVICELIST operation. An array of these values is returned by the GETDEVICELIST operation.
They define the set of devices associated with a file system for the They define the set of devices associated with a file system for the
layout type specified in the GETDEVICELIST4args. layout type specified in the GETDEVICELIST4args.
3.2.19. layout_content4 3.2.19. layout_content4
struct layout_content4 { struct layout_content4 {
layouttype4 loc_type; layouttype4 loc_type;
opaque loc_body<>; opaque loc_body<>;
}; };
The loc_body field must be interpreted based on the layout type The loc_body field must be interpreted based on the layout type
(loc_type). This document defines the loc_body for the NFSv4.1 file (loc_type). This document defines the loc_body for the NFSv4.1 file
layout type is defined; see Section 14.3 for its definition. layout type is defined; see Section 13.3 for its definition.
3.2.20. layout4 3.2.20. layout4
struct layout4 { struct layout4 {
offset4 lo_offset; offset4 lo_offset;
length4 lo_length; length4 lo_length;
layoutiomode4 lo_iomode; layoutiomode4 lo_iomode;
layout_content4 lo_content; layout_content4 lo_content;
}; };
skipping to change at page 83, line 12 skipping to change at page 84, line 12
opaque loh_body<>; opaque loh_body<>;
}; };
The layouthint4 structure is used by the client to pass in a hint The layouthint4 structure is used by the client to pass in a hint
about the type of layout it would like created for a particular file. about the type of layout it would like created for a particular file.
It is the structure specified by the layout_hint attribute described It is the structure specified by the layout_hint attribute described
in Section 5.11.4. The metadata server may ignore the hint, or may in Section 5.11.4. The metadata server may ignore the hint, or may
selectively ignore fields within the hint. This hint should be selectively ignore fields within the hint. This hint should be
provided at create time as part of the initial attributes within provided at create time as part of the initial attributes within
OPEN. The loh_body field is specific to the type of layout OPEN. The loh_body field is specific to the type of layout
(loh_type). The NFSv4.1 file-based layout uses the (loh_type). The NFSv4.1 file-based layout uses the
nfsv4_1_file_layouthint4 structure as defined in Section 14.3. nfsv4_1_file_layouthint4 structure as defined in Section 13.3.
3.2.23. layoutiomode4 3.2.23. layoutiomode4
enum layoutiomode4 { enum layoutiomode4 {
LAYOUTIOMODE4_READ = 1, LAYOUTIOMODE4_READ = 1,
LAYOUTIOMODE4_RW = 2, LAYOUTIOMODE4_RW = 2,
LAYOUTIOMODE4_ANY = 3 LAYOUTIOMODE4_ANY = 3
}; };
The iomode specifies whether the client intends to read or write The iomode specifies whether the client intends to read or write
skipping to change at page 86, line 50 skipping to change at page 87, line 50
filehandles differently, a file attribute is defined which may be filehandles differently, a file attribute is defined which may be
used by the client to determine the filehandle types being returned used by the client to determine the filehandle types being returned
by the server. by the server.
4.2.1. General Properties of a Filehandle 4.2.1. General Properties of a Filehandle
The filehandle contains all the information the server needs to The filehandle contains all the information the server needs to
distinguish an individual file. To the client, the filehandle is distinguish an individual file. To the client, the filehandle is
opaque. The client stores filehandles for use in a later request and opaque. The client stores filehandles for use in a later request and
can compare two filehandles from the same server for equality by can compare two filehandles from the same server for equality by
doing an octet-by-octet comparison. However, the client MUST NOT doing an byte-by-byte comparison. However, the client MUST NOT
otherwise interpret the contents of filehandles. If two filehandles otherwise interpret the contents of filehandles. If two filehandles
from the same server are equal, they MUST refer to the same file. from the same server are equal, they MUST refer to the same file.
Servers SHOULD try to maintain a one-to-one correspondence between Servers SHOULD try to maintain a one-to-one correspondence between
filehandles and files but this is not required. Clients MUST use filehandles and files but this is not required. Clients MUST use
filehandle comparisons only to improve performance, not for correct filehandle comparisons only to improve performance, not for correct
behavior. All clients need to be prepared for situations in which it behavior. All clients need to be prepared for situations in which it
cannot be determined whether two filehandles denote the same object cannot be determined whether two filehandles denote the same object
and in such cases, avoid making invalid assumptions which might cause and in such cases, avoid making invalid assumptions which might cause
incorrect behavior. Further discussion of filehandle and attribute incorrect behavior. Further discussion of filehandle and attribute
skipping to change at page 92, line 51 skipping to change at page 93, line 51
client should not depend on the ability to store any named attributes client should not depend on the ability to store any named attributes
in the server's file system. If a server does support named in the server's file system. If a server does support named
attributes, a client which is also able to handle them should be able attributes, a client which is also able to handle them should be able
to copy a file's data and meta-data with complete transparency from to copy a file's data and meta-data with complete transparency from
one location to another; this would imply that names allowed for one location to another; this would imply that names allowed for
regular directory entries are valid for named attribute names as regular directory entries are valid for named attribute names as
well. well.
Names of attributes will not be controlled by this document or other Names of attributes will not be controlled by this document or other
IETF standards track documents. See the section IANA Considerations IETF standards track documents. See the section IANA Considerations
(Section 22.2) for further discussion. (Section 22.3) for further discussion.
5.4. Classification of Attributes 5.4. Classification of Attributes
Each of the Mandatory and Recommended attributes can be classified in Each of the Mandatory and Recommended attributes can be classified in
one of three categories: per server, per file system, or per file one of three categories: per server, per file system, or per file
system object. Note that it is possible that some per file system system object. Note that it is possible that some per file system
attributes may vary within the file system. See the "homogeneous" attributes may vary within the file system. See the "homogeneous"
attribute for its definition. Note that the attributes attribute for its definition. Note that the attributes
time_access_set and time_modify_set are not listed in this section time_access_set and time_modify_set are not listed in this section
because they are write-only attributes corresponding to time_access because they are write-only attributes corresponding to time_access
skipping to change at page 105, line 31 skipping to change at page 106, line 31
5.9. Character Case Attributes 5.9. Character Case Attributes
With respect to the case_insensitive and case_preserving attributes, With respect to the case_insensitive and case_preserving attributes,
each UCS-4 character (which UTF-8 encodes) has a "long descriptive each UCS-4 character (which UTF-8 encodes) has a "long descriptive
name" RFC1345 [34] which may or may not included the word "CAPITAL" name" RFC1345 [34] which may or may not included the word "CAPITAL"
or "SMALL". The presence of SMALL or CAPITAL allows an NFS server to or "SMALL". The presence of SMALL or CAPITAL allows an NFS server to
implement unambiguous and efficient table driven mappings for case implement unambiguous and efficient table driven mappings for case
insensitive comparisons, and non-case-preserving storage. For insensitive comparisons, and non-case-preserving storage. For
general character handling and internationalization issues, see the general character handling and internationalization issues, see the
section Internationalization (Section 15). section Internationalization (Section 14).
5.10. Directory Notification Attributes 5.10. Directory Notification Attributes
As described in Section 18.39, the client can request a minimum delay As described in Section 18.39, the client can request a minimum delay
for notifications of changes to attributes, but the server is free to for notifications of changes to attributes, but the server is free to
ignore what the client requests. The client can determine in advance ignore what the client requests. The client can determine in advance
what notification delays the server will accept by issuing a GETATTR what notification delays the server will accept by issuing a GETATTR
for either or both of two directory notification attributes. When for either or both of two directory notification attributes. When
the client calls the GET_DIR_DELEGATION operation and asks for the client calls the GET_DIR_DELEGATION operation and asks for
attribute change notifications, it should request notification delays attribute change notifications, it should request notification delays
skipping to change at page 122, line 19 skipping to change at page 123, line 19
both ACE4_APPEND_DATA and ACE4_WRITE_DATA if and only if the write both ACE4_APPEND_DATA and ACE4_WRITE_DATA if and only if the write
permission is enabled. permission is enabled.
If a server receives a SETATTR request that it cannot accurately If a server receives a SETATTR request that it cannot accurately
implement, it should err in the direction of more restricted access, implement, it should err in the direction of more restricted access,
except in the previously discussed cases of execute and read. For except in the previously discussed cases of execute and read. For
example, suppose a server cannot distinguish overwriting data from example, suppose a server cannot distinguish overwriting data from
appending new data, as described in the previous paragraph. If a appending new data, as described in the previous paragraph. If a
client submits an ALLOW ACE where ACE4_APPEND_DATA is set but client submits an ALLOW ACE where ACE4_APPEND_DATA is set but
ACE4_WRITE_DATA is not (or vice versa), the server should either turn ACE4_WRITE_DATA is not (or vice versa), the server should either turn
off ACE4_APPEND DATA or reject the request with NFS4ERR_ATTRNOTSUPP. off ACE4_APPEND_DATA or reject the request with NFS4ERR_ATTRNOTSUPP.
6.2.1.3.2. ACE4_DELETE vs. ACE4_DELETE_CHILD 6.2.1.3.2. ACE4_DELETE vs. ACE4_DELETE_CHILD
Two access mask bits govern the ability to delete a directory entry: Two access mask bits govern the ability to delete a directory entry:
ACE4_DELETE on the object itself (the "target"), and ACE4_DELETE on the object itself (the "target"), and
ACE4_DELETE_CHILD on the containing directory (the "parent"). ACE4_DELETE_CHILD on the containing directory (the "parent").
Many systems also take the "sticky bit" (MODE4_SVTX) on a directory Many systems also take the "sticky bit" (MODE4_SVTX) on a directory
to allow unlink only to a user that owns either the target or the to allow unlink only to a user that owns either the target or the
parent; on some such systems the decision also depends on whether the parent; on some such systems the decision also depends on whether the
skipping to change at page 143, line 51 skipping to change at page 144, line 51
The following combinations of "other" and "seqid" are defined in The following combinations of "other" and "seqid" are defined in
NFSv4.1: NFSv4.1:
o When "other" and "seqid" are both zero, the stateid is treated as o When "other" and "seqid" are both zero, the stateid is treated as
a special anonymous stateid, which can be used in READ, WRITE, and a special anonymous stateid, which can be used in READ, WRITE, and
SETATTR requests to indicate the absence of any open state SETATTR requests to indicate the absence of any open state
associated with the request. When an anonymous stateid value is associated with the request. When an anonymous stateid value is
used, and an existing open denies the form of access requested, used, and an existing open denies the form of access requested,
then access will be denied to the request. This stateid MUST NOT then access will be denied to the request. This stateid MUST NOT
be used on operations to data servers (Section 14.7). be used on operations to data servers (Section 13.7).
o When "other" and "seqid" are both all ones, the stateid is a o When "other" and "seqid" are both all ones, the stateid is a
special read bypass stateid. When this value is used in WRITE or special read bypass stateid. When this value is used in WRITE or
SETATTR, it is treated like the anonymous value. When used in SETATTR, it is treated like the anonymous value. When used in
READ, the server MAY grant access, even if access would normally READ, the server MAY grant access, even if access would normally
be denied to READ requests. This stateid MUST NOT be used on be denied to READ requests. This stateid MUST NOT be used on
operations to data servers. operations to data servers.
o When "other" is zero and "seqid" is one, the stateid represents o When "other" is zero and "seqid" is one, the stateid represents
the current stateid, which is whatever value is the last stateid the current stateid, which is whatever value is the last stateid
skipping to change at page 147, line 35 skipping to change at page 148, line 35
o If the "seqid" field is not zero, and it is less than the current o If the "seqid" field is not zero, and it is less than the current
sequence value corresponding the current "other" field, return sequence value corresponding the current "other" field, return
NFS4ERR_OLD_STATEID. NFS4ERR_OLD_STATEID.
o Otherwise, the stateid is valid and the table entry should contain o Otherwise, the stateid is valid and the table entry should contain
any additional information about the type of stateid and any additional information about the type of stateid and
information associated with that particular type of stateid, such information associated with that particular type of stateid, such
as the associated set of locks, such as open-owner and lock-owner as the associated set of locks, such as open-owner and lock-owner
information, as well as information on the specific locks, such as information, as well as information on the specific locks, such as
open modes and octet ranges. open modes and byte ranges.
8.2.5. Stateid Use for IO Operations 8.2.5. Stateid Use for IO Operations
Clients performing IO operations (and SETATTR's modifying the file Clients performing IO operations (and SETATTR's modifying the file
size), need to select an appropriate stateid based on the locks size), need to select an appropriate stateid based on the locks
(including opens and delegations) held by the client and the various (including opens and delegations) held by the client and the various
types of lock owners issuing the IO requests. types of lock owners issuing the IO requests.
The following rules, applied in order of decreasing priority, govern The following rules, applied in order of decreasing priority, govern
the selection of the appropriate stateid: the selection of the appropriate stateid:
o If the client holds a delegation for the file in question, the o If the client holds a delegation for the file in question, the
delegation stateid should be used. delegation stateid should be used.
o Otherwise, if the lockowner corresponding entity (e.g. process) o Otherwise, if the lockowner corresponding entity (e.g. process)
issuing the IO has a lock stateid for the associated open file, issuing the IO has a lock stateid for the associated open file,
then the lock stateid for that lockowner and open file should be then the lock stateid for that lockowner and open file should be
used. (See Section 14.10.1 for an exception when file layout data used. (See Section 13.10.1 for an exception when file layout data
servers are being used). servers are being used).
o If there is no lock stateid, then the open stateid for the open o If there is no lock stateid, then the open stateid for the open
file in question is used. file in question is used.
o Finally, if none of the above apply, then a special stateid should o Finally, if none of the above apply, then a special stateid should
be used. be used.
8.3. Lease Renewal 8.3. Lease Renewal
skipping to change at page 148, line 47 skipping to change at page 149, line 47
effectively expire, it must have been at least the lease interval effectively expire, it must have been at least the lease interval
since the last SEQUENCE operation issued on any session and there since the last SEQUENCE operation issued on any session and there
must be no active COMPOUND operations on any such session. must be no active COMPOUND operations on any such session.
Because the SEQUENCE operation is the basic mechanism to renew a Because the SEQUENCE operation is the basic mechanism to renew a
lease, and because if must be done at least once for each lease lease, and because if must be done at least once for each lease
period, it is the natural mechanism whereby the server will inform period, it is the natural mechanism whereby the server will inform
the client of changes in the lease status that the client needs to be the client of changes in the lease status that the client needs to be
informed of. The client should inspect the status flags informed of. The client should inspect the status flags
(sr_status_flags) returned by sequence and take the appropriate (sr_status_flags) returned by sequence and take the appropriate
action. (See Section 18.46.4 for details). action. (See Section 18.46.3 for details).
o The status bits SEQ4_STATUS_CB_PATH_DOWN and o The status bits SEQ4_STATUS_CB_PATH_DOWN and
SEQ4_STATUS_CB_PATH_DOWN_SESSION indicate problems with the SEQ4_STATUS_CB_PATH_DOWN_SESSION indicate problems with the
backchannel which the the client may need to address in order to backchannel which the the client may need to address in order to
receive callback requests. receive callback requests.
o The status bits SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING and o The status bits SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING and
SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED indicates actual problems with SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED indicates actual problems with
GSS contexts for the backchannel which the client may have to GSS contexts for the backchannel which the client may have to
address to allow callback requests to be sent to it. address to allow callback requests to be sent to it.
skipping to change at page 150, line 14 skipping to change at page 151, line 14
discussed in Section 8.3, when a client has not failed and re- discussed in Section 8.3, when a client has not failed and re-
establishes his lease before expiration occurs, requests for establishes his lease before expiration occurs, requests for
conflicting locks will not be granted. conflicting locks will not be granted.
To minimize client delay upon restart, lock requests are associated To minimize client delay upon restart, lock requests are associated
with an instance of the client by a client-supplied verifier. This with an instance of the client by a client-supplied verifier. This
verifier is part of the client_owner4 sent in the initial EXCHANGE_ID verifier is part of the client_owner4 sent in the initial EXCHANGE_ID
call made by the client. The server returns a client ID as a result call made by the client. The server returns a client ID as a result
of the EXCHANGE_ID operation. The client then confirms the use of of the EXCHANGE_ID operation. The client then confirms the use of
the client ID by establishing a session associated with that client the client ID by establishing a session associated with that client
ID. See Section 18.36.4 for a description how this is done. All ID. See Section 18.36.3 for a description how this is done. All
locks, including opens, record locks, delegations, and layouts locks, including opens, record locks, delegations, and layouts
obtained by sessions using that client ID are associated with that obtained by sessions using that client ID are associated with that
client ID. client ID.
Since the verifier will be changed by the client upon each Since the verifier will be changed by the client upon each
initialization, the server can compare a new verifier to the verifier initialization, the server can compare a new verifier to the verifier
associated with currently held locks and determine that they do not associated with currently held locks and determine that they do not
match. This signifies the client's new instantiation and subsequent match. This signifies the client's new instantiation and subsequent
loss of locking state. As a result, the server is free to release loss of locking state. As a result, the server is free to release
all locks held which are associated with the old client ID which was all locks held which are associated with the old client ID which was
skipping to change at page 150, line 49 skipping to change at page 151, line 49
requests because the server has granted conflicting access to another requests because the server has granted conflicting access to another
client. Likewise, if there is a possibility that clients have not client. Likewise, if there is a possibility that clients have not
yet re-established their locking state for a file, and that such yet re-established their locking state for a file, and that such
locking state might make it invalid to perform READ or WRITE locking state might make it invalid to perform READ or WRITE
operations, for example through the establishment of mandatory locks, operations, for example through the establishment of mandatory locks,
the server must disallow READ and WRITE operations for that file. the server must disallow READ and WRITE operations for that file.
A client can determine that loss of locking state has occurred via A client can determine that loss of locking state has occurred via
several methods. several methods.
1. When a SEQUENCE succeeds, but sr_status_flags in the reply to 1. When a SEQUENCE (most common) or other operation returns
SEQUENCE indicates SEQ4_STATUS_RESTART_RECLAIM_NEEDED (see NFS4ERR_BADSESSION, this may mean the session has been destroyed,
Section 18.46.4), this indicates the client's client ID and but the client ID is still valid. The client issues a
session are valid (have persisted through server restart) and the CREATE_SESSION request with the client ID to re-establish the
client can now re-establish its lock state (Section 8.4.2.1). session. If CREATE_SESSION fails with NFS4ERR_STALE_CLIENTID,
the client must establish a new client ID (see Section 8.1) and
2. When an operation returns NFS4ERR_STALE_STATEID, this indicates a re-establish its lock state after the CREATE_SESSION, with the
stateid invalidated by a server reboot or restart. Since the new client ID CREATE_SESSION succeeds, (Section 8.4.2.1).
operation that returned NFS4ERR_STALE_STATEID MUST have been
preceded by SEQUENCE, and SEQUENCE did not return an error, this
means the client ID and session are valid. The client can now
re-establish is lock state as described in Section 8.4.2.1. Note
that the server should (MUST) have set
SEQ4_STATUS_RESTART_RECLAIM_NEEDED in the sr_status_flags of the
results of the SEQUENCE operation, and thus this situation should
be the same as that described above.
3. When a SEQUENCE operation returns NFS4ERR_STALE_CLIENTID, this
means both sessionid SEQUENCE refers to (field sa_sessionid) and
the implied client ID are now invalid, where the client ID was
invalidated by server reboot or restart or by lease expiration.
When SEQUENCE returns NFS4ERR_STALE_CLIENTID, the client must
establish a new client ID (see Section 8.1) and re-establish its
lock state (Section 8.4.2.1).
4. When a SEQUENCE operation returns NFS4ERR_BADSESSION, this may 2. When a SEQUENCE (most common) or other operation on a persistent
mean the session has been destroyed, but the client ID is still session returns NFS4ERR_DEADSESSION, this indicates that a
valid. The client issues a CREATE_SESSION request with the session is no longer usable for new, i.e. not satisfied from the
replay cache, operations. Once all pending operations are
determined to be either performed before the retry or not
performed, the client issues a CREATE_SESSION request with the
client ID to re-establish the session. If CREATE_SESSION fails client ID to re-establish the session. If CREATE_SESSION fails
with NFS4ERR_STALE_CLIENTID, the client must establish a new with NFS4ERR_STALE_CLIENTID, the client must establish a new
client ID (see Section 8.1) and re-establish its lock state client ID (see Section 8.1) and re-establish its lock state after
(Section 8.4.2.1). If CREATE_SESSION succeeds, the client must the CREATE_SESSION, with the new client ID, succeeds,
then re-establish its lock state (Section 8.4.2.1). (Section 8.4.2.1).
5. When a operation, neither SEQUENCE nor preceded by SEQUENCE (for 3. When a operation, neither SEQUENCE nor preceded by SEQUENCE (for
example, CREATE_SESSION, DESTROY_SESSION) returns example, CREATE_SESSION, DESTROY_SESSION) returns
NFS4ERR_STALE_CLIENTID. The client MUST establish a new client NFS4ERR_STALE_CLIENTID. The client MUST establish a new client
ID (Section 8.1) and re-establish its lock state ID (Section 8.1) and re-establish its lock state
(Section 8.4.2.1). (Section 8.4.2.1).
8.4.2.1. State Reclaim 8.4.2.1. State Reclaim
When state information and the associated locks are lost as a result When state information and the associated locks are lost as a result
of a server reboot, the protocol must provide a way to cause that of a server reboot, the protocol must provide a way to cause that
state to be re-established. The approach used is to define, for most state to be re-established. The approach used is to define, for most
skipping to change at page 152, line 11 skipping to change at page 152, line 46
are variants of the requests normally used to create locks of that are variants of the requests normally used to create locks of that
type and are referred to as "reclaim-type" requests and the process type and are referred to as "reclaim-type" requests and the process
of re-establishing such locks is referred to as "reclaiming" them. of re-establishing such locks is referred to as "reclaiming" them.
Because each client must have an opportunity to reclaim all of the Because each client must have an opportunity to reclaim all of the
locks that it has without the possibility that some other client will locks that it has without the possibility that some other client will
be granted a conflicting lock, a special period called the "grace be granted a conflicting lock, a special period called the "grace
period" is devoted to the reclaim process. During this period, period" is devoted to the reclaim process. During this period,
requests creating client IDs and sessions are handled normally, but requests creating client IDs and sessions are handled normally, but
locking requests are subject to special restrictions. Only reclaim- locking requests are subject to special restrictions. Only reclaim-
type locking requests are allowed, unless the server is able to type locking requests are allowed, unless the server is able to<
reliably determine (through state persistently maintained across reliably determine (through state persistently maintained across
reboot instances), that granting any such lock cannot possibly reboot instances), that granting any such lock cannot possibly
conflict with a subsequent reclaim. When a request is made to obtain conflict with a subsequent reclaim. When a request is made to obtain
a new lock (i.e. not a reclaim-type request) during the grace period a new lock (i.e. not a reclaim-type request) during the grace period
and such a determination cannot be made, the server must return the and such a determination cannot be made, the server must return the
error NFS4ERR_GRACE. error NFS4ERR_GRACE.
Once a session is established using the new client ID, the client Once a session is established using the new client ID, the client
will use reclaim-type locking requests (e.g. LOCK requests with will use reclaim-type locking requests (e.g. LOCK requests with
reclaim set to true and OPEN operations with a claim type of reclaim set to true and OPEN operations with a claim type of
skipping to change at page 152, line 33 skipping to change at page 153, line 20
Once this is done, or if there is no such locking state to reclaim, Once this is done, or if there is no such locking state to reclaim,
the client sends a global RECLAIM_COMPLETE operation, i.e. one with the client sends a global RECLAIM_COMPLETE operation, i.e. one with
the one_fs argument set to false, to indicate that it has reclaimed the one_fs argument set to false, to indicate that it has reclaimed
all of the locking state that it will reclaim. Once a client sends all of the locking state that it will reclaim. Once a client sends
such a RECLAIM_COMPLETE operation, it may attempt non-reclaim locking such a RECLAIM_COMPLETE operation, it may attempt non-reclaim locking
operations, although it may get NFS4ERR_GRACE errors the operations operations, although it may get NFS4ERR_GRACE errors the operations
until the period of special handling is over. See Section 11.6.7 for until the period of special handling is over. See Section 11.6.7 for
a discussion of the analogous handling lock reclamation in the case a discussion of the analogous handling lock reclamation in the case
of file systems transitioning from server to server. of file systems transitioning from server to server.
Note that if the client ID persisted through a server reboot, which
will be self-evident if the client never received a
NFS4ERR_STALE_CLIENTID error, and instead got
SEQ4_STATUS_RESTART_RECLAIM_NEEDED status from SEQUENCE
(Section 18.46.4), no client ID was re-established.
During the grace period, the server must reject READ and WRITE During the grace period, the server must reject READ and WRITE
operations and non-reclaim locking requests (i.e. other LOCK and OPEN operations and non-reclaim locking requests (i.e. other LOCK and OPEN
operations) with an error of NFS4ERR_GRACE, unless it is able to operations) with an error of NFS4ERR_GRACE, unless it is able to
guarantee that these may be done safely, as described below. guarantee that these may be done safely, as described below.
The grace period may last until all clients who are known to possibly The grace period may last until all clients who are known to possibly
have had locks have done a global RECLAIM_COMPLETE operation, have had locks have done a global RECLAIM_COMPLETE operation,
indicating that they have finished reclaiming the locks they held indicating that they have finished reclaiming the locks they held
before the server reboot. This means that a client which has done a before the server reboot. This means that a client which has done a
RECLAIM_COMPLETE must be prepared to receive an NFS4ERR_GRACE when RECLAIM_COMPLETE must be prepared to receive an NFS4ERR_GRACE when
skipping to change at page 159, line 32 skipping to change at page 160, line 16
client ID) and the client will proceed with normal crash recovery as client ID) and the client will proceed with normal crash recovery as
described in the Section 8.4.2.1. described in the Section 8.4.2.1.
The second occasion of lock revocation is the inability to renew the The second occasion of lock revocation is the inability to renew the
lease before expiration, as discussed in Section 8.4.3. While this lease before expiration, as discussed in Section 8.4.3. While this
is considered a rare or unusual event, the client must be prepared to is considered a rare or unusual event, the client must be prepared to
recover. The server is responsible for determining the precise recover. The server is responsible for determining the precise
consequences of the lease expiration, informing the client of the consequences of the lease expiration, informing the client of the
scope of the lock revocation decided upon. The client then uses the scope of the lock revocation decided upon. The client then uses the
status information provided by the server in the SEQUENCE results status information provided by the server in the SEQUENCE results
(field sr_status_flags, see Section 18.46.4) to synchronize its (field sr_status_flags, see Section 18.46.3) to synchronize its
locking state with that of the server, in order to recover. locking state with that of the server, in order to recover.
The third occasion of lock revocation can occur as a result of The third occasion of lock revocation can occur as a result of
revocation of locks within the lease period, either because of revocation of locks within the lease period, either because of
administrative intervention, or because a recallable lock (a administrative intervention, or because a recallable lock (a
delegation or layout) was not returned within the lease period after delegation or layout) was not returned within the lease period after
having been recalled. While these are considered rare events, they having been recalled. While these are considered rare events, they
are possible and the client must be prepared to deal with them. When are possible and the client must be prepared to deal with them. When
either of these events occur, the client finds out about the either of these events occur, the client finds out about the
situation through the status returned by the SEQUENCE operation. Any situation through the status returned by the SEQUENCE operation. Any
skipping to change at page 165, line 47 skipping to change at page 166, line 36
derived form such an open is used, the server knows that the READ, derived form such an open is used, the server knows that the READ,
WRITE, or SETATTR does not conflict with the delegation, but is WRITE, or SETATTR does not conflict with the delegation, but is
issued under the aegis of the delegation. Even though it is possible issued under the aegis of the delegation. Even though it is possible
for the server to determine from the clientid (gotten from the for the server to determine from the clientid (gotten from the
sessionid) that the client does in fact have a delegation, the server sessionid) that the client does in fact have a delegation, the server
is not obliged to check this, so using a special stateid can result is not obliged to check this, so using a special stateid can result
in avoidable recall of the delegation. in avoidable recall of the delegation.
9.2. Lock Ranges 9.2. Lock Ranges
The protocol allows a lock owner to request a lock with an octet The protocol allows a lock owner to request a lock with an byte range
range and then either upgrade, downgrade, or unlock a sub-range of and then either upgrade, downgrade, or unlock a sub-range of the
the initial lock. It is expected that this will be an uncommon type initial lock. It is expected that this will be an uncommon type of
of request. In any case, servers or server file systems may not be request. In any case, servers or server file systems may not be able
able to support sub-range lock semantics. In the event that a server to support sub-range lock semantics. In the event that a server
receives a locking request that represents a sub-range of current receives a locking request that represents a sub-range of current
locking state for the lock owner, the server is allowed to return the locking state for the lock owner, the server is allowed to return the
error NFS4ERR_LOCK_RANGE to signify that it does not support sub- error NFS4ERR_LOCK_RANGE to signify that it does not support sub-
range lock operations. Therefore, the client should be prepared to range lock operations. Therefore, the client should be prepared to
receive this error and, if appropriate, report the error to the receive this error and, if appropriate, report the error to the
requesting application. requesting application.
The client is discouraged from combining multiple independent locking The client is discouraged from combining multiple independent locking
ranges that happen to be adjacent into a single request since the ranges that happen to be adjacent into a single request since the
server may not support sub-range requests and for reasons related to server may not support sub-range requests and for reasons related to
skipping to change at page 167, line 34 skipping to change at page 168, line 23
pending blocking locks. Clients should use such a nonblocking pending blocking locks. Clients should use such a nonblocking
request to indicate to the server that this is the last time they request to indicate to the server that this is the last time they
intend to poll for the lock, as may happen when the process intend to poll for the lock, as may happen when the process
requesting the lock is interrupted. This is a courtesy to the requesting the lock is interrupted. This is a courtesy to the
server, to prevent it from unnecessarily waiting a lease period server, to prevent it from unnecessarily waiting a lease period
before granting other lock requests. However, clients are not before granting other lock requests. However, clients are not
required to perform this courtesy, and servers must not depend on required to perform this courtesy, and servers must not depend on
them doing so. Also, clients must be prepared for the possibility them doing so. Also, clients must be prepared for the possibility
that this final locking request will be accepted. that this final locking request will be accepted.
When server indicates, via the flag OPEN4_RESULT_MAY_NOTIFY_LOCK,
that CB_NOTIFY_LOCK callbacks will be done for the current open file,
the client should take notice of this, but, since this is a hint,
cannot rely on a CB_NOTIFY_LOCK always being done. A client may
reasonably reduce the frequency with which it polls for a denied
lock, since the greater latency that might occur is likely to be
eliminated given a prompt callback, but it still needs to poll. When
it receives a CB_NOTIFY_LOCK it should promptly try to obtain the
lock, but it should be aware that other clients may polling and the
server is under no obligation to reserve the lock for that particular
client.
9.5. Share Reservations 9.5. Share Reservations
A share reservation is a mechanism to control access to a file. It A share reservation is a mechanism to control access to a file. It
is a separate and independent mechanism from record locking. When a is a separate and independent mechanism from record locking. When a
client opens a file, it issues an OPEN operation to the server client opens a file, it issues an OPEN operation to the server
specifying the type of access required (READ, WRITE, or BOTH) and the specifying the type of access required (READ, WRITE, or BOTH) and the
type of access to deny others (deny NONE, READ, WRITE, or BOTH). If type of access to deny others (deny NONE, READ, WRITE, or BOTH). If
the OPEN fails the client will fail the application's open request. the OPEN fails the client will fail the application's open request.
Pseudo-code definition of the semantics: Pseudo-code definition of the semantics:
skipping to change at page 177, line 22 skipping to change at page 178, line 22
For those applications that choose to use file locking instead of For those applications that choose to use file locking instead of
share reservations to exclude inconsistent file access, there is an share reservations to exclude inconsistent file access, there is an
analogous set of constraints that apply to client side data caching. analogous set of constraints that apply to client side data caching.
These rules are effective only if the file locking is used in a way These rules are effective only if the file locking is used in a way
that matches in an equivalent way the actual READ and WRITE that matches in an equivalent way the actual READ and WRITE
operations executed. This is as opposed to file locking that is operations executed. This is as opposed to file locking that is
based on pure convention. For example, it is possible to manipulate based on pure convention. For example, it is possible to manipulate
a two-megabyte file by dividing the file into two one-megabyte a two-megabyte file by dividing the file into two one-megabyte
regions and protecting access to the two regions by file locks on regions and protecting access to the two regions by file locks on
octets zero and one. A lock for write on octet zero of the file bytes zero and one. A lock for write on byte zero of the file would
would represent the right to do READ and WRITE operations on the represent the right to do READ and WRITE operations on the first
first region. A lock for write on octet one of the file would region. A lock for write on byte one of the file would represent the
represent the right to do READ and WRITE operations on the second right to do READ and WRITE operations on the second region. As long
region. As long as all applications manipulating the file obey this as all applications manipulating the file obey this convention, they
convention, they will work on a local file system. However, they may will work on a local file system. However, they may not work with
not work with the NFS version 4 protocol unless clients refrain from the NFS version 4 protocol unless clients refrain from data caching.
data caching.
The rules for data caching in the file locking environment are: The rules for data caching in the file locking environment are:
o First, when a client obtains a file lock for a particular region, o First, when a client obtains a file lock for a particular region,
the data cache corresponding to that region (if any cache data the data cache corresponding to that region (if any cache data
exists) must be revalidated. If the change attribute indicates exists) must be revalidated. If the change attribute indicates
that the file may have been updated since the cached data was that the file may have been updated since the cached data was
obtained, the client must flush or invalidate the cached data for obtained, the client must flush or invalidate the cached data for
the newly locked region. A client might choose to invalidate all the newly locked region. A client might choose to invalidate all
of non-modified cached data that it has for the file but the only of non-modified cached data that it has for the file but the only
requirement for correct operation is to invalidate all of the data requirement for correct operation is to invalidate all of the data
in the newly locked region. in the newly locked region.
o Second, before releasing a write lock for a region, all modified o Second, before releasing a write lock for a region, all modified
data for that region must be flushed to the server. The modified data for that region must be flushed to the server. The modified
data must also be written to stable storage. data must also be written to stable storage.
Note that flushing data to the server and the invalidation of cached Note that flushing data to the server and the invalidation of cached
data must reflect the actual octet ranges locked or unlocked. data must reflect the actual byte ranges locked or unlocked.
Rounding these up or down to reflect client cache block boundaries Rounding these up or down to reflect client cache block boundaries
will cause problems if not carefully done. For example, writing a will cause problems if not carefully done. For example, writing a
modified block when only half of that block is within an area being modified block when only half of that block is within an area being
unlocked may cause invalid modification to the region outside the unlocked may cause invalid modification to the region outside the
unlocked area. This, in turn, may be part of a region locked by unlocked area. This, in turn, may be part of a region locked by
another client. Clients can avoid this situation by synchronously another client. Clients can avoid this situation by synchronously
performing portions of write operations that overlap that portion performing portions of write operations that overlap that portion
(initial or final) that is not a full block. Similarly, invalidating (initial or final) that is not a full block. Similarly, invalidating
a locked area which is not an integral number of full buffer blocks a locked area which is not an integral number of full buffer blocks
would require the client to read one or two partial blocks from the would require the client to read one or two partial blocks from the
skipping to change at page 178, line 29 skipping to change at page 179, line 28
server reboot might conflict with a lock held by another client. server reboot might conflict with a lock held by another client.
A client implementation may choose to accommodate applications which A client implementation may choose to accommodate applications which
use record locking in non-standard ways (e.g. using a record lock as use record locking in non-standard ways (e.g. using a record lock as
a global semaphore) by flushing to the server more data upon an LOCKU a global semaphore) by flushing to the server more data upon an LOCKU
than is covered by the locked range. This may include modified data than is covered by the locked range. This may include modified data
within files other than the one for which the unlocks are being done. within files other than the one for which the unlocks are being done.
In such cases, the client must not interfere with applications whose In such cases, the client must not interfere with applications whose
READs and WRITEs are being done only within the bounds of record READs and WRITEs are being done only within the bounds of record
locks which the application holds. For example, an application locks locks which the application holds. For example, an application locks
a single octet of a file and proceeds to write that single octet. A a single byte of a file and proceeds to write that single byte. A
client that chose to handle a LOCKU by flushing all modified data to client that chose to handle a LOCKU by flushing all modified data to
the server could validly write that single octet in response to an the server could validly write that single byte in response to an
unrelated unlock. However, it would not be valid to write the entire unrelated unlock. However, it would not be valid to write the entire
block in which that single written octet was located since it block in which that single written byte was located since it includes
includes an area that is not locked and might be locked by another an area that is not locked and might be locked by another client.
client. Client implementations can avoid this problem by dividing Client implementations can avoid this problem by dividing files with
files with modified data into those for which all modifications are modified data into those for which all modifications are done to
done to areas covered by an appropriate record lock and those for areas covered by an appropriate record lock and those for which there
which there are modifications not covered by a record lock. Any are modifications not covered by a record lock. Any writes done for
writes done for the former class of files must not include areas not the former class of files must not include areas not locked and thus
locked and thus not modified on the client. not modified on the client.
10.3.3. Data Caching and Mandatory File Locking 10.3.3. Data Caching and Mandatory File Locking
Client side data caching needs to respect mandatory file locking when Client side data caching needs to respect mandatory file locking when
it is in effect. The presence of mandatory file locking for a given it is in effect. The presence of mandatory file locking for a given
file is indicated when the client gets back NFS4ERR_LOCKED from a file is indicated when the client gets back NFS4ERR_LOCKED from a
READ or WRITE on a file it has an appropriate share reservation for. READ or WRITE on a file it has an appropriate share reservation for.
When mandatory locking is in effect for a file, the client must check When mandatory locking is in effect for a file, the client must check
for an appropriate file lock for data being read or written. If a for an appropriate file lock for data being read or written. If a
lock exists for the range being read or written, the client may lock exists for the range being read or written, the client may
skipping to change at page 185, line 28 skipping to change at page 186, line 25
While the change attribute is opaque to the client in the sense that While the change attribute is opaque to the client in the sense that
it has no idea what units of time, if any, the server is counting it has no idea what units of time, if any, the server is counting
change with, it is not opaque in that the client has to treat it as change with, it is not opaque in that the client has to treat it as
an unsigned integer, and the server has to be able to see the results an unsigned integer, and the server has to be able to see the results
of the client's changes to that integer. Therefore, the server MUST of the client's changes to that integer. Therefore, the server MUST
encode the change attribute in network order when sending it to the encode the change attribute in network order when sending it to the
client. The client MUST decode it from network order to its native client. The client MUST decode it from network order to its native
order when receiving it and the client MUST encode it network order order when receiving it and the client MUST encode it network order
when sending it to the server. For this reason, change is defined as when sending it to the server. For this reason, change is defined as
an unsigned integer rather than an opaque array of octets. an unsigned integer rather than an opaque array of bytes.
For the server, the following steps will be taken when providing a For the server, the following steps will be taken when providing a
write delegation: write delegation:
o Upon providing a write delegation, the server will cache a copy of o Upon providing a write delegation, the server will cache a copy of
the change attribute in the data structure it uses to record the the change attribute in the data structure it uses to record the
delegation. Let this value be represented by sc. delegation. Let this value be represented by sc.
o When a second client sends a GETATTR operation on the same file to o When a second client sends a GETATTR operation on the same file to
the server, the server obtains the change attribute from the first the server, the server obtains the change attribute from the first
skipping to change at page 195, line 26 skipping to change at page 196, line 23
granted, not whether it has been modified again between successive granted, not whether it has been modified again between successive
CB_GETATTR calls, and the server MUST assume that any file the CB_GETATTR calls, and the server MUST assume that any file the
client has modified in cache has been modified again between client has modified in cache has been modified again between
successive CB_GETATTR calls. Depending on the nature of the successive CB_GETATTR calls. Depending on the nature of the
client's memory management system, this weak obligation may not be client's memory management system, this weak obligation may not be
possible. A client MAY return stale information in CB_GETATTR possible. A client MAY return stale information in CB_GETATTR
whenever the file is memory mapped. whenever the file is memory mapped.
o The mixture of memory mapping and file locking on the same file is o The mixture of memory mapping and file locking on the same file is
problematic. Consider the following scenario, where a page size problematic. Consider the following scenario, where a page size
on each client is 8192 octets. on each client is 8192 bytes.
* Client A memory maps first page (8192 octets) of file X * Client A memory maps first page (8192 bytes) of file X
* Client B memory maps first page (8192 octets) of file X * Client B memory maps first page (8192 bytes) of file X
* Client A write locks first 4096 octets * Client A write locks first 4096 bytes
* Client B write locks second 4096 octets * Client B write locks second 4096 bytes
* Client A, via a STORE instruction modifies part of its locked * Client A, via a STORE instruction modifies part of its locked
region. region.
* Simultaneous to client A, client B issues a STORE on part of * Simultaneous to client A, client B issues a STORE on part of
its locked region. its locked region.
Here the challenge is for each client to resynchronize to get a Here the challenge is for each client to resynchronize to get a
correct view of the first page. In many operating environments, the correct view of the first page. In many operating environments, the
virtual memory management systems on each client only know a page is virtual memory management systems on each client only know a page is
skipping to change at page 196, line 29 skipping to change at page 197, line 25
are record locks for. are record locks for.
o Clients and servers MAY deny a record lock on a file they know is o Clients and servers MAY deny a record lock on a file they know is
memory mapped. memory mapped.
o A client MAY deny memory mapping a file that it knows requires o A client MAY deny memory mapping a file that it knows requires
mandatory locking for I/O. If mandatory locking is enabled after mandatory locking for I/O. If mandatory locking is enabled after
the file is opened and mapped, the client MAY deny the application the file is opened and mapped, the client MAY deny the application
further access to its mapped file. further access to its mapped file.
10.8. Name Caching 10.8. Name and Directory Caching without Directory Delegations
Although NFSv4.1 defines a directory delegation facility, (described
in Section 10.9 below), servers are allowed not to implement that
facility and even where it is implemented, it may not be always be
functional, because of resource availability issues or other
constraints. Because of that, it is important to understand how name
and directory caching are done in the absence of directory
delegations, and those topics are discussed in the section
immediately below.
10.8.1. Name Caching
The results of LOOKUP and READDIR operations may be cached to avoid The results of LOOKUP and READDIR operations may be cached to avoid
the cost of subsequent LOOKUP operations. Just as in the case of the cost of subsequent LOOKUP operations. Just as in the case of
attribute caching, inconsistencies may arise among the various client attribute caching, inconsistencies may arise among the various client
caches. To mitigate the effects of these inconsistencies and given caches. To mitigate the effects of these inconsistencies and given
the context of typical file system APIs, an upper time boundary is the context of typical file system APIs, an upper time boundary is
maintained on how long a client name cache entry can be kept without maintained on how long a client name cache entry can be kept without
verifying that the entry has not been made invalid by a directory verifying that the entry has not been made invalid by a directory
change operation performed by another client. .LP When a client is change operation performed by another client.
not making changes to a directory for which there exist name cache
entries, the client needs to periodically fetch attributes for that When a client is not making changes to a directory for which there
directory to ensure that it is not being modified. After determining exist name cache entries, the client needs to periodically fetch
that no modification has occurred, the expiration time for the attributes for that directory to ensure that it is not being
associated name cache entries may be updated to be the current time modified. After determining that no modification has occurred, the
plus the name cache staleness bound. expiration time for the associated name cache entries may be updated
to be the current time plus the name cache staleness bound.
When a client is making changes to a given directory, it needs to When a client is making changes to a given directory, it needs to
determine whether there have been changes made to the directory by determine whether there have been changes made to the directory by
other clients. It does this by using the change attribute as other clients. It does this by using the change attribute as
reported before and after the directory operation in the associated reported before and after the directory operation in the associated
change_info4 value returned for the operation. The server is able to change_info4 value returned for the operation. The server is able to
communicate to the client whether the change_info4 data is provided communicate to the client whether the change_info4 data is provided
atomically with respect to the directory operation. If the change atomically with respect to the directory operation. If the change
values are provided atomically, the client is then able to compare values are provided atomically, the client has a basis for
the pre-operation change value with the change value in the client's determining, given proper care, whether other clients are modifying
name cache. If the comparison indicates that the directory was the directory is question.
updated by another client, the name cache associated with the
modified directory is purged from the client. If the comparison The simplest way to enable the client to make this determination is
indicates no modification, the name cache can be updated on the for the client to serialize all changes made to a specific directory.
client to reflect the directory operation and the associated timeout When this is done, and the server provides before and after values of
extended. The post-operation change value needs to be saved as the the change attribute atomically, the client can simply compare the
basis for future change_info4 comparisons. after value of the change attribute from one operation on a directory
with the before value on the next subsequent operation modifying that
directory. When these are equal, the client is assured that no other
client is modifying the directory in question.
When such serialization is not used, and there may be multiple
simultaneous outstanding operations modifying a single directory sent
from a single client, making this sort of determination can be more
complicated, since two such operations which are recognized as
complete in a different order than they were actually performed,
might give an appearance consistent with modification being made by
another client. Where this appears to happen, the client needs to
await the completion of all such modifications that were started
previously, to see if the outstanding before and after change numbers
can be sorted into a chain such that the before value of one change
number matches the after value of a previous one, in a chain
consistent with this client being the only one modifying the
directory.
In either of these cases, the client is able to determine whether the
directory is being modified by another client. If the comparison
indicates that the directory was updated by another client, the name
cache associated with the modified directory is purged from the
client. If the comparison indicates no modification, the name cache
can be updated on the client to reflect the directory operation and
the associated timeout extended. The post-operation change value
needs to be saved as the basis for future change_info4 comparisons.
As demonstrated by the scenario above, name caching requires that the As demonstrated by the scenario above, name caching requires that the
client revalidate name cache data by inspecting the change attribute client revalidate name cache data by inspecting the change attribute
of a directory at the point when the name cache item was cached. of a directory at the point when the name cache item was cached.
This requires that the server update the change attribute for This requires that the server update the change attribute for
directories when the contents of the corresponding directory is directories when the contents of the corresponding directory is
modified. For a client to use the change_info4 information modified. For a client to use the change_info4 information
appropriately and correctly, the server must report the pre and post appropriately and correctly, the server must report the pre and post
operation change attribute values atomically. When the server is operation change attribute values atomically. When the server is
unable to report the before and after values atomically with respect unable to report the before and after values atomically with respect
to the directory operation, the server must indicate that fact in the to the directory operation, the server must indicate that fact in the
change_info4 return value. When the information is not atomically change_info4 return value. When the information is not atomically
reported, the client should not assume that other clients have not reported, the client should not assume that other clients have not
changed the directory. changed the directory.
skipping to change at page 197, line 28 skipping to change at page 199, line 16
directories when the contents of the corresponding directory is directories when the contents of the corresponding directory is
modified. For a client to use the change_info4 information modified. For a client to use the change_info4 information
appropriately and correctly, the server must report the pre and post appropriately and correctly, the server must report the pre and post
operation change attribute values atomically. When the server is operation change attribute values atomically. When the server is
unable to report the before and after values atomically with respect unable to report the before and after values atomically with respect
to the directory operation, the server must indicate that fact in the to the directory operation, the server must indicate that fact in the
change_info4 return value. When the information is not atomically change_info4 return value. When the information is not atomically
reported, the client should not assume that other clients have not reported, the client should not assume that other clients have not
changed the directory. changed the directory.
10.9. Directory Caching 10.8.2. Directory Caching
The results of READDIR operations may be used to avoid subsequent The results of READDIR operations may be used to avoid subsequent
READDIR operations. Just as in the cases of attribute and name READDIR operations. Just as in the cases of attribute and name
caching, inconsistencies may arise among the various client caches. caching, inconsistencies may arise among the various client caches.
To mitigate the effects of these inconsistencies, and given the To mitigate the effects of these inconsistencies, and given the
context of typical file system APIs, the following rules should be context of typical file system APIs, the following rules should be
followed: followed:
o Cached READDIR information for a directory which is not obtained o Cached READDIR information for a directory which is not obtained
in a single READDIR operation must always be a consistent snapshot in a single READDIR operation must always be a consistent snapshot
skipping to change at page 198, line 23 skipping to change at page 200, line 10
directories when the contents of the corresponding directory is directories when the contents of the corresponding directory is
modified. For a client to use the change_info4 information modified. For a client to use the change_info4 information
appropriately and correctly, the server must report the pre and post appropriately and correctly, the server must report the pre and post
operation change attribute values atomically. When the server is operation change attribute values atomically. When the server is
unable to report the before and after values atomically with respect unable to report the before and after values atomically with respect
to the directory operation, the server must indicate that fact in the to the directory operation, the server must indicate that fact in the
change_info4 return value. When the information is not atomically change_info4 return value. When the information is not atomically
reported, the client should not assume that other clients have not reported, the client should not assume that other clients have not
changed the directory. changed the directory.
10.9. Directory Delegations
10.9.1. Introduction to Directory Delegations
Directory caching for the NFSv4.1 protocol, as previously described,
is similar to file caching in previous versions. Clients typically
cache directory information for a duration determined by the client.
At the end of a predefined timeout, the client will query the server
to see if the directory has been updated. By caching attributes,
clients reduce the number of GETATTR calls made to the server to
validate attributes. Furthermore, frequently accessed files and
directories, such as the current working directory, have their
attributes cached on the client so that some NFS operations can be
performed without having to make an RPC call. By caching name and
inode information about most recently looked up entries in the
Directory Name Lookup Cache (DNLC), clients do not need to send
LOOKUP calls to the server every time these files are accessed.
This caching approach works reasonably well at reducing network
traffic in many environments. However, it does not address
environments where there are numerous queries for files that do not
exist. In these cases of "misses", the client must make RPC calls to
the server in order to provide reasonable application semantics and
promptly detect the creation of new directory entries. Examples of
high miss activity are compilation in software development
environments. The current behavior of NFS limits its potential
scalability and wide-area sharing effectiveness in these types of
environments. Other distributed stateful file system architectures
such as AFS and DFS have proven that adding state around directory
contents can greatly reduce network traffic in high-miss
environments.
Delegation of directory contents is a RECOMMENDED feature of NFSv4.1.
Directory delegations provide similar traffic reduction benefits as
with file delegations. By allowing clients to cache directory
contents (in a read-only fashion) while being notified of changes,
the client can avoid making frequent requests to interrogate the
contents of slowly-changing directories, reducing network traffic and
improving client performance. It can also simplify the task of
determining whether other clients are making changes to the directory
when the client itself is making many changes to the directory and
changes are not serialized.
Directory delegations allow improved namespace cache consistency to
be achieved through delegations and synchronous recalls, in the
absence of notifications. In addition, if time-based consistency is
sufficient, asynchronous notifications can provide performance
benefits for the client, and possibly the server, under some common
operating conditions such as slowly-changing and/or very large
directories.
10.9.2. Directory Delegation Design
NFSv4.1 introduces the GET_DIR_DELEGATION (Section 18.39) operation
to allow the client to ask for a directory delegation. The
delegation covers directory attributes and all entries in the
directory. If either of these change, the delegation will be
recalled synchronously. The operation causing the recall will have
to wait before the recall is complete. Any changes to directory
entry attributes will not cause the delegation to be recalled.
In addition to asking for delegations, a client can also ask for
notifications for certain events. These events include changes to
directory attributes and/or its contents. If a client asks for
notification for a certain event, the server will notify the client
when that event occurs. This will not result in the delegation being
recalled for that client. The notifications are asynchronous and
provide a way of avoiding recalls in situations where a directory is
changing enough that the pure recall model may not be effective while
trying to allow the client to get substantial benefit. In the
absence of notifications, once the delegation is recalled the client
has to refresh its directory cache which might not be very efficient
for very large directories.
The delegation is read-only and the client may not make changes to
the directory other than by performing NFSv4.1 operations that modify
the directory or the associated file attributes so that the server
has knowledge of these changes. In order to keep the client
namespace synchronized with the server, the server will, if the
client has requested notifications, notify the client holding the
delegation of the changes made as a result. This is to avoid any
need for subsequent GETATTR or READDIR calls to the server. If a
single client is holding the delegation and that client makes any
changes to the directory (i.e. the changes are made via operations
issued though a session associated with the clientid holding the
delegation), the delegation will not be recalled. Multiple clients
may hold a delegation on the same directory, but if any such client
modifies the directory, the server MUST recall the delegation from
the other clients, unless those clients have made provisions to be
notified of that sort of modification.
Delegations can be recalled by the server at any time. Normally, the
server will recall the delegation when the directory changes in a way
that is not covered by the notification, or when the directory
changes and notifications have not been requested. If another client
removes the directory for which a delegation has been granted, the
server will recall the delegation.
10.9.3. Attributes in Support of Directory Notifications
See Section 5.10 for a description of the attributes associated with
directory notifications.
10.9.4. Directory Delegation Recall
The server will recall the directory delegation by sending a callback
to the client. It will use the same callback procedure as used for
recalling file delegations. The server will recall the delegation
when the directory changes in a way that is not covered by the
notification. However the server need not recall the delegation if
attributes of an entry within the directory change.
If the server notices that handing out a delegation for a directory
is causing too many notifications to be sent out, it may decide not
to hand out delegations for that directory, or recall those already
granted. If a client tries to remove the directory for which a
delegation has been granted, the server will recall all associated
delegations.
10.9.5. Directory Delegation Recovery
Crash recovery for state on regular files has two main goals,
avoiding the necessity of breaking application guarantees with
respect to locked files and delivery of updates cached at the client.
Neither of these applies to directories protected by read delegations
and notifications. Thus, no provision is made for reclaiming
directory delegations in the event of client or server failure. The
client can simply establish a directory delegation in the same
fashion as was done initially.
11. Multi-Server Namespace 11. Multi-Server Namespace
NFSv4.1 supports attributes that allow a namespace to extend beyond NFSv4.1 supports attributes that allow a namespace to extend beyond
the boundaries of a single server. It is recommended that clients the boundaries of a single server. It is recommended that clients
and servers support construction of such multi-server namespaces. and servers support construction of such multi-server namespaces.
Use of such multi-server namespaces is OPTIONAL however, and for many Use of such multi-server namespaces is OPTIONAL however, and for many
purposes, single-server namespace are perfectly acceptable. Use of purposes, single-server namespace are perfectly acceptable. Use of
multi-server namespaces can provide many advantages, however, by multi-server namespaces can provide many advantages, however, by
separating a file system's logical position in a namespace from the separating a file system's logical position in a namespace from the
(possibly changing) logistical and administrative considerations that (possibly changing) logistical and administrative considerations that
skipping to change at page 203, line 20 skipping to change at page 207, line 39
system location provides a means by which file systems located on one system location provides a means by which file systems located on one
server can be associated with a namespace defined by another server, server can be associated with a namespace defined by another server,
thus allowing a general multi-server namespace facility. Designation thus allowing a general multi-server namespace facility. Designation
of such a location, in place of an absent file system, is called of such a location, in place of an absent file system, is called
"referral". "referral".
Because client support for location-related attributes is OPTIONAL, a Because client support for location-related attributes is OPTIONAL, a
server may (but is not required to) take action to hide migration and server may (but is not required to) take action to hide migration and
referral events from such clients, by acting as a proxy, for example. referral events from such clients, by acting as a proxy, for example.
The server can determine the presence client support from data passed The server can determine the presence client support from data passed
in the EXCHANGE_ID operation (See Section 18.35.4). in the EXCHANGE_ID operation (See Section 18.35.3).
11.4.1. File System Replication 11.4.1. File System Replication
The fs_locations and fs_locations_info attributes provide alternative The fs_locations and fs_locations_info attributes provide alternative
locations, to be used to access data in place of or in addition to locations, to be used to access data in place of or in addition to
the current file system instance. On first access to a file system, the current file system instance. On first access to a file system,
the client should obtain the value of the set of alternate locations the client should obtain the value of the set of alternate locations
by interrogating the fs_locations or fs_locations_info attribute, by interrogating the fs_locations or fs_locations_info attribute,
with the latter being preferred. with the latter being preferred.
skipping to change at page 203, line 51 skipping to change at page 208, line 21
read-only) file system data, or they may reflect alternate paths to read-only) file system data, or they may reflect alternate paths to
the same server or provide for the use of various forms of server the same server or provide for the use of various forms of server
clustering in which multiple servers provide alternate ways of clustering in which multiple servers provide alternate ways of
accessing the same physical file system. How these different modes accessing the same physical file system. How these different modes
of file system transition are represented within the fs_locations and of file system transition are represented within the fs_locations and
fs_locations_info attributes and how the client deals with file fs_locations_info attributes and how the client deals with file
system transition issues will be discussed in detail below. system transition issues will be discussed in detail below.
Multiple server addresses may correspond to the same actual server, Multiple server addresses may correspond to the same actual server,
as shown by a common so_major_id field within the eir_server_owner as shown by a common so_major_id field within the eir_server_owner
field returned by EXCHANGE_ID (see Section 18.35.4). When such field returned by EXCHANGE_ID (see Section 18.35.3). When such
server addresses exist, the client may assume that for each file server addresses exist, the client may assume that for each file
system in the namespace of a given server network address, there system in the namespace of a given server network address, there
exist file systems at corresponding namespace locations for each of exist file systems at corresponding namespace locations for each of
the other server network addresses. It may do this even in the the other server network addresses. It may do this even in the
absence of explicit listing in fs_locations and fs_locations_info. absence of explicit listing in fs_locations and fs_locations_info.
Such corresponding file system locations can be used as alternate Such corresponding file system locations can be used as alternate
locations, just as those explicitly specified via the fs_locations locations, just as those explicitly specified via the fs_locations
and fs_locations_info attributes. Where these specific locations are and fs_locations_info attributes. Where these specific locations are
designated in the fs_locations_info attribute, the conditions of use designated in the fs_locations_info attribute, the conditions of use
specified in this attribute (e.g. priorities, specification of specified in this attribute (e.g. priorities, specification of
skipping to change at page 208, line 33 skipping to change at page 212, line 51
issues in effecting a transition. issues in effecting a transition.
Where the fs_locations_info attribute is available, such file system Where the fs_locations_info attribute is available, such file system
classification data will be made directly available to the client. classification data will be made directly available to the client.
See Section 11.9 for details. When only fs_locations is available, See Section 11.9 for details. When only fs_locations is available,
default assumptions with regard to such classifications have to be default assumptions with regard to such classifications have to be
inferred. See Section 11.8 for details. inferred. See Section 11.8 for details.
In cases in which one server is expected to accept opaque values from In cases in which one server is expected to accept opaque values from
the client that originated from another server, the servers SHOULD the client that originated from another server, the servers SHOULD
encode the "opaque" values in big endian octet order. If this is encode the "opaque" values in big endian byte order. If this is
done, servers acting as replicas or immigrating file systems will be done, servers acting as replicas or immigrating file systems will be
able to parse values like stateids, directory cookies, filehandles, able to parse values like stateids, directory cookies, filehandles,
etc. even if their native octet order is different from that of other etc. even if their native byte order is different from that of other
servers cooperating in the replication and migration of the file servers cooperating in the replication and migration of the file
system. system.
11.6.1. File System Transitions and Simultaneous Access 11.6.1. File System Transitions and Simultaneous Access
When a single file system may be accessed at multiple locations, When a single file system may be accessed at multiple locations,
whether this is because of an indication of file system identity as whether this is because of an indication of file system identity as
reported by the fs_locations or fs_locations_info attributes or reported by the fs_locations or fs_locations_info attributes or
because two file systems instances have corresponding locations on because two file systems instances have corresponding locations on
server addresses which connect to the same server as indicated by a server addresses which connect to the same server as indicated by a
skipping to change at page 231, line 4 skipping to change at page 235, line 26
root pathname together with an array of root pathname together with an array of
/* /*
* Defines an individual server replica * Defines an individual server replica
*/ */
struct fs_locations_server4 { struct fs_locations_server4 {
int32_t fls_currency; int32_t fls_currency;
opaque fls_info<>; opaque fls_info<>;
utf8str_cis fls_server; utf8str_cis fls_server;
}; };
/* /*
* Byte indices of items within fls_info: flag fields, class numbers, * Byte indices of items within
* fls_info: flag fields, class numbers,
* bytes indicating ranks and orders. * bytes indicating ranks and orders.
*/ */
const FSLI4BX_GFLAGS = 0; const FSLI4BX_GFLAGS = 0;
const FSLI4BX_TFLAGS = 1; const FSLI4BX_TFLAGS = 1;
const FSLI4BX_CLSIMUL = 2; const FSLI4BX_CLSIMUL = 2;
const FSLI4BX_CLHANDLE = 3; const FSLI4BX_CLHANDLE = 3;
const FSLI4BX_CLFILEID = 4; const FSLI4BX_CLFILEID = 4;
const FSLI4BX_CLWRITEVER = 5; const FSLI4BX_CLWRITEVER = 5;
const FSLI4BX_CLCHANGE = 6; const FSLI4BX_CLCHANGE = 6;
skipping to change at page 231, line 38 skipping to change at page 236, line 14
const FSLI4GF_ABSENT = 0x04; const FSLI4GF_ABSENT = 0x04;
const FSLI4GF_GOING = 0x08; const FSLI4GF_GOING = 0x08;
const FSLI4GF_SPLIT = 0x10; const FSLI4GF_SPLIT = 0x10;
/* /*
* Bits defined within the transport flag byte. * Bits defined within the transport flag byte.
*/ */
const FSLI4TF_RDMA = 0x01; const FSLI4TF_RDMA = 0x01;
/* /*
* Defines a set of replicas sharing a common value of the root * Defines a set of replicas sharing
* path with in the corresponding single-server namespaces. * a common value of the root path
* with in the corresponding
* single-server namespaces.
*/ */
struct fs_locations_item4 { struct fs_locations_item4 {
fs_locations_server4 fli_entries<>; fs_locations_server4 fli_entries<>;
pathname4 fli_rootpath; pathname4 fli_rootpath;
}; };
/* /*
* Defines the overall structure of the fs_locations_info attribute. * Defines the overall structure of
* the fs_locations_info attribute.
*/ */
struct fs_locations_info4 { struct fs_locations_info4 {
uint32_t fli_flags; uint32_t fli_flags;
int32_t fli_valid_for; int32_t fli_valid_for;
pathname4 fli_fs_root; pathname4 fli_fs_root;
fs_locations_item4 fli_items<>; fs_locations_item4 fli_items<>;
}; };
/* /*
* Flag bits in fli_flags. * Flag bits in fli_flags.
skipping to change at page 232, line 49 skipping to change at page 237, line 28
relative to the master copy. A negative value indicates that the relative to the master copy. A negative value indicates that the
server is unable to give any reasonably useful value here. A zero server is unable to give any reasonably useful value here. A zero
indicates that file system is the actual writable data or a indicates that file system is the actual writable data or a
reliably coherent and fully up-to-date copy. Positive values reliably coherent and fully up-to-date copy. Positive values
indicate how out-of-date this copy can normally be before it is indicate how out-of-date this copy can normally be before it is
considered for update. Such a value is not a guarantee that such considered for update. Such a value is not a guarantee that such
updates will always be performed on the required schedule but updates will always be performed on the required schedule but
instead serve as a hint about how far the copy of the data would instead serve as a hint about how far the copy of the data would
be expected to be behind the most up-to-date copy. be expected to be behind the most up-to-date copy.
o A counted array of one-octet values (fls_info) containing o A counted array of one-byte values (fls_info) containing
information about the particular file system instance. This data information about the particular file system instance. This data
includes general flags, transport capability flags, file system includes general flags, transport capability flags, file system
equivalence class information, and selection priority information. equivalence class information, and selection priority information.
The encoding will be discussed below. The encoding will be discussed below.
o The server string (fls_server). For the case of the replica o The server string (fls_server). For the case of the replica
currently being accessed (via GETATTR), a null string MAY be used currently being accessed (via GETATTR), a null string MAY be used
to indicate the current address being used for the RPC call. to indicate the current address being used for the RPC call.
Data within the fls_info array is in the form of 8-bit data items Data within the fls_info array is in the form of 8-bit data items
skipping to change at page 233, line 33 skipping to change at page 238, line 11
o Explicit definition of the various specific data items within XDR o Explicit definition of the various specific data items within XDR
would limit expandability in that any extension within a would limit expandability in that any extension within a
subsequent minor version would require yet another attribute, subsequent minor version would require yet another attribute,
leading to specification and implementation clumsiness. leading to specification and implementation clumsiness.
o Such explicit definitions would also make it impossible to propose o Such explicit definitions would also make it impossible to propose
standards-track extensions apart from a full minor version. standards-track extensions apart from a full minor version.
This encoding scheme can be adapted to the specification of multi- This encoding scheme can be adapted to the specification of multi-
octet numeric values, even though none are currently defined. If byte numeric values, even though none are currently defined. If
extensions are made via standards-track RFC's, multi-octet quantities extensions are made via standards-track RFC's, multi-byte quantities
will be encoded as a range of octets with a range of indices with the will be encoded as a range of bytes with a range of indices with the
octet interpreted in big endian octet order. Further any such index byte interpreted in big endian byte order. Further any such index
assignments are constrained so that the relevant quantities will not assignments are constrained so that the relevant quantities will not
cross XDR word boundaries. cross XDR word boundaries.
The set of fls_info data is subject to expansion in a future minor The set of fls_info data is subject to expansion in a future minor
version, or in a standard-track RFC, within the context of a single version, or in a standard-track RFC, within the context of a single
minor version. The server SHOULD NOT send and the client MUST not minor version. The server SHOULD NOT send and the client MUST not
use indices within the fls_info array that are not defined in use indices within the fls_info array that are not defined in
standards-track RFC's. standards-track RFC's.
The fls_info array contains within it: The fls_info array contains within it:
skipping to change at page 234, line 11 skipping to change at page 238, line 36
o Two 8-bit flag fields, one devoted to general file-system o Two 8-bit flag fields, one devoted to general file-system
characteristics and a second reserved for transport-related characteristics and a second reserved for transport-related
capabilities. capabilities.
o Six 8-bit class values which define various file system o Six 8-bit class values which define various file system
equivalence classes as explained below. equivalence classes as explained below.
o Four 8-bit priority values which govern file system selection as o Four 8-bit priority values which govern file system selection as
explained below. explained below.
The general file system characteristics flag (at octet index The general file system characteristics flag (at byte index
FSLI4BX_GFLAGS) has the following bits defined within it: FSLI4BX_GFLAGS) has the following bits defined within it:
o FSLI4GF_WRITABLE indicates that this file system target is o FSLI4GF_WRITABLE indicates that this file system target is
writable, allowing it to be selected by clients which may need to writable, allowing it to be selected by clients which may need to
write on this file system. When the current file system instance write on this file system. When the current file system instance
is writable, and is defined as of the same simultaneous use class is writable, and is defined as of the same simultaneous use class
(as specified by the value at index FSLI4BX_CLSIMUL) to which the (as specified by the value at index FSLI4BX_CLSIMUL) to which the
client was previously writing, then it must incorporate within its client was previously writing, then it must incorporate within its
data any committed write made on the source file system instance. data any committed write made on the source file system instance.
See the section on verifier class, for issues related to See the section on verifier class, for issues related to
skipping to change at page 236, line 11 skipping to change at page 240, line 36
working in concert). Note that filehandles could be different for working in concert). Note that filehandles could be different for
file systems that tool part in the split form those newly file systems that tool part in the split form those newly
accessed, allowing the server to determine when the need for such accessed, allowing the server to determine when the need for such
treatment is over. treatment is over.
Although it is possible for this flag to be present in the event Although it is possible for this flag to be present in the event
of referral, it would generally be of little interest to the of referral, it would generally be of little interest to the
client, since the client is not expected to have information client, since the client is not expected to have information
regarding the current contents of the absent file system. regarding the current contents of the absent file system.
The transport-flag field (at octet index FSLI4BX_TFLAGS) contains the The transport-flag field (at byte index FSLI4BX_TFLAGS) contains the
following bits related to the transport capabilities of the specific following bits related to the transport capabilities of the specific
file system. file system.
o FSLI4TF_RDMA indicates that this file system provides NFSv4.1 file o FSLI4TF_RDMA indicates that this file system provides NFSv4.1 file
system access using an RDMA-capable transport. system access using an RDMA-capable transport.
Attribute continuity and file system identity information are Attribute continuity and file system identity information are
expressed by defining equivalence relations on the sets of file expressed by defining equivalence relations on the sets of file
systems presented to the client. Each such relation is expressed as systems presented to the client. Each such relation is expressed as
a set of file system equivalence classes. For each relation, a file a set of file system equivalence classes. For each relation, a file
skipping to change at page 236, line 47 skipping to change at page 241, line 23
simple to provide this data assumes that the server has knowledge of simple to provide this data assumes that the server has knowledge of
the appropriate set of identity relationships to be encoded. As each the appropriate set of identity relationships to be encoded. As each
instance entry is added, the relationships of this instance to instance entry is added, the relationships of this instance to
previously entered instances can be consulted and if one is found previously entered instances can be consulted and if one is found
that bears the specified relationship, that entry's class value can that bears the specified relationship, that entry's class value can
be copied to the new entry. When no such previous entry exists, a be copied to the new entry. When no such previous entry exists, a
new value for that byte index, not previously used can be selected, new value for that byte index, not previously used can be selected,
most likely by increment the value of the last class value assigned most likely by increment the value of the last class value assigned
for that index. for that index.
o The field with octet index FSLI4BX_CLSIMUL defines the o The field with byte index FSLI4BX_CLSIMUL defines the
simultaneous-use class for the file system. simultaneous-use class for the file system.
o The field with octet index FSLI4BX_CLHANDLE defines the handle o The field with byte index FSLI4BX_CLHANDLE defines the handle
class for the file system. class for the file system.
o The field with octet index FSLI4BX_CLFILEID defines the fileid o The field with byte index FSLI4BX_CLFILEID defines the fileid
class for the file system. class for the file system.
o The field with octet index FSLI4BX_CLWRITEVER defines the write o The field with byte index FSLI4BX_CLWRITEVER defines the write
verifier class for the file system. verifier class for the file system.
o The field with octet index FSLI4BX_CLCHANGE defines the change o The field with byte index FSLI4BX_CLCHANGE defines the change
class for the file system. class for the file system.
o The field with octet index FSLI4BX_CLREADDIR defines the readdir o The field with byte index FSLI4BX_CLREADDIR defines the readdir
class for the file system. class for the file system.
Server-specified preference information is also provided via 8-bit Server-specified preference information is also provided via 8-bit
values within the fls_info array. The values provide a rank and an values within the fls_info array. The values provide a rank and an
order (see below) to be used with separate values specifiable for the order (see below) to be used with separate values specifiable for the
cases of read-only and writable file systems. These values are cases of read-only and writable file systems. These values are
compared for different file systems to establish the server-specified compared for different file systems to establish the server-specified
preference, with lower values indicating "more preferred". preference, with lower values indicating "more preferred".
Rank is used to express a strict server-imposed ordering on clients, Rank is used to express a strict server-imposed ordering on clients,
skipping to change at page 237, line 44 skipping to change at page 242, line 20
Within a rank, the order value is used to specify the server's Within a rank, the order value is used to specify the server's
preference to guide the client's selection when the client's own preference to guide the client's selection when the client's own
preferences are not controlling, with lower values of order preferences are not controlling, with lower values of order
indicating "more preferred." If replicas are approximately equal in indicating "more preferred." If replicas are approximately equal in
all respects, clients should defer to the order specified by the all respects, clients should defer to the order specified by the
server. When clients look at server latency as part of their server. When clients look at server latency as part of their
selection, they are free to use this criterion but it is suggested selection, they are free to use this criterion but it is suggested
that when latency differences are not significant, the server- that when latency differences are not significant, the server-
specified order should guide selection. specified order should guide selection.
o The field at octet index FSLI4BX_READRANK gives the rank value to o The field at byte index FSLI4BX_READRANK gives the rank value to
be used for read-only access. be used for read-only access.
o The field at octet index FSLI4BX_READORDER gives the order value o The field at byte index FSLI4BX_READORDER gives the order value to
to be used for read-only access. be used for read-only access.
o The field at octet index FSLI4BX_WRITERANK gives the rank value to o The field at byte index FSLI4BX_WRITERANK gives the rank value to
be used for writable access. be used for writable access.
o The field at octet index FSLI4BX_WRITEORDER gives the order value o The field at byte index FSLI4BX_WRITEORDER gives the order value
to be used for writable access. to be used for writable access.
Depending on the potential need for write access by a given client, Depending on the potential need for write access by a given client,
one of the pairs of rank and order values is used. The read rank and one of the pairs of rank and order values is used. The read rank and
order should only be used if the client knows that only reading will order should only be used if the client knows that only reading will
ever be done or if it is prepared to switch to a different replica in ever be done or if it is prepared to switch to a different replica in
the event that any write access capability is required in the future. the event that any write access capability is required in the future.
11.9.2. The fs_locations_info4 Structure 11.9.2. The fs_locations_info4 Structure
skipping to change at page 244, line 42 skipping to change at page 249, line 29
The procedure above will ensure that before using any data from the The procedure above will ensure that before using any data from the
file system the client has in hand a newly-fetched current version of file system the client has in hand a newly-fetched current version of
the file system image. Multiple values for multiple requests in the file system image. Multiple values for multiple requests in
flight can be resolved by assembling them into the required partial flight can be resolved by assembling them into the required partial
order (and the elements should form a total order within it) and order (and the elements should form a total order within it) and
using the last. The client may then, when switching among file using the last. The client may then, when switching among file
system instances, decline to use an instance which is not of type system instances, decline to use an instance which is not of type
STATUS4_VERSIONED or whose version field is earlier than the last one STATUS4_VERSIONED or whose version field is earlier than the last one
obtained from the predecessor file system instance. obtained from the predecessor file system instance.
12. Directory Delegations 12. Parallel NFS (pNFS)
12.1. Introduction to Directory Delegations
Directory caching for the NFSv4.1 protocol is similar to file caching
in previous versions. Clients typically cache directory information
for a duration determined by the client. At the end of a predefined
timeout, the client will query the server to see if the directory has
been updated. By caching attributes, clients reduce the number of
GETATTR calls made to the server to validate attributes.
Furthermore, frequently accessed files and directories, such as the
current working directory, have their attributes cached on the client
so that some NFS operations can be performed without having to make
an RPC call. By caching name and inode information about most
recently looked up entries in the Directory Name Lookup Cache (DNLC),
clients do not need to send LOOKUP calls to the server every time
these files are accessed.
This caching approach works reasonably well at reducing network
traffic in many environments. However, it does not address
environments where there are numerous queries for files that do not
exist. In these cases of "misses", the client must make RPC calls to
the server in order to provide reasonable application semantics and
promptly detect the creation of new directory entries. Examples of
high miss activity are compilation in software development
environments. The current behavior of NFS limits its potential
scalability and wide-area sharing effectiveness in these types of
environments. Other distributed stateful file system architectures
such as AFS and DFS have proven that adding state around directory
contents can greatly reduce network traffic in high miss
environments.
Delegation of directory contents is a RECOMMENDED feature of NFSv4.1.
Directory delegations provide similar traffic reduction benefits as
with file delegations. By allowing clients to cache directory
contents (in a read-only fashion) while being notified of changes,
the client can avoid making frequent requests to interrogate the
contents of slowly-changing directories, reducing network traffic and
improving client performance.
Directory delegations allow improved namespace cache consistency to
be achieved through delegations and synchronous recalls alone without
asking for notifications. In addition, if time-based consistency is
sufficient, asynchronous notifications can provide performance
benefits for the client, and possibly the server, under some common
operating conditions such as slowly-changing and/or very large
directories.
12.2. Directory Delegation Design
NFSv4.1 introduces the GET_DIR_DELEGATION (Section 18.39) operation
to allow the client to ask for a directory delegation. The
delegation covers directory attributes and all entries in the
directory. If either of these change the delegation will be recalled
synchronously. The operation causing the recall will have to wait
before the recall is complete. Any changes to directory entry
attributes will not cause the delegation to be recalled.
In addition to asking for delegations, a client can also ask for
notifications for certain events. These events include changes to
directory attributes and/or its contents. If a client asks for
notification for a certain event, the server will notify the client
when that event occurs. This will not result in the delegation being
recalled for that client. The notifications are asynchronous and
provide a way of avoiding recalls in situations where a directory is
changing enough that the pure recall model may not be effective while
trying to allow the client to get substantial benefit. In the
absence of notifications, once the delegation is recalled the client
has to refresh its directory cache which might not be very efficient
for very large directories.
The delegation is read-only and the client may not make changes to
the directory other than by performing NFSv4.1 operations that modify
the directory or the associated file attributes so that the server
has knowledge of these changes. In order to keep the client
namespace synchronized with the server, the server will, if the
client has requested notifications, notify the client holding the
delegation of the changes made as a result. This is to avoid any
subsequent GETATTR or READDIR calls to the server. If a single
client is holding the delegation and that client makes any changes to
the directory (i.e. the changes are made via operations issued though
a session associated with the clientid holding the delegation), the
delegation will not be recalled. Multiple clients may hold a
delegation on the same directory, but if any such client modifies the
directory, the server MUST recall the delegation from the other
clients, unless those clients have made provisions to be be notified
of that sort of modification.
Delegations can be recalled by the server at any time. Normally, the
server will recall the delegation when the directory changes in a way
that is not covered by the notification, or when the directory
changes and notifications have not been requested. If another client
removes the directory for which a delegation has been granted, the
server will recall the delegation.
12.3. Attributes in Support of Directory Notifications
See Section 5.10 for a description of the attributes associated with
directory notifications.
12.4. Delegation Recall
The server will recall the directory delegation by sending a callback
to the client. It will use the same callback procedure as used for
recalling file delegations. The server will recall the delegation
when the directory changes in a way that is not covered by the
notification. However the server will not recall the delegation if
attributes of an entry within the directory change. Also, if the
server notices that handing out a delegation for a directory is
causing too many notifications to be sent out, it may decide not to
hand out a delegation for that directory. If another client tries to
remove the directory for which a delegation has been granted, the
server will recall the delegation.
12.5. Directory Delegation Recovery
Crash recovery for state on regular files has two main goals,
avoiding the necessity of breaking application guarantees with
respect to locked files and delivery of updates cached at the client.
Neither of these applies to directories protected by read delegations
and notifications. Thus, the client is required to establish a new
delegation on a server or client reboot.
13. Parallel NFS (pNFS)
13.1. Introduction 12.1. Introduction
pNFS is a set of optional features within NFSv4.1; the pNFS feature pNFS is a set of optional features within NFSv4.1; the pNFS feature
set allows direct client access to the storage devices containing set allows direct client access to the storage devices containing
file data. When file data for a single NFSv4 server is stored on file data. When file data for a single NFSv4 server is stored on
multiple and/or higher throughput storage devices (by comparison to multiple and/or higher throughput storage devices (by comparison to
the server's throughput capability), the result can be significantly the server's throughput capability), the result can be significantly
better file access performance. The relationship among multiple better file access performance. The relationship among multiple
clients, a single server, and multiple storage devices for pNFS clients, a single server, and multiple storage devices for pNFS
(server and clients have access to all storage devices) is shown in (server and clients have access to all storage devices) is shown in
this diagram: this diagram:
skipping to change at page 249, line 21 skipping to change at page 251, line 21
It is possible that various storage protocols are available to both It is possible that various storage protocols are available to both
client and server and it may be possible that a client and server do client and server and it may be possible that a client and server do
not have a matching storage protocol available to them. Because of not have a matching storage protocol available to them. Because of
this, the pNFS server MUST support normal NFSv4.1 access to any file this, the pNFS server MUST support normal NFSv4.1 access to any file
accessible by the pNFS feature; this will allow for continued accessible by the pNFS feature; this will allow for continued
interoperability between a NFSv4.1 client and server. interoperability between a NFSv4.1 client and server.
There are interesting interactions between layouts and other NFSv4.1 There are interesting interactions between layouts and other NFSv4.1
abstractions such as data delegations and record locking. Delegation abstractions such as data delegations and record locking. Delegation
issues are discussed in Section 13.5.4. Byte range locking issues issues are discussed in Section 12.5.4. Byte range locking issues
are discussed in Section 13.2.9 and Section 13.5.1. are discussed in Section 12.2.9 and Section 12.5.1.
13.2. pNFS Definitions 12.2. pNFS Definitions
NFSv4.1's pNFS feature partitions the file system protocol into two NFSv4.1's pNFS feature partitions the file system protocol into two
parts: metadata and data. Where data is the contents of a file and parts: metadata and data. Where data is the contents of a file and
metadata is "everything else". The metadata functionality is metadata is "everything else". The metadata functionality is
implemented by a metadata server that supports pNFS and the implemented by a metadata server that supports pNFS and the
operations described in (Section 18). The data functionality is operations described in (Section 18). The data functionality is
implemented by a storage device that supports the storage protocol. implemented by a storage device that supports the storage protocol.
A subset (defined in Section 14.7) of NFSv4.1 itself is one such A subset (defined in Section 13.7) of NFSv4.1 itself is one such
storage protocol. New terms are introduced to the NFSv4.1 storage protocol. New terms are introduced to the NFSv4.1
nomenclature and existing terms are clarified to allow for the nomenclature and existing terms are clarified to allow for the
description of the pNFS feature. description of the pNFS feature.
13.2.1. Metadata 12.2.1. Metadata
Information about a file system object, such as its name, location Information about a file system object, such as its name, location
within the namespace, owner, ACL and other attributes. Metadata may within the namespace, owner, ACL and other attributes. Metadata may
also include storage location information and this will vary based on also include storage location information and this will vary based on
the underlying storage mechanism that is used. the underlying storage mechanism that is used.
13.2.2. Metadata Server 12.2.2. Metadata Server
An NFSv4.1 server which supports the pNFS feature. A variety of An NFSv4.1 server which supports the pNFS feature. A variety of
architectural choices exists for the metadata server and its use of architectural choices exists for the metadata server and its use of
what file system information is held at the server. Some servers may what file system information is held at the server. Some servers may
contain metadata only for the file objects that reside at the contain metadata only for the file objects that reside at the
metadata server while file data resides on the associated storage metadata server while file data resides on the associated storage
devices. Other metadata servers may hold both metadata and a varying devices. Other metadata servers may hold both metadata and a varying
degree of file data. degree of file data.
13.2.3. pNFS Client 12.2.3. pNFS Client
An NFSv4.1 client that supports pNFS operations and supports at least An NFSv4.1 client that supports pNFS operations and supports at least
one storage protocol or layout type for performance I/O to storage one storage protocol or layout type for performance I/O to storage
devices. devices.
13.2.4. Storage Device 12.2.4. Storage Device
A storage device stores a regular file's data, but leaves metadata A storage device stores a regular file's data, but leaves metadata
management to the metadata server. A storage device could be another management to the metadata server. A storage device could be another
NFSv4.1 server, an object storage device (OSD), a block device NFSv4.1 server, an object storage device (OSD), a block device
accessed over a SAN (e.g., either FiberChannel or iSCSI SAN), or some accessed over a SAN (e.g., either FiberChannel or iSCSI SAN), or some
other entity. other entity.
13.2.5. Storage Protocol 12.2.5. Storage Protocol
A storage protocol is the protocol used between the pNFS client and A storage protocol is the protocol used between the pNFS client and
the storage device to access the file data. the storage device to access the file data.
13.2.6. Control Protocol 12.2.6. Control Protocol
The control protocol is used by the exported file system between the The control protocol is used by the exported file system between the
metadata server and storage devices. Specification of such protocols metadata server and storage devices. Specification of such protocols
is outside the scope of the NFSv4.1 protocol. Such control protocols is outside the scope of the NFSv4.1 protocol. Such control protocols
would be used to control activities such as the allocation and would be used to control activities such as the allocation and
deallocation of storage and the management of state required by the deallocation of storage and the management of state required by the
storage devices to perform client access control. storage devices to perform client access control.
A particular control protocol is not mandated by NFSv4.1 but A particular control protocol is not mandated by NFSv4.1 but
requirements are placed on the control protocol for maintaining requirements are placed on the control protocol for maintaining
attributes like modify time, the change attribute, and the end-of- attributes like modify time, the change attribute, and the end-of-
file (EOF) position. file (EOF) position.
13.2.7. Layout Types 12.2.7. Layout Types
A layout describes the mapping of a file's data to the storage A layout describes the mapping of a file's data to the storage
devices that hold the data. A layout is said to belong to a specific devices that hold the data. A layout is said to belong to a specific
layout type (data type layouttype4, see Section 3.2.15). The layout layout type (data type layouttype4, see Section 3.2.15). The layout
type allows for variants to handle different storage protocols, such type allows for variants to handle different storage protocols, such
as those associated with block/volume [30], object [29], and file as those associated with block/volume [30], object [29], and file
(Section 14) layout types. A metadata server, along with its control (Section 13) layout types. A metadata server, along with its control
protocol, MUST support at least one layout type. A private sub-range protocol, MUST support at least one layout type. A private sub-range
of the layout type name space is also defined. Values from the of the layout type name space is also defined. Values from the
private layout type range MAY be used for internal testing or private layout type range MAY be used for internal testing or
experimentation. experimentation.
As an example, a file layout type could be an array of tuples (e.g., As an example, a file layout type could be an array of tuples (e.g.,
deviceID, file_handle), along with a definition of how the data is deviceID, file_handle), along with a definition of how the data is
stored across the devices (e.g., striping). A block/volume layout stored across the devices (e.g., striping). A block/volume layout
might be an array of tuples that store <deviceID, block_number, block might be an array of tuples that store <deviceID, block_number, block
count> along with information about block size and the associated count> along with information about block size and the associated
file offset of the block number. An object layout might be an array file offset of the block number. An object layout might be an array
of tuples <deviceID, objectID> and an additional structure (i.e., the of tuples <deviceID, objectID> and an additional structure (i.e., the
aggregation map) that defines how the logical octet sequence of the aggregation map) that defines how the logical byte sequence of the
file data is serialized into the different objects. Note that the file data is serialized into the different objects. Note that the
actual layouts are typically more complex than these simple actual layouts are typically more complex than these simple
expository examples. expository examples.
13.2.8. Layout 12.2.8. Layout
A layout defines how a file's data is organized on one or more A layout defines how a file's data is organized on one or more
storage devices. There are many potential layout types; each of the storage devices. There are many potential layout types; each of the
layout types are differentiated by the storage protocol used to layout types are differentiated by the storage protocol used to
access data and in the aggregation scheme that lays out the file data access data and in the aggregation scheme that lays out the file data
on the underlying storage device. A layout is precisely identified on the underlying storage device. A layout is precisely identified
by the following tuple: <client ID, filehandle, layout type, iomode, by the following tuple: <client ID, filehandle, layout type, iomode,
range>; where filehandle refers to the filehandle of the file on the range>; where filehandle refers to the filehandle of the file on the
metadata server. metadata server.
It is important to define when layouts overlap and/or conflict with It is important to define when layouts overlap and/or conflict with
each other. For two layouts with overlapping octet ranges to each other. For two layouts with overlapping byte ranges to actually
actually overlap each other, both layouts must be of the same layout overlap each other, both layouts must be of the same layout type,
type, correspond to the same filehandle, and have the same iomode. correspond to the same filehandle, and have the same iomode. Layouts
Layouts conflict when they overlap and differ in the content of the conflict when they overlap and differ in the content of the layout
layout (i.e., the storage device/file mapping parameters differ). (i.e., the storage device/file mapping parameters differ). Note that
Note that differing iomodes do not lead to conflicting layouts. It differing iomodes do not lead to conflicting layouts. It is
is permissible for layouts with different iomodes, pertaining to the permissible for layouts with different iomodes, pertaining to the
same octet range, to be held by the same client. An example of this same byte range, to be held by the same client. An example of this
would be copy-on-write functionality for a block/volume layout type. would be copy-on-write functionality for a block/volume layout type.
13.2.9. Layout Iomode 12.2.9. Layout Iomode
The layout iomode (data type layoutiomode4, see Section 3.2.23) The layout iomode (data type layoutiomode4, see Section 3.2.23)
indicates to the metadata server the client's intent to perform indicates to the metadata server the client's intent to perform
either just READ operations (Section 18.22) or a mixture of I/O either just READ operations (Section 18.22) or a mixture of I/O
possibly containing WRITE (Section 18.32) and READ operations. For possibly containing WRITE (Section 18.32) and READ operations. For
certain layout types, it is useful for a client to specify this certain layout types, it is useful for a client to specify this
intent at LAYOUTGET (Section 18.43) time. For example, block/volume intent at LAYOUTGET (Section 18.43) time. For example, block/volume
based protocols, block allocation could occur when a READ/WRITE based protocols, block allocation could occur when a READ/WRITE
iomode is specified. A special LAYOUTIOMODE4_ANY iomode is defined iomode is specified. A special LAYOUTIOMODE4_ANY iomode is defined
and can only be used for LAYOUTRETURN and LAYOUTRECALL, not for and can only be used for LAYOUTRETURN and LAYOUTRECALL, not for
skipping to change at page 252, line 24 skipping to change at page 254, line 24
The iomode does not conflict with OPEN share modes or lock requests; The iomode does not conflict with OPEN share modes or lock requests;
open mode and lock conflicts are enforced as they are without the use open mode and lock conflicts are enforced as they are without the use
of pNFS, and are logically separate from the pNFS layout level. As of pNFS, and are logically separate from the pNFS layout level. As
well, open modes and locks are the preferred method for restricting well, open modes and locks are the preferred method for restricting
user access to data files. For example, an OPEN of read, deny-write user access to data files. For example, an OPEN of read, deny-write
does not conflict with a LAYOUTGET containing an iomode of READ/WRITE does not conflict with a LAYOUTGET containing an iomode of READ/WRITE
performed by another client. Applications that depend on writing performed by another client. Applications that depend on writing
into the same file concurrently may use record locking to serialize into the same file concurrently may use record locking to serialize
their accesses. their accesses.
13.2.10. Device IDs 12.2.10. Device IDs
The device ID (data type deviceid4, see Section 3.2.16) names a The device ID (data type deviceid4, see Section 3.2.16) names a
storage device. In practice, a significant amount of information may storage device. In practice, a significant amount of information may
be required to fully address a storage device. Rather than embedding be required to fully address a storage device. Rather than embedding
all such information in a layout, layouts embed device IDs. The all such information in a layout, layouts embed device IDs. The
NFSv4.1 operation GETDEVICEINFO (Section 18.40) is used to retrieve NFSv4.1 operation GETDEVICEINFO (Section 18.40) is used to retrieve
the complete address information regarding the storage device the complete address information regarding the storage device
according to its layout type and device ID. For example, the address according to its layout type and device ID. For example, the address
of an NFSv4.1 data server or of an object storage device could be an of an NFSv4.1 data server or of an object storage device could be an
IP address and port. The address of a block storage device could be IP address and port. The address of a block storage device could be
a volume label. a volume label.
Clients cannot expect the mapping between device ID and storage Clients cannot expect the mapping between device ID and storage
device address to persist across metadata server restart. See device address to persist across metadata server restart. See
Section 13.7.4 for a description of how recovery works in that Section 12.7.4 for a description of how recovery works in that
situation. situation.
To clearly define the lifetime of a device ID to storage address To clearly define the lifetime of a device ID to storage address
mapping and to assist in modifications to those mappings over the mapping and to assist in modifications to those mappings over the
duration of server operation, the stateid mechanism is applied to duration of server operation, the stateid mechanism is applied to
device IDs. Device ID mappings represent another form of stateid device IDs. Device ID mappings represent another form of stateid
Section 8.2.1. The GETDEVICEINFO and GETDEVICELIST operations each Section 8.2.1. The GETDEVICEINFO and GETDEVICELIST operations each
return a device stateid. Like file delegations, the device stateid return a device stateid. Like file delegations, the device stateid
is recallable. A recall of the device stateid will remove or is recallable. A recall of the device stateid will remove or
invalidate the device ID mappings as well as lease expiration. The invalidate the device ID mappings as well as lease expiration. The
GETDEVICEINFO and GETDEVICELIST operations update the current GETDEVICEINFO and GETDEVICELIST operations update the current
filehandle to facilitate the recall of the device stateid. To reduce filehandle to facilitate the recall of the device stateid. To reduce
the need to recall the device ID stateid during mapping the need to recall the device ID stateid during mapping
modifications, the notifications mechanism may be used by the server modifications, the notifications mechanism may be used by the server
to update the client on changes that occur. The server must support to update the client on changes that occur. The server must support
notifications and the client must request them before they can be notifications and the client must request them before they can be
used. For further information about the notification types used. For further information about the notification types
Section 20.4. Section 20.4.
13.3. pNFS Operations 12.3. pNFS Operations
NFSv4.1 has several operations that are needed for pNFS servers, NFSv4.1 has several operations that are needed for pNFS servers,
regardless of layout type or storage protocol. These operations are regardless of layout type or storage protocol. These operations are
all issued to a metadata server and summarized here. Even though all issued to a metadata server and summarized here. Even though
pNFS is an OPTIONAL feature of NFSv4.1, if a server is supporting the pNFS is an OPTIONAL feature of NFSv4.1, if a server is supporting the
pNFS feature, it MUST support all of the pNFS operations. pNFS feature, it MUST support all of the pNFS operations.
GETDEVICEINFO. As noted previously (Section 13.2.10), GETDEVICEINFO GETDEVICEINFO. As noted previously (Section 12.2.10), GETDEVICEINFO
(Section 18.40) returns the mapping of device ID to storage device (Section 18.40) returns the mapping of device ID to storage device
address. address.
GETDEVICELIST (Section 18.41), allows clients to fetch all of the GETDEVICELIST (Section 18.41), allows clients to fetch all of the
mappings of device IDs to storage device addresses for a specific mappings of device IDs to storage device addresses for a specific
file system. file system.
LAYOUTGET (Section 18.43) is used by a client to get a layout for a LAYOUTGET (Section 18.43) is used by a client to get a layout for a
file. file.
skipping to change at page 254, line 9 skipping to change at page 256, line 9
ID. ID.
CB_RECALL_ANY (Section 20.6), tells a client that it needs to return CB_RECALL_ANY (Section 20.6), tells a client that it needs to return
some number of recallable objects, including layouts, to the some number of recallable objects, including layouts, to the
metadata server. metadata server.
CB_RECALLABLE_OBJ_AVAIL (Section 20.7) tells a client that a CB_RECALLABLE_OBJ_AVAIL (Section 20.7) tells a client that a
recallable object that it was denied (in case of pNFS, a layout, recallable object that it was denied (in case of pNFS, a layout,
denied by LAYOUTGET) due to resource exhaustion, is now available. denied by LAYOUTGET) due to resource exhaustion, is now available.
13.4. pNFS Attributes 12.4. pNFS Attributes
A number of attributes specific to pNFS are listed and described in A number of attributes specific to pNFS are listed and described in
Section 5.11 Section 5.11
13.5. Layout Semantics 12.5. Layout Semantics
13.5.1. Guarantees Provided by Layouts 12.5.1. Guarantees Provided by Layouts
Layouts delegate to the client the ability to access data located at Layouts delegate to the client the ability to access data located at
a storage device with the appropriate storage protocol. The client a storage device with the appropriate storage protocol. The client
is guaranteed the layout will be recalled when one of two things is guaranteed the layout will be recalled when one of two things
occur; either a conflicting layout is requested or the state occur; either a conflicting layout is requested or the state
encapsulated by the layout becomes invalid and this can happen when encapsulated by the layout becomes invalid and this can happen when
an event directly or indirectly modifies the layout. When a layout an event directly or indirectly modifies the layout. When a layout
is recalled and returned by the client, the client continues with the is recalled and returned by the client, the client continues with the
ability to access file data with normal NFSv4.1 operations through ability to access file data with normal NFSv4.1 operations through
the metadata server. Only the ability to access the storage devices the metadata server. Only the ability to access the storage devices
is affected. is affected.
The requirement of NFSv4.1, that all user access rights MUST be The requirement of NFSv4.1, that all user access rights MUST be
obtained through the appropriate open, lock, and access operations, obtained through the appropriate open, lock, and access operations,
is not modified with the existence of layouts. Layouts are provided is not modified with the existence of layouts. Layouts are provided
to NFSv4.1 clients and user access still follows the rules of the to NFSv4.1 clients and user access still follows the rules of the
protocol as if they did not exist. It is a requirement that for a protocol as if they did not exist. It is a requirement that for a
client to access a storage device, a layout must be held by the client to access a storage device, a layout must be held by the
client. If a storage device receives an I/O for an octet range for client. If a storage device receives an I/O for an byte range for
which the client does not hold a layout, the storage device SHOULD which the client does not hold a layout, the storage device SHOULD
reject that I/O request. Note that the act of modifying a file for reject that I/O request. Note that the act of modifying a file for
which a layout is held, does not necessarily conflict with the which a layout is held, does not necessarily conflict with the
holding of the layout that describes the file being modified. holding of the layout that describes the file being modified.
Therefore, it is the requirement of the storage protocol or layout Therefore, it is the requirement of the storage protocol or layout
type that determines the necessary behavior. For example, block/ type that determines the necessary behavior. For example, block/
volume layout types require that the layout's the layout's iomode volume layout types require that the layout's the layout's iomode
agree with the type of I/O being performed. agree with the type of I/O being performed.
Depending upon the layout type and storage protocol in use, storage Depending upon the layout type and storage protocol in use, storage
skipping to change at page 255, line 23 skipping to change at page 257, line 23
behave as they would without pNFS. Therefore, if mandatory file behave as they would without pNFS. Therefore, if mandatory file
locks and layouts are provided simultaneously, the storage device locks and layouts are provided simultaneously, the storage device
MUST be able to enforce the mandatory file locks. For example, if MUST be able to enforce the mandatory file locks. For example, if
one client obtains a mandatory lock and a second client accesses the one client obtains a mandatory lock and a second client accesses the
storage device, the storage device MUST appropriately restrict I/O storage device, the storage device MUST appropriately restrict I/O
for the byte range of the mandatory file lock. If the storage device for the byte range of the mandatory file lock. If the storage device
is incapable of providing this check in the presence of mandatory is incapable of providing this check in the presence of mandatory
file locks, the metadata server then MUST NOT grant layouts and file locks, the metadata server then MUST NOT grant layouts and
mandatory file locks simultaneously. mandatory file locks simultaneously.
13.5.2. Getting a Layout 12.5.2. Getting a Layout
A client obtains a layout with the LAYOUTGET operation. The metadata A client obtains a layout with the LAYOUTGET operation. The metadata
server will grant layouts of a particular type (e.g., block/volume, server will grant layouts of a particular type (e.g., block/volume,
object, or file). The client selects an appropriate layout type that object, or file). The client selects an appropriate layout type that
the server supports and the client is prepared to use. The layout the server supports and the client is prepared to use. The layout
returned to the client may not exactly align with the requested octet returned to the client may not exactly align with the requested byte
range. A field within the LAYOUTGET request, loga_minlength, range. A field within the LAYOUTGET request, loga_minlength,
specifies the minimum overlap that MUST exist between the requested specifies the minimum overlap that MUST exist between the requested
layout and the layout returned by the metadata server. The layout and the layout returned by the metadata server. The
loga_minlength field should be at least one. As needed a client may loga_minlength field should be at least one. As needed a client may
make multiple LAYOUTGET requests; these will result in multiple make multiple LAYOUTGET requests; these will result in multiple
overlapping, non-conflicting layouts. overlapping, non-conflicting layouts.
There is no required ordering between getting a layout and performing There is no required ordering between getting a layout and performing
a file OPEN. For example, a layout may first be retrieved by placing a file OPEN. For example, a layout may first be retrieved by placing
a LAYOUTGET operation in the same COMPOUND as the initial file OPEN. a LAYOUTGET operation in the same COMPOUND as the initial file OPEN.
skipping to change at page 256, line 15 skipping to change at page 258, line 15
Although the metadata server is in control of the layout for a file, Although the metadata server is in control of the layout for a file,
the pNFS client can provide hints to the server when a file is opened the pNFS client can provide hints to the server when a file is opened
or created about the preferred layout type and aggregation schemes. or created about the preferred layout type and aggregation schemes.
pNFS introduces a layout_hint (Section 5.11.4) attribute that the pNFS introduces a layout_hint (Section 5.11.4) attribute that the
client can set at file creation time to provide a hint to the server client can set at file creation time to provide a hint to the server
for new files. Setting this attribute separately, after the file has for new files. Setting this attribute separately, after the file has
been created might make it difficult, or impossible, for the server been created might make it difficult, or impossible, for the server
implementation to comply. This further complicates the exclusive implementation to comply. This further complicates the exclusive
file creation via OPEN, which when done via the EXCLUSIVE4 createmode file creation via OPEN, which when done via the EXCLUSIVE4 createmode
does not allow the setting of attributes at file creation time. does not allow the setting of attributes at file creation time.
However as noted in Section 18.16.4, if the server supports a However as noted in Section 18.16.3, if the server supports a
persistent reply cache, the EXCLUSIVE4 createmode is not needed. persistent reply cache, the EXCLUSIVE4 createmode is not needed.
Therefore, a metadata server that supports the layout_hint attribute Therefore, a metadata server that supports the layout_hint attribute
MUST support a persistent session reply cache, and a pNFS client that MUST support a persistent session reply cache, and a pNFS client that
wants to set layout_hint at file creation (OPEN) time MUST NOT use wants to set layout_hint at file creation (OPEN) time MUST NOT use
the EXCLUSIVE4 createmode, and instead MUST used GUARDED for an the EXCLUSIVE4 createmode, and instead MUST used GUARDED for an
exclusive regular file creation. exclusive regular file creation.
13.5.3. Committing a Layout 12.5.3. Committing a Layout
Allowing for varying storage protocols capabilities, the pNFS Allowing for varying storage protocols capabilities, the pNFS
protocol does not require the metadata server and storage devices to protocol does not require the metadata server and storage devices to
have a consistent view of file attributes and data location mappings. have a consistent view of file attributes and data location mappings.
Data location mapping refers to things like which offsets store data Data location mapping refers to things like which offsets store data
as opposed to storing holes (see Section 14.5 for a discussion). as opposed to storing holes (see Section 13.5 for a discussion).
Related issues arise for storage protocols where a layout may hold Related issues arise for storage protocols where a layout may hold
provisionally allocated blocks where the allocation of those blocks provisionally allocated blocks where the allocation of those blocks
does not survive a complete restart of both the client and server. does not survive a complete restart of both the client and server.
Because of this inconsistency, it is necessary to re-synchronize the Because of this inconsistency, it is necessary to re-synchronize the
client with the metadata server and its storage devices and make any client with the metadata server and its storage devices and make any
potential changes available to other clients. This is accomplished potential changes available to other clients. This is accomplished
by use of the LAYOUTCOMMIT operation. by use of the LAYOUTCOMMIT operation.
The LAYOUTCOMMIT operation is responsible for committing a modified The LAYOUTCOMMIT operation is responsible for committing a modified
layout to the metadata server. The data should be written and layout to the metadata server. The data should be written and
committed to the appropriate storage devices before the LAYOUTCOMMIT committed to the appropriate storage devices before the LAYOUTCOMMIT
occurs. If the data is being written asynchronously through the occurs. If the data is being written asynchronously through the
metadata server, a COMMIT to the metadata server is required to metadata server, a COMMIT to the metadata server is required to
synchronize the data and make it visible on the storage devices (see synchronize the data and make it visible on the storage devices (see
Section 13.5.5 for more details). The scope of the LAYOUTCOMMIT Section 12.5.5 for more details). The scope of the LAYOUTCOMMIT
operation depends on the storage protocol in use. It is important to operation depends on the storage protocol in use. It is important to
note that the level of synchronization is from the point of view of note that the level of synchronization is from the point of view of
the client which issued the LAYOUTCOMMIT. The updated state on the the client which issued the LAYOUTCOMMIT. The updated state on the
metadata server need only reflect the state as of the client's last metadata server need only reflect the state as of the client's last
operation previous to the LAYOUTCOMMIT. It is not REQUIRED to operation previous to the LAYOUTCOMMIT. It is not REQUIRED to
maintain a global view that accounts for other clients' I/O that may maintain a global view that accounts for other clients' I/O that may
have occurred within the same time frame. have occurred within the same time frame.
For block/volume-based layouts, the LAYOUTCOMMIT may require updating For block/volume-based layouts, the LAYOUTCOMMIT may require updating
the block list that comprises the file and committing this layout to the block list that comprises the file and committing this layout to
skipping to change at page 257, line 20 skipping to change at page 259, line 20
The control protocol is free to synchronize the attributes before it The control protocol is free to synchronize the attributes before it
receives a LAYOUTCOMMIT, however upon successful completion of a receives a LAYOUTCOMMIT, however upon successful completion of a
LAYOUTCOMMIT, state that exists on the metadata server that describes LAYOUTCOMMIT, state that exists on the metadata server that describes
the file MUST be in sync with the state existing on the storage the file MUST be in sync with the state existing on the storage
devices that comprise that file as of the issuing client's last devices that comprise that file as of the issuing client's last
operation. Thus, a client that queries the size of a file between a operation. Thus, a client that queries the size of a file between a
WRITE to a storage device and the LAYOUTCOMMIT may observe a size WRITE to a storage device and the LAYOUTCOMMIT may observe a size
that does not reflect the actual data written. that does not reflect the actual data written.
13.5.3.1. LAYOUTCOMMIT and change/time_modify/time_change 12.5.3.1. LAYOUTCOMMIT and change/time_modify/time_change
The change, time_modify, and time_access attributes may be updated by The change, time_modify, and time_access attributes may be updated by
the server when the LAYOUTCOMMIT operation is processed. The reason the server when the LAYOUTCOMMIT operation is processed. The reason
for this is that some layout types do not support the update of these for this is that some layout types do not support the update of these
attributes when the storage devices process I/O operations. The attributes when the storage devices process I/O operations. The
client is capable providing suggested values to the server for client is capable providing suggested values to the server for
time_access and time_modify with the arguments to LAYOUTCOMMIT. time_access and time_modify with the arguments to LAYOUTCOMMIT.
Based on layout type, the provided values may or may not be used. Based on layout type, the provided values may or may not be used.
The server should sanity check the client provided values before they The server should sanity check the client provided values before they
are used. For example, the server should ensure that time does not are used. For example, the server should ensure that time does not
skipping to change at page 258, line 5 skipping to change at page 260, line 5
further update to the data has occurred since the last update of the further update to the data has occurred since the last update of the
attributes; file-based protocols may have enough information to make attributes; file-based protocols may have enough information to make
this determination or may update the change attribute upon each file this determination or may update the change attribute upon each file
modification. This also applies for the time_modify and time_access modification. This also applies for the time_modify and time_access
attributes. If the server implementation is able to determine that attributes. If the server implementation is able to determine that
the file has not been modified since the last time_modify update, the the file has not been modified since the last time_modify update, the
server need not update time_modify at LAYOUTCOMMIT. At LAYOUTCOMMIT server need not update time_modify at LAYOUTCOMMIT. At LAYOUTCOMMIT
completion, the updated attributes should be visible if that file was completion, the updated attributes should be visible if that file was
modified since the latest previous LAYOUTCOMMIT or LAYOUTGET. modified since the latest previous LAYOUTCOMMIT or LAYOUTGET.
13.5.3.2. LAYOUTCOMMIT and size 12.5.3.2. LAYOUTCOMMIT and size
The size of a file may be updated when the LAYOUTCOMMIT operation is The size of a file may be updated when the LAYOUTCOMMIT operation is
used by the client. One of the fields in the argument to used by the client. One of the fields in the argument to
LAYOUTCOMMIT is loca_last_write_offset; this field indicates the LAYOUTCOMMIT is loca_last_write_offset; this field indicates the
highest octet offset written but not yet committed with the highest byte offset written but not yet committed with the
LAYOUTCOMMIT operation. The data type of lora_last_write_offset is LAYOUTCOMMIT operation. The data type of lora_last_write_offset is
newoffset4 and is switched on a boolean value, no_newoffset, that newoffset4 and is switched on a boolean value, no_newoffset, that
indicates if a previous write occurred or not. If no_newoffset is indicates if a previous write occurred or not. If no_newoffset is
FALSE, an offset is not given. A loca_last_write_offset value of FALSE, an offset is not given. A loca_last_write_offset value of
zero means that one byte was written at offset zero. zero means that one byte was written at offset zero.
The metadata server may do one of the following: The metadata server may do one of the following:
1. Update the file's size using the last write offset provided by 1. Update the file's size using the last write offset provided by
the client as either the true file size or as a hint of the file the client as either the true file size or as a hint of the file
skipping to change at page 259, line 5 skipping to change at page 261, line 5
otherwise the new size is not provided. If the file size is updated, otherwise the new size is not provided. If the file size is updated,
the metadata server SHOULD update the storage devices such that the the metadata server SHOULD update the storage devices such that the
new file size is reflected when LAYOUTCOMMIT processing is complete. new file size is reflected when LAYOUTCOMMIT processing is complete.
For example, the client should be able to READ up to the new file For example, the client should be able to READ up to the new file
size. size.
If the client wants to explicitly zero-extend or truncate a file, the If the client wants to explicitly zero-extend or truncate a file, the
SETATTR operation MUST be used; SETATTR use is not required when SETATTR operation MUST be used; SETATTR use is not required when
simply writing past EOF via WRITE. simply writing past EOF via WRITE.
13.5.3.3. LAYOUTCOMMIT and layoutupdate 12.5.3.3. LAYOUTCOMMIT and layoutupdate
The LAYOUTCOMMIT argument contains a loca_layoutupdate field The LAYOUTCOMMIT argument contains a loca_layoutupdate field
(Section 18.42.2) of data type layoutupdate4 (Section 3.2.21). This (Section 18.42.1) of data type layoutupdate4 (Section 3.2.21). This
argument is a layout type-specific structure. The structure can be argument is a layout type-specific structure. The structure can be
used to pass arbitrary layout type-specific information from the used to pass arbitrary layout type-specific information from the
client to the metadata server at LAYOUTCOMMIT time. For example, if client to the metadata server at LAYOUTCOMMIT time. For example, if
using a block/volume layout, the client can indicate to the metadata using a block/volume layout, the client can indicate to the metadata
server which reserved or allocated blocks the client used or did not server which reserved or allocated blocks the client used or did not
use. The content of loca_layoutupdate (field lou_body) need not be use. The content of loca_layoutupdate (field lou_body) need not be
the same layout type-specific content returned by LAYOUTGET the same layout type-specific content returned by LAYOUTGET
(Section 18.43.3) in the loc_body field of the lo_content field, of (Section 18.43.2) in the loc_body field of the lo_content field, of
the logr_layout field. The content of loca_layoutupdate is defined the logr_layout field. The content of loca_layoutupdate is defined
by the layout type specification and is opaque to LAYOUTCOMMIT. by the layout type specification and is opaque to LAYOUTCOMMIT.
13.5.4. Recalling a Layout 12.5.4. Recalling a Layout
Since a layout protects a client's access to a file via a direct Since a layout protects a client's access to a file via a direct
client-storage-device path, a layout need only be recalled when it is client-storage-device path, a layout need only be recalled when it is
semantically unable to serve this function. Typically, this occurs semantically unable to serve this function. Typically, this occurs
when the layout no longer encapsulates the true location of the file when the layout no longer encapsulates the true location of the file
over the octet range it represents. Any operation or action, such as over the byte range it represents. Any operation or action, such as
server driven restriping or load balancing, that changes the layout server driven restriping or load balancing, that changes the layout
will result in a recall of the layout. A layout is recalled by the will result in a recall of the layout. A layout is recalled by the
CB_LAYOUTRECALL callback operation (see Section 20.3) and returned CB_LAYOUTRECALL callback operation (see Section 20.3) and returned
with LAYOUTRETURN Section 18.44. The CB_LAYOUTRECALL operation may with LAYOUTRETURN Section 18.44. The CB_LAYOUTRECALL operation may
recall a layout identified by a octet range, all the layouts recall a layout identified by a byte range, all the layouts
associated with a file system (FSID), or all layouts associated with associated with a file system (FSID), or all layouts associated with
a client ID. Recalling all layouts associated with a client ID or a client ID. Recalling all layouts associated with a client ID or
all the layouts associated with a file system also invalidates the all the layouts associated with a file system also invalidates the
client's device cache for the affected file systems. client's device cache for the affected file systems.
Section 13.5.4.2 discusses sequencing issues surrounding the getting, Section 12.5.4.2 discusses sequencing issues surrounding the getting,
returning, and recalling of layouts. returning, and recalling of layouts.
An iomode is also specified when recalling a layout. Generally, the An iomode is also specified when recalling a layout. Generally, the
iomode in the recall request must match the layout being returned; iomode in the recall request must match the layout being returned;
for example, a recall with an iomode of LAYOUTIOMODE4_RW should cause for example, a recall with an iomode of LAYOUTIOMODE4_RW should cause
the client to only return LAYOUTIOMODE4_RW layouts and not the client to only return LAYOUTIOMODE4_RW layouts and not
LAYOUTIOMODE4_READ layouts. However, a special LAYOUTIOMODE4_ANY LAYOUTIOMODE4_READ layouts. However, a special LAYOUTIOMODE4_ANY
enumeration is defined to enable recalling a layout of any iomode; in enumeration is defined to enable recalling a layout of any iomode; in
other words, the client must return both read-only and read/write other words, the client must return both read-only and read/write
layouts. layouts.
skipping to change at page 260, line 10 skipping to change at page 262, line 10
layout to prevent the client from accessing a non-existent file and layout to prevent the client from accessing a non-existent file and
to reclaim state stored on the client. Since a REMOVE may be delayed to reclaim state stored on the client. Since a REMOVE may be delayed
until the last close of the file has occurred, the recall may also be until the last close of the file has occurred, the recall may also be
delayed until this time. After the last reference on the file has delayed until this time. After the last reference on the file has
been released and the file has been removed, the client should no been released and the file has been removed, the client should no
longer be able to perform I/O using the layout. In the case of a longer be able to perform I/O using the layout. In the case of a
files based layout, the pNFS server SHOULD return NFS4ERR_STALE for files based layout, the pNFS server SHOULD return NFS4ERR_STALE for
the removed file. the removed file.
Once a layout has been returned, the client MUST NOT issue I/Os to Once a layout has been returned, the client MUST NOT issue I/Os to
the storage devices for the file, octet range, and iomode represented the storage devices for the file, byte range, and iomode represented
by the returned layout. If a client does issue an I/O to a storage by the returned layout. If a client does issue an I/O to a storage
device for which it does not hold a layout, the storage device SHOULD device for which it does not hold a layout, the storage device SHOULD
reject the I/O. reject the I/O.
Although pNFS does not alter the file data caching capabilities of Although pNFS does not alter the file data caching capabilities of
clients, or their semantics, it recognizes that some clients may clients, or their semantics, it recognizes that some clients may
perform more aggressive write-behind caching to optimize the benefits perform more aggressive write-behind caching to optimize the benefits
provided by pNFS. However, write-behind caching may negatively provided by pNFS. However, write-behind caching may negatively
affect the latency in returning a layout in response to a affect the latency in returning a layout in response to a
CB_LAYOUTRECALL; this is similar to file delegations and the impact CB_LAYOUTRECALL; this is similar to file delegations and the impact
skipping to change at page 260, line 34 skipping to change at page 262, line 34
CB_LAYOUTRECALL. Once a layout is recalled, a server MUST wait one CB_LAYOUTRECALL. Once a layout is recalled, a server MUST wait one
lease period before taking further action. As soon as a lease period lease period before taking further action. As soon as a lease period
has past, the server may choose to fence the client's access to the has past, the server may choose to fence the client's access to the
storage devices if the server perceives the client has taken too long storage devices if the server perceives the client has taken too long
to return a LAYOUT; However, just as in the case of data delegation to return a LAYOUT; However, just as in the case of data delegation
and DELEGRETURN, the server may choose to wait given that the client and DELEGRETURN, the server may choose to wait given that the client
is showing forward progress on its way to returning the layout. This is showing forward progress on its way to returning the layout. This
forward progress can take the form of successful interaction with the forward progress can take the form of successful interaction with the
storage devices or sub-portions of the layout being returned by the storage devices or sub-portions of the layout being returned by the
client. The server can also limit exposure to these problems by client. The server can also limit exposure to these problems by
limiting the octet ranges initially provided in the layouts and thus limiting the byte ranges initially provided in the layouts and thus
the amount of outstanding modified data. the amount of outstanding modified data.
13.5.4.1. Recall Callback Robustness 12.5.4.1. Recall Callback Robustness
It has been assumed thus far that pNFS client state for a file It has been assumed thus far that pNFS client state for a file
exactly matches the pNFS server state for that file and client exactly matches the pNFS server state for that file and client
regarding layout ranges and iomode. This assumption leads to the regarding layout ranges and iomode. This assumption leads to the
implication that any callback results in a LAYOUTRETURN or set of implication that any callback results in a LAYOUTRETURN or set of
LAYOUTRETURNs that exactly match the range in the callback, since LAYOUTRETURNs that exactly match the range in the callback, since
both client and server agree about the state being maintained. both client and server agree about the state being maintained.
However, it can be useful if this assumption does not always hold. However, it can be useful if this assumption does not always hold.
For example: For example:
skipping to change at page 262, line 9 skipping to change at page 264, line 9
specification MUST define whether unilateral layout revocation by the specification MUST define whether unilateral layout revocation by the
metadata server is supported; if it is, the specification must also metadata server is supported; if it is, the specification must also
describe how lingering writes are processed. For example, storage describe how lingering writes are processed. For example, storage
devices identified by the revoked layout could be fenced off from the devices identified by the revoked layout could be fenced off from the
client that held the layout. client that held the layout.
In order to ensure client/server convergence with regard to layout In order to ensure client/server convergence with regard to layout
state, the final LAYOUTRETURN operation in a sequence of LAYOUTRETURN state, the final LAYOUTRETURN operation in a sequence of LAYOUTRETURN
operations for a particular recall, MUST specify the entire range operations for a particular recall, MUST specify the entire range
being recalled, echoing the recalled layout type, iomode, recall/ being recalled, echoing the recalled layout type, iomode, recall/
return type (FILE, FSID, or ALL), and octet range; even if layouts return type (FILE, FSID, or ALL), and byte range; even if layouts
pertaining to partial ranges were previously returned. In addition, pertaining to partial ranges were previously returned. In addition,
if the client holds no layouts that overlaps the range being if the client holds no layouts that overlaps the range being
recalled, the client should return the NFS4ERR_NOMATCHING_LAYOUT recalled, the client should return the NFS4ERR_NOMATCHING_LAYOUT
error code to CB_LAYOUTRECALL. This allows the server to update its error code to CB_LAYOUTRECALL. This allows the server to update its
view of the client's layout state. view of the client's layout state.
13.5.4.2. Serialization of Layout Operations 12.5.4.2. Serialization of Layout Operations
As with other stateful operations, pNFS requires the correct As with other stateful operations, pNFS requires the correct
sequencing of layout operations. pNFS uses the sessions feature of sequencing of layout operations. pNFS uses the sessions feature of
NFSv4.1 to provide the correct sequencing between regular operations NFSv4.1 to provide the correct sequencing between regular operations
and callbacks. It is the server's responsibility to avoid and callbacks. It is the server's responsibility to avoid
inconsistencies regarding the layouts provided and the client's inconsistencies regarding the layouts provided and the client's
responsibility to properly serialize its layout requests and layout responsibility to properly serialize its layout requests and layout
returns. returns.
13.5.4.2.1. Get/Return Serialization 12.5.4.2.1. Get/Return Serialization
The protocol allows the client to send concurrent LAYOUTGET and The protocol allows the client to send concurrent LAYOUTGET and
LAYOUTRETURN operations to the server. However, the protocol does LAYOUTRETURN operations to the server. However, the protocol does
not provide any means for the server to process the requests in the not provide any means for the server to process the requests in the
same order in which they were created, nor does it provide a way for same order in which they were created, nor does it provide a way for
the client to determine the order in which parallel outstanding the client to determine the order in which parallel outstanding
operations were processed by the server. Thus, when a layout operations were processed by the server. Thus, when a layout
retrieved by an outstanding LAYOUTGET operation intersects with a retrieved by an outstanding LAYOUTGET operation intersects with a
layout returned by an outstanding LAYOUTRETURN the order in which the layout returned by an outstanding LAYOUTRETURN the order in which the
two conflicting operations are processed determines the final state two conflicting operations are processed determines the final state
skipping to change at page 263, line 5 skipping to change at page 265, line 5
LAYOUTGET operations for the same file or multiple LAYOUTRETURN LAYOUTGET operations for the same file or multiple LAYOUTRETURN
operations for the same file; but never a mix of both. It is also operations for the same file; but never a mix of both. It is also
permissible for the client to combine LAYOUTRETURN and LAYOUTGET permissible for the client to combine LAYOUTRETURN and LAYOUTGET
operations for the same file in the same COMPOUND request since the operations for the same file in the same COMPOUND request since the
server MUST process these in order. If a client does issue such server MUST process these in order. If a client does issue such
requests, it MUST NOT have more than one outstanding for the same requests, it MUST NOT have more than one outstanding for the same
file at the same time and MUST NOT have other LAYOUTGET or file at the same time and MUST NOT have other LAYOUTGET or
LAYOUTRETURN operations outstanding at the same time for that same LAYOUTRETURN operations outstanding at the same time for that same
file. file.
13.5.4.2.2. Recall/Return Sequencing 12.5.4.2.2. Recall/Return Sequencing
One critical issue with regard to operation sequencing concerns One critical issue with regard to operation sequencing concerns
callbacks. The protocol must defend against races between the reply callbacks. The protocol must defend against races between the reply
to a LAYOUTGET operation and a subsequent CB_LAYOUTRECALL. A client to a LAYOUTGET operation and a subsequent CB_LAYOUTRECALL. A client
MUST NOT process a CB_LAYOUTRECALL that identifies an outstanding MUST NOT process a CB_LAYOUTRECALL that identifies an outstanding
LAYOUTGET operation to which the client has not yet received a reply. LAYOUTGET operation to which the client has not yet received a reply.
Intersecting LAYOUTGET operations are identified in the CB_SEQUENCE Intersecting LAYOUTGET operations are identified in the CB_SEQUENCE
preceding the CB_LAYOUTRECALL. preceding the CB_LAYOUTRECALL.
The callback races section (Section 2.10.5.3) describes the sessions The callback races section (Section 2.10.5.3) describes the sessions
mechanism for allowing the client to detect such situations in order mechanism for allowing the client to detect such situations in order
to delay processing such a CB_LAYOUTRECALL. The server MUST to delay processing such a CB_LAYOUTRECALL. The server MUST
reference all conflicting LAYOUTGET operations in the CB_SEQUENCE reference all conflicting LAYOUTGET operations in the CB_SEQUENCE
that precedes the CB_LAYOUTRECALL. A zero length array of referenced that precedes the CB_LAYOUTRECALL. A zero length array of referenced
operations is used by the server to tell the client that the server operations is used by the server to tell the client that the server
does not know of any LAYOUTGET operations that conflict with the does not know of any LAYOUTGET operations that conflict with the
recall. recall.
13.5.4.2.2.1. Client Considerations 12.5.4.2.2.1. Client Considerations
Consider a pNFS client that has issued a LAYOUTGET and then receives Consider a pNFS client that has issued a LAYOUTGET and then receives
an overlapping CB_LAYOUTRECALL for the same file. There are two an overlapping CB_LAYOUTRECALL for the same file. There are two
possibilities, which the client would be unable to distinguish possibilities, which the client would be unable to distinguish
without additional information provided by the sessions without additional information provided by the sessions
implementation. implementation.
1. The server processed the LAYOUTGET before issuing the recall, so 1. The server processed the LAYOUTGET before issuing the recall, so
the LAYOUTGET response is in flight, and must be waited for the LAYOUTGET response is in flight, and must be waited for
because it may be carrying layout info that will need to be because it may be carrying layout info that will need to be
skipping to change at page 264, line 34 skipping to change at page 266, line 34
recall, the client SHOULD wait for responses to any outstanding recall, the client SHOULD wait for responses to any outstanding
LAYOUTGET that overlaps any portion of the new LAYOUTGET's range . LAYOUTGET that overlaps any portion of the new LAYOUTGET's range .
This is because it is possible (although unlikely) that the prior This is because it is possible (although unlikely) that the prior
operation may have arrived at the server after the recall operation may have arrived at the server after the recall
completed and hence will succeed. completed and hence will succeed.
o The recall process can be considered completed, by the client, o The recall process can be considered completed, by the client,
when the final LAYOUTRETURN operation for the recalled range is when the final LAYOUTRETURN operation for the recalled range is
completed. completed.
13.5.4.2.2.2. Server Considerations 12.5.4.2.2.2. Server Considerations
Consider a related situation from the metadata server's point of Consider a related situation from the metadata server's point of
view. The metadata server has issued a CB_LAYOUTRECALL and receives view. The metadata server has issued a CB_LAYOUTRECALL and receives
an overlapping LAYOUTGET for the same file before the LAYOUTRETURN(s) an overlapping LAYOUTGET for the same file before the LAYOUTRETURN(s)
that respond to the CB_LAYOUTRECALL. Again, there are two cases: that respond to the CB_LAYOUTRECALL. Again, there are two cases:
1. The client issued the LAYOUTGET before processing the 1. The client issued the LAYOUTGET before processing the
CB_LAYOUTRECALL. CB_LAYOUTRECALL.
2. The client issued the LAYOUTGET after processing the 2. The client issued the LAYOUTGET after processing the
skipping to change at page 265, line 13 skipping to change at page 267, line 13
situation can even occur if the session is configured to use a single situation can even occur if the session is configured to use a single
connection for both operations and callbacks. connection for both operations and callbacks.
Given no method to disambiguate these cases the metadata server MUST Given no method to disambiguate these cases the metadata server MUST
reject the overlapping LAYOUTGET with the error reject the overlapping LAYOUTGET with the error
NFS4ERR_RECALLCONFLICT. The client has two ways to avoid this NFS4ERR_RECALLCONFLICT. The client has two ways to avoid this
result. It can issue the LAYOUTGET as a subsequent element of a result. It can issue the LAYOUTGET as a subsequent element of a
COMPOUND containing the LAYOUTRETURN that completes the COMPOUND containing the LAYOUTRETURN that completes the
CB_LAYOUTRECALL or it can wait for the response to that LAYOUTRETURN. CB_LAYOUTRECALL or it can wait for the response to that LAYOUTRETURN.
13.5.5. Metadata Server Write Propagation 12.5.5. Metadata Server Write Propagation
Asynchronous writes written through the metadata server may be Asynchronous writes written through the metadata server may be
propagated lazily to the storage devices. For data written propagated lazily to the storage devices. For data written
asynchronously through the metadata server, a client performing a asynchronously through the metadata server, a client performing a
read at the appropriate storage device is not guaranteed to see the read at the appropriate storage device is not guaranteed to see the
newly written data until a COMMIT occurs at the metadata server. newly written data until a COMMIT occurs at the metadata server.
While the write is pending, reads to the storage device may give out While the write is pending, reads to the storage device may give out
either the old data, the new data, or a mixture of new and old. Upon either the old data, the new data, or a mixture of new and old. Upon
completion of a synchronous WRITE or COMMIT (for asynchronously completion of a synchronous WRITE or COMMIT (for asynchronously
written data), the metadata server MUST ensure that storage devices written data), the metadata server MUST ensure that storage devices
give out the new data and that the data has been written to stable give out the new data and that the data has been written to stable
storage. If the server implements its storage in any way such that storage. If the server implements its storage in any way such that
it cannot obey these constraints, then it MUST recall the layouts to it cannot obey these constraints, then it MUST recall the layouts to
prevent reads being done that cannot be handled correctly. Note that prevent reads being done that cannot be handled correctly. Note that
the layouts MUST be recalled prior to the server responding to the the layouts MUST be recalled prior to the server responding to the
associated WRITE operations. associated WRITE operations.
13.6. pNFS Mechanics 12.6. pNFS Mechanics
This section describes the operations flow taken by a pNFS client to This section describes the operations flow taken by a pNFS client to
a metadata server and storage device. a metadata server and storage device.
When a pNFS client encounters a new FSID, it issues a GETATTR to the When a pNFS client encounters a new FSID, it issues a GETATTR to the
NFSv4.1 server for the fs_layout_type (Section 5.11.1) attribute. If NFSv4.1 server for the fs_layout_type (Section 5.11.1) attribute. If
the attribute returns at least one layout type, and the layout types the attribute returns at least one layout type, and the layout types
returned are among the set supported by the client, the client knows returned are among the set supported by the client, the client knows
that pNFS is a possibility for the file system. If, from the server that pNFS is a possibility for the file system. If, from the server
that returned the new FSID, the client does not have a client ID that that returned the new FSID, the client does not have a client ID that
skipping to change at page 267, line 6 skipping to change at page 269, line 6
parallel. parallel.
If the I/O was a WRITE, then at some point the client may want to If the I/O was a WRITE, then at some point the client may want to
commit the access time to the metadata server. It uses the commit the access time to the metadata server. It uses the
LAYOUTCOMMIT operation. If the I/O was a READ, then at some point LAYOUTCOMMIT operation. If the I/O was a READ, then at some point
the client may want to commit the modification time and the new size the client may want to commit the modification time and the new size
of the file if it believes it extended the file size, to the metadata of the file if it believes it extended the file size, to the metadata
server and the modified data to the file system. Again, it uses server and the modified data to the file system. Again, it uses
LAYOUTCOMMIT. LAYOUTCOMMIT.
13.7. Recovery 12.7. Recovery
Recovery is complicated by the distributed nature of the pNFS Recovery is complicated by the distributed nature of the pNFS
protocol. In general, crash recovery for layouts is similar to crash protocol. In general, crash recovery for layouts is similar to crash
recovery for delegations in the base NFSv4.1 protocol. However, the recovery for delegations in the base NFSv4.1 protocol. However, the
client's ability to perform I/O without contacting the metadata client's ability to perform I/O without contacting the metadata
server and the fact that unlike delegations, layouts are not bound to server and the fact that unlike delegations, layouts are not bound to
stateids introduces subtleties that must be handled correctly if the stateids introduces subtleties that must be handled correctly if the
possibility of file system corruption is to be avoided. possibility of file system corruption is to be avoided.
13.7.1. Client Recovery 12.7.1. Client Recovery
Client recovery for layouts is similar to client recovery for other Client recovery for layouts is similar to client recovery for other
lock and delegation state. When an pNFS client restarts, it will lock and delegation state. When an pNFS client restarts, it will
lose all information about the layouts that it previously owned. lose all information about the layouts that it previously owned.
There are two methods by which the server can reclaim these resources There are two methods by which the server can reclaim these resources
and allow otherwise conflicting layouts to be provided to other and allow otherwise conflicting layouts to be provided to other
clients. clients.
The first is through the expiry of the client's lease. If the client The first is through the expiry of the client's lease. If the client
recovery time is longer than the lease period, the client's lease recovery time is longer than the lease period, the client's lease
skipping to change at page 267, line 45 skipping to change at page 269, line 45
server will find that the client's co_ownerid matches the co_ownerid server will find that the client's co_ownerid matches the co_ownerid
of the previous client invocation, but that the verifier is of the previous client invocation, but that the verifier is
different. The server uses this as a signal to release all layout different. The server uses this as a signal to release all layout
state associated with the client's previous invocation. In this state associated with the client's previous invocation. In this
scenario, the data written by the client but not covered by a scenario, the data written by the client but not covered by a
successful LAYOUTCOMMIT is in an undefined state; it may have been successful LAYOUTCOMMIT is in an undefined state; it may have been
written or it may now be lost. This is acceptable behavior and it is written or it may now be lost. This is acceptable behavior and it is
the client's responsibility to use LAYOUTCOMMIT to achieve the the client's responsibility to use LAYOUTCOMMIT to achieve the
desired level of stability. desired level of stability.
13.7.2. Dealing with Lease Expiration on the Client 12.7.2. Dealing with Lease Expiration on the Client
The mappings between device IDs and device addresses are what enables The mappings between device IDs and device addresses are what enables
a pNFS client to safely write data to and read data from a storage a pNFS client to safely write data to and read data from a storage
device. These mappings are leased (in a manner similar to locking device. These mappings are leased (in a manner similar to locking
state) from the metadata server, and as long as the lease is valid, state) from the metadata server, and as long as the lease is valid,
the client has a ability to issue I/O to the storage devices. The the client has a ability to issue I/O to the storage devices. The
lease on device ID to device address mappings is renewed when the lease on device ID to device address mappings is renewed when the
metadata server receives a SEQUENCE operation from the pNFS client. metadata server receives a SEQUENCE operation from the pNFS client.
The same is not specified to be true for the data server receiving a The same is not specified to be true for the data server receiving a
SEQUENCE operation, and the client MUST NOT assume that a SEQUENCE SEQUENCE operation, and the client MUST NOT assume that a SEQUENCE
skipping to change at page 268, line 44 skipping to change at page 270, line 44
client reestablish client ID and session with the server and obtain client reestablish client ID and session with the server and obtain
new layouts and device ID to device address mappings for the modified new layouts and device ID to device address mappings for the modified
data ranges and then write the data to the storage devices with the data ranges and then write the data to the storage devices with the
newly obtained layouts. newly obtained layouts.
If sr_status_flags from the metadata server has If sr_status_flags from the metadata server has
SEQ4_STATUS_RESTART_RECLAIM_NEEDED set (or SEQUENCE returns SEQ4_STATUS_RESTART_RECLAIM_NEEDED set (or SEQUENCE returns
NFS4ERR_STALE_CLIENTID, or SEQUENCE returns NFS4ERR_BAD_SESSION and NFS4ERR_STALE_CLIENTID, or SEQUENCE returns NFS4ERR_BAD_SESSION and
CREATE_SESSION returns NFS4ERR_STALE_CLIENTID) then the metadata CREATE_SESSION returns NFS4ERR_STALE_CLIENTID) then the metadata
server has restarted, and the client SHOULD recover using the methods server has restarted, and the client SHOULD recover using the methods
described in Section 13.7.4. described in Section 12.7.4.
If sr_status_flags from the metadata server has If sr_status_flags from the metadata server has
SEQ4_STATUS_LEASE_MOVED set, then the client recovers by following SEQ4_STATUS_LEASE_MOVED set, then the client recovers by following
the procedure described in Section 11.6.7.1. After that, the client the procedure described in Section 11.6.7.1. After that, the client
may get an indication that the layout state was not moved with the may get an indication that the layout state was not moved with the
file system. The client recovers as in the other applicable file system. The client recovers as in the other applicable
situations discussed in Paragraph 3 or Paragraph 4 of this section. situations discussed in Paragraph 3 or Paragraph 4 of this section.
If sr_status_flags reports no loss of state, then the lease for the If sr_status_flags reports no loss of state, then the lease for the
mappings the client has with the metadata server are valid and mappings the client has with the metadata server are valid and
renewed, and the client can once again issue I/O requests to the renewed, and the client can once again issue I/O requests to the
storage devices. storage devices.
While clients SHOULD NOT issue I/Os to storage devices that may While clients SHOULD NOT issue I/Os to storage devices that may
extend past the lease expiration time period, this is not always extend past the lease expiration time period, this is not always
possible; for example, an extended network partition that starts possible; for example, an extended network partition that starts
after the I/O is sent and does not heal until the I/O request is after the I/O is sent and does not heal until the I/O request is
received by the storage device. Thus the metadata server and/or received by the storage device. Thus the metadata server and/or
storage devices are responsible for protecting themselves from I/Os storage devices are responsible for protecting themselves from I/Os
that are sent before the lease expires, but arrive after the lease that are sent before the lease expires, but arrive after the lease
expires. See Section 13.7.3. expires. See Section 12.7.3.
13.7.3. Dealing with Loss of Layout State on the Metadata Server 12.7.3. Dealing with Loss of Layout State on the Metadata Server
This is a description of the case where all of the following are This is a description of the case where all of the following are
true: true:
o the metadata server has not restarted o the metadata server has not restarted
o a pNFS client's device ID to device address mappings and/or o a pNFS client's device ID to device address mappings and/or
layouts have been discarded (usually because the client's lease layouts have been discarded (usually because the client's lease
expired) and are invalid expired) and are invalid
o an I/O from the pNFS client arrives at the storage device o an I/O from the pNFS client arrives at the storage device
The metadata server and its storage devices MUST solve this by The metadata server and its storage devices MUST solve this by
fencing the client. In other words, prevent the execution of I/O fencing the client. In other words, prevent the execution of I/O
operations from the client to the storage devices after layout state operations from the client to the storage devices after layout state
loss. The details of how fencing is done are specific to the layout loss. The details of how fencing is done are specific to the layout
type. The solution for NFSv4.1 file-based layouts is described in type. The solution for NFSv4.1 file-based layouts is described in
(Section 14.12), and for other layout types in their respective (Section 13.12), and for other layout types in their respective
external specification documents. external specification documents.
13.7.4. Recovery from Metadata Server Restart 12.7.4. Recovery from Metadata Server Restart
The pNFS client will discover that the metadata server has restarted The pNFS client will discover that the metadata server has restarted
(e.g. rebooted) via the methods described in Section 8.4.2 and (e.g. rebooted) via the methods described in Section 8.4.2 and
discussed in a pNFS-specific context in Paragraph 4, of discussed in a pNFS-specific context in Paragraph 4, of
Section 13.7.2. The client MUST stop using layouts and delete device Section 12.7.2. The client MUST stop using layouts and delete device
ID to device address mappings it previously received from the ID to device address mappings it previously received from the
metadata server. Having done that, if the client wrote data to the metadata server. Having done that, if the client wrote data to the
storage device without committing the layouts via LAYOUTCOMMIT, then storage device without committing the layouts via LAYOUTCOMMIT, then
the client has additional work to do in order to have the client, the client has additional work to do in order to have the client,
metadata server and storage device(s) all synchronized on the state metadata server and storage device(s) all synchronized on the state
of the data. of the data.
o If the client has data still modified and unwritten in the o If the client has data still modified and unwritten in the
client's memory, the client has only two choices. client's memory, the client has only two choices.
skipping to change at page 270, line 49 skipping to change at page 272, line 49
The only recovery option for this scenario is to issue a The only recovery option for this scenario is to issue a
LAYOUTCOMMIT in reclaim mode, which the metadata server will LAYOUTCOMMIT in reclaim mode, which the metadata server will
accept as long as it is in its grace period. The use of accept as long as it is in its grace period. The use of
LAYOUTCOMMIT in reclaim mode informs the metadata server that the LAYOUTCOMMIT in reclaim mode informs the metadata server that the
layout has changed. It is critical the metadata server receive layout has changed. It is critical the metadata server receive
this information before its grace period ends, and thus before it this information before its grace period ends, and thus before it
starts allowing updates to the file system. starts allowing updates to the file system.
To issue LAYOUTCOMMIT in reclaim mode, the client sets the To issue LAYOUTCOMMIT in reclaim mode, the client sets the
loca_reclaim field of the operation's arguments (Section 18.42.2) loca_reclaim field of the operation's arguments (Section 18.42.1)
to TRUE. During the metadata server's recovery grace period (and to TRUE. During the metadata server's recovery grace period (and
only during the recovery grace period) the metadata server is only during the recovery grace period) the metadata server is
prepared to accept LAYOUTCOMMIT requests with the loca_reclaim prepared to accept LAYOUTCOMMIT requests with the loca_reclaim
field set to TRUE. field set to TRUE.
When loca_reclaim is TRUE, the client is attempting to commit When loca_reclaim is TRUE, the client is attempting to commit
changes to the layout that occurred prior to the restart of the changes to the layout that occurred prior to the restart of the
metadata server. The metadata server applies some consistency metadata server. The metadata server applies some consistency
checks on the loca_layoutupdate field of the arguments to checks on the loca_layoutupdate field of the arguments to
determine whether the client can commit the data written to the determine whether the client can commit the data written to the
storage device to the file system. The loca_layoutupdate field is storage device to the file system. The loca_layoutupdate field is
of data type layoutupdate4, and contains layout type-specific of data type layoutupdate4, and contains layout type-specific
content (in the lou_body field of loca_layoutupdate). The layout content (in the lou_body field of loca_layoutupdate). The layout
type-specific information that loca_layoutupdate might have is type-specific information that loca_layoutupdate might have is
discussed in Section 13.5.3.3. If the metadata server's discussed in Section 12.5.3.3. If the metadata server's
consistency checks on loca_layoutupdate succeed, then the metadata consistency checks on loca_layoutupdate succeed, then the metadata
server MUST commit the data (as described by the loca_offset, server MUST commit the data (as described by the loca_offset,
loca_length, and loca_layoutupdate fields of the arguments) that loca_length, and loca_layoutupdate fields of the arguments) that
was written to storage device. If the metadata server's was written to storage device. If the metadata server's
consistency checks on loca_layoutupdate fail, the metadata server consistency checks on loca_layoutupdate fail, the metadata server
rejects the LAYOUTCOMMIT operation, and makes no changes to the rejects the LAYOUTCOMMIT operation, and makes no changes to the
file system. However, any time LAYOUTCOMMIT with loca_reclaim file system. However, any time LAYOUTCOMMIT with loca_reclaim
TRUE fails, the pNFS client has lost all the data in the range TRUE fails, the pNFS client has lost all the data in the range
defined by <loca_offset, loca_length>. A client can defend defined by <loca_offset, loca_length>. A client can defend
against this risk by caching all data, whether written against this risk by caching all data, whether written
skipping to change at page 271, line 39 skipping to change at page 273, line 39
storage devices need not suffer from this limitation. storage devices need not suffer from this limitation.
o The client does not have a copy of the data in its memory and the o The client does not have a copy of the data in its memory and the
metadata server is no longer in its grace period; i.e. the metadata server is no longer in its grace period; i.e. the
metadata server returns NFS4ERR_NO_GRACE. As with the scenario in metadata server returns NFS4ERR_NO_GRACE. As with the scenario in
the above bullet item, the failure of LAYOUTCOMMIT means the data the above bullet item, the failure of LAYOUTCOMMIT means the data
in the range <loca_offset, loca_length> lost. The defense against in the range <loca_offset, loca_length> lost. The defense against
the risk is the same; cache all written data on the client until a the risk is the same; cache all written data on the client until a
successful LAYOUTCOMMIT. successful LAYOUTCOMMIT.
13.7.5. Operations During Metadata Server Grace Period 12.7.5. Operations During Metadata Server Grace Period
Some of the recovery scenarios thus far noted that some operations, Some of the recovery scenarios thus far noted that some operations,
namely WRITE and LAYOUTGET might be permitted during the metadata namely WRITE and LAYOUTGET might be permitted during the metadata
server's grace period. The metadata server may allow these server's grace period. The metadata server may allow these
operations during its grace period. For LAYOUTGET, the metadata operations during its grace period. For LAYOUTGET, the metadata
server must reliably determine that servicing such a request will not server must reliably determine that servicing such a request will not
conflict with an impending LAYOUTCOMMIT reclaim request. For WRITE, conflict with an impending LAYOUTCOMMIT reclaim request. For WRITE,
it must reliably determine that it will not conflict with an it must reliably determine that it will not conflict with an
impending OPEN; or a LOCK where the file has mandatory file locking impending OPEN; or a LOCK where the file has mandatory file locking
enabled. enabled.
skipping to change at page 272, line 17 skipping to change at page 274, line 17
operations by returning the NFS4ERR_GRACE error. However, depending operations by returning the NFS4ERR_GRACE error. However, depending
on the storage protocol (which is specific to the layout type) and on the storage protocol (which is specific to the layout type) and
metadata server implementation, the metadata server may be able to metadata server implementation, the metadata server may be able to
determine that a particular request is safe. For example, a metadata determine that a particular request is safe. For example, a metadata
server may save provisional allocation mappings for each file to server may save provisional allocation mappings for each file to
stable storage, as well as information about potentially conflicting stable storage, as well as information about potentially conflicting
OPEN share modes and mandatory record locks that might have been in OPEN share modes and mandatory record locks that might have been in
effect at the time of restart, and use this information during the effect at the time of restart, and use this information during the
recovery grace period to determine that a WRITE request is safe. recovery grace period to determine that a WRITE request is safe.
13.7.6. Storage Device Recovery 12.7.6. Storage Device Recovery
Recovery from storage device restart is mostly dependent upon the Recovery from storage device restart is mostly dependent upon the
layout type in use. However, there are a few general techniques a layout type in use. However, there are a few general techniques a
client can use if it discovers a storage device has crashed while client can use if it discovers a storage device has crashed while
holding modified, uncommitted data that was asynchronously written. holding modified, uncommitted data that was asynchronously written.
First and foremost, it is important to realize that the client is the First and foremost, it is important to realize that the client is the
only one who has the information necessary to recover non-committed only one who has the information necessary to recover non-committed
data; since, it holds the modified data and most probably nobody else data; since, it holds the modified data and most probably nobody else
does. Second, the best solution is for the client to err on the side does. Second, the best solution is for the client to err on the side
of caution and attempt to re-write the modified data through another of caution and attempt to re-write the modified data through another
path. path.
The client SHOULD immediately write the data to the metadata server, The client SHOULD immediately write the data to the metadata server,
with the stable field in the WRITE4args set to FILE_SYNC4. Once it with the stable field in the WRITE4args set to FILE_SYNC4. Once it
does this, there is no need to wait for the original storage device. does this, there is no need to wait for the original storage device.
13.8. Metadata and Storage Device Roles 12.8. Metadata and Storage Device Roles
If the same physical hardware is used to implement both a metadata If the same physical hardware is used to implement both a metadata
server and storage device, then the same hardware entity is to be server and storage device, then the same hardware entity is to be
understood to be implementing two distinct roles and it is important understood to be implementing two distinct roles and it is important
that it be clearly understood on behalf of which role the hardware is that it be clearly understood on behalf of which role the hardware is
executing at any given time. executing at any given time.
Various sub-cases can be distinguished. Various sub-cases can be distinguished.
1. The storage device uses NFSv4.1 as the storage protocol. The 1. The storage device uses NFSv4.1 as the storage protocol. The
skipping to change at page 273, line 29 skipping to change at page 275, line 29
set, the server_owner and server_scope results are the same, and set, the server_owner and server_scope results are the same, and
the client IDs are the same, and if RPCSEC_GSS is used, the the client IDs are the same, and if RPCSEC_GSS is used, the
server principals are the same. As noted in Section 2.10.4 the server principals are the same. As noted in Section 2.10.4 the
two servers are the same, whether they have the same network two servers are the same, whether they have the same network
address or not. If the pNFS server is ambiguous in its address or not. If the pNFS server is ambiguous in its
EXCHANGE_ID results as to what role a client ID may be used for, EXCHANGE_ID results as to what role a client ID may be used for,
yet still requires the NFSv4.1 request be directed in a manner yet still requires the NFSv4.1 request be directed in a manner
specific to a role (e.g. a READ request for a particular offset specific to a role (e.g. a READ request for a particular offset
directed to the metadata server role might use a different offset directed to the metadata server role might use a different offset
if the READ was intended for the data server role, if the file is if the READ was intended for the data server role, if the file is
using STRIPE4_DENSE packing, see Section 14.5), the pNFS server using STRIPE4_DENSE packing, see Section 13.5), the pNFS server
may mark the the metadata filehandle differently from the data may mark the the metadata filehandle differently from the data
filehandle so that operations addressed to the metadata server filehandle so that operations addressed to the metadata server
can be distinguished from those directed to the data servers. can be distinguished from those directed to the data servers.
Marking the metadata and data server filehandles differently (and Marking the metadata and data server filehandles differently (and
this is RECOMMENDED) is possible because the former are derived this is RECOMMENDED) is possible because the former are derived
from OPEN operations, and the latter are derived from LAYOUTGET from OPEN operations, and the latter are derived from LAYOUTGET
operations. operations.
Note, that it may be the case that while the metadata server and Note, that it may be the case that while the metadata server and
the storage device are distinct from one client's point of view, the storage device are distinct from one client's point of view,
skipping to change at page 274, line 8 skipping to change at page 276, line 8
3. The storage device does not use NFSv4.1 as the storage protocol, 3. The storage device does not use NFSv4.1 as the storage protocol,
and the same physical hardware is used to implement both a and the same physical hardware is used to implement both a
metadata and storage device. Whether distinct network addresses metadata and storage device. Whether distinct network addresses
are used to access metadata server and storage device is are used to access metadata server and storage device is
immaterial, because, it is always clear to the pNFS client and immaterial, because, it is always clear to the pNFS client and
server, from upper layer protocol being used (NFSv4.1 or non- server, from upper layer protocol being used (NFSv4.1 or non-
NFSv4.1) what role the request to the common server network NFSv4.1) what role the request to the common server network
address is directed to. address is directed to.
13.9. Security Considerations 12.9. Security Considerations
pNFS separates file system metadata and data and provides access to pNFS separates file system metadata and data and provides access to
both. There are pNFS-specific operations (listed in Section 13.3) both. There are pNFS-specific operations (listed in Section 12.3)
that provide access to the metadata; all existing NFSv4.1 that provide access to the metadata; all existing NFSv4.1
conventional (non-pNFS) security mechanisms and features apply to conventional (non-pNFS) security mechanisms and features apply to
accessing the metadata. The combination of components in a pNFS accessing the metadata. The combination of components in a pNFS
system (see Figure 65) is required to preserve the security system (see Figure 65) is required to preserve the security
properties of NFSv4.1 with respect to an entity accessing storage properties of NFSv4.1 with respect to an entity accessing storage
device from a client, including security countermeasures to defend device from a client, including security countermeasures to defend
against threats that NFSv4.1 provides defenses for in environments against threats that NFSv4.1 provides defenses for in environments
where these threats are considered significant. where these threats are considered significant.
In some cases, the security countermeasures for connections to In some cases, the security countermeasures for connections to
skipping to change at page 275, line 5 skipping to change at page 277, line 5
storage device, or both (as applicable; when the storage device is an storage device, or both (as applicable; when the storage device is an
NFSv4.1 server, the storage device is ultimately responsible for NFSv4.1 server, the storage device is ultimately responsible for
controlling access). If a pNFS configuration performs these checks controlling access). If a pNFS configuration performs these checks
only in the client, the risk of a misbehaving client obtaining only in the client, the risk of a misbehaving client obtaining
unauthorized access is an important consideration in determining when unauthorized access is an important consideration in determining when
it is appropriate to use such a pNFS configuration. Such layout it is appropriate to use such a pNFS configuration. Such layout
types SHOULD NOT be used when client-only access checks do not types SHOULD NOT be used when client-only access checks do not
provide sufficient assurance that NFSv4.1 access control is being provide sufficient assurance that NFSv4.1 access control is being
applied correctly. applied correctly.
14. PNFS: NFSv4.1 File Layout Type 13. PNFS: NFSv4.1 File Layout Type
This section describes the semantics and format of NFSv4.1 file-based This section describes the semantics and format of NFSv4.1 file-based
layouts for pNFS. NFSv4.1 file-based layouts uses the layouts for pNFS. NFSv4.1 file-based layouts uses the
LAYOUT4_NFSV4_1_FILES layout type. The LAYOUT4_NFSV4_1_FILES type LAYOUT4_NFSV4_1_FILES layout type. The LAYOUT4_NFSV4_1_FILES type
defines striping data across multiple NFSv4.1 data servers. defines striping data across multiple NFSv4.1 data servers.
14.1. Client ID and Session Considerations 13.1. Client ID and Session Considerations
Sessions are a mandatory feature of NFSv4.1, and this extends to both Sessions are a mandatory feature of NFSv4.1, and this extends to both
the metadata server and file-based (NFSv4.1-based) data servers. the metadata server and file-based (NFSv4.1-based) data servers.
The role a server plays in pNFS is determined by the result it The role a server plays in pNFS is determined by the result it
returns from EXCHANGE_ID. The roles are: returns from EXCHANGE_ID. The roles are:
o metadata server (EXCHGID4_FLAG_USE_PNFS_MDS is set in the result o metadata server (EXCHGID4_FLAG_USE_PNFS_MDS is set in the result
eir_flags), eir_flags),
skipping to change at page 276, line 40 skipping to change at page 278, line 40
If metadata server routing and/or identity information is encoded in If metadata server routing and/or identity information is encoded in
data server filehandles, when the metadata server identity or data server filehandles, when the metadata server identity or
location changes, the data server filehandles it gave out must become location changes, the data server filehandles it gave out must become
become invalid (stale), and so the metadata server must first recall become invalid (stale), and so the metadata server must first recall
the layouts. Invalidating a data server filehandle does not render the layouts. Invalidating a data server filehandle does not render
the NFS client's data cache invalid. The client's cache should map a the NFS client's data cache invalid. The client's cache should map a
data server filehandle to a metadata server filehandle, and a data server filehandle to a metadata server filehandle, and a
metadata server filehandle to cached data. metadata server filehandle to cached data.
14.2. File Layout Definitions 13.2. File Layout Definitions
The following definitions apply to the LAYOUT4_NFSV4_1_FILES layout The following definitions apply to the LAYOUT4_NFSV4_1_FILES layout
type, and may be applicable to other layout types. type, and may be applicable to other layout types.
Unit. A unit is a fixed size quantity of data written to a data Unit. A unit is a fixed size quantity of data written to a data
server. server.
Pattern. A pattern is a method of distributing one or more equal Pattern. A pattern is a method of distributing one or more equal
sized units across a set of data servers. A pattern is iterated sized units across a set of data servers. A pattern is iterated
one or more times. one or more times.
Stripe. An stripe is a set of data distributed across a set of data Stripe. An stripe is a set of data distributed across a set of data
servers in a pattern before that pattern repeats. servers in a pattern before that pattern repeats.
Stripe Count. A stripe count is the number of stripe units in a Stripe Count. A stripe count is the number of stripe units in a
pattern. pattern.
Stripe Width. A stripe width is the size of stripe in octets. The Stripe Width. A stripe width is the size of stripe in bytes. The
stripe width = the stripe count * the size of the stripe unit. stripe width = the stripe count * the size of the stripe unit.
Hereafter, this document will refer to a unit that is a written in a Hereafter, this document will refer to a unit that is a written in a
pattern as a "stripe unit". pattern as a "stripe unit".
A pattern may have more stripe units than data servers. If so, some A pattern may have more stripe units than data servers. If so, some
data servers will have more than one stripe unit per stripe. A data data servers will have more than one stripe unit per stripe. A data
server that has multiple stripe units per stripe MAY store each unit server that has multiple stripe units per stripe MAY store each unit
in a different data file (and depending on the implementation, will in a different data file (and depending on the implementation, will
possibly assign a unique data filehandle to each data file). possibly assign a unique data filehandle to each data file).
14.3. File Layout Data Types 13.3. File Layout Data Types
The high level NFSv4.1 layout types are nfsv4_1_file_layouthint4, The high level NFSv4.1 layout types are nfsv4_1_file_layouthint4,
nfsv4_1_file_layout_ds_addr4, and nfsv4_1_file_layout4. nfsv4_1_file_layout_ds_addr4, and nfsv4_1_file_layout4.
The SETATTR operation supports a layout hint attribute The SETATTR operation supports a layout hint attribute
(Section 5.11.4). When the client sets a layout hint (data type (Section 5.11.4). When the client sets a layout hint (data type
layouthint4) with a layout type of LAYOUT4_NFSV4_1_FILES (the layouthint4) with a layout type of LAYOUT4_NFSV4_1_FILES (the
loh_type field), the loh_body field contains a value of data type loh_type field), the loh_body field contains a value of data type
nfsv4_1_file_layouthint4. nfsv4_1_file_layouthint4.
const NFL4_UFLG_MASK = 0x0000003F; const NFL4_UFLG_MASK = 0x0000003F;
const NFL4_UFLG_DENSE = 0x00000001; const NFL4_UFLG_DENSE = 0x00000001;
const NFL4_UFLG_COMMIT_THRU_MDS = 0x00000002; const NFL4_UFLG_COMMIT_THRU_MDS = 0x00000002;
const NFL4_UFLG_STRIPE_UNIT_SIZE_MASK = 0xFFFFFFC0; const NFL4_UFLG_STRIPE_UNIT_SIZE_MASK
= 0xFFFFFFC0;
typedef uint32_t nfl_util4; typedef uint32_t nfl_util4;
/* Encoded in the loh_body field of type layouthint4: */ /* Encoded in the loh_body field of type layouthint4: */
enum filelayout_hint_care4 { enum filelayout_hint_care4 {
NFLH4_CARE_DENSE = NFL4_UFLG_DENSE, NFLH4_CARE_DENSE = NFL4_UFLG_DENSE,
NFLH4_CARE_COMMIT_THRU_MDS = NFL4_UFLG_COMMIT_THRU_MDS,
NFLH4_CARE_STRIPE_UNIT_SIZE = 0x00000040, NFLH4_CARE_COMMIT_THRU_MDS
= NFL4_UFLG_COMMIT_THRU_MDS,
NFLH4_CARE_STRIPE_UNIT_SIZE
= 0x00000040,
NFLH4_CARE_STRIPE_COUNT = 0x00000080 NFLH4_CARE_STRIPE_COUNT = 0x00000080
}; };
struct nfsv4_1_file_layouthint4 { struct nfsv4_1_file_layouthint4 {
uint32_t nflh_care; uint32_t nflh_care;
nfl_util4 nflh_util; nfl_util4 nflh_util;
count4 nflh_stripe_count; count4 nflh_stripe_count;
}; };
The generic layout hint structure is described in Section 3.2.22. The generic layout hint structure is described in Section 3.2.22.
The client uses the layout hint in the layout_hint (Section 5.11.4) The client uses the layout hint in the layout_hint (Section 5.11.4)
attribute to indicate the preferred type of layout to be used for a attribute to indicate the preferred type of layout to be used for a
newly created file. The LAYOUT4_NFSV4_1_FILES layout type-specific newly created file. The LAYOUT4_NFSV4_1_FILES layout type-specific
content for the layout hint is composed of two fields. The first content for the layout hint is composed of two fields. The first
field, nflh_care, is a set of flags indicating which values of the field, nflh_care, is a set of flags indicating which values of the
hint the client cares about. If the NFLH4_CARE_DENSE flag is set, hint the client cares about. If the NFLH4_CARE_DENSE flag is set,
then the client indicates in the second field, nflh_util, a then the client indicates in the second field, nflh_util, a
preference for how the data file is packed (Section 14.5), which is preference for how the data file is packed (Section 13.5), which is
controlled by the value of nflh_util & NFL4_UFLG_DENSE. If the controlled by the value of nflh_util & NFL4_UFLG_DENSE. If the
NFLH4_CARE_COMMIT_THRU_MDS flag is set, then the client indicates a NFLH4_CARE_COMMIT_THRU_MDS flag is set, then the client indicates a
preference for whether the client should send COMMIT operations to preference for whether the client should send COMMIT operations to
the metadata server or data server (Section 14.8), which is the metadata server or data server (Section 13.8), which is
controlled by the value of nflh_util & NFL4_UFLG_COMMIT_THRU_MDS. If controlled by the value of nflh_util & NFL4_UFLG_COMMIT_THRU_MDS. If
the NFLH4_CARE_STRIPE_UNIT_SIZE flag is set, the client indicates its the NFLH4_CARE_STRIPE_UNIT_SIZE flag is set, the client indicates its
preferred stripe unit size, which is indicated in nflh_util & preferred stripe unit size, which is indicated in nflh_util &
NFL4_UFLG_STRIPE_UNIT_SIZE_MASK (thus the stripe unit size MUST be a NFL4_UFLG_STRIPE_UNIT_SIZE_MASK (thus the stripe unit size MUST be a
multiple of 64 octets). If the NFLH4_CARE_STRIPE_COUNT flag is set, multiple of 64 bytes). If the NFLH4_CARE_STRIPE_COUNT flag is set,
the client indicates in the third field, nflh_stripe_count, the the client indicates in the third field, nflh_stripe_count, the
stripe count. The stripe count multiplied by the stripe unit size is stripe count. The stripe count multiplied by the stripe unit size is
the stripe width. the stripe width.
When LAYOUTGET returns a LAYOUT4_NFSV4_1_FILES layout (indicated in When LAYOUTGET returns a LAYOUT4_NFSV4_1_FILES layout (indicated in
the loc_type field of the lo_content field), the loc_body field of the loc_type field of the lo_content field), the loc_body field of
the lo_content field contains a value of data type the lo_content field contains a value of data type
nfsv4_1_file_layout4. Among other content, nfsv4_1_file_layout4 has nfsv4_1_file_layout4. Among other content, nfsv4_1_file_layout4 has
a storage device ID (field nfl_deviceid) of data type deviceid4. The a storage device ID (field nfl_deviceid) of data type deviceid4. The
GETDEVICEINFO operation maps a device ID to a storage device address GETDEVICEINFO operation maps a device ID to a storage device address
skipping to change at page 279, line 18 skipping to change at page 281, line 22
struct nfsv4_1_file_layout_ds_addr4 { struct nfsv4_1_file_layout_ds_addr4 {
uint32_t nflda_stripe_indices<>; uint32_t nflda_stripe_indices<>;
multipath_list4 nflda_multipath_ds_list<>; multipath_list4 nflda_multipath_ds_list<>;
}; };
The nfsv4_1_file_layout_ds_addr4 data type represents the device The nfsv4_1_file_layout_ds_addr4 data type represents the device
address. It is composed of two fields: address. It is composed of two fields:
1. nflda_multipath_ds_list: An array of lists of data servers, where 1. nflda_multipath_ds_list: An array of lists of data servers, where
each list can be one or more elements, and each element each list can be one or more elements, and each element
represents an equivalent (see Section 14.6) data server. The represents an equivalent (see Section 13.6) data server. The
length of this array might be different than the stripe count. length of this array might be different than the stripe count.
2. nflda_stripe_indices: An array of indexes used to index into 2. nflda_stripe_indices: An array of indexes used to index into
nflda_multipath_ds_list. Each element of nflda_stripe_indices nflda_multipath_ds_list. Each element of nflda_stripe_indices
MUST be less than the number of elements in MUST be less than the number of elements in
nflda_multipatch_ds_list. Each element of nflda_multipatch_ds_list. Each element of
nflda_multipath_ds_list SHOULD be referred to by one or more nflda_multipath_ds_list SHOULD be referred to by one or more
elements of nflda_stripe_indices. The number of elements in elements of nflda_stripe_indices. The number of elements in
nflda_stripe_indices is always equal to the stripe count. nflda_stripe_indices is always equal to the stripe count.
skipping to change at page 280, line 36 skipping to change at page 282, line 42
nflda_stripe_indices. Thus when issuing I/O to any data server nflda_stripe_indices. Thus when issuing I/O to any data server
in nflda_multipath_ds_list[nflda_stripe_indices[Y]], the in nflda_multipath_ds_list[nflda_stripe_indices[Y]], the
filehandle in nfl_fh_list[Y] MUST be used. In addition, if any filehandle in nfl_fh_list[Y] MUST be used. In addition, if any
time there exists i, and j, (i != j) such that the intersection time there exists i, and j, (i != j) such that the intersection
of nflda_multipath_ds_list[nflda_stripe_indices[i]] and of nflda_multipath_ds_list[nflda_stripe_indices[i]] and
nflda_multipath_ds_list[nflda_stripe_indices[j]] is not empty, nflda_multipath_ds_list[nflda_stripe_indices[j]] is not empty,
then nfl_fh_list[i] MUST NOT equal nfl_fh_list[j]. In other then nfl_fh_list[i] MUST NOT equal nfl_fh_list[j]. In other
words, when dense packing is being used, if a data server appears words, when dense packing is being used, if a data server appears
in two more units of a striping pattern, each reference to the in two more units of a striping pattern, each reference to the
data server MUST use a different filehandle. See the discussion data server MUST use a different filehandle. See the discussion
on dense packing in Section 14.5. on dense packing in Section 13.5.
The details on the interpretation of the layout are in Section 14.4. The details on the interpretation of the layout are in Section 13.4.
14.4. Interpreting the File Layout 13.4. Interpreting the File Layout
14.4.1. Interpreting the File Layout Using Sparse Packing 13.4.1. Interpreting the File Layout Using Sparse Packing
When sparse packing is used, the algorithm for determining the When sparse packing is used, the algorithm for determining the
filehandle and set of data server network addresses to write stripe filehandle and set of data server network addresses to write stripe
unit i (SUi) to is: unit i (SUi) to is:
stripe_count = number of elements in nflda_stripe_indices; stripe_count = number of elements in nflda_stripe_indices;
j = (SUi + nfl_first_stripe_index) % stripe_count; j = (SUi + nfl_first_stripe_index) % stripe_count;
idx = nflda_stripe_indices[j]; idx = nflda_stripe_indices[j];
skipping to change at page 281, line 40 skipping to change at page 283, line 41
} }
address_list = nflda_multipath_ds_list[idx]; address_list = nflda_multipath_ds_list[idx];
The client would then select a data server from address_list, and The client would then select a data server from address_list, and
issue a READ or WRITE operation using the filehandle specified in fh. issue a READ or WRITE operation using the filehandle specified in fh.
Consider the following example: Consider the following example:
Suppose we have a device address consisting of seven data servers, Suppose we have a device address consisting of seven data servers,
arranged in three equivalence (Section 14.6) classes: arranged in three equivalence (Section 13.6) classes:
{ A, B, C, D }, { E }, { F, G } { A, B, C, D }, { E }, { F, G }
Where A through G are network addresses. Where A through G are network addresses.
Then Then
nflda_multipath_ds_list<> = { A, B, C, D }, { E }, { F, G } nflda_multipath_ds_list<> = { A, B, C, D }, { E }, { F, G }
i.e. i.e.
nflda_multipath_ds_list[0] = { A, B, C, D } nflda_multipath_ds_list[0] = { A, B, C, D }
nflda_multipath_ds_list[1] = { E } nflda_multipath_ds_list[1] = { E }
nflda_multipath_ds_list[2] = { F, G } nflda_multipath_ds_list[2] = { F, G }
skipping to change at page 283, line 23 skipping to change at page 285, line 23
| 5 | 36 | A,B,C,D | | 5 | 36 | A,B,C,D |
| 6 | 67 | F,G | | 6 | 67 | F,G |
| 7 | 36 | A,B,C,D | | 7 | 36 | A,B,C,D |
| 8 | 87 | E | | 8 | 87 | E |
| 9 | 36 | A,B,C,D | | 9 | 36 | A,B,C,D |
| 10 | 67 | F,G | | 10 | 67 | F,G |
| 11 | 36 | A,B,C,D | | 11 | 36 | A,B,C,D |
| 12 | 87 | E | | 12 | 87 | E |
+-----+------------+--------------+ +-----+------------+--------------+
14.4.2. Interpreting the File Layout Using Dense Packing 13.4.2. Interpreting the File Layout Using Dense Packing
When dense packing is used, the algorithm for determining the When dense packing is used, the algorithm for determining the
filehandle and set of data server network addresses to write stripe filehandle and set of data server network addresses to write stripe
unit i (SUi) to is: unit i (SUi) to is:
stripe_count = number of elements in nflda_stripe_indices; stripe_count = number of elements in nflda_stripe_indices;
j = (SUi + nfl_first_stripe_index) % stripe_count; j = (SUi + nfl_first_stripe_index) % stripe_count;
idx = nflda_stripe_indices[j]; idx = nflda_stripe_indices[j];
skipping to change at page 284, line 10 skipping to change at page 286, line 10
address_list = nflda_multipath_ds_list[idx]; address_list = nflda_multipath_ds_list[idx];
The client would then select a data server from address_list, and The client would then select a data server from address_list, and
issue a READ or WRITE operation using the filehandle specified in fh. issue a READ or WRITE operation using the filehandle specified in fh.
Consider the following example (which is the same as the sparse Consider the following example (which is the same as the sparse
packing example, except for the filehandle list): packing example, except for the filehandle list):
Suppose we have a device address consisting of seven data servers, Suppose we have a device address consisting of seven data servers,
arranged in three equivalence (Section 14.6) classes: arranged in three equivalence (Section 13.6) classes:
{ A, B, C, D }, { E }, { F, G } { A, B, C, D }, { E }, { F, G }
Where A through G are network addresses. Where A through G are network addresses.
Then Then
nflda_multipath_ds_list<> = { A, B, C, D }, { E }, { F, G } nflda_multipath_ds_list<> = { A, B, C, D }, { E }, { F, G }
i.e. i.e.
skipping to change at page 285, line 46 skipping to change at page 287, line 46
| 5 | 36 | A,B,C,D | | 5 | 36 | A,B,C,D |
| 6 | 67 | F,G | | 6 | 67 | F,G |
| 7 | 37 | A,B,C,D | | 7 | 37 | A,B,C,D |
| 8 | 87 | E | | 8 | 87 | E |
| 9 | 36 | A,B,C,D | | 9 | 36 | A,B,C,D |
| 10 | 67 | F,G | | 10 | 67 | F,G |
| 11 | 37 | A,B,C,D | | 11 | 37 | A,B,C,D |
| 12 | 87 | E | | 12 | 87 | E |
+-----+------------+--------------+ +-----+------------+--------------+
14.5. Sparse and Dense Stripe Unit Packing 13.5. Sparse and Dense Stripe Unit Packing
The flag NFL4_UFLG_DENSE of the nfl_util4 data type (field nflh_util The flag NFL4_UFLG_DENSE of the nfl_util4 data type (field nflh_util
of the data type nfsv4_1_file_layouthint4 and field nfl_util of data of the data type nfsv4_1_file_layouthint4 and field nfl_util of data
type nfsv4_1_file_layout_ds_addr4) specifies how the data is packed type nfsv4_1_file_layout_ds_addr4) specifies how the data is packed
within the data file on a data server. It allows for two different within the data file on a data server. It allows for two different
data packings: sparse and dense. The packing type determines the data packings: sparse and dense. The packing type determines the
calculation that must be made to map the client visible file offset calculation that must be made to map the client visible file offset
to the offset within the data file located on the data server. to the offset within the data file located on the data server.
If nfl_util & NFL4_UFLG_DENSE is zero, this means that sparse packing If nfl_util & NFL4_UFLG_DENSE is zero, this means that sparse packing
is being used. Hence the logical offsets of the file as viewed by a is being used. Hence the logical offsets of the file as viewed by a
client issuing READs and WRITEs directly to the metadata server are client issuing READs and WRITEs directly to the metadata server are
the same offsets each data server uses when storing a stripe unit. the same offsets each data server uses when storing a stripe unit.
The effect then, for striping patterns consisting of at least two The effect then, for striping patterns consisting of at least two
stripe units, is for each data server file to be sparse or holey. So stripe units, is for each data server file to be sparse or holey. So
for example, suppose a pattern with three stripe units, the stripe for example, suppose a pattern with three stripe units, the stripe
unit size is a 4096 octets, and there are three data servers in the unit size is a 4096 bytes, and there are three data servers in the
pattern, then the file in data server 1 will have stripe units 0, 3, pattern, then the file in data server 1 will have stripe units 0, 3,
6, 9, ... filled, data server 2's file will have stripe units 1, 4, 6, 9, ... filled, data server 2's file will have stripe units 1, 4,
7, 10, ... filled, and data server 3's file will have stripe units 2, 7, 10, ... filled, and data server 3's file will have stripe units 2,
5, 8, 11, ... filled. The unfilled stripe units of each file will be 5, 8, 11, ... filled. The unfilled stripe units of each file will be
holes, hence the files in each data server are sparse. holes, hence the files in each data server are sparse.
If sparse packing is being used and a client attempts I/O to one of If sparse packing is being used and a client attempts I/O to one of
the holes, then an error MUST be returned by the data server. Using the holes, then an error MUST be returned by the data server. Using
the above example, if data server 3 received a READ or WRITE request the above example, if data server 3 received a READ or WRITE request
for block 4, the data server would return NFS4ERR_PNFS_IO_HOLE. Thus for block 4, the data server would return NFS4ERR_PNFS_IO_HOLE. Thus
skipping to change at page 286, line 49 skipping to change at page 288, line 49
the file of data server 1, logical stripe units 1, 4, 7, ... of the the file of data server 1, logical stripe units 1, 4, 7, ... of the
file would live on stripe units 0, 1, 2, ... of the file of data file would live on stripe units 0, 1, 2, ... of the file of data
server 2, and logical stripe units 2, 5, 8, ... of the file would server 2, and logical stripe units 2, 5, 8, ... of the file would
live on stripe units 0, 1, 2, ... of the file of data server 3. live on stripe units 0, 1, 2, ... of the file of data server 3.
Since the dense packing does not leave holes on the data servers, the Since the dense packing does not leave holes on the data servers, the
pNFS client is allowed to write to any offset of any data file of any pNFS client is allowed to write to any offset of any data file of any
data server in the stripe. Thus the the data servers need not know data server in the stripe. Thus the the data servers need not know
the file's striping pattern. the file's striping pattern.
The calculation to determine the octet offset within the data file The calculation to determine the byte offset within the data file for
for dense data server layouts is: dense data server layouts is:
stripe_width = stripe_unit_size * N; stripe_width = stripe_unit_size * N;
where N = number of elements in nflda_stripe_indices. where N = number of elements in nflda_stripe_indices.
data_file_offset = floor(file_offset / stripe_width) data_file_offset = floor(file_offset / stripe_width)
* stripe_unit_size * stripe_unit_size
+ file_offset % stripe_unit_size + file_offset % stripe_unit_size
If dense packing is being used, and a data server appears more than If dense packing is being used, and a data server appears more than
once in a striping pattern, then to distinguish one stripe unit from once in a striping pattern, then to distinguish one stripe unit from
another, the data server MUST use a different filehandle. Let's another, the data server MUST use a different filehandle. Let's
suppose there are two data servers. Logical stripe units 0, 3, 6 are suppose there are two data servers. Logical stripe units 0, 3, 6 are
served by data server 1, logical stripe units 1, 4, 7 are served by served by data server 1, logical stripe units 1, 4, 7 are served by
data server 2, and logical stripe units 2, 5, 8 are also served by data server 2, and logical stripe units 2, 5, 8 are also served by
data server 2. Unless data server 2 has two filehandles (each data server 2. Unless data server 2 has two filehandles (each
referring to a different data file), then, for example, a write to referring to a different data file), then, for example, a write to
logical stripe unit 1 overwrites the write to logical stripe unit 2, logical stripe unit 1 overwrites the write to logical stripe unit 2,
because both logical stripe units are located in the same stripe unit because both logical stripe units are located in the same stripe unit
(0) of data server 2. (0) of data server 2.
14.6. Data Server Multipathing 13.6. Data Server Multipathing
The NFSv4.1 file layout supports multipathing to "equivalent" The NFSv4.1 file layout supports multipathing to "equivalent"
(defined later in this section) data servers. Data server-level (defined later in this section) data servers. Data server-level
multipathing is used for bandwidth scaling via trunking multipathing is used for bandwidth scaling via trunking
(Section 2.10.4) and for higher availability of use in the case of a (Section 2.10.4) and for higher availability of use in the case of a
data s