draft-ietf-nfsv4-minorversion1-04.txt   draft-ietf-nfsv4-minorversion1-05.txt 
NFSv4 S. Shepler NFSv4 S. Shepler
Internet-Draft M. Eisler Internet-Draft M. Eisler
Intended status: Standards Track D. Noveck Intended status: Standards Track D. Noveck
Expires: January 22, 2007 Editors Expires: February 16, 2007 Editors
July 21, 2006 August 15, 2006
NFSv4 Minor Version 1 NFSv4 Minor Version 1
draft-ietf-nfsv4-minorversion1-04.txt draft-ietf-nfsv4-minorversion1-05.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 35 skipping to change at page 1, line 35
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on January 22, 2007. This Internet-Draft will expire on February 16, 2007.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2006). Copyright (C) The Internet Society (2006).
Abstract Abstract
This Internet-Draft describes NFSv4 minor version one, including This Internet-Draft describes NFSv4 minor version one, including
features retained from the base protocol and protocol extensions made features retained from the base protocol and protocol extensions made
subsequently. The current draft includes desciption of the major subsequently. The current draft includes desciption of the major
skipping to change at page 2, line 15 skipping to change at page 2, line 15
Group nfsv4@ietf.org and logged in the issue tracker. Group nfsv4@ietf.org and logged in the issue tracker.
Requirements Language Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [1]. document are to be interpreted as described in RFC 2119 [1].
Table of Contents Table of Contents
1. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 10 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 10 1.1. The NFSv4.1 Protocol . . . . . . . . . . . . . . . . . . 10
1.2. Structured Data Types . . . . . . . . . . . . . . . . . 11 1.2. NFS Version 4 Goals . . . . . . . . . . . . . . . . . . 10
2. RPC and Security Flavor . . . . . . . . . . . . . . . . . . . 20 1.3. Minor Version 1 Goals . . . . . . . . . . . . . . . . . 11
2.1. Ports and Transports . . . . . . . . . . . . . . . . . . 20 1.4. Inconsistencies of this Document with Section XX . . . . 11
2.1.1. Client Retransmission Behavior . . . . . . . . . . . 21 1.5. Overview of NFS version 4.1 Features . . . . . . . . . . 11
2.2. Security Flavors . . . . . . . . . . . . . . . . . . . . 22 1.5.1. RPC and Security . . . . . . . . . . . . . . . . . . 12
2.2.1. Security mechanisms for NFS version 4 . . . . . . . 22 1.5.2. Protocol Structure . . . . . . . . . . . . . . . . . 12
2.3. Security Negotiation . . . . . . . . . . . . . . . . . . 24 1.5.3. File System Model . . . . . . . . . . . . . . . . . 14
2.3.1. SECINFO and SECINFO_NO_NAME . . . . . . . . . . . . 24 1.5.4. Locking Facilities . . . . . . . . . . . . . . . . . 15
2.3.2. Security Error . . . . . . . . . . . . . . . . . . . 24 1.6. General Definitions . . . . . . . . . . . . . . . . . . 16
2.3.3. Callback RPC Authentication . . . . . . . . . . . . 25 1.7. Differences from NFSv4.0 . . . . . . . . . . . . . . . . 18
2.3.4. GSS Server Principal . . . . . . . . . . . . . . . . 25 2. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 18
3. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 18
3.1. Obtaining the First Filehandle . . . . . . . . . . . . . 26 2.2. Structured Data Types . . . . . . . . . . . . . . . . . 20
3.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 26 3. RPC and Security Flavor . . . . . . . . . . . . . . . . . . . 29
3.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 26 3.1. Ports and Transports . . . . . . . . . . . . . . . . . . 29
3.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 27 3.1.1. Client Retransmission Behavior . . . . . . . . . . . 31
3.2.1. General Properties of a Filehandle . . . . . . . . . 27 3.2. Security Flavors . . . . . . . . . . . . . . . . . . . . 31
3.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 28 3.2.1. Security mechanisms for NFS version 4 . . . . . . . 31
3.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 28 3.3. Security Negotiation . . . . . . . . . . . . . . . . . . 33
3.3. One Method of Constructing a Volatile Filehandle . . . . 29 3.3.1. SECINFO and SECINFO_NO_NAME . . . . . . . . . . . . 33
3.4. Client Recovery from Filehandle Expiration . . . . . . . 30 3.3.2. Security Error . . . . . . . . . . . . . . . . . . . 33
4. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.3. Callback RPC Authentication . . . . . . . . . . . . 34
4.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 32 3.3.4. GSS Server Principal . . . . . . . . . . . . . . . . 34
4.2. Recommended Attributes . . . . . . . . . . . . . . . . . 32 4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 33 4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 35
4.4. Classification of Attributes . . . . . . . . . . . . . . 33 4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 35
4.5. Mandatory Attributes - Definitions . . . . . . . . . . . 34 4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 35
4.6. Recommended Attributes - Definitions . . . . . . . . . . 36 4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 36
4.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 43 4.2.1. General Properties of a Filehandle . . . . . . . . . 36
4.8. Interpreting owner and owner_group . . . . . . . . . . . 44 4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 37
4.9. Character Case Attributes . . . . . . . . . . . . . . . 46 4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 37
4.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 46 4.3. One Method of Constructing a Volatile Filehandle . . . . 39
4.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 47 4.4. Client Recovery from Filehandle Expiration . . . . . . . 39
4.12. send_impl_id and recv_impl_id . . . . . . . . . . . . . 48 5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 40
4.13. fs_layout_type . . . . . . . . . . . . . . . . . . . . . 48 5.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 41
4.14. layout_type . . . . . . . . . . . . . . . . . . . . . . 48 5.2. Recommended Attributes . . . . . . . . . . . . . . . . . 41
4.15. layout_hint . . . . . . . . . . . . . . . . . . . . . . 49 5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 42
5. Access Control Lists . . . . . . . . . . . . . . . . . . . . 49 5.4. Classification of Attributes . . . . . . . . . . . . . . 42
5.1. ACE type . . . . . . . . . . . . . . . . . . . . . . . . 51 5.5. Mandatory Attributes - Definitions . . . . . . . . . . . 44
5.2. ACE Access Mask . . . . . . . . . . . . . . . . . . . . 52 5.6. Recommended Attributes - Definitions . . . . . . . . . . 45
5.2.1. ACE4_DELETE vs. ACE4_DELETE_CHILD . . . . . . . . . 57 5.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 54
5.3. ACE flag . . . . . . . . . . . . . . . . . . . . . . . . 58 5.8. Interpreting owner and owner_group . . . . . . . . . . . 54
5.4. ACE who . . . . . . . . . . . . . . . . . . . . . . . . 59 5.9. Character Case Attributes . . . . . . . . . . . . . . . 56
5.4.1. Discussion of EVERYONE@ . . . . . . . . . . . . . . 60 5.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 56
5.4.2. Discussion of OWNER@ and GROUP@ . . . . . . . . . . 60 5.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 57
5.5. Mode Attribute . . . . . . . . . . . . . . . . . . . . . 60 5.12. send_impl_id and recv_impl_id . . . . . . . . . . . . . 58
5.6. Interaction Between Mode and ACL Attributes . . . . . . 61 5.13. fs_layout_type . . . . . . . . . . . . . . . . . . . . . 59
5.6.1. Recomputing mode upon SETATTR of ACL . . . . . . . . 62 5.14. layout_type . . . . . . . . . . . . . . . . . . . . . . 59
5.6.2. Applying the mode given to CREATE or OPEN to an 5.15. layout_hint . . . . . . . . . . . . . . . . . . . . . . 59
inherited ACL . . . . . . . . . . . . . . . . . . . 65 5.16. mdsthreshold . . . . . . . . . . . . . . . . . . . . . . 59
5.6.3. Applying a Mode to an Existing ACL . . . . . . . . . 67 6. Access Control Lists . . . . . . . . . . . . . . . . . . . . 60
5.6.4. ACL and mode in the same SETATTR . . . . . . . . . . 71 6.1. ACE type . . . . . . . . . . . . . . . . . . . . . . . . 62
5.6.5. Inheritance and turning it off . . . . . . . . . . . 72 6.2. ACE Access Mask . . . . . . . . . . . . . . . . . . . . 63
5.6.6. Deficiencies in a Mode Representation of an ACL . . 73 6.2.1. ACE4_DELETE vs. ACE4_DELETE_CHILD . . . . . . . . . 67
6. Single-server Name Space . . . . . . . . . . . . . . . . . . 74 6.3. ACE flag . . . . . . . . . . . . . . . . . . . . . . . . 68
6.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 74 6.4. ACE who . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 74 6.4.1. Discussion of EVERYONE@ . . . . . . . . . . . . . . 71
6.3. Server Pseudo Filesystem . . . . . . . . . . . . . . . . 75 6.4.2. Discussion of OWNER@ and GROUP@ . . . . . . . . . . 71
6.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 76 6.5. Mode Attribute . . . . . . . . . . . . . . . . . . . . . 71
6.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 76 6.6. Interaction Between Mode and ACL Attributes . . . . . . 72
6.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 76 6.6.1. Recomputing mode upon SETATTR of ACL . . . . . . . . 73
6.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 76 6.6.2. Applying the mode given to CREATE or OPEN to an
6.8. Security Policy and Name Space Presentation . . . . . . 77 inherited ACL . . . . . . . . . . . . . . . . . . . 76
7. File Locking and Share Reservations . . . . . . . . . . . . . 78 6.6.3. Applying a Mode to an Existing ACL . . . . . . . . . 77
7.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 78 6.6.4. ACL and mode in the same SETATTR . . . . . . . . . . 82
7.1.1. Client ID . . . . . . . . . . . . . . . . . . . . . 79 6.6.5. Inheritance and turning it off . . . . . . . . . . . 83
7.1.2. Server Release of Clientid . . . . . . . . . . . . . 81 6.6.6. Deficiencies in a Mode Representation of an ACL . . 84
7.1.3. lock_owner and stateid Definition . . . . . . . . . 82 7. Single-server Name Space . . . . . . . . . . . . . . . . . . 85
7.1.4. Use of the stateid and Locking . . . . . . . . . . . 84 7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 85
7.1.5. Sequencing of Lock Requests . . . . . . . . . . . . 86 7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 85
7.1.6. Recovery from Replayed Requests . . . . . . . . . . 87 7.3. Server Pseudo File System . . . . . . . . . . . . . . . 86
7.1.7. Releasing lock_owner State . . . . . . . . . . . . . 87 7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 86
7.1.8. Use of Open Confirmation . . . . . . . . . . . . . . 87 7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 87
7.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 89 7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 87
7.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 89 7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 87
7.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 89 7.8. Security Policy and Name Space Presentation . . . . . . 88
7.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 90 8. File Locking and Share Reservations . . . . . . . . . . . . . 89
7.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 91 8.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 89
7.6.1. Client Failure and Recovery . . . . . . . . . . . . 91 8.1.1. Client ID . . . . . . . . . . . . . . . . . . . . . 90
7.6.2. Server Failure and Recovery . . . . . . . . . . . . 92 8.1.2. Server Release of Clientid . . . . . . . . . . . . . 93
7.6.3. Network Partitions and Recovery . . . . . . . . . . 94 8.1.3. State-owner and Stateid Definition . . . . . . . . . 94
7.7. Recovery from a Lock Request Timeout or Abort . . . . . 98 8.1.4. Use of the Stateid and Locking . . . . . . . . . . . 97
7.8. Server Revocation of Locks . . . . . . . . . . . . . . . 98 8.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 99
7.9. Share Reservations . . . . . . . . . . . . . . . . . . . 99 8.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 99
7.10. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 100 8.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 100
7.10.1. Close and Retention of State Information . . . . . . 101 8.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 100
7.11. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 101 8.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 101
7.12. Short and Long Leases . . . . . . . . . . . . . . . . . 102 8.6.1. Client Failure and Recovery . . . . . . . . . . . . 101
7.13. Clocks, Propagation Delay, and Calculating Lease 8.6.2. Server Failure and Recovery . . . . . . . . . . . . 102
Expiration . . . . . . . . . . . . . . . . . . . . . . . 102 8.6.3. Network Partitions and Recovery . . . . . . . . . . 104
8. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 103 8.7. Server Revocation of Locks . . . . . . . . . . . . . . . 108
8.1. Performance Challenges for Client-Side Caching . . . . . 104 8.8. Share Reservations . . . . . . . . . . . . . . . . . . . 109
8.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 104 8.9. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 110
8.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 106 8.10. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 110
8.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 108 8.11. Short and Long Leases . . . . . . . . . . . . . . . . . 111
8.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 108 8.12. Clocks, Propagation Delay, and Calculating Lease
8.3.2. Data Caching and File Locking . . . . . . . . . . . 109 Expiration . . . . . . . . . . . . . . . . . . . . . . . 111
8.3.3. Data Caching and Mandatory File Locking . . . . . . 111 8.13. Vestigial Locking Infrastructure From V4.0 . . . . . . . 112
8.3.4. Data Caching and File Identity . . . . . . . . . . . 111 9. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 113
8.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 112 9.1. Performance Challenges for Client-Side Caching . . . . . 114
8.4.1. Open Delegation and Data Caching . . . . . . . . . . 115 9.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 114
8.4.2. Open Delegation and File Locks . . . . . . . . . . . 116 9.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 116
8.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 116 9.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 118
8.4.4. Recall of Open Delegation . . . . . . . . . . . . . 119 9.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 118
8.4.5. Clients that Fail to Honor Delegation Recalls . . . 121 9.3.2. Data Caching and File Locking . . . . . . . . . . . 119
8.4.6. Delegation Revocation . . . . . . . . . . . . . . . 122 9.3.3. Data Caching and Mandatory File Locking . . . . . . 121
8.5. Data Caching and Revocation . . . . . . . . . . . . . . 122 9.3.4. Data Caching and File Identity . . . . . . . . . . . 121
8.5.1. Revocation Recovery for Write Open Delegation . . . 123 9.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 122
8.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 124 9.4.1. Open Delegation and Data Caching . . . . . . . . . . 125
8.7. Data and Metadata Caching and Memory Mapped Files . . . 126 9.4.2. Open Delegation and File Locks . . . . . . . . . . . 126
8.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 128 9.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 126
8.9. Directory Caching . . . . . . . . . . . . . . . . . . . 129 9.4.4. Recall of Open Delegation . . . . . . . . . . . . . 129
9. Security Negotiation . . . . . . . . . . . . . . . . . . . . 130 9.4.5. Clients that Fail to Honor Delegation Recalls . . . 131
10. Clarification of Security Negotiation in NFSv4.1 . . . . . . 130 9.4.6. Delegation Revocation . . . . . . . . . . . . . . . 132
10.1. PUTFH + LOOKUP . . . . . . . . . . . . . . . . . . . . . 130 9.5. Data Caching and Revocation . . . . . . . . . . . . . . 132
10.2. PUTFH + LOOKUPP . . . . . . . . . . . . . . . . . . . . 131 9.5.1. Revocation Recovery for Write Open Delegation . . . 133
10.3. PUTFH + SECINFO . . . . . . . . . . . . . . . . . . . . 131 9.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 134
10.4. PUTFH + Anything Else . . . . . . . . . . . . . . . . . 131 9.7. Data and Metadata Caching and Memory Mapped Files . . . 136
11. NFSv4.1 Sessions . . . . . . . . . . . . . . . . . . . . . . 132 9.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 138
11.1. Sessions Background . . . . . . . . . . . . . . . . . . 132 9.9. Directory Caching . . . . . . . . . . . . . . . . . . . 139
11.1.1. Introduction to Sessions . . . . . . . . . . . . . . 132 10. Security Negotiation . . . . . . . . . . . . . . . . . . . . 140
11.1.2. Motivation . . . . . . . . . . . . . . . . . . . . . 133 11. Clarification of Security Negotiation in NFSv4.1 . . . . . . 140
11.1.3. Problem Statement . . . . . . . . . . . . . . . . . 134 11.1. PUTFH + LOOKUP . . . . . . . . . . . . . . . . . . . . . 140
11.1.4. NFSv4 Session Extension Characteristics . . . . . . 136 11.2. PUTFH + LOOKUPP . . . . . . . . . . . . . . . . . . . . 141
11.2. Transport Issues . . . . . . . . . . . . . . . . . . . . 136 11.3. PUTFH + SECINFO . . . . . . . . . . . . . . . . . . . . 141
11.2.1. Session Model . . . . . . . . . . . . . . . . . . . 136 11.4. PUTFH + Anything Else . . . . . . . . . . . . . . . . . 141
11.2.2. Connection State . . . . . . . . . . . . . . . . . . 137 12. NFSv4.1 Sessions . . . . . . . . . . . . . . . . . . . . . . 142
11.2.3. NFSv4 Channels, Sessions and Connections . . . . . . 138 12.1. Sessions Background . . . . . . . . . . . . . . . . . . 142
11.2.4. Reconnection, Trunking and Failover . . . . . . . . 140 12.1.1. Introduction to Sessions . . . . . . . . . . . . . . 142
11.2.5. Server Duplicate Request Cache . . . . . . . . . . . 141 12.1.2. Session Model . . . . . . . . . . . . . . . . . . . 143
11.3. Session Initialization and Transfer Models . . . . . . . 142 12.1.3. Connection State . . . . . . . . . . . . . . . . . . 144
11.3.1. Session Negotiation . . . . . . . . . . . . . . . . 142 12.1.4. NFSv4 Channels, Sessions and Connections . . . . . . 145
11.3.2. RDMA Requirements . . . . . . . . . . . . . . . . . 144 12.1.5. Reconnection, Trunking and Failover . . . . . . . . 146
11.3.3. RDMA Connection Resources . . . . . . . . . . . . . 144 12.1.6. Server Duplicate Request Cache . . . . . . . . . . . 147
11.3.4. TCP and RDMA Inline Transfer Model . . . . . . . . . 145 12.2. Session Initialization and Transfer Models . . . . . . . 148
11.3.5. RDMA Direct Transfer Model . . . . . . . . . . . . . 148 12.2.1. Session Negotiation . . . . . . . . . . . . . . . . 148
11.4. Connection Models . . . . . . . . . . . . . . . . . . . 151 12.2.2. RDMA Requirements . . . . . . . . . . . . . . . . . 150
11.4.1. TCP Connection Model . . . . . . . . . . . . . . . . 152 12.2.3. RDMA Connection Resources . . . . . . . . . . . . . 150
11.4.2. Negotiated RDMA Connection Model . . . . . . . . . . 153 12.2.4. TCP and RDMA Inline Transfer Model . . . . . . . . . 151
11.4.3. Automatic RDMA Connection Model . . . . . . . . . . 154 12.2.5. RDMA Direct Transfer Model . . . . . . . . . . . . . 154
11.5. Buffer Management, Transfer, Flow Control . . . . . . . 154 12.3. Connection Models . . . . . . . . . . . . . . . . . . . 157
11.6. Retry and Replay . . . . . . . . . . . . . . . . . . . . 157 12.3.1. TCP Connection Model . . . . . . . . . . . . . . . . 158
11.7. The Back Channel . . . . . . . . . . . . . . . . . . . . 158 12.3.2. Negotiated RDMA Connection Model . . . . . . . . . . 159
11.8. COMPOUND Sizing Issues . . . . . . . . . . . . . . . . . 159 12.3.3. Automatic RDMA Connection Model . . . . . . . . . . 160
11.9. Data Alignment . . . . . . . . . . . . . . . . . . . . . 159 12.4. Buffer Management, Transfer, Flow Control . . . . . . . 160
11.10. NFSv4 Integration . . . . . . . . . . . . . . . . . . . 161 12.5. Retry and Replay . . . . . . . . . . . . . . . . . . . . 163
11.10.1. Minor Versioning . . . . . . . . . . . . . . . . . . 161 12.6. The Back Channel . . . . . . . . . . . . . . . . . . . . 164
11.10.2. Slot Identifiers and Server Duplicate Request 12.7. COMPOUND Sizing Issues . . . . . . . . . . . . . . . . . 165
Cache . . . . . . . . . . . . . . . . . . . . . . . 161 12.8. Data Alignment . . . . . . . . . . . . . . . . . . . . . 165
11.10.3. Resolving server callback races with sessions . . . 165 12.9. NFSv4 Integration . . . . . . . . . . . . . . . . . . . 167
11.10.4. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . 166 12.9.1. Minor Versioning . . . . . . . . . . . . . . . . . . 167
11.10.5. eXternal Data Representation Efficiency . . . . . . 167 12.9.2. Slot Identifiers and Server Duplicate Request
11.10.6. Effect of Sessions on Existing Operations . . . . . 167 Cache . . . . . . . . . . . . . . . . . . . . . . . 167
11.10.7. Authentication Efficiencies . . . . . . . . . . . . 168 12.9.3. Resolving server callback races with sessions . . . 170
11.11. Sessions Security Considerations . . . . . . . . . . . . 169 12.9.4. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . 171
11.11.1. Denial of Service via Unauthorized State Changes . . 170 12.10. Sessions Security Considerations . . . . . . . . . . . . 173
11.11.2. Authentication . . . . . . . . . . . . . . . . . . . 173 12.10.1. Denial of Service via Unauthorized State Changes . . 173
12. Multi-server Name Space . . . . . . . . . . . . . . . . . . . 174 12.11. Session Mechanics - Steady State . . . . . . . . . . . . 177
12.1. Location attributes . . . . . . . . . . . . . . . . . . 174 12.11.1. Obligations of the Server . . . . . . . . . . . . . 177
12.2. File System Presence or Absence . . . . . . . . . . . . 175 12.11.2. Obligations of the Client . . . . . . . . . . . . . 177
12.3. Getting Attributes for an Absent File System . . . . . . 176 12.11.3. Steps the Client Takes To Establish a Session . . . 178
12.3.1. GETATTR Within an Absent File System . . . . . . . . 176 12.12. Session Mechanics - Recovery . . . . . . . . . . . . . . 178
12.3.2. READDIR and Absent File Systems . . . . . . . . . . 177 12.12.1. Events Requiring Client Action . . . . . . . . . . . 178
12.4. Uses of Location Information . . . . . . . . . . . . . . 178 12.12.2. Events Requiring Server Action . . . . . . . . . . . 180
12.4.1. File System Replication . . . . . . . . . . . . . . 178 13. Multi-server Name Space . . . . . . . . . . . . . . . . . . . 180
12.4.2. File System Migration . . . . . . . . . . . . . . . 179 13.1. Location attributes . . . . . . . . . . . . . . . . . . 180
12.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 180 13.2. File System Presence or Absence . . . . . . . . . . . . 181
12.5. Additional Client-side Considerations . . . . . . . . . 180 13.3. Getting Attributes for an Absent File System . . . . . . 182
12.6. Effecting File System Transitions . . . . . . . . . . . 181 13.3.1. GETATTR Within an Absent File System . . . . . . . . 182
12.6.1. Transparent File System Transitions . . . . . . . . 182 13.3.2. READDIR and Absent File Systems . . . . . . . . . . 183
12.6.2. Filehandles and File System Transitions . . . . . . 184 13.4. Uses of Location Information . . . . . . . . . . . . . . 184
12.6.3. Fileid's and File System Transitions . . . . . . . . 184 13.4.1. File System Replication . . . . . . . . . . . . . . 185
12.6.4. Fsid's and File System Transitions . . . . . . . . . 185 13.4.2. File System Migration . . . . . . . . . . . . . . . 185
12.6.5. The Change Attribute and File System Transitions . . 185 13.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 186
12.6.6. Lock State and File System Transitions . . . . . . . 186 13.5. Additional Client-side Considerations . . . . . . . . . 187
12.6.7. Write Verifiers and File System Transitions . . . . 189 13.6. Effecting File System Transitions . . . . . . . . . . . 187
12.7. Effecting File System Referrals . . . . . . . . . . . . 190 13.6.1. Transparent File System Transitions . . . . . . . . 188
12.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 190 13.6.2. Filehandles and File System Transitions . . . . . . 190
12.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 194 13.6.3. Fileid's and File System Transitions . . . . . . . . 191
12.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 196 13.6.4. Fsid's and File System Transitions . . . . . . . . . 191
12.9. The Attribute fs_locations . . . . . . . . . . . . . . . 196 13.6.5. The Change Attribute and File System Transitions . . 192
12.10. The Attribute fs_locations_info . . . . . . . . . . . . 198 13.6.6. Lock State and File System Transitions . . . . . . . 192
12.11. The Attribute fs_status . . . . . . . . . . . . . . . . 207 13.6.7. Write Verifiers and File System Transitions . . . . 196
13. Directory Delegations . . . . . . . . . . . . . . . . . . . . 210 13.7. Effecting File System Referrals . . . . . . . . . . . . 196
13.1. Introduction to Directory Delegations . . . . . . . . . 211 13.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 196
13.2. Directory Delegation Design (in brief) . . . . . . . . . 212 13.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 200
13.3. Recommended Attributes in support of Directory 13.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 202
Delegations . . . . . . . . . . . . . . . . . . . . . . 213 13.9. The Attribute fs_locations . . . . . . . . . . . . . . . 203
13.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 214 13.10. The Attribute fs_locations_info . . . . . . . . . . . . 205
13.5. Delegation Recovery . . . . . . . . . . . . . . . . . . 214 13.11. The Attribute fs_status . . . . . . . . . . . . . . . . 213
14. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 214 14. Directory Delegations . . . . . . . . . . . . . . . . . . . . 216
15. General Definitions . . . . . . . . . . . . . . . . . . . . . 217 14.1. Introduction to Directory Delegations . . . . . . . . . 217
15.1. Metadata Server . . . . . . . . . . . . . . . . . . . . 217 14.2. Directory Delegation Design (in brief) . . . . . . . . . 218
15.2. Client . . . . . . . . . . . . . . . . . . . . . . . . . 217 14.3. Recommended Attributes in support of Directory
15.3. Storage Device . . . . . . . . . . . . . . . . . . . . . 217 Delegations . . . . . . . . . . . . . . . . . . . . . . 219
15.4. Storage Protocol . . . . . . . . . . . . . . . . . . . . 217 14.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 220
15.5. Control Protocol . . . . . . . . . . . . . . . . . . . . 218 14.5. Directory Delegation Recovery . . . . . . . . . . . . . 220
15.6. Metadata . . . . . . . . . . . . . . . . . . . . . . . . 218 15. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 220
15.7. Layout . . . . . . . . . . . . . . . . . . . . . . . . . 218 15.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 220
16. pNFS protocol semantics . . . . . . . . . . . . . . . . . . . 219 15.2. General Definitions . . . . . . . . . . . . . . . . . . 223
16.1. Definitions . . . . . . . . . . . . . . . . . . . . . . 219 15.2.1. Metadata Server . . . . . . . . . . . . . . . . . . 223
16.1.1. Layout Types . . . . . . . . . . . . . . . . . . . . 219 15.2.2. Client . . . . . . . . . . . . . . . . . . . . . . . 223
16.1.2. Layout Iomode . . . . . . . . . . . . . . . . . . . 219 15.2.3. Storage Device . . . . . . . . . . . . . . . . . . . 223
16.1.3. Layout Segments . . . . . . . . . . . . . . . . . . 220 15.2.4. Storage Protocol . . . . . . . . . . . . . . . . . . 223
16.1.4. Device IDs . . . . . . . . . . . . . . . . . . . . . 221 15.2.5. Control Protocol . . . . . . . . . . . . . . . . . . 224
16.1.5. Aggregation Schemes . . . . . . . . . . . . . . . . 221 15.2.6. Metadata . . . . . . . . . . . . . . . . . . . . . . 224
16.2. Guarantees Provided by Layouts . . . . . . . . . . . . . 222 15.2.7. Layout . . . . . . . . . . . . . . . . . . . . . . . 224
16.3. Getting a Layout . . . . . . . . . . . . . . . . . . . . 223 15.3. pNFS protocol semantics . . . . . . . . . . . . . . . . 225
16.4. Committing a Layout . . . . . . . . . . . . . . . . . . 224 15.3.1. Definitions . . . . . . . . . . . . . . . . . . . . 225
16.4.1. LAYOUTCOMMIT and mtime/atime/change . . . . . . . . 224 15.3.2. Guarantees Provided by Layouts . . . . . . . . . . . 228
16.4.2. LAYOUTCOMMIT and size . . . . . . . . . . . . . . . 225 15.3.3. Getting a Layout . . . . . . . . . . . . . . . . . . 229
16.4.3. LAYOUTCOMMIT and layoutupdate . . . . . . . . . . . 226 15.3.4. Committing a Layout . . . . . . . . . . . . . . . . 230
16.5. Recalling a Layout . . . . . . . . . . . . . . . . . . . 226 15.3.5. Recalling a Layout . . . . . . . . . . . . . . . . . 232
16.5.1. Basic Operation . . . . . . . . . . . . . . . . . . 226 15.3.6. Metadata Server Write Propagation . . . . . . . . . 237
16.5.2. Recall Callback Robustness . . . . . . . . . . . . . 228 15.3.7. Crash Recovery . . . . . . . . . . . . . . . . . . . 238
16.5.3. Recall/Return Sequencing . . . . . . . . . . . . . . 229 15.3.8. Security Considerations . . . . . . . . . . . . . . 243
16.6. Metadata Server Write Propagation . . . . . . . . . . . 231 15.4. The NFSv4 File Layout Type . . . . . . . . . . . . . . . 244
16.7. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 231 15.4.1. File Striping and Data Access . . . . . . . . . . . 244
16.7.1. Leases . . . . . . . . . . . . . . . . . . . . . . . 232 15.4.2. Global Stateid Requirements . . . . . . . . . . . . 253
16.7.2. Client Recovery . . . . . . . . . . . . . . . . . . 233 15.4.3. The Layout Iomode . . . . . . . . . . . . . . . . . 253
16.7.3. Metadata Server Recovery . . . . . . . . . . . . . . 233 15.4.4. Storage Device State Propagation . . . . . . . . . . 253
16.7.4. Storage Device Recovery . . . . . . . . . . . . . . 236 15.4.5. Storage Device Component File Size . . . . . . . . . 256
16.8. Security Considerations . . . . . . . . . . . . . . . . 237 15.4.6. Crash Recovery Considerations . . . . . . . . . . . 256
17. The NFSv4 File Layout Type . . . . . . . . . . . . . . . . . 238 15.4.7. Security Considerations for the File Layout Type . . 257
17.1. File Striping and Data Access . . . . . . . . . . . . . 238 15.4.8. Alternate Approaches . . . . . . . . . . . . . . . . 257
17.1.1. Sparse and Dense Storage Device Data Layouts . . . . 240 16. Minor Versioning . . . . . . . . . . . . . . . . . . . . . . 258
17.1.2. Metadata and Storage Device Roles . . . . . . . . . 242 17. Internationalization . . . . . . . . . . . . . . . . . . . . 261
17.1.3. Device Multipathing . . . . . . . . . . . . . . . . 243 17.1. Stringprep profile for the utf8str_cs type . . . . . . . 262
17.1.4. Operations Issued to Storage Devices . . . . . . . . 243 17.2. Stringprep profile for the utf8str_cis type . . . . . . 264
17.1.5. COMMIT through metadata server . . . . . . . . . . . 244 17.3. Stringprep profile for the utf8str_mixed type . . . . . 265
17.2. Global Stateid Requirements . . . . . . . . . . . . . . 244 17.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 266
17.3. The Layout Iomode . . . . . . . . . . . . . . . . . . . 245 18. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 267
17.4. Storage Device State Propagation . . . . . . . . . . . . 245 18.1. Error Definitions . . . . . . . . . . . . . . . . . . . 267
17.4.1. Lock State Propagation . . . . . . . . . . . . . . . 246 18.2. Operations and their valid errors . . . . . . . . . . . 279
17.4.2. Open-mode Validation . . . . . . . . . . . . . . . . 246 18.3. Callback operations and their valid errors . . . . . . . 287
17.4.3. File Attributes . . . . . . . . . . . . . . . . . . 246 18.4. Errors and the operations that use them . . . . . . . . 287
17.5. Storage Device Component File Size . . . . . . . . . . . 247 19. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 293
17.6. Crash Recovery Considerations . . . . . . . . . . . . . 248 19.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 293
17.7. Security Considerations . . . . . . . . . . . . . . . . 249 19.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 294
17.8. Alternate Approaches . . . . . . . . . . . . . . . . . . 249 20. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 298
18. Minor Versioning . . . . . . . . . . . . . . . . . . . . . . 250 20.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 299
19. Internationalization . . . . . . . . . . . . . . . . . . . . 252 20.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 301
19.1. Stringprep profile for the utf8str_cs type . . . . . . . 253 20.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 302
19.2. Stringprep profile for the utf8str_cis type . . . . . . 255 20.4. Operation 6: CREATE - Create a Non-Regular File Object . 305
19.3. Stringprep profile for the utf8str_mixed type . . . . . 256 20.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting
19.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 258 Recovery . . . . . . . . . . . . . . . . . . . . . . . . 307
20. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 258 20.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 308
20.1. Error Definitions . . . . . . . . . . . . . . . . . . . 258 20.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 309
20.2. Operations and their valid errors . . . . . . . . . . . 270 20.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 310
20.3. Callback operations and their valid errors . . . . . . . 279 20.9. Operation 11: LINK - Create Link to a File . . . . . . . 311
20.4. Errors and the operations that use them . . . . . . . . 279 20.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 312
21. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 284 20.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 316
21.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 284 20.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 317
21.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 285 20.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 318
22. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 287 20.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 320
22.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 288 20.15. Operation 17: NVERIFY - Verify Difference in
22.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 290 Attributes . . . . . . . . . . . . . . . . . . . . . . . 321
22.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 291 20.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 322
22.4. Operation 6: CREATE - Create a Non-Regular File Object . 294 20.17. Operation 19: OPENATTR - Open Named Attribute
22.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting Directory . . . . . . . . . . . . . . . . . . . . . . . 336
Recovery . . . . . . . . . . . . . . . . . . . . . . . . 296 20.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 337
22.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 297 20.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 338
22.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 298 20.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 339
22.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 299 20.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 341
22.9. Operation 11: LINK - Create Link to a File . . . . . . . 300 20.22. Operation 25: READ - Read from File . . . . . . . . . . 341
22.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 301 20.23. Operation 26: READDIR - Read Directory . . . . . . . . . 343
22.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 305 20.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 347
22.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 306 20.25. Operation 28: REMOVE - Remove File System Object . . . . 348
22.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 307 20.26. Operation 29: RENAME - Rename Directory Entry . . . . . 350
22.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 309 20.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 351
22.15. Operation 17: NVERIFY - Verify Difference in 20.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 352
Attributes . . . . . . . . . . . . . . . . . . . . . . . 310 20.29. Operation 33: SECINFO - Obtain Available Security . . . 353
22.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 311 20.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 356
22.17. Operation 19: OPENATTR - Open Named Attribute 20.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 358
Directory . . . . . . . . . . . . . . . . . . . . . . . 325 20.32. Operation 38: WRITE - Write to File . . . . . . . . . . 360
22.18. Operation 20: OPEN_CONFIRM - Confirm Open . . . . . . . 326 20.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 364
22.19. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 328 20.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 364
22.20. Operation 22: PUTFH - Set Current Filehandle . . . . . . 329 20.35. Operation 42: CREATE_CLIENTID - Instantiate Clientid . . 368
22.21. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 330 20.36. Operation 43: CREATE_SESSION - Create New Session and
22.22. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 331 Confirm Clientid . . . . . . . . . . . . . . . . . . . . 374
22.23. Operation 25: READ - Read from File . . . . . . . . . . 332 20.37. Operation 44: DESTROY_SESSION - Destroy existing
22.24. Operation 26: READDIR - Read Directory . . . . . . . . . 334 session . . . . . . . . . . . . . . . . . . . . . . . . 382
22.25. Operation 27: READLINK - Read Symbolic Link . . . . . . 338 20.38. Operation 45: FREE_STATEID - Free stateid with no
22.26. Operation 28: REMOVE - Remove Filesystem Object . . . . 339 locks . . . . . . . . . . . . . . . . . . . . . . . . . 383
22.27. Operation 29: RENAME - Rename Directory Entry . . . . . 340 20.39. Operation 46: GET_DIR_DELEGATION - Get a directory
22.28. Operation 30: RENEW - Renew a Lease . . . . . . . . . . 342
22.29. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 343
22.30. Operation 32: SAVEFH - Save Current Filehandle . . . . . 344
22.31. Operation 33: SECINFO - Obtain Available Security . . . 345
22.32. Operation 34: SETATTR - Set Attributes . . . . . . . . . 348
22.33. Operation 35: SETCLIENTID - Negotiate Clientid . . . . . 350
22.34. Operation 36: SETCLIENTID_CONFIRM - Confirm Clientid . . 354
22.35. Operation 37: VERIFY - Verify Same Attributes . . . . . 357
22.36. Operation 38: WRITE - Write to File . . . . . . . . . . 359
22.37. Operation 39: RELEASE_LOCKOWNER - Release Lockowner
State . . . . . . . . . . . . . . . . . . . . . . . . . 363
22.38. Operation 40: BIND_BACKCHANNEL - Create a callback
channel binding . . . . . . . . . . . . . . . . . . . . 363
22.39. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 369
22.40. Operation 42: CREATECLIENTID - Instantiate Clientid . . 372
22.41. Operation 43: CREATESESSION - Create New Session and
Confirm Clientid . . . . . . . . . . . . . . . . . . . . 377
22.42. Operation 44: DESTROYSESSION - Destroy existing
session . . . . . . . . . . . . . . . . . . . . . . . . 383
22.43. Operation 45: GET_DIR_DELEGATION - Get a directory
delegation . . . . . . . . . . . . . . . . . . . . . . . 384 delegation . . . . . . . . . . . . . . . . . . . . . . . 384
22.44. Operation 46: GETDEVICEINFO - Get Device Information . . 388 20.40. Operation 47: GETDEVICEINFO - Get Device Information . . 388
22.45. Operation 47: GETDEVICELIST . . . . . . . . . . . . . . 389 20.41. Operation 48: GETDEVICELIST . . . . . . . . . . . . . . 389
22.46. Operation 48: LAYOUTCOMMIT - Commit writes made using 20.42. Operation 49: LAYOUTCOMMIT - Commit writes made using
a layout . . . . . . . . . . . . . . . . . . . . . . . . 390 a layout . . . . . . . . . . . . . . . . . . . . . . . . 390
22.47. Operation 49: LAYOUTGET - Get Layout Information . . . . 394 20.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 394
22.48. Operation 50: LAYOUTRETURN - Release Layout 20.44. Operation 51: LAYOUTRETURN - Release Layout
Information . . . . . . . . . . . . . . . . . . . . . . 396 Information . . . . . . . . . . . . . . . . . . . . . . 396
22.49. Operation 51: SECINFO_NO_NAME - Get Security on 20.45. Operation 52: SECINFO_NO_NAME - Get Security on
Unnamed Object . . . . . . . . . . . . . . . . . . . . . 398 Unnamed Object . . . . . . . . . . . . . . . . . . . . . 399
22.50. Operation 52: SEQUENCE - Supply per-procedure 20.46. Operation 53: SEQUENCE - Supply per-procedure
sequencing and control . . . . . . . . . . . . . . . . . 399 sequencing and control . . . . . . . . . . . . . . . . . 400
22.51. Operation 53: SET_SSV . . . . . . . . . . . . . . . . . 400 20.47. Operation 54: SET_SSV . . . . . . . . . . . . . . . . . 403
22.52. Operation 54: WANT_DELEGATION . . . . . . . . . . . . . 402 20.48. Operation 55: TEST_STATEID - Test stateids for
22.53. Operation 10044: ILLEGAL - Illegal operation . . . . . . 405 validity . . . . . . . . . . . . . . . . . . . . . . . . 405
23. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 406 20.49. Operation 56: WANT_DELEGATION . . . . . . . . . . . . . 406
23.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 406 20.50. Operation 10044: ILLEGAL - Illegal operation . . . . . . 409
23.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 406 21. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 409
24. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 408 21.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 410
24.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 408 21.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 410
24.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 409 22. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 412
24.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 410 22.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 412
24.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 412 22.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 413
24.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 416 22.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 414
24.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 417 22.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 417
24.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 419 22.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 420
24.8. Operation 10: CB_RECALLCREDIT - change flow control 22.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 421
limits . . . . . . . . . . . . . . . . . . . . . . . . . 420 22.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 424
24.9. Operation 11: CB_SEQUENCE - Supply callback channel 22.8. Operation 10: CB_RECALL_CREDIT - change flow control
sequencing and control . . . . . . . . . . . . . . . . . 421 limits . . . . . . . . . . . . . . . . . . . . . . . . . 425
24.10. Operation 12: CB_SIZECHANGED . . . . . . . . . . . . . . 422 22.9. Operation 11: CB_SEQUENCE - Supply callback channel
24.11. Operation 10044: CB_ILLEGAL - Illegal Callback sequencing and control . . . . . . . . . . . . . . . . . 425
Operation . . . . . . . . . . . . . . . . . . . . . . . 423 22.10. Operation 12: CB_WANTS_CANCELLED . . . . . . . . . . . . 427
25. References . . . . . . . . . . . . . . . . . . . . . . . . . 424 22.11. Operation 10044: CB_ILLEGAL - Illegal Callback
25.1. Normative References . . . . . . . . . . . . . . . . . . 424 Operation . . . . . . . . . . . . . . . . . . . . . . . 428
25.2. Informative References . . . . . . . . . . . . . . . . . 425 23. Security Considerations . . . . . . . . . . . . . . . . . . . 428
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 425 24. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 429
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 426 24.1. Defining new layout types . . . . . . . . . . . . . . . 429
Intellectual Property and Copyright Statements . . . . . . . . . 427 25. References . . . . . . . . . . . . . . . . . . . . . . . . . 429
25.1. Normative References . . . . . . . . . . . . . . . . . . 429
25.2. Informative References . . . . . . . . . . . . . . . . . 431
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 432
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 432
Intellectual Property and Copyright Statements . . . . . . . . . 434
1. Protocol Data Types 1. Introduction
1.1. The NFSv4.1 Protocol
The NFSv4.1 protocol is a minor version of the NFSv4 protocol
described in [2]. It generally follows the guidelines for minor
versioning model laid in Section 10 of RFC 3530. However, it
diverges from guidelines 11 ("a client and server that supports minor
version X must support minor versions 0 through X-1"), and 12 ("no
features may be introduced as mandatory in a minor version"). These
divergences are due to the introduction of the sessions model for
managing non-idempotent operations and the RECLAIM_COMPLETE
operation. These two new features are infrastructural in nature and
simplify implementation of existing and other new features. Making
them optional would add undue complexity to protocol definition and
implementation. NFSv4.1 accordingly updates the Minor Versioning
guidelines (Section 16).
NFSv4.1, as a minor version, is consistent with the overall goals for
NFS Version 4, but extends the protocol so as to better meet those
goals, based on experiences with NFSv4.0. In addition, NFSv4.1 has
adopted some additional goals, which motivate some of the major
extensions in minor version 1.
1.2. NFS Version 4 Goals
The NFS version 4 protocol is a further revision of the NFS protocol
defined already by versions 2 [17]] and 3 [18]. It retains the
essential characteristics of previous versions: design for easy
recovery, independent of transport protocols, operating systems and
file systems, simplicity, and good performance. The NFS version 4
revision has the following goals:
o Improved access and good performance on the Internet.
The protocol is designed to transit firewalls easily, perform well
where latency is high and bandwidth is low, and scale to very
large numbers of clients per server.
o Strong security with negotiation built into the protocol.
The protocol builds on the work of the ONCRPC working group in
supporting the RPCSEC_GSS protocol. Additionally, the NFS version
4 protocol provides a mechanism to allow clients and servers the
ability to negotiate security and require clients and servers to
support a minimal set of security schemes.
o Good cross-platform interoperability.
The protocol features a file system model that provides a useful,
common set of features that does not unduly favor one file system
or operating system over another.
o Designed for protocol extensions.
The protocol is designed to accept standard extensions within a
framework that enable and encourages backward compatibility.
1.3. Minor Version 1 Goals
Minor version one has the following goals, within the framework
established by the overall version 4 goals.
o To correct significant structtural weaknesses and oversights
discovered in the base protocol.
o To add clarity and specificity to areas left unaddressed or not
addressed in sufficient detail in the base protocol.
o To add specific features based on experience with the existing
protocol and recent industry developments.
o To provide protocol support to take advantage of clustered server
deployments including the ability to provide scalabale parallel
access to files distributed among multiple servers.
1.4. Inconsistencies of this Document with Section XX
Section XX, RPC Definition File, contains the definitions in XDR
description language of the constructs used by the protocol. Prior
to this section, several of the constructs are reproduced for
purposes of explanation. Although every effort has been made to
assure a correct and consistent description, the possibility of
inconsistencies exists. For any part of the document that is
inconsistent with Section XX, Section XX is to be considered
authoritative.
1.5. Overview of NFS version 4.1 Features
To provide a reasonable context for the reader, the major features of
NFS version 4.1 protocol will be reviewed in brief. This will be
done to provide an appropriate context for both the reader who is
familiar with the previous versions of the NFS protocol and the
reader that is new to the NFS protocols. For the reader new to the
NFS protocols, there is still a set of fundamental knowledge that is
expected. The reader should be familiar with the XDR and RPC
protocols as described in [3] and [4]. A basic knowledge of file
systems and distributed file systems is expected as well.
This description of version 4.1 features will not distinguish those
added in minor version one from those present in the base protocol
but will treat minor version 1 as a unified whole See Section 1.7 for
a description of the differences between the two minor versions.
1.5.1. RPC and Security
As with previous versions of NFS, the External Data Representation
(XDR) and Remote Procedure Call (RPC) mechanisms used for the NFS
version 4.1 protocol are those defined in [3] and [4]. To meet end-
to-end security requirements, the RPCSEC_GSS framework [5] will be
used to extend the basic RPC security. With the use of RPCSEC_GSS,
various mechanisms can be provided to offer authentication,
integrity, and privacy to the NFS version 4 protocol. Kerberos V5
will be used as described in [6] to provide one security framework.
The LIPKEY GSS-API mechanism described in [7] will be used to provide
for the use of user password and server public key by the NFS version
4 protocol. With the use of RPCSEC_GSS, other mechanisms may also be
specified and used for NFS version 4.1 security.
To enable in-band security negotiation, the NFS version 4.1 protocol
has operations which provide the client a method of querying the
server about its policies regarding which security mechanisms must be
used for access to the server's file system resources. With this,
the client can securely match the security mechanism that meets the
policies specified at both the client and server.
1.5.2. Protocol Structure
1.5.2.1. Core Protocol
Unlike NFS Versions 2 and 3, which used a series of ancillary
protocols (e.g. NLM, NSM, MOUNT), within all minor versions of NFS
version 4 only a single RPC protocol is used to make requests of the
server. Facilties, that had been separate protocols, such as
locking, are now intergrated within a single unified protocol.
A significant departure from the versions of the NFS protocol before
version 4 is the introduction of the COMPOUND procedure. For the NFS
version 4 protocol, in all minor versions, there are two RPC
procedures, NULL and COMPOUND. The COMPOUND procedure is defined as
a series of individual operations and these operations perform the
sorts of functions performed by traditional NFS procedures.
The operations combined within a COMPOUND request are evaluated in
order by the server, without any atomicity guarantees. A limited set
of facilities exist to pass results from one operation to another.
Once an operation returns a failing result, the evaluation ends and
the results of all evaluated operations are returned to the client.
With the use of the COMPOUND procedure, the client is able to build
simple or complex requests. These COMPOUND requests allow for a
reduction in the number of RPCs needed for logical file system
operations. For example, multi-component lookup requests can be
constructed by combining multiple LOOKUP operations. Those can be
further combined with operations such as GETATTR, READDIR, or OPEN
plus READ to do more complicated sets of operation without incurring
additional latency.
NFS Version 4.1 also contains a a considerable set of callback
operations in which the server makes an RPC directed at the client.
Callback RPC's have a similar structure to that of the normal server
requests. For the NFS version 4 protocol callbacks in all minor
versions, there are two RPC procedures, NULL and CB_COMPOUND. The
CB_COMPOUND procedure is defined in analogous fashion to that of
COMPOUND with its own set of callback operations.
Addition of new server and callback operation within the COMPOUND and
CB_COMPOUND request framework provide means of extending the protocol
in subsequent minor versions.
Except for a small number of operations needed for session creation,
server requests and callback requests are performed within the
context of a session. Sessions provide a client context for every
request and support robust replay protection for non-idempotent
requests.
1.5.2.2. Parallel Access
Minor version one supports high-performance data access to a
clustered server implementation by enabling a separation of metadata
access and data access, with the latter done to multiple servers in
parallel.
Such parallel data access is controlled by recallable objects known
as "layouts", which are integrated into the protocol locking model.
Clients direct requests for data access to a set of data servers
specified by the layout via a data storage protocol which may be
NFSv4.1 or may be another protocol.
1.5.3. File System Model
The general file system model used for the NFS version 4.1 protocol
is the same as previous versions. The server file system is
hierarchical with the regular files contained within being treated as
opaque byte streams. In a slight departure, file and directory names
are encoded with UTF-8 to deal with the basics of
internationalization.
The NFS version 4.1 protocol does not require a separate protocol to
provide for the initial mapping between path name and filehandle.
All file systems exported by a server are presented as a tree so that
all file systems are reachable from a special per-server global root
filefilandle. This allows LOOKUP operations to be used to perform
functions previously provided by the MOUNT protocol. The server
provides any necessary pseudo fileystems to bridge any gaps that
arise due unexported gaps between exported file systems.
1.5.3.1. Filehandles
As in previous versions of the NFS protocol, opaque filehandles are
used to identify individual files and directories. Lookup-type and
create operations are used to go from file and directory names to the
filehandle which is then used to identify the object to subsequent
operations.
The NFS version 4.1 protocol provides support for both persistent
filehandles, guaranteed to be valid for the lifetime of the file
system object designated. In addition it provides support to servers
to provide filehandles with more limited validity guarantees, called
volatile filehandles.
1.5.3.2. File Attributes
The NFS version 4.1 protocol has a rich and extensible attribute
structure. Only a small set of the defined attributes are mandatory
and must be provided by all server implementations. The other
attributes are known as "recommended" attributes.
One significant recommended file attribute is the Access Control List
(ACL) attribute. This attribute provides for directory and file
access control beyond the model used in NFS Versions 2 and 3. The
ACL definition allows for specification specific sets of permissions
for individual users and groups. In addition, ACL inheritance allows
propagation of access permissions and restriction down a directory
tree as fileystsme objects are created.
One other type of attribute is the named attribute. A named
attribute is an opaque byte stream that is associated with a
directory or file and referred to by a string name. Named attributes
are meant to be used by client applications as a method to associate
application specific data with a regular file or directory.
1.5.3.3. Multi-server Namespace
NFS Version 4.1 contains a number of features to allow implementation
of namespaces that cross server boundaries and that allow to and
facilitate a non-disruptive transfer of support for individual file
systems between servers. They are all based upon attributes that
allow one file system to specify alternate or new locations for that
file system.
These attributes may be used together with the concept of absent file
system which provide specifications for additional locations but no
actual file system content. This allows a number of important
facilties:
o Location attributes may be used with absent file systems to
implement referrals whereby one server may direct the client to a
file system provided by another server. This allows extensive
mult-server namspaces to be constructed.
o Location attributes may be provided for present file systems to
provide the locations alternate file system instances or replicas
to be used in the event that the current file system instance
becomes unavailable.
o Location attributes may be provided when a previously present file
system becomes absent. This allows non-disruptive migration of
file systems to alternate servers.
1.5.4. Locking Facilities
As mentioned previously, NFS v4.1, is a single protocol which
includes locking facilities. These locking facilities include
support for many types of locks including a number of sorts of
recallable locks. Recallable locks such as delegations allow the
client to be assured that certain events will not occur so long as
that lock is held. When circumstances change, the lock is recalled
via a callback via a callback request. The assurances provided by
delegations allow more extensive caching to be done safely when
circumstances allow it.
o Share reservations as established by OPEN operations.
o Byte-range locks.
o File delegations which are recallable locks that assure the holder
that inconsitent opens and file changes cannot occur so long as
the delegation is held.
o Directory delegations which are recallable delegations that assure
the holder that inconsistent directory modifications cannot occur
so long as the deleagtion is held.
o Layouts which are recallable objects that assure the holder that
direct access to the file data may be performed directly by the
client and that no change to the data's location inconsistent with
that access may be made so long as the layout is held.
All locks for a given client are tied together under a single client-
wide lease. All requests made on sessions associated with the client
renew that lease. When leases are not promptly renewed lock are
subject to revocation. In the event of server reinitialization,
clients have the opportunity to safely reclaim their locks within a
special grace period.
1.6. General Definitions
The following definitions are provided for the purpose of providing
an appropriate context for the reader.
Client The "client" is the entity that accesses the NFS server's
resources. The client may be an application which contains the
logic to access the NFS server directly. The client may also be
the traditional operating system client remote file system
services for a set of applications.
In the case of file locking the client is the entity that
maintains a set of locks on behalf of one or more applications.
This client is responsible for crash or failure recovery for those
locks it manages.
Note that multiple clients may share the same transport and
multiple clients may exist on the same network node.
Clientid A 64-bit quantity used as a unique, short-hand reference to
a client supplied Verifier and ID. The server is responsible for
supplying the Clientid.
Lease An interval of time defined by the server for which the client
is irrevocably granted a lock. At the end of a lease period the
lock may be revoked if the lease has not been extended. The lock
must be revoked if a conflicting lock has been granted after the
lease interval.
All leases granted by a server have the same fixed interval. Note
that the fixed interval was chosen to alleviate the expense a
server would have in maintaining state about variable length
leases across server failures.
Lock The term "lock" is used to refer any of record (byte- range)
locks, share reservations, delegations or layouts unless
specifically stated otherwise.
Server The "Server" is the entity responsible for coordinating
client access to a set of file systems.
Stable Storage NFS version 4 servers must be able to recover without
data loss from multiple power failures (including cascading power
failures, that is, several power failures in quick succession),
operating system failures, and hardware failure of components
other than the storage medium itself (for example, disk,
nonvolatile RAM).
Some examples of stable storage that are allowable for an NFS
server include:
1. Media commit of data, that is, the modified data has been
successfully written to the disk media, for example, the disk
platter.
2. An immediate reply disk drive with battery-backed on- drive
intermediate storage or uninterruptible power system (UPS).
3. Server commit of data with battery-backed intermediate storage
and recovery software.
4. Cache commit with uninterruptible power system (UPS) and
recovery software.
Stateid A 128-bit quantity returned by a server that uniquely
defines the open and locking state provided by the server for a
specific open or lock owner for a specific file. meaning and are
reserved values.
Verifier A 64-bit quantity generated by the client that the server
can use to determine if the client has restarted and lost all
previous lock state.
1.7. Differences from NFSv4.0
The following summarizes the differences between minor version one
and the base protocol:
o Implementation of the sessions model.
o Support for parallel access to data.
o Addition of the RECLAIM_COMPLETE operation to better structiure
the lock reclamation process.
o < Support for directory delegation.
o Operations to re-obtain a delegation.
o Support for client and server implementation id's.
2. Protocol Data Types
The syntax and semantics to describe the data types of the NFS The syntax and semantics to describe the data types of the NFS
version 4 protocol are defined in the XDR RFC4506 [2] and RPC RFC1831 version 4 protocol are defined in the XDR RFC4506 [3] and RPC RFC1831
[3] documents. The next sections build upon the XDR data types to [4] documents. The next sections build upon the XDR data types to
define types and structures specific to this protocol. define types and structures specific to this protocol.
1.1. Basic Data Types 2.1. Basic Data Types
These are the base NFSv4 data types. These are the base NFSv4 data types.
+---------------+---------------------------------------------------+ +---------------+---------------------------------------------------+
| Data Type | Definition | | Data Type | Definition |
+---------------+---------------------------------------------------+ +---------------+---------------------------------------------------+
| int32_t | typedef int int32_t; | | int32_t | typedef int int32_t; |
| uint32_t | typedef unsigned int uint32_t; | | uint32_t | typedef unsigned int uint32_t; |
| int64_t | typedef hyper int64_t; | | int64_t | typedef hyper int64_t; |
| uint64_t | typedef unsigned hyper uint64_t; | | uint64_t | typedef unsigned hyper uint64_t; |
skipping to change at page 11, line 13 skipping to change at page 19, line 36
| | COMMIT) | | | COMMIT) |
| pathname4 | typedef component4 pathname4<>; | | pathname4 | typedef component4 pathname4<>; |
| | Represents path name for fs_locations | | | Represents path name for fs_locations |
| qop4 | typedef uint32_t qop4; | | qop4 | typedef uint32_t qop4; |
| | Quality of protection designation in SECINFO | | | Quality of protection designation in SECINFO |
| sec_oid4 | typedef opaque sec_oid4<>; | | sec_oid4 | typedef opaque sec_oid4<>; |
| | Security Object Identifier The sec_oid4 data type | | | Security Object Identifier The sec_oid4 data type |
| | is not really opaque. Instead contains an ASN.1 | | | is not really opaque. Instead contains an ASN.1 |
| | OBJECT IDENTIFIER as used by GSS-API in the | | | OBJECT IDENTIFIER as used by GSS-API in the |
| | mech_type argument to GSS_Init_sec_context. See | | | mech_type argument to GSS_Init_sec_context. See |
| | RFC2743 [4] for details. | | | RFC2743 [8] for details. |
| seqid4 | typedef uint32_t seqid4; | | seqid4 | typedef uint32_t seqid4; |
| | Sequence identifier used for file locking | | | Sequence identifier used for file locking |
| utf8string | typedef opaque utf8string<>; | | utf8string | typedef opaque utf8string<>; |
| | UTF-8 encoding for strings | | | UTF-8 encoding for strings |
| utf8str_cis | typedef opaque utf8str_cis; | | utf8str_cis | typedef opaque utf8str_cis; |
| | Case-insensitive UTF-8 string | | | Case-insensitive UTF-8 string |
| utf8str_cs | typedef opaque utf8str_cs; | | utf8str_cs | typedef opaque utf8str_cs; |
| | Case-sensitive UTF-8 string | | | Case-sensitive UTF-8 string |
| utf8str_mixed | typedef opaque utf8str_mixed; | | utf8str_mixed | typedef opaque utf8str_mixed; |
| | UTF-8 strings with a case sensitive prefix and a | | | UTF-8 strings with a case sensitive prefix and a |
skipping to change at page 11, line 36 skipping to change at page 20, line 14
| | Verifier used for various operations (COMMIT, | | | Verifier used for various operations (COMMIT, |
| | CREATE, OPEN, READDIR, SETCLIENTID, | | | CREATE, OPEN, READDIR, SETCLIENTID, |
| | SETCLIENTID_CONFIRM, WRITE) NFS4_VERIFIER_SIZE is | | | SETCLIENTID_CONFIRM, WRITE) NFS4_VERIFIER_SIZE is |
| | defined as 8. | | | defined as 8. |
+---------------+---------------------------------------------------+ +---------------+---------------------------------------------------+
End of Base Data Types End of Base Data Types
Table 1 Table 1
1.2. Structured Data Types 2.2. Structured Data Types
1.2.1. nfstime4 2.2.1. nfstime4
struct nfstime4 { struct nfstime4 {
int64_t seconds; int64_t seconds;
uint32_t nseconds; uint32_t nseconds;
} }
The nfstime4 structure gives the number of seconds and nanoseconds The nfstime4 structure gives the number of seconds and nanoseconds
since midnight or 0 hour January 1, 1970 Coordinated Universal Time since midnight or 0 hour January 1, 1970 Coordinated Universal Time
(UTC). Values greater than zero for the seconds field denote dates (UTC). Values greater than zero for the seconds field denote dates
after the 0 hour January 1, 1970. Values less than zero for the after the 0 hour January 1, 1970. Values less than zero for the
skipping to change at page 12, line 16 skipping to change at page 20, line 42
nseconds fields would have a value of one-half second (500000000). nseconds fields would have a value of one-half second (500000000).
Values greater than 999,999,999 for nseconds are considered invalid. Values greater than 999,999,999 for nseconds are considered invalid.
This data type is used to pass time and date information. A server This data type is used to pass time and date information. A server
converts to and from its local representation of time when processing converts to and from its local representation of time when processing
time values, preserving as much accuracy as possible. If the time values, preserving as much accuracy as possible. If the
precision of timestamps stored for a filesystem object is less than precision of timestamps stored for a filesystem object is less than
defined, loss of precision can occur. An adjunct time maintenance defined, loss of precision can occur. An adjunct time maintenance
protocol is recommended to reduce client and server time skew. protocol is recommended to reduce client and server time skew.
1.2.2. time_how4 2.2.2. time_how4
enum time_how4 { enum time_how4 {
SET_TO_SERVER_TIME4 = 0, SET_TO_SERVER_TIME4 = 0,
SET_TO_CLIENT_TIME4 = 1 SET_TO_CLIENT_TIME4 = 1
}; };
1.2.3. settime4 2.2.3. settime4
union settime4 switch (time_how4 set_it) { union settime4 switch (time_how4 set_it) {
case SET_TO_CLIENT_TIME4: case SET_TO_CLIENT_TIME4:
nfstime4 time; nfstime4 time;
default: default:
void; void;
}; };
The above definitions are used as the attribute definitions to set The above definitions are used as the attribute definitions to set
time values. If set_it is SET_TO_SERVER_TIME4, then the server uses time values. If set_it is SET_TO_SERVER_TIME4, then the server uses
its local representation of time for the time value. its local representation of time for the time value.
1.2.4. specdata4 2.2.4. specdata4
struct specdata4 { struct specdata4 {
uint32_t specdata1; /* major device number */ uint32_t specdata1; /* major device number */
uint32_t specdata2; /* minor device number */ uint32_t specdata2; /* minor device number */
}; };
This data type represents additional information for the device file This data type represents additional information for the device file
types NF4CHR and NF4BLK. types NF4CHR and NF4BLK.
1.2.5. fsid4 2.2.5. fsid4
struct fsid4 { struct fsid4 {
uint64_t major; uint64_t major;
uint64_t minor; uint64_t minor;
}; };
1.2.6. fs_location4 2.2.6. fs_location4
struct fs_location4 { struct fs_location4 {
utf8str_cis server<>; utf8str_cis server<>;
pathname4 rootpath; pathname4 rootpath;
}; };
1.2.7. fs_locations4 2.2.7. fs_locations4
struct fs_locations4 { struct fs_locations4 {
pathname4 fs_root; pathname4 fs_root;
fs_location4 locations<>; fs_location4 locations<>;
}; };
The fs_location4 and fs_locations4 data types are used for the The fs_location4 and fs_locations4 data types are used for the
fs_locations recommended attribute which is used for migration and fs_locations recommended attribute which is used for migration and
replication support. replication support.
1.2.8. fattr4 2.2.8. fattr4
struct fattr4 { struct fattr4 {
bitmap4 attrmask; bitmap4 attrmask;
attrlist4 attr_vals; attrlist4 attr_vals;
}; };
The fattr4 structure is used to represent file and directory The fattr4 structure is used to represent file and directory
attributes. attributes.
The bitmap is a counted array of 32 bit integers used to contain bit The bitmap is a counted array of 32 bit integers used to contain bit
values. The position of the integer in the array that contains bit n values. The position of the integer in the array that contains bit n
can be computed from the expression (n / 32) and its bit within that can be computed from the expression (n / 32) and its bit within that
integer is (n mod 32). integer is (n mod 32).
0 1 0 1
+-----------+-----------+-----------+-- +-----------+-----------+-----------+--
| count | 31 .. 0 | 63 .. 32 | | count | 31 .. 0 | 63 .. 32 |
+-----------+-----------+-----------+-- +-----------+-----------+-----------+--
1.2.9. change_info4 2.2.9. change_info4
struct change_info4 { struct change_info4 {
bool atomic; bool atomic;
changeid4 before; changeid4 before;
changeid4 after; changeid4 after;
}; };
This structure is used with the CREATE, LINK, REMOVE, RENAME This structure is used with the CREATE, LINK, REMOVE, RENAME
operations to let the client know the value of the change attribute operations to let the client know the value of the change attribute
for the directory in which the target filesystem object resides. for the directory in which the target filesystem object resides.
1.2.10. clientaddr4 2.2.10. netaddr4
struct clientaddr4 { struct netaddr4 {
/* see struct rpcb in RFC1833 */ /* see struct rpcb in RFC1833 */
string r_netid<>; /* network id */ string r_netid<>; /* network id */
string r_addr<>; /* universal address */ string r_addr<>; /* universal address */
}; };
The clientaddr4 structure is used as part of the SETCLIENTID The netaddr4 structure is used to identify TCP/IP based endpoints.
operation to either specify the address of the client that is using a The r_netid and r_addr fields are specified in RFC1833 [19], but they
clientid or as part of the callback registration. The r_netid and are underspecified in RFC1833 [19] as far as what they should look
r_addr fields are specified in RFC1833 [10], but they are like for specific protocols.
underspecified in RFC1833 [10] as far as what they should look like
for specific protocols.
For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the
US-ASCII string: US-ASCII string:
h1.h2.h3.h4.p1.p2 h1.h2.h3.h4.p1.p2
The prefix, "h1.h2.h3.h4", is the standard textual form for The prefix, "h1.h2.h3.h4", is the standard textual form for
representing an IPv4 address, which is always four octets long. representing an IPv4 address, which is always four octets long.
Assuming big-endian ordering, h1, h2, h3, and h4, are respectively, Assuming big-endian ordering, h1, h2, h3, and h4, are respectively,
the first through fourth octets each converted to ASCII-decimal. the first through fourth octets each converted to ASCII-decimal.
skipping to change at page 14, line 49 skipping to change at page 23, line 29
For TCP over IPv6 and for UDP over IPv6, the format of r_addr is the For TCP over IPv6 and for UDP over IPv6, the format of r_addr is the
US-ASCII string: US-ASCII string:
x1:x2:x3:x4:x5:x6:x7:x8.p1.p2 x1:x2:x3:x4:x5:x6:x7:x8.p1.p2
The suffix "p1.p2" is the service port, and is computed the same way The suffix "p1.p2" is the service port, and is computed the same way
as with universal addresses for TCP and UDP over IPv4. The prefix, as with universal addresses for TCP and UDP over IPv4. The prefix,
"x1:x2:x3:x4:x5:x6:x7:x8", is the standard textual form for "x1:x2:x3:x4:x5:x6:x7:x8", is the standard textual form for
representing an IPv6 address as defined in Section 2.2 of RFC1884 representing an IPv6 address as defined in Section 2.2 of RFC1884
[5]. Additionally, the two alternative forms specified in Section [9]. Additionally, the two alternative forms specified in Section
2.2 of RFC1884 [5] are also acceptable. 2.2 of RFC1884 [9] are also acceptable.
For TCP over IPv6 the value of r_netid is the string "tcp6". For UDP For TCP over IPv6 the value of r_netid is the string "tcp6". For UDP
over IPv6 the value of r_netid is the string "udp6". over IPv6 the value of r_netid is the string "udp6".
1.2.11. cb_client4 2.2.11. clientaddr4
typedef netaddr4 clientaddr4;
The clientaddr4 structure is used as part of the SETCLIENTID
operation to either specify the address of the client that is using a
clientid or as part of the callback registration.
2.2.12. cb_client4
struct cb_client4 { struct cb_client4 {
unsigned int cb_program; unsigned int cb_program;
clientaddr4 cb_location; netaddr4 cb_location;
}; };
This structure is used by the client to inform the server of its call This structure is used by the client to inform the server of its call
back address; includes the program number and client address. back address; includes the program number and client address.
1.2.12. nfs_client_id4 2.2.13. nfs_client_id4
struct nfs_client_id4 { struct nfs_client_id4 {
verifier4 verifier; verifier4 verifier;
opaque id<NFS4_OPAQUE_LIMIT> opaque id<NFS4_OPAQUE_LIMIT>
}; };
This structure is part of the arguments to the SETCLIENTID operation. This structure is part of the arguments to the SETCLIENTID operation.
NFS4_OPAQUE_LIMIT is defined as 1024. NFS4_OPAQUE_LIMIT is defined as 1024.
1.2.13. open_owner4 2.2.14. open_owner4
struct open_owner4 { struct open_owner4 {
clientid4 clientid; clientid4 clientid;
opaque owner<NFS4_OPAQUE_LIMIT> opaque owner<NFS4_OPAQUE_LIMIT>
}; };
This structure is used to identify the owner of open state. This structure is used to identify the owner of open state.
NFS4_OPAQUE_LIMIT is defined as 1024. NFS4_OPAQUE_LIMIT is defined as 1024.
1.2.14. lock_owner4 2.2.15. lock_owner4
struct lock_owner4 { struct lock_owner4 {
clientid4 clientid; clientid4 clientid;
opaque owner<NFS4_OPAQUE_LIMIT> opaque owner<NFS4_OPAQUE_LIMIT>
}; };
This structure is used to identify the owner of file locking state. This structure is used to identify the owner of file locking state.
NFS4_OPAQUE_LIMIT is defined as 1024. NFS4_OPAQUE_LIMIT is defined as 1024.
1.2.15. open_to_lock_owner4 2.2.16. open_to_lock_owner4
struct open_to_lock_owner4 { struct open_to_lock_owner4 {
seqid4 open_seqid; seqid4 open_seqid;
stateid4 open_stateid; stateid4 open_stateid;
seqid4 lock_seqid; seqid4 lock_seqid;
lock_owner4 lock_owner; lock_owner4 lock_owner;
}; };
This structure is used for the first LOCK operation done for an This structure is used for the first LOCK operation done for an
open_owner4. It provides both the open_stateid and lock_owner such open_owner4. It provides both the open_stateid and lock_owner such
that the transition is made from a valid open_stateid sequence to that the transition is made from a valid open_stateid sequence to
that of the new lock_stateid sequence. Using this mechanism avoids that of the new lock_stateid sequence. Using this mechanism avoids
the confirmation of the lock_owner/lock_seqid pair since it is tied the confirmation of the lock_owner/lock_seqid pair since it is tied
to established state in the form of the open_stateid/open_seqid. to established state in the form of the open_stateid/open_seqid.
1.2.16. stateid4 2.2.17. stateid4
struct stateid4 { struct stateid4 {
uint32_t seqid; uint32_t seqid;
opaque other[12]; opaque other[12];
}; };
This structure is used for the various state sharing mechanisms This structure is used for the various state sharing mechanisms
between the client and server. For the client, this data structure between the client and server. For the client, this data structure
is read-only. The starting value of the seqid field is undefined. is read-only. The starting value of the seqid field is undefined.
The server is required to increment the seqid field monotonically at The server is required to increment the seqid field monotonically at
each transition of the stateid. This is important since the client each transition of the stateid. This is important since the client
will inspect the seqid in OPEN stateids to determine the order of will inspect the seqid in OPEN stateids to determine the order of
OPEN processing done by the server. OPEN processing done by the server.
1.2.17. layouttype4 2.2.18. layouttype4
enum layouttype4 { enum layouttype4 {
LAYOUT_NFSV4_FILES = 1, LAYOUT_NFSV4_FILES = 1,
LAYOUT_OSD2_OBJECTS = 2, LAYOUT_OSD2_OBJECTS = 2,
LAYOUT_BLOCK_VOLUME = 3 LAYOUT_BLOCK_VOLUME = 3
}; };
A layout type specifies the layout being used. The implication is A layout type specifies the layout being used. The implication is
that clients have "layout drivers" that support one or more layout that clients have "layout drivers" that support one or more layout
types. The file server advertises the layout types it supports types. The file server advertises the layout types it supports
through the LAYOUT_TYPES file system attribute. A client asks for through the LAYOUT_TYPES file system attribute. A client asks for
layouts of a particular type in LAYOUTGET, and passes those layouts layouts of a particular type in LAYOUTGET, and passes those layouts
to its layout driver. The set of well known layout types must be to its layout driver.
defined. As well, a private range of layout types is to be defined
by this document. This would allow custom installations to introduce
new layout types.
[[Comment.1: Determine private range of layout types]]
New layout types must be specified in RFCs approved by the IESG The layouttype4 structure is 32 bits in length. The range
before becoming part of the pNFS specification. represented by the layout type is split into two parts. Types within
the range 0x00000000-0x7FFFFFFF are globally unique and are assigned
according to the description in Section 24.1; they are maintained by
IANA. Types within the range 0x8000000-0xFFFFFFFF are site specific
and for "private use" only.
The LAYOUT_NFSV4_FILES enumeration specifies that the NFSv4 file The LAYOUT_NFSV4_FILES enumeration specifies that the NFSv4 file
layout type is to be used. The LAYOUT_OSD2_OBJECTS enumeration layout type is to be used. The LAYOUT_OSD2_OBJECTS enumeration
specifies that the object layout, as defined in [11], is to be used. specifies that the object layout, as defined in [20], is to be used.
Similarly, the LAYOUT_BLOCK_VOLUME enumeration that the block/volume Similarly, the LAYOUT_BLOCK_VOLUME enumeration that the block/volume
layout, as defined in [12], is to be used. layout, as defined in [21], is to be used.
1.2.18. pnfs_deviceid4 2.2.19. deviceid4
typedef uint32_t pnfs_deviceid4; /* 32-bit device ID */ typedef uint32_t deviceid4; /* 32-bit device ID */
Layout information includes device IDs that specify a storage device Layout information includes device IDs that specify a storage device
through a compact handle. Addressing and type information is through a compact handle. Addressing and type information is
obtained with the GETDEVICEINFO operation. A client must not assume obtained with the GETDEVICEINFO operation. A client must not assume
that device IDs are valid across metadata server reboots. The device that device IDs are valid across metadata server reboots. The device
ID is qualified by the layout type and are unique per file system ID is qualified by the layout type and are unique per file system
(FSID). This allows different layout drivers to generate device IDs (FSID). This allows different layout drivers to generate device IDs
without the need for co-ordination. See Section 16.1.4 for more without the need for co-ordination. See Section 15.3.1.4 for more
details. details.
1.2.19. pnfs_netaddr4 2.2.20. devlist_item4
struct pnfs_netaddr4 {
string r_netid<>; /* network ID */
string r_addr<>; /* universal address */
};
For a description of the r_netid and r_addr fields see the
descriptions provided in the clientaddr4 structure description.
1.2.20. pnfs_devlist_item4
struct pnfs_devlist_item4 { struct devlist_item4 {
pnfs_deviceid4 id; deviceid4 dli_id;
opaque device_addr<>; opaque dli_device_addr<>;
}; };
An array of these values is returned by the GETDEVICELIST operation. An array of these values is returned by the GETDEVICELIST operation.
They define the set of devices associated with a file system for the They define the set of devices associated with a file system for the
layout type specified in the GETDEVICELIST4args. layout type specified in the GETDEVICELIST4args.
The device address is used to set up a communication channel with the The device address is used to set up a communication channel with the
storage device. Different layout types will require different types storage device. Different layout types will require different types
of structures to define how they communicate with storage devices. of structures to define how they communicate with storage devices.
The opaque device_addr field must be interpreted based on the The opaque device_addr field must be interpreted based on the
specified layout type. specified layout type.
This document defines the device address for the NFSv4 file layout This document defines the device address for the NFSv4 file layout
(struct pnfs_netaddr4), which identifies a storage device by network (struct netaddr4 (Section 2.2.10)), which identifies a storage device
IP address and port number (similar to struct clientaddr4). This is by network IP address and port number. This is sufficient for the
sufficient for the clients to communicate with the NFSv4 storage clients to communicate with the NFSv4 storage devices, and may be
devices, and may be sufficient for other layout types as well. sufficient for other layout types as well. Device types for object
Device types for object storage devices and block storage devices storage devices and block storage devices (e.g., SCSI volume labels)
(e.g., SCSI volume labels) will be defined by their respective layout will be defined by their respective layout specifications.
specifications.
1.2.21. pnfs_layout4 2.2.21. layout4
struct pnfs_layout4 { struct layout4 {
offset4 offset; offset4 lo_offset;
length4 length; length4 lo_length;
pnfs_layoutiomode4 iomode; layoutiomode4 lo_iomode;
pnfs_layouttype4 type; layouttype4 lo_type;
opaque layout<>; opaque lo_layout<>;
}; };
The pnfs_layout4 structure defines a layout for a file. The layout The layout4 structure defines a layout for a file. The layout type
type specific data is opaque within this structure and must be specific data is opaque within this structure and must be
interepreted based on the layout type. Currently, only the NFSv4 interepreted based on the layout type. Currently, only the NFSv4
file layout type is defined; see Section 17.1 for its definition. file layout type is defined; see Section 15.4.1 for its definition.
Since layouts are sub-dividable, the offset and length together with Since layouts are sub-dividable, the offset and length together with
the file's filehandle, the clientid, iomode, and layout type, the file's filehandle, the clientid, iomode, and layout type,
identifies the layout. identifies the layout.
[[Comment.2: there is a discussion of moving the striping 2.2.22. layoutupdate4
information, or more generally the "aggregation scheme", up to the
generic layout level. This creates a two-layer system where the top
level is a switch on different data placement layouts, and the next
level down is a switch on different data storage types. This lets
different layouts (e.g., striping or mirroring or redundant servers)
to be layered over different storage devices. This would move
geometry information out of nfsv4_file_layouttype4 and up into a
generic pnfs_striped_layout type that would specify a set of
pnfs_deviceid4 and pnfs_devicetype4 to use for storage. Instead of
nfsv4_file_layouttype4, there would be pnfs_nfsv4_devicetype4.]]
1.2.22. pnfs_layoutupdate4
struct pnfs_layoutupdate4 { struct layoutupdate4 {
pnfs_layouttype4 type; layouttype4 lou_type;
opaque layoutupdate_data<>; opaque lou_data<>;
}; };
The pnfs_layoutupdate4 structure is used by the client to return
'updated' layout information to the metadata server at LAYOUTCOMMIT
time. This structure provides a channel to pass layout type specific
information back to the metadata server. E.g., for block/volume
layout types this could include the list of reserved blocks that were
written. The contents of the opaque layoutupdate_data argument are
determined by the layout type and are defined in their context. The
NFSv4 file-based layout does not use this structure, thus the
update_data field should have a zero length.
1.2.23. layouthint4 The layoutupdate4 structure is used by the client to return 'updated'
layout information to the metadata server at LAYOUTCOMMIT time. This
structure provides a channel to pass layout type specific information
back to the metadata server. E.g., for block/volume layout types
this could include the list of reserved blocks that were written.
The contents of the opaque lou_data argument are determined by the
layout type and are defined in their context. The NFSv4 file-based
layout does not use this structure, thus the update_data field should
have a zero length.
struct pnfs_layouthint4 { 2.2.23. layouthint4
pnfs_layouttype4 type;
opaque layouthint_data<>; struct layouthint4 {
layouttype4 loh_type;
opaque loh_data<>;
}; };
The layouthint4 structure is used by the client to pass in a hint The layouthint4 structure is used by the client to pass in a hint
about the type of layout it would like created for a particular file. about the type of layout it would like created for a particular file.
It is the structure specified by the FILE_LAYOUT_HINT attribute It is the structure specified by the FILE_LAYOUT_HINT attribute
described below. The metadata server may ignore the hint, or may described below. The metadata server may ignore the hint, or may
selectively ignore fields within the hint. This hint should be selectively ignore fields within the hint. This hint should be
provided at create time as part of the initial attributes within provided at create time as part of the initial attributes within
OPEN. The NFSv4 file-based layout uses the "nfsv4_file_layouthint" OPEN. The NFSv4 file-based layout uses the "nfsv4_file_layouthint"
structure as defined in Section 17.1. structure as defined in Section 15.4.1.
1.2.24. pnfs_layoutiomode4 2.2.24. layoutiomode4
enum pnfs_layoutiomode4 { enum layoutiomode4 {
LAYOUTIOMODE_READ = 1, LAYOUTIOMODE_READ = 1,
LAYOUTIOMODE_RW = 2, LAYOUTIOMODE_RW = 2,
LAYOUTIOMODE_ANY = 3 LAYOUTIOMODE_ANY = 3
}; };
The iomode specifies whether the client intends to read or write The iomode specifies whether the client intends to read or write
(with the possibility of reading) the data represented by the layout. (with the possibility of reading) the data represented by the layout.
The ANY iomode MUST NOT be used for LAYOUTGET, however, it can be The ANY iomode MUST NOT be used for LAYOUTGET, however, it can be
used for LAYOUTRETURN and LAYOUTRECALL. The ANY iomode specifies used for LAYOUTRETURN and LAYOUTRECALL. The ANY iomode specifies
that layouts pertaining to both READ and RW iomodes are being that layouts pertaining to both READ and RW iomodes are being
returned or recalled, respectively. The metadata server's use of the returned or recalled, respectively. The metadata server's use of the
iomode may depend on the layout type being used. The storage devices iomode may depend on the layout type being used. The storage devices
may validate I/O accesses against the iomode and reject invalid may validate I/O accesses against the iomode and reject invalid
accesses. accesses.
1.2.25. nfs_impl_id4 2.2.25. nfs_impl_id4
struct nfs_impl_id4 { struct nfs_impl_id4 {
utf8str_cis nii_domain; utf8str_cis nii_domain;
utf8str_cs nii_name; utf8str_cs nii_name;
nfstime4 nii_date; nfstime4 nii_date;
}; };
This structure is used to identify client and server implementation This structure is used to identify client and server implementation
detail. The nii_domain field is the DNS domain name that the detail. The nii_domain field is the DNS domain name that the
implementer is associated with. The nii_name field is the product implementer is associated with. The nii_name field is the product
name of the implementation and is completely free form. It is name of the implementation and is completely free form. It is
encouraged that the nii_name be used to distinguish machine encouraged that the nii_name be used to distinguish machine
architecture, machine platforms, revisions, versions, and patch architecture, machine platforms, revisions, versions, and patch
levels. The nii_date field is the timestamp of when the software levels. The nii_date field is the timestamp of when the software
instance was published or built. instance was published or built.
1.2.26. impl_ident4 2.2.26. impl_ident4
struct impl_ident4 { struct impl_ident4 {
clientid4 ii_clientid; clientid4 ii_clientid;
struct nfs_impl_id4 ii_impl_id; struct nfs_impl_id4 ii_impl_id;
}; };
This is used for exchanging implementation identification between This is used for exchanging implementation identification between
client and server. client and server.
2. RPC and Security Flavor 2.2.27. threshold_item4
struct threshold_item4 {
layouttype4 thi_layout_type;
bitmap4 thi_hintset;
opaque thi_hintlist<>;
};
This structure contains a list of hints specific to a layout type for
helping the client determine when it should issue I/O directly
through the metadata server vs. the data servers. The hint structure
consists of the layout type, a bitmap describing the set of hints
supported by the server, they may differ based on the layout type,
and a list of hints, whose structure is determined by the hintset
bitmap. See the mdsthreshold attribute for more details.
The hintset is a bitmap of the following values:
+-------------------------+---+---------+---------------------------+
| name | # | Data | Description |
| | | Type | |
+-------------------------+---+---------+---------------------------+
| threshold4_read_size | 0 | length4 | The file size below which |
| | | | it is recommended to read |
| | | | data through the MDS. |
| threshold4_write_size | 1 | length4 | The file size below which |
| | | | it is recommended to |
| | | | write data through the |
| | | | MDS. |
| threshold4_read_iosize | 2 | length4 | For read I/O sizes below |
| | | | this threshold it is |
| | | | recommended to read data |
| | | | through the MDS |
| threshold4_write_iosize | 3 | length4 | For write I/O sizes below |
| | | | this threshold it is |
| | | | recommended to write data |
| | | | through the MDS |
+-------------------------+---+---------+---------------------------+
2.2.28. mdsthreshold4
struct mdsthreshold4 {
threshold_item4 mth_hints<>;
};
This structure holds an array of threshold_item4 structures each of
which is valid for a particular layout type. An array is necessary
since a server can support multiple layout types for a single file.
3. RPC and Security Flavor
The NFS version 4.1 protocol is a Remote Procedure Call (RPC) The NFS version 4.1 protocol is a Remote Procedure Call (RPC)
application that uses RPC version 2 and the corresponding eXternal application that uses RPC version 2 and the corresponding eXternal
Data Representation (XDR) as defined in [RFC1831] and [RFC4506]. The Data Representation (XDR) as defined in RFC1831 [4] and RFC4506 [3].
RPCSEC_GSS security flavor as defined in [RFC2203] MUST be used as The RPCSEC_GSS security flavor as defined in RFC2203 [5] MUST be used
the mechanism to deliver stronger security for the NFS version 4 as the mechanism to deliver stronger security for the NFS version 4
protocol. protocol.
2.1. Ports and Transports 3.1. Ports and Transports
Historically, NFS version 2 and version 3 servers have resided on Historically, NFS version 2 and version 3 servers have resided on
port 2049. The registered port 2049 [RFC3232] for the NFS protocol port 2049. The registered port 2049 RFC3232 [22] for the NFS
should be the default configuration. NFSv4 clients SHOULD NOT use protocol should be the default configuration. NFSv4 clients SHOULD
the RPC binding protocols as described in [RFC1833]. NOT use the RPC binding protocols as described in RFC1833 [19].
Where an NFS version 4 implementation supports operation over the IP Where an NFS version 4 implementation supports operation over the IP
network protocol, the supported transports between NFS and IP MUST network protocol, the supported transports between NFS and IP MUST
have the following two attributes: have the following two attributes:
1. The transport must support reliable delivery of data in the order 1. The transport must support reliable delivery of data in the order
it was sent. it was sent.
2. The transport must be among the IETF-approved congestion control 2. The transport must be among the IETF-approved congestion control
transport protocols. transport protocols.
skipping to change at page 21, line 48 skipping to change at page 31, line 5
implementation of TCP connection management in proportion to the implementation of TCP connection management in proportion to the
number of expected client machines. NFS version 4.1 will not modify number of expected client machines. NFS version 4.1 will not modify
this connection management model. NFS version 4.1 clients that this connection management model. NFS version 4.1 clients that
violate this assumption can expect scaling issues on the server and violate this assumption can expect scaling issues on the server and
hence reduced service. hence reduced service.
Note that for various timers, the client and server should avoid Note that for various timers, the client and server should avoid
inadvertent synchronization of those timers. For further discussion inadvertent synchronization of those timers. For further discussion
of the general issue refer to [Floyd]. of the general issue refer to [Floyd].
2.1.1. Client Retransmission Behavior 3.1.1. Client Retransmission Behavior
When processing a request received over a reliable transport such as When processing a request received over a reliable transport such as
TCP, the NFS version 4.1 server MUST NOT silently drop the request, TCP, the NFS version 4.1 server MUST NOT silently drop the request,
except if the transport connection has been broken. Given such a except if the transport connection has been broken. Given such a
contract between NFS version 4.1 clients and servers, clients MUST contract between NFS version 4.1 clients and servers, clients MUST
NOT retry a request unless one or both of the following are true: NOT retry a request unless one or both of the following are true:
o The transport connection has been broken o The transport connection has been broken
o The procedure being retried is the NULL procedure o The procedure being retried is the NULL procedure
skipping to change at page 22, line 31 skipping to change at page 31, line 37
The client can then reconnect, and then retry the original request. The client can then reconnect, and then retry the original request.
If the NULL procedure call gets a response, the connection has not If the NULL procedure call gets a response, the connection has not
broken. The client can decide to wait longer for the original broken. The client can decide to wait longer for the original
request's response, or it can break the transport connection and request's response, or it can break the transport connection and
reconnect before re-sending the original request. reconnect before re-sending the original request.
For callbacks from the server to the client, the same rules apply, For callbacks from the server to the client, the same rules apply,
but the server doing the callback becomes the client, and the client but the server doing the callback becomes the client, and the client
receiving the callback becomes the server. receiving the callback becomes the server.
2.2. Security Flavors 3.2. Security Flavors
Traditional RPC implementations have included AUTH_NONE, AUTH_SYS, Traditional RPC implementations have included AUTH_NONE, AUTH_SYS,
AUTH_DH, and AUTH_KRB4 as security flavors. With [RFC2203] an AUTH_DH, and AUTH_KRB4 as security flavors. With RFC2203 [5] an
additional security flavor of RPCSEC_GSS has been introduced which additional security flavor of RPCSEC_GSS has been introduced which
uses the functionality of GSS-API [RFC2743]. This allows for the use uses the functionality of GSS-API RFC2743 [8]. This allows for the
of various security mechanisms by the RPC layer without the use of various security mechanisms by the RPC layer without the
additional implementation overhead of adding RPC security flavors. additional implementation overhead of adding RPC security flavors.
For NFS version 4, the RPCSEC_GSS security flavor MUST be implemented For NFS version 4, the RPCSEC_GSS security flavor MUST be implemented
to enable the mandatory security mechanism. Other flavors, such as, to enable the mandatory security mechanism. Other flavors, such as,
AUTH_NONE, AUTH_SYS, and AUTH_DH MAY be implemented as well. AUTH_NONE, AUTH_SYS, and AUTH_DH MAY be implemented as well.
2.2.1. Security mechanisms for NFS version 4 3.2.1. Security mechanisms for NFS version 4
The use of RPCSEC_GSS requires selection of: mechanism, quality of The use of RPCSEC_GSS requires selection of: mechanism, quality of
protection, and service (authentication, integrity, privacy). The protection, and service (authentication, integrity, privacy). The
remainder of this document will refer to these three parameters of remainder of this document will refer to these three parameters of
the RPCSEC_GSS security as the security triple. the RPCSEC_GSS security as the security triple.
2.2.1.1. Kerberos V5 3.2.1.1. Kerberos V5
The Kerberos V5 GSS-API mechanism as described in [RFC1964] MUST be The Kerberos V5 GSS-API mechanism as described in RFC1964 [6] MUST be
implemented. implemented.
column descriptions: column descriptions:
1 == number of pseudo flavor 1 == number of pseudo flavor
2 == name of pseudo flavor 2 == name of pseudo flavor
3 == mechanism's OID 3 == mechanism's OID
4 == RPCSEC_GSS service 4 == RPCSEC_GSS service
1 2 3 4 1 2 3 4
-------------------------------------------------------------------- --------------------------------------------------------------------
skipping to change at page 23, line 30 skipping to change at page 32, line 32
390005 krb5p 1.2.840.113554.1.2.2 rpc_gss_svc_privacy 390005 krb5p 1.2.840.113554.1.2.2 rpc_gss_svc_privacy
Note that the pseudo flavor is presented here as a mapping aid to the Note that the pseudo flavor is presented here as a mapping aid to the
implementor. Because this NFS protocol includes a method to implementor. Because this NFS protocol includes a method to
negotiate security and it understands the GSS-API mechanism, the negotiate security and it understands the GSS-API mechanism, the
pseudo flavor is not needed. The pseudo flavor is needed for NFS pseudo flavor is not needed. The pseudo flavor is needed for NFS
version 3 since the security negotiation is done via the MOUNT version 3 since the security negotiation is done via the MOUNT
protocol. protocol.
For a discussion of NFS' use of RPCSEC_GSS and Kerberos V5, please For a discussion of NFS' use of RPCSEC_GSS and Kerberos V5, please
see [RFC2623]. see RFC2623 [23].
2.2.1.2. LIPKEY as a security triple 3.2.1.2. LIPKEY as a security triple
The LIPKEY GSS-API mechanism as described in [RFC2847] MUST be The LIPKEY GSS-API mechanism as described in RFC2847 [7] MUST be
implemented and provide the following security triples. The implemented and provide the following security triples. The
definition of the columns matches the previous subsection "Kerberos definition of the columns matches the previous subsection "Kerberos
V5 as security triple" V5 as security triple"
1 2 3 4 1 2 3 4
-------------------------------------------------------------------- --------------------------------------------------------------------
390006 lipkey 1.3.6.1.5.5.9 rpc_gss_svc_none 390006 lipkey 1.3.6.1.5.5.9 rpc_gss_svc_none
390007 lipkey-i 1.3.6.1.5.5.9 rpc_gss_svc_integrity 390007 lipkey-i 1.3.6.1.5.5.9 rpc_gss_svc_integrity
390008 lipkey-p 1.3.6.1.5.5.9 rpc_gss_svc_privacy 390008 lipkey-p 1.3.6.1.5.5.9 rpc_gss_svc_privacy
2.2.1.3. SPKM-3 as a security triple 3.2.1.3. SPKM-3 as a security triple
The SPKM-3 GSS-API mechanism as described in [RFC2847] MUST be The SPKM-3 GSS-API mechanism as described in RFC2847 [7] MUST be
implemented and provide the following security triples. The implemented and provide the following security triples. The
definition of the columns matches the previous subsection "Kerberos definition of the columns matches the previous subsection "Kerberos
V5 as security triple". V5 as security triple".
1 2 3 5 1 2 3 5
-------------------------------------------------------------------- --------------------------------------------------------------------
390009 spkm3 1.3.6.1.5.5.1.3 rpc_gss_svc_none 390009 spkm3 1.3.6.1.5.5.1.3 rpc_gss_svc_none
390010 spkm3i 1.3.6.1.5.5.1.3 rpc_gss_svc_integrity 390010 spkm3i 1.3.6.1.5.5.1.3 rpc_gss_svc_integrity
390011 spkm3p 1.3.6.1.5.5.1.3 rpc_gss_svc_privacy 390011 spkm3p 1.3.6.1.5.5.1.3 rpc_gss_svc_privacy
2.3. Security Negotiation 3.3. Security Negotiation
With the NFS version 4 server potentially offering multiple security With the NFS version 4 server potentially offering multiple security
mechanisms, the client needs a method to determine or negotiate which mechanisms, the client needs a method to determine or negotiate which
mechanism is to be used for its communication with the server. The mechanism is to be used for its communication with the server. The
NFS server may have multiple points within its filesystem name space NFS server may have multiple points within its filesystem name space
that are available for use by NFS clients. In turn the NFS server that are available for use by NFS clients. In turn the NFS server
may be configured such that each of these entry points may have may be configured such that each of these entry points may have
different or multiple security mechanisms in use. different or multiple security mechanisms in use.
The security negotiation between client and server must be done with The security negotiation between client and server must be done with
a secure channel to eliminate the possibility of a third party a secure channel to eliminate the possibility of a third party
intercepting the negotiation sequence and forcing the client and intercepting the negotiation sequence and forcing the client and
server to choose a lower level of security than required or desired. server to choose a lower level of security than required or desired.
See the section "Security Considerations" for further discussion. See the section "Security Considerations" for further discussion.
2.3.1. SECINFO and SECINFO_NO_NAME 3.3.1. SECINFO and SECINFO_NO_NAME
The SECINFO and SECINFO_NO_NAME operations allow the client to The SECINFO and SECINFO_NO_NAME operations allow the client to
determine, on a per filehandle basis, what security triple is to be determine, on a per filehandle basis, what security triple is to be
used for server access. In general, the client will not have to use used for server access. In general, the client will not have to use
either operation except during initial communication with the server either operation except during initial communication with the server
or when the client crosses policy boundaries at the server. It is or when the client crosses policy boundaries at the server. It is
possible that the server's policies change during the client's possible that the server's policies change during the client's
interaction therefore forcing the client to negotiate a new security interaction therefore forcing the client to negotiate a new security
triple. triple.
2.3.2. Security Error 3.3.2. Security Error
Based on the assumption that each NFS version 4 client and server Based on the assumption that each NFS version 4 client and server
must support a minimum set of security (i.e., LIPKEY, SPKM-3, and must support a minimum set of security (i.e., LIPKEY, SPKM-3, and
Kerberos-V5 all under RPCSEC_GSS), the NFS client will start its Kerberos-V5 all under RPCSEC_GSS), the NFS client will start its
communication with the server with one of the minimal security communication with the server with one of the minimal security
triples. During communication with the server, the client may triples. During communication with the server, the client may
receive an NFS error of NFS4ERR_WRONGSEC. This error allows the receive an NFS error of NFS4ERR_WRONGSEC. This error allows the
server to notify the client that the security triple currently being server to notify the client that the security triple currently being
used is not appropriate for access to the server's filesystem used is not appropriate for access to the server's filesystem
resources. The client is then responsible for determining what resources. The client is then responsible for determining what
security triples are available at the server and choose one which is security triples are available at the server and choose one which is
appropriate for the client. See the section for the "SECINFO" appropriate for the client. See the section for the "SECINFO"
operation for further discussion of how the client will respond to operation for further discussion of how the client will respond to
the NFS4ERR_WRONGSEC error and use SECINFO. the NFS4ERR_WRONGSEC error and use SECINFO.
2.3.3. Callback RPC Authentication 3.3.3. Callback RPC Authentication
Callback authentication has changed in NFSv4.1 from NFSv4.0. Callback authentication has changed in NFSv4.1 from NFSv4.0.
NFSv4.0 required the NFS server to create a security context for NFSv4.0 required the NFS server to create a security context for
RPCSEC_GSS, AUTH_DH, and AUTH_KERB4, and any other security flavor RPCSEC_GSS, AUTH_DH, and AUTH_KERB4, and any other security flavor
that had a security context. It also required that principal issuing that had a security context. It also required that principal issuing
the callback be the same as the principal that accepted the callback the callback be the same as the principal that accepted the callback
parameters (via SETCLIENTID), and that the client principal accepting parameters (via SETCLIENTID), and that the client principal accepting
the callback be the same as that which issued the SETCLIENTID. This the callback be the same as that which issued the SETCLIENTID. This
required the NFS client to have an assigned machine credential. required the NFS client to have an assigned machine credential.
NFSv4.1 does not require a machine credential. Instead, NFSv4.1 NFSv4.1 does not require a machine credential. Instead, NFSv4.1
allows an RPCSEC_GSS security context initiated by the client and allows an RPCSEC_GSS security context initiated by the client and
eswtablished on both the client and server to be used on callback eswtablished on both the client and server to be used on callback
RPCs sent by the server to the client. The BIND_BACKCHANNEL RPCs sent by the server to the client. The BIND_BACKCHANNEL
operation is used establish RPCSEC_GSS contexts (if the client so operation is used establish RPCSEC_GSS contexts (if the client so
desires) on the server. No support for AUTH_DH, or AUTH_KERB4 is desires) on the server. No support for AUTH_DH, or AUTH_KERB4 is
specified. specified.
2.3.4. GSS Server Principal 3.3.4. GSS Server Principal
Regardless of what security mechanism under RPCSEC_GSS is being used, Regardless of what security mechanism under RPCSEC_GSS is being used,
the NFS server, MUST identify itself in GSS-API via a the NFS server, MUST identify itself in GSS-API via a
GSS_C_NT_HOSTBASED_SERVICE name type. GSS_C_NT_HOSTBASED_SERVICE GSS_C_NT_HOSTBASED_SERVICE name type. GSS_C_NT_HOSTBASED_SERVICE
names are of the form: names are of the form:
service@hostname service@hostname
For NFS, the "service" element is For NFS, the "service" element is
nfs nfs
Implementations of security mechanisms will convert nfs@hostname to Implementations of security mechanisms will convert nfs@hostname to
various different forms. For Kerberos V5, LIPKEY, and SPKM-3, the various different forms. For Kerberos V5, LIPKEY, and SPKM-3, the
following form is RECOMMENDED: following form is RECOMMENDED:
nfs/hostname nfs/hostname
3. Filehandles 4. Filehandles
The filehandle in the NFS protocol is a per server unique identifier The filehandle in the NFS protocol is a per server unique identifier
for a filesystem object. The contents of the filehandle are opaque for a filesystem object. The contents of the filehandle are opaque
to the client. Therefore, the server is responsible for translating to the client. Therefore, the server is responsible for translating
the filehandle to an internal representation of the filesystem the filehandle to an internal representation of the filesystem
object. object.
3.1. Obtaining the First Filehandle 4.1. Obtaining the First Filehandle
The operations of the NFS protocol are defined in terms of one or The operations of the NFS protocol are defined in terms of one or
more filehandles. Therefore, the client needs a filehandle to more filehandles. Therefore, the client needs a filehandle to
initiate communication with the server. With the NFS version 2 initiate communication with the server. With the NFS version 2
protocol [RFC1094] and the NFS version 3 protocol [RFC1813], there protocol RFC1094 [17] and the NFS version 3 protocol RFC1813 [18],
exists an ancillary protocol to obtain this first filehandle. The there exists an ancillary protocol to obtain this first filehandle.
MOUNT protocol, RPC program number 100005, provides the mechanism of The MOUNT protocol, RPC program number 100005, provides the mechanism
translating a string based filesystem path name to a filehandle which of translating a string based file system path name to a filehandle
can then be used by the NFS protocols. which can then be used by the NFS protocols.
The MOUNT protocol has deficiencies in the area of security and use The MOUNT protocol has deficiencies in the area of security and use
via firewalls. This is one reason that the use of the public via firewalls. This is one reason that the use of the public
filehandle was introduced in [RFC2054] and [RFC2055]. With the use filehandle was introduced in RFC2054 [24] and RFC2055 [25]. With the
of the public filehandle in combination with the LOOKUP operation in use of the public filehandle in combination with the LOOKUP operation
the NFS version 2 and 3 protocols, it has been demonstrated that the in the NFS version 2 and 3 protocols, it has been demonstrated that
MOUNT protocol is unnecessary for viable interaction between NFS the MOUNT protocol is unnecessary for viable interaction between NFS
client and server. client and server.
Therefore, the NFS version 4 protocol will not use an ancillary Therefore, the NFS version 4 protocol will not use an ancillary
protocol for translation from string based path names to a protocol for translation from string based path names to a
filehandle. Two special filehandles will be used as starting points filehandle. Two special filehandles will be used as starting points
for the NFS client. for the NFS client.
3.1.1. Root Filehandle 4.1.1. Root Filehandle
The first of the special filehandles is the ROOT filehandle. The The first of the special filehandles is the ROOT filehandle. The
ROOT filehandle is the "conceptual" root of the filesystem name space ROOT filehandle is the "conceptual" root of the file system name
at the NFS server. The client uses or starts with the ROOT space at the NFS server. The client uses or starts with the ROOT
filehandle by employing the PUTROOTFH operation. The PUTROOTFH filehandle by employing the PUTROOTFH operation. The PUTROOTFH
operation instructs the server to set the "current" filehandle to the operation instructs the server to set the "current" filehandle to the
ROOT of the server's file tree. Once this PUTROOTFH operation is ROOT of the server's file tree. Once this PUTROOTFH operation is
used, the client can then traverse the entirety of the server's file used, the client can then traverse the entirety of the server's file
tree with the LOOKUP operation. A complete discussion of the server tree with the LOOKUP operation. A complete discussion of the server
name space is in the section "NFS Server Name Space". name space is in the section "NFS Server Name Space".
3.1.2. Public Filehandle 4.1.2. Public Filehandle
The second special filehandle is the PUBLIC filehandle. Unlike the The second special filehandle is the PUBLIC filehandle. Unlike the
ROOT filehandle, the PUBLIC filehandle may be bound or represent an ROOT filehandle, the PUBLIC filehandle may be bound or represent an
arbitrary filesystem object at the server. The server is responsible arbitrary file system object at the server. The server is
for this binding. It may be that the PUBLIC filehandle and the ROOT responsible for this binding. It may be that the PUBLIC filehandle
filehandle refer to the same filesystem object. However, it is up to and the ROOT filehandle refer to the same file system object.
the administrative software at the server and the policies of the However, it is up to the administrative software at the server and
server administrator to define the binding of the PUBLIC filehandle the policies of the server administrator to define the binding of the
and server filesystem object. The client may not make any PUBLIC filehandle and server file system object. The client may not
assumptions about this binding. The client uses the PUBLIC make any assumptions about this binding. The client uses the PUBLIC
filehandle via the PUTPUBFH operation. filehandle via the PUTPUBFH operation.
3.2. Filehandle Types 4.2. Filehandle Types
In the NFS version 2 and 3 protocols, there was one type of In the NFS version 2 and 3 protocols, there was one type of
filehandle with a single set of semantics. This type of filehandle filehandle with a single set of semantics. This type of filehandle
is termed "persistent" in NFS Version 4. The semantics of a is termed "persistent" in NFS Version 4. The semantics of a
persistent filehandle remain the same as before. A new type of persistent filehandle remain the same as before. A new type of
filehandle introduced in NFS Version 4 is the "volatile" filehandle, filehandle introduced in NFS Version 4 is the "volatile" filehandle,
which attempts to accommodate certain server environments. which attempts to accommodate certain server environments.
The volatile filehandle type was introduced to address server The volatile filehandle type was introduced to address server
functionality or implementation issues which make correct functionality or implementation issues which make correct
implementation of a persistent filehandle infeasible. Some server implementation of a persistent filehandle infeasible. Some server
environments do not provide a filesystem level invariant that can be environments do not provide a filesystem level invariant that can be
used to construct a persistent filehandle. The underlying server used to construct a persistent filehandle. The underlying server
filesystem may not provide the invariant or the server's filesystem filesystem may not provide the invariant or the server's filesystem
programming interfaces may not provide access to the needed programming interfaces may not provide access to the needed
invariant. Volatile filehandles may ease the implementation of invariant. Volatile filehandles may ease the implementation of
server functionality such as hierarchical storage management or server functionality such as hierarchical storage management or file
filesystem reorganization or migration. However, the volatile system reorganization or migration. However, the volatile filehandle
filehandle increases the implementation burden for the client. increases the implementation burden for the client.
Since the client will need to handle persistent and volatile Since the client will need to handle persistent and volatile
filehandles differently, a file attribute is defined which may be filehandles differently, a file attribute is defined which may be
used by the client to determine the filehandle types being returned used by the client to determine the filehandle types being returned
by the server. by the server.
3.2.1. General Properties of a Filehandle 4.2.1. General Properties of a Filehandle
The filehandle contains all the information the server needs to The filehandle contains all the information the server needs to
distinguish an individual file. To the client, the filehandle is distinguish an individual file. To the client, the filehandle is
opaque. The client stores filehandles for use in a later request and opaque. The client stores filehandles for use in a later request and
can compare two filehandles from the same server for equality by can compare two filehandles from the same server for equality by
doing a byte-by-byte comparison. However, the client MUST NOT doing a byte-by-byte comparison. However, the client MUST NOT
otherwise interpret the contents of filehandles. If two filehandles otherwise interpret the contents of filehandles. If two filehandles
from the same server are equal, they MUST refer to the same file. from the same server are equal, they MUST refer to the same file.
Servers SHOULD try to maintain a one-to-one correspondence between Servers SHOULD try to maintain a one-to-one correspondence between
filehandles and files but this is not required. Clients MUST use filehandles and files but this is not required. Clients MUST use
skipping to change at page 28, line 9 skipping to change at page 37, line 18
"Data Caching and File Identity". "Data Caching and File Identity".
As an example, in the case that two different path names when As an example, in the case that two different path names when
traversed at the server terminate at the same filesystem object, the traversed at the server terminate at the same filesystem object, the
server SHOULD return the same filehandle for each path. This can server SHOULD return the same filehandle for each path. This can
occur if a hard link is used to create two file names which refer to occur if a hard link is used to create two file names which refer to
the same underlying file object and associated data. For example, if the same underlying file object and associated data. For example, if
paths /a/b/c and /a/d/c refer to the same file, the server SHOULD paths /a/b/c and /a/d/c refer to the same file, the server SHOULD
return the same filehandle for both path names traversals. return the same filehandle for both path names traversals.
3.2.2. Persistent Filehandle 4.2.2. Persistent Filehandle
A persistent filehandle is defined as having a fixed value for the A persistent filehandle is defined as having a fixed value for the
lifetime of the filesystem object to which it refers. Once the lifetime of the filesystem object to which it refers. Once the
server creates the filehandle for a filesystem object, the server server creates the filehandle for a filesystem object, the server
MUST accept the same filehandle for the object for the lifetime of MUST accept the same filehandle for the object for the lifetime of
the object. If the server restarts or reboots the NFS server must the object. If the server restarts or reboots the NFS server must
honor the same filehandle value as it did in the server's previous honor the same filehandle value as it did in the server's previous
instantiation. Similarly, if the filesystem is migrated, the new NFS instantiation. Similarly, if the file system is migrated, the new
server must honor the same filehandle as the old NFS server. NFS server must honor the same filehandle as the old NFS server.
The persistent filehandle will be become stale or invalid when the The persistent filehandle will be become stale or invalid when the
filesystem object is removed. When the server is presented with a filesystem object is removed. When the server is presented with a
persistent filehandle that refers to a deleted object, it MUST return persistent filehandle that refers to a deleted object, it MUST return
an error of NFS4ERR_STALE. A filehandle may become stale when the an error of NFS4ERR_STALE. A filehandle may become stale when the
filesystem containing the object is no longer available. The file filesystem containing the object is no longer available. The file
system may become unavailable if it exists on removable media and the system may become unavailable if it exists on removable media and the
media is no longer available at the server or the filesystem in whole media is no longer available at the server or the file system in
has been destroyed or the filesystem has simply been removed from the whole has been destroyed or the file system has simply been removed
server's name space (i.e. unmounted in a UNIX environment). from the server's name space (i.e. unmounted in a UNIX environment).
3.2.3. Volatile Filehandle 4.2.3. Volatile Filehandle
A volatile filehandle does not share the same longevity A volatile filehandle does not share the same longevity
characteristics of a persistent filehandle. The server may determine characteristics of a persistent filehandle. The server may determine
that a volatile filehandle is no longer valid at many different that a volatile filehandle is no longer valid at many different
points in time. If the server can definitively determine that a points in time. If the server can definitively determine that a
volatile filehandle refers to an object that has been removed, the volatile filehandle refers to an object that has been removed, the
server should return NFS4ERR_STALE to the client (as is the case for server should return NFS4ERR_STALE to the client (as is the case for
persistent filehandles). In all other cases where the server persistent filehandles). In all other cases where the server
determines that a volatile filehandle can no longer be used, it determines that a volatile filehandle can no longer be used, it
should return an error of NFS4ERR_FHEXPIRED. should return an error of NFS4ERR_FHEXPIRED.
The mandatory attribute "fh_expire_type" is used by the client to The mandatory attribute "fh_expire_type" is used by the client to
determine what type of filehandle the server is providing for a determine what type of filehandle the server is providing for a
particular filesystem. This attribute is a bitmask with the particular filesystem. This attribute is a bitmask with the
following values: following values:
FH4_PERSISTENT The value of FH4_PERSISTENT is used to indicate a FH4_PERSISTENT The value of FH4_PERSISTENT is used to indicate a
persistent filehandle, which is valid until the object is removed persistent filehandle, which is valid until the object is removed
from the filesystem. The server will not return NFS4ERR_FHEXPIRED from the file system. The server will not return
for this filehandle. FH4_PERSISTENT is defined as a value in NFS4ERR_FHEXPIRED for this filehandle. FH4_PERSISTENT is defined
which none of the bits specified below are set. as a value in which none of the bits specified below are set.
FH4_VOLATILE_ANY The filehandle may expire at any time, except as FH4_VOLATILE_ANY The filehandle may expire at any time, except as
specifically excluded (i.e. FH4_NO_EXPIRE_WITH_OPEN). specifically excluded (i.e. FH4_NO_EXPIRE_WITH_OPEN).
FH4_NOEXPIRE_WITH_OPEN May only be set when FH4_VOLATILE_ANY is set. FH4_NOEXPIRE_WITH_OPEN May only be set when FH4_VOLATILE_ANY is set.
If this bit is set, then the meaning of FH4_VOLATILE_ANY is If this bit is set, then the meaning of FH4_VOLATILE_ANY is
qualified to exclude any expiration of the filehandle when it is qualified to exclude any expiration of the filehandle when it is
open. open.
FH4_VOL_MIGRATION The filehandle will expire as a result of a file FH4_VOL_MIGRATION The filehandle will expire as a result of a file
skipping to change at page 29, line 45 skipping to change at page 39, line 6
This situation can arise if FH4_VOL_MIGRATION or FH4_VOL_RENAME is This situation can arise if FH4_VOL_MIGRATION or FH4_VOL_RENAME is
set, if FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set, set, if FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set,
or if a non-readonly file system has a transition target in a or if a non-readonly file system has a transition target in a
different _handle _ class. In these cases, the server should deny a different _handle _ class. In these cases, the server should deny a
RENAME or REMOVE that would affect an OPEN file of any of the RENAME or REMOVE that would affect an OPEN file of any of the
components leading to the OPEN file. In addition, the server should components leading to the OPEN file. In addition, the server should
deny all RENAME or REMOVE requests during the grace period, in order deny all RENAME or REMOVE requests during the grace period, in order
to make sure that reclaims of files where filehandles may have to make sure that reclaims of files where filehandles may have
expired do not do a reclaim for the wrong file. expired do not do a reclaim for the wrong file.
3.3. One Method of Constructing a Volatile Filehandle 4.3. One Method of Constructing a Volatile Filehandle
A volatile filehandle, while opaque to the client could contain: A volatile filehandle, while opaque to the client could contain:
[volatile bit = 1 | server boot time | slot | generation number] [volatile bit = 1 | server boot time | slot | generation number]
o slot is an index in the server volatile filehandle table o slot is an index in the server volatile filehandle table
o generation number is the generation number for the table entry/ o generation number is the generation number for the table entry/
slot slot
When the client presents a volatile filehandle, the server makes the When the client presents a volatile filehandle, the server makes the
following checks, which assume that the check for the volatile bit following checks, which assume that the check for the volatile bit
has passed. If the server boot time is less than the current server has passed. If the server boot time is less than the current server
boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return
NFS4ERR_BADHANDLE. If the generation number does not match, return NFS4ERR_BADHANDLE. If the generation number does not match, return
skipping to change at page 30, line 21 skipping to change at page 39, line 29
has passed. If the server boot time is less than the current server has passed. If the server boot time is less than the current server
boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return
NFS4ERR_BADHANDLE. If the generation number does not match, return NFS4ERR_BADHANDLE. If the generation number does not match, return
NFS4ERR_FHEXPIRED. NFS4ERR_FHEXPIRED.
When the server reboots, the table is gone (it is volatile). When the server reboots, the table is gone (it is volatile).
If volatile bit is 0, then it is a persistent filehandle with a If volatile bit is 0, then it is a persistent filehandle with a
different structure following it. different structure following it.
3.4. Client Recovery from Filehandle Expiration 4.4. Client Recovery from Filehandle Expiration
If possible, the client SHOULD recover from the receipt of an If possible, the client SHOULD recover from the receipt of an
NFS4ERR_FHEXPIRED error. The client must take on additional NFS4ERR_FHEXPIRED error. The client must take on additional
responsibility so that it may prepare itself to recover from the responsibility so that it may prepare itself to recover from the
expiration of a volatile filehandle. If the server returns expiration of a volatile filehandle. If the server returns
persistent filehandles, the client does not need these additional persistent filehandles, the client does not need these additional
steps. steps.
For volatile filehandles, most commonly the client will need to store For volatile filehandles, most commonly the client will need to store
the component names leading up to and including the filesystem object the component names leading up to and including the file system
in question. With these names, the client should be able to recover object in question. With these names, the client should be able to
by finding a filehandle in the name space that is still available or recover by finding a filehandle in the name space that is still
by starting at the root of the server's filesystem name space. available or by starting at the root of the server's file system name
space.
If the expired filehandle refers to an object that has been removed If the expired filehandle refers to an object that has been removed
from the filesystem, obviously the client will not be able to recover from the file system, obviously the client will not be able to
from the expired filehandle. recover from the expired filehandle.
It is also possible that the expired filehandle refers to a file that It is also possible that the expired filehandle refers to a file that
has been renamed. If the file was renamed by another client, again has been renamed. If the file was renamed by another client, again
it is possible that the original client will not be able to recover. it is possible that the original client will not be able to recover.
However, in the case that the client itself is renaming the file and However, in the case that the client itself is renaming the file and
the file is open, it is possible that the client may be able to the file is open, it is possible that the client may be able to
recover. The client can determine the new path name based on the recover. The client can determine the new path name based on the
processing of the rename request. The client can then regenerate the processing of the rename request. The client can then regenerate the
new filehandle based on the new path name. The client could also use new filehandle based on the new path name. The client could also use
the compound operation mechanism to construct a set of operations the compound operation mechanism to construct a set of operations
like: like:
RENAME A B RENAME A B
LOOKUP B LOOKUP B
skipping to change at page 31, line 13 skipping to change at page 40, line 21
like: like:
RENAME A B RENAME A B
LOOKUP B LOOKUP B
GETFH GETFH
Note that the COMPOUND procedure does not provide atomicity. This Note that the COMPOUND procedure does not provide atomicity. This
example only reduces the overhead of recovering from an expired example only reduces the overhead of recovering from an expired
filehandle. filehandle.
4. File Attributes 5. File Attributes
To meet the requirements of extensibility and increased To meet the requirements of extensibility and increased
interoperability with non-UNIX platforms, attributes must be handled interoperability with non-UNIX platforms, attributes must be handled
in a flexible manner. The NFS version 3 fattr3 structure contains a in a flexible manner. The NFS version 3 fattr3 structure contains a
fixed list of attributes that not all clients and servers are able to fixed list of attributes that not all clients and servers are able to
support or care about. The fattr3 structure can not be extended as support or care about. The fattr3 structure can not be extended as
new needs arise and it provides no way to indicate non-support. With new needs arise and it provides no way to indicate non-support. With
the NFS version 4 protocol, the client is able query what attributes the NFS version 4 protocol, the client is able query what attributes
the server supports and construct requests with only those supported the server supports and construct requests with only those supported
attributes (or a subset thereof). attributes (or a subset thereof).
skipping to change at page 32, line 34 skipping to change at page 41, line 35
reasonably computable by the client when support is not provided on reasonably computable by the client when support is not provided on
the server. the server.
Note that the hidden directory returned by OPENATTR is a convenience Note that the hidden directory returned by OPENATTR is a convenience
for protocol processing. The client should not make any assumptions for protocol processing. The client should not make any assumptions
about the server's implementation of named attributes and whether the about the server's implementation of named attributes and whether the
underlying filesystem at the server has a named attribute directory underlying filesystem at the server has a named attribute directory
or not. Therefore, operations such as SETATTR and GETATTR on the or not. Therefore, operations such as SETATTR and GETATTR on the
named attribute directory are undefined. named attribute directory are undefined.
4.1. Mandatory Attributes 5.1. Mandatory Attributes
These MUST be supported by every NFS version 4 client and server in These MUST be supported by every NFS version 4 client and server in
order to ensure a minimum level of interoperability. The server must order to ensure a minimum level of interoperability. The server must
store and return these attributes and the client must be able to store and return these attributes and the client must be able to
function with an attribute set limited to these attributes. With function with an attribute set limited to these attributes. With
just the mandatory attributes some client functionality may be just the mandatory attributes some client functionality may be
impaired or limited in some ways. A client may ask for any of these impaired or limited in some ways. A client may ask for any of these
attributes to be returned by setting a bit in the GETATTR request and attributes to be returned by setting a bit in the GETATTR request and
the server must return their value. the server must return their value.
4.2. Recommended Attributes 5.2. Recommended Attributes
These attributes are understood well enough to warrant support in the These attributes are understood well enough to warrant support in the
NFS version 4 protocol. However, they may not be supported on all NFS version 4 protocol. However, they may not be supported on all
clients and servers. A client may ask for any of these attributes to clients and servers. A client may ask for any of these attributes to
be returned by setting a bit in the GETATTR request but must handle be returned by setting a bit in the GETATTR request but must handle
the case where the server does not return them. A client may ask for the case where the server does not return them. A client may ask for
the set of attributes the server supports and should not request the set of attributes the server supports and should not request
attributes the server does not support. A server should be tolerant attributes the server does not support. A server should be tolerant
of requests for unsupported attributes and simply not return them of requests for unsupported attributes and simply not return them
rather than considering the request an error. It is expected that rather than considering the request an error. It is expected that
servers will support all attributes they comfortably can and only servers will support all attributes they comfortably can and only
fail to support attributes which are difficult to support in their fail to support attributes which are difficult to support in their
operating environments. A server should provide attributes whenever operating environments. A server should provide attributes whenever
they don't have to "tell lies" to the client. For example, a file they don't have to "tell lies" to the client. For example, a file
modification time should be either an accurate time or should not be modification time should be either an accurate time or should not be
supported by the server. This will not always be comfortable to supported by the server. This will not always be comfortable to
clients but the client is better positioned decide whether and how to clients but the client is better positioned decide whether and how to
fabricate or construct an attribute or whether to do without the fabricate or construct an attribute or whether to do without the
attribute. attribute.
4.3. Named Attributes 5.3. Named Attributes
These attributes are not supported by direct encoding in the NFS These attributes are not supported by direct encoding in the NFS
Version 4 protocol but are accessed by string names rather than Version 4 protocol but are accessed by string names rather than
numbers and correspond to an uninterpreted stream of bytes which are numbers and correspond to an uninterpreted stream of bytes which are
stored with the filesystem object. The name space for these stored with the filesystem object. The name space for these
attributes may be accessed by using the OPENATTR operation. The attributes may be accessed by using the OPENATTR operation. The
OPENATTR operation returns a filehandle for a virtual "attribute OPENATTR operation returns a filehandle for a virtual "attribute
directory" and further perusal of the name space may be done using directory" and further perusal of the name space may be done using
READDIR and LOOKUP operations on this filehandle. Named attributes READDIR and LOOKUP operations on this filehandle. Named attributes
may then be examined or changed by normal READ and WRITE and CREATE may then be examined or changed by normal READ and WRITE and CREATE
skipping to change at page 33, line 44 skipping to change at page 42, line 46
attributes, a client which is also able to handle them should be able attributes, a client which is also able to handle them should be able
to copy a file's data and meta-data with complete transparency from to copy a file's data and meta-data with complete transparency from
one location to another; this would imply that names allowed for one location to another; this would imply that names allowed for
regular directory entries are valid for named attribute names as regular directory entries are valid for named attribute names as
well. well.
Names of attributes will not be controlled by this document or other Names of attributes will not be controlled by this document or other
IETF standards track documents. See the section "IANA IETF standards track documents. See the section "IANA
Considerations" for further discussion. Considerations" for further discussion.
4.4. Classification of Attributes 5.4. Classification of Attributes
Each of the Mandatory and Recommended attributes can be classified in Each of the Mandatory and Recommended attributes can be classified in
one of three categories: per server, per filesystem, or per one of three categories: per server, per file system, or per file
filesystem object. Note that it is possible that some per filesystem system object. Note that it is possible that some per file system
attributes may vary within the filesystem. See the "homogeneous" attributes may vary within the filesystem. See the "homogeneous"
attribute for its definition. Note that the attributes attribute for its definition. Note that the attributes
time_access_set and time_modify_set are not listed in this section time_access_set and time_modify_set are not listed in this section
because they are write-only attributes corresponding to time_access because they are write-only attributes corresponding to time_access
and time_modify, and are used in a special instance of SETATTR. and time_modify, and are used in a special instance of SETATTR.
o The per server attribute is: o The per server attribute is:
lease_time lease_time
skipping to change at page 34, line 33 skipping to change at page 44, line 5
type, change, size, named_attr, fsid, rdattr_error, filehandle, type, change, size, named_attr, fsid, rdattr_error, filehandle,
ACL, archive, fileid, hidden, maxlink, mimetype, mode, ACL, archive, fileid, hidden, maxlink, mimetype, mode,
numlinks, owner, owner_group, rawdev, space_used, system, numlinks, owner, owner_group, rawdev, space_used, system,
time_access, time_backup, time_create, time_metadata, time_access, time_backup, time_create, time_metadata,
time_modify, mounted_on_fileid, layout_type, layout_hint, time_modify, mounted_on_fileid, layout_type, layout_hint,
layout_blksize, layout_alignment layout_blksize, layout_alignment
For quota_avail_hard, quota_avail_soft, and quota_used see their For quota_avail_hard, quota_avail_soft, and quota_used see their
definitions below for the appropriate classification. definitions below for the appropriate classification.
4.5. Mandatory Attributes - Definitions 5.5. Mandatory Attributes - Definitions
+-----------------+----+------------+--------+----------------------+ +-----------------+----+------------+--------+----------------------+
| name | # | Data Type | Access | Description | | name | # | Data Type | Access | Description |
+-----------------+----+------------+--------+----------------------+ +-----------------+----+------------+--------+----------------------+
| supp_attr | 0 | bitmap | READ | The bit vector which | | supp_attr | 0 | bitmap | READ | The bit vector which |
| | | | | would retrieve all | | | | | | would retrieve all |
| | | | | mandatory and | | | | | | mandatory and |
| | | | | recommended | | | | | | recommended |
| | | | | attributes that are | | | | | | attributes that are |
| | | | | supported for this | | | | | | supported for this |
skipping to change at page 35, line 30 skipping to change at page 44, line 47
| | | | | data, directory | | | | | | data, directory |
| | | | | contents or | | | | | | contents or |
| | | | | attributes of the | | | | | | attributes of the |
| | | | | object have been | | | | | | object have been |
| | | | | modified. The server | | | | | | modified. The server |
| | | | | may return the | | | | | | may return the |
| | | | | object's | | | | | | object's |
| | | | | time_metadata | | | | | | time_metadata |
| | | | | attribute for this | | | | | | attribute for this |
| | | | | attribute's value | | | | | | attribute's value |
| | | | | but only if the | | | | | | but only if the file |
| | | | | filesystem object | | | | | | system object can |
| | | | | can not be updated | | | | | | not be updated more |
| | | | | more frequently than | | | | | | frequently than the |
| | | | | the resolution of | | | | | | resolution of |
| | | | | time_metadata. | | | | | | time_metadata. |
| size | 4 | uint64 | R/W | The size of the | | size | 4 | uint64 | R/W | The size of the |
| | | | | object in bytes. | | | | | | object in bytes. |
| link_support | 5 | bool | READ | True, if the | | link_support | 5 | bool | READ | True, if the |
| | | | | object's filesystem | | | | | | object's filesystem |
| | | | | supports hard links. | | | | | | supports hard links. |
| symlink_support | 6 | bool | READ | True, if the | | symlink_support | 6 | bool | READ | True, if the |
| | | | | object's filesystem | | | | | | object's filesystem |
| | | | | supports symbolic | | | | | | supports symbolic |
| | | | | links. | | | | | | links. |
skipping to change at page 36, line 29 skipping to change at page 45, line 44
| | | | | seconds. | | | | | | seconds. |
| rdattr_error | 11 | enum | READ | Error returned from | | rdattr_error | 11 | enum | READ | Error returned from |
| | | | | getattr during | | | | | | getattr during |
| | | | | readdir. | | | | | | readdir. |
| filehandle | 19 | nfs_fh4 | READ | The filehandle of | | filehandle | 19 | nfs_fh4 | READ | The filehandle of |
| | | | | this object | | | | | | this object |
| | | | | (primarily for | | | | | | (primarily for |
| | | | | readdir requests). | | | | | | readdir requests). |
+-----------------+----+------------+--------+----------------------+ +-----------------+----+------------+--------+----------------------+
4.6. Recommended Attributes - Definitions 5.6. Recommended Attributes - Definitions
+--------------------+----+---------------+--------+----------------+
+--------------------+----+--------------+--------+-----------------+
| name | # | Data Type | Access | Description | | name | # | Data Type | Access | Description |
+--------------------+----+--------------+--------+-----------------+ +--------------------+----+---------------+--------+----------------+
| ACL | 12 | nfsace4<> | R/W | The access | | ACL | 12 | nfsace4<> | R/W | The access |
| | | | | control list | | | | | | control list |
| | | | | for the object. | | | | | | for the |
| | | | | object. |
| aclsupport | 13 | uint32 | READ | Indicates what | | aclsupport | 13 | uint32 | READ | Indicates what |
| | | | | types of ACLs | | | | | | types of ACLs |
| | | | | are supported | | | | | | are supported |
| | | | | on the current | | | | | | on the current |
| | | | | filesystem. | | | | | | filesystem. |
| archive | 14 | bool | R/W | True, if this | | archive | 14 | bool | R/W | True, if this |
| | | | | file has been | | | | | | file has been |
| | | | | archived since | | | | | | archived since |
| | | | | the time of | | | | | | the time of |
| | | | | last | | | | | | last |
skipping to change at page 37, line 16 skipping to change at page 46, line 37
| | | | | change the | | | | | | change the |
| | | | | times for a | | | | | | times for a |
| | | | | filesystem | | | | | | filesystem |
| | | | | object as | | | | | | object as |
| | | | | specified in a | | | | | | specified in a |
| | | | | SETATTR | | | | | | SETATTR |
| | | | | operation. | | | | | | operation. |
| case_insensitive | 16 | bool | READ | True, if | | case_insensitive | 16 | bool | READ | True, if |
| | | | | filename | | | | | | filename |
| | | | | comparisons on | | | | | | comparisons on |
| | | | | this filesystem | | | | | | this file |
| | | | | are case | | | | | | system are |
| | | | | case |
| | | | | insensitive. | | | | | | insensitive. |
| case_preserving | 17 | bool | READ | True, if | | case_preserving | 17 | bool | READ | True, if |
| | | | | filename case | | | | | | filename case |
| | | | | on this | | | | | | on this file |
| | | | | filesystem are | | | | | | system are |
| | | | | preserved. | | | | | | preserved. |
| chown_restricted | 18 | bool | READ | If TRUE, the | | chown_restricted | 18 | bool | READ | If TRUE, the |
| | | | | server will | | | | | | server will |
| | | | | reject any | | | | | | reject any |
| | | | | request to | | | | | | request to |
| | | | | change either | | | | | | change either |
| | | | | the owner or | | | | | | the owner or |
| | | | | the group | | | | | | the group |
| | | | | associated with | | | | | | associated |
| | | | | a file if the | | | | | | with a file if |
| | | | | caller is not a | | | | | | the caller is |
| | | | | privileged user | | | | | | not a |
| | | | | (for example, | | | | | | privileged |
| | | | | user (for |
| | | | | example, |
| | | | | "root" in UNIX | | | | | | "root" in UNIX |
| | | | | operating | | | | | | operating |
| | | | | environments or | | | | | | environments |
| | | | | in Windows 2000 | | | | | | or in Windows |
| | | | | the "Take | | | | | | 2000 the "Take |
| | | | | Ownership" | | | | | | Ownership" |
| | | | | privilege). | | | | | | privilege). |
| dir_notif_delay | 56 | nfstime4 | READ | notification | | dir_notif_delay | 56 | nfstime4 | READ | notification |
| | | | | delays on | | | | | | delays on |
| | | | | directory | | | | | | directory |
| | | | | attributes | | | | | | attributes |
| dirent_notif_delay | 57 | nfstime4 | READ | notification | | dirent_notif_delay | 57 | nfstime4 | READ | notification |
| | | | | delays on child | | | | | | delays on |
| | | | | child |
| | | | | attributes | | | | | | attributes |
| fileid | 20 | uint64 | READ | A number | | fileid | 20 | uint64 | READ | A number |
| | | | | uniquely | | | | | | uniquely |
| | | | | identifying the | | | | | | identifying |
| | | | | file within the | | | | | | the file |
| | | | | within the |
| | | | | filesystem. | | | | | | filesystem. |
| files_avail | 21 | uint64 | READ | File slots | | files_avail | 21 | uint64 | READ | File slots |
| | | | | available to | | | | | | available to |
| | | | | this user on | | | | | | this user on |
| | | | | the filesystem | | | | | | the file |
| | | | | containing this | | | | | | system |
| | | | | object - this | | | | | | containing |
| | | | | should be the | | | | | | this object - |
| | | | | smallest | | | | | | this should be |
| | | | | relevant limit. | | | | | | the smallest |
| files_free | 22 | uint64 | READ | Free file slots | | | | | | relevant |
| | | | | on the | | | | | | limit. |
| files_free | 22 | uint64 | READ | Free file |
| | | | | slots on the |
| | | | | filesystem | | | | | | filesystem |
| | | | | containing this | | | | | | containing |
| | | | | object - this | | | | | | this object - |
| | | | | should be the | | | | | | this should be |
| | | | | smallest | | | | | | the smallest |
| | | | | relevant limit. | | | | | | relevant |
| | | | | limit. |
| files_total | 23 | uint64 | READ | Total file | | files_total | 23 | uint64 | READ | Total file |
| | | | | slots on the | | | | | | slots on the |
| | | | | filesystem | | | | | | filesystem |
| | | | | containing this | | | | | | containing |
| | | | | object. | | | | | | this object. |
| fs_absent | 60 | bool | READ | Is current | | fs_absent | 60 | bool | READ | Is current |
| | | | | filesystem | | | | | | filesystem |
| | | | | present or | | | | | | present or |
| | | | | absent. | | | | | | absent. |
| fs_layout_type | 62 | layouttype4 | READ | Layout types | | fs_layout_type | 62 | layouttype4 | READ | Layout types |
| | | | | available for | | | | | | available for |
| | | | | the filesystem. | | | | | | the file |
| fs_locations | 24 | fs_locations | READ | Locations where | | | | | | system. |
| | | | | this filesystem | | fs_locations | 24 | fs_locations | READ | Locations |
| | | | | where this |
| | | | | file system |
| | | | | may be found. | | | | | | may be found. |
| | | | | If the server | | | | | | If the server |
| | | | | returns | | | | | | returns |
| | | | | NFS4ERR_MOVED | | | | | | NFS4ERR_MOVED |
| | | | | as an error, | | | | | | as an error, |
| | | | | this attribute | | | | | | this attribute |
| | | | | MUST be | | | | | | MUST be |
| | | | | supported. | | | | | | supported. |
| fs_locations_info | 67 | | READ | Full function | | fs_locations_info | 67 | | READ | Full function |
| | | | | filesystem | | | | | | filesystem |
| | | | | location. | | | | | | location. |
| fs_status | 61 | fs4_status | READ | Generic | | fs_status | 61 | fs4_status | READ | Generic file |
| | | | | filesystem type | | | | | | system type |
| | | | | information. | | | | | | information. |
| hidden | 25 | bool | R/W | True, if the | | hidden | 25 | bool | R/W | True, if the |
| | | | | file is | | | | | | file is |
| | | | | considered | | | | | | considered |
| | | | | hidden with | | | | | | hidden with |
| | | | | respect to the | | | | | | respect to the |
| | | | | Windows API? | | | | | | Windows API? |
| homogeneous | 26 | bool | READ | True, if this | | homogeneous | 26 | bool | READ | True, if this |
| | | | | object's | | | | | | object's file |
| | | | | filesystem is | | | | | | system is |
| | | | | homogeneous, | | | | | | homogeneous, |
| | | | | i.e. are per | | | | | | i.e. are per |
| | | | | filesystem | | | | | | filesystem |
| | | | | attributes the | | | | | | attributes the |
| | | | | same for all | | | | | | same for all |
| | | | | filesystem's | | | | | | filesystem's |
| | | | | objects. | | | | | | objects. |
| layout_alignment | 66 | uint32_t | READ | Preferred | | layout_alignment | 66 | uint32_t | READ | Preferred |
| | | | | alignment for | | | | | | alignment for |
| | | | | layout related | | | | | | layout related |
| | | | | I/O. | | | | | | I/O. |
| layout_blksize | 65 | uint32_t | READ | Preferred block | | layout_blksize | 65 | uint32_t | READ | Preferred |
| | | | | size for layout | | | | | | block size for |
| | | | | related I/O. | | | | | | layout related |
| | | | | I/O. |
| layout_hint | 63 | layouthint4 | WRITE | Client | | layout_hint | 63 | layouthint4 | WRITE | Client |
| | | | | specified hint | | | | | | specified hint |
| | | | | for file | | | | | | for file |
| | | | | layout. | | | | | | layout. |
| layout_type | 64 | layouttype4 | READ | Layout types | | layout_type | 64 | layouttype4 | READ | Layout types |
| | | | | available for | | | | | | available for |
| | | | | the file. | | | | | | the file. |
| maxfilesize | 27 | uint64 | READ | Maximum | | maxfilesize | 27 | uint64 | READ | Maximum |
| | | | | supported file | | | | | | supported file |
| | | | | size for the | | | | | | size for the |
skipping to change at page 40, line 26 skipping to change at page 50, line 22
| | | | | writable. Lack | | | | | | writable. Lack |
| | | | | of this | | | | | | of this |
| | | | | attribute can | | | | | | attribute can |
| | | | | lead to the | | | | | | lead to the |
| | | | | client either | | | | | | client either |
| | | | | wasting | | | | | | wasting |
| | | | | bandwidth or | | | | | | bandwidth or |
| | | | | not receiving | | | | | | not receiving |
| | | | | the best | | | | | | the best |
| | | | | performance. | | | | | | performance. |
| mdsthreshold | 68 | mdsthreshold4 | READ | Hint to client |
| | | | | as to when to |
| | | | | write through |
| | | | | the pnfs |
| | | | | metadata |
| | | | | server. |
| mimetype | 32 | utf8<> | R/W | MIME body | | mimetype | 32 | utf8<> | R/W | MIME body |
| | | | | type/subtype of | | | | | | type/subtype |
| | | | | this object. | | | | | | of this |
| mode | 33 | mode4 | R/W | UNIX-style mode | | | | | | object. |
| | | | | and permission | | mode | 33 | mode4 | R/W | UNIX-style |
| | | | | mode and |
| | | | | permission |
| | | | | bits for this | | | | | | bits for this |
| | | | | object. | | | | | | object. |
| mounted_on_fileid | 55 | uint64 | READ | Like fileid, | | mounted_on_fileid | 55 | uint64 | READ | Like fileid, |
| | | | | but if the | | | | | | but if the |
| | | | | target | | | | | | target |
| | | | | filehandle is | | | | | | filehandle is |
| | | | | the root of a | | | | | | the root of a |
| | | | | filesystem | | | | | | filesystem |
| | | | | return the | | | | | | return the |
| | | | | fileid of the | | | | | | fileid of the |
| | | | | underlying | | | | | | underlying |
| | | | | directory. | | | | | | directory. |
| no_trunc | 34 | bool | READ | True, if a name | | no_trunc | 34 | bool | READ | True, if a |
| | | | | longer than | | | | | | name longer |
| | | | | name_max is | | | | | | than name_max |
| | | | | used, an error | | | | | | is used, an |
| | | | | be returned and | | | | | | error be |
| | | | | returned and |
| | | | | name is not | | | | | | name is not |
| | | | | truncated. | | | | | | truncated. |
| numlinks | 35 | uint32 | READ | Number of hard | | numlinks | 35 | uint32 | READ | Number of hard |
| | | | | links to this | | | | | | links to this |
| | | | | object. | | | | | | object. |
| owner | 36 | utf8<> | R/W | The string name | | owner | 36 | utf8<> | R/W | The string |
| | | | | of the owner of | | | | | | name of the |
| | | | | this object. | | | | | | owner of this |
| owner_group | 37 | utf8<> | R/W | The string name | | | | | | object. |
| | | | | of the group | | owner_group | 37 | utf8<> | R/W | The string |
| | | | | name of the |
| | | | | group |
| | | | | ownership of | | | | | | ownership of |
| | | | | this object. | | | | | | this object. |
| quota_avail_hard | 38 | uint64 | READ | For definition | | quota_avail_hard | 38 | uint64 | READ | For definition |
| | | | | see "Quota | | | | | | see "Quota |
| | | | | Attributes" | | | | | | Attributes" |
| | | | | section below. | | | | | | section below. |
| quota_avail_soft | 39 | uint64 | READ | For definition | | quota_avail_soft | 39 | uint64 | READ | For definition |
| | | | | see "Quota | | | | | | see "Quota |
| | | | | Attributes" | | | | | | Attributes" |
| | | | | section below. | | | | | | section below. |
| quota_used | 40 | uint64 | READ | For definition | | quota_used | 40 | uint64 | READ | For definition |
| | | | | see "Quota | | | | | | see "Quota |
| | | | | Attributes" | | | | | | Attributes" |
| | | | | section below. | | | | | | section below. |
| rawdev | 41 | specdata4 | READ | Raw device | | rawdev | 41 | specdata4 | READ | Raw device |
| | | | | identifier. | | | | | | identifier. |
| | | | | UNIX device | | | | | | UNIX device |
| | | | | major/minor | | | | | | major/minor |
| | | | | node | | | | | | node |
| | | | | information. If | | | | | | information. |
| | | | | the value of | | | | | | If the value |
| | | | | type is not | | | | | | of type is not |
| | | | | NF4BLK or | | | | | | NF4BLK or |
| | | | | NF4CHR, the | | | | | | NF4CHR, the |
| | | | | value return | | | | | | value return |
| | | | | SHOULD NOT be | | | | | | SHOULD NOT be |
| | | | | considered | | | | | | considered |
| | | | | useful. | | | | | | useful. |
| recv_impl_id | 59 | nfs_impl_id4 | READ | Client obtains | | recv_impl_id | 59 | nfs_impl_id4 | READ | Client obtains |
| | | | | server | | | | | | server |
| | | | | implementation | | | | | | implementation |
| | | | | via GETATTR. | | | | | | via GETATTR. |
| send_impl_id | 58 | impl_ident4 | WRITE | Client provides | | send_impl_id | 58 | impl_ident4 | WRITE | Client |
| | | | | provides |
| | | | | server with | | | | | | server with |
| | | | | implementation | | | | | | implementation |
| | | | | identity via | | | | | | identity via |
| | | | | SETATTR. | | | | | | SETATTR. |
| space_avail | 42 | uint64 | READ | Disk space in | | space_avail | 42 | uint64 | READ | Disk space in |
| | | | | bytes available | | | | | | bytes |
| | | | | to this user on | | | | | | available to |
| | | | | the filesystem | | | | | | this user on |
| | | | | containing this | | | | | | the file |
| | | | | object - this | | | | | | system |
| | | | | should be the | | | | | | containing |
| | | | | smallest | | | | | | this object - |
| | | | | relevant limit. | | | | | | this should be |
| space_free | 43 | uint64 | READ | Free disk space | | | | | | the smallest |
| | | | | in bytes on the | | | | | | relevant |
| | | | | filesystem | | | | | | limit. |
| | | | | containing this | | space_free | 43 | uint64 | READ | Free disk |
| | | | | object - this | | | | | | space in bytes |
| | | | | should be the | | | | | | on the file |
| | | | | smallest | | | | | | system |
| | | | | relevant limit. | | | | | | containing |
| | | | | this object - |
| | | | | this should be |
| | | | | the smallest |
| | | | | relevant |
| | | | | limit. |
| space_total | 44 | uint64 | READ | Total disk | | space_total | 44 | uint64 | READ | Total disk |
| | | | | space in bytes | | | | | | space in bytes |
| | | | | on the | | | | | | on the file |
| | | | | filesystem | | | | | | system |
| | | | | containing this | | | | | | containing |
| | | | | object. | | | | | | this object. |
| space_used | 45 | uint64 | READ | Number of | | space_used | 45 | uint64 | READ | Number of file |
| | | | | filesystem | | | | | | system bytes |
| | | | | bytes allocated | | | | | | allocated to |
| | | | | to this object. | | | | | | this object. |
| system | 46 | bool | R/W | True, if this | | system | 46 | bool | R/W | True, if this |
| | | | | file is a | | | | | | file is a |
| | | | | "system" file | | | | | | "system" file |
| | | | | with respect to | | | | | | with respect |
| | | | | the Windows | | | | | | to the Windows |
| | | | | API? | | | | | | API? |
| time_access | 47 | nfstime4 | READ | The time of | | time_access | 47 | nfstime4 | READ | The time of |
| | | | | last access to | | | | | | last access to |
| | | | | the object by a | | | | | | the object by |
| | | | | read that was | | | | | | a read that |
| | | | | satisfied by | | | | | | was satisfied |
| | | | | the server. | | | | | | by the server. |
| time_access_set | 48 | settime4 | WRITE | Set the time of | | time_access_set | 48 | settime4 | WRITE | Set the time |
| | | | | last access to | | | | | | of last access |
| | | | | the object. | | | | | | to the object. |
| | | | | SETATTR use | | | | | | SETATTR use |
| | | | | only. | | | | | | only. |
| time_backup | 49 | nfstime4 | R/W | The time of | | time_backup | 49 | nfstime4 | R/W | The time of |
| | | | | last backup of | | | | | | last backup of |
| | | | | the object. | | | | | | the object. |
| time_create | 50 | nfstime4 | R/W | The time of | | time_create | 50 | nfstime4 | R/W | The time of |
| | | | | creation of the | | | | | | creation of |
| | | | | object. This | | | | | | the object. |
| | | | | attribute does | | | | | | This attribute |
| | | | | not have any | | | | | | does not have |
| | | | | relation to the | | | | | | any relation |
| | | | | to the |
| | | | | traditional | | | | | | traditional |
| | | | | UNIX file | | | | | | UNIX file |
| | | | | attribute | | | | | | attribute |
| | | | | "ctime" or | | | | | | "ctime" or |
| | | | | "change time". | | | | | | "change time". |
| time_delta | 51 | nfstime4 | READ | Smallest useful | | time_delta | 51 | nfstime4 | READ | Smallest |
| | | | | server time | | | | | | useful server |
| | | | | time |
| | | | | granularity. | | | | | | granularity. |
| time_metadata | 52 | nfstime4 | READ | The time of | | time_metadata | 52 | nfstime4 | READ | The time of |
| | | | | last meta-data | | | | | | last meta-data |
| | | | | modification of | | | | | | modification |
| | | | | the object. | | | | | | of the object. |
| time_modify | 53 | nfstime4 | READ | The time of | | time_modify | 53 | nfstime4 | READ | The time of |
| | | | | last | | | | | | last |
| | | | | modification to | | | | | | modification |
| | | | | the object. | | | | | | to the object. |
| time_modify_set | 54 | settime4 | WRITE | Set the time of | | time_modify_set | 54 | settime4 | WRITE | Set the time |
| | | | | last | | | | | | of last |
| | | | | modification to | | | | | | modification |
| | | | | the object. | | | | | | to the object. |
| | | | | SETATTR use | | | | | | SETATTR use |
| | | | | only. | | | | | | only. |
+--------------------+----+--------------+--------+-----------------+ +--------------------+----+---------------+--------+----------------+
4.7. Time Access 5.7. Time Access
As defined above, the time_access attribute represents the time of As defined above, the time_access attribute represents the time of
last access to the object by a read that was satisfied by the server. last access to the object by a read that was satisfied by the server.
The notion of what is an "access" depends on server's operating The notion of what is an "access" depends on server's operating
environment and/or the server's filesystem semantics. For example, environment and/or the server's filesystem semantics. For example,
for servers obeying POSIX semantics, time_access would be updated for servers obeying POSIX semantics, time_access would be updated
only by the READLINK, READ, and READDIR operations and not any of the only by the READLINK, READ, and READDIR operations and not any of the
operations that modify the content of the object. Of course, setting operations that modify the content of the object. Of course, setting
the corresponding time_access_set attribute is another way to modify the corresponding time_access_set attribute is another way to modify
the time_access attribute. the time_access attribute.
Whenever the file object resides on a writable filesystem, the server Whenever the file object resides on a writable file system, the
should make best efforts to record time_access into stable storage. server should make best efforts to record time_access into stable
However, to mitigate the performance effects of doing so, and most storage. However, to mitigate the performance effects of doing so,
especially whenever the server is satisfying the read of the object's and most especially whenever the server is satisfying the read of the
content from its cache, the server MAY cache access time updates and object's content from its cache, the server MAY cache access time
lazily write them to stable storage. It is also acceptable to give updates and lazily write them to stable storage. It is also
administrators of the server the option to disable time_access acceptable to give administrators of the server the option to disable
updates. time_access updates.
4.8. Interpreting owner and owner_group 5.8. Interpreting owner and owner_group
The recommended attributes "owner" and "owner_group" (and also users The recommended attributes "owner" and "owner_group" (and also users
and groups within the "acl" attribute) are represented in terms of a and groups within the "acl" attribute) are represented in terms of a
UTF-8 string. To avoid a representation that is tied to a particular UTF-8 string. To avoid a representation that is tied to a particular
underlying implementation at the client or server, the use of the underlying implementation at the client or server, the use of the
UTF-8 string has been chosen. Note that section 6.1 of [RFC2624] UTF-8 string has been chosen. Note that section 6.1 of RFC2624 [26]
provides additional rationale. It is expected that the client and provides additional rationale. It is expected that the client and
server will have their own local representation of owner and server will have their own local representation of owner and
owner_group that is used for local storage or presentation to the end owner_group that is used for local storage or presentation to the end
user. Therefore, it is expected that when these attributes are user. Therefore, it is expected that when these attributes are
transferred between the client and server that the local transferred between the client and server that the local
representation is translated to a syntax of the form "user@ representation is translated to a syntax of the form "user@
dns_domain". This will allow for a client and server that do not use dns_domain". This will allow for a client and server that do not use
the same local representation the ability to translate to a common the same local representation the ability to translate to a common
syntax that can be interpreted by both. syntax that can be interpreted by both.
skipping to change at page 46, line 5 skipping to change at page 56, line 21
groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER
error when there is a valid translation for the user or owner error when there is a valid translation for the user or owner
designated in this way. In that case, the client must use the designated in this way. In that case, the client must use the
appropriate name@domain string and not the special form for appropriate name@domain string and not the special form for
compatibility. compatibility.
The owner string "nobody" may be used to designate an anonymous user, The owner string "nobody" may be used to designate an anonymous user,
which will be associated with a file created by a security principal which will be associated with a file created by a security principal
that cannot be mapped through normal means to the owner attribute. that cannot be mapped through normal means to the owner attribute.
4.9. Character Case Attributes 5.9. Character Case Attributes
With respect to the case_insensitive and case_preserving attributes, With respect to the case_insensitive and case_preserving attributes,
each UCS-4 character (which UTF-8 encodes) has a "long descriptive each UCS-4 character (which UTF-8 encodes) has a "long descriptive
name" [RFC1345] which may or may not included the word "CAPITAL" or name" RFC1345 [27] which may or may not included the word "CAPITAL"
"SMALL". The presence of SMALL or CAPITAL allows an NFS server to or "SMALL". The presence of SMALL or CAPITAL allows an NFS server to
implement unambiguous and efficient table driven mappings for case implement unambiguous and efficient table driven mappings for case
insensitive comparisons, and non-case-preserving storage. For insensitive comparisons, and non-case-preserving storage. For
general character handling and internationalization issues, see the general character handling and internationalization issues, see the
section "Internationalization". section "Internationalization".
4.10. Quota Attributes 5.10. Quota Attributes
For the attributes related to filesystem quotas, the following For the attributes related to filesystem quotas, the following
definitions apply: definitions apply:
quota_avail_soft The value in bytes which represents the amount of quota_avail_soft The value in bytes which represents the amount of
additional disk space that can be allocated to this file or additional disk space that can be allocated to this file or
directory before the user may reasonably be warned. It is directory before the user may reasonably be warned. It is
understood that this space may be consumed by allocations to other understood that this space may be consumed by allocations to other
files or directories though there is a rule as to which other files or directories though there is a rule as to which other
files or directories. files or directories.
skipping to change at page 46, line 47 skipping to change at page 57, line 18
meets at least the criterion that allocating space to any file or meets at least the criterion that allocating space to any file or
directory in the set will reduce the "quota_avail_hard" of every directory in the set will reduce the "quota_avail_hard" of every
other file or directory in the set. other file or directory in the set.
Note that there may be a number of distinct but overlapping sets Note that there may be a number of distinct but overlapping sets
of files or directories for which a quota_used value is of files or directories for which a quota_used value is
maintained. E.g. "all files with a given owner", "all files with maintained. E.g. "all files with a given owner", "all files with
a given group owner". etc. a given group owner". etc.
The server is at liberty to choose any of those sets but should do The server is at liberty to choose any of those sets but should do
so in a repeatable way. The rule may be configured per-filesystem so in a repeatable way. The rule may be configured per file
or may be "choose the set with the smallest quota". system or may be "choose the set with the smallest quota".
4.11. mounted_on_fileid 5.11. mounted_on_fileid
UNIX-based operating environments connect a filesystem into the UNIX-based operating environments connect a filesystem into the
namespace by connecting (mounting) the filesystem onto the existing namespace by connecting (mounting) the filesystem onto the existing
file object (the mount point, usually a directory) of an existing file object (the mount point, usually a directory) of an existing
filesystem. When the mount point's parent directory is read via an filesystem. When the mount point's parent directory is read via an
API like readdir(), the return results are directory entries, each API like readdir(), the return results are directory entries, each
with a component name and a fileid. The fileid of the mount point's with a component name and a fileid. The fileid of the mount point's
directory entry will be different from the fileid that the stat() directory entry will be different from the fileid that the stat()
system call returns. The stat() system call is returning the fileid system call returns. The stat() system call is returning the fileid
of the root of the mounted filesystem, whereas readdir() is returning of the root of the mounted file system, whereas readdir() is
the fileid stat() would have returned before any filesystems were returning the fileid stat() would have returned before any file
mounted on the mount point. systems were mounted on the mount point.
Unlike NFS version 3, NFS version 4 allows a client's LOOKUP request Unlike NFS version 3, NFS version 4 allows a client's LOOKUP request
to cross other filesystems. The client detects the filesystem to cross other filesystems. The client detects the filesystem
crossing whenever the filehandle argument of LOOKUP has an fsid crossing whenever the filehandle argument of LOOKUP has an fsid
attribute different from that of the filehandle returned by LOOKUP. attribute different from that of the filehandle returned by LOOKUP.
A UNIX-based client will consider this a "mount point crossing". A UNIX-based client will consider this a "mount point crossing".
UNIX has a legacy scheme for allowing a process to determine its UNIX has a legacy scheme for allowing a process to determine its
current working directory. This relies on readdir() of a mount current working directory. This relies on readdir() of a mount
point's parent and stat() of the mount point returning fileids as point's parent and stat() of the mount point returning fileids as
previously described. The mounted_on_fileid attribute corresponds to previously described. The mounted_on_fileid attribute corresponds to
skipping to change at page 48, line 4 skipping to change at page 58, line 20
The mounted_on_fileid attribute is RECOMMENDED, so the server SHOULD The mounted_on_fileid attribute is RECOMMENDED, so the server SHOULD
provide it if possible, and for a UNIX-based server, this is provide it if possible, and for a UNIX-based server, this is
straightforward. Usually, mounted_on_fileid will be requested during straightforward. Usually, mounted_on_fileid will be requested during
a READDIR operation, in which case it is trivial (at least for UNIX- a READDIR operation, in which case it is trivial (at least for UNIX-
based servers) to return mounted_on_fileid since it is equal to the based servers) to return mounted_on_fileid since it is equal to the
fileid of a directory entry returned by readdir(). If fileid of a directory entry returned by readdir(). If
mounted_on_fileid is requested in a GETATTR operation, the server mounted_on_fileid is requested in a GETATTR operation, the server
should obey an invariant that has it returning a value that is equal should obey an invariant that has it returning a value that is equal
to the file object's entry in the object's parent directory, i.e. to the file object's entry in the object's parent directory, i.e.
what readdir() would have returned. Some operating environments what readdir() would have returned. Some operating environments
allow a series of two or more filesystems to be mounted onto a single allow a series of two or more file systems to be mounted onto a
mount point. In this case, for the server to obey the aforementioned single mount point. In this case, for the server to obey the
invariant, it will need to find the base mount point, and not the aforementioned invariant, it will need to find the base mount point,
intermediate mount points. and not the intermediate mount points.
4.12. send_impl_id and recv_impl_id 5.12. send_impl_id and recv_impl_id
These recommended attributes are used to identify the client and These recommended attributes are used to identify the client and
server. In the case of the send_impl_id attribute, the client sends server. In the case of the send_impl_id attribute, the client sends
its clientid4 value along with the nfs_impl_id4. The use of the its clientid4 value along with the nfs_impl_id4. The use of the
clientid4 value allows the server to identify and match specific clientid4 value allows the server to identify and match specific
client interaction. In the case of the recv_impl_id attribute, the client interaction. In the case of the recv_impl_id attribute, the
client receives the nfs_impl_id4 value. client receives the nfs_impl_id4 value.
Access to this identification information can be most useful at both Access to this identification information can be most useful at both
client and server. Being able to identify specific implementations client and server. Being able to identify specific implementations
skipping to change at page 48, line 39 skipping to change at page 59, line 8
the client and server might refuse to interoperate. the client and server might refuse to interoperate.
Because it is likely some implementations will violate the protocol Because it is likely some implementations will violate the protocol
specification and interpret the identity information, implementations specification and interpret the identity information, implementations
MUST allow the users of the NFSv4 client and server to set the MUST allow the users of the NFSv4 client and server to set the
contents of the sent nfs_impl_id structure to any value. contents of the sent nfs_impl_id structure to any value.
Even though these attributes are recommended, if the server supports Even though these attributes are recommended, if the server supports
one of them it MUST support the other. one of them it MUST support the other.
4.13. fs_layout_type 5.13. fs_layout_type
This attribute applies to a file system and indicates what layout This attribute applies to a file system and indicates what layout
types are supported by the file system. We expect this attribute to types are supported by the file system. We expect this attribute to
be queried when a client encounters a new fsid. This attribute is be queried when a client encounters a new fsid. This attribute is
used by the client to determine if it has applicable layout drivers. used by the client to determine if it has applicable layout drivers.
4.14. layout_type 5.14. layout_type
This attribute indicates the particular layout type(s) used for a This attribute indicates the particular layout type(s) used for a
file. This is for informational purposes only. The client needs to file. This is for informational purposes only. The client needs to
use the LAYOUTGET operation in order to get enough information (e.g., use the LAYOUTGET operation in order to get enough information (e.g.,
specific device information) in order to perform I/O. specific device information) in order to perform I/O.
4.15. layout_hint 5.15. layout_hint
This attribute may be set on newly created files to influence the This attribute may be set on newly created files to influence the
metadata server's choice for the file's layout. It is suggested that metadata server's choice for the file's layout. It is suggested that
this attribute is set as one of the initial attributes within the this attribute is set as one of the initial attributes within the
OPEN call. The metadata server may ignore this attribute. This OPEN call. The metadata server may ignore this attribute. This
attribute is a sub-set of the layout structure returned by LAYOUTGET. attribute is a sub-set of the layout structure returned by LAYOUTGET.
For example, instead of specifying particular devices, this would be For example, instead of specifying particular devices, this would be
used to suggest the stripe width of a file. It is up to the server used to suggest the stripe width of a file. It is up to the server
implementation to determine which fields within the layout it uses. implementation to determine which fields within the layout it uses.
5. Access Control Lists 5.16. mdsthreshold
This attribute acts as a hint to the client to help it determine when
it is more efficient to issue read and write requests to the metadata
server vs. the dataserver. Two types of thresholds are described:
file size thresholds and I/O size thresholds. If a file's size is
smaller than the file size threshold, data accesses should be issued
to the metadata server. If an I/O is below the I/O size threshold,
the I/O should be issued to the metadata server. Each threshold can
be specified independently for read and write requests. For either
threshold type, a value of 0 indicates no read or write should be
issued to the metadata server, while a value of all 1s indicates all
reads or writes should be issued to the metadata server.
The attribute is available on a per filehandle basis. If the current
filehandle refers to a non-pNFS file or directory, the metadata
server should return an attribute that is representative of the
filehandle's file system. It is suggested that this attribute is
queried as part of the OPEN operation. Due to dynamic system
changes, the client should not assume that the attribute will remain
constant for any specific time period, thus it should be periodically
refreshed.
6. Access Control Lists
The NFS version 4 ACL attribute is an array of access control entries The NFS version 4 ACL attribute is an array of access control entries
(ACEs). Although, the client can read and write the ACL attribute, (ACEs). Although, the client can read and write the ACL attribute,
the server is responsible for using the ACL to perform access the server is responsible for using the ACL to perform access
control. The client can use the OPEN or ACCESS operations to check control. The client can use the OPEN or ACCESS operations to check
access without modifying or reading data or metadata. access without modifying or reading data or metadata.
The NFS ACE attribute is defined as follows: The NFS ACE attribute is defined as follows:
typedef uint32_t acetype4; typedef uint32_t acetype4;
skipping to change at page 51, line 16 skipping to change at page 62, line 8
multiple modules that enforce ACLs. For example, the enforcement for multiple modules that enforce ACLs. For example, the enforcement for
NFS version 4 access may be different from the enforcement for local NFS version 4 access may be different from the enforcement for local
access, and both may be different from the enforcement for access access, and both may be different from the enforcement for access
through other protocols such as SMB. So it may be useful for a through other protocols such as SMB. So it may be useful for a
server to accept an ACL even if not all of its modules are able to server to accept an ACL even if not all of its modules are able to
support it. support it.
The guiding principle in all cases is that the server must not accept The guiding principle in all cases is that the server must not accept
ACLs that appear to make the file more secure than it really is. ACLs that appear to make the file more secure than it really is.
5.1. ACE type 6.1. ACE type
Type Description Type Description
_____________________________________________________ _____________________________________________________
ALLOW Explicitly grants the access defined in ALLOW Explicitly grants the access defined in
acemask4 to the file or directory. acemask4 to the file or directory.
DENY Explicitly denies the access defined in DENY Explicitly denies the access defined in
acemask4 to the file or directory. acemask4 to the file or directory.
AUDIT LOG (system dependent) any access AUDIT LOG (system dependent) any access
skipping to change at page 52, line 23 skipping to change at page 63, line 12
NFS4ERR_ATTRNOTSUPP. If the server receives a request to set an ACE NFS4ERR_ATTRNOTSUPP. If the server receives a request to set an ACE
that it can store but cannot enforce, the server SHOULD reject the that it can store but cannot enforce, the server SHOULD reject the
request with NFS4ERR_ATTRNOTSUPP. request with NFS4ERR_ATTRNOTSUPP.
Example: suppose a server can enforce NFS ACLs for NFS access but Example: suppose a server can enforce NFS ACLs for NFS access but
cannot enforce ACLs for local access. If arbitrary processes can run cannot enforce ACLs for local access. If arbitrary processes can run
on the server, then the server SHOULD NOT indicate ACL support. On on the server, then the server SHOULD NOT indicate ACL support. On
the other hand, if only trusted administrative programs run locally, the other hand, if only trusted administrative programs run locally,
then the server may indicate ACL support. then the server may indicate ACL support.
5.2. ACE Access Mask 6.2. ACE Access Mask
The access_mask field contains values based on the following: The access_mask field contains values based on the following:
ACE4_READ_DATA ACE4_READ_DATA
Operation(s) affected: Operation(s) affected:
READ READ
OPEN OPEN
Discussion: Discussion:
Permission to read the data of the file. Permission to read the data of the file.
skipping to change at page 57, line 5 skipping to change at page 67, line 42
If a server receives a SETATTR request that it cannot accurately If a server receives a SETATTR request that it cannot accurately
implement, it should error in the direction of more restricted implement, it should error in the direction of more restricted
access. For example, suppose a server cannot distinguish overwriting access. For example, suppose a server cannot distinguish overwriting
data from appending new data, as described in the previous paragraph. data from appending new data, as described in the previous paragraph.
If a client submits an ACE where APPEND_DATA is set but WRITE_DATA is If a client submits an ACE where APPEND_DATA is set but WRITE_DATA is
not (or vice versa), the server should reject the request with not (or vice versa), the server should reject the request with
NFS4ERR_ATTRNOTSUPP. Nonetheless, if the ACE has type DENY, the NFS4ERR_ATTRNOTSUPP. Nonetheless, if the ACE has type DENY, the
server may silently turn on the other bit, so that both APPEND_DATA server may silently turn on the other bit, so that both APPEND_DATA
and WRITE_DATA are denied. and WRITE_DATA are denied.
5.2.1. ACE4_DELETE vs. ACE4_DELETE_CHILD 6.2.1. ACE4_DELETE vs. ACE4_DELETE_CHILD
Two access mask bits govern the ability to delete a file or directory Two access mask bits govern the ability to delete a file or directory
object: ACE4_DELETE on the object itself, and ACE4_DELETE_CHILD on object: ACE4_DELETE on the object itself, and ACE4_DELETE_CHILD on
the object's parent directory. the object's parent directory.
Many systems also consult the "sticky bit" (MODE4_SVTX) and write Many systems also consult the "sticky bit" (MODE4_SVTX) and write
mode bit on the parent directory when determining whether to allow a mode bit on the parent directory when determining whether to allow a
file to be deleted. The mode bit for write corresponds to file to be deleted. The mode bit for write corresponds to
ACE4_WRITE_DATA, which is the same physical bit as ACE4_ADD_FILE. ACE4_WRITE_DATA, which is the same physical bit as ACE4_ADD_FILE.
Therefore, ACE4_ADD_FILE can come into play when determining Therefore, ACE4_ADD_FILE can come into play when determining
skipping to change at page 58, line 5 skipping to change at page 68, line 38
ACE4_WRITE_DATA is allowed by the target ACE4_WRITE_DATA is allowed by the target
object ACL: object ACL:
allow delete allow delete
else: else:
deny delete deny delete
else: else:
allow delete allow delete
else: else:
deny delete deny delete
5.3. ACE flag 6.3. ACE flag
The "flag" field contains values based on the following descriptions. The "flag" field contains values based on the following descriptions.
ACE4_FILE_INHERIT_ACE ACE4_FILE_INHERIT_ACE
Can be placed on a directory and indicates that this ACE should be Can be placed on a directory and indicates that this ACE should be
added to each new non-directory file created. added to each new non-directory file created.
ACE4_DIRECTORY_INHERIT_ACE ACE4_DIRECTORY_INHERIT_ACE
Can be placed on a directory and indicates that this ACE should be Can be placed on a directory and indicates that this ACE should be
added to each new directory created. added to each new directory created.
skipping to change at page 59, line 46 skipping to change at page 70, line 32
For example, suppose a client tries to set an ACE with For example, suppose a client tries to set an ACE with
ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE. If the ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE. If the
server does not support any form of ACL inheritance, the server server does not support any form of ACL inheritance, the server
should reject the request with NFS4ERR_ATTRNOTSUPP. If the server should reject the request with NFS4ERR_ATTRNOTSUPP. If the server
supports a single "inherit ACE" flag that applies to both files and supports a single "inherit ACE" flag that applies to both files and
directories, the server may reject the request (i.e., requiring the directories, the server may reject the request (i.e., requiring the
client to set both the file and directory inheritance flags). The client to set both the file and directory inheritance flags). The
server may also accept the request and silently turn on the server may also accept the request and silently turn on the
ACE4_DIRECTORY_INHERIT_ACE flag. ACE4_DIRECTORY_INHERIT_ACE flag.
5.4. ACE who 6.4. ACE who
There are several special identifiers ("who") which need to be There are several special identifiers ("who") which need to be
understood universally, rather than in the context of a particular understood universally, rather than in the context of a particular
DNS domain. Some of these identifiers cannot be understood when an DNS domain. Some of these identifiers cannot be understood when an
NFS client accesses the server, but have meaning when a local process NFS client accesses the server, but have meaning when a local process
accesses the file. The ability to display and modify these accesses the file. The ability to display and modify these
permissions is permitted over NFS, even if none of the access methods permissions is permitted over NFS, even if none of the access methods
on the server understands the identifiers. on the server understands the identifiers.
Who Description Who Description
skipping to change at page 60, line 26 skipping to change at page 71, line 24
"BATCH" Accessed from a batch job. "BATCH" Accessed from a batch job.
"ANONYMOUS" Accessed without any authentication. "ANONYMOUS" Accessed without any authentication.
"AUTHENTICATED" Any authenticated user (opposite of "AUTHENTICATED" Any authenticated user (opposite of
ANONYMOUS) ANONYMOUS)
"SERVICE" Access from a system service. "SERVICE" Access from a system service.
To avoid conflict, these special identifiers are distinguish by an To avoid conflict, these special identifiers are distinguish by an
appended "@" and should appear in the form "xxxx@" (note: no domain appended "@" and should appear in the form "xxxx@" (note: no domain
name after the "@"). For example: ANONYMOUS@. name after the "@"). For example: ANONYMOUS@.
5.4.1. Discussion of EVERYONE@ 6.4.1. Discussion of EVERYONE@
It is important to note that "EVERYONE@" is not equivalent to the It is important to note that "EVERYONE@" is not equivalent to the
UNIX "other" entity. This is because, by definition, UNIX "other" UNIX "other" entity. This is because, by definition, UNIX "other"
does not include the owner or owning group of a file. "EVERYONE@" does not include the owner or owning group of a file. "EVERYONE@"
means literally everyone, including the owner or owning group. means literally everyone, including the owner or owning group.
5.4.2. Discussion of OWNER@ and GROUP@ 6.4.2. Discussion of OWNER@ and GROUP@
The ACL itself cannot be used to determine the owner and owning group The ACL itself cannot be used to determine the owner and owning group
of a file. This information should be indicated by the values of the of a file. This information should be indicated by the values of the
owner and owner_group file attributes returned by the server. owner and owner_group file attributes returned by the server.
5.5. Mode Attribute 6.5. Mode Attribute
The NFS version 4 mode attribute is based on the UNIX mode bits. The The NFS version 4 mode attribute is based on the UNIX mode bits. The
following bits are defined: following bits are defined:
const MODE4_SUID = 0x800; /* set user id on execution */ const MODE4_SUID = 0x800; /* set user id on execution */
const MODE4_SGID = 0x400; /* set group id on execution */ const MODE4_SGID = 0x400; /* set group id on execution */
const MODE4_SVTX = 0x200; /* save text even after use */ const MODE4_SVTX = 0x200; /* save text even after use */
const MODE4_RUSR = 0x100; /* read permission: owner */ const MODE4_RUSR = 0x100; /* read permission: owner */
const MODE4_WUSR = 0x080; /* write permission: owner */ const MODE4_WUSR = 0x080; /* write permission: owner */
const MODE4_XUSR = 0x040; /* execute permission: owner */ const MODE4_XUSR = 0x040; /* execute permission: owner */
skipping to change at page 61, line 29 skipping to change at page 72, line 17
identified in the owner attribute. Bits MODE4_RGRP, MODE4_WGRP, and identified in the owner attribute. Bits MODE4_RGRP, MODE4_WGRP, and
MODE4_XGRP apply to the principals identified in the owner_group MODE4_XGRP apply to the principals identified in the owner_group
attribute. Bits MODE4_ROTH, MODE4_WOTH, MODE4_XOTH apply to any attribute. Bits MODE4_ROTH, MODE4_WOTH, MODE4_XOTH apply to any
principal that does not match that in the owner group, and does not principal that does not match that in the owner group, and does not
have a group matching that of the owner_group attribute. have a group matching that of the owner_group attribute.
The remaining bits are not defined by this protocol and MUST NOT be The remaining bits are not defined by this protocol and MUST NOT be
used. The minor version mechanism must be used to define further bit used. The minor version mechanism must be used to define further bit
usage. usage.
5.6. Interaction Between Mode and ACL Attributes 6.6. Interaction Between Mode and ACL Attributes
As defined, there is a certain amount of overlap between ACL and mode As defined, there is a certain amount of overlap between ACL and mode
file attributes. Even though there is overlap, ACLs don't contain file attributes. Even though there is overlap, ACLs don't contain
all the information specified by a mode and modes can't possibly all the information specified by a mode and modes can't possibly
contain all the information specified by an ACL. contain all the information specified by an ACL.
For servers that support both mode and ACL, the mode's MODE4_R*, For servers that support both mode and ACL, the mode's MODE4_R*,
MODE4_W* and MODE4_X* values should be computed from the ACL and MODE4_W* and MODE4_X* values should be computed from the ACL and
should be recomputed upon each SETATTR of ACL. Similarly, upon should be recomputed upon each SETATTR of ACL. Similarly, upon
SETATTR of mode, the ACL should be modified in order to allow the SETATTR of mode, the ACL should be modified in order to allow the
mode computed from the ACL to be the same as the mode given to mode computed from the ACL to be the same as the mode given to
SETATTR. The mode computed from any given ACL should be SETATTR. The mode computed from any given ACL should be
deterministic. This means that given an ACL, the same mode will deterministic. This means that given an ACL, the same mode will
always be computed. always be computed.
For servers that support ACL and not mode, clients may handle For servers that support ACL and not mode, clients may handle
applications which set and get the mode by creating the correct ACL applications which set and get the mode by creating the correct ACL
to send to the server and by computing the mode from the ACL, to send to the server and by computing the mode from the ACL,
respectively. In this case, the methods used by the server to keep respectively. In this case, the methods used by the server to keep
the mode in sync with the ACL can also be used by the client. These the mode in sync with the ACL can also be used by the client. These
methods are explained in sections Section 5.6.3 Section 5.6.1 and methods are explained in Section 6.6.3, Section 6.6.1, and
Section 5.6.2. Section 6.6.2.
Since the mode can't possibly represent all of the information that Since the mode can't possibly represent all of the information that
is defined by an ACL, there are some discrepencies to be aware of. is defined by an ACL, there are some discrepencies to be aware of.
As explained in the section "Deficiencies in a Mode Representation of As explained in the section "Deficiencies in a Mode Representation of
an ACL", the mode bits computed from the ACL could potentially convey an ACL", the mode bits computed from the ACL could potentially convey
more restrictive permissions than what would be granted via the ACL. more restrictive permissions than what would be granted via the ACL.
Because of this clients are not recommended to do their own access Because of this clients are not recommended to do their own access
checks based on the mode of a file. checks based on the mode of a file.
Because the mode attribute includes bits (i.e. MODE4_SUID, Because the mode attribute includes bits (i.e. MODE4_SUID,
MODE4_SGID, MODE4_SVTX) that have nothing to do with ACL semantics, MODE4_SGID, MODE4_SVTX) that have nothing to do with ACL semantics,
it is permitted for clients to specify both the ACL attribute and it is permitted for clients to specify both the ACL attribute and
mode in the same SETATTR operation. However, because there is no mode in the same SETATTR operation. However, because there is no
prescribed order for processing the attributes in a SETATTR, clients prescribed order for processing the attributes in a SETATTR, clients
may see differing results. For recommendations on how to achieve may see differing results. For recommendations on how to achieve
consistent behavior, see Section 5.6.4 for recommendations. consistent behavior, see Section 6.6.4 for recommendations.
5.6.1. Recomputing mode upon SETATTR of ACL 6.6.1. Recomputing mode upon SETATTR of ACL
Keeping the mode and ACL attributes synchronized is important, but as Keeping the mode and ACL attributes synchronized is important, but as
mentioned previously, the mode cannot possibly represent all of the mentioned previously, the mode cannot possibly represent all of the
information in the ACL. Still, the mode should be modified to information in the ACL. Still, the mode should be modified to
represent the access as accurately as possible. represent the access as accurately as possible.
The general algorithm to assign a new mode attribute to an object The general algorithm to assign a new mode attribute to an object
based on a new ACL being set is: based on a new ACL being set is:
1. Walk through the ACEs in order, looking for ACEs with a "who" 1. Walk through the ACEs in order, looking for ACEs with a "who"
skipping to change at page 65, line 34 skipping to change at page 76, line 22
if a.type is ALLOW { if a.type is ALLOW {
mode |= XOTH; mode |= XOTH;
} }
} }
} }
} }
} }
} }
return mode | (old_mode & (SUID | SGID | SVTX)) return mode | (old_mode & (SUID | SGID | SVTX))
5.6.2. Applying the mode given to CREATE or OPEN to an inherited ACL 6.6.2. Applying the mode given to CREATE or OPEN to an inherited ACL
The goal of implementing ACL inheritance is for newly created objects The goal of implementing ACL inheritance is for newly created objects
to inherit the ACLs they were intended to inherit, but without to inherit the ACLs they were intended to inherit, but without
disregarding the mode that is given with the arguments to the CREATE disregarding the mode that is given with the arguments to the CREATE
or OPEN operations. The general algorithm is as follows: or OPEN operations. The general algorithm is as follows:
1. Form an ACL on the newly created object that is the concatenation 1. Form an ACL on the newly created object that is the concatenation
of all inheritable ACEs from its parent directory. Note that of all inheritable ACEs from its parent directory. Note that
there may be zero inheritable ACEs; thus, an object may start there may be zero inheritable ACEs; thus, an object may start
with an empty ACL. with an empty ACL.
skipping to change at page 66, line 50 skipping to change at page 77, line 39
G. On the second ACE, if the type field is ALLOW, an G. On the second ACE, if the type field is ALLOW, an
implementation MAY clear the following mask bits: implementation MAY clear the following mask bits:
ACE4_WRITE_ACL ACE4_WRITE_ACL
ACE4_WRITE_OWNER ACE4_WRITE_OWNER
3. To ensure that the mode is honored, apply the algorithm for 3. To ensure that the mode is honored, apply the algorithm for
applying a mode to a file/directory with an existing ACL on the applying a mode to a file/directory with an existing ACL on the
new object as described in Section 5.6.3, using the mode that is new object as described in Section 6.6.3, using the mode that is
to be used for file creation. to be used for file creation.
5.6.3. Applying a Mode to an Existing ACL 6.6.3. Applying a Mode to an Existing ACL
An existing ACL can mean two things in this context. One, that a An existing ACL can mean two things in this context. One, that a
file/directory already exists and it has an ACL. Two, that a file/directory already exists and it has an ACL. Two, that a
directory has inheritable ACEs that will make up the ACL for any new directory has inheritable ACEs that will make up the ACL for any new
files or directories created therein. files or directories created therein.
The high-level goal of the behavior when a mode is set on a file with The high-level goal of the behavior when a mode is set on a file with
an existing ACL is to take the new mode into account, without needing an existing ACL is to take the new mode into account, without needing
to delete a pre-existing ACL. to delete a pre-existing ACL.
skipping to change at page 71, line 36 skipping to change at page 82, line 36
else: set ACE4_WRITE_DATA and ACE4_APPEND_DATA in A3 else: set ACE4_WRITE_DATA and ACE4_APPEND_DATA in A3
If XGRP is set: set ACE4_EXECUTE in A4 If XGRP is set: set ACE4_EXECUTE in A4
else: set ACE4_EXECUTE in A3 else: set ACE4_EXECUTE in A3
If ROTH is set: set ACE4_READ_DATA in A6 If ROTH is set: set ACE4_READ_DATA in A6
else: set ACE4_READ_DATA in A5 else: set ACE4_READ_DATA in A5
If WOTH is set: set ACE4_WRITE_DATA and ACE4_APPEND_DATA in A6 If WOTH is set: set ACE4_WRITE_DATA and ACE4_APPEND_DATA in A6
else: set ACE4_WRITE_DATA and ACE4_APPEND_DATA in A5 else: set ACE4_WRITE_DATA and ACE4_APPEND_DATA in A5
If XOTH is set: set ACE4_EXECUTE in A6 If XOTH is set: set ACE4_EXECUTE in A6
else: set ACE4_EXECUTE in A5 else: set ACE4_EXECUTE in A5
5.6.4. ACL and mode in the same SETATTR 6.6.4. ACL and mode in the same SETATTR
The only reason that a mode and ACL should be set in the same SETATTR The only reason that a mode and ACL should be set in the same SETATTR
is if the user wants to set the SUID, SGID and SVTX bits along with is if the user wants to set the SUID, SGID and SVTX bits along with
setting the permissions by means of an ACL. There is still no way to setting the permissions by means of an ACL. There is still no way to
enforce which order the attributes will be set in, and it is likely enforce which order the attributes will be set in, and it is likely
that different orders of operations will produce different results. that different orders of operations will produce different results.
5.6.4.1. Client Side Recommendations 6.6.4.1. Client Side Recommendations
If an application needs to enforce a certain behavior, it is If an application needs to enforce a certain behavior, it is
recommended that the client implementations set mode and ACL in recommended that the client implementations set mode and ACL in
separate SETATTR requests. This will produce consistent and expected separate SETATTR requests. This will produce consistent and expected
results. results.
If an application wants to set SUID, SGID and SVTX bits and an ACL: If an application wants to set SUID, SGID and SVTX bits and an ACL:
In the first SETATTR, set the mode with SUID, SGID and SVTX bits In the first SETATTR, set the mode with SUID, SGID and SVTX bits
as desired and all other bits with a value of 0. as desired and all other bits with a value of 0.
In a following SETATTR (preferably in the same COMPOUND) set the In a following SETATTR (preferably in the same COMPOUND) set the
ACL. ACL.
5.6.4.2. Server Side Recommendations 6.6.4.2. Server Side Recommendations
If both mode and ACL are given to SETATTR, server implementations If both mode and ACL are given to SETATTR, server implementations
should verify that the mode and ACL don't conflict, i.e. the mode should verify that the mode and ACL don't conflict, i.e. the mode
computed from the given ACL must be the same as the given mode, computed from the given ACL must be the same as the given mode,
excluding the SUID, SGID and SVTX bits. The algorithm for assigning excluding the SUID, SGID and SVTX bits. The algorithm for assigning
a new mode based on the ACL can be used. (This is described in a new mode based on the ACL can be used. (This is described in
section Section 5.6.1.) If a server receives a request to set both Section 6.6.1.) If a server receives a request to set both mode and
mode and ACL, but the two conflict, the server should return ACL, but the two conflict, the server should return NFS4ERR_INVAL.
NFS4ERR_INVAL.
5.6.5. Inheritance and turning it off 6.6.5. Inheritance and turning it off
The inheritance of access permissions may be problematic if a user The inheritance of access permissions may be problematic if a user
cannot prevent their file from inheriting unwanted permissions. For cannot prevent their file from inheriting unwanted permissions. For
example, a user, "bob", sets up a shared project directory to be used example, a user, "bob", sets up a shared project directory to be used
by everyone working on Project Foo. "alice" is a part of Project Foo, by everyone working on Project Foo. "alice" is a part of Project Foo,
but is working on something that should not be seen by anyone else. but is working on something that should not be seen by anyone else.
How can "alice" make sure that any new files that she creates in this How can "alice" make sure that any new files that she creates in this
shared project directory do not inherit anything that could shared project directory do not inherit anything that could
compromise the security of her work? compromise the security of her work?
skipping to change at page 72, line 44 skipping to change at page 83, line 43
servers is the question of how to communicate the fact that user servers is the question of how to communicate the fact that user
"alice" doesn't want any permissions to be inherited to her newly "alice" doesn't want any permissions to be inherited to her newly
created file or directory. created file or directory.
To do this, implementors should standardize on what the behavior of To do this, implementors should standardize on what the behavior of
CREATE and OPEN must be if: CREATE and OPEN must be if:
1. just mode is given 1. just mode is given
In this case, inheritance will take place, but the mode will be In this case, inheritance will take place, but the mode will be
applied to the inherited ACL as described in Section 5.6.1, applied to the inherited ACL as described in Section 6.6.1,
thereby modifying the ACL. thereby modifying the ACL.
2. just ACL is given 2. just ACL is given
In this case, inheritance will not take place, and the ACL as In this case, inheritance will not take place, and the ACL as
defined in the CREATE or OPEN will be set without modification. defined in the CREATE or OPEN will be set without modification.
3. both mode and ACL are given 3. both mode and ACL are given
In this case, implementors should verify that the mode and ACL In this case, implementors should verify that the mode and ACL
don't conflict, i.e. the mode computed from the given ACL must be don't conflict, i.e. the mode computed from the given ACL must be
the same as the given mode. The algorithm for assigning a new the same as the given mode. The algorithm for assigning a new
mode based on the ACL can be used. This is described in mode based on the ACL can be used. This is described in
Section 5.6.1) If a server receives a request to set both mode Section 6.6.1) If a server receives a request to set both mode
and ACL, but the two conflict, the server should return and ACL, but the two conflict, the server should return
NFS4ERR_INVAL. If the mode and ACL don't conflict, inheritance NFS4ERR_INVAL. If the mode and ACL don't conflict, inheritance
will not take place and both, the mode and ACL, will be set will not take place and both, the mode and ACL, will be set
without modification. without modification.
4. neither mode nor ACL are given 4. neither mode nor ACL are given
In this case, inheritance will take place and no modifications to In this case, inheritance will take place and no modifications to
the ACL will happen. It is worth noting that if no inheritable the ACL will happen. It is worth noting that if no inheritable
ACEs exist on the parent directory, the file will be created with ACEs exist on the parent directory, the file will be created with
an empty ACL, thus granting no accesses. an empty ACL, thus granting no accesses.
5.6.6. Deficiencies in a Mode Representation of an ACL 6.6.6. Deficiencies in a Mode Representation of an ACL
In the presence of an ACL, there are certain cases when the In the presence of an ACL, there are certain cases when the
representation of the mode is not guaranteed to be accurate. An representation of the mode is not guaranteed to be accurate. An
example of a situation is detailed below. example of a situation is detailed below.
As mentioned in Section 5.6, the representation of the mode is As mentioned in Section 6.6, the representation of the mode is
deterministic, but not guaranteed to be accurate. The mode bits deterministic, but not guaranteed to be accurate. The mode bits
potentially convey a more restrictive permission than what will potentially convey a more restrictive permission than what will
actually be granted via the ACL. actually be granted via the ACL.
Given the following ACL of two ACEs: Given the following ACL of two ACEs:
GROUP@:ACE4_READ_DATA/ACE4_WRITE_DATA/ACE4_EXECUTE: GROUP@:ACE4_READ_DATA/ACE4_WRITE_DATA/ACE4_EXECUTE:
ACE4_IDENTIFIER_GROUP:ALLOW ACE4_IDENTIFIER_GROUP:ALLOW
EVERYONE@:ACE4_READ_DATA/ACE4_WRITE_DATA/ACE4_EXECUTE::DENY EVERYONE@:ACE4_READ_DATA/ACE4_WRITE_DATA/ACE4_EXECUTE::DENY
skipping to change at page 74, line 25 skipping to change at page 85, line 23
longer in group "staff". User "bob" logs in to the system again, and longer in group "staff". User "bob" logs in to the system again, and
thus more processes are created, this time owned by "bob" but NOT in thus more processes are created, this time owned by "bob" but NOT in
group "staff". group "staff".
A mode of 0770 is inaccurate for processes not belonging to group A mode of 0770 is inaccurate for processes not belonging to group
"staff". But even if the mode of the file were proactively changed "staff". But even if the mode of the file were proactively changed
to 0070 at the time the group database was edited, mode 0070 would be to 0070 at the time the group database was edited, mode 0070 would be
inaccurate for the pre-existing processes owned by user "bob" and inaccurate for the pre-existing processes owned by user "bob" and
having membership in group "staff". having membership in group "staff".
6. Single-server Name Space 7. Single-server Name Space
This chapter describes the NFSv4 single-server name space. Single- This chapter describes the NFSv4 single-server name space. Single-
server namespaces may be presented directly to clients, or they may server namespaces may be presented directly to clients, or they may
be used as a basis to form larger multi-server namespaces (e.g. site- be used as a basis to form larger multi-server namespaces (e.g. site-
wide or organization-wide) to be presented to clients, as described wide or organization-wide) to be presented to clients, as described
in Section 12. in Section 13.
6.1. Server Exports 7.1. Server Exports
On a UNIX server, the name space describes all the files reachable by On a UNIX server, the name space describes all the files reachable by
pathnames under the root directory or "/". On a Windows NT server pathnames under the root directory or "/". On a Windows NT server
the name space constitutes all the files on disks named by mapped the name space constitutes all the files on disks named by mapped
disk letters. NFS server administrators rarely make the entire disk letters. NFS server administrators rarely make the entire
server's filesystem name space available to NFS clients. More often server's filesystem name space available to NFS clients. More often
portions of the name space are made available via an "export" portions of the name space are made available via an "export"
feature. In previous versions of the NFS protocol, the root feature. In previous versions of the NFS protocol, the root
filehandle for each export is obtained through the MOUNT protocol; filehandle for each export is obtained through the MOUNT protocol;
the client sends a string that identifies the export of name space the client sends a string that identifies the export of name space
and the server returns the root filehandle for it. The MOUNT and the server returns the root filehandle for it. The MOUNT
protocol supports an EXPORTS procedure that will enumerate the protocol supports an EXPORTS procedure that will enumerate the
server's exports. server's exports.
6.2. Browsing Exports 7.2. Browsing Exports
The NFS version 4 protocol provides a root filehandle that clients The NFS version 4 protocol provides a root filehandle that clients
can use to obtain filehandles for the exports of a particular server, can use to obtain filehandles for the exports of a particular server,
via a series of LOOKUP operations within a COMPOUND, to traverse a via a series of LOOKUP operations within a COMPOUND, to traverse a
path. A common user experience is to use a graphical user interface path. A common user experience is to use a graphical user interface
(perhaps a file "Open" dialog window) to find a file via progressive (perhaps a file "Open" dialog window) to find a file via progressive
browsing through a directory tree. The client must be able to move browsing through a directory tree. The client must be able to move
from one export to another export via single-component, progressive from one export to another export via single-component, progressive
LOOKUP operations. LOOKUP operations.
This style of browsing is not well supported by the NFS version 2 and This style of browsing is not well supported by the NFS version 2 and
3 protocols. The client expects all LOOKUP operations to remain 3 protocols. The client expects all LOOKUP operations to remain
within a single server filesystem. For example, the device attribute within a single server file system. For example, the device
will not change. This prevents a client from taking name space paths attribute will not change. This prevents a client from taking name
that span exports. space paths that span exports.
An automounter on the client can obtain a snapshot of the server's An automounter on the client can obtain a snapshot of the server's
name space using the EXPORTS procedure of the MOUNT protocol. If it name space using the EXPORTS procedure of the MOUNT protocol. If it
understands the server's pathname syntax, it can create an image of understands the server's pathname syntax, it can create an image of
the server's name space on the client. The parts of the name space the server's name space on the client. The parts of the name space
that are not exported by the server are filled in with a "pseudo that are not exported by the server are filled in with a "pseudo file
filesystem" that allows the user to browse from one mounted system" that allows the user to browse from one mounted file system
filesystem to another. There is a drawback to this representation of to another. There is a drawback to this representation of the
the server's name space on the client: it is static. If the server server's name space on the client: it is static. If the server
administrator adds a new export the client will be unaware of it. administrator adds a new export the client will be unaware of it.
6.3. Server Pseudo Filesystem 7.3. Server Pseudo File System
NFS version 4 servers avoid this name space inconsistency by NFS version 4 servers avoid this name space inconsistency by
presenting all the exports for a given server within the framework of presenting all the exports for a given server within the framework of
a single namespace, for that server. An NFS version 4 client uses a single namespace, for that server. An NFS version 4 client uses
LOOKUP and READDIR operations to browse seamlessly from one export to LOOKUP and READDIR operations to browse seamlessly from one export to
another. Portions of the server name space that are not exported are another. Portions of the server name space that are not exported are
bridged via a "pseudo filesystem" that provides a view of exported bridged via a "pseudo filesystem" that provides a view of exported
directories only. A pseudo filesystem has a unique fsid and behaves directories only. A pseudo filesystem has a unique fsid and behaves
like a normal, read only filesystem. like a normal, read only filesystem.
skipping to change at page 76, line 5 skipping to change at page 86, line 48
that multiple pseudo filesystems may exist. For example, that multiple pseudo filesystems may exist. For example,
/a pseudo filesystem /a pseudo filesystem
/a/b real filesystem /a/b real filesystem
/a/b/c pseudo filesystem /a/b/c pseudo filesystem
/a/b/c/d real filesystem /a/b/c/d real filesystem
Each of the pseudo filesystems are considered separate entities and Each of the pseudo filesystems are considered separate entities and
therefore will have its own unique fsid. therefore will have its own unique fsid.
6.4. Multiple Roots 7.4. Multiple Roots
The DOS and Windows operating environments are sometimes described as The DOS and Windows operating environments are sometimes described as
having "multiple roots". Filesystems are commonly represented as having "multiple roots". File Systems are commonly represented as
disk letters. MacOS represents filesystems as top level names. NFS disk letters. MacOS represents filesystems as top level names. NFS
version 4 servers for these platforms can construct a pseudo file version 4 servers for these platforms can construct a pseudo file
system above these root names so that disk letters or volume names system above these root names so that disk letters or volume names
are simply directory names in the pseudo root. are simply directory names in the pseudo root.
6.5. Filehandle Volatility 7.5. Filehandle Volatility
The nature of the server's pseudo filesystem is that it is a logical The nature of the server's pseudo filesystem is that it is a logical
representation of filesystem(s) available from the server. representation of filesystem(s) available from the server.
Therefore, the pseudo filesystem is most likely constructed Therefore, the pseudo filesystem is most likely constructed
dynamically when the server is first instantiated. It is expected dynamically when the server is first instantiated. It is expected
that the pseudo filesystem may not have an on disk counterpart from that the pseudo filesystem may not have an on disk counterpart from
which persistent filehandles could be constructed. Even though it is which persistent filehandles could be constructed. Even though it is
preferable that the server provide persistent filehandles for the preferable that the server provide persistent filehandles for the
pseudo filesystem, the NFS client should expect that pseudo file pseudo filesystem, the NFS client should expect that pseudo file
system filehandles are volatile. This can be confirmed by checking system filehandles are volatile. This can be confirmed by checking
the associated "fh_expire_type" attribute for those filehandles in the associated "fh_expire_type" attribute for those filehandles in
question. If the filehandles are volatile, the NFS client must be question. If the filehandles are volatile, the NFS client must be
prepared to recover a filehandle value (e.g. with a series of LOOKUP prepared to recover a filehandle value (e.g. with a series of LOOKUP
operations) when receiving an error of NFS4ERR_FHEXPIRED. operations) when receiving an error of NFS4ERR_FHEXPIRED.
6.6. Exported Root 7.6. Exported Root
If the server's root filesystem is exported, one might conclude that If the server's root filesystem is exported, one might conclude that
a pseudo-filesystem is unneeded. This not necessarily so. Assume a pseudo-filesystem is unneeded. This not necessarily so. Assume
the following filesystems on a server: the following filesystems on a server:
/ disk1 (exported) / disk1 (exported)
/a disk2 (not exported) /a disk2 (not exported)
/a/b disk3 (exported) /a/b disk3 (exported)
Because disk2 is not exported, disk3 cannot be reached with simple Because disk2 is not exported, disk3 cannot be reached with simple
LOOKUPs. The server must bridge the gap with a pseudo-filesystem. LOOKUPs. The server must bridge the gap with a pseudo-filesystem.
6.7. Mount Point Crossing 7.7. Mount Point Crossing
The server filesystem environment may be constructed in such a way The server filesystem environment may be constructed in such a way
that one filesystem contains a directory which is 'covered' or that one filesystem contains a directory which is 'covered' or
mounted upon by a second filesystem. For example: mounted upon by a second filesystem. For example:
/a/b (filesystem 1) /a/b (filesystem 1)
/a/b/c/d (filesystem 2) /a/b/c/d (filesystem 2)
The pseudo filesystem for this server may be constructed to look The pseudo filesystem for this server may be constructed to look
like: like:
/ (place holder/not exported) / (place holder/not exported)
/a/b (filesystem 1) /a/b (filesystem 1)
/a/b/c/d (filesystem 2) /a/b/c/d (filesystem 2)
It is the server's responsibility to present the pseudo filesystem It is the server's responsibility to present the pseudo filesystem
that is complete to the client. If the client sends a lookup request that is complete to the client. If the client sends a lookup request
for the path "/a/b/c/d", the server's response is the filehandle of for the path "/a/b/c/d", the server's response is the filehandle of
the filesystem "/a/b/c/d". In previous versions of the NFS protocol, the file system "/a/b/c/d". In previous versions of the NFS
the server would respond with the filehandle of directory "/a/b/c/d" protocol, the server would respond with the filehandle of directory
within the filesystem "/a/b". "/a/b/c/d" within the file system "/a/b".
The NFS client will be able to determine if it crosses a server mount The NFS client will be able to determine if it crosses a server mount
point by a change in the value of the "fsid" attribute. point by a change in the value of the "fsid" attribute.
6.8. Security Policy and Name Space Presentation 7.8. Security Policy and Name Space Presentation
The application of the server's security policy needs to be carefully The application of the server's security policy needs to be carefully
considered by the implementor. One may choose to limit the considered by the implementor. One may choose to limit the
viewability of portions of the pseudo filesystem based on the viewability of portions of the pseudo filesystem based on the
server's perception of the client's ability to authenticate itself server's perception of the client's ability to authenticate itself
properly. However, with the support of multiple security mechanisms properly. However, with the support of multiple security mechanisms
and the ability to negotiate the appropriate use of these mechanisms, and the ability to negotiate the appropriate use of these mechanisms,
the server is unable to properly determine if a client will be able the server is unable to properly determine if a client will be able
to authenticate itself. If, based on its policies, the server to authenticate itself. If, based on its policies, the server
chooses to limit the contents of the pseudo filesystem, the server chooses to limit the contents of the pseudo filesystem, the server
skipping to change at page 77, line 42 skipping to change at page 88, line 38
have legitimate access. have legitimate access.
As suggested practice, the server should apply the security policy of As suggested practice, the server should apply the security policy of
a shared resource in the server's namespace to the components of the a shared resource in the server's namespace to the components of the
resource's ancestors. For example: resource's ancestors. For example:
/ /
/a/b /a/b
/a/b/c /a/b/c
The /a/b/c directory is a real filesystem and is the shared resource. The /a/b/c directory is a real file system and is the shared
The security policy for /a/b/c is Kerberos with integrity. The resource. The security policy for /a/b/c is Kerberos with integrity.
server should apply the same security policy to /, /a, and /a/b. The server should apply the same security policy to /, /a, and /a/b.
This allows for the extension of the protection of the server's This allows for the extension of the protection of the server's
namespace to the ancestors of the real shared resource. namespace to the ancestors of the real shared resource.
For the case of the use of multiple, disjoint security mechanisms in For the case of the use of multiple, disjoint security mechanisms in
the server's resources, the security for a particular object in the the server's resources, the security for a particular object in the
server's namespace should be the union of all security mechanisms of server's namespace should be the union of all security mechanisms of
all direct descendants. all direct descendants.
7. File Locking and Share Reservations 8. File Locking and Share Reservations
Integrating locking into the NFS protocol necessarily causes it to be Integrating locking into the NFS protocol necessarily causes it to be
stateful. With the inclusion of share reservations the protocol stateful. With the inclusion of such features as share reservations,
becomes substantially more dependent on state than the traditional file and directory delegations, recallable layouts, and support for
combination of NFS and NLM [XNFS]. There are three components to mandatory byte-range locking the protocol becomes substantially more
making this state manageable: dependent on state than the traditional combination of NFS and NLM
[XNFS]. There are three components to making this state manageable:
o Clear division between client and server o Clear division between client and server
o Ability to reliably detect inconsistency in state between client o Ability to reliably detect inconsistency in state between client
and server and server
o Simple and robust recovery mechanisms o Simple and robust recovery mechanisms
In this model, the server owns the state information. The client In this model, the server owns the state information. The client
communicates its view of this state to the server as needed. The requests changes in locks and the server responds with the changes
client is also able to detect inconsistent state before modifying a made. Non-client-initiated changes in locking state are infrequent
file. and the client receives prompt notification of them and can adjust
his view of the locking state to reflect the server's changes.
To support Win32 share reservations it is necessary to atomically To support Win32 share reservations it is necessary to provide
OPEN or CREATE files. Having a separate share/unshare operation operations which atomically OPEN or CREATE files. Having a separate
would not allow correct implementation of the Win32 OpenFile API. In share/unshare operation would not allow correct implementation of the
order to correctly implement share semantics, the previous NFS Win32 OpenFile API. In order to correctly implement share semantics,
protocol mechanisms used when a file is opened or created (LOOKUP, the previous NFS protocol mechanisms used when a file is opened or
CREATE, ACCESS) need to be replaced. The NFS version 4 protocol has created (LOOKUP, CREATE, ACCESS) need to be replaced. The NFS
an OPEN operation that subsumes the NFS version 3 methodology of version 4.1 protocol defines OPEN operation which looks up or creates
LOOKUP, CREATE, and ACCESS. However, because many operations require a file and establishes locking state on the server.
a filehandle, the traditional LOOKUP is preserved to map a file name
to filehandle without establishing state on the server. The policy
of granting access or modifying files is managed by the server based
on the client's state. These mechanisms can implement policy ranging
from advisory only locking to full mandatory locking.
7.1. Locking 8.1. Locking
It is assumed that manipulating a lock is rare when compared to READ It is assumed that manipulating a lock is rare when compared to READ
and WRITE operations. It is also assumed that crashes and network and WRITE operations. It is also assumed that crashes and network
partitions are relatively rare. Therefore it is important that the partitions are relatively rare. Therefore it is important that the
READ and WRITE operations have a lightweight mechanism to indicate if READ and WRITE operations have a lightweight mechanism to indicate if
they possess a held lock. A lock request contains the heavyweight they possess a held lock. A lock request contains the heavyweight
information required to establish a lock and uniquely define the lock information required to establish a lock and uniquely define the lock
owner. owner.
The following sections describe the transition from the heavyweight The following sections describe the transition from the heavyweight
information to the eventual stateid used for most client and server information to the eventual lightwieght stateid used for most client
locking and lease interactions. and server locking interactions.
7.1.1. Client ID 8.1.1. Client ID
For each LOCK request, the client must identify itself to the server. For each operation that obtains or depends on locking state, the
This is done in such a way as to allow for correct lock specific client must be determinable by the server. In NFSv4, each
identification and crash recovery. A sequence of a SETCLIENTID distinct client instance is represented by a clientid, which is a 64-
operation followed by a SETCLIENTID_CONFIRM operation is required to bit identifier that identifies a specific client at a given time and
establish the identification onto the server. Establishment of which is changed whenever the client or the server re-initializes.
identification by a new incarnation of the client also has the effect Clientid's are used to support lock identification and crash
of immediately breaking any leased state that a previous incarnation recovery.
of the client might have had on the server, as opposed to forcing the
new client incarnation to wait for the leases to expire. Breaking In NFSv4.1, the clientid associated with each operation is derived
the lease state amounts to the server removing all lock, share from the session on which the operation is issued. Each session is
reservation, and, where the server is not supporting the associated with a specific clientid at session creation and that
CLAIM_DELEGATE_PREV claim type, all delegation state associated with clientid then becomes the clientid associated with all requests
same client with the same identity. For discussion of delegation issued using it.
state recovery, see the section "Delegation Recovery".
A sequence of a CREATE_CLIENTID operation followed by a
CREATE_SESSION operation using that clientid is required to establish
the identification on the server. Establishment of identification by
a new incarnation of the client also has the effect of immediately
releasing any locking state that a previous incarnation of that same
client might have had on the server. Such released state would
include all lock, share reservation, and, where the server is not
supporting the CLAIM_DELEGATE_PREV claim type, all delegation state
associated with same client with the same identity. For discussion
of delegation state recovery, see the section "Delegation Recovery".
Releasing such state requires that the server be able to determine
that one client instance is the successor of another. Where this
cannot be done, for any of a number of reasons, the locking state
will remain for a time subject to lease expiration (see Section 8.5)
and the new client will need to wait for such state to be removed, if
it makes conflicting lock requests.
Client identification is encapsulated in the following structure: Client identification is encapsulated in the following structure:
struct nfs_client_id4 { struct nfs_client_id4 {
verifier4 verifier; verifier4 verifier;
opaque id<NFS4_OPAQUE_LIMIT>; opaque id<NFS4_OPAQUE_LIMIT>;
}; };
The first field, verifier is a client incarnation verifier that is The first field, verifier, is a client incarnation verifier that is
used to detect client reboots. Only if the verifier is different used to detect client reboots. Only if the verifier is different
from that the server has previously recorded for the client (as from that the server had previously recorded for the client (as
identified by the second field of the structure, id) does the server identified by the second field of the structure, id) does the server
start the process of canceling the client's leased state. start the process of canceling the client's leased state.
The second field, id is a variable length string that uniquely The second field, id is a variable length string that uniquely
defines the client. defines the client so that subsequent instances of the same client
bear the same id with a different verifier.
There are several considerations for how the client generates the id There are several considerations for how the client generates the id
string: string:
o The string should be unique so that multiple clients do not o The string should be unique so that multiple clients do not
present the same string. The consequences of two clients present the same string. The consequences of two clients
presenting the same string range from one client getting an error presenting the same string range from one client getting an error
to one client having its leased state abruptly and unexpectedly to one client having its leased state abruptly and unexpectedly
canceled. canceled.
skipping to change at page 80, line 11 skipping to change at page 91, line 30
string. The implementor is cautioned from an approach that string. The implementor is cautioned from an approach that
requires the string to be recorded in a local file because this requires the string to be recorded in a local file because this
precludes the use of the implementation in an environment where precludes the use of the implementation in an environment where
there is no local disk and all file access is from an NFS version there is no local disk and all file access is from an NFS version
4 server. 4 server.
o The string should be different for each server network address o The string should be different for each server network address
that the client accesses, rather than common to all server network that the client accesses, rather than common to all server network
addresses. The reason is that it may not be possible for the addresses. The reason is that it may not be possible for the
client to tell if same server is listening on multiple network client to tell if same server is listening on multiple network
addresses. If the client issues SETCLIENTID with the same id addresses. If the client issues CREATE_CLIENTID with the same id
string to each network address of such a server, the server will string to each network address of such a server, the server will
think it is the same client, and each successive SETCLIENTID will think it is the same client, and each successive CREATE_CLIENTID
cause the server to begin the process of removing the client's will cause the server remove the client's previous leased state.
previous leased state.
o The algorithm for generating the string should not assume that the o The algorithm for generating the string should not assume that the
client's network address won't change. This includes changes client's network address won't change. This includes changes
between client incarnations and even changes while the client is between client incarnations and even changes while the client is
stilling running in its current incarnation. This means that if still running in its current incarnation. This means that if the
the client includes just the client's and server's network address client includes just the client's and server's network address in
in the id string, there is a real risk, after the client gives up the id string, there is a real risk, after the client gives up the
the network address, that another client, using a similar network address, that another client, using a similar algorithm
algorithm for generating the id string, will generate a for generating the id string, would generate a conflicting id
conflicting id string. string.
o Given the above considerations, an example of a well generated id Given the above considerations, an example of a well generated id
string is one that includes: string is one that includes:
o The server's network address. o The server's network address.
o The client's network address. o The client's network address.
o For a user level NFS version 4 client, it should contain o For a user level NFS version 4 client, it should contain
additional information to distinguish the client from other user additional information to distinguish the client from other user
level clients running on the same host, such as a process id or level clients running on the same host, such as a process id or
other unique sequence. other unique sequence.
skipping to change at page 81, line 12 skipping to change at page 92, line 31
stored in a file, because the file might only be accessible stored in a file, because the file might only be accessible
over NFS version 4). over NFS version 4).
* A true random number. However since this number ought to be * A true random number. However since this number ought to be
the same between client incarnations, this shares the same the same between client incarnations, this shares the same
problem as that of the using the timestamp of the software problem as that of the using the timestamp of the software
installation. installation.
As a security measure, the server MUST NOT cancel a client's leased As a security measure, the server MUST NOT cancel a client's leased
state if the principal established the state for a given id string is state if the principal established the state for a given id string is
not the same as the principal issuing the SETCLIENTID. not the same as the principal issuing the CREATE_CLIENTID.
Note that SETCLIENTID and SETCLIENTID_CONFIRM has a secondary purpose A server may compare an nfs_client_id4 in a CREATE_CLIENTID with an
of establishing the information the server needs to make callbacks to nfs_client_id4 established using SETCLIENTID using NFSv4 minor
the client for purpose of supporting delegations. It is permitted to version 0, so that an NFSv4.1 client is not forced to delay until
change this information via SETCLIENTID and SETCLIENTID_CONFIRM lease expiration for locking state established by the earlier client
within the same incarnation of the client without removing the using minor version 0.
client's leased state.
Once a SETCLIENTID and SETCLIENTID_CONFIRM sequence has successfully Once a CREATE_CLIENTID has been done, and the resulting clientid
completed, the client uses the short hand client identifier, of type established as associated with a session, all requests made on that
clientid4, instead of the longer and less compact nfs_client_id4 session implicitly identify that clientid, which in turn designates
structure. This short hand client identifier (a clientid) is the client specified using the long-form nfs_client_id4 structure.
assigned by the server and should be chosen so that it will not The shorthand client identifier (a clientid) is assigned by the
conflict with a clientid previously assigned by the server. This server and should be chosen so that it will not conflict with a
applies across server restarts or reboots. When a clientid is clientid previously assigned by the server. This applies across
presented to a server and that clientid is not recognized, as would server restarts or reboots.
happen after a server reboot, the server will reject the request with
the error NFS4ERR_STALE_CLIENTID. When this happens, the client must
obtain a new clientid by use of the SETCLIENTID operation and then
proceed to any other necessary recovery for the server reboot case
(See the section "Server Failure and Recovery").
The client must also employ the SETCLIENTID operation when it In the event of a server restart, a client will find out that its
receives a NFS4ERR_STALE_STATEID error using a stateid derived from current clientid is no longer valid when receives a
its current clientid, since this also indicates a server reboot which NFS4ERR_STALE_CLIENTID error. The precise circumstances depend of
has invalidated the existing clientid (see the next section the characteristics of the sessions involved, specifically whether
"lock_owner and stateid Definition" for details). the session is persistent.
See the detailed descriptions of SETCLIENTID and SETCLIENTID_CONFIRM When a session is not persistent, the client will need to create a
for a complete specification of the operations. new session. When the existing clientid is presented to a server as
part of creating a session and that clientid is not recognized, as
would happen after a server reboot, the server will reject the
request with the error NFS4ERR_STALE_CLIENTID. When this happens,
the client must obtain a new clientid by use of the CREATE_CLIENTID
operation and then use that clientid as the basis of the basis of a
new session and then proceed to any other necessary recovery for the
server reboot case (See Section 8.6.2).
7.1.2. Server Release of Clientid In the case of the session being persistent, the client will re-
establish communication using the existing session after the reboot.
This session will be associated with a stale clientid and the client
will receive an indication of that fact in the status field returned
by the SEQUENCE operation. The client, can then use the existing
session to do whatever operations are necessary to determine the
status of requests outstanding at the time of reboot, while avoiding
issuing new requests, particularly any involving locking on that
session. Such requests would fail with NFS4ERR_STALE_CLIENTID error
or an NFS4ERR_STALE_STATEID error, if attempted. In any case, the
client would create a new clientid using CREATE_CLIENTID, create a
new session based on that clientid, and proceed to other necessary
recovery for the server reboot case.
See the detailed descriptions of CREATE_CLIENTID and CREATE_SESSION
for a complete specification of these operations.
8.1.2. Server Release of Clientid
If the server determines that the client holds no associated state If the server determines that the client holds no associated state
for its clientid, the server may choose to release the clientid. The for its clientid, the server may choose to release the clientid. The
server may make this choice for an inactive client so that resources server may make this choice for an inactive client so that resources
are not consumed by those intermittently active clients. If the are not consumed by those intermittently active clients. If the
client contacts the server after this release, the server must ensure client contacts the server after this release, the server must ensure
the client receives the appropriate error so that it will use the the client receives the appropriate error so that it will use the
SETCLIENTID/SETCLIENTID_CONFIRM sequence to establish a new identity. CREATE_CLIENTID/CREATE_SESSION sequence to establish a new identity.
It should be clear that the server must be very hesitant to release a It should be clear that the server must be very hesitant to release a
clientid since the resulting work on the client to recover from such clientid since the resulting work on the client to recover from such
an event will be the same burden as if the server had failed and an event will be the same burden as if the server had failed and
restarted. Typically a server would not release a clientid unless restarted. Typically a server would not release a clientid unless
there had been no activity from that client for many minutes. there had been no activity from that client for many minutes.
Note that if the id string in a SETCLIENTID request is properly Note that if the id string in a CREATE_CLIENTID request is properly
constructed, and if the client takes care to use the same principal constructed, and if the client takes care to use the same principal
for each successive use of SETCLIENTID, then, barring an active for each successive use of CREATE_CLIENTID, then, barring an active
denial of service attack, NFS4ERR_CLID_INUSE should never be denial of service attack, NFS4ERR_CLID_INUSE should never be
returned. returned.
However, client bugs, server bugs, or perhaps a deliberate change of However, client bugs, server bugs, or perhaps a deliberate change of
the principal owner of the id string (such as the case of a client the principal owner of the id string (such as the case of a client
that changes security flavors, and under the new flavor, there is no that changes security flavors, and under the new flavor, there is no
mapping to the previous owner) will in rare cases result in mapping to the previous owner) will in rare cases result in
NFS4ERR_CLID_INUSE. NFS4ERR_CLID_INUSE.
In that event, when the server gets a SETCLIENTID for a client id In that event, when the server gets a CREATE_CLIENTID for a client id
that currently has no state, or it has state, but the lease has that currently has no state, or it has state, but the lease has
expired, rather than returning NFS4ERR_CLID_INUSE, the server MUST expired, rather than returning NFS4ERR_CLID_INUSE, the server MUST
allow the SETCLIENTID, and confirm the new clientid if followed by allow the CREATE_CLIENTID, and confirm the new clientid if followed
the appropriate SETCLIENTID_CONFIRM. by the appropriate CRREATESESSION.
7.1.3. lock_owner and stateid Definition
When requesting a lock, the client must present to the server the 8.1.3. State-owner and Stateid Definition
clientid and an identifier for the owner of the requested lock.
These two fields are referred to as the lock_owner and the definition
of those fields are:
o A clientid returned by the server as part of the client's use of When opening a file or requesting a byte-range lock, the client must
the SETCLIENTID operation. specify an identifier which represents the owner of the requested
lock. This identifier is in the form of a state-owner, represented
in the protocol by a state_owner4, a variable-length opaque array
which, when concatenated with the current clientid uniquely defines
the owner of lock managed by the client. This may be a thread id,
process id, or other unique value.
o A variable length opaque array used to uniquely define the owner Owners of opens and owners of byte-range locks are separate entities
of a lock managed by the client. and remain separate even if the same opaque arrays are used to
designate owners of each. The protocol distinguishes between open-
owners (represented by open_owner4 structures) and lock-owners
(represented by lock_owner4 structures).
This may be a thread id, process id, or other unique value. Each open is associated with a specific open-owner while each byte-
range lock is associated with a lock-owner and an open-owner, the
latter being the open-owner associated with the open file under which
the LOCK operation was done. Delegations and layouts, on the other
hand, are not associated with a specific owner but are associated the
client as a whole.
When the server grants the lock, it responds with a unique stateid. When the server grants a lock of any type (including opens, byte-
The stateid is used as a shorthand reference to the lock_owner, since range locks, delegations, and layouts) it responds with a unique
the server will be maintaining the correspondence between them. stateid, that represents a set of locks (often a single lock) for the
same file, of the same type, and sharing the same ownership
characteristics. Thus opens of the same file by different open-
owners each have an identifying stateid. Similarly, each set of
byte-range locks on a file owned by a specific lock-owner and gotten
via an open for a specific open-owner, has its own identifying
stateid. Delegations and layouts also have associated stateid's by
which they may be referenced. The stateid is used as a shorthand
reference to a lock or set of locks and given a stateid the client
can determine the associated state-owner or state-owners (in the case
of an open-owner/lock-owner pair) and the associated. Clients,
however, must not assume any such mapping and must not use a stateid
returned for a given filehandle and state-owner in the context of a
different filehandle or a different state-owner.
The server is free to form the stateid in any manner that it chooses The server is free to form the stateid in any manner that it chooses
as long as it is able to recognize invalid and out-of-date stateids. as long as it is able to recognize invalid and out-of-date stateids.
This requirement includes those stateids generated by earlier Although the protocol XDR definition divides the stateid into into
instances of the server. From this, the client can be properly 'seqid' and 'other' fields, for the purposes of minor version one,
notified of a server restart. This notification will occur when the this distinction is not important and the server may use the
client presents a stateid to the server from a previous available space as it chooses, with one exception.
instantiation.
The server must be able to distinguish the following situations and The exception is that stateids whose 'other' field is either all
return the error as specified: zeros or all ones are reserved and may not be generated by the
server. Clients may use the protocol-defined special stateid values
for their defined purposes, but any use of stateid's in this reserved
class that are not specially defined by the protocol MUST result in
an NFS4ERR_BAD_STATED being returned.
o The stateid was generated by an earlier server instance (i.e. Clients may not compare stateids associated with different
before a server reboot). The error NFS4ERR_STALE_STATEID should filehandles, so that a server might use stateids with the same bit
be returned. pattern for all opens with a given open-owner or for all sets of
byte-range locks associated with a given lock-owner/open-owner pair.
However, if it does so, it must recognize and reject any use of
stateid when the current filehandle is such that no lock for that
filehandle by that open owner (or lock-owner/open-owner pair) exists.
o The stateid was generated by the current server instance but the Stateid's must remain valid until either a client reboot or a sever
stateid no longer designates the current locking state for the reobot or until the client returns all of the locks associated with
lockowner-file pair in question (i.e. one or more locking the stateid by means of an operation such as CLOSE or DELEGRETURN.
operations has occurred). The error NFS4ERR_OLD_STATEID should be If the locks are lost due to revocation the sateid remains usable
returned. until the client frees it by using FREE_STATEID. Stateid's
associated with byte-range locks are an exception. They remain valid
even if a LOCKU free all remaining locks, so long as the opefile with
which they are associated remains open, unless the client does a
FREE_STATEID to caused the stateid to be freed.
This error condition will only occur when the client issues a Because each operation using a stateid occurs as part of a session,
locking request which changes a stateid while an I/O request that each stateid is implicitly associated with the clientid assigned to
uses that stateid is outstanding. that session. Use of a stateid in the context of a session where the
clientid is invalid should result in the error NFS4ERR_STALE_STATEID.
Servers MUST NOT do any validation or return other errors in this
case, even if they have sufficient information available to validate
stateids associated with an out-of-date client.
o The stateid was generated by the current server instance but the One mechanism that may be used to satisfy the requirement that the
stateid does not designate a locking state for any active server recognize invalid and out-of-date stateids is for the server
lockowner-file pair. The error NFS4ERR_BAD_STATEID should be to divide the stateid into two fields. This division may coincide
returned. with the documented division into 'seqid' and 'other' fields or it
may divide the stateid field up in any other ay it chooses.
This error condition will occur when there has been a logic error o An index into a table of locking-state structures.
on the part of the client or server. This should not happen.
One mechanism that may be used to satisfy these requirements is for o A generation number which is incremented on each allocation of a
the server to, table entry a particular allocation of a stateid.
o divide the "other" field of each stateid into two fields: And then store in each table entry,
* A server verifier which uniquely designates a particular server o The current generation number.
instantiation.
* An index into a table of locking-state structures. o The clientid with which the stateid is associated.
o utilize the "seqid" field of each stateid, such that seqid is o The filehandle of the file on which the locks are taken.
monotonically incremented for each stateid that is associated with
the same index into the locking-state table.
By matching the incoming stateid and its field values with the state o An indication of the type of stateid (open, byte-range lock, file
held at the server, the server is able to easily determine if a delegation, directory delegation, layout).
stateid is valid for its current instantiation and state. If the
stateid is not valid, the appropriate error can be supplied to the
client.
7.1.4. Use of the stateid and Locking With this information, the following procedure would be used to
validate an incoming stateid and return an appropriate error, when
necessary:
o If the current session is associated with an invalid clientid,
return NFS4ERR_STALE_STATEID.
o If the table index field is outside the range of the associated
table, return NFS4ERR_BAD_STATEID.
o If the selected table entry is of a different generation than that
specified in the incoming stateid, return NFS4ERR_BAD_STATEID.
o If the selected table entry does not match the current file
handle, return NFS4ERR_BAD_STATEID.
o If the clientid in the table entry does not match the clientid
associated with the current session, return NFS4ERR_BAD_STATEID.
o If the stateid type is not valid for the context in which the
stateid appears, return NFS4ERR_BAD_STATEID.
o Otherwise, the stateid is valid and the table entry should contain
any additional information about the associated set of locks, such
as open-owner and lock-owner information, as well as information
on the specific locks, such as open modes and byte ranges.
8.1.4. Use of the Stateid and Locking
All READ, WRITE and SETATTR operations contain a stateid. For the All READ, WRITE and SETATTR operations contain a stateid. For the
purposes of this section, SETATTR operations which change the size purposes of this section, SETATTR operations which change the size
attribute of a file are treated as if they are writing the area attribute of a file are treated as if they are writing the area
between the old and new size (i.e. the range truncated or added to between the old and new size (i.e. the range truncated or added to
the file by means of the SETATTR), even where SETATTR is not the file by means of the SETATTR), even where SETATTR is not
explicitly mentioned in the text. explicitly mentioned in the text.
If the lock_owner performs a READ or WRITE in a situation in which it If the state-owner performs a READ or WRITE in a situation in which
has established a lock or share reservation on the server (any OPEN it has established a lock or share reservation on the server (any
constitutes a share reservation) the stateid (previously returned by OPEN constitutes a share reservation) the stateid (previously
the server) must be used to indicate what locks, including both returned by the server) must be used to indicate what locks,
record locks and share reservations, are held by the lockowner. If including both record locks and share reservations, are held by the
no state is established by the client, either record lock or share state-owner. If no state is established by the client, either record
reservation, a stateid of all bits 0 is used. Regardless whether a lock or share reservation, a special stateid of all bits 0 (including
stateid of all bits 0, or a stateid returned by the server is used, all fields of the stateid) is used. Regardless whether a stateid of
if there is a conflicting share reservation or mandatory record lock all bits 0, or a stateid returned by the server is used, if there is
held on the file, the server MUST refuse to service the READ or WRITE a conflicting share reservation or mandatory record lock held on the
operation. file, the server MUST refuse to service the READ or WRITE operation.
Share reservations are established by OPEN operations and by their Share reservations are established by OPEN operations and by their
nature are mandatory in that when the OPEN denies READ or WRITE nature are mandatory in that when the OPEN denies READ or WRITE
operations, that denial results in such operations being rejected operations, that denial results in such operations being rejected
with error NFS4ERR_LOCKED. Record locks may be implemented by the with error NFS4ERR_LOCKED. Record locks may be implemented by the
server as either mandatory or advisory, or the choice of mandatory or server as either mandatory or advisory, or the choice of mandatory or
advisory behavior may be determined by the server on the basis of the advisory behavior may be determined by the server on the basis of the
file being accessed (for example, some UNIX-based servers support a file being accessed (for example, some UNIX-based servers support a
"mandatory lock bit" on the mode attribute such that if set, record "mandatory lock bit" on the mode attribute such that if set, record
locks are required on the file before I/O is possible). When record locks are required on the file before I/O is possible). When record
locks are advisory, they only prevent the granting of conflicting locks are advisory, they only prevent the granting of conflicting
lock requests and have no effect on READs or WRITEs. Mandatory lock requests and have no effect on READs or WRITEs. Mandatory
record locks, however, prevent conflicting I/O operations. When they record locks, however, prevent conflicting I/O operations. When they
are attempted, they are rejected with NFS4ERR_LOCKED. When the are attempted, they are rejected with NFS4ERR_LOCKED. When the
client gets NFS4ERR_LOCKED on a file it knows it has the proper share client gets NFS4ERR_LOCKED on a file it knows it has the proper share
reservation for, it will need to issue a LOCK request on the region reservation for, it will need to issue a LOCK request on the region
of the file that includes the region the I/O was to be performed on, of the file that includes the region the I/O was to be performed on,
with an appropriate locktype (i.e. READ*_LT for a READ operation, with an appropriate locktype (i.e. READ*_LT for a READ operation,
WRITE*_LT for a WRITE operation). WRITE*_LT for a WRITE operation).
With NFS version 3, there was no notion of a stateid so there was no
way to tell if the application process of the client sending the READ
or WRITE operation had also acquired the appropriate record lock on
the file. Thus there was no way to implement mandatory locking.
With the stateid construct, this barrier has been removed.
Note that for UNIX environments that support mandatory file locking, Note that for UNIX environments that support mandatory file locking,
the distinction between advisory and mandatory locking is subtle. In the distinction between advisory and mandatory locking is subtle. In
fact, advisory and mandatory record locks are exactly the same in so fact, advisory and mandatory record locks are exactly the same in so
far as the APIs and requirements on implementation. If the mandatory far as the APIs and requirements on implementation. If the mandatory
lock attribute is set on the file, the server checks to see if the lock attribute is set on the file, the server checks to see if the
lockowner has an appropriate shared (read) or exclusive (write) lock-owner has an appropriate shared (read) or exclusive (write)
record lock on the region it wishes to read or write to. If there is record lock on the region it wishes to read or write to. If there is
no appropriate lock, the server checks if there is a conflicting lock no appropriate lock, the server checks if there is a conflicting lock
(which can be done by attempting to acquire the conflicting lock on (which can be done by attempting to acquire the conflicting lock on
the behalf of the lockowner, and if successful, release the lock the behalf of the lock-owner, and if successful, release the lock
after the READ or WRITE is done), and if there is, the server returns after the READ or WRITE is done), and if there is, the server returns
NFS4ERR_LOCKED. NFS4ERR_LOCKED.
For Windows environments, there are no advisory record locks, so the For Windows environments, there are no advisory record locks, so the
server always checks for record locks during I/O requests. server always checks for record locks during I/O requests.
Thus, the NFS version 4 LOCK operation does not need to distinguish Thus, the NFS version 4 LOCK operation does not need to distinguish
between advisory and mandatory record locks. It is the NFS version 4 between advisory and mandatory record locks. It is the NFS version 4
server's processing of the READ and WRITE operations that introduces server's processing of the READ and WRITE operations that introduces
the distinction. the distinction.
Every stateid other than the special stateid values noted in this Every stateid other than the special stateid values noted in this
section, whether returned by an OPEN-type operation (i.e. OPEN, section, whether returned by an OPEN-type operation (i.e. OPEN,
OPEN_DOWNGRADE), or by a LOCK-type operation (i.e. LOCK or LOCKU), OPEN_DOWNGRADE), or by a LOCK-type operation (i.e. LOCK or LOCKU),
defines an access mode for the file (i.e. READ, WRITE, or READ- defines an access mode for the file (i.e. READ, WRITE, or READ-
WRITE) as established by the original OPEN which began the stateid WRITE) as established by the original OPEN which caused the
sequence, and as modified by subsequent OPENs and OPEN_DOWNGRADEs allocation of the open stateid and as modified by subsequent OPENs
within that stateid sequence. When a READ, WRITE, or SETATTR which and OPEN_DOWNGRADEs for the same open-owner/file pair. Stateids
specifies the size attribute, is done, the operation is subject to returned by byte-range lock operations imply the access mode for the
checking against the access mode to verify that the operation is open stateid associated with the lock set represented by the stateid.
appropriate given the OPEN with which the operation is associated. Delegation stateids have an access mode based on the type of
delegation. When a READ, WRITE, or SETATTR which specifies the size
attribute, is done, the operation is subject to checking against the
access mode to verify that the operation is appropriate given the
OPEN with which the operation is associated.
In the case of WRITE-type operations (i.e. WRITEs and SETATTRs which In the case of WRITE-type operations (i.e. WRITEs and SETATTRs which
set size), the server must verify that the access mode allows writing set size), the server must verify that the access mode allows writing
and return an NFS4ERR_OPENMODE error if it does not. In the case, of and return an NFS4ERR_OPENMODE error if it does not. In the case, of
READ, the server may perform the corresponding check on the access READ, the server may perform the corresponding check on the access
mode, or it may choose to allow READ on opens for WRITE only, to mode, or it may choose to allow READ on opens for WRITE only, to
accommodate clients whose write implementation may unavoidably do accommodate clients whose write implementation may unavoidably do
reads (e.g. due to buffer cache constraints). However, even if READs reads (e.g. due to buffer cache constraints). However, even if READs
are allowed in these circumstances, the server MUST still check for are allowed in these circumstances, the server MUST still check for
locks that conflict with the READ (e.g. another open specify denial locks that conflict with the READ (e.g. another open specify denial
of READs). Note that a server which does enforce the access mode of READs). Note that a server which does enforce the access mode
check on READs need not explicitly check for conflicting share check on READs need not explicitly check for conflicting share
reservations since the existence of OPEN for read access guarantees reservations since the existence of OPEN for read access guarantees
that no conflicting share reservation can exist. that no conflicting share reservation can exist.
A stateid of all bits 1 (one) MAY allow READ operations to bypass A special stateid of all bits 1 (one), including all fields in the
locking checks at the server. However, WRITE operations with a stateid indicates a desire to bypass locking checks. The server MAY
stateid with bits all 1 (one) MUST NOT bypass locking checks and are allow READ operations to bypass locking checks at the server, when
this special stateid is used. However, WRITE operations with with
this special stateid value MUST NOT bypass locking checks and are
treated exactly the same as if a stateid of all bits 0 were used. treated exactly the same as if a stateid of all bits 0 were used.
A lock may not be granted while a READ or WRITE operation using one A lock may not be granted while a READ or WRITE operation using one
of the special stateids is being performed and the range of the lock of the special stateids is being performed and the range of the lock
request conflicts with the range of the READ or WRITE operation. For request conflicts with the range of the READ or WRITE operation. For
the purposes of this paragraph, a conflict occurs when a shared lock the purposes of this paragraph, a conflict occurs when a shared lock
is requested and a WRITE operation is being performed, or an is requested and a WRITE operation is being performed, or an
exclusive lock is requested and either a READ or a WRITE operation is exclusive lock is requested and either a READ or a WRITE operation is
being performed. A SETATTR that sets size is treated similarly to a being performed. A SETATTR that sets size is treated similarly to a
WRITE as discussed above. WRITE as discussed above.
7.1.5. Sequencing of Lock Requests 8.2. Lock Ranges
Locking is different from most NFS operations as it requires "at-
most-one" semantics that are not provided by ONCRPC. ONCRPC over a
reliable transport is not sufficient because a sequence of locking
requests may span multiple TCP connections. In the face of
retransmission or reordering, lock or unlock requests must have a
well defined and consistent behavior. To accomplish this, each lock
request contains a sequence number that is a consecutively increasing
integer. Different lock_owners have different sequences. The server
maintains the last sequence number (L) received and the response that
was returned. The first request issued for any given lock_owner is
issued with a sequence number of zero.
Note that for requests that contain a sequence number, for each
lock_owner, there should be no more than one outstanding request.
If a request (r) with a previous sequence number (r < L) is received,
it is rejected with the return of error NFS4ERR_BAD_SEQID. Given a
properly-functioning client, the response to (r) must have been
received before the last request (L) was sent. If a duplicate of
last request (r == L) is received, the stored response is returned.
If a request beyond the next sequence (r == L + 2) is received, it is
rejected with the return of error NFS4ERR_BAD_SEQID. Sequence
history is reinitialized whenever the SETCLIENTID/SETCLIENTID_CONFIRM
sequence changes the client verifier.
Since the sequence number is represented with an unsigned 32-bit
integer, the arithmetic involved with the sequence number is mod
2^32. For an example of modulo arithetic involving sequence numbers
see [RFC793].
It is critical the server maintain the last response sent to the
client to provide a more reliable cache of duplicate non-idempotent
requests than that of the traditional cache described in [Juszczak].
The traditional duplicate request cache uses a least recently used
algorithm for removing unneeded requests. However, the last lock
request and response on a given lock_owner must be cached as long as
the lock state exists on the server.
The client MUST monotonically increment the sequence number for the
CLOSE, LOCK, LOCKU, OPEN, OPEN_CONFIRM, and OPEN_DOWNGRADE
operations. This is true even in the event that the previous
operation that used the sequence number received an error. The only
exception to this rule is if the previous operation received one of
the following errors: NFS4ERR_STALE_CLIENTID, NFS4ERR_STALE_STATEID,
NFS4ERR_BAD_STATEID, NFS4ERR_BAD_SEQID, NFS4ERR_BADXDR,
NFS4ERR_RESOURCE, NFS4ERR_NOFILEHANDLE.
7.1.6. Recovery from Replayed Requests
As described above, the sequence number is per lock_owner. As long
as the server maintains the last sequence number received and follows
the methods described above, there are no risks of a Byzantine router
re-sending old requests. The server need only maintain the
(lock_owner, sequence number) state as long as there are open files
or closed files with locks outstanding.
LOCK, LOCKU, OPEN, OPEN_DOWNGRADE, and CLOSE each contain a sequence
number and therefore the risk of the replay of these operations
resulting in undesired effects is non-existent while the server
maintains the lock_owner state.
7.1.7. Releasing lock_owner State
When a particular lock_owner no longer holds open or file locking
state at the server, the server may choose to release the sequence
number state associated with the lock_owner. The server may make
this choice based on lease expiration, for the reclamation of server
memory, or other implementation specific details. In any event, the
server is able to do this safely only when the lock_owner no longer
is being utilized by the client. The server may choose to hold the
lock_owner state in the event that retransmitted requests are
received. However, the period to hold this state is implementation
specific.
In the case that a LOCK, LOCKU, OPEN_DOWNGRADE, or CLOSE is
retransmitted after the server has previously released the lock_owner
state, the server will find that the lock_owner has no files open and
an error will be returned to the client. If the lock_owner does have
a file open, the stateid will not match and again an error is
returned to the client.
7.1.8. Use of Open Confirmation
In the case that an OPEN is retransmitted and the lock_owner is being
used for the first time or the lock_owner state has been previously
released by the server, the use of the OPEN_CONFIRM operation will
prevent incorrect behavior. When the server observes the use of the
lock_owner for the first time, it will direct the client to perform
the OPEN_CONFIRM for the corresponding OPEN. This sequence
establishes the use of an lock_owner and associated sequence number.
Since the OPEN_CONFIRM sequence connects a new open_owner on the
server with an existing open_owner on a client, the sequence number
may have any value. The OPEN_CONFIRM step assures the server that
the value received is the correct one. See the section "OPEN_CONFIRM
- Confirm Open" for further details.
There are a number of situations in which the requirement to confirm
an OPEN would pose difficulties for the client and server, in that
they would be prevented from acting in a timely fashion on
information received, because that information would be provisional,
subject to deletion upon non-confirmation. Fortunately, these are
situations in which the server can avoid the need for confirmation
when responding to open requests. The two constraints are:
o The server must not bestow a delegation for any open which would
require confirmation.
o The server MUST NOT require confirmation on a reclaim-type open
(i.e. one specifying claim type CLAIM_PREVIOUS or
CLAIM_DELEGATE_PREV).
These constraints are related in that reclaim-type opens are the only
ones in which the server may be required to send a delegation. For
CLAIM_NULL, sending the delegation is optional while for
CLAIM_DELEGATE_CUR, no delegation is sent.
Delegations being sent with an open requiring confirmation are
troublesome because recovering from non-confirmation adds undue
complexity to the protocol while requiring confirmation on reclaim-
type opens poses difficulties in that the inability to resolve the
status of the reclaim until lease expiration may make it difficult to
have timely determination of the set of locks being reclaimed (since
the grace period may expire).
Requiring open confirmation on reclaim-type opens is avoidable
because of the nature of the environments in which such opens are
done. For CLAIM_PREVIOUS opens, this is immediately after server
reboot, so there should be no time for lockowners to be created,
found to be unused, and recycled. For CLAIM_DELEGATE_PREV opens, we
are dealing with a client reboot situation. A server which supports
delegation can be sure that no lockowners for that client have been
recycled since client initialization and thus can ensure that
confirmation will not be required.
7.2. Lock Ranges
The protocol allows a lock owner to request a lock with a byte range The protocol allows a lock owner to request a lock with a byte range
and then either upgrade or unlock a sub-range of the initial lock. and then either upgrade, downgrade, or unlock a sub-range of the
It is expected that this will be an uncommon type of request. In any initial lock. It is expected that this will be an uncommon type of
case, servers or server filesystems may not be able to support sub- request. In any case, servers or server filesystems may not be able
range lock semantics. In the event that a server receives a locking to support sub-range lock semantics. In the event that a server
request that represents a sub-range of current locking state for the receives a locking request that represents a sub-range of current
lock owner, the server is allowed to return the error locking state for the lock owner, the server is allowed to return the
NFS4ERR_LOCK_RANGE to signify that it does not support sub-range lock error NFS4ERR_LOCK_RANGE to signify that it does not support sub-
operations. Therefore, the client should be prepared to receive this range lock operations. Therefore, the client should be prepared to
error and, if appropriate, report the error to the requesting receive this error and, if appropriate, report the error to the
application. requesting application.
The client is discouraged from combining multiple independent locking The client is discouraged from combining multiple independent locking
ranges that happen to be adjacent into a single request since the ranges that happen to be adjacent into a single request since the
server may not support sub-range requests and for reasons related to server may not support sub-range requests and for reasons related to
the recovery of file locking state in the event of server failure. the recovery of file locking state in the event of server failure.
As discussed in the section "Server Failure and Recovery" below, the As discussed in the section "Server Failure and Recovery" below, the
server may employ certain optimizations during recovery that work server may employ certain optimizations during recovery that work
effectively only when the client's behavior during lock recovery is effectively only when the client's behavior during lock recovery is
similar to the client's locking behavior prior to server failure. similar to the client's locking behavior prior to server failure.
7.3. Upgrading and Downgrading Locks 8.3. Upgrading and Downgrading Locks
If a client has a write lock on a record, it can request an atomic If a client has a write lock on a record, it can request an atomic
downgrade of the lock to a read lock via the LOCK request, by setting downgrade of the lock to a read lock via the LOCK request, by setting
the type to READ_LT. If the server supports atomic downgrade, the the type to READ_LT. If the server supports atomic downgrade, the
request will succeed. If not, it will return NFS4ERR_LOCK_NOTSUPP. request will succeed. If not, it will return NFS4ERR_LOCK_NOTSUPP.
The client should be prepared to receive this error, and if The client should be prepared to receive this error, and if
appropriate, report the error to the requesting application. appropriate, report the error to the requesting application.
If a client has a read lock on a record, it can request an atomic If a client has a read lock on a record, it can request an atomic
upgrade of the lock to a write lock via the LOCK request by setting upgrade of the lock to a write lock via the LOCK request by setting
the type to WRITE_LT or WRITEW_LT. If the server does not support the type to WRITE_LT or WRITEW_LT. If the server does not support
atomic upgrade, it will return NFS4ERR_LOCK_NOTSUPP. If the upgrade atomic upgrade, it will return NFS4ERR_LOCK_NOTSUPP. If the upgrade
can be achieved without an existing conflict, the request will can be achieved without an existing conflict, the request will
succeed. Otherwise, the server will return either NFS4ERR_DENIED or succeed. Otherwise, the server will return either NFS4ERR_DENIED or
NFS4ERR_DEADLOCK. The error NFS4ERR_DEADLOCK is returned if the NFS4ERR_DEADLOCK. The error NFS4ERR_DEADLOCK is returned if the
client issued the LOCK request with the type set to WRITEW_LT and the client issued the LOCK request with the type set to WRITEW_LT and the
server has detected a deadlock. The client should be prepared to server has detected a deadlock. The client should be prepared to
receive such errors and if appropriate, report the error to the receive such errors and if appropriate, report the error to the
requesting application. requesting application.
7.4. Blocking Locks 8.4. Blocking Locks
Some clients require the support of blocking locks. The NFS version Some clients require the support of blocking locks. NFSv4.1 does not
4 protocol must not rely on a callback mechanism and therefore is provide a callback when a previously unavailable lock becomes
unable to notify a client when a previously denied lock has been available. Clients thus have no choice but to continually poll for
granted. Clients have no choice but to continually poll for the the lock. This presents a fairness problem. Two new lock types are
lock. This presents a fairness problem. Two new lock types are
added, READW and WRITEW, and are used to indicate to the server that added, READW and WRITEW, and are used to indicate to the server that
the client is requesting a blocking lock. The server should maintain the client is requesting a blocking lock. The server should maintain
an ordered list of pending blocking locks. When the conflicting lock an ordered list of pending blocking locks. When the conflicting lock
is released, the server may wait the lease period for the first is released, the server may wait the lease period for the first
waiting client to re-request the lock. After the lease period waiting client to re-request the lock. After the lease period
expires the next waiting client request is allowed the lock. Clients expires the next waiting client request is allowed the lock. Clients
are required to poll at an interval sufficiently small that it is are required to poll at an interval sufficiently small that it is
likely to acquire the lock in a timely manner. The server is not likely to acquire the lock in a timely manner. The server is not
required to maintain a list of pending blocked locks as it is used to required to maintain a list of pending blocked locks as it is used to
increase fairness and not correct operation. Because of the increase fairness and not correct operation. Because of the
skipping to change at page 90, line 28 skipping to change at page 100, line 36
storage would be required to guarantee ordered granting of blocking storage would be required to guarantee ordered granting of blocking
locks. locks.
Servers may also note the lock types and delay returning denial of Servers may also note the lock types and delay returning denial of
the request to allow extra time for a conflicting lock to be the request to allow extra time for a conflicting lock to be
released, allowing a successful return. In this way, clients can released, allowing a successful return. In this way, clients can
avoid the burden of needlessly frequent polling for blocking locks. avoid the burden of needlessly frequent polling for blocking locks.
The server should take care in the length of delay in the event the The server should take care in the length of delay in the event the
client retransmits the request. client retransmits the request.
7.5. Lease Renewal 8.5. Lease Renewal
The purpose of a lease is to allow a server to remove stale locks The purpose of a lease is to allow a server to remove stale locks
that are held by a client that has crashed or is otherwise that are held by a client that has crashed or is otherwise
unreachable. It is not a mechanism for cache consistency and lease unreachable. It is not a mechanism for cache consistency and lease
renewals may not be denied if the lease interval has not expired. renewals may not be denied if the lease interval has not expired.
The following events cause implicit renewal of all of the leases for Since each session is associated with a specific client, any
a given client (i.e. all those sharing a given clientid). Each of operation issued on that session is an indication that the associated
these is a positive indication that the client is still active and client is reachable. When a request is issued for a given session,
that the associated state held at the server, for the client, is execution of a SEQUENCE operation will result in all leases for the
still valid. associated client to be implicitly renewed. This approach allows for
low overhead lease renewal which scales well. In the typical case no
o An OPEN with a valid clientid. extra RPC calls are required for lease renewal and in the worst case
one RPC is required every lease period, via a COMPOUND that consists
o Any operation made with a valid stateid (CLOSE, DELEGRETURN, LOCK, solely of a single SEQUENCE operation. The number of locks held by
LOCKU, OPEN, OPEN_CONFIRM, OPEN_DOWNGRADE, READ, SETATTR, WRITE). the client is not a factor since all state for the client is involved
This does not include the special stateids of all bits 0 or all with the lease renewal action.
bits 1.
Note that if the client had restarted or rebooted, the client
would not be making these requests without issuing the
SETCLIENTID/SETCLIENTID_CONFIRM sequence. The use of the
SETCLIENTID/SETCLIENTID_CONFIRM sequence (one that changes the
client verifier) notifies the server to drop the locking state
associated with the client. SETCLIENTID/SETCLIENTID_CONFIRM never
renews a lease.
If the server has rebooted, the stateids (NFS4ERR_STALE_STATEID
error) or the clientid (NFS4ERR_STALE_CLIENTID error) will not be
valid hence preventing spurious renewals.
This approach allows for low overhead lease renewal which scales
well. In the typical case no extra RPC calls are required for lease
renewal and in the worst case one RPC is required every lease period
(i.e. a RENEW operation). The number of locks held by the client is
not a factor since all state for the client is involved with the
lease renewal action.
Since all operations that create a new lease also renew existing Since all operations that create a new lease also renew existing
leases, the server must maintain a common lease expiration time for leases, the server must maintain a common lease expiration time for
all valid leases for a given client. This lease time can then be all valid leases for a given client. This lease time can then be
easily updated upon implicit lease renewal actions. easily updated upon implicit lease renewal actions.
7.6. Crash Recovery 8.6. Crash Recovery
The important requirement in crash recovery is that both the client The important requirement in crash recovery is that both the client
and the server know when the other has failed. Additionally, it is and the server know when the other has failed. Additionally, it is
required that a client sees a consistent view of data across server required that a client sees a consistent view of data across server
restarts or reboots. All READ and WRITE operations that may have restarts or reboots. All READ and WRITE operations that may have
been queued within the client or network buffers must wait until the been queued within the client or network buffers must wait until the
client has successfully recovered the locks protecting the READ and client has successfully recovered the locks protecting the READ and
WRITE operations. WRITE operations.
7.6.1. Client Failure and Recovery 8.6.1. Client Failure and Recovery
In the event that a client fails, the server may recover the client's In the event that a client fails, the server may release the client's
locks when the associated leases have expired. Conflicting locks locks when the associated leases have expired. Conflicting locks
from another client may only be granted after this lease expiration. from another client may only be granted after this lease expiration.
If the client is able to restart or reinitialize within the lease When a client has not not failed and re-establishes his lease before
period the client may be forced to wait the remainder of the lease expiration occurs, requests for conflicting locks will not be
period before obtaining new locks. granted.
To minimize client delay upon restart, lock requests are associated To minimize client delay upon restart, lock requests are associated
with an instance of the client by a client supplied verifier. This with an instance of the client by a client supplied verifier. This
verifier is part of the initial SETCLIENTID call made by the client. verifier is part of the initial CREATE_CLIENTID call made by the
The server returns a clientid as a result of the SETCLIENTID client. The server returns a clientid as a result of the
operation. The client then confirms the use of the clientid with CREATE_CLIENTID operation. The client then confirms the use of the
SETCLIENTID_CONFIRM. The clientid in combination with an opaque clientid by establishing a session associated with that clientid.
owner field is then used by the client to identify the lock owner for All locks, including opens, byte-range locks, delegations, and layout
OPEN. This chain of associations is then used to identify all locks obtained by sessions using that clientid are associated with that
for a particular client. clientid.
Since the verifier will be changed by the client upon each Since the verifier will be changed by the client upon each
initialization, the server can compare a new verifier to the verifier initialization, the server can compare a new verifier to the verifier
associated with currently held locks and determine that they do not associated with currently held locks and determine that they do not
match. This signifies the client's new instantiation and subsequent match. This signifies the client's new instantiation and subsequent
loss of locking state. As a result, the server is free to release loss of locking state. As a result, the server is free to release
all locks held which are associated with the old clientid which was all locks held which are associated with the old clientid which was
derived from the old verifier. derived from the old verifier. At this point conflicting locks from
other clients, kept waiting while the leaser had not yet expired, can
be granted.
Note that the verifier must have the same uniqueness properties of Note that the verifier must have the same uniqueness properties of
the verifier for the COMMIT operation. the verifier for the COMMIT operation.
7.6.2. Server Failure and Recovery 8.6.2. Server Failure and Recovery
If the server loses locking state (usually as a result of a restart If the server loses locking state (usually as a result of a restart
or reboot), it must allow clients time to discover this fact and re- or reboot), it must allow clients time to discover this fact and re-
establish the lost locking state. The client must be able to re- establish the lost locking state. The client must be able to re-
establish the locking state without having the server deny valid establish the locking state without having the server deny valid
requests because the server has granted conflicting access to another requests because the server has granted conflicting access to another
client. Likewise, if there is the possibility that clients have not client. Likewise, if there is a possibility that clients have not
yet re-established their locking state for a file, the server must yet re-established their locking state for a file, the server must
disallow READ and WRITE operations for that file. disallow READ and WRITE operations for that file.
A client can determine that server failure (and thus loss of locking A client can determine that server failure (and thus loss of locking
state) has occurred, when it receives one of two errors. The state) has occurred, when it receives one of two errors. The
NFS4ERR_STALE_STATEID error indicates a stateid invalidated by a NFS4ERR_STALE_STATEID error indicates a stateid invalidated by a
reboot or restart. The NFS4ERR_STALE_CLIENTID error indicates a reboot or restart. The NFS4ERR_STALE_CLIENTID error indicates a
clientid invalidated by reboot or restart. When either of these are clientid invalidated by reboot or restart. When either of these are
received, the client must establish a new clientid (See the section received, the client must establish a new clientid (See
"Client ID") and re-establish its locking state. Section 8.1.1) and re-establish its locking state.
Once a session is established using the new clientid, the client will Once a session is established using the new clientid, the client will
use reclaim-type locking requests (i.e. LOCK requests with reclaim use reclaim-type locking requests (i.e. LOCK requests with reclaim
set to true and OPEN operations with a claim type of CLAIM_PREVIOUS) set to true and OPEN operations with a claim type of CLAIM_PREVIOUS)
to re-establish its locking state. Once this is done, or if there is to re-establish its locking state. Once this is done, or if there is
no such locking state to reclaim, the client does a RECLAIM_COMPLETE no such locking state to reclaim, the client does a RECLAIM_COMPLETE
operation to indicate that it has reclaimed all of the locking state operation to indicate that it has reclaimed all of the locking state
that it will reclaim. Once a client does a RECLAIM_COMPLETE that it will reclaim. Once a client does a RECLAIM_COMPLETE
operation, it may attempt non-reclaim locking operations, although it operation, it may attempt non-reclaim locking operations, although it
may get NFS4ERR_GRACE errors on these until the period of special may get NFS4ERR_GRACE errors on these until the period of special
handling is over. handling is over.
The period of special handling of locking and READs and WRITEs, is The period of special handling of locking and READs and WRITEs, is
referred to as the "grace period". During the grace period, clients referred to as the "grace period". During the grace period, clients
recover locks and the associated state using reclaim-type locking recover locks and the associated state using reclaim-type locking
requests. During this period, the server must reject READ and WRITE requests. During this period, the server must reject READ and WRITE
operations and non-reclaim locking requests (i.e. other LOCK and OPEN operations and non-reclaim locking requests (i.e. other LOCK and OPEN
operations) with an error of NFS4ERR_GRACE, unless it is able to operations) with an error of NFS4ERR_GRACE, unless it is able to
guarantee that these may be done safely, as described below. guarantee that these may be done safely, as described below.
The grace period may last until all clients to have locks have done a The grace period may last until all clients who are known to possibly
RECLAIM_COMPLETE operation, indicating that they have finished have had locks have done a RECLAIM_COMPLETE operation, indicating
reclaiming the locks they held before the server reboot. The server that they have finished reclaiming the locks they held before the
is assumed to maintain in stable storage a list of clients who may server reboot. The server is assumed to maintain in stable storage a
have such locks. The server may also terminate the grace period list of clients who may have such locks. The server may also
before all clients have done RECLAIM_COMPLETE. The server SHOULD NOT terminate the grace period before all clients have done
terminate the grace period before a time equal to the lease period in RECLAIM_COMPLETE. The server SHOULD NOT terminate the grace period
order to give clients an opportunity to find out about the server before a time equal to the lease period in order to give clients an
reboot. Some additional time in order to allow time to establish a opportunity to find out about the server reboot. Some additional
new clientid and session and to effect lock reclaims may be added. time in order to allow time to establish a new clientid and session
and to effect lock reclaims may be added.
If the server can reliably determine that granting a non-reclaim If the server can reliably determine that granting a non-reclaim
request will not conflict with reclamation of locks by other clients, request will not conflict with reclamation of locks by other clients,
the NFS4ERR_GRACE error does not have to be returned even within the the NFS4ERR_GRACE error does not have to be returned even within the
grace period, although NFS4ERR_GRACE must always be returned to grace period, although NFS4ERR_GRACE must always be returned to
clients attempting a non-reclaim lock request before doing their own clients attempting a non-reclaim lock request before doing their own
RECLAIM_COMPLETE. For the server to be able to service READ and RECLAIM_COMPLETE. For the server to be able to service READ and
WRITE operations during the grace period, it must again be able to WRITE operations during the grace period, it must again be able to
guarantee that no possible conflict could arise between an impending guarantee that no possible conflict could arise between a potential
reclaim locking request and the READ or WRITE operation. If the reclaim locking request and the READ or WRITE operation. If the
server is unable to offer that guarantee, the NFS4ERR_GRACE error server is unable to offer that guarantee, the NFS4ERR_GRACE error
must be returned to the client. must be returned to the client.
For a server to provide simple, valid handling during the grace For a server to provide simple, valid handling during the grace
period, the easiest method is to simply reject all non-reclaim period, the easiest method is to simply reject all non-reclaim
locking requests and READ and WRITE operations by returning the locking requests and READ and WRITE operations by returning the
NFS4ERR_GRACE error. However, a server may keep information about NFS4ERR_GRACE error. However, a server may keep information about
granted locks in stable storage. With this information, the server granted locks in stable storage. With this information, the server
could determine if a regular lock or READ or WRITE operation can be could determine if a regular lock or READ or WRITE operation can be
skipping to change at page 94, line 27 skipping to change at page 104, line 19
A server may, upon restart, establish a new value for the lease A server may, upon restart, establish a new value for the lease
period. Therefore, clients should, once a new clientid is period. Therefore, clients should, once a new clientid is
established, refetch the lease_time attribute and use it as the basis established, refetch the lease_time attribute and use it as the basis
for lease renewal for the lease associated with that server. for lease renewal for the lease associated with that server.
However, the server must establish, for this restart event, a grace However, the server must establish, for this restart event, a grace
period at least as long as the lease period for the previous server period at least as long as the lease period for the previous server
instantiation. This allows the client state obtained during the instantiation. This allows the client state obtained during the
previous server instance to be reliably re-established. previous server instance to be reliably re-established.
7.6.3. Network Partitions and Recovery 8.6.3. Network Partitions and Recovery
If the duration of a network partition is greater than the lease If the duration of a network partition is greater than the lease
period provided by the server, the server will have not received a period provided by the server, the server will have not received a
lease renewal from the client. If this occurs, the server may free lease renewal from the client. If this occurs, the server may free
all locks held for the client. As a result, all stateids held by the all locks held for the client, or it may allow the lock state to
client will become invalid or stale. Once the client is able to remain for a considerable period, subject to the constraint that if a
reach the server after such a network partition, all I/O submitted by request for a conflicting lock is made, locks associated with expired
leases do not prevent such a conflicting lock from being granted but
are revoked as necessary so as not to interfere with such conflicting
requests.
If the server chooses to delay freeing of lock state until there is a
conflict, it may either free all of the clients locks once there is a
conflict, or it may only revoke the minimum set of locks necessary to
allow conflicting requests. When it adopts the finer-grained
approach, it must revoke all locks associated with a given stateid,
as long as it revokes a single such lock.
When the server chooses to free all of a client's lock state, either
immediately upon lease expiration, or a result of the first attempt
to get a lock, all stateids held by the client will become invalid or
stale. Once the client is able to reach the server after such a
network partition, the status returned by the SEQUENCE operation will
indicate a loss of locking state. In addition all I/O submitted by
the client with the now invalid stateids will fail with the server the client with the now invalid stateids will fail with the server
returning the error NFS4ERR_EXPIRED. Once this error is received, returning the error NFS4ERR_EXPIRED. Once the client learns of the
the client will suitably notify the application that held the lock. loss of locking state, it will suitably notify the applications that
held the invalidated locks. The client should then take action to
free invalidated stateid's, either by establishing a new client id
using a new verifier or by doing a FREE_STATEID operation to release
each of the invalidated stateid's.
As a courtesy to the client or as an optimization, the server may When the server adopts a finer-grained approach to revocation of
continue to hold locks on behalf of a client for which recent locks when lease have expired, only a subset of stateids will
communication has extended beyond the lease period. If the server normally become invalid during a network partition. When the client
receives a lock or I/O request that conflicts with one of these is able to communicate with the server after such a network
courtesy locks, the server must free the courtesy lock and grant the partition, the status returned by the SEQUENCE operation will
new request. indicate a partial loss of locking state. In addition, operations,
including I/O submitted by the client with the now invalid stateids
will fail with the server returning the error NFS4ERR_EXPIRED. Once
the client learns of the loss of locking state, it will use the
TEST_STATEID operation on all of its stateid's to determine which
locks have been lost and them suitably notify the applications that
held the invalidated locks. The client can then release the
invalidated locking state and acknowledge the revocation of the
associated locks by doing a FREE_STATEID operation on each of the
invalidated stateid's.