draft-ietf-nfsv4-minorversion1-08.txt   draft-ietf-nfsv4-minorversion1-09.txt 
NFSv4 S. Shepler NFSv4 S. Shepler
Internet-Draft M. Eisler Internet-Draft M. Eisler
Intended status: Standards Track D. Noveck Intended status: Standards Track D. Noveck
Expires: April 25, 2007 Editors Expires: September 3, 2007 Editors
October 22, 2006 March 2, 2007
NFSv4 Minor Version 1 NFSv4 Minor Version 1
draft-ietf-nfsv4-minorversion1-08.txt draft-ietf-nfsv4-minorversion1-09.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 35 skipping to change at page 1, line 35
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on April 25, 2007. This Internet-Draft will expire on September 3, 2007.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2006). Copyright (C) The IETF Trust (2007).
Abstract Abstract
This Internet-Draft describes NFSv4 minor version one, including This Internet-Draft describes NFSv4 minor version one, including
features retained from the base protocol and protocol extensions made features retained from the base protocol and protocol extensions made
subsequently. The current draft includes description of the major subsequently. The current draft includes description of the major
extensions, Sessions, Directory Delegations, and parallel NFS (pNFS). extensions, Sessions, Directory Delegations, and parallel NFS (pNFS).
This Internet-Draft is an active work item of the NFSv4 working This Internet-Draft is an active work item of the NFSv4 working
group. Active and resolved issues may be found in the issue tracker group. Active and resolved issues may be found in the issue tracker
at: http://www.nfsv4-editor.org/cgi-bin/roundup/nfsv4. New issues at: http://www.nfsv4-editor.org/cgi-bin/roundup/nfsv4. New issues
skipping to change at page 2, line 15 skipping to change at page 2, line 15
Group nfsv4@ietf.org and logged in the issue tracker. Group nfsv4@ietf.org and logged in the issue tracker.
Requirements Language Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [1]. document are to be interpreted as described in RFC 2119 [1].
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 9 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1. The NFSv4.1 Protocol . . . . . . . . . . . . . . . . . . 9 1.1. The NFSv4.1 Protocol . . . . . . . . . . . . . . . . . . 10
1.2. NFS Version 4 Goals . . . . . . . . . . . . . . . . . . 9 1.2. NFS Version 4 Goals . . . . . . . . . . . . . . . . . . 10
1.3. Minor Version 1 Goals . . . . . . . . . . . . . . . . . 10 1.3. Minor Version 1 Goals . . . . . . . . . . . . . . . . . 11
1.4. Overview of NFS version 4.1 Features . . . . . . . . . . 10 1.4. Overview of NFS version 4.1 Features . . . . . . . . . . 11
1.4.1. RPC and Security . . . . . . . . . . . . . . . . . . 11 1.4.1. RPC and Security . . . . . . . . . . . . . . . . . . 12
1.4.2. Protocol Structure . . . . . . . . . . . . . . . . . 11 1.4.2. Protocol Structure . . . . . . . . . . . . . . . . . 12
1.4.3. File System Model . . . . . . . . . . . . . . . . . 12 1.4.3. File System Model . . . . . . . . . . . . . . . . . 13
1.4.4. Locking Facilities . . . . . . . . . . . . . . . . . 13 1.4.4. Locking Facilities . . . . . . . . . . . . . . . . . 14
1.5. General Definitions . . . . . . . . . . . . . . . . . . 14 1.5. General Definitions . . . . . . . . . . . . . . . . . . 15
1.6. Differences from NFSv4.0 . . . . . . . . . . . . . . . . 16 1.6. Differences from NFSv4.0 . . . . . . . . . . . . . . . . 17
2. Core Infrastructure . . . . . . . . . . . . . . . . . . . . . 16 2. Core Infrastructure . . . . . . . . . . . . . . . . . . . . . 17
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 16 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 18
2.2. RPC and XDR . . . . . . . . . . . . . . . . . . . . . . 16 2.2. RPC and XDR . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1. RPC-based Security . . . . . . . . . . . . . . . . . 16 2.2.1. RPC-based Security . . . . . . . . . . . . . . . . . 18
2.3. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . . 20 2.3. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . . 21
2.4. Client Identifiers . . . . . . . . . . . . . . . . . . . 20 2.4. Client Identifiers and Client Owners . . . . . . . . . . 22
2.4.1. Server Release of Clientid . . . . . . . . . . . . . 24 2.4.1. Server Release of Client ID . . . . . . . . . . . . 26
2.5. Security Service Negotiation . . . . . . . . . . . . . . 25 2.4.2. Handling Client Owner Conflicts . . . . . . . . . . 26
2.5.1. NFSv4 Security Tuples . . . . . . . . . . . . . . . 25 2.5. Server Owners . . . . . . . . . . . . . . . . . . . . . 27
2.5.2. SECINFO and SECINFO_NO_NAME . . . . . . . . . . . . 25 2.6. Security Service Negotiation . . . . . . . . . . . . . . 27
2.5.3. Security Error . . . . . . . . . . . . . . . . . . . 26 2.6.1. NFSv4 Security Tuples . . . . . . . . . . . . . . . 28
2.6. Minor Versioning . . . . . . . . . . . . . . . . . . . . 29 2.6.2. SECINFO and SECINFO_NO_NAME . . . . . . . . . . . . 28
2.7. Non-RPC-based Security Services . . . . . . . . . . . . 31 2.6.3. Security Error . . . . . . . . . . . . . . . . . . . 28
2.7.1. Authorization . . . . . . . . . . . . . . . . . . . 31 2.7. Minor Versioning . . . . . . . . . . . . . . . . . . . . 32
2.7.2. Auditing . . . . . . . . . . . . . . . . . . . . . . 32 2.8. Non-RPC-based Security Services . . . . . . . . . . . . 34
2.7.3. Intrusion Detection . . . . . . . . . . . . . . . . 32 2.8.1. Authorization . . . . . . . . . . . . . . . . . . . 34
2.8. Transport Layers . . . . . . . . . . . . . . . . . . . . 32 2.8.2. Auditing . . . . . . . . . . . . . . . . . . . . . . 34
2.8.1. Required and Recommended Properties of Transports . 32 2.8.3. Intrusion Detection . . . . . . . . . . . . . . . . 35
2.8.2. Client and Server Transport Behavior . . . . . . . . 33 2.9. Transport Layers . . . . . . . . . . . . . . . . . . . . 35
2.8.3. Ports . . . . . . . . . . . . . . . . . . . . . . . 34 2.9.1. Required and Recommended Properties of Transports . 35
2.9. Session . . . . . . . . . . . . . . . . . . . . . . . . 34 2.9.2. Client and Server Transport Behavior . . . . . . . . 35
2.9.1. Motivation and Overview . . . . . . . . . . . . . . 34 2.9.3. Ports . . . . . . . . . . . . . . . . . . . . . . . 37
2.9.2. NFSv4 Integration . . . . . . . . . . . . . . . . . 35 2.10. Session . . . . . . . . . . . . . . . . . . . . . . . . 37
2.9.3. Channels . . . . . . . . . . . . . . . . . . . . . . 36 2.10.1. Motivation and Overview . . . . . . . . . . . . . . 37
2.9.4. Exactly Once Semantics . . . . . . . . . . . . . . . 39 2.10.2. NFSv4 Integration . . . . . . . . . . . . . . . . . 38
2.9.5. RDMA Considerations . . . . . . . . . . . . . . . . 47 2.10.3. Channels . . . . . . . . . . . . . . . . . . . . . . 39
2.9.6. Sessions Security . . . . . . . . . . . . . . . . . 50 2.10.4. Exactly Once Semantics . . . . . . . . . . . . . . . 42
2.9.7. Session Mechanics - Steady State . . . . . . . . . . 54 2.10.5. RDMA Considerations . . . . . . . . . . . . . . . . 51
2.9.8. Session Mechanics - Recovery . . . . . . . . . . . . 55 2.10.6. Sessions Security . . . . . . . . . . . . . . . . . 53
3. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 58 2.10.7. Session Mechanics - Steady State . . . . . . . . . . 57
3.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 59 2.10.8. Session Mechanics - Recovery . . . . . . . . . . . . 59
3.2. Structured Data Types . . . . . . . . . . . . . . . . . 60 2.10.9. Parallel NFS and Sessions . . . . . . . . . . . . . 62
4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 69 3. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 62
4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 70 3.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 62
4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 70 3.2. Structured Data Types . . . . . . . . . . . . . . . . . 64
4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 70 4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 71 4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 74
4.2.1. General Properties of a Filehandle . . . . . . . . . 71 4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 74
4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 72 4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 74
4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 72 4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 75
4.3. One Method of Constructing a Volatile Filehandle . . . . 73 4.2.1. General Properties of a Filehandle . . . . . . . . . 75
4.4. Client Recovery from Filehandle Expiration . . . . . . . 74 4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 76
5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 76
5.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 76 4.3. One Method of Constructing a Volatile Filehandle . . . . 77
5.2. Recommended Attributes . . . . . . . . . . . . . . . . . 76 4.4. Client Recovery from Filehandle Expiration . . . . . . . 78
5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 77 5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 79
5.4. Classification of Attributes . . . . . . . . . . . . . . 77 5.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 80
5.5. Mandatory Attributes - Definitions . . . . . . . . . . . 78 5.2. Recommended Attributes . . . . . . . . . . . . . . . . . 80
5.6. Recommended Attributes - Definitions . . . . . . . . . . 80 5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 81
5.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 89 5.4. Classification of Attributes . . . . . . . . . . . . . . 81
5.8. Interpreting owner and owner_group . . . . . . . . . . . 90 5.5. Mandatory Attributes - Definitions . . . . . . . . . . . 83
5.9. Character Case Attributes . . . . . . . . . . . . . . . 92 5.6. Recommended Attributes - Definitions . . . . . . . . . . 84
5.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 92 5.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 94
5.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 93 5.8. Interpreting owner and owner_group . . . . . . . . . . . 95
5.12. send_impl_id and recv_impl_id . . . . . . . . . . . . . 94 5.9. Character Case Attributes . . . . . . . . . . . . . . . 97
5.13. fs_layout_type . . . . . . . . . . . . . . . . . . . . . 94 5.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 97
5.14. layout_type . . . . . . . . . . . . . . . . . . . . . . 94 5.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 98
5.15. layout_hint . . . . . . . . . . . . . . . . . . . . . . 95 5.12. Directory Notification Attributes . . . . . . . . . . . 99
5.16. mdsthreshold . . . . . . . . . . . . . . . . . . . . . . 95 5.12.1. dir_notif_delay . . . . . . . . . . . . . . . . . . 99
5.17. Retention Attributes . . . . . . . . . . . . . . . . . . 95 5.12.2. dirent_notif_delay . . . . . . . . . . . . . . . . . 99
6. Access Control Lists . . . . . . . . . . . . . . . . . . . . 97 5.13. PNFS Attributes . . . . . . . . . . . . . . . . . . . . 99
6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.13.1. fs_layout_type . . . . . . . . . . . . . . . . . . . 99
6.2. File Attributes Discussion . . . . . . . . . . . . . . . 99 5.13.2. layout_alignment . . . . . . . . . . . . . . . . . . 99
6.2.1. ACL Attribute . . . . . . . . . . . . . . . . . . . 99 5.13.3. layout_blksize . . . . . . . . . . . . . . . . . . . 100
6.2.2. mode Attribute . . . . . . . . . . . . . . . . . . . 110 5.13.4. layout_hint . . . . . . . . . . . . . . . . . . . . 100
6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 111 5.13.5. layout_type . . . . . . . . . . . . . . . . . . . . 100
6.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . 111 5.13.6. mdsthreshold . . . . . . . . . . . . . . . . . . . . 100
6.3.2. Computing a Mode Attribute from an ACL . . . . . . . 112 5.14. Retention Attributes . . . . . . . . . . . . . . . . . . 101
6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 113 6. Access Control Lists . . . . . . . . . . . . . . . . . . . . 103
6.4.1. Setting the mode and/or ACL Attributes . . . . . . . 114 6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.4.2. Retrieving the mode and/or ACL Attributes . . . . . 115 6.2. File Attributes Discussion . . . . . . . . . . . . . . . 104
6.4.3. Creating New Objects . . . . . . . . . . . . . . . . 115 6.2.1. ACL Attribute . . . . . . . . . . . . . . . . . . . 104
7. Single-server Name Space . . . . . . . . . . . . . . . . . . 117 6.2.2. dacl and sacl Attributes . . . . . . . . . . . . . . 115
7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 117 6.2.3. mode Attribute . . . . . . . . . . . . . . . . . . . 116
7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 118 6.2.4. mode_set_masked Attribute . . . . . . . . . . . . . 116
7.3. Server Pseudo File System . . . . . . . . . . . . . . . 118 6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 117
7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 119 6.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . 117
7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 119 6.3.2. Computing a Mode Attribute from an ACL . . . . . . . 118
7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 119 6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 119
7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 119 6.4.1. Setting the mode and/or ACL Attributes . . . . . . . 120
7.8. Security Policy and Name Space Presentation . . . . . . 120 6.4.2. Retrieving the mode and/or ACL Attributes . . . . . 121
8. File Locking and Share Reservations . . . . . . . . . . . . . 121 6.4.3. Creating New Objects . . . . . . . . . . . . . . . . 122
8.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 121 7. Single-server Name Space . . . . . . . . . . . . . . . . . . 125
8.1.1. Client and Session ID . . . . . . . . . . . . . . . 122 7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 126
8.1.2. State-owner and Stateid Definition . . . . . . . . . 122 7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 126
8.1.3. Use of the Stateid and Locking . . . . . . . . . . . 124 7.3. Server Pseudo File System . . . . . . . . . . . . . . . 126
8.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 127 7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 127
8.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 127 7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 127
8.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 128 7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 127
8.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 128 7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 128
8.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 129 7.8. Security Policy and Name Space Presentation . . . . . . 128
8.6.1. Client Failure and Recovery . . . . . . . . . . . . 129 8. File Locking and Share Reservations . . . . . . . . . . . . . 129
8.6.2. Server Failure and Recovery . . . . . . . . . . . . 130 8.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 130
8.6.3. Network Partitions and Recovery . . . . . . . . . . 132 8.1.1. Client and Session ID . . . . . . . . . . . . . . . 130
8.7. Server Revocation of Locks . . . . . . . . . . . . . . . 136 8.1.2. State-owner Definition . . . . . . . . . . . . . . . 130
8.8. Share Reservations . . . . . . . . . . . . . . . . . . . 137 8.1.3. Stateid Definition . . . . . . . . . . . . . . . . . 131
8.9. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 138 8.1.4. Use of the Stateid and Locking . . . . . . . . . . . 134
8.10. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 139 8.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 137
8.11. Short and Long Leases . . . . . . . . . . . . . . . . . 139 8.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 137
8.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 138
8.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 138
8.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 139
8.6.1. Client Failure and Recovery . . . . . . . . . . . . 139
8.6.2. Server Failure and Recovery . . . . . . . . . . . . 140
8.6.3. Network Partitions and Recovery . . . . . . . . . . 143
8.7. Server Revocation of Locks . . . . . . . . . . . . . . . 147
8.8. Share Reservations . . . . . . . . . . . . . . . . . . . 148
8.9. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 149
8.10. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 149
8.11. Short and Long Leases . . . . . . . . . . . . . . . . . 150
8.12. Clocks, Propagation Delay, and Calculating Lease 8.12. Clocks, Propagation Delay, and Calculating Lease
Expiration . . . . . . . . . . . . . . . . . . . . . . . 140 Expiration . . . . . . . . . . . . . . . . . . . . . . . 151
8.13. Vestigial Locking Infrastructure From V4.0 . . . . . . . 140 8.13. Vestigial Locking Infrastructure From V4.0 . . . . . . . 151
9. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 141 9. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 152
9.1. Performance Challenges for Client-Side Caching . . . . . 142 9.1. Performance Challenges for Client-Side Caching . . . . . 153
9.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 143 9.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 153
9.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 144 9.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 155
9.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 146 9.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 157
9.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 146 9.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 157
9.3.2. Data Caching and File Locking . . . . . . . . . . . 147 9.3.2. Data Caching and File Locking . . . . . . . . . . . 158
9.3.3. Data Caching and Mandatory File Locking . . . . . . 149 9.3.3. Data Caching and Mandatory File Locking . . . . . . 160
9.3.4. Data Caching and File Identity . . . . . . . . . . . 149 9.3.4. Data Caching and File Identity . . . . . . . . . . . 160
9.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 150 9.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 161
9.4.1. Open Delegation and Data Caching . . . . . . . . . . 153 9.4.1. Open Delegation and Data Caching . . . . . . . . . . 164
9.4.2. Open Delegation and File Locks . . . . . . . . . . . 154 9.4.2. Open Delegation and File Locks . . . . . . . . . . . 165
9.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 154 9.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 165
9.4.4. Recall of Open Delegation . . . . . . . . . . . . . 157 9.4.4. Recall of Open Delegation . . . . . . . . . . . . . 168
9.4.5. Clients that Fail to Honor Delegation Recalls . . . 159 9.4.5. Clients that Fail to Honor Delegation Recalls . . . 170
9.4.6. Delegation Revocation . . . . . . . . . . . . . . . 160 9.4.6. Delegation Revocation . . . . . . . . . . . . . . . 171
9.5. Data Caching and Revocation . . . . . . . . . . . . . . 160 9.5. Data Caching and Revocation . . . . . . . . . . . . . . 171
9.5.1. Revocation Recovery for Write Open Delegation . . . 161 9.5.1. Revocation Recovery for Write Open Delegation . . . 172
9.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 162 9.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 173
9.7. Data and Metadata Caching and Memory Mapped Files . . . 164 9.7. Data and Metadata Caching and Memory Mapped Files . . . 175
9.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 166 9.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 177
9.9. Directory Caching . . . . . . . . . . . . . . . . . . . 167 9.9. Directory Caching . . . . . . . . . . . . . . . . . . . 178
10. Multi-server Name Space . . . . . . . . . . . . . . . . . . . 168 10. Multi-Server Name Space . . . . . . . . . . . . . . . . . . . 179
10.1. Location attributes . . . . . . . . . . . . . . . . . . 168 10.1. Location attributes . . . . . . . . . . . . . . . . . . 179
10.2. File System Presence or Absence . . . . . . . . . . . . 168 10.2. File System Presence or Absence . . . . . . . . . . . . 179
10.3. Getting Attributes for an Absent File System . . . . . . 170 10.3. Getting Attributes for an Absent File System . . . . . . 181
10.3.1. GETATTR Within an Absent File System . . . . . . . . 170 10.3.1. GETATTR Within an Absent File System . . . . . . . . 181
10.3.2. READDIR and Absent File Systems . . . . . . . . . . 171 10.3.2. READDIR and Absent File Systems . . . . . . . . . . 182
10.4. Uses of Location Information . . . . . . . . . . . . . . 172 10.4. Uses of Location Information . . . . . . . . . . . . . . 183
10.4.1. File System Replication . . . . . . . . . . . . . . 172 10.4.1. File System Replication . . . . . . . . . . . . . . 183
10.4.2. File System Migration . . . . . . . . . . . . . . . 174 10.4.2. File System Migration . . . . . . . . . . . . . . . 185
10.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 175 10.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 186
10.5. Additional Client-side Considerations . . . . . . . . . 176 10.5. Additional Client-side Considerations . . . . . . . . . 187
10.6. Effecting File System Transitions . . . . . . . . . . . 177 10.6. Effecting File System Transitions . . . . . . . . . . . 188
10.6.1. File System Transitions and Simultaneous Access . . 178 10.6.1. File System Transitions and Simultaneous Access . . 189
10.6.2. Simultaneous Use and Transparent Transitions . . . . 179 10.6.2. Simultaneous Use and Transparent Transitions . . . . 190
10.6.3. Filehandles and File System Transitions . . . . . . 181 10.6.3. Filehandles and File System Transitions . . . . . . 192
10.6.4. Fileid's and File System Transitions . . . . . . . . 181 10.6.4. Fileid's and File System Transitions . . . . . . . . 192
10.6.5. Fsid's and File System Transitions . . . . . . . . . 182 10.6.5. Fsids and File System Transitions . . . . . . . . . 193
10.6.6. The Change Attribute and File System Transitions . . 182 10.6.6. The Change Attribute and File System Transitions . . 193
10.6.7. Lock State and File System Transitions . . . . . . . 183 10.6.7. Lock State and File System Transitions . . . . . . . 194
10.6.8. Write Verifiers and File System Transitions . . . . 186 10.6.8. Write Verifiers and File System Transitions . . . . 197
10.7. Effecting File System Referrals . . . . . . . . . . . . 186 10.7. Effecting File System Referrals . . . . . . . . . . . . 197
10.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 187 10.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 198
10.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 191 10.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 202
10.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 193 10.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 204
10.9. The Attribute fs_locations . . . . . . . . . . . . . . . 193 10.9. The Attribute fs_locations . . . . . . . . . . . . . . . 204
10.10. The Attribute fs_locations_info . . . . . . . . . . . . 195 10.10. The Attribute fs_locations_info . . . . . . . . . . . . 206
10.10.1. The location4_server Structure . . . . . . . . . . . 198 10.10.1. The fs_locations_server4 Structure . . . . . . . . . 209
10.10.2. The location4_info Structure . . . . . . . . . . . . 203 10.10.2. The fs_locations_info4 Structure . . . . . . . . . . 214
10.10.3. The location4_item Structure . . . . . . . . . . . . 204 10.10.3. The fs_locations_item4 Structure . . . . . . . . . . 215
10.11. The Attribute fs_status . . . . . . . . . . . . . . . . 205 10.11. The Attribute fs_status . . . . . . . . . . . . . . . . 216
11. Directory Delegations . . . . . . . . . . . . . . . . . . . . 209 11. Directory Delegations . . . . . . . . . . . . . . . . . . . . 220
11.1. Introduction to Directory Delegations . . . . . . . . . 209 11.1. Introduction to Directory Delegations . . . . . . . . . 220
11.2. Directory Delegation Design (in brief) . . . . . . . . . 210 11.2. Directory Delegation Design . . . . . . . . . . . . . . 221
11.3. Recommended Attributes in support of Directory 11.3. Attributes in Support of Directory Notifications . . . . 222
Delegations . . . . . . . . . . . . . . . . . . . . . . 211 11.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 222
11.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 212 11.5. Directory Delegation Recovery . . . . . . . . . . . . . 222
11.5. Directory Delegation Recovery . . . . . . . . . . . . . 212 12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 222
12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 212 12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 222
12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 212 12.2. PNFS Definitions . . . . . . . . . . . . . . . . . . . . 224
12.2. General Definitions . . . . . . . . . . . . . . . . . . 215 12.2.1. Metadata . . . . . . . . . . . . . . . . . . . . . . 224
12.2.1. Metadata Server . . . . . . . . . . . . . . . . . . 215 12.2.2. Metadata Server . . . . . . . . . . . . . . . . . . 224
12.2.2. Client . . . . . . . . . . . . . . . . . . . . . . . 215 12.2.3. Client . . . . . . . . . . . . . . . . . . . . . . . 225
12.2.3. Storage Device . . . . . . . . . . . . . . . . . . . 215 12.2.4. Storage Device . . . . . . . . . . . . . . . . . . . 225
12.2.4. Storage Protocol . . . . . . . . . . . . . . . . . . 215 12.2.5. Data Server . . . . . . . . . . . . . . . . . . . . 225
12.2.5. Control Protocol . . . . . . . . . . . . . . . . . . 216 12.2.6. Storage Protocol or Data Protocol . . . . . . . . . 225
12.2.6. Metadata . . . . . . . . . . . . . . . . . . . . . . 216 12.2.7. Control Protocol . . . . . . . . . . . . . . . . . . 225
12.2.7. Layout . . . . . . . . . . . . . . . . . . . . . . . 216 12.2.8. Layout . . . . . . . . . . . . . . . . . . . . . . . 226
12.3. pNFS protocol semantics . . . . . . . . . . . . . . . . 217 12.2.9. Layout Types . . . . . . . . . . . . . . . . . . . . 226
12.3.1. Definitions . . . . . . . . . . . . . . . . . . . . 217 12.2.10. Layout Iomode . . . . . . . . . . . . . . . . . . . 226
12.3.2. Guarantees Provided by Layouts . . . . . . . . . . . 220 12.2.11. Layout Segment . . . . . . . . . . . . . . . . . . . 227
12.3.3. Getting a Layout . . . . . . . . . . . . . . . . . . 221 12.2.12. Device IDs . . . . . . . . . . . . . . . . . . . . . 228
12.3.4. Committing a Layout . . . . . . . . . . . . . . . . 222 12.3. PNFS Operations . . . . . . . . . . . . . . . . . . . . 228
12.3.5. Recalling a Layout . . . . . . . . . . . . . . . . . 224 12.4. PNFS Attributes . . . . . . . . . . . . . . . . . . . . 229
12.3.6. Metadata Server Write Propagation . . . . . . . . . 230 12.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 229
12.3.7. Crash Recovery . . . . . . . . . . . . . . . . . . . 230 12.5.1. Guarantees Provided by Layouts . . . . . . . . . . . 229
12.3.8. Security Considerations . . . . . . . . . . . . . . 236 12.5.2. Getting a Layout . . . . . . . . . . . . . . . . . . 230
12.4. The NFSv4.1 File Layout Type . . . . . . . . . . . . . . 237 12.5.3. Committing a Layout . . . . . . . . . . . . . . . . 231
12.4.1. Session Considerations . . . . . . . . . . . . . . . 237 12.5.4. Recalling a Layout . . . . . . . . . . . . . . . . . 234
12.4.2. File Striping and Data Access . . . . . . . . . . . 237 12.5.5. Metadata Server Write Propagation . . . . . . . . . 240
12.4.3. Global Stateid Requirements . . . . . . . . . . . . 246 12.6. PNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 240
12.4.4. The Layout Iomode . . . . . . . . . . . . . . . . . 246 12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 241
12.4.5. Storage Device State Propagation . . . . . . . . . . 246 12.7.1. Client Recovery . . . . . . . . . . . . . . . . . . 241
12.4.6. Storage Device Component File Size . . . . . . . . . 249 12.7.2. Dealing with Lease Expiration on the Client . . . . 242
12.4.7. Crash Recovery Considerations . . . . . . . . . . . 249 12.7.3. Dealing with Loss of Layout State on the Metadata
12.4.8. Security Considerations for the File Layout Type . . 250 Server . . . . . . . . . . . . . . . . . . . . . . . 243
12.4.9. Alternate Approaches . . . . . . . . . . . . . . . . 250 12.7.4. Recovery from Metadata Server Restart . . . . . . . 244
13. Internationalization . . . . . . . . . . . . . . . . . . . . 251 12.7.5. Operations During Metadata Server Grace Period . . . 246
13.1. Stringprep profile for the utf8str_cs type . . . . . . . 253 12.7.6. Storage Device Recovery . . . . . . . . . . . . . . 246
13.2. Stringprep profile for the utf8str_cis type . . . . . . 254 12.8. Metadata and Storage Device Roles . . . . . . . . . . . 247
13.3. Stringprep profile for the utf8str_mixed type . . . . . 256 12.9. Security Considerations . . . . . . . . . . . . . . . . 248
13.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 257 13. PNFS: NFSv4.1 File Layout Type . . . . . . . . . . . . . . . 249
14. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 257 13.1. Session Considerations . . . . . . . . . . . . . . . . . 249
14.1. Error Definitions . . . . . . . . . . . . . . . . . . . 258 13.2. File Layout Definitions . . . . . . . . . . . . . . . . 251
14.2. Operations and their valid errors . . . . . . . . . . . 271 13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 251
14.3. Callback operations and their valid errors . . . . . . . 284 13.4. Interpreting the File Layout . . . . . . . . . . . . . . 255
14.4. Errors and the operations that use them . . . . . . . . 285 13.5. Sparse and Dense Stripe Unit Packing . . . . . . . . . . 257
15. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 292 13.6. Data Server Multipathing . . . . . . . . . . . . . . . . 259
15.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 292 13.7. Operations Issued to NFSv4.1 Data Servers . . . . . . . 259
15.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 293 13.8. COMMIT Through Metadata Server . . . . . . . . . . . . . 260
16. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 298 13.9. Global Stateid Requirements . . . . . . . . . . . . . . 261
16.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 298 13.10. The Layout Iomode . . . . . . . . . . . . . . . . . . . 261
16.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 300 13.11. Data Server State Propagation . . . . . . . . . . . . . 261
16.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 302 13.11.1. Lock State Propagation . . . . . . . . . . . . . . . 262
16.4. Operation 6: CREATE - Create a Non-Regular File Object . 304 13.11.2. Open-mode Validation . . . . . . . . . . . . . . . . 262
16.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting 13.11.3. File Attributes . . . . . . . . . . . . . . . . . . 263
Recovery . . . . . . . . . . . . . . . . . . . . . . . . 307 13.12. Data Server Component File Size . . . . . . . . . . . . 263
16.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 308 13.13. Recovery Considerations . . . . . . . . . . . . . . . . 264
16.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 308 13.14. Security Considerations for the File Layout Type . . . . 265
16.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 310 14. Internationalization . . . . . . . . . . . . . . . . . . . . 265
16.9. Operation 11: LINK - Create Link to a File . . . . . . . 311 14.1. Stringprep profile for the utf8str_cs type . . . . . . . 266
16.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 312 14.2. Stringprep profile for the utf8str_cis type . . . . . . 268
16.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 316 14.3. Stringprep profile for the utf8str_mixed type . . . . . 269
16.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 317 14.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 271
16.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 318 15. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 271
16.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 320 15.1. Error Definitions . . . . . . . . . . . . . . . . . . . 271
16.15. Operation 17: NVERIFY - Verify Difference in 15.2. Operations and their valid errors . . . . . . . . . . . 285
Attributes . . . . . . . . . . . . . . . . . . . . . . . 321 15.3. Callback operations and their valid errors . . . . . . . 299
16.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 323 15.4. Errors and the operations that use them . . . . . . . . 300
16.17. Operation 19: OPENATTR - Open Named Attribute 16. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 307
Directory . . . . . . . . . . . . . . . . . . . . . . . 337 16.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 307
16.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 338 16.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 308
16.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 339 17. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 313
16.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 340 17.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 313
16.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 342 17.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 315
16.22. Operation 25: READ - Read from File . . . . . . . . . . 343 17.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 317
16.23. Operation 26: READDIR - Read Directory . . . . . . . . . 345 17.4. Operation 6: CREATE - Create a Non-Regular File Object . 319
16.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 349 17.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting
16.25. Operation 28: REMOVE - Remove File System Object . . . . 350 Recovery . . . . . . . . . . . . . . . . . . . . . . . . 322
16.26. Operation 29: RENAME - Rename Directory Entry . . . . . 352 17.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 323
16.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 354 17.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 323
16.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 355 17.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 325
16.29. Operation 33: SECINFO - Obtain Available Security . . . 355 17.9. Operation 11: LINK - Create Link to a File . . . . . . . 326
16.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 359 17.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 327
16.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 361 17.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 331
16.32. Operation 38: WRITE - Write to File . . . . . . . . . . 362 17.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 332
16.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 367 17.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 334
16.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 369 17.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 335
16.35. Operation 42: EXCHANGE_ID - Instantiate Clientid . . . . 373 17.15. Operation 17: NVERIFY - Verify Difference in
16.36. Operation 43: CREATE_SESSION - Create New Session and Attributes . . . . . . . . . . . . . . . . . . . . . . . 337
Confirm Clientid . . . . . . . . . . . . . . . . . . . . 379 17.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 338
16.37. Operation 44: DESTROY_SESSION - Destroy existing 17.17. Operation 19: OPENATTR - Open Named Attribute
session . . . . . . . . . . . . . . . . . . . . . . . . 389 Directory . . . . . . . . . . . . . . . . . . . . . . . 352
16.38. Operation 45: FREE_STATEID - Free stateid with no 17.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 354
locks . . . . . . . . . . . . . . . . . . . . . . . . . 390 17.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 355
16.39. Operation 46: GET_DIR_DELEGATION - Get a directory 17.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 356
delegation . . . . . . . . . . . . . . . . . . . . . . . 391 17.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 357
16.40. Operation 47: GETDEVICEINFO - Get Device Information . . 395 17.22. Operation 25: READ - Read from File . . . . . . . . . . 358
16.41. Operation 48: GETDEVICELIST . . . . . . . . . . . . . . 396 17.23. Operation 26: READDIR - Read Directory . . . . . . . . . 360
16.42. Operation 49: LAYOUTCOMMIT - Commit writes made using 17.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 364
a layout . . . . . . . . . . . . . . . . . . . . . . . . 397 17.25. Operation 28: REMOVE - Remove File System Object . . . . 365
16.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 401 17.26. Operation 29: RENAME - Rename Directory Entry . . . . . 367
16.44. Operation 51: LAYOUTRETURN - Release Layout 17.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 369
Information . . . . . . . . . . . . . . . . . . . . . . 404 17.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 370
16.45. Operation 52: SECINFO_NO_NAME - Get Security on 17.29. Operation 33: SECINFO - Obtain Available Security . . . 370
Unnamed Object . . . . . . . . . . . . . . . . . . . . . 406 17.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 374
16.46. Operation 53: SEQUENCE - Supply per-procedure 17.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 376
sequencing and control . . . . . . . . . . . . . . . . . 408 17.32. Operation 38: WRITE - Write to File . . . . . . . . . . 377
16.47. Operation 54: SET_SSV . . . . . . . . . . . . . . . . . 411 17.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 382
16.48. Operation 55: TEST_STATEID - Test stateids for 17.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 383
validity . . . . . . . . . . . . . . . . . . . . . . . . 413 17.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 387
16.49. Operation 56: WANT_DELEGATION . . . . . . . . . . . . . 414 17.36. Operation 43: CREATE_SESSION - Create New Session and
16.50. Operation 10044: ILLEGAL - Illegal operation . . . . . . 417 Confirm Client ID . . . . . . . . . . . . . . . . . . . 395
17. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 418 17.37. Operation 44: DESTROY_SESSION - Destroy existing
17.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 418 session . . . . . . . . . . . . . . . . . . . . . . . . 405
17.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 418 17.38. Operation 45: FREE_STATEID - Free stateid with no
18. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 420 locks . . . . . . . . . . . . . . . . . . . . . . . . . 406
18.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 420 17.39. Operation 46: GET_DIR_DELEGATION - Get a directory
18.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 422 delegation . . . . . . . . . . . . . . . . . . . . . . . 407
18.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 423 17.40. Operation 47: GETDEVICEINFO - Get Device Information . . 412
18.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 425 17.41. Operation 48: GETDEVICELIST . . . . . . . . . . . . . . 413
18.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 428 17.42. Operation 49: LAYOUTCOMMIT - Commit writes made using
18.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 429 a layout . . . . . . . . . . . . . . . . . . . . . . . . 414
18.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 432 17.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 417
18.8. Operation 10: CB_RECALL_SLOT - change flow control 17.44. Operation 51: LAYOUTRETURN - Release Layout
limits . . . . . . . . . . . . . . . . . . . . . . . . . 433 Information . . . . . . . . . . . . . . . . . . . . . . 420
18.9. Operation 11: CB_SEQUENCE - Supply callback channel 17.45. Operation 52: SECINFO_NO_NAME - Get Security on
sequencing and control . . . . . . . . . . . . . . . . . 434 Unnamed Object . . . . . . . . . . . . . . . . . . . . . 423
18.10. Operation 12: CB_WANTS_CANCELLED . . . . . . . . . . . . 436 17.46. Operation 53: SEQUENCE - Supply per-procedure
18.11. Operation 13: CB_NOTIFY_LOCK - Notify of possible sequencing and control . . . . . . . . . . . . . . . . . 424
lock availability . . . . . . . . . . . . . . . . . . . 437 17.47. Operation 54: SET_SSV . . . . . . . . . . . . . . . . . 429
18.12. Operation 10044: CB_ILLEGAL - Illegal Callback 17.48. Operation 55: TEST_STATEID - Test stateids for
Operation . . . . . . . . . . . . . . . . . . . . . . . 438 validity . . . . . . . . . . . . . . . . . . . . . . . . 431
19. Security Considerations . . . . . . . . . . . . . . . . . . . 439 17.49. Operation 56: WANT_DELEGATION . . . . . . . . . . . . . 432
20. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 439 17.50. Operation 57: DESTROY_CLIENTID - Destroy existing
20.1. Defining new layout types . . . . . . . . . . . . . . . 439 client ID . . . . . . . . . . . . . . . . . . . . . . . 435
21. References . . . . . . . . . . . . . . . . . . . . . . . . . 440 17.51. Operation 10044: ILLEGAL - Illegal operation . . . . . . 436
21.1. Normative References . . . . . . . . . . . . . . . . . . 440 18. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 437
21.2. Informative References . . . . . . . . . . . . . . . . . 441 18.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 437
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 442 18.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 437
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 443 19. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 439
Intellectual Property and Copyright Statements . . . . . . . . . 444 19.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 439
19.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 441
19.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 442
19.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 444
19.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 447
19.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 448
19.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 451
19.8. Operation 10: CB_RECALL_SLOT - change flow control
limits . . . . . . . . . . . . . . . . . . . . . . . . . 452
19.9. Operation 11: CB_SEQUENCE - Supply callback channel
sequencing and control . . . . . . . . . . . . . . . . . 453
19.10. Operation 12: CB_WANTS_CANCELLED . . . . . . . . . . . . 455
19.11. Operation 13: CB_NOTIFY_LOCK - Notify of possible
lock availability . . . . . . . . . . . . . . . . . . . 456
19.12. Operation 10044: CB_ILLEGAL - Illegal Callback
Operation . . . . . . . . . . . . . . . . . . . . . . . 457
20. Security Considerations . . . . . . . . . . . . . . . . . . . 458
21. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 458
21.1. Defining new layout types . . . . . . . . . . . . . . . 458
22. References . . . . . . . . . . . . . . . . . . . . . . . . . 459
22.1. Normative References . . . . . . . . . . . . . . . . . . 459
22.2. Informative References . . . . . . . . . . . . . . . . . 460
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 461
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 462
Intellectual Property and Copyright Statements . . . . . . . . . 464
1. Introduction 1. Introduction
1.1. The NFSv4.1 Protocol 1.1. The NFSv4.1 Protocol
The NFSv4.1 protocol is a minor version of the NFSv4 protocol The NFSv4.1 protocol is a minor version of the NFSv4 protocol
described in [2]. It generally follows the guidelines for minor described in [2]. It generally follows the guidelines for minor
versioning model laid in Section 10 of RFC 3530. However, it versioning model laid in Section 10 of RFC 3530. However, it
diverges from guidelines 11 ("a client and server that supports minor diverges from guidelines 11 ("a client and server that supports minor
version X must support minor versions 0 through X-1"), and 12 ("no version X must support minor versions 0 through X-1"), and 12 ("no
features may be introduced as mandatory in a minor version"). These features may be introduced as mandatory in a minor version"). These
divergences are due to the introduction of the sessions model for divergences are due to the introduction of the sessions model for
managing non-idempotent operations and the RECLAIM_COMPLETE managing non-idempotent operations and the RECLAIM_COMPLETE
operation. These two new features are infrastructural in nature and operation. These two new features are infrastructural in nature and
simplify implementation of existing and other new features. Making simplify implementation of existing and other new features. Making
them optional would add undue complexity to protocol definition and them optional would add undue complexity to protocol definition and
implementation. NFSv4.1 accordingly updates the Minor Versioning implementation. NFSv4.1 accordingly updates the Minor Versioning
guidelines (Section 2.6). guidelines (Section 2.7).
NFSv4.1, as a minor version, is consistent with the overall goals for NFSv4.1, as a minor version, is consistent with the overall goals for
NFS Version 4, but extends the protocol so as to better meet those NFS Version 4, but extends the protocol so as to better meet those
goals, based on experiences with NFSv4.0. In addition, NFSv4.1 has goals, based on experiences with NFSv4.0. In addition, NFSv4.1 has
adopted some additional goals, which motivate some of the major adopted some additional goals, which motivate some of the major
extensions in minor version 1. extensions in minor version 1.
1.2. NFS Version 4 Goals 1.2. NFS Version 4 Goals
The NFS version 4 protocol is a further revision of the NFS protocol The NFS version 4 protocol is a further revision of the NFS protocol
skipping to change at page 12, line 10 skipping to change at page 13, line 10
as "layouts", which are integrated into the protocol locking model. as "layouts", which are integrated into the protocol locking model.
Clients direct requests for data access to a set of data servers Clients direct requests for data access to a set of data servers
specified by the layout via a data storage protocol which may be specified by the layout via a data storage protocol which may be
NFSv4.1 or may be another protocol. NFSv4.1 or may be another protocol.
1.4.3. File System Model 1.4.3. File System Model
The general file system model used for the NFS version 4.1 protocol The general file system model used for the NFS version 4.1 protocol
is the same as previous versions. The server file system is is the same as previous versions. The server file system is
hierarchical with the regular files contained within being treated as hierarchical with the regular files contained within being treated as
opaque byte streams. In a slight departure, file and directory names opaque octet streams. In a slight departure, file and directory
are encoded with UTF-8 to deal with the basics of names are encoded with UTF-8 to deal with the basics of
internationalization. internationalization.
The NFS version 4.1 protocol does not require a separate protocol to The NFS version 4.1 protocol does not require a separate protocol to
provide for the initial mapping between path name and filehandle. provide for the initial mapping between path name and filehandle.
All file systems exported by a server are presented as a tree so that All file systems exported by a server are presented as a tree so that
all file systems are reachable from a special per-server global root all file systems are reachable from a special per-server global root
filehandle. This allows LOOKUP operations to be used to perform filehandle. This allows LOOKUP operations to be used to perform
functions previously provided by the MOUNT protocol. The server functions previously provided by the MOUNT protocol. The server
provides any necessary pseudo filesystems to bridge any gaps that provides any necessary pseudo filesystems to bridge any gaps that
arise due unexported gaps between exported file systems. arise due to unexported gaps between exported file systems.
1.4.3.1. Filehandles 1.4.3.1. Filehandles
As in previous versions of the NFS protocol, opaque filehandles are As in previous versions of the NFS protocol, opaque filehandles are
used to identify individual files and directories. Lookup-type and used to identify individual files and directories. Lookup-type and
create operations are used to go from file and directory names to the create operations are used to go from file and directory names to the
filehandle which is then used to identify the object to subsequent filehandle which is then used to identify the object to subsequent
operations. operations.
The NFS version 4.1 protocol provides support for both persistent The NFS version 4.1 protocol provides support for persistent
filehandles, guaranteed to be valid for the lifetime of the file filehandles, guaranteed to be valid for the lifetime of the file
system object designated. In addition it provides support to servers system object designated. In addition it provides support to servers
to provide filehandles with more limited validity guarantees, called to provide filehandles with more limited validity guarantees, called
volatile filehandles. volatile filehandles.
1.4.3.2. File Attributes 1.4.3.2. File Attributes
The NFS version 4.1 protocol has a rich and extensible attribute The NFS version 4.1 protocol has a rich and extensible attribute
structure. Only a small set of the defined attributes are mandatory structure. Only a small set of the defined attributes are mandatory
and must be provided by all server implementations. The other and must be provided by all server implementations. The other
attributes are known as "recommended" attributes. attributes are known as "recommended" attributes.
One significant recommended file attribute is the Access Control List One significant recommended file attribute is the Access Control List
(ACL) attribute. This attribute provides for directory and file (ACL) attribute. This attribute provides for directory and file
access control beyond the model used in NFS Versions 2 and 3. The access control beyond the model used in NFS Versions 2 and 3. The
ACL definition allows for specification specific sets of permissions ACL definition allows for specification of specific sets of
for individual users and groups. In addition, ACL inheritance allows permissions for individual users and groups. In addition, ACL
propagation of access permissions and restriction down a directory inheritance allows propagation of access permissions and restriction
tree as filesystem objects are created. down a directory tree as file system objects are created.
One other type of attribute is the named attribute. A named One other type of attribute is the named attribute. A named
attribute is an opaque byte stream that is associated with a attribute is an opaque octet stream that is associated with a
directory or file and referred to by a string name. Named attributes directory or file and referred to by a string name. Named attributes
are meant to be used by client applications as a method to associate are meant to be used by client applications as a method to associate
application specific data with a regular file or directory. application-specific data with a regular file or directory.
1.4.3.3. Multi-server Namespace 1.4.3.3. Multi-server Namespace
NFS Version 4.1 contains a number of features to allow implementation NFS Version 4.1 contains a number of features to allow implementation
of namespaces that cross server boundaries and that allow to and of namespaces that cross server boundaries and that allow and
facilitate a non-disruptive transfer of support for individual file facilitate a non-disruptive transfer of support for individual file
systems between servers. They are all based upon attributes that systems between servers. They are all based upon attributes that
allow one file system to specify alternate or new locations for that allow one file system to specify alternate or new locations for that
file system. file system.
These attributes may be used together with the concept of absent file These attributes may be used together with the concept of absent file
system which provide specifications for additional locations but no system which provide specifications for additional locations but no
actual file system content. This allows a number of important actual file system content. This allows a number of important
facilities: facilities:
o Location attributes may be used with absent file systems to o Location attributes may be used with absent file systems to
implement referrals whereby one server may direct the client to a implement referrals whereby one server may direct the client to a
file system provided by another server. This allows extensive file system provided by another server. This allows extensive
multi-server namespaces to be constructed. multi-server namespaces to be constructed.
o Location attributes may be provided for present file systems to o Location attributes may be provided for present file systems to
provide the locations alternate file system instances or replicas provide the locations of alternate file system instances or
to be used in the event that the current file system instance replicas to be used in the event that the current file system
becomes unavailable. instance becomes unavailable.
o Location attributes may be provided when a previously present file o Location attributes may be provided when a previously present file
system becomes absent. This allows non-disruptive migration of system becomes absent. This allows non-disruptive migration of
file systems to alternate servers. file systems to alternate servers.
1.4.4. Locking Facilities 1.4.4. Locking Facilities
As mentioned previously, NFS v4.1, is a single protocol which As mentioned previously, NFS v4.1, is a single protocol which
includes locking facilities. These locking facilities include includes locking facilities. These locking facilities include
support for many types of locks including a number of sorts of support for many types of locks including a number of sorts of
recallable locks. Recallable locks such as delegations allow the recallable locks. Recallable locks such as delegations allow the
client to be assured that certain events will not occur so long as client to be assured that certain events will not occur so long as
that lock is held. When circumstances change, the lock is recalled that lock is held. When circumstances change, the lock is recalled
via a callback via a callback request. The assurances provided by via a callback request. The assurances provided by delegations allow
delegations allow more extensive caching to be done safely when more extensive caching to be done safely when circumstances allow it.
circumstances allow it.
o Share reservations as established by OPEN operations. o Share reservations as established by OPEN operations.
o Byte-range locks. o Byte-range locks.
o File delegations which are recallable locks that assure the holder o File delegations which are recallable locks that assure the holder
that inconsistent opens and file changes cannot occur so long as that inconsistent opens and file changes cannot occur so long as
the delegation is held. the delegation is held.
o Directory delegations which are recallable delegations that assure o Directory delegations which are recallable delegations that assure
skipping to change at page 14, line 38 skipping to change at page 15, line 36
The following definitions are provided for the purpose of providing The following definitions are provided for the purpose of providing
an appropriate context for the reader. an appropriate context for the reader.
Client The "client" is the entity that accesses the NFS server's Client The "client" is the entity that accesses the NFS server's
resources. The client may be an application which contains the resources. The client may be an application which contains the
logic to access the NFS server directly. The client may also be logic to access the NFS server directly. The client may also be
the traditional operating system client remote file system the traditional operating system client remote file system
services for a set of applications. services for a set of applications.
A client is uniquely identified by a Client Owner.
In the case of file locking the client is the entity that In the case of file locking the client is the entity that
maintains a set of locks on behalf of one or more applications. maintains a set of locks on behalf of one or more applications.
This client is responsible for crash or failure recovery for those This client is responsible for crash or failure recovery for those
locks it manages. locks it manages.
Note that multiple clients may share the same transport and Note that multiple clients may share the same transport and
multiple clients may exist on the same network node. connection and multiple clients may exist on the same network
node.
Clientid A 64-bit quantity used as a unique, short-hand reference to Client ID A 64-bit quantity used as a unique, short-hand reference
a client supplied Verifier and ID. The server is responsible for to a client supplied Verifier and client owner. The server is
supplying the Clientid. responsible for supplying the client ID.
Client Owner The client owner is a unique string, opaque to the
server, which identifies a client. Multiple network connections
and source network addresses originating those connections may
share a client owner. The server is expected to treat requests
from connnections with the same client owner has coming from the
same client.
Lease An interval of time defined by the server for which the client Lease An interval of time defined by the server for which the client
is irrevocably granted a lock. At the end of a lease period the is irrevocably granted a lock. At the end of a lease period the
lock may be revoked if the lease has not been extended. The lock lock may be revoked if the lease has not been extended. The lock
must be revoked if a conflicting lock has been granted after the must be revoked if a conflicting lock has been granted after the
lease interval. lease interval.
All leases granted by a server have the same fixed interval. Note All leases granted by a server have the same fixed interval. Note
that the fixed interval was chosen to alleviate the expense a that the fixed interval was chosen to alleviate the expense a
server would have in maintaining state about variable length server would have in maintaining state about variable length
leases across server failures. leases across server failures.
Lock The term "lock" is used to refer any of record (byte- range) Lock The term "lock" is used to refer to any of record (octet-range)
locks, share reservations, delegations or layouts unless locks, share reservations, delegations or layouts unless
specifically stated otherwise. specifically stated otherwise.
Server The "Server" is the entity responsible for coordinating Server The "Server" is the entity responsible for coordinating
client access to a set of file systems. client access to a set of file systems. A server can span
multiple network addresses. In NFSv4.1, a server is a two tiered
entity allows for servers consisting of multiple components the
flexibility to tightly or loosely couple their components without
requiring tight synchronization among the components. Every
server has a "Server Owner" which reflects the two tiers of a
server entity.
Server Owner The "Server Owner" identifies the server to the client.
The server owner consists of a major and minor identifier. When
the client has two connections each to a peer with the same major
and minor identifier, the client assumes both peers are the same
server (the server namespace is the same via each connection), and
further assumes session and lock state is sharable across both
connections. When each peer has the same major identifier but
different minor identifier, the client assumes both peers can
serve the same namespace, but session and lock state is not
sharable across both connections.
Stable Storage NFS version 4 servers must be able to recover without Stable Storage NFS version 4 servers must be able to recover without
data loss from multiple power failures (including cascading power data loss from multiple power failures (including cascading power
failures, that is, several power failures in quick succession), failures, that is, several power failures in quick succession),
operating system failures, and hardware failure of components operating system failures, and hardware failure of components
other than the storage medium itself (for example, disk, other than the storage medium itself (for example, disk,
nonvolatile RAM). nonvolatile RAM).
Some examples of stable storage that are allowable for an NFS Some examples of stable storage that are allowable for an NFS
server include: server include:
skipping to change at page 15, line 48 skipping to change at page 17, line 23
intermediate storage or uninterruptible power system (UPS). intermediate storage or uninterruptible power system (UPS).
3. Server commit of data with battery-backed intermediate storage 3. Server commit of data with battery-backed intermediate storage
and recovery software. and recovery software.
4. Cache commit with uninterruptible power system (UPS) and 4. Cache commit with uninterruptible power system (UPS) and
recovery software. recovery software.
Stateid A 128-bit quantity returned by a server that uniquely Stateid A 128-bit quantity returned by a server that uniquely
defines the open and locking state provided by the server for a defines the open and locking state provided by the server for a
specific open or lock owner for a specific file. meaning and are specific open or lock owner for a specific file and type of lock.
reserved values.
Verifier A 64-bit quantity generated by the client that the server Verifier A 64-bit quantity generated by the client that the server
can use to determine if the client has restarted and lost all can use to determine if the client has restarted and lost all
previous lock state. previous lock state.
1.6. Differences from NFSv4.0 1.6. Differences from NFSv4.0
The following summarizes the differences between minor version one The following summarizes the differences between minor version one
and the base protocol: and the base protocol:
skipping to change at page 17, line 43 skipping to change at page 19, line 20
RPC header, and the arguments or results. Finally, privacy, usually RPC header, and the arguments or results. Finally, privacy, usually
via encryption, is a service available with RPCSEC_GSS. Privacy is via encryption, is a service available with RPCSEC_GSS. Privacy is
performed on the arguments and results. Note that if privacy is performed on the arguments and results. Note that if privacy is
selected, integrity, authentication, and identification are enabled. selected, integrity, authentication, and identification are enabled.
If privacy is not selected, but integrity is selected, authentication If privacy is not selected, but integrity is selected, authentication
and identification are enabled. If integrity and privacy are not and identification are enabled. If integrity and privacy are not
selected, but authentication is enabled, identification is enabled. selected, but authentication is enabled, identification is enabled.
RPCSEC_GSS does not provide identification as a separate service. RPCSEC_GSS does not provide identification as a separate service.
Although GSS-API has an authentication service distinct from its Although GSS-API has an authentication service distinct from its
privacy and integrity services, use GSS-API's authentication service privacy and integrity services, GSS-API's authentication service is
is not used for RPCSEC_GSS's authentication service. Instead, each not used for RPCSEC_GSS's authentication service. Instead, each RPC
RPC request and response header is integrity protected with the GSS- request and response header is integrity protected with the GSS-API
API integrity service, and this allows RPCSEC_GSS to offer per-RPC integrity service, and this allows RPCSEC_GSS to offer per-RPC
authentication and identity. See [5] for more information. authentication and identity. See [5] for more information.
NFSv4 client and servers MUST support RPCSEC_GSS's integrity and NFSv4 client and servers MUST support RPCSEC_GSS's integrity and
authentication service. NFSv4.1 servers MUST support RPCSEC_GSS's authentication service. NFSv4.1 servers MUST support RPCSEC_GSS's
privacy service. privacy service.
2.2.1.1.1.2. Security mechanisms for NFS version 4 2.2.1.1.1.2. Security mechanisms for NFS version 4
RPCSEC_GSS, via GSS-API, normalizes access to mechanisms that provide RPCSEC_GSS, via GSS-API, normalizes access to mechanisms that provide
security services. Therefore NFSv4 clients and servers MUST support security services. Therefore NFSv4 clients and servers MUST support
skipping to change at page 20, line 34 skipping to change at page 22, line 4
constructed by combining multiple LOOKUP operations. Those can be constructed by combining multiple LOOKUP operations. Those can be
further combined with operations such as GETATTR, READDIR, or OPEN further combined with operations such as GETATTR, READDIR, or OPEN
plus READ to do more complicated sets of operation without incurring plus READ to do more complicated sets of operation without incurring
additional latency. additional latency.
NFSv4 also contains a considerable set of callback operations in NFSv4 also contains a considerable set of callback operations in
which the server makes an RPC directed at the client. Callback RPC's which the server makes an RPC directed at the client. Callback RPC's
have a similar structure to that of the normal server requests. For have a similar structure to that of the normal server requests. For
the NFS version 4 protocol callbacks in all minor versions, there are the NFS version 4 protocol callbacks in all minor versions, there are
two RPC procedures, NULL and CB_COMPOUND. The CB_COMPOUND procedure two RPC procedures, NULL and CB_COMPOUND. The CB_COMPOUND procedure
is defined in analogous fashion to that of COMPOUND with its own set is defined in an analogous fashion to that of COMPOUND with its own
of callback operations. set of callback operations.
Addition of new server and callback operation within the COMPOUND and Addition of new server and callback operation within the COMPOUND and
CB_COMPOUND request framework provide means of extending the protocol CB_COMPOUND request framework provide means of extending the protocol
in subsequent minor versions. in subsequent minor versions.
Except for a small number of operations needed for session creation, Except for a small number of operations needed for session creation,
server requests and callback requests are performed within the server requests and callback requests are performed within the
context of a session. Sessions provide a client context for every context of a session. Sessions provide a client context for every
request and support robust replay protection for non-idempotent request and support robust replay protection for non-idempotent
requests. requests.
2.4. Client Identifiers 2.4. Client Identifiers and Client Owners
For each operation that obtains or depends on locking state, the For each operation that obtains or depends on locking state, the
specific client must be determinable by the server. In NFSv4, each specific client must be determinable by the server. In NFSv4, each
distinct client instance is represented by a clientid, which is a 64- distinct client instance is represented by a client ID, which is a
bit identifier that identifies a specific client at a given time and 64-bit identifier that identifies a specific client at a given time
which is changed whenever the client or the server re-initializes. and which is changed whenever the client or the server re-
Clientid's are used to support lock identification and crash initializes. Client IDs are used to support lock identification and
recovery. crash recovery.
In NFSv4.1, the clientid associated with each operation is derived In NFSv4.1, during steady state operation, the client ID associated
from the session (see Section 2.9) on which the operation is issued. with each operation is derived from the session (see Section 2.10) on
Each session is associated with a specific clientid at session which the operation is issued. Each session is associated with a
creation and that clientid then becomes the clientid associated with specific client ID at session creation and that client ID then
all requests issued using it. Therefore, unlike NFSv4.0, no NFSv4.1 becomes the client ID associated with all requests issued using it.
operation is possible until a clientid is established. Therefore, unlike NFSv4.0, the only NFSv4.1 operations possible
before a client ID is established, are those directly connected with
establishing the client ID.
A sequence of an EXCHANGE_ID operation followed by a CREATE_SESSION A sequence of an EXCHANGE_ID operation followed by a CREATE_SESSION
operation using that clientid is required to establish the operation using that client ID (eir_clientid as returned from
identification on the server. Establishment of identification by a EXCHANGE_ID) is required to establish the identification on the
new incarnation of the client also has the effect of immediately server. Establishment of identification by a new incarnation of the
releasing any locking state that a previous incarnation of that same client also has the effect of immediately releasing any locking state
client might have had on the server. Such released state would that a previous incarnation of that same client might have had on the
include all lock, share reservation, and, where the server is not server. Such released state would include all lock, share
supporting the CLAIM_DELEGATE_PREV claim type, all delegation state reservation, and, where the server is not supporting the
associated with same client with the same identity. For discussion CLAIM_DELEGATE_PREV claim type, all delegation state associated with
of delegation state recovery, see Section 9.2.1. same client with the same identity. For discussion of delegation
state recovery, see Section 9.2.1.
Releasing such state requires that the server be able to determine Releasing such state requires that the server be able to determine
that one client instance is the successor of another. Where this that one client instance is the successor of another. Where this
cannot be done, for any of a number of reasons, the locking state cannot be done, for any of a number of reasons, the locking state
will remain for a time subject to lease expiration (see Section 8.5) will remain for a time subject to lease expiration (see Section 8.5)
and the new client will need to wait for such state to be removed, if and the new client will need to wait for such state to be removed, if
it makes conflicting lock requests. it makes conflicting lock requests.
Client identification is encapsulated in the following structure: Client identification is encapsulated in the following Client Owner
structure:
struct client_owner4 { struct client_owner4 {
verifier4 co_verifier; verifier4 co_verifier;
opaque co_ownerid<NFS4_OPAQUE_LIMIT>; opaque co_ownerid<NFS4_OPAQUE_LIMIT>;
}; };
The first field, co_verifier, is a client incarnation verifier that The first field, co_verifier, is a client incarnation verifier that
is used to detect client reboots. Only if the co_verifier is is used to detect client reboots. Only if the co_verifier is
different from that the server had previously recorded for the client different from that the server had previously recorded for the client
(as identified by the second field of the structure, co_ownerid) does (as identified by the second field of the structure, co_ownerid) does
skipping to change at page 22, line 25 skipping to change at page 23, line 47
requires the string to be recorded in a local file because this requires the string to be recorded in a local file because this
precludes the use of the implementation in an environment where precludes the use of the implementation in an environment where
there is no local disk and all file access is from an NFS version there is no local disk and all file access is from an NFS version
4 server. 4 server.
o The string should be the same for each server network address that o The string should be the same for each server network address that
the client accesses, rather than common to all server network the client accesses, rather than common to all server network
addresses (note: the precise opposite was advised in RFC3530). addresses (note: the precise opposite was advised in RFC3530).
This way, if a server has multiple interfaces, the client can This way, if a server has multiple interfaces, the client can
trunk traffic over multiple network paths as described in trunk traffic over multiple network paths as described in
Section 2.9.3.4.1. Section 2.10.3.4.1.
o The algorithm for generating the string should not assume that the o The algorithm for generating the string should not assume that the
client's network address will not change. This includes changes client's network address will not change, unless the client
between client incarnations and even changes while the client is implementation knows it is using statically assigned network
still running in its current incarnation. This means that if the addresses. This includes changes between client incarnations and
client includes just the client's and server's network address in even changes while the client is still running in its current
the co_ownerid string, there is a real risk, after the client incarnation. This means that if the client includes just the
gives up the network address, that another client, using a similar client's network address in the co_ownerid string, there is a real
algorithm for generating the co_ownerid string, would generate a risk, with dynamic address assignment, that after the client gives
conflicting co_ownerid string. up the network address, another client, using a similar algorithm
for generating the co_ownerid string, would generate a conflicting
co_ownerid string.
Given the above considerations, an example of a well generated Given the above considerations, an example of a well generated
co_ownerid string is one that includes: co_ownerid string is one that includes:
o The client's network address. o If applicable, the client's statically assigned network address.
o For a user level NFS version 4 client, it should contain
additional information to distinguish the client from other user
level clients running on the same host, such as a process id or
other unique sequence.
o Additional information that tends to be unique, such as one or o Additional information that tends to be unique, such as one or
more of: more of:
* The client machine's serial number (for privacy reasons, it is * The client machine's serial number (for privacy reasons, it is
best to perform some one way function on the serial number). best to perform some one way function on the serial number).
* A MAC address (again, a one way function should be performed). * A MAC address (again, a one way function should be performed).
* The timestamp of when the NFS version 4 software was first * The timestamp of when the NFS version 4 software was first
installed on the client (though this is subject to the installed on the client (though this is subject to the
previously mentioned caution about using information that is previously mentioned caution about using information that is
stored in a file, because the file might only be accessible stored in a file, because the file might only be accessible
over NFS version 4). over NFS version 4).
* A true random number. However since this number ought to be * A true random number. However since this number ought to be
the same between client incarnations, this shares the same the same between client incarnations, this shares the same
problem as that of the using the timestamp of the software problem as that of the using the timestamp of the software
installation. installation.
o For a user level NFS version 4 client, it should contain
additional information to distinguish the client from other user
level clients running on the same host, such as a process
identifier or other unique sequence.
As a security measure, the server MUST NOT cancel a client's leased As a security measure, the server MUST NOT cancel a client's leased
state if the principal established the state for a given co_ownerid state if the principal established the state for a given co_ownerid
string is not the same as the principal issuing the EXCHANGE_ID. string is not the same as the principal issuing the EXCHANGE_ID.
A server may compare an client_owner4 in a EXCHANGE_ID with an A server may compare an client_owner4 in a EXCHANGE_ID with an
nfs_client_id4 established using SETCLIENTID using NFSv4 minor nfs_client_id4 established using SETCLIENTID using NFSv4 minor
version 0, so that an NFSv4.1 client is not forced to delay until version 0, so that an NFSv4.1 client is not forced to delay until
lease expiration for locking state established by the earlier client lease expiration for locking state established by the earlier client
using minor version 0. This requires the client_owner4 be using minor version 0. This requires the client_owner4 be
constructed the same way as the nfs_client_id4. If the latter's constructed the same way as the nfs_client_id4. If the latter's
contents included the server's network address, and the NFSv4.1 contents included the server's network address, and the NFSv4.1
client does not wish to use a clientid that prevents trunking, it client does not wish to use a client ID that prevents trunking, it
should issue two EXCHANGE_ID operations. The first EXCHANGE_ID will should issue two EXCHANGE_ID operations. The first EXCHANGE_ID will
have a client_owner4 equal to the nfs_client_id4. This will clear have a client_owner4 equal to the nfs_client_id4. This will clear
the state created by the NFSv4.0 client. The second EXCHANGE_ID will the state created by the NFSv4.0 client. The second EXCHANGE_ID will
not have the server's network address. The state created for the not have the server's network address. The state created for the
second EXCHANGE_ID will not have to wait for lease expiration, second EXCHANGE_ID will not have to wait for lease expiration,
because there will be no state to expire. because there will be no state to expire.
Once a EXCHANGE_ID has been done, and the resulting clientid Once an EXCHANGE_ID has been done, and the resulting client ID
established as associated with a session, all requests made on that established as associated with a session, all requests made on that
session implicitly identify that clientid, which in turn designates session implicitly identify that client ID, which in turn designates
the client specified using the long-form client_owner4 structure. the client specified using the long-form client_owner4 structure.
The shorthand client identifier (a clientid) is assigned by the The shorthand client identifier (a client ID) is assigned by the
server and should be chosen so that it will not conflict with a server (the eir_clientid result from EXCHANGE_ID) and should be
clientid previously assigned by the server. This applies across chosen so that it will not conflict with a client ID previously
server restarts or reboots. assigned by the server. This applies across server restarts or
reboots.
In the event of a server restart, a client will find out that its In the event of a server restart, a client may find out that its
current clientid is no longer valid when receives a current client ID is no longer valid when receives a
NFS4ERR_STALE_CLIENTID error. The precise circumstances depend of NFS4ERR_STALE_CLIENTID error. The precise circumstances depend of
the characteristics of the sessions involved, specifically whether the characteristics of the sessions involved, specifically whether
the session is persistent (see Section 2.9.4.5). the session is persistent (see Section 2.10.4.5).
When a session is not persistent, the client will need to create a When a session is not persistent, the client will need to create a
new session. When the existing clientid is presented to a server as new session. When the existing client ID is presented to a server as
part of creating a session and that clientid is not recognized, as part of creating a session and that client ID is not recognized, as
would happen after a server reboot, the server will reject the would happen after a server reboot, the server will reject the
request with the error NFS4ERR_STALE_CLIENTID. When this happens, request with the error NFS4ERR_STALE_CLIENTID. When this happens,
the client must obtain a new clientid by use of the EXCHANGE_ID the client must obtain a new client ID by use of the EXCHANGE_ID
operation and then use that clientid as the basis of the basis of a operation and then use that client ID as the basis of the basis of a
new session and then proceed to any other necessary recovery for the new session and then proceed to any other necessary recovery for the
server reboot case (See Section 8.6.2). server reboot case (See Section 8.6.2).
In the case of the session being persistent, the client will re- In the case of the session being persistent, the client will re-
establish communication using the existing session after the reboot. establish communication using the existing session after the reboot.
This session will be associated with a stale clientid and the client This session will be associated with a client ID that has had state
will receive an indication of that fact in the sr_status field revoked (but the persistent session is never associated with a stale
returned by the SEQUENCE operation (see Section 2.9.2.1). The client client ID, because if the session is persistent, the client ID MUST
can then use the existing session to do whatever operations are persist), and the client will receive an indication of that fact in
necessary to determine the status of requests outstanding at the time the sr_status_flags field returned by the SEQUENCE operation (see
of reboot, while avoiding issuing new requests, particularly any Section 17.46.4). The client can then use the existing session to do
involving locking on that session. Such requests would fail with whatever operations are necessary to determine the status of requests
NFS4ERR_STALE_CLIENTID error or an NFS4ERR_STALE_STATEID error, if outstanding at the time of reboot, while avoiding issuing new
attempted. In any case, the client would create a new clientid using requests, particularly any involving locking on that session. Such
EXCHANGE_ID, create a new session based on that clientid, and proceed requests would fail with an NFS4ERR_STALE_STATEID error, if
to other necessary recovery for the server reboot case. attempted.
See the detailed descriptions of EXCHANGE_ID (Section 16.35 and See the detailed descriptions of EXCHANGE_ID (Section 17.35 and
CREATE_SESSION (Section 16.36) for a complete specification of these CREATE_SESSION (Section 17.36) for a complete specification of these
operations. operations.
2.4.1. Server Release of Clientid 2.4.1. Server Release of Client ID
NFSv4.1 introduces a new operation called DESTROY_CLIENTID
(Section 17.50) which the client SHOULD use to destroy a client ID it
no longer needs. This permits graceful, bilateral release of a
client ID.
If the server determines that the client holds no associated state If the server determines that the client holds no associated state
for its clientid, the server may choose to release the clientid. The for its client ID (including sessions, opens, locks, delegations,
server may make this choice for an inactive client so that resources layouts, and wants), the server may choose to unilaterally release
are not consumed by those intermittently active clients. If the the client ID. The server may make this choice for an inactive
client contacts the server after this release, the server must ensure client so that resources are not consumed by those intermittently
the client receives the appropriate error so that it will use the active clients. If the client contacts the server after this
EXCHANGE_ID/CREATE_SESSION sequence to establish a new identity. It release, the server must ensure the client receives the appropriate
should be clear that the server must be very hesitant to release a error so that it will use the EXCHANGE_ID/CREATE_SESSION sequence to
clientid since the resulting work on the client to recover from such establish a new identity. It should be clear that the server must be
an event will be the same burden as if the server had failed and very hesitant to release a client ID since the resulting work on the
restarted. Typically a server would not release a clientid unless client to recover from such an event will be the same burden as if
there had been no activity from that client for many minutes. Note the server had failed and restarted. Typically a server would not
that "associated state" includes sessions. As long as there are release a client ID unless there had been no activity from that
sessions, the server MUST not release the clientid. See client for many minutes. As long as there are sessions, opens,
Section 2.9.8.1.4 for discussion on releasing inactive sessions. locks, delegations, layouts, or wants, the server MUST not release
the client ID. See Section 2.10.8.1.4 for discussion on releasing
inactive sessions.
Note that if the id string in a EXCHANGE_ID request is properly 2.4.2. Handling Client Owner Conflicts
If the co_ownerid string in a EXCHANGE_ID request is properly
constructed, and if the client takes care to use the same principal constructed, and if the client takes care to use the same principal
for each successive use of EXCHANGE_ID, then, barring an active for each successive use of EXCHANGE_ID, then, barring an active
denial of service attack, NFS4ERR_CLID_INUSE should never be denial of service attack, conflicts are not possible.
returned.
However, client bugs, server bugs, or perhaps a deliberate change of However, client bugs, server bugs, or perhaps a deliberate change of
the principal owner of the id string (such as the case of a client the principal owner of the co_ownerid string (such as the case of a
that changes security flavors, and under the new flavor, there is no client that changes security flavors, and under the new flavor, there
mapping to the previous owner) will in rare cases result in is no mapping to the previous owner) will in rare cases result in a
NFS4ERR_CLID_INUSE. conflict.
In that event, when the server gets a EXCHANGE_ID for a client id When the server gets a EXCHANGE_ID for a client owner that currently
that currently has no state, or it has state, but the lease has has no state, or if it has state, but the lease has expired, server
expired, rather than returning NFS4ERR_CLID_INUSE, the server MUST MUST allow the EXCHANGE_ID, and confirm the new client ID if followed
allow the EXCHANGE_ID, and confirm the new clientid if followed by by the appropriate CREATE_SESSION.
the appropriate CREATE_SESSION.
2.5. Security Service Negotiation When the server gets a EXCHANGE_ID for a client owner that currently
has state, or an unexpired lease, and the principal that issues the
EXCHANGE_ID is different than principal the previously established
the client owner, the server MUST not destroy the any state that
currently exists for client owner. Regardless, the server has two
choices. First, it can return NFS4ERR_CLID_INUSE. Second, it can
allow the EXCHANGE_ID, and simply treat the client owner as
consisting of both the co_ownerid and the principal that issued the
EXCHANGE_ID.
2.5. Server Owners
The Server Owner is somewhat similar to a Client Owner (Section 2.4),
but unlike the Client Owner, there is no shorthand serverid. The
Server Owner is defined in the following structure:
struct server_owner4 {
uint64_t so_minor_id;
opaque so_major_id<NFS4_OPAQUE_LIMIT>;
};
The Server Owner is returned in the results of EXCHANGE_ID. When the
so_major_id fields are the same in two EXCHANGE_ID results, the
connections each EXCHANGE_ID are sent over can be assumed to address
the same Server (as defined in Section 1.5). If the so_minor_id
fields are also the same, then not only do both connections connect
to the same server, but the session and other state can be shared
across both connections. The reader is cautioned that multiple
servers may deliberately or accidentally claim to have the same
so_major_id or so_major_id/so_minor_id; the reader should examine
Section 2.10.3.4.1 and Section 17.35.
The considerations for generating an so_major_id are similar to that
for generating a co_ownerid string (see Section 2.4). The
consequences of two servers generating conflict so_major_id values
are less dire than they are for co_ownerid conflicts because the
client can use RPCSEC_GSS to compare the authenticity of each server
(see Section 2.10.3.4.1).
2.6. Security Service Negotiation
With the NFS version 4 server potentially offering multiple security With the NFS version 4 server potentially offering multiple security
mechanisms, the client needs a method to determine or negotiate which mechanisms, the client needs a method to determine or negotiate which
mechanism is to be used for its communication with the server. The mechanism is to be used for its communication with the server. The
NFS server may have multiple points within its file system namespace NFS server may have multiple points within its file system namespace
that are available for use by NFS clients. These points can be that are available for use by NFS clients. These points can be
considered security policy boundaries, and in some NFS considered security policy boundaries, and in some NFS
implementations are tied to NFS export points. In turn the NFS implementations are tied to NFS export points. In turn the NFS
server may be configured such that each of these security policy server may be configured such that each of these security policy
boundaries may have different or multiple security mechanisms in use. boundaries may have different or multiple security mechanisms in use.
The security negotiation between client and server must be done with The security negotiation between client and server must be done with
a secure channel to eliminate the possibility of a third party a secure channel to eliminate the possibility of a third party
intercepting the negotiation sequence and forcing the client and intercepting the negotiation sequence and forcing the client and
server to choose a lower level of security than required or desired. server to choose a lower level of security than required or desired.
See Section 19 for further discussion. See Section 20 for further discussion.
2.5.1. NFSv4 Security Tuples 2.6.1. NFSv4 Security Tuples
An NFS server can assign one or more "security tuples" to each An NFS server can assign one or more "security tuples" to each
security policy boundary in its namespace. Each security tuple security policy boundary in its namespace. Each security tuple
consists of a security flavor (see Section 2.2.1.1), and if the consists of a security flavor (see Section 2.2.1.1), and if the
flavor is RPCSEC_GSS, a GSS-API mechanism OID, a GSS-API quality of flavor is RPCSEC_GSS, a GSS-API mechanism OID, a GSS-API quality of
protection, and an RPCSEC_GSS service. protection, and an RPCSEC_GSS service.
2.5.2. SECINFO and SECINFO_NO_NAME 2.6.2. SECINFO and SECINFO_NO_NAME
The SECINFO and SECINFO_NO_NAME operations allow the client to The SECINFO and SECINFO_NO_NAME operations allow the client to
determine, on a per filehandle basis, what security tuple is to be determine, on a per filehandle basis, what security tuple is to be
used for server access. In general, the client will not have to use used for server access. In general, the client will not have to use
either operation except during initial communication with the server either operation except during initial communication with the server
or when the client crosses security policy boundaries at the server. or when the client crosses security policy boundaries at the server.
It is possible that the server's policies change during the client's It is possible that the server's policies change during the client's
interaction therefore forcing the client to negotiate a new security interaction therefore forcing the client to negotiate a new security
tuple. tuple.
Where the use of different security tuples would affect the type of Where the use of different security tuples would affect the type of
access that would be allowed if a request was issued over the same access that would be allowed if a request was issued over the same
connection used for the SECINFO or SECINFO_NO_NAME operation (e.g. connection used for the SECINFO or SECINFO_NO_NAME operation (e.g.
read-only vs. read-write) access, security tuples that allow greater read-only vs. read-write) access, security tuples that allow greater
access should be presented first. Where the general level of access access should be presented first. Where the general level of access
is the same and different security flavors limit the range of is the same and different security flavors limit the range of
skipping to change at page 26, line 19 skipping to change at page 28, line 43
Where the use of different security tuples would affect the type of Where the use of different security tuples would affect the type of
access that would be allowed if a request was issued over the same access that would be allowed if a request was issued over the same
connection used for the SECINFO or SECINFO_NO_NAME operation (e.g. connection used for the SECINFO or SECINFO_NO_NAME operation (e.g.
read-only vs. read-write) access, security tuples that allow greater read-only vs. read-write) access, security tuples that allow greater
access should be presented first. Where the general level of access access should be presented first. Where the general level of access
is the same and different security flavors limit the range of is the same and different security flavors limit the range of
principals whose privileges are recognized (e.g. allowing or principals whose privileges are recognized (e.g. allowing or
disallowing root access), flavors supporting the greatest range of disallowing root access), flavors supporting the greatest range of
principals should be listed first. principals should be listed first.
2.5.3. Security Error 2.6.3. Security Error
Based on the assumption that each NFS version 4 client and server Based on the assumption that each NFS version 4 client and server
must support a minimum set of security (i.e., LIPKEY, SPKM-3, and must support a minimum set of security (i.e., LIPKEY, SPKM-3, and
Kerberos-V5 all under RPCSEC_GSS), the NFS client will initiate file Kerberos-V5 all under RPCSEC_GSS), the NFS client will initiate file
access to the server with one of the minimal security tuples. During access to the server with one of the minimal security tuples. During
communication with the server, the client may receive an NFS error of communication with the server, the client may receive an NFS error of
NFS4ERR_WRONGSEC. This error allows the server to notify the client NFS4ERR_WRONGSEC. This error allows the server to notify the client
that the security tuple currently being used is contravenes the that the security tuple currently being used is contravenes the
server's security policy. The client is then responsible for server's security policy. The client is then responsible for
determining (see Section 2.5.3.1) what security tuples are available determining (see Section 2.6.3.1) what security tuples are available
at the server and choose one which is appropriate for the client. at the server and choosing one which is appropriate for the client.
2.5.3.1. Using NFS4ERR_WRONGSEC, SECINFO, and SECINFO_NO_NAME 2.6.3.1. Using NFS4ERR_WRONGSEC, SECINFO, and SECINFO_NO_NAME
This section explains of the mechanics of NFSv4.1 security This section explains of the mechanics of NFSv4.1 security
negotiation. Unless noted otherwise, for any mention of PUTFH in negotiation. The term "put filehandle operation" refers to
this section, the reader should interpret it as applying to PUTROOTFH PUTROOTFH, PUTPUBFH, PUTFH, and RESTOREFH.
and PUTPUBFH in addition to PUTFH.
2.5.3.1.1. PUTFH + LOOKUP (or OPEN by Name) 2.6.3.1.1. Put Filehandle Operation + SAVEFH
The client is saving a filehandle for a future RESTOREFH. The server
MUST NOT return NFS4ERR_WRONG to either the put filehandle operation
or SAVEFH.
2.6.3.1.2. Two or More Put Filehandle Operations
For a series of N put filehandle operations, the server MUST NOT
return NFS4ERR_WRONGSEC to the first N-1 put filehandle operations.
The Nth put filehandle operation is handled as if it is the first in
a series of operations, and the second in the series of operations is
not a put filehandle operation. For example if the server received
PUTFH, PUTROOTFH, LOOKUP, then the PUTFH is ignored for
NFS4ERR_WRONGSEC purposes, and the PUTROOTFH, LOOKUP subseries is
processed as according to Section 2.6.3.1.3.
2.6.3.1.3. Put Filehandle Operation + LOOKUP (or OPEN by Name)
This situation also applies to a put filehandle operation followed by This situation also applies to a put filehandle operation followed by
an OPEN operation that specifies a component name. a LOOKUP or an OPEN operation that specifies a component name.
In this situation, the client is potentially crossing a security In this situation, the client is potentially crossing a security
policy boundary, and the set of security tuples the parent directory policy boundary, and the set of security tuples the parent directory
supports differ from those of the child. The server implementation supports differ from those of the child. The server implementation
may decide whether to impose any restrictions on security policy may decide whether to impose any restrictions on security policy
administration. There are at least three approaches administration. There are at least three approaches
(sec_policy_child is the tuple set of the child export, (sec_policy_child is the tuple set of the child export,
sec_policy_parent is that of the parent). sec_policy_parent is that of the parent).
a) sec_policy_child <= sec_policy_parent (<= for subset). This a) sec_policy_child <= sec_policy_parent (<= for subset). This
skipping to change at page 27, line 19 skipping to change at page 30, line 14
b) sec_policy_child ^ sec_policy_parent != {} (^ for intersection, b) sec_policy_child ^ sec_policy_parent != {} (^ for intersection,
{} for the empty set). This means that the security tuples {} for the empty set). This means that the security tuples
specified on the security policy of a child directory always has a specified on the security policy of a child directory always has a
non empty intersection with that of the parent. non empty intersection with that of the parent.
c) sec_policy_child ^ sec_policy_parent == {}. This means that c) sec_policy_child ^ sec_policy_parent == {}. This means that
the set of tuples specified on the security policy of a child the set of tuples specified on the security policy of a child
directory may not intersect with that of the parent. In other directory may not intersect with that of the parent. In other
words, there are no restrictions on how the system administrator words, there are no restrictions on how the system administrator
may set. may set up these tuples.
For a server to support approach (b) (when client chooses a flavor For a server to support approach (b) (when client chooses a flavor
that is not a member of sec_policy_parent) and (c), PUTFH must NOT that is not a member of sec_policy_parent) and (c), the put
return NFS4ERR_WRONGSEC in case of security mismatch. Instead, it filehandle operation must NOT return NFS4ERR_WRONGSEC in case of
should be returned from the LOOKUP (or OPEN by component name) that security mismatch. Instead, it should be returned from the LOOKUP
follows. (or OPEN by component name) that follows.
Since the above guideline does not contradict approach (a), it should Since the above guideline does not contradict approach (a), it should
be followed in general. Even if approach (a) is implemented, it is be followed in general. Even if approach (a) is implemented, it is
possible for the security tuple used to be acceptable for the target possible for the security tuple used to be acceptable for the target
of LOOKUP but not for the filehandles used in PUTFH. The PUTFH could of LOOKUP but not for the filehandles used in the put filehandle
really be a PUTROOTFH or PUTPUBFH, where the client does not know the operation. The put filehandle operation could be a PUTROOTFH or
security tuples for the root or public filehandle. Or the security PUTPUBFH, where the client cannot know the security tuples for the
policy for the filehandle used by PUTFH could have changed since the root or public filehandle. Or the security policy for the filehandle
used by the put filehandle operation could have changed since the
time the filehandle was obtained. time the filehandle was obtained.
Therefore, an NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC in Therefore, an NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC in
response to PUTFH, PUTROOTFH, or PUTPUBFH if the operation is response to the put filehandle operation if the operation is
immediately followed by a LOOKUP or an OPEN by component name. immediately followed by a LOOKUP or an OPEN by component name.
2.5.3.1.2. PUTFH + LOOKUPP 2.6.3.1.4. Put Filehandle Operation + LOOKUPP
Since SECINFO only works its way down, there is no way LOOKUPP can Since SECINFO only works its way down, there is no way LOOKUPP can
return NFS4ERR_WRONGSEC without SECINFO_NO_NAME. SECINFO_NO_NAME return NFS4ERR_WRONGSEC without SECINFO_NO_NAME. SECINFO_NO_NAME
solves this issue because via style "parent", it works in the solves this issue because via style SECINFO_STYLE4_PARENT, it works
opposite direction as SECINFO. As with Section 2.5.3.1.1, PUTFH must in the opposite direction as SECINFO. As with Section 2.6.3.1.3, the
not return NFS4ERR_WRONGSEC whenever it is followed by LOOKUPP. If put filehandle operation must not return NFS4ERR_WRONGSEC whenever it
the server does not support SECINFO_NO_NAME, the client's only is followed by LOOKUPP. If the server does not support
recourse is to issue the PUTFH, LOOKUPP, GETFH sequence of operations SECINFO_NO_NAME, the client's only recourse is to issue the put
with every security tuple it supports. filehandle operation, LOOKUPP, GETFH sequence of operations with
every security tuple it supports.
Regardless whether SECINFO_NO_NAME is supported, an NFSv4.1 server Regardless whether SECINFO_NO_NAME is supported, an NFSv4.1 server
MUST NOT return NFS4ERR_WRONGSEC in response to PUTFH, PUTROOTFH, or MUST NOT return NFS4ERR_WRONGSEC in response to a put filehandle
PUTPUBFH if the operation is immediately followed by a LOOKUPP. operation if the operation is immediately followed by a LOOKUPP.
2.5.3.1.3. PUTFH + SECINFO or PUTFH + SECINFO_NO_NAME 2.6.3.1.5. Put Filehandle Operation + SECINFO/SECINFO_NO_NAME
A security sensitive client is allowed to choose a strong security A security sensitive client is allowed to choose a strong security
tuple when querying a server to determine a file object's permitted tuple when querying a server to determine a file object's permitted
security tuples. The security tuple chosen by the client does not security tuples. The security tuple chosen by the client does not
have to be included in the tuple list of the security policy of the have to be included in the tuple list of the security policy of the
either parent directory indicated in PUTFH, or the child file object either parent directory indicated in the put filehandle operation, or
indicated in SECINFO (or any parent directory indicated in the child file object indicated in SECINFO (or any parent directory
SECINFO_NO_NAME). Of course the server has to be configured for indicated in SECINFO_NO_NAME). Of course the server has to be
whatever security tuple the client selects, otherwise the request configured for whatever security tuple the client selects, otherwise
will fail at RPC layer with an appropriate authentication error. the request will fail at RPC layer with an appropriate authentication
error.
In theory, there is no connection between the security flavor used by In theory, there is no connection between the security flavor used by
SECINFO or SECINFO_NO_NAME and those supported by the security SECINFO or SECINFO_NO_NAME and those supported by the security
policy. But in practice, the client may start looking for strong policy. But in practice, the client may start looking for strong
flavors from those supported by the security policy, followed by flavors from those supported by the security policy, followed by
those in the mandatory set. those in the mandatory set.
The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to PUTFH whenever The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to a put
it is immediately followed by SECINFO or SECINFO_NO_NAME. The filehandle operation whenever it is immediately followed by SECINFO
NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC from SECINFO or or SECINFO_NO_NAME. The NFSv4.1 server MUST NOT return
SECINFO_NO_NAME. NFS4ERR_WRONGSEC from SECINFO or SECINFO_NO_NAME.
2.5.3.1.4. PUTFH + PUTFH
This is a nonsensical situation, because the first put filehandle
operation is wasted. The NFSv4.1 server MAY return NFS4ERR_WRONGSEC
to the first PUTFH, or it MAY NOT. If it does not, it then processes
the subsequent PUTFH and any operation that follows it according to
the rules listed in Section 2.5.3.1.
2.5.3.1.5. PUTFH + Nothing 2.6.3.1.6. Put Filehandle Operation + Nothing
This too is nonsensical because the PUTFH is wasted. The NFSv4.1 The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC.
server MAY or MAY NOT return NFS4ERR_WRONGSEC.
2.5.3.1.6. PUTFH + Anything Else 2.6.3.1.7. Put Filehandle Operation + Anything Else
"Anything Else" includes OPEN by filehandle. "Anything Else" includes OPEN by filehandle.
The security policy enforcement applies to the filehandle specified The security policy enforcement applies to the filehandle specified
in PUTFH. Therefore PUTFH must return NFS4ERR_WRONGSEC in case of in the put filehandle operation. Therefore PUTFH must return
security tuple on the part of the mismatch. This avoids the NFS4ERR_WRONGSEC in case of security tuple on the part of the
complexity adding NFS4ERR_WRONGSEC as an allowable error to every mismatch. This avoids the complexity adding NFS4ERR_WRONGSEC as an
other operation. allowable error to every other operation.
PUTFH + SECINFO_NO_NAME (style "current_fh") is an efficient way for A COMPOUND containing the series put filehandle operation +
the client to recover from NFS4ERR_WRONGSEC. SECINFO_NO_NAME (style SECINFO_STYLE4_CURRENT_FH) is an efficient way
for the client to recover from NFS4ERR_WRONGSEC.
The NFSv4.1 server, MUST not return NFS4ERR_WRONGSEC to any operation The NFSv4.1 server MUST not return NFS4ERR_WRONGSEC to any operation
other than LOOKUP, LOOKUPP, and OPEN (by component name). other than a put filehandle operation, LOOKUP, LOOKUPP, and OPEN (by
component name).
2.6. Minor Versioning 2.7. Minor Versioning
To address the requirement of an NFS protocol that can evolve as the To address the requirement of an NFS protocol that can evolve as the
need arises, the NFS version 4 protocol contains the rules and need arises, the NFS version 4 protocol contains the rules and
framework to allow for future minor changes or versioning. framework to allow for future minor changes or versioning.
The base assumption with respect to minor versioning is that any The base assumption with respect to minor versioning is that any
future accepted minor version must follow the IETF process and be future accepted minor version must follow the IETF process and be
documented in a standards track RFC. Therefore, each minor version documented in a standards track RFC. Therefore, each minor version
number will correspond to an RFC. Minor version zero of the NFS number will correspond to an RFC. Minor version zero of the NFS
version 4 protocol is represented by [2], and minor version one is version 4 protocol is represented by [2], and minor version one is
skipping to change at page 30, line 4 skipping to change at page 32, line 47
bitmap4, and GETATTR4res. bitmap4, and GETATTR4res.
This allows for the expansion of the attribute model to allow This allows for the expansion of the attribute model to allow
for future growth or adaptation. for future growth or adaptation.
* Minor version X must append any new attributes after the last * Minor version X must append any new attributes after the last
documented attribute. documented attribute.
Since attribute results are specified as an opaque array of Since attribute results are specified as an opaque array of
per-attribute XDR encoded results, the complexity of adding per-attribute XDR encoded results, the complexity of adding
new attributes in the midst of the current definitions will new attributes in the midst of the current definitions would
be too burdensome. be too burdensome.
3. Minor versions must not modify the structure of an existing 3. Minor versions must not modify the structure of an existing
operation's arguments or results. operation's arguments or results.
Again the complexity of handling multiple structure definitions Again the complexity of handling multiple structure definitions
for a single operation is too burdensome. New operations should for a single operation is too burdensome. New operations should
be added instead of modifying existing structures for a minor be added instead of modifying existing structures for a minor
version. version.
skipping to change at page 31, line 32 skipping to change at page 34, line 29
feature as mandatory. On the other hand, some classes of feature as mandatory. On the other hand, some classes of
features are infrastructural and have broad effects. Allowing features are infrastructural and have broad effects. Allowing
such features to not be mandatory complicates implementation of such features to not be mandatory complicates implementation of
the minor version. the minor version.
13. A client MUST NOT attempt to use a stateid, filehandle, or 13. A client MUST NOT attempt to use a stateid, filehandle, or
similar returned object from the COMPOUND procedure with minor similar returned object from the COMPOUND procedure with minor
version X for another COMPOUND procedure with minor version Y, version X for another COMPOUND procedure with minor version Y,
where X != Y. where X != Y.
2.7. Non-RPC-based Security Services 2.8. Non-RPC-based Security Services
As described in Section 2.2.1.1.1.1, NFSv4 relies on RPC for As described in Section 2.2.1.1.1.1, NFSv4 relies on RPC for
identification, authentication, integrity, and privacy. NFSv4 itself identification, authentication, integrity, and privacy. NFSv4 itself
provides additional security services as described in the next provides additional security services as described in the next
several subsections. several subsections.
2.7.1. Authorization 2.8.1. Authorization
Authorization to access a file object via an NFSv4 operation is Authorization to access a file object via an NFSv4 operation is
ultimately determined by the NFSv4 server. A client can predetermine ultimately determined by the NFSv4 server. A client can predetermine
its access to a file object via the OPEN (Section 16.16) and the its access to a file object via the OPEN (Section 17.16) and the
ACCESS (Section 16.1) operations. ACCESS (Section 17.1) operations.
Principals with appropriate access rights can modify the Principals with appropriate access rights can modify the
authorization on a file object via the SETATTR (Section 16.30) authorization on a file object via the SETATTR (Section 17.30)
operation. Four attributes that affect access rights are: mode, operation. Four attributes that affect access rights are: mode,
owner, owner_group, and acl. See Section 5. owner, owner_group, and acl. See Section 5.
2.7.2. Auditing 2.8.2. Auditing
NFSv4 provides auditing on a per file object basis, via the ACL NFSv4 provides auditing on a per file object basis, via the ACL
attribute as described in Section 6. It is outside the scope of this attribute as described in Section 6. It is outside the scope of this
specification to specify audit log formats or management policies. specification to specify audit log formats or management policies.
2.7.3. Intrusion Detection 2.8.3. Intrusion Detection
NFSv4 provides alarm control on a per file object basis, via the ACL NFSv4 provides alarm control on a per file object basis, via the ACL
attribute as described in Section 6. Alarms may serve as the basis attribute as described in Section 6. Alarms may serve as the basis
for instrusion detection. It is outside the scope of this for intrusion detection. It is outside the scope of this
specification to specify heuristics for detecting intrusion via specification to specify heuristics for detecting intrusion via
alarms. alarms.
2.8. Transport Layers 2.9. Transport Layers
2.8.1. Required and Recommended Properties of Transports 2.9.1. Required and Recommended Properties of Transports
NFSv4 works over RDMA and non-RDMA_based transports with the NFSv4 works over RDMA and non-RDMA_based transports with the
following attributes: following attributes:
o The transport supports reliable delivery of data, which NFSv4 o The transport supports reliable delivery of data, which NFSv4
requires but neither NFSv4 nor RPC has facilities for ensuring. requires but neither NFSv4 nor RPC has facilities for ensuring.
[20] [20]
o The transport delivers data in the order it was sent. Ordered o The transport delivers data in the order it was sent. Ordered
delivery simplifies detection of transmit errors, and simplifies delivery simplifies detection of transmit errors, and simplifies
skipping to change at page 32, line 46 skipping to change at page 35, line 40
network protocol, any transport used between NFS and IP MUST be among network protocol, any transport used between NFS and IP MUST be among
the IETF-approved congestion control transport protocols. At the the IETF-approved congestion control transport protocols. At the
time this document was written, the only two transports that had the time this document was written, the only two transports that had the
above attributes were TCP and SCTP. To enhance the possibilities for above attributes were TCP and SCTP. To enhance the possibilities for
interoperability, an NFS version 4 implementation MUST support interoperability, an NFS version 4 implementation MUST support
operation over the TCP transport protocol. operation over the TCP transport protocol.
Even if NFS version 4 is used over a non-IP network protocol, it is Even if NFS version 4 is used over a non-IP network protocol, it is
RECOMMENDED that the transport support congestion control. RECOMMENDED that the transport support congestion control.
Note that it is permissible for connectionless transports to be used It is permissible for a connectionless transport to be used under
under NFSv4.1, however reliable and in-order delivery of data is NFSv4.1, however reliable and in-order delivery of data by the
still required. NFSv4.1 assumes that a client transport address and connectionless transport is still required. NFSv4.1 assumes that a
server transport address used to send data over a transport together client transport address and server transport address used to send
constitute a connection, even if the underlying transport eschews the data over a transport together constitute a connection, even if the
concept of a connection. underlying transport eschews the concept of a connection.
2.8.2. Client and Server Transport Behavior 2.9.2. Client and Server Transport Behavior
If a connection-oriented transport (e.g. TCP) is used the client and If a connection-oriented transport (e.g. TCP) is used the client and
server SHOULD use long lived connections for at least three reasons: server SHOULD use long lived connections for at least three reasons:
1. This will prevent the weakening of the transport's congestion 1. This will prevent the weakening of the transport's congestion
control mechanisms via short lived connections. control mechanisms via short lived connections.
2. This will improve performance for the WAN environment by 2. This will improve performance for the WAN environment by
eliminating the need for connection setup handshakes. eliminating the need for connection setup handshakes.
3. The NFSv4.1 callback model differs from NFSv4.0, and requires the 3. The NFSv4.1 callback model differs from NFSv4.0, and requires the
client and server to maintain a client-created channel (see client and server to maintain a client-created channel (see
Section 2.9.3.4for the server to use. Section 2.10.3.4for the server to use.
In order to reduce congestion, if a connection-oriented transport is In order to reduce congestion, if a connection-oriented transport is
used, and the request is not the NULL procedure, used, and the request is not the NULL procedure,
o A client (or the server, if issuing a callback), MUST NOT retry a o A requester MUST NOT retry a request unless the connection the
request unless the connection the request was issued over was request was issued over was disconnected before the reply was
disconnected before the reply was received. received.
o A server (or the client, if receiving a callback), MUST NOT o A replier MUST NOT silently drop a request, even if the request is
silently drop a request, even if the request is a retry. (The a retry. (The silent drop behavior of RPCSEC_GSS [5] does not
silent drop behavior of RPCSEC_GSS [5] does not apply because this apply because this behavior happens at the RPCSEC_GSS layer, a
behavior happens at the RPCSEC_GSS layer, a lower layer in the lower layer in the request processing). Instead, the replier
request processing). Instead, the server SHOULD return an SHOULD return an appropriate error (see Section 2.10.4.1) or it
appropriate error (see Section 2.9.4.1) or it MAY disconnect the MAY disconnect the connection.
connection.
When using RDMA transports there are other reasons not tolerating When using RDMA transports there are other reasons for not tolerating
retries over the same connection: retries over the same connection:
o RDMA transports use "credits" to enforce flow control, where a o RDMA transports use "credits" to enforce flow control, where a
credit is a right to a peer to transmit a message. If one peer credit is a right to a peer to transmit a message. If one peer
were to retransmit a request (or reply), it would consume an were to retransmit a request (or reply), it would consume an
additional credit. If the server retransmitted a reply, it would additional credit. If the replier retransmitted a reply, it would
certainly result in an RDMA connection loss, since the client certainly result in an RDMA connection loss, since the requester
would typically only post a single receive buffer for each would typically only post a single receive buffer for each
request. If the client retransmitted a request, the additional request. If the requester retransmitted a request, the additional
credit consumed on the server might lead to RDMA connection credit consumed on the server might lead to RDMA connection
failure unless the client accounted for it and decreased its failure unless the client accounted for it and decreased its
available credit, leading to wasted resources. available credit, leading to wasted resources.
o RDMA credits present a new issue to the reply cache in NFSv4.1. o RDMA credits present a new issue to the reply cache in NFSv4.1.
The reply cache may be used when a connection within a session is The reply cache may be used when a connection within a session is
lost, such as after the client reconnects. Credit information is lost, such as after the client reconnects. Credit information is
a dynamic property of the RDMA connection, and stale values must a dynamic property of the RDMA connection, and stale values must
not be replayed from the cache. This implies that the reply cache not be replayed from the cache. This implies that the reply cache
contents must not be blindly used when replies are issued from it, contents must not be blindly used when replies are issued from it,
and credit information appropriate to the channel must be and credit information appropriate to the channel must be
refreshed by the RPC layer. refreshed by the RPC layer.
In addition, the sender of an NFSv4.1 request is not allowed to stop In addition, the NFSv4.1 requester is not allowed to stop waiting for
waiting for a reply, as described in Section 2.9.4.2. a reply, as described in Section 2.10.4.2.
2.8.3. Ports 2.9.3. Ports
Historically, NFS version 2 and version 3 servers have resided on Historically, NFS version 2 and version 3 servers have resided on
port 2049. The registered port 2049 RFC3232 [21] for the NFS port 2049. The registered port 2049 RFC3232 [21] for the NFS
protocol should be the default configuration. NFSv4 clients SHOULD protocol should be the default configuration. NFSv4 clients SHOULD
NOT use the RPC binding protocols as described in RFC1833 [22]. NOT use the RPC binding protocols as described in RFC1833 [22].
2.9. Session 2.10. Session
2.9.1. Motivation and Overview 2.10.1. Motivation and Overview
Previous versions and minor versions of NFS have suffered from the Previous versions and minor versions of NFS have suffered from the
following: following:
o Lack of support for exactly once semantics (EOS). This includes o Lack of support for exactly once semantics (EOS). This includes
lack of support for EOS through server failure and recovery. lack of support for EOS through server failure and recovery.
o Limited callback support, including no support for sending o Limited callback support, including no support for sending
callbacks through firewalls, and races between responses from callbacks through firewalls, and races between responses from
normal requests, and callbacks. normal requests, and callbacks.
skipping to change at page 34, line 44 skipping to change at page 37, line 38
o Requiring machine credentials for fully secure operation. o Requiring machine credentials for fully secure operation.
Through the introduction of a session, NFSv4.1 addresses the above Through the introduction of a session, NFSv4.1 addresses the above
shortfalls with practical solutions: shortfalls with practical solutions:
o EOS is enabled by a reply cache with a bounded size, making it o EOS is enabled by a reply cache with a bounded size, making it
feasible to keep on persistent storage and enable EOS through feasible to keep on persistent storage and enable EOS through
server failure and recovery. One reason that previous revisions server failure and recovery. One reason that previous revisions
of NFS did not support EOS was because some EOS approaches often of NFS did not support EOS was because some EOS approaches often
limited parallelism. As will be explained in Section 2.9.4), limited parallelism. As will be explained in Section 2.10.4),
NFSv4.1 supports both EOS and unlimited parallelism. NFSv4.1 supports both EOS and unlimited parallelism.
o The NFSv4.1 client provides creates transport connections and o The NFSv4.1 client provides creates transport connections and
gives them to the server for sending callbacks, thus solving the gives them to the server for sending callbacks, thus solving the
firewall issue (Section 16.34). Races between responses from firewall issue (Section 17.34). Races between responses from
client requests, and callbacks caused by the requests are detected client requests, and callbacks caused by the requests are detected
via the session's sequencing properties which are a byproduct of via the session's sequencing properties which are a byproduct of
EOS (Section 2.9.4.3). EOS (Section 2.10.4.3).
o The NFSv4.1 client can add an arbitrary number of connections to o The NFSv4.1 client can add an arbitrary number of connections to
the session, and thus provide trunking (Section 2.9.3.4.1). the session, and thus provide trunking (Section 2.10.3.4.1).
o The NFSv4.1 session produces a session key independent of client o The NFSv4.1 session produces a session key independent of client
and server machine credentials which can be used to compute a and server machine credentials which can be used to compute a
digest for protecting key session management operations digest for protecting key session management operations
Section 2.9.6.3). Section 2.10.6.3).
o The NFSv4.1 client can also create secure RPCSEC_GSS contexts for o The NFSv4.1 client can also create secure RPCSEC_GSS contexts for
use by the session's callback channel that do not require the use by the session's callback channel that do not require the
server to authenticate to a client machine principal server to authenticate to a client machine principal
(Section 2.9.6.2). (Section 2.10.6.2).
A session is a dynamically created, long-lived server object created A session is a dynamically created, long-lived server object created
by a client, used over time from one or more transport connections. by a client, used over time from one or more transport connections.
Its function is to maintain the server's state relative to the Its function is to maintain the server's state relative to the
connection(s) belonging to a client instance. This state is entirely connection(s) belonging to a client instance. This state is entirely
independent of the connection itself, and indeed the state exists independent of the connection itself, and indeed the state exists
whether the connection exists or not (though locks, delegations, etc. whether the connection exists or not (though locks, delegations, etc.
and generally expire in the extended absence of an open connection). and generally expire in the extended absence of an open connection).
The session in effect becomes the object representing an active The session in effect becomes the object representing an active
client on a set of zero or more connections. client on a set of zero or more connections.
2.9.2. NFSv4 Integration 2.10.2. NFSv4 Integration
Sessions are part of NFSv4.1 and not NFSv4.0. Normally, a major Sessions are part of NFSv4.1 and not NFSv4.0. Normally, a major
infrastructure change like sessions would require a new major version infrastructure change like sessions would require a new major version
number to an RPC program like NFS. However, because NFSv4 number to an RPC program like NFS. However, because NFSv4
encapsulates its functionality in a single procedure, COMPOUND, and encapsulates its functionality in a single procedure, COMPOUND, and
because COMPOUND can support an arbitrary number of operations, because COMPOUND can support an arbitrary number of operations,
sessions are almost trivially added. COMPOUND includes a minor sessions are almost trivially added. COMPOUND includes a minor
version number field, and for NFSv4.1 this minor version is set to 1. version number field, and for NFSv4.1 this minor version is set to 1.
When the NFSv4 server processes a COMPOUND with the minor version set When the NFSv4 server processes a COMPOUND with the minor version set
to 1, it expects a different set of operations than it does for to 1, it expects a different set of operations than it does for
NFSv4.0. One operation it expects is the SEQUENCE operation, which NFSv4.0. One operation it expects is the SEQUENCE operation, which
is required for every COMPOUND that operates over an established is required for every COMPOUND that operates over an established
session. session.
2.9.2.1. SEQUENCE and CB_SEQUENCE 2.10.2.1. SEQUENCE and CB_SEQUENCE
In NFSv4.1, when the SEQUENCE operation is present, it is always the In NFSv4.1, when the SEQUENCE operation is present, it is always the
first operation in the COMPOUND procedure. The primary purpose of first operation in the COMPOUND procedure. The primary purpose of
SEQUENCE is to carry the session identifier. The session identifier SEQUENCE is to carry the session identifier. The session identifier
associates all other operations in the COMPOUND procedure with a associates all other operations in the COMPOUND procedure with a
particular session. SEQUENCE also contains required information for particular session. SEQUENCE also contains required information for
maintaining EOS (see Section 2.9.4). Session-enabled NFSv4.1 maintaining EOS (see Section 2.10.4). Session-enabled NFSv4.1
COMPOUND requests thus have the form: COMPOUND requests thus have the form:
+-----+--------------+-----------+------------+-----------+---- +-----+--------------+-----------+------------+-----------+----
| tag | minorversion | numops |SEQUENCE op | op + args | ... | tag | minorversion | numops |SEQUENCE op | op + args | ...
| | (== 1) | (limited) | + args | | | | (== 1) | (limited) | + args | |
+-----+--------------+-----------+------------+-----------+---- +-----+--------------+-----------+------------+-----------+----
and the reply's structure is: and the reply's structure is:
+------------+-----+--------+-------------------------------+--// +------------+-----+--------+-------------------------------+--//
skipping to change at page 36, line 25 skipping to change at page 39, line 24
+------------+-----+--------+-------------------------------+--// +------------+-----+--------+-------------------------------+--//
//-----------------------+---- //-----------------------+----
// status + op + results | ... // status + op + results | ...
//-----------------------+---- //-----------------------+----
A CB_COMPOUND procedure request and reply has a similar form, but A CB_COMPOUND procedure request and reply has a similar form, but
instead of a SEQUENCE operation, there is a CB_SEQUENCE operation, instead of a SEQUENCE operation, there is a CB_SEQUENCE operation,
and there is an additional field called "callback_ident", which is and there is an additional field called "callback_ident", which is
superfluous in NFSv4.1. CB_SEQUENCE has the same information as superfluous in NFSv4.1. CB_SEQUENCE has the same information as
SEQUENCE, but includes other information needed to solve callback SEQUENCE, but includes other information needed to solve callback
races (Section 2.9.4.3). races (Section 2.10.4.3).
2.9.2.2. Clientid and Session Association 2.10.2.2. Client ID and Session Association
Sessions are subordinate to the clientid (Section 2.4). Each Sessions are subordinate to the client ID (Section 2.4). Each client
clientid can have zero or more active sessions. A clientid, and a ID can have zero or more active sessions. A client ID, and a session
session bound to it are required to do anything useful in NFSv4.1. bound to it are required to do anything useful in NFSv4.1. Each time
Each time a session is used, the state leased to it associated a session is used, the state leased to its associated client ID is
clientid is automatically renewed. automatically renewed.
State such as share reservations, locks, delegations, and layouts State such as share reservations, locks, delegations, and layouts
(Section 1.4.4) is tied to the clientid, not the sessions of the (Section 1.4.4) is tied to the client ID, not the sessions of the
clientid. Successive state changing operations from a given state client ID. Successive state changing operations from a given state
owner can go over different sessions, as long each session is owner can go over different sessions, as long each session is
associated with the same clientid. Callbacks can arrive over a associated with the same client ID. Callbacks can arrive over a
different session than the session that sent the operation the different session than the session that sent the operation the
acquired the state that the callback is for. For example, if session acquired the state that the callback is for. For example, if session
A is used to acquire a delegation, a request to recall the delegation A is used to acquire a delegation, a request to recall the delegation
can arrive over session B. can arrive over session B.
2.9.3. Channels 2.10.3. Channels
Each session has one or two channels: the "operation" or "fore" Each session has one or two channels: the "operation" or "fore"
channel used for ordinary requests from client to server, and the channel used for ordinary requests from client to server, and the
"back" channel, used for callback requests from server to client. "back" channel, used for callback requests from server to client.
The session allocates resources for each channel, including separate The session allocates resources for each channel, including separate
reply caches (see Section 2.9.4.1 These resources are for the most reply caches (see Section 2.10.4.1). These resources are for the
part specified at time the session is created. most part specified at time the session is created.
2.9.3.1. Operation Channel 2.10.3.1. Operation Channel
The operation channel carries COMPOUND requests and responses. A The operation channel carries COMPOUND requests and responses. A
session always has an operation channel. session always has an operation channel.
2.9.3.2. Backchannel 2.10.3.2. Backchannel
The backchannel carries CB_COMPOUND requests and responses. Whether The backchannel carries CB_COMPOUND requests and responses. Whether
there is a backchannel or not is a decision of the client; NFSv4.1 there is a backchannel or not is a decision of the client; NFSv4.1
servers MUST support backchannels. servers MUST support backchannels.
2.9.3.3. Session and Channel Association 2.10.3.3. Session and Channel Association
Because there are at most two channels per session, and because each Because there are at most two channels per session, and because each
channel has a distinct purpose, channels are not assigned channel has a distinct purpose, channels are not assigned
identifiers. The operation and backchannel are implicitly created identifiers. The operation and backchannel are implicitly created
and associated when the session is created. and associated when the session is created.
2.9.3.4. Connection and Channel Association 2.10.3.4. Connection and Channel Association
Each channel is associated with zero or more transport connections. Each channel is associated with zero or more transport connections.
A connection can be bound to one channel or both channels of a A connection can be bound to one channel or both channels of a
session; the client and server negotiate whether a connection will session; the client and server negotiate whether a connection will
carry traffic for one channel or both channels via the CREATE_SESSION carry traffic for one channel or both channels via the CREATE_SESSION
(Section 16.36) and the BIND_CONN_TO_SESSION (Section 16.34) (Section 17.36) and the BIND_CONN_TO_SESSION (Section 17.34)
operations. When a session is created via CREATE_SESSION, it is operations. When a session is created via CREATE_SESSION, it is
automatically bound to the operation channel, and optionally the automatically bound to the operation channel, and optionally the
backchannel. If the client does not specify connecting binding backchannel. If the client does not specify connecting binding
enforcement when the session is created, then additional connections enforcement when the session is created, then additional connections
are automatically bound to the operation channel when the are used are automatically bound to the operation channel when the are used
with a SEQUENCE operation that has the session's sessionid. with a SEQUENCE operation that has the session's sessionid.
A connection MAY be bound to the channels of other sessions. The A connection MAY be bound to the channels of other sessions. The
client decides, and the NFSv4.1 server MUST allow it. A connection client decides, and the NFSv4.1 server MUST allow it. A connection
MAY be bound to the channels of other sessions of other clientids. MAY be bound to the channels of other sessions of other clientids.
skipping to change at page 38, line 6 skipping to change at page 41, line 5
the same channel. For example a TCP and RDMA connection can be bound the same channel. For example a TCP and RDMA connection can be bound
to the operation channel. In the event an RDMA and non-RDMA to the operation channel. In the event an RDMA and non-RDMA
connection are bound to the same channel, the maximum number of slots connection are bound to the same channel, the maximum number of slots
must be at least one more than the total number of credits. This way must be at least one more than the total number of credits. This way
if all RDMA credits are use, the non-RDMA connection can have at if all RDMA credits are use, the non-RDMA connection can have at
least one outstanding request. least one outstanding request.
It is permissible for a connection of one type to be bound to the It is permissible for a connection of one type to be bound to the
operation channel, and another type bound to the backchannel. operation channel, and another type bound to the backchannel.
2.9.3.4.1. Trunking 2.10.3.4.1. Trunking
The eir_server_owner results from EXCHANGE_ID give a client a hint A client is allowed to issue EXCHANGE_ID multiple times to the same
that the server it is connected to may be the same as the server it server. The client may be unaware that two different server network
is connected to via another connection. When two connections have addresses refer to the same server. The use of EXCHANGE_ID allows a
the same eir_server_owner.so_major_id, the client treats the client to become aware that an additional network address refers to a
connections as connected to the same server (even if the destination server the client already has an established client ID and session
network addresses are different) and uses a common clientid to for. The eir_server_owner and eir_server_scope results from
identify itself. The eir_server_owner.so_minor_id field allows the EXCHANGE_ID give a client a hint that the server it is connected to
server to control binding of connections to sessions. When two may be the same as the server it is connected to via another
connections have a matching so_major_id and so_minor_id, the client connection. When EXCHANGE_ID is issued over two different
may bind both connections to a common session; this is session connections, and each return the same eir_server_owner.so_major_id
trunking. When two connections have a matching so_major_id, but and eir_server_scope, the client treats the connections as connected
different so_minor_id, the client will need to create a new session to the same server (subject to verification, as described later in
for the clientid in order to use the connection; this is clientid this section (Paragraph 2), even if the destination network addresses
trunking. In either session or clientid trunking, the bandwidth are different). As long two unrelated servers have not selected and
capacity can scale with the number of connections. returned a conflicting pair of eir_major_id and eir_server_scope, or
unless the client has used different co_ownerid values in each
EXCHANGE_ID request, or the server has lost client ID state (e.g. the
server has rebooted) the server MUST return the same eir_clientid
result. Otherwise, the client and server use the common eir_clientid
to identify the client. The eir_server_owner.so_minor_id field
allows the server to control binding of connections to sessions.
When two connections have a matching eir_server_scope, so_major_id
and so_minor_id, the client may bind both connections to a common
session; this is session trunking. When two connections have a
matching so_major_id and eir_server_scope, but different so_minor_id,
the client will need to create a new session for the client ID in
order to use the connection; this is client ID trunking. In either
session or client ID trunking, the bandwidth capacity can scale with
the number of connections.
Just because two servers over two connections claim matching or When two servers over two connections claim matching or partially
partially matching server_owner4 values does not the client should or matching eir_server_owner, eir_server_scope, and eir_clientid values
must trust the servers' claims. The client may verify these claims the client does not have to trust the servers' claims. The client
before trunking traffic. may verify these claims before trunking traffic in the following
ways:
For session trunking, clients and servers can reliably verify if o For session trunking, clients and servers can reliably verify if
connections between different network paths are in fact bound to the connections between different network paths are in fact bound to
same NFSv4.1 server and usable on the same session. The SET_SSV the same NFSv4.1 server and usable on the same session. The
(Section 16.47) operation allows a client and server to establish a SET_SSV (Section 17.47) operation allows a client and server to
unique, shared key value (the SSV). When a new connection is bound establish a unique, shared key value (the SSV). When a new
to the session (via the BIND_CONN_TO_SESSION operation, see connection is bound to the session (via the BIND_CONN_TO_SESSION
Section 16.34), the client offers a digest that based on the SSV. If operation, see Section 17.34), the client offers a digest that is
the client mistakenly tries to bind a connection to a session of a based on the SSV. If the client mistakenly tries to bind a
wrong server, the server will either reject the attempt because it is connection to a session of a wrong server, the server will either
not aware of the session identifier of the BIND_CONN_TO_SESSION reject the attempt because it is not aware of the session
arguments, or it will reject the attempt because the digest for the identifier of the BIND_CONN_TO_SESSION arguments, or it will
SSV does not match what the server expects. Even if the server reject the attempt because the digest for the SSV does not match
mistakenly or maliciously accepts the connection bind attempt, the what the server expects. Even if the server mistakenly or
digest it computes in the response will not be verified by the maliciously accepts the connection bind attempt, the digest it
client, the client will know it cannot use the connection for computes in the response will not be verified by the client, the
trunking the specified channel. client will know it cannot use the connection for trunking the
specified channel.
In the case of clientid trunking, the client can use RPCSEC_GSS to o In the case of client ID trunking, the client can use RPCSEC_GSS
verify that each connection is aimed at the same server. When the to verify that each connection is aimed at the same server. When
client invokes EXCHANGE_ID, it should use RPCSEC_GSS. If each the client invokes EXCHANGE_ID, it should use RPCSEC_GSS. If each
RPCSEC_GSS context over each connection has the same server RPCSEC_GSS context over each connection has the same server
principal, then the servers at the end of each connection are the principal, then -- barring a compromise of the server's GSS
credentials -- the servers at the end of each connection are the
same. same.
2.9.4. Exactly Once Semantics 2.10.4. Exactly Once Semantics
Via the session, NFSv4.1 offers exactly once semantics (EOS) for Via the session, NFSv4.1 offers exactly once semantics (EOS) for
requests sent over a channel. EOS is supported on both the operation requests sent over a channel. EOS is supported on both the operation
and back channels. and back channels.
Each COMPOUND or CB_COMPOUND request that is issued with a leading Each COMPOUND or CB_COMPOUND request that is issued with a leading
SEQUENCE or CB_SEQUENCE operation MUST be executed by the receiver SEQUENCE or CB_SEQUENCE operation MUST be executed by the receiver
exactly once. This requirement is regardless whether the request is exactly once. This requirement is regardless whether the request is
issued with reply caching specified (see Section 2.9.4.1.2). The issued with reply caching specified (see Section 2.10.4.1.2). The
requirement holds even if the requester is issuing the request over a requirement holds even if the requester is issuing the request over a
session created between a pNFS data client and pNFS data server. The session created between a pNFS data client and pNFS data server. The
rationale for this requirement is understood by categorizing requests rationale for this requirement is understood by categorizing requests
into three classifications: into three classifications:
o Nonidempotent requests. o Nonidempotent requests.
o Idempotent modifying requests. o Idempotent modifying requests.
o Idempotent non-modifying requests. o Idempotent non-modifying requests.
skipping to change at page 40, line 13 skipping to change at page 43, line 29
Note that it is not required the server cache the reply to the Note that it is not required the server cache the reply to the
modifying operation to avoid data corruption (but if the client modifying operation to avoid data corruption (but if the client
specified the reply to be cached, the server must cache it). specified the reply to be cached, the server must cache it).
An example of an idempotent non-modifying request is a COMPOUND An example of an idempotent non-modifying request is a COMPOUND
containing SEQUENCE, PUTFH, READLINK and nothing else. The re- containing SEQUENCE, PUTFH, READLINK and nothing else. The re-
execution of a such a request will not cause data corruption, or execution of a such a request will not cause data corruption, or
produce an incorrect result. Nonetheless, for simplicity, the produce an incorrect result. Nonetheless, for simplicity, the
replier MUST enforce EOS for such requests. replier MUST enforce EOS for such requests.
2.9.4.1. Slot Identifiers and Reply Cache 2.10.4.1. Slot Identifiers and Reply Cache
The RPC layer provides a transaction ID (xid), which, while required The RPC layer provides a transaction ID (xid), which, while required
to be unique, is not especially convenient for tracking requests. to be unique, is not especially convenient for tracking requests.
The xid is only meaningful to the requester it cannot be interpreted The xid is only meaningful to the requester it cannot be interpreted
at the replier except to test for equality with previously issued at the replier except to test for equality with previously issued
requests. Because RPC operations may be completed by the replier in requests. Because RPC operations may be completed by the replier in
any order, many transaction IDs may be outstanding at any time. The any order, many transaction IDs may be outstanding at any time. The
requester may therefore perform a computationally expensive lookup requester may therefore perform a computationally expensive lookup
operation in the process of demultiplexing each reply. operation in the process of demultiplexing each reply.
In the NFSv4.1, there is a limit to the number of active requests. In the NFSv4.1, there is a limit to the number of active requests.
This immediately enables a computationally efficient index for each This immediately enables a computationally efficient index for each
request which is designated as a Slot Identifier, or slotid. request which is designated as a Slot Identifier, or slotid.
When the requester issues a new request, it selects a slotid in the When the requester issues a new request, it selects a slotid in the
range 0..N-1, where N is the replier's current "totalrequests" limit range 0..N-1, where N is the replier's current "outstanding requests"
granted to the requester on the session over which the request is to limit granted to the requester on the session over which the request
be issued. The slotid must be unused by any of the requests which is to be issued. The value of N outstanding requests starts out as
the requester has already active on the session. "Unused" here means the value of ca_maxrequests (Section 17.36), but can be adjusted by
the requester has no outstanding request for that slotid. Because the response to SEQUENCE or CB_SEQUENCE as described later in this
the slot id is always an integer in the range 0..N-1, requester section. The slotid must be unused by any of the requests which the
requester has already active on the session. "Unused" here means the
requester has no outstanding request for that slotid. Because the
slot id is always an integer in the range 0..N-1, requester
implementations can use the slotid from a replier response to implementations can use the slotid from a replier response to
efficiently match responses with outstanding requests, such as, for efficiently match responses with outstanding requests, such as, for
example, by using the slotid to index into a outstanding request example, by using the slotid to index into an outstanding request
array. This can be used to avoid expensive hashing and lookup array. This can be used to avoid expensive hashing and lookup
functions in the performance-critical receive path. functions in the performance-critical receive path.
The sequenceid, which accompanies the slotid in each request, is The sequenceid, which accompanies the slotid in each request, is for
important for an important check at the server: it must be able to be an important check at the server: it must be able to be determined
determined efficiently whether a request using a certain slotid is a efficiently whether a request using a certain slotid is a retransmit
retransmit or a new, never-before-seen request. It is not feasible or a new, never-before-seen request. It is not feasible for the
for the client to assert that it is retransmitting to implement this, client to assert that it is retransmitting to implement this, because
because for any given request the client cannot know the server has for any given request the client cannot know the server has seen it
seen it unless the server actually replies. Of course, if the client unless the server actually replies. Of course, if the client has
has seen the server's reply, the client would not retransmit. seen the server's reply, the client would not retransmit.
The sequenceid MUST increase monotonically for each new transmit of a The sequenceid MUST increase monotonically for each new transmit of a
given slotid, and MUST remain unchanged for any retransmission. The given slotid, and MUST remain unchanged for any retransmission. The
server must in turn compare each newly received request's sequenceid server must in turn compare each newly received request's sequenceid
with the last one previously received for that slotid, to see if the with the last one previously received for that slotid, to see if the
new request is: new request is:
o A new request, in which the sequenceid is one greater than that o A new request, in which the sequenceid is one greater than that
previously seen in the slot (accounting for sequence wraparound). previously seen in the slot (accounting for sequence wraparound).
The replier proceeds to execute the new request. The replier proceeds to execute the new request.
skipping to change at page 41, line 50 skipping to change at page 45, line 20
entries for an effective reply cache. entries for an effective reply cache.
The slotid and sequenceid therefore take over the traditional role of The slotid and sequenceid therefore take over the traditional role of
the XID and port number in the replier reply cache implementation, the XID and port number in the replier reply cache implementation,
and the session replaces the IP address. This approach is and the session replaces the IP address. This approach is
considerably more portable and completely robust - it is not subject considerably more portable and completely robust - it is not subject
to the frequent reassignment of ports as clients reconnect over IP to the frequent reassignment of ports as clients reconnect over IP
networks. In addition, the RPC XID is not used in the reply cache, networks. In addition, the RPC XID is not used in the reply cache,
enhancing robustness of the cache in the face of any rapid reuse of enhancing robustness of the cache in the face of any rapid reuse of
XIDs by the client. [[Comment.3: We need to discuss the requirements XIDs by the client. [[Comment.3: We need to discuss the requirements
of the client for changing the XID.]] . of the client for changing the XID.]]
It is required to encode the slotid information into each request in The slotid information is included in each request, without violating
a way that does not violate the minor versioning rules of the NFSv4.0 the minor versioning rules of the NFSv4.0 specification, by encoding
specification. This is accomplished here by encoding it in the it in the SEQUENCE operation within each NFSv4.1 COMPOUND and
SEQUENCE operation within each NFSv4.1 COMPOUND and CB_COMPOUND CB_COMPOUND procedure. The operation easily piggybacks within
procedure. The operation easily piggybacks within existing messages. existing messages. [[Comment.4: Need a better term than piggyback]]
[[Comment.4: Need a better term than piggyback]]
In general, the receipt of a new sequenced request arriving on any The receipt of a new sequenced request arriving on any valid slot is
valid slot is an indication that the previous reply cache contents of an indication that the previous reply cache contents of that slot may
that slot may be discarded. In order to further assist the replier be discarded.
in slot management, the requester is required to use the lowest
available slot when issuing a new request. In this way, the replier
may be able to retire additional entries.
However, in the case where the replier is actively adjusting its The SEQUENCE (and CB_SEQUENCE) operation also carries a
granted maximum request count to the requester, it may not be able to "highest_slotid" value which carries additional client slot usage
use receipt of the slotid to retire cache entries. The slotid used information. The requester must always provide a slotid representing
in an incoming request may not reflect the server's current idea of the outstanding request with the highest-numbered slot value. The
the requester's session limit, because the request may have been sent requester should in all cases provide the most conservative value
from the requester before the update was received. Therefore, in the possible, although it can be increased somewhat above the actual
downward adjustment case, the replier may have to retain a number of instantaneous usage to maintain some minimum or optimal level. This
reply cache entries at least as large as the old value, until provides a way for the requester to yield unused request slots back
operation sequencing rules allow it to infer that the requester has to the replier, which in turn can use the information to reallocate
seen its reply. resources.
The SEQUENCE (and CB_SEQUENCE) operation also carries a "maxslot" The replier responds with both a new target highest_slotid, and an
value which carries additional client slot usage information. The enforced highest_slotid, described as follows:
requester must always provide its highest-numbered outstanding slot
value in the maxslot argument, and the replier may reply with a new
recognized value. The requester should in all cases provide the most
conservative value possible, although it can be increased somewhat
above the actual instantaneous usage to maintain some minimum or
optimal level. This provides a way for the requester to yield unused
request slots back to the replier, which in turn can use the
information to reallocate resources. Obviously, maxslot can never be
zero, or the session would deadlock.
The replier also provides a target maxslot value to the requester, o The target highest_slotid is an indication to the requester of the
which is an indication to the requester of the maxslot the replier highest_slotid the replier wishes the requester to be using. This
wishes the requester to be using. This permits the server to permits the replier to withdraw (or add) resources from a
withdraw (or add) resources from a requester that has been found to requester that has been found to not be using them, in order to
not be using them, in order to more fairly share resources among a more fairly share resources among a varying level of demand from
varying level of demand from other requesters. The requester must other requesters. The requester must always comply with the
always comply with the replier's value updates, since they indicate replier's value updates, since they indicate newly established
newly established hard limits on the requester's access to session hard limits on the requester's access to session resources.
resources. However, because of request pipelining, the requester may However, because of request pipelining, the requester may have
have active requests in flight reflecting prior values, therefore the active requests in flight reflecting prior values, therefore the
replier must not immediately require the requester to comply. replier must not immediately require the requester to comply.
2.9.4.1.1. Errors from SEQUENCE and CB_SEQUENCE o The enforced highest_slotid indicates the highest slotid the
requester is permitted to use on a subsequent SEQUENCE or
CB_SEQUENCE operation.
The requester is required to use the lowest available slot when
issuing a new request. This way, the replier may be able to retire
slot entries faster. However, where the replier is actively
adjusting its granted maximum request count (i.e. the highest_slotid)
to the requester, it will not not be able to use just the receipt of
the slotid and highest_slotid in the request. Neither the slotid nor
the highest_slotid used in a request may reflect the replier's
current idea of the requester's session limit, because the request
may have been sent from the requester before the update was received.
Therefore, in the downward adjustment case, the replier may have to
retain a number of reply cache entries at least as large as the old
value of maximum requests outstanding, until operation sequencing
rules allow it to infer that the requester has seen its reply.
2.10.4.1.1. Errors from SEQUENCE and CB_SEQUENCE
Any time SEQUENCE or CB_SEQUENCE return an error, the sequenceid of Any time SEQUENCE or CB_SEQUENCE return an error, the sequenceid of
the slot MUST NOT change. The replier MUST NOT modify the reply the slot MUST NOT change. The replier MUST NOT modify the reply
cache entry for the slot whenever an error is returned from SEQUENCE cache entry for the slot whenever an error is returned from SEQUENCE
or CB_SEQUENCE. or CB_SEQUENCE.
2.9.4.1.2. Optional Reply Caching 2.10.4.1.2. Optional Reply Caching
On a per-request basis the requester can choose to direct the replier On a per-request basis the requester can choose to direct the replier
to cache the reply to all operations after the first operation to cache the reply to all operations after the first operation
(SEQUENCE or CB_SEQUENCE) via the sa_cachethis or csa_cachethis (SEQUENCE or CB_SEQUENCE) via the sa_cachethis or csa_cachethis
fields of the arguments to SEQUENCE or CB_SEQUENCE. The reason it fields of the arguments to SEQUENCE or CB_SEQUENCE. The reason it
would not direct the replier to cache the entire reply is that the would not direct the replier to cache the entire reply is that the
request is composed of all idempotent operations [20]. Caching the request is composed of all idempotent operations [20]. Caching the
reply may offer little benefit, and if the reply is too large (see reply may offer little benefit, and if the reply is too large (see
Section 2.9.4.4, it may not be cacheable anyway. Section 2.10.4.4), it may not be cacheable anyway.
Whether the requester requests the reply to be cached or not has no Whether the requester requests the reply to be cached or not has no
effect on the slot processing. If the results of SEQUENCE or effect on the slot processing. If the results of SEQUENCE or
CB_SEQUENCE are NFS4_OK, then the slot's sequenceid MUST be CB_SEQUENCE are NFS4_OK, then the slot's sequenceid MUST be
incremented by one. If a requester does not direct the replier to incremented by one. If a requester does not direct the replier to
cache, the reply, the replier MUST do one of following: cache, the reply, the replier MUST do one of following:
o The replier can cache the entire original reply. Even though o The replier can cache the entire original reply. Even though
sa_cachethis or csa_cachethis are FALSE, the replier is always sa_cachethis or csa_cachethis are FALSE, the replier is always
free to cache. It may choose this approach in order to simplify free to cache. It may choose this approach in order to simplify
implementation. implementation.
o The replier enters into its reply cache a reply consisting of the o The replier enters into its reply cache a reply consisting of the
original results to the SEQUENCE or CB_SEQUENCE operation, original results to the SEQUENCE or CB_SEQUENCE operation,
followed by the error NFS4ERR_RETRY_UNCACHED_REP. Thus when the followed by the error NFS4ERR_RETRY_UNCACHED_REP. Thus if the
requester later retries the request, it will get requester later retries the request, it will get
NFS4ERR_RETRY_UNCACHE_REP. NFS4ERR_RETRY_UNCACHE_REP.
2.9.4.1.3. Multiple Connections and Sharing the Reply Cache 2.10.4.1.3. Multiple Connections and Sharing the Reply Cache
Multiple connections can be bound to a session's channel, hence the Multiple connections can be bound to a session's channel, hence the
connections share the same table of slotids. For connections over connections share the same table of slotids. For connections over
non-RDMA transports like TCP, there are no particular considerations. non-RDMA transports like TCP, there are no particular considerations.
Considerations for multiple RDMA connections sharing a slot table are Considerations for multiple RDMA connections sharing a slot table are
discussed in Section 2.9.5.1. [[Comment.5: Also need to discuss when discussed in Section 2.10.5.1. [[Comment.5: Also need to discuss
RDMA and non-RDMA share a slot table.]] when RDMA and non-RDMA share a slot table.]]
2.9.4.2. Retry and Replay 2.10.4.2. Retry and Replay
A client MUST NOT retry a request, unless the connection it used to A client MUST NOT retry a request, unless the connection it used to
send the request disconnects. The client can then reconnect and send the request disconnects. The client can then reconnect and
resend the request, or it can resend the request over a different resend the request, or it can resend the request over a different
connection. In the case of the server resending over the connection. In the case of the server resending over the
backchannel, it cannot reconnect, and either resends the request over backchannel, it cannot reconnect, and either resends the request over
another connection that the client has bound to the backchannel, or another connection that the client has bound to the backchannel, or
if there is no other backchannel connection, waits for the client to if there is no other backchannel connection, waits for the client to
bind a connection to the backchannel. bind a connection to the backchannel.
skipping to change at page 44, line 29 skipping to change at page 48, line 5
NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or CB_SEQUENCE). NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or CB_SEQUENCE).
RDMA fabrics do not guarantee that the memory handles (Steering Tags) RDMA fabrics do not guarantee that the memory handles (Steering Tags)
within each RDMA three-tuple are valid on a scope [[Comment.6: What within each RDMA three-tuple are valid on a scope [[Comment.6: What
is a three-tuple?]] outside that of a single connection. Therefore, is a three-tuple?]] outside that of a single connection. Therefore,
handles used by the direct operations become invalid after connection handles used by the direct operations become invalid after connection
loss. The server must ensure that any RDMA operations which must be loss. The server must ensure that any RDMA operations which must be
replayed from the reply cache use the newly provided handle(s) from replayed from the reply cache use the newly provided handle(s) from
the most recent request. the most recent request.
2.9.4.3. Resolving server callback races with sessions 2.10.4.3. Resolving server callback races with sessions
It is possible for server callbacks to arrive at the client before It is possible for server callbacks to arrive at the client before
the reply from related forward channel operations. For example, a the reply from related forward channel operations. For example, a
client may have been granted a delegation to a file it has opened, client may have been granted a delegation to a file it has opened,
but the reply to the OPEN (informing the client of the granting of but the reply to the OPEN (informing the client of the granting of
the delegation) may be delayed in the network. If a conflicting the delegation) may be delayed in the network. If a conflicting
operation arrives at the server, it will recall the delegation using operation arrives at the server, it will recall the delegation using
the callback channel, which may be on a different transport the callback channel, which may be on a different transport
connection, perhaps even a different network. In NFSv4.0, if the connection, perhaps even a different network. In NFSv4.0, if the
callback request arrives before the related reply, the client may callback request arrives before the related reply, the client may
skipping to change at page 45, line 33 skipping to change at page 49, line 8
to arrive on any of the session's operations channels, because it is to arrive on any of the session's operations channels, because it is
possible that they will be delayed indefinitely. However, it should possible that they will be delayed indefinitely. However, it should
wait for a period of time, and if the time expires it can provide a wait for a period of time, and if the time expires it can provide a
more meaningful error such as NFS4ERR_DELAY. more meaningful error such as NFS4ERR_DELAY.
[[Comment.7: We need to consider the clients' options here, and [[Comment.7: We need to consider the clients' options here, and
describe them... NFS4ERR_DELAY has been discussed as a legal reply describe them... NFS4ERR_DELAY has been discussed as a legal reply
to CB_RECALL?]] to CB_RECALL?]]
There are other scenarios under which callbacks may race replies, There are other scenarios under which callbacks may race replies,
among them pnfs layout recalls, described in Section 12.3.5.3 among them pNFS layout recalls, described in Section 12.5.4.2
[[Comment.8: fill in the blanks w/others, etc...]] [[Comment.8: fill in the blanks w/others, etc...]]
2.9.4.4. COMPOUND and CB_COMPOUND Construction Issues 2.10.4.4. COMPOUND and CB_COMPOUND Construction Issues
Very large requests and replies may pose both buffer management Very large requests and replies may pose both buffer management
issues (especially with RDMA) and reply cache issues. When the issues (especially with RDMA) and reply cache issues. When the
session is created, (Section 16.36) the client and server negotiate session is created, (Section 17.36) the client and server negotiate
the maximum sized request they will send or process the maximum sized request they will send or process
(ca_maxrequestsize), the maximum sized reply they will return or (ca_maxrequestsize), the maximum sized reply they will return or
process (ca_maxresponsesize), and the maximum sized reply they will process (ca_maxresponsesize), and the maximum sized reply they will
store in the reply cache (ca_maxresponsesize_cached). store in the reply cache (ca_maxresponsesize_cached).
If a request exceeds ca_maxrequestsize, the reply will have the If a request exceeds ca_maxrequestsize, the reply will have the
status NFS4ERR_REQ_TOO_BIG. A replier may return NFS4ERR_REQ_TOO_BIG status NFS4ERR_REQ_TOO_BIG. A replier may return NFS4ERR_REQ_TOO_BIG
as the status for first operation (SEQUENCE or CB_SEQUENCE) in the as the status for first operation (SEQUENCE or CB_SEQUENCE) in the
request, or it may chose to return it on a subsequent operation. request, or it may chose to return it on a subsequent operation.
If a reply exceeds ca_maxresponsesize, the reply will have the status If a reply exceeds ca_maxresponsesize, the reply will have the status
NFS4ERR_REP_TOO_BIG. A replier may return NFS4ERR_REP_TOO_BIG as the NFS4ERR_REP_TOO_BIG. A replier may return NFS4ERR_REP_TOO_BIG as the
status for first operation (SEQUENCE or CB_SEQUENCE) in the request, status for first operation (SEQUENCE or CB_SEQUENCE) in the request,
or it may chose to return it on a subsequent operation. or it may chose to return it on a subsequent operation.
If sa_cachethis or csa_cachethis are TRUE, then the replier MUST If sa_cachethis or csa_cachethis are TRUE, then the replier MUST
cache a reply except if an error is returned by the SEQUENCE or cache a reply except if an error is returned by the SEQUENCE or
CB_SEQUENCE operation (see Section 2.9.4.1.1). If the reply exceeds CB_SEQUENCE operation (see Section 2.10.4.1.1). If the reply exceeds
ca_maxresponsesize_cached, (and sa_cachethis or csa_cachethis are ca_maxresponsesize_cached, (and sa_cachethis or csa_cachethis are
TRUE) then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE. Even TRUE) then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE. Even
if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that matter) if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that matter)
is returned on a operation other than first operation (SEQUENCE or is returned on a operation other than first operation (SEQUENCE or
CB_SEQUENCE), then the reply MUST be cached if sa_cachethis or CB_SEQUENCE), then the reply MUST be cached if sa_cachethis or
csa_cachethis are TRUE. For example, if a COMPOUND has eleven csa_cachethis are TRUE. For example, if a COMPOUND has eleven
operations, including SEQUENCE, the fifth operation is a RENAME, and operations, including SEQUENCE, the fifth operation is a RENAME, and
the tenth operation is a READ for one million bytes, server may the tenth operation is a READ for one million bytes, server may
return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth operation. Since return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth operation. Since
the server executed several operations, especially the non-idempotent the server executed several operations, especially the non-idempotent
RENAME, the client's request to cache the reply needs to be honored RENAME, the client's request to cache the reply needs to be honored
in order for correct operation of exactly once semantics. If the in order for correct operation of exactly once semantics. If the
client retries the request, the server will have cached a reply that client retries the request, the server will have cached a reply that
contains results for ten of the eleven requested operations, with the contains results for ten of the eleven requested operations, with the
tenth operation having a status of NFS4ERR_REP_TOO_BIG_TO_CACHE. tenth operation having a status of NFS4ERR_REP_TOO_BIG_TO_CACHE.
A client needs to take care that when sending operations that change A client needs to take care that when sending operations that change
the current filehandle (except for PUTFH, PUTPUBFH, and PUTROOFFH) the current filehandle (except for PUTFH, PUTPUBFH, and PUTROOTFH)
that it not exceed the maximum reply buffer before the GETFH that it not exceed the maximum reply buffer before the GETFH
operation. Otherwise the client will have to retry the operation operation. Otherwise the client will have to retry the operation
that changed the current filehandle, in order obtain the desired that changed the current filehandle, in order obtain the desired
filehandle. For the OPEN operation (see Section 16.16), retry is not filehandle. For the OPEN operation (see Section 17.16), retry is not
always available as an option. The following guidelines for the always available as an option. The following guidelines for the
handling of filehandle changing operations are advised: handling of filehandle changing operations are advised:
o A client SHOULD issue GETFH immediately after a current filehandle o A client SHOULD issue GETFH immediately after a current filehandle
changing operation. This is especially important after any changing operation. This is especially important after any
current filehandle changing non-idempotent operation. It is current filehandle changing non-idempotent operation. It is
critical to issue GETFH immediately after OPEN. critical to issue GETFH immediately after OPEN.
o A server MAY return NFS4ERR_REP_TOO_BIG or o A server MAY return NFS4ERR_REP_TOO_BIG or
NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a
skipping to change at page 47, line 11 skipping to change at page 50, line 33
filehandle changing non-idempotent operation if the reply would be filehandle changing non-idempotent operation if the reply would be
too large on the next operation, especially if the operation is too large on the next operation, especially if the operation is
OPEN. OPEN.
o A server MAY return NFS4ERR_UNSAFE_COMPOUND if it looks at the o A server MAY return NFS4ERR_UNSAFE_COMPOUND if it looks at the
next operation after a non-idempotent current filehandle changing next operation after a non-idempotent current filehandle changing
operation, and finds it is not GETFH. The server would do this if operation, and finds it is not GETFH. The server would do this if
it is unable to determine in advance whether the total response it is unable to determine in advance whether the total response
size would exceed ca_maxresponsesize_cached or ca_maxresponsesize. size would exceed ca_maxresponsesize_cached or ca_maxresponsesize.
2.9.4.5. Persistence 2.10.4.5. Persistence
Since the reply cache is bounded, it is practical for the server Since the reply cache is bounded, it is practical for the server
reply cache to persist across server reboots, and to be kept in reply cache to persist across server reboots, and to be kept in
stable storage (a client's reply cache for callbacks need not persist stable storage (a client's reply cache for callbacks need not persist
across client reboots unless the client intends for its session and across client reboots unless the client intends for its session and
other state to persist across reboots). other state to persist across reboots).
o The slot table including the sequenceid and cached reply for each o The slot table including the sequenceid and cached reply for each
slot. slot.
o The sessionid. o The sessionid.
o The clientid. o The client ID.
o The SSV (see Section 2.9.6.3). o The SSV (see Section 2.10.6.3).
The CREATE_SESSION (see Section 16.36 operation determines the The CREATE_SESSION (see Section 17.36 operation determines the
persistence of the reply cache. persistence of the reply cache.
2.9.5. RDMA Considerations 2.10.5. RDMA Considerations
A complete discussion of the operation of RPC-based protocols atop A complete discussion of the operation of RPC-based protocols atop
RDMA transports is in [RPCRDMA]. A discussion of the operation of RDMA transports is in [RPCRDMA]. A discussion of the operation of
NFSv4, including NFSv4.1 over RDMA is in [NFSDDP]. Where RDMA is NFSv4, including NFSv4.1 over RDMA is in [NFSDDP]. Where RDMA is
considered, this specification assumes the use of such a layering; it considered, this specification assumes the use of such a layering; it
addresses only the upper layer issues relevant to making best use of addresses only the upper layer issues relevant to making best use of
RPC/RDMA. RPC/RDMA.
2.9.5.1. RDMA Connection Resources 2.10.5.1. RDMA Connection Resources
RDMA requires its consumers to register memory and post buffers of a RDMA requires its consumers to register memory and post buffers of a
specific size and number for receive operations. specific size and number for receive operations.
Registration of memory can be a relatively high-overhead operation, Registration of memory can be a relatively high-overhead operation,
since it requires pinning of buffers, assignment of attributes (e.g. since it requires pinning of buffers, assignment of attributes (e.g.
readable/writable), and initialization of hardware translation. readable/writable), and initialization of hardware translation.
Preregistration is desirable to reduce overhead. These registrations Preregistration is desirable to reduce overhead. These registrations
are specific to hardware interfaces and even to RDMA connection are specific to hardware interfaces and even to RDMA connection
endpoints, therefore negotiation of their limits is desirable to endpoints, therefore negotiation of their limits is desirable to
manage resources effectively. manage resources effectively.
Following the basic registration, these buffers must be posted by the Following the basic registration, these buffers must be posted by the
RPC layer to handle receives. These buffers remain in use by the RPC layer to handle receives. These buffers remain in use by the
RPC/NFSv4 implementation; the size and number of them must be known RPC/NFSv4 implementation; the size and number of them must be known
to the remote peer in order to avoid RDMA errors which would cause a to the remote peer in order to avoid RDMA errors which would cause a
fatal error on the RDMA connection. fatal error on the RDMA connection.
NFSv4.1 manages slots as resources on a per session basis (see NFSv4.1 manages slots as resources on a per session basis (see
Section 2.9), while RDMA connections manage credits on a per Section 2.10), while RDMA connections manage credits on a per
connection basis. This means that in order for a peer to send data connection basis. This means that in order for a peer to send data
over RDMA to a remote buffer, it has to have both an NFSv4.1 slot, over RDMA to a remote buffer, it has to have both an NFSv4.1 slot,
and an RDMA credit. and an RDMA credit.
2.9.5.2. Flow Control 2.10.5.2. Flow Control
NFSv4.0 and all previous versions do not provide for any form of flow NFSv4.0 and all previous versions do not provide for any form of flow
control; instead they rely on the windowing provided by transports control; instead they rely on the windowing provided by transports
like TCP to throttle requests. This does not work with RDMA, which like TCP to throttle requests. This does not work with RDMA, which
provides no operation flow control and will terminate a connection in provides no operation flow control and will terminate a connection in
error when limits are exceeded. Limits such as maximum number of error when limits are exceeded. Limits such as maximum number of
requests outstanding are therefore negotiated when a session is requests outstanding are therefore negotiated when a session is
created (see the ca_maxrequests field in Section 16.36). These created (see the ca_maxrequests field in Section 17.36). These
limits then provide the maxima each session's channels' connections limits then provide the maxima each session's channels' connections
must operate within. RDMA connections are managed within these must operate within. RDMA connections are managed within these
limits as described in section 3.3 of [RPCRDMA]; if there are limits as described in section 3.3 of [RPCRDMA]; if there are
multiple RDMA connections, then the maximum requests for a channel multiple RDMA connections, then the maximum requests for a channel
will be divided among the RDMA connections. The limits may also be will be divided among the RDMA connections. The limits may also be
modified dynamically at the server's choosing by manipulating certain modified dynamically at the server's choosing by manipulating certain
parameters present in each NFSv4.1 request. In addition, the parameters present in each NFSv4.1 request. In addition, the
CB_RECALL_SLOT callback operation (see Section 18.8 can be issued by CB_RECALL_SLOT callback operation (see Section 19.8 can be issued by
a server to a client to return RDMA credits to the server, thereby a server to a client to return RDMA credits to the server, thereby
lowering the maximum number of requests a client can have outstanding lowering the maximum number of requests a client can have outstanding
to the server. to the server.
2.9.5.3. Padding 2.10.5.3. Padding
Header padding is requested by each peer at session initiation (see Header padding is requested by each peer at session initiation (see
the csa_headerpadsize argument to CREATE_SESSION in Section 16.36), the csa_headerpadsize argument to CREATE_SESSION in Section 17.36),
and subsequently used by the RPC RDMA layer, as described in and subsequently used by the RPC RDMA layer, as described in
[RPCRDMA]. Zero padding is permitted. [RPCRDMA]. Zero padding is permitted.
Padding leverages the useful property that RDMA receives preserve Padding leverages the useful property that RDMA receives preserve
alignment of data, even when they are placed into anonymous alignment of data, even when they are placed into anonymous
(untagged) buffers. If requested, client inline writes will insert (untagged) buffers. If requested, client inline writes will insert
appropriate pad bytes within the request header to align the data appropriate pad bytes within the request header to align the data
payload on the specified boundary. The client is encouraged to add payload on the specified boundary. The client is encouraged to add
sufficient padding (up to the negotiated size) so that the "data" sufficient padding (up to the negotiated size) so that the "data"
field of the NFSv4.1 WRITE operation is aligned. Most servers can field of the NFSv4.1 WRITE operation is aligned. Most servers can
skipping to change at page 49, line 41 skipping to change at page 53, line 15
In the above case, the server may recycle unused buffers to the next In the above case, the server may recycle unused buffers to the next
posted receive if unused by the actual received request, or may pass posted receive if unused by the actual received request, or may pass
the now-complete buffers by reference for normal write processing. the now-complete buffers by reference for normal write processing.
For a server which can make use of it, this removes any need for data For a server which can make use of it, this removes any need for data
copies of incoming data, without resorting to complicated end-to-end copies of incoming data, without resorting to complicated end-to-end
buffer advertisement and management. This includes most kernel-based buffer advertisement and management. This includes most kernel-based
and integrated server designs, among many others. The client may and integrated server designs, among many others. The client may
perform similar optimizations, if desired. perform similar optimizations, if desired.
2.9.5.4. Dual RDMA and Non-RDMA Transports 2.10.5.4. Dual RDMA and Non-RDMA Transports
Some RDMA transports (for example see [RDDP]), [[Comment.9: need Some RDMA transports (for example see [RDDP]), [[Comment.9: need
xref]] require a "streaming" (non-RDMA) phase, where ordinary traffic xref]] require a "streaming" (non-RDMA) phase, where ordinary traffic
might flow before "stepping" up to RDMA mode, commencing RDMA might flow before "stepping" up to RDMA mode, commencing RDMA
traffic. Some RDMA transports start connections always in RDMA mode. traffic. Some RDMA transports start connections always in RDMA mode.
NFSv4.1 allows, but does not assume, a streaming phase before RDMA NFSv4.1 allows, but does not assume, a streaming phase before RDMA
mode. When a connection is bound to a session, the client and server mode. When a connection is bound to a session, the client and server
negotiate whether the connection is used in RDMA or non-RDMA mode negotiate whether the connection is used in RDMA or non-RDMA mode
(see Section 16.36 and Section 16.34). (see Section 17.36 and Section 17.34).
2.9.6. Sessions Security 2.10.6. Sessions Security
2.9.6.1. Session Callback Security 2.10.6.1. Session Callback Security
The session connection binding improves security over that provided Via session connection binding, NFSv4.1 improves security over that
by NFSv4.0 for the callback channel. The connection is client- provided by NFSv4.0 for the callback channel. The connection is
initiated (see Section 16.34), and subject to the same firewall and client-initiated (see Section 17.34), and subject to the same
routing checks as the operations channel. The connection cannot be firewall and routing checks as the operations channel. The
hijacked by an attacker who connects to the client port prior to the connection cannot be hijacked by an attacker who connects to the
intended server. At the client's option (see Section 16.36 binding client port prior to the intended server. At the client's option
is fully authenticated before being activated (see Section 16.34). (see Section 17.36 binding is fully authenticated before being
Traffic from the server over the callback channel is authenticated activated (see Section 17.34). Traffic from the server over the
exactly as the client specifies (see Section 2.9.6.2). callback channel is authenticated exactly as the client specifies
(see Section 2.10.6.2).
2.9.6.2. Backchannel RPC Security 2.10.6.2. Backchannel RPC Security
When the NFSv4.1 client establishes the backchannel, it informs the When the NFSv4.1 client establishes the backchannel, it informs the
server what security flavors and principals it must use when sending server what security flavors and principals it must use when sending
requests over the backchannel. If the security flavor is RPCSEC_GSS, requests over the backchannel. If the security flavor is RPCSEC_GSS,
the client expresses the principal in the form of an established the client expresses the principal in the form of an established
RPCSEC_GSS context. The server is free to use any flavor/principal RPCSEC_GSS context. The server is free to use any flavor/principal
combination the server offers, but MUST NOT use unoffered combination the server offers, but MUST NOT use unoffered
combinations. combinations.
This way, the client does not have to provide a target GSS principal This way, the client does not have to provide a target GSS principal
as it did with NFSv4.0, and the server does not have to implement an as it did with NFSv4.0, and the server does not have to implement an
RPCSEC_GSS initiator as it did with NFSv4.0. [[Comment.10: xrefs]] RPCSEC_GSS initiator as it did with NFSv4.0. [[Comment.10: xrefs]]
The CREATE_SESSION (Section 16.36) and BACKCHANNEL_CTL The CREATE_SESSION (Section 17.36) and BACKCHANNEL_CTL
(Section 16.33) operations allow the client to specify flavor/ (Section 17.33) operations allow the client to specify flavor/
principal combinations. principal combinations.
2.9.6.3. Protection from Unauthorized State Changes 2.10.6.3. Protection from Unauthorized State Changes
Under some conditions, NFSv4.0 is vulnerable to a denial of service Under some conditions, NFSv4.0 is vulnerable to a denial of service
issue with respect to its state management. issue with respect to its state management.
The attack works via an unauthorized client faking an open_owner4, an The attack works via an unauthorized client faking an open_owner4, an
open_owner/lock_owner pair, or stateid, combined with a seqid. The open_owner/lock_owner pair, or stateid, combined with a seqid. The
operation is sent to the NFSv4 server. The NFSv4 server accepts the operation is sent to the NFSv4 server. The NFSv4 server accepts the
state information, and as long as any status code from the result of state information, and as long as any status code from the result of
this operation is not NFS4ERR_STALE_CLIENTID, NFS4ERR_STALE_STATEID, this operation is not NFS4ERR_STALE_CLIENTID, NFS4ERR_STALE_STATEID,
NFS4ERR_BAD_STATEID, NFS4ERR_BAD_SEQID, NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, NFS4ERR_BAD_SEQID, NFS4ERR_BADXDR,
skipping to change at page 51, line 33 skipping to change at page 55, line 9
If we give each NFSv4.1 user their own session, and each user uses If we give each NFSv4.1 user their own session, and each user uses
RPCSEC_GSS authentication and integrity, then the denial of service RPCSEC_GSS authentication and integrity, then the denial of service
issue is solved, at the cost of additional per session state. The issue is solved, at the cost of additional per session state. The
alternative NFSv4.1 specifies is described as follows. alternative NFSv4.1 specifies is described as follows.
Transport connections MUST be bound to a session by the client. The Transport connections MUST be bound to a session by the client. The
server MUST return an error to an operation (other than the operation server MUST return an error to an operation (other than the operation
that binds the connection to the session) that uses an unbound that binds the connection to the session) that uses an unbound
connection. As a simplification, the transport connection used by connection. As a simplification, the transport connection used by
CREATE_SESSION (see Section 16.36) is automatically bound to the CREATE_SESSION (see Section 17.36) is automatically bound to the
session. Additional connections are bound to a session via session. Additional connections are bound to a session via
BIND_CONN_TO_SESSION (see Section 16.34). BIND_CONN_TO_SESSION (see Section 17.34).
To prevent attackers from issuing BIND_CONN_TO_SESSION operations, To prevent attackers from issuing BIND_CONN_TO_SESSION operations,
the arguments to BIND_CONN_TO_SESSION include a digest of a shared the arguments to BIND_CONN_TO_SESSION include a digest of a shared
secret called the secret session verifier (SSV) that only the client secret called the secret session verifier (SSV) that only the client
and server know. The digest is created via a one way, collision and server know. The digest is created via a one way, collision
resistant hash function, making it intractable for the attacker to resistant hash function, making it intractable for the attacker to
forge. forge.
The SSV is sent to the server via SET_SSV (see Section 16.47). To The SSV is sent to the server via SET_SSV (see Section 17.47). To
prevent eavesdropping, a SET_SSV for the SSV SHOULD be protected via prevent eavesdropping, a SET_SSV for the SSV SHOULD be protected via
RPCSEC_GSS with the privacy service. The SSV can be changed by the RPCSEC_GSS with the privacy service. The SSV can be changed by the
client at any time, by any principal. However several aspects of SSV client at any time, by any principal. However several aspects of SSV
changing prevent an attacker from engaging in a successful denial of changing prevent an attacker from engaging in a successful denial of
service attack: service attack:
o A SET_SSV on the SSV does not replace the SSV with the argument to o A SET_SSV on the SSV does not replace the SSV with the argument to
SET_SSV. Instead, the current SSV on the server is logically SET_SSV. Instead, the current SSV on the server is logically
exclusive ORed (XORed) with the argument to SET_SSV. SET_SSV MUST exclusive ORed (XORed) with the argument to SET_SSV. SET_SSV MUST
NOT be called with an SSV value that is zero. NOT be called with an SSV value that is zero.
skipping to change at page 54, line 17 skipping to change at page 57, line 38
and therefore known, Eve can issue a SET_SSV that will pass the and therefore known, Eve can issue a SET_SSV that will pass the
digest verification check. However because the new connection has digest verification check. However because the new connection has
not been bound to the session, the SET_SSV is rejected for that not been bound to the session, the SET_SSV is rejected for that
reason. reason.
o The connection to session binding model does not prevent o The connection to session binding model does not prevent
connection hijacking. However, if an attacker can perform connection hijacking. However, if an attacker can perform
connection hijacking, it can issue denial of service attacks that connection hijacking, it can issue denial of service attacks that
are less difficult than attacks based on forging sessions. are less difficult than attacks based on forging sessions.
2.9.7. Session Mechanics - Steady State 2.10.7. Session Mechanics - Steady State
2.9.7.1. Obligations of the Server 2.10.7.1. Obligations of the Server
The server has the primary obligation to monitor the state of The server has the primary obligation to monitor the state of
backchannel resources that the client has created for the server backchannel resources that the client has created for the server
(RPCSEC_GSS contexts and back channel connections). When these (RPCSEC_GSS contexts and back channel connections). When these
resources go away, the server takes action as specified in resources go away, the server takes action as specified in
Section 2.9.8.2. Section 2.10.8.2.
2.9.7.2. Obligations of the Client 2.10.7.2. Obligations of the Client
The client has the following obligations in order to utilize the The client has the following obligations in order to utilize the
session: session:
o Keep a necessary session from going idle on the server. A client o Keep a necessary session from going idle on the server. A client
that requires a session, but nonetheless is not sending operations that requires a session, but nonetheless is not sending operations
risks having the session be destroyed by the server. This is risks having the session be destroyed by the server. This is
because sessions consume resources, and resource limitations may because sessions consume resources, and resource limitations may
force the server to cull the least recently used session. force the server to cull the least recently used session.
skipping to change at page 55, line 6 skipping to change at page 58, line 28
BACKCHANNEL_CTL are unexpired. A good practice is to keep at BACKCHANNEL_CTL are unexpired. A good practice is to keep at
least two contexts outstanding, where the expiration time of the least two contexts outstanding, where the expiration time of the
newest context at the time it was created, is N times that of the newest context at the time it was created, is N times that of the
oldest context, where N is the number of contexts available for oldest context, where N is the number of contexts available for
callbacks. callbacks.
o Maintain an active connection. The server requires a callback o Maintain an active connection. The server requires a callback
path in order to gracefully recall recallable state, or notify the path in order to gracefully recall recallable state, or notify the
client of certain events. client of certain events.
2.9.7.3. Steps the Client Takes To Establish a Session 2.10.7.3. Steps the Client Takes To Establish a Session
The client issues EXCHANGE_ID to establish a clientid. The client issues EXCHANGE_ID to establish a client ID.
The client uses the clientid to issue a CREATE_SESSION on a The client uses the client ID to issue a CREATE_SESSION on a
connection to the server. The results of CREATE_SESSION indicate connection to the server. The results of CREATE_SESSION indicate
whether the server will persist the session replay cache through a whether the server will persist the session replay cache through a
server reboot or not, and the client notes this for future reference. server reboot or not, and the client notes this for future reference.
The client SHOULD have specified connecting binding enforcement when The client SHOULD have specified connecting binding enforcement when
the session was created. If so, the client SHOULD issue SET_SSV in the session was created. If so, the client SHOULD issue SET_SSV in
the first COMPOUND after the session is created. If it is not using the first COMPOUND after the session is created. If it is not using
machine credentials, then each time a new principal goes to use the machine credentials, then each time a new principal goes to use the
session, it SHOULD issue a SET_SSV again. session, it SHOULD issue a SET_SSV again.
skipping to change at page 55, line 41 skipping to change at page 59, line 15
If the client wants to use additional connections for the If the client wants to use additional connections for the
backchannel, then it MUST call BIND_CONN_TO_SESSION on each backchannel, then it MUST call BIND_CONN_TO_SESSION on each
connection it wants to use with the session. If the client wants to connection it wants to use with the session. If the client wants to
use additional connections for the operation channel, then it MUST use additional connections for the operation channel, then it MUST
call BIND_CONN_TO_SESSION if it specified connection binding call BIND_CONN_TO_SESSION if it specified connection binding
enforcement before using the connection. enforcement before using the connection.
At this point the client has reached a steady state as far as session At this point the client has reached a steady state as far as session
use. use.
2.9.8. Session Mechanics - Recovery 2.10.8. Session Mechanics - Recovery
2.9.8.1. Events Requiring Client Action 2.10.8.1. Events Requiring Client Action
The following events require client action to recover. The following events require client action to recover.
2.9.8.1.1. RPCSEC_GSS Context Loss by Callback Path 2.10.8.1.1. RPCSEC_GSS Context Loss by Callback Path
If all RPCSEC_GSS contexts granted to by the client to the server for If all RPCSEC_GSS contexts granted to by the client to the server for
callback use have expired, the client MUST establish a new context callback use have expired, the client MUST establish a new context
via BACKCHANNEL_CTL. The sr_status field of SEQUENCE results via BACKCHANNEL_CTL. The sr_status_flags field of the SEQUENCE
indicates when callback contexts are nearly expired, or fully expired results indicates when callback contexts are nearly expired, or fully
(see Section 16.46.4). expired (see Section 17.46.4).
2.9.8.1.2. Connection Disconnect 2.10.8.1.2. Connection Disconnect
If the client loses the last connection of the session, then it MUST If the client loses the last connection of the session, then it MUST
create a new connection, and if connecting binding enforcement was create a new connection, and if connecting binding enforcement was
specified when the session was created, bind it to the session via specified when the session was created, bind it to the session via
BIND_CONN_TO_SESSION. BIND_CONN_TO_SESSION.
If there were requests outstanding at the time the of connection If there were requests outstanding at the time the of connection
disconnect, then the client MUST retry the request, as described in disconnect, then the client MUST retry the request, as described in
Section 2.9.4.2. Note that it is not necessary to retry requests Section 2.10.4.2. Note that it is not necessary to retry requests
over a connection with the same source network address or the same over a connection with the same source network address or the same
destination network address as the disconnected connection. As long destination network address as the disconnected connection. As long
as the sessionid, slotid, and sequenceid in the retry match that of as the sessionid, slotid, and sequenceid in the retry match that of
the original request, the server will recognize the request as a the original request, the server will recognize the request as a
retry if it did see the request prior to disconnect. retry if it did see the request prior to disconnect.
If the connection that was bound to the backchannel is lost, the If the connection that was bound to the backchannel is lost, the
client may need to reconnect, and use BIND_CONN_TO_SESSION, to give client may need to reconnect, and use BIND_CONN_TO_SESSION, to give
the connection to the backchannel. If the connection that was lost the connection to the backchannel. If the connection that was lost
was the last one bound to the backchannel, the client MUST reconnect, was the last one bound to the backchannel, the client MUST reconnect,
and bind the connection to the session and backchannel. The server and bind the connection to the session and backchannel. The server
should indicate when it has no callback connection via the sr_status should indicate when it has no callback connection via the
result from SEQUENCE. sr_status_flags result from SEQUENCE.
2.9.8.1.3. Backchannel GSS Context Loss 2.10.8.1.3. Backchannel GSS Context Loss
Via the sr_status result of the SEQUENCE operation or other means, Via the sr_status_flags result of the SEQUENCE operation or other
the client will learn if some or all of the RPCSEC_GSS contexts it means, the client will learn if some or all of the RPCSEC_GSS
assigned to the backchannel have been lost. The client may need to contexts it assigned to the backchannel have been lost. The client
use BACKCHANNEL_CTL to assign new contexts. It MUST assign new may need to use BACKCHANNEL_CTL to assign new contexts. It MUST
contexts if there are no more contexts. assign new contexts if there are no more contexts.
2.9.8.1.4. Loss of Session 2.10.8.1.4. Loss of Session
The server may lose a record of the session. Causes include: The server may lose a record of the session. Causes include:
o Server crash and reboot o Server crash and reboot
o A catastrophe that causes the cache to be corrupted or lost on the o A catastrophe that causes the cache to be corrupted or lost on the
media it was stored on. This applies even if the server indicated media it was stored on. This applies even if the server indicated
in the CREATE_SESSION results that it would persist the cache. in the CREATE_SESSION results that it would persist the cache.
o The server purges the session of a client that has been inactive o The server purges the session of a client that has been inactive
skipping to change at page 57, line 35 skipping to change at page 61, line 8
client has no general way to recover from this. client has no general way to recover from this.
Note that loss of session does not imply loss of lock, open, Note that loss of session does not imply loss of lock, open,
delegation, or layout state. Nor does loss of lock, open, delegation, or layout state. Nor does loss of lock, open,
delegation, or layout state imply loss of session state. delegation, or layout state imply loss of session state.
[[Comment.12: Add reference to lock recovery section]] . A session [[Comment.12: Add reference to lock recovery section]] . A session
can survive a server reboot, but lock recovery may still be needed. can survive a server reboot, but lock recovery may still be needed.
The converse is also true. The converse is also true.
It is possible CREATE_SESSION will fail with NFS4ERR_STALE_CLIENTID It is possible CREATE_SESSION will fail with NFS4ERR_STALE_CLIENTID
(for example the server reboots and does not preserve clientid (for example the server reboots and does not preserve client ID
state). If so, the client needs to call EXCHANGE_ID, followed by state). If so, the client needs to call EXCHANGE_ID, followed by
CREATE_SESSION. CREATE_SESSION.
2.9.8.1.5. Failover 2.10.8.1.5. Failover
[[Comment.13: Dave Noveck requested this section; not sure what is [[Comment.13: Dave Noveck requested this section; not sure what is
needed here if this refers to failover to a replica. What are the needed here if this refers to failover to a replica. What are the
session ramifications?]] session ramifications?]]
2.9.8.2. Events Requiring Server Action 2.10.8.2. Events Requiring Server Action
The following events require server action to recover. The following events require server action to recover.
2.9.8.2.1. Client Crash and Reboot 2.10.8.2.1. Client Crash and Reboot
As described in Section 16.35, a rebooted client causes the server to As described in Section 17.35, a rebooted client causes the server to
delete any sessions it had. delete any sessions it had.
2.9.8.2.2. Client Crash with No Reboot 2.10.8.2.2. Client Crash with No Reboot
If a client crashes and never comes back, it will never issue If a client crashes and never comes back, it will never issue
EXCHANGE_ID with its old clientid. Thus the server has session state EXCHANGE_ID with its old client owner. Thus the server has session
that will never be used again. After an extended period of time and state that will never be used again. After an extended period of
if the server has resource constraints, it MAY destroy the old time and if the server has resource constraints, it MAY destroy the
session. old session.
2.9.8.2.3. Extended Network Partition 2.10.8.2.3. Extended Network Partition
To the server, the extended network partition may be no different To the server, the extended network partition may be no different
than a client crash with no reboot (see Section 2.9.8.2.2). Unless than a client crash with no reboot (see Section 2.10.8.2.2). Unless
the server can discern that there is a network partition, it is free the server can discern that there is a network partition, it is free
to treat the situation as if the client has crashed for good. to treat the situation as if the client has crashed for good.
2.9.8.2.4. Backchannel Connection Loss 2.10.8.2.4. Backchannel Connection Loss
If there were callback requests outstanding at the time the of a If there were callback requests outstanding at the time the of a
connection disconnect, then the server MUST retry the request, as connection disconnect, then the server MUST retry the request, as
described in Section 2.9.4.2. Note that it is not necessary to retry described in Section 2.10.4.2. Note that it is not necessary to
requests over a connection with the same source network address or retry requests over a connection with the same source network address
the same destination network address as the disconnected connection. or the same destination network address as the disconnected
As long as the sessionid, slotid, and sequenceid in the retry match connection. As long as the sessionid, slotid, and sequenceid in the
that of the original request, the callback target will recognize the retry match that of the original request, the callback target will
request as a retry if it did see the request prior to disconnect. recognize the request as a retry if it did see the request prior to
disconnect.
If the connection lost is the last one bound to the backchannel, then If the connection lost is the last one bound to the backchannel, then
the server MUST indicate that in the sr_status field of the next the server MUST indicate that in the sr_status_flags field of the
SEQUENCE reply. next SEQUENCE reply.
2.9.8.2.5. GSS Context Loss 2.10.8.2.5. GSS Context Loss
The server SHOULD monitor when the last RPCSEC_GSS context assigned The server SHOULD monitor when the last RPCSEC_GSS context assigned
to the backchannel is near expiry (i.e. between one and two periods to the backchannel is near expiry (i.e. between one and two periods
of lease time), and indicate so in the sr_status field of the next of lease time), and indicate so in the sr_status_flags field of the
SEQUENCE reply. The server MUST indicate when the backchannel's last next SEQUENCE reply. The server MUST indicate when the backchannel's
RPCSEC_GSS context has expired in the sr_status field of the next last RPCSEC_GSS context has expired in the sr_status_flags field of
SEQUENCE reply. the next SEQUENCE reply.
2.10.9. Parallel NFS and Sessions
A client and server can potentially be a non-pNFS implementation, a
metadata server implementation, a data server implementation, or two
or three types of implementations. The EXCHGID4_FLAG_USE_NON_PNFS,
EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_FLAG_USE_PNFS_DS flags (not
mutually exclusive) are passed in the EXCHANGE_ID arguments and
results to allow the client to indicate how it wants to use sessions
created under the client ID, and to allow the server to indicate how
it will allow the sessions to be used. See Section 13.1 for pNFS
sessions considerations.
3. Protocol Data Types 3. Protocol Data Types
The syntax and semantics to describe the data types of the NFS The syntax and semantics to describe the data types of the NFS
version 4 protocol are defined in the XDR RFC4506 [3] and RPC RFC1831 version 4 protocol are defined in the XDR RFC4506 [3] and RPC RFC1831
[4] documents. The next sections build upon the XDR data types to [4] documents. The next sections build upon the XDR data types to
define types and structures specific to this protocol. define types and structures specific to this protocol.
3.1. Basic Data Types 3.1. Basic Data Types
skipping to change at page 60, line 11 skipping to change at page 63, line 42
| pathname4 | typedef component4 pathname4<>; | | pathname4 | typedef component4 pathname4<>; |
| | Represents path name for fs_locations | | | Represents path name for fs_locations |
| qop4 | typedef uint32_t qop4; | | qop4 | typedef uint32_t qop4; |
| | Quality of protection designation in SECINFO | | | Quality of protection designation in SECINFO |
| sec_oid4 | typedef opaque sec_oid4<>; | | sec_oid4 | typedef opaque sec_oid4<>; |
| | Security Object Identifier The sec_oid4 data type | | | Security Object Identifier The sec_oid4 data type |
| | is not really opaque. Instead contains an ASN.1 | | | is not really opaque. Instead contains an ASN.1 |
| | OBJECT IDENTIFIER as used by GSS-API in the | | | OBJECT IDENTIFIER as used by GSS-API in the |
| | mech_type argument to GSS_Init_sec_context. See | | | mech_type argument to GSS_Init_sec_context. See |
| | RFC2743 [8] for details. | | | RFC2743 [8] for details. |
| sequenceid4 | typedef uint32_t sequenceid4; |
| | sequence number used for various session |
| | operations (EXCHANGE_ID, CREATE_SESSION, |
| | SEQUENCE, CB_SEQUENCE). |
| seqid4 | typedef uint32_t seqid4; | | seqid4 | typedef uint32_t seqid4; |
| | Sequence identifier used for file locking | | | Sequence identifier used for file locking |
| sessionid4 | typedef opaque sessionid4[16]; |
| | Session identifier |
| slotid4 | typedef uint32_t slotid4; |
| | sequencing artifact various session operations |
| | (SEQUENCE, CB_SEQUENCE). |
| utf8string | typedef opaque utf8string<>; | | utf8string | typedef opaque utf8string<>; |
| | UTF-8 encoding for strings | | | UTF-8 encoding for strings |
| utf8str_cis | typedef opaque utf8str_cis; | | utf8str_cis | typedef opaque utf8str_cis; |
| | Case-insensitive UTF-8 string | | | Case-insensitive UTF-8 string |
| utf8str_cs | typedef opaque utf8str_cs; | | utf8str_cs | typedef opaque utf8str_cs; |
| | Case-sensitive UTF-8 string | | | Case-sensitive UTF-8 string |
| utf8str_mixed | typedef opaque utf8str_mixed; | | utf8str_mixed | typedef opaque utf8str_mixed; |
| | UTF-8 strings with a case sensitive prefix and a | | | UTF-8 strings with a case sensitive prefix and a |
| | case insensitive suffix. | | | case insensitive suffix. |
| verifier4 | typedef opaque verifier4[NFS4_VERIFIER_SIZE]; | | verifier4 | typedef opaque verifier4[NFS4_VERIFIER_SIZE]; |
| | Verifier used for various operations (COMMIT, | | | Verifier used for various operations (COMMIT, |
| | CREATE, OPEN, READDIR, SETCLIENTID, | | | CREATE, EXCHANGE_ID, OPEN, READDIR, WRITE) |
| | SETCLIENTID_CONFIRM, WRITE) NFS4_VERIFIER_SIZE is | | | NFS4_VERIFIER_SIZE is defined as 8. |
| | defined as 8. |
+---------------+---------------------------------------------------+ +---------------+---------------------------------------------------+
End of Base Data Types End of Base Data Types
Table 1 Table 1
3.2. Structured Data Types 3.2. Structured Data Types
3.2.1. nfstime4 3.2.1. nfstime4
skipping to change at page 63, line 36 skipping to change at page 67, line 34
representing an IPv4 address, which is always four octets long. representing an IPv4 address, which is always four octets long.
Assuming big-endian ordering, h1, h2, h3, and h4, are respectively, Assuming big-endian ordering, h1, h2, h3, and h4, are respectively,
the first through fourth octets each converted to ASCII-decimal. the first through fourth octets each converted to ASCII-decimal.
Assuming big-endian ordering, p1 and p2 are, respectively, the first Assuming big-endian ordering, p1 and p2 are, respectively, the first
and second octets each converted to ASCII-decimal. For example, if a and second octets each converted to ASCII-decimal. For example, if a
host, in big-endian order, has an address of 0x0A010307 and there is host, in big-endian order, has an address of 0x0A010307 and there is
a service listening on, in big endian order, port 0x020F (decimal a service listening on, in big endian order, port 0x020F (decimal
527), then complete universal address is "10.1.3.7.2.15". 527), then complete universal address is "10.1.3.7.2.15".
For TCP over IPv4 the value of r_netid is the string "tcp". For UDP For TCP over IPv4 the value of r_netid is the string "tcp". For UDP
over IPv4 the value of r_netid is the string "udp". over IPv4 the value of r_netid is the string "udp". That this
document specifies the universal address and netid for UDP/IPv6 does
not imply that UDP/IPv4 is a legal transport for NFSv4.1 (see
Section 2.9).
For TCP over IPv6 and for UDP over IPv6, the format of r_addr is the For TCP over IPv6 and for UDP over IPv6, the format of r_addr is the
US-ASCII string: US-ASCII string:
x1:x2:x3:x4:x5:x6:x7:x8.p1.p2 x1:x2:x3:x4:x5:x6:x7:x8.p1.p2
The suffix "p1.p2" is the service port, and is computed the same way The suffix "p1.p2" is the service port, and is computed the same way
as with universal addresses for TCP and UDP over IPv4. The prefix, as with universal addresses for TCP and UDP over IPv4. The prefix,
"x1:x2:x3:x4:x5:x6:x7:x8", is the standard textual form for "x1:x2:x3:x4:x5:x6:x7:x8", is the standard textual form for
representing an IPv6 address as defined in Section 2.2 of RFC1884 representing an IPv6 address as defined in Section 2.2 of RFC1884
[9]. Additionally, the two alternative forms specified in Section [9]. Additionally, the two alternative forms specified in Section
2.2 of RFC1884 [9] are also acceptable. 2.2 of RFC1884 [9] are also acceptable.
For TCP over IPv6 the value of r_netid is the string "tcp6". For UDP For TCP over IPv6 the value of r_netid is the string "tcp6". For UDP
over IPv6 the value of r_netid is the string "udp6". over IPv6 the value of r_netid is the string "udp6". That this
document specifies the universal address and netid for UDP/IPv6 does
3.2.11. clientaddr4 not imply that UDP/IPv6 is a legal transport for NFSv4.1 (see
Section 2.9).
typedef netaddr4 clientaddr4;
The clientaddr4 structure is used as part of the SETCLIENTID
operation to either specify the address of the client that is using a
clientid or as part of the callback registration.
3.2.12. cb_client4
struct cb_client4 {
unsigned int cb_program;
netaddr4 cb_location;
};
This structure is used by the client to inform the server of its call
back address; includes the program number and client address.
3.2.13. nfs_client_id4
struct nfs_client_id4 {
verifier4 verifier;
opaque id<NFS4_OPAQUE_LIMIT>
};
This structure is part of the arguments to the SETCLIENTID operation.
NFS4_OPAQUE_LIMIT is defined as 1024.
3.2.14. open_owner4 3.2.11. open_owner4
struct open_owner4 { struct open_owner4 {
clientid4 clientid; clientid4 clientid;
opaque owner<NFS4_OPAQUE_LIMIT> opaque owner<NFS4_OPAQUE_LIMIT>
}; };
This structure is used to identify the owner of open state. This structure is used to identify the owner of open state.
NFS4_OPAQUE_LIMIT is defined as 1024. NFS4_OPAQUE_LIMIT is defined as 1024.
3.2.15. lock_owner4 3.2.12. lock_owner4
struct lock_owner4 { struct lock_owner4 {
clientid4 clientid; clientid4 clientid;
opaque owner<NFS4_OPAQUE_LIMIT> opaque owner<NFS4_OPAQUE_LIMIT>
}; };
This structure is used to identify the owner of file locking state. This structure is used to identify the owner of file locking state.
NFS4_OPAQUE_LIMIT is defined as 1024.
3.2.16. open_to_lock_owner4 3.2.13. open_to_lock_owner4
struct open_to_lock_owner4 { struct open_to_lock_owner4 {
seqid4 open_seqid; seqid4 open_seqid;
stateid4 open_stateid; stateid4 open_stateid;
seqid4 lock_seqid; seqid4 lock_seqid;
lock_owner4 lock_owner; lock_owner4 lock_owner;
}; };
This structure is used for the first LOCK operation done for an This structure is used for the first LOCK operation done for an
open_owner4. It provides both the open_stateid and lock_owner such open_owner4. It provides both the open_stateid and lock_owner such
that the transition is made from a valid open_stateid sequence to that the transition is made from a valid open_stateid sequence to
that of the new lock_stateid sequence. Using this mechanism avoids that of the new lock_stateid sequence. Using this mechanism avoids
the confirmation of the lock_owner/lock_seqid pair since it is tied the confirmation of the lock_owner/lock_seqid pair since it is tied
to established state in the form of the open_stateid/open_seqid. to established state in the form of the open_stateid/open_seqid.
3.2.17. stateid4 3.2.14. stateid4
struct stateid4 { struct stateid4 {
uint32_t seqid; uint32_t seqid;
opaque other[12]; opaque other[12];
}; };
This structure is used for the various state sharing mechanisms This structure is used for the various state sharing mechanisms
between the client and server. For the client, this data structure between the client and server. For the client, this data structure
is read-only. The starting value of the seqid field is undefined. is read-only. The starting value of the seqid field is undefined.
The server is required to increment the seqid field monotonically at The server is required to increment the seqid field monotonically at
each transition of the stateid. This is important since the client each transition of the stateid. This is important since the client
will inspect the seqid in OPEN stateids to determine the order of will inspect the seqid in OPEN stateids to determine the order of
OPEN processing done by the server. OPEN processing done by the server.
3.2.18. layouttype4 3.2.15. layouttype4
enum layouttype4 { enum layouttype4 {
LAYOUT_NFSV4_FILES = 1, LAYOUT4_NFSV4_1_FILES = 1,
LAYOUT_OSD2_OBJECTS = 2, LAYOUT4_OSD2_OBJECTS = 2,
LAYOUT_BLOCK_VOLUME = 3 LAYOUT4_BLOCK_VOLUME = 3
}; };
A layout type specifies the layout being used. The implication is A layout type specifies the layout being used. The implication is
that clients have "layout drivers" that support one or more layout that clients have "layout drivers" that support one or more layout
types. The file server advertises the layout types it supports types. The file server advertises the layout types it supports
through the LAYOUT_TYPES file system attribute. A client asks for through the fs_layout_type file system attribute (Section 5.13.1). A
layouts of a particular type in LAYOUTGET, and passes those layouts client asks for layouts of a particular type in LAYOUTGET, and passes
to its layout driver. those layouts to its layout driver.
The layouttype4 structure is 32 bits in length. The range The layouttype4 structure is 32 bits in length. The range
represented by the layout type is split into two parts. Types within represented by the layout type is split into three parts. Type 0x0
the range 0x00000000-0x7FFFFFFF are globally unique and are assigned is reserved. Types within the range 0x00000001-0x7FFFFFFF are
according to the description in Section 20.1; they are maintained by globally unique and are assigned according to the description in
IANA. Types within the range 0x80000000-0xFFFFFFFF are site specific Section 21.1; they are maintained by IANA. Types within the range
and for "private use" only. 0x80000000-0xFFFFFFFF are site specific and for "private use" only.
The LAYOUT_NFSV4_FILES enumeration specifies that the NFSv4 file The LAYOUT4_NFSV4_1_FILES enumeration specifies that the NFSv4.1 file
layout type is to be used. The LAYOUT_OSD2_OBJECTS enumeration layout type is to be used. The LAYOUT4_OSD2_OBJECTS enumeration
specifies that the object layout, as defined in [23], is to be used. specifies that the object layout, as defined in [23], is to be used.
Similarly, the LAYOUT_BLOCK_VOLUME enumeration that the block/volume Similarly, the LAYOUT4_BLOCK_VOLUME enumeration that the block/volume
layout, as defined in [24], is to be used. layout, as defined in [24], is to be used.
3.2.19. deviceid4 3.2.16. deviceid4
typedef uint32_t deviceid4; /* 32-bit device ID */ typedef uint32_t deviceid4; /* 32-bit device ID */
Layout information includes device IDs that specify a storage device Layout information includes device IDs that specify a storage device
through a compact handle. Addressing and type information is through a compact handle. Addressing and type information is
obtained with the GETDEVICEINFO operation. A client must not assume obtained with the GETDEVICEINFO operation. A client must not assume
that device IDs are valid across metadata server reboots. The device that device IDs are valid across metadata server reboots. The device
ID is qualified by the layout type and are unique per file system ID is qualified by the layout type and are unique per file system
(FSID). This allows different layout drivers to generate device IDs (FSID). This allows different layout drivers to generate device IDs
without the need for co-ordination. See Section 12.3.1.4 for more without the need for co-ordination. See Section 12.2.12 for more
details. details.
3.2.20. devlist_item4 3.2.17. device_addr4
struct devlist_item4 { struct device_addr4 {
deviceid4 dli_id; layouttype4 da_layout_type;
opaque dli_device_addr<>; opaque da_addr_body<>;
}; };
An array of these values is returned by the GETDEVICELIST operation.
They define the set of devices associated with a file system for the
layout type specified in the GETDEVICELIST4args.
The device address is used to set up a communication channel with the The device address is used to set up a communication channel with the
storage device. Different layout types will require different types storage device. Different layout types will require different types
of structures to define how they communicate with storage devices. of structures to define how they communicate with storage devices.
The opaque device_addr field must be interpreted based on the The opaque da_addr_body field must be interpreted based on the
specified layout type. specified da_layout_type field.
This document defines the device address for the NFSv4 file layout This document defines the device address for the NFSv4.1 file layout
(struct netaddr4 (Section 3.2.10)), which identifies a storage device ([[Comment.14: need xref]]), which identifies a storage device by
by network IP address and port number. This is sufficient for the network IP address and port number. This is sufficient for the
clients to communicate with the NFSv4 storage devices, and may be clients to communicate with the NFSv4.1 storage devices, and may be
sufficient for other layout types as well. Device types for object sufficient for other layout types as well. Device types for object
storage devices and block storage devices (e.g., SCSI volume labels) storage devices and block storage devices (e.g., SCSI volume labels)
will be defined by their respective layout specifications. will be defined by their respective layout specifications.
3.2.21. layout4 3.2.18. devlist_item4
struct devlist_item4 {
deviceid4 dli_id;
device_addr4 dli_device_addr<>;
};
An array of these values is returned by the GETDEVICELIST operation.
They define the set of devices associated with a file system for the
layout type specified in the GETDEVICELIST4args.
3.2.19. layout_content4
struct layout_content4 {
layouttype4 loc_type;
opaque loc_body<>;
};
The loc_body field must be interpreted based on the layout type
(loc_type). This document defines the loc_body for the NFSv4.1 file
layout type is defined; see Section 13.3 for its definition.
3.2.20. layout4
struct layout4 { struct layout4 {
offset4 lo_offset; offset4 lo_offset;
length4 lo_length; length4 lo_length;
layoutiomode4 lo_iomode; layoutiomode4 lo_iomode;
layouttype4 lo_type; layout_content4 lo_content;
opaque lo_layout<>;
}; };
The layout4 structure defines a layout for a file. The layout type The layout4 structure defines a layout for a file. The layout type
specific data is opaque within this structure and must be specific data is opaque within lo_content. Since layouts are sub-
interepreted based on the layout type. Currently, only the NFSv4 dividable, the offset and length together with the file's filehandle,
file layout type is defined; see Section 12.4.2 for its definition. the client ID, iomode, and layout type, identifies the layout.
Since layouts are sub-dividable, the offset and length together with
the file's filehandle, the clientid, iomode, and layout type,
identifies the layout.
3.2.22. layoutupdate4 3.2.21. layoutupdate4
struct layoutupdate4 { struct layoutupdate4 {
layouttype4 lou_type; layouttype4 lou_type;
opaque lou_data<>; opaque lou_body<>;
}; };
The layoutupdate4 structure is used by the client to return 'updated' The layoutupdate4 structure is used by the client to return 'updated'
layout information to the metadata server at LAYOUTCOMMIT time. This layout information to the metadata server at LAYOUTCOMMIT time. This
structure provides a channel to pass layout type specific information structure provides a channel to pass layout type specific information
back to the metadata server. E.g., for block/volume layout types (in field lou_body) back to the metadata server. E.g., for block/
this could include the list of reserved blocks that were written. volume layout types this could include the list of reserved blocks
The contents of the opaque lou_data argument are determined by the that were written. The contents of the opaque lou_body argument are
layout type and are defined in their context. The NFSv4 file-based determined by the layout type and are defined in their context. The
layout does not use this structure, thus the update_data field should NFSv4.1 file-based layout does not use this structure, thus the
have a zero length. lou_body field should have a zero length.
3.2.23. layouthint4 3.2.22. layouthint4
struct layouthint4 { struct layouthint4 {
layouttype4 loh_type; layouttype4 loh_type;
opaque loh_data<>; opaque loh_body<>;
}; };
The layouthint4 structure is used by the client to pass in a hint The layouthint4 structure is used by the client to pass in a hint
about the type of layout it would like created for a particular file. about the type of layout it would like created for a particular file.
It is the structure specified by the FILE_LAYOUT_HINT attribute It is the structure specified by the layout_hint attribute described
described below. The metadata server may ignore the hint, or may in Section 5.13.4. The metadata server may ignore the hint, or may
selectively ignore fields within the hint. This hint should be selectively ignore fields within the hint. This hint should be
provided at create time as part of the initial attributes within provided at create time as part of the initial attributes within
OPEN. The NFSv4 file-based layout uses the "nfsv4_file_layouthint" OPEN. The loh_body field is specific to the type of layout
structure as defined in Section 12.4.2. (loh_type). The NFSv4.1 file-based layout uses the
nfsv4_1_file_layouthint4 structure as defined in Section 13.3.
3.2.24. layoutiomode4 3.2.23. layoutiomode4
enum layoutiomode4 { enum layoutiomode4 {
LAYOUTIOMODE_READ = 1, LAYOUTIOMODE4_READ = 1,
LAYOUTIOMODE_RW = 2, LAYOUTIOMODE4_RW = 2,
LAYOUTIOMODE_ANY = 3 LAYOUTIOMODE4_ANY = 3
}; };
The iomode specifies whether the client intends to read or write The iomode specifies whether the client intends to read or write
(with the possibility of reading) the data represented by the layout. (with the possibility of reading) the data represented by the layout.
The ANY iomode MUST NOT be used for LAYOUTGET, however, it can be The ANY iomode MUST NOT be used for LAYOUTGET, however, it can be
used for LAYOUTRETURN and LAYOUTRECALL. The ANY iomode specifies used for LAYOUTRETURN and LAYOUTRECALL. The ANY iomode specifies
that layouts pertaining to both READ and RW iomodes are being that layouts pertaining to both READ and RW iomodes are being
returned or recalled, respectively. The metadata server's use of the returned or recalled, respectively. The metadata server's use of the
iomode may depend on the layout type being used. The storage devices iomode may depend on the layout type being used. The storage devices
may validate I/O accesses against the iomode and reject invalid may validate I/O accesses against the iomode and reject invalid
accesses. accesses.
3.2.25. nfs_impl_id4 3.2.24. nfs_impl_id4
struct nfs_impl_id4 { struct nfs_impl_id4 {
utf8str_cis nii_domain; utf8str_cis nii_domain;
utf8str_cs nii_name; utf8str_cs nii_name;
nfstime4 nii_date; nfstime4 nii_date;
}; };
This structure is used to identify client and server implementation This structure is used to identify client and server implementation
detail. The nii_domain field is the DNS domain name that the detail. The nii_domain field is the DNS domain name that the
implementer is associated with. The nii_name field is the product implementer is associated with. The nii_name field is the product
name of the implementation and is completely free form. It is name of the implementation and is completely free form. It is
encouraged that the nii_name be used to distinguish machine recommended that the nii_name be used to distinguish machine
architecture, machine platforms, revisions, versions, and patch architecture, machine platforms, revisions, versions, and patch
levels. The nii_date field is the timestamp of when the software levels. The nii_date field is the timestamp of when the software
instance was published or built. instance was published or built.
3.2.26. threshold_item4 3.2.25. threshold_item4
struct threshold_item4 { struct threshold_item4 {
layouttype4 thi_layout_type; layouttype4 thi_layout_type;
bitmap4 thi_hintset; bitmap4 thi_hintset;
opaque thi_hintlist<>; opaque thi_hintlist<>;
}; };
This structure contains a list of hints specific to a layout type for This structure contains a list of hints specific to a layout type for
helping the client determine when it should issue I/O directly helping the client determine when it should issue I/O directly
through the metadata server vs. the data servers. The hint structure through the metadata server vs. the data servers. The hint structure
consists of the layout type, a bitmap describing the set of hints consists of the layout type (thi_layout_type), a bitmap (thi_hintset)
supported by the server, they may differ based on the layout type, describing the set of hints supported by the server (they may differ
and a list of hints, whose structure is determined by the hintset based on the layout type), and a list of hints (thi_hintlist), whose
bitmap. See the mdsthreshold attribute for more details. structure is determined by the hintset bitmap. See the mdsthreshold
attribute for more details.
The hintset is a bitmap of the following values: The thi_hintset field is a bitmap of the following values:
+-------------------------+---+---------+---------------------------+ +-------------------------+---+---------+---------------------------+
| name | # | Data | Description | | name | # | Data | Description |
| | | Type | | | | | Type | |
+-------------------------+---+---------+---------------------------+ +-------------------------+---+---------+---------------------------+
| threshold4_read_size | 0 | length4 | The file size below which | | threshold4_read_size | 0 | length4 | The file size below which |
| | | | it is recommended to read | | | | | it is recommended to read |
| | | | data through the MDS. | | | | | data through the MDS. |
| threshold4_write_size | 1 | length4 | The file size below which | | threshold4_write_size | 1 | length4 | The file size below which |
| | | | it is recommended to | | | | | it is recommended to |
skipping to change at page 69, line 33 skipping to change at page 73, line 31
| threshold4_read_iosize | 2 | length4 | For read I/O sizes below | | threshold4_read_iosize | 2 | length4 | For read I/O sizes below |
| | | | this threshold it is | | | | | this threshold it is |
| | | | recommended to read data | | | | | recommended to read data |
| | | | through the MDS | | | | | through the MDS |
| threshold4_write_iosize | 3 | length4 | For write I/O sizes below | | threshold4_write_iosize | 3 | length4 | For write I/O sizes below |
| | | | this threshold it is | | | | | this threshold it is |
| | | | recommended to write data | | | | | recommended to write data |
| | | | through the MDS | | | | | through the MDS |
+-------------------------+---+---------+---------------------------+ +-------------------------+---+---------+---------------------------+
3.2.27. mdsthreshold4 3.2.26. mdsthreshold4
struct mdsthreshold4 { struct mdsthreshold4 {
threshold_item4 mth_hints<>; threshold_item4 mth_hints<>;
}; };
This structure holds an array of threshold_item4 structures each of This structure holds an array of threshold_item4 structures each of
which is valid for a particular layout type. An array is necessary which is valid for a particular layout type. An array is necessary
since a server can support multiple layout types for a single file. since a server can support multiple layout types for a single file.
4. Filehandles 4. Filehandles
skipping to change at page 71, line 37 skipping to change at page 75, line 37
filehandles differently, a file attribute is defined which may be filehandles differently, a file attribute is defined which may be
used by the client to determine the filehandle types being returned used by the client to determine the filehandle types being returned
by the server. by the server.
4.2.1. General Properties of a Filehandle 4.2.1. General Properties of a Filehandle
The filehandle contains all the information the server needs to The filehandle contains all the information the server needs to
distinguish an individual file. To the client, the filehandle is distinguish an individual file. To the client, the filehandle is
opaque. The client stores filehandles for use in a later request and opaque. The client stores filehandles for use in a later request and
can compare two filehandles from the same server for equality by can compare two filehandles from the same server for equality by
doing a byte-by-byte comparison. However, the client MUST NOT doing an octet-by-octet comparison. However, the client MUST NOT
otherwise interpret the contents of filehandles. If two filehandles otherwise interpret the contents of filehandles. If two filehandles
from the same server are equal, they MUST refer to the same file. from the same server are equal, they MUST refer to the same file.
Servers SHOULD try to maintain a one-to-one correspondence between Servers SHOULD try to maintain a one-to-one correspondence between
filehandles and files but this is not required. Clients MUST use filehandles and files but this is not required. Clients MUST use
filehandle comparisons only to improve performance, not for correct filehandle comparisons only to improve performance, not for correct
behavior. All clients need to be prepared for situations in which it behavior. All clients need to be prepared for situations in which it
cannot be determined whether two filehandles denote the same object cannot be determined whether two filehandles denote the same object
and in such cases, avoid making invalid assumptions which might cause and in such cases, avoid making invalid assumptions which might cause
incorrect behavior. Further discussion of filehandle and attribute incorrect behavior. Further discussion of filehandle and attribute
comparison in the context of data caching is presented in the section comparison in the context of data caching is presented in the section
skipping to change at page 78, line 7 skipping to change at page 82, line 7
Each of the Mandatory and Recommended attributes can be classified in Each of the Mandatory and Recommended attributes can be classified in
one of three categories: per server, per file system, or per file one of three categories: per server, per file system, or per file
system object. Note that it is possible that some per file system system object. Note that it is possible that some per file system
attributes may vary within the file system. See the "homogeneous" attributes may vary within the file system. See the "homogeneous"
attribute for its definition. Note that the attributes attribute for its definition. Note that the attributes
time_access_set and time_modify_set are not listed in this section time_access_set and time_modify_set are not listed in this section
because they are write-only attributes corresponding to time_access because they are write-only attributes corresponding to time_access
and time_modify, and are used in a special instance of SETATTR. and time_modify, and are used in a special instance of SETATTR.
o The per server attributes are: o The per server attribute is:
lease_time, send_impl_id, recv_impl_id lease_time
o The per file system attributes are: o The per file system attributes are:
supp_attr, fh_expire_type, link_support, symlink_support, supp_attr, fh_expire_type, link_support, symlink_support,
unique_handles, aclsupport, cansettime, case_insensitive, unique_handles, aclsupport, cansettime, case_insensitive,
case_preserving, chown_restricted, files_avail, files_free, case_preserving, chown_restricted, files_avail, files_free,
files_total, fs_locations, homogeneous, maxfilesize, maxname, files_total, fs_locations, homogeneous, maxfilesize, maxname,
maxread, maxwrite, no_trunc, space_avail, space_free, maxread, maxwrite, no_trunc, space_avail, space_free,
space_total, time_delta, fs_layout_type space_total, time_delta, fs_status, fs_layout_type,
fs_locations_info
o The per file system object attributes are: o The per file system object attributes are:
type, change, size, named_attr, fsid, rdattr_error, filehandle, type, change, size, named_attr, fsid, rdattr_error, filehandle,
ACL, archive, fileid, hidden, maxlink, mimetype, mode, ACL, archive, fileid, hidden, maxlink, mimetype, mode,
numlinks, owner, owner_group, rawdev, space_used, system, numlinks, owner, owner_group, rawdev, space_used, system,
time_access, time_backup, time_create, time_metadata, time_access, time_backup, time_create, time_metadata,
time_modify, mounted_on_fileid, layout_type, layout_hint, time_modify, mounted_on_fileid, dir_notif_delay,
layout_blksize, layout_alignment dirent_notif_delay, dacl, sacl, layout_type, layout_hint,
layout_blksize, layout_alignment, mdsthreshold, retention_get,
retention_set, retentevt_get, retentevt_set, retention_hold,
mode_set_masked
For quota_avail_hard, quota_avail_soft, and quota_used see their For quota_avail_hard, quota_avail_soft, and quota_used see their
definitions below for the appropriate classification. definitions below for the appropriate classification.
5.5. Mandatory Attributes - Definitions 5.5. Mandatory Attributes - Definitions
+-----------------+----+------------+--------+----------------------+ +-----------------+----+------------+--------+----------------------+
| name | # | Data Type | Access | Description | | name | # | Data Type | Access | Description |
+-----------------+----+------------+--------+----------------------+ +-----------------+----+------------+--------+----------------------+
| supp_attr | 0 | bitmap | READ | The bit vector which | | supp_attr | 0 | bitmap | READ | The bit vector which |
| | | | | would retrieve all | | | | | | would retrieve all |
| | | | | mandatory and | | | | | | mandatory and |
| | | | | recommended | | | | | | recommended |
| | | | | attributes that are | | | | | | attributes that are |
| | | | | supported for this | | | | | | supported for this |
| | | | | object. The scope of | | | | | | object. The scope |
| | | | | this attribute | | | | | | of this attribute |
| | | | | applies to all | | | | | | applies to all |
| | | | | objects with a | | | | | | objects with a |
| | | | | matching fsid. | | | | | | matching fsid. |
| type | 1 | nfs4_ftype | READ | The type of the | | type | 1 | nfs4_ftype | READ | The type of the |
| | | | | object (file, | | | | | | object (file, |
| | | | | directory, symlink, | | | | | | directory, symlink, |
| | | | | etc.) | | | | | | etc.) |
| fh_expire_type | 2 | uint32 | READ | Server uses this to | | fh_expire_type | 2 | uint32 | READ | Server uses this to |
| | | | | specify filehandle | | | | | | specify filehandle |
| | | | | expiration behavior | | | | | | expiration behavior |
skipping to change at page 79, line 20 skipping to change at page 83, line 41
| | | | | additional | | | | | | additional |
| | | | | description. | | | | | | description. |
| change | 3 | uint64 | READ | A value created by | | change | 3 | uint64 | READ | A value created by |
| | | | | the server that the | | | | | | the server that the |
| | | | | client can use to | | | | | | client can use to |
| | | | | determine if file | | | | | | determine if file |
| | | | | data, directory | | | | | | data, directory |
| | | | | contents or | | | | | | contents or |
| | | | | attributes of the | | | | | | attributes of the |
| | | | | object have been | | | | | | object have been |
| | | | | modified. The server | | | | | | modified. The |
| | | | | may return the | | | | | | server may return |
| | | | | object's | | | | | | the object's |
| | | | | time_metadata | | | | | | time_metadata |
| | | | | attribute for this | | | | | | attribute for this |
| | | | | attribute's value | | | | | | attribute's value |
| | | | | but only if the file | | | | | | but only if the file |
| | | | | system object can | | | | | | system object can |
| | | | | not be updated more | | | | | | not be updated more |
| | | | | frequently than the | | | | | | frequently than the |
| | | | | resolution of | | | | | | resolution of |
| | | | | time_metadata. | | | | | | time_metadata. |
| size | 4 | uint64 | R/W | The size of the | | size | 4 | uint64 | R/W | The size of the |
| | | | | object in bytes. | | | | | | object in bytes. |
| link_support | 5 | bool | READ | True, if the | | link_support | 5 | bool | READ | True, if the |
| | | | | object's file system | | | | | | object's file system |
| | | | | supports hard links. | | | | | | supports hard links. |
| symlink_support | 6 | bool | READ | True, if the | | symlink_support | 6 | bool | READ | True, if the |
| | | | | object's file system | | | | | | object's file system |
| | | | | supports symbolic | | | | | | supports symbolic |
| | | | | links. | | | | | | links. |
| named_attr | 7 | bool | READ | True, if this object | | named_attr | 7 | bool | READ | True, if this object |
| | | | | has named | | | | | | has named |
| | | | | attributes. In other | | | | | | attributes. In |
| | | | | words, object has a | | | | | | other words, object |
| | | | | non-empty named | | | | | | has a non-empty |
| | | | | attribute directory. | | | | | | named attribute |
| | | | | directory. |
| fsid | 8 | fsid4 | READ | Unique file system | | fsid | 8 | fsid4 | READ | Unique file system |
| | | | | identifier for the | | | | | | identifier for the |
| | | | | file system holding | | | | | | file system holding |
| | | | | this object. fsid | | | | | | this object. fsid |
| | | | | contains major and | | | | | | contains major and |
| | | | | minor components | | | | | | minor components |
| | | | | each of which are | | | | | | each of which are |
| | | | | uint64. | | | | | | uint64. |
| unique_handles | 9 | bool | READ | True, if two | | unique_handles | 9 | bool | READ | True, if two |
| | | | | distinct filehandles | | | | | | distinct filehandles |
skipping to change at page 81, line 46 skipping to change at page 86, line 25
| | | | | privileged | | | | | | privileged |
| | | | | user (for | | | | | | user (for |
| | | | | example, | | | | | | example, |
| | | | | "root" in UNIX | | | | | | "root" in UNIX |
| | | | | operating | | | | | | operating |
| | | | | environments | | | | | | environments |
| | | | | or in Windows | | | | | | or in Windows |
| | | | | 2000 the "Take | | | | | | 2000 the "Take |
| | | | | Ownership" | | | | | | Ownership" |
| | | | | privilege). | | | | | | privilege). |
| dacl | 58 | nfsacl41 | R/W | Automatically |
| | | | | inheritable |
| | | | | access control |
| | | | | list used for |
| | | | | determining |
| | | | | access to file |
| | | | | system |
| | | | | objects. |
| dir_notif_delay | 56 | nfstime4 | READ | notification | | dir_notif_delay | 56 | nfstime4 | READ | notification |
| | | | | delays on | | | | | | delays on |
| | | | | directory | | | | | | directory |
| | | | | attributes | | | | | | attributes |
| dirent_ | 57 | nfstime4 | READ | notification | | dirent_ | 57 | nfstime4 | READ | notification |
| notif_delay | | | | delays on | | notif_delay | | | | delays on |
| | | | | child | | | | | | child |
| | | | | attributes | | | | | | attributes |
| fileid | 20 | uint64 | READ | A number | | fileid | 20 | uint64 | READ | A number |
| | | | | uniquely | | | | | | uniquely |
skipping to change at page 84, line 28 skipping to change at page 89, line 16
| | | | | for this | | | | | | for this |
| | | | | object. | | | | | | object. |
| maxwrite | 31 | uint64 | READ | Maximum write | | maxwrite | 31 | uint64 | READ | Maximum write |
| | | | | size supported | | | | | | size supported |
| | | | | for this | | | | | | for this |
| | | | | object. This | | | | | | object. This |
| | | | | attribute | | | | | | attribute |
| | | | | SHOULD be | | | | | | SHOULD be |
| | | | | supported if | | | | | | supported if |
| | | | | the file is | | | | | | the file is |
| | | | | writable. Lack | | | | | | writable. |
| | | | | of this | | | | | | Lack of this |
| | | | | attribute can | | | | | | attribute can |
| | | | | lead to the | | | | | | lead to the |
| | | | | client either | | | | | | client either |
| | | | | wasting | | | | | | wasting |
| | | | | bandwidth or | | | | | | bandwidth or |
| | | | | not receiving | | | | | | not receiving |
| | | | | the best | | | | | | the best |
| | | | | performance. | | | | | | performance. |
| mdsthreshold | 68 | mdsthreshold4 | READ | Hint to client | | mdsthreshold | 68 | mdsthreshold4 | READ | Hint to client |
| | | | | as to when to | | | | | | as to when to |
| | | | | write through | | | | | | write through |
| | | | | the pnfs | | | | | | the pnfs |
| | | | | metadata | | | | | | metadata |
| | | | | server. | | | | | | server. |
| mimetype | 32 | utf8<> | R/W | MIME body | | mimetype | 32 | utf8<> | R/W | MIME body |
| | | | | type/subtype | | | | | | type/subtype |
| | | | | of this | | | | | | of this |
| | | | | object. | | | | | | object. |
| mode | 33 | mode4 | R/W | UNIX-style | | mode | 33 | mode4 | R/W | UNIX-style |
| | | | | mode and | | | | | | mode including |
| | | | | permission | | | | | | permission |
| | | | | bits for this | | | | | | bits for this |
| | | | | object. | | | | | | object. |
| mode_set_masked | 74 | mode_masked4 | WRITE | Allows setting |
| | | | | or resetting a |
| | | | | subset of the |
| | | | | bits in a |
| | | | | UNIX-style |
| | | | | mode |
| mounted_on_fileid | 55 | uint64 | READ | Like fileid, | | mounted_on_fileid | 55 | uint64 | READ | Like fileid, |
| | | | | but if the | | | | | | but if the |
| | | | | target | | | | | | target |
| | | | | filehandle is | | | | | | filehandle is |
| | | | | the root of a | | | | | | the root of a |
| | | | | file system | | | | | | file system |
| | | | | return the | | | | | | return the |
| | | | | fileid of the | | | | | | fileid of the |
| | | | | underlying | | | | | | underlying |
| | | | | directory. | | | | | | directory. |
skipping to change at page 86, line 18 skipping to change at page 91, line 18
| | | | | node | | | | | | node |
| | | | | information. | | | | | | information. |
| | | | | If the value | | | | | | If the value |
| | | | | of type is not | | | | | | of type is not |
| | | | | NF4BLK or | | | | | | NF4BLK or |
| | | | | NF4CHR, the | | | | | | NF4CHR, the |
| | | | | value return | | | | | | value return |
| | | | | SHOULD NOT be | | | | | | SHOULD NOT be |
| | | | | considered | | | | | | considered |
| | | | | useful. | | | | | | useful. |
| recv_impl_id | 59 | impl_ident4 | READ | Client obtains |
| | | | | the server's |
| | | | | implementation |
| | | | | identity via |
| | | | | GETATTR. |
| retentevt_get | 71 | retention_get4 | READ | Get the | | retentevt_get | 71 | retention_get4 | READ | Get the |
| | | | | event-based | | | | | | event-based |
| | | | | retention | | | | | | retention |
| | | | | duration, and | | | | | | duration, and |
| | | | | if enabled, | | | | | | if enabled, |
| | | | | the | | | | | | the |
| | | | | event-based | | | | | | event-based |
| | | | | retention | | | | | | retention |
| | | | | begin time of | | | | | | begin time of |
| | | | | the file | | | | | | the file |
skipping to change at page 87, line 14 skipping to change at page 92, line 14
| retention_get | 69 | retention_get4 | READ | Get the | | retention_get | 69 | retention_get4 | READ | Get the |
| | | | | retention | | | | | | retention |
| | | | | duration, and | | | | | | duration, and |
| | | | | if enabled, | | | | | | if enabled, |
| | | | | the retention | | | | | | the retention |
| | | | | begin time of | | | | | | begin time of |
| | | | | the file | | | | | | the file |
| | | | | object. | | | | | | object. |
| | | | | GETATTR use | | | | | | GETATTR use |
| | | | | only. | | | | | | only. |
| retention_hold | 69 | uint64_t | R/W | Get or set | | retention_hold | 73 | uint64_t | R/W | Get or set |
| | | | | administrative | | | | | | administrative |
| | | | | retention | | | | | | retention |
| | | | | holds, one | | | | | | holds, one |
| | | | | hold per bit | | | | | | hold per bit |
| | | | | position. | | | | | | position. |
| retention_set | 70 | retention_set4 | WRITE | Set the | | retention_set | 70 | retention_set4 | WRITE | Set the |
| | | | | retention | | | | | | retention |
| | | | | duration, and | | | | | | duration, and |
| | | | | optionally | | | | | | optionally |
| | | | | enable | | | | | | enable |
| | | | | retention on | | | | | | retention on |
| | | | | the file | | | | | | the file |
| | | | | object. | | | | | | object. |
| | | | | SETATTR use | | | | | | SETATTR use |
| | | | | only. | | | | | | only. |
| send_impl_id | 58 | impl_ident4 | WRITE | Client | | sacl | 59 | nfsacl41 | R/W | Automatically |
| | | | | provides | | | | | | inheritable |
| | | | | server with | | | | | | access control |
| | | | | its | | | | | | list used for |
| | | | | implementation | | | | | | auditing |
| | | | | identity via | | | | | | access to |
| | | | | SETATTR. | | | | | | files. |
| space_avail | 42 | uint64 | READ | Disk space in | | space_avail | 42 | uint64 | READ | Disk space in |
| | | | | bytes | | | | | | bytes |
| | | | | available to | | | | | | available to |
| | | | | this user on | | | | | | this user on |
| | | | | the file | | | | | | the file |
| | | | | system | | | | | | system |
| | | | | containing | | | | | | containing |
| | | | | this object - | | | | | | this object - |
| | | | | this should be | | | | | | this should be |
| | | | | the smallest | | | | | | the smallest |
skipping to change at page 94, line 9 skipping to change at page 99, line 9
fileid of a directory entry returned by readdir(). If fileid of a directory entry returned by readdir(). If
mounted_on_fileid is requested in a GETATTR operation, the server mounted_on_fileid is requested in a GETATTR operation, the server
should obey an invariant that has it returning a value that is equal should obey an invariant that has it returning a value that is equal
to the file object's entry in the object's parent directory, i.e. to the file object's entry in the object's parent directory, i.e.
what readdir() would have returned. Some operating environments what readdir() would have returned. Some operating environments
allow a series of two or more file systems to be mounted onto a allow a series of two or more file systems to be mounted onto a
single mount point. In this case, for the server to obey the single mount point. In this case, for the server to obey the
aforementioned invariant, it will need to find the base mount point, aforementioned invariant, it will need to find the base mount point,
and not the intermediate mount points. and not the intermediate mount points.
5.12. send_impl_id and recv_impl_id 5.12. Directory Notification Attributes
These recommended attributes are used to identify the client and As described in Section 17.39, the client can request a minimum delay
server. In the case of the send_impl_id attribute, the client sends for notifications of changes to attributes, but the server is free
its nfs_impl_id4. In the case of the recv_impl_id attribute, the ignore what the client requests. The client can determine in advance
client receives the server's nfs_impl_id4 value. what notification delays the server will accept by issuing a GETATTR
for either or both of two directory notification attributes. When
the client calls the GET_DIR_DELEGATION operation and asks^M for
attribute change notifications, it should request^M notification
delays that are no less than the values in the^M server-provided
attributes.
Access to this identification information can be most useful at both 5.12.1. dir_notif_delay
client and server. Being able to identify specific implementations
can help in planning by administrators or implementors. For example,
diagnostic software may extract this information in an attempt to
identify interoperability problems, performance workload behaviors or
general usage statistics. Since the intent of having access to this
information is for planning or general diagnosis only, the client and
server MUST NOT interpret this implementation identity information in
a way that affects interoperational behavior of the implementation.
The reason is the if clients and servers did such a thing, they might
use fewer capabilities of the protocol than the peer can support, or
the client and server might refuse to interoperate.
Because it is likely some implementations will violate the protocol The dir_notify_delay attribute is the minimum number of seconds the
specification and interpret the identity information, implementations server will delay before notifying the client of a change to the
MUST allow the users of the NFSv4 client and server to set the directory's attributes.
contents of the sent nfs_impl_id structure to any value.
Even though these attributes are RECOMMENDED, if the server supports 5.12.2. dirent_notif_delay
one of them it MUST support the other.
5.13. fs_layout_type The dirent_notif_delay attribute is the minimum number of seconds the
server will delay before notifying the client of a change to a file
object that has an entry in the directory.
This attribute applies to a file system and indicates what layout 5.13. PNFS Attributes
types are supported by the file system. We expect this attribute to
5.13.1. fs_layout_type
The fs_layout_type attribute (data type layouttype4, see
Section 3.2.15) applies to a file system and indicates what layout
types are supported by the file system. This attribute is expected
be queried when a client encounters a new fsid. This attribute is be queried when a client encounters a new fsid. This attribute is
used by the client to determine if it has applicable layout drivers. used by the client to determine if it supports the layout type.
5.14. layout_type 5.13.2. layout_alignment
The layout_alignment attribute indicates the preferred alignment for
I/O to files on the file system the client has layouts for. Where
possible, the client should issue READ and WRITE operations with
offsets are whole multiples of the layout_alignment attribute.
5.13.3. layout_blksize
The layout_blksize attribute indicates the preferred block size for
I/O to files on the file system the client has layouts for. Where
possible, the client should issue READ operations with a count
argument that is a whole multiple of layout_blksize, and WRITE
operations with a data argument of size that is a whole multiple of
layout_blksize.
5.13.4. layout_hint
The layout_hint attribute (data type layouthint4, see Section 3.2.22)
may be set on newly created files to influence the metadata server's
choice for the file's layout. It is suggested that this attribute is
set as one of the initial attributes within the OPEN call. The
metadata server may ignore this attribute. This attribute is a sub-
set of the layout structure returned by LAYOUTGET. For example,
instead of specifying particular devices, this would be used to
suggest the stripe width of a file. It is up to the server
implementation to determine which fields within the layout it uses.
5.13.5. layout_type
This attribute indicates the particular layout type(s) used for a This attribute indicates the particular layout type(s) used for a
file. This is for informational purposes only. The client needs to file. This is for informational purposes only. The client needs to
use the LAYOUTGET operation in order to get enough information (e.g., use the LAYOUTGET operation in order to get enough information (e.g.,
specific device information) in order to perform I/O. specific device information) in order to perform I/O.
5.15. layout_hint 5.13.6. mdsthreshold
This attribute may be set on newly created files to influence the
metadata server's choice for the file's layout. It is suggested that
this attribute is set as one of the initial attributes within the
OPEN call. The metadata server may ignore this attribute. This
attribute is a sub-set of the layout structure returned by LAYOUTGET.
For example, instead of specifying particular devices, this would be
used to suggest the stripe width of a file. It is up to the server
implementation to determine which fields within the layout it uses.
5.16. mdsthreshold
This attribute acts as a hint to the client to help it determine when This attribute acts as a hint to the client to help it determine when
it is more efficient to issue read and write requests to the metadata it is more efficient to issue read and write requests to the metadata
server vs. the data server. Two types of thresholds are described: server vs. the data server. Two types of thresholds are described:
file size thresholds and I/O size thresholds. If a file's size is file size thresholds and I/O size thresholds. If a file's size is
smaller than the file size threshold, data accesses should be issued smaller than the file size threshold, data accesses should be issued
to the metadata server. If an I/O is below the I/O size threshold, to the metadata server. If an I/O is below the I/O size threshold,
the I/O should be issued to the metadata server. Each threshold can the I/O should be issued to the metadata server. Each threshold can
be specified independently for read and write requests. For either be specified independently for read and write requests. For either
threshold type, a value of 0 indicates no read or write should be threshold type, a value of 0 indicates no read or write should be
skipping to change at page 95, line 39 skipping to change at page 101, line 7
The attribute is available on a per filehandle basis. If the current The attribute is available on a per filehandle basis. If the current
filehandle refers to a non-pNFS file or directory, the metadata filehandle refers to a non-pNFS file or directory, the metadata
server should return an attribute that is representative of the server should return an attribute that is representative of the
filehandle's file system. It is suggested that this attribute is filehandle's file system. It is suggested that this attribute is
queried as part of the OPEN operation. Due to dynamic system queried as part of the OPEN operation. Due to dynamic system
changes, the client should not assume that the attribute will remain changes, the client should not assume that the attribute will remain
constant for any specific time period, thus it should be periodically constant for any specific time period, thus it should be periodically
refreshed. refreshed.
5.17. Retention Attributes 5.14. Retention Attributes
Retention is a concept whereby a file object can be placed in an Retention is a concept whereby a file object can be placed in an
immutable, undeletable, unrenamable state for a fixed or infinite immutable, undeletable, unrenamable state for a fixed or infinite
duration of time. Once in this "retained" state, the file cannot be duration of time. Once in this "retained" state, the file cannot be
moved out of the state until the duration of retention has been moved out of the state until the duration of retention has been
reached. reached.
When retention is enabled, retention MUST extend to the data of the When retention is enabled, retention MUST extend to the data of the
file, and the name of file. The server MAY extend retention any file, and the name of file. The server MAY extend retention any
other property of the file, including any subset of mandatory, other property of the file, including any subset of mandatory,
skipping to change at page 97, line 33 skipping to change at page 103, line 8
even if the duration on enabled event or non-event-based retention even if the duration on enabled event or non-event-based retention
has been reached. The server MAY restrict the modification of has been reached. The server MAY restrict the modification of
retention_hold on the basis of the ACE4_WRITE_RETENTION_HOLD ACL retention_hold on the basis of the ACE4_WRITE_RETENTION_HOLD ACL
permission. The enabling of administration retention holds does permission. The enabling of administration retention holds does
not prevent the enabling of event-based or non-event-based not prevent the enabling of event-based or non-event-based
retention. retention.
6. Access Control Lists 6. Access Control Lists
Access Control Lists (ACLs) are a file attribute that specify fine Access Control Lists (ACLs) are a file attribute that specify fine
grained access control. This chapter covers the "acl", "aclsupport", grained access control. This chapter covers the "acl", "dacl",
and "mode" file attributes, and their interactions. "sacl", "aclsupport", "mode", "mode_set_masked" file attributes, and
their interactions.
6.1. Goals 6.1. Goals
ACLs and modes represent two well established but different models ACLs and modes represent two well established but different models
for specifying permissions. This chapter specifies requirements that for specifying permissions. This chapter specifies requirements that
attempt to meet the following goals: attempt to meet the following goals:
o If a server supports the mode attribute, it should provide o If a server supports the mode attribute, it should provide
reasonable semantics to clients that only set and retrieve the reasonable semantics to clients that only set and retrieve the
mode attribute. mode attribute.
skipping to change at page 98, line 23 skipping to change at page 103, line 44
* Setting only the mode attribute should effectively control the * Setting only the mode attribute should effectively control the
traditional UNIX-like permissions of read, write, and execute traditional UNIX-like permissions of read, write, and execute
on owner, owner_group, and other. on owner, owner_group, and other.
* Setting only the mode attribute should provide reasonable * Setting only the mode attribute should provide reasonable
security. For example, setting a mode of 000 should be enough security. For example, setting a mode of 000 should be enough
to ensure that future opens for read or write by any principal to ensure that future opens for read or write by any principal
should fail, regardless of a previously existing or inherited should fail, regardless of a previously existing or inherited
ACL. ACL.
o It must be possible to implement a server such that its clients
can have POSIX compliant semantics.
o This minor version of NFSv4 should not introduce significantly o This minor version of NFSv4 should not introduce significantly
different semantics relating to the mode and ACL attributes, nor different semantics relating to the mode and ACL attributes, nor
should it render invalid any existing implementations. Rather, should it render invalid any existing implementations. Rather,
this chapter provides clarifications based on previous this chapter provides clarifications based on previous
implementations and discussions around them. implementations and discussions around them.
o If a server supports the ACL attribute, then at any time, the o If a server supports the ACL attribute, then at any time, the
server can provide an ACL attribute when requested. The ACL server can provide an ACL attribute when requested. The ACL
attribute will describe all permissions on the file object, except attribute will describe all permissions on the file object, except
for the three high-order bits of the mode attribute (described in for the three high-order bits of the mode attribute (described in
Section 6.2.2). The ACL attribute will not conflict with the mode Section 6.2.3). The ACL attribute will not conflict with the mode
attribute, on servers that support the mode attribute. attribute, on servers that support the mode attribute.
o If a server supports the mode attribute, then at any time, the o If a server supports the mode attribute, then at any time, the
server can provide a mode attribute when requested. The mode server can provide a mode attribute when requested. The mode
attribute will not conflict with the ACL attribute, on servers attribute will not conflict with the ACL attribute, on servers
that support the ACL attribute. that support the ACL attribute.
o When a mode attribute is set on an object, the ACL attribute may o When a mode attribute is set on an object, the ACL attribute may
need to be modified so as to not conflict with the new mode. In need to be modified so as to not conflict with the new mode. In
such cases, it is desirable that the ACL keep as much information such cases, it is desirable that the ACL keep as much information
skipping to change at page 108, line 12 skipping to change at page 113, line 16
The bitmask constants used for the flag field are as follows: The bitmask constants used for the flag field are as follows:
const ACE4_FILE_INHERIT_ACE = 0x00000001; const ACE4_FILE_INHERIT_ACE = 0x00000001;
const ACE4_DIRECTORY_INHERIT_ACE = 0x00000002; const ACE4_DIRECTORY_INHERIT_ACE = 0x00000002;
const ACE4_NO_PROPAGATE_INHERIT_ACE = 0x00000004; const ACE4_NO_PROPAGATE_INHERIT_ACE = 0x00000004;
const ACE4_INHERIT_ONLY_ACE = 0x00000008; const ACE4_INHERIT_ONLY_ACE = 0x00000008;
const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG = 0x00000010; const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG = 0x00000010;
const ACE4_FAILED_ACCESS_ACE_FLAG = 0x00000020; const ACE4_FAILED_ACCESS_ACE_FLAG = 0x00000020;
const ACE4_IDENTIFIER_GROUP = 0x00000040; const ACE4_IDENTIFIER_GROUP = 0x00000040;
const ACE4_INHERITED_ACE = 0x00000080;
A server need not support any of these flags. If the server supports A server need not support any of these flags. If the server supports
flags that are similar to, but not exactly the same as, these flags, flags that are similar to, but not exactly the same as, these flags,
the implementation may define a mapping between the protocol-defined the implementation may define a mapping between the protocol-defined
flags and the implementation-defined flags. Again, the guiding flags and the implementation-defined flags. Again, the guiding
principle is that the file not appear to be more secure than it principle is that the file not appear to be more secure than it
really is. really is.
For example, suppose a client tries to set an ACE with For example, suppose a client tries to set an ACE with
ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE. If the ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE. If the
skipping to change at page 109, line 5 skipping to change at page 114, line 10
directory, and AUDIT and ALARM ACEs with this bit set do not directory, and AUDIT and ALARM ACEs with this bit set do not
trigger log or alarm events. Such ACEs only take effect once they trigger log or alarm events. Such ACEs only take effect once they
are applied (with this bit cleared) to newly created files and are applied (with this bit cleared) to newly created files and
directories as specified by the above two flags. directories as specified by the above two flags.
ACE4_NO_PROPAGATE_INHERIT_ACE ACE4_NO_PROPAGATE_INHERIT_ACE
Can be placed on a directory. This flag tells the server that Can be placed on a directory. This flag tells the server that
inheritance of this ACE should stop at newly created child inheritance of this ACE should stop at newly created child
directories. directories.
ACE4_INHERITED_ACE
Indicates that this ACE is inherited from a parent directory. A
server that supports automatic inheritance will place this flag on
any ACEs inherited from the parent directory when creating a new
object. Client applications will use this to perform automatic
inheritance. Clients and servers MUST clear this bit in the acl
attribute; it may only be used in the dacl and sacl attributes.
ACE4_SUCCESSFUL_ACCESS_ACE_FLAG ACE4_SUCCESSFUL_ACCESS_ACE_FLAG
ACE4_FAILED_ACCESS_ACE_FLAG ACE4_FAILED_ACCESS_ACE_FLAG
The ACE4_SUCCESSFUL_ACCESS_ACE_FLAG (SUCCESS) and The ACE4_SUCCESSFUL_ACCESS_ACE_FLAG (SUCCESS) and
ACE4_FAILED_ACCESS_ACE_FLAG (FAILED) flag bits relate only to ACE4_FAILED_ACCESS_ACE_FLAG (FAILED) flag bits relate only to
ACE4_SYSTEM_AUDIT_ACE_TYPE (AUDIT) and ACE4_SYSTEM_ALARM_ACE_TYPE ACE4_SYSTEM_AUDIT_ACE_TYPE (AUDIT) and ACE4_SYSTEM_ALARM_ACE_TYPE
(ALARM) ACE types. If during the processing of the file's ACL, (ALARM) ACE types. If during the processing of the file's ACL,
the server encounters an AUDIT or ALARM ACE that matches the the server encounters an AUDIT or ALARM ACE that matches the
principal attempting the OPEN, the server notes that fact, and the principal attempting the OPEN, the server notes that fact, and the
presence, if any, of the SUCCESS and FAILED flags encountered in presence, if any, of the SUCCESS and FAILED flags encountered in
skipping to change at page 110, line 33 skipping to change at page 115, line 48
appended "@" and should appear in the form "xxxx@" (note: no domain appended "@" and should appear in the form "xxxx@" (note: no domain
name after the "@"). For example: ANONYMOUS@. name after the "@"). For example: ANONYMOUS@.
6.2.1.5.1. Discussion of EVERYONE@ 6.2.1.5.1. Discussion of EVERYONE@
It is important to note that "EVERYONE@" is not equivalent to the It is important to note that "EVERYONE@" is not equivalent to the
UNIX "other" entity. This is because, by definition, UNIX "other" UNIX "other" entity. This is because, by definition, UNIX "other"
does not include the owner or owning group of a file. "EVERYONE@" does not include the owner or owning group of a file. "EVERYONE@"
means literally everyone, including the owner or owning group. means literally everyone, including the owner or owning group.
6.2.2. mode Attribute 6.2.2. dacl and sacl Attributes
The dacl and sacl attributes are like the acl attribute, but dacl and
sacl each allow only certain types of ACEs. The dacl attribute
allows just ALLOW and DENY ACEs. The sacl attribute allows just
AUDIT and ALARM ACEs. The dacl and sacl attributes also have
improved support for automatic inheritance (see Section 6.4.3.2).
The separation of ACE types and inheritance support make dacl and
sacl a better choice (over acl) for clients when setting ACEs on a
file.
6.2.3. mode Attribute
The NFS version 4 mode attribute is based on the UNIX mode bits. The The NFS version 4 mode attribute is based on the UNIX mode bits. The
following bits are defined: following bits are defined:
const MODE4_SUID = 0x800; /* set user id on execution */ const MODE4_SUID = 0x800; /* set user id on execution */
const MODE4_SGID = 0x400; /* set group id on execution */ const MODE4_SGID = 0x400; /* set group id on execution */
const MODE4_SVTX = 0x200; /* save text even after use */ const MODE4_SVTX = 0x200; /* save text even after use */
const MODE4_RUSR = 0x100; /* read permission: owner */ const MODE4_RUSR = 0x100; /* read permission: owner */
const MODE4_WUSR = 0x080; /* write permission: owner */ const MODE4_WUSR = 0x080; /* write permission: owner */
const MODE4_XUSR = 0x040; /* execute permission: owner */ const MODE4_XUSR = 0x040; /* execute permission: owner */
skipping to change at page 111, line 10 skipping to change at page 116, line 36
const MODE4_XOTH = 0x001; /* execute permission: other */ const MODE4_XOTH = 0x001; /* execute permission: other */
Bits MODE4_RUSR, MODE4_WUSR, and MODE4_XUSR apply to the principal Bits MODE4_RUSR, MODE4_WUSR, and MODE4_XUSR apply to the principal
identified in the owner attribute. Bits MODE4_RGRP, MODE4_WGRP, and identified in the owner attribute. Bits MODE4_RGRP, MODE4_WGRP, and
MODE4_XGRP apply to principals identified in the owner_group MODE4_XGRP apply to principals identified in the owner_group
attribute but who are not identified in the owner attribute. Bits attribute but who are not identified in the owner attribute. Bits
MODE4_ROTH, MODE4_WOTH, MODE4_XOTH apply to any principal that does MODE4_ROTH, MODE4_WOTH, MODE4_XOTH apply to any principal that does
not match that in the owner attribute, and does not have a group not match that in the owner attribute, and does not have a group
matching that of the owner_group attribute. matching that of the owner_group attribute.
The remaining bits are not defined by this protocol. A server MUST Bits within the mode other than those specified above are not defined
NOT return bits other than those defined above in a GETATTR or by this protocol. A server MUST NOT return bits other than those
READDIR operation, and it MUST return NFS4ERR_INVAL if bits other defined above in a GETATTR or READDIR operation, and it MUST return
than those defined above are set in a SETATTR, CREATE, or OPEN NFS4ERR_INVAL if bits other than those defined above are set in a
operation. SETATTR, CREATE, or OPEN operation.
6.2.4. mode_set_masked Attribute
The mode_set_masked attribute is a write-only attribute that allows
individual bits in the mode attribute to be set or reset, without
changing others. It allows, for example, the bits MODE4_SUID,
MODE4_SGID, and MODE4_SVTX to be modified while leaving unmodified
any of the nine low-order mode bits devoted to permissions.
The mode_set_masked attribute consists of two words each in the form
of a mode4. The first consists of the value to be applied to the
current mode value and the second is a mask. Only bits set to one in
the mask word are changed (set or reset) in the file's mode. All
other bits in the mode remain unchanged. Bits in the first word that
correspond to bits which are zero in the mask are ignored, except
that undefined bits are checked for validity and can result in
NFSERR_INVAL as described below.
The mode_set_masked attribute is only valid in a SETATTR operation.
If it is used in a CREATE or OPEN operation, the server MUST return
NFS4ERR_INVAL.
Bits not defined as valid in the mode attribute are not valid in
either word of the mode_set_masked attribute. The server MUST return
NFS4ERR_INVAL if any of those are on in a SETATTR. If the mode and
mode_set_masked attributes are both specified in the same SETATTR,
the server MUST also return NFS4ERR_INVAL.
6.3. Common Methods 6.3. Common Methods
The requirements in this section will be referred to in future The requirements in this section will be referred to in future
sections, especially Section 6.4. sections, especially Section 6.4.
6.3.1. Interpreting an ACL 6.3.1. Interpreting an ACL
6.3.1.1. Server Considerations 6.3.1.1. Server Considerations
skipping to change at page 114, line 19 skipping to change at page 120, line 19
requirements refer to this section. But note that the methods have requirements refer to this section. But note that the methods have
behaviors specified with "SHOULD". This is intentional, to avoid behaviors specified with "SHOULD". This is intentional, to avoid
invalidating existing implementations that compute the mode according invalidating existing implementations that compute the mode according
to the withdrawn POSIX ACL draft (1003.1e draft 17), rather than by to the withdrawn POSIX ACL draft (1003.1e draft 17), rather than by
actual permissions on owner, group, and other. actual permissions on owner, group, and other.
6.4.1. Setting the mode and/or ACL Attributes 6.4.1. Setting the mode and/or ACL Attributes
6.4.1.1. Setting mode and not ACL 6.4.1.1. Setting mode and not ACL
When setting a mode attribute and not an ACL attribute, the mode When any mode permission bits are subject to change, either because
attribute MUST be set as given. The ACL attribute MUST be modified the mode attribute was set or because the mode_set_masked attribute
such that the mode computed via the method in Section 6.3.2 yields was set and the mask included one or more bits from the low-order
the low-order nine bits (MODE4_R*, MODE4_W*, MODE4_X*) of the newly nine mode bits that control permissions, and the ACL attribute is not
set mode attribute. The ACL SHOULD also be modified such that: explicitly set, the ACL attribute must be modified in accordance with
the updated value of the permissions bits within the mode. This must
happen even if the value of the permission bits within the mode is
the same after the mode is set as before.
In cases in which the permissions bits are subject to change, the ACL
attribute MUST be modified such that the mode computed via the method
in Section 6.3.2 yields the low-order nine bits (MODE4_R*, MODE4_W*,
MODE4_X*) of the mode attribute as modified by the attribute change.
The ACL SHOULD also be modified such that:
1. If MODE4_RGRP is not set, entities explicitly listed in the ACL 1. If MODE4_RGRP is not set, entities explicitly listed in the ACL
other than OWNER@ and EVERYONE@ SHOULD NOT be granted other than OWNER@ and EVERYONE@ SHOULD NOT be granted
ACE4_READ_DATA. ACE4_READ_DATA.
2. If MODE4_WGRP is not set, entities explicitly listed in the ACL 2. If MODE4_WGRP is not set, entities explicitly listed in the ACL
other than OWNER@ and EVERYONE@ SHOULD NOT be granted other than OWNER@ and EVERYONE@ SHOULD NOT be granted
ACE4_WRITE_DATA or ACE4_APPEND_DATA. ACE4_WRITE_DATA or ACE4_APPEND_DATA.
3. If MODE4_XGRP is not set, entities explicitly listed in the ACL 3. If MODE4_XGRP is not set, entities explicitly listed in the ACL
skipping to change at page 115, line 7 skipping to change at page 121, line 15
Also note that the requirement may be met by discarding the ACL, in Also note that the requirement may be met by discarding the ACL, in
favor of an ACL that represents the mode and only the mode. This is favor of an ACL that represents the mode and only the mode. This is
permitted, but it is preferable for a server to preserve as much of permitted, but it is preferable for a server to preserve as much of
the ACL as possible without violating the above requirements. the ACL as possible without violating the above requirements.
Discarding the ACL makes it effectively impossible for a file created Discarding the ACL makes it effectively impossible for a file created
with a mode attribute to inherit an ACL (see Section 6.4.3). with a mode attribute to inherit an ACL (see Section 6.4.3).
6.4.1.2. Setting ACL and not mode 6.4.1.2. Setting ACL and not mode
When setting an ACL attribute and not a mode attribute, the ACL When setting an ACL attribute and not setting the mode or
attribute SHOULD be set as given. The nine low-order bits of the mode_set_masked attributes, the permission bits of the mode need to
mode attribute (MODE4_R*, MODE4_W*, MODE4_X*) MUST be modified to be derived from the ACL. In this case, the ACL attribute SHOULD be
match the result of the method Section 6.3.2. The three high-order set as given. The nine low-order bits of the mode attribute
bits of the mode (MODE4_SUID, MODE4_SGID, MODE4_SVTX) SHOULD remain (MODE4_R*, MODE4_W*, MODE4_X*) MUST be modified to match the result
unchanged. of the method Section 6.3.2. The three high-order bits of the mode
(MODE4_SUID, MODE4_SGID, MODE4_SVTX) SHOULD remain unchanged.
6.4.1.3. Setting both ACL and mode 6.4.1.3. Setting both ACL and mode
When setting both the mode and the ACL attribute in the same When setting both the mode (includes use of either the mode attribute
operation, the attributes MUST be applied in this order: mode, then or the mode_set_masked attribute) and the ACL attribute in the same
ACL. The mode attribute is set as given, then the ACL attribute is operation, the attributes MUST be applied in this order: mode (or
set as given, possibly changing the final mode, as described above in mode_set_masked), then ACL. The mode-related attribute is set as
Section 6.4.1.2. given, then the ACL attribute is set as given, possibly changing the
final mode, as described above in Section 6.4.1.2.
6.4.2. Retrieving the mode and/or ACL Attributes 6.4.2. Retrieving the mode and/or ACL Attributes
This section applies only to servers that support both the mode and This section applies only to servers that support both the mode and
the ACL attribute. the ACL attribute.
Some server implementations may have a concept of "objects without Some server implementations may have a concept of "objects without
ACLs", meaning that all permissions are granted and denied according ACLs", meaning that all permissions are granted and denied according
to the mode attribute, and that no ACL attribute is stored for that to the mode attribute, and that no ACL attribute is stored for that
object. If an ACL attribute is requested of such a server, the object. If an ACL attribute is requested of such a server, the
skipping to change at page 117, line 28 skipping to change at page 123, line 39
ACE4_INHERIT_ONLY_ACE flag set). This gives the user and the server, ACE4_INHERIT_ONLY_ACE flag set). This gives the user and the server,
in the cases which it must mask certain permissions upon creation, in the cases which it must mask certain permissions upon creation,
the ability to modify the effective permissions without modifying the the ability to modify the effective permissions without modifying the
ACE which is to be inherited to the new directory's children. ACE which is to be inherited to the new directory's children.
When a newly created object is created with attributes, and those When a newly created object is created with attributes, and those
attributes contain an ACL attribute and/or a mode attribute, the attributes contain an ACL attribute and/or a mode attribute, the
server MUST apply those attributes to the newly created object, as server MUST apply those attributes to the newly created object, as
described in Section 6.4.1. described in Section 6.4.1.
6.4.3.2. Automatic Inheritance
Unlike the acl attribute, the sacl and dacl (see Section 6.2.2)
attributes both have an additional flag field. The flag field
applies to the entire sacl or dacl; three flag values are defined
const ACL4_AUTO_INHERIT = 0x00000001;
const ACL4_PROTECTED = 0x00000002;
const ACL4_DEFAULTED = 0x00000004;
and all other bits must be cleared. The ACE4_INHERITED_ACE flag may
be set in the ACEs of the sacl or dacl (whereas it must always be
cleared in the acl).
Together these features allow a server to support automatic
inheritance, which we now explain in more detail.
Inheritable ACEs are normally inherited by child objects only at the
time that the child objects are created; later modifications to
inheritable ACEs do not result in modifications to inherited ACEs on
descendents.
However, the dacl and sacl provide an optional mechanism which allows
a client application to propagate changes to inheritable ACEs to an
entire directory hierarchy.
A server that supports this performs inheritance at object creation
time in the normal way, but also sets the ACE4_INHERITED_ACE flag on
any inherited ACEs as they are added to the new object.
A client application such as an ACL editor may then propagate changes
to inheritable ACEs on a directory by recursively traversing that
directory's descendants and modifying each ACL encountered to remove
any ACEs with the ACE4_INHERITED_ACE flag and to replace them by the
new inheritable ACEs (also with the ACE4_INHERITED_ACE flag set). It
uses the existing ACE inheritance flags in the obvious way to decide
which ACEs to propagate. (Note that it may encounter further
inheritable ACEs when descending the directory hierarchy, and that
those will also need to be taken into account when propagating
inheritable ACEs to further descendants.)
The reach of this propagation may be limited in two ways: first,
automatic inheritance is not performed from any directory ACL that
has the ACL4_AUTO_INHERIT flag cleared; and second, automatic
inheritance stops wherever an ACL with the ACL4_PROTECTED flag is
set, preventing modification of that ACL and also (if the ACL is set
on a directory) of the ACL on any of the object's descendants.
This propagation is performed independently for the sacl and the dacl
attributes; thus the ACL4_AUTO_INHERIT and ACL4_PROTECTED flags may
be independently set for the sacl and the dacl, and propagation of
one type of acl may continue down a hierarchy even where propagation
of the other acl has stopped.
New objects should be created with a dacl and a sacl that both have
the ACL4_PROTECTED flag cleared and the ACL4_AUTO_INHERIT flag set to
the same value as that on, respectively, the sacl or dacl of the
parent object.
Both the dacl and sacl attributes are RECOMMENDED, and a server may
support one without supporting the other.
A server that supports both the old acl attribute and one or both of
the new dacl or sacl attributes must do so in such a way as to keep
all three attributes consistent with each other. Thus the ACEs
reported in the acl attribute should be the union of the ACEs
reported in the dacl and sacl attributes, except that the
ACE4_INHERITED_ACE flag must be cleared from the ACEs in the acl.
And of course a client that queries only the acl will be unable to
determine the values of the sacl or dacl flag fields.
When a client performs a SETATTR for the acl attribute, the server
SHOULD set the ACL4_PROTECTED flag to true on both the sacl and the
dacl. By using the acl attribute, as opposed to the dacl or sacl
attributes, the client signals that it may not understand automatic
inheritance, and thus cannot be trusted to set an ACL for which
automatic inheritance would make sense.
When a client application queries an ACL, modifies it, and sets it
again, it should leave any ACEs marked with ACE4_INHERITED_ACE
unchanged, in their original order, at the end of the ACL. If the
application is unable to do this, it should set the ACL4_PROTECTED
flag. This behavior is not enforced by servers, but violations of
this rule may lead to unexpected results when applications perform
automatic inheritance.
If a server also supports the mode attribute, it SHOULD set the mode
in such a way that leaves inherited ACEs unchanged, in their original
order, at the end of the ACL. If it is unable to do so, it SHOULD
set the ACL4_PROTECTED flag on the file's dacl.
Finally, in the case where the request that creates a new file or
directory does not also set permissions for that file or directory,
and there are also no ACEs to inherit from the parent's directory,
then the server's choice of ACL for the new object is implementation-
dependent. In this case, the server SHOULD set the ACL4_DEFAULTED
flag on the ACL it chooses for the new object. An application
performing automatic inheritance takes the ACL4_DEFAULTED flag as a
sign that the ACL should be completely replaced by one generated
using the automatic inheritance rules.
7. Single-server Name Space 7. Single-server Name Space
This chapter describes the NFSv4 single-server name space. Single- This chapter describes the NFSv4 single-server name space. Single-
server namespaces may be presented directly to clients, or they may server namespaces may be presented directly to clients, or they may
be used as a basis to form larger multi-server namespaces (e.g. site- be used as a basis to form larger multi-server namespaces (e.g. site-
wide or organization-wide) to be presented to clients, as described wide or organization-wide) to be presented to clients, as described
in Section 10. in Section 10.
7.1. Server Exports 7.1. Server Exports
skipping to change at page 121, line 10 skipping to change at page 129, line 27
For the case of the use of multiple, disjoint security mechanisms in For the case of the use of multiple, disjoint security mechanisms in
the server's resources, the security for a particular object in the the server's resources, the security for a particular object in the
server's namespace should be the union of all security mechanisms of server's namespace should be the union of all security mechanisms of
all direct descendants. all direct descendants.
8. File Locking and Share Reservations 8. File Locking and Share Reservations
Integrating locking into the NFS protocol necessarily causes it to be Integrating locking into the NFS protocol necessarily causes it to be
stateful. With the inclusion of such features as share reservations, stateful. With the inclusion of such features as share reservations,
file and directory delegations, recallable layouts, and support for file and directory delegations, recallable layouts, and support for
mandatory byte-range locking the protocol becomes substantially more mandatory record locking the protocol becomes substantially more
dependent on state than the traditional combination of NFS and NLM dependent on state than the traditional combination of NFS and NLM
[XNFS]. There are three components to making this state manageable: [XNFS]. There are three components to making this state manageable:
o Clear division between client and server o Clear division between client and server
o Ability to reliably detect inconsistency in state between client o Ability to reliably detect inconsistency in state between client
and server and server
o Simple and robust recovery mechanisms o Simple and robust recovery mechanisms
In this model, the server owns the state information. The client In this model, the server owns the state information. The client
requests changes in locks and the server responds with the changes requests changes in locks and the server responds with the changes
made. Non-client-initiated changes in locking state are infrequent made. Non-client-initiated changes in locking state are infrequent
and the client receives prompt notification of them and can adjust and the client receives prompt notification of them and can adjust
his view of the locking state to reflect the server's changes. its view of the locking state to reflect the server's changes.
To support Win32 share reservations it is necessary to provide To support Win32 share reservations it is necessary to provide
operations which atomically OPEN or CREATE files. Having a separate operations which atomically OPEN or CREATE files. Having a separate
share/unshare operation would not allow correct implementation of the share/unshare operation would not allow correct implementation of the
Win32 OpenFile API. In order to correctly implement share semantics, Win32 OpenFile API. In order to correctly implement share semantics,
the previous NFS protocol mechanisms used when a file is opened or the previous NFS protocol mechanisms used when a file is opened or
created (LOOKUP, CREATE, ACCESS) need to be replaced. The NFS created (LOOKUP, CREATE, ACCESS) need to be replaced. The NFS
version 4.1 protocol defines OPEN operation which looks up or creates version 4.1 protocol defines OPEN operation which looks up or creates
a file and establishes locking state on the server. a file and establishes locking state on the server.
skipping to change at page 122, line 7 skipping to change at page 130, line 22
they possess a held lock. A lock request contains the heavyweight they possess a held lock. A lock request contains the heavyweight
information required to establish a lock and uniquely define the lock information required to establish a lock and uniquely define the lock
owner. owner.
The following sections describe the transition from the heavyweight The following sections describe the transition from the heavyweight
information to the eventual lightweight stateid used for most client information to the eventual lightweight stateid used for most client
and server locking interactions. and server locking interactions.
8.1.1. Client and Session ID 8.1.1. Client and Session ID
A client must establish a clientid (see Section 2.4) and then one or A client must establish a client ID (see Section 2.4) and then one or
more sessionids (see Section 2.9) before performing any operations to more sessionids (see Section 2.10) before performing any operations
open, lock, or delegate a file object. The sessionid services as a to open, lock, or delegate a file object. The sessionid services as
shorthand referral to an NFSv4.1 client. a shorthand referral to an NFSv4.1 client.
8.1.2. State-owner and Stateid Definition 8.1.2. State-owner Definition
When opening a file or requesting a byte-range lock, the client must When opening a file or requesting a record lock, the client must
specify an identifier which represents the owner of the requested specify an identifier which represents the owner of the requested
lock. This identifier is in the form of a state-owner, represented lock. This identifier is in the form of a state-owner, represented
in the protocol by a state_owner4, a variable-length opaque array in the protocol by a state_owner4, a variable-length opaque array
which, when concatenated with the current clientid uniquely defines which, when concatenated with the current client ID uniquely defines
the owner of lock managed by the client. This may be a thread id, the owner of lock managed by the client. This may be a thread id,
process id, or other unique value. process id, or other unique value.
Owners of opens and owners of byte-range locks are separate entities Owners of opens and owners of record locks are separate entities and
and remain separate even if the same opaque arrays are used to remain separate even if the same opaque arrays are used to designate
designate owners of each. The protocol distinguishes between open- owners of each. The protocol distinguishes between open-owners
owners (represented by open_owner4 structures) and lock-owners (represented by open_owner4 structures) and lock-owners (represented
(represented by lock_owner4 structures). by lock_owner4 structures).
Each open is associated with a specific open-owner while each byte- Each open is associated with a specific open-owner while each record
range lock is associated with a lock-owner and an open-owner, the lock is associated with a lock-owner and an open-owner, the latter
latter being the open-owner associated with the open file under which being the open-owner associated with the open file under which the
the LOCK operation was done. Delegations and layouts, on the other LOCK operation was done. Delegations and layouts, on the other hand,
hand, are not associated with a specific owner but are associated the are not associated with a specific owner but are associated the
client as a whole. client as a whole.
When the server grants a lock of any type (including opens, byte- 8.1.3. Stateid Definition
range locks, delegations, and layouts) it responds with a unique
stateid, that represents a set of locks (often a single lock) for the When the server grants a lock of any type (including opens, record
same file, of the same type, and sharing the same ownership locks, delegations, and layouts) it responds with a unique stateid,
that represents a set of locks (often a single lock) for the same
file, of the same type, and sharing the same ownership
characteristics. Thus opens of the same file by different open- characteristics. Thus opens of the same file by different open-
owners each have an identifying stateid. Similarly, each set of owners each have an identifying stateid. Similarly, each set of
byte-range locks on a file owned by a specific lock-owner and gotten record locks on a file owned by a specific lock-owner and gotten via
via an open for a specific open-owner, has its own identifying an open for a specific open-owner, has its own identifying stateid.
stateid. Delegations and layouts also have associated stateid's by Delegations and layouts also have associated stateids by which they
which they may be referenced. The stateid is used as a shorthand may be referenced. The stateid is used as a shorthand reference to a
reference to a lock or set of locks and given a stateid the client lock or set of locks and given a stateid the client can determine the
can determine the associated state-owner or state-owners (in the case associated state-owner or state-owners (in the case of an open-owner/
of an open-owner/lock-owner pair) and the associated. Clients, lock-owner pair) and the associated filehandle. When stateids are
however, must not assume any such mapping and must not use a stateid used the current filehandle must be the one associated with that
returned for a given filehandle and state-owner in the context of a stateid.
different filehandle or a different state-owner.
The server is free to form the stateid in any manner that it chooses The server may assign stateids independently for different clients
as long as it is able to recognize invalid and out-of-date stateids. and a stateid with the same bit pattern for one client may designate
Although the protocol XDR definition divides the stateid into 'seqid' an entirely different set of locks for a different client. The
and 'other' fields, for the purposes of minor version one, this stateid is always interpreted with respect to the client ID
distinction is not important and the server may use the available associated with the current session. Stateids apply to all sessions
space as it chooses, with one exception. associated with the given client ID and the client may use a stateid
obtained from one session on another session associated with the same
client ID.
The exception is that stateids whose 'other' field is either all 8.1.3.1. Stateid Structure
zeros or all ones are reserved and may not be generated by the
server. Clients may use the protocol-defined special stateid values
for their defined purposes, but any use of stateid's in this reserved
class that are not specially defined by the protocol MUST result in
an NFS4ERR_BAD_STATED being returned.
Clients may not compare stateids associated with different Stateids are divided into two fields, a 96-bit "other" field
filehandles, so that a server might use stateids with the same bit identifying the specific set of locks and a 32-bit "seqid" sequence
pattern for all opens with a given open-owner or for all sets of value. Except in the case of special stateids, to be discussed
byte-range locks associated with a given lock-owner/open-owner pair. below, the purpose of the sequence value within NFSv4.1 is to allow
However, if it does so, it must recognize and reject any use of the server to communicate to the client the order in which operations
stateid when the current filehandle is such that no lock for that that modified locking state associated with a stateid have been
filehandle by that open owner (or lock-owner/open-owner pair) exists. processed.
Stateid's must remain valid until either a client reboot or a sever In the case of stateids associated with opens, i.e. the stateids
returned by OPEN (the state for the open, rather than that for the
delegation), OPEN_DOWNGRADE, or CLOSE, the server MUST provide an
"seqid" value starting at one for the first use of a given "other"
value and incremented by one with each subsequent operation returning
a stateid.
In the case of other sorts of stateids (i.e. stateids associated with
record locks and delegations), the server MAY provide an incrementing
sequence value on successive stateids returned with same identifying
field, or it may return the value zero. If it does return a non-zero
"seqid" value it MUST start at one and be incremented by one with
each subsequent operation returning a stateid with same "other"
value, just as is done with open state.
The client when using a stateid as a parameter to an operation, must,
except in the case of a special stateid, set the sequence value to
zero. If the value is non-zero, the server MUST return the error
NFS4ERR_BAD_STATEID.
8.1.3.2. Special Stateids
Stateid values whose "other" field is either all zeros or all ones
are reserved. They may not be assigned by the server but have
special meanings defined by the protocol. The particular meaning
depends on whether the "other" field is all zeros or all ones and the
specific value of the "seqid" field.
The following combinations of "other" and "seqid" are defined in
NFSv4.1:
o When "other" and "seqid" are both zero, the stateid is treated as
a special anonymous stateid, which can be used in READ, WRITE, and
SETATTR requests to indicate the absence of any open state
associated with the request. When an anonymous stateid value is
used, and an existing open denies the form of access requested,
then access will be denied to the request.
o When "other" and "seqid" are both all ones, the stateid is a
special read bypass stateid. When this value is used in WRITE or
SETATTR, it is treated like the anonymous value. When used in
READ, the server MAY grant access, even if access would normally
be denied to READ requests.
o When "other" is zero and "seqid" is one, the stateid represents
the current stateid, which is whatever value is the last stateid
returned by an operation within the COMPOUND. In the case of an
OPEN, the stateid returned for the open file, and not the
delegation is used. The stateid passed to the operation in place
of the special value has its "seqid" value set to zero. If there
is no operation in the COMPOUND which has returned a stateid
value, the server MUST return the error NFS4ERR_BAD_STATEID.
If a stateid value is used which has all zero or all ones in the
"other" field, but does not match one of the cases above, the server
MUST return the error NFS4ERR_BAD_STATEID.
Special stateids, unlike other stateids are not associated with
individual client ID's or filehandles and can be used with all valid
client ID's and filehandles. In the case of a special stateid
designating the current current stateid, the current stateid value
substituted for the special stateid is associated with a particular
client ID and filehandle.
8.1.3.3. Stateid Lifetime and Validation
Stateids must remain valid until either a client reboot or a sever
reboot or until the client returns all of the locks associated with reboot or until the client returns all of the locks associated with
the stateid by means of an operation such as CLOSE or DELEGRETURN. the stateid by means of an operation such as CLOSE or DELEGRETURN.
If the locks are lost due to revocation the stateid remains usable If the locks are lost due to revocation the stateid remains a valid
until the client frees it by using FREE_STATEID. Stateid's designation of that revoked state until the client frees it by using
associated with byte-range locks are an exception. They remain valid FREE_STATEID. Stateids associated with record locks are an
even if a LOCKU free all remaining locks, so long as the open file exception. They remain valid even if a LOCKU free all remaining
with which they are associated remains open, unless the client does a locks, so long as the open file with which they are associated
FREE_STATEID to caused the stateid to be freed. remains open, unless the client does a FREE_STATEID to cause the
stateid to be freed.
Because each operation using a stateid occurs as part of a session, An "other" value must never be reused for a different purpose (i.e.
each stateid is implicitly associated with the clientid assigned to different filehandle, owner, or type of locks) within the context of
that session. Use of a stateid in the context of a session where the a single client ID. A server may retain the "other" value for the
clientid is invalid should result in the error NFS4ERR_STALE_STATEID. same purpose beyond the point where it may otherwise be freed but if
Servers MUST NOT do any validation or return other errors in this it does so, it must maintain "seqid" continuity with previous values,
case, even if they have sufficient information available to validate in all case in which it is required to return incrementing "seqid"
stateids associated with an out-of-date client. values in general.
One mechanism that may be used to satisfy the requirement that the One mechanism that may be used to satisfy the requirement that the
server recognize invalid and out-of-date stateids is for the server server recognize invalid and out-of-date stateids is for the server
to divide the stateid into two fields. This division may coincide to divide the "other" field of the stateid into two fields.
with the documented division into 'seqid' and 'other' fields or it
may divide the stateid field up in any other ay it chooses.
o An index into a table of locking-state structures. o An index into a table of locking-state structures.
o A generation number which is incremented on each allocation of a o A generation number which is incremented on each allocation of a
table entry a particular allocation of a stateid. table entry for a particular use.
And then store in each table entry, And then store in each table entry,
o The current generation number. o The current generation number.
o The clientid with which the stateid is associated. o The client ID with which the stateid is associated.
o The filehandle of the file on which the locks are taken. o The filehandle of the file on which the locks are taken.
o An indication of the type of stateid (open, byte-range lock, file o An indication of the type of stateid (open, record lock, file
delegation, directory delegation, layout). delegation, directory delegation, layout).
o The last "seqid" value returned corresponding to the current
"other" value.
With this information, the following procedure would be used to With this information, the following procedure would be used to
validate an incoming stateid and return an appropriate error, when validate an incoming stateid and return an appropriate error, when
necessary: necessary:
o If the current session is associated with an invalid clientid, o If the server has restarted resulting in loss of all lessed state
return NFS4ERR_STALE_STATEID. but the sessionid and clientID are still valid, return
NFS4ERR_STALE_STATEID. (If server restart has resulted in an
invalid client ID or sessionid is invalid, SEQUENCE will return an
error - not NFS4ERR_STATE_STATEID - and the operation that takes a
stateid as an argument will never be processed.)
o If the "other" field is all zeros or all ones, check that the
"other" and "seqid" match a defined combination for a special
stateid and that that stateid can be used in the current context.
If not, then return NFS4ERR_BAD_STATEID.
o If the "seqid" field is not zero, return NFS4ERR_BAD_STATEID.
o Otherwise divide the "other" into a table index and an entry
generation.
o If the table index field is outside the range of the associated o If the table index field is outside the range of the associated
table, return NFS4ERR_BAD_STATEID. table, return NFS4ERR_BAD_STATEID.
o If the selected table entry is of a different generation than that o If the selected table entry is of a different generation than that
specified in the incoming stateid, return NFS4ERR_BAD_STATEID. specified in the incoming stateid, return NFS4ERR_BAD_STATEID.
o If the selected table entry does not match the current file o If the selected table entry does not match the current file
handle, return NFS4ERR_BAD_STATEID. handle, return NFS4ERR_BAD_STATEID.
o If the clientid in the table entry does not match the clientid o If the client ID in the table entry does not match the client ID
associated with the current session, return NFS4ERR_BAD_STATEID. associated with the current session, return NFS4ERR_BAD_STATEID.
o If the stateid type is not valid for the context in which the o If the stateid type is not valid for the context in which the
stateid appears, return NFS4ERR_BAD_STATEID. stateid appears, return NFS4ERR_BAD_STATEID.
o Otherwise, the stateid is valid and the table entry should contain o Otherwise, the stateid is valid and the table entry should contain
any additional information about the associated set of locks, such any additional information about the associated set of locks, such
as open-owner and lock-owner information, as well as information as open-owner and lock-owner information, as well as information
on the specific locks, such as open modes and byte ranges. on the specific locks, such as open modes and octet ranges.
8.1.3. Use of the Stateid and Locking 8.1.4. Use of the Stateid and Locking
All READ, WRITE and SETATTR operations contain a stateid. For the All READ, WRITE and SETATTR operations contain a stateid. For the
purposes of this section, SETATTR operations which change the size purposes of this section, SETATTR operations which change the size
attribute of a file are treated as if they are writing the area attribute of a file are treated as if they are writing the area
between the old and new size (i.e. the range truncated or added to between the old and new size (i.e. the range truncated or added to
the file by means of the SETATTR), even where SETATTR is not the file by means of the SETATTR), even where SETATTR is not
explicitly mentioned in the text. explicitly mentioned in the text.
If the state-owner performs a READ or WRITE in a situation in which If the state-owner performs a READ or WRITE in a situation in which
it has established a lock or share reservation on the server (any it has established a lock or share reservation on the server (any
OPEN constitutes a share reservation) the stateid (previously OPEN constitutes a share reservation) the stateid (previously
returned by the server) must be used to indicate what locks, returned by the server) must be used to indicate what locks,
including both record locks and share reservations, are held by the including both record locks and share reservations, are held by the
state-owner. If no state is established by the client, either record state-owner. If no state is established by the client, either record
lock or share reservation, a special stateid of all bits 0 (including lock or share reservation, a special stateid for anonymous state
all fields of the stateid) is used. Regardless whether a stateid of (zero as "other" and "seqid") is used. Regardless whether a stateid
all bits 0, or a stateid returned by the server is used, if there is for anonymous state or a stateid returned by the server is used, if
a conflicting share reservation or mandatory record lock held on the there is a conflicting share reservation or mandatory record lock
file, the server MUST refuse to service the READ or WRITE operation. held on the file, the server MUST refuse to service the READ or WRITE
operation.
Share reservations are established by OPEN operations and by their Share reservations are established by OPEN operations and by their
nature are mandatory in that when the OPEN denies READ or WRITE nature are mandatory in that when the OPEN denies READ or WRITE
operations, that denial results in such operations being rejected operations, that denial results in such operations being rejected
with error NFS4ERR_LOCKED. Record locks may be implemented by the with error NFS4ERR_LOCKED. Record locks may be implemented by the
server as either mandatory or advisory, or the choice of mandatory or server as either mandatory or advisory, or the choice of mandatory or
advisory behavior may be determined by the server on the basis of the advisory behavior may be determined by the server on the basis of the
file being accessed (for example, some UNIX-based servers support a file being accessed (for example, some UNIX-based servers support a
"mandatory lock bit" on the mode attribute such that if set, record "mandatory lock bit" on the mode attribute such that if set, record
locks are required on the file before I/O is possible). When record locks are required on the file before I/O is possible). When record
skipping to change at page 126, line 10 skipping to change at page 136, line 11
NFS4ERR_LOCKED. NFS4ERR_LOCKED.
For Windows environments, there are no advisory record locks, so the For Windows environments, there are no advisory record locks, so the
server always checks for record locks during I/O requests. server always checks for record locks during I/O requests.
Thus, the NFS version 4 LOCK operation does not need to distinguish Thus, the NFS version 4 LOCK operation does not need to distinguish
between advisory and mandatory record locks. It is the NFS version 4 between advisory and mandatory record locks. It is the NFS version 4
server's processing of the READ and WRITE operations that introduces server's processing of the READ and WRITE operations that introduces
the distinction. the distinction.
Every stateid other than the special stateid values noted in this Every stateid with the exception of special stateid values, whether
section, whether returned by an OPEN-type operation (i.e. OPEN, returned by an OPEN-type operation (i.e. OPEN, OPEN_DOWNGRADE), or
OPEN_DOWNGRADE), or by a LOCK-type operation (i.e. LOCK or LOCKU), by a LOCK-type operation (i.e. LOCK or LOCKU), defines an access
defines an access mode for the file (i.e. READ, WRITE, or READ- mode for the file (i.e. READ, WRITE, or READ-WRITE) as established
WRITE) as established by the original OPEN which caused the by the original OPEN which caused the allocation of the open stateid
allocation of the open stateid and as modified by subsequent OPENs and as modified by subsequent OPENs and OPEN_DOWNGRADEs for the same
and OPEN_DOWNGRADEs for the same open-owner/file pair. Stateids open-owner/file pair. Stateids returned by record lock operations
returned by byte-range lock operations imply the access mode for the imply the access mode for the open stateid associated with the lock
open stateid associated with the lock set represented by the stateid. set represented by the stateid. Delegation stateids have an access
Delegation stateids have an access mode based on the type of mode based on the type of delegation. When a READ, WRITE, or SETATTR
delegation. When a READ, WRITE, or SETATTR which specifies the size which specifies the size attribute, is done, the operation is subject
attribute, is done, the operation is subject to checking against the to checking against the access mode to verify that the operation is
access mode to verify that the operation is appropriate given the appropriate given the OPEN with which the operation is associated.
OPEN with which the operation is associated.
In the case of WRITE-type operations (i.e. WRITEs and SETATTRs which In the case of WRITE-type operations (i.e. WRITEs and SETATTRs which
set size), the server must verify that the access mode allows writing set size), the server must verify that the access mode allows writing
and return an NFS4ERR_OPENMODE error if it does not. In the case, of and return an NFS4ERR_OPENMODE error if it does not. In the case, of
READ, the server may perform the corresponding check on the access READ, the server may perform the corresponding check on the access
mode, or it may choose to allow READ on opens for WRITE only, to mode, or it may choose to allow READ on opens for WRITE only, to
accommodate clients whose write implementation may unavoidably do accommodate clients whose write implementation may unavoidably do
reads (e.g. due to buffer cache constraints). However, even if READs reads (e.g. due to buffer cache constraints). However, even if READs
are allowed in these circumstances, the server MUST still check for are allowed in these circumstances, the server MUST still check for
locks that conflict with the READ (e.g. another open specify denial locks that conflict with the READ (e.g. another open specify denial
of READs). Note that a server which does enforce the access mode of READs). Note that a server which does enforce the access mode
check on READs need not explicitly check for conflicting share check on READs need not explicitly check for conflicting share
reservations since the existence of OPEN for read access guarantees reservations since the existence of OPEN for read access guarantees
that no conflicting share reservation can exist. that no conflicting share reservation can exist.
A special stateid of all bits 1 (one), including all fields in the The read bypass special stateid (all bits of "other" and "seqid" set
stateid indicates a desire to bypass locking checks. The server MAY to one) stateid indicates a desire to bypass locking checks. The
allow READ operations to bypass locking checks at the server, when server MAY allow READ operations to bypass locking checks at the
this special stateid is used. However, WRITE operations with this server, when this special stateid is used. However, WRITE operations
special stateid value MUST NOT bypass locking checks and are treated with this special stateid value MUST NOT bypass locking checks and
exactly the same as if a stateid of all bits 0 were used. are treated exactly the same as if a special stateid for anonymous
state were used.
A lock may not be granted while a READ or WRITE operation using one A lock may not be granted while a READ or WRITE operation using one
of the special stateids is being performed and the range of the lock of the special stateids is being performed and the range of the lock
request conflicts with the range of the READ or WRITE operation. For request conflicts with the range of the READ or WRITE operation. For
the purposes of this paragraph, a conflict occurs when a shared lock the purposes of this paragraph, a conflict occurs when a shared lock
is requested and a WRITE operation is being performed, or an is requested and a WRITE operation is being performed, or an
exclusive lock is requested and either a READ or a WRITE operation is exclusive lock is requested and either a READ or a WRITE operation is
being performed. A SETATTR that sets size is treated similarly to a being performed. A SETATTR that sets size is treated similarly to a
WRITE as discussed above. WRITE as discussed above.
8.2. Lock Ranges 8.2. Lock Ranges
The protocol allows a lock owner to request a lock with a byte range The protocol allows a lock owner to request a lock with an octet
and then either upgrade, downgrade, or unlock a sub-range of the range and then either upgrade, downgrade, or unlock a sub-range of
initial lock. It is expected that this will be an uncommon type of the initial lock. It is expected that this will be an uncommon type
request. In any case, servers or server filesystems may not be able of request. In any case, servers or server filesystems may not be
to support sub-range lock semantics. In the event that a server able to support sub-range lock semantics. In the event that a server
receives a locking request that represents a sub-range of current receives a locking request that represents a sub-range of current
locking state for the lock owner, the server is allowed to return the locking state for the lock owner, the server is allowed to return the
error NFS4ERR_LOCK_RANGE to signify that it does not support sub- error NFS4ERR_LOCK_RANGE to signify that it does not support sub-
range lock operations. Therefore, the client should be prepared to range lock operations. Therefore, the client should be prepared to
receive this error and, if appropriate, report the error to the receive this error and, if appropriate, report the error to the
requesting application. requesting application.
The client is discouraged from combining multiple independent locking The client is discouraged from combining multiple independent locking
ranges that happen to be adjacent into a single request since the ranges that happen to be adjacent into a single request since the
server may not support sub-range requests and for reasons related to server may not support sub-range requests and for reasons related to
skipping to change at page 129, line 42 skipping to change at page 139, line 42
In the event that a client fails, the server may release the client's In the event that a client fails, the server may release the client's
locks when the associated leases have expired. Conflicting locks locks when the associated leases have expired. Conflicting locks
from another client may only be granted after this lease expiration. from another client may only be granted after this lease expiration.
When a client has not failed and re-establishes his lease before When a client has not failed and re-establishes his lease before
expiration occurs, requests for conflicting locks will not be expiration occurs, requests for conflicting locks will not be
granted. granted.
To minimize client delay upon restart, lock requests are associated To minimize client delay upon restart, lock requests are associated
with an instance of the client by a client supplied verifier. This with an instance of the client by a client supplied verifier. This
verifier is part of the initial EXCHANGE_ID call made by the client. verifier is part of the initial EXCHANGE_ID call made by the client.
The server returns a clientid as a result of the EXCHANGE_ID The server returns a client ID as a result of the EXCHANGE_ID
operation. The client then confirms the use of the clientid by operation. The client then confirms the use of the client ID by
establishing a session associated with that clientid. All locks, establishing a session associated with that client ID. All locks,
including opens, byte-range locks, delegations, and layout obtained including opens, record locks, delegations, and layout obtained by
by sessions using that clientid are associated with that clientid. sessions using that client ID are associated with that client ID.
Since the verifier will be changed by the client upon each Since the verifier will be changed by the client upon each
initialization, the server can compare a new verifier to the verifier initialization, the server can compare a new verifier to the verifier
associated with currently held locks and determine that they do not associated with currently held locks and determine that they do not
match. This signifies the client's new instantiation and subsequent match. This signifies the client's new instantiation and subsequent
loss of locking state. As a result, the server is free to release loss of locking state. As a result, the server is free to release
all locks held which are associated with the old clientid which was all locks held which are associated with the old client ID which was
derived from the old verifier. At this point conflicting locks from derived from the old verifier. At this point conflicting locks from
other clients, kept waiting while the leaser had not yet expired, can other clients, kept waiting while the leaser had not yet expired, can
be granted. be granted.
Note that the verifier must have the same uniqueness properties of Note that the verifier must have the same uniqueness properties of
the verifier for the COMMIT operation. the verifier for the COMMIT operation.
8.6.2. Server Failure and Recovery 8.6.2. Server Failure and Recovery
If the server loses locking state (usually as a result of a restart If the server loses locking state (usually as a result of a restart
or reboot), it must allow clients time to discover this fact and re- or reboot), it must allow clients time to discover this fact and re-
establish the lost locking state. The client must be able to re- establish the lost locking state. The client must be able to re-
establish the locking state without having the server deny valid establish the locking state without having the server deny valid
requests because the server has granted conflicting access to another requests because the server has granted conflicting access to another
client. Likewise, if there is a possibility that clients have not client. Likewise, if there is a possibility that clients have not
yet re-established their locking state for a file, the server must yet re-established their locking state for a file, the server must
disallow READ and WRITE operations for that file. disallow READ and WRITE operations for that file.
A client can determine that server failure (and thus loss of locking A client can determine that loss of locking state has occurred via
state) has occurred, when it receives one of two errors. The several methods.
NFS4ERR_STALE_STATEID error indicates a stateid invalidated by a
reboot or restart. The NFS4ERR_STALE_CLIENTID error indicates a
clientid invalidated by reboot or restart. When either of these are
received, the client must establish a new clientid (See
Section 8.1.1) and re-establish its locking state.
Once a session is established using the new clientid, the client will 1. When a SEQUENCE succeeds, but sr_status_flags in the reply to
use reclaim-type locking requests (i.e. LOCK requests with reclaim SEQUENCE indicates SEQ4_STATUS_RESTART_RECLAIM_NEEDED (see
set to true and OPEN operations with a claim type of CLAIM_PREVIOUS) Section 17.46.4). The client's client ID and session are valid
to re-establish its locking state. Once this is done, or if there is (have persisted through server restart) and the client can now
no such locking state to reclaim, the client does a RECLAIM_COMPLETE re-establish its lock state (Section 8.6.2.1).
operation to indicate that it has reclaimed all of the locking state
that it will reclaim. Once a client does a RECLAIM_COMPLETE 2. When an operation returns NFS4ERR_STALE_STATEID, this indicates a
operation, it may attempt non-reclaim locking operations, although it stateid invalidated by a server reboot or restart. Since the
may get NFS4ERR_GRACE errors on these until the period of special operation that returned NFS4ERR_STALE_STATEID MUST have been
handling is over. preceded by SEQUENCE, and SEQUENCE did not return an error, this
means the client ID and session are valid. The client can now
re-establish is lock state as described in Section 8.6.2.1. Note
that the server should (MUST) have set
SEQ4_STATUS_RESTART_RECLAIM_NEEDED in the sr_status_flags of the
results of the SEQUENCE operation, and thus this situation should
be the same as that described above.
3. When a SEQUENCE operation returns NFS4ERR_STALE_CLIENTID, this
means both sessionid SEQUENCE refers to (field sa_sessionid) and
the implied client ID are now invalid, where the client ID was
invalidated by server reboot or restart or by lease expiration.
When SEQUENCE returns NFS4ERR_STALE_CLIENTID, the client must
establish a new client ID (see Section 8.1.1) and re-establish
its lock state (Section 8.6.2.1).
4. When a SEQUENCE operation returns NFS4ERR_BADSESSION, this may
mean the session has been destroyed, but the client ID is still
valid. The client issues a CREATE_SESSION request with the
client ID to re-establish the session. If CREATE_SESSION fails
with NFS4ERR_STALE_CLIENTID, the client must establish a new
client ID (see Section 8.1.1) and re-establish its lock state
(Section 8.6.2.1). If CREATE_SESSION succeeds, the client must
then re-establish its lock state (Section 8.6.2.1).
5. When a operation, neither SEQUENCE nor preceded by SEQUENCE (for
example, CREATE_SESSION, DESTROY_SESSION) returns
NFS4ERR_STALE_CLIENTID. The client MUST establish a new client
ID (Section 8.1.1) and re-establish its lock state