draft-ietf-nfsv4-minorversion1-02.txt   draft-ietf-nfsv4-minorversion1-03.txt 
NFSv4 S. Shepler NFSv4 S. Shepler
Internet-Draft Editor Internet-Draft Editor
Intended status: Standards Track March 6, 2006 Intended status: Standards Track June 20, 2006
Expires: September 7, 2006 Expires: December 22, 2006
NFSv4 Minor Version 1 NFSv4 Minor Version 1
draft-ietf-nfsv4-minorversion1-02.txt draft-ietf-nfsv4-minorversion1-03.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 34 skipping to change at page 1, line 34
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on September 7, 2006. This Internet-Draft will expire on December 22, 2006.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2006). Copyright (C) The Internet Society (2006).
Abstract Abstract
This Internet-Draft describes the NFSv4 minor version 1 protocol This Internet-Draft describes the NFSv4 minor version 1 protocol
extensions. These most significant of these extensions are commonly extensions. These most significant of these extensions are commonly
called: Sessions, Directory Delegations, and parallel NFS or pNFS called: Sessions, Directory Delegations, and parallel NFS or pNFS
skipping to change at page 2, line 13 skipping to change at page 2, line 13
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [1]. document are to be interpreted as described in RFC 2119 [1].
Table of Contents Table of Contents
1. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 9 1. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 9
1.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 9 1.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 9
1.2. Structured Data Types . . . . . . . . . . . . . . . . . 10 1.2. Structured Data Types . . . . . . . . . . . . . . . . . 10
2. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 19 2. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1. Obtaining the First Filehandle . . . . . . . . . . . . . 19 2.1. Obtaining the First Filehandle . . . . . . . . . . . . . 19
2.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 20 2.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . . 20
2.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 20 2.1.2. Public Filehandle . . . . . . . . . . . . . . . . . . 20
2.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 20 2.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 20
2.2.1. General Properties of a Filehandle . . . . . . . . . 21 2.2.1. General Properties of a Filehandle . . . . . . . . . 21
2.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 22 2.2.2. Persistent Filehandle . . . . . . . . . . . . . . . . 21
2.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 22 2.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . . 22
2.3. One Method of Constructing a Volatile Filehandle . . . . 23 2.3. One Method of Constructing a Volatile Filehandle . . . . 23
2.4. Client Recovery from Filehandle Expiration . . . . . . . 24 2.4. Client Recovery from Filehandle Expiration . . . . . . . 24
3. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 25 3. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 25
3.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 26 3.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 26
3.2. Recommended Attributes . . . . . . . . . . . . . . . . . 26 3.2. Recommended Attributes . . . . . . . . . . . . . . . . . 26
3.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 27 3.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 27
3.4. Classification of Attributes . . . . . . . . . . . . . . 27 3.4. Classification of Attributes . . . . . . . . . . . . . . 27
3.5. Mandatory Attributes - Definitions . . . . . . . . . . . 28 3.5. Mandatory Attributes - Definitions . . . . . . . . . . . 28
3.6. Recommended Attributes - Definitions . . . . . . . . . . 30 3.6. Recommended Attributes - Definitions . . . . . . . . . . 30
3.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 38 3.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 38
3.8. Interpreting owner and owner_group . . . . . . . . . . . 38 3.8. Interpreting owner and owner_group . . . . . . . . . . . 38
3.9. Character Case Attributes . . . . . . . . . . . . . . . 40 3.9. Character Case Attributes . . . . . . . . . . . . . . . 40
3.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 40 3.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 41
3.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 41 3.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 41
3.12. send_impl_id and recv_impl_id . . . . . . . . . . . . . 42 3.12. send_impl_id and recv_impl_id . . . . . . . . . . . . . 42
3.13. fs_layouttype . . . . . . . . . . . . . . . . . . . . . 43 3.13. fs_layouttype . . . . . . . . . . . . . . . . . . . . . 43
3.14. layouttype . . . . . . . . . . . . . . . . . . . . . . . 43 3.14. layouttype . . . . . . . . . . . . . . . . . . . . . . . 43
3.15. layouthint . . . . . . . . . . . . . . . . . . . . . . . 43 3.15. layouthint . . . . . . . . . . . . . . . . . . . . . . . 43
3.16. Access Control Lists . . . . . . . . . . . . . . . . . . 43 3.16. Access Control Lists . . . . . . . . . . . . . . . . . . 44
3.16.1. ACE type . . . . . . . . . . . . . . . . . . . . . . 45 3.16.1. ACE type . . . . . . . . . . . . . . . . . . . . . . 46
3.16.2. ACE Access Mask . . . . . . . . . . . . . . . . . . 46 3.16.2. ACE Access Mask . . . . . . . . . . . . . . . . . . . 47
3.16.3. ACE flag . . . . . . . . . . . . . . . . . . . . . . 51 3.16.3. ACE flag . . . . . . . . . . . . . . . . . . . . . . 52
3.16.4. ACE who . . . . . . . . . . . . . . . . . . . . . . 53 3.16.4. ACE who . . . . . . . . . . . . . . . . . . . . . . . 54
3.16.5. Mode Attribute . . . . . . . . . . . . . . . . . . . 54 3.16.5. Mode Attribute . . . . . . . . . . . . . . . . . . . 55
3.16.6. Interaction Between Mode and ACL Attributes . . . . 55 3.16.6. Interaction Between Mode and ACL Attributes . . . . . 56
4. Filesystem Migration and Replication . . . . . . . . . . . . 69 4. Single-server Name Space . . . . . . . . . . . . . . . . . . 69
4.1. Replication . . . . . . . . . . . . . . . . . . . . . . 69 4.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 69
4.2. Migration . . . . . . . . . . . . . . . . . . . . . . . 70 4.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 69
4.3. Interpretation of the fs_locations Attribute . . . . . . 70 4.3. Server Pseudo Filesystem . . . . . . . . . . . . . . . . 70
4.4. Filehandle Recovery for Migration or Replication . . . . 72 4.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 71
5. NFS Server Name Space . . . . . . . . . . . . . . . . . . . . 72 4.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 71
5.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 72 4.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 71
5.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 72 4.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 71
5.3. Server Pseudo Filesystem . . . . . . . . . . . . . . . . 73 4.8. Security Policy and Name Space Presentation . . . . . . 72
5.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 73 5. File Locking and Share Reservations . . . . . . . . . . . . . 73
5.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 74 5.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 73
5.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 74 5.1.1. Client ID . . . . . . . . . . . . . . . . . . . . . . 74
5.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 74 5.1.2. Server Release of Clientid . . . . . . . . . . . . . 76
5.8. Security Policy and Name Space Presentation . . . . . . 75 5.1.3. lock_owner and stateid Definition . . . . . . . . . . 77
6. File Locking and Share Reservations . . . . . . . . . . . . . 76 5.1.4. Use of the stateid and Locking . . . . . . . . . . . 79
6.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 76 5.1.5. Sequencing of Lock Requests . . . . . . . . . . . . . 81
6.1.1. Client ID . . . . . . . . . . . . . . . . . . . . . 77 5.1.6. Recovery from Replayed Requests . . . . . . . . . . . 82
6.1.2. Server Release of Clientid . . . . . . . . . . . . . 79 5.1.7. Releasing lock_owner State . . . . . . . . . . . . . 82
6.1.3. lock_owner and stateid Definition . . . . . . . . . 80 5.1.8. Use of Open Confirmation . . . . . . . . . . . . . . 82
6.1.4. Use of the stateid and Locking . . . . . . . . . . . 82 5.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 84
6.1.5. Sequencing of Lock Requests . . . . . . . . . . . . 84 5.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 84
6.1.6. Recovery from Replayed Requests . . . . . . . . . . 85 5.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 84
6.1.7. Releasing lock_owner State . . . . . . . . . . . . . 85 5.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 85
6.1.8. Use of Open Confirmation . . . . . . . . . . . . . . 85 5.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 86
6.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 87 5.6.1. Client Failure and Recovery . . . . . . . . . . . . . 86
6.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 87 5.6.2. Server Failure and Recovery . . . . . . . . . . . . . 87
6.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 87 5.6.3. Network Partitions and Recovery . . . . . . . . . . . 89
6.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 88 5.7. Recovery from a Lock Request Timeout or Abort . . . . . 92
6.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 89 5.8. Server Revocation of Locks . . . . . . . . . . . . . . . 93
6.6.1. Client Failure and Recovery . . . . . . . . . . . . 89 5.9. Share Reservations . . . . . . . . . . . . . . . . . . . 94
6.6.2. Server Failure and Recovery . . . . . . . . . . . . 90 5.10. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 94
6.6.3. Network Partitions and Recovery . . . . . . . . . . 92 5.10.1. Close and Retention of State Information . . . . . . 95
6.7. Recovery from a Lock Request Timeout or Abort . . . . . 95 5.11. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 96
6.8. Server Revocation of Locks . . . . . . . . . . . . . . . 96 5.12. Short and Long Leases . . . . . . . . . . . . . . . . . 96
6.9. Share Reservations . . . . . . . . . . . . . . . . . . . 97 5.13. Clocks, Propagation Delay, and Calculating Lease
6.10. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 97 Expiration . . . . . . . . . . . . . . . . . . . . . . . 97
6.10.1. Close and Retention of State Information . . . . . . 98 6. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 97
6.11. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 99 6.1. Performance Challenges for Client-Side Caching . . . . . 98
6.12. Short and Long Leases . . . . . . . . . . . . . . . . . 99 6.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 99
6.13. Clocks, Propagation Delay, and Calculating Lease 6.2.1. Delegation Recovery . . . . . . . . . . . . . . . . . 100
Expiration . . . . . . . . . . . . . . . . . . . . . . . 100 6.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 102
6.14. Migration, Replication and State . . . . . . . . . . . . 100 6.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 102
6.14.1. Migration and State . . . . . . . . . . . . . . . . 101 6.3.2. Data Caching and File Locking . . . . . . . . . . . . 103
6.14.2. Replication and State . . . . . . . . . . . . . . . 102 6.3.3. Data Caching and Mandatory File Locking . . . . . . . 105
6.14.3. Notification of Migrated Lease . . . . . . . . . . . 102 6.3.4. Data Caching and File Identity . . . . . . . . . . . 105
6.14.4. Migration and the Lease_time Attribute . . . . . . . 103 6.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 106
7. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 103 6.4.1. Open Delegation and Data Caching . . . . . . . . . . 109
7.1. Performance Challenges for Client-Side Caching . . . . . 104 6.4.2. Open Delegation and File Locks . . . . . . . . . . . 110
7.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 105 6.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 110
7.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 106 6.4.4. Recall of Open Delegation . . . . . . . . . . . . . . 113
7.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 108 6.4.5. Clients that Fail to Honor Delegation Recalls . . . . 115
7.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 108 6.4.6. Delegation Revocation . . . . . . . . . . . . . . . . 116
7.3.2. Data Caching and File Locking . . . . . . . . . . . 109 6.5. Data Caching and Revocation . . . . . . . . . . . . . . 116
7.3.3. Data Caching and Mandatory File Locking . . . . . . 111 6.5.1. Revocation Recovery for Write Open Delegation . . . . 117
7.3.4. Data Caching and File Identity . . . . . . . . . . . 111 6.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 118
7.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 112 6.7. Data and Metadata Caching and Memory Mapped Files . . . 120
7.4.1. Open Delegation and Data Caching . . . . . . . . . . 115 6.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 122
7.4.2. Open Delegation and File Locks . . . . . . . . . . . 116 6.9. Directory Caching . . . . . . . . . . . . . . . . . . . 123
7.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 116 7. Security Negotiation . . . . . . . . . . . . . . . . . . . . 124
7.4.4. Recall of Open Delegation . . . . . . . . . . . . . 119 8. Clarification of Security Negotiation in NFSv4.1 . . . . . . 124
7.4.5. Clients that Fail to Honor Delegation Recalls . . . 121 8.1. PUTFH + LOOKUP . . . . . . . . . . . . . . . . . . . . . 125
7.4.6. Delegation Revocation . . . . . . . . . . . . . . . 122 8.2. PUTFH + LOOKUPP . . . . . . . . . . . . . . . . . . . . 125
7.5. Data Caching and Revocation . . . . . . . . . . . . . . 122 8.3. PUTFH + SECINFO . . . . . . . . . . . . . . . . . . . . 125
7.5.1. Revocation Recovery for Write Open Delegation . . . 123 8.4. PUTFH + Anything Else . . . . . . . . . . . . . . . . . 126
7.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 124 9. NFSv4.1 Sessions . . . . . . . . . . . . . . . . . . . . . . 126
7.7. Data and Metadata Caching and Memory Mapped Files . . . 126 9.1. Sessions Background . . . . . . . . . . . . . . . . . . 126
7.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 128 9.1.1. Introduction to Sessions . . . . . . . . . . . . . . 126
7.9. Directory Caching . . . . . . . . . . . . . . . . . . . 129 9.1.2. Motivation . . . . . . . . . . . . . . . . . . . . . 127
8. Security Negotiation . . . . . . . . . . . . . . . . . . . . 130 9.1.3. Problem Statement . . . . . . . . . . . . . . . . . . 128
9. Clarification of Security Negotiation in NFSv4.1 . . . . . . 130 9.1.4. NFSv4 Session Extension Characteristics . . . . . . . 130
9.1. PUTFH + LOOKUP . . . . . . . . . . . . . . . . . . . . . 130 9.2. Transport Issues . . . . . . . . . . . . . . . . . . . . 130
9.2. PUTFH + LOOKUPP . . . . . . . . . . . . . . . . . . . . 131 9.2.1. Session Model . . . . . . . . . . . . . . . . . . . . 130
9.3. PUTFH + SECINFO . . . . . . . . . . . . . . . . . . . . 131 9.2.2. Connection State . . . . . . . . . . . . . . . . . . 132
9.4. PUTFH + Anything Else . . . . . . . . . . . . . . . . . 131 9.2.3. NFSv4 Channels, Sessions and Connections . . . . . . 132
10. NFSv4.1 Sessions . . . . . . . . . . . . . . . . . . . . . . 132 9.2.4. Reconnection, Trunking and Failover . . . . . . . . . 134
10.1. Sessions Background . . . . . . . . . . . . . . . . . . 132 9.2.5. Server Duplicate Request Cache . . . . . . . . . . . 135
10.1.1. Introduction to Sessions . . . . . . . . . . . . . . 132 9.3. Session Initialization and Transfer Models . . . . . . . 136
10.1.2. Motivation . . . . . . . . . . . . . . . . . . . . . 133 9.3.1. Session Negotiation . . . . . . . . . . . . . . . . . 136
10.1.3. Problem Statement . . . . . . . . . . . . . . . . . 134 9.3.2. RDMA Requirements . . . . . . . . . . . . . . . . . . 138
10.1.4. NFSv4 Session Extension Characteristics . . . . . . 136 9.3.3. RDMA Connection Resources . . . . . . . . . . . . . . 138
10.2. Transport Issues . . . . . . . . . . . . . . . . . . . . 136 9.3.4. TCP and RDMA Inline Transfer Model . . . . . . . . . 139
10.2.1. Session Model . . . . . . . . . . . . . . . . . . . 136 9.3.5. RDMA Direct Transfer Model . . . . . . . . . . . . . 142
10.2.2. Connection State . . . . . . . . . . . . . . . . . . 137 9.4. Connection Models . . . . . . . . . . . . . . . . . . . 145
10.2.3. NFSv4 Channels, Sessions and Connections . . . . . . 138 9.4.1. TCP Connection Model . . . . . . . . . . . . . . . . 146
10.2.4. Reconnection, Trunking and Failover . . . . . . . . 140 9.4.2. Negotiated RDMA Connection Model . . . . . . . . . . 147
10.2.5. Server Duplicate Request Cache . . . . . . . . . . . 141 9.4.3. Automatic RDMA Connection Model . . . . . . . . . . . 148
10.3. Session Initialization and Transfer Models . . . . . . . 142 9.5. Buffer Management, Transfer, Flow Control . . . . . . . 148
10.3.1. Session Negotiation . . . . . . . . . . . . . . . . 142 9.6. Retry and Replay . . . . . . . . . . . . . . . . . . . . 151
10.3.2. RDMA Requirements . . . . . . . . . . . . . . . . . 144 9.7. The Back Channel . . . . . . . . . . . . . . . . . . . . 152
10.3.3. RDMA Connection Resources . . . . . . . . . . . . . 144 9.8. COMPOUND Sizing Issues . . . . . . . . . . . . . . . . . 153
10.3.4. TCP and RDMA Inline Transfer Model . . . . . . . . . 145 9.9. Data Alignment . . . . . . . . . . . . . . . . . . . . . 153
10.3.5. RDMA Direct Transfer Model . . . . . . . . . . . . . 148 9.10. NFSv4 Integration . . . . . . . . . . . . . . . . . . . 155
10.4. Connection Models . . . . . . . . . . . . . . . . . . . 151 9.10.1. Minor Versioning . . . . . . . . . . . . . . . . . . 155
10.4.1. TCP Connection Model . . . . . . . . . . . . . . . . 152 9.10.2. Slot Identifiers and Server Duplicate Request Cache . 155
10.4.2. Negotiated RDMA Connection Model . . . . . . . . . . 153 9.10.3. Resolving server callback races with sessions . . . . 159
10.4.3. Automatic RDMA Connection Model . . . . . . . . . . 154 9.10.4. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . 160
10.5. Buffer Management, Transfer, Flow Control . . . . . . . 154 9.10.5. eXternal Data Representation Efficiency . . . . . . . 161
10.6. Retry and Replay . . . . . . . . . . . . . . . . . . . . 157 9.10.6. Effect of Sessions on Existing Operations . . . . . . 161
10.7. The Back Channel . . . . . . . . . . . . . . . . . . . . 158 9.10.7. Authentication Efficiencies . . . . . . . . . . . . . 162
10.8. COMPOUND Sizing Issues . . . . . . . . . . . . . . . . . 159 9.11. Sessions Security Considerations . . . . . . . . . . . . 163
10.9. Data Alignment . . . . . . . . . . . . . . . . . . . . . 159 9.11.1. Authentication . . . . . . . . . . . . . . . . . . . 165
10.10. NFSv4 Integration . . . . . . . . . . . . . . . . . . . 161 10. Multi-server Name Space . . . . . . . . . . . . . . . . . . . 166
10.10.1. Minor Versioning . . . . . . . . . . . . . . . . . . 161 10.1. Location attributes . . . . . . . . . . . . . . . . . . 166
10.10.2. Slot Identifiers and Server Duplicate Request 10.2. File System Presence or Absence . . . . . . . . . . . . 166
Cache . . . . . . . . . . . . . . . . . . . . . . . 161 10.3. Getting Attributes for an Absent File System . . . . . . 168
10.10.3. Resolving server callback races with sessions . . . 165 10.3.1. GETATTR Within an Absent File System . . . . . . . . 168
10.10.4. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . 166 10.3.2. READDIR and Absent File Systems . . . . . . . . . . . 169
10.10.5. eXternal Data Representation Efficiency . . . . . . 167 10.4. Uses of Location Information . . . . . . . . . . . . . . 170
10.10.6. Effect of Sessions on Existing Operations . . . . . 167 10.4.1. File System Replication . . . . . . . . . . . . . . . 170
10.10.7. Authentication Efficiencies . . . . . . . . . . . . 168 10.4.2. File System Migration . . . . . . . . . . . . . . . . 171
10.11. Sessions Security Considerations . . . . . . . . . . . . 169 10.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . . 172
10.11.1. Authentication . . . . . . . . . . . . . . . . . . . 171 10.5. Additional Client-side Considerations . . . . . . . . . 172
11. Directory Delegations . . . . . . . . . . . . . . . . . . . . 172 10.6. Effecting File System Transitions . . . . . . . . . . . 173
11.1. Introduction to Directory Delegations . . . . . . . . . 172 10.6.1. Transparent File System Transitions . . . . . . . . . 174
11.2. Directory Delegation Design (in brief) . . . . . . . . . 173 10.6.2. Filehandles and File System Transitions . . . . . . . 176
10.6.3. Fileid's and File System Transitions . . . . . . . . 176
10.6.4. Fsid's and File System Transitions . . . . . . . . . 177
10.6.5. The Change Attribute and File System Transitions . . 177
10.6.6. Lock State and File System Transitions . . . . . . . 178
10.6.7. Write Verifiers and File System Transitions . . . . . 181
10.7. Effecting File System Referrals . . . . . . . . . . . . 181
10.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . . 182
10.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 186
10.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 188
10.9. The Attribute fs_locations . . . . . . . . . . . . . . . 188
10.10. The Attribute fs_locations_info . . . . . . . . . . . . 190
10.11. The Attribute fs_status . . . . . . . . . . . . . . . . 199
11. Directory Delegations . . . . . . . . . . . . . . . . . . . . 202
11.1. Introduction to Directory Delegations . . . . . . . . . 203
11.2. Directory Delegation Design (in brief) . . . . . . . . . 204
11.3. Recommended Attributes in support of Directory 11.3. Recommended Attributes in support of Directory
Delegations . . . . . . . . . . . . . . . . . . . . . . 174 Delegations . . . . . . . . . . . . . . . . . . . . . . 205
11.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 175 11.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 206
11.5. Delegation Recovery . . . . . . . . . . . . . . . . . . 175 11.5. Delegation Recovery . . . . . . . . . . . . . . . . . . 206
12. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 175 12. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 206
13. General Definitions . . . . . . . . . . . . . . . . . . . . . 178 13. General Definitions . . . . . . . . . . . . . . . . . . . . . 209
13.1. Metadata Server . . . . . . . . . . . . . . . . . . . . 178 13.1. Metadata Server . . . . . . . . . . . . . . . . . . . . 209
13.2. Client . . . . . . . . . . . . . . . . . . . . . . . . . 178 13.2. Client . . . . . . . . . . . . . . . . . . . . . . . . . 209
13.3. Storage Device . . . . . . . . . . . . . . . . . . . . . 178 13.3. Storage Device . . . . . . . . . . . . . . . . . . . . . 209
13.4. Storage Protocol . . . . . . . . . . . . . . . . . . . . 179 13.4. Storage Protocol . . . . . . . . . . . . . . . . . . . . 209
13.5. Control Protocol . . . . . . . . . . . . . . . . . . . . 179 13.5. Control Protocol . . . . . . . . . . . . . . . . . . . . 210
13.6. Metadata . . . . . . . . . . . . . . . . . . . . . . . . 179 13.6. Metadata . . . . . . . . . . . . . . . . . . . . . . . . 210
13.7. Layout . . . . . . . . . . . . . . . . . . . . . . . . . 180 13.7. Layout . . . . . . . . . . . . . . . . . . . . . . . . . 210
14. pNFS protocol semantics . . . . . . . . . . . . . . . . . . . 180 14. pNFS protocol semantics . . . . . . . . . . . . . . . . . . . 211
14.1. Definitions . . . . . . . . . . . . . . . . . . . . . . 180 14.1. Definitions . . . . . . . . . . . . . . . . . . . . . . 211
14.1.1. Layout Types . . . . . . . . . . . . . . . . . . . . 180 14.1.1. Layout Types . . . . . . . . . . . . . . . . . . . . 211
14.1.2. Layout Iomode . . . . . . . . . . . . . . . . . . . 181 14.1.2. Layout Iomode . . . . . . . . . . . . . . . . . . . . 211
14.1.3. Layout Segments . . . . . . . . . . . . . . . . . . 181 14.1.3. Layout Segments . . . . . . . . . . . . . . . . . . . 212
14.1.4. Device IDs . . . . . . . . . . . . . . . . . . . . . 182 14.1.4. Device IDs . . . . . . . . . . . . . . . . . . . . . 213
14.1.5. Aggregation Schemes . . . . . . . . . . . . . . . . 183 14.1.5. Aggregation Schemes . . . . . . . . . . . . . . . . . 213
14.2. Guarantees Provided by Layouts . . . . . . . . . . . . . 183 14.2. Guarantees Provided by Layouts . . . . . . . . . . . . . 214
14.3. Getting a Layout . . . . . . . . . . . . . . . . . . . . 184 14.3. Getting a Layout . . . . . . . . . . . . . . . . . . . . 215
14.4. Committing a Layout . . . . . . . . . . . . . . . . . . 185 14.4. Committing a Layout . . . . . . . . . . . . . . . . . . 216
14.4.1. LAYOUTCOMMIT and mtime/atime/change . . . . . . . . 186 14.4.1. LAYOUTCOMMIT and mtime/atime/change . . . . . . . . . 216
14.4.2. LAYOUTCOMMIT and size . . . . . . . . . . . . . . . 186 14.4.2. LAYOUTCOMMIT and size . . . . . . . . . . . . . . . . 217
14.4.3. LAYOUTCOMMIT and layoutupdate . . . . . . . . . . . 187 14.4.3. LAYOUTCOMMIT and layoutupdate . . . . . . . . . . . . 218
14.5. Recalling a Layout . . . . . . . . . . . . . . . . . . . 187 14.5. Recalling a Layout . . . . . . . . . . . . . . . . . . . 218
14.5.1. Basic Operation . . . . . . . . . . . . . . . . . . 188 14.5.1. Basic Operation . . . . . . . . . . . . . . . . . . . 218
14.5.2. Recall Callback Robustness . . . . . . . . . . . . . 189 14.5.2. Recall Callback Robustness . . . . . . . . . . . . . 220
14.5.3. Recall/Return Sequencing . . . . . . . . . . . . . . 190 14.5.3. Recall/Return Sequencing . . . . . . . . . . . . . . 221
14.6. Metadata Server Write Propagation . . . . . . . . . . . 192 14.6. Metadata Server Write Propagation . . . . . . . . . . . 223
14.7. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 193 14.7. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 223
14.7.1. Leases . . . . . . . . . . . . . . . . . . . . . . . 193 14.7.1. Leases . . . . . . . . . . . . . . . . . . . . . . . 224
14.7.2. Client Recovery . . . . . . . . . . . . . . . . . . 194 14.7.2. Client Recovery . . . . . . . . . . . . . . . . . . . 225
14.7.3. Metadata Server Recovery . . . . . . . . . . . . . . 195 14.7.3. Metadata Server Recovery . . . . . . . . . . . . . . 226
14.7.4. Storage Device Recovery . . . . . . . . . . . . . . 197 14.7.4. Storage Device Recovery . . . . . . . . . . . . . . . 228
15. Security Considerations . . . . . . . . . . . . . . . . . . . 198 15. Security Considerations . . . . . . . . . . . . . . . . . . . 229
15.1. File Layout Security . . . . . . . . . . . . . . . . . . 199 15.1. File Layout Security . . . . . . . . . . . . . . . . . . 230
15.2. Object Layout Security . . . . . . . . . . . . . . . . . 199 15.2. Object Layout Security . . . . . . . . . . . . . . . . . 230
15.3. Block/Volume Layout Security . . . . . . . . . . . . . . 201 15.3. Block/Volume Layout Security . . . . . . . . . . . . . . 232
16. The NFSv4 File Layout Type . . . . . . . . . . . . . . . . . 201 16. The NFSv4 File Layout Type . . . . . . . . . . . . . . . . . 232
16.1. File Striping and Data Access . . . . . . . . . . . . . 202 16.1. File Striping and Data Access . . . . . . . . . . . . . 232
16.1.1. Sparse and Dense Storage Device Data Layouts . . . . 203 16.1.1. Sparse and Dense Storage Device Data Layouts . . . . 234
16.1.2. Metadata and Storage Device Roles . . . . . . . . . 205 16.1.2. Metadata and Storage Device Roles . . . . . . . . . . 236
16.1.3. Device Multipathing . . . . . . . . . . . . . . . . 206 16.1.3. Device Multipathing . . . . . . . . . . . . . . . . . 237
16.1.4. Operations Issued to Storage Devices . . . . . . . . 206 16.1.4. Operations Issued to Storage Devices . . . . . . . . 237
16.2. Global Stateid Requirements . . . . . . . . . . . . . . 207 16.1.5. COMMIT through metadata server . . . . . . . . . . . 238
16.3. The Layout Iomode . . . . . . . . . . . . . . . . . . . 207 16.2. Global Stateid Requirements . . . . . . . . . . . . . . 238
16.4. Storage Device State Propagation . . . . . . . . . . . . 208 16.3. The Layout Iomode . . . . . . . . . . . . . . . . . . . 239
16.4.1. Lock State Propagation . . . . . . . . . . . . . . . 208 16.4. Storage Device State Propagation . . . . . . . . . . . . 239
16.4.2. Open-mode Validation . . . . . . . . . . . . . . . . 209 16.4.1. Lock State Propagation . . . . . . . . . . . . . . . 240
16.4.3. File Attributes . . . . . . . . . . . . . . . . . . 209 16.4.2. Open-mode Validation . . . . . . . . . . . . . . . . 240
16.5. Storage Device Component File Size . . . . . . . . . . . 210 16.4.3. File Attributes . . . . . . . . . . . . . . . . . . . 240
16.6. Crash Recovery Considerations . . . . . . . . . . . . . 211 16.5. Storage Device Component File Size . . . . . . . . . . . 241
16.7. Security Considerations . . . . . . . . . . . . . . . . 211 16.6. Crash Recovery Considerations . . . . . . . . . . . . . 242
16.8. Alternate Approaches . . . . . . . . . . . . . . . . . . 211 16.7. Security Considerations . . . . . . . . . . . . . . . . 243
17. Layouts and Aggregation . . . . . . . . . . . . . . . . . . . 212 16.8. Alternate Approaches . . . . . . . . . . . . . . . . . . 243
17.1. Simple Map . . . . . . . . . . . . . . . . . . . . . . . 213 17. Layouts and Aggregation . . . . . . . . . . . . . . . . . . . 244
17.2. Block Extent Map . . . . . . . . . . . . . . . . . . . . 213 17.1. Simple Map . . . . . . . . . . . . . . . . . . . . . . . 244
17.3. Striped Map (RAID 0) . . . . . . . . . . . . . . . . . . 213 17.2. Block Extent Map . . . . . . . . . . . . . . . . . . . . 245
17.4. Replicated Map . . . . . . . . . . . . . . . . . . . . . 213 17.3. Striped Map (RAID 0) . . . . . . . . . . . . . . . . . . 245
17.5. Concatenated Map . . . . . . . . . . . . . . . . . . . . 214 17.4. Replicated Map . . . . . . . . . . . . . . . . . . . . . 245
17.6. Nested Map . . . . . . . . . . . . . . . . . . . . . . . 214 17.5. Concatenated Map . . . . . . . . . . . . . . . . . . . . 245
18. Minor Versioning . . . . . . . . . . . . . . . . . . . . . . 214 17.6. Nested Map . . . . . . . . . . . . . . . . . . . . . . . 246
19. Internationalization . . . . . . . . . . . . . . . . . . . . 216 18. Minor Versioning . . . . . . . . . . . . . . . . . . . . . . 246
19.1. Stringprep profile for the utf8str_cs type . . . . . . . 218 19. Internationalization . . . . . . . . . . . . . . . . . . . . 248
19.2. Stringprep profile for the utf8str_cis type . . . . . . 219 19.1. Stringprep profile for the utf8str_cs type . . . . . . . 249
19.3. Stringprep profile for the utf8str_mixed type . . . . . 221 19.2. Stringprep profile for the utf8str_cis type . . . . . . 251
19.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 222 19.3. Stringprep profile for the utf8str_mixed type . . . . . 252
20. Error Definitions . . . . . . . . . . . . . . . . . . . . . . 222 19.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 254
21. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 231 20. Error Definitions . . . . . . . . . . . . . . . . . . . . . . 254
21.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 231 21. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 263
21.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 232 21.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 263
22. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 234 21.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 264
22.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 235 22. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 266
22.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 237 22.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 267
22.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 238 22.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 269
22.4. Operation 6: CREATE - Create a Non-Regular File Object . 241 22.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 270
22.4. Operation 6: CREATE - Create a Non-Regular File Object . 273
22.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting 22.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting
Recovery . . . . . . . . . . . . . . . . . . . . . . . . 244 Recovery . . . . . . . . . . . . . . . . . . . . . . . . 276
22.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 245 22.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 277
22.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 245 22.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 277
22.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 247 22.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 279
22.9. Operation 11: LINK - Create Link to a File . . . . . . . 248 22.9. Operation 11: LINK - Create Link to a File . . . . . . . 280
22.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 249 22.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 281
22.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 253 22.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 285
22.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 255 22.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 287
22.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 256 22.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 288
22.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 258 22.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 290
22.15. Operation 17: NVERIFY - Verify Difference in 22.15. Operation 17: NVERIFY - Verify Difference in
Attributes . . . . . . . . . . . . . . . . . . . . . . . 259 Attributes . . . . . . . . . . . . . . . . . . . . . . . 291
22.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 260 22.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 292
22.17. Operation 19: OPENATTR - Open Named Attribute 22.17. Operation 19: OPENATTR - Open Named Attribute
Directory . . . . . . . . . . . . . . . . . . . . . . . 269 Directory . . . . . . . . . . . . . . . . . . . . . . . 306
22.18. Operation 20: OPEN_CONFIRM - Confirm Open . . . . . . . 271 22.18. Operation 20: OPEN_CONFIRM - Confirm Open . . . . . . . 307
22.19. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 273 22.19. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 309
22.20. Operation 22: PUTFH - Set Current Filehandle . . . . . . 274 22.20. Operation 22: PUTFH - Set Current Filehandle . . . . . . 310
22.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 275 22.21. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 311
22.22. Operation 25: READ - Read from File . . . . . . . . . . 276 22.22. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 313
22.23. Operation 26: READDIR - Read Directory . . . . . . . . . 278 22.23. Operation 25: READ - Read from File . . . . . . . . . . 314
22.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 282 22.24. Operation 26: READDIR - Read Directory . . . . . . . . . 316
22.25. Operation 28: REMOVE - Remove Filesystem Object . . . . 283 22.25. Operation 27: READLINK - Read Symbolic Link . . . . . . 320
22.26. Operation 29: RENAME - Rename Directory Entry . . . . . 285 22.26. Operation 28: REMOVE - Remove Filesystem Object . . . . 321
22.27. Operation 30: RENEW - Renew a Lease . . . . . . . . . . 287 22.27. Operation 29: RENAME - Rename Directory Entry . . . . . 323
22.28. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 288 22.28. Operation 30: RENEW - Renew a Lease . . . . . . . . . . 325
22.29. Operation 32: SAVEFH - Save Current Filehandle . . . . . 289 22.29. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 326
22.30. Operation 33: SECINFO - Obtain Available Security . . . 290 22.30. Operation 32: SAVEFH - Save Current Filehandle . . . . . 327
22.31. Operation 34: SETATTR - Set Attributes . . . . . . . . . 293 22.31. Operation 33: SECINFO - Obtain Available Security . . . 328
22.32. Operation 35: SETCLIENTID - Negotiate Clientid . . . . . 296 22.32. Operation 34: SETATTR - Set Attributes . . . . . . . . . 331
22.33. Operation 36: SETCLIENTID_CONFIRM - Confirm Clientid . . 300 22.33. Operation 35: SETCLIENTID - Negotiate Clientid . . . . . 334
22.34. Operation 37: VERIFY - Verify Same Attributes . . . . . 303 22.34. Operation 36: SETCLIENTID_CONFIRM - Confirm Clientid . . 338
22.35. Operation 38: WRITE - Write to File . . . . . . . . . . 304 22.35. Operation 37: VERIFY - Verify Same Attributes . . . . . 341
22.36. Operation 39: RELEASE_LOCKOWNER - Release Lockowner 22.36. Operation 38: WRITE - Write to File . . . . . . . . . . 342
State . . . . . . . . . . . . . . . . . . . . . . . . . 309 22.37. Operation 39: RELEASE_LOCKOWNER - Release Lockowner
22.37. Operation 10044: ILLEGAL - Illegal operation . . . . . . 310 State . . . . . . . . . . . . . . . . . . . . . . . . . 347
22.38. SECINFO_NO_NAME - Get Security on Unnamed Object . . . . 310 22.38. Operation 10044: ILLEGAL - Illegal operation . . . . . . 348
22.39. CREATECLIENTID - Instantiate Clientid . . . . . . . . . 312 22.39. SECINFO_NO_NAME - Get Security on Unnamed Object . . . . 348
22.40. CREATESESSION - Create New Session and Confirm 22.40. CREATECLIENTID - Instantiate Clientid . . . . . . . . . 350
Clientid . . . . . . . . . . . . . . . . . . . . . . . . 317 22.41. CREATESESSION - Create New Session and Confirm
22.41. BIND_BACKCHANNEL - Create a callback channel binding . . 322 Clientid . . . . . . . . . . . . . . . . . . . . . . . . 355
22.42. DESTROYSESSION - Destroy existing session . . . . . . . 324 22.42. BIND_BACKCHANNEL - Create a callback channel binding . . 360
22.43. SEQUENCE - Supply per-procedure sequencing and control . 325 22.43. DESTROYSESSION - Destroy existing session . . . . . . . 362
22.44. GET_DIR_DELEGATION - Get a directory delegation . . . . 326 22.44. SEQUENCE - Supply per-procedure sequencing and control . 363
22.45. LAYOUTGET - Get Layout Information . . . . . . . . . . . 330 22.45. GET_DIR_DELEGATION - Get a directory delegation . . . . 364
22.46. LAYOUTCOMMIT - Commit writes made using a layout . . . . 332 22.46. LAYOUTGET - Get Layout Information . . . . . . . . . . . 368
22.47. LAYOUTRETURN - Release Layout Information . . . . . . . 336 22.47. LAYOUTCOMMIT - Commit writes made using a layout . . . . 371
22.48. GETDEVICEINFO - Get Device Information . . . . . . . . . 337 22.48. LAYOUTRETURN - Release Layout Information . . . . . . . 375
22.49. GETDEVICELIST . . . . . . . . . . . . . . . . . . . . . 338 22.49. GETDEVICEINFO - Get Device Information . . . . . . . . . 376
23. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 340 22.50. GETDEVICELIST . . . . . . . . . . . . . . . . . . . . . 377
23.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 340 22.51. WANT_DELEGATION . . . . . . . . . . . . . . . . . . . . 379
23.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 340 23. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 382
24. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 342 23.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 382
24.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 342 23.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 383
24.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 343 24. CB_RECALLCREDIT - change flow control limits . . . . . . . . 385
24.3. Operation 10044: CB_ILLEGAL - Illegal Callback 25. CB_SEQUENCE - Supply callback channel sequencing and
Operation . . . . . . . . . . . . . . . . . . . . . . . 344 control . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
24.4. CB_RECALLCREDIT - change flow control limits . . . . . . 345 26. CB_NOTIFY - Notify directory changes . . . . . . . . . . . . 387
24.5. CB_SEQUENCE - Supply callback channel sequencing and 27. CB_RECALL_ANY - Keep any N delegations . . . . . . . . . . . 390
control . . . . . . . . . . . . . . . . . . . . . . . . 346 28. CB_SIZECHANGED . . . . . . . . . . . . . . . . . . . . . . . 393
24.6. CB_NOTIFY - Notify directory changes . . . . . . . . . . 348 29. CB_LAYOUTRECALL . . . . . . . . . . . . . . . . . . . . . . . 394
24.7. CB_RECALL_ANY - Keep any N delegations . . . . . . . . . 351 30. CB_PUSH_DELEG . . . . . . . . . . . . . . . . . . . . . . . . 397
24.8. CB_SIZECHANGED . . . . . . . . . . . . . . . . . . . . . 354 31. CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . . . . . . . . . . 398
24.9. CB_LAYOUTRECALL . . . . . . . . . . . . . . . . . . . . 355 32. References . . . . . . . . . . . . . . . . . . . . . . . . . 398
25. References . . . . . . . . . . . . . . . . . . . . . . . . . 357 32.1. Normative References . . . . . . . . . . . . . . . . . . 398
25.1. Normative References . . . . . . . . . . . . . . . . . . 357 32.2. Informative References . . . . . . . . . . . . . . . . . 399
25.2. Informative References . . . . . . . . . . . . . . . . . 357 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 399
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 358 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 400
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 358 Intellectual Property and Copyright Statements . . . . . . . . . 401
Intellectual Property and Copyright Statements . . . . . . . . . 360
1. Protocol Data Types 1. Protocol Data Types
The syntax and semantics to describe the data types of the NFS The syntax and semantics to describe the data types of the NFS
version 4 protocol are defined in the XDR RFC1832 [2] and RPC RFC1831 version 4 protocol are defined in the XDR RFC1832 [2] and RPC RFC1831
[3] documents. The next sections build upon the XDR data types to [3] documents. The next sections build upon the XDR data types to
define types and structures specific to this protocol. define types and structures specific to this protocol.
1.1. Basic Data Types 1.1. Basic Data Types
These are the base NFSv4 data types. These are the base NFSv4 data types.
+---------------+---------------------------------------------------+ +---------------+---------------------------------------------------+
| Data Type | Definition | | Data Type | Definition |
+---------------+---------------------------------------------------+ +---------------+---------------------------------------------------+
| int32_t | typedef int int32_t; | | int32_t | typedef int int32_t; |
| uint32_t | typedef unsigned int uint32_t; | | uint32_t | typedef unsigned int uint32_t; |
| int64_t | typedef hyper int64_t; | | int64_t | typedef hyper int64_t; |
| uint64_t | typedef unsigned hyper uint64_t; | | uint64_t | typedef unsigned hyper uint64_t; |
| attrlist4 | typedef opaque attrlist4&lt> | | attrlist4 | typedef opaque attrlist4<>; |
| | Used for file/directory attributes | | | Used for file/directory attributes |
| bitmap4 | typedef uint32_t bitmap4&lt> | | bitmap4 | typedef uint32_t bitmap4<>; |
| | Used in attribute array encoding. | | | Used in attribute array encoding. |
| changeid4 | typedef uint64_t changeid4; | | changeid4 | typedef uint64_t changeid4; |
| | Used in definition of change_info | | | Used in definition of change_info |
| clientid4 | typedef uint64_t clientid4; | | clientid4 | typedef uint64_t clientid4; |
| | Shorthand reference to client identification | | | Shorthand reference to client identification |
| component4 | typedef utf8str_cs component4; | | component4 | typedef utf8str_cs component4; |
| | Represents path name components | | | Represents path name components |
| count4 | typedef uint32_t count4; | | count4 | typedef uint32_t count4; |
| | Various count parameters (READ, WRITE, COMMIT) | | | Various count parameters (READ, WRITE, COMMIT) |
| length4 | typedef uint64_t length4; | | length4 | typedef uint64_t length4; |
| | Describes LOCK lengths | | | Describes LOCK lengths |
| linktext4 | typedef utf8str_cs linktext4; | | linktext4 | typedef utf8str_cs linktext4; |
| | Symbolic link contents | | | Symbolic link contents |
| mode4 | typedef uint32_t mode4; | | mode4 | typedef uint32_t mode4; |
| | Mode attribute data type | | | Mode attribute data type |
| nfs_cookie4 | typedef uint64_t nfs_cookie4; | | nfs_cookie4 | typedef uint64_t nfs_cookie4; |
| | Opaque cookie value for READDIR | | | Opaque cookie value for READDIR |
| nfs_fh4 | typedef opaque nfs_fh4&ltNFS4_FHSIZE> | | nfs_fh4 | typedef opaque nfs_fh4<NFS4_FHSIZE> |
| | Filehandle definition; NFS4_FHSIZE is defined as | | | Filehandle definition; NFS4_FHSIZE is defined as |
| | 128 | | | 128 |
| nfs_ftype4 | enum nfs_ftype4; | | nfs_ftype4 | enum nfs_ftype4; |
| | Various defined file types | | | Various defined file types |
| nfsstat4 | enum nfsstat4; | | nfsstat4 | enum nfsstat4; |
| | Return value for operations | | | Return value for operations |
| offset4 | typedef uint64_t offset4; | | offset4 | typedef uint64_t offset4; |
| | Various offset designations (READ, WRITE, LOCK, | | | Various offset designations (READ, WRITE, LOCK, |
| | COMMIT) | | | COMMIT) |
| pathname4 | typedef component4 pathname4&lt> | | pathname4 | typedef component4 pathname4<>; |
| | Represents path name for fs_locations | | | Represents path name for fs_locations |
| qop4 | typedef uint32_t qop4; | | qop4 | typedef uint32_t qop4; |
| | Quality of protection designation in SECINFO | | | Quality of protection designation in SECINFO |
| sec_oid4 | typedef opaque sec_oid4&lt> | | sec_oid4 | typedef opaque sec_oid4<>; |
| | Security Object Identifier The sec_oid4 data type | | | Security Object Identifier The sec_oid4 data type |
| | is not really opaque. Instead contains an ASN.1 | | | is not really opaque. Instead contains an ASN.1 |
| | OBJECT IDENTIFIER as used by GSS-API in the | | | OBJECT IDENTIFIER as used by GSS-API in the |
| | mech_type argument to GSS_Init_sec_context. See | | | mech_type argument to GSS_Init_sec_context. See |
| | RFC2743 [4] for details. | | | RFC2743 [4] for details. |
| seqid4 | typedef uint32_t seqid4; | | seqid4 | typedef uint32_t seqid4; |
| | Sequence identifier used for file locking | | | Sequence identifier used for file locking |
| utf8string | typedef opaque utf8string&lt> | | utf8string | typedef opaque utf8string<>; |
| | UTF-8 encoding for strings | | | UTF-8 encoding for strings |
| utf8str_cis | typedef opaque utf8str_cis; | | utf8str_cis | typedef opaque utf8str_cis; |
| | Case-insensitive UTF-8 string | | | Case-insensitive UTF-8 string |
| utf8str_cs | typedef opaque utf8str_cs; | | utf8str_cs | typedef opaque utf8str_cs; |
| | Case-sensitive UTF-8 string | | | Case-sensitive UTF-8 string |
| utf8str_mixed | typedef opaque utf8str_mixed; | | utf8str_mixed | typedef opaque utf8str_mixed; |
| | UTF-8 strings with a case sensitive prefix and a | | | UTF-8 strings with a case sensitive prefix and a |
| | case insensitive suffix. | | | case insensitive suffix. |
| verifier4 | typedef opaque verifier4[NFS4_VERIFIER_SIZE]; | | verifier4 | typedef opaque verifier4[NFS4_VERIFIER_SIZE]; |
| | Verifier used for various operations (COMMIT, | | | Verifier used for various operations (COMMIT, |
skipping to change at page 12, line 8 skipping to change at page 12, line 8
1.2.5. fsid4 1.2.5. fsid4
struct fsid4 { struct fsid4 {
uint64_t major; uint64_t major;
uint64_t minor; uint64_t minor;
}; };
1.2.6. fs_location4 1.2.6. fs_location4
struct fs_location4 { struct fs_location4 {
utf8str_cis server&lt>; utf8str_cis server<>;
pathname4 rootpath; pathname4 rootpath;
}; };
1.2.7. fs_locations4 1.2.7. fs_locations4
struct fs_locations4 { struct fs_locations4 {
pathname4 fs_root; pathname4 fs_root;
fs_location4 locations&lt>; fs_location4 locations<>;
}; };
The fs_location4 and fs_locations4 data types are used for the The fs_location4 and fs_locations4 data types are used for the
fs_locations recommended attribute which is used for migration and fs_locations recommended attribute which is used for migration and
replication support. replication support.
1.2.8. fattr4 1.2.8. fattr4
struct fattr4 { struct fattr4 {
bitmap4 attrmask; bitmap4 attrmask;
skipping to change at page 13, line 11 skipping to change at page 13, line 11
}; };
This structure is used with the CREATE, LINK, REMOVE, RENAME This structure is used with the CREATE, LINK, REMOVE, RENAME
operations to let the client know the value of the change attribute operations to let the client know the value of the change attribute
for the directory in which the target filesystem object resides. for the directory in which the target filesystem object resides.
1.2.10. clientaddr4 1.2.10. clientaddr4
struct clientaddr4 { struct clientaddr4 {
/* see struct rpcb in RFC1833 */ /* see struct rpcb in RFC1833 */
string r_netid&lt> /* network id */ string r_netid<>; /* network id */
string r_addr&lt> /* universal address */ string r_addr<>; /* universal address */
}; };
The clientaddr4 structure is used as part of the SETCLIENTID The clientaddr4 structure is used as part of the SETCLIENTID
operation to either specify the address of the client that is using a operation to either specify the address of the client that is using a
clientid or as part of the callback registration. The r_netid and clientid or as part of the callback registration. The r_netid and
r_addr fields are specified in RFC1833 [9], but they are r_addr fields are specified in RFC1833 [9], but they are
underspecified in RFC1833 [9] as far as what they should look like underspecified in RFC1833 [9] as far as what they should look like
for specific protocols. for specific protocols.
For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the
skipping to change at page 14, line 22 skipping to change at page 14, line 22
clientaddr4 cb_location; clientaddr4 cb_location;
}; };
This structure is used by the client to inform the server of its call This structure is used by the client to inform the server of its call
back address; includes the program number and client address. back address; includes the program number and client address.
1.2.12. nfs_client_id4 1.2.12. nfs_client_id4
struct nfs_client_id4 { struct nfs_client_id4 {
verifier4 verifier; verifier4 verifier;
opaque id&amp;ltNFS4_OPAQUE_LIMIT> opaque id&lt;NFS4_OPAQUE_LIMIT>
}; };
This structure is part of the arguments to the SETCLIENTID operation. This structure is part of the arguments to the SETCLIENTID operation.
NFS4_OPAQUE_LIMIT is defined as 1024. NFS4_OPAQUE_LIMIT is defined as 1024.
1.2.13. open_owner4 1.2.13. open_owner4
struct open_owner4 { struct open_owner4 {
clientid4 clientid; clientid4 clientid;
opaque owner&amp;ltNFS4_OPAQUE_LIMIT> opaque owner&lt;NFS4_OPAQUE_LIMIT>
}; };
This structure is used to identify the owner of open state. This structure is used to identify the owner of open state.
NFS4_OPAQUE_LIMIT is defined as 1024. NFS4_OPAQUE_LIMIT is defined as 1024.
1.2.14. lock_owner4 1.2.14. lock_owner4
struct lock_owner4 { struct lock_owner4 {
clientid4 clientid; clientid4 clientid;
opaque owner&amp;ltNFS4_OPAQUE_LIMIT> opaque owner&lt;NFS4_OPAQUE_LIMIT>
}; };
This structure is used to identify the owner of file locking state. This structure is used to identify the owner of file locking state.
NFS4_OPAQUE_LIMIT is defined as 1024. NFS4_OPAQUE_LIMIT is defined as 1024.
1.2.15. open_to_lock_owner4 1.2.15. open_to_lock_owner4
struct open_to_lock_owner4 { struct open_to_lock_owner4 {
seqid4 open_seqid; seqid4 open_seqid;
stateid4 open_stateid; stateid4 open_stateid;
skipping to change at page 16, line 29 skipping to change at page 16, line 29
Layout information includes device IDs that specify a storage device Layout information includes device IDs that specify a storage device
through a compact handle. Addressing and type information is through a compact handle. Addressing and type information is
obtained with the GETDEVICEINFO operation. A client must not assume obtained with the GETDEVICEINFO operation. A client must not assume
that device IDs are valid across metadata server reboots. The device that device IDs are valid across metadata server reboots. The device
ID is qualified by the layout type and are unique per file system ID is qualified by the layout type and are unique per file system
(FSID). This allows different layout drivers to generate device IDs (FSID). This allows different layout drivers to generate device IDs
without the need for co-ordination. See Section 14.1.4 for more without the need for co-ordination. See Section 14.1.4 for more
details. details.
1.2.19. pnfs_deviceaddr4 1.2.19. pnfs_netaddr4
struct pnfs_netaddr4 { struct pnfs_netaddr4 {
string r_netid&lt> /* network ID */ string r_netid<>; /* network ID */
string r_addr&lt> /* universal address */ string r_addr<>; /* universal address */
};
struct pnfs_deviceaddr4 {
pnfs_layouttype4 type;
opaque device_addr&lt>
}; };
The device address is used to set up a communication channel with the For a description of the r_netid and r_addr fields see the
storage device. Different layout types will require different types descriptions provided in the clientaddr4 structure description.
of structures to define how they communicate with storage devices.
The opaque device_addr field must be interpreted based on the
specified layout type.
Currently, the only defined device address is that for the NFSv4 file
layout (struct pnfs_netaddr4), which identifies a storage device by
network IP address and port number. This is sufficient for the
clients to communicate with the NFSv4 storage devices, and may also
be sufficient for object-based storage drivers to communicate with
OSDs. The other device address we expect to support is a SCSI volume
identifier. The final protocol specification will detail the allowed
values for device_type and the format of their associated location
information.
[NOTE: other device addresses will be added as the respective
specifications mature. It has been suggested that a separate
device_type enumeration is used as a switch to the pnfs_deviceaddr4
structure (e.g., if multiple types of addresses exist for the same
layout type). Until such a time as a real case is made and the
respective layout types have matured, the device address structure
will be left as is.]
1.2.20. pnfs_devlist_item4 1.2.20. pnfs_devlist_item4
struct pnfs_devlist_item4 { struct pnfs_devlist_item4 {
pnfs_deviceid4 id; pnfs_deviceid4 id;
pnfs_deviceaddr4 addr; opaque device_addr<>;
}; };
An array of these values is returned by the GETDEVICELIST operation. An array of these values is returned by the GETDEVICELIST operation.
They define the set of devices associated with a file system. They define the set of devices associated with a file system for the
layout type specified in the GETDEVICELIST4args.
The device address is used to set up a communication channel with the
storage device. Different layout types will require different types
of structures to define how they communicate with storage devices.
The opaque device_addr field must be interpreted based on the
specified layout type.
This document defines the device address for the NFSv4 file layout
(struct pnfs_netaddr4), which identifies a storage device by network
IP address and port number (similar to struct clientaddr4). This is
sufficient for the clients to communicate with the NFSv4 storage
devices, and may be sufficient for other layout types as well.
Device types for object storage devices and block storage devices
(e.g., SCSI volume labels) will be defined by their respective layout
specifications.
1.2.21. pnfs_layout4 1.2.21. pnfs_layout4
struct pnfs_layout4 { struct pnfs_layout4 {
offset4 offset; offset4 offset;
length4 length; length4 length;
pnfs_layoutiomode4 iomode; pnfs_layoutiomode4 iomode;
pnfs_layouttype4 type; pnfs_layouttype4 type;
opaque layout<>; opaque layout<>;
}; };
skipping to change at page 18, line 29 skipping to change at page 18, line 18
layout types this could include the list of reserved blocks that were layout types this could include the list of reserved blocks that were
written. The contents of the opaque layoutupdate_data argument are written. The contents of the opaque layoutupdate_data argument are
determined by the layout type and are defined in their context. The determined by the layout type and are defined in their context. The
NFSv4 file-based layout does not use this structure, thus the NFSv4 file-based layout does not use this structure, thus the
update_data field should have a zero length. update_data field should have a zero length.
1.2.23. layouthint4 1.2.23. layouthint4
struct pnfs_layouthint4 { struct pnfs_layouthint4 {
pnfs_layouttype4 type; pnfs_layouttype4 type;
opaque layouthint_data&amp;lt>; opaque layouthint_data&lt;>;
}; };
The layouthint4 structure is used by the client to pass in a hint The layouthint4 structure is used by the client to pass in a hint
about the type of layout it would like created for a particular file. about the type of layout it would like created for a particular file.
It is the structure specified by the FILE_LAYOUT_HINT attribute It is the structure specified by the FILE_LAYOUT_HINT attribute
described below. The metadata server may ignore the hint, or may described below. The metadata server may ignore the hint, or may
selectively ignore fields within the hint. This hint should be selectively ignore fields within the hint. This hint should be
provided at create time as part of the initial attributes within provided at create time as part of the initial attributes within
OPEN. The NFSv4 file-based layout uses the "nfsv4_file_layouthint" OPEN. The NFSv4 file-based layout uses the "nfsv4_file_layouthint"
structure as defined in Section 16.1. structure as defined in Section 16.1.
skipping to change at page 23, line 10 skipping to change at page 23, line 5
which none of the bits specified below are set. which none of the bits specified below are set.
FH4_VOLATILE_ANY The filehandle may expire at any time, except as FH4_VOLATILE_ANY The filehandle may expire at any time, except as
specifically excluded (i.e. FH4_NO_EXPIRE_WITH_OPEN). specifically excluded (i.e. FH4_NO_EXPIRE_WITH_OPEN).
FH4_NOEXPIRE_WITH_OPEN May only be set when FH4_VOLATILE_ANY is set. FH4_NOEXPIRE_WITH_OPEN May only be set when FH4_VOLATILE_ANY is set.
If this bit is set, then the meaning of FH4_VOLATILE_ANY is If this bit is set, then the meaning of FH4_VOLATILE_ANY is
qualified to exclude any expiration of the filehandle when it is qualified to exclude any expiration of the filehandle when it is
open. open.
FH4_VOL_MIGRATION The filehandle will expire as a result of FH4_VOL_MIGRATION The filehandle will expire as a result of a file
migration. If FH4_VOL_ANY is set, FH4_VOL_MIGRATION is redundant. system transition (migration or replication), in those case in
which the continuity of filehandle use is not specified by
_handle_ class information within the fs_locations_info attribute.
When this bit is set, clients without access to fs_locations_info
information should assume file handles will expire on file system
transitions.
FH4_VOL_RENAME The filehandle will expire during rename. This FH4_VOL_RENAME The filehandle will expire during rename. This
includes a rename by the requesting client or a rename by any includes a rename by the requesting client or a rename by any
other client. If FH4_VOL_ANY is set, FH4_VOL_RENAME is redundant. other client. If FH4_VOL_ANY is set, FH4_VOL_RENAME is redundant.
Servers which provide volatile filehandles that may expire while open Servers which provide volatile filehandles that may expire while open
(i.e. if FH4_VOL_MIGRATION or FH4_VOL_RENAME is set or if (i.e. if FH4_VOL_MIGRATION or FH4_VOL_RENAME is set or if
FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set), should FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set), should
deny a RENAME or REMOVE that would affect an OPEN file of any of the deny a RENAME or REMOVE that would affect an OPEN file of any of the
components leading to the OPEN file. In addition, the server should components leading to the OPEN file. In addition, the server should
deny all RENAME or REMOVE requests during the grace period upon deny all RENAME or REMOVE requests during the grace period upon
server restart. server restart.
Note that the bits FH4_VOL_MIGRATION and FH4_VOL_RENAME allow the Servers which provide volatile filehandles that may expire while open
client to determine that expiration has occurred whenever a specific require special care as regards handling of RENAMESs and REMOVEs.
event occurs, without an explicit filehandle expiration error from This situation can arise if FH4_VOL_MIGRATION or FH4_VOL_RENAME is
the server. FH4_VOL_ANY does not provide this form of information. set, if FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set,
In situations where the server will expire many, but not all or if a non-readonly file system has a transition target in a
filehandles upon migration (e.g. all but those that are open), different _handle _ class. In these cases, the server should deny a
FH4_VOLATILE_ANY (in this case with FH4_NOEXPIRE_WITH_OPEN) is a RENAME or REMOVE that would affect an OPEN file of any of the
better choice since the client may not assume that all filehandles components leading to the OPEN file. In addition, the server should
will expire when migration occurs, and it is likely that additional deny all RENAME or REMOVE requests during the grace period, in order
expirations will occur (as a result of file CLOSE) that are separated to make sure that reclaims of files where filehandles may have
in time from the migration event itself. expired do not do a reclaim for the wrong file.
2.3. One Method of Constructing a Volatile Filehandle 2.3. One Method of Constructing a Volatile Filehandle
A volatile filehandle, while opaque to the client could contain: A volatile filehandle, while opaque to the client could contain:
[volatile bit = 1 | server boot time | slot | generation number] [volatile bit = 1 | server boot time | slot | generation number]
o slot is an index in the server volatile filehandle table o slot is an index in the server volatile filehandle table
o generation number is the generation number for the table entry/ o generation number is the generation number for the table entry/
skipping to change at page 38, line 12 skipping to change at page 38, line 12
| | | | | for file | | | | | | for file |
| | | | | layout. | | | | | | layout. |
| layout_blksize | TBD | uint32_t | READ | Preferred | | layout_blksize | TBD | uint32_t | READ | Preferred |
| | | | | block size for | | | | | | block size for |
| | | | | layout related | | | | | | layout related |
| | | | | I/O. | | | | | | I/O. |
| layout_alignment | TBD | uint32_t | READ | Preferred | | layout_alignment | TBD | uint32_t | READ | Preferred |
| | | | | alignment for | | | | | | alignment for |
| | | | | layout related | | | | | | layout related |
| | | | | I/O. | | | | | | I/O. |
| fs_absent | TBD | bool | READ | Is current |
| | | | | filesystem |
| | | | | present or |
| | | | | absent. |
| fs_locations_info | TBD | | READ | Full function |
| | | | | filesystem |
| | | | | location. |
| fs_status | TBD | fs4_status | READ | Generic |
| | | | | filesystem |
| | | | | type |
| | | | | information. |
| | TBD | | READ | desc | | | TBD | | READ | desc |
| | TBD | | READ | desc | | | TBD | | READ | desc |
+--------------------+-----+--------------+--------+----------------+ +--------------------+-----+--------------+--------+----------------+
3.7. Time Access 3.7. Time Access
As defined above, the time_access attribute represents the time of As defined above, the time_access attribute represents the time of
last access to the object by a read that was satisfied by the server. last access to the object by a read that was satisfied by the server.
The notion of what is an "access" depends on server's operating The notion of what is an "access" depends on server's operating
environment and/or the server's filesystem semantics. For example, environment and/or the server's filesystem semantics. For example,
skipping to change at page 69, line 25 skipping to change at page 69, line 25
longer in group "staff". User "bob" logs in to the system again, and longer in group "staff". User "bob" logs in to the system again, and
thus more processes are created, this time owned by "bob" but NOT in thus more processes are created, this time owned by "bob" but NOT in
group "staff". group "staff".
A mode of 0770 is inaccurate for processes not belonging to group A mode of 0770 is inaccurate for processes not belonging to group
"staff". But even if the mode of the file were proactively changed "staff". But even if the mode of the file were proactively changed
to 0070 at the time the group database was edited, mode 0070 would be to 0070 at the time the group database was edited, mode 0070 would be
inaccurate for the pre-existing processes owned by user "bob" and inaccurate for the pre-existing processes owned by user "bob" and
having membership in group "staff". having membership in group "staff".
4. Filesystem Migration and Replication 4. Single-server Name Space
With the use of the recommended attribute "fs_locations", the NFS
version 4 server has a method of providing filesystem migration or
replication services. For the purposes of migration and replication,
a filesystem will be defined as all files that share a given fsid
(both major and minor values are the same).
The fs_locations attribute provides a list of filesystem locations.
These locations are specified by providing the server name (either
DNS domain or IP address) and the path name representing the root of
the filesystem. Depending on the type of service being provided, the
list will provide a new location or a set of alternate locations for
the filesystem. The client will use this information to redirect its
requests to the new server.
4.1. Replication
It is expected that filesystem replication will be used in the case
of read-only data. Typically, the filesystem will be replicated on
two or more servers. The fs_locations attribute will provide the
list of these locations to the client. On first access of the
filesystem, the client should obtain the value of the fs_locations
attribute. If, in the future, the client finds the server
unresponsive, the client may attempt to use another server specified
by fs_locations.
If applicable, the client must take the appropriate steps to recover
valid filehandles from the new server. This is described in more
detail in the following sections.
4.2. Migration
Filesystem migration is used to move a filesystem from one server to
another. Migration is typically used for a filesystem that is
writable and has a single copy. The expected use of migration is for
load balancing or general resource reallocation. The protocol does
not specify how the filesystem will be moved between servers. This
server-to-server transfer mechanism is left to the server
implementor. However, the method used to communicate the migration
event between client and server is specified here.
Once the servers participating in the migration have completed the
move of the filesystem, the error NFS4ERR_MOVED will be returned for
subsequent requests received by the original server. The
NFS4ERR_MOVED error is returned for all operations except PUTFH and
GETATTR. Upon receiving the NFS4ERR_MOVED error, the client will
obtain the value of the fs_locations attribute. The client will then
use the contents of the attribute to redirect its requests to the
specified server. To facilitate the use of GETATTR, operations such
as PUTFH must also be accepted by the server for the migrated file
system's filehandles. Note that if the server returns NFS4ERR_MOVED,
the server MUST support the fs_locations attribute.
If the client requests more attributes than just fs_locations, the
server may return fs_locations only. This is to be expected since
the server has migrated the filesystem and may not have a method of
obtaining additional attribute data.
The server implementor needs to be careful in developing a migration
solution. The server must consider all of the state information
clients may have outstanding at the server. This includes but is not
limited to locking/share state, delegation state, and asynchronous
file writes which are represented by WRITE and COMMIT verifiers. The
server should strive to minimize the impact on its clients during and
after the migration process.
4.3. Interpretation of the fs_locations Attribute
The fs_location attribute is structured in the following way:
struct fs_location {
utf8str_cis server&lt>;
pathname4 rootpath;
};
struct fs_locations {
pathname4 fs_root;
fs_location locations&lt>;
};
The fs_location struct is used to represent the location of a
filesystem by providing a server name and the path to the root of the
filesystem. For a multi-homed server or a set of servers that use
the same rootpath, an array of server names may be provided. An
entry in the server array is an UTF8 string and represents one of a
traditional DNS host name, IPv4 address, or IPv6 address. It is not
a requirement that all servers that share the same rootpath be listed
in one fs_location struct. The array of server names is provided for
convenience. Servers that share the same rootpath may also be listed
in separate fs_location entries in the fs_locations attribute.
The fs_locations struct and attribute then contains an array of
locations. Since the name space of each server may be constructed
differently, the "fs_root" field is provided. The path represented
by fs_root represents the location of the filesystem in the server's
name space. Therefore, the fs_root path is only associated with the
server from which the fs_locations attribute was obtained. The
fs_root path is meant to aid the client in locating the filesystem at
the various servers listed.
As an example, there is a replicated filesystem located at two
servers (servA and servB). At servA the filesystem is located at
path "/a/b/c". At servB the filesystem is located at path "/x/y/z".
In this example the client accesses the filesystem first at servA
with a multi-component lookup path of "/a/b/c/d". Since the client
used a multi-component lookup to obtain the filehandle at "/a/b/c/d",
it is unaware that the filesystem's root is located in servA's name
space at "/a/b/c". When the client switches to servB, it will need
to determine that the directory it first referenced at servA is now
represented by the path "/x/y/z/d" on servB. To facilitate this, the
fs_locations attribute provided by servA would have a fs_root value
of "/a/b/c" and two entries in fs_location. One entry in fs_location
will be for itself (servA) and the other will be for servB with a
path of "/x/y/z". With this information, the client is able to
substitute "/x/y/z" for the "/a/b/c" at the beginning of its access
path and construct "/x/y/z/d" to use for the new server.
See the section "Security Considerations" for a discussion on the
recommendations for the security flavor to be used by any GETATTR
operation that requests the "fs_locations" attribute.
4.4. Filehandle Recovery for Migration or Replication
Filehandles for filesystems that are replicated or migrated generally
have the same semantics as for filesystems that are not replicated or
migrated. For example, if a filesystem has persistent filehandles
and it is migrated to another server, the filehandle values for the
filesystem will be valid at the new server.
For volatile filehandles, the servers involved likely do not have a
mechanism to transfer filehandle format and content between
themselves. Therefore, a server may have difficulty in determining
if a volatile filehandle from an old server should return an error of
NFS4ERR_FHEXPIRED. Therefore, the client is informed, with the use
of the fh_expire_type attribute, whether volatile filehandles will
expire at the migration or replication event. If the bit
FH4_VOL_MIGRATION is set in the fh_expire_type attribute, the client
must treat the volatile filehandle as if the server had returned the
NFS4ERR_FHEXPIRED error. At the migration or replication event in
the presence of the FH4_VOL_MIGRATION bit, the client will not
present the original or old volatile filehandle to the new server.
The client will start its communication with the new server by
recovering its filehandles using the saved file names.
5. NFS Server Name Space This chapter describes the NFSv4 single-server name space. Single-
server namespaces may be presented directly to clients, or they may
be used as a basis to form larger multi-server namespaces (e.g. site-
wide or organization-wide) to be presented to clients, as described
in Section 10.
5.1. Server Exports 4.1. Server Exports
On a UNIX server the name space describes all the files reachable by On a UNIX server, the name space describes all the files reachable by
pathnames under the root directory or "/". On a Windows NT server pathnames under the root directory or "/". On a Windows NT server
the name space constitutes all the files on disks named by mapped the name space constitutes all the files on disks named by mapped
disk letters. NFS server administrators rarely make the entire disk letters. NFS server administrators rarely make the entire
server's filesystem name space available to NFS clients. More often server's filesystem name space available to NFS clients. More often
portions of the name space are made available via an "export" portions of the name space are made available via an "export"
feature. In previous versions of the NFS protocol, the root feature. In previous versions of the NFS protocol, the root
filehandle for each export is obtained through the MOUNT protocol; filehandle for each export is obtained through the MOUNT protocol;
the client sends a string that identifies the export of name space the client sends a string that identifies the export of name space
and the server returns the root filehandle for it. The MOUNT and the server returns the root filehandle for it. The MOUNT
protocol supports an EXPORTS procedure that will enumerate the protocol supports an EXPORTS procedure that will enumerate the
server's exports. server's exports.
5.2. Browsing Exports 4.2. Browsing Exports
The NFS version 4 protocol provides a root filehandle that clients The NFS version 4 protocol provides a root filehandle that clients
can use to obtain filehandles for these exports via a multi-component can use to obtain filehandles for the exports of a particular server,
LOOKUP. A common user experience is to use a graphical user via a series of LOOKUP operations within a COMPOUND, to traverse a
interface (perhaps a file "Open" dialog window) to find a file via path. A common user experience is to use a graphical user interface
progressive browsing through a directory tree. The client must be (perhaps a file "Open" dialog window) to find a file via progressive
able to move from one export to another export via single-component, browsing through a directory tree. The client must be able to move
progressive LOOKUP operations. from one export to another export via single-component, progressive
LOOKUP operations.
This style of browsing is not well supported by the NFS version 2 and This style of browsing is not well supported by the NFS version 2 and
3 protocols. The client expects all LOOKUP operations to remain 3 protocols. The client expects all LOOKUP operations to remain
within a single server filesystem. For example, the device attribute within a single server filesystem. For example, the device attribute
will not change. This prevents a client from taking name space paths will not change. This prevents a client from taking name space paths
that span exports. that span exports.
An automounter on the client can obtain a snapshot of the server's An automounter on the client can obtain a snapshot of the server's
name space using the EXPORTS procedure of the MOUNT protocol. If it name space using the EXPORTS procedure of the MOUNT protocol. If it
understands the server's pathname syntax, it can create an image of understands the server's pathname syntax, it can create an image of
the server's name space on the client. The parts of the name space the server's name space on the client. The parts of the name space
that are not exported by the server are filled in with a "pseudo that are not exported by the server are filled in with a "pseudo
filesystem" that allows the user to browse from one mounted filesystem" that allows the user to browse from one mounted
filesystem to another. There is a drawback to this representation of filesystem to another. There is a drawback to this representation of
the server's name space on the client: it is static. If the server the server's name space on the client: it is static. If the server
administrator adds a new export the client will be unaware of it. administrator adds a new export the client will be unaware of it.
5.3. Server Pseudo Filesystem 4.3. Server Pseudo Filesystem
NFS version 4 servers avoid this name space inconsistency by NFS version 4 servers avoid this name space inconsistency by
presenting all the exports within the framework of a single server presenting all the exports for a given server within the framework of
name space. An NFS version 4 client uses LOOKUP and READDIR a single namespace, for that server. An NFS version 4 client uses
operations to browse seamlessly from one export to another. Portions LOOKUP and READDIR operations to browse seamlessly from one export to
of the server name space that are not exported are bridged via a another. Portions of the server name space that are not exported are
"pseudo filesystem" that provides a view of exported directories bridged via a "pseudo filesystem" that provides a view of exported
only. A pseudo filesystem has a unique fsid and behaves like a directories only. A pseudo filesystem has a unique fsid and behaves
normal, read only filesystem. like a normal, read only filesystem.
Based on the construction of the server's name space, it is possible Based on the construction of the server's name space, it is possible
that multiple pseudo filesystems may exist. For example, that multiple pseudo filesystems may exist. For example,
/a pseudo filesystem /a pseudo filesystem
/a/b real filesystem /a/b real filesystem
/a/b/c pseudo filesystem /a/b/c pseudo filesystem
/a/b/c/d real filesystem /a/b/c/d real filesystem
Each of the pseudo filesystems are considered separate entities and Each of the pseudo filesystems are considered separate entities and
therefore will have a unique fsid. therefore will have its own unique fsid.
5.4. Multiple Roots 4.4. Multiple Roots
The DOS and Windows operating environments are sometimes described as The DOS and Windows operating environments are sometimes described as
having "multiple roots". Filesystems are commonly represented as having "multiple roots". Filesystems are commonly represented as
disk letters. MacOS represents filesystems as top level names. NFS disk letters. MacOS represents filesystems as top level names. NFS
version 4 servers for these platforms can construct a pseudo file version 4 servers for these platforms can construct a pseudo file
system above these root names so that disk letters or volume names system above these root names so that disk letters or volume names
are simply directory names in the pseudo root. are simply directory names in the pseudo root.
5.5. Filehandle Volatility 4.5. Filehandle Volatility
The nature of the server's pseudo filesystem is that it is a logical The nature of the server's pseudo filesystem is that it is a logical
representation of filesystem(s) available from the server. representation of filesystem(s) available from the server.
Therefore, the pseudo filesystem is most likely constructed Therefore, the pseudo filesystem is most likely constructed
dynamically when the server is first instantiated. It is expected dynamically when the server is first instantiated. It is expected
that the pseudo filesystem may not have an on disk counterpart from that the pseudo filesystem may not have an on disk counterpart from
which persistent filehandles could be constructed. Even though it is which persistent filehandles could be constructed. Even though it is
preferable that the server provide persistent filehandles for the preferable that the server provide persistent filehandles for the
pseudo filesystem, the NFS client should expect that pseudo file pseudo filesystem, the NFS client should expect that pseudo file
system filehandles are volatile. This can be confirmed by checking system filehandles are volatile. This can be confirmed by checking
the associated "fh_expire_type" attribute for those filehandles in the associated "fh_expire_type" attribute for those filehandles in
question. If the filehandles are volatile, the NFS client must be question. If the filehandles are volatile, the NFS client must be
prepared to recover a filehandle value (e.g. with a multi-component prepared to recover a filehandle value (e.g. with a series of LOOKUP
LOOKUP) when receiving an error of NFS4ERR_FHEXPIRED. operations) when receiving an error of NFS4ERR_FHEXPIRED.
5.6. Exported Root 4.6. Exported Root
If the server's root filesystem is exported, one might conclude that If the server's root filesystem is exported, one might conclude that
a pseudo-filesystem is not needed. This would be wrong. Assume the a pseudo-filesystem is unneeded. This not necessarily so. Assume
following filesystems on a server: the following filesystems on a server:
/ disk1 (exported) / disk1 (exported)
/a disk2 (not exported) /a disk2 (not exported)
/a/b disk3 (exported) /a/b disk3 (exported)
Because disk2 is not exported, disk3 cannot be reached with simple Because disk2 is not exported, disk3 cannot be reached with simple
LOOKUPs. The server must bridge the gap with a pseudo-filesystem. LOOKUPs. The server must bridge the gap with a pseudo-filesystem.
5.7. Mount Point Crossing 4.7. Mount Point Crossing
The server filesystem environment may be constructed in such a way The server filesystem environment may be constructed in such a way
that one filesystem contains a directory which is 'covered' or that one filesystem contains a directory which is 'covered' or
mounted upon by a second filesystem. For example: mounted upon by a second filesystem. For example:
/a/b (filesystem 1) /a/b (filesystem 1)
/a/b/c/d (filesystem 2) /a/b/c/d (filesystem 2)
The pseudo filesystem for this server may be constructed to look The pseudo filesystem for this server may be constructed to look
like: like:
skipping to change at page 75, line 16 skipping to change at page 72, line 20
It is the server's responsibility to present the pseudo filesystem It is the server's responsibility to present the pseudo filesystem
that is complete to the client. If the client sends a lookup request that is complete to the client. If the client sends a lookup request
for the path "/a/b/c/d", the server's response is the filehandle of for the path "/a/b/c/d", the server's response is the filehandle of
the filesystem "/a/b/c/d". In previous versions of the NFS protocol, the filesystem "/a/b/c/d". In previous versions of the NFS protocol,
the server would respond with the filehandle of directory "/a/b/c/d" the server would respond with the filehandle of directory "/a/b/c/d"
within the filesystem "/a/b". within the filesystem "/a/b".
The NFS client will be able to determine if it crosses a server mount The NFS client will be able to determine if it crosses a server mount
point by a change in the value of the "fsid" attribute. point by a change in the value of the "fsid" attribute.
5.8. Security Policy and Name Space Presentation 4.8. Security Policy and Name Space Presentation
The application of the server's security policy needs to be carefully The application of the server's security policy needs to be carefully
considered by the implementor. One may choose to limit the considered by the implementor. One may choose to limit the
viewability of portions of the pseudo filesystem based on the viewability of portions of the pseudo filesystem based on the
server's perception of the client's ability to authenticate itself server's perception of the client's ability to authenticate itself
properly. However, with the support of multiple security mechanisms properly. However, with the support of multiple security mechanisms
and the ability to negotiate the appropriate use of these mechanisms, and the ability to negotiate the appropriate use of these mechanisms,
the server is unable to properly determine if a client will be able the server is unable to properly determine if a client will be able
to authenticate itself. If, based on its policies, the server to authenticate itself. If, based on its policies, the server
chooses to limit the contents of the pseudo filesystem, the server chooses to limit the contents of the pseudo filesystem, the server
skipping to change at page 76, line 5 skipping to change at page 73, line 5
The security policy for /a/b/c is Kerberos with integrity. The The security policy for /a/b/c is Kerberos with integrity. The
server should apply the same security policy to /, /a, and /a/b. server should apply the same security policy to /, /a, and /a/b.
This allows for the extension of the protection of the server's This allows for the extension of the protection of the server's
namespace to the ancestors of the real shared resource. namespace to the ancestors of the real shared resource.
For the case of the use of multiple, disjoint security mechanisms in For the case of the use of multiple, disjoint security mechanisms in
the server's resources, the security for a particular object in the the server's resources, the security for a particular object in the
server's namespace should be the union of all security mechanisms of server's namespace should be the union of all security mechanisms of
all direct descendants. all direct descendants.
6. File Locking and Share Reservations 5. File Locking and Share Reservations
Integrating locking into the NFS protocol necessarily causes it to be Integrating locking into the NFS protocol necessarily causes it to be
stateful. With the inclusion of share reservations the protocol stateful. With the inclusion of share reservations the protocol
becomes substantially more dependent on state than the traditional becomes substantially more dependent on state than the traditional
combination of NFS and NLM [XNFS]. There are three components to combination of NFS and NLM [XNFS]. There are three components to
making this state manageable: making this state manageable:
o Clear division between client and server o Clear division between client and server
o Ability to reliably detect inconsistency in state between client o Ability to reliably detect inconsistency in state between client
skipping to change at page 76, line 39 skipping to change at page 73, line 39
protocol mechanisms used when a file is opened or created (LOOKUP, protocol mechanisms used when a file is opened or created (LOOKUP,
CREATE, ACCESS) need to be replaced. The NFS version 4 protocol has CREATE, ACCESS) need to be replaced. The NFS version 4 protocol has
an OPEN operation that subsumes the NFS version 3 methodology of an OPEN operation that subsumes the NFS version 3 methodology of
LOOKUP, CREATE, and ACCESS. However, because many operations require LOOKUP, CREATE, and ACCESS. However, because many operations require
a filehandle, the traditional LOOKUP is preserved to map a file name a filehandle, the traditional LOOKUP is preserved to map a file name
to filehandle without establishing state on the server. The policy to filehandle without establishing state on the server. The policy
of granting access or modifying files is managed by the server based of granting access or modifying files is managed by the server based
on the client's state. These mechanisms can implement policy ranging on the client's state. These mechanisms can implement policy ranging
from advisory only locking to full mandatory locking. from advisory only locking to full mandatory locking.
6.1. Locking 5.1. Locking
It is assumed that manipulating a lock is rare when compared to READ It is assumed that manipulating a lock is rare when compared to READ
and WRITE operations. It is also assumed that crashes and network and WRITE operations. It is also assumed that crashes and network
partitions are relatively rare. Therefore it is important that the partitions are relatively rare. Therefore it is important that the
READ and WRITE operations have a lightweight mechanism to indicate if READ and WRITE operations have a lightweight mechanism to indicate if
they possess a held lock. A lock request contains the heavyweight they possess a held lock. A lock request contains the heavyweight
information required to establish a lock and uniquely define the lock information required to establish a lock and uniquely define the lock
owner. owner.
The following sections describe the transition from the heavy weight The following sections describe the transition from the heavy weight
information to the eventual stateid used for most client and server information to the eventual stateid used for most client and server
locking and lease interactions. locking and lease interactions.
6.1.1. Client ID 5.1.1. Client ID
For each LOCK request, the client must identify itself to the server. For each LOCK request, the client must identify itself to the server.
This is done in such a way as to allow for correct lock This is done in such a way as to allow for correct lock
identification and crash recovery. A sequence of a SETCLIENTID identification and crash recovery. A sequence of a SETCLIENTID
operation followed by a SETCLIENTID_CONFIRM operation is required to operation followed by a SETCLIENTID_CONFIRM operation is required to
establish the identification onto the server. Establishment of establish the identification onto the server. Establishment of
identification by a new incarnation of the client also has the effect identification by a new incarnation of the client also has the effect
of immediately breaking any leased state that a previous incarnation of immediately breaking any leased state that a previous incarnation
of the client might have had on the server, as opposed to forcing the of the client might have had on the server, as opposed to forcing the
new client incarnation to wait for the leases to expire. Breaking new client incarnation to wait for the leases to expire. Breaking
the lease state amounts to the server removing all lock, share the lease state amounts to the server removing all lock, share
reservation, and, where the server is not supporting the reservation, and, where the server is not supporting the
CLAIM_DELEGATE_PREV claim type, all delegation state associated with CLAIM_DELEGATE_PREV claim type, all delegation state associated with
same client with the same identity. For discussion of delegation same client with the same identity. For discussion of delegation
state recovery, see the section "Delegation Recovery". state recovery, see the section "Delegation Recovery".
Client identification is encapsulated in the following structure: Client identification is encapsulated in the following structure:
struct nfs_client_id4 { struct nfs_client_id4 {
verifier4 verifier; verifier4 verifier;
opaque id&amp;ltNFS4_OPAQUE_LIMIT>; opaque id&lt;NFS4_OPAQUE_LIMIT>;
}; };
The first field, verifier is a client incarnation verifier that is The first field, verifier is a client incarnation verifier that is
used to detect client reboots. Only if the verifier is different used to detect client reboots. Only if the verifier is different
from that the server has previously recorded the client (as from that the server has previously recorded the client (as
identified by the second field of the structure, id) does the server identified by the second field of the structure, id) does the server
start the process of canceling the client's leased state. start the process of canceling the client's leased state.
The second field, id is a variable length string that uniquely The second field, id is a variable length string that uniquely
defines the client. defines the client.
skipping to change at page 79, line 44 skipping to change at page 76, line 44
The client must also employ the SETCLIENTID operation when it The client must also employ the SETCLIENTID operation when it
receives a NFS4ERR_STALE_STATEID error using a stateid derived from receives a NFS4ERR_STALE_STATEID error using a stateid derived from
its current clientid, since this also indicates a server reboot which its current clientid, since this also indicates a server reboot which
has invalidated the existing clientid (see the next section has invalidated the existing clientid (see the next section
"lock_owner and stateid Definition" for details). "lock_owner and stateid Definition" for details).
See the detailed descriptions of SETCLIENTID and SETCLIENTID_CONFIRM See the detailed descriptions of SETCLIENTID and SETCLIENTID_CONFIRM
for a complete specification of the operations. for a complete specification of the operations.
6.1.2. Server Release of Clientid 5.1.2. Server Release of Clientid
If the server determines that the client holds no associated state If the server determines that the client holds no associated state
for its clientid, the server may choose to release the clientid. The for its clientid, the server may choose to release the clientid. The
server may make this choice for an inactive client so that resources server may make this choice for an inactive client so that resources
are not consumed by those intermittently active clients. If the are not consumed by those intermittently active clients. If the
client contacts the server after this release, the server must ensure client contacts the server after this release, the server must ensure
the client receives the appropriate error so that it will use the the client receives the appropriate error so that it will use the
SETCLIENTID/SETCLIENTID_CONFIRM sequence to establish a new identity. SETCLIENTID/SETCLIENTID_CONFIRM sequence to establish a new identity.
It should be clear that the server must be very hesitant to release a It should be clear that the server must be very hesitant to release a
skipping to change at page 80, line 29 skipping to change at page 77, line 29
that changes security flavors, and under the new flavor, there is no that changes security flavors, and under the new flavor, there is no
mapping to the previous owner) will in rare cases result in mapping to the previous owner) will in rare cases result in
NFS4ERR_CLID_INUSE. NFS4ERR_CLID_INUSE.
In that event, when the server gets a SETCLIENTID for a client id In that event, when the server gets a SETCLIENTID for a client id
that currently has no state, or it has state, but the lease has that currently has no state, or it has state, but the lease has
expired, rather than returning NFS4ERR_CLID_INUSE, the server MUST expired, rather than returning NFS4ERR_CLID_INUSE, the server MUST
allow the SETCLIENTID, and confirm the new clientid if followed by allow the SETCLIENTID, and confirm the new clientid if followed by
the appropriate SETCLIENTID_CONFIRM. the appropriate SETCLIENTID_CONFIRM.
6.1.3. lock_owner and stateid Definition 5.1.3. lock_owner and stateid Definition
When requesting a lock, the client must present to the server the When requesting a lock, the client must present to the server the
clientid and an identifier for the owner of the requested lock. clientid and an identifier for the owner of the requested lock.
These two fields are referred to as the lock_owner and the definition These two fields are referred to as the lock_owner and the definition
of those fields are: of those fields are:
o A clientid returned by the server as part of the client's use of o A clientid returned by the server as part of the client's use of
the SETCLIENTID operation. the SETCLIENTID operation.
o A variable length opaque array used to uniquely define the owner o A variable length opaque array used to uniquely define the owner
skipping to change at page 82, line 5 skipping to change at page 79, line 5
o utilize the "seqid" field of each stateid, such that seqid is o utilize the "seqid" field of each stateid, such that seqid is
monotonically incremented for each stateid that is associated with monotonically incremented for each stateid that is associated with
the same index into the locking-state table. the same index into the locking-state table.
By matching the incoming stateid and its field values with the state By matching the incoming stateid and its field values with the state
held at the server, the server is able to easily determine if a held at the server, the server is able to easily determine if a
stateid is valid for its current instantiation and state. If the stateid is valid for its current instantiation and state. If the
stateid is not valid, the appropriate error can be supplied to the stateid is not valid, the appropriate error can be supplied to the
client. client.
6.1.4. Use of the stateid and Locking 5.1.4. Use of the stateid and Locking
All READ, WRITE and SETATTR operations contain a stateid. For the All READ, WRITE and SETATTR operations contain a stateid. For the
purposes of this section, SETATTR operations which change the size purposes of this section, SETATTR operations which change the size
attribute of a file are treated as if they are writing the area attribute of a file are treated as if they are writing the area
between the old and new size (i.e. the range truncated or added to between the old and new size (i.e. the range truncated or added to
the file by means of the SETATTR), even where SETATTR is not the file by means of the SETATTR), even where SETATTR is not
explicitly mentioned in the text. explicitly mentioned in the text.
If the lock_owner performs a READ or WRITE in a situation in which it If the lock_owner performs a READ or WRITE in a situation in which it
has established a lock or share reservation on the server (any OPEN has established a lock or share reservation on the server (any OPEN
skipping to change at page 84, line 14 skipping to change at page 81, line 14
A lock may not be granted while a READ or WRITE operation using one A lock may not be granted while a READ or WRITE operation using one
of the special stateids is being performed and the range of the lock of the special stateids is being performed and the range of the lock
request conflicts with the range of the READ or WRITE operation. For request conflicts with the range of the READ or WRITE operation. For
the purposes of this paragraph, a conflict occurs when a shared lock the purposes of this paragraph, a conflict occurs when a shared lock
is requested and a WRITE operation is being performed, or an is requested and a WRITE operation is being performed, or an
exclusive lock is requested and either a READ or a WRITE operation is exclusive lock is requested and either a READ or a WRITE operation is
being performed. A SETATTR that sets size is treated similarly to a being performed. A SETATTR that sets size is treated similarly to a
WRITE as discussed above. WRITE as discussed above.
6.1.5. Sequencing of Lock Requests 5.1.5. Sequencing of Lock Requests
Locking is different than most NFS operations as it requires "at- Locking is different than most NFS operations as it requires "at-
most-one" semantics that are not provided by ONCRPC. ONCRPC over a most-one" semantics that are not provided by ONCRPC. ONCRPC over a
reliable transport is not sufficient because a sequence of locking reliable transport is not sufficient because a sequence of locking
requests may span multiple TCP connections. In the face of requests may span multiple TCP connections. In the face of
retransmission or reordering, lock or unlock requests must have a retransmission or reordering, lock or unlock requests must have a
well defined and consistent behavior. To accomplish this, each lock well defined and consistent behavior. To accomplish this, each lock
request contains a sequence number that is a consecutively increasing request contains a sequence number that is a consecutively increasing
integer. Different lock_owners have different sequences. The server integer. Different lock_owners have different sequences. The server
maintains the last sequence number (L) received and the response that maintains the last sequence number (L) received and the response that
skipping to change at page 85, line 14 skipping to change at page 82, line 14
The client MUST monotonically increment the sequence number for the The client MUST monotonically increment the sequence number for the
CLOSE, LOCK, LOCKU, OPEN, OPEN_CONFIRM, and OPEN_DOWNGRADE CLOSE, LOCK, LOCKU, OPEN, OPEN_CONFIRM, and OPEN_DOWNGRADE
operations. This is true even in the event that the previous operations. This is true even in the event that the previous
operation that used the sequence number received an error. The only operation that used the sequence number received an error. The only
exception to this rule is if the previous operation received one of exception to this rule is if the previous operation received one of
the following errors: NFS4ERR_STALE_CLIENTID, NFS4ERR_STALE_STATEID, the following errors: NFS4ERR_STALE_CLIENTID, NFS4ERR_STALE_STATEID,
NFS4ERR_BAD_STATEID, NFS4ERR_BAD_SEQID, NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, NFS4ERR_BAD_SEQID, NFS4ERR_BADXDR,
NFS4ERR_RESOURCE, NFS4ERR_NOFILEHANDLE. NFS4ERR_RESOURCE, NFS4ERR_NOFILEHANDLE.
6.1.6. Recovery from Replayed Requests 5.1.6. Recovery from Replayed Requests
As described above, the sequence number is per lock_owner. As long As described above, the sequence number is per lock_owner. As long
as the server maintains the last sequence number received and follows as the server maintains the last sequence number received and follows
the methods described above, there are no risks of a Byzantine router the methods described above, there are no risks of a Byzantine router
re-sending old requests. The server need only maintain the re-sending old requests. The server need only maintain the
(lock_owner, sequence number) state as long as there are open files (lock_owner, sequence number) state as long as there are open files
or closed files with locks outstanding. or closed files with locks outstanding.
LOCK, LOCKU, OPEN, OPEN_DOWNGRADE, and CLOSE each contain a sequence LOCK, LOCKU, OPEN, OPEN_DOWNGRADE, and CLOSE each contain a sequence
number and therefore the risk of the replay of these operations number and therefore the risk of the replay of these operations
resulting in undesired effects is non-existent while the server resulting in undesired effects is non-existent while the server
maintains the lock_owner state. maintains the lock_owner state.
6.1.7. Releasing lock_owner State 5.1.7. Releasing lock_owner State
When a particular lock_owner no longer holds open or file locking When a particular lock_owner no longer holds open or file locking
state at the server, the server may choose to release the sequence state at the server, the server may choose to release the sequence
number state associated with the lock_owner. The server may make number state associated with the lock_owner. The server may make
this choice based on lease expiration, for the reclamation of server this choice based on lease expiration, for the reclamation of server
memory, or other implementation specific details. In any event, the memory, or other implementation specific details. In any event, the
server is able to do this safely only when the lock_owner no longer server is able to do this safely only when the lock_owner no longer
is being utilized by the client. The server may choose to hold the is being utilized by the client. The server may choose to hold the
lock_owner state in the event that retransmitted requests are lock_owner state in the event that retransmitted requests are
received. However, the period to hold this state is implementation received. However, the period to hold this state is implementation
specific. specific.
In the case that a LOCK, LOCKU, OPEN_DOWNGRADE, or CLOSE is In the case that a LOCK, LOCKU, OPEN_DOWNGRADE, or CLOSE is
retransmitted after the server has previously released the lock_owner retransmitted after the server has previously released the lock_owner
state, the server will find that the lock_owner has no files open and state, the server will find that the lock_owner has no files open and
an error will be returned to the client. If the lock_owner does have an error will be returned to the client. If the lock_owner does have
a file open, the stateid will not match and again an error is a file open, the stateid will not match and again an error is
returned to the client. returned to the client.
6.1.8. Use of Open Confirmation 5.1.8. Use of Open Confirmation
In the case that an OPEN is retransmitted and the lock_owner is being In the case that an OPEN is retransmitted and the lock_owner is being
used for the first time or the lock_owner state has been previously used for the first time or the lock_owner state has been previously
released by the server, the use of the OPEN_CONFIRM operation will released by the server, the use of the OPEN_CONFIRM operation will
prevent incorrect behavior. When the server observes the use of the prevent incorrect behavior. When the server observes the use of the
lock_owner for the first time, it will direct the client to perform lock_owner for the first time, it will direct the client to perform
the OPEN_CONFIRM for the corresponding OPEN. This sequence the OPEN_CONFIRM for the corresponding OPEN. This sequence
establishes the use of an lock_owner and associated sequence number. establishes the use of an lock_owner and associated sequence number.
Since the OPEN_CONFIRM sequence connects a new open_owner on the Since the OPEN_CONFIRM sequence connects a new open_owner on the
server with an existing open_owner on a client, the sequence number server with an existing open_owner on a client, the sequence number
skipping to change at page 87, line 5 skipping to change at page 84, line 5
Requiring open confirmation on reclaim-type opens is avoidable Requiring open confirmation on reclaim-type opens is avoidable
because of the nature of the environments in which such opens are because of the nature of the environments in which such opens are
done. For CLAIM_PREVIOUS opens, this is immediately after server done. For CLAIM_PREVIOUS opens, this is immediately after server
reboot, so there should be no time for lockowners to be created, reboot, so there should be no time for lockowners to be created,
found to be unused, and recycled. For CLAIM_DELEGATE_PREV opens, we found to be unused, and recycled. For CLAIM_DELEGATE_PREV opens, we
are dealing with a client reboot situation. A server which supports are dealing with a client reboot situation. A server which supports
delegation can be sure that no lockowners for that client have been delegation can be sure that no lockowners for that client have been
recycled since client initialization and thus can ensure that recycled since client initialization and thus can ensure that
confirmation will not be required. confirmation will not be required.
6.2. Lock Ranges 5.2. Lock Ranges
The protocol allows a lock owner to request a lock with a byte range The protocol allows a lock owner to request a lock with a byte range
and then either upgrade or unlock a sub-range of the initial lock. and then either upgrade or unlock a sub-range of the initial lock.
It is expected that this will be an uncommon type of request. In any It is expected that this will be an uncommon type of request. In any
case, servers or server filesystems may not be able to support sub- case, servers or server filesystems may not be able to support sub-
range lock semantics. In the event that a server receives a locking range lock semantics. In the event that a server receives a locking
request that represents a sub-range of current locking state for the request that represents a sub-range of current locking state for the
lock owner, the server is allowed to return the error lock owner, the server is allowed to return the error
NFS4ERR_LOCK_RANGE to signify that it does not support sub-range lock NFS4ERR_LOCK_RANGE to signify that it does not support sub-range lock
operations. Therefore, the client should be prepared to receive this operations. Therefore, the client should be prepared to receive this
skipping to change at page 87, line 28 skipping to change at page 84, line 28
The client is discouraged from combining multiple independent locking The client is discouraged from combining multiple independent locking
ranges that happen to be adjacent into a single request since the ranges that happen to be adjacent into a single request since the
server may not support sub-range requests and for reasons related to server may not support sub-range requests and for reasons related to
the recovery of file locking state in the event of server failure. the recovery of file locking state in the event of server failure.
As discussed in the section "Server Failure and Recovery" below, the As discussed in the section "Server Failure and Recovery" below, the
server may employ certain optimizations during recovery that work server may employ certain optimizations during recovery that work
effectively only when the client's behavior during lock recovery is effectively only when the client's behavior during lock recovery is
similar to the client's locking behavior prior to server failure. similar to the client's locking behavior prior to server failure.
6.3. Upgrading and Downgrading Locks 5.3. Upgrading and Downgrading Locks
If a client has a write lock on a record, it can request an atomic If a client has a write lock on a record, it can request an atomic
downgrade of the lock to a read lock via the LOCK request, by setting downgrade of the lock to a read lock via the LOCK request, by setting
the type to READ_LT. If the server supports atomic downgrade, the the type to READ_LT. If the server supports atomic downgrade, the
request will succeed. If not, it will return NFS4ERR_LOCK_NOTSUPP. request will succeed. If not, it will return NFS4ERR_LOCK_NOTSUPP.
The client should be prepared to receive this error, and if The client should be prepared to receive this error, and if
appropriate, report the error to the requesting application. appropriate, report the error to the requesting application.
If a client has a read lock on a record, it can request an atomic If a client has a read lock on a record, it can request an atomic
upgrade of the lock to a write lock via the LOCK request by setting upgrade of the lock to a write lock via the LOCK request by setting
the type to WRITE_LT or WRITEW_LT. If the server does not support the type to WRITE_LT or WRITEW_LT. If the server does not support
atomic upgrade, it will return NFS4ERR_LOCK_NOTSUPP. If the upgrade atomic upgrade, it will return NFS4ERR_LOCK_NOTSUPP. If the upgrade
can be achieved without an existing conflict, the request will can be achieved without an existing conflict, the request will
succeed. Otherwise, the server will return either NFS4ERR_DENIED or succeed. Otherwise, the server will return either NFS4ERR_DENIED or
NFS4ERR_DEADLOCK. The error NFS4ERR_DEADLOCK is returned if the NFS4ERR_DEADLOCK. The error NFS4ERR_DEADLOCK is returned if the
client issued the LOCK request with the type set to WRITEW_LT and the client issued the LOCK request with the type set to WRITEW_LT and the
server has detected a deadlock. The client should be prepared to server has detected a deadlock. The client should be prepared to
receive such errors and if appropriate, report the error to the receive such errors and if appropriate, report the error to the
requesting application. requesting application.
6.4. Blocking Locks 5.4. Blocking Locks
Some clients require the support of blocking locks. The NFS version Some clients require the support of blocking locks. The NFS version
4 protocol must not rely on a callback mechanism and therefore is 4 protocol must not rely on a callback mechanism and therefore is
unable to notify a client when a previously denied lock has been unable to notify a client when a previously denied lock has been
granted. Clients have no choice but to continually poll for the granted. Clients have no choice but to continually poll for the
lock. This presents a fairness problem. Two new lock types are lock. This presents a fairness problem. Two new lock types are
added, READW and WRITEW, and are used to indicate to the server that added, READW and WRITEW, and are used to indicate to the server that
the client is requesting a blocking lock. The server should maintain the client is requesting a blocking lock. The server should maintain
an ordered list of pending blocking locks. When the conflicting lock an ordered list of pending blocking locks. When the conflicting lock
is released, the server may wait the lease period for the first is released, the server may wait the lease period for the first
skipping to change at page 88, line 28 skipping to change at page 85, line 28
storage would be required to guarantee ordered granting of blocking storage would be required to guarantee ordered granting of blocking
locks. locks.
Servers may also note the lock types and delay returning denial of Servers may also note the lock types and delay returning denial of
the request to allow extra time for a conflicting lock to be the request to allow extra time for a conflicting lock to be
released, allowing a successful return. In this way, clients can released, allowing a successful return. In this way, clients can
avoid the burden of needlessly frequent polling for blocking locks. avoid the burden of needlessly frequent polling for blocking locks.
The server should take care in the length of delay in the event the The server should take care in the length of delay in the event the
client retransmits the request. client retransmits the request.
6.5. Lease Renewal 5.5. Lease Renewal
The purpose of a lease is to allow a server to remove stale locks The purpose of a lease is to allow a server to remove stale locks
that are held by a client that has crashed or is otherwise that are held by a client that has crashed or is otherwise
unreachable. It is not a mechanism for cache consistency and lease unreachable. It is not a mechanism for cache consistency and lease
renewals may not be denied if the lease interval has not expired. renewals may not be denied if the lease interval has not expired.
The following events cause implicit renewal of all of the leases for The following events cause implicit renewal of all of the leases for
a given client (i.e. all those sharing a given clientid). Each of a given client (i.e. all those sharing a given clientid). Each of
these is a positive indication that the client is still active and these is a positive indication that the client is still active and
that the associated state held at the server, for the client, is that the associated state held at the server, for the client, is
skipping to change at page 89, line 24 skipping to change at page 86, line 24
renewal and in the worst case one RPC is required every lease period renewal and in the worst case one RPC is required every lease period
(i.e. a RENEW operation). The number of locks held by the client is (i.e. a RENEW operation). The number of locks held by the client is
not a factor since all state for the client is involved with the not a factor since all state for the client is involved with the
lease renewal action. lease renewal action.
Since all operations that create a new lease also renew existing Since all operations that create a new lease also renew existing
leases, the server must maintain a common lease expiration time for leases, the server must maintain a common lease expiration time for
all valid leases for a given client. This lease time can then be all valid leases for a given client. This lease time can then be
easily updated upon implicit lease renewal actions. easily updated upon implicit lease renewal actions.
6.6. Crash Recovery 5.6. Crash Recovery
The important requirement in crash recovery is that both the client The important requirement in crash recovery is that both the client
and the server know when the other has failed. Additionally, it is and the server know when the other has failed. Additionally, it is
required that a client sees a consistent view of data across server required that a client sees a consistent view of data across server
restarts or reboots. All READ and WRITE operations that may have restarts or reboots. All READ and WRITE operations that may have
been queued within the client or network buffers must wait until the been queued within the client or network buffers must wait until the
client has successfully recovered the locks protecting the READ and client has successfully recovered the locks protecting the READ and
WRITE operations. WRITE operations.
6.6.1. Client Failure and Recovery 5.6.1. Client Failure and Recovery
In the event that a client fails, the server may recover the client's In the event that a client fails, the server may recover the client's
locks when the associated leases have expired. Conflicting locks locks when the associated leases have expired. Conflicting locks
from another client may only be granted after this lease expiration. from another client may only be granted after this lease expiration.
If the client is able to restart or reinitialize within the lease If the client is able to restart or reinitialize within the lease
period the client may be forced to wait the remainder of the lease period the client may be forced to wait the remainder of the lease
period before obtaining new locks. period before obtaining new locks.
To minimize client delay upon restart, lock requests are associated To minimize client delay upon restart, lock requests are associated
with an instance of the client by a client supplied verifier. This with an instance of the client by a client supplied verifier. This
skipping to change at page 90, line 16 skipping to change at page 87, line 16
initialization, the server can compare a new verifier to the verifier initialization, the server can compare a new verifier to the verifier
associated with currently held locks and determine that they do not associated with currently held locks and determine that they do not
match. This signifies the client's new instantiation and subsequent match. This signifies the client's new instantiation and subsequent
loss of locking state. As a result, the server is free to release loss of locking state. As a result, the server is free to release
all locks held which are associated with the old clientid which was all locks held which are associated with the old clientid which was
derived from the old verifier. derived from the old verifier.
Note that the verifier must have the same uniqueness properties of Note that the verifier must have the same uniqueness properties of
the verifier for the COMMIT operation. the verifier for the COMMIT operation.
6.6.2. Server Failure and Recovery 5.6.2. Server Failure and Recovery
If the server loses locking state (usually as a result of a restart If the server loses locking state (usually as a result of a restart
or reboot), it must allow clients time to discover this fact and re- or reboot), it must allow clients time to discover this fact and re-
establish the lost locking state. The client must be able to re- establish the lost locking state. The client must be able to re-
establish the locking state without having the server deny valid establish the locking state without having the server deny valid
requests because the server has granted conflicting access to another requests because the server has granted conflicting access to another
client. Likewise, if there is the possibility that clients have not client. Likewise, if there is the possibility that clients have not
yet re-established their locking state for a file, the server must yet re-established their locking state for a file, the server must
disallow READ and WRITE operations for that file. The duration of disallow READ and WRITE operations for that file. The duration of
this recovery period is equal to the duration of the lease period. this recovery period is equal to the duration of the lease period.
skipping to change at page 92, line 5 skipping to change at page 89, line 5
A server may, upon restart, establish a new value for the lease A server may, upon restart, establish a new value for the lease
period. Therefore, clients should, once a new clientid is period. Therefore, clients should, once a new clientid is
established, refetch the lease_time attribute and use it as the basis established, refetch the lease_time attribute and use it as the basis
for lease renewal for the lease associated with that server. for lease renewal for the lease associated with that server.
However, the server must establish, for this restart event, a grace However, the server must establish, for this restart event, a grace
period at least as long as the lease period for the previous server period at least as long as the lease period for the previous server
instantiation. This allows the client state obtained during the instantiation. This allows the client state obtained during the
previous server instance to be reliably re-established. previous server instance to be reliably re-established.
6.6.3. Network Partitions and Recovery 5.6.3. Network Partitions and Recovery
If the duration of a network partition is greater than the lease If the duration of a network partition is greater than the lease
period provided by the server, the server will have not received a period provided by the server, the server will have not received a
lease renewal from the client. If this occurs, the server may free lease renewal from the client. If this occurs, the server may free
all locks held for the client. As a result, all stateids held by the all locks held for the client. As a result, all stateids held by the
client will become invalid or stale. Once the client is able to client will become invalid or stale. Once the client is able to
reach the server after such a network partition, all I/O submitted by reach the server after such a network partition, all I/O submitted by
the client with the now invalid stateids will fail with the server the client with the now invalid stateids will fail with the server
returning the error NFS4ERR_EXPIRED. Once this error is received, returning the error NFS4ERR_EXPIRED. Once this error is received,
the client will suitably notify the application that held the lock. the client will suitably notify the application that held the lock.
skipping to change at page 95, line 27 skipping to change at page 92, line 27
client could also inform the application that its record lock or client could also inform the application that its record lock or
share reservations (whether they were delegated or not) have been share reservations (whether they were delegated or not) have been
lost, such as via a UNIX signal, a GUI pop-up window, etc. See the lost, such as via a UNIX signal, a GUI pop-up window, etc. See the
section, "Data Caching and Revocation" for a discussion of what the section, "Data Caching and Revocation" for a discussion of what the
client should do for dealing with unreclaimed delegations on client client should do for dealing with unreclaimed delegations on client
state. state.
For further discussion of revocation of locks see the section "Server For further discussion of revocation of locks see the section "Server
Revocation of Locks". Revocation of Locks".
6.7. Recovery from a Lock Request Timeout or Abort 5.7. Recovery from a Lock Request Timeout or Abort
In the event a lock request times out, a client may decide to not In the event a lock request times out, a client may decide to not
retry the request. The client may also abort the request when the retry the request. The client may also abort the request when the
process for which it was issued is terminated (e.g. in UNIX due to a process for which it was issued is terminated (e.g. in UNIX due to a
signal). It is possible though that the server received the request signal). It is possible though that the server received the request
and acted upon it. This would change the state on the server without and acted upon it. This would change the state on the server without
the client being aware of the change. It is paramount that the the client being aware of the change. It is paramount that the
client re-synchronize state with server before it attempts any other client re-synchronize state with server before it attempts any other
operation that takes a seqid and/or a stateid with the same operation that takes a seqid and/or a stateid with the same
lock_owner. This is straightforward to do without a special re- lock_owner. This is straightforward to do without a special re-
skipping to change at page 96, line 5 skipping to change at page 93, line 5
not receive a response. From this, the next time the client does a not receive a response. From this, the next time the client does a
lock operation for the lock_owner, it can send the cached request, if lock operation for the lock_owner, it can send the cached request, if
there is one, and if the request was one that established state (e.g. there is one, and if the request was one that established state (e.g.
a LOCK or OPEN operation), the server will return the cached result a LOCK or OPEN operation), the server will return the cached result
or if never saw the request, perform it. The client can follow up or if never saw the request, perform it. The client can follow up
with a request to remove the state (e.g. a LOCKU or CLOSE operation). with a request to remove the state (e.g. a LOCKU or CLOSE operation).
With this approach, the sequencing and stateid information on the With this approach, the sequencing and stateid information on the
client and server for the given lock_owner will re-synchronize and in client and server for the given lock_owner will re-synchronize and in
turn the lock state will re-synchronize. turn the lock state will re-synchronize.
6.8. Server Revocation of Locks 5.8. Server Revocation of Locks
At any point, the server can revoke locks held by a client and the At any point, the server can revoke locks held by a client and the
client must be prepared for this event. When the client detects that client must be prepared for this event. When the client detects that
its locks have been or may have been revoked, the client is its locks have been or may have been revoked, the client is
responsible for validating the state information between itself and responsible for validating the state information between itself and
the server. Validating locking state for the client means that it the server. Validating locking state for the client means that it
must verify or reclaim state for each lock currently held. must verify or reclaim state for each lock currently held.
The first instance of lock revocation is upon server reboot or re- The first instance of lock revocation is upon server reboot or re-
initialization. In this instance the client will receive an error initialization. In this instance the client will receive an error
skipping to change at page 97, line 12 skipping to change at page 94, line 12
ensure that a conflicting lock has not been granted. The client may ensure that a conflicting lock has not been granted. The client may
accomplish this task by issuing an I/O request, either a pending I/O accomplish this task by issuing an I/O request, either a pending I/O
or a zero-length read, specifying the stateid associated with the or a zero-length read, specifying the stateid associated with the
lock in question. If the response to the request is success, the lock in question. If the response to the request is success, the
client has validated all of the locks governed by that stateid and client has validated all of the locks governed by that stateid and
re-established the appropriate state between itself and the server. re-established the appropriate state between itself and the server.
If the I/O request is not successful, then one or more of the locks If the I/O request is not successful, then one or more of the locks
associated with the stateid was revoked by the server and the client associated with the stateid was revoked by the server and the client
must notify the owner. must notify the owner.
6.9. Share Reservations 5.9. Share Reservations
A share reservation is a mechanism to control access to a file. It A share reservation is a mechanism to control access to a file. It
is a separate and independent mechanism from record locking. When a is a separate and independent mechanism from record locking. When a
client opens a file, it issues an OPEN operation to the server client opens a file, it issues an OPEN operation to the server
specifying the type of access required (READ, WRITE, or BOTH) and the specifying the type of access required (READ, WRITE, or BOTH) and the
type of access to deny others (deny NONE, READ, WRITE, or BOTH). If type of access to deny others (deny NONE, READ, WRITE, or BOTH). If
the OPEN fails the client will fail the application's open request. the OPEN fails the client will fail the application's open request.
Pseudo-code definition of the semantics: Pseudo-code definition of the semantics:
skipping to change at page 97, line 45 skipping to change at page 94, line 45
const OPEN4_SHARE_ACCESS_READ = 0x00000001; const OPEN4_SHARE_ACCESS_READ = 0x00000001;
const OPEN4_SHARE_ACCESS_WRITE = 0x00000002; const OPEN4_SHARE_ACCESS_WRITE = 0x00000002;
const OPEN4_SHARE_ACCESS_BOTH = 0x00000003; const OPEN4_SHARE_ACCESS_BOTH = 0x00000003;
const OPEN4_SHARE_DENY_NONE = 0x00000000; const OPEN4_SHARE_DENY_NONE = 0x00000000;
const OPEN4_SHARE_DENY_READ = 0x00000001; const OPEN4_SHARE_DENY_READ = 0x00000001;
const OPEN4_SHARE_DENY_WRITE = 0x00000002; const OPEN4_SHARE_DENY_WRITE = 0x00000002;
const OPEN4_SHARE_DENY_BOTH = 0x00000003; const OPEN4_SHARE_DENY_BOTH = 0x00000003;
6.10. OPEN/CLOSE Operations 5.10. OPEN/CLOSE Operations
To provide correct share semantics, a client MUST use the OPEN To provide correct share semantics, a client MUST use the OPEN
operation to obtain the initial filehandle and indicate the desired operation to obtain the initial filehandle and indicate the desired
access and what if any access to deny. Even if the client intends to access and what if any access to deny. Even if the client intends to
use a stateid of all 0's or all 1's, it must still obtain the use a stateid of all 0's or all 1's, it must still obtain the
filehandle for the regular file with the OPEN operation so the filehandle for the regular file with the OPEN operation so the
appropriate share semantics can be applied. For clients that do not appropriate share semantics can be applied. For clients that do not
have a deny mode built into their open programming interfaces, deny have a deny mode built into their open programming interfaces, deny
equal to NONE should be used. equal to NONE should be used.
skipping to change at page 98, line 27 skipping to change at page 95, line 27
failure, NFS4ERR_LOCKS_HELD, if any locks would exist after the failure, NFS4ERR_LOCKS_HELD, if any locks would exist after the
CLOSE. CLOSE.
The LOOKUP operation will return a filehandle without establishing The LOOKUP operation will return a filehandle without establishing
any lock state on the server. Without a valid stateid, the server any lock state on the server. Without a valid stateid, the server
will assume the client has the least access. For example, a file will assume the client has the least access. For example, a file
opened with deny READ/WRITE cannot be accessed using a filehandle opened with deny READ/WRITE cannot be accessed using a filehandle
obtained through LOOKUP because it would not have a valid stateid obtained through LOOKUP because it would not have a valid stateid
(i.e. using a stateid of all bits 0 or all bits 1). (i.e. using a stateid of all bits 0 or all bits 1).
6.10.1. Close and Retention of State Information 5.10.1. Close and Retention of State Information
Since a CLOSE operation requests deallocation of a stateid, dealing Since a CLOSE operation requests deallocation of a stateid, dealing
with retransmission of the CLOSE, may pose special difficulties, with retransmission of the CLOSE, may pose special difficulties,
since the state information, which normally would be used to since the state information, which normally would be used to
determine the state of the open file being designated, might be determine the state of the open file being designated, might be
deallocated, resulting in an NFS4ERR_BAD_STATEID error. deallocated, resulting in an NFS4ERR_BAD_STATEID error.
Servers may deal with this problem in a number of ways. To provide Servers may deal with this problem in a number of ways. To provide
the greatest degree assurance that the protocol is being used the greatest degree assurance that the protocol is being used
properly, a server should, rather than deallocate the stateid, mark properly, a server should, rather than deallocate the stateid, mark
skipping to change at page 99, line 16 skipping to change at page 96, line 16
Servers may avoid this complexity, at the cost of less complete Servers may avoid this complexity, at the cost of less complete
protocol error checking, by simply responding NFS4_OK in the event of protocol error checking, by simply responding NFS4_OK in the event of
a CLOSE for a deallocated stateid, on the assumption that this case a CLOSE for a deallocated stateid, on the assumption that this case
must be caused by a retransmitted close. When adopting this must be caused by a retransmitted close. When adopting this
approach, it is desirable to at least log an error when returning a approach, it is desirable to at least log an error when returning a
no-error indication in this situation. If the server maintains a no-error indication in this situation. If the server maintains a
reply-cache mechanism, it can verify the CLOSE is indeed a reply-cache mechanism, it can verify the CLOSE is indeed a
retransmission and avoid error logging in most cases. retransmission and avoid error logging in most cases.
6.11. Open Upgrade and Downgrade 5.11. Open Upgrade and Downgrade
When an OPEN is done for a file and the lockowner for which the open When an OPEN is done for a file and the lockowner for which the open
is being done already has the file open, the result is to upgrade the is being done already has the file open, the result is to upgrade the
open file status maintained on the server to include the access and open file status maintained on the server to include the access and
deny bits specified by the new OPEN as well as those for the existing deny bits specified by the new OPEN as well as those for the existing
OPEN. The result is that there is one open file, as far as the OPEN. The result is that there is one open file, as far as the
protocol is concerned, and it includes the union of the access and protocol is concerned, and it includes the union of the access and
deny bits for all of the OPEN requests completed. Only a single deny bits for all of the OPEN requests completed. Only a single
CLOSE will be done to reset the effects of both OPENs. Note that the CLOSE will be done to reset the effects of both OPENs. Note that the
client, when issuing the OPEN, may not know that the same file is in client, when issuing the OPEN, may not know that the same file is in
skipping to change at page 99, line 47 skipping to change at page 96, line 47
When multiple open files on the client are merged into a single open When multiple open files on the client are merged into a single open
file object on the server, the close of one of the open files (on the file object on the server, the close of one of the open files (on the
client) may necessitate change of the access and deny status of the client) may necessitate change of the access and deny status of the
open file on the server. This is because the union of the access and open file on the server. This is because the union of the access and
deny bits for the remaining opens may be smaller (i.e. a proper deny bits for the remaining opens may be smaller (i.e. a proper
subset) than previously. The OPEN_DOWNGRADE operation is used to subset) than previously. The OPEN_DOWNGRADE operation is used to
make the necessary change and the client should use it to update the make the necessary change and the client should use it to update the
server so that share reservation requests by other clients are server so that share reservation requests by other clients are
handled properly. handled properly.
6.12. Short and Long Leases 5.12. Short and Long Leases
When determining the time period for the server lease, the usual When determining the time period for the server lease, the usual
lease tradeoffs apply. Short leases are good for fast server lease tradeoffs apply. Short leases are good for fast server
recovery at a cost of increased RENEW or READ (with zero length) recovery at a cost of increased RENEW or READ (with zero length)
requests. Longer leases are certainly kinder and gentler to servers requests. Longer leases are certainly kinder and gentler to servers
trying to handle very large numbers of clients. The number of RENEW trying to handle very large numbers of clients. The number of RENEW
requests drop in proportion to the lease time. The disadvantages of requests drop in proportion to the lease time. The disadvantages of
long leases are slower recovery after server failure (the server must long leases are slower recovery after server failure (the server must
wait for the leases to expire and the grace period to elapse before wait for the leases to expire and the grace period to elapse before
granting new lock requests) and increased file contention (if client granting new lock requests) and increased file contention (if client
fails to transmit an unlock request then server must wait for lease fails to transmit an unlock request then server must wait for lease
expiration before granting new locks). expiration before granting new locks).
Long leases are usable if the server is able to store lease state in Long leases are usable if the server is able to store lease state in
non-volatile memory. Upon recovery, the server can reconstruct the non-volatile memory. Upon recovery, the server can reconstruct the
lease state from its non-volatile memory and continue operation with lease state from its non-volatile memory and continue operation with
its clients and therefore long leases would not be an issue. its clients and therefore long leases would not be an issue.
6.13. Clocks, Propagation Delay, and Calculating Lease Expiration 5.13. Clocks, Propagation Delay, and Calculating Lease Expiration
To avoid the need for synchronized clocks, lease times are granted by To avoid the need for synchronized clocks, lease times are granted by
the server as a time delta. However, there is a requirement that the the server as a time delta. However, there is a requirement that the
client and server clocks do not drift excessively over the duration client and server clocks do not drift excessively over the duration
of the lock. There is also the issue of propagation delay across the of the lock. There is also the issue of propagation delay across the
network which could easily be several hundred milliseconds as well as network which could easily be several hundred milliseconds as well as
the possibility that requests will be lost and need to be the possibility that requests will be lost and need to be
retransmitted. retransmitted.
To take propagation delay into account, the client should subtract it To take propagation delay into account, the client should subtract it
skipping to change at page 100, line 43 skipping to change at page 97, line 43
before the lease would expire. before the lease would expire.
The server's lease period configuration should take into account the The server's lease period configuration should take into account the
network distance of the clients that will be accessing the server's network distance of the clients that will be accessing the server's
resources. It is expected that the lease period will take into resources. It is expected that the lease period will take into
account the network propagation delays and other network delay account the network propagation delays and other network delay
factors for the client population. Since the protocol does not allow factors for the client population. Since the protocol does not allow
for an automatic method to determine an appropriate lease period, the for an automatic method to determine an appropriate lease period, the
server's administrator may have to tune the lease period. server's administrator may have to tune the lease period.
6.14. Migration, Replication and State 6. Client-Side Caching
When responsibility for handling a given file system is transferred
to a new server (migration) or the client chooses to use an alternate
server (e.g. in response to server unresponsiveness) in the context
of file system replication, the appropriate handling of state shared
between the client and server (i.e. locks, leases, stateids, and
clientids) is as described below. The handling differs between
migration and replication. For related discussion of file server
state and recover of such see the sections under "File Locking and
Share Reservations"
If server replica or a server immigrating a filesystem agrees to, or
is expected to, accept opaque values from the client that originated
from another server, then it is a wise implementation practice for
the servers to encode the "opaque" values in network byte order.
This way, servers acting as replicas or immigrating filesystems will
be able to parse values like stateids, directory cookies,
filehandles, etc. even if their native byte order is different from
other servers cooperating in the replication and migration of the
filesystem.
6.14.1. Migration and State
In the case of migration, the servers involved in the migration of a
filesystem SHOULD transfer all server state from the original to the
new server. This must be done in a way that is transparent to the
client. This state transfer will ease the client's transition when a
filesystem migration occurs. If the servers are successful in
transferring all state, the client will continue to use stateids
assigned by the original server. Therefore the new server must
recognize these stateids as valid. This holds true for the clientid
as well. Since responsibility for an entire filesystem is
transferred with a migration event, there is no possibility that
conflicts will arise on the new server as a result of the transfer of
locks.
As part of the transfer of information between servers, leases would
be transferred as well. The leases being transferred to the new
server will typically have a different expiration time from those for
the same client, previously on the old server. To maintain the
property that all leases on a given server for a given client expire
at the same time, the server should advance the expiration time to
the later of the leases being transferred or the leases already
present. This allows the client to maintain lease renewal of both
classes without special effort.
The servers may choose not to transfer the state information upon
migration. However, this choice is discouraged. In this case, when
the client presents state information from the original server, the
client must be prepared to receive either NFS4ERR_STALE_CLIENTID or
NFS4ERR_STALE_STATEID from the new server. The client should then
recover its state information as it normally would in response to a
server failure. The new server must take care to allow for the
recovery of state information as it would in the event of server
restart.
6.14.2. Replication and State
Since client switch-over in the case of replication is not under
server control, the handling of state is different. In this case,
leases, stateids and clientids do not have validity across a
transition from one server to another. The client must re-establish
its locks on the new server. This can be compared to the re-
establishment of locks by means of reclaim-type requests after a
server reboot. The difference is that the server has no provision to
distinguish requests reclaiming locks from those obtaining new locks
or to defer the latter. Thus, a client re-establishing a lock on the
new server (by means of a LOCK or OPEN request), may have the
requests denied due to a conflicting lock. Since replication is
intended for read-only use of filesystems, such denial of locks
should not pose large difficulties in practice. When an attempt to
re-establish a lock on a new server is denied, the client should
treat the situation as if his original lock had been revoked.
6.14.3. Notification of Migrated Lease
In the case of lease renewal, the client may not be submitting
requests for a filesystem that has been migrated to another server.
This can occur because of the implicit lease renewal mechanism. The
client renews leases for all filesystems when submitting a request to
any one filesystem at the server.
In order for the client to schedule renewal of leases that may have
been relocated to the new server, the client must find out about
lease relocation before those leases expire. To accomplish this, all
operations which implicitly renew leases for a client (i.e. OPEN,
CLOSE, READ, WRITE, RENEW, LOCK, LOCKT, LOCKU), will return the error
NFS4ERR_LEASE_MOVED if responsibility for any of the leases to be
renewed has been transferred to a new server. This condition will
continue until the client receives an NFS4ERR_MOVED error and the
server receives the subsequent GETATTR(fs_locations) for an access to
each filesystem for which a lease has been moved to a new server.
When a client receives an NFS4ERR_LEASE_MOVED error, it should
perform an operation on each filesystem associated with the server in
question. When the client receives an NFS4ERR_MOVED error, the
client can follow the normal process to obtain the new server
information (through the fs_locations attribute) and perform renewal
of those leases on the new server. If the server has not had state
transferred to it transparently, the client will receive either
NFS4ERR_STALE_CLIENTID or NFS4ERR_STALE_STATEID from the new server,
as described above, and the client can then recover state information
as it does in the event of server failure.
6.14.4. Migration and the Lease_time Attribute
In order that the client may appropriately manage its leases in the
case of migration, the destination server must establish proper
values for the lease_time attribute.
When state is transferred transparently, that state should include
the correct value of the lease_time attribute. The lease_time
attribute on the destination server must never be less than that on
the source since this would result in premature expiration of leases
granted by the source server. Upon migration in which state is
transferred transparently, the client is under no obligation to re-
fetch the lease_time attribute and may continue to use the value
previously fetched (on the source server).
If state has not been transferred transparently (i.e. the client sees
a real or simulated server reboot), the client should fetch the value
of lease_time on the new (i.e. destination) server, and use it for
subsequent locking requests. However the server must respect a grace
period at least as long as the lease_time on the source server, in
order to ensure that clients have ample time to reclaim their locks
before potentially conflicting non-reclaimed locks are granted. The
means by which the new server obtains the value of lease_time on the
old server is left to the server implementations. It is not
specified by the NFS version 4 protocol.
7. Client-Side Caching
Client-side caching of data, of file attributes, and of file names is Client-side caching of data, of file attributes, and of file names is
essential to providing good performance with the NFS protocol. essential to providing good performance with the NFS protocol.
Providing distributed cache coherence is a difficult problem and Providing distributed cache coherence is a difficult problem and
previous versions of the NFS protocol have not attempted it. previous versions of the NFS protocol have not attempted it.
Instead, several NFS client implementation techniques have been used Instead, several NFS client implementation techniques have been used
to reduce the problems that a lack of coherence poses for users. to reduce the problems that a lack of coherence poses for users.
These techniques have not been clearly defined by earlier protocol These techniques have not been clearly defined by earlier protocol
specifications and it is often unclear what is valid or invalid specifications and it is often unclear what is valid or invalid
client behavior. client behavior.
The NFS version 4 protocol uses many techniques similar to those that The NFS version 4 protocol uses many techniques similar to those that
have been used in previous protocol versions. The NFS version 4 have been used in previous protocol versions. The NFS version 4
protocol does not provide distributed cache coherence. However, it protocol does not provide distributed cache coherence. However, it
defines a more limited set of caching guarantees to allow locks and defines a more limited set of caching guarantees to allow locks and
share reservations to be used without destructive interference from share reservations to be used without destructive interference from
client side caching. client side caching.
skipping to change at page 104, line 8 skipping to change at page 98, line 22
defines a more limited set of caching guarantees to allow locks and defines a more limited set of caching guarantees to allow locks and
share reservations to be used without destructive interference from share reservations to be used without destructive interference from
client side caching. client side caching.
In addition, the NFS version 4 protocol introduces a delegation In addition, the NFS version 4 protocol introduces a delegation
mechanism which allows many decisions normally made by the server to mechanism which allows many decisions normally made by the server to
be made locally by clients. This mechanism provides efficient be made locally by clients. This mechanism provides efficient
support of the common cases where sharing is infrequent or where support of the common cases where sharing is infrequent or where
sharing is read-only. sharing is read-only.
7.1. Performance Challenges for Client-Side Caching 6.1. Performance Challenges for Client-Side Caching
Caching techniques used in previous versions of the NFS protocol have Caching techniques used in previous versions of the NFS protocol have
been successful in providing good performance. However, several been successful in providing good performance. However, several
scalability challenges can arise when those techniques are used with scalability challenges can arise when those techniques are used with
very large numbers of clients. This is particularly true when very large numbers of clients. This is particularly true when
clients are geographically distributed which classically increases clients are geographically distributed which classically increases
the latency for cache revalidation requests. the latency for cache revalidation requests.
The previous versions of the NFS protocol repeat their file data The previous versions of the NFS protocol repeat their file data
cache validation requests at the time the file is opened. This cache validation requests at the time the file is opened. This
skipping to change at page 105, line 5 skipping to change at page 99, line 14
.IP o Compatibility with a large range of server semantics. .IP o .IP o Compatibility with a large range of server semantics. .IP o
Provide the same caching benefits as previous versions of the NFS Provide the same caching benefits as previous versions of the NFS
protocol when unable to provide the more aggressive model. .IP o protocol when unable to provide the more aggressive model. .IP o
Requirements for aggressive caching are organized so that a large Requirements for aggressive caching are organized so that a large
portion of the benefit can be obtained even when not all of the portion of the benefit can be obtained even when not all of the
requirements can be met. .LP The appropriate requirements for the requirements can be met. .LP The appropriate requirements for the
server are discussed in later sections in which specific forms of server are discussed in later sections in which specific forms of
caching are covered. (see the section "Open Delegation"). caching are covered. (see the section "Open Delegation").
7.2. Delegation and Callbacks 6.2. Delegation and Callbacks
Recallable delegation of server responsibilities for a file to a Recallable delegation of server responsibilities for a file to a
client improves performance by avoiding repeated requests to the client improves performance by avoiding repeated requests to the
server in the absence of inter-client conflict. With the use of a server in the absence of inter-client conflict. With the use of a
"callback" RPC from server to client, a server recalls delegated "callback" RPC from server to client, a server recalls delegated
responsibilities when another client engages in sharing of a responsibilities when another client engages in sharing of a
delegated file. delegated file.
A delegation is passed from the server to the client, specifying the A delegation is passed from the server to the client, specifying the
object of the delegation and the type of delegation. There are object of the delegation and the type of delegation. There are
skipping to change at page 106, line 26 skipping to change at page 100, line 35
The server will not know what opens are in effect on the client. The server will not know what opens are in effect on the client.
Without this knowledge the server will be unable to determine if the Without this knowledge the server will be unable to determine if the
access and deny state for the file allows any particular open until access and deny state for the file allows any particular open until
the delegation for the file has been returned. the delegation for the file has been returned.
A client failure or a network partition can result in failure to A client failure or a network partition can result in failure to
respond to a recall callback. In this case, the server will revoke respond to a recall callback. In this case, the server will revoke
the delegation which in turn will render useless any modified state the delegation which in turn will render useless any modified state
still on the client. still on the client.
7.2.1. Delegation Recovery 6.2.1. Delegation Recovery
There are three situations that delegation recovery must deal with: There are three situations that delegation recovery must deal with:
o Client reboot or restart o Client reboot or restart
o Server reboot or restart o Server reboot or restart
o Network partition (full or callback-only) o Network partition (full or callback-only)
In the event the client reboots or restarts, the failure to renew In the event the client reboots or restarts, the failure to renew
skipping to change at page 108, line 21 skipping to change at page 102, line 31
by the client whose delegation is revoked and separately by other by the client whose delegation is revoked and separately by other
clients. See the section "Revocation Recovery for Write Open clients. See the section "Revocation Recovery for Write Open
Delegation" for a discussion of such issues. Note also that when Delegation" for a discussion of such issues. Note also that when
delegations are revoked, information about the revoked delegation delegations are revoked, information about the revoked delegation
will be written by the server to stable storage (as described in the will be written by the server to stable storage (as described in the
section "Crash Recovery"). This is done to deal with the case in section "Crash Recovery"). This is done to deal with the case in
which a server reboots after revoking a delegation but before the which a server reboots after revoking a delegation but before the
client holding the revoked delegation is notified about the client holding the revoked delegation is notified about the
revocation. revocation.
7.3. Data Caching 6.3. Data Caching
When applications share access to a set of files, they need to be When applications share access to a set of files, they need to be
implemented so as to take account of the possibility of conflicting implemented so as to take account of the possibility of conflicting
access by another application. This is true whether the applications access by another application. This is true whether the applications
in question execute on different clients or reside on the same in question execute on different clients or reside on the same
client. client.
Share reservations and record locks are the facilities the NFS Share reservations and record locks are the facilities the NFS
version 4 protocol provides to allow applications to coordinate version 4 protocol provides to allow applications to coordinate
access by providing mutual exclusion facilities. The NFS version 4 access by providing mutual exclusion facilities. The NFS version 4
protocol's data caching must be implemented such that it does not protocol's data caching must be implemented such that it does not
invalidate the assumptions that those using these facilities depend invalidate the assumptions that those using these facilities depend
upon. upon.
7.3.1. Data Caching and OPENs 6.3.1. Data Caching and OPENs
In order to avoid invalidating the sharing assumptions that In order to avoid invalidating the sharing assumptions that
applications rely on, NFS version 4 clients should not provide cached applications rely on, NFS version 4 clients should not provide cached
data to applications or modify it on behalf of an application when it data to applications or modify it on behalf of an application when it
would not be valid to obtain or modify that same data via a READ or would not be valid to obtain or modify that same data via a READ or
WRITE operation. WRITE operation.
Furthermore, in the absence of open delegation (see the section "Open Furthermore, in the absence of open delegation (see the section "Open
Delegation") two additional rules apply. Note that these rules are Delegation") two additional rules apply. Note that these rules are
obeyed in practice by many NFS version 2 and version 3 clients. obeyed in practice by many NFS version 2 and version 3 clients.
skipping to change at page 109, line 35 skipping to change at page 103, line 45
a file OPENed for write. This is complementary to the first rule. a file OPENed for write. This is complementary to the first rule.
If the data is not flushed at CLOSE, the revalidation done after If the data is not flushed at CLOSE, the revalidation done after
client OPENs as file is unable to achieve its purpose. The other client OPENs as file is unable to achieve its purpose. The other
aspect to flushing the data before close is that the data must be aspect to flushing the data before close is that the data must be
committed to stable storage, at the server, before the CLOSE committed to stable storage, at the server, before the CLOSE
operation is requested by the client. In the case of a server operation is requested by the client. In the case of a server
reboot or restart and a CLOSEd file, it may not be possible to reboot or restart and a CLOSEd file, it may not be possible to
retransmit the data to be written to the file. Hence, this retransmit the data to be written to the file. Hence, this
requirement. requirement.
7.3.2. Data Caching and File Locking 6.3.2. Data Caching and File Locking
For those applications that choose to use file locking instead of For those applications that choose to use file locking instead of
share reservations to exclude inconsistent file access, there is an share reservations to exclude inconsistent file access, there is an
analogous set of constraints that apply to client side data caching. analogous set of constraints that apply to client side data caching.
These rules are effective only if the file locking is used in a way These rules are effective only if the file locking is used in a way
that matches in an equivalent way the actual READ and WRITE that matches in an equivalent way the actual READ and WRITE
operations executed. This is as opposed to file locking that is operations executed. This is as opposed to file locking that is
based on pure convention. For example, it is possible to manipulate based on pure convention. For example, it is possible to manipulate
a two-megabyte file by dividing the file into two one-megabyte a two-megabyte file by dividing the file into two one-megabyte
regions and protecting access to the two regions by file locks on regions and protecting access to the two regions by file locks on
skipping to change at page 111, line 17 skipping to change at page 105, line 26
unrelated unlock. However, it would not be valid to write the entire unrelated unlock. However, it would not be valid to write the entire
block in which that single written byte was located since it includes block in which that single written byte was located since it includes
an area that is not locked and might be locked by another client. an area that is not locked and might be locked by another client.
Client implementations can avoid this problem by dividing files with Client implementations can avoid this problem by dividing files with
modified data into those for which all modifications are done to modified data into those for which all modifications are done to
areas covered by an appropriate record lock and those for which there areas covered by an appropriate record lock and those for which there
are modifications not covered by a record lock. Any writes done for are modifications not covered by a record lock. Any writes done for
the former class of files must not include areas not locked and thus the former class of files must not include areas not locked and thus
not modified on the client. not modified on the client.
7.3.3. Data Caching and Mandatory File Locking 6.3.3. Data Caching and Mandatory File Locking
Client side data caching needs to respect mandatory file locking when Client side data caching needs to respect mandatory file locking when
it is in effect. The presence of mandatory file locking for a given it is in effect. The presence of mandatory file locking for a given
file is indicated when the client gets back NFS4ERR_LOCKED from a file is indicated when the client gets back NFS4ERR_LOCKED from a
READ or WRITE on a file it has an appropriate share reservation for. READ or WRITE on a file it has an appropriate share reservation for.
When mandatory locking is in effect for a file, the client must check When mandatory locking is in effect for a file, the client must check
for an appropriate file lock for data being read or written. If a for an appropriate file lock for data being read or written. If a
lock exists for the range being read or written, the client may lock exists for the range being read or written, the client may
satisfy the request using the client's validated cache. If an satisfy the request using the client's validated cache. If an
appropriate file lock is not held for the range of the read or write, appropriate file lock is not held for the range of the read or write,
the read or write request must not be satisfied by the client's cache the read or write request must not be satisfied by the client's cache
and the request must be sent to the server for processing. When a and the request must be sent to the server for processing. When a
read or write request partially overlaps a locked region, the request read or write request partially overlaps a locked region, the request
should be subdivided into multiple pieces with each region (locked or should be subdivided into multiple pieces with each region (locked or
not) treated appropriately. not) treated appropriately.
7.3.4. Data Caching and File Identity 6.3.4. Data Caching and File Identity
When clients cache data, the file data needs to be organized When clients cache data, the file data needs to be organized
according to the filesystem object to which the data belongs. For according to the filesystem object to which the data belongs. For
NFS version 3 clients, the typical practice has been to assume for NFS version 3 clients, the typical practice has been to assume for
the purpose of caching that distinct filehandles represent distinct the purpose of caching that distinct filehandles represent distinct
filesystem objects. The client then has the choice to organize and filesystem objects. The client then has the choice to organize and
maintain the data cache on this basis. maintain the data cache on this basis.
In the NFS version 4 protocol, there is now the possibility to have In the NFS version 4 protocol, there is now the possibility to have
significant deviations from a "one filehandle per object" model significant deviations from a "one filehandle per object" model
skipping to change at page 112, line 36 skipping to change at page 106, line 44
fileid attribute for both of the handles, then it cannot be fileid attribute for both of the handles, then it cannot be
determined whether the two objects are the same. Therefore, determined whether the two objects are the same. Therefore,
operations which depend on that knowledge (e.g. client side data operations which depend on that knowledge (e.g. client side data
caching) cannot be done reliably. caching) cannot be done reliably.
o If GETATTR directed to the two filehandles returns different o If GETATTR directed to the two filehandles returns different
values for the fileid attribute, then they are distinct objects. values for the fileid attribute, then they are distinct objects.
o Otherwise they are the same object. o Otherwise they are the same object.
7.4. Open Delegation 6.4. Open Delegation
When a file is being OPENed, the server may delegate further handling When a file is being OPENed, the server may delegate further handling
of opens and closes for that file to the opening client. Any such of opens and closes for that file to the opening client. Any such
delegation is recallable, since the circumstances that allowed for delegation is recallable, since the circumstances that allowed for
the delegation are subject to change. In particular, the server may the delegation are subject to change. In particular, the server may
receive a conflicting OPEN from another client, the server must receive a conflicting OPEN from another client, the server must
recall the delegation before deciding whether the OPEN from the other recall the delegation before deciding whether the OPEN from the other
client may be granted. Making a delegation is up to the server and client may be granted. Making a delegation is up to the server and
clients should not assume that any particular OPEN either will or clients should not assume that any particular OPEN either will or
will not result in an open delegation. The following is a typical will not result in an open delegation. The following is a typical
skipping to change at page 115, line 5 skipping to change at page 109, line 13
The use of delegation together with various other forms of caching The use of delegation together with various other forms of caching
creates the possibility that no server authentication will ever be creates the possibility that no server authentication will ever be
performed for a given user since all of the user's requests might be performed for a given user since all of the user's requests might be
satisfied locally. Where the client is depending on the server for satisfied locally. Where the client is depending on the server for
authentication, the client should be sure authentication occurs for authentication, the client should be sure authentication occurs for
each user by use of the ACCESS operation. This should be the case each user by use of the ACCESS operation. This should be the case
even if an ACCESS operation would not be required otherwise. As even if an ACCESS operation would not be required otherwise. As
mentioned before, the server may enforce frequent authentication by mentioned before, the server may enforce frequent authentication by
returning an nfsace4 denying all access with every open delegation. returning an nfsace4 denying all access with every open delegation.
7.4.1. Open Delegation and Data Caching 6.4.1. Open Delegation and Data Caching
OPEN delegation allows much of the message overhead associated with OPEN delegation allows much of the message overhead associated with
the opening and closing files to be eliminated. An open when an open the opening and closing files to be eliminated. An open when an open
delegation is in effect does not require that a validation message be delegation is in effect does not require that a validation message be
sent to the server. The continued endurance of the "read open sent to the server. The continued endurance of the "read open
delegation" provides a guarantee that no OPEN for write and thus no delegation" provides a guarantee that no OPEN for write and thus no
write has occurred. Similarly, when closing a file opened for write write has occurred. Similarly, when closing a file opened for write
and if write open delegation is in effect, the data written does not and if write open delegation is in effect, the data written does not
have to be flushed to the server until the open delegation is have to be flushed to the server until the open delegation is
recalled. The continued endurance of the open delegation provides a recalled. The continued endurance of the open delegation provides a
skipping to change at page 116, line 20 skipping to change at page 110, line 28
With respect to authentication, flushing modified data to the server With respect to authentication, flushing modified data to the server
after a CLOSE has occurred may be problematic. For example, the user after a CLOSE has occurred may be problematic. For example, the user
of the application may have logged off the client and unexpired of the application may have logged off the client and unexpired
authentication credentials may not be present. In this case, the authentication credentials may not be present. In this case, the
client may need to take special care to ensure that local unexpired client may need to take special care to ensure that local unexpired
credentials will in fact be available. This may be accomplished by credentials will in fact be available. This may be accomplished by
tracking the expiration time of credentials and flushing data well in tracking the expiration time of credentials and flushing data well in
advance of their expiration or by making private copies of advance of their expiration or by making private copies of
credentials to assure their availability when needed. credentials to assure their availability when needed.
7.4.2. Open Delegation and File Locks 6.4.2. Open Delegation and File Locks
When a client holds a write open delegation, lock operations are When a client holds a write open delegation, lock operations are
performed locally. This includes those required for mandatory file performed locally. This includes those required for mandatory file
locking. This can be done since the delegation implies that there locking. This can be done since the delegation implies that there
can be no conflicting locks. Similarly, all of the revalidations can be no conflicting locks. Similarly, all of the revalidations
that would normally be associated with obtaining locks and the that would normally be associated with obtaining locks and the
flushing of data associated with the releasing of locks need not be flushing of data associated with the releasing of locks need not be
done. done.
When a client holds a read open delegation, lock operations are not When a client holds a read open delegation, lock operations are not
performed locally. All lock operations, including those requesting performed locally. All lock operations, including those requesting
non-exclusive locks, are sent to the server for resolution. non-exclusive locks, are sent to the server for resolution.
7.4.3. Handling of CB_GETATTR 6.4.3. Handling of CB_GETATTR
The server needs to employ special handling for a GETATTR where the The server needs to employ special handling for a GETATTR where the
target is a file that has a write open delegation in effect. The target is a file that has a write open delegation in effect. The
reason for this is that the client holding the write delegation may reason for this is that the client holding the write delegation may
have modified the data and the server needs to reflect this change to have modified the data and the server needs to reflect this change to
the second client that submitted the GETATTR. Therefore, the client the second client that submitted the GETATTR. Therefore, the client
holding the write delegation needs to be interrogated. The server holding the write delegation needs to be interrogated. The server
will use the CB_GETATTR operation. The only attributes that the will use the CB_GETATTR operation. The only attributes that the
server can reliably query via CB_GETATTR are size and change. server can reliably query via CB_GETATTR are size and change.
skipping to change at page 119, line 38 skipping to change at page 113, line 40
CB_GETATTR and responds to the second client as in the last step. CB_GETATTR and responds to the second client as in the last step.
This methodology resolves issues of clock differences between client This methodology resolves issues of clock differences between client
and server and other scenarios where the use of CB_GETATTR break and server and other scenarios where the use of CB_GETATTR break
down. down.
It should be noted that the server is under no obligation to use It should be noted that the server is under no obligation to use
CB_GETATTR and therefore the server MAY simply recall the delegation CB_GETATTR and therefore the server MAY simply recall the delegation
to avoid its use. to avoid its use.
7.4.4. Recall of Open Delegation 6.4.4. Recall of Open Delegation
The following events necessitate recall of an open delegation: The following events necessitate recall of an open delegation:
o Potentially conflicting OPEN request (or READ/WRITE done with o Potentially conflicting OPEN request (or READ/WRITE done with
"special" stateid) "special" stateid)
o SETATTR issued by another client o SETATTR issued by another client
o REMOVE request for the file o REMOVE request for the file
o RENAME request for the file as either source or target of the o RENAME request for the file as either source or target of the
RENAME RENAME
Whether a RENAME of a directory in the path leading to the file Whether a RENAME of a directory in the path leading to the file
results in recall of an open delegation depends on the semantics of results in recall of an open delegation depends on the semantics of
the server filesystem. If that filesystem denies such RENAMEs when a the server filesystem. If that filesystem denies such RENAMEs when a
file is open, the recall must be performed to determine whether the file is open, the recall must be performed to determine whether the
file in question is, in fact, open. file in question is, in fact, open.
In addition to the situations above, the server may choose to recall In addition to the situations above, the server may choose to recall
skipping to change at page 121, line 31 skipping to change at page 115, line 35
except as part of delegation return. Only in the case of closing the except as part of delegation return. Only in the case of closing the
open that resulted in obtaining the delegation would clients be open that resulted in obtaining the delegation would clients be
likely to do this early, since, in that case, the close once done likely to do this early, since, in that case, the close once done
will not be undone. Regardless of the client's choices on scheduling will not be undone. Regardless of the client's choices on scheduling
these actions, all must be performed before the delegation is these actions, all must be performed before the delegation is
returned, including (when applicable) the close that corresponds to returned, including (when applicable) the close that corresponds to
the open that resulted in the delegation. These actions can be the open that resulted in the delegation. These actions can be
performed either in previous requests or in previous operations in performed either in previous requests or in previous operations in
the same COMPOUND request. the same COMPOUND request.
7.4.5. Clients that Fail to Honor Delegation Recalls 6.4.5. Clients that Fail to Honor Delegation Recalls
A client may fail to respond to a recall for various reasons, such as A client may fail to respond to a recall for various reasons, such as
a failure of the callback path from server to the client. The client a failure of the callback path from server to the client. The client
may be unaware of a failure in the callback path. This lack of may be unaware of a failure in the callback path. This lack of
awareness could result in the client finding out long after the awareness could result in the client finding out long after the
failure that its delegation has been revoked, and another client has failure that its delegation has been revoked, and another client has
modified the data for which the client had a delegation. This is modified the data for which the client had a delegation. This is
especially a problem for the client that held a write delegation. especially a problem for the client that held a write delegation.
The server also has a dilemma in that the client that fails to The server also has a dilemma in that the client that fails to
skipping to change at page 122, line 26 skipping to change at page 116, line 31
time after the server attempted to recall the delegation. This time after the server attempted to recall the delegation. This
period of time MUST NOT be less than the value of the period of time MUST NOT be less than the value of the
lease_time attribute. lease_time attribute.
o When the client holds a delegation, it can not rely on operations, o When the client holds a delegation, it can not rely on operations,
except for RENEW, that take a stateid, to renew delegation leases except for RENEW, that take a stateid, to renew delegation leases
across callback path failures. The client that wants to keep across callback path failures. The client that wants to keep
delegations in force across callback path failures must use RENEW delegations in force across callback path failures must use RENEW
to do so. to do so.
7.4.6. Delegation Revocation 6.4.6. Delegation Revocation
At the point a delegation is revoked, if there are associated opens At the point a delegation is revoked, if there are associated opens
on the client, the applications holding these opens need to be on the client, the applications holding these opens need to be
notified. This notification usually occurs by returning errors for notified. This notification usually occurs by returning errors for
READ/WRITE operations or when a close is attempted for the open file. READ/WRITE operations or when a close is attempted for the open file.
If no opens exist for the file at the point the delegation is If no opens exist for the file at the point the delegation is
revoked, then notification of the revocation is unnecessary. revoked, then notification of the revocation is unnecessary.
However, if there is modified data present at the client for the However, if there is modified data present at the client for the
file, the user of the application should be notified. Unfortunately, file, the user of the application should be notified. Unfortunately,
it may not be possible to notify the user since active applications it may not be possible to notify the user since active applications
may not be present at the client. See the section "Revocation may not be present at the client. See the section "Revocation
Recovery for Write Open Delegation" for additional details. Recovery for Write Open Delegation" for additional details.
7.5. Data Caching and Revocation 6.5. Data Caching and Revocation
When locks and delegations are revoked, the assumptions upon which When locks and delegations are revoked, the assumptions upon which
successful caching depend are no longer guaranteed. For any locks or successful caching depend are no longer guaranteed. For any locks or
share reservations that have been revoked, the corresponding owner share reservations that have been revoked, the corresponding owner
needs to be notified. This notification includes applications with a needs to be notified. This notification includes applications with a
file open that has a corresponding delegation which has been revoked. file open that has a corresponding delegation which has been revoked.
Cached data associated with the revocation must be removed from the Cached data associated with the revocation must be removed from the
client. In the case of modified data existing in the client's cache, client. In the case of modified data existing in the client's cache,
that data must be removed from the client without it being written to that data must be removed from the client without it being written to
the server. As mentioned, the assumptions made by the client are no the server. As mentioned, the assumptions made by the client are no
longer valid at the point when a lock or delegation has been revoked. longer valid at the point when a lock or delegation has been revoked.
For example, another client may have been granted a conflicting lock For example, another client may have been granted a conflicting lock
after the revocation of the lock at the first client. Therefore, the after the revocation of the lock at the first client. Therefore, the
data within the lock range may have been modified by the other data within the lock range may have been modified by the other
client. Obviously, the first client is unable to guarantee to the client. Obviously, the first client is unable to guarantee to the
application what has occurred to the file in the case of revocation. application what has occurred to the file in the case of revocation.
Notification to a lock owner will in many cases consist of simply Notification to a lock owner will in many cases consist of simply
returning an error on the next and all subsequent READs/WRITEs to the returning an error on the next and all subsequent READs/WRITEs to the
open file or on the close. Where the methods available to a client open file or on the close. Where the methods available to a client
make such notification impossible because errors for certain make such notification impossible because errors for certain
skipping to change at page 123, line 23 skipping to change at page 117, line 28
open file or on the close. Where the methods available to a client open file or on the close. Where the methods available to a client
make such notification impossible because errors for certain make such notification impossible because errors for certain
operations may not be returned, more drastic action such as signals operations may not be returned, more drastic action such as signals
or process termination may be appropriate. The justification for or process termination may be appropriate. The justification for
this is that an invariant for which an application depends on may be this is that an invariant for which an application depends on may be
violated. Depending on how errors are typically treated for the violated. Depending on how errors are typically treated for the
client operating environment, further levels of notification client operating environment, further levels of notification
including logging, console messages, and GUI pop-ups may be including logging, console messages, and GUI pop-ups may be
appropriate. appropriate.
7.5.1. Revocation Recovery for Write Open Delegation 6.5.1. Revocation Recovery for Write Open Delegation
Revocation recovery for a write open delegation poses the special Revocation recovery for a write open delegation poses the special
issue of modified data in the client cache while the file is not issue of modified data in the client cache while the file is not
open. In this situation, any client which does not flush modified open. In this situation, any client which does not flush modified
data to the server on each close must ensure that the user receives data to the server on each close must ensure that the user receives
appropriate notification of the failure as a result of the appropriate notification of the failure as a result of the
revocation. Since such situations may require human action to revocation. Since such situations may require human action to
correct problems, notification schemes in which the appropriate user correct problems, notification schemes in which the appropriate user
or administrator is notified may be necessary. Logging and console or administrator is notified may be necessary. Logging and console
messages are typical examples. messages are typical examples.
skipping to change at page 124, line 7 skipping to change at page 118, line 12
contents in these situations or mark the results specially to warn contents in these situations or mark the results specially to warn
users of possible problems. users of possible problems.
Saving of such modified data in delegation revocation situations may Saving of such modified data in delegation revocation situations may
be limited to files of a certain size or might be used only when be limited to files of a certain size or might be used only when
sufficient disk space is available within the target filesystem. sufficient disk space is available within the target filesystem.
Such saving may also be restricted to situations when the client has Such saving may also be restricted to situations when the client has
sufficient buffering resources to keep the cached copy available sufficient buffering resources to keep the cached copy available
until it is properly stored to the target filesystem. until it is properly stored to the target filesystem.
7.6. Attribute Caching 6.6. Attribute Caching
The attributes discussed in this section do not include named The attributes discussed in this section do not include named
attributes. Individual named attributes are analogous to files and attributes. Individual named attributes are analogous to files and
caching of the data for these needs to be handled just as data caching of the data for these needs to be handled just as data
caching is for ordinary files. Similarly, LOOKUP results from an caching is for ordinary files. Similarly, LOOKUP results from an
OPENATTR directory are to be cached on the same basis as any other OPENATTR directory are to be cached on the same basis as any other
pathnames and similarly for directory contents. pathnames and similarly for directory contents.
Clients may cache file attributes obtained from the server and use Clients may cache file attributes obtained from the server and use
them to avoid subsequent GETATTR requests. Such caching is write them to avoid subsequent GETATTR requests. Such caching is write
skipping to change at page 126, line 5 skipping to change at page 120, line 8
client will either eventually have to write the access time to the client will either eventually have to write the access time to the
server with bad performance effects, or it would never update the server with bad performance effects, or it would never update the
server's time_access, thereby resulting in a situation where an server's time_access, thereby resulting in a situation where an
application that caches access time between a close and open of the application that caches access time between a close and open of the
same file observes the access time oscillating between the past and same file observes the access time oscillating between the past and
present. The time_access attribute always means the time of last present. The time_access attribute always means the time of last
access to a file by a read that was satisfied by the server. This access to a file by a read that was satisfied by the server. This
way clients will tend to see only time_access changes that go forward way clients will tend to see only time_access changes that go forward
in time. in time.
7.7. Data and Metadata Caching and Memory Mapped Files 6.7. Data and Metadata Caching and Memory Mapped Files
Some operating environments include the capability for an application Some operating environments include the capability for an application
to map a file's content into the application's address space. Each to map a file's content into the application's address space. Each
time the application accesses a memory location that corresponds to a time the application accesses a memory location that corresponds to a
block that has not been loaded into the address space, a page fault block that has not been loaded into the address space, a page fault
occurs and the file is read (or if the block does not exist in the occurs and the file is read (or if the block does not exist in the
file, the block is allocated and then instantiated in the file, the block is allocated and then instantiated in the
application's address space). application's address space).
As long as each memory mapped access to the file requires a page As long as each memory mapped access to the file requires a page
skipping to change at page 128, line 13 skipping to change at page 122, line 16
are record locks for. are record locks for.
o Clients and servers MAY deny a record lock on a file they know is o Clients and servers MAY deny a record lock on a file they know is
memory mapped. memory mapped.
o A client MAY deny memory mapping a file that it knows requires o A client MAY deny memory mapping a file that it knows requires
mandatory locking for I/O. If mandatory locking is enabled after mandatory locking for I/O. If mandatory locking is enabled after
the file is opened and mapped, the client MAY deny the application the file is opened and mapped, the client MAY deny the application
further access to its mapped file. further access to its mapped file.
7.8. Name Caching 6.8. Name Caching
The results of LOOKUP and READDIR operations may be cached to avoid The results of LOOKUP and READDIR operations may be cached to avoid
the cost of subsequent LOOKUP operations. Just as in the case of the cost of subsequent LOOKUP operations. Just as in the case of
attribute caching, inconsistencies may arise among the various client attribute caching, inconsistencies may arise among the various client
caches. To mitigate the effects of these inconsistencies and given caches. To mitigate the effects of these inconsistencies and given
the context of typical filesystem APIs, an upper time boundary is the context of typical filesystem APIs, an upper time boundary is
maintained on how long a client name cache entry can be kept without maintained on how long a client name cache entry can be kept without
verifying that the entry has not been made invalid by a directory verifying that the entry has not been made invalid by a directory
change operation performed by another client. .LP When a client is change operation performed by another client. .LP When a client is
not making changes to a directory for which there exist name cache not making changes to a directory for which there exist name cache
skipping to change at page 129, line 12 skipping to change at page 123, line 16
directories when the contents of the corresponding directory is directories when the contents of the corresponding directory is
modified. For a client to use the change_info4 information modified. For a client to use the change_info4 information
appropriately and correctly, the server must report the pre and post appropriately and correctly, the server must report the pre and post
operation change attribute values atomically. When the server is operation change attribute values atomically. When the server is
unable to report the before and after values atomically with respect unable to report the before and after values atomically with respect
to the directory operation, the server must indicate that fact in the to the directory operation, the server must indicate that fact in the
change_info4 return value. When the information is not atomically change_info4 return value. When the information is not atomically
reported, the client should not assume that other clients have not reported, the client should not assume that other clients have not
changed the directory. changed the directory.
7.9. Directory Caching 6.9. Directory Caching
The results of READDIR operations may be used to avoid subsequent The results of READDIR operations may be used to avoid subsequent
READDIR operations. Just as in the cases of attribute and name READDIR operations. Just as in the cases of attribute and name
caching, inconsistencies may arise among the various client caches. caching, inconsistencies may arise among the various client caches.
To mitigate the effects of these inconsistencies, and given the To mitigate the effects of these inconsistencies, and given the
context of typical filesystem APIs, the following rules should be context of typical filesystem APIs, the following rules should be
followed: followed:
o Cached READDIR information for a directory which is not obtained o Cached READDIR information for a directory which is not obtained
in a single READDIR operation must always be a consistent snapshot in a single READDIR operation must always be a consistent snapshot
skipping to change at page 130, line 7 skipping to change at page 124, line 10
directories when the contents of the corresponding directory is directories when the contents of the corresponding directory is
modified. For a client to use the change_info4 information modified. For a client to use the change_info4 information
appropriately and correctly, the server must report the pre and post appropriately and correctly, the server must report the pre and post
operation change attribute values atomically. When the server is operation change attribute values atomically. When the server is
unable to report the before and after values atomically with respect unable to report the before and after values atomically with respect
to the directory operation, the server must indicate that fact in the to the directory operation, the server must indicate that fact in the
change_info4 return value. When the information is not atomically change_info4 return value. When the information is not atomically
reported, the client should not assume that other clients have not reported, the client should not assume that other clients have not
changed the directory. changed the directory.
8. Security Negotiation 7. Security Negotiation
The NFSv4.0 specification contains three oversights and ambiguities The NFSv4.0 specification contains three oversights and ambiguities
with respect to the SECINFO operation. with respect to the SECINFO operation.
First, it is impossible for the client to use the SECINFO operation First, it is impossible for the client to use the SECINFO operation
to determine the correct security triple for accessing a parent to determine the correct security triple for accessing a parent
directory. This is because SECINFO takes as arguments the current directory. This is because SECINFO takes as arguments the current
file handle and a component name. However, NFSv4.0 uses the LOOKUPP file handle and a component name. However, NFSv4.0 uses the LOOKUPP
operation to get the parent directory of the current file handle. If operation to get the parent directory of the current file handle. If
the client uses the wrong security when issuing the LOOKUPP, and gets the client uses the wrong security when issuing the LOOKUPP, and gets
skipping to change at page 130, line 39 skipping to change at page 124, line 42
Third, there is a problem as to what the client must do (or can do), Third, there is a problem as to what the client must do (or can do),
whenever the server returns NFS4ERR_WRONGSEC in response to a PUTFH whenever the server returns NFS4ERR_WRONGSEC in response to a PUTFH
operation. The NFSv4.0 specification says that client should issue a operation. The NFSv4.0 specification says that client should issue a
SECINFO using the parent filehandle and the component name of the SECINFO using the parent filehandle and the component name of the
filehandle that PUTFH was issued with. This may not be convenient filehandle that PUTFH was issued with. This may not be convenient
for the client. for the client.
This document resolves the above three issues in the context of This document resolves the above three issues in the context of
NFSv4.1. NFSv4.1.
9. Clarification of Security Negotiation in NFSv4.1 8. Clarification of Security Negotiation in NFSv4.1
This section attempts to clarify NFSv4.1 security negotiation issues. This section attempts to clarify NFSv4.1 security negotiation issues.
Unless noted otherwise, for any mention of PUTFH in this section, the Unless noted otherwise, for any mention of PUTFH in this section, the
reader should interpret it as applying to PUTROOTFH and PUTPUBFH in reader should interpret it as applying to PUTROOTFH and PUTPUBFH in
addition to PUTFH. addition to PUTFH.
9.1. PUTFH + LOOKUP 8.1. PUTFH + LOOKUP
The server implementation may decide whether to impose any The server implementation may decide whether to impose any
restrictions on export security administration. There are at least restrictions on export security administration. There are at least
three approaches (Sc is the flavor set of the child export, Sp that three approaches (Sc is the flavor set of the child export, Sp that
of the parent), of the parent),
a) Sc <= Sp (<= for subset) a) Sc <= Sp (<= for subset)
b) Sc ^ Sp != {} (^ for intersection, {} for the empty set) b) Sc ^ Sp != {} (^ for intersection, {} for the empty set)
c) free form c) free form
To support b (when client chooses a flavor that is not a member of To support b (when client chooses a flavor that is not a member of
Sp) and c, PUTFH must NOT return NFS4ERR_WRONGSEC in case of security Sp) and c, PUTFH must NOT return NFS4ERR_WRONGSEC in case of security
mismatch. Instead, it should be returned from the LOOKUP that mismatch. Instead, it should be returned from the LOOKUP that
follows. follows.
Since the above guideline does not contradict a, it should be Since the above guideline does not contradict a, it should be
followed in general. followed in general.
9.2. PUTFH + LOOKUPP 8.2. PUTFH + LOOKUPP
Since SECINFO only works its way down, there is no way LOOKUPP can Since SECINFO only works its way down, there is no way LOOKUPP can
return NFS4ERR_WRONGSEC without the server implementing return NFS4ERR_WRONGSEC without the server implementing
SECINFO_NO_NAME. SECINFO_NO_NAME solves this issue because via style SECINFO_NO_NAME. SECINFO_NO_NAME solves this issue because via style
"parent", it works in the opposite direction as SECINFO (component "parent", it works in the opposite direction as SECINFO (component
name is implicit in this case). name is implicit in this case).
9.3. PUTFH + SECINFO 8.3. PUTFH + SECINFO
This case should be treated specially. This case should be treated specially.
A security sensitive client should be allowed to choose a strong A security sensitive client should be allowed to choose a strong
flavor when querying a server to determine a file object's permitted flavor when querying a server to determine a file object's permitted
security flavors. The security flavor chosen by the client does not security flavors. The security flavor chosen by the client does not
have to be included in the flavor list of the export. Of course the have to be included in the flavor list of the export. Of course the
server has to be configured for whatever flavor the client selects, server has to be configured for whatever flavor the client selects,
otherwise the request will fail at RPC authentication. otherwise the request will fail at RPC authentication.
In theory, there is no connection between the security flavor used by In theory, there is no connection between the security flavor used by
SECINFO and those supported by the export. But in practice, the SECINFO and those supported by the export. But in practice, the
client may start looking for strong flavors from those supported by client may start looking for strong flavors from those supported by
the export, followed by those in the mandatory set. the export, followed by those in the mandatory set.
9.4. PUTFH + Anything Else 8.4. PUTFH + Anything Else
PUTFH must return NFS4ERR_WRONGSEC in case of security mismatch. PUTFH must return NFS4ERR_WRONGSEC in case of security mismatch.
This is the most straightforward approach without having to add This is the most straightforward approach without having to add
NFS4ERR_WRONGSEC to every other operations. NFS4ERR_WRONGSEC to every other operations.
PUTFH + SECINFO_NO_NAME (style "current_fh") is needed for the client PUTFH + SECINFO_NO_NAME (style "current_fh") is needed for the client
to recover from NFS4ERR_WRONGSEC. to recover from NFS4ERR_WRONGSEC.
10. NFSv4.1 Sessions 9. NFSv4.1 Sessions
10.1. Sessions Background 9.1. Sessions Background
10.1.1. Introduction to Sessions 9.1.1. Introduction to Sessions
This draft proposes extensions to NFS version 4 [RFC3530] enabling it This draft proposes extensions to NFS version 4 [RFC3530] enabling it
to support sessions and endpoint management, and to support operation to support sessions and endpoint management, and to support operation
atop RDMA-capable RPC over transports such as iWARP. [RDMAP, DDP] atop RDMA-capable RPC over transports such as iWARP. [RDMAP, DDP]
These extensions enable support for exactly-once semantics by NFSv4 These extensions enable support for exactly-once semantics by NFSv4
servers, multipathing and trunking of transport connections, and servers, multipathing and trunking of transport connections, and
enhanced security. The ability to operate over RDMA enables greatly enhanced security. The ability to operate over RDMA enables greatly
enhanced performance. Operation over existing TCP is enhanced as enhanced performance. Operation over existing TCP is enhanced as
well. well.
skipping to change at page 133, line 23 skipping to change at page 127, line 30
+-----------------+-------------------------------------+ +-----------------+-------------------------------------+
| NFSv4 | NFSv4 + session extensions | | NFSv4 | NFSv4 + session extensions |
+-----------------+------+----------------+-------------+ +-----------------+------+----------------+-------------+
| Operations | Session | | | Operations | Session | |
+------------------------+----------------+ | +------------------------+----------------+ |
| RPC/XDR | | | RPC/XDR | |
+-------------------------------+---------+ | +-------------------------------+---------+ |
| Stream Transport | RDMA Transport | | Stream Transport | RDMA Transport |
+-------------------------------+-----------------------+ +-------------------------------+-----------------------+
10.1.2. Motivation 9.1.2. Motivation
NFS version 4 [RFC3530] has been granted "Proposed Standard" status. NFS version 4 [RFC3530] has been granted "Proposed Standard" status.
The NFSv4 protocol was developed along several design points, The NFSv4 protocol was developed along several design points,
important among them: effective operation over wide-area networks, important among them: effective operation over wide-area networks,
including the Internet itself; strong security integrated into the including the Internet itself; strong security integrated into the
protocol; extensive cross-platform interoperability including protocol; extensive cross-platform interoperability including
integrated locking semantics compatible with multiple operating integrated locking semantics compatible with multiple operating
systems; and protocol extensibility. systems; and protocol extensibility.
The NFS version 4 protocol, however, does not provide support for The NFS version 4 protocol, however, does not provide support for
skipping to change at page 134, line 35 skipping to change at page 128, line 40
FJDAFS] and Harvard University [KM02] are all relevant. FJDAFS] and Harvard University [KM02] are all relevant.
By layering a session binding for NFS version 4 directly atop a By layering a session binding for NFS version 4 directly atop a
standard RDMA transport, a greatly enhanced level of performance and standard RDMA transport, a greatly enhanced level of performance and
transparency can be supported on a wide variety of operating system transparency can be supported on a wide variety of operating system
platforms. These combined capabilities alter the landscape between platforms. These combined capabilities alter the landscape between
local filesystems and network attached storage, enable a new level of local filesystems and network attached storage, enable a new level of
performance, and lead new classes of application to take advantage of performance, and lead new classes of application to take advantage of
NFS. NFS.
10.1.3. Problem Statement 9.1.3. Problem Statement
Two issues drive the current proposal: correctness, and performance. Two issues drive the current proposal: correctness, and performance.
Both are instances of "raising the bar" for NFS, whereby the desire Both are instances of "raising the bar" for NFS, whereby the desire
to use NFS in new classes applications can be accommodated by to use NFS in new classes applications can be accommodated by
providing the basic features to make such use feasible. Such providing the basic features to make such use feasible. Such
applications include tightly coupled sharing environments such as applications include tightly coupled sharing environments such as
cluster computing, high performance computing (HPC) and information cluster computing, high performance computing (HPC) and information
processing such as databases. These trends are explored in depth in processing such as databases. These trends are explored in depth in
[NFSPS]. [NFSPS].
skipping to change at page 136, line 5 skipping to change at page 130, line 8
systems, NFSv4 over RDMA will enable applications running on a set of systems, NFSv4 over RDMA will enable applications running on a set of
client machines to interact through an NFSv4 file system, just as client machines to interact through an NFSv4 file system, just as
applications running on a single machine might interact through a applications running on a single machine might interact through a
local file system. local file system.
This raises the issue of whether additional protocol enhancements to This raises the issue of whether additional protocol enhancements to
enable such interaction would be desirable and what such enhancements enable such interaction would be desirable and what such enhancements
would be. This is a complicated issue which the working group needs would be. This is a complicated issue which the working group needs
to address and will not be further discussed in this document. to address and will not be further discussed in this document.
10.1.4. NFSv4 Session Extension Characteristics 9.1.4. NFSv4 Session Extension Characteristics
This draft will present a solution based upon minor versioning of This draft will present a solution based upon minor versioning of
NFSv4. It will introduce a session to collect transport endpoints NFSv4. It will introduce a session to collect transport endpoints
and resources such as reply caching, which in turn enables and resources such as reply caching, which in turn enables
enhancements such as trunking, failover and recovery. It will enhancements such as trunking, failover and recovery. It will
describe use of RDMA by employing support within an underlying RPC describe use of RDMA by employing support within an underlying RPC
layer [RPCRDMA]. Most importantly, it will focus on making the best layer [RPCRDMA]. Most importantly, it will focus on making the best
possible use of an RDMA transport. possible use of an RDMA transport.
These extensions are proposed as elements of a new minor revision of These extensions are proposed as elements of a new minor revision of
skipping to change at page 136, line 27 skipping to change at page 130, line 30
generically as "NFSv4", when describing properties common to all generically as "NFSv4", when describing properties common to all
minor versions. When referring specifically to properties of the minor versions. When referring specifically to properties of the
original, minor version 0 protocol, "NFSv4.0" will be used, and original, minor version 0 protocol, "NFSv4.0" will be used, and
changes proposed here for minor version 1 will be referred to as changes proposed here for minor version 1 will be referred to as
"NFSv4.1". "NFSv4.1".
This draft proposes only changes which are strictly upward- This draft proposes only changes which are strictly upward-
compatible with existing RPC and NFS Application Programming compatible with existing RPC and NFS Application Programming
Interfaces (APIs). Interfaces (APIs).
10.2. Transport Issues 9.2. Transport Issues
The Transport Issues section of the document explores the details of The Transport Issues section of the document explores the details of
utilizing the various supported transports. utilizing the various supported transports.
10.2.1. Session Model 9.2.1. Session Model
The first and most evident issue in supporting diverse transports is The first and most evident issue in supporting diverse transports is
how to provide for their differences. This draft proposes how to provide for their differences. This draft proposes
introducing an explicit session. introducing an explicit session.
A session introduces minimal protocol requirements, and provides for A session introduces minimal protocol requirements, and provides for
a highly useful and convenient way to manage numerous endpoint- a highly useful and convenient way to manage numerous endpoint-
related issues. The session is a local construct; it represents a related issues. The session is a local construct; it represents a
named, higher-layer object to which connections can refer, and named, higher-layer object to which connections can refer, and
encapsulates properties important to each associated client. encapsulates properties important to each associated client.
skipping to change at page 137, line 46 skipping to change at page 132, line 5
Finally, given adequate connection-oriented transport security Finally, given adequate connection-oriented transport security
semantics, authentication and authorization may be cached on a per- semantics, authentication and authorization may be cached on a per-
session basis, enabling greater efficiency in the issuing and session basis, enabling greater efficiency in the issuing and
processing of requests on both client and server. A proposal for processing of requests on both client and server. A proposal for
transparent, server-driven implementation of this in NFSv4 has been transparent, server-driven implementation of this in NFSv4 has been
made. [CCM] The existence of the session greatly facilitates the made. [CCM] The existence of the session greatly facilitates the
implementation of this approach. This is discussed in detail in the implementation of this approach. This is discussed in detail in the
Authentication Efficiencies section later in this draft. Authentication Efficiencies section later in this draft.
10.2.2. Connection State 9.2.2. Connection State
In RFC3530, the combination of a connected transport endpoint and a In RFC3530, the combination of a connected transport endpoint and a
clientid forms the basis of connection state. While has been made to clientid forms the basis of connection state. While has been made to
be workable with certain limitations, there are difficulties in be workable with certain limitations, there are difficulties in
correct and robust implementation. The NFSv4.0 protocol must provide correct and robust implementation. The NFSv4.0 protocol must provide
a server-initiated connection for the callback channel, and must a server-initiated connection for the callback channel, and must
carefully specify the persistence of client state at the server in carefully specify the persistence of client state at the server in
the face of transport interruptions. The server has only the the face of transport interruptions. The server has only the
client's transport address binding (the IP 4-tuple) to identify the client's transport address binding (the IP 4-tuple) to identify the
client RPC transaction stream and to use as a lookup tag on the client RPC transaction stream and to use as a lookup tag on the
skipping to change at page 138, line 34 skipping to change at page 132, line 41
The session identifier is unique within the server's scope and may be The session identifier is unique within the server's scope and may be
subject to certain server policies such as being bounded in time. subject to certain server policies such as being bounded in time.
It is envisioned that the primary transport model will be connection It is envisioned that the primary transport model will be connection
oriented. Connection orientation brings with it certain potential oriented. Connection orientation brings with it certain potential
optimizations, such as caching of per-connection properties, which optimizations, such as caching of per-connection properties, which
are easily leveraged through the generality of the session. However, are easily leveraged through the generality of the session. However,
it is possible that in future, other transport models could be it is possible that in future, other transport models could be
accommodated below the session abstraction. accommodated below the session abstraction.
10.2.3. NFSv4 Channels, Sessions and Connections 9.2.3. NFSv4 Channels, Sessions and Connections
There are at least two types of NFSv4 channels: the "operations" There are at least two types of NFSv4 channels: the "operations"
channel used for ordinary requests from client to server, and the channel used for ordinary requests from client to server, and the
"back" channel, used for callback requests from server to client. "back" channel, used for callback requests from server to client.
As mentioned above, different NFSv4 operations on these channels can As mentioned above, different NFSv4 operations on these channels can
lead to different resource needs. For example, server callback lead to different resource needs. For example, server callback
operations (CB_RECALL) are specific, small messages which flow from operations (CB_RECALL) are specific, small messages which flow from
server to client at arbitrary times, while data transfers such as server to client at arbitrary times, while data transfers such as
read and write have very different sizes and asymmetric behaviors. read and write have very different sizes and asymmetric behaviors.
skipping to change at page 140, line 41 skipping to change at page 134, line 41
| |
In this way, implementation as well as resource management may be In this way, implementation as well as resource management may be
optimized. Each session will have its own response caching and optimized. Each session will have its own response caching and
buffering, and each connection or channel will have its own transport buffering, and each connection or channel will have its own transport
resources, as appropriate. Clients which do not require certain resources, as appropriate. Clients which do not require certain
behaviors may optimize such resources away completely, by using behaviors may optimize such resources away completely, by using
specific sessions and not even creating the additional channels and specific sessions and not even creating the additional channels and
connections. connections.
10.2.4. Reconnection, Trunking and Failover 9.2.4. Reconnection, Trunking and Failover
Reconnection after failure references stored state on the server Reconnection after failure references stored state on the server
associated with lease recovery during the grace period. The session associated with lease recovery during the grace period. The session
provides a convenient handle for storing and managing information provides a convenient handle for storing and managing information
regarding the client's previous state on a per- connection basis, regarding the client's previous state on a per- connection basis,
e.g. to be used upon reconnection. Reconnection to a previously e.g. to be used upon reconnection. Reconnection to a previously
existing session, and its stored resources, are covered in the existing session, and its stored resources, are covered in the
"Connection Models" section below. "Connection Models" section below.
One important aspect of reconnection is that of RPC library support. One important aspect of reconnection is that of RPC library support.
skipping to change at page 141, line 32 skipping to change at page 135, line 32
of connections, something the RPC layer abstraction architecturally of connections, something the RPC layer abstraction architecturally
abstracts away. Therefore the session binding is not handled in abstracts away. Therefore the session binding is not handled in
connection scope but instead explicitly carried in each request. connection scope but instead explicitly carried in each request.
For Reliability Availability and Serviceability (RAS) issues such as For Reliability Availability and Serviceability (RAS) issues such as
bandwidth aggregation and multipathing, clients frequently seek to bandwidth aggregation and multipathing, clients frequently seek to
make multiple connections through multiple logical or physical make multiple connections through multiple logical or physical
channels. The session is a convenient point to aggregate and manage channels. The session is a convenient point to aggregate and manage
these resources. these resources.
10.2.5. Server Duplicate Request Cache 9.2.5. Server Duplicate Request Cache
Server duplicate request caches, while not a part of an NFS protocol, Server duplicate request caches, while not a part of an NFS protocol,
have become a standard, even required, part of any NFS have become a standard, even required, part of any NFS
implementation. First described in [CJ89], the duplicate request implementation. First described in [CJ89], the duplicate request
cache was initially found to reduce work at the server by avoiding cache was initially found to reduce work at the server by avoiding
duplicate processing for retransmitted requests. A second, and in duplicate processing for retransmitted requests. A second, and in
the long run more important benefit, was improved correctness, as the the long run more important benefit, was improved correctness, as the
cache avoided certain destructive non-idempotent requests from being cache avoided certain destructive non-idempotent requests from being
reinvoked. reinvoked.
skipping to change at page 142, line 41 skipping to change at page 136, line 41
Similarly, it is important for the client to explicitly learn whether Similarly, it is important for the client to explicitly learn whether
the server is able to implement reliable semantics. Knowledge of the server is able to implement reliable semantics. Knowledge of
whether these semantics are in force is critical for a highly whether these semantics are in force is critical for a highly
reliable client, one which must provide transactional integrity reliable client, one which must provide transactional integrity
guarantees. When clients request that the semantics be enabled for a guarantees. When clients request that the semantics be enabled for a
given session, the session reply must inform the client if the mode given session, the session reply must inform the client if the mode
is in fact enabled. In this way the client can confidently proceed is in fact enabled. In this way the client can confidently proceed
with operations without having to implement consistency facilities of with operations without having to implement consistency facilities of
its own. its own.
10.3. Session Initialization and Transfer Models 9.3. Session Initialization and Transfer Models
Session initialization issues, and data transfer models relevant to Session initialization issues, and data transfer models relevant to
both TCP and RDMA are discussed in this section. both TCP and RDMA are discussed in this section.
10.3.1. Session Negotiation 9.3.1. Session Negotiation
The following parameters are exchanged between client and server at The following parameters are exchanged between client and server at
session creation time. Their values allow the server to properly session creation time. Their values allow the server to properly
size resources allocated in order to service the client's requests, size resources allocated in order to service the client's requests,
and to provide the server with a way to communicate limits to the and to provide the server with a way to communicate limits to the
client for proper and optimal operation. They are exchanged prior to client for proper and optimal operation. They are exchanged prior to
all session-related activity, over any transport type. Discussion of all session-related activity, over any transport type. Discussion of
their use is found in their descriptions as well as throughout this their use is found in their descriptions as well as throughout this
section. section.
skipping to change at page 144, line 21 skipping to change at page 138, line 21
bandwidth parameters. The client provides its chosen value to the bandwidth parameters. The client provides its chosen value to the
server in the initial session creation, the value must be provided server in the initial session creation, the value must be provided
in each client RDMA endpoint. The values are asymmetric and in each client RDMA endpoint. The values are asymmetric and
should be set to zero at the server in order to conserve RDMA should be set to zero at the server in order to conserve RDMA
resources, since clients do not issue RDMA Read operations in this resources, since clients do not issue RDMA Read operations in this
proposal. The result is communicated in the session response, to proposal. The result is communicated in the session response, to
permit matching of values across the connection. The value may permit matching of values across the connection. The value may
not be changed in the duration of the session, although a new not be changed in the duration of the session, although a new
value may be requested as part of a new session. value may be requested as part of a new session.
10.3.2. RDMA Requirements 9.3.2. RDMA Requirements
A complete discussion of the operation of RPC-based protocols atop A complete discussion of the operation of RPC-based protocols atop
RDMA transports is in [RPCRDMA]. Where RDMA is considered, this RDMA transports is in [RPCRDMA]. Where RDMA is considered, this
proposal assumes the use of such a layering; it addresses only the proposal assumes the use of such a layering; it addresses only the
upper layer issues relevant to making best use of RPC/RDMA. upper layer issues relevant to making best use of RPC/RDMA.
A connection oriented (reliable sequenced) RDMA transport will be A connection oriented (reliable sequenced) RDMA transport will be
required. There are several reasons for this. First, this model required. There are several reasons for this. First, this model
most closely reflects the general NFSv4 requirement of long-lived and most closely reflects the general NFSv4 requirement of long-lived and
congestion-controlled transports. Second, to operate correctly over congestion-controlled transports. Second, to operate correctly over
skipping to change at page 144, line 48 skipping to change at page 138, line 48
ordering semantic, which presents the same set of ordering and ordering semantic, which presents the same set of ordering and
reliability issues to the RDMA layer over such transports. reliability issues to the RDMA layer over such transports.
The RDMA implementation provides for making connections to other The RDMA implementation provides for making connections to other
RDMA-capable peers. In the case of the current proposals before the RDMA-capable peers. In the case of the current proposals before the
RDDP working group, these RDMA connections are preceded by a RDDP working group, these RDMA connections are preceded by a
"streaming" phase, where ordinary TCP (or NFS) traffic might flow. "streaming" phase, where ordinary TCP (or NFS) traffic might flow.
However, this is not assumed here and sizes and other parameters are However, this is not assumed here and sizes and other parameters are
explicitly exchanged upon a session entering RDMA mode. explicitly exchanged upon a session entering RDMA mode.
10.3.3. RDMA Connection Resources 9.3.3. RDMA Connection Resources
On transport endpoints which support automatic RDMA mode, that is, On transport endpoints which support automatic RDMA mode, that is,
endpoints which are created in the RDMA-enabled state, a single, endpoints which are created in the RDMA-enabled state, a single,
preposted buffer must initially be provided by both peers, and the preposted buffer must initially be provided by both peers, and the
client session negotiation must be the first exchange. client session negotiation must be the first exchange.
On transport endpoints supporting dynamic negotiation, a more On transport endpoints supporting dynamic negotiation, a more
sophisticated negotiation is possible, but is not discussed in the sophisticated negotiation is possible, but is not discussed in the
current draft. current draft.
skipping to change at page 145, line 33 skipping to change at page 139, line 33
RPC layer to handle receives. These buffers remain in use by the RPC layer to handle receives. These buffers remain in use by the
RPC/NFSv4 implementation; the size and number of them must be known RPC/NFSv4 implementation; the size and number of them must be known
to the remote peer in order to avoid RDMA errors which would cause a to the remote peer in order to avoid RDMA errors which would cause a
fatal error on the RDMA connection. fatal error on the RDMA connection.
The session provides a natural way for the server to manage resource The session provides a natural way for the server to manage resource
allocation to each client rather than to each transport connection allocation to each client rather than to each transport connection
itself. This enables considerable flexibility in the administration itself. This enables considerable flexibility in the administration
of transport endpoints. of transport endpoints.
10.3.4. TCP and RDMA Inline Transfer Model 9.3.4. TCP and RDMA Inline Transfer Model
The basic transfer model for both TCP and RDMA is referred to as The basic transfer model for both TCP and RDMA is referred to as
"inline". For TCP, this is the only transfer model supported, since "inline". For TCP, this is the only transfer model supported, since
TCP carries both the RPC header and data together in the data stream. TCP carries both the RPC header and data together in the data stream.
For RDMA, the RDMA Send transfer model is used for all NFS requests For RDMA, the RDMA Send transfer model is used for all NFS requests
and replies, but data is optionally carried by RDMA Writes or RDMA and replies, but data is optionally carried by RDMA Writes or RDMA
Reads. Use of Sends is required to ensure consistency of data and to Reads. Use of Sends is required to ensure consistency of data and to
deliver completion notifications. The pure-Send method is typically deliver completion notifications. The pure-Send method is typically
used where the data payload is small, or where for whatever reason used where the data payload is small, or where for whatever reason
skipping to change at page 148, line 26 skipping to change at page 142, line 26
procedures. Since an arbitrary number (total size) of operations can procedures. Since an arbitrary number (total size) of operations can
be specified in a single COMPOUND procedure, its size is effectively be specified in a single COMPOUND procedure, its size is effectively
unbounded. This cannot be supported by RDMA Sends, and therefore unbounded. This cannot be supported by RDMA Sends, and therefore
this size negotiation places a restriction on the construction and this size negotiation places a restriction on the construction and
maximum size of both COMPOUND requests and responses. If a COMPOUND maximum size of both COMPOUND requests and responses. If a COMPOUND
results in a reply at the server that is larger than can be sent in results in a reply at the server that is larger than can be sent in
an RDMA Send to the client, then the COMPOUND must terminate and the an RDMA Send to the client, then the COMPOUND must terminate and the
operation which causes the overflow will provide a TOOSMALL error operation which causes the overflow will provide a TOOSMALL error
status result. status result.
10.3.5. RDMA Direct Transfer Model 9.3.5. RDMA Direct Transfer Model
Placement of data by explicitly tagged RDMA operations is referred to Placement of data by explicitly tagged RDMA operations is referred to
as "direct" transfer. This method is typically used where the data as "direct" transfer. This method is typically used where the data
payload is relatively large, that is, when RDMA setup has been payload is relatively large, that is, when RDMA setup has been
performed prior to the operation, or when any overhead for setting up performed prior to the operation, or when any overhead for setting up
and performing the transfer is regained by avoiding the overhead of and performing the transfer is regained by avoiding the overhead of
processing an ordinary receive. processing an ordinary receive.
The client advertises RDMA buffers in this proposed model, and not The client advertises RDMA buffers in this proposed model, and not
the server. This means the "XDR Decoding with Read Chunks" described the server. This means the "XDR Decoding with Read Chunks" described
skipping to change at page 151, line 37 skipping to change at page 145, line 37
buffer : +-----------------------------> : buffer : +-----------------------------> :
: : : : : :
: [Segment] : : [Segment] :
tagged : v------------------------------ : [RDMA Read] tagged : v------------------------------ : [RDMA Read]
buffer : +-----------------------------> : buffer : +-----------------------------> :
: : : :
: Direct Write Response : : Direct Write Response :
untagged : <------------------------------ : Send (w/Inv.) untagged : <------------------------------ : Send (w/Inv.)
buffer : : buffer : :
10.4. Connection Models 9.4. Connection Models
There are three scenarios in which to discuss the connection model. There are three scenarios in which to discuss the connection model.
Each will be discussed individually, after describing the common case Each will be discussed individually, after describing the common case
encountered at initial connection establishment. encountered at initial connection establishment.
After a successful connection, the first request proceeds, in the After a successful connection, the first request proceeds, in the
case of a new client association, to initial session creation, and case of a new client association, to initial session creation, and
then optionally to session callback channel binding, prior to regular then optionally to session callback channel binding, prior to regular
operation. operation.
skipping to change at page 152, line 46 skipping to change at page 146, line 46
discarding this state at the server may affect the correctness of the discarding this state at the server may affect the correctness of the
server as seen by the client across network partitioning, such server as seen by the client across network partitioning, such
discarding of state should be done only in a conservative manner. discarding of state should be done only in a conservative manner.
Each client request to the server carries a new SEQUENCE operation Each client request to the server carries a new SEQUENCE operation
within each COMPOUND, which provides the session context. This within each COMPOUND, which provides the session context. This
session context then governs the request control, duplicate request session context then governs the request control, duplicate request
caching, and other persistent parameters managed by the server for a caching, and other persistent parameters managed by the server for a
session. session.
10.4.1. TCP Connection Model 9.4.1. TCP Connection Model
The following is a schematic diagram of the NFSv4.1 protocol The following is a schematic diagram of the NFSv4.1 protocol
exchanges leading up to normal operation on a TCP stream. exchanges leading up to normal operation on a TCP stream.
Client Server Client Server
TCPmode : Create Clientid(nfs_client_id4) : TCPmode TCPmode : Create Clientid(nfs_client_id4) : TCPmode
: ------------------------------> : : ------------------------------> :
: : : :
: Clientid reply(clientid, ...) : : Clientid reply(clientid, ...) :
: <------------------------------ : : <------------------------------ :
skipping to change at page 153, line 35 skipping to change at page 147, line 35
No net additional exchange is added to the initial negotiation by No net additional exchange is added to the initial negotiation by
this proposal. In the NFSv4.1 exchange, the CREATECLIENTID replaces this proposal. In the NFSv4.1 exchange, the CREATECLIENTID replaces
SETCLIENTID (eliding the callback "clientaddr4" addressing) and SETCLIENTID (eliding the callback "clientaddr4" addressing) and
CREATESESSION subsumes the function of SETCLIENTID_CONFIRM, as CREATESESSION subsumes the function of SETCLIENTID_CONFIRM, as
described elsewhere in this document. Callback channel binding is described elsewhere in this document. Callback channel binding is
optional, as in NFSv4.0. Note that the STREAM transport type is optional, as in NFSv4.0. Note that the STREAM transport type is
shown above, but since the transport mode remains unchanged and shown above, but since the transport mode remains unchanged and
transport attributes are not necessarily exchanged, DEFAULT could transport attributes are not necessarily exchanged, DEFAULT could
also be passed. also be passed.
10.4.2. Negotiated RDMA Connection Model 9.4.2. Negotiated RDMA Connection Model
One possible design which has been considered is to have a One possible design which has been considered is to have a
"negotiated" RDMA connection model, supported via use of a session "negotiated" RDMA connection model, supported via use of a session
bind operation as a required first step. However due to issues bind operation as a required first step. However due to issues
mentioned earlier, this proved problematic. This section remains as mentioned earlier, this proved problematic. This section remains as
a reminder of that fact, and it is possible such a mode can be a reminder of that fact, and it is possible such a mode can be
supported. supported.
It is not considered critical that this be supported for two reasons. It is not considered critical that this be supported for two reasons.
One, the session persistence provides a way for the server to One, the session persistence provides a way for the server to
skipping to change at page 154, line 14 skipping to change at page 148, line 14
supports an automatic RDMA connection mode, no further support is supports an automatic RDMA connection mode, no further support is
required from the NFSv4.1 protocol for reconnection. required from the NFSv4.1 protocol for reconnection.
Note, the client must provide at least as many RDMA Read resources to Note, the client must provide at least as many RDMA Read resources to
its local queue for the benefit of the server when reconnecting, as its local queue for the benefit of the server when reconnecting, as
it used when negotiating the session. If this value is no longer it used when negotiating the session. If this value is no longer
appropriate, the client should resynchronize its session state, appropriate, the client should resynchronize its session state,
destroy the existing session, and start over with the more destroy the existing session, and start over with the more
appropriate values. appropriate values.
10.4.3. Automatic RDMA Connection Model 9.4.3. Automatic RDMA Connection Model
The following is a schematic diagram of the NFSv4.1 protocol The following is a schematic diagram of the NFSv4.1 protocol
exchanges performed on an RDMA connection. exchanges performed on an RDMA connection.
Client Server Client Server
RDMAmode : : : RDMAmode RDMAmode : : : RDMAmode
: : : : : :
Prepost : : : Prepost Prepost : : : Prepost
receive : : : receive receive : : : receive
: : : :
skipping to change at page 154, line 44 skipping to change at page 148, line 44
: : Prepost <=N' : : Prepost <=N'
: Session reply(sessionid, size S', : receives of : Session reply(sessionid, size S', : receives of
: maxreq N') : size S' : maxreq N') : size S'
: <------------------------------ : : <------------------------------ :
: : : :
: <normal operation> : : <normal operation> :
: ------------------------------> : : ------------------------------> :
: <------------------------------ : : <------------------------------ :
: : : : : :
10.5. Buffer Management, Transfer, Flow Control 9.5. Buffer Management, Transfer, Flow Control
Inline operations in NFSv4.1 behave effectively the same as TCP Inline operations in NFSv4.1 behave effectively the same as TCP
sends. Procedure results are passed in a single message, and its sends. Procedure results are passed in a single message, and its
completion at the client signal the receiving process to inspect the completion at the client signal the receiving process to inspect the
message. message.
RDMA operations are performed solely by the server in this proposal, RDMA operations are performed solely by the server in this proposal,
as described in the previous "RDMA Direct Model" section. Since as described in the previous "RDMA Direct Model" section. Since
server RDMA operations do not result in a completion at the client, server RDMA operations do not result in a completion at the client,
and due to ordering rules in RDMA transports, after all required RDMA and due to ordering rules in RDMA transports, after all required RDMA
skipping to change at page 157, line 25 skipping to change at page 151, line 25
most efficient allocation of resources on both peers. There is an most efficient allocation of resources on both peers. There is an
important requirement on reconnection: the sizes posted by the server important requirement on reconnection: the sizes posted by the server
at reconnect must be at least as large as previously used, to allow at reconnect must be at least as large as previously used, to allow
recovery. Any replies that are replayed from the server's duplicate recovery. Any replies that are replayed from the server's duplicate
request cache must be able to be received into client buffers. In request cache must be able to be received into client buffers. In
the case where a client has received replies to all its retried the case where a client has received replies to all its retried
requests (and therefore received all its expected responses), then requests (and therefore received all its expected responses), then
the client may disconnect and reconnect with different buffers at the client may disconnect and reconnect with different buffers at
will, since no cache replay will be required. will, since no cache replay will be required.
10.6. Retry and Replay 9.6. Retry and Replay
NFSv4.0 forbids retransmission on active connections over reliable NFSv4.0 forbids retransmission on active connections over reliable
transports; this includes connected-mode RDMA. This restriction must transports; this includes connected-mode RDMA. This restriction must
be maintained in NFSv4.1. be maintained in NFSv4.1.
If one peer were to retransmit a request (or reply), it would consume If one peer were to retransmit a request (or reply), it would consume
an additional credit on the other. If the server retransmitted a an additional credit on the other. If the server retransmitted a
reply, it would certainly result in an RDMA connection loss, since reply, it would certainly result in an RDMA connection loss, since
the client would typically only post a single receive buffer for each the client would typically only post a single receive buffer for each
request. If the client retransmitted a request, the additional request. If the client retransmitted a request, the additional
skipping to change at page 158, line 9 skipping to change at page 152, line 9
refreshed by the RPC layer. refreshed by the RPC layer.
Finally, RDMA fabrics do not guarantee that the memory handles Finally, RDMA fabrics do not guarantee that the memory handles
(Steering Tags) within each rdma three-tuple are valid on a scope (Steering Tags) within each rdma three-tuple are valid on a scope
outside that of a single connection. Therefore, handles used by the outside that of a single connection. Therefore, handles used by the
direct operations become invalid after connection loss. The server direct operations become invalid after connection loss. The server
must ensure that any RDMA operations which must be replayed from the must ensure that any RDMA operations which must be replayed from the
request cache use the newly provided handle(s) from the most recent request cache use the newly provided handle(s) from the most recent
request. request.
10.7. The Back Channel 9.7. The Back Channel
The NFSv4 callback operations present a significant resource problem The NFSv4 callback operations present a significant resource problem
for the RDMA enabled client. Clearly, callbacks must be negotiated for the RDMA enabled client. Clearly, callbacks must be negotiated
in the way credits are for the ordinary operations channel for in the way credits are for the ordinary operations channel for
requests flowing from client to server. But, for callbacks to arrive requests flowing from client to server. But, for callbacks to arrive
on the same RDMA endpoint as operation replies would require on the same RDMA endpoint as operation replies would require
dedicating additional resources, and specialized demultiplexing and dedicating additional resources, and specialized demultiplexing and
event handling. Or, callbacks may not require RDMA sevice at all event handling. Or, callbacks may not require RDMA sevice at all
(they do not normally carry substantial data payloads). It is highly (they do not normally carry substantial data payloads). It is highly
desirable to streamline this critical path via a second desirable to streamline this critical path via a second
skipping to change at page 159, line 22 skipping to change at page 153, line 22
prepared for them. prepared for them.
There is one special case, that where the back channel is bound in There is one special case, that where the back channel is bound in
fact to the operations channel's connection. This configuration fact to the operations channel's connection. This configuration
would be used normally over a TCP stream connection to exactly would be used normally over a TCP stream connection to exactly
implement the NFSv4.0 behavior, but over RDMA would require complex implement the NFSv4.0 behavior, but over RDMA would require complex
resource and event management at both sides of the connection. The resource and event management at both sides of the connection. The
server is not required to accept such a bind request on an RDMA server is not required to accept such a bind request on an RDMA
connection for this reason, though it is recommended. connection for this reason, though it is recommended.
10.8. COMPOUND Sizing Issues 9.8. COMPOUND Sizing Issues
Very large responses may pose duplicate request cache issues. Since Very large responses may pose duplicate request cache issues. Since
servers will want to bound the storage required for such a cache, the servers will want to bound the storage required for such a cache, the
unlimited size of response data in COMPOUND may be troublesome. If unlimited size of response data in COMPOUND may be troublesome. If
COMPOUND is used in all its generality, then the inclusion of certain COMPOUND is used in all its generality, then the inclusion of certain
non-idempotent operations within a single COMPOUND request may render non-idempotent operations within a single COMPOUND request may render
the entire request non-idempotent. (For example, a single COMPOUND the entire request non-idempotent. (For example, a single COMPOUND
request which read a file or symbolic link, then removed it, would be request which read a file or symbolic link, then removed it, would be
obliged to cache the data in order to allow identical replay). obliged to cache the data in order to allow identical replay).
Therefore, many requests might include operations that return any Therefore, many requests might include operations that return any
amount of data. amount of data.
It is not satisfactory for the server to reject COMPOUNDs at will It is not satisfactory for the server to reject COMPOUNDs at will
with NFS4ERR_RESOURCE when they pose such difficulties for the with NFS4ERR_RESOURCE when they pose such difficulties for the
server, as this results in serious interoperability problems. server, as this results in serious interoperability problems.
Instead, any such limits must be explicitly exposed as attributes of Instead, any such limits must be explicitly exposed as attributes of
the session, ensuring that the server can explicitly support any the session, ensuring that the server can explicitly support any
duplicate request cache needs at all times. duplicate request cache needs at all times.
10.9. Data Alignment 9.9. Data Alignment
A negotiated data alignment enables certain scatter/gather A negotiated data alignment enables certain scatter/gather
optimizations. A facility for this is supported by [RPCRDMA]. Where optimizations. A facility for this is supported by [RPCRDMA]. Where
NFS file data is the payload, specific optimizations become highly NFS file data is the payload, specific optimizations become highly
attractive. attractive.
Header padding is requested by each peer at session initiation, and Header padding is requested by each peer at session initiation, and
may be zero (no padding). Padding leverages the useful property that may be zero (no padding). Padding leverages the useful property that
RDMA receives preserve alignment of data, even when they are placed RDMA receives preserve alignment of data, even when they are placed
into anonymous (untagged) buffers. If requested, client inline into anonymous (untagged) buffers. If requested, client inline
skipping to change at page 161, line 6 skipping to change at page 155, line 6
the now-complete buffers by reference for normal write processing. the now-complete buffers by reference for normal write processing.
For a server which can make use of it, this removes any need for data For a server which can make use of it, this removes any need for data
copies of incoming data, without resorting to complicated end-to-end copies of incoming data, without resorting to complicated end-to-end
buffer advertisement and management. This includes most kernel-based buffer advertisement and management. This includes most kernel-based
and integrated server designs, among many others. The client may and integrated server designs, among many others. The client may
perform similar optimizations, if desired. perform similar optimizations, if desired.
Padding is negotiated by the session creation operation, and Padding is negotiated by the session creation operation, and
subsequently used by the RPC RDMA layer, as described in [RPCRDMA]. subsequently used by the RPC RDMA layer, as described in [RPCRDMA].
10.10. NFSv4 Integration 9.10. NFSv4 Integration
The following section discusses the integration of the proposed RDMA The following section discusses the integration of the proposed RDMA
extensions with NFSv4.0. extensions with NFSv4.0.
10.10.1. Minor Versioning 9.10.1. Minor Versioning
Minor versioning is the existing facility to extend the NFSv4 Minor versioning is the existing facility to extend the NFSv4
protocol, and this proposal takes that approach. protocol, and this proposal takes that approach.
Minor versioning of NFSv4 is relatively restrictive, and allows for Minor versioning of NFSv4 is relatively restrictive, and allows for
tightly limited changes only. In particular, it does not permit tightly limited changes only. In particular, it does not permit
adding new "procedures" (it permits adding only new "operations"). adding new "procedures" (it permits adding only new "operations").
Interoperability concerns make it impossible to consider additional Interoperability concerns make it impossible to consider additional
layering to be a minor revision. This somewhat limits the changes layering to be a minor revision. This somewhat limits the changes
that can be proposed when considering extensions. that can be proposed when considering extensions.
skipping to change at page 161, line 44 skipping to change at page 155, line 44
If sessions are in use for a given clientid, this same clientid If sessions are in use for a given clientid, this same clientid
cannot be used for non-session NFSv4 operation, including NFSv4.0. cannot be used for non-session NFSv4 operation, including NFSv4.0.
Because the server will have allocated session-specific state to the Because the server will have allocated session-specific state to the
active clientid, it would be an unnecessary burden on the server active clientid, it would be an unnecessary burden on the server
implementor to support and account for additional, non- session implementor to support and account for additional, non- session
traffic, in addition to being of no benefit. Therefore this proposal traffic, in addition to being of no benefit. Therefore this proposal
prohibits a single clientid from doing this. Nevertheless, employing prohibits a single clientid from doing this. Nevertheless, employing
a new clientid for such traffic is supported. a new clientid for such traffic is supported.
10.10.2. Slot Identifiers and Server Duplicate Request Cache 9.10.2. Slot Identifiers and Server Duplicate Request Cache
The presence of deterministic maximum request limits on a session The presence of deterministic maximum request limits on a session
enables in-progress requests to be assigned unique values with useful enables in-progress requests to be assigned unique values with useful
properties. properties.
The RPC layer provides a transaction ID (xid), which, while required The RPC layer provides a transaction ID (xid), which, while required
to be unique, is not especially convenient for tracking requests. to be unique, is not especially convenient for tracking requests.
The transaction ID is only meaningful to the issuer (client), it The transaction ID is only meaningful to the issuer (client), it
cannot be interpreted at the server except to test for equality with cannot be interpreted at the server except to test for equality with
skipping to change at page 165, line 17 skipping to change at page 159, line 17
the client's access to session resources. However, because of the client's access to session resources. However, because of
request pipelining, the client may have active requests in flight request pipelining, the client may have active requests in flight
reflecting prior values, therefore the server must not immediately reflecting prior values, therefore the server must not immediately
require the client to comply. require the client to comply.
It is worthwhile to note that Sprite RPC [BW87] defined a "channel" It is worthwhile to note that Sprite RPC [BW87] defined a "channel"
which in some ways is similar to the slotid proposed here. Sprite which in some ways is similar to the slotid proposed here. Sprite
RPC used channels to implement parallel request processing and RPC used channels to implement parallel request processing and
request/response cache retirement. request/response cache retirement.
10.10.3. Resolving server callback races with sessions 9.10.3. Resolving server callback races with sessions
It is possible for server callbacks to arrive at the client before It is possible for server callbacks to arrive at the client before
the reply from related forward channel operations. For example, a the reply from related forward channel operations. For example, a
client may have been granted a delegation to a file it has opened, client may have been granted a delegation to a file it has opened,
but the reply to the OPEN (informing the client of the granting of but the reply to the OPEN (informing the client of the granting of
the delegation) may be delayed in the network. If a conflicting the delegation) may be delayed in the network. If a conflicting
operation arrives at the server, it will recall the delegation using operation arrives at the server, it will recall the delegation using
the callback channel, which may be on a different TCP connection, the callback channel, which may be on a different TCP connection,
perhaps even a different network. If the callback request arrives perhaps even a different network. If the callback request arrives
before the related reply, the client may reply to the server with an before the related reply, the client may reply to the server with an
skipping to change at page 166, line 23 skipping to change at page 160, line 23
Therefore, for each client operation which might result in some sort Therefore, for each client operation which might result in some sort
of server callback, the server should "remember" the { slotid, of server callback, the server should "remember" the { slotid,
sequenceid } pair of the client request until the slotid retirement sequenceid } pair of the client request until the slotid retirement
rules allow the server to determine that the client has, in fact, rules allow the server to determine that the client has, in fact,
seen the server's reply. During this time, any recalls of the seen the server's reply. During this time, any recalls of the
associated object should carry these identifiers, for the benefit of associated object should carry these identifiers, for the benefit of
the client. After this time, it is not necessary for the server to the client. After this time, it is not necessary for the server to
provide this information in related callbacks, since it is certain provide this information in related callbacks, since it is certain
that a race condition can no longer occur. that a race condition can no longer occur.
10.10.4. COMPOUND and CB_COMPOUND 9.10.4. COMPOUND and CB_COMPOUND
Support for per-operation control can be piggybacked onto NFSv4 Support for per-operation control can be piggybacked onto NFSv4
COMPOUNDs with full transparency, by placing such facilities into COMPOUNDs with full transparency, by placing such facilities into
their own, new operation, and placing this operation first in each their own, new operation, and placing this operation first in each
COMPOUND under the new NFSv4 minor protocol revision. The contents COMPOUND under the new NFSv4 minor protocol revision. The contents
of the operation would then apply to the entire COMPOUND. of the operation would then apply to the entire COMPOUND.
Recall that the NFSv4 minor revision is contained within the COMPOUND Recall that the NFSv4 minor revision is contained within the COMPOUND
header, encoded prior to the COMPOUNDed operations. By simply header, encoded prior to the COMPOUNDed operations. By simply
requiring that the new operation always be contained in NFSv4 minor requiring that the new operation always be contained in NFSv4 minor
skipping to change at page 167, line 25 skipping to change at page 161, line 25
//-----------------------+---- //-----------------------+----
// status + op + results | ... // status + op + results | ...
//-----------------------+---- //-----------------------+----
The single control operation within each NFSv4.1 COMPOUND defines the The single control operation within each NFSv4.1 COMPOUND defines the
context and operational session parameters which govern that COMPOUND context and operational session parameters which govern that COMPOUND
request and reply. Placing it first in the COMPOUND encoding is request and reply. Placing it first in the COMPOUND encoding is
required in order to allow its processing before other operations in required in order to allow its processing before other operations in
the COMPOUND. the COMPOUND.
10.10.5. eXternal Data Representation Efficiency 9.10.5. eXternal Data Representation Efficiency
RDMA is a copy avoidance technology, and it is important to maintain RDMA is a copy avoidance technology, and it is important to maintain
this efficiency when decoding received messages. Traditional XDR this efficiency when decoding received messages. Traditional XDR
implementations frequently use generated unmarshaling code to convert implementations frequently use generated unmarshaling code to convert
objects to local form, incurring a data copy in the process (in objects to local form, incurring a data copy in the process (in
addition to subjecting the caller to recursive calls, etc). Often, addition to subjecting the caller to recursive calls, etc). Often,
such conversions are carried out even when no size or byte order such conversions are carried out even when no size or byte order
conversion is necessary. conversion is necessary.
It is recommended that implementations pay close attention to the It is recommended that implementations pay close attention to the
details of memory referencing in such code. It is far more efficient details of memory referencing in such code. It is far more efficient
to inspect data in place, using native facilities to deal with word to inspect data in place, using native facilities to deal with word
size and byte order conversion into registers or local variables, size and byte order conversion into registers or local variables,
rather than formally (and blindly) performing the operation via rather than formally (and blindly) performing the operation via
fetch, reallocate and store. fetch, reallocate and store.
Of particular concern is the result of the READDIR operation, in Of particular concern is the result of the READDIR operation, in
which such encoding abounds. which such encoding abounds.
10.10.6. Effect of Sessions on Existing Operations 9.10.6. Effect of Sessions on Existing Operations
The use of a session replaces the use of the SETCLIENTID and The use of a session replaces the use of the SETCLIENTID and
SETCLIENTID_CONFIRM operations, and allows certain simplification of SETCLIENTID_CONFIRM operations, and allows certain simplification of
the RENEW and callback addressing mechanisms in the base protocol. the RENEW and callback addressing mechanisms in the base protocol.
The cb_program and cb_location which are obtained by the server in The cb_program and cb_location which are obtained by the server in
SETCLIENTID_CONFIRM must not be used by the server, because the SETCLIENTID_CONFIRM must not be used by the server, because the
NFSv4.1 client performs callback channel designation with NFSv4.1 client performs callback channel designation with
BIND_BACKCHANNEL. Therefore the SETCLIENTID and SETCLIENTID_CONFIRM BIND_BACKCHANNEL. Therefore the SETCLIENTID and SETCLIENTID_CONFIRM
operations becomes obsolete when sessions are in use, and a server operations becomes obsolete when sessions are in use, and a server
skipping to change at page 168, line 42 skipping to change at page 162, line 42
An interesting issue arises however if an error occurs on such a An interesting issue arises however if an error occurs on such a
SEQUENCE operation. If the SEQUENCE operation fails, perhaps due to SEQUENCE operation. If the SEQUENCE operation fails, perhaps due to
an invalid slotid or other non-renewal-based issue, the server may or an invalid slotid or other non-renewal-based issue, the server may or
may not have performed the RENEW. In this case, the state of any may not have performed the RENEW. In this case, the state of any
renewal is undefined, and the client should make no assumption that renewal is undefined, and the client should make no assumption that
it has been performed. In practice, this should not occur but even it has been performed. In practice, this should not occur but even
if it did, it is expected the client would perform some sort of if it did, it is expected the client would perform some sort of
recovery which would result in a new, successful, SEQUENCE operation recovery which would result in a new, successful, SEQUENCE operation
being run and the client assured that the renewal took place. being run and the client assured that the renewal took place.
10.10.7. Authentication Efficiencies 9.10.7. Authentication Efficiencies
NFSv4 requires the use of the RPCSEC_GSS ONC RPC security flavor NFSv4 requires the use of the RPCSEC_GSS ONC RPC security flavor
[RFC2203] to provide authentication, integrity, and privacy via [RFC2203] to provide authentication, integrity, and privacy via
cryptography. The server dictates to the client the use of cryptography. The server dictates to the client the use of
RPCSEC_GSS, the service (authentication, integrity, or privacy), and RPCSEC_GSS, the service (authentication, integrity, or privacy), and
the specific GSS-API security mechanism that each remote procedure the specific GSS-API security mechanism that each remote procedure
call and result will use. call and result will use.
If the connection's integrity is protected by an additional means If the connection's integrity is protected by an additional means
than RPCSEC_GSS, such as via IPsec, then the use of RPCSEC_GSS's than RPCSEC_GSS, such as via IPsec, then the use of RPCSEC_GSS's
skipping to change at page 169, line 37 skipping to change at page 163, line 37
GSS-API context created previously over another GSS-API mechanism. GSS-API context created previously over another GSS-API mechanism.
NFSv4.1 clients and servers should support CCM and they must use as NFSv4.1 clients and servers should support CCM and they must use as
the cookie the handle from a successful RPCSEC_GSS context creation the cookie the handle from a successful RPCSEC_GSS context creation
over a non-CCM mechanism (such as Kerberos V5). The value of the over a non-CCM mechanism (such as Kerberos V5). The value of the
cookie will be equal to the handle field of the rpc_gss_init_res cookie will be equal to the handle field of the rpc_gss_init_res
structure from the RPCSEC_GSS specification. structure from the RPCSEC_GSS specification.
The [CCM] Draft provides further discussion and examples. The [CCM] Draft provides further discussion and examples.
10.11. Sessions Security Considerations 9.11. Sessions Security Considerations
The NFSv4 minor version 1 retains all of existing NFSv4 security; all The NFSv4 minor version 1 retains all of existing NFSv4 security; all
security considerations present in NFSv4.0 apply to it equally. security considerations present in NFSv4.0 apply to it equally.
Security considerations of any underlying RDMA transport are Security considerations of any underlying RDMA transport are
additionally important, all the more so due to the emerging nature of additionally important, all the more so due to the emerging nature of
such transports. Examining these issues is outside the scope of this such transports. Examining these issues is outside the scope of this
draft. draft.
When protecting a connection with RPCSEC_GSS, all data in each When protecting a connection with RPCSEC_GSS, all data in each
skipping to change at page 171, line 5 skipping to change at page 165, line 5
The proposed session callback channel binding improves security over The proposed session callback channel binding improves security over
that provided by NFSv4 for the callback channel. The connection is that provided by NFSv4 for the callback channel. The connection is
client-initiated, and subject to the same firewall and routing checks client-initiated, and subject to the same firewall and routing checks
as the operations channel. The connection cannot be hijacked by an as the operations channel. The connection cannot be hijacked by an
attacker who connects to the client port prior to the intended attacker who connects to the client port prior to the intended
server. The connection is set up by the client with its desired server. The connection is set up by the client with its desired
attributes, such as optionally securing with IPsec or similar. The attributes, such as optionally securing with IPsec or similar. The
binding is fully authenticated before being activated. binding is fully authenticated before being activated.
10.11.1. Authentication 9.11.1. Authentication
Proper authentication of the principal which issues any session and Proper authentication of the principal which issues any session and
clientid in the proposed NFSv4.1 operations exactly follows the clientid in the proposed NFSv4.1 operations exactly follows the
similar requirement on client identifiers in NFSv4.0. It must not be similar requirement on client identifiers in NFSv4.0. It must not be
possible for a client to impersonate another by guessing its session possible for a client to impersonate another by guessing its session
identifiers for NFSv4.1 operations, nor to bind a callback channel to identifiers for NFSv4.1 operations, nor to bind a callback channel to
an existing session. To protect against this, NFSv4.0 requires an existing session. To protect against this, NFSv4.0 requires
appropriate authentication and matching of the principal used. This appropriate authentication and matching of the principal used. This
is discussed in Section 16, Security Considerations of [RFC3530]. is discussed in Section 16, Security Considerations of [RFC3530].
The same requirement when using a session identifier applies to The same requirement when using a session identifier applies to
skipping to change at page 172, line 8 skipping to change at page 166, line 8
context for the server to use for this contingency. context for the server to use for this contingency.
The server should take care to protect itself against denial of The server should take care to protect itself against denial of
service attacks in the creation of sessions and clientids. Clients service attacks in the creation of sessions and clientids. Clients
who connect and create sessions, only to disconnect and never use who connect and create sessions, only to disconnect and never use
them may leave significant state behind. (The same issue applies to them may leave significant state behind. (The same issue applies to
NFSv4.0 with clients who may perform SETCLIENTID, then never perform NFSv4.0 with clients who may perform SETCLIENTID, then never perform
SETCLIENTID_CONFIRM.) Careful authentication coupled with resource SETCLIENTID_CONFIRM.) Careful authentication coupled with resource
checks is highly recommended. checks is highly recommended.
10. Multi-server Name Space
NFSv4.1 supports attributes that allow a namespace to extend beyond
the boundaries of a single server. Use of such multi-server
namespaces is optional, and for many purposes, single-server
namespace are perfectly acceptable. Use of multi-server namespaces
can provide many advantages, however, by separating a file system's
logical position in a name space from the (possibly changing)
logistical and administrative considerations that result in
particular file systems being located on particular servers.
10.1. Location attributes
NFSv4 contains recommended attributes that allow file systems on one
server to be associated with one or more instances of that file
system on other servers. These attributes specify such file systems
by specifying a server name (either a DNS name or an IP address)
together with the path of that filesystem within that server's
single-server name space.
The fs_locations_info recommended attribute allows specification of
one more file systems locations where the data corresponding to a
given file system may be found. This attributes provides to the
client, in addition to information about file system locations,
extensive information about the various file system choices (e.g.
priority for use, writability, currency, etc.) as well as information
to help the client efficiently effect as seamless a transition as
possible among multiple file system instances, when and if that
should be necessary.
The fs_locations recommended attribute is inherited from NFSv4.0 and
only allows specification of the file system locations where the data
corresponding to a given file system may be found. Servers should
make this attribute available whenever fs_locations_info is
supported, but client use of fs_locations_info is to be preferred.
10.2. File System Presence or Absence
A given location in an NFSv4 namespace (typically but not necessarily
a multi-server namespace) can have a number of file system locations
associated with it (via the fs_locations or fs_locations_info
attribute). There may also be an actual current file system at that
location, accessible via normal namespace operations (e.g. LOOKUP).
In this case there, the file system is said to be "present" at that
position in the namespace and clients will typically use it,
reserving use of additional locations specified via the location-
related attributes to situations in which the principal location is
no longer available.
When there is no actual filesystem at the namespace location in
question, the file system is said to be "absent". An absent file
system contains no files or directories other than the root and any
reference to it, except to access a small set of attributes useful in
determining alternate locations, will result in an error,
NFS4ERR_MOVED. Note that if the server ever returns NFS4ERR_MOVED
(i.e. file systems may be absent), it MUST support the fs_locations
attribute and SHOULD support the fs_locations_info and fs_absent
attributes.
While the error name suggests that we have a case of a file system
which once was present, and has only become absent later, this is
only one possibility. A position in the namespace may be permanently
absent with the file system(s) designated by the location attributes
the only realization. The name NFS4ERR_MOVED reflects an earlier,
more limited conception of its function, but this error will be
returned whenever the referenced file system is absent, whether it
has moved or not.
Except in the case of GETATTR-type operations (to be discussed
later), when the current filehandle at the start of an operation is
within an absent file system, that operation is not performed and the
error NFS4ERR_MOVED returned, to indicate that the filesystem is
absent on the current server.
Because a GETFH cannot succeed, if the current filehandle is within
an absent file system, filehandles within an absent filesystem cannot
be transferred to the client. When a client does have filehandles
within an absent file system, it is the result of obtaining them when
the file system was present, and having the file system become absent
subsequently.
It should be noted that because the check for the current filehandle
being within an absent filesystem happens at the start of every
operation, operations which change the current filehandle so that it
is within an absent filesystem will not result in an error. This
allows such combinations as PUTFH-GETATTR and LOOKUP-GETATTR to be
used to get attribute information, particularly location attribute
information, as discussed below.
The recommended file system attribute fs_absent can used to
interrogate the present/absent status of a given file system.
10.3. Getting Attributes for an Absent File System
When a file system is absent, most attributes are not available, but
it is necessary to allow the client access to the small set of
attributes that are available, and most particularly those that give
information about the correct current locations for this file system,
fs_locations and fs_locations_info.
10.3.1. GETATTR Within an Absent File System
As mentioned above, an exception is made for GETATTR in that
attributes may be obtained for a filehandle within an absent file
system. This exception only applies if the attribute mask contains
at least one attribute bit that indicates the client is interested in
a result regarding an absent file system: fs_locations,
fs_locations_info, or fs_absent. If none of these attributes is
requested, GETATTR will result in an NFS4ERR_MOVED error.
When a GETATTR is done on an absent file system, the set of supported
attributes is very limited. Many attributes, including those that
are normally mandatory will not be available on an absent file
system. In addition to the attributes mentioned above (fs_locations,
fs_locations_info, fs_absent), the following attributes SHOULD be
available on absent file systems, in the case of recommended
attributes at least to the same degree that they are available on
present file systems.
change: This attribute is useful for absent file systems and can be
helpful in summarizing to the client when any of the location-
related attributes changes.
fsid: This attribute should be provided so that the client can
determine file system boundaries, including, in particular, the
boundary between present and absent file systems.
mounted_on_fileid: For objects at the top of an absent file system
this attribute needs to be available. Since the fileid is one
which is within the present parent file system, there should be no
need to reference the absent file system to provide this
information.
Other attributes SHOULD NOT be made available for absent file
systems, even when it is possible to provide them. The server should
not assume that more information is always better and should avoid
gratuitously providing additional information.
When a GETATTR operation includes a bit mask for one of the
attributes fs_locations, fs_locations_info, or absent, but where the
bit mask includes attributes which are not supported, GETATTR will
not return an error, but will return the mask of the actual
attributes supported with the results.
Handling of VERIFY/NVERIFY is similar to GETATTR in that if the
attribute mask does not include fs_locations, fs_locations_info, or
absent, the error NFS4ERR_MOVED will result. It differs in that any
appearance in the attribute mask of an attribute not supported for an
absent file system (and note that this will include some normally
mandatory attributes), will also cause an NFS4ERR_MOVED result.
10.3.2. READDIR and Absent File Systems
A READDIR performed when the current filehandle is within an absent
file system will result in an NFS4ERR_MOVED error, since, unlike the
case of GETATTR, no such exception is made for READDIR.
Attributes for an absent file system may be fetched via a READDIR for
a directory in a present file system, when that directory contains
the root directories of one or more absent filesystems. In this
case, the handling is as follows:
o If the attribute set requested includes one of the attributes
fs_locations, fs_locations_info, or absent, then fetching of
attributes proceeds normally and no NFS4ERR_MOVED indication is
returned, even when the rdattr_error attribute is requested.
o If the attribute set requested does not include one of the
attributes fs_locations, fs_locations_info, or fs_absent, then if
the rdattr_error attribute is requested, each directory entry for
the root of an absent file system, will report NFS4ERR_MOVED as
the value of the rdattr_error attribute.
o If the attribute set requested does not include any of the
attributes fs_locations, fs_locations_info, fs_absent, or
rdattr_error then the occurrence of the root of an absent file
system within the directory will result in the READDIR failing
with an NFSER_MOVED error.
o The unavailability of an attribute because of a file system's
absence, even one that is ordinarily mandatory, does not result in
any error indication. The set of attributes returned for the root
directory of the absent filesystem in that case is simply
restricted to those actually available.
10.4. Uses of Location Information
The location-bearing attributes (fs_locations and fs_locations_info),
provide, together with the possibility of absent filesystems, a
number of important facilities in providing reliable, manageable, and
scalable data access.
When a file system is present, these attribute can provide
alternative locations, to be used to access the same data, in the
event that server failures, communications problems, or other
difficulties, make continued access to the current file system
impossible or otherwise impractical. Provision of such alternate
locations is referred to as "replication" although there are cases in
which replicated sets of data are not in fact present, and the
replicas are instead different paths to the same data.
When a file system is present and becomes absent, clients can be
given the opportunity to have continued access to their data, at an
alternate location. In this case, a continued attempt to use the
data in the now-absent file system will result in an NFSERR_MOVED
error and at that point the successor locations (typically only one
but multiple choices are possible) can be fetched and used to
continue access. Transfer of the file system contents to the new
location is referred to as "migration", but it should be kept in mind
that there are cases in which this term can be used, like
"replication" when there is no actual data migration per se.
Where a file system was not previously present, specification of file
system location provides a means by which file systems located on one
server can be associated with a name space defined by another server,
thus allowing a general multi-server namespace facility. Designation
of such a location, in place of an absent filesystem, is called
"referral".
10.4.1. File System Replication
The fs_locations and fs_locations_info attributes provide alternative
locations, to be used to access data in place of the current file
system. On first access to a filesystem, the client should obtain
the value of the set alternate locations by interrogating the
fs_locations or fs_locations_info attribute, with the latter being
preferred.
In the event that server failures, communications problems, or other
difficulties, make continued access to the current file system
impossible or otherwise impractical, the client can use the alternate
locations as a way to get continued access to his data.
The alternate locations may be physical replicas of the (typically
read-only) file system data, or they may reflect alternate paths to
the same server or provide for the use of various form of server
clustering in which multiple servers provide alternate ways of
accessing the same physical file system. How these different modes
of file system transition are represented within the fs_locations and
fs_locations_info attributes and how the client deals with file
system transition issues will be discussed in detail below.
10.4.2. File System Migration
When a file system is present and becomes absent, clients can be
given the opportunity to have continued access to their data, at an
alternate location, as specified by the fs_locations or
fs_locations_info attribute. Typically, a client will be accessing
the file system in question, get a an NFS4ERR_MOVED error, and then
use the fs_locations or fs_locations_info attribute to determine the
new location of the data. When fs_locations_info is used, additional
information will be available which will define the nature of the
client's handling of the transition to a new server.
Such migration can be helpful in providing load balancing or general
resource reallocation. The protocol does not specify how the
filesystem will be moved between servers. It is anticipated that a
number of different server-to-server transfer mechanisms might be
used with the choice left to the server implementor. The NFSv4.1
protocol specifies the method used to communicate the migration event
between client and server.
The new location may be an alternate communication path to the same
server, or, in the case of various forms of server clustering,
another server providing access to the same physical file system.
The client's responsibilities in dealing with this transition depend
on the specific nature of the new access path and how and whether
data was in fact migrated. These issues will be discussed in detail
below.
Although a single successor location is typical, multiple locations
may be provided, together with information that allows priority among
the choices to be indicated, via information in the fs_locations_info
attribute. Where suitable clustering mechanisms make it possible to
provide multiple identical file systems or paths to them, this allows
the client the opportunity to deal with any resource or
communications issues that might limit data availability.
10.4.3. Referrals
Referrals provide a way of placing a file system in a location
essentially without respect to its physical location on a given
server. This allows a single server of a set of servers to present a
multi-server namespace that encompasses filesystems located on
multiple servers. Some likely uses of this include establishment of
site-wide or organization-wide namespaces, or even knitting such
together into a truly global namespace.
Referrals occur when a client determines, upon first referencing a
position in the current namespace, that it is part of a new file
system and that that file system is absent. When this occurs,
typically by receiving the error NFS4ERR_MOVED, the actual location
or locations of the file system can be determined by fetching the
fs_locations or fs_locations_info attribute.
Use of multi-server namespaces is enabled by NFSv4 but is not
required. The use of multi-server namespaces and their scope will
depend on the application used, and system administration
preferences.
Multi-server namespaces can be established by a single server
providing a large set of referrals to all of the included
filesystems. Alternatively, a single multi-server namespace may be
administratively segmented with separate referral file systems (on
separate servers) for each separately-administered section of the
name space. Any segment or the top-level referral file system may
use replicated referral file systems for higher availability.
10.5. Additional Client-side Considerations
When clients make use of servers that implement referrals and
migration, care should be taken so that a user who mounts a given
filesystem that includes a referral or a relocated filesystem
continue to see a coherent picture of that user-side filesystem
despite the fact that it contains a number of server-side filesystems
which may be on different servers.
One important issue is upward navigation from the root of a server-
side filesystem to its parent (specified as ".." in UNIX). The
client needs to determine when it hits an fsid root going up the
filetree. When at such a point, and needs to ascend to the parent,
it must do so locally instead of sending a LOOKUPP call to the
server. The LOOKUPP would normally return the ancestor of the target
filesystem on the target server, which may not be part of the space
that the client mounted.
Another issue concerns refresh of referral locations. When referrals
are used extensively, they may change as server configurations
change. It is expected that clients will cache information related
to traversing referrals so that future client side requests are
resolved locally without server communication. This is usually
rooted in client-side name lookup caching. Clients should
periodically purge this data for referral points in order to detect
changes in location information. When the change attribute changes
for directories that hold referral entries or for the referral
entries themselves, clients should consider any associated cached
referral information to be out of date.
10.6. Effecting File System Transitions
Transitions between file system instances, whether due to switching
between replicas upon server unavailability, or in response to a
server-initiated migration event are best dealt with together. Even
though the prototypical use cases of replication and migration
contain distinctive sets of features, when all possibilities for
these operations are considered, the underlying unity of these
operations, from the client's point of view is clear, even though for
the server pragmatic considerations will normally force different
implementation strategies for planned and unplanned transitions.
A number of methods are possible for servers to replicate data and to
track client state in order to allow clients to transition between
file system instances with a minimum of disruption. Such methods
vary between those that use inter-server clustering techniques to
limit the changes seen by the client, to those that are less
aggressive, use more standard methods of replicating data, and impose
a greater burden on the client to adapt to the transition.
The NFSv4.1 protocol does not impose choices on clients and servers
with regard to that spectrum of transition methods. In fact, there
are many valid choices, depending on client and application
requirements and their interaction with server implementation
choices. The NFSv4.1 protocol does define the specific choices that
can be made, how these choices are communicated to the client and how
the client is to deal with any discontinuities.
In the sections below references will be made to various possible
server implementation choices as a way of illustrating the transition
scenarios that clients may deal with. The intent here is not to
define or limit server implementations but rather to illustrate the
range of issues that clients may face.
In the discussion below, references will be made to a file system
having a particular property or of two file systems (typically the
source and destination) belonging to a common class of any of several
types. Two file systems that belong to such a class share some
important aspect of file system behavior that clients may depend upon
when present, to easily effect a seamless transition between file
system instances. Conversely, where the file systems do not belong
to such a common class, the client has to deal with various sorts of
implementation discontinuities which may cause performance or other
issues in effecting a transition.
Where the fs_locations_info attribute is available, such file system
classification data will be made directly available to the client.
See Section 10.10 for details. When only fs_locations is available,
default assumptions with regard to such classifications have to be
inferred. See Section 10.9 for details.
In cases in which one server is expected to accept opaque values from
the client that originated from another server, it is a wise
implementation practice for the servers to encode the "opaque" values
in network byte order. If this is done, servers acting as replicas
or immigrating filesystems will be able to parse values like
stateids, directory cookies, filehandles, etc. even if their native
byte order is different from that of other servers cooperating in the
replication and migration of the filesystem.
10.6.1. Transparent File System Transitions
Discussion of transition possibilities will start at the most
transparent end of the spectrum of possibilities. When there are
multiple paths to a single server, and there are network problems
that force another path to be used, or when a path is to be put out
of service, a replication or migration event may occur without any
real replication or migration. Nevertheless, such events fit within
the same general framework in that there is a transition between file
system locations, communicated just as other, less transparent
transitions are communicated.
There are cases of transparent transitions that may happen
independent of location information, in that a specific host name,
may map to several IP addresses, allowing session trunking to provide
alternate paths. In other cases, however multiple addresses may have
separate location entries for specific file systems to preferentially
direct traffic for those specific file systems to certain server
addresses, subject to planned or unplanned, corresponding to a
nominal replication or migrations event.
The specific details of the transition depend on file system
equivalence class information (as provided by the fs_locations_info
and fs_locations attributes).
o Where the old and new filesystems belong to the same _endpoint_
class, the transition consists of creating a new connection which
is associated with the existing session to the old server
endpoint. Where a connection cannot be associated with the
existing session, the target server must be able to recognize the
sessionid as invalid and force creation on a new session or a new
client id.
o Where the old and new filesystems do not belong to the same
_endpoint_ classes, but to the same _server_ class, the transition
consists of creating a new session, associated with the existing
clientid. Where the clientid is stale, the stale, the target
server must be able to recognize the clientid as no longer valid
and force creation of a new clientid.
In either of the above cases, the file system may be shown as
belonging to the same _sharing_ class, class allowing the alternate
session or connection to be established in advance and used either to
accelerate the file system transition when necessary (avoiding
connection latency), or to provide higher performance by actively
using multiple paths simultaneously.
When two file systems belong to the same _endpoint_ class, or
_sharing_ class, many transition issues are eliminated, and any
information indicating otherwise is ignored as erroneous.
In all such transparent transition cases, the following apply:
o File handles stay the same if persistent and if volatile are only
subject to expiration, if they would be in the absence of file
system transition.
o Fileid values do not change across the transition.
o The file system will have the same fsid in both the old and new
the old and new locations.
o Change attribute values are consistent across the transition and
do not have to be refetched. When change attributes indicate that
a cached object is still valid, it can remain cached.
o Session, client, and state identifier retain their validity across
the transition, except where their staleness is recognized and
reported by the new server. Except where such staleness requires
it, no lock reclamation is needed.
o Write verifiers are presumed to retain their validity and can be
presented to COMMIT, with the expectation that if COMMIT on the
new server accept them as valid, then that server has all of the
data unstably written to the original server and has committed it
to stable storage as requested.
10.6.2. Filehandles and File System Transitions
There are a number of ways in which filehandles can be handled across
a file system transition. These can be divided into two broad
classes depending upon whether the two file systems across which the
transition happens share sufficient state to effect some sort of
continuity of filesystem handling.
When there is no such co-operation in filehandle assignment, the two
file systems are reported as being in different _handle_ classes. In
this case, all filehandles are assumed to expire as part of the file
system transition. Note that this behavior does not depend on
fh_expire_type attribute and supersedes the specification of
FH4_VOL_MIGRATION bit, which only affects behavior when
fs_locations_info is not available.
When there is co-operation in filehandle assignment, the two file
systems are reported as being in the same _handle_ classes. In this
case, persistent filehandle remain valid after the file system
transition, while volatile filehandles (excluding those while are
only volatile due to the FH4_VOL_MIGRATION bit) are subject to
expiration on the target server.
10.6.3. Fileid's and File System Transitions
In NFSv4.0, the issue of continuity of fileid's in the event of a
file system transition was not addressed. The general expectation
had been that in situations in which the two filesystem instances are
created by a single vendor using some sort of filesystem image copy,
fileid's will be consistent across the transition while in the
analogous multi-vendor transitions they will not. This poses
difficulties, especially for the client without special knowledge of
the of the transition mechanisms adopted by the server.
It is important to note that while clients themselves may have no
trouble with a fileid changing as a result of a file system
transition event, applications do typically have access to the fileid
(e.g. via stat), and the result of this is that an application may
work perfectly well if there is no filesystem instance transition or
if any such transition is among instances created by a single vendor,
yet be unable to deal with the situation in which a multi-vendor
transition occurs, at the wrong time.
Providing the same fileid's in a multi-vendor (multiple server
vendors) environment has generally been held to be quite difficult.
While there is work to be done, it needs to be pointed out that this
difficulty is partly self-imposed. Servers have typically identified
fileid with inode number, i.e. with a quantity used to find the file
in question. This identification poses special difficulties for
migration of an fs between vendors where assigning the same index to
a given file may not be possible. Note here that a fileid does not
require that it be useful to find the file in question, only that it
is unique within the given fs. Servers prepared to accept a fileid
as a single piece of metadata and store it apart from the value used
to index the file information can relatively easily maintain a fileid
value across a migration event, allowing a truly transparent
migration event.
In any case, where servers can provide continuity of fileids, they
should and the client should be able to find out that such continuity
is available, and take appropriate action. Information about the
continuity (or lack thereof) of fileid's across a file system is
represented by specifying whether the file systems in question are of
the same _fileid_ class.
10.6.4. Fsid's and File System Transitions
Since fsid's are only unique within a per-server basis, it is to be
expected that they will change during a file system transition.
Clients should not make the fsid's received from the server visible
to application since they may not be globally unique, and because
they may change during a file system transition event. Applications
are best served if they are isolated from such transitions to the
extent possible.
10.6.5. The Change Attribute and File System Transitions
Since the change attribute is defined as a server-specific one,
change attributes fetched from one server are normally presumed to be
invalid on another server. Such a presumption is troublesome since
it would invalidate all cached change attributes, requiring
refetching. Even more disruptive, the absence of any assured
continuity for the change attribute means that even if the same value
is gotten on refetch no conclusions can drawn as to whether the
object in question has changed. The identical change attribute could
be merely an artifact, of a modified file with a different change
attribute construction algorithm, with that new algorithm just
happening to result in an identical change value.
When the two file systems have consistent change attribute formats,
and this fact is communicated to the client by reporting as in the
same _change_ class, the client may assume a continuity of change
attribute construction and handle this situation just as it would be
handled without any filesystem transition.
10.6.6. Lock State and File System Transitions
In a file system transition, the two file systems may have co-
operated in state management. When this is the case, and the two
file systems belong to the same _state_ class, the two file systems
will have compatible state environments. In the case of migration,
the servers involved in the migration of a filesystem SHOULD transfer
all server state from the original to the new server. When this
done, it must be done in a way that is transparent to the client.
With replication, such a degree of common state is typically not the
case. Clients, however should use the information provided by the
fs_locations_info attribute to determine whether such sharing is in
effect when this is available, and only if that attribute is not
available depend on these defaults.
This state transfer will reduce disruption to the client when a file
system transition If the servers are successful in transferring all
state, the client will continue to use stateids assigned by the
original server. Therefore the new server must recognize these
stateids as valid. This holds true for the clientid as well. Since
responsibility for an entire filesystem is transferred is with such
an event, there is no possibility that conflicts will arise on the
new server as a result of the transfer of locks.
As part of the transfer of information between servers, leases would
be transferred as well. The leases being transferred to the new
server will typically have a different expiration time from those for
the same client, previously on the old server. To maintain the
property that all leases on a given server for a given client expire
at the same time, the server should advance the expiration time to
the later of the leases being transferred or the leases already
present. This allows the client to maintain lease renewal of both
classes without special effort.
When the two servers belong to the same _state_ class, it does not
necessarily mean that when dealing with the transition, the client
will not have to reclaim state. However it does mean that the client
may proceed using his current clientid and stateid's just as if there
had been no file system transition event and only reclaim state when
an NFS4ERR_STALE_CLIENTID or NFS4ERR_STALE_STATEID error is received.
File systems co-operating in state management may actually share
state or simply divide the id space so as to recognize (and reject as
stale) each others state and clients id's. Servers which do share
state may not do under all conditions or all times. The requirement
for the server is that if it cannot be sure in accepting an id that
it reflects the locks the client was given, it must treat all
associated state as stale and report it as such to the client.
When two file systems belong to different _state_ classes, the client
must establish a new state on the destination, and reclaim if
possible. In this case, old stateids and clientid's should not be
presented to the new server since there is no assurance that they
will not conflict with id's valid on that server.
In either case, when actual locks are not known to be maintained, the
destination server may establish a grace period specific to the given
file system, with non-reclaim locks being rejected for that file
system, even though normal locks are being granted for other file
systems. Clients should not infer the absence of a grace period for
file systems being transitioned to a server from responses to
requests for other file systems.
In the case of lock reclamation for a given file system after a file
system transition, edge conditions can arise similar to those for
reclaim after server reboot (although in the case of the planned
state transfer associated with migration, these can be avoided by
securely recording lock state as part of state migration. Where the
destination server cannot guarantee that locks will not be
incorrectly granted, the destination server should not establish a
file-system-specific grace period.
In place of a file-system-specific version of RECLAIM_COMPLETE,
servers may assume that an attempt to obtain a new lock, other than
be reclaim, indicate the end of the client's attempt to reclaim locks
for that file system. [NOTE: The alternative would be to adapt
RECLAIM_COMPLETE to this task].
Information about client identity that may be propagated between
servers in the form of nfs_client_id4 and associated verifiers, under
the assumption that the client presents the same values to all the
servers with which it deals. [NOTE: This contradicts what is
currently said about SETCLIENTID, and interacts with the issue of
what sessions should do about this.]
Servers are encouraged to provide facilities to allow locks to be
reclaimed on the new server after a file system transition. Often,
however, in cases in which the two file systems are not of the same
_state _ class, such facilities may not be available and client
should be prepared to re-obtain locks, even though it is possible
that the client may have his LOCK or OPEN request denied due to a
conflicting lock. In some environments, such as the transition
between read-only file systems, such denial of locks should not pose
large difficulties in practice. When an attempt to re-establish a
lock on a new server is denied, the client should treat the situation
as if his original lock had been revoked. In all cases in which the
lock is granted, the client cannot assume that no conflicting could
have been granted in the interim. Where change attribute continuity
is present, the client may check the change attribute to check for
unwanted file modifications. Where even this is not available, and
the file system is not read-only a client may reasonably treat all
pending locks as having been revoked.
10.6.6.1. Leases and File System Transitions
In the case of lease renewal, the client may not be submitting
requests for a filesystem that has been transferred to another
server. This can occur because of the lease renewal mechanism. The
client renews leases for all filesystems when submitting a request to
any one filesystem at the server.
In order for the client to schedule renewal of leases that may have
been relocated to the new server, the client must find out about
lease relocation before those leases expire. To accomplish this, all
operations which renew leases for a client (i.e. OPEN, CLOSE, READ,
WRITE, RENEW, LOCK, LOCKT, LOCKU), will return the error
NFS4ERR_LEASE_MOVED if responsibility for any of the leases to be
renewed has been transferred to a new server. This condition will
continue until the client receives an NFS4ERR_MOVED error and the
server receives the subsequent GETATTR for the fs_locations or
fs_locations_info attribute for an access to each filesystem for
which a lease has been moved to a new server.
[ISSUE: There is a conflict between this and the idea in the sessions
text that we can have every op in the session implicitly renew the
lease. This needs to be dealt with. D. Noveck will create an issue
in the issue tracker.]
When a client receives an NFS4ERR_LEASE_MOVED error, it should
perform an operation on each filesystem associated with the server in
question. When the client receives an NFS4ERR_MOVED error, the
client can follow the normal process to obtain the new server
information (through the fs_locations and fs_locations_info
attributes) and perform renewal of those leases on the new server,
unless information in fs_locations_info attribute shows that no state
could have been transferred. If the server has not had state
transferred to it transparently, the client will receive either
NFS4ERR_STALE_CLIENTID or NFS4ERR_STALE_STATEID from the new server,
as described above, and the client can then recover state information
as it does in the event of server failure.
10.6.6.2. Transitions and the Lease_time Attribute
In order that the client may appropriately manage its leases in the
case of a file system transition, the destination server must
establish proper values for the lease_time attribute.
When state is transferred transparently, that state should include
the correct value of the lease_time attribute. The lease_time
attribute on the destination server must never be less than that on
the source since this would result in premature expiration of leases
granted by the source server. Upon transitions in which state is
transferred transparently, the client is under no obligation to re-
fetch the lease_time attribute and may continue to use the value
previously fetched (on the source server).
If state has not been transferred transparently, either because the
file systems are show as being in different state classes or because
the client sees a real or simulated server reboot), the client should
fetch the value of lease_time on the new (i.e. destination) server,
and use it for subsequent locking requests. However the server must
respect a grace period at least as long as the lease_time on the
source server, in order to ensure that clients have ample time to
reclaim their lock before potentially conflicting non-reclaimed locks
are granted.
10.6.7. Write Verifiers and File System Transitions
In a file system transition, the two file systems may be clustered in
the handling of unstably written data. When this is the case, and
the two file systems belong to the same _verifier_ class, valid
verifiers from one system may be recognized by the other and
superfluous writes avoided. There is no requirement that all valid
verifiers be recognized, but it cannot be the case that a verifier is
recognized as valid when it is not. [NOTE: We need to resolve the
issue of proper verifier scope].
When two file systems belong to different _verifier_ classes, the
client must assume that all unstable writes in existence at the time
file system transition, have been lost since there is no way the old
verifier can recognized as valid (or not) on the target server.
10.7. Effecting File System Referrals
Referrals are effected when an absent file system is encountered, and
one or more alternate locations are made available by the
fs_locations or fs_locations_info attributes. The client will
typically get an NFS4ERR_MOVED error, fetch the appropriate location
information and proceed to access the file system on different
server, even though it retains its logical position within the
original namespace.
The examples given in the sections below are somewhat artificial in
that an actual client will not typically do a multi-component lookup,
but will have cached information regarding the upper levels of the
name hierarchy. However, these example are chosen to make the
required behavior clear and easy to put within the scope of a small
number of requests, without getting unduly into details of how
specific clients might choose to cache things.
10.7.1. Referral Example (LOOKUP)
Let us suppose that the following COMPOUND is issued in an
environment in which /src/linux/2.7/latest is absent from the target
server. This may be for a number of reasons. It may be the case
that the file system has moved, or, it may be the case that the
target server is functioning mainly, or solely, to refer clients to
the servers on which various file systems are located.
o PUTROOTFH
o LOOKUP "src"
o LOOKUP "linux"
o LOOKUP "2.7"
o LOOKUP "latest"
o GETFH
o GETATTR fsid,fileid,size,ctime
Under the given circumstances, the following will be the result.
o PUTROOTFH --> NFS_OK. The current fh is now the root of the
pseudo-fs.
o LOOKUP "src" --> NFS_OK. The current fh is for /src and is within
the pseudo-fs.
o LOOKUP "linux" --> NFS_OK. The current fh is for /src/linux and
is within the pseudo-fs.
o LOOKUP "2.7" --> NFS_OK. The current fh is for /src/linux/2.7 and
is within the pseudo-fs.
o LOOKUP "latest" --> NFS_OK. The current fh is for /src/linux/2.7/
latest and is within a new, absent fs, but ... the client will
never see the value of that fh.
o GETFH --> NFS4ERR_MOVED. Fails because current fh is in an absent
fs at the start of the operation and the spec makes no exception
for GETFH.
o GETATTR fsid,fileid,size,ctime. Not executed because the failure
of the GETFH stops processing of the COMPOUND.
Given the failure of the GETFH, the client has the job of determining
the root of the absent file system and where to find that file
system, i.e. the server and path relative to that server's root fh.
Note here that in this example, the client did not obtain filehandles
and attribute information (e.g. fsid) for the intermediate
directories, so that he would not be sure where the absent file
system starts. It could be the case, for example, that
/src/linux/2.7 is the root of the moved filesystem and that the
reason that the lookup of "latest" succeeded is that the filesystem
was not absent on that op but was moved between the last LOOKUP and
the GETFH (since COMPOUND is not atomic). Even if we had the fsid's
for all of the intermediate directories, we could have no way of
knowing that /src/linux/2.7/latest was the root of a new fs, since we
don't yet have its fsid.
In order to get the necessary information, let us re-issue the chain
of lookup's with GETFH's and GETATTR's to at least get the fsid's so
we can be sure where the appropriate fs boundaries are. The client
could choose to get fs_locations_info at the same time but in most
cases the client will have a good guess as to where fs boundaries are
(because of where NFS4ERR_MOVED was gotten and where not) making
fetching of fs_locations_info unnecessary.
OP01: PUTROOTFH --> NFS_OK
- Current fh is root of pseudo-fs.
OP02: GETATTR(fsid) --> NFS_OK
- Just for completeness. Normally, clients will know the fsid of
the pseudo-fs as soon as they establish communication with a
server.
OP03: LOOKUP "src" --> NFS_OK
OP04: GETATTR(fsid) --> NFS_OK
- Get current fsid to see where fs boundaries are. The fsid will be
that for the pseudo-fs in this example, so no boundary.
OP05: GETFH --> NFS_OK
- Current fh is for /src and is within pseudo-fs.
OP06: LOOKUP "linux" --> NFS_OK
- Current fh is for /src/linux and is within pseudo-fs.
OP07: GETATTR(fsid) --> NFS_OK
- Get current fsid to see where fs boundaries are. The fsid will be
that for the pseudo-fs in this example, so no boundary.
OP08: GETFH --> NFS_OK
- Current fh is for /src/linux and is within pseudo-fs.
OP09: LOOKUP "2.7" --> NFS_OK
- Current fh is for /src/linux/2.7 and is within pseudo-fs.
OP10: GETATTR(fsid) --> NFS_OK
- Get current fsid to see where fs boundaries are. The fsid will be
that for the pseudo-fs in this example, so no boundary.
OP11: GETFH --> NFS_OK
- Current fh is for /src/linux/2.7 and is within pseudo-fs.
OP12: LOOKUP "latest" --> NFS_OK
- Current fh is for /src/linux/2.7/latest and is within a new,
absent fs, but ...
- The client will never see the value of that fh
OP13: GETATTR(fsid, fs_locations_info) --> NFS_OK
- We are getting the fsid to know where the fs boundaries are. Note
that the fsid we are given will not necessarily be preserved at
the new location. That fsid might be different and in fact the
fsid we have for this fs might a valid fsid of a different fs on
that new server.
- In this particular case, we are pretty sure anyway that what has
moved is /src/linux/2.7/latest rather than /src/linux/2.7 since we
have the fsid of the latter and it is that of the pseudo-fs, which
presumably cannot move. However, in other examples, we might not
have this kind of information to rely on (e.g. /src/linux/2.7
might be a non-pseudo filesystem separate from /src/linux/2.7/
latest), so we need to have another reliable source information on
the boundary of the fs which is moved. If, for example, the
filesystem "/src/linux" had moved we would have a case of
migration rather than referral and once the boundaries of the
migrated filesystem was clear we could fetch fs_locations_info.
- We are fetching fs_locations_info because the fact that we got an
NFS4ERR_MOVED at this point means that it most likely that this is
a referral and we need the destination. Even if it is the case
that "/src/linux/2.7" is a filesystem which has migrated, we will
still need the location information for that file system.
OP14: GETFH --> NFS4ERR_MOVED
- Fails because current fh is in an absent fs at the start of the
operation and the spec makes no exception for GETFH. Note that
this has the happy consequence that we don't have to worry about
the volatility or lack thereof of the fh. If the root of the fs
on the new location is a persistent fh, then we can assume that
this fh, which we never saw is a persistent fh, which, if we could
see it, would exactly match the new fh. At least, there is no
evidence to disprove that. On the other hand, if we find a
volatile root at the new location, then the filehandle which we
never saw must have been volatile or at least nobody can prove
otherwise.
Given the above, the client knows where the root of the absent file
system is, by noting where the change of fsid occurred. The
fs_locations_info attribute also gives the client the actual location
of the absent file system, so that the referral can proceed. The
server gives the client the bare minimum of information about the
absent file system so that there will be very little scope for
problems of conflict between information sent by the referring server
and information of the file system's home. No filehandles and very
few attributes are present on the referring server and the client can
treat those it receives as basically transient information with the
function of enabling the referral.
10.7.2. Referral Example (READDIR)
Another context in which a client may encounter referrals is when it
does a READDIR on directory in which some of the sub-directories are
the roots of absent file systems.
Suppose such a directory is read as follows:
o PUTROOTFH
o LOOKUP "src"
o LOOKUP "linux"
o LOOKUP "2.7"
o READDIR (fsid, size, ctime, mounted_on_fileid)
In this case, because rdattr_error is not requested,
fs_locations_info is not requested, and some of attributes cannot be
provided the result will be an NFS4ERR_MOVED error on the READDIR,
with the detailed results as follows:
o PUTROOTFH --> NFS_OK. The current fh is at the root of the
pseudo-fs.
o LOOKUP "src" --> NFS_OK. The current fh is for /src and is within
the pseudo-fs.
o LOOKUP "linux" --> NFS_OK. The current fh is for /src/linux and
is within the pseudo-fs.
o LOOKUP "2.7" --> NFS_OK. The current fh is for /src/linux/2.7 and
is within the pseudo-fs.
o READDIR (fsid, size, ctime, mounted_on_fileid) --> NFS4ERR_MOVED.
Note that the same error would have been returned if
/src/linux/2.7 had migrated, when in fact it is because the
directory contains the root of an absent fs.
So now suppose that we reissue with rdattr_error:
o PUTROOTFH
o LOOKUP "src"
o LOOKUP "linux"
o LOOKUP "2.7"
o READDIR (rdattr_error, fsid, size, ctime, mounted_on_fileid)
The results will be:
o PUTROOTFH --> NFS_OK. The current fh is at the root of the
pseudo-fs.
o LOOKUP "src" --> NFS_OK. The current fh is for /src and is within
the pseudo-fs.
o LOOKUP "linux" --> NFS_OK. The current fh is for /src/linux and
is within the pseudo-fs.
o LOOKUP "2.7" --> NFS_OK. The current fh is for /src/linux/2.7 and
is within the pseudo-fs.
o READDIR (rdattr_error, fsid, size, ctime, mounted_on_fileid) -->
NFS_OK. The attributes for "latest" will only contain
rdattr_error with the value will be NFS4ERR_MOVED, together with
an fsid value and an a value for mounted_on_fileid.
So suppose we do another READDIR to get fs_locations_info, although
we could have used a GETATTR directly, as in the previous section.
o PUTROOTFH
o LOOKUP "src"
o LOOKUP "linux"
o LOOKUP "2.7"
o READDIR (rdattr_error, fs_locations_info, mounted_on_fileid, fsid,
size, ctime)
The results would be:
o PUTROOTFH --> NFS_OK. The current fh is at the root of the
pseudo-fs.
o LOOKUP "src" --> NFS_OK. The current fh is for /src and is within
the pseudo-fs.
o LOOKUP "linux" --> NFS_OK. The current fh is for /src/linux and
is within the pseudo-fs.
o LOOKUP "2.7" --> NFS_OK. The current fh is for /src/linux/2.7 and
is within the pseudo-fs.
o READDIR (rdattr_error, fs_locations_info, mounted_on_fileid, fsid,
size, ctime) --> NFS_OK. The attributes will be as shown below.
The attributes for "latest" will only contain
o rdattr_error (value: NFS4ERR_MOVED)
o fs_locations_info )
o mounted_on_fileid (value: unique fileid within referring fs)
o fsid (value: unique value within referring server)
The attribute entry for "latest" will not contain size or ctime.
10.8. The Attribute fs_absent
In order to provide the client information about whether the current
file system is present or absent, the fs_absent attribute may be
interrogated.
As noted above, this attribute, when supported, may be requested of
absent filesystems without causing NFS4ERR_MOVED to be returned and
it should always be available. Servers are strongly urged to support
this attribute on all filesystems if they support it on any
filesystem.
10.9. The Attribute fs_locations
The fs_locations attribute is structured in the following way:
struct fs_location {
utf8str_cis server<>;
pathname4 rootpath;
};
struct fs_locations {
pathname4 fs_root;
fs_location locations<>;
};
The fs_location struct is used to represent the location of a
filesystem by providing a server name and the path to the root of the
file system within that server's namespace. When a set of servers
have corresponding file systems at the same path within their
namespaces, an array of server names may be provided. An entry in
the server array is an UTF8 string and represents one of a
traditional DNS host name, IPv4 address, or IPv6 address. It is not
a requirement that all servers that share the same rootpath be listed
in one fs_location struct. The array of server names is provided for
convenience. Servers that share the same rootpath may also be listed
in separate fs_location entries in the fs_locations attribute.
The fs_locations struct and attribute contains an array of such
locations. Since the name space of each server may be constructed
differently, the "fs_root" field is provided. The path represented
by fs_root represents the location of the filesystem in the current
server's name space, i.e. that of the server from which the
fs_locations attribute was obtained. The fs_root path is meant to
aid the client by clearly referencing the root of the file system
whose locations are being reported, no matter what object within the
current file system, the current filehandle designates.
As an example, suppose there is a replicated filesystem located at
two servers (servA and servB). At servA, the filesystem is located
at path "/a/b/c". At, servB the filesystem is located at path
"/x/y/z". If the client were to obtain the fs_locations value for
the directory at "/a/b/c/d", it might not necessarily know that the
filesystem's root is located in servA's name space at "/a/b/c". When
the client switches to servB, it will need to determine that the
directory it first referenced at servA is now represented by the path
"/x/y/z/d" on servB. To facilitate this, the fs_locations attribute
provided by servA would have a fs_root value of "/a/b/c" and two
entries in fs_locations. One entry in fs_locations will be for
itself (servA) and the other will be for servB with a path of
"/x/y/z". With this information, the client is able to substitute
"/x/y/z" for the "/a/b/c" at the beginning of its access path and
construct "/x/y/z/d" to use for the new server.
Since fs_locations attribute lacks information defining various
attributes of the various file system choices presented, it should
only be interrogated and used when fs_locations_info is not
available. When fs_locations is used, information about the specific
locations should be assumed based on the following rules.