draft-ietf-nfsv4-minorversion2-09.txt | draft-ietf-nfsv4-minorversion2-10.txt | |||
---|---|---|---|---|
NFSv4 T. Haynes | NFSv4 T. Haynes | |||
Internet-Draft Editor | Internet-Draft Editor | |||
Intended status: Standards Track May 02, 2012 | Intended status: Standards Track May 08, 2012 | |||
Expires: November 3, 2012 | Expires: November 9, 2012 | |||
NFS Version 4 Minor Version 2 | NFS Version 4 Minor Version 2 | |||
draft-ietf-nfsv4-minorversion2-09.txt | draft-ietf-nfsv4-minorversion2-10.txt | |||
Abstract | Abstract | |||
This Internet-Draft describes NFS version 4 minor version two, | This Internet-Draft describes NFS version 4 minor version two, | |||
focusing mainly on the protocol extensions made from NFS version 4 | focusing mainly on the protocol extensions made from NFS version 4 | |||
minor version 0 and NFS version 4 minor version 1. Major extensions | minor version 0 and NFS version 4 minor version 1. Major extensions | |||
introduced in NFS version 4 minor version two include: Server-side | introduced in NFS version 4 minor version two include: Server-side | |||
Copy, Space Reservations, and Support for Sparse Files. | Copy, Application I/O Advise, Space Reservations, Sparse Files, | |||
Application Data Blocks, and Labeled NFS. | ||||
Requirements Language | Requirements Language | |||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
document are to be interpreted as described in RFC 2119 [1]. | document are to be interpreted as described in RFC 2119 [1]. | |||
Status of this Memo | Status of this Memo | |||
This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
skipping to change at page 1, line 40 | skipping to change at page 1, line 41 | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
This Internet-Draft will expire on November 3, 2012. | This Internet-Draft will expire on November 9, 2012. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2012 IETF Trust and the persons identified as the | Copyright (c) 2012 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
skipping to change at page 3, line 11 | skipping to change at page 3, line 11 | |||
not be created outside the IETF Standards Process, except to format | not be created outside the IETF Standards Process, except to format | |||
it for publication as an RFC or to translate it into languages other | it for publication as an RFC or to translate it into languages other | |||
than English. | than English. | |||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 6 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
1.1. The NFS Version 4 Minor Version 2 Protocol . . . . . . . 6 | 1.1. The NFS Version 4 Minor Version 2 Protocol . . . . . . . 6 | |||
1.2. Scope of This Document . . . . . . . . . . . . . . . . . 6 | 1.2. Scope of This Document . . . . . . . . . . . . . . . . . 6 | |||
1.3. NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . . 6 | 1.3. NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . . 6 | |||
1.4. Overview of NFSv4.2 Features . . . . . . . . . . . . . . 6 | 1.4. Overview of NFSv4.2 Features . . . . . . . . . . . . . . 7 | |||
1.4.1. Sparse Files . . . . . . . . . . . . . . . . . . . . . 6 | 1.4.1. Sparse Files . . . . . . . . . . . . . . . . . . . . . 7 | |||
1.4.2. Application I/O Advise . . . . . . . . . . . . . . . . 7 | 1.4.2. Application I/O Advise . . . . . . . . . . . . . . . . 7 | |||
1.5. Differences from NFSv4.1 . . . . . . . . . . . . . . . . 7 | 1.5. Differences from NFSv4.1 . . . . . . . . . . . . . . . . 7 | |||
2. NFS Server-side Copy . . . . . . . . . . . . . . . . . . . . . 7 | 2. NFS Server-side Copy . . . . . . . . . . . . . . . . . . . . . 7 | |||
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 7 | 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 7 | |||
2.2. Protocol Overview . . . . . . . . . . . . . . . . . . . . 8 | 2.2. Protocol Overview . . . . . . . . . . . . . . . . . . . . 8 | |||
2.2.1. Intra-Server Copy . . . . . . . . . . . . . . . . . . 9 | 2.2.1. Intra-Server Copy . . . . . . . . . . . . . . . . . . 10 | |||
2.2.2. Inter-Server Copy . . . . . . . . . . . . . . . . . . 11 | 2.2.2. Inter-Server Copy . . . . . . . . . . . . . . . . . . 11 | |||
2.2.3. Server-to-Server Copy Protocol . . . . . . . . . . . . 13 | 2.2.3. Server-to-Server Copy Protocol . . . . . . . . . . . . 14 | |||
2.3. Operations . . . . . . . . . . . . . . . . . . . . . . . 15 | 2.3. Operations . . . . . . . . . . . . . . . . . . . . . . . 16 | |||
2.3.1. netloc4 - Network Locations . . . . . . . . . . . . . 15 | 2.3.1. netloc4 - Network Locations . . . . . . . . . . . . . 16 | |||
2.3.2. Copy Offload Stateids . . . . . . . . . . . . . . . . 16 | 2.3.2. Copy Offload Stateids . . . . . . . . . . . . . . . . 17 | |||
2.4. Security Considerations . . . . . . . . . . . . . . . . . 16 | 2.4. Security Considerations . . . . . . . . . . . . . . . . . 17 | |||
2.4.1. Inter-Server Copy Security . . . . . . . . . . . . . . 16 | 2.4.1. Inter-Server Copy Security . . . . . . . . . . . . . . 17 | |||
3. Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 25 | 3. Support for Application IO Hints . . . . . . . . . . . . . . . 26 | |||
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 25 | 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 26 | |||
3.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 25 | 3.2. POSIX Requirements . . . . . . . . . . . . . . . . . . . 26 | |||
4. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 26 | 3.3. Additional Requirements . . . . . . . . . . . . . . . . . 27 | |||
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 26 | 3.4. Security Considerations . . . . . . . . . . . . . . . . . 28 | |||
5. Support for Application IO Hints . . . . . . . . . . . . . . . 28 | 3.5. IANA Considerations . . . . . . . . . . . . . . . . . . . 28 | |||
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 28 | 4. Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 28 | |||
5.2. POSIX Requirements . . . . . . . . . . . . . . . . . . . 29 | 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 29 | |||
5.3. Additional Requirements . . . . . . . . . . . . . . . . . 30 | 4.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 29 | |||
5.4. Security Considerations . . . . . . . . . . . . . . . . . 31 | 5. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 30 | |||
5.5. IANA Considerations . . . . . . . . . . . . . . . . . . . 31 | 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 30 | |||
6. Application Data Block Support . . . . . . . . . . . . . . . . 31 | 6. Application Data Block Support . . . . . . . . . . . . . . . . 32 | |||
6.1. Generic Framework . . . . . . . . . . . . . . . . . . . . 32 | 6.1. Generic Framework . . . . . . . . . . . . . . . . . . . . 33 | |||
6.1.1. Data Block Representation . . . . . . . . . . . . . . 32 | 6.1.1. Data Block Representation . . . . . . . . . . . . . . 33 | |||
6.1.2. Data Content . . . . . . . . . . . . . . . . . . . . . 33 | 6.1.2. Data Content . . . . . . . . . . . . . . . . . . . . . 34 | |||
6.2. pNFS Considerations . . . . . . . . . . . . . . . . . . . 33 | 6.2. pNFS Considerations . . . . . . . . . . . . . . . . . . . 34 | |||
6.3. An Example of Detecting Corruption . . . . . . . . . . . 34 | 6.3. An Example of Detecting Corruption . . . . . . . . . . . 34 | |||
6.4. Example of READ_PLUS . . . . . . . . . . . . . . . . . . 35 | 6.4. Example of READ_PLUS . . . . . . . . . . . . . . . . . . 36 | |||
6.5. Zero Filled Holes . . . . . . . . . . . . . . . . . . . . 36 | 6.5. Zero Filled Holes . . . . . . . . . . . . . . . . . . . . 36 | |||
7. Labeled NFS . . . . . . . . . . . . . . . . . . . . . . . . . 36 | 7. Labeled NFS . . . . . . . . . . . . . . . . . . . . . . . . . 36 | |||
7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 36 | 7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 37 | |||
7.2. Definitions . . . . . . . . . . . . . . . . . . . . . . . 37 | 7.2. Definitions . . . . . . . . . . . . . . . . . . . . . . . 38 | |||
7.3. MAC Security Attribute . . . . . . . . . . . . . . . . . 37 | 7.3. MAC Security Attribute . . . . . . . . . . . . . . . . . 38 | |||
7.3.1. Interpreting FATTR4_SEC_LABEL . . . . . . . . . . . . 38 | 7.3.1. Delegations . . . . . . . . . . . . . . . . . . . . . 39 | |||
7.3.2. Delegations . . . . . . . . . . . . . . . . . . . . . 39 | 7.3.2. Permission Checking . . . . . . . . . . . . . . . . . 39 | |||
7.3.3. Permission Checking . . . . . . . . . . . . . . . . . 39 | 7.3.3. Object Creation . . . . . . . . . . . . . . . . . . . 39 | |||
7.3.4. Object Creation . . . . . . . . . . . . . . . . . . . 39 | 7.3.4. Existing Objects . . . . . . . . . . . . . . . . . . . 40 | |||
7.3.5. Existing Objects . . . . . . . . . . . . . . . . . . . 40 | 7.3.5. Label Changes . . . . . . . . . . . . . . . . . . . . 40 | |||
7.3.6. Label Changes . . . . . . . . . . . . . . . . . . . . 40 | 7.4. pNFS Considerations . . . . . . . . . . . . . . . . . . . 40 | |||
7.4. pNFS Considerations . . . . . . . . . . . . . . . . . . . 41 | ||||
7.5. Discovery of Server LNFS Support . . . . . . . . . . . . 41 | 7.5. Discovery of Server LNFS Support . . . . . . . . . . . . 41 | |||
7.6. MAC Security NFS Modes of Operation . . . . . . . . . . . 41 | 7.6. MAC Security NFS Modes of Operation . . . . . . . . . . . 41 | |||
7.6.1. Full Mode . . . . . . . . . . . . . . . . . . . . . . 42 | 7.6.1. Full Mode . . . . . . . . . . . . . . . . . . . . . . 42 | |||
7.6.2. Guest Mode . . . . . . . . . . . . . . . . . . . . . . 43 | 7.6.2. Guest Mode . . . . . . . . . . . . . . . . . . . . . . 43 | |||
7.7. Security Considerations . . . . . . . . . . . . . . . . . 43 | 7.7. Security Considerations . . . . . . . . . . . . . . . . . 43 | |||
8. Sharing change attribute implementation details with NFSv4 | 8. Sharing change attribute implementation details with NFSv4 | |||
clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 | clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 | |||
8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 44 | 8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 44 | |||
8.2. Definition of the 'change_attr_type' per-file system | 9. Security Considerations . . . . . . . . . . . . . . . . . . . 44 | |||
attribute . . . . . . . . . . . . . . . . . . . . . . . . 44 | 10. Error Values . . . . . . . . . . . . . . . . . . . . . . . . . 44 | |||
9. Security Considerations . . . . . . . . . . . . . . . . . . . 46 | 10.1. Error Definitions . . . . . . . . . . . . . . . . . . . . 45 | |||
10. Error Values . . . . . . . . . . . . . . . . . . . . . . . . . 46 | 10.1.1. General Errors . . . . . . . . . . . . . . . . . . . . 45 | |||
10.1. Error Definitions . . . . . . . . . . . . . . . . . . . . 46 | 10.1.2. Server to Server Copy Errors . . . . . . . . . . . . . 45 | |||
10.1.1. General Errors . . . . . . . . . . . . . . . . . . . . 46 | 10.1.3. Labeled NFS Errors . . . . . . . . . . . . . . . . . . 46 | |||
10.1.2. Server to Server Copy Errors . . . . . . . . . . . . . 46 | 11. New File Attributes . . . . . . . . . . . . . . . . . . . . . 46 | |||
10.1.3. Labeled NFS Errors . . . . . . . . . . . . . . . . . . 47 | 11.1. New RECOMMENDED Attributes - List and Definition | |||
11. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 47 | References . . . . . . . . . . . . . . . . . . . . . . . 46 | |||
11.1. Attribute Definitions . . . . . . . . . . . . . . . . . . 47 | 11.2. Attribute Definitions . . . . . . . . . . . . . . . . . . 47 | |||
12. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . . 48 | 12. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . . 50 | |||
13. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . . 52 | 13. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . . 53 | |||
13.1. Operation 59: COPY - Initiate a server-side copy . . . . 52 | 13.1. Operation 59: COPY - Initiate a server-side copy . . . . 53 | |||
13.2. Operation 60: COPY_ABORT - Cancel a server-side copy . . 59 | 13.2. Operation 60: COPY_ABORT - Cancel a server-side copy . . 61 | |||
13.3. Operation 61: COPY_NOTIFY - Notify a source server of | 13.3. Operation 61: COPY_NOTIFY - Notify a source server of | |||
a future copy . . . . . . . . . . . . . . . . . . . . . . 60 | a future copy . . . . . . . . . . . . . . . . . . . . . . 62 | |||
13.4. Operation 62: COPY_REVOKE - Revoke a destination | 13.4. Operation 62: COPY_REVOKE - Revoke a destination | |||
server's copy privileges . . . . . . . . . . . . . . . . 62 | server's copy privileges . . . . . . . . . . . . . . . . 64 | |||
13.5. Operation 63: COPY_STATUS - Poll for status of a | 13.5. Operation 63: COPY_STATUS - Poll for status of a | |||
server-side copy . . . . . . . . . . . . . . . . . . . . 63 | server-side copy . . . . . . . . . . . . . . . . . . . . 65 | |||
13.6. Modification to Operation 42: EXCHANGE_ID - | 13.6. Modification to Operation 42: EXCHANGE_ID - | |||
Instantiate Client ID . . . . . . . . . . . . . . . . . . 64 | Instantiate Client ID . . . . . . . . . . . . . . . . . . 66 | |||
13.7. Operation 64: INITIALIZE . . . . . . . . . . . . . . . . 65 | 13.7. Operation 64: INITIALIZE . . . . . . . . . . . . . . . . 67 | |||
13.8. Operation 67: IO_ADVISE - Application I/O access | 13.8. Operation 67: IO_ADVISE - Application I/O access | |||
pattern hints . . . . . . . . . . . . . . . . . . . . . . 69 | pattern hints . . . . . . . . . . . . . . . . . . . . . . 71 | |||
13.9. Changes to Operation 51: LAYOUTRETURN . . . . . . . . . . 75 | 13.9. Changes to Operation 51: LAYOUTRETURN . . . . . . . . . . 77 | |||
13.10. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 78 | 13.10. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 80 | |||
13.11. Operation 66: SEEK . . . . . . . . . . . . . . . . . . . 83 | 13.11. Operation 66: SEEK . . . . . . . . . . . . . . . . . . . 85 | |||
14. NFSv4.2 Callback Operations . . . . . . . . . . . . . . . . . 84 | 14. NFSv4.2 Callback Operations . . . . . . . . . . . . . . . . . 86 | |||
14.1. Procedure 16: CB_ATTR_CHANGED - Notify Client that | 14.1. Procedure 16: CB_ATTR_CHANGED - Notify Client that | |||
the File's Attributes Changed . . . . . . . . . . . . . . 84 | the File's Attributes Changed . . . . . . . . . . . . . . 86 | |||
14.2. Operation 15: CB_COPY - Report results of a | 14.2. Operation 15: CB_COPY - Report results of a | |||
server-side copy . . . . . . . . . . . . . . . . . . . . 85 | server-side copy . . . . . . . . . . . . . . . . . . . . 87 | |||
15. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 87 | 15. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 89 | |||
16. References . . . . . . . . . . . . . . . . . . . . . . . . . . 87 | 16. References . . . . . . . . . . . . . . . . . . . . . . . . . . 89 | |||
16.1. Normative References . . . . . . . . . . . . . . . . . . 87 | 16.1. Normative References . . . . . . . . . . . . . . . . . . 89 | |||
16.2. Informative References . . . . . . . . . . . . . . . . . 88 | 16.2. Informative References . . . . . . . . . . . . . . . . . 90 | |||
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 91 | ||||
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 89 | Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 92 | |||
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 90 | Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 92 | |||
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 90 | ||||
1. Introduction | 1. Introduction | |||
1.1. The NFS Version 4 Minor Version 2 Protocol | 1.1. The NFS Version 4 Minor Version 2 Protocol | |||
The NFS version 4 minor version 2 (NFSv4.2) protocol is the third | The NFS version 4 minor version 2 (NFSv4.2) protocol is the third | |||
minor version of the NFS version 4 (NFSv4) protocol. The first minor | minor version of the NFS version 4 (NFSv4) protocol. The first minor | |||
version, NFSv4.0, is described in [10] and the second minor version, | version, NFSv4.0, is described in [10] and the second minor version, | |||
NFSv4.1, is described in [2]. It follows the guidelines for minor | NFSv4.1, is described in [2]. It follows the guidelines for minor | |||
versioning that are listed in Section 11 of [10]. | versioning that are listed in Section 11 of [10]. | |||
skipping to change at page 6, line 39 | skipping to change at page 6, line 39 | |||
o modify the specification of the NFSv4.0 or NFSv4.1 protocols. | o modify the specification of the NFSv4.0 or NFSv4.1 protocols. | |||
o clarify the NFSv4.0 or NFSv4.1 protocols. I.e., any | o clarify the NFSv4.0 or NFSv4.1 protocols. I.e., any | |||
clarifications made here apply to NFSv4.2 and neither of the prior | clarifications made here apply to NFSv4.2 and neither of the prior | |||
protocols. | protocols. | |||
The full XDR for NFSv4.2 is presented in [3]. | The full XDR for NFSv4.2 is presented in [3]. | |||
1.3. NFSv4.2 Goals | 1.3. NFSv4.2 Goals | |||
[[Comment.1: This needs fleshing out! --TH]] | The goal of the design of NFSv4.2 is to take common local filesystem | |||
features and offer them remotely. These features might | ||||
o already be available on the servers, e.g., sparse files | ||||
o be under development as a new standard, e.g., SEEK_HOLE and | ||||
SEEK_DATA | ||||
o be used by clients with the servers via some proprietary means, | ||||
e.g., Labeled NFS | ||||
but the clients are not able to leverage them on the server within | ||||
the confines of the NFS protocol. | ||||
1.4. Overview of NFSv4.2 Features | 1.4. Overview of NFSv4.2 Features | |||
[[Comment.2: This needs fleshing out! --TH]] | [[Comment.1: This needs fleshing out! --TH]] | |||
1.4.1. Sparse Files | 1.4.1. Sparse Files | |||
Two new operations are defined to support the reading of sparse files | Two new operations are defined to support the reading of sparse files | |||
(READ_PLUS) and the punching of holes to remove backing storage | (READ_PLUS) and the punching of holes to remove backing storage | |||
(INITIALIZE). | (INITIALIZE). | |||
1.4.2. Application I/O Advise | 1.4.2. Application I/O Advise | |||
We propose a new IO_ADVISE operation for NFSv4.2 that clients can use | We propose a new IO_ADVISE operation for NFSv4.2 that clients can use | |||
skipping to change at page 25, line 5 | skipping to change at page 26, line 5 | |||
or an eavesdropper that observes the random number on the wire. | or an eavesdropper that observes the random number on the wire. | |||
Other secure communication techniques (e.g., IPsec) are necessary to | Other secure communication techniques (e.g., IPsec) are necessary to | |||
block these attacks. | block these attacks. | |||
2.4.1.4. Inter-Server Copy without ONC RPC and RPCSEC_GSSv3 | 2.4.1.4. Inter-Server Copy without ONC RPC and RPCSEC_GSSv3 | |||
The same techniques as Section 2.4.1.3, using unique URLs for each | The same techniques as Section 2.4.1.3, using unique URLs for each | |||
destination server, can be used for other protocols (e.g., HTTP [13] | destination server, can be used for other protocols (e.g., HTTP [13] | |||
and FTP [14]) as well. | and FTP [14]) as well. | |||
3. Sparse Files | 3. Support for Application IO Hints | |||
3.1. Introduction | 3.1. Introduction | |||
A sparse file is a common way of representing a large file without | ||||
having to utilize all of the disk space for it. Consequently, a | ||||
sparse file uses less physical space than its size indicates. This | ||||
means the file contains 'holes', byte ranges within the file that | ||||
contain no data. Most modern file systems support sparse files, | ||||
including most UNIX file systems and NTFS, but notably not Apple's | ||||
HFS+. Common examples of sparse files include Virtual Machine (VM) | ||||
OS/disk images, database files, log files, and even checkpoint | ||||
recovery files most commonly used by the HPC community. | ||||
If an application reads a hole in a sparse file, the file system must | ||||
return all zeros to the application. For local data access there is | ||||
little penalty, but with NFS these zeroes must be transferred back to | ||||
the client. If an application uses the NFS client to read data into | ||||
memory, this wastes time and bandwidth as the application waits for | ||||
the zeroes to be transferred. | ||||
A sparse file is typically created by initializing the file to be all | ||||
zeros - nothing is written to the data in the file, instead the hole | ||||
is recorded in the metadata for the file. So a 8G disk image might | ||||
be represented initially by a couple hundred bits in the inode and | ||||
nothing on the disk. If the VM then writes 100M to a file in the | ||||
middle of the image, there would now be two holes represented in the | ||||
metadata and 100M in the data. | ||||
This section introduces a new operation READ_PLUS (Section 13.10) | ||||
which supports all the features of READ but includes an extension to | ||||
support sparse pattern files. READ_PLUS is guaranteed to perform no | ||||
worse than READ, and can dramatically improve performance with sparse | ||||
files. READ_PLUS does not depend on pNFS protocol features, but can | ||||
be used by pNFS to support sparse files. | ||||
3.2. Terminology | ||||
Regular file: An object of file type NF4REG or NF4NAMEDATTR. | ||||
Sparse file: A Regular file that contains one or more Holes. | ||||
Hole: A byte range within a Sparse file that contains regions of all | ||||
zeroes. For block-based file systems, this could also be an | ||||
unallocated region of the file. | ||||
Hole Threshold: The minimum length of a Hole as determined by the | ||||
server. If a server chooses to define a Hole Threshold, then it | ||||
would not return hole information about holes with a length | ||||
shorter than the Hole Threshold. | ||||
4. Space Reservation | ||||
4.1. Introduction | ||||
This section describes a set of operations that allow applications | ||||
such as hypervisors to reserve space for a file, report the amount of | ||||
actual disk space a file occupies and freeup the backing space of a | ||||
file when it is not required. In virtualized environments, virtual | ||||
disk files are often stored on NFS mounted volumes. Since virtual | ||||
disk files represent the hard disks of virtual machines, hypervisors | ||||
often have to guarantee certain properties for the file. | ||||
One such example is space reservation. When a hypervisor creates a | ||||
virtual disk file, it often tries to preallocate the space for the | ||||
file so that there are no future allocation related errors during the | ||||
operation of the virtual machine. Such errors prevent a virtual | ||||
machine from continuing execution and result in downtime. | ||||
Currently, in order to achieve such a guarantee, applications zero | ||||
the entire file. The initial zeroing allocates the backing blocks | ||||
and all subsequent writes are overwrites of already allocated blocks. | ||||
This approach is not only inefficient in terms of the amount of I/O | ||||
done, it is also not guaranteed to work on filesystems that are log | ||||
structured or deduplicated. An efficient way of guaranteeing space | ||||
reservation would be beneficial to such applications. | ||||
If the space_reserved attribute is set on a file, it is guaranteed | ||||
that writes that do not grow the file will not fail with | ||||
NFSERR_NOSPC. | ||||
Another useful feature would be the ability to report the number of | ||||
blocks that would be freed when a file is deleted. Currently, NFS | ||||
reports two size attributes: | ||||
size The logical file size of the file. | ||||
space_used The size in bytes that the file occupies on disk | ||||
While these attributes are sufficient for space accounting in | ||||
traditional filesystems, they prove to be inadequate in modern | ||||
filesystems that support block sharing. In such filesystems, | ||||
multiple inodes can point to a single block with a block reference | ||||
count to guard against premature freeing. Having a way to tell the | ||||
number of blocks that would be freed if the file was deleted would be | ||||
useful to applications that wish to migrate files when a volume is | ||||
low on space. | ||||
Since virtual disks represent a hard drive in a virtual machine, a | ||||
virtual disk can be viewed as a filesystem within a file. Since not | ||||
all blocks within a filesystem are in use, there is an opportunity to | ||||
reclaim blocks that are no longer in use. A call to deallocate | ||||
blocks could result in better space efficiency. Lesser space MAY be | ||||
consumed for backups after block deallocation. | ||||
The following operations and attributes can be used to resolve this | ||||
issues: | ||||
space_reserved This attribute specifies whether the blocks backing | ||||
the file have been preallocated. | ||||
space_freed This attribute specifies the space freed when a file is | ||||
deleted, taking block sharing into consideration. | ||||
INITIALIZED This operation zeroes and/or deallocates the blocks | ||||
backing a region of the file. | ||||
If space_used of a file is interpreted to mean the size in bytes of | ||||
all disk blocks pointed to by the inode of the file, then shared | ||||
blocks get double counted, over-reporting the space utilization. | ||||
This also has the adverse effect that the deletion of a file with | ||||
shared blocks frees up less than space_used bytes. | ||||
On the other hand, if space_used is interpreted to mean the size in | ||||
bytes of those disk blocks unique to the inode of the file, then | ||||
shared blocks are not counted in any file, resulting in under- | ||||
reporting of the space utilization. | ||||
For example, two files A and B have 10 blocks each. Let 6 of these | ||||
blocks be shared between them. Thus, the combined space utilized by | ||||
the two files is 14 * BLOCK_SIZE bytes. In the former case, the | ||||
combined space utilization of the two files would be reported as 20 * | ||||
BLOCK_SIZE. However, deleting either would only result in 4 * | ||||
BLOCK_SIZE being freed. Conversely, the latter interpretation would | ||||
report that the space utilization is only 8 * BLOCK_SIZE. | ||||
Adding another size attribute, space_freed, is helpful in solving | ||||
this problem. space_freed is the number of blocks that are allocated | ||||
to the given file that would be freed on its deletion. In the | ||||
example, both A and B would report space_freed as 4 * BLOCK_SIZE and | ||||
space_used as 10 * BLOCK_SIZE. If A is deleted, B will report | ||||
space_freed as 10 * BLOCK_SIZE as the deletion of B would result in | ||||
the deallocation of all 10 blocks. | ||||
The addition of this problem doesn't solve the problem of space being | ||||
over-reported. However, over-reporting is better than under- | ||||
reporting. | ||||
5. Support for Application IO Hints | ||||
5.1. Introduction | ||||
Applications currently have several options for communicating I/O | Applications currently have several options for communicating I/O | |||
access patterns to the NFS client. While this can help the NFS | access patterns to the NFS client. While this can help the NFS | |||
client optimize I/O and caching for a file, it does not allow the NFS | client optimize I/O and caching for a file, it does not allow the NFS | |||
server and its exported file system to do likewise. Therefore, here | server and its exported file system to do likewise. Therefore, here | |||
we put forth a proposal for the NFSv4.2 protocol to allow | we put forth a proposal for the NFSv4.2 protocol to allow | |||
applications to communicate their expected behavior to the server. | applications to communicate their expected behavior to the server. | |||
By communicating expected access pattern, e.g., sequential or random, | By communicating expected access pattern, e.g., sequential or random, | |||
and data re-use behavior, e.g., data range will be read multiple | and data re-use behavior, e.g., data range will be read multiple | |||
times and should be cached, the server will be able to better | times and should be cached, the server will be able to better | |||
skipping to change at page 29, line 5 | skipping to change at page 26, line 43 | |||
Application specific NFS clients such as those used by hypervisors | Application specific NFS clients such as those used by hypervisors | |||
and databases can also leverage application hints to communicate | and databases can also leverage application hints to communicate | |||
their specialized requirements. | their specialized requirements. | |||
This section adds a new IO_ADVISE operation to communicate the client | This section adds a new IO_ADVISE operation to communicate the client | |||
file access patterns to the NFS server. The NFS server upon | file access patterns to the NFS server. The NFS server upon | |||
receiving a IO_ADVISE operation MAY choose to alter its I/O and | receiving a IO_ADVISE operation MAY choose to alter its I/O and | |||
caching behavior, but is under no obligation to do so. | caching behavior, but is under no obligation to do so. | |||
5.2. POSIX Requirements | 3.2. POSIX Requirements | |||
The first key requirement of the IO_ADVISE operation is to support | The first key requirement of the IO_ADVISE operation is to support | |||
the posix_fadvise function [6], which is supported in Linux and many | the posix_fadvise function [6], which is supported in Linux and many | |||
other operating systems. Examples and guidance on how to use | other operating systems. Examples and guidance on how to use | |||
posix_fadvise to improve performance can be found here [16]. | posix_fadvise to improve performance can be found here [16]. | |||
posix_fadvise is defined as follows, | posix_fadvise is defined as follows, | |||
int posix_fadvise(int fd, off_t offset, off_t len, int advice); | int posix_fadvise(int fd, off_t offset, off_t len, int advice); | |||
The posix_fadvise() function shall advise the implementation on the | The posix_fadvise() function shall advise the implementation on the | |||
skipping to change at page 30, line 5 | skipping to change at page 27, line 41 | |||
POSIX_FADV_DONTNEED - Specifies that the application expects that it | POSIX_FADV_DONTNEED - Specifies that the application expects that it | |||
will not access the specified data in the near future. | will not access the specified data in the near future. | |||
POSIX_FADV_NOREUSE - Specifies that the application expects to | POSIX_FADV_NOREUSE - Specifies that the application expects to | |||
access the specified data once and then not reuse it thereafter. | access the specified data once and then not reuse it thereafter. | |||
Upon successful completion, posix_fadvise() shall return zero; | Upon successful completion, posix_fadvise() shall return zero; | |||
otherwise, an error number shall be returned to indicate the error. | otherwise, an error number shall be returned to indicate the error. | |||
5.3. Additional Requirements | 3.3. Additional Requirements | |||
Many use cases exist for sending application I/O hints to the server | Many use cases exist for sending application I/O hints to the server | |||
that cannot utilize the POSIX supported interface. This is because | that cannot utilize the POSIX supported interface. This is because | |||
some applications may benefit from additional hints not specified by | some applications may benefit from additional hints not specified by | |||
posix_fadvise, and some applications may not use POSIX altogether. | posix_fadvise, and some applications may not use POSIX altogether. | |||
One use case is "Opportunistic Prefetch", which allows a stateid | One use case is "Opportunistic Prefetch", which allows a stateid | |||
holder to tell the server that it is possible that it will access the | holder to tell the server that it is possible that it will access the | |||
specified data in the near future. This is similar to | specified data in the near future. This is similar to | |||
POSIX_FADV_WILLNEED, but the client is unsure it will in fact read | POSIX_FADV_WILLNEED, but the client is unsure it will in fact read | |||
skipping to change at page 31, line 5 | skipping to change at page 28, line 39 | |||
Another use case is "Backward Sequential Read", which allows a stated | Another use case is "Backward Sequential Read", which allows a stated | |||
holder to inform the server that it intends to read the specified | holder to inform the server that it intends to read the specified | |||
data backwards, i.e., back the end to the beginning. This is | data backwards, i.e., back the end to the beginning. This is | |||
different than POSIX_FADV_SEQUENTIAL, whose implied intention was | different than POSIX_FADV_SEQUENTIAL, whose implied intention was | |||
that data will be read from beginning to end. This hint allows | that data will be read from beginning to end. This hint allows | |||
servers to prefetch data at the end of the range first, and then | servers to prefetch data at the end of the range first, and then | |||
prefetch data sequentially in a backwards manner to the start of the | prefetch data sequentially in a backwards manner to the start of the | |||
data range. One example of an application that can make use of this | data range. One example of an application that can make use of this | |||
hint is video editing. | hint is video editing. | |||
5.4. Security Considerations | 3.4. Security Considerations | |||
None. | None. | |||
5.5. IANA Considerations | 3.5. IANA Considerations | |||
The IO_ADVISE_type4 will be extended through an IANA registry. | The IO_ADVISE_type4 will be extended through an IANA registry. | |||
4. Sparse Files | ||||
4.1. Introduction | ||||
A sparse file is a common way of representing a large file without | ||||
having to utilize all of the disk space for it. Consequently, a | ||||
sparse file uses less physical space than its size indicates. This | ||||
means the file contains 'holes', byte ranges within the file that | ||||
contain no data. Most modern file systems support sparse files, | ||||
including most UNIX file systems and NTFS, but notably not Apple's | ||||
HFS+. Common examples of sparse files include Virtual Machine (VM) | ||||
OS/disk images, database files, log files, and even checkpoint | ||||
recovery files most commonly used by the HPC community. | ||||
If an application reads a hole in a sparse file, the file system must | ||||
return all zeros to the application. For local data access there is | ||||
little penalty, but with NFS these zeroes must be transferred back to | ||||
the client. If an application uses the NFS client to read data into | ||||
memory, this wastes time and bandwidth as the application waits for | ||||
the zeroes to be transferred. | ||||
A sparse file is typically created by initializing the file to be all | ||||
zeros - nothing is written to the data in the file, instead the hole | ||||
is recorded in the metadata for the file. So a 8G disk image might | ||||
be represented initially by a couple hundred bits in the inode and | ||||
nothing on the disk. If the VM then writes 100M to a file in the | ||||
middle of the image, there would now be two holes represented in the | ||||
metadata and 100M in the data. | ||||
Two new operations INITIALIZE (Section 13.7) and READ_PLUS | ||||
(Section 13.10) are introduced. INITIALIZE allows for the creation | ||||
of a sparse file and for hole punching. An application might want to | ||||
zero out a range of the file. READ_PLUS supports all the features of | ||||
READ but includes an extension to support sparse pattern files | ||||
(Section 6.1.2). READ_PLUS is guaranteed to perform no worse than | ||||
READ, and can dramatically improve performance with sparse files. | ||||
READ_PLUS does not depend on pNFS protocol features, but can be used | ||||
by pNFS to support sparse files. | ||||
4.2. Terminology | ||||
Regular file: An object of file type NF4REG or NF4NAMEDATTR. | ||||
Sparse file: A Regular file that contains one or more Holes. | ||||
Hole: A byte range within a Sparse file that contains regions of all | ||||
zeroes. For block-based file systems, this could also be an | ||||
unallocated region of the file. | ||||
Hole Threshold: The minimum length of a Hole as determined by the | ||||
server. If a server chooses to define a Hole Threshold, then it | ||||
would not return hole information about holes with a length | ||||
shorter than the Hole Threshold. | ||||
5. Space Reservation | ||||
5.1. Introduction | ||||
This section describes a set of operations that allow applications | ||||
such as hypervisors to reserve space for a file, report the amount of | ||||
actual disk space a file occupies and freeup the backing space of a | ||||
file when it is not required. In virtualized environments, virtual | ||||
disk files are often stored on NFS mounted volumes. Since virtual | ||||
disk files represent the hard disks of virtual machines, hypervisors | ||||
often have to guarantee certain properties for the file. | ||||
One such example is space reservation. When a hypervisor creates a | ||||
virtual disk file, it often tries to preallocate the space for the | ||||
file so that there are no future allocation related errors during the | ||||
operation of the virtual machine. Such errors prevent a virtual | ||||
machine from continuing execution and result in downtime. | ||||
Currently, in order to achieve such a guarantee, applications zero | ||||
the entire file. The initial zeroing allocates the backing blocks | ||||
and all subsequent writes are overwrites of already allocated blocks. | ||||
This approach is not only inefficient in terms of the amount of I/O | ||||
done, it is also not guaranteed to work on filesystems that are log | ||||
structured or deduplicated. An efficient way of guaranteeing space | ||||
reservation would be beneficial to such applications. | ||||
If the space_reserved attribute (see Section 11.2.3) is set on a | ||||
file, it is guaranteed that writes that do not grow the file will not | ||||
fail with NFSERR_NOSPC. | ||||
Another useful feature would be the ability to report the number of | ||||
blocks that would be freed when a file is deleted. Currently, NFS | ||||
reports two size attributes: | ||||
size The logical file size of the file. | ||||
space_used The size in bytes that the file occupies on disk | ||||
While these attributes are sufficient for space accounting in | ||||
traditional filesystems, they prove to be inadequate in modern | ||||
filesystems that support block sharing. In such filesystems, | ||||
multiple inodes can point to a single block with a block reference | ||||
count to guard against premature freeing. Having a way to tell the | ||||
number of blocks that would be freed if the file was deleted would be | ||||
useful to applications that wish to migrate files when a volume is | ||||
low on space. | ||||
Since virtual disks represent a hard drive in a virtual machine, a | ||||
virtual disk can be viewed as a filesystem within a file. Since not | ||||
all blocks within a filesystem are in use, there is an opportunity to | ||||
reclaim blocks that are no longer in use. A call to deallocate | ||||
blocks could result in better space efficiency. Lesser space MAY be | ||||
consumed for backups after block deallocation. | ||||
The following operations and attributes can be used to resolve this | ||||
issues: | ||||
space_reserved This attribute specifies whether the blocks backing | ||||
the file have been preallocated. | ||||
space_freed This attribute specifies the space freed when a file is | ||||
deleted, taking block sharing into consideration. | ||||
INITIALIZED This operation zeroes and/or deallocates the blocks | ||||
backing a region of the file. | ||||
If space_used of a file is interpreted to mean the size in bytes of | ||||
all disk blocks pointed to by the inode of the file, then shared | ||||
blocks get double counted, over-reporting the space utilization. | ||||
This also has the adverse effect that the deletion of a file with | ||||
shared blocks frees up less than space_used bytes. | ||||
On the other hand, if space_used is interpreted to mean the size in | ||||
bytes of those disk blocks unique to the inode of the file, then | ||||
shared blocks are not counted in any file, resulting in under- | ||||
reporting of the space utilization. | ||||
For example, two files A and B have 10 blocks each. Let 6 of these | ||||
blocks be shared between them. Thus, the combined space utilized by | ||||
the two files is 14 * BLOCK_SIZE bytes. In the former case, the | ||||
combined space utilization of the two files would be reported as 20 * | ||||
BLOCK_SIZE. However, deleting either would only result in 4 * | ||||
BLOCK_SIZE being freed. Conversely, the latter interpretation would | ||||
report that the space utilization is only 8 * BLOCK_SIZE. | ||||
Adding another size attribute, space_freed (see Section 11.2.4), is | ||||
helpful in solving this problem. space_freed is the number of blocks | ||||
that are allocated to the given file that would be freed on its | ||||
deletion. In the example, both A and B would report space_freed as 4 | ||||
* BLOCK_SIZE and space_used as 10 * BLOCK_SIZE. If A is deleted, B | ||||
will report space_freed as 10 * BLOCK_SIZE as the deletion of B would | ||||
result in the deallocation of all 10 blocks. | ||||
The addition of this problem doesn't solve the problem of space being | ||||
over-reported. However, over-reporting is better than under- | ||||
reporting. | ||||
6. Application Data Block Support | 6. Application Data Block Support | |||
At the OS level, files are contained on disk blocks. Applications | At the OS level, files are contained on disk blocks. Applications | |||
are also free to impose structure on the data contained in a file and | are also free to impose structure on the data contained in a file and | |||
we can define an Application Data Block (ADB) to be such a structure. | we can define an Application Data Block (ADB) to be such a structure. | |||
From the application's viewpoint, it only wants to handle ADBs and | From the application's viewpoint, it only wants to handle ADBs and | |||
not raw bytes (see [17]). An ADB is typically comprised of two | not raw bytes (see [17]). An ADB is typically comprised of two | |||
sections: a header and data. The header describes the | sections: a header and data. The header describes the | |||
characteristics of the block and can provide a means to detect | characteristics of the block and can provide a means to detect | |||
corruption in the data payload. The data section is typically | corruption in the data payload. The data section is typically | |||
skipping to change at page 32, line 9 | skipping to change at page 33, line 7 | |||
information to necessary to later reconstruct the header portion of | information to necessary to later reconstruct the header portion of | |||
the ADB when the contents are read back. Using sparse file | the ADB when the contents are read back. Using sparse file | |||
techniques, the disk blocks described by would not be allocated. | techniques, the disk blocks described by would not be allocated. | |||
Unlike sparse file techniques, there would be a small cost to store | Unlike sparse file techniques, there would be a small cost to store | |||
the compressed header data. | the compressed header data. | |||
In this section, we are going to define a generic framework for an | In this section, we are going to define a generic framework for an | |||
ADB, present one approach to detecting corruption in a given ADB | ADB, present one approach to detecting corruption in a given ADB | |||
implementation, and describe the model for how the client and server | implementation, and describe the model for how the client and server | |||
can support efficient initialization of ADBs, reading of ADB holes, | can support efficient initialization of ADBs, reading of ADB holes, | |||
punching holes in ADBs, and space reservation. Further, we need to | punching holes in ADBs, and space reservation. | |||
be able to extend this model to applications which do not support | ||||
ADBs, but wish to be able to handle sparse files, hole punching, and | ||||
space reservation. | ||||
6.1. Generic Framework | 6.1. Generic Framework | |||
We want the representation of the ADB to be flexible enough to | We want the representation of the ADB to be flexible enough to | |||
support many different applications. The most basic approach is no | support many different applications. The most basic approach is no | |||
imposition of a block at all, which means we are working with the raw | imposition of a block at all, which means we are working with the raw | |||
bytes. Such an approach would be useful for storing holes, punching | bytes. Such an approach would be useful for storing holes, punching | |||
holes, etc. In more complex deployments, a server might be | holes, etc. In more complex deployments, a server might be | |||
supporting multiple applications, each with their own definition of | supporting multiple applications, each with their own definition of | |||
the ADB. One might store the ADBN at the start of the block and then | the ADB. One might store the ADBN at the start of the block and then | |||
have a guard pattern to detect corruption [19]. The next might store | have a guard pattern to detect corruption [19]. The next might store | |||
the ADBN at an offset of 100 bytes within the block and have no guard | the ADBN at an offset of 100 bytes within the block and have no guard | |||
pattern at all. The point is that existing applications might | pattern at all. I.e., existing applications might already have well | |||
already have well defined formats for their data blocks. | defined formats for their data blocks. | |||
The guard pattern can be used to represent the state of the block, to | The guard pattern can be used to represent the state of the block, to | |||
protect against corruption, or both. Again, it needs to be able to | protect against corruption, or both. Again, it needs to be able to | |||
be placed anywhere within the ADB. | be placed anywhere within the ADB. | |||
We need to be able to represent the starting offset of the block and | We need to be able to represent the starting offset of the block and | |||
the size of the block. Note that nothing prevents the application | the size of the block. Note that nothing prevents the application | |||
from defining different sized blocks in a file. | from defining different sized blocks in a file. | |||
6.1.1. Data Block Representation | 6.1.1. Data Block Representation | |||
skipping to change at page 33, line 35 | skipping to change at page 34, line 31 | |||
While this document does not mandate how sparse ADBs are recorded on | While this document does not mandate how sparse ADBs are recorded on | |||
the server, it does make the assumption that such information is not | the server, it does make the assumption that such information is not | |||
in the file. I.e., the information is metadata. As such, the | in the file. I.e., the information is metadata. As such, the | |||
INITIALIZE operation is defined to be not supported by the DS - it | INITIALIZE operation is defined to be not supported by the DS - it | |||
must be issued to the MDS. But since the client must not assume a | must be issued to the MDS. But since the client must not assume a | |||
priori whether a read is sparse or not, the READ_PLUS operation MUST | priori whether a read is sparse or not, the READ_PLUS operation MUST | |||
be supported by both the DS and the MDS. I.e., the client might | be supported by both the DS and the MDS. I.e., the client might | |||
impose on the MDS to asynchronously read the data from the DS. | impose on the MDS to asynchronously read the data from the DS. | |||
Furthermore, each DS MUST not report to a client either a sparse ADB | Furthermore, each DS MUST not report to a client a sparse ADB which | |||
or data which belongs to another DS. One implication of this | belongs to another DS. One implication of this requirement is that | |||
requirement is that the app_data_block4's adb_block_size MUST be | the app_data_block4's adb_block_size MUST be either be the stripe | |||
either be the stripe width or the stripe width must be an even | width or the stripe width must be an even multiple of it. The second | |||
multiple of it. | implication here is that the DS must be able to use the Control | |||
Protocol to determine from the MDS where the sparse ADBs occur. | ||||
The second implication here is that the DS must be able to use the | ||||
Control Protocol to determine from the MDS where the sparse ADBs | ||||
occur. [[Comment.3: Need to discuss what happens if after the file | ||||
is being written to and an INITIALIZE occurs? --TH]] Perhaps instead | ||||
of the DS pulling from the MDS, the MDS pushes to the DS? Thus an | ||||
INITIALIZE causes a new push? [[Comment.4: Still need to consider | ||||
race cases of the DS getting a WRITE and the MDS getting an | ||||
INITIALIZE. --TH]] | ||||
6.3. An Example of Detecting Corruption | 6.3. An Example of Detecting Corruption | |||
In this section, we define an ADB format in which corruption can be | In this section, we define an ADB format in which corruption can be | |||
detected. Note that this is just one possible format and means to | detected. Note that this is just one possible format and means to | |||
detect corruption. | detect corruption. | |||
Consider a very basic implementation of an operating system's disk | Consider a very basic implementation of an operating system's disk | |||
blocks. A block is either data or it is an indirect block which | blocks. A block is either data or it is an indirect block which | |||
allows for files to be larger than one block. It is desired to be | allows for files to be larger than one block. It is desired to be | |||
skipping to change at page 38, line 30 | skipping to change at page 39, line 19 | |||
component is a LFS as defined in [22] to allow for interoperability | component is a LFS as defined in [22] to allow for interoperability | |||
between MAC mechanisms. The second component is an opaque field | between MAC mechanisms. The second component is an opaque field | |||
which is the actual security attribute data. To allow for various | which is the actual security attribute data. To allow for various | |||
MAC models NFSv4 should be used solely as a transport mechanism for | MAC models NFSv4 should be used solely as a transport mechanism for | |||
the security attribute. It is the responsibility of the endpoints to | the security attribute. It is the responsibility of the endpoints to | |||
consume the security attribute and make access decisions based on | consume the security attribute and make access decisions based on | |||
their respective models. In addition, creation of objects through | their respective models. In addition, creation of objects through | |||
OPEN and CREATE allows for the security attribute to be specified | OPEN and CREATE allows for the security attribute to be specified | |||
upon creation. By providing an atomic create and set operation for | upon creation. By providing an atomic create and set operation for | |||
the security attribute it is possible to enforce the second and | the security attribute it is possible to enforce the second and | |||
fourth requirements. The recommended attribute FATTR4_SEC_LABEL will | fourth requirements. The recommended attribute FATTR4_SEC_LABEL (see | |||
be used to satisfy this requirement. | Section 11.2.2) will be used to satisfy this requirement. | |||
7.3.1. Interpreting FATTR4_SEC_LABEL | ||||
The XDR [23] necessary to implement Labeled NFSv4 is presented below: | ||||
const FATTR4_SEC_LABEL = 81; | ||||
typedef uint32_t policy4; | ||||
Figure 6 | ||||
struct labelformat_spec4 { | ||||
policy4 lfs_lfs; | ||||
policy4 lfs_pi; | ||||
}; | ||||
struct sec_label_attr_info { | ||||
labelformat_spec4 slai_lfs; | ||||
opaque slai_data<>; | ||||
}; | ||||
The FATTR4_SEC_LABEL contains an array of two components with the | ||||
first component being an LFS. It serves to provide the receiving end | ||||
with the information necessary to translate the security attribute | ||||
into a form that is usable by the endpoint. Label Formats assigned | ||||
an LFS may optionally choose to include a Policy Identifier field to | ||||
allow for complex policy deployments. The LFS and Label Format | ||||
Registry are described in detail in [22]. The translation used to | ||||
interpret the security attribute is not specified as part of the | ||||
protocol as it may depend on various factors. The second component | ||||
is an opaque section which contains the data of the attribute. This | ||||
component is dependent on the MAC model to interpret and enforce. | ||||
In particular, it is the responsibility of the LFS specification to | ||||
define a maximum size for the opaque section, slai_data<>. When | ||||
creating or modifying a label for an object, the client needs to be | ||||
guaranteed that the server will accept a label that is sized | ||||
correctly. By both client and server being part of a specific MAC | ||||
model, the client will be aware of the size. | ||||
7.3.2. Delegations | 7.3.1. Delegations | |||
In the event that a security attribute is changed on the server while | In the event that a security attribute is changed on the server while | |||
a client holds a delegation on the file, the client should follow the | a client holds a delegation on the file, the client should follow the | |||
existing protocol with respect to attribute changes. It should flush | existing protocol with respect to attribute changes. It should flush | |||
all changes back to the server and relinquish the delegation. | all changes back to the server and relinquish the delegation. | |||
7.3.3. Permission Checking | 7.3.2. Permission Checking | |||
It is not feasible to enumerate all possible MAC models and even | It is not feasible to enumerate all possible MAC models and even | |||
levels of protection within a subset of these models. This means | levels of protection within a subset of these models. This means | |||
that the NFSv4 client and servers cannot be expected to directly make | that the NFSv4 client and servers cannot be expected to directly make | |||
access control decisions based on the security attribute. Instead | access control decisions based on the security attribute. Instead | |||
NFSv4 should defer permission checking on this attribute to the host | NFSv4 should defer permission checking on this attribute to the host | |||
system. These checks are performed in addition to existing DAC and | system. These checks are performed in addition to existing DAC and | |||
ACL checks outlined in the NFSv4 protocol. Section 7.6 gives a | ACL checks outlined in the NFSv4 protocol. Section 7.6 gives a | |||
specific example of how the security attribute is handled under a | specific example of how the security attribute is handled under a | |||
particular MAC model. | particular MAC model. | |||
7.3.4. Object Creation | 7.3.3. Object Creation | |||
When creating files in NFSv4 the OPEN and CREATE operations are used. | When creating files in NFSv4 the OPEN and CREATE operations are used. | |||
One of the parameters to these operations is an fattr4 structure | One of the parameters to these operations is an fattr4 structure | |||
containing the attributes the file is to be created with. This | containing the attributes the file is to be created with. This | |||
allows NFSv4 to atomically set the security attribute of files upon | allows NFSv4 to atomically set the security attribute of files upon | |||
creation. When a client is MAC aware it must always provide the | creation. When a client is MAC aware it must always provide the | |||
initial security attribute upon file creation. In the event that the | initial security attribute upon file creation. In the event that the | |||
server is the only MAC aware entity in the system it should ignore | server is the only MAC aware entity in the system it should ignore | |||
the security attribute specified by the client and instead make the | the security attribute specified by the client and instead make the | |||
determination itself. A more in depth explanation can be found in | determination itself. A more in depth explanation can be found in | |||
Section 7.6. | Section 7.6. | |||
7.3.5. Existing Objects | 7.3.4. Existing Objects | |||
Note that under the MAC model, all objects must have labels. | Note that under the MAC model, all objects must have labels. | |||
Therefore, if an existing server is upgraded to include LNFS support, | Therefore, if an existing server is upgraded to include LNFS support, | |||
then it is the responsibility of the security system to define the | then it is the responsibility of the security system to define the | |||
behavior for existing objects. For example, if the security system | behavior for existing objects. For example, if the security system | |||
is LFS 0, which means the server just stores and returns labels, then | is LFS 0, which means the server just stores and returns labels, then | |||
existing files should return labels which are set to an empty value. | existing files should return labels which are set to an empty value. | |||
7.3.6. Label Changes | 7.3.5. Label Changes | |||
As per the requirements, when a file's security label is modified, | As per the requirements, when a file's security label is modified, | |||
the server must notify all clients which have the file opened of the | the server must notify all clients which have the file opened of the | |||
change in label. It does so with CB_ATTR_CHANGED. There are | change in label. It does so with CB_ATTR_CHANGED. There are | |||
preconditions to making an attribute change imposed by NFSv4 and the | preconditions to making an attribute change imposed by NFSv4 and the | |||
security system might want to impose others. In the process of | security system might want to impose others. In the process of | |||
meeting these preconditions, the server may chose to either serve the | meeting these preconditions, the server may chose to either serve the | |||
request in whole or return NFS4ERR_DELAY to the SETATTR operation. | request in whole or return NFS4ERR_DELAY to the SETATTR operation. | |||
If there are open delegations on the file belonging to client other | If there are open delegations on the file belonging to client other | |||
than the one making the label change, then the process described in | than the one making the label change, then the process described in | |||
Section 7.3.2 must be followed. | Section 7.3.1 must be followed. | |||
As the server is always presented with the subject label from the | As the server is always presented with the subject label from the | |||
client, it does not necessarily need to communicate the fact that the | client, it does not necessarily need to communicate the fact that the | |||
label has changed to the client. In the cases where the change | label has changed to the client. In the cases where the change | |||
outright denies the client access, the client will be able to quickly | outright denies the client access, the client will be able to quickly | |||
determine that there is a new label in effect. It is in cases where | determine that there is a new label in effect. It is in cases where | |||
the client may share the same object between multiple subjects or a | the client may share the same object between multiple subjects or a | |||
security system which is not strictly hierarchical that the | security system which is not strictly hierarchical that the | |||
CB_ATTR_CHANGED callback is very useful. It allows the server to | CB_ATTR_CHANGED callback is very useful. It allows the server to | |||
inform the clients that the cached security attribute is now stale. | inform the clients that the cached security attribute is now stale. | |||
skipping to change at page 44, line 27 | skipping to change at page 44, line 24 | |||
the way of guidance. The only feature that is mandated by them is | the way of guidance. The only feature that is mandated by them is | |||
that the value must change whenever the file data or metadata change. | that the value must change whenever the file data or metadata change. | |||
While this allows for a wide range of implementations, it also leaves | While this allows for a wide range of implementations, it also leaves | |||
the client with a conundrum: how does it determine which is the most | the client with a conundrum: how does it determine which is the most | |||
recent value for the change attribute in a case where several RPC | recent value for the change attribute in a case where several RPC | |||
calls have been issued in parallel? In other words if two COMPOUNDs, | calls have been issued in parallel? In other words if two COMPOUNDs, | |||
both containing WRITE and GETATTR requests for the same file, have | both containing WRITE and GETATTR requests for the same file, have | |||
been issued in parallel, how does the client determine which of the | been issued in parallel, how does the client determine which of the | |||
two change attribute values returned in the replies to the GETATTR | two change attribute values returned in the replies to the GETATTR | |||
requests corresponds to the most recent state of the file? In some | requests correspond to the most recent state of the file? In some | |||
cases, the only recourse may be to send another COMPOUND containing a | cases, the only recourse may be to send another COMPOUND containing a | |||
third GETATTR that is fully serialised with the first two. | third GETATTR that is fully serialised with the first two. | |||
NFSv4.2 avoids this kind of inefficiency by allowing the server to | NFSv4.2 avoids this kind of inefficiency by allowing the server to | |||
share details about how the change attribute is expected to evolve, | share details about how the change attribute is expected to evolve, | |||
so that the client may immediately determine which, out of the | so that the client may immediately determine which, out of the | |||
several change attribute values returned by the server, is the most | several change attribute values returned by the server, is the most | |||
recent. | recent. change_attr_type is defined as a new recommended attribute | |||
(see Section 11.2.1), and is per filesystem. | ||||
8.2. Definition of the 'change_attr_type' per-file system attribute | ||||
enum change_attr_typeinfo { | ||||
NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR = 0, | ||||
NFS4_CHANGE_TYPE_IS_VERSION_COUNTER = 1, | ||||
NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS = 2, | ||||
NFS4_CHANGE_TYPE_IS_TIME_METADATA = 3, | ||||
NFS4_CHANGE_TYPE_IS_UNDEFINED = 4 | ||||
}; | ||||
+------------------+----+---------------------------+-----+ | ||||
| Name | Id | Data Type | Acc | | ||||
+------------------+----+---------------------------+-----+ | ||||
| change_attr_type | XX | enum change_attr_typeinfo | R | | ||||
+------------------+----+---------------------------+-----+ | ||||
The solution enables the NFS server to provide additional information | ||||
about how it expects the change attribute value to evolve after the | ||||
file data or metadata has changed. 'change_attr_type' is defined as a | ||||
new recommended attribute, and takes values from enum | ||||
change_attr_typeinfo as follows: | ||||
NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR: The change attribute value MUST | ||||
monotonically increase for every atomic change to the file | ||||
attributes, data or directory contents. | ||||
NFS4_CHANGE_TYPE_IS_VERSION_COUNTER: The change attribute value MUST | ||||
be incremented by one unit for every atomic change to the file | ||||
attributes, data or directory contents. This property is | ||||
preserved when writing to pNFS data servers. | ||||
NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS: The change attribute | ||||
value MUST be incremented by one unit for every atomic change to | ||||
the file attributes, data or directory contents. In the case | ||||
where the client is writing to pNFS data servers, the number of | ||||
increments is not guaranteed to exactly match the number of | ||||
writes. | ||||
NFS4_CHANGE_TYPE_IS_TIME_METADATA: The change attribute is | ||||
implemented as suggested in the NFSv4 spec [10] in terms of the | ||||
time_metadata attribute. | ||||
NFS4_CHANGE_TYPE_IS_UNDEFINED: The change attribute does not take | ||||
values that fit into any of these categories. | ||||
If either NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR, | ||||
NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, or | ||||
NFS4_CHANGE_TYPE_IS_TIME_METADATA are set, then the client knows at | ||||
the very least that the change attribute is monotonically increasing, | ||||
which is sufficient to resolve the question of which value is the | ||||
most recent. | ||||
If the client sees the value NFS4_CHANGE_TYPE_IS_TIME_METADATA, then | ||||
by inspecting the value of the 'time_delta' attribute it additionally | ||||
has the option of detecting rogue server implementations that use | ||||
time_metadata in violation of the spec. | ||||
Finally, if the client sees NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, it | ||||
has the ability to predict what the resulting change attribute value | ||||
should be after a COMPOUND containing a SETATTR, WRITE, or CREATE. | ||||
This again allows it to detect changes made in parallel by another | ||||
client. The value NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS permits | ||||
the same, but only if the client is not doing pNFS WRITEs. | ||||
9. Security Considerations | 9. Security Considerations | |||
10. Error Values | 10. Error Values | |||
NFS error numbers are assigned to failed operations within a Compound | NFS error numbers are assigned to failed operations within a Compound | |||
(COMPOUND or CB_COMPOUND) request. A Compound request contains a | (COMPOUND or CB_COMPOUND) request. A Compound request contains a | |||
number of NFS operations that have their results encoded in sequence | number of NFS operations that have their results encoded in sequence | |||
in a Compound reply. The results of successful operations will | in a Compound reply. The results of successful operations will | |||
consist of an NFS4_OK status followed by the encoded results of the | consist of an NFS4_OK status followed by the encoded results of the | |||
skipping to change at page 47, line 43 | skipping to change at page 46, line 31 | |||
10.1.3.1. NFS4ERR_BADLABEL (Error Code 10093) | 10.1.3.1. NFS4ERR_BADLABEL (Error Code 10093) | |||
The label specified is invalid in some manner. | The label specified is invalid in some manner. | |||
10.1.3.2. NFS4ERR_WRONG_LFS (Error Code 10092) | 10.1.3.2. NFS4ERR_WRONG_LFS (Error Code 10092) | |||
The LFS specified in the subject label is not compatible with the LFS | The LFS specified in the subject label is not compatible with the LFS | |||
in object label. | in object label. | |||
11. File Attributes | 11. New File Attributes | |||
11.1. Attribute Definitions | 11.1. New RECOMMENDED Attributes - List and Definition References | |||
11.1.1. Attribute 77: space_reserved | The list of new RECOMMENDED attributes appears in Table 2. The | |||
meaning of the columns of the table are: | ||||
Name: The name of the attribute. | ||||
Id: The number assigned to the attribute. In the event of conflicts | ||||
between the assigned number and [3], the latter is likely | ||||
authoritative, but should be resolved with Errata to this document | ||||
and/or [3]. See [23] for the Errata process. | ||||
Data Type: The XDR data type of the attribute. | ||||
Acc: Access allowed to the attribute. | ||||
R means read-only (GETATTR may retrieve, SETATTR may not set). | ||||
W means write-only (SETATTR may set, GETATTR may not retrieve). | ||||
R W means read/write (GETATTR may retrieve, SETATTR may set). | ||||
Defined in: The section of this specification that describes the | ||||
attribute. | ||||
+------------------+----+-------------------+-----+----------------+ | ||||
| Name | Id | Data Type | Acc | Defined in | | ||||
+------------------+----+-------------------+-----+----------------+ | ||||
| change_attr_type | 79 | change_attr_type4 | R | Section 11.2.1 | | ||||
| sec_label | 80 | sec_label4 | R W | Section 11.2.2 | | ||||
| space_reserved | 77 | boolean | R W | Section 11.2.3 | | ||||
| space_freed | 78 | length4 | R | Section 11.2.4 | | ||||
+------------------+----+-------------------+-----+----------------+ | ||||
Table 2 | ||||
11.2. Attribute Definitions | ||||
11.2.1. Attribute 79: change_attr_type | ||||
enum change_attr_type4 { | ||||
NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR = 0, | ||||
NFS4_CHANGE_TYPE_IS_VERSION_COUNTER = 1, | ||||
NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS = 2, | ||||
NFS4_CHANGE_TYPE_IS_TIME_METADATA = 3, | ||||
NFS4_CHANGE_TYPE_IS_UNDEFINED = 4 | ||||
}; | ||||
change_attr_type is a per filesystem attribute which enables the | ||||
NFSv4.2 server to provide additional information about how it expects | ||||
the change attribute value to evolve after the file data or metadata | ||||
has changed. | ||||
NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR: The change attribute value MUST | ||||
monotonically increase for every atomic change to the file | ||||
attributes, data or directory contents. | ||||
NFS4_CHANGE_TYPE_IS_VERSION_COUNTER: The change attribute value MUST | ||||
be incremented by one unit for every atomic change to the file | ||||
attributes, data or directory contents. This property is | ||||
preserved when writing to pNFS data servers. | ||||
NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS: The change attribute | ||||
value MUST be incremented by one unit for every atomic change to | ||||
the file attributes, data or directory contents. In the case | ||||
where the client is writing to pNFS data servers, the number of | ||||
increments is not guaranteed to exactly match the number of | ||||
writes. | ||||
NFS4_CHANGE_TYPE_IS_TIME_METADATA: The change attribute is | ||||
implemented as suggested in the NFSv4 spec [10] in terms of the | ||||
time_metadata attribute. | ||||
NFS4_CHANGE_TYPE_IS_UNDEFINED: The change attribute does not take | ||||
values that fit into any of these categories. | ||||
If either NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR, | ||||
NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, or | ||||
NFS4_CHANGE_TYPE_IS_TIME_METADATA are set, then the client knows at | ||||
the very least that the change attribute is monotonically increasing, | ||||
which is sufficient to resolve the question of which value is the | ||||
most recent. | ||||
If the client sees the value NFS4_CHANGE_TYPE_IS_TIME_METADATA, then | ||||
by inspecting the value of the 'time_delta' attribute it additionally | ||||
has the option of detecting rogue server implementations that use | ||||
time_metadata in violation of the spec. | ||||
Finally, if the client sees NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, it | ||||
has the ability to predict what the resulting change attribute value | ||||
should be after a COMPOUND containing a SETATTR, WRITE, or CREATE. | ||||
This again allows it to detect changes made in parallel by another | ||||
client. The value NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS permits | ||||
the same, but only if the client is not doing pNFS WRITEs. | ||||
11.2.2. Attribute 80: sec_label | ||||
typedef uint32_t policy4; | ||||
struct labelformat_spec4 { | ||||
policy4 lfs_lfs; | ||||
policy4 lfs_pi; | ||||
}; | ||||
struct sec_label4 { | ||||
labelformat_spec4 slai_lfs; | ||||
opaque slai_data<>; | ||||
}; | ||||
The FATTR4_SEC_LABEL contains an array of two components with the | ||||
first component being an LFS. It serves to provide the receiving end | ||||
with the information necessary to translate the security attribute | ||||
into a form that is usable by the endpoint. Label Formats assigned | ||||
an LFS may optionally choose to include a Policy Identifier field to | ||||
allow for complex policy deployments. The LFS and Label Format | ||||
Registry are described in detail in [22]. The translation used to | ||||
interpret the security attribute is not specified as part of the | ||||
protocol as it may depend on various factors. The second component | ||||
is an opaque section which contains the data of the attribute. This | ||||
component is dependent on the MAC model to interpret and enforce. | ||||
In particular, it is the responsibility of the LFS specification to | ||||
define a maximum size for the opaque section, slai_data<>. When | ||||
creating or modifying a label for an object, the client needs to be | ||||
guaranteed that the server will accept a label that is sized | ||||
correctly. By both client and server being part of a specific MAC | ||||
model, the client will be aware of the size. | ||||
11.2.3. Attribute 77: space_reserved | ||||
The space_reserve attribute is a read/write attribute of type | The space_reserve attribute is a read/write attribute of type | |||
boolean. It is a per file attribute. When the space_reserved | boolean. It is a per file attribute. When the space_reserved | |||
attribute is set via SETATTR, the server must ensure that there is | attribute is set via SETATTR, the server must ensure that there is | |||
disk space to accommodate every byte in the file before it can return | disk space to accommodate every byte in the file before it can return | |||
success. If the server cannot guarantee this, it must return | success. If the server cannot guarantee this, it must return | |||
NFS4ERR_NOSPC. | NFS4ERR_NOSPC. | |||
If the client tries to grow a file which has the space_reserved | If the client tries to grow a file which has the space_reserved | |||
attribute set, the server must guarantee that there is disk space to | attribute set, the server must guarantee that there is disk space to | |||
skipping to change at page 48, line 28 | skipping to change at page 50, line 5 | |||
The value of space_reserved can be obtained at any time through | The value of space_reserved can be obtained at any time through | |||
GETATTR. | GETATTR. | |||
In order to avoid ambiguity, the space_reserve bit cannot be set | In order to avoid ambiguity, the space_reserve bit cannot be set | |||
along with the size bit in SETATTR. Increasing the size of a file | along with the size bit in SETATTR. Increasing the size of a file | |||
with space_reserve set will fail if space reservation cannot be | with space_reserve set will fail if space reservation cannot be | |||
guaranteed for the new size. If the file size is decreased, space | guaranteed for the new size. If the file size is decreased, space | |||
reservation is only guaranteed for the new size and the extra blocks | reservation is only guaranteed for the new size and the extra blocks | |||
backing the file can be released. | backing the file can be released. | |||
11.1.2. Attribute 78: space_freed | 11.2.4. Attribute 78: space_freed | |||
space_freed gives the number of bytes freed if the file is deleted. | space_freed gives the number of bytes freed if the file is deleted. | |||
This attribute is read only and is of type length4. It is a per file | This attribute is read only and is of type length4. It is a per file | |||
attribute. | attribute. | |||
12. Operations: REQUIRED, RECOMMENDED, or OPTIONAL | 12. Operations: REQUIRED, RECOMMENDED, or OPTIONAL | |||
The following tables summarize the operations of the NFSv4.2 protocol | The following tables summarize the operations of the NFSv4.2 protocol | |||
and the corresponding designation of REQUIRED, RECOMMENDED, and | and the corresponding designation of REQUIRED, RECOMMENDED, and | |||
OPTIONAL to implement or MUST NOT implement. The designation of MUST | OPTIONAL to implement or either OBSOLETE if implemented or MUST NOT | |||
NOT implement is reserved for those operations that were defined in | implement. The designation of OBSOLETE if implemented is reserved | |||
either NFSv4.0 or NFSV4.1 and MUST NOT be implemented in NFSv4.2. | for those operations which are defined in either NFSv4.0 or NFSV4.1, | |||
can be implemented in NFSv4.2, and are intended to be MUST NOT be | ||||
implemented in NFSv4.3. The designation of MUST NOT implement is | ||||
reserved for those operations that were defined in either NFSv4.0 or | ||||
NFSV4.1 and MUST NOT be implemented in NFSv4.2. | ||||
For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation | For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation | |||
for operations sent by the client is for the server implementation. | for operations sent by the client is for the server implementation. | |||
The client is generally required to implement the operations needed | The client is generally required to implement the operations needed | |||
for the operating environment for which it serves. For example, a | for the operating environment for which it serves. For example, a | |||
read-only NFSv4.2 client would have no need to implement the WRITE | read-only NFSv4.2 client would have no need to implement the WRITE | |||
operation and is not required to do so. | operation and is not required to do so. | |||
The REQUIRED or OPTIONAL designation for callback operations sent by | The REQUIRED or OPTIONAL designation for callback operations sent by | |||
the server is for both the client and server. Generally, the client | the server is for both the client and server. Generally, the client | |||
skipping to change at page 49, line 26 | skipping to change at page 51, line 11 | |||
The abbreviations used in the second and third columns of the table | The abbreviations used in the second and third columns of the table | |||
are defined as follows. | are defined as follows. | |||
REQ REQUIRED to implement | REQ REQUIRED to implement | |||
REC RECOMMEND to implement | REC RECOMMEND to implement | |||
OPT OPTIONAL to implement | OPT OPTIONAL to implement | |||
OBS MUST NOT implement | ||||
MNI MUST NOT implement | MNI MUST NOT implement | |||
For the NFSv4.2 features that are OPTIONAL, the operations that | For the NFSv4.2 features that are OPTIONAL, the operations that | |||
support those features are OPTIONAL, and the server would return | support those features are OPTIONAL, and the server would return | |||
NFS4ERR_NOTSUPP in response to the client's use of those operations. | NFS4ERR_NOTSUPP in response to the client's use of those operations. | |||
If an OPTIONAL feature is supported, it is possible that a set of | If an OPTIONAL feature is supported, it is possible that a set of | |||
operations related to the feature become REQUIRED to implement. The | operations related to the feature become REQUIRED to implement. The | |||
third column of the table designates the feature(s) and if the | third column of the table designates the feature(s) and if the | |||
operation is REQUIRED or OPTIONAL in the presence of support for the | operation is REQUIRED or OPTIONAL in the presence of support for the | |||
feature. | feature. | |||
skipping to change at page 50, line 51 | skipping to change at page 52, line 36 | |||
| LOOKUP | REQ | | | | LOOKUP | REQ | | | |||
| LOOKUPP | REQ | | | | LOOKUPP | REQ | | | |||
| NVERIFY | REQ | | | | NVERIFY | REQ | | | |||
| OPEN | REQ | | | | OPEN | REQ | | | |||
| OPENATTR | OPT | | | | OPENATTR | OPT | | | |||
| OPEN_CONFIRM | MNI | | | | OPEN_CONFIRM | MNI | | | |||
| OPEN_DOWNGRADE | REQ | | | | OPEN_DOWNGRADE | REQ | | | |||
| PUTFH | REQ | | | | PUTFH | REQ | | | |||
| PUTPUBFH | REQ | | | | PUTPUBFH | REQ | | | |||
| PUTROOTFH | REQ | | | | PUTROOTFH | REQ | | | |||
| READ | OPT | | | | READ | OBS | | | |||
| READDIR | REQ | | | | READDIR | REQ | | | |||
| READLINK | OPT | | | | READLINK | OPT | | | |||
| READ_PLUS | OPT | ADB (REQ) | | | READ_PLUS | OPT | ADB (REQ) | | |||
| RECLAIM_COMPLETE | REQ | | | | RECLAIM_COMPLETE | REQ | | | |||
| RELEASE_LOCKOWNER | MNI | | | | RELEASE_LOCKOWNER | MNI | | | |||
| REMOVE | REQ | | | | REMOVE | REQ | | | |||
| RENAME | REQ | | | | RENAME | REQ | | | |||
| RENEW | MNI | | | | RENEW | MNI | | | |||
| RESTOREFH | REQ | | | | RESTOREFH | REQ | | | |||
| SAVEFH | REQ | | | | SAVEFH | REQ | | | |||
skipping to change at page 55, line 8 | skipping to change at page 57, line 5 | |||
server, the behavior is implementation dependent. | server, the behavior is implementation dependent. | |||
If the metadata flag is set and the client is requesting a whole file | If the metadata flag is set and the client is requesting a whole file | |||
copy (i.e., ca_count is 0 (zero)), a subset of the destination file's | copy (i.e., ca_count is 0 (zero)), a subset of the destination file's | |||
attributes MUST be the same as the source file's corresponding | attributes MUST be the same as the source file's corresponding | |||
attributes and a subset of the destination file's attributes SHOULD | attributes and a subset of the destination file's attributes SHOULD | |||
be the same as the source file's corresponding attributes. The | be the same as the source file's corresponding attributes. The | |||
attributes in the MUST and SHOULD copy subsets will be defined for | attributes in the MUST and SHOULD copy subsets will be defined for | |||
each NFS version. | each NFS version. | |||
For NFSv4.1, Table 2 and Table 3 list the REQUIRED and RECOMMENDED | For NFSv4.2, Table 3 and Table 4 list the REQUIRED and RECOMMENDED | |||
attributes respectively. A "MUST" in the "Copy to destination file?" | attributes respectively. A "MUST" in the "Copy to destination file?" | |||
column indicates that the attribute is part of the MUST copy set. A | column indicates that the attribute is part of the MUST copy set. A | |||
"SHOULD" in the "Copy to destination file?" column indicates that the | "SHOULD" in the "Copy to destination file?" column indicates that the | |||
attribute is part of the SHOULD copy set. | attribute is part of the SHOULD copy set. | |||
+--------------------+----+---------------------------+ | +--------------------+----+---------------------------+ | |||
| Name | Id | Copy to destination file? | | | Name | Id | Copy to destination file? | | |||
+--------------------+----+---------------------------+ | +--------------------+----+---------------------------+ | |||
| supported_attrs | 0 | no | | | supported_attrs | 0 | no | | |||
| type | 1 | MUST | | | type | 1 | MUST | | |||
skipping to change at page 55, line 33 | skipping to change at page 57, line 30 | |||
| symlink_support | 6 | no | | | symlink_support | 6 | no | | |||
| named_attr | 7 | no | | | named_attr | 7 | no | | |||
| fsid | 8 | no | | | fsid | 8 | no | | |||
| unique_handles | 9 | no | | | unique_handles | 9 | no | | |||
| lease_time | 10 | no | | | lease_time | 10 | no | | |||
| rdattr_error | 11 | no | | | rdattr_error | 11 | no | | |||
| filehandle | 19 | no | | | filehandle | 19 | no | | |||
| suppattr_exclcreat | 75 | no | | | suppattr_exclcreat | 75 | no | | |||
+--------------------+----+---------------------------+ | +--------------------+----+---------------------------+ | |||
Table 2 | Table 3 | |||
+--------------------+----+---------------------------+ | +--------------------+----+---------------------------+ | |||
| Name | Id | Copy to destination file? | | | Name | Id | Copy to destination file? | | |||
+--------------------+----+---------------------------+ | +--------------------+----+---------------------------+ | |||
| acl | 12 | MUST | | | acl | 12 | MUST | | |||
| aclsupport | 13 | no | | | aclsupport | 13 | no | | |||
| archive | 14 | no | | | archive | 14 | no | | |||
| cansettime | 15 | no | | | cansettime | 15 | no | | |||
| case_insensitive | 16 | no | | | case_insensitive | 16 | no | | |||
| case_preserving | 17 | no | | | case_preserving | 17 | no | | |||
| change_attr_type | 79 | no | | ||||
| change_policy | 60 | no | | | change_policy | 60 | no | | |||
| chown_restricted | 18 | MUST | | | chown_restricted | 18 | MUST | | |||
| dacl | 58 | MUST | | | dacl | 58 | MUST | | |||
| dir_notif_delay | 56 | no | | | dir_notif_delay | 56 | no | | |||
| dirent_notif_delay | 57 | no | | | dirent_notif_delay | 57 | no | | |||
| fileid | 20 | no | | | fileid | 20 | no | | |||
| files_avail | 21 | no | | | files_avail | 21 | no | | |||
| files_free | 22 | no | | | files_free | 22 | no | | |||
| files_total | 23 | no | | | files_total | 23 | no | | |||
| fs_charset_cap | 76 | no | | | fs_charset_cap | 76 | no | | |||
skipping to change at page 56, line 40 | skipping to change at page 58, line 37 | |||
| quota_avail_hard | 38 | no | | | quota_avail_hard | 38 | no | | |||
| quota_avail_soft | 39 | no | | | quota_avail_soft | 39 | no | | |||
| quota_used | 40 | no | | | quota_used | 40 | no | | |||
| rawdev | 41 | no | | | rawdev | 41 | no | | |||
| retentevt_get | 71 | MUST | | | retentevt_get | 71 | MUST | | |||
| retentevt_set | 72 | no | | | retentevt_set | 72 | no | | |||
| retention_get | 69 | MUST | | | retention_get | 69 | MUST | | |||
| retention_hold | 73 | MUST | | | retention_hold | 73 | MUST | | |||
| retention_set | 70 | no | | | retention_set | 70 | no | | |||
| sacl | 59 | MUST | | | sacl | 59 | MUST | | |||
| sec_label | 80 | MUST | | ||||
| space_avail | 42 | no | | | space_avail | 42 | no | | |||
| space_free | 43 | no | | | space_free | 43 | no | | |||
| space_freed | 78 | no | | | space_freed | 78 | no | | |||
| space_reserved | 77 | MUST | | | space_reserved | 77 | MUST | | |||
| space_total | 44 | no | | | space_total | 44 | no | | |||
| space_used | 45 | no | | | space_used | 45 | no | | |||
| system | 46 | MUST | | | system | 46 | MUST | | |||
| time_access | 47 | MUST | | | time_access | 47 | MUST | | |||
| time_access_set | 48 | no | | | time_access_set | 48 | no | | |||
| time_backup | 49 | no | | | time_backup | 49 | no | | |||
| time_create | 50 | MUST | | | time_create | 50 | MUST | | |||
| time_delta | 51 | no | | | time_delta | 51 | no | | |||
| time_metadata | 52 | SHOULD | | | time_metadata | 52 | SHOULD | | |||
| time_modify | 53 | MUST | | | time_modify | 53 | MUST | | |||
| time_modify_set | 54 | no | | | time_modify_set | 54 | no | | |||
+--------------------+----+---------------------------+ | +--------------------+----+---------------------------+ | |||
Table 3 | Table 4 | |||
[NOTE: The source file's attribute values will take precedence over | [NOTE: The source file's attribute values will take precedence over | |||
any attribute values inherited by the destination file.] | any attribute values inherited by the destination file.] | |||
In the case of an inter-server copy or an intra-server copy between | In the case of an inter-server copy or an intra-server copy between | |||
file systems, the attributes supported for the source file and | file systems, the attributes supported for the source file and | |||
destination file could be different. By definition,the REQUIRED | destination file could be different. By definition,the REQUIRED | |||
attributes will be supported in all cases. If the metadata flag is | attributes will be supported in all cases. If the metadata flag is | |||
set and the source file has a RECOMMENDED attribute that is not | set and the source file has a RECOMMENDED attribute that is not | |||
supported for the destination file, the copy MUST fail with | supported for the destination file, the copy MUST fail with | |||
skipping to change at page 59, line 4 | skipping to change at page 60, line 48 | |||
o NFS4ERR_FBIG | o NFS4ERR_FBIG | |||
o NFS4ERR_NOTDIR | o NFS4ERR_NOTDIR | |||
o NFS4ERR_WRONG_TYPE | o NFS4ERR_WRONG_TYPE | |||
o NFS4ERR_ISDIR | o NFS4ERR_ISDIR | |||
o NFS4ERR_INVAL | o NFS4ERR_INVAL | |||
o NFS4ERR_DELAY | ||||
o NFS4ERR_DELAY | ||||
o NFS4ERR_METADATA_NOTSUPP | o NFS4ERR_METADATA_NOTSUPP | |||
o NFS4ERR_WRONGSEC | o NFS4ERR_WRONGSEC | |||
13.2. Operation 60: COPY_ABORT - Cancel a server-side copy | 13.2. Operation 60: COPY_ABORT - Cancel a server-side copy | |||
13.2.1. ARGUMENT | 13.2.1. ARGUMENT | |||
struct COPY_ABORT4args { | struct COPY_ABORT4args { | |||
/* CURRENT_FH: desination file */ | /* CURRENT_FH: desination file */ | |||
skipping to change at page 65, line 34 | skipping to change at page 67, line 34 | |||
o The server will not reply to DESTROY_SESSION until all operations | o The server will not reply to DESTROY_SESSION until all operations | |||
in progress are completed or aborted. | in progress are completed or aborted. | |||
o The server will not reply to subsequent EXCHANGE_ID invoked on the | o The server will not reply to subsequent EXCHANGE_ID invoked on the | |||
same Client Owner with a new verifier until all operations in | same Client Owner with a new verifier until all operations in | |||
progress on the Client ID's session are completed or aborted. | progress on the Client ID's session are completed or aborted. | |||
o When DESTROY_CLIENTID is invoked, if there are sessions (both idle | o When DESTROY_CLIENTID is invoked, if there are sessions (both idle | |||
and non-idle), opens, locks, delegations, layouts, and/or wants | and non-idle), opens, locks, delegations, layouts, and/or wants | |||
(Section 18.49) associated with the client ID are removed. | (Section 18.49 of [2]) associated with the client ID are removed. | |||
Pending operations will be completed or aborted before the | Pending operations will be completed or aborted before the | |||
sessions, opens, locks, delegations, layouts, and/or wants are | sessions, opens, locks, delegations, layouts, and/or wants are | |||
deleted. | deleted. | |||
o The NFS server SHOULD support client ID trunking, and if it does | o The NFS server SHOULD support client ID trunking, and if it does | |||
and the EXCHGID4_FLAG_SUPP_FENCE_OPS capability is enabled, then a | and the EXCHGID4_FLAG_SUPP_FENCE_OPS capability is enabled, then a | |||
session ID created on one node of the storage cluster MUST be | session ID created on one node of the storage cluster MUST be | |||
destroyable via DESTROY_SESSION. In addition, DESTROY_CLIENTID | destroyable via DESTROY_SESSION. In addition, DESTROY_CLIENTID | |||
and an EXCHANGE_ID with a new verifier affects all sessions | and an EXCHANGE_ID with a new verifier affects all sessions | |||
regardless what node the sessions were created on. | regardless what node the sessions were created on. | |||
skipping to change at page 67, line 4 | skipping to change at page 68, line 45 | |||
}; | }; | |||
union INITIALIZE4res switch (nfsstat4 status) { | union INITIALIZE4res switch (nfsstat4 status) { | |||
case NFS4_OK: | case NFS4_OK: | |||
INITIALIZE4resok resok4; | INITIALIZE4resok resok4; | |||
default: | default: | |||
void; | void; | |||
}; | }; | |||
13.7.3. DESCRIPTION | 13.7.3. DESCRIPTION | |||
Using the data_content4 (Section 6.1.2), INITIALIZE can be used | ||||
either to punch holes or to impose ADB structure on a file. | ||||
13.7.3.1. Hole punching | 13.7.3.1. Hole punching | |||
Whenever a client wishes to zero the blocks backing a particular | Whenever a client wishes to zero the blocks backing a particular | |||
region in the file, it calls the INITIALIZE operation with the | region in the file, it calls the INITIALIZE operation with the | |||
current filehandle set to the filehandle of the file in question, and | current filehandle set to the filehandle of the file in question, and | |||
the equivalent of start offset and length in bytes of the region set | the equivalent of start offset and length in bytes of the region set | |||
in ia_hole.di_offset and ia_hole.di_length respectively. If the | in ia_hole.di_offset and ia_hole.di_length respectively. If the | |||
ia_hole.di_allocated is set to TRUE, then the blocks will be zeroed | ia_hole.di_allocated is set to TRUE, then the blocks will be zeroed | |||
and if it is set to FALSE, then they will be deallocated. All | and if it is set to FALSE, then they will be deallocated. All | |||
further reads to this region MUST return zeros until overwritten. | further reads to this region MUST return zeros until overwritten. | |||
skipping to change at page 69, line 10 | skipping to change at page 71, line 10 | |||
misaligned creation of ADBs. Even while it can detect them, it | misaligned creation of ADBs. Even while it can detect them, it | |||
cannot disallow them, as the application might be in the process of | cannot disallow them, as the application might be in the process of | |||
changing the size of the ADBs. Thus the server must be prepared to | changing the size of the ADBs. Thus the server must be prepared to | |||
handle an INITIALIZE into an existing ADB. | handle an INITIALIZE into an existing ADB. | |||
This document does not mandate the manner in which the server stores | This document does not mandate the manner in which the server stores | |||
ADBs sparsely for a file. It does assume that if ADBs are stored | ADBs sparsely for a file. It does assume that if ADBs are stored | |||
sparsely, then the server can detect when an INITIALIZE arrives that | sparsely, then the server can detect when an INITIALIZE arrives that | |||
will force a new ADB to start inside an existing ADB. For example, | will force a new ADB to start inside an existing ADB. For example, | |||
assume that ADBi has a adb_block_size of 4k and that an INITIALIZE | assume that ADBi has a adb_block_size of 4k and that an INITIALIZE | |||
starts 1k inside ADBi. The server should [[Comment.5: Need to flesh | starts 1k inside ADBi. The server should [[Comment.2: Need to flesh | |||
this out. --TH]] | this out. --TH]] | |||
13.8. Operation 67: IO_ADVISE - Application I/O access pattern hints | 13.8. Operation 67: IO_ADVISE - Application I/O access pattern hints | |||
This section introduces a new operation, named IO_ADVISE, which | This section introduces a new operation, named IO_ADVISE, which | |||
allows NFS clients to communicate application I/O access pattern | allows NFS clients to communicate application I/O access pattern | |||
hints to the NFS server. This new operation will allow hints to be | hints to the NFS server. This new operation will allow hints to be | |||
sent to the server when applications use posix_fadvise, direct I/O, | sent to the server when applications use posix_fadvise, direct I/O, | |||
or at any other point at which the client finds useful. | or at any other point at which the client finds useful. | |||
skipping to change at page 75, line 37 | skipping to change at page 77, line 37 | |||
byte range: | byte range: | |||
o IO_ADVISE4_READ | o IO_ADVISE4_READ | |||
o IO_ADVISE4_WRITE | o IO_ADVISE4_WRITE | |||
13.9. Changes to Operation 51: LAYOUTRETURN | 13.9. Changes to Operation 51: LAYOUTRETURN | |||
13.9.1. Introduction | 13.9.1. Introduction | |||
In the pNFS description provided in [2], the client is not enabled to | In the pNFS description provided in [2], the client is not capable to | |||
relay an error code from the DS to the MDS. In the specification of | relay an error code from the DS to the MDS. In the specification of | |||
the Objects-Based Layout protocol [9], use is made of the opaque | the Objects-Based Layout protocol [9], use is made of the opaque | |||
lrf_body field of the LAYOUTRETURN argument to do such a relaying of | lrf_body field of the LAYOUTRETURN argument to do such a relaying of | |||
error codes. In this section, we define a new data structure to | error codes. In this section, we define a new data structure to | |||
enable the passing of error codes back to the MDS and provide some | enable the passing of error codes back to the MDS and provide some | |||
guidelines on what both the client and MDS should expect in such | guidelines on what both the client and MDS should expect in such | |||
circumstances. | circumstances. | |||
There are two broad classes of errors, transient and persistent. The | There are two broad classes of errors, transient and persistent. The | |||
client SHOULD strive to only use this new mechanism to report | client SHOULD strive to only use this new mechanism to report | |||
persistent errors. It MUST be able to deal with transient issues by | persistent errors. It MUST be able to deal with transient issues by | |||
itself. Also, while the client might consider an issue to be | itself. Also, while the client might consider an issue to be | |||
persistent, it MUST be prepared for the MDS to consider such issues | persistent, it MUST be prepared for the MDS to consider such issues | |||
to be persistent. A prime example of this is if the MDS fences off a | to be transient. A prime example of this is if the MDS fences off a | |||
client from either a stateid or a filehandle. The client will get an | client from either a stateid or a filehandle. The client will get an | |||
error from the DS and might relay either NFS4ERR_ACCESS or | error from the DS and might relay either NFS4ERR_ACCESS or | |||
NFS4ERR_STALE_STATEID back to the MDS, with the belief that this is a | NFS4ERR_BAD_STATEID back to the MDS, with the belief that this is a | |||
hard error. The MDS on the other hand, is waiting for the client to | hard error. If the MDS is informed by the client that there is an | |||
report such an error. For it, the mission is accomplished in that | error, it can safely ignore that. For it, the mission is | |||
the client has returned a layout that the MDS had most likley | accomplished in that the client has returned a layout that the MDS | |||
recalled. | had most likley recalled. | |||
The client might also need to inform the MDS that it cannot reach one | ||||
or more of the DSes. While the MDS can detect the connectivity of | ||||
both of these paths: | ||||
o MDS to DS | ||||
o MDS to client | ||||
it cannot determine if the client and DS path is working. As with | ||||
the case of the DS passing errors to the client, it must be prepared | ||||
for the MDS to consider such outages as being transistory. | ||||
The existing LAYOUTRETURN operation is extended by introducing a new | The existing LAYOUTRETURN operation is extended by introducing a new | |||
data structure to report errors, layoutreturn_device_error4. Also, | data structure to report errors, layoutreturn_device_error4. Also, | |||
layoutreturn_device_error4 is introduced to enable an array of errors | layoutreturn_device_error4 is introduced to enable an array of errors | |||
to be reported. | to be reported. | |||
13.9.2. ARGUMENT | 13.9.2. ARGUMENT | |||
The ARGUMENT specification of the LAYOUTRETURN operation in section | The ARGUMENT specification of the LAYOUTRETURN operation in section | |||
18.44.1 of [2] is augmented by the following XDR code [23]: | 18.44.1 of [2] is augmented by the following XDR code [24]: | |||
struct layoutreturn_device_error4 { | struct layoutreturn_device_error4 { | |||
deviceid4 lrde_deviceid; | deviceid4 lrde_deviceid; | |||
nfsstat4 lrde_status; | nfsstat4 lrde_status; | |||
nfs_opnum4 lrde_opnum; | nfs_opnum4 lrde_opnum; | |||
}; | }; | |||
struct layoutreturn_error_report4 { | struct layoutreturn_error_report4 { | |||
layoutreturn_device_error4 lrer_errors<>; | layoutreturn_device_error4 lrer_errors<>; | |||
}; | }; | |||
skipping to change at page 76, line 42 | skipping to change at page 79, line 10 | |||
13.9.3. RESULT | 13.9.3. RESULT | |||
The RESULT of the LAYOUTRETURN operation is unchanged; see section | The RESULT of the LAYOUTRETURN operation is unchanged; see section | |||
18.44.2 of [2]. | 18.44.2 of [2]. | |||
13.9.4. DESCRIPTION | 13.9.4. DESCRIPTION | |||
The following text is added to the end of the LAYOUTRETURN operation | The following text is added to the end of the LAYOUTRETURN operation | |||
DESCRIPTION in section 18.44.3 of [2]. | DESCRIPTION in section 18.44.3 of [2]. | |||
When a client used LAYOUTRETURN with a type of LAYOUTRETURN4_FILE, | When a client uses LAYOUTRETURN with a type of LAYOUTRETURN4_FILE, | |||
then if the lrf_body field is NULL, it indicates to the MDS that the | then if the lrf_body field is NULL, it indicates to the MDS that the | |||
client experienced no errors. If lrf_body is non-NULL, then the | client experienced no errors. If lrf_body is non-NULL, then the | |||
field references error information which is layout type specific. | field references error information which is layout type specific. | |||
I.e., the Objects-Based Layout protocol can continue to utilize | I.e., the Objects-Based Layout protocol can continue to utilize | |||
lrf_body as specified in [9]. For both Files-Based Layouts, the | lrf_body as specified in [9]. For both Files-Based and Block-Based | |||
field references a layoutreturn_device_error4, which contains an | Layouts, the field references a layoutreturn_device_error4, which | |||
array of layoutreturn_device_error4. | contains an array of layoutreturn_device_error4. | |||
Each individual layoutreturn_device_error4 descibes a single error | Each individual layoutreturn_device_error4 descibes a single error | |||
associated with a DS, which is identfied via lrde_deviceid. The | associated with a DS, which is identfied via lrde_deviceid. The | |||
operation which returned the error is identified via lrde_opnum. | operation which returned the error is identified via lrde_opnum. | |||
Finally the NFS error value (nfsstat4) encountered is provided via | Finally the NFS error value (nfsstat4) encountered is provided via | |||
lrde_status and may consist of the following error codes: | lrde_status and may consist of the following error codes: | |||
NFS4_OKAY: No issues were found for this device. | ||||
NFS4ERR_NXIO: The client was unable to establish any communication | NFS4ERR_NXIO: The client was unable to establish any communication | |||
with the DS. | with the DS. | |||
NFS4ERR_*: The client was able to establish communication with the | NFS4ERR_*: The client was able to establish communication with the | |||
DS and is returning one of the allowed error codes for the | DS and is returning one of the allowed error codes for the | |||
operation denoted by lrde_opnum. | operation denoted by lrde_opnum. | |||
13.9.5. IMPLEMENTATION | 13.9.5. IMPLEMENTATION | |||
The following text is added to the end of the LAYOUTRETURN operation | The following text is added to the end of the LAYOUTRETURN operation | |||
IMPLEMENTATION in section 18.4.4 of [2]. | IMPLEMENTATION in section 18.4.4 of [2]. | |||
A client that expects to use pNFS for a mounted filesystem SHOULD | ||||
check for pNFS support at mount time. This check SHOULD be performed | ||||
by sending a GETDEVICELIST operation, followed by layout-type- | ||||
specific checks for accessibility of each storage device returned by | ||||
GETDEVICELIST. If the NFS server does not support pNFS, the | ||||
GETDEVICELIST operation will be rejected with an NFS4ERR_NOTSUPP | ||||
error; in this situation it is up to the client to determine whether | ||||
it is acceptable to proceed with NFS-only access. | ||||
Clients are expected to tolerate transient storage device errors, and | Clients are expected to tolerate transient storage device errors, and | |||
hence clients SHOULD NOT use the LAYOUTRETURN error handling for | hence clients SHOULD NOT use the LAYOUTRETURN error handling for | |||
device access problems that may be transient. The methods by which a | device access problems that may be transient. The methods by which a | |||
client decides whether an access problem is transient vs. persistent | client decides whether a device access problem is transient vs. | |||
are implementation-specific, but may include retrying I/Os to a data | persistent are implementation-specific, but may include retrying I/Os | |||
server under appropriate conditions. | to a data server under appropriate conditions. | |||
When an I/O fails to a storage device, the client SHOULD retry the | When an I/O fails to a storage device, the client SHOULD retry the | |||
failed I/O via the MDS. In this situation, before retrying the I/O, | failed I/O via the MDS. In this situation, before retrying the I/O, | |||
the client SHOULD return the layout, or the affected portion thereof, | the client SHOULD return the layout, or the affected portion thereof, | |||
and SHOULD indicate which storage device or devices was problematic. | and SHOULD indicate which storage device or devices was problematic. | |||
If the client does not do this, the MDS may issue a layout recall | The client needs to do this when the DS is being unresponsive in | |||
order to fence off any failed write attempts, and ensure that they do | ||||
not end up overwriting any later data being written through the MDS. | ||||
If the client does not do this, the MDS MAY issue a layout recall | ||||
callback in order to perform the retried I/O. | callback in order to perform the retried I/O. | |||
The client needs to be cognizant that since this error handling is | The client needs to be cognizant that since this error handling is | |||
optional in the MDS, the MDS may silently ignore this functionality. | optional in the MDS, the MDS may silently ignore this functionality. | |||
Also, as the MDS may consider some issues the client reports to be | Also, as the MDS may consider some issues the client reports to be | |||
expected (see Section 13.9.1), the client might find it difficult to | expected (see Section 13.9.1), the client might find it difficult to | |||
detect a MDS which has not implemented error handling via | detect a MDS which has not implemented error handling via | |||
LAYOUTRETURN. | LAYOUTRETURN. | |||
If an MDS is aware that a storage device is proving problematic to a | If an MDS is aware that a storage device is proving problematic to a | |||
client, the MDS SHOULD NOT include that storage device in any pNFS | client, the MDS SHOULD NOT include that storage device in any pNFS | |||
layouts sent to that client. If the MDS is aware that a storage | layouts sent to that client. If the MDS is aware that a storage | |||
device is affecting many clients, then the MDS SHOULD NOT include | device is affecting many clients, then the MDS SHOULD NOT include | |||
that storage device in any pNFS layouts sent out. Clients must still | that storage device in any pNFS layouts sent out. If a client asks | |||
be aware that the MDS might not have any choice in using the storage | for a new layout for the file from the MDS, it MUST be prepared for | |||
device, i.e., there might only be one possible layout for the system. | the MDS to return that storage device in the layout. The MDS might | |||
not have any choice in using the storage device, i.e., there might | ||||
Another interesting complication is that for existing files, the MDS | only be one possible layout for the system. Also, in the case of | |||
might have no choice in which storage devices to hand out to clients. | existing files, the MDS might have no choice in which storage devices | |||
The MDS might try to restripe a file across a different storage | to hand out to clients. | |||
device, but clients need to be aware that not all implementations | ||||
have restriping support. | ||||
An MDS SHOULD react to a client return of layouts with errors by not | The MDS is not required to indefinitely retain per-client storage | |||
using the problematic storage devices in layouts for that client, but | ||||
the MDS is not required to indefinitely retain per-client storage | ||||
device error information. An MDS is also not required to | device error information. An MDS is also not required to | |||
automatically reinstate use of a previously problematic storage | automatically reinstate use of a previously problematic storage | |||
device; administrative intervention may be required instead. | device; administrative intervention may be required instead. | |||
A client MAY perform I/O via the MDS even when the client holds a | ||||
layout that covers the I/O; servers MUST support this client | ||||
behavior, and MAY recall layouts as needed to complete I/Os. | ||||
13.10. Operation 65: READ_PLUS | 13.10. Operation 65: READ_PLUS | |||
READ_PLUS is a new read operation which allows NFS clients to avoid | READ_PLUS is a new variant of the NFSv4.1 READ operation [2]. | |||
reading holes in a sparse file and to efficiently transfer ADBs. | Besides being able to support all of the data semantics of READ, it | |||
READ_PLUS supports all the features of the existing NFSv4.1 READ | can also be used by the server to return either holes or ADBs to the | |||
operation [2] but also extends the response to avoid returning data | client. For holes, READ_PLUS extends the response to avoid returning | |||
for portions of the file which are either initialized and contain no | data for portions of the file which are either initialized and | |||
backing store or if the result would appear to be so. I.e., if the | contain no backing store or if the result would appear to be so. | |||
result was a data block composed entirely of zeros, then it is easier | I.e., if the result was a data block composed entirely of zeros, then | |||
to return a hole. Returning data blocks of unitialized data wastes | it is easier to return a hole. Returning data blocks of unitialized | |||
computational and network resources, thus reducing performance. | data wastes computational and network resources, thus reducing | |||
READ_PLUS uses a new result structure that tells the client that the | performance. For ADBs, READ_PLUS is used to return the metadata | |||
result is all zeroes AND the byte-range of the hole in which the | describing the portions of the file which are either initialized and | |||
request was made. | contain no backing store. | |||
If the client sends a READ operation, it is explicitly stating that | If the client sends a READ operation, it is explicitly stating that | |||
it is neither supporting sparse files nor ADBs. So if a READ occurs | it is neither supporting sparse files nor ADBs. So if a READ occurs | |||
on a sparse ADB or file, then the server must expand such data to be | on a sparse ADB or file, then the server must expand such data to be | |||
raw bytes. If a READ occurs in the middle of a hole or ADB, the | raw bytes. If a READ occurs in the middle of a hole or ADB, the | |||
server can only send back bytes starting from that offset. | server can only send back bytes starting from that offset. In | |||
contrast, if a READ_PLUS occurs in the middle of a hole or ADB, the | ||||
server can send back a range which starts before the offset and | ||||
extends past the range. | ||||
Such an operation is inefficient for transfer of sparse sections of | READ is inefficient for transfer of sparse sections of the file. As | |||
the file. As such, READ is marked as OBSOLETE in NFSv4.2. Instead, | such, READ is marked as OBSOLETE in NFSv4.2. Instead, a client | |||
a client should issue READ_PLUS. Note that as the client has no a | should issue READ_PLUS. Note that as the client has no a priori | |||
priori knowledge of whether either an ADB or a hole is present or | knowledge of whether either an ADB or a hole is present or not, it | |||
not, it should always use READ_PLUS. | should always use READ_PLUS. | |||
13.10.1. ARGUMENT | 13.10.1. ARGUMENT | |||
struct READ_PLUS4args { | struct READ_PLUS4args { | |||
/* CURRENT_FH: file */ | /* CURRENT_FH: file */ | |||
stateid4 rpa_stateid; | stateid4 rpa_stateid; | |||
offset4 rpa_offset; | offset4 rpa_offset; | |||
count4 rpa_count; | count4 rpa_count; | |||
}; | }; | |||
skipping to change at page 80, line 10 | skipping to change at page 82, line 16 | |||
The READ_PLUS operation is based upon the NFSv4.1 READ operation [2] | The READ_PLUS operation is based upon the NFSv4.1 READ operation [2] | |||
and similarly reads data from the regular file identified by the | and similarly reads data from the regular file identified by the | |||
current filehandle. | current filehandle. | |||
The client provides a rpa_offset of where the READ_PLUS is to start | The client provides a rpa_offset of where the READ_PLUS is to start | |||
and a rpa_count of how many bytes are to be read. A rpa_offset of | and a rpa_count of how many bytes are to be read. A rpa_offset of | |||
zero means to read data starting at the beginning of the file. If | zero means to read data starting at the beginning of the file. If | |||
rpa_offset is greater than or equal to the size of the file, the | rpa_offset is greater than or equal to the size of the file, the | |||
status NFS4_OK is returned with di_length (the data length) set to | status NFS4_OK is returned with di_length (the data length) set to | |||
zero and eof set to TRUE. READ_PLUS is subject to access permissions | zero and eof set to TRUE. | |||
checking. | ||||
The READ_PLUS result is comprised of an array of rpr_contents, each | The READ_PLUS result is comprised of an array of rpr_contents, each | |||
of which describe a data_content4 type of data. For NFSv4.2, the | of which describe a data_content4 type of data (Section 6.1.2). For | |||
allowed values are data, ADB, and hole. A server is required to | NFSv4.2, the allowed values are data, ADB, and hole. A server is | |||
support the data type, but neither ADB nor hole. Both an ADB and a | required to support the data type, but neither ADB nor hole. Both an | |||
hole must be returned in its entirety - clients must be prepared to | ADB and a hole must be returned in its entirety - clients must be | |||
get more information than they requested. | prepared to get more information than they requested. | |||
READ_PLUS has to support all of the errors which are returned by READ | READ_PLUS has to support all of the errors which are returned by READ | |||
plus NFS4ERR_UNION_NOTSUPP. If the client asks for a hole and the | plus NFS4ERR_UNION_NOTSUPP. If the client asks for a hole and the | |||
server does not support that arm of the discriminated union, but does | server does not support that arm of the discriminated union, but does | |||
support one or more additional arms, it can signal to the client that | support one or more additional arms, it can signal to the client that | |||
it supports the operation, but not the arm with | it supports the operation, but not the arm with | |||
NFS4ERR_UNION_NOTSUPP. | NFS4ERR_UNION_NOTSUPP. | |||
If the data to be returned is comprised entirely of zeros, then the | If the data to be returned is comprised entirely of zeros, then the | |||
server may elect to return that data as a hole. The server | server may elect to return that data as a hole. The server | |||
skipping to change at page 80, line 41 | skipping to change at page 82, line 46 | |||
to determine the full extent of the "hole" - it does not need to | to determine the full extent of the "hole" - it does not need to | |||
determine where the zeros start and end. | determine where the zeros start and end. | |||
The server may elect to return adjacent elements of the same type. | The server may elect to return adjacent elements of the same type. | |||
For example, the guard pattern or block size of an ADB might change, | For example, the guard pattern or block size of an ADB might change, | |||
which would require adjacent elements of type ADB. Likewise if the | which would require adjacent elements of type ADB. Likewise if the | |||
server has a range of data comprised entirely of zeros and then a | server has a range of data comprised entirely of zeros and then a | |||
hole, it might want to return two adjacent holes to the client. | hole, it might want to return two adjacent holes to the client. | |||
If the client specifies a rpa_count value of zero, the READ_PLUS | If the client specifies a rpa_count value of zero, the READ_PLUS | |||
succeeds and returns zero bytes of data, again subject to access | succeeds and returns zero bytes of data. In all situations, the | |||
permissions checking. In all situations, the server may choose to | server may choose to return fewer bytes than specified by the client. | |||
return fewer bytes than specified by the client. The client needs to | The client needs to check for this condition and handle the condition | |||
check for this condition and handle the condition appropriately. | appropriately. | |||
If the client specifies an rpa_offset and rpa_count value that is | If the client specifies an rpa_offset and rpa_count value that is | |||
entirely contained within a hole of the file, then the di_offset and | entirely contained within a hole of the file, then the di_offset and | |||
di_length returned must be for the entire hole. This result is | di_length returned must be for the entire hole. This result is | |||
considered valid until the file is changed (detected via the change | considered valid until the file is changed (detected via the change | |||
attribute). The server MUST provide the same semantics for the hole | attribute). The server MUST provide the same semantics for the hole | |||
as if the client read the region and received zeroes; the implied | as if the client read the region and received zeroes; the implied | |||
holes contents lifetime MUST be exactly the same as any other read | holes contents lifetime MUST be exactly the same as any other read | |||
data. | data. | |||
skipping to change at page 82, line 9 | skipping to change at page 84, line 11 | |||
In general, the IMPLEMENTATION notes for READ in Section 18.22.4 of | In general, the IMPLEMENTATION notes for READ in Section 18.22.4 of | |||
[2] also apply to READ_PLUS. One delta is that when the owner has a | [2] also apply to READ_PLUS. One delta is that when the owner has a | |||
locked byte range, the server MUST return an array of rpr_contents | locked byte range, the server MUST return an array of rpr_contents | |||
with values inside that range. | with values inside that range. | |||
13.10.4.1. Additional pNFS Implementation Information | 13.10.4.1. Additional pNFS Implementation Information | |||
With pNFS, the semantics of using READ_PLUS remains the same. Any | With pNFS, the semantics of using READ_PLUS remains the same. Any | |||
data server MAY return a hole or ADB result for a READ_PLUS request | data server MAY return a hole or ADB result for a READ_PLUS request | |||
that it receives. | that it receives. When a data server chooses to return such a | |||
result, it has the option of returning information for the data | ||||
When a data server chooses to return a hole result, it has the option | stored on that data server (as defined by the data layout), but it | |||
of returning hole information for the data stored on that data server | MUST not return results for a byte range that includes data managed | |||
(as defined by the data layout), but it MUST not return results for a | by another data server. | |||
byte range that includes data managed by another data server. Data | ||||
servers that can obtain hole information for the parts of the file | ||||
stored on that data server, the data server SHOULD return HOLE_INFO | ||||
and the byte range of the hole stored on that data server. | ||||
A data server should do its best to return as much information about | A data server should do its best to return as much information about | |||
a hole as is feasible without having to contact the metadata server. | a hole ADB as is feasible without having to contact the metadata | |||
If communication with the metadata server is required, then every | server. If communication with the metadata server is required, then | |||
attempt should be taken to minimize the number of requests. | every attempt should be taken to minimize the number of requests. | |||
If mandatory locking is enforced, then the data server must also | If mandatory locking is enforced, then the data server must also | |||
ensure that to return only information for a Hole that is within the | ensure that to return only information that is within the owner's | |||
owner's locked byte range. | locked byte range. | |||
13.10.5. READ_PLUS with Sparse Files Example | 13.10.5. READ_PLUS with Sparse Files Example | |||
The following table describes a sparse file. For each byte range, | The following table describes a sparse file. For each byte range, | |||
the file contains either non-zero data or a hole. In addition, the | the file contains either non-zero data or a hole. In addition, the | |||
server in this example uses a Hole Threshold of 32K. | server in this example uses a Hole Threshold of 32K. | |||
+-------------+----------+ | +-------------+----------+ | |||
| Byte-Range | Contents | | | Byte-Range | Contents | | |||
+-------------+----------+ | +-------------+----------+ | |||
| 0-15999 | Hole | | | 0-15999 | Hole | | |||
| 16K-31999 | Non-Zero | | | 16K-31999 | Non-Zero | | |||
| 32K-255999 | Hole | | | 32K-255999 | Hole | | |||
| 256K-287999 | Non-Zero | | | 256K-287999 | Non-Zero | | |||
| 288K-353999 | Hole | | | 288K-353999 | Hole | | |||
| 354K-417999 | Non-Zero | | | 354K-417999 | Non-Zero | | |||
+-------------+----------+ | +-------------+----------+ | |||
Table 4 | Table 5 | |||
Under the given circumstances, if a client was to read from the file | Under the given circumstances, if a client was to read from the file | |||
with a max read size of 64K, the following will be the results for | with a max read size of 64K, the following will be the results for | |||
the given READ_PLUS calls. This assumes the client has already | the given READ_PLUS calls. This assumes the client has already | |||
opened the file, acquired a valid stateid ('s' in the example), and | opened the file, acquired a valid stateid ('s' in the example), and | |||
just needs to issue READ_PLUS requests. | just needs to issue READ_PLUS requests. | |||
1. READ_PLUS(s, 0, 64K) --> NFS_OK, eof = false, <data[0,32K], | 1. READ_PLUS(s, 0, 64K) --> NFS_OK, eof = false, <data[0,32K], | |||
hole[32K,224K]>. Since the first hole is less than the server's | hole[32K,224K]>. Since the first hole is less than the server's | |||
Hole Threshhold, the first 32K of the file is returned as data | Hole Threshhold, the first 32K of the file is returned as data | |||
skipping to change at page 87, line 10 | skipping to change at page 89, line 10 | |||
support the CB_COPY operation. | support the CB_COPY operation. | |||
The CB_COPY operation may fail for the following reasons (this is a | The CB_COPY operation may fail for the following reasons (this is a | |||
partial list): | partial list): | |||
NFS4ERR_NOTSUPP: The copy offload operation is not supported by the | NFS4ERR_NOTSUPP: The copy offload operation is not supported by the | |||
NFS client receiving this request. | NFS client receiving this request. | |||
15. IANA Considerations | 15. IANA Considerations | |||
This section uses terms that are defined in [24]. | This section uses terms that are defined in [25]. | |||
16. References | 16. References | |||
16.1. Normative References | 16.1. Normative References | |||
[1] Bradner, S., "Key words for use in RFCs to Indicate Requirement | [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement | |||
Levels", March 1997. | Levels", March 1997. | |||
[2] Shepler, S., Eisler, M., and D. Noveck, "Network File System | [2] Shepler, S., Eisler, M., and D. Noveck, "Network File System | |||
(NFS) Version 4 Minor Version 1 Protocol", RFC 5661, | (NFS) Version 4 Minor Version 1 Protocol", RFC 5661, | |||
skipping to change at page 89, line 9 | skipping to change at page 91, line 9 | |||
Symposium on File and Storage Technologies (FAST '08) , 2008. | Symposium on File and Storage Technologies (FAST '08) , 2008. | |||
[21] "Section 46.6. Multi-Level Security (MLS) of Deployment Guide: | [21] "Section 46.6. Multi-Level Security (MLS) of Deployment Guide: | |||
Deployment, configuration and administration of Red Hat | Deployment, configuration and administration of Red Hat | |||
Enterprise Linux 5, Edition 6", 2011. | Enterprise Linux 5, Edition 6", 2011. | |||
[22] Quigley, D. and J. Lu, "Registry Specification for MAC Security | [22] Quigley, D. and J. Lu, "Registry Specification for MAC Security | |||
Label Formats", draft-quigley-label-format-registry (work in | Label Formats", draft-quigley-label-format-registry (work in | |||
progress), 2011. | progress), 2011. | |||
[23] Eisler, M., "XDR: External Data Representation Standard", | [23] ISEG, "IESG Processing of RFC Errata for the IETF Stream", | |||
2008. | ||||
[24] Eisler, M., "XDR: External Data Representation Standard", | ||||
RFC 4506, May 2006. | RFC 4506, May 2006. | |||
[24] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA | [25] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA | |||
Considerations Section in RFCs", BCP 26, RFC 5226, May 2008. | Considerations Section in RFCs", BCP 26, RFC 5226, May 2008. | |||
Appendix A. Acknowledgments | Appendix A. Acknowledgments | |||
For the pNFS Access Permissions Check, the original draft was by | For the pNFS Access Permissions Check, the original draft was by | |||
Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow. The work | Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow. The work | |||
was influenced by discussions with Benny Halevy and Bruce Fields. A | was influenced by discussions with Benny Halevy and Bruce Fields. A | |||
review was done by Tom Haynes. | review was done by Tom Haynes. | |||
For the Sharing change attribute implementation details with NFSv4 | For the Sharing change attribute implementation details with NFSv4 | |||
End of changes. 81 change blocks. | ||||
466 lines changed or deleted | 501 lines changed or added | |||
This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |