draft-ietf-nfsv4-flex-files-12.txt   draft-ietf-nfsv4-flex-files-13.txt 
NFSv4 B. Halevy NFSv4 B. Halevy
Internet-Draft Internet-Draft
Intended status: Standards Track T. Haynes Intended status: Standards Track T. Haynes
Expires: January 21, 2018 Primary Data Expires: February 8, 2018 Primary Data
July 20, 2017 August 07, 2017
Parallel NFS (pNFS) Flexible File Layout Parallel NFS (pNFS) Flexible File Layout
draft-ietf-nfsv4-flex-files-12.txt draft-ietf-nfsv4-flex-files-13.txt
Abstract Abstract
The Parallel Network File System (pNFS) allows a separation between The Parallel Network File System (pNFS) allows a separation between
the metadata (onto a metadata server) and data (onto a storage the metadata (onto a metadata server) and data (onto a storage
device) for a file. The flexible file layout type is defined in this device) for a file. The flexible file layout type is defined in this
document as an extension to pNFS which allows the use of storage document as an extension to pNFS which allows the use of storage
devices in a fashion such that they require only a quite limited devices in a fashion such that they require only a quite limited
degree of interaction with the metadata server, using already degree of interaction with the metadata server, using already
existing protocols. Client side mirroring is also added to provide existing protocols. Client side mirroring is also added to provide
skipping to change at page 1, line 38 skipping to change at page 1, line 38
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 21, 2018. This Internet-Draft will expire on February 8, 2018.
Copyright Notice Copyright Notice
Copyright (c) 2017 IETF Trust and the persons identified as the Copyright (c) 2017 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 3
1.2. Difference Between a Data Server and a Storage Device . . 5 1.2. Requirements Language . . . . . . . . . . . . . . . . . . 5
1.3. Requirements Language . . . . . . . . . . . . . . . . . . 6
2. Coupling of Storage Devices . . . . . . . . . . . . . . . . . 6 2. Coupling of Storage Devices . . . . . . . . . . . . . . . . . 6
2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 6 2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 6
2.2. Fencing Clients from the Storage Device . . . . . . . . . 6 2.2. Fencing Clients from the Storage Device . . . . . . . . . 6
2.2.1. Implementation Notes for Synthetic uids/gids . . . . 7 2.2.1. Implementation Notes for Synthetic uids/gids . . . . 7
2.2.2. Example of using Synthetic uids/gids . . . . . . . . 8 2.2.2. Example of using Synthetic uids/gids . . . . . . . . 8
2.3. State and Locking Models . . . . . . . . . . . . . . . . 9 2.3. State and Locking Models . . . . . . . . . . . . . . . . 9
2.3.1. Loosely Coupled Locking Model . . . . . . . . . . . . 9 2.3.1. Loosely Coupled Locking Model . . . . . . . . . . . . 9
2.3.2. Tighly Coupled Locking Model . . . . . . . . . . . . 11 2.3.2. Tightly Coupled Locking Model . . . . . . . . . . . . 10
3. XDR Description of the Flexible File Layout Type . . . . . . 12 3. XDR Description of the Flexible File Layout Type . . . . . . 12
3.1. Code Components Licensing Notice . . . . . . . . . . . . 13 3.1. Code Components Licensing Notice . . . . . . . . . . . . 13
4. Device Addressing and Discovery . . . . . . . . . . . . . . . 14 4. Device Addressing and Discovery . . . . . . . . . . . . . . . 14
4.1. ff_device_addr4 . . . . . . . . . . . . . . . . . . . . . 15 4.1. ff_device_addr4 . . . . . . . . . . . . . . . . . . . . . 14
4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 16 4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 16
5. Flexible File Layout Type . . . . . . . . . . . . . . . . . . 17 5. Flexible File Layout Type . . . . . . . . . . . . . . . . . . 17
5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 18 5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 17
5.1.1. Error Codes from LAYOUTGET . . . . . . . . . . . . . 21 5.1.1. Error Codes from LAYOUTGET . . . . . . . . . . . . . 21
5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS . . 22 5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS . . 21
5.2. Interactions Between Devices and Layouts . . . . . . . . 22 5.2. Interactions Between Devices and Layouts . . . . . . . . 22
5.3. Handling Version Errors . . . . . . . . . . . . . . . . . 22 5.3. Handling Version Errors . . . . . . . . . . . . . . . . . 22
6. Striping via Sparse Mapping . . . . . . . . . . . . . . . . . 23 6. Striping via Sparse Mapping . . . . . . . . . . . . . . . . . 23
7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 23 7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 23
8. Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . 24 8. Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . 24
8.1. Selecting a Mirror . . . . . . . . . . . . . . . . . . . 25 8.1. Selecting a Mirror . . . . . . . . . . . . . . . . . . . 24
8.2. Writing to Mirrors . . . . . . . . . . . . . . . . . . . 25 8.2. Writing to Mirrors . . . . . . . . . . . . . . . . . . . 25
8.2.1. Single Storage Device Updates Mirrors . . . . . . . . 25 8.2.1. Single Storage Device Updates Mirrors . . . . . . . . 25
8.2.2. Single Storage Device Updates Mirrors . . . . . . . . 26 8.2.2. Single Storage Device Updates Mirrors . . . . . . . . 25
8.2.3. Handling Write Errors . . . . . . . . . . . . . . . . 26 8.2.3. Handling Write Errors . . . . . . . . . . . . . . . . 25
8.2.4. Handling Write COMMITs . . . . . . . . . . . . . . . 27 8.2.4. Handling Write COMMITs . . . . . . . . . . . . . . . 26
8.3. Metadata Server Resilvering of the File . . . . . . . . . 27 8.3. Metadata Server Resilvering of the File . . . . . . . . . 27
9. Flexible Files Layout Type Return . . . . . . . . . . . . . . 27 9. Flexible Files Layout Type Return . . . . . . . . . . . . . . 27
9.1. I/O Error Reporting . . . . . . . . . . . . . . . . . . . 29 9.1. I/O Error Reporting . . . . . . . . . . . . . . . . . . . 28
9.1.1. ff_ioerr4 . . . . . . . . . . . . . . . . . . . . . . 29 9.1.1. ff_ioerr4 . . . . . . . . . . . . . . . . . . . . . . 28
9.2. Layout Usage Statistics . . . . . . . . . . . . . . . . . 29 9.2. Layout Usage Statistics . . . . . . . . . . . . . . . . . 29
9.2.1. ff_io_latency4 . . . . . . . . . . . . . . . . . . . 30 9.2.1. ff_io_latency4 . . . . . . . . . . . . . . . . . . . 29
9.2.2. ff_layoutupdate4 . . . . . . . . . . . . . . . . . . 30 9.2.2. ff_layoutupdate4 . . . . . . . . . . . . . . . . . . 30
9.2.3. ff_iostats4 . . . . . . . . . . . . . . . . . . . . . 31 9.2.3. ff_iostats4 . . . . . . . . . . . . . . . . . . . . . 31
9.3. ff_layoutreturn4 . . . . . . . . . . . . . . . . . . . . 32 9.3. ff_layoutreturn4 . . . . . . . . . . . . . . . . . . . . 32
10. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 32
10. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 33
11. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 33 11. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 33
12. Flexible File Layout Type Creation Hint . . . . . . . . . . . 33 12. Flexible File Layout Type Creation Hint . . . . . . . . . . . 33
12.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 34 12.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 33
13. Recalling a Layout . . . . . . . . . . . . . . . . . . . . . 34 13. Recalling a Layout . . . . . . . . . . . . . . . . . . . . . 34
13.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 34 13.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 34
14. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 35 14. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 35
15. Security Considerations . . . . . . . . . . . . . . . . . . . 36 15. Security Considerations . . . . . . . . . . . . . . . . . . . 35
15.1. Kerberized File Access . . . . . . . . . . . . . . . . . 37 15.1. RPCSEC_GSS and Security Services . . . . . . . . . . . . 36
15.1.1. Loosely Coupled . . . . . . . . . . . . . . . . . . 37 15.1.1. Loosely Coupled . . . . . . . . . . . . . . . . . . 36
15.1.2. Tightly Coupled . . . . . . . . . . . . . . . . . . 37 15.1.2. Tightly Coupled . . . . . . . . . . . . . . . . . . 36
16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 37 16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 37
17. References . . . . . . . . . . . . . . . . . . . . . . . . . 38 17. References . . . . . . . . . . . . . . . . . . . . . . . . . 38
17.1. Normative References . . . . . . . . . . . . . . . . . . 38 17.1. Normative References . . . . . . . . . . . . . . . . . . 38
17.2. Informative References . . . . . . . . . . . . . . . . . 39 17.2. Informative References . . . . . . . . . . . . . . . . . 39
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 39 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 39
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 40 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 39
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 40 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 40
1. Introduction 1. Introduction
In the parallel Network File System (pNFS), the metadata server In the parallel Network File System (pNFS), the metadata server
returns layout type structures that describe where file data is returns layout type structures that describe where file data is
located. There are different layout types for different storage located. There are different layout types for different storage
systems and methods of arranging data on storage devices. This systems and methods of arranging data on storage devices. This
document defines the flexible file layout type used with file-based document defines the flexible file layout type used with file-based
data servers that are accessed using the Network File System (NFS) data servers that are accessed using the Network File System (NFS)
skipping to change at page 3, line 44 skipping to change at page 3, line 42
To provide a global state model equivalent to that of the files To provide a global state model equivalent to that of the files
layout type, a back-end control protocol MAY be implemented between layout type, a back-end control protocol MAY be implemented between
the metadata server and NFSv4.1+ storage devices. It is out of scope the metadata server and NFSv4.1+ storage devices. It is out of scope
for this document to specify such a protocol, yet the requirements for this document to specify such a protocol, yet the requirements
for the protocol are specified in [RFC5661] and clarified in for the protocol are specified in [RFC5661] and clarified in
[pNFSLayouts]. [pNFSLayouts].
1.1. Definitions 1.1. Definitions
control protocol: is a set of requirements for the communication of control communication requirements: defines for a layout type the
information on layouts, stateids, file metadata, and file data details regarding information on layouts, stateids, file metadata,
between the metadata server and the storage devices (see and file data which must be communicated between the metadata
[pNFSLayouts]). server and the storage devices.
control protocol: defines a particular mechanism that an
implementation of a layout type would use to meet the control
communication requirement for that layout type. This need not be
a protocol as normally understood. In some cases the same
protocol may be used as a control protocol and data access
protocol.
client-side mirroring: is when the client and not the server is client-side mirroring: is when the client and not the server is
responsible for updating all of the mirrored copies of a layout responsible for updating all of the mirrored copies of a layout
segment. segment.
data file: is that part of the file system object which contains the data file: is that part of the file system object which contains the
content. content.
data server (DS): is one of the pNFS servers which provides the data server (DS): is another term for storage device.
contents of a file system object which is a regular file.
Depending on the layout, there might be one or more data servers
over which the data is striped. Note that while the metadata
server is strictly accessed over the NFSv4.1+ protocol, depending
on the layout type, the data server could be accessed via any
protocol that meets the pNFS requirements.
fencing: is when the metadata server prevents the storage devices fencing: is when the metadata server prevents the storage devices
from processing I/O from a specific client to a specific file. from processing I/O from a specific client to a specific file.
file layout type: is a layout type in which the storage devices are file layout type: is a layout type in which the storage devices are
accessed via the NFS protocol (see Section 13 of [RFC5661]). accessed via the NFS protocol (see Section 13 of [RFC5661]).
layout: informs a client of which storage devices it needs to layout: informs a client of which storage devices it needs to
communicate with (and over which protocol) to perform I/O on a communicate with (and over which protocol) to perform I/O on a
file. The layout might also provide some hints about how the file. The layout might also provide some hints about how the
skipping to change at page 5, line 34 skipping to change at page 5, line 34
this can also be done to create a new mirrored copy of the layout this can also be done to create a new mirrored copy of the layout
segment. segment.
rsize: is the data transfer buffer size used for reads. rsize: is the data transfer buffer size used for reads.
stateid: is a 128-bit quantity returned by a server that uniquely stateid: is a 128-bit quantity returned by a server that uniquely
defines the open and locking states provided by the server for a defines the open and locking states provided by the server for a
specific open-owner or lock-owner/open-owner pair for a specific specific open-owner or lock-owner/open-owner pair for a specific
file and type of lock. file and type of lock.
storage device: is another term used almost interchangeably with storage device: designates the target to which clients may direct I/
data server. See Section 1.2 for the nuances between the two. O requests when they hold an appropriate layout. See Section 2.1
of [pNFSLayouts] for further discussion of the difference between
a data store and a storage device.
tight coupling: is when the metadata server and the storage devices tight coupling: is when the metadata server and the storage devices
do have a control protocol present. do have a control protocol present.
wsize: is the data transfer buffer size used for writes. wsize: is the data transfer buffer size used for writes.
1.2. Difference Between a Data Server and a Storage Device 1.2. Requirements Language
We defined a data server as a pNFS server, which implies that it can
utilize the NFSv4.1+ protocol to communicate with the client. As
such, only the file layout type would currently meet this
requirement. The more generic concept is a storage device, which can
use any protocol to communicate with the client. The requirements
for a storage device to act together with the metadata server to
provide data to a client are that there is a layout type
specification for the given protocol and that the metadata server has
granted a layout to the client. Note that nothing precludes there
being multiple supported layout types (i.e., protocols) between a
metadata server, storage devices, and client.
As storage device is the more encompassing terminology, this document
utilizes it over data server.
1.3. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119]. document are to be interpreted as described in [RFC2119].
2. Coupling of Storage Devices 2. Coupling of Storage Devices
The coupling of the metadata server with the storage devices can be The coupling of the metadata server with the storage devices can be
either tight or loose. In a tight coupling, there is a control either tight or loose. In a tight coupling, there is a control
protocol present to manage security, LAYOUTCOMMITs, etc. With a protocol present to manage security, LAYOUTCOMMITs, etc. With a
skipping to change at page 7, line 7 skipping to change at page 6, line 44
With loosely coupled storage devices, the metadata server uses With loosely coupled storage devices, the metadata server uses
synthetic uids and gids for the data file, where the uid owner of the synthetic uids and gids for the data file, where the uid owner of the
data file is allowed read/write access and the gid owner is allowed data file is allowed read/write access and the gid owner is allowed
read only access. As part of the layout (see ffds_user and read only access. As part of the layout (see ffds_user and
ffds_group in Section 5.1), the client is provided with the user and ffds_group in Section 5.1), the client is provided with the user and
group to be used in the Remote Procedure Call (RPC) [RFC5531] group to be used in the Remote Procedure Call (RPC) [RFC5531]
credentials needed to access the data file. Fencing off of clients credentials needed to access the data file. Fencing off of clients
is achieved by the metadata server changing the synthetic uid and/or is achieved by the metadata server changing the synthetic uid and/or
gid owners of the data file on the storage device to implicitly gid owners of the data file on the storage device to implicitly
revoke the outstanding RPC credentials. A client presenting the revoke the outstanding RPC credentials. A client presenting the
wrong credential for the deisred access will get a NFS4ERR_ACCESS wrong credential for the desired access will get a NFS4ERR_ACCESS
error. error.
With this loosely coupled model, the metadata server is not able to With this loosely coupled model, the metadata server is not able to
fence off a single client, it is forced to fence off all clients. fence off a single client, it is forced to fence off all clients.
However, as the other clients react to the fencing, returning their However, as the other clients react to the fencing, returning their
layouts and trying to get new ones, the metadata server can hand out layouts and trying to get new ones, the metadata server can hand out
a new uid and gid to allow access. a new uid and gid to allow access.
Note: it is recommended to implement common access control methods at Note: it is recommended to implement common access control methods at
the storage device filesystem to allow only the metadata server root the storage device filesystem to allow only the metadata server root
skipping to change at page 8, line 4 skipping to change at page 7, line 41
hand out no layout (forcing the I/O through it), or deny the client hand out no layout (forcing the I/O through it), or deny the client
further access to the file. further access to the file.
2.2.1. Implementation Notes for Synthetic uids/gids 2.2.1. Implementation Notes for Synthetic uids/gids
The selection method for the synthetic uids and gids to be used for The selection method for the synthetic uids and gids to be used for
fencing in loosely coupled storage devices is strictly an fencing in loosely coupled storage devices is strictly an
implementation issue. I.e., an administrator might restrict a range implementation issue. I.e., an administrator might restrict a range
of such ids available to the Lightweight Directory Access Protocol of such ids available to the Lightweight Directory Access Protocol
(LDAP) 'uid' field [RFC4519]. She might also be able to choose an id (LDAP) 'uid' field [RFC4519]. She might also be able to choose an id
that would never be used to grant acccess. Then when the metadata that would never be used to grant access. Then when the metadata
server had a request to access a file, a SETATTR would be sent to the server had a request to access a file, a SETATTR would be sent to the
storage device to set the owner and group of the data file. The user storage device to set the owner and group of the data file. The user
and group might be selected in a round robin fashion from the range and group might be selected in a round robin fashion from the range
of available ids. of available ids.
Those ids would be sent back as ffds_user and ffds_group to the Those ids would be sent back as ffds_user and ffds_group to the
client. And it would present them as the RPC credentials to the client. And it would present them as the RPC credentials to the
storage device. When the client was done accessing the file and the storage device. When the client was done accessing the file and the
metadata server knew that no other client was accessing the file, it metadata server knew that no other client was accessing the file, it
could reset the owner and group to restrict access to the data file. could reset the owner and group to restrict access to the data file.
skipping to change at page 10, line 16 skipping to change at page 9, line 51
follows: follows:
o OPENs are dealt with by the metadata server. Stateids are o OPENs are dealt with by the metadata server. Stateids are
selected by the metadata server and associated with the client id selected by the metadata server and associated with the client id
describing the client's connection to the metadata server. The describing the client's connection to the metadata server. The
metadata server may need to interact with the storage device to metadata server may need to interact with the storage device to
locate the file to be opened, but no locking-related functionality locate the file to be opened, but no locking-related functionality
need be used on the storage device. need be used on the storage device.
OPEN_DOWNGRADE and CLOSE only require local execution on the OPEN_DOWNGRADE and CLOSE only require local execution on the
metadata sever. metadata server.
o Advisory byte-range locks can be implemented locally on the o Advisory byte-range locks can be implemented locally on the
metadata server. As in the case of OPENs, the stateids associated metadata server. As in the case of OPENs, the stateids associated
with byte-range locks are assigned by the metadata server and only with byte-range locks are assigned by the metadata server and only
used on the metadata server. used on the metadata server.
o Delegations are assigned by the metadata server which initiates o Delegations are assigned by the metadata server which initiates
recalls when conflicting OPENs are processed. No storage device recalls when conflicting OPENs are processed. No storage device
involvement is required. involvement is required.
skipping to change at page 11, line 8 skipping to change at page 10, line 44
been revoked. been revoked.
As the client never receives a stateid generated by a storage device, As the client never receives a stateid generated by a storage device,
there is no client lease on the storage device and no prospect of there is no client lease on the storage device and no prospect of
lease expiration, even when access is via NFSv4 protocols. Clients lease expiration, even when access is via NFSv4 protocols. Clients
will have leases on the metadata server. In dealing with lease will have leases on the metadata server. In dealing with lease
expiration, the metadata server may need to use fencing to prevent expiration, the metadata server may need to use fencing to prevent
revoked stateids from being relied upon by a client unaware of the revoked stateids from being relied upon by a client unaware of the
fact that they have been revoked. fact that they have been revoked.
2.3.2. Tighly Coupled Locking Model 2.3.2. Tightly Coupled Locking Model
When locking-related operations are requested, they are primarily When locking-related operations are requested, they are primarily
dealt with by the metadata server, which generates the appropriate dealt with by the metadata server, which generates the appropriate
stateids. These stateids must be made known to the storage device stateids. These stateids must be made known to the storage device
using control protocol facilities, the details of which are not using control protocol facilities, the details of which are not
discussed in this document. discussed in this document.
Given this basic structure, locking-related operations are handled as Given this basic structure, locking-related operations are handled as
follows: follows:
o OPENs are dealt with primarily on the metadata server. Stateids o OPENs are dealt with primarily on the metadata server. Stateids
are selected by the metadata server and associated with the client are selected by the metadata server and associated with the client
id describing the client's connection to the metadata server. The id describing the client's connection to the metadata server. The
metadata server needs to interact with the storage device to metadata server needs to interact with the storage device to
locate the file to be opened, and to make the storage device aware locate the file to be opened, and to make the storage device aware
of the association between the metadata-sever-chosen stateid and of the association between the metadata-server-chosen stateid and
the client and openowner that it represents. the client and openowner that it represents.
OPEN_DOWNGRADE and CLOSE are executed initially on the metadata OPEN_DOWNGRADE and CLOSE are executed initially on the metadata
server but the state change made must be propagated to the storage server but the state change made must be propagated to the storage
device. device.
o Advisory byte-range locks can be implemented locally on the o Advisory byte-range locks can be implemented locally on the
metadata server. As in the case of OPENs, the stateids associated metadata server. As in the case of OPENs, the stateids associated
with byte-range locks, are assigned by the metadata server and are with byte-range locks, are assigned by the metadata server and are
available for use on the metadata server. Because I/O operations available for use on the metadata server. Because I/O operations
are allowed to present lock stateids, the metadata server needs are allowed to present lock stateids, the metadata server needs
the ability to make the storage device aware of the association the ability to make the storage device aware of the association
between the metadata-sever-chosen stateid and the corresponding between the metadata-server-chosen stateid and the corresponding
open stateid it is associated with. open stateid it is associated with.
o Mandatory byte-range locks can be supported when both the metadata o Mandatory byte-range locks can be supported when both the metadata
server and the storage devices have the appropriate support. As server and the storage devices have the appropriate support. As
in the case of advisory byte-range locks, these are assigned by in the case of advisory byte-range locks, these are assigned by
the metadata server and are available for use on the metadata the metadata server and are available for use on the metadata
server. To enable mandatory lock enforcement on the storage server. To enable mandatory lock enforcement on the storage
device, the metadata server needs the ability to make the storage device, the metadata server needs the ability to make the storage
device aware of the association between the metadata-sever-chosen device aware of the association between the metadata-server-chosen
stateid and the client, openowner, and lock (i.e., lockowner, stateid and the client, openowner, and lock (i.e., lockowner,
byte-range, lock-type) that it represents. Because I/O operations byte-range, lock-type) that it represents. Because I/O operations
are allowed to present lock stateids, this information needs to be are allowed to present lock stateids, this information needs to be
propagated to all storage devices to which I/O might be directed propagated to all storage devices to which I/O might be directed
rather than only to daya storage device that contain the locked rather than only to storage device that contain the locked region.
region.
o Delegations are assigned by the metadata server which initiates o Delegations are assigned by the metadata server which initiates
recalls when conflicting OPENs are processed. Because I/O recalls when conflicting OPENs are processed. Because I/O
operations are allowed to present delegation stateids, the operations are allowed to present delegation stateids, the
metadata server requires the ability to make the storage device metadata server requires the ability to make the storage device
aware of the association between the metadata-server-chosen aware of the association between the metadata-server-chosen
stateid and the filehandle and delegation type it represents, and stateid and the filehandle and delegation type it represents, and
to break such an association. to break such an association.
o TEST_STATEID is processed locally on the metadata server, without o TEST_STATEID is processed locally on the metadata server, without
skipping to change at page 16, line 16 skipping to change at page 15, line 46
The ffdv_rsize and ffdv_wsize are used to communicate the maximum The ffdv_rsize and ffdv_wsize are used to communicate the maximum
rsize and wsize supported by the storage device. As the storage rsize and wsize supported by the storage device. As the storage
device can have a different rsize or wsize than the metadata server, device can have a different rsize or wsize than the metadata server,
the ffdv_rsize and ffdv_wsize allow the metadata server to the ffdv_rsize and ffdv_wsize allow the metadata server to
communicate that information on behalf of the storage device. communicate that information on behalf of the storage device.
ffdv_tightly_coupled informs the client as to whether the metadata ffdv_tightly_coupled informs the client as to whether the metadata
server is tightly coupled with the storage devices or not. Note that server is tightly coupled with the storage devices or not. Note that
even if the data protocol is at least NFSv4.1, it may still be the even if the data protocol is at least NFSv4.1, it may still be the
case that there is loose coupling is in effect. If case that there is loose coupling in effect. If ffdv_tightly_coupled
ffdv_tightly_coupled is not set, then the client MUST commit writes is not set, then the client MUST commit writes to the storage devices
to the storage devices for the file before sending a LAYOUTCOMMIT to for the file before sending a LAYOUTCOMMIT to the metadata server.
the metadata server. I.e., the writes MUST be committed by the I.e., the writes MUST be committed by the client to stable storage
client to stable storage via issuing WRITEs with stable_how == via issuing WRITEs with stable_how == FILE_SYNC or by issuing a
FILE_SYNC or by issuing a COMMIT after WRITEs with stable_how != COMMIT after WRITEs with stable_how != FILE_SYNC (see Section 3.3.7
FILE_SYNC (see Section 3.3.7 of [RFC1813]). of [RFC1813]).
4.2. Storage Device Multipathing 4.2. Storage Device Multipathing
The flexible file layout type supports multipathing to multiple The flexible file layout type supports multipathing to multiple
storage device addresses. Storage device level multipathing is used storage device addresses. Storage device level multipathing is used
for bandwidth scaling via trunking and for higher availability of use for bandwidth scaling via trunking and for higher availability of use
in the event of a storage device failure. Multipathing allows the in the event of a storage device failure. Multipathing allows the
client to switch to another storage device address which may be that client to switch to another storage device address which may be that
of another storage device that is exporting the same data stripe of another storage device that is exporting the same data stripe
unit, without having to contact the metadata server for a new layout. unit, without having to contact the metadata server for a new layout.
skipping to change at page 18, line 4 skipping to change at page 17, line 33
}; };
struct layout4 { struct layout4 {
offset4 lo_offset; offset4 lo_offset;
length4 lo_length; length4 lo_length;
layoutiomode4 lo_iomode; layoutiomode4 lo_iomode;
layout_content4 lo_content; layout_content4 lo_content;
}; };
<CODE ENDS> <CODE ENDS>
This document defines structure associated with the layouttype4 value
LAYOUT4_FLEX_FILES. [RFC5661] specifies the loc_body structure as an This document defines structures associated with the layouttype4
XDR type "opaque". The opaque layout is uninterpreted by the generic value LAYOUT4_FLEX_FILES. [RFC5661] specifies the loc_body structure
pNFS client layers, but is interpreted by the flexible file layout as an XDR type "opaque". The opaque layout is uninterpreted by the
type implementation. This section defines the structure of this generic pNFS client layers, but is interpreted by the flexible file
otherwise opaque value, ff_layout4. layout type implementation. This section defines the structure of
this otherwise opaque value, ff_layout4.
5.1. ff_layout4 5.1. ff_layout4
<CODE BEGINS> <CODE BEGINS>
/// const FF_FLAGS_NO_LAYOUTCOMMIT = 0x00000001; /// const FF_FLAGS_NO_LAYOUTCOMMIT = 0x00000001;
/// const FF_FLAGS_NO_IO_THRU_MDS = 0x00000002; /// const FF_FLAGS_NO_IO_THRU_MDS = 0x00000002;
/// const FF_FLAGS_NO_READ_IO = 0x00000004; /// const FF_FLAGS_NO_READ_IO = 0x00000004;
/// const FF_FLAGS_WRITE_ONE_MIRROR = 0x00000008; /// const FF_FLAGS_WRITE_ONE_MIRROR = 0x00000008;
skipping to change at page 20, line 41 skipping to change at page 20, line 23
NFSv4.x storage protocols: NFSv4.x storage protocols:
loosely couple: the stateid has to be an anonymous stateid, loosely couple: the stateid has to be an anonymous stateid,
tightly couple: the stateid has to be a global stateid. tightly couple: the stateid has to be a global stateid.
These stem from a mismatch of ffds_stateid being a singleton and These stem from a mismatch of ffds_stateid being a singleton and
ffds_fh_vers being an array - each open file on the storage device ffds_fh_vers being an array - each open file on the storage device
might need an open stateid. As there are established loosely coupled might need an open stateid. As there are established loosely coupled
implementations of this version of the protocol, it can not be fixed. implementations of this version of the protocol, it can not be fixed.
If an implementation needs a different statedid per file handle, then If an implementation needs a different stateid per file handle, then
this issue will require a new version of the protocol. this issue will require a new version of the protocol.
For loosely coupled storage devices, ffds_user and ffds_group provide For loosely coupled storage devices, ffds_user and ffds_group provide
the synthetic user and group to be used in the RPC credentials that the synthetic user and group to be used in the RPC credentials that
the client presents to the storage device to access the data files. the client presents to the storage device to access the data files.
For tightly coupled storage devices, the user and group on the For tightly coupled storage devices, the user and group on the
storage device will be the same as on the metadata server. I.e., if storage device will be the same as on the metadata server. I.e., if
ffdv_tightly_coupled (see Section 4.1) is set, then the client MUST ffdv_tightly_coupled (see Section 4.1) is set, then the client MUST
ignore both ffds_user and ffds_group. ignore both ffds_user and ffds_group.
skipping to change at page 21, line 31 skipping to change at page 21, line 14
ffl_flags is a bitmap that allows the metadata server to inform the ffl_flags is a bitmap that allows the metadata server to inform the
client of particular conditions that may result from the more or less client of particular conditions that may result from the more or less
tight coupling of the storage devices. tight coupling of the storage devices.
FF_FLAGS_NO_LAYOUTCOMMIT: can be set to indicate that the client is FF_FLAGS_NO_LAYOUTCOMMIT: can be set to indicate that the client is
not required to send LAYOUTCOMMIT to the metadata server. not required to send LAYOUTCOMMIT to the metadata server.
F_FLAGS_NO_IO_THRU_MDS: can be set to indicate that the client F_FLAGS_NO_IO_THRU_MDS: can be set to indicate that the client
should not send I/O operations to the metadata server. I.e., even should not send I/O operations to the metadata server. I.e., even
if the client could determine that there was a network diconnect if the client could determine that there was a network disconnect
to a storage device, the client should not try to proxy the I/O to a storage device, the client should not try to proxy the I/O
through the metadata server. through the metadata server.
FF_FLAGS_NO_READ_IO: can be set to indicate that the client should FF_FLAGS_NO_READ_IO: can be set to indicate that the client should
not send READ requests with the layouts of iomode not send READ requests with the layouts of iomode
LAYOUTIOMODE4_RW. Instead, it should request a layout of iomode LAYOUTIOMODE4_RW. Instead, it should request a layout of iomode
LAYOUTIOMODE4_READ from the metadata server. LAYOUTIOMODE4_READ from the metadata server.
FF_FLAGS_WRITE_ONE_MIRROR: can be set to indicate that the client FF_FLAGS_WRITE_ONE_MIRROR: can be set to indicate that the client
only needs to update one of the mirrors (see Section 8.2). only needs to update one of the mirrors (see Section 8.2).
5.1.1. Error Codes from LAYOUTGET 5.1.1. Error Codes from LAYOUTGET
[RFC5661] provides little guidance as to how the client is to proceed [RFC5661] provides little guidance as to how the client is to proceed
with a LAYOUTEGT which returns an error of either with a LAYOUTGET which returns an error of either
NFS4ERR_LAYOUTTRYLATER, NFS4ERR_LAYOUTUNAVAILABLE, and NFS4ERR_DELAY. NFS4ERR_LAYOUTTRYLATER, NFS4ERR_LAYOUTUNAVAILABLE, and NFS4ERR_DELAY.
Within the context of this document: Within the context of this document:
NFS4ERR_LAYOUTUNAVAILABLE: there is no layout available and the I/O NFS4ERR_LAYOUTUNAVAILABLE: there is no layout available and the I/O
is to go to the metadata server. Note that it is possible to have is to go to the metadata server. Note that it is possible to have
had a layout before a recall and not after. had a layout before a recall and not after.
NFS4ERR_LAYOUTTRYLATER: there is some issue preventing the layout NFS4ERR_LAYOUTTRYLATER: there is some issue preventing the layout
from being granted. If the client already has an appropriate from being granted. If the client already has an appropriate
layout, it should continue with I/O to the storage devices. layout, it should continue with I/O to the storage devices.
skipping to change at page 24, line 41 skipping to change at page 24, line 24
compression or deduplication on the data. However, the over the wire compression or deduplication on the data. However, the over the wire
transfer of the file contents MUST appear identical. Note, this is a transfer of the file contents MUST appear identical. Note, this is a
constraint of the selected XDR representation in which each mirrored constraint of the selected XDR representation in which each mirrored
copy of the layout segment has the same striping pattern (see copy of the layout segment has the same striping pattern (see
Figure 1). Figure 1).
The metadata server is responsible for determining the number of The metadata server is responsible for determining the number of
mirrored copies and the location of each mirror. While the client mirrored copies and the location of each mirror. While the client
may provide a hint to how many copies it wants (see Section 12), the may provide a hint to how many copies it wants (see Section 12), the
metadata server can ignore that hint and in any event, the client has metadata server can ignore that hint and in any event, the client has
no means to dictate neither the storage device (which also means the no means to dictate either the storage device (which also means the
coupling and/or protocol levels to access the layout segments) nor coupling and/or protocol levels to access the layout segments) or the
the location of said storage device. location of said storage device.
The updating of mirrored layout segments is done via client-side The updating of mirrored layout segments is done via client-side
mirroring. With this approach, the client is responsible for making mirroring. With this approach, the client is responsible for making
sure modifications are made on all copies of the layout segments it sure modifications are made on all copies of the layout segments it
is informed of via the layout. If a layout segment is being is informed of via the layout. If a layout segment is being
resilvered to a storage device, that mirrored copy will not be in the resilvered to a storage device, that mirrored copy will not be in the
layout. Thus the metadata server MUST update that copy until the layout. Thus the metadata server MUST update that copy until the
client is presented it in a layout. If the FF_FLAGS_WRITE_ONE_MIRROR client is presented it in a layout. If the FF_FLAGS_WRITE_ONE_MIRROR
is set in ffl_flags, the client need only update one of the mirrors is set in ffl_flags, the client need only update one of the mirrors
(see Section 8.2. If the client is writing to the layout segments (see Section 8.2). If the client is writing to the layout segments
via the metadata server, then the metadata server MUST update all via the metadata server, then the metadata server MUST update all
copies of the mirror. As seen in Section 8.3, during the copies of the mirror. As seen in Section 8.3, during the
resilvering, the layout is recalled, and the client has to make resilvering, the layout is recalled, and the client has to make
modifications via the metadata server. modifications via the metadata server.
8.1. Selecting a Mirror 8.1. Selecting a Mirror
When the metadata server grants a layout to a client, it MAY let the When the metadata server grants a layout to a client, it MAY let the
client know how fast it expects each mirror to be once the request client know how fast it expects each mirror to be once the request
arrives at the storage devices via the ffds_efficiency member. While arrives at the storage devices via the ffds_efficiency member. While
skipping to change at page 25, line 29 skipping to change at page 25, line 14
However, what should not be involved in that calculation is a However, what should not be involved in that calculation is a
perceived network distance between the client and the storage device. perceived network distance between the client and the storage device.
The client is better situated for making that determination based on The client is better situated for making that determination based on
past interaction with the storage device over the different available past interaction with the storage device over the different available
network interfaces between the two. I.e., the metadata server might network interfaces between the two. I.e., the metadata server might
not know about a transient outage between the client and storage not know about a transient outage between the client and storage
device because it has no presence on the given subnet. device because it has no presence on the given subnet.
As such, it is the client which decides which mirror to access for As such, it is the client which decides which mirror to access for
reading the file. The requirements for writing to a mirrored layout reading the file. The requirements for writing to mirrored layout
segments are presented below. segments are presented below.
8.2. Writing to Mirrors 8.2. Writing to Mirrors
8.2.1. Single Storage Device Updates Mirrors 8.2.1. Single Storage Device Updates Mirrors
If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is set, the client If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is set, the client
only needs to update one of the copies of the layout segment. For only needs to update one of the copies of the layout segment. For
this case, the storage device MUST ensure that all copies of the this case, the storage device MUST ensure that all copies of the
mirror are updated when any one of the mirrors is updated. If the mirror are updated when any one of the mirrors is updated. If the
storage device gets an error when updating one of the mirrors, then storage device gets an error when updating one of the mirrors, then
it MUST inform the client that the original WRITE had an error. The it MUST inform the client that the original WRITE had an error. The
client then MUST inform the metadata server (see Section 8.2.3. The client then MUST inform the metadata server (see Section 8.2.3). The
client's responsibility with resepect to COMMIT is explained in client's responsibility with respect to COMMIT is explained in
Section 8.2.4. The client may choose any one of the mirrors and may Section 8.2.4. The client may choose any one of the mirrors and may
use ffds_efficiency in the same manner as for reading when making use ffds_efficiency in the same manner as for reading when making
this choice. this choice.
8.2.2. Single Storage Device Updates Mirrors 8.2.2. Single Storage Device Updates Mirrors
If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is not set, the If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is not set, the
client is responsible for updating all mirrored copies of the layout client is responsible for updating all mirrored copies of the layout
segments that it is given in the layout. A single failed update is segments that it is given in the layout. A single failed update is
sufficient to fail the entire operation. If all but one copy is sufficient to fail the entire operation. If all but one copy is
skipping to change at page 27, line 16 skipping to change at page 26, line 40
When stable writes are done to the metadata server or to a single When stable writes are done to the metadata server or to a single
replica (if allowed by the use of FF_FLAGS_WRITE_ONE_MIRROR ), it is replica (if allowed by the use of FF_FLAGS_WRITE_ONE_MIRROR ), it is
the responsibility of the receiving node to propagate the written the responsibility of the receiving node to propagate the written
data stably, before replying to the client. data stably, before replying to the client.
In the corresponding cases in which unstable writes are done, the In the corresponding cases in which unstable writes are done, the
receiving node does not have any such obligation, although it may receiving node does not have any such obligation, although it may
choose to asynchronously propagate the updates. However, once a choose to asynchronously propagate the updates. However, once a
COMMIT is replied to, all replicas must reflect the writes that have COMMIT is replied to, all replicas must reflect the writes that have
been done, and these data must have been committed to stable storage been done, and this data must have been committed to stable storage
on all replicas. on all replicas.
In order to avoid situations in which stale data is read from In order to avoid situations in which stale data is read from
replicas to which writes have not been propagated: replicas to which writes have not been propagated:
o A client which has outstanding unstable writes made to single node o A client which has outstanding unstable writes made to single node
(metadata server or storage device) MUST do all reads from that (metadata server or storage device) MUST do all reads from that
same node. same node.
o When writes are flushed to the server, for example to implement, o When writes are flushed to the server, for example to implement,
skipping to change at page 30, line 22 skipping to change at page 29, line 47
/// uint64_t ffil_bytes_completed; /// uint64_t ffil_bytes_completed;
/// uint64_t ffil_bytes_not_delivered; /// uint64_t ffil_bytes_not_delivered;
/// nfstime4 ffil_total_busy_time; /// nfstime4 ffil_total_busy_time;
/// nfstime4 ffil_aggregate_completion_time; /// nfstime4 ffil_aggregate_completion_time;
/// }; /// };
/// ///
<CODE ENDS> <CODE ENDS>
Both operation counts and bytes transferred are kept in the Both operation counts and bytes transferred are kept in the
ff_io_latency4. READ operations are used for read latencies. Both ff_io_latency4. As seen in ff_layoutupdate4 (See Section 9.2.2) read
WRITE and COMMIT operations are used for write latencies. and write operations are aggregated separately. READ operations are
"Requested" counters track what the client is attempting to do and used for the ff_io_latency4 ffl_read. Both WRITE and COMMIT
"completed" counters track what was done. Note that there is no operations are used for the ff_io_latency4 ffl_write. "Requested"
requirement that the client only report completed results that have counters track what the client is attempting to do and "completed"
matching requested results from the reported period. counters track what was done. There is no requirement that the
client only report completed results that have matching requested
results from the reported period.
ffil_bytes_not_delivered is used to track the aggregate number of ffil_bytes_not_delivered is used to track the aggregate number of
bytes requested by not fulfilled due to error conditions. bytes requested by not fulfilled due to error conditions.
ffil_total_busy_time is the aggregate time spent with outstanding RPC ffil_total_busy_time is the aggregate time spent with outstanding RPC
calls, ffil_aggregate_completion_time is the sum of all latencies for calls. ffil_aggregate_completion_time is the sum of all round trip
completed RPC calls. times for completed RPC calls.
In Section 3.3.1 of [RFC5661], the nfstime4 is defined as the number
of seconds and nanoseconds since midnight or zero hour January 1,
1970 Coordinated Universal Time (UTC). The use of nfstime4 in
ff_io_latency4 is to store time since the start of the first I/O from
the client after receiving the layout. In other words, these are to
be decoded as duration and not as a date and time.
Note that LAYOUTSTATS are cumulative, i.e., not reset each time the Note that LAYOUTSTATS are cumulative, i.e., not reset each time the
operation is sent. If two LAYOUTSTATS ops for the same file, layout operation is sent. If two LAYOUTSTATS ops for the same file, layout
stateid, and originating from the same NFS client are processed at stateid, and originating from the same NFS client are processed at
the same time by the metadata server, then the one containing the the same time by the metadata server, then the one containing the
larger values contains the most recent time series data. larger values contains the most recent time series data.
9.2.2. ff_layoutupdate4 9.2.2. ff_layoutupdate4
<CODE BEGINS> <CODE BEGINS>
skipping to change at page 36, line 15 skipping to change at page 35, line 44
[RFC5661]). To avoid data corruption, the metadata server MUST fence [RFC5661]). To avoid data corruption, the metadata server MUST fence
off the revoked clients from the respective data files as described off the revoked clients from the respective data files as described
in Section 2.2. in Section 2.2.
15. Security Considerations 15. Security Considerations
The pNFS extension partitions the NFSv4.1+ file system protocol into The pNFS extension partitions the NFSv4.1+ file system protocol into
two parts, the control path and the data path (storage protocol). two parts, the control path and the data path (storage protocol).
The control path contains all the new operations described by this The control path contains all the new operations described by this
extension; all existing NFSv4 security mechanisms and features apply extension; all existing NFSv4 security mechanisms and features apply
to the control path. The combination of components in a pNFS system to the control path (see Sections 1.7.1 and 2.2.1 of [RFC5661]). The
is required to preserve the security properties of NFSv4.1+ with combination of components in a pNFS system is required to preserve
respect to an entity accessing data via a client, including security the security properties of NFSv4.1+ with respect to an entity
countermeasures to defend against threats that NFSv4.1+ provides accessing data via a client, including security countermeasures to
defenses for in environments where these threats are considered defend against threats that NFSv4.1+ provides defenses for in
significant. environments where these threats are considered significant.
The metadata server enforces the file access-control policy at The metadata server enforces the file access-control policy at
LAYOUTGET time. The client should use RPC authorization credentials LAYOUTGET time. The client should use RPC authorization credentials
(uid/gid for AUTH_SYS or tickets for Kerberos) for getting the layout for getting the layout for the requested iomode (READ or RW) and the
for the requested iomode (READ or RW) and the server verifies the server verifies the permissions and ACL for these credentials,
permissions and ACL for these credentials, possibly returning possibly returning NFS4ERR_ACCESS if the client is not allowed the
NFS4ERR_ACCESS if the client is not allowed the requested iomode. If requested iomode. If the LAYOUTGET operation succeeds the client
the LAYOUTGET operation succeeds the client receives, as part of the receives, as part of the layout, a set of credentials allowing it I/O
layout, a set of credentials allowing it I/O access to the specified access to the specified data files corresponding to the requested
data files corresponding to the requested iomode. When the client iomode. When the client acts on I/O operations on behalf of its
acts on I/O operations on behalf of its local users, it MUST local users, it MUST authenticate and authorize the user by issuing
authenticate and authorize the user by issuing respective OPEN and respective OPEN and ACCESS calls to the metadata server, similar to
ACCESS calls to the metadata server, similar to having NFSv4 data having NFSv4 data delegations.
delegations.
If access is allowed, the client uses the corresponding (READ or RW) If access is allowed, the client uses the corresponding (READ or RW)
credentials to perform the I/O operations at the data file's storage credentials to perform the I/O operations at the data file's storage
devices. When the metadata server receives a request to change a devices. When the metadata server receives a request to change a
file's permissions or ACL, it SHOULD recall all layouts for that file file's permissions or ACL, it SHOULD recall all layouts for that file
and then MUST fence off any clients still holding outstanding layouts and then MUST fence off any clients still holding outstanding layouts
for the respective files by implicitly invalidating the previously for the respective files by implicitly invalidating the previously
distributed credential on all data file comprising the file in distributed credential on all data file comprising the file in
question. It is REQUIRED that this be done before committing to the question. It is REQUIRED that this be done before committing to the
new permissions and/or ACL. By requesting new layouts, the clients new permissions and/or ACL. By requesting new layouts, the clients
will reauthorize access against the modified access control metadata. will reauthorize access against the modified access control metadata.
Recalling the layouts in this case is intended to prevent clients Recalling the layouts in this case is intended to prevent clients
from getting an error on I/Os done after the client was fenced off. from getting an error on I/Os done after the client was fenced off.
15.1. Kerberized File Access 15.1. RPCSEC_GSS and Security Services
15.1.1. Loosely Coupled 15.1.1. Loosely Coupled
RPCSEC_GSS version 3 (RPCSEC_GSSv3) [RFC7861] could be used to RPCSEC_GSS version 3 (RPCSEC_GSSv3) [RFC7861] could be used to
authorize the client to the storage device on behalf of the metadata authorize the client to the storage device on behalf of the metadata
server. This would require that each of the metadata server, storage server. This would require that each of the metadata server, storage
device, and client would have to implement RPCSEC_GSSv3. The second device, and client would have to implement RPCSEC_GSSv3 via an RPC-
requirement does not match the intent of the loosely coupled model application-defined structured privilege assertion in a manner
that the storage device need not be modified. described in Section 4.9.1 of [RFC7862]. These requirements do not
match the intent of the loosely coupled model that the storage device
Under this coupling model, the principal used to authenticate the need not be modified. (Note that this does not preclude the use of
metadata file is different than that used to authenticate the data RPCSEC_GSSv3 in a loosely coupled model.)
file. For the metadata server, the user credentials would be
generated by the same Kerberos server as the client. However, for
the data storage access, the metadata server would generate the
ticket granting tickets and provide them to the client. Fencing
would then be controlled either by expiring the ticket or by
modifying the syntethic uid or gid on the data file.
15.1.2. Tightly Coupled 15.1.2. Tightly Coupled
With tight coupling, the principal used to access the metadata file With tight coupling, the principal used to access the metadata file
is exactly the same as used to access the data file. As a result is exactly the same as used to access the data file. The storage
there are no security issues related to using Kerberos with a tightly device can use the control protocol to validate any RPC credentials.
coupled system. As a result there are no security issues related to using RPCSEC_GSS
with a tightly coupled system. For example, if Kerberos V5 GSS-API
[RFC4121] is used as the security mechanism, then the storage device
could use a control protocol to validate the RPC credentials to the
metadata server.
16. IANA Considerations 16. IANA Considerations
[RFC5661] introduced a registry for "pNFS Layout Types Registry" and [RFC5661] introduced a registry for "pNFS Layout Types Registry" and
as such, new layout type numbers need to be assigned by IANA. This as such, new layout type numbers need to be assigned by IANA. This
document defines the protocol associated with the existing layout document defines the protocol associated with the existing layout
type number, LAYOUT4_FLEX_FILES (see Table 1). type number, LAYOUT4_FLEX_FILES (see Table 1).
+--------------------+-------+----------+-----+----------------+ +--------------------+-------+----------+-----+----------------+
| Layout Type Name | Value | RFC | How | Minor Versions | | Layout Type Name | Value | RFC | How | Minor Versions |
skipping to change at page 38, line 43 skipping to change at page 38, line 19
[LEGAL] IETF Trust, "Legal Provisions Relating to IETF Documents", [LEGAL] IETF Trust, "Legal Provisions Relating to IETF Documents",
November 2008, <http://trustee.ietf.org/docs/ November 2008, <http://trustee.ietf.org/docs/
IETF-Trust-License-Policy.pdf>. IETF-Trust-License-Policy.pdf>.
[RFC1813] IETF, "NFS Version 3 Protocol Specification", RFC 1813, [RFC1813] IETF, "NFS Version 3 Protocol Specification", RFC 1813,
June 1995. June 1995.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997. Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC4121] Zhu, L., Jaganathan, K., and S. Hartman, "The Kerberos
Version 5 Generic Security Service Application Program
Interface (GSS-API) Mechanism Version 2", RFC 4121, July
2005.
[RFC4506] Eisler, M., "XDR: External Data Representation Standard", [RFC4506] Eisler, M., "XDR: External Data Representation Standard",
STD 67, RFC 4506, May 2006. STD 67, RFC 4506, May 2006.
[RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol
Specification Version 2", RFC 5531, May 2009. Specification Version 2", RFC 5531, May 2009.
[RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
"Network File System (NFS) Version 4 Minor Version 1 "Network File System (NFS) Version 4 Minor Version 1
Protocol", RFC 5661, January 2010. Protocol", RFC 5661, January 2010.
skipping to change at page 39, line 18 skipping to change at page 38, line 47
RFC 5662, January 2010. RFC 5662, January 2010.
[RFC7530] Haynes, T. and D. Noveck, "Network File System (NFS) [RFC7530] Haynes, T. and D. Noveck, "Network File System (NFS)
version 4 Protocol", RFC 7530, March 2015. version 4 Protocol", RFC 7530, March 2015.
[RFC7862] Haynes, T., "NFS Version 4 Minor Version 2", RFC 7862, [RFC7862] Haynes, T., "NFS Version 4 Minor Version 2", RFC 7862,
November 2016. November 2016.
[pNFSLayouts] [pNFSLayouts]
Haynes, T., "Requirements for pNFS Layout Types", draft- Haynes, T., "Requirements for pNFS Layout Types", draft-
ietf-nfsv4-layout-types-04 (Work In Progress), January ietf-nfsv4-layout-types-05 (Work In Progress), July 2017.
2016.
17.2. Informative References 17.2. Informative References
[RFC4519] Sciberras, A., Ed., "Lightweight Directory Access Protocol [RFC4519] Sciberras, A., Ed., "Lightweight Directory Access Protocol
(LDAP): Schema for User Applications", RFC 4519, DOI (LDAP): Schema for User Applications", RFC 4519, DOI
10.17487/RFC4519, June 2006, 10.17487/RFC4519, June 2006,
<http://www.rfc-editor.org/info/rfc4519>. <http://www.rfc-editor.org/info/rfc4519>.
[RFC7861] Adamson, W. and N. Williams, "Remote Procedure Call (RPC) [RFC7861] Adamson, W. and N. Williams, "Remote Procedure Call (RPC)
Security Version 3", November 2016. Security Version 3", November 2016.
Appendix A. Acknowledgments Appendix A. Acknowledgments
Those who provided miscellaneous comments to early drafts of this Those who provided miscellaneous comments to early drafts of this
document include: Matt W. Benjamin, Adam Emerson, J. Bruce Fields, document include: Matt W. Benjamin, Adam Emerson, J. Bruce Fields,
and Lev Solomonov. and Lev Solomonov.
Those who provided miscellaneous comments to the final drafts of this Those who provided miscellaneous comments to the final drafts of this
document include: Anand Ganesh, Robert Wipfel, Gobikrishnan document include: Anand Ganesh, Robert Wipfel, Gobikrishnan
Sundharraj, Trond Myklebust, and Rick Macklem. Sundharraj, Trond Myklebust, Rick Macklem, and Jim Sermersheim.
Idan Kedar caught a nasty bug in the interaction of client side Idan Kedar caught a nasty bug in the interaction of client side
mirroring and the minor versioning of devices. mirroring and the minor versioning of devices.
Dave Noveck provided comprehensive reviews of the document during the Dave Noveck provided comprehensive reviews of the document during the
working group last calls. He also rewrote Section 2.3. working group last calls. He also rewrote Section 2.3.
Olga Kornievskaiaa made a convincing case against the use of a Olga Kornievskaia made a convincing case against the use of a
credential versus a principal in the fencing approach. Andy Adamson credential versus a principal in the fencing approach. Andy Adamson
and Benjamin Kaduk helped to sharpen the focus. and Benjamin Kaduk helped to sharpen the focus.
Benjamin Kaduk and Olga Kornievskaia also helped provide concrete
scenarios for loosely coupled security mechanisms. And in the end,
Olga proved that as defined, the loosely coupled model would not work
with RPCSEC_GSS.
Tigran Mkrtchyan provided the use case for not allowing the client to Tigran Mkrtchyan provided the use case for not allowing the client to
proxy the I/O through the data server. proxy the I/O through the data server.
Rick Macklem provided the use case for only writing to a single Rick Macklem provided the use case for only writing to a single
mirror. mirror.
Appendix B. RFC Editor Notes Appendix B. RFC Editor Notes
[RFC Editor: please remove this section prior to publishing this [RFC Editor: please remove this section prior to publishing this
document as an RFC] document as an RFC]
 End of changes. 50 change blocks. 
134 lines changed or deleted 133 lines changed or added

This html diff was produced by rfcdiff 1.45. The latest version is available from http://tools.ietf.org/tools/rfcdiff/