draft-ietf-nfsv4-flex-files-06.txt   draft-ietf-nfsv4-flex-files-07.txt 
NFSv4 B. Halevy NFSv4 B. Halevy
Internet-Draft Internet-Draft
Intended status: Standards Track T. Haynes Intended status: Standards Track T. Haynes
Expires: January 22, 2016 Primary Data Expires: July 25, 2016 Primary Data
July 21, 2015 January 22, 2016
Parallel NFS (pNFS) Flexible File Layout Parallel NFS (pNFS) Flexible File Layout
draft-ietf-nfsv4-flex-files-06.txt draft-ietf-nfsv4-flex-files-07.txt
Abstract Abstract
The Parallel Network File System (pNFS) allows a separation between The Parallel Network File System (pNFS) allows a separation between
the metadata (onto a metadata server) and data (onto a storage the metadata (onto a metadata server) and data (onto a storage
device) for a file. The Flexible File Layout Type is defined in this device) for a file. The Flexible File Layout Type is defined in this
document as an extension to pNFS to allow the use of storage devices document as an extension to pNFS to allow the use of storage devices
in a fashion such that they require only a quite limited degree of in a fashion such that they require only a quite limited degree of
interaction with the metadata server, using already existing interaction with the metadata server, using already existing
protocols. Client side mirroring is also added to provide protocols. Client side mirroring is also added to provide
skipping to change at page 1, line 38 skipping to change at page 1, line 38
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 22, 2016. This Internet-Draft will expire on July 25, 2016.
Copyright Notice Copyright Notice
Copyright (c) 2015 IETF Trust and the persons identified as the Copyright (c) 2016 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
skipping to change at page 2, line 22 skipping to change at page 2, line 22
1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 3
1.2. Difference Between a Data Server and a Storage Device . . 5 1.2. Difference Between a Data Server and a Storage Device . . 5
1.3. Requirements Language . . . . . . . . . . . . . . . . . . 6 1.3. Requirements Language . . . . . . . . . . . . . . . . . . 6
2. Coupling of Storage Devices . . . . . . . . . . . . . . . . . 6 2. Coupling of Storage Devices . . . . . . . . . . . . . . . . . 6
2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 6 2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 6
2.2. Fencing Clients from the Data Server . . . . . . . . . . 6 2.2. Fencing Clients from the Data Server . . . . . . . . . . 6
2.2.1. Implementation Notes for Synthetic uids/gids . . . . 7 2.2.1. Implementation Notes for Synthetic uids/gids . . . . 7
2.2.2. Example of using Synthetic uids/gids . . . . . . . . 7 2.2.2. Example of using Synthetic uids/gids . . . . . . . . 7
2.3. State and Locking Models . . . . . . . . . . . . . . . . 8 2.3. State and Locking Models . . . . . . . . . . . . . . . . 8
3. XDR Description of the Flexible File Layout Type . . . . . . 9 3. XDR Description of the Flexible File Layout Type . . . . . . 9
3.1. Code Components Licensing Notice . . . . . . . . . . . . 9 3.1. Code Components Licensing Notice . . . . . . . . . . . . 10
4. Device Addressing and Discovery . . . . . . . . . . . . . . . 11 4. Device Addressing and Discovery . . . . . . . . . . . . . . . 11
4.1. ff_device_addr4 . . . . . . . . . . . . . . . . . . . . . 11 4.1. ff_device_addr4 . . . . . . . . . . . . . . . . . . . . . 11
4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 12 4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 13
5. Flexible File Layout Type . . . . . . . . . . . . . . . . . . 13 5. Flexible File Layout Type . . . . . . . . . . . . . . . . . . 14
5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 14 5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 14
5.2. Interactions Between Devices and Layouts . . . . . . . . 17 5.1.1. Error codes from LAYOUTGET . . . . . . . . . . . . . 17
5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS . . 18
5.2. Interactions Between Devices and Layouts . . . . . . . . 18
5.3. Handling Version Errors . . . . . . . . . . . . . . . . . 18 5.3. Handling Version Errors . . . . . . . . . . . . . . . . . 18
6. Striping via Sparse Mapping . . . . . . . . . . . . . . . . . 18 6. Striping via Sparse Mapping . . . . . . . . . . . . . . . . . 19
7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 19 7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 19
8. Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . 19 8. Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . 20
8.1. Selecting a Mirror . . . . . . . . . . . . . . . . . . . 20 8.1. Selecting a Mirror . . . . . . . . . . . . . . . . . . . 21
8.2. Writing to Mirrors . . . . . . . . . . . . . . . . . . . 20 8.2. Writing to Mirrors . . . . . . . . . . . . . . . . . . . 21
8.3. Metadata Server Resilvering of the File . . . . . . . . . 21 8.3. Metadata Server Resilvering of the File . . . . . . . . . 22
9. Flexible Files Layout Type Return . . . . . . . . . . . . . . 21 9. Flexible Files Layout Type Return . . . . . . . . . . . . . . 22
9.1. I/O Error Reporting . . . . . . . . . . . . . . . . . . . 22 9.1. I/O Error Reporting . . . . . . . . . . . . . . . . . . . 23
9.1.1. ff_ioerr4 . . . . . . . . . . . . . . . . . . . . . . 22 9.1.1. ff_ioerr4 . . . . . . . . . . . . . . . . . . . . . . 23
9.2. Layout Usage Statistics . . . . . . . . . . . . . . . . . 23 9.2. Layout Usage Statistics . . . . . . . . . . . . . . . . . 24
9.2.1. ff_io_latency4 . . . . . . . . . . . . . . . . . . . 23 9.2.1. ff_io_latency4 . . . . . . . . . . . . . . . . . . . 24
9.2.2. ff_layoutupdate4 . . . . . . . . . . . . . . . . . . 24 9.2.2. ff_layoutupdate4 . . . . . . . . . . . . . . . . . . 25
9.2.3. ff_iostats4 . . . . . . . . . . . . . . . . . . . . . 24 9.2.3. ff_iostats4 . . . . . . . . . . . . . . . . . . . . . 25
9.3. ff_layoutreturn4 . . . . . . . . . . . . . . . . . . . . 25 9.3. ff_layoutreturn4 . . . . . . . . . . . . . . . . . . . . 26
10. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 26 10. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 27
11. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 26 11. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 27
12. Flexible File Layout Type Creation Hint . . . . . . . . . . . 26 12. Flexible File Layout Type Creation Hint . . . . . . . . . . . 27
12.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 27 12.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 28
13. Recalling Layouts . . . . . . . . . . . . . . . . . . . . . . 27 13. Recalling a Layout . . . . . . . . . . . . . . . . . . . . . 28
13.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 27 13.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 28
14. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 28
15. Security Considerations . . . . . . . . . . . . . . . . . . . 29 14. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 29
15.1. Kerberized File Access . . . . . . . . . . . . . . . . . 29 15. Security Considerations . . . . . . . . . . . . . . . . . . . 30
15.1.1. Loosely Coupled . . . . . . . . . . . . . . . . . . 30 15.1. Kerberized File Access . . . . . . . . . . . . . . . . . 30
15.1.2. Tightly Coupled . . . . . . . . . . . . . . . . . . 30 15.1.1. Loosely Coupled . . . . . . . . . . . . . . . . . . 31
16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 30 15.1.2. Tightly Coupled . . . . . . . . . . . . . . . . . . 31
17. References . . . . . . . . . . . . . . . . . . . . . . . . . 30 16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 31
17.1. Normative References . . . . . . . . . . . . . . . . . . 30 17. References . . . . . . . . . . . . . . . . . . . . . . . . . 31
17.2. Informative References . . . . . . . . . . . . . . . . . 31 17.1. Normative References . . . . . . . . . . . . . . . . . . 31
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 31 17.2. Informative References . . . . . . . . . . . . . . . . . 32
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 32 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 32
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 32 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 33
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 33
1. Introduction 1. Introduction
In the parallel Network File System (pNFS), the metadata server In the parallel Network File System (pNFS), the metadata server
returns Layout Type structures that describe where file data is returns Layout Type structures that describe where file data is
located. There are different Layout Types for different storage located. There are different Layout Types for different storage
systems and methods of arranging data on storage devices. This systems and methods of arranging data on storage devices. This
document defines the Flexible File Layout Type used with file-based document defines the Flexible File Layout Type used with file-based
data servers that are accessed using the Network File System (NFS) data servers that are accessed using the Network File System (NFS)
protocols: NFSv3 [RFC1813], NFSv4.0 [RFCNFSv4], NFSv4.1 [RFC5661], protocols: NFSv3 [RFC1813], NFSv4.0 [RFCNFSv4], NFSv4.1 [RFC5661],
skipping to change at page 6, line 24 skipping to change at page 6, line 27
either tight or loose. In a tight coupling, there is a control either tight or loose. In a tight coupling, there is a control
protocol present to manage security, LAYOUTCOMMITs, etc. With a protocol present to manage security, LAYOUTCOMMITs, etc. With a
loose coupling, the only control protocol might be a version of NFS. loose coupling, the only control protocol might be a version of NFS.
As such, semantics for managing security, state, and locking models As such, semantics for managing security, state, and locking models
MUST be defined. MUST be defined.
2.1. LAYOUTCOMMIT 2.1. LAYOUTCOMMIT
With a tightly coupled system, when the metadata server receives a With a tightly coupled system, when the metadata server receives a
LAYOUTCOMMIT (see Section 18.42 of [RFC5661]), the semantics of the LAYOUTCOMMIT (see Section 18.42 of [RFC5661]), the semantics of the
File Layout Type MUST be met (see Section 12.5.4 of [RFC5661]). With File Layout Type MUST be met (see Section 12.5.4 of [RFC5661]). It
a loosely coupled system, a LAYOUTCOMMIT to the metadata server MUST is the responsibility of the client to make sure the data file is
be proceeded with a COMMIT to the storage device. It is the stable before the metadata server begins to query the storage devices
responsibility of the client to make sure the data file is stable about the changes to the file. With a loosely coupled system, if any
before the metadata server begins to query the storage devices about WRITE to a storage device did not result with stable_how equal to
the changes to the file. Note that if the client has not done a FILE_SYNC, a LAYOUTCOMMIT to the metadata server MUST be preceded
COMMIT to the storage device, then the LAYOUTCOMMIT might not be with a COMMIT to the storage device. Note that if the client has not
synchronized to the last WRITE operation to the storage device. done a COMMIT to the storage device, then the LAYOUTCOMMIT might not
be synchronized to the last WRITE operation to the storage device.
2.2. Fencing Clients from the Data Server 2.2. Fencing Clients from the Data Server
With loosely coupled storage devices, the metadata server uses With loosely coupled storage devices, the metadata server uses
synthetic uids and gids for the data file, where the uid owner of the synthetic uids and gids for the data file, where the uid owner of the
data file is allowed read/write access and the gid owner is allowed data file is allowed read/write access and the gid owner is allowed
read only access. As part of the layout (see ffds_user and read only access. As part of the layout (see ffds_user and
ffds_group in Section 5.1), the client is provided with the user and ffds_group in Section 5.1), the client is provided with the user and
group to be used in the Remote Procedure Call (RPC) [RFC5531] group to be used in the Remote Procedure Call (RPC) [RFC5531]
credentials needed to access the data file. Fencing off of clients credentials needed to access the data file. Fencing off of clients
is achieved by the metadata server changing the synthetic uid and/or is achieved by the metadata server changing the synthetic uid and/or
gid owners of the data file on the storage device to implicitly gid owners of the data file on the storage device to implicitly
revoke the outstanding RPC credentials. revoke the outstanding RPC credentials.
With this loosely coupled model, the metadata server is not able to With this loosely coupled model, the metadata server is not able to
fence off a single client, it forced to fence off all clients. fence off a single client, it is forced to fence off all clients.
However, as the other clients react to the fencing, returning their However, as the other clients react to the fencing, returning their
layouts and trying to get new ones, the metadata server can hand out layouts and trying to get new ones, the metadata server can hand out
a new uid and gid to allow access. a new uid and gid to allow access.
Note: it is recommended to implement common access control methods at Note: it is recommended to implement common access control methods at
the storage device filesystem to allow only the metadata server root the storage device filesystem to allow only the metadata server root
(super user) access to the storage device, and to set the owner of (super user) access to the storage device, and to set the owner of
all directories holding data files to the root user. This approach all directories holding data files to the root user. This approach
provides a practical model to enforce access control and fence off provides a practical model to enforce access control and fence off
cooperative clients, but it can not protect against malicious cooperative clients, but it can not protect against malicious
skipping to change at page 7, line 24 skipping to change at page 7, line 26
With tightly coupled storage devices, the metadata server sets the With tightly coupled storage devices, the metadata server sets the
user and group owners, mode bits, and ACL of the data file to be the user and group owners, mode bits, and ACL of the data file to be the
same as the metadata file. And the client must authenticate with the same as the metadata file. And the client must authenticate with the
storage device and go through the same authorization process it would storage device and go through the same authorization process it would
go through via the metadata server. go through via the metadata server.
2.2.1. Implementation Notes for Synthetic uids/gids 2.2.1. Implementation Notes for Synthetic uids/gids
The selection method for the synthetic uids and gids to be used for The selection method for the synthetic uids and gids to be used for
fencing in loosely coupled storage devices is strictly an fencing in loosely coupled storage devices is strictly an
implementation issue. An implementation might allow an administrator implementation issue. I.e., an administrator might restrict a range
to restrict a range of such ids in the name servers. She might also of such ids available to the Lightweight Directory Access Protocol
be able to choose an id that would never be used to grant acccess. (LDAP) 'uid' field [RFC4519]. She might also be able to choose an id
Then when the metadata server had a request to access a file, a that would never be used to grant acccess. Then when the metadata
SETATTR would be sent to the storage device to set the owner and server had a request to access a file, a SETATTR would be sent to the
group of the data file. The user and group might be selected in a storage device to set the owner and group of the data file. The user
round robin fashion from the range of available ids. and group might be selected in a round robin fashion from the range
of available ids.
Those ids would be sent back as ffds_user and ffds_group to the Those ids would be sent back as ffds_user and ffds_group to the
client. And it would present them as the RPC credentials to the client. And it would present them as the RPC credentials to the
storage device. When the client was done accessing the file and the storage device. When the client was done accessing the file and the
metadata server knew that no other client was accessing the file, it metadata server knew that no other client was accessing the file, it
could reset the owner and group to restrict access to the data file. could reset the owner and group to restrict access to the data file.
When the metadata server wanted to fence off a client, it would When the metadata server wanted to fence off a client, it would
change the synthetic uid and/or gid to the restricted ids. Note that change the synthetic uid and/or gid to the restricted ids. Note that
using a restricted id ensures that there is a change of owner and at using a restricted id ensures that there is a change of owner and at
skipping to change at page 9, line 5 skipping to change at page 9, line 5
Metadata file OPEN, LOCK, and DELEGATION operations are always Metadata file OPEN, LOCK, and DELEGATION operations are always
executed only against the metadata server. executed only against the metadata server.
The metadata server responds to state changing operations by The metadata server responds to state changing operations by
executing them against the respective data files on the storage executing them against the respective data files on the storage
devices. It then sends the storage device open stateid as part of devices. It then sends the storage device open stateid as part of
the layout (see the ffm_stateid in Section 5.1) and it is then used the layout (see the ffm_stateid in Section 5.1) and it is then used
by the client for executing READ/WRITE operations against the storage by the client for executing READ/WRITE operations against the storage
device. device.
Standalone NFSv4.1+ storage devices that do not return the NFSv4.1+ storage devices that do not return the
EXCHGID4_FLAG_USE_PNFS_DS flag to EXCHANGE_ID are used the same way EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID are indicating that
as NFSv4 storage devices. they are loosely coupled. As such, they are treated the same way as
NFSv4 storage devices.
NFSv4.1+ clustered storage devices that do identify themselves with NFSv4.1+ storage devices that do identify themselves with the
the EXCHGID4_FLAG_USE_PNFS_DS flag to EXCHANGE_ID use a back-end EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID are stongly
control protocol as described in [RFC5661] to implement a global coupled. They will be using a back-end control protocol as described
stateid model as defined there. in [RFC5661] to implement a global stateid model as defined there.
3. XDR Description of the Flexible File Layout Type 3. XDR Description of the Flexible File Layout Type
This document contains the external data representation (XDR) This document contains the external data representation (XDR)
[RFC4506] description of the Flexible File Layout Type. The XDR [RFC4506] description of the Flexible File Layout Type. The XDR
description is embedded in this document in a way that makes it description is embedded in this document in a way that makes it
simple for the reader to extract into a ready-to-compile form. The simple for the reader to extract into a ready-to-compile form. The
reader can feed this document into the following shell script to reader can feed this document into the following shell script to
produce the machine readable XDR description of the Flexible File produce the machine readable XDR description of the Flexible File
Layout Type: Layout Type:
skipping to change at page 11, line 47 skipping to change at page 12, line 4
/// uint32_t ffdv_wsize; /// uint32_t ffdv_wsize;
/// bool ffdv_tightly_coupled; /// bool ffdv_tightly_coupled;
/// }; /// };
/// ///
/// struct ff_device_addr4 { /// struct ff_device_addr4 {
/// multipath_list4 ffda_netaddrs; /// multipath_list4 ffda_netaddrs;
/// ff_device_versions4 ffda_versions<>; /// ff_device_versions4 ffda_versions<>;
/// }; /// };
/// ///
<CODE ENDS> <CODE ENDS>
The ffda_netaddrs field is used to locate the storage device. It The ffda_netaddrs field is used to locate the storage device. It
MUST be set by the server to a list holding one or more of the device MUST be set by the server to a list holding one or more of the device
network addresses. network addresses.
The ffda_versions array allows the metadata server to present The ffda_versions array allows the metadata server to present choices
multiple NFS versions and/or minor versions to the client. The as to NFS version, minor version, and coupling strength to the
ffdv_version and ffdv_minorversion represent the NFS protocol to be client. The ffdv_version and ffdv_minorversion represent the NFS
used to access the storage device. This layout specification defines protocol to be used to access the storage device. This layout
the semantics for ffdv_versions 3 and 4. If ffdv_version equals 3 specification defines the semantics for ffdv_versions 3 and 4. If
then server MUST set ffdv_minorversion to 0 and the client MUST ffdv_version equals 3 then the server MUST set ffdv_minorversion to 0
access the storage device using the NFSv3 protocol [RFC1813]. If and ffdv_tightly_coupled to false. The client MUST then access the
ffdv_version equals 4 then the server MUST set ffdv_minorversion to storage device using the NFSv3 protocol [RFC1813]. If ffdv_version
one of the NFSv4 minor version numbers and the client MUST access the equals 4 then the server MUST set ffdv_minorversion to one of the
storage device using NFSv4. NFSv4 minor version numbers and the client MUST access the storage
device using NFSv4.
Note that while the client might determine that it can not use any of Note that while the client might determine that it cannot use any of
the configured ffdv_version or ffdv_minorversion, when it gets the the configured combinations of ffdv_version, ffdv_minorversion, and
device list from the metadata server, there is no way to indicate to ffdv_tightly_coupled, when it gets the device list from the metadata
the metadata server as to which device it is version incompatible. server, there is no way to indicate to the metadata server as to
If however the client waits until it retrieves the layout from the which device it is version incompatible. If however, the client
metadata server, it can at that time clearly identify the storage waits until it retrieves the layout from the metadata server, it can
device in question (see Section 5.3). at that time clearly identify the storage device in question (see
Section 5.3).
The ffdv_rsize and ffdv_wsize are used to communicate the maximum The ffdv_rsize and ffdv_wsize are used to communicate the maximum
rsize and wsize supported by the storage device. As the storage rsize and wsize supported by the storage device. As the storage
device can have a different rsize or wsize than the metadata server, device can have a different rsize or wsize than the metadata server,
the ffdv_rsize and ffdv_wsize allow the metadata server to the ffdv_rsize and ffdv_wsize allow the metadata server to
communicate that information on behalf of the storage device. communicate that information on behalf of the storage device.
ffdv_tightly_coupled informs the client as to whether the metadata ffdv_tightly_coupled informs the client as to whether the metadata
server is tightly coupled with the storage devices or not. Note that server is tightly coupled with the storage devices or not. Note that
even if the data protocol is at least NFSv4.1, it may still be the even if the data protocol is at least NFSv4.1, it may still be the
case that there is no control protocol present. If case that there is loose coupling is in effect. If
ffdv_tightly_coupled is not set, then the client MUST commit writes ffdv_tightly_coupled is not set, then the client MUST commit writes
to the storage devices for the file before sending a LAYOUTCOMMIT to to the storage devices for the file before sending a LAYOUTCOMMIT to
the metadata server. I.e., the writes MUST be committed by the the metadata server. I.e., the writes MUST be committed by the
client to stable storage via issuing WRITEs with stable_how == client to stable storage via issuing WRITEs with stable_how ==
FILE_SYNC or by issuing a COMMIT after WRITEs with stable_how != FILE_SYNC or by issuing a COMMIT after WRITEs with stable_how !=
FILE_SYNC (see Section 3.3.7 of [RFC1813]). FILE_SYNC (see Section 3.3.7 of [RFC1813]).
4.2. Storage Device Multipathing 4.2. Storage Device Multipathing
The Flexible File Layout Type supports multipathing to multiple The Flexible File Layout Type supports multipathing to multiple
storage device addresses. Storage device level multipathing is used storage device addresses. Storage device level multipathing is used
for bandwidth scaling via trunking and for higher availability of use for bandwidth scaling via trunking and for higher availability of use
in the case of a storage device failure. Multipathing allows the in the event of a storage device failure. Multipathing allows the
client to switch to another storage device address which may be that client to switch to another storage device address which may be that
of another storage device that is exporting the same data stripe of another storage device that is exporting the same data stripe
unit, without having to contact the metadata server for a new layout. unit, without having to contact the metadata server for a new layout.
To support storage device multipathing, ffda_netaddrs contains an To support storage device multipathing, ffda_netaddrs contains an
array of one or more storage device network addresses. This array array of one or more storage device network addresses. This array
(data type multipath_list4) represents a list of storage device (each (data type multipath_list4) represents a list of storage devices
identified by a network address), with the possibility that some (each identified by a network address), with the possibility that
storage device will appear in the list multiple times. some storage device will appear in the list multiple times.
The client is free to use any of the network addresses as a The client is free to use any of the network addresses as a
destination to send storage device requests. If some network destination to send storage device requests. If some network
addresses are less optimal paths to the data than others, then the addresses are less desirable paths to the data than others, then the
MDS SHOULD NOT include those network addresses in ffda_netaddrs. If MDS SHOULD NOT include those network addresses in ffda_netaddrs. If
less optimal network addresses exist to provide failover, the less desirable network addresses exist to provide failover, the
RECOMMENDED method to offer the addresses is to provide them in a RECOMMENDED method to offer the addresses is to provide them in a
replacement device-ID-to-device-address mapping, or a replacement replacement device-ID-to-device-address mapping, or a replacement
device ID. When a client finds no response from the storage device device ID. When a client finds no response from the storage device
using all addresses available in ffda_netaddrs, it SHOULD send a using all addresses available in ffda_netaddrs, it SHOULD send a
GETDEVICEINFO to attempt to replace the existing device-ID-to-device- GETDEVICEINFO to attempt to replace the existing device-ID-to-device-
address mappings. If the MDS detects that all network paths address mappings. If the MDS detects that all network paths
represented by ffda_netaddrs are unavailable, the MDS SHOULD send a represented by ffda_netaddrs are unavailable, the MDS SHOULD send a
CB_NOTIFY_DEVICEID (if the client has indicated it wants device ID CB_NOTIFY_DEVICEID (if the client has indicated it wants device ID
notifications for changed device IDs) to change the device-ID-to- notifications for changed device IDs) to change the device-ID-to-
device-address mappings to the available addresses. If the device ID device-address mappings to the available addresses. If the device ID
itself will be replaced, the MDS SHOULD recall all layouts with the itself will be replaced, the MDS SHOULD recall all layouts with the
device ID, and thus force the client to get new layouts and device ID device ID, and thus force the client to get new layouts and device ID
mappings via LAYOUTGET and GETDEVICEINFO. mappings via LAYOUTGET and GETDEVICEINFO.
Generally, if two network addresses appear in ffda_netaddrs, they Generally, if two network addresses appear in ffda_netaddrs, they
will designate the same storage device. When the storage device is will designate the same storage device. When the storage device is
accessed over NFSv4.1 or higher minor version the two storage device accessed over NFSv4.1 or a higher minor version, the two storage
addresses will support the implementation of client ID or session device addresses will support the implementation of client ID or
trunking (the latter is RECOMMENDED) as defined in [RFC5661]. The session trunking (the latter is RECOMMENDED) as defined in [RFC5661].
two storage device addresses will share the same server owner or The two storage device addresses will share the same server owner or
major ID of the server owner. It is not always necessary for the two major ID of the server owner. It is not always necessary for the two
storage device addresses to designate the same storage device with storage device addresses to designate the same storage device with
trunking being used. For example, the data could be read-only, and trunking being used. For example, the data could be read-only, and
the data consist of exact replicas. the data consist of exact replicas.
5. Flexible File Layout Type 5. Flexible File Layout Type
The layout4 type is defined in [RFC5662] as follows: The layout4 type is defined in [RFC5662] as follows:
<CODE BEGINS> <CODE BEGINS>
skipping to change at page 14, line 30 skipping to change at page 14, line 37
length4 lo_length; length4 lo_length;
layoutiomode4 lo_iomode; layoutiomode4 lo_iomode;
layout_content4 lo_content; layout_content4 lo_content;
}; };
<CODE ENDS> <CODE ENDS>
This document defines structure associated with the layouttype4 value This document defines structure associated with the layouttype4 value
LAYOUT4_FLEX_FILES. [RFC5661] specifies the loc_body structure as an LAYOUT4_FLEX_FILES. [RFC5661] specifies the loc_body structure as an
XDR type "opaque". The opaque layout is uninterpreted by the generic XDR type "opaque". The opaque layout is uninterpreted by the generic
pNFS client layers, but obviously must be interpreted by the Flexible pNFS client layers, but is interpreted by the Flexible File Layout
File Layout Type implementation. This section defines the structure Type implementation. This section defines the structure of this
of this opaque value, ff_layout4. otherwise opaque value, ff_layout4.
5.1. ff_layout4 5.1. ff_layout4
<CODE BEGINS> <CODE BEGINS>
/// const FF_FLAGS_NO_LAYOUTCOMMIT = 1; /// const FF_FLAGS_NO_LAYOUTCOMMIT = 0x00000001;
/// const FF_FLAGS_NO_IO_THRU_MDS = 0x00000002;
/// typedef uint32_t ff_flags4; /// typedef uint32_t ff_flags4;
/// ///
/// struct ff_data_server4 { /// struct ff_data_server4 {
/// deviceid4 ffds_deviceid; /// deviceid4 ffds_deviceid;
/// uint32_t ffds_efficiency; /// uint32_t ffds_efficiency;
/// stateid4 ffds_stateid; /// stateid4 ffds_stateid;
/// nfs_fh4 ffds_fh_vers<>; /// nfs_fh4 ffds_fh_vers<>;
/// fattr4_owner ffds_user; /// fattr4_owner ffds_user;
/// fattr4_owner_group ffds_group; /// fattr4_owner_group ffds_group;
/// }; /// };
/// ///
/// struct ff_mirror4 { /// struct ff_mirror4 {
/// ff_data_server4 ffm_data_servers<>; /// ff_data_server4 ffm_data_servers<>;
/// }; /// };
/// ///
/// struct ff_layout4 { /// struct ff_layout4 {
/// length4 ffl_stripe_unit; /// length4 ffl_stripe_unit;
/// ff_mirror4 ffl_mirrors<>; /// ff_mirror4 ffl_mirrors<>;
/// ff_flags4 ffl_flags; /// ff_flags4 ffl_flags;
/// uint32_t ffl_stats_collect_hint;
/// }; /// };
/// ///
<CODE ENDS> <CODE ENDS>
The ff_layout4 structure specifies a layout over a set of mirrored The ff_layout4 structure specifies a layout over a set of mirrored
copies of that portion of the data file described in the current copies of that portion of the data file described in the current
layout segment. This mirroring protects against loss of data in layout segment. This mirroring protects against loss of data in
layout segments. Note that while not explicitly shown in the above layout segments. Note that while not explicitly shown in the above
XDR, each layout4 element returned in the logr_layout array of XDR, each layout4 element returned in the logr_layout array of
skipping to change at page 16, line 5 skipping to change at page 16, line 8
mirror by the number of elements in ffm_data_servers. If the number mirror by the number of elements in ffm_data_servers. If the number
of stripes is one, then the value for ffl_stripe_unit MUST default to of stripes is one, then the value for ffl_stripe_unit MUST default to
zero. The only supported mapping scheme is sparse and is detailed in zero. The only supported mapping scheme is sparse and is detailed in
Section 6. Note that there is an assumption here that both the Section 6. Note that there is an assumption here that both the
stripe unit size and the number of stripes is the same across all stripe unit size and the number of stripes is the same across all
mirrors. mirrors.
The ffl_mirrors field is the array of mirrored storage devices which The ffl_mirrors field is the array of mirrored storage devices which
provide the storage for the current stripe, see Figure 1. provide the storage for the current stripe, see Figure 1.
The ffl_stats_collect_hint field provides a hint to the client on how
often the server wants it to report LAYOUTSTATS for a file. The time
is in seconds.
+-----------+ +-----------+
| | | |
| | | |
| File | | File |
| | | |
| | | |
+-----+-----+ +-----+-----+
| |
+------------+------------+ +------------+------------+
| | | |
skipping to change at page 16, line 38 skipping to change at page 16, line 45
The ffs_mirrors field represents an array of state information for The ffs_mirrors field represents an array of state information for
each mirrored copy of the current layout segment. Each element is each mirrored copy of the current layout segment. Each element is
described by a ff_mirror4 type. described by a ff_mirror4 type.
ffds_deviceid provides the deviceid of the storage device holding the ffds_deviceid provides the deviceid of the storage device holding the
data file. data file.
ffds_fh_vers is an array of filehandles of the data file matching to ffds_fh_vers is an array of filehandles of the data file matching to
the available NFS versions on the given storage device. There MUST the available NFS versions on the given storage device. There MUST
be exactly as many elements in ffds_fh_vers as there are in be exactly as many elements in ffds_fh_vers as there are in
ffda_versions. Each element of the array corresponds to each ffda_versions. Each element of the array corresponds to a particular
ffdv_version and ffdv_minorversion provided for the device. The combination of ffdv_version, ffdv_minorversion, and
array allows for server implementations which have different ffdv_tightly_coupled provided for the device. The array allows for
filehandles for different version and minor version combinations. server implementations which have different filehandles for different
See Section 5.3 for how to handle versioning issues between the combinations of version, minor version, and coupling strength. See
client and storage devices. Section 5.3 for how to handle versioning issues between the client
and storage devices.
For tight coupling, ffds_stateid provides the stateid to be used by For tight coupling, ffds_stateid provides the stateid to be used by
the client to access the file. For loose coupling and a NFSv4 the client to access the file. For loose coupling and a NFSv4
storage device, the client may use an anonymous stateid to perform I/ storage device, the client may use an anonymous stateid to perform I/
O on the storage device as there is no use for the metadata server O on the storage device as there is no use for the metadata server
stateid (no control protocol). In such a scenario, the server MUST stateid (no control protocol). In such a scenario, the server MUST
set the ffds_stateid to be zero. set the ffds_stateid to be the anonymous stateid.
For loosely coupled storage devices, ffds_user and ffds_group provide For loosely coupled storage devices, ffds_user and ffds_group provide
the synthetic user and group to be used in the RPC credentials that the synthetic user and group to be used in the RPC credentials that
the client presents to the storage device to access the data files. the client presents to the storage device to access the data files.
For tightly coupled storage devices, the user and group on the For tightly coupled storage devices, the user and group on the
storage device will be the same as on the metadata server. I.e., if storage device will be the same as on the metadata server. I.e., if
ffdv_tightly_coupled (see Section 4.1) is set, then the client MUST ffdv_tightly_coupled (see Section 4.1) is set, then the client MUST
ignore both ffds_user and ffds_group. ignore both ffds_user and ffds_group.
The allowed values for both ffds_user and ffds_group are specified in The allowed values for both ffds_user and ffds_group are specified in
skipping to change at page 17, line 32 skipping to change at page 17, line 39
ffds_efficiency describes the metadata server's evaluation as to the ffds_efficiency describes the metadata server's evaluation as to the
effectiveness of each mirror. Note that this is per layout and not effectiveness of each mirror. Note that this is per layout and not
per device as the metric may change due to perceived load, per device as the metric may change due to perceived load,
availability to the metadata server, etc. Higher values denote availability to the metadata server, etc. Higher values denote
higher perceived utility. The way the client can select the best higher perceived utility. The way the client can select the best
mirror to access is discussed in Section 8.1. mirror to access is discussed in Section 8.1.
ffl_flags is a bitmap that allows the metadata server to inform the ffl_flags is a bitmap that allows the metadata server to inform the
client of particular conditions that may result from the more or less client of particular conditions that may result from the more or less
tight coupling of the storage devices. FF_FLAGS_NO_LAYOUTCOMMIT, can tight coupling of the storage devices. FF_FLAGS_NO_LAYOUTCOMMIT can
be set to indicate that the client is not required to send be set to indicate that the client is not required to send
LAYOUTCOMMIT to the metadata server. LAYOUTCOMMIT to the metadata server. FF_FLAGS_NO_IO_THRU_MDS can be
set to indicate that the client SHOULD not send IO operations to the
metadata server. I.e., even if a storage device is partitioned from
the client, the client SHOULD not try to proxy the IO through the
metadata server.
5.1.1. Error codes from LAYOUTGET
[RFC5661] provides little guidance as to how the client is to proceed
with a LAYOUTEGT which returns an error of either
NFS4ERR_LAYOUTTRYLATER, NFS4ERR_LAYOUTUNAVAILABLE, and NFS4ERR_DELAY.
NFS4ERR_LAYOUTUNAVAILABLE: there is no layout available and the IO
is to go to the metadata server. Note that it is possible to have
had a layout before a recall and not after.
NFS4ERR_LAYOUTTRYLATER: there is some issue preventing the layout
from being granted. If the client already has an appropriate
layout, it SHOULD continue with IO to the storage devices.
NFS4ERR_DELAY: there is some issue preventing the layout from being
granted. If the client already has an appropriate layout, it
SHOULD not continue with IO to the storage devices.
5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS
If the client does not ask for a layout for a file, then the IO will
go through the metadata server. Thus, even if the metadata server
sets the FF_FLAGS_NO_IO_THRU_MDS flag, it can recall the layout and
either not set the flag on the new layout or not provide a layout.
When a client encounters an error with a storage device, it typically
returns the layout to the metadata server and requests a new layout.
The client's IO would then proceed according to the status codes as
outlined in Section 5.1.1.
5.2. Interactions Between Devices and Layouts 5.2. Interactions Between Devices and Layouts
In [RFC5661], the File Layout Type is defined such that the In [RFC5661], the File Layout Type is defined such that the
relationship between multipathing and filehandles can result in relationship between multipathing and filehandles can result in
either 0, 1, or N filehandles (see Section 13.3). Some rationals for either 0, 1, or N filehandles (see Section 13.3). Some rationals for
this are clustered servers which share the same filehandle or this are clustered servers which share the same filehandle or
allowing for multiple read-only copies of the file on the same allowing for multiple read-only copies of the file on the same
storage device. In the Flexible File Layout Type, while there is an storage device. In the Flexible File Layout Type, while there is an
array of filehandles, they are independent of the multipathing being array of filehandles, they are independent of the multipathing being
skipping to change at page 18, line 10 skipping to change at page 18, line 49
provide multiple ff_device_addr4, each as a mirror. The client can provide multiple ff_device_addr4, each as a mirror. The client can
then determine that since the ffds_fh_vers are different, then there then determine that since the ffds_fh_vers are different, then there
are multiple copies of the file for the current layout segment are multiple copies of the file for the current layout segment
available. available.
5.3. Handling Version Errors 5.3. Handling Version Errors
When the metadata server provides the ffda_versions array in the When the metadata server provides the ffda_versions array in the
ff_device_addr4 (see Section 4.1), the client is able to determine if ff_device_addr4 (see Section 4.1), the client is able to determine if
it can not access a storage device with any of the supplied it can not access a storage device with any of the supplied
ffdv_version and ffdv_minorversion combinations. However, due to the combinations of ffdv_version, ffdv_minorversion, and
limitations of reporting errors in GETDEVICEINFO (see Section 18.40 ffdv_tightly_coupled. However, due to the limitations of reporting
in [RFC5661], the client is not able to specify which specific device errors in GETDEVICEINFO (see Section 18.40 in [RFC5661], the client
it can not communicate with over one of the provided ffdv_version and is not able to specify which specific device it can not communicate
ffdv_minorversion combinations. Using ff_ioerr4 (see Section 9.1.1 with over one of the provided ffdv_version and ffdv_minorversion
inside either the LAYOUTRETURN (see Section 18.44 of [RFC5661]) or combinations. Using ff_ioerr4 (see Section 9.1.1 inside either the
the LAYOUTERROR (see Section 15.6 of [NFSv42] and Section 10 of this LAYOUTRETURN (see Section 18.44 of [RFC5661]) or the LAYOUTERROR (see
document), the client can isolate the problematic storage device. Section 15.6 of [NFSv42] and Section 10 of this document), the client
can isolate the problematic storage device.
The error code to return for LAYOUTRETURN and/or LAYOUTERROR is The error code to return for LAYOUTRETURN and/or LAYOUTERROR is
NFS4ERR_MINOR_VERS_MISMATCH. It does not matter whether the mismatch NFS4ERR_MINOR_VERS_MISMATCH. It does not matter whether the mismatch
is a major version (e.g., client can use NFSv3 but not NFSv4) or is a major version (e.g., client can use NFSv3 but not NFSv4) or
minor version (e.g., client can use NFSv4.1 but not NFSv4.2), the minor version (e.g., client can use NFSv4.1 but not NFSv4.2), the
error indicates that for all the supplied combinations for error indicates that for all the supplied combinations for
ffdv_version and ffdv_minorversion, the client can not communicate ffdv_version and ffdv_minorversion, the client can not communicate
with the storage device. The client can retry the GETDEVICEINFO to with the storage device. The client can retry the GETDEVICEINFO to
see if the metadata server can provide a different combination or it see if the metadata server can provide a different combination or it
can fall back to doing the I/O through the metadata server. can fall back to doing the I/O through the metadata server.
skipping to change at page 20, line 8 skipping to change at page 20, line 48
mirrored copies and the location of each mirror. While the client mirrored copies and the location of each mirror. While the client
may provide a hint to how many copies it wants (see Section 12), the may provide a hint to how many copies it wants (see Section 12), the
metadata server can ignore that hint and in any event, the client has metadata server can ignore that hint and in any event, the client has
no means to dictate neither the storage device (which also means the no means to dictate neither the storage device (which also means the
coupling and/or protocol levels to access the layout segments) nor coupling and/or protocol levels to access the layout segments) nor
the location of said storage device. the location of said storage device.
The updating of mirrored layout segments is done via client-side The updating of mirrored layout segments is done via client-side
mirroring. With this approach, the client is responsible for making mirroring. With this approach, the client is responsible for making
sure modifications get to all copies of the layout segments it is sure modifications get to all copies of the layout segments it is
informed of via the layout. If a layout segments is being resilvered informed of via the layout. If a layout segment is being resilvered
to a storage device, that mirrored copy will not be in the layout. to a storage device, that mirrored copy will not be in the layout.
Thus the metadata server MUST update that copy until the client is Thus the metadata server MUST update that copy until the client is
presented it in a layout. Also, if the client is writing to the presented it in a layout. Also, if the client is writing to the
layout segments via the metadata server, e.g., using an earlier layout segments via the metadata server, e.g., using an earlier
version of the protocol, then the metadata server MUST update all version of the protocol, then the metadata server MUST update all
copies of the mirror. As seen in Section 8.3, during the copies of the mirror. As seen in Section 8.3, during the
resilvering, the layout is recalled, and the client has to make resilvering, the layout is recalled, and the client has to make
modifications via the metadata server. modifications via the metadata server.
8.1. Selecting a Mirror 8.1. Selecting a Mirror
When the metadata server grants a layout to a client, it can let the When the metadata server grants a layout to a client, it MAY let the
client know how fast it expects each mirror to be once the request client know how fast it expects each mirror to be once the request
arrives at the storage devices via the ffds_efficiency member. While arrives at the storage devices via the ffds_efficiency member. While
the algorithms to calculate that value are left to the metadata the algorithms to calculate that value are left to the metadata
server implementations, factors that could contribute to that server implementations, factors that could contribute to that
calculation include speed of the storage device, physical memory calculation include speed of the storage device, physical memory
available to the device, operating system version, current load, etc. available to the device, operating system version, current load, etc.
However, what should not be involved in that calculation is a However, what should not be involved in that calculation is a
perceived network distance between the client and the storage device. perceived network distance between the client and the storage device.
The client is better situated for making that determination based on The client is better situated for making that determination based on
skipping to change at page 20, line 43 skipping to change at page 21, line 34
not know about a transient outage between the client and storage not know about a transient outage between the client and storage
device because it has no presence on the given subnet. device because it has no presence on the given subnet.
As such, it is the client which decides which mirror to access for As such, it is the client which decides which mirror to access for
reading the file. The requirements for writing to a mirrored layout reading the file. The requirements for writing to a mirrored layout
segments are presented below. segments are presented below.
8.2. Writing to Mirrors 8.2. Writing to Mirrors
The client is responsible for updating all mirrored copies of the The client is responsible for updating all mirrored copies of the
layout segments that it is given in the layout. If all but one copy layout segments that it is given in the layout. A single failed
is updated successfully and the last one provides an error, then the update is suffcient to fail the entire operation. I.e., if all but
client needs to return the layout to the metadata server with an one copy is updated successfully and the last one provides an error,
error indicating that the update failed to that storage device. then the client needs to return the layout to the metadata server
with an error indicating that the update failed to that storage
device. If the client is updating the mirrors serially, then it
SHOULD stop at the first error encountered and report that to the
metadata server. If the client is updating the mirrors in parallel,
then it SHOULD wait until all storage devices respond such that it
can report all errors encountered during the update.
The metadata server is then responsible for determining if it wants The metadata server is then responsible for determining if it wants
to remove the errant mirror from the layout, if the mirror has to remove the errant mirror from the layout, if the mirror has
recovered from some transient error, etc. When the client tries to recovered from some transient error, etc. When the client tries to
get a new layout, the metadata server informs it of the decision by get a new layout, the metadata server informs it of the decision by
the contents of the layout. The client MUST NOT make any assumptions the contents of the layout. The client MUST NOT make any assumptions
that the contents of the previous layout will match those of the new that the contents of the previous layout will match those of the new
one. If it has updates that were not committed, it MUST resend those one. If it has updates that were not committed, it MUST resend those
updates to all mirrors. updates to all mirrors.
skipping to change at page 23, line 49 skipping to change at page 24, line 49
"completed" counters track what was done. Note that there is no "completed" counters track what was done. Note that there is no
requirement that the client only report completed results that have requirement that the client only report completed results that have
matching requested results from the reported period. matching requested results from the reported period.
ffil_bytes_not_delivered is used to track the aggregate number of ffil_bytes_not_delivered is used to track the aggregate number of
bytes requested by not fulfilled due to error conditions. bytes requested by not fulfilled due to error conditions.
ffil_total_busy_time is the aggregate time spent with outstanding RPC ffil_total_busy_time is the aggregate time spent with outstanding RPC
calls, ffil_aggregate_completion_time is the sum of all latencies for calls, ffil_aggregate_completion_time is the sum of all latencies for
completed RPC calls. completed RPC calls.
Note that LAYOUTSTATS are cummulative, i.e., not reset each time the Note that LAYOUTSTATS are cumulative, i.e., not reset each time the
operation is sent. If two RPC calls originating from the same NFS operation is sent. If two LAYOUTSTATS ops for the same file, layout
client are processed at the same time by the metdata server, then the stateid, and originating from the same NFS client are processed at
call containing the larger values contains the most recent time the same time by the metadata server, then the one containing the
series data. larger values contains the most recent time series data.
9.2.2. ff_layoutupdate4 9.2.2. ff_layoutupdate4
<CODE BEGINS> <CODE BEGINS>
/// struct ff_layoutupdate4 { /// struct ff_layoutupdate4 {
/// netaddr4 ffl_addr; /// netaddr4 ffl_addr;
/// nfs_fh4 ffl_fhandle; /// nfs_fh4 ffl_fhandle;
/// ff_io_latency4 ffl_read; /// ff_io_latency4 ffl_read;
/// ff_io_latency4 ffl_write; /// ff_io_latency4 ffl_write;
skipping to change at page 26, line 19 skipping to change at page 27, line 19
array is set to zero. The client MAY also use fflr_iostats_report<> array is set to zero. The client MAY also use fflr_iostats_report<>
to report a list of I/O statistics as an array of elements of type to report a list of I/O statistics as an array of elements of type
ff_iostats4. Each element in the array represents statistics for a ff_iostats4. Each element in the array represents statistics for a
particular byte range. Byte ranges are not guaranteed to be disjoint particular byte range. Byte ranges are not guaranteed to be disjoint
and MAY repeat or intersect. and MAY repeat or intersect.
10. Flexible Files Layout Type LAYOUTERROR 10. Flexible Files Layout Type LAYOUTERROR
If the client is using NFSv4.2 to communicate with the metadata If the client is using NFSv4.2 to communicate with the metadata
server, then instead of waiting for a LAYOUTRETURN to send error server, then instead of waiting for a LAYOUTRETURN to send error
information to the metadata server (see Section 9.1), it can use information to the metadata server (see Section 9.1), it MAY use
LAYOUTERROR (see Section 15.6 of [NFSv42]) to communicate that LAYOUTERROR (see Section 15.6 of [NFSv42]) to communicate that
information. For the Flexible Files Layout Type, this means that information. For the Flexible Files Layout Type, this means that
LAYOUTERROR4args is treated the same as ff_ioerr4. LAYOUTERROR4args is treated the same as ff_ioerr4.
11. Flexible Files Layout Type LAYOUTSTATS 11. Flexible Files Layout Type LAYOUTSTATS
If the client is using NFSv4.2 to communicate with the metadata If the client is using NFSv4.2 to communicate with the metadata
server, then instead of waiting for a LAYOUTRETURN to send I/O server, then instead of waiting for a LAYOUTRETURN to send I/O
statistics to the metadata server (see Section 9.2), it can use statistics to the metadata server (see Section 9.2), it MAY use
LAYOUTSTATS (see Section 15.7 of [NFSv42]) to communicate that LAYOUTSTATS (see Section 15.7 of [NFSv42]) to communicate that
information. For the Flexible Files Layout Type, this means that information. For the Flexible Files Layout Type, this means that
LAYOUTSTATS4args.lsa_layoutupdate is overloaded with the same LAYOUTSTATS4args.lsa_layoutupdate is overloaded with the same
contents as in ffis_layoutupdate. contents as in ffis_layoutupdate.
12. Flexible File Layout Type Creation Hint 12. Flexible File Layout Type Creation Hint
The layouthint4 type is defined in the [RFC5661] as follows: The layouthint4 type is defined in the [RFC5661] as follows:
<CODE BEGINS> <CODE BEGINS>
struct layouthint4 { struct layouthint4 {
layouttype4 loh_type; layouttype4 loh_type;
opaque loh_body<>; opaque loh_body<>;
}; };
<CODE ENDS> <CODE ENDS>
The layouthint4 structure is used by the client to pass a hint about The layouthint4 structure is used by the client to pass a hint about
the type of layout it would like created for a particular file. If the type of layout it would like created for a particular file. If
the loh_type layout type is LAYOUT4_FLEX_FILES, then the loh_body the loh_type layout type is LAYOUT4_FLEX_FILES, then the loh_body
opaque value is defined by the ff_layouthint4 type. opaque value is defined by the ff_layouthint4 type.
12.1. ff_layouthint4 12.1. ff_layouthint4
skipping to change at page 27, line 18 skipping to change at page 28, line 18
/// union ff_mirrors_hint switch (bool ffmc_valid) { /// union ff_mirrors_hint switch (bool ffmc_valid) {
/// case TRUE: /// case TRUE:
/// uint32_t ffmc_mirrors; /// uint32_t ffmc_mirrors;
/// case FALSE: /// case FALSE:
/// void; /// void;
/// }; /// };
/// ///
/// struct ff_layouthint4 { /// struct ff_layouthint4 {
/// ff_mirrors_hint fflh_mirrors_hint; /// ff_mirrors_hint fflh_mirrors_hint;
/// }; /// };
/// ///
<CODE ENDS> <CODE ENDS>
This type conveys hints for the desired data map. All parameters are This type conveys hints for the desired data map. All parameters are
optional so the client can give values for only the parameter it optional so the client can give values for only the parameter it
cares about. cares about.
13. Recalling Layouts 13. Recalling a Layout
The Flexible File Layout Type metadata server should recall While Section 12.5.5 of [RFC5661] discusses layout type independent
outstanding layouts in the following cases: reasons for recalling a layout, the Flexible File Layout Type
metadata server should recall outstanding layouts in the following
cases:
o When the file's security policy changes, i.e., Access Control o When the file's security policy changes, i.e., Access Control
Lists (ACLs) or permission mode bits are set. Lists (ACLs) or permission mode bits are set.
o When the file's layout changes, rendering outstanding layouts o When the file's layout changes, rendering outstanding layouts
invalid. invalid.
o When there are sharing conflicts. o When there are sharing conflicts.
o When a file is being resilvered, either due to being repaired
after a write error or to load balance.
13.1. CB_RECALL_ANY 13.1. CB_RECALL_ANY
The metadata server can use the CB_RECALL_ANY callback operation to The metadata server can use the CB_RECALL_ANY callback operation to
notify the client to return some or all of its layouts. The notify the client to return some or all of its layouts. The
[RFC5661] defines the following types: [RFC5661] defines the following types:
<CODE BEGINS> <CODE BEGINS>
const RCA4_TYPE_MASK_FF_LAYOUT_MIN = -2; const RCA4_TYPE_MASK_FF_LAYOUT_MIN = -2;
const RCA4_TYPE_MASK_FF_LAYOUT_MAX = -1; const RCA4_TYPE_MASK_FF_LAYOUT_MAX = -1;
[[RFC Editor: please insert assigned constants]] [[RFC Editor: please insert assigned constants]]
skipping to change at page 28, line 50 skipping to change at page 29, line 50
The PNFS_FF_RCA4_TYPE_MASK_READ flag notifies the client to return The PNFS_FF_RCA4_TYPE_MASK_READ flag notifies the client to return
layouts of iomode LAYOUTIOMODE4_READ. Similarly, the layouts of iomode LAYOUTIOMODE4_READ. Similarly, the
PNFS_FF_RCA4_TYPE_MASK_RW flag notifies the client to return layouts PNFS_FF_RCA4_TYPE_MASK_RW flag notifies the client to return layouts
of iomode LAYOUTIOMODE4_RW. When both mask flags are set, the client of iomode LAYOUTIOMODE4_RW. When both mask flags are set, the client
is notified to return layouts of either iomode. is notified to return layouts of either iomode.
14. Client Fencing 14. Client Fencing
In cases where clients are uncommunicative and their lease has In cases where clients are uncommunicative and their lease has
expired or when clients fail to return recalled layouts within a expired or when clients fail to return recalled layouts within a
lease period, at the least the server MAY revoke client layouts and/ lease period, at the least the server MAY revoke client layouts and
or device address mappings and reassign these resources to other reassign these resources to other clients (see Section 12.5.5 in
clients (see "Recalling a Layout" in [RFC5661]). To avoid data
corruption, the metadata server MUST fence off the revoked clients [RFC5661]). To avoid data corruption, the metadata server MUST fence
from the respective data files as described in Section 2.2. off the revoked clients from the respective data files as described
in Section 2.2.
15. Security Considerations 15. Security Considerations
The pNFS extension partitions the NFSv4.1+ file system protocol into The pNFS extension partitions the NFSv4.1+ file system protocol into
two parts, the control path and the data path (storage protocol). two parts, the control path and the data path (storage protocol).
The control path contains all the new operations described by this The control path contains all the new operations described by this
extension; all existing NFSv4 security mechanisms and features apply extension; all existing NFSv4 security mechanisms and features apply
to the control path. The combination of components in a pNFS system to the control path. The combination of components in a pNFS system
is required to preserve the security properties of NFSv4.1+ with is required to preserve the security properties of NFSv4.1+ with
respect to an entity accessing data via a client, including security respect to an entity accessing data via a client, including security
skipping to change at page 31, line 31 skipping to change at page 32, line 31
Haynes, T. and D. Noveck, "NFS Version 4 Protocol", draft- Haynes, T. and D. Noveck, "NFS Version 4 Protocol", draft-
ietf-nfsv4-rfc3530bis-35 (work in progress), Dec 2014. ietf-nfsv4-rfc3530bis-35 (work in progress), Dec 2014.
[pNFSLayouts] [pNFSLayouts]
Haynes, T., "Considerations for a New pNFS Layout Type", Haynes, T., "Considerations for a New pNFS Layout Type",
draft-ietf-nfsv4-layout-types-02 (Work In Progress), draft-ietf-nfsv4-layout-types-02 (Work In Progress),
October 2014. October 2014.
17.2. Informative References 17.2. Informative References
[RFC4519] Sciberras, A., Ed., "Lightweight Directory Access Protocol
(LDAP): Schema for User Applications", RFC 4519, DOI
10.17487/RFC4519, June 2006,
<http://www.rfc-editor.org/info/rfc4519>.
[rpcsec_gssv3] [rpcsec_gssv3]
Adamson, W. and N. Williams, "Remote Procedure Call (RPC) Adamson, W. and N. Williams, "Remote Procedure Call (RPC)
Security Version 3", November 2014. Security Version 3", November 2014.
Appendix A. Acknowledgments Appendix A. Acknowledgments
Those who provided miscellaneous comments to early drafts of this Those who provided miscellaneous comments to early drafts of this
document include: Matt W. Benjamin, Adam Emerson, J. Bruce Fields, document include: Matt W. Benjamin, Adam Emerson, J. Bruce Fields,
and Lev Solomonov. and Lev Solomonov.
skipping to change at page 32, line 9 skipping to change at page 33, line 12
Idan Kedar caught a nasty bug in the interaction of client side Idan Kedar caught a nasty bug in the interaction of client side
mirroring and the minor versioning of devices. mirroring and the minor versioning of devices.
Dave Noveck provided a comprehensive review of the document during Dave Noveck provided a comprehensive review of the document during
the working group last call. the working group last call.
Olga Kornievskaia lead the charge against the use of a credential Olga Kornievskaia lead the charge against the use of a credential
versus a principal in the fencing approach. Andy Adamson and versus a principal in the fencing approach. Andy Adamson and
Benjamin Kaduk helped to sharpen the focus. Benjamin Kaduk helped to sharpen the focus.
Tigran Mkrtchyan provided the use case for not allowing the client to
proxy the IO through the data server.
Appendix B. RFC Editor Notes Appendix B. RFC Editor Notes
[RFC Editor: please remove this section prior to publishing this [RFC Editor: please remove this section prior to publishing this
document as an RFC] document as an RFC]
[RFC Editor: prior to publishing this document as an RFC, please [RFC Editor: prior to publishing this document as an RFC, please
replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the
RFC number of this document] RFC number of this document]
Authors' Addresses Authors' Addresses
 End of changes. 49 change blocks. 
138 lines changed or deleted 207 lines changed or added

This html diff was produced by rfcdiff 1.42. The latest version is available from http://tools.ietf.org/tools/rfcdiff/