draft-ietf-nfsv4-flex-files-04.txt   draft-ietf-nfsv4-flex-files-05.txt 
NFSv4 B. Halevy NFSv4 B. Halevy
Internet-Draft T. Haynes Internet-Draft
Intended status: Informational Primary Data Intended status: Standards Track T. Haynes
Expires: June 7, 2015 December 04, 2014 Expires: August 13, 2015 Primary Data
February 09, 2015
Parallel NFS (pNFS) Flexible File Layout Parallel NFS (pNFS) Flexible File Layout
draft-ietf-nfsv4-flex-files-04.txt draft-ietf-nfsv4-flex-files-05.txt
Abstract Abstract
The Parallel Network File System (pNFS) allows a separation between The Parallel Network File System (pNFS) allows a separation between
the metadata and data for a file. The metadata file access is the metadata (onto a metadata server) and data (onto a storage
handled via Network File System version 4 (NFSv4) minor version 1 device) for a file. The Flexible File Layout Type is defined in this
(NFSv4.1) and the data file access is specific to the protocol being document as an extension to pNFS to allow the use of storage devices
used between the client and storage device. The client is informed in a fashion such that they require only a quite limited degree of
by the metadata server as to which protocol to use via a Layout Type. interaction with the metadata server, using already existing
The Flexible File Layout Type is defined in this document as an protocols. Client side mirroring is also added to provide
extension to NFSv4.1 to allow the use of storage devices which need replication of files.
not be tightly coupled to the metadata server.
Status of This Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on June 7, 2015. This Internet-Draft will expire on August 13, 2015.
Copyright Notice Copyright Notice
Copyright (c) 2014 IETF Trust and the persons identified as the Copyright (c) 2015 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 3
1.2. Difference Between a Data Server and a Storage Device . . 5 1.2. Difference Between a Data Server and a Storage Device . . 5
1.3. Requirements Language . . . . . . . . . . . . . . . . . . 5 1.3. Requirements Language . . . . . . . . . . . . . . . . . . 6
2. Coupling of Storage Devices . . . . . . . . . . . . . . . . . 6 2. Coupling of Storage Devices . . . . . . . . . . . . . . . . . 6
2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 6 2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 6
2.2. Security Models . . . . . . . . . . . . . . . . . . . . . 6 2.2. Security Models . . . . . . . . . . . . . . . . . . . . . 6
2.3. State and Locking Models . . . . . . . . . . . . . . . . 7 2.2.1. Implementation Notes for Synthetic uids/gids . . . . 7
3. XDR Description of the Flexible File Layout Type . . . . . . 7 2.2.2. Example of using Synthetic uids/gids . . . . . . . . 7
3.1. Code Components Licensing Notice . . . . . . . . . . . . 8 2.3. State and Locking Models . . . . . . . . . . . . . . . . 8
4. Device Addressing and Discovery . . . . . . . . . . . . . . . 9 3. XDR Description of the Flexible File Layout Type . . . . . . 9
4.1. ff_device_addr4 . . . . . . . . . . . . . . . . . . . . . 9 3.1. Code Components Licensing Notice . . . . . . . . . . . . 9
4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 11 4. Device Addressing and Discovery . . . . . . . . . . . . . . . 11
5. Flexible File Layout Type . . . . . . . . . . . . . . . . . . 12 4.1. ff_device_addr4 . . . . . . . . . . . . . . . . . . . . . 11
5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 12 4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 12
5.2. Interactions Between Devices and Layouts . . . . . . . . 15 5. Flexible File Layout Type . . . . . . . . . . . . . . . . . . 13
5.3. Handling Version Errors . . . . . . . . . . . . . . . . . 15 5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 14
6. Striping via Sparse Mapping . . . . . . . . . . . . . . . . . 16 5.2. Interactions Between Devices and Layouts . . . . . . . . 17
7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 16 5.3. Handling Version Errors . . . . . . . . . . . . . . . . . 17
8. Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . 17 6. Striping via Sparse Mapping . . . . . . . . . . . . . . . . . 18
8.1. Selecting a Mirror . . . . . . . . . . . . . . . . . . . 17 7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 18
8.2. Writing to Mirrors . . . . . . . . . . . . . . . . . . . 18 8. Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . 19
8.3. Metadata Server Resilvering of the File . . . . . . . . . 18 8.1. Selecting a Mirror . . . . . . . . . . . . . . . . . . . 20
9. Flexible Files Layout Type Return . . . . . . . . . . . . . . 19 8.2. Writing to Mirrors . . . . . . . . . . . . . . . . . . . 20
9.1. I/O Error Reporting . . . . . . . . . . . . . . . . . . . 19 8.3. Metadata Server Resilvering of the File . . . . . . . . . 21
9.1.1. ff_ioerr4 . . . . . . . . . . . . . . . . . . . . . . 19 9. Flexible Files Layout Type Return . . . . . . . . . . . . . . 21
9.2. Layout Usage Statistics . . . . . . . . . . . . . . . . . 20 9.1. I/O Error Reporting . . . . . . . . . . . . . . . . . . . 22
9.2.1. ff_io_latency4 . . . . . . . . . . . . . . . . . . . 20 9.1.1. ff_ioerr4 . . . . . . . . . . . . . . . . . . . . . . 22
9.2.2. ff_layoutupdate4 . . . . . . . . . . . . . . . . . . 21 9.2. Layout Usage Statistics . . . . . . . . . . . . . . . . . 23
9.2.3. ff_iostats4 . . . . . . . . . . . . . . . . . . . . . 22 9.2.1. ff_io_latency4 . . . . . . . . . . . . . . . . . . . 23
9.3. ff_layoutreturn4 . . . . . . . . . . . . . . . . . . . . 23 9.2.2. ff_layoutupdate4 . . . . . . . . . . . . . . . . . . 23
10. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 23 9.2.3. ff_iostats4 . . . . . . . . . . . . . . . . . . . . . 24
11. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 23 9.3. ff_layoutreturn4 . . . . . . . . . . . . . . . . . . . . 25
12. Flexible File Layout Type Creation Hint . . . . . . . . . . . 24 10. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 25
12.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 24 11. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 25
13. Recalling Layouts . . . . . . . . . . . . . . . . . . . . . . 25 12. Flexible File Layout Type Creation Hint . . . . . . . . . . . 26
13.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 25 12.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 26
14. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 26 13. Recalling Layouts . . . . . . . . . . . . . . . . . . . . . . 27
15. Security Considerations . . . . . . . . . . . . . . . . . . . 26 13.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 27
15.1. Kerberized File Access . . . . . . . . . . . . . . . . . 27 14. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 28
15.1.1. Loosely Coupled . . . . . . . . . . . . . . . . . . 27 15. Security Considerations . . . . . . . . . . . . . . . . . . . 28
15.1.2. Tightly Coupled . . . . . . . . . . . . . . . . . . 27 15.1. Kerberized File Access . . . . . . . . . . . . . . . . . 29
16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 28 15.1.1. Loosely Coupled . . . . . . . . . . . . . . . . . . 29
17. References . . . . . . . . . . . . . . . . . . . . . . . . . 28 15.1.2. Tightly Coupled . . . . . . . . . . . . . . . . . . 29
17.1. Normative References . . . . . . . . . . . . . . . . . . 28 16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 30
17.2. Informative References . . . . . . . . . . . . . . . . . 29 17. References . . . . . . . . . . . . . . . . . . . . . . . . . 30
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 29 17.1. Normative References . . . . . . . . . . . . . . . . . . 30
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 29 17.2. Informative References . . . . . . . . . . . . . . . . . 31
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 29 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 31
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 31
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 31
1. Introduction 1. Introduction
In the parallel Network File System (pNFS), the metadata server In the parallel Network File System (pNFS), the metadata server
returns Layout Type structures that describe where file data is returns Layout Type structures that describe where file data is
located. There are different Layout Types for different storage located. There are different Layout Types for different storage
systems and methods of arranging data on storage devices. This systems and methods of arranging data on storage devices. This
document defines the Flexible File Layout Type used with file-based document defines the Flexible File Layout Type used with file-based
data servers that are accessed using the Network File System (NFS) data servers that are accessed using the Network File System (NFS)
protocols: NFSv3 [RFC1813], NFSv4 [RFC3530], NFSv4.1 [RFC5661], and protocols: NFSv3 [RFC1813], NFSv4.0 [RFCNFSv4], NFSv4.1 [RFC5661],
NFSv4.2 [NFSv42]. and NFSv4.2 [NFSv42].
To provide a global state model equivalent to that of the Files To provide a global state model equivalent to that of the Files
Layout Type, a back-end control protocol MAY be implemented between Layout Type, a back-end control protocol MAY be implemented between
the metadata server and NFSv4.1 storage devices. It is out of scope the metadata server and NFSv4.1+ storage devices. It is out of scope
for this document to specify the wire protocol of such a protocol, for this document to specify the wire protocol of such a protocol,
yet the requirements for the protocol are specified in [RFC5661] and yet the requirements for the protocol are specified in [RFC5661] and
clarified in [pNFSLayouts]. clarified in [pNFSLayouts].
1.1. Definitions 1.1. Definitions
control protocol: is a set of requirements for the communication of control protocol: is a set of requirements for the communication of
information on layouts, stateids, file metadata, and file data information on layouts, stateids, file metadata, and file data
between the metadata server and the storage devices (see between the metadata server and the storage devices (see
[pNFSLayouts]). [pNFSLayouts]).
client-side mirroring: is when the client and not the server is client-side mirroring: is when the client and not the server is
responsible for updating all of the mirrored copies of a file. responsible for updating all of the mirrored copies of a layout
segment.
data file: is that part of the file system object which describes data file: is that part of the file system object which describes
the payload and not the object. E.g., it is the file contents. the payload and not the object. E.g., it is the file contents.
data server (DS): is one of the pNFS servers which provide the data server (DS): is one of the pNFS servers which provides the
contents of a file system object which is a regular file. contents of a file system object which is a regular file.
Depending on the layout, there might be one or more data servers Depending on the layout, there might be one or more data servers
over which the data is striped. Note that while the metadata over which the data is striped. Note that while the metadata
server is strictly accessed over the NFSv4.1 protocol, depending server is strictly accessed over the NFSv4.1+ protocol, depending
on the Layout Type, the data server could be accessed via any on the Layout Type, the data server could be accessed via any
protocol that meets the pNFS requirements. protocol that meets the pNFS requirements.
fencing: is when the metadata server prevents the storage devices fencing: is when the metadata server prevents the storage devices
from processing I/O from a specific client to a specific file. from processing I/O from a specific client to a specific file.
File Layout Type: is a Layout Type in which the storage devices are File Layout Type: is a Layout Type in which the storage devices are
accessed via the NFSv4.1 protocol. It is defined in Section 13 of accessed via the NFS protocol.
[RFC5661].
layout: informs a client of which storage devices it needs to layout: informs a client of which storage devices it needs to
communicate with (and over which protocol) to perform I/O on a communicate with (and over which protocol) to perform I/O on a
file. The layout might also provide some hints about how the file. The layout might also provide some hints about how the
storage is physically organized. storage is physically organized.
layout iomode: describes whether the layout granted to the client is layout iomode: describes whether the layout granted to the client is
for read or read/write I/O. for read or read/write I/O.
layout segment: describes a sub-division of a layout. That sub-
division might be by the iomode (see Sections 3.3.20 and 12.2.9 of
[RFC5661]), a striping pattern (see Section 13.3 of [RFC5661]), or
requested byte range.
layout stateid: is a 128-bit quantity returned by a server that layout stateid: is a 128-bit quantity returned by a server that
uniquely defines the layout state provided by the server for a uniquely defines the layout state provided by the server for a
specific layout that describes a Layout Type and file (see specific layout that describes a Layout Type and file (see
Section 12.5.2 of [RFC5661]). Further, Section 12.5.3 describes Section 12.5.2 of [RFC5661]). Further, Section 12.5.3 describes
the difference between a layout stateid and a normal stateid. the difference between a layout stateid and a normal stateid.
layout type: describes both the storage protocol used to access the layout type: describes both the storage protocol used to access the
data and the aggregation scheme used to lays out the file data on data and the aggregation scheme used to lay out the file data on
the underlying storage devices. the underlying storage devices.
loose coupling: is when the metadata server and the storage devices loose coupling: is when the metadata server and the storage devices
do not have a control protocol present. do not have a control protocol present.
metadata file: is that part of the file system object which metadata file: is that part of the file system object which
describes the object and not the payload. E.g., it could be the describes the object and not the payload. E.g., it could be the
time since last modification, access, etc. time since last modification, access, etc.
metadata server (MDS): is the pNFS server which provides metadata metadata server (MDS): is the pNFS server which provides metadata
information for a file system object. It also is responsible for information for a file system object. It also is responsible for
generating layouts for file system objects. Note that the MDS is generating layouts for file system objects. Note that the MDS is
responsible for directory-based operations. responsible for directory-based operations.
mirror: is a copy of a file. While mirroring can be used for mirror: is a copy of a layout segment. While mirroring can be used
backing up a file, the copies can be distributed such that each for backing up a layout segment, the copies can be distributed
remote site has a locally cached copy. Note that if one copy of such that each remote site has a locally available copy. Note
the mirror is updated, then all copies must be updated. that if one copy of the mirror is updated, then all copies must be
updated.
Object Layout Type: is a Layout Type in which the storage devices
are accessed via the OSD protocol [ANSI400-2004]. It is defined
in [RFC5664].
recalling a layout: is when the metadata server uses a back channel recalling a layout: is when the metadata server uses a back channel
to inform the client that the layout is to be returned in a to inform the client that the layout is to be returned in a
graceful manner. Note that the client could be able to flush any graceful manner. Note that the client could be able to flush any
writes, etc., before replying to the metadata server. writes, etc., before replying to the metadata server.
revoking a layout: is when the metadata server invalidates the revoking a layout: is when the metadata server invalidates the
layout such that neither the metadata server nor any storage layout such that neither the metadata server nor any storage
device will accept any access from the client with that layout. device will accept any access from the client with that layout.
resilvering: is the act of rebuilding a mirrored copy of a file from resilvering: is the act of rebuilding a mirrored copy of a layout
a known good copy of the file. Note that this can also be done to segment from a known good copy of the layout segment. Note that
create a new mirrored copy of the file. this can also be done to create a new mirrored copy of the layout
segment.
rsize: is the data transfer buffer size used for reads. rsize: is the data transfer buffer size used for reads.
stateid: is a 128-bit quantity returned by a server that uniquely stateid: is a 128-bit quantity returned by a server that uniquely
defines the open and locking states provided by the server for a defines the open and locking states provided by the server for a
specific open-owner or lock-owner/open-owner pair for a specific specific open-owner or lock-owner/open-owner pair for a specific
file and type of lock. file and type of lock.
storage device: is another term used almost interchangeably with storage device: is another term used almost interchangeably with
data server. See Section 1.2 for the nuances between the two. data server. See Section 1.2 for the nuances between the two.
tight coupling: is when the metadata server and the storage devices tight coupling: is when the metadata server and the storage devices
do have a control protocol present. do have a control protocol present.
wsize: is the data transfer buffer size used for writes. wsize: is the data transfer buffer size used for writes.
1.2. Difference Between a Data Server and a Storage Device 1.2. Difference Between a Data Server and a Storage Device
We defined a data server as a pNFS server, which implies that it can We defined a data server as a pNFS server, which implies that it can
utilize the NFSv4.1 protocol to communicate with the client. As utilize the NFSv4.1+ protocol to communicate with the client. As
such, only the File Layout Type would currently meet this such, only the File Layout Type would currently meet this
requirement. The more generic concept is a storage device, which can requirement. The more generic concept is a storage device, which can
use any protocol to communicate with the client. The requirements use any protocol to communicate with the client. The requirements
for a storage device to act together with the metadata server to for a storage device to act together with the metadata server to
provide data to a client are that there is a Layout Type provide data to a client are that there is a Layout Type
specification for the given protocol and that the metadata server has specification for the given protocol and that the metadata server has
granted a layout to the client. Note that nothing precludes there granted a layout to the client. Note that nothing precludes there
being multiple supported Layout Types (i.e., protocols) between a being multiple supported Layout Types (i.e., protocols) between a
metadata server, storage devices, and client. metadata server, storage devices, and client.
skipping to change at page 6, line 14 skipping to change at page 6, line 20
2. Coupling of Storage Devices 2. Coupling of Storage Devices
The coupling of the metadata server with the storage devices can be The coupling of the metadata server with the storage devices can be
either tight or loose. In a tight coupling, there is a control either tight or loose. In a tight coupling, there is a control
protocol present to manage security, LAYOUTCOMMITs, etc. With a protocol present to manage security, LAYOUTCOMMITs, etc. With a
loose coupling, the only control protocol might be a version of NFS. loose coupling, the only control protocol might be a version of NFS.
As such, semantics for managing security, state, and locking models As such, semantics for managing security, state, and locking models
MUST be defined. MUST be defined.
A file is split into metadata and data. The "metadata file" is that
part of the file stored on the metadata server. The "data file" is
that part of the file stored on the storage device. And the "file"
is the combination of the two.
2.1. LAYOUTCOMMIT 2.1. LAYOUTCOMMIT
With a tightly coupled system, when the metadata server receives a With a tightly coupled system, when the metadata server receives a
LAYOUTCOMMIT (see Section 18.42 of [RFC5661]), the semantics of the LAYOUTCOMMIT (see Section 18.42 of [RFC5661]), the semantics of the
File Layout Type MUST be met (see Section 12.5.4 of [RFC5661]). With File Layout Type MUST be met (see Section 12.5.4 of [RFC5661]). With
a loosely coupled system, a LAYOUTCOMMIT to the metadata server MUST a loosely coupled system, a LAYOUTCOMMIT to the metadata server MUST
be proceeded with a COMMIT to the storage device. I.e., it is the be proceeded with a COMMIT to the storage device. It is the
responsibility of the client to make sure the data file is stable responsibility of the client to make sure the data file is stable
before the metadata server begins to query the storage devices about before the metadata server begins to query the storage devices about
the changes to the file. Note that if the client has not done a the changes to the file. Note that if the client has not done a
COMMIT to the storage device, then the LAYOUTCOMMIT might not be COMMIT to the storage device, then the LAYOUTCOMMIT might not be
synchronized to the last WRITE operation to the storage device. synchronized to the last WRITE operation to the storage device.
2.2. Security Models 2.2. Security Models
With loosely coupled storage devices, the metadata server uses With loosely coupled storage devices, the metadata server uses
synthetic uids and gids for the data file, where the uid owner of the synthetic uids and gids for the data file, where the uid owner of the
data file is allowed read/write access and the gid owner is allowed data file is allowed read/write access and the gid owner is allowed
read only access. As part of the layout, the client is provided with read only access. As part of the layout (see ffds_user and
the rpc credentials to be used (see ffm_auth in Section 5.1) to ffds_group in Section 5.1), the client is provided with the user and
access the data file. Fencing off clients is achieved by using group to be used in the Remote Procedure Call (RPC) [RFC5531]
SETATTR by the server to change the uid and/or gid owners of the data credentials needed to access the data file. Fencing off of clients
file to implicitly revoke the outstanding rpc credentials. Note: it is achieved by the metadata server changing the synthetic uid and/or
is recommended to implement common access control methods at the gid owners of the data file on the storage device to implicitly
storage device filesystem exports level to allow only the metadata revoke the outstanding RPC credentials.
server root (super user) access to the storage device, and to set the
owner of all directories holding data files to the root user. This With this loosely coupled model, the metadata server is not able to
security method, when using weak auth flavors such as AUTH_SYS, fence off a single client, it forced to fence off all clients.
However, as the other clients react to the fencing, returning their
layouts and trying to get new ones, the metadata server can hand out
a new uid and gid to allow access.
Note: it is recommended to implement common access control methods at
the storage device filesystem to allow only the metadata server root
(super user) access to the storage device, and to set the owner of
all directories holding data files to the root user. This approach
provides a practical model to enforce access control and fence off provides a practical model to enforce access control and fence off
cooperative clients, but it can not protect against malicious cooperative clients, but it can not protect against malicious
clients; hence it provides a level of security equivalent to NFSv3. clients; hence it provides a level of security equivalent to
AUTH_SYS.
With tightly coupled storage devices, the metadata server sets the With tightly coupled storage devices, the metadata server sets the
user and group owners, mode bits, and ACL of the data file to be the user and group owners, mode bits, and ACL of the data file to be the
same as the metadata file. And the client must authenticate with the same as the metadata file. And the client must authenticate with the
storage device and go through the same authorization process it would storage device and go through the same authorization process it would
go through via the metadata server. go through via the metadata server.
2.2.1. Implementation Notes for Synthetic uids/gids
The selection method for the synthetic uids and gids to be used for
fencing in loosely coupled storage devices is strictly an
implementation issue. An implementation might allow an administrator
to restrict a range of such ids in the name servers. She might also
be able to choose an id that would never be used to grant acccess.
Then when the metadata server had a request to access a file, a
SETATTR would be sent to the storage device to set the owner and
group of the data file. The user and group might be selected in a
round robin fashion from the range of available ids.
Those ids would be sent back as ffds_user and ffds_group to the
client. And it would present them as the RPC credentials to the
storage device. When the client was done accessing the file and the
metadata server knew that no other client was accessing the file, it
could reset the owner and group to restrict access to the data file.
When the metadata server wanted to fence off a client, it would
change the synthetic uid and/or gid to the restricted ids. Note that
using a restricted id ensures that there is a change of owner and at
least one id available that never gets allowed access.
2.2.2. Example of using Synthetic uids/gids
The user loghyr creates a file "ompha.c" on the metadata server and
it creates a corresponding data file on the storage device.
The metadata server entry may look like:
-rw-r--r-- 1 loghyr staff 1697 Dec 4 11:31 ompha.c
On the storage device, it may be assigned some random synthetic uid/
gid to deny access:
-rw-r----- 1 19452 28418 1697 Dec 4 11:31 data_ompha.c
When the file is opened on a client, since the layout knows nothing
about the user (and does not care), whether loghyr or garbo opens the
file does not matter. The owner and group are modified and those
values are returned.
-rw-r----- 1 1066 1067 1697 Dec 4 11:31 data_ompha.c
The set of synthetic gids on the storage device should be selected
such that there is no mapping in any of the name services used by the
storage device. I.e., each group should have no members.
If the layout segment has an iomode of LAYOUTIOMODE4_READ, then the
metadata server should return a synthetic uid that is not set on the
storage device. Only the synthetic gid would be valid.
The client is thus solely responsible for enforcing file permissions
in a loosely coupled model. To allow loghyr write access, it will
send an RPC to the storage device with a credential of 1066:1067. To
allow garbo read access, it will send an RPC to the storage device
with a credential of 1067:1067. The value of the uid does not matter
as long as it is not the synthetic uid granted it when getting the
layout.
While pushing the enforcement of permission checking onto the client
may seem to weaken security, the client may already be responsible
for enforcing permissions before modificaations are sent to a server.
With cached writes, the client is always responsible for tracking who
is modifying a file and making sure to not coalesce requests from
multiple users into one request.
2.3. State and Locking Models 2.3. State and Locking Models
Metadata file OPEN, LOCK, and DELEGATION operations are always Metadata file OPEN, LOCK, and DELEGATION operations are always
executed only against the metadata server. executed only against the metadata server.
With NFSv4 storage devices, the metadata server, in response to the The metadata server responds to state changing operations by
state changing operation, executes them against the respective data executing them against the respective data files on the storage
files on the storage devices. It then sends the storage device open devices. It then sends the storage device open stateid as part of
stateid as part of the layout (see the ffm_stateid in Section 5.1) the layout (see the ffm_stateid in Section 5.1) and it is then used
and it is then used by the client for executing READ/WRITE operations by the client for executing READ/WRITE operations against the storage
against the storage device. device.
Standalone NFSv4.1 storage devices that do not return the Standalone NFSv4.1+ storage devices that do not return the
EXCHGID4_FLAG_USE_PNFS_DS flag to EXCHANGE_ID are used the same way EXCHGID4_FLAG_USE_PNFS_DS flag to EXCHANGE_ID are used the same way
as NFSv4 storage devices. as NFSv4 storage devices.
NFSv4.1 clustered storage devices that do identify themselves with NFSv4.1+ clustered storage devices that do identify themselves with
the EXCHGID4_FLAG_USE_PNFS_DS flag to EXCHANGE_ID use a back-end the EXCHGID4_FLAG_USE_PNFS_DS flag to EXCHANGE_ID use a back-end
control protocol as described in [RFC5661] to implement a global control protocol as described in [RFC5661] to implement a global
stateid model as defined there. stateid model as defined there.
3. XDR Description of the Flexible File Layout Type 3. XDR Description of the Flexible File Layout Type
This document contains the external data representation (XDR) This document contains the external data representation (XDR)
[RFC4506] description of the Flexible File Layout Type. The XDR [RFC4506] description of the Flexible File Layout Type. The XDR
description is embedded in this document in a way that makes it description is embedded in this document in a way that makes it
simple for the reader to extract into a ready-to-compile form. The simple for the reader to extract into a ready-to-compile form. The
skipping to change at page 9, line 36 skipping to change at page 11, line 21
/// * %#include <nfsv42.x> /// * %#include <nfsv42.x>
/// * %#include <rpc_prot.x> /// * %#include <rpc_prot.x>
/// */ /// */
/// ///
<CODE ENDS> <CODE ENDS>
4. Device Addressing and Discovery 4. Device Addressing and Discovery
Data operations to a storage device require the client to know the Data operations to a storage device require the client to know the
network address of the storage device. The NFSv4.1 GETDEVICEINFO network address of the storage device. The NFSv4.1+ GETDEVICEINFO
operation (Section 18.40 of [RFC5661]) is used by the client to operation (Section 18.40 of [RFC5661]) is used by the client to
retrieve that information. retrieve that information.
4.1. ff_device_addr4 4.1. ff_device_addr4
The ff_device_addr4 data structure is returned by the server as the The ff_device_addr4 data structure is returned by the server as the
storage protocol specific opaque field da_addr_body in the storage protocol specific opaque field da_addr_body in the
device_addr4 structure by a successful GETDEVICEINFO operation. device_addr4 structure by a successful GETDEVICEINFO operation.
<CODE BEGINS> <CODE BEGINS>
skipping to change at page 12, line 16 skipping to change at page 14, line 4
major ID of the server owner. It is not always necessary for the two major ID of the server owner. It is not always necessary for the two
storage device addresses to designate the same storage device with storage device addresses to designate the same storage device with
trunking being used. For example, the data could be read-only, and trunking being used. For example, the data could be read-only, and
the data consist of exact replicas. the data consist of exact replicas.
5. Flexible File Layout Type 5. Flexible File Layout Type
The layout4 type is defined in [RFC5662] as follows: The layout4 type is defined in [RFC5662] as follows:
<CODE BEGINS> <CODE BEGINS>
enum layouttype4 { enum layouttype4 {
LAYOUT4_NFSV4_1_FILES = 1, LAYOUT4_NFSV4_1_FILES = 1,
LAYOUT4_OSD2_OBJECTS = 2, LAYOUT4_OSD2_OBJECTS = 2,
LAYOUT4_BLOCK_VOLUME = 3, LAYOUT4_BLOCK_VOLUME = 3,
LAYOUT4_FLEX_FILES = 0x80000005 LAYOUT4_FLEX_FILES = 4
[[RFC Editor: please modify the LAYOUT4_FLEX_FILES [[RFC Editor: please modify the LAYOUT4_FLEX_FILES
to be the layouttype assigned by IANA]] to be the layouttype assigned by IANA]]
}; };
struct layout_content4 { struct layout_content4 {
layouttype4 loc_type; layouttype4 loc_type;
opaque loc_body<>; opaque loc_body<>;
}; };
struct layout4 { struct layout4 {
skipping to change at page 13, line 4 skipping to change at page 14, line 37
This document defines structure associated with the layouttype4 value This document defines structure associated with the layouttype4 value
LAYOUT4_FLEX_FILES. [RFC5661] specifies the loc_body structure as an LAYOUT4_FLEX_FILES. [RFC5661] specifies the loc_body structure as an
XDR type "opaque". The opaque layout is uninterpreted by the generic XDR type "opaque". The opaque layout is uninterpreted by the generic
pNFS client layers, but obviously must be interpreted by the Flexible pNFS client layers, but obviously must be interpreted by the Flexible
File Layout Type implementation. This section defines the structure File Layout Type implementation. This section defines the structure
of this opaque value, ff_layout4. of this opaque value, ff_layout4.
5.1. ff_layout4 5.1. ff_layout4
<CODE BEGINS> <CODE BEGINS>
/// struct ff_data_server4 { /// struct ff_data_server4 {
/// deviceid4 ffds_deviceid; /// deviceid4 ffds_deviceid;
/// uint32_t ffds_efficiency; /// uint32_t ffds_efficiency;
/// stateid4 ffds_stateid; /// stateid4 ffds_stateid;
/// nfs_fh4 ffds_fh_vers<>; /// nfs_fh4 ffds_fh_vers<>;
/// opaque_auth ffds_auth; /// fattr4_owner ffds_user;
/// fattr4_owner_group ffds_group;
/// }; /// };
/// ///
/// struct ff_mirror4 { /// struct ff_mirror4 {
/// ff_data_server4 ffm_data_servers<>; /// ff_data_server4 ffm_data_servers<>;
/// }; /// };
/// ///
/// struct ff_layout4 { /// struct ff_layout4 {
/// length4 ffl_stripe_unit; /// length4 ffl_stripe_unit;
/// ff_mirror4 ffl_mirrors<>; /// ff_mirror4 ffl_mirrors<>;
/// }; /// };
/// ///
<CODE ENDS> <CODE ENDS>
The ff_layout4 structure specifies a layout over a set of mirrored The ff_layout4 structure specifies a layout over a set of mirrored
copies of the data file. This mirroring protects against loss of copies of that portion of the data file described in the current
data files. layout segment. This mirroring protects against loss of data in
layout segments. Note that while not explicitly shown in the above
XDR, each layout4 element returned in the logr_layout array of
LAYOUTGET4res (see Section 18.43.1 of [RFC5661]) descibes a layout
segment. Hence each ff_layout4 also descibes a layout segment.
It is possible that the file is concatenated from more than one It is possible that the file is concatenated from more than one
layout segment. Each layout segment MAY represent different striping layout segment. Each layout segment MAY represent different striping
parameters, applying respectively only to the layout segment byte parameters, applying respectively only to the layout segment byte
range. range.
The ffl_stripe_unit field is the stripe unit size in use for the The ffl_stripe_unit field is the stripe unit size in use for the
current layout segment. The number of stripes is given inside each current layout segment. The number of stripes is given inside each
mirror by the number of elements in ffm_data_servers. If the number mirror by the number of elements in ffm_data_servers. If the number
of stripes is one, then the value for ffl_stripe_unit MUST default to of stripes is one, then the value for ffl_stripe_unit MUST default to
skipping to change at page 14, line 29 skipping to change at page 16, line 29
+-----------+ +-----------+ +-----------+ +-----------+
|+-----------+ |+-----------+ |+-----------+ |+-----------+
||+-----------+ ||+-----------+ ||+-----------+ ||+-----------+
+|| Storage | +|| Storage | +|| Storage | +|| Storage |
+| Devices | +| Devices | +| Devices | +| Devices |
+-----------+ +-----------+ +-----------+ +-----------+
Figure 1 Figure 1
The ffs_mirrors field represents an array of state information for The ffs_mirrors field represents an array of state information for
each mirrored copy of the file. Each element is described by a each mirrored copy of the current layout segment. Each element is
ff_mirror4 type. described by a ff_mirror4 type.
ffds_deviceid provides the deviceid of the storage device holding the ffds_deviceid provides the deviceid of the storage device holding the
data file. data file.
ffds_fh_vers is an array of filehandles of the data file matching to ffds_fh_vers is an array of filehandles of the data file matching to
the available NFS versions on the given storage device. There MUST the available NFS versions on the given storage device. There MUST
be exactly as many elements in ffds_fh_vers as there are in be exactly as many elements in ffds_fh_vers as there are in
ffda_versions. Each element of the array corresponds to each ffda_versions. Each element of the array corresponds to each
ffdv_version and ffdv_minorversion provided for the device. The ffdv_version and ffdv_minorversion provided for the device. The
array allows for server implementations which have different array allows for server implementations which have different
skipping to change at page 15, line 5 skipping to change at page 17, line 5
See Section 5.3 for how to handle versioning issues between the See Section 5.3 for how to handle versioning issues between the
client and storage devices. client and storage devices.
For tight coupling, ffds_stateid provides the stateid to be used by For tight coupling, ffds_stateid provides the stateid to be used by
the client to access the file. For loose coupling and a NFSv4 the client to access the file. For loose coupling and a NFSv4
storage device, the client may use an anonymous stateid to perform I/ storage device, the client may use an anonymous stateid to perform I/
O on the storage device as there is no use for the metadata server O on the storage device as there is no use for the metadata server
stateid (no control protocol). In such a scenario, the server MUST stateid (no control protocol). In such a scenario, the server MUST
set the ffds_stateid to be zero. set the ffds_stateid to be zero.
For loosely coupled storage devices, ffds_auth provides the RPC For loosely coupled storage devices, ffds_user and ffds_group provide
credentials to be used by the client to access the data files. For the synthetic user and group to be used in the RPC credentials that
tightly coupled storage devices, the server SHOULD use the AUTH_NONE the client presents to the storage device to access the data files.
flavor and a zero length opaque body to minimize the returned For tightly coupled storage devices, the user and group on the
structure length. I.e., if ffdv_tightly_coupled (see Section 4.1) is storage device will be the same as on the metadata server. I.e., if
set, then the client MUST ignore ffds_auth in this case. ffdv_tightly_coupled (see Section 4.1) is set, then the client MUST
ignore both ffds_user and ffds_group.
The allowed values for both ffds_user and ffds_group are specified in
Section 5.9 of [RFC5661]. For NFSv3 compatibility, user and group
strings that consist of decimal numeric values with no leading zeros
can be given a special interpretation by clients and servers that
choose to provide such support. The receiver may treat such a user
or group string as representing the same user as would be represented
by an NFSv3 uid or gid having the corresponding numeric value. Note
that if using Kerberos for security, the expectation is that these
values will be a name@domain string.
ffds_efficiency describes the metadata server's evaluation as to the ffds_efficiency describes the metadata server's evaluation as to the
effectiveness of each mirror. Note that this is per layout and not effectiveness of each mirror. Note that this is per layout and not
per device as the metric may change due to perceived load, per device as the metric may change due to perceived load,
availability to the metadata server, etc. Higher values denote availability to the metadata server, etc. Higher values denote
higher perceived utility. The way the client can select the best higher perceived utility. The way the client can select the best
mirror to access is discussed in Section 8.1. mirror to access is discussed in Section 8.1.
5.2. Interactions Between Devices and Layouts 5.2. Interactions Between Devices and Layouts
skipping to change at page 15, line 32 skipping to change at page 17, line 43
relationship between multipathing and filehandles can result in relationship between multipathing and filehandles can result in
either 0, 1, or N filehandles (see Section 13.3). Some rationals for either 0, 1, or N filehandles (see Section 13.3). Some rationals for
this are clustered servers which share the same filehandle or this are clustered servers which share the same filehandle or
allowing for multiple read-only copies of the file on the same allowing for multiple read-only copies of the file on the same
storage device. In the Flexible File Layout Type, while there is an storage device. In the Flexible File Layout Type, while there is an
array of filehandles, they are independent of the multipathing being array of filehandles, they are independent of the multipathing being
used. If the metadata server wants to provide multiple read-only used. If the metadata server wants to provide multiple read-only
copies of the same file on the same storage device, then it should copies of the same file on the same storage device, then it should
provide multiple ff_device_addr4, each as a mirror. The client can provide multiple ff_device_addr4, each as a mirror. The client can
then determine that since the ffds_fh_vers are different, then there then determine that since the ffds_fh_vers are different, then there
multiple copies of the file available. are multiple copies of the file for the current layout segment
available.
5.3. Handling Version Errors 5.3. Handling Version Errors
When the metadata server provides the ffda_versions array in the When the metadata server provides the ffda_versions array in the
ff_device_addr4 (see Section 4.1), the client is able to determine if ff_device_addr4 (see Section 4.1), the client is able to determine if
it can not access a storage device with any of the supplied it can not access a storage device with any of the supplied
ffdv_version and ffdv_minorversion combinations. However, due to the ffdv_version and ffdv_minorversion combinations. However, due to the
limitations of reporting errors in GETDEVICEINFO (see Section 18.40 limitations of reporting errors in GETDEVICEINFO (see Section 18.40
in [RFC5661], the client is not able to specify which specific device in [RFC5661], the client is not able to specify which specific device
it can not communicate with over one of the provided ffdv_version and it can not communicate with over one of the provided ffdv_version and
skipping to change at page 16, line 12 skipping to change at page 18, line 24
minor version (e.g., client can use NFSv4.1 but not NFSv4.2), the minor version (e.g., client can use NFSv4.1 but not NFSv4.2), the
error indicates that for all the supplied combinations for error indicates that for all the supplied combinations for
ffdv_version and ffdv_minorversion, the client can not communicate ffdv_version and ffdv_minorversion, the client can not communicate
with the storage device. The client can retry the GETDEVICEINFO to with the storage device. The client can retry the GETDEVICEINFO to
see if the metadata server can provide a different combination or it see if the metadata server can provide a different combination or it
can fall back to doing the I/O through the metadata server. can fall back to doing the I/O through the metadata server.
6. Striping via Sparse Mapping 6. Striping via Sparse Mapping
While other Layout Types support both dense and sparse mapping of While other Layout Types support both dense and sparse mapping of
logical offsets to phyisical offsets within a file (see for example logical offsets to physical offsets within a file (see for example
Section 13.4 of [RFC5661]), the Flexible File Layout Type only Section 13.4 of [RFC5661]), the Flexible File Layout Type only
supports a sparse mapping. supports a sparse mapping.
With sparse mappings, the logical offset within a file (L) is also With sparse mappings, the logical offset within a file (L) is also
the physical offset on the storage device. As detailed in the physical offset on the storage device. As detailed in
Section 13.4.4 of [RFC5661], this results in holes across each Section 13.4.4 of [RFC5661], this results in holes across each
storage device which does not contain the current stripe index. storage device which does not contain the current stripe index.
L: logical offset into the file L: logical offset into the file
skipping to change at page 17, line 19 skipping to change at page 19, line 29
LAYOUTGET and retry the I/O operation(s) using the new layout, or the LAYOUTGET and retry the I/O operation(s) using the new layout, or the
client MAY just retry the I/O operation(s) using regular NFS READ or client MAY just retry the I/O operation(s) using regular NFS READ or
WRITE operations via the metadata server. The client SHOULD attempt WRITE operations via the metadata server. The client SHOULD attempt
to retrieve a new layout and retry the I/O operation using the to retrieve a new layout and retry the I/O operation using the
storage device first and only if the error persists, retry the I/O storage device first and only if the error persists, retry the I/O
operation via the metadata server. operation via the metadata server.
8. Mirroring 8. Mirroring
The Flexible File Layout Type has a simple model in place for the The Flexible File Layout Type has a simple model in place for the
mirroring of files. There is no assumption that each copy of the mirroring of the file data constrained by a layout segment. There is
mirror is stored identically on the storage devices, i.e., one device no assumption that each copy of the mirror is stored identically on
might employ compression or deduplication on the file. However, the the storage devices, i.e., one device might employ compression or
over the wire transfer of the file contents MUST appear identical. deduplication on the data. However, the over the wire transfer of
Note, this is a construct of the selected XDR representation that the file contents MUST appear identical. Note, this is a construct
each mirrored copy of the file has the same striping pattern (see of the selected XDR representation that each mirrored copy of the
Figure 1). layout segment has the same striping pattern (see Figure 1).
The metadata server is responsible for determining the number of The metadata server is responsible for determining the number of
mirrored copies and the location of each mirror. While the client mirrored copies and the location of each mirror. While the client
may provide a hint to how many copies it wants (see Section 12), the may provide a hint to how many copies it wants (see Section 12), the
metadata server can ignore that hint and in any event, the client has metadata server can ignore that hint and in any event, the client has
no means to dictate neither the storage device (which also means the no means to dictate neither the storage device (which also means the
coupling and/or protocol levels to access the file) nor the location coupling and/or protocol levels to access the layout segments) nor
of said storage device. the location of said storage device.
The updating of mirrored files is done via client-side mirroring. The updating of mirrored layout segments is done via client-side
With this approach, the client is responsible for making sure mirroring. With this approach, the client is responsible for making
modifications get to all copies of the file it is informed of via the sure modifications get to all copies of the layout segments it is
layout. If a file is being resilvered to a storage device, that informed of via the layout. If a layout segments is being resilvered
mirrored copy will not be in the layout. Thus the metadata server to a storage device, that mirrored copy will not be in the layout.
MUST update that copy until the client is presented it in a layout. Thus the metadata server MUST update that copy until the client is
Also, if the client is writing to the file via the metadata server, presented it in a layout. Also, if the client is writing to the
e.g., using an earlier version of the protocol, then the metadata layout segments via the metadata server, e.g., using an earlier
server MUST update all copies of the mirror. As seen in Section 8.3, version of the protocol, then the metadata server MUST update all
during the resilvering, the layout is recalled, and the client has to copies of the mirror. As seen in Section 8.3, during the
make modifications via the metadata server. resilvering, the layout is recalled, and the client has to make
modifications via the metadata server.
8.1. Selecting a Mirror 8.1. Selecting a Mirror
When the metadata server grants a layout to a client, it can let the When the metadata server grants a layout to a client, it can let the
client know how fast it expects each mirror to be once the request client know how fast it expects each mirror to be once the request
arrives at the storage devices via the ffds_efficiency member. While arrives at the storage devices via the ffds_efficiency member. While
the algorithms to calculate that value are left to the metadata the algorithms to calculate that value are left to the metadata
server implementations, factors that could contribute to that server implementations, factors that could contribute to that
calculation include speed of the storage device, physical memory calculation include speed of the storage device, physical memory
available to the device, operating system version, current load, etc. available to the device, operating system version, current load, etc.
However, what should not be involved in that calculation is a However, what should not be involved in that calculation is a
perceived network distance between the client and the storage device. perceived network distance between the client and the storage device.
The client is better situated for making that determination based on The client is better situated for making that determination based on
past interaction with the storage device over the different available past interaction with the storage device over the different available
network interfaces between the two. I.e., the metadata server might network interfaces between the two. I.e., the metadata server might
not know about a transient outage between the client and storage not know about a transient outage between the client and storage
device because it has no presence on the given subnet. device because it has no presence on the given subnet.
As such, it is the client which decides which mirror to access for As such, it is the client which decides which mirror to access for
reading the file. The requirements for writing to a mirrored file reading the file. The requirements for writing to a mirrored layout
are presented below. segments are presented below.
8.2. Writing to Mirrors 8.2. Writing to Mirrors
The client is responsible for updating all mirrored copies of the The client is responsible for updating all mirrored copies of the
file that it is given in the layout. If all but one copy is updated layout segments that it is given in the layout. If all but one copy
successfully and the last one provides an error, then the client is updated successfully and the last one provides an error, then the
needs to return the layout to the metadata server with an error client needs to return the layout to the metadata server with an
indicating that the update failed to that storage device. error indicating that the update failed to that storage device.
The metadata server is then responsible for determining if it wants The metadata server is then responsible for determining if it wants
to remove the errant mirror from the layout, if the mirror has to remove the errant mirror from the layout, if the mirror has
recovered from some transient error, etc. When the client tries to recovered from some transient error, etc. When the client tries to
get a new layout, the metadata server informs it of the decision by get a new layout, the metadata server informs it of the decision by
the contents of the layout. The client MUST NOT make any assumptions the contents of the layout. The client MUST NOT make any assumptions
that the contents of the previous layout will match those of the new that the contents of the previous layout will match those of the new
one. If it has updates that were not committed, it MUST resend those one. If it has updates that were not committed, it MUST resend those
updates to all mirrors. updates to all mirrors.
8.3. Metadata Server Resilvering of the File 8.3. Metadata Server Resilvering of the File
The metadata server may elect to create a new mirror of the file at The metadata server may elect to create a new mirror of the layout
any time. This might be to resilver a copy on a storage device which segments at any time. This might be to resilver a copy on a storage
was down for servicing, to provide a copy of the file on storage with device which was down for servicing, to provide a copy of the layout
different storage performance characteristics, etc. As the client segments on storage with different storage performance
will not be aware of the new mirror and the metadata server will not characteristics, etc. As the client will not be aware of the new
be aware of updates that the client is making to the file, the mirror and the metadata server will not be aware of updates that the
metadata server MUST recall the writable layout segment(s) that it is client is making to the layout segments, the metadata server MUST
resilvering. If the client issues a LAYOUTGET for a writable layout recall the writable layout segment(s) that it is resilvering. If the
segment which is in the process of being resilvered, then the client issues a LAYOUTGET for a writable layout segment which is in
metadata server MUST deny that request with a NFS4ERR_LAYOUTTRYLATER. the process of being resilvered, then the metadata server MUST deny
The client can then perform the I/O through the metadata server. that request with a NFS4ERR_LAYOUTTRYLATER. The client can then
perform the I/O through the metadata server.
9. Flexible Files Layout Type Return 9. Flexible Files Layout Type Return
layoutreturn_file4 is used in the LAYOUTRETURN operation to convey layoutreturn_file4 is used in the LAYOUTRETURN operation to convey
layout-type specific information to the server. It is defined in layout-type specific information to the server. It is defined in
[RFC5661] as follows: [RFC5661] as follows:
<CODE BEGINS> <CODE BEGINS>
struct layoutreturn_file4 { struct layoutreturn_file4 {
skipping to change at page 20, line 4 skipping to change at page 22, line 15
lrf_body opaque value is defined by ff_layoutreturn4 (See lrf_body opaque value is defined by ff_layoutreturn4 (See
Section 9.3). It allows the client to report I/O error information Section 9.3). It allows the client to report I/O error information
or layout usage statistics back to the metadata server as defined or layout usage statistics back to the metadata server as defined
below. below.
9.1. I/O Error Reporting 9.1. I/O Error Reporting
9.1.1. ff_ioerr4 9.1.1. ff_ioerr4
<CODE BEGINS> <CODE BEGINS>
/// struct ff_ioerr4 { /// struct ff_ioerr4 {
/// offset4 ffie_offset; /// offset4 ffie_offset;
/// length4 ffie_length; /// length4 ffie_length;
/// stateid4 ffie_stateid; /// stateid4 ffie_stateid;
/// device_error4 ffie_errors; /// device_error4 ffie_errors<>;
/// }; /// };
/// ///
<CODE ENDS> <CODE ENDS>
Recall that [NFSv42] defines device_error4 as: Recall that [NFSv42] defines device_error4 as:
<CODE BEGINS> <CODE BEGINS>
struct device_error4 { struct device_error4 {
skipping to change at page 21, line 28 skipping to change at page 23, line 35
9.2.2. ff_layoutupdate4 9.2.2. ff_layoutupdate4
<CODE BEGINS> <CODE BEGINS>
/// struct ff_layoutupdate4 { /// struct ff_layoutupdate4 {
/// netaddr4 ffl_addr; /// netaddr4 ffl_addr;
/// nfs_fh4 ffl_fhandle; /// nfs_fh4 ffl_fhandle;
/// ff_io_latency4 ffl_read; /// ff_io_latency4 ffl_read;
/// ff_io_latency4 ffl_write; /// ff_io_latency4 ffl_write;
/// uint32_t ffl_queue_depth;
/// nfstime4 ffl_duration; /// nfstime4 ffl_duration;
/// bool ffl_local; /// bool ffl_local;
/// }; /// };
/// ///
<CODE ENDS> <CODE ENDS>
ffl_addr differentiates which network address the client connected to ffl_addr differentiates which network address the client connected to
on the storage device. In the case of multipathing, ffl_fhandle on the storage device. In the case of multipathing, ffl_fhandle
indicates which read-only copy was selected. ffl_read and ffl_write indicates which read-only copy was selected. ffl_read and ffl_write
convey the latencies respectively for both read and write operations. convey the latencies respectively for both read and write operations.
ffl_queue_depth can be used to indicate how long the I/O had to wait ffl_duration is used to indicate the time period over which the
on internal queues before being serviced. ffl_duration is used to statistics were collected. ffl_local if true indicates that the I/O
indicate the time period over which the statistics were collected. was serviced by the client's cache. This flag allows the client to
ffl_local if true indicates that the I/O was serviced by the client's inform the metadata server about "hot" access to a file it would not
cache. This flag allows the client to inform the metadata server normally be allowed to report on.
about "hot" access to a file it would not normally be allowed to
report on.
9.2.3. ff_iostats4 9.2.3. ff_iostats4
<CODE BEGINS> <CODE BEGINS>
/// struct ff_iostats4 { /// struct ff_iostats4 {
/// offset4 ffis_offset; /// offset4 ffis_offset;
/// length4 ffis_length; /// length4 ffis_length;
/// stateid4 ffis_stateid; /// stateid4 ffis_stateid;
/// io_info4 ffis_read; /// io_info4 ffis_read;
/// io_info4 ffis_write; /// io_info4 ffis_write;
/// deviceid4 ffis_deviceid; /// deviceid4 ffis_deviceid;
/// layoutupdate4 ffis_layoutupdate; /// ff_layoutupdate4 ffis_layoutupdate;
/// }; /// };
/// ///
<CODE ENDS> <CODE ENDS>
Recall that [NFSv42] defines io_info4 as: Recall that [NFSv42] defines io_info4 as:
<CODE BEGINS> <CODE BEGINS>
struct io_info4 { struct io_info4 {
skipping to change at page 23, line 8 skipping to change at page 25, line 8
example, a client can define the default byte range resolution to be example, a client can define the default byte range resolution to be
1 MB in size and the thresholds for reporting to be 1 MB/second or 10 1 MB in size and the thresholds for reporting to be 1 MB/second or 10
I/O operations per second. For each byte range, ffis_offset and I/O operations per second. For each byte range, ffis_offset and
ffis_length represent the starting offset of the range and the range ffis_length represent the starting offset of the range and the range
length in bytes. ffis_read.ii_count, ffis_read.ii_bytes, length in bytes. ffis_read.ii_count, ffis_read.ii_bytes,
ffis_write.ii_count, and ffis_write.ii_bytes represent, respectively, ffis_write.ii_count, and ffis_write.ii_bytes represent, respectively,
the number of contiguous read and write I/Os and the respective the number of contiguous read and write I/Os and the respective
aggregate number of bytes transferred within the reported byte range. aggregate number of bytes transferred within the reported byte range.
The combination of ffis_deviceid and ffl_addr uniquely identify both The combination of ffis_deviceid and ffl_addr uniquely identify both
the storage path and the network route to it. Additionally, the the storage path and the network route to it. Finally, the
ffis_deviceid informs the metadata server as to the version and/or
minor version being used for I/O to the storage device. Finally, the
ffl_fhandle allows the metadata server to differentiate between ffl_fhandle allows the metadata server to differentiate between
multiple read-only copies of the file on the same storage device. multiple read-only copies of the file on the same storage device.
9.3. ff_layoutreturn4 9.3. ff_layoutreturn4
<CODE BEGINS> <CODE BEGINS>
/// struct ff_layoutreturn4 { /// struct ff_layoutreturn4 {
/// ff_ioerr4 fflr_ioerr_report<>; /// ff_ioerr4 fflr_ioerr_report<>;
/// ff_iostats4 fflr_iostats_report<>; /// ff_iostats4 fflr_iostats_report<>;
skipping to change at page 26, line 35 skipping to change at page 28, line 35
In cases where clients are uncommunicative and their lease has In cases where clients are uncommunicative and their lease has
expired or when clients fail to return recalled layouts within a expired or when clients fail to return recalled layouts within a
lease period, at the least the server MAY revoke client layouts and/ lease period, at the least the server MAY revoke client layouts and/
or device address mappings and reassign these resources to other or device address mappings and reassign these resources to other
clients (see "Recalling a Layout" in [RFC5661]). To avoid data clients (see "Recalling a Layout" in [RFC5661]). To avoid data
corruption, the metadata server MUST fence off the revoked clients corruption, the metadata server MUST fence off the revoked clients
from the respective data files as described in Section 2.2. from the respective data files as described in Section 2.2.
15. Security Considerations 15. Security Considerations
The pNFS extension partitions the NFSv4 file system protocol into two The pNFS extension partitions the NFSv4.1+ file system protocol into
parts, the control path and the data path (storage protocol). The two parts, the control path and the data path (storage protocol).
control path contains all the new operations described by this The control path contains all the new operations described by this
extension; all existing NFSv4 security mechanisms and features apply extension; all existing NFSv4 security mechanisms and features apply
to the control path. The combination of components in a pNFS system to the control path. The combination of components in a pNFS system
is required to preserve the security properties of NFSv4 with respect is required to preserve the security properties of NFSv4.1+ with
to an entity accessing data via a client, including security respect to an entity accessing data via a client, including security
countermeasures to defend against threats that NFSv4 provides countermeasures to defend against threats that NFSv4.1+ provides
defenses for in environments where these threats are considered defenses for in environments where these threats are considered
significant. significant.
The metadata server enforces the file access-control policy at The metadata server enforces the file access-control policy at
LAYOUTGET time. The client should use suitable authorization LAYOUTGET time. The client should use suitable authorization
credentials for getting the layout for the requested iomode (READ or credentials for getting the layout for the requested iomode (READ or
RW) and the server verifies the permissions and ACL for these RW) and the server verifies the permissions and ACL for these
credentials, possibly returning NFS4ERR_ACCESS if the client is not credentials, possibly returning NFS4ERR_ACCESS if the client is not
allowed the requested iomode. If the LAYOUTGET operation succeeds allowed the requested iomode. If the LAYOUTGET operation succeeds
the client receives, as part of the layout, a set of credentials the client receives, as part of the layout, a set of credentials
allowing it I/O access to the specified data files corresponding to allowing it I/O access to the specified data files corresponding to
the requested iomode. When the client acts on I/O operations on the requested iomode. When the client acts on I/O operations on
behalf of its local users, it MUST authenticate and authorize the behalf of its local users, it MUST authenticate and authorize the
user by issuing respective OPEN and ACCESS calls to the metadata user by issuing respective OPEN and ACCESS calls to the metadata
server, similar to having NFSv4 data delegations. If access is server, similar to having NFSv4 data delegations. If access is
allowed, the client uses the corresponding (READ or RW) credentials allowed, the client uses the corresponding (READ or RW) credentials
to perform the I/O operations at the data files storage devices. to perform the I/O operations at the data file's storage devices.
When the metadata server receives a request to change a file's When the metadata server receives a request to change a file's
permissions or ACL, it SHOULD recall all layouts for that file and it permissions or ACL, it SHOULD recall all layouts for that file and it
MUST fence off the clients holding outstanding layouts for the MUST fence off the clients holding outstanding layouts for the
respective file by implicitly invalidating the outstanding respective file by implicitly invalidating the outstanding
credentials on all data files comprising before committing to the new credentials on all data files comprising before committing to the new
permissions and ACL. Doing this will ensure that clients re- permissions and ACL. Doing this will ensure that clients re-
authorize their layouts according to the modified permissions and ACL authorize their layouts according to the modified permissions and ACL
by requesting new layouts. Recalling the layouts in this case is by requesting new layouts. Recalling the layouts in this case is
courtesy of the server intended to prevent clients from getting an courtesy of the server intended to prevent clients from getting an
error on I/Os done after the client was fenced off. error on I/Os done after the client was fenced off.
skipping to change at page 28, line 28 skipping to change at page 30, line 28
[NFSv42] Haynes, T., "NFS Version 4 Minor Version 2", draft-ietf- [NFSv42] Haynes, T., "NFS Version 4 Minor Version 2", draft-ietf-
nfsv4-minorversion2-28 (Work In Progress), November 2014. nfsv4-minorversion2-28 (Work In Progress), November 2014.
[RFC1813] IETF, "NFS Version 3 Protocol Specification", RFC 1813, [RFC1813] IETF, "NFS Version 3 Protocol Specification", RFC 1813,
June 1995. June 1995.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997. Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC3530] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R.,
Beame, C., Eisler, M., and D. Noveck, "Network File System
(NFS) version 4 Protocol", RFC 3530, April 2003.
[RFC4506] Eisler, M., "XDR: External Data Representation Standard", [RFC4506] Eisler, M., "XDR: External Data Representation Standard",
STD 67, RFC 4506, May 2006. STD 67, RFC 4506, May 2006.
[RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol
Specification Version 2", RFC 5531, May 2009.
[RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
"Network File System (NFS) Version 4 Minor Version 1 "Network File System (NFS) Version 4 Minor Version 1
Protocol", RFC 5661, January 2010. Protocol", RFC 5661, January 2010.
[RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
"Network File System (NFS) Version 4 Minor Version 1 "Network File System (NFS) Version 4 Minor Version 1
External Data Representation Standard (XDR) Description", External Data Representation Standard (XDR) Description",
RFC 5662, January 2010. RFC 5662, January 2010.
[RFC5664] Halevy, B., Ed., Welch, B., Ed., and J. Zelenka, Ed., [RFCNFSv4]
"Object-Based Parallel NFS (pNFS) Operations", RFC 5664, Haynes, T. and D. Noveck, "NFS Version 4 Protocol", draft-
January 2010. ietf-nfsv4-rfc3530bis-35 (work in progress), Dec 2014.
[pNFSLayouts] [pNFSLayouts]
Haynes, T., "Considerations for a New pNFS Layout Type", Haynes, T., "Considerations for a New pNFS Layout Type",
draft-ietf-nfsv4-layout-types-02 (Work In Progress), draft-ietf-nfsv4-layout-types-02 (Work In Progress),
October 2014. October 2014.
17.2. Informative References 17.2. Informative References
[ANSI400-2004]
Weber, R., Ed., "ANSI INCITS 400-2004, Information
Technology - SCSI Object-Based Storage Device Commands
(OSD)", December 2004.
[rpcsec_gssv3] [rpcsec_gssv3]
Adamson, W. and N. Williams, "Remote Procedure Call (RPC) Adamson, W. and N. Williams, "Remote Procedure Call (RPC)
Security Version 3", November 2014. Security Version 3", November 2014.
Appendix A. Acknowledgments Appendix A. Acknowledgments
Those who provided miscellaneous comments to early drafts of this Those who provided miscellaneous comments to early drafts of this
document include: Matt W. Benjamin, Adam Emerson, Tom Haynes, J. document include: Matt W. Benjamin, Adam Emerson, J. Bruce Fields,
Bruce Fields, and Lev Solomonov. and Lev Solomonov.
Those who provided miscellaneous comments to the final drafts of this
document include: Anand Ganesh, Robert Wipfel, Gobikrishnan
Sundharraj, and Trond Myklebust.
Idan Kedar caught a nasty bug in the interaction of client side Idan Kedar caught a nasty bug in the interaction of client side
mirroring and the minor versioning of devices. mirroring and the minor versioning of devices.
Dave Noveck provided a comprehensive review of the document during
the working group last call.
Olga Kornievskaia lead the charge against the use of a credential
versus a principal in the fencing approach. Andy Adamson and
Benjamin Kaduk helped to sharpen the focus.
Appendix B. RFC Editor Notes Appendix B. RFC Editor Notes
[RFC Editor: please remove this section prior to publishing this [RFC Editor: please remove this section prior to publishing this
document as an RFC] document as an RFC]
[RFC Editor: prior to publishing this document as an RFC, please [RFC Editor: prior to publishing this document as an RFC, please
replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the
RFC number of this document] RFC number of this document]
Authors' Addresses Authors' Addresses
Benny Halevy Benny Halevy
Primary Data, Inc.
Email: bhalevy@primarydata.com
URI: http://www.primarydata.com
Email: bhalevy@gmail.com
Thomas Haynes Thomas Haynes
Primary Data, Inc. Primary Data, Inc.
4300 El Camino Real Ste 100 4300 El Camino Real Ste 100
Los Altos, CA 94022 Los Altos, CA 94022
USA USA
Phone: +1 408 215 1519 Phone: +1 408 215 1519
Email: thomas.haynes@primarydata.com Email: thomas.haynes@primarydata.com
 End of changes. 60 change blocks. 
201 lines changed or deleted 292 lines changed or added

This html diff was produced by rfcdiff 1.42. The latest version is available from http://tools.ietf.org/tools/rfcdiff/