draft-ietf-nfsv4-flex-files-09.txt   draft-ietf-nfsv4-flex-files-10.txt 
NFSv4 B. Halevy NFSv4 B. Halevy
Internet-Draft Internet-Draft
Intended status: Standards Track T. Haynes Intended status: Standards Track T. Haynes
Expires: November 10, 2017 Primary Data Expires: January 18, 2018 Primary Data
May 09, 2017 July 17, 2017
Parallel NFS (pNFS) Flexible File Layout Parallel NFS (pNFS) Flexible File Layout
draft-ietf-nfsv4-flex-files-09.txt draft-ietf-nfsv4-flex-files-10.txt
Abstract Abstract
The Parallel Network File System (pNFS) allows a separation between The Parallel Network File System (pNFS) allows a separation between
the metadata (onto a metadata server) and data (onto a storage the metadata (onto a metadata server) and data (onto a storage
device) for a file. The flexible file layout type is defined in this device) for a file. The flexible file layout type is defined in this
document as an extension to pNFS which allows the use of storage document as an extension to pNFS which allows the use of storage
devices in a fashion such that they require only a quite limited devices in a fashion such that they require only a quite limited
degree of interaction with the metadata server, using already degree of interaction with the metadata server, using already
existing protocols. Client side mirroring is also added to provide existing protocols. Client side mirroring is also added to provide
skipping to change at page 1, line 38 skipping to change at page 1, line 38
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on November 10, 2017. This Internet-Draft will expire on January 18, 2018.
Copyright Notice Copyright Notice
Copyright (c) 2017 IETF Trust and the persons identified as the Copyright (c) 2017 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 2, line 17 skipping to change at page 2, line 17
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 3
1.2. Difference Between a Data Server and a Storage Device . . 5 1.2. Difference Between a Data Server and a Storage Device . . 5
1.3. Requirements Language . . . . . . . . . . . . . . . . . . 6 1.3. Requirements Language . . . . . . . . . . . . . . . . . . 6
2. Coupling of Storage Devices . . . . . . . . . . . . . . . . . 6 2. Coupling of Storage Devices . . . . . . . . . . . . . . . . . 6
2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 6 2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 6
2.2. Fencing Clients from the Data Server . . . . . . . . . . 6 2.2. Fencing Clients from the Storage Device . . . . . . . . . 6
2.2.1. Implementation Notes for Synthetic uids/gids . . . . 7 2.2.1. Implementation Notes for Synthetic uids/gids . . . . 7
2.2.2. Example of using Synthetic uids/gids . . . . . . . . 7 2.2.2. Example of using Synthetic uids/gids . . . . . . . . 8
2.3. State and Locking Models . . . . . . . . . . . . . . . . 8 2.3. State and Locking Models . . . . . . . . . . . . . . . . 9
3. XDR Description of the Flexible File Layout Type . . . . . . 9 2.3.1. Loosely Coupled Locking Model . . . . . . . . . . . . 9
3.1. Code Components Licensing Notice . . . . . . . . . . . . 10 2.3.2. Tighly Coupled Locking Model . . . . . . . . . . . . 10
4. Device Addressing and Discovery . . . . . . . . . . . . . . . 11 3. XDR Description of the Flexible File Layout Type . . . . . . 12
4.1. ff_device_addr4 . . . . . . . . . . . . . . . . . . . . . 11 3.1. Code Components Licensing Notice . . . . . . . . . . . . 13
4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 13 4. Device Addressing and Discovery . . . . . . . . . . . . . . . 14
5. Flexible File Layout type . . . . . . . . . . . . . . . . . . 14 4.1. ff_device_addr4 . . . . . . . . . . . . . . . . . . . . . 14
5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 14 4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 16
5.1.1. Error codes from LAYOUTGET . . . . . . . . . . . . . 18 5. Flexible File Layout Type . . . . . . . . . . . . . . . . . . 17
5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS . . 18 5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 17
5.2. Interactions Between Devices and Layouts . . . . . . . . 18 5.1.1. Error Codes from LAYOUTGET . . . . . . . . . . . . . 21
5.3. Handling Version Errors . . . . . . . . . . . . . . . . . 19 5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS . . 21
6. Striping via Sparse Mapping . . . . . . . . . . . . . . . . . 19 5.2. Interactions Between Devices and Layouts . . . . . . . . 22
7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 20 5.3. Handling Version Errors . . . . . . . . . . . . . . . . . 22
8. Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . 20 6. Striping via Sparse Mapping . . . . . . . . . . . . . . . . . 23
8.1. Selecting a Mirror . . . . . . . . . . . . . . . . . . . 21 7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 23
8.2. Writing to Mirrors . . . . . . . . . . . . . . . . . . . 21 8. Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . 24
8.3. Metadata Server Resilvering of the File . . . . . . . . . 22 8.1. Selecting a Mirror . . . . . . . . . . . . . . . . . . . 24
9. Flexible Files Layout Type Return . . . . . . . . . . . . . . 22 8.2. Writing to Mirrors . . . . . . . . . . . . . . . . . . . 25
9.1. I/O Error Reporting . . . . . . . . . . . . . . . . . . . 23 8.3. Metadata Server Resilvering of the File . . . . . . . . . 26
9.1.1. ff_ioerr4 . . . . . . . . . . . . . . . . . . . . . . 23 9. Flexible Files Layout Type Return . . . . . . . . . . . . . . 26
9.2. Layout Usage Statistics . . . . . . . . . . . . . . . . . 24 9.1. I/O Error Reporting . . . . . . . . . . . . . . . . . . . 27
9.2.1. ff_io_latency4 . . . . . . . . . . . . . . . . . . . 24 9.1.1. ff_ioerr4 . . . . . . . . . . . . . . . . . . . . . . 27
9.2.2. ff_layoutupdate4 . . . . . . . . . . . . . . . . . . 25 9.2. Layout Usage Statistics . . . . . . . . . . . . . . . . . 28
9.2.3. ff_iostats4 . . . . . . . . . . . . . . . . . . . . . 26 9.2.1. ff_io_latency4 . . . . . . . . . . . . . . . . . . . 28
9.3. ff_layoutreturn4 . . . . . . . . . . . . . . . . . . . . 27 9.2.2. ff_layoutupdate4 . . . . . . . . . . . . . . . . . . 29
10. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 27 9.2.3. ff_iostats4 . . . . . . . . . . . . . . . . . . . . . 29
11. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 27 9.3. ff_layoutreturn4 . . . . . . . . . . . . . . . . . . . . 31
12. Flexible File Layout Type Creation Hint . . . . . . . . . . . 28 10. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 31
12.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 28 11. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 31
13. Recalling a Layout . . . . . . . . . . . . . . . . . . . . . 29 12. Flexible File Layout Type Creation Hint . . . . . . . . . . . 32
13.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 29 12.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 32
14. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 30 13. Recalling a Layout . . . . . . . . . . . . . . . . . . . . . 33
15. Security Considerations . . . . . . . . . . . . . . . . . . . 30 13.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 33
15.1. Kerberized File Access . . . . . . . . . . . . . . . . . 31 14. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 34
15.1.1. Loosely Coupled . . . . . . . . . . . . . . . . . . 31 15. Security Considerations . . . . . . . . . . . . . . . . . . . 34
15.1.2. Tightly Coupled . . . . . . . . . . . . . . . . . . 31 15.1. Kerberized File Access . . . . . . . . . . . . . . . . . 35
16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 32 15.1.1. Loosely Coupled . . . . . . . . . . . . . . . . . . 35
17. References . . . . . . . . . . . . . . . . . . . . . . . . . 32 15.1.2. Tightly Coupled . . . . . . . . . . . . . . . . . . 35
17.1. Normative References . . . . . . . . . . . . . . . . . . 32 16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 36
17.2. Informative References . . . . . . . . . . . . . . . . . 33 17. References . . . . . . . . . . . . . . . . . . . . . . . . . 36
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 33 17.1. Normative References . . . . . . . . . . . . . . . . . . 36
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 33 17.2. Informative References . . . . . . . . . . . . . . . . . 37
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 33 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 37
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 37
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 37
1. Introduction 1. Introduction
In the parallel Network File System (pNFS), the metadata server In the parallel Network File System (pNFS), the metadata server
returns layout type structures that describe where file data is returns layout type structures that describe where file data is
located. There are different layout types for different storage located. There are different layout types for different storage
systems and methods of arranging data on storage devices. This systems and methods of arranging data on storage devices. This
document defines the flexible file layout type used with file-based document defines the flexible file layout type used with file-based
data servers that are accessed using the Network File System (NFS) data servers that are accessed using the Network File System (NFS)
protocols: NFSv3 [RFC1813], NFSv4.0 [RFC7530], NFSv4.1 [RFC5661], and protocols: NFSv3 [RFC1813], NFSv4.0 [RFC7530], NFSv4.1 [RFC5661], and
skipping to change at page 6, line 22 skipping to change at page 6, line 25
The coupling of the metadata server with the storage devices can be The coupling of the metadata server with the storage devices can be
either tight or loose. In a tight coupling, there is a control either tight or loose. In a tight coupling, there is a control
protocol present to manage security, LAYOUTCOMMITs, etc. With a protocol present to manage security, LAYOUTCOMMITs, etc. With a
loose coupling, the only control protocol might be a version of NFS. loose coupling, the only control protocol might be a version of NFS.
As such, semantics for managing security, state, and locking models As such, semantics for managing security, state, and locking models
MUST be defined. MUST be defined.
2.1. LAYOUTCOMMIT 2.1. LAYOUTCOMMIT
With a tightly coupled system, when the metadata server receives a When tightly coupled storage devices are used, the metadata server
LAYOUTCOMMIT (see Section 18.42 of [RFC5661]), the semantics of the has the responsibility, upon receiving a LAYOUTCOMMIT (see
file layout type MUST be met (see Section 12.5.4 of [RFC5661]). It Section 18.42 of [RFC5661]), of ensuring that the semantics of pNFS
is the responsibility of the client to make sure the data file is are respected (see Section 12.5.4 of [RFC5661]). These do not
stable before the metadata server begins to query the storage devices include a requirement that data written to data storage device be
about the changes to the file. With a loosely coupled system, if any stable upon completion of the LAYOUTCOMMIT.
WRITE to a storage device did not result with stable_how equal to
FILE_SYNC, a LAYOUTCOMMIT to the metadata server MUST be preceded
with a COMMIT to the storage device. Note that if the client has not
done a COMMIT to the storage device, then the LAYOUTCOMMIT might not
be synchronized to the last WRITE operation to the storage device.
2.2. Fencing Clients from the Data Server In the case of loosely coupled storage devices, it is the
responsibility of the client to make sure the data file is stable
before the metadata server begins to query the storage devices about
the changes to the file. If any WRITE to a storage device did not
result with stable_how equal to FILE_SYNC, a LAYOUTCOMMIT to the
metadata server MUST be preceded by a COMMIT to the storage devices
written to. Note that if the client has not done a COMMIT to the
storage device, then the LAYOUTCOMMIT might not be synchronized to
the last WRITE operation to the storage device.
2.2. Fencing Clients from the Storage Device
With loosely coupled storage devices, the metadata server uses With loosely coupled storage devices, the metadata server uses
synthetic uids and gids for the data file, where the uid owner of the synthetic uids and gids for the data file, where the uid owner of the
data file is allowed read/write access and the gid owner is allowed data file is allowed read/write access and the gid owner is allowed
read only access. As part of the layout (see ffds_user and read only access. As part of the layout (see ffds_user and
ffds_group in Section 5.1), the client is provided with the user and ffds_group in Section 5.1), the client is provided with the user and
group to be used in the Remote Procedure Call (RPC) [RFC5531] group to be used in the Remote Procedure Call (RPC) [RFC5531]
credentials needed to access the data file. Fencing off of clients credentials needed to access the data file. Fencing off of clients
is achieved by the metadata server changing the synthetic uid and/or is achieved by the metadata server changing the synthetic uid and/or
gid owners of the data file on the storage device to implicitly gid owners of the data file on the storage device to implicitly
skipping to change at page 7, line 18 skipping to change at page 7, line 26
all directories holding data files to the root user. This approach all directories holding data files to the root user. This approach
provides a practical model to enforce access control and fence off provides a practical model to enforce access control and fence off
cooperative clients, but it can not protect against malicious cooperative clients, but it can not protect against malicious
clients; hence it provides a level of security equivalent to clients; hence it provides a level of security equivalent to
AUTH_SYS. AUTH_SYS.
With tightly coupled storage devices, the metadata server sets the With tightly coupled storage devices, the metadata server sets the
user and group owners, mode bits, and ACL of the data file to be the user and group owners, mode bits, and ACL of the data file to be the
same as the metadata file. And the client must authenticate with the same as the metadata file. And the client must authenticate with the
storage device and go through the same authorization process it would storage device and go through the same authorization process it would
go through via the metadata server. go through via the metadata server. In the case of tight coupling,
fencing is the responsibility of the control protocol and is not
described in detail here. However, implementations of the tight
coupling locking model (see Section 2.3), will need a way to prevent
access by certain clients to specific files by invalidating the
corresponding stateids on the storage device.
2.2.1. Implementation Notes for Synthetic uids/gids 2.2.1. Implementation Notes for Synthetic uids/gids
The selection method for the synthetic uids and gids to be used for The selection method for the synthetic uids and gids to be used for
fencing in loosely coupled storage devices is strictly an fencing in loosely coupled storage devices is strictly an
implementation issue. I.e., an administrator might restrict a range implementation issue. I.e., an administrator might restrict a range
of such ids available to the Lightweight Directory Access Protocol of such ids available to the Lightweight Directory Access Protocol
(LDAP) 'uid' field [RFC4519]. She might also be able to choose an id (LDAP) 'uid' field [RFC4519]. She might also be able to choose an id
that would never be used to grant acccess. Then when the metadata that would never be used to grant acccess. Then when the metadata
server had a request to access a file, a SETATTR would be sent to the server had a request to access a file, a SETATTR would be sent to the
skipping to change at page 7, line 44 skipping to change at page 8, line 10
client. And it would present them as the RPC credentials to the client. And it would present them as the RPC credentials to the
storage device. When the client was done accessing the file and the storage device. When the client was done accessing the file and the
metadata server knew that no other client was accessing the file, it metadata server knew that no other client was accessing the file, it
could reset the owner and group to restrict access to the data file. could reset the owner and group to restrict access to the data file.
When the metadata server wanted to fence off a client, it would When the metadata server wanted to fence off a client, it would
change the synthetic uid and/or gid to the restricted ids. Note that change the synthetic uid and/or gid to the restricted ids. Note that
using a restricted id ensures that there is a change of owner and at using a restricted id ensures that there is a change of owner and at
least one id available that never gets allowed access. least one id available that never gets allowed access.
Under an AUTH_SYS security model, synthetic uids and gids of 0 SHOULD
be avoided. These typically either grant super access to files on a
storage device or are mapped to an anonymous id. In the first case,
even if the data file is fenced, the client might still be able to
access the file. In the second case, multiple ids might be mapped to
the anonymous ids.
2.2.2. Example of using Synthetic uids/gids 2.2.2. Example of using Synthetic uids/gids
The user loghyr creates a file "ompha.c" on the metadata server and The user loghyr creates a file "ompha.c" on the metadata server and
it creates a corresponding data file on the storage device. it creates a corresponding data file on the storage device.
The metadata server entry may look like: The metadata server entry may look like:
-rw-r--r-- 1 loghyr staff 1697 Dec 4 11:31 ompha.c -rw-r--r-- 1 loghyr staff 1697 Dec 4 11:31 ompha.c
On the storage device, it may be assigned some random synthetic uid/ On the storage device, it may be assigned some random synthetic uid/
gid to deny access: gid to deny access:
-rw-r----- 1 19452 28418 1697 Dec 4 11:31 data_ompha.c -rw-r----- 1 19452 28418 1697 Dec 4 11:31 data_ompha.c
When the file is opened on a client, since the layout knows nothing When the file is opened on a client, since the layout knows nothing
about the user (and does not care), whether loghyr or garbo opens the about the user (and does not care), whether loghyr or garbo opens the
file does not matter. The owner and group are modified and those file does not matter. The owner and group are modified and those
values are returned. values are returned.
skipping to change at page 8, line 41 skipping to change at page 9, line 14
While pushing the enforcement of permission checking onto the client While pushing the enforcement of permission checking onto the client
may seem to weaken security, the client may already be responsible may seem to weaken security, the client may already be responsible
for enforcing permissions before modifications are sent to a server. for enforcing permissions before modifications are sent to a server.
With cached writes, the client is always responsible for tracking who With cached writes, the client is always responsible for tracking who
is modifying a file and making sure to not coalesce requests from is modifying a file and making sure to not coalesce requests from
multiple users into one request. multiple users into one request.
2.3. State and Locking Models 2.3. State and Locking Models
Metadata file OPEN, LOCK, and DELEGATION operations are always The choice of locking models is governed by the following rules:
executed only against the metadata server.
The metadata server responds to state changing operations by o Storage devices implementing the NFSv3 and NFSv4.0 protocols are
executing them against the respective data files on the storage always treated as loosely coupled.
devices. It then sends the storage device open stateid as part of
the layout (see the ffm_stateid in Section 5.1) and it is then used o NFSv4.1+ storage devices that do not return the
by the client for executing READ/WRITE operations against the storage EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID are indicating
that they are to be treated as loosely coupled. From the locking
viewpoint they are treated in the same way as NFSv4.0 storage
devices.
o NFSv4.1+ storage devices that do identify themselves with the
EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID are considered
strongly coupled. They would use a back-end control protocol to
implement the global stateid model as described in [RFC5661].
2.3.1. Loosely Coupled Locking Model
When locking-related operations are requested, they are primarily
dealt with by the metadata server, which generates the appropriate
stateids. When an NFSv4 version is used as the data access protocol,
the metadata server may make stateid-related requests of the storage
devices. However, it is not required to do so and the resulting
stateids are known only to the metadata server and the storage
device. device.
NFSv4.1+ storage devices that do not return the Given this basic structure, locking-related operations are handled as
EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID are indicating that follows:
they are loosely coupled. As such, they are treated the same way as
NFSv4 storage devices.
NFSv4.1+ storage devices that do identify themselves with the o OPENs are dealt with by the metadata server. Stateids are
EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID are stongly selected by the metadata server and associated with the client id
coupled. They will be using a back-end control protocol as described describing the client's connection to the metadata server. The
in [RFC5661] to implement a global stateid model as defined there. metadata server may need to interact with the storage device to
locate the file to be opened, but no locking-related functionality
need be used on the storage device.
OPEN_DOWNGRADE and CLOSE only require local execution on the
metadata sever.
o Advisory byte-range locks can be implemented locally on the
metadata server. As in the case of OPENs, the stateids associated
with byte-range locks are assigned by the metadata server and only
used on the metadata server.
o Delegations are assigned by the metadata server which initiates
recalls when conflicting OPENs are processed. No storage device
involvement is required.
o TEST_STATEID and FREE_STATEID are processed locally on the
metadata server, without storage device involvement.
All I/O operations to the storage device are done using the anonymous
stateid. Thus the storage device has no information about the
openowner and lockowner responsible for issuing a particular I/O
operation. As a result:
o Mandatory byte-range locking cannot be supported because the
storage device has no way of distinguishing I/O done on behalf of
the lock owner from those done by others.
o Enforcement of share reservations is the responsibility of the
client. Even though I/O is done using the anonymous stateid, the
client must ensure that it has a valid stateid associated with the
openowner, that allows the I/O being done before issuing the I/O.
In the event that a stateid is revoked, the metadata server is
responsible for preventing client access, since it has no way of
being sure that the client is aware that the stateid in question has
been revoked.
As the client never receives a stateid generated by a storage device,
there is no client lease on the storage device and no prospect of
lease expiration, even when access is via NFSv4 protocols. Clients
will have leases on the metadata server. In dealing with lease
expiration, the metadata server may need to use fencing to prevent
revoked stateids from being relied upon by a client unaware of the
fact that they have been revoked.
2.3.2. Tighly Coupled Locking Model
When locking-related operations are requested, they are primarily
dealt with by the metadata server, which generates the appropriate
stateids. These stateids must be made known to the storage device
using control protocol facilities, the details of which are not
discussed in this document.
Given this basic structure, locking-related operations are handled as
follows:
o OPENs are dealt with primarily on the metadata server. Stateids
are selected by the metadata server and associated with the client
id describing the client's connection to the metadata server. The
metadata server needs to interact with the storage device to
locate the file to be opened, and to make the storage device aware
of the association between the metadata-sever-chosen stateid and
the client and openowner that it represents.
OPEN_DOWNGRADE and CLOSE are executed initially on the metadata
server but the state change made must be propagated to the storage
device.
o Advisory byte-range locks can be implemented locally on the
metadata server. As in the case of OPENs, the stateids associated
with byte-range locks, are assigned by the metadata server and are
available for use on the metadata server. Because I/O operations
are allowed to present lock stateids, the metadata server needs
the ability to make the storage device aware of the association
between the metadata-sever-chosen stateid and the corresponding
open stateid it is associated with.
o Mandatory byte-range locks can be supported when both the metadata
server and the storage devices have the appropriate support. As
in the case of advisory byte-range locks, these are assigned by
the metadata server and are available for use on the metadata
server. To enable mandatory lock enforcement on the storage
device, the metadata server needs the ability to make the storage
device aware of the association between the metadata-sever-chosen
stateid and the client, openowner, and lock (i.e., lockowner,
byte-range, lock-type) that it represents. Because I/O operations
are allowed to present lock stateids, this information needs to be
propagated to all storage devices to which I/O might be directed
rather than only to daya storage device that contain the locked
region.
o Delegations are assigned by the metadata server which initiates
recalls when conflicting OPENs are processed. Because I/O
operations are allowed to present delegation stateids, the
metadata server requires the ability to make the storage device
aware of the association between the metadata-server-chosen
stateid and the filehandle and delegation type it represents, and
to break such an association.
o TEST_STATEID is processed locally on the metadata server, without
storage device involvement.
o FREE_STATEID is processed on the metadata server but the metadata
server requires the ability to propagate the request to the
corresponding storage devices.
Because the client will possess and use stateids valid on the storage
device, there will be a client lease on the storage device and the
possibility of lease expiration does exist. The best approach for
the storage device is to retain these locks as a courtesy. However,
if it does not do so, control protocol facilities need to provide the
means to synchronize lock state between the metadata server and
storage device.
Clients will also have leases on the metadata server, which are
subject to expiration. In dealing with lease expiration, the
metadata server would be expected to use control protocol facilities
enabling it to invalidate revoked stateids on the storage device. In
the event the client is not responsive, the metadata server may need
to use fencing to prevent revoked stateids from being acted upon by
the storage device.
3. XDR Description of the Flexible File Layout Type 3. XDR Description of the Flexible File Layout Type
This document contains the external data representation (XDR) This document contains the external data representation (XDR)
[RFC4506] description of the flexible file layout type. The XDR [RFC4506] description of the flexible file layout type. The XDR
description is embedded in this document in a way that makes it description is embedded in this document in a way that makes it
simple for the reader to extract into a ready-to-compile form. The simple for the reader to extract into a ready-to-compile form. The
reader can feed this document into the following shell script to reader can feed this document into the following shell script to
produce the machine readable XDR description of the flexible file produce the machine readable XDR description of the flexible file
layout type: layout type:
skipping to change at page 12, line 20 skipping to change at page 15, line 26
The ffda_versions array allows the metadata server to present choices The ffda_versions array allows the metadata server to present choices
as to NFS version, minor version, and coupling strength to the as to NFS version, minor version, and coupling strength to the
client. The ffdv_version and ffdv_minorversion represent the NFS client. The ffdv_version and ffdv_minorversion represent the NFS
protocol to be used to access the storage device. This layout protocol to be used to access the storage device. This layout
specification defines the semantics for ffdv_versions 3 and 4. If specification defines the semantics for ffdv_versions 3 and 4. If
ffdv_version equals 3 then the server MUST set ffdv_minorversion to 0 ffdv_version equals 3 then the server MUST set ffdv_minorversion to 0
and ffdv_tightly_coupled to false. The client MUST then access the and ffdv_tightly_coupled to false. The client MUST then access the
storage device using the NFSv3 protocol [RFC1813]. If ffdv_version storage device using the NFSv3 protocol [RFC1813]. If ffdv_version
equals 4 then the server MUST set ffdv_minorversion to one of the equals 4 then the server MUST set ffdv_minorversion to one of the
NFSv4 minor version numbers and the client MUST access the storage NFSv4 minor version numbers and the client MUST access the storage
device using NFSv4. device using NFSv4 with the specified minor version.
Note that while the client might determine that it cannot use any of Note that while the client might determine that it cannot use any of
the configured combinations of ffdv_version, ffdv_minorversion, and the configured combinations of ffdv_version, ffdv_minorversion, and
ffdv_tightly_coupled, when it gets the device list from the metadata ffdv_tightly_coupled, when it gets the device list from the metadata
server, there is no way to indicate to the metadata server as to server, there is no way to indicate to the metadata server as to
which device it is version incompatible. If however, the client which device it is version incompatible. If however, the client
waits until it retrieves the layout from the metadata server, it can waits until it retrieves the layout from the metadata server, it can
at that time clearly identify the storage device in question (see at that time clearly identify the storage device in question (see
Section 5.3). Section 5.3).
skipping to change at page 14, line 5 skipping to change at page 17, line 5
will designate the same storage device. When the storage device is will designate the same storage device. When the storage device is
accessed over NFSv4.1 or a higher minor version, the two storage accessed over NFSv4.1 or a higher minor version, the two storage
device addresses will support the implementation of client ID or device addresses will support the implementation of client ID or
session trunking (the latter is RECOMMENDED) as defined in [RFC5661]. session trunking (the latter is RECOMMENDED) as defined in [RFC5661].
The two storage device addresses will share the same server owner or The two storage device addresses will share the same server owner or
major ID of the server owner. It is not always necessary for the two major ID of the server owner. It is not always necessary for the two
storage device addresses to designate the same storage device with storage device addresses to designate the same storage device with
trunking being used. For example, the data could be read-only, and trunking being used. For example, the data could be read-only, and
the data consist of exact replicas. the data consist of exact replicas.
5. Flexible File Layout type 5. Flexible File Layout Type
The layout4 type is defined in [RFC5662] as follows: The layout4 type is defined in [RFC5662] as follows:
<CODE BEGINS> <CODE BEGINS>
enum layouttype4 { enum layouttype4 {
LAYOUT4_NFSV4_1_FILES = 1, LAYOUT4_NFSV4_1_FILES = 1,
LAYOUT4_OSD2_OBJECTS = 2, LAYOUT4_OSD2_OBJECTS = 2,
LAYOUT4_BLOCK_VOLUME = 3, LAYOUT4_BLOCK_VOLUME = 3,
LAYOUT4_FLEX_FILES = 4 LAYOUT4_FLEX_FILES = 4
skipping to change at page 17, line 12 skipping to change at page 20, line 12
Section 5.3 for how to handle versioning issues between the client Section 5.3 for how to handle versioning issues between the client
and storage devices. and storage devices.
For tight coupling, ffds_stateid provides the stateid to be used by For tight coupling, ffds_stateid provides the stateid to be used by
the client to access the file. For loose coupling and a NFSv4 the client to access the file. For loose coupling and a NFSv4
storage device, the client may use an anonymous stateid to perform I/ storage device, the client may use an anonymous stateid to perform I/
O on the storage device as there is no use for the metadata server O on the storage device as there is no use for the metadata server
stateid (no control protocol). In such a scenario, the server MUST stateid (no control protocol). In such a scenario, the server MUST
set the ffds_stateid to be the anonymous stateid. set the ffds_stateid to be the anonymous stateid.
This specification of the ffds_stateid is mostly broken for the
tightly coupled model. There needs to exist a one to one mapping
from ffds_stateid to ffds_fh_vers - each open file on the storage
device might need an open stateid. As there are established loosely
coupled implementations of this version of the protocol, the only
viable approaches for a tightly coupled implementation would be to
either use an anonymous stateid for the ffds_stateid or restrict the
size of the ffds_fh_vers to be one. Fixing this issue will require a
new version of the protocol.
[[AI14: One reviewer points out for loosely coupled, we can use the
anon stateid and for tightly coupled we can use the "global stateid".
These make it appear that the bug in the spec was actually a feature.
The intent here is to own up to the bug and shipping code. Can it be
said nicer? --TH]]
For loosely coupled storage devices, ffds_user and ffds_group provide For loosely coupled storage devices, ffds_user and ffds_group provide
the synthetic user and group to be used in the RPC credentials that the synthetic user and group to be used in the RPC credentials that
the client presents to the storage device to access the data files. the client presents to the storage device to access the data files.
For tightly coupled storage devices, the user and group on the For tightly coupled storage devices, the user and group on the
storage device will be the same as on the metadata server. I.e., if storage device will be the same as on the metadata server. I.e., if
ffdv_tightly_coupled (see Section 4.1) is set, then the client MUST ffdv_tightly_coupled (see Section 4.1) is set, then the client MUST
ignore both ffds_user and ffds_group. ignore both ffds_user and ffds_group.
The allowed values for both ffds_user and ffds_group are specified in The allowed values for both ffds_user and ffds_group are specified in
Section 5.9 of [RFC5661]. For NFSv3 compatibility, user and group Section 5.9 of [RFC5661]. For NFSv3 compatibility, user and group
skipping to change at page 17, line 44 skipping to change at page 21, line 12
higher perceived utility. The way the client can select the best higher perceived utility. The way the client can select the best
mirror to access is discussed in Section 8.1. mirror to access is discussed in Section 8.1.
ffl_flags is a bitmap that allows the metadata server to inform the ffl_flags is a bitmap that allows the metadata server to inform the
client of particular conditions that may result from the more or less client of particular conditions that may result from the more or less
tight coupling of the storage devices. tight coupling of the storage devices.
FF_FLAGS_NO_LAYOUTCOMMIT: can be set to indicate that the client is FF_FLAGS_NO_LAYOUTCOMMIT: can be set to indicate that the client is
not required to send LAYOUTCOMMIT to the metadata server. not required to send LAYOUTCOMMIT to the metadata server.
FF_FLAGS_NO_IO_THRU_MDS: can be set to indicate that the client F_FLAGS_NO_IO_THRU_MDS: can be set to indicate that the client
SHOULD not send IO operations to the metadata server. I.e., even SHOULD not send I/O operations to the metadata server. I.e., even
if a storage device is partitioned from the client, the client if the client could determine that there was a network diconnect
SHOULD not try to proxy the IO through the metadata server. to a storage device, the client SHOULD not try to proxy the I/O
through the metadata server.
FF_FLAGS_NO_READ_IO: can be set to indicate that the client SHOULD FF_FLAGS_NO_READ_IO: can be set to indicate that the client SHOULD
not send READ requests with the layouts of iomode not send READ requests with the layouts of iomode
LAYOUTIOMODE4_RW. Instead, it should request a layout of iomode LAYOUTIOMODE4_RW. Instead, it should request a layout of iomode
LAYOUTIOMODE4_READ from the metadata server. LAYOUTIOMODE4_READ from the metadata server.
5.1.1. Error codes from LAYOUTGET 5.1.1. Error Codes from LAYOUTGET
[RFC5661] provides little guidance as to how the client is to proceed [RFC5661] provides little guidance as to how the client is to proceed
with a LAYOUTEGT which returns an error of either with a LAYOUTEGT which returns an error of either
NFS4ERR_LAYOUTTRYLATER, NFS4ERR_LAYOUTUNAVAILABLE, and NFS4ERR_DELAY. NFS4ERR_LAYOUTTRYLATER, NFS4ERR_LAYOUTUNAVAILABLE, and NFS4ERR_DELAY.
Within the context of this document:
NFS4ERR_LAYOUTUNAVAILABLE: there is no layout available and the IO NFS4ERR_LAYOUTUNAVAILABLE: there is no layout available and the I/O
is to go to the metadata server. Note that it is possible to have is to go to the metadata server. Note that it is possible to have
had a layout before a recall and not after. had a layout before a recall and not after.
NFS4ERR_LAYOUTTRYLATER: there is some issue preventing the layout NFS4ERR_LAYOUTTRYLATER: there is some issue preventing the layout
from being granted. If the client already has an appropriate from being granted. If the client already has an appropriate
layout, it SHOULD continue with IO to the storage devices. layout, it should continue with I/O to the storage devices.
NFS4ERR_DELAY: there is some issue preventing the layout from being NFS4ERR_DELAY: there is some issue preventing the layout from being
granted. If the client already has an appropriate layout, it granted. If the client already has an appropriate layout, it
SHOULD not continue with IO to the storage devices. should not continue with I/O to the storage devices.
5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS 5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS
If the client does not ask for a layout for a file, then the IO will Even if the metadata server provides the FF_FLAGS_NO_IO_THRU_MDS,
go through the metadata server. Thus, even if the metadata server flag, the client can still perform I/O to the metadata server. The
sets the FF_FLAGS_NO_IO_THRU_MDS flag, it can recall the layout and flag is at best a hint. The flag is indicating to the client that
either not set the flag on the new layout or not provide a layout. the metadata server most likely wants to separate the metadata I/O
When a client encounters an error with a storage device, it typically from the data I/O to increase the performance of the metadata
returns the layout to the metadata server and requests a new layout. operations. If the metadata server detects that the client is
The client's IO would then proceed according to the status codes as performing I/O against it despite the use of the
outlined in Section 5.1.1. FF_FLAGS_NO_IO_THRU_MDS flag, it can recall the layout and either not
set the flag on the new layout or not provide a layout (perhaps the
intent was for the server to temporarily prevent data I/O to meet
some goal). The client's I/O would then proceed according to the
status codes as outlined in Section 5.1.1.
5.2. Interactions Between Devices and Layouts 5.2. Interactions Between Devices and Layouts
In [RFC5661], the file layout type is defined such that the In [RFC5661], the file layout type is defined such that the
relationship between multipathing and filehandles can result in relationship between multipathing and filehandles can result in
either 0, 1, or N filehandles (see Section 13.3). Some rationals for either 0, 1, or N filehandles (see Section 13.3). Some rationals for
this are clustered servers which share the same filehandle or this are clustered servers which share the same filehandle or
allowing for multiple read-only copies of the file on the same allowing for multiple read-only copies of the file on the same
storage device. In the flexible file layout type, while there is an storage device. In the flexible file layout type, while there is an
array of filehandles, they are independent of the multipathing being array of filehandles, they are independent of the multipathing being
skipping to change at page 20, line 24 skipping to change at page 23, line 47
The metadata server analyzes the error and determines the required The metadata server analyzes the error and determines the required
recovery operations such as recovering media failures or recovery operations such as recovering media failures or
reconstructing missing data files. reconstructing missing data files.
The metadata server SHOULD recall any outstanding layouts to allow it The metadata server SHOULD recall any outstanding layouts to allow it
exclusive write access to the stripes being recovered and to prevent exclusive write access to the stripes being recovered and to prevent
other clients from hitting the same error condition. In these cases, other clients from hitting the same error condition. In these cases,
the server MUST complete recovery before handing out any new layouts the server MUST complete recovery before handing out any new layouts
to the affected byte ranges. to the affected byte ranges.
Although it MAY be acceptable for the client to propagate a Although the client implementation has the option to propagate a
corresponding error to the application that initiated the I/O corresponding error to the application that initiated the I/O
operation and drop any unwritten data, the client SHOULD attempt to operation and drop any unwritten data, the client should attempt to
retry the original I/O operation by requesting a new layout using retry the original I/O operation by either requesting a new layout or
LAYOUTGET and retry the I/O operation(s) using the new layout, or the sending the I/O via regular NFSv4.1+ READ or WRITE operations to the
client MAY just retry the I/O operation(s) using regular NFS READ or metadata server. The client SHOULD attempt to retrieve a new layout
WRITE operations via the metadata server. The client SHOULD attempt and retry the I/O operation using the storage device first and only
to retrieve a new layout and retry the I/O operation using the if the error persists, retry the I/O operation via the metadata
storage device first and only if the error persists, retry the I/O server.
operation via the metadata server.
8. Mirroring 8. Mirroring
The flexible file layout type has a simple model in place for the The flexible file layout type has a simple model in place for the
mirroring of the file data constrained by a layout segment. There is mirroring of the file data constrained by a layout segment. There is
no assumption that each copy of the mirror is stored identically on no assumption that each copy of the mirror is stored identically on
the storage devices, i.e., one device might employ compression or the storage devices. For example, one device might employ
deduplication on the data. However, the over the wire transfer of compression or deduplication on the data. However, the over the wire
the file contents MUST appear identical. Note, this is a construct transfer of the file contents MUST appear identical. Note, this is a
of the selected XDR representation that each mirrored copy of the constraint of the selected XDR representation in which each mirrored
layout segment has the same striping pattern (see Figure 1). copy of the layout segment has the same striping pattern (see
Figure 1).
The metadata server is responsible for determining the number of The metadata server is responsible for determining the number of
mirrored copies and the location of each mirror. While the client mirrored copies and the location of each mirror. While the client
may provide a hint to how many copies it wants (see Section 12), the may provide a hint to how many copies it wants (see Section 12), the
metadata server can ignore that hint and in any event, the client has metadata server can ignore that hint and in any event, the client has
no means to dictate neither the storage device (which also means the no means to dictate neither the storage device (which also means the
coupling and/or protocol levels to access the layout segments) nor coupling and/or protocol levels to access the layout segments) nor
the location of said storage device. the location of said storage device.
The updating of mirrored layout segments is done via client-side The updating of mirrored layout segments is done via client-side
mirroring. With this approach, the client is responsible for making mirroring. With this approach, the client is responsible for making
sure modifications get to all copies of the layout segments it is sure modifications are made on all copies of the layout segments it
informed of via the layout. If a layout segment is being resilvered is informed of via the layout. If a layout segment is being
to a storage device, that mirrored copy will not be in the layout. resilvered to a storage device, that mirrored copy will not be in the
Thus the metadata server MUST update that copy until the client is layout. Thus the metadata server MUST update that copy until the
presented it in a layout. Also, if the client is writing to the client is presented it in a layout. If the client is writing to the
layout segments via the metadata server, e.g., using an earlier layout segments via the metadata server, then the metadata server
version of the protocol, then the metadata server MUST update all MUST update all copies of the mirror. As seen in Section 8.3, during
copies of the mirror. As seen in Section 8.3, during the the resilvering, the layout is recalled, and the client has to make
resilvering, the layout is recalled, and the client has to make
modifications via the metadata server. modifications via the metadata server.
8.1. Selecting a Mirror 8.1. Selecting a Mirror
When the metadata server grants a layout to a client, it MAY let the When the metadata server grants a layout to a client, it MAY let the
client know how fast it expects each mirror to be once the request client know how fast it expects each mirror to be once the request
arrives at the storage devices via the ffds_efficiency member. While arrives at the storage devices via the ffds_efficiency member. While
the algorithms to calculate that value are left to the metadata the algorithms to calculate that value are left to the metadata
server implementations, factors that could contribute to that server implementations, factors that could contribute to that
calculation include speed of the storage device, physical memory calculation include speed of the storage device, physical memory
skipping to change at page 21, line 44 skipping to change at page 25, line 19
device because it has no presence on the given subnet. device because it has no presence on the given subnet.
As such, it is the client which decides which mirror to access for As such, it is the client which decides which mirror to access for
reading the file. The requirements for writing to a mirrored layout reading the file. The requirements for writing to a mirrored layout
segments are presented below. segments are presented below.
8.2. Writing to Mirrors 8.2. Writing to Mirrors
The client is responsible for updating all mirrored copies of the The client is responsible for updating all mirrored copies of the
layout segments that it is given in the layout. A single failed layout segments that it is given in the layout. A single failed
update is sufficient to fail the entire operation. I.e., if all but update is sufficient to fail the entire operation. If all but one
one copy is updated successfully and the last one provides an error, copy is updated successfully and the last one provides an error, then
then the client needs to inform the metadata server about the error the client needs to inform the metadata server about the error via
via either LAYOUTRETURN or LAYOUTERROR that the update failed to that either LAYOUTRETURN or LAYOUTERROR that the update failed to that
storage device. If the client is updating the mirrors serially, then storage device. If the client is updating the mirrors serially, then
it SHOULD stop at the first error encountered and report that to the it SHOULD stop at the first error encountered and report that to the
metadata server. If the client is updating the mirrors in parallel, metadata server. If the client is updating the mirrors in parallel,
then it SHOULD wait until all storage devices respond such that it then it SHOULD wait until all storage devices respond such that it
can report all errors encountered during the update. can report all errors encountered during the update.
The metadata server is then responsible for determining if it wants The metadata server is then responsible for determining if it wants
to remove the errant mirror from the layout, if the mirror has to remove the errant mirror from the layout, if the mirror has
recovered from some transient error, etc. When the client tries to recovered from some transient error, etc. When the client tries to
get a new layout, the metadata server informs it of the decision by get a new layout, the metadata server informs it of the decision by
the contents of the layout. The client MUST NOT make any assumptions the contents of the layout. The client MUST NOT make any assumptions
that the contents of the previous layout will match those of the new that the contents of the previous layout will match those of the new
one. If it has updates that were not committed, it MUST resend those one. If it has updates that were not committed to all mirrors, then
updates to all mirrors. it MUST resend those updates to all mirrors.
There is no provision in the protocol for the metadata server to There is no provision in the protocol for the metadata server to
directly determine that the client has or has not recovered from an directly determine that the client has or has not recovered from an
error. I.e., assume that the storage device was network partitioned error. I.e., assume that the storage device was network partitioned
from the client and all of the copies are successfully updated after from the client and all of the copies are successfully updated after
the error was reported. There is no mechanism for the client to the error was reported. There is no mechanism for the client to
report that fact and the metadata server is forced to repair the file report that fact and the metadata server is forced to repair the file
across the mirror. across the mirror.
If the client supports NFSv4.2, it can use LAYOUTERROR and If the client supports NFSv4.2, it can use LAYOUTERROR and
skipping to change at page 22, line 43 skipping to change at page 26, line 18
The metadata server may elect to create a new mirror of the layout The metadata server may elect to create a new mirror of the layout
segments at any time. This might be to resilver a copy on a storage segments at any time. This might be to resilver a copy on a storage
device which was down for servicing, to provide a copy of the layout device which was down for servicing, to provide a copy of the layout
segments on storage with different storage performance segments on storage with different storage performance
characteristics, etc. As the client will not be aware of the new characteristics, etc. As the client will not be aware of the new
mirror and the metadata server will not be aware of updates that the mirror and the metadata server will not be aware of updates that the
client is making to the layout segments, the metadata server MUST client is making to the layout segments, the metadata server MUST
recall the writable layout segment(s) that it is resilvering. If the recall the writable layout segment(s) that it is resilvering. If the
client issues a LAYOUTGET for a writable layout segment which is in client issues a LAYOUTGET for a writable layout segment which is in
the process of being resilvered, then the metadata server MUST deny the process of being resilvered, then the metadata server can deny
that request with a NFS4ERR_LAYOUTTRYLATER. The client can then that request with a NFS4ERR_LAYOUTUNAVAILABLE. The client would then
perform the I/O through the metadata server. have to perform the I/O through the metadata server.
9. Flexible Files Layout Type Return 9. Flexible Files Layout Type Return
layoutreturn_file4 is used in the LAYOUTRETURN operation to convey layoutreturn_file4 is used in the LAYOUTRETURN operation to convey
layout-type specific information to the server. It is defined in layout-type specific information to the server. It is defined in
[RFC5661] as follows: Section 18.44.1 of [RFC5661] as follows:
<CODE BEGINS> <CODE BEGINS>
/* Constants used for LAYOUTRETURN and CB_LAYOUTRECALL */
const LAYOUT4_RET_REC_FILE = 1;
const LAYOUT4_RET_REC_FSID = 2;
const LAYOUT4_RET_REC_ALL = 3;
enum layoutreturn_type4 {
LAYOUTRETURN4_FILE = LAYOUT4_RET_REC_FILE,
LAYOUTRETURN4_FSID = LAYOUT4_RET_REC_FSID,
LAYOUTRETURN4_ALL = LAYOUT4_RET_REC_ALL
};
struct layoutreturn_file4 { struct layoutreturn_file4 {
offset4 lrf_offset; offset4 lrf_offset;
length4 lrf_length; length4 lrf_length;
stateid4 lrf_stateid; stateid4 lrf_stateid;
/* layouttype4 specific data */ /* layouttype4 specific data */
opaque lrf_body<>; opaque lrf_body<>;
}; };
union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { union layoutreturn4 switch(layoutreturn_type4 lr_returntype) {
case LAYOUTRETURN4_FILE: case LAYOUTRETURN4_FILE:
layoutreturn_file4 lr_layout; layoutreturn_file4 lr_layout;
default: default:
void; void;
}; };
struct LAYOUTRETURN4args { struct LAYOUTRETURN4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
bool lora_reclaim; bool lora_reclaim;
skipping to change at page 23, line 33 skipping to change at page 27, line 22
/* CURRENT_FH: file */ /* CURRENT_FH: file */
bool lora_reclaim; bool lora_reclaim;
layoutreturn_stateid lora_recallstateid; layoutreturn_stateid lora_recallstateid;
layouttype4 lora_layout_type; layouttype4 lora_layout_type;
layoutiomode4 lora_iomode; layoutiomode4 lora_iomode;
layoutreturn4 lora_layoutreturn; layoutreturn4 lora_layoutreturn;
}; };
<CODE ENDS> <CODE ENDS>
If the lora_layout_type layout type is LAYOUT4_FLEX_FILES, then the If the lora_layout_type layout type is LAYOUT4_FLEX_FILES and the
lrf_body opaque value is defined by ff_layoutreturn4 (See lr_returntype is LAYOUTRETURN4_FILE, then the lrf_body opaque value
Section 9.3). It allows the client to report I/O error information is defined by ff_layoutreturn4 (See Section 9.3). It allows the
or layout usage statistics back to the metadata server as defined client to report I/O error information or layout usage statistics
below. back to the metadata server as defined below. Note that while the
data strucures are built on concepts introduced in NFSv4.2, the
effective discriminated union (lora_layout_type combined with
ff_layoutreturn4) allows for a NFSv4.1 metadata server to utilize the
data.
9.1. I/O Error Reporting 9.1. I/O Error Reporting
9.1.1. ff_ioerr4 9.1.1. ff_ioerr4
<CODE BEGINS> <CODE BEGINS>
/// struct ff_ioerr4 { /// struct ff_ioerr4 {
/// offset4 ffie_offset; /// offset4 ffie_offset;
/// length4 ffie_length; /// length4 ffie_length;
skipping to change at page 26, line 33 skipping to change at page 30, line 28
<CODE BEGINS> <CODE BEGINS>
struct io_info4 { struct io_info4 {
uint64_t ii_count; uint64_t ii_count;
uint64_t ii_bytes; uint64_t ii_bytes;
}; };
<CODE ENDS> <CODE ENDS>
With pNFS, the data transfers are performed directly between the pNFS With pNFS, data transfers are performed directly between the pNFS
client and the storage devices. Therefore, the metadata server has client and the storage devices. Therefore, the metadata server has
no visibility to the I/O stream and cannot use any statistical no direct knowledge to the I/O operations being done and thus can not
information about client I/O to optimize data storage location. create on its own statistical information about client I/O to
ff_iostats4 MAY be used by the client to report I/O statistics back optimize data storage location. ff_iostats4 MAY be used by the
to the metadata server upon returning the layout. Since it is client to report I/O statistics back to the metadata server upon
infeasible for the client to report every I/O that used the layout, returning the layout.
the client MAY identify "hot" byte ranges for which to report I/O
statistics. The definition and/or configuration mechanism of what is Since it is not feasible for the client to report every I/O that used
considered "hot" and the size of the reported byte range is out of the layout, the client MAY identify "hot" byte ranges for which to
the scope of this document. It is suggested for client report I/O statistics. The definition and/or configuration mechanism
of what is considered "hot" and the size of the reported byte range
is out of the scope of this document. It is suggested for client
implementation to provide reasonable default values and an optional implementation to provide reasonable default values and an optional
run-time management interface to control these parameters. For run-time management interface to control these parameters. For
example, a client can define the default byte range resolution to be example, a client can define the default byte range resolution to be
1 MB in size and the thresholds for reporting to be 1 MB/second or 10 1 MB in size and the thresholds for reporting to be 1 MB/second or 10
I/O operations per second. For each byte range, ffis_offset and I/O operations per second.
ffis_length represent the starting offset of the range and the range
length in bytes. ffis_read.ii_count, ffis_read.ii_bytes,
ffis_write.ii_count, and ffis_write.ii_bytes represent, respectively,
the number of contiguous read and write I/Os and the respective
aggregate number of bytes transferred within the reported byte range.
The combination of ffis_deviceid and ffl_addr uniquely identify both For each byte range, ffis_offset and ffis_length represent the
the storage path and the network route to it. Finally, the starting offset of the range and the range length in bytes.
ffis_read.ii_count, ffis_read.ii_bytes, ffis_write.ii_count, and
ffis_write.ii_bytes represent, respectively, the number of contiguous
read and write I/Os and the respective aggregate number of bytes
transferred within the reported byte range.
The combination of ffis_deviceid and ffl_addr uniquely identifies
both the storage path and the network route to it. Finally, the
ffl_fhandle allows the metadata server to differentiate between ffl_fhandle allows the metadata server to differentiate between
multiple read-only copies of the file on the same storage device. multiple read-only copies of the file on the same storage device.
9.3. ff_layoutreturn4 9.3. ff_layoutreturn4
<CODE BEGINS> <CODE BEGINS>
/// struct ff_layoutreturn4 { /// struct ff_layoutreturn4 {
/// ff_ioerr4 fflr_ioerr_report<>; /// ff_ioerr4 fflr_ioerr_report<>;
/// ff_iostats4 fflr_iostats_report<>; /// ff_iostats4 fflr_iostats_report<>;
skipping to change at page 29, line 18 skipping to change at page 33, line 18
reasons for recalling a layout, the flexible file layout type reasons for recalling a layout, the flexible file layout type
metadata server should recall outstanding layouts in the following metadata server should recall outstanding layouts in the following
cases: cases:
o When the file's security policy changes, i.e., Access Control o When the file's security policy changes, i.e., Access Control
Lists (ACLs) or permission mode bits are set. Lists (ACLs) or permission mode bits are set.
o When the file's layout changes, rendering outstanding layouts o When the file's layout changes, rendering outstanding layouts
invalid. invalid.
o When there are sharing conflicts. o When existing layouts are inconsistent with the need to enforce
locking constraints.
o When a file is being resilvered, either due to being repaired o When existing layouts are inconsistent with the requirements
after a write error or to load balance. regarding resilvering as described in Section 8.3.
13.1. CB_RECALL_ANY 13.1. CB_RECALL_ANY
The metadata server can use the CB_RECALL_ANY callback operation to The metadata server can use the CB_RECALL_ANY callback operation to
notify the client to return some or all of its layouts. The notify the client to return some or all of its layouts. [RFC5661]
[RFC5661] defines the following types: defines the allowed types, but makes no provision to expand them. It
does hint that "storage protocols" can expand the range, but does not
define such a process. If we put the values under IANA control, then
we could define the following types:
<CODE BEGINS> <CODE BEGINS>
const RCA4_TYPE_MASK_FF_LAYOUT_MIN = -2; const RCA4_TYPE_MASK_FF_LAYOUT_MIN = -2;
const RCA4_TYPE_MASK_FF_LAYOUT_MAX = -1; const RCA4_TYPE_MASK_FF_LAYOUT_MAX = -1;
[[RFC Editor: please insert assigned constants]] [[RFC Editor: please insert assigned constants]]
struct CB_RECALL_ANY4args { struct CB_RECALL_ANY4args {
uint32_t craa_layouts_to_keep; uint32_t craa_layouts_to_keep;
bitmap4 craa_type_mask; bitmap4 craa_type_mask;
}; };
<CODE ENDS> <CODE ENDS>
[[AI13: No, 5661 does not define these above values. The ask here is
to create these and _add_ them to 5661. --TH]]
Typically, CB_RECALL_ANY will be used to recall client state when the Typically, CB_RECALL_ANY will be used to recall client state when the
server needs to reclaim resources. The craa_type_mask bitmap server needs to reclaim resources. The craa_type_mask bitmap
specifies the type of resources that are recalled and the specifies the type of resources that are recalled and the
craa_layouts_to_keep value specifies how many of the recalled craa_layouts_to_keep value specifies how many of the recalled
flexible file layouts the client is allowed to keep. The flexible flexible file layouts the client is allowed to keep. The flexible
file layout type mask flags are defined as follows: file layout type mask flags are defined as follows:
<CODE BEGINS> <CODE BEGINS>
/// enum ff_cb_recall_any_mask { /// enum ff_cb_recall_any_mask {
/// FF_RCA4_TYPE_MASK_READ = -2, /// FF_RCA4_TYPE_MASK_READ = -2,
/// FF_RCA4_TYPE_MASK_RW = -1 /// FF_RCA4_TYPE_MASK_RW = -1
[[RFC Editor: please insert assigned constants]] [[RFC Editor: please insert assigned constants]]
/// }; /// };
/// ///
<CODE ENDS> <CODE ENDS>
They represent the iomode of the recalled layouts. In response, the They represent the iomode of the recalled layouts. In response, the
skipping to change at page 30, line 47 skipping to change at page 34, line 50
The control path contains all the new operations described by this The control path contains all the new operations described by this
extension; all existing NFSv4 security mechanisms and features apply extension; all existing NFSv4 security mechanisms and features apply
to the control path. The combination of components in a pNFS system to the control path. The combination of components in a pNFS system
is required to preserve the security properties of NFSv4.1+ with is required to preserve the security properties of NFSv4.1+ with
respect to an entity accessing data via a client, including security respect to an entity accessing data via a client, including security
countermeasures to defend against threats that NFSv4.1+ provides countermeasures to defend against threats that NFSv4.1+ provides
defenses for in environments where these threats are considered defenses for in environments where these threats are considered
significant. significant.
The metadata server enforces the file access-control policy at The metadata server enforces the file access-control policy at
LAYOUTGET time. The client should use suitable authorization LAYOUTGET time. The client should use RPC authorization credentials
credentials for getting the layout for the requested iomode (READ or (uid/gid for AUTH_SYS or tickets for Kerberos) for getting the layout
RW) and the server verifies the permissions and ACL for these for the requested iomode (READ or RW) and the server verifies the
credentials, possibly returning NFS4ERR_ACCESS if the client is not permissions and ACL for these credentials, possibly returning
allowed the requested iomode. If the LAYOUTGET operation succeeds NFS4ERR_ACCESS if the client is not allowed the requested iomode. If
the client receives, as part of the layout, a set of credentials the LAYOUTGET operation succeeds the client receives, as part of the
allowing it I/O access to the specified data files corresponding to layout, a set of credentials allowing it I/O access to the specified
the requested iomode. When the client acts on I/O operations on data files corresponding to the requested iomode. When the client
behalf of its local users, it MUST authenticate and authorize the acts on I/O operations on behalf of its local users, it MUST
user by issuing respective OPEN and ACCESS calls to the metadata authenticate and authorize the user by issuing respective OPEN and
server, similar to having NFSv4 data delegations. If access is ACCESS calls to the metadata server, similar to having NFSv4 data
allowed, the client uses the corresponding (READ or RW) credentials delegations.
to perform the I/O operations at the data file's storage devices.
When the metadata server receives a request to change a file's If access is allowed, the client uses the corresponding (READ or RW)
permissions or ACL, it SHOULD recall all layouts for that file and it credentials to perform the I/O operations at the data file's storage
MUST fence off the clients holding outstanding layouts for the devices. When the metadata server receives a request to change a
respective file by implicitly invalidating the outstanding file's permissions or ACL, it SHOULD recall all layouts for that file
credentials on all data files comprising before committing to the new and then MUST fence off any clients still holding outstanding layouts
permissions and ACL. Doing this will ensure that clients re- for the respective files by implicitly invalidating the previously
authorize their layouts according to the modified permissions and ACL distributed credential on all data file comprising the file in
by requesting new layouts. Recalling the layouts in this case is question. It is REQUIRED that this be done before committing to the
courtesy of the server intended to prevent clients from getting an new permissions and/or ACL. By requesting new layouts, the clients
error on I/Os done after the client was fenced off. will reauthorize access against the modified access control metadata.
Recalling the layouts in this case is intended to prevent clients
from getting an error on I/Os done after the client was fenced off.
15.1. Kerberized File Access 15.1. Kerberized File Access
15.1.1. Loosely Coupled 15.1.1. Loosely Coupled
RPCSEC_GSS version 3 (RPCSEC_GSSv3) [rpcsec_gssv3] could be used to
authorize the client to the storage device on behalf of the metadata
server. This would require that each of the metadata server, storage
device, and client would have to implement RPCSEC_GSSv3. The second
requirement does not match the intent of the loosely coupled model
that the storage device need not be modified.
Under this coupling model, the principal used to authenticate the Under this coupling model, the principal used to authenticate the
metadata file is different than that used to authenticate the data metadata file is different than that used to authenticate the data
file. I.e., the synthetic principals generated to control access to file. For the metadata server, the user credentials would be
the data file could prove to be difficult to manage. generated by the same Kerberos server as the client. However, for
the data storage access, the metadata server would generate the
While RPCSEC_GSS version 3 (RPCSEC_GSSv3) [rpcsec_gssv3] could be ticket granting tickets and provide them to the client. Fencing
used to authorize the client to the storage device on behalf of the would then be controlled either by expiring the ticket or by
metadata server, such a requirement exceeds the loose coupling model. modifying the syntethic uid or gid on the data file.
I.e., each of the metadata server, storage device, and client would
have to implement RPCSEC_GSSv3.
In all, while either an elaborate schema could be used to
automatically authenticate principals or RPCSEC_GSSv3 aware clients,
metadata server, and storage devices could be deployed, if more
secure authentication is desired, tight coupling should be considered
as described in the next section.
15.1.2. Tightly Coupled 15.1.2. Tightly Coupled
With tight coupling, the principal used to access the metadata file With tight coupling, the principal used to access the metadata file
is exactly the same as used to access the data file. Thus there are is exactly the same as used to access the data file. As a result
no security issues related to using Kerberos with a tightly coupled there are no security issues related to using Kerberos with a tightly
system. coupled system.
16. IANA Considerations 16. IANA Considerations
As described in [RFC5661], new layout type numbers have been assigned As described in [RFC5661], new layout type numbers have been assigned
by IANA. This document defines the protocol associated with the by IANA. This document defines the protocol associated with the
existing layout type number, LAYOUT4_FLEX_FILES. existing layout type number, LAYOUT4_FLEX_FILES.
17. References 17. References
17.1. Normative References 17.1. Normative References
skipping to change at page 33, line 29 skipping to change at page 37, line 29
document include: Matt W. Benjamin, Adam Emerson, J. Bruce Fields, document include: Matt W. Benjamin, Adam Emerson, J. Bruce Fields,
and Lev Solomonov. and Lev Solomonov.
Those who provided miscellaneous comments to the final drafts of this Those who provided miscellaneous comments to the final drafts of this
document include: Anand Ganesh, Robert Wipfel, Gobikrishnan document include: Anand Ganesh, Robert Wipfel, Gobikrishnan
Sundharraj, and Trond Myklebust. Sundharraj, and Trond Myklebust.
Idan Kedar caught a nasty bug in the interaction of client side Idan Kedar caught a nasty bug in the interaction of client side
mirroring and the minor versioning of devices. mirroring and the minor versioning of devices.
Dave Noveck provided a comprehensive review of the document during Dave Noveck provided comprehensive reviews of the document during the
the working group last call. working group last calls.
Olga Kornievskaia lead the charge against the use of a credential Olga Kornievskaiaa made a convincing case against the use of a
versus a principal in the fencing approach. Andy Adamson and credential versus a principal in the fencing approach. Andy Adamson
Benjamin Kaduk helped to sharpen the focus. and Benjamin Kaduk helped to sharpen the focus.
Tigran Mkrtchyan provided the use case for not allowing the client to Tigran Mkrtchyan provided the use case for not allowing the client to
proxy the IO through the data server. proxy the I/O through the data server.
Appendix B. RFC Editor Notes Appendix B. RFC Editor Notes
[RFC Editor: please remove this section prior to publishing this [RFC Editor: please remove this section prior to publishing this
document as an RFC] document as an RFC]
[RFC Editor: prior to publishing this document as an RFC, please [RFC Editor: prior to publishing this document as an RFC, please
replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the
RFC number of this document] RFC number of this document]
 End of changes. 52 change blocks. 
207 lines changed or deleted 404 lines changed or added

This html diff was produced by rfcdiff 1.45. The latest version is available from http://tools.ietf.org/tools/rfcdiff/