draft-ietf-nfsv4-scsi-layout-02.txt   draft-ietf-nfsv4-scsi-layout-03.txt 
NFSv4 C. Hellwig NFSv4 C. Hellwig
Internet-Draft August 15, 2015 Internet-Draft November 03, 2015
Intended status: Standards Track Intended status: Standards Track
Expires: February 16, 2016 Expires: May 6, 2016
Parallel NFS (pNFS) SCSI Layout Parallel NFS (pNFS) SCSI Layout
draft-ietf-nfsv4-scsi-layout-02.txt draft-ietf-nfsv4-scsi-layout-03.txt
Abstract Abstract
The Parallel Network File System (pNFS) allows a separation between The Parallel Network File System (pNFS) allows a separation between
the metadata (onto a metadata server) and data (onto a storage the metadata (onto a metadata server) and data (onto a storage
device) for a file. The SCSI Layout Type is defined in this document device) for a file. The SCSI Layout Type is defined in this document
as an extension to pNFS to allow the use SCSI based block storage as an extension to pNFS to allow the use SCSI based block storage
devices. devices.
Status of this Memo Status of this Memo
skipping to change at page 1, line 34 skipping to change at page 1, line 34
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on February 16, 2016. This Internet-Draft will expire on May 6, 2016.
Copyright Notice Copyright Notice
Copyright (c) 2015 IETF Trust and the persons identified as the Copyright (c) 2015 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 2, line 20 skipping to change at page 2, line 20
1.3. Code Components Licensing Notice . . . . . . . . . . . . . 4 1.3. Code Components Licensing Notice . . . . . . . . . . . . . 4
1.4. XDR Description . . . . . . . . . . . . . . . . . . . . . 4 1.4. XDR Description . . . . . . . . . . . . . . . . . . . . . 4
2. SCSI Layout Description . . . . . . . . . . . . . . . . . . . 6 2. SCSI Layout Description . . . . . . . . . . . . . . . . . . . 6
2.1. Background and Architecture . . . . . . . . . . . . . . . 6 2.1. Background and Architecture . . . . . . . . . . . . . . . 6
2.2. layouttype4 . . . . . . . . . . . . . . . . . . . . . . . 7 2.2. layouttype4 . . . . . . . . . . . . . . . . . . . . . . . 7
2.3. GETDEVICEINFO . . . . . . . . . . . . . . . . . . . . . . 8 2.3. GETDEVICEINFO . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1. Volume Identification . . . . . . . . . . . . . . . . 8 2.3.1. Volume Identification . . . . . . . . . . . . . . . . 8
2.3.2. Volume Topology . . . . . . . . . . . . . . . . . . . 9 2.3.2. Volume Topology . . . . . . . . . . . . . . . . . . . 9
2.4. Data Structures: Extents and Extent Lists . . . . . . . . 12 2.4. Data Structures: Extents and Extent Lists . . . . . . . . 12
2.4.1. Layout Requests and Extent Lists . . . . . . . . . . . 14 2.4.1. Layout Requests and Extent Lists . . . . . . . . . . . 14
2.4.2. Layout Commits . . . . . . . . . . . . . . . . . . . . 15 2.4.2. Layout Commits . . . . . . . . . . . . . . . . . . . . 16
2.4.3. Layout Returns . . . . . . . . . . . . . . . . . . . . 16 2.4.3. Layout Returns . . . . . . . . . . . . . . . . . . . . 16
2.4.4. Client Copy-on-Write Processing . . . . . . . . . . . 16 2.4.4. Client Copy-on-Write Processing . . . . . . . . . . . 17
2.4.5. Extents are Permissions . . . . . . . . . . . . . . . 17 2.4.5. Extents are Permissions . . . . . . . . . . . . . . . 18
2.4.6. End-of-file Processing . . . . . . . . . . . . . . . . 19 2.4.6. End-of-file Processing . . . . . . . . . . . . . . . . 19
2.4.7. Layout Hints . . . . . . . . . . . . . . . . . . . . . 19 2.4.7. Layout Hints . . . . . . . . . . . . . . . . . . . . . 20
2.4.8. Client Fencing . . . . . . . . . . . . . . . . . . . . 19 2.4.8. Client Fencing . . . . . . . . . . . . . . . . . . . . 20
2.5. Crash Recovery Issues . . . . . . . . . . . . . . . . . . 21 2.5. Crash Recovery Issues . . . . . . . . . . . . . . . . . . 22
2.6. Recalling Resources: CB_RECALL_ANY . . . . . . . . . . . . 22 2.6. Recalling Resources: CB_RECALL_ANY . . . . . . . . . . . . 22
2.7. Transient and Permanent Errors . . . . . . . . . . . . . . 22 2.7. Transient and Permanent Errors . . . . . . . . . . . . . . 23
2.8. Volatile write caches . . . . . . . . . . . . . . . . . . 23 2.8. Volatile write caches . . . . . . . . . . . . . . . . . . 23
3. Security Considerations . . . . . . . . . . . . . . . . . . . 23 3. Enforcing NFSv4 Semantics . . . . . . . . . . . . . . . . . . 24
4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25 3.1. Use of Open Stateids . . . . . . . . . . . . . . . . . . . 24
5. Normative References . . . . . . . . . . . . . . . . . . . . . 25 3.2. Enforcing Security Restrictions . . . . . . . . . . . . . 25
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 26 3.3. Enforcing Locking Restrictions . . . . . . . . . . . . . . 25
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 26 4. Security Considerations . . . . . . . . . . . . . . . . . . . 26
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 26 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 27
6. Normative References . . . . . . . . . . . . . . . . . . . . . 27
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 28
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 28
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 29
1. Introduction 1. Introduction
Figure 1 shows the overall architecture of a Parallel NFS (pNFS) Figure 1 shows the overall architecture of a Parallel NFS (pNFS)
system: system:
+-----------+ +-----------+
|+-----------+ +-----------+ |+-----------+ +-----------+
||+-----------+ | | ||+-----------+ | |
||| | NFSv4.1 + pNFS | | ||| | NFSv4.1 + pNFS | |
skipping to change at page 3, line 50 skipping to change at page 3, line 50
supported by the SCSI layout type, and all future references to SCSI supported by the SCSI layout type, and all future references to SCSI
storage devices will imply a block based SCSI command set. storage devices will imply a block based SCSI command set.
The Server to Storage System protocol, called the "Control Protocol", The Server to Storage System protocol, called the "Control Protocol",
is not of concern for interoperability, although it will typically be is not of concern for interoperability, although it will typically be
the same SCSI based storage protocol. the same SCSI based storage protocol.
This document is based on and updates [RFC5663] to provide a better This document is based on and updates [RFC5663] to provide a better
pNFS layout protocol for SCSI based storage devices, and functionally pNFS layout protocol for SCSI based storage devices, and functionally
obsoletes [RFC6688] by providing mandatory disk access protection as obsoletes [RFC6688] by providing mandatory disk access protection as
part of the protocol. part of the protocol. Unlike [RFC5663] this document can make use of
SCSI protocol features and thus can provide reliable fencing by using
SCSI Persistent Reservations, and it can provide reliable and
efficient device discovery by using SCSI device identifiers instead
of having to rely on probing all devices potentially attached to a
client for a signature. The document also optimizes the I/O path by
reducing the size of the LAYOUTCOMMIT payload.
1.1. Conventions Used in This Document 1.1. Conventions Used in This Document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119]. document are to be interpreted as described in [RFC2119].
1.2. General Definitions 1.2. General Definitions
The following definitions are provided for the purpose of providing The following definitions are provided for the purpose of providing
skipping to change at page 6, line 29 skipping to change at page 6, line 35
2. SCSI Layout Description 2. SCSI Layout Description
2.1. Background and Architecture 2.1. Background and Architecture
The fundamental storage abstraction supported by SCSI storage devices The fundamental storage abstraction supported by SCSI storage devices
is a Logical Unit (LU) consisting of a sequential series of fixed- is a Logical Unit (LU) consisting of a sequential series of fixed-
size blocks. This can be thought of as a logical disk; it may be size blocks. This can be thought of as a logical disk; it may be
realized by the storage system as a physical disk, a portion of a realized by the storage system as a physical disk, a portion of a
physical disk, or something more complex (e.g., concatenation, physical disk, or something more complex (e.g., concatenation,
striping, RAID, and combinations thereof) involving multiple physical striping, RAID, and combinations thereof) involving multiple physical
disks or portions thereof. disks or portions thereof. Logical units used as devices for NFS
scsi layouts, and the SCSI initiators used for the pNFS Metadata
Served and clients MUST support SCSI persistent reservations.
A pNFS layout for this SCSI class of storage is responsible for A pNFS layout for this SCSI class of storage is responsible for
mapping from an NFS file (or portion of a file) to the blocks of mapping from an NFS file (or portion of a file) to the blocks of
storage volumes that contain the file. The blocks are expressed as storage volumes that contain the file. The blocks are expressed as
extents with 64-bit offsets and lengths using the existing NFSv4 extents with 64-bit offsets and lengths using the existing NFSv4
offset4 and length4 types. Clients MUST be able to perform I/O to offset4 and length4 types. Clients MUST be able to perform I/O to
the block extents without affecting additional areas of storage the block extents without affecting additional areas of storage
(especially important for writes); therefore, extents MUST be aligned (especially important for writes); therefore, extents MUST be aligned
to 512-byte boundaries, and writable extents MUST be aligned to the to 512-byte boundaries, and writable extents MUST be aligned to the
block size used by the NFSv4 server in managing the actual file block size used by the NFSv4 server in managing the actual file
skipping to change at page 7, line 18 skipping to change at page 7, line 26
initially performed on the read-only storage, with writes going to initially performed on the read-only storage, with writes going to
the un-initialized storage. After the first write that initializes the un-initialized storage. After the first write that initializes
the un-initialized storage, all reads are performed to that now- the un-initialized storage, all reads are performed to that now-
initialized writable storage, and the corresponding read-only storage initialized writable storage, and the corresponding read-only storage
is no longer used. is no longer used.
The SCSI layout solution expands the security responsibilities of the The SCSI layout solution expands the security responsibilities of the
pNFS clients, and there are a number of environments where the pNFS clients, and there are a number of environments where the
mandatory to implement security properties for NFS cannot be mandatory to implement security properties for NFS cannot be
satisfied. The additional security responsibilities of the client satisfied. The additional security responsibilities of the client
follow, and a full discussion is present im Section 3, "Security follow, and a full discussion is present im Section 4, "Security
Considerations". Considerations".
o Typically, SCSI storage devices provide access control mechanisms o Typically, SCSI storage devices provide access control mechanisms
(e.g., Logical Unit Number (LUN) mapping and/or masking), which (e.g., Logical Unit Number (LUN) mapping and/or masking), which
operate at the granularity of individual hosts, not individual operate at the granularity of individual hosts, not individual
blocks. For this reason, block-based protection must be provided blocks. For this reason, block-based protection must be provided
by the client software. by the client software.
o Similarly, SCSI storage devices typically are not able to validate o Similarly, SCSI storage devices typically are not able to validate
NFS locks that apply to file regions. For instance, if a file is NFS locks that apply to file regions. For instance, if a file is
skipping to change at page 8, line 17 skipping to change at page 8, line 25
LAYOUT4_SCSI. [RFC5661] specifies the loc_body structure as an XDR LAYOUT4_SCSI. [RFC5661] specifies the loc_body structure as an XDR
type "opaque". The opaque layout is uninterpreted by the generic type "opaque". The opaque layout is uninterpreted by the generic
pNFS client layers, but obviously must be interpreted by the Layout pNFS client layers, but obviously must be interpreted by the Layout
Type implementation. Type implementation.
2.3. GETDEVICEINFO 2.3. GETDEVICEINFO
2.3.1. Volume Identification 2.3.1. Volume Identification
SCSI targets implementing [SPC3] export unique LU names for each LU SCSI targets implementing [SPC3] export unique LU names for each LU
through the Device Identification VPD page, which can be obtained through the Device Identification VPD page (page code 0x83), which
using the INQUIRY command. This document uses a subset of this can be obtained using the INQUIRY command with the EVPD bit set to
information to identify LUs backing pNFS SCSI layouts. It is similar one. This document uses a subset of this information to identify LUs
to the "Identification Descriptor Target Descriptor" specified in backing pNFS SCSI layouts. It is similar to the "Identification
[SPC3], but limits the allowed values to those that uniquely identify Descriptor Target Descriptor" specified in [SPC3], but limits the
a LU. Device Identification VPD page descriptors used to identify allowed values to those that uniquely identify a LU. Device
LUs for use with pNFS SCSI layouts must adhere to the following Identification VPD page descriptors used to identify LUs for use with
restrictions: pNFS SCSI layouts must adhere to the following restrictions:
1. The "ASSOCIATION" must be set to 0 (The DESIGNATOR field is 1. The "ASSOCIATION" MUST be set to 0 (The DESIGNATOR field is
associated with the addressed logical unit). associated with the addressed logical unit).
2. The "DESIGNATOR TYPE" must be set to one of three values 2. The "DESIGNATOR TYPE" MUST be set to one of four values that are
explicitly listed in the "pnfs_scsi_designator_type" required for the mandatory logical unit name in [SPC3], as
enumerations. explicitly listed in the "pnfs_scsi_designator_type" enumeration:
PS_DESIGNATOR_T10 T10 vendor ID based
PS_DESIGNATOR_EUI64 EUI-64-based
PS_DESIGNATOR_NAA NAA
PS_DESIGNATOR_NAME SCSI name string
Any other associate or designator type MUST NOT be used.
The "CODE SET" VPD page field is stored in the "sbv_code_set" field The "CODE SET" VPD page field is stored in the "sbv_code_set" field
of the "pnfs_scsi_base_volume_info4" structure, the "DESIGNATOR TYPE" of the "pnfs_scsi_base_volume_info4" structure, the "DESIGNATOR TYPE"
is stored in "sbv_designator_type", and the DESIGNATOR is stored in is stored in "sbv_designator_type", and the DESIGNATOR is stored in
"sbv_designator". Due to the use of a XDR array the "DESIGNATOR "sbv_designator". Due to the use of a XDR array the "DESIGNATOR
LENGTH" field does not need to be set separately. Only certain LENGTH" field does not need to be set separately. Only certain
combinations of "sbv_code_set" and "sbv_designator_type" are valid, combinations of "sbv_code_set" and "sbv_designator_type" are valid,
please refer to [SPC3] for details, and note that ASCII may be used please refer to [SPC3] for details, and note that ASCII may be used
as the code set for UTF-8 text that contains only ASCII characters. as the code set for UTF-8 text that contains only printable ASCII
Note that a Device Identification VPD page MAY contain multiple characters. Note that a Device Identification VPD page MAY contain
descriptors with the same association, code set and designator type. multiple descriptors with the same association, code set and
NFS clients thus MUST iterate the descriptors until a match for designator type. NFS clients thus MUST check all the descriptors for
"sbv_code_set", "sbv_designator_type" and "sbv_designator" is found, a possible match to "sbv_code_set", "sbv_designator_type" and
or until the end of VPD page. "sbv_designator".
Storage devices such as storage arrays can have multiple physical Storage devices such as storage arrays can have multiple physical
network ports that need not be connected to a common network, network ports that need not be connected to a common network,
resulting in a pNFS client having simultaneous multipath access to resulting in a pNFS client having simultaneous multipath access to
the same storage volumes via different ports on different networks. the same storage volumes via different ports on different networks.
Selection of one or multiple ports to access the storage device is Selection of one or multiple ports to access the storage device is
left up to the client. left up to the client.
Additionally the server returns a Persistent Reservation key in the Additionally the server returns a Persistent Reservation key in the
"sbv_pr_key" field. See Section 2.4.8 for more details on the use of "sbv_pr_key" field. See Section 2.4.8 for more details on the use of
skipping to change at page 10, line 21 skipping to change at page 10, line 21
/// PS_CODE_SET_UTF8 = 3 /// PS_CODE_SET_UTF8 = 3
/// }; /// };
/// ///
/// /* /// /*
/// * Designator types from taken from SPC-3. /// * Designator types from taken from SPC-3.
/// * /// *
/// * Other values are allocated in SPC-3, but not mandatory to /// * Other values are allocated in SPC-3, but not mandatory to
/// * implement or aren't Logical Unit names. /// * implement or aren't Logical Unit names.
/// */ /// */
/// enum pnfs_scsi_designator_type { /// enum pnfs_scsi_designator_type {
/// PS_DESIGNATOR_T10 = 1,
/// PS_DESIGNATOR_EUI64 = 2, /// PS_DESIGNATOR_EUI64 = 2,
/// PS_DESIGNATOR_NAA = 3, /// PS_DESIGNATOR_NAA = 3,
/// PS_DESIGNATOR_NAME = 8 /// PS_DESIGNATOR_NAME = 8
/// }; /// };
/// ///
/// /* /// /*
/// * Logical Unit name + reservation key. /// * Logical Unit name + reservation key.
/// */ /// */
/// struct pnfs_scsi_base_volume_info4 { /// struct pnfs_scsi_base_volume_info4 {
/// pnfs_scsi_code_set sbv_code_set; /// pnfs_scsi_code_set sbv_code_set;
skipping to change at page 20, line 25 skipping to change at page 21, line 11
device device ID that gives the client a reservation key to access device device ID that gives the client a reservation key to access
the LU, and allows the MDS to revoke access to the logic unit at any the LU, and allows the MDS to revoke access to the logic unit at any
time. time.
2.4.8.1. PRs - Key Generation 2.4.8.1. PRs - Key Generation
To allow fencing individual systems, each system must use a unique To allow fencing individual systems, each system must use a unique
Persistent Reservation key. [SPC3] does not specify a way to Persistent Reservation key. [SPC3] does not specify a way to
generate keys. This document assigns the burden to generate unique generate keys. This document assigns the burden to generate unique
keys to the MDS, which must generate a key for itself before keys to the MDS, which must generate a key for itself before
exporting a volume, and one for each client that accesses a volume. exporting a volume, and a key for each client that accesses a scsi
The MDS MAY either generate a key for each client that accesses logic layout volumes. Individuals keys for each volume that a client can
units exported by the MDS, or generate a key for each [LU, client] access are permitted but not required.
combination. If using a single key per client, the MDS needs to be
aware of the per-client fencing granularity.
2.4.8.2. PRs - MDS Registration and Reservation 2.4.8.2. PRs - MDS Registration and Reservation
Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the
MDS needs to prepare the volume for fencing using PRs. This is done MDS needs to prepare the volume for fencing using PRs. This is done
by registering the reservation generated for the MDS with the device by registering the reservation generated for the MDS with the device
using the "PERSISTENT RESERVE OUT" command with a service action of using the "PERSISTENT RESERVE OUT" command with a service action of
"REGISTER", followed by a "PERSISTENT RESERVE OUT" command, with a "REGISTER", followed by a "PERSISTENT RESERVE OUT" command, with a
service action of "RESERVE" and the type field set to 8h (Exclusive service action of "RESERVE" and the type field set to 8h (Exclusive
Access - All Registrants). To make sure all I_T nexuses are Access - All Registrants). To make sure all I_T nexuses are
skipping to change at page 21, line 16 skipping to change at page 21, line 48
performed for each initiator port. performed for each initiator port.
When a client stops using a device earlier returned by GETDEVICEINFO When a client stops using a device earlier returned by GETDEVICEINFO
it MUST unregister the earlier registered key by issuing a it MUST unregister the earlier registered key by issuing a
"PERSISTENT RESERVE OUT" command with a service action of "REGISTER" "PERSISTENT RESERVE OUT" command with a service action of "REGISTER"
with the "RESERVATION KEY" set to the earlier registered reservation with the "RESERVATION KEY" set to the earlier registered reservation
key. key.
2.4.8.4. PRs - Fencing Action 2.4.8.4. PRs - Fencing Action
In case of a non-responding client the MDS MUST fence the client by In case of a non-responding client the MDS fences the client by
issuing a "PERSISTENT RESERVE OUT" command with the service action issuing a "PERSISTENT RESERVE OUT" command with the service action
set to "PREEMPT" or "PREEMPT AND ABORT", the reservation key field set to "PREEMPT" or "PREEMPT AND ABORT", the reservation key field
set to the server's reservation key, the service action reservation set to the server's reservation key, the service action reservation
key field set to the reservation key associated with the non- key field set to the reservation key associated with the non-
responding client, and the type field set to 8h (Exclusive Access - responding client, and the type field set to 8h (Exclusive Access -
All Registrants). All Registrants).
After the MDS preempts a client, all client I/O to the LU fails. The After the MDS preempts a client, all client I/O to the LU fails. The
client should at this point return any layout that refers to the client should at this point return any layout that refers to the
device ID that points to the LU. Note that the client can device ID that points to the LU. Note that the client can
distinguish I/O errors due to fencing from other errors based on the distinguish I/O errors due to fencing from other errors based on the
"RESERVATION CONFLICT" status. Refer to [SPC3] for details. "RESERVATION CONFLICT" SCSI status. Refer to [SPC3] for details.
2.4.8.5. Client Recovery After a Fence Action 2.4.8.5. Client Recovery After a Fence Action
A client that detects I/O errors on the storage devices MUST commit A client that detects a "RESERVATION CONFLICT" SCSI status on the
storage devices MUST commit all layouts that use the storage device
through the MDS, return all outstanding layouts for the device, through the MDS, return all outstanding layouts for the device,
forget the device ID and unregister the reservation key. Future forget the device ID and unregister the reservation key. Future
GETDEVICEINFO calls may refer to the storage device again, in which GETDEVICEINFO calls may refer to the storage device again, in which
case a new registration will be performed. case the client will perform a new registration based on the key
provided (via sbv_pr_key) at that time.
2.5. Crash Recovery Issues 2.5. Crash Recovery Issues
A critical requirement in crash recovery is that both the client and A critical requirement in crash recovery is that both the client and
the server know when the other has failed. Additionally, it is the server know when the other has failed. Additionally, it is
required that a client sees a consistent view of data across server required that a client sees a consistent view of data across server
restarts. These requirements and a full discussion of crash recovery restarts. These requirements and a full discussion of crash recovery
issues are covered in the "Crash Recovery" section of the NFSv41 issues are covered in the "Crash Recovery" section of the NFSv41
specification [RFC5661]. This document contains additional crash specification [RFC5661]. This document contains additional crash
recovery material specific only to the SCSI layout. recovery material specific only to the SCSI layout.
skipping to change at page 23, line 20 skipping to change at page 23, line 52
the layout. As a result of receiving NFS4ERR_LAYOUTUNAVAILABLE, the the layout. As a result of receiving NFS4ERR_LAYOUTUNAVAILABLE, the
client SHOULD send future READ and WRITE requests directly to the client SHOULD send future READ and WRITE requests directly to the
server. It is expected that a client will not cache the file's server. It is expected that a client will not cache the file's
layoutunavailable state forever, particular if the file is closed, layoutunavailable state forever, particular if the file is closed,
and thus eventually, the client MAY reissue a LAYOUTGET operation. and thus eventually, the client MAY reissue a LAYOUTGET operation.
2.8. Volatile write caches 2.8. Volatile write caches
Many storage devices implement volatile write caches that require an Many storage devices implement volatile write caches that require an
explicit flush to persist the data from write operations to stable explicit flush to persist the data from write operations to stable
storage. When a volatile write cache is used, the pNFS server must storage. Storage devices implemeting [SBC3] should indicated a
volatile write cache by setting the WCE bit to 1 in the Caching mode
page. When a volatile write cache is used, the pNFS server must
ensure the volatile write cache has been committed to stable storage ensure the volatile write cache has been committed to stable storage
before the LAYOUTCOMMIT operation returns. before the LAYOUTCOMMIT operation returns by using one of the
SYNCHRONIZE CACHE commands.
3. Security Considerations 3. Enforcing NFSv4 Semantics
The functionality provided by SCSI Persistent Reservations makes it The functionality provided by SCSI Persistent Reservations makes it
possible for the MDS to "fence" individual client machines from possible for the MDS to control access by individual client machines
specific LUs -- that is to say, to prevent individual client machines to specific LUs. Individual client machines may be allowed to or
from reading or writing to certain block devices. Finer-grained prevented from reading or writing to certain block devices. Finer-
access control methods are not generally available. For this reason, grained access control methods are not generally available.
certain security responsibilities are delegated to pNFS clients for
SCSI layouts. SCSI storage devices generally control access at a LU
granularity, and hence pNFS clients have to be trusted to only
perform accesses allowed by the layout extents they currently hold
(e.g., and not access storage for files on which a layout extent is
not held). In general, the server will not be able to prevent a
client that holds a layout for a file from accessing parts of the
physical disk not covered by the layout. Similarly, the server will
not be able to prevent a client from accessing blocks covered by a
layout that it has already returned. This block-based level of
protection must be provided by the client software.
An alternative method of SCSI protocol use is for the storage devices For this reason, certain responsibilities for enforcing NFSv4
to export virtualized block addresses, which do reflect the files to semantics, including security and locking, are delegated to pNFS
which blocks belong. These virtual block addresses are exported to clients when SCSI layouts are being used. The metadata server's role
pNFS clients via layouts. This allows the storage device to make is to only grant layouts appropriately and the pNFS clients have to
appropriate access checks, while mapping virtual block addresses to be trusted to only perform accesses allowed by the layout extents
physical block addresses. In environments where the security they currently hold (e.g., and not access storage for files on which
requirements are such that client-side protection from access to a layout extent is not held). In general, the server will not be
storage outside of the authorized layout extents is not sufficient, able to prevent a client that holds a layout for a file from
pNFS SCSI layouts SHOULD NOT be used unless the storage device is accessing parts of the physical disk not covered by the layout.
able to implement the appropriate access checks, via use of Similarly, the server will not be able to prevent a client from
virtualized block addresses or other means. In contrast, an accessing blocks covered by a layout that it has already returned.
environment where client-side protection may suffice consists of co- The pNFS client must respect the layout model for this mapping type
located clients, server and storage devices in a data center with a to appropriately respect NFSv4 semantics.
physically isolated SAN under control of a single system
administrator or small group of system administrators.
This also has implications for some NFSv4 functionality outside pNFS. Furthermore, there is no way for the storage to determine the
For instance, if a file is covered by a mandatory read-only lock, the specific NFSv4 entity (principal, openowner, lockowner) on whose
server can ensure that only readable layouts for the file are granted behalf the IO operation is being done. This fact may limit the
to pNFS clients. However, it is up to each pNFS client to ensure functionality to be supported and require the pNFS client to
that the readable layout is used only to service read requests, and implement server policies other than those describable by layouts.
not to allow writes to the existing parts of the file. Similarly, In cases in which layouts previously granted become invalid, the
SCSI storage devices are unable to validate NFS Access Control Lists server has the option of recalling them. In situations in which
(ACLs) and file open modes, so the client must enforce the policies communication difficulties prevent this from happening, layouts may
before sending a READ or WRITE request to the storage device. Since be revoked by the server. This revocation is accompanied by changes
SCSI storage devices are generally not capable of enforcing such in persistent reservation which have the effect of preventing SCSI
file-based security, in environments where pNFS clients cannot be access to the LUs in question by the client.
trusted to enforce such policies, pNFS SCSI layouts SHOULD NOT be
used. 3.1. Use of Open Stateids
The effective implementation of these NFSv4 semantic constraints is
complicated by the different granularities of the actors for the
different types of the functionality to be enforced:
o To enforce security constraints for particular principals.
o To enforce locking constraints for particular owners (openowners
and lockowners)
Fundamental to enforcing both of these sorts of constraints is the
principle that a pNFS client must not issue a SCSI IO operation
unless it possesses both:
o A valid open stateid for the file in question, performing the IO
that allows IO of the type in question, which is associated with
the openowner and principal on whose behalf the IO is to be done.
o A valid layout stateid for the file in question that covers the
byte range on which the IO is to be done and that allows IO of
that type to be done.
As a result, if the equivalent of IO with an anonymous or write-
bypass stateid is to be done, it MUST NOT by done using the pNFS SCSI
layout type. The client MAY attempt such IO using READs and WRITEs
that do not use pNFS and are directed to the MDS.
When open stateids are revoked, due to lease expiration or any form
of administrative revocation, the server MUST recall all layouts that
allow IO to be done on any of the files for which open revocation
happens. When there is a failure to successfully return those
layouts, the client MUST be fenced.
3.2. Enforcing Security Restrictions
The restriction noted above provides adequate enforcement of
appropriate security restriction when the principal issuing the IO is
the same as that opening the file. The server is responsible for
checking that the IO mode requested by the open is allowed for the
principal doing the OPEN. If the correct sort of IO is done on
behalf of the same principal, then the security restriction is
thereby enforced.
If IO is done by a principal different from the one that opened the
file, the client SHOULD send the IO to be performed by the metadata
server rather than doing it directly to the storage device.
3.3. Enforcing Locking Restrictions
Mandatory enforcement of whole-file locking by means of share
reservations is provided when the pNFS client obeys the requirement
set forth in Section 2.1 above. Since performing IO requires a valid
open stateid an IO that violates an existing share reservation would
only be possible when the server allows conflicting open stateids to
exist.
The nature of the SCSI layout type is such implementation/enforcement
of mandatory byte-range locks is very difficult. Given that layouts
are granted to clients rather than owners, the pNFS client is in no
position to successfully arbitrate among multiple lockowners on the
same client. Suppose lockowner A is doing a write and, while the IO
is pending, lockowner B requests a mandatory byte-range for a byte
range potentially overlapping the pending IO. In such a situation,
the lock request cannot be granted while the IO is pending. In a
non-pNFS environment, the server would have to wait for pending IO
before granting the mandatory byte-range lock. In the pNFS
environment the server does not issue the IO and is thus in no
position to wait for its completion. The server may recall such
layouts but in doing so, it has no way of distinguishing those being
used by lockowners A and B, making it difficult to allow B to perform
IO while forbidding A from doing so. Given this fact, the MDS need
to successfully recall all layouts that overlap the range being
locked before returning a successful response to the LOCK request.
While the lock is in effect, the server SHOULD respond to requests
for layouts which overlap a currently locked area with
NFS4ERR_LAYOUTUNAVAILABLE. To simplify the required logic a server
MAY do this for all layout requests on the file in question as long
as there are any byte-range locks in effect.
Given these difficulties it may be difficult for servers supporting
mandatory byte-range locks to also support SCSI layouts. Servers can
support advisory byte-range locks instead. The NFSv4 protocol
currently has no way of determining whether byte-range lock support
on a particular file system will be mandatory or advisory, except by
trying operation which would conflict if mandatory locking is in
effect. Therefore, to avoid confusion, servers SHOULD NOT switch
between mandatory and advisory byte-range locking based on whether
any SCSI layouts have been obtained or whether a client that has
obtained a SCSI layout has requested a byte-range lock.
4. Security Considerations
Access to SCSI storage devices is logically at a lower layer of the Access to SCSI storage devices is logically at a lower layer of the
I/O stack than NFSv4, and hence NFSv4 security is not directly I/O stack than NFSv4, and hence NFSv4 security is not directly
applicable to protocols that access such storage directly. Depending applicable to protocols that access such storage directly. Depending
on the protocol, some of the security mechanisms provided by NFSv4 on the protocol, some of the security mechanisms provided by NFSv4
(e.g., encryption, cryptographic integrity) may not be available or (e.g., encryption, cryptographic integrity) may not be available or
may be provided via different means. At one extreme, pNFS with SCSI may be provided via different means. At one extreme, pNFS with SCSI
layouts can be used with storage access protocols (e.g., parallel layouts can be used with storage access protocols (e.g., serial
SCSI) that provide essentially no security functionality. At the attached SCSI ([SAS3]) that provide essentially no security
other extreme, pNFS may be used with storage protocols such as iSCSI functionality. At the other extreme, pNFS may be used with storage
that can provide significant security functionality. It is the protocols such as iSCSI ([RFC7143]) that can provide significant
responsibility of those administering and deploying pNFS with a SCSI security functionality. It is the responsibility of those
storage access protocol to ensure that appropriate protection is administering and deploying pNFS with a SCSI storage access protocol
provided to that protocol (physical security is a common means for to ensure that appropriate protection is provided to that protocol
protocols not based on IP). In environments where the security (physical security is a common means for protocols not based on IP).
requirements for the storage protocol cannot be met, pNFS SCSI In environments where the security requirements for the storage
layouts SHOULD NOT be used. protocol cannot be met, pNFS SCSI layouts SHOULD NOT be used.
When security is available for a storage protocol, it is generally at When security is available for a storage protocol, it is generally at
a different granularity and with a different notion of identity than a different granularity and with a different notion of identity than
NFSv4 (e.g., NFSv4 controls user access to files, iSCSI controls NFSv4 (e.g., NFSv4 controls user access to files, iSCSI controls
initiator access to volumes). The responsibility for enforcing initiator access to volumes). The responsibility for enforcing
appropriate correspondences between these security layers is placed appropriate correspondences between these security layers is placed
upon the pNFS client. As with the issues in the first paragraph of upon the pNFS client. As with the issues in the first paragraph of
this section, in environments where the security requirements are this section, in environments where the security requirements are
such that client-side protection from access to storage outside of such that client-side protection from access to storage outside of
the layout is not sufficient, pNFS SCSI layouts SHOULD NOT be used. the layout is not sufficient, pNFS SCSI layouts SHOULD NOT be used.
4. IANA Considerations 5. IANA Considerations
IANA is requested to assign a new pNFS layout type in the pNFS Layout IANA is requested to assign a new pNFS layout type in the pNFS Layout
Types Registry as follows (the value 5 is suggested): Layout Type Types Registry as follows (the value 5 is suggested): Layout Type
Name: LAYOUT4_SCSI Value: 0x00000005 RFC: RFCTBD10 How: L (new layout Name: LAYOUT4_SCSI Value: 0x00000005 RFC: RFCTBD10 How: L (new layout
type) Minor Versions: 1 type) Minor Versions: 1
5. Normative References 6. Normative References
[LEGAL] IETF Trust, "Legal Provisions Relating to IETF Documents", [LEGAL] IETF Trust, "Legal Provisions Relating to IETF Documents",
November 2008, <http://trustee.ietf.org/docs/ November 2008, <http://trustee.ietf.org/docs/
IETF-Trust-License-Policy.pdf>. IETF-Trust-License-Policy.pdf>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", March 1997. Requirement Levels", March 1997.
[RFC4506] Eisler, M., "XDR: External Data Representation Standard", [RFC4506] Eisler, M., "XDR: External Data Representation Standard",
STD 67, RFC 4506, May 2006. STD 67, RFC 4506, May 2006.
skipping to change at page 25, line 40 skipping to change at page 28, line 12
External Data Representation Standard (XDR) Description", External Data Representation Standard (XDR) Description",
RFC 5662, January 2010. RFC 5662, January 2010.
[RFC5663] Black, D., Ed., Fridella, S., Ed., and J. Glasgow, Ed., [RFC5663] Black, D., Ed., Fridella, S., Ed., and J. Glasgow, Ed.,
"Parallel NFS (pNFS) Block/Volume Layout", RFC 5663, "Parallel NFS (pNFS) Block/Volume Layout", RFC 5663,
January 2010. January 2010.
[RFC6688] Black, D., Ed., Glasgow, J., and S. Faibish, "Parallel NFS [RFC6688] Black, D., Ed., Glasgow, J., and S. Faibish, "Parallel NFS
(pNFS) Block Disk Protection", RFC 6688, July 2012. (pNFS) Block Disk Protection", RFC 6688, July 2012.
[RFC7143] Chadalapaka, M., Meth, K., and D. Black, "Internet Small
Computer System Interface (iSCSI) Protocol
(Consolidated)", RFC RFC7143, April 2014.
[SAM-4] INCITS Technical Committee T10, "SCSI Architecture Model - [SAM-4] INCITS Technical Committee T10, "SCSI Architecture Model -
4 (SAM-4)", ANSI INCITS 447-2008, ISO/IEC 14776-414, 2008. 4 (SAM-4)", ANSI INCITS 447-2008, ISO/IEC 14776-414, 2008.
[SAS3] INCITS Technical Committee T10, "Serial Attached Scsi-3",
ANSI INCITS ANSI INCITS 519-2014, ISO/IEC 14776-154, 2014.
[SBC3] INCITS Technical Committee T10, "SCSI Block Commands-3", [SBC3] INCITS Technical Committee T10, "SCSI Block Commands-3",
ANSI INCITS INCITS 514-2014, ISO/IEC 14776-323, 2014. ANSI INCITS INCITS 514-2014, ISO/IEC 14776-323, 2014.
[SPC3] INCITS Technical Committee T10, "SCSI Primary Commands-3", [SPC3] INCITS Technical Committee T10, "SCSI Primary Commands-3",
ANSI INCITS 408-2005, ISO/IEC 14776-453, 2005. ANSI INCITS 408-2005, ISO/IEC 14776-453, 2005.
Appendix A. Acknowledgments Appendix A. Acknowledgments
Large parts of this document were copied verbatim, and others were Large parts of this document were copied verbatim, and others were
inspired by [RFC5663]. Thank to David Black, Stephen Fridella and inspired by [RFC5663]. Thank to David Black, Stephen Fridella and
Jason Glasgow for their work on the pNFS block/volume layout Jason Glasgow for their work on the pNFS block/volume layout
protocol. protocol.
David Black, Robert Elliott and Tom Haynes provided a throughout David Black, Robert Elliott and Tom Haynes provided a throughout
review of early drafts of this document, and their input lead to the review of early drafts of this document, and their input lead to the
current form of the document. current form of the document.
David Noveck provided ample feedback to earlier drafts of this
document and wrote the section on enforcing NFSv4 semantics.
Appendix B. RFC Editor Notes Appendix B. RFC Editor Notes
[RFC Editor: please remove this section prior to publishing this [RFC Editor: please remove this section prior to publishing this
document as an RFC] document as an RFC]
[RFC Editor: prior to publishing this document as an RFC, please [RFC Editor: prior to publishing this document as an RFC, please
replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the
RFC number of this document] RFC number of this document]
Author's Address Author's Address
 End of changes. 34 change blocks. 
105 lines changed or deleted 219 lines changed or added

This html diff was produced by rfcdiff 1.42. The latest version is available from http://tools.ietf.org/tools/rfcdiff/