draft-ietf-nfsv4-scsi-layout-05.txt   draft-ietf-nfsv4-scsi-layout-06.txt 
NFSv4 C. Hellwig NFSv4 C. Hellwig
Internet-Draft December 03, 2015 Internet-Draft
Intended status: Standards Track Intended status: Standards Track June 27, 2016
Expires: June 5, 2016 Expires: December 29, 2016
Parallel NFS (pNFS) SCSI Layout Parallel NFS (pNFS) SCSI Layout
draft-ietf-nfsv4-scsi-layout-05.txt draft-ietf-nfsv4-scsi-layout-06.txt
Abstract Abstract
The Parallel Network File System (pNFS) allows a separation between The Parallel Network File System (pNFS) allows a separation between
the metadata (onto a metadata server) and data (onto a storage the metadata (onto a metadata server) and data (onto a storage
device) for a file. The SCSI Layout Type is defined in this document device) for a file. The SCSI Layout Type is defined in this document
as an extension to pNFS to allow the use SCSI based block storage as an extension to pNFS to allow the use SCSI based block storage
devices. devices.
Status of this Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on June 5, 2016. This Internet-Draft will expire on December 29, 2016.
Copyright Notice Copyright Notice
Copyright (c) 2015 IETF Trust and the persons identified as the Copyright (c) 2016 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Conventions Used in This Document . . . . . . . . . . . . 4 1.1. Conventions Used in This Document . . . . . . . . . . . . 4
1.2. General Definitions . . . . . . . . . . . . . . . . . . . 4 1.2. General Definitions . . . . . . . . . . . . . . . . . . . 4
1.3. Code Components Licensing Notice . . . . . . . . . . . . . 4 1.3. Code Components Licensing Notice . . . . . . . . . . . . 4
1.4. XDR Description . . . . . . . . . . . . . . . . . . . . . 4 1.4. XDR Description . . . . . . . . . . . . . . . . . . . . . 4
2. SCSI Layout Description . . . . . . . . . . . . . . . . . . . 6 2. SCSI Layout Description . . . . . . . . . . . . . . . . . . . 6
2.1. Background and Architecture . . . . . . . . . . . . . . . 6 2.1. Background and Architecture . . . . . . . . . . . . . . . 6
2.2. layouttype4 . . . . . . . . . . . . . . . . . . . . . . . 7 2.2. layouttype4 . . . . . . . . . . . . . . . . . . . . . . . 7
2.3. GETDEVICEINFO . . . . . . . . . . . . . . . . . . . . . . 8 2.3. GETDEVICEINFO . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1. Volume Identification . . . . . . . . . . . . . . . . 8 2.3.1. Volume Identification . . . . . . . . . . . . . . . . 8
2.3.2. Volume Topology . . . . . . . . . . . . . . . . . . . 9 2.3.2. Volume Topology . . . . . . . . . . . . . . . . . . . 9
2.4. Data Structures: Extents and Extent Lists . . . . . . . . 12 2.4. Data Structures: Extents and Extent Lists . . . . . . . . 12
2.4.1. Layout Requests and Extent Lists . . . . . . . . . . . 14 2.4.1. Layout Requests and Extent Lists . . . . . . . . . . 14
2.4.2. Layout Commits . . . . . . . . . . . . . . . . . . . . 15 2.4.2. Layout Commits . . . . . . . . . . . . . . . . . . . 15
2.4.3. Layout Returns . . . . . . . . . . . . . . . . . . . . 16 2.4.3. Layout Returns . . . . . . . . . . . . . . . . . . . 16
2.4.4. Layout Revocation . . . . . . . . . . . . . . . . . . 16 2.4.4. Layout Revocation . . . . . . . . . . . . . . . . . . 16
2.4.5. Client Copy-on-Write Processing . . . . . . . . . . . 17 2.4.5. Client Copy-on-Write Processing . . . . . . . . . . . 17
2.4.6. Extents are Permissions . . . . . . . . . . . . . . . 18 2.4.6. Extents are Permissions . . . . . . . . . . . . . . . 18
2.4.7. Partial-Bock Updates . . . . . . . . . . . . . . . . . 19 2.4.7. Partial-Block Updates . . . . . . . . . . . . . . . . 19
2.4.8. End-of-file Processing . . . . . . . . . . . . . . . . 19 2.4.8. End-of-file Processing . . . . . . . . . . . . . . . 19
2.4.9. Layout Hints . . . . . . . . . . . . . . . . . . . . . 20 2.4.9. Layout Hints . . . . . . . . . . . . . . . . . . . . 20
2.4.10. Client Fencing . . . . . . . . . . . . . . . . . . . . 20 2.4.10. Client Fencing . . . . . . . . . . . . . . . . . . . 20
2.5. Crash Recovery Issues . . . . . . . . . . . . . . . . . . 22 2.5. Crash Recovery Issues . . . . . . . . . . . . . . . . . . 22
2.6. Recalling Resources: CB_RECALL_ANY . . . . . . . . . . . . 22 2.6. Recalling Resources: CB_RECALL_ANY . . . . . . . . . . . 22
2.7. Transient and Permanent Errors . . . . . . . . . . . . . . 23 2.7. Transient and Permanent Errors . . . . . . . . . . . . . 23
2.8. Volatile write caches . . . . . . . . . . . . . . . . . . 23 2.8. Volatile write caches . . . . . . . . . . . . . . . . . . 23
3. Enforcing NFSv4 Semantics . . . . . . . . . . . . . . . . . . 24 3. Enforcing NFSv4 Semantics . . . . . . . . . . . . . . . . . . 24
3.1. Use of Open Stateids . . . . . . . . . . . . . . . . . . . 24 3.1. Use of Open Stateids . . . . . . . . . . . . . . . . . . 24
3.2. Enforcing Security Restrictions . . . . . . . . . . . . . 25 3.2. Enforcing Security Restrictions . . . . . . . . . . . . . 25
3.3. Enforcing Locking Restrictions . . . . . . . . . . . . . . 25 3.3. Enforcing Locking Restrictions . . . . . . . . . . . . . 25
4. Security Considerations . . . . . . . . . . . . . . . . . . . 26 4. Security Considerations . . . . . . . . . . . . . . . . . . . 26
5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 27 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 27
6. Normative References . . . . . . . . . . . . . . . . . . . . . 27 6. Normative References . . . . . . . . . . . . . . . . . . . . 27
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 28 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 28
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 28 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 28
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 28 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 28
1. Introduction 1. Introduction
Figure 1 shows the overall architecture of a Parallel NFS (pNFS) Figure 1 shows the overall architecture of a Parallel NFS (pNFS)
system: system:
+-----------+ +-----------+
|+-----------+ +-----------+ |+-----------+ +-----------+
||+-----------+ | | ||+-----------+ | |
||| | NFSv4.1 + pNFS | | ||| | NFSv4.1 + pNFS | |
skipping to change at page 4, line 42 skipping to change at page 4, line 36
Server The "server" is the entity responsible for coordinating Server The "server" is the entity responsible for coordinating
client access to a set of file systems and is identified by a client access to a set of file systems and is identified by a
server owner. server owner.
1.3. Code Components Licensing Notice 1.3. Code Components Licensing Notice
The external data representation (XDR) description and scripts for The external data representation (XDR) description and scripts for
extracting the XDR description are Code Components as described in extracting the XDR description are Code Components as described in
Section 4 of "Legal Provisions Relating to IETF Documents" [LEGAL]. Section 4 of "Legal Provisions Relating to IETF Documents" [LEGAL].
These Code Components are licensed according to the terms of Section These Code Components are licensed according to the terms of
4 of "Legal Provisions Relating to IETF Documents". Section 4 of "Legal Provisions Relating to IETF Documents".
1.4. XDR Description 1.4. XDR Description
This document contains the XDR [RFC4506] description of the NFSv4.1 This document contains the XDR [RFC4506] description of the NFSv4.1
SCSI layout protocol. The XDR description is embedded in this SCSI layout protocol. The XDR description is embedded in this
document in a way that makes it simple for the reader to extract into document in a way that makes it simple for the reader to extract into
a ready-to-compile form. The reader can feed this document into the a ready-to-compile form. The reader can feed this document into the
following shell script to produce the machine readable XDR following shell script to produce the machine readable XDR
description of the NFSv4.1 SCSI layout: description of the NFSv4.1 SCSI layout:
skipping to change at page 6, line 34 skipping to change at page 6, line 29
/// ///
/// %#include "nfsv41.h" /// %#include "nfsv41.h"
/// ///
2. SCSI Layout Description 2. SCSI Layout Description
2.1. Background and Architecture 2.1. Background and Architecture
The fundamental storage model supported by SCSI storage devices is a The fundamental storage model supported by SCSI storage devices is a
Logical Unit (LU) consisting of a sequential series of fixed-size Logical Unit (LU) consisting of a sequential series of fixed-size
blocks. Logical units used as devices for NFS scsi layouts, and the blocks. Logical units used as devices for NFS SCSI layouts, and the
SCSI initiators used for the pNFS Metadata Server and clients MUST SCSI initiators used for the pNFS Metadata Server and clients MUST
support SCSI persistent reservations. support SCSI persistent reservations.
A pNFS layout for this SCSI class of storage is responsible for A pNFS layout for this SCSI class of storage is responsible for
mapping from an NFS file (or portion of a file) to the blocks of mapping from an NFS file (or portion of a file) to the blocks of
storage volumes that contain the file. The blocks are expressed as storage volumes that contain the file. The blocks are expressed as
extents with 64-bit offsets and lengths using the existing NFSv4 extents with 64-bit offsets and lengths using the existing NFSv4
offset4 and length4 types. Clients MUST be able to perform I/O to offset4 and length4 types. Clients MUST be able to perform I/O to
the block extents without affecting additional areas of storage the block extents without affecting additional areas of storage
(especially important for writes); therefore, extents MUST be aligned (especially important for writes); therefore, extents MUST be aligned
skipping to change at page 8, line 38 skipping to change at page 8, line 25
backing pNFS SCSI layouts. It is similar to the "Identification backing pNFS SCSI layouts. It is similar to the "Identification
Descriptor Target Descriptor" specified in [SPC4], but limits the Descriptor Target Descriptor" specified in [SPC4], but limits the
allowed values to those that uniquely identify a LU. Device allowed values to those that uniquely identify a LU. Device
Identification VPD page descriptors used to identify LUs for use with Identification VPD page descriptors used to identify LUs for use with
pNFS SCSI layouts must adhere to the following restrictions: pNFS SCSI layouts must adhere to the following restrictions:
1. The "ASSOCIATION" MUST be set to 0 (The DESIGNATOR field is 1. The "ASSOCIATION" MUST be set to 0 (The DESIGNATOR field is
associated with the addressed logical unit). associated with the addressed logical unit).
2. The "DESIGNATOR TYPE" MUST be set to one of four values that are 2. The "DESIGNATOR TYPE" MUST be set to one of four values that are
required for the mandatory logical unit name in [SPC4], as required for the mandatory logical unit name in section 7.7.3 of
explicitly listed in the "pnfs_scsi_designator_type" enumeration: [SPC4], as explicitly listed in the "pnfs_scsi_designator_type"
enumeration:
PS_DESIGNATOR_T10 T10 vendor ID based PS_DESIGNATOR_T10 T10 vendor ID based
PS_DESIGNATOR_EUI64 EUI-64-based PS_DESIGNATOR_EUI64 EUI-64-based
PS_DESIGNATOR_NAA NAA PS_DESIGNATOR_NAA NAA
PS_DESIGNATOR_NAME SCSI name string PS_DESIGNATOR_NAME SCSI name string
Any other association or designator type MUST NOT be used. Use Any other association or designator type MUST NOT be used. Use
skipping to change at page 9, line 32 skipping to change at page 9, line 20
the same storage volumes via different ports on different networks. the same storage volumes via different ports on different networks.
Selection of one or multiple ports to access the storage device is Selection of one or multiple ports to access the storage device is
left up to the client. left up to the client.
Additionally the server returns a Persistent Reservation key in the Additionally the server returns a Persistent Reservation key in the
"sbv_pr_key" field. See Section 2.4.10 for more details on the use "sbv_pr_key" field. See Section 2.4.10 for more details on the use
of Persistent Reservations. of Persistent Reservations.
2.3.2. Volume Topology 2.3.2. Volume Topology
The pNFS SCSI layout volume topology is expressed as an arbitrary The pNFS SCSI layout volume topology is expressed in terms of the
combination of base volume types enumerated in the following data volume types described below. The individual components of the
structures. The individual components of the topology are contained topology are contained in an array and components may refer to other
in an array and components may refer to other components by using components by using array indices.
array indices.
/// enum pnfs_scsi_volume_type4 { /// enum pnfs_scsi_volume_type4 {
/// PNFS_SCSI_VOLUME_SLICE = 1, /* volume is a slice of /// PNFS_SCSI_VOLUME_SLICE = 1, /* volume is a slice of
/// another volume */ /// another volume */
/// PNFS_SCSI_VOLUME_CONCAT = 2, /* volume is a /// PNFS_SCSI_VOLUME_CONCAT = 2, /* volume is a
/// concatenation of /// concatenation of
/// multiple volumes */ /// multiple volumes */
/// PNFS_SCSI_VOLUME_STRIPE = 3 /* volume is striped across /// PNFS_SCSI_VOLUME_STRIPE = 3 /* volume is striped across
/// multiple volumes */ /// multiple volumes */
/// PNFS_SCSI_VOLUME_BASE = 4, /* volume maps to a single /// PNFS_SCSI_VOLUME_BASE = 4, /* volume maps to a single
/// LU */ /// LU */
/// }; /// };
/// ///
/// /* /// /*
/// * Code sets from SPC-3. /// * Code sets from SPC-4.
/// */ /// */
/// enum pnfs_scsi_code_set { /// enum pnfs_scsi_code_set {
/// PS_CODE_SET_BINARY = 1, /// PS_CODE_SET_BINARY = 1,
/// PS_CODE_SET_ASCII = 2, /// PS_CODE_SET_ASCII = 2,
/// PS_CODE_SET_UTF8 = 3 /// PS_CODE_SET_UTF8 = 3
/// }; /// };
/// ///
/// /* /// /*
/// * Designator types from taken from SPC-3. /// * Designator types from taken from SPC-4.
/// * /// *
/// * Other values are allocated in SPC-3, but not mandatory to /// * Other values are allocated in SPC-4, but not mandatory to
/// * implement or aren't Logical Unit names. /// * implement or aren't Logical Unit names.
/// */ /// */
/// enum pnfs_scsi_designator_type { /// enum pnfs_scsi_designator_type {
/// PS_DESIGNATOR_T10 = 1, /// PS_DESIGNATOR_T10 = 1,
/// PS_DESIGNATOR_EUI64 = 2, /// PS_DESIGNATOR_EUI64 = 2,
/// PS_DESIGNATOR_NAA = 3, /// PS_DESIGNATOR_NAA = 3,
/// PS_DESIGNATOR_NAME = 8 /// PS_DESIGNATOR_NAME = 8
/// }; /// };
/// ///
/// /* /// /*
skipping to change at page 16, line 42 skipping to change at page 16, line 40
Layouts may be unilaterally revoked by the server, due to the Layouts may be unilaterally revoked by the server, due to the
client's lease time expiring, or the client failing to return a client's lease time expiring, or the client failing to return a
layout which has been recalled in a timely manner. For the SCSI layout which has been recalled in a timely manner. For the SCSI
layout type this is accomplished by fencing off the client from layout type this is accomplished by fencing off the client from
access to storage as described in Section 2.4.10. When this is done, access to storage as described in Section 2.4.10. When this is done,
it is necessary that all I/Os issued by the fenced-off client be it is necessary that all I/Os issued by the fenced-off client be
rejected by the storage This includes any in-flight I/Os that the rejected by the storage This includes any in-flight I/Os that the
client issued before the layout was revoked. client issued before the layout was revoked.
Note, that the granularity of this operation can only be at the Note, that the granularity of this operation can only be at the host/
host/LU level. Thus, if one of a client's layouts is unilaterally LU level. Thus, if one of a client's layouts is unilaterally revoked
revoked by the server, it will effectively render useless *all* of by the server, it will effectively render useless *all* of the
the client's layouts for files located on the storage units client's layouts for files located on the storage units comprising
comprising the volume. This may render useless the client's layouts the volume. This may render useless the client's layouts for files
for files in other file systems. See Section 2.4.10.5 for a in other file systems. See Section 2.4.10.5 for a discussion of
discussion of recovery from from fencing. recovery from from fencing.
2.4.5. Client Copy-on-Write Processing 2.4.5. Client Copy-on-Write Processing
Copy-on-write is a mechanism used to support file and/or file system Copy-on-write is a mechanism used to support file and/or file system
snapshots. When writing to unaligned regions, or to regions smaller snapshots. When writing to unaligned regions, or to regions smaller
than a file system block, the writer must copy the portions of the than a file system block, the writer must copy the portions of the
original file data to a new location on disk. This behavior can original file data to a new location on disk. This behavior can
either be implemented on the client or the server. The paragraphs either be implemented on the client or the server. The paragraphs
below describe how a pNFS SCSI layout client implements access to a below describe how a pNFS SCSI layout client implements access to a
file that requires copy-on-write semantics. file that requires copy-on-write semantics.
skipping to change at page 19, line 8 skipping to change at page 19, line 8
layouts, I/Os will be issued from the clients that hold the layouts layouts, I/Os will be issued from the clients that hold the layouts
directly to the storage devices that host the data. These devices directly to the storage devices that host the data. These devices
have no knowledge of files, mandatory locks, or share reservations, have no knowledge of files, mandatory locks, or share reservations,
and are not in a position to enforce such restrictions. For this and are not in a position to enforce such restrictions. For this
reason the NFSv4 server MUST NOT grant layouts that conflict with reason the NFSv4 server MUST NOT grant layouts that conflict with
mandatory locks or share reservations. Further, if a conflicting mandatory locks or share reservations. Further, if a conflicting
mandatory lock request or a conflicting open request arrives at the mandatory lock request or a conflicting open request arrives at the
server, the server MUST recall the part of the layout in conflict server, the server MUST recall the part of the layout in conflict
with the request before granting the request. with the request before granting the request.
2.4.7. Partial-Bock Updates 2.4.7. Partial-Block Updates
SCSI storage devices do not provide byte granularity access and can SCSI storage devices do not provide byte granularity access and can
only perform read and write operations atomically on a block only perform read and write operations atomically on a block
granularity. WRITES to SCSI storage devices thus require read- granularity. WRITES to SCSI storage devices thus require read-
modify-write cycles to write data smaller than the block size or modify-write cycles to write data smaller than the block size or
which is otherwise not block-aligned. Write operations from multiple which is otherwise not block-aligned. Write operations from multiple
clients to the same block can thus lead to data corruption even if clients to the same block can thus lead to data corruption even if
the byte range written by the applications does not overlap. When the byte range written by the applications does not overlap. When
there are multiple clients who wish to access the same block, a pNFS there are multiple clients who wish to access the same block, a pNFS
server MUST avoid these conflicts by implementing a concurrency server MUST avoid these conflicts by implementing a concurrency
skipping to change at page 20, line 30 skipping to change at page 20, line 30
The pNFS SCSI protocol must handle situations in which a system The pNFS SCSI protocol must handle situations in which a system
failure, typically a network connectivity issue, requires the server failure, typically a network connectivity issue, requires the server
to unilaterally revoke extents from a client after the client fails to unilaterally revoke extents from a client after the client fails
to respond to a CB_LAYOUTRECALL request. This is implemented by to respond to a CB_LAYOUTRECALL request. This is implemented by
fencing off a non-responding client from access to the storage fencing off a non-responding client from access to the storage
device. device.
The pNFS SCSI protocol implements fencing using Persistent The pNFS SCSI protocol implements fencing using Persistent
Reservations (PRs), similar to the fencing method used by existing Reservations (PRs), similar to the fencing method used by existing
shared disk file systems. By placing a PR of type "Exclusive Access shared disk file systems. By placing a PR of type "Exclusive Access
- All Registrants" on each SCSI LU exported to pNFS clients the MDS - Registrants Only" on each SCSI LU exported to pNFS clients the MDS
prevents access from any client that does not have an outstanding prevents access from any client that does not have an outstanding
device device ID that gives the client a reservation key to access device device ID that gives the client a reservation key to access
the LU, and allows the MDS to revoke access to the logic unit at any the LU, and allows the MDS to revoke access to the logic unit at any
time. time.
2.4.10.1. PRs - Key Generation 2.4.10.1. PRs - Key Generation
To allow fencing individual systems, each system must use a unique To allow fencing individual systems, each system must use a unique
Persistent Reservation key. [SPC4] does not specify a way to Persistent Reservation key. [SPC4] does not specify a way to
generate keys. This document assigns the burden to generate unique generate keys. This document assigns the burden to generate unique
keys to the MDS, which must generate a key for itself before keys to the MDS, which must generate a key for itself before
exporting a volume, and a key for each client that accesses a scsi exporting a volume, and a key for each client that accesses SCSI
layout volumes. Individuals keys for each volume that a client can layout volumes. Individuals keys for each volume that a client can
access are permitted but not required. access are permitted but not required.
2.4.10.2. PRs - MDS Registration and Reservation 2.4.10.2. PRs - MDS Registration and Reservation
Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the
MDS needs to prepare the volume for fencing using PRs. This is done MDS needs to prepare the volume for fencing using PRs. This is done
by registering the reservation generated for the MDS with the device by registering the reservation generated for the MDS with the device
using the "PERSISTENT RESERVE OUT" command with a service action of using the "PERSISTENT RESERVE OUT" command with a service action of
"REGISTER", followed by a "PERSISTENT RESERVE OUT" command, with a "REGISTER", followed by a "PERSISTENT RESERVE OUT" command, with a
service action of "RESERVE" and the type field set to 8h (Exclusive service action of "RESERVE" and the type field set to 8h (Exclusive
Access - All Registrants). To make sure all I_T nexuses are Access - Registrants Only). To make sure all I_T nexuses (see
registered, the MDS SHOULD set the "All Target Ports" (ALL_TG_PT) bit section 3.1.45 of [SAM-4]) are registered, the MDS SHOULD set the
when registering the key, or otherwise ensure the registration is "All Target Ports" (ALL_TG_PT) bit when registering the key, or
performed for each initiator port. otherwise ensure the registration is performed for each initiator
port.
2.4.10.3. PRs - Client Registration 2.4.10.3. PRs - Client Registration
Before performing the first IO to a device returned from a Before performing the first I/O to a device returned from a
GETDEVICEINFO operation the client will register the registration key GETDEVICEINFO operation the client will register the registration key
returned in sbv_pr_key with the storage device by issuing a returned in sbv_pr_key with the storage device by issuing a
"PERSISTENT RESERVE OUT" command with a service action of REGISTER "PERSISTENT RESERVE OUT" command with a service action of REGISTER
with the "SERVICE ACTION RESERVATION KEY" set to the reservation key with the "SERVICE ACTION RESERVATION KEY" set to the reservation key
returned in sbv_pr_key. To make sure all I_T nexus are registered, returned in sbv_pr_key. To make sure all I_T nexuses are registered,
the client SHOULD set the "All Target Ports" (ALL_TG_PT) bit when the client SHOULD set the "All Target Ports" (ALL_TG_PT) bit when
registering the key, or otherwise ensure the registration is registering the key, or otherwise ensure the registration is
performed for each initiator port. performed for each initiator port.
When a client stops using a device earlier returned by GETDEVICEINFO When a client stops using a device earlier returned by GETDEVICEINFO
it MUST unregister the earlier registered key by issuing a it MUST unregister the earlier registered key by issuing a
"PERSISTENT RESERVE OUT" command with a service action of "REGISTER" "PERSISTENT RESERVE OUT" command with a service action of "REGISTER"
with the "RESERVATION KEY" set to the earlier registered reservation with the "RESERVATION KEY" set to the earlier registered reservation
key. key.
2.4.10.4. PRs - Fencing Action 2.4.10.4. PRs - Fencing Action
In case of a non-responding client the MDS fences the client by In case of a non-responding client the MDS fences the client by
issuing a "PERSISTENT RESERVE OUT" command with the service action issuing a "PERSISTENT RESERVE OUT" command with the service action
set to "PREEMPT" or "PREEMPT AND ABORT", the reservation key field set to "PREEMPT" or "PREEMPT AND ABORT", the reservation key field
set to the server's reservation key, the service action reservation set to the server's reservation key, the service action reservation
key field set to the reservation key associated with the non- key field set to the reservation key associated with the non-
responding client, and the type field set to 8h (Exclusive Access - responding client, and the type field set to 8h (Exclusive Access -
All Registrants). Registrants Only).
After the MDS preempts a client, all client I/O to the LU fails. The After the MDS preempts a client, all client I/O to the LU fails. The
client should at this point return any layout that refers to the client should at this point return any layout that refers to the
device ID that points to the LU. Note that the client can device ID that points to the LU. Note that the client can
distinguish I/O errors due to fencing from other errors based on the distinguish I/O errors due to fencing from other errors based on the
"RESERVATION CONFLICT" SCSI status. Refer to [SPC4] for details. "RESERVATION CONFLICT" SCSI status. Refer to [SPC4] for details.
2.4.10.5. Client Recovery After a Fence Action 2.4.10.5. Client Recovery After a Fence Action
A client that detects a "RESERVATION CONFLICT" SCSI status (I/O A client that detects a "RESERVATION CONFLICT" SCSI status (I/O
skipping to change at page 24, line 29 skipping to change at page 24, line 29
a layout extent is not held). In general, the server will not be a layout extent is not held). In general, the server will not be
able to prevent a client that holds a layout for a file from able to prevent a client that holds a layout for a file from
accessing parts of the physical disk not covered by the layout. accessing parts of the physical disk not covered by the layout.
Similarly, the server will not be able to prevent a client from Similarly, the server will not be able to prevent a client from
accessing blocks covered by a layout that it has already returned. accessing blocks covered by a layout that it has already returned.
The pNFS client must respect the layout model for this mapping type The pNFS client must respect the layout model for this mapping type
to appropriately respect NFSv4 semantics. to appropriately respect NFSv4 semantics.
Furthermore, there is no way for the storage to determine the Furthermore, there is no way for the storage to determine the
specific NFSv4 entity (principal, openowner, lockowner) on whose specific NFSv4 entity (principal, openowner, lockowner) on whose
behalf the IO operation is being done. This fact may limit the behalf the I/O operation is being done. This fact may limit the
functionality to be supported and require the pNFS client to functionality to be supported and require the pNFS client to
implement server policies other than those describable by layouts. implement server policies other than those describable by layouts.
In cases in which layouts previously granted become invalid, the In cases in which layouts previously granted become invalid, the
server has the option of recalling them. In situations in which server has the option of recalling them. In situations in which
communication difficulties prevent this from happening, layouts may communication difficulties prevent this from happening, layouts may
be revoked by the server. This revocation is accompanied by changes be revoked by the server. This revocation is accompanied by changes
in persistent reservation which have the effect of preventing SCSI in persistent reservation which have the effect of preventing SCSI
access to the LUs in question by the client. access to the LUs in question by the client.
3.1. Use of Open Stateids 3.1. Use of Open Stateids
skipping to change at page 24, line 51 skipping to change at page 24, line 51
The effective implementation of these NFSv4 semantic constraints is The effective implementation of these NFSv4 semantic constraints is
complicated by the different granularities of the actors for the complicated by the different granularities of the actors for the
different types of the functionality to be enforced: different types of the functionality to be enforced:
o To enforce security constraints for particular principals. o To enforce security constraints for particular principals.
o To enforce locking constraints for particular owners (openowners o To enforce locking constraints for particular owners (openowners
and lockowners) and lockowners)
Fundamental to enforcing both of these sorts of constraints is the Fundamental to enforcing both of these sorts of constraints is the
principle that a pNFS client must not issue a SCSI IO operation principle that a pNFS client must not issue a SCSI I/O operation
unless it possesses both: unless it possesses both:
o A valid open stateid for the file in question, performing the IO o A valid open stateid for the file in question, performing the I/O
that allows IO of the type in question, which is associated with that allows I/O of the type in question, which is associated with
the openowner and principal on whose behalf the IO is to be done. the openowner and principal on whose behalf the I/O is to be done.
o A valid layout stateid for the file in question that covers the o A valid layout stateid for the file in question that covers the
byte range on which the IO is to be done and that allows IO of byte range on which the I/O is to be done and that allows I/O of
that type to be done. that type to be done.
As a result, if the equivalent of IO with an anonymous or write- As a result, if the equivalent of I/O with an anonymous or write-
bypass stateid is to be done, it MUST NOT by done using the pNFS SCSI bypass stateid is to be done, it MUST NOT by done using the pNFS SCSI
layout type. The client MAY attempt such IO using READs and WRITEs layout type. The client MAY attempt such I/O using READs and WRITEs
that do not use pNFS and are directed to the MDS. that do not use pNFS and are directed to the MDS.
When open stateids are revoked, due to lease expiration or any form When open stateids are revoked, due to lease expiration or any form
of administrative revocation, the server MUST recall all layouts that of administrative revocation, the server MUST recall all layouts that
allow IO to be done on any of the files for which open revocation allow I/O to be done on any of the files for which open revocation
happens. When there is a failure to successfully return those happens. When there is a failure to successfully return those
layouts, the client MUST be fenced. layouts, the client MUST be fenced.
3.2. Enforcing Security Restrictions 3.2. Enforcing Security Restrictions
The restriction noted above provides adequate enforcement of The restriction noted above provides adequate enforcement of
appropriate security restriction when the principal issuing the IO is appropriate security restriction when the principal issuing the I/O
the same as that opening the file. The server is responsible for is the same as that opening the file. The server is responsible for
checking that the IO mode requested by the open is allowed for the checking that the I/O mode requested by the open is allowed for the
principal doing the OPEN. If the correct sort of IO is done on principal doing the OPEN. If the correct sort of I/O is done on
behalf of the same principal, then the security restriction is behalf of the same principal, then the security restriction is
thereby enforced. thereby enforced.
If IO is done by a principal different from the one that opened the If I/O is done by a principal different from the one that opened the
file, the client SHOULD send the IO to be performed by the metadata file, the client SHOULD send the I/O to be performed by the metadata
server rather than doing it directly to the storage device. server rather than doing it directly to the storage device.
3.3. Enforcing Locking Restrictions 3.3. Enforcing Locking Restrictions
Mandatory enforcement of whole-file locking by means of share Mandatory enforcement of whole-file locking by means of share
reservations is provided when the pNFS client obeys the requirement reservations is provided when the pNFS client obeys the requirement
set forth in Section 2.1 above. Since performing IO requires a valid set forth in Section 2.1 above. Since performing I/O requires a
open stateid an IO that violates an existing share reservation would valid open stateid an I/O that violates an existing share reservation
only be possible when the server allows conflicting open stateids to would only be possible when the server allows conflicting open
exist. stateids to exist.
The nature of the SCSI layout type is such implementation/enforcement The nature of the SCSI layout type is such implementation/enforcement
of mandatory byte-range locks is very difficult. Given that layouts of mandatory byte-range locks is very difficult. Given that layouts
are granted to clients rather than owners, the pNFS client is in no are granted to clients rather than owners, the pNFS client is in no
position to successfully arbitrate among multiple lockowners on the position to successfully arbitrate among multiple lockowners on the
same client. Suppose lockowner A is doing a write and, while the IO same client. Suppose lockowner A is doing a write and, while the I/O
is pending, lockowner B requests a mandatory byte-range for a byte is pending, lockowner B requests a mandatory byte-range for a byte
range potentially overlapping the pending IO. In such a situation, range potentially overlapping the pending I/O. In such a situation,
the lock request cannot be granted while the IO is pending. In a the lock request cannot be granted while the I/O is pending. In a
non-pNFS environment, the server would have to wait for pending IO non-pNFS environment, the server would have to wait for pending I/O
before granting the mandatory byte-range lock. In the pNFS before granting the mandatory byte-range lock. In the pNFS
environment the server does not issue the IO and is thus in no environment the server does not issue the I/O and is thus in no
position to wait for its completion. The server may recall such position to wait for its completion. The server may recall such
layouts but in doing so, it has no way of distinguishing those being layouts but in doing so, it has no way of distinguishing those being
used by lockowners A and B, making it difficult to allow B to perform used by lockowners A and B, making it difficult to allow B to perform
IO while forbidding A from doing so. Given this fact, the MDS need I/O while forbidding A from doing so. Given this fact, the MDS need
to successfully recall all layouts that overlap the range being to successfully recall all layouts that overlap the range being
locked before returning a successful response to the LOCK request. locked before returning a successful response to the LOCK request.
While the lock is in effect, the server SHOULD respond to requests While the lock is in effect, the server SHOULD respond to requests
for layouts which overlap a currently locked area with for layouts which overlap a currently locked area with
NFS4ERR_LAYOUTUNAVAILABLE. To simplify the required logic a server NFS4ERR_LAYOUTUNAVAILABLE. To simplify the required logic a server
MAY do this for all layout requests on the file in question as long MAY do this for all layout requests on the file in question as long
as there are any byte-range locks in effect. as there are any byte-range locks in effect.
Given these difficulties it may be difficult for servers supporting Given these difficulties it may be difficult for servers supporting
mandatory byte-range locks to also support SCSI layouts. Servers can mandatory byte-range locks to also support SCSI layouts. Servers can
skipping to change at page 28, line 27 skipping to change at page 28, line 25
ANSI INCITS 513-2015, 2015. ANSI INCITS 513-2015, 2015.
Appendix A. Acknowledgments Appendix A. Acknowledgments
Large parts of this document were copied verbatim, and others were Large parts of this document were copied verbatim, and others were
inspired by [RFC5663]. Thank to David Black, Stephen Fridella and inspired by [RFC5663]. Thank to David Black, Stephen Fridella and
Jason Glasgow for their work on the pNFS block/volume layout Jason Glasgow for their work on the pNFS block/volume layout
protocol. protocol.
David Black, Robert Elliott and Tom Haynes provided a throughout David Black, Robert Elliott and Tom Haynes provided a throughout
review of early drafts of this document, and their input lead to the review of early drafts of this document, and their input led to the
current form of the document. current form of the document.
David Noveck provided ample feedback to various drafts of this David Noveck provided ample feedback to various drafts of this
document, wrote the section on enforcing NFSv4 semantics and rewrote document, wrote the section on enforcing NFSv4 semantics and rewrote
various sections to better catch the intent. various sections to better catch the intent.
Appendix B. RFC Editor Notes Appendix B. RFC Editor Notes
[RFC Editor: please remove this section prior to publishing this [RFC Editor: please remove this section prior to publishing this
document as an RFC] document as an RFC]
 End of changes. 36 change blocks. 
99 lines changed or deleted 100 lines changed or added

This html diff was produced by rfcdiff 1.45. The latest version is available from http://tools.ietf.org/tools/rfcdiff/