draft-ietf-nfsv4-scsi-layout-04.txt   draft-ietf-nfsv4-scsi-layout-05.txt 
NFSv4 C. Hellwig NFSv4 C. Hellwig
Internet-Draft November 04, 2015 Internet-Draft December 03, 2015
Intended status: Standards Track Intended status: Standards Track
Expires: May 7, 2016 Expires: June 5, 2016
Parallel NFS (pNFS) SCSI Layout Parallel NFS (pNFS) SCSI Layout
draft-ietf-nfsv4-scsi-layout-04.txt draft-ietf-nfsv4-scsi-layout-05.txt
Abstract Abstract
The Parallel Network File System (pNFS) allows a separation between The Parallel Network File System (pNFS) allows a separation between
the metadata (onto a metadata server) and data (onto a storage the metadata (onto a metadata server) and data (onto a storage
device) for a file. The SCSI Layout Type is defined in this document device) for a file. The SCSI Layout Type is defined in this document
as an extension to pNFS to allow the use SCSI based block storage as an extension to pNFS to allow the use SCSI based block storage
devices. devices.
Status of this Memo Status of this Memo
skipping to change at page 1, line 34 skipping to change at page 1, line 34
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on May 7, 2016. This Internet-Draft will expire on June 5, 2016.
Copyright Notice Copyright Notice
Copyright (c) 2015 IETF Trust and the persons identified as the Copyright (c) 2015 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 2, line 14 skipping to change at page 2, line 14
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Conventions Used in This Document . . . . . . . . . . . . 4 1.1. Conventions Used in This Document . . . . . . . . . . . . 4
1.2. General Definitions . . . . . . . . . . . . . . . . . . . 4 1.2. General Definitions . . . . . . . . . . . . . . . . . . . 4
1.3. Code Components Licensing Notice . . . . . . . . . . . . . 4 1.3. Code Components Licensing Notice . . . . . . . . . . . . . 4
1.4. XDR Description . . . . . . . . . . . . . . . . . . . . . 4 1.4. XDR Description . . . . . . . . . . . . . . . . . . . . . 4
2. SCSI Layout Description . . . . . . . . . . . . . . . . . . . 6 2. SCSI Layout Description . . . . . . . . . . . . . . . . . . . 6
2.1. Background and Architecture . . . . . . . . . . . . . . . 6 2.1. Background and Architecture . . . . . . . . . . . . . . . 6
2.2. layouttype4 . . . . . . . . . . . . . . . . . . . . . . . 8 2.2. layouttype4 . . . . . . . . . . . . . . . . . . . . . . . 7
2.3. GETDEVICEINFO . . . . . . . . . . . . . . . . . . . . . . 8 2.3. GETDEVICEINFO . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1. Volume Identification . . . . . . . . . . . . . . . . 8 2.3.1. Volume Identification . . . . . . . . . . . . . . . . 8
2.3.2. Volume Topology . . . . . . . . . . . . . . . . . . . 9 2.3.2. Volume Topology . . . . . . . . . . . . . . . . . . . 9
2.4. Data Structures: Extents and Extent Lists . . . . . . . . 12 2.4. Data Structures: Extents and Extent Lists . . . . . . . . 12
2.4.1. Layout Requests and Extent Lists . . . . . . . . . . . 15 2.4.1. Layout Requests and Extent Lists . . . . . . . . . . . 14
2.4.2. Layout Commits . . . . . . . . . . . . . . . . . . . . 16 2.4.2. Layout Commits . . . . . . . . . . . . . . . . . . . . 15
2.4.3. Layout Returns . . . . . . . . . . . . . . . . . . . . 16 2.4.3. Layout Returns . . . . . . . . . . . . . . . . . . . . 16
2.4.4. Client Copy-on-Write Processing . . . . . . . . . . . 17 2.4.4. Layout Revocation . . . . . . . . . . . . . . . . . . 16
2.4.5. Extents are Permissions . . . . . . . . . . . . . . . 18 2.4.5. Client Copy-on-Write Processing . . . . . . . . . . . 17
2.4.6. End-of-file Processing . . . . . . . . . . . . . . . . 19 2.4.6. Extents are Permissions . . . . . . . . . . . . . . . 18
2.4.7. Layout Hints . . . . . . . . . . . . . . . . . . . . . 20 2.4.7. Partial-Bock Updates . . . . . . . . . . . . . . . . . 19
2.4.8. Client Fencing . . . . . . . . . . . . . . . . . . . . 20 2.4.8. End-of-file Processing . . . . . . . . . . . . . . . . 19
2.4.9. Layout Hints . . . . . . . . . . . . . . . . . . . . . 20
2.4.10. Client Fencing . . . . . . . . . . . . . . . . . . . . 20
2.5. Crash Recovery Issues . . . . . . . . . . . . . . . . . . 22 2.5. Crash Recovery Issues . . . . . . . . . . . . . . . . . . 22
2.6. Recalling Resources: CB_RECALL_ANY . . . . . . . . . . . . 22 2.6. Recalling Resources: CB_RECALL_ANY . . . . . . . . . . . . 22
2.7. Transient and Permanent Errors . . . . . . . . . . . . . . 23 2.7. Transient and Permanent Errors . . . . . . . . . . . . . . 23
2.8. Volatile write caches . . . . . . . . . . . . . . . . . . 23 2.8. Volatile write caches . . . . . . . . . . . . . . . . . . 23
3. Enforcing NFSv4 Semantics . . . . . . . . . . . . . . . . . . 24 3. Enforcing NFSv4 Semantics . . . . . . . . . . . . . . . . . . 24
3.1. Use of Open Stateids . . . . . . . . . . . . . . . . . . . 24 3.1. Use of Open Stateids . . . . . . . . . . . . . . . . . . . 24
3.2. Enforcing Security Restrictions . . . . . . . . . . . . . 25 3.2. Enforcing Security Restrictions . . . . . . . . . . . . . 25
3.3. Enforcing Locking Restrictions . . . . . . . . . . . . . . 25 3.3. Enforcing Locking Restrictions . . . . . . . . . . . . . . 25
4. Security Considerations . . . . . . . . . . . . . . . . . . . 26 4. Security Considerations . . . . . . . . . . . . . . . . . . . 26
5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 27 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 27
6. Normative References . . . . . . . . . . . . . . . . . . . . . 27 6. Normative References . . . . . . . . . . . . . . . . . . . . . 27
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 28 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 28
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 28 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 28
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 29 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 28
1. Introduction 1. Introduction
Figure 1 shows the overall architecture of a Parallel NFS (pNFS) Figure 1 shows the overall architecture of a Parallel NFS (pNFS)
system: system:
+-----------+ +-----------+
|+-----------+ +-----------+ |+-----------+ +-----------+
||+-----------+ | | ||+-----------+ | |
||| | NFSv4.1 + pNFS | | ||| | NFSv4.1 + pNFS | |
skipping to change at page 4, line 5 skipping to change at page 4, line 5
The Server to Storage System protocol, called the "Control Protocol", The Server to Storage System protocol, called the "Control Protocol",
is not of concern for interoperability, although it will typically be is not of concern for interoperability, although it will typically be
the same SCSI based storage protocol. the same SCSI based storage protocol.
This document is based on [RFC5663] and makes changes to the block This document is based on [RFC5663] and makes changes to the block
layout type to provide a better pNFS layout protocol for SCSI based layout type to provide a better pNFS layout protocol for SCSI based
storage devices. Despite these changes, [RFC5663] remains the storage devices. Despite these changes, [RFC5663] remains the
defining document for the existing block layout type. [RFC6688] is defining document for the existing block layout type. [RFC6688] is
unnecessary in the context of the SCSI layout type because the new unnecessary in the context of the SCSI layout type because the new
layout type provides mandatory disk access protection as part of the layout type provides mandatory disk access protection as part of the
layout type definition. Unlike [RFC5663], this document can make use layout type definition. In contrast to [RFC5663], this document uses
of SCSI protocol features and thus can provide reliable fencing by SCSI protocol features to provide reliable fencing by using SCSI
using SCSI Persistent Reservations, and it can provide reliable and Persistent Reservations, and it can provide reliable and efficient
efficient device discovery by using SCSI device identifiers instead device discovery by using SCSI device identifiers instead of having
of having to rely on probing all devices potentially attached to a to rely on probing all devices potentially attached to a client for a
client for a signature. This new layout type also optimizes the I/O signature. This new layout type also optimizes the I/O path by
path by reducing the size of the LAYOUTCOMMIT payload reducing the size of the LAYOUTCOMMIT payload
1.1. Conventions Used in This Document 1.1. Conventions Used in This Document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119]. document are to be interpreted as described in [RFC2119].
1.2. General Definitions 1.2. General Definitions
The following definitions are provided for the purpose of providing The following definitions are provided for the purpose of providing
skipping to change at page 5, line 26 skipping to change at page 5, line 26
line, plus a sentinel sequence of "///". line, plus a sentinel sequence of "///".
The embedded XDR file header follows. Subsequent XDR descriptions, The embedded XDR file header follows. Subsequent XDR descriptions,
with the sentinel sequence are embedded throughout the document. with the sentinel sequence are embedded throughout the document.
Note that the XDR code contained in this document depends on types Note that the XDR code contained in this document depends on types
from the NFSv4.1 nfs4_prot.x file [RFC5662]. This includes both nfs from the NFSv4.1 nfs4_prot.x file [RFC5662]. This includes both nfs
types that end with a 4, such as offset4, length4, etc., as well as types that end with a 4, such as offset4, length4, etc., as well as
more generic types such as uint32_t and uint64_t. more generic types such as uint32_t and uint64_t.
/// /* /// /*
/// * This code was derived from RFCTBD10 /// * This code was derived from RFCTBD10
/// * Please reproduce this note if possible. /// * Please reproduce this note if possible.
/// */ /// */
/// /* /// /*
/// * Copyright (c) 2010,2015 IETF Trust and the persons identified /// * Copyright (c) 2010,2015 IETF Trust and the persons
/// * as the document authors. All rights reserved. /// * identified as the document authors. All rights reserved.
/// * /// *
/// * Redistribution and use in source and binary forms, with /// * Redistribution and use in source and binary forms, with
/// * or without modification, are permitted provided that the /// * or without modification, are permitted provided that the
/// * following conditions are met: /// * following conditions are met:
/// * /// *
/// * - Redistributions of source code must retain the above /// * - Redistributions of source code must retain the above
/// * copyright notice, this list of conditions and the /// * copyright notice, this list of conditions and the
/// * following disclaimer. /// * following disclaimer.
/// * /// *
/// * - Redistributions in binary form must reproduce the above /// * - Redistributions in binary form must reproduce the above
/// * copyright notice, this list of conditions and the /// * copyright notice, this list of conditions and the
/// * following disclaimer in the documentation and/or other /// * following disclaimer in the documentation and/or other
/// * materials provided with the distribution. /// * materials provided with the distribution.
/// * /// *
/// * - Neither the name of Internet Society, IETF or IETF /// * - Neither the name of Internet Society, IETF or IETF
/// * Trust, nor the names of specific contributors, may be /// * Trust, nor the names of specific contributors, may be
/// * used to endorse or promote products derived from this /// * used to endorse or promote products derived from this
/// * software without specific prior written permission. /// * software without specific prior written permission.
/// * /// *
/// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS
/// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
/// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
/// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
/// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
/// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
/// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
/// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
/// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
/// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
/// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
/// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
/// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
/// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
/// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
/// */ /// */
/// ///
/// /* /// /*
/// * nfs4_scsi_layout_prot.x /// * nfs4_scsi_layout_prot.x
/// */ /// */
/// ///
/// %#include "nfsv41.h" /// %#include "nfsv41.h"
/// ///
2. SCSI Layout Description 2. SCSI Layout Description
2.1. Background and Architecture 2.1. Background and Architecture
The fundamental storage abstraction supported by SCSI storage devices The fundamental storage model supported by SCSI storage devices is a
is a Logical Unit (LU) consisting of a sequential series of fixed- Logical Unit (LU) consisting of a sequential series of fixed-size
size blocks. This can be thought of as a logical disk; it may be blocks. Logical units used as devices for NFS scsi layouts, and the
realized by the storage system as a physical disk, a portion of a SCSI initiators used for the pNFS Metadata Server and clients MUST
physical disk, or something more complex (e.g., concatenation, support SCSI persistent reservations.
striping, RAID, and combinations thereof) involving multiple physical
disks or portions thereof. Logical units used as devices for NFS
scsi layouts, and the SCSI initiators used for the pNFS Metadata
Served and clients MUST support SCSI persistent reservations.
A pNFS layout for this SCSI class of storage is responsible for A pNFS layout for this SCSI class of storage is responsible for
mapping from an NFS file (or portion of a file) to the blocks of mapping from an NFS file (or portion of a file) to the blocks of
storage volumes that contain the file. The blocks are expressed as storage volumes that contain the file. The blocks are expressed as
extents with 64-bit offsets and lengths using the existing NFSv4 extents with 64-bit offsets and lengths using the existing NFSv4
offset4 and length4 types. Clients MUST be able to perform I/O to offset4 and length4 types. Clients MUST be able to perform I/O to
the block extents without affecting additional areas of storage the block extents without affecting additional areas of storage
(especially important for writes); therefore, extents MUST be aligned (especially important for writes); therefore, extents MUST be aligned
to 512-byte boundaries, and writable extents MUST be aligned to the to 512-byte boundaries.
block size used by the NFSv4 server in managing the actual file
system (4 kilobytes and 8 kilobytes are common block sizes). This
block size is available as the NFSv4.1 layout_blksize attribute.
[RFC5661]. Readable extents SHOULD be aligned to the block size used
by the NFSv4 server, but in order to support legacy file systems with
fragments, alignment to 512-byte boundaries is acceptable.
The pNFS operation for requesting a layout (LAYOUTGET) includes the The pNFS operation for requesting a layout (LAYOUTGET) includes the
"layoutiomode4 loga_iomode" argument, which indicates whether the "layoutiomode4 loga_iomode" argument, which indicates whether the
requested layout is for read-only use or read-write use. A read-only requested layout is for read-only use or read-write use. A read-only
layout may contain holes that are read as zero, whereas a read-write layout may contain holes that are read as zero, whereas a read-write
layout will contain allocated, but un-initialized storage in those layout will contain allocated, but un-initialized storage in those
holes (read as zero, can be written by client). This document also holes (read as zero, can be written by client). This document also
supports client participation in copy-on-write (e.g., for file supports client participation in copy-on-write (e.g., for file
systems with snapshots) by providing both read-only and un- systems with snapshots) by providing both read-only and un-
initialized storage for the same range in a layout. Reads are initialized storage for the same range in a layout. Reads are
initially performed on the read-only storage, with writes going to initially performed on the read-only storage, with writes going to
the un-initialized storage. After the first write that initializes the un-initialized storage. After the first write that initializes
the un-initialized storage, all reads are performed to that now- the un-initialized storage, all reads are performed to that now-
initialized writable storage, and the corresponding read-only storage initialized writable storage, and the corresponding read-only storage
is no longer used. is no longer used.
The SCSI layout solution expands the security responsibilities of the The SCSI layout solution expands the security responsibilities of the
pNFS clients, and there are a number of environments where the pNFS clients, and there are a number of environments where the
mandatory to implement security properties for NFS cannot be mandatory to implement security properties for NFS cannot be
satisfied. The additional security responsibilities of the client satisfied. The additional security responsibilities of the client
follow, and a full discussion is present im Section 4, "Security follow, and a full discussion is present in Section 4, "Security
Considerations". Considerations".
o Typically, SCSI storage devices provide access control mechanisms o Typically, SCSI storage devices provide access control mechanisms
(e.g., Logical Unit Number (LUN) mapping and/or masking), which (e.g., Logical Unit Number (LUN) mapping and/or masking), which
operate at the granularity of individual hosts, not individual operate at the granularity of individual hosts, not individual
blocks. For this reason, block-based protection must be provided blocks. For this reason, block-based protection must be provided
by the client software. by the client software.
o Similarly, SCSI storage devices typically are not able to validate o Similarly, SCSI storage devices typically are not able to validate
NFS locks that apply to file regions. For instance, if a file is NFS locks that apply to file regions. For instance, if a file is
skipping to change at page 8, line 29 skipping to change at page 8, line 24
This document defines structure associated with the layouttype4 value This document defines structure associated with the layouttype4 value
LAYOUT4_SCSI. [RFC5661] specifies the loc_body structure as an XDR LAYOUT4_SCSI. [RFC5661] specifies the loc_body structure as an XDR
type "opaque". The opaque layout is uninterpreted by the generic type "opaque". The opaque layout is uninterpreted by the generic
pNFS client layers, but obviously must be interpreted by the Layout pNFS client layers, but obviously must be interpreted by the Layout
Type implementation. Type implementation.
2.3. GETDEVICEINFO 2.3. GETDEVICEINFO
2.3.1. Volume Identification 2.3.1. Volume Identification
SCSI targets implementing [SPC3] export unique LU names for each LU SCSI targets implementing [SPC4] export unique LU names for each LU
through the Device Identification VPD page (page code 0x83), which through the Device Identification VPD page (page code 0x83), which
can be obtained using the INQUIRY command with the EVPD bit set to can be obtained using the INQUIRY command with the EVPD bit set to
one. This document uses a subset of this information to identify LUs one. This document uses a subset of this information to identify LUs
backing pNFS SCSI layouts. It is similar to the "Identification backing pNFS SCSI layouts. It is similar to the "Identification
Descriptor Target Descriptor" specified in [SPC3], but limits the Descriptor Target Descriptor" specified in [SPC4], but limits the
allowed values to those that uniquely identify a LU. Device allowed values to those that uniquely identify a LU. Device
Identification VPD page descriptors used to identify LUs for use with Identification VPD page descriptors used to identify LUs for use with
pNFS SCSI layouts must adhere to the following restrictions: pNFS SCSI layouts must adhere to the following restrictions:
1. The "ASSOCIATION" MUST be set to 0 (The DESIGNATOR field is 1. The "ASSOCIATION" MUST be set to 0 (The DESIGNATOR field is
associated with the addressed logical unit). associated with the addressed logical unit).
2. The "DESIGNATOR TYPE" MUST be set to one of four values that are 2. The "DESIGNATOR TYPE" MUST be set to one of four values that are
required for the mandatory logical unit name in [SPC3], as required for the mandatory logical unit name in [SPC4], as
explicitly listed in the "pnfs_scsi_designator_type" enumeration: explicitly listed in the "pnfs_scsi_designator_type" enumeration:
PS_DESIGNATOR_T10 T10 vendor ID based PS_DESIGNATOR_T10 T10 vendor ID based
PS_DESIGNATOR_EUI64 EUI-64-based PS_DESIGNATOR_EUI64 EUI-64-based
PS_DESIGNATOR_NAA NAA PS_DESIGNATOR_NAA NAA
PS_DESIGNATOR_NAME SCSI name string PS_DESIGNATOR_NAME SCSI name string
Any other associate or designator type MUST NOT be used. Any other association or designator type MUST NOT be used. Use
of T10 vendor IDs is discouraged when one of the other types can
be used.
The "CODE SET" VPD page field is stored in the "sbv_code_set" field The "CODE SET" VPD page field is stored in the "sbv_code_set" field
of the "pnfs_scsi_base_volume_info4" structure, the "DESIGNATOR TYPE" of the "pnfs_scsi_base_volume_info4" structure, the "DESIGNATOR TYPE"
is stored in "sbv_designator_type", and the DESIGNATOR is stored in is stored in "sbv_designator_type", and the DESIGNATOR is stored in
"sbv_designator". Due to the use of a XDR array the "DESIGNATOR "sbv_designator". Due to the use of a XDR array the "DESIGNATOR
LENGTH" field does not need to be set separately. Only certain LENGTH" field does not need to be set separately. Only certain
combinations of "sbv_code_set" and "sbv_designator_type" are valid, combinations of "sbv_code_set" and "sbv_designator_type" are valid,
please refer to [SPC3] for details, and note that ASCII may be used please refer to [SPC4] for details, and note that ASCII may be used
as the code set for UTF-8 text that contains only printable ASCII as the code set for UTF-8 text that contains only printable ASCII
characters. Note that a Device Identification VPD page MAY contain characters. Note that a Device Identification VPD page MAY contain
multiple descriptors with the same association, code set and multiple descriptors with the same association, code set and
designator type. NFS clients thus MUST check all the descriptors for designator type. NFS clients thus MUST check all the descriptors for
a possible match to "sbv_code_set", "sbv_designator_type" and a possible match to "sbv_code_set", "sbv_designator_type" and
"sbv_designator". "sbv_designator".
Storage devices such as storage arrays can have multiple physical Storage devices such as storage arrays can have multiple physical
network ports that need not be connected to a common network, network ports that need not be connected to a common network,
resulting in a pNFS client having simultaneous multipath access to resulting in a pNFS client having simultaneous multipath access to
the same storage volumes via different ports on different networks. the same storage volumes via different ports on different networks.
Selection of one or multiple ports to access the storage device is Selection of one or multiple ports to access the storage device is
left up to the client. left up to the client.
Additionally the server returns a Persistent Reservation key in the Additionally the server returns a Persistent Reservation key in the
"sbv_pr_key" field. See Section 2.4.8 for more details on the use of "sbv_pr_key" field. See Section 2.4.10 for more details on the use
Persistent Reservations. of Persistent Reservations.
2.3.2. Volume Topology 2.3.2. Volume Topology
The pNFS SCSI layout volume topology is expressed as an arbitrary The pNFS SCSI layout volume topology is expressed as an arbitrary
combination of base volume types enumerated in the following data combination of base volume types enumerated in the following data
structures. The individual components of the topology are contained structures. The individual components of the topology are contained
in an array and components may refer to other components by using in an array and components may refer to other components by using
array indices. array indices.
/// enum pnfs_scsi_volume_type4 { /// enum pnfs_scsi_volume_type4 {
/// PNFS_SCSI_VOLUME_SLICE = 1, /* volume is a slice of /// PNFS_SCSI_VOLUME_SLICE = 1, /* volume is a slice of
/// another volume */ /// another volume */
/// PNFS_SCSI_VOLUME_CONCAT = 2, /* volume is a /// PNFS_SCSI_VOLUME_CONCAT = 2, /* volume is a
/// concatenation of /// concatenation of
/// multiple volumes */ /// multiple volumes */
/// PNFS_SCSI_VOLUME_STRIPE = 3 /* volume is striped across /// PNFS_SCSI_VOLUME_STRIPE = 3 /* volume is striped across
/// multiple volumes */ /// multiple volumes */
/// PNFS_SCSI_VOLUME_BASE = 4, /* volume maps to a single /// PNFS_SCSI_VOLUME_BASE = 4, /* volume maps to a single
/// LU */ /// LU */
/// }; /// };
/// ///
/// /* /// /*
/// * Code sets from SPC-3. /// * Code sets from SPC-3.
/// */ /// */
/// enum pnfs_scsi_code_set { /// enum pnfs_scsi_code_set {
/// PS_CODE_SET_BINARY = 1, /// PS_CODE_SET_BINARY = 1,
/// PS_CODE_SET_ASCII = 2, /// PS_CODE_SET_ASCII = 2,
/// PS_CODE_SET_UTF8 = 3 /// PS_CODE_SET_UTF8 = 3
/// }; /// };
/// ///
/// /* /// /*
skipping to change at page 11, line 5 skipping to change at page 10, line 37
/// * Logical Unit name + reservation key. /// * Logical Unit name + reservation key.
/// */ /// */
/// struct pnfs_scsi_base_volume_info4 { /// struct pnfs_scsi_base_volume_info4 {
/// pnfs_scsi_code_set sbv_code_set; /// pnfs_scsi_code_set sbv_code_set;
/// pnfs_scsi_designator_type sbv_designator_type; /// pnfs_scsi_designator_type sbv_designator_type;
/// opaque sbv_designator<>; /// opaque sbv_designator<>;
/// uint64_t sbv_pr_key; /// uint64_t sbv_pr_key;
/// }; /// };
/// ///
/// /// struct pnfs_scsi_slice_volume_info4 {
/// struct pnfs_scsi_slice_volume_info4 { /// offset4 ssv_start; /* offset of the start of
/// offset4 ssv_start; /* offset of the start of the /// the slice in bytes */
/// slice in bytes */ /// length4 ssv_length; /* length of slice in
/// length4 ssv_length; /* length of slice in bytes */ /// bytes */
/// uint32_t ssv_volume; /* array index of sliced /// uint32_t ssv_volume; /* array index of sliced
/// volume */ /// volume */
/// }; /// };
///
/// ///
/// struct pnfs_scsi_concat_volume_info4 { /// struct pnfs_scsi_concat_volume_info4 {
/// uint32_t scv_volumes<>; /* array indices of volumes /// uint32_t scv_volumes<>; /* array indices of volumes
/// which are concatenated */ /// which are concatenated */
/// }; /// };
/// ///
/// struct pnfs_scsi_stripe_volume_info4 { /// struct pnfs_scsi_stripe_volume_info4 {
/// length4 ssv_stripe_unit; /* size of stripe in bytes */ /// length4 ssv_stripe_unit; /* size of stripe in bytes */
/// uint32_t ssv_volumes<>; /* array indices of volumes /// uint32_t ssv_volumes<>; /* array indices of
/// which are striped across -- /// volumes which are striped
/// MUST be same size */ /// across -- MUST be same
/// }; /// size */
/// };
/// ///
/// union pnfs_scsi_volume4 switch (pnfs_scsi_volume_type4 type) { /// union pnfs_scsi_volume4 switch (pnfs_scsi_volume_type4 type) {
/// case PNFS_SCSI_VOLUME_BASE: /// case PNFS_SCSI_VOLUME_BASE:
/// pnfs_scsi_base_volume_info4 sv_simple_info; /// pnfs_scsi_base_volume_info4 sv_simple_info;
/// case PNFS_SCSI_VOLUME_SLICE: /// case PNFS_SCSI_VOLUME_SLICE:
/// pnfs_scsi_slice_volume_info4 sv_slice_info; /// pnfs_scsi_slice_volume_info4 sv_slice_info;
/// case PNFS_SCSI_VOLUME_CONCAT: /// case PNFS_SCSI_VOLUME_CONCAT:
/// pnfs_scsi_concat_volume_info4 sv_concat_info; /// pnfs_scsi_concat_volume_info4 sv_concat_info;
/// case PNFS_SCSI_VOLUME_STRIPE: /// case PNFS_SCSI_VOLUME_STRIPE:
/// pnfs_scsi_stripe_volume_info4 sv_stripe_info; /// pnfs_scsi_stripe_volume_info4 sv_stripe_info;
/// }; /// };
/// ///
/// /* SCSI layout specific type for da_addr_body */ /// /* SCSI layout-specific type for da_addr_body */
/// struct pnfs_scsi_deviceaddr4 { /// struct pnfs_scsi_deviceaddr4 {
/// pnfs_scsi_volume4 sda_volumes<>; /* array of volumes */ /// pnfs_scsi_volume4 sda_volumes<>; /* array of volumes */
/// }; /// };
/// ///
The "pnfs_scsi_deviceaddr4" data structure is a structure that allows The "pnfs_scsi_deviceaddr4" data structure is a structure that allows
arbitrarily complex nested volume structures to be encoded. The arbitrarily complex nested volume structures to be encoded. The
types of aggregations that are allowed are stripes, concatenations, types of aggregations that are allowed are stripes, concatenations,
and slices. Note that the volume topology expressed in the and slices. Note that the volume topology expressed in the
pnfs_scsi_deviceaddr4 data structure will always resolve to a set of pnfs_scsi_deviceaddr4 data structure will always resolve to a set of
pnfs_scsi_volume_type4 PNFS_SCSI_VOLUME_BASE. The array of volumes pnfs_scsi_volume_type4 PNFS_SCSI_VOLUME_BASE. The array of volumes
is ordered such that the root of the volume hierarchy is the last is ordered such that the root of the volume hierarchy is the last
element of the array. Concat, slice, and stripe volumes MUST refer element of the array. Concat, slice, and stripe volumes MUST refer
to volumes defined by lower indexed elements of the array. to volumes defined by lower indexed elements of the array.
skipping to change at page 12, line 30 skipping to change at page 12, line 21
signature components; thus, the device address may require several signature components; thus, the device address may require several
kilobytes. The client SHOULD be prepared to allocate a large buffer kilobytes. The client SHOULD be prepared to allocate a large buffer
to contain the result. In the case of the server returning to contain the result. In the case of the server returning
NFS4ERR_TOOSMALL, the client SHOULD allocate a buffer of at least NFS4ERR_TOOSMALL, the client SHOULD allocate a buffer of at least
gdir_mincount_bytes to contain the expected result and retry the gdir_mincount_bytes to contain the expected result and retry the
GETDEVICEINFO request. GETDEVICEINFO request.
2.4. Data Structures: Extents and Extent Lists 2.4. Data Structures: Extents and Extent Lists
A pNFS SCSI layout is a list of extents within a flat array of data A pNFS SCSI layout is a list of extents within a flat array of data
blocks in a logical volume. The details of the volume topology can blocks in a volume. The details of the volume topology can be
be determined by using the GETDEVICEINFO operation. The SCSI layout determined by using the GETDEVICEINFO operation. The SCSI layout
describes the individual block extents on the volume that make up the describes the individual block extents on the volume that make up the
file. The offsets and length contained in an extent are specified in file. The offsets and length contained in an extent are specified in
units of bytes. units of bytes.
/// enum pnfs_scsi_extent_state4 { /// enum pnfs_scsi_extent_state4 {
/// PNFS_SCSI_READ_WRITE_DATA = 0, /* the data located by this /// PNFS_SCSI_READ_WRITE_DATA = 0, /* the data located by
/// extent is valid /// this extent is valid
/// for reading and writing. */ /// for reading and
/// PNFS_SCSI_READ_DATA = 1, /* the data located by this /// writing. */
/// extent is valid for reading /// PNFS_SCSI_READ_DATA = 1, /* the data located by this
/// only; it may not be /// extent is valid for
/// written. */ /// reading only; it may not
/// PNFS_SCSI_INVALID_DATA = 2, /* the location is valid; the /// be written. */
/// data is invalid. It is a /// PNFS_SCSI_INVALID_DATA = 2, /* the location is valid; the
/// newly (pre-) allocated /// data is invalid. It is a
/// extent. There is physical /// newly (pre-) allocated
/// space on the volume. */ /// extent. The client MUST
/// PNFS_SCSI_NONE_DATA = 3 /* the location is invalid. /// not read from this
/// It is a hole in the file. /// space */
/// There is no physical space /// PNFS_SCSI_NONE_DATA = 3 /* the location is invalid.
/// on the volume. */ /// It is a hole in the file.
/// }; /// The client MUST NOT read
/// from or write to this
/// /// space */
/// struct pnfs_scsi_extent4 { /// };
/// deviceid4 se_vol_id; /* id of logical volume on
/// which extent of file is
/// stored. */
/// offset4 se_file_offset; /* starting byte offset
/// in the file */
/// length4 se_length; /* size in bytes of the
/// extent */
/// offset4 se_storage_offset; /* starting byte offset
/// in the volume */
/// pnfs_scsi_extent_state4 se_state;
/// /* state of this extent */
/// };
/// ///
/// /* SCSI layout specific type for loc_body */ /// struct pnfs_scsi_extent4 {
/// deviceid4 se_vol_id; /* id of the volume on
/// which extent of file is
/// stored. */
/// offset4 se_file_offset; /* starting byte offset
/// in the file */
/// length4 se_length; /* size in bytes of the
/// extent */
/// offset4 se_storage_offset; /* starting byte offset
/// in the volume */
/// pnfs_scsi_extent_state4 se_state;
/// /* state of this extent */
/// };
///
/// /* SCSI layout-specific type for loc_body */
/// struct pnfs_scsi_layout4 { /// struct pnfs_scsi_layout4 {
/// pnfs_scsi_extent4 sl_extents<>; /// pnfs_scsi_extent4 sl_extents<>;
/// /* extents which make up this /// /* extents which make up this
/// layout. */ /// layout. */
/// }; /// };
/// ///
The SCSI layout consists of a list of extents that map the logical The SCSI layout consists of a list of extents that map the regions of
regions of the file to physical locations on a volume. The the file to locations on a volume. The "se_storage_offset" field
"se_storage_offset" field within each extent identifies a location on within each extent identifies a location on the volume specified by
the logical volume specified by the "se_vol_id" field in the extent. the "se_vol_id" field in the extent. The se_vol_id itself is
The se_vol_id itself is shorthand for the whole topology of the shorthand for the whole topology of the volume on which the file is
logical volume on which the file is stored. The client is stored. The client is responsible for translating this volume-
responsible for translating this logical offset into an offset on the relative offset into an offset on the appropriate underlying SCSI LU.
appropriate underlying SCSI LU. In most cases, all extents in a
layout will reside on the same volume and thus have the same
se_vol_id. In the case of copy-on-write file systems, the
PNFS_SCSI_READ_DATA extents may have a different se_vol_id from the
writable extents.
Each extent maps a logical region of the file onto a portion of the Each extent maps a region of the file onto a portion of the specified
specified LU. The se_file_offset, se_length, and se_state fields for LU. The se_file_offset, se_length, and se_state fields for an extent
an extent returned from the server are valid for all extents. In returned from the server are valid for all extents. In contrast, the
contrast, the interpretation of the se_storage_offset field depends interpretation of the se_storage_offset field depends on the value of
on the value of se_state as follows (in increasing order): se_state as follows (in increasing order):
PNFS_SCSI_READ_WRITE_DATA means that se_storage_offset is valid, and PNFS_SCSI_READ_WRITE_DATA means that se_storage_offset is valid, and
points to valid/initialized data that can be read and written. points to valid/initialized data that can be read and written.
PNFS_SCSI_READ_DATA means that se_storage_offset is valid andpoints PNFS_SCSI_READ_DATA means that se_storage_offset is valid and points
to valid/initialized data that can only be read. Write operations to valid/initialized data that can only be read. Write operations
are prohibited; the client may need to request a read-write are prohibited; the client may need to request a read-write
layout. layout.
PNFS_SCSI_INVALID_DATA means that se_storage_offset is valid, but PNFS_SCSI_INVALID_DATA means that se_storage_offset is valid, but
points to invalid un-initialized data. This data must not be points to invalid un-initialized data. This data must not be read
physically read from the disk until it has been initialized. A from the disk until it has been initialized. A read request for a
read request for a PNFS_SCSI_INVALID_DATA extent must fill the PNFS_SCSI_INVALID_DATA extent must fill the user buffer with
user buffer with zeros, unless the extent is covered by a zeros, unless the extent is covered by a PNFS_SCSI_READ_DATA
PNFS_SCSI_READ_DATA extent of a copy-on-write file system. Write extent of a copy-on-write file system. Write requests must write
requests must write whole server-sized blocks to the disk; bytes whole server-sized blocks to the disk; bytes not initialized by
not initialized by the user must be set to zero. Any write to the user must be set to zero. Any write to storage in a
storage in a PNFS_SCSI_INVALID_DATA extent changes the written PNFS_SCSI_INVALID_DATA extent changes the written portion of the
portion of the extent to PNFS_SCSI_READ_WRITE_DATA; the pNFS extent to PNFS_SCSI_READ_WRITE_DATA; the pNFS client is
client is responsible for reporting this change via LAYOUTCOMMIT. responsible for reporting this change via LAYOUTCOMMIT.
PNFS_SCSI_NONE_DATA means that se_storage_offset is not valid, and PNFS_SCSI_NONE_DATA means that se_storage_offset is not valid, and
this extent may not be used to satisfy write requests. Read this extent may not be used to satisfy write requests. Read
requests may be satisfied by zero-filling as for requests may be satisfied by zero-filling as for
PNFS_SCSI_INVALID_DATA. PNFS_SCSI_NONE_DATA extents may be PNFS_SCSI_INVALID_DATA. PNFS_SCSI_NONE_DATA extents may be
returned by requests for readable extents; they are never returned returned by requests for readable extents; they are never returned
if the request was for a writable extent. if the request was for a writable extent.
An extent list contains all relevant extents in increasing order of An extent list contains all relevant extents in increasing order of
the se_file_offset of each extent; any ties are broken by increasing the se_file_offset of each extent; any ties are broken by increasing
skipping to change at page 15, line 43 skipping to change at page 15, line 18
logically contiguous. Every PNFS_SCSI_READ_DATA extent in a read- logically contiguous. Every PNFS_SCSI_READ_DATA extent in a read-
write layout MUST be covered by one or more PNFS_SCSI_INVALID_DATA write layout MUST be covered by one or more PNFS_SCSI_INVALID_DATA
extents. This overlap of PNFS_SCSI_READ_DATA and extents. This overlap of PNFS_SCSI_READ_DATA and
PNFS_SCSI_INVALID_DATA extents is the only permitted extent PNFS_SCSI_INVALID_DATA extents is the only permitted extent
overlap. overlap.
o Extents MUST be ordered in the list by starting offset, with o Extents MUST be ordered in the list by starting offset, with
PNFS_SCSI_READ_DATA extents preceding PNFS_SCSI_INVALID_DATA PNFS_SCSI_READ_DATA extents preceding PNFS_SCSI_INVALID_DATA
extents in the case of equal se_file_offsets. extents in the case of equal se_file_offsets.
If the minimum requested size, loga_minlength, is zero, this is an According to [RFC5661], if the minimum requested size,
indication to the metadata server that the client desires any layout loga_minlength, is zero, this is an indication to the metadata server
at offset loga_offset or less that the metadata server has "readily that the client desires any layout at offset loga_offset or less that
available". Readily is subjective, and depends on the layout type the metadata server has "readily available". Given the lack of a
and the pNFS server implementation. For SCSI layout servers, readily clear definition of this phrase, in the context of the SCSI layout
available SHOULD be interpreted such that readable layouts are always type, when loga_minlength is zero, the metadata server SHOULD:
available, even if some extents are in the PNFS_SCSI_NONE_DATA state.
When processing requests for writable layouts, a layout is readily o when processing requests for readable layouts, return all such,
available if extents can be returned in the PNFS_SCSI_READ_WRITE_DATA even if some extents are in the PNFS_SCSI_NONE_DATA state.
state.
o when processing requests for writable layouts, return extents
which can be returned in the PNFS_SCSI_READ_WRITE_DATA state.
2.4.2. Layout Commits 2.4.2. Layout Commits
/// ///
/// /* SCSI layout specific type for lou_body */ /// /* SCSI layout-specific type for lou_body */
/// ///
/// struct pnfs_scsi_range4 { /// struct pnfs_scsi_range4 {
/// offset4 sr_file_offset; /* starting byte offset /// offset4 sr_file_offset; /* starting byte offset
/// in the file */ /// in the file */
/// length4 sr_length; /* size in bytes */ /// length4 sr_length; /* size in bytes */
/// }; /// };
/// ///
/// struct pnfs_scsi_layoutupdate4 { /// struct pnfs_scsi_layoutupdate4 {
/// pnfs_scsi_range4 slu_commit_list<>; /// pnfs_scsi_range4 slu_commit_list<>;
/// /* list of extents which /// /* list of extents which
/// * now contain valid data. /// * now contain valid data.
/// */ /// */
/// }; /// };
The "pnfs_scsi_layoutupdate4" structure is used by the client as the The "pnfs_scsi_layoutupdate4" structure is used by the client as the
SCSI layout specific argument in a LAYOUTCOMMIT operation. The SCSI layout-specific argument in a LAYOUTCOMMIT operation. The
"slu_commit_list" field is a list covering regions of the file layout "slu_commit_list" field is a list covering regions of the file layout
that were previously in the PNFS_SCSI_INVALID_DATA state, but have that were previously in the PNFS_SCSI_INVALID_DATA state, but have
been written by the client and should now be considered in the been written by the client and should now be considered in the
PNFS_SCSI_READ_WRITE_DATA state. The extents in the commit list MUST PNFS_SCSI_READ_WRITE_DATA state. The extents in the commit list MUST
be disjoint and MUST be sorted by sr_file_offset. Implementors be disjoint and MUST be sorted by sr_file_offset. Implementors
should be aware that a server may be unable to commit regions at a should be aware that a server may be unable to commit regions at a
granularity smaller than a file-system block (typically 4 KB or 8 granularity smaller than a file-system block (typically 4 KB or 8
KB). As noted above, the block-size that the server uses is KB). As noted above, the block-size that the server uses is
available as an NFSv4 attribute, and any extents included in the available as an NFSv4 attribute, and any extents included in the
"slu_commit_list" MUST be aligned to this granularity and have a size "slu_commit_list" MUST be aligned to this granularity and have a size
that is a multiple of this granularity. If the client believes that that is a multiple of this granularity. Since the block in question
its actions have moved the end-of-file into the middle of a block is in state PNFS_SCSI_INVALID_DATA, byte ranges not written should be
being committed, the client MUST write zeroes from the end-of-file to filled with zeros. This applies even if it appears that the area
the end of that block before committing the block. Failure to do so being written is beyond what the client believes to be the end of
may result in junk (un-initialized data) appearing in that area if file.
the file is subsequently extended by moving the end-of-file.
2.4.3. Layout Returns 2.4.3. Layout Returns
A LAYOUTRETURN operation represents an explicit release of resources
by the client. This may be done in response to a CB_LAYOUTRECALL or
before any recall, in order to avoid a future CB_LAYOUTRECALL. When
the LAYOUTRETURN operation specifies a LAYOUTRETURN4_FILE return
type, then the layoutreturn_file4 data structure specifies the region
of the file layout that is no longer needed by the client.
The LAYOUTRETURN operation is done without any SCSI layout specific The LAYOUTRETURN operation is done without any SCSI layout specific
data. When the LAYOUTRETURN operation specifies a data. The opaque "lrf_body" field of the "layoutreturn_file4" data
LAYOUTRETURN4_FILE_return type, then the layoutreturn_file4 data structure MUST have length zero.
structure specifies the region of the file layout that is no longer
needed by the client. The opaque "lrf_body" field of the
"layoutreturn_file4" data structure MUST have length zero. A
LAYOUTRETURN operation represents an explicit release of resources by
the client, usually done for the purpose of avoiding unnecessary
CB_LAYOUTRECALL operations in the future. The client may return
disjoint regions of the file by using multiple LAYOUTRETURN
operations within a single COMPOUND operation.
Note that the SCSI layout supports unilateral layout revocation. 2.4.4. Layout Revocation
When a layout is unilaterally revoked by the server, usually due to
the client's lease time expiring, or a delegation being recalled, or Layouts may be unilaterally revoked by the server, due to the
the client failing to return a layout in a timely manner, it is client's lease time expiring, or the client failing to return a
important for the sake of correctness that any in-flight I/Os that layout which has been recalled in a timely manner. For the SCSI
the client issued before the layout was revoked are rejected at the layout type this is accomplished by fencing off the client from
storage. For the SCSI protocol, this is possible by fencing a client access to storage as described in Section 2.4.10. When this is done,
with an expired layout timer from the physical storage. Note, it is necessary that all I/Os issued by the fenced-off client be
however, that the granularity of this operation can only be at the rejected by the storage This includes any in-flight I/Os that the
client issued before the layout was revoked.
Note, that the granularity of this operation can only be at the
host/LU level. Thus, if one of a client's layouts is unilaterally host/LU level. Thus, if one of a client's layouts is unilaterally
revoked by the server, it will effectively render useless *all* of revoked by the server, it will effectively render useless *all* of
the client's layouts for files located on the storage units the client's layouts for files located on the storage units
comprising the logical volume. This may render useless the client's comprising the volume. This may render useless the client's layouts
layouts for files in other file systems. for files in other file systems. See Section 2.4.10.5 for a
discussion of recovery from from fencing.
2.4.4. Client Copy-on-Write Processing 2.4.5. Client Copy-on-Write Processing
Copy-on-write is a mechanism used to support file and/or file system Copy-on-write is a mechanism used to support file and/or file system
snapshots. When writing to unaligned regions, or to regions smaller snapshots. When writing to unaligned regions, or to regions smaller
than a file system block, the writer must copy the portions of the than a file system block, the writer must copy the portions of the
original file data to a new location on disk. This behavior can original file data to a new location on disk. This behavior can
either be implemented on the client or the server. The paragraphs either be implemented on the client or the server. The paragraphs
below describe how a pNFS SCSI layout client implements access to a below describe how a pNFS SCSI layout client implements access to a
file that requires copy-on-write semantics. file that requires copy-on-write semantics.
Distinguishing the PNFS_SCSI_READ_WRITE_DATA and PNFS_SCSI_READ_DATA Distinguishing the PNFS_SCSI_READ_WRITE_DATA and PNFS_SCSI_READ_DATA
skipping to change at page 18, line 27 skipping to change at page 18, line 9
client writes only a portion of an extent, the extent may be split at client writes only a portion of an extent, the extent may be split at
block aligned boundaries. block aligned boundaries.
When a client wishes to write data to a PNFS_SCSI_INVALID_DATA extent When a client wishes to write data to a PNFS_SCSI_INVALID_DATA extent
that is not covered by a PNFS_SCSI_READ_DATA extent, it MUST treat that is not covered by a PNFS_SCSI_READ_DATA extent, it MUST treat
this write identically to a write to a file not involved with copy- this write identically to a write to a file not involved with copy-
on-write semantics. Thus, data must be written in at least block- on-write semantics. Thus, data must be written in at least block-
sized increments, aligned to multiples of block-sized offsets, and sized increments, aligned to multiples of block-sized offsets, and
unwritten portions of blocks must be zero filled. unwritten portions of blocks must be zero filled.
2.4.5. Extents are Permissions 2.4.6. Extents are Permissions
Layout extents returned to pNFS clients grant permission to read or Layout extents returned to pNFS clients grant permission to read or
write; PNFS_SCSI_READ_DATA and PNFS_SCSI_NONE_DATA are read-only write; PNFS_SCSI_READ_DATA and PNFS_SCSI_NONE_DATA are read-only
(PNFS_SCSI_NONE_DATA reads as zeroes), PNFS_SCSI_READ_WRITE_DATA and (PNFS_SCSI_NONE_DATA reads as zeroes), PNFS_SCSI_READ_WRITE_DATA and
PNFS_SCSI_INVALID_DATA are read/write, (PNFS_SCSI_INVALID_DATA reads PNFS_SCSI_INVALID_DATA are read/write, (PNFS_SCSI_INVALID_DATA reads
as zeros, any write converts it to PNFS_SCSI_READ_WRITE_DATA). This as zeros, any write converts it to PNFS_SCSI_READ_WRITE_DATA). This
is the only means a client has of obtaining permission to perform is the only means a client has of obtaining permission to perform
direct I/O to storage devices; a pNFS client MUST NOT perform direct direct I/O to storage devices; a pNFS client MUST NOT perform direct
I/O operations that are not permitted by an extent held by the I/O operations that are not permitted by an extent held by the
client. Client adherence to this rule places the pNFS server in client. Client adherence to this rule places the pNFS server in
control of potentially conflicting storage device operations, control of potentially conflicting storage device operations,
enabling the server to determine what does conflict and how to avoid enabling the server to determine what does conflict and how to avoid
conflicts by granting and recalling extents to/from clients. conflicts by granting and recalling extents to/from clients.
SCSI storage devices do not provide byte granularity access and can
only perform read and write operations atomically on a block
granularity, and thus require read-modify-write cycles to write data
smaller than the block size. Overlapping concurrent read and write
operations to the same data thus will cause the read to return a
mixture of before-write and after-write data. Additionally, data
corruption can occur if the underlying storage is striped and the
operations complete in different orders on different stripes. When
there are multiple clients who wish to access the same data, a pNFS
server MUST avoid these conflicts by implementing a concurrency
control policy of single writer XOR multiple readers for a given data
region.
If a client makes a layout request that conflicts with an existing If a client makes a layout request that conflicts with an existing
layout delegation, the request will be rejected with the error layout delegation, the request will be rejected with the error
NFS4ERR_LAYOUTTRYLATER. This client is then expected to retry the NFS4ERR_LAYOUTTRYLATER. This client is then expected to retry the
request after a short interval. During this interval, the server request after a short interval. During this interval, the server
SHOULD recall the conflicting portion of the layout delegation from SHOULD recall the conflicting portion of the layout delegation from
the client that currently holds it. This reject-and-retry approach the client that currently holds it. This reject-and-retry approach
does not prevent client starvation when there is contention for the does not prevent client starvation when there is contention for the
layout of a particular file. For this reason, a pNFS server SHOULD layout of a particular file. For this reason, a pNFS server SHOULD
implement a mechanism to prevent starvation. One possibility is that implement a mechanism to prevent starvation. One possibility is that
the server can maintain a queue of rejected layout requests. Each the server can maintain a queue of rejected layout requests. Each
skipping to change at page 19, line 39 skipping to change at page 19, line 8
layouts, I/Os will be issued from the clients that hold the layouts layouts, I/Os will be issued from the clients that hold the layouts
directly to the storage devices that host the data. These devices directly to the storage devices that host the data. These devices
have no knowledge of files, mandatory locks, or share reservations, have no knowledge of files, mandatory locks, or share reservations,
and are not in a position to enforce such restrictions. For this and are not in a position to enforce such restrictions. For this
reason the NFSv4 server MUST NOT grant layouts that conflict with reason the NFSv4 server MUST NOT grant layouts that conflict with
mandatory locks or share reservations. Further, if a conflicting mandatory locks or share reservations. Further, if a conflicting
mandatory lock request or a conflicting open request arrives at the mandatory lock request or a conflicting open request arrives at the
server, the server MUST recall the part of the layout in conflict server, the server MUST recall the part of the layout in conflict
with the request before granting the request. with the request before granting the request.
2.4.6. End-of-file Processing 2.4.7. Partial-Bock Updates
SCSI storage devices do not provide byte granularity access and can
only perform read and write operations atomically on a block
granularity. WRITES to SCSI storage devices thus require read-
modify-write cycles to write data smaller than the block size or
which is otherwise not block-aligned. Write operations from multiple
clients to the same block can thus lead to data corruption even if
the byte range written by the applications does not overlap. When
there are multiple clients who wish to access the same block, a pNFS
server MUST avoid these conflicts by implementing a concurrency
control policy of single writer XOR multiple readers for a given data
block.
2.4.8. End-of-file Processing
The end-of-file location can be changed in two ways: implicitly as The end-of-file location can be changed in two ways: implicitly as
the result of a WRITE or LAYOUTCOMMIT beyond the current end-of-file, the result of a WRITE or LAYOUTCOMMIT beyond the current end-of-file,
or explicitly as the result of a SETATTR request. Typically, when a or explicitly as the result of a SETATTR request. Typically, when a
file is truncated by an NFSv4 client via the SETATTR call, the server file is truncated by an NFSv4 client via the SETATTR call, the server
frees any disk blocks belonging to the file that are beyond the new frees any disk blocks belonging to the file that are beyond the new
end-of-file byte, and MUST write zeros to the portion of the new end- end-of-file byte, and MUST write zeros to the portion of the new end-
of-file block beyond the new end-of-file byte. These actions render of-file block beyond the new end-of-file byte. These actions render
any pNFS layouts that refer to the blocks that are freed or written any pNFS layouts that refer to the blocks that are freed or written
semantically invalid. Therefore, the server MUST recall from clients semantically invalid. Therefore, the server MUST recall from clients
the portions of any pNFS layouts that refer to blocks that will be the portions of any pNFS layouts that refer to blocks that will be
freed or written by the server before processing the truncate freed or written by the server before effecting the file truncation.
request. These recalls may take time to complete; as explained in These recalls may take time to complete; as explained in [RFC5661],
[RFC5661], if the server cannot respond to the client SETATTR request if the server cannot respond to the client SETATTR request in a
in a reasonable amount of time, it SHOULD reply to the client with reasonable amount of time, it SHOULD reply to the client with the
the error NFS4ERR_DELAY. error NFS4ERR_DELAY.
Blocks in the PNFS_SCSI_INVALID_DATA state that lie beyond the new Blocks in the PNFS_SCSI_INVALID_DATA state that lie beyond the new
end-of-file block present a special case. The server has reserved end-of-file block present a special case. The server has reserved
these blocks for use by a pNFS client with a writable layout for the these blocks for use by a pNFS client with a writable layout for the
file, but the client has yet to commit the blocks, and they are not file, but the client has yet to commit the blocks, and they are not
yet a part of the file mapping on disk. The server MAY free these yet a part of the file mapping on disk. The server MAY free these
blocks while processing the SETATTR request. If so, the server MUST blocks while processing the SETATTR request. If so, the server MUST
recall any layouts from pNFS clients that refer to the blocks before recall any layouts from pNFS clients that refer to the blocks before
processing the truncate. If the server does not free the processing the truncate. If the server does not free the
PNFS_SCSI_INVALID_DATA blocks while processing the SETATTR request, PNFS_SCSI_INVALID_DATA blocks while processing the SETATTR request,
it need not recall layouts that refer only to the it need not recall layouts that refer only to the
PNFS_SCSI_INVALID_DATA blocks. PNFS_SCSI_INVALID_DATA blocks.
When a file is extended implicitly by a WRITE or LAYOUTCOMMIT beyond When a file is extended implicitly by a WRITE or LAYOUTCOMMIT beyond
the current end-of-file, or extended explicitly by a SETATTR request, the current end-of-file, or extended explicitly by a SETATTR request,
the server need not recall any portions of any pNFS layouts. the server need not recall any portions of any pNFS layouts.
2.4.7. Layout Hints 2.4.9. Layout Hints
The SETATTR operation supports a layout hint attribute [RFC5661]. The layout hint attribute specified in [RFC5661] is not supported by
Clients MUST NOT set a layout hint with a layout type (the loh_type the SCSI layout, and the pNFS server MUST reject setting a layout
field) of LAYOUT4_SCSI_VOLUME. hint attribute with a loh_type value of LAYOUT4_SCSI_VOLUME during
OPEN or SETATTR operations. On a file system only supporting the
SCSI layout a server MUST NOT report the layout_hint attribute in the
supported_attrs attribute.
2.4.8. Client Fencing 2.4.10. Client Fencing
The pNFS SCSI protocol must handle situations in which a system The pNFS SCSI protocol must handle situations in which a system
failure, typically a network connectivity issue, requires the server failure, typically a network connectivity issue, requires the server
to unilaterally revoke extents from one client in order to transfer to unilaterally revoke extents from a client after the client fails
the extents to another client. The pNFS server implementation MUST to respond to a CB_LAYOUTRECALL request. This is implemented by
ensure that when resources are transferred to another client, they fencing off a non-responding client from access to the storage
are not used by the client originally owning them, and this must be device.
ensured against any possible combination of partitions and delays
among all of the participants to the protocol (server, storage and
client).
The pNFS SCSI protocol implements fencing using Persistent The pNFS SCSI protocol implements fencing using Persistent
Reservations (PRs), similar to the fencing method used by existing Reservations (PRs), similar to the fencing method used by existing
shared disk file systems. By placing a PR of type "Exclusive Access shared disk file systems. By placing a PR of type "Exclusive Access
- All Registrants" on each SCSI LU exported to pNFS clients the MDS - All Registrants" on each SCSI LU exported to pNFS clients the MDS
prevents access from any client that does not have an outstanding prevents access from any client that does not have an outstanding
device device ID that gives the client a reservation key to access device device ID that gives the client a reservation key to access
the LU, and allows the MDS to revoke access to the logic unit at any the LU, and allows the MDS to revoke access to the logic unit at any
time. time.
2.4.8.1. PRs - Key Generation 2.4.10.1. PRs - Key Generation
To allow fencing individual systems, each system must use a unique To allow fencing individual systems, each system must use a unique
Persistent Reservation key. [SPC3] does not specify a way to Persistent Reservation key. [SPC4] does not specify a way to
generate keys. This document assigns the burden to generate unique generate keys. This document assigns the burden to generate unique
keys to the MDS, which must generate a key for itself before keys to the MDS, which must generate a key for itself before
exporting a volume, and a key for each client that accesses a scsi exporting a volume, and a key for each client that accesses a scsi
layout volumes. Individuals keys for each volume that a client can layout volumes. Individuals keys for each volume that a client can
access are permitted but not required. access are permitted but not required.
2.4.8.2. PRs - MDS Registration and Reservation 2.4.10.2. PRs - MDS Registration and Reservation
Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the
MDS needs to prepare the volume for fencing using PRs. This is done MDS needs to prepare the volume for fencing using PRs. This is done
by registering the reservation generated for the MDS with the device by registering the reservation generated for the MDS with the device
using the "PERSISTENT RESERVE OUT" command with a service action of using the "PERSISTENT RESERVE OUT" command with a service action of
"REGISTER", followed by a "PERSISTENT RESERVE OUT" command, with a "REGISTER", followed by a "PERSISTENT RESERVE OUT" command, with a
service action of "RESERVE" and the type field set to 8h (Exclusive service action of "RESERVE" and the type field set to 8h (Exclusive
Access - All Registrants). To make sure all I_T nexuses are Access - All Registrants). To make sure all I_T nexuses are
registered, the MDS SHOULD set the "All Target Ports" (ALL_TG_PT) bit registered, the MDS SHOULD set the "All Target Ports" (ALL_TG_PT) bit
when registering the key, or otherwise ensure the registration is when registering the key, or otherwise ensure the registration is
performed for each initiator port. performed for each initiator port.
2.4.8.3. PRs - Client Registration 2.4.10.3. PRs - Client Registration
Before performing the first IO to a device returned from a Before performing the first IO to a device returned from a
GETDEVICEINFO operation the client will register the registration key GETDEVICEINFO operation the client will register the registration key
returned in sbv_pr_key with the storage device by issuing a returned in sbv_pr_key with the storage device by issuing a
"PERSISTENT RESERVE OUT" command with a service action of REGISTER "PERSISTENT RESERVE OUT" command with a service action of REGISTER
with the "SERVICE ACTION RESERVATION KEY" set to the reservation key with the "SERVICE ACTION RESERVATION KEY" set to the reservation key
returned in sbv_pr_key. To make sure all I_T nexus are registered, returned in sbv_pr_key. To make sure all I_T nexus are registered,
the client SHOULD set the "All Target Ports" (ALL_TG_PT) bit when the client SHOULD set the "All Target Ports" (ALL_TG_PT) bit when
registering the key, or otherwise ensure the registration is registering the key, or otherwise ensure the registration is
performed for each initiator port. performed for each initiator port.
When a client stops using a device earlier returned by GETDEVICEINFO When a client stops using a device earlier returned by GETDEVICEINFO
it MUST unregister the earlier registered key by issuing a it MUST unregister the earlier registered key by issuing a
"PERSISTENT RESERVE OUT" command with a service action of "REGISTER" "PERSISTENT RESERVE OUT" command with a service action of "REGISTER"
with the "RESERVATION KEY" set to the earlier registered reservation with the "RESERVATION KEY" set to the earlier registered reservation
key. key.
2.4.8.4. PRs - Fencing Action 2.4.10.4. PRs - Fencing Action
In case of a non-responding client the MDS fences the client by In case of a non-responding client the MDS fences the client by
issuing a "PERSISTENT RESERVE OUT" command with the service action issuing a "PERSISTENT RESERVE OUT" command with the service action
set to "PREEMPT" or "PREEMPT AND ABORT", the reservation key field set to "PREEMPT" or "PREEMPT AND ABORT", the reservation key field
set to the server's reservation key, the service action reservation set to the server's reservation key, the service action reservation
key field set to the reservation key associated with the non- key field set to the reservation key associated with the non-
responding client, and the type field set to 8h (Exclusive Access - responding client, and the type field set to 8h (Exclusive Access -
All Registrants). All Registrants).
After the MDS preempts a client, all client I/O to the LU fails. The After the MDS preempts a client, all client I/O to the LU fails. The
client should at this point return any layout that refers to the client should at this point return any layout that refers to the
device ID that points to the LU. Note that the client can device ID that points to the LU. Note that the client can
distinguish I/O errors due to fencing from other errors based on the distinguish I/O errors due to fencing from other errors based on the
"RESERVATION CONFLICT" SCSI status. Refer to [SPC3] for details. "RESERVATION CONFLICT" SCSI status. Refer to [SPC4] for details.
2.4.8.5. Client Recovery After a Fence Action 2.4.10.5. Client Recovery After a Fence Action
A client that detects a "RESERVATION CONFLICT" SCSI status on the A client that detects a "RESERVATION CONFLICT" SCSI status (I/O
storage devices MUST commit all layouts that use the storage device error) on the storage devices MUST commit all layouts that use the
through the MDS, return all outstanding layouts for the device, storage device through the MDS, return all outstanding layouts for
forget the device ID and unregister the reservation key. Future the device, forget the device ID and unregister the reservation key.
GETDEVICEINFO calls may refer to the storage device again, in which Future GETDEVICEINFO calls may refer to the storage device again, in
case the client will perform a new registration based on the key which case the client will perform a new registration based on the
provided (via sbv_pr_key) at that time. key provided (via sbv_pr_key) at that time.
2.5. Crash Recovery Issues 2.5. Crash Recovery Issues
A critical requirement in crash recovery is that both the client and A critical requirement in crash recovery is that both the client and
the server know when the other has failed. Additionally, it is the server know when the other has failed. Additionally, it is
required that a client sees a consistent view of data across server required that a client sees a consistent view of data across server
restarts. These requirements and a full discussion of crash recovery restarts. These requirements and a full discussion of crash recovery
issues are covered in the "Crash Recovery" section of the NFSv41 issues are covered in the "Crash Recovery" section of the NFSv41
specification [RFC5661]. This document contains additional crash specification [RFC5661]. This document contains additional crash
recovery material specific only to the SCSI layout. recovery material specific only to the SCSI layout.
skipping to change at page 23, line 26 skipping to change at page 23, line 11
should immediately respond with NFS4_OK, and then asynchronously should immediately respond with NFS4_OK, and then asynchronously
return complete file layouts until the number of files with layouts return complete file layouts until the number of files with layouts
cached on the client is less than craa_object_to_keep. cached on the client is less than craa_object_to_keep.
2.7. Transient and Permanent Errors 2.7. Transient and Permanent Errors
The server may respond to LAYOUTGET with a variety of error statuses. The server may respond to LAYOUTGET with a variety of error statuses.
These errors can convey transient conditions or more permanent These errors can convey transient conditions or more permanent
conditions that are unlikely to be resolved soon. conditions that are unlikely to be resolved soon.
The transient errors, NFS4ERR_RECALLCONFLICT and NFS4ERR_TRYLATER, The error NFS4ERR_RECALLCONFLICT indicates that the server has
are used to indicate that the server cannot immediately grant the recently issued a CB_LAYOUTRECALL to the requesting client, making it
layout to the client. In the former case, this is because the server necessary for the client to respond to the recall before processing
has recently issued a CB_LAYOUTRECALL to the requesting client, the layout request. A client can wait for that recall to be receive
whereas in the case of NFS4ERR_TRYLATER, the server cannot grant the and processe or it can retry as for NFS4ERR_TRYLATER, as described
request possibly due to sharing conflicts with other clients. In below.
either case, a reasonable approach for the client is to wait several
milliseconds and retry the request. The client SHOULD track the The error NFS4ERR_TRYLATER is used to indicate that the server cannot
number of retries, and if forward progress is not made, the client immediately grant the layout to the client. This may be due to
SHOULD send the READ or WRITE operation directly to the server. constraints on writable sharing of blocks by multiple clients or to a
conflict with a recallable lock (e.g. a delegation). In either case,
a reasonable approach for the client is to wait several milliseconds
and retry the request. The client SHOULD track the number of
retries, and if forward progress is not made, the client should
abandon the attempt to get a layout and perform READ and WRITE
operations by sending them to the server
The error NFS4ERR_LAYOUTUNAVAILABLE may be returned by the server if The error NFS4ERR_LAYOUTUNAVAILABLE may be returned by the server if
layouts are not supported for the requested file or its containing layouts are not supported for the requested file or its containing
file system. The server may also return this error code if the file system. The server may also return this error code if the
server is the progress of migrating the file from secondary storage, server is the progress of migrating the file from secondary storage,
or for any other reason that causes the server to be unable to supply there is a conflicting lock that would prevent the layout from being
the layout. As a result of receiving NFS4ERR_LAYOUTUNAVAILABLE, the granted, or for any other reason that causes the server to be unable
client SHOULD send future READ and WRITE requests directly to the to supply the layout. As a result of receiving
server. It is expected that a client will not cache the file's NFS4ERR_LAYOUTUNAVAILABLE, the client should abandon the attempt to
layoutunavailable state forever, particular if the file is closed, get a layout and perform READ and WRITE operations by sending them to
and thus eventually, the client MAY reissue a LAYOUTGET operation. the MDS. It is expected that a client will not cache the file's
layoutunavailable state forever. In particular, when the file is
closed or opened by the client, issuing a new LAYOUTGET is
appropriate.
2.8. Volatile write caches 2.8. Volatile write caches
Many storage devices implement volatile write caches that require an Many storage devices implement volatile write caches that require an
explicit flush to persist the data from write operations to stable explicit flush to persist the data from write operations to stable
storage. Storage devices implemeting [SBC3] should indicated a storage. Storage devices implementing [SBC3] should indicate a
volatile write cache by setting the WCE bit to 1 in the Caching mode volatile write cache by setting the WCE bit to 1 in the Caching mode
page. When a volatile write cache is used, the pNFS server must page. When a volatile write cache is used, the pNFS server must
ensure the volatile write cache has been committed to stable storage ensure the volatile write cache has been committed to stable storage
before the LAYOUTCOMMIT operation returns by using one of the before the LAYOUTCOMMIT operation returns by using one of the
SYNCHRONIZE CACHE commands. SYNCHRONIZE CACHE commands.
3. Enforcing NFSv4 Semantics 3. Enforcing NFSv4 Semantics
The functionality provided by SCSI Persistent Reservations makes it The functionality provided by SCSI Persistent Reservations makes it
possible for the MDS to control access by individual client machines possible for the MDS to control access by individual client machines
skipping to change at page 28, line 25 skipping to change at page 28, line 16
[SAM-4] INCITS Technical Committee T10, "SCSI Architecture Model - [SAM-4] INCITS Technical Committee T10, "SCSI Architecture Model -
4 (SAM-4)", ANSI INCITS 447-2008, ISO/IEC 14776-414, 2008. 4 (SAM-4)", ANSI INCITS 447-2008, ISO/IEC 14776-414, 2008.
[SAS3] INCITS Technical Committee T10, "Serial Attached Scsi-3", [SAS3] INCITS Technical Committee T10, "Serial Attached Scsi-3",
ANSI INCITS ANSI INCITS 519-2014, ISO/IEC 14776-154, 2014. ANSI INCITS ANSI INCITS 519-2014, ISO/IEC 14776-154, 2014.
[SBC3] INCITS Technical Committee T10, "SCSI Block Commands-3", [SBC3] INCITS Technical Committee T10, "SCSI Block Commands-3",
ANSI INCITS INCITS 514-2014, ISO/IEC 14776-323, 2014. ANSI INCITS INCITS 514-2014, ISO/IEC 14776-323, 2014.
[SPC3] INCITS Technical Committee T10, "SCSI Primary Commands-3", [SPC4] INCITS Technical Committee T10, "SCSI Primary Commands-4",
ANSI INCITS 408-2005, ISO/IEC 14776-453, 2005. ANSI INCITS 513-2015, 2015.
Appendix A. Acknowledgments Appendix A. Acknowledgments
Large parts of this document were copied verbatim, and others were Large parts of this document were copied verbatim, and others were
inspired by [RFC5663]. Thank to David Black, Stephen Fridella and inspired by [RFC5663]. Thank to David Black, Stephen Fridella and
Jason Glasgow for their work on the pNFS block/volume layout Jason Glasgow for their work on the pNFS block/volume layout
protocol. protocol.
David Black, Robert Elliott and Tom Haynes provided a throughout David Black, Robert Elliott and Tom Haynes provided a throughout
review of early drafts of this document, and their input lead to the review of early drafts of this document, and their input lead to the
current form of the document. current form of the document.
David Noveck provided ample feedback to earlier drafts of this David Noveck provided ample feedback to various drafts of this
document and wrote the section on enforcing NFSv4 semantics. document, wrote the section on enforcing NFSv4 semantics and rewrote
various sections to better catch the intent.
Appendix B. RFC Editor Notes Appendix B. RFC Editor Notes
[RFC Editor: please remove this section prior to publishing this [RFC Editor: please remove this section prior to publishing this
document as an RFC] document as an RFC]
[RFC Editor: prior to publishing this document as an RFC, please [RFC Editor: prior to publishing this document as an RFC, please
replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the
RFC number of this document] RFC number of this document]
 End of changes. 62 change blocks. 
330 lines changed or deleted 337 lines changed or added

This html diff was produced by rfcdiff 1.42. The latest version is available from http://tools.ietf.org/tools/rfcdiff/