draft-ietf-nfsv4-pnfs-obj-09.txt   draft-ietf-nfsv4-pnfs-obj-10.txt 
NFSv4 B. Halevy NFSv4 B. Halevy
Internet-Draft B. Welch Internet-Draft B. Welch
Intended status: Standards Track J. Zelenka Intended status: Standards Track J. Zelenka
Expires: December 21, 2008 Panasas Expires: June 5, 2009 Panasas
June 19, 2008 December 02, 2008
Object-based pNFS Operations Object-based pNFS Operations
draft-ietf-nfsv4-pnfs-obj-09 draft-ietf-nfsv4-pnfs-obj-10
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 35 skipping to change at page 1, line 35
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on December 21, 2008. This Internet-Draft will expire on June 5, 2009.
Copyright Notice
Copyright (C) The IETF Trust (2008).
Abstract Abstract
This Internet-Draft provides a description of the object-based pNFS Parallel NFS (pNFS) extends NFSv4 to allow clients to directly access
extension for NFSv4. This is a companion to the main pnfs file data on the storage used by the NFSv4 server. This ability to
specification in the NFSv4 Minor Version 1 Internet Draft, which is bypass the server for data access can increase both performance and
currently draft-ietf-nfsv4-minorversion1-23. parallelism, but requires additional client functionality for data
access, some of which is dependent on the class of storage used,
a.k.a. the Layout Type. The main pNFS operations and data types in
NFSv4 Minor Version 1 specify a layout-type-independent layer;
layout-type-specific information is conveyed using opaque data
structures which internal structure is further defined by the
particular layout type specification. This document specifies the
NFSv4.1 Object-based pNFS Layout Type in companion with the main
NFSv4 Minor Version 1 specification.
Requirements Language Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [1]. document are to be interpreted as described in RFC 2119 [1].
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. XDR Description of the Objects-Based Layout Protocol . . . . . 4 2. XDR Description of the Objects-Based Layout Protocol . . . . . 4
2.1. Basic Data Type Definitions . . . . . . . . . . . . . . . 5 2.1. Basic Data Type Definitions . . . . . . . . . . . . . . . 5
2.1.1. pnfs_osd_objid4 . . . . . . . . . . . . . . . . . . . 5 2.1.1. pnfs_osd_objid4 . . . . . . . . . . . . . . . . . . . 5
2.1.2. pnfs_osd_version4 . . . . . . . . . . . . . . . . . . 6 2.1.2. pnfs_osd_version4 . . . . . . . . . . . . . . . . . . 6
2.1.3. pnfs_osd_object_cred4 . . . . . . . . . . . . . . . . 6 2.1.3. pnfs_osd_object_cred4 . . . . . . . . . . . . . . . . 6
2.1.4. pnfs_osd_raid_algorithm4 . . . . . . . . . . . . . . . 8 2.1.4. pnfs_osd_raid_algorithm4 . . . . . . . . . . . . . . . 7
3. Object Storage Device Addressing and Discovery . . . . . . . . 8 3. Object Storage Device Addressing and Discovery . . . . . . . . 8
3.1. pnfs_osd_targetid_type4 . . . . . . . . . . . . . . . . . 9 3.1. pnfs_osd_targetid_type4 . . . . . . . . . . . . . . . . . 9
3.2. pnfs_osd_deviceaddr4 . . . . . . . . . . . . . . . . . . . 9 3.2. pnfs_osd_deviceaddr4 . . . . . . . . . . . . . . . . . . . 9
3.2.1. SCSI Target Identifier . . . . . . . . . . . . . . . . 10 3.2.1. SCSI Target Identifier . . . . . . . . . . . . . . . . 10
3.2.2. Device Network Address . . . . . . . . . . . . . . . . 11 3.2.2. Device Network Address . . . . . . . . . . . . . . . . 11
4. Object-Based Layout . . . . . . . . . . . . . . . . . . . . . 11 4. Object-Based Layout . . . . . . . . . . . . . . . . . . . . . 11
4.1. pnfs_osd_data_map4 . . . . . . . . . . . . . . . . . . . . 12 4.1. pnfs_osd_data_map4 . . . . . . . . . . . . . . . . . . . . 12
4.2. pnfs_osd_layout4 . . . . . . . . . . . . . . . . . . . . . 13 4.2. pnfs_osd_layout4 . . . . . . . . . . . . . . . . . . . . . 13
4.3. Data Mapping Schemes . . . . . . . . . . . . . . . . . . . 14 4.3. Data Mapping Schemes . . . . . . . . . . . . . . . . . . . 14
4.3.1. Simple Striping . . . . . . . . . . . . . . . . . . . 14 4.3.1. Simple Striping . . . . . . . . . . . . . . . . . . . 14
skipping to change at page 4, line 12 skipping to change at page 4, line 12
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 34 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 34
Intellectual Property and Copyright Statements . . . . . . . . . . 35 Intellectual Property and Copyright Statements . . . . . . . . . . 35
1. Introduction 1. Introduction
In pNFS, the file server returns typed layout structures that In pNFS, the file server returns typed layout structures that
describe where file data is located. There are different layouts for describe where file data is located. There are different layouts for
different storage systems and methods of arranging data on storage different storage systems and methods of arranging data on storage
devices. This document describes the layouts used with object-based devices. This document describes the layouts used with object-based
storage devices (OSD) that are accessed according to the OSD storage storage devices (OSD) that are accessed according to the OSD storage
protocol standard (SNIA T10/1355-D [2]). protocol standard (ANSI INCITS 400-2004 [2]).
An "object" is a container for data and attributes, and files are An "object" is a container for data and attributes, and files are
stored in one or more objects. The OSD protocol specifies several stored in one or more objects. The OSD protocol specifies several
operations on objects, including READ, WRITE, FLUSH, GET ATTRIBUTES, operations on objects, including READ, WRITE, FLUSH, GET ATTRIBUTES,
SET ATTRIBUTES, CREATE and DELETE. However, using the object-based SET ATTRIBUTES, CREATE and DELETE. However, using the object-based
layout the client only uses the READ, WRITE, GET ATTRIBUTES and FLUSH layout the client only uses the READ, WRITE, GET ATTRIBUTES and FLUSH
commands. The other commands are only used by the pNFS server. commands. The other commands are only used by the pNFS server.
An object-based layout for pNFS includes object identifiers, An object-based layout for pNFS includes object identifiers,
capabilities that allow clients to READ or WRITE those objects, and capabilities that allow clients to READ or WRITE those objects, and
skipping to change at page 5, line 6 skipping to change at page 5, line 6
sh extract.sh < spec.txt > pnfs_osd_prot.x sh extract.sh < spec.txt > pnfs_osd_prot.x
The effect of the script is to remove leading white space from each The effect of the script is to remove leading white space from each
line, plus a sentinel sequence of "///". line, plus a sentinel sequence of "///".
The embedded XDR file header follows. Subsequent XDR descriptions, The embedded XDR file header follows. Subsequent XDR descriptions,
with the sentinel sequence are embedded throughout the document. with the sentinel sequence are embedded throughout the document.
Note that the XDR code contained in this document depends on types Note that the XDR code contained in this document depends on types
from the NFSv4.1 nfs4_prot.x file ([8]). This includes both nfs from the NFSv4.1 nfs4_prot.x file ([4]). This includes both nfs
types that end with a 4, such as offset4, length4, etc, as well as types that end with a 4, such as offset4, length4, etc, as well as
more generic types such as uint32_t and uint64_t. more generic types such as uint32_t and uint64_t.
////* ////*
/// * This file was machine generated for /// * This code was derived from IETF RFC &rfc.number.
/// * draft-ietf-nfsv4-pnfs-obj-09 [[RFC Editor: please insert RFC number if needed]]
/// * Last updated Thu Jun 19 07:35:44 UTC 2008 /// * Please reproduce this note if possible.
/// *
/// * Copyright (C) The IETF Trust (2007-2008)
/// * All Rights Reserved.
/// *
/// * Copyright (C) The Internet Society (1998-2006).
/// * All Rights Reserved.
/// */ /// */
/// ///
////* ////*
/// * pnfs_osd_prot.x /// * pnfs_osd_prot.x
/// */ /// */
/// ///
///%#include <nfs4_prot.x> ///%#include <nfs4_prot.x>
/// ///
2.1. Basic Data Type Definitions 2.1. Basic Data Type Definitions
skipping to change at page 6, line 4 skipping to change at page 5, line 47
/// uint64_t oid_object_id; /// uint64_t oid_object_id;
///}; ///};
/// ///
The pnfs_osd_objid4 type is used to identify an object within a The pnfs_osd_objid4 type is used to identify an object within a
partition on a specified object storage device. "oid_device_id" partition on a specified object storage device. "oid_device_id"
selects the object storage device from the set of available storage selects the object storage device from the set of available storage
devices. The device is identified with the deviceid4 type, which is devices. The device is identified with the deviceid4 type, which is
an index into addressing information about that device returned by an index into addressing information about that device returned by
the GETDEVICELIST and GETDEVICEINFO operations. The deviceid4 data the GETDEVICELIST and GETDEVICEINFO operations. The deviceid4 data
type is defined in NFSv4.1 draft [9]. Within an OSD, a partition is type is defined in NFSv4.1 [5]. Within an OSD, a partition is
identified with a 64-bit number, "oid_partition_id". Within a identified with a 64-bit number, "oid_partition_id". Within a
partition, an object is identified with a 64-bit number, partition, an object is identified with a 64-bit number,
"oid_object_id". Creation and management of partitions is outside "oid_object_id". Creation and management of partitions is outside
the scope of this standard, and is a facility provided by the object the scope of this standard, and is a facility provided by the object
storage file system. storage file system.
2.1.2. pnfs_osd_version4 2.1.2. pnfs_osd_version4
///enum pnfs_osd_version4 { ///enum pnfs_osd_version4 {
/// PNFS_OSD_MISSING = 0, /// PNFS_OSD_MISSING = 0,
/// PNFS_OSD_VERSION_1 = 1, /// PNFS_OSD_VERSION_1 = 1,
/// PNFS_OSD_VERSION_2 = 2 /// PNFS_OSD_VERSION_2 = 2
///}; ///};
/// ///
Pnfs_osd_version4 is used to indicate the OSD protocol version or pnfs_osd_version4 is used to indicate the OSD protocol version or
whether an object is missing (i.e., unavailable). Some of the whether an object is missing (i.e., unavailable). Some of the
object-based layout supported raid algorithms encode redundant object-based layout supported raid algorithms encode redundant
information and can compensate for missing components, but the data information and can compensate for missing components, but the data
placement algorithm needs to know what parts are missing. placement algorithm needs to know what parts are missing.
At this time the OSD standard is at version 1.0, and we anticipate a At this time the OSD standard is at version 1.0, and we anticipate a
version 2.0 of the standard ((SNIA T10/1729-D [10])). The second version 2.0 of the standard ((SNIA T10/1729-D [12])). The second
generation OSD protocol has additional proposed features to support generation OSD protocol has additional proposed features to support
more robust error recovery, snapshots, and byte-range capabilities. more robust error recovery, snapshots, and byte-range capabilities.
Therefore, the OSD version is explicitly called out in the Therefore, the OSD version is explicitly called out in the
information returned in the layout. (This information can also be information returned in the layout. (This information can also be
deduced by looking inside the capability type at the format field, deduced by looking inside the capability type at the format field,
which is the first byte. The format value is 0x1 for an OSD v1 which is the first byte. The format value is 0x1 for an OSD v1
capability. However, it seems most robust to call out the version capability. However, it seems most robust to call out the version
explicitly.) explicitly.)
2.1.3. pnfs_osd_object_cred4 2.1.3. pnfs_osd_object_cred4
skipping to change at page 7, line 26 skipping to change at page 7, line 20
Section 12). Therefore, a client SHOULD either issue the LAYOUTGET Section 12). Therefore, a client SHOULD either issue the LAYOUTGET
or GETDEVICEINFO operations via RPCSEC_GSS with the privacy service or GETDEVICEINFO operations via RPCSEC_GSS with the privacy service
or to previously establish an SSV for the sessions via the NFSv4.1 or to previously establish an SSV for the sessions via the NFSv4.1
SET_SSV operation. The pnfs_osd_cap_key_sec4 type is used to SET_SSV operation. The pnfs_osd_cap_key_sec4 type is used to
identify the method used by the server to secure the capability key. identify the method used by the server to secure the capability key.
o PNFS_OSD_CAP_KEY_SEC_NONE denotes that the oc_capability_key is o PNFS_OSD_CAP_KEY_SEC_NONE denotes that the oc_capability_key is
not encrypted in which case the client SHOULD issue the LAYOUTGET not encrypted in which case the client SHOULD issue the LAYOUTGET
or GETDEVICEINFO operations with RPCSEC_GSS with the privacy or GETDEVICEINFO operations with RPCSEC_GSS with the privacy
service or the NFSv4.1 transport should be secured by using service or the NFSv4.1 transport should be secured by using
methods that are external to NFSv4.1 like the use of IPSEC [11] methods that are external to NFSv4.1 like the use of IPSEC [13]
for transporting the NFSV4.1 protocol. for transporting the NFSV4.1 protocol.
o PNFS_OSD_CAP_KEY_SEC_SSV denotes that the oc_capability_key o PNFS_OSD_CAP_KEY_SEC_SSV denotes that the oc_capability_key
contents are encrypted using the SSV GSS context and the contents are encrypted using the SSV GSS context and the
capability key as inputs to the GSS_Wrap() function (see GSS-API capability key as inputs to the GSS_Wrap() function (see GSS-API
[4]) with the conf_req_flag set to TRUE. The client MUST use the [6]) with the conf_req_flag set to TRUE. The client MUST use the
secret SSV key as part of the client's GSS context to decrypt the secret SSV key as part of the client's GSS context to decrypt the
capability key using the value of the oc_capability_key field as capability key using the value of the oc_capability_key field as
the input_message to the GSS_unwrap() function. Note that to the input_message to the GSS_unwrap() function. Note that to
prevent eavesdropping of the SSV key the client SHOULD issue prevent eavesdropping of the SSV key the client SHOULD issue
SET_SSV via RPCSEC_GSS with the privacy service. SET_SSV via RPCSEC_GSS with the privacy service.
The actual method chosen depends on whether the client established a The actual method chosen depends on whether the client established a
SSV key with the server and whether it issued the operation with the SSV key with the server and whether it issued the operation with the
RPCSEC_GSS privacy method. Naturally, if the client did not RPCSEC_GSS privacy method. Naturally, if the client did not
establish a SSV key via SET_SSV the server MUST use the establish a SSV key via SET_SSV the server MUST use the
skipping to change at page 8, line 45 skipping to change at page 8, line 34
in the pnfs_osd_deviceaddr4 type below under the "oda_systemid" and in the pnfs_osd_deviceaddr4 type below under the "oda_systemid" and
"oda_osdname" fields. "oda_osdname" fields.
In some situations, SCSI target discovery may need to be driven based In some situations, SCSI target discovery may need to be driven based
on information contained in the GETDEVICEINFO response. One example on information contained in the GETDEVICEINFO response. One example
of this is iSCSI targets that are not known to the client until a of this is iSCSI targets that are not known to the client until a
layout has been requested. The information provided as the layout has been requested. The information provided as the
"targetid", "netaddr", and "lun" fields in the pnfs_osd_deviceaddr4 "targetid", "netaddr", and "lun" fields in the pnfs_osd_deviceaddr4
type described below (see Section 3.2), allows the client to probe a type described below (see Section 3.2), allows the client to probe a
specific device given its network address and optionally its iSCSI specific device given its network address and optionally its iSCSI
Name (see iSCSI [5]), or when the device network address is omitted, Name (see iSCSI [7]), or when the device network address is omitted,
to discover the object storage device using the provided device name to discover the object storage device using the provided device name
or SCSI device identifier (See SPC-3 [6].) or SCSI device identifier (See SPC-3 [8].)
The oda_systemid is implicitly used by the client, by using the The oda_systemid is implicitly used by the client, by using the
object credential signing key to sign each request with the request object credential signing key to sign each request with the request
integrity check value. This method protects the client from integrity check value. This method protects the client from
unintentionally accessing a device if the device address mapping was unintentionally accessing a device if the device address mapping was
changed (or revoked). The server computes the capability key using changed (or revoked). The server computes the capability key using
its own view of the systemid associated with the respective deviceid its own view of the systemid associated with the respective deviceid
present in the credential. If the client's view of the deviceid present in the credential. If the client's view of the deviceid
mapping is stale, the client will use the wrong systemid (which must mapping is stale, the client will use the wrong systemid (which must
be system-wide unique) and the I/O request to the OSD will fail to be system-wide unique) and the I/O request to the OSD will fail to
skipping to change at page 10, line 37 skipping to change at page 10, line 37
/// opaque oda_systemid<>; /// opaque oda_systemid<>;
/// pnfs_osd_object_cred4 oda_root_obj_cred; /// pnfs_osd_object_cred4 oda_root_obj_cred;
/// opaque oda_osdname<>; /// opaque oda_osdname<>;
///}; ///};
/// ///
3.2.1. SCSI Target Identifier 3.2.1. SCSI Target Identifier
When "oda_targetid" is specified as a OBJ_TARGET_SCSI_NAME, the When "oda_targetid" is specified as a OBJ_TARGET_SCSI_NAME, the
"oti_scsi_name" string MUST be formatted as a "iSCSI Name" as "oti_scsi_name" string MUST be formatted as a "iSCSI Name" as
specified in iSCSI [5] and [7]. Note that the specification of the specified in iSCSI [7] and [9]. Note that the specification of the
oti_scsi_name string format is outside the scope of this document. oti_scsi_name string format is outside the scope of this document.
Parsing the string is based on the string prefix, e.g. "iqn.", Parsing the string is based on the string prefix, e.g. "iqn.",
"eui.", or "naa." and more formats MAY be specified in the future in "eui.", or "naa." and more formats MAY be specified in the future in
accordance with iSCSI Names properties. accordance with iSCSI Names properties.
Currently, the iSCSI Name provides for naming the target device using Currently, the iSCSI Name provides for naming the target device using
a string formmatted as an iSCSI Qualified Name (IQN) or as an EUI a string formatted as an iSCSI Qualified Name (IQN) or as an EUI [10]
[12] string. Those are typically used to identify iSCSI or SRP [13] string. Those are typically used to identify iSCSI or SRP [14]
devices. The Network Address Authority (NAA) string format (see [7]) devices. The Network Address Authority (NAA) string format (see [9])
provides for naming the device using globally unique identifiers, as provides for naming the device using globally unique identifiers, as
defined in FC-FS [14]. These are typically used to identify Fibre defined in FC-FS [15]. These are typically used to identify Fibre
Channel or SAS [15] (Serial Attached SCSI) devices. In particular, Channel or SAS [16] (Serial Attached SCSI) devices. In particular,
such devices that are dual-attached both over Fibre Channel or SAS, such devices that are dual-attached both over Fibre Channel or SAS,
and over iSCSI. and over iSCSI.
When "oda_targetid" is specified as a OBJ_TARGET_SCSI_DEVICE_ID, the When "oda_targetid" is specified as a OBJ_TARGET_SCSI_DEVICE_ID, the
"oti_scsi_device_id" opaque field MUST be formatted as a SCSI Device "oti_scsi_device_id" opaque field MUST be formatted as a SCSI Device
Identifier as defined in SPC-3 [6] VPD Page 83h (Section 7.6.3. Identifier as defined in SPC-3 [8] VPD Page 83h (Section 7.6.3.
"Device Identification VPD Page".) If the Device Identifier is "Device Identification VPD Page".) If the Device Identifier is
identical to the OSD System ID, as given by oda_systemid, the server identical to the OSD System ID, as given by oda_systemid, the server
SHOULD provide a zero-length oti_scsi_device_id<&gt opaque value Note SHOULD provide a zero-length oti_scsi_device_id<&gt opaque value Note
that similarly to the "oti_scsi_name", the specification of the that similarly to the "oti_scsi_name", the specification of the
oti_scsi_device_id opaque contents is outside the scope of this oti_scsi_device_id opaque contents is outside the scope of this
document and more formats MAY be specified in the future in document and more formats MAY be specified in the future in
accordance with SPC-3. accordance with SPC-3.
The OBJ_TARGET_ANON pnfs_osd_targetid_type4 MAY be used for providing The OBJ_TARGET_ANON pnfs_osd_targetid_type4 MAY be used for providing
no target identification. In this case only the OSD System ID and no target identification. In this case only the OSD System ID and
optionally, the provided network address, are used to locate to optionally, the provided network address, are used to locate to
device. device.
3.2.2. Device Network Address 3.2.2. Device Network Address
The optional "oda_targetaddr" field MAY be provided by the server as The optional "oda_targetaddr" field MAY be provided by the server as
a hint to accelerate device discovery over e.g., the iSCSI transport a hint to accelerate device discovery over e.g., the iSCSI transport
protocol. The network address is given with the netaddr4 type, which protocol. The network address is given with the netaddr4 type, which
specifies a TCP/IP based endpoint (as specified in NFSv4.1 draft specifies a TCP/IP based endpoint (as specified in NFSv4.1 [5]).
[9]). When given, the client SHOULD use it to probe for the SCSI When given, the client SHOULD use it to probe for the SCSI device at
device at the given network address. The client MAY still use other the given network address. The client MAY still use other discovery
discovery mechanisms such as iSNS [16] to locate the device using the mechanisms such as iSNS [11] to locate the device using the
oda_targetid. In particular, such external name service, SHOULD be oda_targetid. In particular, such external name service, SHOULD be
used when the devices may be attached to the network using multiple used when the devices may be attached to the network using multiple
connections, and/or multiple storage fabrics (e.g. Fibre-Channel and connections, and/or multiple storage fabrics (e.g. Fibre-Channel and
iSCSI.) iSCSI.)
4. Object-Based Layout 4. Object-Based Layout
The layout4 type is defined in the NFSv4.1 draft [9] as follows: The layout4 type is defined in the NFSv4.1 [5] as follows:
enum layouttype4 { enum layouttype4 {
LAYOUT4_NFSV4_1_FILES = 1, LAYOUT4_NFSV4_1_FILES = 1,
LAYOUT4_OSD2_OBJECTS = 2, LAYOUT4_OSD2_OBJECTS = 2,
LAYOUT4_BLOCK_VOLUME = 3 LAYOUT4_BLOCK_VOLUME = 3
}; };
struct layout_content4 { struct layout_content4 {
layouttype4 loc_type; layouttype4 loc_type;
opaque loc_body<>; opaque loc_body<>;
}; };
struct layout4 { struct layout4 {
offset4 lo_offset; offset4 lo_offset;
length4 lo_length; length4 lo_length;
layoutiomode4 lo_iomode; layoutiomode4 lo_iomode;
layout_content4 lo_content; layout_content4 lo_content;
}; };
This document defines structure associated with the layouttype4 This document defines structure associated with the layouttype4
value, LAYOUT4_OSD2_OBJECTS. The NFSv4.1 draft [9] specifies the value, LAYOUT4_OSD2_OBJECTS. The NFSv4.1 [5] specifies the loc_body
loc_body structure as an XDR type "opaque". The opaque layout is structure as an XDR type "opaque". The opaque layout is
uninterpreted by the generic pNFS client layers, but obviously must uninterpreted by the generic pNFS client layers, but obviously must
be interpreted by the object-storage layout driver. This section be interpreted by the object-storage layout driver. This section
defines the structure of this opaque value, pnfs_osd_layout4. defines the structure of this opaque value, pnfs_osd_layout4.
4.1. pnfs_osd_data_map4 4.1. pnfs_osd_data_map4
///struct pnfs_osd_data_map4 { ///struct pnfs_osd_data_map4 {
/// uint32_t odm_num_comps; /// uint32_t odm_num_comps;
/// length4 odm_stripe_unit; /// length4 odm_stripe_unit;
/// uint32_t odm_group_width; /// uint32_t odm_group_width;
skipping to change at page 14, line 18 skipping to change at page 14, line 18
GETATTR commands to the metadata server. The client uses the file GETATTR commands to the metadata server. The client uses the file
size to decide if it should fill holes with zeros, or return a short size to decide if it should fill holes with zeros, or return a short
read. Striping patterns can cause cases where component objects are read. Striping patterns can cause cases where component objects are
shorter than other components because a hole happens to correspond to shorter than other components because a hole happens to correspond to
the last part of the component object. the last part of the component object.
4.3. Data Mapping Schemes 4.3. Data Mapping Schemes
This section describes the different data mapping schemes in detail. This section describes the different data mapping schemes in detail.
The object layout always uses a "dense" layout as described in The object layout always uses a "dense" layout as described in
NFSv4.1 draft [9]. This means that the second stripe unit of the NFSv4.1 [5]. This means that the second stripe unit of the file
file starts at offset 0 of the second component, rather than at starts at offset 0 of the second component, rather than at offset
offset stripe_unit bytes. After a full stripe has been written, the stripe_unit bytes. After a full stripe has been written, the next
next stripe unit is appended to the first component object in the stripe unit is appended to the first component object in the list
list without any holes in the component objects. without any holes in the component objects.
4.3.1. Simple Striping 4.3.1. Simple Striping
The mapping from the logical offset within a file (L) to the The mapping from the logical offset within a file (L) to the
component object C and object-specific offset O is defined by the component object C and object-specific offset O is defined by the
following equations: following equations:
L = logical offset into the file L = logical offset into the file
W = total number of components W = total number of components
S = W * stripe_unit S = W * stripe_unit
skipping to change at page 19, line 24 skipping to change at page 19, line 24
object, the result could include different data in the same ranges of object, the result could include different data in the same ranges of
mirrored tuples, or corrupt parity information. It is the mirrored tuples, or corrupt parity information. It is the
responsibility of the metadata server to enforce serialization responsibility of the metadata server to enforce serialization
requirements such as this. For example, the metadata server may do requirements such as this. For example, the metadata server may do
so by not granting overlapping write layouts within mirrored objects. so by not granting overlapping write layouts within mirrored objects.
5. Object-Based Layout Update 5. Object-Based Layout Update
layoutupdate4 is used in the LAYOUTCOMMIT operation to convey updates layoutupdate4 is used in the LAYOUTCOMMIT operation to convey updates
to the layout and additional information to the metadata server. It to the layout and additional information to the metadata server. It
is defined in the NFSv4.1 draft [9] as follows: is defined in the NFSv4.1 [5] as follows:
struct layoutupdate4 { struct layoutupdate4 {
layouttype4 lou_type; layouttype4 lou_type;
opaque lou_body<>; opaque lou_body<>;
}; };
The layoutupdate4 type is an opaque value at the generic pNFS client The layoutupdate4 type is an opaque value at the generic pNFS client
level. If the lou_type layout type is LAYOUT4_OSD2_OBJECTS, then the level. If the lou_type layout type is LAYOUT4_OSD2_OBJECTS, then the
lou_body opaque value is defined by the pnfs_osd_layoutupdate4 type. lou_body opaque value is defined by the pnfs_osd_layoutupdate4 type.
skipping to change at page 20, line 30 skipping to change at page 20, line 30
write (*), which can be different than the number of bytes written write (*), which can be different than the number of bytes written
because of internal overhead like block-level allocation and indirect because of internal overhead like block-level allocation and indirect
blocks, and the client reflects this back to the pNFS server so it blocks, and the client reflects this back to the pNFS server so it
can accurately track quota. The pNFS server can choose to trust this can accurately track quota. The pNFS server can choose to trust this
information coming from the clients and therefore avoid querying the information coming from the clients and therefore avoid querying the
OSDs at the time of LAYOUTCOMMIT. If the client is unable to obtain OSDs at the time of LAYOUTCOMMIT. If the client is unable to obtain
this information from the OSD, it simply returns invalid this information from the OSD, it simply returns invalid
olu_delta_space_used. olu_delta_space_used.
(*) Note: At the time this document is written, a per-command used (*) Note: At the time this document is written, a per-command used
capacity attribute is not yet standardized by OSD2 draft [10]. The capacity attribute is not yet standardized by OSD2 draft [12]. The
client MAY use vendor-specific attributes to calculate space client MAY use vendor-specific attributes to calculate space
utilization, provided that the vendor defines and publishes a utilization, provided that the vendor defines and publishes a
suitable vendor-specific attributes page for current-command suitable vendor-specific attributes page for current-command
attributes as defined by OSD2 draft [10], Section 7.1.2.2. attributes as defined by OSD2 draft [12], Section 7.1.2.2.
5.2. pnfs_osd_layoutupdate4 5.2. pnfs_osd_layoutupdate4
///struct pnfs_osd_layoutupdate4 { ///struct pnfs_osd_layoutupdate4 {
/// pnfs_osd_deltaspaceused4 olu_delta_space_used; /// pnfs_osd_deltaspaceused4 olu_delta_space_used;
/// bool olu_ioerr_flag; /// bool olu_ioerr_flag;
///}; ///};
/// ///
"olu_delta_space_used" is used to convey capacity usage information "olu_delta_space_used" is used to convey capacity usage information
skipping to change at page 21, line 47 skipping to change at page 21, line 47
client MAY just retry the I/O operation(s) using regular NFS READ or client MAY just retry the I/O operation(s) using regular NFS READ or
WRITE operations via the metadata server. The client SHOULD attempt WRITE operations via the metadata server. The client SHOULD attempt
to retrieve a new layout and retry the I/O operation using OSD to retrieve a new layout and retry the I/O operation using OSD
commands first and only if the error persists, retry the I/O commands first and only if the error persists, retry the I/O
operation via the metadata server. operation via the metadata server.
7. Object-Based Layout Return 7. Object-Based Layout Return
layoutreturn_file4 is used in the LAYOUTRETURN operation to convey layoutreturn_file4 is used in the LAYOUTRETURN operation to convey
layout-type specific information to the server. It is defined in the layout-type specific information to the server. It is defined in the
NFSv4.1 draft [9] as follows: NFSv4.1 [5] as follows:
struct layoutreturn_file4 { struct layoutreturn_file4 {
offset4 lrf_offset; offset4 lrf_offset;
length4 lrf_length; length4 lrf_length;
stateid4 lrf_stateid; stateid4 lrf_stateid;
/* layouttype4 specific data */ /* layouttype4 specific data */
opaque lrf_body<>; opaque lrf_body<>;
}; };
union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { union layoutreturn4 switch(layoutreturn_type4 lr_returntype) {
skipping to change at page 24, line 26 skipping to change at page 24, line 26
7.3. pnfs_osd_layoutreturn4 7.3. pnfs_osd_layoutreturn4
///struct pnfs_osd_layoutreturn4 { ///struct pnfs_osd_layoutreturn4 {
/// pnfs_osd_ioerr4 olr_ioerr_report<>; /// pnfs_osd_ioerr4 olr_ioerr_report<>;
///}; ///};
/// ///
When OSD I/O operations failed, "olr_ioerr_report<>" is used to When OSD I/O operations failed, "olr_ioerr_report<>" is used to
report these errors to the metadata server as an array of elements of report these errors to the metadata server as an array of elements of
type pnfs_osd_ioerr4. Each element in the array represents an error type pnfs_osd_ioerr4. Each element in the array represents an error
that occured on the object specified by oer_component. If no errors that occurred on the object specified by oer_component. If no errors
are to be reported, the size of the olr_ioerr_report<> array is set are to be reported, the size of the olr_ioerr_report<> array is set
to zero. to zero.
8. Object-Based Creation Layout Hint 8. Object-Based Creation Layout Hint
The layouthint4 type is defined in the NFSv4.1 draft [9] as follows: The layouthint4 type is defined in the NFSv4.1 [5] as follows:
struct layouthint4 { struct layouthint4 {
layouttype4 loh_type; layouttype4 loh_type;
opaque loh_body<>; opaque loh_body<>;
}; };
The layouthint4 structure is used by the client to pass in a hint The layouthint4 structure is used by the client to pass in a hint
about the type of layout it would like created for a particular file. about the type of layout it would like created for a particular file.
If the loh_type layout type is LAYOUT4_OSD2_OBJECTS, then the If the loh_type layout type is LAYOUT4_OSD2_OBJECTS, then the
loh_body opaque value is defined by the pnfs_osd_layouthint4 type. loh_body opaque value is defined by the pnfs_osd_layouthint4 type.
skipping to change at page 27, line 35 skipping to change at page 27, line 35
prevent corruption of the file's parity, Multiple clients must not prevent corruption of the file's parity, Multiple clients must not
hold valid write layouts for the same stripes. An outstanding RW hold valid write layouts for the same stripes. An outstanding RW
layout should be recalled when a conflicting LAYOUTGET is received layout should be recalled when a conflicting LAYOUTGET is received
from a different client for LAYOUTIOMODE4_RW and for a byte-range from a different client for LAYOUTIOMODE4_RW and for a byte-range
overlapping with the outstanding layout segment. overlapping with the outstanding layout segment.
10.1. CB_RECALL_ANY 10.1. CB_RECALL_ANY
The metadata server can use the CB_RECALL_ANY callback operation to The metadata server can use the CB_RECALL_ANY callback operation to
notify the client to return some or all of its layouts. The NFSv4.1 notify the client to return some or all of its layouts. The NFSv4.1
draft [9] defines the following types: [5] defines the following types:
const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN = 8; const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN = 8;
const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 9; const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 9;
struct CB_RECALL_ANY4args { struct CB_RECALL_ANY4args {
uint32_t craa_objects_to_keep; uint32_t craa_objects_to_keep;
bitmap4 craa_type_mask; bitmap4 craa_type_mask;
}; };
Typically, CB_RECALL_ANY will be used to recall client state when the Typically, CB_RECALL_ANY will be used to recall client state when the
skipping to change at page 28, line 22 skipping to change at page 28, line 22
The PNFS_OSD_RCA4_TYPE_MASK_READ flag notifies the client to return The PNFS_OSD_RCA4_TYPE_MASK_READ flag notifies the client to return
layouts of iomode LAYOUTIOMODE4_READ. Similarly, the layouts of iomode LAYOUTIOMODE4_READ. Similarly, the
PNFS_OSD_RCA4_TYPE_MASK_RW flag notifies the client to return layouts PNFS_OSD_RCA4_TYPE_MASK_RW flag notifies the client to return layouts
of iomode LAYOUTIOMODE4_RW. When both mask flags are set, the client of iomode LAYOUTIOMODE4_RW. When both mask flags are set, the client
is notified to return layouts of either iomode. is notified to return layouts of either iomode.
11. Client Fencing 11. Client Fencing
In cases where clients are uncommunicative and their lease has In cases where clients are uncommunicative and their lease has
expired or when clients fail to return recalled layouts in a timely expired or when clients fail to return recalled layouts within a
manner the server MAY revoke client layouts and/or device address lease period at the least (see "Recalling a Layout"[5]), the server
mappings and reassign these resources to other clients. To avoid MAY revoke client layouts and/or device address mappings and reassign
data corruption, the metadata server MUST fence off the revoked these resources to other clients. To avoid data corruption, the
clients from the respective objects as described in Section 12.4. metadata server MUST fence off the revoked clients from the
respective objects as described in Section 12.4.
12. Security Considerations 12. Security Considerations
The pNFS extension partitions the NFSv4 file system protocol into two The pNFS extension partitions the NFSv4 file system protocol into two
parts, the control path and the data path (storage protocol). The parts, the control path and the data path (storage protocol). The
control path contains all the new operations described by this control path contains all the new operations described by this
extension; all existing NFSv4 security mechanisms and features apply extension; all existing NFSv4 security mechanisms and features apply
to the control path. The combination of components in a pNFS system to the control path. The combination of components in a pNFS system
is required to preserve the security properties of NFSv4 with respect is required to preserve the security properties of NFSv4 with respect
to an entity accessing data via a client, including security to an entity accessing data via a client, including security
skipping to change at page 31, line 26 skipping to change at page 31, line 27
authentic metadata server and allows access to the object, as allowed authentic metadata server and allows access to the object, as allowed
by the Cap. by the Cap.
12.3. Protocol Privacy Requirements 12.3. Protocol Privacy Requirements
Note that if the server LAYOUTGET reply, holding CapKey and Cap, is Note that if the server LAYOUTGET reply, holding CapKey and Cap, is
snooped by another client, it can be used to generate valid OSD snooped by another client, it can be used to generate valid OSD
requests (within the Cap access restrictions). requests (within the Cap access restrictions).
To provide the required privacy requirements for the capability key To provide the required privacy requirements for the capability key
returned by LAYOUTGET, the GSS-API [4] framework can be used, e.g. by returned by LAYOUTGET, the GSS-API [6] framework can be used, e.g. by
using the RPCSEC_GSS privacy method to send the LAYOUTGET operation using the RPCSEC_GSS privacy method to send the LAYOUTGET operation
or by using the SSV key to encrypt the oc_capability_key using the or by using the SSV key to encrypt the oc_capability_key using the
GSS_Wrap() function. Two general ways to provide privacy in the GSS_Wrap() function. Two general ways to provide privacy in the
absence of GSS-API that are independent of NFSv4 are either an absence of GSS-API that are independent of NFSv4 are either an
isolated network such as a VLAN or a secure channel provided by IPsec isolated network such as a VLAN or a secure channel provided by IPsec
[11]. [13].
12.4. Revoking Capabilities 12.4. Revoking Capabilities
At any time, the metadata server may invalidate all outstanding At any time, the metadata server may invalidate all outstanding
capabilities on an object by changing its POLICY ACCESS TAG capabilities on an object by changing its POLICY ACCESS TAG
attribute. The value of the POLICY ACCESS TAG is part of a attribute. The value of the POLICY ACCESS TAG is part of a
capability, and it must match the state of the object attribute. If capability, and it must match the state of the object attribute. If
they do not match, the OSD rejects accesses to the object with the they do not match, the OSD rejects accesses to the object with the
sense key set to ILLEGAL REQUEST and an additional sense code set to sense key set to ILLEGAL REQUEST and an additional sense code set to
INVALID FIELD IN CDB. When a client attempts to use a capability and INVALID FIELD IN CDB. When a client attempts to use a capability and
skipping to change at page 32, line 27 skipping to change at page 32, line 29
the pNFS client will obtain a separate layout for each user accessing the pNFS client will obtain a separate layout for each user accessing
a shared object. The client SHOULD use OPEN and ACCESS calls to a shared object. The client SHOULD use OPEN and ACCESS calls to
check user permissions when performing I/O so that the server's check user permissions when performing I/O so that the server's
access control policies are correctly enforced. The result of the access control policies are correctly enforced. The result of the
ACCESS operation may be cached while the client holds a valid layout ACCESS operation may be cached while the client holds a valid layout
as the server is expected to recall layouts when the file's access as the server is expected to recall layouts when the file's access
permissions or ACL change. permissions or ACL change.
13. IANA Considerations 13. IANA Considerations
As described in the NFSv4.1 draft [9], new layout type numbers will As described in the NFSv4.1 [5], new layout type numbers will be
be requested from IANA. This document defines the protocol requested from IANA. This document defines the protocol associated
associated with the existing layout type number, with the existing layout type number, LAYOUT4_OSD2_OBJECTS, and it
LAYOUT4_OSD2_OBJECTS, and it requires no further actions for IANA. requires no further actions for IANA.
14. References 14. References
14.1. Normative References 14.1. Normative References
[1] Bradner, S., "Key words for use in RFCs to Indicate Requirement [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", RFC 2119, March 1997. Levels", RFC 2119, March 1997.
[2] Weber, R., "SCSI Object-Based Storage Device Commands", [2] Weber, R., "Information Technology - SCSI Object-Based Storage
July 2004, <http://www.t10.org/ftp/t10/drafts/osd/osd-r10.pdf>. Device Commands (OSD)", ANSI INCITS 400-2004, July 2004.
[3] Eisler, M., "XDR: External Data Representation Standard", [3] Eisler, M., "XDR: External Data Representation Standard",
STD 67, RFC 4506, May 2006. STD 67, RFC 4506, May 2006.
[4] Linn, J., "Generic Security Service Application Program [4] Shepler, S., Eisler, M., and D. Noveck, "NFSv4 Minor Version 1
Interface Version 2, Update 1", RFC 2743, January 2000. XDR Description", RFC [[RFC Editor: please insert NFSv4 Minor
Version XDR Description 1 RFC number]], [[RFC Editor: please
insert NFSv4 Minor Version 1 XDR Description RFC month]] [[RFC
Editor: please insert NFSv4 Minor Version 1 XDR Description RFC
year]].
[5] IBM, IBM, Cisco Systems, Hewlett-Packard Co., and IBM, [5] Shepler, S., Eisler, M., and D. Noveck, "NFSv4 Minor Version
"Internet Small Computer Systems Interface (iSCSI)", RFC 3720, 1", RFC [[RFC Editor: please insert NFSv4 Minor Version 1 RFC
April 2004, <http://www.ietf.org/rfc/rfc3720.txt>. number]], [[RFC Editor: please insert NFSv4 Minor Version 1 RFC
month]] [[RFC Editor: please insert NFSv4 Minor Version 1 RFC
year]].
[6] Weber, R., "SCSI Primary Commands - 3 (SPC-3)", INCITS 408- [6] Linn, J., "Generic Security Service Application Program
2005, May 2005. Interface Version 2, Update 1", RFC 2743, January 2000.
[7] Hewlett-Packard Co., Hewlett-Packard Co., and Hewlett-Packard [7] Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M., and E.
Co., "T11 Network Address Authority (NAA) Naming Format for Zeidner, "Internet Small Computer Systems Interface (iSCSI)",
iSCSI Node Names", RFC 3980, February 2005, RFC 3720, April 2004, <http://www.ietf.org/rfc/rfc3720.txt>.
<http://www.ietf.org/rfc/rfc3980.txt>.
14.2. Informative References [8] Weber, R., "SCSI Primary Commands - 3 (SPC-3)", ANSI
INCITS 408-2005, May 2005.
[8] Shepler, S., Eisler, M., and D. Noveck, "NFSv4 Minor Version 1 [9] Krueger, M., Chadalapaka, M., and R. Elliott, "T11 Network
XDR Description", May 2008, <http://www.ietf.org/ Address Authority (NAA) Naming Format for iSCSI Node Names",
internet-drafts/draft-ietf-nfsv4-minorversion1-dot-x-06.txt>. RFC 3980, February 2005, <http://www.ietf.org/rfc/rfc3980.txt>.
[9] Shepler, S., Eisler, M., and D. Noveck, "NFSv4 Minor Version [10] IEEE, "Guidelines for 64-bit Global Identifier (EUI-64)
1", May 2008, <http://www.ietf.org/internet-drafts/ Registration Authority",
draft-ietf-nfsv4-minorversion1-23.txt>. <http://standards.ieee.org/regauth/oui/tutorials/EUI64.html>.
[10] Weber, R., "SCSI Object-Based Storage Device Commands -2 [11] Tseng, J., Gibbons, K., Travostino, F., Du Laney, C., and J.
(OSD-2)", January 2008, Souza, "Internet Storage Name Service (iSNS)", RFC 4171,
<http://www.t10.org/ftp/t10/drafts/osd2/osd2r03.pdf>. September 2005, <http://www.ietf.org/rfc/rfc4171.txt>.
[11] Kent, S. and K. Seo, "Security Architecture for the Internet 14.2. Informative References
Protocol", RFC 4301, December 2005.
[12] IEEE, "Guidelines for 64-bit Global Identifier (EUI-64) [12] Weber, R., "SCSI Object-Based Storage Device Commands -2
Registration Authority", (OSD-2)", July 2008,
<http://standards.ieee.org/regauth/oui/tutorials/EUI64.html>. <http://www.t10.org/ftp/t10/drafts/osd2/osd2r04.pdf>.
[13] T10/ANSI INCITS 365-2002, "SCSI RDMA Protocol (SRP)", [13] Kent, S. and K. Seo, "Security Architecture for the Internet
INCITS 365-2002, Protocol", RFC 4301, December 2005.
<http://ftp.t10.org/ftp/t10/drafts/srp/srp-r16a.pdf>.
[14] T11 1619-D/ANSI INCITS 424-2007, "Fibre Channel Framing and [14] T10/ANSI INCITS 365-2002, "SCSI RDMA Protocol (SRP)", ANSI
Signaling - 2 (FC-FS-2)", INCITS 424-2007, August 2006, INCITS 365-2002.
<http://www.t11.org/t11/stat.nsf/upnum/1619-d>.
[15] T10 1601-D/ANSI INCITS 417-2006, "Serial Attached SCSI - 1.1 [15] T11 1619-D/ANSI INCITS 424-2007, "Fibre Channel Framing and
(SAS-1.1)", INCITS 417-2006, September 2005, Signaling - 2 (FC-FS-2)", INCITS 424-2007, August 2006.
<http://www.t10.org/ftp/t10/drafts/sas1/sas1r10.pdf>.
[16] Tseng, J., Gibbons, K., Travostino, F., Du Laney, C., and J. [16] T10 1601-D/ANSI INCITS 417-2006, "Serial Attached SCSI - 1.1
Souza, "Internet Storage Name Service (iSNS)", RFC 4171, (SAS-1.1)", INCITS 417-2006, September 2005.
September 2005, <http://www.ietf.org/rfc/rfc4171.txt>.
[17] MacWilliams, F. and N. Sloane, "The Theory of Error-Correcting [17] MacWilliams, F. and N. Sloane, "The Theory of Error-Correcting
Codes, Part I", 1977. Codes, Part I", 1977.
Appendix A. Acknowledgments Appendix A. Acknowledgments
Todd Pisek was a co-editor of the initial drafts for this document. Todd Pisek was a co-editor of the initial drafts for this document.
Daniel E. Messinger and Pete Wyckoff reviewed and commented on this Daniel E. Messinger and Pete Wyckoff reviewed and commented on this
document. document.
skipping to change at page 35, line 44 skipping to change at line 1578
attempt made to obtain a general license or permission for the use of attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr. http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at this standard. Please address the information to the IETF at
ietf-ipr@ietf.org. ietf-ipr@ietf.org.
Acknowledgment
Funding for the RFC Editor function is provided by the IETF
Administrative Support Activity (IASA).
 End of changes. 50 change blocks. 
108 lines changed or deleted 108 lines changed or added

This html diff was produced by rfcdiff 1.35. The latest version is available from http://tools.ietf.org/tools/rfcdiff/