draft-ietf-nfsv4-pnfs-obj-06.txt   draft-ietf-nfsv4-pnfs-obj-07.txt 
NFSv4 B. Halevy NFSv4 B. Halevy
Internet-Draft B. Welch Internet-Draft B. Welch
Intended status: Standards Track J. Zelenka Intended status: Standards Track J. Zelenka
Expires: September 18, 2008 Panasas Expires: October 3, 2008 Panasas
March 17, 2008 April 01, 2008
Object-based pNFS Operations Object-based pNFS Operations
draft-ietf-nfsv4-pnfs-obj-06 draft-ietf-nfsv4-pnfs-obj-07
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 34 skipping to change at page 1, line 35
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on September 18, 2008. This Internet-Draft will expire on October 3, 2008.
Copyright Notice Copyright Notice
Copyright (C) The IETF Trust (2008). Copyright (C) The IETF Trust (2008).
Abstract Abstract
This Internet-Draft provides a description of the object-based pNFS This Internet-Draft provides a description of the object-based pNFS
extension for NFSv4. This is a companion to the main pnfs extension for NFSv4. This is a companion to the main pnfs
specification in the NFSv4 Minor Version 1 Internet Draft, which is specification in the NFSv4 Minor Version 1 Internet Draft, which is
currently draft-ietf-nfsv4-minorversion1-21.txt. currently draft-ietf-nfsv4-minorversion1-21.
Requirements Language Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [1]. document are to be interpreted as described in RFC 2119 [1].
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Object Storage Device Addressing and Discovery . . . . . . . . 4 2. XDR Description of the Objects-Based Layout Protocol . . . . . 4
2.1. pnfs_osd_addr_type4 . . . . . . . . . . . . . . . . . . . 5 2.1. Basic Data Type Definitions . . . . . . . . . . . . . . . 5
2.2. pnfs_osd_deviceaddr4 . . . . . . . . . . . . . . . . . . . 5 2.1.1. pnfs_osd_objid4 . . . . . . . . . . . . . . . . . . . 5
2.2.1. SCSI Target Identifier . . . . . . . . . . . . . . . . 6 2.1.2. pnfs_osd_version4 . . . . . . . . . . . . . . . . . . 6
2.2.2. Device Network Address . . . . . . . . . . . . . . . . 7 2.1.3. pnfs_osd_object_cred4 . . . . . . . . . . . . . . . . 6
3. Object-Based Layout . . . . . . . . . . . . . . . . . . . . . 7 2.1.4. pnfs_osd_raid_algorithm4 . . . . . . . . . . . . . . . 8
3.1. pnfs_osd_layout4 . . . . . . . . . . . . . . . . . . . . . 8 3. Object Storage Device Addressing and Discovery . . . . . . . . 8
3.1.1. pnfs_osd_objid4 . . . . . . . . . . . . . . . . . . . 8 3.1. pnfs_osd_addr_type4 . . . . . . . . . . . . . . . . . . . 9
3.1.2. pnfs_osd_version4 . . . . . . . . . . . . . . . . . . 9 3.2. pnfs_osd_deviceaddr4 . . . . . . . . . . . . . . . . . . . 9
3.1.3. pnfs_osd_object_cred4 . . . . . . . . . . . . . . . . 10 3.2.1. SCSI Target Identifier . . . . . . . . . . . . . . . . 10
3.1.4. pnfs_osd_raid_algorithm4 . . . . . . . . . . . . . . . 11 3.2.2. Device Network Address . . . . . . . . . . . . . . . . 11
3.1.5. pnfs_osd_data_map4 . . . . . . . . . . . . . . . . . . 11 4. Object-Based Layout . . . . . . . . . . . . . . . . . . . . . 11
3.2. Data Mapping Schemes . . . . . . . . . . . . . . . . . . . 12 4.1. pnfs_osd_data_map4 . . . . . . . . . . . . . . . . . . . . 12
3.2.1. Simple Striping . . . . . . . . . . . . . . . . . . . 12 4.2. pnfs_osd_layout4 . . . . . . . . . . . . . . . . . . . . . 13
3.2.2. Nested Striping . . . . . . . . . . . . . . . . . . . 13 4.3. Data Mapping Schemes . . . . . . . . . . . . . . . . . . . 13
3.2.3. Mirroring . . . . . . . . . . . . . . . . . . . . . . 14 4.3.1. Simple Striping . . . . . . . . . . . . . . . . . . . 14
3.3. RAID Algorithms . . . . . . . . . . . . . . . . . . . . . 15 4.3.2. Nested Striping . . . . . . . . . . . . . . . . . . . 15
3.3.1. PNFS_OSD_RAID_0 . . . . . . . . . . . . . . . . . . . 15 4.3.3. Mirroring . . . . . . . . . . . . . . . . . . . . . . 16
3.3.2. PNFS_OSD_RAID_4 . . . . . . . . . . . . . . . . . . . 15 4.4. RAID Algorithms . . . . . . . . . . . . . . . . . . . . . 17
3.3.3. PNFS_OSD_RAID_5 . . . . . . . . . . . . . . . . . . . 16 4.4.1. PNFS_OSD_RAID_0 . . . . . . . . . . . . . . . . . . . 17
3.3.4. PNFS_OSD_RAID_PQ . . . . . . . . . . . . . . . . . . . 16 4.4.2. PNFS_OSD_RAID_4 . . . . . . . . . . . . . . . . . . . 17
3.3.5. RAID Usage and implementation notes . . . . . . . . . 17 4.4.3. PNFS_OSD_RAID_5 . . . . . . . . . . . . . . . . . . . 17
4. Object-Based Layout Update . . . . . . . . . . . . . . . . . . 17 4.4.4. PNFS_OSD_RAID_PQ . . . . . . . . . . . . . . . . . . . 18
4.1. pnfs_osd_layoutupdate4 . . . . . . . . . . . . . . . . . . 17 4.4.5. RAID Usage and Implementation Notes . . . . . . . . . 18
4.1.1. pnfs_osd_deltaspaceused4 . . . . . . . . . . . . . . . 18 5. Object-Based Layout Update . . . . . . . . . . . . . . . . . . 19
5. Recovering from Client I/O Errors . . . . . . . . . . . . . . 18 5.1. pnfs_osd_deltaspaceused4 . . . . . . . . . . . . . . . . . 19
6. Object-Based Layout Return . . . . . . . . . . . . . . . . . . 19 5.2. pnfs_osd_layoutupdate4 . . . . . . . . . . . . . . . . . . 20
6.1. pnfs_osd_layoutreturn4 . . . . . . . . . . . . . . . . . . 20 6. Recovering from Client I/O Errors . . . . . . . . . . . . . . 20
6.1.1. pnfs_osd_errno4 . . . . . . . . . . . . . . . . . . . 21 7. Object-Based Layout Return . . . . . . . . . . . . . . . . . . 21
6.1.2. pnfs_osd_ioerr4 . . . . . . . . . . . . . . . . . . . 22 7.1. pnfs_osd_errno4 . . . . . . . . . . . . . . . . . . . . . 22
7. Object-Based Creation Layout Hint . . . . . . . . . . . . . . 22 7.2. pnfs_osd_ioerr4 . . . . . . . . . . . . . . . . . . . . . 23
7.1. pnfs_osd_layouthint4 . . . . . . . . . . . . . . . . . . . 22 7.3. pnfs_osd_layoutreturn4 . . . . . . . . . . . . . . . . . . 24
8. Layout Segments . . . . . . . . . . . . . . . . . . . . . . . 24 8. Object-Based Creation Layout Hint . . . . . . . . . . . . . . 24
8.1. CB_LAYOUTRECALL and LAYOUTRETURN . . . . . . . . . . . . . 24 8.1. pnfs_osd_layouthint4 . . . . . . . . . . . . . . . . . . . 24
8.2. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . . 25 9. Layout Segments . . . . . . . . . . . . . . . . . . . . . . . 26
9. Recalling Layouts . . . . . . . . . . . . . . . . . . . . . . 25 9.1. CB_LAYOUTRECALL and LAYOUTRETURN . . . . . . . . . . . . . 26
9.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . . 25 9.2. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . . 27
10. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . . 26 10. Recalling Layouts . . . . . . . . . . . . . . . . . . . . . . 27
11. Security Considerations . . . . . . . . . . . . . . . . . . . 26 10.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . . 27
11.1. OSD Security Data Types . . . . . . . . . . . . . . . . . 27 11. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . . 28
11.2. The OSD Security Protocol . . . . . . . . . . . . . . . . 28 12. Security Considerations . . . . . . . . . . . . . . . . . . . 28
11.3. Protocol Privacy Requirements . . . . . . . . . . . . . . 29 12.1. OSD Security Data Types . . . . . . . . . . . . . . . . . 29
11.4. Revoking Capabilities . . . . . . . . . . . . . . . . . . 29 12.2. The OSD Security Protocol . . . . . . . . . . . . . . . . 30
12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 30 12.3. Protocol Privacy Requirements . . . . . . . . . . . . . . 31
13. XDR Description of the Objects layout type . . . . . . . . . . 30 12.4. Revoking Capabilities . . . . . . . . . . . . . . . . . . 31
14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 34 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 32
14.1. Normative References . . . . . . . . . . . . . . . . . . . 34 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 32
14.2. Informative References . . . . . . . . . . . . . . . . . . 35 14.1. Normative References . . . . . . . . . . . . . . . . . . . 32
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 36 14.2. Informative References . . . . . . . . . . . . . . . . . . 33
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 36 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 34
Intellectual Property and Copyright Statements . . . . . . . . . . 38 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 34
Intellectual Property and Copyright Statements . . . . . . . . . . 35
1. Introduction 1. Introduction
In pNFS, the file server returns typed layout structures that In pNFS, the file server returns typed layout structures that
describe where file data is located. There are different layouts for describe where file data is located. There are different layouts for
different storage systems and methods of arranging data on storage different storage systems and methods of arranging data on storage
devices. This document describes the layouts used with object-based devices. This document describes the layouts used with object-based
storage devices (OSD) that are accessed according to the OSD storage storage devices (OSD) that are accessed according to the OSD storage
protocol standard (SNIA T10/1355-D [2]). protocol standard (SNIA T10/1355-D [2]).
skipping to change at page 4, line 27 skipping to change at page 4, line 27
SET ATTRIBUTES, CREATE and DELETE. However, using the object-based SET ATTRIBUTES, CREATE and DELETE. However, using the object-based
layout the client only uses the READ, WRITE, GET ATTRIBUTES and FLUSH layout the client only uses the READ, WRITE, GET ATTRIBUTES and FLUSH
commands. The other commands are only used by the pNFS server. commands. The other commands are only used by the pNFS server.
An object-based layout for pNFS includes object identifiers, An object-based layout for pNFS includes object identifiers,
capabilities that allow clients to READ or WRITE those objects, and capabilities that allow clients to READ or WRITE those objects, and
various parameters that control how file data is striped across their various parameters that control how file data is striped across their
component objects. The OSD protocol has a capability-based security component objects. The OSD protocol has a capability-based security
scheme that allows the pNFS server to control what operations and scheme that allows the pNFS server to control what operations and
what objects can be used by clients. This scheme is described in what objects can be used by clients. This scheme is described in
more detail in the Security Considerations section (Section 11). more detail in the Security Considerations section (Section 12).
2. Object Storage Device Addressing and Discovery 2. XDR Description of the Objects-Based Layout Protocol
This document contains the XDR [3] description of the NFSv4.1 objects
layout protocol. The XDR description is embedded in this document in
a way that makes it simple for the reader to extract into a ready to
compile form. The reader can feed this document into the following
shell script to produce the machine readable XDR description of the
NFSv4.1 objects layout protocol:
#!/bin/sh
grep "^ *///" | sed 's?^ *///??'
I.e. if the above script is stored in a file called "extract.sh", and
this document is in a file called "spec.txt", then the reader can do:
sh extract.sh < spec.txt > pnfs_osd_prot.x
The effect of the script is to remove leading white space from each
line, plus a sentinel sequence of "///".
The embedded XDR file header follows. Subsequent XDR descriptions,
with the sentinel sequence are embedded throughout the document.
Note that the XDR code contained in this document depends on types
from the NFSv4.1 nfs4_prot.x file ([8]). This includes both nfs
types that end with a 4, such as offset4, length4, etc, as well as
more generic types such as uint32_t and uint64_t.
////*
/// * This file was machine generated for
/// * draft-ietf-nfsv4-pnfs-obj-07
/// * Last updated Tue Apr 1 21:35:08 IDT 2008
/// *
/// * Copyright (C) The IETF Trust (2007-2008)
/// * All Rights Reserved.
/// *
/// * Copyright (C) The Internet Society (1998-2006).
/// * All Rights Reserved.
/// */
///
////*
/// * pnfs_osd_prot.x
/// */
///
///%#include <nfs4_prot.x>
///
2.1. Basic Data Type Definitions
The following sections define basic data types and constants used by
the Object-Based Layout protocol.
2.1.1. pnfs_osd_objid4
An object is identified by a number, somewhat like an inode number.
The object storage model has a two level scheme, where the objects
within an object storage device are grouped into partitions.
///struct pnfs_osd_objid4 {
/// deviceid4 oid_device_id;
/// uint64_t oid_partition_id;
/// uint64_t oid_object_id;
///};
///
The pnfs_osd_objid4 type is used to identify an object within a
partition on a specified object storage device. "oid_device_id"
selects the object storage device from the set of available storage
devices. The device is identified with the deviceid4 type, which is
an index into addressing information about that device returned by
the GETDEVICELIST and GETDEVICEINFO operations. The deviceid4 data
type is defined in NFSv4.1 draft [9]. Within an OSD, a partition is
identified with a 64-bit number, "oid_partition_id". Within a
partition, an object is identified with a 64-bit number,
"oid_object_id". Creation and management of partitions is outside
the scope of this standard, and is a facility provided by the object
storage file system.
2.1.2. pnfs_osd_version4
///enum pnfs_osd_version4 {
/// PNFS_OSD_MISSING = 0,
/// PNFS_OSD_VERSION_1 = 1,
/// PNFS_OSD_VERSION_2 = 2
///};
///
Pnfs_osd_version4 is used to indicate the OSD protocol version or
whether an object is missing (i.e., unavailable). Some of the
object-based layout supported raid algorithms encode redundant
information and can compensate for missing components, but the data
placement algorithm needs to know what parts are missing.
At this time the OSD standard is at version 1.0, and we anticipate a
version 2.0 of the standard ((SNIA T10/1729-D [10])). The second
generation OSD protocol has additional proposed features to support
more robust error recovery, snapshots, and byte-range capabilities.
Therefore, the OSD version is explicitly called out in the
information returned in the layout. (This information can also be
deduced by looking inside the capability type at the format field,
which is the first byte. The format value is 0x1 for an OSD v1
capability. However, it seems most robust to call out the version
explicitly.)
2.1.3. pnfs_osd_object_cred4
///enum pnfs_osd_cap_key_sec4 {
/// PNFS_OSD_CAP_KEY_SEC_NONE = 0,
/// PNFS_OSD_CAP_KEY_SEC_SSV = 1,
///};
///
///struct pnfs_osd_object_cred4 {
/// pnfs_osd_objid4 oc_object_id;
/// pnfs_osd_version4 oc_osd_version;
/// pnfs_osd_cap_key_sec4 oc_cap_key_sec;
/// opaque oc_capability_key<>;
/// opaque oc_capability<>;
///};
///
The pnfs_osd_object_cred4 structure is used to identify each
component comprising the file. The "oc_object_id" identifies the
component object, the "oc_osd_version" represents the osd protocol
version, or whether that component is unavailable, and the
"oc_capability" and "oc_capability_key", along with the
"oda_systemid" from the pnfs_osd_deviceaddr4, provide the OSD
security credentials needed to access that object. The
"oc_cap_key_sec" value denotes the method used to secure the
oc_capability_key (see Section 12.1 for more details).
To comply with the OSD security requirements the capability key
SHOULD be transferred securely to prevent eavesdropping (see
Section 12). Therefore, a client SHOULD either issue the LAYOUTGET
or GETDEVICEINFO operations via RPCSEC_GSS with the privacy service
or to previously establish an SSV for the sessions via the NFSv4.1
SET_SSV operation. The pnfs_osd_cap_key_sec4 type is used to
identify the method used by the server to secure the capability key.
o PNFS_OSD_CAP_KEY_SEC_NONE denotes that the oc_capability_key is
not encrypted in which case the client SHOULD issue the LAYOUTGET
or GETDEVICEINFO operations with RPCSEC_GSS with the privacy
service or the NFSv4.1 transport should be secured by using
methods that are external to NFSv4.1 like the use of IPSEC [11]
for transporting the NFSV4.1 protocol.
o PNFS_OSD_CAP_KEY_SEC_SSV denotes that the oc_capability_key
contents are encrypted using the SSV GSS context and the
capability key as inputs to the GSS_Wrap() function (see GSS-API
[4]) with the conf_req_flag set to TRUE. The client MUST use the
secret SSV key as part of the client's GSS context to decrypt the
capability key using the value of the oc_capability_key field as
the input_message to the GSS_unwrap() function. Note that to
prevent eavesdropping of the SSV key the client SHOULD issue
SET_SSV via RPCSEC_GSS with the privacy service.
The actual method chosen depends on whether the client established a
SSV key with the server and whether it issued the operation with the
RPCSEC_GSS privacy method. Naturally, if the client did not
establish a SSV key via SET_SSV the server MUST use the
PNFS_OSD_CAP_KEY_SEC_NONE method. Otherwise, if the operation was
not issued with the RPCSEC_GSS privacy method the server SHOULD
secure the oc_capability_key with the PNFS_OSD_CAP_KEY_SEC_SSV
method. The server MAY use the PNFS_OSD_CAP_KEY_SEC_SSV method also
when the operation was issued with the RPCSEC_GSS privacy method.
2.1.4. pnfs_osd_raid_algorithm4
///enum pnfs_osd_raid_algorithm4 {
/// PNFS_OSD_RAID_0 = 1,
/// PNFS_OSD_RAID_4 = 2,
/// PNFS_OSD_RAID_5 = 3,
/// PNFS_OSD_RAID_PQ = 4 /* Reed-Solomon P+Q */
///};
///
pnfs_osd_raid_algorithm4 represents the data redundancy algorithm
used to protect the file's contents. See Section 4.4 for more
details.
3. Object Storage Device Addressing and Discovery
Data operations to an OSD require the client to know the "address" of Data operations to an OSD require the client to know the "address" of
each OSD's root object. The root object is synonymous with SCSI each OSD's root object. The root object is synonymous with SCSI
logical unit. The client specifies SCSI logical units to its SCSI logical unit. The client specifies SCSI logical units to its SCSI
protocol stack using a representation local to the client. Because protocol stack using a representation local to the client. Because
these representations are local, GETDEVICEINFO must return these representations are local, GETDEVICEINFO must return
information that can be used by the client to select the correct information that can be used by the client to select the correct
local representation. local representation.
In the block world, a set offset (logical block number or track/ In the block world, a set offset (logical block number or track/
sector) contains a disk label. This label identifies the disk sector) contains a disk label. This label identifies the disk
uniquely. In contrast, an OSD has a standard set of attributes on uniquely. In contrast, an OSD has a standard set of attributes on
its root object. For device identification purposes the OSD System its root object. For device identification purposes the OSD System
ID (root information attribute number 3) and the OSD Name (root ID (root information attribute number 3) and the OSD Name (root
information attribute number 9) are used as the label. These appear information attribute number 9) are used as the label. These appear
in the pnfs_osd_deviceaddr4 type below under the "systemid" and in the pnfs_osd_deviceaddr4 type below under the "oda_systemid" and
"osdname" fields. "oda_osdname" fields.
In some situations, SCSI target discovery may need to be driven based In some situations, SCSI target discovery may need to be driven based
on information contained in the GETDEVICEINFO response. One example on information contained in the GETDEVICEINFO response. One example
of this is iSCSI targets that are not known to the client until a of this is iSCSI targets that are not known to the client until a
layout has been requested. The information provided as the layout has been requested. The information provided as the
"targetid", "netaddr", and "lun" fields in the pnfs_osd_deviceaddr4 "targetid", "netaddr", and "lun" fields in the pnfs_osd_deviceaddr4
type described below (see Section 2.2), allows the client to probe a type described below (see Section 3.2), allows the client to probe a
specific device given its network address and optionally its iSCSI specific device given its network address and optionally its iSCSI
Name (see iSCSI [3]), or when the device network address is omitted, Name (see iSCSI [5]), or when the device network address is omitted,
to discover the object storage device using the provided device name to discover the object storage device using the provided device name
or SCSI device identifier (See SPC-3 [4].) or SCSI device identifier (See SPC-3 [6].)
The systemid is used by the client, along with the object credential The oda_systemid is used by the client, along with the object
to sign each request with the request integrity check value. This credential to sign each request with the request integrity check
method protects the client from unintentionally accessing a device if value. This method protects the client from unintentionally
the device address mapping was changed (or revoked). The server accessing a device if the device address mapping was changed (or
computes the capability_key using its own view of the systemid revoked). The server computes the capability key using its own view
associated with the respective deviceid present in the credential. of the systemid associated with the respective deviceid present in
If the client's view of the deviceid mapping is stale, the client the credential. If the client's view of the deviceid mapping is
will use the wrong systemid (which must be system-wide unique) and stale, the client will use the wrong systemid (which must be system-
the I/O request to the OSD will fail to pass the integrity check wide unique) and the I/O request to the OSD will fail to pass the
verification. integrity check verification.
To recover from this condition the client should report the error and To recover from this condition the client should report the error and
return the layout using LAYOUTRETURN, and invalidate all the device return the layout using LAYOUTRETURN, and invalidate all the device
address mappings associated with this layout. The client can then address mappings associated with this layout. The client can then
ask for a new layout if it wishes using LAYOUTGET and resolve the ask for a new layout if it wishes using LAYOUTGET and resolve the
referenced deviceids using GETDEVICEINFO or GETDEVICELIST. referenced deviceids using GETDEVICEINFO or GETDEVICELIST.
The server MUST provide the systemid and SHOULD also provide the The server MUST provide the oda_systemid and SHOULD also provide the
osdname. When the OSD name is present the client SHOULD get the root oda_osdname. When the OSD name is present the client SHOULD get the
information attributes whenever it establishes communication with the root information attributes whenever it establishes communication
OSD and verify that the OSD name it got from the OSD matches the one with the OSD and verify that the OSD name it got from the OSD matches
sent by the metadata server. To do so, the client uses the the one sent by the metadata server. To do so, the client uses the
root_obj_cred credentials. root_obj_cred credentials.
2.1. pnfs_osd_addr_type4 3.1. pnfs_osd_addr_type4
The following enum specifies the manner in which a scsi target can be The following enum specifies the manner in which a scsi target can be
specified. The target can be specified as an SCSI Name, or as a SCSI specified. The target can be specified as an SCSI Name, or as a SCSI
Device Identifier. Device Identifier.
enum pnfs_obj_addr_type4 { ///enum pnfs_osd_targetid_type4 {
OBJ_TARGET_ANON = 1, /// OBJ_TARGET_ANON = 1,
OBJ_TARGET_SCSI_NAME = 2, /// OBJ_TARGET_SCSI_NAME = 2,
OBJ_TARGET_SCSI_DEVICE_ID = 3 /// OBJ_TARGET_SCSI_DEVICE_ID = 3
}; ///};
///
2.2. pnfs_osd_deviceaddr4 3.2. pnfs_osd_deviceaddr4
The specification for an object device address is as follows: The specification for an object device address is as follows:
struct pnfs_osd_deviceaddr4 { ///union pnfs_osd_targetid4 switch (pnfs_osd_targetid_type4 oti_type) {
union targetid switch (pnfs_osd_addr_type4 type) { /// case OBJ_TARGET_SCSI_NAME:
case OBJ_TARGET_SCSI_NAME: /// string oti_scsi_name<>;
string scsi_name<>; ///
/// case OBJ_TARGET_SCSI_DEVICE_ID:
case OBJ_TARGET_SCSI_DEVICE_ID: /// opaque oti_scsi_device_id<>;
opaque scsi_device_id<>; ///
/// default:
default: /// void;
void; ///};
}; ///
union netaddr switch (bool netaddr_available) { ///union pnfs_osd_targetaddr4 switch (bool ota_available) {
case TRUE: /// case TRUE:
netaddr4 netaddr; /// netaddr4 ota_netaddr;
case FALSE: /// case FALSE:
void; /// void;
}; ///};
uint64_t lun; ///
opaque systemid<>; ///struct pnfs_osd_deviceaddr4 {
pnfs_osd_object_cred4 root_obj_cred; /// pnfs_osd_targetid4 oda_targetid;
opaque osdname<>; /// pnfs_osd_targetaddr4 oda_targetaddr;
}; /// uint64_t oda_lun;
/// opaque oda_systemid<>;
/// pnfs_osd_object_cred4 oda_root_obj_cred;
/// opaque oda_osdname<>;
///};
///
2.2.1. SCSI Target Identifier 3.2.1. SCSI Target Identifier
When "targetid" is specified as a OBJ_TARGET_SCSI_NAME, the When "oda_targetid" is specified as a OBJ_TARGET_SCSI_NAME, the
"scsi_name" string MUST be formatted as a "iSCSI Name" as specified "oti_scsi_name" string MUST be formatted as a "iSCSI Name" as
in iSCSI [3] and [5]. Note that the specification of the scsi_name specified in iSCSI [5] and [7]. Note that the specification of the
string format is outside the scope of this document. Parsing the oti_scsi_name string format is outside the scope of this document.
string is based on the string prefix, e.g. "iqn.", "eui.", or "naa." Parsing the string is based on the string prefix, e.g. "iqn.",
and more formats MAY be specified in the future in accordance with "eui.", or "naa." and more formats MAY be specified in the future in
iSCSI Names properties. accordance with iSCSI Names properties.
Currently, the iSCSI Name provides for naming the target device using Currently, the iSCSI Name provides for naming the target device using
a string formmatted as an iSCSI Qualified Name (IQN) or as an EUI [8] a string formmatted as an iSCSI Qualified Name (IQN) or as an EUI
string. Those are typically used to identify iSCSI or SRP [9] [12] string. Those are typically used to identify iSCSI or SRP [13]
devices. The Network Address Authority (NAA) string format (see [5]) devices. The Network Address Authority (NAA) string format (see [7])
provides for naming the device using globally unique identifiers, as provides for naming the device using globally unique identifiers, as
defined in FC-FS [10]. These are typically used to identify Fibre defined in FC-FS [14]. These are typically used to identify Fibre
Channel or SAS [11] (Serial Attached SCSI) devices. In particular, Channel or SAS [15] (Serial Attached SCSI) devices. In particular,
such devices that are dual-attached both over Fibre Channel or SAS, such devices that are dual-attached both over Fibre Channel or SAS,
and over iSCSI. and over iSCSI.
When "targetid" is specified as a OBJ_TARGET_SCSI_DEVICE_ID, the When "oda_targetid" is specified as a OBJ_TARGET_SCSI_DEVICE_ID, the
"scsi_device_id" opaque field MUST be formatted as a SCSI Device "oti_scsi_device_id" opaque field MUST be formatted as a SCSI Device
Identifier as defined in SPC-3 [4] VPD Page 83h (Section 7.6.3. Identifier as defined in SPC-3 [6] VPD Page 83h (Section 7.6.3.
"Device Identification VPD Page".) Note that similarly to the "Device Identification VPD Page".) Note that similarly to the
"scsi_name", the specification of the scsi_device_id opaque contents "oti_scsi_name", the specification of the oti_scsi_device_id opaque
is outside the scope of this document and more formats MAY be contents is outside the scope of this document and more formats MAY
specified in the future in accordance with SPC-3. be specified in the future in accordance with SPC-3.
The OBJ_TARGET_ANON pnfs_osd_addr_type4 MAY be used for providing no The OBJ_TARGET_ANON pnfs_osd_addr_type4 MAY be used for providing no
target identification. In this case only the OSD systemid and target identification. In this case only the OSD System ID and
optionally, the provided network address, are used to locate to optionally, the provided network address, are used to locate to
device. device.
2.2.2. Device Network Address 3.2.2. Device Network Address
The optional "netaddr" field MAY be provided by the server as a hint The optional "oda_targetaddr" field MAY be provided by the server as
to accelerate device discovery over e.g., the iSCSI transport a hint to accelerate device discovery over e.g., the iSCSI transport
protocol. The network address is given with the netaddr4 type, which protocol. The network address is given with the netaddr4 type, which
specifies a TCP/IP based endpoint (as specified in NFSv4.1 draft specifies a TCP/IP based endpoint (as specified in NFSv4.1 draft
[12]). When given, the client SHOULD use it to probe for the SCSI [9]). When given, the client SHOULD use it to probe for the SCSI
device at the given network address. The client MAY still use other device at the given network address. The client MAY still use other
discovery mechanisms such as iSNS [13] to locate the device using the discovery mechanisms such as iSNS [16] to locate the device using the
targetid. In particular, such external name service, SHOULD be used oda_targetid. In particular, such external name service, SHOULD be
when the devices may be attached to the network using multiple used when the devices may be attached to the network using multiple
connections, and/or multiple storage fabrics (e.g. Fibre-Channel and connections, and/or multiple storage fabrics (e.g. Fibre-Channel and
iSCSI.) iSCSI.)
3. Object-Based Layout 4. Object-Based Layout
The layout4 type is defined in the NFSv4.1 draft [12] as follows: The layout4 type is defined in the NFSv4.1 draft [9] as follows:
enum layouttype4 { enum layouttype4 {
LAYOUT4_NFSV4_1_FILES = 1, LAYOUT4_NFSV4_1_FILES = 1,
LAYOUT4_OSD2_OBJECTS = 2, LAYOUT4_OSD2_OBJECTS = 2,
LAYOUT4_BLOCK_VOLUME = 3 LAYOUT4_BLOCK_VOLUME = 3
}; };
struct layout_content4 { struct layout_content4 {
layouttype4 loc_type; layouttype4 loc_type;
opaque loc_body<>; opaque loc_body<>;
}; };
struct layout4 { struct layout4 {
offset4 lo_offset; offset4 lo_offset;
length4 lo_length; length4 lo_length;
layoutiomode4 lo_iomode; layoutiomode4 lo_iomode;
layout_content4 lo_content; layout_content4 lo_content;
}; };
This document defines structure associated with the layouttype4 This document defines structure associated with the layouttype4
value, LAYOUT4_OSD2_OBJECTS. The NFSv4.1 draft [12] specifies the value, LAYOUT4_OSD2_OBJECTS. The NFSv4.1 draft [9] specifies the
loc_body structure as an XDR type "opaque". The opaque layout is loc_body structure as an XDR type "opaque". The opaque layout is
uninterpreted by the generic pNFS client layers, but obviously must uninterpreted by the generic pNFS client layers, but obviously must
be interpreted by the object-storage layout driver. This document be interpreted by the object-storage layout driver. This section
defines the structure of this opaque value, pnfs_osd_layout4. defines the structure of this opaque value, pnfs_osd_layout4.
3.1. pnfs_osd_layout4 4.1. pnfs_osd_data_map4
struct pnfs_osd_layout4 { ///struct pnfs_osd_data_map4 {
pnfs_osd_data_map4 map; /// uint32_t odm_num_comps;
uint32_t comps_index; /// length4 odm_stripe_unit;
pnfs_osd_object_cred4 components<>; /// uint32_t odm_group_width;
}; /// uint32_t odm_group_depth;
/// uint32_t odm_mirror_cnt;
/// pnfs_osd_raid_algorithm4 odm_raid_algorithm;
///};
///
The pnfs_osd_data_map4 structure parameterizes the algorithm that
maps a file's contents over the component objects. Instead of
limiting the system to simple striping scheme where loss of a single
component object results in data loss, the map parameters support
mirroring and more complicated schemes that protect against loss of a
component object.
"odm_num_comps" is the number of component objects the file is
striped over. The server MAY grow the file by adding more components
to the stripe while clients hold valid layouts until the file has
reached its final stripe width. The file length in this case MUST be
limited to the number of bytes in a full stripe.
The "odm_stripe_unit" is the number of bytes placed on one component
before advancing to the next one in the list of components. The
number of bytes in a full stripe is odm_stripe_unit times the number
of components. In some raid schemes, a stripe includes redundant
information (i.e., parity) that lets the system recover from loss or
damage to a component object.
The "odm_group_width" and "odm_group_depth" parameters allow a nested
striping pattern (See Section 4.3.2 for details). If there is no
nesting, then odm_group_width and odm_group_depth MUST be zero. The
size of the components array MUST be a multiple of odm_group_width.
The "odm_mirror_cnt" is used to replicate a file by replicating its
component objects. If there is no mirroring, then odm_mirror_cnt
MUST be 0. If odm_mirror_cnt is greater than zero, then the size of
the component array MUST be a multiple of (odm_mirror_cnt+1).
See Section 4.3 for more details.
4.2. pnfs_osd_layout4
///struct pnfs_osd_layout4 {
/// pnfs_osd_data_map4 olo_map;
/// uint32_t olo_comps_index;
/// pnfs_osd_object_cred4 olo_components<>;
///};
///
The pnfs_osd_layout4 structure specifies a layout over a set of The pnfs_osd_layout4 structure specifies a layout over a set of
component objects. The components field is an array of object component objects. The "olo_components" field is an array of object
identifiers and security credentials that grant access to each identifiers and security credentials that grant access to each
object. The organization of the data is defined by the object. The organization of the data is defined by the
pnfs_osd_data_map4 type that specifies how the file's data is mapped pnfs_osd_data_map4 type that specifies how the file's data is mapped
onto the component objects (i.e., the striping pattern). The data onto the component objects (i.e., the striping pattern). The data
placement algorithm that maps file data onto component objects assume placement algorithm that maps file data onto component objects assume
that each component object occurs exactly once in the array of that each component object occurs exactly once in the array of
components. Therefore, component objects MUST appear in the components. Therefore, component objects MUST appear in the
component array only once. The components array may represent all olo_components array only once. The components array may represent
objects comprising the file, in which case comps_index is set to zero all objects comprising the file, in which case "olo_comps_index" is
and the number of entries in the "components" array is equal to set to zero and the number of entries in the olo_components array is
map.num_comps. The server MAY return fewer components than equal to olo_map.odm_num_comps. The server MAY return fewer
num_comps, provided that the returned components are sufficient to components than odm_num_comps, provided that the returned components
access any byte in the layout's data range (e.g., a sub-stripe of are sufficient to access any byte in the layout's data range (e.g., a
"group_width" components). In this case, comps_index represents the sub-stripe of "odm_group_width" components). In this case,
position of the returned components array within the full array of olo_comps_index represents the position of the returned components
components that comprise the file. array within the full array of components that comprise the file.
Note that the layout depends on the file size, which the client Note that the layout depends on the file size, which the client
learns from the generic return parameters of LAYOUTGET, by doing learns from the generic return parameters of LAYOUTGET, by doing
GETATTR commands to the metadata server. The client uses the file GETATTR commands to the metadata server. The client uses the file
size to decide if it should fill holes with zeros, or return a short size to decide if it should fill holes with zeros, or return a short
read. Striping patterns can cause cases where component objects are read. Striping patterns can cause cases where component objects are
shorter than other components because a hole happens to correspond to shorter than other components because a hole happens to correspond to
the last part of the component object. the last part of the component object.
3.1.1. pnfs_osd_objid4 4.3. Data Mapping Schemes
An object is identified by a number, somewhat like an inode number.
The object storage model has a two level scheme, where the objects
within an object storage device are grouped into partitions.
struct pnfs_osd_objid4 {
deviceid4 device_id;
uint64_t partition_id;
uint64_t object_id;
};
The pnfs_osd_objid4 type is used to identify an object within a
partition on a specified object storage device. "device_id" selects
the object storage device from the set of available storage devices.
The device is identified with the deviceid4 type, which is an index
into addressing information about that device returned by the
GETDEVICELIST and GETDEVICEINFO pnfs operations. The deviceid4 data
type is defined in NFSv4.1 draft [12]. Within an OSD, a partition is
identified with a 64-bit number, "partition_id". Within a partition,
an object is identified with a 64-bit number, "object_id". Creation
and management of partitions is outside the scope of this standard,
and is a facility provided by the object storage file system.
3.1.2. pnfs_osd_version4
enum pnfs_osd_version4 {
PNFS_OSD_MISSING = 0,
PNFS_OSD_VERSION_1 = 1,
PNFS_OSD_VERSION_2 = 2
};
The osd_version is used to indicate the OSD protocol version or
whether an object is missing (i.e., unavailable). Some layout
schemes encode redundant information and can compensate for missing
components, but the data placement algorithm needs to know what parts
are missing.
At this time the OSD standard is at version 1.0, and we anticipate a
version 2.0 of the standard ((SNIA T10/1729-D [14])). The second
generation OSD protocol has additional proposed features to support
more robust error recovery, snapshots, and byte-range capabilities.
Therefore, the OSD version is explicitly called out in the
information returned in the layout. (This information can also be
deduced by looking inside the capability type at the format field,
which is the first byte. The format value is 0x1 for an OSD v1
capability. However, it seems most robust to call out the version
explicitly.)
3.1.3. pnfs_osd_object_cred4
enum pnfs_osd_cap_key_sec4 {
PNFS_OSD_CAP_KEY_SEC_NONE = 0,
PNFS_OSD_CAP_KEY_SEC_SSV = 1,
};
struct pnfs_osd_object_cred4 {
pnfs_osd_objid4 object_id;
pnfs_osd_version4 osd_version;
pnfs_osd_cap_key_sec4 cap_key_sec;
opaque capability_key<>;
opaque capability<>;
};
The pnfs_osd_object_cred4 structure is used to identify each
component comprising the file. The object_id identifies the
component object, the osd_version represents the osd protocol
version, or whether that component is unavailable, and the capability
and capability key, along with the systemid from the
pnfs_osd_deviceaddr, provide the OSD security credentials needed to
access that object. The cap_key_sec value denotes the method used to
secure the capability_key (see Section 11.1 for more details).
To comply with the OSD security requirements the capability key
SHOULD be transferred securely to prevent eavesdropping (see
Section 11). Therefore, a client SHOULD either issue the LAYOUTGET
operation via RPCSEC_GSS with the privacy service or to previously
establish an SSV for the sessions via the NFSv4.1 SET_SSV operation.
The pnfs_osd_cap_key_sec4 type is used to identify the method used by
the server to secure the capability key.
o PNFS_OSD_CAP_KEY_SEC_NONE denotes that the capability_key is not
encrypted in which case the client SHOULD issue the LAYOUTGET
operation with RPCSEC_GSS with the privacy service or the NFSv4.1
transport should be secured by using methods that are external to
NFSv4.1 like the use of IPSEC [15] for transporting the NFSV4.1
protocol.
o PNFS_OSD_CAP_KEY_SEC_SSV denotes that the capability_key contents
are encrypted using the SSV GSS context and the capability key as
inputs to the GSS_Wrap() function (see GSS-API [6]) with the
conf_req_flag set to TRUE. The client MUST use the secret SSV key
as part of the client's GSS context to decrypt the capability key
using the value of the capability_key field as the input_message
to the GSS_unwrap() function. Note that to prevent eavesdropping
of the SSV key the client SHOULD issue SET_SSV via RPCSEC_GSS with
the privacy service.
The actual method chosen depends on whether the client established a
SSV key with the server and whether it issued the LAYOUTGET operation
with the RPCSEC_GSS privacy method. Naturally, if the client did not
establish a SSV key via SET_SSV the server MUST use the
PNFS_OSD_CAP_KEY_SEC_NONE method. Otherwise, if the LAYOUTGET
operation was not issued with the RPCSEC_GSS privacy method the
server SHOULD secure the capability_key with the
PNFS_OSD_CAP_KEY_SEC_SSV method. The server MAY use the
PNFS_OSD_CAP_KEY_SEC_SSV method also when the LAYOUTGET operation was
issued with the RPCSEC_GSS privacy method.
3.1.4. pnfs_osd_raid_algorithm4
enum pnfs_osd_raid_algorithm4 {
PNFS_OSD_RAID_0 = 1,
PNFS_OSD_RAID_4 = 2,
PNFS_OSD_RAID_5 = 3,
PNFS_OSD_RAID_PQ = 4 /* Reed-Solomon P+Q */
};
pnfs_osd_raid_algorithm4 represents the data redundancy algorithm
used to protect the file's contents. See Section 3.3 for more
details.
3.1.5. pnfs_osd_data_map4
struct pnfs_osd_data_map4 {
uint32_t num_comps;
length4 stripe_unit;
uint32_t group_width;
uint32_t group_depth;
uint32_t mirror_cnt;
pnfs_osd_raid_algorithm4 raid_algorithm;
};
The pnfs_osd_data_map4 structure parameterizes the algorithm that
maps a file's contents over the component objects. Instead of
limiting the system to simple striping scheme where loss of a single
component object results in data loss, the map parameters support
mirroring and more complicated schemes that protect against loss of a
component object.
num_comps is the number of component objects the file is striped
over. The server MAY grow the file by adding more components to the
stripe while clients hold valid layouts until the file has reached
its final stripe width. The file length in this case MUST be limited
to the number of bytes in a full stripe.
The stripe_unit is the number of bytes placed on one component before
advancing to the next one in the list of components. The number of
bytes in a full stripe is stripe_unit times the number of components.
In some raid schemes, a stripe includes redundant information (i.e.,
parity) that lets the system recover from loss or damage to a
component object.
The group_width and group_depth parameters allow a nested striping
pattern (See Section 3.2.2 for details). If there is no nesting,
then group_width and group_depth MUST be zero. The size of the
components array MUST be a multiple of group_width.
The mirror_cnt is used to replicate a file by replicating its
component objects. If there is no mirroring, then mirror_cnt MUST be
0. If mirror_cnt is greater than zero, then the size of the
component array MUST be a multiple of (mirror_cnt+1).
See Section 3.2 for more details.
3.2. Data Mapping Schemes
This section describes the different data mapping schemes in detail. This section describes the different data mapping schemes in detail.
The object layout always uses a "dense" layout as described in the The object layout always uses a "dense" layout as described in
pNFS document. This means that the second stripe unit of the file NFSv4.1 draft [9]. This means that the second stripe unit of the
starts at offset 0 of the second component, rather than at offset file starts at offset 0 of the second component, rather than at
stripe_unit bytes. After a full stripe has been written, the next offset stripe_unit bytes. After a full stripe has been written, the
stripe unit is appended to the first component object in the list next stripe unit is appended to the first component object in the
without any holes in the component objects. list without any holes in the component objects.
3.2.1. Simple Striping 4.3.1. Simple Striping
The mapping from the logical offset within a file (L) to the The mapping from the logical offset within a file (L) to the
component object C and object-specific offset O is defined by the component object C and object-specific offset O is defined by the
following equations: following equations:
L = logical offset into the file L = logical offset into the file
W = total number of components W = total number of components
S = W * stripe_unit S = W * stripe_unit
N = L / S N = L / S
C = (L-(N*S)) / stripe_unit C = (L-(N*S)) / stripe_unit
O = (N*stripe_unit)+(L%stripe_unit) O = (N*stripe_unit)+(L%stripe_unit)
In these equations, S is the number of bytes in a full stripe, and N In these equations, S is the number of bytes in a full stripe, and N
is the stripe number. C is an index into the array of components, so is the stripe number. C is an index into the array of components, so
it selects a particular object storage device. Both N and C count it selects a particular object storage device. Both N and C count
from zero. O is the offset within the object that corresponds to the from zero. O is the offset within the object that corresponds to the
file offset. Note that this computation does not accommodate the file offset. Note that this computation does not accommodate the
same object appearing in the component array multiple times. same object appearing in the olo_components array multiple times.
For example, consider an object striped over four devices, <D0 D1 D2 For example, consider an object striped over four devices, <D0 D1 D2
D3>. The stripe_unit is 4096 bytes. The stripe width S is thus 4 * D3>. The stripe_unit is 4096 bytes. The stripe width S is thus 4 *
4096 = 16384. 4096 = 16384.
Offset 0: Offset 0:
N = 0 / 16384 = 0 N = 0 / 16384 = 0
C = 0-0/4096 = 0 (D0) C = 0-0/4096 = 0 (D0)
O = 0*4096 + (0%4096) = 0 O = 0*4096 + (0%4096) = 0
skipping to change at page 13, line 29 skipping to change at page 15, line 5
Offset 9000: Offset 9000:
N = 9000 / 16384 = 0 N = 9000 / 16384 = 0
C = (9000-(0*16384)) / 4096 = 2 (D2) C = (9000-(0*16384)) / 4096 = 2 (D2)
O = (0*4096)+(9000%4096) = 808 O = (0*4096)+(9000%4096) = 808
Offset 132000: Offset 132000:
N = 132000 / 16384 = 8 N = 132000 / 16384 = 8
C = (132000-(8*16384)) / 4096 = 0 C = (132000-(8*16384)) / 4096 = 0
O = (8*4096) + (132000%4096) = 33696 O = (8*4096) + (132000%4096) = 33696
3.2.2. Nested Striping 4.3.2. Nested Striping
The group_width and group_depth parameters allow a nested striping The odm_group_width and odm_group_depth parameters allow a nested
pattern. The group_width defines the width of a data stripe and the striping pattern. odm_group_width defines the width of a data stripe
group_depth defines how many stripes are written before advancing to and odm_group_depth defines how many stripes are written before
the next group of components in the list of component objects for the advancing to the next group of components in the list of component
file. The math used to map from a file offset to a component object objects for the file. The math used to map from a file offset to a
and offset within that object is shown below. The computations map component object and offset within that object is shown below. The
from the logical offset L to the component index C and offset computations map from the logical offset L to the component index C
relative O within that component object. and offset relative O within that component object.
L = logical offset into the file L = logical offset into the file
W = total number of components W = total number of components
S = stripe_unit * group_depth * W S = stripe_unit * group_depth * W
T = stripe_unit * group_depth * group_width T = stripe_unit * group_depth * group_width
U = stripe_unit * group_width U = stripe_unit * group_width
M = L / S M = L / S
G = (L - (M * S)) / T G = (L - (M * S)) / T
H = (L - (M * S)) % T H = (L - (M * S)) % T
N = H / U N = H / U
skipping to change at page 14, line 46 skipping to change at page 16, line 33
O = 27 MB % 1 MB + 2 * 1 MB + 0 * 50 * 1 MB = 2 MB O = 27 MB % 1 MB + 2 * 1 MB + 0 * 50 * 1 MB = 2 MB
Offset 7232 MB: Offset 7232 MB:
M = 7232 MB / 5000 MB = 1 M = 7232 MB / 5000 MB = 1
G = (7232 MB - (1 * 5000 MB)) / 500 MB = 4 G = (7232 MB - (1 * 5000 MB)) / 500 MB = 4
H = (7232 MB - (1 * 5000 MB)) % 500 MB = 232 MB H = (7232 MB - (1 * 5000 MB)) % 500 MB = 232 MB
N = 232 MB / 10 MB = 23 N = 232 MB / 10 MB = 23
C = (232 MB - (23 * 10 MB)) / 1 MB + 4 * 10 = 42 C = (232 MB - (23 * 10 MB)) / 1 MB + 4 * 10 = 42
O = 7232 MB % 1 MB + 23 * 1 MB + 1 * 50 * 1 MB = 73 MB O = 7232 MB % 1 MB + 23 * 1 MB + 1 * 50 * 1 MB = 73 MB
3.2.3. Mirroring 4.3.3. Mirroring
The mirror_cnt is used to replicate a file by replicating its The odm_mirror_cnt is used to replicate a file by replicating its
component objects. If there is no mirroring, then mirror_cnt MUST be component objects. If there is no mirroring, then odm_mirror_cnt
0. If mirror_cnt is greater than zero, then the size of the MUST be 0. If odm_mirror_cnt is greater than zero, then the size of
component array MUST be a multiple of (mirror_cnt+1). Thus, for a the olo_components array MUST be a multiple of (odm_mirror_cnt+1).
classic mirror on two objects, mirror_cnt is one. Note that Thus, for a classic mirror on two objects, odm_mirror_cnt is one.
mirroring can be defined over any raid algorithm and striping pattern Note that mirroring can be defined over any raid algorithm and
(either simple or nested). If group_width is also non-zero, then the striping pattern (either simple or nested). If odm_group_width is
size MUST be a multiple of group_width * (mirror_cnt+1). Replicas also non-zero, then the size of the olo_components array MUST be a
are adjacent in the components array, and the value C produced by the multiple of odm_group_width * (odm_mirror_cnt+1). Replicas are
above equations is not a direct index into the components array. adjacent in the olo_components array, and the value C produced by the
above equations is not a direct index into the olo_components array.
Instead, the following equations determine the replica component Instead, the following equations determine the replica component
index RCi, where i ranges from 0 to mirror_cnt. index RCi, where i ranges from 0 to odm_mirror_cnt.
C = component index for striping or two-level striping C = component index for striping or two-level striping
i ranges from 0 to mirror_cnt, inclusive i ranges from 0 to odm_mirror_cnt, inclusive
RCi = C * (mirror_cnt+1) + i RCi = C * (odm_mirror_cnt+1) + i
3.3. RAID Algorithms 4.4. RAID Algorithms
pnfs_osd_raid_algorithm4 determines the algorithm and placement of pnfs_osd_raid_algorithm4 determines the algorithm and placement of
redundant data. This section defines the different RAID algorithms. redundant data. This section defines the different RAID algorithms.
3.3.1. PNFS_OSD_RAID_0 4.4.1. PNFS_OSD_RAID_0
PNFS_OSD_RAID_0 means there is no parity data, so all bytes in the PNFS_OSD_RAID_0 means there is no parity data, so all bytes in the
component objects are data bytes located by the above equations for C component objects are data bytes located by the above equations for C
and O. If a component object is marked as PNFS_OSD_MISSING, the pNFS and O. If a component object is marked as PNFS_OSD_MISSING, the pNFS
client MUST either return an I/O error if this component is attempted client MUST either return an I/O error if this component is attempted
to be read or alternatively, it can retry the READ against the pNFS to be read or alternatively, it can retry the READ against the pNFS
server. server.
3.3.2. PNFS_OSD_RAID_4 4.4.2. PNFS_OSD_RAID_4
PNFS_OSD_RAID_4 means that the last component object, or the last in PNFS_OSD_RAID_4 means that the last component object, or the last in
each group if group_width is > zero, contains parity information each group (if odm_group_width is greater than zero), contains parity
computed over the rest of the stripe with an XOR operation. If a information computed over the rest of the stripe with an XOR
component object is unavailable, the client can read the rest of the operation. If a component object is unavailable, the client can read
stripe units in the damaged stripe and recompute the missing stripe the rest of the stripe units in the damaged stripe and recompute the
unit by XORing the other stripe units in the stripe. Or the client missing stripe unit by XORing the other stripe units in the stripe.
can replay the READ against the pNFS server which will presumably Or the client can replay the READ against the pNFS server which will
perform the reconstructed read on the client's behalf. presumably perform the reconstructed read on the client's behalf.
When parity is present in the file, then there is an additional When parity is present in the file, then there is an additional
computation to map from the file offset L to the offset that accounts computation to map from the file offset L to the offset that accounts
for embedded parity, L'. First compute L', and then use L' in the for embedded parity, L'. First compute L', and then use L' in the
above equations for C and O. above equations for C and O.
L = file offset, not accounting for parity L = file offset, not accounting for parity
P = number of parity devices in each stripe P = number of parity devices in each stripe
W = group_width, if not zero, else size of component array W = group_width, if not zero, else size of olo_components array
N = L / (W-P * stripe_unit) N = L / (W-P * stripe_unit)
L' = N * (W * stripe_unit) + L' = N * (W * stripe_unit) +
(L % (W-P * stripe_unit)) (L % (W-P * stripe_unit))
3.3.3. PNFS_OSD_RAID_5 4.4.3. PNFS_OSD_RAID_5
PNFS_OSD_RAID_5 means that the position of the parity data is rotated PNFS_OSD_RAID_5 means that the position of the parity data is rotated
on each stripe. In the first stripe, the last component holds the on each stripe or each group (if odm_group_width is greater than
parity. In the second stripe, the next-to-last component holds the zero). In the first stripe, the last component holds the parity. In
parity, and so on. In this scheme, all stripe units are rotated so the second stripe, the next-to-last component holds the parity, and
that I/O is evenly spread across objects as the file is read so on. In this scheme, all stripe units are rotated so that I/O is
sequentially. The rotated parity layout is illustrated here, with evenly spread across objects as the file is read sequentially. The
numbers indicating the stripe unit. rotated parity layout is illustrated here, with numbers indicating
the stripe unit.
0 1 2 P 0 1 2 P
4 5 P 3 4 5 P 3
8 P 6 7 8 P 6 7
P 9 a b P 9 a b
To compute the component object C, first compute the offset that To compute the component object C, first compute the offset that
accounts for parity L' and use that to compute C. Then rotate C to accounts for parity L' and use that to compute C. Then rotate C to
get C'. Finally, increase C' by one if the parity information comes get C'. Finally, increase C' by one if the parity information comes
at or before C' within that stripe. The following equations at or before C' within that stripe. The following equations
illustrate this by computing I, which is the index of the component illustrate this by computing I, which is the index of the component
that contains parity for a given stripe. that contains parity for a given stripe.
L = file offset, not accounting for parity L = file offset, not accounting for parity
W = group_width, if not zero, else size of component array W = odm_group_width, if not zero, else size of olo_components array
N = L / (W-1 * stripe_unit) N = L / (W-1 * stripe_unit)
(Compute L' as describe above) (Compute L' as describe above)
(Compute C based on L' as described above) (Compute C based on L' as described above)
C' = (C - (N%W)) % W C' = (C - (N%W)) % W
I = W - (N%W) - 1 I = W - (N%W) - 1
if (C' <= I) { if (C' <= I) {
C'++ C'++
} }
3.3.4. PNFS_OSD_RAID_PQ 4.4.4. PNFS_OSD_RAID_PQ
PNFS_OSD_RAID_PQ is a double-parity scheme that uses the Reed-Solomon PNFS_OSD_RAID_PQ is a double-parity scheme that uses the Reed-Solomon
P+Q encoding scheme [16]. In this layout, the last two component P+Q encoding scheme [17]. In this layout, the last two component
objects hold the P and Q data, respectively. P is parity computed objects hold the P and Q data, respectively. P is parity computed
with XOR, and Q is a more complex equation that is not described with XOR, and Q is a more complex equation that is not described
here. The equations given above for embedded parity can be used to here. The equations given above for embedded parity can be used to
map a file offset to the correct component object by setting the map a file offset to the correct component object by setting the
number of parity components to 2 instead of 1 for RAID4 or RAID5. number of parity components to 2 instead of 1 for RAID4 or RAID5.
Clients may simply choose to read data through the metadata server if Clients may simply choose to read data through the metadata server if
two components are missing or damaged. two components are missing or damaged.
Issue: This scheme also has a RAID_4 like layout where the ECC blocks Issue: This scheme also has a RAID_4 like layout where the ECC blocks
are stored on the same components on every stripe and a rotated, are stored on the same components on every stripe and a rotated,
RAID-5 like layout where the stripe units are rotated. Should we RAID-5 like layout where the stripe units are rotated. Should we
make the following properties orthogonal: RAID_4 or RAID_5 (i.e., make the following properties orthogonal: RAID_4 or RAID_5 (i.e.,
non-rotated or rotated), and then have the number of parity non-rotated or rotated), and then have the number of parity
components and the associated algorithm be the orthogonal parameter? components and the associated algorithm be the orthogonal parameter?
3.3.5. RAID Usage and implementation notes 4.4.5. RAID Usage and Implementation Notes
RAID layouts with redundant data in their stripes require additional RAID layouts with redundant data in their stripes require additional
serialization of updates to ensure correct operation. Otherwise, if serialization of updates to ensure correct operation. Otherwise, if
two clients simultaneously write to the same logical range of an two clients simultaneously write to the same logical range of an
object, the result could include different data in the same ranges of object, the result could include different data in the same ranges of
mirrored tuples, or corrupt parity information. It is the mirrored tuples, or corrupt parity information. It is the
responsibility of the metadata server to enforce serialization responsibility of the metadata server to enforce serialization
requirements such as this. For example, the metadata server may do requirements such as this. For example, the metadata server may do
so by not granting overlapping write layouts within mirrored objects. so by not granting overlapping write layouts within mirrored objects.
4. Object-Based Layout Update 5. Object-Based Layout Update
layoutupdate4 is used in the LAYOUTCOMMIT operation to convey updates layoutupdate4 is used in the LAYOUTCOMMIT operation to convey updates
to the layout and additional information to the metadata server. It to the layout and additional information to the metadata server. It
is defined in the NFSv4.1 draft [12] as follows: is defined in the NFSv4.1 draft [9] as follows:
struct layoutupdate4 { struct layoutupdate4 {
layouttype4 lou_type; layouttype4 lou_type;
opaque lou_body<>; opaque lou_body<>;
}; };
The layoutupdate4 type is an opaque value at the generic pNFS client The layoutupdate4 type is an opaque value at the generic pNFS client
level. If the lou_type layout type is LAYOUT4_OSD2_OBJECTS, then the level. If the lou_type layout type is LAYOUT4_OSD2_OBJECTS, then the
lou_body opaque value is defined by the pnfs_osd_layoutupdate4 type. lou_body opaque value is defined by the pnfs_osd_layoutupdate4 type.
4.1. pnfs_osd_layoutupdate4
struct pnfs_osd_layoutupdate4 {
pnfs_osd_deltaspaceused4 lou_delta_space_used;
bool lou_ioerr;
};
Object-Based pNFS clients are not allowed to modify the layout. Object-Based pNFS clients are not allowed to modify the layout.
"lou_delta_space_used" is used to convey capacity usage information Therefore, the information passed in pnfs_osd_layoutupdate4 is used
back to the metadata server. only to update the file's attributes. In addition to the generic
information the client can pass to the metadata server in
LAYOUTCOMMIT such as the highest offset the client wrote to and the
last time it modified the file, the client MAY use
pnfs_osd_layoutupdate4 to convey the capacity consumed (or released)
by writes using the layout, and to indicate that I/O errors were
encountered by such writes.
4.1.1. pnfs_osd_deltaspaceused4 5.1. pnfs_osd_deltaspaceused4
union pnfs_osd_deltaspaceused4 switch (bool valid) { ///union pnfs_osd_deltaspaceused4 switch (bool valid) {
case TRUE: /// case TRUE:
int64_t dsu_delta; /* Bytes consumed by write activity */ /// int64_t dsu_delta;
case FALSE: /// case FALSE:
void; /// void;
}; ///};
///
pnfs_osd_deltaspaceused4 is used to convey space utilization pnfs_osd_deltaspaceused4 is used to convey space utilization
information at the time of LAYOUTCOMMIT. For the file system to information at the time of LAYOUTCOMMIT. For the file system to
properly maintain capacity used information, it needs to track how properly maintain capacity used information, it needs to track how
much capacity was consumed by WRITE operations performed by the much capacity was consumed by WRITE operations performed by the
client. In this protocol, the OSD returns the capacity consumed by a client. In this protocol, the OSD returns the capacity consumed by a
write, which can be different than the number of bytes written write (*), which can be different than the number of bytes written
because of internal overhead like block-based allocation and indirect because of internal overhead like block-level allocation and indirect
blocks, and the client reflects this back to the pNFS server so it blocks, and the client reflects this back to the pNFS server so it
can accurately track quota. The pNFS server can choose to trust this can accurately track quota. The pNFS server can choose to trust this
information coming from the clients and therefore avoid querying the information coming from the clients and therefore avoid querying the
OSDs at the time of LAYOUTCOMMIT. If the client is unable to obtain OSDs at the time of LAYOUTCOMMIT. If the client is unable to obtain
this information from the OSD, it simply returns invalid this information from the OSD, it simply returns invalid
lou_delta_space_used. olu_delta_space_used.
The "lou_ioerr" flag is used when I/O errors were encountered while (*) Note: At the time this document is written, a per-command used
capacity attribute is not yet standardized by OSD2 draft [10]. The
client MAY use vendor-specific attributes to calculate space
utilization, provided that the vendor defines and publishes a
suitable vendor-specific attributes page for current-command
attributes as defined by OSD2 draft [10], Section 7.1.2.2.
5.2. pnfs_osd_layoutupdate4
///struct pnfs_osd_layoutupdate4 {
/// pnfs_osd_deltaspaceused4 olu_delta_space_used;
/// bool olu_ioerr_flag;
///};
///
"olu_delta_space_used" is used to convey capacity usage information
back to the metadata server.
The "olu_ioerr_flag" is used when I/O errors were encountered while
writing the file. The client MUST report the errors using the writing the file. The client MUST report the errors using the
pnfs_osd_ioerr4 structure (See Section 6.1.1) at LAYOUTRETURN time. pnfs_osd_ioerr4 structure (See Section 7.1) at LAYOUTRETURN time.
If the client updated the file successfully before hitting the I/O If the client updated the file successfully before hitting the I/O
errors it MAY use LAYOUTCOMMIT to update the metadata server as errors it MAY use LAYOUTCOMMIT to update the metadata server as
described above. Typically, in the error-free case, the server MAY described above. Typically, in the error-free case, the server MAY
turn around and update the file's attributes on the storage devices. turn around and update the file's attributes on the storage devices.
However, if I/O errors were encountered the server better not attempt However, if I/O errors were encountered the server better not attempt
to write the new attributes on the storage devices until it receives to write the new attributes on the storage devices until it receives
the I/O error report, therefore the client MUST set the lou_ioerr the I/O error report, therefore the client MUST set the
flag to true. Note that in this case, the client SHOULD send both olu_ioerr_flag to true. Note that in this case, the client SHOULD
the LAYOUTCOMMIT and LAYOUTRETURN operations in the same COMPOUND send both the LAYOUTCOMMIT and LAYOUTRETURN operations in the same
RPC. COMPOUND RPC.
5. Recovering from Client I/O Errors 6. Recovering from Client I/O Errors
The pNFS client may encounter errors when directly accessing the The pNFS client may encounter errors when directly accessing the
object storage devices. However, it is the responsibility of the object storage devices. However, it is the responsibility of the
metadata server to handle the I/O errors. When the metadata server to handle the I/O errors. When the
LAYOUT4_OSD2_OBJECTS layout type is used, the client MUST report the LAYOUT4_OSD2_OBJECTS layout type is used, the client MUST report the
I/O errors to the server at LAYOUTRETURN time using the I/O errors to the server at LAYOUTRETURN time using the
pnfs_osd_ioerr4 structure (See Section 6.1.1). pnfs_osd_ioerr4 structure (See Section 7.1).
The metadata server analyzes the error and determines the required The metadata server analyzes the error and determines the required
recovery operations such as repairing any parity inconsistencies, recovery operations such as repairing any parity inconsistencies,
recovering media failures, or reconstructing missing objects. recovering media failures, or reconstructing missing objects.
The metadata server SHOULD recall any outstanding layouts to allow it The metadata server SHOULD recall any outstanding layouts to allow it
exclusive write access to the stripes being recovered and to prevent exclusive write access to the stripes being recovered and to prevent
other clients from hitting the same error condition. In these cases, other clients from hitting the same error condition. In these cases,
the server MUST complete recovery before handing out any new layouts the server MUST complete recovery before handing out any new layouts
to the affected byte ranges. to the affected byte ranges.
skipping to change at page 19, line 26 skipping to change at page 21, line 27
corresponding error to the application that initiated the I/O corresponding error to the application that initiated the I/O
operation and drop any unwritten data, the client SHOULD attempt to operation and drop any unwritten data, the client SHOULD attempt to
retry the original I/O operation by requesting a new layout using retry the original I/O operation by requesting a new layout using
LAYOUTGET and retry the I/O operation(s) using the new layout or the LAYOUTGET and retry the I/O operation(s) using the new layout or the
client MAY just retry the I/O operation(s) using regular NFS READ or client MAY just retry the I/O operation(s) using regular NFS READ or
WRITE operations via the metadata server. The client SHOULD attempt WRITE operations via the metadata server. The client SHOULD attempt
to retrieve a new layout and retry the I/O operation using OSD to retrieve a new layout and retry the I/O operation using OSD
commands first and only if the error persists, retry the I/O commands first and only if the error persists, retry the I/O
operation via the metadata server. operation via the metadata server.
6. Object-Based Layout Return 7. Object-Based Layout Return
layoutreturn_file4 is used in the LAYOUTRETURN operation to convey layoutreturn_file4 is used in the LAYOUTRETURN operation to convey
layout-type specific information to the server. It is defined in the layout-type specific information to the server. It is defined in the
NFSv4.1 draft [12] as follows: NFSv4.1 draft [9] as follows:
struct layoutreturn_file4 { struct layoutreturn_file4 {
offset4 lrf_offset; offset4 lrf_offset;
length4 lrf_length; length4 lrf_length;
stateid4 lrf_stateid; stateid4 lrf_stateid;
/* layouttype4 specific data */ /* layouttype4 specific data */
opaque lrf_body<>; opaque lrf_body<>;
}; };
union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { union layoutreturn4 switch(layoutreturn_type4 lr_returntype) {
skipping to change at page 20, line 32 skipping to change at page 22, line 32
bool lora_reclaim; bool lora_reclaim;
layoutreturn_stateid lora_recallstateid; layoutreturn_stateid lora_recallstateid;
layouttype4 lora_layout_type; layouttype4 lora_layout_type;
layoutiomode4 lora_iomode; layoutiomode4 lora_iomode;
layoutreturn4 lora_layoutreturn; layoutreturn4 lora_layoutreturn;
}; };
If the lora_layout_type layout type is LAYOUT4_OSD2_OBJECTS, then the If the lora_layout_type layout type is LAYOUT4_OSD2_OBJECTS, then the
lrf_body opaque value is defined by the pnfs_osd_layoutreturn4 type. lrf_body opaque value is defined by the pnfs_osd_layoutreturn4 type.
6.1. pnfs_osd_layoutreturn4 The pnfs_osd_layoutreturn4 type allows the client to report I/O error
information back to the metadata server as defined below.
struct pnfs_osd_layoutreturn4 {
pnfs_osd_ioerr4 ioerr<>;
};
When OSD I/O operations failed, "ioerr" is used to report these
errors to the metadata server. The pnfs_osd_ioerr4 data structure is
defined as follows:
6.1.1. pnfs_osd_errno4 7.1. pnfs_osd_errno4
enum pnfs_osd_errno4 { ///enum pnfs_osd_errno4 {
PNFS_OSD_ERR_EIO = 1, /// PNFS_OSD_ERR_EIO = 1,
PNFS_OSD_ERR_NOT_FOUND = 2, /// PNFS_OSD_ERR_NOT_FOUND = 2,
PNFS_OSD_ERR_NO_SPACE = 3, /// PNFS_OSD_ERR_NO_SPACE = 3,
PNFS_OSD_ERR_BAD_CRED = 4, /// PNFS_OSD_ERR_BAD_CRED = 4,
PNFS_OSD_ERR_NO_ACCESS = 5, /// PNFS_OSD_ERR_NO_ACCESS = 5,
PNFS_OSD_ERR_UNREACHABLE = 6, /// PNFS_OSD_ERR_UNREACHABLE = 6,
PNFS_OSD_ERR_RESOURCE = 7 /// PNFS_OSD_ERR_RESOURCE = 7
}; ///};
///
pnfs_osd_errno4 is used to represent error types when read/write pnfs_osd_errno4 is used to represent error types when read/write
errors are reported to the metadata server. The error codes serve as errors are reported to the metadata server. The error codes serve as
hints to the metadata server that may help it in diagnosing the exact hints to the metadata server that may help it in diagnosing the exact
reason for the error and in repairing it. reason for the error and in repairing it.
o PNFS_OSD_ERR_EIO indicates the operation failed because the Object o PNFS_OSD_ERR_EIO indicates the operation failed because the Object
Storage Device experienced a failure trying to access the object. Storage Device experienced a failure trying to access the object.
The most common source of these errors is media errors, but other The most common source of these errors is media errors, but other
internal errors might cause this. In this case, the metadata internal errors might cause this. In this case, the metadata
skipping to change at page 22, line 10 skipping to change at page 23, line 41
o PNFS_OSD_ERR_UNREACHABLE indicates the client did not complete the o PNFS_OSD_ERR_UNREACHABLE indicates the client did not complete the
I/O operation at the Object Storage Device due to a communication I/O operation at the Object Storage Device due to a communication
failure. Whether the I/O operation was executed by the OSD or not failure. Whether the I/O operation was executed by the OSD or not
is undetermined. is undetermined.
o PNFS_OSD_ERR_RESOURCE indicates the client did not issue the I/O o PNFS_OSD_ERR_RESOURCE indicates the client did not issue the I/O
operation due to a local problem on the initiator (i.e. client) operation due to a local problem on the initiator (i.e. client)
side, e.g., when running out of memory. The client MUST guarantee side, e.g., when running out of memory. The client MUST guarantee
that the OSD command was never dispatched to the OSD. that the OSD command was never dispatched to the OSD.
6.1.2. pnfs_osd_ioerr4 7.2. pnfs_osd_ioerr4
struct pnfs_osd_ioerr4 { ///struct pnfs_osd_ioerr4 {
pnfs_osd_objid4 component; /// pnfs_osd_objid4 oer_component;
length4 comp_offset; /// length4 oer_comp_offset;
length4 comp_length; /// length4 oer_comp_length;
bool iswrite; /// bool oer_iswrite;
pnfs_osd_errno4 errno; /// pnfs_osd_errno4 oer_errno;
}; ///};
///
The pnfs_osd_ioerr4 structure is used to return error indications for The pnfs_osd_ioerr4 structure is used to return error indications for
objects that generated errors during data transfers. These are hints objects that generated errors during data transfers. These are hints
to the metadata server that there are problems with that object. For to the metadata server that there are problems with that object. For
each error, "component", "comp_offset", and "comp_length" represent each error, "oer_component", "oer_comp_offset", and "oer_comp_length"
the object and byte range within the component object in which the represent the object and byte range within the component object in
error occurred, "iswrite" is set to "true" if the failed OSD which the error occurred, "oer_iswrite" is set to "true" if the
operation was data modifying, and "errno" represents the type of failed OSD operation was data modifying, and "oer_errno" represents
error. the type of error.
Component byte ranges in the optional pnfs_osd_ioerr4 structure are Component byte ranges in the optional pnfs_osd_ioerr4 structure are
used for recovering the object and MUST be set by the client to cover used for recovering the object and MUST be set by the client to cover
all failed I/O operations to the component. all failed I/O operations to the component.
7. Object-Based Creation Layout Hint 7.3. pnfs_osd_layoutreturn4
The layouthint4 type is defined in the NFSv4.1 draft [12] as follows: ///struct pnfs_osd_layoutreturn4 {
/// pnfs_osd_ioerr4 olr_ioerr_report<>;
///};
///
When OSD I/O operations failed, "olr_ioerr_report<>" is used to
report these errors to the metadata server as an array of elements of
type pnfs_osd_ioerr4. Each element in the array represents an error
that occured on the object specified by oer_component. If no errors
are to be reported, the size of the olr_ioerr_report<> array is set
to zero.
8. Object-Based Creation Layout Hint
The layouthint4 type is defined in the NFSv4.1 draft [9] as follows:
struct layouthint4 { struct layouthint4 {
layouttype4 loh_type; layouttype4 loh_type;
opaque loh_body<>; opaque loh_body<>;
}; };
The layouthint4 structure is used by the client to pass in a hint The layouthint4 structure is used by the client to pass in a hint
about the type of layout it would like created for a particular file. about the type of layout it would like created for a particular file.
If the loh_type layout type is LAYOUT4_OSD2_OBJECTS, then the If the loh_type layout type is LAYOUT4_OSD2_OBJECTS, then the
loh_body opaque value is defined by the pnfs_osd_layouthint4 type. loh_body opaque value is defined by the pnfs_osd_layouthint4 type.
7.1. pnfs_osd_layouthint4 8.1. pnfs_osd_layouthint4
union num_comps_hint4 switch (bool valid) {
case TRUE:
uint32_t num_comps;
case FALSE:
void;
};
union stripe_unit_hint4 switch (bool valid) {
case TRUE:
length4 stripe_unit;
case FALSE:
void;
};
union group_width_hint4 switch (bool valid) {
case TRUE:
uint32_t group_width;
case FALSE:
void;
};
union group_depth_hint4 switch (bool valid) {
case TRUE:
uint32_t group_depth;
case FALSE:
void;
};
union mirror_cnt_hint4 switch (bool valid) {
case TRUE:
uint32_t mirror_cnt;
case FALSE:
void;
};
union raid_algorithm_hint4 switch (bool valid) { ///union pnfs_osd_max_comps_hint4 switch (bool omx_valid) {
case TRUE: /// case TRUE:
pnfs_osd_raid_algorithm4 raid_algorithm; /// uint32_t omx_max_comps;
case FALSE: /// case FALSE:
void; /// void;
}; ///};
///
///union pnfs_osd_stripe_unit_hint4 switch (bool osu_valid) {
/// case TRUE:
/// length4 osu_stripe_unit;
/// case FALSE:
/// void;
///};
///
///union pnfs_osd_group_width_hint4 switch (bool ogw_valid) {
/// case TRUE:
/// uint32_t ogw_group_width;
/// case FALSE:
/// void;
///};
///
///union pnfs_osd_group_depth_hint4 switch (bool ogd_valid) {
/// case TRUE:
/// uint32_t ogd_group_depth;
/// case FALSE:
/// void;
///};
///
///union pnfs_osd_mirror_cnt_hint4 switch (bool omc_valid) {
/// case TRUE:
/// uint32_t omc_mirror_cnt;
/// case FALSE:
/// void;
///};
///
///union pnfs_osd_raid_algorithm_hint4 switch (bool ora_valid) {
/// case TRUE:
/// pnfs_osd_raid_algorithm4 ora_raid_algorithm;
/// case FALSE:
/// void;
///};
///
///struct pnfs_osd_layouthint4 {
/// pnfs_osd_max_comps_hint4 olh_max_comps_hint;
/// pnfs_osd_stripe_unit_hint4 olh_stripe_unit_hint;
/// pnfs_osd_group_width_hint4 olh_group_width_hint;
/// pnfs_osd_group_depth_hint4 olh_group_depth_hint;
/// pnfs_osd_mirror_cnt_hint4 olh_mirror_cnt_hint;
/// pnfs_osd_raid_algorithm_hint4 olh_raid_algorithm_hint;
///};
///
struct pnfs_osd_layouthint4 {
num_comps_hint4 num_comps_hint;
stripe_unit_hint4 stripe_unit_hint;
group_width_hint4 group_width_hint;
group_depth_hint4 group_depth_hint;
mirror_cnt_hint4 mirror_cnt_hint;
raid_algorithm_hint4 raid_algorithm_hint;
};
This type conveys hints for the desired data map. All parameters are This type conveys hints for the desired data map. All parameters are
optional so the client can give values for only the parameters it optional so the client can give values for only the parameters it
cares about, e.g. it can provide a hint for the desired number of cares about, e.g. it can provide a hint for the desired number of
mirrored components, regardless of the the raid algorithm selected mirrored components, regardless of the the raid algorithm selected
for the file. The server should make an attempt to honor the hints for the file. The server should make an attempt to honor the hints
but it can ignore any or all of them at its own discretion and but it can ignore any or all of them at its own discretion and
without failing the respective create operation. without failing the respective CREATE operation.
The num_comps hint can be used to limit the total number of component The "olh_max_comps_hint" can be used to limit the total number of
objects comprising the file. All other hints correspond directly to component objects comprising the file. All other hints correspond
the different fields of pnfs_osd_data_map4. directly to the different fields of pnfs_osd_data_map4.
8. Layout Segments 9. Layout Segments
The pnfs layout operations operate on logical byte ranges. There is The pnfs layout operations operate on logical byte ranges. There is
no requirement in the protocol for any relationship between byte no requirement in the protocol for any relationship between byte
ranges used in LAYOUTGET to acquire layouts and byte ranges used in ranges used in LAYOUTGET to acquire layouts and byte ranges used in
CB_LAYOUTRECALL, LAYOUTCOMMIT, or LAYOUTRETURN. However, using OSD CB_LAYOUTRECALL, LAYOUTCOMMIT, or LAYOUTRETURN. However, using OSD
capabilities poses limitations on these operations since the byte-range capabilities poses limitations on these operations since
capabilities associated with layout segments cannot be merged or the capabilities associated with layout segments cannot be merged or
split. The following guidelines should be followed for proper split. The following guidelines should be followed for proper
operation of object-based layouts. operation of object-based layouts.
8.1. CB_LAYOUTRECALL and LAYOUTRETURN 9.1. CB_LAYOUTRECALL and LAYOUTRETURN
In general, the object-based layout driver should keep track of each In general, the object-based layout driver should keep track of each
layout segment it got, keeping record of the segment's iomode, layout segment it got, keeping record of the segment's iomode,
offset, and length. The server should allow the client to get offset, and length. The server should allow the client to get
multiple overlapping layout segments but is free to recall the layout multiple overlapping layout segments but is free to recall the layout
to prevent overlap. to prevent overlap.
In response to CB_LAYOUTRECALL, the client should return all layout In response to CB_LAYOUTRECALL, the client should return all layout
segments matching the given iomode and overlapping with the recalled segments matching the given iomode and overlapping with the recalled
range. When returning the layouts for this byte range with range. When returning the layouts for this byte range with
skipping to change at page 25, line 6 skipping to change at page 27, line 5
the clientid, iomode, and byte range given in LAYOUTRETURN. If no the clientid, iomode, and byte range given in LAYOUTRETURN. If no
exact match is found then the server should release all layout exact match is found then the server should release all layout
segments matching the clientid and iomode and that are fully segments matching the clientid and iomode and that are fully
contained in the returned byte range. If none are found and the byte contained in the returned byte range. If none are found and the byte
range is a subset of an outstanding layout segment with for the same range is a subset of an outstanding layout segment with for the same
clientid and iomode, then the client can be considered malfunctioning clientid and iomode, then the client can be considered malfunctioning
and the server SHOULD recall all layouts from this client to reset and the server SHOULD recall all layouts from this client to reset
its state. If this behavior repeats the server SHOULD deny all its state. If this behavior repeats the server SHOULD deny all
LAYOUTGETs from this client. LAYOUTGETs from this client.
8.2. LAYOUTCOMMIT 9.2. LAYOUTCOMMIT
LAYOUTCOMMIT is only used by object-based pNFS to convey modified LAYOUTCOMMIT is only used by object-based pNFS to convey modified
attributes hints and/or to report I/O errors to the MDS. Therefore, attributes hints and/or to report I/O errors to the MDS. Therefore,
the offset and length in LAYOUTCOMMIT4args are reserved for future the offset and length in LAYOUTCOMMIT4args are reserved for future
use and should be set to 0. use and should be set to 0.
9. Recalling Layouts 10. Recalling Layouts
The object-based metadata server should recall outstanding layouts in The object-based metadata server should recall outstanding layouts in
the following cases: the following cases:
o When the file's security policy changes, i.e. ACLs or permission o When the file's security policy changes, i.e. ACLs or permission
mode bits are set. mode bits are set.
o When the file's aggregation map changes, rendering outstanding o When the file's aggregation map changes, rendering outstanding
layouts invalid. layouts invalid.
o When there are sharing conflicts. For example, the server will o When there are sharing conflicts. For example, the server will
issue stripe aligned layout segments for RAID-5 objects. To issue stripe aligned layout segments for RAID-5 objects. To
prevent corruption of the file's parity, Multiple clients must not prevent corruption of the file's parity, Multiple clients must not
hold valid write layouts for the same stripes. An outstanding RW hold valid write layouts for the same stripes. An outstanding RW
layout should be recalled when a conflicting LAYOUTGET is received layout should be recalled when a conflicting LAYOUTGET is received
from a different client for LAYOUTIOMODE4_RW and for a byte-range from a different client for LAYOUTIOMODE4_RW and for a byte-range
overlapping with the outstanding layout segment. overlapping with the outstanding layout segment.
9.1. CB_RECALL_ANY 10.1. CB_RECALL_ANY
The metadata server can use the CB_RECALL_ANY callback operation to The metadata server can use the CB_RECALL_ANY callback operation to
notify the client to return some or all of its layouts. The NFSv4.1 notify the client to return some or all of its layouts. The NFSv4.1
draft [12] defines the following types: draft [9] defines the following types:
const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN = 8; const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN = 8;
const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 11; const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 11;
struct CB_RECALL_ANY4args { struct CB_RECALL_ANY4args {
uint32_t craa_objects_to_keep; uint32_t craa_objects_to_keep;
bitmap4 craa_type_mask; bitmap4 craa_type_mask;
}; };
Typically, CB_RECALL_ANY will be used to recall client state when the Typically, CB_RECALL_ANY will be used to recall client state when the
server needs to reclaim resources. The craa_type_mask bitmap server needs to reclaim resources. The craa_type_mask bitmap
specifies the type of resources that are recalled and the specifies the type of resources that are recalled and the
craa_objects_to_keep value specifies how many of the recalled objects craa_objects_to_keep value specifies how many of the recalled objects
the client is allowed to keep. The object-based layout type mask the client is allowed to keep. The object-based layout type mask
flags are defined as follows. They represent the iomode of the flags are defined as follows. They represent the iomode of the
recalled layouts. In response, the client SHOULD return layouts of recalled layouts. In response, the client SHOULD return layouts of
the recalled iomode that it needs the least, keeping at most the recalled iomode that it needs the least, keeping at most
craa_objects_to_keep object-based layouts. craa_objects_to_keep object-based layouts.
const PNFS_OSD_RCA4_TYPE_MASK_READ = RCA4_TYPE_MASK_OBJ_LAYOUT_MIN; ///const PNFS_OSD_RCA4_TYPE_MASK_READ = RCA4_TYPE_MASK_OBJ_LAYOUT_MIN;
const PNFS_OSD_RCA4_TYPE_MASK_RW = RCA4_TYPE_MASK_OBJ_LAYOUT_MIN+1; ///const PNFS_OSD_RCA4_TYPE_MASK_RW = RCA4_TYPE_MASK_OBJ_LAYOUT_MIN+1;
///
The PNFS_OSD_RCA4_TYPE_MASK_READ flag notifies the client to return The PNFS_OSD_RCA4_TYPE_MASK_READ flag notifies the client to return
layouts of iomode LAYOUTIOMODE4_READ. Similarly, the layouts of iomode LAYOUTIOMODE4_READ. Similarly, the
PNFS_OSD_RCA4_TYPE_MASK_RW flag notifies the client to return layouts PNFS_OSD_RCA4_TYPE_MASK_RW flag notifies the client to return layouts
of iomode LAYOUTIOMODE4_RW. When both mask flags are set, the client of iomode LAYOUTIOMODE4_RW. When both mask flags are set, the client
is notified to return layouts of either iomode. is notified to return layouts of either iomode.
10. Client Fencing 11. Client Fencing
In cases where clients are uncommunicative and their lease has In cases where clients are uncommunicative and their lease has
expired or when clients fail to return recalled layouts in a timely expired or when clients fail to return recalled layouts in a timely
manner the server MAY revoke client layouts and/or device address manner the server MAY revoke client layouts and/or device address
mappings and reassign these resources to other clients. To avoid mappings and reassign these resources to other clients. To avoid
data corruption, the metadata server MUST fence off the revoked data corruption, the metadata server MUST fence off the revoked
clients from the respective objects as described in Section 11.4. clients from the respective objects as described in Section 12.4.
11. Security Considerations 12. Security Considerations
The pNFS extension partitions the NFSv4 file system protocol into two The pNFS extension partitions the NFSv4 file system protocol into two
parts, the control path and the data path (storage protocol). The parts, the control path and the data path (storage protocol). The
control path contains all the new operations described by this control path contains all the new operations described by this
extension; all existing NFSv4 security mechanisms and features apply extension; all existing NFSv4 security mechanisms and features apply
to the control path. The combination of components in a pNFS system to the control path. The combination of components in a pNFS system
is required to preserve the security properties of NFSv4 with respect is required to preserve the security properties of NFSv4 with respect
to an entity accessing data via a client, including security to an entity accessing data via a client, including security
countermeasures to defend against threats that NFSv4 provides countermeasures to defend against threats that NFSv4 provides
defenses for in environments where these threats are considered defenses for in environments where these threats are considered
skipping to change at page 27, line 33 skipping to change at page 29, line 32
and enforcement of the server access control policy using the layout and enforcement of the server access control policy using the layout
security credentials, the NOSEC security method MUST NOT be used for security credentials, the NOSEC security method MUST NOT be used for
I/O operation. It MAY only be used to get the System ID attribute I/O operation. It MAY only be used to get the System ID attribute
when the metadata server provided only the OSD name with the device when the metadata server provided only the OSD name with the device
address. The remainder of this section gives an overview of the address. The remainder of this section gives an overview of the
security mechanism described in that standard. The goal is to give security mechanism described in that standard. The goal is to give
the reader a basic understanding of the object security model. Any the reader a basic understanding of the object security model. Any
discrepancies between this text and the actual standard are obviously discrepancies between this text and the actual standard are obviously
to be resolved in favor of the OSD standard. to be resolved in favor of the OSD standard.
11.1. OSD Security Data Types 12.1. OSD Security Data Types
There are three main data types associated with object security: a There are three main data types associated with object security: a
capability, a credential, and security parameters. The capability is capability, a credential, and security parameters. The capability is
a set of fields that specifies an object and what operations can be a set of fields that specifies an object and what operations can be
performed on it. A credential is a signed capability. Only a performed on it. A credential is a signed capability. Only a
security manager that knows the secret device keys can correctly sign security manager that knows the secret device keys can correctly sign
a capability to form a valid credential. In pNFS, the file server a capability to form a valid credential. In pNFS, the file server
acts as the security manager and returns signed capabilities (i.e., acts as the security manager and returns signed capabilities (i.e.,
credentials) to the pNFS client. The security parameters are values credentials) to the pNFS client. The security parameters are values
computed by the issuer of OSD commands (i.e., the client) that prove computed by the issuer of OSD commands (i.e., the client) that prove
skipping to change at page 28, line 8 skipping to change at page 30, line 7
resulting signatures into the security_parameters field of the OSD resulting signatures into the security_parameters field of the OSD
command. The object storage device uses the secret keys it shares command. The object storage device uses the secret keys it shares
with the security manager to validate the signature values in the with the security manager to validate the signature values in the
security parameters. security parameters.
The security types are opaque to the generic layers of the pNFS The security types are opaque to the generic layers of the pNFS
client. The credential contents are defined as opaque within the client. The credential contents are defined as opaque within the
pnfs_osd_object_cred4 type. Instead of repeating the definitions pnfs_osd_object_cred4 type. Instead of repeating the definitions
here, the reader is referred to section 4.9.2.2 of the OSD standard. here, the reader is referred to section 4.9.2.2 of the OSD standard.
11.2. The OSD Security Protocol 12.2. The OSD Security Protocol
The object storage protocol relies on a cryptographically secure The object storage protocol relies on a cryptographically secure
capability to control accesses at the object storage devices. capability to control accesses at the object storage devices.
Capabilities are generated by the metadata server, returned to the Capabilities are generated by the metadata server, returned to the
client, and used by the client as described below to authenticate client, and used by the client as described below to authenticate
their requests to the Object Storage Device (OSD). Capabilities their requests to the Object Storage Device (OSD). Capabilities
therefore achieve the required access and open mode checking. They therefore achieve the required access and open mode checking. They
allow the file server to define and check a policy (e.g., open mode) allow the file server to define and check a policy (e.g., open mode)
and the OSD to enforce that policy without knowing the details (e.g., and the OSD to enforce that policy without knowing the details (e.g.,
user IDs and ACLs). user IDs and ACLs).
skipping to change at page 29, line 20 skipping to change at page 31, line 19
OSD uses the SecretKey it shares with the metadata server to compare OSD uses the SecretKey it shares with the metadata server to compare
the ReqMAC the client sent with a locally computed value: the ReqMAC the client sent with a locally computed value:
LocalCapKey = MAC<SecretKey>(Cap, SystemID) LocalCapKey = MAC<SecretKey>(Cap, SystemID)
LocalReqMAC = MAC<LocalCapKey>(Req, ReqNonce) LocalReqMAC = MAC<LocalCapKey>(Req, ReqNonce)
and if they match the OSD assumes that the capabilities came from an and if they match the OSD assumes that the capabilities came from an
authentic metadata server and allows access to the object, as allowed authentic metadata server and allows access to the object, as allowed
by the Cap. by the Cap.
11.3. Protocol Privacy Requirements 12.3. Protocol Privacy Requirements
Note that if the server LAYOUTGET reply, holding CapKey and Cap, is Note that if the server LAYOUTGET reply, holding CapKey and Cap, is
snooped by another client, it can be used to generate valid OSD snooped by another client, it can be used to generate valid OSD
requests (within the Cap access restrictions). requests (within the Cap access restrictions).
To provide the required privacy requirements for the capability key To provide the required privacy requirements for the capability key
returned by LAYOUTGET, the GSS-API [6] framework can be used, e.g. by returned by LAYOUTGET, the GSS-API [4] framework can be used, e.g. by
using the RPCSEC_GSS privacy method to send the LAYOUTGET operation using the RPCSEC_GSS privacy method to send the LAYOUTGET operation
or by using the SSV key to encrypt the capability_key using the or by using the SSV key to encrypt the oc_capability_key using the
GSS_Wrap() function. Two general ways to provide privacy in the GSS_Wrap() function. Two general ways to provide privacy in the
absence of GSS-API that are independent of NFSv4 are either an absence of GSS-API that are independent of NFSv4 are either an
isolated network such as a VLAN or a secure channel provided by IPsec isolated network such as a VLAN or a secure channel provided by IPsec
[15]. [11].
11.4. Revoking Capabilities 12.4. Revoking Capabilities
At any time, the metadata server may invalidate all outstanding At any time, the metadata server may invalidate all outstanding
capabilities on an object by changing its POLICY ACCESS TAG capabilities on an object by changing its POLICY ACCESS TAG
attribute. The value of the POLICY ACCESS TAG is part of a attribute. The value of the POLICY ACCESS TAG is part of a
capability, and it must match the state of the object attribute. If capability, and it must match the state of the object attribute. If
they do not match, the OSD rejects accesses to the object with the they do not match, the OSD rejects accesses to the object with the
sense key set to ILLEGAL REQUEST and an additional sense code set to sense key set to ILLEGAL REQUEST and an additional sense code set to
INVALID FIELD IN CDB. When a client attempts to use a capability and INVALID FIELD IN CDB. When a client attempts to use a capability and
is rejected this way, it should issue a LAYOUTCOMMIT for the object is rejected this way, it should issue a LAYOUTCOMMIT for the object
and specify PNFS_OSD_BAD_CRED in the ioerr parameter. The client may and specify PNFS_OSD_BAD_CRED in the olr_ioerr_report parameter. The
elect to issue a compound LAYOUTRETURN/LAYOUTGET (or LAYOUTCOMMIT/ client may elect to issue a compound LAYOUTRETURN/LAYOUTGET (or
LAYOUTRETURN/LAYOUTGET) to attempt to fetch a refreshed set of LAYOUTCOMMIT/LAYOUTRETURN/LAYOUTGET) to attempt to fetch a refreshed
capabilities. set of capabilities.
The metadata server may elect to change the access policy tag on an The metadata server may elect to change the access policy tag on an
object at any time, for any reason (with the understanding that there object at any time, for any reason (with the understanding that there
is likely an associated performance penalty, especially if there are is likely an associated performance penalty, especially if there are
outstanding layouts for this object). The metadata server MUST outstanding layouts for this object). The metadata server MUST
revoke outstanding capabilities when any one of the following occurs: revoke outstanding capabilities when any one of the following occurs:
o the permissions on the object change, o The permissions on the object change,
o a conflicting mandatory byte-range lock is granted, or o a conflicting mandatory byte-range lock is granted, or
o a layout is revoked and reassigned to another client o a layout is revoked and reassigned to another client.
A pNFS client will typically hold one layout for each byte range for A pNFS client will typically hold one layout for each byte range for
either READ or READ/WRITE. The client's credentials are checked by either READ or READ/WRITE. The client's credentials are checked by
the metadata server at LAYOUTGET time and it is the client's the metadata server at LAYOUTGET time and it is the client's
responsibility to enforce access control among multiple users responsibility to enforce access control among multiple users
accessing the same file. It is neither required nor expected that accessing the same file. It is neither required nor expected that
the pNFS client will obtain a separate layout for each user accessing the pNFS client will obtain a separate layout for each user accessing
a shared object. The client SHOULD use OPEN and ACCESS calls to a shared object. The client SHOULD use OPEN and ACCESS calls to
check user permissions when performing I/O so that the server's check user permissions when performing I/O so that the server's
access control policies are correctly enforced. The result of the access control policies are correctly enforced. The result of the
ACCESS operation may be cached while the client holds a valid layout ACCESS operation may be cached while the client holds a valid layout
as the server is expected to recall layouts when the file's access as the server is expected to recall layouts when the file's access
permissions or ACL change. permissions or ACL change.
12. IANA Considerations 13. IANA Considerations
As described in the NFSv4.1 draft [12], new layout type numbers will As described in the NFSv4.1 draft [9], new layout type numbers will
be requested from IANA. This document defines the protocol be requested from IANA. This document defines the protocol
associated with the existing layout type number, associated with the existing layout type number,
LAYOUT4_OSD2_OBJECTS, and it requires no further actions for IANA. LAYOUT4_OSD2_OBJECTS, and it requires no further actions for IANA.
13. XDR Description of the Objects layout type
This section contains the XDR [7] description of objects layout
protocol. The XDR description is provided in this document in a way
that makes it simple for the reader to extract into ready to compile
form. The reader can feed this document in the following shell
script to produce the machine readable XDR description of the objects
layout protocol:
#!/bin/sh
grep "^ *///" | sed 's?^ *///??'
I.e. if the above script is stored in a file called "extract.sh", and
this document is in a file called "spec.txt", then the reader can do:
sh extract.sh < spec.txt > pnfs_obj_prot.x
The effect of the script is to remove leading white space from each
line, plus a sentinel sequence of "///".
The XDR description, with the sentinel sequence follows:
///%#include <nfs4_prot.h>
///
////*
/// * Device information
/// */
///enum pnfs_obj_addr_type4 {
/// OBJ_TARGET_ANON = 1,
/// OBJ_TARGET_SCSI_NAME = 2,
/// OBJ_TARGET_SCSI_DEVICE_ID = 3
///};
///
///struct pnfs_osd_deviceaddr4 {
/// union targetid switch (pnfs_osd_addr_type4 type) {
/// case OBJ_TARGET_SCSI_NAME:
/// string scsi_name<>;
///
/// case OBJ_TARGET_SCSI_DEVICE_ID:
/// opaque scsi_device_id<>;
///
/// default:
/// void;
/// };
/// union netaddr switch (bool netaddr_available) {
/// case TRUE:
/// netaddr4 netaddr;
/// case FALSE:
/// void;
/// };
/// uint64_t lun;
/// opaque systemid<>;
/// pnfs_osd_object_cred4 root_obj_cred;
/// opaque osdname<>;
///};
///
////*
/// * Layout type
/// */
///enum pnfs_osd_raid_algorithm4 {
/// PNFS_OSD_RAID_0 = 1,
/// PNFS_OSD_RAID_4 = 2,
/// PNFS_OSD_RAID_5 = 3,
/// PNFS_OSD_RAID_PQ = 4 /* Reed-Solomon P+Q */
///};
///
///struct pnfs_osd_data_map4 {
/// uint32_t num_comps;
/// length4 stripe_unit;
/// uint32_t group_width;
/// uint32_t group_depth;
/// uint32_t mirror_cnt;
/// pnfs_osd_raid_algorithm4 raid_algorithm;
///};
///
////* Note: deviceid4 is defined by the nfsv4.1 protocol */
///
///struct pnfs_osd_objid4 {
/// deviceid4 device_id;
/// uint64_t partition_id;
/// uint64_t object_id;
///};
///
///enum pnfs_osd_version4 {
/// PNFS_OSD_MISSING = 0,
/// PNFS_OSD_VERSION_1 = 1,
/// PNFS_OSD_VERSION_2 = 2
///};
///
///enum pnfs_osd_cap_key_sec4 {
/// PNFS_OSD_CAP_KEY_SEC_NONE = 0,
/// PNFS_OSD_CAP_KEY_SEC_SSV = 1,
///};
///
///struct pnfs_osd_object_cred4 {
/// pnfs_osd_objid4 object_id;
/// pnfs_osd_version4 osd_version;
/// pnfs_osd_cap_key_sec4 cap_key_sec;
/// opaque capability_key<>;
/// opaque capability<>;
///};
///
///struct pnfs_osd_layout4 {
/// pnfs_osd_data_map4 map;
/// uint32_t comps_index;
/// pnfs_osd_object_cred4 components<>;
///};
///
////*
/// * Layout update
/// */
///union pnfs_osd_deltaspaceused4 switch (bool valid) {
///case TRUE:
/// int64_t delta; /* Bytes consumed by write activity */
///case FALSE:
/// void;
///};
///
///struct pnfs_osd_layoutupdate4 {
/// pnfs_osd_deltaspaceused4 lou_delta_space_used;
/// bool lou_ioerr;
///};
///
////*
/// * Layout return
/// */
///enum pnfs_osd_errno4 {
/// PNFS_OSD_ERR_EIO = 1,
/// PNFS_OSD_ERR_NOT_FOUND = 2,
/// PNFS_OSD_ERR_NO_SPACE = 3,
/// PNFS_OSD_ERR_BAD_CRED = 4,
/// PNFS_OSD_ERR_NO_ACCESS = 5,
/// PNFS_OSD_ERR_UNREACHABLE = 6,
/// PNFS_OSD_ERR_RESOURCE = 7
///};
///
///struct pnfs_osd_ioerr4 {
/// pnfs_osd_objid4 component;
/// length4 comp_offset;
/// length4 comp_length;
/// bool iswrite;
/// pnfs_osd_errno4 errno;
///};
///
///struct pnfs_osd_layoutreturn4 {
/// pnfs_osd_ioerr4 ioerr<>;
///};
///
////*
/// * Layout hint
/// */
///union num_comps_hint4 switch (bool valid) {
/// case TRUE:
/// uint32_t num_comps;
/// case FALSE:
/// void;
///};
///
///union stripe_unit_hint4 switch (bool valid) {
/// case TRUE:
/// length4 stripe_unit;
/// case FALSE:
/// void;
///};
///
///union group_width_hint4 switch (bool valid) {
/// case TRUE:
/// uint32_t group_width;
/// case FALSE:
/// void;
///};
///
///union group_depth_hint4 switch (bool valid) {
/// case TRUE:
/// uint32_t group_depth;
/// case FALSE:
/// void;
///};
///
///union mirror_cnt_hint4 switch (bool valid) {
/// case TRUE:
/// uint32_t mirror_cnt;
/// case FALSE:
/// void;
///};
///
///union raid_algorithm_hint4 switch (bool valid) {
/// case TRUE:
/// pnfs_osd_raid_algorithm4 raid_algorithm;
/// case FALSE:
/// void;
///};
///
///struct pnfs_osd_layouthint4 {
/// num_comps_hint4 num_comps_hint;
/// stripe_unit_hint4 stripe_unit_hint;
/// group_width_hint4 group_width_hint;
/// group_depth_hint4 group_depth_hint;
/// mirror_cnt_hint4 mirror_cnt_hint;
/// raid_algorithm_hint4 raid_algorithm_hint;
///};
14. References 14. References
14.1. Normative References 14.1. Normative References
[1] Bradner, S., "Key words for use in RFCs to Indicate Requirement [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", RFC 2119, March 1997. Levels", RFC 2119, March 1997.
[2] Weber, R., "SCSI Object-Based Storage Device Commands", [2] Weber, R., "SCSI Object-Based Storage Device Commands",
July 2004, <http://www.t10.org/ftp/t10/drafts/osd/osd-r10.pdf>. July 2004, <http://www.t10.org/ftp/t10/drafts/osd/osd-r10.pdf>.
[3] IBM, IBM, Cisco Systems, Hewlett-Packard Co., and IBM, [3] Eisler, M., "XDR: External Data Representation Standard",
STD 67, RFC 4506, May 2006.
[4] Linn, J., "Generic Security Service Application Program
Interface Version 2, Update 1", RFC 2743, January 2000.
[5] IBM, IBM, Cisco Systems, Hewlett-Packard Co., and IBM,
"Internet Small Computer Systems Interface (iSCSI)", RFC 3720, "Internet Small Computer Systems Interface (iSCSI)", RFC 3720,
April 2004, <http://www.ietf.org/rfc/rfc3720.txt>. April 2004, <http://www.ietf.org/rfc/rfc3720.txt>.
[4] Weber, R., "SCSI Primary Commands - 3 (SPC-3)", INCITS 408- [6] Weber, R., "SCSI Primary Commands - 3 (SPC-3)", INCITS 408-
2005, May 2005. 2005, May 2005.
[5] Hewlett-Packard Co., Hewlett-Packard Co., and Hewlett-Packard [7] Hewlett-Packard Co., Hewlett-Packard Co., and Hewlett-Packard
Co., "T11 Network Address Authority (NAA) Naming Format for Co., "T11 Network Address Authority (NAA) Naming Format for
iSCSI Node Names", RFC 3980, February 2005, iSCSI Node Names", RFC 3980, February 2005,
<http://www.ietf.org/rfc/rfc3980.txt>. <http://www.ietf.org/rfc/rfc3980.txt>.
[6] Linn, J., "Generic Security Service Application Program 14.2. Informative References
Interface Version 2, Update 1", RFC 2743, January 2000.
[7] Eisler, M., "XDR: External Data Representation Standard", [8] Shepler, S., Eisler, M., and D. Noveck, "NFSv4 Minor Version 1
STD 67, RFC 4506, May 2006. XDR Description", February 2008, <http://www.ietf.org/
internet-drafts/draft-ietf-nfsv4-minorversion1-dot-x-04.txt>.
14.2. Informative References [9] Shepler, S., Eisler, M., and D. Noveck, "NFSv4 Minor Version
1", February 2008, <http://www.ietf.org/internet-drafts/
draft-ietf-nfsv4-minorversion1-21.txt>.
[8] IEEE, "Guidelines for 64-bit Global Identifier (EUI-64) [10] Weber, R., "SCSI Object-Based Storage Device Commands -2
(OSD-2)", January 2008,
<http://www.t10.org/ftp/t10/drafts/osd2/osd2r03.pdf>.
[11] Kent, S. and K. Seo, "Security Architecture for the Internet
Protocol", RFC 4301, December 2005.
[12] IEEE, "Guidelines for 64-bit Global Identifier (EUI-64)
Registration Authority", Registration Authority",
<http://standards.ieee.org/regauth/oui/tutorials/EUI64.html>. <http://standards.ieee.org/regauth/oui/tutorials/EUI64.html>.
[9] T10/ANSI INCITS 365-2002, "SCSI RDMA Protocol (SRP)", [13] T10/ANSI INCITS 365-2002, "SCSI RDMA Protocol (SRP)",
INCITS 365-2002, INCITS 365-2002,
<http://ftp.t10.org/ftp/t10/drafts/srp/srp-r16a.pdf>. <http://ftp.t10.org/ftp/t10/drafts/srp/srp-r16a.pdf>.
[10] T11 1619-D/ANSI INCITS 424-2007, "Fibre Channel Framing and [14] T11 1619-D/ANSI INCITS 424-2007, "Fibre Channel Framing and
Signaling - 2 (FC-FS-2)", INCITS 424-2007, August 2006, Signaling - 2 (FC-FS-2)", INCITS 424-2007, August 2006,
<http://www.t11.org/t11/stat.nsf/upnum/1619-d>. <http://www.t11.org/t11/stat.nsf/upnum/1619-d>.
[11] T10 1601-D/ANSI INCITS 417-2006, "Serial Attached SCSI - 1.1 [15] T10 1601-D/ANSI INCITS 417-2006, "Serial Attached SCSI - 1.1
(SAS-1.1)", INCITS 417-2006, September 2005, (SAS-1.1)", INCITS 417-2006, September 2005,
<http://www.t10.org/ftp/t10/drafts/sas1/sas1r10.pdf>. <http://www.t10.org/ftp/t10/drafts/sas1/sas1r10.pdf>.
[12] Shepler, S., Eisler, M., and D. Noveck, "NFSv4 Minor Version [16] Tseng, J., Gibbons, K., Travostino, F., Du Laney, C., and J.
1", February 2008, <http://www.ietf.org/internet-drafts/
draft-ietf-nfsv4-minorversion1-21.txt>.
[13] Tseng, J., Gibbons, K., Travostino, F., Du Laney, C., and J.
Souza, "Internet Storage Name Service (iSNS)", RFC 4171, Souza, "Internet Storage Name Service (iSNS)", RFC 4171,
September 2005, <http://www.ietf.org/rfc/rfc4171.txt>. September 2005, <http://www.ietf.org/rfc/rfc4171.txt>.
[14] Weber, R., "SCSI Object-Based Storage Device Commands -2 [17] MacWilliams, F. and N. Sloane, "The Theory of Error-Correcting
(OSD-2)", January 2008,
<http://www.t10.org/ftp/t10/drafts/osd2/osd2r03.pdf>.
[15] Kent, S. and K. Seo, "Security Architecture for the Internet
Protocol", RFC 4301, December 2005.
[16] MacWilliams, F. and N. Sloane, "The Theory of Error-Correcting
Codes, Part I", 1977. Codes, Part I", 1977.
Appendix A. Acknowledgments Appendix A. Acknowledgments
Todd Pisek was a co-editor of the initial drafts for this document. Todd Pisek was a co-editor of the initial drafts for this document.
Daniel E. Messinger and Pete Wyckoff reviewed and commented on this Daniel E. Messinger and Pete Wyckoff reviewed and commented on this
document. document.
Authors' Addresses Authors' Addresses
 End of changes. 122 change blocks. 
747 lines changed or deleted 647 lines changed or added

This html diff was produced by rfcdiff 1.34. The latest version is available from http://tools.ietf.org/tools/rfcdiff/