NFSv4                                                          B. Halevy
Internet-Draft                                                  B. Welch
Intended status: Standards Track                              J. Zelenka
Expires: September 18, October 3, 2008                                         Panasas
                                                          March 17,
                                                          April 01, 2008

                      Object-based pNFS Operations
                      draft-ietf-nfsv4-pnfs-obj-06
                      draft-ietf-nfsv4-pnfs-obj-07

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on September 18, October 3, 2008.

Copyright Notice

   Copyright (C) The IETF Trust (2008).

Abstract

   This Internet-Draft provides a description of the object-based pNFS
   extension for NFSv4.  This is a companion to the main pnfs
   specification in the NFSv4 Minor Version 1 Internet Draft, which is
   currently draft-ietf-nfsv4-minorversion1-21.txt. draft-ietf-nfsv4-minorversion1-21.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [1].

Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
   2.  Object Storage Device Addressing and Discovery . . .  XDR Description of the Objects-Based Layout Protocol . . . . .  4
     2.1.  pnfs_osd_addr_type4  Basic Data Type Definitions  . . . . . . . . . . . . . . .  5
       2.1.1.  pnfs_osd_objid4  . . . .  5
     2.2.  pnfs_osd_deviceaddr4 . . . . . . . . . . . . . . .  5
       2.1.2.  pnfs_osd_version4  . . . .  5
       2.2.1.  SCSI Target Identifier . . . . . . . . . . . . . .  6
       2.1.3.  pnfs_osd_object_cred4  . .  6
       2.2.2.  Device Network Address . . . . . . . . . . . . . .  6
       2.1.4.  pnfs_osd_raid_algorithm4 . .  7
   3.  Object-Based Layout . . . . . . . . . . . . .  8
   3.  Object Storage Device Addressing and Discovery . . . . . . . .  7  8
     3.1.  pnfs_osd_layout4  pnfs_osd_addr_type4  . . . . . . . . . . . . . . . . . . .  9
     3.2.  pnfs_osd_deviceaddr4 . . . .  8
       3.1.1.  pnfs_osd_objid4 . . . . . . . . . . . . . . .  9
       3.2.1.  SCSI Target Identifier . . . .  8
       3.1.2.  pnfs_osd_version4 . . . . . . . . . . . . 10
       3.2.2.  Device Network Address . . . . . .  9
       3.1.3.  pnfs_osd_object_cred4 . . . . . . . . . . 11
   4.  Object-Based Layout  . . . . . . 10
       3.1.4.  pnfs_osd_raid_algorithm4 . . . . . . . . . . . . . . . 11
       3.1.5.
     4.1.  pnfs_osd_data_map4 . . . . . . . . . . . . . . . . . . 11
     3.2. . . 12
     4.2.  pnfs_osd_layout4 . . . . . . . . . . . . . . . . . . . . . 13
     4.3.  Data Mapping Schemes . . . . . . . . . . . . . . . . . . . 12
       3.2.1. 13
       4.3.1.  Simple Striping  . . . . . . . . . . . . . . . . . . . 12
       3.2.2. 14
       4.3.2.  Nested Striping  . . . . . . . . . . . . . . . . . . . 13
       3.2.3. 15
       4.3.3.  Mirroring  . . . . . . . . . . . . . . . . . . . . . . 14
     3.3. 16
     4.4.  RAID Algorithms  . . . . . . . . . . . . . . . . . . . . . 15
       3.3.1. 17
       4.4.1.  PNFS_OSD_RAID_0  . . . . . . . . . . . . . . . . . . . 15
       3.3.2. 17
       4.4.2.  PNFS_OSD_RAID_4  . . . . . . . . . . . . . . . . . . . 15
       3.3.3. 17
       4.4.3.  PNFS_OSD_RAID_5  . . . . . . . . . . . . . . . . . . . 16
       3.3.4. 17
       4.4.4.  PNFS_OSD_RAID_PQ . . . . . . . . . . . . . . . . . . . 16
       3.3.5. 18
       4.4.5.  RAID Usage and implementation notes Implementation Notes  . . . . . . . . . 17
   4. 18
   5.  Object-Based Layout Update . . . . . . . . . . . . . . . . . . 17
     4.1.  pnfs_osd_layoutupdate4 19
     5.1.  pnfs_osd_deltaspaceused4 . . . . . . . . . . . . . . . . . 19
     5.2.  pnfs_osd_layoutupdate4 . . 17
       4.1.1.  pnfs_osd_deltaspaceused4 . . . . . . . . . . . . . . . 18
   5. . 20
   6.  Recovering from Client I/O Errors  . . . . . . . . . . . . . . 18
   6. 20
   7.  Object-Based Layout Return . . . . . . . . . . . . . . . . . . 19
     6.1.  pnfs_osd_layoutreturn4 21
     7.1.  pnfs_osd_errno4  . . . . . . . . . . . . . . . . . . 20
       6.1.1.  pnfs_osd_errno4 . . . 22
     7.2.  pnfs_osd_ioerr4  . . . . . . . . . . . . . . . . 21
       6.1.2.  pnfs_osd_ioerr4 . . . . . 23
     7.3.  pnfs_osd_layoutreturn4 . . . . . . . . . . . . . . 22
   7. . . . . 24
   8.  Object-Based Creation Layout Hint  . . . . . . . . . . . . . . 22
     7.1. 24
     8.1.  pnfs_osd_layouthint4 . . . . . . . . . . . . . . . . . . . 22
   8. 24
   9.  Layout Segments  . . . . . . . . . . . . . . . . . . . . . . . 24
     8.1. 26
     9.1.  CB_LAYOUTRECALL and LAYOUTRETURN . . . . . . . . . . . . . 24
     8.2. 26
     9.2.  LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . . 25
   9. 27
   10. Recalling Layouts  . . . . . . . . . . . . . . . . . . . . . . 25
     9.1. 27
     10.1. CB_RECALL_ANY  . . . . . . . . . . . . . . . . . . . . . . 25
   10. 27
   11. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . . 26
   11. 28
   12. Security Considerations  . . . . . . . . . . . . . . . . . . . 26
     11.1. 28
     12.1. OSD Security Data Types  . . . . . . . . . . . . . . . . . 27
     11.2. 29
     12.2. The OSD Security Protocol  . . . . . . . . . . . . . . . . 28
     11.3. 30
     12.3. Protocol Privacy Requirements  . . . . . . . . . . . . . . 29
     11.4. 31
     12.4. Revoking Capabilities  . . . . . . . . . . . . . . . . . . 29
   12. 31
   13. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 30
   13. XDR Description of the Objects layout type . . . . . . . . . . 30 32
   14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 34 32
     14.1. Normative References . . . . . . . . . . . . . . . . . . . 34 32
     14.2. Informative References . . . . . . . . . . . . . . . . . . 35 33
   Appendix A.  Acknowledgments . . . . . . . . . . . . . . . . . . . 36 34
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 36 34
   Intellectual Property and Copyright Statements . . . . . . . . . . 38 35

1.  Introduction

   In pNFS, the file server returns typed layout structures that
   describe where file data is located.  There are different layouts for
   different storage systems and methods of arranging data on storage
   devices.  This document describes the layouts used with object-based
   storage devices (OSD) that are accessed according to the OSD storage
   protocol standard (SNIA T10/1355-D [2]).

   An "object" is a container for data and attributes, and files are
   stored in one or more objects.  The OSD protocol specifies several
   operations on objects, including READ, WRITE, FLUSH, GET ATTRIBUTES,
   SET ATTRIBUTES, CREATE and DELETE.  However, using the object-based
   layout the client only uses the READ, WRITE, GET ATTRIBUTES and FLUSH
   commands.  The other commands are only used by the pNFS server.

   An object-based layout for pNFS includes object identifiers,
   capabilities that allow clients to READ or WRITE those objects, and
   various parameters that control how file data is striped across their
   component objects.  The OSD protocol has a capability-based security
   scheme that allows the pNFS server to control what operations and
   what objects can be used by clients.  This scheme is described in
   more detail in the Security Considerations section (Section 11). 12).

2.  Object Storage Device Addressing and Discovery

   Data operations to an OSD require  XDR Description of the client to know Objects-Based Layout Protocol

   This document contains the "address" XDR [3] description of
   each OSD's root object. the NFSv4.1 objects
   layout protocol.  The root object XDR description is synonymous with SCSI
   logical unit.  The client specifies SCSI logical units embedded in this document in
   a way that makes it simple for the reader to its SCSI
   protocol stack using extract into a representation local ready to the client.  Because
   these representations are local, GETDEVICEINFO must return
   information that
   compile form.  The reader can be used by feed this document into the client following
   shell script to select produce the correct
   local representation.

   In machine readable XDR description of the block world,
   NFSv4.1 objects layout protocol:

   #!/bin/sh
   grep "^  *///" | sed 's?^  *///??'

   I.e. if the above script is stored in a set offset (logical block number or track/
   sector) contains file called "extract.sh", and
   this document is in a disk label.  This label identifies file called "spec.txt", then the disk
   uniquely.  In contrast, an OSD has a standard set reader can do:

   sh extract.sh < spec.txt > pnfs_osd_prot.x

   The effect of attributes on
   its root object.  For device identification purposes the OSD System
   ID (root information attribute number 3) and script is to remove leading white space from each
   line, plus a sentinel sequence of "///".

   The embedded XDR file header follows.  Subsequent XDR descriptions,
   with the OSD Name (root
   information attribute number 9) sentinel sequence are used as the label.  These appear
   in embedded throughout the pnfs_osd_deviceaddr4 type below under document.

   Note that the "systemid" and
   "osdname" fields.

   In some situations, SCSI target discovery may need to be driven based
   on information XDR code contained in the GETDEVICEINFO response.  One example
   of this is iSCSI targets that are not known to document depends on types
   from the client until NFSv4.1 nfs4_prot.x file ([8]).  This includes both nfs
   types that end with a
   layout has been requested.  The information provided 4, such as the
   "targetid", "netaddr", offset4, length4, etc, as well as
   more generic types such as uint32_t and "lun" fields in the pnfs_osd_deviceaddr4
   type described below (see Section 2.2), allows the client to probe a
   specific device given its network address uint64_t.

   ////*
   /// * This file was machine generated for
   /// * draft-ietf-nfsv4-pnfs-obj-07
   /// * Last updated Tue Apr  1 21:35:08 IDT 2008
   /// *
   /// * Copyright (C) The IETF Trust (2007-2008)
   /// * All Rights Reserved.
   /// *
   /// * Copyright (C) The Internet Society (1998-2006).
   /// * All Rights Reserved.
   /// */
   ///
   ////*
   /// * pnfs_osd_prot.x
   /// */
   ///
   ///%#include <nfs4_prot.x>
   ///

2.1.  Basic Data Type Definitions

   The following sections define basic data types and optionally its iSCSI
   Name (see iSCSI [3]), or when constants used by
   the device network address Object-Based Layout protocol.

2.1.1.  pnfs_osd_objid4

   An object is omitted,
   to discover the identified by a number, somewhat like an inode number.
   The object storage device using model has a two level scheme, where the provided device name
   or SCSI objects
   within an object storage device identifier (See SPC-3 [4].) are grouped into partitions.

   ///struct pnfs_osd_objid4 {
   ///    deviceid4       oid_device_id;
   ///    uint64_t        oid_partition_id;
   ///    uint64_t        oid_object_id;
   ///};
   ///

   The systemid pnfs_osd_objid4 type is used by the client, along with the object credential to sign each request with the request integrity check value.  This
   method protects the client from unintentionally accessing identify an object within a device if
   partition on a specified object storage device. "oid_device_id"
   selects the object storage device address mapping was changed (or revoked).  The server
   computes from the capability_key using its own view set of available storage
   devices.  The device is identified with the systemid
   associated with the respective deviceid present in the credential.
   If deviceid4 type, which is
   an index into addressing information about that device returned by
   the client's view GETDEVICELIST and GETDEVICEINFO operations.  The deviceid4 data
   type is defined in NFSv4.1 draft [9].  Within an OSD, a partition is
   identified with a 64-bit number, "oid_partition_id".  Within a
   partition, an object is identified with a 64-bit number,
   "oid_object_id".  Creation and management of the deviceid mapping partitions is stale, the client
   will use outside
   the wrong systemid (which must be system-wide unique) scope of this standard, and is a facility provided by the I/O request object
   storage file system.

2.1.2.  pnfs_osd_version4

   ///enum pnfs_osd_version4 {
   ///    PNFS_OSD_MISSING    = 0,
   ///    PNFS_OSD_VERSION_1  = 1,
   ///    PNFS_OSD_VERSION_2  = 2
   ///};
   ///

   Pnfs_osd_version4 is used to indicate the OSD will fail to pass the integrity check
   verification.

   To recover from this condition the client should report the error and
   return protocol version or
   whether an object is missing (i.e., unavailable).  Some of the
   object-based layout using LAYOUTRETURN, supported raid algorithms encode redundant
   information and invalidate all the device
   address mappings associated with this layout.  The client can then
   ask compensate for a new layout if it wishes using LAYOUTGET and resolve the
   referenced deviceids using GETDEVICEINFO or GETDEVICELIST.

   The server MUST provide the systemid and SHOULD also provide missing components, but the
   osdname.  When data
   placement algorithm needs to know what parts are missing.

   At this time the OSD name standard is present the client SHOULD get the root
   information attributes whenever it establishes communication with the
   OSD at version 1.0, and verify that we anticipate a
   version 2.0 of the standard ((SNIA T10/1729-D [10])).  The second
   generation OSD name it got from protocol has additional proposed features to support
   more robust error recovery, snapshots, and byte-range capabilities.
   Therefore, the OSD matches the one
   sent by version is explicitly called out in the metadata server.  To do so,
   information returned in the client uses layout.  (This information can also be
   deduced by looking inside the
   root_obj_cred credentials.

2.1.  pnfs_osd_addr_type4

   The following enum specifies capability type at the manner in format field,
   which a scsi target can be
   specified. is the first byte.  The target can be specified as format value is 0x1 for an SCSI Name, or as a SCSI
   Device Identifier.

   enum pnfs_obj_addr_type4 OSD v1
   capability.  However, it seems most robust to call out the version
   explicitly.)

2.1.3.  pnfs_osd_object_cred4

   ///enum pnfs_osd_cap_key_sec4 {
       OBJ_TARGET_ANON             = 1,
       OBJ_TARGET_SCSI_NAME
   ///    PNFS_OSD_CAP_KEY_SEC_NONE = 2,
       OBJ_TARGET_SCSI_DEVICE_ID 0,
   ///    PNFS_OSD_CAP_KEY_SEC_SSV  = 3
   };

2.2.  pnfs_osd_deviceaddr4

   The specification for an object device address is as follows:

   struct pnfs_osd_deviceaddr4 {
       union targetid switch (pnfs_osd_addr_type4 type) 1,
   ///};
   ///
   ///struct pnfs_osd_object_cred4 {
           case OBJ_TARGET_SCSI_NAME:
               string              scsi_name<>;

           case OBJ_TARGET_SCSI_DEVICE_ID:
   ///    pnfs_osd_objid4         oc_object_id;
   ///    pnfs_osd_version4       oc_osd_version;
   ///    pnfs_osd_cap_key_sec4   oc_cap_key_sec;
   ///    opaque              scsi_device_id<>;

           default:
               void;
       };
       union netaddr switch (bool netaddr_available) {
           case TRUE:
               netaddr4       netaddr;
           case FALSE:
               void;
       };
       uint64_t                lun;                  oc_capability_key<>;
   ///    opaque                  systemid<>;                  oc_capability<>;
   ///};
   ///
   The pnfs_osd_object_cred4   root_obj_cred;
       opaque                  osdname<>;
   };

2.2.1.  SCSI Target Identifier

   When "targetid" structure is specified as a OBJ_TARGET_SCSI_NAME, the
   "scsi_name" string MUST be formatted as a "iSCSI Name" as specified
   in iSCSI [3] and [5].  Note that the specification of used to identify each
   component comprising the scsi_name
   string format is outside file.  The "oc_object_id" identifies the scope of this document.  Parsing
   component object, the
   string is based on "oc_osd_version" represents the string prefix, e.g. "iqn.", "eui.", osd protocol
   version, or "naa." whether that component is unavailable, and more formats MAY be specified in the future in accordance
   "oc_capability" and "oc_capability_key", along with
   iSCSI Names properties.

   Currently, the iSCSI Name provides
   "oda_systemid" from the pnfs_osd_deviceaddr4, provide the OSD
   security credentials needed to access that object.  The
   "oc_cap_key_sec" value denotes the method used to secure the
   oc_capability_key (see Section 12.1 for naming more details).

   To comply with the target device using OSD security requirements the capability key
   SHOULD be transferred securely to prevent eavesdropping (see
   Section 12).  Therefore, a string formmatted as an iSCSI Qualified Name (IQN) client SHOULD either issue the LAYOUTGET
   or as GETDEVICEINFO operations via RPCSEC_GSS with the privacy service
   or to previously establish an EUI [8]
   string.  Those are typically SSV for the sessions via the NFSv4.1
   SET_SSV operation.  The pnfs_osd_cap_key_sec4 type is used to
   identify iSCSI or SRP [9]
   devices.  The Network Address Authority (NAA) string format (see [5])
   provides for naming the device using globally unique identifiers, as
   defined in FC-FS [10].  These are typically method used by the server to identify Fibre
   Channel or SAS [11] (Serial Attached SCSI) devices.  In particular,
   such devices secure the capability key.

   o  PNFS_OSD_CAP_KEY_SEC_NONE denotes that are dual-attached both over Fibre Channel or SAS,
   and over iSCSI.

   When "targetid" the oc_capability_key is specified as a OBJ_TARGET_SCSI_DEVICE_ID,
      not encrypted in which case the
   "scsi_device_id" opaque field MUST client SHOULD issue the LAYOUTGET
      or GETDEVICEINFO operations with RPCSEC_GSS with the privacy
      service or the NFSv4.1 transport should be formatted as a SCSI Device
   Identifier as defined in SPC-3 [4] VPD Page 83h (Section 7.6.3.
   "Device Identification VPD Page".)  Note secured by using
      methods that similarly are external to NFSv4.1 like the
   "scsi_name", the specification use of IPSEC [11]
      for transporting the scsi_device_id opaque NFSV4.1 protocol.

   o  PNFS_OSD_CAP_KEY_SEC_SSV denotes that the oc_capability_key
      contents
   is outside are encrypted using the scope of this document SSV GSS context and more formats MAY be
   specified in the future in accordance
      capability key as inputs to the GSS_Wrap() function (see GSS-API
      [4]) with SPC-3. the conf_req_flag set to TRUE.  The OBJ_TARGET_ANON pnfs_osd_addr_type4 MAY be used for providing no
   target identification.  In this case only client MUST use the OSD systemid and
   optionally,
      secret SSV key as part of the provided network address, are used to locate client's GSS context to
   device.

2.2.2.  Device Network Address

   The optional "netaddr" field MAY be provided by decrypt the server
      capability key using the value of the oc_capability_key field as a hint
      the input_message to accelerate device discovery over e.g., the iSCSI transport
   protocol.  The network address is given with GSS_unwrap() function.  Note that to
      prevent eavesdropping of the netaddr4 type, which
   specifies a TCP/IP based endpoint (as specified in NFSv4.1 draft
   [12]).  When given, SSV key the client SHOULD use issue
      SET_SSV via RPCSEC_GSS with the privacy service.

   The actual method chosen depends on whether the client established a
   SSV key with the server and whether it to probe for issued the SCSI
   device at operation with the
   RPCSEC_GSS privacy method.  Naturally, if the given network address.  The client MAY still did not
   establish a SSV key via SET_SSV the server MUST use other
   discovery mechanisms such as iSNS [13] to locate the device using
   PNFS_OSD_CAP_KEY_SEC_NONE method.  Otherwise, if the
   targetid.  In particular, such external name service, operation was
   not issued with the RPCSEC_GSS privacy method the server SHOULD be used
   when
   secure the devices may be attached to oc_capability_key with the network using multiple
   connections, and/or multiple storage fabrics (e.g.  Fibre-Channel and
   iSCSI.)

3.  Object-Based Layout PNFS_OSD_CAP_KEY_SEC_SSV
   method.  The layout4 type is defined in server MAY use the NFSv4.1 draft [12] as follows:

   enum layouttype4 PNFS_OSD_CAP_KEY_SEC_SSV method also
   when the operation was issued with the RPCSEC_GSS privacy method.

2.1.4.  pnfs_osd_raid_algorithm4

   ///enum pnfs_osd_raid_algorithm4 {
       LAYOUT4_NFSV4_1_FILES
   ///    PNFS_OSD_RAID_0     = 1,
       LAYOUT4_OSD2_OBJECTS
   ///    PNFS_OSD_RAID_4     = 2,
       LAYOUT4_BLOCK_VOLUME
   ///    PNFS_OSD_RAID_5     = 3
   };

   struct layout_content4 {
       layouttype4             loc_type;
       opaque                  loc_body<>;
   };

   struct layout4 {
       offset4                 lo_offset;
       length4                 lo_length;
       layoutiomode4           lo_iomode;
       layout_content4         lo_content;
   };

   This document defines structure associated with 3,
   ///    PNFS_OSD_RAID_PQ    = 4     /* Reed-Solomon P+Q */
   ///};
   ///

   pnfs_osd_raid_algorithm4 represents the layouttype4
   value, LAYOUT4_OSD2_OBJECTS.  The NFSv4.1 draft [12] specifies data redundancy algorithm
   used to protect the
   loc_body structure as file's contents.  See Section 4.4 for more
   details.

3.  Object Storage Device Addressing and Discovery

   Data operations to an XDR type "opaque".  The opaque layout is
   uninterpreted by OSD require the generic pNFS client layers, but obviously must
   be interpreted by the object-storage layout driver.  This document
   defines to know the structure "address" of this opaque value, pnfs_osd_layout4.

3.1.  pnfs_osd_layout4

   struct pnfs_osd_layout4 {
       pnfs_osd_data_map4      map;
       uint32_t                comps_index;
       pnfs_osd_object_cred4   components<>;
   };
   each OSD's root object.  The pnfs_osd_layout4 structure root object is synonymous with SCSI
   logical unit.  The client specifies SCSI logical units to its SCSI
   protocol stack using a layout over a set of
   component objects.  The components field is an array of object
   identifiers and security credentials that grant access representation local to each
   object.  The organization of the data is defined client.  Because
   these representations are local, GETDEVICEINFO must return
   information that can be used by the
   pnfs_osd_data_map4 type that specifies how client to select the file's data is mapped
   onto correct
   local representation.

   In the component objects (i.e., block world, a set offset (logical block number or track/
   sector) contains a disk label.  This label identifies the striping pattern).  The data
   placement algorithm that maps file data onto component objects assume
   that each component object occurs exactly once in the array disk
   uniquely.  In contrast, an OSD has a standard set of
   components.  Therefore, component objects MUST appear in the
   component array only once.  The components array may represent all
   objects comprising attributes on
   its root object.  For device identification purposes the file, in which case comps_index is set to zero OSD System
   ID (root information attribute number 3) and the OSD Name (root
   information attribute number of entries 9) are used as the label.  These appear
   in the "components" array is equal to
   map.num_comps.  The server MAY return fewer components than
   num_comps, provided that pnfs_osd_deviceaddr4 type below under the returned components are sufficient "oda_systemid" and
   "oda_osdname" fields.

   In some situations, SCSI target discovery may need to
   access any byte be driven based
   on information contained in the layout's data range (e.g., a sub-stripe GETDEVICEINFO response.  One example
   of
   "group_width" components).  In this case, comps_index represents the
   position of the returned components array within the full array of
   components that comprise the file.

   Note is iSCSI targets that are not known to the client until a
   layout depends on has been requested.  The information provided as the file size, which
   "targetid", "netaddr", and "lun" fields in the pnfs_osd_deviceaddr4
   type described below (see Section 3.2), allows the client
   learns from to probe a
   specific device given its network address and optionally its iSCSI
   Name (see iSCSI [5]), or when the generic return parameters of LAYOUTGET, by doing
   GETATTR commands device network address is omitted,
   to discover the metadata server.  The client uses object storage device using the file
   size to decide if it should fill holes with zeros, provided device name
   or return a short
   read.  Striping patterns can cause cases where component objects are
   shorter than other components because a hole happens to correspond to
   the last part of the component object.

3.1.1.  pnfs_osd_objid4

   An object is identified by a number, somewhat like an inode number.
   The object storage model has a two level scheme, where the objects
   within an object storage SCSI device are grouped into partitions.

   struct pnfs_osd_objid4 {
       deviceid4       device_id;
       uint64_t        partition_id;
       uint64_t        object_id;
   }; identifier (See SPC-3 [6].)

   The pnfs_osd_objid4 type oda_systemid is used to identify an by the client, along with the object within a
   partition on
   credential to sign each request with the request integrity check
   value.  This method protects the client from unintentionally
   accessing a specified object storage device. "device_id" selects device if the object storage device from address mapping was changed (or
   revoked).  The server computes the set capability key using its own view
   of available storage devices.
   The device is identified with the deviceid4 type, which is an index
   into addressing information about that device returned by systemid associated with the
   GETDEVICELIST and GETDEVICEINFO pnfs operations.  The deviceid4 data
   type is defined respective deviceid present in NFSv4.1 draft [12].  Within an OSD, a partition is
   identified with a 64-bit number, "partition_id".  Within a partition,
   an object is identified with a 64-bit number, "object_id".  Creation
   and management of partitions is outside
   the scope credential.  If the client's view of this standard,
   and is a facility provided by the object storage file system.

3.1.2.  pnfs_osd_version4

   enum pnfs_osd_version4 {
       PNFS_OSD_MISSING    = 0,
       PNFS_OSD_VERSION_1  = 1,
       PNFS_OSD_VERSION_2  = 2
   };

   The osd_version deviceid mapping is used to indicate
   stale, the OSD protocol version or
   whether an object is missing (i.e., unavailable).  Some layout
   schemes encode redundant information and can compensate for missing
   components, but the data placement algorithm needs to know what parts
   are missing.

   At this time client will use the OSD standard is at version 1.0, wrong systemid (which must be system-
   wide unique) and we anticipate a
   version 2.0 of the standard ((SNIA T10/1729-D [14])).  The second
   generation OSD protocol has additional proposed features I/O request to support
   more robust error recovery, snapshots, and byte-range capabilities.
   Therefore, the OSD version is explicitly called out in will fail to pass the
   information returned in
   integrity check verification.

   To recover from this condition the layout.  (This information can also be
   deduced by looking inside client should report the capability type at error and
   return the format field,
   which is layout using LAYOUTRETURN, and invalidate all the first byte. device
   address mappings associated with this layout.  The format value is 0x1 client can then
   ask for an OSD v1
   capability.  However, a new layout if it seems most robust to call out the version
   explicitly.)

3.1.3.  pnfs_osd_object_cred4

   enum pnfs_osd_cap_key_sec4 {
       PNFS_OSD_CAP_KEY_SEC_NONE = 0,
       PNFS_OSD_CAP_KEY_SEC_SSV  = 1,
   };

   struct pnfs_osd_object_cred4 {
       pnfs_osd_objid4         object_id;
       pnfs_osd_version4       osd_version;
       pnfs_osd_cap_key_sec4   cap_key_sec;
       opaque                  capability_key<>;
       opaque                  capability<>;
   };

   The pnfs_osd_object_cred4 structure is used to identify each
   component comprising wishes using LAYOUTGET and resolve the file.
   referenced deviceids using GETDEVICEINFO or GETDEVICELIST.

   The object_id identifies server MUST provide the
   component object, oda_systemid and SHOULD also provide the osd_version represents
   oda_osdname.  When the osd protocol
   version, or whether that component OSD name is unavailable, and present the capability
   and capability key, along client SHOULD get the
   root information attributes whenever it establishes communication
   with the systemid from OSD and verify that the
   pnfs_osd_deviceaddr, provide OSD name it got from the OSD security credentials needed to
   access that object.  The cap_key_sec value denotes matches
   the method used to
   secure one sent by the capability_key (see Section 11.1 for more details). metadata server.  To comply with the OSD security requirements do so, the capability key
   SHOULD be transferred securely to prevent eavesdropping (see
   Section 11).  Therefore, a client SHOULD either issue uses the LAYOUTGET
   operation via RPCSEC_GSS with
   root_obj_cred credentials.

3.1.  pnfs_osd_addr_type4

   The following enum specifies the privacy service or to previously
   establish manner in which a scsi target can be
   specified.  The target can be specified as an SSV for the sessions via the NFSv4.1 SET_SSV operation. SCSI Name, or as a SCSI
   Device Identifier.

   ///enum pnfs_osd_targetid_type4 {
   ///    OBJ_TARGET_ANON             = 1,
   ///    OBJ_TARGET_SCSI_NAME        = 2,
   ///    OBJ_TARGET_SCSI_DEVICE_ID   = 3
   ///};
   ///

3.2.  pnfs_osd_deviceaddr4

   The pnfs_osd_cap_key_sec4 type specification for an object device address is used to identify the method used by
   the server to secure as follows:

 ///union pnfs_osd_targetid4 switch (pnfs_osd_targetid_type4 oti_type) {
 ///    case OBJ_TARGET_SCSI_NAME:
 ///        string              oti_scsi_name<>;
 ///
 ///    case OBJ_TARGET_SCSI_DEVICE_ID:
 ///        opaque              oti_scsi_device_id<>;
 ///
 ///    default:
 ///        void;
 ///};
 ///
 ///union pnfs_osd_targetaddr4 switch (bool ota_available) {
 ///    case TRUE:
 ///        netaddr4            ota_netaddr;
 ///    case FALSE:
 ///        void;
 ///};
 ///
 ///struct pnfs_osd_deviceaddr4 {
 ///    pnfs_osd_targetid4      oda_targetid;
 ///    pnfs_osd_targetaddr4    oda_targetaddr;
 ///    uint64_t                oda_lun;
 ///    opaque                  oda_systemid<>;
 ///    pnfs_osd_object_cred4   oda_root_obj_cred;
 ///    opaque                  oda_osdname<>;
 ///};
 ///

3.2.1.  SCSI Target Identifier

   When "oda_targetid" is specified as a OBJ_TARGET_SCSI_NAME, the capability key.

   o  PNFS_OSD_CAP_KEY_SEC_NONE denotes
   "oti_scsi_name" string MUST be formatted as a "iSCSI Name" as
   specified in iSCSI [5] and [7].  Note that the capability_key specification of the
   oti_scsi_name string format is not
      encrypted in which case outside the client SHOULD issue scope of this document.
   Parsing the LAYOUTGET
      operation with RPCSEC_GSS with string is based on the privacy service string prefix, e.g. "iqn.",
   "eui.", or the NFSv4.1
      transport should "naa." and more formats MAY be secured by using methods that are external to
      NFSv4.1 like specified in the use of IPSEC [15] for transporting future in
   accordance with iSCSI Names properties.

   Currently, the NFSV4.1
      protocol.

   o  PNFS_OSD_CAP_KEY_SEC_SSV denotes that iSCSI Name provides for naming the capability_key contents
      are encrypted target device using the SSV GSS context and the capability key
   a string formmatted as
      inputs to the GSS_Wrap() function (see GSS-API [6]) with the
      conf_req_flag set an iSCSI Qualified Name (IQN) or as an EUI
   [12] string.  Those are typically used to TRUE. identify iSCSI or SRP [13]
   devices.  The client MUST use Network Address Authority (NAA) string format (see [7])
   provides for naming the secret SSV key device using globally unique identifiers, as part of the client's GSS context
   defined in FC-FS [14].  These are typically used to decrypt the capability key
      using the value of identify Fibre
   Channel or SAS [15] (Serial Attached SCSI) devices.  In particular,
   such devices that are dual-attached both over Fibre Channel or SAS,
   and over iSCSI.

   When "oda_targetid" is specified as a OBJ_TARGET_SCSI_DEVICE_ID, the capability_key
   "oti_scsi_device_id" opaque field MUST be formatted as the input_message
      to the GSS_unwrap() function. a SCSI Device
   Identifier as defined in SPC-3 [6] VPD Page 83h (Section 7.6.3.
   "Device Identification VPD Page".)  Note that similarly to prevent eavesdropping
      of the SSV key the client SHOULD issue SET_SSV via RPCSEC_GSS with
   "oti_scsi_name", the privacy service.

   The actual method chosen depends on whether specification of the client established a
   SSV key with oti_scsi_device_id opaque
   contents is outside the server scope of this document and whether it issued more formats MAY
   be specified in the LAYOUTGET operation future in accordance with SPC-3.

   The OBJ_TARGET_ANON pnfs_osd_addr_type4 MAY be used for providing no
   target identification.  In this case only the RPCSEC_GSS privacy method.  Naturally, if the client did not
   establish a SSV key via SET_SSV the server MUST use OSD System ID and
   optionally, the
   PNFS_OSD_CAP_KEY_SEC_NONE method.  Otherwise, if provided network address, are used to locate to
   device.

3.2.2.  Device Network Address

   The optional "oda_targetaddr" field MAY be provided by the LAYOUTGET
   operation was not issued server as
   a hint to accelerate device discovery over e.g., the iSCSI transport
   protocol.  The network address is given with the RPCSEC_GSS privacy method netaddr4 type, which
   specifies a TCP/IP based endpoint (as specified in NFSv4.1 draft
   [9]).  When given, the
   server client SHOULD secure use it to probe for the capability_key with SCSI
   device at the
   PNFS_OSD_CAP_KEY_SEC_SSV method. given network address.  The server client MAY still use other
   discovery mechanisms such as iSNS [16] to locate the
   PNFS_OSD_CAP_KEY_SEC_SSV method also device using the
   oda_targetid.  In particular, such external name service, SHOULD be
   used when the LAYOUTGET operation was
   issued with devices may be attached to the RPCSEC_GSS privacy method.

3.1.4.  pnfs_osd_raid_algorithm4 network using multiple
   connections, and/or multiple storage fabrics (e.g.  Fibre-Channel and
   iSCSI.)

4.  Object-Based Layout

   The layout4 type is defined in the NFSv4.1 draft [9] as follows:

   enum pnfs_osd_raid_algorithm4 layouttype4 {
       PNFS_OSD_RAID_0
       LAYOUT4_NFSV4_1_FILES   = 1,
       PNFS_OSD_RAID_4
       LAYOUT4_OSD2_OBJECTS    = 2,
       PNFS_OSD_RAID_5     = 3,
       PNFS_OSD_RAID_PQ
       LAYOUT4_BLOCK_VOLUME    = 4     /* Reed-Solomon P+Q */ 3
   };

   pnfs_osd_raid_algorithm4 represents the data redundancy algorithm
   used to protect the file's contents.  See Section 3.3 for more
   details.

3.1.5.  pnfs_osd_data_map4

   struct pnfs_osd_data_map4 layout_content4 {
       uint32_t                    num_comps;
       layouttype4             loc_type;
       opaque                  loc_body<>;
   };

   struct layout4 {
       offset4                 lo_offset;
       length4                     stripe_unit;
       uint32_t                    group_width;
       uint32_t                    group_depth;
       uint32_t                    mirror_cnt;
       pnfs_osd_raid_algorithm4    raid_algorithm;                 lo_length;
       layoutiomode4           lo_iomode;
       layout_content4         lo_content;
   };

   The pnfs_osd_data_map4

   This document defines structure parameterizes associated with the algorithm that
   maps a file's contents over layouttype4
   value, LAYOUT4_OSD2_OBJECTS.  The NFSv4.1 draft [9] specifies the
   loc_body structure as an XDR type "opaque".  The opaque layout is
   uninterpreted by the generic pNFS client layers, but obviously must
   be interpreted by the object-storage layout driver.  This section
   defines the structure of this opaque value, pnfs_osd_layout4.

4.1.  pnfs_osd_data_map4

   ///struct pnfs_osd_data_map4 {
   ///    uint32_t                    odm_num_comps;
   ///    length4                     odm_stripe_unit;
   ///    uint32_t                    odm_group_width;
   ///    uint32_t                    odm_group_depth;
   ///    uint32_t                    odm_mirror_cnt;
   ///    pnfs_osd_raid_algorithm4    odm_raid_algorithm;
   ///};
   ///

   The pnfs_osd_data_map4 structure parameterizes the algorithm that
   maps a file's contents over the component objects.  Instead of
   limiting the system to simple striping scheme where loss of a single
   component object results in data loss, the map parameters support
   mirroring and more complicated schemes that protect against loss of a
   component object.

   num_comps

   "odm_num_comps" is the number of component objects the file is
   striped over.  The server MAY grow the file by adding more components
   to the stripe while clients hold valid layouts until the file has
   reached its final stripe width.  The file length in this case MUST be
   limited to the number of bytes in a full stripe.

   The stripe_unit "odm_stripe_unit" is the number of bytes placed on one component
   before advancing to the next one in the list of components.  The
   number of bytes in a full stripe is stripe_unit odm_stripe_unit times the number
   of components.  In some raid schemes, a stripe includes redundant
   information (i.e., parity) that lets the system recover from loss or
   damage to a component object.

   The group_width "odm_group_width" and group_depth "odm_group_depth" parameters allow a nested
   striping pattern (See Section 3.2.2 4.3.2 for details).  If there is no
   nesting, then group_width odm_group_width and group_depth odm_group_depth MUST be zero.  The
   size of the components array MUST be a multiple of group_width. odm_group_width.

   The mirror_cnt "odm_mirror_cnt" is used to replicate a file by replicating its
   component objects.  If there is no mirroring, then mirror_cnt odm_mirror_cnt
   MUST be 0.  If mirror_cnt odm_mirror_cnt is greater than zero, then the size of
   the component array MUST be a multiple of (mirror_cnt+1). (odm_mirror_cnt+1).

   See Section 3.2 4.3 for more details.

3.2.  Data Mapping Schemes

   This section describes the different data mapping schemes in detail.

4.2.  pnfs_osd_layout4

   ///struct pnfs_osd_layout4 {
   ///    pnfs_osd_data_map4      olo_map;
   ///    uint32_t                olo_comps_index;
   ///    pnfs_osd_object_cred4   olo_components<>;
   ///};
   ///

   The object layout always uses pnfs_osd_layout4 structure specifies a "dense" layout as described in the
   pNFS document.  This means over a set of
   component objects.  The "olo_components" field is an array of object
   identifiers and security credentials that the second stripe unit grant access to each
   object.  The organization of the file
   starts at offset 0 of data is defined by the second component, rather than at offset
   stripe_unit bytes.  After a full stripe has been written,
   pnfs_osd_data_map4 type that specifies how the next
   stripe unit file's data is appended to mapped
   onto the first component objects (i.e., the striping pattern).  The data
   placement algorithm that maps file data onto component objects assume
   that each component object occurs exactly once in the list
   without any holes array of
   components.  Therefore, component objects MUST appear in the component objects.

3.2.1.  Simple Striping
   olo_components array only once.  The mapping from the logical offset within a file (L) to components array may represent
   all objects comprising the
   component object file, in which case "olo_comps_index" is
   set to zero and the number of entries in the olo_components array is
   equal to olo_map.odm_num_comps.  The server MAY return fewer
   components than odm_num_comps, provided that the returned components
   are sufficient to access any byte in the layout's data range (e.g., a
   sub-stripe of "odm_group_width" components).  In this case,
   olo_comps_index represents the position of the returned components
   array within the full array of components that comprise the file.

   Note that the layout depends on the file size, which the client
   learns from the generic return parameters of LAYOUTGET, by doing
   GETATTR commands to the metadata server.  The client uses the file
   size to decide if it should fill holes with zeros, or return a short
   read.  Striping patterns can cause cases where component objects are
   shorter than other components because a hole happens to correspond to
   the last part of the component object.

4.3.  Data Mapping Schemes

   This section describes the different data mapping schemes in detail.
   The object layout always uses a "dense" layout as described in
   NFSv4.1 draft [9].  This means that the second stripe unit of the
   file starts at offset 0 of the second component, rather than at
   offset stripe_unit bytes.  After a full stripe has been written, the
   next stripe unit is appended to the first component object in the
   list without any holes in the component objects.

4.3.1.  Simple Striping

   The mapping from the logical offset within a file (L) to the
   component object C and object-specific offset O is defined by the
   following equations:

   L = logical offset into the file
   W = total number of components
   S = W * stripe_unit
   N = L / S
   C = (L-(N*S)) / stripe_unit
   O = (N*stripe_unit)+(L%stripe_unit)

   In these equations, S is the number of bytes in a full stripe, and N
   is the stripe number.  C is an index into the array of components, so
   it selects a particular object storage device.  Both N and C count
   from zero.  O is the offset within the object that corresponds to the
   file offset.  Note that this computation does not accommodate the
   same object appearing in the component olo_components array multiple times.

   For example, consider an object striped over four devices, <D0 D1 D2
   D3>.  The stripe_unit is 4096 bytes.  The stripe width S is thus 4 *
   4096 = 16384.

   Offset 0:
     N = 0 / 16384 = 0
     C = 0-0/4096 = 0 (D0)
     O = 0*4096 + (0%4096) = 0

   Offset 4096:
     N = 4096 / 16384 = 0
     C = (4096-(0*16384)) / 4096 = 1 (D1)
     O = (0*4096)+(4096%4096) = 0

   Offset 9000:
     N = 9000 / 16384 = 0
     C = (9000-(0*16384)) / 4096 = 2 (D2)
     O = (0*4096)+(9000%4096) = 808

   Offset 132000:
     N = 132000 / 16384 = 8
     C = (132000-(8*16384)) / 4096 = 0
     O = (8*4096) + (132000%4096) = 33696

3.2.2.

4.3.2.  Nested Striping

   The group_width odm_group_width and group_depth odm_group_depth parameters allow a nested
   striping pattern.  The group_width odm_group_width defines the width of a data stripe
   and the
   group_depth odm_group_depth defines how many stripes are written before
   advancing to the next group of components in the list of component
   objects for the file.  The math used to map from a file offset to a
   component object and offset within that object is shown below.  The
   computations map from the logical offset L to the component index C
   and offset relative O within that component object.

   L = logical offset into the file
   W = total number of components
   S = stripe_unit * group_depth * W
   T = stripe_unit * group_depth * group_width
   U = stripe_unit * group_width
   M = L / S
   G = (L - (M * S)) / T
   H = (L - (M * S)) % T
   N = H / U
   C = (H - (N * U)) / stripe_unit + G * group_width
   O = L % stripe_unit + N * stripe_unit + M * group_depth * stripe_unit

   In these equations, S is the number of bytes striped across all
   component objects before the pattern repeats.  T is the number of
   bytes striped within a group of component objects before advancing to
   the next group.  U is the number of bytes in a stripe within a group.
   M is the "major" (i.e., across all components) stripe number, and N
   is the "minor" (i.e., across the group) stripe number.  G counts the
   groups from the beginning of the major stripe, and H is the byte
   offset within the group.

   For example, consider an object striped over 100 devices with a
   group_width of 10, a group_depth of 50, and a stripe_unit of 1 MB.
   In this scheme, 500 MB are written to the first 10 components, and
   5000 MB is written before the pattern wraps back around to the first
   component in the array.

   Offset 0:
     W = 100
     S = 1 MB * 50 * 100 = 5000 MB
     T = 1 MB * 50 * 10 = 500 MB
     U = 1 MB * 10 = 10 MB
     M = 0 / 5000 MB = 0
     G = (0 - (0 * 5000 MB)) / 500 MB = 0
     H = (0 - (0 * 5000 MB)) % 500 MB = 0
     N = 0 / 10 MB = 0
     C = (0 - (0 * 10 MB)) / 1 MB + 0 * 10 = 0
     O = 0 % 1 MB + 0 * 1 MB + 0 * 50 * 1 MB = 0

   Offset 27 MB:
     M = 27 MB / 5000 MB = 0
     G = (27 MB - (0 * 5000 MB)) / 500 MB = 0
     H = (27 MB - (0 * 5000 MB)) % 500 MB = 27 MB
     N = 27 MB / 10 MB = 2
     C = (27 MB - (2 * 10 MB)) / 1 MB + 0 * 10 = 7
     O = 27 MB % 1 MB + 2 * 1 MB + 0 * 50 * 1 MB = 2 MB

   Offset 7232 MB:
     M = 7232 MB / 5000 MB = 1
     G = (7232 MB - (1 * 5000 MB)) / 500 MB = 4
     H = (7232 MB - (1 * 5000 MB)) % 500 MB = 232 MB
     N = 232 MB / 10 MB = 23
     C = (232 MB - (23 * 10 MB)) / 1 MB + 4 * 10 = 42
     O = 7232 MB % 1 MB + 23 * 1 MB + 1 * 50 * 1 MB = 73 MB

3.2.3.

4.3.3.  Mirroring

   The mirror_cnt odm_mirror_cnt is used to replicate a file by replicating its
   component objects.  If there is no mirroring, then mirror_cnt odm_mirror_cnt
   MUST be 0.  If mirror_cnt odm_mirror_cnt is greater than zero, then the size of
   the
   component olo_components array MUST be a multiple of (mirror_cnt+1). (odm_mirror_cnt+1).
   Thus, for a classic mirror on two objects, mirror_cnt odm_mirror_cnt is one.
   Note that mirroring can be defined over any raid algorithm and
   striping pattern (either simple or nested).  If group_width odm_group_width is
   also non-zero, then the size of the olo_components array MUST be a
   multiple of group_width odm_group_width * (mirror_cnt+1). (odm_mirror_cnt+1).  Replicas are
   adjacent in the components olo_components array, and the value C produced by the
   above equations is not a direct index into the components olo_components array.
   Instead, the following equations determine the replica component
   index RCi, where i ranges from 0 to mirror_cnt. odm_mirror_cnt.

   C = component index for striping or two-level striping
   i ranges from 0 to mirror_cnt, odm_mirror_cnt, inclusive
   RCi = C * (mirror_cnt+1) (odm_mirror_cnt+1) + i

3.3.

4.4.  RAID Algorithms

   pnfs_osd_raid_algorithm4 determines the algorithm and placement of
   redundant data.  This section defines the different RAID algorithms.

3.3.1.

4.4.1.  PNFS_OSD_RAID_0

   PNFS_OSD_RAID_0 means there is no parity data, so all bytes in the
   component objects are data bytes located by the above equations for C
   and O. If a component object is marked as PNFS_OSD_MISSING, the pNFS
   client MUST either return an I/O error if this component is attempted
   to be read or alternatively, it can retry the READ against the pNFS
   server.

3.3.2.

4.4.2.  PNFS_OSD_RAID_4

   PNFS_OSD_RAID_4 means that the last component object, or the last in
   each group if group_width (if odm_group_width is > zero, greater than zero), contains parity
   information computed over the rest of the stripe with an XOR
   operation.  If a component object is unavailable, the client can read
   the rest of the stripe units in the damaged stripe and recompute the
   missing stripe unit by XORing the other stripe units in the stripe.
   Or the client can replay the READ against the pNFS server which will
   presumably perform the reconstructed read on the client's behalf.

   When parity is present in the file, then there is an additional
   computation to map from the file offset L to the offset that accounts
   for embedded parity, L'.  First compute L', and then use L' in the
   above equations for C and O.

   L = file offset, not accounting for parity
   P = number of parity devices in each stripe
   W = group_width, if not zero, else size of component olo_components array
   N = L / (W-P * stripe_unit)
   L' = N * (W * stripe_unit) +
        (L % (W-P * stripe_unit))

3.3.3.

4.4.3.  PNFS_OSD_RAID_5

   PNFS_OSD_RAID_5 means that the position of the parity data is rotated
   on each stripe. stripe or each group (if odm_group_width is greater than
   zero).  In the first stripe, the last component holds the parity.  In
   the second stripe, the next-to-last component holds the parity, and
   so on.  In this scheme, all stripe units are rotated so that I/O is
   evenly spread across objects as the file is read sequentially.  The
   rotated parity layout is illustrated here, with numbers indicating
   the stripe unit.

   0 1 2 P
   4 5 P 3
   8 P 6 7
   P 9 a b

   To compute the component object C, first compute the offset that
   accounts for parity L' and use that to compute C. Then rotate C to
   get C'.  Finally, increase C' by one if the parity information comes
   at or before C' within that stripe.  The following equations
   illustrate this by computing I, which is the index of the component
   that contains parity for a given stripe.

   L = file offset, not accounting for parity
   W = group_width, odm_group_width, if not zero, else size of component olo_components array
   N = L / (W-1 * stripe_unit)
   (Compute L' as describe above)
   (Compute C based on L' as described above)
   C' = (C - (N%W)) % W
   I = W - (N%W) - 1
   if (C' <= I) {
     C'++
   }

3.3.4.

4.4.4.  PNFS_OSD_RAID_PQ

   PNFS_OSD_RAID_PQ is a double-parity scheme that uses the Reed-Solomon
   P+Q encoding scheme [16]. [17].  In this layout, the last two component
   objects hold the P and Q data, respectively.  P is parity computed
   with XOR, and Q is a more complex equation that is not described
   here.  The equations given above for embedded parity can be used to
   map a file offset to the correct component object by setting the
   number of parity components to 2 instead of 1 for RAID4 or RAID5.
   Clients may simply choose to read data through the metadata server if
   two components are missing or damaged.

   Issue: This scheme also has a RAID_4 like layout where the ECC blocks
   are stored on the same components on every stripe and a rotated,
   RAID-5 like layout where the stripe units are rotated.  Should we
   make the following properties orthogonal: RAID_4 or RAID_5 (i.e.,
   non-rotated or rotated), and then have the number of parity
   components and the associated algorithm be the orthogonal parameter?

3.3.5.

4.4.5.  RAID Usage and implementation notes Implementation Notes

   RAID layouts with redundant data in their stripes require additional
   serialization of updates to ensure correct operation.  Otherwise, if
   two clients simultaneously write to the same logical range of an
   object, the result could include different data in the same ranges of
   mirrored tuples, or corrupt parity information.  It is the
   responsibility of the metadata server to enforce serialization
   requirements such as this.  For example, the metadata server may do
   so by not granting overlapping write layouts within mirrored objects.

4.

5.  Object-Based Layout Update

   layoutupdate4 is used in the LAYOUTCOMMIT operation to convey updates
   to the layout and additional information to the metadata server.  It
   is defined in the NFSv4.1 draft [12] [9] as follows:

   struct layoutupdate4 {
       layouttype4             lou_type;
       opaque                  lou_body<>;
   };

   The layoutupdate4 type is an opaque value at the generic pNFS client
   level.  If the lou_type layout type is LAYOUT4_OSD2_OBJECTS, then the
   lou_body opaque value is defined by the pnfs_osd_layoutupdate4 type.

4.1.  pnfs_osd_layoutupdate4

   struct pnfs_osd_layoutupdate4 {
       pnfs_osd_deltaspaceused4    lou_delta_space_used;
       bool                        lou_ioerr;
   };

   Object-Based pNFS clients are not allowed to modify the layout.
   "lou_delta_space_used"
   Therefore, the information passed in pnfs_osd_layoutupdate4 is used
   only to convey capacity usage update the file's attributes.  In addition to the generic
   information
   back the client can pass to the metadata server.

4.1.1. server in
   LAYOUTCOMMIT such as the highest offset the client wrote to and the
   last time it modified the file, the client MAY use
   pnfs_osd_layoutupdate4 to convey the capacity consumed (or released)
   by writes using the layout, and to indicate that I/O errors were
   encountered by such writes.

5.1.  pnfs_osd_deltaspaceused4

  union

   ///union pnfs_osd_deltaspaceused4 switch (bool valid) {
   ///    case TRUE:
   ///        int64_t     dsu_delta;  /* Bytes consumed by write activity */
   ///    case FALSE:
   ///        void;
  };
   ///};
   ///

   pnfs_osd_deltaspaceused4 is used to convey space utilization
   information at the time of LAYOUTCOMMIT.  For the file system to
   properly maintain capacity used information, it needs to track how
   much capacity was consumed by WRITE operations performed by the
   client.  In this protocol, the OSD returns the capacity consumed by a
   write,
   write (*), which can be different than the number of bytes written
   because of internal overhead like block-based block-level allocation and indirect
   blocks, and the client reflects this back to the pNFS server so it
   can accurately track quota.  The pNFS server can choose to trust this
   information coming from the clients and therefore avoid querying the
   OSDs at the time of LAYOUTCOMMIT.  If the client is unable to obtain
   this information from the OSD, it simply returns invalid
   lou_delta_space_used.
   olu_delta_space_used.

   (*) Note: At the time this document is written, a per-command used
   capacity attribute is not yet standardized by OSD2 draft [10].  The "lou_ioerr" flag
   client MAY use vendor-specific attributes to calculate space
   utilization, provided that the vendor defines and publishes a
   suitable vendor-specific attributes page for current-command
   attributes as defined by OSD2 draft [10], Section 7.1.2.2.

5.2.  pnfs_osd_layoutupdate4

   ///struct pnfs_osd_layoutupdate4 {
   ///    pnfs_osd_deltaspaceused4    olu_delta_space_used;
   ///    bool                        olu_ioerr_flag;
   ///};
   ///

   "olu_delta_space_used" is used to convey capacity usage information
   back to the metadata server.

   The "olu_ioerr_flag" is used when I/O errors were encountered while
   writing the file.  The client MUST report the errors using the
   pnfs_osd_ioerr4 structure (See Section 6.1.1) 7.1) at LAYOUTRETURN time.

   If the client updated the file successfully before hitting the I/O
   errors it MAY use LAYOUTCOMMIT to update the metadata server as
   described above.  Typically, in the error-free case, the server MAY
   turn around and update the file's attributes on the storage devices.
   However, if I/O errors were encountered the server better not attempt
   to write the new attributes on the storage devices until it receives
   the I/O error report, therefore the client MUST set the lou_ioerr
   flag
   olu_ioerr_flag to true.  Note that in this case, the client SHOULD
   send both the LAYOUTCOMMIT and LAYOUTRETURN operations in the same
   COMPOUND RPC.

5.

6.  Recovering from Client I/O Errors

   The pNFS client may encounter errors when directly accessing the
   object storage devices.  However, it is the responsibility of the
   metadata server to handle the I/O errors.  When the
   LAYOUT4_OSD2_OBJECTS layout type is used, the client MUST report the
   I/O errors to the server at LAYOUTRETURN time using the
   pnfs_osd_ioerr4 structure (See Section 6.1.1). 7.1).

   The metadata server analyzes the error and determines the required
   recovery operations such as repairing any parity inconsistencies,
   recovering media failures, or reconstructing missing objects.

   The metadata server SHOULD recall any outstanding layouts to allow it
   exclusive write access to the stripes being recovered and to prevent
   other clients from hitting the same error condition.  In these cases,
   the server MUST complete recovery before handing out any new layouts
   to the affected byte ranges.

   Although is it MAY be acceptable for the client to propagate a
   corresponding error to the application that initiated the I/O
   operation and drop any unwritten data, the client SHOULD attempt to
   retry the original I/O operation by requesting a new layout using
   LAYOUTGET and retry the I/O operation(s) using the new layout or the
   client MAY just retry the I/O operation(s) using regular NFS READ or
   WRITE operations via the metadata server.  The client SHOULD attempt
   to retrieve a new layout and retry the I/O operation using OSD
   commands first and only if the error persists, retry the I/O
   operation via the metadata server.

6.

7.  Object-Based Layout Return

   layoutreturn_file4 is used in the LAYOUTRETURN operation to convey
   layout-type specific information to the server.  It is defined in the
   NFSv4.1 draft [12] [9] as follows:

   struct layoutreturn_file4 {
           offset4         lrf_offset;
           length4         lrf_length;
           stateid4        lrf_stateid;
           /* layouttype4 specific data */
           opaque          lrf_body<>;
   };

   union layoutreturn4 switch(layoutreturn_type4 lr_returntype) {
           case LAYOUTRETURN4_FILE:
                   layoutreturn_file4      lr_layout;
           default:
                   void;
   };

   struct LAYOUTRETURN4args {
           /* CURRENT_FH: file */
           bool                    lora_reclaim;
           layoutreturn_stateid    lora_recallstateid;
           layouttype4             lora_layout_type;
           layoutiomode4           lora_iomode;
           layoutreturn4           lora_layoutreturn;
   };

   If the lora_layout_type layout type is LAYOUT4_OSD2_OBJECTS, then the
   lrf_body opaque value is defined by the pnfs_osd_layoutreturn4 type.

6.1.  pnfs_osd_layoutreturn4

   struct

   The pnfs_osd_layoutreturn4 {
       pnfs_osd_ioerr4             ioerr<>;
   };

   When OSD I/O operations failed, "ioerr" is used type allows the client to report these
   errors I/O error
   information back to the metadata server.  The pnfs_osd_ioerr4 data structure is
   defined server as follows:

6.1.1. defined below.

7.1.  pnfs_osd_errno4

   enum

   ///enum pnfs_osd_errno4 {
   ///    PNFS_OSD_ERR_EIO            = 1,
   ///    PNFS_OSD_ERR_NOT_FOUND      = 2,
   ///    PNFS_OSD_ERR_NO_SPACE       = 3,
   ///    PNFS_OSD_ERR_BAD_CRED       = 4,
   ///    PNFS_OSD_ERR_NO_ACCESS      = 5,
   ///    PNFS_OSD_ERR_UNREACHABLE    = 6,
   ///    PNFS_OSD_ERR_RESOURCE       = 7
   };
   ///};
   ///

   pnfs_osd_errno4 is used to represent error types when read/write
   errors are reported to the metadata server.  The error codes serve as
   hints to the metadata server that may help it in diagnosing the exact
   reason for the error and in repairing it.

   o  PNFS_OSD_ERR_EIO indicates the operation failed because the Object
      Storage Device experienced a failure trying to access the object.
      The most common source of these errors is media errors, but other
      internal errors might cause this.  In this case, the metadata
      server should go examine the broken object more closely, hence it
      should be used as the default error code.

   o  PNFS_OSD_ERR_NOT_FOUND indicates the object ID specifies an object
      that does not exist on the Object Storage Device.

   o  PNFS_OSD_ERR_NO_SPACE indicates the operation failed because the
      Object Storage Device ran out of free capacity during the
      operation.

   o  PNFS_OSD_ERR_BAD_CRED indicates the security parameters are not
      valid.  The primary cause of this is that the capability has
      expired, or the access policy tag (a.k.a, capability version
      number) has been changed to revoke capabilities.  The client will
      need to return the layout and get a new one with fresh
      capabilities.

   o  PNFS_OSD_ERR_NO_ACCESS indicates the capability does not allow the
      requested operation.  This should not occur in normal operation
      because the metadata server should give out correct capabilities,
      or none at all.

   o  PNFS_OSD_ERR_UNREACHABLE indicates the client did not complete the
      I/O operation at the Object Storage Device due to a communication
      failure.  Whether the I/O operation was executed by the OSD or not
      is undetermined.

   o  PNFS_OSD_ERR_RESOURCE indicates the client did not issue the I/O
      operation due to a local problem on the initiator (i.e. client)
      side, e.g., when running out of memory.  The client MUST guarantee
      that the OSD command was never dispatched to the OSD.

6.1.2.

7.2.  pnfs_osd_ioerr4

   struct

   ///struct pnfs_osd_ioerr4 {
   ///    pnfs_osd_objid4     component;     oer_component;
   ///    length4             comp_offset;             oer_comp_offset;
   ///    length4             comp_length;             oer_comp_length;
   ///    bool                iswrite;                oer_iswrite;
   ///    pnfs_osd_errno4     errno;
   };     oer_errno;
   ///};
   ///

   The pnfs_osd_ioerr4 structure is used to return error indications for
   objects that generated errors during data transfers.  These are hints
   to the metadata server that there are problems with that object.  For
   each error, "component", "comp_offset", "oer_component", "oer_comp_offset", and "comp_length" "oer_comp_length"
   represent the object and byte range within the component object in
   which the error occurred, "iswrite" "oer_iswrite" is set to "true" if the
   failed OSD operation was data modifying, and "errno" "oer_errno" represents
   the type of error.

   Component byte ranges in the optional pnfs_osd_ioerr4 structure are
   used for recovering the object and MUST be set by the client to cover
   all failed I/O operations to the component.

7.  Object-Based Creation Layout Hint

   The layouthint4 type is defined in the NFSv4.1 draft [12] as follows:

   struct layouthint4

7.3.  pnfs_osd_layoutreturn4

   ///struct pnfs_osd_layoutreturn4 {
       layouttype4
   ///    pnfs_osd_ioerr4             olr_ioerr_report<>;
   ///};
   ///

   When OSD I/O operations failed, "olr_ioerr_report<>" is used to
   report these errors to the metadata server as an array of elements of
   type pnfs_osd_ioerr4.  Each element in the array represents an error
   that occured on the object specified by oer_component.  If no errors
   are to be reported, the size of the olr_ioerr_report<> array is set
   to zero.

8.  Object-Based Creation Layout Hint

   The layouthint4 type is defined in the NFSv4.1 draft [9] as follows:

   struct layouthint4 {
       layouttype4           loh_type;
       opaque                loh_body<>;
   };

   The layouthint4 structure is used by the client to pass in a hint
   about the type of layout it would like created for a particular file.
   If the loh_type layout type is LAYOUT4_OSD2_OBJECTS, then the
   loh_body opaque value is defined by the pnfs_osd_layouthint4 type.

7.1.

8.1.  pnfs_osd_layouthint4

   union num_comps_hint4

   ///union pnfs_osd_max_comps_hint4 switch (bool valid) omx_valid) {
   ///    case TRUE:
   ///        uint32_t            num_comps;            omx_max_comps;
   ///    case FALSE:
   ///        void;
   };

   union stripe_unit_hint4
   ///};
   ///
   ///union pnfs_osd_stripe_unit_hint4 switch (bool valid) osu_valid) {
   ///    case TRUE:
   ///        length4             stripe_unit;             osu_stripe_unit;
   ///    case FALSE:
   ///        void;
   };

   union group_width_hint4
   ///};
   ///
   ///union pnfs_osd_group_width_hint4 switch (bool valid) ogw_valid) {
   ///    case TRUE:
   ///        uint32_t            group_width;            ogw_group_width;
   ///    case FALSE:
   ///        void;
   };

   union group_depth_hint4
   ///};
   ///
   ///union pnfs_osd_group_depth_hint4 switch (bool valid) ogd_valid) {
   ///    case TRUE:
   ///        uint32_t            group_depth;            ogd_group_depth;
   ///    case FALSE:
   ///        void;
   };

   union mirror_cnt_hint4
   ///};
   ///
   ///union pnfs_osd_mirror_cnt_hint4 switch (bool valid) omc_valid) {
   ///    case TRUE:
   ///        uint32_t            mirror_cnt;            omc_mirror_cnt;
   ///    case FALSE:
   ///        void;
   };

   union raid_algorithm_hint4
   ///};
   ///
   ///union pnfs_osd_raid_algorithm_hint4 switch (bool valid) ora_valid) {
   ///    case TRUE:
   ///        pnfs_osd_raid_algorithm4    raid_algorithm;    ora_raid_algorithm;
   ///    case FALSE:
   ///        void;
   };

   struct
   ///};
   ///
   ///struct pnfs_osd_layouthint4 {
       num_comps_hint4         num_comps_hint;
       stripe_unit_hint4       stripe_unit_hint;
       group_width_hint4       group_width_hint;
       group_depth_hint4       group_depth_hint;
       mirror_cnt_hint4        mirror_cnt_hint;
       raid_algorithm_hint4    raid_algorithm_hint;
   };
   ///    pnfs_osd_max_comps_hint4        olh_max_comps_hint;
   ///    pnfs_osd_stripe_unit_hint4      olh_stripe_unit_hint;
   ///    pnfs_osd_group_width_hint4      olh_group_width_hint;
   ///    pnfs_osd_group_depth_hint4      olh_group_depth_hint;
   ///    pnfs_osd_mirror_cnt_hint4       olh_mirror_cnt_hint;
   ///    pnfs_osd_raid_algorithm_hint4   olh_raid_algorithm_hint;
   ///};
   ///

   This type conveys hints for the desired data map.  All parameters are
   optional so the client can give values for only the parameters it
   cares about, e.g. it can provide a hint for the desired number of
   mirrored components, regardless of the the raid algorithm selected
   for the file.  The server should make an attempt to honor the hints
   but it can ignore any or all of them at its own discretion and
   without failing the respective create CREATE operation.

   The num_comps hint "olh_max_comps_hint" can be used to limit the total number of
   component objects comprising the file.  All other hints correspond
   directly to the different fields of pnfs_osd_data_map4.

8.

9.  Layout Segments

   The pnfs layout operations operate on logical byte ranges.  There is
   no requirement in the protocol for any relationship between byte
   ranges used in LAYOUTGET to acquire layouts and byte ranges used in
   CB_LAYOUTRECALL, LAYOUTCOMMIT, or LAYOUTRETURN.  However, using OSD
   byte-range capabilities poses limitations on these operations since
   the capabilities associated with layout segments cannot be merged or
   split.  The following guidelines should be followed for proper
   operation of object-based layouts.

8.1.

9.1.  CB_LAYOUTRECALL and LAYOUTRETURN

   In general, the object-based layout driver should keep track of each
   layout segment it got, keeping record of the segment's iomode,
   offset, and length.  The server should allow the client to get
   multiple overlapping layout segments but is free to recall the layout
   to prevent overlap.

   In response to CB_LAYOUTRECALL, the client should return all layout
   segments matching the given iomode and overlapping with the recalled
   range.  When returning the layouts for this byte range with
   LAYOUTRETURN the client MUST NOT return a sub-range of a layout
   segment it has; each LAYOUTRETURN sent MUST completely cover at least
   one outstanding layout segment.

   The server, in turn, should release any segment that exactly matches
   the clientid, iomode, and byte range given in LAYOUTRETURN.  If no
   exact match is found then the server should release all layout
   segments matching the clientid and iomode and that are fully
   contained in the returned byte range.  If none are found and the byte
   range is a subset of an outstanding layout segment with for the same
   clientid and iomode, then the client can be considered malfunctioning
   and the server SHOULD recall all layouts from this client to reset
   its state.  If this behavior repeats the server SHOULD deny all
   LAYOUTGETs from this client.

8.2.

9.2.  LAYOUTCOMMIT

   LAYOUTCOMMIT is only used by object-based pNFS to convey modified
   attributes hints and/or to report I/O errors to the MDS.  Therefore,
   the offset and length in LAYOUTCOMMIT4args are reserved for future
   use and should be set to 0.

9.

10.  Recalling Layouts

   The object-based metadata server should recall outstanding layouts in
   the following cases:

   o  When the file's security policy changes, i.e.  ACLs or permission
      mode bits are set.

   o  When the file's aggregation map changes, rendering outstanding
      layouts invalid.

   o  When there are sharing conflicts.  For example, the server will
      issue stripe aligned layout segments for RAID-5 objects.  To
      prevent corruption of the file's parity, Multiple clients must not
      hold valid write layouts for the same stripes.  An outstanding RW
      layout should be recalled when a conflicting LAYOUTGET is received
      from a different client for LAYOUTIOMODE4_RW and for a byte-range
      overlapping with the outstanding layout segment.

9.1.

10.1.  CB_RECALL_ANY

   The metadata server can use the CB_RECALL_ANY callback operation to
   notify the client to return some or all of its layouts.  The NFSv4.1
   draft [12] [9] defines the following types:

   const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN     = 8;
   const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX     = 11;

   struct  CB_RECALL_ANY4args      {
       uint32_t        craa_objects_to_keep;
       bitmap4         craa_type_mask;
   };

   Typically, CB_RECALL_ANY will be used to recall client state when the
   server needs to reclaim resources.  The craa_type_mask bitmap
   specifies the type of resources that are recalled and the
   craa_objects_to_keep value specifies how many of the recalled objects
   the client is allowed to keep.  The object-based layout type mask
   flags are defined as follows.  They represent the iomode of the
   recalled layouts.  In response, the client SHOULD return layouts of
   the recalled iomode that it needs the least, keeping at most
   craa_objects_to_keep object-based layouts.

  const

///const PNFS_OSD_RCA4_TYPE_MASK_READ = RCA4_TYPE_MASK_OBJ_LAYOUT_MIN;
  const
///const PNFS_OSD_RCA4_TYPE_MASK_RW   = RCA4_TYPE_MASK_OBJ_LAYOUT_MIN+1;
///

   The PNFS_OSD_RCA4_TYPE_MASK_READ flag notifies the client to return
   layouts of iomode LAYOUTIOMODE4_READ.  Similarly, the
   PNFS_OSD_RCA4_TYPE_MASK_RW flag notifies the client to return layouts
   of iomode LAYOUTIOMODE4_RW.  When both mask flags are set, the client
   is notified to return layouts of either iomode.

10.

11.  Client Fencing

   In cases where clients are uncommunicative and their lease has
   expired or when clients fail to return recalled layouts in a timely
   manner the server MAY revoke client layouts and/or device address
   mappings and reassign these resources to other clients.  To avoid
   data corruption, the metadata server MUST fence off the revoked
   clients from the respective objects as described in Section 11.4.

11. 12.4.

12.  Security Considerations

   The pNFS extension partitions the NFSv4 file system protocol into two
   parts, the control path and the data path (storage protocol).  The
   control path contains all the new operations described by this
   extension; all existing NFSv4 security mechanisms and features apply
   to the control path.  The combination of components in a pNFS system
   is required to preserve the security properties of NFSv4 with respect
   to an entity accessing data via a client, including security
   countermeasures to defend against threats that NFSv4 provides
   defenses for in environments where these threats are considered
   significant.

   The metadata server enforces the file access-control policy at
   LAYOUTGET time.  The client should use suitable authorization
   credentials for getting the layout for the requested iomode (READ or
   RW) and the server verifies the permissions and ACL for these
   credentials, possibly returning NFS4ERR_ACCESS if the client is not
   allowed the requested iomode.  If the LAYOUTGET operation succeeds
   the client receives, as part of the layout, a set of object
   capabilities allowing it I/O access to the specified objects
   corresponding to the requested iomode.  When the client acts on I/O
   operations on behalf of its local users it MUST authenticate and
   authorize the user by issuing respective OPEN and ACCESS calls to the
   metadata server, similarly to having NFSv4 data delegations.  If
   access is allowed the client uses the corresponding (READ or RW)
   capabilities to perform the I/O operations at the object-storage
   devices.  When the metadata server receives a request to change
   file's permissions or ACL it SHOULD recall all layouts for that file
   and it MUST change the capability version attribute on all objects
   comprising the file to implicitly invalidate any outstanding
   capabilities before committing to the new permissions and ACL.  Doing
   this will ensure that clients re-authorize their layouts according to
   the modified permissions and ACL by requesting new layouts.
   Recalling the layouts in this case is courtesy of the server intended
   to prevent clients from getting an error on I/Os done after the
   capability version changed.

   The object storage protocol MUST implement the security aspects
   described in version 1 of the T10 OSD protocol definition [2].  The
   standard defines four security methods: NOSEC, CAPKEY, CMDRSP, and
   ALLDATA.  To provide minimum level of security allowing verification
   and enforcement of the server access control policy using the layout
   security credentials, the NOSEC security method MUST NOT be used for
   I/O operation.  It MAY only be used to get the System ID attribute
   when the metadata server provided only the OSD name with the device
   address.  The remainder of this section gives an overview of the
   security mechanism described in that standard.  The goal is to give
   the reader a basic understanding of the object security model.  Any
   discrepancies between this text and the actual standard are obviously
   to be resolved in favor of the OSD standard.

11.1.

12.1.  OSD Security Data Types

   There are three main data types associated with object security: a
   capability, a credential, and security parameters.  The capability is
   a set of fields that specifies an object and what operations can be
   performed on it.  A credential is a signed capability.  Only a
   security manager that knows the secret device keys can correctly sign
   a capability to form a valid credential.  In pNFS, the file server
   acts as the security manager and returns signed capabilities (i.e.,
   credentials) to the pNFS client.  The security parameters are values
   computed by the issuer of OSD commands (i.e., the client) that prove
   they hold valid credentials.  The client uses the credential as a
   signing key to sign the requests it makes to OSD, and puts the
   resulting signatures into the security_parameters field of the OSD
   command.  The object storage device uses the secret keys it shares
   with the security manager to validate the signature values in the
   security parameters.

   The security types are opaque to the generic layers of the pNFS
   client.  The credential contents are defined as opaque within the
   pnfs_osd_object_cred4 type.  Instead of repeating the definitions
   here, the reader is referred to section 4.9.2.2 of the OSD standard.

11.2.

12.2.  The OSD Security Protocol

   The object storage protocol relies on a cryptographically secure
   capability to control accesses at the object storage devices.
   Capabilities are generated by the metadata server, returned to the
   client, and used by the client as described below to authenticate
   their requests to the Object Storage Device (OSD).  Capabilities
   therefore achieve the required access and open mode checking.  They
   allow the file server to define and check a policy (e.g., open mode)
   and the OSD to enforce that policy without knowing the details (e.g.,
   user IDs and ACLs).

   Since capabilities are tied to layouts, and since they are used to
   enforce access control, when the file ACL or mode changes the
   outstanding capabilities MUST be revoked to enforce the new access
   permissions.  The server SHOULD recall layouts to allow clients to
   gracefully return their capabilities before the access permissions
   change.

   Each capability is specific to a particular object, an operation on
   that object, a byte range w/in the object (in OSDv2), and has an
   explicit expiration time.  The capabilities are signed with a secret
   key that is shared by the object storage devices (OSD) and the
   metadata managers.  Clients do not have device keys so they are
   unable to forge the signatures in the security parameters.  The
   combination of a capability, the OSD system id, and a signature is
   called a "credential" in the OSD specification.

   The details of the security and privacy model for Object Storage are
   defined in the T10 OSD standard.  The following sketch of the
   algorithm should help the reader understand the basic model.

   LAYOUTGET returns a CapKey and a Cap which, together with the OSD
   SystemID, are also called a credential.  It is a capability and a
   signature over that capability and the SystemID.  The OSD Standard
   refers to the CapKey as the "Credential integrity check value" and to
   the ReqMAC as the "Request integrity check value".

   CapKey = MAC<SecretKey>(Cap, SystemID)
   Credential = {Cap, SystemID, CapKey}

   The client uses CapKey to sign all the requests it issues for that
   object using the respective Cap. In other words, the Cap appears in
   the request to the storage device, and that request is signed with
   the CapKey as follows:

   ReqMAC = MAC<CapKey>(Req, ReqNonce)
   Request = {Cap, Req, ReqNonce, ReqMAC}

   The following is sent to the OSD: {Cap, Req, ReqNonce, ReqMAC}.  The
   OSD uses the SecretKey it shares with the metadata server to compare
   the ReqMAC the client sent with a locally computed value:

   LocalCapKey = MAC<SecretKey>(Cap, SystemID)
   LocalReqMAC = MAC<LocalCapKey>(Req, ReqNonce)

   and if they match the OSD assumes that the capabilities came from an
   authentic metadata server and allows access to the object, as allowed
   by the Cap.

11.3.

12.3.  Protocol Privacy Requirements

   Note that if the server LAYOUTGET reply, holding CapKey and Cap, is
   snooped by another client, it can be used to generate valid OSD
   requests (within the Cap access restrictions).

   To provide the required privacy requirements for the capability key
   returned by LAYOUTGET, the GSS-API [6] [4] framework can be used, e.g. by
   using the RPCSEC_GSS privacy method to send the LAYOUTGET operation
   or by using the SSV key to encrypt the capability_key oc_capability_key using the
   GSS_Wrap() function.  Two general ways to provide privacy in the
   absence of GSS-API that are independent of NFSv4 are either an
   isolated network such as a VLAN or a secure channel provided by IPsec
   [15].

11.4.
   [11].

12.4.  Revoking Capabilities

   At any time, the metadata server may invalidate all outstanding
   capabilities on an object by changing its POLICY ACCESS TAG
   attribute.  The value of the POLICY ACCESS TAG is part of a
   capability, and it must match the state of the object attribute.  If
   they do not match, the OSD rejects accesses to the object with the
   sense key set to ILLEGAL REQUEST and an additional sense code set to
   INVALID FIELD IN CDB.  When a client attempts to use a capability and
   is rejected this way, it should issue a LAYOUTCOMMIT for the object
   and specify PNFS_OSD_BAD_CRED in the ioerr olr_ioerr_report parameter.  The
   client may elect to issue a compound LAYOUTRETURN/LAYOUTGET (or LAYOUTCOMMIT/
   LAYOUTRETURN/LAYOUTGET)
   LAYOUTCOMMIT/LAYOUTRETURN/LAYOUTGET) to attempt to fetch a refreshed
   set of capabilities.

   The metadata server may elect to change the access policy tag on an
   object at any time, for any reason (with the understanding that there
   is likely an associated performance penalty, especially if there are
   outstanding layouts for this object).  The metadata server MUST
   revoke outstanding capabilities when any one of the following occurs:

   o  the  The permissions on the object change,

   o  a conflicting mandatory byte-range lock is granted, or

   o  a layout is revoked and reassigned to another client client.

   A pNFS client will typically hold one layout for each byte range for
   either READ or READ/WRITE.  The client's credentials are checked by
   the metadata server at LAYOUTGET time and it is the client's
   responsibility to enforce access control among multiple users
   accessing the same file.  It is neither required nor expected that
   the pNFS client will obtain a separate layout for each user accessing
   a shared object.  The client SHOULD use OPEN and ACCESS calls to
   check user permissions when performing I/O so that the server's
   access control policies are correctly enforced.  The result of the
   ACCESS operation may be cached while the client holds a valid layout
   as the server is expected to recall layouts when the file's access
   permissions or ACL change.

12.  IANA Considerations

   As described in the NFSv4.1 draft [12], new layout type numbers will
   be requested from IANA.  This document defines the protocol
   associated with the existing layout type number,
   LAYOUT4_OSD2_OBJECTS, and it requires no further actions for IANA.

13.  XDR Description of the Objects layout type

   This section contains the XDR [7] description of objects layout
   protocol.  The XDR description is provided in this document in a way
   that makes it simple for the reader to extract into ready to compile
   form.  The reader can feed this document in the following shell
   script to produce the machine readable XDR description of the objects
   layout protocol:

   #!/bin/sh
   grep "^  *///" | sed 's?^  *///??'

   I.e. if the above script is stored in a file called "extract.sh", and
   this document is in a file called "spec.txt", then the reader can do:

   sh extract.sh < spec.txt > pnfs_obj_prot.x
   The effect of the script is to remove leading white space from each
   line, plus a sentinel sequence of "///".

   The XDR description, with the sentinel sequence follows:

   ///%#include <nfs4_prot.h>
   ///
   ////*
   /// * Device information
   /// */
   ///enum pnfs_obj_addr_type4 {
   ///    OBJ_TARGET_ANON             = 1,
   ///    OBJ_TARGET_SCSI_NAME        = 2,
   ///    OBJ_TARGET_SCSI_DEVICE_ID   = 3
   ///};
   ///
   ///struct pnfs_osd_deviceaddr4 {
   ///    union targetid switch (pnfs_osd_addr_type4 type) {
   ///        case OBJ_TARGET_SCSI_NAME:
   ///            string              scsi_name<>;
   ///
   ///        case OBJ_TARGET_SCSI_DEVICE_ID:
   ///            opaque              scsi_device_id<>;
   ///
   ///        default:
   ///            void;
   ///    };
   ///    union netaddr switch (bool netaddr_available) {
   ///        case TRUE:
   ///            netaddr4       netaddr;
   ///        case FALSE:
   ///            void;
   ///    };
   ///    uint64_t                lun;
   ///    opaque                  systemid<>;
   ///    pnfs_osd_object_cred4   root_obj_cred;
   ///    opaque                  osdname<>;
   ///};
   ///
   ////*
   /// * Layout type
   /// */
   ///enum pnfs_osd_raid_algorithm4 {
   ///    PNFS_OSD_RAID_0     = 1,
   ///    PNFS_OSD_RAID_4     = 2,
   ///    PNFS_OSD_RAID_5     = 3,
   ///    PNFS_OSD_RAID_PQ    = 4     /* Reed-Solomon P+Q */
   ///};
   ///
   ///struct pnfs_osd_data_map4 {
   ///    uint32_t                    num_comps;
   ///    length4                     stripe_unit;
   ///    uint32_t                    group_width;
   ///    uint32_t                    group_depth;
   ///    uint32_t                    mirror_cnt;
   ///    pnfs_osd_raid_algorithm4    raid_algorithm;
   ///};
   ///
   ////* Note: deviceid4 is defined by the nfsv4.1 protocol */
   ///
   ///struct pnfs_osd_objid4 {
   ///    deviceid4       device_id;
   ///    uint64_t        partition_id;
   ///    uint64_t        object_id;
   ///};
   ///
   ///enum pnfs_osd_version4 {
   ///    PNFS_OSD_MISSING    = 0,
   ///    PNFS_OSD_VERSION_1  = 1,
   ///    PNFS_OSD_VERSION_2  = 2
   ///};
   ///
   ///enum pnfs_osd_cap_key_sec4 {
   ///    PNFS_OSD_CAP_KEY_SEC_NONE = 0,
   ///    PNFS_OSD_CAP_KEY_SEC_SSV  = 1,
   ///};
   ///
   ///struct pnfs_osd_object_cred4 {
   ///    pnfs_osd_objid4         object_id;
   ///    pnfs_osd_version4       osd_version;
   ///    pnfs_osd_cap_key_sec4   cap_key_sec;
   ///    opaque                  capability_key<>;
   ///    opaque                  capability<>;
   ///};
   ///
   ///struct pnfs_osd_layout4 {
   ///    pnfs_osd_data_map4      map;
   ///    uint32_t                comps_index;
   ///    pnfs_osd_object_cred4   components<>;
   ///};
   ///
   ////*
   /// * Layout update
   /// */
   ///union pnfs_osd_deltaspaceused4 switch (bool valid) {
   ///case TRUE:

   ///    int64_t     delta;  /* Bytes consumed by write activity */
   ///case FALSE:
   ///    void;
   ///};
   ///
   ///struct pnfs_osd_layoutupdate4 {
   ///    pnfs_osd_deltaspaceused4    lou_delta_space_used;
   ///    bool                        lou_ioerr;
   ///};
   ///
   ////*
   /// * Layout return
   /// */
   ///enum pnfs_osd_errno4 {
   ///    PNFS_OSD_ERR_EIO            = 1,
   ///    PNFS_OSD_ERR_NOT_FOUND      = 2,
   ///    PNFS_OSD_ERR_NO_SPACE       = 3,
   ///    PNFS_OSD_ERR_BAD_CRED       = 4,
   ///    PNFS_OSD_ERR_NO_ACCESS      = 5,
   ///    PNFS_OSD_ERR_UNREACHABLE    = 6,
   ///    PNFS_OSD_ERR_RESOURCE       = 7
   ///};
   ///
   ///struct pnfs_osd_ioerr4 {
   ///    pnfs_osd_objid4     component;
   ///    length4             comp_offset;
   ///    length4             comp_length;
   ///    bool                iswrite;
   ///    pnfs_osd_errno4     errno;
   ///};
   ///
   ///struct pnfs_osd_layoutreturn4 {
   ///    pnfs_osd_ioerr4             ioerr<>;
   ///};
   ///
   ////*
   /// * Layout hint
   /// */
   ///union num_comps_hint4 switch (bool valid) {
   ///    case TRUE:
   ///        uint32_t            num_comps;
   ///    case FALSE:
   ///        void;
   ///};
   ///
   ///union stripe_unit_hint4 switch (bool valid) {
   ///    case TRUE:
   ///        length4             stripe_unit;
   ///    case FALSE:
   ///        void;
   ///};
   ///
   ///union group_width_hint4 switch (bool valid) {
   ///    case TRUE:
   ///        uint32_t            group_width;
   ///    case FALSE:
   ///        void;
   ///};
   ///
   ///union group_depth_hint4 switch (bool valid) {
   ///    case TRUE:
   ///        uint32_t            group_depth;
   ///    case FALSE:
   ///        void;
   ///};
   ///
   ///union mirror_cnt_hint4 switch (bool valid) {
   ///    case TRUE:
   ///        uint32_t            mirror_cnt;
   ///    case FALSE:
   ///        void;
   ///};
   ///
   ///union raid_algorithm_hint4 switch (bool valid) {
   ///    case TRUE:
   ///        pnfs_osd_raid_algorithm4    raid_algorithm;
   ///    case FALSE:
   ///        void;
   ///};
   ///
   ///struct pnfs_osd_layouthint4 {
   ///    num_comps_hint4         num_comps_hint;
   ///    stripe_unit_hint4       stripe_unit_hint;
   ///    group_width_hint4       group_width_hint;
   ///    group_depth_hint4       group_depth_hint;
   ///    mirror_cnt_hint4        mirror_cnt_hint;
   ///    raid_algorithm_hint4    raid_algorithm_hint;
   ///}; control policies are correctly enforced.  The result of the
   ACCESS operation may be cached while the client holds a valid layout
   as the server is expected to recall layouts when the file's access
   permissions or ACL change.

13.  IANA Considerations

   As described in the NFSv4.1 draft [9], new layout type numbers will
   be requested from IANA.  This document defines the protocol
   associated with the existing layout type number,
   LAYOUT4_OSD2_OBJECTS, and it requires no further actions for IANA.

14.  References

14.1.  Normative References

   [1]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
         Levels", RFC 2119, March 1997.

   [2]   Weber, R., "SCSI Object-Based Storage Device Commands",
         July 2004, <http://www.t10.org/ftp/t10/drafts/osd/osd-r10.pdf>.

   [3]   Eisler, M., "XDR: External Data Representation Standard",
         STD 67, RFC 4506, May 2006.

   [4]   Linn, J., "Generic Security Service Application Program
         Interface Version 2, Update 1", RFC 2743, January 2000.

   [5]   IBM, IBM, Cisco Systems, Hewlett-Packard Co., and IBM,
         "Internet Small Computer Systems Interface (iSCSI)", RFC 3720,
         April 2004, <http://www.ietf.org/rfc/rfc3720.txt>.

   [4]

   [6]   Weber, R., "SCSI Primary Commands - 3 (SPC-3)", INCITS 408-
         2005, May 2005.

   [5]

   [7]   Hewlett-Packard Co., Hewlett-Packard Co., and Hewlett-Packard
         Co., "T11 Network Address Authority (NAA) Naming Format for
         iSCSI Node Names", RFC 3980, February 2005,
         <http://www.ietf.org/rfc/rfc3980.txt>.

   [6]   Linn, J., "Generic Security Service Application Program
         Interface Version 2, Update 1", RFC 2743, January 2000.

   [7]   Eisler, M., "XDR: External Data Representation Standard",
         STD 67, RFC 4506, May 2006.

14.2.  Informative References

   [8]   Shepler, S., Eisler, M., and D. Noveck, "NFSv4 Minor Version 1
         XDR Description", February 2008, <http://www.ietf.org/
         internet-drafts/draft-ietf-nfsv4-minorversion1-dot-x-04.txt>.

   [9]   Shepler, S., Eisler, M., and D. Noveck, "NFSv4 Minor Version
         1", February 2008, <http://www.ietf.org/internet-drafts/
         draft-ietf-nfsv4-minorversion1-21.txt>.

   [10]  Weber, R., "SCSI Object-Based Storage Device Commands -2
         (OSD-2)", January 2008,
         <http://www.t10.org/ftp/t10/drafts/osd2/osd2r03.pdf>.

   [11]  Kent, S. and K. Seo, "Security Architecture for the Internet
         Protocol", RFC 4301, December 2005.

   [12]  IEEE, "Guidelines for 64-bit Global Identifier (EUI-64)
         Registration Authority",
         <http://standards.ieee.org/regauth/oui/tutorials/EUI64.html>.

   [9]

   [13]  T10/ANSI INCITS 365-2002, "SCSI RDMA Protocol (SRP)",
         INCITS 365-2002,
         <http://ftp.t10.org/ftp/t10/drafts/srp/srp-r16a.pdf>.

   [10]

   [14]  T11 1619-D/ANSI INCITS 424-2007, "Fibre Channel Framing and
         Signaling - 2 (FC-FS-2)", INCITS 424-2007, August 2006,
         <http://www.t11.org/t11/stat.nsf/upnum/1619-d>.

   [11]

   [15]  T10 1601-D/ANSI INCITS 417-2006, "Serial Attached SCSI - 1.1
         (SAS-1.1)", INCITS 417-2006, September 2005,
         <http://www.t10.org/ftp/t10/drafts/sas1/sas1r10.pdf>.

   [12]  Shepler, S., Eisler, M., and D. Noveck, "NFSv4 Minor Version
         1", February 2008, <http://www.ietf.org/internet-drafts/
         draft-ietf-nfsv4-minorversion1-21.txt>.

   [13]

   [16]  Tseng, J., Gibbons, K., Travostino, F., Du Laney, C., and J.
         Souza, "Internet Storage Name Service (iSNS)", RFC 4171,
         September 2005, <http://www.ietf.org/rfc/rfc4171.txt>.

   [14]  Weber, R., "SCSI Object-Based Storage Device Commands -2
         (OSD-2)", January 2008,
         <http://www.t10.org/ftp/t10/drafts/osd2/osd2r03.pdf>.

   [15]  Kent, S. and K. Seo, "Security Architecture for the Internet
         Protocol", RFC 4301, December 2005.

   [16]

   [17]  MacWilliams, F. and N. Sloane, "The Theory of Error-Correcting
         Codes, Part I", 1977.

Appendix A.  Acknowledgments

   Todd Pisek was a co-editor of the initial drafts for this document.
   Daniel E. Messinger and Pete Wyckoff reviewed and commented on this
   document.

Authors' Addresses

   Benny Halevy
   Panasas, Inc.
   1501 Reedsdale St. Suite 400
   Pittsburgh, PA  15233
   USA

   Phone: +1-412-323-3500
   Email: bhalevy@panasas.com
   URI:   http://www.panasas.com/

   Brent Welch
   Panasas, Inc.
   6520 Kaiser Drive
   Fremont, CA  95444
   USA

   Phone: +1-650-608-7770
   Email: welch@panasas.com
   URI:   http://www.panasas.com/

   Jim Zelenka
   Panasas, Inc.
   1501 Reedsdale St. Suite 400
   Pittsburgh, PA  15233
   USA

   Phone: +1-412-323-3500
   Email: jimz@panasas.com
   URI:   http://www.panasas.com/

Full Copyright Statement

   Copyright (C) The IETF Trust (2008).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.

Acknowledgment

   Funding for the RFC Editor function is provided by the IETF
   Administrative Support Activity (IASA).