Network
NFSv4                                                          B. Halevy
Internet-Draft                                                  B. Welch
Expires: July December 27, 2006                                    J. Zelenka
                                                                 Panasas
                                                                T. Pisek
                                                                     Sun
                                                        January 23,
                                                           June 25, 2006

                      Object-based pNFS Operations
                    draft-ietf-nfsv4-pnfs-obj-00.txt
                    draft-ietf-nfsv4-pnfs-obj-01.txt

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on July December 27, 2006.

Copyright Notice

   Copyright (C) The Internet Society (2006).

Abstract

   This Internet-Draft provides a description of the object-based pNFS
   extension for NFSv4.  This is a companion to the main pnfs operations
   draft,
   specification in the NFSv4 Minor Version 1 Internet Draft, which is
   currently draft-ietf-nfsv4-pnfs-00.txt draft-ietf-nfsv4-minorversion1-03.txt.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [1].

Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Object Storage Device Addressing and Discovery . . . . . . . .  3
     2.1   pnfs_osd_addr_type4  . . . . . . . . . . . . . . . . . . .  4
     2.2   pnfs_osd_deviceaddr4 . . . . . . . . . . . . . . . . . . .  4
   3.  Object-Based Layout  . . . . . . . . . . . . . . . . . . . . .  5  4
     3.1   pnfs_osd_objid4   pnfs_osd_layout4 . . . . . . . . . . . . . . . . . . . . .  5
     3.2   pnfs_osd_layout4
       3.1.1   pnfs_osd_objid4  . . . . . . . . . . . . . . . . . . .  6
       3.1.3   pnfs_osd_object_cred4  . . . .  6
     3.3   pnfs_osd_data_map4 . . . . . . . . . . . .  7
       3.1.4   pnfs_osd_raid_algorithm4 . . . . . . . . . . . . . . .  7
       3.3.1
       3.1.5   pnfs_osd_data_map4 . . . . . . . . . . . . . . . . . .  7
     3.2   Data Mapping Schemes . . . . . . . . . . . . . . . . . . .  8
       3.2.1   Simple Striping  . . . . . . . . . . . . . . . . . . .  7
       3.3.2  8
       3.2.2   Nested Striping  . . . . . . . . . . . . . . . . . . .  9
       3.3.3
       3.2.3   Mirroring  . . . . . . . . . . . . . . . . . . . . . . 11
       3.3.4
     3.3   RAID Algorithms  . . . . . . . . . . . . . . . . . . . . . 11
       3.3.1   PNFS_OSD_RAID_0  . . . . . . . . . . . . . . . . . . . 11
       3.3.5   Usage and implementation notes
       3.3.2   PNFS_OSD_RAID_4  . . . . . . . . . . . . 13
     3.4   pnfs_layoutupdate4 . . . . . . . 11
       3.3.3   PNFS_OSD_RAID_5  . . . . . . . . . . . . . . . . . . . 12
       3.3.4   PNFS_OSD_RAID_PQ . . . . . . . . . . . . . . . . . . . 12
       3.3.5   RAID Usage and implementation notes  . . . . . . . . . 13
   4.  Security Considerations  .  Object-Based Layout Update . . . . . . . . . . . . . . . . . . 15 13
     4.1   Security Data Types   pnfs_osd_layoutupdate4 . . . . . . . . . . . . . . . . . . 13
       4.1.1   pnfs_osd_deltaspaceused4 . . 16
     4.2   Security Protocol . . . . . . . . . . . . . 14
       4.1.2   pnfs_osd_errno4  . . . . . . . 16
     4.3   Revoking capabilities . . . . . . . . . . . . 14
       4.1.3   pnfs_osd_ioerr4  . . . . . . 17 . . . . . . . . . . . . . 15
   5.  Normative References  Object-Based Creation Layout Hint  . . . . . . . . . . . . . . 15
     5.1   pnfs_osd_layouthint4 . . . . . . . 18
       Authors' Addresses . . . . . . . . . . . . 16
   6.  Layout Segments  . . . . . . . . . . 19
       Intellectual Property and Copyright Statements . . . . . . . . 20

1.  Introduction

   In pNFS, the file . . . . . 17
     6.1   CB_LAYOUTRECALL and LAYOUTRETURN . . . . . . . . . . . . . 17
     6.2   LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . . 18
   7.  Recalling Layouts  . . . . . . . . . . . . . . . . . . . . . . 18
   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 18
     8.1   OSD Security Data Types  . . . . . . . . . . . . . . . . . 19
     8.2   The OSD Security Protocol  . . . . . . . . . . . . . . . . 19
     8.3   Revoking capabilities  . . . . . . . . . . . . . . . . . . 21
   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 21
     9.1   Normative References . . . . . . . . . . . . . . . . . . . 21
     9.2   Informative References . . . . . . . . . . . . . . . . . . 22
       Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 22
       Intellectual Property and Copyright Statements . . . . . . . . 24

1.  Introduction

   In pNFS, the file server returns typed layout structures that
   describe where file data is located.  There are different layouts for
   different storage systems and methods of arranging data on storage
   devices.  This document describes the layouts used with object-based
   storage devices (OSD) that are accessed according to the iSCSI/OSD
   storage protocol standard (SNIA T10/1355-D [2]).

   An "object" is a container for data and attributes, and files are
   stored in one or more objects.  The OSD protocol specifies several
   operations on objects, including READ, WRITE, FLUSH, GETATTR,
   SETATTR, CREATE and DELETE.  However, in this proposal the client
   only uses the READ, WRITE, GETATTR and FLUSH commands.  The other
   commands are only used by the pNFS server.

   An object-based layout for pNFS includes object identifiers,
   capabilities that allow clients to READ or WRITE those objects, and
   various parameters that control how file data is striped across their
   component objects.  The OSD protocol has a capability-based security
   scheme that allows the pNFS server to control what operations and
   what objects are used by clients.  This scheme is described in more
   detail in the "Security Considerations" section. section (Section 8).

2.  Object Storage Device Addressing and Discovery

   Data operations to an OSD require the client to know the "address" of
   each OSD's root object.  The root object is synonymous with SCSI
   logical unit.  The client specifies SCSI logical units to its SCSI
   stack using a representation local to the client.  Because these
   representations are local, GETDEVICEINFO must return information that
   can be used by the client to select the correct local representation.

   In the block world, a set offset (logical block number or track/
   sector) contains a disk label.  This label identifies the disk
   uniquely.  In constrast, contrast, an OSD has a standard set of attributes on
   its root object.  For device identification purposes, the OSD name
   (root information attribute number 9) will be used as the label.
   This appears in the pnfs_obj_deviceaddr4 pnfs_osd_deviceaddr4 type below under the
   "root_id" field.

   In some situations, SCSI target discovery may need to be driven based
   on information contained in the GETDEVICEINFO response.  One example
   of this is iSCSI targets that are not known to the client until a
   layout has been requested.  Eventually iSCSI will adopt ANSI T10
   SAM-3, at which time the World Wide Name (WWN aka, EUI-64/EUI-128)
   naming conventions can be specified.  In addition, Fibre Channel (FC)
   SCSI targets have a unique WWN.  Although these FC targets have
   already been discovered, some implementations may want to specify the
   WWN in addition to the label.  This information appears as the
   "target" and "lun" fields in the pnfs_obj_deviceaddr4 pnfs_osd_deviceaddr4 type described
   below.

2.1  pnfs_osd_addr_type4

   The following enum specifies the manner in which a scsi target can be
   specified.  The target can be specified as an IP address (v4 or v6), a network address, as an
   Internet Qualified Name (IQN), or by the WWN of the target.

       enum pnfs_obj_addr_type4 {
               OBJ_TARGET_IP_ADDR  = 1,
               OBJ_TARGET_IQN      = 2,
               OBJ_TARGET_WWN      = 3
       };

                                 Figure 1

   A device can be specified by the tuple <target, logical unit number
   (LUN), OSD Name>, or in the default case, just by the OSD Name.  The
   following enum is used to select World-Wide Name (WWN) of the format:
   target.

   enum pnfs_obj_dev_specifier4 pnfs_obj_addr_type4 {
               OBJ_DEV_SPEC_TARGET
       OBJ_TARGET_NETADDR  = 1 1,
       OBJ_TARGET_IQN      = 2,
       OBJ_TARGET_WWN      = 3
   };

                                 Figure 2

   To summarize, the device addressing is fundamentally done by
   specifying the OSD name (i.e., root_id).  In order to help the client
   resource discovery process, physical address hints can also be
   provided.

2.2  pnfs_osd_deviceaddr4

   The specification for an object device address is as follows:

   struct pnfs_osd_deviceaddr4 {
       union pnfs_obj_deviceaddr4 target switch (pnfs_obj_dev_specifier4 dev) (pnfs_osd_addr_type4 type) {
           case OBJ_DEV_SPEC_TARGET:
                   pnfs_obj_addr_type4     addr_type; OBJ_TARGET_NETADDR:
               pnfs_netaddr4   netaddr;

           case OBJ_TARGET_IQN:
               string                  target<>;
                   uint64                  lun;
                   opaque                  root_id<>;          iqn<>;

           case OBJ_TARGET_WWN:
               string          wwn<>;

           default:
               void;
       };
       uint64_t            lun;
       opaque              root_id<>;
   };

                                 Figure 3

3.  Object-Based Layout

   The pnfs_layout4 type is defined in the NFSv4.1 draft [5] as follows:

   enum pnfs_layouttype4 {
       LAYOUT_NFSV4_FILES  = 1,
       LAYOUT_OSD2_OBJECTS = 2,
       LAYOUT_BLOCK_VOLUME = 3
   };

   struct pnfs_layout4 {
       offset4                 offset;
       length4                 length;
       pnfs_layoutiomode4      iomode;
       pnfs_layouttype4        type;
       opaque                  layout<>;
   };

   This draft defines structure associated with the pnfs_layouttype4
   value, LAYOUT_OSD_OBJECTS. LAYOUT_OSD2_OBJECTS.  The pnfs NFSv4.1 draft specifies the
   structure as an XDR type "opaque".  The opaque layout is
   uninterpreted by the generic pNFS client layers, but obviously must
   be interpreted by the object-storage layout driver.  This document
   defines the structure structure of this opaque value, pnfs_osd_layout4.

3.1  pnfs_osd_layout4

   struct pnfs_osd_layout4 {
       pnfs_osd_object_cred4   components<>;
       pnfs_osd_data_map4      map;
   };

   The pnfs_osd_layout4 structure specifies a layout over a set of
   component objects.  The components field is an array of object
   identifiers and security credentials that grant access to each
   object.  The organization of the data is defined by the
   pnfs_osd_data_map4 type that specifies how the file's data is mapped
   onto the component objects (i.e., the striping pattern).  The data
   placement algorithm that maps file data onto component objects assume
   that each component object occurs exactly once in the array of
   components.  Therefore, component objects MUST appear in the
   component array only once.

   Note that the layout depends on the file size, which the client
   learns from the generic return parameters of LAYOUTGET, by doing
   GETATTR commands to the metadata server, and by getting
   CB_SIZE_CHANGED callbacks from the metadata server.  The client uses
   the file size to decide if it should fill holes with zeros, or return
   a short read.  Striping patterns can cause cases where component
   objects are shorter than other components because a hole happens to
   correspond to the last part of
   this opaque value.

   This is the pnfs_layoutdata4 type from the general pNFS
   specification:

   enum pnfs_layouttype4 {
          LAYOUT_NFSV4_FILES  = 1,
          LAYOUT_OSD_OBJECTS  = 2,
          LAYOUT_BLOCK_VOLUME = 3
   };

   struct pnfs_layoutdata4 {
     pnfs_layouttype4 layout_type;
     opaque           layout_data<>;
   };

                                 Figure 4

3.1 component object.

3.1.1  pnfs_osd_objid4

   An object is identified by a number, somewhat like an inode number.
   The object storage model has a two level scheme, where the objects
   within an object storage device are grouped into partitions.

   struct pnfs_osd_objid4 {
       pnfs_deviceid4  device_id;
     uint64
       uint64_t        partition_id;
     uint64
       uint64_t        object_id;
   };

                                 Figure 5

   The pnfs_osd_objid4 identifies type is used to identify an object within a
   partition on a specified object storage device.  The device "device_id" selects
   the object storage device from the set of available storage devices.
   The device is identified with the pnfs_deviceid4 type, which is an
   index into addressing information about that device returned by the
   GETDEVICEINFO pnfs operation.  Within an OSD, a partition is
   identified with a 64-bit number. number, "partition_id".  Within a partition,
   an object is identified with a 64-bit number. number, "object_id".  Creation
   and management of partitions is outside the scope of this standard,
   and is a facility provided by the object storage file system.

3.2  pnfs_osd_layout4

   The pnfs_osd_layout4 specifies a layout over a set of component
   objects.

3.1.2  pnfs_osd_version4

   enum pnfs_osd_version4 {
       PNFS_OSD_MISSING    = 0,
       PNFS_OSD_VERSION_1  = 1,
       PNFS_OSD_VERSION_2  = 2
   };

   The components field osd_version is an array of object identifiers and
   security credentials that grant access used to each object.  The
   organization of the data is defined by the pnfs_osd_data_map4 type
   that specifies how indicate the file's data OSD protocol version or
   whether an object is mapped onto the component
   objects missing (i.e., unavailable).  Some layout
   schemes encode redundant information and can compensate for missing
   components, but the striping pattern).  The data placement algorithm
   that maps file data onto component objects assume that each component
   object occurs exactly once in the array of components.  Therefore,
   component objects MUST appear in the component array only once. needs to know what parts
   are missing.

   At this time the OSD standard is at version 1.0, and we anticipate a
   version 2.0 of the standard.  The second generation OSD protocol has
   additional proposed features to support more robust error recovery,
   snapshots, and byte-range capabilities.  Therefore, the OSD version
   is explicitly called out in the information returned in the layout.
   (This information can also be deduced by looking inside the
   capability type at the format field, which is the first byte.  The
   format value is 0x1 for an OSD v1 capability.  However, it seems most
   robust to call out the version explicitly.)

   In addition, the osd_version field

3.1.3  pnfs_osd_object_cred4

   struct pnfs_osd_object_cred4 {
       pnfs_osd_objid4     object_id;
       pnfs_osd_version4   osd_version;
       opaque              credential<>;
   };

   The pnfs_osd_object_cred4 structure is used to indicate identify each
   component comprising the file.  The object_id identifies the
   component object, the osd_version represents the osd protocol
   version, or whether that an object
   may be missing (i.e., unavailable).  Some layout schemes encode
   redundant information component is unavailable, and can compensate for missing components, but the data placement algorithm needs credential
   provides the OSD security credentials needed to know what parts are missing. access that object
   (see Section 8.1 for more details).

3.1.4  pnfs_osd_raid_algorithm4

   enum pnfs_osd_version pnfs_osd_raid_algorithm4 {
     PNFS_OSD_MISSING   = 0,
     PNFS_OSD_VERSION_1
       PNFS_OSD_RAID_0     = 1,
     PNFS_OSD_VERSION_2
       PNFS_OSD_RAID_4     = 2 2,
       PNFS_OSD_RAID_5     = 3,
       PNFS_OSD_RAID_PQ    = 4     /* Reed-Solomon P+Q */
   };

   pnfs_osd_raid_algorithm4 represents the data redundancy algorithm
   used to protect the file's contents.  See Section 3.3 for more
   details.

3.1.5  pnfs_osd_data_map4

   struct pnfs_osd_object_cred4 pnfs_osd_data_map4 {
     pnfs_osd_objid4        object_id;
     pnfs_osd_version       osd_version;
     opaque                 credential<>;
       length4                     stripe_unit;
       uint16_t                    group_width;
       uint16_t                    group_depth;
       uint16_t                    mirror_cnt;
       pnfs_osd_raid_algorithm4    raid_algorithm;
   };

   struct pnfs_osd_layout4 {
     pnfs_osd_object_cred4  components<>;

   The pnfs_osd_data_map4     map;
   };
                                 Figure 6

   Note that structure parameterizes the layout depends on algorithm that
   maps a file's contents over the file size, which component objects.  Instead of
   limiting the client
   learns from system to simple striping scheme where loss of a single
   component object results in data loss, the generic return map parameters support
   mirroring and more complicated schemes that protect against loss of LAYOUTGET, by doing
   GETATTR commands a
   component object.

   The stripe_unit is the number of bytes placed on one component before
   advancing to the metadata server, and by getting
   CB_SIZE_CHANGED callbacks from next one in the metadata server. list of components.  The client uses number of
   bytes in a full stripe is stripe_unit times the file size to decide if it should fill holes with zeros, number of components.

   In some raid schemes, a stripe includes redundant information (i.e.,
   parity) that lets the system recover from loss or return damage to a short read.  Striping patterns can cause cases where
   component object.

   The group_width and group_depth parameters allow a nested striping
   pattern.  If there is no nesting, then group_width and group_depth
   MUST be zero.  Otherwise, the group_width defines the width of a data
   stripe, and the group_depth defines how many stripes are written
   before advancing to the next group of components in the list of
   component objects are shorter than other for the file.  The size of the components because array
   MUST be a hole happens to
   correspond multiple of group_width.

   The mirror_cnt is used to replicate a file by replicating its
   component objects.  If there is no mirroring, then mirror_cnt MUST be
   0.  If mirror_cnt is greater than zero, then the last part size of the
   component object.

3.3  pnfs_osd_data_map4 array MUST be a multiple of (mirror_cnt+1).

   See Section 3.2 for more details.

3.2  Data Mapping Schemes

   This section describes the different data mapping schemes in detail.

3.2.1  Simple Striping

   The pnfs_osd_data_map4 parameterizes object layout always uses a "dense" layout as described in the algorithm
   pNFS document.  This means that maps a file's
   contents over the component objects.  Instead second stripe unit of limiting the system
   to simple striping scheme where loss file
   starts at offset 0 of the second component, rather than at offset
   stripe_unit bytes.  After a single full stripe has been written, the next
   stripe unit is appended to the first component object
   results in data loss, the map parameters support mirroring and more
   complicated schemes that protect against loss of a list
   without any holes in the component object. objects.  The type is shown first, mapping from the
   logical offset within a file (L) to do the component object C and then each parameter
   object-specific offset O is explained.

   enum pnfs_osd_raid_algorithm4 {
     PNFS_OSD_RAID_0    = 1,
     PNFS_OSD_RAID_4    = 2,
     PNFS_OSD_RAID_5    = 3,
     PNFS_OSD_RAID_PQ defined by the following equations:

   L = 4     /* Reed-Solomon P+Q */
   };

   struct pnfs_osd_data_map4 {
       length4                   stripe_unit;
       uint16                    group_width;
       uint16                    group_depth;
       uint16                    mirror_cnt;
       pnfs_osd_raid_algorithm4  raid_algorithm;
   };

                                 Figure 7

3.3.1  Simple Striping

   The stripe_unit is logical offset into the file
   W = total number of bytes placed on one component before
   advancing to the next one in components
   S = W * stripe_unit
   N = L / S
   C = (L-(N*S)) / stripe_unit
   O = (N*stripe_unit)+(L%stripe_unit)

   In these equations, S is the list of components.  The number of bytes in a full stripe, and N
   is the stripe number.  C is stripe_unit times an index into the number array of components.
   In some raid schemes, components, so
   it selects a stripe includes redundant information (i.e.,
   parity) that lets the system recover particular object storage device.  Both N and C count
   from loss or damage zero.  O is the offset within the object that corresponds to a
   component object.

   The the
   file offset.  Note that this computation does not accommodate the
   same object layout always uses a "dense" layout as described appearing in the
   pNFS document.  This means that component array multiple times.

   For example, consider an object striped over four devices, <D0 D1 D2
   D3>.  The stripe_unit is 4096 bytes.  The stripe width S is thus 4 *
   4096 = 16384.

   Offset 0:
     N = 0 / 16384 = 0
     C = 0-0/4096 = 0 (D0)
     O = 0*4096 + (0%4096) = 0

   Offset 4096:
     N = 4096 / 16384 = 0
     C = (4096-(0*16384)) / 4096 = 1 (D1)
     O = (0*4096)+(4096%4096) = 0

   Offset 9000:
     N = 9000 / 16384 = 0
     C = (9000-(0*16384)) / 4096 = 2 (D2)
     O = (0*4096)+(9000%4096) = 808

   Offset 132000:
     N = 132000 / 16384 = 8
     C = (132000-(8*16384)) / 4096 = 0
     O = (8*4096) + (132000%4096) = 33696

3.2.2  Nested Striping

   The group_width and group_depth parameters allow a nested striping
   pattern.  If there is no nesting, then group_width and group_depth
   MUST be zero.  Otherwise, the second stripe unit of group_width defines the file
   starts at offset 0 width of the second component, rather than at offset
   stripe_unit bytes.  After a full stripe has been written, data
   stripe, and the next
   stripe unit is appended group_depth defines how many stripes are written
   before advancing to the first component object next group of components in the list
   without any holes in of
   component objects for the file.  The size of the components array
   MUST be a multiple of group_width.  The math used to map from a file
   offset to a component objects. object and offset within that object is shown
   below.  The mapping computations map from the logical offset within a file (L) L to do the
   component object index C and
   object-specific offset relative O is defined by the following equations: within that component object.

   L = logical offset into the file
   W = total number of components
   S = stripe_unit * group_depth * W
   T = stripe_unit * group_depth * group_width
   U = stripe_unit
   N * group_width
   M = L / S
   G = (L - (M * S)) / T
   H = (L - (M * S)) % T
   N = H / U
   C = (L-(N*S)) (H - (N * U)) / stripe_unit + G * group_width
   O = (N*stripe_unit)+(L%stripe_unit)

                                 Figure 8 L % stripe_unit + N * stripe_unit + M * group_depth * stripe_unit

   In these equations, S is the number of bytes striped across all
   component objects before the pattern repeats.  T is the number of
   bytes striped within a group of component objects before advancing to
   the next group.  U is the number of bytes in a full stripe, stripe within a group.
   M is the "major" (i.e., across all components) stripe number, and N
   is the "minor" (i.e., across the group) stripe number.  C is an index into  G counts the array
   groups from the beginning of components, so
   it selects a particular object storage device.  Both N the major stripe, and C count
   from zero.  O H is the byte
   offset within the object that corresponds to the
   file offset.  Note that this computation does not accomodate the same
   object appearing in the component array multiple times. group.

   For example, consider an object striped over four devices, <D0 D1 D2
   D3>.  The 100 devices with a
   group_width of 10, a group_depth of 50, and a stripe_unit of 1 MB.
   In this scheme, 500 MB are written to the first 10 components, and
   5000 MB is 4096 bytes.  The stripe width written before the pattern wraps back around to the first
   component in the array.

   Offset 0:
     W = 100
     S is thus 4 = 1 MB *
   4096 50 * 100 = 16384.

   Offset 0:
     N 5000 MB
     T = 1 MB * 50 * 10 = 500 MB
     U = 1 MB * 10 = 10 MB
     M = 0 / 16384 5000 MB = 0
     C
     G = 0-0/4096 (0 - (0 * 5000 MB)) / 500 MB = 0 (D0)
     O
     H = 0*4096 + (0%4096) (0 - (0 * 5000 MB)) % 500 MB = 0

   Offset 4096:
     N = 4096 0 / 16384 10 MB = 0
     C = (4096-(0*16384)) (0 - (0 * 10 MB)) / 4096 = 1 (D1) MB + 0 * 10 = 0
     O = (0*4096)+(4096%4096) 0 % 1 MB + 0 * 1 MB + 0 * 50 * 1 MB = 0

   Offset 9000:
     N 27 MB:
     M = 9000 27 MB / 16384 5000 MB = 0
     C
     G = (9000-(0*16384)) (27 MB - (0 * 5000 MB)) / 4096 500 MB = 2 (D2)
     O 0
     H = (0*4096)+(9000%4096) (27 MB - (0 * 5000 MB)) % 500 MB = 808

   Offset 132000: 27 MB
     N = 132000 27 MB / 16384 10 MB = 8 2
     C = (132000-(8*16384)) (27 MB - (2 * 10 MB)) / 4096 = 0
     O = (8*4096) 1 MB + (132000%4096) 0 * 10 = 33696

                                 Figure 9

3.3.2  Nested Striping

   The group_width and group_depth parameters allow a nested striping
   pattern.  If there is no nesting, then group_width and group_depth
   MUST be zero.  Otherwise, the group_width defines the width of a data
   stripe, and the group_depth defines how many stripes are written
   before advancing to the next group of components in the list of
   component objects for the file.  The size of the components array
   MUST be a multiple of group_width.  The math used to map from a file
   offset to a component object and offset within that object is shown
   below.  The computations map from the logical offset L to the
   component index C and offset relative 7
     O within that component object.

   L = logical offset into the file
   W = total number of components
   S = stripe_unit * group_depth 27 MB % 1 MB + 2 * W
   T = stripe_unit 1 MB + 0 * group_depth 50 * group_width
   U 1 MB = stripe_unit * group_width 2 MB

   Offset 7232 MB:
     M = L 7232 MB / S 5000 MB = 1
     G = (L (7232 MB - (M (1 * S)) 5000 MB)) / T 500 MB = 4
     H = (L (7232 MB - (M (1 * S)) 5000 MB)) % T 500 MB = 232 MB
     N = H 232 MB / U 10 MB = 23
     C = (H (232 MB - (N (23 * U)) 10 MB)) / stripe_unit 1 MB + G 4 * group_width 10 = 42
     O = L 7232 MB % stripe_unit 1 MB + N 23 * stripe_unit 1 MB + M 1 * group_depth 50 * stripe_unit
                                 Figure 10

   In these equations, S 1 MB = 73 MB

3.2.3  Mirroring

   The mirror_cnt is the number of bytes striped across all used to replicate a file by replicating its
   component objects before the pattern repeats.  T objects.  If there is no mirroring, then mirror_cnt MUST be
   0.  If mirror_cnt is greater than zero, then the number size of
   bytes striped within the
   component array MUST be a group multiple of component objects before advancing to
   the next group.  U (mirror_cnt+1).  Thus, for a
   classic mirror on two objects, mirror_cnt is one.  If group_width is
   also non-zero, then the number size MUST be a multiple of bytes group_width *
   (mirror_cnt+1).  Replicas are adjacent in a stripe within a group.
   M is the "major" (i.e., across all components) stripe number, components array, and N
   is the "minor" (i.e., across
   the group) stripe number.  G counts value C produced by the
   groups from above equations is not a direct index
   into the beginning of components array.  Instead, the major stripe, and H is following equations
   determine the byte
   offset within replica component index RCi, where i ranges from 0 to
   mirror_cnt.

   C = component index for striping or two-level striping
   i ranges from 0 to mirror_cnt, inclusive
   RCi = C * (mirror_cnt+1) + i

3.3  RAID Algorithms

   pnfs_osd_raid_algorithm4 determines the group.

   For example, consider an object striped over 100 devices with a
   group_width of 10, a group_depth of 50, algorithm and a stripe_unit placement of 1 MB.
   In this scheme, 500 MB
   redundant data.  This section defines the different RAID algorithms.

3.3.1  PNFS_OSD_RAID_0

   PNFS_OSD_RAID_0 means there is no parity data, so all bytes in the
   component objects are written to data bytes located by the first 10 components, above equations for C
   and
   5000 MB O. If a component object is written before unavailable, the pattern wraps back around pNFS client can
   choose to return NULLs for the first missing data, or it can retry the READ
   against the pNFS server, or it can return an EIO error.

3.3.2  PNFS_OSD_RAID_4

   PNFS_OSD_RAID_4 means that the last component in object, or the array.

   Offset 0:
     W = 100
     S = 1 MB * 50 * 100 = 5000 MB
     T = 1 MB * 50 * 10 = 500 MB
     U = 1 MB * 10 = 10 MB
     M = 0 / 5000 MB = 0
     G = (0 - (0 * 5000 MB)) / 500 MB = 0
     H = (0 - (0 * 5000 MB)) % 500 MB = 0
     N = 0 / 10 MB = 0
     C = (0 - (0 * 10 MB)) / 1 MB + 0 * 10 = 0
     O = 0 % 1 MB + 0 * 1 MB + 0 * 50 * 1 MB = 0

   Offset 27 MB:
     M = 27 MB / 5000 MB = 0
     G = (27 MB - (0 * 5000 MB)) / 500 MB = 0
     H = (27 MB - (0 * 5000 MB)) % 500 MB = 27 MB
     N = 27 MB / 10 MB = 2 last in
   each group if group_width is > zero, contains parity information
   computed over the rest of the stripe with an XOR operation.  If a
   component object is unavailable, the client can read the rest of the
   stripe units in the damaged stripe and recompute the missing stripe
   unit by XORing the other stripe units in the stripe.  Or the client
   can replay the READ against the pNFS server which will presumably
   perform the reconstructed read on the client's behalf.

   When parity is present in the file, then there is an additional
   computation to map from the file offset L to the offset that accounts
   for embedded parity, L'.  First compute L', and then use L' in the
   above equations for C and O.

   L = (27 MB - (2 * 10 MB)) / 1 MB + 0 * 10 = 7
     O = 27 MB % 1 MB + 2 * 1 MB + 0 * 50 * 1 MB = 2 MB

   Offset 7232 MB:
     M = 7232 MB / 5000 MB = 1
     G = (7232 MB - (1 * 5000 MB)) / 500 MB = 4
     H file offset, not accounting for parity
   P = (7232 MB - (1 * 5000 MB)) % 500 MB number of parity devices in each stripe
   W = 232 MB group_width, if not zero, else size of component array
   N = 232 MB / 10 MB = 23
     C = (232 MB - (23 * 10 MB)) L / 1 MB + 4 (W-P * 10 = 42
     O = 7232 MB % 1 MB + 23 stripe_unit)
   L' = N * 1 MB + 1 (W * 50 stripe_unit) +
        (L % (W-P * 1 MB = 73 MB

                                 Figure 11 stripe_unit))

3.3.3  Mirroring

   The mirror_cnt  PNFS_OSD_RAID_5

   PNFS_OSD_RAID_5 means that the position of the parity data is used to replicate a file by replicating its rotated
   on each stripe.  In the first stripe, the last component objects.  If there is no mirroring, then mirror_cnt MUST be
   0.  If mirror_cnt is greater than zero, then holds the size of
   parity.  In the second stripe, the next-to-last component array MUST be a multiple of (mirror_cnt+1).  Thus, for a
   classic mirror on two objects, mirror_cnt holds the
   parity, and so on.  In this scheme, all stripe units are rotated so
   that I/O is one.  If group_width evenly spread across objects as the file is
   also non-zero, then read
   sequentially.  The rotated parity layout is illustrated here, with
   numbers indicating the size MUST be stripe unit.

   0 1 2 P
   4 5 P 3
   8 P 6 7
   P 9 a multiple of group_width *
   (mirror_cnt+1).  Replicas are adjacent in b

   To compute the components array, and component object C, first compute the value offset that
   accounts for parity L' and use that to compute C. Then rotate C produced to
   get C'.  Finally, increase C' by one if the above parity information comes
   at or before C' within that stripe.  The following equations
   illustrate this by computing I, which is not a direct index
   into the components array.  Instead, the following equations deterine index of the replica component index RCi, where i ranges from 0 to mirror_cnt.

   C
   that contains parity for a given stripe.

   L = file offset, not accounting for parity
   W = group_width, if not zero, else size of component index for striping or two-level striping
   i ranges from 0 to mirror_cnt, inclusive
   RCi array
   N = C L / (W-1 * (mirror_cnt+1) + i

                                 Figure 12 stripe_unit)
   (Compute L' as describe above)
   (Compute C based on L' as described above)
   C' = (C - (N%W)) % W
   I = W - (N%W) - 1
   if (C' <= I) {
     C'++
   }

3.3.4  RAID

   The raid_algorithm determines the algorithm and placement of
   redundant data.  PNFS_OSD_RAID_0 means there  PNFS_OSD_RAID_PQ

   PNFS_OSD_RAID_PQ is no parity data, so
   all bytes in a double-parity scheme that uses the Reed-Solomon
   P+Q encoding scheme.  In this layout, the last two component objects are data bytes located by
   hold the
   above P and Q data, respectively.  P is parity computed with XOR,
   and Q is a more complex equation that is not described here.  The
   equations given above for C and O. If embedded parity can be used to map a file
   offset to the correct component object is unavailable, by setting the pNFS client can choose number of
   parity components to return NULLs 2 instead of 1 for the missing data, or
   it can retry the READ against the pNFS server, RAID4 or it can return an
   EIO error.

   PNFS_OSD_RAID_4 means that RAID5.  Clients may
   simply choose to read data through the last component object, metadata server if two
   components are missing or damaged.

   Issue: This scheme also has a RAID_4 like layout where the last in
   each group if group_width is > zero, contains parity information
   computed over ECC blocks
   are stored on the rest of same components on every stripe and a rotated,
   RAID-5 like layout where the stripe with an XOR operation.  If a
   component object is unavailable, units are rotated.  Should we
   make the client can read following properties orthogonal: RAID_4 or RAID_5 (i.e.,
   non-rotated or rotated), and then have the rest number of parity
   components and the
   stripe units in associated algorithm be the damaged stripe orthogonal parameter?

3.3.5  RAID Usage and recompute implementation notes

   RAID layouts with redundant data in their stripes require additional
   serialization of updates to ensure correct operation.  Otherwise, if
   two clients simultaneously write to the missing stripe
   unit by XORing same logical range of an
   object, the other stripe units result could include different data in the stripe.  Or the client
   can replay same ranges of
   mirrored tuples, or corrupt parity information.  It is the READ against
   responsibility of the pNFS metadata server which will presumably
   perform the reconstructed read on to enforce serialization
   requirements such as this.  For example, the client's behalf.

   When parity metadata server may do
   so by not granting overlapping write layouts within mirrored objects.

4.  Object-Based Layout Update

   pnfs_layoutupdate4 is present used in the file, then there is an additional
   computation LAYOUTCOMMIT operation to map from the file offset L convey
   updates to the offset that accounts
   for embedded parity, L'.  First compute L', and then use L' in the
   above equations for C layout and O.

   L = file offset, not accounting for parity
   P = number of parity devices in each stripe
   W = group_width, if not zero, else size of component array
   N = L / (W-P * stripe_unit)
   L' = N * (W * stripe_unit) +
        (L % (W-P * stripe_unit))

                                 Figure 13

   PNFS_OSD_RAID_5 means that additional information to the position of metadata
   server.  It is defined in the parity data NFSv4.1 draft [5] as follows:

   struct pnfs_layoutupdate4 {
       pnfs_layouttype4    type;
       opaque              layoutupdate_data<>;
   };

   The pnfs_layoutupdate4 type is rotated
   on each stripe.  In an opaque value at the first stripe, generic pNFS
   client level.  If the last component holds type is LAYOUT_OSD2_OBJECTS, then the
   parity.  In opaque
   value is described by the second stripe, pnfs_osd_layoutupdate4 type.

4.1  pnfs_osd_layoutupdate4

   struct pnfs_osd_layoutupdate4 {
       pnfs_osd_deltaspaceused4    delta_space_used;
       pnfs_osd_ioerr4             ioerr<>;
   };

   Object-Based pNFS clients are not allowed to modify the next-to-last component holds layout.
   "delta_space_used" is used to convey capacity usage information back
   to the
   parity, and so on.  In this scheme, all stripe units are rotated so
   that metadata server and, in case OSD I/O operations failed,
   "ioerr" is evenly spread across objects as used to report these errors to the file is read
   sequentially.  The rotated parity layout metadata server.

4.1.1  pnfs_osd_deltaspaceused4

   union pnfs_osd_deltaspaceused4 switch (bool valid) {
       case TRUE:
           length4     delta;  /* Bytes consumed by write activity */
       case FALSE:
           void;
   };

   pnfs_osd_deltaspaceused4 is illustrated here, with
   numbers indicating used to convey space utilization
   information at the stripe unit.

   0 1 2 P
   4 5 P 3
   8 P 6 7
   P 9 a b

                                 Figure 14

   To compute time of LAYOUTCOMMIT.  For the component object C, first compute file system to
   properly maintain capacity used information, it needs to track how
   much capacity was consumed by WRITE operations performed by the offset that
   accounts for parity L'
   client.  In this protocol, the OSD returns the capacity consumed by a
   write, which can be different because of internal overhead like
   block-based allocation and use that to compute C. Then rotate C indirect blocks, and the client reflects
   this back to
   get C'.  Finally, increase C' by one if the parity information comes
   at or before C' within that stripe. pNFS server so it can accurately track quota.  The following equations
   illustrate
   pNFS server can choose to trust this by computing I, which is information coming from the index
   clients and therefore avoid querying the OSDs at the time of
   LAYOUTCOMMIT.  If the component
   that contains parity for a given stripe.

   L client is unable to obtain this information
   from the OSD, it simply returns invalid deltaspaceused4.

4.1.2  pnfs_osd_errno4

   enum pnfs_osd_errno4 {
       PNFS_OSD_NOT_FOUND      = file offset, not accounting for parity
   W 1,
       PNFS_OSD_NO_SPACE       = group_width, if not zero, else size of component array
   N 2,
       PNFS_OSD_EIO            = L / (W-1 * stripe_unit)
   (Compute L' as describe above)
   (Compute C based on L' as described above)
   C' 3,
       PNFS_OSD_BAD_CRED       = (C - (N%W)) % W
   I 4,
       PNFS_OSD_NO_ACCESS      = W - (N%W) - 1
   if (C' <= I) {
     C'++
   }

                                 Figure 15

   PNFS_OSD_RAID_PQ is a double-parity scheme that uses the Reed-Solomon
   P+Q encoding scheme.  In this layout, the last two component objects
   hold the P and Q data, respectively.  P is parity computed with XOR,
   and Q is a more complex equation that 5,
       PNFS_OSD_UNREACHABLE    = 6
   };

   pnfs_osd_errno4 is not described here.  The
   equations given above for embedded parity can be used to map a file
   offset represent error types when read/write
   errors are reported to the correct component metadata server.

   o  PNFS_OSD_NOT_FOUND indicates the object by setting ID specifics an object
   that does not exist on the number Object Storage Device.

   o  PNFS_OSD_NO_SPACE indicates the operation failed because the
   Object Storage Device ran out of
   parity components free capacity during the operation.

   o  PNFS_OSD_EIO indicates the operation failed because the Object
   Storage Device experienced a failure trying to 2 instead access the object.
   The most common source of 1 for RAID4 or RAID5.  Clients may
   simply choose to read data through these errors is media errors, but other
   internal errors might cause this.  In this case, the metadata server if two
   components
   should go examine the broken object more closely.

   o  PNFS_OSD_BAD_CRED indicates the security parameters are missing or damaged.

   Issue: not valid.
   The primary cause of this scheme also is that the capability has a RAID_4 like layout where expired, or the ECC
   blocks are stored on
   security policy tag (i.e., capability version number) has been
   changed to revoke capabilities.  The client will need to return the same components on every stripe
   layout and get a
   rotated, RAID-5 like layout where new one with fresh capabilities.

   o  PNFS_OSD_NO_ACCESS indicates the stripe units are rotated.
   Should we make capability does not allow the following properties orthogonal: RAID_4 or RAID_5
   (i.e., non-rotated or rotated), and then have
   requested operation.  This should not occur in normal operation
   because the number of parity
   components and metadata server should give out correct capabilities, or
   none at all.

   o  PNFS_OSD_UNREACHABLE indicates the associated algorithm be client was unable to contact
   the orthogonal parameter?

3.3.5  Usage and implementation notes

   RAID layouts with redundant data in their stripes require additional
   serialization of updates Object Storage Device due to ensure correct operation.  Otherwise, if
   two clients simultaneously write a communication failure.

4.1.3  pnfs_osd_ioerr4

   struct pnfs_osd_ioerr4 {
       pnfs_osd_objid4     component;
       length4             offset;
       length4             length;
       bool                iswrite;
       pnfs_osd_errno4     errno;
   };

   The pnfs_osd_ioerr4 structure is used to return error indications for
   objects that generated errors during data transfers.  These are hints
   to the same logical metadata server that there are problems with that object.  For
   each error, "component", "offset", and "length" represent the object
   and byte range of an
   object, the result could include different data in which the same ranges of
   mirrored tuples, or corrupt parity information.  It error occurred. "iswrite" is set to
   "true" if the
   responsibility failed OSD operation was data modifying, and "errno"
   represents the type of error.

5.  Object-Based Creation Layout Hint

   The pnfs_layouthint4 type is defined in the metadata server to enforce serialization
   requirements such NFSv4.1 draft [5] as this.  For example, the metadata server may do
   so by not granting overlapping write layouts within mirrored objects.

3.4  pnfs_layoutupdate4
   follows:

   struct pnfs_layouthint4 {
       pnfs_layouttype4    type;
       opaque              layouthint_data<>;
   };

   The pnfs_layoutupdate4 pnfs_layouthint4 type is an opaque value at the generic pNFS
   client level.  If the layout type is LAYOUT_OSD_OBJECTS, LAYOUT_OSD2_OBJECTS, then the
   opaque value is described by the pnfs_osd_layoutupdate4 pnfs_osd_layouthint4 type.  This type
   conveys error information, timestamp information, and capacity used
   information back to the metadata server.

   struct pnfs_layoutupdate4

5.1  pnfs_osd_layouthint4

   union num_comps_hint4 switch (bool valid) {
     pnfs_layouttype4      type;
     opaque                layoutupdate_data<>;
       case TRUE:
           uint32_t            num_comps;
       case FALSE:
           void;
   };

   enum pnfs_osd_errno

   union stripe_unit_hint4 switch (bool valid) {
     PNFS_OBJ_NOT_FOUND    = 1,
     PNFS_OBJ_NO_SPACE     = 2,
     PNFS_OBJ_EIO          = 3,
     PNFS_OBJ_BAD_CRED     = 4,
     PNFS_OBJ_NO_ACCESS    = 5,
     PNFS_OBJ_UNREACHABLE  = 6
       case TRUE:
           length4             stripe_unit;
       case FALSE:
           void;
   };

   struct pnfs_osd_ioerr4

   union group_width_hint4 switch (bool valid) {
     pnfs_osd_objid4        component;
     length4                offset;
     length4                length;
     pnfs_osd_errno         errno;
       case TRUE:
           uint16_t            group_width;
       case FALSE:
           void;
   };

   union deltaspaceused4 group_depth_hint4 switch (bool valid) {
       case TRUE:
       length4  delta;       /* Bytes consumed by write activity */
           uint16_t            group_depth;
       case FALSE:
           void;
   }
   };

   union mirror_cnt_hint4 switch (bool valid) {
       case TRUE:
           uint16_t            mirror_cnt;
       case FALSE:
           void;
   };

   union raid_algorithm_hint4 switch (bool valid) {
       case TRUE:
           pnfs_osd_raid_algorithm4    raid_algorithm;
       case FALSE:
           void;
   };

   struct pnfs_osd_layoutupdate4 pnfs_osd_layouthint4 {
       deltaspaceused4     delta_space_used;
       newtime4            time_metadata;
       pnfs_osd_ioerr4     ioerr<>;
       num_comps_hint4         num_comps_hint;
       stripe_unit_hint4       stripe_unit_hint;
       group_width_hint4       group_width_hint;
       group_depth_hint4       group_depth_hint;
       mirror_cnt_hint4        mirror_cnt_hint;
       raid_algorithm_hint4    raid_algorithm_hint;
   };

                                 Figure 16

   This type conveys hints for the desired data map.  All parameters are
   optional so the client can give values for only the parameters it
   cares about, e.g. it can provide a hint for the desired number of
   mirrored components, regardless of the the raid algorithm selected
   for the file.  The server should make an attempt to honor the hints
   but it can ignore any or all of them at its own discretion and
   without failing the respective create operation.

   The deltaspaceused4 type is num_comps hint can be used to convey space utilization
   information at limit the time total number of LAYOUTCOMMIT.  For component
   objects comprising the file system to
   properly maintain capacity used information, it needs file.  All other hints correspond directly to track how
   much capacity was consumed by WRITE operations performed by
   the
   client.  In this protocol, different fields of pnfs_osd_data_map4.

6.  Layout Segments

   The pnfs layout operations operate on logical byte ranges.  There is
   no requirement in the protocol for any relationship between byte
   ranges used in LAYOUTGET to acquire layouts and byte ranges used in
   CB_LAYOUTRECALL, LAYOUTCOMMIT, or LAYOUTRETURN.  However, using OSD returns
   capabilities poses limitations on these operations since the capacity consumed by a
   write, which can
   capabilities associated with layout segments cannot be different because merged or
   split.  The following guidelines should be followed for proper
   operation of internal overhead like
   block-based allocation object-based layouts.

6.1  CB_LAYOUTRECALL and indirect blocks, LAYOUTRETURN

   In general, the object-based layout driver should keep track of each
   layout segment it got, keeping record of the segment's iomode,
   offset, and length.  The server should allow the client reflects
   this back to get
   multiple overlapping layout segments but is free to recall the pNFS server so it can accurately track quota.  The
   pNFS server can choose layout
   to trust this information coming from prevent overlap.

   In response to CB_LAYOUTRECALL, the
   clients client should return all layout
   segments matching the given iomode and therefore avoid querying overlapping with the OSDs at recalled
   range.  When returning the time of
   LAYOUTCOMMIT.  If layouts for this byte range with
   LAYOUTRETURN the client MUST NOT return a sub-range of a layout
   segment it has; each LAYOUTRETURN sent MUST completely cover at least
   one outstanding layout segment.

   The server, in turn, should release any segment that exactly matches
   the clientid, iomode, and byte range given in LAYOUTRETURN.  If no
   exact match is unable to obtain this information
   from found then the server should release all layout
   segments matching the clientid and iomode and that are fully
   contained in the OSD, it simply returns invalid deltaspaceused4.

   The time_metadata value indicates returned byte range.  If none are found and the new modify time byte
   range is a subset of an outstanding layout segment with for the file.
   The server can choose to trust same
   clientid and iomode, then the client's view of this attribute,
   or it client can query storage to determine the actual modify time.  A
   file's modify time will be considered malfunctioning
   and the latest modify time among server SHOULD recall all
   components of the file.  A layouts from this client can avoid returning time
   information by returning an invalid time_metadata (i.e., to reset
   its state.  If this behavior repeats the
   time_changed union descriminator server SHOULD deny all
   LAYOUTGETs from this client.

6.2  LAYOUTCOMMIT

   LAYOUTCOMMIT is FALSE.)

   The pnfs_osd_ioerr4 returns error indications for objects that
   generated only used by object-based pNFS to convey modified
   attributes hints and/or to report I/O errors during data transfers.  These to the MDS.  Therefore,
   the offset and length in LAYOUTCOMMIT4args are hints reserved for future
   use and should be set to 0.  However, component byte ranges in the
   metadata server that there
   optional pnfs_osd_ioerr4 structure are problems with that object.

   PNFS_OBJ_NOT_FOUND indicates used for recovering the object ID specifics an object that
   does not exist on the Object Storage Device.

   PNFS_OBJ_NO_SPACE indicates the operation failed because the Object
   Storage Device ran out of free capacity during the operation.

   PNFS_OBJ_EIO indicates
   and MUST be set by the operation client to cover all failed because the Object
   Storage Device experienced a failure trying I/O operations to access
   the object. component.

7.  Recalling Layouts

   The most common source of these errors is media errors, but other
   internal errors might cause this.  In this case, the object-based metadata server should go examine the broken object more closely.

   PNFS_OBJ_BAD_CRED indicates the security parameters are not valid.
   The primary cause of this is that recall outstanding layouts in
   the capability has expired, or following cases:

   o  When the file's security policy tag (i.e., capability version number) has been
   changed to revoke capabilities.  The client will need to return changes, i.e.  ACLs or permission
   mode bits are set.

   o  When the
   layout and get a new one with fresh capabilities.

   PNFS_OBJ_NO_ACCESS indicates file's aggregation map changes, rendering outstanding
   layouts invalid.

   o  When there are sharing conflicts.  For example, the capability does not allow server will
   issue stripe aligned layout segments for RAID-5 objects.  To prevent
   corruption of the
   requested operation.  This should file's parity, Multiple clients must not occur in normal operation
   because hold valid
   write layouts for the metadata server same stripes.  An outstanding RW layout should give out correct capabilities, or
   none at all.

   PNFS_OBJ_UNREACHABLE indicates the
   be recalled when a conflicting LAYOUTGET is received from a different
   client was unable to contact the
   Object Storage Device due to for LAYOUTIOMODE_RW and for a communication failure.

4. byte-range overlapping with the
   outstanding layout segment.

8.  Security Considerations

   The pNFS extension partitions the NFSv4 file system protocol into two
   parts, the control path and the data path (storage protocol).  The
   control path contains all the new operations described by this
   extension; all existing NFSv4 security mechanisms and features apply
   to the control path.  The combination of components in a pNFS system
   is required to preserve the security properties of NFSv4 with respect
   to an entity accessing data via a client, including security
   countermeasures to defend against threats that NFSv4 provides
   defenses for in environments where these threats are considered
   significant.

4.1

   The object storage protocol MUST implement the security aspects
   described in version 1 of the T10 OSD protocol definition [2].  The
   remainder of this section gives an overview of the security mechanism
   described in that standard.  The goal is to give the reader a basic
   understanding of the object security model.  Any discrepancies
   between this text and the actual standard are obviously to be
   resolved in favor of the OSD standard.

8.1  OSD Security Data Types

   There are three main data types associated with object security: a
   capability, a credential, and security parameters.  The capability is
   a set of fields that specifies an object and what operations can be
   performed on it.  A credential is a signed capability.  Only a
   security manager that knows the secret device keys can correctly sign
   a capability to form a valid credential.  In pNFS, the file server
   acts as the security manager and returns signed capabilities (i.e.,
   credentials) to the pNFS client.  The security parameters are values
   computed by the issuer of OSD commands (i.e., the client) that prove
   they hold valid credentials.  The client uses the credential as a
   signing key to sign the requests it makes to OSD, and puts the
   resulting signatures into the security_parameters field of the OSD
   command.  The object storage device uses the secret keys it shares
   with the security manager to validate the signature values in the
   security parameters.

   The security types are opaque to the generic layers of the pNFS
   client.  The credential is defined as opaque within the
   pnfs_obj_and_cred
   pnfs_osd_and_cred type.  Instead of repeating the definitions here,
   the reader is referred to section 4.9.2.2 of the OSD standard.

4.2

8.2  The OSD Security Protocol

   The object storage protocol relies on a cryptographically secure
   capability to control accesses at the object storage devices.
   Capabilities are generated by the metadata server, returned to the
   client, and used by the client as described below to authenticate
   their requests to the Object Storage Device (OSD).  Capabilities
   therefore achieve the required access and open mode checking.  They
   allow the file server to define and check a policy (e.g., open mode)
   and the OSD to enforce that policy without knowing the details (e.g.,
   user IDs and ACLs).

   Since capabilities are tied to layouts, and since they are used to
   enforce access control, when the file ACL or mode changes the
   outstanding capabilities MUST be revoked to enforce the new access
   permissions.  The server SHOULD recall layouts to allow clients to
   gracefully return their capabilities before the access permissions
   change.

   Each capability is specific to a particular object, an operation on
   that object, a byte range w/in the object (in OSDv2), and has an
   explicit expiration time.  The capabilities are signed with a secret
   key that is shared by the object storage devices (OSD) and the
   metadata managers.  Clients do not have device keys so they are
   unable to forge the signatures in the security parameters.  The
   combination of a capability and its signature is called a
   "credential" in the OSD specification.

   The details of the security and privacy model for Object Storage are
   defined in the T10 OSD standard.  The following sketch of the
   algorithm should help the reader understand the basic model.

   LAYOUTGET returns a CapKey, which is also called a credential.  It is
   a capability and a signature over that capability.

     {CapKey

   CapKey = MAC<SecretKey>(CapArgs), MAC<SecretKey>(CapArgs)
   Credential = {CapKey, CapArgs}

   The client uses CapKey to sign all the requests it issues for that
   object using the respective CapArgs.  In other words, the CapArgs
   appears in the request to the storage device, and that request is
   signed with the CapKey as follows:

   ReqMAC = MAC<CapKey>(Req, Nonceln)
   Request = {CapArgs, Req, Nonceln, ReqMAC}

   The following is sent to the OSD: {CapArgs, Req, Nonceln, ReqMAC}.
   The OSD uses the SecretKey it shares with the metadata server to
   compare the ReqMAC the client sent with a locally computed value:

   MAC<MAC<SecretKey>(CapArgs)>(Req, Nonceln)

   and if they match the OSD assumes that the capabilities came from an
   authentic metadata server and allows access to the object, as allowed
   by the CapArgs.  Therefore, if the server LAYOUTGET reply, holding
   CapKey and CapArgs, is snooped by another client, it can be used to
   generate valid OSD requests (within the CapArgs access restriction).

   To provide the required privacy requirements for the capabilities
   returned by LAYOUTGET, the GSS-API can be used, e.g. by using a
   session key known to the file server and to the client to encrypt the
   whole layout or parts of it.  Two general ways to provide privacy in
   the absence of GSS-API that are independent of NFSv4 are either an
   isolated network such as a VLAN or a secure channel provided by
   IPsec.

4.3

8.3  Revoking capabilities

   At any time, the metadata server may invalidate all outstanding
   capabilities on an object by changing its capability version
   attribute.  There is also a "fence bit" attribute that the metadata
   server can toggle to temporarily block access without permanently
   revoking capabilities.  The value of the fence bit and the capability
   version are part of a capability, and they must match the state of
   the attributes.  If they do not match, the OSD rejects accesses to
   the object.  When a client attempts to use a capability and discovers
   a capability version mismatch, it should issue a LAYOUTRETURN for the
   object and specify PNFS_OBJ_BAD_CRED PNFS_OSD_BAD_CRED in the pnfs_obj_ioerr pnfs_osd_ioerr parameter.
   The client may elect to issue a compound LAYOUTRETURN/LAYOUTGET (or
   LAYOUTCOMMIT/LAYOUTRETURN/LAYOUTGET) to attempt to fetch a refreshed
   set of capabilities.

   The metadata server may elect to change the capability version on an
   object at any time, for any reason (with the understanding that there
   is likely an associated performance penalty, especially if there are
   outstanding layouts for this object).  The metadata server MUST
   revoke outstanding capabilities when any one of the following occurs:
   (1) the permissions on the object change, (2) a conflicting mandatory
   byte-range lock is granted.

   A pNFS client will typically hold one layout for each byte range for
   either READ or READ/WRITE.  It is the pNFS client's responsibility to
   enforce access control among multiple users accessing the same file.
   It is neither required nor expected that the pNFS client will obtain
   a separate layout for each user accessing a shared object.  The
   client SHOULD use ACCESS calls to check user permissions when
   performing I/O so that the server's access control policies are
   correctly enforced.  The result of the ACCESS operation may be cached
   indefinitely, as the server is expected to recall layouts when the
   file's access permissions or ACL change.

5.

9.  References

9.1  Normative References

   [1]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
        Levels", RFC 2119, March 1997.

   [2]  Weber, R., "SCSI Object-Based Storage Device Commands",
        July 2004, <http://www.t10.org/ftp/t10/drafts/osd/osd-r10.pdf>.

   [3]  Goodson, G., "NFSv4 pNFS Extentions", October 2005, <ftp://
        www.ietf.org/internet-drafts/draft-ietf-nfsv4-pnfs-00.txt>.  Eisler, M., "XDR: External Data Representation Standard",
        RFC 4506, May 2006.

   [4]  Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame,
        C., Eisler, M., and D. Noveck, "Network File System (NFS)
        version 4 Protocol", RFC 3530, April 2003.

9.2  Informative References

   [5]  Shepler, S., "NFSv4 Minor Version 1", June 2006, <http://
        www.ietf.org/internet-drafts/
        draft-ietf-nfsv4-minorversion1-03.txt>.

   [6]  Weber, R., "SCSI Object-Based Storage Device Commands -2
        (OSD-2)", October 2004,
        <http://www.t10.org/ftp/t10/drafts/osd2/osd2r00.pdf>.

Authors' Addresses

   Benny Halevy
   Panasas, Inc.
   1501 Reedsdale St. Suite 400
   Pittsburgh, PA  15233
   USA

   Phone: +1-412-323-3500
   Email: bhalevy@panasas.com
   URI:   http://www.panasas.com/

   Brent Welch
   Panasas, Inc.
   6520 Kaiser Drive
   Fremont, CA  95444
   USA

   Phone: +1-650-608-7770
   Email: welch@panasas.com
   URI:   http://www.panasas.com/
   Jim Zelenka
   Panasas, Inc.
   1501 Reedsdale St. Suite 400
   Pittsburgh, PA  15233
   USA

   Phone: +1-412-323-3500
   Email: jimz@panasas.com
   URI:   http://www.panasas.com/

   Todd Pisek
   Sun Microsystems, Inc.
   1270 Eagan Industrial Rd. - Suite 160
   Eagant, MN  55121-1231
   USA

   Phone: +1-651-552-6415
   Email: trp@sun.com
   URI:   http://www.sun.com/

Intellectual Property Statement

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.

Disclaimer of Validity

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Copyright Statement

   Copyright (C) The Internet Society (2006).  This document is subject
   to the rights, licenses and restrictions contained in BCP 78, and
   except as set forth therein, the authors retain all their rights.

Acknowledgment

   Funding for the RFC Editor function is currently provided by the
   Internet Society.