draft-ietf-nfsv4-pnfs-obj-00.txt   draft-ietf-nfsv4-pnfs-obj-01.txt 
Network B. Halevy NFSv4 B. Halevy
Internet-Draft B. Welch Internet-Draft B. Welch
Expires: July 27, 2006 J. Zelenka Expires: December 27, 2006 J. Zelenka
Panasas Panasas
T. Pisek T. Pisek
Sun Sun
January 23, 2006 June 25, 2006
Object-based pNFS Operations Object-based pNFS Operations
draft-ietf-nfsv4-pnfs-obj-00.txt draft-ietf-nfsv4-pnfs-obj-01.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 37 skipping to change at page 1, line 37
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on July 27, 2006. This Internet-Draft will expire on December 27, 2006.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2006). Copyright (C) The Internet Society (2006).
Abstract Abstract
This Internet-Draft provides a description of the object-based pNFS This Internet-Draft provides a description of the object-based pNFS
extension for NFSv4. This is a companion to the main pnfs operations extension for NFSv4. This is a companion to the main pnfs
draft, which is currently draft-ietf-nfsv4-pnfs-00.txt specification in the NFSv4 Minor Version 1 Internet Draft, which is
currently draft-ietf-nfsv4-minorversion1-03.txt.
Requirements Language Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [1]. document are to be interpreted as described in RFC 2119 [1].
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Object Storage Device Addressing and Discovery . . . . . . . . 3 2. Object Storage Device Addressing and Discovery . . . . . . . . 3
3. Object-Based Layout . . . . . . . . . . . . . . . . . . . . . 5 2.1 pnfs_osd_addr_type4 . . . . . . . . . . . . . . . . . . . 4
3.1 pnfs_osd_objid4 . . . . . . . . . . . . . . . . . . . . . 5 2.2 pnfs_osd_deviceaddr4 . . . . . . . . . . . . . . . . . . . 4
3.2 pnfs_osd_layout4 . . . . . . . . . . . . . . . . . . . . . 6 3. Object-Based Layout . . . . . . . . . . . . . . . . . . . . . 4
3.3 pnfs_osd_data_map4 . . . . . . . . . . . . . . . . . . . . 7 3.1 pnfs_osd_layout4 . . . . . . . . . . . . . . . . . . . . . 5
3.3.1 Simple Striping . . . . . . . . . . . . . . . . . . . 7 3.1.1 pnfs_osd_objid4 . . . . . . . . . . . . . . . . . . . 6
3.3.2 Nested Striping . . . . . . . . . . . . . . . . . . . 9 3.1.3 pnfs_osd_object_cred4 . . . . . . . . . . . . . . . . 7
3.3.3 Mirroring . . . . . . . . . . . . . . . . . . . . . . 11 3.1.4 pnfs_osd_raid_algorithm4 . . . . . . . . . . . . . . . 7
3.3.4 RAID . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.5 pnfs_osd_data_map4 . . . . . . . . . . . . . . . . . . 7
3.3.5 Usage and implementation notes . . . . . . . . . . . . 13 3.2 Data Mapping Schemes . . . . . . . . . . . . . . . . . . . 8
3.4 pnfs_layoutupdate4 . . . . . . . . . . . . . . . . . . . . 13 3.2.1 Simple Striping . . . . . . . . . . . . . . . . . . . 8
4. Security Considerations . . . . . . . . . . . . . . . . . . . 15 3.2.2 Nested Striping . . . . . . . . . . . . . . . . . . . 9
4.1 Security Data Types . . . . . . . . . . . . . . . . . . . 16 3.2.3 Mirroring . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Security Protocol . . . . . . . . . . . . . . . . . . . . 16 3.3 RAID Algorithms . . . . . . . . . . . . . . . . . . . . . 11
4.3 Revoking capabilities . . . . . . . . . . . . . . . . . . 17 3.3.1 PNFS_OSD_RAID_0 . . . . . . . . . . . . . . . . . . . 11
5. Normative References . . . . . . . . . . . . . . . . . . . . . 18 3.3.2 PNFS_OSD_RAID_4 . . . . . . . . . . . . . . . . . . . 11
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 19 3.3.3 PNFS_OSD_RAID_5 . . . . . . . . . . . . . . . . . . . 12
Intellectual Property and Copyright Statements . . . . . . . . 20 3.3.4 PNFS_OSD_RAID_PQ . . . . . . . . . . . . . . . . . . . 12
3.3.5 RAID Usage and implementation notes . . . . . . . . . 13
4. Object-Based Layout Update . . . . . . . . . . . . . . . . . . 13
4.1 pnfs_osd_layoutupdate4 . . . . . . . . . . . . . . . . . . 13
4.1.1 pnfs_osd_deltaspaceused4 . . . . . . . . . . . . . . . 14
4.1.2 pnfs_osd_errno4 . . . . . . . . . . . . . . . . . . . 14
4.1.3 pnfs_osd_ioerr4 . . . . . . . . . . . . . . . . . . . 15
5. Object-Based Creation Layout Hint . . . . . . . . . . . . . . 15
5.1 pnfs_osd_layouthint4 . . . . . . . . . . . . . . . . . . . 16
6. Layout Segments . . . . . . . . . . . . . . . . . . . . . . . 17
6.1 CB_LAYOUTRECALL and LAYOUTRETURN . . . . . . . . . . . . . 17
6.2 LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . . 18
7. Recalling Layouts . . . . . . . . . . . . . . . . . . . . . . 18
8. Security Considerations . . . . . . . . . . . . . . . . . . . 18
8.1 OSD Security Data Types . . . . . . . . . . . . . . . . . 19
8.2 The OSD Security Protocol . . . . . . . . . . . . . . . . 19
8.3 Revoking capabilities . . . . . . . . . . . . . . . . . . 21
9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 21
9.1 Normative References . . . . . . . . . . . . . . . . . . . 21
9.2 Informative References . . . . . . . . . . . . . . . . . . 22
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 22
Intellectual Property and Copyright Statements . . . . . . . . 24
1. Introduction 1. Introduction
In pNFS, the file server returns typed layout structures that In pNFS, the file server returns typed layout structures that
describe where file data is located. There are different layouts for describe where file data is located. There are different layouts for
different storage systems and methods of arranging data on storage different storage systems and methods of arranging data on storage
devices. This document describes the layouts used with object-based devices. This document describes the layouts used with object-based
storage devices (OSD) that are accessed according to the iSCSI/OSD storage devices (OSD) that are accessed according to the iSCSI/OSD
storage protocol standard (SNIA T10/1355-D [2]). storage protocol standard (SNIA T10/1355-D [2]).
skipping to change at page 3, line 27 skipping to change at page 3, line 27
SETATTR, CREATE and DELETE. However, in this proposal the client SETATTR, CREATE and DELETE. However, in this proposal the client
only uses the READ, WRITE, GETATTR and FLUSH commands. The other only uses the READ, WRITE, GETATTR and FLUSH commands. The other
commands are only used by the pNFS server. commands are only used by the pNFS server.
An object-based layout for pNFS includes object identifiers, An object-based layout for pNFS includes object identifiers,
capabilities that allow clients to READ or WRITE those objects, and capabilities that allow clients to READ or WRITE those objects, and
various parameters that control how file data is striped across their various parameters that control how file data is striped across their
component objects. The OSD protocol has a capability-based security component objects. The OSD protocol has a capability-based security
scheme that allows the pNFS server to control what operations and scheme that allows the pNFS server to control what operations and
what objects are used by clients. This scheme is described in more what objects are used by clients. This scheme is described in more
detail in the "Security Considerations" section. detail in the "Security Considerations" section (Section 8).
2. Object Storage Device Addressing and Discovery 2. Object Storage Device Addressing and Discovery
Data operations to an OSD require the client to know the "address" of Data operations to an OSD require the client to know the "address" of
each OSD's root object. The root object is synonymous with SCSI each OSD's root object. The root object is synonymous with SCSI
logical unit. The client specifies SCSI logical units to its SCSI logical unit. The client specifies SCSI logical units to its SCSI
stack using a representation local to the client. Because these stack using a representation local to the client. Because these
representations are local, GETDEVICEINFO must return information that representations are local, GETDEVICEINFO must return information that
can be used by the client to select the correct local representation. can be used by the client to select the correct local representation.
In the block world, a set offset (logical block number or track/ In the block world, a set offset (logical block number or track/
sector) contains a disk label. This label identifies the disk sector) contains a disk label. This label identifies the disk
uniquely. In constrast, an OSD has a standard set of attributes on uniquely. In contrast, an OSD has a standard set of attributes on
its root object. For device identification purposes, the OSD name its root object. For device identification purposes, the OSD name
(root information attribute number 9) will be used as the label. (root information attribute number 9) will be used as the label.
This appears in the pnfs_obj_deviceaddr4 type below under the This appears in the pnfs_osd_deviceaddr4 type below under the
"root_id" field. "root_id" field.
In some situations, SCSI target discovery may need to be driven based In some situations, SCSI target discovery may need to be driven based
on information contained in the GETDEVICEINFO response. One example on information contained in the GETDEVICEINFO response. One example
of this is iSCSI targets that are not known to the client until a of this is iSCSI targets that are not known to the client until a
layout has been requested. Eventually iSCSI will adopt ANSI T10 layout has been requested. Eventually iSCSI will adopt ANSI T10
SAM-3, at which time the World Wide Name (WWN aka, EUI-64/EUI-128) SAM-3, at which time the World Wide Name (WWN aka, EUI-64/EUI-128)
naming conventions can be specified. In addition, Fibre Channel (FC) naming conventions can be specified. In addition, Fibre Channel (FC)
SCSI targets have a unique WWN. Although these FC targets have SCSI targets have a unique WWN. Although these FC targets have
already been discovered, some implementations may want to specify the already been discovered, some implementations may want to specify the
WWN in addition to the label. This information appears as the WWN in addition to the label. This information appears as the
"target" and "lun" fields in the pnfs_obj_deviceaddr4 type described "target" and "lun" fields in the pnfs_osd_deviceaddr4 type described
below. below.
2.1 pnfs_osd_addr_type4
The following enum specifies the manner in which a scsi target can be The following enum specifies the manner in which a scsi target can be
specified. The target can be specified as an IP address (v4 or v6), specified. The target can be specified as a network address, as an
as an Internet Qualified Name (IQN), or by the WWN of the target. Internet Qualified Name (IQN), or by the World-Wide Name (WWN) of the
target.
enum pnfs_obj_addr_type4 { enum pnfs_obj_addr_type4 {
OBJ_TARGET_IP_ADDR = 1, OBJ_TARGET_NETADDR = 1,
OBJ_TARGET_IQN = 2, OBJ_TARGET_IQN = 2,
OBJ_TARGET_WWN = 3 OBJ_TARGET_WWN = 3
}; };
Figure 1 2.2 pnfs_osd_deviceaddr4
A device can be specified by the tuple <target, logical unit number
(LUN), OSD Name>, or in the default case, just by the OSD Name. The
following enum is used to select the format:
enum pnfs_obj_dev_specifier4 { The specification for an object device address is as follows:
OBJ_DEV_SPEC_TARGET = 1
};
Figure 2 struct pnfs_osd_deviceaddr4 {
union target switch (pnfs_osd_addr_type4 type) {
case OBJ_TARGET_NETADDR:
pnfs_netaddr4 netaddr;
To summarize, the device addressing is fundamentally done by case OBJ_TARGET_IQN:
specifying the OSD name (i.e., root_id). In order to help the client string iqn<>;
resource discovery process, physical address hints can also be
provided. The specification for an object device address is as
follows:
union pnfs_obj_deviceaddr4 switch (pnfs_obj_dev_specifier4 dev) { case OBJ_TARGET_WWN:
case OBJ_DEV_SPEC_TARGET: string wwn<>;
pnfs_obj_addr_type4 addr_type;
string target<>;
uint64 lun;
opaque root_id<>;
default: default:
void;
};
uint64_t lun;
opaque root_id<>; opaque root_id<>;
}; };
Figure 3
3. Object-Based Layout 3. Object-Based Layout
This draft defines structure associated with the pnfs_layouttype4 The pnfs_layout4 type is defined in the NFSv4.1 draft [5] as follows:
value, LAYOUT_OSD_OBJECTS. The pnfs draft specifies the structure as
an XDR type "opaque". The opaque layout is uninterpreted by the
generic pNFS client layers, but obviously must be interpreted by the
object-storage layout driver. This document defines the structure of
this opaque value.
This is the pnfs_layoutdata4 type from the general pNFS
specification:
enum pnfs_layouttype4 { enum pnfs_layouttype4 {
LAYOUT_NFSV4_FILES = 1, LAYOUT_NFSV4_FILES = 1,
LAYOUT_OSD_OBJECTS = 2, LAYOUT_OSD2_OBJECTS = 2,
LAYOUT_BLOCK_VOLUME = 3 LAYOUT_BLOCK_VOLUME = 3
}; };
struct pnfs_layoutdata4 { struct pnfs_layout4 {
pnfs_layouttype4 layout_type; offset4 offset;
opaque layout_data<>; length4 length;
pnfs_layoutiomode4 iomode;
pnfs_layouttype4 type;
opaque layout<>;
}; };
Figure 4 This draft defines structure associated with the pnfs_layouttype4
value, LAYOUT_OSD2_OBJECTS. The NFSv4.1 draft specifies the
structure as an XDR type "opaque". The opaque layout is
uninterpreted by the generic pNFS client layers, but obviously must
be interpreted by the object-storage layout driver. This document
defines the structure of this opaque value, pnfs_osd_layout4.
3.1 pnfs_osd_objid4 3.1 pnfs_osd_layout4
struct pnfs_osd_layout4 {
pnfs_osd_object_cred4 components<>;
pnfs_osd_data_map4 map;
};
The pnfs_osd_layout4 structure specifies a layout over a set of
component objects. The components field is an array of object
identifiers and security credentials that grant access to each
object. The organization of the data is defined by the
pnfs_osd_data_map4 type that specifies how the file's data is mapped
onto the component objects (i.e., the striping pattern). The data
placement algorithm that maps file data onto component objects assume
that each component object occurs exactly once in the array of
components. Therefore, component objects MUST appear in the
component array only once.
Note that the layout depends on the file size, which the client
learns from the generic return parameters of LAYOUTGET, by doing
GETATTR commands to the metadata server, and by getting
CB_SIZE_CHANGED callbacks from the metadata server. The client uses
the file size to decide if it should fill holes with zeros, or return
a short read. Striping patterns can cause cases where component
objects are shorter than other components because a hole happens to
correspond to the last part of the component object.
3.1.1 pnfs_osd_objid4
An object is identified by a number, somewhat like an inode number. An object is identified by a number, somewhat like an inode number.
The object storage model has a two level scheme, where the objects The object storage model has a two level scheme, where the objects
within an object storage device are grouped into partitions. within an object storage device are grouped into partitions.
struct pnfs_osd_objid4 { struct pnfs_osd_objid4 {
pnfs_deviceid4 device_id; pnfs_deviceid4 device_id;
uint64 partition_id; uint64_t partition_id;
uint64 object_id; uint64_t object_id;
}; };
Figure 5 The pnfs_osd_objid4 type is used to identify an object within a
partition on a specified object storage device. "device_id" selects
The pnfs_osd_objid4 identifies an object within a partition on a the object storage device from the set of available storage devices.
specified object storage device. The device selects the object The device is identified with the pnfs_deviceid4 type, which is an
storage device from the set of available storage devices. The device index into addressing information about that device returned by the
is identified with the pnfs_deviceid4 type, which is an index into
addressing information about that device returned by the
GETDEVICEINFO pnfs operation. Within an OSD, a partition is GETDEVICEINFO pnfs operation. Within an OSD, a partition is
identified with a 64-bit number. Within a partition, an object is identified with a 64-bit number, "partition_id". Within a partition,
identified with a 64-bit number. Creation and management of an object is identified with a 64-bit number, "object_id". Creation
partitions is outside the scope of this standard, and is a facility and management of partitions is outside the scope of this standard,
provided by the object storage file system. and is a facility provided by the object storage file system.
3.2 pnfs_osd_layout4 3.1.2 pnfs_osd_version4
The pnfs_osd_layout4 specifies a layout over a set of component enum pnfs_osd_version4 {
objects. The components field is an array of object identifiers and PNFS_OSD_MISSING = 0,
security credentials that grant access to each object. The PNFS_OSD_VERSION_1 = 1,
organization of the data is defined by the pnfs_osd_data_map4 type PNFS_OSD_VERSION_2 = 2
that specifies how the file's data is mapped onto the component };
objects (i.e., the striping pattern). The data placement algorithm
that maps file data onto component objects assume that each component The osd_version is used to indicate the OSD protocol version or
object occurs exactly once in the array of components. Therefore, whether an object is missing (i.e., unavailable). Some layout
component objects MUST appear in the component array only once. schemes encode redundant information and can compensate for missing
components, but the data placement algorithm needs to know what parts
are missing.
At this time the OSD standard is at version 1.0, and we anticipate a At this time the OSD standard is at version 1.0, and we anticipate a
version 2.0 of the standard. The second generation OSD protocol has version 2.0 of the standard. The second generation OSD protocol has
additional proposed features to support more robust error recovery, additional proposed features to support more robust error recovery,
snapshots, and byte-range capabilities. Therefore, the OSD version snapshots, and byte-range capabilities. Therefore, the OSD version
is explicitly called out in the information returned in the layout. is explicitly called out in the information returned in the layout.
(This information can also be deduced by looking inside the (This information can also be deduced by looking inside the
capability type at the format field, which is the first byte. The capability type at the format field, which is the first byte. The
format value is 0x1 for an OSD v1 capability. However, it seems most format value is 0x1 for an OSD v1 capability. However, it seems most
robust to call out the version explicitly.) robust to call out the version explicitly.)
In addition, the osd_version field is used to indicate that an object 3.1.3 pnfs_osd_object_cred4
may be missing (i.e., unavailable). Some layout schemes encode
redundant information and can compensate for missing components, but
the data placement algorithm needs to know what parts are missing.
enum pnfs_osd_version {
PNFS_OSD_MISSING = 0,
PNFS_OSD_VERSION_1 = 1,
PNFS_OSD_VERSION_2 = 2
};
struct pnfs_osd_object_cred4 { struct pnfs_osd_object_cred4 {
pnfs_osd_objid4 object_id; pnfs_osd_objid4 object_id;
pnfs_osd_version osd_version; pnfs_osd_version4 osd_version;
opaque credential<>; opaque credential<>;
}; };
struct pnfs_osd_layout4 { The pnfs_osd_object_cred4 structure is used to identify each
pnfs_osd_object_cred4 components<>; component comprising the file. The object_id identifies the
pnfs_osd_data_map4 map; component object, the osd_version represents the osd protocol
}; version, or whether that component is unavailable, and the credential
Figure 6 provides the OSD security credentials needed to access that object
(see Section 8.1 for more details).
Note that the layout depends on the file size, which the client
learns from the generic return parameters of LAYOUTGET, by doing
GETATTR commands to the metadata server, and by getting
CB_SIZE_CHANGED callbacks from the metadata server. The client uses
the file size to decide if it should fill holes with zeros, or return
a short read. Striping patterns can cause cases where component
objects are shorter than other components because a hole happens to
correspond to the last part of the component object.
3.3 pnfs_osd_data_map4
The pnfs_osd_data_map4 parameterizes the algorithm that maps a file's 3.1.4 pnfs_osd_raid_algorithm4
contents over the component objects. Instead of limiting the system
to simple striping scheme where loss of a single component object
results in data loss, the map parameters support mirroring and more
complicated schemes that protect against loss of a component object.
The type is shown first, and then each parameter is explained.
enum pnfs_osd_raid_algorithm4 { enum pnfs_osd_raid_algorithm4 {
PNFS_OSD_RAID_0 = 1, PNFS_OSD_RAID_0 = 1,
PNFS_OSD_RAID_4 = 2, PNFS_OSD_RAID_4 = 2,
PNFS_OSD_RAID_5 = 3, PNFS_OSD_RAID_5 = 3,
PNFS_OSD_RAID_PQ = 4 /* Reed-Solomon P+Q */ PNFS_OSD_RAID_PQ = 4 /* Reed-Solomon P+Q */
}; };
pnfs_osd_raid_algorithm4 represents the data redundancy algorithm
used to protect the file's contents. See Section 3.3 for more
details.
3.1.5 pnfs_osd_data_map4
struct pnfs_osd_data_map4 { struct pnfs_osd_data_map4 {
length4 stripe_unit; length4 stripe_unit;
uint16 group_width; uint16_t group_width;
uint16 group_depth; uint16_t group_depth;
uint16 mirror_cnt; uint16_t mirror_cnt;
pnfs_osd_raid_algorithm4 raid_algorithm; pnfs_osd_raid_algorithm4 raid_algorithm;
}; };
Figure 7 The pnfs_osd_data_map4 structure parameterizes the algorithm that
maps a file's contents over the component objects. Instead of
3.3.1 Simple Striping limiting the system to simple striping scheme where loss of a single
component object results in data loss, the map parameters support
mirroring and more complicated schemes that protect against loss of a
component object.
The stripe_unit is the number of bytes placed on one component before The stripe_unit is the number of bytes placed on one component before
advancing to the next one in the list of components. The number of advancing to the next one in the list of components. The number of
bytes in a full stripe is stripe_unit times the number of components. bytes in a full stripe is stripe_unit times the number of components.
In some raid schemes, a stripe includes redundant information (i.e., In some raid schemes, a stripe includes redundant information (i.e.,
parity) that lets the system recover from loss or damage to a parity) that lets the system recover from loss or damage to a
component object. component object.
The group_width and group_depth parameters allow a nested striping
pattern. If there is no nesting, then group_width and group_depth
MUST be zero. Otherwise, the group_width defines the width of a data
stripe, and the group_depth defines how many stripes are written
before advancing to the next group of components in the list of
component objects for the file. The size of the components array
MUST be a multiple of group_width.
The mirror_cnt is used to replicate a file by replicating its
component objects. If there is no mirroring, then mirror_cnt MUST be
0. If mirror_cnt is greater than zero, then the size of the
component array MUST be a multiple of (mirror_cnt+1).
See Section 3.2 for more details.
3.2 Data Mapping Schemes
This section describes the different data mapping schemes in detail.
3.2.1 Simple Striping
The object layout always uses a "dense" layout as described in the The object layout always uses a "dense" layout as described in the
pNFS document. This means that the second stripe unit of the file pNFS document. This means that the second stripe unit of the file
starts at offset 0 of the second component, rather than at offset starts at offset 0 of the second component, rather than at offset
stripe_unit bytes. After a full stripe has been written, the next stripe_unit bytes. After a full stripe has been written, the next
stripe unit is appended to the first component object in the list stripe unit is appended to the first component object in the list
without any holes in the component objects. The mapping from the without any holes in the component objects. The mapping from the
logical offset within a file (L) to do the component object C and logical offset within a file (L) to do the component object C and
object-specific offset O is defined by the following equations: object-specific offset O is defined by the following equations:
L = logical offset into the file L = logical offset into the file
W = total number of components W = total number of components
S = W * stripe_unit S = W * stripe_unit
N = L / S N = L / S
C = (L-(N*S)) / stripe_unit C = (L-(N*S)) / stripe_unit
O = (N*stripe_unit)+(L%stripe_unit) O = (N*stripe_unit)+(L%stripe_unit)
Figure 8
In these equations, S is the number of bytes in a full stripe, and N In these equations, S is the number of bytes in a full stripe, and N
is the stripe number. C is an index into the array of components, so is the stripe number. C is an index into the array of components, so
it selects a particular object storage device. Both N and C count it selects a particular object storage device. Both N and C count
from zero. O is the offset within the object that corresponds to the from zero. O is the offset within the object that corresponds to the
file offset. Note that this computation does not accomodate the same file offset. Note that this computation does not accommodate the
object appearing in the component array multiple times. same object appearing in the component array multiple times.
For example, consider an object striped over four devices, <D0 D1 D2 For example, consider an object striped over four devices, <D0 D1 D2
D3>. The stripe_unit is 4096 bytes. The stripe width S is thus 4 * D3>. The stripe_unit is 4096 bytes. The stripe width S is thus 4 *
4096 = 16384. 4096 = 16384.
Offset 0: Offset 0:
N = 0 / 16384 = 0 N = 0 / 16384 = 0
C = 0-0/4096 = 0 (D0) C = 0-0/4096 = 0 (D0)
O = 0*4096 + (0%4096) = 0 O = 0*4096 + (0%4096) = 0
skipping to change at page 9, line 25 skipping to change at page 9, line 29
Offset 9000: Offset 9000:
N = 9000 / 16384 = 0 N = 9000 / 16384 = 0
C = (9000-(0*16384)) / 4096 = 2 (D2) C = (9000-(0*16384)) / 4096 = 2 (D2)
O = (0*4096)+(9000%4096) = 808 O = (0*4096)+(9000%4096) = 808
Offset 132000: Offset 132000:
N = 132000 / 16384 = 8 N = 132000 / 16384 = 8
C = (132000-(8*16384)) / 4096 = 0 C = (132000-(8*16384)) / 4096 = 0
O = (8*4096) + (132000%4096) = 33696 O = (8*4096) + (132000%4096) = 33696
Figure 9 3.2.2 Nested Striping
3.3.2 Nested Striping
The group_width and group_depth parameters allow a nested striping The group_width and group_depth parameters allow a nested striping
pattern. If there is no nesting, then group_width and group_depth pattern. If there is no nesting, then group_width and group_depth
MUST be zero. Otherwise, the group_width defines the width of a data MUST be zero. Otherwise, the group_width defines the width of a data
stripe, and the group_depth defines how many stripes are written stripe, and the group_depth defines how many stripes are written
before advancing to the next group of components in the list of before advancing to the next group of components in the list of
component objects for the file. The size of the components array component objects for the file. The size of the components array
MUST be a multiple of group_width. The math used to map from a file MUST be a multiple of group_width. The math used to map from a file
offset to a component object and offset within that object is shown offset to a component object and offset within that object is shown
below. The computations map from the logical offset L to the below. The computations map from the logical offset L to the
skipping to change at page 10, line 4 skipping to change at page 10, line 5
W = total number of components W = total number of components
S = stripe_unit * group_depth * W S = stripe_unit * group_depth * W
T = stripe_unit * group_depth * group_width T = stripe_unit * group_depth * group_width
U = stripe_unit * group_width U = stripe_unit * group_width
M = L / S M = L / S
G = (L - (M * S)) / T G = (L - (M * S)) / T
H = (L - (M * S)) % T H = (L - (M * S)) % T
N = H / U N = H / U
C = (H - (N * U)) / stripe_unit + G * group_width C = (H - (N * U)) / stripe_unit + G * group_width
O = L % stripe_unit + N * stripe_unit + M * group_depth * stripe_unit O = L % stripe_unit + N * stripe_unit + M * group_depth * stripe_unit
Figure 10
In these equations, S is the number of bytes striped across all In these equations, S is the number of bytes striped across all
component objects before the pattern repeats. T is the number of component objects before the pattern repeats. T is the number of
bytes striped within a group of component objects before advancing to bytes striped within a group of component objects before advancing to
the next group. U is the number of bytes in a stripe within a group. the next group. U is the number of bytes in a stripe within a group.
M is the "major" (i.e., across all components) stripe number, and N M is the "major" (i.e., across all components) stripe number, and N
is the "minor" (i.e., across the group) stripe number. G counts the is the "minor" (i.e., across the group) stripe number. G counts the
groups from the beginning of the major stripe, and H is the byte groups from the beginning of the major stripe, and H is the byte
offset within the group. offset within the group.
skipping to change at page 10, line 49 skipping to change at page 11, line 5
O = 27 MB % 1 MB + 2 * 1 MB + 0 * 50 * 1 MB = 2 MB O = 27 MB % 1 MB + 2 * 1 MB + 0 * 50 * 1 MB = 2 MB
Offset 7232 MB: Offset 7232 MB:
M = 7232 MB / 5000 MB = 1 M = 7232 MB / 5000 MB = 1
G = (7232 MB - (1 * 5000 MB)) / 500 MB = 4 G = (7232 MB - (1 * 5000 MB)) / 500 MB = 4
H = (7232 MB - (1 * 5000 MB)) % 500 MB = 232 MB H = (7232 MB - (1 * 5000 MB)) % 500 MB = 232 MB
N = 232 MB / 10 MB = 23 N = 232 MB / 10 MB = 23
C = (232 MB - (23 * 10 MB)) / 1 MB + 4 * 10 = 42 C = (232 MB - (23 * 10 MB)) / 1 MB + 4 * 10 = 42
O = 7232 MB % 1 MB + 23 * 1 MB + 1 * 50 * 1 MB = 73 MB O = 7232 MB % 1 MB + 23 * 1 MB + 1 * 50 * 1 MB = 73 MB
Figure 11 3.2.3 Mirroring
3.3.3 Mirroring
The mirror_cnt is used to replicate a file by replicating its The mirror_cnt is used to replicate a file by replicating its
component objects. If there is no mirroring, then mirror_cnt MUST be component objects. If there is no mirroring, then mirror_cnt MUST be
0. If mirror_cnt is greater than zero, then the size of the 0. If mirror_cnt is greater than zero, then the size of the
component array MUST be a multiple of (mirror_cnt+1). Thus, for a component array MUST be a multiple of (mirror_cnt+1). Thus, for a
classic mirror on two objects, mirror_cnt is one. If group_width is classic mirror on two objects, mirror_cnt is one. If group_width is
also non-zero, then the size MUST be a multiple of group_width * also non-zero, then the size MUST be a multiple of group_width *
(mirror_cnt+1). Replicas are adjacent in the components array, and (mirror_cnt+1). Replicas are adjacent in the components array, and
the value C produced by the above equations is not a direct index the value C produced by the above equations is not a direct index
into the components array. Instead, the following equations deterine into the components array. Instead, the following equations
the replica component index RCi, where i ranges from 0 to mirror_cnt. determine the replica component index RCi, where i ranges from 0 to
mirror_cnt.
C = component index for striping or two-level striping C = component index for striping or two-level striping
i ranges from 0 to mirror_cnt, inclusive i ranges from 0 to mirror_cnt, inclusive
RCi = C * (mirror_cnt+1) + i RCi = C * (mirror_cnt+1) + i
Figure 12 3.3 RAID Algorithms
3.3.4 RAID pnfs_osd_raid_algorithm4 determines the algorithm and placement of
redundant data. This section defines the different RAID algorithms.
The raid_algorithm determines the algorithm and placement of 3.3.1 PNFS_OSD_RAID_0
redundant data. PNFS_OSD_RAID_0 means there is no parity data, so
all bytes in the component objects are data bytes located by the PNFS_OSD_RAID_0 means there is no parity data, so all bytes in the
above equations for C and O. If a component object is unavailable, component objects are data bytes located by the above equations for C
the pNFS client can choose to return NULLs for the missing data, or and O. If a component object is unavailable, the pNFS client can
it can retry the READ against the pNFS server, or it can return an choose to return NULLs for the missing data, or it can retry the READ
EIO error. against the pNFS server, or it can return an EIO error.
3.3.2 PNFS_OSD_RAID_4
PNFS_OSD_RAID_4 means that the last component object, or the last in PNFS_OSD_RAID_4 means that the last component object, or the last in
each group if group_width is > zero, contains parity information each group if group_width is > zero, contains parity information
computed over the rest of the stripe with an XOR operation. If a computed over the rest of the stripe with an XOR operation. If a
component object is unavailable, the client can read the rest of the component object is unavailable, the client can read the rest of the
stripe units in the damaged stripe and recompute the missing stripe stripe units in the damaged stripe and recompute the missing stripe
unit by XORing the other stripe units in the stripe. Or the client unit by XORing the other stripe units in the stripe. Or the client
can replay the READ against the pNFS server which will presumably can replay the READ against the pNFS server which will presumably
perform the reconstructed read on the client's behalf. perform the reconstructed read on the client's behalf.
skipping to change at page 12, line 12 skipping to change at page 12, line 12
for embedded parity, L'. First compute L', and then use L' in the for embedded parity, L'. First compute L', and then use L' in the
above equations for C and O. above equations for C and O.
L = file offset, not accounting for parity L = file offset, not accounting for parity
P = number of parity devices in each stripe P = number of parity devices in each stripe
W = group_width, if not zero, else size of component array W = group_width, if not zero, else size of component array
N = L / (W-P * stripe_unit) N = L / (W-P * stripe_unit)
L' = N * (W * stripe_unit) + L' = N * (W * stripe_unit) +
(L % (W-P * stripe_unit)) (L % (W-P * stripe_unit))
Figure 13 3.3.3 PNFS_OSD_RAID_5
PNFS_OSD_RAID_5 means that the position of the parity data is rotated PNFS_OSD_RAID_5 means that the position of the parity data is rotated
on each stripe. In the first stripe, the last component holds the on each stripe. In the first stripe, the last component holds the
parity. In the second stripe, the next-to-last component holds the parity. In the second stripe, the next-to-last component holds the
parity, and so on. In this scheme, all stripe units are rotated so parity, and so on. In this scheme, all stripe units are rotated so
that I/O is evenly spread across objects as the file is read that I/O is evenly spread across objects as the file is read
sequentially. The rotated parity layout is illustrated here, with sequentially. The rotated parity layout is illustrated here, with
numbers indicating the stripe unit. numbers indicating the stripe unit.
0 1 2 P 0 1 2 P
4 5 P 3 4 5 P 3
8 P 6 7 8 P 6 7
P 9 a b P 9 a b
Figure 14
To compute the component object C, first compute the offset that To compute the component object C, first compute the offset that
accounts for parity L' and use that to compute C. Then rotate C to accounts for parity L' and use that to compute C. Then rotate C to
get C'. Finally, increase C' by one if the parity information comes get C'. Finally, increase C' by one if the parity information comes
at or before C' within that stripe. The following equations at or before C' within that stripe. The following equations
illustrate this by computing I, which is the index of the component illustrate this by computing I, which is the index of the component
that contains parity for a given stripe. that contains parity for a given stripe.
L = file offset, not accounting for parity L = file offset, not accounting for parity
W = group_width, if not zero, else size of component array W = group_width, if not zero, else size of component array
N = L / (W-1 * stripe_unit) N = L / (W-1 * stripe_unit)
(Compute L' as describe above) (Compute L' as describe above)
(Compute C based on L' as described above) (Compute C based on L' as described above)
C' = (C - (N%W)) % W C' = (C - (N%W)) % W
I = W - (N%W) - 1 I = W - (N%W) - 1
if (C' <= I) { if (C' <= I) {
C'++ C'++
} }
Figure 15 3.3.4 PNFS_OSD_RAID_PQ
PNFS_OSD_RAID_PQ is a double-parity scheme that uses the Reed-Solomon PNFS_OSD_RAID_PQ is a double-parity scheme that uses the Reed-Solomon
P+Q encoding scheme. In this layout, the last two component objects P+Q encoding scheme. In this layout, the last two component objects
hold the P and Q data, respectively. P is parity computed with XOR, hold the P and Q data, respectively. P is parity computed with XOR,
and Q is a more complex equation that is not described here. The and Q is a more complex equation that is not described here. The
equations given above for embedded parity can be used to map a file equations given above for embedded parity can be used to map a file
offset to the correct component object by setting the number of offset to the correct component object by setting the number of
parity components to 2 instead of 1 for RAID4 or RAID5. Clients may parity components to 2 instead of 1 for RAID4 or RAID5. Clients may
simply choose to read data through the metadata server if two simply choose to read data through the metadata server if two
components are missing or damaged. components are missing or damaged.
Issue: this scheme also has a RAID_4 like layout where the ECC Issue: This scheme also has a RAID_4 like layout where the ECC blocks
blocks are stored on the same components on every stripe and a are stored on the same components on every stripe and a rotated,
rotated, RAID-5 like layout where the stripe units are rotated. RAID-5 like layout where the stripe units are rotated. Should we
Should we make the following properties orthogonal: RAID_4 or RAID_5 make the following properties orthogonal: RAID_4 or RAID_5 (i.e.,
(i.e., non-rotated or rotated), and then have the number of parity non-rotated or rotated), and then have the number of parity
components and the associated algorithm be the orthogonal parameter? components and the associated algorithm be the orthogonal parameter?
3.3.5 Usage and implementation notes 3.3.5 RAID Usage and implementation notes
RAID layouts with redundant data in their stripes require additional RAID layouts with redundant data in their stripes require additional
serialization of updates to ensure correct operation. Otherwise, if serialization of updates to ensure correct operation. Otherwise, if
two clients simultaneously write to the same logical range of an two clients simultaneously write to the same logical range of an
object, the result could include different data in the same ranges of object, the result could include different data in the same ranges of
mirrored tuples, or corrupt parity information. It is the mirrored tuples, or corrupt parity information. It is the
responsibility of the metadata server to enforce serialization responsibility of the metadata server to enforce serialization
requirements such as this. For example, the metadata server may do requirements such as this. For example, the metadata server may do
so by not granting overlapping write layouts within mirrored objects. so by not granting overlapping write layouts within mirrored objects.
3.4 pnfs_layoutupdate4 4. Object-Based Layout Update
The pnfs_layoutupdate4 type is an opaque value at the generic pNFS pnfs_layoutupdate4 is used in the LAYOUTCOMMIT operation to convey
client level. If the type is LAYOUT_OSD_OBJECTS, then the opaque updates to the layout and additional information to the metadata
value is described by the pnfs_osd_layoutupdate4 type. This type server. It is defined in the NFSv4.1 draft [5] as follows:
conveys error information, timestamp information, and capacity used
information back to the metadata server.
struct pnfs_layoutupdate4 { struct pnfs_layoutupdate4 {
pnfs_layouttype4 type; pnfs_layouttype4 type;
opaque layoutupdate_data<>; opaque layoutupdate_data<>;
}; };
enum pnfs_osd_errno { The pnfs_layoutupdate4 type is an opaque value at the generic pNFS
PNFS_OBJ_NOT_FOUND = 1, client level. If the type is LAYOUT_OSD2_OBJECTS, then the opaque
PNFS_OBJ_NO_SPACE = 2, value is described by the pnfs_osd_layoutupdate4 type.
PNFS_OBJ_EIO = 3,
PNFS_OBJ_BAD_CRED = 4,
PNFS_OBJ_NO_ACCESS = 5,
PNFS_OBJ_UNREACHABLE = 6
};
struct pnfs_osd_ioerr4 { 4.1 pnfs_osd_layoutupdate4
pnfs_osd_objid4 component;
length4 offset; struct pnfs_osd_layoutupdate4 {
length4 length; pnfs_osd_deltaspaceused4 delta_space_used;
pnfs_osd_errno errno; pnfs_osd_ioerr4 ioerr<>;
}; };
union deltaspaceused4 switch (bool valid) { Object-Based pNFS clients are not allowed to modify the layout.
"delta_space_used" is used to convey capacity usage information back
to the metadata server and, in case OSD I/O operations failed,
"ioerr" is used to report these errors to the metadata server.
4.1.1 pnfs_osd_deltaspaceused4
union pnfs_osd_deltaspaceused4 switch (bool valid) {
case TRUE: case TRUE:
length4 delta; /* Bytes consumed by write activity */ length4 delta; /* Bytes consumed by write activity */
case FALSE: case FALSE:
void; void;
}
struct pnfs_osd_layoutupdate4 {
deltaspaceused4 delta_space_used;
newtime4 time_metadata;
pnfs_osd_ioerr4 ioerr<>;
}; };
Figure 16 pnfs_osd_deltaspaceused4 is used to convey space utilization
The deltaspaceused4 type is used to convey space utilization
information at the time of LAYOUTCOMMIT. For the file system to information at the time of LAYOUTCOMMIT. For the file system to
properly maintain capacity used information, it needs to track how properly maintain capacity used information, it needs to track how
much capacity was consumed by WRITE operations performed by the much capacity was consumed by WRITE operations performed by the
client. In this protocol, the OSD returns the capacity consumed by a client. In this protocol, the OSD returns the capacity consumed by a
write, which can be different because of internal overhead like write, which can be different because of internal overhead like
block-based allocation and indirect blocks, and the client reflects block-based allocation and indirect blocks, and the client reflects
this back to the pNFS server so it can accurately track quota. The this back to the pNFS server so it can accurately track quota. The
pNFS server can choose to trust this information coming from the pNFS server can choose to trust this information coming from the
clients and therefore avoid querying the OSDs at the time of clients and therefore avoid querying the OSDs at the time of
LAYOUTCOMMIT. If the client is unable to obtain this information LAYOUTCOMMIT. If the client is unable to obtain this information
from the OSD, it simply returns invalid deltaspaceused4. from the OSD, it simply returns invalid deltaspaceused4.
The time_metadata value indicates the new modify time of the file. 4.1.2 pnfs_osd_errno4
The server can choose to trust the client's view of this attribute,
or it can query storage to determine the actual modify time. A
file's modify time will be the latest modify time among all
components of the file. A client can avoid returning time
information by returning an invalid time_metadata (i.e., the
time_changed union descriminator is FALSE.)
The pnfs_osd_ioerr4 returns error indications for objects that enum pnfs_osd_errno4 {
generated errors during data transfers. These are hints to the PNFS_OSD_NOT_FOUND = 1,
metadata server that there are problems with that object. PNFS_OSD_NO_SPACE = 2,
PNFS_OSD_EIO = 3,
PNFS_OSD_BAD_CRED = 4,
PNFS_OSD_NO_ACCESS = 5,
PNFS_OSD_UNREACHABLE = 6
};
PNFS_OBJ_NOT_FOUND indicates the object ID specifics an object that pnfs_osd_errno4 is used to represent error types when read/write
does not exist on the Object Storage Device. errors are reported to the metadata server.
PNFS_OBJ_NO_SPACE indicates the operation failed because the Object o PNFS_OSD_NOT_FOUND indicates the object ID specifics an object
Storage Device ran out of free capacity during the operation. that does not exist on the Object Storage Device.
PNFS_OBJ_EIO indicates the operation failed because the Object o PNFS_OSD_NO_SPACE indicates the operation failed because the
Object Storage Device ran out of free capacity during the operation.
o PNFS_OSD_EIO indicates the operation failed because the Object
Storage Device experienced a failure trying to access the object. Storage Device experienced a failure trying to access the object.
The most common source of these errors is media errors, but other The most common source of these errors is media errors, but other
internal errors might cause this. In this case, the metadata server internal errors might cause this. In this case, the metadata server
should go examine the broken object more closely. should go examine the broken object more closely.
PNFS_OBJ_BAD_CRED indicates the security parameters are not valid. o PNFS_OSD_BAD_CRED indicates the security parameters are not valid.
The primary cause of this is that the capability has expired, or the The primary cause of this is that the capability has expired, or the
security policy tag (i.e., capability version number) has been security policy tag (i.e., capability version number) has been
changed to revoke capabilities. The client will need to return the changed to revoke capabilities. The client will need to return the
layout and get a new one with fresh capabilities. layout and get a new one with fresh capabilities.
PNFS_OBJ_NO_ACCESS indicates the capability does not allow the o PNFS_OSD_NO_ACCESS indicates the capability does not allow the
requested operation. This should not occur in normal operation requested operation. This should not occur in normal operation
because the metadata server should give out correct capabilities, or because the metadata server should give out correct capabilities, or
none at all. none at all.
PNFS_OBJ_UNREACHABLE indicates the client was unable to contact the o PNFS_OSD_UNREACHABLE indicates the client was unable to contact
Object Storage Device due to a communication failure. the Object Storage Device due to a communication failure.
4. Security Considerations 4.1.3 pnfs_osd_ioerr4
struct pnfs_osd_ioerr4 {
pnfs_osd_objid4 component;
length4 offset;
length4 length;
bool iswrite;
pnfs_osd_errno4 errno;
};
The pnfs_osd_ioerr4 structure is used to return error indications for
objects that generated errors during data transfers. These are hints
to the metadata server that there are problems with that object. For
each error, "component", "offset", and "length" represent the object
and byte range in which the error occurred. "iswrite" is set to
"true" if the failed OSD operation was data modifying, and "errno"
represents the type of error.
5. Object-Based Creation Layout Hint
The pnfs_layouthint4 type is defined in the NFSv4.1 draft [5] as
follows:
struct pnfs_layouthint4 {
pnfs_layouttype4 type;
opaque layouthint_data<>;
};
The pnfs_layouthint4 type is an opaque value at the generic pNFS
client level. If the layout type is LAYOUT_OSD2_OBJECTS, then the
opaque value is described by the pnfs_osd_layouthint4 type.
5.1 pnfs_osd_layouthint4
union num_comps_hint4 switch (bool valid) {
case TRUE:
uint32_t num_comps;
case FALSE:
void;
};
union stripe_unit_hint4 switch (bool valid) {
case TRUE:
length4 stripe_unit;
case FALSE:
void;
};
union group_width_hint4 switch (bool valid) {
case TRUE:
uint16_t group_width;
case FALSE:
void;
};
union group_depth_hint4 switch (bool valid) {
case TRUE:
uint16_t group_depth;
case FALSE:
void;
};
union mirror_cnt_hint4 switch (bool valid) {
case TRUE:
uint16_t mirror_cnt;
case FALSE:
void;
};
union raid_algorithm_hint4 switch (bool valid) {
case TRUE:
pnfs_osd_raid_algorithm4 raid_algorithm;
case FALSE:
void;
};
struct pnfs_osd_layouthint4 {
num_comps_hint4 num_comps_hint;
stripe_unit_hint4 stripe_unit_hint;
group_width_hint4 group_width_hint;
group_depth_hint4 group_depth_hint;
mirror_cnt_hint4 mirror_cnt_hint;
raid_algorithm_hint4 raid_algorithm_hint;
};
This type conveys hints for the desired data map. All parameters are
optional so the client can give values for only the parameters it
cares about, e.g. it can provide a hint for the desired number of
mirrored components, regardless of the the raid algorithm selected
for the file. The server should make an attempt to honor the hints
but it can ignore any or all of them at its own discretion and
without failing the respective create operation.
The num_comps hint can be used to limit the total number of component
objects comprising the file. All other hints correspond directly to
the different fields of pnfs_osd_data_map4.
6. Layout Segments
The pnfs layout operations operate on logical byte ranges. There is
no requirement in the protocol for any relationship between byte
ranges used in LAYOUTGET to acquire layouts and byte ranges used in
CB_LAYOUTRECALL, LAYOUTCOMMIT, or LAYOUTRETURN. However, using OSD
capabilities poses limitations on these operations since the
capabilities associated with layout segments cannot be merged or
split. The following guidelines should be followed for proper
operation of object-based layouts.
6.1 CB_LAYOUTRECALL and LAYOUTRETURN
In general, the object-based layout driver should keep track of each
layout segment it got, keeping record of the segment's iomode,
offset, and length. The server should allow the client to get
multiple overlapping layout segments but is free to recall the layout
to prevent overlap.
In response to CB_LAYOUTRECALL, the client should return all layout
segments matching the given iomode and overlapping with the recalled
range. When returning the layouts for this byte range with
LAYOUTRETURN the client MUST NOT return a sub-range of a layout
segment it has; each LAYOUTRETURN sent MUST completely cover at least
one outstanding layout segment.
The server, in turn, should release any segment that exactly matches
the clientid, iomode, and byte range given in LAYOUTRETURN. If no
exact match is found then the server should release all layout
segments matching the clientid and iomode and that are fully
contained in the returned byte range. If none are found and the byte
range is a subset of an outstanding layout segment with for the same
clientid and iomode, then the client can be considered malfunctioning
and the server SHOULD recall all layouts from this client to reset
its state. If this behavior repeats the server SHOULD deny all
LAYOUTGETs from this client.
6.2 LAYOUTCOMMIT
LAYOUTCOMMIT is only used by object-based pNFS to convey modified
attributes hints and/or to report I/O errors to the MDS. Therefore,
the offset and length in LAYOUTCOMMIT4args are reserved for future
use and should be set to 0. However, component byte ranges in the
optional pnfs_osd_ioerr4 structure are used for recovering the object
and MUST be set by the client to cover all failed I/O operations to
the component.
7. Recalling Layouts
The object-based metadata server should recall outstanding layouts in
the following cases:
o When the file's security policy changes, i.e. ACLs or permission
mode bits are set.
o When the file's aggregation map changes, rendering outstanding
layouts invalid.
o When there are sharing conflicts. For example, the server will
issue stripe aligned layout segments for RAID-5 objects. To prevent
corruption of the file's parity, Multiple clients must not hold valid
write layouts for the same stripes. An outstanding RW layout should
be recalled when a conflicting LAYOUTGET is received from a different
client for LAYOUTIOMODE_RW and for a byte-range overlapping with the
outstanding layout segment.
8. Security Considerations
The pNFS extension partitions the NFSv4 file system protocol into two The pNFS extension partitions the NFSv4 file system protocol into two
parts, the control path and the data path (storage protocol). The parts, the control path and the data path (storage protocol). The
control path contains all the new operations described by this control path contains all the new operations described by this
extension; all existing NFSv4 security mechanisms and features apply extension; all existing NFSv4 security mechanisms and features apply
to the control path. The combination of components in a pNFS system to the control path. The combination of components in a pNFS system
is required to preserve the security properties of NFSv4 with respect is required to preserve the security properties of NFSv4 with respect
to an entity accessing data via a client, including security to an entity accessing data via a client, including security
countermeasures to defend against threats that NFSv4 provides countermeasures to defend against threats that NFSv4 provides
defenses for in environments where these threats are considered defenses for in environments where these threats are considered
significant. significant.
4.1 Security Data Types The object storage protocol MUST implement the security aspects
described in version 1 of the T10 OSD protocol definition [2]. The
remainder of this section gives an overview of the security mechanism
described in that standard. The goal is to give the reader a basic
understanding of the object security model. Any discrepancies
between this text and the actual standard are obviously to be
resolved in favor of the OSD standard.
8.1 OSD Security Data Types
There are three main data types associated with object security: a There are three main data types associated with object security: a
capability, a credential, and security parameters. The capability is capability, a credential, and security parameters. The capability is
a set of fields that specifies an object and what operations can be a set of fields that specifies an object and what operations can be
performed on it. A credential is a signed capability. Only a performed on it. A credential is a signed capability. Only a
security manager that knows the secret device keys can correctly sign security manager that knows the secret device keys can correctly sign
a capability to form a valid credential. In pNFS, the file server a capability to form a valid credential. In pNFS, the file server
acts as the security manager and returns signed capabilities (i.e., acts as the security manager and returns signed capabilities (i.e.,
credentials) to the pNFS client. The security parameters are values credentials) to the pNFS client. The security parameters are values
computed by the issuer of OSD commands (i.e., the client) that prove computed by the issuer of OSD commands (i.e., the client) that prove
they hold valid credentials. The client uses the credential as a they hold valid credentials. The client uses the credential as a
signing key to sign the requests it makes to OSD, and puts the signing key to sign the requests it makes to OSD, and puts the
resulting signatures into the security_parameters field of the OSD resulting signatures into the security_parameters field of the OSD
command. The object storage device uses the secret keys it shares command. The object storage device uses the secret keys it shares
with the security manager to validate the signature values in the with the security manager to validate the signature values in the
security parameters. security parameters.
The security types are opaque to the generic layers of the pNFS The security types are opaque to the generic layers of the pNFS
client. The credential is defined as opaque within the client. The credential is defined as opaque within the
pnfs_obj_and_cred type. Instead of repeating the definitions here, pnfs_osd_and_cred type. Instead of repeating the definitions here,
the reader is referred to section 4.9.2.2 of the OSD standard. the reader is referred to section 4.9.2.2 of the OSD standard.
4.2 Security Protocol 8.2 The OSD Security Protocol
The object storage protocol relies on a cryptographically secure The object storage protocol relies on a cryptographically secure
capability to control accesses at the object storage devices. capability to control accesses at the object storage devices.
Capabilities are generated by the metadata server, returned to the Capabilities are generated by the metadata server, returned to the
client, and used by the client as described below to authenticate client, and used by the client as described below to authenticate
their requests to the Object Storage Device (OSD). Capabilities their requests to the Object Storage Device (OSD). Capabilities
therefore achieve the required access and open mode checking. They therefore achieve the required access and open mode checking. They
allow the file server to define and check a policy (e.g., open mode) allow the file server to define and check a policy (e.g., open mode)
and the OSD to enforce that policy without knowing the details (e.g., and the OSD to enforce that policy without knowing the details (e.g.,
user IDs and ACLs). user IDs and ACLs).
Since capabilities are tied to layouts, and since they are used to
enforce access control, when the file ACL or mode changes the
outstanding capabilities MUST be revoked to enforce the new access
permissions. The server SHOULD recall layouts to allow clients to
gracefully return their capabilities before the access permissions
change.
Each capability is specific to a particular object, an operation on Each capability is specific to a particular object, an operation on
that object, a byte range w/in the object (in OSDv2), and has an that object, a byte range w/in the object (in OSDv2), and has an
explicit expiration time. The capabilities are signed with a secret explicit expiration time. The capabilities are signed with a secret
key that is shared by the object storage devices (OSD) and the key that is shared by the object storage devices (OSD) and the
metadata managers. Clients do not have device keys so they are metadata managers. Clients do not have device keys so they are
unable to forge the signatures in the security parameters. The unable to forge the signatures in the security parameters. The
combination of a capability and its signature is called a combination of a capability and its signature is called a
"credential" in the OSD specification. "credential" in the OSD specification.
The details of the security and privacy model for Object Storage are The details of the security and privacy model for Object Storage are
defined in the T10 OSD standard. The following sketch of the defined in the T10 OSD standard. The following sketch of the
algorithm should help the reader understand the basic model. algorithm should help the reader understand the basic model.
LAYOUTGET returns a CapKey, which is also called a credential. It is LAYOUTGET returns a CapKey, which is also called a credential. It is
a capability and a signature over that capability. a capability and a signature over that capability.
{CapKey = MAC<SecretKey>(CapArgs), CapArgs} CapKey = MAC<SecretKey>(CapArgs)
Credential = {CapKey, CapArgs}
The client uses CapKey to sign all the requests it issues for that The client uses CapKey to sign all the requests it issues for that
object using the respective CapArgs. In other words, the CapArgs object using the respective CapArgs. In other words, the CapArgs
appears in the request to the storage device, and that request is appears in the request to the storage device, and that request is
signed with the CapKey as follows: signed with the CapKey as follows:
ReqMAC = MAC<CapKey>(Req, Nonceln) ReqMAC = MAC<CapKey>(Req, Nonceln)
Request = {CapArgs, Req, Nonceln, ReqMAC}
The following is sent to the OSD: {CapArgs, Req, Nonceln, ReqMAC}. The following is sent to the OSD: {CapArgs, Req, Nonceln, ReqMAC}.
The OSD uses the SecretKey it shares with the metadata server to The OSD uses the SecretKey it shares with the metadata server to
compare the ReqMAC the client sent with a locally computed compare the ReqMAC the client sent with a locally computed value:
MAC<MAC<SecretKey>(CapArgs)>(Req, Nonceln) MAC<MAC<SecretKey>(CapArgs)>(Req, Nonceln)
and if they match the OSD assumes that the capabilities came from an and if they match the OSD assumes that the capabilities came from an
authentic metadata server and allows access to the object, as allowed authentic metadata server and allows access to the object, as allowed
by the CapArgs. Therefore, if the server LAYOUTGET reply, holding by the CapArgs. Therefore, if the server LAYOUTGET reply, holding
CapKey and CapArgs, is snooped by another client, it can be used to CapKey and CapArgs, is snooped by another client, it can be used to
generate valid OSD requests (within the CapArgs access restriction). generate valid OSD requests (within the CapArgs access restriction).
To provide the required privacy requirements for the capabilities To provide the required privacy requirements for the capabilities
returned by LAYOUTGET, the GSS-API can be used, e.g. by using a returned by LAYOUTGET, the GSS-API can be used, e.g. by using a
session key known to the file server and to the client to encrypt the session key known to the file server and to the client to encrypt the
whole layout or parts of it. Two general ways to provide privacy in whole layout or parts of it. Two general ways to provide privacy in
the absence of GSS-API that are independent of NFSv4 are either an the absence of GSS-API that are independent of NFSv4 are either an
isolated network such as a VLAN or a secure channel provided by isolated network such as a VLAN or a secure channel provided by
IPsec. IPsec.
4.3 Revoking capabilities 8.3 Revoking capabilities
At any time, the metadata server may invalidate all outstanding At any time, the metadata server may invalidate all outstanding
capabilities on an object by changing its capability version capabilities on an object by changing its capability version
attribute. There is also a "fence bit" attribute that the metadata attribute. There is also a "fence bit" attribute that the metadata
server can toggle to temporarily block access without permanently server can toggle to temporarily block access without permanently
revoking capabilities. The value of the fence bit and the capability revoking capabilities. The value of the fence bit and the capability
version are part of a capability, and they must match the state of version are part of a capability, and they must match the state of
the attributes. If they do not match, the OSD rejects accesses to the attributes. If they do not match, the OSD rejects accesses to
the object. When a client attempts to use a capability and discovers the object. When a client attempts to use a capability and discovers
a capability version mismatch, it should issue a LAYOUTRETURN for the a capability version mismatch, it should issue a LAYOUTRETURN for the
object and specify PNFS_OBJ_BAD_CRED in the pnfs_obj_ioerr parameter. object and specify PNFS_OSD_BAD_CRED in the pnfs_osd_ioerr parameter.
The client may elect to issue a compound LAYOUTRETURN/LAYOUTGET (or The client may elect to issue a compound LAYOUTRETURN/LAYOUTGET (or
LAYOUTCOMMIT/LAYOUTRETURN/LAYOUTGET) to attempt to fetch a refreshed LAYOUTCOMMIT/LAYOUTRETURN/LAYOUTGET) to attempt to fetch a refreshed
set of capabilities. set of capabilities.
The metadata server may elect to change the capability version on an The metadata server may elect to change the capability version on an
object at any time, for any reason (with the understanding that there object at any time, for any reason (with the understanding that there
is likely an associated performance penalty, especially if there are is likely an associated performance penalty, especially if there are
outstanding layouts for this object). The metadata server MUST outstanding layouts for this object). The metadata server MUST
revoke outstanding capabilities when any one of the following occurs: revoke outstanding capabilities when any one of the following occurs:
(1) the permissions on the object change, (2) a conflicting mandatory (1) the permissions on the object change, (2) a conflicting mandatory
skipping to change at page 18, line 28 skipping to change at page 21, line 40
either READ or READ/WRITE. It is the pNFS client's responsibility to either READ or READ/WRITE. It is the pNFS client's responsibility to
enforce access control among multiple users accessing the same file. enforce access control among multiple users accessing the same file.
It is neither required nor expected that the pNFS client will obtain It is neither required nor expected that the pNFS client will obtain
a separate layout for each user accessing a shared object. The a separate layout for each user accessing a shared object. The
client SHOULD use ACCESS calls to check user permissions when client SHOULD use ACCESS calls to check user permissions when
performing I/O so that the server's access control policies are performing I/O so that the server's access control policies are
correctly enforced. The result of the ACCESS operation may be cached correctly enforced. The result of the ACCESS operation may be cached
indefinitely, as the server is expected to recall layouts when the indefinitely, as the server is expected to recall layouts when the
file's access permissions or ACL change. file's access permissions or ACL change.
5. Normative References 9. References
9.1 Normative References
[1] Bradner, S., "Key words for use in RFCs to Indicate Requirement [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", March 1997. Levels", RFC 2119, March 1997.
[2] Weber, R., "SCSI Object-Based Storage Device Commands", [2] Weber, R., "SCSI Object-Based Storage Device Commands",
July 2004, <http://www.t10.org/ftp/t10/drafts/osd/osd-r10.pdf>. July 2004, <http://www.t10.org/ftp/t10/drafts/osd/osd-r10.pdf>.
[3] Goodson, G., "NFSv4 pNFS Extentions", October 2005, <ftp:// [3] Eisler, M., "XDR: External Data Representation Standard",
www.ietf.org/internet-drafts/draft-ietf-nfsv4-pnfs-00.txt>. RFC 4506, May 2006.
[4] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, [4] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame,
C., Eisler, M., and D. Noveck, "Network File System (NFS) C., Eisler, M., and D. Noveck, "Network File System (NFS)
version 4 Protocol", RFC 3530, April 2003. version 4 Protocol", RFC 3530, April 2003.
9.2 Informative References
[5] Shepler, S., "NFSv4 Minor Version 1", June 2006, <http://
www.ietf.org/internet-drafts/
draft-ietf-nfsv4-minorversion1-03.txt>.
[6] Weber, R., "SCSI Object-Based Storage Device Commands -2
(OSD-2)", October 2004,
<http://www.t10.org/ftp/t10/drafts/osd2/osd2r00.pdf>.
Authors' Addresses Authors' Addresses
Benny Halevy Benny Halevy
Panasas, Inc. Panasas, Inc.
1501 Reedsdale St. Suite 400 1501 Reedsdale St. Suite 400
Pittsburgh, PA 15233 Pittsburgh, PA 15233
USA USA
Phone: +1-412-323-3500 Phone: +1-412-323-3500
Email: bhalevy@panasas.com Email: bhalevy@panasas.com
 End of changes. 84 change blocks. 
223 lines changed or deleted 454 lines changed or added

This html diff was produced by rfcdiff 1.32. The latest version is available from http://www.levkowetz.com/ietf/tools/rfcdiff/