draft-ietf-nfsv4-pnfs-block-07.txt   draft-ietf-nfsv4-pnfs-block-08.txt 
NFSv4 Working Group D. Black NFSv4 Working Group D. Black
Internet Draft S. Fridella Internet Draft S. Fridella
Expires: September 17, 2008 J. Glasgow Expires: October 2, 2008 J. Glasgow
Intended Status: Proposed Standard EMC Corporation Intended Status: Proposed Standard EMC Corporation
March 17, 2008 April 1, 2008
pNFS Block/Volume Layout pNFS Block/Volume Layout
draft-ietf-nfsv4-pnfs-block-07.txt draft-ietf-nfsv4-pnfs-block-08.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that By submitting this Internet-Draft, each author represents that
any applicable patent or other IPR claims of which he or she is any applicable patent or other IPR claims of which he or she is
aware have been or will be disclosed, and any of which he or she aware have been or will be disclosed, and any of which he or she
becomes aware will be disclosed, in accordance with Section 6 of becomes aware will be disclosed, in accordance with Section 6 of
BCP 79. BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
skipping to change at page 2, line 40 skipping to change at page 2, line 40
2.3.6. End-of-file Processing..............................18 2.3.6. End-of-file Processing..............................18
2.3.7. Layout Hints........................................18 2.3.7. Layout Hints........................................18
2.3.8. Client Fencing......................................19 2.3.8. Client Fencing......................................19
2.4. Crash Recovery Issues....................................21 2.4. Crash Recovery Issues....................................21
2.5. Recalling resources: CB_RECALL_ANY.......................21 2.5. Recalling resources: CB_RECALL_ANY.......................21
2.6. Transient and Permanent Errors...........................22 2.6. Transient and Permanent Errors...........................22
3. Security Considerations.......................................22 3. Security Considerations.......................................22
4. Conclusions...................................................24 4. Conclusions...................................................24
5. IANA Considerations...........................................24 5. IANA Considerations...........................................24
6. Acknowledgments...............................................24 6. Acknowledgments...............................................24
7. References....................................................24 7. References....................................................25
7.1. Normative References.....................................25 7.1. Normative References.....................................25
7.2. Informative References...................................25 7.2. Informative References...................................25
Author's Addresses...............................................25 Author's Addresses...............................................25
Intellectual Property Statement..................................26 Intellectual Property Statement..................................26
Disclaimer of Validity...........................................26 Disclaimer of Validity...........................................26
Copyright Statement..............................................27 Copyright Statement..............................................27
Acknowledgment...................................................27 Acknowledgment...................................................27
1. Introduction 1. Introduction
skipping to change at page 5, line 8 skipping to change at page 5, line 8
The effect of the script is to remove both leading white space and a The effect of the script is to remove both leading white space and a
sentinel sequence of "///" from each matching line. sentinel sequence of "///" from each matching line.
The embedded XDR file header follows, with subsequent pieces embedded The embedded XDR file header follows, with subsequent pieces embedded
throughout the document: throughout the document:
////* ////*
/// * This file was machine generated for /// * This file was machine generated for
/// * draft-ietf-nfsv4-pnfs-block-07 /// * draft-ietf-nfsv4-pnfs-block-07
/// * Last updated Tue Jan 29 02:57:06 CST 2008 /// * Last updated Tue Apr 1 15:57:06 EST 2008
/// */ /// */
////* ////*
/// * Copyright (C) The IETF Trust (2007-2008) /// * Copyright (C) The IETF Trust (2007-2008)
/// * All Rights Reserved. /// * All Rights Reserved.
/// * /// *
/// * Copyright (C) The Internet Society (1998-2006). /// * Copyright (C) The Internet Society (1998-2006).
/// * All Rights Reserved. /// * All Rights Reserved.
/// */ /// */
/// ///
////* ////*
skipping to change at page 7, line 6 skipping to change at page 7, line 6
access to the same volume via both iSCSI and Fibre Channel is access to the same volume via both iSCSI and Fibre Channel is
possible, hence network addresses are difficult to use for volume possible, hence network addresses are difficult to use for volume
identification. For this reason, this pNFS block layout identifies identification. For this reason, this pNFS block layout identifies
storage volumes by content, for example providing the means to match storage volumes by content, for example providing the means to match
(unique portions of) labels used by volume managers. Any block pNFS (unique portions of) labels used by volume managers. Any block pNFS
system using this layout MUST support a means of content-based unique system using this layout MUST support a means of content-based unique
volume identification that can be employed via the data structure volume identification that can be employed via the data structure
given here. given here.
///struct pnfs_block_sig_component4 { /* disk signature component */ ///struct pnfs_block_sig_component4 { /* disk signature component */
/// int64_t sig_offset; /* byte offset of component /// int64_t bsc_sig_offset; /* byte offset of component
/// on volume*/ /// on volume*/
/// opaque contents<>; /* contents of this component /// opaque bsc_contents<>; /* contents of this component
/// of the signature */ /// of the signature */
///}; ///};
/// ///
Note that the opaque "contents" field in the Note that the opaque "bsc_contents" field in the
"pnfs_block_sig_component4" structure MUST NOT be interpreted as a "pnfs_block_sig_component4" structure MUST NOT be interpreted as a
zero-terminated string, as it may contain embedded zero-valued bytes. zero-terminated string, as it may contain embedded zero-valued bytes.
There are no restrictions on alignment (e.g., neither sig_offset nor There are no restrictions on alignment (e.g., neither bsc_sig_offset
the length are required to be multiples of 4). The sig_offset is a nor the length are required to be multiples of 4). The
signed quantity which when positive represents an byte offset from bsc_sig_offset is a signed quantity which when positive represents an
the start of the volume, and when negative represents an byte offset byte offset from the start of the volume, and when negative
from the end of the volume. represents an byte offset from the end of the volume.
Negative offsets are permitted in order to simplify the client Negative offsets are permitted in order to simplify the client
implementation on systems where the device label is found at a fixed implementation on systems where the device label is found at a fixed
offset from the end of the volume. If the server uses negative offset from the end of the volume. If the server uses negative
offsets to describe the signature, then the client and server MUST offsets to describe the signature, then the client and server MUST
NOT see different volume sizes. Negative offsets SHOULD NOT be used NOT see different volume sizes. Negative offsets SHOULD NOT be used
in systems that dynamically resize volumes unless care is taken to in systems that dynamically resize volumes unless care is taken to
ensure that the device label is always present at the offset from the ensure that the device label is always present at the offset from the
end of the volume as seen by the clients. end of the volume as seen by the clients.
skipping to change at page 8, line 20 skipping to change at page 8, line 20
/// PNFS_BLOCK_VOLUME_CONCAT = 2, /* volume is a /// PNFS_BLOCK_VOLUME_CONCAT = 2, /* volume is a
/// concatenation of /// concatenation of
/// multiple volumes */ /// multiple volumes */
/// PNFS_BLOCK_VOLUME_STRIPE = 3 /* volume is striped across /// PNFS_BLOCK_VOLUME_STRIPE = 3 /* volume is striped across
/// multiple volumes */ /// multiple volumes */
///}; ///};
/// ///
///const PNFS_BLOCK_MAX_SIG_COMP = 16; /* maximum components per ///const PNFS_BLOCK_MAX_SIG_COMP = 16; /* maximum components per
/// signature */ /// signature */
///struct pnfs_block_simple_volume_info4 { ///struct pnfs_block_simple_volume_info4 {
/// pnfs_block_sig_component4 ds<PNFS_BLOCK_MAX_SIG_COMP>; /// pnfs_block_sig_component4 bsv_ds<PNFS_BLOCK_MAX_SIG_COMP>;
/// /* disk signature */ /// /* disk signature */
///}; ///};
/// ///
/// ///
///struct pnfs_block_slice_volume_info4 { ///struct pnfs_block_slice_volume_info4 {
/// offset4 start; /* offset of the start of the /// offset4 bsv_start; /* offset of the start of the
/// slice in bytes */ /// slice in bytes */
/// length4 length; /* length of slice in bytes */ /// length4 bsv_length; /* length of slice in bytes */
/// uint32_t volume; /* array index of sliced /// uint32_t bsv_volume; /* array index of sliced
/// volume */ /// volume */
///}; ///};
/// ///
///struct pnfs_block_concat_volume_info4 { ///struct pnfs_block_concat_volume_info4 {
/// uint32_t volumes<>; /* array indices of volumes /// uint32_t bcv_volumes<>; /* array indices of volumes
/// which are concatenated */ /// which are concatenated */
///}; ///};
/// ///
///struct pnfs_block_stripe_volume_info4 { ///struct pnfs_block_stripe_volume_info4 {
/// length4 stripe_unit; /* size of stripe in bytes */ /// length4 bsv_stripe_unit; /* size of stripe in bytes */
/// uint32_t volumes<>; /* array indices of volumes /// uint32_t bsv_volumes<>; /* array indices of volumes
/// which are striped across -- /// which are striped across --
/// MUST be same size */ /// MUST be same size */
///}; ///};
/// ///
///union pnfs_block_volume4 switch (pnfs_block_volume_type4 type) { ///union pnfs_block_volume4 switch (pnfs_block_volume_type4 type) {
/// case PNFS_BLOCK_VOLUME_SIMPLE: /// case PNFS_BLOCK_VOLUME_SIMPLE:
/// pnfs_block_simple_volume_info4 simple_info; /// pnfs_block_simple_volume_info4 bv_simple_info;
/// case PNFS_BLOCK_VOLUME_SLICE: /// case PNFS_BLOCK_VOLUME_SLICE:
/// pnfs_block_slice_volume_info4 slice_info; /// pnfs_block_slice_volume_info4 bv_slice_info;
/// case PNFS_BLOCK_VOLUME_CONCAT: /// case PNFS_BLOCK_VOLUME_CONCAT:
/// pnfs_block_concat_volume_info4 concat_info; /// pnfs_block_concat_volume_info4 bv_concat_info;
/// case PNFS_BLOCK_VOLUME_STRIPE: /// case PNFS_BLOCK_VOLUME_STRIPE:
/// pnfs_block_stripe_volume_info4 stripe_info; /// pnfs_block_stripe_volume_info4 bv_stripe_info;
///}; ///};
/// ///
////* block layout specific type for da_addr_body */ ////* block layout specific type for da_addr_body */
///struct pnfs_block_deviceaddr4 { ///struct pnfs_block_deviceaddr4 {
/// pnfs_block_volume4 volumes<>; /* array of volumes */ /// pnfs_block_volume4 bda_volumes<>; /* array of volumes */
///}; ///};
/// ///
The "pnfs_block_deviceaddr4" data structure is a structure that The "pnfs_block_deviceaddr4" data structure is a structure that
allows arbitrarily complex nested volume structures to be encoded. allows arbitrarily complex nested volume structures to be encoded.
The types of aggregations that are allowed are stripes, The types of aggregations that are allowed are stripes,
concatenations, and slices. Note that the volume topology expressed concatenations, and slices. Note that the volume topology expressed
in the pnfs_block_deviceaddr4 data structure will always resolve to a in the pnfs_block_deviceaddr4 data structure will always resolve to a
set of pnfs_block_volume_type4 PNFS_BLOCK_VOLUME_SIMPLE. The array set of pnfs_block_volume_type4 PNFS_BLOCK_VOLUME_SIMPLE. The array
of volumes is ordered such that the root of the volume hierarchy is of volumes is ordered such that the root of the volume hierarchy is
skipping to change at page 11, line 26 skipping to change at page 11, line 26
/// extent. There is physical /// extent. There is physical
/// space on the volume. */ /// space on the volume. */
/// PNFS_BLOCK_NONE_DATA = 3 /* the location is invalid. It /// PNFS_BLOCK_NONE_DATA = 3 /* the location is invalid. It
/// is a hole in the file. /// is a hole in the file.
/// There is no physical space /// There is no physical space
/// on the volume. */ /// on the volume. */
///}; ///};
/// ///
///struct pnfs_block_extent4 { ///struct pnfs_block_extent4 {
/// deviceid4 vol_id; /* id of logical volume on /// deviceid4 bex_vol_id; /* id of logical volume on
/// which extent of file is /// which extent of file is
/// stored. */ /// stored. */
/// offset4 file_offset; /* the starting byte offset in /// offset4 bex_file_offset; /* the starting byte offset in
/// the file */ /// the file */
/// length4 extent_length; /* the size in bytes of the /// length4 bex_length; /* the size in bytes of the
/// extent */ /// extent */
/// offset4 storage_offset; /* the starting byte offset in /// offset4 bex_storage_offset;/* the starting byte offset in
/// the volume */ /// the volume */
/// pnfs_block_extent_state4 es; /* the state of this extent */ /// pnfs_block_extent_state4 bex_state;
/// /* the state of this extent */
///}; ///};
/// ///
////* block layout specific type for loc_body */ ////* block layout specific type for loc_body */
///struct pnfs_block_layout4 { ///struct pnfs_block_layout4 {
/// pnfs_block_extent4 extents<>; /* extents which make up this /// pnfs_block_extent4 blo_extents<>;
/// /* extents which make up this
/// layout. */ /// layout. */
///}; ///};
/// ///
The block layout consists of a list of extents which map the logical The block layout consists of a list of extents which map the logical
regions of the file to physical locations on a volume. The "storage regions of the file to physical locations on a volume. The
offset" field within each extent identifies a location on the logical "bex_storage_offset" field within each extent identifies a location
volume specified by the "vol_id" field in the extent. The vol_id on the logical volume specified by the "bex_vol_id" field in the
itself is shorthand for the whole topology of the logical volume on extent. The bex_vol_id itself is shorthand for the whole topology of
which the file is stored. The client is responsible for translating the logical volume on which the file is stored. The client is
this logical offset into an offset on the appropriate underlying SAN responsible for translating this logical offset into an offset on the
logical unit. In most cases all extents in a layout will reside on appropriate underlying SAN logical unit. In most cases all extents
the same volume and thus have the same vol_id. In the case of copy in a layout will reside on the same volume and thus have the same
on write file systems, the PNFS_BLOCK_READ_DATA extents may have a bex_vol_id. In the case of copy on write file systems, the
different vol_id from the writable extents. PNFS_BLOCK_READ_DATA extents may have a different bex_vol_id from the
writable extents.
Each extent maps a logical region of the file onto a portion of the Each extent maps a logical region of the file onto a portion of the
specified logical volume. The file_offset, extent_length, and es specified logical volume. The bex_file_offset, bex_length, and
fields for an extent returned from the server are valid for all bex_state fields for an extent returned from the server are valid for
extents. In contrast, the interpretation of the storage_offset field all extents. In contrast, the interpretation of the
depends on the value of es as follows (in increasing order): bex_storage_offset field depends on the value of bex_state as follows
(in increasing order):
o PNFS_BLOCK_READ_WRITE_DATA means that storage_offset is valid, and o PNFS_BLOCK_READ_WRITE_DATA means that bex_storage_offset is valid,
points to valid/initialized data that can be read and written. and points to valid/initialized data that can be read and written.
o PNFS_BLOCK_READ_DATA means that storage_offset is valid and points o PNFS_BLOCK_READ_DATA means that bex_storage_offset is valid and
to valid/ initialized data which can only be read. Write points to valid/ initialized data which can only be read. Write
operations are prohibited; the client may need to request a read- operations are prohibited; the client may need to request a read-
write layout. write layout.
o PNFS_BLOCK_INVALID_DATA means that storage_offset is valid, but o PNFS_BLOCK_INVALID_DATA means that bex_storage_offset is valid,
points to invalid un-initialized data. This data must not be but points to invalid un-initialized data. This data must not be
physically read from the disk until it has been initialized. A physically read from the disk until it has been initialized. A
read request for a PNFS_BLOCK_INVALID_DATA extent must fill the read request for a PNFS_BLOCK_INVALID_DATA extent must fill the
user buffer with zeros, unless the extent is covered by a user buffer with zeros, unless the extent is covered by a
PNFS_BLOCK_READ_DATA extent of a copy-on-write file system. Write PNFS_BLOCK_READ_DATA extent of a copy-on-write file system. Write
requests must write whole server-sized blocks to the disk; bytes requests must write whole server-sized blocks to the disk; bytes
not initialized by the user must be set to zero. Any write to not initialized by the user must be set to zero. Any write to
storage in a PNFS_BLOCK_INVALID_DATA extent changes the written storage in a PNFS_BLOCK_INVALID_DATA extent changes the written
portion of the extent to PNFS_BLOCK_READ_WRITE_DATA; the pNFS portion of the extent to PNFS_BLOCK_READ_WRITE_DATA; the pNFS
client is responsible for reporting this change via LAYOUTCOMMIT. client is responsible for reporting this change via LAYOUTCOMMIT.
o PNFS_BLOCK_NONE_DATA means that storage_offset is not valid, and o PNFS_BLOCK_NONE_DATA means that bex_storage_offset is not valid,
this extent may not be used to satisfy write requests. Read and this extent may not be used to satisfy write requests. Read
requests may be satisfied by zero-filling as for requests may be satisfied by zero-filling as for
PNFS_BLOCK_INVALID_DATA. PNFS_BLOCK_NONE_DATA extents may be PNFS_BLOCK_INVALID_DATA. PNFS_BLOCK_NONE_DATA extents may be
returned by requests for readable extents; they are never returned returned by requests for readable extents; they are never returned
if the request was for a writeable extent. if the request was for a writeable extent.
An extent list lists all relevant extents in increasing order of the An extent list lists all relevant extents in increasing order of the
file_offset of each extent; any ties are broken by increasing order bex_file_offset of each extent; any ties are broken by increasing
of the extent state (es). order of the extent state (bex_state).
2.3.1. Layout Requests and Extent Lists 2.3.1. Layout Requests and Extent Lists
Each request for a layout specifies at least three parameters: file Each request for a layout specifies at least three parameters: file
offset, desired size, and minimum size. If the status of a request offset, desired size, and minimum size. If the status of a request
indicates success, the extent list returned must meet the following indicates success, the extent list returned must meet the following
criteria: criteria:
o A request for a readable (but not writeable) layout returns only o A request for a readable (but not writeable) layout returns only
PNFS_BLOCK_READ_DATA or PNFS_BLOCK_NONE_DATA extents (but not PNFS_BLOCK_READ_DATA or PNFS_BLOCK_NONE_DATA extents (but not
skipping to change at page 13, line 42 skipping to change at page 13, line 46
read-only layout. For a read-write layout, the set of writable read-only layout. For a read-write layout, the set of writable
extents (i.e., excluding PNFS_BLOCK_READ_DATA extents) MUST be extents (i.e., excluding PNFS_BLOCK_READ_DATA extents) MUST be
logically contiguous. Every PNFS_BLOCK_READ_DATA extent in a logically contiguous. Every PNFS_BLOCK_READ_DATA extent in a
read-write layout MUST be covered by one or more read-write layout MUST be covered by one or more
PNFS_BLOCK_INVALID_DATA extents. This overlap of PNFS_BLOCK_INVALID_DATA extents. This overlap of
PNFS_BLOCK_READ_DATA and PNFS_BLOCK_INVALID_DATA extents is the PNFS_BLOCK_READ_DATA and PNFS_BLOCK_INVALID_DATA extents is the
only permitted extent overlap. only permitted extent overlap.
o Extents MUST be ordered in the list by starting offset, with o Extents MUST be ordered in the list by starting offset, with
PNFS_BLOCK_READ_DATA extents preceding PNFS_BLOCK_INVALID_DATA PNFS_BLOCK_READ_DATA extents preceding PNFS_BLOCK_INVALID_DATA
extents in the case of equal file_offsets. extents in the case of equal bex_file_offsets.
2.3.2. Layout Commits 2.3.2. Layout Commits
////* block layout specific type for lou_body */ ////* block layout specific type for lou_body */
///struct pnfs_block_layoutupdate4 { ///struct pnfs_block_layoutupdate4 {
/// pnfs_block_extent4 commit_list<>; /* list of extents which /// pnfs_block_extent4 blu_commit_list<>;
/// /* list of extents which
/// * now contain valid data. /// * now contain valid data.
/// */ /// */
///}; ///};
/// ///
The "pnfs_block_layoutupdate4" structure is used by the client as the The "pnfs_block_layoutupdate4" structure is used by the client as the
block-protocol specific argument in a LAYOUTCOMMIT operation. The block-protocol specific argument in a LAYOUTCOMMIT operation. The
"commit_list" field is an extent list covering regions of the file "blu_commit_list" field is an extent list covering regions of the
layout that were previously in the PNFS_BLOCK_INVALID_DATA state, but file layout that were previously in the PNFS_BLOCK_INVALID_DATA
have been written by the client and should now be considered in the state, but have been written by the client and should now be
PNFS_BLOCK_READ_WRITE_DATA state. The es field of each extent in the considered in the PNFS_BLOCK_READ_WRITE_DATA state. The bex_state
commit_list MUST be set to PNFS_BLOCK_READ_WRITE_DATA. The extents field of each extent in the blu_commit_list MUST be set to
in the commit list MUST be disjoint and MUST be sorted by PNFS_BLOCK_READ_WRITE_DATA. The extents in the commit list MUST be
file_offset. The storage_offset field is unused. Implementers disjoint and MUST be sorted by bex_file_offset. The
should be aware that a server may be unable to commit regions at a bex_storage_offset field is unused. Implementers should be aware
granularity smaller than a file-system block (typically 4KB or 8KB). that a server may be unable to commit regions at a granularity
As noted above, the block-size that the server uses is available as smaller than a file-system block (typically 4KB or 8KB). As noted
an NFSv4 attribute, and any extents included in the "commit_list" above, the block-size that the server uses is available as an NFSv4
MUST be aligned to this granularity and have a size that is a attribute, and any extents included in the "blu_commit_list" MUST be
multiple of this granularity. If the client believes that its aligned to this granularity and have a size that is a multiple of
actions have moved the end-of-file into the middle of a block being this granularity. If the client believes that its actions have moved
committed, the client MUST write zeroes from the end-of-file to the the end-of-file into the middle of a block being committed, the
end of that block before committing the block. Failure to do so may client MUST write zeroes from the end-of-file to the end of that
result in junk (uninitialized data) appearing in that area if the block before committing the block. Failure to do so may result in
file is subsequently extended by moving the end-of-file. junk (uninitialized data) appearing in that area if the file is
subsequently extended by moving the end-of-file.
2.3.3. Layout Returns 2.3.3. Layout Returns
The LAYOUTRETURN operation is done without any block layout specific The LAYOUTRETURN operation is done without any block layout specific
data. When the LAYOUTRETURN operation specifies a data. When the LAYOUTRETURN operation specifies a
LAYOUTRETURN4_FILE_return type, then the layoutreturn_file4 data LAYOUTRETURN4_FILE_return type, then the layoutreturn_file4 data
structure specifies the region of the file layout that is no longer structure specifies the region of the file layout that is no longer
needed by the client. The opaque "lrf_body" field of the needed by the client. The opaque "lrf_body" field of the
"layoutreturn_file4" data structure MUST have length zero. A "layoutreturn_file4" data structure MUST have length zero. A
LAYOUTRETURN operation represents an explicit release of resources by LAYOUTRETURN operation represents an explicit release of resources by
skipping to change at page 15, line 43 skipping to change at page 15, line 45
extents allows copy-on-write processing to be done by pNFS clients. extents allows copy-on-write processing to be done by pNFS clients.
In classic NFS, this operation would be done by the server. Since In classic NFS, this operation would be done by the server. Since
pNFS enables clients to do direct block access, it is useful for pNFS enables clients to do direct block access, it is useful for
clients to participate in copy-on-write operations. All block/volume clients to participate in copy-on-write operations. All block/volume
pNFS clients MUST support this copy-on-write processing. pNFS clients MUST support this copy-on-write processing.
When a client wishes to write data covered by a PNFS_BLOCK_READ_DATA When a client wishes to write data covered by a PNFS_BLOCK_READ_DATA
extent, it MUST have requested a writable layout from the server; extent, it MUST have requested a writable layout from the server;
that layout will contain PNFS_BLOCK_INVALID_DATA extents to cover all that layout will contain PNFS_BLOCK_INVALID_DATA extents to cover all
the data ranges of that layout's PNFS_BLOCK_READ_DATA extents. More the data ranges of that layout's PNFS_BLOCK_READ_DATA extents. More
precisely, for any file_offset range covered by one or more precisely, for any bex_file_offset range covered by one or more
PNFS_BLOCK_READ_DATA extents in a writable layout, the server MUST PNFS_BLOCK_READ_DATA extents in a writable layout, the server MUST
include one or more PNFS_BLOCK_INVALID_DATA extents in the layout include one or more PNFS_BLOCK_INVALID_DATA extents in the layout
that cover the same file_offset range. When performing a write to that cover the same bex_file_offset range. When performing a write
such an area of a layout, the client MUST effectively copy the data to such an area of a layout, the client MUST effectively copy the
from the PNFS_BLOCK_READ_DATA extent for any partial blocks of data from the PNFS_BLOCK_READ_DATA extent for any partial blocks of
file_offset and range, merge in the changes to be written, and write bex_file_offset and range, merge in the changes to be written, and
the result to the PNFS_BLOCK_INVALID_DATA extent for the blocks for write the result to the PNFS_BLOCK_INVALID_DATA extent for the blocks
that file_offset and range. That is, if entire blocks of data are to for that bex_file_offset and range. That is, if entire blocks of
be overwritten by an operation, the corresponding data are to be overwritten by an operation, the corresponding
PNFS_BLOCK_READ_DATA blocks need not be fetched, but any partial- PNFS_BLOCK_READ_DATA blocks need not be fetched, but any partial-
block writes must be merged with data fetched via block writes must be merged with data fetched via
PNFS_BLOCK_READ_DATA extents before storing the result via PNFS_BLOCK_READ_DATA extents before storing the result via
PNFS_BLOCK_INVALID_DATA extents. For the purposes of this PNFS_BLOCK_INVALID_DATA extents. For the purposes of this
discussion, "entire blocks" and "partial blocks" refer to the discussion, "entire blocks" and "partial blocks" refer to the
server's file-system block size. Storing of data in a server's file-system block size. Storing of data in a
PNFS_BLOCK_INVALID_DATA extent converts the written portion of the PNFS_BLOCK_INVALID_DATA extent converts the written portion of the
PNFS_BLOCK_INVALID_DATA extent to a PNFS_BLOCK_READ_WRITE_DATA PNFS_BLOCK_INVALID_DATA extent to a PNFS_BLOCK_READ_WRITE_DATA
extent; all subsequent reads MUST be performed from this extent; the extent; all subsequent reads MUST be performed from this extent; the
corresponding portion of the PNFS_BLOCK_READ_DATA extent MUST NOT be corresponding portion of the PNFS_BLOCK_READ_DATA extent MUST NOT be
skipping to change at page 16, line 31 skipping to change at page 16, line 34
extent that is not covered by a PNFS_BLOCK_READ_DATA extent, it MUST extent that is not covered by a PNFS_BLOCK_READ_DATA extent, it MUST
treat this write identically to a write to a file not involved with treat this write identically to a write to a file not involved with
copy-on-write semantics. Thus, data must be written in at least copy-on-write semantics. Thus, data must be written in at least
block size increments, aligned to multiples of block sized offsets, block size increments, aligned to multiples of block sized offsets,
and unwritten portions of blocks must be zero filled. and unwritten portions of blocks must be zero filled.
In the LAYOUTCOMMIT operation that normally sends updated layout In the LAYOUTCOMMIT operation that normally sends updated layout
information back to the server, for writable data, some information back to the server, for writable data, some
PNFS_BLOCK_INVALID_DATA extents may be committed as PNFS_BLOCK_INVALID_DATA extents may be committed as
PNFS_BLOCK_READ_WRITE_DATA extents, signifying that the storage at PNFS_BLOCK_READ_WRITE_DATA extents, signifying that the storage at
the corresponding storage_offset values has been stored into and is the corresponding bex_storage_offset values has been stored into and
now to be considered as valid data to be read. PNFS_BLOCK_READ_DATA is now to be considered as valid data to be read.
extents are not committed to the server. For extents that the client PNFS_BLOCK_READ_DATA extents are not committed to the server. For
receives via LAYOUTGET as PNFS_BLOCK_INVALID_DATA and returns via extents that the client receives via LAYOUTGET as
LAYOUTCOMMIT as PNFS_BLOCK_READ_WRITE_DATA, the server will PNFS_BLOCK_INVALID_DATA and returns via LAYOUTCOMMIT as
understand that the PNFS_BLOCK_READ_DATA mapping for that extent is PNFS_BLOCK_READ_WRITE_DATA, the server will understand that the
no longer valid or necessary for that file. PNFS_BLOCK_READ_DATA mapping for that extent is no longer valid or
necessary for that file.
2.3.5. Extents are Permissions 2.3.5. Extents are Permissions
Layout extents returned to pNFS clients grant permission to read or Layout extents returned to pNFS clients grant permission to read or
write; PNFS_BLOCK_READ_DATA and PNFS_BLOCK_NONE_DATA are read-only write; PNFS_BLOCK_READ_DATA and PNFS_BLOCK_NONE_DATA are read-only
(PNFS_BLOCK_NONE_DATA reads as zeroes), PNFS_BLOCK_READ_WRITE_DATA (PNFS_BLOCK_NONE_DATA reads as zeroes), PNFS_BLOCK_READ_WRITE_DATA
and PNFS_BLOCK_INVALID_DATA are read/write, (PNFS_BLOCK_INVALID_DATA and PNFS_BLOCK_INVALID_DATA are read/write, (PNFS_BLOCK_INVALID_DATA
reads as zeros, any write converts it to PNFS_BLOCK_READ_WRITE_DATA). reads as zeros, any write converts it to PNFS_BLOCK_READ_WRITE_DATA).
This is the only client means of obtaining permission to perform This is the only client means of obtaining permission to perform
direct I/O to storage devices; a pNFS client MUST NOT perform direct direct I/O to storage devices; a pNFS client MUST NOT perform direct
I/O operations that are not permitted by an extent held by the I/O operations that are not permitted by an extent held by the
client. Client adherence to this rule places the pNFS server in client. Client adherence to this rule places the pNFS server in
control of potentially conflicting storage device operations, control of potentially conflicting storage device operations,
enabling the server to determine what does conflict and how to avoid enabling the server to determine what does conflict and how to avoid
conflicts by granting and recalling extents to/from clients. conflicts by granting and recalling extents to/from clients.
Block/volume class storage devices are not required to perform read Block/volume class storage devices are not required to perform read
and write operations atomically. Overlapping concurrent read and and write operations atomically. Overlapping concurrent read and
skipping to change at page 19, line 7 skipping to change at page 19, line 9
2.3.7. Layout Hints 2.3.7. Layout Hints
The SETATTR operation supports a layout hint attribute [NFSv4.1]. The SETATTR operation supports a layout hint attribute [NFSv4.1].
When the client sets a layout hint (data type layouthint4) with a When the client sets a layout hint (data type layouthint4) with a
layout type of LAYOUT4_BLOCK_VOLUME (the loh_type field), the layout type of LAYOUT4_BLOCK_VOLUME (the loh_type field), the
loh_body field contains a value of data type pnfs_block_layouthint4. loh_body field contains a value of data type pnfs_block_layouthint4.
////* block layout specific type for loh_body */ ////* block layout specific type for loh_body */
///struct pnfs_block_layouthint4 { ///struct pnfs_block_layouthint4 {
/// uint64_t maximum_io_time; /* maximum i/o time in seconds /// uint64_t blh_maximum_io_time; /* maximum i/o time in seconds
/// */ /// */
///}; ///};
/// ///
The block layout client uses the layout hint data structure to The block layout client uses the layout hint data structure to
communicate to the server the maximum time that it may take an I/O to communicate to the server the maximum time that it may take an I/O to
execute on the client. Clients using block layouts MUST set the execute on the client. Clients using block layouts MUST set the
layout hint attribute before using LAYOUTGET operations. layout hint attribute before using LAYOUTGET operations.
2.3.8. Client Fencing 2.3.8. Client Fencing
skipping to change at page 19, line 46 skipping to change at page 19, line 48
provides a means to discover and mask LUNs, including a means of provides a means to discover and mask LUNs, including a means of
associating clients with the necessary World Wide Names or Initiator associating clients with the necessary World Wide Names or Initiator
names to be masked. names to be masked.
In the absence of support for LUN masking, the server has to rely on In the absence of support for LUN masking, the server has to rely on
the clients to implement a timed lease I/O fencing mechanism. the clients to implement a timed lease I/O fencing mechanism.
Because clients do not know if the server is using LUN masking, in Because clients do not know if the server is using LUN masking, in
all cases the client MUST implement timed lease fencing. In timed all cases the client MUST implement timed lease fencing. In timed
lease fencing we define two time periods, the first, "lease_time" is lease fencing we define two time periods, the first, "lease_time" is
the length of a lease as defined by the server's lease_time attribute the length of a lease as defined by the server's lease_time attribute
(see [NFSV4.1]), and the second, "maximum_io_time" is the maximum (see [NFSV4.1]), and the second, "blh_maximum_io_time" is the maximum
time it can take for a client I/O to the storage system to either time it can take for a client I/O to the storage system to either
complete or fail; this value is often 30 seconds or 60 seconds, but complete or fail; this value is often 30 seconds or 60 seconds, but
may be longer in some environments. If the maximum client I/O time may be longer in some environments. If the maximum client I/O time
cannot be bounded, the client MUST use a value of all 1s as the cannot be bounded, the client MUST use a value of all 1s as the
maximum_io_time. blh_maximum_io_time.
The client MUST use SETATTR with a layout hint of type The client MUST use SETATTR with a layout hint of type
LAYOUT4_BLOCK_VOLUME to inform the server of its maximum_I/O time LAYOUT4_BLOCK_VOLUME to inform the server of its maximum I/O time
prior to issuing the first LAYOUTGET operation. The maximum io time prior to issuing the first LAYOUTGET operation. The maximum io
hint is a per client attribute, and as such the server SHOULD time hint is a per client attribute, and as such the server SHOULD
maintain the value set by each client. A server which implements maintain the value set by each client. A server which implements
fencing via LUN masking SHOULD accept any maximum io time value from fencing via LUN masking SHOULD accept any maximum io time value from
a client. A server which does not implement fencing may return an a client. A server which does not implement fencing may return an
error NFS4ERR_INVAL to the SETATTR operation. Such a server SHOULD error NFS4ERR_INVAL to the SETATTR operation. Such a server SHOULD
return NFS4ERR_INVAL when a client sends an unbounded maximum I/O return NFS4ERR_INVAL when a client sends an unbounded maximum I/O
time (all 1s), or when the maximum I/O time is significantly greater time (all 1s), or when the maximum I/O time is significantly greater
than that of other clients using block layouts with pNFS. than that of other clients using block layouts with pNFS.
When a client receives the error NFS4ERR_INVAL in response to the When a client receives the error NFS4ERR_INVAL in response to the
SETATTR operation for a layout hint, the client MUST NOT use the SETATTR operation for a layout hint, the client MUST NOT use the
skipping to change at page 20, line 46 skipping to change at page 20, line 49
is not renewed prior to expiration, the client MUST cease to use the is not renewed prior to expiration, the client MUST cease to use the
layout after "lease_time" seconds from when it either sent the layout after "lease_time" seconds from when it either sent the
original LAYOUTGET command, or sent the last operation renewing the original LAYOUTGET command, or sent the last operation renewing the
lease. In other words, the client may not issue any I/O to blocks lease. In other words, the client may not issue any I/O to blocks
specified by an expired layout. In the presence of large specified by an expired layout. In the presence of large
communication delays between the client and server it is even communication delays between the client and server it is even
possible for the lease to expire prior to the server response possible for the lease to expire prior to the server response
arriving at the client. In such a situation the client MUST NOT use arriving at the client. In such a situation the client MUST NOT use
the expired layouts, and SHOULD revert to using standard NFSv41 READ the expired layouts, and SHOULD revert to using standard NFSv41 READ
and WRITE operations. Furthermore, the client must be configured and WRITE operations. Furthermore, the client must be configured
such that I/O operations complete within the "maximum_io_time" even such that I/O operations complete within the "blh_maximum_io_time"
in the presence of multipath drivers that will retry I/Os via even in the presence of multipath drivers that will retry I/Os via
multiple paths. multiple paths.
As stated in the section "Dealing with Lease Expiration on the As stated in the section "Dealing with Lease Expiration on the
Client" of [NFSV4.1], if any SEQUENCE operation is successful, but Client" of [NFSV4.1], if any SEQUENCE operation is successful, but
sr_status_flag has SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, sr_status_flag has SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED,
SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, or SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, or
SEQ4_STATUS_ADMIN_STATE_REVOKED set, the client MUST immediately SEQ4_STATUS_ADMIN_STATE_REVOKED set, the client MUST immediately
cease to use all layouts and device id to device address mappings cease to use all layouts and device id to device address mappings
associated with the corresponding server. associated with the corresponding server.
In the absence of known two way communication between the client and In the absence of known two way communication between the client and
the server on the fore channel, the server must wait for at least the the server on the fore channel, the server must wait for at least the
time period "lease_time" plus "maximum_io_time" before transferring time period "lease_time" plus "blh_maximum_io_time" before
layouts from the original client to any other client. The server, transferring layouts from the original client to any other client.
like the client, must take a conservative approach, and start the The server, like the client, must take a conservative approach, and
lease expiration timer from the time that it received the operation start the lease expiration timer from the time that it received the
which last renewed the lease. operation which last renewed the lease.
2.4. Crash Recovery Issues 2.4. Crash Recovery Issues
When the server crashes while the client holds a writable layout, and When the server crashes while the client holds a writable layout, and
the client has written data to blocks covered by the layout, and the the client has written data to blocks covered by the layout, and the
blocks are still in the PNFS_BLOCK_INVALID_DATA state, the client has blocks are still in the PNFS_BLOCK_INVALID_DATA state, the client has
two options for recovery. If the data that has been written to these two options for recovery. If the data that has been written to these
blocks is still cached by the client, the client can simply re-write blocks is still cached by the client, the client can simply re-write
the data via NFSv4, once the server has come back online. However, the data via NFSv4, once the server has come back online. However,
if the data is no longer in the client's cache, the client MUST NOT if the data is no longer in the client's cache, the client MUST NOT
 End of changes. 45 change blocks. 
103 lines changed or deleted 111 lines changed or added

This html diff was produced by rfcdiff 1.34. The latest version is available from http://tools.ietf.org/tools/rfcdiff/