draft-ietf-nfsv4-pnfs-block-09.txt   draft-ietf-nfsv4-pnfs-block-10.txt 
NFSv4 Working Group D. Black NFSv4 Working Group D. Black
Internet Draft S. Fridella Internet Draft S. Fridella
Expires: December 12, 2008 J. Glasgow Expires: May 25, 2009 J. Glasgow
Intended Status: Proposed Standard EMC Corporation Intended Status: Proposed Standard EMC Corporation
June 11, 2008 November 25, 2008
pNFS Block/Volume Layout pNFS Block/Volume Layout
draft-ietf-nfsv4-pnfs-block-09.txt draft-ietf-nfsv4-pnfs-block-10.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that By submitting this Internet-Draft, each author represents that
any applicable patent or other IPR claims of which he or she is any applicable patent or other IPR claims of which he or she is
aware have been or will be disclosed, and any of which he or she aware have been or will be disclosed, and any of which he or she
becomes aware will be disclosed, in accordance with Section 6 of becomes aware will be disclosed, in accordance with Section 6 of
BCP 79. BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
skipping to change at page 1, line 36 skipping to change at page 1, line 36
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html http://www.ietf.org/shadow.html
This Internet-Draft will expire in September 2008. This Internet-Draft will expire in May 2009.
Abstract Abstract
Parallel NFS (pNFS) extends NFSv4 to allow clients to directly access Parallel NFS (pNFS) extends NFSv4 to allow clients to directly access
file data on the storage used by the NFSv4 server. This ability to file data on the storage used by the NFSv4 server. This ability to
bypass the server for data access can increase both performance and bypass the server for data access can increase both performance and
parallelism, but requires additional client functionality for data parallelism, but requires additional client functionality for data
access, some of which is dependent on the class of storage used. The access, some of which is dependent on the class of storage used. The
main pNFS operations draft specifies storage-class-independent main pNFS operations draft specifies storage-class-independent
extensions to NFS; this draft specifies the additional extensions extensions to NFS; this draft specifies the additional extensions
skipping to change at page 2, line 37 skipping to change at page 2, line 37
2.3.5. Extents are Permissions.............................17 2.3.5. Extents are Permissions.............................17
2.3.6. End-of-file Processing..............................18 2.3.6. End-of-file Processing..............................18
2.3.7. Layout Hints........................................19 2.3.7. Layout Hints........................................19
2.3.8. Client Fencing......................................19 2.3.8. Client Fencing......................................19
2.4. Crash Recovery Issues....................................21 2.4. Crash Recovery Issues....................................21
2.5. Recalling resources: CB_RECALL_ANY.......................22 2.5. Recalling resources: CB_RECALL_ANY.......................22
2.6. Transient and Permanent Errors...........................22 2.6. Transient and Permanent Errors...........................22
3. Security Considerations.......................................23 3. Security Considerations.......................................23
4. Conclusions...................................................24 4. Conclusions...................................................24
5. IANA Considerations...........................................24 5. IANA Considerations...........................................24
6. Acknowledgments...............................................24 6. Acknowledgments...............................................25
7. References....................................................25 7. References....................................................25
7.1. Normative References.....................................25 7.1. Normative References.....................................25
7.2. Informative References...................................25 7.2. Informative References...................................25
Author's Addresses...............................................26 Author's Addresses...............................................26
Intellectual Property Statement..................................26 Intellectual Property Statement..................................26
Disclaimer of Validity...........................................27 Disclaimer of Validity...........................................27
Copyright Statement..............................................27 Copyright Statement..............................................27
Acknowledgment...................................................27 Acknowledgment...................................................27
1. Introduction 1. Introduction
skipping to change at page 9, line 33 skipping to change at page 9, line 33
The types of aggregations that are allowed are stripes, The types of aggregations that are allowed are stripes,
concatenations, and slices. Note that the volume topology expressed concatenations, and slices. Note that the volume topology expressed
in the pnfs_block_deviceaddr4 data structure will always resolve to a in the pnfs_block_deviceaddr4 data structure will always resolve to a
set of pnfs_block_volume_type4 PNFS_BLOCK_VOLUME_SIMPLE. The array set of pnfs_block_volume_type4 PNFS_BLOCK_VOLUME_SIMPLE. The array
of volumes is ordered such that the root of the volume hierarchy is of volumes is ordered such that the root of the volume hierarchy is
the last element of the array. Concat, slice and stripe volumes MUST the last element of the array. Concat, slice and stripe volumes MUST
refer to volumes defined by lower indexed elements of the array. refer to volumes defined by lower indexed elements of the array.
The "pnfs_block_device_addr4" data structure is returned by the The "pnfs_block_device_addr4" data structure is returned by the
server as the storage-protocol-specific opaque field da_addr_body in server as the storage-protocol-specific opaque field da_addr_body in
the "device_addr4" structure by a successful GETDEVICELIST operation. the "device_addr4" structure by a successful GETDEVICEINFO operation.
[NFSV4.1]. [NFSV4.1].
As noted above, all device_addr4 structures eventually resolve to a As noted above, all device_addr4 structures eventually resolve to a
set of volumes of type PNFS_BLOCK_VOLUME_SIMPLE. These volumes are set of volumes of type PNFS_BLOCK_VOLUME_SIMPLE. These volumes are
each uniquely identified by a set of signature components. each uniquely identified by a set of signature components.
Complicated volume hierarchies may be composed of dozens of volumes Complicated volume hierarchies may be composed of dozens of volumes
each with several signature components, thus the device address may each with several signature components, thus the device address may
require several kilobytes. The client SHOULD be prepared to allocate require several kilobytes. The client SHOULD be prepared to allocate
a large buffer to contain the result. In the case of the server a large buffer to contain the result. In the case of the server
returning NFS4ERR_TOOSMALL the client SHOULD allocate a buffer of at returning NFS4ERR_TOOSMALL the client SHOULD allocate a buffer of at
skipping to change at page 17, line 34 skipping to change at page 17, line 34
Block/volume class storage devices are not required to perform read Block/volume class storage devices are not required to perform read
and write operations atomically. Overlapping concurrent read and and write operations atomically. Overlapping concurrent read and
write operations to the same data may cause the read to return a write operations to the same data may cause the read to return a
mixture of before-write and after-write data. Overlapping write mixture of before-write and after-write data. Overlapping write
operations can be worse, as the result could be a mixture of data operations can be worse, as the result could be a mixture of data
from the two write operations; data corruption can occur if the from the two write operations; data corruption can occur if the
underlying storage is striped and the operations complete in underlying storage is striped and the operations complete in
different orders on different stripes. A pNFS server can avoid these different orders on different stripes. A pNFS server can avoid these
conflicts by implementing a single writer XOR multiple readers conflicts by implementing a single writer XOR multiple readers
concurrency control policy when there are multiple clients who wish concurrency control policy when there are multiple clients who wish
to access the same data. This policy SHOULD be implemented when to access the same data. This policy MUST be implemented when
storage devices do not provide atomicity for concurrent read/write storage devices do not provide atomicity for concurrent read/write
and write/write operations to the same data. and write/write operations to the same data.
If a client makes a layout request that conflicts with an existing If a client makes a layout request that conflicts with an existing
layout delegation, the request will be rejected with the error layout delegation, the request will be rejected with the error
NFS4ERR_LAYOUTTRYLATER. This client is then expected to retry the NFS4ERR_LAYOUTTRYLATER. This client is then expected to retry the
request after a short interval. During this interval the server request after a short interval. During this interval the server
SHOULD recall the conflicting portion of the layout delegation from SHOULD recall the conflicting portion of the layout delegation from
the client that currently holds it. This reject-and-retry approach the client that currently holds it. This reject-and-retry approach
does not prevent client starvation when there is contention for the does not prevent client starvation when there is contention for the
skipping to change at page 23, line 12 skipping to change at page 23, line 12
supply the layout. As a result of receiving supply the layout. As a result of receiving
NFS4ERR_LAYOUTUNAVAILABLE, the client SHOULD send future READ and NFS4ERR_LAYOUTUNAVAILABLE, the client SHOULD send future READ and
WRITE requests directly to the server. It is expected that a client WRITE requests directly to the server. It is expected that a client
will not cache the file's layoutunavailable state forever, particular will not cache the file's layoutunavailable state forever, particular
if the file is closed, and thus eventually, the client MAY reissue a if the file is closed, and thus eventually, the client MAY reissue a
LAYOUTGET operation. LAYOUTGET operation.
3. Security Considerations 3. Security Considerations
Typically, SAN disk arrays and SAN protocols provide access control Typically, SAN disk arrays and SAN protocols provide access control
mechanisms (access-logics, lun masking, etc.) which operate at the mechanisms (e.g., logical unit number mapping and/or masking) which
granularity of individual hosts. The functionality provided by such operate at the granularity of individual hosts. The functionality
mechanisms makes it possible for the server to "fence" individual provided by such mechanisms makes it possible for the server to
client machines from certain physical disks---that is to say, to "fence" individual client machines from certain physical disks---that
prevent individual client machines from reading or writing to certain is to say, to prevent individual client machines from reading or
physical disks. Finer-grained access control methods are not writing to certain physical disks. Finer-grained access control
generally available. For this reason, certain security methods are not generally available. For this reason, certain
responsibilities are delegated to pNFS clients for block/volume security responsibilities are delegated to pNFS clients for
layouts. Block/volume storage systems generally control access at a block/volume layouts. Block/volume storage systems generally control
volume granularity, and hence pNFS clients have to be trusted to only access at a volume granularity, and hence pNFS clients have to be
perform accesses allowed by the layout extents they currently hold trusted to only perform accesses allowed by the layout extents they
(e.g., and not access storage for files on which a layout extent is currently hold (e.g., and not access storage for files on which a
not held). In general, the server will not be able to prevent a layout extent is not held). In general, the server will not be able
client which holds a layout for a file from accessing parts of the to prevent a client which holds a layout for a file from accessing
physical disk not covered by the layout. Similarly, the server will parts of the physical disk not covered by the layout. Similarly, the
not be able to prevent a client from accessing blocks covered by a server will not be able to prevent a client from accessing blocks
layout that it has already returned. This block-based level of covered by a layout that it has already returned. This block-based
protection must be provided by the client software. level of protection must be provided by the client software.
An alternative method of block/volume protocol use is for the storage An alternative method of block/volume protocol use is for the storage
devices to export virtualized block addresses, which do reflect the devices to export virtualized block addresses, which do reflect the
files to which blocks belong. These virtual block addresses are files to which blocks belong. These virtual block addresses are
exported to pNFS clients via layouts. This allows the storage device exported to pNFS clients via layouts. This allows the storage device
to make appropriate access checks, while mapping virtual block to make appropriate access checks, while mapping virtual block
addresses to physical block addresses. In environments where the addresses to physical block addresses. In environments where the
security requirements are such that client-side protection from security requirements are such that client-side protection from
access to storage outside of the layout is not sufficient pNFS access to storage outside of the authorized layout extents is not
block/volume storage layouts for pNFS SHOULD NOT be used, unless the sufficient, pNFS block/volume storage layouts SHOULD NOT be used
storage device is able to implement the appropriate access checks, unless the storage device is able to implement the appropriate access
via use of virtualized block addresses, or other means. checks, via use of virtualized block addresses or other means. In
contrast, an environment where client-side protection may suffice
consists of co-located clients, server and storage systems in a
datacenter with a physically isolated SAN under control of a single
system administrator or small group of system administrators.
This also has implications for some NFSv4 functionality outside pNFS. This also has implications for some NFSv4 functionality outside pNFS.
For instance, if a file is covered by a mandatory read-only lock, the For instance, if a file is covered by a mandatory read-only lock, the
server can ensure that only readable layouts for the file are granted server can ensure that only readable layouts for the file are granted
to pNFS clients. However, it is up to each pNFS client to ensure to pNFS clients. However, it is up to each pNFS client to ensure
that the readable layout is used only to service read requests, and that the readable layout is used only to service read requests, and
not to allow writes to the existing parts of the file. Since not to allow writes to the existing parts of the file. Since
block/volume storage systems are generally not capable of enforcing block/volume storage systems are generally not capable of enforcing
such file-based security, in environments where pNFS clients cannot such file-based security, in environments where pNFS clients cannot
be trusted to enforce such policies, pNFS block/volume storage be trusted to enforce such policies, pNFS block/volume storage
skipping to change at page 24, line 17 skipping to change at page 24, line 21
Access to block/volume storage is logically at a lower layer of the Access to block/volume storage is logically at a lower layer of the
I/O stack than NFSv4, and hence NFSv4 security is not directly I/O stack than NFSv4, and hence NFSv4 security is not directly
applicable to protocols that access such storage directly. Depending applicable to protocols that access such storage directly. Depending
on the protocol, some of the security mechanisms provided by NFSv4 on the protocol, some of the security mechanisms provided by NFSv4
(e.g., encryption, cryptographic integrity) may not be available, or (e.g., encryption, cryptographic integrity) may not be available, or
may be provided via different means. At one extreme, pNFS with may be provided via different means. At one extreme, pNFS with
block/volume storage can be used with storage access protocols (e.g., block/volume storage can be used with storage access protocols (e.g.,
parallel SCSI) that provide essentially no security functionality. parallel SCSI) that provide essentially no security functionality.
At the other extreme, pNFS may be used with storage protocols such as At the other extreme, pNFS may be used with storage protocols such as
iSCSI that provide significant functionality. It is the iSCSI that can provide significant security functionality. It is the
responsibility of those administering and deploying pNFS with a responsibility of those administering and deploying pNFS with a
block/volume storage access protocol to ensure that appropriate block/volume storage access protocol to ensure that appropriate
protection is provided to that protocol (physical security is a protection is provided to that protocol (physical security is a
common means for protocols not based on IP). In environments where common means for protocols not based on IP). In environments where
the security requirements for the storage protocol cannot be met, the security requirements for the storage protocol cannot be met,
pNFS block/volume storage layouts SHOULD NOT be used. pNFS block/volume storage layouts SHOULD NOT be used.
When security is available for a storage protocol, it is generally at When security is available for a storage protocol, it is generally at
a different granularity and with a different notion of identity than a different granularity and with a different notion of identity than
NFSv4 (e.g., NFSv4 controls user access to files, iSCSI controls NFSv4 (e.g., NFSv4 controls user access to files, iSCSI controls
skipping to change at page 25, line 12 skipping to change at page 25, line 17
This draft draws extensively on the authors' familiarity with the This draft draws extensively on the authors' familiarity with the
mapping functionality and protocol in EMC's MPFS (previously named mapping functionality and protocol in EMC's MPFS (previously named
HighRoad) system [MPFS]. The protocol used by MPFS is called FMP HighRoad) system [MPFS]. The protocol used by MPFS is called FMP
(File Mapping Protocol); it is an add-on protocol that runs in (File Mapping Protocol); it is an add-on protocol that runs in
parallel with file system protocols such as NFSv3 to provide pNFS- parallel with file system protocols such as NFSv3 to provide pNFS-
like functionality for block/volume storage. While drawing on FMP, like functionality for block/volume storage. While drawing on FMP,
the data structures and functional considerations in this draft the data structures and functional considerations in this draft
differ in significant ways, based on lessons learned and the differ in significant ways, based on lessons learned and the
opportunity to take advantage of NFSv4 features such as COMPOUND opportunity to take advantage of NFSv4 features such as COMPOUND
operations. The design to support pNFS client participation in copy- operations. The design to support pNFS client participation in copy-
on-write is based on text and ideas contributed by Craig Everhart on-write is based on text and ideas contributed by Craig Everhart.
(formerly with IBM).
Andy Adamson, Richard Chandler, Benny Halevy, Fredric Isaman, and Andy Adamson, Ben Campbell, Richard Chandler, Benny Halevy, Fredric
Mario Wurzl all helped to review drafts of this specification. Isaman, and Mario Wurzl all helped to review drafts of this
specification.
7. References 7. References
7.1. Normative References 7.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997. Requirement Levels", BCP 14, RFC 2119, March 1997.
[NFSV4.1] Shepler, S., Eisler, M., and Noveck, D. ed., "NFSv4 Minor [NFSV4.1] Shepler, S., Eisler, M., and Noveck, D. ed., "NFSv4 Minor
Version 1", draft-ietf-nfsv4-minorversion1-23.txt, Internet Version 1", draft-ietf-nfsv4-minorversion1-26.txt, Internet
Draft, May 2008. Draft, September 2008.
[XDR] Eisler, M., "XDR: External Data Representation Standard", [XDR] Eisler, M., "XDR: External Data Representation Standard",
STD 67, RFC 4506, May 2006. STD 67, RFC 4506, May 2006.
7.2. Informative References 7.2. Informative References
[MPFS] EMC Corporation, "EMC Celerra Multi-Path File System", EMC [MPFS] EMC Corporation, "EMC Celerra Multi-Path File System", EMC
Data Sheet, available at: Data Sheet, available at:
http://www.emc.com/collateral/software/data-sheet/h2006-celerra-mpfs- http://www.emc.com/collateral/software/data-sheet/h2006-celerra-mpfs-
mpfsi.pdf mpfsi.pdf
skipping to change at page 26, line 24 skipping to change at page 26, line 24
Stephen Fridella Stephen Fridella
EMC Corporation EMC Corporation
228 South Street 228 South Street
Hopkinton, MA 01748 Hopkinton, MA 01748
Phone: +1 (508) 249-3528 Phone: +1 (508) 249-3528
Email: fridella_stephen@emc.com Email: fridella_stephen@emc.com
Jason Glasgow Jason Glasgow
EMC Corporation Google
32 Coslin Drive 5 Cambridge Center
Southboro, MA 01772 Cambridge, MA 02142
Phone: +1 (508) 305 8831 Phone: +1 (617) 575 1599
Email: glasgow_jason@emc.com Email: jglasgow@aya.yale.edu
Intellectual Property Statement Intellectual Property Statement
The IETF takes no position regarding the validity or scope of any The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has might or might not be available; nor does it represent that it has
made any independent effort to identify any such rights. Information made any independent effort to identify any such rights. Information
on the procedures with respect to rights in RFC documents can be on the procedures with respect to rights in RFC documents can be
 End of changes. 15 change blocks. 
41 lines changed or deleted 45 lines changed or added

This html diff was produced by rfcdiff 1.35. The latest version is available from http://tools.ietf.org/tools/rfcdiff/