draft-ietf-nfsv4-scsi-layout-06.txt   draft-ietf-nfsv4-scsi-layout-07.txt 
NFSv4 C. Hellwig NFSv4 C. Hellwig
Internet-Draft Internet-Draft
Intended status: Standards Track June 27, 2016 Intended status: Standards Track August 16, 2016
Expires: December 29, 2016 Expires: February 17, 2017
Parallel NFS (pNFS) SCSI Layout Parallel NFS (pNFS) SCSI Layout
draft-ietf-nfsv4-scsi-layout-06.txt draft-ietf-nfsv4-scsi-layout-07.txt
Abstract Abstract
The Parallel Network File System (pNFS) allows a separation between The Parallel Network File System (pNFS) allows a separation between
the metadata (onto a metadata server) and data (onto a storage the metadata (onto a metadata server) and data (onto a storage
device) for a file. The SCSI Layout Type is defined in this document device) for a file. The SCSI Layout Type is defined in this document
as an extension to pNFS to allow the use SCSI based block storage as an extension to pNFS to allow the use SCSI based block storage
devices. devices.
Status of This Memo Status of This Memo
skipping to change at page 1, line 34 skipping to change at page 1, line 34
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on December 29, 2016. This Internet-Draft will expire on February 17, 2017.
Copyright Notice Copyright Notice
Copyright (c) 2016 IETF Trust and the persons identified as the Copyright (c) 2016 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 3, line 30 skipping to change at page 3, line 30
+| Systems | +| Systems |
+-----------+ +-----------+
Figure 1 Figure 1
The overall approach is that pNFS-enhanced clients obtain sufficient The overall approach is that pNFS-enhanced clients obtain sufficient
information from the server to enable them to access the underlying information from the server to enable them to access the underlying
storage (on the storage systems) directly. See the Section 12 of storage (on the storage systems) directly. See the Section 12 of
[RFC5661] for more details. This document is concerned with access [RFC5661] for more details. This document is concerned with access
from pNFS clients to storage devices over block storage protocols from pNFS clients to storage devices over block storage protocols
based on the the SCSI Architecture Model ([SAM-4]), e.g., Fibre based on the the SCSI Architecture Model ([SAM-5]), e.g., Fibre
Channel Protocol (FCP) for Fibre Channel, Internet SCSI (iSCSI) or Channel Protocol (FCP) for Fibre Channel, Internet SCSI (iSCSI) or
Serial Attached SCSI (SAS). pNFS SCSI layout requires block based Serial Attached SCSI (SAS). pNFS SCSI layout requires block based
SCSI command sets, for example SCSI Block Commands ([SBC3]). While SCSI command sets, for example SCSI Block Commands ([SBC3]). While
SCSI command set for non-block based access exist these are not SCSI command set for non-block based access exist these are not
supported by the SCSI layout type, and all future references to SCSI supported by the SCSI layout type, and all future references to SCSI
storage devices will imply a block based SCSI command set. storage devices will imply a block based SCSI command set.
The Server to Storage System protocol, called the "Control Protocol", The Server to Storage System protocol, called the "Control Protocol",
is not of concern for interoperability, although it will typically be is not of concern for interoperability, although it will typically be
the same SCSI based storage protocol. the same SCSI based storage protocol.
skipping to change at page 3, line 52 skipping to change at page 3, line 52
This document is based on [RFC5663] and makes changes to the block This document is based on [RFC5663] and makes changes to the block
layout type to provide a better pNFS layout protocol for SCSI based layout type to provide a better pNFS layout protocol for SCSI based
storage devices. Despite these changes, [RFC5663] remains the storage devices. Despite these changes, [RFC5663] remains the
defining document for the existing block layout type. [RFC6688] is defining document for the existing block layout type. [RFC6688] is
unnecessary in the context of the SCSI layout type because the new unnecessary in the context of the SCSI layout type because the new
layout type provides mandatory disk access protection as part of the layout type provides mandatory disk access protection as part of the
layout type definition. In contrast to [RFC5663], this document uses layout type definition. In contrast to [RFC5663], this document uses
SCSI protocol features to provide reliable fencing by using SCSI SCSI protocol features to provide reliable fencing by using SCSI
Persistent Reservations, and it can provide reliable and efficient Persistent Reservations, and it can provide reliable and efficient
device discovery by using SCSI device identifiers instead of having device discovery by using SCSI device identifiers instead of having
to rely on probing all devices potentially attached to a client for a to rely on probing all devices potentially attached to a client.
signature. This new layout type also optimizes the I/O path by
reducing the size of the LAYOUTCOMMIT payload This new layout type also optimizes the I/O path by reducing the size
of the LAYOUTCOMMIT payload. Except for these changes the protocol
is identical to [RFC5663], most importantly there are no changes to
way the volume topology is built Section 2.3.2, and the data
structures that describe extents Section 2.4 are unchanged compared
to [RFC5663] as well.
1.1. Conventions Used in This Document 1.1. Conventions Used in This Document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119]. document are to be interpreted as described in [RFC2119].
1.2. General Definitions 1.2. General Definitions
The following definitions are provided for the purpose of providing The following definitions are provided for the purpose of providing
skipping to change at page 6, line 31 skipping to change at page 6, line 34
/// ///
2. SCSI Layout Description 2. SCSI Layout Description
2.1. Background and Architecture 2.1. Background and Architecture
The fundamental storage model supported by SCSI storage devices is a The fundamental storage model supported by SCSI storage devices is a
Logical Unit (LU) consisting of a sequential series of fixed-size Logical Unit (LU) consisting of a sequential series of fixed-size
blocks. Logical units used as devices for NFS SCSI layouts, and the blocks. Logical units used as devices for NFS SCSI layouts, and the
SCSI initiators used for the pNFS Metadata Server and clients MUST SCSI initiators used for the pNFS Metadata Server and clients MUST
support SCSI persistent reservations. support SCSI persistent reservations as defined in [SPC4].
A pNFS layout for this SCSI class of storage is responsible for A pNFS layout for this SCSI class of storage is responsible for
mapping from an NFS file (or portion of a file) to the blocks of mapping from an NFS file (or portion of a file) to the blocks of
storage volumes that contain the file. The blocks are expressed as storage volumes that contain the file. The blocks are expressed as
extents with 64-bit offsets and lengths using the existing NFSv4 extents with 64-bit offsets and lengths using the existing NFSv4
offset4 and length4 types. Clients MUST be able to perform I/O to offset4 and length4 types. Clients MUST be able to perform I/O to
the block extents without affecting additional areas of storage the block extents without affecting additional areas of storage
(especially important for writes); therefore, extents MUST be aligned (especially important for writes); therefore, extents MUST be aligned
to 512-byte boundaries. to logical block size boundaries of the underlying logical units
(typically 512 or 4096 bytes). For complex volume topologies the
serves MUST ensure extents are aligned to the logical block size
boundaries of the larges logical block size in the volume topology.
The pNFS operation for requesting a layout (LAYOUTGET) includes the The pNFS operation for requesting a layout (LAYOUTGET) includes the
"layoutiomode4 loga_iomode" argument, which indicates whether the "layoutiomode4 loga_iomode" argument, which indicates whether the
requested layout is for read-only use or read-write use. A read-only requested layout is for read-only use or read-write use. A read-only
layout may contain holes that are read as zero, whereas a read-write layout may contain holes that are read as zero, whereas a read-write
layout will contain allocated, but un-initialized storage in those layout will contain allocated, but un-initialized storage in those
holes (read as zero, can be written by client). This document also holes (read as zero, can be written by client). This document also
supports client participation in copy-on-write (e.g., for file supports client participation in copy-on-write (e.g., for file
systems with snapshots) by providing both read-only and un- systems with snapshots) by providing both read-only and un-
initialized storage for the same range in a layout. Reads are initialized storage for the same range in a layout. Reads are
skipping to change at page 8, line 15 skipping to change at page 8, line 28
Type implementation. Type implementation.
2.3. GETDEVICEINFO 2.3. GETDEVICEINFO
2.3.1. Volume Identification 2.3.1. Volume Identification
SCSI targets implementing [SPC4] export unique LU names for each LU SCSI targets implementing [SPC4] export unique LU names for each LU
through the Device Identification VPD page (page code 0x83), which through the Device Identification VPD page (page code 0x83), which
can be obtained using the INQUIRY command with the EVPD bit set to can be obtained using the INQUIRY command with the EVPD bit set to
one. This document uses a subset of this information to identify LUs one. This document uses a subset of this information to identify LUs
backing pNFS SCSI layouts. It is similar to the "Identification backing pNFS SCSI layouts. Device Identification VPD page
Descriptor Target Descriptor" specified in [SPC4], but limits the descriptors used to identify LUs for use with pNFS SCSI layouts must
allowed values to those that uniquely identify a LU. Device adhere to the following restrictions:
Identification VPD page descriptors used to identify LUs for use with
pNFS SCSI layouts must adhere to the following restrictions:
1. The "ASSOCIATION" MUST be set to 0 (The DESIGNATOR field is 1. The "ASSOCIATION" MUST be set to 0 (The DESIGNATOR field is
associated with the addressed logical unit). associated with the addressed logical unit).
2. The "DESIGNATOR TYPE" MUST be set to one of four values that are 2. The "DESIGNATOR TYPE" MUST be set to one of four values that are
required for the mandatory logical unit name in section 7.7.3 of required for the mandatory logical unit name in section 7.7.3 of
[SPC4], as explicitly listed in the "pnfs_scsi_designator_type" [SPC4], as explicitly listed in the "pnfs_scsi_designator_type"
enumeration: enumeration:
PS_DESIGNATOR_T10 T10 vendor ID based PS_DESIGNATOR_T10 T10 vendor ID based
skipping to change at page 11, line 50 skipping to change at page 11, line 50
to volumes defined by lower indexed elements of the array. to volumes defined by lower indexed elements of the array.
The "pnfs_scsi_device_addr4" data structure is returned by the server The "pnfs_scsi_device_addr4" data structure is returned by the server
as the storage-protocol-specific opaque field da_addr_body in the as the storage-protocol-specific opaque field da_addr_body in the
"device_addr4" structure by a successful GETDEVICEINFO operation "device_addr4" structure by a successful GETDEVICEINFO operation
[RFC5661]. [RFC5661].
As noted above, all device_addr4 structures eventually resolve to a As noted above, all device_addr4 structures eventually resolve to a
set of volumes of type PNFS_SCSI_VOLUME_BASE. Complicated volume set of volumes of type PNFS_SCSI_VOLUME_BASE. Complicated volume
hierarchies may be composed of dozens of volumes each with several hierarchies may be composed of dozens of volumes each with several
signature components; thus, the device address may require several components; thus, the device address may require several kilobytes.
kilobytes. The client SHOULD be prepared to allocate a large buffer The client SHOULD be prepared to allocate a large buffer to contain
to contain the result. In the case of the server returning the result. In the case of the server returning NFS4ERR_TOOSMALL,
NFS4ERR_TOOSMALL, the client SHOULD allocate a buffer of at least the client SHOULD allocate a buffer of at least gdir_mincount_bytes
gdir_mincount_bytes to contain the expected result and retry the to contain the expected result and retry the GETDEVICEINFO request.
GETDEVICEINFO request.
2.4. Data Structures: Extents and Extent Lists 2.4. Data Structures: Extents and Extent Lists
A pNFS SCSI layout is a list of extents within a flat array of data A pNFS SCSI layout is a list of extents within a flat array of data
blocks in a volume. The details of the volume topology can be blocks in a volume. The details of the volume topology can be
determined by using the GETDEVICEINFO operation. The SCSI layout determined by using the GETDEVICEINFO operation. The SCSI layout
describes the individual block extents on the volume that make up the describes the individual block extents on the volume that make up the
file. The offsets and length contained in an extent are specified in file. The offsets and length contained in an extent are specified in
units of bytes. units of bytes.
skipping to change at page 14, line 32 skipping to change at page 14, line 32
the se_file_offset of each extent; any ties are broken by increasing the se_file_offset of each extent; any ties are broken by increasing
order of the extent state (se_state). order of the extent state (se_state).
2.4.1. Layout Requests and Extent Lists 2.4.1. Layout Requests and Extent Lists
Each request for a layout specifies at least three parameters: file Each request for a layout specifies at least three parameters: file
offset, desired size, and minimum size. If the status of a request offset, desired size, and minimum size. If the status of a request
indicates success, the extent list returned must meet the following indicates success, the extent list returned must meet the following
criteria: criteria:
o A request for a readable (but not writable) layout returns only o A request for a readable (but not writable) layout MUST return
PNFS_SCSI_READ_DATA or PNFS_SCSI_NONE_DATA extents (but not either PNFS_SCSI_READ_DATA or PNFS_SCSI_NONE_DATA extents. It
PNFS_SCSI_INVALID_DATA or PNFS_SCSI_READ_WRITE_DATA extents). SHALL NOT return PNFS_SCSI_INVALID_DATA or
PNFS_SCSI_READ_WRITE_DATA extents.
o A request for a writable layout returns PNFS_SCSI_READ_WRITE_DATA o A request for a writable layout MUST return
or PNFS_SCSI_INVALID_DATA extents (but not PNFS_SCSI_NONE_DATA PNFS_SCSI_READ_WRITE_DATA or PNFS_SCSI_INVALID_DATA extents, and
extents). It may also return PNFS_SCSI_READ_DATA extents only it MAY return addition PNFS_SCSI_READ_DATA extents for ranges
when the offset ranges in those extents are also covered by covered by PNFS_SCSI_INVALID_DATA extents to allow client side
PNFS_SCSI_INVALID_DATA extents to permit writes. copy-on-write operations. A request for a writable layout SHALL
NOT return PNFS_SCSI_NONE_DATA extents.
o The first extent in the list MUST contain the requested starting o The first extent in the list MUST contain the requested starting
offset. offset.
o The total size of extents within the requested range MUST cover at o The total size of extents within the requested range MUST cover at
least the minimum size. One exception is allowed: the total size least the minimum size. One exception is allowed: the total size
MAY be smaller if only readable extents were requested and EOF is MAY be smaller if only readable extents were requested and EOF is
encountered. encountered.
o Extents in the extent list MUST be logically contiguous for a o Extents in the extent list MUST be logically contiguous for a
skipping to change at page 21, line 6 skipping to change at page 21, line 6
2.4.10.2. PRs - MDS Registration and Reservation 2.4.10.2. PRs - MDS Registration and Reservation
Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the
MDS needs to prepare the volume for fencing using PRs. This is done MDS needs to prepare the volume for fencing using PRs. This is done
by registering the reservation generated for the MDS with the device by registering the reservation generated for the MDS with the device
using the "PERSISTENT RESERVE OUT" command with a service action of using the "PERSISTENT RESERVE OUT" command with a service action of
"REGISTER", followed by a "PERSISTENT RESERVE OUT" command, with a "REGISTER", followed by a "PERSISTENT RESERVE OUT" command, with a
service action of "RESERVE" and the type field set to 8h (Exclusive service action of "RESERVE" and the type field set to 8h (Exclusive
Access - Registrants Only). To make sure all I_T nexuses (see Access - Registrants Only). To make sure all I_T nexuses (see
section 3.1.45 of [SAM-4]) are registered, the MDS SHOULD set the section 3.1.45 of [SAM-5]) are registered, the MDS SHOULD set the
"All Target Ports" (ALL_TG_PT) bit when registering the key, or "All Target Ports" (ALL_TG_PT) bit when registering the key, or
otherwise ensure the registration is performed for each initiator otherwise ensure the registration is performed for each target port,
port. and MUST perform registration for each initiator port.
2.4.10.3. PRs - Client Registration 2.4.10.3. PRs - Client Registration
Before performing the first I/O to a device returned from a Before performing the first I/O to a device returned from a
GETDEVICEINFO operation the client will register the registration key GETDEVICEINFO operation the client will register the registration key
returned in sbv_pr_key with the storage device by issuing a returned in sbv_pr_key with the storage device by issuing a
"PERSISTENT RESERVE OUT" command with a service action of REGISTER "PERSISTENT RESERVE OUT" command with a service action of REGISTER
with the "SERVICE ACTION RESERVATION KEY" set to the reservation key with the "SERVICE ACTION RESERVATION KEY" set to the reservation key
returned in sbv_pr_key. To make sure all I_T nexuses are registered, returned in sbv_pr_key. To make sure all I_T nexuses are registered,
the client SHOULD set the "All Target Ports" (ALL_TG_PT) bit when the client SHOULD set the "All Target Ports" (ALL_TG_PT) bit when
registering the key, or otherwise ensure the registration is registering the key, or otherwise ensure the registration is
performed for each initiator port. performed for each target port, and MUST perform registration for
each initiator port.
When a client stops using a device earlier returned by GETDEVICEINFO When a client stops using a device earlier returned by GETDEVICEINFO
it MUST unregister the earlier registered key by issuing a it MUST unregister the earlier registered key by issuing a
"PERSISTENT RESERVE OUT" command with a service action of "REGISTER" "PERSISTENT RESERVE OUT" command with a service action of "REGISTER"
with the "RESERVATION KEY" set to the earlier registered reservation with the "RESERVATION KEY" set to the earlier registered reservation
key. key.
2.4.10.4. PRs - Fencing Action 2.4.10.4. PRs - Fencing Action
In case of a non-responding client the MDS fences the client by In case of a non-responding client the MDS fences the client by
skipping to change at page 28, line 5 skipping to change at page 28, line 5
"Parallel NFS (pNFS) Block/Volume Layout", RFC 5663, "Parallel NFS (pNFS) Block/Volume Layout", RFC 5663,
January 2010. January 2010.
[RFC6688] Black, D., Ed., Glasgow, J., and S. Faibish, "Parallel NFS [RFC6688] Black, D., Ed., Glasgow, J., and S. Faibish, "Parallel NFS
(pNFS) Block Disk Protection", RFC 6688, July 2012. (pNFS) Block Disk Protection", RFC 6688, July 2012.
[RFC7143] Chadalapaka, M., Meth, K., and D. Black, "Internet Small [RFC7143] Chadalapaka, M., Meth, K., and D. Black, "Internet Small
Computer System Interface (iSCSI) Protocol Computer System Interface (iSCSI) Protocol
(Consolidated)", RFC RFC7143, April 2014. (Consolidated)", RFC RFC7143, April 2014.
[SAM-4] INCITS Technical Committee T10, "SCSI Architecture Model - [SAM-5] INCITS Technical Committee T10, "SCSI Architecture Model -
4 (SAM-4)", ANSI INCITS 447-2008, ISO/IEC 14776-414, 2008. 5 (SAM-5)", ANSI INCITS 515-XXXXX, 2016.
[SAS3] INCITS Technical Committee T10, "Serial Attached Scsi-3", [SAS3] INCITS Technical Committee T10, "Serial Attached Scsi-3",
ANSI INCITS ANSI INCITS 519-2014, ISO/IEC 14776-154, 2014. ANSI INCITS ANSI INCITS 519-2014, ISO/IEC 14776-154, 2014.
[SBC3] INCITS Technical Committee T10, "SCSI Block Commands-3", [SBC3] INCITS Technical Committee T10, "SCSI Block Commands-3",
ANSI INCITS INCITS 514-2014, ISO/IEC 14776-323, 2014. ANSI INCITS INCITS 514-2014, ISO/IEC 14776-323, 2014.
[SPC4] INCITS Technical Committee T10, "SCSI Primary Commands-4", [SPC4] INCITS Technical Committee T10, "SCSI Primary Commands-4",
ANSI INCITS 513-2015, 2015. ANSI INCITS 513-2015, 2015.
Appendix A. Acknowledgments Appendix A. Acknowledgments
Large parts of this document were copied verbatim, and others were Large parts of this document were copied verbatim, and others were
inspired by [RFC5663]. Thank to David Black, Stephen Fridella and inspired by [RFC5663]. Thank to David Black, Stephen Fridella and
Jason Glasgow for their work on the pNFS block/volume layout Jason Glasgow for their work on the pNFS block/volume layout
protocol. protocol.
David Black, Robert Elliott and Tom Haynes provided a throughout David Black, Robert Elliott and Tom Haynes provided a throughout
review of early drafts of this document, and their input led to the review of drafts of this document, and their input led to the current
current form of the document. form of the document.
David Noveck provided ample feedback to various drafts of this David Noveck provided ample feedback to various drafts of this
document, wrote the section on enforcing NFSv4 semantics and rewrote document, wrote the section on enforcing NFSv4 semantics and rewrote
various sections to better catch the intent. various sections to better catch the intent.
Appendix B. RFC Editor Notes Appendix B. RFC Editor Notes
[RFC Editor: please remove this section prior to publishing this [RFC Editor: please remove this section prior to publishing this
document as an RFC] document as an RFC]
 End of changes. 16 change blocks. 
37 lines changed or deleted 45 lines changed or added

This html diff was produced by rfcdiff 1.45. The latest version is available from http://tools.ietf.org/tools/rfcdiff/