draft-ietf-nfsv4-nfs-rdma-problem-statement-06.txt   draft-ietf-nfsv4-nfs-rdma-problem-statement-07.txt 
NFSv4 Working Group Tom Talpey NFSv4 Working Group Tom Talpey
Internet-Draft Network Appliance, Inc. Internet-Draft Network Appliance, Inc.
Intended status: Informational Chet Juszczak Intended status: Informational Chet Juszczak
Expires: November 8, 2007 May 7, 2007 Expires: January 1, 2008 July 1, 2007
NFS RDMA Problem Statement NFS RDMA Problem Statement
draft-ietf-nfsv4-nfs-rdma-problem-statement-06 draft-ietf-nfsv4-nfs-rdma-problem-statement-07
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 35 skipping to change at page 1, line 35
documents at any time. It is inappropriate to use Internet-Drafts documents at any time. It is inappropriate to use Internet-Drafts
as reference material or to cite them other than as "work in as reference material or to cite them other than as "work in
progress." progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on November 8, 2007. This Internet-Draft will expire on January 1, 2008.
Copyright Notice Copyright Notice
Copyright (C) The IETF Trust (2007). Copyright (C) The IETF Trust (2007).
Abstract Abstract
This draft addresses applying Remote Direct Memory Access to the This draft addresses applying Remote Direct Memory Access to the
NFS protocols. NFS implementations historically incur significant NFS protocols. NFS implementations historically incur significant
overhead due to data copies on end-host systems, as well as other overhead due to data copies on end-host systems, as well as other
sources. The potential benefits of RDMA to these implementations processing overhead. The potential benefits of RDMA to these
are explored, and the reasons why RDMA is especially well-suited to implementations are explored, and the reasons why RDMA is
NFS and network file protocols in general are evaluated. especially well-suited to NFS and network file protocols in general
are evaluated.
Table Of Contents Table Of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 2
2. Problem Statement . . . . . . . . . . . . . . . . . . . . 4 2. Problem Statement . . . . . . . . . . . . . . . . . . . . 5
3. File Protocol Architecture . . . . . . . . . . . . . . . . 5 3. File Protocol Architecture . . . . . . . . . . . . . . . . 6
4. Sources of Overhead . . . . . . . . . . . . . . . . . . . 7 4. Sources of Overhead . . . . . . . . . . . . . . . . . . . 8
4.1. Savings from TOE . . . . . . . . . . . . . . . . . . . . 8 4.1. Savings from TOE . . . . . . . . . . . . . . . . . . . . 9
4.2. Savings from RDMA . . . . . . . . . . . . . . . . . . . 9 4.2. Savings from RDMA . . . . . . . . . . . . . . . . . . . 10
5. Application of RDMA to NFS . . . . . . . . . . . . . . . . 10 5. Application of RDMA to NFS . . . . . . . . . . . . . . . . 10
6. Conclusions . . . . . . . . . . . . . . . . . . . . . . . 10 6. Conclusions . . . . . . . . . . . . . . . . . . . . . . . 11
Security Considerations . . . . . . . . . . . . . . . . . 11 Security Considerations . . . . . . . . . . . . . . . . . 12
IANA Considerations . . . . . . . . . . . . . . . . . . . 11 IANA Considerations . . . . . . . . . . . . . . . . . . . 12
Acknowledgements . . . . . . . . . . . . . . . . . . . . . 11 Acknowledgements . . . . . . . . . . . . . . . . . . . . . 12
Normative References . . . . . . . . . . . . . . . . . . . 11 Normative References . . . . . . . . . . . . . . . . . . . 12
Informative References . . . . . . . . . . . . . . . . . . 12 Informative References . . . . . . . . . . . . . . . . . . 12
Authors' Addresses . . . . . . . . . . . . . . . . . . . . 14 Authors' Addresses . . . . . . . . . . . . . . . . . . . . 15
Intellectual Property and Copyright Statements . . . . . . 14 Intellectual Property and Copyright Statements . . . . . . 15
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . 15 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . 16
1. Introduction 1. Introduction
The Network File System (NFS) protocol (as described in [RFC1094], The Network File System (NFS) protocol (as described in [RFC1094],
[RFC1813], and [RFC3530]) is one of several remote file access [RFC1813], and [RFC3530]) is one of several remote file access
protocols used in the class of processing architecture sometimes protocols used in the class of processing architecture sometimes
called Network Attached Storage (NAS). called Network Attached Storage (NAS).
Historically, remote file access has proved to be a convenient, Historically, remote file access has proven to be a convenient,
cost-effective way to share information over a network, a concept cost-effective way to share information over a network, a concept
proven over time by the popularity of the NFS protocol. However, proven over time by the popularity of the NFS protocol. However,
there are issues in such a deployment. there are issues in such a deployment.
As compared to a local (direct-attached) file access architecture, As compared to a local (direct-attached) file access architecture,
NFS removes the overhead of managing the local on-disk filesystem NFS removes the overhead of managing the local on-disk filesystem
state and its metadata, but interposes at least a transport network state and its metadata, but interposes at least a transport network
and two network endpoints between an application process and the and two network endpoints between an application process and the
files it is accessing. This tradeoff has to date usually resulted files it is accessing. This tradeoff has to date usually resulted
in a net performance loss as a result of reduced bandwidth, in a net performance loss as a result of reduced bandwidth,
skipping to change at page 3, line 13 skipping to change at page 3, line 23
implementations. implementations.
Replication of local file access performance on NAS using Replication of local file access performance on NAS using
traditional network protocol stacks has proven difficult, not traditional network protocol stacks has proven difficult, not
because of protocol processing overheads, but because of data copy because of protocol processing overheads, but because of data copy
costs in the network endpoints. This is especially true since host costs in the network endpoints. This is especially true since host
buses are now often the main bottleneck in NAS architectures buses are now often the main bottleneck in NAS architectures
[MOG03] [CHA+01]. [MOG03] [CHA+01].
The External Data Representation [RFC4506] employed beneath NFS and The External Data Representation [RFC4506] employed beneath NFS and
RPC [RFC1831] can add more data copies, exacerbating the problem. RPC [RFC1831bis] can add more data copies, exacerbating the
problem.
Data copy-avoidance designs have not been widely adopted for a Data copy-avoidance designs have not been widely adopted for a
variety of reasons. [BRU99] points out that "many copy avoidance variety of reasons. [BRU99] points out that "many copy avoidance
techniques for network I/O are not applicable or may even backfire techniques for network I/O are not applicable or may even backfire
if applied to file I/O." Other designs that eliminate unnecessary if applied to file I/O." Other designs that eliminate unnecessary
copies, such as [PAI+00], are incompatible with existing APIs and copies, such as [PAI+00], are incompatible with existing APIs and
therefore force application changes. therefore force application changes.
In recent years, an effort to standardize a set of protocols for In recent years, an effort to standardize a set of protocols for
Remote Direct Memory Access, RDMA, over the standard Internet Remote Direct Memory Access, RDMA, over the standard Internet
Protocol Suite has been chartered [RDDP]. Several drafts have been Protocol Suite has been chartered [RDDP]. A complete IP-based RDMA
proposed and are being considered for Standards Track. procotol suite is available in the published Standards Track
specifications.
RDMA is a general solution to the problem of CPU overhead incurred RDMA is a general solution to the problem of CPU overhead incurred
due to data copies, primarily at the receiver. Substantial due to data copies, primarily at the receiver. Substantial
research has addressed this and has borne out the efficacy of the research has addressed this and has borne out the efficacy of the
approach. An overview of this is the RDDP "Remote Direct Memory approach. An overview of this is the RDDP "Remote Direct Memory
Access (RDMA) over IP Problem Statement" document, [RFC4297]. Access (RDMA) over IP Problem Statement" document, [RFC4297].
In addition to the per-byte savings of off-loading data copies, In addition to the per-byte savings of off-loading data copies,
RDMA-enabled NICs (RNICS) offload the underlying protocol layers as RDMA-enabled NICs (RNICS) offload the underlying protocol layers as
well, e.g. TCP, further reducing CPU overhead due to NAS well, e.g., TCP, further reducing CPU overhead due to NAS
processing. processing.
1.1. Background 1.1. Background
The RDDP Problem Statement [RFC4297] asserts: The RDDP Problem Statement [RFC4297] asserts:
"High costs associated with copying are an issue primarily for "High costs associated with copying are an issue primarily for
large scale systems ... with high bandwidth feeds, usually large scale systems ... with high bandwidth feeds, usually
multiprocessors and clusters, that are adversely affected by multiprocessors and clusters, that are adversely affected by
copying overhead. Examples of such machines include all copying overhead. Examples of such machines include all
skipping to change at page 5, line 12 skipping to change at page 5, line 27
nature of the RPC and XDR protocols, the NFS data payload arrives nature of the RPC and XDR protocols, the NFS data payload arrives
at arbitrary alignment, necessitating a copy at the receiver, and at arbitrary alignment, necessitating a copy at the receiver, and
the NFS requests are completed in an arbitrary sequence. the NFS requests are completed in an arbitrary sequence.
The data copies consume system bus bandwidth and CPU time, reducing The data copies consume system bus bandwidth and CPU time, reducing
the available system capacity for applications [RFC4297]. the available system capacity for applications [RFC4297].
Achieving zero-copy with NFS has, to date, required sophisticated, Achieving zero-copy with NFS has, to date, required sophisticated,
version-specific "header cracking" hardware and/or extensive version-specific "header cracking" hardware and/or extensive
platform-specific virtual memory mapping tricks. Such approaches platform-specific virtual memory mapping tricks. Such approaches
become even more difficult for NFS version 4 due to the existence become even more difficult for NFS version 4 due to the existence
of the COMPOUND operation, which further reduces alignment and of the COMPOUND operation and presence of Kerberos and other
greatly complicates ULP offload. security information, which further reduce alignment and greatly
complicate ULP offload.
Furthermore, NFS will soon be challenged by emerging high-speed Furthermore, NFS is challenged by high-speed network fabrics such
network fabrics such as 10 Gbits/s Ethernet. Performing even raw as 10 Gbits/s Ethernet. Performing even raw network I/O such as
network I/O such as TCP is an issue at such speeds with today's TCP is an issue at such speeds with today's hardware. The problem
hardware. The problem is fundamental in nature and has led the is fundamental in nature and has led the IETF to explore RDMA
IETF to explore RDMA [RFC4297]. [RFC4297].
Zero-copy techniques benefit file protocols extensively, as they Zero-copy techniques benefit file protocols extensively, as they
enable direct user I/O, reduce the overhead of protocol stacks, enable direct user I/O, reduce the overhead of protocol stacks,
provide perfect alignment into caches, etc. Many studies have provide perfect alignment into caches, etc. Many studies have
already shown the performance benefits of such techniques [SKE+01] already shown the performance benefits of such techniques [SKE+01]
[DCK+03] [FJNFS] [FJDAFS] [KM02] [MAF+02]. [DCK+03] [FJNFS] [FJDAFS] [KM02] [MAF+02].
RDMA is compelling here for another reason; hardware offloaded RDMA is compelling here for another reason; hardware offloaded
networking support in itself does not avoid data copies, without networking support in itself does not avoid data copies, without
resorting to implementing part of the NFS protocol in the NIC. resorting to implementing part of the NFS protocol in the NIC.
skipping to change at page 5, line 42 skipping to change at page 6, line 10
ubiquitous and interoperable solutions. ubiquitous and interoperable solutions.
By providing file access performance equivalent to that of local By providing file access performance equivalent to that of local
file systems, NFS over RDMA will enable applications running on a file systems, NFS over RDMA will enable applications running on a
set of client machines to interact through an NFS file system, just set of client machines to interact through an NFS file system, just
as applications running on a single machine might interact through as applications running on a single machine might interact through
a local file system. a local file system.
3. File Protocol Architecture 3. File Protocol Architecture
NFS runs as an ONC RPC [RFC1831] application. Being a file access NFS runs as an ONC RPC [RFC1831bis] application. Being a file
protocol, NFS is very "rich" in data content (versus control access protocol, NFS is very "rich" in data content (versus control
information). information).
NFS messages can range from very small (under 100 bytes) to very NFS messages can range from very small (under 100 bytes) to very
large (from many kilobytes to a megabyte or more). They are all large (from many kilobytes to a megabyte or more). They are all
contained within an RPC message and follow a variable length RPC contained within an RPC message and follow a variable length RPC
header. This layout provides an alignment challenge for the data header. This layout provides an alignment challenge for the data
items contained in an NFS call (request) or reply (response) items contained in an NFS call (request) or reply (response)
message. message.
In addition to the control information in each NFS call or reply In addition to the control information in each NFS call or reply
skipping to change at page 6, line 44 skipping to change at page 7, line 14
The encoding of XDR data into transport buffers is referred to as The encoding of XDR data into transport buffers is referred to as
"marshalling", and the decoding of XDR data contained within "marshalling", and the decoding of XDR data contained within
transport buffers and into destination RPC procedure result transport buffers and into destination RPC procedure result
buffers, is referred to as "unmarshalling". The process of buffers, is referred to as "unmarshalling". The process of
marshalling takes place therefore at the sender of any particular marshalling takes place therefore at the sender of any particular
message, be it an RPC request or an RPC response. Unmarshalling, message, be it an RPC request or an RPC response. Unmarshalling,
of course, takes place at the receiver. of course, takes place at the receiver.
Normally, any bulk data is moved (copied) as a result of the Normally, any bulk data is moved (copied) as a result of the
unmarshalling process, because the destination adddress is not unmarshalling process, because the destination address is not known
known until the RPC code receives control and subsequently invokes until the RPC code receives control and subsequently invokes the
the XDR unmarshalling routine. In other words, XDR-encoded data is XDR unmarshalling routine. In other words, XDR-encoded data is not
not self-describing, and it carries no placement information. This self-describing, and it carries no placement information. This
results in a data copy in most NFS implementations. results in a data copy in most NFS implementations.
One mechanism by which the RPC layer may overcome this is for each One mechanism by which the RPC layer may overcome this is for each
request to include placement information, to be used for direct request to include placement information, to be used for direct
placement during XDR encode. This "write chunk" can avoid sending placement during XDR encode. This "write chunk" can avoid sending
bulk data inline in an RPC message and generally results in one or bulk data inline in an RPC message and generally results in one or
more RDMA Write operations. more RDMA Write operations.
Similarly, a "read chunk", where placement information referring to Similarly, a "read chunk", where placement information referring to
bulk data which may be directly fetched via one or more RDMA Read bulk data which may be directly fetched via one or more RDMA Read
skipping to change at page 8, line 16 skipping to change at page 8, line 44
overhead it targets is a larger share of the total cost. As other overhead it targets is a larger share of the total cost. As other
sources of overhead, such as the checksumming and interrupt sources of overhead, such as the checksumming and interrupt
handling above are eliminated, the remaining overheads (primarily handling above are eliminated, the remaining overheads (primarily
data copy) loom larger. data copy) loom larger.
With copies crossing the bus twice per copy, network processing With copies crossing the bus twice per copy, network processing
overhead is high whenever network bandwidth is large in comparison overhead is high whenever network bandwidth is large in comparison
to CPU and memory bandwidths. Generally with today's end-systems, to CPU and memory bandwidths. Generally with today's end-systems,
the effects are observable at network speeds at or above 1 Gbits/s. the effects are observable at network speeds at or above 1 Gbits/s.
A common question is whether increase in CPU processing power A common question is whether an increase in CPU processing power
alleviates the problem of high processing costs of network I/O. alleviates the problem of high processing costs of network I/O.
The answer is no, it is the memory bandwidth that is the issue. The answer is no, it is the memory bandwidth that is the issue.
Faster CPUs do not help if the CPU spends most of its time waiting Faster CPUs do not help if the CPU spends most of its time waiting
for memory [RFC4297]. for memory [RFC4297].
TCP offload engine (TOE) technology aims to offload the CPU by TCP offload engine (TOE) technology aims to offload the CPU by
moving TCP/IP protocol processing to the NIC. However, TOE moving TCP/IP protocol processing to the NIC. However, TOE
technology by itself does nothing to avoid necessary data copies technology by itself does nothing to avoid necessary data copies
within upper layer protocols. [MOG03] provides a description of within upper layer protocols. [MOG03] provides a description of
the role TOE can play in reducing per-packet and per-message costs. the role TOE can play in reducing per-packet and per-message costs.
Beyond the offloads commonly provided by today's network interface Beyond the offloads commonly provided by today's network interface
hardware, TOE alone (w/o RDMA) helps in protocol header processing, hardware, TOE alone (w/o RDMA) helps in protocol header processing,
but this has been shown to be a minority component of the total but this has been shown to be a minority component of the total
protocol processing overhead. [CHA+01] protocol processing overhead. [CHA+01]
Numerous software approaches to the optimization of network Numerous software approaches to the optimization of network
throughput have been made. Experience has shown that network I/O throughput have been made. Experience has shown that network I/O
interacts with other aspects of system processing such as file I/O interacts with other aspects of system processing such as file I/O
and disk I/O. [BRU99] [CHU96] Zero-copy optimizations based on and disk I/O. [BRU99] [CHU96] Zero-copy optimizations based on
page remapping [CHU96] can be dependent upon machine architecture, page remapping [CHU96] can be dependent upon machine architecture,
and are not scaleable to multi-processor architectures. Correct and are not scalable to multi-processor architectures. Correct
buffer alignment and sizing together are needed to optimize the buffer alignment and sizing together are needed to optimize the
performance of zero-copy movement mechanisms [SKE+01]. The NFS performance of zero-copy movement mechanisms [SKE+01]. The NFS
message layout described above does not facilitate the splitting of message layout described above does not facilitate the splitting of
headers from data nor does it facilitate providing correct data headers from data nor does it facilitate providing correct data
buffer alignment. buffer alignment.
4.1. Savings from TOE 4.1. Savings from TOE
The expected improvement of TOE specifically for NFS protocol The expected improvement of TOE specifically for NFS protocol
processing can be quantified and shown to be fundamentally limited. processing can be quantified and shown to be fundamentally limited.
skipping to change at page 10, line 34 skipping to change at page 11, line 12
Neither peer however is aware of the others' data destination in Neither peer however is aware of the others' data destination in
the current NFS, RPC or XDR protocols. Existing NFS the current NFS, RPC or XDR protocols. Existing NFS
implementations have struggled with the performance costs of data implementations have struggled with the performance costs of data
copies when using traditional Ethernet transports. copies when using traditional Ethernet transports.
With the onset of faster networks, the network I/O bottleneck will With the onset of faster networks, the network I/O bottleneck will
worsen. Fortunately, new transports that support RDMA have worsen. Fortunately, new transports that support RDMA have
emerged. RDMA excels at bulk transfer efficiency; it is an emerged. RDMA excels at bulk transfer efficiency; it is an
efficient way to deliver direct data placement and remove a major efficient way to deliver direct data placement and remove a major
part of the problem: data copies. RDMA also addresses other part of the problem: data copies. RDMA also addresses other
overheads, e.g. underlying protocol offload, and offers separation overheads, e.g., underlying protocol offload, and offers separation
of control information from data. of control information from data.
The current NFS message layout provides the performance enhancing The current NFS message layout provides the performance enhancing
opportunity for an NFS over RDMA protocol that separates the opportunity for an NFS over RDMA protocol that separates the
control information from data chunks while meeting the alignment control information from data chunks while meeting the alignment
needs of both. The data chunks can be copied "directly" between needs of both. The data chunks can be copied "directly" between
the client and server memory addresses above (with a single the client and server memory addresses above (with a single
occurrence on each memory bus) while the control information can be occurrence on each memory bus) while the control information can be
passed "inline". [RPCRDMA] describes such a protocol. passed "inline". [RPCRDMA] describes such a protocol.
skipping to change at page 11, line 18 skipping to change at page 11, line 45
performance enhancements and improved semantics described above. performance enhancements and improved semantics described above.
The minor versioning support defined in NFS version 4 was designed The minor versioning support defined in NFS version 4 was designed
to support protocol improvements without disruption to the to support protocol improvements without disruption to the
installed base. Evolutionary improvement of the protocol via minor installed base. Evolutionary improvement of the protocol via minor
versioning is a conservative and cautious approach to current and versioning is a conservative and cautious approach to current and
future problems and shortcomings. future problems and shortcomings.
Many arguments can be made as to the efficacy of the file Many arguments can be made as to the efficacy of the file
abstraction in meeting the future needs of enterprise data service abstraction in meeting the future needs of enterprise data service
and the Internet. Fine grained Quality of Service (QoS) policies and the Internet. Fine grained Quality of Service (QoS) policies
(e.g. data delivery, retention, availability, security, ...) are (e.g., data delivery, retention, availability, security, ...) are
high among them. high among them.
It is vital that the NFS protocol continue to provide these It is vital that the NFS protocol continue to provide these
benefits to a wide range of applications, without its usefulness benefits to a wide range of applications, without its usefulness
being compromised by concerns about performance and semantic being compromised by concerns about performance and semantic
inadequacies. This can reasonably be addressed in the existing NFS inadequacies. This can reasonably be addressed in the existing NFS
protocol framework. A cautious evolutionary improvement of protocol framework. A cautious evolutionary improvement of
performance and semantics allows building on the value already performance and semantics allows building on the value already
present in the NFS protocol, while addressing new requirements that present in the NFS protocol, while addressing new requirements that
have arisen from the application of networking technology. have arisen from the application of networking technology.
7. Security Considerations 7. Security Considerations
Security Considerations are not covered by this document. Please The NFS protocol, in conjunction with its layering on RPC, provides
refer to the appropriate protocol documents for any security a rich and widely interoperable security model to applications and
issues. systems. Any layering of NFS over RDMA transports must address the
NFS security requirements, and additionally must ensure that no new
vulnerabilities are introduced. For RDMA, the integrity, and any
privacy, of the data stream are of particular importance.
Security Considerations must be addressed by any relevant RDMA
transport layering. The protocol described in [RPCRDMA] provides
one such approach.
8. IANA Considerations 8. IANA Considerations
IANA Considerations are not covered by this document. Please refer This document has no IANA considerations.
to the appropriate protocol documents for any IANA issues.
9. Acknowledgements 9. Acknowledgements
The authors wish to thank Jeff Chase who provided many useful The authors wish to thank Jeff Chase who provided many useful
suggestions. suggestions.
10. Normative References 10. Normative References
[RFC3530] [RFC3530]
S. Shepler, et. al., "NFS Version 4 Protocol", Standards Track S. Shepler, et al., "NFS Version 4 Protocol", Standards Track
RFC RFC
[RFC1831]
R. Srinivasan, "RPC: Remote Procedure Call Protocol [RFC1831bis]
R. Thurlow, Ed., "RPC: Remote Procedure Call Protocol
Specification Version 2", Standards Track RFC Specification Version 2", Standards Track RFC
[RFC4506] [RFC4506]
M. Eisler, Ed. "XDR: External Data Representation Standard", M. Eisler, Ed. "XDR: External Data Representation Standard",
Standards Track RFC Standards Track RFC
[RFC1813] [RFC1813]
B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3
Protocol Specification", Informational RFC Protocol Specification", Informational RFC
 End of changes. 23 change blocks. 
49 lines changed or deleted 60 lines changed or added

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/