draft-ietf-rddp-arch-00.txt   draft-ietf-rddp-arch-01.txt 
Internet-Draft Stephen Bailey (Sandburst) Internet-Draft Stephen Bailey (Sandburst)
Expires: June 2003 Tom Talpey (NetApp) Expires: August 2003 Tom Talpey (NetApp)
The Architecture of Direct Data Placement (DDP) The Architecture of Direct Data Placement (DDP)
And Remote Direct Memory Access (RDMA) And Remote Direct Memory Access (RDMA)
On Internet Protocols On Internet Protocols
draft-ietf-rddp-arch-00 draft-ietf-rddp-arch-01
Status of this Memo Status of this Memo
This document is an Internet-Draft and is in full conformance with This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026. all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet- other groups may also distribute working documents as Internet-
Drafts. Drafts.
skipping to change at page 1, line 35 skipping to change at page 1, line 35
progress." progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2002). All Rights Reserved. Copyright (C) The Internet Society (2003). All Rights Reserved.
Abstract Abstract
This document defines an abstract architecture for Direct Data This document defines an abstract architecture for Direct Data
Placement (DDP) and Remote Direct Memory Access (RDMA) protocols to Placement (DDP) and Remote Direct Memory Access (RDMA) protocols to
run on Internet Protocol-suite transports. This architecture does run on Internet Protocol-suite transports. This architecture does
not necessarily reflect the proper way to implement such protocols, not necessarily reflect the proper way to implement such protocols,
but is, rather, a descriptive tool for defining and understanding but is, rather, a descriptive tool for defining and understanding
the protocols. the protocols.
Table Of Contents Table Of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . 2
2. Architecture . . . . . . . . . . . . . . . . . . . . . . 3 2. Architecture . . . . . . . . . . . . . . . . . . . . . . 3
2.1. Direct Data Placement (DDP) Protocol Architecture . . . 3 2.1. Direct Data Placement (DDP) Protocol Architecture . . . 3
2.1.1. Transport Operations . . . . . . . . . . . . . . . . . . 5 2.1.1. Transport Operations . . . . . . . . . . . . . . . . . . 5
2.1.2. DDP Operations . . . . . . . . . . . . . . . . . . . . . 6 2.1.2. DDP Operations . . . . . . . . . . . . . . . . . . . . . 6
2.1.3. Transport Characteristics in DDP . . . . . . . . . . . . 9 2.1.3. Transport Characteristics in DDP . . . . . . . . . . . . 9
2.2. Remote Direct Memory Access Protocol Architecture . . . 10 2.2. Remote Direct Memory Access Protocol Architecture . . . 11
2.2.1. RDMA Operations . . . . . . . . . . . . . . . . . . . . 11 2.2.1. RDMA Operations . . . . . . . . . . . . . . . . . . . . 12
2.2.2. Transport Characteristics in RDMA . . . . . . . . . . . 14 2.2.2. Transport Characteristics in RDMA . . . . . . . . . . . 14
3. Security Considerations . . . . . . . . . . . . . . . . 14 3. Security Considerations . . . . . . . . . . . . . . . . 15
4. IANA Considerations . . . . . . . . . . . . . . . . . . 15 4. IANA Considerations . . . . . . . . . . . . . . . . . . 15
5. Acknowledgements . . . . . . . . . . . . . . . . . . . . 15 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . 15
References . . . . . . . . . . . . . . . . . . . . . . . 15 References . . . . . . . . . . . . . . . . . . . . . . . 16
Authors' Addresses . . . . . . . . . . . . . . . . . . . 16 Authors' Addresses . . . . . . . . . . . . . . . . . . . 16
Full Copyright Statement . . . . . . . . . . . . . . . . 17 Full Copyright Statement . . . . . . . . . . . . . . . . 17
1. Introduction 1. Introduction
This document defines an abstract architecture for Direct Data This document defines an abstract architecture for Direct Data
Placement (DDP) and Remote Direct Memory Access (RDMA) protocols to Placement (DDP) and Remote Direct Memory Access (RDMA) protocols to
run on Internet Protocol-suite transports [RDDP, ROM]. This run on Internet Protocol-suite transports [RDDP, ROM]. This
architecture does not necessarily reflect the proper way to architecture does not necessarily reflect the proper way to
implement such protocols, but is, rather, a descriptive tool for implement such protocols, but is, rather, a descriptive tool for
skipping to change at page 6, line 46 skipping to change at page 6, line 46
Real implementations of xpt_send() and xpt_recv() typically return Real implementations of xpt_send() and xpt_recv() typically return
error indications, but that is not relevant to this architecture. error indications, but that is not relevant to this architecture.
2.1.2. DDP Operations 2.1.2. DDP Operations
The DDP layer provides: The DDP layer provides:
void ddp_send(socket_t s, message_t m); void ddp_send(socket_t s, message_t m);
void ddp_send_ddp(socket_t s, message_t m, ddp_addr_t d, void ddp_send_ddp(socket_t s, message_t m, ddp_addr_t d,
ddp_notify_t n); ddp_notify_t n);
ddp_recv_t ddp_recv(socket_t s); void ddp_post_recv(socket_t s, bdesc_t b);
ddp_ind_t ddp_recv(socket_t s);
bdesc_t ddp_register(socket_t s, ddp_buffer_t b); bdesc_t ddp_register(socket_t s, ddp_buffer_t b);
void ddp_deregister(bhand_t bh); void ddp_deregister(bhand_t bh);
msizes_t ddp_max_msizes(socket_t s); msizes_t ddp_max_msizes(socket_t s);
ddp_addr_t ddp_addr_t
the buffer address portion of a tagged message: the buffer address portion of a tagged message:
typedef struct { typedef struct {
stag_t stag; stag_t stag;
address_t offset; address_t offset;
skipping to change at page 8, line 21 skipping to change at page 8, line 21
which may have no relationship to the `start' or `end' which may have no relationship to the `start' or `end'
addresses of that buffer. However, particular addresses of that buffer. However, particular
implementations, such as DDP on a multicast transport (see implementations, such as DDP on a multicast transport (see
below), may allow some client protocol control over the below), may allow some client protocol control over the
starting offset. starting offset.
bhand_t bhand_t
an opaque buffer handle used to deregister a buffer. an opaque buffer handle used to deregister a buffer.
ddp_recv_t recv_message_t
a description of a completed untagged receive buffer:
typedef struct {
bdesc_t b;
length l;
} recv_message_t;
ddp_ind_t
an untagged message, a tagged message reception indication, or an untagged message, a tagged message reception indication, or
a tagged message reception error: a tagged message reception error:
typedef union { typedef union {
message_t m; recv_message_t m;
ddp_msg_id_t i; ddp_msg_id_t i;
ddp_err_t e; ddp_err_t e;
} ddp_recv_t; } ddp_ind_t;
ddp_err_t ddp_err_t
indicates an error while receiving a tagged message, typically indicates an error while receiving a tagged message, typically
`offset' out of bounds, or `stag' is not registered to the `offset' out of bounds, or `stag' is not registered to the
socket. socket.
msizes_t msizes_t
The maximum untagged and tagged messages that fit in a single The maximum untagged and tagged messages that fit in a single
transport message: transport message:
typedef struct { typedef struct {
msize_t max_untagged; msize_t max_untagged;
msize_t max_tagged; msize_t max_tagged;
} msizes_t; } msizes_t;
ddp_send(socket_t s, message_t m) ddp_send(socket_t s, message_t m)
send an untagged message. send an untagged message.
ddp_send_ddp(socket_t s, message_t m, ddp_addr_t d, ddp_notify_t n) ddp_send_ddp(socket_t s, message_t m, ddp_addr_t d, ddp_notify_t n)
send a tagged message. send a tagged message to remote buffer address d.
ddp_post_recv(socket_t s, bdesc_t b)
post a registered buffer to accept received untagged messages.
ddp_recv(socket_t s) ddp_recv(socket_t s)
get the next received untagged message, tagged message get the next received untagged message, tagged message
reception indication, or tagged message error. reception indication, or tagged message error.
ddp_register(socket_t s, ddp_buffer_t b) ddp_register(socket_t s, ddp_buffer_t b)
register a buffer for DDP on a socket. The same buffer may be register a buffer for DDP on a socket. The same buffer may be
registered multiple times on the same or different sockets. registered multiple times on the same or different sockets.
Different buffers may also refer to portions of the same The same buffer registered on different sockets may result in
underlying addressable object (buffer aliasing). a common registration. Different buffers may also refer to
portions of the same underlying addressable object (buffer
aliasing).
ddp_deregister(bhand_t bh) ddp_deregister(bhand_t bh)
remove a registration from a buffer. remove a registration from a buffer.
ddp_max_msizes(socket_t s) ddp_max_msizes(socket_t s)
get the current maximum untagged and tagged message sizes that get the current maximum untagged and tagged message sizes that
will fit in a single transport message. will fit in a single transport message.
skipping to change at page 10, line 6 skipping to change at page 10, line 24
Some transports support several combinations of these Some transports support several combinations of these
characteristics. For example, SCTP [SCTP] is reliable, single characteristics. For example, SCTP [SCTP] is reliable, single
source, single destination (point-to-point) and supports both source, single destination (point-to-point) and supports both
ordered and unordered modes. ordered and unordered modes.
In general, these transport characteristics equally affect In general, these transport characteristics equally affect
transport and DDP message delivery. However, there are several transport and DDP message delivery. However, there are several
issues specific to DDP messages. issues specific to DDP messages.
DDP messages carried by transport are framed for processing by the
receiver, and may be further protected for integrity or privacy in
accordance with the transport capabilities. DDP does not provide
such functions.
A key component of DDP is how the following operations on the A key component of DDP is how the following operations on the
receiving side are ordered among themselves, and how they relate to receiving side are ordered among themselves, and how they relate to
corresponding operations on the sending side: corresponding operations on the sending side:
o set()s, o set()s,
o untagged message reception indications, and o untagged message reception indications, and
o tagged message reception indications. o tagged message reception indications.
skipping to change at page 12, line 9 skipping to change at page 12, line 27
} rdma_buffer_t; } rdma_buffer_t;
2.2.1. RDMA Operations 2.2.1. RDMA Operations
The RDMA layer provides: The RDMA layer provides:
void rdma_send(socket_t s, message_t m); void rdma_send(socket_t s, message_t m);
void rdma_write(socket_t s, message_t m, ddp_addr_t d, void rdma_write(socket_t s, message_t m, ddp_addr_t d,
rdma_notify_t n); rdma_notify_t n);
void rdma_read(socket_t s, ddp_addr_t s, ddp_addr_t d); void rdma_read(socket_t s, ddp_addr_t s, ddp_addr_t d);
rdma_recv_t rdma_recv(socket_t s); void rdma_post_recv(socket_t s, bdesc_t b);
rdma_ind_t rdma_recv(socket_t s);
bdesc_t rdma_register(socket_t s, rdma_buffer_t b, bdesc_t rdma_register(socket_t s, rdma_buffer_t b,
bmode_t mode); bmode_t mode);
void rdma_deregister(bhand_t bh); void rdma_deregister(bhand_t bh);
msizes_t rdma_max_msizes(socket_t s); msizes_t rdma_max_msizes(socket_t s);
Although, for clarity, these data transfer interfaces are Although, for clarity, these data transfer interfaces are
synchronous, rdma_read() and possibly rdma_send() (in the presence synchronous, rdma_read() and possibly rdma_send() (in the presence
of Send flow control), can require an arbitrary amount of time to of Send flow control), can require an arbitrary amount of time to
complete. To express the full concurrency and interleaving of RDMA complete. To express the full concurrency and interleaving of RDMA
data transfer, these interfaces are also defined to be data transfer, these interfaces should also be reentrant. For
multithreaded. For example, a client protocol may perform an example, a client protocol may perform an rdma_send(), while an
rdma_send(), while an rdma_read() operation is in progress. rdma_read() operation is in progress.
rdma_notify_t rdma_notify_t
RDMA Write notification information, used to signal that the RDMA Write notification information, used to signal that the
message represents the final fragment of a multi-segmented message represents the final fragment of a multi-segmented
RDMA message: RDMA message:
typedef struct { typedef struct {
boolean_t notify; boolean_t notify;
rdma_write_id_t i; rdma_write_id_t i;
} rdma_notify_t; } rdma_notify_t;
identical in function to ddp_notify_t, except that the type identical in function to ddp_notify_t, except that the type
rdma_write_id_t may not be equivalent to ddp_msg_id_t. rdma_write_id_t may not be equivalent to ddp_msg_id_t.
rdma_write_id_t (scalar) rdma_write_id_t (scalar)
an RDMA Write identifier. an RDMA Write identifier.
rdma_recv_t rdma_ind_t
a Send message, an RDMA Write completion identifier, or an a Send message, or an RDMA error:
RDMA error:
typedef union { typedef union {
message_t m; recv_message_t m;
rdma_write_id_t i;
rdma_err_t e; rdma_err_t e;
} rdma_recv_t; } rdma_ind_t;
rdma_err_t rdma_err_t
an RDMA protocol error indication. RDMA errors include buffer an RDMA protocol error indication. RDMA errors include buffer
addressing errors corresponding to ddp_err_ts, and buffer addressing errors corresponding to ddp_err_ts, and buffer
protection violations (e.g. RDMA Writing a buffer only protection violations (e.g. RDMA Writing a buffer only
registered for reading). registered for reading).
bmode_t bmode_t
buffer registration mode (permissions). Any combination of buffer registration mode (permissions). Any combination of
skipping to change at page 13, line 26 skipping to change at page 13, line 48
rdma_send(socket_t s, message_t m) rdma_send(socket_t s, message_t m)
send a message, delivering it to the next untagged RDMA buffer send a message, delivering it to the next untagged RDMA buffer
at the remote peer. at the remote peer.
rdma_write(socket_t s, message_t m, ddp_addr_t d, rdma_notify_t n) rdma_write(socket_t s, message_t m, ddp_addr_t d, rdma_notify_t n)
RDMA Write to remote buffer address d. RDMA Write to remote buffer address d.
rdma_read(socket_t s, ddp_addr_t s, ddp_addr_t d) rdma_read(socket_t s, ddp_addr_t s, length l, ddp_addr_t d)
RDMA Read from remote buffer address s to local buffer address RDMA Read l octets from remote buffer address s to local
d. buffer address d.
rdma_post_recv(socket_t s, bdesc_t b)
post a registered buffer to accept received Send messages.
rdma_recv(socket_t s); rdma_recv(socket_t s);
get the next received Send message, RDMA Write completion get the next received Send message, RDMA Write completion
identifier, or RDMA error. identifier, or RDMA error.
rdma_register(socket_t s, rdma_buffer_t b, bmode_t mode) rdma_register(socket_t s, rdma_buffer_t b, bmode_t mode)
register a buffer for RDMA on a socket (for read access, write register a buffer for RDMA on a socket (for read access, write
access or both). As with DDP, the same buffer may be access or both). As with DDP, the same buffer may be
skipping to change at page 15, line 11 skipping to change at page 15, line 36
Because a Steering Tag exports access to a memory region, one Because a Steering Tag exports access to a memory region, one
critical aspect of security is the scope of this access. It must critical aspect of security is the scope of this access. It must
be possible to individually control specific attributes of the be possible to individually control specific attributes of the
access provided by a Steering Tag, including remote read access, access provided by a Steering Tag, including remote read access,
remote write access, and others that might be identified. A remote write access, and others that might be identified. A
specification must provide both implementation requirements specification must provide both implementation requirements
relevant to this issue, and guidelines to assist implementors in relevant to this issue, and guidelines to assist implementors in
making the appropriate design decisions. making the appropriate design decisions.
A number of other potential attacks have been envisioned and must
be addressed. Some such examples are outlined in [RDMACON].
Resource issues leading to denial-of-service attacks, overwrites Resource issues leading to denial-of-service attacks, overwrites
and other concurrent operations, the ordering of completions as and other concurrent operations, the ordering of completions as
required by the RDMA protocol, and the granularity of transfer are required by the RDMA protocol, and the granularity of transfer are
all within the required scope of any security analysis of RDMA and all within the required scope of any security analysis of RDMA and
DDP. DDP.
4. IANA Considerations 4. IANA Considerations
IANA considerations are not addressed in by this document. Any IANA considerations are not addressed in by this document. Any
IANA considerations resulting from the use of DDP or RDMA must be IANA considerations resulting from the use of DDP or RDMA must be
skipping to change at page 15, line 50 skipping to change at page 16, line 25
[IB] InfiniBand Architecture Specification, Volumes 1 and 2, [IB] InfiniBand Architecture Specification, Volumes 1 and 2,
Release 1.0.a. http://www.infinibandta.org Release 1.0.a. http://www.infinibandta.org
[MYR] [MYR]
Myrinet, http://www.myricom.com Myrinet, http://www.myricom.com
[RDDP] [RDDP]
Remote Direct Data Placement Working Group charter, Remote Direct Data Placement Working Group charter,
http://www.ietf.org/html.charters/rddp-charter.html http://www.ietf.org/html.charters/rddp-charter.html
[RDMACON]
D. Black, M. Speer, J. Wroclawski, "DDP and RDMA Concerns",
http://www.ietf.org/internet-drafts/draft-ietf-rddp-rdma-
concerns-00.txt, Work in Progress, December 2002
[ROM] [ROM]
A. Romanow, J. Mogul, T. Talpey, S. Bailey, "RDMA over IP A. Romanow, J. Mogul, T. Talpey, S. Bailey, "RDMA over IP
Problem Statement", http://www.ietf.org/internet-drafts/draft- Problem Statement", http://www.ietf.org/internet-drafts/draft-
ietf-rddp-problem-statement-00.txt, Work in Progress, December ietf-rddp-problem-statement-01.txt, Work in Progress, February
2002 2003
[SCTP] [SCTP]
R. Stewart et al., "Stream Transmission Control Protocol", R. Stewart et al., "Stream Transmission Control Protocol",
Standards Track RFC, http://www.ietf.org/rfc/rfc2960.txt Standards Track RFC, http://www.ietf.org/rfc/rfc2960.txt
[SDP] [SDP]
Sockets Direct Protocol v1.0 Sockets Direct Protocol v1.0
[SRVNET] [SRVNET]
Compaq Servernet, Compaq Servernet,
http://nonstop.compaq.com/view.asp?PAGE=ServerNet http://nonstop.compaq.com/view.asp?PAGE=ServerNet
[VI] Virtual Interface Architecture Specification Version 1.0. [VI] Virtual Interface Architecture Specification Version 1.0.
http://www.viarch.org/html/collateral/san_10.pdf http://www.vidf.org/info/04standards.html
Authors' Addresses Authors' Addresses
Stephen Bailey Stephen Bailey
Sandburst Corporation Sandburst Corporation
600 Federal Street 600 Federal Street
Andover, MA 01810 USA Andover, MA 01810 USA
USA USA
Phone: +1 978 689 1614 Phone: +1 978 689 1614
Email: steph@sandburst.com Email: steph@sandburst.com
Tom Talpey Tom Talpey
skipping to change at page 17, line 7 skipping to change at page 17, line 23
Tom Talpey Tom Talpey
Network Appliance Network Appliance
375 Totten Pond Road 375 Totten Pond Road
Waltham, MA 02451 USA Waltham, MA 02451 USA
Phone: +1 781 768 5329 Phone: +1 781 768 5329
Email: thomas.talpey@netapp.com Email: thomas.talpey@netapp.com
Full Copyright Statement Full Copyright Statement
Copyright (C) The Internet Society (2002). All Rights Reserved. Copyright (C) The Internet Society (2003). All Rights Reserved.
This document and translations of it may be copied and furnished to This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain others, and derivative works that comment on or otherwise explain
it or assist in its implementation may be prepared, copied, it or assist in its implementation may be prepared, copied,
published and distributed, in whole or in part, without restriction published and distributed, in whole or in part, without restriction
of any kind, provided that the above copyright notice and this of any kind, provided that the above copyright notice and this
paragraph are included on all such copies and derivative works. paragraph are included on all such copies and derivative works.
However, this document itself may not be modified in any way, such However, this document itself may not be modified in any way, such
as by removing the copyright notice or references to the Internet as by removing the copyright notice or references to the Internet
Society or other Internet organizations, except as needed for the Society or other Internet organizations, except as needed for the
 End of changes. 

This html diff was produced by rfcdiff 1.23, available from http://www.levkowetz.com/ietf/tools/rfcdiff/