Remote Direct Data Placement Work Group   P. Culley
   INTERNET-DRAFT                              Hewlett-Packard Company
   draft-ietf-rddp-mpa-05.txt
   draft-ietf-rddp-mpa-06.txt                U. Elzur
                                               Broadcom Corporation
                                             R. Recio
                                               IBM Corporation
                                             S. Bailey
                                               Sandburst Corporation
                                             J. Carrier
                                               Cray Inc.

   Expires: December 2006                    June 23, February 2007                    September 5, 2006

             Marker PDU Aligned Framing for TCP Specification

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/1id-abstracts.html.  The list of Internet-Draft
   Shadow Directories can be accessed at http://www.ietf.org/shadow.html

Abstract

   MPA (Marker Protocol data unit Aligned framing) is designed to work
   as an "adaptation layer" between TCP and the Direct Data Placement
   [DDP] protocol, preserving the reliable, in-order delivery of TCP,
   while adding the preservation of higher-level protocol record
   boundaries that DDP requires.  MPA is fully compliant with applicable
   TCP RFCs and can be utilized with existing TCP implementations.  MPA
   also supports integrated implementations that combine TCP, MPA and
   DDP to reduce buffering requirements in the implementation and
   improve performance at the system level.

   Table of Contents

   Status of this Memo                                                 1
   Abstract                                                            1
   1      Glossary                                                     4                                                     5
   2      Introduction                                                 7                                                 8
   2.1    Motivation                                                   7                                                   8
   2.2    Protocol Overview                                            7                                            8
   3      MPA's interactions with DDP                                 11                                 12
   4      MPA Full Operation Mode                                     13                                     14
   4.1    FPDU Format                                                 13                                                 14
   4.2    Marker Format                                               14                                               15
   4.3    MPA Markers                                                 14                                                 15
   4.4    CRC Calculation                                             17                                             18
   4.5    FPDU Size Considerations                                    20                                    21
   5      MPA's interactions with TCP                                 22                                 23
   5.1    MPA transmitters with a standard layered TCP                23
   5.2    MPA receivers with a standard layered TCP                   24
   5.3    Optimized MPA/TCP transmitters                              24
   5.3.1  Effects of Optimized MPA/TCP Segmentation                   25
   5.4    Optimized MPA/TCP receivers                                 27
   6      MPA Receiver FPDU Identification                            28
   6.1    Re-segmenting Middle boxes and non optimized MPA/TCP senders29                            24
   7      Connection Semantics                                        30                                        26
   7.1    Connection setup                                            30                                            26
   7.1.1  MPA Request and Reply Frame Format                          32                          28
   7.1.2  Connection Startup Rules                                    33                                    29
   7.1.3  Example Delayed Startup sequence                            36                            32
   7.1.4  Use of Private Data                                         39                                         35
   7.1.4.1  Motivation                                                35
   7.1.4.2  Example Immediate Startup using Private Data              36
   7.1.5  "Dual stack" implementations                                42                                38
   7.2    Normal Connection Teardown                                  43                                  39
   8      Error Semantics                                             44                                             40
   9      Security Considerations                                     45                                     41
   9.1    Protocol-specific Security Considerations                   45                   41
   9.1.1  Spoofing                                                    45                                                    41
   9.1.1.1  Impersonation                                             41
   9.1.1.2  Stream Hijacking                                          42
   9.1.1.3  Man in the Middle Attack                                  42
   9.1.2  Eavesdropping                                               46                                               42
   9.2    Introduction to Security Options                            47                            43
   9.3    Using IPsec With MPA                                        47                                        43
   9.4    Requirements for IPsec Encapsulation of MPA/DDP             48             44
   10     IANA Considerations                                         45
   A Appendix. Optimized MPA-aware TCP implementations                46
   A.1    Optimized MPA/TCP transmitters                              46
   A.2    Effects of Optimized MPA/TCP Segmentation                   47
   A.3    Optimized MPA/TCP receivers                                 49
   11     References                                                  50
   11.1   Normative References                                        50
   11.2   Informative References                                      50
   12     Appendix
   A.4    Re-segmenting Middle boxes and non optimized MPA/TCP senders50
   A.5    Receiver implementation                                     51
   A.5.1  Network Layer Reassembly Buffers                            52
   12.1
   A.5.2  TCP Reassembly buffers                                      53
   B Appendix. Analysis of MPA over TCP Operations                         52
   12.1.1                    54
   B.1    Assumptions                                                 53
   12.1.2                                                 54
   B.1.1  MPA is layered beneath DDP [DDP]                            54
   B.1.2  MPA preserves DDP message framing                           55
   B.1.3  The size of the ULPDU passed to MPA is less than EMSS under
          normal conditions                                           55
   B.1.4  Out-of-order placement but NO out-of-order Delivery         55
   B.2    The Value of FPDU Alignment                                 54
   12.2   Receiver implementation                                     61
   12.2.1 Network Layer Reassembly Buffers                            61
   12.2.2                                 55
   B.2.1  Impact of lack of FPDU Alignment on the receiver computational
          load and complexity                                         57
   B.2.2  FPDU Alignment effects on TCP Reassembly buffers                                      62
   12.3 wire protocol                 61
   C Appendix. IETF Implementation Interoperability with RDMA Consortium
          Protocols                                                   63
   12.3.1
   C.1    Negotiated Parameters                                       63
   12.3.2
   C.2    RDMAC RNIC and Non-permissive IETF RNIC                     64
   12.3.3                     65
   C.2.1  RDMAC RNIC Initiator                                        65
   C.2.2  Non-Permissive IETF RNIC Initiator                          66
   C.2.3  RDMAC RNIC and Permissive IETF RNIC                         66
   12.3.4
   C.2.4  RDMAC RNIC Initiator                                        67
   C.2.5  Permissive IETF RNIC Initiator                              67
   C.3    Non-Permissive IETF RNIC and Permissive IETF RNIC           67
   13
   Normative References                                               69
   Informative References                                             69
   Author's Addresses                                          68
   14                                                 71
   Acknowledgments                                             69                                                    72
   Full Copyright Statement                                           72                                           75
   Intellectual Property                                              72                                              75

   Table of Figures

   Figure 1 ULP MPA TCP Layering                                       8                                       9
   Figure 2 FPDU Format                                               13                                               14
   Figure 3 Marker Format                                             14                                             15
   Figure 4 Example FPDU Format with Marker                           16                           17
   Figure 5 Annotated Hex Dump of an FPDU                             19                             20
   Figure 6 Annotated Hex Dump of an FPDU with Marker                 20                 21
   Figure 7 Fully layered implementation                              22                              23
   Figure 8 Optimized MPA/TCP implementation                          22
   Figure 9 MPA Request/Reply Frame                                   32                                   28
   Figure 10: 9: Example Delayed Startup negotiation                     37                      33
   Figure 11: 10: Example Immediate Startup negotiation                   40                   36
   Figure 11 Optimized MPA/TCP implementation                         46
   Figure 12: Non-aligned FPDU freely placed in TCP octet stream      56      57
   Figure 13: Aligned FPDU placed immediately after TCP header        57        59
   Figure 14.  Connection Parameters for the RNIC Types.              64
   Figure 15: MPA negotiation between an RDMAC RNIC and a Non-permissive
          IETF RNIC.                                                  65
   Figure 16: MPA negotiation between an RDMAC RNIC and a Permissive
          IETF RNIC.                                                  66
   Figure 17: MPA negotiation between a Non-permissive IETF RNIC and a
          Permissive IETF RNIC.                                              67                                       68
   Revision history [To be deleted prior to RFC publication]

   [draft-ietf-rddp-mpa-06] workgroup draft with following changes:

        Document restructuring to move descriptive information on
        implementing optimized MPA/TCP implementations to an appendix.
        All normative text was removed from the appendix.  Paragraph
        added to security section explaining IPSEC version.  Added
        informative references to architecture, applicability, and
        problem statement documents.

   [draft-ietf-rddp-mpa-05] workgroup draft with following changes:

        Document restructuring to differentiate between fully layered
        MPA on TCP implementations and optimized MPA/TCP
        implementations.  This involved somewhat blurring the artificial
        layer between MPA and an MPA-aware TCP.  This involved a bit of
        terminology change.

        Re-wrote the requirement to avoid duplicate segments during TCP
        out of order passing to MPA; this is now a co-responsibility
        between MPA/TCP; also explained that the requirement was to
        avoid data corruption through bypassing MPA CRCs and other
        checks.

1  Glossary

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
       "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
       this document are to be interpreted as described in [RFC2119].

   Consumer - the ULPs or applications that lie above MPA and DDP.  The
       Consumer is responsible for making TCP connections, starting MPA
       and DDP connections, and generally controlling operations.

   Delivery - (Delivered, Delivers) - For MPA, Delivery is defined as
       the process of informing DDP that a particular PDU is ordered for
       use.  A PDU is Delivered in the exact order that it was sent by
       the original sender; MPA uses TCP's byte stream ordering to
       determine when Delivery is possible.  This is specifically
       different from "passing the PDU to DDP", which may generally
       occur in any order, while the order of Delivery is strictly
       defined.

   EMSS - Effective Maximum Segment Size.  EMSS is the smaller of the
       TCP maximum segment size (MSS) as defined in RFC 793 [RFC793],
       and the current path Maximum Transfer Unit (MTU) [RFC1191].

   FPDU - Framed Protocol Data Unit.  The unit of data created by an MPA
       sender.

   FPDU Alignment - the property that an FPDU is Header Aligned with the
       TCP segment, and the TCP segment includes an integer number of
       FPDUs.  A TCP segment with a FPDU Alignment allows immediate
       processing of the contained FPDUs without waiting on other TCP
       segments to arrive or combining with prior segments.

   FPDU Pointer (FPDUPTR) - This field of the Marker is used to indicate
       the beginning of an FPDU.

   Full Operation (Full Operation Phase) - After the completion of the
       Startup Phase MPA begins exchanging FPDUs.

   Header Alignment - the property that a TCP segment begins with an
       FPDU.  The FPDU is Header Aligned when the FPDU header is exactly
       at the start of the TCP segment (right behind the TCP headers on
       the wire).

   Initiator - The endpoint of a connection that sends the MPA Request
       Frame, i.e. the first to actually send data (which may not be the
       one which sends the TCP SYN).

   Marker - A four octet field that is placed in the MPA data stream at
       fixed octet intervals (every 512 octets).

   MPA-aware TCP - a TCP implementation that is aware of the receiver
       efficiencies of MPA FPDU Alignment and is capable of sending TCP
       segments that begin with an FPDU.

   MPA-enabled - MPA is enabled if the MPA protocol is visible on the
       wire.  When the sender is MPA-enabled, it is inserting framing
       and Markers.  When the receiver is MPA-enabled, it is
       interpreting framing and Markers.

   MPA Request Frame - Data sent from the MPA Initiator to the MPA
       Responder during the Startup Phase.

   MPA Reply Frame - Data sent from the MPA Responder to the MPA
       Initiator during the Startup Phase.

   MPA - Marker-based ULP PDU Aligned Framing for TCP protocol.  This
       document defines the MPA protocol.

   MULPDU - Maximum ULPDU.  The current maximum size of the record that
       is acceptable for DDP to pass to MPA for transmission.

   Node - A computing device attached to one or more links of a Network.
       A Node in this context does not refer to a specific application
       or protocol instantiation running on the computer.  A Node may
       consist of one or more MPA on TCP devices installed in a host
       computer.

   PAD - A 1-3 octet group of zeros used to fill an FPDU to an exact
       modulo 4 size.

   PDU - protocol data unit

   Private Data - A block of data exchanged between MPA endpoints during
       initial connection setup.

   Protection Domain - An RDMA concept (see [VERBS] and [RDMASEC]) that
       tie use of various endpoint resources (memory access etc.) to the
       specific RDMA/DDP/MPA connection.

   RDDP - a suite of protocols including MPA, [DDP], [RDMAP], an overall
       security document [RDMASEC], a problem statement [RFC4297], an
       architecture document [RFC4296], and an applicability document
       [APPL].

   RDMA - Remote Direct Memory Access; a protocol that uses DDP and MPA
       to enable applications to transfer data directly from memory
       buffers.  See [RDMAP].

   Remote Peer - The MPA protocol implementation on the opposite end of
       the connection.  Used to refer to the remote entity when
       describing protocol exchanges or other interactions between two
       Nodes.

   Responder - The connection endpoint which responds to an incoming MPA
       connection request (the MAP Request Frame).  This may not be the
       endpoint which awaited the TCP SYN.

   Startup Phase - The initial exchanges of an MPA connection which
       serves to more fully identify MPA endpoints to each other and
       pass connection specific setup information to each other.

   ULP - Upper Layer Protocol.  The protocol layer above the protocol
       layer currently being referenced.  The ULP for MPA is DDP [DDP].

   ULPDU - Upper Layer Protocol Data Unit.  The data record defined by
      the layer above MPA (DDP).  ULPDU corresponds to DDP's DDP
      segment.

   ULPDU_Length - a field in the FPDU describing the length of the
      included ULPDU.

2  Introduction

   This section discusses the reason for creating MPA on TCP and a
   general overview of the protocol.

2.1 Motivation

   The Direct Data Placement protocol [DDP], when used with TCP [RFC793]
   requires a mechanism to detect record boundaries.  The DDP records
   are referred to as Upper Layer Protocol Data Units by this document.
   The ability to locate the Upper Layer Protocol Data Unit (ULPDU)
   boundary is useful to a hardware network adapter that uses DDP to
   directly place the data in the application buffer based on the
   control information carried in the ULPDU header.  This may be done
   without requiring that the packets arrive in order.  Potential
   benefits of this capability are the avoidance of the memory copy
   overhead and a smaller memory requirement for handling out of order
   or dropped packets.

   Many approaches have been proposed for a generalized framing
   mechanism.  Some are probabilistic in nature and others are
   deterministic.  A probabilistic approach is characterized by a
   detectable value embedded in the octet stream.  It is probabilistic
   because under some conditions the receiver may incorrectly interpret
   application data as the detectable value.  Under these conditions,
   the protocol may fail with unacceptable frequency.  A deterministic
   approach is characterized by embedded controls at known locations in
   the octet stream.  Because the receiver can guarantee it will only
   examine the data stream at locations that are known to contain the
   embedded control, the protocol can never misinterpret application
   data as being embedded control data.  For unambiguous handling of an
   out of order packet, the deterministic approach is preferred.

   The MPA protocol provides a framing mechanism for DDP running over
   TCP using the deterministic approach.  It allows the location of the
   ULPDU to be determined in the TCP stream even if the TCP segments
   arrive out of order.

2.2 Protocol Overview

   The layering of PDUs with MPA is shown in Figure 1, below.

               +------------------+
               |     ULP client   |
               +------------------+  <- Consumer messages
               |        DDP       |
               +------------------+  <- ULPDUs
               |        MPA*      |
               +------------------+  <- FPDUs (containing ULPDUs)
               |        TCP*      |
               +------------------+  <- TCP Segments (containing FPDUs)
               |      IP etc.     |
               +------------------+
                * These may be fully layered or optimized together.

                       Figure 1 ULP MPA TCP Layering

   MPA is described as an extra layer above TCP and below DDP.  The
   operation sequence is:

   1.  A TCP connection is established by ULP action.  This is done
       using methods not described by this specification.  The ULP may
       exchange some amount of data in streaming mode prior to starting
       MPA, but is not required to do so.

   2.  The Consumer negotiates the use of DDP and MPA at both ends of a
       connection.  The mechanisms to do this are not described in this
       specification.  The negotiation may be done in streaming mode, or
       by some other mechanism (such as a pre-arranged port number).

   3.  The ULP activates MPA on each end in the Startup Phase, either as
       an Initiator or a Responder, as determined by the ULP.  This mode
       verifies the usage of MPA, specifies the use of CRC and Markers,
       and allows the ULP to communicate some additional data via a
       Private Data exchange.  See section 7.1 Connection setup for more
       details on the startup process.

   4.  At the end of the Startup Phase, the ULP puts MPA (and DDP) into
       Full Operation and begins sending DDP data as further described
       below.  In this document, DDP data chunks are called ULPDUs.  For
       a description of the DDP data, see [DDP].

   Following is a description of data transfer when MPA is in Full
   Operation.

   1.  DDP determines the Maximum ULPDU (MULPDU) size by querying MPA
       for this value.  MPA derives this information from TCP or IP,
       when it is available, or chooses a reasonable value.

   2.  DDP creates ULPDUs of MULPDU size or smaller, and hands them to
       MPA at the sender.

   3.  MPA creates a Framed Protocol Data Unit (FPDU) by pre-pending a
       header, optionally inserting Markers, and appending a CRC field
       after the ULPDU and PAD (if any).  MPA delivers the FPDU to TCP.

   4.  The TCP sender puts the FPDUs into the TCP stream.  If the sender
       is optimized MPA/TCP, it segments the TCP stream in such a way
       that a TCP Segment boundary is also the boundary of an FPDU.  TCP
       then passes each segment to the IP layer for transmission.

   5.  The receiver may or may not be optimized.  If it is optimized
       MPA/TCP, it may separate passing the TCP payload to MPA from
       passing the TCP payload ordering information to MPA.  In either
       case, RFC compliant TCP wire behavior is observed at both the
       sender and receiver.

   6.  The MPA receiver locates and assembles complete FPDUs within the
       stream, verifies their integrity, and removes MPA Markers (when
       present), ULPDU_Length, PAD and the CRC field.

   7.  MPA then provides the complete ULPDUs to DDP.  MPA may also
       separate passing MPA payload to DDP from passing the MPA payload
       ordering information.

   A fully layered MPA on TCP is implemented as a data stream ULP for
   TCP and is therefore RFC compliant.

   An optimized MPA/TCP DDP/MPA/TCP uses a TCP layer which potentially contains
   some additional semantics behaviors as defined suggested in this document.  It is completely
   interoperable  When
   DDP/MPA/TCP are cross-layer optimized, the behavior of TCP (esp.
   sender segmentation) may change from that of the un-optimized
   implementation, but the changes are within the bounds permitted by
   the TCP RFC specifications, and will interoperate with an un-
   optimized TCP.  The additional behaviors are described in Appendix A
   and are not normative, they are described at a fully layered MPA on TCP implementation interface layer as
   a convenience.  Implementations may achieve the described
   functionality using any method, including cross layer optimizations
   between TCP, MPA and is
   also RFC compliant. DDP.

   An optimized MPA/TCP DDP/MPA/TCP sender is able to segment the data stream
   such that TCP segments begin with FPDUs (FPDU Alignment).  This has
   significant advantages for receivers.  When segments arrive with
   aligned FPDUs the receiver usually need not buffer any portion of the
   segment, allowing DDP to place it in its destination memory
   immediately, thus avoiding copies from intermediate buffers (DDP's
   reason for existence).

   An optimized MPA/TCP DDP/MPA/TCP receiver allows a DDP on MPA implementation
   to locate the start of ULPDUs that may be received out of order.  It
   also allows the implementation to determine if the entire ULPDU has
   been received.  As a result, MPA can pass out of order ULPDUs to DDP
   for immediate use.  This enables a DDP on MPA implementation to save
   a significant amount of intermediate storage by placing the ULPDUs in
   the right locations in the application buffers when they arrive,
   rather than waiting until full ordering can be restored.

   The ability of a receiver to recover out of order ULPDUs is optional
   and declared to the transmitter during startup.  When the receiver
   declares that it does not support out of order recovery, the
   transmitter does not add the control information to the data stream
   needed for out of order recovery.

   If the receiver is fully layered, then MPA receives a strictly
   ordered stream of data and does not deal with out of order ULPDUs.
   In this case MPA passes each ULPDU to DDP when the last bytes arrive
   from TCP, along with the indication that they are in order.

   MPA implementations that support recovery of out of order ULPDUs MUST
   support a mechanism to indicate the ordering of ULPDUs as the sender
   transmitted them and indicate when missing intermediate segments
   arrive.  These mechanisms allow DDP to reestablish record ordering
   and report Delivery of complete messages (groups of records).

   MPA also addresses enhanced data integrity.  Some users of TCP have
   noted that the TCP checksum is not as strong as could be desired (see
   [CRCTCP]).  Studies such as [CRCTCP] have shown that the TCP checksum
   indicates segments in error at a much higher rate than the underlying
   link characteristics would indicate.  With these higher error rates,
   the chance that an error will escape detection, when using only the
   TCP checksum for data integrity, becomes a concern.  A stronger
   integrity check can reduce the chance of data errors being missed.

   MPA includes a CRC check to increase the ULPDU data integrity to the
   level provided by other modern protocols, such as SCTP [RFC2960].  It
   is possible to disable this CRC check, however CRCs MUST be enabled
   unless it is clear that the end to end connection through the network
   has data integrity at least as good as a an MPA with CRC enabled (for
   example when IPsec is implemented end to end).  DDP's ULP expects
   this level of data integrity and therefore the ULP does not have to
   provide its own duplicate data integrity and error recovery for lost
   data.

3  MPA's interactions with DDP

   DDP requires MPA to maintain DDP record boundaries from the sender to
   the receiver.  When using MPA on TCP to send data, DDP provides
   records (ULPDUs) to MPA.  MPA will use the reliable transmission
   abilities of TCP to transmit the data, and will insert appropriate
   additional information into the TCP stream to allow the MPA receiver
   to locate the record boundary information.

   As such, MPA accepts complete records (ULPDUs) from DDP at the sender
   and returns them to DDP at the receiver.

   MPA MUST encapsulate the ULPDU such that there is exactly one ULPDU
   contained in one FPDU.

   MPA over a standard TCP stack can usually provide FPDU Alignment with
   the TCP Header if the FPDU is equal to TCP's EMSS.  An optimized
   MPA/TCP stack can also maintain alignment as long as the FPDU is less
   than or equal to TCP's EMSS.  Since FPDU Alignment is generally
   desired by the receiver, DDP must cooperate with MPA to ensure FPDUs'
   lengths do not exceed the EMSS under normal conditions.  This is done
   with the MULPDU mechanism.

   MPA provides information to DDP on the current maximum size of the
   record that is acceptable to send (MULPDU).  DDP SHOULD limit each
   record size to MULPDU.  The range of MULPDU values MUST be between
   128 octets and 64768 octets, inclusive.

   The sending DDP MUST NOT post a ULPDU larger than 64768 octets to
   MPA.  DDP MAY post a ULPDU of any size between one and 64768 octets,
   however MPA is not REQUIRED to support a ULPDU Length that is greater
   than the current MULPDU.

   While the maximum theoretical length supported by the MPA header
   ULPDU_Length field is 65535, TCP over IP requires the IP datagram
   maximum length to be 65535 octets.  To enable MPA to support FPDU
   Alignment, the maximum size of the FPDU must fit within an IP
   datagram.  Thus the ULPDU limit of 64768 octets was derived by taking
   the maximum IP datagram length, subtracting from it the maximum total
   length of the sum of the IPv4 header, TCP header, IPv4 options, TCP
   options, and the worst case MPA overhead, and then rounding the
   result down to a 128 octet boundary.

   On receive, MPA MUST pass each ULPDU with its length to DDP when it
   has been validated.

   If an MPA implementation supports passing out of order ULPDUs to DDP,
   the MPA implementation SHOULD:

   *   Pass each ULPDU with its length to DDP as soon as it has been
       fully received and validated.

   *   Provide a mechanism to indicate the ordering of ULPDUs as the
       sender transmitted them.  One possible mechanism might be
       providing the TCP sequence number for each ULPDU.

   *   Provide a mechanism to indicate when a given ULPDU (and prior
       ULPDUs) are complete (Delivered to DDP).  One possible mechanism
       might be to allow DDP to see the current outgoing TCP Ack
       sequence number.

   *   Provide an indication to DDP that the TCP has closed or has begun
       to close the connection (e.g. received a FIN).

   MPA MUST provide the protocol version negotiated with its peer to
   DDP.  DDP will use this version to set the version in its header and
   to report the version to [RDMAP].

4  MPA Full Operation Mode

   The following sections describe the main semantics of the full
   operation mode of MPA.

4.1 FPDU Format

   MPA senders create FPDUs out of ULPDUs.  The format of an FPDU shown
   below MUST be used for all MPA FPDUs.  For purposes of clarity,
   Markers are not shown in Figure 2.

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |          ULPDU_Length         |                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               +
      |                                                               |
      ~                                                               ~
      ~                            ULPDU                              ~
      |                                                               |
      |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                               |          PAD (0-3 octets)     |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                             CRC                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                           Figure 2 FPDU Format

   ULPDU_Length: 16 bits (unsigned integer).  This is the number of
   octets of the contained ULPDU.  It does not include the length of the
   FPDU header itself, the pad, the CRC, or of any Markers that fall
   within the ULPDU.  The 16-bit ULPDU Length field is large enough to
   support the largest IP datagrams for IPv4 or IPv6.

   PAD: The PAD field trails the ULPDU and contains between zero and
   three octets of data.  The pad data MUST be set to zero by the sender
   and ignored by the receiver (except for CRC checking).  The length of
   the pad is set so as to make the size of the FPDU an integral
   multiple of four.

   CRC: 32 bits, When CRCs are enabled, this field contains a CRC32C
   check value, which is used to verify the entire contents of the FPDU,
   using CRC32C.  See section 4.4 CRC Calculation on page 17. 18.  When CRCs
   are not enabled, this field is still present, may contain any value,
   and MUST NOT be checked.

   The FPDU adds a minimum of 6 octets to the length of the ULPDU.  In
   addition, the total length of the FPDU will include the length of any
   Markers and from 0 to 3 pad octets added to round-up the ULPDU size.

4.2 Marker Format

   The format of a Marker MUST be as specified in Figure 3:

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |           RESERVED            |            FPDUPTR            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                          Figure 3 Marker Format

   RESERVED: The Reserved field MUST be set to zero on transmit and
   ignored on receive (except for CRC calculation).

   FPDUPTR: The FPDU Pointer is a relative pointer, 16-bits long,
   interpreted as an unsigned integer that indicates the number of
   octets in the TCP stream from the beginning of the ULPDU Length field
   to the first octet of the entire Marker.  The least significant two
   bits MUST always be set to zero at the transmitter, and the receivers
   MUST always treat these as zero for calculations.

4.3 MPA Markers

   MPA Markers are used to identify the start of FPDUs when packets are
   received out of order.  This is done by locating the Markers at fixed
   intervals in the data stream (which is correlated to the TCP sequence
   number) and using the Marker value to locate the preceding FPDU
   start.

   All MPA Markers are included in the containing FPDU CRC calculation
   (when both CRCs and Markers are in use).

   The MPA receiver's ability to locate out of order FPDUs and pass the
   ULPDUs to DDP is implementation dependent.  MPA/DDP allows those
   receivers that are able to deal with out of order FPDUs in this way
   to require the insertion of Markers in the data stream.  When the
   receiver cannot deal with out of order FPDUs in this way, it may
   disable the insertion of Markers at the sender.  All MPA senders MUST
   be able to generate Markers when their use is declared by the
   opposing receiver (see section 7.1 Connection setup on page 30). 26).

   When Markers are enabled, MPA senders MUST insert a Marker into the
   data stream at a 512 octet periodic interval in the TCP Sequence
   Number Space.  The Marker contains a 16 bit unsigned integer referred
   to as the FPDUPTR (FPDU Pointer).

   If the FPDUPTR's value is non-zero, the FPDU Pointer is a 16 bit
   relative back-pointer.  FPDUPTR MUST contain the number of octets in
   the TCP stream from the beginning of the ULPDU Length field to the
   first octet of the Marker, unless the Marker falls between FPDUs.
   Thus the location of the first octet of the previous FPDU header can
   be determined by subtracting the value of the given Marker from the
   current octet-stream sequence number (i.e. TCP sequence number) of
   the first octet of the Marker.  Note that this computation MUST take
   into account that the TCP sequence number could have wrapped between
   the Marker and the header.

   An FPDUPTR value of 0x0000 is a special case - it is used when the
   Marker falls exactly between FPDUs (between the preceding FPDU CRC
   field, and the next FPDU's ULPDU Length field).  In this case, the
   Marker is considered to be contained in the following FPDU; the
   Marker MUST be included in the CRC calculation of the FPDU following
   the Marker (if CRCs are being generated or checked).  Thus an FPDUPTR
   value of 0x0000 means that immediately following the Marker is an
   FPDU header (the ULPDU Length field).

   Since all FPDUs are integral multiples of 4 octets, the bottom two
   bits of the FPDUPTR as calculated by the sender are zero.  MPA
   reserves these bits so they MUST be treated as zero for computation
   at the receiver.

   When Markers are enabled (see section 7.1 Connection setup on page
   30),
   26), the MPA Markers MUST be inserted immediately preceding the first
   FPDU of Full Operation phase, and at every 512th octet of the TCP
   octet stream thereafter.  As a result, the first Marker has an
   FPDUPTR value of 0x0000.  If the first Marker begins at octet
   sequence number SeqStart, then Markers are inserted such that the
   first octet of the Marker is at octet sequence number SeqNum if the
   remainder of (SeqNum - SeqStart) mod 512 is zero.  Note that SeqNum
   can wrap.

   For example, if the TCP sequence number were used to calculate the
   insertion point of the Marker, the starting TCP sequence number is
   unlikely to be zero, and 512 octet multiples are unlikely to fall on
   a modulo 512 of zero.  If the MPA connection is started at TCP
   sequence number 11, then the 1st Marker will begin at 11, and
   subsequent Markers will begin at 523, 1035, etc.

   If an FPDU is large enough to contain multiple Markers, they MUST all
   point to the same point in the TCP stream: the first octet of the
   ULPDU Length field for the FPDU.

   If a Marker interval contains multiple FPDUs (the FPDUs are small),
   the Marker MUST point to the start of the ULPDU Length field for the
   FPDU containing the Marker unless the Marker falls between FPDUs, in
   which case the Marker MUST be zero.

   The following example shows an FPDU containing a Marker.

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |       ULPDU Length (0x0010)   |                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               +
      |                                                               |
      +                                                               +
      |                         ULPDU (octets 0-9)                    |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |            (0x0000)           |        FPDU ptr (0x000C)      |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                        ULPDU (octets 10-15)                   |
      |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                               |          PAD (2 octets:0,0)   |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                              CRC                              |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                 Figure 4 Example FPDU Format with Marker

   MPA Receivers MUST preserve ULPDU boundaries when passing data to
   DDP.  MPA Receivers MUST pass the ULPDU data and the ULPDU Length to
   DDP and not the Markers, headers, and CRC.

4.4 CRC Calculation

   An MPA implementation MUST implement CRC support and MUST either:

   (1) always use CRCs; The MPA provider at is not REQUIRED to support
       an administrator's request that CRCs not be used.

       or

   (2a) only indicate a preference to not use CRCs on the explicit
       request of the system administrator, via an interface not defined
       in this spec.  The default configuration for a connection MUST be
       to use CRCs.

   (2b) disable CRC checking (and possibly generation) if both the local
       and remote endpoints indicate preference to not use CRCs.

   The decision for hosts to request CRC suppression MAY be made on an
   administrative basis for any path that provides equivalent protection
   from undetected errors as an end-to-end CRC32c.

   The process MUST be invisible to the ULP.

   After receipt of an MPA startup declaration indicating that its peer
   requires CRCs, an MPA instance MUST continue generating and checking
   CRCs until the connection terminates.  If an MPA instance has
   declared that it does not require CRCs, it MUST turn off CRC checking
   immediately after receipt of an MPA mode declaration indicating that
   its peer also does not require CRCs.  It MAY continue generating
   CRCs.  See section 7.1 Connection setup on page 30 26 for details on the
   MPA startup.

   When sending an FPDU, the sender MUST include a CRC field.  When CRCs
   are enabled, the CRC field in the MPA FPDU MUST be computed using the
   CRC32C polynomial in the manner described in the iSCSI Protocol
   [iSCSI] document for Header and Data Digests.

   The fields which MUST be included in the CRC calculation when sending
   an FPDU are as follows:

   1)  If a Marker does not immediately precede the ULPDU Length field,
       the CRC-32c is calculated from the first octet of the ULPDU
       Length field, through all the ULPDU and Markers (if present), to
       the last octet of the PAD (if present), inclusive.  If there is a
       Marker immediately following the PAD, the Marker is included in
       the CRC calculation for this FPDU.

   2)  If a Marker immediately precedes the first octet of the ULPDU
       Length field of the FPDU, (i.e. the Marker fell between FPDUs,
       and thus is required to be included in the second FPDU), the CRC-
       32c is calculated from the first octet of the Marker, through the
       ULPDU Length header, through all the ULPDU and Markers (if
       present), to the last octet of the PAD (if present), inclusive.

   3)  After calculating the CRC-32c, the resultant value is placed into
       the CRC field at the end of the FPDU.

   When an FPDU is received, and CRC checking is enabled, the receiver
   MUST first perform the following:

   1)  Calculate the CRC of the incoming FPDU in the same fashion as
       defined above.

   2)  Verify that the calculated CRC-32c value is the same as the
       received CRC-32c value found in the FPDU CRC field.  If not, the
       receiver MUST treat the FPDU as an invalid FPDU.

   The procedure for handling invalid FPDUs is covered in the Error
   Section (see section 8 on page 44) 40).

   The following is an annotated hex dump of an example FPDU sent as the
   first FPDU on the stream.  As such, it starts with a Marker.  The
   FPDU contains a 42 octet ULPDU (an example DDP segment) which in turn
   contains 24 octets of the contained ULPDU, which is a data load that
   is all zeros.  The CRC32c has been correctly calculated and can be
   used as a reference.  See the [DDP] and [RDMAP] specification for
   definitions of the DDP Control field, Queue, MSN, MO, and Send Data.

       Octet Contents  Annotation
       Count

       0000    00      Marker: Reserved
       0001    00
       0002    00      Marker: FPDUPTR
       0003    00
       0004    00      ULPDU Length
       0005    2a
       0006    41      DDP Control Field, Send with Last flag set
       0007    43
       0008    00      Reserved (DDP STag position with no STag)
       0009    00
       000a    00
       000b    00
       000c    00      DDP Queue = 0
       000d    00
       000e    00
       000f    00
       0010    00      DDP MSN = 1
       0011    00
       0012    00
       0013    01
       0014    00      DDP MO = 0
       0015    00
       0016    00
       0017    00
       0018    00      DDP Send Data (24 octets of zeros)
       ...
       002f    00
       0030    52      CRC32c
       0031    23
       0032    99
       0033    83
                  Figure 5 Annotated Hex Dump of an FPDU
   The following is an example sent as the second FPDU of the stream
   where the first FPDU (which is not shown here) had a length of 492
   octets and was also a Send to Queue 0 with Last Flag set.  This
   example contains a Marker.

       Octet Contents  Annotation
       Count

       01ec    00      Length
       01ed    2a
       01ee    41      DDP Control Field: Send with Last Flag set
       01ef    43
       01f0    00      Reserved (DDP STag position with no STag)
       01f1    00
       01f2    00
       01f3    00
       01f4    00      DDP Queue = 0
       01f5    00
       01f6    00
       01f7    00
       01f8    00      DDP MSN = 2
       01f9    00
       01fa    00
       01fb    02
       01fc    00      DDP MO = 0
       01fd    00
       01fe    00
       01ff    00
       0200    00      Marker: Reserved
       0201    00
       0202    00      Marker: FPDUPTR
       0203    14
       0204    00      DDP Send Data (24 octets of zeros)
       ...
       021b    00
       021c    84      CRC32c
       021d    92
       021e    58
       021f    98
            Figure 6 Annotated Hex Dump of an FPDU with Marker

4.5 FPDU Size Considerations

   MPA defines the Maximum Upper Layer Protocol Data Unit (MULPDU) as
   the size of the largest ULPDU fitting in an FPDU.  For an empty TCP
   Segment, MULPDU is EMSS minus the FPDU overhead (6 octets) minus
   space for Markers and pad octets.

        The maximum ULPDU Length for a single ULPDU when Markers are
        present MUST be computed as:

        MULPDU = EMSS - (6 + 4 * Ceiling(EMSS / 512) + EMSS mod 4)

   The formula above accounts for the worst-case number of Markers.

        The maximum ULPDU Length for a single ULPDU when Markers are NOT
        present MUST be computed as:

        MULPDU = EMSS - (6 + EMSS mod 4)

   As a further optimization of the wire efficiency an MPA
   implementation MAY dynamically adjust the MULPDU (see section 5 for
   latency and wire efficiency trade-offs).  When one or more FPDUs are
   already packed into a TCP Segment, MULPDU MAY be reduced accordingly.

   DDP SHOULD provide ULPDUs that are as large as possible, but less
   than or equal to MULPDU.

   If the TCP implementation needs to adjust EMSS to support MTU changes
   or changing TCP options, the MULPDU value is changed accordingly.

   In certain rare situations, the EMSS may shrink below 128 octets in
   size.  If this occurs, the MPA on TCP sender MUST NOT shrink the
   MULPDU below 128 octets and is not REQUIRED required to follow the
   segmentation rules in Sections 5.1 and 5.3. Appendix A.

   If one or more FPDUs are already packed into a TCP segment, such that
   the remaining room is less than 128 octets, MPA MUST NOT provide a
   MULPDU smaller than 128.  In this case, MPA would typically provide a
   MULPDU for the next full sized segment, but may still pack the next
   FPDU into the small remaining room, provide that the next FPDU is
   small enough to fit.

   The value 128 is chosen as to allow DDP designers room for the DDP
   Header and some user data.

5  MPA's interactions with TCP

   The following sections describe MPA's interactions with TCP.  We will
   discuss two significant cases;  This
   section discusses using a standard layered TCP stack with MPA
   attached above a TCP socket, and socket.  Discussion of using an optimized MPA-
   aware TCP with an MPA implementation that takes advantage of the
   extra optimizations.  Other implementations are possible. optimizations is done in Appendix A.

                   +-----------------------------------+
                   | +-----+       +-----------------+ |
                   | | MPA |       | Other Protocols | |
                   | +-----+       +-----------------+ |
                   |    ||                  ||         |
                   |  ----- socket API --------------  |
                   |            ||                     |
                   |         +-----+                   |
                   |         | TCP |                   |
                   |         +-----+                   |
                   |            ||                     |
                   |         +-----+                   |
                   |         | IP  |                   |
                   |         +-----+                   |
                   +-----------------------------------+

                   Figure 7 Fully layered implementation

   The Fully layered implementation is described for completeness;
   however, the user is cautioned that the reduced probability of FPDU
   alignment when transmitting with this implementation will tend to
   introduce a higher overhead at optimized receivers.  In addition, the
   lack of out-of-order receive processing will significantly reduce the
   value of DDP/MPA by imposing higher buffering and copying overhead in
   the local receiver.

                   +-----------------------------------+
                   | +-----------+ +-----------------+ |
                   | | Optimized | | Other Protocols | |
                   | |  MPA/TCP  | +-----------------+ |
                   | +-----------+        ||           |
                   |         \\     --- socket API --- |
                   |          \\          ||           |
                   |           \\      +-----+         |
                   |            \\     | TCP |         |
                   |             \\    +-----+         |
                   |              \\    //             |
                   |             +-------+             |
                   |             |  IP   |             |
                   |             +-------+             |
                   +-----------------------------------+

                 Figure 8 Optimized MPA/TCP implementation
   The optimized MPA/TCP implementations described below are only
   applicable to MPA, all other TCP applications continue to use the
   standard TCP stacks and interfaces.

5.1  MPA transmitters with a standard layered

5.1 MPA transmitters with a standard layered TCP

   MPA transmitters SHOULD calculate a MULPDU as described in section
   4.5  If the TCP implementation allows EMSS to be determined by MPA,
   that value should be used.  If the transmit side TCP implementation
   is not able to report the EMSS, MPA SHOULD use the current MTU value
   to establish a likely FPDU size, taking into account the various
   expected header sizes.

   MPA transmitters SHOULD also use whatever facilities the TCP stack
   presents to cause the TCP transmitter to start TCP segments at FPDU
   boundaries.  Multiple FPDUs MAY be packed into a single TCP segment
   as determined by the EMSS calculation as long as they are entirely
   contained in the TCP segment.

   For example, passing FPDU buffers sized to the current EMSS to the
   TCP socket and using the TCP_NODELAY socket option to disable the
   Nagle [RFC0896] algorithm will usually result in many of the segments
   starting with an FPDU.

   It is recognized that various effects can cause a FPDU alignment to
   be lost.  Following are a few of the effects:

   *   ULPDUs that are smaller than the MULPDU.  If these are sent in a
       continuous stream, FPDU alignment will be lost.  Note that
       careful use of a dynamic MULPDU can help in this case; the MULPDU
       for future FPDUs can be adjusted to re-establish alignment with
       the segments based on the current EMSS.

   *   Sending enough data that the TCP receive window limit is reached.
       TCP may send a smaller segment to exactly fill the receive
       window.

   *   Sending data when TCP is operating up against the congestion
       window.  If TCP is not tracking the congestion window in
       segments, it may transmit a smaller segment to exactly fill the
       receive window.

   *   Changes in EMSS due to varying TCP options, or changes in MTU.

   If FPDU alignment with TCP segments is lost for any reason, the
   alignment is regained after a break in transmission where the TCP
   send buffers are emptied.  Many usage models for DDP/MPA will include
   such breaks.

   MPA receivers are REQUIRED to be able to operate correctly even if
   alignment is lost (see section 6).

5.2 MPA receivers with a standard layered TCP

   MPA receivers will get TCP data in the usual ordered stream.  The
   receivers MUST identify FPDU boundaries by using the ULPDU_LENGTH
   field, as described in section 6.  Receivers MAY utilize markers to
   check for FPDU boundary consistency, but they are NOT required to
   examine the markers to determine the FPDU boundaries.

5.3  Optimized MPA/TCP transmitters

   The various TCP RFCs allow considerable choice in segmenting a TCP
   stream.  In order to optimize

6  MPA Receiver FPDU recovery at the Identification

   An MPA receiver, an
   optimized MPA/TCP implementation uses additional segmentation rules.

   To provide optimum performance, an optimized MPA/TCP transmit side
   implementation SHOULD be enabled to:

   *   With an EMSS large enough to contain receiver MUST first verify the FPDU(s), segment FPDU before passing the
       outgoing TCP stream such that ULPDU
   to DDP.  To do this, the first octet receiver MUST:

   *   locate the start of every TCP
       Segment begins with an FPDU.  Multiple FPDUs MAY be packed into a
       single TCP segment as long as they are entirely contained in the
       TCP segment. FPDU unambiguously,

   *   Report the current EMSS from   verify its CRC (if CRC checking is enabled).

   If the TCP to above conditions are true, the MPA transmit layer.

   There are exceptions to receiver passes the above rule.  Once an ULPDU is provided
   to
   MPA, DDP.

   To detect the MPA/TCP sender MUST transmit it or fail start of the connection; it
   cannot FPDU unambiguously one of the following
   MUST be repudiated.  As a result, during changes used:

   1:  In an ordered TCP stream, the ULPDU Length field in MTU and EMSS,
   or the current
       FPDU when TCP's Receive Window size (RWIN) becomes too small, it may FPDU has a valid CRC, can be
   necessary to send FPDUs that do not conform used to identify the segmentation rule
   above.

   A possible, but less desirable, alternative is to use IP
   fragmentation on accepted FPDUs to deal with MTU reductions or
   extremely small EMSS.

   The sender MUST still format
       beginning of the FPDU according to FPDU format as
   shown in Figure 2.

   On next FPDU.

   2:  For optimized MPA/TCP receivers that support out of order
       reception of FPDUs (see section 4.3 MPA Markers on page 15) a retransmission, TCP does not necessarily preserve original TCP
   segmentation boundaries.  This
       Marker can lead always be used to locate the loss beginning of an FPDU Alignment
   and containment within (in
       FPDUs with valid CRCs).  Since the location of the Marker is
       known in the octet stream (sequence number space), the Marker can
       always be found.

   3:  Having found an FPDU by means of a TCP segment during TCP retransmissions.  An Marker, an optimized MPA/TCP sender SHOULD try
       receiver can find following contiguous FPDUs by using the ULPDU
       Length fields (from FPDUs with valid CRCs) to preserve original TCP
   segmentation boundaries establish the next
       FPDU boundary.

   The ULPDU Length field (see section 4 on a retransmission.

5.3.1  Effects of Optimized MPA/TCP Segmentation

   Optimized MPA/TCP senders will fill TCP segments page 14) MUST be used to
   determine if the EMSS with a
   single entire FPDU when a DDP message is large enough.  Since present before forwarding the DDP
   message may not exactly fit into TCP segments, a "message tail" often
   occurs that results ULPDU
   to DDP.

   CRC calculation is discussed in an FPDU section 4.4 on page 18 above.

7  Connection Semantics

7.1 Connection setup

   MPA requires that is smaller than the Consumer MUST activate MPA, and any TCP
   enhancements for MPA, on a single TCP
   segment.  Additionally some DDP messages may be considerably shorter
   than half connection at the EMSS.  If a small FPDU is sent same location
   in a single TCP segment the
   result is a "short" TCP segment.

   Applications expected to see strong advantages from Direct Data
   Placement include transaction-based applications and throughput
   applications.  Request/response protocols typically send one FPDU per
   TCP segment and then wait for a response.  Under these conditions,
   these "short" TCP segments are an appropriate octet stream at both the sender and expected effect of the segmentation.

   Another possibility receiver.  This is that
   required in order for the application might be sending multiple
   messages (FPDUs) Marker scheme to correctly locate the same endpoint before waiting for a response.
   In this case, the segmentation policy would tend
   Markers (if enabled) and to reduce correctly locate the
   available connection bandwidth first FPDU.

   MPA, and any TCP enhancements for MPA are enabled by under-filling the ULP in both
   directions at once at an endpoint.

   This can be accomplished several ways, and is left up to DDP's ULP:

   *   DDP's ULP MAY require DDP on MPA startup immediately after TCP segments.

   Standard TCP implementations often utilize
       connection setup.  This has the Nagle [RFC0896]
   algorithm to ensure advantage that segments are filled to the EMSS whenever the
   round trip latency no streaming mode
       negotiation is large enough that the source stream can fully
   fill segments before Acks arrive.  The algorithm does this by
   delaying the transmission needed.  An example of TCP segments until such a ULP can fill protocol is shown in
       Figure 10: Example Immediate Startup negotiation on page 36.

       This may be accomplished by using a
   segment, well-known port, or until an ACK arrives from the far side.  The algorithm
   thus allows for smaller segments when latencies are shorter to keep
   the ULP's end to end latency a service
       locator protocol to reasonable levels.

   The Nagle algorithm locate an appropriate port on which DDP on
       MPA is not mandatory expected to use [RFC1122].

   When used with optimized MPA/TCP stacks, Nagle and similar algorithms
   can result in the "packing" of multiple FPDUs into TCP segments.

   If a "message tail", small DDP messages, or operate.

   *   DDP's ULP MAY negotiate the start of a larger DDP
   message are available, on MPA MAY pack multiple FPDUs into sometime after a
       normal TCP segments.
   When this is done, the startup, using TCP segments can be more fully utilized, but,
   due to the size constraints of FPDUs, segments may not be filled to streaming data exchanges on the EMSS.  A dynamic MULPDU
       same connection.  The exchange establishes that informs DDP of the size of on MPA (as
       well as other ULPs) will be used, and exactly locates the
   remaining TCP segment space makes filling point
       in the TCP segment more
   effective.

        Note that octet stream where MPA receivers must do more processing of a TCP segment
        that contains multiple FPDUs, this may affect the performance of
        some receiver implementations.

   It is up to the ULP to decide if Nagle is useful with DDP/MPA. begin operation.  Note that many of the applications expected to take advantage of MPA/DDP
   prefer to avoid the extra delays caused by Nagle.  In
       such scenarios
   it a negotiation protocol is anticipated there will be minimal opportunity for packing at outside the transmitter and receivers may choose to optimize their
   performance for scope of this anticipated behavior.

   Therefore, the application is expected to set TCP parameters
       specification.  A simplified example of such
   that it can trade off latency and wire efficiency.  This is
   accomplished by setting the TCP_NODELAY socket option (which disables
   Nagle).

   When latency a protocol is not critical, application shown
       in Figure 9: Example Delayed Startup negotiation on page 33.

   An MPA endpoint operates in two distinct phases.

   The Startup Phase is expected used to leave Nagle
   enabled.  In verify correct MPA setup, exchange CRC
   and Marker configuration, and optionally pass Private Data between
   endpoints prior to completing a DDP connection.  During this case the TCP implementation may pack any available
   FPDUs into TCP segments so that the segments phase,
   specifically formatted frames are filled exchanged as TCP byte streams
   without using CRCs or Markers.  During this phase a DDP endpoint need
   not be "bound" to the EMSS.
   If MPA connection.  In fact, the amount choice of data available is DDP
   endpoint and its operating parameters may not enough to fill be known until the TCP segment
   when it is prepared for transmission, TCP can send
   Consumer supplied Private Data (if any) has been examined by the segment partly
   filled, or use
   Consumer.

   The second distinct phase is Full Operation during which FPDUs are
   sent using all the Nagle algorithm rules that pertain (CRCs, Markers, MULPDU
   restrictions etc.).  A DDP endpoint MUST be "bound" to wait for the ULP MPA
   connection at entry to post more
   data.

5.4  Optimized MPA/TCP receivers this phase.

   When an MPA receive implementation and Private Data is passed between ULPs in the MPA-aware receive side TCP
   implementation support handling out of order ULPDUs, Startup Phase, the TCP receive
   implementation SHOULD be enabled
   ULP is responsible for interpreting that data, and then placing MPA
   into Full Operation.

   Note: The following text differentiates the two endpoints by calling
       them Initiator and Responder.  This is quite arbitrary and is NOT
       related to perform the following functions:

   1)  The implementation SHOULD pass incoming TCP segments to MPA as
       soon as they have been received and validated, even if not
       received startup (SYN, SYN/ACK sequence).  The
       Initiator is the side that sends first in order. the MPA startup
       sequence (the MPA Request Frame).

   Note: The TCP layer MUST have committed to keeping
       each segment before it can possibility that both endpoints would be passed allowed to make a
       connection at the MPA.  This means same time, sometimes called an active/active
       connection, was considered by the work group and rejected.  There
       were several motivations for this decision.  One was that
       applications needing this facility were few (none other than
       theoretical at the segment must have passed time of this draft).  Another was that the TCP, IP,
       facility created some implementation difficulties, particularly
       with the "dual stack" designs described later on.  A last issue
       was that dealing with rejected connections at startup would have
       required at least an additional frame type, and lower layer data
       integrity validation (i.e., checksum), must be in more recovery
       actions, complicating the receive
       window, must be part protocol.  While none of these issues
       was overwhelming, the same epoch (if timestamps are used to
       verify this) group and any other checks required by TCP RFCs.

       This is implementers were not motivated
       to imply that do the data must be completely ordered
       before use.  An implementation MAY accept out of order segments,
       SACK them [RFC2018], and pass them work to MPA immediately, before the
       reception resolve these issues.  The protocol includes a
       method of the segments needed to fill in the gaps arrive.
       MPA expects to utilize detecting these segments when active/active startup attempts so that
       they are complete
       FPDUs or can be combined into complete FPDUs to allow the passing
       of ULPDUs to DDP when they arrive, independent of ordering.  DDP
       uses the passed ULPDU to "place" the DDP segments (see [DDP] for
       more details).

       Since MPA performs a CRC calculation and other checks on received
       FPDUs, the MPA/TCP implementation MUST ensure that any TCP
       segments that duplicate data already received and processed (as
       can happen during TCP retries) do not overwrite already received rejected and processed FPDUs.  This avoids the possibility that duplicate
       data may corrupt already validated FPDUs.

   2) an error reported.

   The implementation MUST provide a mechanism to indicate the
       ordering of TCP segments as the sender transmitted them.  One
       possible mechanism ULP is responsible for determining which side is Initiator or
   Responder.  For client/server type ULPs this is easy.  For peer-peer
   ULPs (which might be attaching the TCP sequence number to
       each segment.

   3)  The implementation MUST provide a mechanism to indicate when utilize a
       given TCP segment (and the prior TCP stream) is complete.  One
       possible style active/active startup), some
   mechanism might (not defined by this specification) must be established, or
   some streaming mode data exchanged prior to utilize the leading (left) edge of
       the TCP Receive Window. MPA uses startup to determine
   the ordering side which starts in Initiator and completion indications to inform DDP
       when a ULPDU is complete; which starts in Responder MPA Delivers the FPDU to DDP.  DDP uses
       the indications to "deliver" its messages to the DDP consumer
       (see [DDP] for more details).

       DDP on
   mode.

7.1.1  MPA MUST utilize these two mechanisms to establish Request and Reply Frame Format

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   0  |                                                               |
      +         Key (16 bytes containing "MPA ID Req Frame")          +
   4  |      (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65)        |
      +         Or  (16 bytes containing "MPA ID Rep Frame")          +
   8  |      (4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65)        |
      +                                                               +
   12 |                                                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   16 |M|C|R| Res     |     Rev       |          PD_Length            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      ~                                                               ~
      ~                   Private Data                                ~
      |                                                               |
      |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                     Figure 8 MPA Request/Reply Frame

   Key: This field contains the
       Delivery semantics that DDP's consumers agree to.  These
       semantics are described fully in [DDP].  These include
       requirements on DDP's consumer to respect ownership of buffers
       prior "key" used to the time validate that DDP delivers them to the Consumer.

6  MPA Receiver FPDU Identification

   An sender
       is an MPA receiver sender.  Initiator mode senders MUST first verify the FPDU before passing the ULPDU set this field to DDP.  To do this, the receiver MUST:

   *   locate the start of
       the FPDU unambiguously,

   *   verify its CRC (if CRC checking is enabled).

   If the above conditions are true, fixed value "MPA ID Req frame" or (in byte order) 4D 50 41 20
       49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal).  Responder
       mode receivers MUST check this field for the MPA receiver passes same value, and
       close the ULPDU connection and report an error locally if any other
       value is detected.  Responder mode senders MUST set this field to DDP.

   To detect the start of the FPDU unambiguously one of
       the following fixed value "MPA ID Rep frame" or (in byte order) 4D 50 41 20
       49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal).  Initiator
       mode receivers MUST be used:

   1:  In an ordered TCP stream, the ULPDU Length check this field in for the current
       FPDU same value, and
       close the connection and report an error locally if any other
       value is detected.

   M: This bit, when FPDU has sent in an MPA Request Frame or an MPA Reply Frame,
       declares a valid CRC, can receiver's requirement for Markers.  When in a
       received MPA Request Frame or MPA Reply Frame and the value is
       '0', Markers MUST NOT be used added to identify the
       beginning of data stream by the next FPDU.

   2:  For optimized MPA/TCP receivers that support out of order
       reception of FPDUs (see sender.
       When '1' Markers MUST be added as described in section 4.3 MPA
       Markers on page 14) a
       Marker can always be used to locate the beginning of 15.

   C: This bit declares an FPDU (in
       FPDUs with valid CRCs).  Since the location of the Marker endpoint's preferred CRC usage.  When this
       field is
       known '0' in the octet stream (sequence number space), MPA Request Frame and the Marker can
       always MPA Reply Frame,
       CRCs MUST not be found.

   3:  Having found an FPDU checked and need not be generated by means of a Marker, an optimized MPA/TCP
       receiver can find following contiguous FPDUs either
       endpoint.  When this bit is '1' in either the MPA Request Frame
       or MPA Reply Frame, CRCs MUST be generated and checked by using both
       endpoints.  Note that even when not in use, the ULPDU
       Length fields (from FPDUs with valid CRCs) to establish CRC field remains
       present in the next
       FPDU boundary.

   The ULPDU Length FPDU.  When CRCs are not in use, the CRC field (see section 4)
       MUST be used to determine if
   the entire considered valid for FPDU checking regardless of its
       contents.

   R: This bit is present before forwarding the ULPDU set to DDP.

   CRC calculation is discussed in section 4.4 on page 17 above.

6.1  Re-segmenting Middle boxes zero, and non optimized MPA/TCP senders

   Since MPA senders often start FPDUs not checked on TCP segment boundaries, a
   receiving optimized MPA/TCP implementation may be able to optimize
   the reception of data in various ways.

   However, MPA receivers MUST NOT depend on FPDU Alignment on TCP
   segment boundaries.

   Some MPA senders may be unable to conform to the sender requirements
   because their implementation of TCP is not designed with MPA in mind.
   Even for optimized MPA/TCP senders,
       Request Frame.  In the network may contain "middle
   boxes" which modify MPA Reply Frame, this bit is the TCP stream Rejected
       Connection bit, set by changing the segmentation. Responders ULP to indicate acceptance
       '0', or rejection '1', of the connection parameters provided in
       the Private Data.

   Res: This field is generally interoperable with TCP and its users reserved for future use.  It MUST be set to zero
       when sending, and not checked on reception.

   Rev: This field contains the Revision of MPA.  For this version of
       the specification senders MUST set this field to one.  MPA must
   be no exception.

   The presence
       receivers compliant with this version of Markers in the specification MUST
       check this field.  If the MPA (when enabled) allows an optimized
   MPA/TCP receiver to recover cannot interoperate with
       the FPDUs despite these obstacles,
   although received version, then it may be necessary to utilize additional buffering at MUST close the
   receiver to do so.

   Some of connection and
       report an error locally.  Otherwise, the cases that a MPA receiver may have to contend with are listed
   below as a reminder should
       report the received version to the implementer:

   *   A single Aligned and complete FPDU, either in order, or out of
       order: ULP.

   PD_Length: This can be passed to DDP as soon as validated, and
       Delivered when ordering is established.

   *   Multiple FPDUs in a TCP segment, aligned and fully contained,
       either field MUST contain the length in order, or out Octets of order:  These can be passed to DDP as
       soon as validated, and Delivered when ordering is established.

   *   Incomplete FPDU: The receiver should buffer until the remainder
       Private Data field.  A value of the FPDU arrives. zero indicates that there is no
       Private Data field present at all.  If the remainder of receiver detects that
       the FPDU is already
       available, this can be passed to DDP as soon as validated, and
       Delivered when ordering is established.

   *   Unaligned FPDU start: The partial FPDU must be combined with its
       preceding portion(s).  If PD_Length field does not match the preceding parts are already
       available, and length of the whole FPDU is present, this can be passed to
       DDP as soon as validated, and Delivered when ordering is
       established.  If Private Data
       field, or if the whole FPDU is not available, length of the Private Data field exceeds 512
       octets, the receiver
       should buffer until MUST close the remainder of connection and report an
       error locally.  Otherwise, the FPDU arrives.

   *   Combinations of Unaligned or incomplete FPDUs (and potentially
       other complete FPDUs) in MPA receiver should pass the same TCP segment:  If
       PD_Length value and Private Data to the ULP.

   Private Data: This field may contain any FPDU is
       present in its entirety, value defined by ULPs or can be completed with portions
       already available, it can may
       not be passed present.  The Private Data field MUST between 0 and 512
       octets in length.  ULPs define how to DDP as soon as validated, size, set, and Delivered when ordering is established.

7  Connection Semantics

7.1 validate
       this field within these limits.

7.1.2  Connection setup Startup Rules

   The following rules apply to MPA requires that the Consumer MUST activate MPA, and any TCP
   enhancements for MPA, on a TCP half connection at the same location
   in the octet stream at both the sender and the receiver.  This Startup Phase:

   1.  When MPA is
   required started in order for the Marker scheme to correctly locate Initiator mode, the
   Markers (if enabled) and to correctly locate the first FPDU.

   MPA, and any TCP enhancements for MPA are enabled by the implementation
       MUST send a valid MPA Request Frame.  The MPA Request Frame MAY
       include ULP supplied Private Data.

   2.  When MPA is started in both
   directions at once at an endpoint.

   This can be accomplished several ways, and the Responder mode, the MPA implementation
       MUST wait until a MPA Request Frame is left up to DDP's ULP:

   *   DDP's ULP MAY require DDP on received and validated
       before entering full MPA/DDP operation.

       If the MPA startup immediately after Request Frame is improperly formatted, the
       implementation MUST close the TCP connection setup.  This has and exit MPA.

       If the advantage that no streaming mode
       negotiation MPA Request Frame is needed.  An example of such a protocol properly formatted but the Private
       Data is shown in
       Figure 11: Example Immediate Startup negotiation on page 40.

       This may be accomplished by using a well-known port, or a service
       locator protocol to locate not acceptable, the implementation SHOULD return an appropriate port on which DDP on MPA is expected
       Reply Frame with the Rejected Connection bit set to operate.

   *   DDP's ULP MAY negotiate '1'; the start of DDP on MPA sometime after a
       normal
       Reply Frame MAY include ULP supplied Private Data; the
       implementation MUST exit MPA, leaving the TCP startup, using connection open.
       The ULP may close TCP streaming data exchanges on or use the
       same connection.  The exchange establishes that DDP on MPA (as
       well as connection for other ULPs) will be used, and exactly locates the point
       in purposes.

       If the octet stream where MPA Request Frame is to begin operation.  Note that
       such a negotiation protocol is outside properly formatted and the scope of this
       specification.  A simplified example of such a protocol Private
       Data is shown
       in Figure 10: Example Delayed Startup negotiation on page 37.

   An acceptable, the implementation SHOULD return an MPA endpoint operates in two distinct phases.

   The Startup Phase is used Reply
       Frame with the Rejected Connection bit set to verify correct '0'; the MPA setup, exchange CRC Reply
       Frame MAY include ULP supplied Private Data; and Marker configuration, the Responder
       SHOULD prepare to interpret any data received as FPDUs and optionally pass Private Data between
   endpoints prior
       any received ULPDUs to completing a DDP connection.  During this phase,
   specifically formatted DDP.

       Note: Since the receiver's ability to deal with Markers is
           unknown until the Request and Reply frames are exchanged as TCP byte streams
   without using CRCs or Markers.  During have been
           received, sending FPDUs before this phase a DDP endpoint need occurs is not be "bound" possible.

       Note: The requirement to wait on a Request Frame before sending a
           Reply frame is a design choice, it makes for well ordered
           sequence of events at each end, and avoids having to specify
           how to deal with situations where both ends start at the same
           time.

   3.  MPA connection.  In fact, Initiator mode implementations MUST receive and validate a
       MPA Reply Frame.

       If the choice of DDP
   endpoint MPA Reply Frame is improperly formatted, the
       implementation MUST close the TCP connection and its operating parameters may not be known until exit MPA.

       If the MPA Reply Frame is properly formatted but is the
   Consumer supplied Private
       Data (if any) has been examined by the
   Consumer.

   The second distinct phase is Full Operation during which FPDUs are
   sent using all not acceptable, or if the rules that pertain (CRCs, Markers, MULPDU
   restrictions etc.).  A DDP endpoint MUST be "bound" Rejected Connection bit set to
       '1', the MPA implementation MUST exit MPA, leaving the TCP connection at entry to this phase.

   When
       open.  The ULP may close TCP or use the connection for other
       purposes.

       If the MPA Reply Frame is properly formatted and the Private Data
       is passed between ULPs in the Startup Phase, acceptable, and the
   ULP Reject Connection bit is responsible for set to '0', the
       implementation SHOULD enter full MPA/DDP operation mode;
       interpreting that data, any received data as FPDUs and then placing sending DDP ULPDUs as
       FPDUs.

   4.  MPA
   into Full Operation.

   Note: The following text differentiates the two endpoints by calling
       them Initiator and Responder.  This is quite arbitrary Responder mode implementations MUST receive and validate at
       least one FPDU before sending any FPDUs or Markers.

       Note: this requirement is NOT
       related present to allow the TCP startup (SYN, SYN/ACK sequence).  The Initiator is the side that sends first in the MPA startup
       sequence (the MPA Request Frame).

   Note: The possibility that both endpoints would be allowed time to make a
       connection
           get its receiver into Full Operation before an FPDU arrives,
           avoiding potential race conditions at the same time, sometimes called an active/active
       connection, Initiator.  This
           was considered by also subject to some debate in the work group and rejected.  There
       were several motivations for this decision.  One before
           rough consensus was that
       applications needing this facility were few (none other than
       theoretical at the time of reached.  Eliminating this draft).  Another was that the
       facility created requirement
           would allow faster startup in some implementation difficulties, particularly
       with the "dual stack" designs described later on.  A last issue
       was types of applications.
           However, that dealing with rejected connections at startup would have
       required at least an additional frame type, and more recovery
       actions, complicating the protocol.  While none of these issues
       was overwhelming, also make certain implementations
           (particularly "dual stack") much harder.

   5.  If a received "Key" does not match the group expected value, (See 7.1.1
       MPA Request and implementers were not motivated
       to do Reply Frame Format above) the work to resolve these issues.  The protocol includes a
       method of detecting these active/active startup attempts so that
       they can TCP/DDP connection
       MUST be rejected closed, and an error reported. returned to the ULP.

   6.  The ULP is responsible for determining which side is Initiator or
   Responder.  For client/server type ULPs this is easy.  For peer-peer
   ULPs (which might utilize a TCP style active/active startup), some
   mechanism (not defined by this specification) must received Private Data fields may be established, or
   some streaming mode data exchanged prior to MPA startup used by Consumers at
       either end to determine further validate the side which starts in Initiator connection, and which starts in set up DDP or
       other ULP parameters.  The Initiator ULP MAY close the
       TCP/MPA/DDP connection as a result of validating the Private Data
       fields.  The Responder SHOULD return a MPA
   mode.

7.1.1  MPA Request and Reply Frame Format

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   0  |                                                               |
      +         Key (16 bytes containing "MPA ID Req Frame")          +
   4  |      (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65)        |
      +         Or  (16 bytes containing "MPA ID Rep Frame")          +
   8  |      (4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65)        |
      +                                                               +
   12 |                                                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   16 |M|C|R| Res     |     Rev       |          PD_Length            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      ~                                                               ~
      ~ with the
       "Reject Connection" Bit set to '1' if the validation of the
       Private Data                                ~
      |                                                               |
      |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                     Figure 9 is not acceptable to the ULP.

   7.  When the first FPDU is to be sent, then if Markers are enabled,
       the first octets sent are the special Marker 0x00000000, followed
       by the start of the FPDU (the FPDU's ULPDU Length field).  If
       Markers are not enabled, the first octets sent are the start of
       the FPDU (the FPDU's ULPDU Length field).

   8.  MPA Request/Reply implementations MUST use the difference between the MPA
       Request Frame

   Key: This field contains and the "key" used MPA Reply Frame to validate that check for incorrect
       "Initiator/Initiator" startups.  Implementations SHOULD put a
       timeout on waiting for the sender
       is an MPA sender.  Initiator mode senders Request Frame when started in
       Responder mode, to detect incorrect "Responder/Responder"
       startups.

   9.  MPA implementations MUST set this validate the PD_Length field.  The
       buffer that receives the Private Data field MUST be large enough
       to receive that data; the fixed value "MPA ID Req frame" or (in byte order) 4D 50 41 20
       49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal).  Responder
       mode receivers amount of Private Data MUST check this field for not exceed
       the same value, and
       close PD_Length, or the connection and report an error locally if application buffer.  If any other
       value is detected.  Responder mode senders MUST set this field to of the fixed value "MPA ID Rep frame" or (in byte order) 4D 50 41 20
       49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal).  Initiator
       mode receivers above
       fails, the startup frame MUST check this field be considered improperly formatted.

   10. MPA implementations SHOULD implement a reasonable timeout while
       waiting for the same value, entire startup frames; this prevents certain
       denial of service attacks.  ULPs SHOULD implement a reasonable
       timeout while waiting for FPDUs, ULPDUs and
       close the connection application level
       messages to guard against application failures and report an error locally if any other
       value is detected.

   M: This bit, certain denial
       of service attacks.

7.1.3  Example Delayed Startup sequence

   A variety of startup sequences are possible when sent in an using MPA Request Frame or on TCP.
   Following is an MPA Reply Frame,
       declares a receiver's requirement example of an MPA/DDP startup that occurs after TCP
   has been running for Markers.  When in a
       received MPA Request Frame or MPA Reply Frame while and the value has exchanged some amount of
   streaming data.  This example does not use any Private Data (an
   example that does is
       '0', Markers MUST NOT be added to the data stream by the sender.
       When '1' Markers MUST be added as described shown later in section 4.3 MPA
       Markers 7.1.4.2 Example Immediate Startup
   using Private Data on page 14.

   C: This bit declares an endpoint's preferred CRC usage.  When this
       field 36), although it is '0' in perfectly legal to
   include the MPA Request Frame and Private Data.  Note that since the MPA Reply Frame,
       CRCs MUST example does not be checked use
   any Private Data, there are no ULP interactions shown between
   receiving "Startup frames" and need not be generated by either
       endpoint.  When this bit is '1' in either the putting MPA into Full Operation.

          Initiator                                 Responder

   +---------------------------+
   |ULP streaming mode         |
   | <Hello> request to        |
   | transition to DDP/MPA     |           +--------------------------+
   | mode (optional)           | --------> |ULP gets request;         |
   +---------------------------+           |enables MPA Responder mode|
                                           |with last (optional)      |
                                           |streaming mode <Hello Ack>|
                                           |for MPA to send.          |
   +---------------------------+           |MPA waits for incoming    |
   |ULP receives streaming     | <-------- |  <MPA Request Frame
       or frame>     |
   | <Hello Ack>;              |           +--------------------------+
   |Enters MPA Initiator mode; |
   |MPA sends                  |
   |  <MPA Request Frame>;     |
   |MPA waits for incoming     |           +--------------------------+
   |  <MPA Reply Frame, CRCs MUST be generated and checked by both
       endpoints.  Note that even when not in use, the CRC field remains
       present in Frame         | - - - - > |MPA receives              |
   +---------------------------+           |  <MPA Request Frame>     |
                                           |Consumer binds DDP to MPA,|
                                           |MPA sends the FPDU.  When CRCs are             |
                                           |  <MPA Reply Frame>.      |
                                           |DDP/MPA enables FPDU      |
   +---------------------------+           |decoding, but does not in use,    |
   |MPA receives the CRC field
       MUST be considered valid for           | < - - - - |send any FPDUs.           |
   |  <MPA Reply Frame>        |           +--------------------------+
   |Consumer binds DDP to MPA, |
   |DDP/MPA begins full        |
   |operation.                 |
   |MPA sends first FPDU checking regardless of its
       contents.

   R: This bit (as   |           +--------------------------+
   |DDP ULPDUs become          | ========> |MPA Receives first FPDU.  |
   |available).                |           |MPA sends first FPDU (as  |
   +---------------------------+           |DDP ULPDUs become         |
                                   <====== |available.                |
                                           +--------------------------+
               Figure 9: Example Delayed Startup negotiation
   An example Delayed Startup sequence is set to zero, described below:

       *   Active and not checked on reception passive sides start up a TCP connection in the
           usual fashion, probably using sockets APIs.  They exchange
           some amount of streaming mode data.  At some point one side
           (the MPA
       Request Frame.  In Initiator) sends streaming mode data that
           effectively says "Hello, Lets go into MPA/DDP mode."

   *   When the remote side (the MPA Reply Frame, Responder) gets this bit is the Rejected
       Connection bit, set by streaming mode
       message, the Responders ULP to indicate acceptance
       '0', or rejection '1', Consumer would send a last streaming mode message
       that effectively says "I Acknowledge your Hello, and am now in
       MPA Responder Mode".  The exchange of these messages establishes
       the connection parameters provided exact point in the Private Data.

   Res: This field TCP stream where MPA is reserved for future use.  It MUST be set to zero
       when sending, and not checked on reception.

   Rev: This field contains enabled.  The
       Responding Consumer enables MPA in the Revision of MPA.  For this version of Responder mode and waits
       for the specification senders MUST set this field to one. initial MPA
       receivers compliant with this version of the specification MUST
       check this field.  If the startup message.

       *   The Initiating Consumer would enable MPA receiver cannot interoperate with startup in the received version,
           Initiator mode which then sends the MPA Request Frame.  It is
           assumed that no Private Data messages are needed for this
           example, although it MUST close is possible to do so.  The Initiating
           MPA (and Consumer) would also wait for the MPA connection and
       report an error locally.  Otherwise, to
           be accepted.

   *   The Responding MPA would receive the initial MPA receiver should
       report Request Frame
       and would inform the received version to Consumer that this message arrived.  The
       Consumer can then accept the ULP.

   PD_Length: This field MUST contain MPA/DDP connection or close the length in Octets of TCP
       connection.

   *   To accept the
       Private Data field.  A value of zero indicates that there is no
       Private Data field present at all.  If connection request, the receiver detects that Responding Consumer would
       use an appropriate API to bind the PD_Length field does not match TCP/MPA connections to a DDP
       endpoint, thus enabling MPA/DDP into Full Operation.  In the length
       process of going to Full Operation, MPA sends the Private Data
       field, or if the length of MPA Reply
       Frame.  MPA/DDP waits for the Private Data field exceeds 512
       octets, first incoming FPDU before sending
       any FPDUs.

   *   If the receiver MUST initial TCP data was not a properly formatted MPA Request
       Frame MPA will close or reset the TCP connection and report an
       error locally.  Otherwise, the immediately.

       *   The Initiating MPA receiver should pass would receive the
       PD_Length value MPA Reply Frame and Private Data to the ULP.

   Private Data: This field may contain any value defined by ULPs or may
       not be present.  The Private Data field MUST between 0 and 512
       octets in length.  ULPs define how to size, set, and validate
           would report this field within these limits.

7.1.2  Connection Startup Rules

   The following rules apply message to MPA connection Startup Phase:

   1.  When MPA is started in the Initiator mode, the MPA implementation
       MUST send a valid MPA Request Frame. Consumer.  The MPA Request Frame MAY
       include ULP supplied Private Data.

   2.  When MPA is started in the Responder mode, Consumer can
           then accept the MPA implementation
       MUST wait until a MPA Request Frame is received and validated
       before entering full MPA/DDP operation.

       If the MPA Request Frame is improperly formatted, the
       implementation MUST connection, or close or reset the TCP
           connection and exit MPA.

       If to abort the MPA Request Frame is properly formatted but process.

       *   On determining that the Private
       Data Connection is not acceptable, the implementation SHOULD return
           Initiating Consumer would use an MPA
       Reply Frame with the Rejected Connection bit set appropriate API to '1'; the MPA
       Reply Frame MAY include ULP supplied Private Data; the
       implementation MUST exit MPA, leaving the TCP connection open.
       The ULP may close TCP or use the connection for other purposes.

       If bind the
           TCP/MPA connections to a DDP endpoint thus enabling MPA/DDP
           into Full Operation.  MPA/DDP would begin sending DDP
           messages as MPA Request Frame is properly formatted and the FPDUs.

7.1.4  Use of Private Data

   This section is acceptable, the implementation SHOULD return an MPA Reply
       Frame with the Rejected Connection bit set to '0'; the MPA Reply
       Frame MAY include advisory in nature, in that it suggests a method that
   a ULP supplied Private Data; and the Responder
       SHOULD prepare to interpret any data received as FPDUs and pass
       any received ULPDUs to DDP.

       Note: Since the receiver's ability to can deal with Markers is
           unknown until the Request and Reply frames pre-DDP connection information exchange.

7.1.4.1  Motivation

   Prior RDMA protocols have been
           received, sending FPDUs before this occurs is not possible.

       Note: The requirement to wait on a Request Frame before sending a
           Reply frame is developed that provide Private Data
   via out of band mechanisms.  As a design choice, it makes for well ordered
           sequence result, many applications now
   expect some form of events at each end, and avoids having Private Data to specify
           how be available for application use
   prior to deal with situations where both ends start at setting up the same
           time.

   3.  MPA Initiator mode implementations MUST receive DDP/RDMA connection.  Following are some
   examples of the use of Private Data.

   An RDMA Endpoint (referred to as a Queue Pair, or QP, in InfiniBand
   and validate the [VERBS]) must be associated with a
       MPA Reply Frame.

       If Protection Domain.  No
   receive operations may be posted to the MPA Reply Frame endpoint before it is improperly formatted, the
       implementation MUST close
   associated with a Protection Domain.  Indeed under both the TCP connection
   InfiniBand and exit MPA.

       If the MPA Reply Frame is properly formatted but proposed RDMA/DDP verbs [VERBS] an endpoint/QP is
   created within a Protection Domain.

   There are some applications where the Private
       Data choice of Protection Domain is not acceptable, or if
   dependent upon the Rejected Connection bit set to
       '1', the implementation MUST exit MPA, leaving identity of the TCP connection
       open.  The remote ULP may close TCP or use the connection client.  For example,
   if a user session requires multiple connections, it is highly
   desirable for other
       purposes.

       If the MPA Reply Frame all of those connections to use a single Protection
   Domain.  Note: use of Protection Domains is properly formatted further discussed in
   [RDMASEC].

   InfiniBand, the DAT APIs [DAT-API] and the [IT-API] all provide for
   the active side ULP to provide Private Data when requesting a
   connection.  This data is acceptable, and the Reject Connection bit is set passed to '0', the
       implementation SHOULD enter full MPA/DDP operation mode;
       interpreting any received data as FPDUs and sending DDP ULPDUs as
       FPDUs.

   4.  MPA Responder mode implementations MUST receive and validate at
       least one FPDU before sending any FPDUs or Markers.

       Note: this requirement is present ULP to allow the Initiator time it to
           get its receiver into Full Operation before an FPDU arrives,
           avoiding potential race conditions at the Initiator.  This
           was also subject determine
   whether to some debate in the work group before
           rough consensus was reached.  Eliminating this requirement
           would allow faster startup in some types of applications.
           However, that would also make certain implementations
           (particularly "dual stack") much harder.

   5.  If a received "Key" does not match the expected value, (See 7.1.1
       MPA Request and Reply Frame Format above) accept the TCP/DDP connection
       MUST be closed, connection, and an error returned to the ULP.

   6. if so with which endpoint (and
   implicitly which Protection Domain).

   The received Private Data fields may can also be used by Consumers at
       either end to further validate ensure that both ends of the
   connection have configured their RDMA endpoints compatibly on such
   matters as the RDMA Read capacity (see [RDMAP]).  Further ULP-
   specific uses are also presumed, such as establishing the identity of
   the client.

   Private Data is also allowed for when accepting the connection, to
   allow completion of any negotiation on RDMA resources and set up DDP or for other
   ULP parameters.  The Initiator ULP MAY close reasons.

   There are several potential ways to exchange this Private Data.  For
   example, the
       TCP/MPA/DDP InfiniBand specification includes a connection as
   management protocol that allows a result small amount of validating the Private Data
       fields.  The Responder SHOULD return a MPA Reply Frame with the
       "Reject Connection" Bit set to '1' if be
   exchanged using datagrams before actually starting the validation RDMA
   connection.

   This draft allows for small amounts of the Private Data is not acceptable to the ULP.

   7.  When the first FPDU is to be sent, then if Markers are enabled,
       the first octets sent are the special Marker 0x00000000, followed
       by the start of the FPDU (the FPDU's ULPDU Length field).  If
       Markers are not enabled, the first octets sent are the start exchanged
   as part of the FPDU (the FPDU's ULPDU Length field).

   8. MPA implementations MUST use the difference between startup sequence.  The actual Private Data fields
   are carried in the MPA Request Frame Frame, and the MPA Reply Frame to check for incorrect
       "Initiator/Initiator" startups.  Implementations SHOULD put a
       timeout on waiting for the MPA Request Frame when started in
       Responder mode, Frame.

   If larger amounts of Private Data or more negotiation is necessary,
   TCP streaming mode messages may be exchanged prior to detect incorrect "Responder/Responder"
       startups.

   9.  MPA implementations MUST validate the PD_Length field.  The
       buffer that receives the Private Data field MUST be large enough
       to receive that data; the amount of Private Data MUST not exceed
       the PD_Length, or the application buffer.  If any of the above
       fails, the startup frame MUST be considered improperly formatted.

   10. MPA implementations SHOULD implement a reasonable timeout while
       waiting for the entire startup frames; this prevents certain
       denial of service attacks.  ULPs SHOULD implement a reasonable
       timeout while waiting for FPDUs, ULPDUs and application level
       messages to guard against application failures and certain denial
       of service attacks.

7.1.3  Example Delayed Startup sequence

   A variety of startup sequences are possible when using MPA on TCP.
   Following is an example of an MPA/DDP startup that occurs after TCP
   has been running for a while and has exchanged some amount of
   streaming data.  This example does not use any Private Data (an
   example that does is shown later in enabling MPA.

7.1.4.2  Example Immediate Startup using Private Data on page 40), although it is perfectly legal to
   include the Private Data.  Note that since the example does not use
   any Private Data, there are no ULP interactions shown between
   receiving "Startup frames" and putting MPA into Full Operation.

          Initiator                                 Responder

   +---------------------------+
   |ULP streaming mode
   |TCP SYN sent               |           +--------------------------+
   +---------------------------+ --------> |TCP gets SYN packet;      | <Hello> request to
   +---------------------------+           |  Sends SYN-Ack           | transition to DDP/MPA
   |TCP gets SYN-Ack           | <-------- +--------------------------+
   | mode (optional)  Sends Ack                |
   +---------------------------+ --------> |ULP gets request;         | +--------------------------+
   +---------------------------+           |enables           |Consumer enables MPA Responder mode|
                                           |with last (optional)      |
                                           |streaming mode <Hello Ack>|
                                           |for
   |Consumer enables MPA to send.       |
   +---------------------------+           |MPA           |Responder Mode, waits for incoming |
   |ULP receives streaming
   |Initiator mode with        | <--------           |  <MPA Request frame>     |
   | <Hello Ack>;              |           +--------------------------+
   |Enters
   |Private Data; MPA Initiator mode; |
   |MPA sends    |           +--------------------------+
   |  <MPA Request Frame>;     |
   |MPA waits for incoming     |           +--------------------------+
   |  <MPA Reply Frame         | - - - - > |MPA receives              |
   +---------------------------+           |  <MPA Request Frame>     |
                                           |Consumer binds examines Private |
                                           |Data, provides MPA with   |
                                           |return Private Data,      |
                                           |binds DDP to MPA,|
                                           |MPA sends the MPA, and     |
                                           |enables MPA to send an    |
                                           |  <MPA Reply Frame>.      |
                                           |DDP/MPA enables FPDU      |
   +---------------------------+           |decoding, but does not    |
   |MPA receives the           | < - - - - |send any FPDUs.           |
   |  <MPA Reply Frame>        |           +--------------------------+
   |Consumer examines Private  |
   |Data, binds DDP to MPA,    |
   |DDP/MPA begins full
   |and enables DDP/MPA to     |
   |operation.
   |begin Full Operation.      |
   |MPA sends first FPDU (as   |           +--------------------------+
   |DDP ULPDUs become          | ========> |MPA Receives first FPDU.  |
   |available).                |           |MPA sends first FPDU (as  |
   +---------------------------+           |DDP ULPDUs become         |
                                   <====== |available.                |
                                           +--------------------------+
             Figure 10: Example Delayed Immediate Startup negotiation
   An example Delayed Startup sequence

   Note: the exact order of when MPA is described below:

       *   Active and passive sides start up a started in the TCP connection in
       sequence is implementation dependent; the
           usual fashion, probably using sockets APIs.  They exchange
           some amount of streaming mode data.  At some point above diagram shows one side
           (the MPA Initiator) sends streaming mode data that
           effectively says "Hello, Lets go into MPA/DDP mode."

   *   When the remote side (the MPA Responder) gets this streaming mode
       message,
       possible sequence.  Also, the Consumer would send a last streaming mode message
       that effectively says "I Acknowledge your Hello, and am now in
       MPA Responder Mode".  The exchange of these messages establishes Initiator "Ack" to the exact point in Responder's
       "SYN-Ack" may be combined into the same TCP stream where segment containing
       the MPA Request Frame (as is enabled. allowed by TCP RFCs).

   The
       Responding Consumer enables MPA in the Responder mode and waits
       for the initial MPA example immediate startup message. sequence is described below:

   *   The Initiating Consumer passive side (Responding Consumer) would enable MPA startup in the
           Initiator mode which then sends listen on the MPA Request Frame.  It is
           assumed that no Private Data messages are needed TCP
       destination port, to indicate its readiness to accept a
       connection.

       *   The active side (Initiating Consumer) would request a
           connection from a TCP endpoint (that expected to upgrade to
           MPA/DDP/RDMA and expected the Private Data) to a destination
           address and port.

       *   The Initiating Consumer would initiate a TCP connection to
           the destination port.  Acceptance/rejection of the connection
           would proceed as per normal TCP connection establishment.

   *   The passive side (Responding Consumer) would receive the TCP
       connection request as usual allowing normal TCP gatekeepers, such
       as INETD and TCPserver, to exercise their normal
       safeguard/logging functions.  On acceptance of the TCP
       connection, the Responding Consumer would enable MPA in the
       Responder mode and wait for this
           example, although it is possible the initial MPA startup message.

       *   The Initiating Consumer would enable MPA startup in the
           Initiator mode to do so. send an initial MPA Request Frame with its
           included Private Data message to send.  The Initiating MPA
           (and Consumer) would also wait for the MPA connection to be accepted.
           accepted, and any returned Private Data.

   *   The Responding MPA would receive the initial MPA Request Frame
       with the Private Data message and would inform pass the Consumer that this message arrived. Private Data
       through to the Consumer.  The Consumer can then accept the
       MPA/DDP connection or connection, close the TCP
       connection. connection, or reject the MPA
       connection with a return message.

   *   To accept the connection request, the Responding Consumer would
       use an appropriate API to bind the TCP/MPA connections to a DDP
       endpoint, thus enabling MPA/DDP into Full Operation.  In the
       process of going to Full Operation, MPA sends the MPA Reply
       Frame. Frame
       which includes the Consumer supplied Private Data containing any
       appropriate Consumer response.  MPA/DDP waits for the first
       incoming FPDU before sending any FPDUs.

   *   If the initial TCP data was not a properly formatted MPA Request
       Frame
       Frame, MPA will close or reset the TCP connection immediately.

   *   To reject the MPA connection request, the Responding Consumer
       would send an MPA Reply Frame with any ULP supplied Private Data
       (with reason for rejection), with the "Rejected Connection" bit
       set to '1', and may close the TCP connection.

       *   The Initiating MPA would receive the MPA Reply Frame with the
           Private Data message and would report this message to the Consumer.  The Consumer can
           then accept
           Consumer, including the MPA/DDP connection, or supplied Private Data.

           If the "rejected Connection" bit is set to a '1', MPA will
           close or reset the TCP connection to abort and exit.

           If the process.

       *   On "Rejected Connection" bit is set to a '0', and on
           determining from the MPA Reply Frame Private Data that the
           Connection is acceptable, the Initiating Consumer would use
           an appropriate API to bind the TCP/MPA connections to a DDP
           endpoint thus enabling MPA/DDP into Full Operation.  MPA/DDP
           would begin sending DDP messages as MPA FPDUs.

7.1.4  Use

7.1.5  "Dual stack" implementations

   MPA/DDP implementations are commonly expected to be implemented as
   part of Private Data

   This section is advisory in nature, in that it suggests a method that a ULP can deal "dual stack" architecture.  One "stack" is the traditional
   TCP stack, usually with pre-DDP connection information exchange.

7.1.4.1  Motivation

   Prior RDMA protocols have been developed that provide Private Data
   via out of band mechanisms.  As a result, many applications now
   expect some form of Private Data to be available for application use
   prior sockets interface API (Application
   Programming Interface).  The second stack is the MPA/DDP "stack" with
   its own API, and potentially separate code or hardware to setting up deal with
   the DDP/RDMA connection.  Following MPA/DDP data.  Of course, implementations may vary, so the
   following comments are some
   examples of the an advisory nature only.

   The use of Private Data.

   An RDMA Endpoint (referred to as a Queue Pair, or QP, in InfiniBand
   and the [VERBS]) must be associated with a Protection Domain.  No
   receive operations may be posted to the endpoint before it two "stacks" offers advantages:

        TCP connection setup is
   associated usually done with a Protection Domain.  Indeed under both the
   InfiniBand TCP stack.  This
        allows use of the usual naming and proposed RDMA/DDP verbs [VERBS] an endpoint/QP is
   created within a Protection Domain.

   There addressing mechanisms.  It
        also means that any mechanisms used to "harden" the connection
        setup against security threats are some also used when starting
        MPA/DDP.

        Some applications where may have been originally designed for TCP, but
        are "enhanced" to utilize MPA/DDP after a negotiation reveals
        the choice of Protection Domain is
   dependent upon capability to do so.  The negotiation process takes place in
        TCP's streaming mode, using the identity usual TCP APIs.

        Some new applications, designed for RDMA or DDP, still need to
        exchange some data prior to starting MPA/DDP.  This exchange can
        be of the remote ULP client.  For example,
   if arbitrary length or complexity, but often consists of only
        a user session requires multiple connections, it is highly
   desirable for all small amount of those connections to use Private Data, perhaps only a single Protection
   Domain.  Note: use of Protection Domains is further discussed in
   [RDMASEC].

   InfiniBand, the DAT APIs [DAT-API] and message.
        Using the [IT-API] all provide TCP streaming mode for this exchange allows this to be
        done using well understood methods.

   The main disadvantage of using two stacks is the conversion of an
   active side ULP TCP connection between them.  This process must be done with
   care to provide Private Data prevent loss of data.

   To avoid some of the problems when requesting using a
   connection.  This "dual stack" architecture
   the following additional restrictions may be required by the
   implementation:

   1.  Enabling the DDP/MPA stack SHOULD be done only when no incoming
       stream data is passed to expected.  This is typically managed by the ULP to allow it to determine
   whether to accept
       protocol.  When following the connection, and if so with which endpoint (and
   implicitly which Protection Domain).

   The Private Data can also be used to ensure that both ends of the
   connection have configured their RDMA endpoints compatibly on such
   matters as recommended startup sequence, the RDMA Read capacity (see [RDMAP]).  Further ULP-
   specific uses are also presumed, such as establishing
       Responder side enters DDP/MPA mode, sends the identity of last streaming mode
       data, and then waits for the client.

   Private Data MPA Request Frame.  No additional
       streaming mode data is also allowed for when accepting expected.  The Initiator side ULP receives
       the connection, to
   allow completion of any negotiation on RDMA resources last streaming mode data, and for other
   ULP reasons.

   There are several potential ways to exchange this Private Data.  For
   example, then enters DDP/MPA mode.
       Again, no additional streaming mode data is expected.

   2.  The DDP/MPA MAY provide the InfiniBand specification includes a connection
   management protocol that allows ability to send a small amount "last streaming
       message" as part of Private Data to be
   exchanged using datagrams before actually starting the RDMA
   connection. its Responder DDP/MPA enable function.  This draft
       allows for small amounts of Private Data the DDP/MPA stack to be exchanged
   as part of more easily manage the MPA startup sequence.  The actual Private Data fields
   are carried in conversion to
       DDP/MPA mode (and avoid problems with a very fast return of the
       MPA Request Frame, and Frame from the MPA Reply Frame.

   If larger amounts Initiator side).

   Note: Regardless of Private Data or more negotiation the "stack" architecture used, TCP's rules MUST
       be followed.  For example, if network data is necessary, lost, re-segmented
       or re-ordered, TCP streaming mode messages may be exchanged prior to enabling MPA.

7.1.4.2  Example Immediate Startup using Private Data

          Initiator                                 Responder

   +---------------------------+
   |TCP SYN sent               |           +--------------------------+
   +---------------------------+ --------> |TCP gets SYN packet;      |
   +---------------------------+           |  Sends SYN-Ack           |
   |TCP gets SYN-Ack           | <-------- +--------------------------+
   |  Sends Ack                |
   +---------------------------+ --------> +--------------------------+
   +---------------------------+           |Consumer enables MPA      |
   |Consumer enables MUST recover appropriately even when this
       occurs while switching stacks.

7.2 Normal Connection Teardown

   Each half connection of MPA       |           |Responder Mode, waits for |
   |Initiator mode with        |           |  <MPA Request frame>     |
   |Private Data; terminates when DDP closes the
   corresponding TCP half connection.

   A mechanism SHOULD be provided by MPA sends    |           +--------------------------+
   |  <MPA Request Frame>;     |
   |MPA waits to DDP for incoming     |           +--------------------------+
   |  <MPA Reply Frame         | - - - - > |MPA receives              |
   +---------------------------+           |  <MPA Request Frame>     |
                                           |Consumer examines Private |
                                           |Data, provides MPA with   |
                                           |return Private Data,      |
                                           |binds DDP to MPA, and     |
                                           |enables be made aware
   that a graceful close of the LLP connection has been received by the
   LLP (e.g. FIN is received).

8  Error Semantics

   The following errors MUST be detected by MPA and the codes SHOULD be
   provided to send an    |
                                           |  <MPA Reply Frame>.      |
                                           |DDP/MPA enables FPDU      |
   +---------------------------+           |decoding, but DDP or other Consumer:

    Code Error

    1    TCP connection closed, terminated or lost.  This includes lost
         by timeout, too many retries, RST received or FIN received.

    2    Received MPA CRC does not    |
   |MPA receives match the calculated value for the           | < - - - - |send any FPDUs.           |
   |  <MPA Reply Frame>        |           +--------------------------+
   |Consumer examines Private  |
   |Data, binds DDP to MPA,    |
   |and enables DDP/MPA to     |
   |begin Full Operation.      |
   |MPA sends first FPDU (as   |           +--------------------------+
   |DDP ULPDUs become          | ========> |MPA Receives first
         FPDU.  |
   |available).                |           |MPA sends first FPDU (as  |
   +---------------------------+           |DDP ULPDUs become         |
                                   <====== |available.                |
                                           +--------------------------+
             Figure 11: Example Immediate Startup negotiation

   Note:

    3    In the exact order of when MPA is started in event that the TCP connection
       sequence CRC is implementation dependent; valid, received MPA Marker (if
         enabled) and ULPDU Length fields do not agree on the above diagram shows one
       possible sequence.  Also, start of
         a FPDU.  If the Initiator "Ack" to FPDU start determined from previous ULPDU
         Length fields does not match with the Responder's
       "SYN-Ack" MPA Marker position, MPA
         SHOULD deliver an error to DDP.  It may not be combined into the same TCP possible to
         make this check as a segment containing arrives, but the MPA Request Frame (as is allowed by TCP RFCs).

   The example immediate startup check SHOULD be
         made when a gap creating an out of order sequence is described below:

   *   The passive side (Responding Consumer) would listen on the TCP
       destination port, to indicate its readiness to accept a
       connection.

       *   The active side (Initiating Consumer) would request a
           connection from a TCP endpoint (that expected to upgrade to
           MPA/DDP/RDMA closed
         and expected the Private Data) to any time a destination
           address and port.

       *   The Initiating Consumer would initiate Marker points to an already identified FPDU.
         It is OPTIONAL for a TCP connection receiver to
           the destination port.  Acceptance/rejection of the connection
           would proceed as per normal TCP connection establishment.

   *   The passive side (Responding Consumer) would receive the TCP
       connection request as usual allowing normal TCP gatekeepers, such
       as INETD and TCPserver, to exercise their normal
       safeguard/logging functions.  On acceptance of the TCP
       connection, the Responding Consumer would enable MPA check each Marker, if
         multiple Markers are present in an FPDU, or if the
       Responder mode and wait for the initial MPA startup message.

       *   The Initiating Consumer would enable MPA startup segment is
         received in the
           Initiator mode to send an initial order.

    4    Invalid MPA Request Frame with its
           included Private Data message to send.  The Initiating or MPA
           (and Consumer) would also wait for Response Frame received.  In
         this case, the MPA TCP connection to MUST be
           accepted, and any returned Private Data.

   *   The Responding MPA would receive the initial MPA Request Frame
       with the Private Data message immediately closed.  DDP
         and would pass the Private Data
       through other ULPs should treat this similar to code 1, above.

   When conditions 2 or 3 above are detected, an optimized MPA/TCP
   implementation MAY choose to silently drop the Consumer.  The Consumer can then accept TCP segment rather
   than reporting the
       MPA/DDP connection, close error to DDP.  In this case, the sending TCP connection, or reject will
   retry the MPA
       connection with a return message.

   *   To accept segment, usually correcting the connection request, error, unless the Responding Consumer would
       use an appropriate API to bind problem
   was at the TCP/MPA connections to a DDP
       endpoint, thus enabling MPA/DDP into Full Operation. source.  In that case, the
       process source will usually exceed the
   number of going to Full Operation, MPA sends retries and terminate the connection.

   Once MPA Reply Frame
       which includes the Consumer supplied Private Data containing delivers an error of any
       appropriate Consumer response.  MPA/DDP waits for the first
       incoming FPDU before sending type, it MUST NOT pass or deliver
   any FPDUs.

   *   If the initial TCP data was not a properly formatted MPA Request
       Frame, additional FPDUs on that half connection.

   For Error codes 2 and 3, MPA will MUST NOT close or reset the TCP connection immediately.

   *   To reject
   following a reported error.  Closing the MPA connection request, is the Responding Consumer
       would send an
   responsibility of DDP's ULP.

        Note that since MPA Reply Frame with will not Deliver any ULP supplied Private Data
       (with reason for rejection), with FPDUs on a half
        connection following an error detected on the "Rejected Connection" bit
       set receive side of
        that connection, DDP's ULP is expected to '1', and may close tear down the TCP
        connection.

       *   The Initiating MPA would receive the MPA Reply Frame with  This may not occur until after one or more last
        messages are transmitted on the
           Private Data message and would report this opposite half connection.  This
        allows a diagnostic error message to be sent.

9  Security Considerations

   This section discusses the
           Consumer, including the supplied Private Data.

           If security considerations for MPA.

9.1 Protocol-specific Security Considerations

   The vulnerabilities of MPA to third-party attacks are no greater than
   any other protocol running over TCP.  A third party, by sending
   packets into the "rejected Connection" bit is set network that are delivered to an MPA receiver, could
   launch a '1', variety of attacks that take advantage of how MPA will
           close the TCP connection and exit.

           If the "Rejected Connection" bit is set to operates.
   For example, a '0', and on
           determining from the MPA Reply Frame Private Data third party could send random packets that the
           Connection is acceptable, the Initiating Consumer would use are valid
   for TCP, but contain no FPDU headers.  An MPA receiver reports an appropriate API to bind the TCP/MPA connections
   error to a DDP
           endpoint thus enabling MPA/DDP into Full Operation.  MPA/DDP
           would begin sending DDP messages as MPA FPDUs.

7.1.5  "Dual stack" implementations

   MPA/DDP implementations are commonly expected to when any packet arrives that cannot be implemented validated as
   part of a "dual stack" architecture.  One "stack" is the traditional
   TCP stack, usually with a sockets interface API (Application
   Programming Interface).  The second stack is the MPA/DDP "stack" with
   its own API, and potentially separate code or hardware to deal with
   the MPA/DDP data.  Of course, implementations may vary, so the
   following comments are of an advisory nature only.

   The use of the two "stacks" offers advantages:

        TCP connection setup is usually done with the TCP stack.  This
        allows use of the usual naming and addressing mechanisms.  It
   FPDU when properly located on an FPDU boundary.  A third party could
   also means send packets that any mechanisms used to "harden" the connection
        setup against security threats are also used when starting
        MPA/DDP.

        Some applications may have been originally designed valid for TCP, MPA, and DDP, but
        are "enhanced" to utilize MPA/DDP after a negotiation reveals
        the capability to do so.  The negotiation process takes place not
   target valid buffers.  These types of attacks ultimately result in
        TCP's streaming mode, using the usual TCP APIs.

        Some new applications, designed for RDMA or DDP, still need to
        exchange some data prior to starting MPA/DDP.  This exchange can
        be of arbitrary length or complexity, but often consists of only
        a small amount
   loss of Private Data, perhaps only connection and thus become a single message.
        Using the TCP streaming mode for this exchange allows this to be
        done using well understood methods.

   The main disadvantage of using two stacks is the conversion type of an
   active TCP connection between them.  This process must DOS (Denial Of Service)
   attack.  Communication security mechanisms such as IPsec [RFC2401]
   may be done with
   care used to prevent loss of data.

   To avoid some such attacks.

   Independent of the problems when using how MPA operates, a "dual stack" architecture third party could use ICMP
   messages to reduce the following additional restrictions may path MTU to such a small size that performance
   would likewise be required by the
   implementation:

   1.  Enabling the DDP/MPA stack SHOULD severely impacted.  Range checking on path MTU
   sizes in ICMP packets may be done only when no incoming
       stream used to prevent such attacks.

   [RDMAP] and [DDP] are used to control, read and write data is expected.  This is typically managed by the ULP
       protocol.  When following the recommended startup sequence, the
       Responder side enters DDP/MPA mode, sends buffers
   over IP networks.  Therefore, the last streaming mode
       data, control and then waits for the MPA Request Frame.  No additional
       streaming mode data is expected.  The Initiator side ULP receives packets of
   these protocols are vulnerable to the last streaming mode data, spoofing, tampering and then enters DDP/MPA mode.
       Again, no additional streaming mode data
   information disclosure attacks listed below.  In addition, Connection
   to/from an unauthorized or unauthenticated endpoint is expected.

   2.  The DDP/MPA MAY provide the ability to send a "last streaming
       message" as part of its Responder DDP/MPA enable function.  This
       allows potential
   problem with most applications using RDMA, DDP, and MPA.

9.1.1  Spoofing

   Spoofing attacks can be launched by the DDP/MPA stack Remote Peer, or by a network
   based attacker.  A network based spoofing attack applies to more easily manage all
   Remote Peers.  Because the conversion to
       DDP/MPA mode (and avoid problems with MPA Stream requires a very fast return TCP Stream in the
   ESTABLISHED state, certain types of traditional forms of wire attacks
   do not apply -- an end-to-end handshake must have occurred to
   establish the MPA Request Frame from Stream.  So, the Initiator side).

   Note: Regardless only form of the "stack" architecture used, TCP's rules MUST
       be followed.  For example, if network data spoofing that applies
   is lost, re-segmented
       or re-ordered, TCP MUST recover appropriately even one when a remote node can both send and receive packets.  Yet
   even with this
       occurs while switching stacks.

7.2  Normal Connection Teardown

   Each half connection of MPA terminates when DDP closes limitation the
   corresponding TCP half connection.

   A mechanism SHOULD be provided by MPA to DDP for DDP Stream is still exposed to be made aware
   that a graceful close of the LLP connection has been received by the
   LLP (e.g. FIN is received).

8  Error Semantics

   The
   following errors MUST be detected by MPA and the codes SHOULD be
   provided to DDP or other Consumer:

    Code Error

    1    TCP connection closed, terminated or lost.  This includes lost
         by timeout, too many retries, RST received or FIN received.

    2    Received MPA CRC does not match the calculated value for the
         FPDU.

    3    In spoofing attacks.

9.1.1.1  Impersonation

   A network based attacker can impersonate a legal MPA/DDP/RDMAP peer
   (by spoofing a legal IP address), and establish an MPA/DDP/RDMAP
   Stream with the event that victim.  End to end authentication (i.e. IPsec or ULP
   authentication) provides protection against this attack.

9.1.1.2  Stream Hijacking

   Stream hijacking happens when a network based attacker follows the CRC is valid, received MPA Marker (if
         enabled)
   Stream establishment phase, and ULPDU Length fields do not agree on waits until the start of authentication phase
   (if such a FPDU.  If phase exists) is completed successfully.  He can then
   spoof the FPDU start determined IP address and re-direct the Stream from previous ULPDU
         Length fields does not match with the MPA Marker position, MPA
         SHOULD deliver an error to DDP.  It may not be possible victim to
         make this check as a segment arrives, but the check SHOULD be
         made when a gap creating its
   own machine.  For example, an out attacker can wait until an iSCSI
   authentication is completed successfully, and hijack the iSCSI
   Stream.

   The best protection against this form of order sequence attack is closed end-to-end
   integrity protection and any time a Marker points authentication, such as IPsec to an already identified FPDU.
         It prevent
   spoofing.  Another option is OPTIONAL for a receiver to check each Marker, if
         multiple Markers are present in an FPDU, or if the segment provide physical security.
   Discussion of physical security is
         received in order.

    4    Invalid MPA Request Frame or MPA Response Frame received.  In out of scope for this case, document.

9.1.1.3  Man in the TCP connection MUST be immediately closed.  DDP
         and other ULPs should treat this similar Middle Attack

   If a network based attacker has the ability to code 1, above.

   When conditions 2 delete, inject replay,
   or 3 above are detected, an optimized MPA/TCP
   implementation MAY choose to silently drop the modify packets which will still be accepted by MPA (e.g., TCP segment rather
   than reporting
   sequence number is correct, FPDU is valid etc.) then the error Stream can
   be exposed to DDP.  In this case, the sending TCP will
   retry the segment, usually correcting the error, unless the problem
   was at the source.  In that case, a man in the source will usually exceed middle attack.  The attacker could
   potentially use the
   number services of retries [DDP] and terminate [RDMAP] to read the connection.

   Once MPA delivers an error
   contents of any type, it MUST NOT pass or deliver
   any additional FPDUs on that half connection.

   For Error codes 2 and 3, MPA MUST NOT close the TCP connection
   following a reported error.  Closing the connection is associated data buffer, modify the
   responsibility contents of DDP's ULP.

        Note that since MPA will not Deliver any FPDUs on a half
        connection following an error detected on the receive side
   associated data buffer, or to disable further access to the buffer.
   The only countermeasure for this form of
        that connection, DDP's ULP attack is expected to tear down either secure
   the
        connection.  This may not occur until after one MPA/DDP/RDMAP Stream (i.e. integrity protect) or more last
        messages are transmitted on the opposite half connection.  This
        allows a diagnostic error message attempt to be sent.

9  Security Considerations

   This section discusses the
   provide physical security considerations for MPA.

9.1  Protocol-specific Security Considerations to prevent man-in-the-middle type attacks.

   The vulnerabilities best protection against this form of MPA attack is end-to-end
   integrity protection and authentication, such as IPsec, to third-party attacks are no greater than
   any other protocol running over TCP.  A third party, by sending
   packets into the network that prevent
   spoofing or tampering.  If Stream or session level authentication and
   integrity protection are delivered to an MPA receiver, could
   launch a variety of attacks that take advantage of how MPA operates.
   For example, not used, then a third party could send random packets that are valid
   for TCP, but contain no FPDU headers.  An MPA receiver reports an
   error man-in-the-middle attack
   can occur, enabling spoofing and tampering.

   Another approach is to DDP when any packet arrives that cannot be validated restrict access to only the local subnet/link,
   and provide some mechanism to limit access, such as physical security
   or 802.1.x.  This model is an
   FPDU when properly located on an FPDU boundary.  A third party could
   also send packets that are valid for TCP, MPA, extremely limited deployment scenario,
   and DDP, but do will not
   target valid buffers.  These types of attacks ultimately result in
   loss of connection be further examined here.

9.1.2  Eavesdropping

   Generally speaking, Stream confidentiality protects against
   eavesdropping.  Stream and/or session authentication and thus become integrity
   protection is a type counter measurement against various spoofing and
   tampering attacks.  The effectiveness of DOS (Denial Of Service)
   attack.  Communication security mechanisms such as IPsec [RFC2401]
   may be used to prevent such attacks.

   Independent of how MPA operates, authentication and integrity
   against a third party could use ICMP
   messages to reduce specific attack, depend on whether the path MTU authentication is
   machine level authentication (as the one provided by IPsec), or ULP
   authentication.

9.2 Introduction to such a small size that performance
   would likewise be severely impacted.  Range checking on path MTU
   sizes in ICMP packets may Security Options

   The following security services can be used to prevent such attacks.

   [RDMAP] and [DDP] are used applied to control, read and write an MPA/DDP/RDMAP
   Stream:

   1.  Session confidentiality - protects against eavesdropping.

   2.  Per-packet data buffers
   over IP networks.  Therefore, source authentication - protects against the control
   following spoofing attacks: network based impersonation, Stream
   hijacking, and man in the data packets middle.

   3.  Per-packet integrity - protects against tampering done by
   network based modification of FPDUs (indirectly affecting buffer
   content through DDP services).

   4.  Packet sequencing - protects against replay attacks, which is
   a special case of
   these protocols are vulnerable to the spoofing, above tampering and
   information disclosure attacks listed below.  In addition, Connection
   to/from attack.

   If an unauthorized MPA/DDP/RDMAP Stream may be subject to impersonation attacks,
   or unauthenticated endpoint Stream hijacking attacks, it is a potential
   problem with most applications using RDMA, DDP, and MPA.

9.1.1  Spoofing

   Spoofing attacks can recommended that the Stream be launched by
   authenticated, integrity protected, and protected from replay
   attacks; it may use confidentiality protection to protect from
   eavesdropping (in case the Remote Peer, or by MPA/DDP/RDMAP Stream traverses a network
   based attacker.  A network based spoofing attack applies to all
   Remote Peers.  Because public
   network).

   IPsec is capable of providing the MPA Stream requires a above security services for IP and
   TCP Stream in traffic.

   ULP protocols may be able to provide part of the
   ESTABLISHED state, certain types above security
   services.  See [NFSv4CHANNEL] for additional information on a
   promising approach called "channel binding".  From [NFSv4CHANNEL]:

        "The concept of traditional forms channel bindings allows applications to prove
        that the end-points of wire attacks
   do not apply -- an end-to-end handshake must have occurred two secure channels at different network
        layers are the same by binding authentication at one channel to
   establish
        the MPA Stream.  So, session protection at the only form other channel.  The use of spoofing that applies
   is one when a remote node channel
        bindings allows applications to delegate session protection to
        lower layers, which may significantly improve performance for
        some applications."

9.3 Using IPsec With MPA

   IPsec can both send and receive packets.  Yet
   even with this limitation be used to protect against the Stream packet injection attacks
   outlined above.  Because IPsec is still exposed designed to the
   following spoofing attacks.

9.1.1.1  Impersonation

   A network based attacker can impersonate a legal MPA/DDP/RDMAP peer
   (by spoofing a legal secure individual IP address),
   packets, MPA can run above IPsec without change.  IPsec packets are
   processed (e.g., integrity checked and decrypted) in the order they
   are received, and establish an MPA/DDP/RDMAP
   Stream with MPA receiver will process the victim.  End to end authentication (i.e. decrypted FPDUs
   contained in these packets in the same manner as FPDUs contained in
   unsecured IP packets.

   MPA Implementations MUST implement IPsec or ULP
   authentication) provides protection against this attack.

9.1.1.2  Stream Hijacking

   Stream hijacking happens when a network based attacker follows the
   Stream establishment phase, and waits until the authentication phase
   (if such a phase exists) is completed successfully.  He can then
   spoof the IP address and re-direct the Stream from the victim to its
   own machine.  For example, an attacker can wait until an iSCSI
   authentication is completed successfully, and hijack the iSCSI
   Stream. as described in Section 9.4
   below.  The best protection against this form use of attack is end-to-end
   integrity protection and authentication, such as IPsec to prevent
   spoofing.  Another option is up to provide physical security.
   Discussion of physical security is out of scope ULPs and administrators.

9.4 Requirements for this document.

9.1.1.3  Man in IPsec Encapsulation of MPA/DDP

   The IP Storage working group has spent significant time and effort to
   define the Middle Attack

   If normative IPsec requirements for IP Storage [RFC3723].
   Portions of that specification are applicable to a network based attacker has wide variety of
   protocols, including the ability RDDP protocol suite.  In order to delete, inject replay,
   or modify packets which will still be accepted by not
   replicate this effort, an MPA (e.g., on TCP
   sequence number is correct, FPDU is valid etc.) then implementation MUST follow the Stream can
   be exposed to a man
   requirements defined in the middle attack.  The attacker could
   potentially use the services of [DDP] RFC3723 Section 2.3 and [RDMAP] to read the
   contents of the associated data buffer, modify the contents of Section 5, including
   the associated data buffer, or to disable further access to the buffer.
   The only countermeasure normative references for this form of attack is those sections.

   Additionally, since IPsec acceleration hardware may only be able to either secure
   handle a limited number of active IKE Phase 2 SAs, Phase 2 delete
   messages MAY be sent for idle SAs, as a means of keeping the MPA/DDP/RDMAP Stream (i.e. integrity protect) or attempt to
   provide physical security number
   of active Phase 2 SAs to prevent man-in-the-middle type attacks. a minimum.  The best protection against this form receipt of attack is end-to-end
   integrity protection and authentication, such an IKE Phase 2
   delete message MUST NOT be interpreted as IPsec, to prevent
   spoofing or tampering.  If Stream or session level authentication and
   integrity protection are not used, then a man-in-the-middle attack
   can occur, enabling spoofing and tampering.

   Another approach reason for tearing down
   an DDP/RDMA Stream.  Rather, it is preferable to restrict access to only leave the local subnet/link,
   and provide some mechanism to limit access, such as physical security
   or 802.1.x.  This model is an extremely limited deployment scenario,
   and will not be further examined here.

9.1.2  Eavesdropping

   Generally speaking, Stream confidentiality protects against
   eavesdropping. Stream and/or session authentication up,
   and integrity
   protection if additional traffic is a counter measurement against various spoofing sent on it, to bring up another IKE
   Phase 2 SA to protect it.  This avoids the potential for continually
   bringing Streams up and
   tampering attacks. down.

   The effectiveness IPsec requirements for RDDP are based on the version of authentication IPsec
   specified in RFC 2401 [RFC2401] and integrity
   against related RFCs, as profiled by RFC
   3723 [RFC3723], despite the existence of a specific attack, depend on whether newer version of IPsec
   specified in RFC 4301 [RFC4301] and related RFCs.  One of the authentication is
   machine level authentication (as the one provided by IPsec), or ULP
   authentication.

9.2  Introduction to Security Options

   The following security services can be applied to an MPA/DDP/RDMAP
   Stream:

   1.  Session confidentiality - protects against eavesdropping.

   2.  Per-packet data source authentication - protects against
   important early applications of the
   following spoofing attacks: network based impersonation, Stream
   hijacking, and man RDDP protocols is their use with
   iSCSI [iSER]; RDDP's IPsec requirements follow those of IPsec in the middle.

   3.  Per-packet integrity - protects against tampering done
   order to facilitate that usage by
   network based modification of FPDUs (indirectly affecting buffer
   content through DDP services).

   4.  Packet sequencing - protects against replay attacks, which is allowing a special case common profile of the above tampering attack.

   If an MPA/DDP/RDMAP Stream may be subject IPsec
   to impersonation attacks,
   or Stream hijacking attacks, it is recommended that the Stream be
   authenticated, integrity protected, used with iSCSI and protected from replay
   attacks; it may use confidentiality protection to protect from
   eavesdropping (in case the MPA/DDP/RDMAP Stream traverses a public
   network).

   IPsec is capable of providing RDDP protocols.  In the above security services for IP and
   TCP traffic.

   ULP protocols future, RFC
   3723 may be able updated to provide part the newer version of IPsec, the above IPsec security
   services.  See [NFSv4CHANNEL] for additional information on a
   promising approach called "channel binding".  From [NFSv4CHANNEL]:

        "The concept
   requirements of channel bindings allows applications any such update should apply uniformly to prove iSCSI and
   the RDDP protocols.

   Note that there are serious security issues if IPsec is not
   implemented end-to-end.  For example, if IPsec is implemented as a
   tunnel in the end-points middle of two secure channels at different network
        layers are the same by binding authentication at one channel to network, any hosts between the session protection at peer and
   the other channel.  The use of channel
        bindings allows applications to delegate session protection to
        lower layers, which may significantly improve performance for
        some applications."

9.3  Using IPsec With MPA IPsec tunneling device can be used to protect against freely attack the packet injection attacks
   outlined above.  Because IPsec is designed to secure individual IP
   packets, MPA can run above IPsec without change.  IPsec packets unprotected Stream.

10 IANA Considerations

   No IANA actions are
   processed (e.g., integrity checked and decrypted) in required by this document.

   If a well-known port is chosen as the order they
   are received, and an mechanism to identify a DDP on
   MPA receiver will process on TCP, the decrypted FPDUs
   contained in these packets in well-known port must be registered with IANA.
   Because the same manner as FPDUs contained in
   unsecured IP packets.

   MPA Implementations MUST implement IPsec as described in Section 9.4
   below.  The use of IPsec the port is up DDP specific, registration of the port
   with IANA is left to ULPs and administrators.

9.4  Requirements DDP.

A Appendix.
            Optimized MPA-aware TCP implementations

   This appendix is for IPsec Encapsulation of MPA/DDP

   The IP Storage working group has spent significant time information only and effort to
   define is NOT part of the normative IPsec requirements
   standard.

   This appendix covers some Optimized MPA-aware TCP implementation
   guidance to implementers.  It is intended for IP Storage [RFC3723].
   Portions of those implementations
   that specification are applicable to a wide variety of
   protocols, including the RDDP protocol suite.  In order want to not
   replicate this effort, send/receive as much traffic as possible in an MPA on TCP implementation MUST follow the
   requirements defined in RFC3723 Section 2.3 aligned
   and Section 5, including
   the associated normative references for those sections.

   Additionally, since IPsec acceleration hardware may only be able to
   handle a limited number of active IKE Phase 2 SAs, Phase 2 delete
   messages MAY be sent for idle SAs, as zero-copy fashion.

                   +-----------------------------------+
                   | +-----------+ +-----------------+ |
                   | | Optimized | | Other Protocols | |
                   | |  MPA/TCP  | +-----------------+ |
                   | +-----------+        ||           |
                   |         \\     --- socket API --- |
                   |          \\          ||           |
                   |           \\      +-----+         |
                   |            \\     | TCP |         |
                   |             \\    +-----+         |
                   |              \\    //             |
                   |             +-------+             |
                   |             |  IP   |             |
                   |             +-------+             |
                   +-----------------------------------+

                Figure 11 Optimized MPA/TCP implementation

   The diagram above shows a means of keeping the number block diagram of active Phase 2 SAs to a minimum. potential
   implementation.  The receipt of an IKE Phase 2
   delete message MUST NOT be interpreted as a reason for tearing down
   an DDP/RDMA Stream.  Rather, it is preferable to leave network sub-system in the Stream up,
   and if additional traffic is sent on it, to bring up another IKE
   Phase 2 SA to protect it.  This avoids diagram can support
   traditional sockets based connections using the potential for continually
   bringing Streams up and down.

   Note that there are serious security issues if IPsec is not
   implemented end-to-end.  For example, if IPsec is implemented normal API as a
   tunnel in shown
   on the middle right side of the network, any hosts between diagram.  Connections for DDP/MPA/TCP are
   run using the peer and facilities shown on the IPsec tunneling device can freely attack left side of the unprotected Stream.

10 IANA Considerations

   No IANA actions are required by this document.

   If a well-known port is chosen as diagram.

   The DDP/MPA/TCP connections can be started using the mechanism to identify a DDP on
   MPA facilities shown
   on TCP, the well-known port must left side using some suitable API, or they can be registered with IANA.
   Because initiated
   using the use of facilities shown on the port is DDP specific, registration of right side and transitioned to the port
   with IANA is
   left side at the point in the connection setup where MPA goes to DDP.

11 References

11.1 Normative References

   [iSCSI] Satran, J., Internet Small Computer Systems Interface
       (iSCSI), RFC 3720, April 2004.

   [RFC1191] Mogul, J.,
   "full MPA/DDP operation mode" as described in section 7.1.2 on page
   29.

   The optimized MPA/TCP implementations (left side of diagram and Deering, S., "Path MTU Discovery", RFC 1191,
       November 1990.

   [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., Romanow, A., "TCP
       Selective Acknowledgment Options", RFC 2018, October 1996.

   [RFC2119] Bradner, S., "Key words for
   described below) are only applicable to MPA, all other TCP
   applications continue to use the standard TCP stacks and interfaces
   shown in the right side of the diagram.

A.1  Optimized MPA/TCP transmitters

   The various TCP RFCs to Indicate
       Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC3723] Aboba B., et al, "Securing Block Storage Protocols over
       IP", RFC3723, April 2004.

   [RFC793] Postel, J., "Transmission Control Protocol - DARPA Internet
       Program Protocol Specification", RFC 793, September 1981.

   [RDMASEC]  Pinkerton J., Deleganes E., Bitan S., "DDP/RDMAP
       Security", draft-ietf-rddp-security-09.txt (work allow considerable choice in progress),
       MAY 2006.

11.2 Informative References

   [CRCTCP] Stone J., Partridge, C., "When the CRC and segmenting a TCP checksum
       disagree", ACM Sigcomm, Sept. 2000.

   [DAT-API] DAT Collaborative, "kDAPL (Kernel Direct Access Programming
       Library) and uDAPL (User Direct Access Programming Library)",
       http://www.datcollaborative.org.

   [DDP] H. Shah et al., "Direct Data Placement over Reliable
       Transports", draft-ietf-rddp-ddp-06.txt (Work in progress), May
       2006.

   [IT-API] The Open Group, "Interconnect Transport API (IT-API)"
       Version 2.1, http://www.opengroup.org.

   [RFC2401]  Atkinson, R., Kent, S., "Security Architecture for the
       Internet Protocol", RFC 2401, November 1998.

   [RFC0896] J. Nagle, "Congestion Control in IP/TCP Internetworks", RFC
       896, January 1984.

   [NFSv4CHANNEL] Williams, N., "On the Use of Channel Bindings
   stream.  In order to
       Secure Channels", Internet-Draft draft-ietf-nfsv4-channel-
       bindings-02.txt, July 2004.

   [RDMAP] R. Recio et al., "RDMA Protocol Specification",
       draft-ietf-rddp-rdmap-06.txt, May 2006.

   [RFC2960] R. Stewart et al., "Stream Control Transmission Protocol",
       RFC 2960, October 2000.

   [RFC792] Postel, J., "Internet Control Message Protocol", September
       1981

   [RFC1122] Braden, R.T., "Requirements for Internet hosts -
       communication layers", October 1989.

   [VERBS] J. Hilland et al., "RDMA Protocol Verbs Specification",
       draft-hilland-iwarp-verbs-v1.0-RDMAC.pdf April 2003,
       http://www.rdmaconsortium.org.

12 Appendix

   This appendix is for information only and is NOT part of optimize FPDU recovery at the
   standard.

   The appendix covers three topics;

   Section 12.1 is MPA receiver, an analysis
   optimized MPA/TCP implementation uses additional segmentation rules.

   To provide optimum performance, an optimized MPA/TCP transmit side
   implementation should be enabled to:

   *   With an EMSS large enough to contain the FPDU(s), segment the
       outgoing TCP stream such that the first octet of MPA on every TCP and why it is useful to
   integrate MPA
       Segment begins with an FPDU.  Multiple FPDUs may be packed into a
       single TCP (with modifications to typical TCP
   implementations) to reduce overall system buffering and overhead.

   Section 12.2 covers some MPA receiver implementation notes.

   Section 12.3 covers methods of making MPA implementations
   interoperate with both IETF and RDMA Consortium versions of segment as long as they are entirely contained in the
   protocols.

12.1 Analysis of MPA over
       TCP Operations

   This appendix analyzes segment.

   *   Report the impact of MPA on current EMSS from the TCP sender, receiver,
   and wire protocol.

   One of MPA's high level goals is to provide enough information, when
   combined with the Direct Data Placement Protocol [DDP], MPA transmit layer.

   There are exceptions to enable
   out-of-order placement of DDP payload into the final Upper Layer
   Protocol (ULP) buffer.  Note that DDP separates the act of placing
   data into a ULP buffer from that of notifying the ULP that the ULP
   buffer above rule.  Once an ULPDU is available for use.  In DDP terminology, provided to
   MPA, the former is
   defined as "Placement", MPA/TCP sender transmits it or fails the connection; it
   cannot be repudiated.  As a result, during changes in MTU and EMSS,
   or when TCP's Receive Window size (RWIN) becomes too small, it may be
   necessary to send FPDUs that do not conform to the later segmentation rule
   above.

   A possible, but less desirable, alternative is defined as "Delivery".  MPA
   supports in-order Delivery of the data to the ULP, including support
   for Direct Data Placement in the final ULP buffer location use IP
   fragmentation on accepted FPDUs to deal with MTU reductions or
   extremely small EMSS.

   Even when alignment with TCP segments arrive out-of-order.  Effectively, the goal is to use lost, the
   pre-posted ULP buffers as sender still
   formats the FPDU according to FPDU format as shown in Figure 2.

   On a retransmission, TCP receive buffer, where the
   reassembly of the ULP Protocol Data Unit (PDU) by does not necessarily preserve original TCP (with MPA and
   DDP) is done in place, in the ULP buffer, with no data copies.
   segmentation boundaries.  This Appendix walks through can lead to the advantages and disadvantages loss of the FPDU Alignment
   and containment within a TCP sender modifications proposed by MPA:

   1) that MPA prefers that the segment during TCP retransmissions.  An
   optimized MPA/TCP sender should try to do Header Alignment, where preserve original TCP
   segmentation boundaries on a retransmission.

A.2  Effects of Optimized MPA/TCP Segmentation

   Optimized MPA/TCP senders will fill TCP segment should begin segments to the EMSS with an MPA Framing Protocol Data Unit
   (FPDU) (if there a
   single FPDU when a DDP message is payload present).

   2) large enough.  Since the DDP
   message may not exactly fit into TCP segments, a "message tail" often
   occurs that there be results in an integral number of FPDUs FPDU that is smaller than a single TCP
   segment.  Additionally some DDP messages may be considerably shorter
   than the EMSS.  If a small FPDU is sent in a single TCP segment (under
   conditions where the Path MTU
   result is not changing).

   This Appendix concludes that the scaling a "short" TCP segment.

   Applications expected to see strong advantages of from Direct Data
   Placement include transaction-based applications and throughput
   applications.  Request/response protocols typically send one FPDU Alignment
   are strong, based primarily on fairly drastic per
   TCP receive buffer
   reduction requirements segment and simplified receive handling.  The analysis
   also shows that there is little effect to then wait for a response.  Under these conditions,
   these "short" TCP wire behavior.

12.1.1 Assumptions

12.1.1.1 MPA is layered beneath DDP [DDP]

   MPA is an adaptation layer between DDP segments are an appropriate and TCP.  DDP requires
   preservation expected effect of DDP segment boundaries and a CRC32C digest covering
   the DDP header and data.   MPA adds these features to
   the TCP stream
   so segmentation.

   Another possibility is that DDP over TCP has the application might be sending multiple
   messages (FPDUs) to the same basic properties as DDP over SCTP.

12.1.1.2 MPA preserves DDP message framing

   MPA was designed as a framing layer specifically endpoint before waiting for DDP and was not
   intended as a general-purpose framing layer for any other ULP using
   TCP.

   A framing layer allows ULPs using it response.

   In this case, the segmentation policy would tend to receive indications from reduce the
   transport layer only when complete ULPDUs are present.  As a framing
   layer, MPA is not aware of
   available connection bandwidth by under-filling the content of TCP segments.

   Standard TCP implementations often utilize the DDP PDU, only Nagle [RFC0896]
   algorithm to ensure that it
   has received and, if necessary, reassembled a complete PDU for
   Delivery segments are filled to the DDP.

12.1.1.3 The size of EMSS whenever the ULPDU passed to MPA
   round trip latency is less than EMSS under
      normal conditions

   To make reception of a complete DDP PDU on every received segment
   possible, DDP passes to MPA a PDU large enough that is no larger than the EMSS source stream can fully
   fill segments before Acks arrive.  The algorithm does this by
   delaying the transmission of TCP segments until a ULP can fill a
   segment, or until an ACK arrives from the underlying fabric.  Each FPDU that MPA creates contains
   sufficient information far side.  The algorithm
   thus allows for the receiver smaller segments when latencies are shorter to directly place keep
   the ULP
   payload ULP's end to end latency to reasonable levels.

   The Nagle algorithm is not mandatory to use [RFC1122].

   When used with optimized MPA/TCP stacks, Nagle and similar algorithms
   can result in the correct location in "packing" of multiple FPDUs into TCP segments.

   If a "message tail", small DDP messages, or the correct receive buffer.

   Edge cases when this condition does not occur start of a larger DDP
   message are dealt with, but do
   not need available, MPA may pack multiple FPDUs into TCP segments.
   When this is done, the TCP segments can be more fully utilized, but,
   due to the size constraints of FPDUs, segments may not be on filled to
   the fast path

12.1.1.4 Out-of-order placement but NO out-of-order Delivery

   DDP receives complete DDP PDUs from MPA.  Each EMSS.  A dynamic MULPDU that informs DDP PDU contains of the
   information necessary to place its ULP payload directly in size of the
   correct location in host memory.

   Because each DDP
   remaining TCP segment is self-describing, it is possible for DDP
   segments received out space makes filling the TCP segment more
   effective.

        Note that MPA receivers do more processing of order to have their ULP payload placed
   immediately in a TCP segment that
        contains multiple FPDUs, this may affect the ULP receive buffer.

   Data delivery performance of some
        receiver implementations.

   It is up to the ULP is guaranteed to be in the order decide if Nagle is useful with DDP/MPA.  Note
   that many of the data
   was sent.  DDP only indicates data delivery applications expected to take advantage of MPA/DDP
   prefer to avoid the ULP after TCP has
   acknowledged the complete byte stream.

12.1.2 The Value of FPDU Alignment

   Significant receiver optimizations can extra delays caused by Nagle.  In such scenarios
   it is anticipated there will be achieved when Header
   Alignment and complete FPDUs are the common case.  The optimizations
   allow utilizing significantly fewer buffers on minimal opportunity for packing at
   the receiver transmitter and less
   computation per FPDU.  The net effect is receivers may choose to optimize their
   performance for this anticipated behavior.

   Therefore, the ability application is expected to build a
   "flow-through" receiver set TCP parameters such
   that enables TCP-based solutions to scale to
   10G it can trade off latency and beyond in an economical way.  The optimizations are
   especially relevant wire efficiency.  Implementations
   should provide a connection option which disables Nagle for MPA/TCP
   similar to hardware implementations of receivers that
   process multiple protocol layers - Data Link Layer (e.g., Ethernet),
   Network and Transport Layer (e.g., TCP/IP), and even some ULP on top
   of TCP (e.g., MPA/DDP).  As network speed increases, there the way the TCP_NODELAY socket option is an
   increasing desire to use provided for a hardware based receiver in order
   traditional sockets interface.

   When latency is not critical, application is expected to
   achieve an efficient high performance solution.

   A TCP receiver, under worst leave Nagle
   enabled.  In this case conditions, has to allocate buffers
   (BufferSizeTCP) whose capacities are a function of the bandwidth-
   delay product.  Thus:

       BufferSizeTCP = K * bandwidth [octets/Second] * Delay [Seconds].

   Where bandwidth is TCP implementation may pack any available
   FPDUs into TCP segments so that the end-to-end bandwidth of segments are filled to the connection, delay
   is EMSS.
   If the round trip delay amount of the connection, and K data available is an implementation
   dependent constant.

   Thus BufferSizeTCP scales with not enough to fill the end-to-end bandwidth (10x more
   buffers for a 10x increase in end-to-end bandwidth).  As this
   buffering approach may scale poorly TCP segment
   when it is prepared for hardware transmission, TCP can send the segment partly
   filled, or software
   implementations alike, several approaches allow reduction in use the
   amount of buffering required for high-speed TCP communication.

   The MPA/DDP approach is Nagle algorithm to enable wait for the ULP's buffer ULP to be used as post more
   data.

A.3  Optimized MPA/TCP receivers

   When an MPA receive implementation and the MPA-aware receive side TCP
   implementation support handling out of order ULPDUs, the TCP receive buffer.  If
   implementation performs the application pre-posts a sufficient amount
   of buffering, following functions:

   1)  The implementation passes incoming TCP segments to MPA as soon as
       they have been received and each validated, even if not received in
       order.  The TCP layer commits to keeping each segment has sufficient information before it
       can be passed to
   place the payload into MPA.  This means that the right application buffer, when an out-of-
   order TCP segment arrives it could potentially must have
       passed the TCP, IP, and lower layer data integrity validation
       (i.e., checksum), must be placed directly in the ULP buffer.  However, placement can only receive window, must be done when a complete
   FPDU with part of
       the placement information is available same epoch (if timestamps are used to the receiver, verify this) and
   the FPDU contents contain enough information any
       other checks required by TCP RFCs.

       This is not to place imply that the data into
   the correct ULP buffer (e.g., there is a DDP header available).

   For the case when must be completely ordered
       before use.  An implementation can accept out of order segments,
       SACK them [RFC2018], and pass them to MPA immediately, before the FPDU is not aligned with
       reception of the TCP segment, it
   may take, on average, 2 TCP segments needed to assemble one FPDU.
   Therefore, fill in the receiver has gaps arrive.
       MPA expects to allocate BufferSizeNAF (Buffer Size,
   Non-Aligned FPDU) octets:

       BufferSizeNAF = K1* EMSS * number_of_connections + K2 * EMSS

   Where K1 and K2 utilize these segments when they are implementation dependent constants and EMSS is complete
       FPDUs or can be combined into complete FPDUs to allow the effective maximum segment size.

   For example, a 1 Gbps link with 10,000 connections and an EMSS of
   1500B would require 15 MB passing
       of memory.  Often the number ULPDUs to DDP when they arrive, independent of connections
   used scales with the network speed, aggravating the situation for
   higher speeds.

   FPDU Alignment would allow ordering.  DDP
       uses the receiver passed ULPDU to allocate BufferSizeAF
   (Buffer Size, Aligned FPDU) octets:

       BufferSizeAF = K2 * EMSS

   for the same conditions.  A FPDU Aligned receiver may require memory
   in "place" the range of ~100s of KB - which is feasible DDP segments (see [DDP] for an on-chip memory
   and enables
       more details).

       Since MPA performs a "flow-through" design, in which CRC calculation and other checks on received
       FPDUs, the MPA/TCP implementation ensures that any TCP segments
       that duplicate data flows through
   the NIC already received and is placed directly in processed (as can happen
       during TCP retries) do not overwrite already received and
       processed FPDUs.  This avoids the destination buffer.  Assuming
   most possibility that duplicate data
       may corrupt already validated FPDUs.

   2)  The implementation provides a mechanism to indicate the ordering
       of TCP segments as the connections support FPDU Alignment, sender transmitted them.  One possible
       mechanism might be attaching the receiver buffers
   no longer scale with TCP sequence number of connections.

   Additional optimizations can be achieved in to each
       segment.

   3)  The implementation also provides a balanced I/O sub-system
   -- where mechanism to indicate when a
       given TCP segment (and the system interface of prior TCP stream) is complete.  One
       possible mechanism might be to utilize the network controller provides
   ample bandwidth as compared with leading (left) edge of
       the network bandwidth.  For almost
   twenty years this has been TCP Receive Window.

       MPA uses the case ordering and the trend is expected completion indications to
   continue - while Ethernet speeds have scaled by 1000 (from 10
   megabit/sec inform DDP
       when a ULPDU is complete; MPA Delivers the FPDU to 10 gigabit/sec), I/O bus bandwidth of volume CPU
   architectures has scaled from ~2 MB/sec DDP.  DDP uses
       the indications to ~2 GB/sec (PC-XT bus "deliver" its messages to
   PCI-X DDR).  Under these conditions, the FPDU Alignment approach
   allows BufferSizeAF DDP consumer
       (see [DDP] for more details).

       DDP on MPA utilizes the above two mechanisms to be indifferent establish the
       Delivery semantics that DDP's consumers agree to.  These
       semantics are described fully in [DDP].  These include
       requirements on DDP's consumer to network speed.  It is
   primarily a function respect ownership of buffers
       prior to the local processing time for a given frame.
   Thus when the FPDU Alignment approach is used, receive buffering is
   expected that DDP delivers them to scale gracefully (i.e. less than linear scaling) as the Consumer.

   The use of SACK [RFC2018] significantly improves network speed utilization
   and performance and is increased.

12.1.2.1 Impact therefore recommended.  When combined with the
   out-of-order passing of lack segments to MPA and DDP, significant
   buffering and copying of received data can be avoided.

A.4  Re-segmenting Middle boxes and non optimized MPA/TCP senders

   Since MPA senders often start FPDUs on TCP segment boundaries, a
   receiving optimized MPA/TCP implementation may be able to optimize
   the reception of data in various ways.

   However, MPA receivers MUST NOT depend on FPDU Alignment on TCP
   segment boundaries.

   Some MPA senders may be unable to conform to the receiver computational
      load sender requirements
   because their implementation of TCP is not designed with MPA in mind.
   Even for optimized MPA/TCP senders, the network may contain "middle
   boxes" which modify the TCP stream by changing the segmentation.
   This is generally interoperable with TCP and complexity its users and MPA must
   be no exception.

   The presence of Markers in MPA (when enabled) allows an optimized
   MPA/TCP receiver must perform IP and TCP processing, and then perform
   FPDU CRC checks, before to recover the FPDUs despite these obstacles,
   although it can trust may be necessary to utilize additional buffering at the FPDU header placement
   information.  For simplicity
   receiver to do so.

   Some of the description, the assumption is cases that a FPDU is carried in no more than 2 TCP segments.  In reality, receiver may have to contend with no FPDU Alignment, an FPDU are listed
   below as a reminder to the implementer:

   *   A single Aligned and complete FPDU, either in order, or out of
       order:  This can be carried by more than 2 TCP
   segments (e.g., if the PMTU was reduced).

   ----++-----------------------------++-----------------------++-----
   +---||---------------+    +--------||--------+   +----------||----+
   |   TCP Seg X-1      |    |     TCP Seg X    |   |  TCP Seg X+1   |
   +---||---------------+    +--------||--------+   +----------||----+
   ----++-----------------------------++-----------------------++-----
                   FPDU #N-1                  FPDU #N

       Figure 12: Non-aligned FPDU freely placed in TCP octet stream

   The receiver algorithm for processing TCP segments (e.g., TCP segment
   #X in Figure 12: Non-aligned FPDU freely placed in TCP octet stream)
   carrying non-aligned passed to DDP as soon as validated, and
       Delivered when ordering is established.

   *   Multiple FPDUs (in-order or out-of-order) includes:

      Data Link Layer processing (whole frame) - typically including a
          CRC calculation.

      1.  Network Layer processing (assuming not an IP fragment, the
          whole Data Link Layer frame contains one IP datagram.  IP
          fragments should be reassembled in a local buffer.  This is
          not a performance optimization goal)

      2.  Transport Layer processing -- TCP protocol processing, header segment, aligned and checksum checks.

          a.  Classify incoming TCP segment using the 5 tuple (IP SRC,
              IP DST, TCP SRC Port, TCP DST Port, protocol)

      3.  Find FPDU message boundaries.

          a.  Get MPA state information for the connection

              If the TCP segment is in-order, use the receiver managed
                  MPA state information to calculate where the previous
                  FPDU message (#N-1) ends fully contained,
       either in the current TCP segment X.
                  (previously, order, or out of order:  These can be passed to DDP as
       soon as validated, and Delivered when the MPA ordering is established.

   *   Incomplete FPDU: The receiver processed should buffer until the first
                  part remainder
       of the FPDU #N-1, it calculated arrives.  If the number remainder of bytes
                  remaining to complete FPDU #N-1 by using the MPA
                  Length field).

                  Get the stored partial CRC for FPDU #N-1

                  Complete CRC calculation for is already
       available, this can be passed to DDP as soon as validated, and
       Delivered when ordering is established.

   *   Unaligned FPDU #N-1 data (first
                      portion of TCP segment #X)

                  Check CRC calculation for start: The partial FPDU #N-1 must be combined with its
       preceding portion(s).  If no the preceding parts are already
       available, and the whole FPDU CRC errors, placement is allowed
                  Locate the local buffer for present, this can be passed to
       DDP as soon as validated, and Delivered when ordering is
       established.  If the first portion of
                      FPDU#N-1, CopyData(local buffer of first portion
                      of whole FPDU #N-1, host buffer address, length)

                  Compute host is not available, the receiver
       should buffer address for second portion until the remainder of the FPDU
                      #N-1

                  CopyData (local buffer of second portion arrives.

   *   Combinations of FPDU #N-1,
                      host buffer address for second portion, length)

                  Calculate the octet offset into Unaligned or incomplete FPDUs (and potentially
       other complete FPDUs) in the same TCP segment for
                      the next FPDU #N.

                  Start Calculation of CRC for available data for FPDU
                      #N

                  Store partial CRC results for FPDU #N

                  Store local buffer address of first portion of segment:  If any FPDU #N

                  No further action is possible on FPDU #N, before
       present in its entirety, or can be completed with portions
       already available, it can be passed to DDP as soon as validated,
       and Delivered when ordering is
                      completely received

              If established.

A.5  Receiver implementation

   Transport & Network Layer Reassembly Buffers:

   The use of reassembly buffers (either TCP out-of-order, receiver must buffer the data until
                  at least one complete reassembly buffers or IP
   fragmentation reassembly buffers) is implementation dependent.  When
   MPA is enabled, reassembly buffers are needed if out of order packets
   arrive and Markers are not enabled.  Buffers are also needed if FPDU
   Alignment is received.  Typically
                  buffering for more than one TCP segment per connection lost or if IP fragmentation occurs.  This is required.  Use because the
   incoming out of order segment may not contain enough information for
   MPA based Markers to calculate process all of the FPDU.  For cases where FPDU boundaries are.

                  When a complete FPDU is available, a similar procedure
                      to the in-order algorithm above is used.  There re-segmenting
   middle box is
                      additional complexity, though, because when present, or where the
                      missing segment arrives, this TCP segment must be
                      run through sender is not optimized, the CRC engine after
   presence of Markers significantly reduces the CRC amount of buffering
   needed.

   Recovery from IP Fragmentation is
                      calculated for transparent to the missing segment.

   If we assume FPDU Alignment, MPA Consumers.

A.5.1  Network Layer Reassembly Buffers

   The MPA/TCP implementation should set the following diagram IP Don't Fragment bit at
   the IP layer.  Thus upon a path MTU change, intermediate devices drop
   the IP datagram if it is too large and reply with an ICMP message
   which tells the algorithm
   below apply.  Note source TCP that when using MPA, the receiver is assumed path MTU has changed.  This
   causes TCP to
   actively detect presence or loss emit segments conformant with the new path MTU size.
   Thus IP fragments under most conditions should never occur at the
   receiver.  But it is possible.

   There are several options for implementation of FPDU Alignment for every network layer
   reassembly buffers:

   1.  drop any IP fragments, and reply with an ICMP message according
       to [RFC792] (fragmentation needed and DF set) to tell the Remote
       Peer to resize its TCP segment received.

      +--------------------------+      +--------------------------+
   +--|--------------------------+   +--|--------------------------+
   |  |       TCP Seg X          |   |  |         TCP Seg X+1      |
   +--|--------------------------+   +--|--------------------------+
      +--------------------------+      +--------------------------+
                FPDU #N                          FPDU #N+1

        Figure 13: Aligned FPDU placed immediately after TCP header

   2.  support an IP reassembly buffer, but have it of limited size
       (possibly the same size as the local link's MTU).  The receiver algorithm for FPDU Aligned frames (in-order or out-of-
   order) includes:

       1)  Data Link Layer processing (whole frame) - typically
           including end Node
       would normally never advertise a CRC calculation.

       2)  Network Layer processing (assuming not path MTU larger than the local
       link MTU.  It is recommended that a dropped IP fragment cause an
       ICMP message to be generated according to RFC792.

   3.  multiple IP fragment, reassembly buffers, of effectively unlimited size.

   4.  support an IP reassembly buffer for the
           whole Data Link Layer frame contains one largest IP datagram. datagram (64
       KB).

   5.  support for a large IP
           fragments reassembly buffer which could span
       multiple IP datagrams.

   An implementation should be reassembled in a local support at least 2 or 3 above, to avoid
   dropping packets that have traversed the entire fabric.

   There is no end-to-end ACK for IP reassembly buffers, so there is no
   flow control on the buffer.  This  The only end-to-end ACK is
           not a performance optimization goal)

       3)  Transport Layer processing -- TCP protocol processing, header
           and checksum checks.

           a.  Classify incoming TCP segment using the 5 tuple (IP SRC, ACK,
   which can only occur when a complete IP DST, TCP SRC Port, TCP DST Port, protocol)

       4)  Check for Header Alignment. (Described in detail in Section
           6).  Assuming Header Alignment for the rest datagram is delivered to TCP.
   Because of this, under worst case, pathological scenarios, the algorithm
           below.

           a.  If the header
   largest IP reassembly buffer is not aligned, see the algorithm defined
               in the prior section.

       5)  If TCP is in-order or out-of-order the MPA header is at receive window (to buffer
   multiple IP datagrams that have all been fragmented).

   Note that if the
           beginning Remote Peer does not implement re-segmentation of
   the current TCP payload.  Get data stream upon receiving the FPDU length
           from ICMP reply updating the FPDU header.

       6)  Calculate CRC over FPDU

       7)  Check CRC calculation for FPDU #N

       8)  If no FPDU CRC errors, placement path MTU,
   it is allowed

       9)  CopyData(TCP segment #X, host buffer address, length)

       10) Loop possible to #5 until all halt forward progress because the FPDUs in opposite peer
   would continue to retransmit using a transport segment size that is
   too large.  This deadlock scenario is no different than if the fabric
   MTU (not last hop MTU) was reduced after connection setup, and the
   remote Node's behavior is not compliant with [RFC1122].

A.5.2  TCP segment Reassembly buffers

   A TCP reassembly buffer is also needed.  TCP reassembly buffers are
           consumed in order to handle
   needed if FPDU packing.

   Implementation note: In both cases the receiver has to classify the
   incoming Alignment is lost when using TCP segment and associate it with one of the flows it
   maintains.  In the case of no FPDU Alignment, the receiver is forced
   to classify incoming traffic before it can calculate MPA or when the
   MPA FPDU CRC.
   In the case spans multiple TCP segments.  Buffers are also needed if
   Markers are disabled and out of order packets arrive.

   Since lost FPDU Alignment the operations order is left often means that FPDUs are incomplete, an
   MPA on TCP implementation must have a reassembly buffer large enough
   to the
   implementer.

   The recover an FPDU Aligned receiver algorithm is significantly simpler.  There that is no need less than or equal to the MTU of the
   locally attached link (this should be the largest possible advertised
   TCP path MTU).  If the MTU is smaller than 140 octets, a buffer portions of FPDUs.  Accessing state
   information at
   least 140 octets long is also substantially simplified - the normal case does
   not require retrieving information needed to find out where a support the minimum FPDU starts
   and ends or retrieval size.
   The 140 octets allows for the minimum MULPDU of 128, 2 octets of pad,
   2 of ULPDU_Length, 4 of CRC, and space for a partial CRC before possible Marker.  As
   usual, additional buffering is likely to provide better performance.

   Note that if the CRC calculation can
   commence.  This avoids adding internal latencies, having multiple
   data passes through TCP segment were not stored, it is possible to
   deadlock the CRC machine, or scheduling multiple commands
   for moving MPA algorithm.  If the path MTU is reduced, FPDU
   Alignment requires the source TCP to re-segment the data stream to
   the host buffer. new path MTU.  The aligned FPDU approach is useful for in-order source MPA will detect this condition and out-of-order
   reception.  The receiver can use
   reduce the same mechanisms for data storage
   in both cases, and only needs MPA segment size, but any FPDUs already posted to account for when all the
   source TCP will be re-segmented and lose FPDU Alignment.  If the
   destination does not support a TCP reassembly buffer, these segments have arrived to enable Delivery.  The Header Alignment,
   along with
   can never be successfully transmitted and the high probability that at least one protocol deadlocks.

   When a complete FPDU is
   found with every TCP segment, allows the receiver to perform data
   placement for out-of-order received, processing continues normally.

B Appendix.
            Analysis of MPA over TCP segments with no need Operations

   This appendix is for intermediate
   buffering.  Essentially the TCP receive buffer has been eliminated information only and TCP reassembly is done in place within NOT part of the ULP buffer.

   In case FPDU Alignment
   standard.

   This appendix is not found, the receiver should follow the
   algorithm for non aligned FPDU reception which may be slower and less
   efficient.

12.1.2.2 FPDU Alignment effects on TCP wire protocol

      In an optimized MPA/TCP implementation, analysis of MPA on TCP exposes its EMSS and why it is useful to
      MPA.
   integrate MPA uses the EMSS with TCP (with modifications to calculate its MULPDU, which it then
      exposes typical TCP
   implementations) to DDP, its ULP.  DDP uses reduce overall system buffering and overhead.

   One of MPA's high level goals is to provide enough information, when
   combined with the MULPDU Direct Data Placement Protocol [DDP], to segment its enable
   out-of-order placement of DDP payload so that each FPDU sent by MPA fits completely into one
      TCP segment.  This has no impact on wire protocol and exposing
      this information is already supported on many TCP
      implementations, including all modern flavors of BSD networking,
      through the TCP_MAXSEG socket option.

   In the common case, the ULP (i.e. final Upper Layer
   Protocol (ULP) buffer.  Note that DDP over MPA) messages provided to separates the TCP layer are segmented to MULPDU size.  It is assumed act of placing
   data into a ULP buffer from that of notifying the ULP message size is bounded by MULPDU, such that a single the ULP message
   can be encapsulated in a single TCP segment.  Therefore, in
   buffer is available for use.  In DDP terminology, the
   common case, there former is no increase in
   defined as "Placement", and the number later is defined as "Delivery".  MPA
   supports in-order Delivery of TCP segments
   emitted.  For smaller ULP messages, the sender can also apply
   packing, i.e. the sender packs as many complete FPDUs as possible
   into one TCP segment.  The requirement data to always have a complete FPDU
   may increase the number of ULP, including support
   for Direct Data Placement in the final ULP buffer location when TCP
   segments emitted.  Typically, a ULP
   message size varies from few bytes arrive out-of-order.  Effectively, the goal is to multiple EMSS (e.g., 64
   Kbytes).  In some cases use the
   pre-posted ULP may post more than one message at a
   time for transmission, giving the sender an opportunity for packing.
   In buffers as the case TCP receive buffer, where more than one FPDU is available for transmission
   and the FPDUs are encapsulated into a
   reassembly of the ULP Protocol Data Unit (PDU) by TCP segment (with MPA and there
   DDP) is no
   room done in place, in the TCP segment to include the next complete FPDU, another
   TCP segment is sent.  In this corner case some of the TCP segments
   are not full size.  In the worst case scenario, the ULP may choose a
   FPDU size that is EMSS/2 +1 buffer, with no data copies.

   This Appendix walks through the advantages and has multiple messages available for
   transmission.  For this poor choice disadvantages of FPDU size, the average
   TCP
   segment size is therefore about 1/2 of the EMSS and sender modifications proposed by MPA:

   1) that MPA prefers that the number of TCP
   segments emitted is approaching 2x of what is possible without the
   requirement sender to encapsulate do Header Alignment, where
   a TCP segment should begin with an integer MPA Framing Protocol Data Unit
   (FPDU) (if there is payload present).

   2) that there be an integral number of complete FPDUs in
   every TCP segment.  This is a dynamic situation that only lasts for
   the duration TCP segment (under
   conditions where the sender ULP has multiple non-optimal messages
   for transmission and this causes a minor impact on the wire
   utilization.

   However, it Path MTU is not expected changing).

   This Appendix concludes that requiring the scaling advantages of FPDU Alignment will have a
   measurable impact on wire behavior of most applications.  Throughput
   applications with large I/Os
   are expected to take full advantage of
   the EMSS.  Another class of applications with many small outstanding
   buffers (as compared to EMSS) strong, based primarily on fairly drastic TCP receive buffer
   reduction requirements and simplified receive handling.  The analysis
   also shows that there is expected little effect to use packing when
   applicable.  Transaction oriented applications are also optimal. TCP retransmission is another area that can affect sender wire behavior.
   TCP supports retransmission of the exact, originally transmitted

B.1  Assumptions

B.1.1  MPA is layered beneath DDP [DDP]

   MPA is an adaptation layer between DDP and TCP.  DDP requires
   preservation of DDP segment (see [RFC793] section 2.6, [RFC793] section 3.7 "managing boundaries and a CRC32C digest covering
   the
   window" DDP header and [RFC1122] section 4.2.2.15).  In data.   MPA adds these features to the unlikely event TCP stream
   so that
   part of the original segment DDP over TCP has been received and acknowledged by the remote peer (e.g., a re-segmenting middle box, same basic properties as documented in
   Section 6.1, Re-segmenting Middle boxes DDP over SCTP.

B.1.2  MPA preserves DDP message framing

   MPA was designed as a framing layer specifically for DDP and non optimized MPA/TCP
   senders on page 29), was not
   intended as a better available bandwidth utilization may be
   possible by re-transmitting only general-purpose framing layer for any other ULP using
   TCP.

   A framing layer allows ULPs using it to receive indications from the missing octets.  If an optimized
   MPA/TCP retransmits
   transport layer only when complete FPDUs, there may be some marginal
   bandwidth loss.

   Another area where ULPDUs are present.  As a change in the TCP segment number may have impact framing
   layer, MPA is that not aware of Slow Start and Congestion Avoidance.  Slow-start
   exponential increase is measured in segments per second, as the
   algorithm focuses on the overhead per segment at content of the source for
   congestion DDP PDU, only that eventually results in dropped segments.  Slow-start
   exponential bandwidth growth it
   has received and, if necessary, reassembled a complete PDU for optimized MPA/TCP is similar
   Delivery to any
   TCP implementation.  Congestion Avoidance allows for a linear growth
   in available bandwidth when recovering after a packet drop.  Similar
   to the analysis for slow-start, optimized MPA/TCP doesn't change the
   behavior of the algorithm.  Therefore the average DDP.

B.1.3  The size of the segment
   versus EMSS ULPDU passed to MPA is not a major factor in the assessment less than EMSS under
          normal conditions

   To make reception of the bandwidth
   growth for a sender.  Both Slow Start and Congestion Avoidance for an
   optimized MPA/TCP will behave similarly to any TCP sender and allow
   an optimized MPA/TCP complete DDP PDU on every received segment
   possible, DDP passes to enjoy MPA a PDU that is no larger than the theoretical performance limits EMSS of
   the algorithms.

   In summary, underlying fabric.  Each FPDU that MPA creates contains
   sufficient information for the receiver to directly place the ULP messages generated at
   payload in the sender (e.g., correct location in the
   amount of messages grouped for every transmission request) and
   message size distribution has correct receive buffer.

   Edge cases when this condition does not occur are dealt with, but do
   not need to be on the most significant impact over fast path

B.1.4  Out-of-order placement but NO out-of-order Delivery

   DDP receives complete DDP PDUs from MPA.  Each DDP PDU contains the
   number of TCP segments emitted.  The worst case effect for certain
   ULPs (with average message size of EMSS/2+1
   information necessary to EMSS), place its ULP payload directly in the
   correct location in host memory.

   Because each DDP segment is bounded by
   an increase self-describing, it is possible for DDP
   segments received out of up order to 2x have their ULP payload placed
   immediately in the number of TCP segments and
   acknowledges.  In reality ULP receive buffer.

   Data delivery to the effect ULP is expected guaranteed to be marginal.

12.2 Receiver implementation

   Transport & Network Layer Reassembly Buffers:

   The use of reassembly buffers (either TCP reassembly buffers or IP
   fragmentation reassembly buffers) is implementation dependent.  When
   MPA is enabled, reassembly buffers are needed if out of order packets
   arrive and Markers are not enabled.  Buffers are also needed if FPDU
   Alignment is lost or if IP fragmentation occurs.  This is because in the
   incoming out of order segment may not contain enough information for
   MPA to process all of the FPDU.  For cases where a re-segmenting
   middle box is present, or where data
   was sent.  DDP only indicates data delivery to the ULP after TCP sender is not optimized, the
   presence of Markers significantly reduces has
   acknowledged the amount complete byte stream.

B.2  The Value of buffering
   needed.

   Recovery from IP Fragmentation must FPDU Alignment

   Significant receiver optimizations can be transparent to the MPA
   Consumers.

12.2.1 Network Layer Reassembly Buffers

   Most IP implementations set achieved when Header
   Alignment and complete FPDUs are the IP Don't Fragment bit.  Thus upon a
   path MTU change, intermediate devices drop common case.  The optimizations
   allow utilizing significantly fewer buffers on the IP datagram if it is
   too large receiver and reply with an ICMP message which tells less
   computation per FPDU.  The net effect is the source TCP ability to build a
   "flow-through" receiver that the path MTU has changed.  This causes TCP enables TCP-based solutions to emit segments
   conformant with the new path MTU size.  Thus IP fragments under most
   conditions should never occur at the receiver.  But it is possible.

   There scale to
   10G and beyond in an economical way.  The optimizations are several options for implementation
   especially relevant to hardware implementations of network layer
   reassembly buffers:

   1.  drop any IP fragments, receivers that
   process multiple protocol layers - Data Link Layer (e.g., Ethernet),
   Network and reply with an ICMP message according
       to [RFC792] (fragmentation needed Transport Layer (e.g., TCP/IP), and DF set) even some ULP on top
   of TCP (e.g., MPA/DDP).  As network speed increases, there is an
   increasing desire to tell the Remote
       Peer use a hardware based receiver in order to resize its TCP segment

   2.  support
   achieve an IP reassembly buffer, but have it efficient high performance solution.

   A TCP receiver, under worst case conditions, has to allocate buffers
   (BufferSizeTCP) whose capacities are a function of limited size
       (possibly the same size as bandwidth-
   delay product.  Thus:

       BufferSizeTCP = K * bandwidth [octets/Second] * Delay [Seconds].

   Where bandwidth is the local link's MTU).  The end Node
       would normally never advertise a path MTU larger than end-to-end bandwidth of the local
       link MTU.  It connection, delay
   is recommended that a dropped IP fragment cause an
       ICMP message to be generated according to RFC792.

   3.  multiple IP reassembly buffers, the round trip delay of effectively unlimited size.

   4.  support an IP reassembly buffer for the largest IP datagram (64
       KB).

   5.  support for a large IP reassembly buffer which could span
       multiple IP datagrams.

   An connection, and K is an implementation should support at least 2 or 3 above, to avoid
   dropping packets that have traversed
   dependent constant.

   Thus BufferSizeTCP scales with the entire fabric.

   There is no end-to-end ACK bandwidth (10x more
   buffers for IP reassembly buffers, so there is no
   flow control on the buffer.  The only end-to-end ACK is a 10x increase in end-to-end bandwidth).  As this
   buffering approach may scale poorly for hardware or software
   implementations alike, several approaches allow reduction in the
   amount of buffering required for high-speed TCP ACK,
   which can only occur when a complete IP datagram communication.

   The MPA/DDP approach is delivered to TCP.
   Because of this, under worst case, pathological scenarios, enable the
   largest IP reassembly ULP's buffer is to be used as the
   TCP receive window (to buffer
   multiple IP datagrams that have all been fragmented).

   Note that if buffer.  If the Remote Peer does not implement re-segmentation application pre-posts a sufficient amount
   of buffering, and each TCP segment has sufficient information to
   place the data stream upon receiving the ICMP reply updating payload into the path MTU, right application buffer, when an out-of-
   order TCP segment arrives it is possible to halt forward progress because could potentially be placed directly in
   the opposite peer
   would continue to retransmit using ULP buffer.  However, placement can only be done when a transport segment size that is
   too large.  This deadlock scenario complete
   FPDU with the placement information is no different than if available to the fabric
   MTU (not last hop MTU) was reduced after connection setup, receiver, and
   the
   remote Node's behavior is not compliant with [RFC1122].

12.2.2 TCP Reassembly buffers

   A TCP reassembly FPDU contents contain enough information to place the data into
   the correct ULP buffer (e.g., there is also needed.  TCP reassembly buffers are
   needed if a DDP header available).

   For the case when the FPDU Alignment is lost when using TCP not aligned with MPA or when the
   MPA FPDU spans multiple TCP segments.  Buffers are also needed if
   Markers are disabled and out of order packets arrive.

   Since lost FPDU Alignment often means that FPDUs are incomplete, an
   MPA segment, it
   may take, on average, 2 TCP implementation must have a reassembly buffer large enough segments to recover an FPDU that is less than or equal assemble one FPDU.
   Therefore, the receiver has to allocate BufferSizeNAF (Buffer Size,
   Non-Aligned FPDU) octets:

       BufferSizeNAF = K1* EMSS * number_of_connections + K2 * EMSS

   Where K1 and K2 are implementation dependent constants and EMSS is
   the MTU effective maximum segment size.

   For example, a 1 Gbps link with 10,000 connections and an EMSS of
   1500B would require 15 MB of memory.  Often the
   locally attached link (this should be number of connections
   used scales with the largest possible advertised
   TCP path MTU).  If network speed, aggravating the MTU is smaller than 140 octets, situation for
   higher speeds.

   FPDU Alignment would allow the buffer
   MUST be at least 140 octets long receiver to support allocate BufferSizeAF
   (Buffer Size, Aligned FPDU) octets:

       BufferSizeAF = K2 * EMSS
   for the minimum same conditions.  A FPDU size.
   The 140 octets allows for Aligned receiver may require memory
   in the minimum MULPDU of 128, 2 octets of pad,
   2 range of ULPDU_Length, 4 ~100s of CRC, and space KB - which is feasible for an on-chip memory
   and enables a possible Marker.  As
   usual, additional buffering may provide better performance.

   Note that if the TCP segment were not stored, it is possible to
   deadlock "flow-through" design, in which the MPA algorithm.  If data flows through
   the path MTU NIC and is reduced, FPDU
   Alignment requires placed directly in the source TCP to re-segment the data stream to
   the new path MTU.  The source MPA will detect this condition and
   reduce the MPA segment size, but any FPDUs already posted to destination buffer.  Assuming
   most of the
   source TCP will be re-segmented and lose connections support FPDU Alignment.  If Alignment, the
   destination does not support a TCP reassembly buffer, these segments receiver buffers
   no longer scale with number of connections.

   Additional optimizations can never be successfully transmitted and the protocol deadlocks.

   When achieved in a complete FPDU is received, processing continues normally.

12.3 IETF Implementation Interoperability with RDMA Consortium Protocols

   The RDMA Consortium created early specifications of balanced I/O sub-system
   -- where the MPA/DDP/RDMA
   protocols and some manufacturers created implementations system interface of those
   protocols before the IETF versions were finalized.  These protocols network controller provides
   ample bandwidth as compared with the network bandwidth.  For almost
   twenty years this has been the case and are very similar to the IETF versions making it possible for
   implementations trend is expected to be created or modified
   continue - while Ethernet speeds have scaled by 1000 (from 10
   megabit/sec to support either set 10 gigabit/sec), I/O bus bandwidth of
   specifications.  For those interested, volume CPU
   architectures has scaled from ~2 MB/sec to ~2 GB/sec (PC-XT bus to
   PCI-X DDR).  Under these conditions, the RDMA Consortium protocol
   documents can FPDU Alignment approach
   allows BufferSizeAF to be obtained at http://www.rdmaconsortium.org.

   In this section, implementations of MPA/DDP/RDMA that conform indifferent to the
   RDMAC specifications are called RDMAC RNICs.  Implementations network speed.  It is
   primarily a function of
   MPA/DDP/RDMA that conform to the IETF RFCs are called IETF RNICs.

   Without the exchange of MPA Request/Reply Frames, there is no
   standard mechanism local processing time for enabling RDMAC RNICs to interoperate with IETF
   RNICs.  Even if a ULP uses a well-known port to start an IETF RNIC
   immediately in RDMA mode (i.e., without exchanging given frame.
   Thus when the MPA
   Request/Reply messages), there FPDU Alignment approach is no reason used, receive buffering is
   expected to believe an IETF RNIC
   will interoperate with an RDMAC RNIC because scale gracefully (i.e. less than linear scaling) as
   network speed is increased.

B.2.1  Impact of the differences in
   the version number in the DDP and RDMAP headers on the wire.

   Therefore, the ULP or other supporting entity at the RDMAC RNIC must
   implement MPA Request/Reply Frames on behalf lack of FPDU Alignment on the RNIC in order to
   negotiate the connection parameters. receiver computational
          load and complexity

   The following section describes
   the results following the exchange of the MPA Request/Reply Frames receiver must perform IP and TCP processing, and then perform
   FPDU CRC checks, before it can trust the conversion from streaming to RDMA mode.

12.3.1 Negotiated Parameters

   Three types FPDU header placement
   information.  For simplicity of RNICs are considered:

   Upgraded RDMAC RNIC - an RNIC implementing the RDMAC protocols which
       has a ULP or other supporting entity that exchanges the MPA
       Request/Reply Frames in streaming mode before the conversion to
       RDMA mode.

   Non-permissive IETF RNIC - an RNIC implementing description, the IETF protocols
       which assumption is not capable of implementing the RDMAC protocols.  Such
   that a FPDU is carried in no more than 2 TCP segments.  In reality,
   with no FPDU Alignment, an RNIC FPDU can only interoperate with other IETF RNICs.

   Permissive IETF RNIC - an RNIC implementing the IETF protocols which
       is capable of implementing the RDMAC protocols on a per
       connection basis.

   The Permissive IETF RNIC is recommended for those implementers that
   want maximum interoperability with other RNIC implementations.

   The values used be carried by these three RNIC types for more than 2 TCP
   segments (e.g., if the MPA, DDP, and RDMAP
   versions as well as MPA Markers and CRC are summarized in Figure 14.

    +----------------++-----------+-----------+-----------+-----------+
    | RNIC TYPE      || DDP/RDMAP |    MPA    |    MPA    |    MPA    |
    |                ||  Version  | Revision  |  Markers  |    CRC    |
    +----------------++-----------+-----------+-----------+-----------+
    +----------------++-----------+-----------+-----------+-----------+
    | RDMAC          ||     0     |     0     |     1     |     1     |
    |                ||           |           |           |           |
    +----------------++-----------+-----------+-----------+-----------+
    | IETF           ||     1     |     1     |  0 or 1   |  0 or 1   |
    | Non-permissive ||           |           |           |           |
    +----------------++-----------+-----------+-----------+-----------+
    | IETF           ||  1 or 0   |  1 or 0   |  0 or 1   |  0 or 1 PMTU was reduced).

   ----++-----------------------------++-----------------------++-----
   +---||---------------+    +--------||--------+   +----------||----+
   |   TCP Seg X-1      | permissive     ||    |     TCP Seg X    |   |  TCP Seg X+1   |
    +----------------++-----------+-----------+-----------+-----------+
   +---||---------------+    +--------||--------+   +----------||----+
   ----++-----------------------------++-----------------------++-----
                   FPDU #N-1                  FPDU #N

       Figure 14.  Connection Parameters 12: Non-aligned FPDU freely placed in TCP octet stream

   The receiver algorithm for processing TCP segments (e.g., TCP segment
   #X in Figure 12: Non-aligned FPDU freely placed in TCP octet stream)
   carrying non-aligned FPDUs (in-order or out-of-order) includes:

      Data Link Layer processing (whole frame) - typically including a
          CRC calculation.

      1.  Network Layer processing (assuming not an IP fragment, the RNIC Types.
            For MPA Markers and MPA CRC, enabled=1, disabled=0.

   It is assumed there
          whole Data Link Layer frame contains one IP datagram.  IP
          fragments should be reassembled in a local buffer.  This is no mixing of versions allowed between MPA, DDP
          not a performance optimization goal)

      2.  Transport Layer processing -- TCP protocol processing, header
          and RDMAP.  The RNIC either generates checksum checks.

          a.  Classify incoming TCP segment using the RDMAC protocols on 5 tuple (IP SRC,
              IP DST, TCP SRC Port, TCP DST Port, protocol)

      3.  Find FPDU message boundaries.

          a.  Get MPA state information for the wire
   (version is zero) or connection

              If the IETF protocols (version TCP segment is one).

   During the exchange of in-order, use the receiver managed
                  MPA Request/Reply Frames, each peer
   provides its MPA Revision, Marker preference (M: 0=disabled,
   1=enabled), and CRC preference.  The MPA Revision provided state information to calculate where the previous
                  FPDU message (#N-1) ends in the MPA
   Request Frame and current TCP segment X.
                  (previously, when the MPA Reply Frame may differ.

   From the information in receiver processed the MPA Request/Reply Frames, each side sets first
                  part of FPDU #N-1, it calculated the Version field (V: 0=RDMAC, 1=IETF) number of bytes
                  remaining to complete FPDU #N-1 by using the DDP/RDMAP protocols as
   well as MPA
                  Length field).

                  Get the state stored partial CRC for FPDU #N-1

                  Complete CRC calculation for FPDU #N-1 data (first
                      portion of the Markers TCP segment #X)

                  Check CRC calculation for each half connection.  Between
   DDP and RDMAP, FPDU #N-1

                  If no mixing of versions FPDU CRC errors, placement is allowed.  Moreover, the DDP
   and RDMAP version MUST be identical in allowed

                  Locate the two directions.  The RNIC
   either generates local buffer for the RDMAC protocols on the wire (version is zero) or first portion of
                      FPDU#N-1, CopyData(local buffer of first portion
                      of FPDU #N-1, host buffer address, length)

                  Compute host buffer address for second portion of FPDU
                      #N-1

                  CopyData (local buffer of second portion of FPDU #N-1,
                      host buffer address for second portion, length)

                  Calculate the IETF protocols (version is one).

   In octet offset into the following sections, TCP segment for
                      the figures do not discuss next FPDU #N.

                  Start Calculation of CRC negotiation
   because there is no interoperability issue for CRCs.  Since the RDMAC
   RNIC will always request available data for FPDU
                      #N
                  Store partial CRC use, then, according to results for FPDU #N

                  Store local buffer address of first portion of FPDU #N

                  No further action is possible on FPDU #N, before it is
                      completely received

              If TCP out-of-order, receiver must buffer the IETF MPA
   specification, both peers MUST generate and check CRCs.

12.3.2 RDMAC RNIC and Non-permissive IETF RNIC

   Figure 15 shows that a Non-permissive IETF RNIC cannot interoperate
   with an RDMAC RNIC, despite data until
                  at least one complete FPDU is received.  Typically
                  buffering for more than one TCP segment per connection
                  is required.  Use the fact that both peers exchange MPA
   Request/Reply Frames.  For based Markers to calculate
                  where FPDU boundaries are.

                  When a Non-permissive IETF RNIC, the MPA
   negotiation has no effect on the DDP/RDMAP version and it complete FPDU is unable available, a similar procedure
                      to interoperate with the RDMAC RNIC.

   The rows in in-order algorithm above is used.  There is
                      additional complexity, though, because when the figure show
                      missing segment arrives, this TCP segment must be
                      run through the state of CRC engine after the Marker field in CRC is
                      calculated for the MPA
   Request Frame sent by missing segment.

   If we assume FPDU Alignment, the MPA Initiator.  The columns show following diagram and the state algorithm
   below apply.  Note that when using MPA, the receiver is assumed to
   actively detect presence or loss of FPDU Alignment for every TCP
   segment received.

      +--------------------------+      +--------------------------+
   +--|--------------------------+   +--|--------------------------+
   |  |       TCP Seg X          |   |  |         TCP Seg X+1      |
   +--|--------------------------+   +--|--------------------------+
      +--------------------------+      +--------------------------+
                FPDU #N                          FPDU #N+1

        Figure 13: Aligned FPDU placed immediately after TCP header
   The receiver algorithm for FPDU Aligned frames (in-order or out-of-
   order) includes:

       1)  Data Link Layer processing (whole frame) - typically
           including a CRC calculation.

       2)  Network Layer processing (assuming not an IP fragment, the Marker field
           whole Data Link Layer frame contains one IP datagram.  IP
           fragments should be reassembled in the MPA Reply Frame sent by the MPA Responder.
   Each type of RNIC a local buffer.  This is shown as an Initiator and
           not a Responder.  The
   connection results are shown in performance optimization goal)

       3)  Transport Layer processing -- TCP protocol processing, header
           and checksum checks.

           a.  Classify incoming TCP segment using the lower right corner, at 5 tuple (IP SRC,
               IP DST, TCP SRC Port, TCP DST Port, protocol)

       4)  Check for Header Alignment. (Described in detail in Section
           6).  Assuming Header Alignment for the
   intersection rest of the different RNIC types, where V=0 algorithm
           below.

           a.  If the header is not aligned, see the RDMAC
   DDP/RDMAP version, V=1 algorithm defined
               in the prior section.

       5)  If TCP is in-order or out-of-order the IETF DDP/RDMAC version, M=0 means MPA
   Markers are disabled and M=1 means MPA Markers are enabled.  The
   negotiated Marker state header is shown as X/Y, for at the receive direction
           beginning of the Initiator/Responder.

          +---------------------------++-----------------------+
          |   MPA                     ||          MPA          |
          | CONNECT                   ||       Responder       |
          |   MODE  +-----------------++-------+---------------+
          |         |   RNIC          || RDMAC |     IETF      |
          |         |   TYPE          ||       | Non-permissive|
          |         |          +------++-------+-------+-------+
          |         |          |MARKER|| M=1   | M=0   |  M=1  |
          +---------+----------+------++-------+-------+-------+
          +---------+----------+------++-------+-------+-------+
          |         |   RDMAC  | M=1  || V=0   | close | close |
          |         |          |      || M=1/1 |       |       |
          |         +----------+------++-------+-------+-------+
          |   MPA   |          | M=0  || close | V=1   | V=1   |
          |Initiator|   IETF   |      ||       | M=0/0 | M=0/1 |
          |         |Non-perms.+------++-------+-------+-------+
          |         |          | M=1  || close | V=1   | V=1   |
          |         |          |      ||       | M=1/0 | M=1/1 |
          +---------+----------+------++-------+-------+-------+
   Figure 15: MPA negotiation between an RDMAC RNIC and a Non-permissive
                                IETF RNIC.

12.3.2.1 RDMAC RNIC Initiator

   If current TCP payload.  Get the RDMAC RNIC is FPDU length
           from the MPA Initiator, its ULP sends an MPA Request
   Frame with Rev field set FPDU header.

       6)  Calculate CRC over FPDU

       7)  Check CRC calculation for FPDU #N

       8)  If no FPDU CRC errors, placement is allowed

       9)  CopyData(TCP segment #X, host buffer address, length)

       10) Loop to zero and #5 until all the M and C bits set FPDUs in the TCP segment are
           consumed in order to one.
   Because handle FPDU packing.

   Implementation note: In both cases the Non-permissive IETF RNIC cannot dynamically downgrade receiver has to classify the
   version number
   incoming TCP segment and associate it uses for DDP and RDMAP, it would send an MPA Reply
   Frame with the Rev field equal to one and then gracefully close of the
   connection.

12.3.2.2 Non-Permissive IETF RNIC Initiator

   If flows it
   maintains.  In the Non-permissive IETF RNIC is case of no FPDU Alignment, the MPA Initiator, receiver is forced
   to classify incoming traffic before it sends an MPA
   Request Frame with Rev field equal can calculate the FPDU CRC.
   In the case of FPDU Alignment the operations order is left to one.  The ULP or supporting
   entity for the RDMAC RNIC responds with an MPA Reply Frame that has
   implementer.

   The FPDU Aligned receiver algorithm is significantly simpler.  There
   is no need to locally buffer portions of FPDUs.  Accessing state
   information is also substantially simplified - the Rev field equal normal case does
   not require retrieving information to zero find out where a FPDU starts
   and ends or retrieval of a partial CRC before the M bit set to one.  The Non-
   permissive IETF RNIC will gracefully close CRC calculation can
   commence.  This avoids adding internal latencies, having multiple
   data passes through the connection after it
   reads CRC machine, or scheduling multiple commands
   for moving the incompatible Rev field in data to the MPA Reply Frame.

12.3.3 RDMAC RNIC host buffer.

   The aligned FPDU approach is useful for in-order and Permissive IETF RNIC

   Figure 16 shows that a Permissive IETF RNIC can interoperate with an
   RDMAC RNIC regardless of its Marker preference. out-of-order
   reception.  The figure uses receiver can use the same format as shown with mechanisms for data storage
   in both cases, and only needs to account for when all the Non-permissive IETF RNIC.

          +---------------------------++-----------------------+
          |   MPA                     ||          MPA          |
          | CONNECT                   ||       Responder       |
          |   MODE  +-----------------++-------+---------------+
          |         |   RNIC          || RDMAC |     IETF      |
          |         |   TYPE          ||       |  Permissive   |
          |         |          +------++-------+-------+-------+
          |         |          |MARKER|| M=1   | M=0   | M=1   |
          +---------+----------+------++-------+-------+-------+
          +---------+----------+------++-------+-------+-------+
          |         |   RDMAC  | M=1  || V=0   | N/A   | V=0   |
          |         |          |      || M=1/1 |       | M=1/1 |
          |         +----------+------++-------+-------+-------+
          |   MPA   |          | M=0  || V=0   | V=1   | V=1   |
          |Initiator|   IETF   |      || M=1/1 | M=0/0 | M=0/1 |
          |         |Permissive+------++-------+-------+-------+
          |         |          | M=1  || V=0   | V=1   | V=1   |
          |         |          |      || M=1/1 | M=1/0 | M=1/1 |
          +---------+----------+------++-------+-------+-------+
     Figure 16: MPA negotiation between an RDMAC RNIC and a Permissive
                                IETF RNIC.

   A truly Permissive IETF RNIC will recognize an RDMAC RNIC from the
   Rev field of the MPA Req/Rep Frames and then adjust its receive
   Marker state and DDP/RDMAP version TCP
   segments have arrived to accommodate the RDMAC RNIC.  As
   a result, as an MPA Responder, enable Delivery.  The Header Alignment,
   along with the Permissive IETF RNIC will never
   return an MPA Reply Frame high probability that at least one complete FPDU is
   found with every TCP segment, allows the M bit set receiver to zero.  This case perform data
   placement for out-of-order TCP segments with no need for intermediate
   buffering.  Essentially the TCP receive buffer has been eliminated
   and TCP reassembly is
   shown as a not applicable (N/A) done in Figure 16.

12.3.3.1 RDMAC RNIC Initiator

   When place within the RDMAC RNIC ULP buffer.

   In case FPDU Alignment is not found, the MPA Initiator, its ULP or other supporting
   entity prepares an MPA Request message and sets the revision to zero
   and receiver should follow the M bit
   algorithm for non aligned FPDU reception which may be slower and C bit less
   efficient.

B.2.2  FPDU Alignment effects on TCP wire protocol

      In an optimized MPA/TCP implementation, TCP exposes its EMSS to one.

   The Permissive IETF Responder receives the
      MPA.  MPA Request message and
   checks uses the revision field.  Since it is capable of generating RDMAC
   DDP/RDMAP headers, EMSS to calculate its MULPDU, which it sends an MPA Reply message with revision set then
      exposes to
   zero and DDP, its ULP.  DDP uses the M and C bits set MULPDU to one.  The Responder must inform segment its
   ULP
      payload so that it is generating version zero DDP/RDMAP messages.

12.3.3.2 Permissive IETF RNIC Initiator

   If the Permissive IETF RNIC each FPDU sent by MPA fits completely into one
      TCP segment.  This has no impact on wire protocol and exposing
      this information is already supported on many TCP
      implementations, including all modern flavors of BSD networking,
      through the MPA Initiator, it prepares TCP_MAXSEG socket option.

   In the MPA
   Request Frame setting common case, the Rev field ULP (i.e. DDP over MPA) messages provided to one.  Regardless of
   the value
   of TCP layer are segmented to MULPDU size.  It is assumed that the M bit
   ULP message size is bounded by MULPDU, such that a single ULP message
   can be encapsulated in a single TCP segment.  Therefore, in the MPA Request Frame,
   common case, there is no increase in the number of TCP segments
   emitted.  For smaller ULP or other supporting
   entity for messages, the RDMAC RNIC will create an MPA Reply Frame with Rev
   equal sender can also apply
   packing, i.e. the sender packs as many complete FPDUs as possible
   into one TCP segment.  The requirement to zero always have a complete FPDU
   may increase the number of TCP segments emitted.  Typically, a ULP
   message size varies from few bytes to multiple EMSS (e.g., 64
   Kbytes).  In some cases the ULP may post more than one message at a
   time for transmission, giving the sender an opportunity for packing.
   In the case where more than one FPDU is available for transmission
   and the M bit set FPDUs are encapsulated into a TCP segment and there is no
   room in the TCP segment to one.

   When include the Initiator reads next complete FPDU, another
   TCP segment is sent.  In this corner case some of the Rev field TCP segments
   are not full size.  In the worst case scenario, the ULP may choose a
   FPDU size that is EMSS/2 +1 and has multiple messages available for
   transmission.  For this poor choice of FPDU size, the MPA Reply Frame average TCP
   segment size is therefore about 1/2 of the EMSS and
   finds that its peer the number of TCP
   segments emitted is approaching 2x of what is possible without the
   requirement to encapsulate an RDMAC RNIC, it must inform its ULP integer number of complete FPDUs in
   every TCP segment.  This is a dynamic situation that it
   should generate version zero DDP/RDMAP only lasts for
   the duration where the sender ULP has multiple non-optimal messages
   for transmission and enable MPA
   Markers and CRC.

12.3.4 Non-Permissive IETF RNIC and Permissive IETF RNIC

   For completeness, Figure 17 shows this causes a minor impact on the wire
   utilization.

   However, it is not expected that requiring FPDU Alignment will have a
   measurable impact on wire behavior of most applications.  Throughput
   applications with large I/Os are expected to take full advantage of
   the EMSS.  Another class of applications with many small outstanding
   buffers (as compared to EMSS) is expected to use packing when
   applicable.  Transaction oriented applications are also optimal.

   TCP retransmission is another area that can affect sender behavior.
   TCP supports retransmission of the exact, originally transmitted
   segment (see [RFC793] section 2.6, [RFC793] section 3.7 "managing the
   window" and [RFC1122] section 4.2.2.15).  In the unlikely event that
   part of the original segment has been received and acknowledged by
   the remote peer (e.g., a re-segmenting middle box, as documented in
   Appendix A.4, Re-segmenting Middle boxes and non optimized MPA/TCP
   senders on page 50), a better available bandwidth utilization may be
   possible by re-transmitting only the missing octets.  If an optimized
   MPA/TCP retransmits complete FPDUs, there may be some marginal
   bandwidth loss.

   Another area where a change in the TCP segment number may have impact
   is that of Slow Start and Congestion Avoidance.  Slow-start
   exponential increase is measured in segments per second, as the
   algorithm focuses on the overhead per segment at the source for
   congestion that eventually results in dropped segments.  Slow-start
   exponential bandwidth growth for optimized MPA/TCP is similar to any
   TCP implementation.  Congestion Avoidance allows for a linear growth
   in available bandwidth when recovering after a packet drop.  Similar
   to the analysis for slow-start, optimized MPA/TCP doesn't change the
   behavior of the algorithm.  Therefore the average size of the segment
   versus EMSS is not a major factor in the assessment of the bandwidth
   growth for a sender.  Both Slow Start and Congestion Avoidance for an
   optimized MPA/TCP will behave similarly to any TCP sender and allow
   an optimized MPA/TCP to enjoy the theoretical performance limits of
   the algorithms.

   In summary, the ULP messages generated at the sender (e.g., the
   amount of messages grouped for every transmission request) and
   message size distribution has the most significant impact over the
   number of TCP segments emitted.  The worst case effect for certain
   ULPs (with average message size of EMSS/2+1 to EMSS), is bounded by
   an increase of up to 2x in the number of TCP segments and
   acknowledges.  In reality the effect is expected to be marginal.

C Appendix.
            IETF Implementation Interoperability with RDMA Consortium
        Protocols

   This appendix is for information only and is NOT part of the
   standard.

   This appendix covers methods of making MPA implementations
   interoperate with both IETF and RDMA Consortium versions of the
   protocols.

   The RDMA Consortium created early specifications of the MPA/DDP/RDMA
   protocols and some manufacturers created implementations of those
   protocols before the IETF versions were finalized.  These protocols
   and are very similar to the IETF versions making it possible for
   implementations to be created or modified to support either set of
   specifications.

   For those interested, the RDMA Consortium protocol documents
   (draft-culley-iwarp-mpa-v1.0.pdf, draft-shah-iwarp-ddp-v1.0.pdf, and
   draft-recio-iwarp-rdmac-v1.0.pdf) can be obtained at
   http://www.rdmaconsortium.org.

   In this section, implementations of MPA/DDP/RDMA that conform to the
   RDMAC specifications are called RDMAC RNICs.  Implementations of
   MPA/DDP/RDMA that conform to the IETF RFCs are called IETF RNICs.

   Without the exchange of MPA Request/Reply Frames, there is no
   standard mechanism for enabling RDMAC RNICs to interoperate with IETF
   RNICs.  Even if a ULP uses a well-known port to start an IETF RNIC
   immediately in RDMA mode (i.e., without exchanging the MPA
   Request/Reply messages), there is no reason to believe an IETF RNIC
   will interoperate with an RDMAC RNIC because of the differences in
   the version number in the DDP and RDMAP headers on the wire.

   Therefore, the ULP or other supporting entity at the RDMAC RNIC must
   implement MPA Request/Reply Frames on behalf of the RNIC in order to
   negotiate the connection parameters.  The following section describes
   the results following the exchange of the MPA Request/Reply Frames
   before the conversion from streaming to RDMA mode.

C.1  Negotiated Parameters

   Three types of RNICs are considered:

   Upgraded RDMAC RNIC - an RNIC implementing the RDMAC protocols which
       has a ULP or other supporting entity that exchanges the MPA
       Request/Reply Frames in streaming mode before the conversion to
       RDMA mode.

   Non-permissive IETF RNIC - an RNIC implementing the IETF protocols
       which is not capable of implementing the RDMAC protocols.  Such
       an RNIC can only interoperate with other IETF RNICs.

   Permissive IETF RNIC - an RNIC implementing the IETF protocols which
       is capable of implementing the RDMAC protocols on a per
       connection basis.

   The Permissive IETF RNIC is recommended for those implementers that
   want maximum interoperability with other RNIC implementations.

   The values used by these three RNIC types for the MPA, DDP, and RDMAP
   versions as well as MPA Markers and CRC are summarized in Figure 14.

    +----------------++-----------+-----------+-----------+-----------+
    | RNIC TYPE      || DDP/RDMAP |    MPA    |    MPA    |    MPA    |
    |                ||  Version  | Revision  |  Markers  |    CRC    |
    +----------------++-----------+-----------+-----------+-----------+
    +----------------++-----------+-----------+-----------+-----------+
    | RDMAC          ||     0     |     0     |     1     |     1     |
    |                ||           |           |           |           |
    +----------------++-----------+-----------+-----------+-----------+
    | IETF           ||     1     |     1     |  0 or 1   |  0 or 1   |
    | Non-permissive ||           |           |           |           |
    +----------------++-----------+-----------+-----------+-----------+
    | IETF           ||  1 or 0   |  1 or 0   |  0 or 1   |  0 or 1   |
    | permissive     ||           |           |           |           |
    +----------------++-----------+-----------+-----------+-----------+
           Figure 14.  Connection Parameters for the RNIC Types.
            For MPA Markers and MPA CRC, enabled=1, disabled=0.

   It is assumed there is no mixing of versions allowed between MPA, DDP
   and RDMAP.  The RNIC either generates the RDMAC protocols on the wire
   (version is zero) or the IETF protocols (version is one).

   During the exchange of the MPA Request/Reply Frames, each peer
   provides its MPA Revision, Marker preference (M: 0=disabled,
   1=enabled), and CRC preference.  The MPA Revision provided in the MPA
   Request Frame and the MPA Reply Frame may differ.

   From the information in the MPA Request/Reply Frames, each side sets
   the Version field (V: 0=RDMAC, 1=IETF) of the DDP/RDMAP protocols as
   well as the state of the Markers for each half connection.  Between
   DDP and RDMAP, no mixing of versions is allowed.  Moreover, the DDP
   and RDMAP version MUST be identical in the two directions.  The RNIC
   either generates the RDMAC protocols on the wire (version is zero) or
   the IETF protocols (version is one).

   In the following sections, the figures do not discuss CRC negotiation
   because there is no interoperability issue for CRCs.  Since the RDMAC
   RNIC will always request CRC use, then, according to the IETF MPA
   specification, both peers MUST generate and check CRCs.

C.2  RDMAC RNIC and Non-permissive IETF RNIC

   Figure 15 shows that a Non-permissive IETF RNIC cannot interoperate
   with an RDMAC RNIC, despite the fact that both peers exchange MPA
   Request/Reply Frames.  For a Non-permissive IETF RNIC, the MPA
   negotiation has no effect on the DDP/RDMAP version and it is unable
   to interoperate with the RDMAC RNIC.

   The rows in the figure show the state of the Marker field in the MPA
   Request Frame sent by the MPA Initiator.  The columns show the state
   of the Marker field in the MPA Reply Frame sent by the MPA Responder.
   Each type of RNIC is shown as an Initiator and a Responder.  The
   connection results are shown in the lower right corner, at the
   intersection of the different RNIC types, where V=0 is the RDMAC
   DDP/RDMAP version, V=1 is the IETF DDP/RDMAC version, M=0 means MPA
   Markers are disabled and M=1 means MPA Markers are enabled.  The
   negotiated Marker state is shown as X/Y, for the receive direction of
   the Initiator/Responder.

          +---------------------------++-----------------------+
          |   MPA                     ||          MPA          |
          | CONNECT                   ||       Responder       |
          |   MODE  +-----------------++-------+---------------+
          |         |   RNIC          || RDMAC |     IETF      |
          |         |   TYPE          ||       | Non-permissive|
          |         |          +------++-------+-------+-------+
          |         |          |MARKER|| M=1   | M=0   |  M=1  |
          +---------+----------+------++-------+-------+-------+
          +---------+----------+------++-------+-------+-------+
          |         |   RDMAC  | M=1  || V=0   | close | close |
          |         |          |      || M=1/1 |       |       |
          |         +----------+------++-------+-------+-------+
          |   MPA   |          | M=0  || close | V=1   | V=1   |
          |Initiator|   IETF   |      ||       | M=0/0 | M=0/1 |
          |         |Non-perms.+------++-------+-------+-------+
          |         |          | M=1  || close | V=1   | V=1   |
          |         |          |      ||       | M=1/0 | M=1/1 |
          +---------+----------+------++-------+-------+-------+
   Figure 15: MPA negotiation between an RDMAC RNIC and a Non-permissive
                                IETF RNIC.

C.2.1  RDMAC RNIC Initiator

   If the RDMAC RNIC is the MPA Initiator, its ULP sends an MPA Request
   Frame with Rev field set to zero and the M and C bits set to one.
   Because the Non-permissive IETF RNIC cannot dynamically downgrade the
   version number it uses for DDP and RDMAP, it would send an MPA Reply
   Frame with the Rev field equal to one and then gracefully close the
   connection.

C.2.2  Non-Permissive IETF RNIC Initiator

   If the Non-permissive IETF RNIC is the MPA Initiator, it sends an MPA
   Request Frame with Rev field equal to one.  The ULP or supporting
   entity for the RDMAC RNIC responds with an MPA Reply Frame that has
   the Rev field equal to zero and the M bit set to one.  The Non-
   permissive IETF RNIC will gracefully close the connection after it
   reads the incompatible Rev field in the MPA Reply Frame.

C.2.3  RDMAC RNIC and Permissive IETF RNIC

   Figure 16 shows that a Permissive IETF RNIC can interoperate with an
   RDMAC RNIC regardless of its Marker preference.  The figure uses the
   same format as shown with the Non-permissive IETF RNIC.

          +---------------------------++-----------------------+
          |   MPA                     ||          MPA          |
          | CONNECT                   ||       Responder       |
          |   MODE  +-----------------++-------+---------------+
          |         |   RNIC          || RDMAC |     IETF      |
          |         |   TYPE          ||       |  Permissive   |
          |         |          +------++-------+-------+-------+
          |         |          |MARKER|| M=1   | M=0   | M=1   |
          +---------+----------+------++-------+-------+-------+
          +---------+----------+------++-------+-------+-------+
          |         |   RDMAC  | M=1  || V=0   | N/A   | V=0   |
          |         |          |      || M=1/1 |       | M=1/1 |
          |         +----------+------++-------+-------+-------+
          |   MPA   |          | M=0  || V=0   | V=1   | V=1   |
          |Initiator|   IETF   |      || M=1/1 | M=0/0 | M=0/1 |
          |         |Permissive+------++-------+-------+-------+
          |         |          | M=1  || V=0   | V=1   | V=1   |
          |         |          |      || M=1/1 | M=1/0 | M=1/1 |
          +---------+----------+------++-------+-------+-------+
     Figure 16: MPA negotiation between an RDMAC RNIC and a Permissive
                                IETF RNIC.

   A truly Permissive IETF RNIC will recognize an RDMAC RNIC from the
   Rev field of the MPA Req/Rep Frames and then adjust its receive
   Marker state and DDP/RDMAP version to accommodate the RDMAC RNIC.  As
   a result, as an MPA Responder, the Permissive IETF RNIC will never
   return an MPA Reply Frame with the M bit set to zero.  This case is
   shown as a not applicable (N/A) in Figure 16.

C.2.4  RDMAC RNIC Initiator

   When the RDMAC RNIC is the MPA Initiator, its ULP or other supporting
   entity prepares an MPA Request message and sets the revision to zero
   and the M bit and C bit to one.

   The Permissive IETF Responder receives the MPA Request message and
   checks the revision field.  Since it is capable of generating RDMAC
   DDP/RDMAP headers, it sends an MPA Reply message with revision set to
   zero and the M and C bits set to one.  The Responder must inform its
   ULP that it is generating version zero DDP/RDMAP messages.

C.2.5  Permissive IETF RNIC Initiator

   If the Permissive IETF RNIC is the MPA Initiator, it prepares the MPA
   Request Frame setting the Rev field to one.  Regardless of the value
   of the M bit in the MPA Request Frame, the ULP or other supporting
   entity for the RDMAC RNIC will create an MPA Reply Frame with Rev
   equal to zero and the M bit set to one.

   When the Initiator reads the Rev field of the MPA Reply Frame and
   finds that its peer is an RDMAC RNIC, it must inform its ULP that it
   should generate version zero DDP/RDMAP messages and enable MPA
   Markers and CRC.

C.3  Non-Permissive IETF RNIC and Permissive IETF RNIC

   For completeness, Figure 17 below shows the results of MPA
   negotiation between a Non-permissive IETF RNIC and a Permissive IETF
   RNIC.  The important point from this figure is that an IETF RNIC
   cannot detect whether its peer is a Permissive or Non-permissive
   RNIC.

      +---------------------------++-------------------------------+
      |   MPA                     ||              MPA              |
      | CONNECT                   ||            Responder          |
      |   MODE  +-----------------++---------------+---------------+
      |         |   RNIC          ||     IETF      |     IETF      |
      |         |   TYPE          || Non-permissive|  Permissive   |
      |         |          +------++-------+-------+-------+-------+
      |         |          |MARKER|| M=0   | M=1   | M=0   | M=1   |
      +---------+----------+------++-------+-------+-------+-------+
      +---------+----------+------++-------+-------+-------+-------+
      |         |          | M=0  || V=1   | V=1   | V=1   | V=1   |
      |         |   IETF   |      || M=0/0 | M=0/1 | M=0/0 | M=0/1 |
      |         |Non-perms.+------++-------+-------+-------+-------+
      |         |          | M=1  || V=1   | V=1   | V=1   | V=1   |
      |         |          |      || M=1/0 | M=1/1 | M=1/0 | M=1/1 |
      |   MPA   +----------+------++-------+-------+-------+-------+
      |Initiator|          | M=0  || V=1   | V=1   | V=1   | V=1   |
      |         |   IETF   |      || M=0/0 | M=0/1 | M=0/0 | M=0/1 |
      |         |Permissive+------++-------+-------+-------+-------+
      |         |          | M=1  || V=1   | V=1   | V=1   | V=1   |
      |         |          |      || M=1/0 | M=1/1 | M=1/0 | M=1/1 |
      +---------+----------+------++-------+-------+-------+-------+
    Figure 17: MPA negotiation between a Non-permissive IETF RNIC and a
                           Permissive IETF RNIC.

Normative References

   [iSCSI] Satran, J., Internet Small Computer Systems Interface
       (iSCSI), RFC 3720, April 2004.

   [RFC1191] Mogul, J., and Deering, S., "Path MTU Discovery", RFC 1191,
       November 1990.

   [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., Romanow, A., "TCP
       Selective Acknowledgment Options", RFC 2018, October 1996.

   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
       Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC2401]  Atkinson, R., Kent, S., "Security Architecture for the
       Internet Protocol", RFC 2401, November 1998.

   [RFC3723] Aboba B., et al, "Securing Block Storage Protocols over
       IP", RFC3723, April 2004.

   [RFC793] Postel, J., "Transmission Control Protocol - DARPA Internet
       Program Protocol Specification", RFC 793, September 1981.

   [RDMASEC]  Pinkerton J., Deleganes E., Bitan S., "DDP/RDMAP
       Security", draft-ietf-rddp-security-09.txt (work in progress),
       MAY 2006.

Informative References

   [APPL] Bestler, C., "Applicability of Remote Direct Memory Access
       Protocol (RDMA) and Direct Data Placement (DDP)", draft-ietf-
       rddp-applicability-08.txt (Work in progress), June 2006.

   [CRCTCP] Stone J., Partridge, C., "When the CRC and TCP checksum
       disagree", ACM Sigcomm, Sept. 2000.

   [DAT-API] DAT Collaborative, "kDAPL (Kernel Direct Access Programming
       Library) and uDAPL (User Direct Access Programming Library)",
       http://www.datcollaborative.org.

   [DDP] H. Shah et al., "Direct Data Placement over Reliable
       Transports", draft-ietf-rddp-ddp-06.txt (Work in progress), May
       2006.

   [iSER] Mike Ko et al., "iSCSI Extensions for RDMA Specification",
       draft-ietf-ips-iser-05.txt (Work in progress), October 2005.

   [IT-API] The Open Group, "Interconnect Transport API (IT-API)"
       Version 2.1, http://www.opengroup.org.

   [NFSv4CHANNEL] Williams, N., "On the Use of Channel Bindings to
       Secure Channels", Internet-Draft draft-ietf-nfsv4-channel-
       bindings-02.txt, July 2004.

   [RDMAP] R. Recio et al., "RDMA Protocol Specification",
       draft-ietf-rddp-rdmap-06.txt, May 2006.

   [RFC792] Postel, J., "Internet Control Message Protocol", September
       1981

   [RFC0896] J. Nagle, "Congestion Control in IP/TCP Internetworks", RFC
       896, January 1984.

   [RFC1122] Braden, R.T., "Requirements for Internet hosts -
       communication layers", October 1989.

   [RFC2960] R. Stewart et al., "Stream Control Transmission Protocol",
       RFC 2960, October 2000.

   [RFC4296] Bailey, S., Talpey, T, "The Architecture of MPA negotiation
   between a Non-permissive IETF RNIC and a Permissive IETF RNIC.  The
   important point from this figure is that an IETF RNIC cannot detect
   whether its peer is a Permissive or Non-permissive RNIC.

      +---------------------------++-------------------------------+
      |   MPA                     ||              MPA              |
      | CONNECT                   ||            Responder          |
      |   MODE  +-----------------++---------------+---------------+
      |         |   RNIC          ||     IETF      |     IETF      |
      |         |   TYPE          || Non-permissive|  Permissive   |
      |         |          +------++-------+-------+-------+-------+
      |         |          |MARKER|| M=0   | M=1   | M=0   | M=1   |
      +---------+----------+------++-------+-------+-------+-------+
      +---------+----------+------++-------+-------+-------+-------+
      |         |          | M=0  || V=1   | V=1   | V=1   | V=1   |
      |         |   IETF   |      || M=0/0 | M=0/1 | M=0/0 | M=0/1 |
      |         |Non-perms.+------++-------+-------+-------+-------+
      |         |          | M=1  || V=1   | V=1   | V=1   | V=1   |
      |         |          |      || M=1/0 | M=1/1 | M=1/0 | M=1/1 |
      |   MPA   +----------+------++-------+-------+-------+-------+
      |Initiator|          | M=0  || V=1   | V=1   | V=1   | V=1   |
      |         |   IETF   |      || M=0/0 | M=0/1 | M=0/0 | M=0/1 |
      |         |Permissive+------++-------+-------+-------+-------+
      |         |          | M=1  || V=1   | V=1   | V=1   | V=1   |
      |         |          |      || M=1/0 | M=1/1 | M=1/0 | M=1/1 |
      +---------+----------+------++-------+-------+-------+-------+
    Figure 17: MPA negotiation between a Non-permissive IETF RNIC Direct Data
       Placement (DDP) and a
                           Permissive IETF RNIC.

13 Remote Direct Memory Access (RDMA) on
       Internet Protocols" RFC 4296, December 2005

   [RFC4297] Romanow, A., et al., "Remote Direct Memory Access (RDMA)
       over IP Problem Statement", RFC 4297, December 2005

   [RFC4301] Kent, S., Seo, K., "Security Architecture for the Internet
       Protocol", RFC 4301, December 2005

   [VERBS] J. Hilland et al., "RDMA Protocol Verbs Specification",
       draft-hilland-iwarp-verbs-v1.0-RDMAC.pdf April 2003,
       http://www.rdmaconsortium.org.

Author's Addresses

   Stephen Bailey
       Sandburst Corporation
       600 Federal Street
       Andover, MA  01810 USA
       Phone: +1 978 689 1614
       Email: steph@sandburst.com

   Paul R. Culley
       Hewlett-Packard Company
       20555 SH 249
       Houston, Tx. USA 77070-2698
       Phone:  281-514-5543
       Email:  paul.culley@hp.com

   Uri Elzur
       Broadcom
       16215 Alton Parkway
       CA, 92618
       Phone: 949.585.6432
       Email:  uri@broadcom.com

   Renato J Recio
       IBM
       Internal Zip 9043
       11400 Burnett Road
       Austin,  Texas  78759
       Phone:  512-838-3685
       Email:  recio@us.ibm.com

   John Carrier
       Cray Inc.
       411 First Avenue S, Suite 600
       Seattle, WA 98104-2860
       Phone: 206-701-2090
       Email: carrier@cray.com

14

Acknowledgments

   Dwight Barron
       Hewlett-Packard Company
       20555 SH 249
       Houston, Tx. USA 77070-2698
       Phone: 281-514-2769
       Email: dwight.barron@hp.com

   Jeff Chase
       Department of Computer Science
       Duke University
       Durham, NC 27708-0129 USA
       Phone: +1 919 660 6559
       Email: chase@cs.duke.edu

   Ted Compton
       EMC Corporation
       Research Triangle Park, NC 27709, USA
       Phone: 919-248-6075
       Email: compton_ted@emc.com

   Dave Garcia
       Hewlett-Packard Company
       19333 Vallco Parkway
       Cupertino, Ca. USA 95014
       Phone: 408.285.6116
       Email: dave.garcia@hp.com

   Hari Ghadia
       Adaptec, Inc.
       691 S. Milpitas Blvd.,
       Milpitas, CA 95035  USA
       Phone: +1 (408) 957-5608
       Email: hari_ghadia@adaptec.com

   Howard C. Herbert
       Intel Corporation
       MS CH7-404
       5000 West Chandler Blvd.
       Chandler, Arizona 85226
       Phone: 480-554-3116
       Email: howard.c.herbert@intel.com
   Jeff Hilland
       Hewlett-Packard Company
       20555 SH 249
       Houston, Tx. USA 77070-2698
       Phone: 281-514-9489
       Email: jeff.hilland@hp.com

   Mike Ko
       IBM
       650 Harry Rd.
       San Jose, CA 95120
       Phone: (408) 927-2085
       Email: mako@us.ibm.com

   Mike Krause
       Hewlett-Packard Corporation, 43LN
       19410 Homestead Road
       Cupertino, CA 95014 USA
       Phone: +1 (408) 447-3191
       Email: krause@cup.hp.com

   Dave Minturn
       Intel Corporation
       MS JF1-210
       5200 North East Elam Young Parkway
       Hillsboro, Oregon  97124
       Phone: 503-712-4106
       Email: dave.b.minturn@intel.com

   Jim Pinkerton
       Microsoft, Inc.
       One Microsoft Way
       Redmond, WA, USA 98052
       Email: jpink@microsoft.com

   Hemal Shah
       16215 Alton Parkway
       Irvine, California 92619-7013 USA
       Phone: +1 949 926-6941
       Email: hemal@broadcom.com

   Allyn Romanow
       Cisco Systems
       170 W Tasman Drive
       San Jose, CA 95134 USA
       Phone: +1 408 525 8836
       Email: allyn@cisco.com
   Tom Talpey
       Network Appliance
       375 Totten Pond Road
       Waltham, MA 02451 USA
       Phone: +1 (781) 768-5329
       EMail: thomas.talpey@netapp.com

   Patricia Thaler
       Broadcom
       16215 Alton Parkway
       Irvine, CA 92618
       Phone: 916 570 2707
       pthaler@broadcom.com

   Jim Wendt
       Hewlett Packard Corporation
       8000 Foothills Boulevard MS 5668
       Roseville, CA 95747-5668 USA
       Phone: +1 916 785 5198
       Email: jim_wendt@hp.com

   Jim Williams
       Emulex Corporation
       580 Main Street
       Bolton, MA 01740 USA
       Phone: +1 978 779 7224
       Email: jim.williams@emulex.com

Full Copyright Statement

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

   Copyright (C) The Internet Society (2006).  This document is subject
   to the rights, licenses and restrictions contained in BCP 78, and
   except as set forth therein, the authors retain all their rights.

Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.