Inter-Domain Multicast Routing (IDMR) A.J.Ballardie INTERNET-DRAFT University College London S. ReeveBay Networks, Inc.& N. Jain Bay Networks, Inc.April,September 1996 Core Based Trees (CBT) Multicast -- Protocol Specification --<draft-ietf-idmr-cbt-spec-05.txt>Status of this Memo This document is an Internet Draft. Internet Drafts are workingdo- cumentsdoc- uments of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute work- ing documents as Internet Drafts). Internet Drafts are draft documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress." Please check the I-D abstract listing contained in each Internet Draft directory to learn the current status of this or any other Internet Draft. Abstract This document describes the Core Based Tree (CBT) network layer mul- ticast protocol. CBT is a next-generation multicast protocol that makes use of a shared delivery tree rather than separate per-sender trees utilized by most other multicast schemes [1, 2, 3]. The CBT architecture is described in [4a]. This specification includes an optimization whereby unencapsulated (native) IP-style multicasts are forwarded byCBT,CBT routers, resulting in very good forwarding performance. This mode of operation is called CBT "native mode". Native mode can only be used in CBT-only domainsor "clouds".(footnote 1). _________________________ This revision contains two appendices; Appendix A describes simple CBT add-on mechanisms for dynamically migrating a CBT tree to one whose core is directly attached to a source's subnetwork, thereby allowing CBT to emulate shortest-path trees. Appendix B describes a group state aggregation scheme. This document is progressing through the IDMR working group of the IETF.TheCBTarchitecture is described in an accompanying document: ftp://cs.ucl.ac.uk/darpa/IDMR/draft-ietf-idmr-arch-03.txt. Otherrelated documents include [4, 5]. For all IDMR-related documents, see http://www.cs.ucl.ac.uk/ietf/idmr. NOTE that core placement and management is not discussed in this doc- ument. 1. Changes since Previous Revision(04)(05) This note summarizes the changes to this document since the previous revision (revision04). +05). +o inclusion ofa "group mask" field for aggregated joins/join-acks (sections 10.2, 8.1,"first hop router" andAppendix A). +"primary core" fields in the CBT mode data packet header. +o removal of the term "non-core" router, replaced by "on-tree" router. +o removal of the term"Group"default DR(G-DR)", which was only a "token" identity. + more complete explanation(D-DR)", replaced simply by DR. +o inclusion of T and S bits in theuseCBT control and data packet headers (type ofCBT's IP protocolservice, and security, respectively). +o CBT control messages are now carried directly over IP rather than UDPport numbers (section 11). + more complete explanation(for all implementations). +o inclusion ofnon-member sender case (section 6). +an Appendix (A) describing extensions to the CBT protocol to achieve dynamic source-migration of core routers for shortest-path tree emulation. +o inclusion of an Appendix (B) describing a group state aggrega- tion scheme. _________________________ 1 The termFIB (forwarding information base) has been replaced throughout"domain" should be considered synonymous with "routing domain" throughout, as are theterm "forwarding database (db)". +terms "re- gion" and "cloud". +o editorial changes and some re-organisation throughout for extra clarity.Finally, in keeping with CBT's tradition of simplicity, this revision is 1 page less than the previous revision :-) .2. Some Terminology In CBT, the core routers for a particular group are categorised into PRIMARY CORE, and NON-PRIMARY (secondary) CORES. The "core tree" is the part of a tree linking all core routers of a particular group together. On-tree routers are those with a forwarding database entry for the corresponding group. 3. Protocol Specification 3.1. Tree Joining Process -- Overview A CBT router is notified of a local host's desire to join a group via IGMP [6]. We refer to a CBT router with directly attached hosts as a "leaf CBT router", or just "leaf" router. The following CBT control messages come into play subequent to asubnet'ssub- net's CBT leaf router receiving an IGMP membership report (also termed "IGMP join"):++o JOIN_REQUEST++o JOIN_ACK If the CBT leaf router is the subnet'sdefaultdesignated router (see next section), it generates a CBT join-request in response toreceiv- ingreceiving an IGMP group membership report from a directly connected host. The CBT join is sent to the next-hop on the unicast path to a target core, specified in the join packet; a router elects a "target core" based on a static configuration. If, on receipt of an IGMP-join, the locally-elected DR has already joined the corresponding tree, then it need do nothing more with respect to joining. The join is processed by each such hop on the path to the core, until either the join reaches the target core itself, or hits a router that is already part of the corresponding distribution tree (as identified by the group address). In both cases, the router concerned terminates the join, and responds with ajoin-ack,join-ack (join acknowledgement), which traverses thereverse- pathreverse-path of the corresponding join. This ispossiblepossi- ble due to the transient path state created by a join traversing a CBT router. The ack fixes that state. 3.2. DR Election Multiple CBT routers may be connected to a multi-access subnetwork. In such cases it is necessary to elect a subnetwork designated router(D-DR)(DR) that is responsible for generating and sending CBT joins upstream, on behalf of hosts on the subnetwork. CBT DR election happens "on the back" of IGMP [6]; on a subnet with multiple multicast routers, an IGMP "querier" is elected as part ofIGMP; atIGMP. At start-up, a multicast router assumes no other multicast routers are present on its subnetwork, and so begins by believing it is the subnet's IGMP querier. It sends a small number IGMP-HOST- MEMBERSHIP-QUERYs in short succession in order to quickly learn about any group memberships on the subnet. If other multicast routers are present on the same subnet, they will receive these IGMP queries; a multicast router yields querier duty as soon as it hears an IGMP query from a lower-addressed router on the same subnetwork. The CBTdefaultDR(D-DR)is always(footnote 1)the subnet'sIGMP- querier.IGMP querier (footnote 2). As a result, there is no protocol overhead whatsoeverasso- ciatedassociated with electing a CBT D-DR. 3.3. Tree Joining Process -- Details The receipt of an IGMP group membership report by a CBTD-DRDR for a CBT group not previously heard from triggers the tree joiningpro- cess;process; theD-DRDR unicasts a JOIN-REQUEST to the first hop on the(uni- cast)(unicast) path to the target core specified in the CBT join packet. _________________________ 2 Or lowest addressed CBT router if the subnet's IGMP querier is non-CBT capable. Each CBT-capable router traversed on the path between the sending DR and the core processes the join. However, if a join hits a CBT router that is alreadyon-tree (footnote 2),on-tree, the join is not propogated further, butACK'dacknowledged downstream from that point. JOIN-REQUESTs carry the identity of all the cores associated with the group. Assuming there are no on-tree routers in between, once the join (subcode ACTIVE_JOIN) reaches the target core, if the target core is not the primary core (as indicated in a separate field of the join packet) it first acknowledges the received join by means of a_________________________ 1 This document does not address the case where some routers on a multi-access subnet may be running multi- cast routing protocols other than CBT. In such cases, IGMP querier may be a non-CBT router, in which case the CBT DR election breaks. This will be discussed in a CBT interoperability document, to appear shortly. 2 "on-tree" refers to whether a router has a forward- ing db entry for the corresponding group. JOIN-ACK, then sendsJOIN-ACK, then sends a JOIN-REQUEST, subcode REJOIN-ACTIVE, to the primary core router. If the rejoin-active reaches the primary core, it responds by sending a JOIN-ACK, subcode PRIMARY-REJOIN-ACK, which traverses the reverse- path of thejoin.join (rejoin). The primary-rejoin-ack serves to confirm no loop ispresent without requiringpresent, and so explicit loopdetection.detection is not necessary. If some other on-tree router is encountered before the rejoin-active reaches the primary, that router responds with a JOIN-ACK, subcode NORMAL. On receipt of the ack, subcode normal, the router sends a join, subcode REJOIN-NACTIVE, which acts as a loop detection packet (see section 8.3). Note that loop detection is not necessary subse- quent to receiving a join-ack with subcode PRIMARY-REJOIN-ACK. To facilitate detailed protocol description, we use a sample topol- ogy, illustrated in Figure 1 (shown over). Member hosts are shown as individual capital letters, routers are prefixed with R, and subnets are prefixed with S. A B | S1 S4 | ------------------- ----------------------------------------------- | | | | ------ ------ ------ ------ | R1 | | R2 | | R5 | | R6 | ------ ------ ------ ------ C | | | | | | | | | S2 | S8 | ---------- ------------------------------------------ ------------- S3 | ------ | R3 | | ------ D | S9 | | S5 | | | --------------------------------------------- | |----| | | ---| R7 |-----| ------ | |----| |------------------| R4 | | S7 | ------ F | | | S6 | |-E | --------------------------------- | | | ------ |---| |---------------------| R8 | |R12 ----| ------ G |---| | | | S10 | S14 ---------------------------- | | I --| ------ | | R9 | ------ | S12 | ---------------------------- S15 | | | ------ |----------------------|R10 | J ---| ------ H | | | | ---------------------------- | S13 Figure 1. Example Network Topology Taking the example topology in figure 1, host Ais thewishes to join groupinitia- tor, and hasG. All subnets' routers have been configured to use core routers R4 (primary core) and R9(secon- dary core).(secondary core) for a range of group addresses, including G. Router R1 receives an IGMP host membership report, and proceeds to unicast a JOIN-REQUEST, subcode ACTIVE-JOIN to the next-hop on the path to R4 (R3), the target core. R3 receives the join, caches the necessary groupinformation,information (transient state), and forwards it to R4 -- the target of the join. R4, being the target of the join, sends a JOIN_ACK (subcode NORMAL) back out of the receiving interface to the previous-hop sender of the join, R3. A JOIN-ACK, like a JOIN-REQUEST, is processed hop-by-hop by each router on the reverse-path of the corresponding join. The receipt of a join-ack establishes the receiving router on thecorrespondingcorre- sponding CBT tree, i.e. the router becomes part of a branch on the delivery tree. Finally, R3 sends a join-ack to R1. A new CBT branch has been created, attaching subnet S1 to the CBT delivery tree for the corresponding group. For the period between any CBT-capable router forwarding (orori- ginating)origi- nating) a JOIN_REQUEST and receiving a JOIN_ACK the corresponding router is not permitted to acknowledge any subsequent joins received for the same group; rather, the router caches such joins till such time as it has itself received a JOIN_ACK for the original join. Only then can it acknowledge any cached joins. A router is said to be in a "pending-join" state if it is awaiting a JOIN_ACK itself. Note that the presence of asymmetric routes in the underlying unicastrouting,routing does not affect the tree-building process; CBT tree branches are symmetric by the nature in which they are built. Joins set up transient state (incoming and outgoing interface state) in all routers along a path to a particular core. The corresponding join-ack traverses the reverse-path of the join as dictated by the transient state, and not necessarily the path that underlying routing would dictate. Whilst permanent asymmetric routes could pose a problem for CBT, transient asymmetricity is detected by the CBT protocol. 3.4. Forwarding Joins on Multi-Access Subnets The DR election mechanism does not guarantee that the DR will be the router that actually forwards a join off a multi-access network; the first hop on the path to a particular core might be via another router on the same subnetwork, which actually forwards off-subnet. Although very much the same, let's see another example using our example topology of figure 1 of a host joining a CBT tree for the case where more than one CBT router exists on the host subnetwork. B's subnet, S4, has 3 CBT routers attached. Assume also that R6 has been elected IGMP-querier and CBTD-DR.DR. R6 (S4'sD-DR)DR) receives an IGMP group membership report. R6's config- ured information suggests R4 as the target core for this group. R6 thus generates a join-request for target core R4, subcode ACTIVE_JOIN. R6's routing table says the next-hop on the path to R4 is R2, which is on the same subnet as R6. This is irrelevant to R6, which unicasts it to R2. R2 unicasts it to R3, which happens to be already on-tree for the specified group (from R1's join). R3 there- fore can acknowledge the arrived join and unicast the ack back to R2. R2 forwards it to R6, the origin of the join-request. If an IGMP membership report is received by aD-DRDR with a join for the same group already pending, or if theD-DRDR is already on-tree for the group, it takes no action. 3.5. On-Demand "Core Tree" Building The "coretree",tree" - the part of a CBT tree linking all of its cores together, is built on-demand. That is, the core tree is only built subsequent to a non-primary (secondary) core receiving a join- request. This triggers the secondary core to join the primary core; the primary need never join anything. Join-requests carry anorderedlist of core routers (and the identity of the primary core in its own separate field), making it possible for the secondary cores to know where to join when they themselves receive a join. Hence, the primary core must be uniquely identified as such acrossathe whole group. A secondary joins the primarysubse- quentsubsequent to sending an ack for the first joinjust received.it receives. 3.6. Tree Teardown There are two scenarios whereby a tree branch may be torn down:++o During a re-configuration. If a router's best next-hop to the specified core is one of its existing children, then before sending the join it must tear down that particular downstream branch. It does so by sending a FLUSH_TREE message which is pro- cessed hop-by-hop down the branch. All routers receiving this message must process it and forward it to all their children. Routers that have received a flush message will re-establish themselves on the delivery tree if they have directly connected subnets with group presence.++o If a CBT router has no children it periodically checks all its directly connected subnets for group member presence. If nomembermem- ber presence is ascertained on any of its subnets it sends a QUIT_REQUEST upstream to remove itself from the tree. The receipt of a quit-request triggers the receiving parent router to immediately query its forwardingdatabase, and estab- lishdatabase to establish whether there remains any directly connected groupmember- ship,membership, or any children, for the said group. If not, the router itself sends a quit-request upstream. The following example, using the example topology of figure 1, shows how a tree branch is gracefully torn down using a QUIT_REQUEST. Assume group member B leaves group G on subnet S4. B issues an IGMP HOST-MEMBERSHIP-LEAVE (relevant only to IGMPv2 and later versions) message which is multicast to the "all-routers" group (224.0.0.2). R6, the subnet'sD-DRDR and IGMP-querier, responds with agroup- specific-QUERY.group-specific- QUERY. No hosts respond within the required responseinter- val,interval, soD-DRDR assumes group G traffic is no longer wanted on subnet S4. Since R6 has no CBT children, and no other directly attached subnets with group G presence, it immediately follows on by sending a QUIT_REQUEST to R2, its parent on the tree for group G. R2 responds with a QUIT-ACK, unicast to R6; R2 removes the corresponding child information. R2 in turn sends a QUIT upstream to R3 (since it has no other children or subnet(s) with group presence). NOTE: immediately subsequent to sending a QUIT-REQUEST, the sender removes the corresponding parent information, i.e. it does not wait for the receipt of a QUIT-ACK. R3 responds to the QUIT by unicasting a QUIT-ACK to R2. R3 subse- quently checks whether it in turn can send a quit by checking group G presence on its directly attached subnets, and any group G children. It has the latter (R1 is its child on the group G tree), and so R3 cannot itself send a quit. However, the branch R3-R2-R6 has been removed from the tree. 4.Data Packet Forwarding Rules 4.1. Native Mode In native mode, whenTree Maintenance Once a tree branch has been created, i.e. a CBT routerreceiveshas received adata packet, the packet's TTLJOIN_ACK for a JOIN_REQUEST previously sent (or forwarded), a child router isdecremented, and, provided the packet's TTL remains greater than/equalrequired to1, forwardsmonitor thedata packet over all outgoing inter- faces that are partstatus ofthe corresponding CBT tree. 4.2. CBT Mode In CBT mode, routers ignore all non-locally originated native mode multicast data packets. Locally-originated multicast data is only processedits parent/parent link at fixed intervals by means of asubnet's D-DR; in this case, the D-DR forwards the native multicast data packet, TTL 1, over any outgoing member subnets for which that router"keepalive" mechanism operating between them. The "keepalive" protocol isD-DR. Additionally, the D-DR encapsulates the locally-originated multicastsimple, andforwards it, CBT mode, over all tree interfaces, as dictatedimplemented bythemeans of two CBTforwarding database. Whencontrol messages: CBT_ECHO_REQUEST and CBT_ECHO_REPLY; arouter, operatingchild unicasts a CBT-ECHO-REQUEST to its parent, which unicasts a CBT-ECHO-REPLY in response. Adjacent CBTmode, receives an encapsulated multi- cast data packet, it decapsulates one copyrouters only need tosend, native mode and TTL 1,send one keepalive representing all children having the same parent, reachable overany directly attached member subnets for which it is D- DR. Additionally, an encapsulated copya particular link, regardless of group. This aggregation strategy isforwarded over all outgoing tree interfaces,expected to con- serve considerable bandwidth on "busy" links, such asdictated by thetransit net- work, or backbone network, links. For any CBTforwarding database. Like the outer encapsulating IP header,router, if its parent router, or path to theTTL value ofparent, fails, theencapsu- lating CBT headerchild isdecremented each timeinitially responsible for re-attaching itself, and therefore all routers subordinate to itis processed by a CBT router.on the same branch, to the tree. 4.1. Router Failure Anexample of CBT mode forwarding is provided towardson-tree router can detect a failure from theend offollowing two cases: +o if thenext section. 5. CBT Mode -- Encapsulation Details Inchild responsible for sending keepalives across amulti-protocol environment, whose infrastructure may include non-multicast-capable routers, it is necessary to tunnel data packets between CBT-capable routers. This is called "CBT mode". Data packets are de-capsulated by CBT routers (suchpartic- ular link stops receiving CBT_ECHO_REPLY messages. In this case the child realises thattheyits parent has becomenative mode data packets) before being forwarded over subnets with member hosts. When multicasting (native mode)unreachable and must therefore try and re-connect tomember hosts,theTTL value oftree for all groups represented on theoriginal IP header isparent/child link. For all groups sharing a common core setto one. CBT mode encapsulation is(corelist), provided those groups can be speci- fied asfol- lows: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | encaps IP hdr | CBT hdr | original IP hdr | data ....| ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Figure 2. Encapsulation for CBT mode The TTL value ofa CIDR-like aggregate, an aggregated join can be sent representing theCBT header is setrange of groups. Aggregated joins are made possible by theencapsulating CBT router directly attached to the originpresence of adata packet. This value is decremented each time it is processed by a CBT router. An encap- sulated data packet is discarded when"group mask" field in the CBT con- trol packet headerTTL value reaches zero. The purpose(footnote 3). If a range of groups cannot be represented by a mask, then each group must be re-joined individually. CBT's re-join strategy is as follows: the(outer) encapsulating IP headerrejoining router which is immediately subordinate to"tunnel" data packets between CBT-capable routers (or "islands"). The outer IP header's TTL valuethe failure sends a JOIN_REQUEST (subcode ACTIVE_JOIN if it has no children attached, and subcode ACTIVE_REJOIN if at least one child issetattached) to the"length" ofbest next-hop router on thecorresponding tun- nel, or MAX_TTL (255)if this is not known, or subjectpath tochange. Itthe elected core. If no JOIN-ACK isworth pointing out herereceived after three retransmissions, each transmission being at PEND-JOIN-INTERVAL (5 secs) intervals, thedistinction between subnetworks and tree branches (especially apparent in CBT mode), although they can be one andnext-highest pri- ority core is elected from thesame. For example, a multi-access subnetwork containing routerscore list, andend-systems could potentially be boththe process repeated. If all cores have been tried unsuccessfully, the DR has no option but to give up. +o if aCBT tree branch andparent stops receiving CBT_ECHO_REQUESTs from asubnetwork with group member presence. A tree branch which ischild. In this case, if the parent has notsimultaneouslyreceived an expected keepalive after CHILD_ASSERT_EXPIRE_TIME, all children reachable across that link are removed from the parent's forwarding database. 4.2. Router Re-Starts There are two cases to consider here: +o Core re-start. All JOIN-REQUESTs (all types) carry the identi- ties (i.e. IP addresses) of each of the cores for asubnetworkgroup. If a router iseithera"tunnel" orcore for apoint-to- point link.group, but has only recently re-started, it will not be aware that it is a core for any group(s). InCBT mode there are three forwarding methods usedsuch circumstances, a core only becomes aware that it is such byCBT routers: + IP multicasting. This method sends an unaltered (unencapsulated) data packet acrossreceiving adirectly-connected subnetwork with group member presence. Any host originating multicast data, does soJOIN-REQUEST. Subsequent to a core learning its status in thisform. + CBT unicasting. This methodway, if it isused for sending data packets encapsulated (as illustrated above) acrossnot the primary core it acknowl- edges the received join, then sends atunnel or point- to-point link. En/de-capsulation takes place in CBT routers. + CBT multicasting. Routers on multi-access links use this methodJOIN_REQUEST (subcode ACTIVE_REJOIN) tosend data packets encapsulated (as illustrated above) buttheouter encapsulating IP header contains a multicast address. This methodprimary core. If the re-started router isused when a parent or multiple childrenthe primary core, it need take no action, i.e. in all _________________________ 3 There arereachable oversituations where it is advantageous to send a singlephysical interface, as could be the case onjoin-request that represents potentially many groups. One such example is provided in [11], whereby amulti-access Ethernet. The IP module of end-systems subscribeddesignated border router is required tothe same group will discard these multicasts since thejoin all groups inside a CBTpayload type (protocol id) ofdomain. circumstances, theouter IP header is not recog- nizableprimary core simply waits to be joined byhosts. CBT routers create forwarding database (db) entries whenever they send or receive a JOIN_ACK. The forwarding database describesother routers. +o Non-core re-start. In this case, the router can only join theparent-child relationships on a per-group basis. A forwarding data- base entry dictates over whichtreeinterfaces, and how (unicast or multicast)again if adata packet is to be sent. A forwarding db entry is shown below: Note thatdownstream router sends aCBT forwarding dbJOIN_REQUEST through it, or it isrequiredelected DR forboth CBT-mode and native-mode multicasting. The field lengths shown above assume a maximumone of16its directlycon- nected neighbouring routers. Using our example topology in figure 1, let's assume the CBT routers are operating in CBT mode. Member G originates an IP multicast (native mode) packet. R8attached sub- nets, and subsequently receives an IGMP membership report. 4.3. Route Loops Routing loops are only a concern when a router with at least one child is attempting to re-join a CBT tree. In this case theDR for subnet S10. R8 thereforere- joining router sends a(native mode) copy over any member subnets for whichJOIN_REQUEST (subcode ACTIVE REJOIN) to the best next-hop on the path to an elected core. This join is forwarded as normal until it reaches either the specified core, another core, or a on-tree router that isDR - S14 and S10 (the copy over S10already part of the tree. If the rejoin reaches the primary core, loop detection is notsent, sincenecessary because thepacket was originally received from S10).primary never has a parent. Themulticast packet is CBT mode encapsulatedprimary core acks an active-rejoin byR8, and unicast to eachmeans ofits children, R9 and R12; these children are not reachable over the same interface, otherwise R8 could have sentaCBT mode multi- cast. R9,JOIN-ACK, subcode PRIMARY-REJOIN-ACK. This ack must be processed by each router on theDR for S12, need not IP multicast (native mode) onto 32-bits 4 4 4 8 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | group-id | parent addr | parent vif | No.reverse-path of| | | | index | index |children | children | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+--+-+-+-+-+-++-+-+-+-+-+-+-+-+-+ |chld addr |chld vif | | index | index | |+-+-+-+-+-+-+-+-+-+-+ |chld addr |chld vif | | index | index | |+-+-+-+-+-+-+-+-+-+-+ |chld addr |chld vif | | index | index | |+-+-+-+-+-+-+-+-+-+-+ | | | etc. | |+-+-+-+-+-+-+-+-+-+-| Figure 3. CBT forwarding database entry S12 since there are no members present there. R9, in CBT mode, uni- caststhepacket to R10, whichactive-rejoin; this ack creates tree state, just like a normal join-ack. If an active-rejoin is terminated by any router on theDR for S13 and S15. R10 decap- sulatestree other than theCBT mode packet and IP multicasts (native mode) to each of S13 and S15. Going upstream from R8, R8 CBT mode unicastsprimary core, loop detection must take place, as we now describe. If, in response toR4. Itan active-rejoin, a JOIN-ACK isDR for all directly connected subnets and therefore IP multicasts (native mode)returned, subcode NORMAL (as opposed to an ack with subcode PRIMARY-REJOIN-ACK), thedata packet onto S5, S6 and S7, all of which have member pres- ence. R4 unicasts, CBT mode,router receiving the ack subsequently generates a JOIN-REQUEST, sub- code NACTIVE-REJOIN (non-active rejoin). This packet serves only toall outgoing children, R3 and R7 (NOTE: R4detect loops; it does nothave a parent sincecreate any transient state in the routers itistraverses, other than theprimary coreoriginating router (in case retransmis- sions are necessary). Any on-tree router receiving a non-active rejoin is required to forward it over its parent interface for thegroup). R7 IP multicasts (native mode) onto S9. R3 CBT mode unicastsspecified group. In this way, it will either reach the primary core, which unicasts, directly toR1 and R2, its children. Finally, R1 IP multicasts (native mode) onto S1 and S3, and R2 IP multicasts (native mode) onto S4. 6. Non-Member Sending Forthe sender, amulticast data packet to span beyondjoin ack with subcode PRI- MARY-NACTIVE-ACK (so thescope ofsender knows no loop is present), or theoriginat- ing subnetwork at least one CBT-capable router must be present on that subnetwork. The default DR (D-DR) for the group on the subnetwork must encapsulatesender receives the(native) IP-style packet and unicastnon-active rejoin itto a core for the group. The encapsulation required is shownsent, via one of its child interfaces, infigure 2; CBT mode encapsulation is necessary so the receiving CBT router can demultiplexwhich case thepacket accordingly.rejoin obviously formed a loop. Ifthe encapsulated packet hits the tree atanon-core router, the packetloop isforwarded according topresent, theforwarding rules of section 4.2. Ifnon-active join originator immediately sends a QUIT_REQUEST to its newly-established parent and thefirst on-tree router encounteredloop isthe target core, various scenarios define what happens next: +broken. Using figure 2 (over) to demonstrate this, ifthe target coreR3 isnot the primary, and the target core has not yet joinedattempting to re- join the tree(because it has not yet itself received any join-requests),(R1 is thetargetcoresimply forwards the encapsu- lated packetin figure 2) and R3 believes its best next-hop tothe primary core. if the target coreR1 isnot the primary, but has children, the target coreR6, and R6 believes R5 is its best next-hop to R1, which sees R4 as its best next-hop to R1 -- a loop is formed. R3 begins by sending a JOIN_REQUEST (subcode ACTIVE_REJOIN, since R4 is its child) to R6. R6 forwards thedata accordingjoin tothe rules of section 4.2. + if the target coreR5. R5 is on-tree for theprimary, the primary forwards the data accordinggroup, so responds to therules of section 4.2. 7. Eliminatingactive-rejoin with a JOIN-ACK, subcode NOR- MAL (the ack traverses R6 on its way to R3). R3 now generates a JOIN-REQUEST, subcode NACTIVE-REJOIN, and forwards this to its parent, R6. R6 forwards theTopology-Discovery Protocol innon-active rejoin to R5, its parent. R5 does similarly, as does R4. Now, thePresence of Tun- nels Traditionally, multicast protocols operating withinnon-active rejoin has reached R3, which originated it, so R3 concludes avirtual topol- ogy, i.e. an overlay ofloop is present on thephysical topology, have requiredparent interface for theassistance ofspecified group. It immediately sends amulticast topology discovery protocol, such as that presentQUIT_REQUEST to R6, which inDVMRP [1]. However,turn sends a quit if itis possible to havehas not received an ACK from R5 already AND has itself amulticast protocol operate withinchild or subnets with member presence. If so it does not send avirtual topology withoutquit -- theneed for a multicast topology discovery protocol. One way to achieve this isloop has been broken byhavingR3 sending the first quit. QUIT_REQUESTs are typically acknowledged by means of arouter configure allQUIT_ACK. A child removes itstunnelsparent information immediately subsequent to send- ing itsvirtual neighboursfirst QUIT-REQUEST. The ack here serves to notify the (old) child that it (the parent) has inadvance. A tunnel is identified byfact removed its child information. However, there might be cases where, due to failure, the parent can- not respond. The child sends alocal interface address andQUIT-REQUEST aremote interface address. Routing is replaced by "ranking" each such tunnel interface associated withmaximum of three times, at PEND-QUIT-INTERVAL (5 sec) intervals. ------ | R1 | ------ | --------------------------- | ------ | R2 | ------ | --------------------------- | | ------ | | R3 |--------------------------| ------ | | | --------------------------- | | | ------ ------ | | | | R4 | |-------| R6 | ------ | |----| | | --------------------------- | | | ------ | | R5 |--------------------------| ------ | | Figure 2: Example Loop Topology In another scenario the rejoin travels over aparticular core address; ifloop-free path, and thehighest-ranked routefirst on-tree router encountered isunavailable (tunnel end-points are requiredthe primary core, R1. In figure 2, R3 sends a join, subcode REJOIN_ACTIVE torun an Hello-like protocol between themselves) thenR2, thenext- highest ranked available route is selected, and so on. The exact specification ofnext-hop on theHello protocol is outsidepath to core R1. R2 forwards thescope of this document. CBT trees are built usingre-join to R1, thesame join/join-ack mechanisms as before, only now some branchesprimary core, which returns a JOIN-ACK, subcode PRIMARY-REJOIN-ACK, over the reverse-path of the rejoin-active. Whenever adeliveryrouter receives a PRI- MARY-REJOIN-ACK no loop detection is necessary. If we assume R2 is on treerun in native mode, whilst others (tunnels) run in CBT mode. Underlying unicast routing dictatesfor the corresponding group, R3 sends a join, subcode REJOIN_ACTIVE to R2, whichinterfacereplies with apacket should be forwarded over. Each interface is configured as either native mode or CBT mode, sojoin ack, subcode NORMAL. R3 must then generate a loop detection packetcan be encapsulated (decapsulated) accordingly. As an example, router R's configuration would be as follows: intf type mode remote addr ----------------------------------- #1 phys native - #2 tunnel cbt 128.16.8.117 #3 phys native - #4 tunnel cbt 128.16.6.8 #5 tunnel cbt 128.96.41.1 core backup-intfs -------------------- A #5, #2 B #3, #5 C #2, #4 The CBT forwarding database needs to be slightly modified to accommo- date an extra field, "backup-intfs" (backup interfaces). The entry in this field specifies a backup interface whenever a tunnel interface specified in the forwarding db(join request, subcode REJOIN-NACTIVE) which isdown. Additional backups (should the first-listed backup be down) are specified for each core inforwarded to its parent, R2, which does similarly. On receipt of thecore backup table. For example, if interface (tunnel) #2 were down, andrejoin-Nactive, thetargetpri- mary coreofunicasts a join ack back directly to R3, with subcode PRI- MARY-NACTIVE-ACK. This confirms to R3 that its rejoin does not form a loop. 5. Data Packet Loops The CBTcontrol packet were core A, the core backup table suggests using interface #5 asprotocol builds areplacement.loop-free distribution tree. Ifinter- face #5 happened to be down also, then the same table recommends interface #2 asall routers that comprise abackup for core A. 8. Tree Maintenance Onceparticular tree function correctly, data packets should never traverse a tree branchhas been created, i.e. amore than once (footnote 4). CBTrouter has received a JOIN_ACK formode data packets from aJOIN_REQUEST previously sent (or forwarded),non-member sender must arrive on achild router is required to monitor the status of its parent/parent link at fixed intervals by means of a "keepalive" mechanism operating between them.tree via an "off-tree" interface. The"keepalive" mechanism is implemented by means of twoCBTcontrol messages: CBT_ECHO_REQUESTmode data packet's header includes an "on-tree" field, which contains the value 0x00 until the data packet reaches an on-tree router. The first on-tree router must convert this value to 0xff. This value remains unchanged, andCBT_ECHO_REPLY. Adjacent CBT routersfrom here on the packet should traverse onlyneedon-tree interfaces. If an encapsulated packet happens tosend one keepalive per link, regardless of how many groups are present"wander" off-tree and back on again, an on-tree router will receive the CBT encapsulated packet via an off-tree interface. However, this router will recognise thatlink. This aggregation strategythe "on- tree" field of the encapsulating CBT header isexpectedset toconserve considerable bandwidth on "busy" links, such as transit network, or backbone network, links.0xff, and so immediately discards the packet. _________________________ 4 Thekeepalive protocol is simple, as follows: a child unicasts a CBT-ECHO-REQUESTexception toits parent, which unicasts a CBT-ECHO-REPLY in response. For anythis is when CBTrouter, if its parent router, or path to the parent, fails, the childmode isinitially responsible for re-attaching itself, and therefore alloperating between CBT routerssubordinateconnected toit ona multi-access link; a data packet may traverse thesame branch, tolink in native mode (if group members are present on thetree.link), as well as CBTecho requests and replies can be aggregated and sent on a per link basis, rather than individuallymode foreach group;sending the data between CBTcontrol packet header (section 10.2) accommodates such aggregation. 8.1. Router Failure An on-tree router can detect a failure from the following two cases: + if the child responsible for sending keepalives across a partic- ular link stops receiving CBT_ECHO_REPLY messages. In this case the child realises that its parent has become unreachable and must therefore try and re-connect to the tree for all groups representedrouters on theparent/child link. For all groups sharingtree. 6. Data Packet Forwarding Rules 6.1. Native Mode In native mode, when acommon core set (corelist), provided those groups can be speci- fied asCBT router receives aCIDR-like aggregate, an aggregated join candata packet, the packet may only besent representingforwarded over outgoing tree interfaces (member subnets and interfaces leading to outgoing on-tree neighbours) iff it has been received via arange of groups. Aggregated joins are made pos- sible byvalid on-tree interface (or thepresence ofpacket has arrived encapsulated from a"group mask" field innon-member, i.e. off-tree, sender). Oth- erwise, theCBT controlpacketheader. Aggregated joins are also discussed in Appendix A. Ifis discarded. Before arange of groups cannot be representedpacket is forwarded by amask, then each group must be re-joined individually. CBT's re-join strategy is as follows:subnet's DR, provided therejoining router whichpacket's TTL isimmediately subordinate togreater than 1, thefailure sends a JOIN_REQUEST (subcode ACTIVE_JOIN if it has no children attached, and subcode ACTIVE_REJOIN if at least one childpacket's TTL isattached) to the best next-hop router ondecremented. 6.2. CBT Mode In CBT mode, routers ignore all non-locally originated native mode multicast data packets. Locally-originated multicast data is only processed by a subnet's DR; in this case, thepath toDR forwards theelected core. If no JOIN-ACKnative multicast data packet, TTL 1, over any outgoing member subnets for which that router isreceived after three retransmissions, each transmission being at PEND-JOIN-INTERVAL (10 secs),DR. Additionally, thenext-highest priority core is elected fromDR encapsulates thecore list,locally-originated multicast andthe process repeated. Ifforwards it, CBT mode, over allcores have been tried unsuccessfully,tree interfaces, as dictated by theD-DR has no option but to give up. + ifCBT forwarding database. When aparent stops receiving CBT_ECHO_REQUESTs fromrouter, operating in CBT mode, receives achild. In this case, if the parent has not receivedCBT-mode encapsu- lated data packet, it decapsulates one copy to send, native mode and TTL 1, over any directly attached member subnets for which it is DR. Additionally, anexpected keepalive after CHILD_ASSERT_EXPIRE_TIME,encapsulated copy is forwarded over allchildren reachable across that link are removed from the parent'soutgoing tree interfaces, as dictated by its CBT forwarding database.8.2. Router Re-Starts There are two cases to consider here: + Core re-start. All JOIN-REQUESTs (all types) carryLike theidenti- ties (i.e.outer encapsulating IPaddresses) of eachheader, the TTL value of thecores for a group. If a routerencapsu- lating CBT header isa core for a group, but has only recently re-started, it will not be aware thatdecremented each time it is processed by acore for any group(s).CBT router. An example of CBT mode forwarding is provided towards the end of the next section. 7. CBT Mode -- Encapsulation Details Insuch circumstances,acore only becomes aware thatmulti-protocol environment, whose infrastructure may include non-multicast-capable routers, it issuch by receiving a JOIN-REQUEST. Subsequentnecessary toa core learning its status in this way, if ittunnel data packets between CBT-capable routers. This isnot the primary core it ack- nowledges the received join, then sends a JOIN_REQUEST (subcode ACTIVE_REJOIN)called "CBT mode". Data packets are de-capsulated by CBT routers (such that they become native mode data packets) before being forwarded over subnets with member hosts. When multicasting (native mode) to member hosts, theprimary core. IfTTL value of there-started routeroriginal IP header isthe primary core, it need take no action, i.e. in all cir- cumstances, the primary core simply waitsset tobe joined by other routers. + Non-core re-start. In this case, the router can only join the tree again if a downstream router sends a JOIN_REQUEST through it, or itone. CBT mode encapsulation iselected DRas fol- lows: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | encaps IP hdr | CBT hdr | original IP hdr | data ....| ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Figure 3. Encapsulation foroneCBT mode The TTL value ofits directly attached sub- nets, and subsequently receives an IGMP membership report. 8.3. Route Loops Routing loops are only a concern when a router with at least one child is attempting to re-join athe CBTtree. In this caseheader is set by there- joiningencapsulating CBT routersends a JOIN_REQUEST (subcode ACTIVE REJOIN)directly attached to thebest next-hop on the path to an elected core.origin of a data packet. Thisjoinvalue isforwarded as normal untildecremented each time itreaches either the specified core, another core, oris processed by anon-core router thatCBT router. An encap- sulated data packet isalready part ofdiscarded when thetree. If the rejoinCBT header TTL value reaches zero. The purpose of theprimary core, loop detection(outer) encapsulating IP header isnot necessary because the primary never has a parent.to "tunnel" data packets between CBT-capable routers (or "islands"). Theprimary core acks an active-rejoin by means of a JOIN-ACK, subcode PRIMARY-REJOIN-ACK. This ack must be processed by each router onouter IP header's TTL value is set to thereverse-path"length" of theactive-rejoin;corresponding tun- nel, or MAX_TTL (255)if thisack creates tree state, just like a normal join-ack. If an active-rejoinisterminated by any router on the tree other than the primary core, loop detection must take place, as we now describe. If, in responsenot known, or subject toan active-rejoin, a JOIN-ACKchange. It isreturned, subcode NORMAL (as opposed to an ack with subcode PRIMARY-REJOIN-ACK), the router receivingworth pointing out here theack subsequently generates a JOIN-REQUEST, sub- code NACTIVE-REJOIN (non-active rejoin). This packet serves only to detect loops; it does not create any transient statedistinction between subnetworks and tree branches (especially apparent in CBT mode), although they can be one and the same. For example, a multi-access subnetwork containing routersit traverses, other than the originating router. Any on-tree router receivingand end-systems could potentially be both anon-active rejoin is required to forward it over its parent interface for the specified group. In this way, it will either reach the primary core, which returns, directly to the sender,CBT tree branch and ajoin acksubnetwork withsubcode PRIMARY-NACTIVE-ACK (so the sender knows no loopgroup member presence. A tree branch which ispresent), ornot simultaneously a subnetwork is either a "tunnel" or a point-to- point link. In CBT mode there are three forwarding methods used by CBT routers: +o IP multicasting. This method sends an unaltered (unencapsulated) data packet across a directly-connected subnetwork with group member presence. Any host originating multicast data, does so in this form. +o CBT unicasting. This method is used for sending data packets encapsulated (as illustrated above) across a tunnel or point-to- point link; thesender receivesIP destination address of thenon-active rejoin it sent, via oneencapsulating IP header is a unicast address. En/de-capsulation takes place in CBT routers. +o CBT multicasting. A CBT router on a multi-access link can take advantage ofits child interfaces,multicast inwhich casetherejoin obviously formed a loop. Ifcase where multiple on-tree neigh- bours are reachable across aloop is present,single physical link; thenon-active join originator immediately sendsouter encapsulating IP header contains aQUIT_REQUEST tomulticast address as itsnewly-established parent anddes- tination address. The IP module of end-systems on theloop is broken. Using figure 4 (over) to demonstrate this, if R3 is attemptingsame link subscribed tore-jointhetree (R1same group will discard these multicasts since the CBT payload type (protocol id) of the outer IP header is not recognizable by hosts. CBT routers create forwarding database (db) entries whenever they send or receive a JOIN_ACK. The forwarding database describes thecore in figure 4)parent-child relationships on a per-group basis. A forwarding database entry dictates over which tree interfaces, andR3 believes its best next-hophow (unicast or multicast) a data packet is toR1be sent. Note that a CBT forwarding db isR6,required for both CBT-mode andR6 believes R5native-mode multicasting. Using our example topology in figure 1, let's assume the CBT routers are operating in CBT mode. Member G originates an IP multicast (native mode) packet. R8 isits best next-hop to R1, which sees R4 as its best next-hop to R1 --the DR for subnet S10. R8 therefore sends aloop(native mode, TTL 1) copy over any member subnets for which it isformed. R3 begins by sending a JOIN_REQUEST (subcode ACTIVE_REJOIN, since R4DR - S14 and S10 (the copy over S10 isits child) to R6. R6 forwardsnot sent, since thejoin to R5. R5packet was originally received from S10). The multicast packet ison-tree for the group, so responds to the active-rejoin with a JOIN-ACK, subcode NOR- MAL (the ack traverses R6 on its way to R3). R3 now generates a JOIN-REQUEST, subcode NACTIVE-REJOIN,CBT mode encapsulated by R8, andforwards this to its parent, R6. R6 forwards the non-active rejoinuni- cast toR5,each of itsparent. R5 does similarly, as does R4. Now,children, R9 and R12; these children are not reachable over thenon-active rejoin has reached R3, which originated it, so R3 concludessame interface, otherwise R8 could have sent aloop is present onCBT mode multicast. R9, theparent interfaceDR for S12, need not IP multicast (native mode) onto S12 since there are no members present there. R9 unicasts thespecified group. It immediately sends a QUIT_REQUESTpacket in CBT mode toR6,R10, whichin turn sends a quit if it has not received an ACKis the DR for S13 and S15. R10 decapsulates the CBT mode packet and IP multicasts (native mode, TTL 1) to each of S13 and S15. Going upstream fromR5 already AND has itself a child orR8, R8 CBT mode unicasts to R4. It is DR for all directly connected subnetswithand therefore IP multicasts (native mode) the data packet onto S5, S6 and S7, all of which have memberpresence. If so itpres- ence. R4 unicasts, CBT mode, the packet to all outgoing children, R3 and R7 (NOTE: R4 does notsendhave aquit --parent since it is theloop has been broken by R3 sendingprimary core router for thefirst quit. QUIT_REQUESTs are typically acknowledged by means of a QUIT_ACK. A child removes its parent information immediately subsequentgroup). R7 IP multicasts (native mode) onto S9. R3 CBT mode unicasts tosend- ingR1 and R2, itsfirst QUIT-REQUEST. The ack here serveschildren. Finally, R1 IP multicasts (native mode) onto S1 and S3, and R2 IP multicasts (native mode) onto S4. 8. Non-Member Sending For a multicast data packet tonotifyspan beyond the(old) childscope of the originat- ing subnetwork at least one CBT-capable router must be present on that subnetwork. The DR for the group on the subnetwork must encap- sulate the (native) IP-style packet and unicast it(the parent) has in fact removed its child information. However, there might be cases where, duetofailure,a core for theparent can- not respond.group (footnote 5). Thechild sends a QUIT-REQUEST a maximum of three times, at PEND-QUIT-INTERVAL (10 sec) intervals. ------ | R1 | ------ | --------------------------- | ------ | R2 | ------ | --------------------------- | | ------ | | R3 |--------------------------| ------ | | | --------------------------- | | | ------ ------ | | | | R4 | |-------| R6 | ------ | |----| | | --------------------------- | | | ------ | | R5 |--------------------------| ------ | | Figure 4: Example Loop Topology In another scenarioencapsulation required is shown in figure 3; CBT mode encapsulation is necessary so therejoin travels overreceiving CBT router can demultiplex the packet accordingly. If the encapsulated packet hits the tree at an on-tree router, the packet is forwarded according to the forwarding rules of section 6.1 or 6.2, depending on whether the receiving router is operating in native- or CBT mode. Note that it is possible for the different interfaces of aloop-free path, androuter to operate in different (and independent) modes. If the first on-tree router encountered is theprimarytarget core,R1. In figure 4, R3 sends a join, subcode REJOIN_ACTIVE to R2,various scenarios define what happens next: +o if thenext-hop ontarget core is not thepath toprimary, and the target coreR1. R2has not yet joined the tree (because it has not yet itself received any join-requests), the target core simply forwards there-joinencapsu- lated packet toR1,the primarycore, which returns a JOIN-ACK, subcode PRIMARY-REJOIN-ACK, over the reverse-path ofcore; therejoin-active. Whenever a router receives a PRIMARY-REJOIN-ACK no loop detectionprimary core IP address isnecessary. If we assume R2included in the encapsulating CBT data packet header. if the target core ison tree fornot thecorresponding group, R3 sends a join, subcode REJOIN_ACTIVE to R2, which replies with a join ack, subcode NORMAL. R3 must then generate a loop detection packet (join request, subcode REJOIN-NACTIVE) which is forwarded to its parent, R2, which does similarly. On receipt of the rejoin-Nactive,primary, but has children, thepri- marytarget coreunicasts a join ack back directly to R3, with subcode PRIMARY-NACTIVE-ACK. This confirmsforwards the data according toR3the rules of section 6. _________________________ 5 It is assumed thatits rejoin does not form a loop. 9. Data Packet Loops The CBT protocol builds a loop-free distribution tree. If allCBT-capable routersthat comprise a particular tree function correctly, data packets should never traverse a tree branch more than once. CBT mode data packets from a non-member sender must arrive ondiscover <core, group> mappings by means of some discovery pro- tocol. Such atree via an "off-tree" interface. The CBT mode data packet's header includes an "on-tree" field, which contains the value 0x00 untilprotocol is outside thedata packet reaches an on-tree router. The first on-tree router must convertscope of thisvalue to 0xff. This value remains unchanged, and from here ondocument. +o if thepacket should traverse only on-tree interfaces. If an encapsulated packet happens to "wander" off-tree and back on again, an on-tree router will receivetarget core is theCBT encapsulated packet via an off-tree interface. However, this router will recognise thatprimary, the"on-tree" field ofprimary forwards theencapsulating CBT header is setdata according to0xff, and so immediately discardsthepacket. 10. CBT Packet Formats and Message Types We distinguish between two typesrules ofCBT packet: CBT mode data pack- ets, and CBT control packets. CBT control packets carrysection 6.2. 9. Eliminating the Topology-Discovery Protocol in the Presence of Tun- nels Traditionally, multicast protocols operating within aCBT control packet header. For "conventional router" implementations,virtual topol- ogy, i.e. an overlay of the physical topology, have required the assistance of a multicast topology discovery protocol, such as that present in DVMRP [1]. However, it isrecommended CBT con- trol packets be encapsulated in IP, as illustrated below: +++++++++++++++++++++++++++++++ | IP header | CBT control pkt | +++++++++++++++++++++++++++++++ In CBT mode,possible to have a multicast protocol operate within a virtual topology without theoriginal data packetneed for a multicast topology discovery protocol. One way to achieve this isencapsulatedby having a router configure all its tunnels to its virtual neighbours in advance. A tunnel is identified by aCBT headerlocal interface address andan IP header, as illustrated below: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | IP header | CBT header | original IP hdr | data .... | ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ The IP protocol field ofa remote interface address. Routing is replaced by "ranking" each such tunnel interface associated with a particular core address; if theIP headerhighest-ranked route isusedunavailable (tunnel end-points are required todemultiplex a packet correctly; CBT has been assigned IPrun an Hello-like protocolnumber 7. The CBT modulebetween themselves) thendemultiplexes based ontheencapsulating CBT header's "type" field, thereby distinguishing between CBT control packetsnext- highest ranked available route is selected, andCBT mode data packets (the first 16 bitsso on. The exact specification ofboththe Hello protocol is outside the scope of this doc- ument. CBTcontrol and CBT data packet headerstrees areidentical). Some implementations of CBT encapsulate CBT control packets in UDP (like the workstation router version). In these implementations,built using theencapsulationsame join/join-ack mechanisms as before, only now some branches of a delivery tree run in native mode, whilst others (tunnels) run in CBTcontol packetsmode. Underlying unicast routing dictates which interface a packet should be forwarded over. Each interface is configured asfollows: ++++++++++++++++++++++++++++++++++++++++++++ | IP header | UDP header |either native mode or CBTcontrol pkt | ++++++++++++++++++++++++++++++++++++++++++++mode, so a packet can be encapsulated (decapsulated) accordingly. As an example, router R's configuration would be as follows: intf type mode remote addr ----------------------------------- #1 phys native - #2 tunnel cbt 128.16.8.117 #3 phys native - #4 tunnel cbt 128.16.6.8 #5 tunnel cbt 128.96.41.1 core backup-intfs -------------------- A #5, #2 B #3, #5 C #2, #4 The CBThas been assigned UDP port number 7777 forforwarding database needs to be slightly modified to accommo- date an extra field, "backup-intfs" (backup interfaces). The entry in thispurpose. Itfield specifies a backup interface whenever a tunnel interface specified in the forwarding db isrecommended for performance reasons that conventional router implementations implementdown. Additional backups (should theIP encapsulationfirst-listed backup be down) are specified forcontrol packets, noteach core in theUDP encapsulation. Thecore backup table. For example, if interface (tunnel) #2 were down, and the target core of a CBTdatacontrol packetheader is illustrated below: 10.1. CBT Header Format (forwere core A, the core backup table suggests using interface #5 as a replacement. If inter- face #5 happened to be down also, then the same table recommends interface #2 as a backup for core A. 10. CBT Packet Formats and Message Types We distinguish between two types of CBT packet: CBT mode data pack- ets, and CBT control packets. CBT control packets carry a CBT control packet header. CBT control packets are encapsulated in IP, as illustrated below: +++++++++++++++++++++++++++++++ | IP header | CBT control pkt | +++++++++++++++++++++++++++++++ In CBT mode, the original data packet is encapsulated in a CBT header and an IP header, as illustrated below: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | IP header | CBT header | original IP hdr | data .... | ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ The IP protocol field of the inner (original) IP header is used to demultiplex a packet correctly; CBT has been assigned IP protocol number 7. The CBT module then demultiplexes based on the encapsulat- ing CBT header's "type" field, thereby distinguishing between CBT control packets and CBT mode data packets. The CBT data packet header is illustrated below. 10.1. CBT Header Format (for CBT Mode data) 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | vers |unused | type | hdr length | on-tree|unused| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | checksum | IP TTL | unused | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | group identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |reservedfirst-hop router | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | primary core | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | reserved | reserved |T|S| Type | Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | .....Flow-id value..... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | unused | unused | Type | Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | .....SecurityInformation.....data...... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure5.4. CBT Header Each of the fields is described below:++o Vers: Version number -- this release specifies version 1.++o type: indicates CBT payload; values are defined for control (0x00), and data (0xff). For the value 0x00 (control), a CBT control header is assumed present rather than a CBT header.++o hdr length: length of the header, for purpose of checksum calculation.++o on-tree: indicates whether the packet is on-tree (0xff) or off-tree (0x00).++o checksum: the 16-bit one's complement of the one's complement of the CBT header, calculated across all fields.++o IP TTL: TTL valuegleaned fromcorresponding to the value of the IP TTL value of the original multicast packet, and set in the CBT headerwhereby thepacket originated. +DR directly attached to the origin host (decre- mented by CBT routers visited). +o group identifier: multicast group address.+ The TLV fields at+o first-hop router: identifies theend ofencapsulating router directly attached to theheader are for a flow- identifier, and/or security options, if and when implemented. A "type" valueorigin ofzero impliesa"length"multicast packet. This field is relevant to source-migration ofzero, implying therea core to the source (see Appendix A). It isno "value" field. 10.2. Control Packet Header Format The individual fields are described below. 0 1 2 3 4 5 6 7 8 9 0 1 2 3set to NULL when core migration is disabled. +o primary core: the primary core for the group, as identified by "group-id". This field is necessary for the case where non-member senders happen to send to a secondary core, which may not yet be joined to the primary core. This field allows the secondary to know which is the primary for the group, so that the secondary can forward the (encapsulated) data onwards to the primary. +o T bit: indicates the presence (1) or absence (0) of Type of Service/flow-id value ("type", "length", "type of ser- vice/flow-id") . +o S bit: indicates the presence (1) or absence (0) of a secu- rity value ("type", "length", "security data"). 10.2. Control Packet Header Format The individual fields are described below. 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | vers |unused | type | code | # cores | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | hdr length | checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | group identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | group mask | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | packet origin | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | primary core address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | target core address (core #1) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Core #2 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Core #3 | | .... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | reserved | reserved||T|S| Type | Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |.....Flow-id value.....type of service/flow-id | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | unused | unused | Type | Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | .....Security data..... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure6.5. CBT Control Packet Header++o Vers: Version number -- this release specifies version 1.++o type: indicates control message type (see sections 10.3).++o code: indicates subcode of control message type.++o # cores: number of core addresses carried by this control packet.++o header length: length of theheader, for purpose of checksum calculation. + checksum:header, for purpose of checksum calculation. +o checksum: the 16-bit one's complement of the one's complement of the CBT control header, calculated across all fields. +o group identifier: multicast group address. +o group mask: mask value for aggregated CBT joins/join-acks. Zero for non-aggregated joins/join-acks. +o packet origin: address of the CBT router that originated the control packet. +o primary core address: the address of the primary core for the group. +o target core address: desired core affiliation of control mes- sage. +o Core #N: IP address for each of a group's cores. +o T bit: indicates the presence (1) or absence (0) of Type of Service/flow-id value ("type", "length", "type of ser- vice/flow-id") . +o S bit: indicates the presence (1) or absence (0) of a secu- rity value ("type", "length", "security data"). 10.3. CBT Control Message Types There are ten types of CBT message. All are encoded in the CBT con- trol header, shown in figure 5. +o JOIN-REQUEST (type 1): generated by a router and unicast to the specified core address. It is processed hop-by-hop on its way to the specified core. Its purpose is to establish the originating CBT router, and all intermediate CBT routers, as part of the corresponding delivery tree. Note that all cores for the corresponding group are carried in join-requests. +o JOIN-ACK (type 2): an acknowledgement to the above. The full list of core addresses is carried in a JOIN-ACK, together with the actual core affiliation (the join may have been ter- minated by an on-tree router on its journey to the specified core, and the terminating router may or may not be affiliated to the core specified in the original join). A JOIN-ACK tra- verses the reverse path as the corresponding JOIN-REQUEST, with each CBT router on the path processing the ack. It is the receipt of a JOIN-ACK that actually "fixes" tree state. +o JOIN-NACK (type 3): a negative acknowledgement, indicating that the tree join process has not been successful. +o QUIT-REQUEST (type 4): a request, sent from a child to a par- ent, to be removed as a child of that parent. +o QUIT-ACK (type 5): acknowledgement to the above. If the par- ent, or the path to it is down, no acknowledgement will be received within the timeout period. This results in the child nevertheless removing its parent information. +o FLUSH-TREE (type 6): a message sent from parent to all chil- dren, which traverses a complete branch. This message results in all tree interface information being removed from each router on the branch, possibly because of a re-configuration scenario. +o CBT-ECHO-REQUEST (type 7): once a tree branch is established, this messsage acts as a "keepalive", and is unicast from child to parent (can be aggregated from one per group to one per link. See section 4). +o CBT-ECHO-REPLY (type 8): positive reply to the above. +o CBT-BR-KEEPALIVE (type 9): applicable to border routers only. See [11] for more information. +o CBT-BR-KEEPALIVE-ACK (type 10): acknowledgement to the above. 10.3.1. CBT Control Message Subcodes The JOIN-REQUEST has three valid subcodes: +o ACTIVE-JOIN (code 0) - sent from a CBT router that has no children for the specified group. +o REJOIN-ACTIVE (code 1) - sent from a CBT router that has at least one child for the specified group. +o REJOIN-NACTIVE (code 2) - generated by a router subsequent to receiving a join ack, subcode NORMAL, in response to a active-rejoin. A JOIN-ACK has three valid subcodes: +o NORMAL (code 0) - sent by a core router, or on-tree router, acknowledging joins with subcodes ACTIVE-JOIN and REJOIN- ACTIVE. +o PRIMARY-REJOIN-ACK (code 1) - sent by a primary core to acknowledge the receipt of a join-request received with sub- code REJOIN-ACTIVE. This message traverses the reverse-path of the corresponding re-join, and is processed by each router on that path. +o PRIMARY-NACTIVE-ACK (code 2) - sent by a primary core to acknowledge the receipt of a join-request received with sub- code REJOIN-NACTIVE. This ack is unicast directly to the router that generated the rejoin-Nactive, i.e. the ack it is not processed hop-by-hop. 11. CBT Protocol Number CBT has been assigned IP protocol number 7. CBT control messages are carried directly over IP. 12. Default Timer Values There are several CBT control messages which are transmitted at fixed intervals. These values, retransmission times, and timeout values, are given below. Note these are recommended default values only, and are configurable with each implementation (all times are in seconds): +o CBT-ECHO-INTERVAL 30 (time between sending successive CBT-ECHO- REQUESTs to parent). +o PEND-JOIN-INTERVAL 5 (retransmission time for join-request if no ack rec'd) +o PEND-JOIN-TIMEOUT 30 (time to try joining a different core, or give up) +o EXPIRE-PENDING-JOIN 90 (remove transient state for join that has not been ack'd) +o PEND_QUIT_INTERVAL 5 (retransmission time for quit-request if no ack rec'd) +o CBT-ECHO-TIMEOUT 90 (time to consider parent unreachable) +o CHILD-ASSERT-INTERVAL 90 (increment child timeout if no ECHO rec'd from a child) +o CHILD-ASSERT-EXPIRE-TIME 180 (time to consider child gone) +o IFF-SCAN-INTERVAL 300 (scan all interfaces for group presence. If none, send QUIT) +o BR-KEEPALIVE-INTERVAL 200 (backup designated BR to designated BR keepalive interval) +o BR-KEEPALIVE-RETRY-INTERVAL 30 (keepalive interval if BR fails to respond) 13. Interoperability Issues Interoperability between CBT and DVMRP has recently been defined in [11]. Interoperability with other multicast protocols will be fully speci- fied as the need arises. 14. CBT Security Architecture see [4]. Acknowledgements Special thanks goes to Paul Francis, NTT Japan, for the original brainstorming sessions that brought about this work. Thanks too to Sue Thompson (Bellcore). Her detailed reviews led to the identification of some subtle protocol flaws, and she suggested several simplifications. Thanks also to the networking team at Bay Networks for their comments and suggestions, in particular Steve Ostrowski for his suggestion of using "native mode" as a router optimization, and Eric Crawley. Thanks also to Ken Carlberg (SAIC) for reviewing the text, and gener- ally providing constructive comments throughout. I would also like to thank the participants of the IETF IDMR working group meetings for their general constructive comments and sugges- tions since the inception of CBT. APPENDICES DISCLAIMER: As of writing, the mechanisms described in Appendices A and B have not been tested, simulated, or demonstrated. APPENDIX A Dynamic Source-Migration of Cores A.0 Abstract This appendix describes CBT protocol mechanisms that allow a CBT mul- ticast tree, initially constructed around a randomly-placed set of core router, to dynamically reconfigure itself in response to an active source, such that the CBT tree becomes rooted at the source's local CBT router. Henceforth, CBT emulates a shortest-path tree. For clarity, the mechanisms are described in the context of "flat" multicasting, but are transferrable to a hierarchical model with only minor changes. A.1 Motivation One of the criticisms levelled against shared tree multicast schemes is that they potentially result in sub-optimal routes between receivers. Another criticism is that shared trees incur a high traf- fic concentration effect on the core routers. Given that any shared tree is likely to have two, three, or more cores which can be strate- gically placed in the network, as well as the fact that any on-tree router can act as a "branch point" (or "exploder point"), shared tree traffic concentration can be significantly reduced. This note never- theless addresses both of these criticisms by describing new mecha- nisms that +o allow a CBT to dynamically transition from a random configura- tion to one where any CBT router can become a core - more pre- cisely, that which is local to a source, and... +o remove the traffic concentration issue completely, as a result of the above; traffic concentration is not an issue with source- rooted trees. The mechanisms described here are relevant to non-concurrent sources; the concurrent-sender case is not addressed here, although experience with MBONE applications for the past several years suggests that most multicast applications are of the single, infrequently-changing sender type. Also, it is not necessarily implied that the initial CBT tree must be transitioned. Any transition is an "all-or-nothing" transition, meaning that either all the tree transitions, or none of it does (footnote 6). A.2 Goals & Requirements By means of the mechanisms described, this Appendix sets out to achieve the follwoing: +o provide mechanisms that allow the dynamic transition from an initial CBT, constructed around a pre-configured set of cores, to a CBT that is rooted at a core attached to a sender's local subnetwork. This is source-rooted tree emulation. +o ensure that these mechanisms do not impact CBT's simplicity or scalability. +o eliminate completely the traffic concentration issue from CBT. +o to eliminate the core placement/core advertisement problems. +o ensure that the scheme is robust, such that if a source's local router (or link to it) should fail, the CBT self-organises itself and returns to its original configuration. +o the mechanisms should provide the same even to non-member senders. The above incurs a few additional requirements on existing baseline CBT mechanisms described in this specification: +o a new JOIN-REQUEST subcode, REVERSE-JOIN +o a new JOIN-ACK subcode, REVERSE-ACK _________________________ 6 This is the expected behaviour of PIM Sparse Mode; on reciept of high-bandwidth traffic, most receivers' local routers will be configured to transition to source trees. +o new JOIN-ACK subcode, CORE-MIGRATE +o a "first-hop router" field needs to be included in the CBT data packet header. +o a new message type: - SOURCE-NOTIFICATION +o CBT-mode data encapsulation is required until the local CBT router connected to an active source receives a JOIN-REQUEST, whose "target core address" field is one of its own IP addresses. These new additions are explained in the next section. A.3 Source-Tree Emulation Criteria CBT routers are configured with a lower-bound data-rate threshold that is the expected boundary between low- and high-bandwidth data rate traffic. CBT also monitors the duration each sender sends. If this duration exceeds a pre-configured value (global across CBT), say 3 minutes, AND the data rate threshold is exceeded, the CBT tree transitions such that receivers become joined to the "core" local to the source's subnet, i.e. the CBT tree becomes source-rooted, but nevertheless remains a CBT. A.4 Source-Migration Mechanisms E o o D \ / \ / L o \ / \ o C \ N / \ / \A(2) (1)B / O===================================O | | M | | | | K o o H /\ /\ / \ / \ / \ / \ s J o o I G o o F ---------- Key: B = primary core A = secondary core s = sending host J = sending host's local DR M & N = network nodes not on original CBT tree Figure A1: Original CBT Tree In figure A1, host s starts sending native mode multicast data. CBT router J encapsulates it as CBT mode, inserting its own IP address in the "first-hop router" field of the CBT mode data packet header. This data packet flows over the CBT tree. Note that tree migration can be disabled either by sending all pack- ets in native mode, or by inserting NULL value into the "first-hop router" field. Since the first-hop router is the original encapsulat- ing router (data packets are always originated from hosts in native mode), the first-hop router knows whether the sender's data rate war- rants activating the "first-hop router" field; for the purpose of the ensuing protocol description, we assume this is the case. Any router on the tree receiving the CBT mode data packet, inspects the "first-hop router" field of the CBT header, and compiles a join- request to send to it. In order to fully specify the join, it must inspect its underlying unicast routing table(s) to find the best next-hop to the source's first hop router. That next hop will be either on or off the existing CBT tree for the group. If the next hop is off-tree, the join generated is given a subcode of ACTIVE-JOIN (as per CBT spec), and a "target core address" of the source's first hop router. The join is then forwarded and processed according to the CBT specification. The primary core, and the original core list, remain specified in their respective fields of the CBT control packet header. Using figure A1 to illustrate an example, node L's routing tables suggest that the best next-hop to J, the source's first hop router, is via node M, not yet on the tree. So, node L generates a join and forwards it to M, which forwards it to J. The join-ack (subcode NOR- MAL) returns to L via M on the reverse-path of the join. When the join-ack reaches L, L sends a QUIT-REQUEST to A, its old parent. The shortest-path branch now exists, L-M-J. If the best next hop to the source's first hop router is via an existing on-tree interface, if that interface is the node's parent on the current tree, no further action need be taken, and no join need be sent towards the source, J. However, the join's best next hop may be via an existing child inter- face - this is where the new join type, subcode REVERSE-JOIN, comes in. The purpose of this join type is to simply reverse the existing parent-child relationship between two adjacent on-tree routers; each end of the link between the two routers is re-labelled. This join must be acknowledged by means of a JOIN-ACK, subcode REVERSE-ACK. A reverse-join is only ever sent from a child to its parent. Immediately subsequent to sending a reverse-join-ACK, the sending node's old parent interface is labelled as "pending child", and a timer is set on that interface. This is a delay timer, set at a default of 5 seconds, during which time a reverse-join is expected over that interface from the node's old parent. Should this timer expire, a REVERSE-ASSERT message is sent to the old parent (new child) to cause it to agree to the change in the parent-child rela- tionship. A REVERSE-ASSERT must be ack'd (REVERSE-ASSERT-ACK). If, after (say) three retransmissions (at 5 sec intervals) no reverse- assert-ack has been received, a QUIT-REQUEST is sent to the old par- ent and the corresponding interface is removed from this node's cur- rent forwarding database. Of course, if a node has already received a reverse-join during the period one of its other interfaces was changing its parent-child relationship with another of its neighbours, then the pending-child delay timer need not be activated. Looking at figure A1 again, here's the process of how the parent- child relationships change on the tree when an active source, s, starts sending. Of course, links E-C, I-J, and L-J do not do this because they forge completely new paths towards the source's local router, J. K sends a reverse-join to J. J acks this with a join-ack, subcode REVERSE-ACK. At this point, J is K's parent, and I is still K's child. K now sets the pending-child delay timer on its interface to A (K's old parent), and expects a reverse-join from A. If it weren't to arrive after the delay timer expires, plus several retransmissions of a reverse-assert control message, K can send a quit to A (it sends a quit because, as far as A is concerned, it thinks K is still its child) and removes the K-A interface from its CBT forwarding database. However, assuming a reverse-join does arrive at K from A before the delay timer expires, K acks the reverse-join and cancels the delay timer on that interface. Next, let's consider CBT router (node) I. I's unicast routing table suggest it can reach J directly (next-hop) via a different interface than the I-K interface, so I sends a join-request, subcode active- join, to J, which acks it as normal. On receipt of the ack, I sends a quit to K and removes K as its parent from its database. Now let's consider node L. Like I, it finds a new path to J, via M, so simply sends a new join to J, via M, and on receipt of the join- ack, sends a quit to A, and removes A from its forwarding database. A new, shortest-path, branch now exists, J-M-L. Next let's consider A-B, the link between the cores. A is the sec- ondary, and B is the primary, so A originally joined towards B. So, B sends a reverse-join to A. A sends a reverse-ack to B, so A is now B's parent, and B has children B-H, and B-C. Note that the role of primary and secondary is not affected - the target of B's join to A is the source's local router, J. The existing branches D-C-B, F-H-B, and G-H-B, need not change any of their parent-child relationships, since each of these nodes' unicast routing tables indicate that the best next-hop a join-request, tar- getted at source J, would take, is via the corresponding existing parent. For E, it sends a new join via N to J. On receipt of the join-ack, it sends a quit to C. A new branch has been created, E-N-J. Each node on the tree now has a shortest-path to J, the source's local CBT router. Hence, J is the root ("core") of a shortest-path multicast tree. Note that these new mechanisms augment the CBT protocol, and the baseline CBT protocol engine is not affected in any way by this add- on mechanism. A.5 Robustness Issues Some immediate questions might be: +o what happens to the source-rooted tree if the source's local CBT router fails? +o what happens if the source's local CBT router fails whilst the initial tree is transitioning? +o what happens if the tree is partitioned, or not yet fully con- nected, when a source starts sending? +o how do new receivers join an already-transitioned tree? All of these questions are now addressed: +o What happens to the source-rooted tree if the source's local CBT router fails? A source-rooted CBT has a single point of failure - the root of the tree. In spite of a source being joined, the corelist (primary & sec- ondaries) is carried in CBT control packets, as per the CBT spec. However, the contents of the "target core address" field identifies the IP address of the source's local CBT router. So, in the event of a failure, the CBT routers still have all the information they need to rejoin the original tree, constructed around the corelist. Rejoining then, proceeds according to the rules of the CBT specification. Of course, rejoining the original tree happens only after sev- eral attempts have been made to rejoin the source's "core". +o What happens if the source's local CBT router fails whilst the initial tree is transitioning? This really is no different to the above case. The parts of the tree that have transitioned will rejoin the original tree according to their corresponding corelist. Those parts of the tree in the process of transitioning may temporarily transition, but eventually those nodes will receive a FLUSH from a CBT router adjacent to the failed source router ("core"). They then rejoin the original tree. +o What happens if the tree is partitioned, or not yet fully con- nected, when a source starts sending? The problem here is that some parts of the network (CBT tree) may not receive CBT encapsulated mode data packets before the source's local DR starts forwarding data in native mode, and so those receivers will not know the IP address of the local DR to join to. For example, assume a secondary core with downstream members cannot reach the primary. If the routers adjacent to the secon- daries are all functioning correctly, the secondaries themselves may not be aware that a partition has occurred somewhere further upstream. So, what if a source downstream from a secondary, starts sending data after the partition has happened? A new control message, the SOURCE-NOTIFICATION, is used to solve this problem. As soon as any core recieves CBT mode encapsulated data, it caches the source "core" IP address, and starts multi- casting (to the group) SOURCE-NOTIFICATION messages, one every minute. Source-notifications contain the IP address of the source's local DR. A core continues to multicast source- notications at 1 minute intervals until the source has ceased transmitting data for more than 20 seconds. Obviously, if a CBT is fully connected, the larger proportion of source-notifications will be redundant. However, this cost jus- tifies the robustness the scheme provides. If an off-tree source begins sending data, which first hits the tree at a secondary core with no receivers attached, the16-bit one's complement ofsecondary does not trigger a join towards theone's complement ofprimary, but instead just unicasts the data, in CBTcontrol header, calculated across all fields. + group identifier: multicast group address. + group mask: mask value for aggregated CBT joins/join-acks. Zero for non-aggregated joins/join-acks. + packet origin: address ofmode, to the primary (as per CBTrouter that originated the control packet. +spec). The primarycore address:then forwards theaddress ofdata over any con- nected tree branches. Receivers can then begin transitioning. In this way, a transitioned CBT tree extends to theprimary corefirst hop router of a non-member sender. Note that cores and on-tree routers only ever react to active sources iff they have an existing CBT forwarding database for the said group.+ target core address: desiredFor example, a primary coreaffiliation of control mes- sage. + Core #1, #2, #3 etc.: IP address for each ofwould not establish agroup's cores. + The TLV fieldsshortest-path branch to a non-member sender unless it has at least one existing child registered for theendcorresponding group. +o How do new receivers join an already-transitioned CBT? New receivers will always attempt to join one of theheader arecores in the corelist for aflow- identifier, and/or security options, if implemented. A "type" value of zero impliesgroup. Two things can happen here: firstly, a"length" of zero, implying there is no "value" field. 10.3. CBT Control Message Types There are ten typesnew join, targetted at one ofCBT message. All are encoded intheCBT con- trol header, showncores infigure 6. + JOIN-REQUEST (type 1): generated bythe corelist eventu- ally reaches that target core. Secondly, the new join hits a routerand unicast to the specified core address. It is processed hop-by-hop on its way toalready established on-tree, but thespecified core. Its purposerouter encountered is now joined toestablishtheoriginating CBT router,source tree (source "core"). For the first scenario, all on-tree routers and allintermediatecore routers maintain the address of which upstream core their CBTrouters, as partbranch actually emanates from (as per CBT spec). When a new join arrives at one of thecorresponding delivery tree. Note that all cores are carried in join-requests. + JOIN-ACK (type 2): an acknowledgementoriginal cores, the core checks whether its own current core affiliation is to a core outside theabove. The full list ofcorelist set. If so, that coreaddressesiscarried inaJOIN-ACK, together withsource "core", so theactualcoreaffiliation (theresponds to the new joinmay have been ter- minated by an on-tree router on its journeywith a JOIN-ACK, subcode CORE-MIGRATE. This join-ack contains the address of the active source "core". This join-ack causes a join-request to be issued by one of thespecified core, androuters that receives it - theterminatingroutermay or may not be affiliatedwhose path to the corespecified in the original join). A JOIN-ACK traverses the reverse path as(just joined) diverges from that to thecorresponding JOIN-REQUEST, with each CBTsource "core"; this can easily be gleaned from unicast routing. The routeron the path processing the ack. It isthen simply directs it new join at the source "core", and on receipt of the join-ack, sends aJOIN-ACK that actually "fixes" tree state. + JOIN-NACK (type 3):quit to its now "old" parent. For the second case, the solution is trivial; any on-tree router receiving anegative acknowledgement, indicating thatjoin targetted either at one of the original cores for the group, or the active source "core", simply acks (subcode NORMAL) thetreejoinprocess has not been successful. + QUIT-REQUEST (type 4): a request, sent from a child toand includes in the ack the source "core" affiliation (as per CBT spec). A.6 Loops It may seem that the potential for aparent,transitioning tree tobe removed asform loops, especially in the presence of reverse-joins, is greatly increased. This is probably NOT the case; "reversed branches" are those that are already part of achild toloop-free tree thatparent. + QUIT-ACK (type 5): acknowledgement to the above. IfCBT constructs around theparent, ororiginal set of cores. Transitioned tree are just CBTs, whereby thepath to itcore isdown,simply rooted at the source. Loops are noacknowledgement willmore likely with these mechanisms then they are with baseline CBT. Note that these are assertions - formal proofs may bereceived withinmore appropriate. APPENDIX B Group State Aggregation B.1 Introduction Although thetimeout period. This results inscalability of shared tree multicast schemes is attrac- tive now, to scale over thechild nevertheless removing its parent information. + FLUSH-TREE (type 6):longer-term, amessage sent from parentcombination of hierarchy (support mechanisms that facilitate domain-oriented multicasting), and group aggregation strategies, is required. If IP multicast is toall chil- dren, which traverseshave acomplete branch. This message resultslong-term future inall tree interface information being removed from each router onthebranch, possibly because of a re-configuration scenario. + CBT-ECHO-REQUEST (type 7): once a tree branch is established, this messsage actsInternet as a"keepalive", andglobal transport mecha- nism, by far the most serious challenge isunicast from child to parent (can be aggregated from one per group to one per link). + CBT-ECHO-REPLY (type 8): positive replyto address theabove. + CBT-BR-KEEPALIVE (type 9): applicable to border routers only, when attaching a CBT domainissue of group state aggregation. Shared trees were developed partly tosome other domain. See [11] for more information. + CBT-BR-KEEPALIVE-ACK (type 10): acknowledgementaddress scalability with regards to multicast state maintained in theabove. 10.3.1. CBT Control Message Subcodes The JOIN-REQUEST has three valid subcodes: + ACTIVE-JOIN (code 0) - sent from a CBT routernetwork, which resulted in an improvement in thathas no children forstate by a factor of thespecified group. + REJOIN-ACTIVE (code 1) - sent fromnumber of active sources (a source being aCBT routersubnetwork aggregate). However, it is per- ceived thathas at leastthe number of sources sending to any onechildgroup will not grow as fast as the number of groups, indeed the latter will probably grow at several orders of magnitude faster [12]. Therefore, it is essential to contain this potential problem, particularly for thespecified group. + REJOIN-NACTIVE (code 2) - generatedbenefit of routers on wide-area links, bya router subsequentdesigning an effective group state aggregation mechanism, capable of collapsing group state. Unlike unicast addresses, multicast addresses cannot be aggregated according toreceivingtopological locality; multicast addresses are truly location-independent. Thus, it would not seem obvious how the problem can be addressed - clearly, it must be looked at in ajoin ack, subcode NORMAL,different way. In order to be effective, flexibility and efficiency must be facets of group aggregation; an aggregation scheme must be able to accommo- date groups with wide-ranging characteristics inresponsethe least constrain- ing way possible. For example, the trend towards small, non-local groups (e.g. 4 or 5 person audio/video conferences between different user groups spread over different countries/continents); it is these types of groups that are likely to result in an explosive growth in state. Also, these groups will, in all likelihood, utilize multicast addresses that are randomly spread across the multicast address space, making aggregation seemingly more difficult. An aggregation scheme must therefore account for this. B.2 Design Overview This scheme involves replacing aactive-rejoin. A JOIN-ACK has three valid subcodes: + NORMAL (code 0) -subset of individual tree state pre- sentby a core router, or on-tree non-core router acknowledging joins with subcodes ACTIVE-JOINon inter-domain links, andREJOIN-ACTIVE. + PRIMARY-REJOIN-ACK (code 1) - sent byaggregating it over aprimary coresingle shared tree. The scheme does not yet specify how candidate groups for aggre- gation are arrived at, but an obvious scheme toack- nowledge the receipt ofwould be to aggregate already-overlapping distribution trees. The pivotal idea behind this approach encompasses two inter-dependent strategies: +o administratively defining ajoin-request receivedportion of the multicast address space for aggregate groups. For brevity, an example might be the range 238.0.0.0 - 238.255.255.255. +o associated withsubcode REJOIN-ACTIVE. This message traverseseach aggregate group address is a mask, specify- ing thereverse-pathportion of thecorresponding re-join, and is processed by each router onaddress thatpath. + PRIMARY-NACTIVE-ACK (code 2) - sent by a primary coreit used toack- nowledgeidentify thereceipt of a join-request received with subcode REJOIN-NACTIVE. This ack is unicast directly toaggregate group itself (the portion covered by therouter that generatedmask); therejoin-Nactive, i.e.remaining address space is used as an index to an ordered list of groups with which theack itaggregate address isnot processed hop-by-hop. 11. CBT Protocolassociated. The ordered list andPort Numbers CBT has been assigned IPits association with a group aggregate address is conveyed by means of a protocolnumber 7, and UDP port number 7777.message (TBD). TheUDP port numberindex isonly required for certain CBT implementations, as describedused to de-aggregate at region boundaries (border routers). The scheme subscribes to thebeginningnotion ofsection 10. 12. Default Timer Values There are several CBT control messages which are transmitted at fixed intervals. These values, retransmission times, and timeout values, are given below. Note these are recommended default values only, and are configurableaggregation-on-demand; a bor- der router (BR) is configured witheach implementation (all times are in seconds): + CBT-ECHO-INTERVAL 30 (time between sending successive CBT-ECHO- REQUESTs to parent). + PEND-JOIN-INTERVAL 10 (retransmission time for join-request if no ack rec'd) + PEND-JOIN-TIMEOUT 30 (time to try joiningadifferent core, or give up) + EXPIRE-PENDING-JOIN 90 (remove transient state for join that has not been ack'd) + PEND_QUIT_INTERVAL 10 (retransmission time for quit-request if no ack rec'd) + CBT-ECHO-TIMEOUT 90 (time to consider parent unreachable) + CHILD-ASSERT-INTERVAL 90 (increment child timeout if no ECHO rec'd fromthreshold number of groups on achild) + CHILD-ASSERT-EXPIRE-TIME 180 (time to consider child gone) + IFF-SCAN-INTERVAL 300 (scan all interfaces for group presence. If none, send QUIT) + BR-KEEPALIVE-INTERVAL 200 (backup designated BRBRs external interface, above which it begins todesignated BR keepalive interval) + BR-KEEPALIVE-RETRY-INTERVAL 30 (keepalive interval ifsolicit aggregations periodically, say once every hour. As an example, say BRfails123 wishes torespond) 13. Interoperability Issues Interoperability between CBT and DVMRPaggregate 200 groups. BR 123 ran- domly chooses (or by some address allocation algorithm) a group aggregate address. It hasrecentlybeendefined in ftp://cs.ucl.ac.uk/darpa/IDMR/draft-ietf-idmr-cbt-dvmrp-00.txt. Interoperability with other multicast protocols will be fully speci- fied shortly. 14. CBT Security Architecture see [4]. Acknowledgements Special thanks goes to Paul Francis, NTT Japan, for the original brainstorming sessionsestablished thatbrought about this work. Thanks too to Sue Thompson (Bellcore). Her detailed reviews led totheidentificationnumber ofsome subtle protocol flaws,groups for which aggregation is desired is 200. The nearest power of 2 value to 200 is 256 (2^8), andshe suggested several simplifications. Thanks alsoso the aggregate mask covers 24 bits, leav- ing 8 to specify each individual group's traffic flowing over thenetworking team at Bay Networksaggregate tree. So we have: Group aggregate address: 238.10.12.0 Group aggregate mask: 238.10.12/24 A data packet fortheir comments and suggestions,the 30th listed group (listed inparticular Steve Ostrowskia protocol message (TBD) as described above) would be addressed to: 238.10.12.30. Similarly, a data packet pertaining to the 150th listed group would be addressed to: 238.10.12.150, and so on. All routers comprising the aggregate tree need only maintain the group aggregate address and mask, together with the aggregate tree's associated interfaces. If a number of individual shared trees have been replaced by an aggregate tree, then the core routers (RPs) of each of those shared trees must additionally maintain the complete list of groups associated with an <aggregate address/mask-len> so as to be able to "re-direct" any incoming joins forhis suggestionalready aggregated groups. Similarly, border routers (BRs) are incurred the storage cost ofusing "native mode"maintaining the individual groups associated with an <aggre- gate address/mask-len>, so asa router optimization,to be able to aggregate andEric Crawley. Thanks alsode- aggregate as data packets flow across a (sub)region's border. B.3 Scaling Further The scheme described can be applied recursively (to border routers) toKen Carlberg (SAIC)accommodate a hierarchy containing an arbitrary number of levels. The scheme described imposes two general requirements (or assump- tions): +o a well defined aggregate group address space forreviewingeach level of hierarchy (or scope levels). +o thetext, and gen- erally providing constructive comments throughout. I would also likeability tothank the participants ofarbitrarily create boundaries in multicast routers, thereby separating different hierarchical levels. The former will require consensus within the IETFIDMR working group meetings for their general constructive commentsandsugges- tions sinceapproval from theinception of CBT. APPENDIX A There are situations where itIANA. The latter capability isadvantageous to sendalready available in multicast routers; boundaries are specified in asingle join- request that represents potentially many groups. One such examplemulticast routers configura- tion file. This capability isprovidedcurrently available in[11], wherebythe best known multicast routing protocols: DVMRP, M-OSPF, PIM, and CBT. Defining boundaries may require some degree of coordination; whenever adesignated border routerparticular scoped level (boundary) isrequired to joinintroduced which has multiple entry/exit multicast routers, these must allgroups inside a CBT domain. Such aggregated joining is only possible ifbe configured such that their boundary definitions are identical, i.e. they must eachof the groupsbe con- figured with thejoin represents shares a common corelist. Furthermore, aggregationsame boundary-address/mask (the range 239.0.0.0 - 239.255.255.255 isonly efficient over contiguous ranges of group addresses; the "group mask" field intheCBT control packet header is used to specify a CIDR-like groupIANA-defined multicast boundary addressmask. Authors' Addresses:range). Author Information: Tony Ballardie, Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, ENGLAND, U.K. Tel: ++44 (0)71 419 3462 e-mail: A.Ballardie@cs.ucl.ac.uk Scott Reeve,Bay Networks, Inc. 3, Federal Street, Billerica, MA 01821, USA. Tel: ++1 508 670 8888 e-mail: sreeve@BayNetworks.comNitin Jain, Bay Networks, Inc. 3, Federal Street, Billerica, MA 01821, USA. Tel: ++1 508 670 8888 e-mail:njain@BayNetworks.com{sreeve, njain}@BayNetworks.com References [1]DVMRP. Described in "MulticastT. Pusateri. Distance Vector Multicast Routingin a Datagram Internet- work", S. Deering, PhD Thesis, 1990. Available via anonymous ftp from: gregorio.stanford.edu:vmtp/sd-thesis.ps. NOTE: DVMRP version 3 is specified as a working draft.Protocol. Working draft, June 1996. (draft-ietf-idmr-dvmrp-v3-01.{ps,txt}). [2] J. Moy. Multicast Routing Extensions to OSPF. Communications of the ACM, 37(8): 61-66, August 1994. Also RFC 1584, March 1994. [3] D. Farinacci, S. Deering, D. Estrin, and V. Jacobson. Protocol Independent Multicast (PIM) Dense-ModeSpecification (draft-ietf- idmr-pim-spec-01.ps).Specification. Working draft,1994.July 1996. (draft-ietf-idmr-pim-dm-spec-02.{ps,txt}). [4a] A. Ballardie. Core Based Tree (CBT) Multicast Architecture. Working draft, July 1996. (draft-ietf-idmr-cbt-arch-04.txt) [4] A. J. Ballardie. Scalable Multicast Key Distribution; RFCXXXX,1949, SRI Network Information Center, 1996. [5] A. J. Ballardie. "A New Approach to Multicast Communication in a Datagram Internetwork", PhD Thesis, 1995. Available via anonymous ftp from: cs.ucl.ac.uk:darpa/IDMR/ballardie-thesis.ps.Z. [6] W. Fenner. Internet Group Management Protocol, version 2(IGMPv2), (draft-idmr-igmp-v2-01.txt).(IGMPv2). Working draft, May 1996. (draft-idmr-igmp-v2-03.txt). [7] B. Cain, S. Deering, A. Thyagarajan. Internet Group Management Protocol Version 3 (IGMPv3) (draft-cain-igmp-00.txt). [8] M. Handley, J. Crowcroft, I. Wakeman. Hierarchical Rendezvous Point proposal, work in progress. (http://www.cs.ucl.ac.uk/staff/M.Handley/hpim.ps) and (ftp://cs.ucl.ac.uk/darpa/IDMR/IETF-DEC95/hpim-slides.ps). [9] D. Estrin et al. USC/ISI, Work in progress. (http://netweb.usc.edu/pim/). [10] D. Estrin et al. PIM Sparse Mode Specification.(draft-ietf- idmr-pim-sparse-spec-00.txt).Working draft, July 1996. (draft-ietf-idmr-pim-sparse-spec-04.{ps,txt}). [11] A. Ballardie. CBTMulticast Interoperability-Stage 1;Dense Mode Interoperability: Border Router Specification; Working draft,AprilJuly 1996. Also available from:ftp://cs.ucl.ac.uk/darpa/IDMR/draft-ietf-idmr-cbt-dvmrp-00.txtftp://cs.ucl.ac.uk/darpa/IDMR/draft-ietf-idmr-cbt-dm-interop-XX.txt [12] S. Deering. Private communication, August 1996.