<draft-ietf-idmr-cbt-spec-02.txt>Inter-Domain Multicast Routing (IDMR) A. J. Ballardie INTERNET-DRAFT University College LondonN. Jain Bay Networks, Inc. S. Reeve Bay Networks, Inc. June 20th,November 21st, 1995 Core Based Trees (CBT) Multicast -- Protocol Specification -- Status of this Memo This document is an Internet Draft. Internet Drafts are working do- cuments of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute work- ing documents as Internet Drafts). Internet Drafts are draft documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress." Please check the I-D abstract listing contained in each Internet Draft directory to learn the current status of this or any other Internet Draft. Abstract This document describes the Core Based Tree (CBT) multicast protocol specification. CBT is a next-generation multicast protocol that makes use of a shared delivery tree rather than separate per-sender trees utilized by most other multicast schemes [1, 2, 3].TheThis specification includes a description of an optimization whereby native IP-style multicasts are forwarded over tree branches as well as subnetworks with group member presence. This mode of operation will be called CBT "native mode" and obviates the need toinsert a CBT header intoencapsulate data packets before forwarding over CBT interfaces. Native mode is only relevant to CBT-only domains or ``clouds''. Also included are some new "data-driven" features. A special authors' note is included explaining the primary differences between this latest specification and the previous release (June 1995). The CBT architecture is described in an accompanying document: draft-ietf-idmr-arch-00.txt. Other related documents include [4, 5]. For all IDMR-related documents, see http://www.cs.ucl.ac.uk/ietf/idmr. _1._D_o_c_u_m_e_n_t _L_a_y_o_u_t We describe_A_u_t_h_o_r_s' _N_o_t_e The purpose of this note is to explain how the CBT protocoldetails by means of example usinghas evolved since thetopol- ogy shownprevious version (June 1995). The CBT designers have constantly been seeking to streamline the pro- tocol and seek new mechanisms to simplify the group initiation pro- cedure. Especially, it has been a high priority to ensure that the group joining process is as transparent as possible for new receivers; ideally, from a user perspective, only a minimum of infor- mation should be required infigure 1. Examples show howorder to join ahost joinsCBT group -- the knowledge/input of two group parameters, group address and TTL value, is a reasonable expectation. At the same time, we strive to keep join latency to an absolute minimum. The factor most affecting join latency in CBT is the mechanism by which each group on a LAN elects a so-called designated router (DR). This mechanism has now been re-invented, being simpler, andleaveskeeps join latency to agroup,minimum. This new DR election process is explained in section 2.3. Core selection, placement, andwe also show various tree maintenance scenarios. Inmanagement have prevented a simple group initiation/joining process, inherent in data-driven schemes (like DVMRP); some network entity needs to elect a group's cores, and a mechanism is needed to distribute thisfigure member hostsinformation throughout the network so it is available to potential new receivers. CBT separates out most aspects of core management from the protocol itself. This has been made easier due to the fact that core manage- ment is not a problem unique to CBT, but also PIM-Sparse Mode. Separate, protocol-independent core management mechanisms areshown as capital letters,currently being proposed/developed [8, 9]. In the absence of core management/distribution protocol, the task could be manually handled by network management facilities. In CBT, the core routers for a particular group areprefixed with R,categorised into PRIMARY CORE, andsubnets are prefixed with S. Figure 1 is shown over... A BNON-PRIMARY (secondary) CORES. The core tree, the part of a tree linking all core routers together, is built on-demand. That is, the core tree is only built subsequent to a non-primary core receiving a join-request (non-primary core routers join the primary core router -- the primary need do nothing). Join-requests carry an ordered list of core routers, making it possi- ble for the non-primary cores to know where to join. CBT now supports the aggregation of certain types of control message on distribution trees, provided aggregation is at all possible. This depends on coordinated multicast address assignment. Also catalytic in the simplification of the CBT protocol are the "multi-protocol support" aspects of the latest proposal of IGMP (IGMPv3 [6]), in particular, the introduction of the RP/Core-Report message (see Appendix and [6]). The end result of these developments is that the CBT protocol is further simplified and more efficient; six message types have been eliminated from the previous version of the protocol, thereby reduc- ing protocol overhead. Furthermore, the new DR election mechanism ensures group join latency is kept to a minimum. Throughout this draft, we assume IGMPv3 is operating between hosts and routers on a LAN. _2. _P_r_o_t_o_c_o_l _S_p_e_c_i_f_i_c_a_t_i_o_n _2._1. _C_B_T _G_r_o_u_p _I_n_i_t_i_a_t_i_o_n A group's initiator elects a small number of candidate cores (which may be advertised by "some means"). Subsequently, the core distribu- tion engine (if available) is notified of the new group now associ- ated with the elected cores. Subsequent network advertisements pro- vide the <core,group> mapping information for potential new senders and/or receivers. _2._2. _T_r_e_e _J_o_i_n_i_n_g _P_r_o_c_e_s_s -- _O_v_e_r_v_i_e_w It is assumed that hosts receive <core,group> mapping advertisements via some protocol external to CBT. Given this assumption, the follow- ing steps are involved in a host joining a CBT tree: o+ the joining host learns of the candidate cores for the group. o+ subsequently, an IGMP RP/Core-Report is issued on the subnet- work, addressed to the corresponding multicast group. All IGMP messages are received by all operational CBT multicast routers on the subnetwork. One CBT-capable router per subnetwork is initially elected as the default LAN CBT DR (DEFAULT DR) for all groups. This election happens automatically when CBT routers are initialised. If the subnetwork has multiple CBT routers present, a (possibly different) group-specific DR (GROUP DR) may subsequently be elected. This is fully explained in section 2.3. o+ on receiving an IGMP RP/Core-Report, the local DR takes care of establishing the subnet as part of the corresponding CBT delivery tree. The following CBT control messages come into play during the host joining process: o+ JOIN_REQUEST o+ JOIN_ACK A join-request is generated by a locally-elected DR (see next sec- tion) in response to receiving an IGMP group membership report from a directly connected host. The join is sent to the next-hop on the path to the target core, as specified in the join packet. The join is pro- cessed by each such hop on the path to the core, until either the join reaches the target core itself, or hits a router that is already part of the corresponding distribution tree (as identified by the group address). In both cases, the router concerned terminates the join, and responds with a join-ack, which traverses the reverse-path of the corresponding join. This is possible due to the transient path state created by a join traversing a CBT router. The ack simply fixes that state. _2._3. _D_R _E_l_e_c_t_i_o_n Multiple CBT routers may be connected to a multi-access subnetwork. In such cases it is necessary to elect a (sub)network designated router (DR) that is responsible for sending IGMP host membership queries, and for generating join-requests in response to receiving IGMP group membership reports. Such joins are forwarded upstream by the DR. At start-up, a CBT router assumes it is the only CBT-capable router on its subnetwork. It therefore sends two or three IGMP-HOST- MEMBERSHIP-QUERYs in short succession (for robustness) in order to quickly learn about any group memberships on the subnet. If other CBT routers are present on the same subnet, they will receive these IGMP queries, and depending on which router was already the elected querier, yield querier duty to the new router iff the new router is lower-addressed. If it is not, then the newly-started CBT router will yield when it hears a query from the already established querier. The CBT DEFAULT DR (D-DR) is always (exception, next para) the subnet's IGMP-querier; in CBT these two roles go hand-in-hand. As a result, there is no protocol overhead whatsoever associated with electing the CBT D-DR. On multi-access LANs where different routers may be running different multicast routing protocols, there may be times when a LAN's (subnet's) elected querier is a non-CBT router. CBT routers keep track of their immediate CBT neighbouring routers, and can therefore easily establish if the source of an IGMP query is CBT-capable or not. If an elected querier is not CBT-capable, the DR is (implicitly) elected to be the lowest-addressed neighbour on the same link; if a CBT router on such a link knows of a lower-addressed neighbour on the same link, it either does not attempt to claim DR status, or relinqu- ishes its DR status if it was previously elected DR. _2._4. _B_a_c_k_w_a_r_d_s _C_o_m_p_a_t_i_b_i_l_i_t_y _w_i_t_h _I_G_M_P_v_1 & _v_2 _H_o_s_t_s To comply with this specification, CBT routers are expected to run IGMP version 3 [7]. However, it cannot be assumed that all hosts on a subnetwork will be running IGMPv3; there may be instances of IGMP versions 1 and/or 2. IGMPv1 & v2 hosts will not be able to issue RP/Core Reports, available with IGMPv3. The implications of this primarily mean that such hosts must inform a D-DR of <core, group> mappings by means of network management. Alternatively, hosts may implement minimal user- level code to emulate IGMPv3-specific messages, and send them as CBT auxiliary control messages to the specified group address. NOTE: one recent core distribution proposal [8] does not require hosts to participate in core election at all. Rather, a local DR is configured to know a set of core addresses in the lowest level of a core hierarchy, and a function is used to map a group address onto a particular core in the hierarchy. _2._5. _T_r_e_e _J_o_i_n_i_n_g _P_r_o_c_e_s_s -- _D_e_t_a_i_l_s The receipt of an IGMP group membership report by a CBT D-DR for a CBT group not previously heard from triggers the tree joining pro- cess. Immediately subsequent to receiving an IGMP group membership report for a CBT group not previously heard from, the D-DR unicasts a JOIN- REQUEST to the first hop on the (unicast) path to the specified core. Core information is gleaned either by means of an IGMP RP/Core Report, also sent in response to an IGMP host membership query, but prior to an IGMP host membership report, or by some other means. Each CBT-capable router traversed on the path between the sending DR and the core processes the join. However, if a join hits a CBT router that is already on-tree, the join is not propogated further, but ACK'd from that point. JOIN-REQUESTs carry the identity of all cores for the group. Assuming there are no on-tree routers in between, once the join (subcode ACTIVE_JOIN) reaches the target core, if the target core is not the primary core (the first listed in the core listing, contained within the join) it first acknowledges the received join by means of a JOIN-ACK, then sends a JOIN-REQUEST, subcode REJOIN-ACTIVE, to the primary core router. Either the primary core, or the first on-tree router encountered, acknowledges the received rejoin by means of a JOIN-ACK. Any such router other than the primary core proceeds by transforming the rejoin into a REJOIN-NACTIVE for loop detection. This is described in section 6.3. To facilitate detailed protocol description, we use a sample topology, illustrated in Figure 1 (shown over). Member hosts are shown as individual capital letters, routers are prefixed with R, and subnets are prefixed with S. A B | S1 S4 | ------------------- ----------------------------------------------- | | | | ------ ------ ------ ------ | R1 | | R2 | | R5 | | R6 | ------ ------ ------ ------ C | | | | | | | | | S2 | S8 | ---------- ------------------------------------------ ------------- S3 | ------ | R3 | | ------ D | S9 | | S5 | | | --------------------------------------------- | |----| | | ---| R7 |-----| ------ | |----| |------------------| R4 | | S7 | ------ F | | | S6 | |-E | --------------------------------- | | | ------ |---| |---------------------| R8 | |R12 -----| ------ G |---| | | | S10 | S14 ---------------------------- | | I --| ------ | | R9 | ------ | S12 | ---------------------------- S15 | | | ------ |----------------------|R10 | J ---| ------ H | | | | ---------------------------- | S13 Figure 1. Example Network Topology_2. _P_r_o_t_o_c_o_l _S_p_e_c_i_f_i_c_a_t_i_o_n _2._1. _C_B_T _G_r_o_u_p _I_n_i_t_i_a_t_i_o_n Like any ofTaking theother multicast schemes, one user,example topology in figure 1, host A is the group initia- tor,initiates a CBTand has elected core routers R4 (primary core) and R9 (secondary core) by some external protocol. The <core,group> mapping is subse- quently advertised by some (possibly same) protocol. Host A generates an IGMP RP/Core-Report and an IGMP group membership report when the multicastgroup. Group initiation could be car- ried outapplication is invoked on host A. Both reports are multicast to the corresponding group address. All multi- cast routers receive all multicast-addressed messages by default. The only CBT router on A's subnet (S1) is R1, which is, by default, the D-DR. Router R1, receives the RP/Core-Report and the group membership report, and proceeds to unicast anetwork management centre, orJOIN-REQUEST, subcode ACTIVE-JOIN to the next-hop on the path to R4 (R3), the target core in the RP/Core Report. R3 receives the join, caches the necessary group information, and forwards it to R4 -- the target of the join. R4, being the target of the join, sends a JOIN_ACK back out of the receiving interface to the previous-hop sender of the join, R3. A JOIN-ACK, like JOIN-REQUESTs, is processed hop-by-hop bysome other external means, rather than haveeach router on the reverse-path of the corresponding join. The receipt of auser act asjoin-ack establishes the receiving router on the corresponding CBT tree, i.e. the router becomes part of a branch on the delivery tree. R3 sends a join-ack to R2, which sends a joinj-ack to R1. A new CBT branch has been created, attaching subnet S1 to the CBT delivery tree for the corresponding group. At this point, it is proposed that IGMP (v3) groupinitiator. However, inmulticasts a notification across theauthor's implementation, this flexibilitysubnet indicating to member hosts that the delivery tree has beenaffordedjoined successfully. Such a message would greatly benefit multicast protocols requiring explicit joins [5, 10]. For theuser, andperiod between any CBT-capable router forwarding (or ori- ginating) aCBT group is invoked by means ofJOIN_REQUEST and receiving agraphical user inter- face (GUI), known asJOIN_ACK theCBT User Group Management Interface. NOTE: Workcorresponding router iscurrently in progressnot permitted toaddressacknowledge any subsequent joins received for theissue of core placement. _2._2. _T_r_e_e _J_o_i_n_i_n_g _P_r_o_c_e_s_s The following steps are involved in a host establishing itself as part of a CBT multicast tree: o+same group; rather, thejoining host must inform all routers on its subnet thatrouter caches such joins till such time as itrequireshas itself received aDesignated Router (DR)JOIN_ACK for thegrouporiginal join. Only then can itwishesacknowledge any cached joins. A router is said tojoin (itbe in a pending-join state if it is awaiting arequirementJOIN_ACK itself. _2._6. _D-_D_R_s, _G-_D_R_s, _a_n_d _P_r_o_x_y-_a_c_k_s The DR election mechanism does not guarantee thatonly one router, the DR, forward to and from upstream to avoid loops). o+theestablishment of aDRfor the group. o+ once established,will be theDR must proceed torouter that actually forwards a join off a multi-access network; thedistribution tree. The following CBT control messages come into play during the host joining process: NOTE: all CBT message types are described in section 8 irrespective of some offirst hop on thecomments included with certain message types below. o+ CORE_NOTIFICATION (sent only by a group initiating hostpath toinform each core for the group that it has been elected asa particular coreformight be via another router on thegroup). o+ CORE_NOTIFICATION_ACK o+ DR_SOLICITATION o+ DR_ADVERTISEMENT_NOTIFICATION (sent only bysame (sub)network, which actually forwards off-LAN. It is not necessary or desirable to have a tree branch rooted anywhere other than at alocal CBT-capablerouterwhenthatrouterisunaware of a DR forthegroup on the same subnet,interface to andbelieves it is candidate forfrom thebest next-hopLAN; only this routeroffneed keep group state information, theLAN tojoin origina- tor (D-DR) need not since thecore address as specified infirst hop is on theDR_SOLICITATION. This message acts assame LAN. Because of this, CBT incorporates atie-breaker insimple mechanism that prevents thecase where there are two or moreD-DR in suchrouters on a subnet). o+ DR_ADVERTISEMENT o+ TAG_REPORT (sent by a joining host to the DR subsequent to receivingscenarios from keeping group state. If aDR_ADVERTISEMENT. This message servesjoin-ack has returned toinvoketheDR to become partoriginating subnet of thedistribution tree, if not already, by sending a JOIN_REQUEST). o+ JOIN_REQUEST (sent only by the group's DR iff it iscorresponding join, but has not yetpart of, or inreached theprocess of, joiningoriginating router of the correspondingCBT tree). o+ JOIN_ACK o+ HOST_JOIN_ACK (multicast acrossjoin, obviously thesubnet byjoin-request's first hop is on thelocal DRsame subnet asan indication thattheDRoriginating router (the D-DR). A router knows when it ispart of the distribution tree. This message may be sentinimmediate response to receiving a TAG_REPORT, depending on whetherthis situation by extracting theDR is already partorigin router's subnet address using its own subnet mask, then comparing the result with its own address (using address and mask of theCBT tree or not.subnet that is about to be forwarded over). Ifnot itone further hop issent subsequentrequired for the join-ack to reach theDR receiv- ingoriginator of the corresponding join-request, the router does not send aJOIN_ACK). A group-initiating hostnormal join-ack, but rather sends aCORE-NOTIFICATION message to eachJOIN-ACK with subcode PROXY-ACK. Proxy-acks, like normal join-acks, are unicast. A router receiving a proxy-ack cancels any transient state it has created for the corresponding group. The sender of a proxy-ack becomes theelected coresgroup-specific DR (G-DR) for thegroup. This message is acknowledged (CORE_NOTIFICATION_ACK) by each core individually. Provided at least one ACK is receivedgroup - ahost will not be prevented from joining the tree. The purpose oftoken (impli- cit) identity. In theCORE_NOTIFICATIONnormal case where there istwofold: firstly, to communi- cateno LAN extra hop, theidentities of allreceipt of a JOIN-ACK means that thecores, together with their rank- ings, to each of them individually; secondly, to invokeD-DR becomes thebuilding ofG-DR for thecore backbone or core tree. These two procedures follow on onespecified group. Control packets may continue to be incurred an extra-hop if they are generated by theother in the order just described. New receivers attempting to join whilstD-DR, but data packets will not; since only thebuildingsender of thecore backbone is still in progress have their explicit JOIN-REQUEST messages stored by whichever CBT- capable router involved inproxy-ack keeps a FIB entry for thecore joining process is encountered first. Taking our example topology in figure 1, host Agroup, it is thegroup initia- tor. The elected cores areonly routerR4 (primary core) and R9 (secon- dary core). Host A first sends a CORE_NOTIFICATION to eachon the LAN that has an upstream forwarding entry. Now let's see an illustration ofR4 and R9, and each responds positively withthis; aCORE_NOTIFICATION_ACK. CORE_NOTIFICATION messages are always unicast. Subsequent to sendinghost joins aCORE_NOTIFICATION_ACK, each secondary core router (in this case there is only one secondary, R9) proceedsCBT group (the first tojoindo so on theprimary core,subnet), but more than one router is present on its subnet. B's subnet, S4, has 3 CBT routers attached. Assume also that R6 has been elected IGMP-querier andthus forms the core tree, or backbone; R9 unicastsCBT D-DR. The invoking of aJOIN_REQUEST (subcode CORE_JOIN)multicast application on B causes an IGMP RP/Core- Report and an IGMP group membership report toR8, its best next-hopbe multicast to theprimary core, R4. JOIN_REQUESTs (andcorrespondingACKs) are processed by all intervening CBT-capable routers, and forwarded if necessary. R8 forwardsgroup. The target core and ordered core list are contained within theJOIN_REQUEST toRP/Core report. R6 generates a join-request for target core R4,rememberingsubcode ACTIVE_JOIN. R6's routing table says theincom- ing and outgoing interfaces ofnext-hop on theJOIN_REQUEST.path to R4receives the JOIN_REQUEST (subcode CORE_JOIN), realises itis R2, which is on thetarget of the join, and therefore sends a JOIN_ACK back out of the receiving interfacesame subnet as R6. This is irrelevant to R6, which unicasts it to R2. R2 unicasts it to R3, which happens to be already on-tree for theprevious-hop sender of the join. R8 receivesspecified group (from R1's join). R3 therefore can acknowledge theJOIN_ACKarrived join andforwardsunicast it back toR9 over the interfaceR2. R2 realises it is not thejoin was received from R9. On receiptorigin of theJOIN_ACK, R9 need take no further action. Core tree set upcorresponding join-request, but sees that the origin (R6) iscomplete. Foron theperiod between any CBT-capable router forwarding (or ori- ginating) a JOIN_REQUESTsame subnet as itself, andreceiving a JOIN_ACKthat over which thecorresponding router is not permittedjoin-ack would be forwarded toacknowledge any subsequent joins received forthesame group; rather,origin, R6. R2 unicasts therouter caches such joins till such time as it has itself received a JOIN_ACK forjoin-ack on its final hop, but sets theoriginal join, at which time it can acknowledge any cached joins. A router is saidack subcode tobePROXY-ACK. This results ina pending-join state if it is awaiting a JOIN_ACK itself. Returning to host A which has just received both CORE_NOTIFICATION_ACKs, it must now establish which local CBT router is DRthe D-DR (R6) removing its pending join information for the specified group.Since A is the group initiator itAnother consequence of receiving a proxy-ack ishighly unlikelythat the D-DR need not create aDRFIB entry for thegroup will already exist.specified group. IfA was joininganexisting groupIGMP RP/Core-Report is received by aDR may already be present. Host A sendsD-DR with aDR_SOLICITATION (IP TTL 1) tojoin for the"all-CBT-routers" address (224.0.0.7). The solicitation contains onesame group already pending, it takes no action. Note that the presence ofcore addresses as electedunderlying transient asymmetric routes is irrelevant to the tree-building process; CBT tree branches are sym- metric by thehost, tonature in whichit wishesthey are built. Joins set up transient state (incoming and outgoing interface state) in all routers along ajoinpath tobe sent. Any routers ona particular core. The corresponding join-ack traverses thesame subnet receivingreverse-path of thesolicitation establish whether they arejoin as dictated by the transient state, and not the path that underlying routing would dictate. Whilst permanent asymmetric routes could pose a problem for CBT, transient asymmetri- city is detected by the CBT protocol. _2._7. _T_r_e_e _T_e_a_r_d_o_w_n There are two scenarios whereby a tree branch may be torn down: o+ During a re-configuration. If a router's best next-hop to the specified coreor not. If a routeris one of its existing children, then before sending the join it must tear down that particular downstream branch. It doesconsider itself a candidate and has no record forso by sending aDR forFLUSH_TREE message which is pro- cessed hop-by-hop down thegroup,branch. All routers receiving this message must process it and forward itmulticasts a DR_ADV_NOTIFICATIONto all their children. Routers that have received a flush message will re-establish themselves on the"all-CBT- routers"delivery tree if they have directly connected subnets with group(224.0.0.7). This message acts aspresence. o+ If atie-breaker in the case where there is more than oneCBT routeron the subnet which thinkshas no children it periodically checks all its directly connected subnets for group member presence. If no member presence isthe best next-hop to the core. The lowest-addressed sourceascertained on any of its subnets it sends aDR_ADV_NOTIFICATION wins the election and subsequently advertisesQUIT_REQUEST upstream to remove itselfas DR by meansfrom the tree. Let's see, using the example topology of figure 1, how a tree branch is gracefully torn down using aDR_ADVERTISEMENT,QUIT_REQUEST. Assume group member B leaves group G on subnet S4. B issues an IGMP HOST-MEMBERSHIP-LEAVE message which is multicast to the"all-systems"all-routers" group(224.0.0.1). As R1 is(224.0.0.2). R6, theonly router on A's subnet, itsubnet's D-DR and IGMP-querier, responds with aDR_ADV_NOTIFICATION followed by a DR_ADVERTISEMENT. The time between sending a DR_ADV_NOTIFICATION and a DR_ADVERTISEMENT should be configurable and ideally less than one secondgroup-specific-QUERY. No hosts respond within the required response interval, soas to keep join latency to a minimum. The DR election for subnet S4D-DR assumes group G traffic ismore complex. When host B sends a DR_SOLICITATION routers R2, R5 and R6 receive it. Assuming R2 and R5 both believe they are the best next-hop to R4 (the specified core) both send a DR_ADV_NOTIFICATION.no longer wanted on subnet S4. Since R2(the lower addressed) wins the tie-breakerhas no CBT children, andsubsequently multicasts a DR_ADVERTISEMENT to S4. Allno other directly attached subnets withjoining hosts proceed similarly. A DR candidate is a router whose outgoing interface, as specified in its routing table entry for the destination, is different than the interface over which the DR_SOLICITATION arrived. On receiving a DR_ADVERTISEMENT host A sendsgroup G presence, it immediately follows on by sending aTAG_REPORTQUIT_REQUEST to R3, its parent on theDR, R1. R1tree for group G. R3 responds by unicasting aJOIN_REQUEST (subcode ACTIVE_JOIN) to R3 -- the best next-hopQUIT_ACK toR4, the desired target of the join.R2. R3forwards (unicast) the received join to R4, remembering incomingsubsequently checks whether it in turn can send a quit by checking group G presence on its directly attached subnets, andoutgoing interfaces. R4, now already establishedany group G children. It has the latter (R1 is its child ontree forthe groupresponds to the JOIN_REQUEST with a JOIN_ACK,G tree), andsends it to R3, which in turn sends it to R1. Theso R3 cannot itself send a quit. However, the branchR1-R3-R4 is now complete and part ofR3-R2 has been removed from thedistributiontree.On receipt of the JOIN_ACK, R1 multicasts to the "all-systems" address (224.0.0.1)_3. _C_B_T _P_r_o_t_o_c_o_l _P_o_r_t_s CBT routers implement user-level code for tree building, maintenance, and teardown. This results in aHOST_JOIN_ACK whichgroup-specific forwarding information base (FIB) being built in user-space. This FIB isa notificationdownloaded into kernel-space for fast and efficient data packet forwarding. Any changes in FIB entries are communicated to thejoining end-systemkernel as they occur, so that theDR has been successful in joiningkernel FIB always reflects the current state of any par- ticular group's tree.The multicast application running on host A can now send data. Host B proceeds to joinCBT primary and auxiliary control packets then travel inside UDP datagrams, as thegroup in a similar fashion, but therefollowing diagram illustrates: ++++++++++++++++++++++++++++++++++++++++++++ | IP header | UDP header | CBT control pkt | ++++++++++++++++++++++++++++++++++++++++++++ Figure 2. Encapsulation for CBT control messages The following UDP port numbers aresome subtle differences. Host Bcurrently being used (their use at this stage isnot the group initiatorunofficial, andit need not send CORE_NOTIFICATIONs. Host B's first steppending official approval): o+ CBT Primary control messages - UDP port 7777 o+ CBT Auxiliary control messages - UDP port 7778 _4. _D_a_t_a _P_a_c_k_e_t _F_o_r_w_a_r_d_i_n_g (_n_a_t_i_v_e _m_o_d_e) In CBT "native mode" only one forwarding method isto elect a DR,used, namely all data packets are forwarded over CBT tree interfaces asdescribed above. On receipt of a DR_ADVERTISEMENT from router R2 in this case, B unicasts a TAG_REPORT to R2. The core specified in the TAG_REPORT is R4. In response the the TAG_REPORT, R2 unicasts a JOIN_REQUEST (subcode ACTIVE_JOIN) to R3,native IP mul- ticasts, i.e. there are no encapsulations required. This assumes that CBT is thebest next-hop to R4. R3 however, has just joinedmulticast routing protocol in operation within thetreedomain (or "cloud") in question, andso can acknowledgethat all routers within thereceived join,domain of operation are CBT-capable, i.e. there are no "tunnels". If this latter constraint cannot be satisfied itneed not travel all the wayis necessary toR4. R3 unicasts a JOIN_ACKencapsulate IP-over-IP before forwarding toR2, which results in R2 multicastingaHOST_JOIN_ACK across subnet S4. _3.child or parent reachable via non- CBT-capable router(s). The rules for native mode forwarding are altogether simpler than those for CBT-mode forwarding (see next section); data packets are sent over child/parent interfaces as specified in the corresponding FIB entry, as native IP multicasts. This applies to point-to-point links as well as broadcast-type subnetworks such as Ethernets. _5. _D_a_t_a _P_a_c_k_e_t _F_o_r_w_a_r_d_i_n_g (_C_B_T _m_o_d_e) "CBT mode" as opposed to "native mode" describes theforwarding/sendingforwarding of data packets over CBT tree interfacescontain- ingcontaining a CBT headerencapsulation.encap- sulation. For efficiency, this encapsulation is as follows: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | encaps IP hdr | CBT hdr | original IP hdr | data ....| ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Figure2.3. Encapsulation for CBT mode By using the encapsulations above there isvirtuallyno necessity to modify a packet's original IPheader, and decapsulationheader until it is forwarded over subnets with group member presence in native mode. When this happens, the TTL value of the original IP header is set to one before forwarding. The TTL value of the CBT header is set by the encapsulating CBT router directly attached to the origin of a data packet. This value is decremented each time it is processed by a CBT router. An encap- sulated data packet is discarded when the CBT header TTL value reaches zero. The purpose of the (outer) encapsulating IP header is to "tunnel" data packets between CBT-capable routers (or "islands"). The outer IP header's TTL value is set to the "length" of the corresponding tun- nel, or MAX_TTL if this is not known, or subject to change. For native mode IP multicasts, i.e. those without any extra encapsu- lation, the TTL value of the IP header is decremented each time the packet isrelatively efficient.received by a multicast router. It is worth pointing out at this point the distinction between sub- networks and tree branches, although they can be one and the same. For example, a multi-access subnetwork containing routers and end- systems could potentially be both a CBT tree branch and a subnetwork with group member presence. A tree branch which is not simultaneously a subnetwork is either a "tunnel" or a point-to-point link. In CBT forwarding mode there are three forwarding methods used by CBT routers: o+ IP multicasting. This method is used to send a data packet across a directly-connected subnetwork with group member pres- ence.Thus, systemSystem host changes are not required for CBT.Simi- larly,Similarly, end-systems originating multicast data do so intradi- tional IP-style.traditional IP- style. o+ CBT unicasting. This method is used for sending data packets encapsulated (as illustrated above) across a tunnel or point- to-point link. En/de-capsulation takes place in CBT routers. o+ CBT multicasting. This method sends data packets encapsulated (as illustrated above) but the outer encapsulating IP header contains a multicast address. This method is used when a parent or multiple children are reachable over a single physical inter- face, as could be the case on a multi-access Ethernet. The IP module of end-systems subscribed to the same group will discard these multicasts since the CBT payload typewill(protocol id) of the outer IP header is notbe recog- nized.recognizable by hosts. CBT routers create Forwarding Information Base (FIB) entries whenever they send or receive aJOIN_ACK.JOIN_ACK (with the exception of a proxy-ack, as explained in section 2.5). The FIB describes the parent-child relationships on a per-group basis. A FIB entry dictates over which tree interfaces, and how (unicast or multicast) a data packet is to be sent. Additionally, a data packet is IP multicast over any directly-connected subnetworks with group member presence. Such interfaces are kept in a separate table relating to IGMP. A FIB entry is shown below: 32-bits 4 4 4 4 | 4 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | group-id | parent addr | parent vif | No. of | | | | index | index |children | children | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |chld addr |chld vif | | index | index | |+-+-+-+-+-+-+-+-+-+-+ |chld addr |chld vif | | index | index | |+-+-+-+-+-+-+-+-+-+-+ |chld addr |chld vif | | index | index | |+-+-+-+-+-+-+-+-+-+-+ | | | etc. | |+-+-+-+-+-+-+-+-+-+-+ Figure3.4. CBT FIB entry Note that a CBT FIB is required for both CBT-mode and native-mode multicasting. The field lengths shown above assume a maximum of 16 directly con- nected neighbouring routers. When a data packet arrives at a CBT router, the following rules apply: o+ if the packet is an IP-style multicast, it is checked to see if it originated locally (i.e. if the arrival interface subnetmask bitwise ANDed with the packet's source IP address equals the arrival interface's subnet number, the packet was sourced locally). Ifit does notthe packet is not of local origin, it is discarded. o+ the packet is IP multicast to all directly connected subnets with group member presence. The packet is sent with an IP TTL value of 1 in this case. o+ the packet is encapsulated for CBT forwarding (see figure2)3) and unicast to parent and children. However, if more than one child is reachable over the same interface the packet will be CBT mul- ticast. Therefore, it is possible that an IP-style multicast and a CBT multicast will be forwarded over a particular subnetwork. NOTE: the TTL value of encapsulated data packets is manipulated as described at the beginning of this section. Using our example topology in figure 1, let's assume member G ori- ginates an IP multicast packet. R8 is the DR for subnetS10 (R4 is DR for all its attached subnets).S10. R8 CBT unicasts the packet to each of its children, R9 and R12. Thesechildrenchil- dren are not reachable over the same interface. R8, being the DR for subnets S14 and S10 also IPmul- ticastsmulticasts the packet to S14 (S10 received the IP style packet already from the originator). R9, the DR for S12, need not IP multicast onto S12 since there are no members present there. R9 CBT unicasts the packet to R10, which is the DR for S13 and S15. It IP multicasts to both S13 and S15. Going upstream from R8, R8 CBT unicasts to R4. It is DR for all directly connected subnets and therefore IP multicasts the data packet onto S5, S6 and S7, all of which have member presence. R4 uni- casts the packet to all outgoing children, R3 and R7 (NOTE: R4 does not have a parent since it is the primary core router for the group). R7 IP multicasts onto S9. R3 CBT unicasts to R1 and R2, its children. Finally, R1 IP multicasts onto S1 and S3, and R2 IP multicasts onto S4._3._1._5._1. _N_o_n-_M_e_m_b_e_r _S_e_n_d_i_n_g (_C_B_T _m_o_d_e) For a multicast data packet to span beyond the scope of the originat- ing subnetwork at least one CBT-capable router must be present on that subnetwork. The default DR (D-DR) for the group on the subnetwork mustencap- sulateencapsulate the IP-style packet and unicast it to a core for the group. This requires CBT routers to have access to a mapping mechanism between group addresses and core routers. This mechanism is currently beyond the scope of this document._4. _D_a_t_a _P_a_c_k_e_t _F_o_r_w_a_r_d_i_n_g (_n_a_t_i_v_e _m_o_d_e) InAlternatively, hosts could perform the CBT"native mode" only one forwarding methodencapsulation themselves, but this would require hosts to run a core discovery protocol. Host modifications required for such a protocol, and the subsequent data packet encapsulation, are considered extremely undesirable, and are therefore not considered further. _5._2. _E_l_i_m_i_n_a_t_i_n_g _t_h_e _T_o_p_o_l_o_g_y-_D_i_s_c_o_v_e_r_y _P_r_o_t_o_c_o_l _i_n _t_h_e _P_r_e_s_e_n_c_e _o_f _T_u_n_n_e_l_s Traditionally, multicast protocols operating within a virtual topol- ogy, i.e. an overlay of the physical topology, have required the assistance of a multicast topology discovery protocol, such as that present in DVMRP. However, it isused, namelypossible to have a multicast proto- col operate within a virtual topology without the need for a multi- cast topology discovery protocol. One way to achieve this is by hav- ing a router configure alldata packetsits tunnels to its virtual neighbours in advance. A tunnel is identified by a local interface address and a remote interface address. Routing is replaced by "ranking" each such tunnel interface associated with a particular core address; if the highest-ranked route is unavailable (tunnel end-points areforwarded overrequired to run an Hello-like protocol between themselves) then the next- highest ranked available route is selected, and so on. CBT trees are built using the same join/join-ack mechanisms as before, only now some branches of a delivery treeinterfacesrun in native mode, whilst others (tunnels) run in CBT mode. Underlying unicast routing dictates which interface a packet should be forwarded over. Each interface is configured as either native mode or CBT mode, so a packet can be encapsulated (decapsulated) accordingly. As an example, router R's configuration would be as follows: intf type mode remote addr ----------------------------------- #1 phys nativeIP mul- ticasts, i.e. there are no encapsulations required. This assumes that- #2 tunnel cbt 128.16.8.117 #3 phys native - #4 tunnel cbt 128.16.6.8 #5 tunnel cbt 128.96.41.1 core backup-intfs -------------------- A #5, #2 B #3, #5 C #2, #4 The CBT FIB needs to be slightly modified to accommodate an extra field, "backup-intfs" (backup interfaces). The entry in this field specifies a backup interface whenever a tunnel interface specified in the FIB is down. Additional backups (should themulticast routing protocolfirst-listed backup be down) are specified for each core inoperation withinthedomain (or "cloud") in question. It also assumes that all routers withincore backup table. For example, if interface (tunnel) #2 were down, and thedomaintarget core ofoperation are CBT-capable, i.e. there are no "tunnels". If this latter constraint cannot be satisfied it is necessary to encap- sulate IP-over-IP before forwarding toachild or parent reachable via non-CBT-capable router(s). Besides the structural characteristics of "native mode" data packets, described above, the dataCBT control packetforwarding rules are identicalwere core A, the core backup table suggests using interface #5 as a replacement. If interface #5 happened tothose described in section 3. _4._1.be down also, then the same table recommends interface #2 as a backup for core A. _5._3. _N_o_n-_M_e_m_b_e_r _S_e_n_d_i_n_g (_n_a_t_i_v_e _m_o_d_e) For a multicast data packet to span beyond the scope of the originat- ing subnetwork at least one CBT-capable router must be present on that subnetwork. The default DRfor the group(D-DR) on the subnetwork must encap- sulate (IP-over-IP) the IP-style packet and unicast it to a core for the group. This requires CBT routers to have access to a mapping mechanism between group addresses and core routers. This mechanism is currently beyond the scope of this document._5.Again, host changes could obviate the need for a local router to per- form a <core, group> mapping and an encapsulation, but this is not considered a desirable option. _6. _T_r_e_e _M_a_i_n_t_e_n_a_n_c_e Once a tree branch has been created, i.e. a CBT router has received a JOIN_ACK for a JOIN_REQUEST previously sent (forwarded), a child router is required to monitor the status of its parent/parent link at fixed intervals by means of a ``keepalive'' mechanism operating between them. The ``keepalive'' mechanism is implemented by means of two CBT control messages: CBT_ECHO_REQUEST and CBT_ECHO_REPLY. Immediately subsequent to a parent/child relationship being esta- blished, a child unicasts a CBT-ECHO-REQUEST to its parent, which unicasts a CBT-ECHO-REPLY in response. CBT echo requests and replies may be aggregated to conserve bandwidth on links over which tree branches overlap. However, this is only pos- sible if group address assignment has been coordinated to facilitate aggregation. (see section 8.4). For anynon-coreCBT router, if its parent router, or path to the parent, fails,that non-core routerthe child is initially responsible for re-attaching itself, and therefore all routers subordinate to it on the same branch, to the tree._5._1._6._1. _R_o_u_t_e_r _F_a_i_l_u_r_eA non-coreAn on-tree router can detect a failure from the following two cases: o+ if a child stops receiving CBT_ECHO_REPLY messages. In this case the child realises that its parent has become unreachable and must therefore try and re-connect to the tree.It does so byThe router on the tree immediately subordinate to the failed router arbitrarilychoosing an alternateelects a core from its list of cores for this group.It establishes a chosen core's reachability by unicasting a CBT_CORE_PING message to it, to which the core responds with a CBT_PING_REPLY. On receipt of the latter, the re-joiningThe rejoin- ing router then sends a JOIN_REQUEST (subcodeACTIVE_REJOIN)ACTIVE_JOIN if it has no children attached, and subcode ACTIVE_REJOIN if at least one child is attached) to the best next-hop router on the path to the elected core.A router will continue arbitrarily choosingIf no JOIN-ACK is received after the speci- fied number of retransmissions, an alternate core is arbitarily elected from the core list. The process is repeated until aCBT_PING_REPLYJOIN-ACK is received for a maximum of RECONNECT-TIMEOUT seconds (90 secs isreceived.the recommended default). o+ if a parent stops receiving CBT_ECHO_REQUESTs from a child. In this case the parent simply removes the child interface from its FIB entry for the particular group._5._2._6._2. _R_o_u_t_e_r _R_e-_S_t_a_r_t_s There are two cases to consider here: o+ Core re-start.In this case, the core router relies on receiving a CBT_CORE_PING message, which containsAll JOIN-REQUESTs (all types) carry thelistidenti- ties (i.e. addresses) of each of the cores forthe specifieda group.Obviously, one of theIf a router is a coreaddressesfor a group, but has only recently re-started, it will not beits own. Ifaware that it is a corerealises its core statusfor any group(s). In such circumstances, agroupcore only becomes aware that it is such by receiving a JOIN-REQUEST. Subsequent to a core learning its status in this way, if it is not the primary core it ack- nowledges the received join, then sends a JOIN_REQUEST(sub- code ACTIVE_JOIN)(subcode ACTIVE_REJOIN) to the primary core. If the re-started routerin ques- tionis the primary core, it neednot send a join, but rather awaits joins and considers itself part oftake no action, i.e. in all cir- cumstances, thetree again.primary core simply waits to be joined by other routers. o+ Non-core re-start. In this case, the router can only join the tree again if a downstream router sends a JOIN_REQUEST through it, or it is elected DR for one of its directly attached sub-nets. _5._3.nets, and subsequently receives an IGMP RP/Core Report. _6._3. _R_o_u_t_e _L_o_o_p_s Routing loops are only a concern when a router with at least one child is attempting to re-join a CBT tree. In this case the re- joining router sends a JOIN_REQUEST (subcode ACTIVE REJOIN) to the best next-hop on the path to the core. This join is forwarded as nor- mal until it reaches either thecorecore, or a non-core router that is already part of the tree. If the join reaches the specified core, the join terminates there and is ACKd as normal. If however, the join is terminated by non-core router, the ACTIVE_REJOIN is converted to aNON_ACTIVE_REJOINNON_ACTIVE_REJOIN, keeping the origin as that specified in the ACTIVE_REJOIN, and forwarded upstream. A JOIN_ACK is also sentdownstreamdown- stream to acknowledge the received join. The NON_ACTIVE_REJOIN is a loop detection packet. All routersreceivingreceiv- ing this must forward it over their parent interface.IfThis process continues until the NON_ACTIVE_REJOIN is received by the primary core for the group, or the NON_ACTIVE_REJOIN is received by the originator of the corresponding ACTIVE_REJOIN. A router will know this since the "origin" field remains unchanged when a join is converted from an ACTIVE_REJOIN to a NON_ACTIVE_REJOIN. In the former case, the primary core acknowledges the NON_ACTIVE_REJOIN with JOIN-ACK, sub- code NACTIVE_REJOIN. This message is unicast directly to theoriginator ofREJOIN_ACTIVE originator. In thecorrespond- ing ACTIVE_REJOIN should receivelatter case, theNON_ACTIVE_REJOIN itACTIVE_REJOIN ori- ginator immediately sends a QUIT_REQUEST to itsrecently establishednewly-established parent and the loop is broken. o+ Using figure45 (over) to demonstrate this, if R3 is attempting to re-join the tree (R1 is the core in figure4)5) and R3 believes its best next-hop to R1 is R6, and R6 believes R5 is its best next-hop to R1, which sees R4 as its best next-hop to R1 -- a loop is formed. R3 begins by sending a JOIN_REQUEST (subcode ACTIVE_REJOIN, since R4 is its child) to R6. R6 forwards the join to R5. R5 is on-tree for the group, so changes the join subcode to NON_ACTIVE_REJOIN, and forwards this to its parent, R4. R4 forwards the NON_ACTIVE_REJOIN to R3, its parent. R3 originated the corresponding ACTIVE_REJOIN, and so it immedi- ately sends a QUIT_REQUEST to R6, which in turn sends a quit if it has not received an ACK from R5 already AND has itself a child or subnets with member presence. If so itneeddoes not send a quit -- the loop has been broken by R3 sending the first quit. QUIT_REQUESTs are typically acknowledged by means of a QUIT_ACK, but there might be cases where, due to failure, the parent cannot respond. In this case the child nevertheless removes the parent information after some small number (typically 3) of re-tries. ------ | R1 | ------ | --------------------------- | ------ | R2 | ------ | --------------------------- | | ------ | | R3 |--------------------------| ------ | | | --------------------------- | | | ------ ------ | | | | R4 | |-------| R6 | ------ | |----| | | --------------------------- | | | ------ | | R5 |--------------------------| ------ | | Figure4:5: Example Loop Topology_6. _D_a_t_a _P_a_c_k_e_t _L_o_o_p_s NOTE: this is only applicable when CBT header encapsulationIn the other scenario where no loop isin use. When a data packet hits its first on-tree router, thatactually formed, routeris responsible for settingR3 sends a join, subcode REJOIN_ACTIVE to R2, theon-tree bits innext-hop on theCBT header. This indicatespath toall subsequent routers oncore R1. R2 forwards thetree thatre-join to R1, thepacket is inprimary core, which unicasts a JOIN-ACK to theprocessoriginator ofspanningthetree forREJOIN_ACTIVE, i.e. thegroup. However, it might be that a misbehaving router forwards an on-tree packet over a non-tree interface, and suchjoin-ack remains invisible to R2. _7. _D_a_t_a _P_a_c_k_e_t _L_o_o_p_s The CBT protocol builds apacket might work its way back onto the tree, potentially formingloop-free distribution tree. If all routers that comprise a particular tree function correctly, datapacket loop. Therefore, the on-tree bits in the CBT header serve to identify suchpackets--should never traverse arouter receivetree branch more than once. CBT routers will only forward native-style data packets if they are received over a valid on-tree interface. A native-style data packetwith its on-tree bits setthat is not received overa non-treesuch an interfacethe packetisimmediatelydiscarded._7. _T_r_e_e _T_e_a_r_d_o_w_n There are two scenarios whereby a tree branch may be torn down: o+ During a re-configuration, ifEncapsulated CBT data packets from arouter's best next-hop to the specified corenon-member sender can arrive via an "off-tree" interface (this isone of its existing children then before send- ing the re-join it must tear down that particular downstream branch. It does so by sendinghow CBT-mode sends data across tun- nels, and how data from non-member senders in native-mode or CBT-mode reaches aFLUSH_TREE messagetree). The encapsulating CBT data packet header includes an "on-tree" field, whichis pro- cessed hop-by-hop downcontains thebranch. All routers receivingvalue 0x00 until the data packet reaches an on-tree router. At thismessagepoint, the router mustprocess it and forward itcon- vert this value toall their children. Routers that have received a flush message will re-establish themselves on the delivery tree if they have directly connected subnets with group presence. Subsequent0xff tosending a FLUSH_TREE,indicate therouter can send the re-join to its child. o+ If a CBT router has no children it periodically checks all its directly connected subnets for group member presence. If no member presencedata packet isascertained on any of its subnets it sends a QUIT_REQUEST upstream to remove itselfnow on-tree. This value remains unchanged, and from here on thetree. With regardspacket should traverse only on-tree interfaces. If an encapsulated packet happens to "wander" off-tree and back on again, the latterscenario, lets see using the example topology of figure 1 how a tree branch is torn down. Assume member E leaveson-tree router will receive thegroup (if IGMPv2 is in useCBT encapsulated packet via anexplicit IGMP_LEAVE messageoff-tree interface. However, this router willbe sent by E). If R7 registers no further group presence (by meansrecognise that the "on-tree" field ofIGMP) then R7 sends a QUIT_REQUEST to R4. R4 responds with a QUIT_ACKthe encapsulating CBT header is set toR7. R4 has children AND subnets with group presence,0xff, and sodoes not itself attempt to quitimmediately discards thetree. The branch R4-R7 has been torn down.packet. _8. _C_B_T _P_a_c_k_e_t _F_o_r_m_a_t_s _a_n_d _M_e_s_s_a_g_e _T_y_p_e_s CBT packets travel in IP datagrams. We distinguish between two types of CBT packet: CBT data packets, and CBT control packets. CBT data packets carry a CBT header when these packets are traversing CBT tree branches. The enscapsulation (for "CBT mode") is shown below: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | encaps IP hdr | CBT hdr | original IP hdr | data ....| ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Figure5.6. Encapsulation for CBT mode CBT control packets carry a CBT control header. All CBT control mes- sages are implemented over UDP. This makes sense for several reasons: firstly, all the information required to build a CBT delivery tree is kept in user space. Secondly, implementation is made considerably easier. CBT control messages fall into two categories: primary maintenance messages, which are concerned with tree-building, re-configuration, and teardown, and auxiliary maintenance messsages, which are mainly concerned with general tree maintenance. _8._1. _C_B_T _H_e_a_d_e_r _F_o_r_m_a_tSee over....0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | vers |unused | type | hdr length |protocol |on-tree|unused| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | checksum | IP TTL |on-tree|unused|unused | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | group identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | core address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | packet origin | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | flow identifier | | (T.B.D) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | security fields | | (T.B.D) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure6.7. CBT Header Each of the fields is described below: o+ Vers: Version number -- this release specifies version 1. o+ type: indicates whether the payload is data or control infor- mation. o+ hdr length: length of the header, for purpose of checksum calculation. o+protocol: upper-layer protocol number.on-tree: indicates whether the packet is on-tree (0xff) or off-tree (0x00). Once this field is set (i.e. on-tree), it is non-changing. o+ checksum: the 16-bit one's complement of the one's complement of the CBT header, calculated across all fields. o+ IP TTL: TTL value gleaned from the IP header where the packet originated. It is decremented each time it traverses a CBT router. o+on-tree: indicates whether the packet is on- or off-tree. Once this field is set (i.e. on-tree), it is non-changing. o+group identifier: multicast group address. o+ core address: the unicast address of a core for the group. A core address is always inserted into the CBT header by an originating host, since at any instant, it does not know if the local DR for the group is on-tree. If it is not, the local DR must unicast the packet to the specified core. o+ packet origin: source address of the originating end-system. o+ flow-identifier: (T.B.D) value uniquely identifying apreviouslyprevi- ously set up data stream. o+ security fields: these fields (T.B.D.) will ensure the authenticity and integrity of the received packet. _8._2. _C_o_n_t_r_o_l _P_a_c_k_e_t _H_e_a_d_e_r _F_o_r_m_a_t See over... The individual fields are described below. It should be noted thatthe contents of theonly certain fields beyond ``group identifier'' areempty in someprocessed for the dif- ferent controlmessages:messages. 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | vers |unused | type | code |unused# cores | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | hdr length | checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | group identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | packet origin | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | target core address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Core #1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Core #2 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Core #3 |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Core #4 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|Core #5.... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Resource Reservation fields | | (T.B.D) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | security fields | | (T.B.D) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure7.8. CBT Control Packet Header o+ Vers: Version number -- this release specifies version 1. o+ type: indicates control message type (see sections 1.3, 1.4). o+ code: indicatessub-codesubcode of control message type. o+ # cores: number of core addresses carried by this control packet. o+ header length: length of the header, for purpose of checksum calculation. o+ checksum: the 16-bit one's complement of the one's complement of the CBT control header, calculated across all fields. o+ group identifier: multicast group address. o+ packet origin: source address of the originating end-system. o+ target core address: desired/actual core affiliation ofcontrol mes- sage.con- trol message. o+ Core #Z:Maximum of 5 core addresses may be specified for any one group. An implementation is not expected to utilize more than, say, 3. NOTE: It was an engineering design decision to have a fixed max- imum numberIP address of coreaddresses, to avoid a variable-sized packet.#Z. o+ Resource Reservation fields: these fields (T.B.D.) are used to reserve resources as part of the CBT tree set up pro- cedure. o+ Security fields: these fields (T.B.D.) ensure the authenti- city and integrity of the received packet. _8._3. _P_r_i_m_a_r_y _M_a_i_n_t_e_n_a_n_c_e _M_e_s_s_a_g_e _T_y_p_e_s There are six types of CBT primary maintenancemessage, namely:message. Primary mes- sage subcodes are described in the next section. o+JOIN-REQUEST: invoked by an end-system,JOIN-REQUEST (type 1): generatedand sent (unicast)by aCBTrouter and unicast to the specified core address. It is processed hop-by-hop on its way to the specified core. Its purpose is to establish the sending CBT router, and all intermediate CBT routers, as part of the corresponding delivery tree. o+JOIN-ACK:JOIN-ACK (type 2): an acknowledgement to the above. The full list of core addresses is carried in a JOIN-ACK, together with the actual core affiliation (the join may have beenterminatedter- minated by an on-tree router on its journey to the specified core, and the terminating router may or may not be affiliated to the core specified in the original join). A JOIN-ACK traverses the same path as the corresponding JOIN-REQUEST,and itwith each CBT router on the path processing the ack. It is the receipt of a JOIN-ACK that actually creates a tree branch. o+JOIN-NACK:JOIN-NACK (type 3): a negative acknowledgement, indicating that the tree join process has not been successful. o+QUIT-REQUEST:QUIT-REQUEST (type 4): a request, sent from a child to a parent, to be removed as a child to that parent. o+QUIT-ACK:QUIT-ACK (type 5): acknowledgement to the above. If the parent, or the path to it is down, no acknowledgement will be received within the timeout period. This results in the child nevertheless removing its parent information. o+FLUSH-TREE:FLUSH-TREE (type 6): a message sent from parent to allchildren,chil- dren, which traverses a complete branch. This message results in all tree interface information being removed from each router on the branch, possibly because of a re-configuration scenario. _8._3._1. _P_r_i_m_a_r_y _M_a_i_n_t_e_n_a_n_c_e _M_e_s_s_a_g_e _S_u_b_c_o_d_e_s The JOIN-REQUEST has three validsub-codes, namely JOIN-ACTIVE, RE- JOIN-ACTIVE, and RE-JOIN-NACTIVE. A JOIN-ACTIVE issubcodes: o+ ACTIVE-JOIN (code 0) - sent from a CBT router that has no children for the specified group.A RE-JOIN-ACTIVE iso+ REJOIN-ACTIVE (code 1) - sent from a CBT router that has at least one child for the specified group.A RE-JOIN-NACTIVE originally started out as an active re-join, but has reached an on-tree router for the corresponding group. At this point, the router changes the join status to non-active re-join and forwards it on its parent branch, as does each CBT router that receives it. Should the router that originated the active re-join subsequently receive the non-active re-join, it must immediately sendo+ REJOIN-NACTIVE (code 2) - converted from aQUIT-REQUEST to its parent router. It then attempts to re-join again. In this wayREJOIN-ACTIVE by there-join acts as a loop-detection packet. _8._4. _A_u_x_i_l_l_i_a_r_y _M_a_i_n_t_e_n_a_n_c_e _M_e_s_s_a_g_e _T_y_p_e_s There are eleven CBT auxilliary maintenance message types: o+ CBT-DR-SOLICITATION:first on-tree router receiving arequest sent fromREJOIN-ACTIVE. This mes- sage is forwarded over ahost torouter's parent interface until it either reaches theCBT ``all-routers'' multicast address, forprimary core, or is received by theaddressorigi- nator of thebest next-hop CBTcorresponding REJOIN-ACTIVE. A JOIN-ACK has three valid subcodes: o+ NORMAL (code 0) - sent by a core router, or on-tree non-core routeron the LANacknowledging joins with subcodes REJOIN-ACTIVE and ACTIVE-JOIN. o+ PROXY-ACK (code 1) - acknowledgement of a join-request by a router connected to thecoresame subnet asspecified inthesolicitation.originator (subnet D-DR) of the corresponding join. o+CBT-DR-ADVERTISEMENT:REJOIN-NACTIVE (code 2) - sent by areplyprimary core to ack- nowledge theabove. Advertisements are addressedreceipt of a join-request received with subcode REJOIN-NACTIVE. This ack is unicast directly to the``all-systems'' multicast group. o+ CBT-CORE-NOTIFICATION: unicast from a group initiating hostrouter that converted the corresponding REJOIN-ACTIVE toeach core selected forREJOIN- NACTIVE. The CBT control packet "origin" field contains thegroup, this message notifies each coreIP address of theidentities of eachoriginator of theother core(s)REJOIN-ACTIVE, so in order for thegroup, together with theirprimary coreranking. The receiptto directly reach the source ofthis message invokesthebuildingREJOIN-NACTIVE, the converting router inserts its IP address in the "core address" field of the control packet header. The primary coretree by all cores other thanuses thehighest-ranked (primary core). o+ CBT-CORE-NOTIFICATION-ACK: a notification of acceptanceaddress in this field tobecoming a core fordetermine the target of the join-ack, subcode REJOIN-NACTIVE. _8._4. _A_u_x_i_l_l_i_a_r_y _M_a_i_n_t_e_n_a_n_c_e _M_e_s_s_a_g_e _T_y_p_e_s There are two CBT auxilliary maintenance message types. CBT auxiliary messages are encoded in agroup, toCBT control packet header, and thecorresponding end-system.fields of the control packet are interpreted as illustrated below. The interpretation of certain fields further depends on whether aggrega- tion and security are implemented. 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | vers |unused | type | code | aggregate | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | hdr length | checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | group identifier (or low end of range) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | group id mask or NULL | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | NULL (if security implemented) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | security fields if implemented or NULL | | (T.B.D) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 9. CBT Echo Request/Reply o+CBT-ECHO-REQUEST:CBT-ECHO-REQUEST (type 7): once a tree branch is established, this messsage acts as a ``keepalive'', and is unicast from child to parent. o+CBT-ECHO-REPLY:CBT-ECHO-REPLY (type 8): positive reply to the above.o+ CBT-CORE-PING: unicast from aCBTrouter to a core when a tree router's parent has failed. The purpose of this messageEcho Requests/Replies can be sent as aggregates, or individually for each group if multicast address assignment isto establish core reachability before sending a JOIN- REQUEST to it. o+ CBT-PING-REPLY: positive reply to the above. o+ CBT-TAG-REPORT: unicast from an end-system tosuch that aggrega- tion is not possible. If aggregation is implemented, thedesignated router for"aggregate" field (which replaces thecorresponding group, subsequent to"# cores" field of theend- system receiving a designated router advertisement (as well as a core notification reply if group-initiating host). This message invokesstandard control packet header. In this case, no cores are assumed present in thesending of a JOIN-REQUEST ifmes- sage) will contain thereceiv- ing routervalue 0xff, otherwise 0x00. If aggregation is notalready part of the corresponding tree. o+ CBT-HOST_JOIN_ACK: group-specific multicast by a CBT router that originated a JOIN-REQUEST on behalf of some end-system onimplemented, thesame LAN (subnet). The purpose of this message"group id mask" field is set tonotify end-systemsNULL, or is not present, depending onthe LAN belongingwhether security is imple- mented or not. Masks are used according tothe specified grouptheir standard networking usage. The "flow-id" field (to be done) ofsuch things as: success in joining the delivery tree; actual core affiliation. o+ CBT-DR-ADV-NOTIFICATION: multicast totheCBT ``all-routers'' address, this messagestandard control packet header issent subsequent to receiving a CBT- DR-SOLICITATION, but priorNULL if security is implemented, not present otherwise. The security fields (to be done) are only present if security is implemented. _9. _D_e_f_a_u_l_t _T_i_m_e_r _V_a_l_u_e_s There are several CBT control messages which are transmitted at fixed intervals. These values, retransmission times, and timeout values, are given below. Note these are recommended default values only, and are configurable with each implementation (all times are in seconds): o+ CBT-ECHO-INTERVAL 30 (time between sending successive CBT-ECHO- REQUESTs to parent). o+ PEND-JOIN-INTERVAL 10 (retransmission time for join-request if no ack rec'd) o+ PEND-JOIN-TIMEOUT 30 (time toany CBT-DR-ADVERTISEMENT being sent. It acts astry joining atie-breaking mechanism should more than one router on the subnet think itself the best next-hop to the addressed core. It also promts an already established DR to announce itself as such if itdifferent core, or give up) o+ EXPIRE-PENDING-JOIN 90 (remove transient state for join that has notalready done so in responsebeen ack'd) o+ CBT-ECHO-TIMEOUT 90 (time toa CBT-DR-SOLICITATION. _9.consider parent unreachable) o+ CHILD-ASSERT-INTERVAL 90 (check last time we rec'd an ECHO from each child) o+ CHILD-ASSERT-EXPIRE-TIME 180 (remove child information if no ECHO received) o+ IFF-SCAN-INTERVAL 300 (scan all interfaces for group presence. If none, send QUIT) _1_0. _I_n_t_e_r_o_p_e_r_a_b_i_l_i_t_y _I_s_s_u_e_s One of the design goals of CBT is for it to fully interwork with other IP multicast schemes. We have already described how CBT-style packets are transformed into IP-style multicasts, and vice-versa. In order for CBT to fully interwork with other schemes, it is neces- sary to define the interface(s) between a ``CBT cloud'' and the cloud of another scheme. The CBT authors are currently working out the details of the ``CBT-other'' interface, and therefore we omit further discussion of this topic at the present time._1_0._1_1. _C_B_T _S_e_c_u_r_i_t_y _A_r_c_h_i_t_e_c_t_u_r_e see current I-D:draft-ietf-idmr-mkd-02.txtdraft-ietf-idmr-mkd-01.{ps,txt} Acknowledgements Special thanks goes to Paul Francis, NTT Japan, for the original brainstorming sessions that brought about this work. Thanks also to the networking team at Bay Networks for their comments andsugges- tions,suggestions, in particular Steve Ostrowski for his suggestion of using "native mode" as a router optimization, Eric Crawley, Scott Reeve, and Nitin Jain. Thanks also to Ken Carlberg (SAIC) for review- ing the text, and generally providing constructive comments throughout. I would also like to thank the participants of the IETF IDMR working group meetings for their general constructive comments and sugges- tions since the inception of CBT. APPENDIX IGMP version 3 has recently been proposed [6]. The authors have the following recommendations for amendments (all minor) to IGMPv3: o+ The IGMPv3 draft [6] introduces a new IGMP message type, the PIM RP-REPORT message. Its message format is shown below: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Code | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Group Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Version | Reserved | # of RP's (N) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RP Address [1] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RP Address [...] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RP Address [N] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 10. PIM RP-REPORT. The CBT authors propose the following minor amendments to the IGMP PIM RP-REPORT: o+ the report to be re-named RP/CORE-REPORT o+ RP fields re-named RP/Core fields o+ the reserved field to be re-named the "target core" field, to contain the numeric value of the position of the target core in the RP/Core list o+ The introduction of a new code value to distinguish PIM RP reports from CBT Core reports. These minor amendments to IGMPv3 would satisfy CBT's operational requirements. Author's Address: Tony Ballardie, Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, ENGLAND, U.K. Tel: ++44 (0)71 419 3462 e-mail: A.Ballardie@cs.ucl.ac.ukNitin Jain, Bay Networks, Inc. 3 Federal Street, Billerica, MA 01821, USA. Tel: ++1 508 670 8888 e-mail: njain@BayNetworks.com Scott Reeve, Bay Networks, Inc. 3 Federal Street, Billerica, MA 01821, USA. Tel: ++1 508 670 8888 e-mail: sreeve@BayNetworks.comReferences [1] DVMRP. Described in "Multicast Routing in a Datagram Internet- work", S. Deering, PhD Thesis, 1990. Available via anonymous ftp from: gregorio.stanford.edu:vmtp/sd-thesis.ps. [2] J. Moy. Multicast Routing Extensions to OSPF. Communications of the ACM, 37(8): 61-66, August 1994. [3] D. Farinacci, S. Deering, D. Estrin, and V. Jacobson. Protocol Independent Multicast (PIM) Dense-Mode Specification (draft-ietf- idmr-pim-spec-01.ps). Working draft, 1994. [4] A. J. Ballardie. Scalable Multicast Key Distribution (draft-ietf-idmr-mkd-02.txt).idmr-mkd-01.txt). Working draft, 1995. [5] A. J. Ballardie. "A New Approach to Multicast Communication in a Datagram Internetwork", PhD Thesis, 1995. Available via anonymous ftp from: cs.ucl.ac.uk:darpa/IDMR/ballardie-thesis.ps.Z. [6] W. Fenner. Internet Group Management Protocol, version 2 (IGMPv2), (draft-idmr-igmp-v2-01.txt). [7] B. Cain, S. Deering, A. Thyagarajan. Internet Group Management Protocol Version 3 (IGMPv3) (draft-cain-igmp-00.txt). [8] M. Handley, J. Crowcroft, I. Wakeman. Hierarchical Rendezvous Point proposal, work in progress. (http://www.cs.ucl.ac.uk/staff/M.Handley/hpim.ps). [9] D. Estrin et al. USC/ISI, Work in progress. (document not yet available). [10] D. Estrin et al. PIM Sparse Mode Specification. (draft-ietf- idmr-pim-sparse-spec-00.txt).