draft-ietf-idmr-cbt-spec-06.txt   draft-ietf-idmr-cbt-spec-07.txt 
Inter-Domain Multicast Routing (IDMR) A. Ballardie Inter-Domain Multicast Routing (IDMR) A. Ballardie
INTERNET-DRAFT University College London INTERNET-DRAFT Consultant
S. Reeve & N. Jain
Bay Networks, Inc.
September 1996 March 1997
Core Based Trees (CBT) Multicast Core Based Trees (CBT) Multicast Routing
-- Protocol Specification -- -- Protocol Specification --
Status of this Memo Status of this Memo
This document is an Internet Draft. Internet Drafts are working doc- This document is an Internet Draft. Internet Drafts are working doc-
uments of the Internet Engineering Task Force (IETF), its Areas, and uments of the Internet Engineering Task Force (IETF), its Areas, and
its Working Groups. Note that other groups may also distribute work- its Working Groups. Note that other groups may also distribute work-
ing documents as Internet Drafts). ing documents as Internet Drafts).
skipping to change at page 1, line 34 skipping to change at page 1, line 32
Drafts as reference material or to cite them other than as a "working Drafts as reference material or to cite them other than as a "working
draft" or "work in progress." draft" or "work in progress."
Please check the I-D abstract listing contained in each Internet Please check the I-D abstract listing contained in each Internet
Draft directory to learn the current status of this or any other Draft directory to learn the current status of this or any other
Internet Draft. Internet Draft.
Abstract Abstract
This document describes the Core Based Tree (CBT) network layer mul- This document describes the Core Based Tree (CBT) network layer mul-
ticast protocol. CBT is a next-generation multicast protocol that ticast routing protocol. CBT builds a shared multicast distribution
makes use of a shared delivery tree rather than separate per-sender tree per group, and is suited to inter- and intra-domain multicast
trees utilized by most other multicast schemes [1, 2, 3]. The CBT routing.
architecture is described in [4a].
This specification includes an optimization whereby unencapsulated CBT is protocol independent in that it makes use of unicast routing
(native) IP-style multicasts are forwarded by CBT routers, resulting to establish paths between senders and receivers. The CBT architec-
in very good forwarding performance. This mode of operation is ture is described in [1].
called CBT "native mode". Native mode can only be used in CBT-only
domains (footnote 1).
_________________________
This revision contains two appendices; Appendix A describes simple
CBT add-on mechanisms for dynamically migrating a CBT tree to one
whose core is directly attached to a source's subnetwork, thereby
allowing CBT to emulate shortest-path trees. Appendix B describes a
group state aggregation scheme.
This document is progressing through the IDMR working group of the This document is progressing through the IDMR working group of the
IETF. CBT related documents include [4, 5]. For all IDMR-related IETF. CBT related documents include [1, 5, 6]. For all IDMR-related
documents, see http://www.cs.ucl.ac.uk/ietf/idmr. documents, see http://www.cs.ucl.ac.uk/ietf/idmr.
NOTE that core placement and management is not discussed in this doc- TABLE OF CONTENTS
ument.
1. Changes since Previous Revision (05)
This note summarizes the changes to this document since the previous
revision (revision 05).
+o inclusion of "first hop router" and "primary core" fields in the
CBT mode data packet header.
+o removal of the term "non-core" router, replaced by "on-tree"
router.
+o removal of the term "default DR (D-DR)", replaced simply by DR.
+o inclusion of T and S bits in the CBT control and data packet
headers (type of service, and security, respectively).
+o CBT control messages are now carried directly over IP rather
than UDP (for all implementations).
+o inclusion of an Appendix (A) describing extensions to the CBT
protocol to achieve dynamic source-migration of core routers for
shortest-path tree emulation.
+o inclusion of an Appendix (B) describing a group state aggrega-
tion scheme.
_________________________
1 The term "domain" should be considered synonymous
with "routing domain" throughout, as are the terms "re-
gion" and "cloud".
+o editorial changes and some re-organisation throughout for extra
clarity.
2. Some Terminology
In CBT, the core routers for a particular group are categorised into
PRIMARY CORE, and NON-PRIMARY (secondary) CORES.
The "core tree" is the part of a tree linking all core routers of a
particular group together.
On-tree routers are those with a forwarding database entry for the
corresponding group.
3. Protocol Specification
3.1. Tree Joining Process -- Overview
A CBT router is notified of a local host's desire to join a group via
IGMP [6]. We refer to a CBT router with directly attached hosts as a
"leaf CBT router", or just "leaf" router.
The following CBT control messages come into play subequent to a sub-
net's CBT leaf router receiving an IGMP membership report (also
termed "IGMP join"):
+o JOIN_REQUEST
+o JOIN_ACK
If the CBT leaf router is the subnet's designated router (see next
section), it generates a CBT join-request in response to receiving an
IGMP group membership report from a directly connected host. The CBT
join is sent to the next-hop on the unicast path to a target core,
specified in the join packet; a router elects a "target core" based
on a static configuration. If, on receipt of an IGMP-join, the
locally-elected DR has already joined the corresponding tree, then it
need do nothing more with respect to joining.
The join is processed by each such hop on the path to the core, until
either the join reaches the target core itself, or hits a router that
is already part of the corresponding distribution tree (as identified
by the group address). In both cases, the router concerned terminates
the join, and responds with a join-ack (join acknowledgement), which
traverses the reverse-path of the corresponding join. This is possi-
ble due to the transient path state created by a join traversing a
CBT router. The ack fixes that state.
3.2. DR Election
Multiple CBT routers may be connected to a multi-access subnetwork.
In such cases it is necessary to elect a subnetwork designated router
(DR) that is responsible for generating and sending CBT joins
upstream, on behalf of hosts on the subnetwork.
CBT DR election happens "on the back" of IGMP [6]; on a subnet with
multiple multicast routers, an IGMP "querier" is elected as part of
IGMP. At start-up, a multicast router assumes no other multicast
routers are present on its subnetwork, and so begins by believing it
is the subnet's IGMP querier. It sends a small number IGMP-HOST-
MEMBERSHIP-QUERYs in short succession in order to quickly learn about
any group memberships on the subnet. If other multicast routers are
present on the same subnet, they will receive these IGMP queries; a
multicast router yields querier duty as soon as it hears an IGMP
query from a lower-addressed router on the same subnetwork.
The CBT DR is always the subnet's IGMP querier (footnote 2). As a
result, there is no protocol overhead whatsoever associated with
electing a CBT D-DR.
3.3. Tree Joining Process -- Details
The receipt of an IGMP group membership report by a CBT DR for a CBT
group not previously heard from triggers the tree joining process;
the DR unicasts a JOIN-REQUEST to the first hop on the (unicast) path
to the target core specified in the CBT join packet.
_________________________
2 Or lowest addressed CBT router if the subnet's IGMP
querier is non-CBT capable.
Each CBT-capable router traversed on the path between the sending DR
and the core processes the join. However, if a join hits a CBT router
that is already on-tree, the join is not propogated further, but
acknowledged downstream from that point.
JOIN-REQUESTs carry the identity of all the cores associated with the
group. Assuming there are no on-tree routers in between, once the
join (subcode ACTIVE_JOIN) reaches the target core, if the target
core is not the primary core (as indicated in a separate field of the
join packet) it first acknowledges the received join by means of a
JOIN-ACK, then sends a JOIN-REQUEST, subcode REJOIN-ACTIVE, to the
primary core router.
If the rejoin-active reaches the primary core, it responds by sending
a JOIN-ACK, subcode PRIMARY-REJOIN-ACK, which traverses the reverse-
path of the join (rejoin). The primary-rejoin-ack serves to confirm
no loop is present, and so explicit loop detection is not necessary.
If some other on-tree router is encountered before the rejoin-active
reaches the primary, that router responds with a JOIN-ACK, subcode
NORMAL. On receipt of the ack, subcode normal, the router sends a
join, subcode REJOIN-NACTIVE, which acts as a loop detection packet
(see section 8.3). Note that loop detection is not necessary subse-
quent to receiving a join-ack with subcode PRIMARY-REJOIN-ACK.
To facilitate detailed protocol description, we use a sample topol- 1. Changes Since Previous Revision............................ 3
ogy, illustrated in Figure 1 (shown over). Member hosts are shown as
individual capital letters, routers are prefixed with R, and subnets
are prefixed with S.
A B 2. Introduction & Terminology................................. 4
| S1 S4 |
------------------- -----------------------------------------------
| | | |
------ ------ ------ ------
| R1 | | R2 | | R5 | | R6 |
------ ------ ------ ------
C | | | | |
| | | | S2 | S8 |
---------- ------------------------------------------ -------------
S3 |
------
| R3 |
| ------ D
| S9 | | S5 |
| | ---------------------------------------------
| |----| | |
---| R7 |-----| ------
| |----| |------------------| R4 |
| S7 | ------ F
| | | S6 |
|-E | ---------------------------------
| |
| ------
|---| |---------------------| R8 |
|R12 ----| ------ G
|---| | | | S10
| S14 ----------------------------
| |
I --| ------
| | R9 |
------
| S12
| ----------------------------
S15 | |
| ------
|----------------------|R10 |
J ---| ------ H
| | |
| ----------------------------
| S13
Figure 1. Example Network Topology 3. CBT Functional Overview.................................... 5
Taking the example topology in figure 1, host A wishes to join group
G. All subnets' routers have been configured to use core routers R4
(primary core) and R9 (secondary core) for a range of group
addresses, including G.
Router R1 receives an IGMP host membership report, and proceeds to 4. CBT Protocol Specificiation Details........................ 8
unicast a JOIN-REQUEST, subcode ACTIVE-JOIN to the next-hop on the
path to R4 (R3), the target core. R3 receives the join, caches the
necessary group information (transient state), and forwards it to R4
-- the target of the join.
R4, being the target of the join, sends a JOIN_ACK (subcode NORMAL) 4.1 CBT HELLO Protocol..................................... 8
back out of the receiving interface to the previous-hop sender of the
join, R3. A JOIN-ACK, like a JOIN-REQUEST, is processed hop-by-hop by
each router on the reverse-path of the corresponding join. The
receipt of a join-ack establishes the receiving router on the corre-
sponding CBT tree, i.e. the router becomes part of a branch on the
delivery tree. Finally, R3 sends a join-ack to R1. A new CBT branch
has been created, attaching subnet S1 to the CBT delivery tree for
the corresponding group.
For the period between any CBT-capable router forwarding (or origi- 4.1.1 Sending HELLOs................................... 9
nating) a JOIN_REQUEST and receiving a JOIN_ACK the corresponding
router is not permitted to acknowledge any subsequent joins received
for the same group; rather, the router caches such joins till such
time as it has itself received a JOIN_ACK for the original join. Only
then can it acknowledge any cached joins. A router is said to be in a
"pending-join" state if it is awaiting a JOIN_ACK itself.
Note that the presence of asymmetric routes in the underlying unicast 4.1.2 Receiving HELLOs................................. 9
routing does not affect the tree-building process; CBT tree branches
are symmetric by the nature in which they are built. Joins set up
transient state (incoming and outgoing interface state) in all
routers along a path to a particular core. The corresponding join-ack
traverses the reverse-path of the join as dictated by the transient
state, and not necessarily the path that underlying routing would
dictate. Whilst permanent asymmetric routes could pose a problem for
CBT, transient asymmetricity is detected by the CBT protocol.
3.4. Forwarding Joins on Multi-Access Subnets 4.2 JOIN_REQUEST Processing................................ 10
The DR election mechanism does not guarantee that the DR will be the 4.2.1 Sending JOIN_REQUESTs............................ 10
router that actually forwards a join off a multi-access network; the
first hop on the path to a particular core might be via another
router on the same subnetwork, which actually forwards off-subnet.
Although very much the same, let's see another example using our 4.2.2 Receiving JOIN_REQUESTs.......................... 10
example topology of figure 1 of a host joining a CBT tree for the
case where more than one CBT router exists on the host subnetwork.
B's subnet, S4, has 3 CBT routers attached. Assume also that R6 has 4.3 JOIN_ACK Processing.................................... 11
been elected IGMP-querier and CBT DR.
R6 (S4's DR) receives an IGMP group membership report. R6's config- 4.3.1 Sending JOIN_ACKs................................ 11
ured information suggests R4 as the target core for this group. R6
thus generates a join-request for target core R4, subcode
ACTIVE_JOIN. R6's routing table says the next-hop on the path to R4
is R2, which is on the same subnet as R6. This is irrelevant to R6,
which unicasts it to R2. R2 unicasts it to R3, which happens to be
already on-tree for the specified group (from R1's join). R3 there-
fore can acknowledge the arrived join and unicast the ack back to R2.
R2 forwards it to R6, the origin of the join-request.
If an IGMP membership report is received by a DR with a join for the 4.3.2 Receiving JOIN_ACKs.............................. 12
same group already pending, or if the DR is already on-tree for the
group, it takes no action.
3.5. On-Demand "Core Tree" Building 4.4 QUIT_NOTIFICATION Processing........................... 12
The "core tree" - the part of a CBT tree linking all of its cores 4.4.1 Sending QUIT_NOTIFICATIONs....................... 12
together, is built on-demand. That is, the core tree is only built
subsequent to a non-primary (secondary) core receiving a join-
request. This triggers the secondary core to join the primary core;
the primary need never join anything.
Join-requests carry an list of core routers (and the identity of the 4.4.2 Receiving QUIT_NOTIFICATIONs..................... 13
primary core in its own separate field), making it possible for the
secondary cores to know where to join when they themselves receive a
join. Hence, the primary core must be uniquely identified as such
across the whole group. A secondary joins the primary subsequent to
sending an ack for the first join it receives.
3.6. Tree Teardown 4.5 CBT ECHO_REQUEST Processing............................ 14
There are two scenarios whereby a tree branch may be torn down: 4.5.1 Sending ECHO_REQUESTs............................ 14
+o During a re-configuration. If a router's best next-hop to the 4.5.2 Receiving ECHO_REQUESTs.......................... 14
specified core is one of its existing children, then before
sending the join it must tear down that particular downstream
branch. It does so by sending a FLUSH_TREE message which is pro-
cessed hop-by-hop down the branch. All routers receiving this
message must process it and forward it to all their children.
Routers that have received a flush message will re-establish
themselves on the delivery tree if they have directly connected
subnets with group presence.
+o If a CBT router has no children it periodically checks all its 4.6 ECHO_REPLY Processing.................................. 15
directly connected subnets for group member presence. If no mem-
ber presence is ascertained on any of its subnets it sends a
QUIT_REQUEST upstream to remove itself from the tree.
The receipt of a quit-request triggers the receiving parent 4.6.1 Sending ECHO_REPLYs.............................. 15
router to immediately query its forwarding database to establish
whether there remains any directly connected group membership,
or any children, for the said group. If not, the router itself
sends a quit-request upstream.
The following example, using the example topology of figure 1, shows 4.6.2 Receiving ECHO_REPLYs............................ 15
how a tree branch is gracefully torn down using a QUIT_REQUEST. 4.7 FLUSH_TREE Processing.................................. 16
Assume group member B leaves group G on subnet S4. B issues an IGMP 4.7.1 Sending FLUSH_TREE Messages...................... 16
HOST-MEMBERSHIP-LEAVE (relevant only to IGMPv2 and later versions)
message which is multicast to the "all-routers" group (224.0.0.2).
R6, the subnet's DR and IGMP-querier, responds with a group-specific-
QUERY. No hosts respond within the required response interval, so DR
assumes group G traffic is no longer wanted on subnet S4.
Since R6 has no CBT children, and no other directly attached subnets 4.7.2 Receiving FLUSH_TREE Messages.................... 16
with group G presence, it immediately follows on by sending a
QUIT_REQUEST to R2, its parent on the tree for group G. R2 responds
with a QUIT-ACK, unicast to R6; R2 removes the corresponding child
information. R2 in turn sends a QUIT upstream to R3 (since it has no
other children or subnet(s) with group presence).
NOTE: immediately subsequent to sending a QUIT-REQUEST, the sender 5. Timers and Default Values.................................. 16
removes the corresponding parent information, i.e. it does not
wait for the receipt of a QUIT-ACK.
R3 responds to the QUIT by unicasting a QUIT-ACK to R2. R3 subse- 6. CBT Packet Formats and Message Types....................... 17
quently checks whether it in turn can send a quit by checking group G
presence on its directly attached subnets, and any group G children.
It has the latter (R1 is its child on the group G tree), and so R3
cannot itself send a quit. However, the branch R3-R2-R6 has been
removed from the tree.
4. Tree Maintenance 6.1 CBT Common Control Packet Header....................... 18
Once a tree branch has been created, i.e. a CBT router has received a 6.2 HELLO Packet Format.................................... 19
JOIN_ACK for a JOIN_REQUEST previously sent (or forwarded), a child
router is required to monitor the status of its parent/parent link at
fixed intervals by means of a "keepalive" mechanism operating between
them. The "keepalive" protocol is simple, and implemented by means
of two CBT control messages: CBT_ECHO_REQUEST and CBT_ECHO_REPLY; a
child unicasts a CBT-ECHO-REQUEST to its parent, which unicasts a
CBT-ECHO-REPLY in response.
Adjacent CBT routers only need to send one keepalive representing all 6.3 JOIN_REQUEST Packet Format............................. 19
children having the same parent, reachable over a particular link,
regardless of group. This aggregation strategy is expected to con-
serve considerable bandwidth on "busy" links, such as transit net-
work, or backbone network, links.
For any CBT router, if its parent router, or path to the parent, 6.4 JOIN_ACK Packet Format................................. 20
fails, the child is initially responsible for re-attaching itself,
and therefore all routers subordinate to it on the same branch, to
the tree.
4.1. Router Failure 6.5 QUIT_NOTIFICATION Packet Format........................ 21
An on-tree router can detect a failure from the following two cases: 6.6 ECHO_REQUEST Packet Format............................. 21
+o if the child responsible for sending keepalives across a partic- 6.7 ECHO_REPLY Packet Format............................... 22
ular link stops receiving CBT_ECHO_REPLY messages. In this case
the child realises that its parent has become unreachable and
must therefore try and re-connect to the tree for all groups
represented on the parent/child link. For all groups sharing a
common core set (corelist), provided those groups can be speci-
fied as a CIDR-like aggregate, an aggregated join can be sent
representing the range of groups. Aggregated joins are made
possible by the presence of a "group mask" field in the CBT con-
trol packet header (footnote 3).
If a range of groups cannot be represented by a mask, then each 6.8 FLUSH_TREE Packet Format............................... 23
group must be re-joined individually.
CBT's re-join strategy is as follows: the rejoining router which 7. Core Router Discovery...................................... 23
is immediately subordinate to the failure sends a JOIN_REQUEST
(subcode ACTIVE_JOIN if it has no children attached, and subcode
ACTIVE_REJOIN if at least one child is attached) to the best
next-hop router on the path to the elected core. If no JOIN-ACK
is received after three retransmissions, each transmission being
at PEND-JOIN-INTERVAL (5 secs) intervals, the next-highest pri-
ority core is elected from the core list, and the process
repeated. If all cores have been tried unsuccessfully, the DR
has no option but to give up.
+o if a parent stops receiving CBT_ECHO_REQUESTs from a child. In 7.1 Bootstrap Message Format.............................. 25
this case, if the parent has not received an expected keepalive
after CHILD_ASSERT_EXPIRE_TIME, all children reachable across
that link are removed from the parent's forwarding database.
4.2. Router Re-Starts 7.2 Candidate Core Advertisement Message Format........... 25
There are two cases to consider here: 8. Interoperability Issues.................................... 25
+o Core re-start. All JOIN-REQUESTs (all types) carry the identi- Acknowledgements.............................................. 26
ties (i.e. IP addresses) of each of the cores for a group. If a
router is a core for a group, but has only recently re-started,
it will not be aware that it is a core for any group(s). In such
circumstances, a core only becomes aware that it is such by
receiving a JOIN-REQUEST. Subsequent to a core learning its
status in this way, if it is not the primary core it acknowl-
edges the received join, then sends a JOIN_REQUEST (subcode
ACTIVE_REJOIN) to the primary core. If the re-started router is
the primary core, it need take no action, i.e. in all
_________________________
3 There are situations where it is advantageous to
send a single join-request that represents potentially
many groups. One such example is provided in [11],
whereby a designated border router is required to join
all groups inside a CBT domain.
circumstances, the primary core simply waits to be joined by References.................................................... 26
other routers.
+o Non-core re-start. In this case, the router can only join the Author Information............................................ 27
tree again if a downstream router sends a JOIN_REQUEST through
it, or it is elected DR for one of its directly attached sub-
nets, and subsequently receives an IGMP membership report.
4.3. Route Loops 1. Changes since Previous Revision (05)
Routing loops are only a concern when a router with at least one This revision of the CBT protocol specification differs significantly
child is attempting to re-join a CBT tree. In this case the re- from the previously released revision (05). Consequently, this revi-
joining router sends a JOIN_REQUEST (subcode ACTIVE REJOIN) to the sion represents version 2 of the CBT protocol. CBT version 2 is not,
best next-hop on the path to an elected core. This join is forwarded and was not, intended to be backwards compatible with version 1; we
as normal until it reaches either the specified core, another core, do not expect this to cause extensive compatibility problems because
or a on-tree router that is already part of the tree. If the rejoin we do not believe CBT is at all widely deployed at this stage. How-
reaches the primary core, loop detection is not necessary because the ever, any future versions of CBT can be expected to be backwards com-
primary never has a parent. The primary core acks an active-rejoin by patible with this version.
means of a JOIN-ACK, subcode PRIMARY-REJOIN-ACK. This ack must be
processed by each router on the reverse-path of the active-rejoin;
this ack creates tree state, just like a normal join-ack.
If an active-rejoin is terminated by any router on the tree other The most significant changes to version 2 compared to version 1
than the primary core, loop detection must take place, as we now include:
describe.
If, in response to an active-rejoin, a JOIN-ACK is returned, subcode +o new LAN mechanisms, including the incorporation of an HELLO pro-
NORMAL (as opposed to an ack with subcode PRIMARY-REJOIN-ACK), the tocol.
router receiving the ack subsequently generates a JOIN-REQUEST, sub-
code NACTIVE-REJOIN (non-active rejoin). This packet serves only to
detect loops; it does not create any transient state in the routers
it traverses, other than the originating router (in case retransmis-
sions are necessary). Any on-tree router receiving a non-active
rejoin is required to forward it over its parent interface for the
specified group. In this way, it will either reach the primary core,
which unicasts, directly to the sender, a join ack with subcode PRI-
MARY-NACTIVE-ACK (so the sender knows no loop is present), or the
sender receives the non-active rejoin it sent, via one of its child
interfaces, in which case the rejoin obviously formed a loop.
If a loop is present, the non-active join originator immediately +o new simplified packet formats, with the definition of a common
sends a QUIT_REQUEST to its newly-established parent and the loop is CBT control packet header.
broken.
Using figure 2 (over) to demonstrate this, if R3 is attempting to re- +o a generic intra-domain core discovery ("bootstrap") mechanism,
join the tree (R1 is the core in figure 2) and R3 believes its best to be specified separately, and published soon.
next-hop to R1 is R6, and R6 believes R5 is its best next-hop to R1,
which sees R4 as its best next-hop to R1 -- a loop is formed. R3
begins by sending a JOIN_REQUEST (subcode ACTIVE_REJOIN, since R4 is
its child) to R6. R6 forwards the join to R5. R5 is on-tree for the
group, so responds to the active-rejoin with a JOIN-ACK, subcode NOR-
MAL (the ack traverses R6 on its way to R3).
R3 now generates a JOIN-REQUEST, subcode NACTIVE-REJOIN, and forwards This specification revision is a complete re-write of the previous
this to its parent, R6. R6 forwards the non-active rejoin to R5, its revision.
parent. R5 does similarly, as does R4. Now, the non-active rejoin has
reached R3, which originated it, so R3 concludes a loop is present on
the parent interface for the specified group. It immediately sends a
QUIT_REQUEST to R6, which in turn sends a quit if it has not received
an ACK from R5 already AND has itself a child or subnets with member
presence. If so it does not send a quit -- the loop has been broken
by R3 sending the first quit.
QUIT_REQUESTs are typically acknowledged by means of a QUIT_ACK. A 2. Introduction & Terminology
child removes its parent information immediately subsequent to send-
ing its first QUIT-REQUEST. The ack here serves to notify the (old)
child that it (the parent) has in fact removed its child information.
However, there might be cases where, due to failure, the parent can-
not respond. The child sends a QUIT-REQUEST a maximum of three
times, at PEND-QUIT-INTERVAL (5 sec) intervals.
------ In CBT, a "core router" (or just "core") is a router which configured
| R1 | to act as a "meeting point" between a sender and group receivers. The
------ term "rendezvous point (RP)" is used equivalently in some contexts
| [2]. Each core router is configured to know it is a core router.
---------------------------
|
------
| R2 |
------
|
---------------------------
| |
------ |
| R3 |--------------------------|
------ |
| |
--------------------------- |
| | ------
------ | | |
| R4 | |-------| R6 |
------ | |----|
| |
--------------------------- |
| |
------ |
| R5 |--------------------------|
------ |
|
Figure 2: Example Loop Topology A router that is part of a CBT distribution tree is known as an "on-
tree" router. An on-tree router maintains active state for the group.
In another scenario the rejoin travels over a loop-free path, and the We refer to a broadcast interface as any interface that supports mul-
first on-tree router encountered is the primary core, R1. In figure ticast transmission.
2, R3 sends a join, subcode REJOIN_ACTIVE to R2, the next-hop on the
path to core R1. R2 forwards the re-join to R1, the primary core,
which returns a JOIN-ACK, subcode PRIMARY-REJOIN-ACK, over the
reverse-path of the rejoin-active. Whenever a router receives a PRI-
MARY-REJOIN-ACK no loop detection is necessary.
If we assume R2 is on tree for the corresponding group, R3 sends a An "upstream" interface (or router) is one which is on the path
join, subcode REJOIN_ACTIVE to R2, which replies with a join ack, towards the group's core router with respect to this router. A "down-
subcode NORMAL. R3 must then generate a loop detection packet (join stream" interface (or router) is one which is on the path away from
request, subcode REJOIN-NACTIVE) which is forwarded to its parent, the group's core router with respect to this router.
R2, which does similarly. On receipt of the rejoin-Nactive, the pri-
mary core unicasts a join ack back directly to R3, with subcode PRI-
MARY-NACTIVE-ACK. This confirms to R3 that its rejoin does not form
a loop.
5. Data Packet Loops Other terminology is introduced in its context throughout the text.
The CBT protocol builds a loop-free distribution tree. If all routers 3. CBT Functional Overview
that comprise a particular tree function correctly, data packets
should never traverse a tree branch more than once (footnote 4).
CBT mode data packets from a non-member sender must arrive on a tree The CBT protocol is designed to build and maintain a shared multicast
via an "off-tree" interface. The CBT mode data packet's header distribution tree that spans only those networks and links leading to
includes an "on-tree" field, which contains the value 0x00 until the interested receivers.
data packet reaches an on-tree router. The first on-tree router must
convert this value to 0xff. This value remains unchanged, and from
here on the packet should traverse only on-tree interfaces. If an
encapsulated packet happens to "wander" off-tree and back on again,
an on-tree router will receive the CBT encapsulated packet via an
off-tree interface. However, this router will recognise that the "on-
tree" field of the encapsulating CBT header is set to 0xff, and so
immediately discards the packet.
_________________________ To achieve this, a host first expresses its interest in joining a
4 The exception to this is when CBT mode is operating group by multicasting an IGMP host membership report [3] across its
between CBT routers connected to a multi-access link; a attached link. On receiving this report, a local CBT aware router
data packet may traverse the link in native mode (if invokes the tree joining process (unless it has already) by generat-
group members are present on the link), as well as CBT ing a JOIN_REQUEST message, which is sent to the next hop on the path
mode for sending the data between CBT routers on the towards the group's core router (how the local router discovers which
core to join is discussed in section 7). This join message must be
explicitly acknowledged (JOIN_ACK) either by the core router itself,
or by another router that is on the unicast path between the sending
router and the core, which itself has already successfully joined the
tree. tree.
6. Data Packet Forwarding Rules The join message sets up transient join state in the routers it tra-
verses, and this state consists of <group, incoming interface, outgo-
6.1. Native Mode ing interface>. "Incoming interface" and "outgoing interface" may be
"previous hop" and "next hop", respectively, if the corresponding
In native mode, when a CBT router receives a data packet, the packet links do not support multicast transmission. "Previous hop" is taken
may only be forwarded over outgoing tree interfaces (member subnets from the incoming control packet's IP source address, and "next hop"
and interfaces leading to outgoing on-tree neighbours) iff it has is gleaned from the routing table - the next hop to the specified
been received via a valid on-tree interface (or the packet has core address. This transient state eventually times out unless it is
arrived encapsulated from a non-member, i.e. off-tree, sender). Oth- "confirmed" with a join acknowledgement (JOIN_ACK) from upstream. The
erwise, the packet is discarded. JOIN_ACK traverses the reverse path of the corresponding join mes-
sage, which is possible due to the presence of the transient join
Before a packet is forwarded by a subnet's DR, provided the packet's state. Once the acknowledgement reaches the router that originated
TTL is greater than 1, the packet's TTL is decremented. the join message, the new receiver can receive traffic sent to the
group.
6.2. CBT Mode
In CBT mode, routers ignore all non-locally originated native mode
multicast data packets. Locally-originated multicast data is only
processed by a subnet's DR; in this case, the DR forwards the native
multicast data packet, TTL 1, over any outgoing member subnets for
which that router is DR. Additionally, the DR encapsulates the
locally-originated multicast and forwards it, CBT mode, over all tree
interfaces, as dictated by the CBT forwarding database.
When a router, operating in CBT mode, receives a CBT-mode encapsu-
lated data packet, it decapsulates one copy to send, native mode and
TTL 1, over any directly attached member subnets for which it is DR.
Additionally, an encapsulated copy is forwarded over all outgoing
tree interfaces, as dictated by its CBT forwarding database.
Like the outer encapsulating IP header, the TTL value of the encapsu-
lating CBT header is decremented each time it is processed by a CBT
router.
An example of CBT mode forwarding is provided towards the end of the
next section.
7. CBT Mode -- Encapsulation Details
In a multi-protocol environment, whose infrastructure may include
non-multicast-capable routers, it is necessary to tunnel data packets
between CBT-capable routers. This is called "CBT mode". Data packets
are de-capsulated by CBT routers (such that they become native mode
data packets) before being forwarded over subnets with member hosts.
When multicasting (native mode) to member hosts, the TTL value of the
original IP header is set to one. CBT mode encapsulation is as fol-
lows:
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
| encaps IP hdr | CBT hdr | original IP hdr | data ....|
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Figure 3. Encapsulation for CBT mode
The TTL value of the CBT header is set by the encapsulating CBT
router directly attached to the origin of a data packet. This value
is decremented each time it is processed by a CBT router. An encap-
sulated data packet is discarded when the CBT header TTL value
reaches zero.
The purpose of the (outer) encapsulating IP header is to "tunnel"
data packets between CBT-capable routers (or "islands"). The outer IP
header's TTL value is set to the "length" of the corresponding tun-
nel, or MAX_TTL (255)if this is not known, or subject to change.
It is worth pointing out here the distinction between subnetworks and
tree branches (especially apparent in CBT mode), although they can be
one and the same. For example, a multi-access subnetwork containing
routers and end-systems could potentially be both a CBT tree branch
and a subnetwork with group member presence. A tree branch which is
not simultaneously a subnetwork is either a "tunnel" or a point-to-
point link.
In CBT mode there are three forwarding methods used by CBT routers:
+o IP multicasting. This method sends an unaltered (unencapsulated)
data packet across a directly-connected subnetwork with group
member presence. Any host originating multicast data, does so
in this form.
+o CBT unicasting. This method is used for sending data packets
encapsulated (as illustrated above) across a tunnel or point-to-
point link; the IP destination address of the encapsulating IP
header is a unicast address. En/de-capsulation takes place in
CBT routers.
+o CBT multicasting. A CBT router on a multi-access link can take
advantage of multicast in the case where multiple on-tree neigh-
bours are reachable across a single physical link; the outer
encapsulating IP header contains a multicast address as its des-
tination address. The IP module of end-systems on the same link
subscribed to the same group will discard these multicasts since
the CBT payload type (protocol id) of the outer IP header is not
recognizable by hosts.
CBT routers create forwarding database (db) entries whenever they
send or receive a JOIN_ACK. The forwarding database describes the
parent-child relationships on a per-group basis. A forwarding
database entry dictates over which tree interfaces, and how (unicast
or multicast) a data packet is to be sent.
Note that a CBT forwarding db is required for both CBT-mode and
native-mode multicasting.
Using our example topology in figure 1, let's assume the CBT routers
are operating in CBT mode.
Member G originates an IP multicast (native mode) packet. R8 is the
DR for subnet S10. R8 therefore sends a (native mode, TTL 1) copy
over any member subnets for which it is DR - S14 and S10 (the copy
over S10 is not sent, since the packet was originally received from
S10). The multicast packet is CBT mode encapsulated by R8, and uni-
cast to each of its children, R9 and R12; these children are not
reachable over the same interface, otherwise R8 could have sent a CBT
mode multicast. R9, the DR for S12, need not IP multicast (native
mode) onto S12 since there are no members present there. R9 unicasts
the packet in CBT mode to R10, which is the DR for S13 and S15. R10
decapsulates the CBT mode packet and IP multicasts (native mode, TTL
1) to each of S13 and S15.
Going upstream from R8, R8 CBT mode unicasts to R4. It is DR for all
directly connected subnets and therefore IP multicasts (native mode)
the data packet onto S5, S6 and S7, all of which have member pres-
ence. R4 unicasts, CBT mode, the packet to all outgoing children, R3
and R7 (NOTE: R4 does not have a parent since it is the primary core
router for the group). R7 IP multicasts (native mode) onto S9. R3 CBT
mode unicasts to R1 and R2, its children. Finally, R1 IP multicasts
(native mode) onto S1 and S3, and R2 IP multicasts (native mode) onto
S4.
8. Non-Member Sending
For a multicast data packet to span beyond the scope of the originat-
ing subnetwork at least one CBT-capable router must be present on
that subnetwork. The DR for the group on the subnetwork must encap-
sulate the (native) IP-style packet and unicast it to a core for the
group (footnote 5). The encapsulation required is shown in figure 3;
CBT mode encapsulation is necessary so the receiving CBT router can
demultiplex the packet accordingly.
If the encapsulated packet hits the tree at an on-tree router, the
packet is forwarded according to the forwarding rules of section 6.1
or 6.2, depending on whether the receiving router is operating in
native- or CBT mode. Note that it is possible for the different
interfaces of a router to operate in different (and independent)
modes.
If the first on-tree router encountered is the target core, various Loops cannot be created in a CBT tree because a) there is only one
scenarios define what happens next: active core per group, and b) tree building/maintenance scenarios
which may lead to the creation of tree loops are avoided. For exam-
ple, if a router's upstream neighbour becomes unreachable, the router
immediately "flushes" all of its downstream branches, allowing them
to individually rejoin if necessary. Transient unicast loops do not
pose a threat because a new join message that loops back on itself
will never get acknowledged, and thus eventually times out.
+o if the target core is not the primary, and the target core has The state created in routers by the sending or receiving of a
not yet joined the tree (because it has not yet itself received JOIN_ACK is bi-directional - data can flow either way along a tree
any join-requests), the target core simply forwards the encapsu- "branch", and the state is group specific - it consists of the group
lated packet to the primary core; the primary core IP address is address and a list of local interfaces over which join messages for
included in the encapsulating CBT data packet header. the group have previously been acknowledged. There is no concept of
"incoming" or "outgoing" interfaces, though it is necessary to be
able to distinguish the upstream interface from any downstream inter-
faces. In CBT, these interfaces are known as the "parent" and "child"
interfaces, respectively. We recommend the parent be distinguished as
such by a single bit in each multicast forwarding cache entry.
if the target core is not the primary, but has children, the With regards to the information contained in the multicast forwarding
target core forwards the data according to the rules of section cache, on link types not supporting native multicast transmission an
6. on-tree router must store the address of a parent and any children.
_________________________ On links supporting multicast however, parent and any child informa-
5 It is assumed that CBT-capable routers discover tion is represented with local interface addresses (or similar iden-
<core, group> mappings by means of some discovery pro- tifying information, such as an interface "index") over which the
tocol. Such a protocol is outside the scope of this parent or child is reachable.
document.
+o if the target core is the primary, the primary forwards the data When a multicast data packet arrives at a router, the router uses the
according to the rules of section 6.2. group address as an index into the multicast forwarding cache. A copy
of the incoming multicast data packet is forwarded over each inter-
face (or to each address) listed in the entry except the incoming
interface.
9. Eliminating the Topology-Discovery Protocol in the Presence of Tun- Each router that comprises a CBT multicast tree, except the core
nels router, is responsible for maintaining its upstream link, provided it
has interested downstream receivers, i.e. the child interface list is
non-NULL. A child interface is one over which a member host is
directly attached, or one over which a downstream on-tree router is
attached. This "tree maintenance" is achieved by each downstream
router periodically sending a CBT "keepalive" message (ECHO_REQUEST)
to its upstream neighbour, i.e. its parent router on the tree. One
keepalive message is sent to represent entries with the same parent,
thereby improving scalability on links which are shared by many
groups. On multicast capable links, a keepalive is multicast to the
"all-cbt-routers" group (IANA assigned as 224.0.0.15); this has a
suppressing effect on any other router for which the link is its par-
ent link. If a parent link does not support multicast transmission,
keepalives are unicast.
Traditionally, multicast protocols operating within a virtual topol- The receipt of a keepalive message over a valid child interface imme-
ogy, i.e. an overlay of the physical topology, have required the diately prompts a response (ECHO_REPLY), which is either unicast or
assistance of a multicast topology discovery protocol, such as that multicast, as appropriate.
present in DVMRP [1]. However, it is possible to have a multicast
protocol operate within a virtual topology without the need for a
multicast topology discovery protocol. One way to achieve this is by
having a router configure all its tunnels to its virtual neighbours
in advance. A tunnel is identified by a local interface address and a
remote interface address. Routing is replaced by "ranking" each such
tunnel interface associated with a particular core address; if the
highest-ranked route is unavailable (tunnel end-points are required
to run an Hello-like protocol between themselves) then the next-
highest ranked available route is selected, and so on. The exact
specification of the Hello protocol is outside the scope of this doc-
ument.
CBT trees are built using the same join/join-ack mechanisms as The ECHO_REQUEST does not contain any group information; the
before, only now some branches of a delivery tree run in native mode, ECHO_REPLY does, but only periodically. To maintain consistent infor-
whilst others (tunnels) run in CBT mode. Underlying unicast routing mation between parent and child,
dictates which interface a packet should be forwarded over. Each the parent periodically reports, in an ECHO_REPLY, all groups for
interface is configured as either native mode or CBT mode, so a which it has state, over each of its child interfaces for those
packet can be encapsulated (decapsulated) accordingly. groups. This group-carrying echo reply is not prompted explicitly by
the receipt of an echo request message. A child is notified of the
time to expect the next echo reply message containing group informa-
tion in an echo reply prompted by a child's echo request. The fre-
quency of parent group reporting is at the granularity of minutes.
As an example, router R's configuration would be as follows: It cannot be assumed all of the routers on a multi-access link have a
uniform view of unicast routing; this is particularly the case when a
multi-access link spans two or more unicast routing domains. This
could lead to multiple upstream tree branches being formed (an error
condition) unless steps are taken to ensure all routers on the link
agree which is the upstream router for a particular group. CBT
routers attached to a multi-access link participate in an explicit
election mechanism that elects a single router, the designated router
(DR), as the link's upstream router for all groups. Since the DR
might not be the link's best next-hop for a particular core router,
this may result in join messages being re-directed back across a
multi-access link. If this happens, the re-directed join message is
unicast across the link by the DR to the best next-hop, thereby pre-
venting a looping scenario. This re-direction only ever applies to
join messages. Whilst this is suboptimal for join messages, which
are generated infrequently, multicast data never traverses a link
more than once (either natively, or encapsulated).
intf type mode remote addr In all but the exception case described above, all CBT control mes-
----------------------------------- sages are multicast over multicast supporting links to the "all-cbt-
#1 phys native - routers" group, with IP TTL 1. The IP source address of CBT control
#2 tunnel cbt 128.16.8.117 messages is the outgoing interface of the sending router. The IP des-
#3 phys native - tination address of CBT control messages is either the "all-cbt-
#4 tunnel cbt 128.16.6.8 routers" group address, or the IP address of a router reachable over
#5 tunnel cbt 128.96.41.1 one of the sending router's interfaces, depending on whether the
sender's outgoing link supports multicast transmission. All the nec-
essary addressing information is obtained as part of tree set up.
core backup-intfs If CBT is implemented over a tunnelled topology, when sending a CBT
-------------------- control packet over a tunnel interface, the sending router uses as
A #5, #2 the packet's IP source address the local tunnel end point address,
B #3, #5 and the remote tunnel end point address as the packet's IP destina-
C #2, #4 tion address.
The CBT forwarding database needs to be slightly modified to accommo- 4. Protocol Specification Details
date an extra field, "backup-intfs" (backup interfaces). The entry in
this field specifies a backup interface whenever a tunnel interface
specified in the forwarding db is down. Additional backups (should
the first-listed backup be down) are specified for each core in the
core backup table. For example, if interface (tunnel) #2 were down,
and the target core of a CBT control packet were core A, the core
backup table suggests using interface #5 as a replacement. If inter-
face #5 happened to be down also, then the same table recommends
interface #2 as a backup for core A.
10. CBT Packet Formats and Message Types Details of the CBT protocol are presented in the context of a single
router implementation.
We distinguish between two types of CBT packet: CBT mode data pack- 4.1. CBT HELLO Protocol
ets, and CBT control packets. CBT control packets carry a CBT control
packet header.
CBT control packets are encapsulated in IP, as illustrated below: The HELLO protocol is used to elect a designated router (DR) on
broadcast-type links. It is also used to elect a designated border
router (BR) when interconnecting a CBT domain with other domains (see
[5]).
+++++++++++++++++++++++++++++++ A router represents its status as a link's DR by setting the DR-flag
| IP header | CBT control pkt | on that interface; a DR flag is associated with each of a router's
+++++++++++++++++++++++++++++++ broadcast interfaces. This flag can only assume one of two values:
TRUE or FALSE. By default, this flag is FALSE.
In CBT mode, the original data packet is encapsulated in a CBT header HELLO messages are multicast periodically to the all-cbt-routers
and an IP header, as illustrated below: group, 224.0.0.15, using IP TTL 1. The advertisement period is
[HELLO_TIMER] seconds. [HELLO_TIMER] comprises a configured
[HELLO_INTERVAL], to which is added [RND_RSP] seconds - a random
response interval. This random response additive is required to
avoid the potential problem of synchronisation between HELLO adver-
tisements (or other control messages) from different routers. The
HELLO protocol's convergence time is set at [HELLO_CONV] seconds -
the time after which no further HELLOs are expected in any one round
of the protocol.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Each HELLO advertising router includes the upper bound of its
| IP header | CBT header | original IP hdr | data .... | [RND_RSP] timer in its HELLO advertisements. This is necessary so
++++++++++++++++++++++++++++++++++++++++++++++++++++++++ that all routers attached to the link can agree on a common HELLO
convergence time [HELLO_CONV]; in any one round of the HELLO proto-
col, a router assumes the minimum of the upper bound of its config-
ured [RND_RSP] and that of any received advertisement's. The minimum
upper bound is then used as this router's [RND_RSP] upper bound in
the next round of the protocol. [HELLO_CONV] is set to this minimum
upper bound + 2 seconds (the 2 seconds being a response "safety mar-
gin") for the next round of the protocol.
The IP protocol field of the inner (original) IP header is used to A network manager can preference a router's DR eligibility by option-
demultiplex a packet correctly; CBT has been assigned IP protocol ally configuring a HELLO preference. Valid configuration values range
number 7. The CBT module then demultiplexes based on the encapsulat- from 1 to 254 (decimal), 1 representing the "most eligible" value. In
ing CBT header's "type" field, thereby distinguishing between CBT the absence of explicit configuration, a router assumes the default
control packets and CBT mode data packets. HELLO preference value of 255. The elected DR uses HELLO preference
zero (0) in HELLO advertisements, irrespective of any configured
preference. The DR continues to use preference zero for as long as
it is running.
The CBT data packet header is illustrated below. The DR election winner is that which advertises the lowest HELLO
preference, or the lowest-addressed in the event of a tie.
10.1. CBT Header Format (for CBT Mode data) The situation where two or more routers attached to the same broad-
cast link are advertising HELLO preference 0 should never arise. How-
ever, should this situation arise, all but the lowest addressed zero-
advertising router relinquishes its claim as DR immediately by unset-
ting the DR flag on the corresponding interface. The relinquishing
router(s) subsequently advertise their previously used preference
value in HELLO advertisements.
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4.1.1. Sending HELLOs
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| vers |unused | type | hdr length | on-tree|unused|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| checksum | IP TTL | unused |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| group identifier |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| first-hop router |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| primary core |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved | reserved |T|S| Type | Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| .....Flow-id value..... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unused | unused | Type | Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| .....Security data...... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 4. CBT Header When a router starts up, it multicasts two HELLO messages over each
of its broadcast interfaces in successsion. The DR flag is initially
unset (FALSE) on each broadcast interface.
Each of the fields is described below: A router sends a HELLO message whenever its [HELLO_TIMER] expires.
+o Vers: Version number -- this release specifies version 1. Whenever a router sends a HELLO message, it resets its [HELLO_TIMER].
+o type: indicates CBT payload; values are defined for control 4.1.2. Receiving HELLOs
(0x00), and data (0xff). For the value 0x00 (control), a CBT
control header is assumed present rather than a CBT header.
+o hdr length: length of the header, for purpose of checksum On receipt of any HELLO message, a router adjusts its [RND_RSP] upper
calculation. bound to the minimum of this router's configured [RND_RSP] upper
bound and that received in the received HELLO. The router also
adjusts its [HELLO_CONV] as described above.
+o on-tree: indicates whether the packet is on-tree (0xff) or A router need not respond to a HELLO message if the received HELLO is
off-tree (0x00). "better" than its own. Thus, in steady state, the HELLO protocol
incurs very little traffic overhead.
+o checksum: the 16-bit one's complement of the one's complement If the received HELLO message is "better" (lower preferenced, or
of the CBT header, calculated across all fields. equally preferenced but lower addressed) than it would send itself,
it immediately unsets its DR flag on the arriving interface if the DR
flag is set on that interface. It also resets its [HELLO_TIMER].
+o IP TTL: TTL value corresponding to the value of the IP TTL If the received HELLO message is not "better" than this router would
value of the original multicast packet, and set in the CBT send itself, it sets its [RND_RSP] random response timer; on expiry,
header by the DR directly attached to the origin host (decre- the router responds with its own HELLO message . If no "better" HELLO
mented by CBT routers visited). message is received within the current [HELLO_CONV], the router sets
the DR flag on the corresponding interface.
+o group identifier: multicast group address. 4.2. JOIN_REQUEST Processing
+o first-hop router: identifies the encapsulating router A JOIN_REQUEST is the CBT control message used to register a member
directly attached to the origin of a multicast packet. This host's interest in joining the distribution tree for the group.
field is relevant to source-migration of a core to the source
(see Appendix A). It is set to NULL when core migration is
disabled.
+o primary core: the primary core for the group, as identified 4.2.1. Sending JOIN_REQUESTs
by "group-id". This field is necessary for the case where
non-member senders happen to send to a secondary core, which
may not yet be joined to the primary core. This field allows
the secondary to know which is the primary for the group, so
that the secondary can forward the (encapsulated) data
onwards to the primary.
+o T bit: indicates the presence (1) or absence (0) of Type of A JOIN_REQUEST can only ever be originated by a leaf router, i.e. a
Service/flow-id value ("type", "length", "type of ser- router with directly attached member hosts. This join message is sent
vice/flow-id") . hop-by-hop towards the core router for the group (see section 7).
The originating router caches <group, NULL, upstream interface> state
for each join it originates. This state is known as "transient join
state". The absence of a "downstream interface" (NULL) indicates
that this router is the join message originator, and is therefore
responsible for any retransmissions of this message if a response is
not received within [JOIN_RTX_INTERVAL]. It is an error if no
response is received after [JOIN_TIMEOUT] seconds. If this error
condition occurs, the joining process may be re-invoked by the
receipt of the next IGMP host membership report from a locally
attached member host.
+o S bit: indicates the presence (1) or absence (0) of a secu- Note that if the interface over which a JOIN_REQUEST is to be sent
rity value ("type", "length", "security data"). supports multicast, the JOIN_REQUEST is multicast to the all-cbt-
routers group, using IP TTL 1. If the link does not support multi-
cast, the JOIN_REQUEST is unicast to the next hop on the unicast path
to the group's core.
10.2. Control Packet Header Format 4.2.2. Receiving JOIN_REQUESTs
The individual fields are described below. On broadcast links, JOIN_REQUESTs which are multicast may only be
forwarded by the link's DR. Other routers attached to the link may
process the join (see below). JOIN_REQUESTs which are multicast over
a point-to-point link are only processed by the router on the link
which does not have a local interface corresponding to the join's
network layer (IP) source address. Unicast JOIN_REQUESTs may only be
processed by the router which has a local interface corresponding to
the join's network layer (IP) destination address.
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 With regard to forwarding a received JOIN_REQUEST, if the receiving
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ router is not on-tree for the group, and is not the group's core
| vers |unused | type | code | # cores | router, the join is forwarded to the next hop on the path towards the
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ core. The join is multicast, or unicast, according to whether the
| hdr length | checksum | outgoing interface supports multicast. The router caches the follow-
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ing information with respect to the forwarded join: <group, down-
| group identifier | stream interface, upstream interface>.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| group mask |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packet origin |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| primary core address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| target core address (core #1) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Core #2 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Core #3 |
| .... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved | reserved |T|S| Type | Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| type of service/flow-id |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unused | unused | Type | Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| .....Security data..... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 5. CBT Control Packet Header If this transient join state is not "confirmed" with a join acknowl-
edgement (JOIN_ACK) message from upstream, the state is timed out
after 1.5 times [JOIN_RTX_INTERVAL].
+o Vers: Version number -- this release specifies version 1. If the receiving router is the group's core router, the join is "ter-
minated" and acknowledged by means of a JOIN_ACK. Similarly, if the
router is on-tree and the JOIN_REQUEST arrives over an interface that
is not the upstream interface for the group, the join is acknowl-
edged.
+o type: indicates control message type (see sections 10.3). If [RND_RSP] pertaining to a JOIN_REQUEST is active (i.e. running),
if a JOIN_REQUEST is received for the same group over that group's
parent interface, cancel [RND_RSP] for the impending JOIN_REQUEST.
+o code: indicates subcode of control message type. If this router has a cache-deletion-timer [CACHE_DEL_TIMER] running
on the arrival interface for the group specified in a multicast join,
the timer is cancelled.
+o # cores: number of core addresses carried by this control If a multicast JOIN_REQUEST is received and the QUIT_TIME bit (see
packet. section 4.4.1) is set on the arrival interface for the specified
group, unset the QUIT_TIME bit.
+o header length: length of the header, for purpose of checksum 4.3. JOIN_ACK Processing
calculation.
+o checksum: the 16-bit one's complement of the one's complement A JOIN_ACK is the mechanism by which an interface is added to a
of the CBT control header, calculated across all fields. router's multicast forwarding cache; thus, the interface becomes part
of the group distribution tree.
+o group identifier: multicast group address. 4.3.1. Sending JOIN_ACKs
+o group mask: mask value for aggregated CBT joins/join-acks. The JOIN_ACK is sent over the same interface as the corresponding
Zero for non-aggregated joins/join-acks. JOIN_REQUEST was received. The sending of the acknowledgement causes
the router to add the interface to its child interface list in its
forwarding cache for the group, if it is not already. If the router
does not yet have active state for this group, this router must be
the core router for the group; the core creates a forwarding cache
entry and includes the interface in its child interface list, and
sends the JOIN_ACK downstream.
+o packet origin: address of the CBT router that originated the A JOIN_ACK is multicast or unicast, according to whether the outgoing
control packet. interface supports multicast transmission or not.
+o primary core address: the address of the primary core for the 4.3.2. Receiving JOIN_ACKs
group.
+o target core address: desired core affiliation of control mes- The group and arrival interface must be matched to a <group, ....,
sage. upstream interface> from the router's cached transient state. If no
match is found, the JOIN_ACK is discarded. If a match is found, a
CBT forwarding cache entry for the group is created, with "upstream
interface" marked as the group's parent interface.
+o Core #N: IP address for each of a group's cores. If "downstream interface" in the cached transient state is NULL, the
JOIN_ACK has reached the originator of the corresponding
JOIN_REQUEST; the JOIN_ACK is not forwarded downstream. If "down-
stream interface" is non-NULL, a JOIN_ACK for the group is sent over
the "downstream interface" (multicast or unicast, accordingly). This
interface is installed in the child interface list of the group's
forwarding cache entry.
+o T bit: indicates the presence (1) or absence (0) of Type of Once transient state has been confirmed by transferring it to the
Service/flow-id value ("type", "length", "type of ser- forwarding cache, the transient state is deleted.
vice/flow-id") .
+o S bit: indicates the presence (1) or absence (0) of a secu- 4.4. QUIT_NOTIFICATION Processing
rity value ("type", "length", "security data").
10.3. CBT Control Message Types A CBT tree is "pruned" in the direction downstream-to-upstream when-
ever a CBT router's child interface list for a group becomes NULL.
There are ten types of CBT message. All are encoded in the CBT con- 4.4.1. Sending QUIT_NOTIFICATIONs
trol header, shown in figure 5.
+o JOIN-REQUEST (type 1): generated by a router and unicast to A QUIT_NOTIFICATION is sent to a router's parent router on the tree
the specified core address. It is processed hop-by-hop on its whenever the router's child interface list becomes NULL.
way to the specified core. Its purpose is to establish the
originating CBT router, and all intermediate CBT routers, as
part of the corresponding delivery tree. Note that all cores
for the corresponding group are carried in join-requests.
+o JOIN-ACK (type 2): an acknowledgement to the above. The full A QUIT_NOTIFICATION is not acknowledged; once sent, all information
list of core addresses is carried in a JOIN-ACK, together pertaining to the group it represents is deleted from the forwarding
with the actual core affiliation (the join may have been ter- cache after a short interval.
minated by an on-tree router on its journey to the specified
core, and the terminating router may or may not be affiliated
to the core specified in the original join). A JOIN-ACK tra-
verses the reverse path as the corresponding JOIN-REQUEST,
with each CBT router on the path processing the ack. It is
the receipt of a JOIN-ACK that actually "fixes" tree state.
+o JOIN-NACK (type 3): a negative acknowledgement, indicating To ensure consistency between a child and parent router given the
that the tree join process has not been successful. potential for loss of a QUIT_NOTIFICATION, there is a QUIT_TIME bit
associated with the parent of each group entry; whenever a
QUIT_NOTIFICATION is sent for a group, the QUIT_TIME bit for that
group entry is set for a maximum of [QUIT_TIME] seconds before the
entry is deleted and the QUIT_TIME bit unset. By default, this bit is
unset.
+o QUIT-REQUEST (type 4): a request, sent from a child to a par- When the QUIT_TIME bit is set, if the router detects multicast traf-
ent, to be removed as a child of that parent. fic for the group arriving over a to-be-deleted parent interface (one
over which a quit has recently been sent), the router sends another
QUIT_NOTIFICATION over that interface. This is multicast, or unicast,
as appropriate for the outgoing link. It continues to do so at
[QUIT_RATE] second intervals so long as data continues to arrive, and
provided [QUIT_TIME] has not yet expired.
+o QUIT-ACK (type 5): acknowledgement to the above. If the par- If, after sending a QUIT_NOTIFICATION a multicast JOIN_REQUEST for
ent, or the path to it is down, no acknowledgement will be the specified group arrives over the interface the quit was sent, the
received within the timeout period. This results in the QUIT_TIME bit is immediately unset if it is set (any traffic arriving
child nevertheless removing its parent information. over the interface will be for/from another child router attached to
the same link).
+o FLUSH-TREE (type 6): a message sent from parent to all chil- 4.4.2. Receiving QUIT_NOTIFICATIONs
dren, which traverses a complete branch. This message results
in all tree interface information being removed from each
router on the branch, possibly because of a re-configuration
scenario.
+o CBT-ECHO-REQUEST (type 7): once a tree branch is established, The group reported in the QUIT_NOTIFICATION must be matched with a
this messsage acts as a "keepalive", and is unicast from forwarding cache entry. If no match is found, the QUIT_NOTIFICATION
child to parent (can be aggregated from one per group to one is ignored and discarded. If a match is found, if the arrival inter-
per link. See section 4). face is a valid child interface in the group entry, how the router
proceeds depends on whether the QUIT_NOTIFICATION was multicast or
unicast.
+o CBT-ECHO-REPLY (type 8): positive reply to the above. If the QUIT_NOTIFICATION was unicast, the corresponding child inter-
face is deleted from the group's forwarding cache entry, and no fur-
ther processing is required.
+o CBT-BR-KEEPALIVE (type 9): applicable to border routers only. If the QUIT_NOTIFICATION was multicast, and the arrival interface is
See [11] for more information. a valid child interface for the specified group, the router sets a
cache-deletion-timer [CACHE_DEL_TIMER].
+o CBT-BR-KEEPALIVE-ACK (type 10): acknowledgement to the above. Because this router might be acting as a parent router for multiple
downstream routers attached to the arrival link, [CACHE_DEL_TIMER]
interval gives those routers that did not send the
QUIT_NOTIFICATION, but received it over their parent interface, the
opportunity to ensure that the parent router does not remove the link
from its child interface list.
10.3.1. CBT Control Message Subcodes Therefore, on receipt of a multicast QUIT_NOTIFICATION over a parent
interface, a receiving router starts a random response interval timer
which is set to [RND_RSP] seconds.
The JOIN-REQUEST has three valid subcodes: If a multicast JOIN_REQUEST is received over the same interface (par-
ent) for the same group before this router's [RND_RSP] timer expires,
it suppresses the multicasting of its own similar JOIN_REQUEST.
+o ACTIVE-JOIN (code 0) - sent from a CBT router that has no If a multicast JOIN_REQUEST is not received via the router's parent
children for the specified group. link before [RND_RSP] expires, a JOIN_REQUEST is multicast over the
link for the previously quit group, with IP TTL 1.
+o REJOIN-ACTIVE (code 1) - sent from a CBT router that has at 4.5. ECHO_REQUEST Processing
least one child for the specified group.
+o REJOIN-NACTIVE (code 2) - generated by a router subsequent to The ECHO_REQUEST message allows a child to monitor reachability to
receiving a join ack, subcode NORMAL, in response to a its parent router for a group (or range of groups if the parent
active-rejoin. router is the parent for multiple groups). Group information is not
carried in ECHO_REQUEST messages.
A JOIN-ACK has three valid subcodes: 4.5.1. Sending ECHO_REQUESTs
+o NORMAL (code 0) - sent by a core router, or on-tree router, Whenever a router creates a forwarding cache entry due to the receipt
acknowledging joins with subcodes ACTIVE-JOIN and REJOIN- of a JOIN_ACK, the router begins the periodic sending of ECHO_REQUEST
ACTIVE. messages over its parent interface. The ECHO_REQUEST is multicast to
the "all-cbt-routers" group over multicast-capable interfaces, and
unicast to the parent router otherwise.
+o PRIMARY-REJOIN-ACK (code 1) - sent by a primary core to ECHO_REQUEST messages are sent at [ECHO_INTERVAL] second intervals.
acknowledge the receipt of a join-request received with sub- Whenever an ECHO_REQUEST is sent, [ECHO_INTERVAL] is reset.
code REJOIN-ACTIVE. This message traverses the reverse-path
of the corresponding re-join, and is processed by each router
on that path.
+o PRIMARY-NACTIVE-ACK (code 2) - sent by a primary core to If, for any echo-request sent to a parent, the expected response
acknowledge the receipt of a join-request received with sub- (ECHO_REPLY) is not forthcoming within [ECHO_RTX_INTERVAL], the echo
code REJOIN-NACTIVE. This ack is unicast directly to the request message is retransmitted. If no response is forthcoming
router that generated the rejoin-Nactive, i.e. the ack it is within [ECHO_TIMEOUT] seconds, the router sends a FLUSH_TREE message
not processed hop-by-hop. over each of its child interfaces for the group, then removes all
forwarding cache state for the group.
11. CBT Protocol Number 4.5.2. Receiving ECHO_REQUESTs
CBT has been assigned IP protocol number 7. CBT control messages are If a ECHO_REQUEST is received over any valid child interface, the
carried directly over IP. receiving router responds with an ECHO_REPLY message over the same
interface. This message is multicast to the "all-cbt-routers" group
over multicast-capable interfaces, and unicast otherwise.
12. Default Timer Values If a multicast ECHO_REQUEST message arrives via any valid parent
interface, the router resets its [ECHO_INTERVAL] timer for that
upstream interface, thereby suppressing the sending of its own
ECHO_REQUEST over that upstream interface.
There are several CBT control messages which are transmitted at fixed 4.6. ECHO_REPLY Processing
intervals. These values, retransmission times, and timeout values,
are given below. Note these are recommended default values only, and
are configurable with each implementation (all times are in seconds):
+o CBT-ECHO-INTERVAL 30 (time between sending successive CBT-ECHO- ECHO_REPLY messages allow a child to monitor the reachability of its
REQUESTs to parent). parent, and ensure the group state information is consistent between
them.
+o PEND-JOIN-INTERVAL 5 (retransmission time for join-request if no 4.6.1. Sending ECHO_REPLY messages
ack rec'd)
+o PEND-JOIN-TIMEOUT 30 (time to try joining a different core, or An ECHO_REPLY message is sent in direct response to receiving an
give up) ECHO_REQUEST message, provided the ECHO_REQUEST is received over any
one of this router's valid child interfaces. Additionally, an
ECHO_REPLY is sent periodically by a parent router over each of its
child links, reporting all groups for which the link is its child.
+o EXPIRE-PENDING-JOIN 90 (remove transient state for join that has ECHO_REPLY messages are unicast or multicast, as appropriate.
not been ack'd)
+o PEND_QUIT_INTERVAL 5 (retransmission time for quit-request if no 4.6.2. Receiving ECHO_REPLY messages
ack rec'd)
+o CBT-ECHO-TIMEOUT 90 (time to consider parent unreachable) An ECHO_REPLY message must be received via a valid parent interface.
When received, the child router resets its [ECHO_INTERVAL] timer for
this upstream interface. The child router also caches the reported
"group report interval" (seconds) - the time at which the next group
carrying ECHO_REPLY will be sent by the parent router. Like
[ECHO_INTERVAL], this is cached per upstream interface. If the group
carrying ECHO_REPLY does not arrive shortly after "group report
interval" has expired, a QUIT_NOTIFICATION is sent for each group for
which the non-reporting router is the parent.
+o CHILD-ASSERT-INTERVAL 90 (increment child timeout if no ECHO If this echo reply carries a list of groups, the child router must
rec'd from a child) match all those of its forwarding cache entries for which the arrival
interface is the upstream interface. If the parent router does not
consider itself the parent router for group(s) which the child thinks
is its parent, the child sends a FLUSH_TREE message downstream for
each such group. If this router has directly attached members for any
of the flushed groups, the receipt of an IGMP host membership report
for any of those groups will prompt this router to rejoin the corre-
sponding tree(s).
+o CHILD-ASSERT-EXPIRE-TIME 180 (time to consider child gone) If the upstream router considers itself the parent for more groups
than does the receiving router, this router sends a QUIT_NOTIFICATION
for each of those groups for which the QUIT_TIME bit is set in the
forwarding cache. Otherwise, the router takes no action.
+o IFF-SCAN-INTERVAL 300 (scan all interfaces for group presence. 4.7. FLUSH_TREE Processing
If none, send QUIT)
+o BR-KEEPALIVE-INTERVAL 200 (backup designated BR to designated BR The FLUSH_TREE (flush) message is the mechanism by which a router
keepalive interval) invokes the tearing down of all its downstream branches for a partic-
ular group. The flush message is multicast to the "all-cbt-routers"
group when sent over multicast-capable interfaces, and unicast other-
wise.
+o BR-KEEPALIVE-RETRY-INTERVAL 30 (keepalive interval if BR fails 4.7.1. Sending FLUSH_TREE messages
to respond)
13. Interoperability Issues A FLUSH_TREE message is sent over each downstream (child) interface
when a router has lost reachability with its parent router for the
group (detected via ECHO_REQUEST and ECHO_REPLY messages). All group
state is removed from an interface over which a flush message is
sent.
Interoperability between CBT and DVMRP has recently been defined in 4.7.2. Receiving FLUSH_TREE messages
[11].
Interoperability with other multicast protocols will be fully speci- A FLUSH_TREE message must be received over the parent interface for
fied as the need arises. the specified group, otherwise the message is discarded.
14. CBT Security Architecture The flush message must be forwarded over each child interface for the
specified group.
see [4]. Once the flush message has been forwarded, all state for the group is
removed from the router's forwarding cache.
Acknowledgements 5. Timers and Default Values
Special thanks goes to Paul Francis, NTT Japan, for the original This section provides a summary of the timers described above,
brainstorming sessions that brought about this work. together with their default values.
Thanks too to Sue Thompson (Bellcore). Her detailed reviews led to +o [HELLO_INTERVAL]: a base value making up the bulk of the inter-
the identification of some subtle protocol flaws, and she suggested val between sending a HELLO message. Default: 60 seconds.
several simplifications.
Thanks also to the networking team at Bay Networks for their comments +o [RND_RSP]: router's random response interval. Default: 2 sec-
and suggestions, in particular Steve Ostrowski for his suggestion of onds.
using "native mode" as a router optimization, and Eric Crawley.
Thanks also to Ken Carlberg (SAIC) for reviewing the text, and gener- +o [HELLO_TIMER]: (variable) interval between sending HELLO mes-
ally providing constructive comments throughout. sages. [HELLO_TIMER] = [HELLO_INTERVAL + RND_RSP]
I would also like to thank the participants of the IETF IDMR working +o [HELLO_CONV]: convergence time of one round of the HELLO proto-
group meetings for their general constructive comments and sugges- col. [HELLO_CONV] = [min(RND_RSP) + 2 seconds].
tions since the inception of CBT.
APPENDICES +o [JOIN_RTX_INTERVAL]: retransmission time for JOIN_REQUESTs.
Default: 5 seconds.
DISCLAIMER: As of writing, the mechanisms described in Appendices A and +o [JOIN_TIMEOUT]: time to raise exception due to tree join fail-
B have not been tested, simulated, or demonstrated. ure. Default: 3.5 times [JOIN_RTX_INTERVAL].
APPENDIX A +o [CACHE_DEL_TIMER]: time to remove child interface from forward-
ing cache. Default: 2 seconds.
Dynamic Source-Migration of Cores +o [QUIT_TIME]: time to remove parent interface from forwarding
cache entry. Unset QUIT_TIME bit. Default: 60 seconds.
A.0 Abstract +o [QUIT_RATE]: period for sending QUIT_NOTIFICATION if traffic
persists. Default: 15 seconds.
This appendix describes CBT protocol mechanisms that allow a CBT mul- +o [ECHO_INTERVAL]: interval between sending ECHO_REQUEST to parent
ticast tree, initially constructed around a randomly-placed set of routers. Default: 60 seconds.
core router, to dynamically reconfigure itself in response to an
active source, such that the CBT tree becomes rooted at the source's
local CBT router. Henceforth, CBT emulates a shortest-path tree.
For clarity, the mechanisms are described in the context of "flat" +o [ECHO_RTX_INTERVAL]: retransmission time for ECHO_REQUESTs.
multicasting, but are transferrable to a hierarchical model with only Default 2 seconds.
minor changes.
A.1 Motivation +o [ECHO_TIMEOUT]: time to consider parent unreachable. Default:
3.5 times [ECHO_RTX_INTERVAL].
One of the criticisms levelled against shared tree multicast schemes 6. CBT Packet Formats and Message Types
is that they potentially result in sub-optimal routes between
receivers. Another criticism is that shared trees incur a high traf-
fic concentration effect on the core routers. Given that any shared
tree is likely to have two, three, or more cores which can be strate-
gically placed in the network, as well as the fact that any on-tree
router can act as a "branch point" (or "exploder point"), shared tree
traffic concentration can be significantly reduced. This note never-
theless addresses both of these criticisms by describing new mecha-
nisms that
+o allow a CBT to dynamically transition from a random configura- CBT control packets are encapsulated in IP. CBT has been assigned IP
tion to one where any CBT router can become a core - more pre- protocol number 7 by IANA [4].
cisely, that which is local to a source, and...
+o remove the traffic concentration issue completely, as a result 6.1. CBT Common Control Packet Header
of the above; traffic concentration is not an issue with source-
rooted trees.
The mechanisms described here are relevant to non-concurrent sources; All CBT control messages have a common fixed length header.
the concurrent-sender case is not addressed here, although experience
with MBONE applications for the past several years suggests that most
multicast applications are of the single, infrequently-changing
sender type. Also, it is not necessarily implied that the initial
CBT tree must be transitioned. Any transition is an "all-or-nothing"
transition, meaning that either all the tree transitions, or none of
it does (footnote 6).
A.2 Goals & Requirements 0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| vers | type | addr len | checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
By means of the mechanisms described, this Appendix sets out to Figure 1. CBT Common Control Packet Header
achieve the follwoing:
+o provide mechanisms that allow the dynamic transition from an This CBT specification is version 2.
initial CBT, constructed around a pre-configured set of cores,
to a CBT that is rooted at a core attached to a sender's local
subnetwork. This is source-rooted tree emulation.
+o ensure that these mechanisms do not impact CBT's simplicity or CBT packet types are:
scalability.
+o eliminate completely the traffic concentration issue from CBT. +o type 0: HELLO
+o to eliminate the core placement/core advertisement problems. +o type 1: JOIN_REQUEST
+o ensure that the scheme is robust, such that if a source's local +o type 2: JOIN_ACK
router (or link to it) should fail, the CBT self-organises
itself and returns to its original configuration.
+o the mechanisms should provide the same even to non-member +o type 3: QUIT_NOTIFICATION
senders.
The above incurs a few additional requirements on existing baseline +o type 4: ECHO_REQUEST
CBT mechanisms described in this specification:
+o a new JOIN-REQUEST subcode, REVERSE-JOIN +o type 5: ECHO_REPLY
+o a new JOIN-ACK subcode, REVERSE-ACK +o type 6: FLUSH_TREE
_________________________
6 This is the expected behaviour of PIM Sparse Mode;
on reciept of high-bandwidth traffic, most receivers'
local routers will be configured to transition to
source trees.
+o new JOIN-ACK subcode, CORE-MIGRATE +o type 7: Bootstrap Message
+o a "first-hop router" field needs to be included in the CBT data +o type 8: Candidate Core Advertisement
packet header.
+o a new message type: +o Addr Length: address length in bytes of unicast or multicast
addresses carried in the control packet.
- SOURCE-NOTIFICATION +o Checksum: the 16-bit one's complement of the one's complement
sum of the entire CBT control packet.
+o CBT-mode data encapsulation is required until the local CBT 6.2. HELLO Packet Format
router connected to an active source receives a JOIN-REQUEST,
whose "target core address" field is one of its own IP
addresses.
These new additions are explained in the next section. 0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| CBT Control Packet Header |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| rnd response | Preference | reserved | option type |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| option len | option value |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
A.3 Source-Tree Emulation Criteria Figure 2. HELLO Packet Format
CBT routers are configured with a lower-bound data-rate threshold HELLO Packet Field Definitions:
that is the expected boundary between low- and high-bandwidth data
rate traffic. CBT also monitors the duration each sender sends. If
this duration exceeds a pre-configured value (global across CBT), say
3 minutes, AND the data rate threshold is exceeded, the CBT tree
transitions such that receivers become joined to the "core" local to
the source's subnet, i.e. the CBT tree becomes source-rooted, but
nevertheless remains a CBT.
A.4 Source-Migration Mechanisms +o rnd response: random response interval in seconds.
E o o D
\ /
\ /
L o \ /
\ o C
\ N /
\ /
\A(2) (1)B /
O===================================O
| |
M | |
| |
K o o H
/\ /\
/ \ / \
/ \ / \
s J o o I G o o F
----------
Key: B = primary core +o preference: sender's HELLO preference.
A = secondary core
s = sending host
J = sending host's local DR
M & N = network nodes not on original CBT tree
Figure A1: Original CBT Tree +o option type: the type of option present in the "option value"
field. One option type is currently defined: option type 0
(zero) = BR_HELLO; option value 0 (zero); option length 0
(zero). This option type is used with HELLO messages sent by a
border router (BR) as part of designated BR election (see [5]).
In figure A1, host s starts sending native mode multicast data. CBT +o option len: length of the "option value" field in bytes.
router J encapsulates it as CBT mode, inserting its own IP address in
the "first-hop router" field of the CBT mode data packet header. This
data packet flows over the CBT tree.
Note that tree migration can be disabled either by sending all pack- +o option value: variable length field carrying the option value.
ets in native mode, or by inserting NULL value into the "first-hop
router" field. Since the first-hop router is the original encapsulat-
ing router (data packets are always originated from hosts in native
mode), the first-hop router knows whether the sender's data rate war-
rants activating the "first-hop router" field; for the purpose of the
ensuing protocol description, we assume this is the case.
Any router on the tree receiving the CBT mode data packet, inspects 6.3. JOIN_REQUEST Packet Format
the "first-hop router" field of the CBT header, and compiles a join-
request to send to it. In order to fully specify the join, it must
inspect its underlying unicast routing table(s) to find the best
next-hop to the source's first hop router. That next hop will be
either on or off the existing CBT tree for the group. If the next hop
is off-tree, the join generated is given a subcode of ACTIVE-JOIN (as
per CBT spec), and a "target core address" of the source's first hop
router. The join is then forwarded and processed according to the CBT
specification. The primary core, and the original core list, remain
specified in their respective fields of the CBT control packet
header.
Using figure A1 to illustrate an example, node L's routing tables JOIN_REQUEST Field Definitions
suggest that the best next-hop to J, the source's first hop router,
is via node M, not yet on the tree. So, node L generates a join and
forwards it to M, which forwards it to J. The join-ack (subcode NOR-
MAL) returns to L via M on the reverse-path of the join. When the
join-ack reaches L, L sends a QUIT-REQUEST to A, its old parent. The
shortest-path branch now exists, L-M-J.
If the best next hop to the source's first hop router is via an +o group address: multicast group address of the group being
existing on-tree interface, if that interface is the node's parent on joined. For a "wildcard" join (see [5]), this field contains
the current tree, no further action need be taken, and no join need the value of INADDR_ANY.
be sent towards the source, J.
However, the join's best next hop may be via an existing child inter- +o originating router: router that originated this JOIN_REQUEST.
face - this is where the new join type, subcode REVERSE-JOIN, comes
in. The purpose of this join type is to simply reverse the existing
parent-child relationship between two adjacent on-tree routers; each
end of the link between the two routers is re-labelled. This join
must be acknowledged by means of a JOIN-ACK, subcode REVERSE-ACK. A
reverse-join is only ever sent from a child to its parent.
Immediately subsequent to sending a reverse-join-ACK, the sending 0 1 2 3
node's old parent interface is labelled as "pending child", and a 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
timer is set on that interface. This is a delay timer, set at a +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
default of 5 seconds, during which time a reverse-join is expected | CBT Control Packet Header |
over that interface from the node's old parent. Should this timer +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
expire, a REVERSE-ASSERT message is sent to the old parent (new | group address |
child) to cause it to agree to the change in the parent-child rela- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
tionship. A REVERSE-ASSERT must be ack'd (REVERSE-ASSERT-ACK). If, | originating router |
after (say) three retransmissions (at 5 sec intervals) no reverse- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
assert-ack has been received, a QUIT-REQUEST is sent to the old par- | target router |
ent and the corresponding interface is removed from this node's cur- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
rent forwarding database. | option type | option len | option value |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Of course, if a node has already received a reverse-join during the Figure 3. JOIN_REQUEST Packet Format
period one of its other interfaces was changing its parent-child +o target router: target (core) router for the group.
relationship with another of its neighbours, then the pending-child
delay timer need not be activated.
Looking at figure A1 again, here's the process of how the parent- +o option type: allows the specification of a variety of
child relationships change on the tree when an active source, s, JOIN_REQUEST options. One option is currently defined: option
starts sending. Of course, links E-C, I-J, and L-J do not do this type 0 (zero) = BR_JOIN; option length 0 (zero); option value 0
because they forge completely new paths towards the source's local (zero). This option is used by a CBT domain border router to
router, J. join an internal core for all groups that map to that core. The
state instantiated by a JOIN_REQUEST with this option set is
represents (*, core). For further details, see [5].
K sends a reverse-join to J. J acks this with a join-ack, subcode 6.4. JOIN_ACK Packet Format
REVERSE-ACK. At this point, J is K's parent, and I is still K's
child. K now sets the pending-child delay timer on its interface to
A (K's old parent), and expects a reverse-join from A. If it weren't
to arrive after the delay timer expires, plus several retransmissions
of a reverse-assert control message, K can send a quit to A (it sends
a quit because, as far as A is concerned, it thinks K is still its
child) and removes the K-A interface from its CBT forwarding
database. However, assuming a reverse-join does arrive at K from A
before the delay timer expires, K acks the reverse-join and cancels
the delay timer on that interface.
Next, let's consider CBT router (node) I. I's unicast routing table 0 1 2 3
suggest it can reach J directly (next-hop) via a different interface 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
than the I-K interface, so I sends a join-request, subcode active- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
join, to J, which acks it as normal. On receipt of the ack, I sends a | CBT Control Packet Header |
quit to K and removes K as its parent from its database. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| group address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| target router |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| option type | option len | option value |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Now let's consider node L. Like I, it finds a new path to J, via M, Figure 4. JOIN_ACK Packet Format
so simply sends a new join to J, via M, and on receipt of the join- JOIN_ACK Field Definitions
ack, sends a quit to A, and removes A from its forwarding database.
A new, shortest-path, branch now exists, J-M-L.
Next let's consider A-B, the link between the cores. A is the sec- +o group address: multicast group address of the group being
ondary, and B is the primary, so A originally joined towards B. So, joined.
B sends a reverse-join to A. A sends a reverse-ack to B, so A is now
B's parent, and B has children B-H, and B-C. Note that the role of
primary and secondary is not affected - the target of B's join to A
is the source's local router, J.
The existing branches D-C-B, F-H-B, and G-H-B, need not change any of +o target router: router (DR) that originated the corresponding
their parent-child relationships, since each of these nodes' unicast JOIN_REQUEST.
routing tables indicate that the best next-hop a join-request, tar-
getted at source J, would take, is via the corresponding existing
parent.
For E, it sends a new join via N to J. On receipt of the join-ack, it 6.5. QUIT_NOTIFICATION Packet Format
sends a quit to C. A new branch has been created, E-N-J.
Each node on the tree now has a shortest-path to J, the source's 0 1 2 3
local CBT router. Hence, J is the root ("core") of a shortest-path 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
multicast tree. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| CBT Control Packet Header |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| group address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| originating child router |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Note that these new mechanisms augment the CBT protocol, and the Figure 5. QUIT_NOTIFICATION Packet Format
baseline CBT protocol engine is not affected in any way by this add-
on mechanism.
A.5 Robustness Issues QUIT_NOTIFICATION Field Definitions
Some immediate questions might be: +o group address: multicast group address of the group being
joined.
+o what happens to the source-rooted tree if the source's local CBT +o originating child router: address of the router that originates
router fails? the QUIT_NOTIFICATION.
+o what happens if the source's local CBT router fails whilst the 6.6. ECHO_REQUEST Packet Format
initial tree is transitioning?
+o what happens if the tree is partitioned, or not yet fully con- ECHO_REQUEST Field Definitions
nected, when a source starts sending?
+o how do new receivers join an already-transitioned tree? +o originating child router: address of the router that originates
the ECHO_REQUEST.
All of these questions are now addressed: 0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| CBT Control Packet Header |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| originating child router |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+o What happens to the source-rooted tree if the source's local CBT Figure 6. ECHO_REQUEST Packet Format
router fails? 6.7. ECHO_REPLY Packet Format
A source-rooted CBT has a single point of failure - the root of 0 1 2 3
the tree. 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| CBT Control Packet Header |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| originating parent router |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| group report interval | num groups |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| group address #1 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| group address #2 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ...... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| group address #n |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
In spite of a source being joined, the corelist (primary & sec- Figure 7. ECHO_REPLY Packet Format
ondaries) is carried in CBT control packets, as per the CBT
spec. However, the contents of the "target core address" field
identifies the IP address of the source's local CBT router. So,
in the event of a failure, the CBT routers still have all the
information they need to rejoin the original tree, constructed
around the corelist. Rejoining then, proceeds according to the
rules of the CBT specification.
Of course, rejoining the original tree happens only after sev- ECHO_REPLY Field Definitions
eral attempts have been made to rejoin the source's "core".
+o What happens if the source's local CBT router fails whilst the +o oringinating parent router: address of the router originating
initial tree is transitioning? this ECHO_REPLY.
This really is no different to the above case. The parts of the +o group report interval: number of seconds until the sending
tree that have transitioned will rejoin the original tree router will send its next ECHO_REPLY containing a list of group
according to their corresponding corelist. Those parts of the addresses.
tree in the process of transitioning may temporarily transition,
but eventually those nodes will receive a FLUSH from a CBT
router adjacent to the failed source router ("core"). They then
rejoin the original tree.
+o What happens if the tree is partitioned, or not yet fully con- +o num groups: the number of groups being reported by this
nected, when a source starts sending? ECHO_REPLY.
The problem here is that some parts of the network (CBT tree) +o group address: a list of multicast group addresses for which
may not receive CBT encapsulated mode data packets before the this router considers itself a parent router w.r.t. the link
source's local DR starts forwarding data in native mode, and so over which this message is sent.
those receivers will not know the IP address of the local DR to
join to.
For example, assume a secondary core with downstream members 6.8. FLUSH_TREE Packet Format
cannot reach the primary. If the routers adjacent to the secon-
daries are all functioning correctly, the secondaries themselves
may not be aware that a partition has occurred somewhere further
upstream. So, what if a source downstream from a secondary,
starts sending data after the partition has happened?
A new control message, the SOURCE-NOTIFICATION, is used to solve 0 1 2 3
this problem. As soon as any core recieves CBT mode encapsulated 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
data, it caches the source "core" IP address, and starts multi- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
casting (to the group) SOURCE-NOTIFICATION messages, one every | CBT Control Packet Header |
minute. Source-notifications contain the IP address of the +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
source's local DR. A core continues to multicast source- | group address |
notications at 1 minute intervals until the source has ceased +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
transmitting data for more than 20 seconds.
Obviously, if a CBT is fully connected, the larger proportion of Figure 8. FLUSH_TREE Packet Format
source-notifications will be redundant. However, this cost jus-
tifies the robustness the scheme provides.
If an off-tree source begins sending data, which first hits the FLUSH_TREE Field Definitions
tree at a secondary core with no receivers attached, the
secondary does not trigger a join towards the primary, but
instead just unicasts the data, in CBT mode, to the primary (as
per CBT spec). The primary then forwards the data over any con-
nected tree branches. Receivers can then begin transitioning. In
this way, a transitioned CBT tree extends to the first hop
router of a non-member sender.
Note that cores and on-tree routers only ever react to active +o group address: multicast group address of the group being
sources iff they have an existing CBT forwarding database for "flushed".
the said group. For example, a primary core would not establish
a shortest-path branch to a non-member sender unless it has at
least one existing child registered for the corresponding group.
+o How do new receivers join an already-transitioned CBT? 7. Core Router Discovery
New receivers will always attempt to join one of the cores in For intra-domain core discovery, CBT has decided to adopt the "boot-
the corelist for a group. Two things can happen here: firstly, a strap" mechanism currently specified with the PIM sparse mode proto-
new join, targetted at one of the cores in the corelist eventu- col [2]. This bootstrap mechanism is scalable, robust, and does not
ally reaches that target core. Secondly, the new join hits a rely on underlying multicast routing support to deliver core router
router already established on-tree, but the router encountered information; this information is distributed via traditional unicast
is now joined to the source tree (source "core"). hop-by-hop forwarding.
For the first scenario, all on-tree routers and all core routers It is expected that the bootstrap mechanism will be specified inde-
maintain the address of which upstream core their CBT branch pendently as a "generic" RP/Core discovery mechanism in its own sepa-
actually emanates from (as per CBT spec). When a new join rate document. It is unlikely at this stage that the bootstrap mecha-
arrives at one of the original cores, the core checks whether nism will be appended to a well-known network layer protocol, such as
its own current core affiliation is to a core outside the IGMP [3], though this would facilitate its ubiquitous (intra-domain)
corelist set. If so, that core is a source "core", so the core deployment. Therefore, each multicast routing protocol requiring the
responds to the new join with a JOIN-ACK, subcode CORE-MIGRATE. bootstrap mechanism must implement it as part of the multicast rout-
This join-ack contains the address of the active source "core". ing protocol itself.
This join-ack causes a join-request to be issued by one of the
routers that receives it - the router whose path to the core
(just joined) diverges from that to the source "core"; this can
easily be gleaned from unicast routing. The router then simply
directs it new join at the source "core", and on receipt of the
join-ack, sends a quit to its now "old" parent.
For the second case, the solution is trivial; any on-tree router A summary of the operation of the bootstrap mechanism follows
receiving a join targetted either at one of the original cores (details are provided in [7]). It is assumed that all routers within
for the group, or the active source "core", simply acks (subcode the domain implement the "bootstrap" protocol, or at least forward
NORMAL) the join and includes in the ack the source "core" bootstrap protocol messages.
affiliation (as per CBT spec).
A.6 Loops A subset of the domain's routers are configured to be CBT candidate
It may seem that the potential for a transitioning tree to form core routers. Each candidate core router periodically (default every
loops, especially in the presence of reverse-joins, is greatly 60 secs) advertises itself to the domain's Bootstrap Router (BSR),
increased. This is probably NOT the case; "reversed branches" are using "Core Advertisement" messages. The BSR is itself elected
those that are already part of a loop-free tree that CBT constructs dynamically from all (or participating) routers in the domain. The
around the original set of cores. Transitioned tree are just CBTs, domain's elected BSR collects "Core Advertisement" messages from can-
whereby the core is simply rooted at the source. Loops are no more didate core routers and periodically advertises a candidate core set
likely with these mechanisms then they are with baseline CBT. Note (CC-set) to each other router in the domain, using traditional hop-
that these are assertions - formal proofs may be more appropriate. by-hop unicast forwarding. The BSR uses "Bootstrap Messages" to
advertise the CC-set. Together, "Core Advertisements" and "Bootstrap
Messages" comprise the "bootstrap" protocol.
APPENDIX B When a router receives an IGMP host membership report from one of its
directly attached hosts, the local router uses a hash function on the
reported group address, the result of which is used as an index into
the CC-set. This is how local routers discover which core to use for
a particular group.
Group State Aggregation Note the hash function is specifically tailored such that a small
number of consecutive groups always hash to the same core. Further-
more, bootstrap messages can carry a "group mask", potentially limit-
ing a CC-set to a particular range of groups. This can help reduce
traffic concentration at the core.
B.1 Introduction If a BSR detects a particular core as being unreachable (it has not
announced its availability within some period), it deletes the rele-
vant core from the CC-set sent in its next bootstrap message. This is
how a local router discovers a group's core is unreachable; the
router must re-hash for each affected group and join the new core
after removing the old state. The removal of the "old" state follows
the sending of a QUIT_NOTIFICATION upstream, and a FLUSH_TREE message
downstream.
Although the scalability of shared tree multicast schemes is attrac- 7.1. Bootstrap Message Format
tive now, to scale over the longer-term, a combination of hierarchy
(support mechanisms that facilitate domain-oriented multicasting),
and group aggregation strategies, is required. If IP multicast is to
have a long-term future in the Internet as a global transport mecha-
nism, by far the most serious challenge is to address the issue of
group state aggregation.
Shared trees were developed partly to address scalability with 0 1 2 3
regards to multicast state maintained in the network, which resulted 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
in an improvement in that state by a factor of the number of active +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
sources (a source being a subnetwork aggregate). However, it is per- | CBT common control packet header |
ceived that the number of sources sending to any one group will not +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
grow as fast as the number of groups, indeed the latter will probably | For full Bootstrap Message specification, see [7] |
grow at several orders of magnitude faster [12]. Therefore, it is +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
essential to contain this potential problem, particularly for the
benefit of routers on wide-area links, by designing an effective
group state aggregation mechanism, capable of collapsing group state.
Unlike unicast addresses, multicast addresses cannot be aggregated Figure 9. Bootstrap Message Format
according to topological locality; multicast addresses are truly
location-independent. Thus, it would not seem obvious how the problem
can be addressed - clearly, it must be looked at in a different way.
In order to be effective, flexibility and efficiency must be facets 7.2. Candidate Core Advertisement Message Format
of group aggregation; an aggregation scheme must be able to accommo-
date groups with wide-ranging characteristics in the least constrain-
ing way possible. For example, the trend towards small, non-local
groups (e.g. 4 or 5 person audio/video conferences between different
user groups spread over different countries/continents); it is these
types of groups that are likely to result in an explosive growth in
state. Also, these groups will, in all likelihood, utilize multicast
addresses that are randomly spread across the multicast address
space, making aggregation seemingly more difficult. An aggregation
scheme must therefore account for this.
B.2 Design Overview 0 1 2 3
This scheme involves replacing a subset of individual tree state pre- 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
sent on inter-domain links, and aggregating it over a single shared +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
tree. The scheme does not yet specify how candidate groups for aggre- | CBT common control packet header |
gation are arrived at, but an obvious scheme to would be to aggregate +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
already-overlapping distribution trees. The pivotal idea behind this | For full Candidate Core Adv. Message specification, see [7] |
approach encompasses two inter-dependent strategies: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+o administratively defining a portion of the multicast address Figure 10. Candidate Core Advertisement Message Format
space for aggregate groups. For brevity, an example might be the
range 238.0.0.0 - 238.255.255.255.
+o associated with each aggregate group address is a mask, specify- 8. Interoperability Issues
ing the portion of the address that it used to identify the
aggregate group itself (the portion covered by the mask); the
remaining address space is used as an index to an ordered list
of groups with which the aggregate address is associated. The
ordered list and its association with a group aggregate address
is conveyed by means of a protocol message (TBD). The index is
used to de-aggregate at region boundaries (border routers).
The scheme subscribes to the notion of aggregation-on-demand; a bor- Interoperability between CBT and DVMRP is specified in [5].
der router (BR) is configured with a threshold number of groups on a
BRs external interface, above which it begins to solicit aggregations
periodically, say once every hour.
As an example, say BR 123 wishes to aggregate 200 groups. BR 123 ran- Interoperability with other multicast protocols will be fully speci-
domly chooses (or by some address allocation algorithm) a group fied as the need arises.
aggregate address. It has been established that the number of groups
for which aggregation is desired is 200. The nearest power of 2 value
to 200 is 256 (2^8), and so the aggregate mask covers 24 bits, leav-
ing 8 to specify each individual group's traffic flowing over the
aggregate tree.
So we have: Acknowledgements
Group aggregate address: 238.10.12.0
Group aggregate mask: 238.10.12/24 Special thanks goes to Paul Francis, NTT Japan, for the original
brainstorming sessions that brought about this work.
A data packet for the 30th listed group (listed in a protocol message Others that have contributed to the progress of CBT include Ken Carl-
(TBD) as described above) would be addressed to: 238.10.12.30. berg, Eric Crawley, Nitin Jain, Steven Ostrowsksi, Radia Perlman,
Scott Reeve, Clay Shields, Sue Thompson, Paul White.
Similarly, a data packet pertaining to the 150th listed group would The participants of the IETF IDMR working group have provided useful
be addressed to: 238.10.12.150, and so on. feedback since the inception of CBT.
All routers comprising the aggregate tree need only maintain the References
group aggregate address and mask, together with the aggregate tree's
associated interfaces. If a number of individual shared trees have
been replaced by an aggregate tree, then the core routers (RPs) of
each of those shared trees must additionally maintain the complete
list of groups associated with an <aggregate address/mask-len> so as
to be able to "re-direct" any incoming joins for already aggregated
groups. Similarly, border routers (BRs) are incurred the storage
cost of maintaining the individual groups associated with an <aggre-
gate address/mask-len>, so as to be able to aggregate and de-
aggregate as data packets flow across a (sub)region's border.
B.3 Scaling Further [1] Core Based Trees (CBT) Multicast Routing Architecture;
A. Ballardie; ftp://ds.internic.net/internet-drafts/draft-ietf-idmr-
cbt-arch-**.txt. Working draft, 1997.
The scheme described can be applied recursively (to border routers) [2] Protocol Independent Multicast (PIM) Sparse Mode/Dense Mode; D.
to accommodate a hierarchy containing an arbitrary number of levels. Estrin et al; ftp://netweb.usc.edu/pim Working drafts, 1996.
The scheme described imposes two general requirements (or assump- [3] Internet Group Management Protocol, version 2 (IGMPv2); W. Fenner;
tions): ftp://ds.internic.net/internet-drafts/draft-ietf-idmr-igmp-v2-**.txt.
Working draft, 1996.
+o a well defined aggregate group address space for each level of [4] Assigned Numbers; J. Reynolds and J. Postel; RFC 1700, October
hierarchy (or scope levels). 1994.
+o the ability to arbitrarily create boundaries in multicast [5] CBT Border Router Specification for Interconnecting a CBT Stub
routers, thereby separating different hierarchical levels. Region to a DVMRP Backbone; A. Ballardie;
ftp://ds.internic.net/internet-drafts/draft-ietf-idmr-cbt-
dvmrp-**.txt. Working draft, March 1997.
The former will require consensus within the IETF and approval from [6] Scalable Multicast Key Distribution; A. Ballardie; RFC 1949, July
the IANA. The latter capability is already available in multicast 1996.
routers; boundaries are specified in a multicast routers configura-
tion file. This capability is currently available in the best known
multicast routing protocols: DVMRP, M-OSPF, PIM, and CBT.
Defining boundaries may require some degree of coordination; whenever [7] A Dynamic Bootstrap Mechanism for Rendezvous-based Multicast Rout-
a particular scoped level (boundary) is introduced which has multiple ing; D. Estrin et al.; Technical Report; ftp://catarina.usc.edu/pim
entry/exit multicast routers, these must all be configured such that
their boundary definitions are identical, i.e. they must each be con-
figured with the same boundary-address/mask (the range 239.0.0.0 -
239.255.255.255 is the IANA-defined multicast boundary address
range).
Author Information: Author Information:
Tony Ballardie, Tony Ballardie,
Department of Computer Science, Research Consultant,
University College London,
Gower Street,
London, WC1E 6BT,
ENGLAND, U.K.
Tel: ++44 (0)71 419 3462
e-mail: A.Ballardie@cs.ucl.ac.uk
Scott Reeve, Nitin Jain,
Bay Networks, Inc.
3, Federal Street,
Billerica, MA 01821,
USA.
Tel: ++1 508 670 8888
e-mail: {sreeve, njain}@BayNetworks.com
References
[1] T. Pusateri. Distance Vector Multicast Routing Protocol. Working
draft, June 1996. (draft-ietf-idmr-dvmrp-v3-01.{ps,txt}).
[2] J. Moy. Multicast Routing Extensions to OSPF. Communications of
the ACM, 37(8): 61-66, August 1994. Also RFC 1584, March 1994.
[3] D. Farinacci, S. Deering, D. Estrin, and V. Jacobson. Protocol
Independent Multicast (PIM) Dense-Mode Specification. Working draft,
July 1996. (draft-ietf-idmr-pim-dm-spec-02.{ps,txt}).
[4a] A. Ballardie. Core Based Tree (CBT) Multicast Architecture.
Working draft, July 1996. (draft-ietf-idmr-cbt-arch-04.txt)
[4] A. J. Ballardie. Scalable Multicast Key Distribution; RFC 1949,
SRI Network Information Center, 1996.
[5] A. J. Ballardie. "A New Approach to Multicast Communication in a
Datagram Internetwork", PhD Thesis, 1995. Available via anonymous ftp
from: cs.ucl.ac.uk:darpa/IDMR/ballardie-thesis.ps.Z.
[6] W. Fenner. Internet Group Management Protocol, version 2 (IGMPv2).
Working draft, May 1996. (draft-idmr-igmp-v2-03.txt).
[7] B. Cain, S. Deering, A. Thyagarajan. Internet Group Management
Protocol Version 3 (IGMPv3) (draft-cain-igmp-00.txt).
[8] M. Handley, J. Crowcroft, I. Wakeman. Hierarchical Rendezvous
Point proposal, work in progress.
(http://www.cs.ucl.ac.uk/staff/M.Handley/hpim.ps) and
(ftp://cs.ucl.ac.uk/darpa/IDMR/IETF-DEC95/hpim-slides.ps).
[9] D. Estrin et al. USC/ISI, Work in progress.
(http://netweb.usc.edu/pim/).
[10] D. Estrin et al. PIM Sparse Mode Specification. Working draft,
July 1996. (draft-ietf-idmr-pim-sparse-spec-04.{ps,txt}).
[11] A. Ballardie. CBT - Dense Mode Interoperability: Border Router
Specification; Working draft, July 1996. Also available from:
ftp://cs.ucl.ac.uk/darpa/IDMR/draft-ietf-idmr-cbt-dm-interop-XX.txt
[12] S. Deering. Private communication, August 1996. e-mail: ABallardie@acm.org
 End of changes. 

This html diff was produced by rfcdiff 1.23, available from http://www.levkowetz.com/ietf/tools/rfcdiff/