draft-ietf-idmr-cbt-arch-03.txt   draft-ietf-idmr-cbt-arch-04.txt 
Inter-Domain Multicast Routing (IDMR) A. J. Ballardie Inter-Domain Multicast Routing (IDMR) A. Ballardie
INTERNET-DRAFT University College London INTERNET-DRAFT Consultant
February 9th, 1996
Core Based Trees (CBT) Multicast Architecture March 1997
<draft-ietf-idmr-cbt-arch-03.txt>
Core Based Trees (CBT) Multicast Routing Architecture
Status of this Memo Status of this Memo
This document is an Internet Draft. Internet Drafts are working do- This document is an Internet Draft. Internet Drafts are working doc-
cuments of the Internet Engineering Task Force (IETF), its Areas, and uments of the Internet Engineering Task Force (IETF), its Areas, and
its Working Groups. Note that other groups may also distribute work- its Working Groups. Note that other groups may also distribute work-
ing documents as Internet Drafts). ing documents as Internet Drafts).
Internet Drafts are draft documents valid for a maximum of six Internet Drafts are draft documents valid for a maximum of six
months. Internet Drafts may be updated, replaced, or obsoleted by months. Internet Drafts may be updated, replaced, or obsoleted by
other documents at any time. It is not appropriate to use Internet other documents at any time. It is not appropriate to use Internet
Drafts as reference material or to cite them other than as a "working Drafts as reference material or to cite them other than as a "working
draft" or "work in progress." draft" or "work in progress."
Please check the I-D abstract listing contained in each Internet Please check the I-D abstract listing contained in each Internet
Draft directory to learn the current status of this or any other In- Draft directory to learn the current status of this or any other In-
ternet Draft. ternet Draft.
Abstract Abstract
CBT is a new multicast architecture that builds a single, shared CBT is a multicast routing architecture that builds a single delivery
delivery tree. Traditional multicast algorithms build a source-based tree per group which is shared by all of the group's senders and re-
tree per sender subnetwork, as does DVMRP, the protocol currently de- ceivers. Most multicast algorithms build one multicast tree per
ployed in the Internet's multicast backbone (MBONE) [1]. The primary sender (subnetwork), the tree being rooted at the sender's subnet-
advantage of the shared tree approach is that it typically offers work. The primary advantage of the shared tree approach is that it
more favourable scaling characteristics than do existing multicast typically offers more favourable scaling characteristics than all
algorithms. other multicast algorithms.
The CBT protocol [2] is a network layer multicast protocol that
builds a shared delivery tree. The CBT protocol also allows for the
incorporation of security features [2, 3], and the potential to
reserve network resources as part of delivery tree set-up [14].
The sending and receiving of multicast data by hosts on a subnetwork The CBT protocol [1] is a network layer multicast routing protocol
conforms to the traditional IP multicast service model [4]; multicast that builds and maintains a shared delivery tree for a multicast
data on a subnetwork appears no different to that of existing group. The sending and receiving of multicast data by hosts on a
schemes. subnetwork conforms to the traditional IP multicast service model
[2].
CBT is progressing through the IDMR working group of the IETF. The CBT is progressing through the IDMR working group of the IETF. The
CBT protocol is described in an accompanying document [2]. For this, CBT protocol is described in an accompanying document [1]. For this,
and all IDMR-related documents, see http://www.cs.ucl.ac.uk/ietf/idmr and all IDMR-related documents, see http://www.cs.ucl.ac.uk/ietf/idmr
TABLE OF CONTENTS TABLE OF CONTENTS
1. Background................................................... 3 1. Background................................................... 3
2. Introduction................................................. 4 2. Introduction................................................. 3
3. Source Based Tree Algorithms & 3. Source Based Tree Algorithms................................. 5
the Motivation for Shared Trees.............................. 5
3.1 Distance-Vector Multicast Algorithm...................... 5 3.1 Distance-Vector Multicast Algorithm...................... 5
3.2 Link State Multicast Algorithm........................... 6 3.2 Link State Multicast Algorithm........................... 6
3.3 The Motivation for Shared Trees.......................... 7 3.3 The Motivation for Shared Trees.......................... 7
4. CBT - The New Architecture................................... 8 4. CBT - The New Architecture................................... 8
4.1 Design Requirements...................................... 8 4.1 Design Requirements...................................... 8
4.2 Architectural Overview................................... 11 4.2 Components & Functions................................... 10
4.3 Core Selection, Placement, and Management................ 13 4.2.1 Core Router Discovery ............................. 13
5. Architectural Justification: 4.2.2 CBT Control Message Retransmission Strategy ....... 14
Comparisons & Simulation Results............................. 15
5.1 Traffic Concentration.................................... 19 4.2.3 Non-Member Sending................................. 15
5.2 Shared Trees and Load Balancing.......................... 19 5. Interoperability with Other Multicast Routing Protocols ..... 15
6. Conclusions.................................................. 19 6. Summary ..................................................... 16
Acknowledgements................................................ 20 Acknowledgements ............................................... 16
APPENDIX........................................................ 21 References ..................................................... 16
1. Background Author Information.............................................. 18
Centre based forwarding was first described in the early 1980s by Appendix........................................................ 19
Wall in his PhD thesis on broadcast and selective broadcast. At this
time, the benefits and uses of multicast were not fully understood.
It wasn't until much later that the IP multicast address space was
defined (class D space), and Deering defined the IP multicast service
model [4]. Deering's work was pioneering in that he invented algo-
rithms that allowed hosts to arbitrarily join a multicast group (IGMP
[5]), and enable a local multicast-capable router to become part of a
receiver-initiated wide-area delivery tree. The latter consituted
algorithms that build sourced-based delivery trees, with one delivery
tree per sender subnetwork. These algorithms are documented in [4].
The multicast-capable part of the Internet (MBONE [1]) primarily 1. Background
implements an a protocol instance of the distance-vector multicast
algorithm [4] called DVMRP.
Now we have several years practical experience with multicast, and a Shared trees were first described by Wall in his investigation into
diversity of multicast applications, from shared workspace tools, to low-delay approaches to broadcast and selective broadcast [3]. Wall
audio- and video-conferencing tools, with new applications emerging concluded that delay will not be minimal, as with shortest-path
all the time (e.g. distributed video gaming), some of which have trees, but the delay can be kept within bounds that may be accept-
wide-ranging multicast requirements. Furthermore, the MBONE has been able. Back then, the benefits and uses of multicast were not fully
constantly growing since the first public audiocast (audio multicast) understood, and it wasn't until much later that the IP multicast
of a 1992 IETF meeting [6] took place. Then, the MBONE intercon- address space was defined (class D space [4]). Deering's work [2] in
nected 40 subnets in 4 different countries. Statistics suggest that the late 1980's was pioneering in that he defined the IP multicast
the MBONE has doubled in size over the past eight months, and as of service model, and invented algorithms which allow hosts to arbitrar-
July 1995 comprises over 2,500 subnetworks spanning 25 countries [1]. ily join and leave a multicast group. All of Deering's multicast
algorithms build source-rooted delivery trees, with one delivery tree
per sender subnetwork. These algorithms are documented in [2].
The obvious popularity and growth of multicast means that the scaling After several years practical experience with multicast, we see a
aspects of wide-area multicasting cannot be overlooked; some predic- diversity of multicast applications and correspondingly, a wide vari-
tions talk of thousands of groups being present at any one time in ety of multicast application requirements. For example, distributed
the Internet. Distributed Interactive Simulation (DIS) applications interactive simulation (DIS) applications have strict requirements in
[15] have real-time requirements (in terms of join latency, group terms of join latency, group membership dynamics, group sender popu-
membership, group sender populations) that far exceed the require- lations, far exceeding the requirements of many other multicast
ments of most applications currently in use. applications.
Scalability is measured in terms of network state maintenance, The multicast-capable part of the Internet, the MBONE, continues to
bandwidth efficiency, and protocol overhead. Other factors that can expand rapidly. The obvious popularity and growth of multicast means
that the scaling aspects of wide-area multicasting cannot be over-
looked; some predictions talk of thousands of groups being present at
any one time in the Internet.
We evaluate scalability in terms of network state maintenance, band-
width efficiency, and protocol overhead. Other factors that can
affect these parameters include sender set size, and wide-area dis- affect these parameters include sender set size, and wide-area dis-
tribution of group members. tribution of group members.
2. Introduction 2. Introduction
Deering's algorithms build source-based multicast delivery trees,
that is, the delivery tree is rooted at a multicast-capable router on
the subnetwork directly attached to a sender. The IGMP protocol [5]
operates between hosts and multicast router(s) on a subnetwork, ena-
bling multicast routers to monitor group presence on attached sub-
nets, so that on receipt of a multicast packet, a multicast router
knows over which interfaces group members are located.
Multicasting on the local subnetwork does not require either the Multicasting on the local subnetwork does not require either the
presence of a multicast router or the implementation of a multicast presence of a multicast router or the implementation of a multicast
algorithm; on most shared media (e.g. Ethernet), a host, which need routing algorithm; on most shared media (e.g. Ethernet), a host,
not necessarily be a member, simply sends a multicast data packet, which need not necessarily be a group member, simply sends a multi-
which is received by any member hosts. For multicasts to extend cast data packet, which is received by any member hosts connected to
beyond the scope of the local subnetwork, the subnet must have a the same medium.
multicast-capable router attached, which itself is attached (possibly
virtually) to another multicast-capable router, and so on. The col- For multicasts to extend beyond the scope of the local subnetwork,
lection of these (virtually) connected multicast routers forms the the subnet must have a multicast-capable router attached, which
Internet's MBONE [1]. Each multicast router must implement the same itself is attached (possibly "virtually") to another multicast-
multicast routing protocol to ensure full multicast connectivity capable router, and so on. The collection of these (virtually) con-
(footnote). For wide-area multicasting then, multicast routers "con- nected multicast routers forms the Internet's MBONE.
spire" to get multicast data to the group's receivers, wherever they
be located. This is the job of the multicast routing algorithm. All multicast routing protocols make use of IGMP [5], a protocol that
operates between hosts and multicast router(s) belonging to the same
subnetwork. IGMP enables the subnet's multicast router(s) to monitor
group membership presence on its directly attached links, so that if
multicast data arrives, it knows over which of its links to send a
copy of the packet.
In our description of the MBONE so far, we have assumed that all mul-
ticast routers on the MBONE are running the same multicast routing
protocol. In reality, this is not the case; the MBONE is a collection
of autonomously administered multicast regions, each region defined
by one or more multicast-capable border routers. Each region indepen-
dently chooses to run whichever multicast routing protocol best suits
its needs, and the regions interconnect via the "backbone region",
which currently runs the Distance Vector Multicast Routing Protocol
(DVMRP) [6]. Therefore, it follows that a region's border router(s)
must interoperate with DVMRP.
Different algorithms use different techniques for establishing a dis- Different algorithms use different techniques for establishing a dis-
tribution tree. If we classify these algorithms into source-based tribution tree. If we classify these algorithms into source-based
tree algorithms and shared tree algorithms, we'll see that the dif- tree algorithms and shared tree algorithms, we'll see that the dif-
ferent classes have considerably different scaling characteristics, ferent classes have considerably different scaling characteristics,
and the characteristics of the resulting trees differ too, for exam- and the characteristics of the resulting trees differ too, for exam-
ple, average delay. Let's look at source-based tree algorithms first. ple, average delay. Let's look at source-based tree algorithms first.
_________________________ 3. Source-Based Tree Algorithms
Domains that deploy different multicast protocols may
be interconnected using a common multicast protocol at
domain boundaries (border routers). A special protocol
interface needs to be implemented in these border
routers.
3. Source-Based Tree Algorithms & the Motivation for Shared Trees
The strategy we'll use for motivating (CBT) shared tree multicast is The strategy we'll use for motivating (CBT) shared tree multicast is
based, in part, in explaining the characteristics of source-based based, in part, in explaining the characteristics of source-based
tree multicast, in particular its scalability. We then summarize the tree multicast, in particular its scalability.
primary motivation for shared trees in section 3.3.
Source-based tree multicast algorithms are often referred to as Most source-based tree multicast algorithms are often referred to as
"dense-mode" algorithms; they assume that the receiver population "dense-mode" algorithms; they assume that the receiver population
densely populates the domain of operation, and therefore the accom- densely populates the domain of operation, and therefore the accompa-
panying overhead (in terms of state, bandwidth usage, and/or process- nying overhead (in terms of state, bandwidth usage, and/or processing
ing costs) is justified. Whilst this might be true of a local costs) is justified. Whilst this might be the case in a local envi-
environment, it is almost never true of the wide-area environment, ronment, wide-area group membership tends to be sparsely distributed
especially since wide-area group membership tends to be sparsely dis- throughout the Internet. There may be "pockets" of denseness, but if
tributed throughout the Internet. There may be "pockets" of dense- one views the global picture, wide-area groups tend to be sparsely
ness, but if one views the global picture, wide-area groups tend to distributed.
be sparsely distributed.
Source-based multicast trees are either built by a distance-vector Source-based multicast trees are either built by a distance-vector
style algorithm, which may be implemented separately from the unicast style algorithm, which may be implemented separately from the unicast
routing algorithm (as is the case with DVMRP), or the multicast tree routing algorithm (as is the case with DVMRP), or the multicast tree
may be built using the information present in the underlying unicast may be built using the information present in the underlying unicast
routing table (as is the case with PIM-DM [7]). The other algorithm routing table (as is the case with PIM-DM [7]). The other algorithm
used for building source-based trees is the link-state algorithm (a used for building source-based trees is the link-state algorithm (a
protocol instance being M-OSPF [8]). protocol instance being M-OSPF [8]).
3.1. Distance-Vector Multicast Algorithm 3.1. Distance-Vector Multicast Algorithm
The distance-vector multicast algorithm builds a multicast delivery The distance-vector multicast algorithm builds a multicast delivery
tree using a variant of the Reverse-Path Forwarding technique [9]. tree using a variant of the Reverse-Path Forwarding technique [9].
The technique basically is as follows: when a multicast router The technique basically is as follows: when a multicast router
receives a multicast data packet, if the packet arrives on the inter- receives a multicast data packet, if the packet arrives on the inter-
face used to reach the source of the packet, the packet is forwarded face used to reach the source of the packet, the packet is forwarded
over all outgoing interfaces, except leaf subnets (footnote) with no over all outgoing interfaces, except leaf subnets with no members
members attached. Otherwise the packet is discarded. This consti- attached. A "leaf" subnet is one which no router would use to reach
tutes a "broadcast & prune" approach. Subsequently, outgoing inter- the souce of a multicast packet. If the data packet does not arrive
faces may be "pruned"; when a data packet reaches a leaf router, if over the link that would be used to reach the source, the packet is
that router has no membership registered on its directly attached discarded.
subnetworks, the router may send a prune message one hop back towards
_________________________
A "leaf" subnet is one with no downstream routers at-
tached. A "leaf" router is one which no other router
uses to reach the source.
This constitutes a "broadcast & prune" approach to multicast tree
construction; when a data packet reaches a leaf router, if that
router has no membership registered on any of its directly attached
subnetworks, the router sends a prune message one hop back towards
the source. The receiving router then checks its leaf subnets for the source. The receiving router then checks its leaf subnets for
group membership, and checks whether it has received a prune from all group membership, and checks whether it has received a prune from all
of its downstream routers. If the checks succeed, the router itself of its downstream routers (downstream with respect to the source).
can send a prune upstream over the interface leading to the source.
Thus, prune messages prune the tree branches not leading to group
members. Prune messages are stored per <source, group> pair, and are
subject to timeout, after which data flows over the branch again. If
there are still no members present, the pruning process repeats
itself. State that "times out" in this way is referred to as "soft
state".
It can be inferred from the above algorithm that multicast routers If so, the router itself can send a prune upstream over the interface
must maintain state per group per active source. This source-specific leading to the source.
state manifests itself in the multicast forwarding table, where, for
each active source, the outgoing interface list is dependent on the
prunes received, and on the time since they were received (because
they time out).
As we said, most wide-area groups are likely to be sparsely distri- The sender and receiver of a prune message must cache the <source,
buted. As a result, it is fair to assume that some potentially large group> pair being reported, for a "lifetime" which is at the granu-
number of routers in the internetwork, i.e. those not leading to larity of minutes. Unless a router's prune information is refreshed
downstream receivers, are incurred the cost of storing <source, by the receipt of a new prune for <source, group> before its "life-
group> state. time" expires, that information is removed, allowing data to flow
over the branch again. State that expires in this way is referred to
as "soft state".
The "broadcast & prune" nature of the distance-vector multicast algo- Interestingly, routers that do not lead to group members are incurred
rithm, the sparseness of receivers, combined with the fact that many the state overhead incurred by prune messages. For wide-area multi-
wide-area links have limited bandwidth, demonstrates that such an casting, which potentially has to support many thousands of active
algorithm cannot scale to a large internetwork that supports large groups, each of which may be sparsely distributed, this technique
numbers of groups. clearly does not scale.
3.2. Link-State Multicast Algorithm 3.2. Link-State Multicast Algorithm
Routers implementing a link state algorithm periodically collect Routers implementing a link state algorithm periodically collect
reachability information to their directly attached neighbours, then reachability information to their directly attached neighbours, then
flood this throughout the routing domain in so-called link state flood this throughout the routing domain in so-called link state
update packets. Deering extended the link state algorithm for multi- update packets. Deering extended the link state algorithm for multi-
casting by having a router additionally detect group membership casting by having a router additionally detect group membership
changes on its incident links before flooding this information in changes on its incident links before flooding this information in
link state packets. link state packets.
Each router then, has a complete, up-to-date image of a domain's Each router then, has a complete, up-to-date image of a domain's
topology and group membership. On receiving a multicast data packet, topology and group membership. On receiving a multicast data packet,
each router uses its membership and topology information to calculate each router uses its membership and topology information to calculate
a shortest-path tree rooted at the sender subnetwork. Provided the a shortest-path tree rooted at the sender subnetwork. Provided the
calculating router falls within the computed tree, it forwards the calculating router falls within the computed tree, it forwards the
data packet over the interfaces defined by its calculation. Hence, data packet over the interfaces defined by its calculation. Hence,
multicast data packets only ever traverse routers leading to members, multicast data packets only ever traverse routers leading to members,
either directly attached, or further downstream. That is, the either directly attached, or further downstream. That is, the deliv-
delivery tree is a true multicast tree right from the start. ery tree is a true multicast tree right from the start.
However, the flooding (reliable broadcasting) of group membership However, the flooding (reliable broadcasting) of group membership
information is the predominant factor preventing the link state mul- information is the predominant factor preventing the link state mul-
ticast algorithm being applicable over the wide-area. The other lim- ticast algorithm being applicable over the wide-area. The other lim-
iting factor is the processing cost of the Dijkstra calculation to iting factor is the processing cost of the Dijkstra calculation to
compute the shortest-path tree for each active source. compute the shortest-path tree for each active source.
3.3. The Motivation for Shared Trees 3.3. The Motivation for Shared Trees
The algorithms described in the previous sections clearly motivate The algorithms described in the previous sections clearly motivate
the need for a multicast algorithm(s) that is more scalable. CBT was the need for a multicast algorithm(s) that is more scalable. CBT was
designed primarily to address the topic of scalability; a shared tree designed primarily to address the topic of scalability; a shared tree
architecture offers an improvement in scalability over source tree architecture offers an improvement in scalability over source tree
architectures by a factor of the number of active sources (where architectures by a factor of the number of active sources (where
source is a subnetwork aggregate). Source trees scale O(S * G), since source is usually a subnetwork aggregate). Source trees scale O(S *
all sources use the same shared tree; shared trees eliminate the G), since a distinct delivery tree is built per active source. Shared
source (S) scaling factor, and hence a shared tree scales O(G). The trees eliminate the source (S) scaling factor; all sources use the
implication of this is that applications with many active senders, same shared tree, and hence a shared tree scales O(G). The implica-
such as distributed interactive simulation applications, and distri- tion of this is that applications with many active senders, such as
buted video-gaming (where most receivers are also senders), scale distributed interactive simulation applications, and distributed
considerably better if shared trees are used. video-gaming (where most receivers are also senders), have a signifi-
cantly lesser impact on underlying multicast routing if shared trees
are used.
In the table below we compare the amount of state required by CBT and In the "back of the envelope" table below we compare the amount of
DVMRP for different group sizes with different numbers of active state required by CBT and DVMRP for different group sizes with dif-
sources: ferent numbers of active sources:
|--------------|---------------------------------------------------| |--------------|---------------------------------------------------|
| Number of | | | | | Number of | | | |
| groups | 10 | 100 | 1000 | | groups | 10 | 100 | 1000 |
==================================================================== ====================================================================
| Group size | | | | | Group size | | | |
| (# members) | 20 | 40 | 60 | | (# members) | 20 | 40 | 60 |
-------------------------------------------------------------------| -------------------------------------------------------------------|
| No. of srcs | | | | | | | | | | | No. of srcs | | | | | | | | | |
| per group |10% | 50% |100% |10% | 50% |100% |10% | 50% | 100% | | per group |10% | 50% |100% |10% | 50% |100% |10% | 50% | 100% |
skipping to change at page 8, line 26 skipping to change at page 7, line 47
| router | | | | | | | | | | | router | | | | | | | | | |
| entries | 20 | 100 | 200 |400 | 2K | 4K | 6K | 30K | 60K | | entries | 20 | 100 | 200 |400 | 2K | 4K | 6K | 30K | 60K |
-------------------------------------------------------------------- --------------------------------------------------------------------
| No. of CBT | | | | | No. of CBT | | | |
| router | | | | | router | | | |
| entries | 10 | 100 | 1000 | | entries | 10 | 100 | 1000 |
|------------------------------------------------------------------| |------------------------------------------------------------------|
Figure 1: Comparison of DVMRP and CBT Router State Figure 1: Comparison of DVMRP and CBT Router State
There are also additional significant bandwidth and state savings Shared trees also incur significant bandwidth and state savings com-
with shared trees in contrast to source trees; firstly, the tree only pared with source trees; firstly, the tree only spans a group's
spans a group's receivers (including links/routers leading to receivers (including links/routers leading to receivers) -- there is
receivers) -- there is no cost to routers/links in other parts of the no cost to routers/links in other parts of the network. Secondly,
network. Secondly, routers between a non-member sender and the routers between a non-member sender and the delivery tree are not
delivery tree are not incurred any cost pertaining to multicast, and incurred any cost pertaining to multicast, and indeed, these routers
indeed, these routers need not even be multicast-capable -- packets need not even be multicast-capable -- packets from non-member senders
from non-member senders are encapsulated and unicast towards a core are encapsulated and unicast to a core on the tree.
on the tree.
The figure below illustrates a core based tree.
b b b-----b
\ | |
\ | |
b---b b------b
/ \ / KEY....
/ \/
b X---b-----b X = Core
/ \ b = on-tree router
/ \
/ \
b b------b
/ \ |
/ \ |
b b b
Figure 2: CBT Tree
4. CBT - The New Architecture 4. CBT - The New Architecture
4.1. Design Requirements 4.1. Design Requirements
The CBT shared tree design was geared towards several design objec- The CBT shared tree design was geared towards several design objec-
tives: tives:
+ scalability +o scalability - the CBT designers decided not to sacrifice CBT's
O(G) scaling characteric to optimize delay using SPTs, as does
PIM. This was an important design decision, and one, we think,
was taken with foresight; once multicasting becomes ubiquitous,
router state maintenance will be a predominant scaling factor.
It is possible in some circumstances to improve/optimize the
delay of shared trees by other means. For example, a broadcast-
type lecture with a single sender (or limited set of infre-
quently changing senders) could have its core placed in the
locality of the sender, allowing the CBT to emulate a shortest-
path tree (SPT) whilst still maintaining its O(G) scaling char-
acteristic. More generally, because CBT does not incur source-
specific state, it is particularly suited to many sender appli-
cations.
+ routing protocol independence +o routing protocol independence - a facet of some source tree
algorithms is that they are tied inextricably, one way or
another, to a particular routing protocol. For example, DVMRP
relies for its correct operation on some of the features of RIP
(e.g. poison reverse). Similarly, M-OSPF can only be implemented
in routers supporting OSPF, the corresponding unicast protocol.
+ robustness If multicast routing is to become ubiquitous in the Internet,
multicast routing protocol operation should remain independent
of particular unicast protocols to facilitate inter-domain mul-
ticast routing in a heterogeneous unicast routing environment.
+ interoperability +o robustness - source-based tree algorithms are clearly robust; a
sender simply sends its data, and intervening routers "conspire"
to get the data where it needs to, creating state along the way.
This is the so-called "data driven" approach -- there is no set-
up protocol involved.
As we have mentioned, a shared multicast tree improves scalability by It is not as easy to achieve the same degree of robustness in
an order of magnitude. shared tree algorithms; a shared tree's core router maintains
connectivity between all group members, and is thus a single
point of failure. Protocol mechanisms must be present that
ensure a core failure is detected quickly, and the tree recon-
nected quickly using a replacement core router.
A facet of existing source tree algorithms is that they are all tied +o simplicity - the CBT protocol is relatively simple compared to
inextricably, one way or another, to a particular routing protocol. most other multicast routing protocols. This simplicity can lead
For example, DVMRP is based heavily on RIP, and relies for its to enhanced performance compared to other protocols.
correct operation on some of the features of RIP (e.g. poison
reverse). Similarly, with M-OSPF.
With the multicast infrastructure fast converging on the unicast +o interoperability - from a multicast perspective, the Internet is
infrastructure, which is heterogeneous in nature, the extent to which a collection of heterogeneous multicast regions. The protocol
a particular multicast algorithm can be deployed is severely limited interconnecting these multicast regions is currently DVMRP [6];
if deployment is dependent on the presence of a particular unicast any regions not running DVMRP connect to the DVMRP "backbone" as
algorithm. Therefore, it makes good sense to separate out these stub regions. Thus, the current Internet multicast infrastruc-
dependencies from the multicast algorithm; this is exactly what CBT ture resembles a tree structure. CBT has well-defined interoper-
does, as does PIM [7]. The CBT protocol makes use of the underlying ability mechanisms with DVMRP [15].
unicast routing table when building its delivery tree, independent of
how the unicast table(s) is computed.
Source-based tree algorithms are clearly robust; a sender simply Over the longer term, the imposing of a tree structure on the
sends its data, and intervening routers "conspire" to get the data multicast infrastructure is unacceptable for various reasons
where it needs to, creating state along the way. This is the so- (e.g. administrative burden). Ideally, we want to be able to
called "data driven" approach -- there is no set-up protocol randomly interconnect multicast regions and use a shared tree
involved. protocol, such as CBT, to dynamically build inter-domain shared
trees spanning only those domains with interested receivers (or
those leading to interested receivers). It is still not clear
how we can achieve this efficiently and effectively in the pres-
ence of heterogeneous multicast regions/domains, each with their
differing multicast forwarding rules. Besides this, there is the
issue of core router discovery in the inter-domain environment.
These, and other outstanding issues regarding inter-domain mul-
ticast routing, are discussed in [10].
It is not as easy to achieve the same degree of robustness in shared Clearly therefore, significant aspects of the inter-domain mul-
tree algorithms, since there is network entity involved which is the ticast routing (IDMR) architecture remain areas of ongoing
focal point of the tree, which effectively, keeps a shared tree fully research. When an IDMR architecture can be fully defined, new
connected. This entity is just a pre-nominated router, to which all CBT interoperability mechanisms will be specified as deemed nec-
receivers (their directly-attached routers) must explicitly join. In essary to accommodate the IDMR architecture.
actual fact, any shared tree is likely to have several such so-called
"cores", or "rendevous points (RPs)" to increase robustness.
CBT and PIM diverge in their approaches to robustness; PIM attempts 4.2. CBT Components & Functions
to maintain a data-driven approach, with only one RP active at any
one time. In CBT, all cores are active. PIM's desire to emulate the
robustness of source trees comes at a cost, especially in terms of
protocol complexity, and an overall increased bandwidth requirement
[10]. CBT, on the other hand, adopts the "hard state" approach,
whereby a tree branch is created reflecting underlying unicast at the
instant it is created; thereafter, the same tree branch does not
necessarily change to reflect underlying unicast changes. This has
positive implications for the case where a branch is created together
with a resource reservation. The reservation remains fixed unless a
neighbouring link/router fails, in which case there is no option but
to rejoin the tree by means of explicit protocol, and request the
resources again. However, the router to which a branch is rejoined
may not be able to honour the same reservation request.
CBT branches are torn down by means of explicit protocol messages, The CBT protocol is designed to build and maintain a shared multicast
whereas PIM branches time out. PIM incurs protocol complexity to distribution tree that spans only those networks and links leading to
manage its various timers (to keep state where it is required), and interested receivers.
protocol "refresh" messages consume bandwidth. CBT implements a
"keepalive" mechanism between adjacent routers on a tree; these are
CBT's only periodic tree maintenance messages, and they are aggre-
gated such that there is one per link, not one per group per link.
Both protocols reduce the frequency of tree maintenance messages as
the number of messages increases.
A comparison of protocol costs/state/scalability etc. is summarised To achieve this, a host first expresses its interest in joining a
in section 5, and is presented in detail in [10]. group by multicasting an IGMP host membership report [5] across its
attached link. On receiving this report, a local CBT aware router
invokes the tree joining process (unless it has already) by generat-
ing a JOIN_REQUEST message, which is sent to the next hop on the path
towards the group's core router (how the local router discovers which
core to join is discussed in section 4.2.1). This join message must
be explicitly acknowledged (JOIN_ACK) either by the core router
itself, or by another router that is on the unicast path between the
sending router and the core, which itself has already successfully
joined the tree.
With regards to interoperability, the type of multicast delivery tree The join message sets up transient join state in the routers it tra-
interconnecting member hosts (subnets) over the wide-area is tran- verses, and this state consists of <group, incoming interface, outgo-
sparent to those hosts; hosts send/receive multicast data in tradi- ing interface>. "Incoming interface" and "outgoing interface" may be
tional IP format, and hosts are not involved explicitly in any way "previous hop" and "next hop", respectively, if the corresponding
with tree set-up. This is the sole responsibility of local multicast links do not support multicast transmission. "Previous hop" is taken
routers. from the incoming control packet's IP source address, and "next hop"
is gleaned from the routing table - the next hop to the specified
core address. This transient state eventually times out unless it is
"confirmed" with a join acknowledgement (JOIN_ACK) from upstream. The
JOIN_ACK traverses the reverse path of the corresponding join mes-
sage, which is possible due to the presence of the transient join
state. Once the acknowledgement reaches the router that originated
the join message, the new receiver can receive traffic sent to the
group.
CBT also accommodates interoperability between "clouds" or domains Loops cannot be created in a CBT tree because a) there is only one
operating different multicast protocols. It is expected that intero- active core per group, and b) tree building/maintenance scenarios
perability between different multicast protocols only be relevant at which may lead to the creation of tree loops are avoided. For exam-
domain (or region, or "cloud") boundaries; inside a particular domain ple, if a router's upstream neighbour becomes unreachable, the router
only a single multicast protocol is expected to be present. One immediately "flushes" all of its downstream branches, allowing them
method of how different trees can be "spliced" together at a domain to individually rejoin if necessary. Transient unicast loops do not
boundary is presented in [11]. pose a threat because a new join message that loops back on itself
will never get acknowledged, and thus eventually times out.
Homogeneous clouds, as described in the previous paragraph, means The state created in routers by the sending or receiving of a
that multicast data can travel in what we call "native mode", i.e. JOIN_ACK is bi-directional - data can flow either way along a tree
no encapsulation is required that keeps data, from different groups "branch", and the state is group specific - it consists of the group
using different multicast protocols, separate. CBT also specifies a address and a list of local interfaces over which join messages for
"CBT mode", where a particular cloud is heterogeneous, i.e. multiple the group have previously been acknowledged. There is no concept of
multicast protocols are present possibly on the same subnet; CBT mode "incoming" or "outgoing" interfaces, though it is necessary to be
data is encapsulated with a CBT header to distinguish it from any able to distinguish the upstream interface from any downstream inter-
other multicast traffic, for example, traffic that is using a source faces. In CBT, these interfaces are known as the "parent" and "child"
based tree. These two "modes" are described in the CBT specification interfaces, respectively. We recommend the parent be distinguished as
document [2]. such by a single bit in each multicast forwarding cache entry.
4.2. Architectural Overview With regards to the information contained in the multicast forwarding
cache, on link types not supporting native multicast transmission an
on-tree router must store the address of a parent and any children.
On links supporting multicast however, parent and any child informa-
tion is represented with local interface addresses (or similar iden-
tifying information, such as an interface "index") over which the
parent or child is reachable.
Shared trees were first described by Wall in his investigation into When a multicast data packet arrives at a router, the router uses the
low-delay approaches to broadcast and selective broadcast [12]. Wall group address as an index into the multicast forwarding cache. A copy
concluded that delay will not be minimal, as with shortest-path of the incoming multicast data packet is forwarded over each inter-
trees, but the delay can be kept within bounds that may be accept- face (or to each address) listed in the entry except the incoming
able. Simulations have recently been carried out to compare various interface.
scaling characteristics of centre-based and shortest-path trees. A
summary of these simulations can be found in section 5, and the
details are presented in [10].
A CBT shared tree involves having a small set of pre-nominated cores Each router that comprises a CBT multicast tree, except the core
(routers), to which routers connected to member hosts join by means router, is responsible for maintaining its upstream link, provided it
of an explicit protocol control message. Any core is eligible for has interested downstream receivers, i.e. the child interface list is
joining, although only a single core is targetted by a join message. non-NULL. A child interface is one over which a member host is
On receipt of a join, the core router replies with an acknowledge- directly attached, or one over which a downstream on-tree router is
ment, which traverses the reverse-path of the corresponding join (the attached. This "tree maintenance" is achieved by each downstream
join sets up transient state along the way). As such, tree branches router periodically sending a "keepalive" message (ECHO_REQUEST) to
are created, and parent-child relationships set up between adjacent its upstream neighbour, i.e. its parent router on the tree. One
CBT routers on the tree, the parent being the router nearer the core. keepalive message is sent to represent entries with the same parent,
When the ack is received, the router creates a CBT forwarding infor- thereby improving scalability on links which are shared by many
mation base (FIB) entry, listing interfaces corresponding to a par- groups. On multicast capable links, a keepalive is multicast to the
ticular group. Due to the way the tree is formed, each tree link is "all-cbt-routers" group (IANA assigned as 224.0.0.15); this has a
symmetric, and the tree reflects an undirected graph. suppressing effect on any other router for which the link is its par-
ent link. If a parent link does not support multicast transmission,
keepalives are unicast.
Tree branches are made up of so-called non-core routers, which form a The receipt of a keepalive message over a valid child interface imme-
shortest forward path between a member-host's directly attached diately prompts a response (ECHO_REPLY), which is either unicast or
router, and the core. A router at the end of a branch is known as a multicast, as appropriate.
leaf router on the tree.
A diagram showing a single-core CBT tree is shown in the figure The ECHO_REQUEST does not contain any group information; the
below. Only one core is shown to demonstrate the principle. ECHO_REPLY does, but only periodically. To maintain consistent infor-
mation between parent and child,
the parent periodically reports, in a ECHO_REPLY, all groups for
which it has state, over each of its child interfaces for those
groups. This group-carrying echo reply is not prompted explicitly by
the receipt of an echo request message. A child is notified of the
time to expect the next echo reply message containing group informa-
tion in an echo reply prompted by a child's echo request. The fre-
quency of parent group reporting is at the granularity of minutes.
b b b-----b It cannot be assumed all of the routers on a multi-access link have a
\ | | uniform view of unicast routing; this is particularly the case when a
\ | | multi-access link spans two or more unicast routing domains. This
b---b b------b could lead to multiple upstream tree branches being formed (an error
/ \ / KEY.... condition) unless steps are taken to ensure all routers on the link
/ \/ agree which is the upstream router for a particular group. CBT
b X---b-----b X = Core routers attached to a multi-access link participate in an explicit
/ \ b = non-core router election mechanism that elects a single router, the designated router
/ \ (DR), as the link's upstream router for all groups. Since the DR
/ \ might not be the link's best next-hop for a particular core router,
b b------b this may result in join messages being re-directed back across a
/ \ | multi-access link. If this happens, the re-directed join message is
/ \ | unicast across the link by the DR to the best next-hop, thereby pre-
b b b venting a looping scenario. This re-direction only ever applies to
join messages. Whilst this is suboptimal for join messages, which
are generated infrequently, multicast data never traverses a link
more than once (either natively, or encapsulated).
Figure 2: Single-Core CBT Tree In all but the exception case described above, all CBT control mes-
sages are multicast over multicast supporting links to the "all-cbt-
routers" group, with IP TTL 1. When a CBT control message is sent
over a non-multicast supporting link, it is explicitly addressed to
the appropriate next hop.
Join messages do not necessarily travel all the way to the core 4.2.1. Core Router Discovery
before being acknowledged; if a join message hits a router on the
tree (on-tree router) on its journey towards a core, that on-tree
router terminates the join and acknowledges it. A router that is
itself in the process of joining the tree (i.e. one that has not yet
received an ack itself) can terminate and cache any subsequent
received joins, but cannot acknowledge them until its own join has
been successfully acknowledged.
For simplicity, part of the CBT protocol [2] is data driven; one of a Core router discovery is by far the most controversial and difficult
set of cores for a group is elected (by a group initiator) as the aspect of shared tree multicast architectures, particularly in the
PRIMARY core, all others are termed SECONDARY cores. The ordering of context of inter-domain multicast routing (IDMR). There have been
the secondary cores is irrelevant, however, the primary must be uni- many proposals over the past three years or so, including advertising
form across the whole group. Whenever a secondary core is joined, it core addresses in a multicast session directory like "sdr" [11], man-
first acks the join then proceeds to join the primary core -- the ual placement, and the HPIM [12] approach of strictly dividing up the
primary core simply listens for incoming joins, it need never send a multicast address space into many "hierarchical scopes" and using
join itself. In this way, a CBT tree becomes fully connected in explicit advertising of core routers between scope levels.
response to members joining.
The tree joining process is triggered when a subnet's CBT-capable For intra-domain core discovery, CBT has decided to adopt the "boot-
router receives an IGMP group membership report [5]. If more than one strap" mechanism currently specified with the PIM sparse mode proto-
CBT router is present on a subnetwork, only one router, the subnet's col [7]. This bootstrap mechanism is scalable, robust, and does not
designated router (DR), generates a join message. A subnet's DR is rely on underlying multicast routing support to deliver core router
implicitly elected (i.e. no CBT protocol involvement) as a "side- information; this information is distributed via traditional unicast
effect" of IGMP. hop-by-hop forwarding.
CBT builds a loop-free shared tree. In some scenarios it is necessary It is expected that the bootstrap mechanism will be specified inde-
to take explicit precautions against loops, in others it is not. For pendently as a "generic" RP/Core discovery mechanism in its own sepa-
example, a loop cannot form between a router that wishes to join the rate document. It is unlikely at this stage that the bootstrap mecha-
tree, if that router has no child tree branches (interfaces). There nism will be appended to a well-known network layer protocol, such as
is a potential for loops if a joining router has at least one child IGMP [5] or ICMP [13], though this would facilitate its ubiquitous
attached. This scenario would constitute a rejoin. Once the rejoining (intra-domain) deployment. Therefore, each multicast routing protocol
router has received an ack, it must generate a loop detection which requiring the bootstrap mechanism must implement it as part of the
is sent over its parent interface. Receiving routers must forward multicast routing protocol itself.
this packet over their parent interface only, unless it is received
by its originator, in which case a loop is present, or it is recieved
by the primary core, in which case no loop is present. In the former
case, the loop is broken by the originator sending a quit packet to
its new parent, and the latter case solicits an ack from the primary
core, confirming the absence of a loop.
Tree maintenance takes place between adjacent on-tree routers in the A summary of the operation of the bootstrap mechanism follows. It is
form of protocol "keepalive" messages. Only one of these need be sent assumed that all routers within the domain implement the "bootstrap"
per link (not per group), and this message is sent in the direction protocol, or at least forward bootstrap protocol messages.
child -> parent. The parent sends a corresponding acknowledgement.
This is how tree connectivity is monitored, and breakages/failures
detected.
When a CBT router receives a multicast data packet, it simply for- A subset of the domain's routers are configured to be CBT candidate
wards it over all outgoing interfaces, as specified in its FIB entry core routers. Each candidate core router periodically (default every
for the group. Protocol mechanisms help ensure that data packets 60 secs) advertises itself to the domain's Bootstrap Router (BSR),
never loop. using "Core Advertisement" messages. The BSR is itself elected
dynamically from all (or participating) routers in the domain. The
domain's elected BSR collects "Core Advertisement" messages from can-
didate core routers and periodically advertises a candidate core set
(CC-set) to each other router in the domain, using traditional hop-
by-hop unicast forwarding. The BSR uses "Bootstrap Messages" to
advertise the CC-set. Together, "Core Advertisements" and "Bootstrap
Messages" comprise the "bootstrap" protocol.
The CBT protocol is simple and lightweight. The CBT protocol specifi- When a router receives an IGMP host membership report from one of its
cation is described in an separate document [2]. directly attached hosts, the local router uses a hash function on the
reported group address, the result of which is used as an index into
the CC-set. This is how local routers discover which core to use for
a particular group.
4.3. Core Selection, Placement, and Management Note the hash function is specifically tailored such that a small
number of consecutive groups always hash to the same core. Further-
more, bootstrap messages can carry a "group mask", potentially limit-
ing a CC-set to a particular range of groups. This can help reduce
traffic concentration at the core.
The issues of core (RP) selection, placement, and management are If a BSR detects a particular core as being unreachable (it has not
still under review, although several recent advancements have been announced its availability within some period), it deletes the rele-
made [13]. vant core from the CC-set sent in its next bootstrap message. This is
how a local router discovers a group's core is unreachable; the
router must re-hash for each affected group and join the new core
after removing the old state. The removal of the "old" state follows
the sending of a QUIT_NOTIFICATION upstream, and a FLUSH_TREE message
downstream.
Until recently, it was thought that hosts would need to be explicitly 4.2.2. CBT Control Message Retransmission Strategy
involved in discovering core addresses corresponding to a particular
group. This would require host system changes, and modifications to
applications, such that an application could request the host to
establish a <core, group> mapping for a given group, as well as some
sort of core advertisement protocol in which hosts and core routers
participate.
A dependency on host or application involvement with core (RP) Certain CBT control messages illicit a response of some sort. Lack of
discovery is therefore very undesirable. Indeed, the type of multi- response may be due to an upstream router crashing, or the loss of
cast tree to be joined should be invisible to hosts (and even appli- the original message, or its response. To detect these events, CBT
cations). retransmits those control messages for which it expects a response,
if that response is not forthcoming within the retransmission-
interval, which varies depending on the type of message involved.
There is an upper bound (typically 3) on the number of retransmis-
sions of the original message before an exception condition is
raised.
A new proposal has recently emerged called Hierarchical PIM [13], For example, the exception procedure for lack of response to a
which proposes having a multi-level hierarchy of RPs (cores), but JOIN_REQUEST is to rehash the corresponding group to another core
which are known only to multicast routers; cores (RPs) remain invisi- router from the advertised CC-set, then restart the joining process.
ble to hosts. Whilst its name suggests it is specific to PIM, its
principles are not.
Each level of hierarchy corresponds to a multicast "scope level", The exception procedure for lack of response to an ECHO_REQUEST is to
with a certain multicast address range corresponding to each scope. send a QUIT_NOTIFICATION upstream and a FLUSH_TREE message downstream
This therefore, requires the multicast address space to be officially for the group. If this is router has group members attached, it
"administratively scoped"; a group initiator chooses a multicast establishes if the group's core is still active (it appears in the
address based on the perceived scope of the group. On receipt of an most recent CC-set advertisement from the BSR), and if so restarts
IGMP group membership report from a host, a local router (the the joining process to that core. If the contents of the CC-set indi-
lowest-level of the core (RP) hierarchy) knows, based on the group cate the core is unreachable, the router must rehash the group, then
address, whether it has to join a core at the next level up in the restart the joining process towards the newly elected core.
hierarchy; each candidate RP at a particular level advertises itself
within its own level (using a well-known scoped multicast address),
and to the level below. Hence, an RP should always know how to reach
an RP at the next level up in the hierarchy. A next-level RP is
chosen by using a hash function across the group address.
The lower the hierarchy level, the more densely RPs populate that 4.2.3. Non-Member Sending
level. However, at the top level (for globally-scoped groups) it is
expected there will be sufficient RP distribution so that shared tree
paths (i.e. delay) is reasonably efficient.
It is expected that there will probably be six or seven levels of If a non-member sender's local router is already on-tree for the
hierarchy (local, region, country, continent, etc.), with the candi- group being sent to, the subnet's upstream router simply forwards the
date RPs changing relatively infrequently (say every 24 hrs), with data packet over all outgoing interfaces corresponding to that
the candidate RP announcements being made more frequently, e.g. group's forwarding cache entry. This is in contrast to PIM-SM [7]
every few minutes. which must encapsulate data from a non-member sender, irrespective of
whether the local router has joined the tree. This is due to PIM's
uni-directional state.
The finer details of this scheme have still to be worked out, but due If the sender's subnet is not attached to the group tree, the local
to the significant advantage of being host-transparent, it is highly DR must encapsulate the data packet and unicast it to the group's
likely that this RP/Core selection/placement/management scheme will core router, where it is decapsulated and disseminated over all tree
be adopted (footnote). interfaces, as specified by the core's forwarding cache entry for the
_________________________ group. The data packet encapsulation method is IP-in-IP [14].
This work is currently progressing under the auspices
of the IDMR working group of the IETF.
5. Architectural Justification: Comparisons & Simulation Results Routers in between a non-member sender and the group's core need not
know anything about the multicast group, and indeed may even be mul-
ticast-unaware. This makes CBT particulary attractive for applica-
tions with non-member senders.
This majority of this section summarises the results of in-depth 5. Interoperability with Other Multicast Routing Protocols
simulations carried out by Harris Corp., USA, investigating the per-
formance and resource cost comparisons of multicast algorithms for See "interoperability" in section 4.1.
distributed interactive simulation applications [10]. More pre-
cisely, the report summarises the work on the Real-time Information The interoperability mechanisms for interfacing CBT with DVMRP are
Transfer & Networking (RITN) program, comparing the cost and perfor- defined in [15].
mance of PIM [7] and CBT [2] in a DIS environment. As we said ear-
lier, DIS applications have wide-ranging requirements. We feel it is 6. Summary
important to take into account a wide range of requirements so that
future applications can be accommodated with ease; all other multi- This document presents an architecture for intra- and inter-domain
cast architectures are tailored to the requirements of applications multicast routing, though some aspects of inter-domain multicast
in the current Internet, particularly audio and video applications. routing remain to be solved (e.g. inter-domain core router discov-
Figure 3 shows a comparison of application requirements [10]. ery). We motivated this architecture by describing how an inter-
domain multicast routing algorithm must scale to large numbers of
groups present in the internetwork, and discussed why most other
existing algorithms are less suited to inter-domain multicast rout-
ing. We followed by describing the features and components of the
architecture, illustrating its simplicity and scalability. Finally,
the Appendix summarizes simulation results comparing CBT with PIM.
7. Acknowledgements
Special thanks goes to Paul Francis, NTT Japan, for the original
brainstorming sessions that brought about this work.
Thanks also to Bibb Cain et al. (Harris Corporation) for allowing the
publication of their simulation results in the Appendix, and the
duplication of figures 3 and 4.
The participants of the IETF IDMR working group have provided useful
feedback since the inception of CBT.
References
[1] Core Based Trees (CBT) Multicast Routing: Protocol Specification;
A. Ballardie; ftp://ds.internic.net/internet-drafts/draft-ietf-idmr-
cbt-spec-**.txt. Working draft, March 1997.
[2] Multicast Routing in a Datagram Internetwork; S. Deering, PhD The-
sis, 1991; ftp://gregorio.stanford.edu/vmtp/sd-thesis.ps.
[3] Mechanisms for Broadcast and Selective Broadcast; D. Wall; PhD
thesis, Stanford University, June 1980. Technical Report #90.
[4] Assigned Numbers; J. Reynolds and J. Postel; RFC 1700, October
1994.
[5] Internet Group Management Protocol, version 2 (IGMPv2); W. Fenner;
ftp://ds.internic.net/internet-drafts/draft-ietf-idmr-igmp-v2-**.txt.
Working draft, 1996.
[6] Distance Vector Multicast Routing Protocol (DVMRP); T. Pusateri;
ftp://ds.internic.net/internet-drafts/draft-ietf-idmr-dvmrp-v3-**.txt.
Working draft, 1997.
[7] Protocol Independent Multicast (PIM) Sparse Mode/Dense Mode Speci-
fications; D. Estrin et al; ftp://netweb.usc.edu/pim Working drafts,
1996.
[8] Multicast Extensions to OSPF; J. Moy; RFC 1584; March 1994.
[9] Reverse path forwarding of broadcast packets; Y.K. Dalal and R.M.
Metcalfe; Communications of the ACM, 21(12):1040--1048, 1978.
[10] Some Issues for an Inter-Domain Multicast Routing Protocol; D.
Meyer; ftp://ds.internic.net/internet-drafts/draft-ietf-mboned-imrp-
some-issues-**.txt. Working draft, 1997.
[11] SDP: Session Description Protocol; M. Handley and V. Jacobson;
ftp://ds.internic.net/internet-drafts/draft-ietf-mmusic-sdp-02.txt;
Working draft, 1996.
[12] Hierarchical Protocol Independent Multicast; M. Handley, J.
Crowcroft, I. Wakeman. Available from:
http://www.cs.ucl.ac.uk/staff/M.Handley/hpim.ps and
ftp://cs.ucl.ac.uk/darpa/IDMR/hpim.ps Work done 1995.
[13] Internet Control Message Protocol (ICMP); J. Postel; RFC 792;
September 1981.
[14] IP Encapsulation within IP; C. Perkins; RFC 2003; October 1996.
[15] CBT - Dense Mode Multicast Interoperability; A. Ballardie;
ftp://ds.internic.net/internet-drafts/draft-ietf-idmr-cbt-
dvmrp-**.txt. Working draft, March 1997.
[16] Performance and Resource Cost Comparisons of Multicast Routing
Algorithms for Distributed Interactive Simulation Applications; T.
Billhartz, J. Bibb Cain, E. Farrey-Goudreau, and D. Feig. Available
from: http://www.epm.ornl.gov/~sgb/pubs.html; July 1995.
Author Information
Tony Ballardie,
Research Consultant
e-mail: ABallardie@acm.org
APPENDIX Part 1: Comparisons & Simulation Results
Part 1 of this appendix summarises the results of in-depth simula-
tions carried out by Harris Corp., USA, investigating the performance
and resource cost comparisons of multicast algorithms for distributed
interactive simulation (DIS) applications [16]. More precisely, the
report summarises the work on the Real-time Information Transfer &
Networking (RITN) program, comparing the cost and performance of PIM
and CBT in a DIS environment. As we said earlier, DIS applications
have wide-ranging requirements. We feel it is important to take into
account a wide range of requirements so that future applications can
be accommodated with ease; most other multicast architectures are
tailored to the requirements of applications in the current Internet,
particularly audio and video applications. Figure 3 shows a compari-
son of application requirements.
We also present results into the study of whether source-based trees We also present results into the study of whether source-based trees
or shared trees are the best choice for multicasting in the RITN pro- or shared trees are the best choice for multicasting in the RITN pro-
gram. In the study of shortest-path trees (SPTs) vs. shared trees, gram. In the study of shortest-path trees (SPTs) vs. shared trees,
PIM Dense-Mode and PIM-SM with SPTs were used as SPTs, with CBT and PIM Dense-Mode and PIM-SM with SPTs were used as SPTs, with CBT and
PIM-SM used as shared trees. This section assumes the reader is fami- PIM-SM used as shared trees. This section assumes the reader is
liar with the different modes of PIM [7]. familiar with the different modes of PIM [7].
|--------------|----------------------------------------------------| |--------------|----------------------------------------------------|
| | Paradigm | | | Paradigm |
| |----------------------------------------------------| | |----------------------------------------------------|
| Requirement | | audio/video | audio/video | | Requirement | | audio/video | audio/video |
| | DIS | lecture dist'n | conference | | | DIS | lecture dist'n | conference |
===================================================================== =====================================================================
| senders | many | small number | small number | | senders | many | small number | small number |
--------------------------------------------------------------------- ---------------------------------------------------------------------
| receivers |also senders | many recvrs | also senders | | receivers |also senders | many recvrs | also senders |
skipping to change at page 17, line 6 skipping to change at page 21, line 6
| tree | virtual | and includes | and can slowly move| | tree | virtual | and includes | and can slowly move|
| | space, with | current recvrs | over phys topology | | | space, with | current recvrs | over phys topology |
| | rapid mvmt | | | | | rapid mvmt | | |
| | over phys | | | | | over phys | | |
| | topology | | | | | topology | | |
--------------------------------------------------------------------- ---------------------------------------------------------------------
Figure 3: Comparison of Multicast Requirements Figure 3: Comparison of Multicast Requirements
The following criteria were considered in the simulations: The following criteria were considered in the simulations:
+ end-to-end delay +o end-to-end delay
+ group join time +o group join time
+ scalability (all participants are both senders & receivers) +o scalability (all participants are both senders & receivers)
+ bandwidth utilization +o bandwidth utilization
+ overhead traffic +o overhead traffic
+ protocol complexity +o protocol complexity
A brief summary of the the results of the evaluation follow. For a A brief summary of the the results of the evaluation follow. For a
detailed description of the simulations and test environments, refer detailed description of the simulations and test environments, refer
to [10]. to [16].
+ End-to-end delay. It was shown that PIM-DM and PIM-SM with SPTs +o End-to-end delay. It was shown that PIM-DM and PIM-SM with SPTs
deliver packets between 13% and 31% faster than CBT. PIM-SM has deliver packets between 13% and 31% faster than CBT. PIM-SM has
about the same delay characteristics as CBT. Processing delay about the same delay characteristics as CBT. Processing delay
was not taken into account here, and this stands in PIM's was not taken into account here, and this stands in CBT's
favour, since PIM routers are likely to have much larger routing favour, since PIM routers are likely to have much larger routing
tables, and thus, much greater search times. tables, and thus, much greater search times.
+ Join time. There was very little difference between any of the +o Join time. There was very little difference between any of the
algorithms. algorithms.
+ Bandwidth efficiency. It was shown that PIM-SM with shared +o Bandwidth efficiency. It was shown that PIM-SM with shared
trees, and PIM-SM with SPTs both required only about 4% more trees, and PIM-SM with SPTs both required only about 4% more
bandwidth in total, to deliver data to hosts. PIM-DM however, is bandwidth in total, to deliver data to hosts. PIM-DM however, is
very bandwidth inefficient, but this improves greatly as the very bandwidth inefficient, but this improves greatly as the
network becomes densely populated with group receivers. network becomes densely populated with group receivers.
+ Overhead comparisons. CBT exhibited the lowest overhead percen- +o Overhead comparisons (for tree creation, maintenance, and tear-
tage, even less than PIM-SM with shared trees. PIM-DM was shown down). CBT exhibited the lowest overhead percentage, even less
to have more than double the overhead of PIM-SM with SPTs due to than PIM-SM with shared trees. PIM-DM was shown to have more
the periodic broadcasting & pruning. than double the overhead of PIM-SM with SPTs due to the periodic
broadcasting & pruning.
The Harris simulations [10] measured the average number of rout- The Harris simulations [16] measured the average number of rout-
ing table entries required at each router for CBT, PIM-DM, PIM- ing table entries required at each router for CBT, PIM-DM, PIM-
SM with SPTs, and PIM-SM with shared trees. The following param- SM with SPTs, and PIM-SM with shared trees. The following param-
eters were used in the simulations: eters were used in the simulations:
+ N = number of active multicast groups in the network. + N = number of active multicast groups in the network.
+ n = average number of senders to a group. + n = average number of senders to a group.
+ b = fraction of groups moving to source tree in PIM-SM. + b = fraction of groups moving to source tree in PIM-SM.
skipping to change at page 18, line 20 skipping to change at page 22, line 22
group (or source tree for a particular sender). group (or source tree for a particular sender).
+ s = average percentage of routers on a source-based tree for + s = average percentage of routers on a source-based tree for
a group (but not on shared tree). a group (but not on shared tree).
+ d = average number of hops from a host to the RP. + d = average number of hops from a host to the RP.
+ r = total number of routers in the network. + r = total number of routers in the network.
The following results were calculated, given b = 1, c = 0.5, The following results were calculated, given b = 1, c = 0.5,
s = 0.25, and d/r = 0.05. s = 0.25, and d/r = 0.05. The formulae for the calculation are
given Part 2 of this Appendix.
|--------------|----------------------------------------------------| |--------------|----------------------------------------------------|
| | Group size parameters | | | Group size parameters |
| |----------------------------------------------------| | |----------------------------------------------------|
| Protocol | N = 1000 | N = 1000 | N = 20,000 | N = 20,000 | | Protocol | N = 1000 | N = 1000 | N = 20,000 | N = 20,000 |
| | n = 10 | n = 200 | n = 10 | n = 200 | | | n = 10 | n = 200 | n = 10 | n = 200 |
===================================================================== =====================================================================
| | | | | | | | | | | |
| CBT | 500 | 500 | 10,000 | 10,000 | | CBT | 500 | 500 | 10,000 | 10,000 |
--------------------------------------------------------------------- ---------------------------------------------------------------------
skipping to change at page 18, line 43 skipping to change at page 22, line 46
--------------------------------------------------------------------- ---------------------------------------------------------------------
| PIM-Sparse | | | | | | PIM-Sparse | | | | |
| with SPT | 8000 | 150,500 | 160,000 | 3,010,000 | | with SPT | 8000 | 150,500 | 160,000 | 3,010,000 |
--------------------------------------------------------------------- ---------------------------------------------------------------------
| PIM-Sparse, | | | | | | PIM-Sparse, | | | | |
| shared tree | 1000 | 1,500 | 20,000 | 210,000 | | shared tree | 1000 | 1,500 | 20,000 | 210,000 |
--------------------------------------------------------------------- ---------------------------------------------------------------------
Figure 4: Comparison of Router State Requirements Figure 4: Comparison of Router State Requirements
+ Complexity comparisons. Protocol complexity, protocol traffic +o Complexity comparisons. Protocol complexity, protocol traffic
overhead, and routing table size were considered here. CBT was overhead, and routing table size were considered here. CBT was
found to be considerably simpler than all other schemes, on all found to be considerably simpler than all other schemes, on all
counts (footnote). counts.
_________________________
The comparisons carried out were subjective.
5.1. Traffic Concentration
One of the criticisms leveled at shared trees is that traffic is
aggregated onto a smaller subset of network links, than with source
tree protocols.
It has been shown that if shared trees, spanning a large number of
receivers, with some (not insignificant proportion of these being
also senders), if such a shared tree has only a small number of cores
(RPs), traffic concentration is a problem on the core incident links,
and for the core itself. The Harris simulation results [10] suggest
increasing the number of cores as group size increases, thereby
largely eliminating the problem.
5.2. Shared Trees and Load Balancing
An important question raised in the SPT vs. CBT debate is: how effec-
tively can load sharing be achieved by the different schemes? The
scope of ability for DVMRP to do load-sharing is limited primarily by
its forwarding algorithm (RPF); this allows for only single-path
routing.
With shared tree schemes however, each receiver can choose which of
the small selection of cores it wishes to join. Cores and on-tree
nodes can be configured to accept only a certain number of joins,
forcing a receiver to join via a different path. This flexibility
gives shared tree schemes the ability to achieve load balancing.
In general, spread over all groups, CBT has the ability to randomize
the group set over different trees (spanning different links around
the centre of the network), something that would not seem possible
under SPT schemes.
6. Conclusions Simulation Conclusions
In comparing CBT with the other shared tree architecture, PIM, CBT In comparing CBT with the other shared tree architecture, PIM, CBT
was found to be more favourable in terms of scalability, complexity, was found to be more favourable in terms of scalability, complexity,
and overhead. Other characteristics were found to be very similar. and overhead. Other characteristics were found to be similar.
When comparing SPTs with shared trees, we find that the end-to-end When comparing SPTs with shared trees, we find that the end-to-end
delays of shared trees are usually acceptable, and can be improved by delays of shared trees are usually acceptable, and can be improved by
strategic core placement. Routing table size is another important strategic core placement. Routing table size is another important
factor in support of shared trees, as figures 1 and 4 clearly illus- factor in support of shared trees, as figures 1 and 4 clearly illus-
trate. trate.
Shared trees also have more potential to load balance traffic across Appendix: Part 2
different links or trees. In the absence of multi-path multicast
routing, this does not seem possible with source-based tree tech-
niques.
Acknowledgements
Special thanks goes to Paul Francis, NTT Japan, for the original
brainstorming sessions that brought about this work.
Thanks also to Bibb Cain et al. (Harris Corporation) for allowing the
publication of their simulation results, and the duplication of fig-
ures 3 and 4.
Ken Carlberg (SAIC) reviewed the text, and provided useful feedback
and suggestions.
I would also like to thank the participants of the IETF IDMR working
group meetings for their general constructive comments and sugges-
tions since the inception of CBT.
APPENDIX
The following formulae were used by Harris Corp. in calculating the The following formulae were used by Harris Corp. in calculating the
values in figure 4. The meaning of the formulae arguments precedes values in figure 4. The meaning of the formulae arguments precedes
figure 4. figure 4.
+ average no. of entries in each PIM-DM router is approximated by: +o average no. of entries in each PIM-DM router is approximated by:
T(PIM-DM) ~ N * n T(PIM-DM) ~ N * n
+ average no. of entries a router maintains is the likelihood, c, +o average no. of entries a router maintains is the likelihood, c,
that the router will be on a shared tree, times the total that the router will be on a shared tree, times the total num-
number, N, of shared trees: ber, N, of shared trees:
T(CBT) = N * c T(CBT) = N * c
+ average no. of entries a router maintains due to each src based +o average no. of entries a router maintains due to each src based
tree is the average no. of hops, d, from a host to the RP, tree is the average no. of hops, d, from a host to the RP,
divided by the number of routers, r, in the network: divided by the number of routers, r, in the network:
T(PIM-SM, shared tree) = N * c + N * n * d/r T(PIM-SM, shared tree) = N * c + N * n * d/r
+ average no. of routing table entries for PIM-SM with some groups +o average no. of routing table entries for PIM-SM with some groups
setting up source-based trees: setting up source-based trees:
T(PIM, SPT) = N * [B * n + 1] * c + b * n * s T(PIM, SPT) = N * [B * n + 1] * c + b * n * s
For more detailed information on these formulae, refer to [10]. For more detailed information on these formulae, refer to [16].
References
[1] MBONE, The Multicast BackbONE; M. Macedonia and D. Brutzman;
available from http://www.cs.ucl.ac.uk/mice/mbone_review.html.
[2] Core Based Trees (CBT) Multicast: Protocol Specification; A. J.
Ballardie, N. Jain, and S. Reeve;
ftp://cs.ucl.ac.uk/darpa/IDMR/draft-ietf-idmr-cbt-spec-04.txt.
[3] A. J. Ballardie. Scalable Multicast Key Distribution
(ftp://cs.ucl.ac.uk/darpa/IDMR/draft-ietf-idmr-mkd-01.{ps,txt}). Work-
ing draft, 1995.
[4] DVMRP. Described in "Multicast Routing in a Datagram Internet-
work", S. Deering, PhD Thesis, 1990. Available via anonymous ftp from:
gregorio.stanford.edu:vmtp/sd-thesis.ps.
[5] W. Fenner. Internet Group Management Protocol, version 2 (IGMPv2),
(draft-idmr-igmp-v2-01.txt). Working draft.
[6] First IETF Internet Audiocast; S. Casner, S. Deering; In proceed-
ings of ACM Sigcomm'92.
[7] D. Estrin et al. USC/ISI, Work in progress.
(http://netweb.usc.edu/pim/).
[8] J. Moy. Multicast Routing Extensions to OSPF. Communications of
the ACM, 37(8): 61-66, August 1994.
[9] Y.K. Dalal and R.M. Metcalfe. Reverse path forwarding of broad-
cast packets. Communications of the ACM, 21(12):1040--1048, 1978.
[10] T. Billhartz, J. Bibb Cain, E. Farrey-Goudreau, and D. Feig.
Performance and Resource Cost Comparisons of Multicast Routing Algo-
rithms for Distributed Interactive Simulation Applications; a report
prepared for NRL ****, July 1995.
[11] A. J. Ballardie. Defining a Level-1/Level-2 Interface in the
Presence of Multiple Protocols. Working draft, January 1996.
ftp://cs.ucl.ac.uk/darpa/IDMR/draft-ietf-idmr-hier-intf-00.txt
[12] D. Wall. Mechanisms for Broadcast and Selective Broadcast. PhD
thesis, Stanford University, June 1980. Technical Report #90.
[13] M. Handley, J. Crowcroft, I. Wakeman. Hierarchical Rendezvous
Point proposal, work in progress.
(http://www.cs.ucl.ac.uk/staff/M.Handley/hpim.ps) and
(ftp://cs.ucl.ac.uk/darpa/IDMR/IETF-DEC95/hpim-slides.ps).
[14] A. J. Ballardie. "A New Approach to Multicast Communication in a
Datagram Internetwork", PhD Thesis, 1995. Available via anonymous ftp
from: cs.ucl.ac.uk:darpa/IDMR/ballardie-thesis.ps.Z.
[15] D. J. Hook, J. 0. Calvin, M. K. Newton, D. A. Fuscio. "An
Approach to DIS Scalability", Eleventh DIS Workshop, Orlando, Florida,
Sept. 26th-30th, 1994, pp 94-141.
Author's Address:
Tony Ballardie,
Department of Computer Science,
University College London,
Gower Street,
London, WC1E 6BT,
ENGLAND, U.K.
Tel: ++44 (0)71 419 3462
e-mail: A.Ballardie@cs.ucl.ac.uk
 End of changes. 

This html diff was produced by rfcdiff 1.23, available from http://www.levkowetz.com/ietf/tools/rfcdiff/