draft-ietf-ospf-omp-01.txt   draft-ietf-ospf-omp-02.txt 
Internet Engineering Task Force Curtis Villamizar Internet Engineering Task Force Curtis Villamizar
INTERNET-DRAFT ANS INTERNET-DRAFT UUNET
OSPF Optimized Multipath (OSPF-OMP) OSPF Optimized Multipath (OSPF-OMP)
Status of this Memo Status of this Memo
This document is an Internet-Draft. Internet-Drafts are working doc- This document is an Internet-Draft and is in full conformance with all
uments of the Internet Engineering Task Force (IETF), its areas, and provisions of Section 10 of RFC2026.
its working groups. Note that other groups may also distribute work-
ing documents as Internet-Drafts. Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that other
groups may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet- Drafts as reference mate- time. It is inappropriate to use Internet- Drafts as reference mate-
rial or to cite them other than as ``work in progress.'' rial or to cite them other than as ``work in progress.''
To view the entire list of current Internet-Drafts, please check the The list of current Internet-Drafts can be accessed at
``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow http://www.ietf.org/ietf/1id-abstracts.txt
Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern Eu-
rope), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific Rim), The list of Internet-Draft Shadow Directories can be accessed at
ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast). http://www.ietf.org/shadow.html.
Copyright (C) The Internet Society (February 24, 1999). All Rights
Reserved.
Abstract Abstract
OSPF may form multiple equal cost paths between points. This is true OSPF may form multiple equal cost paths between points. This is true
of any link state protocol. In the absense of any explicit support of any link state protocol. In the absense of any explicit support
to take advantage of this, a path may be chosen arbitrarily. Tech- to take advantage of this, a path may be chosen arbitrarily. Tech-
niques have been utilized to divide traffic somewhat evenly among the niques have been utilized to divide traffic somewhat evenly among the
available paths. These techniques have been referred to as Equal Cost available paths. These techniques have been referred to as Equal Cost
Multipath (ECMP). An unequal division of traffic among the available Multipath (ECMP). An unequal division of traffic among the available
paths is generally preferable. Routers generally have no knowledge paths is generally preferable. Routers generally have no knowledge
skipping to change at page 2, line ? skipping to change at page 2, line ?
traffic are suddenly moved, making the technique prone to oscillations traffic are suddenly moved, making the technique prone to oscillations
[3]. The oscillation is damped to some extent by providing a range [3]. The oscillation is damped to some extent by providing a range
of composite metric differences in which composite metrics are con- of composite metric differences in which composite metrics are con-
sidered equal and equal cost multipath techniques are used. Even then sidered equal and equal cost multipath techniques are used. Even then
the technique still suffers oscillations due to the course adjustments the technique still suffers oscillations due to the course adjustments
made at equal/unequal metric boundaries. made at equal/unequal metric boundaries.
1.2 Equal Cost Multipath 1.2 Equal Cost Multipath
A widely utilized technique to improve loading is known as Equal Cost A widely utilized technique to improve loading is known as Equal Cost
Multipath (ECMP). If the topology is such that equal cost paths exist, Multipath (ECMP). ECMP is specified in [5]. In ECMP no attempt to
then an attempt is made to divide traffic equally among the paths. make dynamic adjustments to OSPF costs based on loading and therefore
Three methods of splitting traffic have been used. ECMP is completely stable. If the topology is such that equal cost
paths exist, then an attempt is made to divide traffic equally among
the paths. At least three methods of splitting traffic have been
used.
1. Per packet round robin forwarding. 1. Per packet round robin forwarding.
2. Dividing destination prefixes among available next hops in the for- 2. Dividing destination prefixes among available next hops in the for-
warding entries. warding entries.
3. Dividing traffic according to a hash function applied to the source 3. Dividing traffic according to a hash function applied to the source
and desination pair. and desination pair.
The ``per packet round robin forwarding'' technique is only applicable The ``per packet round robin forwarding'' technique is only applicable
if the delays on the paths are almost equal. The delay difference if the delays on the paths are almost equal. The delay difference
must be small relative to packet serialization time. Delay differ- must be small relative to packet serialization time. Delay differ-
ences greater than three times the packet serialization time can cause ences greater than three times the packet serialization time can cause
terrible TCP performance. for example, packet 2, 4, and 6 may arrive terrible TCP performance. For example, packet 2, 4, and 6 may arrive
before packet 1, triggering TCP fast retransmit. The result will be before packet 1, triggering TCP fast retransmit. The result will be
limiting TCP to a very small window and very poor performance over limiting TCP to a very small window and very poor performance over
long delay paths. long delay paths.
The delay differences must be quite small. A 532 byte packet is seri- The delay differences must be quite small. A 532 byte packet is seri-
alized onto a DS1 link in under 2.8 msec. At DS3 speed, serialization alized onto a DS1 link in under 2.8 msec. At DS3 speed, serialization
is accomplished in under 100 usec. At OC12 it is under 7 usec. For is accomplished in under 100 usec. At OC12 it is under 7 usec. For
this reason ``per packet round robin forwarding'' is not applicable to this reason ``per packet round robin forwarding'' is not applicable to
a high speed WAN. a high speed WAN.
Dividing destination prefixes among available next hops provides a Dividing destination prefixes among available next hops provides a
very course and unpredictable load split. Long prefixes are prob- very course and unpredictable load split. Very short prefixes are
lematic. In reaching an end node, the majority of traffic is often problematic. In reaching an end node, the majority of traffic is of-
destined to a single prefix. This technique is applicable to a high ten destined to a single prefix. This technique is applicable to a
speed WAN but with the drawbacks just mentioned better techniques are high speed WAN but with the drawbacks just mentioned better techniques
needed. are needed.
The ``source/destination hash'' based technique was used as far back The ``source/destination hash'' based technique was used as far back
as the T3-NSFNET in the NSS routers. A hash function, such as CRC-16, as the T1-NSFNET in the IBM RT-PC based routers. A hash function,
is applied over the source address and destination address. The hash such as CRC-16, is applied over the source address and destination ad-
space is then split evenly among the available paths by either set- dress. The hash space is then split evenly among the available paths
ting threshholds or performing a modulo operation. Traffic between by either setting threshholds or performing a modulo operation. Traf-
any given source and destination remain on the same path. Because fic between any given source and destination remain on the same path.
the technique is based on host addresses, and uses both the source and Because the technique is based on host addresses, and uses both the
destination address, it does not suffer the course granularity prob- source and destination address, it does not suffer the course gran-
lem of the prefix based technique, even when forwarding to a single ularity problem of the prefix based technique, even when forwarding
prefix. Source/destination hash is the best technique available for a to a single prefix. Source/destination hash is the best technique
high speed WAN. available for a high speed WAN.
The forwarding decision for the ``source/destination hash'' based The forwarding decision for the ``source/destination hash'' based
technique is quite simple. When a packet arrives, look up the for- technique is quite simple. When a packet arrives, look up the for-
warding entry in the radix tree. The next hop entry can be an array warding entry in the radix tree. The next hop entry can be an array
index into a set of structures, each containing one or more actual index into a set of structures, each containing one or more actual
next hops. If more than one next hop is present, compute a CRC16 next hops. If more than one next hop is present, compute a CRC16
value based on the source and destination addresses. The CRC16 can value based on the source and destination addresses. The CRC16 can
be implemented in hardware and computed in parallel to the radix tree be implemented in hardware and computed in parallel to the radix tree
lookup in high speed implementations, and discarded if not needed. lookup in high speed implementations, and discarded if not needed.
Each next hop entry in the structure must contain a boundary value
and the next hop itself. An integer ``less than'' comparison is made
against the boundary value determining whether to use this next hop
or move to the next a comparison. In hardware the full set of compar-
isons can be made simultaneously for up to some number of next hops.
This yields the next hop to use.
.----. .----.
/ \ / \
| N2 | | N2 |
\ / \ /
`----' `----'
// \\ // \\
// \\ // \\
// \\ // \\
.----. .----. .----. .----.
/ \ / \ / \ / \
| N1 | ======== | N3 | | N1 | ======== | N3 |
\ / \ / \ / \ /
`----' `----' `----' `----'
Figure 1: A very simple application of ECMP Figure 1: A very simple application of ECMP
Each next hop entry in the structure must contain a boundary value
and the next hop itself. An integer ``less than'' comparison is made
against the boundary value determining whether to use this next hop
or move to the next a comparison. In hardware the full set of compar-
isons can be made simultaneously for up to some number of next hops or
a binary search can be performed. This yields the next hop to use.
1.3 Optimized Multipath differs from ECMP 1.3 Optimized Multipath differs from ECMP
For ECMP, the boundary values are set by first dividing one more than For ECMP, the boundary values are set by first dividing one more than
the maximum value that the hash computation can return (65536 for the maximum value that the hash computation can return (65536 for
CRC16) by the number of available next hops and then setting the Nth CRC16) by the number of available next hops and then setting the Nth
boundary to N times that number (with the Nth value fixed at one more boundary to N times that number (with the Nth value fixed at one more
than the maximum value regardless of underflow caused by trucating than the maximum value regardless of underflow caused by trucating
during division, 65536 for CRC16). during division, 65536 for CRC16).
An equal load split is not always optimal. Consider the example in An equal load split is not always optimal. Consider the example in
Figure 1 with the offered traffic in Table 1. If all of the link Figure 1 with the offered traffic in Table 1. If all of the link
costs are set equally, then the link N1---N3 is significantly over- costs are set equally, then the link N1---N3 is significantly over-
loaded (135.75%) while the path N1---N2---N3 is lightly loaded (45.25% loaded (135.75%) while the path N1---N2---N3 is lightly loaded (45.25%
and 22.62%). If the cost on the N1---N3 link is equal to the cost and 22.62%). If the cost on the N1---N3 link is equal to the cost
of the N1---N2---N3 path, then N1 will try to split the load destined of the N1---N2---N3 path, then N1 will try to split the load destined
toward N3 across the two paths. toward N3 across the two paths.
Given the offered traffic in Table 1 the loading on N1---N3 is reduced
to 67.87% but the link loading on the path N2---N3 becomes 113.12%.
Ideally node N1 should put 1/3 of the traffic toward N3 on the path
N1---N2---N3 and 2/3 on the path N1---N3. To know to do this N1 must
know the loading on N2--N3.
This is where Optimized Multipath (OMP) provides additional benefit
No ECMP OMP No ECMP OMP
Nodes Node Names Split Traffic Traffic Nodes Node Names Split Traffic Traffic
n3-n1 Node 3 -> Node 1 60 30 40 n3-n1 Node 3 -> Node 1 60 30 40
n1-n3 Node 1 -> Node 3 60 30 40 n1-n3 Node 1 -> Node 3 60 30 40
n3-n2 Node 3 -> Node 2 20 50 40 n3-n2 Node 3 -> Node 2 20 50 40
n2-n3 Node 2 -> Node 3 20 50 40 n2-n3 Node 2 -> Node 3 20 50 40
n2-n1 Node 2 -> Node 1 10 40 30 n2-n1 Node 2 -> Node 1 10 40 30
n1-n2 Node 1 -> Node 2 10 40 30 n1-n2 Node 1 -> Node 2 10 40 30
Table 1: Traffic loading for the example in Figure 1 Table 1: Traffic loading for the example in Figure 1
over ECMP. Ignoring for the moment how node N1 knows to put 1/3 of Given the offered traffic in Table 1 the loading on N1---N3 is reduced
the traffic toward N3 on the path N1---N2---N3, the way this is ac- to 67.87% but the link loading on the path N2---N3 becomes 113.12%.
complished from a forwarding standpoint is to move the boundary in the Ideally node N1 should put 1/3 of the traffic toward N3 on the path
forwarding structure from the default value of 1/2 of 65536 to about N1---N2---N3 and 2/3 on the path N1---N3. To know to do this N1 must
1/3 of 65536. If there are a very large set of source and destina- know the loading on N2--N3.
tion host addresses pairs, then the traffic will be split among the
65536 possible hash values. This provides the means for a very fine This is where Optimized Multipath (OMP) provides additional benefit
granularity of adjustment. over ECMP. Ignoring for the moment how node N1 knows to put 1/3 of the
traffic toward N3 on the path N1---N2---N3 (described later in Sec-
tion 2), the way the distribution of traffic is accomplished from a
forwarding standpoint is to move the boundary in the forwarding struc-
ture from the default value of 1/2 of 65536 to about 1/3 of 65536. If
there are a very large set of source and destination host addresses
pairs, then the traffic will be split among the 65536 possible hash
values. This provides the means for a very fine granularity of ad-
justment.
Having explained how a fine granularity of forwarding adjustment can Having explained how a fine granularity of forwarding adjustment can
be accomplished, what remains is to define how nodes in a large topol- be accomplished, what remains is to define how nodes in a large topol-
ogy can know what the loading levels are elsewhere in the topology and ogy can know what the loading levels are elsewhere in the topology and
defining an algorithm which can allow autonomous unsyncronized deci- defining an algorithm which can allow autonomous unsyncronized deci-
sions on the parts of many routers in a topology to quickly converge sions on the parts of many routers in a topology to quickly converge
on a near optimal loading without the risk of oscillation. on a near optimal loading without the risk of oscillation. This is
covered in the following sections.
2 Flooding Loading Information 2 Flooding Loading Information
Loading information is flooded within an OSPF area using Opaque Loading information is flooded within an OSPF area using Opaque
LSAs [1]. Area local scope (link-state type 10) link state at- LSAs [1]. Area local scope (link-state type 10) link state at-
tributes are flooded containing an ``Opaque Type'' of LSA_OMP_LINK_LOAD tributes are flooded containing an ``Opaque Type'' of LSA_OMP_LINK_LOAD
or LSA_OMP_PATH_LOAD. The type LSA_OMP_LINK_LOAD Opaque LSA is or LSA_OMP_PATH_LOAD. The type LSA_OMP_LINK_LOAD Opaque LSA is
used to flood link loading information within an area. The type used to flood link loading information within an area. The
LSA_OMP_PATH_LOAD Opaque LSA is used to flood loading information type LSA_OMP_PATH_LOAD Opaque LSA is used to flood loading informa-
for use with inter-area routes. Loading information obtained from tion for use with inter-area routes. Loading information obtained
an exterior routing protocol may also be considered if available. The from an exterior routing protocol may also be considered if avail-
means of passing loading information in an exterior routing protocol able. The means of passing loading information in an exterior routing
is beyond the scope of this document. protocol is beyond the scope of this document.
2.1 Link Loading Information 2.1 Link Loading Information
Within an area link loading is flooded using the type LSA_OMP_LINK_LOAD Within an area link loading is flooded using the type LSA_OMP_LINK_LOAD
Opaque LSA. The format of this LSA is described in Appendix A. Opaque LSA. The format of this LSA is described in Appendix A.
The ``Opaque Information'' in the type LSA_OMP_LINK_LOAD Opaque LSA The ``Opaque Information'' in the type LSA_OMP_LINK_LOAD Opaque LSA
contains the following. contains the following.
1. a measure of link loading in each direction as a fraction of link 1. a measure of link loading in each direction as a fraction of link
skipping to change at page 6, line 15 skipping to change at page 6, line 19
2.1 Link Loading Information 2.1 Link Loading Information
Within an area link loading is flooded using the type LSA_OMP_LINK_LOAD Within an area link loading is flooded using the type LSA_OMP_LINK_LOAD
Opaque LSA. The format of this LSA is described in Appendix A. Opaque LSA. The format of this LSA is described in Appendix A.
The ``Opaque Information'' in the type LSA_OMP_LINK_LOAD Opaque LSA The ``Opaque Information'' in the type LSA_OMP_LINK_LOAD Opaque LSA
contains the following. contains the following.
1. a measure of link loading in each direction as a fraction of link 1. a measure of link loading in each direction as a fraction of link
capacity, capacity,
2. a measure of packets dropped due to queue overflow in each direc- 2. a measure of packets dropped due to queue overflow in each direc-
tion (if known) expressed as a fraction, tion (if known) expressed as a fraction,
3. the link capacity in kilobits per second (or unity if less than 3. the link capacity in kilobits per second (or unity if less than
1000 bytes per second). 1000 bytes per second).
Generally the number of ouput packets dropped will be known. In de- Generally the number of ouput packets dropped will be known. In de-
signs where drops occur on the input (generally a bad design prac- signs where drops occur on the input, the rate of input queue drops
tice), the rate of input queue drops should be recorded. These mea- should be recorded. These measures of loading and drop are computed
sures of loading and drop are computed using the interface counters using the interface counters generally maintained for SNMP purposes,
generally maintained for SNMP purposes, plus a running count of output plus a running count of output queue drops if available. The coun-
queue drops if available. The counters are sampled every 15 seconds. ters are sampled every 15 seconds but generally flooded at longer time
intervals.
The previous value of each of the counters is substracted from the The previous value of each of the counters is substracted from the
current value. The counters required are 1) bytes out, 2) bytes in, current value. The counters required are 1) bytes out, 2) bytes in,
3) packets out, 4) packets in, 5) output queue drops, and 6) input 3) packets out, 4) packets in, 5) output queue drops, and 6) input
queue drops. These counters should already exist to satisfy SNMP queue drops. These counters should already exist to satisfy SNMP
requirements. requirements.
A value of instantaneous load in each direction is based on byte count A value of instantaneous load in each direction is based on byte count
and link capacity. An instantaneous output queue drop rate is based and link capacity. An instantaneous output queue drop rate is based
on queue drops and packet count. Some of the values are filtered as on queue drops and packet count. Some of the values are filtered as
described in Appendix B.1. described in Appendix B.1.
The last time that a type LSA_OMP_LINK_LOAD Opaque LSA with the same The last time that a type LSA_OMP_LINK_LOAD Opaque LSA with the same
Opaque ID was sent is recorded and the values sent are recorded. For Opaque ID was sent is recorded and the values sent are recorded. For
the purpose of determining when to reflood, an equivalent loading fig- the purpose of determining when to reflood, an equivalent loading fig-
ure is used. The computation of equivalent loading is described in ure is used. The computation of equivalent loading is described in
Section 2.3. Section 2.3.
The higher of the current equivalent loading computation and The higher of the current equivalent loading computation and
the previous is used when determining whether to send the type the previous is used when determining whether to send the
LSA_OMP_LINK_LOAD Opaque LSA. The type LSA_OMP_LINK_LOAD Opaque LSA type LSA_OMP_LINK_LOAD Opaque LSA. The type LSA_OMP_LINK_LOAD Opaque
is flooded according to elapsed time since last flooded, the current LSA is flooded according to elapsed time since last flooded, the cur-
equivalent load, and the difference between the equivalent load of rent equivalent load, and the difference between the current equiva-
the most loaded and least loaded path. The reflooding decision is lent load and the previously flooded equivalent load. The reflooding
described in detail in Appendix B.1 decision is described in detail in Appendix B.1.
The point of this graduated scale is to reduce the amount of flooding
that is occurring unless links are in trouble or undergoing a signif- The point of this graduated reflooding schedule is to reduce the
icant traffic shift. Change may occur in a quiescent network due to amount of flooding that is occurring unless links are in trouble or
failure external to the network that causes traffic to take alternate undergoing a significant traffic shift. Change may occur in a qui-
paths. In this case, the more frequent flooding will trigger a faster escent network due to failure external to the network that causes
convergence. Traffic shift may also occur due to shedding of loading traffic to take alternate paths. In this case, the more frequent
by the OMP algortihm itself as the algorithm converges in response to flooding will trigger a faster convergence. Traffic shift may also
an external change. occur due to shedding of loading by the OMP algorithm itself as the
algorithm converges in response to an external change.
2.2 Path Loading Information 2.2 Path Loading Information
Path loading information regarding an adjacent area is flooded by an Path loading information regarding an adjacent area is flooded by an
Area Border Router (ABR) using the type LSA_OMP_PATH_LOAD Opaque LSA. Area Border Router (ABR) using the type LSA_OMP_PATH_LOAD Opaque LSA.
The format of this LSA is described in Appendix A. The format of this LSA is described in Appendix A.
The ``Opaque Information'' in the type LSA_OMP_PATH_LOAD Opaque LSA The ``Opaque Information'' in the type LSA_OMP_PATH_LOAD Opaque LSA
contains the following. contains the following.
skipping to change at page 7, line 38 skipping to change at page 7, line 47
These values are taken from the link on the path from the ABR to the These values are taken from the link on the path from the ABR to the
destination of the summary LSA. The link with the highest loading may destination of the summary LSA. The link with the highest loading may
not be the link with the lowest capacity. The queue drop value is one not be the link with the lowest capacity. The queue drop value is one
minus the product of fraction of packets that are not dropped at each minus the product of fraction of packets that are not dropped at each
measurement point on the path (input and output in the direction of measurement point on the path (input and output in the direction of
the path). The following computation is used. the path). The following computation is used.
path-loss = 1 - product(1 - link-loss) path-loss = 1 - product(1 - link-loss)
The path loading and path loss rate are filtered according to the same The path loading and path loss rate are filtered according to the
algorithm defined in the prior section. Rather than polling a set algorithm defined in Appendix B.1. Rather than polling a set of coun-
of counters the current value of the path loading and path loss rate ters the current value of the path loading and path loss rate is used.
is used. An equivalent load is calculated for each path to a summary An equivalent load is calculated for each path to a summary LSA des-
LSA destination as described in Section 2.3. A type LSA_OMP_PATH_LOAD tination as described in Section 2.3. A type LSA_OMP_PATH_LOAD Opaque
Opaque LSA is flooded according to the same rate schedule as described LSA is flooded according to the same rate schedule as described in the
in the prior section. prior section and Appendix B.1.
An ABR may be configured to not send type LSA_OMP_PATH_LOAD Opaque LSA An ABR may be configured to not send type LSA_OMP_PATH_LOAD Opaque LSA
into any given area. into any given area. See Appendix C.
2.3 Computing equivalent loading 2.3 Computing equivalent loading
The equivalent load is the actual percent loading multiplied by a The equivalent load is the actual fractional loading multiplied by a
factor that provides an estimate of the extent to which TCP would be factor that provides an estimate based on loss of the extent to which
slowing down to avoid congestion. This estimate is based on the link TCP is expected to slow down to avoid congestion. This estimate is
bandwidth and loss rate, knowledge of TCP dynamics, and some assump- based on the link bandwidth and loss rate, knowledge of TCP dynamics,
tion about the characteristics of the TCP flows being passed through and some assumption about the characteristics of the TCP flows being
the link. Some of the assumptions must be configured. passed through the link. Some of the assumptions must be configured.
If loss is low, the equivalent load will be equal to the actual per- If loss is low or zero, the equivalent load will be equal to the ac-
cent loading. If loss is high and loading is at or near 100%, then tual fractional loading (link utilization expressed as a number be-
tween 0 and 1). If loss is high and loading is at or near 100%, then
the equivalent load calculation provides a means of deciding which the equivalent load calculation provides a means of deciding which
links are more heavily overloaded. The equivalent load figure is links are more heavily overloaded. The equivalent load figure is
not intended to be an accurate pridiction of offerred load, simply a not intended to be an accurate pridiction of offerred load, simply a
metric for use in deciding which link to offload. metric for use in deciding which link to offload.
Mathis et al provide the following estimate of loss given TCP window Mathis et al provide the following estimate of loss given TCP window
size and round trip time [4]. size and round trip time [4].
p < (MSS / (BW * RTT))**2 p < (MSS / (BW * RTT))**2
skipping to change at page 8, line 42 skipping to change at page 9, line 8
estimated using the following computation. estimated using the following computation.
equiv-load = load * K * sqrt(loss) equiv-load = load * K * sqrt(loss)
The inverse of the square root of 0.1% is 10 so 10 may be used for the The inverse of the square root of 0.1% is 10 so 10 may be used for the
value of ``K''. value of ``K''.
The conversion of loss to estimated loading is not at all accurate. The conversion of loss to estimated loading is not at all accurate.
The non-linearity does affect the time to converge though convergence The non-linearity does affect the time to converge though convergence
still occurs as long as loss is positively correlated to loading. still occurs as long as loss is positively correlated to loading.
This is discussed further in Section D.1. This is discussed further in Appendix E.1.
3 Next hop structures 3 Next hop structures
A ``next hop structure'' contains a set of complete paths to a des-
tination, some of which may share the same immediate next hop. The
name is not meant to imply a single next hop. A given route can ref-
erence only one next hop structure, which can contain multiple paths
and multiple next hops. Entries for paths that use the same next hop
are combined before moving information to the forwarding table. A
next hop structure contains the information nessecary to balance load
across a set of next hops.
For intra-area routes, a separate next hop structure must exist for For intra-area routes, a separate next hop structure must exist for
each destination router or network. For inter-area routes (summary each destination router or network. For inter-area routes (summary
routes), one next hop structure is needed for each set of equidistant routes), at most one next hop structure is needed for each combination
ABRs. For external routes (AS-external or routes from an exterior of ABRs which announce summary routes that are considered equidis-
routing protocol, such as BGP) one next hop structure is needed for tant. Optimizing inter-area and external routing is discussed in
each set of equidistant ASBRs. Section 3.2.
The set of intra-area next hop structures is initialized after the The set of intra-area next hop structures is initialized after the
OSPF SPF calculation is completed. An additional set of next hops is OSPF SPF calculation is completed. An additional set of next hops is
then added by relaxing the best path criteria. then added by relaxing the best path criteria.
The use of the next hop structure and its contents is described in
Section 4.1.
3.1 Relaxing the Best Path Criteria 3.1 Relaxing the Best Path Criteria
The exercise of setting link costs to produce the most beneficial set The exercise of setting link costs to produce the most beneficial set
of equal costs paths is tedious and very difficult for large topolo- of equal costs paths is tedious and very difficult for large topolo-
gies. OSPF as defined in RFC--2328 requires that only the best path gies. OSPF as defined in RFC--2328 requires that only the best path
be considered. For the purpose of Optimized Multipath, this crite- be considered. For the purpose of Optimized Multipath, this crite-
ria can be relaxed to allow a greater number of multipaths but not to ria can be relaxed to allow a greater number of multipaths but not to
the point of creating routing loops. Any next hop which is closer in the point of creating routing loops. Any next hop which is closer in
terms of costs than the current hop can be considered a viable next terms of costs than the current hop and does not cross a virtual link
hop for multipath routing. If next hops were used where the cost at can be considered a viable next hop for multipath routing. If next
the next hop is equal or greater, routing loops would form. hops were used where the cost at the next hop is equal or greater,
routing loops would form.
In considering the paths beyond the next hop path, only the best paths In considering the paths beyond the next hop path, only the best paths
should be considered. There is no way to determine if subsequent should be considered. There is no way to determine if subsequent
routers have relaxed the best path criteria. In addition, there is routers have relaxed the best path criteria. In addition, there is
no need to consider the additional paths if the best path criteria no need to consider the additional paths if the best path criteria
is relaxed downstream. If best path criteria is relaxed downstream, is relaxed downstream. If best path criteria is relaxed downstream,
the best paths must be part of the downstream next hop structure. If the best paths must be part of the downstream next hop structure. If
there are additional paths the the downstream is able to use to fur- there are additional paths the the downstream is able to use to fur-
ther distribute the load, the entire set of paths will still converge ther distribute the load, the entire set of paths will still converge
toward optimal loading. toward optimal loading.
The best path criteria is relaxed only for intra-area routes. The
best path criteria can also be relaxed when considering the cost to
reach ABRs or ASBRs. The best path criteria should not be relaxed
when considering the total cost to reach a summary route or external
route.
3.2 Offloading Congestion Outside the OSPF Area 3.2 Offloading Congestion Outside the OSPF Area
For inter-area routes or external routes, a separate next hop struc- For inter-area routes or external routes, a separate next hop struc-
ture must exist for each such route is it is desireable to reduce ture must exist for each such route if it is desireable to reduce
loading outside of the area and the loading within the area is suffi- loading outside of the area and the loading within the area is suffi-
ciently low to safely allow this. ciently low to safely allow this.
The existing procedures regarding selection of inter-area and exter-
nal routes outlined in [5] still apply. For inter-area routes the
intra-area cost and cost of the summary route are summed. For exter-
nal routes the intra-area cost is summed with a type 1 external cost
and considered before a type 2 external cost. The best path criteria
is not relaxed when applied to the sum of intra-area cost and summary
route cost or intra-area cost and type 1 external cost.
In order for an ABR or ASBR to be considered as a viable exit point to
the area for a given destination, it must be advertising an applicable
summary route or external route. The best summary route or external
route must still be choosen. If a single ABR or ASBR advertises the
best route, multiple paths to that ABR or ASBR may be used, but traf-
fic cannot be sent toward an ABR or ASBR advertising a higher cost
summary route or external route. If two or more ABR or ASBR advertise
a route at the same cost, then traffic load can be split among these
ABR or ASBR.
For intra-area routes if a type LSA_OMP_PATH_LOAD Opaque LSA exists For intra-area routes if a type LSA_OMP_PATH_LOAD Opaque LSA exists
for the summary LSA and more than one ABR is advertising the summary for the summary LSA and more than one ABR is advertising an equally
route and the equivalent load for the summary LSA is greater than preferred summary route and the equivalent load for the summary LSA
90% and the equivalent load within the area is sufficiently smaller is greater than 90% and the equivalent load within the area is suffi-
than the inter-area loading, then a next hop structure can be created ciently smaller than the inter-area loading, then a next hop structure
specifically to allow offloading of the intra-area route. For exter- can be created specifically to allow offloading of the intra-area
nal routes, if an equivalent loading exists, and more than one ASBR is route. For external routes, if an equivalent loading exists, and more
advertising the route, and the equivalent load is greater than 95% and than one ASBR is advertising an equally preferred external route, and
the equivalent load within the area is sufficiently smaller than the the equivalent load is greater than 95% and the equivalent load within
external route loading, then a separate structure is used. the area is sufficiently smaller than the external route loading, then
a separate structure is used.
Hysterysis must be used in the algorithm for determining if an equiv- Hysterysis must be used in the algorithm for determining if an equiv-
alent load on a summary LSA or external route is considered suffi- alent load on a summary LSA or external route is considered suffi-
ciently larger than the intra-area equivalent load or if an external ciently larger than the intra-area equivalent load or if an external
route loading is considered sufficiently larger than the inter-area route loading is considered sufficiently larger than the inter-area
equivalent load. For for the purpose of describing this algorithm one equivalent load. For for the purpose of describing this algorithm one
euivalent load is referred to as the more external, and the other as equivalent load is referred to as the more external, and the other as
the more internal equivalent load. the more internal equivalent load.
If the more external equivalent load exceeds the more internal equiv- If the more external equivalent load exceeds the more internal equiv-
alent load by 15% and the more internal equivalent load is under 85%, alent load by 15% and the more internal equivalent load is under 85%,
then a separate next hop structure is created. If the more external then a separate next hop structure is created. If the more external
equivalent load falls below 20% of the more internal equivalent load equivalent load falls below 20% of the more internal equivalent load
or the more internal equivalent load exceeds 98%, then an existing or the more internal equivalent load exceeds 98%, then an existing
separate next hop structure is marked for removal and combined with separate next hop structure is marked for removal and combined with
the more internal next hop structure (see Section 3.3). The more ex- the more internal next hop structure (see Section 3.3). The more ex-
ternal equivalent load should not fall significantly below the more ternal equivalent load should not fall significantly below the more
skipping to change at page 11, line 5 skipping to change at page 11, line 48
As described in Section 3.2 separate next hop structure is needed As described in Section 3.2 separate next hop structure is needed
if the loading indicated by the type LSA_OMP_PATH_LOAD Opaque LSA or if the loading indicated by the type LSA_OMP_PATH_LOAD Opaque LSA or
exterior routing protocol is sufficiently high to require separate exterior routing protocol is sufficiently high to require separate
balancing for traffic to the summary-LSA or exterior route and the balancing for traffic to the summary-LSA or exterior route and the
intra-AS loading is sufficiently low. intra-AS loading is sufficiently low.
When a separate next hop structure is created, the same available When a separate next hop structure is created, the same available
paths appear in the structure, leading to the same set of ABR or ASBR. paths appear in the structure, leading to the same set of ABR or ASBR.
The balance on these available paths should be copied from the exist- The balance on these available paths should be copied from the exist-
ing more internal next hop structure. By initializing the new next ing more internal next hop structure. By initializing the new next
hop structure this way, a sudden change in loading is avoided if the hop structure this way, a sudden change in loading is avoided if a
summary route or external route sinks a great deal of traffic. great deal of traffic is destined toward the summary route or external
route.
When a separate next hop structure can be destroyed, the traffic When a separate next hop structure can be destroyed, the traffic
should be transitioned gradually. The next hop structure must be should be transitioned gradually. The next hop structure must be
marked for deletion. The traffic share in this separate next hop marked for deletion. The traffic share in this separate next hop
structure should be gradually changed so that it exactly matches the structure should be gradually changed so that it exactly matches the
traffic share in the more internal next hop structure. The gradual traffic share in the more internal next hop structure. The gradual
change should follow the adjustment rate schedule described in Sec- change should follow the adjustment rate schedule described in Sec-
tion 4.1 where the move increment is increased gradually as moves tion 4.1 where the move increment is increased gradually as moves
continue in the same direction. Once the separate next hop structure continue in the same direction. The only difference is that there is
marked for deletion matches the more internal next hop structure, the no need to overshoot when adjusting to match the more internal next
summary route or external route can be changed to point to the more hop structure parameters. Once the separate next hop structure marked
internal next hop structure and the deletion can be made. for deletion matches the more internal next hop structure, the summary
route or external route can be changed to point to the more internal
next hop structure and the deletion can be made.
3.4 Critcally loaded segment 3.4 Critcally loaded segment
For every set of paths, one link or part of the path is identified as For every set of paths, one link or part of the path is identified as
the ``critcally loaded'' segment. This is the part of the path with the ``critcally loaded'' segment. This is the part of the path with
the highest equivalent load as defined in Section 2.3. For an inter- the highest equivalent load as defined in Section 2.3. For an inter-
area route with a separate next hop structure, the critically loaded area route with a separate next hop structure, the critically loaded
segment may be the critcally loaded segment for the intra-area set of segment may be the critcally loaded segment for the intra-area set of
paths, or it may be the summary LSA if the equivalent load on the sum- paths, or it may be the summary LSA if the equivalent load on the sum-
mary LSA is greater. For an external route with a separate next hop mary LSA is greater. For an external route with a separate next hop
skipping to change at page 11, line 44 skipping to change at page 12, line 40
area loading exceeds this ceiling, the summary LSA loads or external area loading exceeds this ceiling, the summary LSA loads or external
routes loads are in effect ignored. routes loads are in effect ignored.
Each next hop structure has exactly one ``critcally loaded'' segment. Each next hop structure has exactly one ``critcally loaded'' segment.
There may be more than one path in the next hop structure sharing There may be more than one path in the next hop structure sharing
this critcally loaded segment. A particular Opaque LSA may be the this critcally loaded segment. A particular Opaque LSA may be the
critcally loaded segment for no next hop structures if it is lightly critcally loaded segment for no next hop structures if it is lightly
loaded. Another Opaque LSA may be the critcally loaded segment for loaded. Another Opaque LSA may be the critcally loaded segment for
many next hop structures if it is heavily loaded. many next hop structures if it is heavily loaded.
3.5 Optimizing Partial Paths
Under some circumstances multiple paths will exist to a destination
where all of the available paths share one or more links. In some
cases overall system convergence time can be substantially improved by
optimizing a partial path when the most heavily loaded link is shared
by all available paths to a destination.
Computations are actually reduced when partial paths are considered.
The next hop structures kept within the routing process must contain
the full paths used to reach a destination (this is already a require-
ment). After an SPF calculation has changed the next hop structure
and before attempting any optimization the set of paths are examined
looking for intr-area links which are common to all paths. If any
such links are found, only intra-area links closer than any of these
links can be considered as candidates for the ``critcally loaded''
segment (Section 3.4). If there is only one immediate hop, no attempt
is made to load balance.
The change in load adjustment parameters should be applied to the
data structures for the full paths even though only a subset of the
links are eligible to be considered as the critcally loaded segment.
For the purpose of building type LSA_OMP_PATH_LOAD Opaque LSA loading
along the entire path must be considered including links shared by all
available paths.
4 Adjusting Equal Cost Path Loadings 4 Adjusting Equal Cost Path Loadings
Next hop structures are described in Section 3. A next hop structure
contains a set of complete paths to a destination.
Adjustments are made to a next hop structure to reflect differences in Adjustments are made to a next hop structure to reflect differences in
loading on the paths as reported by the type LSA_OMP_LINK_LOAD Opaque loading on the paths as reported by the type LSA_OMP_LINK_LOAD Opaque
LSA and type LSA_OMP_PATH_LOAD Opaque LSA. Section 3.4 describes the LSA and type LSA_OMP_PATH_LOAD Opaque LSA. Section 3.4 describes the
selection of a ``critically loaded segment'' which is used to de- selection of a ``critically loaded segment'' which is used to de-
termine when to make adjustments and the size of the adjustments. termine when to make adjustments and the size of the adjustments.
Section 3.5 describes conditions under which some links are excluded
from considerations as the ``critically loaded segment''.
An adjustment to loading of a given set of equal cost paths is made An adjustment to loading of a given set of equal cost paths is made
when one of two conditions are true. Either the ``critically loaded when one of two conditions are true. Either the ``critically loaded
segment'' has been reflooded, or a criteria is met involving 1) the segment'' has been reflooded, or a criteria is met involving 1) the
difference between the equivalent load of the``critically loaded seg- difference between the equivalent load of the``critically loaded seg-
ment'' and the lightest loaded path, 2) the equivalent load of the ment'' and the lightest loaded path, 2) the equivalent load of the
``critically loaded segment'', 3) the type of destination, intr-area, ``critically loaded segment'', 3) the type of destination, intr-area,
inter-area, or external, and 4) the amount of time since the last inter-area, or external, and 4) the amount of time since the last
load adjustment. The details of this conditional are described in load adjustment. The details of this conditional are described in
Appendix B. Appendix B.
skipping to change at page 13, line 6 skipping to change at page 14, line 38
1. The current ``traffic share'' (an integer, the range is 0 to 65355 1. The current ``traffic share'' (an integer, the range is 0 to 65355
for a CRC16 hash), for a CRC16 hash),
2. The current ``move increment'' used when moving traffic toward this 2. The current ``move increment'' used when moving traffic toward this
path (an integer, the range is 0 to 65355 for a CRC16 hash), path (an integer, the range is 0 to 65355 for a CRC16 hash),
3. The number of moves in the same direction, referred to as the 3. The number of moves in the same direction, referred to as the
``move count''. ``move count''.
If there is no prior history for a path, then the move increment is If there is no prior history for a path, then the move increment is
initialized to about 1% or 650. The number of moves in the same di- initialized to a constant, typically about 1% (about 650 for CRC16).
rection is initialized to 0. No loading adjustment is made on the The number of moves in the same direction is initialized to 0. No
first iteration. loading adjustment is made on the first iteration.
If the critcally loaded segment has changed, all paths now containing If the critcally loaded segment has changed, all paths now containing
the critically loaded segment are first examined. The lowest move the critically loaded segment are first examined. The lowest move
increment of any one of these paths is noted. increment of any one of these paths is noted.
The move increment is adjusted for each path before any traffic is The move increment is adjusted for each path before any traffic is
moved. One of the following actions is taken for each path. moved. One of the following actions is taken for each path.
1. If the path contains the critcally loaded segment its move incre- 1. If the path contains the critcally loaded segment its move incre-
ment is left unchanged. ment is left unchanged.
2. If the path does not contain the critcally loaded segment but the 2. If the path does not contain the critcally loaded segment but the
critically loaded segment has changed and the path contains the critically loaded segment has changed and the path contains the
prior critcally loaded segment, then first the move increment is prior critcally loaded segment, then first the move increment is
replaced with the lowest move increment from any of the paths con- replaced with the lowest move increment from any of the paths con-
taining the critically loaded segment unless the move increment is taining the critically loaded segment unless the move increment is
already lower, then the move increment is cut in half. already lower. Then in either case the move increment is cut in
half.
3. If the path does not contain the critcally loaded segment and ei- 3. If the path does not contain the critcally loaded segment and ei-
ther the critically loaded segment has not changed, or the path ther the critically loaded segment has not changed, or the path
does not contain the prior critcally loaded segment, then the move does not contain the prior critcally loaded segment, then the move
increment is increased. increment is increased.
The amount increase in the move increment is described in Ap- The amount increase in the move increment is described in Ap-
pendix B.4. The increase is designed to minimize the possibility pendix B.4. The increase is designed to minimize the possibility
of dramatic overshoot due to to great an increase in adjustment rate. of dramatic overshoot due to to great an increase in adjustment rate.
The move increment is never less than one and the increase in move The move increment is never less than a configured minimum. The in-
increment is never less than one. The move increment is never allowed crease in move increment is never less than one but generally is con-
strained to a higher number by virtue of being calculated based on the
prior move increment. The configured minimum for the move increment
is typically 0.1% (65 for CRC16). The move increment is never allowed
to exceed the size of the hash space divided by the number of equal to exceed the size of the hash space divided by the number of equal
cost paths in the next hop structure. cost paths in the next hop structure.
The dramatic decrease in move increment when move direction is re- The dramatic decrease in move increment when move direction is re-
versed and the slow increase in move increment when it remains in the versed and the slow increase in move increment when it remains in the
same direction keeps the algorithm stable. The exponential nature of same direction keeps the algorithm stable. The exponential nature of
the increase allows the algorithm to track externally caused changes the increase allows the algorithm to track externally caused changes
in traffic loading. in traffic loading.
The traffic share allocated to a path not containing the critcally The traffic share allocated to a path not containing the critcally
skipping to change at page 14, line 48 skipping to change at page 16, line 37
ing loops and have a severe short term impact on link loading. An ing loops and have a severe short term impact on link loading. An
oscillating link can cause high levels of loss and is generally better oscillating link can cause high levels of loss and is generally better
off held in the neighbor adjacency ``down'' state. The algorithm de- off held in the neighbor adjacency ``down'' state. The algorithm de-
scribed in the [7] can be used when advertising OSPF type 1 or type 2 scribed in the [7] can be used when advertising OSPF type 1 or type 2
LSA (router and network LSAs). LSA (router and network LSAs).
Regardless as to whether router and network LSAs are damped, neigh- Regardless as to whether router and network LSAs are damped, neigh-
bor adjacency state changes will occur and router and network LSAs bor adjacency state changes will occur and router and network LSAs
will have to be handled. The LSA may indicate an up transition or will have to be handled. The LSA may indicate an up transition or
a down transition. In either an up or down transition, when the SPF a down transition. In either an up or down transition, when the SPF
algorithm is applied, existing paths from the router doing the SPF to algorithm is applied, existing paths to specific destinations may no
specific destinations may no longer be usable and new paths may become longer be usable and new paths may become usable. In the case of an
usable. In the case of an up transition, some paths may no longer be up transition, some paths may no longer be usable because their cost
usable because their cost is no longer among those tied for the best. is no longer among those tied for the best. In the case of down tran-
sitions, new paths may become usable because they are now the best
In the case of down transitions, new paths may become usable because path still available.
they are now the best path still available.
4.2.2 Handling the Loss of Paths 4.2.2 Handling the Loss of Paths
When a path becomes unusable, paths which previously had the same When a path becomes unusable, paths which previously had the same
cost may remain. This can only occur on an LSA down transition. cost may remain. This can only occur on an LSA down transition.
A new next hop entry should be created in which the proportion of A new next hop entry should be created in which the proportion of
source/destination hash space allocated to the now infeasible path source/destination hash space allocated to the now infeasible path
is distributed to the remaining paths proportionally to their prior is distributed to the remaining paths proportionally to their prior
allocation. Very high loading percentages should result, trigger- allocation. Very high loading percentages should result, trigger-
ing an increase in LSA_OMP_LINK_LOAD Opaque LSA flooding rate until ing an increase in LSA_OMP_LINK_LOAD Opaque LSA flooding rate until
skipping to change at page 15, line 29 skipping to change at page 17, line 18
4.2.3 Handling the Addition of Paths 4.2.3 Handling the Addition of Paths
When a new path becomes usable it may be tied for best with paths car- When a new path becomes usable it may be tied for best with paths car-
rying existing traffic. This can only occur on an LSA up transition. rying existing traffic. This can only occur on an LSA up transition.
A new next hop entry should be created in which the loading on the A new next hop entry should be created in which the loading on the
new path is zero. If such a path were to oscillate, little or no load new path is zero. If such a path were to oscillate, little or no load
would be affected. If the path remains usable, the shift of load to would be affected. If the path remains usable, the shift of load to
this path will accellerate until a balance is reached. this path will accellerate until a balance is reached.
If a completely new set of best paths becomes available, the load If a completely new set of best paths becomes available, the load
should be split across the available paths, proportional to 10% of should be split across the available paths. The split used in sim-
link capacity plus the remaining link bandwidth as determined by prior ulations was a share on a given link proportional to 10% of link
capacity plus the remaining link bandwidth as determined by prior
LSA_OMP_LINK_LOAD Opaque LSA values. The contribution of link capacity LSA_OMP_LINK_LOAD Opaque LSA values. The contribution of link capacity
in the weighting should be configurable. See Appendix C.5. in the weighting should be configurable. See Appendix C.5.
Acknowledgements Acknowledgements
Numerous individual have provided valuable comments regarding this Numerous individual have provided valuable comments regarding this
work. Dave Ward made a very substantial contribution by pointing out work. Dave Ward made a very substantial contribution by pointing out
that the best path criteria could be relaxed. Dave Ward, John Scud- that the best path criteria could be relaxed. Geoffrey Cristallo pro-
der, Tony Li, and Daniel Awduche have provided particularly valuable vided comments on the handling of inter-area and external routes with
review and comments. worked examples which resulted in corrections and clarifications to
this document. John Scudder, Tony Li, and Daniel Awduche have also
provided particularly valuable review and comments.
References References
[1] R. Coltun. The ospf opaque lsa option. Technical Report RFC 2370, [1] R. Coltun. The ospf opaque lsa option. Technical Report RFC 2370,
Internet Engineering Task Force, 1998. ftp://ftp.isi.edu/in- Internet Engineering Task Force, 1998. ftp://ftp.isi.edu/in-
notes/rfc2370.txt. notes/rfc2370.txt.
[2] Atul Khanna and John Zinky. The revised ARPAnet routing met- [2] Atul Khanna and John Zinky. The revised ARPAnet routing met-
ric. In SIGCOMM Symposium on Communications Architectures and ric. In SIGCOMM Symposium on Communications Architectures and
Protocols, pages 45--56, Austin, Texas, September 1989. ACM. Protocols, pages 45--56, Austin, Texas, September 1989. ACM.
[3] Steven H. Low and P. Varaiya. Dynamic behavior of a class of [3] Steven H. Low and P. Varaiya. Dynamic behavior of a class of
adaptive routing protocols (IGRP). In Proceedings of the Confer- adaptive routing protocols (IGRP). In Proceedings of the Confer-
ence on Computer Communications (IEEE Infocom), pages 610--616, ence on Computer Communications (IEEE Infocom), pages 610--616,
March/April 1993. March/April 1993.
[4] M. Mathis, J. Semke, J. Mahdavi, and T. Ott. The macroscopic be- [4] M. Mathis, J. Semke, J. Mahdavi, and T. Ott. The macroscopic be-
skipping to change at page 16, line 17 skipping to change at page 18, line 8
Protocols, pages 45--56, Austin, Texas, September 1989. ACM. Protocols, pages 45--56, Austin, Texas, September 1989. ACM.
[3] Steven H. Low and P. Varaiya. Dynamic behavior of a class of [3] Steven H. Low and P. Varaiya. Dynamic behavior of a class of
adaptive routing protocols (IGRP). In Proceedings of the Confer- adaptive routing protocols (IGRP). In Proceedings of the Confer-
ence on Computer Communications (IEEE Infocom), pages 610--616, ence on Computer Communications (IEEE Infocom), pages 610--616,
March/April 1993. March/April 1993.
[4] M. Mathis, J. Semke, J. Mahdavi, and T. Ott. The macroscopic be- [4] M. Mathis, J. Semke, J. Mahdavi, and T. Ott. The macroscopic be-
havior of the TCP congestion avoidance algorithm. ACM Computer havior of the TCP congestion avoidance algorithm. ACM Computer
Communication Review, 27(3), July 1997. Communication Review, 27(3), July 1997.
[5] J. Moy. Ospf version 2. Technical Report RFC 2328, Internet Engi- [5] J. Moy. Ospf version 2. Technical Report RFC 2328, Internet Engi-
neering Task Force, 1998. ftp://ftp.isi.edu/in-notes/rfc2328.txt. neering Task Force, 1998. ftp://ftp.isi.edu/in-notes/rfc2328.txt.
[6] W. Stevens. Tcp slow start, congestion avoidance, fast retrans- [6] W. Stevens. Tcp slow start, congestion avoidance, fast retrans-
mit, and fast recovery algorithms. Technical Report RFC 2001, mit, and fast recovery algorithms. Technical Report RFC 2001,
Internet Engineering Task Force, 1997. ftp://ftp.isi.edu/in- Internet Engineering Task Force, 1997. ftp://ftp.isi.edu/in-
notes/rfc2001.txt. notes/rfc2001.txt.
[7] Curtis Villamizar, R. Govindan, and R. Chandra. Bgp route [7] C. Villamizar, R. Chandra, and R. Govindan. Bgp route flap damp-
flap damping. Internet Draft (Work in Progress) draft-ietf- ing. Technical Report RFC 2439, Internet Engineering Task Force,
idr-route-damp-03, Internet Engineering Task Force, 5 1998. 1998. ftp://ftp.isi.edu/in-notes/rfc2439.txt.
ftp://ftp.isi.edu/internet-drafts/draft-ietf-idr-route-damp-
03.txt.
Security Considerations Security Considerations
The use of the Opaque LSAs described in this document do no impact The use of the Opaque LSAs described in this document do no impact
the security of OSPF deployments. In deployments which use a strong the security of OSPF deployments. In deployments which use a strong
OSPF authentication method, and require signatures on LSA from the OSPF authentication method, and require signatures on LSA from the
originating router, no leveraging of a partial compromise beyond a originating router, no leveraging of a partial compromise beyond a
localized disruption of service is possible. In deployments which localized disruption of service is possible. In deployments which
use a strong OSPF authentication method, but do not require signatures use a strong OSPF authentication method, but do not require signatures
on LSA from the originating router, compromise of a single router can on LSA from the originating router, compromise of a single router can
be leveraged to cause significant disruption of service with or with- be leveraged to cause significant disruption of service with or with-
out the use of these Opaque LSA, but disruption of service cannot be out the use of these Opaque LSA, but disruption of service cannot be
achieved without such a compromise. In deployments which use a weak achieved without such a compromise. In deployments which use a weak
OSPF authentication method, significant disruption of service can be OSPF authentication method, significant disruption of service can be
caused through forged protocol interactions with or without the use of caused through forged protocol interactions with or without the use of
these Opaque LSA. these Opaque LSA.
Author's Addresses Author's Addresses
Curtis Villamizar Curtis Villamizar
ANS Communications UUNET Network Architecture Group
<curtis@ans.net> <curtis@uu.net>
Full Copyright Statement
Copyright (C) The Internet Society (February 24, 1999). All Rights
Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implmentation may be prepared, copied, published and
distributed, in whole or in part, without restriction of any kind,
provided that the above copyright notice and this paragraph are in-
cluded on all such copies and derivative works. However, this doc-
ument itself may not be modified in any way, such as by removing the
copyright notice or references to the Internet Society or other In-
ternet organizations, except as needed for the purpose of developing
Internet standards in which case the procedures for copyrights defined
in the Internet Standards process must be followed, or as required to
translate it into languages other than English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an
``AS IS'' basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEER-
ING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUD-
ING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MER-
CHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
A Data Formats A Data Formats
+----------------+----------------+----------------+----------------+ +----------------+----------------+----------------+----------------+
| Link State Advertisement Age | Options | LSA Type | | Link State Advertisement Age | Options | LSA Type |
+----------------+----------------+----------------+----------------+ +----------------+----------------+----------------+----------------+
| Opaque Type | Opaque ID | | Opaque Type | Opaque ID |
+----------------+----------------+----------------+----------------+ +----------------+----------------+----------------+----------------+
| Advertising Router | | Advertising Router |
+----------------+----------------+----------------+----------------+ +----------------+----------------+----------------+----------------+
skipping to change at page 17, line 25 skipping to change at page 19, line 45
+----------------+----------------+----------------+----------------+ +----------------+----------------+----------------+----------------+
| LSA Checksum | LSA length | | LSA Checksum | LSA length |
+----------------+----------------+----------------+----------------+ +----------------+----------------+----------------+----------------+
| Version | Reference Type | Packing Method | BW Scale | | Version | Reference Type | Packing Method | BW Scale |
+----------------+----------------+----------------+----------------+ +----------------+----------------+----------------+----------------+
| Reference to a Type 1-5 LSA (32 or 64 bits, see below) | | Reference to a Type 1-5 LSA (32 or 64 bits, see below) |
+----------------+----------------+----------------+----------------+ +----------------+----------------+----------------+----------------+
| Packed Loading Information (variable length, see below) | | Packed Loading Information (variable length, see below) |
+----------------+----------------+----------------+----------------+ +----------------+----------------+----------------+----------------+
The "LSA Age", "Options", and "LSA Type" are part of the Link State The ``LSA Age'', ``Options'', and ``LSA Type'' are part of the Link
Advertisement Format described in Appendix A of RFC--2328. The LSA State Advertisement Format described in Appendix A of RFC--2328. The
Type is is 10, signifying an Opaque LSA with Area local scope, as de- LSA Type is 10, signifying an Opaque LSA with Area local scope, as de-
fined in RFC--2370. RFC--2370 splits the Link State ID field into fined in RFC--2370. RFC--2370 splits the Link State ID field into two
two part, Opaque Type, and Opaque ID. The Opaque Type contains either part, Opaque Type, and Opaque ID. The Opaque Type contains either the
the value LSA_OMP_LINK_LOAD or LSA_OMP_PATH_LOADas described in Sec- value LSA_OMP_LINK_LOAD or LSA_OMP_PATH_LOAD as described in Section 2.
tion 2. The "Advertising Router", "Link State Advertisement Sequence The ``Advertising Router'', ``Link State Advertisement Sequence Num-
Number", "LSA Checksum", and "LSA length" are part of the Link State ber'', ``LSA Checksum'', and ``LSA length'' are part of the Link State
Advertisement Format described in Appendix A of RFC--2328. The re- Advertisement Format described in Appendix A of RFC--2328. The re-
mainder of the packet is specific to Opaque Type LSA_OMP_LINK_LOAD or mainder of the packet is specific to Opaque Type LSA_OMP_LINK_LOAD or
LSA_OMP_PATH_LOADLSAs. LSA_OMP_PATH_LOADLSAs.
Opaque Type The Opaque Type will contain the value LSA_OMP_LINK_LOAD Opaque Type The Opaque Type will contain the value LSA_OMP_LINK_LOAD
or LSA_OMP_PATH_LOADas described in Section 2. Numeric values will or LSA_OMP_PATH_LOADas described in Section 2. Numeric values will
be requested from IANA. be requested from IANA.
Opaque ID The Opaque ID will contain an integer which will be unique Opaque ID The Opaque ID will contain an integer which will be unique
per router and interface, virtual interface, or MAC address for per router and interface, virtual interface, or MAC address for
skipping to change at page 18, line 15 skipping to change at page 20, line 35
For a LSA_OMP_PATH_LOAD Opaque LSA, the ``Opaque ID'' must contain For a LSA_OMP_PATH_LOAD Opaque LSA, the ``Opaque ID'' must contain
a 24 bit integer that is unique to a summary LSA or AS-external LSA a 24 bit integer that is unique to a summary LSA or AS-external LSA
advertised by the same router. The method of assignment of these advertised by the same router. The method of assignment of these
24 bit integers is a local matter. A router must be capable of 24 bit integers is a local matter. A router must be capable of
uniquely identify a summary LSA or AS-external LSA using a 24 bit uniquely identify a summary LSA or AS-external LSA using a 24 bit
number. number.
Version The version number is 1. Version The version number is 1.
Reference Type The Reference Type indicates the type of LSA that is Reference Type The Reference Type indicates the type of LSA that is
being referenced in the "Reference to a Type 1-5 LSA" field. being referenced in the ``Reference to a Type 1-5 LSA'' field.
Packing Method The Packing Method is an integer that indicates how Packing Method The Packing Method is an integer that indicates how
the "Packed Loading Information" field is formated. the ``Packed Loading Information'' field is formated.
BW Scale Bandwidth numbers must be scale by shifting the 32 bit in- BW Scale Bandwidth numbers must be scale by shifting the 32 bit in-
teger left by this amount. If this value is non-zero, integers teger left by this amount. If this value is non-zero, 64 bit inte-
larger than 64 bits must be used. gers or larger should be used to represent the bandwidth.
Reference to a Type 1-5 LSA This field contains the 32 bit "Link Reference to a Type 1-5 LSA This field contains the 32 bit ``Link
State ID" field of a Type 1-5 LSA. Type 1-5 indicate: State ID'' field of a Type 1-5 LSA. Type 1-5 indicate:
1. Router-LSAs 1. Router-LSAs
2. Network-LSAs 2. Network-LSAs
3. Summary-LSAs (IP network) 3. Summary-LSAs (IP network)
4. Summary-LSAs (ASBR) 4. Summary-LSAs (ASBR)
5. AS-external-LSAs 5. AS-external-LSAs
Loading information for a Type 1 LSA, a Router-LSA, is sent as a
Loading information for a Type 1 LSA, a Router-LSA, is sent as LSA_OMP_LINK_LOAD Opaque LSA. For a Type 1 LSA the ``Link State ID''
aLSA_OMP_LINK_LOAD Opaque LSA. For a Type 1 LSA the "Link State ID" field is followed by a 32 bit ``Link ID''. This identifies a single
field is followed by a 32 bit "Link ID". This identifies a single
link. There are four types of Links. link. There are four types of Links.
1. Point-to-point connection to another router 1. Point-to-point connection to another router
2. Connection to a transit network 2. Connection to a transit network
3. Connection to a stub network 3. Connection to a stub network
4. Virtual link 4. Virtual link
Normally loading information is provided for a Link Type 1. Load- Normally loading information is provided for a Link Type 1. Load-
ing information may also be provided for a Link Type 2 or 3. Load- ing information may also be provided for a Link Type 2 or 3. Load-
ing information cannot be provided to a Link Type 4. ing information cannot be provided to a Link Type 4.
Loading information is not provided for Type 2 LSAs, Network-LSAs. Loading information is not provided for Type 2 LSAs, Network-LSAs.
Loading information for Type 3 and Type 4 LSAs, Summary-LSAs for Loading information for Type 3 and Type 4 LSAs, Summary-LSAs for
IP networks in another area and Summary-LSAs for ASBRs, is sent as IP networks in another area and Summary-LSAs for ASBRs, is sent as
a LSA_OMP_PATH_LOAD Opaque LSA as described in Section 2.2. For a LSA_OMP_PATH_LOAD Opaque LSA as described in Section 2.2. For
a Type 3 and Type 4 LSA there is no information in the Reference a Type 3 and Type 4 LSA there is no information in the Reference
following the "Link State ID". following the ``Link State ID''.
Loading information for Type 5 LSAs, AS-external-LSAs, is sent as Loading information for Type 5 LSAs, AS-external-LSAs, is sent as
a LSA_OMP_PATH_LOAD Opaque LSA as described in Section 2.2. For a a LSA_OMP_PATH_LOAD Opaque LSA as described in Section 2.2. For a
Type 5 LSA there is no information in the Reference following the Type 5 LSA there is no information in the Reference following the
"Link State ID". ``Link State ID''.
Packed Loading Information The format of the Packed Loading Informa- Packed Loading Information The format of the Packed Loading Informa-
tion depends on the value of the Packing Method field. Currently tion depends on the value of the Packing Method field. Currently
only the value 1 is defined. only the value 1 is defined.
The following format is used when the Packing Method field contains 1. The following format is used when the Packing Method field contains 1.
The LSA must be ignored if values other than 1 are found in Packing The LSA must be ignored if values other than 1 are found in Packing
Method. Method.
+----------------+----------------+----------------+----------------+ +----------------+----------------+----------------+----------------+
| In Scaled Link Capacity in kilobits per second | | In Scaled Link Capacity in kilobits per second |
skipping to change at page 19, line 47 skipping to change at page 22, line 21
to counting or truncation error, greater than full loading, substitute to counting or truncation error, greater than full loading, substitute
0xffff. The Link Drop Fraction is a 16 bit unsigned integer from 0 to 0xffff. The Link Drop Fraction is a 16 bit unsigned integer from 0 to
0xffff representing number of packets dropped relative to the number 0xffff representing number of packets dropped relative to the number
of packets received. This value can be derived from the change in two of packets received. This value can be derived from the change in two
MIB-2 counters (ifOutDiscard<<16)/ifInPacket. The hex number 0x10000 MIB-2 counters (ifOutDiscard<<16)/ifInPacket. The hex number 0x10000
would represent unity (all of the packets being dropped) so 0xffff would represent unity (all of the packets being dropped) so 0xffff
must be substituted. must be substituted.
B Concise Statement of the Algorithms B Concise Statement of the Algorithms
An OSPF router may play one of two roles, or both. An interior An OSPF router may play one of two roles, or both. The two functions
are flooding loading information and load balancing. An interior
routers and edge routers will flood loading information. A router routers and edge routers will flood loading information. A router
may choose not to flood information if it is beleived that there is may choose not to flood information if it is beleived that there is no
no way that the interface could become congested. An ingress router way that the interface could become congested or if it has no way to
will process loading information and if it has equal cost paths will measure the load, as is the case in a shared broadcast interface. An
balance load across those paths. ingress or interior router will process loading information and if it
has equal cost paths will balance load across those paths.
The description of algorithms is broken down into the following sub- The description of algorithms is broken down into the following sub-
sections. sections.
Flooding Loading Information Appendix B.1 Flooding Loading Information Appendix B.1
Building Next Hop Structures Appendix B.2 Building Next Hop Structures Appendix B.2
Processing Loading Information Appendix B.3 Processing Loading Information Appendix B.3
Adjusting Loading Appendix B.4 Adjusting Loading Appendix B.4
The algorithms are defined in the following section in pseudocode. The algorithms are defined in the following section in pseudocode.
B.1 Flooding Loading Information B.1 Flooding Loading Information
It is assumed that counters are large enough to avoid multiple over- It is assumed that counters are large enough to avoid multiple over-
flow (ie: 64 bit counters are used for high speed interfaces) and flow (ie: 64 bit counters are used for high speed interfaces) and
skipping to change at page 20, line 29 skipping to change at page 23, line 7
The algorithms are defined in the following section in pseudocode. The algorithms are defined in the following section in pseudocode.
B.1 Flooding Loading Information B.1 Flooding Loading Information
It is assumed that counters are large enough to avoid multiple over- It is assumed that counters are large enough to avoid multiple over-
flow (ie: 64 bit counters are used for high speed interfaces) and flow (ie: 64 bit counters are used for high speed interfaces) and
that counter overflow is easily detected is compensated for in counter that counter overflow is easily detected is compensated for in counter
deltas. It is assumed that ifInDiscard and ifOutDiscard accurately deltas. It is assumed that ifInDiscard and ifOutDiscard accurately
counts all queueing drops. counts all queueing drops.
The following counters are sampled at 15second intervals: ifInOctets, The following counters are sampled at 15 second intervals:
ifOutOctets, ifInPacket, ifOutPacket, ifInDiscard and ifOutDiscard. ifInOctets, ifOutOctets, ifInPacket, ifOutPacket, ifInDiscard and
The value if ifInSpeed and ifOutSpeed is assumed to be constant. Some ifOutDiscard. The value if ifInSpeed and ifOutSpeed is assumed to
state must be stored. The previously used value of each raw counter be constant. Some state must be stored. The previously used value
is needed to compute deltas. State variables InFilteredUtil, Out- of each raw counter is needed to compute deltas. State variables
FilteredUtil, InLoss, OutLoss, InEquivLoad and OutEquivLoad must be InFilteredUtil, OutFilteredUtil, InLoss, OutLoss, InEquivLoad and Out-
saved. The last time a reflooding occurred must also be stored. EquivLoad must be saved. The last time a reflooding occurred must
also be stored.
The input and output utilizations are expressed as fractions using The input and output utilizations are expressed as fractions using
ifInOctets, ifOutOctets, ifInSpeed, and ifOutSpeed. Call the raw ifInOctets, ifOutOctets, ifInSpeed, and ifOutSpeed. Call the raw 15
15second fractional utilizations InRawUtil and OutRawUtil. Compute second fractional utilizations InRawUtil and OutRawUtil. Compute the
the following filtered values for both In and Out, making sure to save following filtered values for both In and Out, making sure to save the
the previous values. previous values.
PrevFilteredUtil = FilteredUtil; PrevFilteredUtil = FilteredUtil;
if (RawUtil < FilteredUtil) { if (RawUtil < FilteredUtil) {
FilteredUtil -= (FilteredUtil >> 3); FilteredUtil -= (FilteredUtil >> 3);
FilteredUtil += (RawUtil >> 3); FilteredUtil += (RawUtil >> 3);
} else if (RawUtil > FilteredUtil) { } else if (RawUtil > FilteredUtil) {
FilteredUtil -= (FilteredUtil >> 1); FilteredUtil -= (FilteredUtil >> 1);
FilteredUtil += (RawUtil >> 1); FilteredUtil += (RawUtil >> 1);
} }
skipping to change at page 22, line 30 skipping to change at page 25, line 7
If the decision is made to reflood an LSA according to the test above, If the decision is made to reflood an LSA according to the test above,
then input and output FilteredUtil and Loss must be packed into an LSA then input and output FilteredUtil and Loss must be packed into an LSA
and flooded. and flooded.
The 15second timer interval should be jittered by a random number in The 15second timer interval should be jittered by a random number in
the range of plus or minus 5 seconds (obviously taking the actual time the range of plus or minus 5 seconds (obviously taking the actual time
interval into account in computing rates). interval into account in computing rates).
B.2 Building Next Hop Structures B.2 Building Next Hop Structures
The OSPF SPF calculation is done as per RFC--2328. The arrival of The OSPF SPF calculation is done as per RFC--2328. Minor differ-
loading information does not require new SPF calculations since nei- ences in the outcome result from relaxing the best path criteria as
ther reacheability or costs are changed. described in Section 3.1. Modification to the SPF algorithm is de-
scribed in Appendix D. The arrival of loading information does not
require new SPF calculations since neither reacheability or costs are
changed.
The SPF calculation yields the shortest paths from the given node The SPF calculation yields the shortest paths from the given node
to all other routers and networks in the OSPF area. In some cases to all other routers and networks in the OSPF area. In some cases
multipaths will already exist. For all destinations, every feasible multipaths will already exist. For all destinations, every feasible
hop is examined, and paths through next hops that simply provide an hop is examined, and paths through next hops that simply provide an
improvement are added, as described in Section 3.1. improvement are added, as described in Section 3.1.
After completing the SPF calculation and relaxing the best path cri- After completing the SPF calculation and relaxing the best path cri-
teria, intra-area multipath sets are recorded as next hop structures. teria, intra-area multipath sets are recorded as next hop structures.
If a previous SPF had been in use, loadings are transfered to the new If a previous SPF had been in use, loadings are transfered to the new
skipping to change at page 24, line 10 skipping to change at page 26, line 32
heavily loaded link for a next hop structure, then the next hop struc- heavily loaded link for a next hop structure, then the next hop struc-
ture should be readjusted immediately. The last time each next hop ture should be readjusted immediately. The last time each next hop
structure has been readjusted must be maintained and periodically structure has been readjusted must be maintained and periodically
readjusted. Timer events are handled as follows. readjusted. Timer events are handled as follows.
foreach NextHopStruct ( AllNextHopStructs ) { foreach NextHopStruct ( AllNextHopStructs ) {
Elapsed = Now - LastReadjust[NextHopStruct]; Elapsed = Now - LastReadjust[NextHopStruct];
MinLoaded = MinEquivLoad(NextHopStruct); MinLoaded = MinEquivLoad(NextHopStruct);
MaxLoaded = MaxEquivLoad(NextHopStruct); MaxLoaded = MaxEquivLoad(NextHopStruct);
AbsDiff = MaxLoaded - MinLoaded; AbsDiff = MaxLoaded - MinLoaded;
if (((Elapsed >= 60)) if (((Elapsed >= 60)
&& (AbsDiff > 0.045) && (MaxLoaded > 0.95)) && (AbsDiff > 0.045) && (MaxLoaded > 0.95))
|| ((Elapsed >= 90)) || ((Elapsed >= 90)
&& (AbsDiff > 0.03) && (MaxLoaded > 0.95)) && (AbsDiff > 0.03) && (MaxLoaded > 0.95))
|| ((Elapsed >= 120)) || ((Elapsed >= 120)
&& (AbsDiff > 0.01) && (MaxLoaded > 0.97)) && (AbsDiff > 0.01) && (MaxLoaded > 0.97))
|| ((Elapsed >= 240)) || ((Elapsed >= 240)
&& (AbsDiff > 0.005) && (MaxLoaded > 0.98)) && (AbsDiff > 0.005) && (MaxLoaded > 0.98))
|| ((Elapsed >= 90)) || ((Elapsed >= 90)
&& (AbsDiff > 0.05) && (MaxLoaded > 0.90)) && (AbsDiff > 0.05) && (MaxLoaded > 0.90))
|| ((Elapsed >= 120)) || ((Elapsed >= 120)
&& (AbsDiff > 0.03) && (MaxLoaded > 0.90)) && (AbsDiff > 0.03) && (MaxLoaded > 0.90))
|| ((Elapsed >= 180)) || ((Elapsed >= 180)
&& (AbsDiff > 0.01) && (MaxLoaded > 0.90)) && (AbsDiff > 0.01) && (MaxLoaded > 0.90))
|| (Elapsed >= 300)) || (Elapsed >= 300)) {
) {
/* we need to readjust this multipath set */ /* we need to readjust this multipath set */
... ...
This loop and conditional results in a subset of the available next This loop and conditional results in a subset of the available next
hop structures being adjusted based on the timer. The same effect may hop structures being adjusted based on the timer. The same effect may
be accomplished by determining when a next hop structure will need to be accomplished by determining when a next hop structure will need to
be adjusted if no further flooding changes arrive and queueing next be adjusted if no further flooding changes arrive and queueing next
hop structures on lists according to how long they will remain idle. hop structures on lists according to how long they will remain idle.
B.4 Adjusting Loading B.4 Adjusting Loading
skipping to change at page 26, line 7 skipping to change at page 28, line 27
} }
if (Increase < 1) if (Increase < 1)
Increase = 1; Increase = 1;
MoveIncrement += Increase; MoveIncrement += Increase;
} else { } else {
if (MoveIncrement > MinRateWithCritical) if (MoveIncrement > MinRateWithCritical)
MoveIncrement = MinRateWithCritical; MoveIncrement = MinRateWithCritical;
MoveIncrement /= 2; MoveIncrement /= 2;
MoveCount = 0; MoveCount = 0;
} }
if (MoveIncrement < 1) if (MoveIncrement < MinMoveIncrement)
MoveIncrement = 1 MoveIncrement = MinMoveIncrement;
if (MoveIncrement > 65635) if (MoveIncrement > 65535)
MoveIncrement = 65635; MoveIncrement = 65535;
} }
Then traffic share is adjusted. Then traffic share is adjusted.
foreach Path1 ( SetofPaths ) { foreach Path1 ( SetofPaths ) {
if (!Path1.HasCriticalSegment) if (!Path1.HasCriticalSegment)
continue; continue;
foreach Path2 ( SetofPaths ) { foreach Path2 ( SetofPaths ) {
if (Path2.HasCriticalSegment) if (Path2.HasCriticalSegment)
continue; continue;
skipping to change at page 30, line 5 skipping to change at page 32, line 19
C.6 Parameters associated with Flooding and Traffic Share C.6 Parameters associated with Flooding and Traffic Share
Parameters associated with flooding rate, move increment and traffic Parameters associated with flooding rate, move increment and traffic
share adjustment should not be configurable by the user or should be share adjustment should not be configurable by the user or should be
well hidden so as only to be accessible to developers. Adjustment of well hidden so as only to be accessible to developers. Adjustment of
these parameters can slow convergence or increase overshoot. If the these parameters can slow convergence or increase overshoot. If the
parameters are sufficiently altered instability could result. While parameters are sufficiently altered instability could result. While
it is likely that some improvements could result from further tuning, it is likely that some improvements could result from further tuning,
experimentation on live networks is not encouraged. experimentation on live networks is not encouraged.
D Algorithm Performance D Modified SPF Calculation
The most common implementation of the SPF calculation is Dikstra's
algorithm. Most implementations do not yield the full path as a
consequence of the SPF calculation. Retaining the full path as the
algorithm procedes is a relatively minor modification.
If the best path criteria is relaxed, the information obtained from
a single Dikstra calculation is insufficient. Dikstra's algorithm
provides a very efficient single-source shortest path calculation.
For the relaxed best path criteria, the cost to any destination from
any immediately adjacent node is needed in addition to the set of best
paths and costs from the current node.
It is beleived to be more efficient to compute an SPF using Dikstra's
algorithm from the standpoint of each adjacent node in addition to an
SPF from the current node than it is to use an algorithm to compute
the costs from any node to any other node. The former runs in order
N squared while algorithms to accomplish the latter runs in order N
cubed, where N is the number of nodes in the graph. The amount of
computation would be expected to be about equal in the case where all
nodes are immediately adjacent to the current node.
There is likely to be more efficient methods of computing the costs
from a subset of nodes to all destinations than either using multiple
Dikstra calculations or computing the costs from all nodes to all oth-
ers and only making use of a subset of the results. This efficiency
consideration is left as an exercise for the implementor.
E Algorithm Performance
A number of simulations have been performed to test the OSPF-OMP algo- A number of simulations have been performed to test the OSPF-OMP algo-
rithms. In these simulations the algorithm has been demonstrated to rithms. In these simulations the algorithm has been demonstrated to
be very stable. This work has not been formally published yet but is be very stable. This work has not been formally published yet but is
currently available at http://engr.ans.net/ospf-omp. currently available at http://engr.ans.net/ospf-omp.
The simulations done to date have only modeled behavior of load bal- The simulations done to date have only modeled behavior of load bal-
ancing intra-area routes. This would be applicable to networks in ancing intra-area routes. This would be applicable to networks in
which external routing was mapped onto IGP routing with a single OSPF which external routing was mapped onto IGP routing with a single OSPF
area. Passing loading information between areas, allowing loading in area. Passing loading information between areas, allowing loading in
one area affect an adjacent area, has not been simulated. Similarly one area affect an adjacent area, has not been simulated. Similarly
passing loading information with external routes and affecting loading passing loading information with external routes and affecting loading
in a peer AS has not been simulated. in a peer AS has not been simulated.
D.1 Conversion from Loss to Equivalent Load E.1 Conversion from Loss to Equivalent Load
The current adjustment for loss in the equivalent load is based on The current adjustment for loss in the equivalent load is based on
the premise that traffic is dominated by TCP and that TCP flows suffi- the premise that traffic is dominated by TCP and that TCP flows suffi-
ciently unconstrained by other bottlenecks and of sufficient duration ciently unconstrained by other bottlenecks and of sufficient duration
exist to keep the aggrgate traffic flow elastic. This is considered exist to keep the aggrgate traffic flow elastic. This is considered
a very safe assumption for Internet traffic. Enterprise networks a very safe assumption for Internet traffic. Enterprise networks
may have a higher contribution of incompressible traffic (traffic not may have a higher contribution of incompressible traffic (traffic not
conforming to a form of congestion avoidance such as TCP congestion conforming to a form of congestion avoidance such as TCP congestion
avoidance described in RFC--2001). avoidance described in RFC--2001).
skipping to change at page 30, line 48 skipping to change at page 33, line 48
sured directly when links are at full utilization, using the link sured directly when links are at full utilization, using the link
utilization and loss. The load adjustment algorithm should remain utilization and loss. The load adjustment algorithm should remain
stable as long as the first derivative of the estimator over offered stable as long as the first derivative of the estimator over offered
load remains positive. If the first derivative is negative within load remains positive. If the first derivative is negative within
some region, then oscillation will occur within that range of opera- some region, then oscillation will occur within that range of opera-
tion. The greatest risk of this occurring is in routers where active tion. The greatest risk of this occurring is in routers where active
queue management is not used (ie: where drop-tail queues are used) queue management is not used (ie: where drop-tail queues are used)
and in particular where buffering is too small. In such cases, as and in particular where buffering is too small. In such cases, as
offered load increases just beyond full utilization, loss increases offered load increases just beyond full utilization, loss increases
somewhat, but utilization can drop substantially (typically to about somewhat, but utilization can drop substantially (typically to about
90increases, the estimator of offered load may decrease, causing the 90%) as offered load increases. In this region, as the offered load
increases, the estimator of offered load may decrease, causing the
link to appear less loaded than another. The rather aggresive com- link to appear less loaded than another. The rather aggresive com-
pensation for loss is intended to insure that this effect either does pensation for loss is intended to insure that this effect either does
not occur, or occurs only within a very narrow range of offerred load not occur, or occurs only within a very narrow range of offerred load
at just over full utiization. If the derivative is negative within a at just over full utiization. If the derivative is negative within a
narrow range, oscillations can occur only within that range, and the narrow range, oscillations can occur only within that range, and the
oscillations are well bounded. oscillations are well bounded.
D.2 Performance as traffic loads change E.2 Performance as traffic loads change
This work has considerable advantages over other approaches, particu- This work has considerable advantages over other approaches, particu-
larly traffic engineering approaches that involve adjustment of vir- larly traffic engineering approaches that involve adjustment of vir-
tual circuit layouts based on historic traffic figures. The advantage tual circuit layouts based on historic traffic figures. The advantage
is the ability to adjust loading gradually as actual offerred traffic is the ability to adjust loading gradually as actual offerred traffic
deviates for expected. The advantage is even greater when compared to deviates for expected. The advantage is even greater when compared to
dynamic virtual circuit layouts, using PNNI for example, since these dynamic virtual circuit layouts, using PNNI for example, since these
algorithms have proven to often converge on very suboptimal layouts. algorithms have proven to often converge on very suboptimal layouts.
Simulations demonstrating behavior in these particular cases can be Simulations demonstrating behavior in these particular cases can be
found at http://engr.ans.net/ospf-omp/ramp.html. found at http://engr.ans.net/ospf-omp/ramp.html.
D.3 Convergence after a major perturbation E.3 Convergence after a major perturbation
Simulations have been performed in which link failures are examined. Simulations have been performed in which link failures are examined.
Without relaxing the best path criteria, OSPF OMP is quite dependant Without relaxing the best path criteria, OSPF OMP is quite dependant
of the set of link metrics to create a set of equal cost paths that of the set of link metrics to create a set of equal cost paths that
will allow the most heavily loaded portions of the topology to be will allow the most heavily loaded portions of the topology to be
offloaded. When links fail, the set of metrics often are far from offloaded. When links fail, the set of metrics often are far from
ideal for the remaining topology. Relaxing the best path criteria ideal for the remaining topology. Relaxing the best path criteria
significantly improves the response to link failure. significantly improves the response to link failure.
Simulations are available at http://engr.ans.net/ospf-omp/fail.html. Simulations are available at http://engr.ans.net/ospf-omp/fail.html.
F Examples
Figure 2 provides a simple topology. For the purpose of illustrating
how OMP works, only the traffic flow from left to right between a few
pair of dominant traffic contributors is considered.
The traffic mapped onto the topology in Figure 2 is dominated by the
ingress nodes A and F and the egress nodes E and G. The capacity of
the links are 1 excpt link E-G which has a capacity of 2. The load
contributed by the ingress-egress pairs A-E, F-G, and F-E are 0.5.
The node pair A-G contributes a load of 1. Link costs are all 2,
except F-D which is 3, and F-G which is 6.
If ECMP were used, all the traffic from F to E would take the path
F-D-E. All the traffic from F to G would take the link F-G. All the
cost=2 cost=2 cost=2 cost=2
A ----------> B ----------> C ----------> E ----------> G
| load=1.5 | load=0.5 load=0.5 ^ load<25% ^
| | | |
| | cost=2 cost=2 | |
| \-----------> D ------------/ |
| load=0.5 ^ load=1.0 |
| | |
| cost=3 | load=0.5 |
| cost=2 | cost=6 |
\-------------------------> F --------------------------/
load=0.5 load=1.0
Figure 2: A Simple Example
traffic from A to E would take the hop from A to B and then at B it
would be split evenly between the paths B-C-E and B-D-E. Half of the
traffic from A to G would take the hop A-B and half would take the hop
A-G.
ingress-egress load and path
F-E 0.5 0.50 F-D-E
F-G 0.5 0.50 F-G
A-E 0.5 0.25 A-B-C-E
0.25 A-B-D-E
A-G 1.0 0.33 A-B-C-E-G
0.33 A-B-D-E-G
0.33 A-F-G
link loading status
A-B 1.16 overloaded
A-F 0.33 underutilized
B-C 0.58 underutilized
B-D 0.58 underutilized
C-E 0.58 underutilized
D-E 1.08 overloaded
E-G 0.66 underutilized
F-D 0.50 underutilized
F-G 0.83 near capacity
Above is the initial loading for OMP which differs slightly from ECMP.
In ECMP half the traffic from A to G would take the A-F, where OMP
starts out with one third of the A-G traffic on link A-F.
Note that using OMP the path F-D-E-G with cost 7 is considered close
enough to equal to the path F-G with cost 6. This is because the
next hop D is closer to G with a cost of 4 than F is with a cost of
6. Initially node F would not move load over because link D-E at a
loading of 1.08 is in worse shape than node F-G at a loading of 0.83.
For illustrative purposes three opportunities for moving load are
considered separately. These are shown in Figure 3, Figure 4, and
Figure 5.
cost=2 cost=2 cost=2 cost=2
A ----------> B ----------> C ----------> E ----------> G
| load=1.25 | load=.75 load=.75 ^ load<25% ^
| | | |
| | cost=2 cost=2 | |
| \-----------> D ------------/ |
| load=.25 ^ load=.75 |
| | |
| cost=3 | load=0.5 |
| cost=2 | cost=6 |
\-------------------------> F --------------------------/
load=0.75 load=1.25
Figure 3: First Opportunity for Load Adjustment
Node B has a clear opportunity to reduce the load on the link D-E by
moving traffic from the next hop D to the next hop C. If this opti-
mization were to be applied alone, utilizations on the links B-C,
C-E, and D-E would approach 0.75 and utilization on the link B-D would
approach 0.25. This is illustrated in Figure 3.
Node A can reduce the loss on link A-B by putting more load on link
A-F. This will initially have the effect of lowering A-B to 1.0 and
raising F-G to 1.0. The links can only pass 100would just reduce loss
on link A-B at the expense of increasing loss on link F-G. The load on
link A-F would increase to 0.5.
After node B had moved enough traffic away from link D-E so that its
loading fell below the 1.0 loading of link F-G, node F would begin
to move traffic destined to G away from link F-G. Load would be added
to link D-E but node B would continue to move load away from D-E.
Utilizations of B-C, C-E, and D-E would rise. Utilizations of F-D
would also approach 0.5 and utilization on F-G would fall. This is
illustrated in Figure 4.
Node A would be faced with an overloaded link A-B and a better path to
G of A-F-G, with the worst loading being at link F-G, initially only
slightly over capacity. Both links A-B and F-G would be reporting
100% utilization but link A-B would be expected to report higher loss.
In addition, as the other optimizations proceed, the utilization of
link F-G would approach 100%.
cost=2 cost=2 cost=2 cost=2
A ----------> B ----------> C ----------> E ----------> G
| load=1.25 | load=.92 load=.92 ^ load<25% ^
| | | |
| | cost=2 cost=2 | |
| \-----------> D ------------/ |
| load=.08 ^ load=.92 |
| | |
| cost=3 | load=0.92 |
| cost=2 | cost=6 |
\-------------------------> F --------------------------/
load=0.75 load=0.92
Figure 4: Second Opportunity for Load Adjustment
cost=2 cost=2 cost=2 cost=2
A ----------> B ----------> C ----------> E ----------> G
| load=1.0 | load=1.0 load=1.0 ^ load=25% ^
| | | |
| | cost=2 cost=2 | |
| \-----------> D ------------/ |
| load=0 ^ load=1.0 |
| | |
| cost=3 | load=1.0 |
| cost=2 | cost=6 |
\-------------------------> F --------------------------/
load=1.0 load=1.0
Figure 5: Another Opportunity for Load Adjustment
Node A would move traffic from the next hop of B to the next hop of
F. Node F will continue to move load from F-G to F-D-E. Node B will
continue to move load from B-D-E to B-C-E. The utilizations of links
A-B, B-C, C-E, D-E, F-D, and F-G will approach 0.83. Utilization of
link A-F will approach 0.63. Utilization of link B-D will approach
zero. This is illustrated in Figure 5.
The following table provides the approximate state of traffic loading
acheived in a brief simulation. Of 6 links that could be balanced at
about 0.83, 3 converged to about 0.85, and three to about 0.82. Note
that the path F-G-E was used to get from F to E in addition to the
lower cost F-D-E.
ingress-egress load and path
F-E 0.5 0.25 F-D-E
0.25 F-G-E
F-G 0.5 0.25 F-G
0.25 F-D-E-G
A-E 0.5 0.25 A-B-C-E
0.00 A-B-D-E
0.12 A-F-D-E
0.12 A-F-G-E
A-G 1.0 0.60 A-B-C-E-G
0.00 A-B-D-E-G
0.20 A-F-G
0.20 A-F-D-E-G
link loading status
A-B 0.85
A-F 0.65
B-C 0.85
B-D 0.00
C-E 0.85
D-E 0.82
E-G 0.80
F-D 0.82
F-G 0.82
In this example, multiple links are balanced at 82% to 85% utiliza-
tion. Without using OMP it is difficult (it might be impossible using
only ECMP) to avoid applying an offerred load that exceeds link ca-
pacity on parts of the topology. This example is intended to provide
a more advanced tutorial than the trivial three node example in Fig-
ure 1.
This example is among the simulations at http://engr.ans.net/ospf-
omp/tutorial. More complex topologies and traffic patterns have been
simulated and are available at the same URL.
 End of changes. 

This html diff was produced by rfcdiff 1.23, available from http://www.levkowetz.com/ietf/tools/rfcdiff/