draft-ietf-ospf-omp-00.txt   draft-ietf-ospf-omp-01.txt 
Internet Engineering Task Force Curtis Villamizar Internet Engineering Task Force Curtis Villamizar
INTERNET-DRAFT ANS INTERNET-DRAFT ANS
OSPF Optimized Multipath (OSPF-OMP) OSPF Optimized Multipath (OSPF-OMP)
Status of this Memo Status of this Memo
This document is an Internet-Draft. Internet-Drafts are working This document is an Internet-Draft. Internet-Drafts are working doc-
documents of the Internet Engineering Task Force (IETF), its areas, uments of the Internet Engineering Task Force (IETF), its areas, and
and its working groups. Note that other groups may also distribute its working groups. Note that other groups may also distribute work-
working documents as Internet-Drafts. ing documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet- Drafts as reference time. It is inappropriate to use Internet- Drafts as reference mate-
material or to cite them other than as ``work in progress.'' rial or to cite them other than as ``work in progress.''
To view the entire list of current Internet-Drafts, please check the To view the entire list of current Internet-Drafts, please check the
``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow
Directories on ftp.is.co.za (Africa), ftp.nordu.net (Europe), Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern Eu-
munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or rope), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific Rim),
ftp.isi.edu (US West Coast). ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast).
Abstract Abstract
OSPF may form multiple equal cost paths between points. This is true OSPF may form multiple equal cost paths between points. This is true
of any link state protocol. In the absense of any explicit support to of any link state protocol. In the absense of any explicit support
take advantage of this, a path may be chosen arbitrarily. Techniques to take advantage of this, a path may be chosen arbitrarily. Tech-
have been utilized to divide traffic somewhat evenly among the niques have been utilized to divide traffic somewhat evenly among the
available paths. These techniques have been referred to as Equal Cost available paths. These techniques have been referred to as Equal Cost
Multipath (ECMP). An unequal division of traffic among the available Multipath (ECMP). An unequal division of traffic among the available
paths is generally preferable. Routers generally have no knowledge of paths is generally preferable. Routers generally have no knowledge
traffic loading on distant links and therefore have no basis to of traffic loading on distant links and therefore have no basis to
optimize the allocation of traffic. optimize the allocation of traffic.
Optimized Mulitpath is a compatible extension to OSPF, utilizing the Optimized Mulitpath is a compatible extension to OSPF, utilizing the
Opaque LSA to distribute loading information, proposing a means to Opaque LSA to distribute loading information, proposing a means to
adjust forwarding, and providing an algorithm to make the adjustments adjust forwarding, and providing an algorithm to make the adjustments
gradually enough to insure stability yet provide reasonably fast gradually enough to insure stability yet provide reasonably fast ad-
adjustment when needed. justment when needed.
1 Overview 1 Overview
Networks running OSPF are often heavily loaded. Topologies often Networks running OSPF are often heavily loaded. Topologies often
evolve to include multiple paths. Multiple paths may be initially evolve to include multiple paths. Multiple paths may be initially de-
designed to provide redundancy but also result from incremental signed to provide redundancy but also result from incremental addition
addition of circuits to accomodate traffic growth. The redundant of circuits to accomodate traffic growth. The redundant paths provide
paths provide a potential to distribute traffic loading and reduce a potential to distribute traffic loading and reduce congestion. Op-
congestion. Optimized Mulitpath (OMP) provides a means for OSPF to timized Mulitpath (OMP) provides a means for OSPF to make better use
make better use of this potential to distribute loading. of this potential to distribute loading.
1.1 Past Attempts
Early attempts to provide load sensitive routing involved changing Early attempts to provide load sensitive routing involved changing
link costs according to loading. These attempts were doomed to link costs according to loading. These attempts were doomed to fail-
failure because the adjustment increment was grossly course and ure because the adjustment increment was grossly course and oscilla-
oscillation was inevitable [?]. tion was inevitable [2]. This early experience is largely responsible
for the common belief that any form of load sensitive routing will
fail due to severe oscillations resulting from instability.
Attempts to use a metric composed of weighted components of delay,
traffic, and fixed costs have also been met with very limited success.
The problem again is the granularity of adjustment. As the composi-
tion of weighted components switches favored paths large amounts of
traffic are suddenly moved, making the technique prone to oscillations
[3]. The oscillation is damped to some extent by providing a range
of composite metric differences in which composite metrics are con-
sidered equal and equal cost multipath techniques are used. Even then
the technique still suffers oscillations due to the course adjustments
made at equal/unequal metric boundaries.
1.2 Equal Cost Multipath
A widely utilized technique to improve loading is known as Equal Cost A widely utilized technique to improve loading is known as Equal Cost
Multipath (ECMP). If the topology is such that equal cost paths exist, Multipath (ECMP). If the topology is such that equal cost paths exist,
then an attempt is made to divide traffic equally among the paths. then an attempt is made to divide traffic equally among the paths.
Three methods of splitting traffic have been used. Three methods of splitting traffic have been used.
1. Per packet round robin forwarding. 1. Per packet round robin forwarding.
2. Dividing destination prefixes among available next hops in the 2. Dividing destination prefixes among available next hops in the for-
forwarding entries. warding entries.
3. Dividing traffic according to a hash function applied to the source 3. Dividing traffic according to a hash function applied to the source
and desination pair. and desination pair.
The ``per packet round robin forwarding'' technique is only applicable The ``per packet round robin forwarding'' technique is only applicable
if the delays on the paths are almost equal. The delay difference if the delays on the paths are almost equal. The delay difference
must be small relative to packet serialization time. Delay must be small relative to packet serialization time. Delay differ-
differences greater than three times the packet serialization time can ences greater than three times the packet serialization time can cause
cause terrible TCP performance. for example, packet 2, 4, and 6 may terrible TCP performance. for example, packet 2, 4, and 6 may arrive
arrive before packet 1, triggering TCP fast retransmit. The result before packet 1, triggering TCP fast retransmit. The result will be
will be limiting TCP to a very small window and very poor performance limiting TCP to a very small window and very poor performance over
over long delay paths. long delay paths.
The delay differences must be quite small. A 532 byte packet is The delay differences must be quite small. A 532 byte packet is seri-
serialized onto a DS1 link in under 2.8 msec. At DS3 speed, alized onto a DS1 link in under 2.8 msec. At DS3 speed, serialization
serialization is accomplished in under 100 usec. At OC12 it is under is accomplished in under 100 usec. At OC12 it is under 7 usec. For
7 usec. For this reason ``per packet round robin forwarding'' is not this reason ``per packet round robin forwarding'' is not applicable to
applicable to a high speed WAN. a high speed WAN.
Dividing destination prefixes among available next hops provides a Dividing destination prefixes among available next hops provides a
very course and unpredictable load split. Long prefixes are very course and unpredictable load split. Long prefixes are prob-
problematic. In reaching an end node, the majority of traffic is lematic. In reaching an end node, the majority of traffic is often
often destined to a single prefix. This technique is applicable to a destined to a single prefix. This technique is applicable to a high
high speed WAN but with the drawbacks just mentioned better techniques speed WAN but with the drawbacks just mentioned better techniques are
are needed. needed.
The ``source/destination hash'' based technique was used as far back The ``source/destination hash'' based technique was used as far back
as the T3-NSFNET in the NSS routers. A hash function, such as CRC-16, as the T3-NSFNET in the NSS routers. A hash function, such as CRC-16,
is applied over the source address and destination address. The hash is applied over the source address and destination address. The hash
space is then split evenly among the available paths by either setting space is then split evenly among the available paths by either set-
threshholds or performing a modulo operation. Traffic between any ting threshholds or performing a modulo operation. Traffic between
given source and destination remain on the same path. Because the any given source and destination remain on the same path. Because
technique is based on host addresses, and uses both the source and the technique is based on host addresses, and uses both the source and
destination address, it does not suffer the course granularity problem destination address, it does not suffer the course granularity prob-
of the prefix based technique, even when forwarding to a single lem of the prefix based technique, even when forwarding to a single
prefix. Source/destination hash is the best technique available for a prefix. Source/destination hash is the best technique available for a
high speed WAN. high speed WAN.
The forwarding decision for the ``source/destination hash'' based The forwarding decision for the ``source/destination hash'' based
technique is quite simple. When a packet arrives, look up the for- technique is quite simple. When a packet arrives, look up the for-
warding entry in the radix tree. The next hop entry can be an array warding entry in the radix tree. The next hop entry can be an array
index into a set of structures, each containing one or more actual index into a set of structures, each containing one or more actual
next hops. If more than one next hop is present, compute a CRC16 next hops. If more than one next hop is present, compute a CRC16
value based on the source and destination addresses. The CRC16 can be value based on the source and destination addresses. The CRC16 can
implemented in hardware and computed in parallel to the radix tree be implemented in hardware and computed in parallel to the radix tree
lookup in high speed implementations, and discarded if not needed. lookup in high speed implementations, and discarded if not needed.
Each next hop entry in the structure must contain a boundary value and Each next hop entry in the structure must contain a boundary value
the next hop itself. An integer ``less than'' comparison is made and the next hop itself. An integer ``less than'' comparison is made
against the boundary value determining whether to use this next hop or move against the boundary value determining whether to use this next hop
to the next a comparison. In hardware the full set of comparisons can be or move to the next a comparison. In hardware the full set of compar-
made simultaneously for up to some number of next hops. This yields the isons can be made simultaneously for up to some number of next hops.
next hop to use. This yields the next hop to use.
For ECMP, the boundary values are set by first dividing one more than
the maximum value that the hash computation can return (65536 for
CRC16) by the number of available next hops and then setting the Nth
boundary to N times that number (with the Nth value fixed at one more
than the maximum value regardless of underflow caused by trucating
during division, 65536 for CRC16).
An equal load split is not always optimal. Consider the example in
Figure 1 with the offered traffic in Table 1. If all of the link
costs are set equally, then the link N1---N3 is significantly
overloaded (135.75%) while the path N1---N2---N3 is lightly loaded
(45.25% and 22.62%). If the cost on the N1---N3 link is equal to the
cost of the N1---N2---N3 path, then N1 will try to split the load
destined toward N3 across the two paths.
Given the offered traffic in Table 1 the loading on N1---N3 is reduced
to 67.87% but the link loading on the path N2---N3 becomes 113.12%.
Ideally node N1 should put 1/3 of the traffic toward N3 on the path
N1---N2---N3 and 2/3 on the path N1---N3. To know to do this N1 must
know the loading on N2--N3.
.----. .----.
/ \ / \
| N2 | | N2 |
\ / \ /
`----' `----'
// \\ // \\
// \\ // \\
// \\ // \\
.----. .----. .----. .----.
/ \ / \ / \ / \
| N1 | ======== | N3 | | N1 | ======== | N3 |
\ / \ / \ / \ /
`----' `----' `----' `----'
Figure 1: A very simple application of ECMP Figure 1: A very simple application of ECMP
Nodes Traffic Node Names 1.3 Optimized Multipath differs from ECMP
n3-n1 60.000 Node 3 -> Node 1 For ECMP, the boundary values are set by first dividing one more than
n1-n3 60.000 Node 1 -> Node 3 the maximum value that the hash computation can return (65536 for
n3-n2 20.000 Node 3 -> Node 2 CRC16) by the number of available next hops and then setting the Nth
n2-n3 20.000 Node 2 -> Node 3 boundary to N times that number (with the Nth value fixed at one more
n2-n1 10.000 Node 2 -> Node 1 than the maximum value regardless of underflow caused by trucating
n1-n2 10.000 Node 1 -> Node 2 during division, 65536 for CRC16).
An equal load split is not always optimal. Consider the example in
Figure 1 with the offered traffic in Table 1. If all of the link
costs are set equally, then the link N1---N3 is significantly over-
loaded (135.75%) while the path N1---N2---N3 is lightly loaded (45.25%
and 22.62%). If the cost on the N1---N3 link is equal to the cost
of the N1---N2---N3 path, then N1 will try to split the load destined
toward N3 across the two paths.
Given the offered traffic in Table 1 the loading on N1---N3 is reduced
to 67.87% but the link loading on the path N2---N3 becomes 113.12%.
Ideally node N1 should put 1/3 of the traffic toward N3 on the path
N1---N2---N3 and 2/3 on the path N1---N3. To know to do this N1 must
know the loading on N2--N3.
Table 1: Traffic loading for the example in Figure 1
This is where Optimized Multipath (OMP) provides additional benefit This is where Optimized Multipath (OMP) provides additional benefit
over ECMP. Ignoring for the moment how node N1 knows to put 1/3 of the No ECMP OMP
traffic toward N3 on the path N1---N2---N3, the way this is Nodes Node Names Split Traffic Traffic
accomplished from a forwarding standpoint is to move the boundary in
the forwarding structure from the default value of 1/2 of 65536 to n3-n1 Node 3 -> Node 1 60 30 40
about 1/3 of 65536. If there are a very large set of source and n1-n3 Node 1 -> Node 3 60 30 40
destination host addresses pairs, then the traffic will be split among n3-n2 Node 3 -> Node 2 20 50 40
the 65536 possible hash values. This provides the means for a very n2-n3 Node 2 -> Node 3 20 50 40
fine granularity of adjustment. n2-n1 Node 2 -> Node 1 10 40 30
n1-n2 Node 1 -> Node 2 10 40 30
Table 1: Traffic loading for the example in Figure 1
over ECMP. Ignoring for the moment how node N1 knows to put 1/3 of
the traffic toward N3 on the path N1---N2---N3, the way this is ac-
complished from a forwarding standpoint is to move the boundary in the
forwarding structure from the default value of 1/2 of 65536 to about
1/3 of 65536. If there are a very large set of source and destina-
tion host addresses pairs, then the traffic will be split among the
65536 possible hash values. This provides the means for a very fine
granularity of adjustment.
Having explained how a fine granularity of forwarding adjustment can Having explained how a fine granularity of forwarding adjustment can
be accomplished, what remains is to define how nodes in a large topol- be accomplished, what remains is to define how nodes in a large topol-
ogy can know what the loading levels are elsewhere in the topology and ogy can know what the loading levels are elsewhere in the topology and
defining an algorithm which can allow autonomous unsyncronized defining an algorithm which can allow autonomous unsyncronized deci-
decisions on the parts of many routers in a topology to quickly sions on the parts of many routers in a topology to quickly converge
converge on a near optimal loading without the risk of oscillation. on a near optimal loading without the risk of oscillation.
2 Flooding Loading Information 2 Flooding Loading Information
Loading information is flooded within an OSPF area using Opaque LSAs Loading information is flooded within an OSPF area using Opaque
[1]. Area local scope (link-state type 10) link state attributes are LSAs [1]. Area local scope (link-state type 10) link state at-
flooded containing an ``Opaque Type'' of LSA_OMP_LINK_LOAD or tributes are flooded containing an ``Opaque Type'' of LSA_OMP_LINK_LOAD
LSA_OMP_PATH_LOAD. The type LSA_OMP_LINK_LOAD Opaque LSA is used to or LSA_OMP_PATH_LOAD. The type LSA_OMP_LINK_LOAD Opaque LSA is
flood link loading information within an area. The type used to flood link loading information within an area. The type
LSA_OMP_PATH_LOAD Opaque LSA is used to flood loading information for LSA_OMP_PATH_LOAD Opaque LSA is used to flood loading information
use with inter-area routes. Loading information obtained from an for use with inter-area routes. Loading information obtained from
exterior routing protocol may also be considered if available. The an exterior routing protocol may also be considered if available. The
means of passing loading information in an exterior routing protocol means of passing loading information in an exterior routing protocol
is beyond the scope of this document. is beyond the scope of this document.
2.1 Link Loading Information 2.1 Link Loading Information
Within an area link loading is flooded using the type LSA_OMP_LINK_LOAD Within an area link loading is flooded using the type LSA_OMP_LINK_LOAD
Opaque LSA. The ``Opaque ID'' must contain a 24 bit integer that is Opaque LSA. The format of this LSA is described in Appendix A.
unique to the advertising router and link or virtual link. The method
of assignment of these 24 bit integers is a local matter. A router
must be capable of being configured to put a fixed value in N of the
bits and use the remainin 24-N bits to uniquely identify an interface.
The ``Opaque Information'' in the type LSA_OMP_LINK_LOAD Opaque LSA The ``Opaque Information'' in the type LSA_OMP_LINK_LOAD Opaque LSA
contains the following. contains the following.
1. a measure of link loading in each direction as a fraction of link 1. a measure of link loading in each direction as a fraction of link
capacity, capacity,
2. a measure of packets dropped due to queue overflow in each 2. a measure of packets dropped due to queue overflow in each direc-
direction (if known) expressed as a fraction, tion (if known) expressed as a fraction,
3. the link capacity in killobits per second (or unity if less than 3. the link capacity in kilobits per second (or unity if less than
1000 bits per second). 1000 bytes per second).
Generally the number of ouput packets dropped will be known. In Generally the number of ouput packets dropped will be known. In de-
designs where drops occur on the input (generally a bad design signs where drops occur on the input (generally a bad design prac-
practice), the rate of input queue drops should be recorded. These tice), the rate of input queue drops should be recorded. These mea-
measures of loading and drop are computed using the interface counters sures of loading and drop are computed using the interface counters
generally maintained for SNMP purposes, plus a running count of output generally maintained for SNMP purposes, plus a running count of output
queue drops if available. The counters are sampled every queue drops if available. The counters are sampled every 15 seconds.
OMP_SAMPLE_INTERVAL seconds.
The previous value of each of the counters is substracted from the The previous value of each of the counters is substracted from the
current value. The counters to be sampled are the following. current value. The counters required are 1) bytes out, 2) bytes in,
3) packets out, 4) packets in, 5) output queue drops, and 6) input
1. bytes out, queue drops. These counters should already exist to satisfy SNMP
requirements.
2. bytes in,
3. packets out,
4. packets in,
5. output queue drops,
6. input queue drops.
A value of instantaneous load in each direction is based on byte count A value of instantaneous load in each direction is based on byte count
and link capacity. An instantaneous output queue drop rate is based and link capacity. An instantaneous output queue drop rate is based
on queue drops and packet count. Each of these is combined with a on queue drops and packet count. Some of the values are filtered as
running filtered value according to the following method. The running described in Appendix B.1.
total is shifted down by OMP_SHIFT_BITS bits and subtracted from the
running total. The instantaneous value is shifted down by
OMP_SHIFT_BITS bits and added to the running total.
The last time that a type LSA_OMP_LINK_LOAD Opaque LSA with the same The last time that a type LSA_OMP_LINK_LOAD Opaque LSA with the same
Opaque ID was sent is recorded and the values sent are recorded. For Opaque ID was sent is recorded and the values sent are recorded. For
the purpose of determining when to reflood, an equivalent loading the purpose of determining when to reflood, an equivalent loading fig-
figure is used. The computation of equivalent loading is described in ure is used. The computation of equivalent loading is described in
Section 2.3. Section 2.3.
The higher of the current equivalent loading computation and the The higher of the current equivalent loading computation and
previous is used when determining whether to send the type the previous is used when determining whether to send the type
LSA_OMP_LINK_LOAD Opaque LSA. The type LSA_OMP_LINK_LOAD Opaque LSA is LSA_OMP_LINK_LOAD Opaque LSA. The type LSA_OMP_LINK_LOAD Opaque LSA
sent if any of the following is true. is flooded according to elapsed time since last flooded, the current
equivalent load, and the difference between the equivalent load of
1. The equivalent load is over 100% and the change in equivalent load the most loaded and least loaded path. The reflooding decision is
since last resent is over 5% and 30 seconds has elapsed since last described in detail in Appendix B.1
sent.
2. The equivalent load is over 100% and the change in equivalent load
since last resent is over 2% and 90 seconds has elapsed since last
sent.
3. The equivalent load is over 100% and 3 minutes has elapsed since
last sent.
4. The equivalent load is over 90% and and the change in equivalent
load since last resent is over 5% and 1 minute has elapsed since
last sent.
5. The equivalent load is over 90% and the change in equivalent load
since last resent is over 2% and 4 minutes has elapsed since last
sent.
6. The equivalent load is over 90% and 10 minutes has elapsed since
last sent.
7. The equivalent load is over 70% and the change in equivalent load
since last resent is over 10% and 1 minute has elapsed since last
sent.
8. The equivalent load is over 70% and the change in equivalent load
since last resent is over 5% and 2 minutes has elapsed since last
sent.
9. The equivalent load is over 70% and the change in equivalent load
since last resent is over 2% and 8 minutes has elapsed since last
sent.
10. The equivalent load is over 70% and 15 minutes has elapsed since
last sent.
11. The equivalent load is over 50% and the change in equivalent load
since last resent is over 10% and 1 minute has elapsed since last
sent.
12. The equivalent load is over 50% and the change in equivalent load
since last resent is over 5% and 5 minutes has elapsed since last
sent.
13. The equivalent load is over 25% and the change in equivalent load
since last resent is over 25% and 2 minutes has elapsed since last
sent.
14. The equivalent load is over 25% and 20 minutes has elapsed since
last sent.
15. The type LSA_OMP_LINK_LOAD Opaque LSA has never been sent.
The point of this graduated scale is to reduce the amount of flooding The point of this graduated scale is to reduce the amount of flooding
that is occurring unless links are in trouble or undergoing a that is occurring unless links are in trouble or undergoing a signif-
significant traffic shift. Change may occur in a quiescent network icant traffic shift. Change may occur in a quiescent network due to
due to failure external to the network that causes traffic to take failure external to the network that causes traffic to take alternate
alternate paths. In this case, the more frequent flooding will paths. In this case, the more frequent flooding will trigger a faster
trigger a faster convergence. Traffic shift may also occur due to convergence. Traffic shift may also occur due to shedding of loading
shedding of loading by the OMP algortihm itself as the algorithm by the OMP algortihm itself as the algorithm converges in response to
converges in response to an external change. an external change.
2.2 Path Loading Information 2.2 Path Loading Information
Path loading information regarding an adjacent area is flooded by an Path loading information regarding an adjacent area is flooded by an
Area Border Router (ABR) using the type LSA_OMP_PATH_LOAD Opaque LSA. Area Border Router (ABR) using the type LSA_OMP_PATH_LOAD Opaque LSA.
The ``Opaque ID'' must contain a 24 bit integer that is unique to the The format of this LSA is described in Appendix A.
router and an advertised summary LSA. The method of assignment of
these 24 bit integers is a local matter. A router must be capable of
being configured to put a fixed value in N of the bits and use the
remainin 24-N bits to uniquely identify the summary LSA.
The ``Opaque Information'' in the type LSA_OMP_PATH_LOAD Opaque LSA The ``Opaque Information'' in the type LSA_OMP_PATH_LOAD Opaque LSA
contains the following. contains the following.
1. the highest loading in the direction toward the destination as a 1. the highest loading in the direction toward the destination as a
fraction of link capacity, fraction of link capacity,
2. a measure of total packet drop due to queue overflow in the direc-
tion toward the destination expressed as a fraction,
2. a measure of total packet drop due to queue overflow in the
direction toward the destination expressed as a fraction,
3. the smallest link capacity on the path to the destination. 3. the smallest link capacity on the path to the destination.
These values are taken from the link on the path from the ABR to the These values are taken from the link on the path from the ABR to the
destination of the summary LSA. The link with the highest loading may destination of the summary LSA. The link with the highest loading may
not be the link with the lowest capacity. The queue drop value is one not be the link with the lowest capacity. The queue drop value is one
minus the product of fraction of packets that are not dropped at each minus the product of fraction of packets that are not dropped at each
measurement point on the path (input and output in the direction of measurement point on the path (input and output in the direction of
the path). The following computation is used. the path). The following computation is used.
path-loss = 1 - product((1 - link-loss-in) * (1 - path-loss = 1 - product(1 - link-loss)
link-loss-out))
The path loading and path loss rate are filtered according to the same The path loading and path loss rate are filtered according to the same
algorithm defined in the prior section. Rather than polling a set of algorithm defined in the prior section. Rather than polling a set
counters the current value of the path loading and path loss rate is of counters the current value of the path loading and path loss rate
used. An equivalent load is calculated for each path to a summary LSA is used. An equivalent load is calculated for each path to a summary
destination as described in Section 2.3. A type LSA_OMP_PATH_LOAD LSA destination as described in Section 2.3. A type LSA_OMP_PATH_LOAD
Opaque LSA is flooded according to the same rate schedule as described Opaque LSA is flooded according to the same rate schedule as described
in the prior section. in the prior section.
An ABR may be configured to not send type LSA_OMP_PATH_LOAD Opaque LSA An ABR may be configured to not send type LSA_OMP_PATH_LOAD Opaque LSA
into any given area. into any given area.
2.3 Computing equivalent loading 2.3 Computing equivalent loading
The equivalent load is the actual percent loading multiplied by a The equivalent load is the actual percent loading multiplied by a
factor that provides an estimate of the extent to which TCP would be factor that provides an estimate of the extent to which TCP would be
slowing down to avoid congestion. This estimate is based on the link slowing down to avoid congestion. This estimate is based on the link
bandwidth and loss rate, knowledge of TCP dynamics, and some bandwidth and loss rate, knowledge of TCP dynamics, and some assump-
assumption about the characteristics of the TCP flows being passed tion about the characteristics of the TCP flows being passed through
through the link. Some of the assumptions must be configured. the link. Some of the assumptions must be configured.
If loss is low, the equivalent load will be equal to the actual If loss is low, the equivalent load will be equal to the actual per-
percent loading. If loss is high and loading is at or near 100%, then cent loading. If loss is high and loading is at or near 100%, then
the equivalent load calculation provides a means of deciding which the equivalent load calculation provides a means of deciding which
links are more heavily overloaded. The equivalent load figure is not links are more heavily overloaded. The equivalent load figure is
intended to be an accurate pridiction of offerred load, simply a not intended to be an accurate pridiction of offerred load, simply a
metric for use in deciding which link to offload. metric for use in deciding which link to offload.
Mathis and Mahdavi provide the following estimate of loss given TCP Mathis et al provide the following estimate of loss given TCP window
window size and round trip time [2]. size and round trip time [4].
p < (MSS / (BW * RTT))**2 p < (MSS / (BW * RTT))**2
The basis for the estimate is that TCP slows down roughly in The basis for the estimate is that TCP slows down roughly in propor-
proportion to the inverse of the square root of loss. There is no way tion to the inverse of the square root of loss. There is no way to
to know how fast TCP would be going if no loss were present if there know how fast TCP would be going if no loss were present if there are
are other bottlenecks. A somewhat arbitrary assumption is made that other bottlenecks. A somewhat arbitrary assumption is made that TCP
TCP would go no faster than if loss were at 0.5%. If loss is greater would go no faster than if loss were at 0.5%. If loss is greater than
than 0.5% then TCP performance would be reduced. The equivalent 0.5% then TCP performance would be reduced. The equivalent loading is
loading is estimated using the following computation. estimated using the following computation.
equiv-load = load * K * sqrt(1 - ((1 - loss-in) * (1 - equiv-load = load * K * sqrt(loss)
loss-out)))
The inverse of the square root of 0.1% is 10 so 10 may be used for the The inverse of the square root of 0.1% is 10 so 10 may be used for the
value of ``K''. A square root is somewhat time consuming to compute, value of ``K''.
so a table lookup can be done to avoid this computation. Increments
of 0.5% would yield a 200 entry table. The computation could then be
done with a table lookup, a shift, and an integer multiply. At most
this needs to be done on links with loss once every
OMP_SAMPLE_INTERVAL seconds.
The conversion of loss to estimated loading is not at all accurate. The conversion of loss to estimated loading is not at all accurate.
The non-linearity does affect the time to converge though convergence The non-linearity does affect the time to converge though convergence
still occurs as long as loss is positively correlated to loading. still occurs as long as loss is positively correlated to loading.
This is discussed further in Section D.1. This is discussed further in Section D.1.
3 Adjusting Equal Cost Path Loadings 3 Next hop structures
Adjustments are made to a next hop structure to reflect differences in For intra-area routes, a separate next hop structure must exist for
loading on the paths as reported by the type LSA_OMP_LINK_LOAD Opaque each destination router or network. For inter-area routes (summary
LSA and type LSA_OMP_PATH_LOAD Opaque LSA. Section 3.2 describes the routes), one next hop structure is needed for each set of equidistant
selection of a ``critically loaded segment'' which is used to ABRs. For external routes (AS-external or routes from an exterior
determine when to make adjustments and the size of the adjustments. routing protocol, such as BGP) one next hop structure is needed for
An adjustment to loading of a given set of equal cost paths is made each set of equidistant ASBRs.
when one of the following conditions are true.
1. The LSA for the ``critically loaded segment'' has been reflooded. The set of intra-area next hop structures is initialized after the
2. The difference between the equivalent load of the ``critically OSPF SPF calculation is completed. An additional set of next hops is
loaded segment'' and the lightest loaded path is greater than 5% then added by relaxing the best path criteria.
and the equivalent load of the ``critically loaded segment'' is
greater than 100% and 90 seconds has elapsed since the last
adjustment.
3. The difference between the equivalent load of the ``critically 3.1 Relaxing the Best Path Criteria
loaded segment'' and the lightest loaded path is greater than 3%
and the equivalent load of the ``critically loaded segment'' is
greater than 100% and 9 180 seconds has elapsed since the last
adjustment.
4. The difference between the equivalent load of the ``critically The exercise of setting link costs to produce the most beneficial set
loaded segment'' and the lightest loaded path is greater than 5% of equal costs paths is tedious and very difficult for large topolo-
and the equivalent load of the ``critically loaded segment'' is gies. OSPF as defined in RFC--2328 requires that only the best path
greater than 90% and 120 seconds has elapsed since the last be considered. For the purpose of Optimized Multipath, this crite-
adjustment. ria can be relaxed to allow a greater number of multipaths but not to
5. The difference between the equivalent load of the ``critically the point of creating routing loops. Any next hop which is closer in
loaded segment'' and the lightest loaded path is greater than 3% terms of costs than the current hop can be considered a viable next
and the equivalent load of the ``critically loaded segment'' is hop for multipath routing. If next hops were used where the cost at
greater than 90% and 240 seconds has elapsed since the last the next hop is equal or greater, routing loops would form.
adjustment.
6. 300 seconds has elapsed since the last adjustment. In considering the paths beyond the next hop path, only the best paths
should be considered. There is no way to determine if subsequent
routers have relaxed the best path criteria. In addition, there is
no need to consider the additional paths if the best path criteria
is relaxed downstream. If best path criteria is relaxed downstream,
the best paths must be part of the downstream next hop structure. If
there are additional paths the the downstream is able to use to fur-
ther distribute the load, the entire set of paths will still converge
toward optimal loading.
The reflooding algorithm is designed to be slightly less aggressive 3.2 Offloading Congestion Outside the OSPF Area
than the adjustment algorithm. This reduces the need to continuously
flood small changes except in conditions of overload or substantial
change in loading. Some overshoot may occur due to adjustments made
in the absence of accurate knowledge of loading.
3.1 Next hop structures For inter-area routes or external routes, a separate next hop struc-
ture must exist for each such route is it is desireable to reduce
loading outside of the area and the loading within the area is suffi-
ciently low to safely allow this.
For intra-AS routes, a separate next hop structure must exist for each For intra-area routes if a type LSA_OMP_PATH_LOAD Opaque LSA exists
destination router or network. For inter-AS routes, a separate struc- for the summary LSA and more than one ABR is advertising the summary
ture must exist for each intra-AS route if a type LSA_OMP_PATH_LOAD route and the equivalent load for the summary LSA is greater than
Opaque LSA exists for the summary LSA and more than one ABR is 90% and the equivalent load within the area is sufficiently smaller
advertising the summary route and the equivalent load for the summary than the inter-area loading, then a next hop structure can be created
LSA is greater than 50% and the equivalent load is not sufficiently specifically to allow offloading of the intra-area route. For exter-
smaller than the intra-AS loading. If a separate structure is not nal routes, if an equivalent loading exists, and more than one ASBR is
used for the intra-AS route, then the next hop structure associated advertising the route, and the equivalent load is greater than 95% and
with the reaching the ABR or set of ABRs is used. For external the equivalent load within the area is sufficiently smaller than the
routes, if an equivalent loading exists, and more than one ASBR is ad- external route loading, then a separate structure is used.
vertising the route, and the equivalent load is greater than 50% and the
equivalent load is not sufficiently smaller than the internal route load-
ing associated with the external next hop, then a separate structure is
used. If a separate structure is not used for an external route, then the
next hop structure associated with the reaching external next hop is used.
Hysterysis must be used in the algorithm for determining if an Hysterysis must be used in the algorithm for determining if an equiv-
equivalent load on a summary LSA or external route is considered alent load on a summary LSA or external route is considered suffi-
sufficiently smaller than the intra-AS equivalent load or if an ciently larger than the intra-area equivalent load or if an external
external route is considered sufficiently smaller than the inter-AS route loading is considered sufficiently larger than the inter-area
equivalent load. For for the purpose of describing this algorithm one equivalent load. For for the purpose of describing this algorithm one
euivalent load is referred to as the more external, and the other as euivalent load is referred to as the more external, and the other as
the more internal equivalent load. the more internal equivalent load.
If the more external equivalent load exceeds the more internal equiva- If the more external equivalent load exceeds the more internal equiv-
lent load by 5% and the more internal equivalent load is under 90%, alent load by 15% and the more internal equivalent load is under 85%,
then a separate next hop structure is created. If the more external then a separate next hop structure is created. If the more external
equivalent load falls below 20% of the more internal equivalent load equivalent load falls below 20% of the more internal equivalent load
or the more internal equivalent load exceeds 95%, then an existing or the more internal equivalent load exceeds 98%, then an existing
separate next hop structure is marked for removal and combined with separate next hop structure is marked for removal and combined with
the more internal next hop structure (see Section 3.4). The more the more internal next hop structure (see Section 3.3). The more ex-
external equivalent load should not fall significantly below the more ternal equivalent load should not fall significantly below the more
internal unless either the traffic toward the more external destina- internal unless either the traffic toward the more external destina-
tion increases or the loading on the more internal increases, since tion increases or the loading on the more internal increases, since
the more internal equivalent load will become the critical segment on the the more internal equivalent load will become the critical segment on
separate next hop structure if the load is sufficiently shifted but is un- the separate next hop structure if the load is sufficiently shifted
likely to overshoot by 20%. These threshholds should be configurable at but is unlikely to overshoot by 20%. These threshholds should be
least per type of routes (inter-AS or external). configurable at least per type of routes (inter-AS or external).
3.2 Critcally loaded segment The degree to which Summary LSA loading and external route loading
will be considered is limited. This serves two purposes. First, it
prevents compensating for external congestion to the point of loading
the internal network beyond a fixed threshhold. Second, it prevents
triggering the removal of the next hop structure, which if allowed to
occur could trigger a hysteresis loop. This mechanisms are described
in Section 3.4, and Appendix C.4.
For every set of intra-AS paths, one link or part of the path is 3.3 Creating and destroying next hop structures
identified as the ``critcally loaded'' segment. This is the part of
the path with the highest equivalent load as defined in Section 2.3. As described in Section 3.2 separate next hop structure is needed
For an inter-AS route with a separate next hop structure, the if the loading indicated by the type LSA_OMP_PATH_LOAD Opaque LSA or
critcally loaded segment is the critcally loaded segment for the exterior routing protocol is sufficiently high to require separate
intra-AS set of paths, or the summary LSA if the equivalent load on balancing for traffic to the summary-LSA or exterior route and the
the summary LSA is greater. For an external route with a separate intra-AS loading is sufficiently low.
next hop structure, the critcally loaded segment is the critcally
loaded segment for the internal route or the external route if the When a separate next hop structure is created, the same available
equivalent load on the external route is greater. paths appear in the structure, leading to the same set of ABR or ASBR.
The balance on these available paths should be copied from the exist-
ing more internal next hop structure. By initializing the new next
hop structure this way, a sudden change in loading is avoided if the
summary route or external route sinks a great deal of traffic.
When a separate next hop structure can be destroyed, the traffic
should be transitioned gradually. The next hop structure must be
marked for deletion. The traffic share in this separate next hop
structure should be gradually changed so that it exactly matches the
traffic share in the more internal next hop structure. The gradual
change should follow the adjustment rate schedule described in Sec-
tion 4.1 where the move increment is increased gradually as moves
continue in the same direction. Once the separate next hop structure
marked for deletion matches the more internal next hop structure, the
summary route or external route can be changed to point to the more
internal next hop structure and the deletion can be made.
3.4 Critcally loaded segment
For every set of paths, one link or part of the path is identified as
the ``critcally loaded'' segment. This is the part of the path with
the highest equivalent load as defined in Section 2.3. For an inter-
area route with a separate next hop structure, the critically loaded
segment may be the critcally loaded segment for the intra-area set of
paths, or it may be the summary LSA if the equivalent load on the sum-
mary LSA is greater. For an external route with a separate next hop
structure, the critcally loaded segment may be the critcally loaded
segment for the internal route or it may be the external route if
the equivalent load on the external route is greater. In considering
loading reported for summary LSA or external routes, the loading may
be clamped to some configured ceiling (see Appendix C.4). If intra-
area loading exceeds this ceiling, the summary LSA loads or external
routes loads are in effect ignored.
Each next hop structure has exactly one ``critcally loaded'' segment. Each next hop structure has exactly one ``critcally loaded'' segment.
An Opaque LSA may be the critcally loaded segment for no next hop There may be more than one path in the next hop structure sharing
structures if it is lightly loaded. An Opaque LSA may be the this critcally loaded segment. A particular Opaque LSA may be the
critcally loaded segment for many next hop structures if it is heavily critcally loaded segment for no next hop structures if it is lightly
loaded. loaded. Another Opaque LSA may be the critcally loaded segment for
many next hop structures if it is heavily loaded.
3.3 Load Adjustment Rate 4 Adjusting Equal Cost Path Loadings
In order to assure stability the rate of adjustment must be Adjustments are made to a next hop structure to reflect differences in
sufficiently limited. An adaptive adjustment rate method is used. loading on the paths as reported by the type LSA_OMP_LINK_LOAD Opaque
LSA and type LSA_OMP_PATH_LOAD Opaque LSA. Section 3.4 describes the
selection of a ``critically loaded segment'' which is used to de-
termine when to make adjustments and the size of the adjustments.
An adjustment to loading of a given set of equal cost paths is made
when one of two conditions are true. Either the ``critically loaded
segment'' has been reflooded, or a criteria is met involving 1) the
difference between the equivalent load of the``critically loaded seg-
ment'' and the lightest loaded path, 2) the equivalent load of the
``critically loaded segment'', 3) the type of destination, intr-area,
inter-area, or external, and 4) the amount of time since the last
load adjustment. The details of this conditional are described in
Appendix B.
When the SPF is recalculated paths may disappear and new paths may The reflooding algorithm is designed to be slightly less aggressive
appear. If a new equal cost path is added to a set of existing set of than the adjustment algorithm. This reduces the need to continuously
paths or a single path, the new path would have previously not carried flood small changes except in conditions of overload or substantial
any traffic of the traffic. The next hop structure is initialized or change in loading. Some overshoot may occur due to adjustments made
modified if there had previously been equal cost paths such that the in the absence of accurate knowledge of loading.
new path is unused. The procedures for handling links coming up or
going down is covered in Section 4. 4.1 Load Adjustment Rate
In order to assure stability the rate of adjustment must be suffi-
ciently limited. An adaptive adjustment rate method is used.
A ``critcally loaded'' segment for a next hop structure is determined A ``critcally loaded'' segment for a next hop structure is determined
as described in Section 3.2. When the type LSA_OMP_LINK_LOAD Opaque as described in Section 3.4. When the type LSA_OMP_LINK_LOAD Opaque
LSA or type LSA_OMP_PATH_LOAD Opaque LSA for this segment is updated, LSA or type LSA_OMP_PATH_LOAD Opaque LSA for this segment is updated
load is shed toward all equal cost paths that do not involve that or the criteria in Appendix B is met, load is shed from all paths in
segment. A separate set of variables controlling rate of adjustment the next hop structure that include that segment toward all paths in
is kept for each alternate path. the next hop structure that do not include that segment. A separate
set of variables controlling rate of adjustment is kept for each path
receiving load.
The number of paths may exceed the number of next hops. The The number of paths usually exceeds the number of next hops. The dis-
distinction between paths which share a next hop is important if one tinction between paths which share a next hop is important if one of
of the paths sharing a next hop goes down (see Section 4). This the paths sharing a next hop goes down (see Section 4.2). This dis-
distinction is only needed in making the computations. When moving tinction is only needed in making the computations. When moving the
the next hop structure into the data structures used for forwarding, next hop structure into the data structures used for forwarding, paths
paths which share a common next hop may be combined. The following which share a common next hop may be combined.
variables are kept for each path in a next hop structure.
1. The current ``traffic share'' (an integer, the range is 0 to 65355 The following variables are kept for each path in a next hop struc-
for a CRC16 hash), ture.
2. The current ``move increment'' (an integer, the range is 0 to 65355 1. The current ``traffic share'' (an integer, the range is 0 to 65355
for a CRC16 hash), for a CRC16 hash),
2. The current ``move increment'' used when moving traffic toward this
path (an integer, the range is 0 to 65355 for a CRC16 hash),
3. The number of moves in the same direction, referred to as the 3. The number of moves in the same direction, referred to as the
``move count''. ``move count''.
If there is no prior history for a path, then the move increment is If there is no prior history for a path, then the move increment is
initialized to about 1% or 650. The number of moves in the same initialized to about 1% or 650. The number of moves in the same di-
direction is initialized to 3. No loading adjustment is made on the rection is initialized to 0. No loading adjustment is made on the
first iteration. first iteration.
If critcally loaded segment has not changed, or if the current path If the critcally loaded segment has changed, all paths now containing
did not contain the previous critcally loaded segment, then the the critically loaded segment are first examined. The lowest move
adjustment is continuing in the same direction. If the critcally increment of any one of these paths is noted.
loaded segment has just changed and the path being shed load toward
contains the prior critcally loaded segment, then the adjustment
direction has reversed for this path.
If the adjustment direction has reversed, the number of moves in the The move increment is adjusted for each path before any traffic is
same direction is set to zero and the move increment is reduced by moved. One of the following actions is taken for each path.
half. This move increment is then used.
If the adjustment is continuing in the same direction, the number of 1. If the path contains the critcally loaded segment its move incre-
moves in the same direction is considered before increasing the move ment is left unchanged.
increment. This value is here referred to as the move count. The
move increment is updated according to the following conditions.
1. If the move count is greater than 4, the move increment is 2. If the path does not contain the critcally loaded segment but the
increased by its current value divided by the number of equal cost critically loaded segment has changed and the path contains the
paths in the next hop structure. prior critcally loaded segment, then first the move increment is
2. If the move count is less than or equal to 4, the move increment is replaced with the lowest move increment from any of the paths con-
increased by 1/2 its current value divided by the number of equal taining the critically loaded segment unless the move increment is
cost paths in the next hop structure. already lower, then the move increment is cut in half.
3. If the path does not contain the critcally loaded segment and ei-
ther the critically loaded segment has not changed, or the path
does not contain the prior critcally loaded segment, then the move
increment is increased.
The amount increase in the move increment is described in Ap-
pendix B.4. The increase is designed to minimize the possibility
of dramatic overshoot due to to great an increase in adjustment rate.
The move increment is never less than one and the increase in move The move increment is never less than one and the increase in move
increment is never less than one. The move increment is never allowed increment is never less than one. The move increment is never allowed
to exceed the size of the hash space divided by the number of equal to exceed the size of the hash space divided by the number of equal
cost paths in the next hop structure. cost paths in the next hop structure.
The dramatic decrease in move increment when move direction is The dramatic decrease in move increment when move direction is re-
reversed and the slow increase in move increment when it remains in versed and the slow increase in move increment when it remains in the
the same direction keeps the algorithm stable. The exponential nature same direction keeps the algorithm stable. The exponential nature of
of the increase allows the algorithm to track externally caused the increase allows the algorithm to track externally caused changes
changes in traffic loading. in traffic loading.
The amount of hash space allocated to a path is incremented by the
move amount and the amount of hash space allocated to the critcally
loaded path or paths are decremented by this amount.
This process is repeated for each alternate path. The new hash space
boundaries are then moved to the forwarding engine.
3.4 Creating and destroying next hop structures
Section 3.1 describes the conditions under which a next hop structure
would be needed for an inter-AS route or AS external route. Briefly,
a separate next hop structure is needed if the loading indicated by
the type LSA_OMP_PATH_LOAD Opaque LSA or exterior routing protocol is
sufficiently high to require separate balancing for traffic to the
summary-LSA or exterior route and the intra-AS loading is sufficiently
low.
When a separate next hop structure is created, the same available The traffic share allocated to a path not containing the critcally
paths appear in the structure, leading to the same set of ABR or ASBR. loaded segment is incremented by the move amount for that path and the
The balance on these available paths should be copied from the traffic share allocated to the path or paths containing the the crit-
existing more internal next hop structure. By initializing the new cally loaded segment are reduced by this amount divided by the number
next hop structure this way, a sudden change in loading is avoided if of paths containing the critcally loaded segment. This adjustment is
the summary route or external route sinks a great deal of traffic. described in pseudocode in Appendix B.4.
When a separate next hop structure can be destroyed, the traffic This adjustment process is repeated for each path in a next hop struc-
should be transitioned gradually. The next hop structure must be ture. The new hash space boundaries are then moved to the forwarding
marked for deletion. The traffic share in this separate next hop engine.
structure should be gradually changed so that it exactly matches the
traffic share in the more internal next hop structure. The gradual
change should follow the adjustment rate schedule described in
Section 3.3 where the move increment is increased gradually as moves
continue in the same direction. Once the separate next hop structure
marked for deletion matches the more internal next hop structure, the
summary route or external route can be changed to point to the more
internal next hop structure and the deletion can be made.
4 Dealing with Link Failures 4.2 Dealing with Link Adjacency Changes
Link failures do occur for various reasons. OSPF routing will Link failures do occur for various reasons. OSPF routing will con-
converge to a new set of paths. Whatever load balance had previously verge to a new set of paths. Whatever load balance had previously
existed will be upset and the load balancing will have to converge to existed will be upset and the load balancing will have to converge to
a new load balanced state. a new load balanced state. Previous load balancing parameter should
remain intact to the extent possible after the SPF calculation has
completed. Adjustments for new or deleted paths in the SPF result are
described here. These adjustments must be made after the best path
criteria is relaxed as described in Section 3.1.
4.2.1 Impact of Link Adjacency Changes
Links which are intermitent may be the most harmful. The OSPF Links which are intermitent may be the most harmful. The OSPF
``Hello'' protocol is inadequate for handling intermitent links. When ``Hello'' protocol is inadequate for handling intermitent links. When
such a link is up it may draw traffic during periods of high loss, such a link is up it may draw traffic during periods of high loss,
even brief periods of complete loss. even brief periods of complete loss.
The inadequacies of the OSPF ``Hello'' protocol is well known and many The inadequacies of the OSPF ``Hello'' protocol is well known and many
implementations provide lower level protocol state information to OSPF implementations provide lower level protocol state information to OSPF
to indicate a link in the ``down'' state. For example, indications to indicate a link in the ``down'' state. For example, indications
may include carrier loss, excessive framing errors, unavailable may include carrier loss, excessive framing errors, unavailable sec-
seconds, or loss indications from PPP LQM. onds, or loss indications from PPP LQM.
Even where the use of a link is avoided by providing indication of Even where the use of a link is avoided by providing indication of
lower level link availability, intermitent links are still lower level link availability, intermitent links are still problem-
problematic. During a brief period immediately after a link state atic. During a brief period immediately after a link state attribute
attribute is initially flooded OSPF state can be inconsistent among is initially flooded OSPF state can be inconsistent among routers
routers within the AS. This inconsistency can cause intermittent within the OSPF area. This inconsistency can cause intermittent rout-
routing loops and have a severe short term impact on link loading. An ing loops and have a severe short term impact on link loading. An
oscillating link can cause high levels of loss and is generally better oscillating link can cause high levels of loss and is generally better
off held in the neighbor adjacency ``down'' state. The algorithm off held in the neighbor adjacency ``down'' state. The algorithm de-
described in the [4] can be used when advertising OSPF type 1 or type scribed in the [7] can be used when advertising OSPF type 1 or type 2
2 LSA (router and network LSAs). LSA (router and network LSAs).
Regardless as to whether router and network LSAs are damped, neighbor Regardless as to whether router and network LSAs are damped, neigh-
adjacency state changes will occur and router and network LSAs will bor adjacency state changes will occur and router and network LSAs
have to be handled. The LSA may indicate an up transition or a down will have to be handled. The LSA may indicate an up transition or
transition. In either an up or down transition, when the SPF a down transition. In either an up or down transition, when the SPF
algorithm is applied, existing paths from the router doing the SPF to algorithm is applied, existing paths from the router doing the SPF to
specific destinations may no longer be usable and new paths may become specific destinations may no longer be usable and new paths may become
usable. In the case of an up transition, some paths may no longer be usable. In the case of an up transition, some paths may no longer be
usable because their cost is no longer among those tied for the best. usable because their cost is no longer among those tied for the best.
In the case of down transitions, new paths may become usable because In the case of down transitions, new paths may become usable because
they are now the best path still available. they are now the best path still available.
When a path becomes unusable, paths which previously had the same cost 4.2.2 Handling the Loss of Paths
may remain. This can only occur on an LSA down transition. A new
next hop entry should be created in which the proportion of When a path becomes unusable, paths which previously had the same
source/destination hash space allocated to the now infeasible path is cost may remain. This can only occur on an LSA down transition.
distributed to the remaining paths proportionally to their prior A new next hop entry should be created in which the proportion of
allocation. Very high loading percentages should result, triggering source/destination hash space allocated to the now infeasible path
an increase in LSA_OMP_LINK_LOAD Opaque LSA flooding rate until is distributed to the remaining paths proportionally to their prior
allocation. Very high loading percentages should result, trigger-
ing an increase in LSA_OMP_LINK_LOAD Opaque LSA flooding rate until
convergence is approached. convergence is approached.
4.2.3 Handling the Addition of Paths
When a new path becomes usable it may be tied for best with paths car- When a new path becomes usable it may be tied for best with paths car-
rying existing traffic. This can only occur on an LSA up transition. rying existing traffic. This can only occur on an LSA up transition.
A new next hop entry should be created in which the loading on the new A new next hop entry should be created in which the loading on the
path is zero. If such a path were to oscillate, little or no load new path is zero. If such a path were to oscillate, little or no load
would be affected. If the path remains usable, the shift of load to would be affected. If the path remains usable, the shift of load to
this path will accellerate until a balance is reached. this path will accellerate until a balance is reached.
If a completely new set of best paths becomes available, the load If a completely new set of best paths becomes available, the load
should be split across the available paths, proportional to 10% of should be split across the available paths, proportional to 10% of
link capacity plus the remaining link bandwidth as determined by prior link capacity plus the remaining link bandwidth as determined by prior
LSA_OMP_LINK_LOAD Opaque LSA values. LSA_OMP_LINK_LOAD Opaque LSA values. The contribution of link capacity
in the weighting should be configurable. See Appendix C.5.
Acknowledgements
Numerous individual have provided valuable comments regarding this
work. Dave Ward made a very substantial contribution by pointing out
that the best path criteria could be relaxed. Dave Ward, John Scud-
der, Tony Li, and Daniel Awduche have provided particularly valuable
review and comments.
References
[1] R. Coltun. The ospf opaque lsa option. Technical Report RFC 2370,
Internet Engineering Task Force, 1998. ftp://ftp.isi.edu/in-
notes/rfc2370.txt.
[2] Atul Khanna and John Zinky. The revised ARPAnet routing met-
ric. In SIGCOMM Symposium on Communications Architectures and
Protocols, pages 45--56, Austin, Texas, September 1989. ACM.
[3] Steven H. Low and P. Varaiya. Dynamic behavior of a class of
adaptive routing protocols (IGRP). In Proceedings of the Confer-
ence on Computer Communications (IEEE Infocom), pages 610--616,
March/April 1993.
[4] M. Mathis, J. Semke, J. Mahdavi, and T. Ott. The macroscopic be-
havior of the TCP congestion avoidance algorithm. ACM Computer
Communication Review, 27(3), July 1997.
[5] J. Moy. Ospf version 2. Technical Report RFC 2328, Internet Engi-
neering Task Force, 1998. ftp://ftp.isi.edu/in-notes/rfc2328.txt.
[6] W. Stevens. Tcp slow start, congestion avoidance, fast retrans-
mit, and fast recovery algorithms. Technical Report RFC 2001,
Internet Engineering Task Force, 1997. ftp://ftp.isi.edu/in-
notes/rfc2001.txt.
[7] Curtis Villamizar, R. Govindan, and R. Chandra. Bgp route
flap damping. Internet Draft (Work in Progress) draft-ietf-
idr-route-damp-03, Internet Engineering Task Force, 5 1998.
ftp://ftp.isi.edu/internet-drafts/draft-ietf-idr-route-damp-
03.txt.
Security Considerations
The use of the Opaque LSAs described in this document do no impact
the security of OSPF deployments. In deployments which use a strong
OSPF authentication method, and require signatures on LSA from the
originating router, no leveraging of a partial compromise beyond a
localized disruption of service is possible. In deployments which
use a strong OSPF authentication method, but do not require signatures
on LSA from the originating router, compromise of a single router can
be leveraged to cause significant disruption of service with or with-
out the use of these Opaque LSA, but disruption of service cannot be
achieved without such a compromise. In deployments which use a weak
OSPF authentication method, significant disruption of service can be
caused through forged protocol interactions with or without the use of
these Opaque LSA.
Author's Addresses
Curtis Villamizar
ANS Communications
<curtis@ans.net>
A Data Formats A Data Formats
@@ This is obviously a very important piece and is missing. +----------------+----------------+----------------+----------------+
| Link State Advertisement Age | Options | LSA Type |
+----------------+----------------+----------------+----------------+
| Opaque Type | Opaque ID |
+----------------+----------------+----------------+----------------+
| Advertising Router |
+----------------+----------------+----------------+----------------+
| Link State Advertisement Sequence Number |
+----------------+----------------+----------------+----------------+
| LSA Checksum | LSA length |
+----------------+----------------+----------------+----------------+
| Version | Reference Type | Packing Method | BW Scale |
+----------------+----------------+----------------+----------------+
| Reference to a Type 1-5 LSA (32 or 64 bits, see below) |
+----------------+----------------+----------------+----------------+
| Packed Loading Information (variable length, see below) |
+----------------+----------------+----------------+----------------+
The "LSA Age", "Options", and "LSA Type" are part of the Link State
Advertisement Format described in Appendix A of RFC--2328. The LSA
Type is is 10, signifying an Opaque LSA with Area local scope, as de-
fined in RFC--2370. RFC--2370 splits the Link State ID field into
two part, Opaque Type, and Opaque ID. The Opaque Type contains either
the value LSA_OMP_LINK_LOAD or LSA_OMP_PATH_LOADas described in Sec-
tion 2. The "Advertising Router", "Link State Advertisement Sequence
Number", "LSA Checksum", and "LSA length" are part of the Link State
Advertisement Format described in Appendix A of RFC--2328. The re-
mainder of the packet is specific to Opaque Type LSA_OMP_LINK_LOAD or
LSA_OMP_PATH_LOADLSAs.
Opaque Type The Opaque Type will contain the value LSA_OMP_LINK_LOAD
or LSA_OMP_PATH_LOADas described in Section 2. Numeric values will
be requested from IANA.
Opaque ID The Opaque ID will contain an integer which will be unique
per router and interface, virtual interface, or MAC address for
which loading is reported. These numbers are only of significance
to the advertising router except as a means of identification of
subsequent LSAs.
For a LSA_OMP_LINK_LOAD Opaque LSA, the ``Opaque ID'' must contain
a 24 bit integer that is unique to the link or virtual link. The
method of assignment of these 24 bit integers is a local matter. A
router must be capable of uniquely identify an interface using a 24
bit number.
For a LSA_OMP_PATH_LOAD Opaque LSA, the ``Opaque ID'' must contain
a 24 bit integer that is unique to a summary LSA or AS-external LSA
advertised by the same router. The method of assignment of these
24 bit integers is a local matter. A router must be capable of
uniquely identify a summary LSA or AS-external LSA using a 24 bit
number.
Version The version number is 1.
Reference Type The Reference Type indicates the type of LSA that is
being referenced in the "Reference to a Type 1-5 LSA" field.
Packing Method The Packing Method is an integer that indicates how
the "Packed Loading Information" field is formated.
BW Scale Bandwidth numbers must be scale by shifting the 32 bit in-
teger left by this amount. If this value is non-zero, integers
larger than 64 bits must be used.
Reference to a Type 1-5 LSA This field contains the 32 bit "Link
State ID" field of a Type 1-5 LSA. Type 1-5 indicate:
1. Router-LSAs
2. Network-LSAs
3. Summary-LSAs (IP network)
4. Summary-LSAs (ASBR)
5. AS-external-LSAs
Loading information for a Type 1 LSA, a Router-LSA, is sent as
aLSA_OMP_LINK_LOAD Opaque LSA. For a Type 1 LSA the "Link State ID"
field is followed by a 32 bit "Link ID". This identifies a single
link. There are four types of Links.
1. Point-to-point connection to another router
2. Connection to a transit network
3. Connection to a stub network
4. Virtual link
Normally loading information is provided for a Link Type 1. Load-
ing information may also be provided for a Link Type 2 or 3. Load-
ing information cannot be provided to a Link Type 4.
Loading information is not provided for Type 2 LSAs, Network-LSAs.
Loading information for Type 3 and Type 4 LSAs, Summary-LSAs for
IP networks in another area and Summary-LSAs for ASBRs, is sent as
a LSA_OMP_PATH_LOAD Opaque LSA as described in Section 2.2. For
a Type 3 and Type 4 LSA there is no information in the Reference
following the "Link State ID".
Loading information for Type 5 LSAs, AS-external-LSAs, is sent as
a LSA_OMP_PATH_LOAD Opaque LSA as described in Section 2.2. For a
Type 5 LSA there is no information in the Reference following the
"Link State ID".
Packed Loading Information The format of the Packed Loading Informa-
tion depends on the value of the Packing Method field. Currently
only the value 1 is defined.
The following format is used when the Packing Method field contains 1.
The LSA must be ignored if values other than 1 are found in Packing
Method.
+----------------+----------------+----------------+----------------+
| In Scaled Link Capacity in kilobits per second |
+----------------+----------------+----------------+----------------+
| In Link Loading Fraction | In Link Drop Fraction (packets) |
+----------------+----------------+----------------+----------------+
| Out Scaled Link Capacity in kilobits per second |
+----------------+----------------+----------------+----------------+
| Out Link Loading Fraction | Out Link Drop Fraction (packet) |
+----------------+----------------+----------------+----------------+
The Scaled Link Capacity is an unsigned integer in kilobits per sec-
ond. If this would be larger than a 32 bit integer, the value should
be shifted to the right until the top bit is in the 32 bit number MSB
and the number of bit shifts recorded in the BW Scale field
The Link Loading Fraction is a 16 bit unsigned integer from 0 to
0xffff representing capacity in bytes being used relative to capac-
ity in bytes per the measurement interval. The hex number 0x10000
would represent unity, but if full loading has been acheived or due
to counting or truncation error, greater than full loading, substitute
0xffff. The Link Drop Fraction is a 16 bit unsigned integer from 0 to
0xffff representing number of packets dropped relative to the number
of packets received. This value can be derived from the change in two
MIB-2 counters (ifOutDiscard<<16)/ifInPacket. The hex number 0x10000
would represent unity (all of the packets being dropped) so 0xffff
must be substituted.
B Concise Statement of the Algorithms B Concise Statement of the Algorithms
@@ MIB counters -> pseudo code -- pull code from the simulation and An OSPF router may play one of two roles, or both. An interior
past it here. routers and edge routers will flood loading information. A router
may choose not to flood information if it is beleived that there is
no way that the interface could become congested. An ingress router
will process loading information and if it has equal cost paths will
balance load across those paths.
C Changes to Existing OSPF Implementations The description of algorithms is broken down into the following sub-
sections.
@@ BSD and gated would make a nice reference implementation Flooding Loading Information Appendix B.1
C.1 Forwarding Building Next Hop Structures Appendix B.2
Processing Loading Information Appendix B.3
@@ ip_output.c or equiv Adjusting Loading Appendix B.4
C.2 Routing process interface to forwarding The algorithms are defined in the following section in pseudocode.
@@ route socket or equiv B.1 Flooding Loading Information
C.3 The routing process It is assumed that counters are large enough to avoid multiple over-
flow (ie: 64 bit counters are used for high speed interfaces) and
that counter overflow is easily detected is compensated for in counter
deltas. It is assumed that ifInDiscard and ifOutDiscard accurately
counts all queueing drops.
@@ gated or equiv The following counters are sampled at 15second intervals: ifInOctets,
ifOutOctets, ifInPacket, ifOutPacket, ifInDiscard and ifOutDiscard.
The value if ifInSpeed and ifOutSpeed is assumed to be constant. Some
state must be stored. The previously used value of each raw counter
is needed to compute deltas. State variables InFilteredUtil, Out-
FilteredUtil, InLoss, OutLoss, InEquivLoad and OutEquivLoad must be
saved. The last time a reflooding occurred must also be stored.
D Algorithm Performance The input and output utilizations are expressed as fractions using
ifInOctets, ifOutOctets, ifInSpeed, and ifOutSpeed. Call the raw
15second fractional utilizations InRawUtil and OutRawUtil. Compute
the following filtered values for both In and Out, making sure to save
the previous values.
@@ see http://engr.ans.net/ospf-omp PrevFilteredUtil = FilteredUtil;
if (RawUtil < FilteredUtil) {
FilteredUtil -= (FilteredUtil >> 3);
FilteredUtil += (RawUtil >> 3);
} else if (RawUtil > FilteredUtil) {
FilteredUtil -= (FilteredUtil >> 1);
FilteredUtil += (RawUtil >> 1);
}
D.1 Conversion from Loss to Equivalent Load InLoss and OutLoss is computed using the ratio of the deltas of Dis-
card to Packet SNMP counters. Next compute an ``equivalent loading''
for the purpose of determining whether to reflood.
@@ black art - consult local TCP guru - very few exist PrevEquivLoad = EquivLoad;
if (Loss < 0.005) {
EquivLoad = FilteredUtil;
} else {
if (Loss <= 0.09) {
LossComp = 10 * sqrt(Loss);
} else {
LossComp = 3;
}
EquivLoad = FilteredUtil * LossComp;
}
D.2 Performance when tracking traffic load change A square root is somewhat time consuming to compute, so a table lookup
can be done to avoid this computation. Increments of 0.1% loss would
yield an 90 entry table. A 128-512 entry table would be adequate.
The table can be sized so a shift and mask can be used to index it.
The computation could then be done with a table lookup, a shift, and
an integer multiply. At most this needs to be done for links with
nonzero loss once every 15 seconds.
@@ see http://engr.ans.net/ospf-omp/ramp.html Then decide whether to flood based on the greater of the relative
change in InEquivLoad or OutEquivLoad and on the time elapsed since
the last reflooding (taking care not to divide by zero).
D.3 Performance when traffic loading is constant Diff = max(abs(InEquivLoad - InPrevEquivLoad)
/ InPrevEquivLoad,
abs(OutEquivLoad - OutPrevEquivLoad)
/ OutPrevEquivLoad);
Load = max(InEquivLoad, OutEquivLoad)
Elapsed = Now - LastReflood;
if (((Load > 1.00) && (Diff > 0.05) && (Elapsed >= 30))
|| ((Load > 1.00) && (Diff > 0.02) && (Elapsed >= 60))
|| ((Load > 1.00) && (Diff > 0.01) && (Elapsed >= 90))
|| ((Load > 1.00) && (Elapsed >= 180))
|| ((Load > 0.90) && (Diff > 0.05) && (Elapsed >= 60))
|| ((Load > 0.90) && (Diff > 0.02) && (Elapsed >= 240))
|| ((Load > 0.90) && (Diff > 0.01) && (Elapsed >= 480))
|| ((Load > 0.90) && (Elapsed >= 600))
|| ((Load > 0.70) && (Diff > 0.10) && (Elapsed >= 60))
|| ((Load > 0.70) && (Diff > 0.05) && (Elapsed >= 120))
|| ((Load > 0.70) && (Diff > 0.02) && (Elapsed >= 480))
|| ((Load > 0.70) && (Elapsed >= 900))
|| ((Load > 0.50) && (Diff > 0.10) && (Elapsed >= 60))
|| ((Load > 0.50) && (Diff > 0.05) && (Elapsed >= 300))
|| ((Load > 0.25) && (Diff > 0.25) && (Elapsed >= 120))
|| ((Load > 0.25) && (Elapsed >= 1200))
) {
/* we passed one of the tests so flood it */
LastReflood = Now;
...
@@ it converges and oscillates a tiny bit - about 0.5% in simulations. If the decision is made to reflood an LSA according to the test above,
then input and output FilteredUtil and Loss must be packed into an LSA
and flooded.
D.4 Convergence after a major perturbation The 15second timer interval should be jittered by a random number in
the range of plus or minus 5 seconds (obviously taking the actual time
interval into account in computing rates).
@@ see simulations at http://engr.ans.net/ospf-omp - we're going kill B.2 Building Next Hop Structures
a link and watch the thing converge at some later date.
References The OSPF SPF calculation is done as per RFC--2328. The arrival of
loading information does not require new SPF calculations since nei-
ther reacheability or costs are changed.
[1] Rob Coltun. The ospf opaque lsa option. Internet Draft (Work in The SPF calculation yields the shortest paths from the given node
Progress) draft-ietf-ospf-opaque-04, Internet Engineering Task to all other routers and networks in the OSPF area. In some cases
Force, 2 1998. ftp://ds.internic.net/internet-drafts/draft-ietf- multipaths will already exist. For all destinations, every feasible
ospf-opaque-04.txt. hop is examined, and paths through next hops that simply provide an
improvement are added, as described in Section 3.1.
[2] M. Mathis, J. Semke, J. Mahdavi, and T. Ott. The macroscopic After completing the SPF calculation and relaxing the best path cri-
behavior of the TCP congestion avoidance algorithm. ACM Computer teria, intra-area multipath sets are recorded as next hop structures.
Communication Review, 27(3), July 1997. If a previous SPF had been in use, loadings are transfered to the new
[3] J. Moy. Ospf version 2. Technical Report RFC 2178, Internet Engi- set of data structures and links are added or removed as needed as
neering Task Force, 1997. ftp://ds.internic.net/rfc/rfc2178.txt. described in Section 4.2.
[4] C. Villamizar, R. Govindan, and R. Chandra. Bgp route flap damp- After recording the intra-area next hop structures, the existing set
ing. Internet Draft (Work in Progress) draft-ietf-idr-route-damp- of inter-area next hop structures and the set of external route next
02, Internet Engineering Task Force, 2 hop structures is examined. Paths are added or removed from next
1998. ftp://ds.internic.net/internet-drafts/draft-ietf-idr-route- hop structures as needed, as described in Section 3, Section 3.3, and
damp-02.txt. Section 4.2.
Security Considerations Inter-area and external routes map onto the intra-area routing. These
therefore share the same set of paths and the same next hop structure
as the intra-area route to the nearest ABR or ASBR. Next hop struc-
tures may be needed to reach any one in a set of ABRs or ASBRs if 1)
there are multiple ABRs equally distant to a summary route or 2) mul-
tiple ASBRs equally distant advertising an external route at the same
cost, 3) relaxing the best path criteria for an intra-area destination
results in going to a next-hop which does not share the same closest
ABR or ASBR.
@@ You can't be sure if loading the *.doc version of this draft into Next hop structures may also be needed to offload paths in adjacent
your word processor may infect you with a dangerous virus so a *.doc areas or external paths. The following conditional is used to deter-
version has not been made available. [Something vaguely serious to go mine whether a next hop structure should be added for a SummaryLSA.
in here at a later date, subject to IESG whim de jour.] @@
Author's Addresses if (IntraAreaLoad < 85%
&& SummaryLSALoad > 90%
&& SummaryLSALoad - IntraAreaLoad > 15%) {
/* add a next hop structure */
...
Curtis Villamizar The conditional for an external route is the same, except the intra-
ANS Communications area load would be a more internal load, intra-area, or Summary LSA,
<curtis@ans.net> and the 90% threshhold is increased to 95%.
The following conditional is used to determine is an existing sepa-
rate next hop structure for a Summary LSA or external route should be
deleted.
if (MoreInternalLoad > 98%
|| MoreInternalLoad - MoreExternalLoad > 20%) {
/* delete a next hop structure */
...
B.3 Processing Loading Information
Adjustments to loading may be triggerred by one of two events. When
a new loading LSA is received, if the LSA corresponds to the most
heavily loaded link for a next hop structure, then the next hop struc-
ture should be readjusted immediately. The last time each next hop
structure has been readjusted must be maintained and periodically
readjusted. Timer events are handled as follows.
foreach NextHopStruct ( AllNextHopStructs ) {
Elapsed = Now - LastReadjust[NextHopStruct];
MinLoaded = MinEquivLoad(NextHopStruct);
MaxLoaded = MaxEquivLoad(NextHopStruct);
AbsDiff = MaxLoaded - MinLoaded;
if (((Elapsed >= 60))
&& (AbsDiff > 0.045) && (MaxLoaded > 0.95))
|| ((Elapsed >= 90))
&& (AbsDiff > 0.03) && (MaxLoaded > 0.95))
|| ((Elapsed >= 120))
&& (AbsDiff > 0.01) && (MaxLoaded > 0.97))
|| ((Elapsed >= 240))
&& (AbsDiff > 0.005) && (MaxLoaded > 0.98))
|| ((Elapsed >= 90))
&& (AbsDiff > 0.05) && (MaxLoaded > 0.90))
|| ((Elapsed >= 120))
&& (AbsDiff > 0.03) && (MaxLoaded > 0.90))
|| ((Elapsed >= 180))
&& (AbsDiff > 0.01) && (MaxLoaded > 0.90))
|| (Elapsed >= 300))
) {
/* we need to readjust this multipath set */
...
This loop and conditional results in a subset of the available next
hop structures being adjusted based on the timer. The same effect may
be accomplished by determining when a next hop structure will need to
be adjusted if no further flooding changes arrive and queueing next
hop structures on lists according to how long they will remain idle.
B.4 Adjusting Loading
A next hop structure will need to be adjusted when one of the two cri-
teria in the prior section is met. The adjustment procedure itslef
relies upon the following stored state.
NextHopStruct {
LastReadjust;
PrevCriticalSegment;
TotalPaths;
SetofPaths (
Path;
bit HasCriticalSegment,
HasPrevCriticalSeg;
TrafficShare;
MoveIncrement;
MoveCount;
);
};
Before the path move increments are adjusted, a preliminary pass is
made over the next hop structure. This pass notes which paths con-
tain the prior critical segment, notes which paths contain the current
critical segment and counts the number of paths containing the current
critical segment.
NumberWithCritical = 0;
MinRateWithCritical = 65536;
foreach Path ( SetofPaths ) {
SetOrClear HasCriticalSegment;
SetOrClear HasPrevCriticalSeg;
if (HasCriticalSegment) {
++NumberWithCritical;
if (MoveIncrement < MinRateWithCritical)
MinRateWithCritical = MoveIncrement;
}
}
Next the move increments for each path is adjusted.
foreach Path ( SetofPaths ) {
if (HasCriticalSegment)
continue;
if (!HasPrevCriticalSeg) {
++MoveCount;
if (MoveCount > 4) {
Increase = MoveIncrement
/ (2 * (1 + NumberWithCritical));
} else {
Increase = MoveIncrement
/ (4 * (1 + NumberWithCritical));
}
if (Increase < 1)
Increase = 1;
MoveIncrement += Increase;
} else {
if (MoveIncrement > MinRateWithCritical)
MoveIncrement = MinRateWithCritical;
MoveIncrement /= 2;
MoveCount = 0;
}
if (MoveIncrement < 1)
MoveIncrement = 1
if (MoveIncrement > 65635)
MoveIncrement = 65635;
}
Then traffic share is adjusted.
foreach Path1 ( SetofPaths ) {
if (!Path1.HasCriticalSegment)
continue;
foreach Path2 ( SetofPaths ) {
if (Path2.HasCriticalSegment)
continue;
Move = Path2.MoveIncrement / NumberWithCritical;
if (Move < 1)
Move = 1;
if (Move > (65536 - Path2.TrafficShare)) {
Move = 65536 - Path2.TrafficShare;
Path2.MoveIncrement = Move;
}
if (Move > Path1.TrafficShare)
Move = Path1.TrafficShare;
Path2.TrafficShare += Move;
Path1.TrafficShare -= Move;
}
}
The traffic shares for paths sharing a common next hop are then summed
and the appropriate information is transferred to the forwarding data
structures.
C Configuration Options
Many of the capabilities described here must be configurable. The
ability to enable and disable capability subsets is needed. Many
parameters used by the algorithm should also be configurable.
C.1 Capability Subsets
There should at least be the ability to enabled and disabled the fol-
lowing.
default description of capability
ON Flooding any loading information
ON Flooding loading information for specific links
- Relaxing best path criteria
- Adjusting traffic shares (default to even split)
OFF Flooding loading information for Summary LSA
OFF Flooding loading information for specific Summary LSA
OFF Flooding loading information for external routes
OFF Flooding loading information for specific external routes
OFF Considering loading information for Summary LSA
OFF Considering loading information for specific Summary LSA
OFF Considering loading information for external routes
OFF Considering loading information for specific exter-
nal routes
Flooding and considering Summary LSA and external route loading in-
formation should be disabled by default. Flooding loading information
should be enabled by default. Relaxing best path criteria and adjust-
ing traffic shares could be enabled or disabled by default, depending
on vendor preference.
C.2 Parameters for Equivalent Load Calculation
The following values affect the computation of equivalent load.
default description of parameter
10 The value of K in ``equiv-load = load * K * sqrt(loss)''
0.5% The minimum loss rate at which to compensate for loss
9% The maximum loss rate above which compensation is fixed
C.3 Parameters for Relaxing the Best Path Criteria
The following parameter affects the number of next hops and paths
added as a result of relaxing the best path criteria. For example,
increasing the mtric difference to 2 would require the next hop to be
a metric of ``2'' closer than the current distance to the destination,
and reduce the number of paths added.
default description of parameter
1 The metric difference required to relaxing best path
C.4 Parameters for Loading Outside of the OSPF Area
The following parameters affect the creation of separate next hop
structures to compensate for loading on Summary LSA and external
routes when the those loadings are high and intra-AS loadings are
substantially lower.
default description of parameter
15% The loading difference to consider a more external load
over a more internal load
85% The maximum internal loading where a more external load
will become eligible for consideration
90% The minimum loading in which a Summary LSA will be
considered over a an intra-area loading
95% The minimum loading in which an external route will be
considered over a an intra-area loading
20% The load difference at which an external load will be
removed from consideration due to being well under the
internal load.
94% The maximum value used for in place of loading for a
Summary LSA when performing traffic share adjustment.
98% The internal loading where a Summary LSA will be
removed from consideration over the internal load
90% The maximum value used for in place of loading for a
external route when performing traffic share adjustment.
98% The internal loading where a external route will be
removed from consideration over the internal load
Limiting the compensation that will be made to accommodate external
loading is consistent with the reason for considering external routes.
Rarely does a business go out of its way to improve the performance of
their competitor's product, a network service or otherwise. Avoiding
congestion in a peer's network is doing a favor for one's own cus-
tomers by not sending their traffic into known areas of congestion
but only if it does not significantly impact congestion in one's own
network.
Limiting the compensation for Summary LSA loading and external route
loading avoids triggering the hysteresis criteria where a separate
next hop structure is deleted if an internal loading exceeds a fixed
threshhold. In effect the loading on the Summary LSA loading and ex-
ternal route loading is ignored if internal loadings exceed a given
threshhold, since the Summary LSA loading or external route loading
will no longer be considered as the critical segment. If internal
loading reaches a point where even with load balancing internal paths
exceed the higher threshhold, the next hop structure will be removed.
C.5 Parameters for Initial Loading of Paths
When determining the initial loading on a new set of paths, where the
destination was previously unreachable, or none of the previous paths
appear in the new next hop structure, the following weighting is used.
weight = (link-capacity * 0.10)
+ (link-capacity * (1 - utilization))
The contribution of link capacity in the weighting should be config-
urable.
default description of parameter
10% The fraction of total link capacity to consider in
addition to the reported unused link capacity.
C.6 Parameters associated with Flooding and Traffic Share
Parameters associated with flooding rate, move increment and traffic
share adjustment should not be configurable by the user or should be
well hidden so as only to be accessible to developers. Adjustment of
these parameters can slow convergence or increase overshoot. If the
parameters are sufficiently altered instability could result. While
it is likely that some improvements could result from further tuning,
experimentation on live networks is not encouraged.
D Algorithm Performance
A number of simulations have been performed to test the OSPF-OMP algo-
rithms. In these simulations the algorithm has been demonstrated to
be very stable. This work has not been formally published yet but is
currently available at http://engr.ans.net/ospf-omp.
The simulations done to date have only modeled behavior of load bal-
ancing intra-area routes. This would be applicable to networks in
which external routing was mapped onto IGP routing with a single OSPF
area. Passing loading information between areas, allowing loading in
one area affect an adjacent area, has not been simulated. Similarly
passing loading information with external routes and affecting loading
in a peer AS has not been simulated.
D.1 Conversion from Loss to Equivalent Load
The current adjustment for loss in the equivalent load is based on
the premise that traffic is dominated by TCP and that TCP flows suffi-
ciently unconstrained by other bottlenecks and of sufficient duration
exist to keep the aggrgate traffic flow elastic. This is considered
a very safe assumption for Internet traffic. Enterprise networks
may have a higher contribution of incompressible traffic (traffic not
conforming to a form of congestion avoidance such as TCP congestion
avoidance described in RFC--2001).
The assumed relationship between packet loss and link utilization is
based on the work of Mathis et al [4]. The constants in this rela-
tionship cannot be determined as they depend on delay bandwith product
of TCP flows, the number and duration of TCP flows, and whether TCP
flows are severely constrained elsewhere.
The objective is to estimate the offered load, which cannot be mea-
sured directly when links are at full utilization, using the link
utilization and loss. The load adjustment algorithm should remain
stable as long as the first derivative of the estimator over offered
load remains positive. If the first derivative is negative within
some region, then oscillation will occur within that range of opera-
tion. The greatest risk of this occurring is in routers where active
queue management is not used (ie: where drop-tail queues are used)
and in particular where buffering is too small. In such cases, as
offered load increases just beyond full utilization, loss increases
somewhat, but utilization can drop substantially (typically to about
90increases, the estimator of offered load may decrease, causing the
link to appear less loaded than another. The rather aggresive com-
pensation for loss is intended to insure that this effect either does
not occur, or occurs only within a very narrow range of offerred load
at just over full utiization. If the derivative is negative within a
narrow range, oscillations can occur only within that range, and the
oscillations are well bounded.
D.2 Performance as traffic loads change
This work has considerable advantages over other approaches, particu-
larly traffic engineering approaches that involve adjustment of vir-
tual circuit layouts based on historic traffic figures. The advantage
is the ability to adjust loading gradually as actual offerred traffic
deviates for expected. The advantage is even greater when compared to
dynamic virtual circuit layouts, using PNNI for example, since these
algorithms have proven to often converge on very suboptimal layouts.
Simulations demonstrating behavior in these particular cases can be
found at http://engr.ans.net/ospf-omp/ramp.html.
D.3 Convergence after a major perturbation
Simulations have been performed in which link failures are examined.
Without relaxing the best path criteria, OSPF OMP is quite dependant
of the set of link metrics to create a set of equal cost paths that
will allow the most heavily loaded portions of the topology to be
offloaded. When links fail, the set of metrics often are far from
ideal for the remaining topology. Relaxing the best path criteria
significantly improves the response to link failure.
Simulations are available at http://engr.ans.net/ospf-omp/fail.html.
 End of changes. 

This html diff was produced by rfcdiff 1.23, available from http://www.levkowetz.com/ietf/tools/rfcdiff/