|
List Info
Thread: Re: BGP path hunting, MRAI timer and Path Length Damping
|
|
| Re: BGP path hunting, MRAI timer and
Path Length Damping |
  United States |
2007-06-21 18:27:54 |
Hi Robin,
On Jun 21, 2007, at 1:39 AM, Robin Whittle wrote:
> Tony, can you advise on this? Anyone else?
I'll try.
> Geoff's diagram :
>
> ht
tp://www.potaroo.net/ispcol/2007-06/dampbgp.html
>
> explains the pattern of behavior which results in a
series of
> longer best-path messages emanating out from the local
part of
> the network, followed by a withdrawal.
>
> I am not sure about the time elapsed between the four
sections
> of Fig 1, which I will call A, B, C and D.
The first thing to understand is that there is no absolute
standard
of time in any of this. BGP (and routers in general) are
'soft' real-
time systems. That is, given a task to do, they will get
around to
it, but it will frequently be a 'best effort' approach, and
not
scheduled to a precise moment. Other computations within
the
protocol and outside of the protocol in other management
functions
may cause significant time variances in event latency.
> Here is my understanding, based on Geoff's diagram and
Tony's
> written explanation. Is there a better explanation
somewhere?
>
> A This is the steady state before T = 0.0, when (2)
detects
> that its link to (1) goes down.
We should note that at this point, AS 5 should have paths as
follows:
2: 2 1
3: 3 2 1
4: 4 3 2 1
> B, sort of . . .
>
> T = 0.1 secs
>
> (2) sends withdrawals to (5) and (3).
>
> This is almost immediately after the link goes down
- say
> a 0.1 second processing time after (2) determines
that (1)
> is unreachable. This is the time it takes to
generate two
> withdrawal announcements.
Again, this would be ideal... Reality is that you have
asynchronous
systems, updates are not computed simultaneously, and can be
delayed
for a variety of reasons.
> (3) gets the withdrawal too. This is where my
understanding
> departs from the diagram. In B, (5) has sent an
> announcement but 3 hasn't. I figure they would
typically
> send their announcements at about the same time.
Recall 5's path table from above. After processing the
update from
(2), it's path table should look like:
3: 3 2 1
4: 4 3 2 1
Since 5 and 3 are not synchronized at all, there's a
significant race
between the withdrawal from 3 and the advertisement of
"5 3 2 1" by (5).
> In my understanding, at 0.1 seconds, both (5) and
(3) have
> been sent the withdrawal from (2) but have not yet
processed
> them.
In figure B, we see (5) advertising its alternate path. It
has
processed the withdrawal from 2, but not processed the
to-be-
delivered withdrawal from 3.
> In my understanding, there is no strung out sequence of
longer
> and longer paths being announced by 5, as Geoff's
diagram
> depicts. Just the first announcement at T = 0.1s of a
longer
> path, followed by a withdrawal at T = 30.2 seconds.
This does
> not really resemble the "path hunting"
pattern Geoff and Tony
> refer to.
First, that is *exactly* path hunting.
You're assuming that 5 implements MRAI and doing so fairly
precisely. If 5 does not implement MRAI, you'd see exactly
the set
of updates in Geoff's diagram.
> My understanding of what Geoff writes is that
"path hunting"
> occurs anyway, and is exacerbated in variations in the
time
> constant of the Minimum Route Advertisement Interval
(MRAI)
> Timer. I am not sure if these are small variations of
fractions
> of a second from the default 30 seconds, or some other
default
> or configured values such as 20 to 50 seconds. I don't
see how
> various values of MRAI in different routers would
affect my
> explanation of what happens in the situation depicted
by Geoff's
> diagram, other than to change the delay time to 20.2 or
50.2
> seconds, for instance.
You just need a slightly richer topology to create more
interesting
race conditions.
> II guess this would need to replace the MRAI timer's
function, or
> operate with longer time constants than that timer, but
I don't
> think Tony's I-D mentions this.
The assumption was that this would subsume MRAI.
> This sounds like a great idea. I understand that when
applied
> to Geoff's diagram, here is what would happen, assuming
for the
> moment that there was no MRAI timer:
>
> T = 0.1 seconds
>
> (2) sends withdrawals to (5) and (3).
>
>
> T = 0.2 seconds.
>
> (5) gets (2)'s withdrawal and does not announce
anything,
> since its best path "3, 2, 1" is now
longer than before,
> and Tony's (Geoff's?) PLD algorithm is invoked.
Exact ownership of the algorithm isn't clear to me. It's
been kicked
around a bit and I thought I heard of it from someone, but I
checked
with them and they denied it. In any case, I'll explicitly
relinquish any claim to it. I'm much more interested in
getting it
deployed...
> So with this PLD algorithm, as I understand it, there
is only
> one message sent by (5), and it is the correct one,
sent at 0.4
> seconds, not at 30.2 seconds as with current
arrangements.
Sounds right.
> Tony, you co-wrote the RFC - what do you think?
Co-edited and really just for 1771.
As always, I think that the RFC is a fine goal, and that
implementors
frequently take short cuts that may or may not be warranted.
I'm not
throwing rocks at them because frankly, many of those rocks
would end
up back in my lap. As I noted earlier, implementing MRAI
per the
letter of the spec is non-trivial, and I sympathize with
those who
decided not to implement it. On the flip side, I'm also
leery of
having a system that has no built-in protections against
persistent,
rapid oscillation. If there are no protections built in,
then some
poor network operator will be awakened at 4am with a router
that is
melting down under the update load. I've done that too and
wouldn't
wish that on anyone. Thus, I think we have a responsibility
to
provide a better set of mechanisms.
Regards,
Tony
_______________________________________________
Idr mailing list
Idr ietf.org
https://ww
w1.ietf.org/mailman/listinfo/idr
|
|
| Re: BGP path hunting, MRAI timer and
Path Length Damping |
  Australia |
2007-06-22 03:27:44 |
Hi Tony,
Thanks for your response. My knowledge of BGP is shallower
than
probably everyone on this list. I hope to understand BGP
and
the behavior of the BGP system better. Maybe I can
contribute
some useful ideas or provoke helpful discussion.
I am glad to have worked out those examples and found what
you
define as "path hunting":
>> In my understanding, there is no strung out
sequence of longer
>> and longer paths being announced by 5, as Geoff's
diagram
>> depicts. Just the first announcement at T = 0.1s
of a longer
>> path, followed by a withdrawal at T = 30.2 seconds.
This does
>> not really resemble the "path hunting"
pattern Geoff and Tony
>> refer to.
>
> First, that is *exactly* path hunting.
with the qualification that actual behavior varies a lot due
to
timing variations, topology, router implementation details,
configuration etc.
I also understand that the term "path hunting"
extends to the
broader patterns which resemble reverberation in acoustics -
of
series of typically longer-path announcements for a prefix,
often at about 30 second intervals, which are later followed
by
a withdrawal.
Your I-D also states that "path hunting" occurs
when a prefix is
first advertised, but that seems pretty natural and hard to
prevent. I don't think this would be as destructive as the
MRAI
timer holding back a withdrawal for 30 seconds just after an
AA+
has been sent.
"Path hunting" after a first advertisement is a
consequence of
the path-vector protocol. Maybe your proposed Path Length
Damping would be marginally effective at reducing the
propagation of sub-optimal "best paths" sometimes
before the
final, stable, optimal "best path", by expediting
the
announcements of shorter paths and withdrawals, and
delaying
longer path announcements.
> The assumption was that this would subsume MRAI.
I am reading the second version of your ID now:
htt
p://tools.ietf.org/html/draft-li-bgp-stability
and I see Geoff is a co-author and that you have a section
"Deprecate MRAI".
Perhaps you could add to the "Path Hunting"
section the history
of the MRAI timer's changing specification - as discussed in
the
last day or so - and the statement from 2002 that few
router
vendors implemented the RFC 1771 version of it.
Does the apparent failure of Flap Damping coincide with the
general change from (I guess) late 1990s MRAI compliance
with
RFC 1771 to the problematic RFC 4271 mode of also stopping
withdrawals?
I think it would also be helpful to add minimal example of
the
mechanism, as I tried to work it out for single and
multiple
systems. All that is needed is:
(3) (6)
/ /
/ / Upstream
(1)--X--(2)---(4)---(5)---(7)---> routers
to show how it generates a prompt AA+, a 30 second delayed
A++
and a 60 second delayed withdrawal. Also, I could show how
the
new algorithm would result in a much better outcome.
I can write a concise account of this, for an appendix of
your
I-D if you like. I would base it on the exact timings as I
did
before, but add the qualification that in reality, timings
would
not be so exact, but the behavior would be broadly similar.
>From the 01 version of your and Geoff's I-D:
(39% of activity involves a pair of updates inside 31
seconds.)
Only 28% of the updates for the month are not part of
any
sequence, while 26% of updates are part of a coupled
update
pair, and 46% of updates are part of sequences of 3 or
more
updates.
So more than 2/3 of activity is seems clearly to be
associated
with delays caused by the MRAI timer in a small triangle of
routers/ASes, or in more complex mesh structures. Getting
rid
of that mechanism would greatly reduce the burden of
messages
and speed up the propagation of withdrawals. Sequences of
multiple messages might be replaced by a single message,
which
is more informative, and propagates quickly.
> Exact ownership of the algorithm isn't clear to me.
It's been
> kicked around a bit and I thought I heard of it from
someone,
> but I checked with them and they denied it. In any
case, I'll
> explicitly relinquish any claim to it. I'm much more
> interested in getting it deployed...
Now I understand from messages on this list that the MRAI
function is typically not implemented in a way which lets
through withdrawals quickly, and looking at Geoff's
analysis, it
seems to me that some changes as you suggest would make a
really
significant reduction in message volumes and convergence
time.
> As I noted earlier, implementing MRAI per the letter
of
> the spec is non-trivial, and I sympathize with those
who
> decided not to implement it. On the flip side, I'm
also
> leery of having a system that has no built-in
protections
> against persistent, rapid oscillation. If there are
no
> protections built in, then some poor network operator
will
> be awakened at 4am with a router that is melting down
under
> the update load. I've done that too and wouldn't wish
that
> on anyone.
I entirely agree, and unless something is done, the rising
tide
of prefixes and delayed, reverberated, unnecessary AA+
messages,
with the much delayed withdrawals, will swamp all existing
transit and multihomed border routers. All Internet users
collectively pay for running and replacing these routers.
> Thus, I think we have a responsibility to provide a
better set
> of mechanisms.
Yes, but it seems to me the old RFC 1771 had a much better
MRAI
specification than what router vendors were generally
implementing by 2002 or so, and better than is mandated by
RFC 4271.
Thanks John for the excerpt from the IDR list archives.
>> Date: Tue, 05 Mar 2002 10:50:02 -0800
>> From: Pedro Roque Marques <roque juniper.net>
>>
>> Alex Zinin writes:
>>
>> > Pedro,
>>
>> > I see your point. Note that when you are
sending a
>> > WITHDRAWN, you know it is a bad news.
>>
>> Alex,
>>
>> A link up event will often cause a router to send
withrawns
>> as it changes it's best path from one that it's
policy and/or
>> ibgp flooding rules allow it to advertise to one
that is
>> rejected.
>>
>> The inverse is true for a link down event...
I don't understand these statements enough to criticise
them.
>> I still maintain that it is not possible in BGP to
>> distinguish if an update corresponds to a link
up/down
>> event... or what information you wish to send
fast/slow.
I am a newbie to BGP, but the behavior of path length
hunting is
a major burden on all the routers, so I think that while
the
above statement may have some theoretical truth to it, the
practical situation is that delaying the withdrawals is
more
more trouble than it is worth.
>> I also believe that it is a major mistake to treat
>> advertisements differently than withrawns,
specially from a
>> queuing/timing point of view. The only thing you
are going to
>> get out of that is weird blackholes and more
instability...
This is written by someone who presumably knows far more
about
BGP than I do, but I don't see why this would be true.
>> AFAIK, no major vendor implements their
advertisement
>> rate-controller/queueing in a way as to threat
updates and
>> withrawns differently.
>>
>> Pedro.
Lets say we have 500,000 prefixes. Do we have a single
down
counter for each prefix? I guess so. As far as I know,
the
same announcements go to all peers.
This is a major challenge to implement. It is either a
massive
number of downcounters each with its own pointers to
whatever it
affects, or some kind of linked list of events to remember
in
the future, which could get very long, and would require
tricky
methods to take things out of and to incrementally chew
through
and take action upon as time passes by.
There would never be 500,000 timers active at once, so maybe
it
could be implemented by having a pool of downcounters, say a
few
thousand, which are used by the first instances which need
them
and which are returned to the pool once they time out or no
longer needed.
Maybe there could be 4,000 down counters and each instance
which
wanted another down counter when the 4,000 are all busy
would
have its details and a time-stamp put into a queue. That
would
function as a timer and from the point of view of each
instance.
(However, then clearing the timer means finding the
pointer in
he queue and setting it to zero.) The end of the timer
period
would be when its pointer appeared at the exit of the
queue.
The router wouldn't bother emptying that queue while it was
so
busy as to be running 4,000 MRAI-like down-counters, which
would
only be for brief periods.
Then there is the question of each router's message queue
to
each peer being throttled via TCP, with the software being
able
to find messages in the queue and delete them if something
else
happens after they were put in the queue.
It seems there is now general agreement that Path Length
Damping
is a worthwhile goal. This is a slightly fancier version of
the
old (RFC 1771) MRAI specification.
Yet by 2002, we are told that Cisco, Juniper et. al. had
given
up on following RFC 1771's strict instructions to exempt
withdrawals from the MRAI mechanism.
They presumably didn't do this lightly. I guess it was
either
that the cost of fully implementing RFC 1771's MRAI spec
was
either prohibitive, considering all the other things the CPU
is
trying to do - and/or that some engineers had already
decided
that there was little or no value in treating withdrawals
differently from advertisements of feasible paths.
Now the cost of this simplification in implementations -
now
mandated by the RFC 4271 - is either unacceptable or so
costly
that the simple MRAI function needs to be replaced by
something
fussier.
CPUs and memories of newer routers have become faster, so
maybe
it is easier now to have the more complex MRAI-like timer.
On
the other hand, there are now far more prefixes and maybe
more
peers it has to be implemented for.
Implementing a more complex MRAI-like timer in new firmware
upgrades to older routers may be prohibitive.
So I guess the trick is to find the simplest solution which
will
solve the problem.
Here are my thoughts on the various algorithms:
RFC 1771 MRAI
-------------
Data: "most recently sent update"
"most recent update not yet announced"
Timer down counter
[ Is update a
[ withdrawal? Yes: Announce immediately and clear the
counter.
[ Write it to "most recently sent
update".
[
[ No:
Counter set? No: Announce update, save it
as
"most recently sent
update"
and set timer to count
down.
Yes: Store update in
"most
recent update not yet
announced".
When timer
counts down: If "most recent update not yet
announced"
differs from "most recently sent
update", then
announce "most recently received
update not
yet announced" and write this to
"most
recently sent update".
RFC 4271 MRAI
-------------
This is the same as the above, minus the first section
marked
with '['.
This is simpler logic, but I don't see that it involves any
less
state than the RFC 1771 version.
Path Length Damping
-------------------
Data: "most recently sent update"
"most recent update not yet announced"
Integer: "previously announced update
length"
Timer down counter
Update: (Treat withdrawal as having zero length.)
Compare this update's length with
"previously
announced update length".
Write this update's length to "previously
announced
update length".
Was new update longer?
Longer: Write it to "most recent update
not
yet announced". Set timer to
count down.
Same or
shorter: Send update, write it to "most
recently
sent update" and clear counter.
When timer
counts down: Announce "most recently received
update not
yet announced" and write this to
"most
recently sent update".
This requires storing an extra 16 bit integer for each
prefix on
each interface. I can't think of a simpler solution.
I don't know enough about routers to estimate the real
implementation costs of these, but I don't see why there
was
such resistance to implementing the RFC 1771 MRAI timer.
It would be interesting to know what the difficulty was -
because those difficulties may still be important and might
prevent the adoption of Path Length Damping or some other
mechanisms as your I-D suggests.
By the way, I have a web page pointer to this discussion,
which
started on the RRG list and moved to here at the IDR list:
http://www.firstpr.com.au/ip/sram-ip-forwa
rding/#BGP_hunting_MRAI_disc
and am trying to maintain a set of papers and proposals
regarding improvements to BGP, and analysis of its
problems:
http://www.firstpr.com.au/ip/sram-ip-forwarding
/#BGP_improvements
The list starts with this I-D.
- Robin http://www.firstpr
.com.au/ip/ivip/
_______________________________________________
Idr mailing list
Idr ietf.org
https://ww
w1.ietf.org/mailman/listinfo/idr
|
|
[1-2]
|
|