Joel M. Halpern | 14 Oct 2004 16:26

Re: Re: FW: I-D ACTION:draft-bhatia-ecmp-routes-in-bgp-01.txt

There is no current mechanism in BGP to set two AS PATHS in the same update.
The ECMP mechanism does permit that.
However, when an ECMP participant who is using multiple paths speaks to a 
non ECMP router (one that does not support the extension) it must advertise 
an AS PATH that can be used to avoid loops.
That advertisement is where the synthesized AS_SET is required.

Yours,
Joel M. Halpern

At 02:11 PM 10/14/2004 +0000, john smith wrote:
>I would also like to know why the draft cannot keep things simple and 
>propogate all the paths as AS_PATHs instead of the proposed AS_SET manner?
>
>Eitherways you are assuming that the presence of the ECMP_NEXT_HOP 
>sufficies to differentiate between the 2 cases. ie vanilla BGP or ECMP_CAP BGP.
>
>-JS
>
>>From: "ephim  era" <ephemera6380 <at> rediffmail.com>
>>Reply-To: ephim  era <ephemera6380 <at> rediffmail.com>
>>To: idr <at> ietf.org
>>Subject: Re: Re: [Idr] FW: I-D ACTION:draft-bhatia-ecmp-routes-in-bgp-01.txt
>>Date: 14 Oct 2004 11:35:34 -0000
>>
>>   Hi All,
>>        Got a doubt. When OSPF is having ECMP routes and we are doing 
>> redistibution of OSPF routes to BGP, Should BGP get all 3 prefixes with 
>> different nexthops ?
>>
>>Thanks in advance,
>>Ephim
>>
>>
>>On Thu, 23 Sep 2004 Manav Bhatia wrote :
>> >Jeff,
>> >
>> > > WD_NLRI: NULL
>> > > Path Attributes:
>> > >   + Origin - IGP
>> > >   + AS_PATH - 65535
>> > >   + NEXT_HOP - 192.168.1.1
>> > >   + MP_REACH_NLRI
>> > >     o AFI - 1 (ipv4)
>> > >     o SAFI - 1 (unicast)
>> > >     o MP_NEXTHOP - 192.168.1.2
>> > >     o MP_NLRI - 10.0.1/24
>> > > NLRI: 10.0.0/24
>> > >
>> > > If one wanted to insert an IPv4/Unicast ECMP nexthop set here, which
>> > > NLRI set would it apply to?
>> >
>> >Like the other path attributes ORIGIN, AS_PATH, etc. ECMP_NEXT_HOP too
>> >applies to all the NLRIs listed in an UPDATE message. If in this case you
>> >had an additional next-hop for only 10.0.0/24, you could announce this
>> >information in the following two ways.
>> >
>> >Let the additional NEXT_HOP be 192.168.1.3
>> >
>> >(i) In addition to the above UPDATE message you could announce another one
>> >containing
>> >
>> >WD_NLRI: NULL
>> >Path Attributes:
>> >   + Origin - IGP
>> >   + AS_PATH - 65535
>> >   + ECMP_NEXT_HOP
>> >     o AFI - 1 (ipv4)
>> >     o SAFI - 1 (unicast)
>> >     o NUM NEXTHOPs - 1
>> >     o LENGTH - 4
>> >     o MP_NEXTHOP - 192.168.1.3
>> >NLRI: 10.0.0/24
>> >
>> >The IBGP peer upon receiving this UPDATE message will know that it needs to
>> >append this route rather than replace.
>> >
>> >(ii) You could announce the following two UPDATE messages.
>> >
>> >WD_NLRI: NULL
>> >Path Attributes:
>> >   + Origin - IGP
>> >   + AS_PATH - 65535
>> >   + NEXT_HOP - 192.168.1.1
>> >   + ECMP_NEXT_HOP
>> >     o AFI - 1 (ipv4)
>> >     o SAFI - 1 (unicast)
>> >     o NUM NEXTHOPs - 1
>> >     o LENGTH - 4
>> >     o MP_NEXTHOP - 192.168.1.3
>> >NLRI: 10.0.0/24
>> >
>> >and
>> >
>> >WD_NLRI: NULL
>> >Path Attributes:
>> >   + Origin - IGP
>> >   + AS_PATH - 65535
>> >   + MP_REACH_NLRI
>> >     o AFI - 1 (ipv4)
>> >     o SAFI - 1 (unicast)
>> >     o MP_NEXTHOP - 192.168.1.2
>> >     o MP_NLRI - 10.0.1/24
>> >
>> >How you advertise the routes depends upon the time (apart from your
>> >implementation) at which the information about this additional NEXT_HOP is
>> >known to the originating router.
>> >
>> >[..]
>> > > > Put the NLRI in MP_REACH_NLRI. Set the length and the address of 
>> the MP
>> > > > nexthop as Zero. Put this MP nexthop info in ECMP_NEXT_HOP and 
>> advertise
>> > > > this to your IBGP peer. It will append this route, as it has been
>> >advertised
>> > > > with ECMP_NEXT_HOP.
>> > >
>> > > I think what makes me uncomfortable with this mechanism is implicitly
>> > > using the normal NEXT_HOP as a reset mechanism.  Just from a coding
>> > > standpoint, I'd rather see either no normal nexthop and a set of
>> > > ECMP nexthops which always have an implicit withdrawal behavior
>> > > or no ECMP nexthops at all.
>> >
>> >Could you elaborate on why using the ordinary NEXT_HOP for implicit
>> >withdrawls makes you feel uneasy? We think having such a mechanism in 
>> place,
>> >makes it easier to understand the new capability (everybody is familiar 
>> with
>> >how implicit withdrawls work) and introduces no additional burden on the
>> >implementation. The way it works is that any time you receive an UPDATE
>> >message with ECMP_NEXT_HOP, you know that the route needs to be 
>> appended and
>> >not replaced!
>> >
>> >Let me explain what i think you are suggesting. Correct me if i am wrong.
>> >
>> >Lets assume we have two next-hops N1 and N2 for a NLRI. Say after some
>> >point, we get another one N3. You suggest that we should now announce
>> >ECMP_NEXT_HOP N1, N2 and N3. That way, whenever you receive a new UPDATE,
>> >you always treat that as an implicit withdraw, the way BGP works now. Is
>> >that it?
>> >
>> >I would prefer the former approach because there can be cases (L3 VPN 
>> NLRIs,
>> >etc) where the NLRI may be clubbed with some information (labels, etc) that
>> >may make it non unique. In such cases, its much easier to use the mechanism
>> >that we have described. This'll become clearer when i explain how ECMP can
>> >work with 3107.
>> >
>> > >
>> > > > IMO a PE can do load splitting in whatever manner it wants to. It
>> > > doesn't
>> > > > need to inform the other PE that there is more than one next-hop
>> > > available
>> > > > to reach a destination. Basically its something similar to what we 
>> have
>> > > done
>> > > > for EBGP peers.
>> > >
>> > > Yes, it could do whatever it wanted to.  However, RFC 3107 implicitly
>> > > expects that you only get one nexthop.  If you get more than one,
>> > > this will potentially affect the behavior.  The change in behavior
>> > > should be discussed somewhere - and I think that your draft is probably
>> > > the right place to do that.
>> > >
>> >
>> >I may be missing something, but isn't this more of an implementation
>> >specific issue?
>> >
>> >Anyways, there isn't anything in this draft that precludes this.
>> >
>> >Assume that a PE router has two next-hops to reach some NLRI x.y.z.w. 
>> Let L1
>> >and L2 be the labels associated with these next-hops. It could send this
>> >information the following way
>> >
>> >WD_NLRI: NULL
>> >Path Attributes:
>> >   + Origin - IGP
>> >   + AS_PATH - 65535
>> >   + MP_REACH_NLRI
>> >     o AFI - IPv4
>> >     o SAFI - VPN
>> >     o MP_NEXTHOP - 0::192.168.1.2
>> >     o MP_NLRI - L1:RD:x.y.z.w
>> >
>> >and
>> >
>> >WD_NLRI: NULL
>> >Path Attributes:
>> >   + Origin - IGP
>> >   + AS_PATH - 65535
>> >   + MP_REACH_NLRI
>> >     o AFI - IPv4
>> >     o SAFI - VPN
>> >     o MP_NEXTHOP - 0
>> >     o MP_NLRI - L2:RD:x.y.z.w
>> >  + ECMP_NEXT_HOP
>> >     o AFI - IPv4
>> >     o SAFI - VPN
>> >     o NUM NEXTHOPs - 1
>> >     o LENGTH - 4
>> >     o MP_NEXTHOP - 0::192.168.1.3
>> >
>> > > > ]I'm most concerned about the synthesized AS_PATH.
>> > > > ] o In some circumstances, particularly with large AS_SETs, it 
>> will not
>> > > > ]   be possible to preserve path length.
>> > > >
>> > > > Could you give me an example to illustrate this?
>> > >
>> > > 10/8 NH a Path 1 2 [<255 ASes>]
>> > > 10/8 NH b Path 3 4 [<255 ASes>]
>> >
>> >I believe, the above means that each AS_PATH contains 2 ASes in AS_SEQ and
>> >one large AS_SET containing 255 ASes against, each path having 257 ASes in
>> >AS_SEQ.
>> >
>> > >
>> > > Resulting advertisement:
>> > > 10/8 ECMP NH a,b Path [1 3] [2 4] [<first part of first path/second 
>> path>]
>> > > [<second part of first path/second path>]
>> >
>> >Refer to Sec. 16.1 on how we construct the synthetic AS Paths in cases 
>> where
>> >the contributing AS Paths consist of AS_SETs.
>> >
>> >Resulting advertisement will be:
>> >10/8 ECMP NH a,b Path [1 3] [2 4] [<255 ASes from the first path> <255 ASes
>> >in the second path>]
>> >
>> >The problem of over flowing that will occur in constructing this extremely
>> >large AS_SET is also present in ordinary BGP, and yes, something that we
>> >need to work upon!
>> >
>> > >
>> > > Note that this presumes that the set elements in the original
>> > > advertisement are disjoint and thus not prone to merging.
>> > >
>> > > The path is thus lengthened.
>> >
>> >Yes. If we have more than 255 ASes present in the AS_SET, then we may need
>> >to prepend a new segment of type AS_SET, which will increase the path
>> >length, as each AS_SET is counted as 1, irrespective of the number of ASes
>> >present in the set.
>> >
>> >However, this can be easily avoided by introducing a new path segment type
>> >AS_EXT_SET (value 3) which will have exactly the same semantics as the
>> >existing AS_SET, except that it wont be considered when counting the 
>> AS_PATH
>> >length.
>> >
>> > >
>> > > > ] o Processing ECMP routes into paths of equivalent length with many
>> > > > ]   AS_SETs will impact AS_PATH regular expression engines.
>> > > > ] o The set merging rules of BGP (9.2.2.2, draft 25) can result
>> > > > ]   in the AS_PATH being shortened.
>> > > >
>> > > > I am sorry, I didn't get this!
>> > >
>> > > As for the first:
>> > >
>> > > I get [1 3] [2 4] (rest)
>> > >
>> > > Previously I would have gotten 1 2 (rest) or 3 4 (rest).
>> > >
>> > > I can do a regular expression that says "Match 1 2 .*$" and prefer this
>> > > path.
>> >
>> >You really cant run your regular expression here that says "match 1 2 .$"
>> >because now your traffic will be split across paths "1 2 (rest)" and "3 4
>> >(rest)". Moreover, this seems to be an implementation issue and i am sure
>> >vendors can find clever ways to do this.
>> >
>> > > Per the aggregation rules, one would typically not see sets such as
>> > > [1 3] [2 4] in the net.  An implementation would typically create
>> > > [1 2 3 4] as part of aggregation.  Generating separate sets is a 
>> deviation
>> > > from the specification, but not a mandatory one:
>> > >
>> > >             - for each pair of adjacent tuples in the aggregated
>> > >             AS_PATH, if both tuples have the same type, merge them
>> > >             together, as long as doing so will not cause a segment with
>> > >             length greater than 255 to be generated.
>> > >
>> > > The multiple set element encoding doesn't break BGP, but it makes a
>> > > presumption that really isn't all that clear in -25.
>> >
>> >I am not sure if we can call this as "deviating" from the base spec.
>> >Multiprotocol extensions did something similar for NEXT_HOP attribute.
>> >
>> > > That presumption
>> > > is that an implementation will not take adjacent set elements and
>> > > merge them.  Also, it presumes that an implementation wont choose
>> > > to clean up sets of the form:
>> > >
>> > > [1 2] [3 4] [3 4]
>> > >
>> > > where the original was:
>> > > 1 3 3
>> > > 2 4 4
>> > >
>> > > As by the aggregation rules, a given set element should exist within
>> > > the AS_PATH only once.
>> > >
>> > > I suspect it would be ... instructive to find out what various
>> >implementations
>> > > do with odd AS_PATHs as would be generated by the synthetic AS_PATH.
>> >
>> >Definitely.
>> >
>> >Cheers,
>> >Manav
>> >
>> >
>> >_______________________________________________
>> >Idr mailing list
>> >Idr <at> ietf.org
>> >https://www1.ietf.org/mailman/listinfo/idr
>>
>>_______________________________________________
>>Idr mailing list
>>Idr <at> ietf.org
>>https://www1.ietf.org/mailman/listinfo/idr
>
>_________________________________________________________________
>FREE pop-up blocking with the new MSN Toolbar - get it now! 
>http://toolbar.msn.com/
>
>
>_______________________________________________
>Idr mailing list
>Idr <at> ietf.org
>https://www1.ietf.org/mailman/listinfo/idr

_______________________________________________
Idr mailing list
Idr <at> ietf.org
https://www1.ietf.org/mailman/listinfo/idr


Gmane