The Poretsky Family | 3 Jun 2006 03:25
Picon

Re: FW: Proposal on Protection Benchmarking

Thanks Curtis.  The author team will begin working through your very
worthwhile comments.

Scott
----- Original Message -----
From: "Curtis Villamizar" <curtis <at> occnc.com>
To: <bmwg <at> ietf.org>
Sent: Friday, June 02, 2006 10:32 AM
Subject: Re: FW: [bmwg] Proposal on Protection Benchmarking

>
> At 03:30 PM 5/3/2006, Al Morton wrote:
> >BMWG,
> >
> >The proponents of the Protection Benchmarking Work Proposal have
> >prepared the description of this work effort, below.
> >
> >BMWG discussed this work at the Dallas IETF-65 session, where there was
>
> >strong support and involved membership (see meeting minutes).
> >
> >Please weigh-in on whether this topic should become part of BMWG's
> >chartered work, by
> >
> >                    June 2, 2006
> >
> >And, if you support the work, please say:
> >
> >+  HOW you intend to support the development in BMWG,
> >    (by reviewing draft X by MM/DD/YY, for example),  or
> >
> >+  WHY this work will be beneficial to BMWG's user community, or
> >
> >+  Modifications that would make the proposal more useful
> >    (which we will discuss on the list), and
> >
> >+  (anything else that's constructive)
> >
> >And remember, we'd like to hear your opinion on the list, even if you
> >spoke in favor of this proposal at the meeting.
> >
> >Thanks in advance for your efforts and commitment to BMWG!
> >
> >Al
> >bmwg co-chair
>
> [ ... snip ... ]
>
> Comments below inline.  The issues are all fixable and these are good
> BMWG work group work items IMHO.
>
> I think this falls under "anything else that's constructive".  I have
> not provided replacement text but if there is agreement on the all or
> parts of the comments below, I could do so.
>
> Curtis
>
>
> ------------------------------------------------------------
>
>   draft-poretsky-protection-term-01.txt
>
> Terminology doesn't match what is normally used.
>
> btw - Not sure why you can't say MPLS instead of sub-IP when you mean
> MPLS or maybe MPLS plus GMPLS.
>
> PHP might violate the definition of "Path" ever so slightly.
>
> You might want to define "Tunnel" as the collection of related Paths
> (LSP in MPLS terminology).  A Tunnel is used to carry a specific flow
> of traffic which is generally a very large aggregation of microflows
> but may be any flow defined by a classifier at the ingress.
> A Tunnel may include two primary Paths during MPLS make-before-break
> reroute and one or both may have a backup Path during trasition.
>
> A backup path always computed before the failover event.  A new path
> computed after the failover event is simply a reroute of the primary
> path.
>
> A backup path may be signaled or unsignaled.  If it is unsignaled it
> has been computed but has not been signaled, saving some time in
> restoration.  Juniper called unsignaled backup paths "standby" and the
> name stuck (Avici at least uses the same name though the feature works
> a little differently).  This is opposite of the standby and dynamic
> terminology in the draft.
>
> A pair of paths are "disjoint" if they do not share a common link.  A
> path segment may be one or more hops (which means you have to define
> hop).  Paths that protect a segment of a path may merge beyond the
> segment being protected and are cosidered disjoint if they do not use
> a link from the set of links in the protected segment.  A path is node
> disjoint if it does not share a common node other than the ingress and
> egress.  A node disjoint specification can be expressed as a link
> disjoint specification.
>
> A shared risk link group (SRLG) is a set of links which are likely to
> fail concurrently due to sharing a physical resource (same fiber using
> wdm for example).  If SRLG are considered then the set of links to be
> avoided to be considered disjoint include those links on the path or
> path segment being protected plus any that share a common SRLG.
>
> Failover may be at the point of local repair (PLR - MPLS FRR term) or
> at the ingress.  If failover is at the ingress it is generally on a
> disjoint path from ingress to egress.  If failover is at a PLR it will
> use MPLS FRR which has two flavors, 1-to-1 and facility (aka detour
> and bypass).  You should just import the FRR terminology and note that
> the terms detour and bypass are commonly used.
>
> The link/node/path protection terminology doesn't cover the above
> adequately.
>
> The only major ommission is terms to describe the type of failure.
>
>   A failure may be a node failure or link failure.
>
>   One of the following may be true.
>
>     A failure may be completely isolated (single link failure).
>
>     A failure may affect a set of links which share a single SRLG (for
>     example a multiple interface line card may fail, a physical link
>     with sublinks may fail such as channelized, switched service, or
>     ethernet VLAN, or common transport resource may be used such as
>     wavelengths on the same fiber or common transport equipment, or
>     common power may fail.
>
>     A failure may affect multiple links that are not covered by any
>     common SRLG.
>
> You can try to find or think up terms for the above since terminology
> varies.  Single logical link, single SRLG, and unexpected correlated
> failure are terms commonly used but if you come up with something
> better it wouldn't hurt.  It is very important to test for unexpected
> correlated failures since these do quite often occur and very long
> restoration time can occur with some equipment.
>
> Note that "Restoration" is used more often than "Failure Recovery"
> meaning that service is restored either completely or partially.
> Often interim restoration of IP service using FRR experiences
> congestion but reroute of primary paths avoids restoration.  This is a
> two step service restoration.  Occasionally restored service
> experiences congestion after the primary paths are rerouted.
>
> Another useful metric is the quality of traffic layout after a
> failover.  This is very difficult to measure qualitatively and is
> affected by all of the nodes in a network which make path
> determinations.  One qualitative measure is the worst loaded link in
> the resulting traffic layout but it is by no means the only measure
> and may not be the best.
>
> An unavoidable problem in any restoration is the discontinuity in end to
end
> delay when the primary and backup path delays differ significantly.
> If the backup path has a shorter delay out of order delivery may occur
> if restoration is fast.  If the backup path is longer then a sudden
> increase in delay will occur which can affect real time applications
> which use playback buffers to remove limited jitter.
>
> ------------------------------------------------------------
>
>   draft-poretsky-mpls-protection-meth-05.txt
>
> Some terminolgy missing in draft-poretsky-protection-term-01.txt may
> be implicitly defined here.
>
> These are extremely minimal tests and a note on what is *NOT* covered
> should be made in Section 1 "Introduction".  Those cases could be
> covered in later work so as not to hold this up.  Incomplete tests are
> better than nothing.
>
> Note that unexpected correlated failures are not covered by these
> tests.  This type of failure require a new path computation and a new
> path must be signal.  Also not covered is the reroute of the primary
> path which in many real world cases restores relatively uncongested
> service which the interim restoration provided by FRR does not do.
>
> The FRR Scalability section is good to have (5.3).  Delay is often
> considerably longer for hundreds of protected paths than for one.
> This is regardless of the use of detour or bypass FRR since the
> limiting factor is changing the insegments unless a two stage
> insegment hardware lookup is used (this detail is fyi only).
>
> For failure types and restoration that requires path recomputation,
> the speed of path recomputation is dependent on the complexity of the
> IGP topology and would require a similar scalability section would be
> needed.
>
> The microflow diversity mentioned in the Vapiwal and Karthik draft
> could be moved to this draft as another possible scalability
> consideration.  Microflow diversity has been known to affect some
> architectures (though I'm not sure it affects any still in business).
>
> ------------------------------------------------------------
>
>   merge:  draft-vapiwala-bmwg-frr-failover-meth-00.txt
>
> This draft seems to add a lot of test cases but is essentially more of
> the same in draft-poretsky-mpls-protection-meth-05.txt.  Maybe they
> can remain separate but related with Poretsky et all describing
> minimal test cases and Vapiwal and Karthik describing an expanded set
> of test cases.  The two could reference each other and initially
> advance together but later the expanded set of test cases might
> further expand independently.
>
> Sending 3 traffic streams is almost silly considering at least
> thousands if not millions will be encountered in the field.
>
> Note that FRR is generally not affected by the number of nodes or
> links advertised in the IGP.
>
> The number of tunnels and number of tunnels affected by the failover
> is very significant.
>
> Useful parameters for ingress are total number of prefix and total
> number of affected prefix.  Loss to prefix that are not affected
> should be checked.  Past architectures did lose traffic to prefixes
> not affected by route change due to an ill conceived cache
> architecture so its worth measuring and reporting.  Useful metrics are
> percentage of traffic lost over time (as routes are installed), total
> number of prefix affected over time, and total number of microflows
> affected over time.  Convergence is not instantaneous at ingress for
> most architectures.  If hardware entries for individual prefixes have
> to be changed the restoration is gradual.  If there is a two stage
> lookup at ingress (prefix to tunnel, tunnel to LSP/inseg) then
> restoration can be an atomic operation (complete FRR restoration
> happens all at once).
>
> ------------------------------------------------------------
>
> Another draft should cover unexpected correlated failures and primary
> path rerouting.  From a practical standpoint these are very important
> to providers and too often overlooked in testing since they are more
> difficult to test.
>
> _______________________________________________
> bmwg mailing list
> bmwg <at> ietf.org
> https://www1.ietf.org/mailman/listinfo/bmwg
>

Gmane