3 Jun 2006 03:25
Re: FW: Proposal on Protection Benchmarking
The Poretsky Family <s.poretsky <at> verizon.net>
2006-06-03 01:25:16 GMT
2006-06-03 01:25:16 GMT
Thanks Curtis. The author team will begin working through your very worthwhile comments. Scott ----- Original Message ----- From: "Curtis Villamizar" <curtis <at> occnc.com> To: <bmwg <at> ietf.org> Sent: Friday, June 02, 2006 10:32 AM Subject: Re: FW: [bmwg] Proposal on Protection Benchmarking > > At 03:30 PM 5/3/2006, Al Morton wrote: > >BMWG, > > > >The proponents of the Protection Benchmarking Work Proposal have > >prepared the description of this work effort, below. > > > >BMWG discussed this work at the Dallas IETF-65 session, where there was > > >strong support and involved membership (see meeting minutes). > > > >Please weigh-in on whether this topic should become part of BMWG's > >chartered work, by > > > > June 2, 2006 > > > >And, if you support the work, please say: > > > >+ HOW you intend to support the development in BMWG, > > (by reviewing draft X by MM/DD/YY, for example), or > > > >+ WHY this work will be beneficial to BMWG's user community, or > > > >+ Modifications that would make the proposal more useful > > (which we will discuss on the list), and > > > >+ (anything else that's constructive) > > > >And remember, we'd like to hear your opinion on the list, even if you > >spoke in favor of this proposal at the meeting. > > > >Thanks in advance for your efforts and commitment to BMWG! > > > >Al > >bmwg co-chair > > [ ... snip ... ] > > Comments below inline. The issues are all fixable and these are good > BMWG work group work items IMHO. > > I think this falls under "anything else that's constructive". I have > not provided replacement text but if there is agreement on the all or > parts of the comments below, I could do so. > > Curtis > > > ------------------------------------------------------------ > > draft-poretsky-protection-term-01.txt > > Terminology doesn't match what is normally used. > > btw - Not sure why you can't say MPLS instead of sub-IP when you mean > MPLS or maybe MPLS plus GMPLS. > > PHP might violate the definition of "Path" ever so slightly. > > You might want to define "Tunnel" as the collection of related Paths > (LSP in MPLS terminology). A Tunnel is used to carry a specific flow > of traffic which is generally a very large aggregation of microflows > but may be any flow defined by a classifier at the ingress. > A Tunnel may include two primary Paths during MPLS make-before-break > reroute and one or both may have a backup Path during trasition. > > A backup path always computed before the failover event. A new path > computed after the failover event is simply a reroute of the primary > path. > > A backup path may be signaled or unsignaled. If it is unsignaled it > has been computed but has not been signaled, saving some time in > restoration. Juniper called unsignaled backup paths "standby" and the > name stuck (Avici at least uses the same name though the feature works > a little differently). This is opposite of the standby and dynamic > terminology in the draft. > > A pair of paths are "disjoint" if they do not share a common link. A > path segment may be one or more hops (which means you have to define > hop). Paths that protect a segment of a path may merge beyond the > segment being protected and are cosidered disjoint if they do not use > a link from the set of links in the protected segment. A path is node > disjoint if it does not share a common node other than the ingress and > egress. A node disjoint specification can be expressed as a link > disjoint specification. > > A shared risk link group (SRLG) is a set of links which are likely to > fail concurrently due to sharing a physical resource (same fiber using > wdm for example). If SRLG are considered then the set of links to be > avoided to be considered disjoint include those links on the path or > path segment being protected plus any that share a common SRLG. > > Failover may be at the point of local repair (PLR - MPLS FRR term) or > at the ingress. If failover is at the ingress it is generally on a > disjoint path from ingress to egress. If failover is at a PLR it will > use MPLS FRR which has two flavors, 1-to-1 and facility (aka detour > and bypass). You should just import the FRR terminology and note that > the terms detour and bypass are commonly used. > > The link/node/path protection terminology doesn't cover the above > adequately. > > The only major ommission is terms to describe the type of failure. > > A failure may be a node failure or link failure. > > One of the following may be true. > > A failure may be completely isolated (single link failure). > > A failure may affect a set of links which share a single SRLG (for > example a multiple interface line card may fail, a physical link > with sublinks may fail such as channelized, switched service, or > ethernet VLAN, or common transport resource may be used such as > wavelengths on the same fiber or common transport equipment, or > common power may fail. > > A failure may affect multiple links that are not covered by any > common SRLG. > > You can try to find or think up terms for the above since terminology > varies. Single logical link, single SRLG, and unexpected correlated > failure are terms commonly used but if you come up with something > better it wouldn't hurt. It is very important to test for unexpected > correlated failures since these do quite often occur and very long > restoration time can occur with some equipment. > > Note that "Restoration" is used more often than "Failure Recovery" > meaning that service is restored either completely or partially. > Often interim restoration of IP service using FRR experiences > congestion but reroute of primary paths avoids restoration. This is a > two step service restoration. Occasionally restored service > experiences congestion after the primary paths are rerouted. > > Another useful metric is the quality of traffic layout after a > failover. This is very difficult to measure qualitatively and is > affected by all of the nodes in a network which make path > determinations. One qualitative measure is the worst loaded link in > the resulting traffic layout but it is by no means the only measure > and may not be the best. > > An unavoidable problem in any restoration is the discontinuity in end to end > delay when the primary and backup path delays differ significantly. > If the backup path has a shorter delay out of order delivery may occur > if restoration is fast. If the backup path is longer then a sudden > increase in delay will occur which can affect real time applications > which use playback buffers to remove limited jitter. > > ------------------------------------------------------------ > > draft-poretsky-mpls-protection-meth-05.txt > > Some terminolgy missing in draft-poretsky-protection-term-01.txt may > be implicitly defined here. > > These are extremely minimal tests and a note on what is *NOT* covered > should be made in Section 1 "Introduction". Those cases could be > covered in later work so as not to hold this up. Incomplete tests are > better than nothing. > > Note that unexpected correlated failures are not covered by these > tests. This type of failure require a new path computation and a new > path must be signal. Also not covered is the reroute of the primary > path which in many real world cases restores relatively uncongested > service which the interim restoration provided by FRR does not do. > > The FRR Scalability section is good to have (5.3). Delay is often > considerably longer for hundreds of protected paths than for one. > This is regardless of the use of detour or bypass FRR since the > limiting factor is changing the insegments unless a two stage > insegment hardware lookup is used (this detail is fyi only). > > For failure types and restoration that requires path recomputation, > the speed of path recomputation is dependent on the complexity of the > IGP topology and would require a similar scalability section would be > needed. > > The microflow diversity mentioned in the Vapiwal and Karthik draft > could be moved to this draft as another possible scalability > consideration. Microflow diversity has been known to affect some > architectures (though I'm not sure it affects any still in business). > > ------------------------------------------------------------ > > merge: draft-vapiwala-bmwg-frr-failover-meth-00.txt > > This draft seems to add a lot of test cases but is essentially more of > the same in draft-poretsky-mpls-protection-meth-05.txt. Maybe they > can remain separate but related with Poretsky et all describing > minimal test cases and Vapiwal and Karthik describing an expanded set > of test cases. The two could reference each other and initially > advance together but later the expanded set of test cases might > further expand independently. > > Sending 3 traffic streams is almost silly considering at least > thousands if not millions will be encountered in the field. > > Note that FRR is generally not affected by the number of nodes or > links advertised in the IGP. > > The number of tunnels and number of tunnels affected by the failover > is very significant. > > Useful parameters for ingress are total number of prefix and total > number of affected prefix. Loss to prefix that are not affected > should be checked. Past architectures did lose traffic to prefixes > not affected by route change due to an ill conceived cache > architecture so its worth measuring and reporting. Useful metrics are > percentage of traffic lost over time (as routes are installed), total > number of prefix affected over time, and total number of microflows > affected over time. Convergence is not instantaneous at ingress for > most architectures. If hardware entries for individual prefixes have > to be changed the restoration is gradual. If there is a two stage > lookup at ingress (prefix to tunnel, tunnel to LSP/inseg) then > restoration can be an atomic operation (complete FRR restoration > happens all at once). > > ------------------------------------------------------------ > > Another draft should cover unexpected correlated failures and primary > path rerouting. From a practical standpoint these are very important > to providers and too often overlooked in testing since they are more > difficult to test. > > _______________________________________________ > bmwg mailing list > bmwg <at> ietf.org > https://www1.ietf.org/mailman/listinfo/bmwg >
RSS Feed