Joachim Nilsson | 30 May 14:12 2012
Picon

[quagga-dev 9389] Re: OSPF: Problem with route distribution when restarting ospfd, since 0.99.18

Hi again,

the fix/revert discussed below is available for cherry-picking from
my GitHub repository, branch westermo/ospfd-fixes

https://github.com/troglobit/quagga/commits/westermo/ospfd-fixes

Regards
 /Joachim
 
On 05/28/2012 02:00 PM, Joachim Nilsson wrote:
Hi,

we've been looking further into this issue, done some bisecting
and found the root cause, or change, that cause the problems
we see. (Please see the full report in the original mail for details)

http://git.savannah.gnu.org/gitweb/?p=quagga.git;a=commit;h=02d942c9d4afabf04bd781f0e1e5e8aa36945df2

This commit causes the problem. It introduces a #define for
MaxAge removal used by ospf_lsa.c:ospf_maxage_lsa_remover().

#define OSFP_LSA_MAXAGE_REMOVE_DELAY_DEFAULT 60

This value used to be hard coded to 2 and if I change this back to
2 ... yes, that's all I need to do, then our problem disappears!

The reason for the change is the discussion on quagga-dev:4132
about a bug in long delay networks.

http://lists.quagga.net/pipermail/quagga-dev/2006-May/004133.html

I'm still not at all up to speed on how the inner workings of OSPF
should behave, but is this perhaps related to Step 8 of Section 13,
as discussed in "G.1 Flooding Modifications" of RFC2328?

http://tools.ietf.org/html/rfc2328#section-13

I may be mixing the cards up now, but Cisco seem to have some
other form of counter measures for this in their

    interface <type> <id>
        ip ospf flood-reduction

http://www.cisco.com/en/US/docs/ios/12_1t/12_1t2/feature/guide/dt_ospff.html

Anyone else with input on this?

Regards
 /Joachim

On 05/14/2012 05:50 PM, Joachim Nilsson wrote:
We've seen OSPF issues when reconfiguring ospfd at runtime in our testbeds at Westermo. By editing its config file and restarting ospfd we see problems with route distribution. Laptop setup, very simple. Two areas, one RIP network with loopback /32 nets on R1 and R4. Most of the time it works, but sometimes when we restart ospfd on R2 the edge router R1 never gets any routes. ______ | |---| 66.66.66.66/32 | | | R1 | | | |______| |.1 | | 172.12.0.0/24 | Area 2 | |.2 ______ | | | | | R2 | | | |______| |.1 | | 172.10.0.0/24 | Area 0 | |.2 ______ | | | ASBR | | R3 | | | |______| |.1 | | 10.1.0.0/24 | RIP |.2 ______ | |---| 14.0.0.1/32 | |---| 15.0.0.1/32 | R4 |---| 16.0.0.1/32 | | |______| In reproducing the bug we restart ospfd on R2 by pressing Ctrl-C while [ 1 ]; do sudo ./ospfd/ospfd -u root -g root -f ../ospfd.conf sleep 1; done Results ... 0.99.17: - Stable as a rock! 0.99.18 - Git: - Sometimes loss of 66-network in R3 and R4 - Loss of all RIP nets in R1 Sadly we haven't yet had the time to try and figure out where the bug is located. However, we noticed that there was quite an overhaul of the LSA refresh logic in 0.99.18.


<div>
    Hi again,<br><br>
    the fix/revert discussed below is available for cherry-picking from<br>
    my GitHub repository, branch westermo/ospfd-fixes<br><br><a class="moz-txt-link-freetext" href="https://github.com/troglobit/quagga/commits/westermo/ospfd-fixes">https://github.com/troglobit/quagga/commits/westermo/ospfd-fixes</a><br><br>
    Regards<br>
    &nbsp;/Joachim<br>
    &nbsp;<br>
    On 05/28/2012 02:00 PM, Joachim Nilsson wrote:
    <blockquote cite="mid:4FC368E4.2030103 <at> gmail.com" type="cite">

      Hi,<br><br>
      we've been looking further into this issue, done some bisecting<br>
      and found the root cause, or change, that cause the problems<br>
      we see. (Please see the full report in the original mail for
      details)<br><br><a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://git.savannah.gnu.org/gitweb/?p=quagga.git;a=commit;h=02d942c9d4afabf04bd781f0e1e5e8aa36945df2">http://git.savannah.gnu.org/gitweb/?p=quagga.git;a=commit;h=02d942c9d4afabf04bd781f0e1e5e8aa36945df2</a><br><br>
      This commit causes the problem. It introduces a #define for<br>
      MaxAge removal used by ospf_lsa.c:ospf_maxage_lsa_remover().<br><br>
      #define OSFP_LSA_MAXAGE_REMOVE_DELAY_DEFAULT 60<br><br>
      This value used to be hard coded to 2 and if I change this back to<br>
      2 ... yes, that's all I need to do, then our problem disappears!<br><br>
      The reason for the change is the discussion on quagga-dev:4132<br>
      about a bug in long delay networks.<br><br><a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://lists.quagga.net/pipermail/quagga-dev/2006-May/004133.html">http://lists.quagga.net/pipermail/quagga-dev/2006-May/004133.html</a><br><br>
      I'm still not at all up to speed on how the inner workings of OSPF<br>
      should behave, but is this perhaps related to Step 8 of Section
      13,<br>
      as discussed in "G.1 Flooding Modifications" of RFC2328?<br><br><a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://tools.ietf.org/html/rfc2328#section-13">http://tools.ietf.org/html/rfc2328#section-13</a><br><br>
      I may be mixing the cards up now, but Cisco seem to have some<br>
      other form of counter measures for this in their <br><br>
      &nbsp;&nbsp;&nbsp; interface &lt;type&gt; &lt;id&gt;<br>
      &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; ip ospf flood-reduction<br><br><a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://www.cisco.com/en/US/docs/ios/12_1t/12_1t2/feature/guide/dt_ospff.html">http://www.cisco.com/en/US/docs/ios/12_1t/12_1t2/feature/guide/dt_ospff.html</a><br><br>
      Anyone else with input on this?<br><br>
      Regards<br>
      &nbsp;/Joachim<br><br>
      On 05/14/2012 05:50 PM, Joachim Nilsson wrote:
      <blockquote cite="mid:4FB129B4.1030808 <at> westermo.se" type="cite">
        We've seen OSPF issues when reconfiguring ospfd at runtime in our
testbeds at Westermo.  By editing its config file and restarting ospfd
we see problems with route distribution.

Laptop setup, very simple. Two areas, one RIP network with loopback
/32 nets on R1 and R4. Most of the time it works, but sometimes when
we restart ospfd on R2 the edge router R1 never gets any routes.
 ______
|      |---| 66.66.66.66/32
|      |
|  R1  |
|      |
|______|
   |.1
   |
   | 172.12.0.0/24
   | Area 2
   |
   |.2
 ______
|      |
|      |
|  R2  |
|      |
|______|
   |.1
   |
   | 172.10.0.0/24
   | Area 0
   |
   |.2
 ______
|      |
| ASBR |
|  R3  |
|      |
|______|
   |.1
   |
   | 10.1.0.0/24
   |   RIP
   |.2
 ______  
|      |---| 14.0.0.1/32
|      |---| 15.0.0.1/32
|  R4  |---| 16.0.0.1/32
|      |
|______|

In reproducing the bug we restart ospfd on R2 by pressing Ctrl-C

while [ 1 ]; do
    sudo ./ospfd/ospfd -u root -g root -f ../ospfd.conf
    sleep 1;
done

Results ...

0.99.17:
    - Stable as a rock!

0.99.18 - Git:
    - Sometimes loss of 66-network in R3 and R4
    - Loss of all RIP nets in R1

Sadly we haven't yet had the time to try and figure out where the bug
is located.  However, we noticed that there was quite an overhaul of
the LSA refresh logic in 0.99.18.

      </blockquote>
      <br>
</blockquote>
    <br>
</div>

Gmane