6 Jun 2012 17:03
Re: CTDB asymetric (non-)recovery
Abhijith Das <adas <at> redhat.com>
2012-06-06 15:03:20 GMT
2012-06-06 15:03:20 GMT
----- Original Message ----- > From: "Nicolas Ecarnot" <nicolas.ecarnot <at> gmail.com> > To: "Steven Whitehouse" <swhiteho <at> redhat.com> > Cc: adas <at> redhat.com, samba-technical <at> samba.org > Sent: Wednesday, June 6, 2012 9:47:56 AM > Subject: CTDB asymetric (non-)recovery > > Le 06/06/2012 11:00, Steven Whitehouse a écrit : > >> Bonus question : Do you know which better channel I could ask a > >> precise > >> ctdb question? > >> > > > > I probably missed that... I'm just catching up after the Jubilee > > holidayAbhi should be able to point you in the right direction > > wrt > > ctdb, > > Thank you Steve. > > Well, as Abhi is in Cc, here is the situation: > > I had a 2-nodes cluster running too fine under Ubuntu server 11.10, > > with > > cman, corosync, GFS2, OCFS2, clvm, ctdb, samba, winbind. > > > > So I decided to upgrade to Precise (12.04) > > Ctdb seems to run fine as it was under 11.10. > > But an asymetric behaviour is striking my setup. > My tests are showing this : > > Test 01 > - both nodes down (ctdb stop) > - node 0 : ctdb start : OK > - node 1 : ctdb start : both OK > > Test 02 > - both nodes down (ctdb stop) > - node 1 : ctdb start : OK > - node 0 : ctdb start : both OK > > Test 03 > - both nodes down (ctdb stop) > - node 0 : ctdb start : OK > - node 1 : ctdb start : both OK > - node 1 : ctdb stop : OK > - node 1 : ctdb start : both OK > > Test 04 > - both nodes down (ctdb stop) > - node 0 : ctdb start : OK > - node 1 : ctdb start : both OK > - node 0 : ctdb stop : OK > - node 0 : ctdb start : node 0 down, only node 1 OK !!! > > I tried to run these tests by asking ctdb to manage samba+winbind, > and > tried the same tests without managing them. > > Without managing them, it greatly improves the Test 04, but not at > each > try, so I guess this is not related to the 50.samba script. > (That may be related to timings???) > > I'm reading the docs and the source file to understand what's wrong. > In my log files, what is different between the situation of success > (Test 01,02,03) and failure (Test 04) is the following error message > looping : > > Good situation : > || [...] > [recoverd: 5894]: The interfaces status has changed on local node 1 - > force takeover run > || [recoverd: 5894]: Trigger takeoverrun > || [18295]: CTDB_WAIT_UNTIL_RECOVERED > || [ 1077]: startup event OK - enabling monitoring > || [...] > > Bad situation : > || [...] > || [recoverd: 5894]: The interfaces status has changed on local node > || 1 - > force takeover run > || [ 5846]: CTDB_WAIT_UNTIL_RECOVERED > || [recoverd: 5894]: The interfaces status has changed on local node > || 1 - > force takeover run > || [ 5846]: CTDB_WAIT_UNTIL_RECOVERED > || [Infinite repeating and looping...] > > > On the healthy node, I see every 10 seconds : > || [recoverd:18343]: client/ctdb_client.c:990 control timed out. > reqid:8068 opcode:18 dstnode:0 > || [recoverd:18343]: client/ctdb_client.c:1101 ctdb_control_recv > || failed > || [recoverd:18343]: client/ctdb_client.c:990 control timed out. > reqid:8070 opcode:18 dstnode:0 > || [recoverd:18343]: client/ctdb_client.c:1101 ctdb_control_recv > || failed > || [18295]: Recovery daemon ping timeout. Count : 0 > || [recoverd:18343]: Could not find idr:8068 > > I also see signs saying that node 1 can not pull db from node 0. Is > it > just a consequence, or else? > What does my cluster is trying to whisper to my deaf ears? > > -- > Nicolas Ecarnot >
Abhi should be able to point you in the right direction
> > wrt
> > ctdb,
>
> Thank you Steve.
>
> Well, as Abhi is in Cc, here is the situation:
> > I had a 2-nodes cluster running too fine under Ubuntu server 11.10,
> > with
> > cman, corosync, GFS2, OCFS2, clvm, ctdb, samba, winbind.
> >
> > So I decided to upgrade to Precise (12.04)
>
> Ctdb seems to run fine as it was under 11.10.
>
> But an asymetric behaviour is striking my setup.
> My tests are showing this :
>
> Test 01
> - both nodes down (ctdb stop)
> - node 0 : ctdb start : OK
> - node 1 : ctdb start : both OK
>
> Test 02
> - both nodes down (ctdb stop)
> - node 1 : ctdb start : OK
> - node 0 : ctdb start : both OK
>
> Test 03
> - both nodes down (ctdb stop)
> - node 0 : ctdb start : OK
> - node 1 : ctdb start : both OK
> - node 1 : ctdb stop : OK
> - node 1 : ctdb start : both OK
>
> Test 04
> - both nodes down (ctdb stop)
> - node 0 : ctdb start : OK
> - node 1 : ctdb start : both OK
> - node 0 : ctdb stop : OK
> - node 0 : ctdb start : node 0 down, only node 1 OK !!!
>
> I tried to run these tests by asking ctdb to manage samba+winbind,
> and
> tried the same tests without managing them.
>
> Without managing them, it greatly improves the Test 04, but not at
> each
> try, so I guess this is not related to the 50.samba script.
> (That may be related to timings???)
>
> I'm reading the docs and the source file to understand what's wrong.
> In my log files, what is different between the situation of success
> (Test 01,02,03) and failure (Test 04) is the following error message
> looping :
>
> Good situation :
> || [...]
> [recoverd: 5894]: The interfaces status has changed on local node 1 -
> force takeover run
> || [recoverd: 5894]: Trigger takeoverrun
> || [18295]: CTDB_WAIT_UNTIL_RECOVERED
> || [ 1077]: startup event OK - enabling monitoring
> || [...]
>
> Bad situation :
> || [...]
> || [recoverd: 5894]: The interfaces status has changed on local node
> || 1 -
> force takeover run
> || [ 5846]: CTDB_WAIT_UNTIL_RECOVERED
> || [recoverd: 5894]: The interfaces status has changed on local node
> || 1 -
> force takeover run
> || [ 5846]: CTDB_WAIT_UNTIL_RECOVERED
> || [Infinite repeating and looping...]
>
>
> On the healthy node, I see every 10 seconds :
> || [recoverd:18343]: client/ctdb_client.c:990 control timed out.
> reqid:8068 opcode:18 dstnode:0
> || [recoverd:18343]: client/ctdb_client.c:1101 ctdb_control_recv
> || failed
> || [recoverd:18343]: client/ctdb_client.c:990 control timed out.
> reqid:8070 opcode:18 dstnode:0
> || [recoverd:18343]: client/ctdb_client.c:1101 ctdb_control_recv
> || failed
> || [18295]: Recovery daemon ping timeout. Count : 0
> || [recoverd:18343]: Could not find idr:8068
>
> I also see signs saying that node 1 can not pull db from node 0. Is
> it
> just a consequence, or else?
> What does my cluster is trying to whisper to my deaf ears?
>
> --
> Nicolas Ecarnot
>
RSS Feed