Borgström Jonas | 24 Sep 19:29 2007

RE: Possible cman init script race condition

From: David Teigland [mailto:teigland <at> redhat.com] 
Sent: den 24 september 2007 18:10
To: Borgström Jonas
Cc: linux clustering
Subject: Re: [Linux-cluster] Possible cman init script race condition
*snip*
> > 1190645596 node "prod-db1" not a cman member, cn 1
> > 1190645597 node "prod-db1" not a cman member, cn 1
> > 1190645598 node "prod-db1" not a cman member, cn 1
> > 1190645599 node "prod-db1" not a cman member, cn 1
> > 1190645600 reduce victim prod-db1
> > 1190645600 delay of 16s leaves 0 victims
> > 1190645600 finish default 1
> > 1190645600 stop default
> > 1190645600 start default 2 members 1 2 
> > 1190645600 do_recovery stop 1 start 2 finish 1
> 
> I think something has gone wrong here, either in groupd or fenced, that's
> preventing this start from finishing (we don't get a 'finish default 2'
> which we expect).  A 'group_tool -v' here should show the state of the
> fence group still in transition.  Could you run that, plus a 'group_tool
> dump' at this point, in addition to the 'dump fence' you have.  And please
> run those commands on both nodes.
> 
Hi david, thanks for your fast response. Here's the output you requested:

[root <at> prod-db1 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010001 JOIN_START_WAIT 2 200020001 1
[1 2]

[root <at> prod-db2 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010002 JOIN_START_WAIT 1 100020001 1
[1 2]

I attached "group_tool dump" output as files, since they are quite long.

> > 1190645954 client 3: dump    <--- Before killing prod-db1
> > 1190645985 stop default
> > 1190645985 start default 3 members 2 
> > 1190645985 do_recovery stop 2 start 3 finish 1
> > 1190645985 finish default 3
> > 1190646008 client 3: dump    <--- After killing prod-db1
> 
> Node 1 isn't fenced here because it never completed joining the fence
> group above.
>
> > The scary part is that as far as I can tell fenced is the only cman
> > daemon being affected by this. So your cluster appears to work fine. But
> > when a node needs to be fenced the operation it isn't carried out and
> > that can cause gfs filesystem corruption.
>
> You shouldn't be able to mount gfs on the node where joining the fence
> group is stuck.

My current setup is very stripped down so I haven't configured gfs. But on my original setup where I
initially noticed this issue I had no problem mounting gfs filesystems and after a simulated network
failure I could still continue to write to the filesystem from both nodes since no node was fenced, and that
quickly corrupted the filesystem.

Regards,
Jonas

1190653002 cman: our nodeid 2 name prod-db2 quorum 1
1190653002 setup_cpg groupd_handle 6b8b456700000000
1190653002 groupd confchg total 1 left 0 joined 1
1190653002 client connection 3
1190653002 got client 3 setup
1190653002 setup fence 0
1190653002 client connection 4
1190653002 got client 4 setup
1190653002 setup dlm 1
1190653003 client connection 5
1190653003 got client 5 setup
1190653003 setup gfs 2
1190653004 got client 3 join
1190653004 0:default got join
1190653004 0:default is cpg client 6 name 0_default handle 327b23c600000001
1190653004 0:default cpg_join ok
1190653004 0:default waiting for first cpg event
1190653004 0:default confchg left 0 joined 1 total 1
1190653004 0:default process_node_join 2
1190653004 0:default cpg add node 2 total 1
1190653004 0:default create group id 10002 our_nodeid 2
1190653004 0:default make_event_id 200010001 nodeid 2 memb_count 1 type 1
1190653004 0:default queue join event for nodeid 2
1190653004 0:default process_current_event 200010001 2 JOIN_BEGIN
1190653004 0:default app node init: add 2 total 1
1190653004 0:default process_current_event 200010001 2 JOIN_ALL_STOPPED
1190653004 0:default action for app: setid default 65538
1190653004 0:default action for app: start default 1 2 1 2
1190653004 client connection 7
1190653004 got client 7 get_group
1190653004 client connection 7
1190653004 got client 7 get_group
1190653005 client connection 7
1190653005 got client 7 get_group
1190653006 client connection 7
1190653006 got client 7 get_group
1190653007 client connection 7
1190653007 got client 7 get_group
1190653008 client connection 7
1190653008 got client 7 get_group
1190653009 client connection 7
1190653009 got client 7 get_group
1190653010 client connection 7
1190653010 got client 7 get_group
1190653011 client connection 7
1190653011 got client 7 get_group
1190653012 client connection 7
1190653012 got client 7 get_group
1190653013 client connection 7
1190653013 got client 7 get_group
1190653014 client connection 7
1190653014 got client 7 get_group
1190653015 client connection 7
1190653015 got client 7 get_group
1190653016 client connection 7
1190653016 got client 7 get_group
1190653017 client connection 7
1190653017 got client 7 get_group
1190653018 client connection 7
1190653018 got client 7 get_group
1190653019 client connection 7
1190653019 got client 7 get_group
1190653020 client connection 7
1190653020 got client 7 get_group
1190653021 client connection 7
1190653021 got client 7 get_group
1190653022 client connection 7
1190653022 got client 7 get_group
1190653023 client connection 7
1190653023 got client 7 get_group
1190653024 client connection 7
1190653024 got client 7 get_group
1190653025 client connection 7
1190653025 got client 7 get_group
1190653026 client connection 7
1190653026 got client 7 get_group
1190653027 client connection 7
1190653027 got client 7 get_group
1190653028 client connection 7
1190653028 got client 7 get_group
1190653029 client connection 7
1190653029 got client 7 get_group
1190653030 client connection 7
1190653030 got client 7 get_group
1190653031 client connection 7
1190653031 got client 7 get_group
1190653032 client connection 7
1190653032 got client 7 get_group
1190653033 client connection 7
1190653033 got client 7 get_group
1190653034 client connection 7
1190653034 got client 7 get_group
1190653035 client connection 7
1190653035 got client 7 get_group
1190653036 client connection 7
1190653036 got client 7 get_group
1190653037 client connection 7
1190653037 got client 7 get_group
1190653038 client connection 7
1190653038 got client 7 get_group
1190653039 client connection 7
1190653039 got client 7 get_group
1190653040 client connection 7
1190653040 got client 7 get_group
1190653040 cman: node 1 added
1190653040 groupd confchg total 2 left 0 joined 1
1190653040 0:default confchg left 0 joined 1 total 2
1190653040 0:default process_node_join 1
1190653040 0:default cpg add node 1 total 2
1190653040 0:default make_event_id 100020001 nodeid 1 memb_count 2 type 1
1190653040 0:default queue join event for nodeid 1
1190653041 0:default ignore msg from 1 id 100010001 started
1190653041 0:default ignore msg from 1 id 200020001 stopped
1190653041 0:default ignore msg from 1 id 200020001 started
1190653041 client connection 7
1190653041 got client 7 get_group
1190653041 client connection 7
1190653041 got client 7 get_group
1190653041 got client 3 start_done
1190653041 0:default send started
1190653041 0:default mark node 2 started
1190653041 0:default process_current_event 200010001 2 JOIN_ALL_STARTED
1190653041 0:default action for app: finish default 1
1190653041 0:default purge msg 200020001 from 2 EST_JOIN_BEGIN
1190653041 0:default purge msg 200020001 from 2 EST_JOIN_BEGIN
1190653041 0:default process_current_event 100020001 1 JOIN_BEGIN
1190653041 0:default action for app: stop default
1190653041 0:default app node join: add 1 total 2
1190653041 got client 3 stop_done
1190653041 0:default send stopped
1190653041 0:default waiting for 1 more stopped messages before JOIN_ALL_STOPPED 1
1190653041 0:default mark node 2 stopped
1190653041 0:default process_current_event 100020001 1 JOIN_ALL_STOPPED
1190653041 0:default action for app: start default 2 2 2 1 2
1190653041 got client 3 start_done
1190653041 0:default send started
1190653041 0:default mark node 2 started
1190653042 client connection 7
1190653042 got client 7 get_group
1190653099 client connection 7
1190653099 got client 7 get_groups
1190653143 client connection 7
1190653143 got client 7 dump
1190653023 cman: our nodeid 1 name prod-db1 quorum 1
1190653023 setup_cpg groupd_handle 6b8b456700000000
1190653023 groupd confchg total 1 left 0 joined 1
1190653023 client connection 3
1190653023 got client 3 setup
1190653023 setup fence 0
1190653023 client connection 4
1190653023 got client 4 setup
1190653023 setup dlm 1
1190653024 client connection 5
1190653024 got client 5 setup
1190653024 setup gfs 2
1190653025 got client 3 join
1190653025 0:default got join
1190653025 0:default is cpg client 6 name 0_default handle 327b23c600000001
1190653025 0:default cpg_join ok
1190653025 0:default waiting for first cpg event
1190653025 client connection 7
1190653025 0:default waiting for first cpg event
1190653025 got client 7 get_group
1190653025 0:default waiting for first cpg event
1190653025 0:default confchg left 0 joined 1 total 1
1190653025 0:default process_node_join 1
1190653025 0:default cpg add node 1 total 1
1190653025 0:default create group id 10001 our_nodeid 1
1190653025 0:default make_event_id 100010001 nodeid 1 memb_count 1 type 1
1190653025 0:default queue join event for nodeid 1
1190653025 0:default process_current_event 100010001 1 JOIN_BEGIN
1190653025 0:default app node init: add 1 total 1
1190653025 0:default process_current_event 100010001 1 JOIN_ALL_STOPPED
1190653025 0:default action for app: setid default 65537
1190653025 0:default action for app: start default 1 2 1 1
1190653025 client connection 7
1190653025 got client 7 get_group
1190653026 client connection 7
1190653026 got client 7 get_group
1190653027 client connection 7
1190653027 got client 7 get_group
1190653028 client connection 7
1190653028 got client 7 get_group
1190653029 client connection 7
1190653029 got client 7 get_group
1190653030 client connection 7
1190653030 got client 7 get_group
1190653031 client connection 7
1190653031 got client 7 get_group
1190653032 client connection 7
1190653032 got client 7 get_group
1190653033 client connection 7
1190653033 got client 7 get_group
1190653034 client connection 7
1190653034 got client 7 get_group
1190653035 client connection 7
1190653035 got client 7 get_group
1190653036 client connection 7
1190653036 got client 7 get_group
1190653037 client connection 7
1190653037 got client 7 get_group
1190653038 client connection 7
1190653038 got client 7 get_group
1190653039 client connection 7
1190653039 got client 7 get_group
1190653040 client connection 7
1190653040 got client 7 get_group
1190653040 cman: node 2 added
1190653040 groupd confchg total 2 left 0 joined 1
1190653040 0:default confchg left 0 joined 1 total 2
1190653040 0:default process_node_join 2
1190653040 0:default cpg add node 2 total 2
1190653040 0:default make_event_id 200020001 nodeid 2 memb_count 2 type 1
1190653040 0:default queue join event for nodeid 2
1190653041 client connection 7
1190653041 got client 7 get_group
1190653041 client connection 7
1190653041 got client 7 get_group
1190653041 got client 3 start_done
1190653041 0:default send started
1190653041 0:default mark node 1 started
1190653041 0:default process_current_event 100010001 1 JOIN_ALL_STARTED
1190653041 0:default action for app: finish default 1
1190653041 0:default process_current_event 200020001 2 JOIN_BEGIN
1190653041 0:default action for app: stop default
1190653041 0:default app node join: add 2 total 2
1190653041 got client 3 stop_done
1190653041 0:default send stopped
1190653041 0:default waiting for 1 more stopped messages before JOIN_ALL_STOPPED 2
1190653041 0:default mark node 1 stopped
1190653041 0:default process_current_event 200020001 2 JOIN_ALL_STOPPED
1190653041 0:default action for app: start default 2 2 2 2 1
1190653041 got client 3 start_done
1190653041 0:default send started
1190653041 0:default mark node 1 started
1190653041 0:default ignore msg from 2 id 200010001 started
1190653041 0:default ignore msg from 2 id 100020001 stopped
1190653041 0:default ignore msg from 2 id 100020001 started
1190653042 client connection 7
1190653042 got client 7 get_group
1190653080 client connection 7
1190653080 got client 7 get_groups
1190653124 client connection 7
1190653124 got client 7 dump
1190653023 cman: our nodeid 1 name prod-db1 quorum 1
1190653023 setup_cpg groupd_handle 6b8b456700000000
1190653023 groupd confchg total 1 left 0 joined 1
1190653023 client connection 3
1190653023 got client 3 setup
1190653023 setup fence 0
1190653023 client connection 4
1190653023 got client 4 setup
1190653023 setup dlm 1
1190653024 client connection 5
1190653024 got client 5 setup
1190653024 setup gfs 2
1190653025 got client 3 join
1190653025 0:default got join
1190653025 0:default is cpg client 6 name 0_default handle 327b23c600000001
1190653025 0:default cpg_join ok
1190653025 0:default waiting for first cpg event
1190653025 client connection 7
1190653025 0:default waiting for first cpg event
1190653025 got client 7 get_group
1190653025 0:default waiting for first cpg event
1190653025 0:default confchg left 0 joined 1 total 1
1190653025 0:default process_node_join 1
1190653025 0:default cpg add node 1 total 1
1190653025 0:default create group id 10001 our_nodeid 1
1190653025 0:default make_event_id 100010001 nodeid 1 memb_count 1 type 1
1190653025 0:default queue join event for nodeid 1
1190653025 0:default process_current_event 100010001 1 JOIN_BEGIN
1190653025 0:default app node init: add 1 total 1
1190653025 0:default process_current_event 100010001 1 JOIN_ALL_STOPPED
1190653025 0:default action for app: setid default 65537
1190653025 0:default action for app: start default 1 2 1 1
1190653025 client connection 7
1190653025 got client 7 get_group
1190653026 client connection 7
1190653026 got client 7 get_group
1190653027 client connection 7
1190653027 got client 7 get_group
1190653028 client connection 7
1190653028 got client 7 get_group
1190653029 client connection 7
1190653029 got client 7 get_group
1190653030 client connection 7
1190653030 got client 7 get_group
1190653031 client connection 7
1190653031 got client 7 get_group
1190653032 client connection 7
1190653032 got client 7 get_group
1190653033 client connection 7
1190653033 got client 7 get_group
1190653034 client connection 7
1190653034 got client 7 get_group
1190653035 client connection 7
1190653035 got client 7 get_group
1190653036 client connection 7
1190653036 got client 7 get_group
1190653037 client connection 7
1190653037 got client 7 get_group
1190653038 client connection 7
1190653038 got client 7 get_group
1190653039 client connection 7
1190653039 got client 7 get_group
1190653040 client connection 7
1190653040 got client 7 get_group
1190653040 cman: node 2 added
1190653040 groupd confchg total 2 left 0 joined 1
1190653040 0:default confchg left 0 joined 1 total 2
1190653040 0:default process_node_join 2
1190653040 0:default cpg add node 2 total 2
1190653040 0:default make_event_id 200020001 nodeid 2 memb_count 2 type 1
1190653040 0:default queue join event for nodeid 2
1190653041 client connection 7
1190653041 got client 7 get_group
1190653041 client connection 7
1190653041 got client 7 get_group
1190653041 got client 3 start_done
1190653041 0:default send started
1190653041 0:default mark node 1 started
1190653041 0:default process_current_event 100010001 1 JOIN_ALL_STARTED
1190653041 0:default action for app: finish default 1
1190653041 0:default process_current_event 200020001 2 JOIN_BEGIN
1190653041 0:default action for app: stop default
1190653041 0:default app node join: add 2 total 2
1190653041 got client 3 stop_done
1190653041 0:default send stopped
1190653041 0:default waiting for 1 more stopped messages before JOIN_ALL_STOPPED 2
1190653041 0:default mark node 1 stopped
1190653041 0:default process_current_event 200020001 2 JOIN_ALL_STOPPED
1190653041 0:default action for app: start default 2 2 2 2 1
1190653041 got client 3 start_done
1190653041 0:default send started
1190653041 0:default mark node 1 started
1190653041 0:default ignore msg from 2 id 200010001 started
1190653041 0:default ignore msg from 2 id 100020001 stopped
1190653041 0:default ignore msg from 2 id 100020001 started
1190653042 client connection 7
1190653042 got client 7 get_group
1190653080 client connection 7
1190653080 got client 7 get_groups
1190653124 client connection 7
1190653124 got client 7 dump

Gmane