Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: =?iso-8859-1?Q?Borgstr=F6m_Jonas?= <jobot <at> wmdata.com>
Subject: RE: Possible cman init script race condition
Newsgroups: gmane.linux.redhat.cluster
Date: Monday 24th September 2007 17:29:01 UTC (over 9 years ago)
From: David Teigland [mailto:[email protected]] 
Sent: den 24 september 2007 18:10
To: Borgström Jonas
Cc: linux clustering
Subject: Re: [Linux-cluster] Possible cman init script race condition
*snip*
> > 1190645596 node "prod-db1" not a cman member, cn 1
> > 1190645597 node "prod-db1" not a cman member, cn 1
> > 1190645598 node "prod-db1" not a cman member, cn 1
> > 1190645599 node "prod-db1" not a cman member, cn 1
> > 1190645600 reduce victim prod-db1
> > 1190645600 delay of 16s leaves 0 victims
> > 1190645600 finish default 1
> > 1190645600 stop default
> > 1190645600 start default 2 members 1 2 
> > 1190645600 do_recovery stop 1 start 2 finish 1
> 
> I think something has gone wrong here, either in groupd or fenced, that's
> preventing this start from finishing (we don't get a 'finish default 2'
> which we expect).  A 'group_tool -v' here should show the state of the
> fence group still in transition.  Could you run that, plus a 'group_tool
> dump' at this point, in addition to the 'dump fence' you have.  And
please
> run those commands on both nodes.
> 
Hi david, thanks for your fast response. Here's the output you requested:

[[email protected] ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010001 JOIN_START_WAIT 2 200020001 1
[1 2]

[[email protected] ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010002 JOIN_START_WAIT 1 100020001 1
[1 2]

I attached "group_tool dump" output as files, since they are quite long.

> > 1190645954 client 3: dump    <--- Before killing prod-db1
> > 1190645985 stop default
> > 1190645985 start default 3 members 2 
> > 1190645985 do_recovery stop 2 start 3 finish 1
> > 1190645985 finish default 3
> > 1190646008 client 3: dump    <--- After killing prod-db1
> 
> Node 1 isn't fenced here because it never completed joining the fence
> group above.
>
> > The scary part is that as far as I can tell fenced is the only cman
> > daemon being affected by this. So your cluster appears to work fine.
But
> > when a node needs to be fenced the operation it isn't carried out and
> > that can cause gfs filesystem corruption.
>
> You shouldn't be able to mount gfs on the node where joining the fence
> group is stuck.

My current setup is very stripped down so I haven't configured gfs. But on
my original setup where I initially noticed this issue I had no problem
mounting gfs filesystems and after a simulated network failure I could
still continue to write to the filesystem from both nodes since no node was
fenced, and that quickly corrupted the filesystem.

Regards,
Jonas
 
CD: 3ms