Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: matthew devney <matthew <at> devney.net>
Subject: Re: pbs_mom changing /dev/null mode and perms?
Newsgroups: gmane.comp.clustering.torque.user
Date: Monday 17th May 2010 17:26:43 UTC (over 6 years ago)
This has happened to me a couple times too.  I haven't done nearly
this much investigation.  Every couple months I find that a job has
died because a node no longer works, and upon ssh'ing there to check,
am presented with an error about /dev/null and find that it's now a
regular file.

I have no further information; just thought I'd share that this is not
an isolated case.

Matthew Devney
[email protected]


On Tue, May 11, 2010 at 2:15 AM, Arnau Bria  wrote:
> Hi all,
>
> I've faced a really strange problem in our cluster.
>
> Time to time /dev/null changes its perms/mode from:
> 0 crw-rw-rw- 1 root root 1, 3 May 10 12:31 /dev/null
> to
> 0 -rw-r--r-- 1 root root 0 May 11 10:42 /dev/null
>
> After some debuggin with audit I found this:
>
> ----
> type=PATH msg=audit(05/11/2010 03:32:18.394:111668) : item=1
name=/dev/null inode=2085 dev=00:11 mode=character,666 ouid=root ogid=root
rdev=01:03
> type=PATH msg=audit(05/11/2010 03:32:18.394:111668) : item=0 name=/dev/
inode=1120 dev=00:11 mode=dir,755 ouid=root ogid=root rdev=00:00
> type=CWD msg=audit(05/11/2010 03:32:18.394:111668) :
 cwd=/var/spool/pbs/mom_priv
> type=SYSCALL msg=audit(05/11/2010 03:32:18.394:111668) : arch=x86_64
syscall=unlink success=yes exit=0 a0=6a17a0 a1=15c724b6 a2=15c724ac
a3=726576726573206e items=2 ppid=1 pid=18208 auid=root uid=root gid=root
euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none)
ses=7327 comm=pbs_mom exe=/usr/sbin/pbs_mom key=NULL_touch
> ----
>
> Notice the "syscall=unlink"
>
> At that time, pbs was doing this:
>
> # grep 03:32 /var/spool/pbs/mom_logs/20100511
> 05/11/2010 03:32:17;0080;  
pbs_mom;Job;10327910.pbs02.pic.es;scan_for_terminated: job
10327910.pbs02.pic.es task 1 terminated, sid=3571
> 05/11/2010 03:32:17;0008;   pbs_mom;Job;10327910.pbs02.pic.es;job was
terminated
> 05/11/2010 03:32:17;0080;   pbs_mom;Svr;preobit_reply;top of
preobit_reply
> 05/11/2010 03:32:17;0080;  
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
while loop
> 05/11/2010 03:32:17;0080;   pbs_mom;Svr;preobit_reply;in while loop, no
error from job stat
> 05/11/2010 03:32:17;0008;   pbs_mom;Job;10327910.pbs02.pic.es;checking
job post-processing routine
> 05/11/2010 03:32:17;0080;   pbs_mom;Job;10327910.pbs02.pic.es;obit sent
to server
> 05/11/2010 03:32:18;0080;  
pbs_mom;Job;10268969.pbs02.pic.es;scan_for_terminated: job
10268969.pbs02.pic.es task 1 terminated, sid=19965
> 05/11/2010 03:32:18;0008;   pbs_mom;Job;10268969.pbs02.pic.es;job was
terminated
> 05/11/2010 03:32:18;0080;   pbs_mom;Job;10327910.pbs02.pic.es;removing
transient job directory /home/tmp/10327910.pbs02.pic.es
> 05/11/2010 03:32:18;0080;   pbs_mom;Svr;preobit_reply;top of
preobit_reply
> 05/11/2010 03:32:18;0080;  
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
while loop
> 05/11/2010 03:32:18;0001;  
pbs_mom;Job;10268969.pbs02.pic.es;preobit_reply, unknown on server,
deleting locally
> 05/11/2010 03:32:18;0080;   pbs_mom;Job;10268969.pbs02.pic.es;removing
transient job directory /home/tmp/10268969.pbs02.pic.es
>
> # tracejob 10327910.pbs02.pic.es
>
> pbs02.pic.es
>
> 32:17  M    scan_for_terminated: job 10327910.pbs02.pic.es task 1
terminated, sid
> 32:17  M    job was terminated
> 32:17  M    checking job post-processing routine
> 32:17  M    obit sent to server
> 32:18  M    removing transient job directory
/home/tmp/10327910.pbs02.pic.es
>
>
> I don't know how to stop this behave, but as you can expect,
> changing /dev/null brings the node as unstable.
>
> Anyone faced this before? Anyone could give me some hint on how to
> prevent this? Some developer imagines where could be the problem?
>
> # rpm -qa|grep torque
> torque-2.3.6-2cri.el5.x86_64
> torque-mom-2.3.6-2cri.el5.x86_64
> torque-client-2.3.6-2cri.el5.x86_64
>
> TIA,
> Arnau
> _______________________________________________
> torqueusers mailing list
> [email protected]
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
 
CD: 3ms