matthew devney | 17 May 19:26 2010
Picon

Re: pbs_mom changing /dev/null mode and perms?

This has happened to me a couple times too.  I haven't done nearly
this much investigation.  Every couple months I find that a job has
died because a node no longer works, and upon ssh'ing there to check,
am presented with an error about /dev/null and find that it's now a
regular file.

I have no further information; just thought I'd share that this is not
an isolated case.

Matthew Devney
matthew <at> devney.net

On Tue, May 11, 2010 at 2:15 AM, Arnau Bria <arnaubria <at> pic.es> wrote:
> Hi all,
>
> I've faced a really strange problem in our cluster.
>
> Time to time /dev/null changes its perms/mode from:
> 0 crw-rw-rw- 1 root root 1, 3 May 10 12:31 /dev/null
> to
> 0 -rw-r--r-- 1 root root 0 May 11 10:42 /dev/null
>
> After some debuggin with audit I found this:
>
> ----
> type=PATH msg=audit(05/11/2010 03:32:18.394:111668) : item=1 name=/dev/null inode=2085 dev=00:11
mode=character,666 ouid=root ogid=root rdev=01:03
> type=PATH msg=audit(05/11/2010 03:32:18.394:111668) : item=0 name=/dev/ inode=1120 dev=00:11
mode=dir,755 ouid=root ogid=root rdev=00:00
> type=CWD msg=audit(05/11/2010 03:32:18.394:111668) :  cwd=/var/spool/pbs/mom_priv
> type=SYSCALL msg=audit(05/11/2010 03:32:18.394:111668) : arch=x86_64 syscall=unlink success=yes
exit=0 a0=6a17a0 a1=15c724b6 a2=15c724ac a3=726576726573206e items=2 ppid=1 pid=18208 auid=root
uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none)
ses=7327 comm=pbs_mom exe=/usr/sbin/pbs_mom key=NULL_touch
> ----
>
> Notice the "syscall=unlink"
>
> At that time, pbs was doing this:
>
> # grep 03:32 /var/spool/pbs/mom_logs/20100511
> 05/11/2010 03:32:17;0080;   pbs_mom;Job;10327910.pbs02.pic.es;scan_for_terminated: job
10327910.pbs02.pic.es task 1 terminated, sid=3571
> 05/11/2010 03:32:17;0008;   pbs_mom;Job;10327910.pbs02.pic.es;job was terminated
> 05/11/2010 03:32:17;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
> 05/11/2010 03:32:17;0080;   pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr
worked, top of while loop
> 05/11/2010 03:32:17;0080;   pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
> 05/11/2010 03:32:17;0008;   pbs_mom;Job;10327910.pbs02.pic.es;checking job post-processing routine
> 05/11/2010 03:32:17;0080;   pbs_mom;Job;10327910.pbs02.pic.es;obit sent to server
> 05/11/2010 03:32:18;0080;   pbs_mom;Job;10268969.pbs02.pic.es;scan_for_terminated: job
10268969.pbs02.pic.es task 1 terminated, sid=19965
> 05/11/2010 03:32:18;0008;   pbs_mom;Job;10268969.pbs02.pic.es;job was terminated
> 05/11/2010 03:32:18;0080;   pbs_mom;Job;10327910.pbs02.pic.es;removing transient job directory /home/tmp/10327910.pbs02.pic.es
> 05/11/2010 03:32:18;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
> 05/11/2010 03:32:18;0080;   pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr
worked, top of while loop
> 05/11/2010 03:32:18;0001;   pbs_mom;Job;10268969.pbs02.pic.es;preobit_reply, unknown on
server, deleting locally
> 05/11/2010 03:32:18;0080;   pbs_mom;Job;10268969.pbs02.pic.es;removing transient job directory /home/tmp/10268969.pbs02.pic.es
>
> # tracejob 10327910.pbs02.pic.es
>
> pbs02.pic.es
>
> 32:17  M    scan_for_terminated: job 10327910.pbs02.pic.es task 1 terminated, sid
> 32:17  M    job was terminated
> 32:17  M    checking job post-processing routine
> 32:17  M    obit sent to server
> 32:18  M    removing transient job directory /home/tmp/10327910.pbs02.pic.es
>
>
> I don't know how to stop this behave, but as you can expect,
> changing /dev/null brings the node as unstable.
>
> Anyone faced this before? Anyone could give me some hint on how to
> prevent this? Some developer imagines where could be the problem?
>
> # rpm -qa|grep torque
> torque-2.3.6-2cri.el5.x86_64
> torque-mom-2.3.6-2cri.el5.x86_64
> torque-client-2.3.6-2cri.el5.x86_64
>
> TIA,
> Arnau
> _______________________________________________
> torqueusers mailing list
> torqueusers <at> supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>

Gmane