17 May 2010 19:26
Re: pbs_mom changing /dev/null mode and perms?
matthew devney <matthew <at> devney.net>
2010-05-17 17:26:43 GMT
2010-05-17 17:26:43 GMT
This has happened to me a couple times too. I haven't done nearly this much investigation. Every couple months I find that a job has died because a node no longer works, and upon ssh'ing there to check, am presented with an error about /dev/null and find that it's now a regular file. I have no further information; just thought I'd share that this is not an isolated case. Matthew Devney matthew <at> devney.net On Tue, May 11, 2010 at 2:15 AM, Arnau Bria <arnaubria <at> pic.es> wrote: > Hi all, > > I've faced a really strange problem in our cluster. > > Time to time /dev/null changes its perms/mode from: > 0 crw-rw-rw- 1 root root 1, 3 May 10 12:31 /dev/null > to > 0 -rw-r--r-- 1 root root 0 May 11 10:42 /dev/null > > After some debuggin with audit I found this: > > ---- > type=PATH msg=audit(05/11/2010 03:32:18.394:111668) : item=1 name=/dev/null inode=2085 dev=00:11 mode=character,666 ouid=root ogid=root rdev=01:03 > type=PATH msg=audit(05/11/2010 03:32:18.394:111668) : item=0 name=/dev/ inode=1120 dev=00:11 mode=dir,755 ouid=root ogid=root rdev=00:00 > type=CWD msg=audit(05/11/2010 03:32:18.394:111668) : cwd=/var/spool/pbs/mom_priv > type=SYSCALL msg=audit(05/11/2010 03:32:18.394:111668) : arch=x86_64 syscall=unlink success=yes exit=0 a0=6a17a0 a1=15c724b6 a2=15c724ac a3=726576726573206e items=2 ppid=1 pid=18208 auid=root uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=7327 comm=pbs_mom exe=/usr/sbin/pbs_mom key=NULL_touch > ---- > > Notice the "syscall=unlink" > > At that time, pbs was doing this: > > # grep 03:32 /var/spool/pbs/mom_logs/20100511 > 05/11/2010 03:32:17;0080; pbs_mom;Job;10327910.pbs02.pic.es;scan_for_terminated: job 10327910.pbs02.pic.es task 1 terminated, sid=3571 > 05/11/2010 03:32:17;0008; pbs_mom;Job;10327910.pbs02.pic.es;job was terminated > 05/11/2010 03:32:17;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 05/11/2010 03:32:17;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 05/11/2010 03:32:17;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat > 05/11/2010 03:32:17;0008; pbs_mom;Job;10327910.pbs02.pic.es;checking job post-processing routine > 05/11/2010 03:32:17;0080; pbs_mom;Job;10327910.pbs02.pic.es;obit sent to server > 05/11/2010 03:32:18;0080; pbs_mom;Job;10268969.pbs02.pic.es;scan_for_terminated: job 10268969.pbs02.pic.es task 1 terminated, sid=19965 > 05/11/2010 03:32:18;0008; pbs_mom;Job;10268969.pbs02.pic.es;job was terminated > 05/11/2010 03:32:18;0080; pbs_mom;Job;10327910.pbs02.pic.es;removing transient job directory /home/tmp/10327910.pbs02.pic.es > 05/11/2010 03:32:18;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 05/11/2010 03:32:18;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 05/11/2010 03:32:18;0001; pbs_mom;Job;10268969.pbs02.pic.es;preobit_reply, unknown on server, deleting locally > 05/11/2010 03:32:18;0080; pbs_mom;Job;10268969.pbs02.pic.es;removing transient job directory /home/tmp/10268969.pbs02.pic.es > > # tracejob 10327910.pbs02.pic.es > > pbs02.pic.es > > 32:17 M scan_for_terminated: job 10327910.pbs02.pic.es task 1 terminated, sid > 32:17 M job was terminated > 32:17 M checking job post-processing routine > 32:17 M obit sent to server > 32:18 M removing transient job directory /home/tmp/10327910.pbs02.pic.es > > > I don't know how to stop this behave, but as you can expect, > changing /dev/null brings the node as unstable. > > Anyone faced this before? Anyone could give me some hint on how to > prevent this? Some developer imagines where could be the problem? > > # rpm -qa|grep torque > torque-2.3.6-2cri.el5.x86_64 > torque-mom-2.3.6-2cri.el5.x86_64 > torque-client-2.3.6-2cri.el5.x86_64 > > TIA, > Arnau > _______________________________________________ > torqueusers mailing list > torqueusers <at> supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers >
RSS Feed