3 May 2012 21:03
Re: Condor 7.6 - Windows Parallel Universe problems
I investigated recently a problem I reported a while ago on this list but got no reply (so I was probably the only one experiencing it. However, I want report the solution here just in case someone else stumbles across it. The problematic setup: I use a pool of Windows machines dedicated to Condor. The pool runs a Central Manager, a Schedd, and a credd and the users all submit from external Windows client machines to the pool's schedd using the "-remote" option for condor_submit. I've enabled PASSWORD authentication on the pool which might be part of the problem. As long as the "vanilla" universe is used everything works nicely. But if one submits a job to the "parallel" universe the job is started but after it is finished the shadow gives the error message 01/20/12 16:39:36 (80.0) (4436): SetEffectiveOwner(FelixWolfheimer) failed with errno=13: Permission denied. 01/20/12 16:39:36 (80.0) (4436): Failed to perform final update to job queue! and the job is rescheduled and runs into the same problem in the end, is rescheduled again, etc. Solution: I found out that I had to add the dummy(?) account "condor_pool" to the QUEUE_SUPER_USERS in the condor config file on the machine running the schedd of the pool. Actually, this seems not very obvious to me and I wonder whether this is the intended behaviour?! Anyway, now the parallel jobs run fine and just do what they are supposed to do.
![]()
.
However, I want report the solution here just in case someone else
stumbles across it.
The problematic setup:
I use a pool of Windows machines dedicated to Condor. The pool runs a
Central Manager, a Schedd, and a credd and the users all submit from
external Windows client machines to the pool's schedd using the
"-remote" option for condor_submit. I've enabled PASSWORD authentication
on the pool which might be part of the problem.
As long as the "vanilla" universe is used everything works nicely. But
if one submits a job to the "parallel" universe the job is started but
after it is finished the shadow gives the error message
01/20/12 16:39:36 (80.0) (4436): SetEffectiveOwner(FelixWolfheimer)
failed with errno=13: Permission denied.
01/20/12 16:39:36 (80.0) (4436): Failed to perform final update to job queue!
and the job is rescheduled and runs into the same problem in the end, is rescheduled again, etc.
Solution: I found out that I had to add the dummy(?) account "condor_pool" to the
QUEUE_SUPER_USERS
in the condor config file on the machine running the schedd of the pool.
Actually, this seems not very obvious to me and I wonder whether this is the intended behaviour?!
Anyway, now the parallel jobs run fine and just do what they are supposed to do.
RSS Feed