[Swift-devel] Re: Coaster error
Jonathan Monette
jon.monette at gmail.com
Tue Aug 17 12:08:29 CDT 2010
Ok. Have ran more tests on this problem. I am running on both
localhost and pads. In the first stage of my workflow I run on
localhost to collect some metadata. I then use this metadata to
reproject the images submitting these jobs to pads. All the images are
reprojected and completes without error. After this the coasters is
waiting for more jobs to submit to the workers while localhost is
collecting more metadata. I believe coasters starts to shutdown some of
the workers because they are idle and wants to free the resources on the
machine(am I correct so far?) During the shutdown some workers are
shutdown successfully but there is always 1 or 2 that fail to shutdown
and I get the qdel error 153 I mentioned yesterday. If coasters fails
to shutdown a job does the service terminate? I ask this because after
the job fails to shutdown there are no more jobs being submitted in the
queue and my script hangs since it is waiting for the next stage in my
workflow to complete. Is there a coaster parameter that lets coasters
know to not shutdown the workers even if they become idle for a bit or
is this a legitimate error in coasters?
On 8/16/10 1:38 PM, Jonathan Monette wrote:
> Hello,
> I am getting this error when running coasters on PADS.
>
> Canceling job 449188.svc.pads.ci.uchicago.edu
> Canceling job 449189.svc.pads.ci.uchicago.edu
> Failed to shut down block
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> Failed to cancel task. qdel returned with an exit code of 153
> at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:159)
>
> at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
>
> at
> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70)
>
> at
> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:101)
>
> at
> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:90)
>
> at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:44)
>
> at
> org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:293)
>
> at
> org.globus.cog.abstraction.coaster.service.job.manager.Block$1.run(Block.java:284)
>
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
>
> I am assuming is that coasters could not qdel a job. As soon as this
> error appeared all my jobs in the queue disappeared and no more jobs
> are submitted. My script hangs because it is waiting for some apps to
> run but the jobs are never submitted to the PADS scheduler. My run
> and all the log files are located at
> /home/jonmon/Workspace/Montage/m101_j_6x6/runs/m101_montage_Aug-16-2010_13-24-43
> on the CI machines.
>
--
Jon
Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein
More information about the Swift-devel
mailing list