[Swift-devel] Re: Coaster error

Jonathan Monette jon.monette at gmail.com
Tue Aug 17 12:08:29 CDT 2010


Ok.  Have ran more tests on this problem.  I am running on both 
localhost and pads.  In the first stage of my workflow I run on 
localhost to collect some metadata.  I then use this metadata to 
reproject the images submitting these jobs to pads.  All the images are 
reprojected and completes without error.  After this the coasters is 
waiting for more jobs to submit to the workers while localhost is 
collecting more metadata.  I believe coasters starts to shutdown some of 
the workers because they are idle and wants to free the resources on the 
machine(am I correct so far?)  During the shutdown some workers are 
shutdown successfully but there is always 1 or 2 that fail to shutdown 
and I get the qdel error 153 I mentioned yesterday.  If coasters fails 
to shutdown a job does the service terminate?  I ask this because after 
the job fails to shutdown there are no more jobs being submitted in the 
queue and my script hangs since it is waiting for the next stage in my 
workflow to complete.  Is there a coaster parameter that lets coasters 
know to not shutdown the workers even if they become idle for a bit or 
is this a legitimate error in coasters?

On 8/16/10 1:38 PM, Jonathan Monette wrote:
> Hello,
>     I am getting this error when running coasters on PADS.
>
> Canceling job 449188.svc.pads.ci.uchicago.edu
> Canceling job 449189.svc.pads.ci.uchicago.edu
> Failed to shut down block
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: 
> Failed to cancel task. qdel returned with an exit code of 153
>     at 
> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:159) 
>
>     at 
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) 
>
>     at 
> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70) 
>
>     at 
> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:101) 
>
>     at 
> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:90) 
>
>     at 
> org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:44) 
>
>     at 
> org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:293) 
>
>     at 
> org.globus.cog.abstraction.coaster.service.job.manager.Block$1.run(Block.java:284) 
>
>     at java.util.TimerThread.mainLoop(Timer.java:512)
>     at java.util.TimerThread.run(Timer.java:462)
>
> I am assuming is that coasters could not qdel a job.  As soon as this 
> error appeared all my jobs in the queue disappeared and no more jobs 
> are submitted.  My script hangs because it is waiting for some apps to 
> run but the jobs are never submitted to the PADS scheduler.  My run 
> and all the log files are located at 
> /home/jonmon/Workspace/Montage/m101_j_6x6/runs/m101_montage_Aug-16-2010_13-24-43 
> on the CI machines.
>

-- 
Jon

Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein




More information about the Swift-devel mailing list