[Swift-devel] PBS coasters miscalculate PBS options

Michael Wilde wilde at mcs.anl.gov
Thu Feb 25 10:11:14 CST 2010


I suspect that even though qstat on pads doesnt show node limits on the queue, it likely balks if you ask for more nodes than exist on the system. I'll try setting provider options that keep it below 48 nodes (or 384 nodes depending on how this is counted), or better yet much lower so that it doesnt create jobs that will never be runnable.

- Mike

login1$ qstat -q

server: svc.pads.ci.uchicago.edu

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
short              --      --    04:00:00   --    0   0 --   E R
extended           --      --       --      --    2   0 --   E R
fast               --      --    01:00:00   --    0   0 --   E R
long               --      --    24:00:00   --    0   0 --   E R
                                               ----- -----
                                                   2     0
login1$ 


----- wilde at mcs.anl.gov wrote:

> Mihael, running a 1000 job workflow with minimal specs in the
> sites.xml entry for coasters on PADS gave the error "(qsub reported an
> exit code of 188). 
> qsub: Job exceeds queue resource limits MSG=cannot locate feasible
> nodes" (full trace below). The sites entry was:
> 
>   <pool handle="pbs">
>     <profile namespace="globus" key="maxwalltime">00:00:10</profile>
>     <profile namespace="globus" key="maxtime">1800</profile>
>     <execution provider="coaster" url="none" jobManager="local:pbs"/>
>     <profile namespace="globus" key="workersPerNode">1</profile>
>     <profile namespace="karajan" key="initialScore">10000</profile>
>     <profile namespace="karajan" key="jobThrottle">5.99</profile>
>     <filesystem provider="local"/>
>     <workdirectory>$(pwd)</workdirectory>
>   </pool>
> 
> - Mike
> 
> 
> Swift running in SwiftR.run.056 
> Swift svn swift-r3202 cog-r2683
> 
> RunID: 20100225-0813-xn3bajnc
> Progress:
> Progress:  uninitialized:1
> Progress:  Selecting site:399  Stage in:600  Submitting:1
> Progress:  Selecting site:399  Stage in:529  Submitting:2 
> Submitted:70
> Progress:  Selecting site:399  Stage in:413  Submitted:188
> Progress:  Selecting site:399  Stage in:326  Submitted:275
> Worker task failed: Error submitting block task
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> Cannot submit job: Could not submit job (qsub reported an exit code of
> 188). 
> qsub: Job exceeds queue resource limits MSG=cannot locate feasible
> nodes
> 
>         at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63)
>         at
> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46)
>         at
> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:43)
>         at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run(BlockTaskSubmitter.java:66)
> Caused by:
> org.globus.cog.abstraction.impl.scheduler.common.ProcessException:
> Could not submit job (qsub reported an exit code of 188). 
> qsub: Job exceeds queue resource limits MSG=cannot locate feasible
> nodes
> 
>         at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:86)
>         at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53)
>         ... 3 more
> Failed to shut down block
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> Can only cancel an active task
>         at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:149)
>         at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
>         at
> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70)
>         at
> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:96)
>         at
> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:85)
>         at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:44)
>         at
> org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:271)
>         at
> org.globus.cog.abstraction.coaster.service.job.manager.Block.shutdown(Block.java:252)
>         at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.cleanDoneBlocks(BlockQueueProcessor.java:151)
>         at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:436)
>         at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:78)
> Exception caught in block processor
> java.util.ConcurrentModificationException
>         at
> java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
>         at java.util.AbstractList$Itr.next(AbstractList.java:343)
>         at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.cleanDoneBlocks(BlockQueueProcessor.java:149)
>         at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:436)
>         at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:78)
> Exception caught in block processor
> java.util.ConcurrentModificationException
>         at
> java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
>         at java.util.AbstractList$Itr.next(AbstractList.java:343)
>         at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.cleanDoneBlocks(BlockQueueProcessor.java:149)
>         at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:436)
>         at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:78)
> Cleaning up...
> Shutting down service at https://192.5.86.5:50002
> Got channel MetaChannel: 1151109057 -> null
> +Canceling job 4970.svc.pads.ci.uchicago.edu
> Canceling job 4971.svc.pads.ci.uchicago.edu
> Canceling job 4972.svc.pads.ci.uchicago.edu
> Canceling job 4973.svc.pads.ci.uchicago.edu
> Canceling job 4974.svc.pads.ci.uchicago.edu
>  Done
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel



More information about the Swift-devel mailing list