[Swift-devel] PBS coasters miscalculate PBS options
Michael Wilde
wilde at mcs.anl.gov
Thu Feb 25 10:11:14 CST 2010
I suspect that even though qstat on pads doesnt show node limits on the queue, it likely balks if you ask for more nodes than exist on the system. I'll try setting provider options that keep it below 48 nodes (or 384 nodes depending on how this is counted), or better yet much lower so that it doesnt create jobs that will never be runnable.
- Mike
login1$ qstat -q
server: svc.pads.ci.uchicago.edu
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
short -- -- 04:00:00 -- 0 0 -- E R
extended -- -- -- -- 2 0 -- E R
fast -- -- 01:00:00 -- 0 0 -- E R
long -- -- 24:00:00 -- 0 0 -- E R
----- -----
2 0
login1$
----- wilde at mcs.anl.gov wrote:
> Mihael, running a 1000 job workflow with minimal specs in the
> sites.xml entry for coasters on PADS gave the error "(qsub reported an
> exit code of 188).
> qsub: Job exceeds queue resource limits MSG=cannot locate feasible
> nodes" (full trace below). The sites entry was:
>
> <pool handle="pbs">
> <profile namespace="globus" key="maxwalltime">00:00:10</profile>
> <profile namespace="globus" key="maxtime">1800</profile>
> <execution provider="coaster" url="none" jobManager="local:pbs"/>
> <profile namespace="globus" key="workersPerNode">1</profile>
> <profile namespace="karajan" key="initialScore">10000</profile>
> <profile namespace="karajan" key="jobThrottle">5.99</profile>
> <filesystem provider="local"/>
> <workdirectory>$(pwd)</workdirectory>
> </pool>
>
> - Mike
>
>
> Swift running in SwiftR.run.056
> Swift svn swift-r3202 cog-r2683
>
> RunID: 20100225-0813-xn3bajnc
> Progress:
> Progress: uninitialized:1
> Progress: Selecting site:399 Stage in:600 Submitting:1
> Progress: Selecting site:399 Stage in:529 Submitting:2
> Submitted:70
> Progress: Selecting site:399 Stage in:413 Submitted:188
> Progress: Selecting site:399 Stage in:326 Submitted:275
> Worker task failed: Error submitting block task
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> Cannot submit job: Could not submit job (qsub reported an exit code of
> 188).
> qsub: Job exceeds queue resource limits MSG=cannot locate feasible
> nodes
>
> at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63)
> at
> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46)
> at
> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:43)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run(BlockTaskSubmitter.java:66)
> Caused by:
> org.globus.cog.abstraction.impl.scheduler.common.ProcessException:
> Could not submit job (qsub reported an exit code of 188).
> qsub: Job exceeds queue resource limits MSG=cannot locate feasible
> nodes
>
> at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:86)
> at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53)
> ... 3 more
> Failed to shut down block
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> Can only cancel an active task
> at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:149)
> at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
> at
> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70)
> at
> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:96)
> at
> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:85)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:44)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:271)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.Block.shutdown(Block.java:252)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.cleanDoneBlocks(BlockQueueProcessor.java:151)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:436)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:78)
> Exception caught in block processor
> java.util.ConcurrentModificationException
> at
> java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
> at java.util.AbstractList$Itr.next(AbstractList.java:343)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.cleanDoneBlocks(BlockQueueProcessor.java:149)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:436)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:78)
> Exception caught in block processor
> java.util.ConcurrentModificationException
> at
> java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
> at java.util.AbstractList$Itr.next(AbstractList.java:343)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.cleanDoneBlocks(BlockQueueProcessor.java:149)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:436)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:78)
> Cleaning up...
> Shutting down service at https://192.5.86.5:50002
> Got channel MetaChannel: 1151109057 -> null
> +Canceling job 4970.svc.pads.ci.uchicago.edu
> Canceling job 4971.svc.pads.ci.uchicago.edu
> Canceling job 4972.svc.pads.ci.uchicago.edu
> Canceling job 4973.svc.pads.ci.uchicago.edu
> Canceling job 4974.svc.pads.ci.uchicago.edu
> Done
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list