[Swift-user] using queuing system
Marc Parisien
mparisien at uchicago.edu
Tue Nov 9 13:50:23 CST 2010
Hi All,
I'm sorry, but I really don't know what I'm doing. This is what I want do to: make swift use the queuing system on the IBI cluster (qsub/qstat).
I made a sites.xml file, like this:
IBI has 8-cores nodes and I want to use max 4-cores/node.
--------------------------------------------------------
<config>
<pool handle="pbs">
<execution provider="coaster" url="none" jobmanager="local:pbs"/>
<profile namespace="globus" key="workersPerNode">4</profile>
<profile namespace="globus" key="slots">4096</profile>
<profile namespace="globus" key="nodeGranularity">1</profile>
<!-- run up to 256 app() tasks at once: (2.55*100)+1 -->
<profile namespace="karajan" key="jobThrottle">2.55</profile>
<profile namespace="karajan" key="initialScore">10000</profile>
<filesystem provider="local" url="none"/>
<workdirectory>/cchome/mparis_x/swift</workdirectory>
</pool>
</config>
--------------------------------------------------------
Here's the shell trace of the exec:
--------------------------------------------------------
Swift svn swift-r3649 (swift modified locally) cog-r2890 (cog modified locally)
RunID: 20101109-1328-he8pfis2
Progress:
Progress: Selecting site:3 Initializing site shared directory:1
Progress: Selecting site:2 Initializing site shared directory:1 Stage in:1
Progress: Stage in:1 Submitting:3
Progress: Submitted:3 Active:1
Progress: Active:4
Worker task failed:
org.globus.cog.abstraction.impl.scheduler.common.ProcessException: Exitcode file not found 5 queue polls after the job was reported done
at org.globus.cog.abstraction.impl.scheduler.common.Job.close(Job.java:66)
at org.globus.cog.abstraction.impl.scheduler.common.Job.setState(Job.java:177)
at org.globus.cog.abstraction.impl.scheduler.pbs.QueuePoller.processStdout(QueuePoller.java:126)
at org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.pollQueue(AbstractQueuePoller.java:169)
at org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:82)
at java.lang.Thread.run(Unknown Source)
Progress: Active:3 Failed but can retry:1
Progress: Stage in:1 Failed but can retry:3
Progress: Stage in:1 Active:2 Failed but can retry:1
Progress: Active:4
Progress: Active:4
Progress: Active:4
Progress: Active:4
Progress: Active:3 Checking status:1
Progress: Active:2 Checking status:1 Finished successfully:1
Progress: Checking status:1 Finished successfully:3
Progress: Checking status:1 Finished successfully:4
Final status: Finished successfully:5
Cleaning up...
Shutting down service at https://172.16.0.149:52228
Got channel MetaChannel: 1235930463[1821457857: {}] -> null[1821457857: {}]
+ Done
--------------------------------------------------------
Q1. When I log into the node that processes the job, I see that it has spawned 8 processes, but my swift script should only spawn at most 4 (because my for loop is [0:3]). Why? Because of the retries??
Q2. The worker task seems to fail; but then seems to come back on it's feet (Active:4)? Active:4... no no, top tells me there are 8 processes running at the same time!
Q3. The swift returns, qstat shows that I don't have anything queued, but if I log into the node that treated the job, I still see active processes:
[mparis_x at compute-14-41 ~]$ ps aux | grep mparis_x
mparis_x 12754 0.0 0.0 8696 1004 ? Ss 13:29 0:00 bash /opt/gridengine/default/spool/compute-14-41/job_scripts/4026341
mparis_x 12755 0.0 0.0 39360 6368 ? S 13:29 0:00 /usr/bin/perl /cchome/mparis_x/.globus/coasters/cscript5281277440516286419.pl http://172.16.0.149:50608,http://172.18.0.149:50608,http://172.20.0.1:50608,http://172.30.0.1:50608 1109-290113-000000 /cchome/mparis_x/.globus/coasters
-> are these going to "finish" anytime by themselves... they just seem to hang there...
Thanks for your time,
Marc.
More information about the Swift-user
mailing list