[Swift-user] using queuing system

Tue Nov 9 13:50:23 CST 2010

Hi All,

  I'm sorry, but I really don't know what I'm doing. This is what I want do to: make swift use the queuing system on the IBI cluster (qsub/qstat).

  I made a sites.xml file, like this:
  IBI has 8-cores nodes and I want to use max 4-cores/node.

--------------------------------------------------------
<config>
  <pool handle="pbs">
    <execution provider="coaster" url="none" jobmanager="local:pbs"/>

    <profile namespace="globus" key="workersPerNode">4</profile>
    <profile namespace="globus" key="slots">4096</profile>
    <profile namespace="globus" key="nodeGranularity">1</profile>

    <!-- run up to 256 app() tasks at once: (2.55*100)+1 -->
    <profile namespace="karajan" key="jobThrottle">2.55</profile>   
    <profile namespace="karajan" key="initialScore">10000</profile>

    <filesystem provider="local" url="none"/>
    <workdirectory>/cchome/mparis_x/swift</workdirectory>
  </pool>
</config>
--------------------------------------------------------

Here's the shell trace of the exec:

--------------------------------------------------------
Swift svn swift-r3649 (swift modified locally) cog-r2890 (cog modified locally)

RunID: 20101109-1328-he8pfis2
Progress:
Progress:  Selecting site:3  Initializing site shared directory:1
Progress:  Selecting site:2  Initializing site shared directory:1  Stage in:1
Progress:  Stage in:1  Submitting:3
Progress:  Submitted:3  Active:1
Progress:  Active:4
Worker task failed: 
org.globus.cog.abstraction.impl.scheduler.common.ProcessException: Exitcode file not found 5 queue polls after the job was reported done
	at org.globus.cog.abstraction.impl.scheduler.common.Job.close(Job.java:66)
	at org.globus.cog.abstraction.impl.scheduler.common.Job.setState(Job.java:177)
	at org.globus.cog.abstraction.impl.scheduler.pbs.QueuePoller.processStdout(QueuePoller.java:126)
	at org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.pollQueue(AbstractQueuePoller.java:169)
	at org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:82)
	at java.lang.Thread.run(Unknown Source)
Progress:  Active:3 Failed but can retry:1
Progress:  Stage in:1 Failed but can retry:3
Progress:  Stage in:1  Active:2 Failed but can retry:1
Progress:  Active:4
Progress:  Active:4
Progress:  Active:4
Progress:  Active:4
Progress:  Active:3  Checking status:1
Progress:  Active:2  Checking status:1  Finished successfully:1
Progress:  Checking status:1  Finished successfully:3
Progress:  Checking status:1  Finished successfully:4
Final status:  Finished successfully:5
Cleaning up...
Shutting down service at https://172.16.0.149:52228
Got channel MetaChannel: 1235930463[1821457857: {}] -> null[1821457857: {}]
+ Done
--------------------------------------------------------

Q1. When I log into the node that processes the job, I see that it has spawned 8 processes, but my swift script should only spawn at most 4 (because my for loop is [0:3]). Why? Because of the retries??

Q2. The worker task seems to fail; but then seems to come back on it's feet (Active:4)?  Active:4... no no, top tells me there are 8 processes running at the same time!

Q3. The swift returns, qstat shows that I don't have anything queued, but if I log into the node that treated the job, I still see active processes:

[mparis_x at compute-14-41 ~]$ ps aux | grep mparis_x
mparis_x 12754  0.0  0.0   8696  1004 ?        Ss   13:29   0:00 bash /opt/gridengine/default/spool/compute-14-41/job_scripts/4026341
mparis_x 12755  0.0  0.0  39360  6368 ?        S    13:29   0:00 /usr/bin/perl /cchome/mparis_x/.globus/coasters/cscript5281277440516286419.pl http://172.16.0.149:50608,http://172.18.0.149:50608,http://172.20.0.1:50608,http://172.30.0.1:50608 1109-290113-000000 /cchome/mparis_x/.globus/coasters

-> are these going to "finish" anytime by themselves... they just seem to hang there...

Thanks for your time,
    Marc.