[Swift-user] using queuing system

Tue Nov 9 14:35:00 CST 2010

Hi Marc,

The IBI cluster I think is an SGE machine, not PBS.

I had sent you previously a non-coaster-based sites entry that looked like:

  <pool handle="sge">
    <execution provider="sge" url="none" />
    <profile namespace="globus" key="pe">threaded</profile>
    <profile key="jobThrottle" namespace="karajan">.49</profile>
    <profile namespace="karajan" key="initialScore">10000</profile>
    <filesystem provider="local" url="none" />
    <workdirectory>$(pwd)/swiftwork</workdirectory>
  </pool>

so change the coaster version you posted, below, to this:

  <pool handle="sge">
    <execution provider="coaster" url="none" jobmanager="local:sge"/>
    <profile namespace="globus" key="pe">threaded</profile>
    <profile namespace="globus" key="workersPerNode">4</profile>
    <profile namespace="globus" key="slots">128</profile>
    <profile namespace="globus" key="nodeGranularity">1</profile>
    <profile namespace="globus" key="maxnodes">1</profile>
    <profile namespace="karajan" key="jobThrottle">5.11</profile>   
    <profile namespace="karajan" key="initialScore">10000</profile>
    <filesystem provider="local" url="none"/>
    <workdirectory>/cchome/mparis_x/swift</workdirectory>
  </pool>

changes above are:
- added "pe" tag, needed for SGE ("parallel environment"). "threaded" seems to be the right PE for ibicluster.
- changed slots to 128: submit up to 128 SGE jobs at once
- nodeGranularity 1, maxnodes 1: each job should request 1 node
- throttle to allow up to 512 swift app() calls to run at once (4x128)

Then, change your tc.data file to read "sge" instead of "pbs".

Lastly, set the following in a file named "cf":

wrapperlog.always.transfer=true
sitedir.keep=true
execution.retries=0
lazy.errors=false
status.mode=provider
use.provider.staging=false
provider.staging.pin.swiftfiles=false

and run swift using a command similar to this:

swift -config cf -sites.file sites.xml -tc.file tc.data yourscript.swift -args=etc

(changing the file names to match yours)

I will need to add a config file into my "latest/" Swift release on ibicluster to retain SGE submit files and stdout/err logs. But for now, you can proceed as above, without that.

More notes below...

----- Original Message -----
> Hi All,
> 
> I'm sorry, but I really don't know what I'm doing. This is what I want
> do to: make swift use the queuing system on the IBI cluster
> (qsub/qstat).
> 
> 
> 
> I made a sites.xml file, like this:
> IBI has 8-cores nodes and I want to use max 4-cores/node.
> 
> --------------------------------------------------------
> <config>
> <pool handle="pbs">
> <execution provider="coaster" url="none" jobmanager="local:pbs"/>
> 
> <profile namespace="globus" key="workersPerNode">4</profile>
> <profile namespace="globus" key="slots">4096</profile>
> <profile namespace="globus" key="nodeGranularity">1</profile>
> 
> <!-- run up to 256 app() tasks at once: (2.55*100)+1 -->
> <profile namespace="karajan" key="jobThrottle">2.55</profile>
> <profile namespace="karajan" key="initialScore">10000</profile>
> 
> <filesystem provider="local" url="none"/>
> <workdirectory>/cchome/mparis_x/swift</workdirectory>
> </pool>
> </config>
> --------------------------------------------------------
> 
> 
> Here's the shell trace of the exec:
> 
> --------------------------------------------------------
> Swift svn swift-r3649 (swift modified locally) cog-r2890 (cog modified
> locally)
> 
> RunID: 20101109-1328-he8pfis2
> Progress:
> Progress: Selecting site:3 Initializing site shared directory:1
> Progress: Selecting site:2 Initializing site shared directory:1 Stage
> in:1
> Progress: Stage in:1 Submitting:3
> Progress: Submitted:3 Active:1
> Progress: Active:4
> Worker task failed:
> org.globus.cog.abstraction.impl.scheduler.common.ProcessException:
> Exitcode file not found 5 queue polls after the job was reported done
> at

I suspect this is because Swift submitted PBS-style jobs to SGE, based on the incorrect sites.xml attributes.

> at java.lang.Thread.run(Unknown Source)
> Progress: Active:3 Failed but can retry:1
> Progress: Stage in:1 Failed but can retry:3
> Progress: Stage in:1 Active:2 Failed but can retry:1
> Progress: Active:4
> Progress: Active:4
> Progress: Active:4
> Progress: Active:4
> Progress: Active:3 Checking status:1
> Progress: Active:2 Checking status:1 Finished successfully:1
> Progress: Checking status:1 Finished successfully:3
> Progress: Checking status:1 Finished successfully:4
> Final status: Finished successfully:5
> Cleaning up...
> Shutting down service at https://172.16.0.149:52228
> Got channel MetaChannel: 1235930463[1821457857: {}] ->
> null[1821457857: {}]
> + Done
> --------------------------------------------------------
> 
> 
> 
> Q1. When I log into the node that processes the job, I see that it has
> spawned 8 processes, but my swift script should only spawn at most 4
> (because my for loop is [0:3]). Why? Because of the retries??

I'm not sure. There should be one swift worker.pl process running per node. If the problem persists, please send a snapshot of what you see in ps using:
  ps -fjH -u mparis_x

> Q2. The worker task seems to fail; but then seems to come back on it's
> feet (Active:4)? Active:4... no no, top tells me there are 8 processes
> running at the same time!

> Q3. The swift returns, qstat shows that I don't have anything queued,
> but if I log into the node that treated the job, I still see active
> processes:
> 
> [mparis_x at compute-14-41 ~]$ ps aux | grep mparis_x
> mparis_x 12754 0.0 0.0 8696 1004 ? Ss 13:29 0:00 bash
> /opt/gridengine/default/spool/compute-14-41/job_scripts/4026341
> mparis_x 12755 0.0 0.0 39360 6368 ? S 13:29 0:00 /usr/bin/perl
> /cchome/mparis_x/.globus/coasters/cscript5281277440516286419.pl
> http://172.16.0.149:50608,http://172.18.0.149:50608,http://172.20.0.1:50608,http://172.30.0.1:50608
> 1109-290113-000000 /cchome/mparis_x/.globus/coasters
> 
> -> are these going to "finish" anytime by themselves... they just seem
> to hang there...

I'll look at this with you if it persists once we correct the sites file.

> Thanks for your time,
> Marc.
> 
> 
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory