[Swift-user] Coaster jobs are not running with expected parallelism

Michael Wilde wilde at mcs.anl.gov
Tue Jan 19 13:26:36 CST 2010


Im running a script on PADS that emits 20 jobs in parallel with a foreach().

I set coasters to use 8 workers per node, and my throttle to allow 64 
jobs to run in parallel, so I would expect *at least* 8 jobs to be 
running in parallel. But what I see is:

- 3 PBS worker jobs start
- 2 of these have a single core (c19/0 and c19/1)
- 1 of these has 18 *nodes*
- all 20 jobs show up as submitted or active, but never more than *3* 
active (note that 1 job is a setup job ad completes right away).

Below is info on this run.

Any idea why coaster provider is behaving this way?

- Mike

pool entry is:

   <pool handle="pbs">
     <profile namespace="globus" key="maxwalltime">00:05:00</profile>
     <profile namespace="globus" key="maxtime">1800</profile>
     <execution provider="coaster" url="none" jobManager="local:pbs"/>
     <profile namespace="globus" key="coastersPerNode">8</profile>
     <profile namespace="karajan" key="jobThrottle">.63</profile>
     <profile namespace="karajan" key="initialScore">10000</profile>
     <gridftp  url="local://localhost" />
     <workdirectory>$rundir</workdirectory>
   </pool>

Running on login2, I see:

/home/wilde/protlib2/bin/run.loops.sh: Executing on site pbs
Running from host with compute-node reachable address of 172.5.86.6
Running in /home/wilde/protests/run.loops.1498
protlib2 home is /home/wilde/protlib2
Swift svn swift-r3202 cog-r2682

RunID: 20100119-1309-l72sbpg8
Progress:
Progress:  Checking status:1
Progress:  Selecting site:18  Initializing site shared directory:1 
Stage in:1  Finished successfully:1
Progress:  Submitting:19  Submitted:1  Finished successfully:1
Progress:  Submitted:19  Active:1  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:2  Checking status:1  Finished 
successfully:1
Progress:  Submitted:15  Active:3  Stage out:1  Finished successfully:2
Progress:  Submitted:15  Active:3  Finished successfully:3

...and this keeps up - the script is progressing but only 3 jobs are 
running at a time. (Each takes about 5 minutes)

PBS shows:

login2$ qstat -n

svc.pads.ci.uchicago.edu:
 
   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK 
Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- 
------ ----- - -----
912.svc.pads.ci.     wilde    extended null              14877     1  -- 
    --  00:29 R   --
    c19
913.svc.pads.ci.     wilde    extended null                --     18  -- 
    --  00:29 R   --
    c46+c45+c44+c06+c07+c08+c10+c12+c14+c17+c22+c24+c28+c34+c35+c37+c39+c40
914.svc.pads.ci.     wilde    extended null              15135     1  -- 
    --  00:29 R   --
    c19
login2$ qstat -f
Job Id: 912.svc.pads.ci.uchicago.edu
     Job_Name = null
     Job_Owner = wilde at login2.pads.ci.uchicago.edu
     resources_used.cput = 00:00:58
     resources_used.mem = 165768kb
     resources_used.vmem = 757612kb
     resources_used.walltime = 00:01:14
     job_state = R
     queue = extended
     server = svc.pads.ci.uchicago.edu
     Checkpoint = u
     ctime = Tue Jan 19 13:09:16 2010
     Error_Path = 
login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS58
         66754363410172037.submit.stderr
     exec_host = c19.pads.ci.uchicago.edu/0
     Hold_Types = n
     Join_Path = n
     Keep_Files = n
     Mail_Points = n
     mtime = Tue Jan 19 13:09:18 2010
     Output_Path = 
login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5
         866754363410172037.submit.stdout
     Priority = 0
     qtime = Tue Jan 19 13:09:16 2010
     Rerunable = True
     Resource_List.nodect = 1
     Resource_List.nodes = 1
     Resource_List.walltime = 00:29:00
     session_id = 14877
     Shell_Path_List = /bin/sh
     Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
 
PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
 
oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
 
8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
 
:/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
 
e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
 
6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
 
1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
 
ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
 
/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
 
r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
 
swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
         svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
         PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
         PBS_SERVER=login2.pads.ci.uchicago.edu,
         PBS_O_HOST=login2.pads.ci.uchicago.edu,
         PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
         PBS_O_QUEUE=extended
     etime = Tue Jan 19 13:09:16 2010
     submit_args = /home/wilde/.globus/scripts/PBS5866754363410172037.submit
     start_time = Tue Jan 19 13:09:17 2010
     start_count = 1

Job Id: 913.svc.pads.ci.uchicago.edu
     Job_Name = null
     Job_Owner = wilde at login2.pads.ci.uchicago.edu
     resources_used.cput = 00:00:36
     resources_used.mem = 166452kb
     resources_used.vmem = 765732kb
     resources_used.walltime = 00:00:51
     job_state = R
     queue = extended
     server = svc.pads.ci.uchicago.edu
     Checkpoint = u
     ctime = Tue Jan 19 13:09:16 2010
     Error_Path = 
login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS89
         90749016166185054.submit.stderr
     exec_host = 
c46.pads.ci.uchicago.edu/0+c45.pads.ci.uchicago.edu/0+c44.pads
 
.ci.uchicago.edu/0+c06.pads.ci.uchicago.edu/0+c07.pads.ci.uchicago.edu
 
/0+c08.pads.ci.uchicago.edu/0+c10.pads.ci.uchicago.edu/0+c12.pads.ci.u
 
chicago.edu/0+c14.pads.ci.uchicago.edu/0+c17.pads.ci.uchicago.edu/0+c2
 
2.pads.ci.uchicago.edu/0+c24.pads.ci.uchicago.edu/0+c28.pads.ci.uchica
 
go.edu/0+c34.pads.ci.uchicago.edu/0+c35.pads.ci.uchicago.edu/0+c37.pad
 
s.ci.uchicago.edu/0+c39.pads.ci.uchicago.edu/0+c40.pads.ci.uchicago.ed
         u/0
     Hold_Types = n
     Join_Path = n
     Keep_Files = n
     Mail_Points = n
     mtime = Tue Jan 19 13:09:55 2010
     Output_Path = 
login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS8
         990749016166185054.submit.stdout
     Priority = 0
     qtime = Tue Jan 19 13:09:16 2010
     Rerunable = True
     Resource_List.nodect = 18
     Resource_List.nodes = 18
     Resource_List.walltime = 00:29:00
     session_id = 13956
     Shell_Path_List = /bin/sh
     Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
 
PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
 
oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
 
8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
 
:/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
 
e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
 
6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
 
1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
 
ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
 
/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
 
r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
 
swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
         svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
         PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
         PBS_SERVER=login2.pads.ci.uchicago.edu,
         PBS_O_HOST=login2.pads.ci.uchicago.edu,
         PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
         PBS_O_QUEUE=extended
     etime = Tue Jan 19 13:09:16 2010
     submit_args = /home/wilde/.globus/scripts/PBS8990749016166185054.submit
     start_time = Tue Jan 19 13:09:18 2010
     start_count = 1

Job Id: 914.svc.pads.ci.uchicago.edu
     Job_Name = null
     Job_Owner = wilde at login2.pads.ci.uchicago.edu
     resources_used.cput = 00:00:58
     resources_used.mem = 165760kb
     resources_used.vmem = 757612kb
     resources_used.walltime = 00:01:11
     job_state = R
     queue = extended
     server = svc.pads.ci.uchicago.edu
     Checkpoint = u
     ctime = Tue Jan 19 13:09:18 2010
     Error_Path = 
login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS54
         46269528052212820.submit.stderr
     exec_host = c19.pads.ci.uchicago.edu/1
     Hold_Types = n
     Join_Path = n
     Keep_Files = n
     Mail_Points = n
     mtime = Tue Jan 19 13:09:20 2010
     Output_Path = 
login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5
         446269528052212820.submit.stdout
     Priority = 0
     qtime = Tue Jan 19 13:09:18 2010
     Rerunable = True
     Resource_List.nodect = 1
     Resource_List.nodes = 1
     Resource_List.walltime = 00:29:00
     session_id = 15135
     Shell_Path_List = /bin/sh
     Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
 
PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
 
oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
 
8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
 
:/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
 
e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
 
6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
 
1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
 
ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
 
/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
 
r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
 
swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
         svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
         PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
         PBS_SERVER=login2.pads.ci.uchicago.edu,
         PBS_O_HOST=login2.pads.ci.uchicago.edu,
         PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
         PBS_O_QUEUE=extended
     etime = Tue Jan 19 13:09:18 2010
     submit_args = /home/wilde/.globus/scripts/PBS5446269528052212820.submit
     start_time = Tue Jan 19 13:09:20 2010
     start_count = 1

login2$
-------------------------------------------------------------------------------------------------------



More information about the Swift-user mailing list