[Swift-user] Coaster jobs are not running with expected parallelism
Michael Wilde
wilde at mcs.anl.gov
Tue Jan 19 13:26:36 CST 2010
Im running a script on PADS that emits 20 jobs in parallel with a foreach().
I set coasters to use 8 workers per node, and my throttle to allow 64
jobs to run in parallel, so I would expect *at least* 8 jobs to be
running in parallel. But what I see is:
- 3 PBS worker jobs start
- 2 of these have a single core (c19/0 and c19/1)
- 1 of these has 18 *nodes*
- all 20 jobs show up as submitted or active, but never more than *3*
active (note that 1 job is a setup job ad completes right away).
Below is info on this run.
Any idea why coaster provider is behaving this way?
- Mike
pool entry is:
<pool handle="pbs">
<profile namespace="globus" key="maxwalltime">00:05:00</profile>
<profile namespace="globus" key="maxtime">1800</profile>
<execution provider="coaster" url="none" jobManager="local:pbs"/>
<profile namespace="globus" key="coastersPerNode">8</profile>
<profile namespace="karajan" key="jobThrottle">.63</profile>
<profile namespace="karajan" key="initialScore">10000</profile>
<gridftp url="local://localhost" />
<workdirectory>$rundir</workdirectory>
</pool>
Running on login2, I see:
/home/wilde/protlib2/bin/run.loops.sh: Executing on site pbs
Running from host with compute-node reachable address of 172.5.86.6
Running in /home/wilde/protests/run.loops.1498
protlib2 home is /home/wilde/protlib2
Swift svn swift-r3202 cog-r2682
RunID: 20100119-1309-l72sbpg8
Progress:
Progress: Checking status:1
Progress: Selecting site:18 Initializing site shared directory:1
Stage in:1 Finished successfully:1
Progress: Submitting:19 Submitted:1 Finished successfully:1
Progress: Submitted:19 Active:1 Finished successfully:1
Progress: Submitted:17 Active:3 Finished successfully:1
Progress: Submitted:17 Active:3 Finished successfully:1
Progress: Submitted:17 Active:3 Finished successfully:1
Progress: Submitted:17 Active:3 Finished successfully:1
Progress: Submitted:17 Active:3 Finished successfully:1
Progress: Submitted:17 Active:3 Finished successfully:1
Progress: Submitted:17 Active:3 Finished successfully:1
Progress: Submitted:17 Active:3 Finished successfully:1
Progress: Submitted:17 Active:3 Finished successfully:1
Progress: Submitted:17 Active:3 Finished successfully:1
Progress: Submitted:17 Active:3 Finished successfully:1
Progress: Submitted:17 Active:3 Finished successfully:1
Progress: Submitted:17 Active:3 Finished successfully:1
Progress: Submitted:17 Active:3 Finished successfully:1
Progress: Submitted:17 Active:3 Finished successfully:1
Progress: Submitted:17 Active:3 Finished successfully:1
Progress: Submitted:17 Active:3 Finished successfully:1
Progress: Submitted:17 Active:2 Checking status:1 Finished
successfully:1
Progress: Submitted:15 Active:3 Stage out:1 Finished successfully:2
Progress: Submitted:15 Active:3 Finished successfully:3
...and this keeps up - the script is progressing but only 3 jobs are
running at a time. (Each takes about 5 minutes)
PBS shows:
login2$ qstat -n
svc.pads.ci.uchicago.edu:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK
Memory Time S Time
-------------------- -------- -------- ---------------- ------ ----- ---
------ ----- - -----
912.svc.pads.ci. wilde extended null 14877 1 --
-- 00:29 R --
c19
913.svc.pads.ci. wilde extended null -- 18 --
-- 00:29 R --
c46+c45+c44+c06+c07+c08+c10+c12+c14+c17+c22+c24+c28+c34+c35+c37+c39+c40
914.svc.pads.ci. wilde extended null 15135 1 --
-- 00:29 R --
c19
login2$ qstat -f
Job Id: 912.svc.pads.ci.uchicago.edu
Job_Name = null
Job_Owner = wilde at login2.pads.ci.uchicago.edu
resources_used.cput = 00:00:58
resources_used.mem = 165768kb
resources_used.vmem = 757612kb
resources_used.walltime = 00:01:14
job_state = R
queue = extended
server = svc.pads.ci.uchicago.edu
Checkpoint = u
ctime = Tue Jan 19 13:09:16 2010
Error_Path =
login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS58
66754363410172037.submit.stderr
exec_host = c19.pads.ci.uchicago.edu/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = n
mtime = Tue Jan 19 13:09:18 2010
Output_Path =
login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5
866754363410172037.submit.stdout
Priority = 0
qtime = Tue Jan 19 13:09:16 2010
Rerunable = True
Resource_List.nodect = 1
Resource_List.nodes = 1
Resource_List.walltime = 00:29:00
session_id = 14877
Shell_Path_List = /bin/sh
Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
:/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
PBS_SERVER=login2.pads.ci.uchicago.edu,
PBS_O_HOST=login2.pads.ci.uchicago.edu,
PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
PBS_O_QUEUE=extended
etime = Tue Jan 19 13:09:16 2010
submit_args = /home/wilde/.globus/scripts/PBS5866754363410172037.submit
start_time = Tue Jan 19 13:09:17 2010
start_count = 1
Job Id: 913.svc.pads.ci.uchicago.edu
Job_Name = null
Job_Owner = wilde at login2.pads.ci.uchicago.edu
resources_used.cput = 00:00:36
resources_used.mem = 166452kb
resources_used.vmem = 765732kb
resources_used.walltime = 00:00:51
job_state = R
queue = extended
server = svc.pads.ci.uchicago.edu
Checkpoint = u
ctime = Tue Jan 19 13:09:16 2010
Error_Path =
login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS89
90749016166185054.submit.stderr
exec_host =
c46.pads.ci.uchicago.edu/0+c45.pads.ci.uchicago.edu/0+c44.pads
.ci.uchicago.edu/0+c06.pads.ci.uchicago.edu/0+c07.pads.ci.uchicago.edu
/0+c08.pads.ci.uchicago.edu/0+c10.pads.ci.uchicago.edu/0+c12.pads.ci.u
chicago.edu/0+c14.pads.ci.uchicago.edu/0+c17.pads.ci.uchicago.edu/0+c2
2.pads.ci.uchicago.edu/0+c24.pads.ci.uchicago.edu/0+c28.pads.ci.uchica
go.edu/0+c34.pads.ci.uchicago.edu/0+c35.pads.ci.uchicago.edu/0+c37.pad
s.ci.uchicago.edu/0+c39.pads.ci.uchicago.edu/0+c40.pads.ci.uchicago.ed
u/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = n
mtime = Tue Jan 19 13:09:55 2010
Output_Path =
login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS8
990749016166185054.submit.stdout
Priority = 0
qtime = Tue Jan 19 13:09:16 2010
Rerunable = True
Resource_List.nodect = 18
Resource_List.nodes = 18
Resource_List.walltime = 00:29:00
session_id = 13956
Shell_Path_List = /bin/sh
Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
:/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
PBS_SERVER=login2.pads.ci.uchicago.edu,
PBS_O_HOST=login2.pads.ci.uchicago.edu,
PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
PBS_O_QUEUE=extended
etime = Tue Jan 19 13:09:16 2010
submit_args = /home/wilde/.globus/scripts/PBS8990749016166185054.submit
start_time = Tue Jan 19 13:09:18 2010
start_count = 1
Job Id: 914.svc.pads.ci.uchicago.edu
Job_Name = null
Job_Owner = wilde at login2.pads.ci.uchicago.edu
resources_used.cput = 00:00:58
resources_used.mem = 165760kb
resources_used.vmem = 757612kb
resources_used.walltime = 00:01:11
job_state = R
queue = extended
server = svc.pads.ci.uchicago.edu
Checkpoint = u
ctime = Tue Jan 19 13:09:18 2010
Error_Path =
login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS54
46269528052212820.submit.stderr
exec_host = c19.pads.ci.uchicago.edu/1
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = n
mtime = Tue Jan 19 13:09:20 2010
Output_Path =
login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5
446269528052212820.submit.stdout
Priority = 0
qtime = Tue Jan 19 13:09:18 2010
Rerunable = True
Resource_List.nodect = 1
Resource_List.nodes = 1
Resource_List.walltime = 00:29:00
session_id = 15135
Shell_Path_List = /bin/sh
Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
:/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
PBS_SERVER=login2.pads.ci.uchicago.edu,
PBS_O_HOST=login2.pads.ci.uchicago.edu,
PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
PBS_O_QUEUE=extended
etime = Tue Jan 19 13:09:18 2010
submit_args = /home/wilde/.globus/scripts/PBS5446269528052212820.submit
start_time = Tue Jan 19 13:09:20 2010
start_count = 1
login2$
-------------------------------------------------------------------------------------------------------
More information about the Swift-user
mailing list