[Swift-user] Coaster jobs are not running with expected parallelism
Michael Wilde
wilde at mcs.anl.gov
Tue Jan 19 13:38:44 CST 2010
On 1/19/10 1:32 PM, Mihael Hategan wrote:
> Maybe PBS is lying about that 18 node job.
I would be surprised if thats the case. But even if it had *1* node you
would think it would run at least 8 jobs in parallel.
Im confused why it has started three jobs, two with only one core and
one with 18 nodes.
But the 18 node job just hit its wall time limit; now coasters seems to
have started a 10 node job:
login2$ qstat -n
svc.pads.ci.uchicago.edu:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK
Memory Time S Time
-------------------- -------- -------- ---------------- ------ ----- ---
------ ----- - -----
912.svc.pads.ci. wilde extended null 14877 1 --
-- 00:29 R 00:25
c19
915.svc.pads.ci. wilde extended null 9028 1 --
-- 00:29 R --
c38
916.svc.pads.ci. wilde extended null -- 10 --
-- 00:29 R --
c45+c44+c06+c07+c08+c10+c12+c14+c17+c22
login2$
The coaster or worker logs on
> pads/~/.globus/coasters could shed some light on this.
I'll look and make these readable by you.
- Mike
> On Tue, 2010-01-19 at 13:26 -0600, Michael Wilde wrote:
>> Im running a script on PADS that emits 20 jobs in parallel with a foreach().
>>
>> I set coasters to use 8 workers per node, and my throttle to allow 64
>> jobs to run in parallel, so I would expect *at least* 8 jobs to be
>> running in parallel. But what I see is:
>>
>> - 3 PBS worker jobs start
>> - 2 of these have a single core (c19/0 and c19/1)
>> - 1 of these has 18 *nodes*
>> - all 20 jobs show up as submitted or active, but never more than *3*
>> active (note that 1 job is a setup job ad completes right away).
>>
>> Below is info on this run.
>>
>> Any idea why coaster provider is behaving this way?
>>
>> - Mike
>>
>> pool entry is:
>>
>> <pool handle="pbs">
>> <profile namespace="globus" key="maxwalltime">00:05:00</profile>
>> <profile namespace="globus" key="maxtime">1800</profile>
>> <execution provider="coaster" url="none" jobManager="local:pbs"/>
>> <profile namespace="globus" key="coastersPerNode">8</profile>
>> <profile namespace="karajan" key="jobThrottle">.63</profile>
>> <profile namespace="karajan" key="initialScore">10000</profile>
>> <gridftp url="local://localhost" />
>> <workdirectory>$rundir</workdirectory>
>> </pool>
>>
>> Running on login2, I see:
>>
>> /home/wilde/protlib2/bin/run.loops.sh: Executing on site pbs
>> Running from host with compute-node reachable address of 172.5.86.6
>> Running in /home/wilde/protests/run.loops.1498
>> protlib2 home is /home/wilde/protlib2
>> Swift svn swift-r3202 cog-r2682
>>
>> RunID: 20100119-1309-l72sbpg8
>> Progress:
>> Progress: Checking status:1
>> Progress: Selecting site:18 Initializing site shared directory:1
>> Stage in:1 Finished successfully:1
>> Progress: Submitting:19 Submitted:1 Finished successfully:1
>> Progress: Submitted:19 Active:1 Finished successfully:1
>> Progress: Submitted:17 Active:3 Finished successfully:1
>> Progress: Submitted:17 Active:3 Finished successfully:1
>> Progress: Submitted:17 Active:3 Finished successfully:1
>> Progress: Submitted:17 Active:3 Finished successfully:1
>> Progress: Submitted:17 Active:3 Finished successfully:1
>> Progress: Submitted:17 Active:3 Finished successfully:1
>> Progress: Submitted:17 Active:3 Finished successfully:1
>> Progress: Submitted:17 Active:3 Finished successfully:1
>> Progress: Submitted:17 Active:3 Finished successfully:1
>> Progress: Submitted:17 Active:3 Finished successfully:1
>> Progress: Submitted:17 Active:3 Finished successfully:1
>> Progress: Submitted:17 Active:3 Finished successfully:1
>> Progress: Submitted:17 Active:3 Finished successfully:1
>> Progress: Submitted:17 Active:3 Finished successfully:1
>> Progress: Submitted:17 Active:3 Finished successfully:1
>> Progress: Submitted:17 Active:3 Finished successfully:1
>> Progress: Submitted:17 Active:3 Finished successfully:1
>> Progress: Submitted:17 Active:2 Checking status:1 Finished
>> successfully:1
>> Progress: Submitted:15 Active:3 Stage out:1 Finished successfully:2
>> Progress: Submitted:15 Active:3 Finished successfully:3
>>
>> ...and this keeps up - the script is progressing but only 3 jobs are
>> running at a time. (Each takes about 5 minutes)
>>
>> PBS shows:
>>
>> login2$ qstat -n
>>
>> svc.pads.ci.uchicago.edu:
>>
>> Req'd Req'd Elap
>> Job ID Username Queue Jobname SessID NDS TSK
>> Memory Time S Time
>> -------------------- -------- -------- ---------------- ------ ----- ---
>> ------ ----- - -----
>> 912.svc.pads.ci. wilde extended null 14877 1 --
>> -- 00:29 R --
>> c19
>> 913.svc.pads.ci. wilde extended null -- 18 --
>> -- 00:29 R --
>> c46+c45+c44+c06+c07+c08+c10+c12+c14+c17+c22+c24+c28+c34+c35+c37+c39+c40
>> 914.svc.pads.ci. wilde extended null 15135 1 --
>> -- 00:29 R --
>> c19
>> login2$ qstat -f
>> Job Id: 912.svc.pads.ci.uchicago.edu
>> Job_Name = null
>> Job_Owner = wilde at login2.pads.ci.uchicago.edu
>> resources_used.cput = 00:00:58
>> resources_used.mem = 165768kb
>> resources_used.vmem = 757612kb
>> resources_used.walltime = 00:01:14
>> job_state = R
>> queue = extended
>> server = svc.pads.ci.uchicago.edu
>> Checkpoint = u
>> ctime = Tue Jan 19 13:09:16 2010
>> Error_Path =
>> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS58
>> 66754363410172037.submit.stderr
>> exec_host = c19.pads.ci.uchicago.edu/0
>> Hold_Types = n
>> Join_Path = n
>> Keep_Files = n
>> Mail_Points = n
>> mtime = Tue Jan 19 13:09:18 2010
>> Output_Path =
>> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5
>> 866754363410172037.submit.stdout
>> Priority = 0
>> qtime = Tue Jan 19 13:09:16 2010
>> Rerunable = True
>> Resource_List.nodect = 1
>> Resource_List.nodes = 1
>> Resource_List.walltime = 00:29:00
>> session_id = 14877
>> Shell_Path_List = /bin/sh
>> Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
>>
>> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
>>
>> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
>>
>> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
>>
>> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
>>
>> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
>>
>> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
>>
>> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
>>
>> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
>>
>> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
>>
>> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
>>
>> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
>> svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
>> PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
>> PBS_SERVER=login2.pads.ci.uchicago.edu,
>> PBS_O_HOST=login2.pads.ci.uchicago.edu,
>> PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
>> PBS_O_QUEUE=extended
>> etime = Tue Jan 19 13:09:16 2010
>> submit_args = /home/wilde/.globus/scripts/PBS5866754363410172037.submit
>> start_time = Tue Jan 19 13:09:17 2010
>> start_count = 1
>>
>> Job Id: 913.svc.pads.ci.uchicago.edu
>> Job_Name = null
>> Job_Owner = wilde at login2.pads.ci.uchicago.edu
>> resources_used.cput = 00:00:36
>> resources_used.mem = 166452kb
>> resources_used.vmem = 765732kb
>> resources_used.walltime = 00:00:51
>> job_state = R
>> queue = extended
>> server = svc.pads.ci.uchicago.edu
>> Checkpoint = u
>> ctime = Tue Jan 19 13:09:16 2010
>> Error_Path =
>> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS89
>> 90749016166185054.submit.stderr
>> exec_host =
>> c46.pads.ci.uchicago.edu/0+c45.pads.ci.uchicago.edu/0+c44.pads
>>
>> .ci.uchicago.edu/0+c06.pads.ci.uchicago.edu/0+c07.pads.ci.uchicago.edu
>>
>> /0+c08.pads.ci.uchicago.edu/0+c10.pads.ci.uchicago.edu/0+c12.pads.ci.u
>>
>> chicago.edu/0+c14.pads.ci.uchicago.edu/0+c17.pads.ci.uchicago.edu/0+c2
>>
>> 2.pads.ci.uchicago.edu/0+c24.pads.ci.uchicago.edu/0+c28.pads.ci.uchica
>>
>> go.edu/0+c34.pads.ci.uchicago.edu/0+c35.pads.ci.uchicago.edu/0+c37.pad
>>
>> s.ci.uchicago.edu/0+c39.pads.ci.uchicago.edu/0+c40.pads.ci.uchicago.ed
>> u/0
>> Hold_Types = n
>> Join_Path = n
>> Keep_Files = n
>> Mail_Points = n
>> mtime = Tue Jan 19 13:09:55 2010
>> Output_Path =
>> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS8
>> 990749016166185054.submit.stdout
>> Priority = 0
>> qtime = Tue Jan 19 13:09:16 2010
>> Rerunable = True
>> Resource_List.nodect = 18
>> Resource_List.nodes = 18
>> Resource_List.walltime = 00:29:00
>> session_id = 13956
>> Shell_Path_List = /bin/sh
>> Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
>>
>> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
>>
>> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
>>
>> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
>>
>> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
>>
>> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
>>
>> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
>>
>> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
>>
>> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
>>
>> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
>>
>> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
>>
>> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
>> svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
>> PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
>> PBS_SERVER=login2.pads.ci.uchicago.edu,
>> PBS_O_HOST=login2.pads.ci.uchicago.edu,
>> PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
>> PBS_O_QUEUE=extended
>> etime = Tue Jan 19 13:09:16 2010
>> submit_args = /home/wilde/.globus/scripts/PBS8990749016166185054.submit
>> start_time = Tue Jan 19 13:09:18 2010
>> start_count = 1
>>
>> Job Id: 914.svc.pads.ci.uchicago.edu
>> Job_Name = null
>> Job_Owner = wilde at login2.pads.ci.uchicago.edu
>> resources_used.cput = 00:00:58
>> resources_used.mem = 165760kb
>> resources_used.vmem = 757612kb
>> resources_used.walltime = 00:01:11
>> job_state = R
>> queue = extended
>> server = svc.pads.ci.uchicago.edu
>> Checkpoint = u
>> ctime = Tue Jan 19 13:09:18 2010
>> Error_Path =
>> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS54
>> 46269528052212820.submit.stderr
>> exec_host = c19.pads.ci.uchicago.edu/1
>> Hold_Types = n
>> Join_Path = n
>> Keep_Files = n
>> Mail_Points = n
>> mtime = Tue Jan 19 13:09:20 2010
>> Output_Path =
>> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5
>> 446269528052212820.submit.stdout
>> Priority = 0
>> qtime = Tue Jan 19 13:09:18 2010
>> Rerunable = True
>> Resource_List.nodect = 1
>> Resource_List.nodes = 1
>> Resource_List.walltime = 00:29:00
>> session_id = 15135
>> Shell_Path_List = /bin/sh
>> Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
>>
>> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
>>
>> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
>>
>> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
>>
>> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
>>
>> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
>>
>> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
>>
>> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
>>
>> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
>>
>> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
>>
>> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
>>
>> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
>> svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
>> PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
>> PBS_SERVER=login2.pads.ci.uchicago.edu,
>> PBS_O_HOST=login2.pads.ci.uchicago.edu,
>> PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
>> PBS_O_QUEUE=extended
>> etime = Tue Jan 19 13:09:18 2010
>> submit_args = /home/wilde/.globus/scripts/PBS5446269528052212820.submit
>> start_time = Tue Jan 19 13:09:20 2010
>> start_count = 1
>>
>> login2$
>> -------------------------------------------------------------------------------------------------------
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>
More information about the Swift-user
mailing list