[Swift-user] Coaster jobs are not running with expected parallelism
Mihael Hategan
hategan at mcs.anl.gov
Tue Jan 19 13:32:30 CST 2010
Maybe PBS is lying about that 18 node job. The coaster or worker logs on
pads/~/.globus/coasters could shed some light on this.
On Tue, 2010-01-19 at 13:26 -0600, Michael Wilde wrote:
> Im running a script on PADS that emits 20 jobs in parallel with a foreach().
>
> I set coasters to use 8 workers per node, and my throttle to allow 64
> jobs to run in parallel, so I would expect *at least* 8 jobs to be
> running in parallel. But what I see is:
>
> - 3 PBS worker jobs start
> - 2 of these have a single core (c19/0 and c19/1)
> - 1 of these has 18 *nodes*
> - all 20 jobs show up as submitted or active, but never more than *3*
> active (note that 1 job is a setup job ad completes right away).
>
> Below is info on this run.
>
> Any idea why coaster provider is behaving this way?
>
> - Mike
>
> pool entry is:
>
> <pool handle="pbs">
> <profile namespace="globus" key="maxwalltime">00:05:00</profile>
> <profile namespace="globus" key="maxtime">1800</profile>
> <execution provider="coaster" url="none" jobManager="local:pbs"/>
> <profile namespace="globus" key="coastersPerNode">8</profile>
> <profile namespace="karajan" key="jobThrottle">.63</profile>
> <profile namespace="karajan" key="initialScore">10000</profile>
> <gridftp url="local://localhost" />
> <workdirectory>$rundir</workdirectory>
> </pool>
>
> Running on login2, I see:
>
> /home/wilde/protlib2/bin/run.loops.sh: Executing on site pbs
> Running from host with compute-node reachable address of 172.5.86.6
> Running in /home/wilde/protests/run.loops.1498
> protlib2 home is /home/wilde/protlib2
> Swift svn swift-r3202 cog-r2682
>
> RunID: 20100119-1309-l72sbpg8
> Progress:
> Progress: Checking status:1
> Progress: Selecting site:18 Initializing site shared directory:1
> Stage in:1 Finished successfully:1
> Progress: Submitting:19 Submitted:1 Finished successfully:1
> Progress: Submitted:19 Active:1 Finished successfully:1
> Progress: Submitted:17 Active:3 Finished successfully:1
> Progress: Submitted:17 Active:3 Finished successfully:1
> Progress: Submitted:17 Active:3 Finished successfully:1
> Progress: Submitted:17 Active:3 Finished successfully:1
> Progress: Submitted:17 Active:3 Finished successfully:1
> Progress: Submitted:17 Active:3 Finished successfully:1
> Progress: Submitted:17 Active:3 Finished successfully:1
> Progress: Submitted:17 Active:3 Finished successfully:1
> Progress: Submitted:17 Active:3 Finished successfully:1
> Progress: Submitted:17 Active:3 Finished successfully:1
> Progress: Submitted:17 Active:3 Finished successfully:1
> Progress: Submitted:17 Active:3 Finished successfully:1
> Progress: Submitted:17 Active:3 Finished successfully:1
> Progress: Submitted:17 Active:3 Finished successfully:1
> Progress: Submitted:17 Active:3 Finished successfully:1
> Progress: Submitted:17 Active:3 Finished successfully:1
> Progress: Submitted:17 Active:3 Finished successfully:1
> Progress: Submitted:17 Active:2 Checking status:1 Finished
> successfully:1
> Progress: Submitted:15 Active:3 Stage out:1 Finished successfully:2
> Progress: Submitted:15 Active:3 Finished successfully:3
>
> ...and this keeps up - the script is progressing but only 3 jobs are
> running at a time. (Each takes about 5 minutes)
>
> PBS shows:
>
> login2$ qstat -n
>
> svc.pads.ci.uchicago.edu:
>
> Req'd Req'd Elap
> Job ID Username Queue Jobname SessID NDS TSK
> Memory Time S Time
> -------------------- -------- -------- ---------------- ------ ----- ---
> ------ ----- - -----
> 912.svc.pads.ci. wilde extended null 14877 1 --
> -- 00:29 R --
> c19
> 913.svc.pads.ci. wilde extended null -- 18 --
> -- 00:29 R --
> c46+c45+c44+c06+c07+c08+c10+c12+c14+c17+c22+c24+c28+c34+c35+c37+c39+c40
> 914.svc.pads.ci. wilde extended null 15135 1 --
> -- 00:29 R --
> c19
> login2$ qstat -f
> Job Id: 912.svc.pads.ci.uchicago.edu
> Job_Name = null
> Job_Owner = wilde at login2.pads.ci.uchicago.edu
> resources_used.cput = 00:00:58
> resources_used.mem = 165768kb
> resources_used.vmem = 757612kb
> resources_used.walltime = 00:01:14
> job_state = R
> queue = extended
> server = svc.pads.ci.uchicago.edu
> Checkpoint = u
> ctime = Tue Jan 19 13:09:16 2010
> Error_Path =
> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS58
> 66754363410172037.submit.stderr
> exec_host = c19.pads.ci.uchicago.edu/0
> Hold_Types = n
> Join_Path = n
> Keep_Files = n
> Mail_Points = n
> mtime = Tue Jan 19 13:09:18 2010
> Output_Path =
> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5
> 866754363410172037.submit.stdout
> Priority = 0
> qtime = Tue Jan 19 13:09:16 2010
> Rerunable = True
> Resource_List.nodect = 1
> Resource_List.nodes = 1
> Resource_List.walltime = 00:29:00
> session_id = 14877
> Shell_Path_List = /bin/sh
> Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
>
> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
>
> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
>
> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
>
> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
>
> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
>
> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
>
> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
>
> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
>
> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
>
> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
>
> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
> svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
> PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
> PBS_SERVER=login2.pads.ci.uchicago.edu,
> PBS_O_HOST=login2.pads.ci.uchicago.edu,
> PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
> PBS_O_QUEUE=extended
> etime = Tue Jan 19 13:09:16 2010
> submit_args = /home/wilde/.globus/scripts/PBS5866754363410172037.submit
> start_time = Tue Jan 19 13:09:17 2010
> start_count = 1
>
> Job Id: 913.svc.pads.ci.uchicago.edu
> Job_Name = null
> Job_Owner = wilde at login2.pads.ci.uchicago.edu
> resources_used.cput = 00:00:36
> resources_used.mem = 166452kb
> resources_used.vmem = 765732kb
> resources_used.walltime = 00:00:51
> job_state = R
> queue = extended
> server = svc.pads.ci.uchicago.edu
> Checkpoint = u
> ctime = Tue Jan 19 13:09:16 2010
> Error_Path =
> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS89
> 90749016166185054.submit.stderr
> exec_host =
> c46.pads.ci.uchicago.edu/0+c45.pads.ci.uchicago.edu/0+c44.pads
>
> .ci.uchicago.edu/0+c06.pads.ci.uchicago.edu/0+c07.pads.ci.uchicago.edu
>
> /0+c08.pads.ci.uchicago.edu/0+c10.pads.ci.uchicago.edu/0+c12.pads.ci.u
>
> chicago.edu/0+c14.pads.ci.uchicago.edu/0+c17.pads.ci.uchicago.edu/0+c2
>
> 2.pads.ci.uchicago.edu/0+c24.pads.ci.uchicago.edu/0+c28.pads.ci.uchica
>
> go.edu/0+c34.pads.ci.uchicago.edu/0+c35.pads.ci.uchicago.edu/0+c37.pad
>
> s.ci.uchicago.edu/0+c39.pads.ci.uchicago.edu/0+c40.pads.ci.uchicago.ed
> u/0
> Hold_Types = n
> Join_Path = n
> Keep_Files = n
> Mail_Points = n
> mtime = Tue Jan 19 13:09:55 2010
> Output_Path =
> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS8
> 990749016166185054.submit.stdout
> Priority = 0
> qtime = Tue Jan 19 13:09:16 2010
> Rerunable = True
> Resource_List.nodect = 18
> Resource_List.nodes = 18
> Resource_List.walltime = 00:29:00
> session_id = 13956
> Shell_Path_List = /bin/sh
> Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
>
> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
>
> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
>
> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
>
> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
>
> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
>
> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
>
> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
>
> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
>
> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
>
> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
>
> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
> svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
> PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
> PBS_SERVER=login2.pads.ci.uchicago.edu,
> PBS_O_HOST=login2.pads.ci.uchicago.edu,
> PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
> PBS_O_QUEUE=extended
> etime = Tue Jan 19 13:09:16 2010
> submit_args = /home/wilde/.globus/scripts/PBS8990749016166185054.submit
> start_time = Tue Jan 19 13:09:18 2010
> start_count = 1
>
> Job Id: 914.svc.pads.ci.uchicago.edu
> Job_Name = null
> Job_Owner = wilde at login2.pads.ci.uchicago.edu
> resources_used.cput = 00:00:58
> resources_used.mem = 165760kb
> resources_used.vmem = 757612kb
> resources_used.walltime = 00:01:11
> job_state = R
> queue = extended
> server = svc.pads.ci.uchicago.edu
> Checkpoint = u
> ctime = Tue Jan 19 13:09:18 2010
> Error_Path =
> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS54
> 46269528052212820.submit.stderr
> exec_host = c19.pads.ci.uchicago.edu/1
> Hold_Types = n
> Join_Path = n
> Keep_Files = n
> Mail_Points = n
> mtime = Tue Jan 19 13:09:20 2010
> Output_Path =
> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5
> 446269528052212820.submit.stdout
> Priority = 0
> qtime = Tue Jan 19 13:09:18 2010
> Rerunable = True
> Resource_List.nodect = 1
> Resource_List.nodes = 1
> Resource_List.walltime = 00:29:00
> session_id = 15135
> Shell_Path_List = /bin/sh
> Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
>
> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
>
> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
>
> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
>
> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
>
> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
>
> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
>
> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
>
> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
>
> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
>
> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
>
> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
> svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
> PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
> PBS_SERVER=login2.pads.ci.uchicago.edu,
> PBS_O_HOST=login2.pads.ci.uchicago.edu,
> PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
> PBS_O_QUEUE=extended
> etime = Tue Jan 19 13:09:18 2010
> submit_args = /home/wilde/.globus/scripts/PBS5446269528052212820.submit
> start_time = Tue Jan 19 13:09:20 2010
> start_count = 1
>
> login2$
> -------------------------------------------------------------------------------------------------------
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
More information about the Swift-user
mailing list