[Swift-user] Coaster jobs are not running with expected parallelism

Mihael Hategan hategan at mcs.anl.gov
Tue Jan 19 13:32:30 CST 2010


Maybe PBS is lying about that 18 node job. The coaster or worker logs on
pads/~/.globus/coasters could shed some light on this.

On Tue, 2010-01-19 at 13:26 -0600, Michael Wilde wrote:
> Im running a script on PADS that emits 20 jobs in parallel with a foreach().
> 
> I set coasters to use 8 workers per node, and my throttle to allow 64 
> jobs to run in parallel, so I would expect *at least* 8 jobs to be 
> running in parallel. But what I see is:
> 
> - 3 PBS worker jobs start
> - 2 of these have a single core (c19/0 and c19/1)
> - 1 of these has 18 *nodes*
> - all 20 jobs show up as submitted or active, but never more than *3* 
> active (note that 1 job is a setup job ad completes right away).
> 
> Below is info on this run.
> 
> Any idea why coaster provider is behaving this way?
> 
> - Mike
> 
> pool entry is:
> 
>    <pool handle="pbs">
>      <profile namespace="globus" key="maxwalltime">00:05:00</profile>
>      <profile namespace="globus" key="maxtime">1800</profile>
>      <execution provider="coaster" url="none" jobManager="local:pbs"/>
>      <profile namespace="globus" key="coastersPerNode">8</profile>
>      <profile namespace="karajan" key="jobThrottle">.63</profile>
>      <profile namespace="karajan" key="initialScore">10000</profile>
>      <gridftp  url="local://localhost" />
>      <workdirectory>$rundir</workdirectory>
>    </pool>
> 
> Running on login2, I see:
> 
> /home/wilde/protlib2/bin/run.loops.sh: Executing on site pbs
> Running from host with compute-node reachable address of 172.5.86.6
> Running in /home/wilde/protests/run.loops.1498
> protlib2 home is /home/wilde/protlib2
> Swift svn swift-r3202 cog-r2682
> 
> RunID: 20100119-1309-l72sbpg8
> Progress:
> Progress:  Checking status:1
> Progress:  Selecting site:18  Initializing site shared directory:1 
> Stage in:1  Finished successfully:1
> Progress:  Submitting:19  Submitted:1  Finished successfully:1
> Progress:  Submitted:19  Active:1  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:2  Checking status:1  Finished 
> successfully:1
> Progress:  Submitted:15  Active:3  Stage out:1  Finished successfully:2
> Progress:  Submitted:15  Active:3  Finished successfully:3
> 
> ...and this keeps up - the script is progressing but only 3 jobs are 
> running at a time. (Each takes about 5 minutes)
> 
> PBS shows:
> 
> login2$ qstat -n
> 
> svc.pads.ci.uchicago.edu:
>  
>    Req'd  Req'd   Elap
> Job ID               Username Queue    Jobname          SessID NDS   TSK 
> Memory Time  S Time
> -------------------- -------- -------- ---------------- ------ ----- --- 
> ------ ----- - -----
> 912.svc.pads.ci.     wilde    extended null              14877     1  -- 
>     --  00:29 R   --
>     c19
> 913.svc.pads.ci.     wilde    extended null                --     18  -- 
>     --  00:29 R   --
>     c46+c45+c44+c06+c07+c08+c10+c12+c14+c17+c22+c24+c28+c34+c35+c37+c39+c40
> 914.svc.pads.ci.     wilde    extended null              15135     1  -- 
>     --  00:29 R   --
>     c19
> login2$ qstat -f
> Job Id: 912.svc.pads.ci.uchicago.edu
>      Job_Name = null
>      Job_Owner = wilde at login2.pads.ci.uchicago.edu
>      resources_used.cput = 00:00:58
>      resources_used.mem = 165768kb
>      resources_used.vmem = 757612kb
>      resources_used.walltime = 00:01:14
>      job_state = R
>      queue = extended
>      server = svc.pads.ci.uchicago.edu
>      Checkpoint = u
>      ctime = Tue Jan 19 13:09:16 2010
>      Error_Path = 
> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS58
>          66754363410172037.submit.stderr
>      exec_host = c19.pads.ci.uchicago.edu/0
>      Hold_Types = n
>      Join_Path = n
>      Keep_Files = n
>      Mail_Points = n
>      mtime = Tue Jan 19 13:09:18 2010
>      Output_Path = 
> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5
>          866754363410172037.submit.stdout
>      Priority = 0
>      qtime = Tue Jan 19 13:09:16 2010
>      Rerunable = True
>      Resource_List.nodect = 1
>      Resource_List.nodes = 1
>      Resource_List.walltime = 00:29:00
>      session_id = 14877
>      Shell_Path_List = /bin/sh
>      Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
>  
> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
>  
> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
>  
> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
>  
> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
>  
> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
>  
> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
>  
> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
>  
> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
>  
> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
>  
> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
>  
> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
>          svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
>          PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
>          PBS_SERVER=login2.pads.ci.uchicago.edu,
>          PBS_O_HOST=login2.pads.ci.uchicago.edu,
>          PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
>          PBS_O_QUEUE=extended
>      etime = Tue Jan 19 13:09:16 2010
>      submit_args = /home/wilde/.globus/scripts/PBS5866754363410172037.submit
>      start_time = Tue Jan 19 13:09:17 2010
>      start_count = 1
> 
> Job Id: 913.svc.pads.ci.uchicago.edu
>      Job_Name = null
>      Job_Owner = wilde at login2.pads.ci.uchicago.edu
>      resources_used.cput = 00:00:36
>      resources_used.mem = 166452kb
>      resources_used.vmem = 765732kb
>      resources_used.walltime = 00:00:51
>      job_state = R
>      queue = extended
>      server = svc.pads.ci.uchicago.edu
>      Checkpoint = u
>      ctime = Tue Jan 19 13:09:16 2010
>      Error_Path = 
> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS89
>          90749016166185054.submit.stderr
>      exec_host = 
> c46.pads.ci.uchicago.edu/0+c45.pads.ci.uchicago.edu/0+c44.pads
>  
> .ci.uchicago.edu/0+c06.pads.ci.uchicago.edu/0+c07.pads.ci.uchicago.edu
>  
> /0+c08.pads.ci.uchicago.edu/0+c10.pads.ci.uchicago.edu/0+c12.pads.ci.u
>  
> chicago.edu/0+c14.pads.ci.uchicago.edu/0+c17.pads.ci.uchicago.edu/0+c2
>  
> 2.pads.ci.uchicago.edu/0+c24.pads.ci.uchicago.edu/0+c28.pads.ci.uchica
>  
> go.edu/0+c34.pads.ci.uchicago.edu/0+c35.pads.ci.uchicago.edu/0+c37.pad
>  
> s.ci.uchicago.edu/0+c39.pads.ci.uchicago.edu/0+c40.pads.ci.uchicago.ed
>          u/0
>      Hold_Types = n
>      Join_Path = n
>      Keep_Files = n
>      Mail_Points = n
>      mtime = Tue Jan 19 13:09:55 2010
>      Output_Path = 
> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS8
>          990749016166185054.submit.stdout
>      Priority = 0
>      qtime = Tue Jan 19 13:09:16 2010
>      Rerunable = True
>      Resource_List.nodect = 18
>      Resource_List.nodes = 18
>      Resource_List.walltime = 00:29:00
>      session_id = 13956
>      Shell_Path_List = /bin/sh
>      Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
>  
> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
>  
> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
>  
> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
>  
> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
>  
> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
>  
> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
>  
> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
>  
> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
>  
> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
>  
> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
>  
> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
>          svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
>          PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
>          PBS_SERVER=login2.pads.ci.uchicago.edu,
>          PBS_O_HOST=login2.pads.ci.uchicago.edu,
>          PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
>          PBS_O_QUEUE=extended
>      etime = Tue Jan 19 13:09:16 2010
>      submit_args = /home/wilde/.globus/scripts/PBS8990749016166185054.submit
>      start_time = Tue Jan 19 13:09:18 2010
>      start_count = 1
> 
> Job Id: 914.svc.pads.ci.uchicago.edu
>      Job_Name = null
>      Job_Owner = wilde at login2.pads.ci.uchicago.edu
>      resources_used.cput = 00:00:58
>      resources_used.mem = 165760kb
>      resources_used.vmem = 757612kb
>      resources_used.walltime = 00:01:11
>      job_state = R
>      queue = extended
>      server = svc.pads.ci.uchicago.edu
>      Checkpoint = u
>      ctime = Tue Jan 19 13:09:18 2010
>      Error_Path = 
> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS54
>          46269528052212820.submit.stderr
>      exec_host = c19.pads.ci.uchicago.edu/1
>      Hold_Types = n
>      Join_Path = n
>      Keep_Files = n
>      Mail_Points = n
>      mtime = Tue Jan 19 13:09:20 2010
>      Output_Path = 
> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5
>          446269528052212820.submit.stdout
>      Priority = 0
>      qtime = Tue Jan 19 13:09:18 2010
>      Rerunable = True
>      Resource_List.nodect = 1
>      Resource_List.nodes = 1
>      Resource_List.walltime = 00:29:00
>      session_id = 15135
>      Shell_Path_List = /bin/sh
>      Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
>  
> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
>  
> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
>  
> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
>  
> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
>  
> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
>  
> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
>  
> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
>  
> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
>  
> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
>  
> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
>  
> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
>          svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
>          PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
>          PBS_SERVER=login2.pads.ci.uchicago.edu,
>          PBS_O_HOST=login2.pads.ci.uchicago.edu,
>          PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
>          PBS_O_QUEUE=extended
>      etime = Tue Jan 19 13:09:18 2010
>      submit_args = /home/wilde/.globus/scripts/PBS5446269528052212820.submit
>      start_time = Tue Jan 19 13:09:20 2010
>      start_count = 1
> 
> login2$
> -------------------------------------------------------------------------------------------------------
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user




More information about the Swift-user mailing list