[Swift-user] Coaster jobs are not running with expected parallelism

Michael Wilde wilde at mcs.anl.gov
Tue Jan 19 13:38:44 CST 2010



On 1/19/10 1:32 PM, Mihael Hategan wrote:
> Maybe PBS is lying about that 18 node job. 

I would be surprised if thats the case. But even if it had *1* node you 
would think it would run at least 8 jobs in parallel.

Im confused why it has started three jobs, two with only one core and 
one with 18 nodes.

But the 18 node job just hit its wall time limit; now coasters seems to 
have started a 10 node job:

login2$ qstat -n

svc.pads.ci.uchicago.edu:
 
   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK 
Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- 
------ ----- - -----
912.svc.pads.ci.     wilde    extended null              14877     1  -- 
    --  00:29 R 00:25
    c19
915.svc.pads.ci.     wilde    extended null               9028     1  -- 
    --  00:29 R   --
    c38
916.svc.pads.ci.     wilde    extended null                --     10  -- 
    --  00:29 R   --
    c45+c44+c06+c07+c08+c10+c12+c14+c17+c22
login2$


The coaster or worker logs on
> pads/~/.globus/coasters could shed some light on this.

I'll look and make these readable by you.

- Mike

> On Tue, 2010-01-19 at 13:26 -0600, Michael Wilde wrote:
>> Im running a script on PADS that emits 20 jobs in parallel with a foreach().
>>
>> I set coasters to use 8 workers per node, and my throttle to allow 64 
>> jobs to run in parallel, so I would expect *at least* 8 jobs to be 
>> running in parallel. But what I see is:
>>
>> - 3 PBS worker jobs start
>> - 2 of these have a single core (c19/0 and c19/1)
>> - 1 of these has 18 *nodes*
>> - all 20 jobs show up as submitted or active, but never more than *3* 
>> active (note that 1 job is a setup job ad completes right away).
>>
>> Below is info on this run.
>>
>> Any idea why coaster provider is behaving this way?
>>
>> - Mike
>>
>> pool entry is:
>>
>>    <pool handle="pbs">
>>      <profile namespace="globus" key="maxwalltime">00:05:00</profile>
>>      <profile namespace="globus" key="maxtime">1800</profile>
>>      <execution provider="coaster" url="none" jobManager="local:pbs"/>
>>      <profile namespace="globus" key="coastersPerNode">8</profile>
>>      <profile namespace="karajan" key="jobThrottle">.63</profile>
>>      <profile namespace="karajan" key="initialScore">10000</profile>
>>      <gridftp  url="local://localhost" />
>>      <workdirectory>$rundir</workdirectory>
>>    </pool>
>>
>> Running on login2, I see:
>>
>> /home/wilde/protlib2/bin/run.loops.sh: Executing on site pbs
>> Running from host with compute-node reachable address of 172.5.86.6
>> Running in /home/wilde/protests/run.loops.1498
>> protlib2 home is /home/wilde/protlib2
>> Swift svn swift-r3202 cog-r2682
>>
>> RunID: 20100119-1309-l72sbpg8
>> Progress:
>> Progress:  Checking status:1
>> Progress:  Selecting site:18  Initializing site shared directory:1 
>> Stage in:1  Finished successfully:1
>> Progress:  Submitting:19  Submitted:1  Finished successfully:1
>> Progress:  Submitted:19  Active:1  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:2  Checking status:1  Finished 
>> successfully:1
>> Progress:  Submitted:15  Active:3  Stage out:1  Finished successfully:2
>> Progress:  Submitted:15  Active:3  Finished successfully:3
>>
>> ...and this keeps up - the script is progressing but only 3 jobs are 
>> running at a time. (Each takes about 5 minutes)
>>
>> PBS shows:
>>
>> login2$ qstat -n
>>
>> svc.pads.ci.uchicago.edu:
>>  
>>    Req'd  Req'd   Elap
>> Job ID               Username Queue    Jobname          SessID NDS   TSK 
>> Memory Time  S Time
>> -------------------- -------- -------- ---------------- ------ ----- --- 
>> ------ ----- - -----
>> 912.svc.pads.ci.     wilde    extended null              14877     1  -- 
>>     --  00:29 R   --
>>     c19
>> 913.svc.pads.ci.     wilde    extended null                --     18  -- 
>>     --  00:29 R   --
>>     c46+c45+c44+c06+c07+c08+c10+c12+c14+c17+c22+c24+c28+c34+c35+c37+c39+c40
>> 914.svc.pads.ci.     wilde    extended null              15135     1  -- 
>>     --  00:29 R   --
>>     c19
>> login2$ qstat -f
>> Job Id: 912.svc.pads.ci.uchicago.edu
>>      Job_Name = null
>>      Job_Owner = wilde at login2.pads.ci.uchicago.edu
>>      resources_used.cput = 00:00:58
>>      resources_used.mem = 165768kb
>>      resources_used.vmem = 757612kb
>>      resources_used.walltime = 00:01:14
>>      job_state = R
>>      queue = extended
>>      server = svc.pads.ci.uchicago.edu
>>      Checkpoint = u
>>      ctime = Tue Jan 19 13:09:16 2010
>>      Error_Path = 
>> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS58
>>          66754363410172037.submit.stderr
>>      exec_host = c19.pads.ci.uchicago.edu/0
>>      Hold_Types = n
>>      Join_Path = n
>>      Keep_Files = n
>>      Mail_Points = n
>>      mtime = Tue Jan 19 13:09:18 2010
>>      Output_Path = 
>> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5
>>          866754363410172037.submit.stdout
>>      Priority = 0
>>      qtime = Tue Jan 19 13:09:16 2010
>>      Rerunable = True
>>      Resource_List.nodect = 1
>>      Resource_List.nodes = 1
>>      Resource_List.walltime = 00:29:00
>>      session_id = 14877
>>      Shell_Path_List = /bin/sh
>>      Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
>>  
>> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
>>  
>> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
>>  
>> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
>>  
>> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
>>  
>> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
>>  
>> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
>>  
>> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
>>  
>> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
>>  
>> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
>>  
>> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
>>  
>> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
>>          svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
>>          PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
>>          PBS_SERVER=login2.pads.ci.uchicago.edu,
>>          PBS_O_HOST=login2.pads.ci.uchicago.edu,
>>          PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
>>          PBS_O_QUEUE=extended
>>      etime = Tue Jan 19 13:09:16 2010
>>      submit_args = /home/wilde/.globus/scripts/PBS5866754363410172037.submit
>>      start_time = Tue Jan 19 13:09:17 2010
>>      start_count = 1
>>
>> Job Id: 913.svc.pads.ci.uchicago.edu
>>      Job_Name = null
>>      Job_Owner = wilde at login2.pads.ci.uchicago.edu
>>      resources_used.cput = 00:00:36
>>      resources_used.mem = 166452kb
>>      resources_used.vmem = 765732kb
>>      resources_used.walltime = 00:00:51
>>      job_state = R
>>      queue = extended
>>      server = svc.pads.ci.uchicago.edu
>>      Checkpoint = u
>>      ctime = Tue Jan 19 13:09:16 2010
>>      Error_Path = 
>> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS89
>>          90749016166185054.submit.stderr
>>      exec_host = 
>> c46.pads.ci.uchicago.edu/0+c45.pads.ci.uchicago.edu/0+c44.pads
>>  
>> .ci.uchicago.edu/0+c06.pads.ci.uchicago.edu/0+c07.pads.ci.uchicago.edu
>>  
>> /0+c08.pads.ci.uchicago.edu/0+c10.pads.ci.uchicago.edu/0+c12.pads.ci.u
>>  
>> chicago.edu/0+c14.pads.ci.uchicago.edu/0+c17.pads.ci.uchicago.edu/0+c2
>>  
>> 2.pads.ci.uchicago.edu/0+c24.pads.ci.uchicago.edu/0+c28.pads.ci.uchica
>>  
>> go.edu/0+c34.pads.ci.uchicago.edu/0+c35.pads.ci.uchicago.edu/0+c37.pad
>>  
>> s.ci.uchicago.edu/0+c39.pads.ci.uchicago.edu/0+c40.pads.ci.uchicago.ed
>>          u/0
>>      Hold_Types = n
>>      Join_Path = n
>>      Keep_Files = n
>>      Mail_Points = n
>>      mtime = Tue Jan 19 13:09:55 2010
>>      Output_Path = 
>> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS8
>>          990749016166185054.submit.stdout
>>      Priority = 0
>>      qtime = Tue Jan 19 13:09:16 2010
>>      Rerunable = True
>>      Resource_List.nodect = 18
>>      Resource_List.nodes = 18
>>      Resource_List.walltime = 00:29:00
>>      session_id = 13956
>>      Shell_Path_List = /bin/sh
>>      Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
>>  
>> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
>>  
>> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
>>  
>> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
>>  
>> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
>>  
>> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
>>  
>> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
>>  
>> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
>>  
>> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
>>  
>> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
>>  
>> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
>>  
>> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
>>          svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
>>          PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
>>          PBS_SERVER=login2.pads.ci.uchicago.edu,
>>          PBS_O_HOST=login2.pads.ci.uchicago.edu,
>>          PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
>>          PBS_O_QUEUE=extended
>>      etime = Tue Jan 19 13:09:16 2010
>>      submit_args = /home/wilde/.globus/scripts/PBS8990749016166185054.submit
>>      start_time = Tue Jan 19 13:09:18 2010
>>      start_count = 1
>>
>> Job Id: 914.svc.pads.ci.uchicago.edu
>>      Job_Name = null
>>      Job_Owner = wilde at login2.pads.ci.uchicago.edu
>>      resources_used.cput = 00:00:58
>>      resources_used.mem = 165760kb
>>      resources_used.vmem = 757612kb
>>      resources_used.walltime = 00:01:11
>>      job_state = R
>>      queue = extended
>>      server = svc.pads.ci.uchicago.edu
>>      Checkpoint = u
>>      ctime = Tue Jan 19 13:09:18 2010
>>      Error_Path = 
>> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS54
>>          46269528052212820.submit.stderr
>>      exec_host = c19.pads.ci.uchicago.edu/1
>>      Hold_Types = n
>>      Join_Path = n
>>      Keep_Files = n
>>      Mail_Points = n
>>      mtime = Tue Jan 19 13:09:20 2010
>>      Output_Path = 
>> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5
>>          446269528052212820.submit.stdout
>>      Priority = 0
>>      qtime = Tue Jan 19 13:09:18 2010
>>      Rerunable = True
>>      Resource_List.nodect = 1
>>      Resource_List.nodes = 1
>>      Resource_List.walltime = 00:29:00
>>      session_id = 15135
>>      Shell_Path_List = /bin/sh
>>      Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
>>  
>> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
>>  
>> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
>>  
>> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
>>  
>> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
>>  
>> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
>>  
>> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
>>  
>> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
>>  
>> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
>>  
>> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
>>  
>> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
>>  
>> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
>>          svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
>>          PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
>>          PBS_SERVER=login2.pads.ci.uchicago.edu,
>>          PBS_O_HOST=login2.pads.ci.uchicago.edu,
>>          PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
>>          PBS_O_QUEUE=extended
>>      etime = Tue Jan 19 13:09:18 2010
>>      submit_args = /home/wilde/.globus/scripts/PBS5446269528052212820.submit
>>      start_time = Tue Jan 19 13:09:20 2010
>>      start_count = 1
>>
>> login2$
>> -------------------------------------------------------------------------------------------------------
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> 



More information about the Swift-user mailing list