[Swift-user] Coaster jobs are not running with expected parallelism

Tue Jan 19 14:23:43 CST 2010

With workersPerNode = 8, I now see 2 PBS jobs; one has 1 node, one has 3 
nodes.

Now *16* jobs are active.

The pattern seems to be that its only running workersPerNode app() tasks 
per PBS job (ie, per block).

I'll see if I can get it to run workersPerNode tasks per *node* with 
more explicit settings in the sites file.

The current jobs is:

/home/wilde/protlib2/bin/run.loops.sh: Executing on site pbs
Running from host with compute-node reachable address of 172.5.86.6
Running in /home/wilde/protests/run.loops.5357
protlib2 home is /home/wilde/protlib2
Swift svn swift-r3202 cog-r2682

RunID: 20100119-1414-q09uz2c0
Progress:
Progress:  Checking status:1
Progress:  Selecting site:18  Initializing site shared directory:1 
Stage in:1  Finished successfully:1
Progress:  Stage in:19  Submitting:1  Finished successfully:1
Progress:  Submitted:19  Active:1  Finished successfully:1
Progress:  Submitted:11  Active:9  Finished successfully:1
Progress:  Submitted:7  Active:13  Finished successfully:1
Progress:  Submitted:4  Active:16  Finished successfully:1
Progress:  Submitted:4  Active:16  Finished successfully:1
Progress:  Submitted:4  Active:16  Finished successfully:1
Progress:  Submitted:4  Active:16  Finished successfully:1

PBS says:

login2$ qstat -n

svc.pads.ci.uchicago.edu:

   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK 
Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- 
------ ----- - -----
917.svc.pads.ci.     wilde    extended null              16709     1  -- 
    --  00:29 R 00:04
    c19
918.svc.pads.ci.     wilde    extended null              15309     3  -- 
    --  00:29 R 00:04
    c46+c45+c44
login2$

Swift log is in:

login2$ ls -l $(pwd)/*0.log
-rw-r--r-- 1 wilde ci-users 386242 Jan 19 14:21 
/home/wilde/protests/run.loops.5357/psim.loops-20100119
4-q09uz2c0.log
login2$

On 1/19/10 2:09 PM, Mihael Hategan wrote:
> On Tue, 2010-01-19 at 14:02 -0600, Michael Wilde wrote:
> 
>> -rw-r--r-- 1 wilde ci-users 912946 Jan 19 13:49 
>> /home/wilde/protests/run.loops.1498/psim.loops-20100119-1309-l72sbpg8.log
>>
>> I killed the run and will retry with workersPerNode corrected; maybe you 
>> can see, though, in this log, why the run was limited to only 3 active 
>> at once.
>>
>> I'll see if same happens with workersPerNode set.
>>
>> This would be explained if leaving workersPerNode *not* set somehow 
>> defaults to 1 worker per *block* (ie per pbs job) instead of 1 worker 
>> per node. Could that be hapenning?
> 
> Not intentionally.
> 
>