[Swift-user] Coaster jobs are not running with expected parallelism
Michael Wilde
wilde at mcs.anl.gov
Tue Jan 19 14:23:43 CST 2010
With workersPerNode = 8, I now see 2 PBS jobs; one has 1 node, one has 3
nodes.
Now *16* jobs are active.
The pattern seems to be that its only running workersPerNode app() tasks
per PBS job (ie, per block).
I'll see if I can get it to run workersPerNode tasks per *node* with
more explicit settings in the sites file.
The current jobs is:
/home/wilde/protlib2/bin/run.loops.sh: Executing on site pbs
Running from host with compute-node reachable address of 172.5.86.6
Running in /home/wilde/protests/run.loops.5357
protlib2 home is /home/wilde/protlib2
Swift svn swift-r3202 cog-r2682
RunID: 20100119-1414-q09uz2c0
Progress:
Progress: Checking status:1
Progress: Selecting site:18 Initializing site shared directory:1
Stage in:1 Finished successfully:1
Progress: Stage in:19 Submitting:1 Finished successfully:1
Progress: Submitted:19 Active:1 Finished successfully:1
Progress: Submitted:11 Active:9 Finished successfully:1
Progress: Submitted:7 Active:13 Finished successfully:1
Progress: Submitted:4 Active:16 Finished successfully:1
Progress: Submitted:4 Active:16 Finished successfully:1
Progress: Submitted:4 Active:16 Finished successfully:1
Progress: Submitted:4 Active:16 Finished successfully:1
PBS says:
login2$ qstat -n
svc.pads.ci.uchicago.edu:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK
Memory Time S Time
-------------------- -------- -------- ---------------- ------ ----- ---
------ ----- - -----
917.svc.pads.ci. wilde extended null 16709 1 --
-- 00:29 R 00:04
c19
918.svc.pads.ci. wilde extended null 15309 3 --
-- 00:29 R 00:04
c46+c45+c44
login2$
Swift log is in:
login2$ ls -l $(pwd)/*0.log
-rw-r--r-- 1 wilde ci-users 386242 Jan 19 14:21
/home/wilde/protests/run.loops.5357/psim.loops-20100119
4-q09uz2c0.log
login2$
On 1/19/10 2:09 PM, Mihael Hategan wrote:
> On Tue, 2010-01-19 at 14:02 -0600, Michael Wilde wrote:
>
>> -rw-r--r-- 1 wilde ci-users 912946 Jan 19 13:49
>> /home/wilde/protests/run.loops.1498/psim.loops-20100119-1309-l72sbpg8.log
>>
>> I killed the run and will retry with workersPerNode corrected; maybe you
>> can see, though, in this log, why the run was limited to only 3 active
>> at once.
>>
>> I'll see if same happens with workersPerNode set.
>>
>> This would be explained if leaving workersPerNode *not* set somehow
>> defaults to 1 worker per *block* (ie per pbs job) instead of 1 worker
>> per node. Could that be hapenning?
>
> Not intentionally.
>
>
More information about the Swift-user
mailing list