[Swift-user] Coaster jobs are not running with expected parallelism

Michael Wilde wilde at mcs.anl.gov
Tue Jan 19 13:49:02 CST 2010



On 1/19/10 1:44 PM, Mihael Hategan wrote:
> On Tue, 2010-01-19 at 13:38 -0600, Michael Wilde wrote:
>> On 1/19/10 1:32 PM, Mihael Hategan wrote:
>>> Maybe PBS is lying about that 18 node job. 
>> I would be surprised if thats the case. But even if it had *1* node you 
>> would think it would run at least 8 jobs in parallel.
> 
> I see. Though not with your current setup. You should use
> "workersPerNode" instead of "coastersPerNode".

Thanks!  I'll fix that and try again. This makes more sense now, if its 
assuming 1 worker per node.

Still doesnt explain why its not starting more jobs, since it allocated 
abundant nodes (even assuming 1 worker per node).


> 
>> Im confused why it has started three jobs, two with only one core and 
>> one with 18 nodes.
> 
> It does that. It spreads out the block sizes to exploit non-linearities
> in queuing times.
> 
>> But the 18 node job just hit its wall time limit; now coasters seems to 
>> have started a 10 node job:
> 
> Don't know about that. Logs please.
> 

Here's the logs from that dir for this run. I dont understand why the 
coasters.log file in that directory has not been written to since Jan 13.

login2$ ls -dt * | head
worker-0119-090116-000002.log  worker-0114-310129-000005.log
worker-0119-090116-000004.log  worker-0114-310129-000006.log
worker-0119-090116-000003.log  worker-0114-310129-000007.log
worker-0119-090116-000001.log  worker-0114-310129-000008.log
worker-0119-090116-000000.log  worker-0114-310129-000009.log
cscript7310283766853084762.pl  worker-0114-310129-000000.log
worker-0119-491225-000001.log  worker-0114-110123-000004.log
worker-0119-491225-000000.log  worker-0114-110123-000002.log
worker-0119-151225-000001.log  worker-0114-110123-000003.log
worker-0119-151225-000000.log  worker-0114-110123-000000.log
login2$ ls -1dt * | head
worker-0119-090116-000002.log
worker-0119-090116-000004.log
worker-0119-090116-000003.log
worker-0119-090116-000001.log
worker-0119-090116-000000.log
cscript7310283766853084762.pl
worker-0119-491225-000001.log
worker-0119-491225-000000.log
worker-0119-151225-000001.log
worker-0119-151225-000000.log
login2$ more *0119-090116*
::::::::::::::
worker-0119-090116-000000.log
::::::::::::::
1263928159 0119-090116-000000 Logging started
1263928159 INFO - Running on node c19.pads.ci.uchicago.edu
1263928159 INFO 000000 Registration successful. ID=000000
::::::::::::::
worker-0119-090116-000001.log
::::::::::::::
1263928159 0119-090116-000001 Logging started
1263928159 INFO - Running on node c46.pads.ci.uchicago.edu
1263928160 INFO 000000 Registration successful. ID=000000
::::::::::::::
worker-0119-090116-000002.log
::::::::::::::
1263928160 0119-090116-000002 Logging started
1263928161 INFO - Running on node c19.pads.ci.uchicago.edu
1263928161 INFO 000000 Registration successful. ID=000000
1263929738 INFO 000000 Acknowledged shutdown. Exiting
1263929738 INFO 000000 Ran a total of 3 jobs
1263929738 INFO - All sub-processes finished. Exiting.
::::::::::::::
worker-0119-090116-000003.log
::::::::::::::
1263929733 0119-090116-000003 Logging started
1263929733 INFO - Running on node c38.pads.ci.uchicago.edu
1263929733 INFO 000000 Registration successful. ID=000000
::::::::::::::
worker-0119-090116-000004.log
::::::::::::::
1263929734 0119-090116-000004 Logging started
1263929734 INFO - Running on node c45.pads.ci.uchicago.edu
1263929734 INFO 000000 Registration successful. ID=000000
login2$




More information about the Swift-user mailing list