[Swift-user] Coaster jobs are not running with expected parallelism
Michael Wilde
wilde at mcs.anl.gov
Tue Jan 19 13:49:02 CST 2010
On 1/19/10 1:44 PM, Mihael Hategan wrote:
> On Tue, 2010-01-19 at 13:38 -0600, Michael Wilde wrote:
>> On 1/19/10 1:32 PM, Mihael Hategan wrote:
>>> Maybe PBS is lying about that 18 node job.
>> I would be surprised if thats the case. But even if it had *1* node you
>> would think it would run at least 8 jobs in parallel.
>
> I see. Though not with your current setup. You should use
> "workersPerNode" instead of "coastersPerNode".
Thanks! I'll fix that and try again. This makes more sense now, if its
assuming 1 worker per node.
Still doesnt explain why its not starting more jobs, since it allocated
abundant nodes (even assuming 1 worker per node).
>
>> Im confused why it has started three jobs, two with only one core and
>> one with 18 nodes.
>
> It does that. It spreads out the block sizes to exploit non-linearities
> in queuing times.
>
>> But the 18 node job just hit its wall time limit; now coasters seems to
>> have started a 10 node job:
>
> Don't know about that. Logs please.
>
Here's the logs from that dir for this run. I dont understand why the
coasters.log file in that directory has not been written to since Jan 13.
login2$ ls -dt * | head
worker-0119-090116-000002.log worker-0114-310129-000005.log
worker-0119-090116-000004.log worker-0114-310129-000006.log
worker-0119-090116-000003.log worker-0114-310129-000007.log
worker-0119-090116-000001.log worker-0114-310129-000008.log
worker-0119-090116-000000.log worker-0114-310129-000009.log
cscript7310283766853084762.pl worker-0114-310129-000000.log
worker-0119-491225-000001.log worker-0114-110123-000004.log
worker-0119-491225-000000.log worker-0114-110123-000002.log
worker-0119-151225-000001.log worker-0114-110123-000003.log
worker-0119-151225-000000.log worker-0114-110123-000000.log
login2$ ls -1dt * | head
worker-0119-090116-000002.log
worker-0119-090116-000004.log
worker-0119-090116-000003.log
worker-0119-090116-000001.log
worker-0119-090116-000000.log
cscript7310283766853084762.pl
worker-0119-491225-000001.log
worker-0119-491225-000000.log
worker-0119-151225-000001.log
worker-0119-151225-000000.log
login2$ more *0119-090116*
::::::::::::::
worker-0119-090116-000000.log
::::::::::::::
1263928159 0119-090116-000000 Logging started
1263928159 INFO - Running on node c19.pads.ci.uchicago.edu
1263928159 INFO 000000 Registration successful. ID=000000
::::::::::::::
worker-0119-090116-000001.log
::::::::::::::
1263928159 0119-090116-000001 Logging started
1263928159 INFO - Running on node c46.pads.ci.uchicago.edu
1263928160 INFO 000000 Registration successful. ID=000000
::::::::::::::
worker-0119-090116-000002.log
::::::::::::::
1263928160 0119-090116-000002 Logging started
1263928161 INFO - Running on node c19.pads.ci.uchicago.edu
1263928161 INFO 000000 Registration successful. ID=000000
1263929738 INFO 000000 Acknowledged shutdown. Exiting
1263929738 INFO 000000 Ran a total of 3 jobs
1263929738 INFO - All sub-processes finished. Exiting.
::::::::::::::
worker-0119-090116-000003.log
::::::::::::::
1263929733 0119-090116-000003 Logging started
1263929733 INFO - Running on node c38.pads.ci.uchicago.edu
1263929733 INFO 000000 Registration successful. ID=000000
::::::::::::::
worker-0119-090116-000004.log
::::::::::::::
1263929734 0119-090116-000004 Logging started
1263929734 INFO - Running on node c45.pads.ci.uchicago.edu
1263929734 INFO 000000 Registration successful. ID=000000
login2$
More information about the Swift-user
mailing list