[Swift-user] Coaster jobs are not running with expected parallelism
Michael Wilde
wilde at mcs.anl.gov
Tue Jan 19 14:02:34 CST 2010
On 1/19/10 1:55 PM, Mihael Hategan wrote:
> On Tue, 2010-01-19 at 13:49 -0600, Michael Wilde wrote:
>> On 1/19/10 1:44 PM, Mihael Hategan wrote:
>>> On Tue, 2010-01-19 at 13:38 -0600, Michael Wilde wrote:
>>>> On 1/19/10 1:32 PM, Mihael Hategan wrote:
>>>>> Maybe PBS is lying about that 18 node job.
>>>> I would be surprised if thats the case. But even if it had *1* node you
>>>> would think it would run at least 8 jobs in parallel.
>>> I see. Though not with your current setup. You should use
>>> "workersPerNode" instead of "coastersPerNode".
>> Thanks! I'll fix that and try again. This makes more sense now, if its
>> assuming 1 worker per node.
>>
>> Still doesnt explain why its not starting more jobs, since it allocated
>> abundant nodes (even assuming 1 worker per node).
>
> Trunk or branch?
Stable branch.
>
>>
>>>> Im confused why it has started three jobs, two with only one core and
>>>> one with 18 nodes.
>>> It does that. It spreads out the block sizes to exploit non-linearities
>>> in queuing times.
>>>
>>>> But the 18 node job just hit its wall time limit; now coasters seems to
>>>> have started a 10 node job:
>>> Don't know about that. Logs please.
>>>
>> Here's the logs from that dir for this run. I dont understand why the
>> coasters.log file in that directory has not been written to since Jan 13.
>
> If you run swift on the head node and the coaster bootstrap provider is
> "local", then the coaster service runs in the same jvm as swift, and it
> writes to the same log as swift.
>
>> login2$ more *0119-090116*
>
> [...]
>
> Seems fine so far. Swift log then.
-rw-r--r-- 1 wilde ci-users 912946 Jan 19 13:49
/home/wilde/protests/run.loops.1498/psim.loops-20100119-1309-l72sbpg8.log
I killed the run and will retry with workersPerNode corrected; maybe you
can see, though, in this log, why the run was limited to only 3 active
at once.
I'll see if same happens with workersPerNode set.
This would be explained if leaving workersPerNode *not* set somehow
defaults to 1 worker per *block* (ie per pbs job) instead of 1 worker
per node. Could that be hapenning?
- Mike
More information about the Swift-user
mailing list