[Swift-user] Coaster jobs are not running with expected parallelism

Tue Jan 19 14:02:34 CST 2010

On 1/19/10 1:55 PM, Mihael Hategan wrote:
> On Tue, 2010-01-19 at 13:49 -0600, Michael Wilde wrote:
>> On 1/19/10 1:44 PM, Mihael Hategan wrote:
>>> On Tue, 2010-01-19 at 13:38 -0600, Michael Wilde wrote:
>>>> On 1/19/10 1:32 PM, Mihael Hategan wrote:
>>>>> Maybe PBS is lying about that 18 node job. 
>>>> I would be surprised if thats the case. But even if it had *1* node you 
>>>> would think it would run at least 8 jobs in parallel.
>>> I see. Though not with your current setup. You should use
>>> "workersPerNode" instead of "coastersPerNode".
>> Thanks!  I'll fix that and try again. This makes more sense now, if its 
>> assuming 1 worker per node.
>>
>> Still doesnt explain why its not starting more jobs, since it allocated 
>> abundant nodes (even assuming 1 worker per node).
> 
> Trunk or branch?

Stable branch.

> 
>>
>>>> Im confused why it has started three jobs, two with only one core and 
>>>> one with 18 nodes.
>>> It does that. It spreads out the block sizes to exploit non-linearities
>>> in queuing times.
>>>
>>>> But the 18 node job just hit its wall time limit; now coasters seems to 
>>>> have started a 10 node job:
>>> Don't know about that. Logs please.
>>>
>> Here's the logs from that dir for this run. I dont understand why the 
>> coasters.log file in that directory has not been written to since Jan 13.
> 
> If you run swift on the head node and the coaster bootstrap provider is
> "local", then the coaster service runs in the same jvm as swift, and it
> writes to the same log as swift.
> 
>> login2$ more *0119-090116*
> 
> [...]
> 
> Seems fine so far. Swift log then.

-rw-r--r-- 1 wilde ci-users 912946 Jan 19 13:49 
/home/wilde/protests/run.loops.1498/psim.loops-20100119-1309-l72sbpg8.log

I killed the run and will retry with workersPerNode corrected; maybe you 
can see, though, in this log, why the run was limited to only 3 active 
at once.

I'll see if same happens with workersPerNode set.

This would be explained if leaving workersPerNode *not* set somehow 
defaults to 1 worker per *block* (ie per pbs job) instead of 1 worker 
per node. Could that be hapenning?

- Mike