[Swift-devel] Re: examining the plots of a 65535 job CNARI run

Mihael Hategan hategan at mcs.anl.gov
Thu Sep 25 10:16:27 CDT 2008


On Thu, 2008-09-25 at 15:06 +0000, Ben Clifford wrote:
> On Thu, 25 Sep 2008, Mihael Hategan wrote:
> 
> > When a task needs a new worker it becomes bound to the request for the
> > new worker. It does not go to another worker if it becomes available
> > while its worker is queued.
> 
> Could be that, but ~2000s seems a long time for that - the every 6300s 
> trough/peak periods where the coaster workers get restarted are only a 
> couple hundred seconds long.

Maybe unrelated, but there were these workers that, as far as the
queuing system was concerned, were running, without having produced any
logs. They kept "running" after things stopped happening, despite the
fact that they should have shut down for being idle for too long.

So I suspect there is a problem there. If replication was on, the long
jobs may be early replicas that happened to go to such a funny worker,
and which were eventually canceled after another job went through the
whole pipe.

I will add some code to cancel workers if no registration is received a
certain time after the respective job goes into running state.

> 




More information about the Swift-devel mailing list