[Swift-devel] Re: Broken pipe on persistent coasters (was Re: Next steps on making the ExTENCI SCEC workflow run reliably)

Mihael Hategan hategan at mcs.anl.gov
Wed May 11 20:59:06 CDT 2011


On Wed, 2011-05-11 at 20:02 -0500, Allan Espinosa wrote:
> 2011/5/11 Allan Espinosa <aespinosa at cs.uchicago.edu>:
> 
> > 2011/5/11 Mihael Hategan <hategan at mcs.anl.gov>:
> >> On Wed, 2011-05-11 at 16:42 -0500, Allan Espinosa wrote:
> >>> Right. Workers die because they exceed the maximum walltime.  Does the
> >>> coaster service expect the workers to die cleanly (passive ones)?
> >>
> >> Hmm. They aren't expected to die. Which may be a problem.
> >>
> >> We (as in I) need to change that. Passive workers should advertise their
> >> walltime to the service and the service should take that into account so
> >> that jobs don't get sent to workers who don't have enough time left.
> 
> I remember that previous versions of the worker.pl has an idle timeout
> parameter.

That was only to shut them down in case they lose connection to the
service, but I remove that since the heartbeats pretty much do the same
thing. Or so I remember.

> 
> >>
> >> However, as inefficient as this may be, the service should notify the
> >> client that the jobs that were running on a dying worker have failed,
> >> and those jobs should be restarted by swift. Is that not happening?
> >>
> 
> In this case, it hasn't (yet)

Ok. That's a bug, and I think it's a major bug in your case. Please file
a bug report on it and I'll get to it as soon as I can.




More information about the Swift-devel mailing list