[Swift-devel] Re: worker.pl IDLETIMEOUT

Michael Wilde wilde at mcs.anl.gov
Fri Dec 10 19:56:57 CST 2010


I would think the service knows when each worker registered and how long the worker has been idle, regardless of whether the server started the worker itself.

We should be able to test this readily in a small controlled setup, and validate the results with Mihael regarding whats supposed to happen and what we'd like to have happen.

- Mike


----- Original Message -----
> I am not sure about passive workers though. Since swift is not
> involved in the creation of the workers, it has no idea when to issue
> the SHUTDOWN command to the workers (and service).
> 
> -Allan
> 
> 2010/12/10 Michael Wilde <wilde at mcs.anl.gov>:
> > Since your pilot jobs are scripts that launch worker.pl, you could
> > put a timer in those scripts to kill worker.pl and exit cleanly.
> >
> > If you set maxtime in the pool entry to be somewhat less than the
> > Condor jobtime setting for the pilot job, will Swift, even in the
> > case of persistent coasters, (a) not start a job whose maxwalltime
> > is > than the maxtime remaining, and (b) shut down workers when no
> > queued job has fit into the remaining time of the worker for some
> > idle timeout period? (I.e., I thought the reason IDLETIMEOUT could
> > be removed from the worker was that the client (or the service) has
> > similar logic.
> >
> > - Mike
> >
> >
> > ----- Original Message -----
> >> Looking at the worker.pl I use, yes there is no more IDLE timeout
> >> cases. Then this will leave pilot jobs failing when it exceeds the
> >> maxwalltime. This is another explanation for the large amount of
> >> job
> >> failures in OSG as well.
> >>
> >> Before the changes, I simply changed the IDLE timeout to exit
> >> cleanly
> >> (exit 0 instead of die)
> >>
> >> -Allan
> >>
> >> 2010/12/10 Michael Wilde <wilde at mcs.anl.gov>:
> >> > I added that idle timeout arg to worker.pl I think. But in recent
> >> > changes I think Mihael removed the idle timeout entirely. Are you
> >> > using a recent trunk version with those changes? That seemed to
> >> > work
> >> > best for me in my latest tests using passive persistent coaster
> >> > servers.
> >> >
> >> >
> >> >
> >> > ----- Original Message -----
> >> >> The idle timeout having a non-zero exitcode generated a lot of
> >> >> "JOB
> >> >> FAILED" stats in OSG . this skews their usage report in a weird
> >> >> fashion. I made some modifications before but my upgrade to the
> >> >> latest trunk code somehow broke it.
> >> >>
> >> >> 2010/10/12 Allan Espinosa <aespinosa at cs.uchicago.edu>:
> >> >> > Poking at worker.pl, I see that it accepts a third argument
> >> >> > for
> >> >> > idle
> >> >> > time. Is
> >> >> > this in seconds?
> >> >> >
> >> >> > Also, I'm using swift to driver a number of passive workers.
> >> >> > The
> >> >> > worker jobs
> >> >> > fail due to this timeout. I may have to modify things to suit
> >> >> > this
> >> >> > kind of
> >> >> > setup.
> >> >> >
> >> >> > Thanks,
> >> >> > -Allan
> 
> --
> Allan M. Espinosa <http://amespinosa.wordpress.com>
> PhD student, Computer Science
> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list