[Swift-devel] Re: worker.pl IDLETIMEOUT
Mihael Hategan
hategan at mcs.anl.gov
Fri Dec 10 22:39:56 CST 2010
On Fri, 2010-12-10 at 18:26 -0600, Allan Espinosa wrote:
> I am not sure about passive workers though. Since swift is not
> involved in the creation of the workers, it has no idea when to issue
> the SHUTDOWN command to the workers (and service).
Well, the problem with passive workers is that it becomes your
responsibility not only to start them, but also to shut them down.
>
> -Allan
>
> 2010/12/10 Michael Wilde <wilde at mcs.anl.gov>:
> > Since your pilot jobs are scripts that launch worker.pl, you could put a timer in those scripts to kill worker.pl and exit cleanly.
> >
> > If you set maxtime in the pool entry to be somewhat less than the Condor jobtime setting for the pilot job, will Swift, even in the case of persistent coasters, (a) not start a job whose maxwalltime is > than the maxtime remaining, and (b) shut down workers when no queued job has fit into the remaining time of the worker for some idle timeout period? (I.e., I thought the reason IDLETIMEOUT could be removed from the worker was that the client (or the service) has similar logic.
> >
> > - Mike
> >
> >
> > ----- Original Message -----
> >> Looking at the worker.pl I use, yes there is no more IDLE timeout
> >> cases. Then this will leave pilot jobs failing when it exceeds the
> >> maxwalltime. This is another explanation for the large amount of job
> >> failures in OSG as well.
> >>
> >> Before the changes, I simply changed the IDLE timeout to exit cleanly
> >> (exit 0 instead of die)
> >>
> >> -Allan
> >>
> >> 2010/12/10 Michael Wilde <wilde at mcs.anl.gov>:
> >> > I added that idle timeout arg to worker.pl I think. But in recent
> >> > changes I think Mihael removed the idle timeout entirely. Are you
> >> > using a recent trunk version with those changes? That seemed to work
> >> > best for me in my latest tests using passive persistent coaster
> >> > servers.
> >> >
> >> >
> >> >
> >> > ----- Original Message -----
> >> >> The idle timeout having a non-zero exitcode generated a lot of "JOB
> >> >> FAILED" stats in OSG . this skews their usage report in a weird
> >> >> fashion. I made some modifications before but my upgrade to the
> >> >> latest trunk code somehow broke it.
> >> >>
> >> >> 2010/10/12 Allan Espinosa <aespinosa at cs.uchicago.edu>:
> >> >> > Poking at worker.pl, I see that it accepts a third argument for
> >> >> > idle
> >> >> > time. Is
> >> >> > this in seconds?
> >> >> >
> >> >> > Also, I'm using swift to driver a number of passive workers. The
> >> >> > worker jobs
> >> >> > fail due to this timeout. I may have to modify things to suit
> >> >> > this
> >> >> > kind of
> >> >> > setup.
> >> >> >
> >> >> > Thanks,
> >> >> > -Allan
>
More information about the Swift-devel
mailing list