[Swift-devel] Error messages and execution retries
Mihael Hategan
hategan at mcs.anl.gov
Wed Aug 18 16:08:31 CDT 2010
On Wed, 2010-08-18 at 20:54 +0000, Ben Clifford wrote:
> > Retries are meant to deal with transient errors, where transient is
> > pretty much defined as "eventually stops happening if you retry enough
> > times". The determination of whether they are transient or not (to a
> > certain degree of confidence) requires that the operations are retried.
>
> Right.
>
> Sometimes tehre are transient errors. Sometimes there are not.
>
> The theory of distributed computing likes to talk about transient errors
> and how they can be dealt with this way. But its not clear to me in
> practice how much that happens - my gut feeling from when I ran stuff was
> that most errors were non-transient and retries happened rarely. But I
> have no numerical evidence. That numerical evidence (either way) is
> probably the decider for retries.
Right. It used to be the case somewhat with GT2/GT4.
There is, of course, also the issue that in the multi-site case, retries
also imply re-scheduling. So this may iron out temporarily bad sites.
Which I think is an essential issue (and commonly used in automated
swift installations).
>
> > A skilled person could perhaps, by looking at the error, be able to make
> > a quicker determination. But then the same skilled person would probably
> > be able to set retries to 0 if he/she wanted to debug.
>
> A skilled person equally well could turn retries on.
>
> This thread is starting to sound pretty much like a complaint people have
> about condor where rather than failing a job, it will keep trying over and
> over. A 'skilled person' knows how and where to look to see wahts going
> on. A non-skilled person sees their job go into the queue and never
> complete.
The distinction here being "never" as opposed to after some finite (and
reasonably short compared to the expected workflow run time) amount of
time.
>From the perspective of a user, they should never even have to see that
there were retries. So I think this argument is a bit silly. All we are
saying is that there could be a way to find out about errors a little
faster. And while we could automate that, it comes at a cost. We already
have mechanisms to find out about errors as soon as they happen, and
it's called lazy.errors=false and retries=0.
More information about the Swift-devel
mailing list