[Swift-devel] Error messages and execution retries

Wed Aug 18 16:08:31 CDT 2010

On Wed, 2010-08-18 at 20:54 +0000, Ben Clifford wrote:
> > Retries are meant to deal with transient errors, where transient is
> > pretty much defined as "eventually stops happening if you retry enough
> > times". The determination of whether they are transient or not (to a
> > certain degree of confidence) requires that the operations are retried.
> 
> Right.
> 
> Sometimes tehre are transient errors. Sometimes there are not.
> 
> The theory of distributed computing likes to talk about transient errors 
> and how they can be dealt with this way. But its not clear to me in 
> practice how much that happens - my gut feeling from when I ran stuff was 
> that most errors were non-transient and retries happened rarely. But I 
> have no numerical evidence. That numerical evidence (either way) is 
> probably the decider for retries.

Right. It used to be the case somewhat with GT2/GT4.

There is, of course, also the issue that in the multi-site case, retries
also imply re-scheduling. So this may iron out temporarily bad sites.
Which I think is an essential issue (and commonly used in automated
swift installations).

> 
> > A skilled person could perhaps, by looking at the error, be able to make
> > a quicker determination. But then the same skilled person would probably
> > be able to set retries to 0 if he/she wanted to debug.
> 
> A skilled person equally well could turn retries on.
> 
> This thread is starting to sound pretty much like a complaint people have 
> about condor where rather than failing a job, it will keep trying over and 
> over. A 'skilled person' knows how and where to look to see wahts going 
> on. A non-skilled person sees their job go into the queue and never 
> complete.

The distinction here being "never" as opposed to after some finite (and
reasonably short compared to the expected workflow run time) amount of
time.

>From the perspective of a user, they should never even have to see that
there were retries. So I think this argument is a bit silly. All we are
saying is that there could be a way to find out about errors a little
faster. And while we could automate that, it comes at a cost. We already
have mechanisms to find out about errors as soon as they happen, and
it's called lazy.errors=false and retries=0.