[Swift-devel] Re: [Swift-user] Resending: I/O errors in swift script

Mihael Hategan hategan at mcs.anl.gov
Thu Aug 30 14:03:06 CDT 2007


On Thu, 2007-08-30 at 13:23 -0500, Michael Wilde wrote:
> You make some good points here, Mihael.
> I'll wait till I get a bit more experience.
> But I dont want to loose the "newbie" perspective, as thats where most 
> users will start (and end) their experience with Swift.

Then don't try all the fancy non-newbie flags. "-debug" means "I want
the details because I think I can make sense of them".

> 
> I went back to the log/out-err files and I think I see where I was 
> confused: the indication of nonzero exit codes comes out much later in 
> the log; it seems like the earlier jobs failed on output file retreival 
> long before there was any indication of a non-zero job exitcode.

The exit code is checked first. So exit code errors and missing file
errors for a given job are mutually exclusive. Normally these are only
reported at the end of the workflow. Anyway, I'll try to put in the
stamp file, to distinguish between application failures and filesystem
failures.

> 
> This seems to me to need much more scrutiny; either I need to try 
> several more controlled test cases and annotate the logs, or we should 
> walk through a log together and I can explain what questions a newbie 
> has about various messages and what an improved format might be.

There are two directions here. One is improving what we have, and the
other is re-inventing what we have.

Now, I'm not saying that there are no mistakes at all in the reasoning
leading to the current state. But most of the things in there were not
randomly thrown in, but the result of (I'd like to think) careful
thinking. As much as there can be given the complexity of the problem.
So there is a fine line between improving and re-inventing. If it's
aggressively crossed, we may end up improving few things at the expense
of considerable time.

Of course, not crossing that line assumes a certain level of trust.
Which is hard to formally define. In any event, those are, I think, the
options.

Mihael

> 
> I'll try same with debug off to see what the default looks like.
> 
> Onwards for now but we need to come back to this.
> 
> - Mike
> 
> 
> Mihael Hategan wrote:
> >>>> Also noted that:
> >>>>
> >>>> - the retry logic here did more harm than good.
> >>> Can you be more specific?
> >> In this case there was a script error.  Every retry that wound up on an 
> >> IA64 host would fail. But there was no feedback on this aspect of the 
> >> runtime environment.
> >>
> >> I suspect a better default is "stop the workflow on first failure", then 
> >> let the user re-run till the wf is considered "debugged" and then let 
> >> the user set how things should be retried.
> > 
> > I think that's an over generalization of a solution to your particular
> > case. It ignores errors due to sites having problems, which is pretty
> > standard, and would cause lots of annoyances. Ioan asked for more
> > retries, and I can understand why. Now you're asking for no retries.
> > 
> > The assumption was this: if there's a problem with the application
> > invocation, all retries will eventually fail. There is no way to tell
> > between application failures and site failures (even the exit code may
> > not be the right indicator). Retries dramatically decrease the odds of
> > failing the whole workflow because of a bad node/site (although it
> > depends on the exact initial probability of finding bad nodes). But they
> > do not change much if the invocation is broken. The application not
> > being installed properly is, to a certain extent, a site problem, and
> > chances are that running the same thing on a different site will make it
> > work.
> > 
> > Perhaps there should be two different sets of settings: one for setting
> > up the workflow, and one for running it in production mode.
> > 
> > Or, perhaps, the information about the workflow should be organized
> > better, using interfaces more intuitive than endless streams of loosely
> > structured text, so that the user can, interactively, explore the
> > various details of what has happened.
> > 
> > Now, there's retries and there's lazy errors (compute everything that's
> > possible and only stop after nothing more can be done). You can disable
> > that. swift -help. I think it's -lazy.errors=false.
> > 
> > Mihael
> > 
> >> - Mike
> >>
> > 
> > 
> > 
> 




More information about the Swift-devel mailing list