[Swift-devel] Re: [Swift-user] Resending: I/O errors in swift script

Michael Wilde wilde at mcs.anl.gov
Thu Aug 30 13:23:24 CDT 2007


You make some good points here, Mihael.
I'll wait till I get a bit more experience.
But I dont want to loose the "newbie" perspective, as thats where most 
users will start (and end) their experience with Swift.

I went back to the log/out-err files and I think I see where I was 
confused: the indication of nonzero exit codes comes out much later in 
the log; it seems like the earlier jobs failed on output file retreival 
long before there was any indication of a non-zero job exitcode.

This seems to me to need much more scrutiny; either I need to try 
several more controlled test cases and annotate the logs, or we should 
walk through a log together and I can explain what questions a newbie 
has about various messages and what an improved format might be.

I'll try same with debug off to see what the default looks like.

Onwards for now but we need to come back to this.

- Mike


Mihael Hategan wrote:
>>>> Also noted that:
>>>>
>>>> - the retry logic here did more harm than good.
>>> Can you be more specific?
>> In this case there was a script error.  Every retry that wound up on an 
>> IA64 host would fail. But there was no feedback on this aspect of the 
>> runtime environment.
>>
>> I suspect a better default is "stop the workflow on first failure", then 
>> let the user re-run till the wf is considered "debugged" and then let 
>> the user set how things should be retried.
> 
> I think that's an over generalization of a solution to your particular
> case. It ignores errors due to sites having problems, which is pretty
> standard, and would cause lots of annoyances. Ioan asked for more
> retries, and I can understand why. Now you're asking for no retries.
> 
> The assumption was this: if there's a problem with the application
> invocation, all retries will eventually fail. There is no way to tell
> between application failures and site failures (even the exit code may
> not be the right indicator). Retries dramatically decrease the odds of
> failing the whole workflow because of a bad node/site (although it
> depends on the exact initial probability of finding bad nodes). But they
> do not change much if the invocation is broken. The application not
> being installed properly is, to a certain extent, a site problem, and
> chances are that running the same thing on a different site will make it
> work.
> 
> Perhaps there should be two different sets of settings: one for setting
> up the workflow, and one for running it in production mode.
> 
> Or, perhaps, the information about the workflow should be organized
> better, using interfaces more intuitive than endless streams of loosely
> structured text, so that the user can, interactively, explore the
> various details of what has happened.
> 
> Now, there's retries and there's lazy errors (compute everything that's
> possible and only stop after nothing more can be done). You can disable
> that. swift -help. I think it's -lazy.errors=false.
> 
> Mihael
> 
>> - Mike
>>
> 
> 
> 



More information about the Swift-devel mailing list