[Swift-devel] Re: [Swift-user] Resending: I/O errors in swift script

Michael Wilde wilde at mcs.anl.gov
Thu Aug 30 12:43:03 CDT 2007


Mihael Hategan wrote:
> Note: moved to swift-devel.
> 
> On Thu, 2007-08-30 at 10:56 -0500, Michael Wilde wrote:
>> Great - thanks.  That was indeed the problem: my application script had 
>> a typo and was trying to run the 32-bit binary regardless what processor 
>> type it wound up on.  When I last run successfully, I was getting most 
>> or all i686 machines; this time I was getting ia64 machines.
>>
>> I'll try to re-run it w/o debug, and see if the messages need improvement.
> 
> There is no translation for the cryptic missing file message I know of,
> so I doubt that will improve.
> 
>> Kickstart would have helped here - would have told me that Im running on 
>> ia64.
> 
> What stops you from enabling it?

Nothing - that was just an observation.  I'll try it once I get 
comfortable with how the default options behave.

> 
>> This is the kind of problem that on a local machine would have been 
>> recognizable instantly but on a remote machine through swift, karajan, 
>> globus and PBS is a much greater challenge to diagnose.  We should think 
>> in terms of how to make that long pipeline to the remote execution 
>> environment much more transparent to the user.
> 
> I don't think It's the long pipeline that is the problem, but the fact
> that the assumptions that you can usually make about your local machine
> don't hold for a random machine out there. Moreover, they change
> depending on where your job happens to run, whereas your machine stays
> the same. We can improve things, I hope, and for that we need concrete
> ideas.
> 
>> Think: "what would I see if I ran this locally" and "how do I bring that 
>> environment to the swift user"?
> 
> You can't bring that environment to the swift user. Remote != local, and
> it may take a long time until it will be if at all. Question is "what is
> a useful set of things/information to troubleshoot such problems and how
> do we get that without compromising other things too much".
> 
>> Also noted that:
>>
>> - the retry logic here did more harm than good.
> 
> Can you be more specific?

In this case there was a script error.  Every retry that wound up on an 
IA64 host would fail. But there was no feedback on this aspect of the 
runtime environment.

I suspect a better default is "stop the workflow on first failure", then 
let the user re-run till the wf is considered "debugged" and then let 
the user set how things should be retried.

- Mike

> 
>>  Maybe we want the 
>> default for this to be off, especially during debugging.
> 
> That, I'm guessing, could be added as an option.
> 
>> - in my latest run, which succeeded, the final job completion was 
>> excessively delayed. The output files were all back on the submit host, 
>> 4 of 5 jobs were logged as completed, and the completion of the final 
>> job seemed to take a few minutes longer.
>>
>> I'll work through the error logs more closely and file an enhancement 
>> request in bugz.
>>
>> I can batch these for later discussion or bring them as I encounter 
>> things, whatever people prefer.  I dont want to distract anyone at the 
>> moment into long discssions on these; I'll organize them into bug 
>> reports and enhancement requests and file for discussion when we next 
>> review priorities.
>>
>> Ian was suggesting that this be soon - now is when we need to pick the 
>> next features for you to work on, Ben and Mihael.  Maybe a review of 
>> bugs and requests next week, which can be started by email discussion, 
>> and we'll note which topics needs voice or f2f discussion.
> 
> Action items! Yummy.
> 
> Mihael
> 
>> - Mike
>>
>>
>> Mihael Hategan wrote:
>>> Ok. You have a bunch of errors, mainly of two types:
>>> 1. Missing output file (we should add a rule in error.properties to make
>>> that verbose message a little more readable). This may be because the
>>> application didn't run or because the filesystem is broken. Right now an
>>> exit code file is produced by the wrapper only if the exit code of the
>>> application is not 0. This does not allow telling between the
>>> application having completed successfully or the filesystem being
>>> broken. I believe that a stamp file should also be created by the
>>> wrapper in order to distinguish between the two. The reason for the
>>> stamp file instead of always having an exit code file is that it is more
>>> efficient to check the existence of a file than to stage it out and look
>>> at its contents.
>>>
>>> 2. Exit code != 0. Looks like some issues with R.
>>>
>>> Mihael
>>>
>>> On Thu, 2007-08-30 at 08:31 -0500, Michael Wilde wrote:
>>>> Resending this after changing list to take larger attachments.
>>>> Previous message seems to have gotten lost (I musta pressed the wrong 
>>>> button in the list manager?)
>>>>
>>>> ---
>>>>
>>>> I'm progressing on the angle runs. Previous errors were due to problems
>>>> with svn update, and then apparently needing ant clean and distclean.
>>>>
>>>> Now I'm executing but getting I/O errors.  Ive attached all the logs and
>>>> output from this run.
>>>>
>>>> My result files are coming back zero-length and Im seeing I/O errors in
>>>> the logs (eg, in swift.out):
>>>>
>>>> ...
>>>> Task(type=2, identity=urn:0-0-6-0-2-1188429807124) setting status to
>>>> SubmittedTask(type=2, identity=urn:0-0-6-0-1-1188429807121) setting
>>>> status to Active
>>>> Task(type=2, identity=urn:0-0-6-0-2-1188429807124) setting status to Active
>>>> Task(type=2, identity=urn:0-0-6-0-2-1188429807124) setting status to
>>>> Failed Exception in getFile
>>>>
>>>> ...
>>>>
>>>> My suspcion is that the app is failing and not proucing an expected
>>>> output file.  Perhaps theres a clean error in the log that says this but
>>>> I havent found it yet.  I think I saw error #500's from gridftp in the log.
>>>>
>>>> While I debug further, if anyone sees a different or obvious cause, I'd
>>>> appreciate your eyeballs on it.
>>>>
>>>> Thanks,
>>>>
>>>> Mike
>>>>
>>>> _______________________________________________
>>>> Swift-user mailing list
>>>> Swift-user at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>>
> 
> 



More information about the Swift-devel mailing list