[Swift-user] Resending: I/O errors in swift script
Michael Wilde
wilde at mcs.anl.gov
Thu Aug 30 10:56:02 CDT 2007
Great - thanks. That was indeed the problem: my application script had
a typo and was trying to run the 32-bit binary regardless what processor
type it wound up on. When I last run successfully, I was getting most
or all i686 machines; this time I was getting ia64 machines.
I'll try to re-run it w/o debug, and see if the messages need improvement.
Kickstart would have helped here - would have told me that Im running on
ia64.
This is the kind of problem that on a local machine would have been
recognizable instantly but on a remote machine through swift, karajan,
globus and PBS is a much greater challenge to diagnose. We should think
in terms of how to make that long pipeline to the remote execution
environment much more transparent to the user.
Think: "what would I see if I ran this locally" and "how do I bring that
environment to the swift user"?
Also noted that:
- the retry logic here did more harm than good. Maybe we want the
default for this to be off, especially during debugging.
- in my latest run, which succeeded, the final job completion was
excessively delayed. The output files were all back on the submit host,
4 of 5 jobs were logged as completed, and the completion of the final
job seemed to take a few minutes longer.
I'll work through the error logs more closely and file an enhancement
request in bugz.
I can batch these for later discussion or bring them as I encounter
things, whatever people prefer. I dont want to distract anyone at the
moment into long discssions on these; I'll organize them into bug
reports and enhancement requests and file for discussion when we next
review priorities.
Ian was suggesting that this be soon - now is when we need to pick the
next features for you to work on, Ben and Mihael. Maybe a review of
bugs and requests next week, which can be started by email discussion,
and we'll note which topics needs voice or f2f discussion.
- Mike
Mihael Hategan wrote:
> Ok. You have a bunch of errors, mainly of two types:
> 1. Missing output file (we should add a rule in error.properties to make
> that verbose message a little more readable). This may be because the
> application didn't run or because the filesystem is broken. Right now an
> exit code file is produced by the wrapper only if the exit code of the
> application is not 0. This does not allow telling between the
> application having completed successfully or the filesystem being
> broken. I believe that a stamp file should also be created by the
> wrapper in order to distinguish between the two. The reason for the
> stamp file instead of always having an exit code file is that it is more
> efficient to check the existence of a file than to stage it out and look
> at its contents.
>
> 2. Exit code != 0. Looks like some issues with R.
>
> Mihael
>
> On Thu, 2007-08-30 at 08:31 -0500, Michael Wilde wrote:
>> Resending this after changing list to take larger attachments.
>> Previous message seems to have gotten lost (I musta pressed the wrong
>> button in the list manager?)
>>
>> ---
>>
>> I'm progressing on the angle runs. Previous errors were due to problems
>> with svn update, and then apparently needing ant clean and distclean.
>>
>> Now I'm executing but getting I/O errors. Ive attached all the logs and
>> output from this run.
>>
>> My result files are coming back zero-length and Im seeing I/O errors in
>> the logs (eg, in swift.out):
>>
>> ...
>> Task(type=2, identity=urn:0-0-6-0-2-1188429807124) setting status to
>> SubmittedTask(type=2, identity=urn:0-0-6-0-1-1188429807121) setting
>> status to Active
>> Task(type=2, identity=urn:0-0-6-0-2-1188429807124) setting status to Active
>> Task(type=2, identity=urn:0-0-6-0-2-1188429807124) setting status to
>> Failed Exception in getFile
>>
>> ...
>>
>> My suspcion is that the app is failing and not proucing an expected
>> output file. Perhaps theres a clean error in the log that says this but
>> I havent found it yet. I think I saw error #500's from gridftp in the log.
>>
>> While I debug further, if anyone sees a different or obvious cause, I'd
>> appreciate your eyeballs on it.
>>
>> Thanks,
>>
>> Mike
>>
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>
>
More information about the Swift-user
mailing list