[Swift-devel] Re: [Swift-user] Resending: I/O errors in swift script
Mihael Hategan
hategan at mcs.anl.gov
Thu Aug 30 11:30:51 CDT 2007
Note: moved to swift-devel.
On Thu, 2007-08-30 at 10:56 -0500, Michael Wilde wrote:
> Great - thanks. That was indeed the problem: my application script had
> a typo and was trying to run the 32-bit binary regardless what processor
> type it wound up on. When I last run successfully, I was getting most
> or all i686 machines; this time I was getting ia64 machines.
>
> I'll try to re-run it w/o debug, and see if the messages need improvement.
There is no translation for the cryptic missing file message I know of,
so I doubt that will improve.
>
> Kickstart would have helped here - would have told me that Im running on
> ia64.
What stops you from enabling it?
>
> This is the kind of problem that on a local machine would have been
> recognizable instantly but on a remote machine through swift, karajan,
> globus and PBS is a much greater challenge to diagnose. We should think
> in terms of how to make that long pipeline to the remote execution
> environment much more transparent to the user.
I don't think It's the long pipeline that is the problem, but the fact
that the assumptions that you can usually make about your local machine
don't hold for a random machine out there. Moreover, they change
depending on where your job happens to run, whereas your machine stays
the same. We can improve things, I hope, and for that we need concrete
ideas.
>
> Think: "what would I see if I ran this locally" and "how do I bring that
> environment to the swift user"?
You can't bring that environment to the swift user. Remote != local, and
it may take a long time until it will be if at all. Question is "what is
a useful set of things/information to troubleshoot such problems and how
do we get that without compromising other things too much".
>
> Also noted that:
>
> - the retry logic here did more harm than good.
Can you be more specific?
> Maybe we want the
> default for this to be off, especially during debugging.
That, I'm guessing, could be added as an option.
>
> - in my latest run, which succeeded, the final job completion was
> excessively delayed. The output files were all back on the submit host,
> 4 of 5 jobs were logged as completed, and the completion of the final
> job seemed to take a few minutes longer.
>
> I'll work through the error logs more closely and file an enhancement
> request in bugz.
>
> I can batch these for later discussion or bring them as I encounter
> things, whatever people prefer. I dont want to distract anyone at the
> moment into long discssions on these; I'll organize them into bug
> reports and enhancement requests and file for discussion when we next
> review priorities.
>
> Ian was suggesting that this be soon - now is when we need to pick the
> next features for you to work on, Ben and Mihael. Maybe a review of
> bugs and requests next week, which can be started by email discussion,
> and we'll note which topics needs voice or f2f discussion.
Action items! Yummy.
Mihael
>
> - Mike
>
>
> Mihael Hategan wrote:
> > Ok. You have a bunch of errors, mainly of two types:
> > 1. Missing output file (we should add a rule in error.properties to make
> > that verbose message a little more readable). This may be because the
> > application didn't run or because the filesystem is broken. Right now an
> > exit code file is produced by the wrapper only if the exit code of the
> > application is not 0. This does not allow telling between the
> > application having completed successfully or the filesystem being
> > broken. I believe that a stamp file should also be created by the
> > wrapper in order to distinguish between the two. The reason for the
> > stamp file instead of always having an exit code file is that it is more
> > efficient to check the existence of a file than to stage it out and look
> > at its contents.
> >
> > 2. Exit code != 0. Looks like some issues with R.
> >
> > Mihael
> >
> > On Thu, 2007-08-30 at 08:31 -0500, Michael Wilde wrote:
> >> Resending this after changing list to take larger attachments.
> >> Previous message seems to have gotten lost (I musta pressed the wrong
> >> button in the list manager?)
> >>
> >> ---
> >>
> >> I'm progressing on the angle runs. Previous errors were due to problems
> >> with svn update, and then apparently needing ant clean and distclean.
> >>
> >> Now I'm executing but getting I/O errors. Ive attached all the logs and
> >> output from this run.
> >>
> >> My result files are coming back zero-length and Im seeing I/O errors in
> >> the logs (eg, in swift.out):
> >>
> >> ...
> >> Task(type=2, identity=urn:0-0-6-0-2-1188429807124) setting status to
> >> SubmittedTask(type=2, identity=urn:0-0-6-0-1-1188429807121) setting
> >> status to Active
> >> Task(type=2, identity=urn:0-0-6-0-2-1188429807124) setting status to Active
> >> Task(type=2, identity=urn:0-0-6-0-2-1188429807124) setting status to
> >> Failed Exception in getFile
> >>
> >> ...
> >>
> >> My suspcion is that the app is failing and not proucing an expected
> >> output file. Perhaps theres a clean error in the log that says this but
> >> I havent found it yet. I think I saw error #500's from gridftp in the log.
> >>
> >> While I debug further, if anyone sees a different or obvious cause, I'd
> >> appreciate your eyeballs on it.
> >>
> >> Thanks,
> >>
> >> Mike
> >>
> >> _______________________________________________
> >> Swift-user mailing list
> >> Swift-user at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> >
> >
>
More information about the Swift-devel
mailing list