[Swift-devel] Re: [Swift-user] Resending: I/O errors in swift script

Mihael Hategan hategan at mcs.anl.gov
Thu Aug 30 12:45:19 CDT 2007


On Thu, 2007-08-30 at 12:22 -0500, Michael Wilde wrote:
> Following up on this, Mihael, you said:
> 
>  >> 2. Exit code != 0. Looks like some issues with R.
> 
> I dont see where in the logs you observed that the jobs were failing. I 
> think that would have tipped me off earlier that I have an app problem.

It normally comes out on stderr.

grep -A 1000 "The following errors have occurred" swift.out

But that's fundamentally the problem with information overload: it's
hard to tell what the relevant part is. That's why you shouldn't run
with -d. That information is in the logs anyway.

> 
> I must be looking in the wrong place.  I redirected stdout and stderr 
> into a file, starting swift like this:
> 
> $ swift -debug awf2.swift >swift.out 2>&1 &
> 
> from which I get the following logs when all is done:
> 
> $ wc -l *log *out
>        1 awf2-rm4p72i7lp0r0.0.rlog
>     1322 awf2-rm4p72i7lp0r0.log
>        1 swift.log
>     1400 swift.out
>     2724 total
> $
> 
> The awf2*.log file seems to be more or less a timestamped version of 
> stdout/err.

That's because you run with -d, which pretty much means "show me
everything on stdout".

>  (Interesting to note where the extra lines are going that 
> are in swift.out but not in awf2*.log, though. )

Those are the error reports. They are printed on stderr. And yes, the
actual log should also contain these. Bug report.

> 
> In the .log file I see the text that Ive excerpted below. I think the 
> following impovements could be made and wonder if you agree:
> 
> - Clearly show job exit code (I still dont see this)

grep -A 10 "exit code" swift.out. I'm not sure what can be more clear in
a log file than spelling "application x failed with an exit code of y".
Please, don't confuse clarity of a particular message with the
difficulty to find a particular message in a haystack of messages.

> - Use mnemonic codes for task types (rather than 1,2...)

Makes sense. Should be cog bug report.

> - for the logs, map task URNs to simple integers;

That's not such a good idea. The current scheme shows allows one to
figure out the thread hierarchy.

>    display the mapping up front

grep "running in" swift.out

> 
> - Mike
> 
> 
> 2007-08-29 18:23:53,895 INFO  vdl:dostagein Staged in pc1.pcap to 
> awf2-rm4p72i7lp0r0/shared/ on UC
> 2007-08-29 18:23:53,896 INFO  vdl:execute2 Running job angle4-h2fjbhgi 
> angle4 with arguments [pc1.pcap, 
> of-75398839-775c-40ac-bd5c-49275e3269d5-0-1, 
> cf-a8272a9e-0f23-472f-8b4e-9f7825877a5a-0-1] in 
> awf2-rm4p72i7lp0r0/angle4-h2fjbhgi on\
>   UC
> 2007-08-29 18:23:54,078 DEBUG TaskImpl Task(type=1, 
> identity=urn:0-0-3-0-1188429807105) setting status to Submitted
> 2007-08-29 18:23:54,943 DEBUG TaskImpl Task(type=1, 
> identity=urn:0-0-2-0-1188429807107) setting status to Submitted
> 2007-08-29 18:23:55,364 DEBUG TaskImpl Task(type=1, 
> identity=urn:0-0-6-0-1188429807109) setting status to Submitted
> 2007-08-29 18:23:55,503 DEBUG TaskImpl Task(type=1, 
> identity=urn:0-0-3-0-1188429807105) setting status to Active
> 2007-08-29 18:23:57,057 DEBUG TaskImpl Task(type=2, 
> identity=urn:0-0-1-0-1-1188429807096) setting status to Completed
> ...
> 2007-08-29 18:23:58,117 DEBUG TaskImpl Task(type=1, 
> identity=urn:0-0-1-0-1188429807111) setting status to Submitted
> 2007-08-29 18:24:01,480 DEBUG TaskImpl Task(type=1, 
> identity=urn:0-0-4-0-1188429807103) setting status to Active
> 2007-08-29 18:24:06,322 DEBUG TaskImpl Task(type=1, 
> identity=urn:0-0-2-0-1188429807107) setting status to Active
> 2007-08-29 18:24:06,727 DEBUG TaskImpl Task(type=1, 
> identity=urn:0-0-6-0-1188429807109) setting status to Completed
> 2007-08-29 18:24:06,729 DEBUG TaskImpl Task(type=4, 
> identity=urn:0-0-6-0-1188429807113) setting status to Active
> 2007-08-29 18:24:06,734 DEBUG TaskImpl Task(type=4, 
> identity=urn:0-0-6-0-1188429807113) setting status to Completed
> 2007-08-29 18:24:06,735 INFO  vdl:execute2 Completed job angle4-h2fjbhgi 
> angle4 with arguments [pc1.pcap, 
> of-75398839-775c-40ac-bd5c-49275e3269d5-0-1, 
> cf-a8272a9e-0f23-472f-8b4e-9f7825877a5a-0-1] on UC
> 2007-08-29 18:24:06,744 INFO  vdl:dostageout Staging out 
> awf2-rm4p72i7lp0r0/shared/of-75398839-775c-40ac-bd5c-49275e3269d5-0-1 to 
> file://localhost/of-75398839-775c-40ac-bd5c-49275e3269d5-0-1 from UC
> 2007-08-29 18:24:06,744 INFO  vdl:dostageout Staging out 
> awf2-rm4p72i7lp0r0/shared/cf-a8272a9e-0f23-472f-8b4e-9f7825877a5a-0-1 to 
> file://localhost/cf-a8272a9e-0f23-472f-8b4e-9f7825877a5a-0-1 from UC
> 2007-08-29 18:24:06,745 DEBUG TaskImpl Task(type=4, 
> identity=urn:0-0-6-0-1-1188429807115) setting status to Active
> 
> 
> Michael Wilde wrote:
> > Great - thanks.  That was indeed the problem: my application script had 
> > a typo and was trying to run the 32-bit binary regardless what processor 
> > type it wound up on.  When I last run successfully, I was getting most 
> > or all i686 machines; this time I was getting ia64 machines.
> > 
> > I'll try to re-run it w/o debug, and see if the messages need improvement.
> > 
> > Kickstart would have helped here - would have told me that Im running on 
> > ia64.
> > 
> > This is the kind of problem that on a local machine would have been 
> > recognizable instantly but on a remote machine through swift, karajan, 
> > globus and PBS is a much greater challenge to diagnose.  We should think 
> > in terms of how to make that long pipeline to the remote execution 
> > environment much more transparent to the user.
> > 
> > Think: "what would I see if I ran this locally" and "how do I bring that 
> > environment to the swift user"?
> > 
> > Also noted that:
> > 
> > - the retry logic here did more harm than good. Maybe we want the 
> > default for this to be off, especially during debugging.
> > 
> > - in my latest run, which succeeded, the final job completion was 
> > excessively delayed. The output files were all back on the submit host, 
> > 4 of 5 jobs were logged as completed, and the completion of the final 
> > job seemed to take a few minutes longer.
> > 
> > I'll work through the error logs more closely and file an enhancement 
> > request in bugz.
> > 
> > I can batch these for later discussion or bring them as I encounter 
> > things, whatever people prefer.  I dont want to distract anyone at the 
> > moment into long discssions on these; I'll organize them into bug 
> > reports and enhancement requests and file for discussion when we next 
> > review priorities.
> > 
> > Ian was suggesting that this be soon - now is when we need to pick the 
> > next features for you to work on, Ben and Mihael.  Maybe a review of 
> > bugs and requests next week, which can be started by email discussion, 
> > and we'll note which topics needs voice or f2f discussion.
> > 
> > - Mike
> > 
> > 
> > Mihael Hategan wrote:
> >> Ok. You have a bunch of errors, mainly of two types:
> >> 1. Missing output file (we should add a rule in error.properties to make
> >> that verbose message a little more readable). This may be because the
> >> application didn't run or because the filesystem is broken. Right now an
> >> exit code file is produced by the wrapper only if the exit code of the
> >> application is not 0. This does not allow telling between the
> >> application having completed successfully or the filesystem being
> >> broken. I believe that a stamp file should also be created by the
> >> wrapper in order to distinguish between the two. The reason for the
> >> stamp file instead of always having an exit code file is that it is more
> >> efficient to check the existence of a file than to stage it out and look
> >> at its contents.
> >>
> >> 2. Exit code != 0. Looks like some issues with R.
> >>
> >> Mihael
> >>
> >> On Thu, 2007-08-30 at 08:31 -0500, Michael Wilde wrote:
> >>> Resending this after changing list to take larger attachments.
> >>> Previous message seems to have gotten lost (I musta pressed the wrong 
> >>> button in the list manager?)
> >>>
> >>> ---
> >>>
> >>> I'm progressing on the angle runs. Previous errors were due to problems
> >>> with svn update, and then apparently needing ant clean and distclean.
> >>>
> >>> Now I'm executing but getting I/O errors.  Ive attached all the logs and
> >>> output from this run.
> >>>
> >>> My result files are coming back zero-length and Im seeing I/O errors in
> >>> the logs (eg, in swift.out):
> >>>
> >>> ...
> >>> Task(type=2, identity=urn:0-0-6-0-2-1188429807124) setting status to
> >>> SubmittedTask(type=2, identity=urn:0-0-6-0-1-1188429807121) setting
> >>> status to Active
> >>> Task(type=2, identity=urn:0-0-6-0-2-1188429807124) setting status to 
> >>> Active
> >>> Task(type=2, identity=urn:0-0-6-0-2-1188429807124) setting status to
> >>> Failed Exception in getFile
> >>>
> >>> ...
> >>>
> >>> My suspcion is that the app is failing and not proucing an expected
> >>> output file.  Perhaps theres a clean error in the log that says this but
> >>> I havent found it yet.  I think I saw error #500's from gridftp in 
> >>> the log.
> >>>
> >>> While I debug further, if anyone sees a different or obvious cause, I'd
> >>> appreciate your eyeballs on it.
> >>>
> >>> Thanks,
> >>>
> >>> Mike
> >>>
> >>> _______________________________________________
> >>> Swift-user mailing list
> >>> Swift-user at ci.uchicago.edu
> >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> >>
> >>
> > _______________________________________________
> > Swift-user mailing list
> > Swift-user at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> > 
> > 
> 




More information about the Swift-devel mailing list