[Swift-devel] Re: [Swift-user] Resending: I/O errors in swift script
Michael Wilde
wilde at mcs.anl.gov
Thu Aug 30 12:36:38 CDT 2007
i should note that i did se the jobs getting retried, but no clear
indication of what their exit code was.
Michael Wilde wrote:
> Following up on this, Mihael, you said:
>
> >> 2. Exit code != 0. Looks like some issues with R.
>
> I dont see where in the logs you observed that the jobs were failing. I
> think that would have tipped me off earlier that I have an app problem.
>
> I must be looking in the wrong place. I redirected stdout and stderr
> into a file, starting swift like this:
>
> $ swift -debug awf2.swift >swift.out 2>&1 &
>
> from which I get the following logs when all is done:
>
> $ wc -l *log *out
> 1 awf2-rm4p72i7lp0r0.0.rlog
> 1322 awf2-rm4p72i7lp0r0.log
> 1 swift.log
> 1400 swift.out
> 2724 total
> $
>
> The awf2*.log file seems to be more or less a timestamped version of
> stdout/err. (Interesting to note where the extra lines are going that
> are in swift.out but not in awf2*.log, though. )
>
> In the .log file I see the text that Ive excerpted below. I think the
> following impovements could be made and wonder if you agree:
>
> - Clearly show job exit code (I still dont see this)
> - Use mnemonic codes for task types (rather than 1,2...)
> - for the logs, map task URNs to simple integers;
> display the mapping up front
>
> - Mike
>
>
> 2007-08-29 18:23:53,895 INFO vdl:dostagein Staged in pc1.pcap to
> awf2-rm4p72i7lp0r0/shared/ on UC
> 2007-08-29 18:23:53,896 INFO vdl:execute2 Running job angle4-h2fjbhgi
> angle4 with arguments [pc1.pcap,
> of-75398839-775c-40ac-bd5c-49275e3269d5-0-1,
> cf-a8272a9e-0f23-472f-8b4e-9f7825877a5a-0-1] in
> awf2-rm4p72i7lp0r0/angle4-h2fjbhgi on\
> UC
> 2007-08-29 18:23:54,078 DEBUG TaskImpl Task(type=1,
> identity=urn:0-0-3-0-1188429807105) setting status to Submitted
> 2007-08-29 18:23:54,943 DEBUG TaskImpl Task(type=1,
> identity=urn:0-0-2-0-1188429807107) setting status to Submitted
> 2007-08-29 18:23:55,364 DEBUG TaskImpl Task(type=1,
> identity=urn:0-0-6-0-1188429807109) setting status to Submitted
> 2007-08-29 18:23:55,503 DEBUG TaskImpl Task(type=1,
> identity=urn:0-0-3-0-1188429807105) setting status to Active
> 2007-08-29 18:23:57,057 DEBUG TaskImpl Task(type=2,
> identity=urn:0-0-1-0-1-1188429807096) setting status to Completed
> ...
> 2007-08-29 18:23:58,117 DEBUG TaskImpl Task(type=1,
> identity=urn:0-0-1-0-1188429807111) setting status to Submitted
> 2007-08-29 18:24:01,480 DEBUG TaskImpl Task(type=1,
> identity=urn:0-0-4-0-1188429807103) setting status to Active
> 2007-08-29 18:24:06,322 DEBUG TaskImpl Task(type=1,
> identity=urn:0-0-2-0-1188429807107) setting status to Active
> 2007-08-29 18:24:06,727 DEBUG TaskImpl Task(type=1,
> identity=urn:0-0-6-0-1188429807109) setting status to Completed
> 2007-08-29 18:24:06,729 DEBUG TaskImpl Task(type=4,
> identity=urn:0-0-6-0-1188429807113) setting status to Active
> 2007-08-29 18:24:06,734 DEBUG TaskImpl Task(type=4,
> identity=urn:0-0-6-0-1188429807113) setting status to Completed
> 2007-08-29 18:24:06,735 INFO vdl:execute2 Completed job angle4-h2fjbhgi
> angle4 with arguments [pc1.pcap,
> of-75398839-775c-40ac-bd5c-49275e3269d5-0-1,
> cf-a8272a9e-0f23-472f-8b4e-9f7825877a5a-0-1] on UC
> 2007-08-29 18:24:06,744 INFO vdl:dostageout Staging out
> awf2-rm4p72i7lp0r0/shared/of-75398839-775c-40ac-bd5c-49275e3269d5-0-1 to
> file://localhost/of-75398839-775c-40ac-bd5c-49275e3269d5-0-1 from UC
> 2007-08-29 18:24:06,744 INFO vdl:dostageout Staging out
> awf2-rm4p72i7lp0r0/shared/cf-a8272a9e-0f23-472f-8b4e-9f7825877a5a-0-1 to
> file://localhost/cf-a8272a9e-0f23-472f-8b4e-9f7825877a5a-0-1 from UC
> 2007-08-29 18:24:06,745 DEBUG TaskImpl Task(type=4,
> identity=urn:0-0-6-0-1-1188429807115) setting status to Active
>
>
> Michael Wilde wrote:
>> Great - thanks. That was indeed the problem: my application script
>> had a typo and was trying to run the 32-bit binary regardless what
>> processor type it wound up on. When I last run successfully, I was
>> getting most or all i686 machines; this time I was getting ia64 machines.
>>
>> I'll try to re-run it w/o debug, and see if the messages need
>> improvement.
>>
>> Kickstart would have helped here - would have told me that Im running
>> on ia64.
>>
>> This is the kind of problem that on a local machine would have been
>> recognizable instantly but on a remote machine through swift, karajan,
>> globus and PBS is a much greater challenge to diagnose. We should
>> think in terms of how to make that long pipeline to the remote
>> execution environment much more transparent to the user.
>>
>> Think: "what would I see if I ran this locally" and "how do I bring
>> that environment to the swift user"?
>>
>> Also noted that:
>>
>> - the retry logic here did more harm than good. Maybe we want the
>> default for this to be off, especially during debugging.
>>
>> - in my latest run, which succeeded, the final job completion was
>> excessively delayed. The output files were all back on the submit
>> host, 4 of 5 jobs were logged as completed, and the completion of the
>> final job seemed to take a few minutes longer.
>>
>> I'll work through the error logs more closely and file an enhancement
>> request in bugz.
>>
>> I can batch these for later discussion or bring them as I encounter
>> things, whatever people prefer. I dont want to distract anyone at the
>> moment into long discssions on these; I'll organize them into bug
>> reports and enhancement requests and file for discussion when we next
>> review priorities.
>>
>> Ian was suggesting that this be soon - now is when we need to pick the
>> next features for you to work on, Ben and Mihael. Maybe a review of
>> bugs and requests next week, which can be started by email discussion,
>> and we'll note which topics needs voice or f2f discussion.
>>
>> - Mike
>>
>>
>> Mihael Hategan wrote:
>>> Ok. You have a bunch of errors, mainly of two types:
>>> 1. Missing output file (we should add a rule in error.properties to make
>>> that verbose message a little more readable). This may be because the
>>> application didn't run or because the filesystem is broken. Right now an
>>> exit code file is produced by the wrapper only if the exit code of the
>>> application is not 0. This does not allow telling between the
>>> application having completed successfully or the filesystem being
>>> broken. I believe that a stamp file should also be created by the
>>> wrapper in order to distinguish between the two. The reason for the
>>> stamp file instead of always having an exit code file is that it is more
>>> efficient to check the existence of a file than to stage it out and look
>>> at its contents.
>>>
>>> 2. Exit code != 0. Looks like some issues with R.
>>>
>>> Mihael
>>>
>>> On Thu, 2007-08-30 at 08:31 -0500, Michael Wilde wrote:
>>>> Resending this after changing list to take larger attachments.
>>>> Previous message seems to have gotten lost (I musta pressed the
>>>> wrong button in the list manager?)
>>>>
>>>> ---
>>>>
>>>> I'm progressing on the angle runs. Previous errors were due to problems
>>>> with svn update, and then apparently needing ant clean and distclean.
>>>>
>>>> Now I'm executing but getting I/O errors. Ive attached all the logs
>>>> and
>>>> output from this run.
>>>>
>>>> My result files are coming back zero-length and Im seeing I/O errors in
>>>> the logs (eg, in swift.out):
>>>>
>>>> ...
>>>> Task(type=2, identity=urn:0-0-6-0-2-1188429807124) setting status to
>>>> SubmittedTask(type=2, identity=urn:0-0-6-0-1-1188429807121) setting
>>>> status to Active
>>>> Task(type=2, identity=urn:0-0-6-0-2-1188429807124) setting status to
>>>> Active
>>>> Task(type=2, identity=urn:0-0-6-0-2-1188429807124) setting status to
>>>> Failed Exception in getFile
>>>>
>>>> ...
>>>>
>>>> My suspcion is that the app is failing and not proucing an expected
>>>> output file. Perhaps theres a clean error in the log that says this
>>>> but
>>>> I havent found it yet. I think I saw error #500's from gridftp in
>>>> the log.
>>>>
>>>> While I debug further, if anyone sees a different or obvious cause, I'd
>>>> appreciate your eyeballs on it.
>>>>
>>>> Thanks,
>>>>
>>>> Mike
>>>>
>>>> _______________________________________________
>>>> Swift-user mailing list
>>>> Swift-user at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>>
>>>
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
More information about the Swift-devel
mailing list