[Swift-user] 1) disable retry mechanism and 2) continue on failure?
Ioan Raicu
iraicu at cs.uchicago.edu
Sun Mar 30 19:02:08 CDT 2008
Hi all,
We have a workflow (a simple for loop) that is failing due to
> runam6 failed
> Execution failed:
> Exception in runam6:
> Arguments: [000000, 0.200000, 0.000391, 0.204419, 0.200000, 0.000391,
> 0.204419, 10]
> Host: bgp
> Directory: amps2-20080330-1849-hnpls37c/jobs/y/runam6-yvkudjqi
> stderr.txt: mkdir: cannot create directory `am.000000': File exists
> mkdir: cannot create directory `source': File exists
> mkdir: cannot create directory `scendat': File exists
> mkdir: cannot create directory `scendat/bgtestcases': File exists
> mkdir: cannot create directory `out_dir': File exists
> mkdir: cannot create directory `rftxtdat': File exists
Here is the command we invoke as seen by Falkon:
> execute: /bin/bash shared/wrapper.sh runam6-xvkudjqi -jobdir x -e
> /tmp/runam6 -out stdout.txt -err stderr.txt -i -d -if -of
> amdi.000000 -k -a 000000 0.200000 0.000391 0.204419 0.200000 0.000391
> 0.204419 10
> 1206921034.45020: executing taskID urn:0-1-1-1206920989652 /bin/bash
> shared/wrapper.sh runam6-xvkudjqi -jobdir x -e /tmp/runam6 -out
> stdout.txt -err stderr.txt -i -d -if -of amdi.000000 -k -a 000000
> 0.200000 0.000391 0.204419 0.200000 0.000391 0.204419 10 ... completed
> with exit code 0 in 40585.011719 ms!
> sendResults: urn:0-1-1-1206920989652#0
Things look fine when we look at the output manually, but somehow Swift
doesn't think so. BTW, the application execution time is about 40
seconds, and we are running things on a single CPU at the moment, so I
don't think its a concurrency issue. The same problem appears, even if
we run both Swift, Falkon, and the Falkon worker on the same node.
Either way, we need to get some results tonight for a talk tomorrow, and
we don't have the time to fix the real problem, whatever it may be.
So, here are the 2 questions I have.
1) How do we disable the retry mechanism, to make sure that Swift won't
retry failed jobs?
2) How do we configure Swift to continue sending all tasks it is able to
(in our case, it should be all tasks, as we only have 1 for loop, with
no data dependencies between iterations), although all tasks will
eventually fail?
The motivation for these questions is so we can do a large run via Swift
and let the Falkon exit codes guide us to whether tasks failed or were
successful. Our hopes are that things are actually executing fine at
the application (we'll do some sanity checks to make sure), and that
somehow Swift is reporting errors due to some reason we don't
understand. If the application indeed runs successfully, we could
produce some graphs of the run from the Falkon logs, and get those
results for the talk tomorrow. We'll then have to figure out exactly
why the error is happening, and how to fix that, but that seems out of
the scope of our work for the next 12 hours.
One last thing. The Swift and Falkon installs we have from SVN (updated
today) passed sanity checks... we could run sleep jobs just fine. But
this app unfortunately doesn't.
Thanks,
Ioan
--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
More information about the Swift-user
mailing list