[Swift-user] 1) disable retry mechanism and 2) continue on failure?

Sun Mar 30 19:02:08 CDT 2008

Hi all,
We have a workflow (a simple for loop) that is failing due to
> runam6 failed
> Execution failed:
>         Exception in runam6:
> Arguments: [000000, 0.200000, 0.000391, 0.204419, 0.200000, 0.000391, 
> 0.204419, 10]
> Host: bgp
> Directory: amps2-20080330-1849-hnpls37c/jobs/y/runam6-yvkudjqi
> stderr.txt: mkdir: cannot create directory `am.000000': File exists
> mkdir: cannot create directory `source': File exists
> mkdir: cannot create directory `scendat': File exists
> mkdir: cannot create directory `scendat/bgtestcases': File exists
> mkdir: cannot create directory `out_dir': File exists
> mkdir: cannot create directory `rftxtdat': File exists
Here is the command we invoke as seen by Falkon:
> execute: /bin/bash shared/wrapper.sh runam6-xvkudjqi -jobdir x -e 
> /tmp/runam6 -out stdout.txt -err stderr.txt -i -d  -if  -of 
> amdi.000000 -k  -a 000000 0.200000 0.000391 0.204419 0.200000 0.000391 
> 0.204419 10
> 1206921034.45020: executing taskID urn:0-1-1-1206920989652 /bin/bash 
> shared/wrapper.sh runam6-xvkudjqi -jobdir x -e /tmp/runam6 -out 
> stdout.txt -err stderr.txt -i -d  -if  -of amdi.000000 -k  -a 000000 
> 0.200000 0.000391 0.204419 0.200000 0.000391 0.204419 10 ... completed 
> with exit code 0 in 40585.011719 ms!
> sendResults: urn:0-1-1-1206920989652#0
Things look fine when we look at the output manually, but somehow Swift 
doesn't think so.  BTW, the application execution time is about 40 
seconds, and we are running things on a single CPU at the moment, so I 
don't think its a concurrency issue.  The same problem appears, even if 
we run both Swift, Falkon, and the Falkon worker on the same node.  
Either way, we need to get some results tonight for a talk tomorrow, and 
we don't have the time to fix the real problem, whatever it may be. 

So, here are the 2 questions I have. 

1) How do we disable the retry mechanism, to make sure that Swift won't 
retry failed jobs?
2) How do we configure Swift to continue sending all tasks it is able to 
(in our case, it should be all tasks, as we only have 1 for loop, with 
no data dependencies between iterations), although all tasks will 
eventually fail?

The motivation for these questions is so we can do a large run via Swift 
and let the Falkon exit codes guide us to whether tasks failed or were 
successful.  Our hopes are that things are actually executing fine at 
the application (we'll do some sanity checks to make sure), and that 
somehow Swift is reporting errors due to some reason we don't 
understand.  If the application indeed runs successfully, we could 
produce some graphs of the run from the Falkon logs, and get those 
results for the talk tomorrow.   We'll then have to figure out exactly 
why the error is happening, and how to fix that, but that seems out of 
the scope of our work for the next 12 hours.

One last thing.  The Swift and Falkon installs we have from SVN (updated 
today) passed sanity checks... we could run sleep jobs just fine.  But 
this app unfortunately doesn't. 

Thanks,
Ioan

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================