[Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2) continue on failure?

Sun Mar 30 20:46:28 CDT 2008

On Sun, 30 Mar 2008, Ioan Raicu wrote:

hmm. this:

> > Directory: amps2-20080330-1849-hnpls37c/jobs/y/runam6-yvkudjqi

doesn't match up with this:

> Here is the command we invoke as seen by Falkon:
> > execute: /bin/bash shared/wrapper.sh runam6-xvkudjqi -jobdir x -e

as in, swift is expecting to run job id yvkudjqi and you are claiming that 
falkon is running the wrapper script associated with job id xvkudjqi.

That seems wrong - did you cut and paste the wrong log line or is there 
some mismatch going on in the code? Can you check? Swift should make at 
most one submission for any job id (that looks like yvkudjqi) so if you 
see Falkon trying to run code for any one job ID more than once something 
is awry.

Also, I remembered another time i've seen errors like that this 
mkdir/already exists problem was to do with broken file mappings, but it 
looks like you are inputting only numerical values and outputting a single 
file, so I think that won't be a problem ehre.

> > /tmp/runam6 -out stdout.txt -err stderr.txt -i -d  -if  -of amdi.000000 -k
> > -a 000000 0.200000 0.000391 0.204419 0.200000 0.000391 0.204419 10
> > 1206921034.45020: executing taskID urn:0-1-1-1206920989652 /bin/bash
> > shared/wrapper.sh runam6-xvkudjqi -jobdir x -e /tmp/runam6 -out stdout.txt
> > -err stderr.txt -i -d  -if  -of amdi.000000 -k  -a 000000 0.200000 0.000391
> > 0.204419 0.200000 0.000391 0.204419 10 ... completed with exit code 0 in
> > 40585.011719 ms!
> > sendResults: urn:0-1-1-1206920989652#0
> Things look fine when we look at the output manually, but somehow Swift
> doesn't think so.  BTW, the application execution time is about 40 seconds,
> and we are running things on a single CPU at the moment, so I don't think its
> a concurrency issue.  The same problem appears, even if we run both Swift,
> Falkon, and the Falkon worker on the same node.  Either way, we need to get
> some results tonight for a talk tomorrow, and we don't have the time to fix
> the real problem, whatever it may be. 
> So, here are the 2 questions I have. 
> 1) How do we disable the retry mechanism, to make sure that Swift won't retry
> failed jobs?
> 2) How do we configure Swift to continue sending all tasks it is able to (in
> our case, it should be all tasks, as we only have 1 for loop, with no data
> dependencies between iterations), although all tasks will eventually fail?
> 
> The motivation for these questions is so we can do a large run via Swift and
> let the Falkon exit codes guide us to whether tasks failed or were successful.
> Our hopes are that things are actually executing fine at the application
> (we'll do some sanity checks to make sure), and that somehow Swift is
> reporting errors due to some reason we don't understand.  If the application
> indeed runs successfully, we could produce some graphs of the run from the
> Falkon logs, and get those results for the talk tomorrow.   We'll then have to
> figure out exactly why the error is happening, and how to fix that, but that
> seems out of the scope of our work for the next 12 hours.
> 
> One last thing.  The Swift and Falkon installs we have from SVN (updated
> today) passed sanity checks... we could run sleep jobs just fine.  But this
> app unfortunately doesn't. 
> Thanks,
> Ioan
> 
>