[Swift-user] Swift run finished with errors
lixi at uchicago.edu
lixi at uchicago.edu
Thu Jul 3 07:21:24 CDT 2008
Thank you for detailed explanations.
In addition, I want to know to which sites were this 3 tries
submitted and how about the replications, because I want to
explore details of scheduler's behavior.
Thanks,
Xi
---- Original message ----
>Date: Thu, 3 Jul 2008 08:22:57 +0000 (GMT)
>From: Ben Clifford <benc at hawaga.org.uk>
>Subject: Re: [Swift-user] Swift run finished with errors
>To: lixi at uchicago.edu
>Cc: swift-user <swift-user at ci.uchicago.edu>
>
>
>That job failed 3 times. Sometimes that will happen.
>
>There are various things you can do to reduce the effect
this has on your
>run:
>
>Turn on lazy.errors in swift.properties:
> Normally when one job has failed (eg. it has used up
all of its
> retries) then the whole run is immediately abandoned.
> If you turn on lazy errors, then the rest of the run
will attempt to
> continue. This means that you might end up with a run
in which only
> that one job (or perhaps only a small number of jobs)
has failed. The
> restart log (*.rlog) should then let you run again to
try that small
> number again.
>
>Increase the number of retries in swift.properties -
execution.retries.
> This is set to 2 by default, meaning that a job will be
executed up to
> three times - once originally, and twice more as retries
if there are
> failures. You can increase this a small amount, eg to 5,
to massively
> reduce the probability of of a job caused by random job
failures. (eg
> if you have p=0.01 chance of a job submission failing,
then
> exection.retries=2 gives p^3 = 0.000001 chance of
failure; but
> execution.retries=5 gives p^6 = 0..000000000001 chance
of failure
>
> This does not help when the failures are caused by a
broken job (such
> as missing input files on the submit side); in such a
case it will
> increase load on remote systems and slow the run down.
>
>--
>
More information about the Swift-user
mailing list