[Swift-devel] execution.retries
lixi at uchicago.edu
lixi at uchicago.edu
Tue Jun 10 11:58:52 CDT 2008
>Look how many lines there are in the log like this:
>
>2008-06-10 10:48:03,137-0500 INFO vdl:initshareddir START
>host=OSG_LIGO_MIT
>
>followed closely by:
>
>2008-06-10 10:48:03,196-0500 DEBUG TaskImpl Task
(type=FILE_OPERATION,
>identity=u
>rn:0-1-701-1-1213112750531) setting status to Failed
>org.globus.cog.abstraction.
>impl.file.IrrecoverableResourceException: Error
communicating with the
>GridFTP server
Yes, I've seen that. My question is: do these lines mean
different execution retries? If yes, can we add another
factor here other than the final failureFactor to change the
site's score. Then this might prevent more retries
submitting to this site again for this single job.
>The site fails rapidly. The scheduler will never go below 2
jobs per site
>at once, no matter how much it fails.
>So, Swift will submit to that site many many times, all of
which will
>faill; and so that site will absorb all the retries for a
site.
However during the execution, the site's score could be
decreased into negative one which would erase at lease 2
jobs limit?
Then to some extent, that site would have no chance of
absorbing more retries.
>Pretty much to stop this, the scheduler needs to be able to
go below
>2-jobs-per-site. The lower limit could go to 0; however,
once a site has
>gone to 0, its possible that no more jobs will be run and
the score will
>never go up; thus a transient failure might cause a site to
be ignored
>forever. And if every site does this, then eventually you
might end up
>with no sites at all to run.
I agree that the lower limit should not be 0.
>If you would like to experiment, I'll show you where in the
source code to
>change that lower limit.
In fact, I know the place to modify the limit. What I don't
know is how to use the whole framework to complie and
generate new version.
Xi
More information about the Swift-devel
mailing list