[Swift-devel] execution.retries
Ben Clifford
benc at hawaga.org.uk
Tue Jun 10 11:38:14 CDT 2008
In that log file, it looks like there are a lot of attempts to initialise
the shared directory (each of which fails, by the looks of it).
Look how many lines there are in the log like this:
2008-06-10 10:48:03,137-0500 INFO vdl:initshareddir START
host=OSG_LIGO_MIT
followed closely by:
2008-06-10 10:48:03,196-0500 DEBUG TaskImpl Task(type=FILE_OPERATION,
identity=u
rn:0-1-701-1-1213112750531) setting status to Failed
org.globus.cog.abstraction.
impl.file.IrrecoverableResourceException: Error communicating with the
GridFTP server
It looks like you are suffering from the "fast fail" problem here though -
see bug 101.
The site fails rapidly. The scheduler will never go below 2 jobs per site
at once, no matter how much it fails.
So, Swift will submit to that site many many times, all of which will
faill; and so that site will absorb all the retries for a site.
Pretty much to stop this, the scheduler needs to be able to go below
2-jobs-per-site. The lower limit could go to 0; however, once a site has
gone to 0, its possible that no more jobs will be run and the score will
never go up; thus a transient failure might cause a site to be ignored
forever. And if every site does this, then eventually you might end up
with no sites at all to run.
If you would like to experiment, I'll show you where in the source code to
change that lower limit.
There are a few other ideas being tossed around - I have some, and I think
Mihael has some too - to deal with this.
--
More information about the Swift-devel
mailing list