[Swift-devel] execution.retries

Tue Jun 10 11:38:14 CDT 2008

In that log file, it looks like there are a lot of attempts to initialise 
the shared directory (each of which fails, by the looks of it).

Look how many lines there are in the log like this:

2008-06-10 10:48:03,137-0500 INFO  vdl:initshareddir START 
host=OSG_LIGO_MIT 

followed closely by:

2008-06-10 10:48:03,196-0500 DEBUG TaskImpl Task(type=FILE_OPERATION, 
identity=u
rn:0-1-701-1-1213112750531) setting status to Failed 
org.globus.cog.abstraction.
impl.file.IrrecoverableResourceException: Error communicating with the 
GridFTP server

It looks like you are suffering from the "fast fail" problem here though - 
see bug 101.

The site fails rapidly. The scheduler will never go below 2 jobs per site 
at once, no matter how much it fails.

So, Swift will submit to that site many many times, all of which will 
faill; and so that site will absorb all the retries for a site.

Pretty much to stop this, the scheduler needs to be able to go below 
2-jobs-per-site. The lower limit could go to 0; however, once a site has 
gone to 0, its possible that no more jobs will be run and the score will 
never go up; thus a transient failure might cause a site to be ignored 
forever. And if every site does this, then eventually you might end up 
with no sites at all to run.

If you would like to experiment, I'll show you where in the source code to 
change that lower limit.

There are a few other ideas being tossed around - I have some, and I think 
Mihael has some too - to deal with this.

--