[Swift-devel] execution.retries

lixi at uchicago.edu lixi at uchicago.edu
Tue Jun 10 11:58:52 CDT 2008


>Look how many lines there are in the log like this:
>
>2008-06-10 10:48:03,137-0500 INFO  vdl:initshareddir START 
>host=OSG_LIGO_MIT 
>
>followed closely by:
>
>2008-06-10 10:48:03,196-0500 DEBUG TaskImpl Task
(type=FILE_OPERATION, 
>identity=u
>rn:0-1-701-1-1213112750531) setting status to Failed 
>org.globus.cog.abstraction.
>impl.file.IrrecoverableResourceException: Error 
communicating with the 
>GridFTP server

Yes, I've seen that. My question is: do these lines mean 
different execution retries? If yes, can we add another 
factor here other than the final failureFactor to change the 
site's score. Then this might prevent more retries 
submitting to this site again for this single job.

>The site fails rapidly. The scheduler will never go below 2 
jobs per site 
>at once, no matter how much it fails.

>So, Swift will submit to that site many many times, all of 
which will 
>faill; and so that site will absorb all the retries for a 
site.

However during the execution, the site's score could be 
decreased into negative one which would erase at lease 2 
jobs limit?

Then to some extent, that site would have no chance of  
absorbing more retries.

>Pretty much to stop this, the scheduler needs to be able to 
go below 
>2-jobs-per-site. The lower limit could go to 0; however, 
once a site has 
>gone to 0, its possible that no more jobs will be run and 
the score will 
>never go up; thus a transient failure might cause a site to 
be ignored 
>forever. And if every site does this, then eventually you 
might end up 
>with no sites at all to run.

I agree that the lower limit should not be 0.

>If you would like to experiment, I'll show you where in the 
source code to 
>change that lower limit.

In fact, I know the place to modify the limit. What I don't 
know is how to use the whole framework to complie and 
generate new version.

Xi



More information about the Swift-devel mailing list