[Swift-devel] Question about retry behavior

Ben Clifford benc at hawaga.org.uk
Fri Mar 2 02:30:26 CST 2012


The below was a problem with grid sites that "failed fast" on OSG; but there, there was/is a site scoring mechanism to try to slow down submissions to that site. Plus ça change, plus c'est la même chose.

On Mar 2, 2012, at 9:05 AM, David Kelly wrote:

> 
> Consider the case of one John Q. Swifterson. 
> 
> Mr. Swifterson is working late one night performing science. He has written a very important program to simulate the effects of cocaine on honeybee dance behavior. 
> 
> John is using persistent coasters and has 100 nodes available. Each node performs only 1 task at a time. In an abundance of caution, he sets execution.retries=50.
> 
> John then submits 100,000 jobs. 99 jobs start immediately and start working as expected. But, 1 job fails due to a corrupted binary on that node. What should happen next?
> 
> The swift user guide says this:
> ---
> If an application procedure execution fails, Swift will attempt that execution again repeatedly until it succeeds, up until the limit defined in the execution.retries configuration property.
> 
> Site selection will occur for retried jobs in the same way that it happens for new jobs. Retried jobs may run on the same site or may run on a different site.
> 
> If the retry limit execution.retries is reached for an application procedure, then that application procedure will fail. This will cause the entire run to fail - either immediately (if the lazy.errors property is false) or after all other possible work has been attempted (if the lazy.errors property is true).
> ---
> 
> Since 99/100 nodes are in use, so all 50 retries will occur on same the problematic node. This causes the entire run to fail. Is this correct? Is there any way to change this behavior?
> 
> One possibility is to set a job throttle lower than the number of sites actually available. That might increase the chances of success a bit.
> 
> Is there any way to force retries to happen on a different node? And to also optionally to disconnect nodes which experience high failure rates?
> 
> Thanks,
> David
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 




More information about the Swift-devel mailing list