[Swift-devel] Question about retry behavior

Ioan Raicu iraicu at cs.iit.edu
Fri Mar 2 09:56:52 CST 2012


In Falkon, we used to keep track of failures at the workers, and too many repeated failures with certain exit codes or keywords in the output stream, and we would suspend the faulty worker for some period of time. This worked great for intermittent shared file system problems due to load, as backing off for some time usually fixed the problem. For other things, such as apps not installed or missing data, this only slowed down the failure rate, but at least it was controllable based on the way we configured the worker logic.

Ioan

-- 
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor
=================================================================
Computer Science Department
Illinois Institute of Technology
10 W. 31st Street Chicago, IL 60616 =================================================================
Cel:   1-847-722-0876
Email: iraicu at cs.iit.edu
Web:   http://www.cs.iit.edu/~iraicu/
=================================================================
=================================================================



On Mar 2, 2012, at 9:37 AM, Michael Wilde <wilde at mcs.anl.gov> wrote:

> I think the problem here is that in David's case there is only one site, "OSG", (using the Glidein workload mgmt system GWMS), so he has no control of where his coaster workers start.
> 
> Jobs are failing because he has not yet told Condor to avoid launching workers on sites where his app is not correctly installed.
> 
> If this were a more general case where apps fail on specific nodes, we'd want to try to both prevent Condor from running workers on that node, and prevent the coaster worker form taking jobs for that node. In one mode we could train the worker to "freeze" on the node after any job on the node fails in a certain way. That way we'd  take the node out of service and stop it form failing future jobs.
> 
> I think for now we have simple workarounds for this kind of problem but moving forward we should look at increasingly more robust solutions.
> 
> - Mike
> 
> 
> ----- Original Message -----
>> From: "Ben Clifford" <benc at hawaga.org.uk>
>> To: "David Kelly" <davidk at ci.uchicago.edu>
>> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
>> Sent: Friday, March 2, 2012 2:30:26 AM
>> Subject: Re: [Swift-devel] Question about retry behavior
>> The below was a problem with grid sites that "failed fast" on OSG; but
>> there, there was/is a site scoring mechanism to try to slow down
>> submissions to that site. Plus ça change, plus c'est la même chose.
>> 
>> On Mar 2, 2012, at 9:05 AM, David Kelly wrote:
>> 
>>> 
>>> Consider the case of one John Q. Swifterson.
>>> 
>>> Mr. Swifterson is working late one night performing science. He has
>>> written a very important program to simulate the effects of cocaine
>>> on honeybee dance behavior.
>>> 
>>> John is using persistent coasters and has 100 nodes available. Each
>>> node performs only 1 task at a time. In an abundance of caution, he
>>> sets execution.retries=50.
>>> 
>>> John then submits 100,000 jobs. 99 jobs start immediately and start
>>> working as expected. But, 1 job fails due to a corrupted binary on
>>> that node. What should happen next?
>>> 
>>> The swift user guide says this:
>>> ---
>>> If an application procedure execution fails, Swift will attempt that
>>> execution again repeatedly until it succeeds, up until the limit
>>> defined in the execution.retries configuration property.
>>> 
>>> Site selection will occur for retried jobs in the same way that it
>>> happens for new jobs. Retried jobs may run on the same site or may
>>> run on a different site.
>>> 
>>> If the retry limit execution.retries is reached for an application
>>> procedure, then that application procedure will fail. This will
>>> cause the entire run to fail - either immediately (if the
>>> lazy.errors property is false) or after all other possible work has
>>> been attempted (if the lazy.errors property is true).
>>> ---
>>> 
>>> Since 99/100 nodes are in use, so all 50 retries will occur on same
>>> the problematic node. This causes the entire run to fail. Is this
>>> correct? Is there any way to change this behavior?
>>> 
>>> One possibility is to set a job throttle lower than the number of
>>> sites actually available. That might increase the chances of success
>>> a bit.
>>> 
>>> Is there any way to force retries to happen on a different node? And
>>> to also optionally to disconnect nodes which experience high failure
>>> rates?
>>> 
>>> Thanks,
>>> David
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>>> 
>> 
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel



More information about the Swift-devel mailing list