[Swift-devel] Question about retry behavior

Fri Mar 2 10:10:15 CST 2012

Good points, Ioan - I'd forgotten about that part of the Falkon work. Seems like per-worker fault analysis is a good thing, but that higher level analysis and actions are also needed.  Maybe per-worker and per-site analysis and down-ability.

- Mike

----- Original Message -----
> From: "Ioan Raicu" <iraicu at cs.iit.edu>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Ben Clifford" <benc at hawaga.org.uk>, "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Friday, March 2, 2012 9:59:11 AM
> Subject: Re: [Swift-devel] Question about retry behavior
> And BTW, the logic was part o the worker, and each worker was making a
> separate independent decision. I think a central place to have this
> logic might also be useful, and perhaps might perform better, as it
> might be possible to differentiate between failures due to a specific
> node, to system wide failures that would cause some job to fail no
> matter where it was submitted.
> 
> Ioan
> 
> --
> =================================================================
> Ioan Raicu, Ph.D.
> Assistant Professor
> =================================================================
> Computer Science Department
> Illinois Institute of Technology
> 10 W. 31st Street Chicago, IL 60616
> =================================================================
> Cel: 1-847-722-0876
> Email: iraicu at cs.iit.edu
> Web: http://www.cs.iit.edu/~iraicu/
> =================================================================
> =================================================================
> 
> 
> 
> On Mar 2, 2012, at 9:37 AM, Michael Wilde <wilde at mcs.anl.gov> wrote:
> 
> > I think the problem here is that in David's case there is only one
> > site, "OSG", (using the Glidein workload mgmt system GWMS), so he
> > has no control of where his coaster workers start.
> >
> > Jobs are failing because he has not yet told Condor to avoid
> > launching workers on sites where his app is not correctly installed.
> >
> > If this were a more general case where apps fail on specific nodes,
> > we'd want to try to both prevent Condor from running workers on that
> > node, and prevent the coaster worker form taking jobs for that node.
> > In one mode we could train the worker to "freeze" on the node after
> > any job on the node fails in a certain way. That way we'd take the
> > node out of service and stop it form failing future jobs.
> >
> > I think for now we have simple workarounds for this kind of problem
> > but moving forward we should look at increasingly more robust
> > solutions.
> >
> > - Mike
> >
> >
> > ----- Original Message -----
> >> From: "Ben Clifford" <benc at hawaga.org.uk>
> >> To: "David Kelly" <davidk at ci.uchicago.edu>
> >> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> >> Sent: Friday, March 2, 2012 2:30:26 AM
> >> Subject: Re: [Swift-devel] Question about retry behavior
> >> The below was a problem with grid sites that "failed fast" on OSG;
> >> but
> >> there, there was/is a site scoring mechanism to try to slow down
> >> submissions to that site. Plus ça change, plus c'est la même chose.
> >>
> >> On Mar 2, 2012, at 9:05 AM, David Kelly wrote:
> >>
> >>>
> >>> Consider the case of one John Q. Swifterson.
> >>>
> >>> Mr. Swifterson is working late one night performing science. He
> >>> has
> >>> written a very important program to simulate the effects of
> >>> cocaine
> >>> on honeybee dance behavior.
> >>>
> >>> John is using persistent coasters and has 100 nodes available.
> >>> Each
> >>> node performs only 1 task at a time. In an abundance of caution,
> >>> he
> >>> sets execution.retries=50.
> >>>
> >>> John then submits 100,000 jobs. 99 jobs start immediately and
> >>> start
> >>> working as expected. But, 1 job fails due to a corrupted binary on
> >>> that node. What should happen next?
> >>>
> >>> The swift user guide says this:
> >>> ---
> >>> If an application procedure execution fails, Swift will attempt
> >>> that
> >>> execution again repeatedly until it succeeds, up until the limit
> >>> defined in the execution.retries configuration property.
> >>>
> >>> Site selection will occur for retried jobs in the same way that it
> >>> happens for new jobs. Retried jobs may run on the same site or may
> >>> run on a different site.
> >>>
> >>> If the retry limit execution.retries is reached for an application
> >>> procedure, then that application procedure will fail. This will
> >>> cause the entire run to fail - either immediately (if the
> >>> lazy.errors property is false) or after all other possible work
> >>> has
> >>> been attempted (if the lazy.errors property is true).
> >>> ---
> >>>
> >>> Since 99/100 nodes are in use, so all 50 retries will occur on
> >>> same
> >>> the problematic node. This causes the entire run to fail. Is this
> >>> correct? Is there any way to change this behavior?
> >>>
> >>> One possibility is to set a job throttle lower than the number of
> >>> sites actually available. That might increase the chances of
> >>> success
> >>> a bit.
> >>>
> >>> Is there any way to force retries to happen on a different node?
> >>> And
> >>> to also optionally to disconnect nodes which experience high
> >>> failure
> >>> rates?
> >>>
> >>> Thanks,
> >>> David
> >>> _______________________________________________
> >>> Swift-devel mailing list
> >>> Swift-devel at ci.uchicago.edu
> >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >>>
> >>
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory