[Swift-devel] Re: Error throttling for MD-244 Molecule Run

Mihael Hategan hategan at mcs.anl.gov
Wed Sep 12 10:14:34 CDT 2007


Old Swift. You should upgrade.

On Wed, 2007-09-12 at 09:57 -0500, Veronika Nefedova wrote:
> The throttling parameters are set in swift.properties. The last run  
> (that we are discussing now) was performed by Ioan from  
> viper.uchicago.edu. I am not sure which cogl install he used, but the  
> one that I was using before is a r1047 from 8/1. I put the copy of  
> swift.properties from that install to www.ci.uchicago.edu/~nefedova/ 
> swift.properties.
> Ioan, please confirm which swift install did you use (yours or mine)  
> and if this file is different from mine, please send your  
> swift.properties.
> 
> Thanks,
> 
> Nika
> 
> On Sep 12, 2007, at 9:26 AM, Michael Wilde wrote:
> 
> > [Changing Subject: Re: 244 MolDyn run was successful! to start a  
> > new thread.]
> >
> > Ioan, Nika, when we last discussed this in various conversations, I  
> > think we were going to try two approaches:
> >
> > - Ioan was going to modify Falkon to recognize the stale-file- 
> > handle error, "down" the offending host, and re-queue the job,  
> > transparently to the client (Swift).
> >
> > - At the same time, we were discussing with Mihael adjustments to  
> > the Swift error retry throttling so that these errors would not  
> > cause th workflow to slow down so drastically. As I recall,  
> > Mihael's view was that the current throttle control parameters were  
> > sufficient to try this now. Unless we have evidence from tests that  
> > this is *not* the case, we should try this now, without waiting for  
> > any Falkon (or Swift) code changes.  Nika, can you send to the list  
> > the throttling parameters that you are using?
> >
> > - Mike
> >
> >
> > Veronika Nefedova wrote:
> >> Hi, Ioan:
> >> I am wondering what is happening with Falcon scheduler and whether  
> >> it can now avoid 'bad' nodes during the execution?
> >> Thanks,
> >> Nika
> >> On Aug 27, 2007, at 12:30 PM, Ioan Raicu wrote:
> >>> Hi,
> >>> I will look at the Falkon scheduler to what I can do to either  
> >>> throttle or blacklist task dispatches to bad nodes.
> >>>
> >>> On a similar note, IMO, the heuristic in Karajan should be  
> >>> modified to take into account the task execution time of the  
> >>> failed or successful task, and not just the number of tasks.   
> >>> This would ensure that Swift is not throttling task submission to  
> >>> Falkon when there are 1000s of successful tasks that take on the  
> >>> order of 100s of second to complete, yet there are also 1000s of  
> >>> failed tasks that are only 10 ms long.  This is exactly the case  
> >>> with MolDyn, when we get a bad node in a bunch of 100s of nodes,  
> >>> which ends up throttling the number of active and running tasks  
> >>> to about 100, regardless of the number of processors Falkon has.
> >>> I also think that when Swift runs in conjunction with Falkon, we  
> >>> should increase the number of retry attempts Swift is willing to  
> >>> make per task before giving up.  Currently, it is set to 3, but a  
> >>> higher number of would be better, considering the low overhead of  
> >>> task submission Falkon has!
> >>>
> >>> I think the combination of these three changes (one from Falkon  
> >>> and another from Swift) should increase the probability of large  
> >>> workflows completing on a large number of resources!
> >>>
> >>> Ioan
> >>>
> >>> Veronika Nefedova wrote:
> >>>> OK. I looked at the output and it looks like 14 molecules have  
> >>>> still failed. They all failed due to hardware problems -- I saw  
> >>>> nothing application-specific in applications logs, all very  
> >>>> consistent with staled NFS handle that Ioan reported seeing.
> >>>> It would be great to be able to stop submitting jobs to 'bad'  
> >>>> nodes during the run (long term), or to increase the number of  
> >>>> retries in swift(short term) to enable the whole workflow to go  
> >>>> through.
> >>>>
> >>>> Nika
> >>>>
> >>>> On Aug 13, 2007, at 11:52 PM, Ioan Raicu wrote:
> >>>>
> >>>>>
> >>>>>
> >>>>> Mihael Hategan wrote:
> >>>>>> On Mon, 2007-08-13 at 23:07 -0500, Ioan Raicu wrote:
> >>>>>>
> >>>>>>>>>
> >>>>>>>> small != not at all
> >>>>>>>>
> >>>>>>> Check out these two graphs, showing the # of active tasks within
> >>>>>>> Falkon!  Active tasks = queued+pending+active 
> >>>>>>> +done_and_not_delivered.
> >>>>>>>
> >>>>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/ 
> >>>>>>> 244-mol-success-8-10-07/number-of-active-tasks.jpg
> >>>>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/ 
> >>>>>>> 244-mol-success-8-10-07/number-of-active-tasks-zoom.jpg
> >>>>>>>
> >>>>>>> Notice that after 3600 some seconds (after all the jobs that  
> >>>>>>> failed
> >>>>>>> had failed), the # of active tasks in Falkon oscillates  
> >>>>>>> between 100
> >>>>>>> and 101 active tasks!  The # presented on these graphs are  
> >>>>>>> taken from
> >>>>>>> the median value per minute (the raw samples were 60 samples per
> >>>>>>> minute).  Notice that only at the very end of the experiment,  
> >>>>>>> at 30K+
> >>>>>>> seconds, the # of active tasks increases to a max of 109 for  
> >>>>>>> a brief
> >>>>>>> period of time before it drops towards 0 as the workflow  
> >>>>>>> completes.  I
> >>>>>>> did notice that towards the end of the workflow, the jobs were
> >>>>>>> typically shorter, and perhaps that somehow influenced the #  
> >>>>>>> of active
> >>>>>>> tasks within Falkon...  So, when I said not at all, I was  
> >>>>>>> refering to
> >>>>>>> this flat line 100~101 active tasks that is shown in these  
> >>>>>>> figures!
> >>>>>>>
> >>>>>> Then say "it appears (from x and y) that the number of  
> >>>>>> concurrent jobs
> >>>>>> does not increase by an observable amount". This is not the  
> >>>>>> same as "the
> >>>>>> score does not increase at all".
> >>>>>>
> >>>>> You are playing with words here... the bottom line is that  
> >>>>> after 19K+ jobs and several hours of successful jobs, there was  
> >>>>> no indication that the heuristic was adapting to the new  
> >>>>> conditions, in which no jobs were failing!
> >>>>>>
> >>>>>>>>> So you are saying that 19K+ successful jobs was not enough to
> >>>>>>>>> counteract the 10K+ failed jobs from the early part of the
> >>>>>>>>> experiment?
> >>>>>>>> Yep. 19*1/5 = 3.8 < 10.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Can this ratio (1:5) be changed?
> >>>>>>>>>
> >>>>>>>> Yes. The scheduler has two relevant properties:  
> >>>>>>>> successFactor (currently
> >>>>>>>> 0.1) and failureFactor (currently -0.5). The term "factor"  
> >>>>>>>> is not used
> >>>>>>>> formally, since these get added to the current score.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> From this experiment, it would seem that the euristic is a  
> >>>>>>>>> slow
> >>>>>>>>> learner... maybe you ahve ideas on how to make it more  
> >>>>>>>>> quick to adapt
> >>>>>>>>> to changes?
> >>>>>>>>>
> >>>>>>>> That could perhaps be done.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>> In the context in which jobs are sent to non-busy workers,  
> >>>>>>>>>> the system
> >>>>>>>>>> would tend to produce lots of failed jobs if it takes  
> >>>>>>>>>> little time
> >>>>>>>>>> (compared to the normal run-time of a job) for a bad  
> >>>>>>>>>> worker to fail a
> >>>>>>>>>> job. This *IS* why the swift scheduler throttles in the  
> >>>>>>>>>> beginning: to
> >>>>>>>>>> avoid sending a large number of jobs to a resource that is  
> >>>>>>>>>> broken.
> >>>>>>>>>>
> >>>>>>>>> But not the whole resource is broken...
> >>>>>>>> No, just slightly more than 1/3 of it. At least that's how  
> >>>>>>>> it appears
> >>>>>>>> from the outside.
> >>>>>>>>
> >>>>>>> But a failed job should not be given the same weight as a  
> >>>>>>> succesful
> >>>>>>> job, in my oppinion.
> >>>>>>>
> >>>>>> Nope. I'd punish failures quite harshly. That's because the  
> >>>>>> expected
> >>>>>> behavior is for things to work. I would not want a site that  
> >>>>>> fails half
> >>>>>> the jobs to be anywhere near keeping a constant score.
> >>>>>>
> >>>>> That is fine, but you have a case (such as this one) in which  
> >>>>> this is not ideal... how do you propose we adapt to cover this  
> >>>>> corner case?
> >>>>>>
> >>>>>>>   For example, it seems to me that you are giving the failed  
> >>>>>>> jobs 5
> >>>>>>> times more weight than succesful jobs, but in reality it  
> >>>>>>> should be the
> >>>>>>> other way around.  Failed jobs usually will fail quickly (as  
> >>>>>>> in the
> >>>>>>> case that we have in MolDyn), or they will fail slowly  
> >>>>>>> (within the
> >>>>>>> lifetime of the resource allocation).  On the other hand, most
> >>>>>>> successful jobs will likely take more time to complete that  
> >>>>>>> it takes
> >>>>>>> for a job to fail (if it fails quickly).   Perhaps instead of
> >>>>>>>> successFactor (currently
> >>>>>>>> 0.1) and failureFactor (currently -0.5)
> >>>>>>>>
> >>>>>>> it should be more like:
> >>>>>>> successFactor: +1*(executionTime)
> >>>>>>> failureFactor: -1*(failureTime)
> >>>>>>>
> >>>>>> That's a very good idea. Biasing score based on run-time (at  
> >>>>>> least when
> >>>>>> known). Please note: you should still fix Falkon to not do  
> >>>>>> that thing
> >>>>>> it's doing.
> >>>>>>
> >>>>> Its not clear to me this should be done all the time, Falkon  
> >>>>> needs to know why the failure happened to decide to throttle!
> >>>>>>
> >>>>>>> The 1 could of course be changed with some other weight to give
> >>>>>>> preference to successful jobs, or to failed jobs.  With this  
> >>>>>>> kind of
> >>>>>>> strategy, the problems we are facing with throttling when  
> >>>>>>> there are
> >>>>>>> large # of short failures wouldn't be happening!  Do you see any
> >>>>>>> drawbacks to this approach?
> >>>>>>>
> >>>>>> None that are obvious. It's in fact a good thing if the goal is
> >>>>>> performance, since it takes execution time into account. I've  
> >>>>>> had manual
> >>>>>> "punishments" for connection time-outs because they take a  
> >>>>>> long time to
> >>>>>> happen. But this time biasing naturally integrates that kind  
> >>>>>> of stuff.
> >>>>>> So thanks.
> >>>>>>
> >>>>>>
> >>>>>>>>> that is the whole point here...
> >>>>>>>> This point comes because you KNOW how things work  
> >>>>>>>> internally. All Swift
> >>>>>>>> sees is 10K failed jobs out of 29K.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> anyways, I think this is a valid case that we need to  
> >>>>>>>>> discuss how to
> >>>>>>>>> handle, to make the entire Swift+Falkon more robust!
> >>>>>>>>>
> >>>>>>>>> BTW, here is another experiment with MolDyn that shows the  
> >>>>>>>>> throttling
> >>>>>>>>> and this heuristic behaving as I would expected!
> >>>>>>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/ 
> >>>>>>>>> 244-mol-failed/summary_graph.jpg
> >>>>>>>>>
> >>>>>>>>> Notice the queue lenth (blue line) at around 11K seconds  
> >>>>>>>>> dropped
> >>>>>>>>> sharply, but then grew back up.  That sudden drop was many  
> >>>>>>>>> jobs
> >>>>>>>>> failing fast on a bad node, and the sudden growth back up  
> >>>>>>>>> was Swift
> >>>>>>>>> re-submitting almost the same # of jobs that failed back to  
> >>>>>>>>> Falkon.
> >>>>>>>>>
> >>>>>>>> That failing many jobs fast behavior is not right,  
> >>>>>>>> regardless of whether
> >>>>>>>> Swift can deal with it or not.
> >>>>>>> If its a machine error, then it would be best to not fail  
> >>>>>>> many jobs
> >>>>>>> fast...
> >>>>>>> however, if its an app error, you want to fail the tasks as  
> >>>>>>> fast as
> >>>>>>> possible to fail the entire workflow faster,
> >>>>>>>
> >>>>>> But you can't distinguish between the two. The best you can do  
> >>>>>> is assume
> >>>>>> that the failure is a linear combination between broken  
> >>>>>> application and
> >>>>>> broken node. If it's broken node, rescheduling would do (which  
> >>>>>> does not
> >>>>>> happen in your case: jobs keep being sent to the worker that  
> >>>>>> is not
> >>>>>> busy, and that's the broken one). If it's a broken  
> >>>>>> application, then the
> >>>>>> way to distinguish it from the other one is that after a bunch of
> >>>>>> retries on different nodes, it still fails. Notice that  
> >>>>>> different nodes
> >>>>>> is essential here.
> >>>>>>
> >>>>> Right, I could try to keep track of statistics on each node,  
> >>>>> and when failures happen, try to determine if its a system wide  
> >>>>> failure (all nodes reporting errors), or are the faiures  
> >>>>> isolated on a single (or small set) node(s)...  I'll have to  
> >>>>> think about how to do this efficiently!
> >>>>>>
> >>>>>>>  so the app can be fixed and the workflow retried!  For  
> >>>>>>> example, say
> >>>>>>> you had 1000 tasks (all independent), and had a wrong path  
> >>>>>>> set to the
> >>>>>>> app... with the current Falkon behaviour, the entire workflow  
> >>>>>>> would
> >>>>>>> likely fail within some 10~20 seconds of it submitting the  
> >>>>>>> first task!
> >>>>>>> However, if Falkon does some "smart" throttling when it sees  
> >>>>>>> failures,
> >>>>>>> its going to take time proportional to the failures to fail the
> >>>>>>> workflow!
> >>>>>>>
> >>>>>> You're missing the part where all nodes fail the jobs equally,  
> >>>>>> thus not
> >>>>>> creating the inequality we're talking about (the ones where  
> >>>>>> broken nodes
> >>>>>> get higher chances of getting more jobs).
> >>>>>>
> >>>>> Right, maybe we can use this to distinguish between node  
> >>>>> failure and app failure!
> >>>>>>
> >>>>>>>   Essentially, I am not a bit fan of throttling task dispatch  
> >>>>>>> due to
> >>>>>>> failed executions, unless we know why these tasks failed!
> >>>>>>>
> >>>>>> Stop putting exclamation marks after every sentence. It  
> >>>>>> diminishes the
> >>>>>> meaning of it!
> >>>>>>
> >>>>> So you are going from playing with words to picking on my  
> >>>>> exclamation! :)
> >>>>>> Well, you can't know why these tasks failed. That's the whole  
> >>>>>> problem.
> >>>>>> You're dealing with incomplete information and you have to devise
> >>>>>> heuristics that get things done efficiently.
> >>>>>>
> >>>>> But Swift might know why it failed, it has a bunch of STDOUT/ 
> >>>>> STDERR that it always captures!  Falkon might capture the same  
> >>>>> output, but its optional ;(  Could these outputs not be parsed  
> >>>>> for certain well know errors, and have different exit codes to  
> >>>>> mean different kinds of errors?
> >>>>>>
> >>>>>>>   Exit codes are not usually enough in general, unless we  
> >>>>>>> define our
> >>>>>>> own and the app and wrapper scripts generate these particular  
> >>>>>>> exit
> >>>>>>> codes that Falkon can intercept and interpret reliably!
> >>>>>>>
> >>>>>> That would be an improvement, but probably not a universally  
> >>>>>> valid
> >>>>>> assumption. So I wouldn't design with only that in mind.
> >>>>>>
> >>>>> But it would be an improvement over what we currently have...
> >>>>>>
> >>>>>>>> Frankly I'd rather Swift not be the part
> >>>>>>>> to deal with it because it has to resort to heuristics,  
> >>>>>>>> whereas Falkon
> >>>>>>>> has direct knowledge of which nodes do what.
> >>>>>>>>
> >>>>>>> That's fine, but I don't think Falkon can do it alone, it needs
> >>>>>>> context and failure definition, which I believe only the  
> >>>>>>> application
> >>>>>>> and Swift could say for certain!
> >>>>>>>
> >>>>>> Nope, they can't. Swift does not meddle with semantics of  
> >>>>>> applications.
> >>>>>> They're all equally valuable functions.
> >>>>>>
> >>>>>> Now, there's stuff you can do to improve things, I'm guessing.  
> >>>>>> You can
> >>>>>> choose not to, and then we can keep having this discussion.  
> >>>>>> There might
> >>>>>> be stuff Swift can do, but it's not insight into applications,  
> >>>>>> so you'll
> >>>>>> have to ask for something else.
> >>>>>>
> >>>>> Any suggestions?
> >>>>>
> >>>>> Ioan
> >>>>>> Mihael
> >>>>>>
> >>>>>>
> >>>>>>> Ioan
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>
> >>> -- 
> >>> ============================================
> >>> Ioan Raicu
> >>> Ph.D. Student
> >>> ============================================
> >>> Distributed Systems Laboratory
> >>> Computer Science Department
> >>> University of Chicago
> >>> 1100 E. 58th Street, Ryerson Hall
> >>> Chicago, IL 60637
> >>> ============================================
> >>> Email: iraicu at cs.uchicago.edu
> >>> Web:   http://www.cs.uchicago.edu/~iraicu
> >>>       http://dsl.cs.uchicago.edu/
> >>> ============================================
> >>> ============================================
> >>>
> >
> 




More information about the Swift-devel mailing list