[Swift-devel] Re: Error throttling for MD-244 Molecule Run

Wed Sep 12 16:34:49 CDT 2007

On Wed, 2007-09-12 at 16:32 -0500, Michael Wilde wrote:
> Nika, Ioan, Mihael:
> 
> The throttling parameters in the file Nika points to below are:
> 
> throttle.submit=16
> throttle.host.submit=16
> throttle.transfers=16
> throttle.file.operations=16
> 
> In addition, the latest file documents the throttle.score parameter, 
> which Ive set to "off" in my recent Angle runs:
> 
> "The Swift scheduler has the ability to limit the number of concurrent 
> jobs allowed on a site based on the performance history of that site. 
> Each site is assigned a score (initially 1), which can increase or 
> decrease based on whether the site yields successful or faulty job runs. 
> The score for a site can take values in the (0.1, 100) interval. The 
> number of allowed jobs is calculated using the following formula:
>     2 + score*throttle.score.job.factor
> This means a site will always be allowed at least two concurrent jobs 
> and at most 2 + 100*throttle.score.job.factor. With a default of 4 this 
> means at least 2 jobs and at most 402.
> # Default: 4 "
> 
> #throttle.score.job.factor=4
> throttle.score.job.factor=off
> 
> This is, I believe, the parameter that Mihael provided to work around 
> the problem that Nika and Ioan were observing at uc-teragrid, where the 
> workflow would slow down drastically when stale-nfs-filehandle errors 
> occurred.
> 
> So as far as I can tell, setting this to "off" should solve the problem, 
>   assuming that you are running the Swift code base in which this was 
> implemented.
> 
> Mihael, can you confirm?

Sounds about right.

> 
> Nika, Ioan, do you agree? Can you try this?
> 
> - Mike
> 
> 
> Veronika Nefedova wrote:
> > The throttling parameters are set in swift.properties. The last run 
> > (that we are discussing now) was performed by Ioan from 
> > viper.uchicago.edu. I am not sure which cogl install he used, but the 
> > one that I was using before is a r1047 from 8/1. I put the copy of 
> > swift.properties from that install to 
> > www.ci.uchicago.edu/~nefedova/swift.properties.
> > Ioan, please confirm which swift install did you use (yours or mine) and 
> > if this file is different from mine, please send your swift.properties.
> > 
> > Thanks,
> > 
> > Nika
> > 
> > On Sep 12, 2007, at 9:26 AM, Michael Wilde wrote:
> > 
> >> [Changing Subject: Re: 244 MolDyn run was successful! to start a new 
> >> thread.]
> >>
> >> Ioan, Nika, when we last discussed this in various conversations, I 
> >> think we were going to try two approaches:
> >>
> >> - Ioan was going to modify Falkon to recognize the stale-file-handle 
> >> error, "down" the offending host, and re-queue the job, transparently 
> >> to the client (Swift).
> >>
> >> - At the same time, we were discussing with Mihael adjustments to the 
> >> Swift error retry throttling so that these errors would not cause th 
> >> workflow to slow down so drastically. As I recall, Mihael's view was 
> >> that the current throttle control parameters were sufficient to try 
> >> this now. Unless we have evidence from tests that this is *not* the 
> >> case, we should try this now, without waiting for any Falkon (or 
> >> Swift) code changes.  Nika, can you send to the list the throttling 
> >> parameters that you are using?
> >>
> >> - Mike
> >>
> >>
> >> Veronika Nefedova wrote:
> >>> Hi, Ioan:
> >>> I am wondering what is happening with Falcon scheduler and whether it 
> >>> can now avoid 'bad' nodes during the execution?
> >>> Thanks,
> >>> Nika
> >>> On Aug 27, 2007, at 12:30 PM, Ioan Raicu wrote:
> >>>> Hi,
> >>>> I will look at the Falkon scheduler to what I can do to either 
> >>>> throttle or blacklist task dispatches to bad nodes.
> >>>>
> >>>> On a similar note, IMO, the heuristic in Karajan should be modified 
> >>>> to take into account the task execution time of the failed or 
> >>>> successful task, and not just the number of tasks.  This would 
> >>>> ensure that Swift is not throttling task submission to Falkon when 
> >>>> there are 1000s of successful tasks that take on the order of 100s 
> >>>> of second to complete, yet there are also 1000s of failed tasks that 
> >>>> are only 10 ms long.  This is exactly the case with MolDyn, when we 
> >>>> get a bad node in a bunch of 100s of nodes, which ends up throttling 
> >>>> the number of active and running tasks to about 100, regardless of 
> >>>> the number of processors Falkon has.
> >>>> I also think that when Swift runs in conjunction with Falkon, we 
> >>>> should increase the number of retry attempts Swift is willing to 
> >>>> make per task before giving up.  Currently, it is set to 3, but a 
> >>>> higher number of would be better, considering the low overhead of 
> >>>> task submission Falkon has!
> >>>>
> >>>> I think the combination of these three changes (one from Falkon and 
> >>>> another from Swift) should increase the probability of large 
> >>>> workflows completing on a large number of resources!
> >>>>
> >>>> Ioan
> >>>>
> >>>> Veronika Nefedova wrote:
> >>>>> OK. I looked at the output and it looks like 14 molecules have 
> >>>>> still failed. They all failed due to hardware problems -- I saw 
> >>>>> nothing application-specific in applications logs, all very 
> >>>>> consistent with staled NFS handle that Ioan reported seeing.
> >>>>> It would be great to be able to stop submitting jobs to 'bad' nodes 
> >>>>> during the run (long term), or to increase the number of retries in 
> >>>>> swift(short term) to enable the whole workflow to go through.
> >>>>>
> >>>>> Nika
> >>>>>
> >>>>> On Aug 13, 2007, at 11:52 PM, Ioan Raicu wrote:
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> Mihael Hategan wrote:
> >>>>>>> On Mon, 2007-08-13 at 23:07 -0500, Ioan Raicu wrote:
> >>>>>>>
> >>>>>>>>>>
> >>>>>>>>> small != not at all
> >>>>>>>>>
> >>>>>>>> Check out these two graphs, showing the # of active tasks within
> >>>>>>>> Falkon!  Active tasks = 
> >>>>>>>> queued+pending+active+done_and_not_delivered.
> >>>>>>>>
> >>>>>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/number-of-active-tasks.jpg 
> >>>>>>>>
> >>>>>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/number-of-active-tasks-zoom.jpg 
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Notice that after 3600 some seconds (after all the jobs that failed
> >>>>>>>> had failed), the # of active tasks in Falkon oscillates between 100
> >>>>>>>> and 101 active tasks!  The # presented on these graphs are taken 
> >>>>>>>> from
> >>>>>>>> the median value per minute (the raw samples were 60 samples per
> >>>>>>>> minute).  Notice that only at the very end of the experiment, at 
> >>>>>>>> 30K+
> >>>>>>>> seconds, the # of active tasks increases to a max of 109 for a 
> >>>>>>>> brief
> >>>>>>>> period of time before it drops towards 0 as the workflow 
> >>>>>>>> completes.  I
> >>>>>>>> did notice that towards the end of the workflow, the jobs were
> >>>>>>>> typically shorter, and perhaps that somehow influenced the # of 
> >>>>>>>> active
> >>>>>>>> tasks within Falkon...  So, when I said not at all, I was 
> >>>>>>>> refering to
> >>>>>>>> this flat line 100~101 active tasks that is shown in these figures!
> >>>>>>>>
> >>>>>>> Then say "it appears (from x and y) that the number of concurrent 
> >>>>>>> jobs
> >>>>>>> does not increase by an observable amount". This is not the same 
> >>>>>>> as "the
> >>>>>>> score does not increase at all".
> >>>>>>>
> >>>>>> You are playing with words here... the bottom line is that after 
> >>>>>> 19K+ jobs and several hours of successful jobs, there was no 
> >>>>>> indication that the heuristic was adapting to the new conditions, 
> >>>>>> in which no jobs were failing!
> >>>>>>>
> >>>>>>>>>> So you are saying that 19K+ successful jobs was not enough to
> >>>>>>>>>> counteract the 10K+ failed jobs from the early part of the
> >>>>>>>>>> experiment?
> >>>>>>>>> Yep. 19*1/5 = 3.8 < 10.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Can this ratio (1:5) be changed?
> >>>>>>>>>>
> >>>>>>>>> Yes. The scheduler has two relevant properties: successFactor 
> >>>>>>>>> (currently
> >>>>>>>>> 0.1) and failureFactor (currently -0.5). The term "factor" is 
> >>>>>>>>> not used
> >>>>>>>>> formally, since these get added to the current score.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> From this experiment, it would seem that the euristic is a slow
> >>>>>>>>>> learner... maybe you ahve ideas on how to make it more quick 
> >>>>>>>>>> to adapt
> >>>>>>>>>> to changes?
> >>>>>>>>>>
> >>>>>>>>> That could perhaps be done.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>> In the context in which jobs are sent to non-busy workers, 
> >>>>>>>>>>> the system
> >>>>>>>>>>> would tend to produce lots of failed jobs if it takes little 
> >>>>>>>>>>> time
> >>>>>>>>>>> (compared to the normal run-time of a job) for a bad worker 
> >>>>>>>>>>> to fail a
> >>>>>>>>>>> job. This *IS* why the swift scheduler throttles in the 
> >>>>>>>>>>> beginning: to
> >>>>>>>>>>> avoid sending a large number of jobs to a resource that is 
> >>>>>>>>>>> broken.
> >>>>>>>>>>>
> >>>>>>>>>> But not the whole resource is broken...
> >>>>>>>>> No, just slightly more than 1/3 of it. At least that's how it 
> >>>>>>>>> appears
> >>>>>>>>> from the outside.
> >>>>>>>>>
> >>>>>>>> But a failed job should not be given the same weight as a succesful
> >>>>>>>> job, in my oppinion.
> >>>>>>>>
> >>>>>>> Nope. I'd punish failures quite harshly. That's because the expected
> >>>>>>> behavior is for things to work. I would not want a site that 
> >>>>>>> fails half
> >>>>>>> the jobs to be anywhere near keeping a constant score.
> >>>>>>>
> >>>>>> That is fine, but you have a case (such as this one) in which this 
> >>>>>> is not ideal... how do you propose we adapt to cover this corner 
> >>>>>> case?
> >>>>>>>
> >>>>>>>>   For example, it seems to me that you are giving the failed jobs 5
> >>>>>>>> times more weight than succesful jobs, but in reality it should 
> >>>>>>>> be the
> >>>>>>>> other way around.  Failed jobs usually will fail quickly (as in the
> >>>>>>>> case that we have in MolDyn), or they will fail slowly (within the
> >>>>>>>> lifetime of the resource allocation).  On the other hand, most
> >>>>>>>> successful jobs will likely take more time to complete that it 
> >>>>>>>> takes
> >>>>>>>> for a job to fail (if it fails quickly).   Perhaps instead of
> >>>>>>>>> successFactor (currently
> >>>>>>>>> 0.1) and failureFactor (currently -0.5)
> >>>>>>>>>
> >>>>>>>> it should be more like:
> >>>>>>>> successFactor: +1*(executionTime)
> >>>>>>>> failureFactor: -1*(failureTime)
> >>>>>>>>
> >>>>>>> That's a very good idea. Biasing score based on run-time (at 
> >>>>>>> least when
> >>>>>>> known). Please note: you should still fix Falkon to not do that 
> >>>>>>> thing
> >>>>>>> it's doing.
> >>>>>>>
> >>>>>> Its not clear to me this should be done all the time, Falkon needs 
> >>>>>> to know why the failure happened to decide to throttle!
> >>>>>>>
> >>>>>>>> The 1 could of course be changed with some other weight to give
> >>>>>>>> preference to successful jobs, or to failed jobs.  With this 
> >>>>>>>> kind of
> >>>>>>>> strategy, the problems we are facing with throttling when there are
> >>>>>>>> large # of short failures wouldn't be happening!  Do you see any
> >>>>>>>> drawbacks to this approach?
> >>>>>>>>
> >>>>>>> None that are obvious. It's in fact a good thing if the goal is
> >>>>>>> performance, since it takes execution time into account. I've had 
> >>>>>>> manual
> >>>>>>> "punishments" for connection time-outs because they take a long 
> >>>>>>> time to
> >>>>>>> happen. But this time biasing naturally integrates that kind of 
> >>>>>>> stuff.
> >>>>>>> So thanks.
> >>>>>>>
> >>>>>>>
> >>>>>>>>>> that is the whole point here...
> >>>>>>>>> This point comes because you KNOW how things work internally. 
> >>>>>>>>> All Swift
> >>>>>>>>> sees is 10K failed jobs out of 29K.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> anyways, I think this is a valid case that we need to discuss 
> >>>>>>>>>> how to
> >>>>>>>>>> handle, to make the entire Swift+Falkon more robust!
> >>>>>>>>>>
> >>>>>>>>>> BTW, here is another experiment with MolDyn that shows the 
> >>>>>>>>>> throttling
> >>>>>>>>>> and this heuristic behaving as I would expected!
> >>>>>>>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/summary_graph.jpg 
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Notice the queue lenth (blue line) at around 11K seconds dropped
> >>>>>>>>>> sharply, but then grew back up.  That sudden drop was many jobs
> >>>>>>>>>> failing fast on a bad node, and the sudden growth back up was 
> >>>>>>>>>> Swift
> >>>>>>>>>> re-submitting almost the same # of jobs that failed back to 
> >>>>>>>>>> Falkon.
> >>>>>>>>>>
> >>>>>>>>> That failing many jobs fast behavior is not right, regardless 
> >>>>>>>>> of whether
> >>>>>>>>> Swift can deal with it or not.
> >>>>>>>> If its a machine error, then it would be best to not fail many jobs
> >>>>>>>> fast...
> >>>>>>>> however, if its an app error, you want to fail the tasks as fast as
> >>>>>>>> possible to fail the entire workflow faster,
> >>>>>>>>
> >>>>>>> But you can't distinguish between the two. The best you can do is 
> >>>>>>> assume
> >>>>>>> that the failure is a linear combination between broken 
> >>>>>>> application and
> >>>>>>> broken node. If it's broken node, rescheduling would do (which 
> >>>>>>> does not
> >>>>>>> happen in your case: jobs keep being sent to the worker that is not
> >>>>>>> busy, and that's the broken one). If it's a broken application, 
> >>>>>>> then the
> >>>>>>> way to distinguish it from the other one is that after a bunch of
> >>>>>>> retries on different nodes, it still fails. Notice that different 
> >>>>>>> nodes
> >>>>>>> is essential here.
> >>>>>>>
> >>>>>> Right, I could try to keep track of statistics on each node, and 
> >>>>>> when failures happen, try to determine if its a system wide 
> >>>>>> failure (all nodes reporting errors), or are the faiures isolated 
> >>>>>> on a single (or small set) node(s)...  I'll have to think about 
> >>>>>> how to do this efficiently!
> >>>>>>>
> >>>>>>>>  so the app can be fixed and the workflow retried!  For example, 
> >>>>>>>> say
> >>>>>>>> you had 1000 tasks (all independent), and had a wrong path set 
> >>>>>>>> to the
> >>>>>>>> app... with the current Falkon behaviour, the entire workflow would
> >>>>>>>> likely fail within some 10~20 seconds of it submitting the first 
> >>>>>>>> task!
> >>>>>>>> However, if Falkon does some "smart" throttling when it sees 
> >>>>>>>> failures,
> >>>>>>>> its going to take time proportional to the failures to fail the
> >>>>>>>> workflow!
> >>>>>>>>
> >>>>>>> You're missing the part where all nodes fail the jobs equally, 
> >>>>>>> thus not
> >>>>>>> creating the inequality we're talking about (the ones where 
> >>>>>>> broken nodes
> >>>>>>> get higher chances of getting more jobs).
> >>>>>>>
> >>>>>> Right, maybe we can use this to distinguish between node failure 
> >>>>>> and app failure!
> >>>>>>>
> >>>>>>>>   Essentially, I am not a bit fan of throttling task dispatch 
> >>>>>>>> due to
> >>>>>>>> failed executions, unless we know why these tasks failed!
> >>>>>>>>
> >>>>>>> Stop putting exclamation marks after every sentence. It 
> >>>>>>> diminishes the
> >>>>>>> meaning of it!
> >>>>>>>
> >>>>>> So you are going from playing with words to picking on my 
> >>>>>> exclamation! :)
> >>>>>>> Well, you can't know why these tasks failed. That's the whole 
> >>>>>>> problem.
> >>>>>>> You're dealing with incomplete information and you have to devise
> >>>>>>> heuristics that get things done efficiently.
> >>>>>>>
> >>>>>> But Swift might know why it failed, it has a bunch of 
> >>>>>> STDOUT/STDERR that it always captures!  Falkon might capture the 
> >>>>>> same output, but its optional ;(  Could these outputs not be 
> >>>>>> parsed for certain well know errors, and have different exit codes 
> >>>>>> to mean different kinds of errors?
> >>>>>>>
> >>>>>>>>   Exit codes are not usually enough in general, unless we define 
> >>>>>>>> our
> >>>>>>>> own and the app and wrapper scripts generate these particular exit
> >>>>>>>> codes that Falkon can intercept and interpret reliably!
> >>>>>>>>
> >>>>>>> That would be an improvement, but probably not a universally valid
> >>>>>>> assumption. So I wouldn't design with only that in mind.
> >>>>>>>
> >>>>>> But it would be an improvement over what we currently have...
> >>>>>>>
> >>>>>>>>> Frankly I'd rather Swift not be the part
> >>>>>>>>> to deal with it because it has to resort to heuristics, whereas 
> >>>>>>>>> Falkon
> >>>>>>>>> has direct knowledge of which nodes do what.
> >>>>>>>>>
> >>>>>>>> That's fine, but I don't think Falkon can do it alone, it needs
> >>>>>>>> context and failure definition, which I believe only the 
> >>>>>>>> application
> >>>>>>>> and Swift could say for certain!
> >>>>>>>>
> >>>>>>> Nope, they can't. Swift does not meddle with semantics of 
> >>>>>>> applications.
> >>>>>>> They're all equally valuable functions.
> >>>>>>>
> >>>>>>> Now, there's stuff you can do to improve things, I'm guessing. 
> >>>>>>> You can
> >>>>>>> choose not to, and then we can keep having this discussion. There 
> >>>>>>> might
> >>>>>>> be stuff Swift can do, but it's not insight into applications, so 
> >>>>>>> you'll
> >>>>>>> have to ask for something else.
> >>>>>>>
> >>>>>> Any suggestions?
> >>>>>>
> >>>>>> Ioan
> >>>>>>> Mihael
> >>>>>>>
> >>>>>>>
> >>>>>>>> Ioan
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>> -- 
> >>>> ============================================
> >>>> Ioan Raicu
> >>>> Ph.D. Student
> >>>> ============================================
> >>>> Distributed Systems Laboratory
> >>>> Computer Science Department
> >>>> University of Chicago
> >>>> 1100 E. 58th Street, Ryerson Hall
> >>>> Chicago, IL 60637
> >>>> ============================================
> >>>> Email: iraicu at cs.uchicago.edu
> >>>> Web:   http://www.cs.uchicago.edu/~iraicu
> >>>>       http://dsl.cs.uchicago.edu/
> >>>> ============================================
> >>>> ============================================
> >>>>
> >>
> > 
> > 
>