[Swift-devel] Re: Error throttling for MD-244 Molecule Run

Wed Sep 12 16:32:25 CDT 2007

Nika, Ioan, Mihael:

The throttling parameters in the file Nika points to below are:

throttle.submit=16
throttle.host.submit=16
throttle.transfers=16
throttle.file.operations=16

In addition, the latest file documents the throttle.score parameter, 
which Ive set to "off" in my recent Angle runs:

"The Swift scheduler has the ability to limit the number of concurrent 
jobs allowed on a site based on the performance history of that site. 
Each site is assigned a score (initially 1), which can increase or 
decrease based on whether the site yields successful or faulty job runs. 
The score for a site can take values in the (0.1, 100) interval. The 
number of allowed jobs is calculated using the following formula:
    2 + score*throttle.score.job.factor
This means a site will always be allowed at least two concurrent jobs 
and at most 2 + 100*throttle.score.job.factor. With a default of 4 this 
means at least 2 jobs and at most 402.
# Default: 4 "

#throttle.score.job.factor=4
throttle.score.job.factor=off

This is, I believe, the parameter that Mihael provided to work around 
the problem that Nika and Ioan were observing at uc-teragrid, where the 
workflow would slow down drastically when stale-nfs-filehandle errors 
occurred.

So as far as I can tell, setting this to "off" should solve the problem, 
  assuming that you are running the Swift code base in which this was 
implemented.

Mihael, can you confirm?

Nika, Ioan, do you agree? Can you try this?

- Mike

Veronika Nefedova wrote:
> The throttling parameters are set in swift.properties. The last run 
> (that we are discussing now) was performed by Ioan from 
> viper.uchicago.edu. I am not sure which cogl install he used, but the 
> one that I was using before is a r1047 from 8/1. I put the copy of 
> swift.properties from that install to 
> www.ci.uchicago.edu/~nefedova/swift.properties.
> Ioan, please confirm which swift install did you use (yours or mine) and 
> if this file is different from mine, please send your swift.properties.
> 
> Thanks,
> 
> Nika
> 
> On Sep 12, 2007, at 9:26 AM, Michael Wilde wrote:
> 
>> [Changing Subject: Re: 244 MolDyn run was successful! to start a new 
>> thread.]
>>
>> Ioan, Nika, when we last discussed this in various conversations, I 
>> think we were going to try two approaches:
>>
>> - Ioan was going to modify Falkon to recognize the stale-file-handle 
>> error, "down" the offending host, and re-queue the job, transparently 
>> to the client (Swift).
>>
>> - At the same time, we were discussing with Mihael adjustments to the 
>> Swift error retry throttling so that these errors would not cause th 
>> workflow to slow down so drastically. As I recall, Mihael's view was 
>> that the current throttle control parameters were sufficient to try 
>> this now. Unless we have evidence from tests that this is *not* the 
>> case, we should try this now, without waiting for any Falkon (or 
>> Swift) code changes.  Nika, can you send to the list the throttling 
>> parameters that you are using?
>>
>> - Mike
>>
>>
>> Veronika Nefedova wrote:
>>> Hi, Ioan:
>>> I am wondering what is happening with Falcon scheduler and whether it 
>>> can now avoid 'bad' nodes during the execution?
>>> Thanks,
>>> Nika
>>> On Aug 27, 2007, at 12:30 PM, Ioan Raicu wrote:
>>>> Hi,
>>>> I will look at the Falkon scheduler to what I can do to either 
>>>> throttle or blacklist task dispatches to bad nodes.
>>>>
>>>> On a similar note, IMO, the heuristic in Karajan should be modified 
>>>> to take into account the task execution time of the failed or 
>>>> successful task, and not just the number of tasks.  This would 
>>>> ensure that Swift is not throttling task submission to Falkon when 
>>>> there are 1000s of successful tasks that take on the order of 100s 
>>>> of second to complete, yet there are also 1000s of failed tasks that 
>>>> are only 10 ms long.  This is exactly the case with MolDyn, when we 
>>>> get a bad node in a bunch of 100s of nodes, which ends up throttling 
>>>> the number of active and running tasks to about 100, regardless of 
>>>> the number of processors Falkon has.
>>>> I also think that when Swift runs in conjunction with Falkon, we 
>>>> should increase the number of retry attempts Swift is willing to 
>>>> make per task before giving up.  Currently, it is set to 3, but a 
>>>> higher number of would be better, considering the low overhead of 
>>>> task submission Falkon has!
>>>>
>>>> I think the combination of these three changes (one from Falkon and 
>>>> another from Swift) should increase the probability of large 
>>>> workflows completing on a large number of resources!
>>>>
>>>> Ioan
>>>>
>>>> Veronika Nefedova wrote:
>>>>> OK. I looked at the output and it looks like 14 molecules have 
>>>>> still failed. They all failed due to hardware problems -- I saw 
>>>>> nothing application-specific in applications logs, all very 
>>>>> consistent with staled NFS handle that Ioan reported seeing.
>>>>> It would be great to be able to stop submitting jobs to 'bad' nodes 
>>>>> during the run (long term), or to increase the number of retries in 
>>>>> swift(short term) to enable the whole workflow to go through.
>>>>>
>>>>> Nika
>>>>>
>>>>> On Aug 13, 2007, at 11:52 PM, Ioan Raicu wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> Mihael Hategan wrote:
>>>>>>> On Mon, 2007-08-13 at 23:07 -0500, Ioan Raicu wrote:
>>>>>>>
>>>>>>>>>>
>>>>>>>>> small != not at all
>>>>>>>>>
>>>>>>>> Check out these two graphs, showing the # of active tasks within
>>>>>>>> Falkon!  Active tasks = 
>>>>>>>> queued+pending+active+done_and_not_delivered.
>>>>>>>>
>>>>>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/number-of-active-tasks.jpg 
>>>>>>>>
>>>>>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/number-of-active-tasks-zoom.jpg 
>>>>>>>>
>>>>>>>>
>>>>>>>> Notice that after 3600 some seconds (after all the jobs that failed
>>>>>>>> had failed), the # of active tasks in Falkon oscillates between 100
>>>>>>>> and 101 active tasks!  The # presented on these graphs are taken 
>>>>>>>> from
>>>>>>>> the median value per minute (the raw samples were 60 samples per
>>>>>>>> minute).  Notice that only at the very end of the experiment, at 
>>>>>>>> 30K+
>>>>>>>> seconds, the # of active tasks increases to a max of 109 for a 
>>>>>>>> brief
>>>>>>>> period of time before it drops towards 0 as the workflow 
>>>>>>>> completes.  I
>>>>>>>> did notice that towards the end of the workflow, the jobs were
>>>>>>>> typically shorter, and perhaps that somehow influenced the # of 
>>>>>>>> active
>>>>>>>> tasks within Falkon...  So, when I said not at all, I was 
>>>>>>>> refering to
>>>>>>>> this flat line 100~101 active tasks that is shown in these figures!
>>>>>>>>
>>>>>>> Then say "it appears (from x and y) that the number of concurrent 
>>>>>>> jobs
>>>>>>> does not increase by an observable amount". This is not the same 
>>>>>>> as "the
>>>>>>> score does not increase at all".
>>>>>>>
>>>>>> You are playing with words here... the bottom line is that after 
>>>>>> 19K+ jobs and several hours of successful jobs, there was no 
>>>>>> indication that the heuristic was adapting to the new conditions, 
>>>>>> in which no jobs were failing!
>>>>>>>
>>>>>>>>>> So you are saying that 19K+ successful jobs was not enough to
>>>>>>>>>> counteract the 10K+ failed jobs from the early part of the
>>>>>>>>>> experiment?
>>>>>>>>> Yep. 19*1/5 = 3.8 < 10.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Can this ratio (1:5) be changed?
>>>>>>>>>>
>>>>>>>>> Yes. The scheduler has two relevant properties: successFactor 
>>>>>>>>> (currently
>>>>>>>>> 0.1) and failureFactor (currently -0.5). The term "factor" is 
>>>>>>>>> not used
>>>>>>>>> formally, since these get added to the current score.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> From this experiment, it would seem that the euristic is a slow
>>>>>>>>>> learner... maybe you ahve ideas on how to make it more quick 
>>>>>>>>>> to adapt
>>>>>>>>>> to changes?
>>>>>>>>>>
>>>>>>>>> That could perhaps be done.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> In the context in which jobs are sent to non-busy workers, 
>>>>>>>>>>> the system
>>>>>>>>>>> would tend to produce lots of failed jobs if it takes little 
>>>>>>>>>>> time
>>>>>>>>>>> (compared to the normal run-time of a job) for a bad worker 
>>>>>>>>>>> to fail a
>>>>>>>>>>> job. This *IS* why the swift scheduler throttles in the 
>>>>>>>>>>> beginning: to
>>>>>>>>>>> avoid sending a large number of jobs to a resource that is 
>>>>>>>>>>> broken.
>>>>>>>>>>>
>>>>>>>>>> But not the whole resource is broken...
>>>>>>>>> No, just slightly more than 1/3 of it. At least that's how it 
>>>>>>>>> appears
>>>>>>>>> from the outside.
>>>>>>>>>
>>>>>>>> But a failed job should not be given the same weight as a succesful
>>>>>>>> job, in my oppinion.
>>>>>>>>
>>>>>>> Nope. I'd punish failures quite harshly. That's because the expected
>>>>>>> behavior is for things to work. I would not want a site that 
>>>>>>> fails half
>>>>>>> the jobs to be anywhere near keeping a constant score.
>>>>>>>
>>>>>> That is fine, but you have a case (such as this one) in which this 
>>>>>> is not ideal... how do you propose we adapt to cover this corner 
>>>>>> case?
>>>>>>>
>>>>>>>>   For example, it seems to me that you are giving the failed jobs 5
>>>>>>>> times more weight than succesful jobs, but in reality it should 
>>>>>>>> be the
>>>>>>>> other way around.  Failed jobs usually will fail quickly (as in the
>>>>>>>> case that we have in MolDyn), or they will fail slowly (within the
>>>>>>>> lifetime of the resource allocation).  On the other hand, most
>>>>>>>> successful jobs will likely take more time to complete that it 
>>>>>>>> takes
>>>>>>>> for a job to fail (if it fails quickly).   Perhaps instead of
>>>>>>>>> successFactor (currently
>>>>>>>>> 0.1) and failureFactor (currently -0.5)
>>>>>>>>>
>>>>>>>> it should be more like:
>>>>>>>> successFactor: +1*(executionTime)
>>>>>>>> failureFactor: -1*(failureTime)
>>>>>>>>
>>>>>>> That's a very good idea. Biasing score based on run-time (at 
>>>>>>> least when
>>>>>>> known). Please note: you should still fix Falkon to not do that 
>>>>>>> thing
>>>>>>> it's doing.
>>>>>>>
>>>>>> Its not clear to me this should be done all the time, Falkon needs 
>>>>>> to know why the failure happened to decide to throttle!
>>>>>>>
>>>>>>>> The 1 could of course be changed with some other weight to give
>>>>>>>> preference to successful jobs, or to failed jobs.  With this 
>>>>>>>> kind of
>>>>>>>> strategy, the problems we are facing with throttling when there are
>>>>>>>> large # of short failures wouldn't be happening!  Do you see any
>>>>>>>> drawbacks to this approach?
>>>>>>>>
>>>>>>> None that are obvious. It's in fact a good thing if the goal is
>>>>>>> performance, since it takes execution time into account. I've had 
>>>>>>> manual
>>>>>>> "punishments" for connection time-outs because they take a long 
>>>>>>> time to
>>>>>>> happen. But this time biasing naturally integrates that kind of 
>>>>>>> stuff.
>>>>>>> So thanks.
>>>>>>>
>>>>>>>
>>>>>>>>>> that is the whole point here...
>>>>>>>>> This point comes because you KNOW how things work internally. 
>>>>>>>>> All Swift
>>>>>>>>> sees is 10K failed jobs out of 29K.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> anyways, I think this is a valid case that we need to discuss 
>>>>>>>>>> how to
>>>>>>>>>> handle, to make the entire Swift+Falkon more robust!
>>>>>>>>>>
>>>>>>>>>> BTW, here is another experiment with MolDyn that shows the 
>>>>>>>>>> throttling
>>>>>>>>>> and this heuristic behaving as I would expected!
>>>>>>>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/summary_graph.jpg 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Notice the queue lenth (blue line) at around 11K seconds dropped
>>>>>>>>>> sharply, but then grew back up.  That sudden drop was many jobs
>>>>>>>>>> failing fast on a bad node, and the sudden growth back up was 
>>>>>>>>>> Swift
>>>>>>>>>> re-submitting almost the same # of jobs that failed back to 
>>>>>>>>>> Falkon.
>>>>>>>>>>
>>>>>>>>> That failing many jobs fast behavior is not right, regardless 
>>>>>>>>> of whether
>>>>>>>>> Swift can deal with it or not.
>>>>>>>> If its a machine error, then it would be best to not fail many jobs
>>>>>>>> fast...
>>>>>>>> however, if its an app error, you want to fail the tasks as fast as
>>>>>>>> possible to fail the entire workflow faster,
>>>>>>>>
>>>>>>> But you can't distinguish between the two. The best you can do is 
>>>>>>> assume
>>>>>>> that the failure is a linear combination between broken 
>>>>>>> application and
>>>>>>> broken node. If it's broken node, rescheduling would do (which 
>>>>>>> does not
>>>>>>> happen in your case: jobs keep being sent to the worker that is not
>>>>>>> busy, and that's the broken one). If it's a broken application, 
>>>>>>> then the
>>>>>>> way to distinguish it from the other one is that after a bunch of
>>>>>>> retries on different nodes, it still fails. Notice that different 
>>>>>>> nodes
>>>>>>> is essential here.
>>>>>>>
>>>>>> Right, I could try to keep track of statistics on each node, and 
>>>>>> when failures happen, try to determine if its a system wide 
>>>>>> failure (all nodes reporting errors), or are the faiures isolated 
>>>>>> on a single (or small set) node(s)...  I'll have to think about 
>>>>>> how to do this efficiently!
>>>>>>>
>>>>>>>>  so the app can be fixed and the workflow retried!  For example, 
>>>>>>>> say
>>>>>>>> you had 1000 tasks (all independent), and had a wrong path set 
>>>>>>>> to the
>>>>>>>> app... with the current Falkon behaviour, the entire workflow would
>>>>>>>> likely fail within some 10~20 seconds of it submitting the first 
>>>>>>>> task!
>>>>>>>> However, if Falkon does some "smart" throttling when it sees 
>>>>>>>> failures,
>>>>>>>> its going to take time proportional to the failures to fail the
>>>>>>>> workflow!
>>>>>>>>
>>>>>>> You're missing the part where all nodes fail the jobs equally, 
>>>>>>> thus not
>>>>>>> creating the inequality we're talking about (the ones where 
>>>>>>> broken nodes
>>>>>>> get higher chances of getting more jobs).
>>>>>>>
>>>>>> Right, maybe we can use this to distinguish between node failure 
>>>>>> and app failure!
>>>>>>>
>>>>>>>>   Essentially, I am not a bit fan of throttling task dispatch 
>>>>>>>> due to
>>>>>>>> failed executions, unless we know why these tasks failed!
>>>>>>>>
>>>>>>> Stop putting exclamation marks after every sentence. It 
>>>>>>> diminishes the
>>>>>>> meaning of it!
>>>>>>>
>>>>>> So you are going from playing with words to picking on my 
>>>>>> exclamation! :)
>>>>>>> Well, you can't know why these tasks failed. That's the whole 
>>>>>>> problem.
>>>>>>> You're dealing with incomplete information and you have to devise
>>>>>>> heuristics that get things done efficiently.
>>>>>>>
>>>>>> But Swift might know why it failed, it has a bunch of 
>>>>>> STDOUT/STDERR that it always captures!  Falkon might capture the 
>>>>>> same output, but its optional ;(  Could these outputs not be 
>>>>>> parsed for certain well know errors, and have different exit codes 
>>>>>> to mean different kinds of errors?
>>>>>>>
>>>>>>>>   Exit codes are not usually enough in general, unless we define 
>>>>>>>> our
>>>>>>>> own and the app and wrapper scripts generate these particular exit
>>>>>>>> codes that Falkon can intercept and interpret reliably!
>>>>>>>>
>>>>>>> That would be an improvement, but probably not a universally valid
>>>>>>> assumption. So I wouldn't design with only that in mind.
>>>>>>>
>>>>>> But it would be an improvement over what we currently have...
>>>>>>>
>>>>>>>>> Frankly I'd rather Swift not be the part
>>>>>>>>> to deal with it because it has to resort to heuristics, whereas 
>>>>>>>>> Falkon
>>>>>>>>> has direct knowledge of which nodes do what.
>>>>>>>>>
>>>>>>>> That's fine, but I don't think Falkon can do it alone, it needs
>>>>>>>> context and failure definition, which I believe only the 
>>>>>>>> application
>>>>>>>> and Swift could say for certain!
>>>>>>>>
>>>>>>> Nope, they can't. Swift does not meddle with semantics of 
>>>>>>> applications.
>>>>>>> They're all equally valuable functions.
>>>>>>>
>>>>>>> Now, there's stuff you can do to improve things, I'm guessing. 
>>>>>>> You can
>>>>>>> choose not to, and then we can keep having this discussion. There 
>>>>>>> might
>>>>>>> be stuff Swift can do, but it's not insight into applications, so 
>>>>>>> you'll
>>>>>>> have to ask for something else.
>>>>>>>
>>>>>> Any suggestions?
>>>>>>
>>>>>> Ioan
>>>>>>> Mihael
>>>>>>>
>>>>>>>
>>>>>>>> Ioan
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>> -- 
>>>> ============================================
>>>> Ioan Raicu
>>>> Ph.D. Student
>>>> ============================================
>>>> Distributed Systems Laboratory
>>>> Computer Science Department
>>>> University of Chicago
>>>> 1100 E. 58th Street, Ryerson Hall
>>>> Chicago, IL 60637
>>>> ============================================
>>>> Email: iraicu at cs.uchicago.edu
>>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>>>       http://dsl.cs.uchicago.edu/
>>>> ============================================
>>>> ============================================
>>>>
>>
> 
>