[Swift-devel] Error throttling for MD-244 Molecule Run

Wed Sep 12 09:26:38 CDT 2007

[Changing Subject: Re: 244 MolDyn run was successful! to start a new 
thread.]

Ioan, Nika, when we last discussed this in various conversations, I 
think we were going to try two approaches:

- Ioan was going to modify Falkon to recognize the stale-file-handle 
error, "down" the offending host, and re-queue the job, transparently to 
the client (Swift).

- At the same time, we were discussing with Mihael adjustments to the 
Swift error retry throttling so that these errors would not cause th 
workflow to slow down so drastically. As I recall, Mihael's view was 
that the current throttle control parameters were sufficient to try this 
now. Unless we have evidence from tests that this is *not* the case, we 
should try this now, without waiting for any Falkon (or Swift) code 
changes.  Nika, can you send to the list the throttling parameters that 
you are using?

- Mike

Veronika Nefedova wrote:
> Hi, Ioan:
> 
> I am wondering what is happening with Falcon scheduler and whether it 
> can now avoid 'bad' nodes during the execution?
> 
> Thanks,
> 
> Nika
> 
> On Aug 27, 2007, at 12:30 PM, Ioan Raicu wrote:
> 
>> Hi,
>> I will look at the Falkon scheduler to what I can do to either 
>> throttle or blacklist task dispatches to bad nodes.
>>
>> On a similar note, IMO, the heuristic in Karajan should be modified to 
>> take into account the task execution time of the failed or successful 
>> task, and not just the number of tasks.  This would ensure that Swift 
>> is not throttling task submission to Falkon when there are 1000s of 
>> successful tasks that take on the order of 100s of second to complete, 
>> yet there are also 1000s of failed tasks that are only 10 ms long.  
>> This is exactly the case with MolDyn, when we get a bad node in a 
>> bunch of 100s of nodes, which ends up throttling the number of active 
>> and running tasks to about 100, regardless of the number of processors 
>> Falkon has.
>> I also think that when Swift runs in conjunction with Falkon, we 
>> should increase the number of retry attempts Swift is willing to make 
>> per task before giving up.  Currently, it is set to 3, but a higher 
>> number of would be better, considering the low overhead of task 
>> submission Falkon has!
>>
>> I think the combination of these three changes (one from Falkon and 
>> another from Swift) should increase the probability of large workflows 
>> completing on a large number of resources!
>>
>> Ioan
>>
>> Veronika Nefedova wrote:
>>> OK. I looked at the output and it looks like 14 molecules have still 
>>> failed. They all failed due to hardware problems -- I saw nothing 
>>> application-specific in applications logs, all very consistent with 
>>> staled NFS handle that Ioan reported seeing.
>>> It would be great to be able to stop submitting jobs to 'bad' nodes 
>>> during the run (long term), or to increase the number of retries in 
>>> swift(short term) to enable the whole workflow to go through.
>>>
>>> Nika
>>>
>>> On Aug 13, 2007, at 11:52 PM, Ioan Raicu wrote:
>>>
>>>>
>>>>
>>>> Mihael Hategan wrote:
>>>>> On Mon, 2007-08-13 at 23:07 -0500, Ioan Raicu wrote:
>>>>>
>>>>>>>>
>>>>>>> small != not at all
>>>>>>>
>>>>>> Check out these two graphs, showing the # of active tasks within
>>>>>> Falkon!  Active tasks = queued+pending+active+done_and_not_delivered.
>>>>>>
>>>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/number-of-active-tasks.jpg 
>>>>>>
>>>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/number-of-active-tasks-zoom.jpg 
>>>>>>
>>>>>>
>>>>>> Notice that after 3600 some seconds (after all the jobs that failed
>>>>>> had failed), the # of active tasks in Falkon oscillates between 100
>>>>>> and 101 active tasks!  The # presented on these graphs are taken from
>>>>>> the median value per minute (the raw samples were 60 samples per
>>>>>> minute).  Notice that only at the very end of the experiment, at 30K+
>>>>>> seconds, the # of active tasks increases to a max of 109 for a brief
>>>>>> period of time before it drops towards 0 as the workflow 
>>>>>> completes.  I
>>>>>> did notice that towards the end of the workflow, the jobs were
>>>>>> typically shorter, and perhaps that somehow influenced the # of 
>>>>>> active
>>>>>> tasks within Falkon...  So, when I said not at all, I was refering to
>>>>>> this flat line 100~101 active tasks that is shown in these figures!
>>>>>>
>>>>> Then say "it appears (from x and y) that the number of concurrent jobs
>>>>> does not increase by an observable amount". This is not the same as 
>>>>> "the
>>>>> score does not increase at all".
>>>>>
>>>> You are playing with words here... the bottom line is that after 
>>>> 19K+ jobs and several hours of successful jobs, there was no 
>>>> indication that the heuristic was adapting to the new conditions, in 
>>>> which no jobs were failing!
>>>>>
>>>>>>>> So you are saying that 19K+ successful jobs was not enough to
>>>>>>>> counteract the 10K+ failed jobs from the early part of the
>>>>>>>> experiment?
>>>>>>> Yep. 19*1/5 = 3.8 < 10.
>>>>>>>
>>>>>>>
>>>>>>>> Can this ratio (1:5) be changed?
>>>>>>>>
>>>>>>> Yes. The scheduler has two relevant properties: successFactor 
>>>>>>> (currently
>>>>>>> 0.1) and failureFactor (currently -0.5). The term "factor" is not 
>>>>>>> used
>>>>>>> formally, since these get added to the current score.
>>>>>>>
>>>>>>>
>>>>>>>> From this experiment, it would seem that the euristic is a slow
>>>>>>>> learner... maybe you ahve ideas on how to make it more quick to 
>>>>>>>> adapt
>>>>>>>> to changes?
>>>>>>>>
>>>>>>> That could perhaps be done.
>>>>>>>
>>>>>>>
>>>>>>>>> In the context in which jobs are sent to non-busy workers, the 
>>>>>>>>> system
>>>>>>>>> would tend to produce lots of failed jobs if it takes little time
>>>>>>>>> (compared to the normal run-time of a job) for a bad worker to 
>>>>>>>>> fail a
>>>>>>>>> job. This *IS* why the swift scheduler throttles in the 
>>>>>>>>> beginning: to
>>>>>>>>> avoid sending a large number of jobs to a resource that is broken.
>>>>>>>>>
>>>>>>>> But not the whole resource is broken...
>>>>>>> No, just slightly more than 1/3 of it. At least that's how it 
>>>>>>> appears
>>>>>>> from the outside.
>>>>>>>
>>>>>> But a failed job should not be given the same weight as a succesful
>>>>>> job, in my oppinion.
>>>>>>
>>>>> Nope. I'd punish failures quite harshly. That's because the expected
>>>>> behavior is for things to work. I would not want a site that fails 
>>>>> half
>>>>> the jobs to be anywhere near keeping a constant score.
>>>>>
>>>> That is fine, but you have a case (such as this one) in which this 
>>>> is not ideal... how do you propose we adapt to cover this corner case?
>>>>>
>>>>>>   For example, it seems to me that you are giving the failed jobs 5
>>>>>> times more weight than succesful jobs, but in reality it should be 
>>>>>> the
>>>>>> other way around.  Failed jobs usually will fail quickly (as in the
>>>>>> case that we have in MolDyn), or they will fail slowly (within the
>>>>>> lifetime of the resource allocation).  On the other hand, most
>>>>>> successful jobs will likely take more time to complete that it takes
>>>>>> for a job to fail (if it fails quickly).   Perhaps instead of
>>>>>>> successFactor (currently
>>>>>>> 0.1) and failureFactor (currently -0.5)
>>>>>>>
>>>>>> it should be more like:
>>>>>> successFactor: +1*(executionTime)
>>>>>> failureFactor: -1*(failureTime)
>>>>>>
>>>>> That's a very good idea. Biasing score based on run-time (at least 
>>>>> when
>>>>> known). Please note: you should still fix Falkon to not do that thing
>>>>> it's doing.
>>>>>
>>>> Its not clear to me this should be done all the time, Falkon needs 
>>>> to know why the failure happened to decide to throttle!
>>>>>
>>>>>> The 1 could of course be changed with some other weight to give
>>>>>> preference to successful jobs, or to failed jobs.  With this kind of
>>>>>> strategy, the problems we are facing with throttling when there are
>>>>>> large # of short failures wouldn't be happening!  Do you see any
>>>>>> drawbacks to this approach?
>>>>>>
>>>>> None that are obvious. It's in fact a good thing if the goal is
>>>>> performance, since it takes execution time into account. I've had 
>>>>> manual
>>>>> "punishments" for connection time-outs because they take a long 
>>>>> time to
>>>>> happen. But this time biasing naturally integrates that kind of stuff.
>>>>> So thanks.
>>>>>
>>>>>
>>>>>>>> that is the whole point here...
>>>>>>> This point comes because you KNOW how things work internally. All 
>>>>>>> Swift
>>>>>>> sees is 10K failed jobs out of 29K.
>>>>>>>
>>>>>>>
>>>>>>>> anyways, I think this is a valid case that we need to discuss 
>>>>>>>> how to
>>>>>>>> handle, to make the entire Swift+Falkon more robust!
>>>>>>>>
>>>>>>>> BTW, here is another experiment with MolDyn that shows the 
>>>>>>>> throttling
>>>>>>>> and this heuristic behaving as I would expected!
>>>>>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/summary_graph.jpg 
>>>>>>>>
>>>>>>>>
>>>>>>>> Notice the queue lenth (blue line) at around 11K seconds dropped
>>>>>>>> sharply, but then grew back up.  That sudden drop was many jobs
>>>>>>>> failing fast on a bad node, and the sudden growth back up was Swift
>>>>>>>> re-submitting almost the same # of jobs that failed back to Falkon.
>>>>>>>>
>>>>>>> That failing many jobs fast behavior is not right, regardless of 
>>>>>>> whether
>>>>>>> Swift can deal with it or not.
>>>>>> If its a machine error, then it would be best to not fail many jobs
>>>>>> fast...
>>>>>> however, if its an app error, you want to fail the tasks as fast as
>>>>>> possible to fail the entire workflow faster,
>>>>>>
>>>>> But you can't distinguish between the two. The best you can do is 
>>>>> assume
>>>>> that the failure is a linear combination between broken application 
>>>>> and
>>>>> broken node. If it's broken node, rescheduling would do (which does 
>>>>> not
>>>>> happen in your case: jobs keep being sent to the worker that is not
>>>>> busy, and that's the broken one). If it's a broken application, 
>>>>> then the
>>>>> way to distinguish it from the other one is that after a bunch of
>>>>> retries on different nodes, it still fails. Notice that different 
>>>>> nodes
>>>>> is essential here.
>>>>>
>>>> Right, I could try to keep track of statistics on each node, and 
>>>> when failures happen, try to determine if its a system wide failure 
>>>> (all nodes reporting errors), or are the faiures isolated on a 
>>>> single (or small set) node(s)...  I'll have to think about how to do 
>>>> this efficiently!
>>>>>
>>>>>>  so the app can be fixed and the workflow retried!  For example, say
>>>>>> you had 1000 tasks (all independent), and had a wrong path set to the
>>>>>> app... with the current Falkon behaviour, the entire workflow would
>>>>>> likely fail within some 10~20 seconds of it submitting the first 
>>>>>> task!
>>>>>> However, if Falkon does some "smart" throttling when it sees 
>>>>>> failures,
>>>>>> its going to take time proportional to the failures to fail the
>>>>>> workflow!
>>>>>>
>>>>> You're missing the part where all nodes fail the jobs equally, thus 
>>>>> not
>>>>> creating the inequality we're talking about (the ones where broken 
>>>>> nodes
>>>>> get higher chances of getting more jobs).
>>>>>
>>>> Right, maybe we can use this to distinguish between node failure and 
>>>> app failure!
>>>>>
>>>>>>   Essentially, I am not a bit fan of throttling task dispatch due to
>>>>>> failed executions, unless we know why these tasks failed!
>>>>>>
>>>>> Stop putting exclamation marks after every sentence. It diminishes the
>>>>> meaning of it!
>>>>>
>>>> So you are going from playing with words to picking on my 
>>>> exclamation! :)
>>>>> Well, you can't know why these tasks failed. That's the whole problem.
>>>>> You're dealing with incomplete information and you have to devise
>>>>> heuristics that get things done efficiently.
>>>>>
>>>> But Swift might know why it failed, it has a bunch of STDOUT/STDERR 
>>>> that it always captures!  Falkon might capture the same output, but 
>>>> its optional ;(  Could these outputs not be parsed for certain well 
>>>> know errors, and have different exit codes to mean different kinds 
>>>> of errors?
>>>>>
>>>>>>   Exit codes are not usually enough in general, unless we define our
>>>>>> own and the app and wrapper scripts generate these particular exit
>>>>>> codes that Falkon can intercept and interpret reliably!
>>>>>>
>>>>> That would be an improvement, but probably not a universally valid
>>>>> assumption. So I wouldn't design with only that in mind.
>>>>>
>>>> But it would be an improvement over what we currently have...
>>>>>
>>>>>>> Frankly I'd rather Swift not be the part
>>>>>>> to deal with it because it has to resort to heuristics, whereas 
>>>>>>> Falkon
>>>>>>> has direct knowledge of which nodes do what.
>>>>>>>
>>>>>> That's fine, but I don't think Falkon can do it alone, it needs
>>>>>> context and failure definition, which I believe only the application
>>>>>> and Swift could say for certain!
>>>>>>
>>>>> Nope, they can't. Swift does not meddle with semantics of 
>>>>> applications.
>>>>> They're all equally valuable functions.
>>>>>
>>>>> Now, there's stuff you can do to improve things, I'm guessing. You can
>>>>> choose not to, and then we can keep having this discussion. There 
>>>>> might
>>>>> be stuff Swift can do, but it's not insight into applications, so 
>>>>> you'll
>>>>> have to ask for something else.
>>>>>
>>>> Any suggestions?
>>>>
>>>> Ioan
>>>>> Mihael
>>>>>
>>>>>
>>>>>> Ioan
>>>>>>
>>>>>>
>>>>>
>>>
>>
>> -- 
>> ============================================
>> Ioan Raicu
>> Ph.D. Student
>> ============================================
>> Distributed Systems Laboratory
>> Computer Science Department
>> University of Chicago
>> 1100 E. 58th Street, Ryerson Hall
>> Chicago, IL 60637
>> ============================================
>> Email: iraicu at cs.uchicago.edu
>> Web:   http://www.cs.uchicago.edu/~iraicu
>>       http://dsl.cs.uchicago.edu/
>> ============================================
>> ============================================
>>
> 
>