[Swift-devel] Re: Error throttling for MD-244 Molecule Run

Wed Sep 12 09:57:58 CDT 2007

The throttling parameters are set in swift.properties. The last run  
(that we are discussing now) was performed by Ioan from  
viper.uchicago.edu. I am not sure which cogl install he used, but the  
one that I was using before is a r1047 from 8/1. I put the copy of  
swift.properties from that install to www.ci.uchicago.edu/~nefedova/ 
swift.properties.
Ioan, please confirm which swift install did you use (yours or mine)  
and if this file is different from mine, please send your  
swift.properties.

Thanks,

Nika

On Sep 12, 2007, at 9:26 AM, Michael Wilde wrote:

> [Changing Subject: Re: 244 MolDyn run was successful! to start a  
> new thread.]
>
> Ioan, Nika, when we last discussed this in various conversations, I  
> think we were going to try two approaches:
>
> - Ioan was going to modify Falkon to recognize the stale-file- 
> handle error, "down" the offending host, and re-queue the job,  
> transparently to the client (Swift).
>
> - At the same time, we were discussing with Mihael adjustments to  
> the Swift error retry throttling so that these errors would not  
> cause th workflow to slow down so drastically. As I recall,  
> Mihael's view was that the current throttle control parameters were  
> sufficient to try this now. Unless we have evidence from tests that  
> this is *not* the case, we should try this now, without waiting for  
> any Falkon (or Swift) code changes.  Nika, can you send to the list  
> the throttling parameters that you are using?
>
> - Mike
>
>
> Veronika Nefedova wrote:
>> Hi, Ioan:
>> I am wondering what is happening with Falcon scheduler and whether  
>> it can now avoid 'bad' nodes during the execution?
>> Thanks,
>> Nika
>> On Aug 27, 2007, at 12:30 PM, Ioan Raicu wrote:
>>> Hi,
>>> I will look at the Falkon scheduler to what I can do to either  
>>> throttle or blacklist task dispatches to bad nodes.
>>>
>>> On a similar note, IMO, the heuristic in Karajan should be  
>>> modified to take into account the task execution time of the  
>>> failed or successful task, and not just the number of tasks.   
>>> This would ensure that Swift is not throttling task submission to  
>>> Falkon when there are 1000s of successful tasks that take on the  
>>> order of 100s of second to complete, yet there are also 1000s of  
>>> failed tasks that are only 10 ms long.  This is exactly the case  
>>> with MolDyn, when we get a bad node in a bunch of 100s of nodes,  
>>> which ends up throttling the number of active and running tasks  
>>> to about 100, regardless of the number of processors Falkon has.
>>> I also think that when Swift runs in conjunction with Falkon, we  
>>> should increase the number of retry attempts Swift is willing to  
>>> make per task before giving up.  Currently, it is set to 3, but a  
>>> higher number of would be better, considering the low overhead of  
>>> task submission Falkon has!
>>>
>>> I think the combination of these three changes (one from Falkon  
>>> and another from Swift) should increase the probability of large  
>>> workflows completing on a large number of resources!
>>>
>>> Ioan
>>>
>>> Veronika Nefedova wrote:
>>>> OK. I looked at the output and it looks like 14 molecules have  
>>>> still failed. They all failed due to hardware problems -- I saw  
>>>> nothing application-specific in applications logs, all very  
>>>> consistent with staled NFS handle that Ioan reported seeing.
>>>> It would be great to be able to stop submitting jobs to 'bad'  
>>>> nodes during the run (long term), or to increase the number of  
>>>> retries in swift(short term) to enable the whole workflow to go  
>>>> through.
>>>>
>>>> Nika
>>>>
>>>> On Aug 13, 2007, at 11:52 PM, Ioan Raicu wrote:
>>>>
>>>>>
>>>>>
>>>>> Mihael Hategan wrote:
>>>>>> On Mon, 2007-08-13 at 23:07 -0500, Ioan Raicu wrote:
>>>>>>
>>>>>>>>>
>>>>>>>> small != not at all
>>>>>>>>
>>>>>>> Check out these two graphs, showing the # of active tasks within
>>>>>>> Falkon!  Active tasks = queued+pending+active 
>>>>>>> +done_and_not_delivered.
>>>>>>>
>>>>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/ 
>>>>>>> 244-mol-success-8-10-07/number-of-active-tasks.jpg
>>>>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/ 
>>>>>>> 244-mol-success-8-10-07/number-of-active-tasks-zoom.jpg
>>>>>>>
>>>>>>> Notice that after 3600 some seconds (after all the jobs that  
>>>>>>> failed
>>>>>>> had failed), the # of active tasks in Falkon oscillates  
>>>>>>> between 100
>>>>>>> and 101 active tasks!  The # presented on these graphs are  
>>>>>>> taken from
>>>>>>> the median value per minute (the raw samples were 60 samples per
>>>>>>> minute).  Notice that only at the very end of the experiment,  
>>>>>>> at 30K+
>>>>>>> seconds, the # of active tasks increases to a max of 109 for  
>>>>>>> a brief
>>>>>>> period of time before it drops towards 0 as the workflow  
>>>>>>> completes.  I
>>>>>>> did notice that towards the end of the workflow, the jobs were
>>>>>>> typically shorter, and perhaps that somehow influenced the #  
>>>>>>> of active
>>>>>>> tasks within Falkon...  So, when I said not at all, I was  
>>>>>>> refering to
>>>>>>> this flat line 100~101 active tasks that is shown in these  
>>>>>>> figures!
>>>>>>>
>>>>>> Then say "it appears (from x and y) that the number of  
>>>>>> concurrent jobs
>>>>>> does not increase by an observable amount". This is not the  
>>>>>> same as "the
>>>>>> score does not increase at all".
>>>>>>
>>>>> You are playing with words here... the bottom line is that  
>>>>> after 19K+ jobs and several hours of successful jobs, there was  
>>>>> no indication that the heuristic was adapting to the new  
>>>>> conditions, in which no jobs were failing!
>>>>>>
>>>>>>>>> So you are saying that 19K+ successful jobs was not enough to
>>>>>>>>> counteract the 10K+ failed jobs from the early part of the
>>>>>>>>> experiment?
>>>>>>>> Yep. 19*1/5 = 3.8 < 10.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Can this ratio (1:5) be changed?
>>>>>>>>>
>>>>>>>> Yes. The scheduler has two relevant properties:  
>>>>>>>> successFactor (currently
>>>>>>>> 0.1) and failureFactor (currently -0.5). The term "factor"  
>>>>>>>> is not used
>>>>>>>> formally, since these get added to the current score.
>>>>>>>>
>>>>>>>>
>>>>>>>>> From this experiment, it would seem that the euristic is a  
>>>>>>>>> slow
>>>>>>>>> learner... maybe you ahve ideas on how to make it more  
>>>>>>>>> quick to adapt
>>>>>>>>> to changes?
>>>>>>>>>
>>>>>>>> That could perhaps be done.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> In the context in which jobs are sent to non-busy workers,  
>>>>>>>>>> the system
>>>>>>>>>> would tend to produce lots of failed jobs if it takes  
>>>>>>>>>> little time
>>>>>>>>>> (compared to the normal run-time of a job) for a bad  
>>>>>>>>>> worker to fail a
>>>>>>>>>> job. This *IS* why the swift scheduler throttles in the  
>>>>>>>>>> beginning: to
>>>>>>>>>> avoid sending a large number of jobs to a resource that is  
>>>>>>>>>> broken.
>>>>>>>>>>
>>>>>>>>> But not the whole resource is broken...
>>>>>>>> No, just slightly more than 1/3 of it. At least that's how  
>>>>>>>> it appears
>>>>>>>> from the outside.
>>>>>>>>
>>>>>>> But a failed job should not be given the same weight as a  
>>>>>>> succesful
>>>>>>> job, in my oppinion.
>>>>>>>
>>>>>> Nope. I'd punish failures quite harshly. That's because the  
>>>>>> expected
>>>>>> behavior is for things to work. I would not want a site that  
>>>>>> fails half
>>>>>> the jobs to be anywhere near keeping a constant score.
>>>>>>
>>>>> That is fine, but you have a case (such as this one) in which  
>>>>> this is not ideal... how do you propose we adapt to cover this  
>>>>> corner case?
>>>>>>
>>>>>>>   For example, it seems to me that you are giving the failed  
>>>>>>> jobs 5
>>>>>>> times more weight than succesful jobs, but in reality it  
>>>>>>> should be the
>>>>>>> other way around.  Failed jobs usually will fail quickly (as  
>>>>>>> in the
>>>>>>> case that we have in MolDyn), or they will fail slowly  
>>>>>>> (within the
>>>>>>> lifetime of the resource allocation).  On the other hand, most
>>>>>>> successful jobs will likely take more time to complete that  
>>>>>>> it takes
>>>>>>> for a job to fail (if it fails quickly).   Perhaps instead of
>>>>>>>> successFactor (currently
>>>>>>>> 0.1) and failureFactor (currently -0.5)
>>>>>>>>
>>>>>>> it should be more like:
>>>>>>> successFactor: +1*(executionTime)
>>>>>>> failureFactor: -1*(failureTime)
>>>>>>>
>>>>>> That's a very good idea. Biasing score based on run-time (at  
>>>>>> least when
>>>>>> known). Please note: you should still fix Falkon to not do  
>>>>>> that thing
>>>>>> it's doing.
>>>>>>
>>>>> Its not clear to me this should be done all the time, Falkon  
>>>>> needs to know why the failure happened to decide to throttle!
>>>>>>
>>>>>>> The 1 could of course be changed with some other weight to give
>>>>>>> preference to successful jobs, or to failed jobs.  With this  
>>>>>>> kind of
>>>>>>> strategy, the problems we are facing with throttling when  
>>>>>>> there are
>>>>>>> large # of short failures wouldn't be happening!  Do you see any
>>>>>>> drawbacks to this approach?
>>>>>>>
>>>>>> None that are obvious. It's in fact a good thing if the goal is
>>>>>> performance, since it takes execution time into account. I've  
>>>>>> had manual
>>>>>> "punishments" for connection time-outs because they take a  
>>>>>> long time to
>>>>>> happen. But this time biasing naturally integrates that kind  
>>>>>> of stuff.
>>>>>> So thanks.
>>>>>>
>>>>>>
>>>>>>>>> that is the whole point here...
>>>>>>>> This point comes because you KNOW how things work  
>>>>>>>> internally. All Swift
>>>>>>>> sees is 10K failed jobs out of 29K.
>>>>>>>>
>>>>>>>>
>>>>>>>>> anyways, I think this is a valid case that we need to  
>>>>>>>>> discuss how to
>>>>>>>>> handle, to make the entire Swift+Falkon more robust!
>>>>>>>>>
>>>>>>>>> BTW, here is another experiment with MolDyn that shows the  
>>>>>>>>> throttling
>>>>>>>>> and this heuristic behaving as I would expected!
>>>>>>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/ 
>>>>>>>>> 244-mol-failed/summary_graph.jpg
>>>>>>>>>
>>>>>>>>> Notice the queue lenth (blue line) at around 11K seconds  
>>>>>>>>> dropped
>>>>>>>>> sharply, but then grew back up.  That sudden drop was many  
>>>>>>>>> jobs
>>>>>>>>> failing fast on a bad node, and the sudden growth back up  
>>>>>>>>> was Swift
>>>>>>>>> re-submitting almost the same # of jobs that failed back to  
>>>>>>>>> Falkon.
>>>>>>>>>
>>>>>>>> That failing many jobs fast behavior is not right,  
>>>>>>>> regardless of whether
>>>>>>>> Swift can deal with it or not.
>>>>>>> If its a machine error, then it would be best to not fail  
>>>>>>> many jobs
>>>>>>> fast...
>>>>>>> however, if its an app error, you want to fail the tasks as  
>>>>>>> fast as
>>>>>>> possible to fail the entire workflow faster,
>>>>>>>
>>>>>> But you can't distinguish between the two. The best you can do  
>>>>>> is assume
>>>>>> that the failure is a linear combination between broken  
>>>>>> application and
>>>>>> broken node. If it's broken node, rescheduling would do (which  
>>>>>> does not
>>>>>> happen in your case: jobs keep being sent to the worker that  
>>>>>> is not
>>>>>> busy, and that's the broken one). If it's a broken  
>>>>>> application, then the
>>>>>> way to distinguish it from the other one is that after a bunch of
>>>>>> retries on different nodes, it still fails. Notice that  
>>>>>> different nodes
>>>>>> is essential here.
>>>>>>
>>>>> Right, I could try to keep track of statistics on each node,  
>>>>> and when failures happen, try to determine if its a system wide  
>>>>> failure (all nodes reporting errors), or are the faiures  
>>>>> isolated on a single (or small set) node(s)...  I'll have to  
>>>>> think about how to do this efficiently!
>>>>>>
>>>>>>>  so the app can be fixed and the workflow retried!  For  
>>>>>>> example, say
>>>>>>> you had 1000 tasks (all independent), and had a wrong path  
>>>>>>> set to the
>>>>>>> app... with the current Falkon behaviour, the entire workflow  
>>>>>>> would
>>>>>>> likely fail within some 10~20 seconds of it submitting the  
>>>>>>> first task!
>>>>>>> However, if Falkon does some "smart" throttling when it sees  
>>>>>>> failures,
>>>>>>> its going to take time proportional to the failures to fail the
>>>>>>> workflow!
>>>>>>>
>>>>>> You're missing the part where all nodes fail the jobs equally,  
>>>>>> thus not
>>>>>> creating the inequality we're talking about (the ones where  
>>>>>> broken nodes
>>>>>> get higher chances of getting more jobs).
>>>>>>
>>>>> Right, maybe we can use this to distinguish between node  
>>>>> failure and app failure!
>>>>>>
>>>>>>>   Essentially, I am not a bit fan of throttling task dispatch  
>>>>>>> due to
>>>>>>> failed executions, unless we know why these tasks failed!
>>>>>>>
>>>>>> Stop putting exclamation marks after every sentence. It  
>>>>>> diminishes the
>>>>>> meaning of it!
>>>>>>
>>>>> So you are going from playing with words to picking on my  
>>>>> exclamation! :)
>>>>>> Well, you can't know why these tasks failed. That's the whole  
>>>>>> problem.
>>>>>> You're dealing with incomplete information and you have to devise
>>>>>> heuristics that get things done efficiently.
>>>>>>
>>>>> But Swift might know why it failed, it has a bunch of STDOUT/ 
>>>>> STDERR that it always captures!  Falkon might capture the same  
>>>>> output, but its optional ;(  Could these outputs not be parsed  
>>>>> for certain well know errors, and have different exit codes to  
>>>>> mean different kinds of errors?
>>>>>>
>>>>>>>   Exit codes are not usually enough in general, unless we  
>>>>>>> define our
>>>>>>> own and the app and wrapper scripts generate these particular  
>>>>>>> exit
>>>>>>> codes that Falkon can intercept and interpret reliably!
>>>>>>>
>>>>>> That would be an improvement, but probably not a universally  
>>>>>> valid
>>>>>> assumption. So I wouldn't design with only that in mind.
>>>>>>
>>>>> But it would be an improvement over what we currently have...
>>>>>>
>>>>>>>> Frankly I'd rather Swift not be the part
>>>>>>>> to deal with it because it has to resort to heuristics,  
>>>>>>>> whereas Falkon
>>>>>>>> has direct knowledge of which nodes do what.
>>>>>>>>
>>>>>>> That's fine, but I don't think Falkon can do it alone, it needs
>>>>>>> context and failure definition, which I believe only the  
>>>>>>> application
>>>>>>> and Swift could say for certain!
>>>>>>>
>>>>>> Nope, they can't. Swift does not meddle with semantics of  
>>>>>> applications.
>>>>>> They're all equally valuable functions.
>>>>>>
>>>>>> Now, there's stuff you can do to improve things, I'm guessing.  
>>>>>> You can
>>>>>> choose not to, and then we can keep having this discussion.  
>>>>>> There might
>>>>>> be stuff Swift can do, but it's not insight into applications,  
>>>>>> so you'll
>>>>>> have to ask for something else.
>>>>>>
>>>>> Any suggestions?
>>>>>
>>>>> Ioan
>>>>>> Mihael
>>>>>>
>>>>>>
>>>>>>> Ioan
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>> -- 
>>> ============================================
>>> Ioan Raicu
>>> Ph.D. Student
>>> ============================================
>>> Distributed Systems Laboratory
>>> Computer Science Department
>>> University of Chicago
>>> 1100 E. 58th Street, Ryerson Hall
>>> Chicago, IL 60637
>>> ============================================
>>> Email: iraicu at cs.uchicago.edu
>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>>       http://dsl.cs.uchicago.edu/
>>> ============================================
>>> ============================================
>>>
>