[Swift-devel] Re: 244 MolDyn run was successful!

Mon Sep 10 12:02:42 CDT 2007

Hi, Ioan:

I am wondering what is happening with Falcon scheduler and whether it  
can now avoid 'bad' nodes during the execution?

Thanks,

Nika

On Aug 27, 2007, at 12:30 PM, Ioan Raicu wrote:

> Hi,
> I will look at the Falkon scheduler to what I can do to either  
> throttle or blacklist task dispatches to bad nodes.
>
> On a similar note, IMO, the heuristic in Karajan should be modified  
> to take into account the task execution time of the failed or  
> successful task, and not just the number of tasks.  This would  
> ensure that Swift is not throttling task submission to Falkon when  
> there are 1000s of successful tasks that take on the order of 100s  
> of second to complete, yet there are also 1000s of failed tasks  
> that are only 10 ms long.  This is exactly the case with MolDyn,  
> when we get a bad node in a bunch of 100s of nodes, which ends up  
> throttling the number of active and running tasks to about 100,  
> regardless of the number of processors Falkon has.
> I also think that when Swift runs in conjunction with Falkon, we  
> should increase the number of retry attempts Swift is willing to  
> make per task before giving up.  Currently, it is set to 3, but a  
> higher number of would be better, considering the low overhead of  
> task submission Falkon has!
>
> I think the combination of these three changes (one from Falkon and  
> another from Swift) should increase the probability of large  
> workflows completing on a large number of resources!
>
> Ioan
>
> Veronika Nefedova wrote:
>> OK. I looked at the output and it looks like 14 molecules have  
>> still failed. They all failed due to hardware problems -- I saw  
>> nothing application-specific in applications logs, all very  
>> consistent with staled NFS handle that Ioan reported seeing.
>> It would be great to be able to stop submitting jobs to 'bad'  
>> nodes during the run (long term), or to increase the number of  
>> retries in swift(short term) to enable the whole workflow to go  
>> through.
>>
>> Nika
>>
>> On Aug 13, 2007, at 11:52 PM, Ioan Raicu wrote:
>>
>>>
>>>
>>> Mihael Hategan wrote:
>>>> On Mon, 2007-08-13 at 23:07 -0500, Ioan Raicu wrote:
>>>>
>>>>>>>
>>>>>> small != not at all
>>>>>>
>>>>> Check out these two graphs, showing the # of active tasks within
>>>>> Falkon!  Active tasks = queued+pending+active 
>>>>> +done_and_not_delivered.
>>>>>
>>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244- 
>>>>> mol-success-8-10-07/number-of-active-tasks.jpg
>>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244- 
>>>>> mol-success-8-10-07/number-of-active-tasks-zoom.jpg
>>>>>
>>>>> Notice that after 3600 some seconds (after all the jobs that  
>>>>> failed
>>>>> had failed), the # of active tasks in Falkon oscillates between  
>>>>> 100
>>>>> and 101 active tasks!  The # presented on these graphs are  
>>>>> taken from
>>>>> the median value per minute (the raw samples were 60 samples per
>>>>> minute).  Notice that only at the very end of the experiment,  
>>>>> at 30K+
>>>>> seconds, the # of active tasks increases to a max of 109 for a  
>>>>> brief
>>>>> period of time before it drops towards 0 as the workflow  
>>>>> completes.  I
>>>>> did notice that towards the end of the workflow, the jobs were
>>>>> typically shorter, and perhaps that somehow influenced the # of  
>>>>> active
>>>>> tasks within Falkon...  So, when I said not at all, I was  
>>>>> refering to
>>>>> this flat line 100~101 active tasks that is shown in these  
>>>>> figures!
>>>>>
>>>> Then say "it appears (from x and y) that the number of  
>>>> concurrent jobs
>>>> does not increase by an observable amount". This is not the same  
>>>> as "the
>>>> score does not increase at all".
>>>>
>>> You are playing with words here... the bottom line is that after  
>>> 19K+ jobs and several hours of successful jobs, there was no  
>>> indication that the heuristic was adapting to the new conditions,  
>>> in which no jobs were failing!
>>>>
>>>>>>> So you are saying that 19K+ successful jobs was not enough to
>>>>>>> counteract the 10K+ failed jobs from the early part of the
>>>>>>> experiment?
>>>>>> Yep. 19*1/5 = 3.8 < 10.
>>>>>>
>>>>>>
>>>>>>> Can this ratio (1:5) be changed?
>>>>>>>
>>>>>> Yes. The scheduler has two relevant properties: successFactor  
>>>>>> (currently
>>>>>> 0.1) and failureFactor (currently -0.5). The term "factor" is  
>>>>>> not used
>>>>>> formally, since these get added to the current score.
>>>>>>
>>>>>>
>>>>>>> From this experiment, it would seem that the euristic is a slow
>>>>>>> learner... maybe you ahve ideas on how to make it more quick  
>>>>>>> to adapt
>>>>>>> to changes?
>>>>>>>
>>>>>> That could perhaps be done.
>>>>>>
>>>>>>
>>>>>>>> In the context in which jobs are sent to non-busy workers,  
>>>>>>>> the system
>>>>>>>> would tend to produce lots of failed jobs if it takes little  
>>>>>>>> time
>>>>>>>> (compared to the normal run-time of a job) for a bad worker  
>>>>>>>> to fail a
>>>>>>>> job. This *IS* why the swift scheduler throttles in the  
>>>>>>>> beginning: to
>>>>>>>> avoid sending a large number of jobs to a resource that is  
>>>>>>>> broken.
>>>>>>>>
>>>>>>> But not the whole resource is broken...
>>>>>> No, just slightly more than 1/3 of it. At least that's how it  
>>>>>> appears
>>>>>> from the outside.
>>>>>>
>>>>> But a failed job should not be given the same weight as a  
>>>>> succesful
>>>>> job, in my oppinion.
>>>>>
>>>> Nope. I'd punish failures quite harshly. That's because the  
>>>> expected
>>>> behavior is for things to work. I would not want a site that  
>>>> fails half
>>>> the jobs to be anywhere near keeping a constant score.
>>>>
>>> That is fine, but you have a case (such as this one) in which  
>>> this is not ideal... how do you propose we adapt to cover this  
>>> corner case?
>>>>
>>>>>   For example, it seems to me that you are giving the failed  
>>>>> jobs 5
>>>>> times more weight than succesful jobs, but in reality it should  
>>>>> be the
>>>>> other way around.  Failed jobs usually will fail quickly (as in  
>>>>> the
>>>>> case that we have in MolDyn), or they will fail slowly (within the
>>>>> lifetime of the resource allocation).  On the other hand, most
>>>>> successful jobs will likely take more time to complete that it  
>>>>> takes
>>>>> for a job to fail (if it fails quickly).   Perhaps instead of
>>>>>> successFactor (currently
>>>>>> 0.1) and failureFactor (currently -0.5)
>>>>>>
>>>>> it should be more like:
>>>>> successFactor: +1*(executionTime)
>>>>> failureFactor: -1*(failureTime)
>>>>>
>>>> That's a very good idea. Biasing score based on run-time (at  
>>>> least when
>>>> known). Please note: you should still fix Falkon to not do that  
>>>> thing
>>>> it's doing.
>>>>
>>> Its not clear to me this should be done all the time, Falkon  
>>> needs to know why the failure happened to decide to throttle!
>>>>
>>>>> The 1 could of course be changed with some other weight to give
>>>>> preference to successful jobs, or to failed jobs.  With this  
>>>>> kind of
>>>>> strategy, the problems we are facing with throttling when there  
>>>>> are
>>>>> large # of short failures wouldn't be happening!  Do you see any
>>>>> drawbacks to this approach?
>>>>>
>>>> None that are obvious. It's in fact a good thing if the goal is
>>>> performance, since it takes execution time into account. I've  
>>>> had manual
>>>> "punishments" for connection time-outs because they take a long  
>>>> time to
>>>> happen. But this time biasing naturally integrates that kind of  
>>>> stuff.
>>>> So thanks.
>>>>
>>>>
>>>>>>> that is the whole point here...
>>>>>> This point comes because you KNOW how things work internally.  
>>>>>> All Swift
>>>>>> sees is 10K failed jobs out of 29K.
>>>>>>
>>>>>>
>>>>>>> anyways, I think this is a valid case that we need to discuss  
>>>>>>> how to
>>>>>>> handle, to make the entire Swift+Falkon more robust!
>>>>>>>
>>>>>>> BTW, here is another experiment with MolDyn that shows the  
>>>>>>> throttling
>>>>>>> and this heuristic behaving as I would expected!
>>>>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/ 
>>>>>>> 244-mol-failed/summary_graph.jpg
>>>>>>>
>>>>>>> Notice the queue lenth (blue line) at around 11K seconds dropped
>>>>>>> sharply, but then grew back up.  That sudden drop was many jobs
>>>>>>> failing fast on a bad node, and the sudden growth back up was  
>>>>>>> Swift
>>>>>>> re-submitting almost the same # of jobs that failed back to  
>>>>>>> Falkon.
>>>>>>>
>>>>>> That failing many jobs fast behavior is not right, regardless  
>>>>>> of whether
>>>>>> Swift can deal with it or not.
>>>>> If its a machine error, then it would be best to not fail many  
>>>>> jobs
>>>>> fast...
>>>>> however, if its an app error, you want to fail the tasks as  
>>>>> fast as
>>>>> possible to fail the entire workflow faster,
>>>>>
>>>> But you can't distinguish between the two. The best you can do  
>>>> is assume
>>>> that the failure is a linear combination between broken  
>>>> application and
>>>> broken node. If it's broken node, rescheduling would do (which  
>>>> does not
>>>> happen in your case: jobs keep being sent to the worker that is not
>>>> busy, and that's the broken one). If it's a broken application,  
>>>> then the
>>>> way to distinguish it from the other one is that after a bunch of
>>>> retries on different nodes, it still fails. Notice that  
>>>> different nodes
>>>> is essential here.
>>>>
>>> Right, I could try to keep track of statistics on each node, and  
>>> when failures happen, try to determine if its a system wide  
>>> failure (all nodes reporting errors), or are the faiures isolated  
>>> on a single (or small set) node(s)...  I'll have to think about  
>>> how to do this efficiently!
>>>>
>>>>>  so the app can be fixed and the workflow retried!  For  
>>>>> example, say
>>>>> you had 1000 tasks (all independent), and had a wrong path set  
>>>>> to the
>>>>> app... with the current Falkon behaviour, the entire workflow  
>>>>> would
>>>>> likely fail within some 10~20 seconds of it submitting the  
>>>>> first task!
>>>>> However, if Falkon does some "smart" throttling when it sees  
>>>>> failures,
>>>>> its going to take time proportional to the failures to fail the
>>>>> workflow!
>>>>>
>>>> You're missing the part where all nodes fail the jobs equally,  
>>>> thus not
>>>> creating the inequality we're talking about (the ones where  
>>>> broken nodes
>>>> get higher chances of getting more jobs).
>>>>
>>> Right, maybe we can use this to distinguish between node failure  
>>> and app failure!
>>>>
>>>>>   Essentially, I am not a bit fan of throttling task dispatch  
>>>>> due to
>>>>> failed executions, unless we know why these tasks failed!
>>>>>
>>>> Stop putting exclamation marks after every sentence. It  
>>>> diminishes the
>>>> meaning of it!
>>>>
>>> So you are going from playing with words to picking on my  
>>> exclamation! :)
>>>> Well, you can't know why these tasks failed. That's the whole  
>>>> problem.
>>>> You're dealing with incomplete information and you have to devise
>>>> heuristics that get things done efficiently.
>>>>
>>> But Swift might know why it failed, it has a bunch of STDOUT/ 
>>> STDERR that it always captures!  Falkon might capture the same  
>>> output, but its optional ;(  Could these outputs not be parsed  
>>> for certain well know errors, and have different exit codes to  
>>> mean different kinds of errors?
>>>>
>>>>>   Exit codes are not usually enough in general, unless we  
>>>>> define our
>>>>> own and the app and wrapper scripts generate these particular exit
>>>>> codes that Falkon can intercept and interpret reliably!
>>>>>
>>>> That would be an improvement, but probably not a universally valid
>>>> assumption. So I wouldn't design with only that in mind.
>>>>
>>> But it would be an improvement over what we currently have...
>>>>
>>>>>> Frankly I'd rather Swift not be the part
>>>>>> to deal with it because it has to resort to heuristics,  
>>>>>> whereas Falkon
>>>>>> has direct knowledge of which nodes do what.
>>>>>>
>>>>> That's fine, but I don't think Falkon can do it alone, it needs
>>>>> context and failure definition, which I believe only the  
>>>>> application
>>>>> and Swift could say for certain!
>>>>>
>>>> Nope, they can't. Swift does not meddle with semantics of  
>>>> applications.
>>>> They're all equally valuable functions.
>>>>
>>>> Now, there's stuff you can do to improve things, I'm guessing.  
>>>> You can
>>>> choose not to, and then we can keep having this discussion.  
>>>> There might
>>>> be stuff Swift can do, but it's not insight into applications,  
>>>> so you'll
>>>> have to ask for something else.
>>>>
>>> Any suggestions?
>>>
>>> Ioan
>>>> Mihael
>>>>
>>>>
>>>>> Ioan
>>>>>
>>>>>
>>>>
>>
>
> -- 
> ============================================
> Ioan Raicu
> Ph.D. Student
> ============================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ============================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
>       http://dsl.cs.uchicago.edu/
> ============================================
> ============================================
>