[Swift-devel] Re: 244 MolDyn run was successful!

Mon Aug 13 23:52:24 CDT 2007


Mihael Hategan wrote:
> On Mon, 2007-08-13 at 23:07 -0500, Ioan Raicu wrote:
>   
>>>>     
>>>>         
>>> small != not at all
>>>   
>>>       
>> Check out these two graphs, showing the # of active tasks within
>> Falkon!  Active tasks = queued+pending+active+done_and_not_delivered.
>>
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/number-of-active-tasks.jpg
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/number-of-active-tasks-zoom.jpg
>>
>> Notice that after 3600 some seconds (after all the jobs that failed
>> had failed), the # of active tasks in Falkon oscillates between 100
>> and 101 active tasks!  The # presented on these graphs are taken from
>> the median value per minute (the raw samples were 60 samples per
>> minute).  Notice that only at the very end of the experiment, at 30K+
>> seconds, the # of active tasks increases to a max of 109 for a brief
>> period of time before it drops towards 0 as the workflow completes.  I
>> did notice that towards the end of the workflow, the jobs were
>> typically shorter, and perhaps that somehow influenced the # of active
>> tasks within Falkon...  So, when I said not at all, I was refering to
>> this flat line 100~101 active tasks that is shown in these figures!
>>     
>
> Then say "it appears (from x and y) that the number of concurrent jobs
> does not increase by an observable amount". This is not the same as "the
> score does not increase at all".
>   
You are playing with words here... the bottom line is that after 19K+ 
jobs and several hours of successful jobs, there was no indication that 
the heuristic was adapting to the new conditions, in which no jobs were 
failing!
>   
>>>> So you are saying that 19K+ successful jobs was not enough to
>>>> counteract the 10K+ failed jobs from the early part of the
>>>> experiment? 
>>>>     
>>>>         
>>> Yep. 19*1/5 = 3.8 < 10.
>>>
>>>   
>>>       
>>>> Can this ratio (1:5) be changed?
>>>>     
>>>>         
>>> Yes. The scheduler has two relevant properties: successFactor (currently
>>> 0.1) and failureFactor (currently -0.5). The term "factor" is not used
>>> formally, since these get added to the current score.
>>>
>>>   
>>>       
>>>> From this experiment, it would seem that the euristic is a slow
>>>> learner... maybe you ahve ideas on how to make it more quick to adapt
>>>> to changes?
>>>>     
>>>>         
>>> That could perhaps be done.
>>>
>>>   
>>>       
>>>>> In the context in which jobs are sent to non-busy workers, the system
>>>>> would tend to produce lots of failed jobs if it takes little time
>>>>> (compared to the normal run-time of a job) for a bad worker to fail a
>>>>> job. This *IS* why the swift scheduler throttles in the beginning: to
>>>>> avoid sending a large number of jobs to a resource that is broken.
>>>>>   
>>>>>       
>>>>>           
>>>> But not the whole resource is broken... 
>>>>     
>>>>         
>>> No, just slightly more than 1/3 of it. At least that's how it appears
>>> from the outside.
>>>   
>>>       
>> But a failed job should not be given the same weight as a succesful
>> job, in my oppinion.
>>     
>
> Nope. I'd punish failures quite harshly. That's because the expected
> behavior is for things to work. I would not want a site that fails half
> the jobs to be anywhere near keeping a constant score.
>   
That is fine, but you have a case (such as this one) in which this is 
not ideal... how do you propose we adapt to cover this corner case? 
>   
>>   For example, it seems to me that you are giving the failed jobs 5
>> times more weight than succesful jobs, but in reality it should be the
>> other way around.  Failed jobs usually will fail quickly (as in the
>> case that we have in MolDyn), or they will fail slowly (within the
>> lifetime of the resource allocation).  On the other hand, most
>> successful jobs will likely take more time to complete that it takes
>> for a job to fail (if it fails quickly).   Perhaps instead of 
>>     
>>> successFactor (currently
>>> 0.1) and failureFactor (currently -0.5)
>>>       
>> it should be more like:
>> successFactor: +1*(executionTime)
>> failureFactor: -1*(failureTime)
>>     
>
> That's a very good idea. Biasing score based on run-time (at least when
> known). Please note: you should still fix Falkon to not do that thing
> it's doing.
>   
Its not clear to me this should be done all the time, Falkon needs to 
know why the failure happened to decide to throttle!
>   
>> The 1 could of course be changed with some other weight to give
>> preference to successful jobs, or to failed jobs.  With this kind of
>> strategy, the problems we are facing with throttling when there are
>> large # of short failures wouldn't be happening!  Do you see any
>> drawbacks to this approach?
>>     
>
> None that are obvious. It's in fact a good thing if the goal is
> performance, since it takes execution time into account. I've had manual
> "punishments" for connection time-outs because they take a long time to
> happen. But this time biasing naturally integrates that kind of stuff.
> So thanks.
>
>   
>>>> that is the whole point here... 
>>>>     
>>>>         
>>> This point comes because you KNOW how things work internally. All Swift
>>> sees is 10K failed jobs out of 29K.
>>>
>>>   
>>>       
>>>> anyways, I think this is a valid case that we need to discuss how to
>>>> handle, to make the entire Swift+Falkon more robust!
>>>>
>>>> BTW, here is another experiment with MolDyn that shows the throttling
>>>> and this heuristic behaving as I would expected!
>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/summary_graph.jpg
>>>>
>>>> Notice the queue lenth (blue line) at around 11K seconds dropped
>>>> sharply, but then grew back up.  That sudden drop was many jobs
>>>> failing fast on a bad node, and the sudden growth back up was Swift
>>>> re-submitting almost the same # of jobs that failed back to Falkon.
>>>>     
>>>>         
>>> That failing many jobs fast behavior is not right, regardless of whether
>>> Swift can deal with it or not. 
>>>       
>> If its a machine error, then it would be best to not fail many jobs
>> fast...
>> however, if its an app error, you want to fail the tasks as fast as
>> possible to fail the entire workflow faster,
>>     
>
> But you can't distinguish between the two. The best you can do is assume
> that the failure is a linear combination between broken application and
> broken node. If it's broken node, rescheduling would do (which does not
> happen in your case: jobs keep being sent to the worker that is not
> busy, and that's the broken one). If it's a broken application, then the
> way to distinguish it from the other one is that after a bunch of
> retries on different nodes, it still fails. Notice that different nodes
> is essential here.
>   
Right, I could try to keep track of statistics on each node, and when 
failures happen, try to determine if its a system wide failure (all 
nodes reporting errors), or are the faiures isolated on a single (or 
small set) node(s)...  I'll have to think about how to do this efficiently!
>   
>>  so the app can be fixed and the workflow retried!  For example, say
>> you had 1000 tasks (all independent), and had a wrong path set to the
>> app... with the current Falkon behaviour, the entire workflow would
>> likely fail within some 10~20 seconds of it submitting the first task!
>> However, if Falkon does some "smart" throttling when it sees failures,
>> its going to take time proportional to the failures to fail the
>> workflow!
>>     
>
> You're missing the part where all nodes fail the jobs equally, thus not
> creating the inequality we're talking about (the ones where broken nodes
> get higher chances of getting more jobs).
>   
Right, maybe we can use this to distinguish between node failure and app 
failure!
>   
>>   Essentially, I am not a bit fan of throttling task dispatch due to
>> failed executions, unless we know why these tasks failed!
>>     
>
> Stop putting exclamation marks after every sentence. It diminishes the
> meaning of it!
>   
So you are going from playing with words to picking on my exclamation! :)
> Well, you can't know why these tasks failed. That's the whole problem.
> You're dealing with incomplete information and you have to devise
> heuristics that get things done efficiently.
>   
But Swift might know why it failed, it has a bunch of STDOUT/STDERR that 
it always captures!  Falkon might capture the same output, but its 
optional ;(  Could these outputs not be parsed for certain well know 
errors, and have different exit codes to mean different kinds of errors?
>   
>>   Exit codes are not usually enough in general, unless we define our
>> own and the app and wrapper scripts generate these particular exit
>> codes that Falkon can intercept and interpret reliably!
>>     
>
> That would be an improvement, but probably not a universally valid
> assumption. So I wouldn't design with only that in mind.
>   
But it would be an improvement over what we currently have...
>   
>>> Frankly I'd rather Swift not be the part
>>> to deal with it because it has to resort to heuristics, whereas Falkon
>>> has direct knowledge of which nodes do what.
>>>   
>>>       
>> That's fine, but I don't think Falkon can do it alone, it needs
>> context and failure definition, which I believe only the application
>> and Swift could say for certain!
>>     
>
> Nope, they can't. Swift does not meddle with semantics of applications.
> They're all equally valuable functions.
>
> Now, there's stuff you can do to improve things, I'm guessing. You can
> choose not to, and then we can keep having this discussion. There might
> be stuff Swift can do, but it's not insight into applications, so you'll
> have to ask for something else.
>   
Any suggestions?

Ioan
> Mihael
>
>   
>> Ioan
>>
>>     
>
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070813/e9b07abf/attachment.html>