[Swift-devel] Re: 244 MolDyn run was successful!

Mon Aug 13 23:07:20 CDT 2007


Mihael Hategan wrote:
> On Mon, 2007-08-13 at 15:17 -0500, Ioan Raicu wrote:
>   
>> Mihael Hategan wrote: 
>>     
>
>>>
>> Look at
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/summary_graph-med.jpg!
>> Do you see the # of active (green) workers as a relatively flat line
>> at around 100 (and this is with the wait queue length being 0, so
>> Swift was simply not sending enough work to keep Falkon's 200+ workers
>> busy)?  If the score would have improved, then I would have expected
>> an upward trend on the number of active workers!
>>     
>
> small != not at all
>   
Check out these two graphs, showing the # of active tasks within 
Falkon!  Active tasks = queued+pending+active+done_and_not_delivered.

http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/number-of-active-tasks.jpg
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/number-of-active-tasks-zoom.jpg

Notice that after 3600 some seconds (after all the jobs that failed had 
failed), the # of active tasks in Falkon oscillates between 100 and 101 
active tasks!  The # presented on these graphs are taken from the median 
value per minute (the raw samples were 60 samples per minute).  Notice 
that only at the very end of the experiment, at 30K+ seconds, the # of 
active tasks increases to a max of 109 for a brief period of time before 
it drops towards 0 as the workflow completes.  I did notice that towards 
the end of the workflow, the jobs were typically shorter, and perhaps 
that somehow influenced the # of active tasks within Falkon...  So, when 
I said not at all, I was refering to this flat line 100~101 active tasks 
that is shown in these figures!
>   
>>>       
>> So you are saying that 19K+ successful jobs was not enough to
>> counteract the 10K+ failed jobs from the early part of the
>> experiment? 
>>     
>
> Yep. 19*1/5 = 3.8 < 10.
>
>   
>>  Can this ratio (1:5) be changed?
>>     
>
> Yes. The scheduler has two relevant properties: successFactor (currently
> 0.1) and failureFactor (currently -0.5). The term "factor" is not used
> formally, since these get added to the current score.
>
>   
>>   From this experiment, it would seem that the euristic is a slow
>> learner... maybe you ahve ideas on how to make it more quick to adapt
>> to changes?
>>     
>
> That could perhaps be done.
>
>   
>>> In the context in which jobs are sent to non-busy workers, the system
>>> would tend to produce lots of failed jobs if it takes little time
>>> (compared to the normal run-time of a job) for a bad worker to fail a
>>> job. This *IS* why the swift scheduler throttles in the beginning: to
>>> avoid sending a large number of jobs to a resource that is broken.
>>>   
>>>       
>> But not the whole resource is broken... 
>>     
>
> No, just slightly more than 1/3 of it. At least that's how it appears
> from the outside.
>   
But a failed job should not be given the same weight as a succesful job, 
in my oppinion.  For example, it seems to me that you are giving the 
failed jobs 5 times more weight than succesful jobs, but in reality it 
should be the other way around.  Failed jobs usually will fail quickly 
(as in the case that we have in MolDyn), or they will fail slowly 
(within the lifetime of the resource allocation).  On the other hand, 
most successful jobs will likely take more time to complete that it 
takes for a job to fail (if it fails quickly).   Perhaps instead of
> successFactor (currently
> 0.1) and failureFactor (currently -0.5)
it should be more like:
successFactor: +1*(executionTime)
failureFactor: -1*(failureTime)

The 1 could of course be changed with some other weight to give 
preference to successful jobs, or to failed jobs.  With this kind of 
strategy, the problems we are facing with throttling when there are 
large # of short failures wouldn't be happening!  Do you see any 
drawbacks to this approach?
>   
>> that is the whole point here... 
>>     
>
> This point comes because you KNOW how things work internally. All Swift
> sees is 10K failed jobs out of 29K.
>
>   
>> anyways, I think this is a valid case that we need to discuss how to
>> handle, to make the entire Swift+Falkon more robust!
>>
>> BTW, here is another experiment with MolDyn that shows the throttling
>> and this heuristic behaving as I would expected!
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/summary_graph.jpg
>>
>> Notice the queue lenth (blue line) at around 11K seconds dropped
>> sharply, but then grew back up.  That sudden drop was many jobs
>> failing fast on a bad node, and the sudden growth back up was Swift
>> re-submitting almost the same # of jobs that failed back to Falkon.
>>     
>
> That failing many jobs fast behavior is not right, regardless of whether
> Swift can deal with it or not. 
If its a machine error, then it would be best to not fail many jobs fast...
however, if its an app error, you want to fail the tasks as fast as 
possible to fail the entire workflow faster, so the app can be fixed and 
the workflow retried!  For example, say you had 1000 tasks (all 
independent), and had a wrong path set to the app... with the current 
Falkon behaviour, the entire workflow would likely fail within some 
10~20 seconds of it submitting the first task!  However, if Falkon does 
some "smart" throttling when it sees failures, its going to take time 
proportional to the failures to fail the workflow!  Essentially, I am 
not a bit fan of throttling task dispatch due to failed executions, 
unless we know why these tasks failed!  Exit codes are not usually 
enough in general, unless we define our own and the app and wrapper 
scripts generate these particular exit codes that Falkon can intercept 
and interpret reliably!
> Frankly I'd rather Swift not be the part
> to deal with it because it has to resort to heuristics, whereas Falkon
> has direct knowledge of which nodes do what.
>   
That's fine, but I don't think Falkon can do it alone, it needs context 
and failure definition, which I believe only the application and Swift 
could say for certain!

Ioan

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070813/83ec5e7c/attachment.html>