[Swift-devel] Re: 244 MolDyn run was successful!

Mon Aug 13 23:31:18 CDT 2007

On Mon, 2007-08-13 at 23:07 -0500, Ioan Raicu wrote:
> > >     
> > 
> > small != not at all
> >   
> Check out these two graphs, showing the # of active tasks within
> Falkon!  Active tasks = queued+pending+active+done_and_not_delivered.
> 
> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/number-of-active-tasks.jpg
> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/number-of-active-tasks-zoom.jpg
> 
> Notice that after 3600 some seconds (after all the jobs that failed
> had failed), the # of active tasks in Falkon oscillates between 100
> and 101 active tasks!  The # presented on these graphs are taken from
> the median value per minute (the raw samples were 60 samples per
> minute).  Notice that only at the very end of the experiment, at 30K+
> seconds, the # of active tasks increases to a max of 109 for a brief
> period of time before it drops towards 0 as the workflow completes.  I
> did notice that towards the end of the workflow, the jobs were
> typically shorter, and perhaps that somehow influenced the # of active
> tasks within Falkon...  So, when I said not at all, I was refering to
> this flat line 100~101 active tasks that is shown in these figures!

Then say "it appears (from x and y) that the number of concurrent jobs
does not increase by an observable amount". This is not the same as "the
score does not increase at all".

> > > So you are saying that 19K+ successful jobs was not enough to
> > > counteract the 10K+ failed jobs from the early part of the
> > > experiment? 
> > >     
> > 
> > Yep. 19*1/5 = 3.8 < 10.
> > 
> >   
> > > Can this ratio (1:5) be changed?
> > >     
> > 
> > Yes. The scheduler has two relevant properties: successFactor (currently
> > 0.1) and failureFactor (currently -0.5). The term "factor" is not used
> > formally, since these get added to the current score.
> > 
> >   
> > > From this experiment, it would seem that the euristic is a slow
> > > learner... maybe you ahve ideas on how to make it more quick to adapt
> > > to changes?
> > >     
> > 
> > That could perhaps be done.
> > 
> >   
> > > > In the context in which jobs are sent to non-busy workers, the system
> > > > would tend to produce lots of failed jobs if it takes little time
> > > > (compared to the normal run-time of a job) for a bad worker to fail a
> > > > job. This *IS* why the swift scheduler throttles in the beginning: to
> > > > avoid sending a large number of jobs to a resource that is broken.
> > > >   
> > > >       
> > > But not the whole resource is broken... 
> > >     
> > 
> > No, just slightly more than 1/3 of it. At least that's how it appears
> > from the outside.
> >   
> But a failed job should not be given the same weight as a succesful
> job, in my oppinion.

Nope. I'd punish failures quite harshly. That's because the expected
behavior is for things to work. I would not want a site that fails half
the jobs to be anywhere near keeping a constant score.

>   For example, it seems to me that you are giving the failed jobs 5
> times more weight than succesful jobs, but in reality it should be the
> other way around.  Failed jobs usually will fail quickly (as in the
> case that we have in MolDyn), or they will fail slowly (within the
> lifetime of the resource allocation).  On the other hand, most
> successful jobs will likely take more time to complete that it takes
> for a job to fail (if it fails quickly).   Perhaps instead of 
> > successFactor (currently
> > 0.1) and failureFactor (currently -0.5)
> it should be more like:
> successFactor: +1*(executionTime)
> failureFactor: -1*(failureTime)

That's a very good idea. Biasing score based on run-time (at least when
known). Please note: you should still fix Falkon to not do that thing
it's doing.

> 
> The 1 could of course be changed with some other weight to give
> preference to successful jobs, or to failed jobs.  With this kind of
> strategy, the problems we are facing with throttling when there are
> large # of short failures wouldn't be happening!  Do you see any
> drawbacks to this approach?

None that are obvious. It's in fact a good thing if the goal is
performance, since it takes execution time into account. I've had manual
"punishments" for connection time-outs because they take a long time to
happen. But this time biasing naturally integrates that kind of stuff.
So thanks.

> > > that is the whole point here... 
> > >     
> > 
> > This point comes because you KNOW how things work internally. All Swift
> > sees is 10K failed jobs out of 29K.
> > 
> >   
> > > anyways, I think this is a valid case that we need to discuss how to
> > > handle, to make the entire Swift+Falkon more robust!
> > > 
> > > BTW, here is another experiment with MolDyn that shows the throttling
> > > and this heuristic behaving as I would expected!
> > > http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/summary_graph.jpg
> > > 
> > > Notice the queue lenth (blue line) at around 11K seconds dropped
> > > sharply, but then grew back up.  That sudden drop was many jobs
> > > failing fast on a bad node, and the sudden growth back up was Swift
> > > re-submitting almost the same # of jobs that failed back to Falkon.
> > >     
> > 
> > That failing many jobs fast behavior is not right, regardless of whether
> > Swift can deal with it or not. 
> If its a machine error, then it would be best to not fail many jobs
> fast...
> however, if its an app error, you want to fail the tasks as fast as
> possible to fail the entire workflow faster,

But you can't distinguish between the two. The best you can do is assume
that the failure is a linear combination between broken application and
broken node. If it's broken node, rescheduling would do (which does not
happen in your case: jobs keep being sent to the worker that is not
busy, and that's the broken one). If it's a broken application, then the
way to distinguish it from the other one is that after a bunch of
retries on different nodes, it still fails. Notice that different nodes
is essential here.

>  so the app can be fixed and the workflow retried!  For example, say
> you had 1000 tasks (all independent), and had a wrong path set to the
> app... with the current Falkon behaviour, the entire workflow would
> likely fail within some 10~20 seconds of it submitting the first task!
> However, if Falkon does some "smart" throttling when it sees failures,
> its going to take time proportional to the failures to fail the
> workflow!

You're missing the part where all nodes fail the jobs equally, thus not
creating the inequality we're talking about (the ones where broken nodes
get higher chances of getting more jobs).

>   Essentially, I am not a bit fan of throttling task dispatch due to
> failed executions, unless we know why these tasks failed!

Stop putting exclamation marks after every sentence. It diminishes the
meaning of it!

Well, you can't know why these tasks failed. That's the whole problem.
You're dealing with incomplete information and you have to devise
heuristics that get things done efficiently.

>   Exit codes are not usually enough in general, unless we define our
> own and the app and wrapper scripts generate these particular exit
> codes that Falkon can intercept and interpret reliably!

That would be an improvement, but probably not a universally valid
assumption. So I wouldn't design with only that in mind.

> > Frankly I'd rather Swift not be the part
> > to deal with it because it has to resort to heuristics, whereas Falkon
> > has direct knowledge of which nodes do what.
> >   
> That's fine, but I don't think Falkon can do it alone, it needs
> context and failure definition, which I believe only the application
> and Swift could say for certain!

Nope, they can't. Swift does not meddle with semantics of applications.
They're all equally valuable functions.

Now, there's stuff you can do to improve things, I'm guessing. You can
choose not to, and then we can keep having this discussion. There might
be stuff Swift can do, but it's not insight into applications, so you'll
have to ask for something else.

Mihael

> 
> Ioan
>