[Swift-devel] Re: 244 MolDyn run was successful!
Mihael Hategan
hategan at mcs.anl.gov
Mon Aug 27 13:07:59 CDT 2007
On Mon, 2007-08-27 at 17:37 +0000, Ben Clifford wrote:
>
> On Mon, 27 Aug 2007, Ioan Raicu wrote:
>
> > On a similar note, IMO, the heuristic in Karajan should be modified to take
> > into account the task execution time of the failed or successful task, and not
> > just the number of tasks. This would ensure that Swift is not throttling task
> > submission to Falkon when there are 1000s of successful tasks that take on the
> > order of 100s of second to complete, yet there are also 1000s of failed tasks
> > that are only 10 ms long. This is exactly the case with MolDyn, when we get a
> > bad node in a bunch of 100s of nodes, which ends up throttling the number of
> > active and running tasks to about 100, regardless of the number of processors
> > Falkon has.
>
> Is that different from when submitting to PBS or GRAM where there are
> 1000s of successful tasks taking 100s of seconds to complete but with
> 1000s of failed tasks that are only 10ms long?
In your scenario, assuming that GRAM and PBS do work (since some jobs
succeed), then you can't really submit that fast. So the same thing
would happen, but slower. Unfortunately, in the PBS case, there's not
much that can be done but to throttle until no more jobs than good nodes
are being run at one time.
Now, there is the probing part, which makes the system start with a
lower throttle which increases until problems appear. If this is
disabled (as it was in the ModDyn run), large numbers of parallel jobs
will be submitted causing a large number of failures.
So this whole thing is close to a linear system with negative feedback.
If the initial state is very far away from stability, there will be
large transients. You're more than welcome to study how to make it
converge faster, or how to guess the initial state better (knowing the
number of nodes a cluster has would be a step).
>
More information about the Swift-devel
mailing list