<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
The question I am interested in, can you modify the heuristic to take
into account the execution time of tasks when updating the site score?
I think it is important you use only the execution time (and not Falkon
queue time + execution time + result delivery time); in this case, how
does Falkon pass this information back to Swift?<br>
<br>
Ioan<br>
<br>
Mihael Hategan wrote:
<blockquote cite="mid:1188238079.31798.25.camel@blabla.mcs.anl.gov"
type="cite">
<pre wrap="">On Mon, 2007-08-27 at 17:37 +0000, Ben Clifford wrote:
</pre>
<blockquote type="cite">
<pre wrap="">On Mon, 27 Aug 2007, Ioan Raicu wrote:
</pre>
<blockquote type="cite">
<pre wrap="">On a similar note, IMO, the heuristic in Karajan should be modified to take
into account the task execution time of the failed or successful task, and not
just the number of tasks. This would ensure that Swift is not throttling task
submission to Falkon when there are 1000s of successful tasks that take on the
order of 100s of second to complete, yet there are also 1000s of failed tasks
that are only 10 ms long. This is exactly the case with MolDyn, when we get a
bad node in a bunch of 100s of nodes, which ends up throttling the number of
active and running tasks to about 100, regardless of the number of processors
Falkon has.
</pre>
</blockquote>
<pre wrap="">Is that different from when submitting to PBS or GRAM where there are
1000s of successful tasks taking 100s of seconds to complete but with
1000s of failed tasks that are only 10ms long?
</pre>
</blockquote>
<pre wrap=""><!---->
In your scenario, assuming that GRAM and PBS do work (since some jobs
succeed), then you can't really submit that fast. So the same thing
would happen, but slower. Unfortunately, in the PBS case, there's not
much that can be done but to throttle until no more jobs than good nodes
are being run at one time.
Now, there is the probing part, which makes the system start with a
lower throttle which increases until problems appear. If this is
disabled (as it was in the ModDyn run), large numbers of parallel jobs
will be submitted causing a large number of failures.
So this whole thing is close to a linear system with negative feedback.
If the initial state is very far away from stability, there will be
large transients. You're more than welcome to study how to make it
converge faster, or how to guess the initial state better (knowing the
number of nodes a cluster has would be a step).
</pre>
<pre wrap=""><!---->
</pre>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: <a class="moz-txt-link-abbreviated" href="mailto:iraicu@cs.uchicago.edu">iraicu@cs.uchicago.edu</a>
Web: <a class="moz-txt-link-freetext" href="http://www.cs.uchicago.edu/~iraicu">http://www.cs.uchicago.edu/~iraicu</a>
<a class="moz-txt-link-freetext" href="http://dsl.cs.uchicago.edu/">http://dsl.cs.uchicago.edu/</a>
============================================
============================================</pre>
</body>
</html>