<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

The question I am interested in, can you modify the heuristic to take

into account the execution time of tasks when updating the site score? 

I think it is important you use only the execution time (and not Falkon

queue time + execution time + result delivery time); in this case, how

does Falkon pass this information back to Swift?<br>

<br>

Ioan<br>

<br>

Mihael Hategan wrote:

<blockquote cite="mid:1188238079.31798.25.camel@blabla.mcs.anl.gov"

 type="cite">

  <pre wrap="">On Mon, 2007-08-27 at 17:37 +0000, Ben Clifford wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap="">On Mon, 27 Aug 2007, Ioan Raicu wrote:

    </pre>

    <blockquote type="cite">

      <pre wrap="">On a similar note, IMO, the heuristic in Karajan should be modified to take

into account the task execution time of the failed or successful task, and not

just the number of tasks.  This would ensure that Swift is not throttling task

submission to Falkon when there are 1000s of successful tasks that take on the

order of 100s of second to complete, yet there are also 1000s of failed tasks

that are only 10 ms long.  This is exactly the case with MolDyn, when we get a

bad node in a bunch of 100s of nodes, which ends up throttling the number of

active and running tasks to about 100, regardless of the number of processors

Falkon has. 

      </pre>

    </blockquote>

    <pre wrap="">Is that different from when submitting to PBS or GRAM where there are 

1000s of successful tasks taking 100s of seconds to complete but with 

1000s of failed tasks that are only 10ms long?

    </pre>

  </blockquote>

  <pre wrap=""><!---->

In your scenario, assuming that GRAM and PBS do work (since some jobs

succeed), then you can't really submit that fast. So the same thing

would happen, but slower. Unfortunately, in the PBS case, there's not

much that can be done but to throttle until no more jobs than good nodes

are being run at one time.

Now, there is the probing part, which makes the system start with a

lower throttle which increases until problems appear. If this is

disabled (as it was in the ModDyn run), large numbers of parallel jobs

will be submitted causing a large number of failures.

So this whole thing is close to a linear system with negative feedback.

If the initial state is very far away from stability, there will be

large transients. You're more than welcome to study how to make it

converge faster, or how to guess the initial state better (knowing the

number of nodes a cluster has would be a step).

  </pre>

  <pre wrap=""><!---->

  </pre>

</blockquote>

<br>

<pre class="moz-signature" cols="72">-- 

============================================

Ioan Raicu

Ph.D. Student

============================================

Distributed Systems Laboratory

Computer Science Department

University of Chicago

1100 E. 58th Street, Ryerson Hall

Chicago, IL 60637

============================================

Email: <a class="moz-txt-link-abbreviated" href="mailto:iraicu@cs.uchicago.edu">iraicu@cs.uchicago.edu</a>

Web:   <a class="moz-txt-link-freetext" href="http://www.cs.uchicago.edu/~iraicu">http://www.cs.uchicago.edu/~iraicu</a>

       <a class="moz-txt-link-freetext" href="http://dsl.cs.uchicago.edu/">http://dsl.cs.uchicago.edu/</a>

============================================

============================================</pre>

</body>

</html>