<HTML><BODY style="word-wrap: break-word; -khtml-nbsp-mode: space; -khtml-line-break: after-white-space; ">OK. I looked at the output and it looks like 14 molecules have still failed. They all failed due to hardware problems -- I saw nothing application-specific in applications logs, all very consistent with staled NFS handle that Ioan reported seeing.<DIV>It would be great to be able to stop submitting jobs to 'bad' nodes during the run (long term), or to increase the number of retries in swift(short term) to enable the whole workflow to go through.</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>Nika</DIV><DIV><BR><DIV><DIV>On Aug 13, 2007, at 11:52 PM, Ioan Raicu wrote:</DIV><BR class="Apple-interchange-newline"><BLOCKQUOTE type="cite">  <BR> <BR> Mihael Hategan wrote: <BLOCKQUOTE cite="mid:1187065878.4015.19.camel@blabla.mcs.anl.gov" type="cite">  <PRE wrap="">On Mon, 2007-08-13 at 23:07 -0500, Ioan Raicu wrote:

  </PRE>  <BLOCKQUOTE type="cite">    <BLOCKQUOTE type="cite">      <BLOCKQUOTE type="cite">        <PRE wrap="">            </PRE>      </BLOCKQUOTE>      <PRE wrap="">small != not at all

      </PRE>    </BLOCKQUOTE>    <PRE wrap="">Check out these two graphs, showing the # of active tasks within

Falkon!  Active tasks = queued+pending+active+done_and_not_delivered.

<A class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/number-of-active-tasks.jpg">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/number-of-active-tasks.jpg</A>

<A class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/number-of-active-tasks-zoom.jpg">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/number-of-active-tasks-zoom.jpg</A>

Notice that after 3600 some seconds (after all the jobs that failed

had failed), the # of active tasks in Falkon oscillates between 100

and 101 active tasks!  The # presented on these graphs are taken from

the median value per minute (the raw samples were 60 samples per

minute).  Notice that only at the very end of the experiment, at 30K+

seconds, the # of active tasks increases to a max of 109 for a brief

period of time before it drops towards 0 as the workflow completes.  I

did notice that towards the end of the workflow, the jobs were

typically shorter, and perhaps that somehow influenced the # of active

tasks within Falkon...  So, when I said not at all, I was refering to

this flat line 100~101 active tasks that is shown in these figures!

    </PRE>  </BLOCKQUOTE>  <PRE wrap="">Then say "it appears (from x and y) that the number of concurrent jobs

does not increase by an observable amount". This is not the same as "the

score does not increase at all".

  </PRE> </BLOCKQUOTE> You are playing with words here... the bottom line is that after 19K+ jobs and several hours of successful jobs, there was no indication that the heuristic was adapting to the new conditions, in which no jobs were failing!<BR> <BLOCKQUOTE cite="mid:1187065878.4015.19.camel@blabla.mcs.anl.gov" type="cite">  <PRE wrap="">  </PRE>  <BLOCKQUOTE type="cite">    <BLOCKQUOTE type="cite">      <BLOCKQUOTE type="cite">        <PRE wrap="">So you are saying that 19K+ successful jobs was not enough to

counteract the 10K+ failed jobs from the early part of the

experiment? 

        </PRE>      </BLOCKQUOTE>      <PRE wrap="">Yep. 19*1/5 = 3.8 < 10.

      </PRE>      <BLOCKQUOTE type="cite">        <PRE wrap="">Can this ratio (1:5) be changed?

        </PRE>      </BLOCKQUOTE>      <PRE wrap="">Yes. The scheduler has two relevant properties: successFactor (currently

0.1) and failureFactor (currently -0.5). The term "factor" is not used

formally, since these get added to the current score.

      </PRE>      <BLOCKQUOTE type="cite">        <PRE wrap="">From this experiment, it would seem that the euristic is a slow

learner... maybe you ahve ideas on how to make it more quick to adapt

to changes?

        </PRE>      </BLOCKQUOTE>      <PRE wrap="">That could perhaps be done.

      </PRE>      <BLOCKQUOTE type="cite">        <BLOCKQUOTE type="cite">          <PRE wrap="">In the context in which jobs are sent to non-busy workers, the system

would tend to produce lots of failed jobs if it takes little time

(compared to the normal run-time of a job) for a bad worker to fail a

job. This *IS* why the swift scheduler throttles in the beginning: to

avoid sending a large number of jobs to a resource that is broken.

          </PRE>        </BLOCKQUOTE>        <PRE wrap="">But not the whole resource is broken... 

        </PRE>      </BLOCKQUOTE>      <PRE wrap="">No, just slightly more than 1/3 of it. At least that's how it appears

from the outside.

      </PRE>    </BLOCKQUOTE>    <PRE wrap="">But a failed job should not be given the same weight as a succesful

job, in my oppinion.

    </PRE>  </BLOCKQUOTE>  <PRE wrap="">Nope. I'd punish failures quite harshly. That's because the expected

behavior is for things to work. I would not want a site that fails half

the jobs to be anywhere near keeping a constant score.

  </PRE> </BLOCKQUOTE> That is fine, but you have a case (such as this one) in which this is not ideal... how do you propose we adapt to cover this corner case?  <BR> <BLOCKQUOTE cite="mid:1187065878.4015.19.camel@blabla.mcs.anl.gov" type="cite">  <PRE wrap="">  </PRE>  <BLOCKQUOTE type="cite">    <PRE wrap="">  For example, it seems to me that you are giving the failed jobs 5

times more weight than succesful jobs, but in reality it should be the

other way around.  Failed jobs usually will fail quickly (as in the

case that we have in MolDyn), or they will fail slowly (within the

lifetime of the resource allocation).  On the other hand, most

successful jobs will likely take more time to complete that it takes

for a job to fail (if it fails quickly).   Perhaps instead of 

    </PRE>    <BLOCKQUOTE type="cite">      <PRE wrap="">successFactor (currently

0.1) and failureFactor (currently -0.5)

      </PRE>    </BLOCKQUOTE>    <PRE wrap="">it should be more like:

successFactor: +1*(executionTime)

failureFactor: -1*(failureTime)

    </PRE>  </BLOCKQUOTE>  <PRE wrap="">That's a very good idea. Biasing score based on run-time (at least when

known). Please note: you should still fix Falkon to not do that thing

it's doing.

  </PRE> </BLOCKQUOTE> Its not clear to me this should be done all the time, Falkon needs to know why the failure happened to decide to throttle!<BR> <BLOCKQUOTE cite="mid:1187065878.4015.19.camel@blabla.mcs.anl.gov" type="cite">  <PRE wrap="">  </PRE>  <BLOCKQUOTE type="cite">    <PRE wrap="">The 1 could of course be changed with some other weight to give

preference to successful jobs, or to failed jobs.  With this kind of

strategy, the problems we are facing with throttling when there are

large # of short failures wouldn't be happening!  Do you see any

drawbacks to this approach?

    </PRE>  </BLOCKQUOTE>  <PRE wrap="">None that are obvious. It's in fact a good thing if the goal is

performance, since it takes execution time into account. I've had manual

"punishments" for connection time-outs because they take a long time to

happen. But this time biasing naturally integrates that kind of stuff.

So thanks.

  </PRE>  <BLOCKQUOTE type="cite">    <BLOCKQUOTE type="cite">      <BLOCKQUOTE type="cite">        <PRE wrap="">that is the whole point here... 

        </PRE>      </BLOCKQUOTE>      <PRE wrap="">This point comes because you KNOW how things work internally. All Swift

sees is 10K failed jobs out of 29K.

      </PRE>      <BLOCKQUOTE type="cite">        <PRE wrap="">anyways, I think this is a valid case that we need to discuss how to

handle, to make the entire Swift+Falkon more robust!

BTW, here is another experiment with MolDyn that shows the throttling

and this heuristic behaving as I would expected!

<A class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/summary_graph.jpg">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/summary_graph.jpg</A>

Notice the queue lenth (blue line) at around 11K seconds dropped

sharply, but then grew back up.  That sudden drop was many jobs

failing fast on a bad node, and the sudden growth back up was Swift

re-submitting almost the same # of jobs that failed back to Falkon.

        </PRE>      </BLOCKQUOTE>      <PRE wrap="">That failing many jobs fast behavior is not right, regardless of whether

Swift can deal with it or not. 

      </PRE>    </BLOCKQUOTE>    <PRE wrap="">If its a machine error, then it would be best to not fail many jobs

fast...

however, if its an app error, you want to fail the tasks as fast as

possible to fail the entire workflow faster,

    </PRE>  </BLOCKQUOTE>  <PRE wrap="">But you can't distinguish between the two. The best you can do is assume

that the failure is a linear combination between broken application and

broken node. If it's broken node, rescheduling would do (which does not

happen in your case: jobs keep being sent to the worker that is not

busy, and that's the broken one). If it's a broken application, then the

way to distinguish it from the other one is that after a bunch of

retries on different nodes, it still fails. Notice that different nodes

is essential here.

  </PRE> </BLOCKQUOTE> Right, I could try to keep track of statistics on each node, and when failures happen, try to determine if its a system wide failure (all nodes reporting errors), or are the faiures isolated on a single (or small set) node(s)...  I'll have to think about how to do this efficiently!<BR> <BLOCKQUOTE cite="mid:1187065878.4015.19.camel@blabla.mcs.anl.gov" type="cite">  <PRE wrap="">  </PRE>  <BLOCKQUOTE type="cite">    <PRE wrap=""> so the app can be fixed and the workflow retried!  For example, say

you had 1000 tasks (all independent), and had a wrong path set to the

app... with the current Falkon behaviour, the entire workflow would

likely fail within some 10~20 seconds of it submitting the first task!

However, if Falkon does some "smart" throttling when it sees failures,

its going to take time proportional to the failures to fail the

workflow!

    </PRE>  </BLOCKQUOTE>  <PRE wrap="">You're missing the part where all nodes fail the jobs equally, thus not

creating the inequality we're talking about (the ones where broken nodes

get higher chances of getting more jobs).

  </PRE> </BLOCKQUOTE> Right, maybe we can use this to distinguish between node failure and app failure!<BR> <BLOCKQUOTE cite="mid:1187065878.4015.19.camel@blabla.mcs.anl.gov" type="cite">  <PRE wrap="">  </PRE>  <BLOCKQUOTE type="cite">    <PRE wrap="">  Essentially, I am not a bit fan of throttling task dispatch due to

failed executions, unless we know why these tasks failed!

    </PRE>  </BLOCKQUOTE>  <PRE wrap="">Stop putting exclamation marks after every sentence. It diminishes the

meaning of it!

  </PRE> </BLOCKQUOTE> So you are going from playing with words to picking on my exclamation! :)<BR> <BLOCKQUOTE cite="mid:1187065878.4015.19.camel@blabla.mcs.anl.gov" type="cite">  <PRE wrap="">Well, you can't know why these tasks failed. That's the whole problem.

You're dealing with incomplete information and you have to devise

heuristics that get things done efficiently.

  </PRE> </BLOCKQUOTE> But Swift might know why it failed, it has a bunch of STDOUT/STDERR that it always captures!  Falkon might capture the same output, but its optional ;(  Could these outputs not be parsed for certain well know errors, and have different exit codes to mean different kinds of errors?<BR> <BLOCKQUOTE cite="mid:1187065878.4015.19.camel@blabla.mcs.anl.gov" type="cite">  <PRE wrap="">  </PRE>  <BLOCKQUOTE type="cite">    <PRE wrap="">  Exit codes are not usually enough in general, unless we define our

own and the app and wrapper scripts generate these particular exit

codes that Falkon can intercept and interpret reliably!

    </PRE>  </BLOCKQUOTE>  <PRE wrap="">That would be an improvement, but probably not a universally valid

assumption. So I wouldn't design with only that in mind.

  </PRE> </BLOCKQUOTE> But it would be an improvement over what we currently have...<BR> <BLOCKQUOTE cite="mid:1187065878.4015.19.camel@blabla.mcs.anl.gov" type="cite">  <PRE wrap="">  </PRE>  <BLOCKQUOTE type="cite">    <BLOCKQUOTE type="cite">      <PRE wrap="">Frankly I'd rather Swift not be the part

to deal with it because it has to resort to heuristics, whereas Falkon

has direct knowledge of which nodes do what.

      </PRE>    </BLOCKQUOTE>    <PRE wrap="">That's fine, but I don't think Falkon can do it alone, it needs

context and failure definition, which I believe only the application

and Swift could say for certain!

    </PRE>  </BLOCKQUOTE>  <PRE wrap="">Nope, they can't. Swift does not meddle with semantics of applications.

They're all equally valuable functions.

Now, there's stuff you can do to improve things, I'm guessing. You can

choose not to, and then we can keep having this discussion. There might

be stuff Swift can do, but it's not insight into applications, so you'll

have to ask for something else.

  </PRE> </BLOCKQUOTE> Any suggestions?<BR> <BR> Ioan<BR> <BLOCKQUOTE cite="mid:1187065878.4015.19.camel@blabla.mcs.anl.gov" type="cite">  <PRE wrap="">Mihael

  </PRE>  <BLOCKQUOTE type="cite">    <PRE wrap="">Ioan

    </PRE>  </BLOCKQUOTE>  <PRE wrap="">

  </PRE> </BLOCKQUOTE>  </BLOCKQUOTE></DIV><BR></DIV></BODY></HTML>