<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

  <title></title>

</head>

<body bgcolor="#ffffff" text="#000000">

<br>

<br>

Michael Wilde wrote:

<blockquote cite="mid:46EEB586.7020807@mcs.anl.gov" type="cite">Some

comments on this thread:

  <br>

  <br>

- We need to agree on a rule of thumb on what workflow profiles will

run OK on GRAM and which wont, and need Falkon.  We could approximate

an answer to this with a few calculations and assumptions.

  <br>

Measuring this would not hurt. Thr dominant factor seems to be queuing

delays.

  <br>

</blockquote>

I think job size distribution vs. the scale of the deployment is

important.  For example, a workload with 1 second jobs on average might

do OK over GRAM/PBS if you only want to use 1 node, but would be

relatively inefficient for 10 nodes, VERY inefficient for 100 nodes,

and useless for 1000 nodes.  On the other hand, if you had 1 hour jobs

on average, running over 100 nodes is fine, and maybe even 1000 nodes

might get decent utilization.  I have some graphs and formulas that

allow you to input the job length, number of processors, and the rate

of job submission/execution, yielding the resource utilization for the

particular input.  For example, see the attached graph, showing the

theoretical efficiency of various job lengths for a grid site of 1000

processors and with various job throughput (1/sec such as PBS, 10/sec

such as the latest development Condor, 500/sec such as Falkon, and

1,000/sec through 1,000,000/sec hypothetical throughputs).  With graphs

like these (see attached, which have some nice and simple formulas

behind them), can characterize their workload, testbed, resource

manager, and figure out what kind of efficiency they can expect to get,

and they can choose between say GRAM and Falkon accordingly.  <br>

<br>

Just as another example, lets take MolDyn for example.  The large run

with 244 mol on the Purdue hardware probably had an average job length

of <30 min.  If you wanted to scale that to say 1000 processors, you

would only get about 66% efficiency of the hardware assuming that

GRAM/PBS could sustain a 1 job/sec throughput.  <br>

<br>

<img alt="abstract efficiency 1000 processors"

 src="cid:part1.03070805.01070101@cs.uchicago.edu" height="623"

 width="911"><br>

<br>

And the on top of everything, there is the queuing delays that the

provisioning can help with, by amortizing the queuing delay over many

jobs.<br>

<blockquote cite="mid:46EEB586.7020807@mcs.anl.gov" type="cite"><br>

I'm in favor of using only GRAM when that is effective, and of agreeing

on a metric of when that is, and when that is not. This will need more

discussion. Mihael I think made a start on that in this thread; it

needs to be developed further.

  <br>

  <br>

- I'm waiting to hear comments (and see action) on to the thread I

started asking to test if the existing Swift throttles will get past

the last known blocker of MolDyn-244.

  <br>

</blockquote>

I think it will also be interesting to compare Swift+GRAM and

Swift+Falkon for MolDyn 244-mol if possible.<br>

<blockquote cite="mid:46EEB586.7020807@mcs.anl.gov" type="cite"><br>

- We need to discuss on-list with Ioan when his Falkon-side error

recovery logic will be available and when to test it, in part based on

assessment of whether it is necessary given the throttling. </blockquote>

The Falkon-side error recovery logic is in, it just needs a good large

scale test (with hopefully some errors that Falkon can test its

recovery mechanisms) to verify the new features.<br>

<blockquote cite="mid:46EEB586.7020807@mcs.anl.gov" type="cite">I think

the main issue here is whether a Swift-level throttle override that

does not deal with taking bad nodes out-of-service can be effective or

not.

  <br>

  <br>

- We still need to determine a direction for support of Falkon vs

development of a supportable alternative to "glide-in" provisioning.

  <br>

Progress is being made on Falkon (both in its development and

integration/support); its a question of how much of whose development

time to devote to the overall problem, on what schedule.

  <br>

  <br>

- As Ioan goes on to develop data-aware methods in Falkon, we need to

determine how to support the basic Swift needs and to isolate the two

efforts from each other until such time as we decide they should be

coupled.

  <br>

</blockquote>

Right.  I am adding new features to Falkon with things in mind that I

need to keep things compatible with older client codes (i.e. the Falkon

provider in Swift).  I am making all sorts of enable/disable knobs that

will allow users to enable and disbale all new features on top of the

basic Falkon task execution service (i.e. data caching, new schedulers,

new provisioning mechanisms, etc...).<br>

<br>

Ioan<br>

<blockquote cite="mid:46EEB586.7020807@mcs.anl.gov" type="cite"><br>

Mike

  <br>

</blockquote>

<br>

</body>

</html>