<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
<br>
<br>
Michael Wilde wrote:
<blockquote cite="mid:46EEB586.7020807@mcs.anl.gov" type="cite">Some
comments on this thread:
<br>
<br>
- We need to agree on a rule of thumb on what workflow profiles will
run OK on GRAM and which wont, and need Falkon. We could approximate
an answer to this with a few calculations and assumptions.
<br>
Measuring this would not hurt. Thr dominant factor seems to be queuing
delays.
<br>
</blockquote>
I think job size distribution vs. the scale of the deployment is
important. For example, a workload with 1 second jobs on average might
do OK over GRAM/PBS if you only want to use 1 node, but would be
relatively inefficient for 10 nodes, VERY inefficient for 100 nodes,
and useless for 1000 nodes. On the other hand, if you had 1 hour jobs
on average, running over 100 nodes is fine, and maybe even 1000 nodes
might get decent utilization. I have some graphs and formulas that
allow you to input the job length, number of processors, and the rate
of job submission/execution, yielding the resource utilization for the
particular input. For example, see the attached graph, showing the
theoretical efficiency of various job lengths for a grid site of 1000
processors and with various job throughput (1/sec such as PBS, 10/sec
such as the latest development Condor, 500/sec such as Falkon, and
1,000/sec through 1,000,000/sec hypothetical throughputs). With graphs
like these (see attached, which have some nice and simple formulas
behind them), can characterize their workload, testbed, resource
manager, and figure out what kind of efficiency they can expect to get,
and they can choose between say GRAM and Falkon accordingly. <br>
<br>
Just as another example, lets take MolDyn for example. The large run
with 244 mol on the Purdue hardware probably had an average job length
of <30 min. If you wanted to scale that to say 1000 processors, you
would only get about 66% efficiency of the hardware assuming that
GRAM/PBS could sustain a 1 job/sec throughput. <br>
<br>
<img alt="abstract efficiency 1000 processors"
src="cid:part1.03070805.01070101@cs.uchicago.edu" height="623"
width="911"><br>
<br>
And the on top of everything, there is the queuing delays that the
provisioning can help with, by amortizing the queuing delay over many
jobs.<br>
<blockquote cite="mid:46EEB586.7020807@mcs.anl.gov" type="cite"><br>
I'm in favor of using only GRAM when that is effective, and of agreeing
on a metric of when that is, and when that is not. This will need more
discussion. Mihael I think made a start on that in this thread; it
needs to be developed further.
<br>
<br>
- I'm waiting to hear comments (and see action) on to the thread I
started asking to test if the existing Swift throttles will get past
the last known blocker of MolDyn-244.
<br>
</blockquote>
I think it will also be interesting to compare Swift+GRAM and
Swift+Falkon for MolDyn 244-mol if possible.<br>
<blockquote cite="mid:46EEB586.7020807@mcs.anl.gov" type="cite"><br>
- We need to discuss on-list with Ioan when his Falkon-side error
recovery logic will be available and when to test it, in part based on
assessment of whether it is necessary given the throttling. </blockquote>
The Falkon-side error recovery logic is in, it just needs a good large
scale test (with hopefully some errors that Falkon can test its
recovery mechanisms) to verify the new features.<br>
<blockquote cite="mid:46EEB586.7020807@mcs.anl.gov" type="cite">I think
the main issue here is whether a Swift-level throttle override that
does not deal with taking bad nodes out-of-service can be effective or
not.
<br>
<br>
- We still need to determine a direction for support of Falkon vs
development of a supportable alternative to "glide-in" provisioning.
<br>
Progress is being made on Falkon (both in its development and
integration/support); its a question of how much of whose development
time to devote to the overall problem, on what schedule.
<br>
<br>
- As Ioan goes on to develop data-aware methods in Falkon, we need to
determine how to support the basic Swift needs and to isolate the two
efforts from each other until such time as we decide they should be
coupled.
<br>
</blockquote>
Right. I am adding new features to Falkon with things in mind that I
need to keep things compatible with older client codes (i.e. the Falkon
provider in Swift). I am making all sorts of enable/disable knobs that
will allow users to enable and disbale all new features on top of the
basic Falkon task execution service (i.e. data caching, new schedulers,
new provisioning mechanisms, etc...).<br>
<br>
Ioan<br>
<blockquote cite="mid:46EEB586.7020807@mcs.anl.gov" type="cite"><br>
Mike
<br>
</blockquote>
<br>
</body>
</html>