[Swift-devel] bug 53
Ioan Raicu
iraicu at cs.uchicago.edu
Mon Sep 17 23:49:08 CDT 2007
Michael Wilde wrote:
> Some comments on this thread:
>
> - We need to agree on a rule of thumb on what workflow profiles will
> run OK on GRAM and which wont, and need Falkon. We could approximate
> an answer to this with a few calculations and assumptions.
> Measuring this would not hurt. Thr dominant factor seems to be queuing
> delays.
I think job size distribution vs. the scale of the deployment is
important. For example, a workload with 1 second jobs on average might
do OK over GRAM/PBS if you only want to use 1 node, but would be
relatively inefficient for 10 nodes, VERY inefficient for 100 nodes, and
useless for 1000 nodes. On the other hand, if you had 1 hour jobs on
average, running over 100 nodes is fine, and maybe even 1000 nodes might
get decent utilization. I have some graphs and formulas that allow you
to input the job length, number of processors, and the rate of job
submission/execution, yielding the resource utilization for the
particular input. For example, see the attached graph, showing the
theoretical efficiency of various job lengths for a grid site of 1000
processors and with various job throughput (1/sec such as PBS, 10/sec
such as the latest development Condor, 500/sec such as Falkon, and
1,000/sec through 1,000,000/sec hypothetical throughputs). With graphs
like these (see attached, which have some nice and simple formulas
behind them), can characterize their workload, testbed, resource
manager, and figure out what kind of efficiency they can expect to get,
and they can choose between say GRAM and Falkon accordingly.
Just as another example, lets take MolDyn for example. The large run
with 244 mol on the Purdue hardware probably had an average job length
of <30 min. If you wanted to scale that to say 1000 processors, you
would only get about 66% efficiency of the hardware assuming that
GRAM/PBS could sustain a 1 job/sec throughput.
abstract efficiency 1000 processors
And the on top of everything, there is the queuing delays that the
provisioning can help with, by amortizing the queuing delay over many jobs.
>
> I'm in favor of using only GRAM when that is effective, and of
> agreeing on a metric of when that is, and when that is not. This will
> need more discussion. Mihael I think made a start on that in this
> thread; it needs to be developed further.
>
> - I'm waiting to hear comments (and see action) on to the thread I
> started asking to test if the existing Swift throttles will get past
> the last known blocker of MolDyn-244.
I think it will also be interesting to compare Swift+GRAM and
Swift+Falkon for MolDyn 244-mol if possible.
>
> - We need to discuss on-list with Ioan when his Falkon-side error
> recovery logic will be available and when to test it, in part based on
> assessment of whether it is necessary given the throttling.
The Falkon-side error recovery logic is in, it just needs a good large
scale test (with hopefully some errors that Falkon can test its recovery
mechanisms) to verify the new features.
> I think the main issue here is whether a Swift-level throttle override
> that does not deal with taking bad nodes out-of-service can be
> effective or not.
>
> - We still need to determine a direction for support of Falkon vs
> development of a supportable alternative to "glide-in" provisioning.
> Progress is being made on Falkon (both in its development and
> integration/support); its a question of how much of whose development
> time to devote to the overall problem, on what schedule.
>
> - As Ioan goes on to develop data-aware methods in Falkon, we need to
> determine how to support the basic Swift needs and to isolate the two
> efforts from each other until such time as we decide they should be
> coupled.
Right. I am adding new features to Falkon with things in mind that I
need to keep things compatible with older client codes (i.e. the Falkon
provider in Swift). I am making all sorts of enable/disable knobs that
will allow users to enable and disbale all new features on top of the
basic Falkon task execution service (i.e. data caching, new schedulers,
new provisioning mechanisms, etc...).
Ioan
>
> Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070917/77ff488f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: abstract-efficiency-1000proc.gif
Type: image/gif
Size: 23787 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070917/77ff488f/attachment.gif>
More information about the Swift-devel
mailing list