[Swift-devel] bug 53

Mon Sep 17 23:49:08 CDT 2007

Michael Wilde wrote:
> Some comments on this thread:
>
> - We need to agree on a rule of thumb on what workflow profiles will 
> run OK on GRAM and which wont, and need Falkon.  We could approximate 
> an answer to this with a few calculations and assumptions.
> Measuring this would not hurt. Thr dominant factor seems to be queuing 
> delays.
I think job size distribution vs. the scale of the deployment is 
important.  For example, a workload with 1 second jobs on average might 
do OK over GRAM/PBS if you only want to use 1 node, but would be 
relatively inefficient for 10 nodes, VERY inefficient for 100 nodes, and 
useless for 1000 nodes.  On the other hand, if you had 1 hour jobs on 
average, running over 100 nodes is fine, and maybe even 1000 nodes might 
get decent utilization.  I have some graphs and formulas that allow you 
to input the job length, number of processors, and the rate of job 
submission/execution, yielding the resource utilization for the 
particular input.  For example, see the attached graph, showing the 
theoretical efficiency of various job lengths for a grid site of 1000 
processors and with various job throughput (1/sec such as PBS, 10/sec 
such as the latest development Condor, 500/sec such as Falkon, and 
1,000/sec through 1,000,000/sec hypothetical throughputs).  With graphs 
like these (see attached, which have some nice and simple formulas 
behind them), can characterize their workload, testbed, resource 
manager, and figure out what kind of efficiency they can expect to get, 
and they can choose between say GRAM and Falkon accordingly. 

Just as another example, lets take MolDyn for example.  The large run 
with 244 mol on the Purdue hardware probably had an average job length 
of <30 min.  If you wanted to scale that to say 1000 processors, you 
would only get about 66% efficiency of the hardware assuming that 
GRAM/PBS could sustain a 1 job/sec throughput. 

abstract efficiency 1000 processors

And the on top of everything, there is the queuing delays that the 
provisioning can help with, by amortizing the queuing delay over many jobs.
>
> I'm in favor of using only GRAM when that is effective, and of 
> agreeing on a metric of when that is, and when that is not. This will 
> need more discussion. Mihael I think made a start on that in this 
> thread; it needs to be developed further.
>
> - I'm waiting to hear comments (and see action) on to the thread I 
> started asking to test if the existing Swift throttles will get past 
> the last known blocker of MolDyn-244.
I think it will also be interesting to compare Swift+GRAM and 
Swift+Falkon for MolDyn 244-mol if possible.
>
> - We need to discuss on-list with Ioan when his Falkon-side error 
> recovery logic will be available and when to test it, in part based on 
> assessment of whether it is necessary given the throttling. 
The Falkon-side error recovery logic is in, it just needs a good large 
scale test (with hopefully some errors that Falkon can test its recovery 
mechanisms) to verify the new features.
> I think the main issue here is whether a Swift-level throttle override 
> that does not deal with taking bad nodes out-of-service can be 
> effective or not.
>
> - We still need to determine a direction for support of Falkon vs 
> development of a supportable alternative to "glide-in" provisioning.
> Progress is being made on Falkon (both in its development and 
> integration/support); its a question of how much of whose development 
> time to devote to the overall problem, on what schedule.
>
> - As Ioan goes on to develop data-aware methods in Falkon, we need to 
> determine how to support the basic Swift needs and to isolate the two 
> efforts from each other until such time as we decide they should be 
> coupled.
Right.  I am adding new features to Falkon with things in mind that I 
need to keep things compatible with older client codes (i.e. the Falkon 
provider in Swift).  I am making all sorts of enable/disable knobs that 
will allow users to enable and disbale all new features on top of the 
basic Falkon task execution service (i.e. data caching, new schedulers, 
new provisioning mechanisms, etc...).

Ioan
>
> Mike

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070917/77ff488f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: abstract-efficiency-1000proc.gif
Type: image/gif
Size: 23787 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070917/77ff488f/attachment.gif>