<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
<br>
<br>
Mihael Hategan wrote:
<blockquote cite="mid:1213934501.1194.18.camel@localhost" type="cite">
<pre wrap="">On Thu, 2008-06-19 at 22:24 -0500, Ioan Raicu wrote:
</pre>
<blockquote type="cite">
<pre wrap="">
Mihael Hategan wrote:
</pre>
<blockquote type="cite">
<pre wrap="">There's probably a misunderstanding. Mike seemed to suggest that, when
using BG/P, there should be multiple services in order to distribute
load.
</pre>
</blockquote>
<pre wrap="">Yes, he was correct.
</pre>
<blockquote type="cite">
<pre wrap="">That I think is a problem.
</pre>
</blockquote>
<pre wrap="">I don't follow. If your goal is to just show that it works at small
scales (100s, maybe 1000s of CPUs), you don't need this, but if you
want to have any chance of scaling to 160K CPUs, I don't think you'll
have many options :(
</pre>
</blockquote>
<pre wrap=""><!---->
If your service scales linearly, then splitting it into multiple
processes does not help. But now you have more services to maintain.
That's because k*n = c*k*(n/c), where k would be your linearity factor.
If you have worse, say k*n^2, then dividing makes sense because
c*k*((n/c)^2) = k*n/c, which is better than k*(n^2).
The point is that I'd rather spend my time making the algorithm linear
than dealing with multiple services.
Now, of course, as you mention, it may not be possible to do so because
the problem is at the networking layer. So we should probably stop
talking until we know what the actual bottleneck is. And I mean *know*.
Do we?
</pre>
</blockquote>
For Falkon, it was a networking issue (couple with the amount of
CPU/RAM the node had where the service was running), that was causing
one Falkon service to not scale beyond 10K+ CPUs reliably, when using
persistent sockets. Note that when not using persistent sockets, as is
the case with GT4.0.x WS, we were able to scale to 50K CPUs just fine,
but in this case, there were never more than a few 100 TCP connections
that the service had to maintain at the same time, which is why it
scaled so well. Now, that is not to say that your implementation of
Coaster won't scale to 160K CPUs all from 1 service, but from my
experience, a server (implemented in Java anyways) using select with
2~4GB of memory and 4 CPU cores will not be able to handle 100K+
concurrent TCP connections that are all active at the same time.
Anyways, I never did a thorough study of this to see what part of the
networking stack or OS level calls was the problem... I'd be curious to
see how far Coaster will scale with a single service using TCP, so it
might be worth running 1 Coaster service on a login node, and trying to
see how many CPUs it can manage before running into trouble.<br>
<br>
Ioan<br>
<blockquote cite="mid:1213934501.1194.18.camel@localhost" type="cite">
<pre wrap="">
</pre>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: <a class="moz-txt-link-abbreviated" href="mailto:iraicu@cs.uchicago.edu">iraicu@cs.uchicago.edu</a>
Web: <a class="moz-txt-link-freetext" href="http://www.cs.uchicago.edu/~iraicu">http://www.cs.uchicago.edu/~iraicu</a>
<a class="moz-txt-link-freetext" href="http://dev.globus.org/wiki/Incubator/Falkon">http://dev.globus.org/wiki/Incubator/Falkon</a>
<a class="moz-txt-link-freetext" href="http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page">http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page</a>
===================================================
===================================================
</pre>
</body>
</html>