[Swift-devel] Try coaster on BG/P ?
Ioan Raicu
iraicu at cs.uchicago.edu
Fri Jun 20 12:44:35 CDT 2008
Mihael Hategan wrote:
> On Thu, 2008-06-19 at 22:24 -0500, Ioan Raicu wrote:
>
>> Mihael Hategan wrote:
>>
>>> There's probably a misunderstanding. Mike seemed to suggest that, when
>>> using BG/P, there should be multiple services in order to distribute
>>> load.
>>>
>> Yes, he was correct.
>>
>>> That I think is a problem.
>>>
>> I don't follow. If your goal is to just show that it works at small
>> scales (100s, maybe 1000s of CPUs), you don't need this, but if you
>> want to have any chance of scaling to 160K CPUs, I don't think you'll
>> have many options :(
>>
>
> If your service scales linearly, then splitting it into multiple
> processes does not help. But now you have more services to maintain.
> That's because k*n = c*k*(n/c), where k would be your linearity factor.
> If you have worse, say k*n^2, then dividing makes sense because
> c*k*((n/c)^2) = k*n/c, which is better than k*(n^2).
>
> The point is that I'd rather spend my time making the algorithm linear
> than dealing with multiple services.
>
> Now, of course, as you mention, it may not be possible to do so because
> the problem is at the networking layer. So we should probably stop
> talking until we know what the actual bottleneck is. And I mean *know*.
> Do we?
>
For Falkon, it was a networking issue (couple with the amount of CPU/RAM
the node had where the service was running), that was causing one Falkon
service to not scale beyond 10K+ CPUs reliably, when using persistent
sockets. Note that when not using persistent sockets, as is the case
with GT4.0.x WS, we were able to scale to 50K CPUs just fine, but in
this case, there were never more than a few 100 TCP connections that the
service had to maintain at the same time, which is why it scaled so
well. Now, that is not to say that your implementation of Coaster won't
scale to 160K CPUs all from 1 service, but from my experience, a server
(implemented in Java anyways) using select with 2~4GB of memory and 4
CPU cores will not be able to handle 100K+ concurrent TCP connections
that are all active at the same time. Anyways, I never did a thorough
study of this to see what part of the networking stack or OS level calls
was the problem... I'd be curious to see how far Coaster will scale with
a single service using TCP, so it might be worth running 1 Coaster
service on a login node, and trying to see how many CPUs it can manage
before running into trouble.
Ioan
>
>
--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080620/2f29c1bb/attachment.html>
More information about the Swift-devel
mailing list