[Swift-devel] Try coaster on BG/P ?

Fri Jun 20 12:44:35 CDT 2008


Mihael Hategan wrote:
> On Thu, 2008-06-19 at 22:24 -0500, Ioan Raicu wrote:
>   
>> Mihael Hategan wrote: 
>>     
>>> There's probably a misunderstanding. Mike seemed to suggest that, when
>>> using BG/P, there should be multiple services in order to distribute
>>> load. 
>>>       
>> Yes, he was correct.
>>     
>>> That I think is a problem. 
>>>       
>> I don't follow.  If your goal is to just show that it works at small
>> scales (100s, maybe 1000s of CPUs), you don't need this, but if you
>> want to have any chance of scaling to 160K CPUs, I don't think you'll
>> have many options :(
>>     
>
> If your service scales linearly, then splitting it into multiple
> processes does not help. But now you have more services to maintain.
> That's because k*n = c*k*(n/c), where k would be your linearity factor.
> If you have worse, say k*n^2, then dividing makes sense because
> c*k*((n/c)^2) = k*n/c, which is better than k*(n^2).
>
> The point is that I'd rather spend my time making the algorithm linear
> than dealing with multiple services.
>
> Now, of course, as you mention, it may not be possible to do so because
> the problem is at the networking layer. So we should probably stop
> talking until we know what the actual bottleneck is. And I mean *know*.
> Do we?
>   
For Falkon, it was a networking issue (couple with the amount of CPU/RAM 
the node had where the service was running), that was causing one Falkon 
service to not scale beyond 10K+ CPUs reliably, when using persistent 
sockets.  Note that when not using persistent sockets, as is the case 
with GT4.0.x WS, we were able to scale to 50K CPUs just fine, but in 
this case, there were never more than a few 100 TCP connections that the 
service had to maintain at the same time, which is why it scaled so 
well.  Now, that is not to say that your implementation of Coaster won't 
scale to 160K CPUs all from 1 service, but from my experience, a server 
(implemented in Java anyways) using select with 2~4GB of memory and 4 
CPU cores will not be able to handle 100K+ concurrent TCP connections 
that are all active at the same time.  Anyways, I never did a thorough 
study of this to see what part of the networking stack or OS level calls 
was the problem... I'd be curious to see how far Coaster will scale with 
a single service using TCP, so it might be worth running 1 Coaster 
service on a login node, and trying to see how many CPUs it can manage 
before running into trouble.

Ioan
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080620/2f29c1bb/attachment.html>