[Swift-devel] Try coaster on BG/P ?

Thu Jun 19 22:24:10 CDT 2008


Mihael Hategan wrote:
> There's probably a misunderstanding. Mike seemed to suggest that, when
> using BG/P, there should be multiple services in order to distribute
> load. 
Yes, he was correct.
> That I think is a problem. 
I don't follow.  If your goal is to just show that it works at small 
scales (100s, maybe 1000s of CPUs), you don't need this, but if you want 
to have any chance of scaling to 160K CPUs, I don't think you'll have 
many options :(
> But I also now think he was referring
> to the case in which multiple clusters are used, in which case what you
> say applies. 
You can call them whatever you want.  It is 1 machine, which is composed 
of 640 P-SETs, each P-SET having an I/O node (i.e. think of a login node 
from a cluster), and 64 compute nodes (4 CPU cores each node) on a 
private network behind each I/O node.  So, if you want to think of the 
160K CPU BG/P as a collection of 640 clusters, you can, but its really 1 
big machine.

The trouble with not using the 640 I/O nodes is that 1 (or a few, up to 
10) login nodes has to manage 160K CPUs.  If you use the I/O nodes as 
well, then 1 (or up to 10) login nodes can manage 640 I/O nodes, which 
in turn will each manage 256 CPUs, a break down that is certainly more 
manageable.  We have had trouble running Falkon reliably at more than 
10K~20K CPUs using a single Falkon service (i.e. running on 1 login 
node), but when we turned to the hierarchical solution, we have gotten 
up to 64K CPUs and it worked without any sign of stress or problems.  
BTW, the trouble we had when managing all CPUs from a single service was 
probably due to the fact that we were using persistent sockets, and 
using select to manage 10K+ active sockets.  We have an option to run 
without persistent sockets which will scale better, but we haven't 
tested this on the BG/P yet as it involves Java on the compute nodes 
(which we heard works, but we haven't tried it yet).
> We've pretty much discussed this, and (1) is what we would
> eventually want to achieve with Swift + Coasters.
>   
Right, I agree that option #1 is the desired goal.

Ioan
> On Thu, 2008-06-19 at 19:58 -0500, Ioan Raicu wrote:
>   
>> I am not sure of what problem you are referring to fix?
>>
>> The issue with Falkon, is that there are queues at the service.  If a
>> client submits all its jobs to a single service (that only manages 256
>> CPUs), there could be 639 other services with 160K - 256 CPUs that are
>> left idle (worst case, which wouldn't happen very often, but could
>> still happen towards the ends of runs when there isn't enough work to
>> keep everyone busy).  There are only 2 solutions.  
>>
>> 1) never queue anything up at the services, only send tasks from the
>> client to a service when we know there is an available CPU to run that
>> task; this is the approach we took
>> 2) allow tasks to timeout after some time, and trigger a resubmit of
>> the same task to another service, and keep doing this until a reply to
>> that task comes back; this seems that it would introduce unnecessarily
>> long delays, and cause load imbalances towards the end of runs when
>> there isn't enough work to keep all busy
>>
>> In essence, there is no problem to solve here, its just what solution
>> you take, in such a distributed tree like environment, where you have
>> 1 client, N services, and M workers.  N is a value between 1 and 640,
>> and M could be as high as 160K, with a ratio of 1:256 between N:M.  
>>
>> Ioan
>>
>> Mihael Hategan wrote: 
>>     
>>> On Thu, 2008-06-19 at 18:56 -0500, Michael Wilde wrote:
>>>
>>>   
>>>       
>>>> What Ioan did in Falkon when he went to the multiple-server architecture 
>>>> is relevant here: the client load-shares among all the servers, 
>>>> round-robin, only sending a job to a server when it knows that the 
>>>> server has a free cpu slot. In this way, no queues build up on the 
>>>> servers, and it avoids having a job wait in any server's queue when a 
>>>> free cpu might be available on some other server.
>>>>
>>>>     
>>>>         
>>> If you have O(1) scheduling, this shouldn't be necessary. It's like
>>> i2u2: Don't build a cluster to reduce the odds of triggering a problem.
>>> Fix the problem instead.
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>>   
>>>       
>
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080619/0f16d70b/attachment.html>