[Swift-devel] Try coaster on BG/P ?

Mihael Hategan hategan at mcs.anl.gov
Thu Jun 19 20:30:57 CDT 2008


There's probably a misunderstanding. Mike seemed to suggest that, when
using BG/P, there should be multiple services in order to distribute
load. That I think is a problem. But I also now think he was referring
to the case in which multiple clusters are used, in which case what you
say applies. We've pretty much discussed this, and (1) is what we would
eventually want to achieve with Swift + Coasters.

On Thu, 2008-06-19 at 19:58 -0500, Ioan Raicu wrote:
> I am not sure of what problem you are referring to fix?
> 
> The issue with Falkon, is that there are queues at the service.  If a
> client submits all its jobs to a single service (that only manages 256
> CPUs), there could be 639 other services with 160K - 256 CPUs that are
> left idle (worst case, which wouldn't happen very often, but could
> still happen towards the ends of runs when there isn't enough work to
> keep everyone busy).  There are only 2 solutions.  
> 
> 1) never queue anything up at the services, only send tasks from the
> client to a service when we know there is an available CPU to run that
> task; this is the approach we took
> 2) allow tasks to timeout after some time, and trigger a resubmit of
> the same task to another service, and keep doing this until a reply to
> that task comes back; this seems that it would introduce unnecessarily
> long delays, and cause load imbalances towards the end of runs when
> there isn't enough work to keep all busy
> 
> In essence, there is no problem to solve here, its just what solution
> you take, in such a distributed tree like environment, where you have
> 1 client, N services, and M workers.  N is a value between 1 and 640,
> and M could be as high as 160K, with a ratio of 1:256 between N:M.  
> 
> Ioan
> 
> Mihael Hategan wrote: 
> > On Thu, 2008-06-19 at 18:56 -0500, Michael Wilde wrote:
> > 
> >   
> > > What Ioan did in Falkon when he went to the multiple-server architecture 
> > > is relevant here: the client load-shares among all the servers, 
> > > round-robin, only sending a job to a server when it knows that the 
> > > server has a free cpu slot. In this way, no queues build up on the 
> > > servers, and it avoids having a job wait in any server's queue when a 
> > > free cpu might be available on some other server.
> > > 
> > >     
> > 
> > If you have O(1) scheduling, this shouldn't be necessary. It's like
> > i2u2: Don't build a cluster to reduce the odds of triggering a problem.
> > Fix the problem instead.
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> >   




More information about the Swift-devel mailing list