[Swift-devel] Falkon and Coaster support for MPI

Mon Jun 30 03:43:45 CDT 2008

A few thoughts:

1) It must be straightforward to submit MPI programs from Swift, via  
the GRAM provider--the only issue is passing the appropriate  
parameters to the GRAM submission. (I realize this is not completely  
trivial, but as Ben says, we have done it before.)

2) The challenge is doing this in conjunction with multi-level  
scheduling (aka Falkon/Coaster/Glideins), which we may require for  
reasons of feasibility (on BG/P, where I don't think you can request  
less than a rack?) and/or performance.

3) So my view is that we want to set up Swift to support both modes  
(1) and (2).

4) Ioan points out that a fully general multi-level scheduling  
solution with support for multi-CPU jobs may introduce the need for a  
smarter scheduler than our current FIFO approach. E.g., if we have 256  
nodes and a queue with jobs of size {32,256,32,32,32,32,32,32,32,32},  
a FIFO strategy would run them in that order, and waste much CPU time.  
On the other hand, a simple "first-fit" strategy might starve large  
jobs.

I think we should be nervous about getting into the business of  
implementing scheduler functionality like this.

I'd like to advocate that in the short term, we try to make this  
problem go away by requiring that if an application includes MPI  
tasks, they all be of the same size.

Of course the problem remains that we will probably still have a mix  
of uniprocessor (P=1) and multiprocessor (P=N) tasks. Again, we could  
make this problem go away by reserving some nodes for P=1 and some for  
P=N, at the cost of some inefficiency.

5) The question has been raised of how to implement (2). One proposal  
is to adapt coaster to support MPI jobs. I'm a bit concerned that this  
could be expensive: we already have Falkon running well on BG/P, and  
given our other commitments to support NSF user communities, putting  
scarce resources into replicating that work may not be optimal.

Regards -- Ian.

On Jun 29, 2008, at 2:25 PM, Ioan Raicu wrote:

> How do you guarantee that N nodes will all start at the same time?   
> At least, in Falkon, this is the largest problem that I am not sure  
> how to address... the Falkon scheduler doesn't know how to handle a  
> task of N processors, it only knows tasks of 1 processor...
>
> Lets take an example.  Assume you have 256 CPUs free.  Lets say you  
> get a MPI job for 32 CPUs, the new improved scheduler could simply  
> replicate that single CPU task 32 times, and start the task on 32  
> CPUs.  Now, while the 32 CPU MPI job is running, a 256 CPU MPI job  
> comes in, and needs to be scheduled.  The naive replication of 1 ==>  
> 256 tasks doesn't work anymore, as there are no 256 CPUs free, only  
> 224 are free.  So, the scheduler needs to be smart enough to wait  
> for this 32 CPU MPI job to finish, before it can launch the 256 CPU  
> job, and make sure no other job will go through before that might  
> use up any of the free CPUs.  All this is certainly do-able, as PBS,  
> SGE, Condor, etc... most LRMs can deal with MPI jobs just fine, but  
> its certainly not a trivial addition to Falkon, and its not clear to  
> me how easy or hard it would be in Coaster either, depending on how  
> much effort is placed in the scheduler that feeds jobs to Coaster.
>
> Ioan
>
> Ben Clifford wrote:
>> I had a brief look at how MPI runs inside PBS.
>>
>> It looks something like:
>>
>> 1.   user requests n nodes to run on PBS
>> 2.   PBS allocates those n nodes, and writes a description of them  
>> all       into a per-job (not per-node) $PBSNODEFILE
>> 3.   PBS runs the same mpi command-line on all of the nodes,  
>> specifying          $PBSNODEFILE somewhere in that command line.
>> 4.   The mpi commandline uses the $PBSNODEFILE to know how to get  
>> to all          the nodes.
>>
>> It might not be too much change to the coaster code to make it so  
>> you can get this behaviour. You'd need to be able to specify a node  
>> count (rather than the implicit count of 1 at the moment) and have  
>> the coaster manager handle starting everything up at the same time  
>> and providing something like $PBSNODEFILE.
>>
>>
>
> -- 
> ===================================================
> Ioan Raicu
> Ph.D. Candidate
> ===================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ===================================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
> http://dev.globus.org/wiki/Incubator/Falkon
> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> ===================================================
> ===================================================
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel