[Swift-devel] Falkon and Coaster support for MPI

Mon Jun 30 12:18:10 CDT 2008

I am just now catching up with the dozens of emails...

Ian Foster wrote:
> 4) Ioan points out that a fully general multi-level scheduling 
> solution with support for multi-CPU jobs may introduce the need for a 
> smarter scheduler than our current FIFO approach. E.g., if we have 256 
> nodes and a queue with jobs of size {32,256,32,32,32,32,32,32,32,32}, 
> a FIFO strategy would run them in that order, and waste much CPU time. 
> On the other hand, a simple "first-fit" strategy might starve large jobs.
This is all true... in the case of Falkon, there are further 
limitations, such as:
32 CPU MPI job starts and runs for 10 min
256 CPU MPI job is ready to run, but not enough CPUs are available; what 
is easy in Falkon to do is to place the 256 CPU job back in the queue, 
and process the next one, which is 32 CPUs... and keep doing this until 
it finds all 256 CPUs free to schedule the 256 CPU MPI job.  This means 
that the order will be {32, ...., 32, 256}... and this is assuming that 
at some point, the smaller MPI jobs will stop coming, and let the 256 
CPU MPI job start, or else the 256 CPU MPI job will run the risk of 
being starved. 

The thing that is a bit harder to achieve (in Falkon) is to actually 
pause all scheduling decisions when it comes to a MPI job that needs 
more CPUs than are free, to allow enough CPUs to drain and free up to 
let the larger MPI job go through as fast as possible.  Come to think of 
it, maybe this is not that hard to implement, as we could simply do a 
blocking wait until enough CPUs are freed up, and run the scheduler in a 
single threaded mode to ensure that no other threads can schedule 
anything else. 

So, in a way, I guess its possible to do both of these, probably not at 
the same time, but configurable at startup time, wether you want to 
maintain order requirements (and potentially get poor utilization), or 
wether you can re-order jobs and do a smallest job first ordering that 
will maximize the utilization (but potentially starve large jobs).
>
> I think we should be nervous about getting into the business of 
> implementing scheduler functionality like this.
That was my impression as well, at least to add this kind of logic to 
Falkon.  In the end, its probably not as hard as I thought it would be, 
but with any new code/functionaly, there is always the bag of new bugs, 
so the time investment is certainly not trivial.
>
> I'd like to advocate that in the short term, we try to make this 
> problem go away by requiring that if an application includes MPI 
> tasks, they all be of the same size.
Yes, that would certainly make it easier on the implementation side.

Ioan
>
>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================