[mpich-discuss] Using MPI_Comm_split to MPI_COMM_LOCAL

Wed Nov 10 11:44:06 CST 2010

Excerpts from Dave Goodell's message of Wed Nov 10 10:20:52 -0700 2010:
> On Nov 8, 2010, at 4:47 PM CST, John Bent wrote:
> 
> > We'd like to create an MPI Communicator for just the processes on each
> > local node (i.e. something like MPI_COMM_LOCAL).  We were doing this
> > previously very naively by having everyone send out their hostnames and
> > then doing string parsing.  We realize that a much simpler way to do it
> > would be to use MPI_Comm_split to split MPI_COMM_WORLD by the IP
> > address.  Unfortunately, the IP address is 64 bits and the max "color"
> > to pass to MPI_Comm_split is only 2^16.  So we're currently planning on
> > splitting iteratively on each 16 bits in the 64 bit IP address.
> 
> Hi John,
> 
Hi Dave, Thanks for your very helpful reply!  I've interspersed a bit
below.

> Are your IP addresses really 64 bits?  IPv4 addresses are 32-bit and (AFAIK) full IPv6 addresses are 128-bit.  If you have IPv6 then maybe you could just use the low order 64-bits for most HPC MPI scenarios, but I'm not overly knowledgeable about IPv6...
> 
Oh, good catch about IPv6.

> Also, as I read the MPI-2.2 standard, the only restriction on color values is that it is a non-negative integer.  So FWIW you really have 2^31 values available on most platforms.
> 
Unfortunately, it appears (from a cursory google search and
checking on two 64 bit architectures here at lanl) that int's are
still just 32 bits even on 64 bit architectures.  And since the color
parameter is a signed int but requires it to be positive, we have just
half of those bits.

> > Anyone know a better way to achieve MPI_COMM_LOCAL?  Or can
> > MPI_Comm_split be enhanced to take a 64 bit color?
> 
> For MPICH2 we could conceivably add an extension (MPIX_Comm_split64 or
> whatever) that took a longer (perhaps arbitrary length) color.  Also,
> the MPI Forum could provide this sort of capability in future versions
> of the MPI standard.  But there's nothing that can be done to
> MPI_Comm_split itself in the short term.
> 
> w.r.t. the end goal of MPI_COMM_LOCAL: In MPICH2 we usually create
> this communicator anyway, so we also could probably expose this
> communicator directly in an easy fashion with some sort of extension.
> 
That'd be excellent.  Are there other communicators also generated?
Specifically, once I have the MPI_COMM_LOCAL, I'm planning on making
another communicator consisting of all the rank 0's from all the
MPI_COMM_LOCAL's (call this MPI_COMM_ONE_PER_NODE).  We specifically
want this for doing intelligent scheduling of file system accesses.  

More generally, once we have an MPI_COMM_ONE_PER_NODE, we could also use
it to optimize general MPI communications; e.g.  first broadcast to
MPI_COMM_ONE_PER_NODE and then broadcast again on MPI_COMM_LOCAL but I
assume this sort of optimization is not necessary since the internal
implementation of these communications presumably do this sort of thing
(and hopefully even better!) already.

> Otherwise, your iterative solution seems very reasonable, especially as a fix entirely outside of the MPI library.  Alternatively you could re-implement MPI_Comm_split at the user level with communication calls and MPI_Comm_create.  This could be a bit more efficient if you take the time to do it right.
> 
That's what I was thinking too but I don't know how to make it a bit
more efficient.  Off the top of your head could you propose some basic
pseudocode for this or describe the algorithm please?
-- 
Thanks,

John