[mpich-discuss] Using MPI_Comm_split to MPI_COMM_LOCAL

Wed Nov 10 14:51:32 CST 2010

On Nov 10, 2010, at 11:44 AM CST, John Bent wrote:

> Excerpts from Dave Goodell's message of Wed Nov 10 10:20:52 -0700 2010:
>> On Nov 8, 2010, at 4:47 PM CST, John Bent wrote:
>> 
>>> We'd like to create an MPI Communicator for just the processes on each
>>> local node (i.e. something like MPI_COMM_LOCAL).  We were doing this
>>> previously very naively by having everyone send out their hostnames and
>>> then doing string parsing.  We realize that a much simpler way to do it
>>> would be to use MPI_Comm_split to split MPI_COMM_WORLD by the IP
>>> address.  Unfortunately, the IP address is 64 bits and the max "color"
>>> to pass to MPI_Comm_split is only 2^16.  So we're currently planning on
>>> splitting iteratively on each 16 bits in the 64 bit IP address.
>> 
>> Are your IP addresses really 64 bits?  IPv4 addresses are 32-bit and (AFAIK) full IPv6 addresses are 128-bit.  If you have IPv6 then maybe you could just use the low order 64-bits for most HPC MPI scenarios, but I'm not overly knowledgeable about IPv6...
>> 
> Oh, good catch about IPv6.

So what size are your addresses actually?

>> Also, as I read the MPI-2.2 standard, the only restriction on color values is that it is a non-negative integer.  So FWIW you really have 2^31 values available on most platforms.
>> 
> Unfortunately, it appears (from a cursory google search and
> checking on two 64 bit architectures here at lanl) that int's are
> still just 32 bits even on 64 bit architectures.  And since the color
> parameter is a signed int but requires it to be positive, we have just
> half of those bits.

That's not really how binary representations work...  If you have a two-complement binary integer (the mainstream representation) then you only lose _one_ bit if you restrict yourself to non-negative numbers.  For example, let's assume you have 4-bit numbers:

0000:  0
0001:  1
0010:  2
0011:  3
0100:  4
0101:  5
0110:  6
0111:  7
----8<----
1000: -8
1001: -7
1010: -6
1011: -5
1100: -4
1101: -3
1110: -2
1111: -1

Excluding negative numbers still leaves you with all of the numbers above the cut line, which is 8 values or 3-bits worth.  So in the 32-bit int case that I was assuming, you still have 31 usable bits if you have non-negative color values.

>>> Anyone know a better way to achieve MPI_COMM_LOCAL?  Or can
>>> MPI_Comm_split be enhanced to take a 64 bit color?
>> 
>> w.r.t. the end goal of MPI_COMM_LOCAL: In MPICH2 we usually create
>> this communicator anyway, so we also could probably expose this
>> communicator directly in an easy fashion with some sort of extension.
>> 
> That'd be excellent.  Are there other communicators also generated?
> Specifically, once I have the MPI_COMM_LOCAL, I'm planning on making
> another communicator consisting of all the rank 0's from all the
> MPI_COMM_LOCAL's (call this MPI_COMM_ONE_PER_NODE).  We specifically
> want this for doing intelligent scheduling of file system accesses.  

Yes, we also (usually) have this communicator available.

> More generally, once we have an MPI_COMM_ONE_PER_NODE, we could also use
> it to optimize general MPI communications; e.g.  first broadcast to
> MPI_COMM_ONE_PER_NODE and then broadcast again on MPI_COMM_LOCAL but I
> assume this sort of optimization is not necessary since the internal
> implementation of these communications presumably do this sort of thing
> (and hopefully even better!) already.

This sort of collective optimization is the exact reason that we already have these two communicators lying around.  They would need to be comm_dup'ed in order to avoid stomping on other user communication, but that's a fairly easy task as long as we're careful about the context IDs under the hood.

>> Otherwise, your iterative solution seems very reasonable, especially as a fix entirely outside of the MPI library.  Alternatively you could re-implement MPI_Comm_split at the user level with communication calls and MPI_Comm_create.  This could be a bit more efficient if you take the time to do it right.
>> 
> That's what I was thinking too but I don't know how to make it a bit
> more efficient.  Off the top of your head could you propose some basic
> pseudocode for this or describe the algorithm please?

By "more efficient", I meant more efficient than your strategy of several iterations of comm_split.  This mostly comes down to avoiding the extra overhead of allocating multiple communicators, but there could be other minor efficiencies as well, especially depending on the numerical distribution of colors.

The current implementation of MPI_Comm_split in MPICH2 would be a good starting point.  Note that this isn't an overly efficient implementation, there are better algorithms and implementations out there.  Bill Gropp had a nice paper about this recently at Euro MPI: http://www.springerlink.com/content/934177v2k58kuqh6/

Here's the code to look at if you want to pursue this route: https://trac.mcs.anl.gov/projects/mpich2/browser/mpich2/trunk/src/mpi/comm/comm_split.c#L54

But your iterative solution sounds like a very pragmatic approach to me.

-Dave