[mpich-discuss] best/fastest way to get a node communicator

Jim Dinan dinan at mcs.anl.gov
Tue Jan 17 17:00:07 CST 2012


Hi Jeff,

If you're worried about space, you could follow a simple algorithm that 
splits off groups iteratively.  This will use constant space but takes 
longer and will scale poorly.  There's probably some algorithm in the 
middle that would improve on this (have N processes broadcast their 
names to get a more effective split?). Here's some (very) rough pseudocode:

MPI_Comm node_comm   = MPI_COMM_NULL;
MPI_Comm parent_comm;

// Dup so we don't leak communicators
MPI_Comm_dup(MPI_COMM_WORD, &parent_comm);

while (node_comm == MPI_COMM_NULL) {
   char my_name[MPI_MAX_PROCESSOR_NAME];
   char root_name[MPI_MAX_PROCESSOR_NAME];
   int  rank, len;
   MPI_Comm old_parent;

   MPI_Comm_rank(parent_comm, &rank);
   MPI_Get_processor_name(name, &len);

   if (rank == 0) {
     MPI_Bcast(From=0, my_name, MPI_MAX_PROCESSOR_NAME, ...);
     strncpy(root_name, my_name, MPI_MAX_PROCESSOR_NAME);
   } else {
     MPI_Bcast(From=0, root_name, MPI_MAX_PROCESSOR_NAME, ...);
   }

   old_parent = parent_comm;

   if (strcmp(my_name, root_name) == 0) {
     //  My group splits off, I'm done after this
     MPI_Comm_split(parent_comm, color=1, &node_comm)
   } else {
     // My group keeps going, separate from the others
     MPI_Comm_split(parent_comm, color=0, &parent_comm)
   }

   // Old parent is no longer needed
   MPI_Comm_free(&old_parent);
}

  ~Jim.

On 1/17/12 4:23 PM, Jeff Hammond wrote:
> Thanks for the helpful suggestions, Rhys.
>
> Using a perfect hash function was the first thing to jump into my
> head.  Unfortunately, I do not know how to write perfect hash
> functions for MPI_MAX_PROCESSOR_NAME-length character arrays and the
> value of optimization for memory usage on clusters, which almost
> always have fewer than 20000 cores and more than 20 GB of DRAM, isn't
> worth me learning how to use gperf.  The memory use required for the
> Gather is  an issue on a Blue Gene or a Cray, but like I said, I have
> another solution there.
>
> Whenever MPI_Get_processor_name returns an IP address, I don't need a
> hash function because I can just convert the IP address into an
> integer (32-bit for IPv4 and 128-bit for IPv6) and use that as the
> key.  A straightforward optimization is to test if all nodes have
> returned an IP address and if true, use that, otherwise fall back to
> the gather+qsort implementation.
>
> Jeff
>
>
> On Tue, Jan 17, 2012 at 3:39 PM, Rhys Ulerich<rhys.ulerich at gmail.com>  wrote:
>>> On 01/17/2012 01:24 PM, Jeff Hammond wrote:
>>>> I'm interested in being able to create a communicator for each node.
>>>> I have custom APIs for this on Blue Gene and Cray, but that doesn't
>>>> help on clusters.
>>>>
>>>> I was thinking of doing the following:
>>>> - call gethostname on each rank
>>>> - gather these values to root
>>>> - sort the array and assign a different color number for each unique
>>>> value in a new array
>>>> - scatter the color array and call comm_split
>>>>
>>>> Does anyone know of a better/faster way to do this?
>>
>> This may avoid the memory overhead on root and avoids the sort.
>>
>> - call MPI_Get_processor_name as Jim Dinan suggested
>> - hash the processor name into a nonnegative int color (with care as
>> to the chosen hash function)
>> - MPI_Comm_split on the color
>> - Set value '1' into some integer buffer on all ranks.
>> - MPI_Allreduce MPI_SUM on each integer buffer to find the number of
>> ranks in each rank's color-communicator
>> - MPI_Allreduce to find the minimum and maximum of the ranks in each
>> color-communicator across COMM_WORLD
>> - If minimum and maximum are not identical, throw away the bad
>> node-specific communicators (you encountered a hash collision), add
>> some salt to the hash (maybe an iteration number), and repeat the
>> process.
>>
>> Presumably you won't be hit by hash collisions indefinitely.  With a
>> good hash function, you probably won't hit any collisions at all.
>>
>> It is likely not faster (as it is a bit chatty) but I've never
>> measured it.  Nor implemented it.
>>
>> - Rhys
>> _______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
>


More information about the mpich-discuss mailing list