[MPICH] Doing collectives a better way in not-so-small nodes

Sylvain Jeaugey sylvain.jeaugey at bull.net
Tue Nov 28 03:12:03 CST 2006


Hi all,

When using not-so-small nodes, which means with more than 4 cores, most
collective operations often suffer a lot from intra-node contention. 
Therefore, the best performance may be obtained using different algorithm 
for intra-node and inter-node.

Example : allreduce :
intra-node preferred algorithm : tree-based (reduce,bcast)
inter-node preferred algorithm : current one (Rabenseifner's algorithm)

So, basically, what I'd like to do is to elect a master on each node, 
which will do a tree-based reduce within the node, then perform an 
inter-node allreduce with other nodes' masters, then perform a bcast 
within the node.

To do that nicely, I would like to add to any communicator 2 others 
communicators. First, a localnode_comm, used to perform intra-node 
collectives. Then a internode_comm, used to perform inter-node 
collectives.

For example, using 16 8-cores machines, localnode_comm would be of size 8, 
and internode_comm of size 16, where the original communicator would 
contain 128 processes. The internode_comm would be defined only for one 
process per node, the only one allowed to perform global collectives.

localnode_comm and internode_comm would be defined at comm creation 
according to the underlying topology.

With this, many collectives could be written like this :

MPI_MyColl(comm) {
     if (comm->collctx->type == real) { /* this is a real communicator */
         MPI_MyColl_up(comm->collctx->localnode_comm) // reduce for allreduce
 	if (comm->collctx->internode_comm) /* I am the master in my node */
             MPI_MyColl(comm->collctx->internode_comm)
         MPI_Mycoll_down(comm->collctx->localnode_comm) // bcast for allreduce
     } else if (comm->collctx->type == local) {
 	/* put here the algorithm for MyColl, using shm */
     } else if (comm->collctx->type == inter) {
 	/* put here the algorithm for MyColl, using network only */
     }
}

I'd like to know what you think is the better way to do it.

Thanks in advance,

Sylvain




More information about the mpich-discuss mailing list