[MPICH] Doing collectives a better way in not-so-small nodes
Sylvain Jeaugey
sylvain.jeaugey at bull.net
Tue Nov 28 03:12:03 CST 2006
Hi all,
When using not-so-small nodes, which means with more than 4 cores, most
collective operations often suffer a lot from intra-node contention.
Therefore, the best performance may be obtained using different algorithm
for intra-node and inter-node.
Example : allreduce :
intra-node preferred algorithm : tree-based (reduce,bcast)
inter-node preferred algorithm : current one (Rabenseifner's algorithm)
So, basically, what I'd like to do is to elect a master on each node,
which will do a tree-based reduce within the node, then perform an
inter-node allreduce with other nodes' masters, then perform a bcast
within the node.
To do that nicely, I would like to add to any communicator 2 others
communicators. First, a localnode_comm, used to perform intra-node
collectives. Then a internode_comm, used to perform inter-node
collectives.
For example, using 16 8-cores machines, localnode_comm would be of size 8,
and internode_comm of size 16, where the original communicator would
contain 128 processes. The internode_comm would be defined only for one
process per node, the only one allowed to perform global collectives.
localnode_comm and internode_comm would be defined at comm creation
according to the underlying topology.
With this, many collectives could be written like this :
MPI_MyColl(comm) {
if (comm->collctx->type == real) { /* this is a real communicator */
MPI_MyColl_up(comm->collctx->localnode_comm) // reduce for allreduce
if (comm->collctx->internode_comm) /* I am the master in my node */
MPI_MyColl(comm->collctx->internode_comm)
MPI_Mycoll_down(comm->collctx->localnode_comm) // bcast for allreduce
} else if (comm->collctx->type == local) {
/* put here the algorithm for MyColl, using shm */
} else if (comm->collctx->type == inter) {
/* put here the algorithm for MyColl, using network only */
}
}
I'd like to know what you think is the better way to do it.
Thanks in advance,
Sylvain
More information about the mpich-discuss
mailing list