[MPICH] Doing collectives a better way in not-so-small nodes
Rajeev Thakur
thakur at mcs.anl.gov
Fri Dec 8 12:38:18 CST 2006
Sylvain,
Your approach sounds good. We have been meaning to add
topology-aware collectives to MPICH2 for a while now, but haven't gotten
around to it. We hope to add something next year.
Rajeev
> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Sylvain Jeaugey
> Sent: Tuesday, November 28, 2006 3:12 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: [MPICH] Doing collectives a better way in not-so-small nodes
>
> Hi all,
>
> When using not-so-small nodes, which means with more than 4
> cores, most
> collective operations often suffer a lot from intra-node contention.
> Therefore, the best performance may be obtained using
> different algorithm
> for intra-node and inter-node.
>
> Example : allreduce :
> intra-node preferred algorithm : tree-based (reduce,bcast)
> inter-node preferred algorithm : current one (Rabenseifner's
> algorithm)
>
> So, basically, what I'd like to do is to elect a master on each node,
> which will do a tree-based reduce within the node, then perform an
> inter-node allreduce with other nodes' masters, then perform a bcast
> within the node.
>
> To do that nicely, I would like to add to any communicator 2 others
> communicators. First, a localnode_comm, used to perform intra-node
> collectives. Then a internode_comm, used to perform inter-node
> collectives.
>
> For example, using 16 8-cores machines, localnode_comm would
> be of size 8,
> and internode_comm of size 16, where the original communicator would
> contain 128 processes. The internode_comm would be defined
> only for one
> process per node, the only one allowed to perform global collectives.
>
> localnode_comm and internode_comm would be defined at comm creation
> according to the underlying topology.
>
> With this, many collectives could be written like this :
>
> MPI_MyColl(comm) {
> if (comm->collctx->type == real) { /* this is a real
> communicator */
> MPI_MyColl_up(comm->collctx->localnode_comm) //
> reduce for allreduce
> if (comm->collctx->internode_comm) /* I am the master
> in my node */
> MPI_MyColl(comm->collctx->internode_comm)
> MPI_Mycoll_down(comm->collctx->localnode_comm) //
> bcast for allreduce
> } else if (comm->collctx->type == local) {
> /* put here the algorithm for MyColl, using shm */
> } else if (comm->collctx->type == inter) {
> /* put here the algorithm for MyColl, using network only */
> }
> }
>
> I'd like to know what you think is the better way to do it.
>
> Thanks in advance,
>
> Sylvain
>
>
More information about the mpich-discuss
mailing list