[MPICH] Doing collectives a better way in not-so-small nodes

Rajeev Thakur thakur at mcs.anl.gov
Fri Dec 8 12:38:18 CST 2006


Sylvain,
        Your approach sounds good. We have been meaning to add
topology-aware collectives to MPICH2 for a while now, but haven't gotten
around to it. We hope to add something next year.

Rajeev 

> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov 
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Sylvain Jeaugey
> Sent: Tuesday, November 28, 2006 3:12 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: [MPICH] Doing collectives a better way in not-so-small nodes
> 
> Hi all,
> 
> When using not-so-small nodes, which means with more than 4 
> cores, most
> collective operations often suffer a lot from intra-node contention. 
> Therefore, the best performance may be obtained using 
> different algorithm 
> for intra-node and inter-node.
> 
> Example : allreduce :
> intra-node preferred algorithm : tree-based (reduce,bcast)
> inter-node preferred algorithm : current one (Rabenseifner's 
> algorithm)
> 
> So, basically, what I'd like to do is to elect a master on each node, 
> which will do a tree-based reduce within the node, then perform an 
> inter-node allreduce with other nodes' masters, then perform a bcast 
> within the node.
> 
> To do that nicely, I would like to add to any communicator 2 others 
> communicators. First, a localnode_comm, used to perform intra-node 
> collectives. Then a internode_comm, used to perform inter-node 
> collectives.
> 
> For example, using 16 8-cores machines, localnode_comm would 
> be of size 8, 
> and internode_comm of size 16, where the original communicator would 
> contain 128 processes. The internode_comm would be defined 
> only for one 
> process per node, the only one allowed to perform global collectives.
> 
> localnode_comm and internode_comm would be defined at comm creation 
> according to the underlying topology.
> 
> With this, many collectives could be written like this :
> 
> MPI_MyColl(comm) {
>      if (comm->collctx->type == real) { /* this is a real 
> communicator */
>          MPI_MyColl_up(comm->collctx->localnode_comm) // 
> reduce for allreduce
>  	if (comm->collctx->internode_comm) /* I am the master 
> in my node */
>              MPI_MyColl(comm->collctx->internode_comm)
>          MPI_Mycoll_down(comm->collctx->localnode_comm) // 
> bcast for allreduce
>      } else if (comm->collctx->type == local) {
>  	/* put here the algorithm for MyColl, using shm */
>      } else if (comm->collctx->type == inter) {
>  	/* put here the algorithm for MyColl, using network only */
>      }
> }
> 
> I'd like to know what you think is the better way to do it.
> 
> Thanks in advance,
> 
> Sylvain
> 
> 




More information about the mpich-discuss mailing list