[mpich-discuss] Algorithms used in implementation of collective operations like Allgather , Bcast etc.
Torquil Macdonald Sørensen
torquil at gmail.com
Sat Jun 4 04:53:35 CDT 2011
On 04/06/11 11:47, Mahesh Doijade wrote:
> Hi,
> I want to know which algorithm is used in implementation of Collective
> Operations like Allgather, Bcast , gather and Scatter. Is it tree based
> algorithm or Linear Ring algorithm or both with some threshold during which each
> is used, and if some other kind of algorithms are also been used. Please let me
> know in a elaborate manner.
> Best Regards
> --Mahesh Doijade
Hi!
I think you should probably check out the source code. I'm not sure if the
following is satisfactory info for you regarding AllGather, but this is what I
found near the top of the file mpich2/src/mpi/coll/allgather.c (The cource files
for the other functions you mention might also contain detailed descriptions.
Comments might also be scattered throughout the code):
Algorithm: MPI_Allgather
For short messages and non-power-of-two no. of processes, we use
the algorithm from the Jehoshua Bruck et al IEEE TPDS Nov 97
paper. It is a variant of the disemmination algorithm for
barrier. It takes ceiling(lg p) steps.
Cost = lgp.alpha + n.((p-1)/p).beta
where n is total size of data gathered on each process.
For short or medium-size messages and power-of-two no. of
processes, we use the recursive doubling algorithm.
Cost = lgp.alpha + n.((p-1)/p).beta
TODO: On TCP, we may want to use recursive doubling instead of the Bruck
algorithm in all cases because of the pairwise-exchange property of
recursive doubling (see Benson et al paper in Euro PVM/MPI
2003).
It is interesting to note that either of the above algorithms for
MPI_Allgather has the same cost as the tree algorithm for MPI_Gather!
For long messages or medium-size messages and non-power-of-two
no. of processes, we use a ring algorithm. In the first step, each
process i sends its contribution to process i+1 and receives
the contribution from process i-1 (with wrap-around). From the
second step onwards, each process i forwards to process i+1 the
data it received from process i-1 in the previous step. This takes
a total of p-1 steps.
Cost = (p-1).alpha + n.((p-1)/p).beta
We use this algorithm instead of recursive doubling for long
messages because we find that this communication pattern (nearest
neighbor) performs twice as fast as recursive doubling for long
messages (on Myrinet and IBM SP).
More information about the mpich-discuss
mailing list