[mpich-discuss] Algorithms used in implementation of collective operations like Allgather , Bcast etc.

Torquil Macdonald Sørensen torquil at gmail.com
Sat Jun 4 04:53:35 CDT 2011


On 04/06/11 11:47, Mahesh Doijade wrote:
> Hi,
>       I want to know which algorithm is used in implementation of  Collective
> Operations like Allgather, Bcast , gather and Scatter. Is it tree based
> algorithm or Linear Ring algorithm or both with some threshold during which each
> is used, and if some other kind of algorithms are also been used. Please let me
> know in a elaborate manner.
> Best Regards
> --Mahesh Doijade

Hi!

I think you should probably check out the source code. I'm not sure if the 
following is satisfactory info for you regarding AllGather, but this is what I 
found near the top of the file mpich2/src/mpi/coll/allgather.c (The cource files 
for the other functions you mention might also contain detailed descriptions. 
Comments might also be scattered throughout the code):

    Algorithm: MPI_Allgather

    For short messages and non-power-of-two no. of processes, we use
    the algorithm from the Jehoshua Bruck et al IEEE TPDS Nov 97
    paper. It is a variant of the disemmination algorithm for
    barrier. It takes ceiling(lg p) steps.

    Cost = lgp.alpha + n.((p-1)/p).beta
    where n is total size of data gathered on each process.

    For short or medium-size messages and power-of-two no. of
    processes, we use the recursive doubling algorithm.

    Cost = lgp.alpha + n.((p-1)/p).beta

    TODO: On TCP, we may want to use recursive doubling instead of the Bruck
    algorithm in all cases because of the pairwise-exchange property of
    recursive doubling (see Benson et al paper in Euro PVM/MPI
    2003).

    It is interesting to note that either of the above algorithms for
    MPI_Allgather has the same cost as the tree algorithm for MPI_Gather!

    For long messages or medium-size messages and non-power-of-two
    no. of processes, we use a ring algorithm. In the first step, each
    process i sends its contribution to process i+1 and receives
    the contribution from process i-1 (with wrap-around). From the
    second step onwards, each process i forwards to process i+1 the
    data it received from process i-1 in the previous step. This takes
    a total of p-1 steps.

    Cost = (p-1).alpha + n.((p-1)/p).beta

    We use this algorithm instead of recursive doubling for long
    messages because we find that this communication pattern (nearest
    neighbor) performs twice as fast as recursive doubling for long
    messages (on Myrinet and IBM SP).


More information about the mpich-discuss mailing list