[MPICH2-dev] Broadcast on non-power of two nodes

Thu Dec 1 17:02:55 CST 2005

Ron,
    The cutoff points were set on some Myrinet cluster. You will need to
experiment and use appropriate ones for Red Storm.

>   For example, a 16-node broadcast of 16834 bytes, the tree algorithm
>   requires 30 operations (2n-2) of size 16384, of which most 
> of them overlap
>   so the run time is actually O(log2(n)).  The equivalent 
> gather requires
>   241 operations of size 1024 bytes each in addition to the 
> 30 operations
>   in the tree to scatter the little bits of the message.

The ring algorithm for allgather takes nprocs-1 steps. In each step, each
process sends to rank+1 and receives from rank-1. So, a lot of the
communication should happen concurrently, unless it is getting serialized by
the network for some reason. Can that happen on your network?

Also, you may want to consider adding an extra condition (nprocs) in if
statement that selects the allgather algorithm. For nprocs beyond some value
you may want to use recursive doubling even for non powers of two.

Rajeev

> -----Original Message-----
> From: owner-mpich2-dev at mcs.anl.gov 
> [mailto:owner-mpich2-dev at mcs.anl.gov] On Behalf Of Ron Brightwell
> Sent: Thursday, December 01, 2005 4:24 PM
> To: mpich2-dev at mcs.anl.gov
> Cc: ssdev at sandia.gov
> Subject: [MPICH2-dev] Broadcast on non-power of two nodes
> 
> Greetings:
> 
> We've been doing some analysis of the performance of the 
> MPICH2 collective
> operations on Red Storm, and we noticed a huge performance issue with
> broadcasts on non-powers-of-two numbers of nodes.  Here's the 
> analysis from
> one of our developers:
> 
>   There is a crossover at MPIR_BCAST_SHORT_MSG=12288 that 
> switches to use a
>   long message algorithm "binomial tree scatter followed by 
> an allgather".
>   The tree scatter appears to be roughly the same algorithm 
> as the fan out
>   tree used by the short message protocol, but doesn't send 
> all of the data,
>   just pieces of it.  This is (almost) equally efficient for 
> powers of two
>   as non-powers of two.
> 
>   The allgather, however, switches to a ring algorithm if the 
> number of
>   nodes is a non-power-of-two.  This algorithm seems to be 
> very inefficient
>   in comparison to the tree algorithm -- it also breaks the 
> message down
>   into smaller bits.
> 
>   For example, a 16-node broadcast of 16834 bytes, the tree algorithm
>   requires 30 operations (2n-2) of size 16384, of which most 
> of them overlap
>   so the run time is actually O(log2(n)).  The equivalent 
> gather requires
>   241 operations of size 1024 bytes each in addition to the 
> 30 operations
>   in the tree to scatter the little bits of the message.
> 
>   For a 8192-node broadcast of the same message, the 
> transaction size is now
>   2 bytes.  Yes, 2 bytes.  This results in 67,100,672 
> operations, or (n^2-n)
>   operations in addition to the (2n-2) for the tree scatter.  
> I do not know
>   what the run time big O is for this since I do not know how 
> much overlap
>   there is.
> 
>   There is another crossover at MPIR_BCAST_LONG_MSG=524288 
> that switches
>   powers-of-two to the ring algorithm.  You can see the result of that
>   as a large spike in the time to complete all Bcasts larger 
> than 0.5 MB.
>   At that size it still takes forever.
> 
> So, now a couple of questions.  First, is the above analysis 
> correct, or
> are we missing something?  Second, what's the rationale for 
> switching to a
> protocol that results in sending 67 million 2-byte messages?  
> And finally,
> what's the best way for us to address the problem?  Is just 
> fiddling with
> the crossover points the easiest thing to do?
> 
> -Ron
> 
> 
>