[MPICH2-dev] Broadcast on non-power of two nodes

Thu Dec 1 16:23:42 CST 2005

Greetings:

We've been doing some analysis of the performance of the MPICH2 collective
operations on Red Storm, and we noticed a huge performance issue with
broadcasts on non-powers-of-two numbers of nodes.  Here's the analysis from
one of our developers:

  There is a crossover at MPIR_BCAST_SHORT_MSG=12288 that switches to use a
  long message algorithm "binomial tree scatter followed by an allgather".
  The tree scatter appears to be roughly the same algorithm as the fan out
  tree used by the short message protocol, but doesn't send all of the data,
  just pieces of it.  This is (almost) equally efficient for powers of two
  as non-powers of two.

  The allgather, however, switches to a ring algorithm if the number of
  nodes is a non-power-of-two.  This algorithm seems to be very inefficient
  in comparison to the tree algorithm -- it also breaks the message down
  into smaller bits.

  For example, a 16-node broadcast of 16834 bytes, the tree algorithm
  requires 30 operations (2n-2) of size 16384, of which most of them overlap
  so the run time is actually O(log2(n)).  The equivalent gather requires
  241 operations of size 1024 bytes each in addition to the 30 operations
  in the tree to scatter the little bits of the message.

  For a 8192-node broadcast of the same message, the transaction size is now
  2 bytes.  Yes, 2 bytes.  This results in 67,100,672 operations, or (n^2-n)
  operations in addition to the (2n-2) for the tree scatter.  I do not know
  what the run time big O is for this since I do not know how much overlap
  there is.

  There is another crossover at MPIR_BCAST_LONG_MSG=524288 that switches
  powers-of-two to the ring algorithm.  You can see the result of that
  as a large spike in the time to complete all Bcasts larger than 0.5 MB.
  At that size it still takes forever.

So, now a couple of questions.  First, is the above analysis correct, or
are we missing something?  Second, what's the rationale for switching to a
protocol that results in sending 67 million 2-byte messages?  And finally,
what's the best way for us to address the problem?  Is just fiddling with
the crossover points the easiest thing to do?

-Ron