[MPICH2-dev] Broadcast on non-power of two nodes
Ron Brightwell
rbbrigh at sandia.gov
Thu Dec 1 16:23:42 CST 2005
Greetings:
We've been doing some analysis of the performance of the MPICH2 collective
operations on Red Storm, and we noticed a huge performance issue with
broadcasts on non-powers-of-two numbers of nodes. Here's the analysis from
one of our developers:
There is a crossover at MPIR_BCAST_SHORT_MSG=12288 that switches to use a
long message algorithm "binomial tree scatter followed by an allgather".
The tree scatter appears to be roughly the same algorithm as the fan out
tree used by the short message protocol, but doesn't send all of the data,
just pieces of it. This is (almost) equally efficient for powers of two
as non-powers of two.
The allgather, however, switches to a ring algorithm if the number of
nodes is a non-power-of-two. This algorithm seems to be very inefficient
in comparison to the tree algorithm -- it also breaks the message down
into smaller bits.
For example, a 16-node broadcast of 16834 bytes, the tree algorithm
requires 30 operations (2n-2) of size 16384, of which most of them overlap
so the run time is actually O(log2(n)). The equivalent gather requires
241 operations of size 1024 bytes each in addition to the 30 operations
in the tree to scatter the little bits of the message.
For a 8192-node broadcast of the same message, the transaction size is now
2 bytes. Yes, 2 bytes. This results in 67,100,672 operations, or (n^2-n)
operations in addition to the (2n-2) for the tree scatter. I do not know
what the run time big O is for this since I do not know how much overlap
there is.
There is another crossover at MPIR_BCAST_LONG_MSG=524288 that switches
powers-of-two to the ring algorithm. You can see the result of that
as a large spike in the time to complete all Bcasts larger than 0.5 MB.
At that size it still takes forever.
So, now a couple of questions. First, is the above analysis correct, or
are we missing something? Second, what's the rationale for switching to a
protocol that results in sending 67 million 2-byte messages? And finally,
what's the best way for us to address the problem? Is just fiddling with
the crossover points the easiest thing to do?
-Ron
More information about the mpich2-dev
mailing list