[MPICH] MPICH2 performance issue with dual core

Thu Jan 3 11:19:27 CST 2008

On Thu, 2007-12-27 at 13:37 -0500, Tony Ladd wrote:
> 
> 2) In the Alltoall I noticed that MPICH1 posts all the receives and
> then 
> uses non-blocking sends + Msg_Wait. This can lead to oversubscription 
> and packet loss in my experience. Last year, when working with
> Guiseppe 
> Ciaccio to test MPIGAMMA on our cluster I found truly horrible 
> performance on Alltoall  for large numbers of processes (hours
> instead 
> of seconds for M ~ 100). The problem was exacerbated by GAMMA's 
> rudimentary flow control which does not expect oversubscription.
> However 
> in my opinion a scalable collective algorithm should not
> oversubscribe 
> if at all possible. In the case of Alltoall an additional loop
> enables 
> an arbitrary number of receives to be posted at once. I found about 4 
> was optimum for MPIGAMMA.

The naive implementation of this won't work, you find you need a barrier
at the end of each loop to ensure all processes proceed lockstep.

The best way to do it is to pipeline the inner loop with itself, have
each iteration of the loop consist of four receives, ideally a barrier,
four sends and a final barrier.  If you have asynchronous barriers then
you can pipeline multiple iterations of the loop to ensure the network
is kept busy and hence achieving maximum performance.  A loop size of
four and a pipeline depth of two would have a maximum of eight active
receives to any given process at any one time and other than any
in-balances in the network a minimum of four.  Any in-balances would be
corrected each iteration of the loop however.

To do this you need a low overhead asynchronous barrier, If the barrier
is slow and the process count allows it you can simply increase the
pipeline depth to maintain full utilisation however it does help if it's
low CPU overhead.

Ashley,