[MPICH] MPICH2 performance issue with dual core

Fri Jan 4 09:48:48 CST 2008

On Thu, 2008-01-03 at 18:33 -0500, Tony Ladd wrote:
> Ashley
> 
> Do you mean that without barrier calls it will give the wrong answer? I 
> dont think this is the case.

No, I think adding the barriers will make the performance more
consistent and faster.

> I agree that barrier calls could help control oversubscription in 
> erratic or heterogeneous networks. We had a flat and low latancy (for 
> Gigabit ethernet) set up when we ran these tests (Intel PRO 1000 + 
> Extreme NEtworks x450a + MPIGAMMA -- total latency inc MPI was 11 microsecs
> but CPU is 100% due to constant polling)

Unfortunately all nodes are erratic and hence all networks are also.  If
one process is slow to receive a message on a timestep for some reason
then on the next timestep it will have more work to do and hence be slow
again, this will cause positive feedback causing hotspotting where one
node experiences incoming network contention and becomes a bottleneck
for the whole operation.  Adding barriers breaks the feedback cycle but
adds latency, this latency however can be hidden by pipelining multiple
timesteps together.

> We actually found 2 blocks was 
> optimum not 4 as I said earlier, but 1 is almost as good. I think you 
> would have to just send 1 block at a time to ensure no 
> oversubscription-perhaps this would be best. My point was that 
> deliberately opening a lot of receive buffers at once was not a good 
> thing. In MPICH1 you would have as many open buffers as the size of the 
> MPI_COMM. I have not
> yet spent enough time navigating the MPICH2 source to know if its the same.

You should aim to have a small number of receive buffers posted in
advance however avoiding large numbers of unexpected messages should
take priority over minimising the posted receive buffers.

Ashley.

> Ashley Pittman wrote:
> > On Thu, 2007-12-27 at 13:37 -0500, Tony Ladd wrote:
> >   
> >> 2) In the Alltoall I noticed that MPICH1 posts all the receives and
> >> then 
> >> uses non-blocking sends + Msg_Wait. This can lead to oversubscription 
> >> and packet loss in my experience. Last year, when working with
> >> Guiseppe 
> >> Ciaccio to test MPIGAMMA on our cluster I found truly horrible 
> >> performance on Alltoall  for large numbers of processes (hours
> >> instead 
> >> of seconds for M ~ 100). The problem was exacerbated by GAMMA's 
> >> rudimentary flow control which does not expect oversubscription.
> >> However 
> >> in my opinion a scalable collective algorithm should not
> >> oversubscribe 
> >> if at all possible. In the case of Alltoall an additional loop
> >> enables 
> >> an arbitrary number of receives to be posted at once. I found about 4 
> >> was optimum for MPIGAMMA.
> >>     
> >
> > The naive implementation of this won't work, you find you need a barrier
> > at the end of each loop to ensure all processes proceed lockstep.
> >
> > The best way to do it is to pipeline the inner loop with itself, have
> > each iteration of the loop consist of four receives, ideally a barrier,
> > four sends and a final barrier.  If you have asynchronous barriers then
> > you can pipeline multiple iterations of the loop to ensure the network
> > is kept busy and hence achieving maximum performance.  A loop size of
> > four and a pipeline depth of two would have a maximum of eight active
> > receives to any given process at any one time and other than any
> > in-balances in the network a minimum of four.  Any in-balances would be
> > corrected each iteration of the loop however.
> >
> > To do this you need a low overhead asynchronous barrier, If the barrier
> > is slow and the process count allows it you can simply increase the
> > pipeline depth to maintain full utilisation however it does help if it's
> > low CPU overhead.
> >
> > Ashley,
> >
> >   
>