[MPICH] MPICH2 performance issue with dual core

Thu Jan 3 17:33:01 CST 2008

Ashley

Do you mean that without barrier calls it will give the wrong answer? I 
dont think this is the case.

I agree that barrier calls could help control oversubscription in 
erratic or heterogeneous networks. We had a flat and low latancy (for 
Gigabit ethernet) set up when we ran these tests (Intel PRO 1000 + 
Extreme NEtworks x450a + MPIGAMMA -- total latency inc MPI was 11 microsecs
but CPU is 100% due to constant polling) We actually found 2 blocks was 
optimum not 4 as I said earlier, but 1 is almost as good. I think you 
would have to just send 1 block at a time to ensure no 
oversubscription-perhaps this would be best. My point was that 
deliberately opening a lot of receive buffers at once was not a good 
thing. In MPICH1 you would have as many open buffers as the size of the 
MPI_COMM. I have not
yet spent enough time navigating the MPICH2 source to know if its the same.

Tony

Ashley Pittman wrote:
> On Thu, 2007-12-27 at 13:37 -0500, Tony Ladd wrote:
>   
>> 2) In the Alltoall I noticed that MPICH1 posts all the receives and
>> then 
>> uses non-blocking sends + Msg_Wait. This can lead to oversubscription 
>> and packet loss in my experience. Last year, when working with
>> Guiseppe 
>> Ciaccio to test MPIGAMMA on our cluster I found truly horrible 
>> performance on Alltoall  for large numbers of processes (hours
>> instead 
>> of seconds for M ~ 100). The problem was exacerbated by GAMMA's 
>> rudimentary flow control which does not expect oversubscription.
>> However 
>> in my opinion a scalable collective algorithm should not
>> oversubscribe 
>> if at all possible. In the case of Alltoall an additional loop
>> enables 
>> an arbitrary number of receives to be posted at once. I found about 4 
>> was optimum for MPIGAMMA.
>>     
>
> The naive implementation of this won't work, you find you need a barrier
> at the end of each loop to ensure all processes proceed lockstep.
>
> The best way to do it is to pipeline the inner loop with itself, have
> each iteration of the loop consist of four receives, ideally a barrier,
> four sends and a final barrier.  If you have asynchronous barriers then
> you can pipeline multiple iterations of the loop to ensure the network
> is kept busy and hence achieving maximum performance.  A loop size of
> four and a pipeline depth of two would have a maximum of eight active
> receives to any given process at any one time and other than any
> in-balances in the network a minimum of four.  Any in-balances would be
> corrected each iteration of the loop however.
>
> To do this you need a low overhead asynchronous barrier, If the barrier
> is slow and the process count allows it you can simply increase the
> pipeline depth to maintain full utilisation however it does help if it's
> low CPU overhead.
>
> Ashley,
>
>   

-- 
Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Web    http://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514