[MPICH] MPICH2 performance issue with dual core
Tony Ladd
tladd at che.ufl.edu
Thu Jan 3 17:33:01 CST 2008
Ashley
Do you mean that without barrier calls it will give the wrong answer? I
dont think this is the case.
I agree that barrier calls could help control oversubscription in
erratic or heterogeneous networks. We had a flat and low latancy (for
Gigabit ethernet) set up when we ran these tests (Intel PRO 1000 +
Extreme NEtworks x450a + MPIGAMMA -- total latency inc MPI was 11 microsecs
but CPU is 100% due to constant polling) We actually found 2 blocks was
optimum not 4 as I said earlier, but 1 is almost as good. I think you
would have to just send 1 block at a time to ensure no
oversubscription-perhaps this would be best. My point was that
deliberately opening a lot of receive buffers at once was not a good
thing. In MPICH1 you would have as many open buffers as the size of the
MPI_COMM. I have not
yet spent enough time navigating the MPICH2 source to know if its the same.
Tony
Ashley Pittman wrote:
> On Thu, 2007-12-27 at 13:37 -0500, Tony Ladd wrote:
>
>> 2) In the Alltoall I noticed that MPICH1 posts all the receives and
>> then
>> uses non-blocking sends + Msg_Wait. This can lead to oversubscription
>> and packet loss in my experience. Last year, when working with
>> Guiseppe
>> Ciaccio to test MPIGAMMA on our cluster I found truly horrible
>> performance on Alltoall for large numbers of processes (hours
>> instead
>> of seconds for M ~ 100). The problem was exacerbated by GAMMA's
>> rudimentary flow control which does not expect oversubscription.
>> However
>> in my opinion a scalable collective algorithm should not
>> oversubscribe
>> if at all possible. In the case of Alltoall an additional loop
>> enables
>> an arbitrary number of receives to be posted at once. I found about 4
>> was optimum for MPIGAMMA.
>>
>
> The naive implementation of this won't work, you find you need a barrier
> at the end of each loop to ensure all processes proceed lockstep.
>
> The best way to do it is to pipeline the inner loop with itself, have
> each iteration of the loop consist of four receives, ideally a barrier,
> four sends and a final barrier. If you have asynchronous barriers then
> you can pipeline multiple iterations of the loop to ensure the network
> is kept busy and hence achieving maximum performance. A loop size of
> four and a pipeline depth of two would have a maximum of eight active
> receives to any given process at any one time and other than any
> in-balances in the network a minimum of four. Any in-balances would be
> corrected each iteration of the loop however.
>
> To do this you need a low overhead asynchronous barrier, If the barrier
> is slow and the process count allows it you can simply increase the
> pipeline depth to maintain full utilisation however it does help if it's
> low CPU overhead.
>
> Ashley,
>
>
--
Tony Ladd
Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA
Email: tladd-"(AT)"-che.ufl.edu
Web http://ladd.che.ufl.edu
Tel: (352)-392-6509
FAX: (352)-392-9514
More information about the mpich-discuss
mailing list