[MPICH] MPICH2 performance issue with dual core

chong tan chong_guan_tan at yahoo.com
Thu Dec 27 13:25:00 CST 2007


tony, You are good at lookig at the code.  When I run into MPICH2 performance issues, I changed my
algorithm to work around it. There is no MPI collective call in my code anymore.

tan



----- Original Message ----
From: Tony Ladd <tladd at che.ufl.edu>
To: Rajeev Thakur <thakur at mcs.anl.gov>
Cc: mpich-discuss at mcs.anl.gov
Sent: Thursday, December 27, 2007 10:37:35 AM
Subject: Re: [MPICH] MPICH2 performance issue with dual core

Rajeev Thakur wrote:
> The collectives in MPICH2 are not optimized for multicore, but it's at the
> top of our list to do.
>
> Rajeev 
>
>
>  
>> -----Original Message-----
>> From: owner-mpich-discuss at mcs.anl.gov 
>> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Tony Ladd
>> Sent: Wednesday, December 26, 2007 12:00 PM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: [MPICH] MPICH2 performance issue with dual core
>>
>> I am using MPICH2 over Gigabit ethernet (Intel PRO 1000 + Extreme 
>> Networks x450a-s48t switches). For a single process per node 
>> MPICH2 is 
>> very fast; typical throughput on edge exchange is ~100MBytes/sec both 
>> ways. MPICH2 has more uniform throughput than LAM, is much 
>> faster than 
>> OpenMPI and almost as good throughput as MPIGAMMA (using 1MB TCP 
>> buffers). Latency is 24 microsecs with tuned NIC drivers. So far so 
>> (very) good.
>>
>> Collective communications are excellent for 1 process as well, but 
>> terrible with 2 processes per node. For example, an AlltoAll with 16 
>> processes has average 1-way throughput of 56MBytes/sec when 
>> distributed 
>> over 16 nodes but only 6MBytes per sec when using 8 nodes and 2 
>> processes per node. This is of course the reverse of what one would 
>> expect. I also see the latency goes up more with 2 processes 
>> per node. 
>> So a 4 process Barrier call takes about 58 microsecs on 4 
>> nodes and 68 
>> microsecs on 2 nodes. I checked with a single node and two 
>> processes and 
>> that was very fast (over 400MBytes/sec) so perhaps the issue is the 
>> interaction of shared memory and TCP. I compiled ch3:ssm and 
>> ch3:nemesis 
>> with the same result. Also with and without --enable-fast. 
>> This also did 
>> little.
>>
>> Finally I notice the cpu utilization is 100%; can this be part of the 
>> problem?
>>
>> I apologize if this has been gone over before, but I am new to MPICH2.
>>
>> Thanks
>>
>> Tony
>>
>> -- 
>> Tony Ladd
>>
>> Chemical Engineering Department
>> University of Florida
>> Gainesville, Florida 32611-6005
>> USA
>>
>> Email: tladd-"(AT)"-che.ufl.edu
>> WebL  http://ladd.che.ufl.edu
>>
>> Tel:  (352)-392-6509
>> FAX:  (352)-392-9514
>>
>>
>>    
>
>  
Rajeev

Is there a time frame for the release of optimized multi-core collectives?

Since there is going to be development work, could I also make a couple 
of observations about where the collectives in MPICH1 (and perhaps 
MPICH2) could be improved.

1) I think Rabenseifner's AllReduce can be done in 1/2 the time by 
avoiding the copy to a single head node. For a message length N the 
reduce to multiple nodes takes order N (M messages of length N/M) and 
the copy to the head node is also N (M * N/M). Similarly for the B'cast. 
If you use all the nodes as sources for the B'cast you can avoid the 
copies to and from the head node. So then the time is 2N rather than 4N. 
My hand-coded ALlReduce is typically about twice as fast at 
MPICH1/MPIGAMMA. Of course you need a special AllReduce function not a 
combination of Reduce + Bcast but would not the extra speed be worth it 
in such an important collective?

2) In the Alltoall I noticed that MPICH1 posts all the receives and then 
uses non-blocking sends + Msg_Wait. This can lead to oversubscription 
and packet loss in my experience. Last year, when working with Guiseppe 
Ciaccio to test MPIGAMMA on our cluster I found truly horrible 
performance on Alltoall  for large numbers of processes (hours instead 
of seconds for M ~ 100). The problem was exacerbated by GAMMA's 
rudimentary flow control which does not expect oversubscription. However 
in my opinion a scalable collective algorithm should not oversubscribe 
if at all possible. In the case of Alltoall an additional loop enables 
an arbitrary number of receives to be posted at once. I found about 4 
was optimum for MPIGAMMA.

One other thing. Is is possible to tune the eager-rendezvous transition. 
Its good for pairwise exchanges, but for rings I think something a bit 
larger would be best for our set up.

I really like what I have seen of MPICH2 so far (apart from the 
multicore collectives). I think it will easily be the best MPI for TCP.

Tony

-- 
Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Web    http://ladd.che.ufl.edu

Tel:  (352)-392-6509
FAX:  (352)-392-9514


      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20071227/f9682a2c/attachment.htm>


More information about the mpich-discuss mailing list