[MPICH] MPICH2 performance issue with dual core
Tony Ladd
tladd at che.ufl.edu
Thu Dec 27 12:37:35 CST 2007
Rajeev Thakur wrote:
> The collectives in MPICH2 are not optimized for multicore, but it's at the
> top of our list to do.
>
> Rajeev
>
>
>
>> -----Original Message-----
>> From: owner-mpich-discuss at mcs.anl.gov
>> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Tony Ladd
>> Sent: Wednesday, December 26, 2007 12:00 PM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: [MPICH] MPICH2 performance issue with dual core
>>
>> I am using MPICH2 over Gigabit ethernet (Intel PRO 1000 + Extreme
>> Networks x450a-s48t switches). For a single process per node
>> MPICH2 is
>> very fast; typical throughput on edge exchange is ~100MBytes/sec both
>> ways. MPICH2 has more uniform throughput than LAM, is much
>> faster than
>> OpenMPI and almost as good throughput as MPIGAMMA (using 1MB TCP
>> buffers). Latency is 24 microsecs with tuned NIC drivers. So far so
>> (very) good.
>>
>> Collective communications are excellent for 1 process as well, but
>> terrible with 2 processes per node. For example, an AlltoAll with 16
>> processes has average 1-way throughput of 56MBytes/sec when
>> distributed
>> over 16 nodes but only 6MBytes per sec when using 8 nodes and 2
>> processes per node. This is of course the reverse of what one would
>> expect. I also see the latency goes up more with 2 processes
>> per node.
>> So a 4 process Barrier call takes about 58 microsecs on 4
>> nodes and 68
>> microsecs on 2 nodes. I checked with a single node and two
>> processes and
>> that was very fast (over 400MBytes/sec) so perhaps the issue is the
>> interaction of shared memory and TCP. I compiled ch3:ssm and
>> ch3:nemesis
>> with the same result. Also with and without --enable-fast.
>> This also did
>> little.
>>
>> Finally I notice the cpu utilization is 100%; can this be part of the
>> problem?
>>
>> I apologize if this has been gone over before, but I am new to MPICH2.
>>
>> Thanks
>>
>> Tony
>>
>> --
>> Tony Ladd
>>
>> Chemical Engineering Department
>> University of Florida
>> Gainesville, Florida 32611-6005
>> USA
>>
>> Email: tladd-"(AT)"-che.ufl.edu
>> WebL http://ladd.che.ufl.edu
>>
>> Tel: (352)-392-6509
>> FAX: (352)-392-9514
>>
>>
>>
>
>
Rajeev
Is there a time frame for the release of optimized multi-core collectives?
Since there is going to be development work, could I also make a couple
of observations about where the collectives in MPICH1 (and perhaps
MPICH2) could be improved.
1) I think Rabenseifner's AllReduce can be done in 1/2 the time by
avoiding the copy to a single head node. For a message length N the
reduce to multiple nodes takes order N (M messages of length N/M) and
the copy to the head node is also N (M * N/M). Similarly for the B'cast.
If you use all the nodes as sources for the B'cast you can avoid the
copies to and from the head node. So then the time is 2N rather than 4N.
My hand-coded ALlReduce is typically about twice as fast at
MPICH1/MPIGAMMA. Of course you need a special AllReduce function not a
combination of Reduce + Bcast but would not the extra speed be worth it
in such an important collective?
2) In the Alltoall I noticed that MPICH1 posts all the receives and then
uses non-blocking sends + Msg_Wait. This can lead to oversubscription
and packet loss in my experience. Last year, when working with Guiseppe
Ciaccio to test MPIGAMMA on our cluster I found truly horrible
performance on Alltoall for large numbers of processes (hours instead
of seconds for M ~ 100). The problem was exacerbated by GAMMA's
rudimentary flow control which does not expect oversubscription. However
in my opinion a scalable collective algorithm should not oversubscribe
if at all possible. In the case of Alltoall an additional loop enables
an arbitrary number of receives to be posted at once. I found about 4
was optimum for MPIGAMMA.
One other thing. Is is possible to tune the eager-rendezvous transition.
Its good for pairwise exchanges, but for rings I think something a bit
larger would be best for our set up.
I really like what I have seen of MPICH2 so far (apart from the
multicore collectives). I think it will easily be the best MPI for TCP.
Tony
--
Tony Ladd
Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA
Email: tladd-"(AT)"-che.ufl.edu
Web http://ladd.che.ufl.edu
Tel: (352)-392-6509
FAX: (352)-392-9514
More information about the mpich-discuss
mailing list