[Fwd: Re: [MPICH] MPICH2 performance issue with dual core]

Tony Ladd tladd at che.ufl.edu
Wed Dec 26 13:31:45 CST 2007



-------- Original Message --------
Subject: 	Re: [MPICH] MPICH2 performance issue with dual core
Date: 	Wed, 26 Dec 2007 14:30:48 -0500
From: 	Tony Ladd <tladd at che.ufl.edu>
To: 	chong tan <chong_guan_tan at yahoo.com>
References: 	<277590.4440.qm at web52006.mail.re2.yahoo.com>



chong tan wrote:
> There are a few thing to be expected running on mult-core CPU.  In 
> general, 1 dual-core
> CPU does not perform as well as 2 uni-core CPUs of the same caliber.  
> The shared Cache,
> memory IO and other interfaces are the key reasons why.  Cache for 
> one, the cost of Cache
> eviction can be 2X-3X for dual-core CPU, on top of that, you have 2 
> processes contributing
> to the rate of evictions.
>  
> tan
>  
>  
>
>  
> ----- Original Message ----
> From: Tony Ladd <tladd at che.ufl.edu>
> To: mpich-discuss at mcs.anl.gov
> Sent: Wednesday, December 26, 2007 10:00:03 AM
> Subject: [MPICH] MPICH2 performance issue with dual core
>
> I am using MPICH2 over Gigabit ethernet (Intel PRO 1000 + Extreme
> Networks x450a-s48t switches). For a single process per node MPICH2 is
> very fast; typical throughput on edge exchange is ~100MBytes/sec both
> ways. MPICH2 has more uniform throughput than LAM, is much faster than
> OpenMPI and almost as good throughput as MPIGAMMA (using 1MB TCP
> buffers). Latency is 24 microsecs with tuned NIC drivers. So far so
> (very) good.
>
> Collective communications are excellent for 1 process as well, but
> terrible with 2 processes per node. For example, an AlltoAll with 16
> processes has average 1-way throughput of 56MBytes/sec when distributed
> over 16 nodes but only 6MBytes per sec when using 8 nodes and 2
> processes per node. This is of course the reverse of what one would
> expect. I also see the latency goes up more with 2 processes per node.
> So a 4 process Barrier call takes about 58 microsecs on 4 nodes and 68
> microsecs on 2 nodes. I checked with a single node and two processes and
> that was very fast (over 400MBytes/sec) so perhaps the issue is the
> interaction of shared memory and TCP. I compiled ch3:ssm and ch3:nemesis
> with the same result. Also with and without --enable-fast. This also did
> little.
>
> Finally I notice the cpu utilization is 100%; can this be part of the
> problem?
>
> I apologize if this has been gone over before, but I am new to MPICH2.
>
> Thanks
>
> Tony
>
> -- 
> Tony Ladd
>
> Chemical Engineering Department
> University of Florida
> Gainesville, Florida 32611-6005
> USA
>
> Email: tladd-"(AT)"-che.ufl.edu
> WebL  http://ladd.che.ufl.edu <http://ladd.che.ufl.edu/>
>
> Tel:  (352)-392-6509
> FAX:  (352)-392-9514
>
>
>
> ------------------------------------------------------------------------
> Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try 
> it now. 
> <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20>

Tan

Thanks for the quick response. A couple of additional details. I have 
made extensive MPI benchmarks of this (and other systems) over the years 
and I have not seen this behavior before (see 
http;//ladd.che.ufl.edu/research/beoclus/beoclus.htm if you are 
interested). GigE has huge latency and very small bandwidth in 
comparison with the internal memory bus (even if shared). And the drop 
in performance is huge-a factor of 10 for the alltoall and a factor of 4 
in allreduce. No other MPI I have tried does this-that includes MPICH-1, 
LAM, OPENMPI, MPIGAMMA. Typically dual core MPI is the same as single 
core for the same number of processes and usually 2 cores per node is a 
bit quicker. It can vary depending on exactly what you are doing, but no 
more than 10-20%.

You may be right that its a cache issue. MPICH2 uses 100% CPU (polling?) 
while LAM is < 50% CPU utilization. However MPIGAMMA (based on MPICH-1) 
uses 100% CPU too and it is typically the same throughput for 1 or 2 cores.

The pity is that MPICH2 is excellent for TCP-the best I have tried so 
far. But such poor dual core performance will leave LAM as the workhorse 
MPI in our group for now, even though its no longer supported. Perhaps 
it argues for multithreaded codes.

Tony

-- 
Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Web    http://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514




-- 
Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Web    http://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514




More information about the mpich-discuss mailing list