[mpich-discuss] Why is my quad core slower than cluster

Mon Jul 14 11:42:59 CDT 2008

There answer of your not so good performmace gain (X) on QuadX2 can be sumarized as below:

The are 2 major bottlenecks on a Quad CPU:
- memory bandwidth is shared by all the cores on the physical CPU.  For application of significant size and activities, the more processes you run on a Quad CPU, the higher the memory contention you can expect, thus limiting the X.

- The shared cache and other resources put yet another limit on X.  The shared cache can become so 'bloated' that each process is tring to evict others from the cache.   The shared cache eviction is extremely pronounced on some Quad CPU.  If you dig deep enough, you can find out that one of the Quad CPU out there takes 120 cycles to do a cache eviction, that is extremel expensive.

whereas on a box with single uni-core CPU, there is neither memory bandwidth nor cache contention when the application is run, and that is why you are seeing the good X.

If you prefer something that give you good and consistent X, try SUN NIAGARA.  But that is a much slower box.

tan

--- On Mon, 7/14/08, Gaetano Bellanca <gaetano.bellanca at unife.it> wrote:

From: Gaetano Bellanca <gaetano.bellanca at unife.it>
Subject: Re: [mpich-discuss] Why is my quad core slower than cluster
To: mpich-discuss at mcs.anl.gov
Date: Monday, July 14, 2008, 5:01 AM

Hello to everybody,

we have more or less the same problems. We are developing a FDTD code for electromagnetic simulation in FORTRAN. The code is mainly based on a 3 loops used to compute the electric field components, and 3 identical loops to compute the magnetic field components. 

We are using a small PC cluster made with 10 PIV 3GHz connected with a 1Gbit/s ethernet LAN built some years ago, and a Intel Vernonia 2 procesors / 4 core each (total 8 core). The processors are Intel Xeon E5345  @ 2.33GHz. 
We are using the Intel 10.1 fortran compiler (compiler options as indicated in the manual for machine optimization, with -O3), ubuntu 7.10 (kernel 2.6.22-14 generic on the cluster, kernel 2.6.22-14 server on the multiprocessor machine).
mpich2 is compiled with nemesis, and we are still with the 2.1.06p1 (still no time to upgrade to  the last version) 

Testing the code for a (not too big, to keep the overall time limited) simulation (85184 variables 44x44x44 cells, 51000 temporal iterations) we had  a good scaling on the cluster. On the total simulation time (with parallel and sequential operations mixed) we have a speed-up of 8.5 using 10PEs ( 6.2 with 9, 8.2 with 8, 5 with 7, 5.8 with 6 etc ...). 

The same simulation has been run on the 2PEs/quad core machine but we didn't have good performances. 
The speed up is 2 if we run mpiexec -n 2 .... as the domain is divided between the two processors which seems to work independently. But, by increasing the number of processors (core) used, running the simulation with .n 3, -n 4 etc ... we have a speed-up of 2.48 with 4 cores (2 on each PE), but only 2.6 with 8 PEs.

We also tried to use -parallel or -openmp (limiting the openmp directives only in the loops of field computations), without obtaining significant changes in the performances, both running with mpiexec -n 1 or mpiexec -n 2 (trying to mix mpi and openmp).

Our idea is that we have serious problems in managing the shared resources for memory access, but we have not expertise on that, and we could be totally wrong. 

Regards.

Gaetano

Gaetano Bellanca - Department of Engineering - University of Ferrara  
Via Saragat, 1 - 44100 - Ferrara - ITALY             
Voice (VoIP):  +39 0532 974809     Fax:  +39 0532 974870
mailto:gaetano.bellanca at unife.it 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080714/c9bc3135/attachment.htm>