[mpich-discuss] Why is my quad core slower than cluster

Mon Jul 14 13:22:19 CDT 2008

Hello Zach and list

Zach, I think it is worth trying to recompile mpich2 with icc (and 
ifort), instead of gcc (and gfortran?),
then recompile your code again with the resulting mpicc wrapper,
and run it again on your home PC.
Note that Gaetano and Sami didn't say which compiler they used to build 
mpich itself (icc, gcc or other).
You can get icc and ifort trial licenses from Intel for a test, if you wish.
If you skip the mpich Fortran interface you don't need ifort.

This would verify whether icc really does a better job than gcc
when it comes to compiling the "memcpy" code on the latest multicore 
processors.
Memory copies may be on the base of most mpich message passing work 
inside a single computer,
as it was suggested by a couple of experts who posted short messages on 
this thread last week.

As a bonus, compiling mpich with icc would also produce a more fair 
comparison between
your home PC and your USC cluster,
which also uses icc (although I don't know if icc was also used to 
compile mpich there).

In any case, recompiling with icc may not make a difference,
and the often mentioned memory bandwidth limitation of multi-core machines
may be the dominant factor on the poor performance and speedups that 
you, Gaetano, Sami, Matthew,
myself, and so many people have experienced and reported on this and on 
other mailing lists
(FYI, I saw the same type of complaints on climate and oceans,
computational biology, and on cluster administration lists, but there is 
probably much more out there.)

Icc versus gcc may be the last ditch, but it may be worth trying.

Gus Correa

-- 
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

zach wrote:

>This is starting to sound like a limitation of using MPI on multicore
>processors and not necessarily an issue with the install or
>configuration of mpich.
>Can we expect improvements to mpich in the near future to deal with this?
>Is it just that the quad multicore cpus are newer and have not been
>rigorously tested with mpich yet to find all the issues?
>-not great news for me since I just built a quad core cpu box thinking
>I would get near x4 speed -up... :(
>
>Zach
>
>On 7/14/08, Gaetano Bellanca <gaetano.bellanca at unife.it> wrote:
>  
>
>>Hello to everybody,
>>
>>we have more or less the same problems. We are developing a FDTD code for
>>electromagnetic simulation in FORTRAN. The code is mainly based on a 3 loops
>>used to compute the electric field components, and 3 identical loops to
>>compute the magnetic field components.
>>
>>We are using a small PC cluster made with 10 PIV 3GHz connected with a
>>1Gbit/s ethernet LAN built some years ago, and a Intel Vernonia 2 procesors
>>/ 4 core each (total 8 core). The processors are Intel Xeon E5345  @
>>2.33GHz.
>>We are using the Intel 10.1 fortran compiler (compiler options as indicated
>>in the manual for machine optimization, with -O3), ubuntu 7.10 (kernel
>>2.6.22-14 generic on the cluster, kernel 2.6.22-14 server on the
>>multiprocessor machine).
>>mpich2 is compiled with nemesis, and we are still with the 2.1.06p1 (still
>>no time to upgrade to  the last version)
>>
>>Testing the code for a (not too big, to keep the overall time limited)
>>simulation (85184 variables 44x44x44 cells, 51000 temporal iterations) we
>>had  a good scaling on the cluster. On the total simulation time (with
>>parallel and sequential operations mixed) we have a speed-up of 8.5 using
>>10PEs ( 6.2 with 9, 8.2 with 8, 5 with 7, 5.8 with 6 etc ...).
>>
>>The same simulation has been run on the 2PEs/quad core machine but we didn't
>>have good performances.
>>The speed up is 2 if we run mpiexec -n 2 .... as the domain is divided
>>between the two processors which seems to work independently. But, by
>>increasing the number of processors (core) used, running the simulation with
>>.n 3, -n 4 etc ... we have a speed-up of 2.48 with 4 cores (2 on each PE),
>>but only 2.6 with 8 PEs.
>>
>>We also tried to use -parallel or -openmp (limiting the openmp directives
>>only in the loops of field computations), without obtaining significant
>>changes in the performances, both running with mpiexec -n 1 or mpiexec -n 2
>>(trying to mix mpi and openmp).
>>
>>Our idea is that we have serious problems in managing the shared resources
>>for memory access, but we have not expertise on that, and we could be
>>totally wrong.
>>
>>Regards.
>>
>>Gaetano
>>
>>
>>________________________________
>>Gaetano Bellanca - Department of Engineering - University of Ferrara
>>Via Saragat, 1 - 44100 - Ferrara - ITALY
>>Voice (VoIP):  +39 0532 974809     Fax:  +39 0532 974870
>>mailto:gaetano.bellanca at unife.it
>>________________________________
>>
>>
>>
>>    
>>