[mpich-discuss] Why is my quad core slower than cluster

H. Sami Sozuer hss at photon.iyte.edu.tr
Mon Jul 14 12:13:12 CDT 2008


It could be an mpich problem or more generally an OS problem.

Suppose we have a 4 CPU system with 2 cores per CPU.
Let's label the cores as c1a, c1b,  c2a,c2b,  c3a,c3b, and  c4a,c4b,
so the number is the number of the CPU and a, b label the
two cores on each CPU.

The question is, which set of cores will be used when one
issues a command like
mpich -n 4 ....

Will it be 
A) c1a,c1b,c2a,c2b
or will it be 
B) c1a,c2a,c3a,c4a

or some other combination?

Can mpich or the OS distinguish the cores on each CPU
as being on the same CPU and thus avoid assigning jobs to
cores on the same CPU, or does it treat all 8 cores as completely
symmetrical? Because the two cases A) or B) make a huge difference
in terms of performance. In a NUMA system such as an Opteron system,
with memory bandwidth limited problems, case A) will run nearly
twice as long as case B). With multicore, multiCPU systems
so common these days, I think this is a problem well worth
looking into.

Sami


Rajeev Thakur wrote:
> Not sure if it's an MPICH problem or a memory bandwidth problem on
> multicore.
>
> One way to check the memory bandwidth is to run the STREAM benchmark on 1
> core and multiple cores. http://www.cs.virginia.edu/stream/ref.html
>
> Rajeev 
>
>   
>> -----Original Message-----
>> From: owner-mpich-discuss at mcs.anl.gov 
>> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of zach
>> Sent: Monday, July 14, 2008 10:45 AM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] Why is my quad core slower than cluster
>>
>> This is starting to sound like a limitation of using MPI on multicore
>> processors and not necessarily an issue with the install or
>> configuration of mpich.
>> Can we expect improvements to mpich in the near future to 
>> deal with this?
>> Is it just that the quad multicore cpus are newer and have not been
>> rigorously tested with mpich yet to find all the issues?
>> -not great news for me since I just built a quad core cpu box thinking
>> I would get near x4 speed -up... :(
>>
>> Zach
>>
>> On 7/14/08, Gaetano Bellanca <gaetano.bellanca at unife.it> wrote:
>>     
>>> Hello to everybody,
>>>
>>> we have more or less the same problems. We are developing a 
>>>       
>> FDTD code for
>>     
>>> electromagnetic simulation in FORTRAN. The code is mainly 
>>>       
>> based on a 3 loops
>>     
>>> used to compute the electric field components, and 3 
>>>       
>> identical loops to
>>     
>>> compute the magnetic field components.
>>>
>>> We are using a small PC cluster made with 10 PIV 3GHz 
>>>       
>> connected with a
>>     
>>> 1Gbit/s ethernet LAN built some years ago, and a Intel 
>>>       
>> Vernonia 2 procesors
>>     
>>> / 4 core each (total 8 core). The processors are Intel Xeon E5345  @
>>> 2.33GHz.
>>> We are using the Intel 10.1 fortran compiler (compiler 
>>>       
>> options as indicated
>>     
>>> in the manual for machine optimization, with -O3), ubuntu 
>>>       
>> 7.10 (kernel
>>     
>>> 2.6.22-14 generic on the cluster, kernel 2.6.22-14 server on the
>>> multiprocessor machine).
>>> mpich2 is compiled with nemesis, and we are still with the 
>>>       
>> 2.1.06p1 (still
>>     
>>> no time to upgrade to  the last version)
>>>
>>> Testing the code for a (not too big, to keep the overall 
>>>       
>> time limited)
>>     
>>> simulation (85184 variables 44x44x44 cells, 51000 temporal 
>>>       
>> iterations) we
>>     
>>> had  a good scaling on the cluster. On the total simulation 
>>>       
>> time (with
>>     
>>> parallel and sequential operations mixed) we have a 
>>>       
>> speed-up of 8.5 using
>>     
>>> 10PEs ( 6.2 with 9, 8.2 with 8, 5 with 7, 5.8 with 6 etc ...).
>>>
>>> The same simulation has been run on the 2PEs/quad core 
>>>       
>> machine but we didn't
>>     
>>> have good performances.
>>> The speed up is 2 if we run mpiexec -n 2 .... as the domain 
>>>       
>> is divided
>>     
>>> between the two processors which seems to work 
>>>       
>> independently. But, by
>>     
>>> increasing the number of processors (core) used, running 
>>>       
>> the simulation with
>>     
>>> .n 3, -n 4 etc ... we have a speed-up of 2.48 with 4 cores 
>>>       
>> (2 on each PE),
>>     
>>> but only 2.6 with 8 PEs.
>>>
>>> We also tried to use -parallel or -openmp (limiting the 
>>>       
>> openmp directives
>>     
>>> only in the loops of field computations), without obtaining 
>>>       
>> significant
>>     
>>> changes in the performances, both running with mpiexec -n 1 
>>>       
>> or mpiexec -n 2
>>     
>>> (trying to mix mpi and openmp).
>>>
>>> Our idea is that we have serious problems in managing the 
>>>       
>> shared resources
>>     
>>> for memory access, but we have not expertise on that, and 
>>>       
>> we could be
>>     
>>> totally wrong.
>>>
>>> Regards.
>>>
>>> Gaetano
>>>
>>>
>>> ________________________________
>>> Gaetano Bellanca - Department of Engineering - University of Ferrara
>>> Via Saragat, 1 - 44100 - Ferrara - ITALY
>>> Voice (VoIP):  +39 0532 974809     Fax:  +39 0532 974870
>>> mailto:gaetano.bellanca at unife.it
>>> ________________________________
>>>
>>>
>>>
>>>       
>>     





More information about the mpich-discuss mailing list