[mpich-discuss] Why is my quad core slower than cluster

Mon Jul 14 21:48:06 CDT 2008

Hello Sami and list

Oh, well, as you see, an expert who claims to know the answers to these 
problems
seems not to be willing to share these answers with less knowledgeable 
MPI users like us.
So, maybe we can find the answers ourselves, not by individual 
"homework" brainstorming,
but through community collaboration and generous information sharing,
which is the hallmark of this mailing list.

I Googled around today to find out how to assign MPI processes to 
specific processors,
and I found some interesting information on how to do it.

Below is a link to a posting from the computational fluid dynamics (CFD) 
community that may be of interest.
Not surprisingly, they are struggling with the same type of problems all 
of us have,
including how to tie MPI processes to specific processors:

http://openfoam.cfd-online.com/cgi-bin/forum/board-auth.cgi?file=/1/5949.html#POST18006

I would summarize these problems as related to three types of bottleneck:

1) Multicore processor bottlenecks (standalone machines and clusters)
2) Network fabric bottlenecks (clusters)
3) File system bottlenecks (clusters)

All three types of problems are due to contention for some type of 
system resource
by the MPI processes that take part in a computation/program.

Our focus on this thread, started by Zach, has been on problem 1),
although most of us may need to look into problems 2) and 3) sooner or 
later.
(I have all the three of them already!)

The CFD folks use MPI as we do.
They seem to use another MPI flavor, but the same problems are there.
The problems are not caused by MPI itself, but they become apparent when 
you run MPI programs.
That has been my experience too.

As for how to map the MPI processes to specific processors (or cores),
the key command seems to be "taskset", as my googling afternoon showed.
Try "man taskset" for more info.

For a standalone machine like yours, something like the command line 
below should work to
force execution on "processors" 0 and 2 (which in my case are two 
different physical CPUs):

mpiexec -n 2 taskset -c 0,2  my_mpi_program

You need to check on your computer ("more /proc/cpuinfo")
what are the exact "processor" numbers that correspond to separate 
physical CPUs. 
Most likely they are the even numbered processors only, or the odd 
numbered only,
since you have dual-core CPUs (integers module 2), with "processors" 0,1 
being the four
cores of the first physical CPU, "processors" 2,3 the cores of the 
second physical CPU, and so on.
At least, this is what I see on my dual-core dual-processor machine.
I would say for quad-cores the separate physical CPUs would be 
processors 0,4,8, etc,
or 1,5,7, etc, and so on (integers module 4), with "processors" 0,1,2,3 
being the four cores
in the first physical CPU, and so on. 

In /proc/cpuinfo look for the keyword "processor".
These are the numbers you need to use in "taskset -c".
However, other helpful information comes in the keywords "physical id",
"core id", "siblings", and "cpu cores".
They will allow you to map cores and physical CPUs to
the "processor" number.

The "taskset"  command line above worked in one of my standalone 
multicore machines,
and I hope a variant of it will work on your machine also.
It works with the "mpiexec" that comes with the MPICH distribution, and 
also with
the "mpiexec" associated to the Torque/PBS batch system, which is nice 
for clusters as well.

"Taskset" can change the default behavior of the Linux scheduler, which 
is to allow processes to
be moved from one core/CPU to another during execution.
The scheduler does this to ensure optimal CPU use (i.e. load balance).
With taskset you can force execution to happen on the cores you specify 
on the command line,
i.e. you can force the so called "CPU affinity" you wish.
Note that the "taskset" man page uses both the terms "CPU" and 
"processor", and doesn't use the term "core",
which may be  a bit confusing. 
Make no mistake, "processor" and "CPU" there stand for what we've been 
calling "core" here.

Other postings that you may find useful on closely related topics are:

http://www.ibm.com/developerworks/linux/library/l-scheduler/
http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html

I hope this helps,

Still, we have a long way to go to sort out how much of the multicore 
bottleneck can
be ascribed to lack of memory bandwidth, and how much may be  perhaps 
associated to how
memcpy is compiled by different compilers,
or if there are other components of this problem that we don't see now.

Maybe our community won't find a solution to Zach's problem: "Why is my 
quad core slower than cluster?"
However, I hope that through collaboration, and by sharing information,
we may be able to nail down the root of the problem,
and perhaps to find ways to improve the alarmingly bad performance
some of us have reported on multicore machines.

Gus Correa

-- 
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

chong tan wrote:

> My 2 cents:
>
> -parallel programming requies that one study not only his/her 
> programming problem, but also the underrlying HW on which the solution 
> to the problem is to be executed.  You can't not expect magic to fall 
> from the sky.  Those who don't put in the effort are expecting pie 
> from the sky.
>
>  
>
> The problem you described is not an MPICH problem, it is the problem 
> the application designer need to address.  There are more than 1 ways 
> to achieve what you want, and the solutions are extremely simple.
>
>  
>
> I did my homework, all the problems mentioned in this thread are 
> covered in my experiments and have good explanations.  Sorry that I 
> can't shared too much of that due to my employment, but I like to 
> assure you that the answers are there awaiting you.
>
>  
>
> tan
>
>
>
> --- On *Mon, 7/14/08, H. Sami Sozuer /<hss at photon.iyte.edu.tr>/* wrote:
>
>     From: H. Sami Sozuer <hss at photon.iyte.edu.tr>
>     Subject: Re: [mpich-discuss] Why is my quad core slower than cluster
>     To: mpich-discuss at mcs.anl.gov
>     Date: Monday, July 14, 2008, 10:13 AM
>
>It could be an mpich problem or more generally an OS problem.
>
>Suppose we have a 4 CPU system with 2 cores per CPU.
>Let's label the cores as c1a, c1b,  c2a,c2b,  c3a,c3b, and  c4a,c4b,
>so the number is the number of the CPU and a, b label the
>two cores on each CPU.
>
>The question is, which set of cores will be used when one
>issues a command like
>mpich -n 4 ....
>
>Will it be 
>A) c1a,c1b,c2a,c2b
>or will it be 
>B) c1a,c2a,c3a,c4a
>
>or some other combination?
>
>Can mpich or the OS distinguish the cores on each CPU
>as being on the same CPU and thus avoid assigning jobs to
>cores on the same CPU, or does it treat all 8 cores as completely
>symmetrical? Because the two cases A) or B) make a huge difference
>in terms of performance. In a NUMA system such as an Opteron system,
>with memory bandwidth limited problems, case A) will run nearly
>twice as long as case B). With multicore, multiCPU systems
>so common these days, I think this is a problem well worth
>looking into.
>
>Sami
>
>
>Rajeev Thakur wrote:
>> Not sure if it's an MPICH problem or a memory bandwidth problem on
>> multicore.
>>
>> One way to check the memory bandwidth is to run the STREAM benchmark on 1
>> core and multiple cores. http://www.cs.virginia.edu/stream/ref.html
>>
>> Rajeev 
>>
>>   
>>> -----Original Message-----
>>> From: owner-mpich-discuss at mcs.anl.gov 
>>> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of zach
>>> Sent: Monday, July 14, 2008 10:45 AM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: Re: [mpich-discuss] Why is my quad core slower than cluster
>>>
>>> This is starting to sound like a limitation of using MPI on multicore
>>> processors and not necessarily an issue with the install or
>>> configuration of mpich.
>>> Can we expect improvements to mpich in the near future to 
>>> deal with this?
>>> Is it just that the quad multicore cpus are newer and have not been
>>> rigorously tested with mpich yet to find all the issues?
>>> -not great news for me since I just built a quad core cpu box thinking
>>> I would get near x4 speed -up... :(
>>>
>>> Zach
>>>
>>> On 7/14/08, Gaetano Bellanca <gaetano.bellanca at unife.it> wrote:
>>>     
>>>> Hello to everybody,
>>>>
>>>> we have more or less the same problems. We are developing a 
>>>>       
>>> FDTD code for
>>>     
>>>> electromagnetic simulation in FORTRAN. The code is mainly 
>>>>       
>>> based on a 3 loops
>>>     
>>>> used to compute the electric field components, and 3 
>>>>       
>>> identical loops to
>>>     
>>>> compute the magnetic field components.
>>>>
>>>> We are using a small PC cluster made with 10 PIV 3GHz 
>>>>       
>>> connected with a
>>>     
>>>> 1Gbit/s ethernet LAN built some years ago, and a Intel 
>>>>       
>>> Vernonia 2 procesors
>>>     
>>>> / 4 core each (total 8 core). The processors are Intel Xeon E5345 
>@
>>>> 2.33GHz.
>>>> We are using the Intel 10.1 fortran compiler (compiler 
>>>>       
>>> options as indicated
>>>     
>>>> in the manual for machine optimization, with -O3), ubuntu 
>>>>       
>>> 7.10 (kernel
>>>     
>>>> 2.6.22-14 generic on the cluster, kernel 2.6.22-14 server on the
>>>> multiprocessor machine).
>>>> mpich2 is compiled with nemesis, and we are still with the 
>>>>       
>>> 2.1.06p1 (still
>>>     
>>>> no time to upgrade to  the last version)
>>>>
>>>> Testing the code for a (not too big, to keep the overall 
>>>>       
>>> time limited)
>>>     
>>>> simulation (85184 variables 44x44x44 cells, 51000 temporal 
>>>>       
>>> iterations) we
>>>     
>>>> had  a good scaling on the cluster. On the total simulation 
>>>>       
>>> time (with
>>>     
>>>> parallel and sequential operations mixed) we have a 
>>>>       
>>> speed-up of 8.5 using
>>>     
>>>> 10PEs ( 6.2 with 9, 8.2 with 8, 5 with 7, 5.8 with 6 etc ...).
>>>>
>>>> The same simulation has been run on the 2PEs/quad core 
>>>>       
>>> machine but we didn't
>>>     
>>>> have good performances.
>>>> The speed up is 2 if we run mpiexec -n 2 .... as the domain 
>>>>       
>>> is divided
>>>     
>>>> between the two processors which seems to work 
>>>>       
>>> independently. But, by
>>>     
>>>> increasing the number of processors (core) used, running 
>>>>       
>>> the simulation with
>>>     
>>>> .n 3, -n 4 etc ... we have a speed-up of 2.48 with 4 cores 
>>>>       
>>> (2 on each PE),
>>>     
>>>> but only 2.6 with 8 PEs.
>>>>
>>>> We also tried to use -parallel or -openmp (limiting the 
>>>>       
>>> openmp directives
>>>     
>>>> only in the loops of field computations), without obtaining 
>>>>       
>>> significant
>>>     
>>>> changes in the performances, both running with mpiexec -n 1 
>>>>       
>>> or mpiexec -n 2
>>>     
>>>> (trying to mix mpi and openmp).
>>>>
>>>> Our idea is that we have serious problems in managing the 
>>>>       
>>> shared resources
>>>     
>>>> for memory access, but we have not expertise on that, and 
>>>>       
>>> we could be
>>>     
>>>> totally wrong.
>>>>
>>>> Regards.
>>>>
>>>> Gaetano
>>>>
>>>>
>>>> ________________________________
>>>> Gaetano Bellanca - Department of Engineering - University of
>Ferrara
>>>> Via Saragat, 1 - 44100 - Ferrara - ITALY
>>>> Voice (VoIP):  +39 0532 974809     Fax:  +39 0532 974870
>>>> mailto:gaetano.bellanca at unife.it
>>>> ________________________________
>>>>
>>>>
>>>>
>>>>       
>>>
>
>