[mpich-discuss] Why is my quad core slower than cluster

Mon Jul 14 12:19:11 CDT 2008

mpiexec leaves it up to the OS scheduler to place the processes on the
cores.

Rajeev

> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov 
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of H. Sami Sozuer
> Sent: Monday, July 14, 2008 12:13 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Why is my quad core slower than cluster
> 
> It could be an mpich problem or more generally an OS problem.
> 
> Suppose we have a 4 CPU system with 2 cores per CPU.
> Let's label the cores as c1a, c1b,  c2a,c2b,  c3a,c3b, and  c4a,c4b,
> so the number is the number of the CPU and a, b label the
> two cores on each CPU.
> 
> The question is, which set of cores will be used when one
> issues a command like
> mpich -n 4 ....
> 
> Will it be 
> A) c1a,c1b,c2a,c2b
> or will it be 
> B) c1a,c2a,c3a,c4a
> 
> or some other combination?
> 
> Can mpich or the OS distinguish the cores on each CPU
> as being on the same CPU and thus avoid assigning jobs to
> cores on the same CPU, or does it treat all 8 cores as completely
> symmetrical? Because the two cases A) or B) make a huge difference
> in terms of performance. In a NUMA system such as an Opteron system,
> with memory bandwidth limited problems, case A) will run nearly
> twice as long as case B). With multicore, multiCPU systems
> so common these days, I think this is a problem well worth
> looking into.
> 
> Sami
> 
> 
> Rajeev Thakur wrote:
> > Not sure if it's an MPICH problem or a memory bandwidth problem on
> > multicore.
> >
> > One way to check the memory bandwidth is to run the STREAM 
> benchmark on 1
> > core and multiple cores. http://www.cs.virginia.edu/stream/ref.html
> >
> > Rajeev 
> >
> >   
> >> -----Original Message-----
> >> From: owner-mpich-discuss at mcs.anl.gov 
> >> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of zach
> >> Sent: Monday, July 14, 2008 10:45 AM
> >> To: mpich-discuss at mcs.anl.gov
> >> Subject: Re: [mpich-discuss] Why is my quad core slower 
> than cluster
> >>
> >> This is starting to sound like a limitation of using MPI 
> on multicore
> >> processors and not necessarily an issue with the install or
> >> configuration of mpich.
> >> Can we expect improvements to mpich in the near future to 
> >> deal with this?
> >> Is it just that the quad multicore cpus are newer and have not been
> >> rigorously tested with mpich yet to find all the issues?
> >> -not great news for me since I just built a quad core cpu 
> box thinking
> >> I would get near x4 speed -up... :(
> >>
> >> Zach
> >>
> >> On 7/14/08, Gaetano Bellanca <gaetano.bellanca at unife.it> wrote:
> >>     
> >>> Hello to everybody,
> >>>
> >>> we have more or less the same problems. We are developing a 
> >>>       
> >> FDTD code for
> >>     
> >>> electromagnetic simulation in FORTRAN. The code is mainly 
> >>>       
> >> based on a 3 loops
> >>     
> >>> used to compute the electric field components, and 3 
> >>>       
> >> identical loops to
> >>     
> >>> compute the magnetic field components.
> >>>
> >>> We are using a small PC cluster made with 10 PIV 3GHz 
> >>>       
> >> connected with a
> >>     
> >>> 1Gbit/s ethernet LAN built some years ago, and a Intel 
> >>>       
> >> Vernonia 2 procesors
> >>     
> >>> / 4 core each (total 8 core). The processors are Intel 
> Xeon E5345  @
> >>> 2.33GHz.
> >>> We are using the Intel 10.1 fortran compiler (compiler 
> >>>       
> >> options as indicated
> >>     
> >>> in the manual for machine optimization, with -O3), ubuntu 
> >>>       
> >> 7.10 (kernel
> >>     
> >>> 2.6.22-14 generic on the cluster, kernel 2.6.22-14 server on the
> >>> multiprocessor machine).
> >>> mpich2 is compiled with nemesis, and we are still with the 
> >>>       
> >> 2.1.06p1 (still
> >>     
> >>> no time to upgrade to  the last version)
> >>>
> >>> Testing the code for a (not too big, to keep the overall 
> >>>       
> >> time limited)
> >>     
> >>> simulation (85184 variables 44x44x44 cells, 51000 temporal 
> >>>       
> >> iterations) we
> >>     
> >>> had  a good scaling on the cluster. On the total simulation 
> >>>       
> >> time (with
> >>     
> >>> parallel and sequential operations mixed) we have a 
> >>>       
> >> speed-up of 8.5 using
> >>     
> >>> 10PEs ( 6.2 with 9, 8.2 with 8, 5 with 7, 5.8 with 6 etc ...).
> >>>
> >>> The same simulation has been run on the 2PEs/quad core 
> >>>       
> >> machine but we didn't
> >>     
> >>> have good performances.
> >>> The speed up is 2 if we run mpiexec -n 2 .... as the domain 
> >>>       
> >> is divided
> >>     
> >>> between the two processors which seems to work 
> >>>       
> >> independently. But, by
> >>     
> >>> increasing the number of processors (core) used, running 
> >>>       
> >> the simulation with
> >>     
> >>> .n 3, -n 4 etc ... we have a speed-up of 2.48 with 4 cores 
> >>>       
> >> (2 on each PE),
> >>     
> >>> but only 2.6 with 8 PEs.
> >>>
> >>> We also tried to use -parallel or -openmp (limiting the 
> >>>       
> >> openmp directives
> >>     
> >>> only in the loops of field computations), without obtaining 
> >>>       
> >> significant
> >>     
> >>> changes in the performances, both running with mpiexec -n 1 
> >>>       
> >> or mpiexec -n 2
> >>     
> >>> (trying to mix mpi and openmp).
> >>>
> >>> Our idea is that we have serious problems in managing the 
> >>>       
> >> shared resources
> >>     
> >>> for memory access, but we have not expertise on that, and 
> >>>       
> >> we could be
> >>     
> >>> totally wrong.
> >>>
> >>> Regards.
> >>>
> >>> Gaetano
> >>>
> >>>
> >>> ________________________________
> >>> Gaetano Bellanca - Department of Engineering - University 
> of Ferrara
> >>> Via Saragat, 1 - 44100 - Ferrara - ITALY
> >>> Voice (VoIP):  +39 0532 974809     Fax:  +39 0532 974870
> >>> mailto:gaetano.bellanca at unife.it
> >>> ________________________________
> >>>
> >>>
> >>>
> >>>       
> >>     
> 
> 
>