[mpich-discuss] Why is my quad core slower than cluster
Rajeev Thakur
thakur at mcs.anl.gov
Mon Jul 14 12:19:11 CDT 2008
mpiexec leaves it up to the OS scheduler to place the processes on the
cores.
Rajeev
> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of H. Sami Sozuer
> Sent: Monday, July 14, 2008 12:13 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Why is my quad core slower than cluster
>
> It could be an mpich problem or more generally an OS problem.
>
> Suppose we have a 4 CPU system with 2 cores per CPU.
> Let's label the cores as c1a, c1b, c2a,c2b, c3a,c3b, and c4a,c4b,
> so the number is the number of the CPU and a, b label the
> two cores on each CPU.
>
> The question is, which set of cores will be used when one
> issues a command like
> mpich -n 4 ....
>
> Will it be
> A) c1a,c1b,c2a,c2b
> or will it be
> B) c1a,c2a,c3a,c4a
>
> or some other combination?
>
> Can mpich or the OS distinguish the cores on each CPU
> as being on the same CPU and thus avoid assigning jobs to
> cores on the same CPU, or does it treat all 8 cores as completely
> symmetrical? Because the two cases A) or B) make a huge difference
> in terms of performance. In a NUMA system such as an Opteron system,
> with memory bandwidth limited problems, case A) will run nearly
> twice as long as case B). With multicore, multiCPU systems
> so common these days, I think this is a problem well worth
> looking into.
>
> Sami
>
>
> Rajeev Thakur wrote:
> > Not sure if it's an MPICH problem or a memory bandwidth problem on
> > multicore.
> >
> > One way to check the memory bandwidth is to run the STREAM
> benchmark on 1
> > core and multiple cores. http://www.cs.virginia.edu/stream/ref.html
> >
> > Rajeev
> >
> >
> >> -----Original Message-----
> >> From: owner-mpich-discuss at mcs.anl.gov
> >> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of zach
> >> Sent: Monday, July 14, 2008 10:45 AM
> >> To: mpich-discuss at mcs.anl.gov
> >> Subject: Re: [mpich-discuss] Why is my quad core slower
> than cluster
> >>
> >> This is starting to sound like a limitation of using MPI
> on multicore
> >> processors and not necessarily an issue with the install or
> >> configuration of mpich.
> >> Can we expect improvements to mpich in the near future to
> >> deal with this?
> >> Is it just that the quad multicore cpus are newer and have not been
> >> rigorously tested with mpich yet to find all the issues?
> >> -not great news for me since I just built a quad core cpu
> box thinking
> >> I would get near x4 speed -up... :(
> >>
> >> Zach
> >>
> >> On 7/14/08, Gaetano Bellanca <gaetano.bellanca at unife.it> wrote:
> >>
> >>> Hello to everybody,
> >>>
> >>> we have more or less the same problems. We are developing a
> >>>
> >> FDTD code for
> >>
> >>> electromagnetic simulation in FORTRAN. The code is mainly
> >>>
> >> based on a 3 loops
> >>
> >>> used to compute the electric field components, and 3
> >>>
> >> identical loops to
> >>
> >>> compute the magnetic field components.
> >>>
> >>> We are using a small PC cluster made with 10 PIV 3GHz
> >>>
> >> connected with a
> >>
> >>> 1Gbit/s ethernet LAN built some years ago, and a Intel
> >>>
> >> Vernonia 2 procesors
> >>
> >>> / 4 core each (total 8 core). The processors are Intel
> Xeon E5345 @
> >>> 2.33GHz.
> >>> We are using the Intel 10.1 fortran compiler (compiler
> >>>
> >> options as indicated
> >>
> >>> in the manual for machine optimization, with -O3), ubuntu
> >>>
> >> 7.10 (kernel
> >>
> >>> 2.6.22-14 generic on the cluster, kernel 2.6.22-14 server on the
> >>> multiprocessor machine).
> >>> mpich2 is compiled with nemesis, and we are still with the
> >>>
> >> 2.1.06p1 (still
> >>
> >>> no time to upgrade to the last version)
> >>>
> >>> Testing the code for a (not too big, to keep the overall
> >>>
> >> time limited)
> >>
> >>> simulation (85184 variables 44x44x44 cells, 51000 temporal
> >>>
> >> iterations) we
> >>
> >>> had a good scaling on the cluster. On the total simulation
> >>>
> >> time (with
> >>
> >>> parallel and sequential operations mixed) we have a
> >>>
> >> speed-up of 8.5 using
> >>
> >>> 10PEs ( 6.2 with 9, 8.2 with 8, 5 with 7, 5.8 with 6 etc ...).
> >>>
> >>> The same simulation has been run on the 2PEs/quad core
> >>>
> >> machine but we didn't
> >>
> >>> have good performances.
> >>> The speed up is 2 if we run mpiexec -n 2 .... as the domain
> >>>
> >> is divided
> >>
> >>> between the two processors which seems to work
> >>>
> >> independently. But, by
> >>
> >>> increasing the number of processors (core) used, running
> >>>
> >> the simulation with
> >>
> >>> .n 3, -n 4 etc ... we have a speed-up of 2.48 with 4 cores
> >>>
> >> (2 on each PE),
> >>
> >>> but only 2.6 with 8 PEs.
> >>>
> >>> We also tried to use -parallel or -openmp (limiting the
> >>>
> >> openmp directives
> >>
> >>> only in the loops of field computations), without obtaining
> >>>
> >> significant
> >>
> >>> changes in the performances, both running with mpiexec -n 1
> >>>
> >> or mpiexec -n 2
> >>
> >>> (trying to mix mpi and openmp).
> >>>
> >>> Our idea is that we have serious problems in managing the
> >>>
> >> shared resources
> >>
> >>> for memory access, but we have not expertise on that, and
> >>>
> >> we could be
> >>
> >>> totally wrong.
> >>>
> >>> Regards.
> >>>
> >>> Gaetano
> >>>
> >>>
> >>> ________________________________
> >>> Gaetano Bellanca - Department of Engineering - University
> of Ferrara
> >>> Via Saragat, 1 - 44100 - Ferrara - ITALY
> >>> Voice (VoIP): +39 0532 974809 Fax: +39 0532 974870
> >>> mailto:gaetano.bellanca at unife.it
> >>> ________________________________
> >>>
> >>>
> >>>
> >>>
> >>
>
>
>
More information about the mpich-discuss
mailing list