scaling in 4-core machine

Wed Nov 18 14:18:27 CST 2009

There is also the paper by Barry, Bill, David, and Dinesh about SpMV. Its
very good. That is what I base my slides on.
You can see the punchline in the tutorial slides.

  Matt

On Wed, Nov 18, 2009 at 9:26 AM, Aron Ahmadia <aron.ahmadia at kaust.edu.sa>wrote:

> Does anybody have good references in the literature analyzing the memory
> access patterns for sparse solvers and how they scale?  I remember seeing
> Barry's talk about multigrid memory access patterns, but I'm not sure if
> I've ever seen a good paper reference.
>
> Cheers,
> Aron
>
>
> On Wed, Nov 18, 2009 at 6:14 PM, Satish Balay <balay at mcs.anl.gov> wrote:
>
>> Just want to add one more point to this.
>>
>> Most multicore machines do not provide scalable hardware. [yeah - the
>> FPUs cores are scalable - but the memory subsystem is not]. So one
>> should not expect scalable performance out of them. You should take
>> the 'max' performance you can get out out them - and then look for
>> scalability with multiple nodes.
>>
>> Satish
>>
>> On Wed, 18 Nov 2009, Jed Brown wrote:
>>
>> > jarunan at ascomp.ch wrote:
>> > >
>> > > Hello,
>> > >
>> > > I have read the topic about performance of a machine with 2 dual-core
>> > > chips, and it is written that with -np 2 it should scale the best. I
>> > > would like to ask about 4-core machine.
>> > >
>> > > I run the test on a quad core machine with mpiexec -n 1, 2 and 4 to
>> see
>> > > the parallel scaling. The cpu times of the test are:
>> > >
>> > > Solver/Precond/Sub_Precond
>> > >
>> > > gmres/bjacobi/ilu
>> > >
>> > > -n 1, 1917.5730 sec,
>> > > -n 2, 1699.9490 sec, efficiency = 56.40%
>> > > -n 4, 1661.6810 sec, efficiency = 28.86%
>> > >
>> > > bicgstab/asm/ilu
>> > >
>> > > -n 1, 1800.8380 sec,
>> > > -n 2, 1415.0170 sec, efficiency = 63.63%
>> > > -n 4, 1119.3480 sec, efficiency = 40.22%
>> >
>> > These numbers are worthless without at least knowing iteration counts.
>> >
>> > > Why is the scaling so low, especially with option -n 4?
>> > > Would it be expected to be better running with real 4 CPU's instead of
>> a
>> > > quad core ship?
>> >
>> > 4 sockets using a single core each (4x1) will generally do better than
>> > 2x2 or 1x4, but 4x4 costs about the same as 4x1 these days.  This is a
>> > very common question, the answer is that a single floating point unit is
>> > about 10 times faster than memory for the sort of operations that we do
>> > when solving PDE.  You don't get another memory bus every time you add a
>> > core so the ratio becomes worse.  More cores are not a complete loss
>> > because at least you get an extra L1 cache for each core, but sparse
>> > matrix and vector kernels are atrocious at reusing cache (there's not
>> > much to reuse because most values are only needed to perform one
>> > operation).
>> >
>> > Getting better multicore performance requires changing the algorithms to
>> > better reuse L1 cache.  This means moving away from assembled matrices
>> > where possible and of course finding good preconditioners.  High-order
>> > and fast multipole methods are good for this.  But it's very much an
>> > open problem and unless you want to do research in the field, you have
>> > to live with poor multicore performance.
>> >
>> > When buying hardware, remember that you are buying memory bandwidth (and
>> > a low-latency network) instead of floating point units.
>> >
>> > Jed
>> >
>> >
>>
>>
>

-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20091118/19f40b30/attachment.htm>