scaling in 4-core machine

Satish Balay balay at mcs.anl.gov
Wed Nov 18 09:14:19 CST 2009


Just want to add one more point to this.

Most multicore machines do not provide scalable hardware. [yeah - the
FPUs cores are scalable - but the memory subsystem is not]. So one
should not expect scalable performance out of them. You should take
the 'max' performance you can get out out them - and then look for
scalability with multiple nodes.

Satish

On Wed, 18 Nov 2009, Jed Brown wrote:

> jarunan at ascomp.ch wrote:
> > 
> > Hello,
> > 
> > I have read the topic about performance of a machine with 2 dual-core
> > chips, and it is written that with -np 2 it should scale the best. I
> > would like to ask about 4-core machine.
> > 
> > I run the test on a quad core machine with mpiexec -n 1, 2 and 4 to see
> > the parallel scaling. The cpu times of the test are:
> > 
> > Solver/Precond/Sub_Precond
> > 
> > gmres/bjacobi/ilu
> > 
> > -n 1, 1917.5730 sec,
> > -n 2, 1699.9490 sec, efficiency = 56.40%
> > -n 4, 1661.6810 sec, efficiency = 28.86%
> > 
> > bicgstab/asm/ilu
> > 
> > -n 1, 1800.8380 sec,
> > -n 2, 1415.0170 sec, efficiency = 63.63%
> > -n 4, 1119.3480 sec, efficiency = 40.22%
> 
> These numbers are worthless without at least knowing iteration counts.
> 
> > Why is the scaling so low, especially with option -n 4?
> > Would it be expected to be better running with real 4 CPU's instead of a
> > quad core ship?
> 
> 4 sockets using a single core each (4x1) will generally do better than
> 2x2 or 1x4, but 4x4 costs about the same as 4x1 these days.  This is a
> very common question, the answer is that a single floating point unit is
> about 10 times faster than memory for the sort of operations that we do
> when solving PDE.  You don't get another memory bus every time you add a
> core so the ratio becomes worse.  More cores are not a complete loss
> because at least you get an extra L1 cache for each core, but sparse
> matrix and vector kernels are atrocious at reusing cache (there's not
> much to reuse because most values are only needed to perform one
> operation).
> 
> Getting better multicore performance requires changing the algorithms to
> better reuse L1 cache.  This means moving away from assembled matrices
> where possible and of course finding good preconditioners.  High-order
> and fast multipole methods are good for this.  But it's very much an
> open problem and unless you want to do research in the field, you have
> to live with poor multicore performance.
> 
> When buying hardware, remember that you are buying memory bandwidth (and
> a low-latency network) instead of floating point units.
> 
> Jed
> 
> 



More information about the petsc-users mailing list