scaling in 4-core machine

Jed Brown jed at 59A2.org
Wed Nov 18 04:13:05 CST 2009


jarunan at ascomp.ch wrote:
> 
> Hello,
> 
> I have read the topic about performance of a machine with 2 dual-core
> chips, and it is written that with -np 2 it should scale the best. I
> would like to ask about 4-core machine.
> 
> I run the test on a quad core machine with mpiexec -n 1, 2 and 4 to see
> the parallel scaling. The cpu times of the test are:
> 
> Solver/Precond/Sub_Precond
> 
> gmres/bjacobi/ilu
> 
> -n 1, 1917.5730 sec,
> -n 2, 1699.9490 sec, efficiency = 56.40%
> -n 4, 1661.6810 sec, efficiency = 28.86%
> 
> bicgstab/asm/ilu
> 
> -n 1, 1800.8380 sec,
> -n 2, 1415.0170 sec, efficiency = 63.63%
> -n 4, 1119.3480 sec, efficiency = 40.22%

These numbers are worthless without at least knowing iteration counts.

> Why is the scaling so low, especially with option -n 4?
> Would it be expected to be better running with real 4 CPU's instead of a
> quad core ship?

4 sockets using a single core each (4x1) will generally do better than
2x2 or 1x4, but 4x4 costs about the same as 4x1 these days.  This is a
very common question, the answer is that a single floating point unit is
about 10 times faster than memory for the sort of operations that we do
when solving PDE.  You don't get another memory bus every time you add a
core so the ratio becomes worse.  More cores are not a complete loss
because at least you get an extra L1 cache for each core, but sparse
matrix and vector kernels are atrocious at reusing cache (there's not
much to reuse because most values are only needed to perform one
operation).

Getting better multicore performance requires changing the algorithms to
better reuse L1 cache.  This means moving away from assembled matrices
where possible and of course finding good preconditioners.  High-order
and fast multipole methods are good for this.  But it's very much an
open problem and unless you want to do research in the field, you have
to live with poor multicore performance.

When buying hardware, remember that you are buying memory bandwidth (and
a low-latency network) instead of floating point units.

Jed

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 261 bytes
Desc: OpenPGP digital signature
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20091118/3513e496/attachment.pgp>


More information about the petsc-users mailing list