[petsc-users] performance surprise

Fri Jan 20 15:36:33 CST 2012

On Fri, Jan 20, 2012 at 15:27, Dominik Szczerba <dominik at itis.ethz.ch>wrote:

> I am running some performance tests on a distributed cluster each node
> 16 cores (Cray).
> I am very surprised to find that my benchmark jobs are about 3x slower when
> running on N nodes using all 16 cores than when running on N*16 nodes
> using only one core.
>

Yes, this is normal. Memory bandwidth is the overwhelming bottleneck for
most sparse linear algebra. One core can almost saturate the bandwidth of a
socket, so you see little benefit from the extra cores.

Pay attention to memory bandwidth when you buy computers and try to make
your algorithms use a lot of flops per memory access if you want to utilize
the floating point hardware you have lying around.

> I find this using 2 independent petsc builds and
> they both exibit the same behavior: my own gnu
> build and the system module petsc, both 3.2. I was so far unable to
> build my own petsc version with cray compilers to compare.
>
> The scheme is relatively complex with a shell matrix and block
> preconditioners, transient non-linear problem. I am using boomeramg
> from hypre.
>
> What do you think this unexpected performance may come from? Is it
> possible that the node interconnect is faster than the shared memory
> bus on the node? I was expecting the exact opposite.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120120/34783652/attachment-0001.htm>