[petsc-users] performance surprise

Fri Jan 20 16:33:49 CST 2012

http://www.mcs.anl.gov/petsc/documentation/faq.html#computers

  Likely you will do best if you use perhaps 1/2 the cores per node. You should experiment by starting with 1 core per node, then use 2 etc until you see the performance peak and the will tell you the sweet spot.

    Barry

On Jan 20, 2012, at 3:36 PM, Jed Brown wrote:

> On Fri, Jan 20, 2012 at 15:27, Dominik Szczerba <dominik at itis.ethz.ch> wrote:
> I am running some performance tests on a distributed cluster each node
> 16 cores (Cray).
> I am very surprised to find that my benchmark jobs are about 3x slower when
> running on N nodes using all 16 cores than when running on N*16 nodes
> using only one core.
> 
> Yes, this is normal. Memory bandwidth is the overwhelming bottleneck for most sparse linear algebra. One core can almost saturate the bandwidth of a socket, so you see little benefit from the extra cores.
> 
> Pay attention to memory bandwidth when you buy computers and try to make your algorithms use a lot of flops per memory access if you want to utilize the floating point hardware you have lying around.
>  
> I find this using 2 independent petsc builds and
> they both exibit the same behavior: my own gnu
> build and the system module petsc, both 3.2. I was so far unable to
> build my own petsc version with cray compilers to compare.
> 
> The scheme is relatively complex with a shell matrix and block
> preconditioners, transient non-linear problem. I am using boomeramg
> from hypre.
> 
> What do you think this unexpected performance may come from? Is it
> possible that the node interconnect is faster than the shared memory
> bus on the node? I was expecting the exact opposite.
>