[petsc-users] speedup for TS solver using DMDA

Mon Sep 15 14:08:01 CDT 2014

  Based on the streams speedups below it looks like a single core can utilize roughly 1/2 of the memory bandwidth, leaving all the other cores only 1/2 of the bandwidth to utilize, so you can only expect at best a speedup of roughly 2 on this machine with traditional PETSc sparse solvers. 

  To add insult to injury it appears that the threads are not being assigned to physical cores very well either.  Under the best circumstance on this system one would like to see a speedup of about 2 when running with two processes but it actually delivers only 1.23 and the speedup of 2 only occurs with 5 processes. I attribute this to the MPI or OS not assigning the second MPI process to the “best” core for memory bandwidth. Likely it should assign the second MPI process to the 2nd CPU but instead it is assigning it also to the first CPU and only when it gets to the 5th MPI process does the second CPU get utilized. 

   You can look at the documentation for your MPI’s process affinity to see if you can force the 2nd MPI process onto the second CPU.

   Barry

np  speedup
1 1.0
2 1.23
3 1.3
4 1.75
5 2.18

6 1.22
7 2.3
8 1.22
9 2.01
10 1.19
11 1.93
12 1.93
13 1.73
14 2.17
15 1.99
16 2.08
17 2.16
18 1.47
19 1.95
20 2.09
21 1.9
22 1.96
23 1.92
24 2.02
25 1.96
26 1.89
27 1.93
28 1.97
29 1.96
30 1.93
31 2.16
32 2.12
Estimation of possible

On Sep 15, 2014, at 1:42 PM, Katy Ghantous <katyghantous at gmail.com> wrote:

> Matt, thanks! i will look into that and find other ways to make the computation faster.
> 
> Barry, the benchmark reports up to 2 speedup, but says 1 node in the end. but either way i was expecting a higher speedup.. 2 is the limit for two cpus despite the multiple cores?
> 
> please let me know if the file attached is what you are asking for. 
> Thank you!
> 
> 
> On Mon, Sep 15, 2014 at 8:23 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> 
>    Please send the output from running
> 
>     make steams NPMAX=32
> 
>     in the PETSc root directory.
> 
> 
>    Barry
> 
>   My guess is that it reports “one node” is just because it uses the “hostname” to distinguish nodes and though your machine has two CPUs, from the point of view of the OS it has only a single hostname and hence reports just one “node”.
> 
> 
> On Sep 15, 2014, at 12:45 PM, Katy Ghantous <katyghantous at gmail.com> wrote:
> 
> > Hi,
> > I am using DMDA to run in parallel TS to solves a set of N equations. I am using DMDAGetCorners in the RHSfunction with setting the stencil size at 2 to solve a set of coupled ODEs on 30 cores.
> > The machine has 32 cores (2 physical CPUs with 2x8 core each with speed of 3.4Ghz per core).
> > However, mpiexec with more than one core is showing no speedup.
> > Also at the configuring/testing stage for petsc on that machine, there was no speedup and it only reported one node.
> > Is there somehting wrong with how i configured petsc or is the approach inappropriate for the machine?
> > I am not sure what files (or sections of the code) you would need to be able to answer my question.
> >
> > Thank you!
> 
> 
> <scaling.log>