[petsc-users] speedup for TS solver using DMDA

Barry Smith bsmith at mcs.anl.gov
Tue Sep 16 14:04:02 CDT 2014


  The tool hwloc can be useful in understanding the organization of cores and memories on a machine.   For example I run 

lstopo --no-icaches --no-io --ignore PU

(along with make streams in the root PETSc directory) on my laptop and it shows

np  speedup
1 1.0
2 1.43
3 1.47
4 1.45
Estimation of possible speedup of MPI programs based on Streams benchmark.
It appears you have 1 node(s)
See graph in the file src/benchmarks/streams/scaling.png
Machine (16GB) + NUMANode L#0 (P#0 16GB) + L3 L#0 (6144KB)
  L2 L#0 (256KB) + L1d L#0 (32KB) + Core L#0
  L2 L#1 (256KB) + L1d L#1 (32KB) + Core L#1
  L2 L#2 (256KB) + L1d L#2 (32KB) + Core L#2
  L2 L#3 (256KB) + L1d L#3 (32KB) + Core L#3

This system has one “memory bank”, 1 CPU and 4 cores. Note that when two cores are running the streams benchmark they are essentially utilizing all of the memory bandwidth hence you get no further speed up after two cores.

Next I run on a “server” class workstation with two “memory banks”, each associated with a CPU with 8 cores 

np  speedup
1 1.0
2 1.8
3 2.21
4 2.35
5 2.4
6 2.41
7 3.3
8 2.4
9 2.66
10 2.22
11 2.28
12 4.04
13 2.46
14 2.61
15 4.11
16 3.01
Estimation of possible speedup of MPI programs based on Streams benchmark.
It appears you have 1 node(s)
See graph in the file src/benchmarks/streams/scaling.png
Machine (128GB)
  NUMANode L#0 (P#0 64GB) + Socket L#0 + L3 L#0 (20MB)
    L2 L#0 (256KB) + L1d L#0 (32KB) + Core L#0
    L2 L#1 (256KB) + L1d L#1 (32KB) + Core L#1
    L2 L#2 (256KB) + L1d L#2 (32KB) + Core L#2
    L2 L#3 (256KB) + L1d L#3 (32KB) + Core L#3
    L2 L#4 (256KB) + L1d L#4 (32KB) + Core L#4
    L2 L#5 (256KB) + L1d L#5 (32KB) + Core L#5
    L2 L#6 (256KB) + L1d L#6 (32KB) + Core L#6
    L2 L#7 (256KB) + L1d L#7 (32KB) + Core L#7
  NUMANode L#1 (P#1 64GB) + Socket L#1 + L3 L#1 (20MB)
    L2 L#8 (256KB) + L1d L#8 (32KB) + Core L#8
    L2 L#9 (256KB) + L1d L#9 (32KB) + Core L#9
    L2 L#10 (256KB) + L1d L#10 (32KB) + Core L#10
    L2 L#11 (256KB) + L1d L#11 (32KB) + Core L#11
    L2 L#12 (256KB) + L1d L#12 (32KB) + Core L#12
    L2 L#13 (256KB) + L1d L#13 (32KB) + Core L#13
    L2 L#14 (256KB) + L1d L#14 (32KB) + Core L#14
    L2 L#15 (256KB) + L1d L#15 (32KB) + Core L#15


Note the speedup gets to be as high as 4 meaning that the memory is fast enough to fully server at least four cores.  But the speed up jumps all over the place when using from 1 to 16 cores. I am guessing that is because the MPI processes are not being well mapped to cores.  So I run with the additional MPICH mpiexec options -bind-to socket -map-by hwthread  and get 

np  speedup
1 1.0
2 2.26
3 2.79
4 2.93
5 2.99
6 3.0
7 3.01
8 2.99
9 2.81
10 2.81
11 2.9
12 2.94
13 2.94
14 2.94
15 2.93
16 2.93
Estimation of possible speedup of MPI programs based on Streams benchmark.

The I run with just the -bind-to socket and get much better numbers

np  speedup
1 1.0
2 2.41
3 3.36
4 4.45
5 4.51
6 5.45
7 5.07
8 5.81
9 5.27
10 5.93
11 5.42
12 5.95
13 5.49
14 5.94
15 5.56
16 5.88

Using this option I get roughly a speedup of 6. 

See http://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager#Process-core_Binding for more information on these options

 Barry




On Sep 16, 2014, at 10:08 AM, Katy Ghantous <katyghantous at gmail.com> wrote:

> thank you! this has been extremely useful in figuring out a plan of action. 
> 
> 
> On Mon, Sep 15, 2014 at 9:08 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> 
>   Based on the streams speedups below it looks like a single core can utilize roughly 1/2 of the memory bandwidth, leaving all the other cores only 1/2 of the bandwidth to utilize, so you can only expect at best a speedup of roughly 2 on this machine with traditional PETSc sparse solvers.
> 
>   To add insult to injury it appears that the threads are not being assigned to physical cores very well either.  Under the best circumstance on this system one would like to see a speedup of about 2 when running with two processes but it actually delivers only 1.23 and the speedup of 2 only occurs with 5 processes. I attribute this to the MPI or OS not assigning the second MPI process to the “best” core for memory bandwidth. Likely it should assign the second MPI process to the 2nd CPU but instead it is assigning it also to the first CPU and only when it gets to the 5th MPI process does the second CPU get utilized.
> 
>    You can look at the documentation for your MPI’s process affinity to see if you can force the 2nd MPI process onto the second CPU.
> 
>    Barry
> 
> 
> np  speedup
> 1 1.0
> 2 1.23
> 3 1.3
> 4 1.75
> 5 2.18
> 
> 
> 6 1.22
> 7 2.3
> 8 1.22
> 9 2.01
> 10 1.19
> 11 1.93
> 12 1.93
> 13 1.73
> 14 2.17
> 15 1.99
> 16 2.08
> 17 2.16
> 18 1.47
> 19 1.95
> 20 2.09
> 21 1.9
> 22 1.96
> 23 1.92
> 24 2.02
> 25 1.96
> 26 1.89
> 27 1.93
> 28 1.97
> 29 1.96
> 30 1.93
> 31 2.16
> 32 2.12
> Estimation of possible
> 
> On Sep 15, 2014, at 1:42 PM, Katy Ghantous <katyghantous at gmail.com> wrote:
> 
> > Matt, thanks! i will look into that and find other ways to make the computation faster.
> >
> > Barry, the benchmark reports up to 2 speedup, but says 1 node in the end. but either way i was expecting a higher speedup.. 2 is the limit for two cpus despite the multiple cores?
> >
> > please let me know if the file attached is what you are asking for.
> > Thank you!
> >
> >
> > On Mon, Sep 15, 2014 at 8:23 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> >    Please send the output from running
> >
> >     make steams NPMAX=32
> >
> >     in the PETSc root directory.
> >
> >
> >    Barry
> >
> >   My guess is that it reports “one node” is just because it uses the “hostname” to distinguish nodes and though your machine has two CPUs, from the point of view of the OS it has only a single hostname and hence reports just one “node”.
> >
> >
> > On Sep 15, 2014, at 12:45 PM, Katy Ghantous <katyghantous at gmail.com> wrote:
> >
> > > Hi,
> > > I am using DMDA to run in parallel TS to solves a set of N equations. I am using DMDAGetCorners in the RHSfunction with setting the stencil size at 2 to solve a set of coupled ODEs on 30 cores.
> > > The machine has 32 cores (2 physical CPUs with 2x8 core each with speed of 3.4Ghz per core).
> > > However, mpiexec with more than one core is showing no speedup.
> > > Also at the configuring/testing stage for petsc on that machine, there was no speedup and it only reported one node.
> > > Is there somehting wrong with how i configured petsc or is the approach inappropriate for the machine?
> > > I am not sure what files (or sections of the code) you would need to be able to answer my question.
> > >
> > > Thank you!
> >
> >
> > <scaling.log>
> 
> 



More information about the petsc-users mailing list