<div dir="ltr">Thanks Matt,<div><br></div><div>I just remade the streams tests on the machine and i got the following table, my question would be, is this the maximum speedup i may get on my machine, and thus should compare the efficiency and scaling tests up to this figure instead?</div><div><br></div><div>I have 20-cores nodes so this was made over 4 nodes,</div><div><br></div><div>Thanks,</div><div><br></div><div><pre style="box-sizing:border-box;overflow:auto;font-family:monospace;font-size:14px;display:block;padding:0px;margin:0px;line-height:inherit;word-break:break-all;word-wrap:break-word;color:rgb(0,0,0);background-color:rgb(255,255,255);border:0px;border-radius:0px;white-space:pre-wrap;vertical-align:baseline;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:left;text-indent:0px;text-transform:none;word-spacing:0px;text-decoration-style:initial;text-decoration-color:initial">np  speedup

1 1.0

2 1.82

3 2.43

4 2.79

5 2.99

6 3.13

7 3.13

8 3.19

9 3.17

10 3.17

11 3.44

12 3.81

13 4.13

14 4.43

15 4.72

16 5.05

17 5.4

18 5.69

19 5.99

20 6.29

21 6.66

22 6.96

23 7.26

24 7.6

25 7.86

26 8.25

27 8.54

28 8.88

29 9.2

30 9.44

31 9.84

32 10.06

33 10.43

34 10.72

35 11.11

36 11.42

37 11.75

38 12.07

39 12.27

40 12.65

41 12.94

42 13.34

43 13.6

44 13.83

45 14.27

46 14.56

47 14.84

48 15.24

49 15.49

50 15.85

51 15.87

52 16.35

53 16.76

54 17.02

55 17.17

56 17.7

57 17.9

58 18.28

59 18.56

60 18.82

61 19.37

62 19.62

63 19.88

64 20.21</pre><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, May 2, 2018 at 1:24 PM, Matthew Knepley <span dir="ltr"><<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class="">On Wed, May 2, 2018 at 4:19 PM, Manuel Valera <span dir="ltr"><<a href="mailto:mvalera-w@sdsu.edu" target="_blank">mvalera-w@sdsu.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hello guys,<div><br></div><div>We are working in writing a paper about the parallelization of our model using PETSc, which is very exciting since is the first time we see our model scaling, but so far i feel my results for the laplacian solver could be much better,</div><div><br></div><div>For example, using CG/Multigrid i get less than 20% of efficiency after 16 cores, up to 64 cores where i get only 8% efficiency,</div><div><br></div><div>I am defining efficiency as speedup over number of cores, and speedup as twall_n/twall_1 where n is the number of cores, i think that's pretty standard,</div></div></blockquote><div><br></div></span><div>This is the first big problem. Not all "cores" are created equal. First, you need to run streams in the exact same configuration, so that you can see</div><div>how much speedup to expect. The program is here</div><div><br></div><div>  cd src/benchmarks/streams</div><div><br></div><div>and</div><div><br></div><div>  make streams</div><div><br></div><div>will run it. You will probably need to submit the program yourself to the batch system to get the same configuration as your solver.</div><div><br></div><div>This really matter because 16 cores on one nodes probably only has the potential for 5x speedup, so that your 20% is misguided.</div><div><br></div><div>  Thanks,</div><div><br></div><div>     Matt</div><div><div class="h5"><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>The ksp_view for a distributed solve looks like this:</div><div><br></div><div><div>KSP Object: 16 MPI processes</div><div>  type: cg</div><div>  maximum iterations=10000, initial guess is zero</div><div>  tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.</div><div>  left preconditioning</div><div>  using PRECONDITIONED norm type for convergence test</div><div>PC Object: 16 MPI processes</div><div>  type: hypre</div><div>    HYPRE BoomerAMG preconditioning</div><div>      Cycle type V</div><div>      Maximum number of levels 25</div><div>      Maximum number of iterations PER hypre call 1</div><div>      Convergence tolerance PER hypre call 0.</div><div>      Threshold for strong coupling 0.25</div><div>      Interpolation truncation factor 0.</div><div>      Interpolation: max elements per row 0</div><div>      Number of levels of aggressive coarsening 0</div><div>      Number of paths for aggressive coarsening 1</div><div>      Maximum row sums 0.9</div><div>      Sweeps down         1</div><div>      Sweeps up           1</div><div>      Sweeps on coarse    1</div><div>      Relax down          symmetric-SOR/Jacobi</div><div>      Relax up            symmetric-SOR/Jacobi</div><div>      Relax on coarse     Gaussian-elimination</div><div>      Relax weight  (all)      1.</div><div>      Outer relax weight (all) 1.</div><div>      Using CF-relaxation</div><div>      Not using more complex smoothers.</div><div>      Measure type        local</div><div>      Coarsen type        Falgout</div><div>      Interpolation type  classical</div><div>      Using nodal coarsening (with HYPRE_BOOMERAMGSetNodal() 1</div><div>      HYPRE_BoomerAMGSetInterpVecVar<wbr>iant() 1</div><div>  linear system matrix = precond matrix:</div><div>  Mat Object: 16 MPI processes</div><div>    type: mpiaij</div><div>    rows=213120, cols=213120</div><div>    total: nonzeros=3934732, allocated nonzeros=8098560</div><div>    total number of mallocs used during MatSetValues calls =0</div><div>      has attached near null space</div></div><div><br></div><div><br></div><div>And the log_view for the same case would be:</div><div><br></div><div><div>******************************<wbr>******************************<wbr>******************************<wbr>******************************</div><div>***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***</div><div>******************************<wbr>******************************<wbr>******************************<wbr>******************************</div><div><br></div><div>------------------------------<wbr>---------------- PETSc Performance Summary: ------------------------------<wbr>----------------</div><div><br></div><div>./gcmSeamount on a timings named ocean with 16 processors, by valera Wed May  2 13:18:21 2018</div><div>Using Petsc Development GIT revision: v3.9-163-gbe3efd4  GIT Date: 2018-04-16 10:45:40 -0500</div><div><br></div><div>                         Max       Max/Min        Avg      Total </div><div>Time (sec):           1.355e+00      1.00004   1.355e+00</div><div>Objects:              4.140e+02      1.00000   4.140e+02</div><div>Flop:                 7.582e+05      1.09916   7.397e+05  1.183e+07</div><div>Flop/sec:            5.594e+05      1.09918   5.458e+05  8.732e+06</div><div>MPI Messages:         1.588e+03      1.19167   1.468e+03  2.348e+04</div><div>MPI Message Lengths:  7.112e+07      1.37899   4.462e+04  1.048e+09</div><div>MPI Reductions:       4.760e+02      1.00000</div><div><br></div><div>Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)</div><div>                            e.g., VecAXPY() for real vectors of length N --> 2N flop</div><div>                            and VecAXPY() for complex vectors of length N --> 8N flop</div><div><br></div><div>Summary of Stages:   ----- Time ------  ----- Flop -----  --- Messages ---  -- Message Lengths --  -- Reductions --</div><div>                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total </div><div> 0:      Main Stage: 1.3553e+00 100.0%  1.1835e+07 100.0%  2.348e+04 100.0%  4.462e+04      100.0%  4.670e+02  98.1% </div><div><br></div><div>------------------------------<wbr>------------------------------<wbr>------------------------------<wbr>------------------------------</div><div>See the 'Profiling' chapter of the users' manual for details on interpreting output.</div><div>Phase summary info:</div><div>   Count: number of times phase was executed</div><div>   Time and Flop: Max - maximum over all processors</div><div>                   Ratio - ratio of maximum to minimum over all processors</div><div>   Mess: number of messages sent</div><div>   Avg. len: average message length (bytes)</div><div>   Reduct: number of global reductions</div><div>   Global: entire computation</div><div>   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().</div><div>      %T - percent time in this phase         %F - percent flop in this phase</div><div>      %M - percent messages in this phase     %L - percent message lengths in this phase</div><div>      %R - percent reductions in this phase</div><div>   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)</div><div>------------------------------<wbr>------------------------------<wbr>------------------------------<wbr>------------------------------</div><div>Event                Count      Time (sec)     Flop                             --- Global ---  --- Stage ---   Total</div><div>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s</div><div>------------------------------<wbr>------------------------------<wbr>------------------------------<wbr>------------------------------</div><div><br></div><div>--- Event Stage 0: Main Stage</div><div><br></div><div>BuildTwoSidedF         2 1.0 9.1908e-03 2.2 0.00e+00 0.0 3.6e+02 1.6e+05 0.0e+00  1  0  2  6  0   1  0  2  6  0     0</div><div>VecTDot                1 1.0 6.4135e-05 1.1 2.66e+04 1.0 0.0e+00 0.0e+00 1.0e+00  0  4  0  0  0   0  4  0  0  0  6646</div><div>VecNorm                1 1.0 1.4589e-0347.1 2.66e+04 1.0 0.0e+00 0.0e+00 1.0e+00  0  4  0  0  0   0  4  0  0  0   292</div><div>VecScale              14 1.0 3.6144e-04 1.3 4.80e+05 1.1 0.0e+00 0.0e+00 0.0e+00  0 62  0  0  0   0 62  0  0  0 20346</div><div>VecCopy                7 1.0 1.0152e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0</div><div>VecSet                83 1.0 3.0013e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0</div><div>VecPointwiseMult      12 1.0 2.7585e-04 1.4 2.43e+05 1.2 0.0e+00 0.0e+00 0.0e+00  0 31  0  0  0   0 31  0  0  0 13153</div><div>VecScatterBegin      111 1.0 2.5293e-02 1.8 0.00e+00 0.0 9.5e+03 3.4e+04 1.9e+01  1  0 40 31  4   1  0 40 31  4     0</div><div>VecScatterEnd         92 1.0 4.8771e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  3  0  0  0  0   3  0  0  0  0     0</div><div>VecNormalize           1 1.0 2.6941e-05 2.3 1.33e+04 1.0 0.0e+00 0.0e+00 0.0e+00  0  2  0  0  0   0  2  0  0  0  7911</div><div>MatConvert             1 1.0 1.1009e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 4.0e+00  1  0  0  0  1   1  0  0  0  1     0</div><div>MatAssemblyBegin       3 1.0 2.8401e-02 1.0 0.00e+00 0.0 3.6e+02 1.6e+05 0.0e+00  2  0  2  6  0   2  0  2  6  0     0</div><div>MatAssemblyEnd         3 1.0 2.9033e-02 1.0 0.00e+00 0.0 6.0e+01 1.2e+04 2.0e+01  2  0  0  0  4   2  0  0  0  4     0</div><div>MatGetRowIJ            2 1.0 1.9073e-06 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0</div><div>MatView                1 1.0 3.0398e-04 5.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0</div><div>KSPSetUp               1 1.0 4.7994e-04 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0</div><div>KSPSolve               1 1.0 2.4850e-03 2.0 5.33e+04 1.0 0.0e+00 0.0e+00 2.0e+00  0  7  0  0  0   0  7  0  0  0   343</div><div>PCSetUp                2 1.0 2.2953e-02 1.0 1.33e+04 1.0 0.0e+00 0.0e+00 6.0e+00  2  2  0  0  1   2  2  0  0  1     9</div><div>PCApply                1 1.0 1.3151e-03 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0</div><div>------------------------------<wbr>------------------------------<wbr>------------------------------<wbr>------------------------------</div><div><br></div><div>Memory usage is given in bytes:</div><div><br></div><div>Object Type          Creations   Destructions     Memory  Descendants' Mem.</div><div>Reports information only for process 0.</div><div><br></div><div>--- Event Stage 0: Main Stage</div><div><br></div><div>              Vector   172            170     70736264     0.</div><div>              Matrix     5              5      7125104     0.</div><div>   Matrix Null Space     1              1          608     0.</div><div>    Distributed Mesh    18             16        84096     0.</div><div>           Index Set    73             73     10022204     0.</div><div>   IS L to G Mapping    18             16      1180828     0.</div><div>   Star Forest Graph    36             32        27968     0.</div><div>     Discrete System    18             16        15040     0.</div><div>         Vec Scatter    67             64     38240520     0.</div><div>       Krylov Solver     2              2         2504     0.</div><div>      Preconditioner     2              2         2528     0.</div><div>              Viewer     2              1          848     0.</div><div>==============================<wbr>==============================<wbr>==============================<wbr>==============================</div><div>Average time to get PetscTime(): 0.</div><div>Average time for MPI_Barrier(): 2.38419e-06</div><div>Average time for zero size MPI_Send(): 2.11596e-06</div><div>#PETSc Option Table entries:</div><div>-da_processors_z 1</div><div>-ksp_type cg</div><div>-ksp_view</div><div>-log_view</div><div>-pc_hypre_boomeramg_nodal_coar<wbr>sen 1</div><div>-pc_hypre_boomeramg_vec_interp<wbr>_variant 1</div><div>#End of PETSc Option Table entries</div><div>Compiled without FORTRAN kernels</div><div>Compiled with full precision matrices (default)</div><div>sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4</div><div>Configure options: --known-level1-dcache-size=327<wbr>68 --known-level1-dcache-linesize<wbr>=64 --known-level1-dcache-assoc=8 --known-sizeof-char=1 --known-sizeof-void-p=8 --known-sizeof-short=2 --known-sizeof-int=4 --known-sizeof-long=8 --known-sizeof-long-long=8 --known-sizeof-float=4 --known-sizeof-double=8 --known-sizeof-size_t=8 --known-bits-per-byte=8 --known-memcmp-ok=1 --known-sizeof-MPI_Comm=8 --known-sizeof-MPI_Fint=4 --known-mpi-long-double=1 --known-mpi-int64_t=1 --known-mpi-c-double-complex=1 --known-has-attribute-aligned=<wbr>1 PETSC_ARCH=timings --with-mpi-dir=/usr/lib64/open<wbr>mpi --with-blaslapack-dir=/usr/lib<wbr>64 COPTFLAGS=-O3 CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 --with-shared-libraries=1 --download-hypre --with-debugging=no --with-batch -known-mpi-shared-libraries=0 --known-64-bit-blas-indices=0</div><div>------------------------------<wbr>-----------</div><div>Libraries compiled on 2018-04-27 21:13:11 on ocean </div><div>Machine characteristics: Linux-3.10.0-327.36.3.el7.x86_<wbr>64-x86_64-with-centos-7.2.1511<wbr>-Core</div><div>Using PETSc directory: /home/valera/petsc</div><div>Using PETSc arch: timings</div><div>------------------------------<wbr>-----------</div><div><br></div><div>Using C compiler: /usr/lib64/openmpi/bin/mpicc  -fPIC  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fstack-protector -fvisibility=hidden -O3  </div><div>Using Fortran compiler: /usr/lib64/openmpi/bin/mpif90  -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -O3    </div><div>------------------------------<wbr>-----------</div><div><br></div><div>Using include paths: -I/home/valera/petsc/include -I/home/valera/petsc/timings/i<wbr>nclude -I/usr/lib64/openmpi/include</div><div>------------------------------<wbr>-----------</div><div><br></div><div>Using C linker: /usr/lib64/openmpi/bin/mpicc</div><div>Using Fortran linker: /usr/lib64/openmpi/bin/mpif90</div><div>Using libraries: -Wl,-rpath,/home/valera/petsc/<wbr>timings/lib -L/home/valera/petsc/timings/l<wbr>ib -lpetsc -Wl,-rpath,/home/valera/petsc/<wbr>timings/lib -L/home/valera/petsc/timings/l<wbr>ib -Wl,-rpath,/usr/lib64 -L/usr/lib64 -Wl,-rpath,/usr/lib64/openmpi/<wbr>lib -L/usr/lib64/openmpi/lib -Wl,-rpath,/usr/lib/gcc/x86_64<wbr>-redhat-linux/4.8.5 -L/usr/lib/gcc/x86_64-redhat-l<wbr>inux/4.8.5 -lHYPRE -llapack -lblas -lm -lstdc++ -ldl -lmpi_usempi -lmpi_mpifh -lmpi -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lpthread -lstdc++ -ldl</div></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>What do you see wrong here? what options could i try to improve my solver scaling? </div><div><br></div><div>Thanks so much,</div><div><br></div><div>Manuel</div><div><br></div><div><br></div><div><br></div></div>

</blockquote></div></div></div><span class="HOEnZb"><font color="#888888"><br><br clear="all"><div><br></div>-- <br><div class="m_-1615080337341068005gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="http://www.caam.rice.edu/~mk51/" target="_blank">https://www.cse.buffalo.edu/~<wbr>knepley/</a><br></div></div></div></div></div>

</font></span></div></div>

</blockquote></div><br></div>