<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, May 2, 2018 at 4:19 PM, Manuel Valera <span dir="ltr"><<a href="mailto:mvalera-w@sdsu.edu" target="_blank">mvalera-w@sdsu.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hello guys,<div><br></div><div>We are working in writing a paper about the parallelization of our model using PETSc, which is very exciting since is the first time we see our model scaling, but so far i feel my results for the laplacian solver could be much better,</div><div><br></div><div>For example, using CG/Multigrid i get less than 20% of efficiency after 16 cores, up to 64 cores where i get only 8% efficiency,</div><div><br></div><div>I am defining efficiency as speedup over number of cores, and speedup as twall_n/twall_1 where n is the number of cores, i think that's pretty standard,</div></div></blockquote><div><br></div><div>This is the first big problem. Not all "cores" are created equal. First, you need to run streams in the exact same configuration, so that you can see</div><div>how much speedup to expect. The program is here</div><div><br></div><div> cd src/benchmarks/streams</div><div><br></div><div>and</div><div><br></div><div> make streams</div><div><br></div><div>will run it. You will probably need to submit the program yourself to the batch system to get the same configuration as your solver.</div><div><br></div><div>This really matter because 16 cores on one nodes probably only has the potential for 5x speedup, so that your 20% is misguided.</div><div><br></div><div> Thanks,</div><div><br></div><div> Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>The ksp_view for a distributed solve looks like this:</div><div><br></div><div><div>KSP Object: 16 MPI processes</div><div> type: cg</div><div> maximum iterations=10000, initial guess is zero</div><div> tolerances: relative=1e-05, absolute=1e-50, divergence=10000.</div><div> left preconditioning</div><div> using PRECONDITIONED norm type for convergence test</div><div>PC Object: 16 MPI processes</div><div> type: hypre</div><div> HYPRE BoomerAMG preconditioning</div><div> Cycle type V</div><div> Maximum number of levels 25</div><div> Maximum number of iterations PER hypre call 1</div><div> Convergence tolerance PER hypre call 0.</div><div> Threshold for strong coupling 0.25</div><div> Interpolation truncation factor 0.</div><div> Interpolation: max elements per row 0</div><div> Number of levels of aggressive coarsening 0</div><div> Number of paths for aggressive coarsening 1</div><div> Maximum row sums 0.9</div><div> Sweeps down 1</div><div> Sweeps up 1</div><div> Sweeps on coarse 1</div><div> Relax down symmetric-SOR/Jacobi</div><div> Relax up symmetric-SOR/Jacobi</div><div> Relax on coarse Gaussian-elimination</div><div> Relax weight (all) 1.</div><div> Outer relax weight (all) 1.</div><div> Using CF-relaxation</div><div> Not using more complex smoothers.</div><div> Measure type local</div><div> Coarsen type Falgout</div><div> Interpolation type classical</div><div> Using nodal coarsening (with HYPRE_BOOMERAMGSetNodal() 1</div><div> HYPRE_<wbr>BoomerAMGSetInterpVecVariant() 1</div><div> linear system matrix = precond matrix:</div><div> Mat Object: 16 MPI processes</div><div> type: mpiaij</div><div> rows=213120, cols=213120</div><div> total: nonzeros=3934732, allocated nonzeros=8098560</div><div> total number of mallocs used during MatSetValues calls =0</div><div> has attached near null space</div></div><div><br></div><div><br></div><div>And the log_view for the same case would be:</div><div><br></div><div><div>******************************<wbr>******************************<wbr>******************************<wbr>******************************</div><div>*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document ***</div><div>******************************<wbr>******************************<wbr>******************************<wbr>******************************</div><div><br></div><div>------------------------------<wbr>---------------- PETSc Performance Summary: ------------------------------<wbr>----------------</div><div><br></div><div>./gcmSeamount on a timings named ocean with 16 processors, by valera Wed May 2 13:18:21 2018</div><div>Using Petsc Development GIT revision: v3.9-163-gbe3efd4 GIT Date: 2018-04-16 10:45:40 -0500</div><div><br></div><div> Max Max/Min Avg Total </div><div>Time (sec): 1.355e+00 1.00004 1.355e+00</div><div>Objects: 4.140e+02 1.00000 4.140e+02</div><div>Flop: 7.582e+05 1.09916 7.397e+05 1.183e+07</div><div>Flop/sec: 5.594e+05 1.09918 5.458e+05 8.732e+06</div><div>MPI Messages: 1.588e+03 1.19167 1.468e+03 2.348e+04</div><div>MPI Message Lengths: 7.112e+07 1.37899 4.462e+04 1.048e+09</div><div>MPI Reductions: 4.760e+02 1.00000</div><div><br></div><div>Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)</div><div> e.g., VecAXPY() for real vectors of length N --> 2N flop</div><div> and VecAXPY() for complex vectors of length N --> 8N flop</div><div><br></div><div>Summary of Stages: ----- Time ------ ----- Flop ----- --- Messages --- -- Message Lengths -- -- Reductions --</div><div> Avg %Total Avg %Total counts %Total Avg %Total counts %Total </div><div> 0: Main Stage: 1.3553e+00 100.0% 1.1835e+07 100.0% 2.348e+04 100.0% 4.462e+04 100.0% 4.670e+02 98.1% </div><div><br></div><div>------------------------------<wbr>------------------------------<wbr>------------------------------<wbr>------------------------------</div><div>See the 'Profiling' chapter of the users' manual for details on interpreting output.</div><div>Phase summary info:</div><div> Count: number of times phase was executed</div><div> Time and Flop: Max - maximum over all processors</div><div> Ratio - ratio of maximum to minimum over all processors</div><div> Mess: number of messages sent</div><div> Avg. len: average message length (bytes)</div><div> Reduct: number of global reductions</div><div> Global: entire computation</div><div> Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().</div><div> %T - percent time in this phase %F - percent flop in this phase</div><div> %M - percent messages in this phase %L - percent message lengths in this phase</div><div> %R - percent reductions in this phase</div><div> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)</div><div>------------------------------<wbr>------------------------------<wbr>------------------------------<wbr>------------------------------</div><div>Event Count Time (sec) Flop --- Global --- --- Stage --- Total</div><div> Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s</div><div>------------------------------<wbr>------------------------------<wbr>------------------------------<wbr>------------------------------</div><div><br></div><div>--- Event Stage 0: Main Stage</div><div><br></div><div>BuildTwoSidedF 2 1.0 9.1908e-03 2.2 0.00e+00 0.0 3.6e+02 1.6e+05 0.0e+00 1 0 2 6 0 1 0 2 6 0 0</div><div>VecTDot 1 1.0 6.4135e-05 1.1 2.66e+04 1.0 0.0e+00 0.0e+00 1.0e+00 0 4 0 0 0 0 4 0 0 0 6646</div><div>VecNorm 1 1.0 1.4589e-0347.1 2.66e+04 1.0 0.0e+00 0.0e+00 1.0e+00 0 4 0 0 0 0 4 0 0 0 292</div><div>VecScale 14 1.0 3.6144e-04 1.3 4.80e+05 1.1 0.0e+00 0.0e+00 0.0e+00 0 62 0 0 0 0 62 0 0 0 20346</div><div>VecCopy 7 1.0 1.0152e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0</div><div>VecSet 83 1.0 3.0013e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0</div><div>VecPointwiseMult 12 1.0 2.7585e-04 1.4 2.43e+05 1.2 0.0e+00 0.0e+00 0.0e+00 0 31 0 0 0 0 31 0 0 0 13153</div><div>VecScatterBegin 111 1.0 2.5293e-02 1.8 0.00e+00 0.0 9.5e+03 3.4e+04 1.9e+01 1 0 40 31 4 1 0 40 31 4 0</div><div>VecScatterEnd 92 1.0 4.8771e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 3 0 0 0 0 0</div><div>VecNormalize 1 1.0 2.6941e-05 2.3 1.33e+04 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 7911</div><div>MatConvert 1 1.0 1.1009e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 4.0e+00 1 0 0 0 1 1 0 0 0 1 0</div><div>MatAssemblyBegin 3 1.0 2.8401e-02 1.0 0.00e+00 0.0 3.6e+02 1.6e+05 0.0e+00 2 0 2 6 0 2 0 2 6 0 0</div><div>MatAssemblyEnd 3 1.0 2.9033e-02 1.0 0.00e+00 0.0 6.0e+01 1.2e+04 2.0e+01 2 0 0 0 4 2 0 0 0 4 0</div><div>MatGetRowIJ 2 1.0 1.9073e-06 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0</div><div>MatView 1 1.0 3.0398e-04 5.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0</div><div>KSPSetUp 1 1.0 4.7994e-04 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0</div><div>KSPSolve 1 1.0 2.4850e-03 2.0 5.33e+04 1.0 0.0e+00 0.0e+00 2.0e+00 0 7 0 0 0 0 7 0 0 0 343</div><div>PCSetUp 2 1.0 2.2953e-02 1.0 1.33e+04 1.0 0.0e+00 0.0e+00 6.0e+00 2 2 0 0 1 2 2 0 0 1 9</div><div>PCApply 1 1.0 1.3151e-03 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0</div><div>------------------------------<wbr>------------------------------<wbr>------------------------------<wbr>------------------------------</div><div><br></div><div>Memory usage is given in bytes:</div><div><br></div><div>Object Type Creations Destructions Memory Descendants' Mem.</div><div>Reports information only for process 0.</div><div><br></div><div>--- Event Stage 0: Main Stage</div><div><br></div><div> Vector 172 170 70736264 0.</div><div> Matrix 5 5 7125104 0.</div><div> Matrix Null Space 1 1 608 0.</div><div> Distributed Mesh 18 16 84096 0.</div><div> Index Set 73 73 10022204 0.</div><div> IS L to G Mapping 18 16 1180828 0.</div><div> Star Forest Graph 36 32 27968 0.</div><div> Discrete System 18 16 15040 0.</div><div> Vec Scatter 67 64 38240520 0.</div><div> Krylov Solver 2 2 2504 0.</div><div> Preconditioner 2 2 2528 0.</div><div> Viewer 2 1 848 0.</div><div>==============================<wbr>==============================<wbr>==============================<wbr>==============================</div><div>Average time to get PetscTime(): 0.</div><div>Average time for MPI_Barrier(): 2.38419e-06</div><div>Average time for zero size MPI_Send(): 2.11596e-06</div><div>#PETSc Option Table entries:</div><div>-da_processors_z 1</div><div>-ksp_type cg</div><div>-ksp_view</div><div>-log_view</div><div>-pc_hypre_boomeramg_nodal_<wbr>coarsen 1</div><div>-pc_hypre_boomeramg_vec_<wbr>interp_variant 1</div><div>#End of PETSc Option Table entries</div><div>Compiled without FORTRAN kernels</div><div>Compiled with full precision matrices (default)</div><div>sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4</div><div>Configure options: --known-level1-dcache-size=<wbr>32768 --known-level1-dcache-<wbr>linesize=64 --known-level1-dcache-assoc=8 --known-sizeof-char=1 --known-sizeof-void-p=8 --known-sizeof-short=2 --known-sizeof-int=4 --known-sizeof-long=8 --known-sizeof-long-long=8 --known-sizeof-float=4 --known-sizeof-double=8 --known-sizeof-size_t=8 --known-bits-per-byte=8 --known-memcmp-ok=1 --known-sizeof-MPI_Comm=8 --known-sizeof-MPI_Fint=4 --known-mpi-long-double=1 --known-mpi-int64_t=1 --known-mpi-c-double-complex=1 --known-has-attribute-aligned=<wbr>1 PETSC_ARCH=timings --with-mpi-dir=/usr/lib64/<wbr>openmpi --with-blaslapack-dir=/usr/<wbr>lib64 COPTFLAGS=-O3 CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 --with-shared-libraries=1 --download-hypre --with-debugging=no --with-batch -known-mpi-shared-libraries=0 --known-64-bit-blas-indices=0</div><div>------------------------------<wbr>-----------</div><div>Libraries compiled on 2018-04-27 21:13:11 on ocean </div><div>Machine characteristics: Linux-3.10.0-327.36.3.el7.x86_<wbr>64-x86_64-with-centos-7.2.<wbr>1511-Core</div><div>Using PETSc directory: /home/valera/petsc</div><div>Using PETSc arch: timings</div><div>------------------------------<wbr>-----------</div><div><br></div><div>Using C compiler: /usr/lib64/openmpi/bin/mpicc -fPIC -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fstack-protector -fvisibility=hidden -O3 </div><div>Using Fortran compiler: /usr/lib64/openmpi/bin/mpif90 -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -O3 </div><div>------------------------------<wbr>-----------</div><div><br></div><div>Using include paths: -I/home/valera/petsc/include -I/home/valera/petsc/timings/<wbr>include -I/usr/lib64/openmpi/include</div><div>------------------------------<wbr>-----------</div><div><br></div><div>Using C linker: /usr/lib64/openmpi/bin/mpicc</div><div>Using Fortran linker: /usr/lib64/openmpi/bin/mpif90</div><div>Using libraries: -Wl,-rpath,/home/valera/petsc/<wbr>timings/lib -L/home/valera/petsc/timings/<wbr>lib -lpetsc -Wl,-rpath,/home/valera/petsc/<wbr>timings/lib -L/home/valera/petsc/timings/<wbr>lib -Wl,-rpath,/usr/lib64 -L/usr/lib64 -Wl,-rpath,/usr/lib64/openmpi/<wbr>lib -L/usr/lib64/openmpi/lib -Wl,-rpath,/usr/lib/gcc/x86_<wbr>64-redhat-linux/4.8.5 -L/usr/lib/gcc/x86_64-redhat-<wbr>linux/4.8.5 -lHYPRE -llapack -lblas -lm -lstdc++ -ldl -lmpi_usempi -lmpi_mpifh -lmpi -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lpthread -lstdc++ -ldl</div></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>What do you see wrong here? what options could i try to improve my solver scaling? </div><div><br></div><div>Thanks so much,</div><div><br></div><div>Manuel</div><div><br></div><div><br></div><div><br></div></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="http://www.caam.rice.edu/~mk51/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br></div></div></div></div></div>
</div></div>