[petsc-users] Help my solver scale

Wed May 2 15:24:32 CDT 2018

On Wed, May 2, 2018 at 4:19 PM, Manuel Valera <mvalera-w at sdsu.edu> wrote:

> Hello guys,
>
> We are working in writing a paper about the parallelization of our model
> using PETSc, which is very exciting since is the first time we see our
> model scaling, but so far i feel my results for the laplacian solver could
> be much better,
>
> For example, using CG/Multigrid i get less than 20% of efficiency after 16
> cores, up to 64 cores where i get only 8% efficiency,
>
> I am defining efficiency as speedup over number of cores, and speedup as
> twall_n/twall_1 where n is the number of cores, i think that's pretty
> standard,
>

This is the first big problem. Not all "cores" are created equal. First,
you need to run streams in the exact same configuration, so that you can see
how much speedup to expect. The program is here

  cd src/benchmarks/streams

and

  make streams

will run it. You will probably need to submit the program yourself to the
batch system to get the same configuration as your solver.

This really matter because 16 cores on one nodes probably only has the
potential for 5x speedup, so that your 20% is misguided.

  Thanks,

     Matt

> The ksp_view for a distributed solve looks like this:
>
> KSP Object: 16 MPI processes
>   type: cg
>   maximum iterations=10000, initial guess is zero
>   tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
>   left preconditioning
>   using PRECONDITIONED norm type for convergence test
> PC Object: 16 MPI processes
>   type: hypre
>     HYPRE BoomerAMG preconditioning
>       Cycle type V
>       Maximum number of levels 25
>       Maximum number of iterations PER hypre call 1
>       Convergence tolerance PER hypre call 0.
>       Threshold for strong coupling 0.25
>       Interpolation truncation factor 0.
>       Interpolation: max elements per row 0
>       Number of levels of aggressive coarsening 0
>       Number of paths for aggressive coarsening 1
>       Maximum row sums 0.9
>       Sweeps down         1
>       Sweeps up           1
>       Sweeps on coarse    1
>       Relax down          symmetric-SOR/Jacobi
>       Relax up            symmetric-SOR/Jacobi
>       Relax on coarse     Gaussian-elimination
>       Relax weight  (all)      1.
>       Outer relax weight (all) 1.
>       Using CF-relaxation
>       Not using more complex smoothers.
>       Measure type        local
>       Coarsen type        Falgout
>       Interpolation type  classical
>       Using nodal coarsening (with HYPRE_BOOMERAMGSetNodal() 1
>       HYPRE_BoomerAMGSetInterpVecVariant() 1
>   linear system matrix = precond matrix:
>   Mat Object: 16 MPI processes
>     type: mpiaij
>     rows=213120, cols=213120
>     total: nonzeros=3934732, allocated nonzeros=8098560
>     total number of mallocs used during MatSetValues calls =0
>       has attached near null space
>
>
> And the log_view for the same case would be:
>
> ************************************************************
> ************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
> -fCourier9' to print this document            ***
> ************************************************************
> ************************************************************
>
> ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
> ./gcmSeamount on a timings named ocean with 16 processors, by valera Wed
> May  2 13:18:21 2018
> Using Petsc Development GIT revision: v3.9-163-gbe3efd4  GIT Date:
> 2018-04-16 10:45:40 -0500
>
>                          Max       Max/Min        Avg      Total
> Time (sec):           1.355e+00      1.00004   1.355e+00
> Objects:              4.140e+02      1.00000   4.140e+02
> Flop:                 7.582e+05      1.09916   7.397e+05  1.183e+07
> Flop/sec:            5.594e+05      1.09918   5.458e+05  8.732e+06
> MPI Messages:         1.588e+03      1.19167   1.468e+03  2.348e+04
> MPI Message Lengths:  7.112e+07      1.37899   4.462e+04  1.048e+09
> MPI Reductions:       4.760e+02      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of length N
> --> 2N flop
>                             and VecAXPY() for complex vectors of length N
> --> 8N flop
>
> Summary of Stages:   ----- Time ------  ----- Flop -----  --- Messages
> ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total   counts
>  %Total     Avg         %Total   counts   %Total
>  0:      Main Stage: 1.3553e+00 100.0%  1.1835e+07 100.0%  2.348e+04
> 100.0%  4.462e+04      100.0%  4.670e+02  98.1%
>
> ------------------------------------------------------------
> ------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flop: Max - maximum over all processors
>                    Ratio - ratio of maximum to minimum over all processors
>    Mess: number of messages sent
>    Avg. len: average message length (bytes)
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flop in this
> phase
>       %M - percent messages in this phase     %L - percent message lengths
> in this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
> all processors)
> ------------------------------------------------------------
> ------------------------------------------------------------
> Event                Count      Time (sec)     Flop
>      --- Global ---  --- Stage ---   Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------
> ------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> BuildTwoSidedF         2 1.0 9.1908e-03 2.2 0.00e+00 0.0 3.6e+02 1.6e+05
> 0.0e+00  1  0  2  6  0   1  0  2  6  0     0
> VecTDot                1 1.0 6.4135e-05 1.1 2.66e+04 1.0 0.0e+00 0.0e+00
> 1.0e+00  0  4  0  0  0   0  4  0  0  0  6646
> VecNorm                1 1.0 1.4589e-0347.1 2.66e+04 1.0 0.0e+00 0.0e+00
> 1.0e+00  0  4  0  0  0   0  4  0  0  0   292
> VecScale              14 1.0 3.6144e-04 1.3 4.80e+05 1.1 0.0e+00 0.0e+00
> 0.0e+00  0 62  0  0  0   0 62  0  0  0 20346
> VecCopy                7 1.0 1.0152e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet                83 1.0 3.0013e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecPointwiseMult      12 1.0 2.7585e-04 1.4 2.43e+05 1.2 0.0e+00 0.0e+00
> 0.0e+00  0 31  0  0  0   0 31  0  0  0 13153
> VecScatterBegin      111 1.0 2.5293e-02 1.8 0.00e+00 0.0 9.5e+03 3.4e+04
> 1.9e+01  1  0 40 31  4   1  0 40 31  4     0
> VecScatterEnd         92 1.0 4.8771e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  3  0  0  0  0   3  0  0  0  0     0
> VecNormalize           1 1.0 2.6941e-05 2.3 1.33e+04 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  2  0  0  0   0  2  0  0  0  7911
> MatConvert             1 1.0 1.1009e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 4.0e+00  1  0  0  0  1   1  0  0  0  1     0
> MatAssemblyBegin       3 1.0 2.8401e-02 1.0 0.00e+00 0.0 3.6e+02 1.6e+05
> 0.0e+00  2  0  2  6  0   2  0  2  6  0     0
> MatAssemblyEnd         3 1.0 2.9033e-02 1.0 0.00e+00 0.0 6.0e+01 1.2e+04
> 2.0e+01  2  0  0  0  4   2  0  0  0  4     0
> MatGetRowIJ            2 1.0 1.9073e-06 2.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatView                1 1.0 3.0398e-04 5.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSetUp               1 1.0 4.7994e-04 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 2.4850e-03 2.0 5.33e+04 1.0 0.0e+00 0.0e+00
> 2.0e+00  0  7  0  0  0   0  7  0  0  0   343
> PCSetUp                2 1.0 2.2953e-02 1.0 1.33e+04 1.0 0.0e+00 0.0e+00
> 6.0e+00  2  2  0  0  1   2  2  0  0  1     9
> PCApply                1 1.0 1.3151e-03 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> ------------------------------------------------------------
> ------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
>               Vector   172            170     70736264     0.
>               Matrix     5              5      7125104     0.
>    Matrix Null Space     1              1          608     0.
>     Distributed Mesh    18             16        84096     0.
>            Index Set    73             73     10022204     0.
>    IS L to G Mapping    18             16      1180828     0.
>    Star Forest Graph    36             32        27968     0.
>      Discrete System    18             16        15040     0.
>          Vec Scatter    67             64     38240520     0.
>        Krylov Solver     2              2         2504     0.
>       Preconditioner     2              2         2528     0.
>               Viewer     2              1          848     0.
> ============================================================
> ============================================================
> Average time to get PetscTime(): 0.
> Average time for MPI_Barrier(): 2.38419e-06
> Average time for zero size MPI_Send(): 2.11596e-06
> #PETSc Option Table entries:
> -da_processors_z 1
> -ksp_type cg
> -ksp_view
> -log_view
> -pc_hypre_boomeramg_nodal_coarsen 1
> -pc_hypre_boomeramg_vec_interp_variant 1
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> Configure options: --known-level1-dcache-size=32768 --known-level1-dcache-linesize=64
> --known-level1-dcache-assoc=8 --known-sizeof-char=1 --known-sizeof-void-p=8
> --known-sizeof-short=2 --known-sizeof-int=4 --known-sizeof-long=8
> --known-sizeof-long-long=8 --known-sizeof-float=4 --known-sizeof-double=8
> --known-sizeof-size_t=8 --known-bits-per-byte=8 --known-memcmp-ok=1
> --known-sizeof-MPI_Comm=8 --known-sizeof-MPI_Fint=4
> --known-mpi-long-double=1 --known-mpi-int64_t=1
> --known-mpi-c-double-complex=1 --known-has-attribute-aligned=1
> PETSC_ARCH=timings --with-mpi-dir=/usr/lib64/openmpi
> --with-blaslapack-dir=/usr/lib64 COPTFLAGS=-O3 CXXOPTFLAGS=-O3
> FOPTFLAGS=-O3 --with-shared-libraries=1 --download-hypre
> --with-debugging=no --with-batch -known-mpi-shared-libraries=0
> --known-64-bit-blas-indices=0
> -----------------------------------------
> Libraries compiled on 2018-04-27 21:13:11 on ocean
> Machine characteristics: Linux-3.10.0-327.36.3.el7.x86_
> 64-x86_64-with-centos-7.2.1511-Core
> Using PETSc directory: /home/valera/petsc
> Using PETSc arch: timings
> -----------------------------------------
>
> Using C compiler: /usr/lib64/openmpi/bin/mpicc  -fPIC  -Wall
> -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fstack-protector
> -fvisibility=hidden -O3
> Using Fortran compiler: /usr/lib64/openmpi/bin/mpif90  -fPIC -Wall
> -ffree-line-length-0 -Wno-unused-dummy-argument -O3
> -----------------------------------------
>
> Using include paths: -I/home/valera/petsc/include
> -I/home/valera/petsc/timings/include -I/usr/lib64/openmpi/include
> -----------------------------------------
>
> Using C linker: /usr/lib64/openmpi/bin/mpicc
> Using Fortran linker: /usr/lib64/openmpi/bin/mpif90
> Using libraries: -Wl,-rpath,/home/valera/petsc/timings/lib
> -L/home/valera/petsc/timings/lib -lpetsc -Wl,-rpath,/home/valera/petsc/timings/lib
> -L/home/valera/petsc/timings/lib -Wl,-rpath,/usr/lib64 -L/usr/lib64
> -Wl,-rpath,/usr/lib64/openmpi/lib -L/usr/lib64/openmpi/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.8.5
> -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lHYPRE -llapack -lblas -lm
> -lstdc++ -ldl -lmpi_usempi -lmpi_mpifh -lmpi -lgfortran -lm -lgfortran -lm
> -lgcc_s -lquadmath -lpthread -lstdc++ -ldl
>
>
>
>
>
> What do you see wrong here? what options could i try to improve my solver
> scaling?
>
> Thanks so much,
>
> Manuel
>
>
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.caam.rice.edu/~mk51/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20180502/6b9e00c0/attachment-0001.html>