[petsc-users] Help my solver scale

Manuel Valera mvalera-w at sdsu.edu
Wed May 2 15:19:11 CDT 2018

Hello guys,

We are working in writing a paper about the parallelization of our model
using PETSc, which is very exciting since is the first time we see our
model scaling, but so far i feel my results for the laplacian solver could
be much better,

For example, using CG/Multigrid i get less than 20% of efficiency after 16
cores, up to 64 cores where i get only 8% efficiency,

I am defining efficiency as speedup over number of cores, and speedup as
twall_n/twall_1 where n is the number of cores, i think that's pretty

The ksp_view for a distributed solve looks like this:

KSP Object: 16 MPI processes
  type: cg
  maximum iterations=10000, initial guess is zero
  tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
  left preconditioning
  using PRECONDITIONED norm type for convergence test
PC Object: 16 MPI processes
  type: hypre
    HYPRE BoomerAMG preconditioning
      Cycle type V
      Maximum number of levels 25
      Maximum number of iterations PER hypre call 1
      Convergence tolerance PER hypre call 0.
      Threshold for strong coupling 0.25
      Interpolation truncation factor 0.
      Interpolation: max elements per row 0
      Number of levels of aggressive coarsening 0
      Number of paths for aggressive coarsening 1
      Maximum row sums 0.9
      Sweeps down         1
      Sweeps up           1
      Sweeps on coarse    1
      Relax down          symmetric-SOR/Jacobi
      Relax up            symmetric-SOR/Jacobi
      Relax on coarse     Gaussian-elimination
      Relax weight  (all)      1.
      Outer relax weight (all) 1.
      Using CF-relaxation
      Not using more complex smoothers.
      Measure type        local
      Coarsen type        Falgout
      Interpolation type  classical
      Using nodal coarsening (with HYPRE_BOOMERAMGSetNodal() 1
      HYPRE_BoomerAMGSetInterpVecVariant() 1
  linear system matrix = precond matrix:
  Mat Object: 16 MPI processes
    type: mpiaij
    rows=213120, cols=213120
    total: nonzeros=3934732, allocated nonzeros=8098560
    total number of mallocs used during MatSetValues calls =0
      has attached near null space

And the log_view for the same case would be:

***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
-fCourier9' to print this document            ***

---------------------------------------------- PETSc Performance Summary:

./gcmSeamount on a timings named ocean with 16 processors, by valera Wed
May  2 13:18:21 2018
Using Petsc Development GIT revision: v3.9-163-gbe3efd4  GIT Date:
2018-04-16 10:45:40 -0500

                         Max       Max/Min        Avg      Total
Time (sec):           1.355e+00      1.00004   1.355e+00
Objects:              4.140e+02      1.00000   4.140e+02
Flop:                 7.582e+05      1.09916   7.397e+05  1.183e+07
Flop/sec:            5.594e+05      1.09918   5.458e+05  8.732e+06
MPI Messages:         1.588e+03      1.19167   1.468e+03  2.348e+04
MPI Message Lengths:  7.112e+07      1.37899   4.462e+04  1.048e+09
MPI Reductions:       4.760e+02      1.00000

Flop counting convention: 1 flop = 1 real number operation of type
                            e.g., VecAXPY() for real vectors of length N
--> 2N flop
                            and VecAXPY() for complex vectors of length N
--> 8N flop

Summary of Stages:   ----- Time ------  ----- Flop -----  --- Messages ---
-- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts
 %Total     Avg         %Total   counts   %Total
 0:      Main Stage: 1.3553e+00 100.0%  1.1835e+07 100.0%  2.348e+04
100.0%  4.462e+04      100.0%  4.670e+02  98.1%

See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flop: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and
      %T - percent time in this phase         %F - percent flop in this
      %M - percent messages in this phase     %L - percent message lengths
in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
all processors)
Event                Count      Time (sec)     Flop
     --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s

--- Event Stage 0: Main Stage

BuildTwoSidedF         2 1.0 9.1908e-03 2.2 0.00e+00 0.0 3.6e+02 1.6e+05
0.0e+00  1  0  2  6  0   1  0  2  6  0     0
VecTDot                1 1.0 6.4135e-05 1.1 2.66e+04 1.0 0.0e+00 0.0e+00
1.0e+00  0  4  0  0  0   0  4  0  0  0  6646
VecNorm                1 1.0 1.4589e-0347.1 2.66e+04 1.0 0.0e+00 0.0e+00
1.0e+00  0  4  0  0  0   0  4  0  0  0   292
VecScale              14 1.0 3.6144e-04 1.3 4.80e+05 1.1 0.0e+00 0.0e+00
0.0e+00  0 62  0  0  0   0 62  0  0  0 20346
VecCopy                7 1.0 1.0152e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet                83 1.0 3.0013e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  1  0  0  0  0   1  0  0  0  0     0
VecPointwiseMult      12 1.0 2.7585e-04 1.4 2.43e+05 1.2 0.0e+00 0.0e+00
0.0e+00  0 31  0  0  0   0 31  0  0  0 13153
VecScatterBegin      111 1.0 2.5293e-02 1.8 0.00e+00 0.0 9.5e+03 3.4e+04
1.9e+01  1  0 40 31  4   1  0 40 31  4     0
VecScatterEnd         92 1.0 4.8771e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  3  0  0  0  0   3  0  0  0  0     0
VecNormalize           1 1.0 2.6941e-05 2.3 1.33e+04 1.0 0.0e+00 0.0e+00
0.0e+00  0  2  0  0  0   0  2  0  0  0  7911
MatConvert             1 1.0 1.1009e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
4.0e+00  1  0  0  0  1   1  0  0  0  1     0
MatAssemblyBegin       3 1.0 2.8401e-02 1.0 0.00e+00 0.0 3.6e+02 1.6e+05
0.0e+00  2  0  2  6  0   2  0  2  6  0     0
MatAssemblyEnd         3 1.0 2.9033e-02 1.0 0.00e+00 0.0 6.0e+01 1.2e+04
2.0e+01  2  0  0  0  4   2  0  0  0  4     0
MatGetRowIJ            2 1.0 1.9073e-06 2.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatView                1 1.0 3.0398e-04 5.0 0.00e+00 0.0 0.0e+00 0.0e+00
1.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSetUp               1 1.0 4.7994e-04 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               1 1.0 2.4850e-03 2.0 5.33e+04 1.0 0.0e+00 0.0e+00
2.0e+00  0  7  0  0  0   0  7  0  0  0   343
PCSetUp                2 1.0 2.2953e-02 1.0 1.33e+04 1.0 0.0e+00 0.0e+00
6.0e+00  2  2  0  0  1   2  2  0  0  1     9
PCApply                1 1.0 1.3151e-03 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Vector   172            170     70736264     0.
              Matrix     5              5      7125104     0.
   Matrix Null Space     1              1          608     0.
    Distributed Mesh    18             16        84096     0.
           Index Set    73             73     10022204     0.
   IS L to G Mapping    18             16      1180828     0.
   Star Forest Graph    36             32        27968     0.
     Discrete System    18             16        15040     0.
         Vec Scatter    67             64     38240520     0.
       Krylov Solver     2              2         2504     0.
      Preconditioner     2              2         2528     0.
              Viewer     2              1          848     0.
Average time to get PetscTime(): 0.
Average time for MPI_Barrier(): 2.38419e-06
Average time for zero size MPI_Send(): 2.11596e-06
#PETSc Option Table entries:
-da_processors_z 1
-ksp_type cg
-pc_hypre_boomeramg_nodal_coarsen 1
-pc_hypre_boomeramg_vec_interp_variant 1
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --known-level1-dcache-size=32768
--known-level1-dcache-linesize=64 --known-level1-dcache-assoc=8
--known-sizeof-char=1 --known-sizeof-void-p=8 --known-sizeof-short=2
--known-sizeof-int=4 --known-sizeof-long=8 --known-sizeof-long-long=8
--known-sizeof-float=4 --known-sizeof-double=8 --known-sizeof-size_t=8
--known-bits-per-byte=8 --known-memcmp-ok=1 --known-sizeof-MPI_Comm=8
--known-sizeof-MPI_Fint=4 --known-mpi-long-double=1 --known-mpi-int64_t=1
--known-mpi-c-double-complex=1 --known-has-attribute-aligned=1
PETSC_ARCH=timings --with-mpi-dir=/usr/lib64/openmpi
--with-blaslapack-dir=/usr/lib64 COPTFLAGS=-O3 CXXOPTFLAGS=-O3
FOPTFLAGS=-O3 --with-shared-libraries=1 --download-hypre
--with-debugging=no --with-batch -known-mpi-shared-libraries=0
Libraries compiled on 2018-04-27 21:13:11 on ocean
Machine characteristics:
Using PETSc directory: /home/valera/petsc
Using PETSc arch: timings

Using C compiler: /usr/lib64/openmpi/bin/mpicc  -fPIC  -Wall
-Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fstack-protector
-fvisibility=hidden -O3
Using Fortran compiler: /usr/lib64/openmpi/bin/mpif90  -fPIC -Wall
-ffree-line-length-0 -Wno-unused-dummy-argument -O3

Using include paths: -I/home/valera/petsc/include
-I/home/valera/petsc/timings/include -I/usr/lib64/openmpi/include

Using C linker: /usr/lib64/openmpi/bin/mpicc
Using Fortran linker: /usr/lib64/openmpi/bin/mpif90
Using libraries: -Wl,-rpath,/home/valera/petsc/timings/lib
-L/home/valera/petsc/timings/lib -lpetsc
-Wl,-rpath,/home/valera/petsc/timings/lib -L/home/valera/petsc/timings/lib
-Wl,-rpath,/usr/lib64 -L/usr/lib64 -Wl,-rpath,/usr/lib64/openmpi/lib
-L/usr/lib64/openmpi/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.8.5
-L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lHYPRE -llapack -lblas -lm
-lstdc++ -ldl -lmpi_usempi -lmpi_mpifh -lmpi -lgfortran -lm -lgfortran -lm
-lgcc_s -lquadmath -lpthread -lstdc++ -ldl

What do you see wrong here? what options could i try to improve my solver

Thanks so much,

