[petsc-users] Help my solver scale
Matthew Knepley
knepley at gmail.com
Wed May 2 15:24:32 CDT 2018
On Wed, May 2, 2018 at 4:19 PM, Manuel Valera <mvalera-w at sdsu.edu> wrote:
> Hello guys,
>
> We are working in writing a paper about the parallelization of our model
> using PETSc, which is very exciting since is the first time we see our
> model scaling, but so far i feel my results for the laplacian solver could
> be much better,
>
> For example, using CG/Multigrid i get less than 20% of efficiency after 16
> cores, up to 64 cores where i get only 8% efficiency,
>
> I am defining efficiency as speedup over number of cores, and speedup as
> twall_n/twall_1 where n is the number of cores, i think that's pretty
> standard,
>
This is the first big problem. Not all "cores" are created equal. First,
you need to run streams in the exact same configuration, so that you can see
how much speedup to expect. The program is here
cd src/benchmarks/streams
and
make streams
will run it. You will probably need to submit the program yourself to the
batch system to get the same configuration as your solver.
This really matter because 16 cores on one nodes probably only has the
potential for 5x speedup, so that your 20% is misguided.
Thanks,
Matt
> The ksp_view for a distributed solve looks like this:
>
> KSP Object: 16 MPI processes
> type: cg
> maximum iterations=10000, initial guess is zero
> tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
> left preconditioning
> using PRECONDITIONED norm type for convergence test
> PC Object: 16 MPI processes
> type: hypre
> HYPRE BoomerAMG preconditioning
> Cycle type V
> Maximum number of levels 25
> Maximum number of iterations PER hypre call 1
> Convergence tolerance PER hypre call 0.
> Threshold for strong coupling 0.25
> Interpolation truncation factor 0.
> Interpolation: max elements per row 0
> Number of levels of aggressive coarsening 0
> Number of paths for aggressive coarsening 1
> Maximum row sums 0.9
> Sweeps down 1
> Sweeps up 1
> Sweeps on coarse 1
> Relax down symmetric-SOR/Jacobi
> Relax up symmetric-SOR/Jacobi
> Relax on coarse Gaussian-elimination
> Relax weight (all) 1.
> Outer relax weight (all) 1.
> Using CF-relaxation
> Not using more complex smoothers.
> Measure type local
> Coarsen type Falgout
> Interpolation type classical
> Using nodal coarsening (with HYPRE_BOOMERAMGSetNodal() 1
> HYPRE_BoomerAMGSetInterpVecVariant() 1
> linear system matrix = precond matrix:
> Mat Object: 16 MPI processes
> type: mpiaij
> rows=213120, cols=213120
> total: nonzeros=3934732, allocated nonzeros=8098560
> total number of mallocs used during MatSetValues calls =0
> has attached near null space
>
>
> And the log_view for the same case would be:
>
> ************************************************************
> ************************************************************
> *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
> -fCourier9' to print this document ***
> ************************************************************
> ************************************************************
>
> ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
> ./gcmSeamount on a timings named ocean with 16 processors, by valera Wed
> May 2 13:18:21 2018
> Using Petsc Development GIT revision: v3.9-163-gbe3efd4 GIT Date:
> 2018-04-16 10:45:40 -0500
>
> Max Max/Min Avg Total
> Time (sec): 1.355e+00 1.00004 1.355e+00
> Objects: 4.140e+02 1.00000 4.140e+02
> Flop: 7.582e+05 1.09916 7.397e+05 1.183e+07
> Flop/sec: 5.594e+05 1.09918 5.458e+05 8.732e+06
> MPI Messages: 1.588e+03 1.19167 1.468e+03 2.348e+04
> MPI Message Lengths: 7.112e+07 1.37899 4.462e+04 1.048e+09
> MPI Reductions: 4.760e+02 1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length N
> --> 2N flop
> and VecAXPY() for complex vectors of length N
> --> 8N flop
>
> Summary of Stages: ----- Time ------ ----- Flop ----- --- Messages
> --- -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total counts
> %Total Avg %Total counts %Total
> 0: Main Stage: 1.3553e+00 100.0% 1.1835e+07 100.0% 2.348e+04
> 100.0% 4.462e+04 100.0% 4.670e+02 98.1%
>
> ------------------------------------------------------------
> ------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flop: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all processors
> Mess: number of messages sent
> Avg. len: average message length (bytes)
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
> %T - percent time in this phase %F - percent flop in this
> phase
> %M - percent messages in this phase %L - percent message lengths
> in this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
> all processors)
> ------------------------------------------------------------
> ------------------------------------------------------------
> Event Count Time (sec) Flop
> --- Global --- --- Stage --- Total
> Max Ratio Max Ratio Max Ratio Mess Avg len
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
> ------------------------------------------------------------
> ------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> BuildTwoSidedF 2 1.0 9.1908e-03 2.2 0.00e+00 0.0 3.6e+02 1.6e+05
> 0.0e+00 1 0 2 6 0 1 0 2 6 0 0
> VecTDot 1 1.0 6.4135e-05 1.1 2.66e+04 1.0 0.0e+00 0.0e+00
> 1.0e+00 0 4 0 0 0 0 4 0 0 0 6646
> VecNorm 1 1.0 1.4589e-0347.1 2.66e+04 1.0 0.0e+00 0.0e+00
> 1.0e+00 0 4 0 0 0 0 4 0 0 0 292
> VecScale 14 1.0 3.6144e-04 1.3 4.80e+05 1.1 0.0e+00 0.0e+00
> 0.0e+00 0 62 0 0 0 0 62 0 0 0 20346
> VecCopy 7 1.0 1.0152e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecSet 83 1.0 3.0013e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> VecPointwiseMult 12 1.0 2.7585e-04 1.4 2.43e+05 1.2 0.0e+00 0.0e+00
> 0.0e+00 0 31 0 0 0 0 31 0 0 0 13153
> VecScatterBegin 111 1.0 2.5293e-02 1.8 0.00e+00 0.0 9.5e+03 3.4e+04
> 1.9e+01 1 0 40 31 4 1 0 40 31 4 0
> VecScatterEnd 92 1.0 4.8771e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 3 0 0 0 0 3 0 0 0 0 0
> VecNormalize 1 1.0 2.6941e-05 2.3 1.33e+04 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 2 0 0 0 0 2 0 0 0 7911
> MatConvert 1 1.0 1.1009e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 4.0e+00 1 0 0 0 1 1 0 0 0 1 0
> MatAssemblyBegin 3 1.0 2.8401e-02 1.0 0.00e+00 0.0 3.6e+02 1.6e+05
> 0.0e+00 2 0 2 6 0 2 0 2 6 0 0
> MatAssemblyEnd 3 1.0 2.9033e-02 1.0 0.00e+00 0.0 6.0e+01 1.2e+04
> 2.0e+01 2 0 0 0 4 2 0 0 0 4 0
> MatGetRowIJ 2 1.0 1.9073e-06 2.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatView 1 1.0 3.0398e-04 5.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSetUp 1 1.0 4.7994e-04 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSolve 1 1.0 2.4850e-03 2.0 5.33e+04 1.0 0.0e+00 0.0e+00
> 2.0e+00 0 7 0 0 0 0 7 0 0 0 343
> PCSetUp 2 1.0 2.2953e-02 1.0 1.33e+04 1.0 0.0e+00 0.0e+00
> 6.0e+00 2 2 0 0 1 2 2 0 0 1 9
> PCApply 1 1.0 1.3151e-03 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> ------------------------------------------------------------
> ------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type Creations Destructions Memory Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
> Vector 172 170 70736264 0.
> Matrix 5 5 7125104 0.
> Matrix Null Space 1 1 608 0.
> Distributed Mesh 18 16 84096 0.
> Index Set 73 73 10022204 0.
> IS L to G Mapping 18 16 1180828 0.
> Star Forest Graph 36 32 27968 0.
> Discrete System 18 16 15040 0.
> Vec Scatter 67 64 38240520 0.
> Krylov Solver 2 2 2504 0.
> Preconditioner 2 2 2528 0.
> Viewer 2 1 848 0.
> ============================================================
> ============================================================
> Average time to get PetscTime(): 0.
> Average time for MPI_Barrier(): 2.38419e-06
> Average time for zero size MPI_Send(): 2.11596e-06
> #PETSc Option Table entries:
> -da_processors_z 1
> -ksp_type cg
> -ksp_view
> -log_view
> -pc_hypre_boomeramg_nodal_coarsen 1
> -pc_hypre_boomeramg_vec_interp_variant 1
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> Configure options: --known-level1-dcache-size=32768 --known-level1-dcache-linesize=64
> --known-level1-dcache-assoc=8 --known-sizeof-char=1 --known-sizeof-void-p=8
> --known-sizeof-short=2 --known-sizeof-int=4 --known-sizeof-long=8
> --known-sizeof-long-long=8 --known-sizeof-float=4 --known-sizeof-double=8
> --known-sizeof-size_t=8 --known-bits-per-byte=8 --known-memcmp-ok=1
> --known-sizeof-MPI_Comm=8 --known-sizeof-MPI_Fint=4
> --known-mpi-long-double=1 --known-mpi-int64_t=1
> --known-mpi-c-double-complex=1 --known-has-attribute-aligned=1
> PETSC_ARCH=timings --with-mpi-dir=/usr/lib64/openmpi
> --with-blaslapack-dir=/usr/lib64 COPTFLAGS=-O3 CXXOPTFLAGS=-O3
> FOPTFLAGS=-O3 --with-shared-libraries=1 --download-hypre
> --with-debugging=no --with-batch -known-mpi-shared-libraries=0
> --known-64-bit-blas-indices=0
> -----------------------------------------
> Libraries compiled on 2018-04-27 21:13:11 on ocean
> Machine characteristics: Linux-3.10.0-327.36.3.el7.x86_
> 64-x86_64-with-centos-7.2.1511-Core
> Using PETSc directory: /home/valera/petsc
> Using PETSc arch: timings
> -----------------------------------------
>
> Using C compiler: /usr/lib64/openmpi/bin/mpicc -fPIC -Wall
> -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fstack-protector
> -fvisibility=hidden -O3
> Using Fortran compiler: /usr/lib64/openmpi/bin/mpif90 -fPIC -Wall
> -ffree-line-length-0 -Wno-unused-dummy-argument -O3
> -----------------------------------------
>
> Using include paths: -I/home/valera/petsc/include
> -I/home/valera/petsc/timings/include -I/usr/lib64/openmpi/include
> -----------------------------------------
>
> Using C linker: /usr/lib64/openmpi/bin/mpicc
> Using Fortran linker: /usr/lib64/openmpi/bin/mpif90
> Using libraries: -Wl,-rpath,/home/valera/petsc/timings/lib
> -L/home/valera/petsc/timings/lib -lpetsc -Wl,-rpath,/home/valera/petsc/timings/lib
> -L/home/valera/petsc/timings/lib -Wl,-rpath,/usr/lib64 -L/usr/lib64
> -Wl,-rpath,/usr/lib64/openmpi/lib -L/usr/lib64/openmpi/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.8.5
> -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lHYPRE -llapack -lblas -lm
> -lstdc++ -ldl -lmpi_usempi -lmpi_mpifh -lmpi -lgfortran -lm -lgfortran -lm
> -lgcc_s -lquadmath -lpthread -lstdc++ -ldl
>
>
>
>
>
> What do you see wrong here? what options could i try to improve my solver
> scaling?
>
> Thanks so much,
>
> Manuel
>
>
>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
https://www.cse.buffalo.edu/~knepley/ <http://www.caam.rice.edu/~mk51/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20180502/6b9e00c0/attachment-0001.html>
More information about the petsc-users
mailing list