[petsc-users] Help my solver scale
Matthew Knepley
knepley at gmail.com
Wed May 2 17:40:24 CDT 2018
On Wed, May 2, 2018 at 5:59 PM, Manuel Valera <mvalera-w at sdsu.edu> wrote:
> Thanks Matt,
>
> I just remade the streams tests on the machine and i got the following
> table, my question would be, is this the maximum speedup i may get on my
> machine, and thus should compare the efficiency and scaling tests up to
> this figure instead?
>
> I have 20-cores nodes
>
Are you sure they are not 16 core nodes?
> so this was made over 4 nodes,
>
Okay, you get a speedup of 20 using all 4 nodes (64 processes). This means
that the maximum speedup is 30% in your terminology.
We can see that this is consistent scaling, since for 16 processes (I
assume 1 node) we get a speedup of 5, which is also 30%.
Using the bandwidth limit as peak instead of core count, then your strong
scaling is about 70% for 16 processes (okay, not great), and
30% for 64 processes. That is believable, but could probably be improved.
The next things to look at are: How big are the problem sizes per process?
Are the iteration counts increasing? What do you get looking
only are solve time? Only at setup time? Do you really care about strong
scaling rather than weak scaling?
For anything else we would need to see the output from
-ksp_view -ksp_converged_reason -log_view
Thanks,
Matt
> Thanks,
>
> np speedup
> 1 1.0
> 2 1.82
> 3 2.43
> 4 2.79
> 5 2.99
> 6 3.13
> 7 3.13
> 8 3.19
> 9 3.17
> 10 3.17
> 11 3.44
> 12 3.81
> 13 4.13
> 14 4.43
> 15 4.72
> 16 5.05
> 17 5.4
> 18 5.69
> 19 5.99
> 20 6.29
> 21 6.66
> 22 6.96
> 23 7.26
> 24 7.6
> 25 7.86
> 26 8.25
> 27 8.54
> 28 8.88
> 29 9.2
> 30 9.44
> 31 9.84
> 32 10.06
> 33 10.43
> 34 10.72
> 35 11.11
> 36 11.42
> 37 11.75
> 38 12.07
> 39 12.27
> 40 12.65
> 41 12.94
> 42 13.34
> 43 13.6
> 44 13.83
> 45 14.27
> 46 14.56
> 47 14.84
> 48 15.24
> 49 15.49
> 50 15.85
> 51 15.87
> 52 16.35
> 53 16.76
> 54 17.02
> 55 17.17
> 56 17.7
> 57 17.9
> 58 18.28
> 59 18.56
> 60 18.82
> 61 19.37
> 62 19.62
> 63 19.88
> 64 20.21
>
>
>
> On Wed, May 2, 2018 at 1:24 PM, Matthew Knepley <knepley at gmail.com> wrote:
>
>> On Wed, May 2, 2018 at 4:19 PM, Manuel Valera <mvalera-w at sdsu.edu> wrote:
>>
>>> Hello guys,
>>>
>>> We are working in writing a paper about the parallelization of our model
>>> using PETSc, which is very exciting since is the first time we see our
>>> model scaling, but so far i feel my results for the laplacian solver could
>>> be much better,
>>>
>>> For example, using CG/Multigrid i get less than 20% of efficiency after
>>> 16 cores, up to 64 cores where i get only 8% efficiency,
>>>
>>> I am defining efficiency as speedup over number of cores, and speedup as
>>> twall_n/twall_1 where n is the number of cores, i think that's pretty
>>> standard,
>>>
>>
>> This is the first big problem. Not all "cores" are created equal. First,
>> you need to run streams in the exact same configuration, so that you can see
>> how much speedup to expect. The program is here
>>
>> cd src/benchmarks/streams
>>
>> and
>>
>> make streams
>>
>> will run it. You will probably need to submit the program yourself to the
>> batch system to get the same configuration as your solver.
>>
>> This really matter because 16 cores on one nodes probably only has the
>> potential for 5x speedup, so that your 20% is misguided.
>>
>> Thanks,
>>
>> Matt
>>
>>
>>> The ksp_view for a distributed solve looks like this:
>>>
>>> KSP Object: 16 MPI processes
>>> type: cg
>>> maximum iterations=10000, initial guess is zero
>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
>>> left preconditioning
>>> using PRECONDITIONED norm type for convergence test
>>> PC Object: 16 MPI processes
>>> type: hypre
>>> HYPRE BoomerAMG preconditioning
>>> Cycle type V
>>> Maximum number of levels 25
>>> Maximum number of iterations PER hypre call 1
>>> Convergence tolerance PER hypre call 0.
>>> Threshold for strong coupling 0.25
>>> Interpolation truncation factor 0.
>>> Interpolation: max elements per row 0
>>> Number of levels of aggressive coarsening 0
>>> Number of paths for aggressive coarsening 1
>>> Maximum row sums 0.9
>>> Sweeps down 1
>>> Sweeps up 1
>>> Sweeps on coarse 1
>>> Relax down symmetric-SOR/Jacobi
>>> Relax up symmetric-SOR/Jacobi
>>> Relax on coarse Gaussian-elimination
>>> Relax weight (all) 1.
>>> Outer relax weight (all) 1.
>>> Using CF-relaxation
>>> Not using more complex smoothers.
>>> Measure type local
>>> Coarsen type Falgout
>>> Interpolation type classical
>>> Using nodal coarsening (with HYPRE_BOOMERAMGSetNodal() 1
>>> HYPRE_BoomerAMGSetInterpVecVariant() 1
>>> linear system matrix = precond matrix:
>>> Mat Object: 16 MPI processes
>>> type: mpiaij
>>> rows=213120, cols=213120
>>> total: nonzeros=3934732, allocated nonzeros=8098560
>>> total number of mallocs used during MatSetValues calls =0
>>> has attached near null space
>>>
>>>
>>> And the log_view for the same case would be:
>>>
>>> ************************************************************
>>> ************************************************************
>>> *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
>>> -fCourier9' to print this document ***
>>> ************************************************************
>>> ************************************************************
>>>
>>> ---------------------------------------------- PETSc Performance
>>> Summary: ----------------------------------------------
>>>
>>> ./gcmSeamount on a timings named ocean with 16 processors, by valera Wed
>>> May 2 13:18:21 2018
>>> Using Petsc Development GIT revision: v3.9-163-gbe3efd4 GIT Date:
>>> 2018-04-16 10:45:40 -0500
>>>
>>> Max Max/Min Avg Total
>>> Time (sec): 1.355e+00 1.00004 1.355e+00
>>> Objects: 4.140e+02 1.00000 4.140e+02
>>> Flop: 7.582e+05 1.09916 7.397e+05 1.183e+07
>>> Flop/sec: 5.594e+05 1.09918 5.458e+05 8.732e+06
>>> MPI Messages: 1.588e+03 1.19167 1.468e+03 2.348e+04
>>> MPI Message Lengths: 7.112e+07 1.37899 4.462e+04 1.048e+09
>>> MPI Reductions: 4.760e+02 1.00000
>>>
>>> Flop counting convention: 1 flop = 1 real number operation of type
>>> (multiply/divide/add/subtract)
>>> e.g., VecAXPY() for real vectors of length N
>>> --> 2N flop
>>> and VecAXPY() for complex vectors of length
>>> N --> 8N flop
>>>
>>> Summary of Stages: ----- Time ------ ----- Flop ----- --- Messages
>>> --- -- Message Lengths -- -- Reductions --
>>> Avg %Total Avg %Total counts
>>> %Total Avg %Total counts %Total
>>> 0: Main Stage: 1.3553e+00 100.0% 1.1835e+07 100.0% 2.348e+04
>>> 100.0% 4.462e+04 100.0% 4.670e+02 98.1%
>>>
>>> ------------------------------------------------------------
>>> ------------------------------------------------------------
>>> See the 'Profiling' chapter of the users' manual for details on
>>> interpreting output.
>>> Phase summary info:
>>> Count: number of times phase was executed
>>> Time and Flop: Max - maximum over all processors
>>> Ratio - ratio of maximum to minimum over all
>>> processors
>>> Mess: number of messages sent
>>> Avg. len: average message length (bytes)
>>> Reduct: number of global reductions
>>> Global: entire computation
>>> Stage: stages of a computation. Set stages with PetscLogStagePush()
>>> and PetscLogStagePop().
>>> %T - percent time in this phase %F - percent flop in this
>>> phase
>>> %M - percent messages in this phase %L - percent message
>>> lengths in this phase
>>> %R - percent reductions in this phase
>>> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time
>>> over all processors)
>>> ------------------------------------------------------------
>>> ------------------------------------------------------------
>>> Event Count Time (sec) Flop
>>> --- Global --- --- Stage --- Total
>>> Max Ratio Max Ratio Max Ratio Mess Avg len
>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>>> ------------------------------------------------------------
>>> ------------------------------------------------------------
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> BuildTwoSidedF 2 1.0 9.1908e-03 2.2 0.00e+00 0.0 3.6e+02 1.6e+05
>>> 0.0e+00 1 0 2 6 0 1 0 2 6 0 0
>>> VecTDot 1 1.0 6.4135e-05 1.1 2.66e+04 1.0 0.0e+00 0.0e+00
>>> 1.0e+00 0 4 0 0 0 0 4 0 0 0 6646
>>> VecNorm 1 1.0 1.4589e-0347.1 2.66e+04 1.0 0.0e+00 0.0e+00
>>> 1.0e+00 0 4 0 0 0 0 4 0 0 0 292
>>> VecScale 14 1.0 3.6144e-04 1.3 4.80e+05 1.1 0.0e+00 0.0e+00
>>> 0.0e+00 0 62 0 0 0 0 62 0 0 0 20346
>>> VecCopy 7 1.0 1.0152e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecSet 83 1.0 3.0013e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>> VecPointwiseMult 12 1.0 2.7585e-04 1.4 2.43e+05 1.2 0.0e+00 0.0e+00
>>> 0.0e+00 0 31 0 0 0 0 31 0 0 0 13153
>>> VecScatterBegin 111 1.0 2.5293e-02 1.8 0.00e+00 0.0 9.5e+03 3.4e+04
>>> 1.9e+01 1 0 40 31 4 1 0 40 31 4 0
>>> VecScatterEnd 92 1.0 4.8771e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 3 0 0 0 0 3 0 0 0 0 0
>>> VecNormalize 1 1.0 2.6941e-05 2.3 1.33e+04 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 2 0 0 0 0 2 0 0 0 7911
>>> MatConvert 1 1.0 1.1009e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 4.0e+00 1 0 0 0 1 1 0 0 0 1 0
>>> MatAssemblyBegin 3 1.0 2.8401e-02 1.0 0.00e+00 0.0 3.6e+02 1.6e+05
>>> 0.0e+00 2 0 2 6 0 2 0 2 6 0 0
>>> MatAssemblyEnd 3 1.0 2.9033e-02 1.0 0.00e+00 0.0 6.0e+01 1.2e+04
>>> 2.0e+01 2 0 0 0 4 2 0 0 0 4 0
>>> MatGetRowIJ 2 1.0 1.9073e-06 2.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatView 1 1.0 3.0398e-04 5.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> KSPSetUp 1 1.0 4.7994e-04 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> KSPSolve 1 1.0 2.4850e-03 2.0 5.33e+04 1.0 0.0e+00 0.0e+00
>>> 2.0e+00 0 7 0 0 0 0 7 0 0 0 343
>>> PCSetUp 2 1.0 2.2953e-02 1.0 1.33e+04 1.0 0.0e+00 0.0e+00
>>> 6.0e+00 2 2 0 0 1 2 2 0 0 1 9
>>> PCApply 1 1.0 1.3151e-03 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> ------------------------------------------------------------
>>> ------------------------------------------------------------
>>>
>>> Memory usage is given in bytes:
>>>
>>> Object Type Creations Destructions Memory Descendants'
>>> Mem.
>>> Reports information only for process 0.
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> Vector 172 170 70736264 0.
>>> Matrix 5 5 7125104 0.
>>> Matrix Null Space 1 1 608 0.
>>> Distributed Mesh 18 16 84096 0.
>>> Index Set 73 73 10022204 0.
>>> IS L to G Mapping 18 16 1180828 0.
>>> Star Forest Graph 36 32 27968 0.
>>> Discrete System 18 16 15040 0.
>>> Vec Scatter 67 64 38240520 0.
>>> Krylov Solver 2 2 2504 0.
>>> Preconditioner 2 2 2528 0.
>>> Viewer 2 1 848 0.
>>> ============================================================
>>> ============================================================
>>> Average time to get PetscTime(): 0.
>>> Average time for MPI_Barrier(): 2.38419e-06
>>> Average time for zero size MPI_Send(): 2.11596e-06
>>> #PETSc Option Table entries:
>>> -da_processors_z 1
>>> -ksp_type cg
>>> -ksp_view
>>> -log_view
>>> -pc_hypre_boomeramg_nodal_coarsen 1
>>> -pc_hypre_boomeramg_vec_interp_variant 1
>>> #End of PETSc Option Table entries
>>> Compiled without FORTRAN kernels
>>> Compiled with full precision matrices (default)
>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
>>> Configure options: --known-level1-dcache-size=32768
>>> --known-level1-dcache-linesize=64 --known-level1-dcache-assoc=8
>>> --known-sizeof-char=1 --known-sizeof-void-p=8 --known-sizeof-short=2
>>> --known-sizeof-int=4 --known-sizeof-long=8 --known-sizeof-long-long=8
>>> --known-sizeof-float=4 --known-sizeof-double=8 --known-sizeof-size_t=8
>>> --known-bits-per-byte=8 --known-memcmp-ok=1 --known-sizeof-MPI_Comm=8
>>> --known-sizeof-MPI_Fint=4 --known-mpi-long-double=1 --known-mpi-int64_t=1
>>> --known-mpi-c-double-complex=1 --known-has-attribute-aligned=1
>>> PETSC_ARCH=timings --with-mpi-dir=/usr/lib64/openmpi
>>> --with-blaslapack-dir=/usr/lib64 COPTFLAGS=-O3 CXXOPTFLAGS=-O3
>>> FOPTFLAGS=-O3 --with-shared-libraries=1 --download-hypre
>>> --with-debugging=no --with-batch -known-mpi-shared-libraries=0
>>> --known-64-bit-blas-indices=0
>>> -----------------------------------------
>>> Libraries compiled on 2018-04-27 21:13:11 on ocean
>>> Machine characteristics: Linux-3.10.0-327.36.3.el7.x86_
>>> 64-x86_64-with-centos-7.2.1511-Core
>>> Using PETSc directory: /home/valera/petsc
>>> Using PETSc arch: timings
>>> -----------------------------------------
>>>
>>> Using C compiler: /usr/lib64/openmpi/bin/mpicc -fPIC -Wall
>>> -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fstack-protector
>>> -fvisibility=hidden -O3
>>> Using Fortran compiler: /usr/lib64/openmpi/bin/mpif90 -fPIC -Wall
>>> -ffree-line-length-0 -Wno-unused-dummy-argument -O3
>>> -----------------------------------------
>>>
>>> Using include paths: -I/home/valera/petsc/include
>>> -I/home/valera/petsc/timings/include -I/usr/lib64/openmpi/include
>>> -----------------------------------------
>>>
>>> Using C linker: /usr/lib64/openmpi/bin/mpicc
>>> Using Fortran linker: /usr/lib64/openmpi/bin/mpif90
>>> Using libraries: -Wl,-rpath,/home/valera/petsc/timings/lib
>>> -L/home/valera/petsc/timings/lib -lpetsc -Wl,-rpath,/home/valera/petsc/timings/lib
>>> -L/home/valera/petsc/timings/lib -Wl,-rpath,/usr/lib64 -L/usr/lib64
>>> -Wl,-rpath,/usr/lib64/openmpi/lib -L/usr/lib64/openmpi/lib
>>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.8.5
>>> -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lHYPRE -llapack -lblas -lm
>>> -lstdc++ -ldl -lmpi_usempi -lmpi_mpifh -lmpi -lgfortran -lm -lgfortran -lm
>>> -lgcc_s -lquadmath -lpthread -lstdc++ -ldl
>>>
>>>
>>>
>>>
>>>
>>> What do you see wrong here? what options could i try to improve my
>>> solver scaling?
>>>
>>> Thanks so much,
>>>
>>> Manuel
>>>
>>>
>>>
>>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/ <http://www.caam.rice.edu/~mk51/>
>>
>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
https://www.cse.buffalo.edu/~knepley/ <http://www.caam.rice.edu/~mk51/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20180502/37a227a5/attachment-0001.html>
More information about the petsc-users
mailing list