[petsc-users] Help my solver scale

Matthew Knepley knepley at gmail.com
Wed May 2 17:40:24 CDT 2018


On Wed, May 2, 2018 at 5:59 PM, Manuel Valera <mvalera-w at sdsu.edu> wrote:

> Thanks Matt,
>
> I just remade the streams tests on the machine and i got the following
> table, my question would be, is this the maximum speedup i may get on my
> machine, and thus should compare the efficiency and scaling tests up to
> this figure instead?
>
> I have 20-cores nodes
>

Are you sure they are not 16 core nodes?


> so this was made over 4 nodes,
>

Okay, you get a speedup of 20 using all 4 nodes (64 processes). This means
that the maximum speedup is 30% in your terminology.
We can see that this is consistent scaling, since for 16 processes (I
assume 1 node) we get a speedup of 5, which is also 30%.

Using the bandwidth limit as peak instead of core count, then your strong
scaling is about 70% for 16 processes (okay, not great), and
30% for 64 processes. That is believable, but could probably be improved.

The next things to look at are: How big are the problem sizes per process?
Are the iteration counts increasing? What do you get looking
only are solve time? Only at setup time? Do you really care about strong
scaling rather than weak scaling?

For anything else we would need to see the output from

  -ksp_view -ksp_converged_reason -log_view

  Thanks,

    Matt


> Thanks,
>
> np  speedup
> 1 1.0
> 2 1.82
> 3 2.43
> 4 2.79
> 5 2.99
> 6 3.13
> 7 3.13
> 8 3.19
> 9 3.17
> 10 3.17
> 11 3.44
> 12 3.81
> 13 4.13
> 14 4.43
> 15 4.72
> 16 5.05
> 17 5.4
> 18 5.69
> 19 5.99
> 20 6.29
> 21 6.66
> 22 6.96
> 23 7.26
> 24 7.6
> 25 7.86
> 26 8.25
> 27 8.54
> 28 8.88
> 29 9.2
> 30 9.44
> 31 9.84
> 32 10.06
> 33 10.43
> 34 10.72
> 35 11.11
> 36 11.42
> 37 11.75
> 38 12.07
> 39 12.27
> 40 12.65
> 41 12.94
> 42 13.34
> 43 13.6
> 44 13.83
> 45 14.27
> 46 14.56
> 47 14.84
> 48 15.24
> 49 15.49
> 50 15.85
> 51 15.87
> 52 16.35
> 53 16.76
> 54 17.02
> 55 17.17
> 56 17.7
> 57 17.9
> 58 18.28
> 59 18.56
> 60 18.82
> 61 19.37
> 62 19.62
> 63 19.88
> 64 20.21
>
>
>
> On Wed, May 2, 2018 at 1:24 PM, Matthew Knepley <knepley at gmail.com> wrote:
>
>> On Wed, May 2, 2018 at 4:19 PM, Manuel Valera <mvalera-w at sdsu.edu> wrote:
>>
>>> Hello guys,
>>>
>>> We are working in writing a paper about the parallelization of our model
>>> using PETSc, which is very exciting since is the first time we see our
>>> model scaling, but so far i feel my results for the laplacian solver could
>>> be much better,
>>>
>>> For example, using CG/Multigrid i get less than 20% of efficiency after
>>> 16 cores, up to 64 cores where i get only 8% efficiency,
>>>
>>> I am defining efficiency as speedup over number of cores, and speedup as
>>> twall_n/twall_1 where n is the number of cores, i think that's pretty
>>> standard,
>>>
>>
>> This is the first big problem. Not all "cores" are created equal. First,
>> you need to run streams in the exact same configuration, so that you can see
>> how much speedup to expect. The program is here
>>
>>   cd src/benchmarks/streams
>>
>> and
>>
>>   make streams
>>
>> will run it. You will probably need to submit the program yourself to the
>> batch system to get the same configuration as your solver.
>>
>> This really matter because 16 cores on one nodes probably only has the
>> potential for 5x speedup, so that your 20% is misguided.
>>
>>   Thanks,
>>
>>      Matt
>>
>>
>>> The ksp_view for a distributed solve looks like this:
>>>
>>> KSP Object: 16 MPI processes
>>>   type: cg
>>>   maximum iterations=10000, initial guess is zero
>>>   tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
>>>   left preconditioning
>>>   using PRECONDITIONED norm type for convergence test
>>> PC Object: 16 MPI processes
>>>   type: hypre
>>>     HYPRE BoomerAMG preconditioning
>>>       Cycle type V
>>>       Maximum number of levels 25
>>>       Maximum number of iterations PER hypre call 1
>>>       Convergence tolerance PER hypre call 0.
>>>       Threshold for strong coupling 0.25
>>>       Interpolation truncation factor 0.
>>>       Interpolation: max elements per row 0
>>>       Number of levels of aggressive coarsening 0
>>>       Number of paths for aggressive coarsening 1
>>>       Maximum row sums 0.9
>>>       Sweeps down         1
>>>       Sweeps up           1
>>>       Sweeps on coarse    1
>>>       Relax down          symmetric-SOR/Jacobi
>>>       Relax up            symmetric-SOR/Jacobi
>>>       Relax on coarse     Gaussian-elimination
>>>       Relax weight  (all)      1.
>>>       Outer relax weight (all) 1.
>>>       Using CF-relaxation
>>>       Not using more complex smoothers.
>>>       Measure type        local
>>>       Coarsen type        Falgout
>>>       Interpolation type  classical
>>>       Using nodal coarsening (with HYPRE_BOOMERAMGSetNodal() 1
>>>       HYPRE_BoomerAMGSetInterpVecVariant() 1
>>>   linear system matrix = precond matrix:
>>>   Mat Object: 16 MPI processes
>>>     type: mpiaij
>>>     rows=213120, cols=213120
>>>     total: nonzeros=3934732, allocated nonzeros=8098560
>>>     total number of mallocs used during MatSetValues calls =0
>>>       has attached near null space
>>>
>>>
>>> And the log_view for the same case would be:
>>>
>>> ************************************************************
>>> ************************************************************
>>> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
>>> -fCourier9' to print this document            ***
>>> ************************************************************
>>> ************************************************************
>>>
>>> ---------------------------------------------- PETSc Performance
>>> Summary: ----------------------------------------------
>>>
>>> ./gcmSeamount on a timings named ocean with 16 processors, by valera Wed
>>> May  2 13:18:21 2018
>>> Using Petsc Development GIT revision: v3.9-163-gbe3efd4  GIT Date:
>>> 2018-04-16 10:45:40 -0500
>>>
>>>                          Max       Max/Min        Avg      Total
>>> Time (sec):           1.355e+00      1.00004   1.355e+00
>>> Objects:              4.140e+02      1.00000   4.140e+02
>>> Flop:                 7.582e+05      1.09916   7.397e+05  1.183e+07
>>> Flop/sec:            5.594e+05      1.09918   5.458e+05  8.732e+06
>>> MPI Messages:         1.588e+03      1.19167   1.468e+03  2.348e+04
>>> MPI Message Lengths:  7.112e+07      1.37899   4.462e+04  1.048e+09
>>> MPI Reductions:       4.760e+02      1.00000
>>>
>>> Flop counting convention: 1 flop = 1 real number operation of type
>>> (multiply/divide/add/subtract)
>>>                             e.g., VecAXPY() for real vectors of length N
>>> --> 2N flop
>>>                             and VecAXPY() for complex vectors of length
>>> N --> 8N flop
>>>
>>> Summary of Stages:   ----- Time ------  ----- Flop -----  --- Messages
>>> ---  -- Message Lengths --  -- Reductions --
>>>                         Avg     %Total     Avg     %Total   counts
>>>  %Total     Avg         %Total   counts   %Total
>>>  0:      Main Stage: 1.3553e+00 100.0%  1.1835e+07 100.0%  2.348e+04
>>> 100.0%  4.462e+04      100.0%  4.670e+02  98.1%
>>>
>>> ------------------------------------------------------------
>>> ------------------------------------------------------------
>>> See the 'Profiling' chapter of the users' manual for details on
>>> interpreting output.
>>> Phase summary info:
>>>    Count: number of times phase was executed
>>>    Time and Flop: Max - maximum over all processors
>>>                    Ratio - ratio of maximum to minimum over all
>>> processors
>>>    Mess: number of messages sent
>>>    Avg. len: average message length (bytes)
>>>    Reduct: number of global reductions
>>>    Global: entire computation
>>>    Stage: stages of a computation. Set stages with PetscLogStagePush()
>>> and PetscLogStagePop().
>>>       %T - percent time in this phase         %F - percent flop in this
>>> phase
>>>       %M - percent messages in this phase     %L - percent message
>>> lengths in this phase
>>>       %R - percent reductions in this phase
>>>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time
>>> over all processors)
>>> ------------------------------------------------------------
>>> ------------------------------------------------------------
>>> Event                Count      Time (sec)     Flop
>>>        --- Global ---  --- Stage ---   Total
>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>> ------------------------------------------------------------
>>> ------------------------------------------------------------
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> BuildTwoSidedF         2 1.0 9.1908e-03 2.2 0.00e+00 0.0 3.6e+02 1.6e+05
>>> 0.0e+00  1  0  2  6  0   1  0  2  6  0     0
>>> VecTDot                1 1.0 6.4135e-05 1.1 2.66e+04 1.0 0.0e+00 0.0e+00
>>> 1.0e+00  0  4  0  0  0   0  4  0  0  0  6646
>>> VecNorm                1 1.0 1.4589e-0347.1 2.66e+04 1.0 0.0e+00 0.0e+00
>>> 1.0e+00  0  4  0  0  0   0  4  0  0  0   292
>>> VecScale              14 1.0 3.6144e-04 1.3 4.80e+05 1.1 0.0e+00 0.0e+00
>>> 0.0e+00  0 62  0  0  0   0 62  0  0  0 20346
>>> VecCopy                7 1.0 1.0152e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecSet                83 1.0 3.0013e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> VecPointwiseMult      12 1.0 2.7585e-04 1.4 2.43e+05 1.2 0.0e+00 0.0e+00
>>> 0.0e+00  0 31  0  0  0   0 31  0  0  0 13153
>>> VecScatterBegin      111 1.0 2.5293e-02 1.8 0.00e+00 0.0 9.5e+03 3.4e+04
>>> 1.9e+01  1  0 40 31  4   1  0 40 31  4     0
>>> VecScatterEnd         92 1.0 4.8771e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  3  0  0  0  0   3  0  0  0  0     0
>>> VecNormalize           1 1.0 2.6941e-05 2.3 1.33e+04 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  2  0  0  0   0  2  0  0  0  7911
>>> MatConvert             1 1.0 1.1009e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 4.0e+00  1  0  0  0  1   1  0  0  0  1     0
>>> MatAssemblyBegin       3 1.0 2.8401e-02 1.0 0.00e+00 0.0 3.6e+02 1.6e+05
>>> 0.0e+00  2  0  2  6  0   2  0  2  6  0     0
>>> MatAssemblyEnd         3 1.0 2.9033e-02 1.0 0.00e+00 0.0 6.0e+01 1.2e+04
>>> 2.0e+01  2  0  0  0  4   2  0  0  0  4     0
>>> MatGetRowIJ            2 1.0 1.9073e-06 2.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatView                1 1.0 3.0398e-04 5.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> KSPSetUp               1 1.0 4.7994e-04 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> KSPSolve               1 1.0 2.4850e-03 2.0 5.33e+04 1.0 0.0e+00 0.0e+00
>>> 2.0e+00  0  7  0  0  0   0  7  0  0  0   343
>>> PCSetUp                2 1.0 2.2953e-02 1.0 1.33e+04 1.0 0.0e+00 0.0e+00
>>> 6.0e+00  2  2  0  0  1   2  2  0  0  1     9
>>> PCApply                1 1.0 1.3151e-03 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> ------------------------------------------------------------
>>> ------------------------------------------------------------
>>>
>>> Memory usage is given in bytes:
>>>
>>> Object Type          Creations   Destructions     Memory  Descendants'
>>> Mem.
>>> Reports information only for process 0.
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>>               Vector   172            170     70736264     0.
>>>               Matrix     5              5      7125104     0.
>>>    Matrix Null Space     1              1          608     0.
>>>     Distributed Mesh    18             16        84096     0.
>>>            Index Set    73             73     10022204     0.
>>>    IS L to G Mapping    18             16      1180828     0.
>>>    Star Forest Graph    36             32        27968     0.
>>>      Discrete System    18             16        15040     0.
>>>          Vec Scatter    67             64     38240520     0.
>>>        Krylov Solver     2              2         2504     0.
>>>       Preconditioner     2              2         2528     0.
>>>               Viewer     2              1          848     0.
>>> ============================================================
>>> ============================================================
>>> Average time to get PetscTime(): 0.
>>> Average time for MPI_Barrier(): 2.38419e-06
>>> Average time for zero size MPI_Send(): 2.11596e-06
>>> #PETSc Option Table entries:
>>> -da_processors_z 1
>>> -ksp_type cg
>>> -ksp_view
>>> -log_view
>>> -pc_hypre_boomeramg_nodal_coarsen 1
>>> -pc_hypre_boomeramg_vec_interp_variant 1
>>> #End of PETSc Option Table entries
>>> Compiled without FORTRAN kernels
>>> Compiled with full precision matrices (default)
>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
>>> Configure options: --known-level1-dcache-size=32768
>>> --known-level1-dcache-linesize=64 --known-level1-dcache-assoc=8
>>> --known-sizeof-char=1 --known-sizeof-void-p=8 --known-sizeof-short=2
>>> --known-sizeof-int=4 --known-sizeof-long=8 --known-sizeof-long-long=8
>>> --known-sizeof-float=4 --known-sizeof-double=8 --known-sizeof-size_t=8
>>> --known-bits-per-byte=8 --known-memcmp-ok=1 --known-sizeof-MPI_Comm=8
>>> --known-sizeof-MPI_Fint=4 --known-mpi-long-double=1 --known-mpi-int64_t=1
>>> --known-mpi-c-double-complex=1 --known-has-attribute-aligned=1
>>> PETSC_ARCH=timings --with-mpi-dir=/usr/lib64/openmpi
>>> --with-blaslapack-dir=/usr/lib64 COPTFLAGS=-O3 CXXOPTFLAGS=-O3
>>> FOPTFLAGS=-O3 --with-shared-libraries=1 --download-hypre
>>> --with-debugging=no --with-batch -known-mpi-shared-libraries=0
>>> --known-64-bit-blas-indices=0
>>> -----------------------------------------
>>> Libraries compiled on 2018-04-27 21:13:11 on ocean
>>> Machine characteristics: Linux-3.10.0-327.36.3.el7.x86_
>>> 64-x86_64-with-centos-7.2.1511-Core
>>> Using PETSc directory: /home/valera/petsc
>>> Using PETSc arch: timings
>>> -----------------------------------------
>>>
>>> Using C compiler: /usr/lib64/openmpi/bin/mpicc  -fPIC  -Wall
>>> -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fstack-protector
>>> -fvisibility=hidden -O3
>>> Using Fortran compiler: /usr/lib64/openmpi/bin/mpif90  -fPIC -Wall
>>> -ffree-line-length-0 -Wno-unused-dummy-argument -O3
>>> -----------------------------------------
>>>
>>> Using include paths: -I/home/valera/petsc/include
>>> -I/home/valera/petsc/timings/include -I/usr/lib64/openmpi/include
>>> -----------------------------------------
>>>
>>> Using C linker: /usr/lib64/openmpi/bin/mpicc
>>> Using Fortran linker: /usr/lib64/openmpi/bin/mpif90
>>> Using libraries: -Wl,-rpath,/home/valera/petsc/timings/lib
>>> -L/home/valera/petsc/timings/lib -lpetsc -Wl,-rpath,/home/valera/petsc/timings/lib
>>> -L/home/valera/petsc/timings/lib -Wl,-rpath,/usr/lib64 -L/usr/lib64
>>> -Wl,-rpath,/usr/lib64/openmpi/lib -L/usr/lib64/openmpi/lib
>>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.8.5
>>> -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lHYPRE -llapack -lblas -lm
>>> -lstdc++ -ldl -lmpi_usempi -lmpi_mpifh -lmpi -lgfortran -lm -lgfortran -lm
>>> -lgcc_s -lquadmath -lpthread -lstdc++ -ldl
>>>
>>>
>>>
>>>
>>>
>>> What do you see wrong here? what options could i try to improve my
>>> solver scaling?
>>>
>>> Thanks so much,
>>>
>>> Manuel
>>>
>>>
>>>
>>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/ <http://www.caam.rice.edu/~mk51/>
>>
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.caam.rice.edu/~mk51/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20180502/37a227a5/attachment-0001.html>


More information about the petsc-users mailing list