[petsc-users] Help my solver scale

Wed May 2 16:59:46 CDT 2018

Thanks Matt,

I just remade the streams tests on the machine and i got the following
table, my question would be, is this the maximum speedup i may get on my
machine, and thus should compare the efficiency and scaling tests up to
this figure instead?

I have 20-cores nodes so this was made over 4 nodes,

Thanks,

np  speedup
1 1.0
2 1.82
3 2.43
4 2.79
5 2.99
6 3.13
7 3.13
8 3.19
9 3.17
10 3.17
11 3.44
12 3.81
13 4.13
14 4.43
15 4.72
16 5.05
17 5.4
18 5.69
19 5.99
20 6.29
21 6.66
22 6.96
23 7.26
24 7.6
25 7.86
26 8.25
27 8.54
28 8.88
29 9.2
30 9.44
31 9.84
32 10.06
33 10.43
34 10.72
35 11.11
36 11.42
37 11.75
38 12.07
39 12.27
40 12.65
41 12.94
42 13.34
43 13.6
44 13.83
45 14.27
46 14.56
47 14.84
48 15.24
49 15.49
50 15.85
51 15.87
52 16.35
53 16.76
54 17.02
55 17.17
56 17.7
57 17.9
58 18.28
59 18.56
60 18.82
61 19.37
62 19.62
63 19.88
64 20.21

On Wed, May 2, 2018 at 1:24 PM, Matthew Knepley <knepley at gmail.com> wrote:

> On Wed, May 2, 2018 at 4:19 PM, Manuel Valera <mvalera-w at sdsu.edu> wrote:
>
>> Hello guys,
>>
>> We are working in writing a paper about the parallelization of our model
>> using PETSc, which is very exciting since is the first time we see our
>> model scaling, but so far i feel my results for the laplacian solver could
>> be much better,
>>
>> For example, using CG/Multigrid i get less than 20% of efficiency after
>> 16 cores, up to 64 cores where i get only 8% efficiency,
>>
>> I am defining efficiency as speedup over number of cores, and speedup as
>> twall_n/twall_1 where n is the number of cores, i think that's pretty
>> standard,
>>
>
> This is the first big problem. Not all "cores" are created equal. First,
> you need to run streams in the exact same configuration, so that you can see
> how much speedup to expect. The program is here
>
>   cd src/benchmarks/streams
>
> and
>
>   make streams
>
> will run it. You will probably need to submit the program yourself to the
> batch system to get the same configuration as your solver.
>
> This really matter because 16 cores on one nodes probably only has the
> potential for 5x speedup, so that your 20% is misguided.
>
>   Thanks,
>
>      Matt
>
>
>> The ksp_view for a distributed solve looks like this:
>>
>> KSP Object: 16 MPI processes
>>   type: cg
>>   maximum iterations=10000, initial guess is zero
>>   tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
>>   left preconditioning
>>   using PRECONDITIONED norm type for convergence test
>> PC Object: 16 MPI processes
>>   type: hypre
>>     HYPRE BoomerAMG preconditioning
>>       Cycle type V
>>       Maximum number of levels 25
>>       Maximum number of iterations PER hypre call 1
>>       Convergence tolerance PER hypre call 0.
>>       Threshold for strong coupling 0.25
>>       Interpolation truncation factor 0.
>>       Interpolation: max elements per row 0
>>       Number of levels of aggressive coarsening 0
>>       Number of paths for aggressive coarsening 1
>>       Maximum row sums 0.9
>>       Sweeps down         1
>>       Sweeps up           1
>>       Sweeps on coarse    1
>>       Relax down          symmetric-SOR/Jacobi
>>       Relax up            symmetric-SOR/Jacobi
>>       Relax on coarse     Gaussian-elimination
>>       Relax weight  (all)      1.
>>       Outer relax weight (all) 1.
>>       Using CF-relaxation
>>       Not using more complex smoothers.
>>       Measure type        local
>>       Coarsen type        Falgout
>>       Interpolation type  classical
>>       Using nodal coarsening (with HYPRE_BOOMERAMGSetNodal() 1
>>       HYPRE_BoomerAMGSetInterpVecVariant() 1
>>   linear system matrix = precond matrix:
>>   Mat Object: 16 MPI processes
>>     type: mpiaij
>>     rows=213120, cols=213120
>>     total: nonzeros=3934732, allocated nonzeros=8098560
>>     total number of mallocs used during MatSetValues calls =0
>>       has attached near null space
>>
>>
>> And the log_view for the same case would be:
>>
>> ************************************************************
>> ************************************************************
>> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
>> -fCourier9' to print this document            ***
>> ************************************************************
>> ************************************************************
>>
>> ---------------------------------------------- PETSc Performance
>> Summary: ----------------------------------------------
>>
>> ./gcmSeamount on a timings named ocean with 16 processors, by valera Wed
>> May  2 13:18:21 2018
>> Using Petsc Development GIT revision: v3.9-163-gbe3efd4  GIT Date:
>> 2018-04-16 10:45:40 -0500
>>
>>                          Max       Max/Min        Avg      Total
>> Time (sec):           1.355e+00      1.00004   1.355e+00
>> Objects:              4.140e+02      1.00000   4.140e+02
>> Flop:                 7.582e+05      1.09916   7.397e+05  1.183e+07
>> Flop/sec:            5.594e+05      1.09918   5.458e+05  8.732e+06
>> MPI Messages:         1.588e+03      1.19167   1.468e+03  2.348e+04
>> MPI Message Lengths:  7.112e+07      1.37899   4.462e+04  1.048e+09
>> MPI Reductions:       4.760e+02      1.00000
>>
>> Flop counting convention: 1 flop = 1 real number operation of type
>> (multiply/divide/add/subtract)
>>                             e.g., VecAXPY() for real vectors of length N
>> --> 2N flop
>>                             and VecAXPY() for complex vectors of length N
>> --> 8N flop
>>
>> Summary of Stages:   ----- Time ------  ----- Flop -----  --- Messages
>> ---  -- Message Lengths --  -- Reductions --
>>                         Avg     %Total     Avg     %Total   counts
>>  %Total     Avg         %Total   counts   %Total
>>  0:      Main Stage: 1.3553e+00 100.0%  1.1835e+07 100.0%  2.348e+04
>> 100.0%  4.462e+04      100.0%  4.670e+02  98.1%
>>
>> ------------------------------------------------------------
>> ------------------------------------------------------------
>> See the 'Profiling' chapter of the users' manual for details on
>> interpreting output.
>> Phase summary info:
>>    Count: number of times phase was executed
>>    Time and Flop: Max - maximum over all processors
>>                    Ratio - ratio of maximum to minimum over all processors
>>    Mess: number of messages sent
>>    Avg. len: average message length (bytes)
>>    Reduct: number of global reductions
>>    Global: entire computation
>>    Stage: stages of a computation. Set stages with PetscLogStagePush()
>> and PetscLogStagePop().
>>       %T - percent time in this phase         %F - percent flop in this
>> phase
>>       %M - percent messages in this phase     %L - percent message
>> lengths in this phase
>>       %R - percent reductions in this phase
>>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time
>> over all processors)
>> ------------------------------------------------------------
>> ------------------------------------------------------------
>> Event                Count      Time (sec)     Flop
>>        --- Global ---  --- Stage ---   Total
>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>> ------------------------------------------------------------
>> ------------------------------------------------------------
>>
>> --- Event Stage 0: Main Stage
>>
>> BuildTwoSidedF         2 1.0 9.1908e-03 2.2 0.00e+00 0.0 3.6e+02 1.6e+05
>> 0.0e+00  1  0  2  6  0   1  0  2  6  0     0
>> VecTDot                1 1.0 6.4135e-05 1.1 2.66e+04 1.0 0.0e+00 0.0e+00
>> 1.0e+00  0  4  0  0  0   0  4  0  0  0  6646
>> VecNorm                1 1.0 1.4589e-0347.1 2.66e+04 1.0 0.0e+00 0.0e+00
>> 1.0e+00  0  4  0  0  0   0  4  0  0  0   292
>> VecScale              14 1.0 3.6144e-04 1.3 4.80e+05 1.1 0.0e+00 0.0e+00
>> 0.0e+00  0 62  0  0  0   0 62  0  0  0 20346
>> VecCopy                7 1.0 1.0152e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecSet                83 1.0 3.0013e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> VecPointwiseMult      12 1.0 2.7585e-04 1.4 2.43e+05 1.2 0.0e+00 0.0e+00
>> 0.0e+00  0 31  0  0  0   0 31  0  0  0 13153
>> VecScatterBegin      111 1.0 2.5293e-02 1.8 0.00e+00 0.0 9.5e+03 3.4e+04
>> 1.9e+01  1  0 40 31  4   1  0 40 31  4     0
>> VecScatterEnd         92 1.0 4.8771e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  3  0  0  0  0   3  0  0  0  0     0
>> VecNormalize           1 1.0 2.6941e-05 2.3 1.33e+04 1.0 0.0e+00 0.0e+00
>> 0.0e+00  0  2  0  0  0   0  2  0  0  0  7911
>> MatConvert             1 1.0 1.1009e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 4.0e+00  1  0  0  0  1   1  0  0  0  1     0
>> MatAssemblyBegin       3 1.0 2.8401e-02 1.0 0.00e+00 0.0 3.6e+02 1.6e+05
>> 0.0e+00  2  0  2  6  0   2  0  2  6  0     0
>> MatAssemblyEnd         3 1.0 2.9033e-02 1.0 0.00e+00 0.0 6.0e+01 1.2e+04
>> 2.0e+01  2  0  0  0  4   2  0  0  0  4     0
>> MatGetRowIJ            2 1.0 1.9073e-06 2.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatView                1 1.0 3.0398e-04 5.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> KSPSetUp               1 1.0 4.7994e-04 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
>> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> KSPSolve               1 1.0 2.4850e-03 2.0 5.33e+04 1.0 0.0e+00 0.0e+00
>> 2.0e+00  0  7  0  0  0   0  7  0  0  0   343
>> PCSetUp                2 1.0 2.2953e-02 1.0 1.33e+04 1.0 0.0e+00 0.0e+00
>> 6.0e+00  2  2  0  0  1   2  2  0  0  1     9
>> PCApply                1 1.0 1.3151e-03 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> ------------------------------------------------------------
>> ------------------------------------------------------------
>>
>> Memory usage is given in bytes:
>>
>> Object Type          Creations   Destructions     Memory  Descendants'
>> Mem.
>> Reports information only for process 0.
>>
>> --- Event Stage 0: Main Stage
>>
>>               Vector   172            170     70736264     0.
>>               Matrix     5              5      7125104     0.
>>    Matrix Null Space     1              1          608     0.
>>     Distributed Mesh    18             16        84096     0.
>>            Index Set    73             73     10022204     0.
>>    IS L to G Mapping    18             16      1180828     0.
>>    Star Forest Graph    36             32        27968     0.
>>      Discrete System    18             16        15040     0.
>>          Vec Scatter    67             64     38240520     0.
>>        Krylov Solver     2              2         2504     0.
>>       Preconditioner     2              2         2528     0.
>>               Viewer     2              1          848     0.
>> ============================================================
>> ============================================================
>> Average time to get PetscTime(): 0.
>> Average time for MPI_Barrier(): 2.38419e-06
>> Average time for zero size MPI_Send(): 2.11596e-06
>> #PETSc Option Table entries:
>> -da_processors_z 1
>> -ksp_type cg
>> -ksp_view
>> -log_view
>> -pc_hypre_boomeramg_nodal_coarsen 1
>> -pc_hypre_boomeramg_vec_interp_variant 1
>> #End of PETSc Option Table entries
>> Compiled without FORTRAN kernels
>> Compiled with full precision matrices (default)
>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
>> Configure options: --known-level1-dcache-size=32768
>> --known-level1-dcache-linesize=64 --known-level1-dcache-assoc=8
>> --known-sizeof-char=1 --known-sizeof-void-p=8 --known-sizeof-short=2
>> --known-sizeof-int=4 --known-sizeof-long=8 --known-sizeof-long-long=8
>> --known-sizeof-float=4 --known-sizeof-double=8 --known-sizeof-size_t=8
>> --known-bits-per-byte=8 --known-memcmp-ok=1 --known-sizeof-MPI_Comm=8
>> --known-sizeof-MPI_Fint=4 --known-mpi-long-double=1 --known-mpi-int64_t=1
>> --known-mpi-c-double-complex=1 --known-has-attribute-aligned=1
>> PETSC_ARCH=timings --with-mpi-dir=/usr/lib64/openmpi
>> --with-blaslapack-dir=/usr/lib64 COPTFLAGS=-O3 CXXOPTFLAGS=-O3
>> FOPTFLAGS=-O3 --with-shared-libraries=1 --download-hypre
>> --with-debugging=no --with-batch -known-mpi-shared-libraries=0
>> --known-64-bit-blas-indices=0
>> -----------------------------------------
>> Libraries compiled on 2018-04-27 21:13:11 on ocean
>> Machine characteristics: Linux-3.10.0-327.36.3.el7.x86_
>> 64-x86_64-with-centos-7.2.1511-Core
>> Using PETSc directory: /home/valera/petsc
>> Using PETSc arch: timings
>> -----------------------------------------
>>
>> Using C compiler: /usr/lib64/openmpi/bin/mpicc  -fPIC  -Wall
>> -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fstack-protector
>> -fvisibility=hidden -O3
>> Using Fortran compiler: /usr/lib64/openmpi/bin/mpif90  -fPIC -Wall
>> -ffree-line-length-0 -Wno-unused-dummy-argument -O3
>> -----------------------------------------
>>
>> Using include paths: -I/home/valera/petsc/include
>> -I/home/valera/petsc/timings/include -I/usr/lib64/openmpi/include
>> -----------------------------------------
>>
>> Using C linker: /usr/lib64/openmpi/bin/mpicc
>> Using Fortran linker: /usr/lib64/openmpi/bin/mpif90
>> Using libraries: -Wl,-rpath,/home/valera/petsc/timings/lib
>> -L/home/valera/petsc/timings/lib -lpetsc -Wl,-rpath,/home/valera/petsc/timings/lib
>> -L/home/valera/petsc/timings/lib -Wl,-rpath,/usr/lib64 -L/usr/lib64
>> -Wl,-rpath,/usr/lib64/openmpi/lib -L/usr/lib64/openmpi/lib
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.8.5
>> -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lHYPRE -llapack -lblas -lm
>> -lstdc++ -ldl -lmpi_usempi -lmpi_mpifh -lmpi -lgfortran -lm -lgfortran -lm
>> -lgcc_s -lquadmath -lpthread -lstdc++ -ldl
>>
>>
>>
>>
>>
>> What do you see wrong here? what options could i try to improve my solver
>> scaling?
>>
>> Thanks so much,
>>
>> Manuel
>>
>>
>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/ <http://www.caam.rice.edu/~mk51/>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20180502/72b909d4/attachment-0001.html>