[petsc-users] Preconditioner for Helmholtz-like problem

Matthew Knepley knepley at gmail.com
Sat Oct 17 07:41:51 CDT 2020


On Sat, Oct 17, 2020 at 5:21 AM Alexey Kozlov <Alexey.V.Kozlov.2 at nd.edu>
wrote:

> Matt,
>
> Thank you for your reply!
> My system has 8 NUMA nodes, so the memory bandwidth can increase up to 8
> times when doing parallel computations. In other words, each node of the
> big computer cluster works as a small cluster consisting of 8 nodes. Of
> course, this works only if the contribution of communications between the
> NUMA nodes is small. The total amount of memory on a single cluster node is
> 128GB, so it is enough to fit my application.
>

Barry is right, of course. We can see that the PETSc LU, using the natural
ordering, is doing 10,000x flops compared to MUMPS. Using the
same ordering, MUMPS might
still benefit from blocking, but the gap would be much much smaller.

I misunderstood your description of the parallelism. Yes, using 8 nodes you
could see 8x from one node. I think Pierre is correct that something
related to the size is
happening since the numeric factorization in the parallel case for MUMPS is
running at 30x the flop rate of the serial case. Its possible that they are
using a different
ordering in parallel that does more flope, but is more amenable to
vectorization. It is hard to know without reporting all the MUMPS options.

  Thanks,

    Matt


> Below is the output of -log_view for three cases:
> (1) BUILT-IN PETSC LU SOLVER
> ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
> ./caat on a arch-linux-c-opt named d24cepyc110.crc.nd.edu with 1
> processor, by akozlov Sat Oct 17 03:58:23 2020
> Using 0 OpenMP threads
> Using Petsc Release Version 3.13.6, unknown
>
>                          Max       Max/Min     Avg       Total
> Time (sec):           5.551e+03     1.000   5.551e+03
> Objects:              1.000e+01     1.000   1.000e+01
> Flop:                 1.255e+13     1.000   1.255e+13  1.255e+13
> Flop/sec:             2.261e+09     1.000   2.261e+09  2.261e+09
> MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00
> MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00
> MPI Reductions:       0.000e+00     0.000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of length N
> --> 2N flop
>                             and VecAXPY() for complex vectors of length N
> --> 8N flop
>
> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages
> ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total    Count
> %Total     Avg         %Total    Count   %Total
>  0:      Main Stage: 5.5509e+03 100.0%  1.2551e+13 100.0%  0.000e+00
> 0.0%  0.000e+00        0.0%  0.000e+00   0.0%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flop: Max - maximum over all processors
>                   Ratio - ratio of maximum to minimum over all processors
>    Mess: number of messages sent
>    AvgLen: average message length (bytes)
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flop in this
> phase
>       %M - percent messages in this phase     %L - percent message lengths
> in this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flop
>        --- Global ---  --- Stage ----  Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
>  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatSolve               1 1.0 7.3267e-01 1.0 4.58e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  6246
> MatLUFactorSym         1 1.0 1.0673e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatLUFactorNum         1 1.0 5.5350e+03 1.0 1.25e+13 1.0 0.0e+00 0.0e+00
> 0.0e+00100100  0  0  0 100100  0  0  0  2267
> MatAssemblyBegin       1 1.0 1.1921e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         1 1.0 1.0247e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetRowIJ            1 1.0 1.4306e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering         1 1.0 1.2596e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet                 4 1.0 9.3985e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAssemblyBegin       2 1.0 4.7684e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAssemblyEnd         2 1.0 4.7684e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSetUp               1 1.0 1.6689e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 7.3284e-01 1.0 4.58e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  6245
> PCSetUp                1 1.0 5.5458e+03 1.0 1.25e+13 1.0 0.0e+00 0.0e+00
> 0.0e+00100100  0  0  0 100100  0  0  0  2262
> PCApply                1 1.0 7.3267e-01 1.0 4.58e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  6246
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
>               Matrix     2              2  11501999992     0.
>               Vector     2              2      3761520     0.
>        Krylov Solver     1              1         1408     0.
>       Preconditioner     1              1         1184     0.
>            Index Set     3              3      1412088     0.
>               Viewer     1              0            0     0.
>
> ========================================================================================================================
> Average time to get PetscTime(): 7.15256e-08
> #PETSc Option Table entries:
> -ksp_type preonly
> -log_view
> -pc_type lu
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 16 sizeof(PetscInt) 4
> Configure options: --with-blaslapack-dir=/opt/crc/i/intel/19.0/mkl
> --with-g=1 --with-valgrind-dir=/opt/crc/v/valgrind/3.14/ompi
> --with-scalar-type=complex --with-clanguage=c --with-openmp
> --with-debugging=0 COPTFLAGS="-mkl=parallel -O2 -mavx -axCORE-AVX2
> -no-prec-div -fp-model fast=2" FOPTFLAGS="-mkl=parallel -O2 -mavx
> -axCORE-AVX2 -no-prec-div -fp-model fast=2" CXXOPTFLAGS="-mkl=parallel -O2
> -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2" --download-superlu_dist
> --download-mumps --download-scalapack --download-metis --download-cmake
> --download-parmetis --download-ptscotch
> -----------------------------------------
> Libraries compiled on 2020-10-14 10:52:17 on epycfe.crc.nd.edu
> Machine characteristics:
> Linux-3.10.0-1160.2.1.el7.x86_64-x86_64-with-redhat-7.9-Maipo
> Using PETSc directory: /afs/crc.nd.edu/user/a/akozlov/Private/petsc
> Using PETSc arch: arch-linux-c-opt
> -----------------------------------------
>
> Using C compiler: mpicc  -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2
> -no-prec-div -fp-model fast=2 -fopenmp
> Using Fortran compiler: mpif90  -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2
> -no-prec-div -fp-model fast=2  -fopenmp
> -----------------------------------------
>
> Using include paths: -I/afs/
> crc.nd.edu/user/a/akozlov/Private/petsc/include -I/afs/
> crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/include
> -I/opt/crc/v/valgrind/3.14/ompi/include
> -----------------------------------------
>
> Using C linker: mpicc
> Using Fortran linker: mpif90
> Using libraries: -Wl,-rpath,/afs/
> crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -L/afs/
> crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -lpetsc
> -Wl,-rpath,/afs/
> crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -L/afs/
> crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib
> -Wl,-rpath,/opt/crc/i/intel/19.0/mkl -L/opt/crc/i/intel/19.0/mkl
> -Wl,-rpath,/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib
> -L/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib
> -Wl,-rpath,/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7
> -L/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7
> -Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64
> -L/opt/crc/i/intel/19.0/mkl/lib/intel64
> -Wl,-rpath,/opt/crc/i/intel/19.0/lib/intel64
> -L/opt/crc/i/intel/19.0/lib/intel64 -Wl,-rpath,/opt/crc/i/intel/19.0/lib64
> -L/opt/crc/i/intel/19.0/lib64 -Wl,-rpath,/afs/
> crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin
> -L/afs/
> crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin
> -Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64_lin
> -L/opt/crc/i/intel/19.0/mkl/lib/intel64_lin
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.8.5
> -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lcmumps -ldmumps -lsmumps
> -lzmumps -lmumps_common -lpord -lscalapack -lsuperlu_dist -lmkl_intel_lp64
> -lmkl_core -lmkl_intel_thread -lpthread -lptesmumps -lptscotchparmetis
> -lptscotch -lptscotcherr -lesmumps -lscotch -lscotcherr -lX11 -lparmetis
> -lmetis -lstdc++ -ldl -lmpifort -lmpi -lmkl_intel_lp64 -lmkl_intel_thread
> -lmkl_core -liomp5 -lifport -lifcoremt_pic -limf -lsvml -lm -lipgo -lirc
> -lpthread -lgcc_s -lirc_s -lrt -lquadmath -lstdc++ -ldl
> -----------------------------------------
>
>
> (2) EXTERNAL PACKAGE MUMPS, 1 MPI PROCESS
> ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
> ./caat on a arch-linux-c-opt named d24cepyc068.crc.nd.edu with 1
> processor, by akozlov Sat Oct 17 01:55:20 2020
> Using 0 OpenMP threads
> Using Petsc Release Version 3.13.6, unknown
>
>                          Max       Max/Min     Avg       Total
> Time (sec):           1.075e+02     1.000   1.075e+02
> Objects:              9.000e+00     1.000   9.000e+00
> Flop:                 1.959e+12     1.000   1.959e+12  1.959e+12
> Flop/sec:             1.823e+10     1.000   1.823e+10  1.823e+10
> MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00
> MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00
> MPI Reductions:       0.000e+00     0.000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of length N
> --> 2N flop
>                             and VecAXPY() for complex vectors of length N
> --> 8N flop
>
> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages
> ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total    Count
> %Total     Avg         %Total    Count   %Total
>  0:      Main Stage: 1.0747e+02 100.0%  1.9594e+12 100.0%  0.000e+00
> 0.0%  0.000e+00        0.0%  0.000e+00   0.0%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flop: Max - maximum over all processors
>                   Ratio - ratio of maximum to minimum over all processors
>    Mess: number of messages sent
>    AvgLen: average message length (bytes)
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flop in this
> phase
>       %M - percent messages in this phase     %L - percent message lengths
> in this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flop
>        --- Global ---  --- Stage ----  Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
>  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatSolve               1 1.0 3.1965e-01 1.0 1.96e+12 1.0 0.0e+00 0.0e+00
> 0.0e+00  0100  0  0  0   0100  0  0  0 6126201
> MatLUFactorSym         1 1.0 2.3141e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> MatLUFactorNum         1 1.0 1.0001e+02 1.0 1.16e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 93  0  0  0  0  93  0  0  0  0    12
> MatAssemblyBegin       1 1.0 1.1921e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         1 1.0 1.0067e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetRowIJ            1 1.0 1.8650e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering         1 1.0 1.3029e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecCopy                1 1.0 1.0943e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet                 4 1.0 9.2626e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAssemblyBegin       2 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAssemblyEnd         2 1.0 4.7684e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSetUp               1 1.0 1.6689e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 3.1981e-01 1.0 1.96e+12 1.0 0.0e+00 0.0e+00
> 0.0e+00  0100  0  0  0   0100  0  0  0 6123146
> PCSetUp                1 1.0 1.0251e+02 1.0 1.16e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 95  0  0  0  0  95  0  0  0  0    11
> PCApply                1 1.0 3.1965e-01 1.0 1.96e+12 1.0 0.0e+00 0.0e+00
> 0.0e+00  0100  0  0  0   0100  0  0  0 6126096
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
>               Matrix     2              2     59441612     0.
>               Vector     2              2      3761520     0.
>        Krylov Solver     1              1         1408     0.
>       Preconditioner     1              1         1184     0.
>            Index Set     2              2       941392     0.
>               Viewer     1              0            0     0.
>
> ========================================================================================================================
> Average time to get PetscTime(): 4.76837e-08
> #PETSc Option Table entries:
> -ksp_type preonly
> -log_view
> -pc_factor_mat_solver_type mumps
> -pc_type lu
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 16 sizeof(PetscInt) 4
> Configure options: --with-blaslapack-dir=/opt/crc/i/intel/19.0/mkl
> --with-g=1 --with-valgrind-dir=/opt/crc/v/valgrind/3.14/ompi
> --with-scalar-type=complex --with-clanguage=c --with-openmp
> --with-debugging=0 COPTFLAGS="-mkl=parallel -O2 -mavx -axCORE-AVX2
> -no-prec-div -fp-model fast=2" FOPTFLAGS="-mkl=parallel -O2 -mavx
> -axCORE-AVX2 -no-prec-div -fp-model fast=2" CXXOPTFLAGS="-mkl=parallel -O2
> -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2" --download-superlu_dist
> --download-mumps --download-scalapack --download-metis --download-cmake
> --download-parmetis --download-ptscotch
> -----------------------------------------
> Libraries compiled on 2020-10-14 10:52:17 on epycfe.crc.nd.edu
> Machine characteristics:
> Linux-3.10.0-1160.2.1.el7.x86_64-x86_64-with-redhat-7.9-Maipo
> Using PETSc directory: /afs/crc.nd.edu/user/a/akozlov/Private/petsc
> Using PETSc arch: arch-linux-c-opt
> -----------------------------------------
>
> Using C compiler: mpicc  -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2
> -no-prec-div -fp-model fast=2 -fopenmp
> Using Fortran compiler: mpif90  -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2
> -no-prec-div -fp-model fast=2  -fopenmp
> -----------------------------------------
>
> Using include paths: -I/afs/
> crc.nd.edu/user/a/akozlov/Private/petsc/include -I/afs/
> crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/include
> -I/opt/crc/v/valgrind/3.14/ompi/include
> -----------------------------------------
>
> Using C linker: mpicc
> Using Fortran linker: mpif90
> Using libraries: -Wl,-rpath,/afs/
> crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -L/afs/
> crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -lpetsc
> -Wl,-rpath,/afs/
> crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -L/afs/
> crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib
> -Wl,-rpath,/opt/crc/i/intel/19.0/mkl -L/opt/crc/i/intel/19.0/mkl
> -Wl,-rpath,/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib
> -L/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib
> -Wl,-rpath,/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7
> -L/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7
> -Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64
> -L/opt/crc/i/intel/19.0/mkl/lib/intel64
> -Wl,-rpath,/opt/crc/i/intel/19.0/lib/intel64
> -L/opt/crc/i/intel/19.0/lib/intel64 -Wl,-rpath,/opt/crc/i/intel/19.0/lib64
> -L/opt/crc/i/intel/19.0/lib64 -Wl,-rpath,/afs/
> crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin
> -L/afs/
> crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin
> -Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64_lin
> -L/opt/crc/i/intel/19.0/mkl/lib/intel64_lin
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.8.5
> -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lcmumps -ldmumps -lsmumps
> -lzmumps -lmumps_common -lpord -lscalapack -lsuperlu_dist -lmkl_intel_lp64
> -lmkl_core -lmkl_intel_thread -lpthread -lptesmumps -lptscotchparmetis
> -lptscotch -lptscotcherr -lesmumps -lscotch -lscotcherr -lX11 -lparmetis
> -lmetis -lstdc++ -ldl -lmpifort -lmpi -lmkl_intel_lp64 -lmkl_intel_thread
> -lmkl_core -liomp5 -lifport -lifcoremt_pic -limf -lsvml -lm -lipgo -lirc
> -lpthread -lgcc_s -lirc_s -lrt -lquadmath -lstdc++ -ldl
> -----------------------------------------
>
>
> (3) EXTERNAL PACKAGE MUMPS , 48 MPI PROCESSES ON A SINGLE CLUSTER NODE
> WITH 8 NUMA NODES
> ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
> ./caat on a arch-linux-c-opt named d24cepyc069.crc.nd.edu with 48
> processors, by akozlov Sat Oct 17 04:40:25 2020
> Using 0 OpenMP threads
> Using Petsc Release Version 3.13.6, unknown
>
>                          Max       Max/Min     Avg       Total
> Time (sec):           1.415e+01     1.000   1.415e+01
> Objects:              3.000e+01     1.000   3.000e+01
> Flop:                 4.855e+10     1.637   4.084e+10  1.960e+12
> Flop/sec:             3.431e+09     1.637   2.886e+09  1.385e+11
> MPI Messages:         1.180e+02     2.682   8.169e+01  3.921e+03
> MPI Message Lengths:  1.559e+05     5.589   1.238e+03  4.855e+06
> MPI Reductions:       4.000e+01     1.000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of length N
> --> 2N flop
>                             and VecAXPY() for complex vectors of length N
> --> 8N flop
>
> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages
> ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total    Count
> %Total     Avg         %Total    Count   %Total
>  0:      Main Stage: 1.4150e+01 100.0%  1.9602e+12 100.0%  3.921e+03
> 100.0%  1.238e+03      100.0%  3.100e+01  77.5%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flop: Max - maximum over all processors
>                   Ratio - ratio of maximum to minimum over all processors
>    Mess: number of messages sent
>    AvgLen: average message length (bytes)
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flop in this
> phase
>       %M - percent messages in this phase     %L - percent message lengths
> in this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flop
>        --- Global ---  --- Stage ----  Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
>  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> BuildTwoSided          5 1.0 1.0707e-02 3.3 0.00e+00 0.0 7.8e+02 4.0e+00
> 5.0e+00  0  0 20  0 12   0  0 20  0 16     0
> BuildTwoSidedF         3 1.0 8.6837e-03 7.8 0.00e+00 0.0 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  8   0  0  0  0 10     0
> MatSolve               1 1.0 6.6314e-02 1.0 4.85e+10 1.6 3.5e+03 1.2e+03
> 6.0e+00  0100 90 87 15   0100 90 87 19 29529617
> MatLUFactorSym         1 1.0 2.4322e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 4.0e+00 17  0  0  0 10  17  0  0  0 13     0
> MatLUFactorNum         1 1.0 5.8816e+00 1.0 5.08e+07 1.8 0.0e+00 0.0e+00
> 0.0e+00 42  0  0  0  0  42  0  0  0  0   332
> MatAssemblyBegin       1 1.0 7.3917e-0357.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  2   0  0  0  0  3     0
> MatAssemblyEnd         1 1.0 2.5823e-02 1.0 0.00e+00 0.0 3.8e+02 1.6e+03
> 5.0e+00  0  0 10 13 12   0  0 10 13 16     0
> MatGetRowIJ            1 1.0 3.5763e-06 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering         1 1.0 9.2506e-05 3.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet                 4 1.0 5.3000e-0460.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAssemblyBegin       2 1.0 2.2390e-0319.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00  0  0  0  0  5   0  0  0  0  6     0
> VecAssemblyEnd         2 1.0 9.7752e-06 2.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecScatterBegin        2 1.0 1.6036e-0312.8 0.00e+00 0.0 5.9e+02 4.8e+03
> 1.0e+00  0  0 15 58  2   0  0 15 58  3     0
> VecScatterEnd          2 1.0 2.0087e-0338.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> SFSetGraph             2 1.0 1.5259e-05 5.8 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> SFSetUp                3 1.0 3.3023e-03 2.9 0.00e+00 0.0 1.6e+03 7.0e+02
> 2.0e+00  0  0 40 23  5   0  0 40 23  6     0
> SFBcastOpBegin         2 1.0 1.5953e-0313.7 0.00e+00 0.0 5.9e+02 4.8e+03
> 1.0e+00  0  0 15 58  2   0  0 15 58  3     0
> SFBcastOpEnd           2 1.0 2.0008e-0345.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> SFPack                 2 1.0 1.4646e-03361.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> SFUnpack               2 1.0 4.1723e-0529.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSetUp               1 1.0 3.0994e-06 3.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 6.6350e-02 1.0 4.85e+10 1.6 3.5e+03 1.2e+03
> 6.0e+00  0100 90 87 15   0100 90 87 19 29513594
> PCSetUp                1 1.0 8.4679e+00 1.0 5.08e+07 1.8 0.0e+00 0.0e+00
> 1.0e+01 60  0  0  0 25  60  0  0  0 32   230
> PCApply                1 1.0 6.6319e-02 1.0 4.85e+10 1.6 3.5e+03 1.2e+03
> 6.0e+00  0100 90 87 15   0100 90 87 19 29527282
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
>               Matrix     4              4      1224428     0.
>          Vec Scatter     3              3         2400     0.
>               Vector     8              8      1923424     0.
>            Index Set     9              9        32392     0.
>    Star Forest Graph     3              3         3376     0.
>        Krylov Solver     1              1         1408     0.
>       Preconditioner     1              1         1160     0.
>               Viewer     1              0            0     0.
>
> ========================================================================================================================
> Average time to get PetscTime(): 7.15256e-08
> Average time for MPI_Barrier(): 3.48091e-06
> Average time for zero size MPI_Send(): 2.49843e-06
> #PETSc Option Table entries:
> -ksp_type preonly
> -log_view
> -pc_factor_mat_solver_type mumps
> -pc_type lu
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 16 sizeof(PetscInt) 4
> Configure options: --with-blaslapack-dir=/opt/crc/i/intel/19.0/mkl
> --with-g=1 --with-valgrind-dir=/opt/crc/v/valgrind/3.14/ompi
> --with-scalar-type=complex --with-clanguage=c --with-openmp
> --with-debugging=0 COPTFLAGS="-mkl=parallel -O2 -mavx -axCORE-AVX2
> -no-prec-div -fp-model fast=2" FOPTFLAGS="-mkl=parallel -O2 -mavx
> -axCORE-AVX2 -no-prec-div -fp-model fast=2" CXXOPTFLAGS="-mkl=parallel -O2
> -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2" --download-superlu_dist
> --download-mumps --download-scalapack --download-metis --download-cmake
> --download-parmetis --download-ptscotch
> -----------------------------------------
> Libraries compiled on 2020-10-14 10:52:17 on epycfe.crc.nd.edu
> Machine characteristics:
> Linux-3.10.0-1160.2.1.el7.x86_64-x86_64-with-redhat-7.9-Maipo
> Using PETSc directory: /afs/crc.nd.edu/user/a/akozlov/Private/petsc
> Using PETSc arch: arch-linux-c-opt
> -----------------------------------------
>
> Using C compiler: mpicc  -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2
> -no-prec-div -fp-model fast=2 -fopenmp
> Using Fortran compiler: mpif90  -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2
> -no-prec-div -fp-model fast=2  -fopenmp
> -----------------------------------------
>
> Using include paths: -I/afs/
> crc.nd.edu/user/a/akozlov/Private/petsc/include -I/afs/
> crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/include
> -I/opt/crc/v/valgrind/3.14/ompi/include
> -----------------------------------------
>
> Using C linker: mpicc
> Using Fortran linker: mpif90
> Using libraries: -Wl,-rpath,/afs/
> crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -L/afs/
> crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -lpetsc
> -Wl,-rpath,/afs/
> crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -L/afs/
> crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib
> -Wl,-rpath,/opt/crc/i/intel/19.0/mkl -L/opt/crc/i/intel/19.0/mkl
> -Wl,-rpath,/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib
> -L/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib
> -Wl,-rpath,/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7
> -L/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7
> -Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64
> -L/opt/crc/i/intel/19.0/mkl/lib/intel64
> -Wl,-rpath,/opt/crc/i/intel/19.0/lib/intel64
> -L/opt/crc/i/intel/19.0/lib/intel64 -Wl,-rpath,/opt/crc/i/intel/19.0/lib64
> -L/opt/crc/i/intel/19.0/lib64 -Wl,-rpath,/afs/
> crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin
> -L/afs/
> crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin
> -Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64_lin
> -L/opt/crc/i/intel/19.0/mkl/lib/intel64_lin
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.8.5
> -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lcmumps -ldmumps -lsmumps
> -lzmumps -lmumps_common -lpord -lscalapack -lsuperlu_dist -lmkl_intel_lp64
> -lmkl_core -lmkl_intel_thread -lpthread -lptesmumps -lptscotchparmetis
> -lptscotch -lptscotcherr -lesmumps -lscotch -lscotcherr -lX11 -lparmetis
> -lmetis -lstdc++ -ldl -lmpifort -lmpi -lmkl_intel_lp64 -lmkl_intel_thread
> -lmkl_core -liomp5 -lifport -lifcoremt_pic -limf -lsvml -lm -lipgo -lirc
> -lpthread -lgcc_s -lirc_s -lrt -lquadmath -lstdc++ -ldl
> -----------------------------------------
>
>
>
> On Sat, Oct 17, 2020 at 12:33 AM Matthew Knepley <knepley at gmail.com>
> wrote:
>
>> On Fri, Oct 16, 2020 at 11:48 PM Alexey Kozlov <Alexey.V.Kozlov.2 at nd.edu>
>> wrote:
>>
>>> Thank you for your advice! My sparse matrix seems to be very stiff so I
>>> have decided to concentrate on the direct solvers. I have very good results
>>> with MUMPS. Due to a lack of time I haven’t got a good result with
>>> SuperLU_DIST and haven’t compiled PETSc with Pastix yet but I have a
>>> feeling that MUMPS is the best. I have run a sequential test case with
>>> built-in PETSc LU (-pc_type lu -ksp_type preonly) and MUMPs (-pc_type lu
>>> -ksp_type preonly -pc_factor_mat_solver_type mumps) with default settings
>>> and found that MUMPs was about 50 times faster than the built-in LU and
>>> used about 3 times less RAM. Do you have any idea why it could be?
>>>
>> The numbers do not sound realistic, but of course we do not have your
>> particular problem. In particular, the memory figure seems impossible.
>>
>>> My test case has about 100,000 complex equations with about 3,000,000
>>> non-zeros. PETSc was compiled with the following options: ./configure
>>> --with-blaslapack-dir=/opt/crc/i/intel/19.0/mkl --enable-g
>>> --with-valgrind-dir=/opt/crc/v/valgrind/3.14/ompi
>>> --with-scalar-type=complex --with-clanguage=c --with-openmp
>>> --with-debugging=0 COPTFLAGS='-mkl=parallel -O2 -mavx -axCORE-AVX2
>>> -no-prec-div -fp-model fast=2' FOPTFLAGS='-mkl=parallel -O2 -mavx
>>> -axCORE-AVX2 -no-prec-div -fp-model fast=2' CXXOPTFLAGS='-mkl=parallel -O2
>>> -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2' --download-superlu_dist
>>> --download-mumps --download-scalapack --download-metis --download-cmake
>>> --download-parmetis --download-ptscotch.
>>>
>>> Running MUPMS in parallel using MPI also gave me a significant gain in
>>> performance (about 10 times on a single cluster node).
>>>
>> Again, this does not appear to make sense. The performance should be
>> limited by memory bandwidth, and a single cluster node will not usually have
>> 10x the bandwidth of a CPU, although it might be possible with a very old
>> CPU.
>>
>> It would help to understand the performance if you would send the output
>> of -log_view.
>>
>>   Thanks,
>>
>>     Matt
>>
>>> Could you, please, advise me whether I can adjust some options for the
>>> direct solvers to improve performance? Should I try MUMPS in OpenMP mode?
>>>
>>> On Sat, Sep 19, 2020 at 7:40 AM Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>> As Jed said high frequency is hard. AMG, as-is,  can be adapted (
>>>> https://link.springer.com/article/10.1007/s00466-006-0047-8) with
>>>> parameters.
>>>> AMG for convection: use richardson/sor and not chebyshev smoothers and
>>>> in smoothed aggregation (gamg) don't smooth (-pc_gamg_agg_nsmooths 0).
>>>> Mark
>>>>
>>>> On Sat, Sep 19, 2020 at 2:11 AM Alexey Kozlov <Alexey.V.Kozlov.2 at nd.edu>
>>>> wrote:
>>>>
>>>>> Thanks a lot! I'll check them out.
>>>>>
>>>>> On Sat, Sep 19, 2020 at 1:41 AM Barry Smith <bsmith at petsc.dev> wrote:
>>>>>
>>>>>>
>>>>>>   These are small enough that likely sparse direct solvers are the
>>>>>> best use of your time and for general efficiency.
>>>>>>
>>>>>>   PETSc supports 3 parallel direct solvers, SuperLU_DIST, MUMPs and
>>>>>> Pastix. I recommend configuring PETSc for all three of them and then
>>>>>> comparing them for problems of interest to you.
>>>>>>
>>>>>>    --download-superlu_dist --download-mumps --download-pastix
>>>>>> --download-scalapack (used by MUMPS) --download-metis --download-parmetis
>>>>>> --download-ptscotch
>>>>>>
>>>>>>   Barry
>>>>>>
>>>>>>
>>>>>> On Sep 18, 2020, at 11:28 PM, Alexey Kozlov <Alexey.V.Kozlov.2 at nd.edu>
>>>>>> wrote:
>>>>>>
>>>>>> Thanks for the tips! My matrix is complex and unsymmetric. My typical
>>>>>> test case has of the order of one million equations. I use a 2nd-order
>>>>>> finite-difference scheme with 19-point stencil, so my typical test case
>>>>>> uses several GB of RAM.
>>>>>>
>>>>>> On Fri, Sep 18, 2020 at 11:52 PM Jed Brown <jed at jedbrown.org> wrote:
>>>>>>
>>>>>>> Unfortunately, those are hard problems in which the "good" methods
>>>>>>> are technical and hard to make black-box.  There are "sweeping" methods
>>>>>>> that solve on 2D "slabs" with PML boundary conditions, H-matrix based
>>>>>>> methods, and fancy multigrid methods.  Attempting to solve with STRUMPACK
>>>>>>> is probably the easiest thing to try (--download-strumpack).
>>>>>>>
>>>>>>>
>>>>>>> https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MATSOLVERSSTRUMPACK.html
>>>>>>>
>>>>>>> Is the matrix complex symmetric?
>>>>>>>
>>>>>>> Note that you can use a direct solver (MUMPS, STRUMPACK, etc.) for a
>>>>>>> 3D problem like this if you have enough memory.  I'm assuming the memory or
>>>>>>> time is unacceptable and you want an iterative method with much lower setup
>>>>>>> costs.
>>>>>>>
>>>>>>> Alexey Kozlov <Alexey.V.Kozlov.2 at nd.edu> writes:
>>>>>>>
>>>>>>> > Dear all,
>>>>>>> >
>>>>>>> > I am solving a convected wave equation in a frequency domain. This
>>>>>>> equation
>>>>>>> > is a 3D Helmholtz equation with added first-order derivatives and
>>>>>>> mixed
>>>>>>> > derivatives, and with complex coefficients. The discretized PDE
>>>>>>> results in
>>>>>>> > a sparse linear system (about 10^6 equations) which is solved in
>>>>>>> PETSc. I
>>>>>>> > am having difficulty with the code convergence at high frequency,
>>>>>>> skewed
>>>>>>> > grid, and high Mach number. I suspect it may be due to the
>>>>>>> preconditioner I
>>>>>>> > use. I am currently using the ILU preconditioner with the number
>>>>>>> of fill
>>>>>>> > levels 2 or 3, and BCGS or GMRES solvers. I suspect the state of
>>>>>>> the art
>>>>>>> > has evolved and there are better preconditioners for Helmholtz-like
>>>>>>> > problems. Could you, please, advise me on a better preconditioner?
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Alexey
>>>>>>> >
>>>>>>> > --
>>>>>>> > Alexey V. Kozlov
>>>>>>> >
>>>>>>> > Research Scientist
>>>>>>> > Department of Aerospace and Mechanical Engineering
>>>>>>> > University of Notre Dame
>>>>>>> >
>>>>>>> > 117 Hessert Center
>>>>>>> > Notre Dame, IN 46556-5684
>>>>>>> > Phone: (574) 631-4335
>>>>>>> > Fax: (574) 631-8355
>>>>>>> > Email: akozlov at nd.edu
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Alexey V. Kozlov
>>>>>>
>>>>>> Research Scientist
>>>>>> Department of Aerospace and Mechanical Engineering
>>>>>> University of Notre Dame
>>>>>>
>>>>>> 117 Hessert Center
>>>>>> Notre Dame, IN 46556-5684
>>>>>> Phone: (574) 631-4335
>>>>>> Fax: (574) 631-8355
>>>>>> Email: akozlov at nd.edu
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Alexey V. Kozlov
>>>>>
>>>>> Research Scientist
>>>>> Department of Aerospace and Mechanical Engineering
>>>>> University of Notre Dame
>>>>>
>>>>> 117 Hessert Center
>>>>> Notre Dame, IN 46556-5684
>>>>> Phone: (574) 631-4335
>>>>> Fax: (574) 631-8355
>>>>> Email: akozlov at nd.edu
>>>>>
>>>>
>>>
>>> --
>>> Alexey V. Kozlov
>>>
>>> Research Scientist
>>> Department of Aerospace and Mechanical Engineering
>>> University of Notre Dame
>>>
>>> 117 Hessert Center
>>> Notre Dame, IN 46556-5684
>>> Phone: (574) 631-4335
>>> Fax: (574) 631-8355
>>> Email: akozlov at nd.edu
>>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>>
>
>
> --
> Alexey V. Kozlov
>
> Research Scientist
> Department of Aerospace and Mechanical Engineering
> University of Notre Dame
>
> 117 Hessert Center
> Notre Dame, IN 46556-5684
> Phone: (574) 631-4335
> Fax: (574) 631-8355
> Email: akozlov at nd.edu
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20201017/1329ec44/attachment-0001.html>


More information about the petsc-users mailing list