[petsc-users] Preconditioner for Helmholtz-like problem
Alexey Kozlov
Alexey.V.Kozlov.2 at nd.edu
Sat Oct 17 04:20:32 CDT 2020
Matt,
Thank you for your reply!
My system has 8 NUMA nodes, so the memory bandwidth can increase up to 8
times when doing parallel computations. In other words, each node of the
big computer cluster works as a small cluster consisting of 8 nodes. Of
course, this works only if the contribution of communications between the
NUMA nodes is small. The total amount of memory on a single cluster node is
128GB, so it is enough to fit my application.
Below is the output of -log_view for three cases:
(1) BUILT-IN PETSC LU SOLVER
---------------------------------------------- PETSc Performance Summary:
----------------------------------------------
./caat on a arch-linux-c-opt named d24cepyc110.crc.nd.edu with 1 processor,
by akozlov Sat Oct 17 03:58:23 2020
Using 0 OpenMP threads
Using Petsc Release Version 3.13.6, unknown
Max Max/Min Avg Total
Time (sec): 5.551e+03 1.000 5.551e+03
Objects: 1.000e+01 1.000 1.000e+01
Flop: 1.255e+13 1.000 1.255e+13 1.255e+13
Flop/sec: 2.261e+09 1.000 2.261e+09 2.261e+09
MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00
MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00
MPI Reductions: 0.000e+00 0.000
Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N
--> 2N flop
and VecAXPY() for complex vectors of length N
--> 8N flop
Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages ---
-- Message Lengths -- -- Reductions --
Avg %Total Avg %Total Count %Total
Avg %Total Count %Total
0: Main Stage: 5.5509e+03 100.0% 1.2551e+13 100.0% 0.000e+00 0.0%
0.000e+00 0.0% 0.000e+00 0.0%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flop: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
AvgLen: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
%T - percent time in this phase %F - percent flop in this
phase
%M - percent messages in this phase %L - percent message lengths
in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
all processors)
------------------------------------------------------------------------------------------------------------------------
Event Count Time (sec) Flop
--- Global --- --- Stage ---- Total
Max Ratio Max Ratio Max Ratio Mess AvgLen
Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
MatSolve 1 1.0 7.3267e-01 1.0 4.58e+09 1.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 6246
MatLUFactorSym 1 1.0 1.0673e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatLUFactorNum 1 1.0 5.5350e+03 1.0 1.25e+13 1.0 0.0e+00 0.0e+00
0.0e+00100100 0 0 0 100100 0 0 0 2267
MatAssemblyBegin 1 1.0 1.1921e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 1 1.0 1.0247e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetRowIJ 1 1.0 1.4306e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 1 1.0 1.2596e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 4 1.0 9.3985e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAssemblyBegin 2 1.0 4.7684e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAssemblyEnd 2 1.0 4.7684e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSetUp 1 1.0 1.6689e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 1 1.0 7.3284e-01 1.0 4.58e+09 1.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 6245
PCSetUp 1 1.0 5.5458e+03 1.0 1.25e+13 1.0 0.0e+00 0.0e+00
0.0e+00100100 0 0 0 100100 0 0 0 2262
PCApply 1 1.0 7.3267e-01 1.0 4.58e+09 1.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 6246
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
Reports information only for process 0.
--- Event Stage 0: Main Stage
Matrix 2 2 11501999992 0.
Vector 2 2 3761520 0.
Krylov Solver 1 1 1408 0.
Preconditioner 1 1 1184 0.
Index Set 3 3 1412088 0.
Viewer 1 0 0 0.
========================================================================================================================
Average time to get PetscTime(): 7.15256e-08
#PETSc Option Table entries:
-ksp_type preonly
-log_view
-pc_type lu
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 16 sizeof(PetscInt) 4
Configure options: --with-blaslapack-dir=/opt/crc/i/intel/19.0/mkl
--with-g=1 --with-valgrind-dir=/opt/crc/v/valgrind/3.14/ompi
--with-scalar-type=complex --with-clanguage=c --with-openmp
--with-debugging=0 COPTFLAGS="-mkl=parallel -O2 -mavx -axCORE-AVX2
-no-prec-div -fp-model fast=2" FOPTFLAGS="-mkl=parallel -O2 -mavx
-axCORE-AVX2 -no-prec-div -fp-model fast=2" CXXOPTFLAGS="-mkl=parallel -O2
-mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2" --download-superlu_dist
--download-mumps --download-scalapack --download-metis --download-cmake
--download-parmetis --download-ptscotch
-----------------------------------------
Libraries compiled on 2020-10-14 10:52:17 on epycfe.crc.nd.edu
Machine characteristics:
Linux-3.10.0-1160.2.1.el7.x86_64-x86_64-with-redhat-7.9-Maipo
Using PETSc directory: /afs/crc.nd.edu/user/a/akozlov/Private/petsc
Using PETSc arch: arch-linux-c-opt
-----------------------------------------
Using C compiler: mpicc -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2
-no-prec-div -fp-model fast=2 -fopenmp
Using Fortran compiler: mpif90 -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2
-no-prec-div -fp-model fast=2 -fopenmp
-----------------------------------------
Using include paths: -I/afs/crc.nd.edu/user/a/akozlov/Private/petsc/include
-I/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/include
-I/opt/crc/v/valgrind/3.14/ompi/include
-----------------------------------------
Using C linker: mpicc
Using Fortran linker: mpif90
Using libraries: -Wl,-rpath,/afs/
crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -L/afs/
crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -lpetsc
-Wl,-rpath,/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib
-L/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib
-Wl,-rpath,/opt/crc/i/intel/19.0/mkl -L/opt/crc/i/intel/19.0/mkl
-Wl,-rpath,/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib
-L/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib
-Wl,-rpath,/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7
-L/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7
-Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64
-L/opt/crc/i/intel/19.0/mkl/lib/intel64
-Wl,-rpath,/opt/crc/i/intel/19.0/lib/intel64
-L/opt/crc/i/intel/19.0/lib/intel64 -Wl,-rpath,/opt/crc/i/intel/19.0/lib64
-L/opt/crc/i/intel/19.0/lib64 -Wl,-rpath,/afs/
crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin
-L/afs/
crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin
-Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64_lin
-L/opt/crc/i/intel/19.0/mkl/lib/intel64_lin
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.8.5
-L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lcmumps -ldmumps -lsmumps
-lzmumps -lmumps_common -lpord -lscalapack -lsuperlu_dist -lmkl_intel_lp64
-lmkl_core -lmkl_intel_thread -lpthread -lptesmumps -lptscotchparmetis
-lptscotch -lptscotcherr -lesmumps -lscotch -lscotcherr -lX11 -lparmetis
-lmetis -lstdc++ -ldl -lmpifort -lmpi -lmkl_intel_lp64 -lmkl_intel_thread
-lmkl_core -liomp5 -lifport -lifcoremt_pic -limf -lsvml -lm -lipgo -lirc
-lpthread -lgcc_s -lirc_s -lrt -lquadmath -lstdc++ -ldl
-----------------------------------------
(2) EXTERNAL PACKAGE MUMPS, 1 MPI PROCESS
---------------------------------------------- PETSc Performance Summary:
----------------------------------------------
./caat on a arch-linux-c-opt named d24cepyc068.crc.nd.edu with 1 processor,
by akozlov Sat Oct 17 01:55:20 2020
Using 0 OpenMP threads
Using Petsc Release Version 3.13.6, unknown
Max Max/Min Avg Total
Time (sec): 1.075e+02 1.000 1.075e+02
Objects: 9.000e+00 1.000 9.000e+00
Flop: 1.959e+12 1.000 1.959e+12 1.959e+12
Flop/sec: 1.823e+10 1.000 1.823e+10 1.823e+10
MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00
MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00
MPI Reductions: 0.000e+00 0.000
Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N
--> 2N flop
and VecAXPY() for complex vectors of length N
--> 8N flop
Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages ---
-- Message Lengths -- -- Reductions --
Avg %Total Avg %Total Count %Total
Avg %Total Count %Total
0: Main Stage: 1.0747e+02 100.0% 1.9594e+12 100.0% 0.000e+00 0.0%
0.000e+00 0.0% 0.000e+00 0.0%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flop: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
AvgLen: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
%T - percent time in this phase %F - percent flop in this
phase
%M - percent messages in this phase %L - percent message lengths
in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
all processors)
------------------------------------------------------------------------------------------------------------------------
Event Count Time (sec) Flop
--- Global --- --- Stage ---- Total
Max Ratio Max Ratio Max Ratio Mess AvgLen
Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
MatSolve 1 1.0 3.1965e-01 1.0 1.96e+12 1.0 0.0e+00 0.0e+00
0.0e+00 0100 0 0 0 0100 0 0 0 6126201
MatLUFactorSym 1 1.0 2.3141e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatLUFactorNum 1 1.0 1.0001e+02 1.0 1.16e+09 1.0 0.0e+00 0.0e+00
0.0e+00 93 0 0 0 0 93 0 0 0 0 12
MatAssemblyBegin 1 1.0 1.1921e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 1 1.0 1.0067e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetRowIJ 1 1.0 1.8650e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 1 1.0 1.3029e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecCopy 1 1.0 1.0943e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 4 1.0 9.2626e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAssemblyBegin 2 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAssemblyEnd 2 1.0 4.7684e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSetUp 1 1.0 1.6689e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 1 1.0 3.1981e-01 1.0 1.96e+12 1.0 0.0e+00 0.0e+00
0.0e+00 0100 0 0 0 0100 0 0 0 6123146
PCSetUp 1 1.0 1.0251e+02 1.0 1.16e+09 1.0 0.0e+00 0.0e+00
0.0e+00 95 0 0 0 0 95 0 0 0 0 11
PCApply 1 1.0 3.1965e-01 1.0 1.96e+12 1.0 0.0e+00 0.0e+00
0.0e+00 0100 0 0 0 0100 0 0 0 6126096
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
Reports information only for process 0.
--- Event Stage 0: Main Stage
Matrix 2 2 59441612 0.
Vector 2 2 3761520 0.
Krylov Solver 1 1 1408 0.
Preconditioner 1 1 1184 0.
Index Set 2 2 941392 0.
Viewer 1 0 0 0.
========================================================================================================================
Average time to get PetscTime(): 4.76837e-08
#PETSc Option Table entries:
-ksp_type preonly
-log_view
-pc_factor_mat_solver_type mumps
-pc_type lu
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 16 sizeof(PetscInt) 4
Configure options: --with-blaslapack-dir=/opt/crc/i/intel/19.0/mkl
--with-g=1 --with-valgrind-dir=/opt/crc/v/valgrind/3.14/ompi
--with-scalar-type=complex --with-clanguage=c --with-openmp
--with-debugging=0 COPTFLAGS="-mkl=parallel -O2 -mavx -axCORE-AVX2
-no-prec-div -fp-model fast=2" FOPTFLAGS="-mkl=parallel -O2 -mavx
-axCORE-AVX2 -no-prec-div -fp-model fast=2" CXXOPTFLAGS="-mkl=parallel -O2
-mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2" --download-superlu_dist
--download-mumps --download-scalapack --download-metis --download-cmake
--download-parmetis --download-ptscotch
-----------------------------------------
Libraries compiled on 2020-10-14 10:52:17 on epycfe.crc.nd.edu
Machine characteristics:
Linux-3.10.0-1160.2.1.el7.x86_64-x86_64-with-redhat-7.9-Maipo
Using PETSc directory: /afs/crc.nd.edu/user/a/akozlov/Private/petsc
Using PETSc arch: arch-linux-c-opt
-----------------------------------------
Using C compiler: mpicc -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2
-no-prec-div -fp-model fast=2 -fopenmp
Using Fortran compiler: mpif90 -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2
-no-prec-div -fp-model fast=2 -fopenmp
-----------------------------------------
Using include paths: -I/afs/crc.nd.edu/user/a/akozlov/Private/petsc/include
-I/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/include
-I/opt/crc/v/valgrind/3.14/ompi/include
-----------------------------------------
Using C linker: mpicc
Using Fortran linker: mpif90
Using libraries: -Wl,-rpath,/afs/
crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -L/afs/
crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -lpetsc
-Wl,-rpath,/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib
-L/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib
-Wl,-rpath,/opt/crc/i/intel/19.0/mkl -L/opt/crc/i/intel/19.0/mkl
-Wl,-rpath,/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib
-L/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib
-Wl,-rpath,/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7
-L/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7
-Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64
-L/opt/crc/i/intel/19.0/mkl/lib/intel64
-Wl,-rpath,/opt/crc/i/intel/19.0/lib/intel64
-L/opt/crc/i/intel/19.0/lib/intel64 -Wl,-rpath,/opt/crc/i/intel/19.0/lib64
-L/opt/crc/i/intel/19.0/lib64 -Wl,-rpath,/afs/
crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin
-L/afs/
crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin
-Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64_lin
-L/opt/crc/i/intel/19.0/mkl/lib/intel64_lin
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.8.5
-L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lcmumps -ldmumps -lsmumps
-lzmumps -lmumps_common -lpord -lscalapack -lsuperlu_dist -lmkl_intel_lp64
-lmkl_core -lmkl_intel_thread -lpthread -lptesmumps -lptscotchparmetis
-lptscotch -lptscotcherr -lesmumps -lscotch -lscotcherr -lX11 -lparmetis
-lmetis -lstdc++ -ldl -lmpifort -lmpi -lmkl_intel_lp64 -lmkl_intel_thread
-lmkl_core -liomp5 -lifport -lifcoremt_pic -limf -lsvml -lm -lipgo -lirc
-lpthread -lgcc_s -lirc_s -lrt -lquadmath -lstdc++ -ldl
-----------------------------------------
(3) EXTERNAL PACKAGE MUMPS , 48 MPI PROCESSES ON A SINGLE CLUSTER NODE WITH
8 NUMA NODES
---------------------------------------------- PETSc Performance Summary:
----------------------------------------------
./caat on a arch-linux-c-opt named d24cepyc069.crc.nd.edu with 48
processors, by akozlov Sat Oct 17 04:40:25 2020
Using 0 OpenMP threads
Using Petsc Release Version 3.13.6, unknown
Max Max/Min Avg Total
Time (sec): 1.415e+01 1.000 1.415e+01
Objects: 3.000e+01 1.000 3.000e+01
Flop: 4.855e+10 1.637 4.084e+10 1.960e+12
Flop/sec: 3.431e+09 1.637 2.886e+09 1.385e+11
MPI Messages: 1.180e+02 2.682 8.169e+01 3.921e+03
MPI Message Lengths: 1.559e+05 5.589 1.238e+03 4.855e+06
MPI Reductions: 4.000e+01 1.000
Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N
--> 2N flop
and VecAXPY() for complex vectors of length N
--> 8N flop
Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages ---
-- Message Lengths -- -- Reductions --
Avg %Total Avg %Total Count %Total
Avg %Total Count %Total
0: Main Stage: 1.4150e+01 100.0% 1.9602e+12 100.0% 3.921e+03 100.0%
1.238e+03 100.0% 3.100e+01 77.5%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flop: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
AvgLen: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
%T - percent time in this phase %F - percent flop in this
phase
%M - percent messages in this phase %L - percent message lengths
in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
all processors)
------------------------------------------------------------------------------------------------------------------------
Event Count Time (sec) Flop
--- Global --- --- Stage ---- Total
Max Ratio Max Ratio Max Ratio Mess AvgLen
Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
BuildTwoSided 5 1.0 1.0707e-02 3.3 0.00e+00 0.0 7.8e+02 4.0e+00
5.0e+00 0 0 20 0 12 0 0 20 0 16 0
BuildTwoSidedF 3 1.0 8.6837e-03 7.8 0.00e+00 0.0 0.0e+00 0.0e+00
3.0e+00 0 0 0 0 8 0 0 0 0 10 0
MatSolve 1 1.0 6.6314e-02 1.0 4.85e+10 1.6 3.5e+03 1.2e+03
6.0e+00 0100 90 87 15 0100 90 87 19 29529617
MatLUFactorSym 1 1.0 2.4322e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
4.0e+00 17 0 0 0 10 17 0 0 0 13 0
MatLUFactorNum 1 1.0 5.8816e+00 1.0 5.08e+07 1.8 0.0e+00 0.0e+00
0.0e+00 42 0 0 0 0 42 0 0 0 0 332
MatAssemblyBegin 1 1.0 7.3917e-0357.6 0.00e+00 0.0 0.0e+00 0.0e+00
1.0e+00 0 0 0 0 2 0 0 0 0 3 0
MatAssemblyEnd 1 1.0 2.5823e-02 1.0 0.00e+00 0.0 3.8e+02 1.6e+03
5.0e+00 0 0 10 13 12 0 0 10 13 16 0
MatGetRowIJ 1 1.0 3.5763e-06 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 1 1.0 9.2506e-05 3.4 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 4 1.0 5.3000e-0460.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAssemblyBegin 2 1.0 2.2390e-0319.1 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00 0 0 0 0 5 0 0 0 0 6 0
VecAssemblyEnd 2 1.0 9.7752e-06 2.4 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecScatterBegin 2 1.0 1.6036e-0312.8 0.00e+00 0.0 5.9e+02 4.8e+03
1.0e+00 0 0 15 58 2 0 0 15 58 3 0
VecScatterEnd 2 1.0 2.0087e-0338.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SFSetGraph 2 1.0 1.5259e-05 5.8 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SFSetUp 3 1.0 3.3023e-03 2.9 0.00e+00 0.0 1.6e+03 7.0e+02
2.0e+00 0 0 40 23 5 0 0 40 23 6 0
SFBcastOpBegin 2 1.0 1.5953e-0313.7 0.00e+00 0.0 5.9e+02 4.8e+03
1.0e+00 0 0 15 58 2 0 0 15 58 3 0
SFBcastOpEnd 2 1.0 2.0008e-0345.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SFPack 2 1.0 1.4646e-03361.4 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SFUnpack 2 1.0 4.1723e-0529.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSetUp 1 1.0 3.0994e-06 3.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 1 1.0 6.6350e-02 1.0 4.85e+10 1.6 3.5e+03 1.2e+03
6.0e+00 0100 90 87 15 0100 90 87 19 29513594
PCSetUp 1 1.0 8.4679e+00 1.0 5.08e+07 1.8 0.0e+00 0.0e+00
1.0e+01 60 0 0 0 25 60 0 0 0 32 230
PCApply 1 1.0 6.6319e-02 1.0 4.85e+10 1.6 3.5e+03 1.2e+03
6.0e+00 0100 90 87 15 0100 90 87 19 29527282
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
Reports information only for process 0.
--- Event Stage 0: Main Stage
Matrix 4 4 1224428 0.
Vec Scatter 3 3 2400 0.
Vector 8 8 1923424 0.
Index Set 9 9 32392 0.
Star Forest Graph 3 3 3376 0.
Krylov Solver 1 1 1408 0.
Preconditioner 1 1 1160 0.
Viewer 1 0 0 0.
========================================================================================================================
Average time to get PetscTime(): 7.15256e-08
Average time for MPI_Barrier(): 3.48091e-06
Average time for zero size MPI_Send(): 2.49843e-06
#PETSc Option Table entries:
-ksp_type preonly
-log_view
-pc_factor_mat_solver_type mumps
-pc_type lu
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 16 sizeof(PetscInt) 4
Configure options: --with-blaslapack-dir=/opt/crc/i/intel/19.0/mkl
--with-g=1 --with-valgrind-dir=/opt/crc/v/valgrind/3.14/ompi
--with-scalar-type=complex --with-clanguage=c --with-openmp
--with-debugging=0 COPTFLAGS="-mkl=parallel -O2 -mavx -axCORE-AVX2
-no-prec-div -fp-model fast=2" FOPTFLAGS="-mkl=parallel -O2 -mavx
-axCORE-AVX2 -no-prec-div -fp-model fast=2" CXXOPTFLAGS="-mkl=parallel -O2
-mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2" --download-superlu_dist
--download-mumps --download-scalapack --download-metis --download-cmake
--download-parmetis --download-ptscotch
-----------------------------------------
Libraries compiled on 2020-10-14 10:52:17 on epycfe.crc.nd.edu
Machine characteristics:
Linux-3.10.0-1160.2.1.el7.x86_64-x86_64-with-redhat-7.9-Maipo
Using PETSc directory: /afs/crc.nd.edu/user/a/akozlov/Private/petsc
Using PETSc arch: arch-linux-c-opt
-----------------------------------------
Using C compiler: mpicc -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2
-no-prec-div -fp-model fast=2 -fopenmp
Using Fortran compiler: mpif90 -fPIC -mkl=parallel -O2 -mavx -axCORE-AVX2
-no-prec-div -fp-model fast=2 -fopenmp
-----------------------------------------
Using include paths: -I/afs/crc.nd.edu/user/a/akozlov/Private/petsc/include
-I/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/include
-I/opt/crc/v/valgrind/3.14/ompi/include
-----------------------------------------
Using C linker: mpicc
Using Fortran linker: mpif90
Using libraries: -Wl,-rpath,/afs/
crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -L/afs/
crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib -lpetsc
-Wl,-rpath,/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib
-L/afs/crc.nd.edu/user/a/akozlov/Private/petsc/arch-linux-c-opt/lib
-Wl,-rpath,/opt/crc/i/intel/19.0/mkl -L/opt/crc/i/intel/19.0/mkl
-Wl,-rpath,/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib
-L/opt/crc/m/mvapich2/2.3.1/intel/19.0/lib
-Wl,-rpath,/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7
-L/opt/crc/i/intel/19.0/tbb/lib/intel64_lin/gcc4.7
-Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64
-L/opt/crc/i/intel/19.0/mkl/lib/intel64
-Wl,-rpath,/opt/crc/i/intel/19.0/lib/intel64
-L/opt/crc/i/intel/19.0/lib/intel64 -Wl,-rpath,/opt/crc/i/intel/19.0/lib64
-L/opt/crc/i/intel/19.0/lib64 -Wl,-rpath,/afs/
crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin
-L/afs/
crc.nd.edu/x86_64_linux/i/intel/19.0/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin
-Wl,-rpath,/opt/crc/i/intel/19.0/mkl/lib/intel64_lin
-L/opt/crc/i/intel/19.0/mkl/lib/intel64_lin
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.8.5
-L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lcmumps -ldmumps -lsmumps
-lzmumps -lmumps_common -lpord -lscalapack -lsuperlu_dist -lmkl_intel_lp64
-lmkl_core -lmkl_intel_thread -lpthread -lptesmumps -lptscotchparmetis
-lptscotch -lptscotcherr -lesmumps -lscotch -lscotcherr -lX11 -lparmetis
-lmetis -lstdc++ -ldl -lmpifort -lmpi -lmkl_intel_lp64 -lmkl_intel_thread
-lmkl_core -liomp5 -lifport -lifcoremt_pic -limf -lsvml -lm -lipgo -lirc
-lpthread -lgcc_s -lirc_s -lrt -lquadmath -lstdc++ -ldl
-----------------------------------------
On Sat, Oct 17, 2020 at 12:33 AM Matthew Knepley <knepley at gmail.com> wrote:
> On Fri, Oct 16, 2020 at 11:48 PM Alexey Kozlov <Alexey.V.Kozlov.2 at nd.edu>
> wrote:
>
>> Thank you for your advice! My sparse matrix seems to be very stiff so I
>> have decided to concentrate on the direct solvers. I have very good results
>> with MUMPS. Due to a lack of time I haven’t got a good result with
>> SuperLU_DIST and haven’t compiled PETSc with Pastix yet but I have a
>> feeling that MUMPS is the best. I have run a sequential test case with
>> built-in PETSc LU (-pc_type lu -ksp_type preonly) and MUMPs (-pc_type lu
>> -ksp_type preonly -pc_factor_mat_solver_type mumps) with default settings
>> and found that MUMPs was about 50 times faster than the built-in LU and
>> used about 3 times less RAM. Do you have any idea why it could be?
>>
> The numbers do not sound realistic, but of course we do not have your
> particular problem. In particular, the memory figure seems impossible.
>
>> My test case has about 100,000 complex equations with about 3,000,000
>> non-zeros. PETSc was compiled with the following options: ./configure
>> --with-blaslapack-dir=/opt/crc/i/intel/19.0/mkl --enable-g
>> --with-valgrind-dir=/opt/crc/v/valgrind/3.14/ompi
>> --with-scalar-type=complex --with-clanguage=c --with-openmp
>> --with-debugging=0 COPTFLAGS='-mkl=parallel -O2 -mavx -axCORE-AVX2
>> -no-prec-div -fp-model fast=2' FOPTFLAGS='-mkl=parallel -O2 -mavx
>> -axCORE-AVX2 -no-prec-div -fp-model fast=2' CXXOPTFLAGS='-mkl=parallel -O2
>> -mavx -axCORE-AVX2 -no-prec-div -fp-model fast=2' --download-superlu_dist
>> --download-mumps --download-scalapack --download-metis --download-cmake
>> --download-parmetis --download-ptscotch.
>>
>> Running MUPMS in parallel using MPI also gave me a significant gain in
>> performance (about 10 times on a single cluster node).
>>
> Again, this does not appear to make sense. The performance should be
> limited by memory bandwidth, and a single cluster node will not usually have
> 10x the bandwidth of a CPU, although it might be possible with a very old
> CPU.
>
> It would help to understand the performance if you would send the output
> of -log_view.
>
> Thanks,
>
> Matt
>
>> Could you, please, advise me whether I can adjust some options for the
>> direct solvers to improve performance? Should I try MUMPS in OpenMP mode?
>>
>> On Sat, Sep 19, 2020 at 7:40 AM Mark Adams <mfadams at lbl.gov> wrote:
>>
>>> As Jed said high frequency is hard. AMG, as-is, can be adapted (
>>> https://link.springer.com/article/10.1007/s00466-006-0047-8) with
>>> parameters.
>>> AMG for convection: use richardson/sor and not chebyshev smoothers and
>>> in smoothed aggregation (gamg) don't smooth (-pc_gamg_agg_nsmooths 0).
>>> Mark
>>>
>>> On Sat, Sep 19, 2020 at 2:11 AM Alexey Kozlov <Alexey.V.Kozlov.2 at nd.edu>
>>> wrote:
>>>
>>>> Thanks a lot! I'll check them out.
>>>>
>>>> On Sat, Sep 19, 2020 at 1:41 AM Barry Smith <bsmith at petsc.dev> wrote:
>>>>
>>>>>
>>>>> These are small enough that likely sparse direct solvers are the
>>>>> best use of your time and for general efficiency.
>>>>>
>>>>> PETSc supports 3 parallel direct solvers, SuperLU_DIST, MUMPs and
>>>>> Pastix. I recommend configuring PETSc for all three of them and then
>>>>> comparing them for problems of interest to you.
>>>>>
>>>>> --download-superlu_dist --download-mumps --download-pastix
>>>>> --download-scalapack (used by MUMPS) --download-metis --download-parmetis
>>>>> --download-ptscotch
>>>>>
>>>>> Barry
>>>>>
>>>>>
>>>>> On Sep 18, 2020, at 11:28 PM, Alexey Kozlov <Alexey.V.Kozlov.2 at nd.edu>
>>>>> wrote:
>>>>>
>>>>> Thanks for the tips! My matrix is complex and unsymmetric. My typical
>>>>> test case has of the order of one million equations. I use a 2nd-order
>>>>> finite-difference scheme with 19-point stencil, so my typical test case
>>>>> uses several GB of RAM.
>>>>>
>>>>> On Fri, Sep 18, 2020 at 11:52 PM Jed Brown <jed at jedbrown.org> wrote:
>>>>>
>>>>>> Unfortunately, those are hard problems in which the "good" methods
>>>>>> are technical and hard to make black-box. There are "sweeping" methods
>>>>>> that solve on 2D "slabs" with PML boundary conditions, H-matrix based
>>>>>> methods, and fancy multigrid methods. Attempting to solve with STRUMPACK
>>>>>> is probably the easiest thing to try (--download-strumpack).
>>>>>>
>>>>>>
>>>>>> https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MATSOLVERSSTRUMPACK.html
>>>>>>
>>>>>> Is the matrix complex symmetric?
>>>>>>
>>>>>> Note that you can use a direct solver (MUMPS, STRUMPACK, etc.) for a
>>>>>> 3D problem like this if you have enough memory. I'm assuming the memory or
>>>>>> time is unacceptable and you want an iterative method with much lower setup
>>>>>> costs.
>>>>>>
>>>>>> Alexey Kozlov <Alexey.V.Kozlov.2 at nd.edu> writes:
>>>>>>
>>>>>> > Dear all,
>>>>>> >
>>>>>> > I am solving a convected wave equation in a frequency domain. This
>>>>>> equation
>>>>>> > is a 3D Helmholtz equation with added first-order derivatives and
>>>>>> mixed
>>>>>> > derivatives, and with complex coefficients. The discretized PDE
>>>>>> results in
>>>>>> > a sparse linear system (about 10^6 equations) which is solved in
>>>>>> PETSc. I
>>>>>> > am having difficulty with the code convergence at high frequency,
>>>>>> skewed
>>>>>> > grid, and high Mach number. I suspect it may be due to the
>>>>>> preconditioner I
>>>>>> > use. I am currently using the ILU preconditioner with the number of
>>>>>> fill
>>>>>> > levels 2 or 3, and BCGS or GMRES solvers. I suspect the state of
>>>>>> the art
>>>>>> > has evolved and there are better preconditioners for Helmholtz-like
>>>>>> > problems. Could you, please, advise me on a better preconditioner?
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Alexey
>>>>>> >
>>>>>> > --
>>>>>> > Alexey V. Kozlov
>>>>>> >
>>>>>> > Research Scientist
>>>>>> > Department of Aerospace and Mechanical Engineering
>>>>>> > University of Notre Dame
>>>>>> >
>>>>>> > 117 Hessert Center
>>>>>> > Notre Dame, IN 46556-5684
>>>>>> > Phone: (574) 631-4335
>>>>>> > Fax: (574) 631-8355
>>>>>> > Email: akozlov at nd.edu
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Alexey V. Kozlov
>>>>>
>>>>> Research Scientist
>>>>> Department of Aerospace and Mechanical Engineering
>>>>> University of Notre Dame
>>>>>
>>>>> 117 Hessert Center
>>>>> Notre Dame, IN 46556-5684
>>>>> Phone: (574) 631-4335
>>>>> Fax: (574) 631-8355
>>>>> Email: akozlov at nd.edu
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Alexey V. Kozlov
>>>>
>>>> Research Scientist
>>>> Department of Aerospace and Mechanical Engineering
>>>> University of Notre Dame
>>>>
>>>> 117 Hessert Center
>>>> Notre Dame, IN 46556-5684
>>>> Phone: (574) 631-4335
>>>> Fax: (574) 631-8355
>>>> Email: akozlov at nd.edu
>>>>
>>>
>>
>> --
>> Alexey V. Kozlov
>>
>> Research Scientist
>> Department of Aerospace and Mechanical Engineering
>> University of Notre Dame
>>
>> 117 Hessert Center
>> Notre Dame, IN 46556-5684
>> Phone: (574) 631-4335
>> Fax: (574) 631-8355
>> Email: akozlov at nd.edu
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
--
Alexey V. Kozlov
Research Scientist
Department of Aerospace and Mechanical Engineering
University of Notre Dame
117 Hessert Center
Notre Dame, IN 46556-5684
Phone: (574) 631-4335
Fax: (574) 631-8355
Email: akozlov at nd.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20201017/1fc90cd2/attachment-0001.html>
More information about the petsc-users
mailing list