[petsc-users] Help with fieldsplit performance
Edoardo alinovi
edoardo.alinovi at gmail.com
Sun Feb 5 12:26:17 CST 2023
Hello Petsc's crew,
I would like to ask for some support in setting up the fieldsplit
preconditioner in order to obtain better performance. I have already found
some posts on the topic and keep experimenting, but I would like to hear
your opinion as experts :)
I have my fancy CFD pressure based coupled solver already validated on
some basic problems, so I am confident the matrix is OK. However, I am
struggling a bit in finding performance. In my experiments, I have found
out that *schur *is the best in terms of overall iteration count, but it
takes ages to converge! Using additive or multiplicative looks a better
call, but in some cases I get a very high number of iterations to converge
(500+).
I attach here the logs (ksp_view and log_view) for an example case of the
flow past a 90deg T-junction, 285k cells on 4 procs.
GMRES + fieldsplit and schur take 90s to converge with 4 iters. Do you see
anything strange in the way ksp is set up?
Thank you for the support as always!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230205/43cd7ea4/attachment-0001.html>
-------------- next part --------------
KSP Object: (UPeqn_) 4 MPI processes
type: gmres
restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
happy breakdown tolerance 1e-30
maximum iterations=10000, nonzero initial guess
tolerances: relative=0., absolute=1e-06, divergence=10000.
right preconditioning
using UNPRECONDITIONED norm type for convergence test
PC Object: (UPeqn_) 4 MPI processes
type: fieldsplit
FieldSplit with Schur preconditioner, blocksize = 3, factorization FULL
Preconditioner for the Schur complement formed from A11
Split info:
Split number 0 Fields 0, 1
Split number 1 Fields 2
KSP solver for A00 block
KSP Object: (UPeqn_fieldsplit_u_) 4 MPI processes
type: gmres
restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
happy breakdown tolerance 1e-30
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using PRECONDITIONED norm type for convergence test
PC Object: (UPeqn_fieldsplit_u_) 4 MPI processes
type: bjacobi
number of blocks = 4
Local solver information for first block is in the following KSP and PC objects on rank 0:
Use -UPeqn_fieldsplit_u_ksp_view ::ascii_info_detail to display information for all blocks
KSP Object: (UPeqn_fieldsplit_u_sub_) 1 MPI process
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (UPeqn_fieldsplit_u_sub_) 1 MPI process
type: ilu
out-of-place factorization
0 levels of fill
tolerance for zero pivot 2.22045e-14
matrix ordering: natural
factor fill ratio given 1., needed 1.
Factored matrix follows:
Mat Object: (UPeqn_fieldsplit_u_sub_) 1 MPI process
type: seqaij
rows=48586, cols=48586, bs=2
package used to perform factorization: petsc
total: nonzeros=483068, allocated nonzeros=483068
using I-node routines: found 24293 nodes, limit used is 5
linear system matrix = precond matrix:
Mat Object: (UPeqn_fieldsplit_u_sub_) 1 MPI process
type: seqaij
rows=48586, cols=48586, bs=2
total: nonzeros=483068, allocated nonzeros=483068
total number of mallocs used during MatSetValues calls=0
using I-node routines: found 24293 nodes, limit used is 5
linear system matrix = precond matrix:
Mat Object: (UPeqn_fieldsplit_u_) 4 MPI processes
type: mpiaij
rows=190000, cols=190000, bs=2
total: nonzeros=1891600, allocated nonzeros=1891600
total number of mallocs used during MatSetValues calls=0
using I-node (on process 0) routines: found 24293 nodes, limit used is 5
KSP solver for S = A11 - A10 inv(A00) A01
KSP Object: (UPeqn_fieldsplit_p_) 4 MPI processes
type: gmres
restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
happy breakdown tolerance 1e-30
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using PRECONDITIONED norm type for convergence test
PC Object: (UPeqn_fieldsplit_p_) 4 MPI processes
type: hypre
HYPRE BoomerAMG preconditioning
Cycle type V
Maximum number of levels 25
Maximum number of iterations PER hypre call 1
Convergence tolerance PER hypre call 0.
Threshold for strong coupling 0.25
Interpolation truncation factor 0.
Interpolation: max elements per row 0
Number of levels of aggressive coarsening 0
Number of paths for aggressive coarsening 1
Maximum row sums 0.9
Sweeps down 1
Sweeps up 1
Sweeps on coarse 1
Relax down symmetric-SOR/Jacobi
Relax up symmetric-SOR/Jacobi
Relax on coarse Gaussian-elimination
Relax weight (all) 1.
Outer relax weight (all) 1.
Using CF-relaxation
Not using more complex smoothers.
Measure type local
Coarsen type Falgout
Interpolation type classical
SpGEMM type cusparse
linear system matrix followed by preconditioner matrix:
Mat Object: (UPeqn_fieldsplit_p_) 4 MPI processes
type: schurcomplement
rows=95000, cols=95000
Schur complement A11 - A10 inv(A00) A01
A11
Mat Object: (UPeqn_fieldsplit_p_) 4 MPI processes
type: mpiaij
rows=95000, cols=95000
total: nonzeros=472900, allocated nonzeros=472900
total number of mallocs used during MatSetValues calls=0
not using I-node (on process 0) routines
A10
Mat Object: 4 MPI processes
type: mpiaij
rows=95000, cols=190000
total: nonzeros=945800, allocated nonzeros=945800
total number of mallocs used during MatSetValues calls=0
not using I-node (on process 0) routines
KSP of A00
KSP Object: (UPeqn_fieldsplit_u_) 4 MPI processes
type: gmres
restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
happy breakdown tolerance 1e-30
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using PRECONDITIONED norm type for convergence test
PC Object: (UPeqn_fieldsplit_u_) 4 MPI processes
type: bjacobi
number of blocks = 4
Local solver information for first block is in the following KSP and PC objects on rank 0:
Use -UPeqn_fieldsplit_u_ksp_view ::ascii_info_detail to display information for all blocks
KSP Object: (UPeqn_fieldsplit_u_sub_) 1 MPI process
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (UPeqn_fieldsplit_u_sub_) 1 MPI process
type: ilu
out-of-place factorization
0 levels of fill
tolerance for zero pivot 2.22045e-14
matrix ordering: natural
factor fill ratio given 1., needed 1.
Factored matrix follows:
Mat Object: (UPeqn_fieldsplit_u_sub_) 1 MPI process
type: seqaij
rows=48586, cols=48586, bs=2
package used to perform factorization: petsc
total: nonzeros=483068, allocated nonzeros=483068
using I-node routines: found 24293 nodes, limit used is 5
linear system matrix = precond matrix:
Mat Object: (UPeqn_fieldsplit_u_sub_) 1 MPI process
type: seqaij
rows=48586, cols=48586, bs=2
total: nonzeros=483068, allocated nonzeros=483068
total number of mallocs used during MatSetValues calls=0
using I-node routines: found 24293 nodes, limit used is 5
linear system matrix = precond matrix:
Mat Object: (UPeqn_fieldsplit_u_) 4 MPI processes
type: mpiaij
rows=190000, cols=190000, bs=2
total: nonzeros=1891600, allocated nonzeros=1891600
total number of mallocs used during MatSetValues calls=0
using I-node (on process 0) routines: found 24293 nodes, limit used is 5
A01
Mat Object: 4 MPI processes
type: mpiaij
rows=190000, cols=95000, rbs=2, cbs=1
total: nonzeros=945800, allocated nonzeros=945800
total number of mallocs used during MatSetValues calls=0
using I-node (on process 0) routines: found 24293 nodes, limit used is 5
Mat Object: (UPeqn_fieldsplit_p_) 4 MPI processes
type: mpiaij
rows=95000, cols=95000
total: nonzeros=472900, allocated nonzeros=472900
total number of mallocs used during MatSetValues calls=0
not using I-node (on process 0) routines
linear system matrix = precond matrix:
Mat Object: 4 MPI processes
type: mpiaij
rows=285000, cols=285000, bs=3
total: nonzeros=4256100, allocated nonzeros=4256100
total number of mallocs used during MatSetValues calls=0
using I-node (on process 0) routines: found 24293 nodes, limit used is 5
-------------- next part --------------
flubio_coupled on a linux_x86_64 named localhost.localdomain with 4 processors, by edo Sun Feb 5 19:21:59 2023
Using Petsc Release Version 3.18.3, unknown
Max Max/Min Avg Total
Time (sec): 9.234e+01 1.000 9.234e+01
Objects: 2.190e+02 1.000 2.190e+02
Flops: 8.885e+10 1.083 8.686e+10 3.475e+11
Flops/sec: 9.623e+08 1.083 9.407e+08 3.763e+09
MPI Msg Count: 5.765e+04 2.999 3.844e+04 1.537e+05
MPI Msg Len (bytes): 9.189e+07 2.972 1.548e+03 2.380e+08
MPI Reductions: 3.723e+04 1.000
Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N flops
and VecAXPY() for complex vectors of length N --> 8N flops
Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- -- Message Lengths -- -- Reductions --
Avg %Total Avg %Total Count %Total Avg %Total Count %Total
0: Main Stage: 9.2336e+01 100.0% 3.4746e+11 100.0% 1.537e+05 100.0% 1.548e+03 100.0% 3.721e+04 99.9%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flop: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
AvgLen: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
%T - percent time in this phase %F - percent flop in this phase
%M - percent messages in this phase %L - percent message lengths in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event Count Time (sec) Flop --- Global --- --- Stage ---- Total
Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
BuildTwoSided 13 1.0 1.6536e-01 2.7 0.00e+00 0.0 4.0e+01 4.0e+00 1.3e+01 0 0 0 0 0 0 0 0 0 0 0
BuildTwoSidedF 6 1.0 1.4933e-01 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatMult 906 1.0 8.5608e+01 1.0 8.60e+10 1.1 1.5e+05 1.5e+03 3.5e+04 93 97 100 99 95 93 97 100 99 95 3928
MatMultAdd 254 1.0 8.2637e-02 1.1 6.15e+07 1.1 2.0e+03 7.7e+02 1.0e+00 0 0 1 1 0 0 0 1 1 0 2907
MatSolve 18680 1.0 2.7633e+01 1.1 1.71e+10 1.1 0.0e+00 0.0e+00 0.0e+00 29 19 0 0 0 29 19 0 0 0 2424
MatLUFactorNum 1 1.0 7.5099e-03 1.6 1.97e+06 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1026
MatILUFactorSym 1 1.0 6.3318e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatConvert 1 1.0 2.8405e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyBegin 19 1.0 9.5880e-02 1154.5 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 19 1.0 9.6739e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 2.2e+01 0 0 0 0 0 0 0 0 0 0 0
MatGetRowIJ 3 1.0 1.4920e-06 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatCreateSubMat 4 1.0 2.0991e-01 1.1 0.00e+00 0.0 6.4e+01 2.3e+03 4.4e+01 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 1 1.0 1.2688e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatZeroEntries 1 1.0 4.9500e-03 2.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSetUp 4 1.0 2.9473e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 1 1.0 9.0082e+01 1.0 8.88e+10 1.1 1.5e+05 1.5e+03 3.7e+04 98 100 100 99 100 98 100 100 99 100 3857
KSPGMRESOrthog 18157 1.0 3.4719e+01 1.1 4.99e+10 1.1 0.0e+00 0.0e+00 1.8e+04 37 56 0 0 49 37 56 0 0 49 5616
PCSetUp 4 1.0 3.7452e-01 1.0 1.97e+06 1.1 8.8e+01 2.8e+04 7.6e+01 0 0 0 1 0 0 0 0 1 0 21
PCSetUpOnBlocks 264 1.0 1.5411e-02 1.3 1.97e+06 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 500
PCApply 5 1.0 9.0060e+01 1.0 8.88e+10 1.1 1.5e+05 1.5e+03 3.7e+04 98 100 100 99 100 98 100 100 99 100 3857
KSPSolve_FS_0 5 1.0 1.6113e+00 1.0 1.68e+09 1.1 2.8e+03 1.6e+03 7.0e+02 2 2 2 2 2 2 2 2 2 2 4077
KSPSolve_FS_Schu 5 1.0 8.7013e+01 1.0 8.58e+10 1.1 1.5e+05 1.5e+03 3.6e+04 94 97 97 96 96 94 97 97 96 96 3854
KSPSolve_FS_Low 5 1.0 1.4239e+00 1.0 1.38e+09 1.1 2.3e+03 1.5e+03 5.7e+02 2 2 1 1 2 2 2 1 1 2 3793
VecMDot 18157 1.0 2.0707e+01 1.1 2.49e+10 1.1 0.0e+00 0.0e+00 1.8e+04 21 28 0 0 49 21 28 0 0 49 4708
VecNorm 18949 1.0 3.8294e+00 1.2 1.83e+09 1.1 0.0e+00 0.0e+00 1.9e+04 4 2 0 0 51 4 2 0 0 51 1868
VecScale 19203 1.0 4.4065e-01 1.1 9.21e+08 1.1 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 8169
VecCopy 789 1.0 5.7539e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 20007 1.0 4.2773e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 1314 1.0 8.6258e-02 1.0 1.27e+08 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 5764
VecMAXPY 18944 1.0 1.5599e+01 1.0 2.67e+10 1.1 0.0e+00 0.0e+00 0.0e+00 17 30 0 0 0 17 30 0 0 0 6689
VecAssemblyBegin 5 1.0 5.3558e-02 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 5.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAssemblyEnd 5 1.0 3.8580e-05 2.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecScatterBegin 19225 1.0 2.2305e-01 1.3 0.00e+00 0.0 1.5e+05 1.5e+03 7.0e+00 0 0 100 99 0 0 0 100 99 0 0
VecScatterEnd 19225 1.0 2.7400e+00 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
VecNormalize 18944 1.0 4.2351e+00 1.1 2.74e+09 1.1 0.0e+00 0.0e+00 1.9e+04 4 3 0 0 51 4 3 0 0 51 2533
SFSetGraph 7 1.0 2.2973e-03 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SFSetUp 7 1.0 2.1191e-02 1.3 0.00e+00 0.0 8.0e+01 3.5e+02 7.0e+00 0 0 0 0 0 0 0 0 0 0 0
SFPack 19225 1.0 4.0084e-02 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SFUnpack 19225 1.0 5.2803e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
--- Event Stage 1: Unknown
------------------------------------------------------------------------------------------------------------------------
Object Type Creations Destructions. Reports information only for process 0.
--- Event Stage 0: Main Stage
Matrix 23 5
Krylov Solver 6 1
Preconditioner 6 1
Vector 126 28
Index Set 35 16
Star Forest Graph 13 0
Distributed Mesh 3 0
Discrete System 3 0
Weak Form 3 0
Viewer 1 0
--- Event Stage 1: Unknown
========================================================================================================================
Average time to get PetscTime(): 2.77e-08
Average time for MPI_Barrier(): 0.000138606
Average time for zero size MPI_Send(): 5.6264e-05
#PETSc Option Table entries:
-log_view
-UPeqn_fieldsplit_p_pc_type hypre
-UPeqn_fieldsplit_u_pc_type bjacobi
-UPeqn_pc_fieldsplit_type schur
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: PETSC_ARCH=linux_x86_64 FOPTFLAGS=-O3 COPTFLAGS=-O3 CXXOPTFLAGS=-O3 -with-debugging=no -download-fblaslapack=1 -download-superlu_dist -download-mumps -download-hypre -download-metis -download-parmetis -download-scalapack -download-ml -download-slepc -download-hpddm -download-cmake -with-mpi-dir=/home/edo/software_repo/openmpi-4.1.4/build/
-----------------------------------------
Libraries compiled on 2023-01-08 17:23:02 on localhost.localdomain
Machine characteristics: Linux-5.14.0-162.6.1.el9_1.x86_64-x86_64-with-glibc2.34
Using PETSc directory: /home/edo/software_repo/petsc
Using PETSc arch: linux_x86_64
-----------------------------------------
Using C compiler: /home/edo/software_repo/openmpi-4.1.4/build/bin/mpicc -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector -fvisibility=hidden -O3
Using Fortran compiler: /home/edo/software_repo/openmpi-4.1.4/build/bin/mpif90 -fPIC -Wall -ffree-line-length-none -ffree-line-length-0 -Wno-lto-type-mismatch -Wno-unused-dummy-argument -O3
-----------------------------------------
Using include paths: -I/home/edo/software_repo/petsc/include -I/home/edo/software_repo/petsc/linux_x86_64/include -I/home/edo/software_repo/openmpi-4.1.4/build/include
-----------------------------------------
Using C linker: /home/edo/software_repo/openmpi-4.1.4/build/bin/mpicc
Using Fortran linker: /home/edo/software_repo/openmpi-4.1.4/build/bin/mpif90
Using libraries: -Wl,-rpath,/home/edo/software_repo/petsc/linux_x86_64/lib -L/home/edo/software_repo/petsc/linux_x86_64/lib -lpetsc -Wl,-rpath,/home/edo/software_repo/petsc/linux_x86_64/lib -L/home/edo/software_repo/petsc/linux_x86_64/lib -Wl,-rpath,/home/edo/software_repo/openmpi-4.1.4/build/lib -L/home/edo/software_repo/openmpi-4.1.4/build/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/11 -L/usr/lib/gcc/x86_64-redhat-linux/11 -lHYPRE -ldmumps -lmumps_common -lpord -lpthread -lscalapack -lsuperlu_dist -lml -lflapack -lfblas -lparmetis -lmetis -lm -lstdc++ -ldl -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -ldl
-----------------------------------------
More information about the petsc-users
mailing list