[petsc-users] Help with fieldsplit performance

Sun Feb 5 12:26:17 CST 2023

Hello Petsc's crew,

I would like to ask for some support in setting up the fieldsplit
preconditioner in order to obtain better performance. I have already found
some posts on the topic and keep experimenting, but I would like to hear
your opinion as experts :)

I have my fancy  CFD pressure based coupled solver already validated on
some basic problems, so I am confident the matrix is OK. However, I am
struggling a bit in finding performance. In my experiments, I have found
out that *schur *is the best in terms of overall iteration count, but it
takes ages to converge! Using additive or multiplicative looks a better
call, but in some cases I get a very high number of iterations to converge
(500+).

I attach here the logs (ksp_view and log_view)  for an example case of the
flow past a 90deg T-junction, 285k cells on 4 procs.

GMRES + fieldsplit and schur take 90s to converge with 4 iters. Do you see
anything strange in the way ksp is set up?

Thank you for the support as always!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230205/43cd7ea4/attachment-0001.html>
-------------- next part --------------

KSP Object: (UPeqn_) 4 MPI processes
  type: gmres
    restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
    happy breakdown tolerance 1e-30
  maximum iterations=10000, nonzero initial guess
  tolerances:  relative=0., absolute=1e-06, divergence=10000.
  right preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: (UPeqn_) 4 MPI processes
  type: fieldsplit
    FieldSplit with Schur preconditioner, blocksize = 3, factorization FULL
    Preconditioner for the Schur complement formed from A11
    Split info:
    Split number 0 Fields  0, 1
    Split number 1 Fields  2
    KSP solver for A00 block
      KSP Object: (UPeqn_fieldsplit_u_) 4 MPI processes
        type: gmres
          restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
          happy breakdown tolerance 1e-30
        maximum iterations=10000, initial guess is zero
        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
        left preconditioning
        using PRECONDITIONED norm type for convergence test
      PC Object: (UPeqn_fieldsplit_u_) 4 MPI processes
        type: bjacobi
          number of blocks = 4
          Local solver information for first block is in the following KSP and PC objects on rank 0:
          Use -UPeqn_fieldsplit_u_ksp_view ::ascii_info_detail to display information for all blocks
        KSP Object: (UPeqn_fieldsplit_u_sub_) 1 MPI process
          type: preonly
          maximum iterations=10000, initial guess is zero
          tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
          left preconditioning
          using NONE norm type for convergence test
        PC Object: (UPeqn_fieldsplit_u_sub_) 1 MPI process
          type: ilu
            out-of-place factorization
            0 levels of fill
            tolerance for zero pivot 2.22045e-14
            matrix ordering: natural
            factor fill ratio given 1., needed 1.
              Factored matrix follows:
                Mat Object: (UPeqn_fieldsplit_u_sub_) 1 MPI process
                  type: seqaij
                  rows=48586, cols=48586, bs=2
                  package used to perform factorization: petsc
                  total: nonzeros=483068, allocated nonzeros=483068
                    using I-node routines: found 24293 nodes, limit used is 5
          linear system matrix = precond matrix:
          Mat Object: (UPeqn_fieldsplit_u_sub_) 1 MPI process
            type: seqaij
            rows=48586, cols=48586, bs=2
            total: nonzeros=483068, allocated nonzeros=483068
            total number of mallocs used during MatSetValues calls=0
              using I-node routines: found 24293 nodes, limit used is 5
        linear system matrix = precond matrix:
        Mat Object: (UPeqn_fieldsplit_u_) 4 MPI processes
          type: mpiaij
          rows=190000, cols=190000, bs=2
          total: nonzeros=1891600, allocated nonzeros=1891600
          total number of mallocs used during MatSetValues calls=0
            using I-node (on process 0) routines: found 24293 nodes, limit used is 5
    KSP solver for S = A11 - A10 inv(A00) A01 
      KSP Object: (UPeqn_fieldsplit_p_) 4 MPI processes
        type: gmres
          restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
          happy breakdown tolerance 1e-30
        maximum iterations=10000, initial guess is zero
        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
        left preconditioning
        using PRECONDITIONED norm type for convergence test
      PC Object: (UPeqn_fieldsplit_p_) 4 MPI processes
        type: hypre
          HYPRE BoomerAMG preconditioning
            Cycle type V
            Maximum number of levels 25
            Maximum number of iterations PER hypre call 1
            Convergence tolerance PER hypre call 0.
            Threshold for strong coupling 0.25
            Interpolation truncation factor 0.
            Interpolation: max elements per row 0
            Number of levels of aggressive coarsening 0
            Number of paths for aggressive coarsening 1
            Maximum row sums 0.9
            Sweeps down         1
            Sweeps up           1
            Sweeps on coarse    1
            Relax down          symmetric-SOR/Jacobi
            Relax up            symmetric-SOR/Jacobi
            Relax on coarse     Gaussian-elimination
            Relax weight  (all)      1.
            Outer relax weight (all) 1.
            Using CF-relaxation
            Not using more complex smoothers.
            Measure type        local
            Coarsen type        Falgout
            Interpolation type  classical
            SpGEMM type         cusparse
        linear system matrix followed by preconditioner matrix:
        Mat Object: (UPeqn_fieldsplit_p_) 4 MPI processes
          type: schurcomplement
          rows=95000, cols=95000
            Schur complement A11 - A10 inv(A00) A01
            A11
              Mat Object: (UPeqn_fieldsplit_p_) 4 MPI processes
                type: mpiaij
                rows=95000, cols=95000
                total: nonzeros=472900, allocated nonzeros=472900
                total number of mallocs used during MatSetValues calls=0
                  not using I-node (on process 0) routines
            A10
              Mat Object: 4 MPI processes
                type: mpiaij
                rows=95000, cols=190000
                total: nonzeros=945800, allocated nonzeros=945800
                total number of mallocs used during MatSetValues calls=0
                  not using I-node (on process 0) routines
            KSP of A00
              KSP Object: (UPeqn_fieldsplit_u_) 4 MPI processes
                type: gmres
                  restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
                  happy breakdown tolerance 1e-30
                maximum iterations=10000, initial guess is zero
                tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
                left preconditioning
                using PRECONDITIONED norm type for convergence test
              PC Object: (UPeqn_fieldsplit_u_) 4 MPI processes
                type: bjacobi
                  number of blocks = 4
                  Local solver information for first block is in the following KSP and PC objects on rank 0:
                  Use -UPeqn_fieldsplit_u_ksp_view ::ascii_info_detail to display information for all blocks
                KSP Object: (UPeqn_fieldsplit_u_sub_) 1 MPI process
                  type: preonly
                  maximum iterations=10000, initial guess is zero
                  tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
                  left preconditioning
                  using NONE norm type for convergence test
                PC Object: (UPeqn_fieldsplit_u_sub_) 1 MPI process
                  type: ilu
                    out-of-place factorization
                    0 levels of fill
                    tolerance for zero pivot 2.22045e-14
                    matrix ordering: natural
                    factor fill ratio given 1., needed 1.
                      Factored matrix follows:
                        Mat Object: (UPeqn_fieldsplit_u_sub_) 1 MPI process
                          type: seqaij
                          rows=48586, cols=48586, bs=2
                          package used to perform factorization: petsc
                          total: nonzeros=483068, allocated nonzeros=483068
                            using I-node routines: found 24293 nodes, limit used is 5
                  linear system matrix = precond matrix:
                  Mat Object: (UPeqn_fieldsplit_u_sub_) 1 MPI process
                    type: seqaij
                    rows=48586, cols=48586, bs=2
                    total: nonzeros=483068, allocated nonzeros=483068
                    total number of mallocs used during MatSetValues calls=0
                      using I-node routines: found 24293 nodes, limit used is 5
                linear system matrix = precond matrix:
                Mat Object: (UPeqn_fieldsplit_u_) 4 MPI processes
                  type: mpiaij
                  rows=190000, cols=190000, bs=2
                  total: nonzeros=1891600, allocated nonzeros=1891600
                  total number of mallocs used during MatSetValues calls=0
                    using I-node (on process 0) routines: found 24293 nodes, limit used is 5
            A01
              Mat Object: 4 MPI processes
                type: mpiaij
                rows=190000, cols=95000, rbs=2, cbs=1
                total: nonzeros=945800, allocated nonzeros=945800
                total number of mallocs used during MatSetValues calls=0
                  using I-node (on process 0) routines: found 24293 nodes, limit used is 5
        Mat Object: (UPeqn_fieldsplit_p_) 4 MPI processes
          type: mpiaij
          rows=95000, cols=95000
          total: nonzeros=472900, allocated nonzeros=472900
          total number of mallocs used during MatSetValues calls=0
            not using I-node (on process 0) routines
  linear system matrix = precond matrix:
  Mat Object: 4 MPI processes
    type: mpiaij
    rows=285000, cols=285000, bs=3
    total: nonzeros=4256100, allocated nonzeros=4256100
    total number of mallocs used during MatSetValues calls=0
      using I-node (on process 0) routines: found 24293 nodes, limit used is 5
-------------- next part --------------
flubio_coupled on a linux_x86_64 named localhost.localdomain with 4 processors, by edo Sun Feb  5 19:21:59 2023
Using Petsc Release Version 3.18.3, unknown 

                         Max       Max/Min     Avg       Total
Time (sec):           9.234e+01     1.000   9.234e+01
Objects:              2.190e+02     1.000   2.190e+02
Flops:                8.885e+10     1.083   8.686e+10  3.475e+11
Flops/sec:            9.623e+08     1.083   9.407e+08  3.763e+09
MPI Msg Count:        5.765e+04     2.999   3.844e+04  1.537e+05
MPI Msg Len (bytes):  9.189e+07     2.972   1.548e+03  2.380e+08
MPI Reductions:       3.723e+04     1.000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total    Count   %Total     Avg         %Total    Count   %Total
 0:      Main Stage: 9.2336e+01 100.0%  3.4746e+11 100.0%  1.537e+05 100.0%  1.548e+03      100.0%  3.721e+04  99.9%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flop: Max - maximum over all processors
                  Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   AvgLen: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flop in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flop                              --- Global ---  --- Stage ----  Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

BuildTwoSided         13 1.0 1.6536e-01 2.7 0.00e+00 0.0 4.0e+01 4.0e+00 1.3e+01  0  0  0  0  0   0  0  0  0  0     0
BuildTwoSidedF         6 1.0 1.4933e-01 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatMult              906 1.0 8.5608e+01 1.0 8.60e+10 1.1 1.5e+05 1.5e+03 3.5e+04 93 97 100 99 95  93 97 100 99 95  3928
MatMultAdd           254 1.0 8.2637e-02 1.1 6.15e+07 1.1 2.0e+03 7.7e+02 1.0e+00  0  0  1  1  0   0  0  1  1  0  2907
MatSolve           18680 1.0 2.7633e+01 1.1 1.71e+10 1.1 0.0e+00 0.0e+00 0.0e+00 29 19  0  0  0  29 19  0  0  0  2424
MatLUFactorNum         1 1.0 7.5099e-03 1.6 1.97e+06 1.1 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1026
MatILUFactorSym        1 1.0 6.3318e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatConvert             1 1.0 2.8405e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyBegin      19 1.0 9.5880e-02 1154.5 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd        19 1.0 9.6739e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 2.2e+01  0  0  0  0  0   0  0  0  0  0     0
MatGetRowIJ            3 1.0 1.4920e-06 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatCreateSubMat        4 1.0 2.0991e-01 1.1 0.00e+00 0.0 6.4e+01 2.3e+03 4.4e+01  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 1.0 1.2688e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries         1 1.0 4.9500e-03 2.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSetUp               4 1.0 2.9473e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               1 1.0 9.0082e+01 1.0 8.88e+10 1.1 1.5e+05 1.5e+03 3.7e+04 98 100 100 99 100  98 100 100 99 100  3857
KSPGMRESOrthog     18157 1.0 3.4719e+01 1.1 4.99e+10 1.1 0.0e+00 0.0e+00 1.8e+04 37 56  0  0 49  37 56  0  0 49  5616
PCSetUp                4 1.0 3.7452e-01 1.0 1.97e+06 1.1 8.8e+01 2.8e+04 7.6e+01  0  0  0  1  0   0  0  0  1  0    21
PCSetUpOnBlocks      264 1.0 1.5411e-02 1.3 1.97e+06 1.1 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   500
PCApply                5 1.0 9.0060e+01 1.0 8.88e+10 1.1 1.5e+05 1.5e+03 3.7e+04 98 100 100 99 100  98 100 100 99 100  3857
KSPSolve_FS_0          5 1.0 1.6113e+00 1.0 1.68e+09 1.1 2.8e+03 1.6e+03 7.0e+02  2  2  2  2  2   2  2  2  2  2  4077
KSPSolve_FS_Schu       5 1.0 8.7013e+01 1.0 8.58e+10 1.1 1.5e+05 1.5e+03 3.6e+04 94 97 97 96 96  94 97 97 96 96  3854
KSPSolve_FS_Low        5 1.0 1.4239e+00 1.0 1.38e+09 1.1 2.3e+03 1.5e+03 5.7e+02  2  2  1  1  2   2  2  1  1  2  3793
VecMDot            18157 1.0 2.0707e+01 1.1 2.49e+10 1.1 0.0e+00 0.0e+00 1.8e+04 21 28  0  0 49  21 28  0  0 49  4708
VecNorm            18949 1.0 3.8294e+00 1.2 1.83e+09 1.1 0.0e+00 0.0e+00 1.9e+04  4  2  0  0 51   4  2  0  0 51  1868
VecScale           19203 1.0 4.4065e-01 1.1 9.21e+08 1.1 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0  8169
VecCopy              789 1.0 5.7539e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet             20007 1.0 4.2773e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY             1314 1.0 8.6258e-02 1.0 1.27e+08 1.1 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  5764
VecMAXPY           18944 1.0 1.5599e+01 1.0 2.67e+10 1.1 0.0e+00 0.0e+00 0.0e+00 17 30  0  0  0  17 30  0  0  0  6689
VecAssemblyBegin       5 1.0 5.3558e-02 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 5.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd         5 1.0 3.8580e-05 2.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecScatterBegin    19225 1.0 2.2305e-01 1.3 0.00e+00 0.0 1.5e+05 1.5e+03 7.0e+00  0  0 100 99  0   0  0 100 99  0     0
VecScatterEnd      19225 1.0 2.7400e+00 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
VecNormalize       18944 1.0 4.2351e+00 1.1 2.74e+09 1.1 0.0e+00 0.0e+00 1.9e+04  4  3  0  0 51   4  3  0  0 51  2533
SFSetGraph             7 1.0 2.2973e-03 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFSetUp                7 1.0 2.1191e-02 1.3 0.00e+00 0.0 8.0e+01 3.5e+02 7.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFPack             19225 1.0 4.0084e-02 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFUnpack           19225 1.0 5.2803e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

--- Event Stage 1: Unknown

------------------------------------------------------------------------------------------------------------------------

Object Type          Creations   Destructions. Reports information only for process 0.

--- Event Stage 0: Main Stage

              Matrix    23              5
       Krylov Solver     6              1
      Preconditioner     6              1
              Vector   126             28
           Index Set    35             16
   Star Forest Graph    13              0
    Distributed Mesh     3              0
     Discrete System     3              0
           Weak Form     3              0
              Viewer     1              0

--- Event Stage 1: Unknown

========================================================================================================================
Average time to get PetscTime(): 2.77e-08
Average time for MPI_Barrier(): 0.000138606
Average time for zero size MPI_Send(): 5.6264e-05
#PETSc Option Table entries:
-log_view
-UPeqn_fieldsplit_p_pc_type hypre
-UPeqn_fieldsplit_u_pc_type bjacobi
-UPeqn_pc_fieldsplit_type schur
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: PETSC_ARCH=linux_x86_64 FOPTFLAGS=-O3 COPTFLAGS=-O3 CXXOPTFLAGS=-O3 -with-debugging=no -download-fblaslapack=1 -download-superlu_dist -download-mumps -download-hypre -download-metis -download-parmetis -download-scalapack -download-ml -download-slepc -download-hpddm -download-cmake -with-mpi-dir=/home/edo/software_repo/openmpi-4.1.4/build/
-----------------------------------------
Libraries compiled on 2023-01-08 17:23:02 on localhost.localdomain 
Machine characteristics: Linux-5.14.0-162.6.1.el9_1.x86_64-x86_64-with-glibc2.34
Using PETSc directory: /home/edo/software_repo/petsc
Using PETSc arch: linux_x86_64
-----------------------------------------

Using C compiler: /home/edo/software_repo/openmpi-4.1.4/build/bin/mpicc  -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector -fvisibility=hidden -O3   
Using Fortran compiler: /home/edo/software_repo/openmpi-4.1.4/build/bin/mpif90  -fPIC -Wall -ffree-line-length-none -ffree-line-length-0 -Wno-lto-type-mismatch -Wno-unused-dummy-argument -O3     
-----------------------------------------

Using include paths: -I/home/edo/software_repo/petsc/include -I/home/edo/software_repo/petsc/linux_x86_64/include -I/home/edo/software_repo/openmpi-4.1.4/build/include
-----------------------------------------

Using C linker: /home/edo/software_repo/openmpi-4.1.4/build/bin/mpicc
Using Fortran linker: /home/edo/software_repo/openmpi-4.1.4/build/bin/mpif90
Using libraries: -Wl,-rpath,/home/edo/software_repo/petsc/linux_x86_64/lib -L/home/edo/software_repo/petsc/linux_x86_64/lib -lpetsc -Wl,-rpath,/home/edo/software_repo/petsc/linux_x86_64/lib -L/home/edo/software_repo/petsc/linux_x86_64/lib -Wl,-rpath,/home/edo/software_repo/openmpi-4.1.4/build/lib -L/home/edo/software_repo/openmpi-4.1.4/build/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/11 -L/usr/lib/gcc/x86_64-redhat-linux/11 -lHYPRE -ldmumps -lmumps_common -lpord -lpthread -lscalapack -lsuperlu_dist -lml -lflapack -lfblas -lparmetis -lmetis -lm -lstdc++ -ldl -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -ldl
-----------------------------------------