[petsc-users] Slow convergence while parallel computations.

Matthew Knepley knepley at gmail.com
Fri Sep 3 10:07:05 CDT 2021


It is a RAR since this is Windows :)

Viktor, your system looks singular. Is it possible that you somehow have
zero on the diagonal? That might make the
SOR a problem. You could replace that with Jacobi using

  -mg_levels_pc_type jacobi

  0 KSP Residual norm 2.980664994991e+02
  0 KSP preconditioned resid norm 2.980664994991e+02 true resid norm
7.983356882620e+11 ||r(i)||/||b|| 1.000000000000e+00
  1 KSP Residual norm 1.650358505966e+01
  1 KSP preconditioned resid norm 1.650358505966e+01 true resid norm
4.601793132543e+12 ||r(i)||/||b|| 5.764233267037e+00
  2 KSP Residual norm 2.086911345353e+01
  2 KSP preconditioned resid norm 2.086911345353e+01 true resid norm
1.258153657657e+12 ||r(i)||/||b|| 1.575970705250e+00
  3 KSP Residual norm 1.909137523120e+01
  3 KSP preconditioned resid norm 1.909137523120e+01 true resid norm
2.179275269000e+12 ||r(i)||/||b|| 2.729773077969e+00

Mark, here is the solver

KSP Object: 1 MPI processes
  type: cg
  maximum iterations=100000, initial guess is zero
  tolerances:  relative=1e-08, absolute=1e-50, divergence=10000.
  left preconditioning
  using PRECONDITIONED norm type for convergence test
PC Object: 1 MPI processes
  type: gamg
    type is MULTIPLICATIVE, levels=4 cycles=v
      Cycles per PCApply=1
      Using externally compute Galerkin coarse grid matrices
      GAMG specific options
        Threshold for dropping small values in graph on each level =   0.
0.   0.   0.
        Threshold scaling factor for each level not specified = 1.
        AGG specific options
          Symmetric graph false
          Number of levels to square graph 1
          Number smoothing steps 1
        Complexity:    grid = 1.0042
  Coarse grid solver -- level -------------------------------
    KSP Object: (mg_coarse_) 1 MPI processes
      type: preonly
      maximum iterations=10000, initial guess is zero
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using NONE norm type for convergence test
    PC Object: (mg_coarse_) 1 MPI processes
      type: bjacobi
        number of blocks = 1
        Local solver information for first block is in the following KSP
and PC objects on rank 0:
        Use -mg_coarse_ksp_view ::ascii_info_detail to display information
for all blocks
      KSP Object: (mg_coarse_sub_) 1 MPI processes
        type: preonly
        maximum iterations=1, initial guess is zero
        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
        left preconditioning
        using NONE norm type for convergence test
      PC Object: (mg_coarse_sub_) 1 MPI processes
        type: lu
          out-of-place factorization
          tolerance for zero pivot 2.22045e-14
          using diagonal shift on blocks to prevent zero pivot [INBLOCKS]
          matrix ordering: nd
          factor fill ratio given 5., needed 1.19444
            Factored matrix follows:
              Mat Object: 1 MPI processes
                type: seqaij
                rows=36, cols=36
                package used to perform factorization: petsc
                total: nonzeros=774, allocated nonzeros=774
                  using I-node routines: found 22 nodes, limit used is 5
        linear system matrix = precond matrix:
        Mat Object: (mg_coarse_sub_) 1 MPI processes
          type: seqaij
          rows=36, cols=36
          total: nonzeros=648, allocated nonzeros=648
          total number of mallocs used during MatSetValues calls=0
            not using I-node routines
      linear system matrix = precond matrix:
      Mat Object: (mg_coarse_sub_) 1 MPI processes
        type: seqaij
        rows=36, cols=36
        total: nonzeros=648, allocated nonzeros=648
        total number of mallocs used during MatSetValues calls=0
          not using I-node routines
  Down solver (pre-smoother) on level 1 -------------------------------
    KSP Object: (mg_levels_1_) 1 MPI processes
      type: chebyshev
        eigenvalue estimates used:  min = 0.0997354, max = 1.09709
        eigenvalues estimate via gmres min 0.00372245, max 0.997354
        eigenvalues estimated using gmres with translations  [0. 0.1; 0.
1.1]
        KSP Object: (mg_levels_1_esteig_) 1 MPI processes
          type: gmres
            restart=30, using Classical (unmodified) Gram-Schmidt
Orthogonalization with no iterative refinement
            happy breakdown tolerance 1e-30
          maximum iterations=10, initial guess is zero
          tolerances:  relative=1e-12, absolute=1e-50, divergence=10000.
          left preconditioning
          using PRECONDITIONED norm type for convergence test
        estimating eigenvalues using noisy right hand side
      maximum iterations=2, nonzero initial guess
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using NONE norm type for convergence test
    PC Object: (mg_levels_1_) 1 MPI processes
      type: sor
        type = local_symmetric, iterations = 1, local iterations = 1, omega
= 1.
      linear system matrix = precond matrix:
      Mat Object: 1 MPI processes
        type: seqaij
        rows=902, cols=902
        total: nonzeros=66660, allocated nonzeros=66660
        total number of mallocs used during MatSetValues calls=0
          not using I-node routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 2 -------------------------------
    KSP Object: (mg_levels_2_) 1 MPI processes
      type: chebyshev
        eigenvalue estimates used:  min = 0.0994525, max = 1.09398
        eigenvalues estimate via gmres min 0.0303095, max 0.994525
        eigenvalues estimated using gmres with translations  [0. 0.1; 0.
1.1]
        KSP Object: (mg_levels_2_esteig_) 1 MPI processes
          type: gmres
            restart=30, using Classical (unmodified) Gram-Schmidt
Orthogonalization with no iterative refinement
            happy breakdown tolerance 1e-30
          maximum iterations=10, initial guess is zero
          tolerances:  relative=1e-12, absolute=1e-50, divergence=10000.
          left preconditioning
          using PRECONDITIONED norm type for convergence test
        estimating eigenvalues using noisy right hand side
      maximum iterations=2, nonzero initial guess
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using NONE norm type for convergence test
    PC Object: (mg_levels_2_) 1 MPI processes
      type: sor
        type = local_symmetric, iterations = 1, local iterations = 1, omega
= 1.
      linear system matrix = precond matrix:
      Mat Object: 1 MPI processes
        type: seqaij
        rows=12043, cols=12043
        total: nonzeros=455611, allocated nonzeros=455611
        total number of mallocs used during MatSetValues calls=0
          not using I-node routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 3 -------------------------------
    KSP Object: (mg_levels_3_) 1 MPI processes
      type: chebyshev
        eigenvalue estimates used:  min = 0.0992144, max = 1.09136
        eigenvalues estimate via gmres min 0.0222691, max 0.992144
        eigenvalues estimated using gmres with translations  [0. 0.1; 0.
1.1]
        KSP Object: (mg_levels_3_esteig_) 1 MPI processes
          type: gmres
            restart=30, using Classical (unmodified) Gram-Schmidt
Orthogonalization with no iterative refinement
            happy breakdown tolerance 1e-30
          maximum iterations=10, initial guess is zero
          tolerances:  relative=1e-12, absolute=1e-50, divergence=10000.
          left preconditioning
          using PRECONDITIONED norm type for convergence test
        estimating eigenvalues using noisy right hand side
      maximum iterations=2, nonzero initial guess
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using NONE norm type for convergence test
    PC Object: (mg_levels_3_) 1 MPI processes
      type: sor
        type = local_symmetric, iterations = 1, local iterations = 1, omega
= 1.
      linear system matrix = precond matrix:
      Mat Object: 1 MPI processes
        type: seqaij
        rows=1600200, cols=1600200
        total: nonzeros=124439742, allocated nonzeros=129616200
        total number of mallocs used during MatSetValues calls=0
          using I-node routines: found 533400 nodes, limit used is 5
  Up solver (post-smoother) same as down solver (pre-smoother)
  linear system matrix = precond matrix:
  Mat Object: 1 MPI processes
    type: seqaij
    rows=1600200, cols=1600200
    total: nonzeros=124439742, allocated nonzeros=129616200
    total number of mallocs used during MatSetValues calls=0
      using I-node routines: found 533400 nodes, limit used is 5

  Thanks,

     Matt

On Fri, Sep 3, 2021 at 10:56 AM Mark Adams <mfadams at lbl.gov> wrote:

> That does not seem to be an ASCII file.
>
> On Fri, Sep 3, 2021 at 10:48 AM Viktor Nazdrachev <numbersixvs at gmail.com>
> wrote:
>
>> Hello Mark and Matthew!
>>
>>
>>
>> I attached log files for serial and parallel cases and corresponding information about GAMG preconditioner (using grep).
>>
>> I have to notice, that assembling of global stiffness matrix in code was performed by MatSetValues subrotuine (not MatSetValuesBlocked)
>>
>> !nnds – number of nodes
>>
>> !dmn=3
>>
>> call MatCreate(Petsc_Comm_World,Mat_K,ierr)
>>
>> call MatSetFromOptions(Mat_K,ierr)
>>
>> call MatSetSizes(Mat_K,Petsc_Decide,Petsc_Decide,n,n,ierr_m)
>>
>>>>
>> call MatMPIAIJSetPreallocation(Mat_K,0,dbw,0,obw,ierr)
>>
>>>>
>> call MatSetOption(Mat_K,Mat_New_Nonzero_Allocation_Err,Petsc_False,ierr)
>>
>>>>
>> do i=1,nels
>>
>>     call FormLocalK(i,k,indx,"Kp") ! find local stiffness matrix
>>
>>      indx=indxmap(indx,2) !find global indices for DOFs
>>
>>      call MatSetValues(Mat_K,ef_eldof,indx,ef_eldof,indx,k,Add_Values,ierr)
>>
>> end do
>>
>>
>>
>> But nullspace vector was created using VecSetBlockSize subroutine.
>>
>>
>>
>> call VecCreate(Petsc_Comm_World,Vec_NullSpace,ierr)
>>
>> call VecSetBlockSize(Vec_NullSpace,dmn,ierr)
>>
>> call VecSetSizes(Vec_NullSpace,nnds*dmn,Petsc_Decide,ierr)
>>
>> call VecSetUp(Vec_NullSpace,ierr)
>>
>> call VecGetArrayF90(Vec_NullSpace,null_space,ierr)
>>
>>>>
>> call VecRestoreArrayF90(Vec_NullSpace,null_space,ierr)
>>
>> call MatNullSpaceCreateRigidBody(Vec_NullSpace,matnull,ierr)
>>
>> call MatSetNearNullSpace(Mat_K,matnull,ierr)
>>
>>
>>
>> I suppose it can be one of the reasons of GAMG slow convergence.
>>
>> So I attached log files for parallel run with «pure» GAMG precondtioner.
>>
>>
>>
>>
>>
>> Kind regards,
>>
>>
>>
>> Viktor Nazdrachev
>>
>>
>>
>> R&D senior researcher
>>
>>
>>
>> Geosteering Technologies LLC
>>
>> пт, 3 сент. 2021 г. в 15:11, Matthew Knepley <knepley at gmail.com>:
>>
>>> On Fri, Sep 3, 2021 at 8:02 AM Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Sep 3, 2021 at 1:57 AM Viktor Nazdrachev <numbersixvs at gmail.com>
>>>> wrote:
>>>>
>>>>> Hello, Lawrence!
>>>>> Thank you for your response!
>>>>>
>>>>> I attached log files (txt files with convergence behavior and RAM
>>>>> usage log in separate txt files) and resulting table with convergence
>>>>> investigation data(xls). Data for main non-regular grid with 500K cells and
>>>>> heterogeneous properties are in 500K folder, whereas data for simple
>>>>> uniform 125K cells grid with constant properties are in 125K folder.
>>>>>
>>>>>
>>>>> >>* On 1 Sep 2021, at 09:42, **Наздрачёв** Виктор** <numbersixvs at gmail.com <https://lists.mcs.anl.gov/mailman/listinfo/petsc-users>**> wrote:*
>>>>>
>>>>> >>
>>>>>
>>>>> >>* I have a 3D elasticity problem with heterogeneous properties.*
>>>>>
>>>>> >
>>>>>
>>>>> >What does your coefficient variation look like? How large is the contrast?
>>>>>
>>>>>
>>>>>
>>>>> Young modulus varies from 1 to 10 GPa, Poisson ratio varies from 0.3
>>>>> to 0.44 and density – from 1700 to 2600 kg/m^3.
>>>>>
>>>>
>>>> That is not too bad. Poorly shaped elements are the next thing to worry
>>>> about. Try to keep the aspect ratio below 10 if possible.
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> >>* There is unstructured grid with aspect ratio varied from 4 to 25. Zero Dirichlet BCs  are imposed on bottom face of mesh. Also, Neumann (traction) BCs are imposed on side faces. Gravity load is also accounted for. The grid I use consists of 500k cells (which is approximately 1.6M of DOFs).*
>>>>>
>>>>> >>
>>>>>
>>>>> >>* The best performance and memory usage for single MPI process was obtained with HPDDM(BFBCG) solver and bjacobian + ICC (1) in subdomains as preconditioner, it took 1 m 45 s and RAM 5.0 GB. Parallel computation with 4 MPI processes took 2 m 46 s when using 5.6 GB of RAM. This because of number of iterations required to achieve the same tolerance is significantly increased.*
>>>>>
>>>>> >
>>>>>
>>>>> >How many iterations do you have in serial (and then in parallel)?
>>>>>
>>>>>
>>>>>
>>>>> Serial run is required 112 iterations to reach convergence (log_hpddm(bfbcg)_bjacobian_icc_1_mpi.txt), parallel run with 4 MPI – 680 iterations.
>>>>>
>>>>>
>>>>>
>>>>> I attached log files for all simulations (txt files with convergence
>>>>> behavior and RAM usage log in separate txt files) and resulting table with
>>>>> convergence/memory usage data(xls). Data for main non-regular grid with
>>>>> 500K cells and heterogeneous properties are in 500K folder, whereas data
>>>>> for simple uniform 125K cells grid with constant properties are in 125K
>>>>> folder.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> >>* I`ve also tried PCGAMG (agg) preconditioner with IC**С** (1) sub-precondtioner. For single MPI process, the calculation took 10 min and 3.4 GB of RAM. To improve the convergence rate, the nullspace was attached using MatNullSpaceCreateRigidBody and MatSetNearNullSpace subroutines.  This has reduced calculation time to 3 m 58 s when using 4.3 GB of RAM. Also, there is peak memory usage with 14.1 GB, which appears just before the start of the iterations. Parallel computation with 4 MPI processes took 2 m 53 s when using 8.4 GB of RAM. In that case the peak memory usage is about 22 GB.*
>>>>>
>>>>> >
>>>>>
>>>>> >Does the number of iterates increase in parallel? Again, how many iterations do you have?
>>>>>
>>>>>
>>>>>
>>>>> For case with 4 MPI processes and attached nullspace it is required 177 iterations to reach convergence (you may see detailed log in log_hpddm(bfbcg)_gamg_nearnullspace_4_mpi.txt). For comparison, 90 iterations are required for sequential run(log_hpddm(bfbcg)_gamg_nearnullspace_1_mpi.txt).
>>>>>
>>>>>
>>>> Again, do not use ICC. I am surprised to see such a large jump in
>>>> iteration count, but get ICC off the table.
>>>>
>>>> You will see variability in the iteration count with processor count
>>>> with GAMG. As much as 10% +-. Maybe more (random) variability , but usually
>>>> less.
>>>>
>>>> You can decrease the memory a little, and the setup time a lot, by
>>>> aggressively coarsening, at the expense of higher iteration counts. It's a
>>>> balancing act.
>>>>
>>>> You can run with the defaults, add '-info', grep on GAMG and send the
>>>> ~30 lines of output if you want advice on parameters.
>>>>
>>>
>>> Can you send the output of
>>>
>>>   -ksp_view -ksp_monitor_true_residual -ksp_converged_reason
>>>
>>>   Thanks,
>>>
>>>       Matt
>>>
>>>
>>>> Thanks,
>>>> Mark
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> >>* Are there ways to avoid decreasing of the convergence rate for bjacobi precondtioner in parallel mode? Does it make sense to use hierarchical or nested krylov methods with a local gmres solver (sub_pc_type gmres) and some sub-precondtioner (for example, sub_pc_type bjacobi)?*
>>>>>
>>>>> >
>>>>>
>>>>> >bjacobi is only a one-level method, so you would not expect process-independent convergence rate for this kind of problem. If the coefficient variation is not too extreme, then I would expect GAMG (or some other smoothed aggregation package, perhaps -pc_type ml (you need --download-ml)) would work well with some tuning.
>>>>>
>>>>>
>>>>>
>>>>> Thanks for idea, but, unfortunately, ML cannot be compiled with 64bit
>>>>> integers (It is extremely necessary to perform computation on mesh with
>>>>> more than 10M cells).
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> >If you have extremely high contrast coefficients you might need something with stronger coarse grids. If you can assemble so-called Neumann matrices (https://petsc.org/release/docs/manualpages/Mat/MATIS.html#MATIS) then you could try the geneo scheme offered by PCHPDDM.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I found strange convergence behavior for HPDDM preconditioner. For 1
>>>>> MPI process BFBCG solver did not converged
>>>>> (log_hpddm(bfbcg)_pchpddm_1_mpi.txt), while for 4 MPI processes computation
>>>>> was successful (1018 to reach convergence,
>>>>> log_hpddm(bfbcg)_pchpddm_4_mpi.txt).
>>>>>
>>>>> But it should be mentioned that stiffness matrix was created in AIJ
>>>>> format (our default matrix format in program).
>>>>>
>>>>> Matrix conversion to MATIS format via MatConvert subroutine resulted
>>>>> in losing of convergence for both serial and parallel run.
>>>>>
>>>>>
>>>>> >>* Is this peak memory usage expected for gamg preconditioner? is there any way to reduce it?*
>>>>>
>>>>> >
>>>>>
>>>>> >I think that peak memory usage comes from building the coarse grids. Can you run with `-info` and grep for GAMG, this will provide some output that more expert GAMG users can interpret.
>>>>>
>>>>>
>>>>>
>>>>>  Thanks, I`ll try to use a strong threshold only for coarse grids.
>>>>>
>>>>>
>>>>>
>>>>> Kind regards,
>>>>>
>>>>>
>>>>>
>>>>> Viktor Nazdrachev
>>>>>
>>>>>
>>>>>
>>>>> R&D senior researcher
>>>>>
>>>>>
>>>>>
>>>>> Geosteering Technologies LLC
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ср, 1 сент. 2021 г. в 12:02, Lawrence Mitchell <wence at gmx.li>:
>>>>>
>>>>>>
>>>>>>
>>>>>> > On 1 Sep 2021, at 09:42, Наздрачёв Виктор <numbersixvs at gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > I have a 3D elasticity problem with heterogeneous properties.
>>>>>>
>>>>>> What does your coefficient variation look like? How large is the
>>>>>> contrast?
>>>>>>
>>>>>> > There is unstructured grid with aspect ratio varied from 4 to 25.
>>>>>> Zero Dirichlet BCs  are imposed on bottom face of mesh. Also, Neumann
>>>>>> (traction) BCs are imposed on side faces. Gravity load is also accounted
>>>>>> for. The grid I use consists of 500k cells (which is approximately 1.6M of
>>>>>> DOFs).
>>>>>> >
>>>>>> > The best performance and memory usage for single MPI process was
>>>>>> obtained with HPDDM(BFBCG) solver and bjacobian + ICC (1) in subdomains as
>>>>>> preconditioner, it took 1 m 45 s and RAM 5.0 GB. Parallel computation with
>>>>>> 4 MPI processes took 2 m 46 s when using 5.6 GB of RAM. This because of
>>>>>> number of iterations required to achieve the same tolerance is
>>>>>> significantly increased.
>>>>>>
>>>>>> How many iterations do you have in serial (and then in parallel)?
>>>>>>
>>>>>> > I`ve also tried PCGAMG (agg) preconditioner with ICС (1)
>>>>>> sub-precondtioner. For single MPI process, the calculation took 10 min and
>>>>>> 3.4 GB of RAM. To improve the convergence rate, the nullspace was attached
>>>>>> using MatNullSpaceCreateRigidBody and MatSetNearNullSpace subroutines.
>>>>>> This has reduced calculation time to 3 m 58 s when using 4.3 GB of RAM.
>>>>>> Also, there is peak memory usage with 14.1 GB, which appears just before
>>>>>> the start of the iterations. Parallel computation with 4 MPI processes took
>>>>>> 2 m 53 s when using 8.4 GB of RAM. In that case the peak memory usage is
>>>>>> about 22 GB.
>>>>>>
>>>>>> Does the number of iterates increase in parallel? Again, how many
>>>>>> iterations do you have?
>>>>>>
>>>>>> > Are there ways to avoid decreasing of the convergence rate for
>>>>>> bjacobi precondtioner in parallel mode? Does it make sense to use
>>>>>> hierarchical or nested krylov methods with a local gmres solver
>>>>>> (sub_pc_type gmres) and some sub-precondtioner (for example, sub_pc_type
>>>>>> bjacobi)?
>>>>>>
>>>>>> bjacobi is only a one-level method, so you would not expect
>>>>>> process-independent convergence rate for this kind of problem. If the
>>>>>> coefficient variation is not too extreme, then I would expect GAMG (or some
>>>>>> other smoothed aggregation package, perhaps -pc_type ml (you need
>>>>>> --download-ml)) would work well with some tuning.
>>>>>>
>>>>>> If you have extremely high contrast coefficients you might need
>>>>>> something with stronger coarse grids. If you can assemble so-called Neumann
>>>>>> matrices (
>>>>>> https://petsc.org/release/docs/manualpages/Mat/MATIS.html#MATIS)
>>>>>> then you could try the geneo scheme offered by PCHPDDM.
>>>>>>
>>>>>> > Is this peak memory usage expected for gamg preconditioner? is
>>>>>> there any way to reduce it?
>>>>>>
>>>>>> I think that peak memory usage comes from building the coarse grids.
>>>>>> Can you run with `-info` and grep for GAMG, this will provide some output
>>>>>> that more expert GAMG users can interpret.
>>>>>>
>>>>>> Lawrence
>>>>>>
>>>>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://www.cse.buffalo.edu/~knepley/
>>> <http://www.cse.buffalo.edu/~knepley/>
>>>
>>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210903/880fcca7/attachment-0001.html>


More information about the petsc-users mailing list