[petsc-users] [KSP] PETSc not reporting a KSP fail when true residual is NaN

Fri Feb 25 10:06:01 CST 2022

Dear PETSc users,

I'm working on an inhouse code that solves the Navier-Stokes equation in a
Lagrangian fashion for free surface flows. Because of the large distortions
and pressure gradients, it is quite common to encounter some issues with
iterative solvers for some time steps, and because of that, I implemented a
function that changes the solver type based on the flag KSPConvergedReason.
If this flag is negative after a call to KSPSolve, I solve the same linear
system again using a direct method.

The problem is that, sometimes, KSP keeps converging even though the
residual is NaN, and because of that, I'm not able to identify the problem
and change the solver, which leads to a solution vector equals to INF and
obviously the code ends up crashing. Is it normal to observe this kind of
behaviour?

Please find attached the log produced with the options
-ksp_monitor_lg_residualnorm -ksp_log -ksp_view -ksp_monitor_true_residual
-ksp_converged_reason and the function that changes the solver. I'm
currently using FGMRES and BJACOBI preconditioner with LU for each block.
The problem still happens with ILU for example. We can see in the log file
that for the time step 921, the true residual is NaN and within just one
iteration, the solver fails and it gives the reason DIVERGED_PC_FAILED. I
simply changed the solver to MUMPS and it converged for that time step.
However, when solving time step 922 we can see that FGMRES converges while
the true residual is NaN. Why is that possible? I would appreciate it if
someone could clarify this issue to me.

Kind regards,
Giovane

-- 
Giovane Avancini
Doutorando em Engenharia de Estruturas - Escola de Engenharia de São
Carlos, USP

PhD researcher in Structural Engineering - School of Engineering of São
Carlos. USP
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220225/722c8a3f/attachment-0001.html>
-------------- next part --------------
void FluidDomain::solveLinearSystem(KSP& ksp, Mat& mat, Vec& rhs, Vec& solution)
{
	auto start_timer = std::chrono::high_resolution_clock::now();

	KSPReset(ksp);
	KSPSetOperators(ksp, mat, mat);
	PC pc;
	KSPGetPC(ksp, &pc);
	PetscBool isbjacobi;
	PetscObjectTypeCompare((PetscObject)pc, PCBJACOBI, &isbjacobi);
	if (isbjacobi)
	{
		PetscInt nlocal;
		KSP *subksp;
		PC subpc;
		KSPSetUp(ksp);
		PCBJacobiGetSubKSP(pc, &nlocal, NULL, &subksp);
		for (int i = 0; i < nlocal; i++)
		{
			KSPGetPC(subksp[i], &subpc);
			PCSetType(subpc, PCLU);
			// PCFactorReorderForNonzeroDiagonal(subpc, 1.0e-10);
			// PCFactorSetShiftType(subpc, MAT_SHIFT_NONZERO);
			//PCFactorSetShiftAmount(subpc, 1.0e-10);
		}
	}
	else
	{
		//PCFactorReorderForNonzeroDiagonal(pc, 1.0e-10);
		//PCFactorSetShiftType(pc, MAT_SHIFT_NONZERO);
		// PCFactorSetShiftAmount(pc, 1.0e-10);
	}
	PetscReal matnorm, vecnorm;
	VecNorm(rhs, NORM_INFINITY, &vecnorm);
	MatNorm(mat, NORM_INFINITY, &matnorm);
	PetscPrintf(PETSC_COMM_WORLD, "MatNorm: %g    VecNorm: %g\n", (double)matnorm, (double)vecnorm);
	KSPSolve(ksp, rhs, solution);
	KSPView(ksp, PETSC_VIEWER_STDOUT_WORLD);
	KSPConvergedReason reason;
	KSPGetConvergedReason(ksp,&reason);
	PetscInt nit;
	KSPGetIterationNumber(ksp, &nit);
	PetscReal norm;
	KSPGetResidualNorm(ksp, &norm);

	auto end_timer = std::chrono::high_resolution_clock::now();
	std::chrono::duration<double> elapsed = end_timer - start_timer;

	if (reason > 0)
	{
		PetscPrintf(PETSC_COMM_WORLD, "Solver converged within %d iterations. Elapsed time: %f\n", nit, elapsed.count());
	}
	else
	{
		if (reason == -3)
			PetscPrintf(PETSC_COMM_WORLD, "Solver convergence is very slow. Modifying the solver in order to improve the convergence...\n");
		else
			PetscPrintf(PETSC_COMM_WORLD, "Solver diverged, reason %d. Modifying the solver in order to improve the convergence...\n", reason);
		KSP ksp2;
		KSPCreate(PETSC_COMM_WORLD, &ksp2);
		KSPSetType(ksp2, KSPPREONLY);
		KSPSetTolerances(ksp2, 1.0e-8, PETSC_DEFAULT, PETSC_DEFAULT, 5000);
		KSPGMRESSetRestart(ksp2, 30);
		PC pc2;
		KSPGetPC(ksp2, &pc2);
		PCSetType(pc2, PCLU);
		KSPSetOperators(ksp2, mat, mat);
		PetscObjectTypeCompare((PetscObject)pc2, PCBJACOBI, &isbjacobi);
		if (isbjacobi)
		{
			PetscInt nlocal;
			KSP *subksp;
			PC subpc;
			KSPSetUp(ksp2);
			PCBJacobiGetSubKSP(pc2, &nlocal, NULL, &subksp);
			for (int i = 0; i < nlocal; i++)
			{
				KSPGetPC(subksp[i], &subpc);
				//PCFactorSetShiftType(subpc, MAT_SHIFT_NONZERO);
			}
		}
		else
		{
			//PCFactorSetShiftType(pc2, MAT_SHIFT_NONZERO);
		}
		VecNorm(rhs, NORM_INFINITY, &vecnorm);
		MatNorm(mat, NORM_INFINITY, &matnorm);
		PetscPrintf(PETSC_COMM_WORLD, "MatNorm: %g    VecNorm: %g\n", (double)matnorm, (double)vecnorm);
		KSPSolve(ksp2, rhs, solution);
		KSPGetConvergedReason(ksp2, &reason);
		KSPGetIterationNumber(ksp2, &nit);
		if (reason > 0)
		{
			PetscPrintf(PETSC_COMM_WORLD, "Solver converged within %d iterations.\n", nit);
		}
		else
		{
			PetscPrintf(PETSC_COMM_WORLD, "Changing the solver did not improve the convergence.\n");
		}
		KSPDestroy(&ksp2);
	}
}
-------------- next part --------------
----------------------- TIME STEP = 921, time = 0.184200  -----------------------

Mesh Regenerated. Elapsed time: 0.011534
Isolated nodes: 0
Assemble Linear System. Elapsed time: 0.023297
MatNorm: 3.04644e+06    VecNorm: 1305.
  0 KSP unpreconditioned resid norm 1.466259843490e+04 true resid norm           -nan ||r(i)||/||b||           -nan
Linear solve did not converge due to DIVERGED_PC_FAILED iterations 0
               PC failed due to SUBPC_ERROR 
KSP Object: 4 MPI processes
  type: fgmres
    restart=100, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
    happy breakdown tolerance 1e-30
  maximum iterations=500, initial guess is zero
  tolerances:  relative=1e-08, absolute=1e-50, divergence=10000.
  right preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 4 MPI processes
  type: bjacobi
    number of blocks = 4
    Local solver information for first block is in the following KSP and PC objects on rank 0:
    Use -ksp_view ::ascii_info_detail to display information for all blocks
  KSP Object: (sub_) 1 MPI processes
    type: preonly
    maximum iterations=10000, initial guess is zero
    tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
    left preconditioning
    using NONE norm type for convergence test
  PC Object: (sub_) 1 MPI processes
    type: lu
      out-of-place factorization
      tolerance for zero pivot 2.22045e-14
      matrix ordering: nd
      factor fill ratio given 5., needed 2.90995
        Factored matrix follows:
          Mat Object: 1 MPI processes
            type: seqaij
            rows=1091, cols=1091
            package used to perform factorization: petsc
            total: nonzeros=58039, allocated nonzeros=58039
              using I-node routines: found 364 nodes, limit used is 5
    linear system matrix = precond matrix:
    Mat Object: (sub_) 1 MPI processes
      type: seqaij
      rows=1091, cols=1091
      total: nonzeros=19945, allocated nonzeros=19945
      total number of mallocs used during MatSetValues calls=0
        using I-node routines: found 364 nodes, limit used is 5
  linear system matrix = precond matrix:
  Mat Object: 4 MPI processes
    type: mpiaij
    rows=4362, cols=4362
    total: nonzeros=88470, allocated nonzeros=88470
    total number of mallocs used during MatSetValues calls=0
      using I-node (on process 0) routines: found 364 nodes, limit used is 5
Solver diverged, reason -11. Modifying the solver in order to improve the convergence...
MatNorm: 3.04644e+06    VecNorm: 1305.
Linear solve converged due to CONVERGED_ITS iterations 1
Solver converged within 1 iterations.
Newton iteration: 0 - L2 Position Norm: 1.203626E-03 - L2 Pressure Norm: 2.537266E-01
Memory used by each processor: 36.636719 Mb
Assemble Linear System. Elapsed time: 0.016010
MatNorm: 3.04644e+06    VecNorm: 0.0239994
  0 KSP unpreconditioned resid norm 6.218477255232e-02 true resid norm           -nan ||r(i)||/||b||           -nan
Linear solve did not converge due to DIVERGED_PC_FAILED iterations 0
               PC failed due to SUBPC_ERROR 
KSP Object: 4 MPI processes
  type: fgmres
    restart=100, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
    happy breakdown tolerance 1e-30
  maximum iterations=500, initial guess is zero
  tolerances:  relative=1e-08, absolute=1e-50, divergence=10000.
  right preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 4 MPI processes
  type: bjacobi
    number of blocks = 4
    Local solver information for first block is in the following KSP and PC objects on rank 0:
    Use -ksp_view ::ascii_info_detail to display information for all blocks
  KSP Object: (sub_) 1 MPI processes
    type: preonly
    maximum iterations=10000, initial guess is zero
    tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
    left preconditioning
    using NONE norm type for convergence test
  PC Object: (sub_) 1 MPI processes
    type: lu
      out-of-place factorization
      tolerance for zero pivot 2.22045e-14
      matrix ordering: nd
      factor fill ratio given 5., needed 2.90995
        Factored matrix follows:
          Mat Object: 1 MPI processes
            type: seqaij
            rows=1091, cols=1091
            package used to perform factorization: petsc
            total: nonzeros=58039, allocated nonzeros=58039
              using I-node routines: found 364 nodes, limit used is 5
    linear system matrix = precond matrix:
    Mat Object: (sub_) 1 MPI processes
      type: seqaij
      rows=1091, cols=1091
      total: nonzeros=19945, allocated nonzeros=19945
      total number of mallocs used during MatSetValues calls=0
        using I-node routines: found 364 nodes, limit used is 5
  linear system matrix = precond matrix:
  Mat Object: 4 MPI processes
    type: mpiaij
    rows=4362, cols=4362
    total: nonzeros=88470, allocated nonzeros=88470
    total number of mallocs used during MatSetValues calls=0
      using I-node (on process 0) routines: found 364 nodes, limit used is 5
Solver diverged, reason -11. Modifying the solver in order to improve the convergence...
MatNorm: 3.04644e+06    VecNorm: 0.0239994
Linear solve converged due to CONVERGED_ITS iterations 1
Solver converged within 1 iterations.
Newton iteration: 1 - L2 Position Norm: 1.796085E-07 - L2 Pressure Norm: 9.187252E-02
Memory used by each processor: 36.695312 Mb
Assemble Linear System. Elapsed time: 0.020556
MatNorm: 3.04644e+06    VecNorm: 2.81116e-06
  0 KSP unpreconditioned resid norm 1.136884066004e-05 true resid norm           -nan ||r(i)||/||b||           -nan
Linear solve did not converge due to DIVERGED_PC_FAILED iterations 0
               PC failed due to SUBPC_ERROR 
KSP Object: 4 MPI processes
  type: fgmres
    restart=100, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
    happy breakdown tolerance 1e-30
  maximum iterations=500, initial guess is zero
  tolerances:  relative=1e-08, absolute=1e-50, divergence=10000.
  right preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 4 MPI processes
  type: bjacobi
    number of blocks = 4
    Local solver information for first block is in the following KSP and PC objects on rank 0:
    Use -ksp_view ::ascii_info_detail to display information for all blocks
  KSP Object: (sub_) 1 MPI processes
    type: preonly
    maximum iterations=10000, initial guess is zero
    tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
    left preconditioning
    using NONE norm type for convergence test
  PC Object: (sub_) 1 MPI processes
    type: lu
      out-of-place factorization
      tolerance for zero pivot 2.22045e-14
      matrix ordering: nd
      factor fill ratio given 5., needed 2.90995
        Factored matrix follows:
          Mat Object: 1 MPI processes
            type: seqaij
            rows=1091, cols=1091
            package used to perform factorization: petsc
            total: nonzeros=58039, allocated nonzeros=58039
              using I-node routines: found 364 nodes, limit used is 5
    linear system matrix = precond matrix:
    Mat Object: (sub_) 1 MPI processes
      type: seqaij
      rows=1091, cols=1091
      total: nonzeros=19945, allocated nonzeros=19945
      total number of mallocs used during MatSetValues calls=0
        using I-node routines: found 364 nodes, limit used is 5
  linear system matrix = precond matrix:
  Mat Object: 4 MPI processes
    type: mpiaij
    rows=4362, cols=4362
    total: nonzeros=88470, allocated nonzeros=88470
    total number of mallocs used during MatSetValues calls=0
      using I-node (on process 0) routines: found 364 nodes, limit used is 5
Solver diverged, reason -11. Modifying the solver in order to improve the convergence...
MatNorm: 3.04644e+06    VecNorm: 2.81116e-06
Linear solve converged due to CONVERGED_ITS iterations 1
Solver converged within 1 iterations.
Newton iteration: 2 - L2 Position Norm: 1.868159E-12 - L2 Pressure Norm: 2.037029E-07
Memory used by each processor: 36.808594 Mb

----------------------- TIME STEP = 922, time = 0.184400  -----------------------

Mesh Regenerated. Elapsed time: 0.019474
Isolated nodes: 1
Assemble Linear System. Elapsed time: 0.030308
MatNorm: 3.04642e+06    VecNorm: 1305.09
  0 KSP unpreconditioned resid norm 1.466597558465e+04 true resid norm           -nan ||r(i)||/||b||           -nan
  1 KSP unpreconditioned resid norm 3.992657613692e+02 true resid norm           -nan ||r(i)||/||b||           -nan
  2 KSP unpreconditioned resid norm 6.865492930467e+01 true resid norm           -nan ||r(i)||/||b||           -nan
  3 KSP unpreconditioned resid norm 1.488490448891e+01 true resid norm           -nan ||r(i)||/||b||           -nan
  4 KSP unpreconditioned resid norm 6.459160528254e+00 true resid norm           -nan ||r(i)||/||b||           -nan
  5 KSP unpreconditioned resid norm 2.684190657780e+00 true resid norm           -nan ||r(i)||/||b||           -nan
  6 KSP unpreconditioned resid norm 1.583730558735e+00 true resid norm           -nan ||r(i)||/||b||           -nan
  7 KSP unpreconditioned resid norm 7.857636392042e-01 true resid norm           -nan ||r(i)||/||b||           -nan
  8 KSP unpreconditioned resid norm 5.609287021479e-01 true resid norm           -nan ||r(i)||/||b||           -nan
  9 KSP unpreconditioned resid norm 4.240869629805e-01 true resid norm           -nan ||r(i)||/||b||           -nan
 10 KSP unpreconditioned resid norm 3.545861070917e-01 true resid norm           -nan ||r(i)||/||b||           -nan
 11 KSP unpreconditioned resid norm 2.796829041968e-01 true resid norm           -nan ||r(i)||/||b||           -nan
 12 KSP unpreconditioned resid norm 2.415853017221e-01 true resid norm           -nan ||r(i)||/||b||           -nan
 13 KSP unpreconditioned resid norm 1.933876557197e-01 true resid norm           -nan ||r(i)||/||b||           -nan
 14 KSP unpreconditioned resid norm 1.820288353613e-01 true resid norm           -nan ||r(i)||/||b||           -nan
 15 KSP unpreconditioned resid norm 1.657259644747e-01 true resid norm           -nan ||r(i)||/||b||           -nan
 16 KSP unpreconditioned resid norm 1.563463788745e-01 true resid norm           -nan ||r(i)||/||b||           -nan
 17 KSP unpreconditioned resid norm 1.272726963049e-01 true resid norm           -nan ||r(i)||/||b||           -nan
 18 KSP unpreconditioned resid norm 1.137797079759e-01 true resid norm           -nan ||r(i)||/||b||           -nan
 19 KSP unpreconditioned resid norm 8.582335118209e-02 true resid norm           -nan ||r(i)||/||b||           -nan
 20 KSP unpreconditioned resid norm 7.628931493998e-02 true resid norm           -nan ||r(i)||/||b||           -nan
 21 KSP unpreconditioned resid norm 5.901409359786e-02 true resid norm           -nan ||r(i)||/||b||           -nan
 22 KSP unpreconditioned resid norm 5.496262106550e-02 true resid norm           -nan ||r(i)||/||b||           -nan
 23 KSP unpreconditioned resid norm 4.367683601600e-02 true resid norm           -nan ||r(i)||/||b||           -nan
 24 KSP unpreconditioned resid norm 3.767769610963e-02 true resid norm           -nan ||r(i)||/||b||           -nan
 25 KSP unpreconditioned resid norm 2.758466841864e-02 true resid norm           -nan ||r(i)||/||b||           -nan
 26 KSP unpreconditioned resid norm 2.401068925144e-02 true resid norm           -nan ||r(i)||/||b||           -nan
 27 KSP unpreconditioned resid norm 1.918366114227e-02 true resid norm           -nan ||r(i)||/||b||           -nan
 28 KSP unpreconditioned resid norm 1.796891532704e-02 true resid norm           -nan ||r(i)||/||b||           -nan
 29 KSP unpreconditioned resid norm 1.646774691070e-02 true resid norm           -nan ||r(i)||/||b||           -nan
 30 KSP unpreconditioned resid norm 1.581043087339e-02 true resid norm           -nan ||r(i)||/||b||           -nan
 31 KSP unpreconditioned resid norm 1.451402784393e-02 true resid norm           -nan ||r(i)||/||b||           -nan
 32 KSP unpreconditioned resid norm 1.365719226793e-02 true resid norm           -nan ||r(i)||/||b||           -nan
 33 KSP unpreconditioned resid norm 1.221815466293e-02 true resid norm           -nan ||r(i)||/||b||           -nan
 34 KSP unpreconditioned resid norm 1.170507483612e-02 true resid norm           -nan ||r(i)||/||b||           -nan
 35 KSP unpreconditioned resid norm 1.112121419983e-02 true resid norm           -nan ||r(i)||/||b||           -nan
 36 KSP unpreconditioned resid norm 1.041368299534e-02 true resid norm           -nan ||r(i)||/||b||           -nan
 37 KSP unpreconditioned resid norm 8.898468360233e-03 true resid norm           -nan ||r(i)||/||b||           -nan
 38 KSP unpreconditioned resid norm 7.828540090048e-03 true resid norm           -nan ||r(i)||/||b||           -nan
 39 KSP unpreconditioned resid norm 6.804894322652e-03 true resid norm           -nan ||r(i)||/||b||           -nan
 40 KSP unpreconditioned resid norm 5.932441731922e-03 true resid norm           -nan ||r(i)||/||b||           -nan
 41 KSP unpreconditioned resid norm 5.038590720204e-03 true resid norm           -nan ||r(i)||/||b||           -nan
 42 KSP unpreconditioned resid norm 4.352003569050e-03 true resid norm           -nan ||r(i)||/||b||           -nan
 43 KSP unpreconditioned resid norm 3.340851172402e-03 true resid norm           -nan ||r(i)||/||b||           -nan
 44 KSP unpreconditioned resid norm 2.489084471832e-03 true resid norm           -nan ||r(i)||/||b||           -nan
 45 KSP unpreconditioned resid norm 1.982062096221e-03 true resid norm           -nan ||r(i)||/||b||           -nan
 46 KSP unpreconditioned resid norm 1.543532665899e-03 true resid norm           -nan ||r(i)||/||b||           -nan
 47 KSP unpreconditioned resid norm 1.041250067680e-03 true resid norm           -nan ||r(i)||/||b||           -nan
 48 KSP unpreconditioned resid norm 7.072998665082e-04 true resid norm           -nan ||r(i)||/||b||           -nan
 49 KSP unpreconditioned resid norm 4.326826499956e-04 true resid norm           -nan ||r(i)||/||b||           -nan
 50 KSP unpreconditioned resid norm 3.114665876716e-04 true resid norm           -nan ||r(i)||/||b||           -nan
 51 KSP unpreconditioned resid norm 1.971230239174e-04 true resid norm           -nan ||r(i)||/||b||           -nan
 52 KSP unpreconditioned resid norm 1.513573312329e-04 true resid norm           -nan ||r(i)||/||b||           -nan
 53 KSP unpreconditioned resid norm 8.825285013709e-05 true resid norm           -nan ||r(i)||/||b||           -nan
Linear solve converged due to CONVERGED_RTOL iterations 53
KSP Object: 4 MPI processes
  type: fgmres
    restart=100, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
    happy breakdown tolerance 1e-30
  maximum iterations=500, initial guess is zero
  tolerances:  relative=1e-08, absolute=1e-50, divergence=10000.
  right preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 4 MPI processes
  type: bjacobi
    number of blocks = 4
    Local solver information for first block is in the following KSP and PC objects on rank 0:
    Use -ksp_view ::ascii_info_detail to display information for all blocks
  KSP Object: (sub_) 1 MPI processes
    type: preonly
    maximum iterations=10000, initial guess is zero
    tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
    left preconditioning
    using NONE norm type for convergence test
  PC Object: (sub_) 1 MPI processes
    type: lu
      out-of-place factorization
      tolerance for zero pivot 2.22045e-14
      matrix ordering: nd
      factor fill ratio given 5., needed 3.8053
        Factored matrix follows:
          Mat Object: 1 MPI processes
            type: seqaij
            rows=1089, cols=1089
            package used to perform factorization: petsc
            total: nonzeros=77571, allocated nonzeros=77571
              using I-node routines: found 363 nodes, limit used is 5
    linear system matrix = precond matrix:
    Mat Object: (sub_) 1 MPI processes
      type: seqaij
      rows=1089, cols=1089
      total: nonzeros=20385, allocated nonzeros=20385
      total number of mallocs used during MatSetValues calls=0
        using I-node routines: found 363 nodes, limit used is 5
  linear system matrix = precond matrix:
  Mat Object: 4 MPI processes
    type: mpiaij
    rows=4353, cols=4353
    total: nonzeros=88389, allocated nonzeros=88389
    total number of mallocs used during MatSetValues calls=0
      using I-node (on process 0) routines: found 363 nodes, limit used is 5
Solver converged within 53 iterations. Elapsed time: 0.112512
Newton iteration: 0 - L2 Position Norm: INF - L2 Pressure Norm: INF