[petsc-users] [KSP] PETSc not reporting a KSP fail when true residual is NaN

Fri Feb 25 10:48:22 CST 2022

  Giovane,

    Thanks for the complete report. It looks like we may be missing a check in our FGMRES implementation that allows the iteration to continue after a NaN/Inf. 

    I will explain how we handle the checking and then attach a patch that you can apply to see if it resolves the problem.  Whenever our KSP solvers compute a norm we
check after that calculation to verify that the norm is not an Inf or Nan. This is an inexpensive global check across all MPI ranks because immediately after the norm computation all ranks that share the KSP have the same value. If the norm is a Inf or Nan we "short-circuit" the KSP solve and return immediately with an appropriate not converged code. A quick eye-ball inspection of the FGMRES code found a missing check. 

   You can apply the attached patch file in the PETSC_DIR with 

patch -p1 < fgmres.patch
make libs

then rerun your code and see if it now handles the Inf/NaN correctly. If so we'll patch our release branch with the fix.

  Barry

> Giovane

> On Feb 25, 2022, at 11:06 AM, Giovane Avancini via petsc-users <petsc-users at mcs.anl.gov> wrote:
> 
> Dear PETSc users,
> 
> I'm working on an inhouse code that solves the Navier-Stokes equation in a Lagrangian fashion for free surface flows. Because of the large distortions and pressure gradients, it is quite common to encounter some issues with iterative solvers for some time steps, and because of that, I implemented a function that changes the solver type based on the flag KSPConvergedReason. If this flag is negative after a call to KSPSolve, I solve the same linear system again using a direct method.
> 
> The problem is that, sometimes, KSP keeps converging even though the residual is NaN, and because of that, I'm not able to identify the problem and change the solver, which leads to a solution vector equals to INF and obviously the code ends up crashing. Is it normal to observe this kind of behaviour?
> 
> Please find attached the log produced with the options -ksp_monitor_lg_residualnorm -ksp_log -ksp_view -ksp_monitor_true_residual -ksp_converged_reason and the function that changes the solver. I'm currently using FGMRES and BJACOBI preconditioner with LU for each block. The problem still happens with ILU for example. We can see in the log file that for the time step 921, the true residual is NaN and within just one iteration, the solver fails and it gives the reason DIVERGED_PC_FAILED. I simply changed the solver to MUMPS and it converged for that time step. However, when solving time step 922 we can see that FGMRES converges while the true residual is NaN. Why is that possible? I would appreciate it if someone could clarify this issue to me.
> 
> Kind regards,
> Giovane
> 
> 
> 
> -- 
> Giovane Avancini
> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São Carlos, USP
> 
> PhD researcher in Structural Engineering - School of Engineering of São Carlos. USP
> <function.txt><log.txt>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220225/9e068d25/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fgmres.patch
Type: application/octet-stream
Size: 538 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220225/9e068d25/attachment-0001.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220225/9e068d25/attachment-0003.html>