[petsc-users] [KSP] PETSc not reporting a KSP fail when true residual is NaN

Barry Smith bsmith at petsc.dev
Fri Apr 1 17:27:00 CDT 2022


  I'll take a look at it this weekend. The computed preconditioned residual norm is a real number so I am not sure where PETSc will be able to detect the problem appropriately before it is too late.


> On Apr 1, 2022, at 6:14 PM, Giovane Avancini <giavancini at usp.br> wrote:
> 
> Hi Barry, it's me again.
> 
> Sorry to bother you with this issue, but the problem is still happening, now when using KSPIBCGS. As you can see below, even when a NaN pops up in the residual, the solver still converges to an INF solution.
> 
> ----------------------- TIME STEP = 3318, time = 0.663600  -----------------------
> 
> Mesh Regenerated. Elapsed time: 0.018536
> Isolated nodes: 14
> Assemble Linear System. Elapsed time: 0.030077
>   0 KSP preconditioned resid norm 4.087133454416e+04 true resid norm           -nan ||r(i)||/||b||           -nan
>   1 KSP preconditioned resid norm 8.670288259109e+03 true resid norm           -nan ||r(i)||/||b||           -nan
>   2 KSP preconditioned resid norm 4.875596419197e+03 true resid norm           -nan ||r(i)||/||b||           -nan
>   3 KSP preconditioned resid norm 1.226640070761e+03 true resid norm           -nan ||r(i)||/||b||           -nan
>   4 KSP preconditioned resid norm 7.121904546851e+02 true resid norm           -nan ||r(i)||/||b||           -nan
>   5 KSP preconditioned resid norm 5.990560906831e+02 true resid norm           -nan ||r(i)||/||b||           -nan
>   6 KSP preconditioned resid norm 4.256157374933e+02 true resid norm           -nan ||r(i)||/||b||           -nan
>   7 KSP preconditioned resid norm 3.274351035311e+02 true resid norm           -nan ||r(i)||/||b||           -nan
>   8 KSP preconditioned resid norm 2.436138522439e+02 true resid norm           -nan ||r(i)||/||b||           -nan
>   9 KSP preconditioned resid norm 1.268089193578e+02 true resid norm           -nan ||r(i)||/||b||           -nan
>  10 KSP preconditioned resid norm 1.093950736015e+02 true resid norm           -nan ||r(i)||/||b||           -nan
>  11 KSP preconditioned resid norm 9.950531836062e+01 true resid norm           -nan ||r(i)||/||b||           -nan
>  12 KSP preconditioned resid norm 1.066841140901e+02 true resid norm           -nan ||r(i)||/||b||           -nan
>  13 KSP preconditioned resid norm 1.003475554456e+02 true resid norm           -nan ||r(i)||/||b||           -nan
>  14 KSP preconditioned resid norm 1.073513486989e+02 true resid norm           -nan ||r(i)||/||b||           -nan
>  15 KSP preconditioned resid norm 8.724609972930e+01 true resid norm           -nan ||r(i)||/||b||           -nan
>  16 KSP preconditioned resid norm 1.445166180332e+02 true resid norm           -nan ||r(i)||/||b||           -nan
>  17 KSP preconditioned resid norm 3.767376396291e+01 true resid norm           -nan ||r(i)||/||b||           -nan
>  18 KSP preconditioned resid norm 7.597770355737e+01 true resid norm           -nan ||r(i)||/||b||           -nan
>  19 KSP preconditioned resid norm 3.208030402538e+01 true resid norm           -nan ||r(i)||/||b||           -nan
>  20 KSP preconditioned resid norm 3.477715841173e+01 true resid norm           -nan ||r(i)||/||b||           -nan
>  21 KSP preconditioned resid norm 2.880337856055e+01 true resid norm           -nan ||r(i)||/||b||           -nan
>  22 KSP preconditioned resid norm 2.730108581171e+01 true resid norm           -nan ||r(i)||/||b||           -nan
>  23 KSP preconditioned resid norm 2.111131168298e+01 true resid norm           -nan ||r(i)||/||b||           -nan
>  24 KSP preconditioned resid norm 1.635560497545e+01 true resid norm           -nan ||r(i)||/||b||           -nan
>  25 KSP preconditioned resid norm 1.550914551701e+01 true resid norm           -nan ||r(i)||/||b||           -nan
>  26 KSP preconditioned resid norm 1.409066040669e+01 true resid norm           -nan ||r(i)||/||b||           -nan
>  27 KSP preconditioned resid norm 1.032086999081e+01 true resid norm           -nan ||r(i)||/||b||           -nan
>  28 KSP preconditioned resid norm 1.111168488798e+01 true resid norm           -nan ||r(i)||/||b||           -nan
>  29 KSP preconditioned resid norm 9.898696915473e+00 true resid norm           -nan ||r(i)||/||b||           -nan
>  30 KSP preconditioned resid norm 1.234283818664e+01 true resid norm           -nan ||r(i)||/||b||           -nan
>  31 KSP preconditioned resid norm 2.735222111838e+01 true resid norm           -nan ||r(i)||/||b||           -nan
>  32 KSP preconditioned resid norm 6.431272223321e+00 true resid norm           -nan ||r(i)||/||b||           -nan
>  33 KSP preconditioned resid norm 6.320133000091e+00 true resid norm           -nan ||r(i)||/||b||           -nan
>  34 KSP preconditioned resid norm 6.568217058049e+00 true resid norm           -nan ||r(i)||/||b||           -nan
>  35 KSP preconditioned resid norm 6.483075335206e+00 true resid norm           -nan ||r(i)||/||b||           -nan
>  36 KSP preconditioned resid norm 6.419074566626e+00 true resid norm           -nan ||r(i)||/||b||           -nan
>  37 KSP preconditioned resid norm 6.372749647101e+00 true resid norm           -nan ||r(i)||/||b||           -nan
>  38 KSP preconditioned resid norm 5.920214853455e+00 true resid norm           -nan ||r(i)||/||b||           -nan
>  39 KSP preconditioned resid norm 5.953698988377e+00 true resid norm           -nan ||r(i)||/||b||           -nan
>  40 KSP preconditioned resid norm 4.009279521077e+00 true resid norm           -nan ||r(i)||/||b||           -nan
>  41 KSP preconditioned resid norm 8.407438130288e+00 true resid norm           -nan ||r(i)||/||b||           -nan
>  42 KSP preconditioned resid norm 1.924008529878e+00 true resid norm           -nan ||r(i)||/||b||           -nan
>  43 KSP preconditioned resid norm 9.126618449455e+00 true resid norm           -nan ||r(i)||/||b||           -nan
>  44 KSP preconditioned resid norm 2.747853629308e+00 true resid norm           -nan ||r(i)||/||b||           -nan
>  45 KSP preconditioned resid norm 2.556706051040e+00 true resid norm           -nan ||r(i)||/||b||           -nan
>  46 KSP preconditioned resid norm 2.427212844835e+00 true resid norm           -nan ||r(i)||/||b||           -nan
>  47 KSP preconditioned resid norm 7.630151877379e+00 true resid norm           -nan ||r(i)||/||b||           -nan
>  48 KSP preconditioned resid norm 5.895961768741e-01 true resid norm           -nan ||r(i)||/||b||           -nan
>  49 KSP preconditioned resid norm 2.271378954392e-01 true resid norm           -nan ||r(i)||/||b||           -nan
>  50 KSP preconditioned resid norm 1.779670755839e-01 true resid norm           -nan ||r(i)||/||b||           -nan
>  51 KSP preconditioned resid norm 1.488459722777e-01 true resid norm           -nan ||r(i)||/||b||           -nan
>  52 KSP preconditioned resid norm 1.479802491212e-01 true resid norm           -nan ||r(i)||/||b||           -nan
>  53 KSP preconditioned resid norm 1.316523287251e-01 true resid norm           -nan ||r(i)||/||b||           -nan
>  54 KSP preconditioned resid norm 1.347849424457e-01 true resid norm           -nan ||r(i)||/||b||           -nan
>  55 KSP preconditioned resid norm 6.739405576032e-02 true resid norm           -nan ||r(i)||/||b||           -nan
>  56 KSP preconditioned resid norm 6.699633313335e-02 true resid norm           -nan ||r(i)||/||b||           -nan
>  57 KSP preconditioned resid norm 8.064741830609e-02 true resid norm           -nan ||r(i)||/||b||           -nan
>  58 KSP preconditioned resid norm 6.744985187452e-02 true resid norm           -nan ||r(i)||/||b||           -nan
>  59 KSP preconditioned resid norm 6.981071339163e-02 true resid norm           -nan ||r(i)||/||b||           -nan
>  60 KSP preconditioned resid norm 4.410819986572e-02 true resid norm           -nan ||r(i)||/||b||           -nan
>  61 KSP preconditioned resid norm 4.062281042354e-02 true resid norm           -nan ||r(i)||/||b||           -nan
> Linear solve converged due to CONVERGED_RTOL iterations 61
> Solver converged within 61 iterations. Elapsed time: 0.117009
> Newton iteration: 0 - L2 Position Norm: INF - L2 Pressure Norm: INF
> Memory used by each processor: 47.843750 Mb
> 
> Could you please check if the issue can be fixed the same way as you did for the GMRES family solvers? Thanks in advance,
> 
> Kind regards,
> 
> Giovane
> 
> Em ter., 8 de mar. de 2022 às 01:05, Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> escreveu:
> 
>   I ran with -info and get repeated 
> 
> MatPivotCheck_none(): Detected zero pivot in factorization in row 2547 value 0. tolerance 2.22045e-14
> 
> after the first linear solve failure. The values are always slightly different. My conclusion is that from this point on the default factorization is truly failing each time which is why it is always switching the linear solver. 
> 
>   Barry
> 
> 
>> On Mar 7, 2022, at 7:01 PM, Giovane Avancini <giavancini at usp.br <mailto:giavancini at usp.br>> wrote:
>> 
>> Sorry, I forgot to attach the file.
>> 
>> Em seg., 7 de mar. de 2022 às 21:01, Giovane Avancini <giavancini at usp.br <mailto:giavancini at usp.br>> escreveu:
>> Thanks Barry! I included the piece of code you sent and now it seems to be working pretty well. It has completed all the 5000 time steps and the solver is indeed triggering the failure when a NaN/Inf is found.
>> 
>> I just noticed a strange behaviour in my code after the patch that was not happening before, so I was wondering if it could be related to the way you fixed the bug or if it is a coincidence, please find attached the log file.
>> 
>> At time step 913, the first failure occurs,and it doesn't print the norms of iteration 0 for instance (before, even when the pc ended up failing during the first ksp iteration, the norms were plotted indicating the NaN). Ok, maybe now it verifies that a NaN appeared before the norms are actually computed.
>> 
>> What is strange to me is that, after the first failure, all the remaining calls to FGMRES have failed as well, which is unlikely to be the case in my view. Would it be possible that some error flags of FGMRES are not being reseted from one call to another? So after the first iteration of step 913, FGMRES is being called with an error flag already set to true?
>> 
>> Anyway, I really appreciate your efforts in finding the bug and trying to help me, thank you very much!
>> 
>> Em seg., 7 de mar. de 2022 às 18:08, Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> escreveu:
>> 
>>    The fix for the problem Geiovane encountered is in https://gitlab.com/petsc/petsc/-/merge_requests/4934 <https://gitlab.com/petsc/petsc/-/merge_requests/4934>
>> 
>> 
>>> On Mar 3, 2022, at 11:24 AM, Giovane Avancini <giavancini at usp.br <mailto:giavancini at usp.br>> wrote:
>>> 
>>> Sorry for my late reply Barry,
>>> 
>>> Sure I can share the code with you, but unfortunately I don't know how to make docker images. If you don't mind, you can clone the code from github through this link: git at github.com <mailto:git at github.com>:giavancini/runPFEM.git
>>> It can be easily compiled with cmake, and you can see the dependencies in README.md. Please let me know if you need any other information.
>>> 
>>> Kind regards,
>>> 
>>> Giovane
>>> 
>>> Em sex., 25 de fev. de 2022 às 18:22, Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> escreveu:
>>> 
>>>      Hmm, this is going to be tricky to debug why it the Inf/Nan is not found when it should be. 
>>> 
>>>      In a debugger you can catch/trap floating point exceptions (how to do this depends on your debugger) and then step through the code after that to see why PETSc KSP is not properly noting the Inf/Nan and returning. This may be cumbersome to do if you don't know PETSc well. Is your code easy to build, would be willing to share it to me so I can run it and debug directly? If you know how to make docker images or something you might be able to give it to me easily.
>>> 
>>>   Barry
>>> 
>>> 
>>>> On Feb 25, 2022, at 3:59 PM, Giovane Avancini <giavancini at usp.br <mailto:giavancini at usp.br>> wrote:
>>>> 
>>>> Mark, Matthew and Barry,
>>>> 
>>>> Thank you all for the quick responses.
>>>> 
>>>> Others might have a better idea, but you could run with '-info :ksp' and see if you see any messages like "Linear solver has created a not a number (NaN) as the residual norm, declaring divergence \n"
>>>> You could also run with -log_trace and see if it is using KSPConvergedDefault. I'm not sure if this is the method used given your parameters, but I think it is.
>>>> Mark, I ran with both options. I didn't get any messages like "linear solver has created a not a number..." when using -info: ksp. When turning on -log_trace, I could verify that it is using KSPConvergedDefault but what does it mean exactly? When FGMRES converges with the true residual being NaN, I get the following message: [0] KSPConvergedDefault(): Linear solver has converged. Residual norm 8.897908325511e-05 is less than relative tolerance 1.000000000000e-08 times initial right hand side norm 1.466597558465e+04 at iteration 53. No information about NaN whatsoever.
>>>> 
>>>> We check for NaN or Inf, for example, in KSPCheckDot(). if you have the KSP set to error (https://petsc.org/main/docs/manualpages/KSP/KSPSetErrorIfNotConverged.html <https://petsc.org/main/docs/manualpages/KSP/KSPSetErrorIfNotConverged.html>)
>>>> then we throw an error, but the return codes do not seem to be checked in your implementation. If not, then we set the flag for divergence.
>>>> Matthew, I do not check the return code in this case because I don't want PETSc to stop if an error occurs during the solving step. I just want to know that it didn't converge and treat this error inside my code. The problem is that the flag for divergence is not always being set when FGMRES is not converging. I was just wondering why it was set during time step 921 and why not for time step 922 as well.
>>>> 
>>>> Thanks for the complete report. It looks like we may be missing a check in our FGMRES implementation that allows the iteration to continue after a NaN/Inf. 
>>>> 
>>>>     I will explain how we handle the checking and then attach a patch that you can apply to see if it resolves the problem.  Whenever our KSP solvers compute a norm we
>>>> check after that calculation to verify that the norm is not an Inf or Nan. This is an inexpensive global check across all MPI ranks because immediately after the norm computation all ranks that share the KSP have the same value. If the norm is a Inf or Nan we "short-circuit" the KSP solve and return immediately with an appropriate not converged code. A quick eye-ball inspection of the FGMRES code found a missing check. 
>>>> 
>>>>    You can apply the attached patch file in the PETSC_DIR with 
>>>> 
>>>> patch -p1 < fgmres.patch
>>>> make libs
>>>> 
>>>> then rerun your code and see if it now handles the Inf/NaN correctly. If so we'll patch our release branch with the fix.
>>>> Thank you for checking this, Barry. I applied the patch exactly the way you instructed, however, the problem is still happening. Is there a way to check if the patch was in fact applied? You can see in the attached screenshot the terminal information.
>>>> 
>>>> Kind regards,
>>>> 
>>>> Giovane
>>>> 
>>>> Em sex., 25 de fev. de 2022 às 13:48, Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> escreveu:
>>>> 
>>>>   Giovane,
>>>> 
>>>>     Thanks for the complete report. It looks like we may be missing a check in our FGMRES implementation that allows the iteration to continue after a NaN/Inf. 
>>>> 
>>>>     I will explain how we handle the checking and then attach a patch that you can apply to see if it resolves the problem.  Whenever our KSP solvers compute a norm we
>>>> check after that calculation to verify that the norm is not an Inf or Nan. This is an inexpensive global check across all MPI ranks because immediately after the norm computation all ranks that share the KSP have the same value. If the norm is a Inf or Nan we "short-circuit" the KSP solve and return immediately with an appropriate not converged code. A quick eye-ball inspection of the FGMRES code found a missing check. 
>>>> 
>>>>    You can apply the attached patch file in the PETSC_DIR with 
>>>> 
>>>> patch -p1 < fgmres.patch
>>>> make libs
>>>> 
>>>> then rerun your code and see if it now handles the Inf/NaN correctly. If so we'll patch our release branch with the fix.
>>>> 
>>>>   Barry
>>>> 
>>>> 
>>>> 
>>>>> Giovane
>>>>   
>>>> 
>>>>> On Feb 25, 2022, at 11:06 AM, Giovane Avancini via petsc-users <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
>>>>> 
>>>>> Dear PETSc users,
>>>>> 
>>>>> I'm working on an inhouse code that solves the Navier-Stokes equation in a Lagrangian fashion for free surface flows. Because of the large distortions and pressure gradients, it is quite common to encounter some issues with iterative solvers for some time steps, and because of that, I implemented a function that changes the solver type based on the flag KSPConvergedReason. If this flag is negative after a call to KSPSolve, I solve the same linear system again using a direct method.
>>>>> 
>>>>> The problem is that, sometimes, KSP keeps converging even though the residual is NaN, and because of that, I'm not able to identify the problem and change the solver, which leads to a solution vector equals to INF and obviously the code ends up crashing. Is it normal to observe this kind of behaviour?
>>>>> 
>>>>> Please find attached the log produced with the options -ksp_monitor_lg_residualnorm -ksp_log -ksp_view -ksp_monitor_true_residual -ksp_converged_reason and the function that changes the solver. I'm currently using FGMRES and BJACOBI preconditioner with LU for each block. The problem still happens with ILU for example. We can see in the log file that for the time step 921, the true residual is NaN and within just one iteration, the solver fails and it gives the reason DIVERGED_PC_FAILED. I simply changed the solver to MUMPS and it converged for that time step. However, when solving time step 922 we can see that FGMRES converges while the true residual is NaN. Why is that possible? I would appreciate it if someone could clarify this issue to me.
>>>>> 
>>>>> Kind regards,
>>>>> Giovane
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Giovane Avancini
>>>>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São Carlos, USP
>>>>> 
>>>>> PhD researcher in Structural Engineering - School of Engineering of São Carlos. USP
>>>>> <function.txt><log.txt>
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Giovane Avancini
>>>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São Carlos, USP
>>>> 
>>>> PhD researcher in Structural Engineering - School of Engineering of São Carlos. USP
>>>> <log.txt><patch.png>
>>> 
>>> 
>>> 
>>> -- 
>>> Giovane Avancini
>>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São Carlos, USP
>>> 
>>> PhD researcher in Structural Engineering - School of Engineering of São Carlos. USP
>> 
>> 
>> 
>> -- 
>> Giovane Avancini
>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São Carlos, USP
>> 
>> PhD researcher in Structural Engineering - School of Engineering of São Carlos. USP
>> 
>> 
>> -- 
>> Giovane Avancini
>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São Carlos, USP
>> 
>> PhD researcher in Structural Engineering - School of Engineering of São Carlos. USP
>> <log.txt>
> 
> 
> 
> -- 
> Giovane Avancini
> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São Carlos, USP
> 
> PhD researcher in Structural Engineering - School of Engineering of São Carlos. USP

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220401/de2e17eb/attachment-0001.html>


More information about the petsc-users mailing list