[petsc-users] error when solving a linear system with gmres + pilut/euclid

Tue Aug 25 16:46:37 CDT 2020

  I have submitted a merge request https://gitlab.com/petsc/petsc/-/merge_requests/3096 <https://gitlab.com/petsc/petsc/-/merge_requests/3096> that will make the error handling and message clearer in the future.

  Barry

> On Aug 25, 2020, at 8:55 AM, Alfredo Jaramillo <ajaramillopalma at gmail.com> wrote:
> 
> In fact, on my machine the code is compiled with gnu, and on the cluster it is compiled with intel (2015) compilers. I just run the program with "-fp_trap" and got:
> 
> ===============================================================
>    |> Assembling interface problem. Unk # 56
>    |> Solving interface problem
>   Residual norms for interp_ solve.
>   0 KSP Residual norm 3.642615470862e+03 
> [0]PETSC ERROR: *** unknown floating point error occurred ***
> [0]PETSC ERROR: The specific exception can be determined by running in a debugger.  When the
> [0]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3f)
> [0]PETSC ERROR: where the result is a bitwise OR of the following flags:
> [0]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8 FE_UNDERFLOW=0x10 FE_INEXACT=0x20
> [0]PETSC ERROR: Try option -start_in_debugger
> [0]PETSC ERROR: likely location of problem given in stack below
> [0]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
> [1]PETSC ERROR: [2]PETSC ERROR: *** unknown floating point error occurred ***
> [3]PETSC ERROR: *** unknown floating point error occurred ***
> [3]PETSC ERROR: The specific exception can be determined by running in a debugger.  When the
> [4]PETSC ERROR: *** unknown floating point error occurred ***
> [4]PETSC ERROR: The specific exception can be determined by running in a debugger.  When the
> [4]PETSC ERROR: [5]PETSC ERROR: *** unknown floating point error occurred ***
> [5]PETSC ERROR: The specific exception can be determined by running in a debugger.  When the
> [5]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3f)
> [5]PETSC ERROR: where the result is a bitwise OR of the following flags:
> [6]PETSC ERROR: *** unknown floating point error occurred ***
> [6]PETSC ERROR: The specific exception can be determined by running in a debugger.  When the
> [6]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3f)
> [6]PETSC ERROR: where the result is a bitwise OR of the following flags:
> [6]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8 FE_UNDERFLOW=0x10 FE_INEXACT=0x20
> [7]PETSC ERROR: *** unknown floating point error occurred ***
> [7]PETSC ERROR: The specific exception can be determined by running in a debugger.  When the
> [7]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3f)
> [7]PETSC ERROR: where the result is a bitwise OR of the following flags:
> [7]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8 FE_UNDERFLOW=0x10 FE_INEXACT=0x20
> [7]PETSC ERROR: Try option -start_in_debugger
> [7]PETSC ERROR: likely location of problem given in stack below
> [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> [0]PETSC ERROR:       INSTEAD the line number of the start of the function
> [0]PETSC ERROR:       is given.
> [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355 /mnt/lustre/home/ajaramillo/petsc-3.13.0/src/sys/error/fp.c
> [0]PETSC ERROR: [0] VecMDot line 1154 /mnt/lustre/home/ajaramillo/petsc-3.13.0/src/vec/vec/interface/rvector.c
> [0]PETSC ERROR: [0] KSPGMRESClassicalGramSchmidtOrthogonalization line 44 /mnt/lustre/home/ajaramillo/petsc-3.13.0/src/ksp/ksp/impls/gmres/borthog2.c
> [0]PETSC ERROR: [0] KSPGMRESCycle line 122 /mnt/lustre/home/ajaramillo/petsc-3.13.0/src/ksp/ksp/impls/gmres/gmres.c
> [0]PETSC ERROR: [0] KSPSolve_GMRES line 225 /mnt/lustre/home/ajaramillo/petsc-3.13.0/src/ksp/ksp/impls/gmres/gmres.c
> [0]PETSC ERROR: [0] KSPSolve_Private line 590 /mnt/lustre/home/ajaramillo/petsc-3.13.0/src/ksp/ksp/interface/itfunc.c
> [0]PETSC ERROR: *** unknown floating point error occurred ***
> ===============================================================
> 
> So it seems that in fact a division by 0 is taking place. I will try to run this in debug mode. 
> 
> thanks
> Alfredo
> 
> On Tue, Aug 25, 2020 at 10:23 AM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
> 
>   Sounds like it might be a compiler problem generating bad code. 
> 
>   On the machine where it fails you can run with -fp_trap to have it error out as soon as a Nan or Inf appears. If you can use the debugger on that machine you can tell the debugger to catch floating point exceptions and see the exact line an values of variables where a Nan or Inf appear.
> 
>    As Matt conjectured it is likely there is a divide by zero before PETSc detects and it may be helpful to find out exactly where that happens.
> 
>   Barry
> 
> 
>> On Aug 25, 2020, at 8:03 AM, Alfredo Jaramillo <ajaramillopalma at gmail.com <mailto:ajaramillopalma at gmail.com>> wrote:
>> 
>> Yes, Barry, that is correct.
>> 
>> 
>> 
>> On Tue, Aug 25, 2020 at 1:02 AM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>> 
>>   On one system you get this error, on another system with the identical code and test case you do not get the error?
>> 
>>   You get it with three iterative methods but not with MUMPS?
>> 
>> Barry
>> 
>> 
>>> On Aug 24, 2020, at 8:35 PM, Alfredo Jaramillo <ajaramillopalma at gmail.com <mailto:ajaramillopalma at gmail.com>> wrote:
>>> 
>>> Hello Barry, Matthew, thanks for the replies !
>>> 
>>> Yes, it is our custom code, and it also happens when setting -pc_type bjacobi. Before testing an iterative solver, we were using MUMPS (-ksp_type preonly -ksp_pc_type lu -pc_factor_mat_solver_type mumps) without issues.
>>> 
>>> Running the ex19 (as "mpirun -n 4 ex19 -da_refine 5") did not produce any problem.
>>> 
>>> To reproduce the situation on my computer, I was able to reproduce the error for a small case and -pc_type bjacobi. For that particular case, when running in the cluster the error appears at the very last iteration:
>>> 
>>> =====
>>> 27 KSP Residual norm 8.230378644666e-06 
>>> [0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
>>> [0]PETSC ERROR: Invalid argument
>>> [0]PETSC ERROR: Scalar value must be same on all processes, argument # 3
>>> ====
>>> 
>>> whereas running on my computer the error is not launched and convergence is reached instead:
>>> 
>>> ====
>>> Linear interp_ solve converged due to CONVERGED_RTOL iterations 27
>>> ====
>>> 
>>> I will run valgrind to seek for possible memory corruptions.
>>> 
>>> thank you
>>> Alfredo
>>> 
>>> On Mon, Aug 24, 2020 at 9:00 PM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>>> 
>>>    Oh yes, it could happen with Nan. 
>>> 
>>>    KSPGMRESClassicalGramSchmidtOrthogonalization() calls  KSPCheckDot(ksp,lhh[j]); so should detect any NAN that appear and set ksp->convergedreason  but the call to MAXPY() is still made before returning and hence producing the error message.
>>> 
>>>    We should circuit the orthogonalization as soon as it sees a Nan/Inf and return immediately for GMRES to cleanup and produce a very useful error message. 
>>> 
>>>   Alfredo,
>>> 
>>>     It is also possible that the hypre preconditioners are producing a Nan because your matrix is too difficult for them to handle, but it would be odd to happen after many iterations.
>>> 
>>>    As I suggested before run with -pc_type bjacobi to see if you get the same problem.
>>> 
>>>   Barry
>>> 
>>> 
>>>> On Aug 24, 2020, at 6:38 PM, Matthew Knepley <knepley at gmail.com <mailto:knepley at gmail.com>> wrote:
>>>> 
>>>> On Mon, Aug 24, 2020 at 6:27 PM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>>>> 
>>>>    Alfredo,
>>>> 
>>>>       This should never happen. The input to the VecMAXPY in gmres is computed via VMDot which produces the same result on all processes.
>>>> 
>>>>        If you run with -pc_type bjacobi does it also happen?
>>>> 
>>>>        Is this your custom code or does it happen in PETSc examples also? Like src/snes/tutorials/ex19 -da_refine 5 
>>>> 
>>>>       Could be memory corruption, can you run under valgrind?
>>>> 
>>>> Couldn't it happen if something generates a NaN? That also should not happen, but I was allowing that pilut might do it.
>>>> 
>>>>   Thanks,
>>>> 
>>>>     Matt
>>>>  
>>>>     Barry
>>>> 
>>>> 
>>>> > On Aug 24, 2020, at 4:05 PM, Alfredo Jaramillo <ajaramillopalma at gmail.com <mailto:ajaramillopalma at gmail.com>> wrote:
>>>> > 
>>>> > Dear PETSc developers,
>>>> > 
>>>> > I'm trying to solve a linear problem with GMRES preconditioned with pilut from HYPRE. For this I'm using the options:
>>>> > 
>>>> > -ksp_type gmres -pc_type hypre -pc_hypre_type pilut -ksp_monitor
>>>> > 
>>>> > If I use a single core, GMRES (+ pilut or euclid) converges. However, when using multiple cores the next error appears after some number of iterations:
>>>> > 
>>>> > [0]PETSC ERROR: Scalar value must be same on all processes, argument # 3
>>>> > 
>>>> > relative to the function VecMAXPY. I attached a screenshot with more detailed output. The same happens when using euclid. Can you please give me some insight on this?
>>>> > 
>>>> > best regards
>>>> > Alfredo
>>>> > <Screenshot from 2020-08-24 17-57-52.png>
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>>>> -- Norbert Wiener
>>>> 
>>>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
>>> 
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200825/bf8a94ee/attachment.html>