[petsc-users] error when solving a linear system with gmres + pilut/euclid

Alfredo Jaramillo ajaramillopalma at gmail.com
Tue Aug 25 08:55:01 CDT 2020


In fact, on my machine the code is compiled with gnu, and on the cluster it
is compiled with intel (2015) compilers. I just run the program with
"-fp_trap" and got:

===============================================================
   |> Assembling interface problem. Unk # 56
   |> Solving interface problem
  Residual norms for interp_ solve.
  0 KSP Residual norm 3.642615470862e+03
[0]PETSC ERROR: *** unknown floating point error occurred ***
[0]PETSC ERROR: The specific exception can be determined by running in a
debugger.  When the
[0]PETSC ERROR: debugger traps the signal, the exception can be found with
fetestexcept(0x3f)
[0]PETSC ERROR: where the result is a bitwise OR of the following flags:
[0]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8
FE_UNDERFLOW=0x10 FE_INEXACT=0x20
[0]PETSC ERROR: Try option -start_in_debugger
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: ---------------------  Stack Frames
------------------------------------
[1]PETSC ERROR: [2]PETSC ERROR: *** unknown floating point error occurred
***
[3]PETSC ERROR: *** unknown floating point error occurred ***
[3]PETSC ERROR: The specific exception can be determined by running in a
debugger.  When the
[4]PETSC ERROR: *** unknown floating point error occurred ***
[4]PETSC ERROR: The specific exception can be determined by running in a
debugger.  When the
[4]PETSC ERROR: [5]PETSC ERROR: *** unknown floating point error occurred
***
[5]PETSC ERROR: The specific exception can be determined by running in a
debugger.  When the
[5]PETSC ERROR: debugger traps the signal, the exception can be found with
fetestexcept(0x3f)
[5]PETSC ERROR: where the result is a bitwise OR of the following flags:
[6]PETSC ERROR: *** unknown floating point error occurred ***
[6]PETSC ERROR: The specific exception can be determined by running in a
debugger.  When the
[6]PETSC ERROR: debugger traps the signal, the exception can be found with
fetestexcept(0x3f)
[6]PETSC ERROR: where the result is a bitwise OR of the following flags:
[6]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8
FE_UNDERFLOW=0x10 FE_INEXACT=0x20
[7]PETSC ERROR: *** unknown floating point error occurred ***
[7]PETSC ERROR: The specific exception can be determined by running in a
debugger.  When the
[7]PETSC ERROR: debugger traps the signal, the exception can be found with
fetestexcept(0x3f)
[7]PETSC ERROR: where the result is a bitwise OR of the following flags:
[7]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8
FE_UNDERFLOW=0x10 FE_INEXACT=0x20
[7]PETSC ERROR: Try option -start_in_debugger
[7]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[0]PETSC ERROR:       INSTEAD the line number of the start of the function
[0]PETSC ERROR:       is given.
[0]PETSC ERROR: [0] PetscDefaultFPTrap line 355
/mnt/lustre/home/ajaramillo/petsc-3.13.0/src/sys/error/fp.c
[0]PETSC ERROR: [0] VecMDot line 1154
/mnt/lustre/home/ajaramillo/petsc-3.13.0/src/vec/vec/interface/rvector.c
[0]PETSC ERROR: [0] KSPGMRESClassicalGramSchmidtOrthogonalization line 44
/mnt/lustre/home/ajaramillo/petsc-3.13.0/src/ksp/ksp/impls/gmres/borthog2.c
[0]PETSC ERROR: [0] KSPGMRESCycle line 122
/mnt/lustre/home/ajaramillo/petsc-3.13.0/src/ksp/ksp/impls/gmres/gmres.c
[0]PETSC ERROR: [0] KSPSolve_GMRES line 225
/mnt/lustre/home/ajaramillo/petsc-3.13.0/src/ksp/ksp/impls/gmres/gmres.c
[0]PETSC ERROR: [0] KSPSolve_Private line 590
/mnt/lustre/home/ajaramillo/petsc-3.13.0/src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: *** unknown floating point error occurred ***
===============================================================

So it seems that in fact a division by 0 is taking place. I will try to run
this in debug mode.

thanks
Alfredo

On Tue, Aug 25, 2020 at 10:23 AM Barry Smith <bsmith at petsc.dev> wrote:

>
>   Sounds like it might be a compiler problem generating bad code.
>
>   On the machine where it fails you can run with -fp_trap to have it error
> out as soon as a Nan or Inf appears. If you can use the debugger on that
> machine you can tell the debugger to catch floating point exceptions and
> see the exact line an values of variables where a Nan or Inf appear.
>
>    As Matt conjectured it is likely there is a divide by zero before PETSc
> detects and it may be helpful to find out exactly where that happens.
>
>   Barry
>
>
> On Aug 25, 2020, at 8:03 AM, Alfredo Jaramillo <ajaramillopalma at gmail.com>
> wrote:
>
> Yes, Barry, that is correct.
>
>
>
> On Tue, Aug 25, 2020 at 1:02 AM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>>   On one system you get this error, on another system with the identical
>> code and test case you do not get the error?
>>
>>   You get it with three iterative methods but not with MUMPS?
>>
>> Barry
>>
>>
>> On Aug 24, 2020, at 8:35 PM, Alfredo Jaramillo <ajaramillopalma at gmail.com>
>> wrote:
>>
>> Hello Barry, Matthew, thanks for the replies !
>>
>> Yes, it is our custom code, and it also happens when setting -pc_type
>> bjacobi. Before testing an iterative solver, we were using MUMPS (-ksp_type
>> preonly -ksp_pc_type lu -pc_factor_mat_solver_type mumps) without issues.
>>
>> Running the ex19 (as "mpirun -n 4 ex19 -da_refine 5") did not produce any
>> problem.
>>
>> To reproduce the situation on my computer, I was able to reproduce the
>> error for a small case and -pc_type bjacobi. For that particular case, when
>> running in the cluster the error appears at the very last iteration:
>>
>> =====
>> 27 KSP Residual norm 8.230378644666e-06
>> [0]PETSC ERROR: --------------------- Error Message
>> --------------------------------------------------------------
>> [0]PETSC ERROR: Invalid argument
>> [0]PETSC ERROR: Scalar value must be same on all processes, argument # 3
>> ====
>>
>> whereas running on my computer the error is not launched and convergence
>> is reached instead:
>>
>> ====
>> Linear interp_ solve converged due to CONVERGED_RTOL iterations 27
>> ====
>>
>> I will run valgrind to seek for possible memory corruptions.
>>
>> thank you
>> Alfredo
>>
>> On Mon, Aug 24, 2020 at 9:00 PM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>    Oh yes, it could happen with Nan.
>>>
>>>    KSPGMRESClassicalGramSchmidtOrthogonalization()
>>> calls  KSPCheckDot(ksp,lhh[j]); so should detect any NAN that appear and
>>> set ksp->convergedreason  but the call to MAXPY() is still made before
>>> returning and hence producing the error message.
>>>
>>>    We should circuit the orthogonalization as soon as it sees a Nan/Inf
>>> and return immediately for GMRES to cleanup and produce a very useful error
>>> message.
>>>
>>>   Alfredo,
>>>
>>>     It is also possible that the hypre preconditioners are producing a
>>> Nan because your matrix is too difficult for them to handle, but it would
>>> be odd to happen after many iterations.
>>>
>>>    As I suggested before run with -pc_type bjacobi to see if you get the
>>> same problem.
>>>
>>>   Barry
>>>
>>>
>>> On Aug 24, 2020, at 6:38 PM, Matthew Knepley <knepley at gmail.com> wrote:
>>>
>>> On Mon, Aug 24, 2020 at 6:27 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>
>>>>
>>>>    Alfredo,
>>>>
>>>>       This should never happen. The input to the VecMAXPY in gmres is
>>>> computed via VMDot which produces the same result on all processes.
>>>>
>>>>        If you run with -pc_type bjacobi does it also happen?
>>>>
>>>>        Is this your custom code or does it happen in PETSc examples
>>>> also? Like src/snes/tutorials/ex19 -da_refine 5
>>>>
>>>>       Could be memory corruption, can you run under valgrind?
>>>>
>>>
>>> Couldn't it happen if something generates a NaN? That also should not
>>> happen, but I was allowing that pilut might do it.
>>>
>>>   Thanks,
>>>
>>>     Matt
>>>
>>>
>>>>     Barry
>>>>
>>>>
>>>> > On Aug 24, 2020, at 4:05 PM, Alfredo Jaramillo <
>>>> ajaramillopalma at gmail.com> wrote:
>>>> >
>>>> > Dear PETSc developers,
>>>> >
>>>> > I'm trying to solve a linear problem with GMRES preconditioned with
>>>> pilut from HYPRE. For this I'm using the options:
>>>> >
>>>> > -ksp_type gmres -pc_type hypre -pc_hypre_type pilut -ksp_monitor
>>>> >
>>>> > If I use a single core, GMRES (+ pilut or euclid) converges. However,
>>>> when using multiple cores the next error appears after some number of
>>>> iterations:
>>>> >
>>>> > [0]PETSC ERROR: Scalar value must be same on all processes, argument
>>>> # 3
>>>> >
>>>> > relative to the function VecMAXPY. I attached a screenshot with more
>>>> detailed output. The same happens when using euclid. Can you please give me
>>>> some insight on this?
>>>> >
>>>> > best regards
>>>> > Alfredo
>>>> > <Screenshot from 2020-08-24 17-57-52.png>
>>>>
>>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://www.cse.buffalo.edu/~knepley/
>>> <http://www.cse.buffalo.edu/~knepley/>
>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200825/966c150f/attachment-0001.html>


More information about the petsc-users mailing list