[petsc-users] error when solving a linear system with gmres + pilut/euclid

Alfredo Jaramillo ajaramillopalma at gmail.com
Tue Aug 25 16:54:37 CDT 2020


thank you, Barry,

I wasn't able to reproduce the error on my computer, neither on a second
cluster. On the first cluster, I requested to activate X11 at some node for
attaching a debugger, and that activation (if possible) should take some
time.
I will inform you of any news on that.

kind regards
Alfredo



On Tue, Aug 25, 2020 at 6:46 PM Barry Smith <bsmith at petsc.dev> wrote:

>
>   I have submitted a merge request
> https://gitlab.com/petsc/petsc/-/merge_requests/3096 that will make the
> error handling and message clearer in the future.
>
>   Barry
>
>
> On Aug 25, 2020, at 8:55 AM, Alfredo Jaramillo <ajaramillopalma at gmail.com>
> wrote:
>
> In fact, on my machine the code is compiled with gnu, and on the cluster
> it is compiled with intel (2015) compilers. I just run the program with
> "-fp_trap" and got:
>
> ===============================================================
>    |> Assembling interface problem. Unk # 56
>    |> Solving interface problem
>   Residual norms for interp_ solve.
>   0 KSP Residual norm 3.642615470862e+03
> [0]PETSC ERROR: *** unknown floating point error occurred ***
> [0]PETSC ERROR: The specific exception can be determined by running in a
> debugger.  When the
> [0]PETSC ERROR: debugger traps the signal, the exception can be found with
> fetestexcept(0x3f)
> [0]PETSC ERROR: where the result is a bitwise OR of the following flags:
> [0]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8
> FE_UNDERFLOW=0x10 FE_INEXACT=0x20
> [0]PETSC ERROR: Try option -start_in_debugger
> [0]PETSC ERROR: likely location of problem given in stack below
> [0]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> [1]PETSC ERROR: [2]PETSC ERROR: *** unknown floating point error occurred
> ***
> [3]PETSC ERROR: *** unknown floating point error occurred ***
> [3]PETSC ERROR: The specific exception can be determined by running in a
> debugger.  When the
> [4]PETSC ERROR: *** unknown floating point error occurred ***
> [4]PETSC ERROR: The specific exception can be determined by running in a
> debugger.  When the
> [4]PETSC ERROR: [5]PETSC ERROR: *** unknown floating point error occurred
> ***
> [5]PETSC ERROR: The specific exception can be determined by running in a
> debugger.  When the
> [5]PETSC ERROR: debugger traps the signal, the exception can be found with
> fetestexcept(0x3f)
> [5]PETSC ERROR: where the result is a bitwise OR of the following flags:
> [6]PETSC ERROR: *** unknown floating point error occurred ***
> [6]PETSC ERROR: The specific exception can be determined by running in a
> debugger.  When the
> [6]PETSC ERROR: debugger traps the signal, the exception can be found with
> fetestexcept(0x3f)
> [6]PETSC ERROR: where the result is a bitwise OR of the following flags:
> [6]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8
> FE_UNDERFLOW=0x10 FE_INEXACT=0x20
> [7]PETSC ERROR: *** unknown floating point error occurred ***
> [7]PETSC ERROR: The specific exception can be determined by running in a
> debugger.  When the
> [7]PETSC ERROR: debugger traps the signal, the exception can be found with
> fetestexcept(0x3f)
> [7]PETSC ERROR: where the result is a bitwise OR of the following flags:
> [7]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8
> FE_UNDERFLOW=0x10 FE_INEXACT=0x20
> [7]PETSC ERROR: Try option -start_in_debugger
> [7]PETSC ERROR: likely location of problem given in stack below
> [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not
> available,
> [0]PETSC ERROR:       INSTEAD the line number of the start of the function
> [0]PETSC ERROR:       is given.
> [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355
> /mnt/lustre/home/ajaramillo/petsc-3.13.0/src/sys/error/fp.c
> [0]PETSC ERROR: [0] VecMDot line 1154
> /mnt/lustre/home/ajaramillo/petsc-3.13.0/src/vec/vec/interface/rvector.c
> [0]PETSC ERROR: [0] KSPGMRESClassicalGramSchmidtOrthogonalization line 44
> /mnt/lustre/home/ajaramillo/petsc-3.13.0/src/ksp/ksp/impls/gmres/borthog2.c
> [0]PETSC ERROR: [0] KSPGMRESCycle line 122
> /mnt/lustre/home/ajaramillo/petsc-3.13.0/src/ksp/ksp/impls/gmres/gmres.c
> [0]PETSC ERROR: [0] KSPSolve_GMRES line 225
> /mnt/lustre/home/ajaramillo/petsc-3.13.0/src/ksp/ksp/impls/gmres/gmres.c
> [0]PETSC ERROR: [0] KSPSolve_Private line 590
> /mnt/lustre/home/ajaramillo/petsc-3.13.0/src/ksp/ksp/interface/itfunc.c
> [0]PETSC ERROR: *** unknown floating point error occurred ***
> ===============================================================
>
> So it seems that in fact a division by 0 is taking place. I will try to
> run this in debug mode.
>
> thanks
> Alfredo
>
> On Tue, Aug 25, 2020 at 10:23 AM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>>   Sounds like it might be a compiler problem generating bad code.
>>
>>   On the machine where it fails you can run with -fp_trap to have it
>> error out as soon as a Nan or Inf appears. If you can use the debugger on
>> that machine you can tell the debugger to catch floating point exceptions
>> and see the exact line an values of variables where a Nan or Inf appear.
>>
>>    As Matt conjectured it is likely there is a divide by zero before
>> PETSc detects and it may be helpful to find out exactly where that happens.
>>
>>   Barry
>>
>>
>> On Aug 25, 2020, at 8:03 AM, Alfredo Jaramillo <ajaramillopalma at gmail.com>
>> wrote:
>>
>> Yes, Barry, that is correct.
>>
>>
>>
>> On Tue, Aug 25, 2020 at 1:02 AM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>   On one system you get this error, on another system with the identical
>>> code and test case you do not get the error?
>>>
>>>   You get it with three iterative methods but not with MUMPS?
>>>
>>> Barry
>>>
>>>
>>> On Aug 24, 2020, at 8:35 PM, Alfredo Jaramillo <
>>> ajaramillopalma at gmail.com> wrote:
>>>
>>> Hello Barry, Matthew, thanks for the replies !
>>>
>>> Yes, it is our custom code, and it also happens when setting -pc_type
>>> bjacobi. Before testing an iterative solver, we were using MUMPS (-ksp_type
>>> preonly -ksp_pc_type lu -pc_factor_mat_solver_type mumps) without issues.
>>>
>>> Running the ex19 (as "mpirun -n 4 ex19 -da_refine 5") did not produce
>>> any problem.
>>>
>>> To reproduce the situation on my computer, I was able to reproduce the
>>> error for a small case and -pc_type bjacobi. For that particular case, when
>>> running in the cluster the error appears at the very last iteration:
>>>
>>> =====
>>> 27 KSP Residual norm 8.230378644666e-06
>>> [0]PETSC ERROR: --------------------- Error Message
>>> --------------------------------------------------------------
>>> [0]PETSC ERROR: Invalid argument
>>> [0]PETSC ERROR: Scalar value must be same on all processes, argument # 3
>>> ====
>>>
>>> whereas running on my computer the error is not launched and convergence
>>> is reached instead:
>>>
>>> ====
>>> Linear interp_ solve converged due to CONVERGED_RTOL iterations 27
>>> ====
>>>
>>> I will run valgrind to seek for possible memory corruptions.
>>>
>>> thank you
>>> Alfredo
>>>
>>> On Mon, Aug 24, 2020 at 9:00 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>
>>>>
>>>>    Oh yes, it could happen with Nan.
>>>>
>>>>    KSPGMRESClassicalGramSchmidtOrthogonalization()
>>>> calls  KSPCheckDot(ksp,lhh[j]); so should detect any NAN that appear and
>>>> set ksp->convergedreason  but the call to MAXPY() is still made before
>>>> returning and hence producing the error message.
>>>>
>>>>    We should circuit the orthogonalization as soon as it sees a Nan/Inf
>>>> and return immediately for GMRES to cleanup and produce a very useful error
>>>> message.
>>>>
>>>>   Alfredo,
>>>>
>>>>     It is also possible that the hypre preconditioners are producing a
>>>> Nan because your matrix is too difficult for them to handle, but it would
>>>> be odd to happen after many iterations.
>>>>
>>>>    As I suggested before run with -pc_type bjacobi to see if you get
>>>> the same problem.
>>>>
>>>>   Barry
>>>>
>>>>
>>>> On Aug 24, 2020, at 6:38 PM, Matthew Knepley <knepley at gmail.com> wrote:
>>>>
>>>> On Mon, Aug 24, 2020 at 6:27 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>>
>>>>>
>>>>>    Alfredo,
>>>>>
>>>>>       This should never happen. The input to the VecMAXPY in gmres is
>>>>> computed via VMDot which produces the same result on all processes.
>>>>>
>>>>>        If you run with -pc_type bjacobi does it also happen?
>>>>>
>>>>>        Is this your custom code or does it happen in PETSc examples
>>>>> also? Like src/snes/tutorials/ex19 -da_refine 5
>>>>>
>>>>>       Could be memory corruption, can you run under valgrind?
>>>>>
>>>>
>>>> Couldn't it happen if something generates a NaN? That also should not
>>>> happen, but I was allowing that pilut might do it.
>>>>
>>>>   Thanks,
>>>>
>>>>     Matt
>>>>
>>>>
>>>>>     Barry
>>>>>
>>>>>
>>>>> > On Aug 24, 2020, at 4:05 PM, Alfredo Jaramillo <
>>>>> ajaramillopalma at gmail.com> wrote:
>>>>> >
>>>>> > Dear PETSc developers,
>>>>> >
>>>>> > I'm trying to solve a linear problem with GMRES preconditioned with
>>>>> pilut from HYPRE. For this I'm using the options:
>>>>> >
>>>>> > -ksp_type gmres -pc_type hypre -pc_hypre_type pilut -ksp_monitor
>>>>> >
>>>>> > If I use a single core, GMRES (+ pilut or euclid) converges.
>>>>> However, when using multiple cores the next error appears after some number
>>>>> of iterations:
>>>>> >
>>>>> > [0]PETSC ERROR: Scalar value must be same on all processes, argument
>>>>> # 3
>>>>> >
>>>>> > relative to the function VecMAXPY. I attached a screenshot with more
>>>>> detailed output. The same happens when using euclid. Can you please give me
>>>>> some insight on this?
>>>>> >
>>>>> > best regards
>>>>> > Alfredo
>>>>> > <Screenshot from 2020-08-24 17-57-52.png>
>>>>>
>>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to which their
>>>> experiments lead.
>>>> -- Norbert Wiener
>>>>
>>>> https://www.cse.buffalo.edu/~knepley/
>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200825/a8fcd458/attachment-0001.html>


More information about the petsc-users mailing list