[petsc-users] Floating point exception
Barry Smith
bsmith at mcs.anl.gov
Sat Apr 25 13:44:06 CDT 2015
The FPE is occurring inside hypre code. You need to run with the additional command line option -start_in_debugger and enter cont in each xterm that comes up with the debugger then when it crashes type where to see where it crashed and print the variables to see if it is a divide by zero etc.
Since it crashes in hypre it is not likely that the matrix values or right hand side had inf or Nan in them.
Barry
> On Apr 25, 2015, at 1:13 PM, Danyang Su <danyang.su at gmail.com> wrote:
>
> Hi Barry,
>
> With -fp_trap and -start_in_debugger options, the code crashed with the following error.
>
> The code at #21 0x41C49A in __solver_dd_MOD_solver_dd_snes_solve_react at solver_ddmethod.F90:2850 is "call KSPSolve(ksp_react,b_react,x_react,ierr)"
>
> I run this case with 4 processors and the preconditioner type is HYPRE. Does this mean something wrong in Matrix ksp_react or RHS b_react?
>
> Thanks,
>
> Danyang
>
>
> timestep: 1846 time: 3.392E+00 years delt: 1.000E-02 years iter: 1 max.sia: 0.000E+00 tol.sia: 0.000E+00
> Reduce time step for reactive transport
> timestep: 1846 time: 3.387E+00 years delt: 5.000E-03 years iter: 1 max.sia: 0.000E+00 tol.sia: 0.000E+00
> Reduce time step for reactive transport
> timestep: 1846 time: 3.385E+00 years delt: 2.500E-03 years iter: 1 max.sia: 0.000E+00 tol.sia: 0.000E+00
> [0]PETSC ERROR: *** unknown floating point error occurred ***
> [0]PETSC ERROR: The specific exception can be determined by running in a debugger. When the
> [0]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3d)
> [0]PETSC ERROR: where the result is a bitwise OR of the following flags:
> [0]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8 FE_UNDERFLOW=0x10 FE_INEXACT=0x20
> [0]PETSC ERROR: Try option -start_in_debugger
> [0]PETSC ERROR: likely location of problem given in stack below
> [0]PETSC ERROR: --------------------- Stack Frames ------------------------------------
> [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> [0]PETSC ERROR: INSTEAD the line number of the start of the function
> [0]PETSC ERROR: is given.
> [1]PETSC ERROR: *** unknown floating point error occurred ***
> [1]PETSC ERROR: The specific exception can be determined by running in a debugger. When the
> [1]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3d)
> [1]PETSC ERROR: where the result is a bitwise OR of the following flags:
> [1]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8 FE_UNDERFLOW=0x10 FE_INEXACT=0x20
> [1]PETSC ERROR: Try option -start_in_debugger
> [1]PETSC ERROR: likely location of problem given in stack below
> [1]PETSC ERROR: --------------------- Stack Frames ------------------------------------
> [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> [1]PETSC ERROR: INSTEAD the line number of the start of the function
> [1]PETSC ERROR: is given.
> [1]PETSC ERROR: [1] PetscDefaultFPTrap line 379 /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c
> [1]PETSC ERROR: [1] Hypre solve line 174 /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/pc/impls/hypre/hypre.c
> [1]PETSC ERROR: [1] PCApply_HYPRE line 161 /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/pc/impls/hypre/hypre.c
> [1]PETSC ERROR: [2]PETSC ERROR: *** unknown floating point error occurred ***
> [2]PETSC ERROR: The specific exception can be determined by running in a debugger. When the
> [2]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3d)
> [2]PETSC ERROR: where the result is a bitwise OR of the following flags:
> [2]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8 FE_UNDERFLOW=0x10 FE_INEXACT=0x20
> [2]PETSC ERROR: Try option -start_in_debugger
> [2]PETSC ERROR: likely location of problem given in stack below
> [2]PETSC ERROR: --------------------- Stack Frames ------------------------------------
> [2]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> [2]PETSC ERROR: INSTEAD the line number of the start of the function
> [2]PETSC ERROR: is given.
> [2]PETSC ERROR: [2] PetscDefaultFPTrap line 379 /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c
> [2]PETSC ERROR: [2] Hypre solve line 174 /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/pc/impls/hypre/hypre.c
> [2]PETSC ERROR: [2] PCApply_HYPRE line 161 /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/pc/impls/hypre/hypre.c
> [2]PETSC ERROR: [2] KSP_PCApply line 228 /home/dsu/Soft/PETSc/petsc-3.5.2/include/petsc-private/kspimpl.h
> [2]PETSC ERROR: [2] KSPInitialResidual line 44 /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itres.c
> [0]PETSC ERROR: [0] PetscDefaultFPTrap line 379 /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c
> [0]PETSC ERROR: [0] Hypre solve line 174 /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/pc/impls/hypre/hypre.c
> [0]PETSC ERROR: [0] PCApply_HYPRE line 161 /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/pc/impls/hypre/hypre.c
> [0]PETSC ERROR: [0] KSP_PCApply line 228 /home/dsu/Soft/PETSc/petsc-3.5.2/include/petsc-private/kspimpl.h
> [0]PETSC ERROR: [0] KSPInitialResidual line 44 /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itres.c
> [0]PETSC ERROR: [0] KSPSolve_GMRES line 224 /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/impls/gmres/gmres.c
> [1] KSP_PCApply line 228 /home/dsu/Soft/PETSc/petsc-3.5.2/include/petsc-private/kspimpl.h
> [1]PETSC ERROR: [1] KSPInitialResidual line 44 /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itres.c
> [1]PETSC ERROR: [1] KSPSolve_GMRES line 224 /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/impls/gmres/gmres.c
> [2]PETSC ERROR: [2] KSPSolve_GMRES line 224 /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/impls/gmres/gmres.c
> [2]PETSC ERROR: [0]PETSC ERROR: User provided function() line 0 in Unknown file trapped floating point error
> User provided function() line 0 in Unknown file trapped floating point error
> [1]PETSC ERROR: User provided function() line 0 in Unknown file trapped floating point error
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
> #0 0x7FDC76F307D7
> #0 0x7FA04C1207D7
> #1 0x7FA04C120DDE
> #1 0x7FDC76F30DDE
> #2 0x7FA04B41ED3F
> #3 0x7FA04B41ECC9
> #2 0x7FDC7622ED3F
> #4 0x7FA04B4220D7
> #0 0x7F622A92F7D7
> #3 0x7FDC7622ECC9
> #5 0x7FA04C6BADCB
> #1 0x7F622A92FDDE
> #4 0x7FDC762320D7
> #6 0x7FA04C6B5825
> #2 0x7F6229C2DD3F
> #7 0x7FA04C6BC17F
> #5 0x7FDC774CADCB
> #8 0x7FA04B41ED3F
> #3 0x7F6229C2DCC9
> #6 0x7FDC774C5825
> #4 0x7F6229C310D7
> #9 0x7FA04D9EF449
> #7 0x7FDC774CC17F
> #10 0x7FA04D9EF055
> #5 0x7F622AEC9DCB
> #8 0x7FDC7622ED3F
> #11 0x7FA04D99D2DD
> #6 0x7F622AEC4825
> #9 0x7FDC787FF449
> #12 0x7FA04D984ACD
> #7 0x7F622AECB17F
> #10 0x7FDC787FF055
> #13 0x7FA04D973E63
> #8 0x7F6229C2DD3F
> #11 0x7FDC787AD2DD
> #14 0x7FA04D27E8E3
> #9 0x7F622C1FE449
> #12 0x7FDC78794ACD
> #15 0x7FA04D2BEB04
> #10 0x7F622C1FE055
> #13 0x7FDC78783E63
> #16 0x7FA04D3CABFA
> #11 0x7F622C1AC2DD
> #17 0x7FA04D3CB927
> #14 0x7FDC7808E8E3
> #12 0x7F622C193ACD
> #18 0x7FA04D361DE8
> #15 0x7FDC780CEB04
> #13 0x7F622C182E63
> #16 0x7FDC781DABFA
> #19 0x7FA04D3A0E1D
> #20 0x7FA04D3DC121
> #14 0x7F622BA8D8E3
> #15 0x7F622BACDB04
> #17 0x7FDC781DB927
> #18 0x7FDC78171DE8
> #16 0x7F622BBD9BFA
> #19 0x7FDC781B0E1D
> #17 0x7F622BBDA927
> #20 0x7FDC781EC121
> #18 0x7F622BB70DE8
> #19 0x7F622BBAFE1D
> #20 0x7F622BBEB121
> #21 0x41C49A in __solver_dd_MOD_solver_dd_snes_solve_react at solver_ddmethod.F90:2850
> #21 0x41C49A in __solver_dd_MOD_solver_dd_snes_solve_react at solver_ddmethod.F90:2850
> #21 0x41C49A in __solver_dd_MOD_solver_dd_snes_solve_react at solver_ddmethod.F90:2850
> #22 0x6A25A5 in reactran_ at reactran.F90:954
> #22 0x6A25A5 in reactran_ at reactran.F90:954
> #22 0x6A25A5 in reactran_ at reactran.F90:954
> #23 0x574836 in timeloop_ at timeloop.F90:1194
> #23 0x574836 in timeloop_ at timeloop.F90:1194
> #23 0x574836 in timeloop_ at timeloop.F90:1194
> #24 0x5ABFD7 in driver_pc at driver_pc.F90:599
> #24 0x5ABFD7 in driver_pc at driver_pc.F90:599
> #24 0x5ABFD7 in driver_pc at driver_pc.F90:599
>
> On 15-04-24 11:12 AM, Barry Smith wrote:
>>> On Apr 24, 2015, at 1:05 PM, Danyang Su <danyang.su at gmail.com> wrote:
>>>
>>> Hi All,
>>>
>>> One of my case crashes because of floating point exception when using 4 processors, as shown below. But if I run this case with 1 processor, it works fine. I have tested the codes with around 100 cases up to 768 processors, all other cases work fine. I just wonder if this kind of error is caused because of NaN in jacobi matrix, RHS or preconditioner?
>> Yes, almost for sure it is one of these places.
>>
>> First run the bad case with -fp_trap if all goes well you'll see the function where the FPE is generated. Then run also with -start_in_debugger and
>> type cont in all four debugger windows. When the FPE happens the debugger should stop showing exactly where the FPE happens.
>>
>> Barry
>>
>>> I can check all the entries of jacobi matrix to see if the value is valid, but this seems not a good idea as it takes a long time to reach this point. If I restart the simulation from a specified time (e.g., 7.685 in this case), then the error does not occur.
>>>
>>> Would you please give me any suggestion on debugging this case?
>>>
>>> Thanks and Regards,
>>>
>>> Danyang
>>>
>>>
>>> timestep: 2730 time: 7.665E+00 years delt: 1.000E-02 years iter: 1 max.sia: 0.000E+00 tol.sia: 0.000E+00
>>> timestep: 2731 time: 7.675E+00 years delt: 1.000E-02 years iter: 1 max.sia: 0.000E+00 tol.sia: 0.000E+00
>>> timestep: 2732 time: 7.685E+00 years delt: 1.000E-02 years iter: 1 max.sia: 0.000E+00 tol.sia: 0.000E+00
>>> timestep: 2733 time: 7.695E+00 years delt: 1.000E-02 years iter: 1 max.sia: 0.000E+00 tol.sia: 0.000E+00
>>> timestep: 2734 time: 7.705E+00 years delt: 1.000E-02 years iter: 1 max.sia: 0.000E+00 tol.sia: 0.000E+00
>>> Reduce time step for reactive transport
>>> timestep: 2734 time: 7.700E+00 years delt: 5.000E-03 years iter: 1 max.sia: 0.000E+00 tol.sia: 0.000E+00
>>> Reduce time step for reactive transport
>>> timestep: 2734 time: 7.697E+00 years delt: 2.500E-03 years iter: 1 max.sia: 0.000E+00 tol.sia: 0.000E+00
>>> [1]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
>>> [1]PETSC ERROR: Floating point exception
>>> [2]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
>>> [2]PETSC ERROR: Floating point exception
>>> [2]PETSC ERROR: Vec entry at local location 0 is not-a-number or infinite at end of function: Parameter number 3
>>> [2]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
>>> [2]PETSC ERROR: Petsc Release Version 3.5.2, Sep, 08, 2014
>>> [2]PETSC ERROR: [1]PETSC ERROR: Vec entry at local location 0 is not-a-number or infinite at end of function: Parameter number 3
>>> [1]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
>>> [1]PETSC ERROR: Petsc Release Version 3.5.2, Sep, 08, 2014
>>> [1]PETSC ERROR: ../min3p_thcm_petsc_dbg on a linux-gnu-dbg named nwmop by dsu Thu Apr 23 15:38:52 2015
>>> [1]PETSC ERROR: Configure options PETSC_ARCH=linux-gnu-dbg --with-cc=gcc --with-cxx=g++ --with-fc=gfortran --download-fblaslapack --download-mpich --download-mumps --download-hypre --download-superlu_dist --download-metis --download-parmetis --download-scalapack
>>> [1]PETSC ERROR: #1 VecValidValues() line 34 in /home/dsu/Soft/PETSc/petsc-3.5.2/src/vec/vec/interface/rvector.c
>>> ../min3p_thcm_petsc_dbg on a linux-gnu-dbg named nwmop by dsu Thu Apr 23 15:38:52 2015
>>> [2]PETSC ERROR: Configure options PETSC_ARCH=linux-gnu-dbg --with-cc=gcc --with-cxx=g++ --with-fc=gfortran --download-fblaslapack --download-mpich --download-mumps --download-hypre --download-superlu_dist --download-metis --download-parmetis --download-scalapack
>>> [2]PETSC ERROR: #1 VecValidValues() line 34 in /home/dsu/Soft/PETSc/petsc-3.5.2/src/vec/vec/interface/rvector.c
>>> [2]PETSC ERROR: [1]PETSC ERROR: #2 PCApply() line 442 in /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/pc/interface/precon.c
>>> [1]PETSC ERROR: #2 PCApply() line 442 in /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/pc/interface/precon.c
>>> [2]PETSC ERROR: #3 KSP_PCApply() line 230 in /home/dsu/Soft/PETSc/petsc-3.5.2/include/petsc-private/kspimpl.h
>>> #3 KSP_PCApply() line 230 in /home/dsu/Soft/PETSc/petsc-3.5.2/include/petsc-private/kspimpl.h
>>> [1]PETSC ERROR: #4 KSPInitialResidual() line 63 in /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itres.c
>>> [2]PETSC ERROR: #4 KSPInitialResidual() line 63 in /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itres.c
>>> [1]PETSC ERROR: #5 KSPSolve_GMRES() line 234 in /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/impls/gmres/gmres.c
>>> [2]PETSC ERROR: #5 KSPSolve_GMRES() line 234 in /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/impls/gmres/gmres.c
>>> [2]PETSC ERROR: #6 KSPSolve() line 459 in /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itfunc.c
>>> [1]PETSC ERROR: #6 KSPSolve() line 459 in /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itfunc.c
>>> ^C[mpiexec at nwmop] Sending Ctrl-C to processes as requested
>>> [mpiexec at nwmop] Press Ctrl-C again to force abort
>
More information about the petsc-users
mailing list