[petsc-users] Floating point exception
Barry Smith
bsmith at mcs.anl.gov
Sat Apr 25 20:03:25 CDT 2015
If this is what you got in your last run
> at ../../gas_advection/velocity_g.F90:1344
> 1344 cinfrt = cinfrt_dg(i1) * diff(ic,idim) !diff is a very small value, e.g., 1.0d-316
then it is still catching floating point underflow, which we do not want. This means either the change I suggested you make in the fp.c code didn't work or it actually uses a different floating point trap than that one. BTW: absurd numbers like 1.0d-316 are often a symptom of uninitialized data; could that be a problem that diff is not filled correctly for all the ic, idim you are using?
This going round and round is very frustrating and a waste of time. You need to be more proactive yourself and explore the code and poke around to figure out how to solve the problem.
Please email $PETSC_DIR/$PETSC_ARCH/include/petscvariables.h so I can see what FP trap is being used on your machine.
Barry
> On Apr 25, 2015, at 2:24 PM, Danyang Su <danyang.su at gmail.com> wrote:
>
>
>
> On 15-04-25 11:55 AM, Barry Smith wrote:
>>> On Apr 25, 2015, at 1:51 PM, Danyang Su <danyang.su at gmail.com>
>>> wrote:
>>>
>>>
>>>
>>> On 15-04-25 11:32 AM, Barry Smith wrote:
>>>
>>>> I told you this yesterday.
>>>>
>>>> It is probably stopping here on a harmless underflow. You need to edit the PETSc code to not worry about underflow.
>>>>
>>>> Edit the file /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c and locate
>>>>
>>>> #elif defined PETSC_HAVE_XMMINTRIN_H
>>>> _MM_SET_EXCEPTION_MASK(_MM_MASK_INEXACT);
>>>> #else
>>>>
>>>> change it to
>>>>
>>>> #elif defined PETSC_HAVE_XMMINTRIN_H
>>>> _MM_SET_EXCEPTION_MASK(_MM_MASK_INEXACT | _MM_MASK_UNDERFLOW);
>>>> #else
>>>>
>>>> Then run make gnumake in the PETSc directory to compile the new version. Now link and run the program again with -fp_trap and see where it gets stuck this time.
>>>>
>>>> Did you do this?
>>>>
>>>> Barry
>>>>
>>> Yes, I did change the code in fp.c and run 'make gnumake' in the PETSc directory. I just did a double check and ran make gnumake again and got the following information this time.
>>>
>>>
>>> dsu at nwmop:~/Soft/PETSc/petsc-3.5.2$
>>> make gnumake
>>> Building PETSc using GNU Make with 10 build threads
>>> ==========================================
>>> make[1]: Entering directory `/home/dsu/Soft/PETSc/petsc-3.5.2'
>>> make[1]: Nothing to be done for `all'.
>>> make[1]: Leaving directory `/home/dsu/Soft/PETSc/petsc-3.5.2'
>>> =========================================
>>>
>>>
>>> Then I recompiled the codes, ran with -fp_trap and still got the following error
>>>
>>> Backtrace for this error:
>>> Note: The EXACT line numbers in the stack are not available,
>>> [2]PETSC ERROR: INSTEAD the line number of the start of the function
>>> [2]PETSC ERROR: is given.
>>> [2]PETSC ERROR: [2] PetscDefaultFPTrap line 379 /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c
>>> INSTEAD the line number of the start of the function
>>> [3]PETSC ERROR: is given.
>>> [3]PETSC ERROR: [3] PetscDefaultFPTrap line 379 /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c
>>> [2]PETSC ERROR: User provided function() line 0 in Unknown file trapped floating point error
>>> [3]PETSC ERROR: User provided function() line 0 in Unknown file trapped floating point error
>>>
>>>
>> This is different then what you sent a few minutes ago where it crashed in hypre.
>>
>> Anyways you need to use the -start_in_debugger business I sent in the previous email to see the exact place the problem occurs.
>>
> Here is the information shown on gdb screen
>
> Program received signal SIGFPE, Arithmetic exception.
> 0x00000000006c2bef in velocity_g (l_sufx=1, suffix=..., nmax=12, njamxc=34,
> cinfradx=..., radial_coordx=.FALSE., _suffix=3)
> at ../../gas_advection/velocity_g.F90:1344
> 1344 cinfrt = cinfrt_dg(i1) * diff(ic,idim) !diff is a very small value, e.g., 1.0d-316
> (gdb)
>
> After type cont on gdb screen, I got error information as below
>
> [1]PETSC ERROR: *** unknown floating point error occurred ***
> [1]PETSC ERROR: The specific exception can be determined by running in a debugger. When the
> [1]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3d)
> [1]PETSC ERROR: where the result is a bitwise OR of the following flags:
> [1]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8 FE_UNDERFLOW=0x10 FE_INEXACT=0x20
> [1]PETSC ERROR: Try option -start_in_debugger
> [1]PETSC ERROR: likely location of problem given in stack below
> [1]PETSC ERROR: --------------------- Stack Frames ------------------------------------
> [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> [1]PETSC ERROR: INSTEAD the line number of the start of the function
> [1]PETSC ERROR: is given.
> [1]PETSC ERROR: [1] PetscDefaultFPTrap line 379 /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c
> [1]PETSC ERROR: User provided function() line 0 in Unknown file trapped floating point error
> [0]PETSC ERROR: *** unknown floating point error occurred ***
> [0]PETSC ERROR: The specific exception can be determined by running in a debugger. When the
> [0]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3d)
> [0]PETSC ERROR: where the result is a bitwise OR of the following flags:
> [0]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8 FE_UNDERFLOW=0x10 FE_INEXACT=0x20
> [0]PETSC ERROR: Try option -start_in_debugger
> [0]PETSC ERROR: likely location of problem given in stack below
>
> Thanks,
>
> Danyang
>>
>>> Thanks,
>>>
>>> Danyang
>>>
>>>>> On Apr 25, 2015, at 1:05 AM, Danyang Su <danyang.su at gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi Barry and Satish,
>>>>>
>>>>> How can I get rid of unknown floating point error when a very small value is multiplied.
>>>>>
>>>>> e.g.,
>>>>> cinfrt_dg(i1) and diff(ic,idim) are 1.0250235986806329E-008 8.6178408169776945E-317 respectively,
>>>>>
>>>>> cinfrt = cinfrt_dg(i1) * diff(ic,idim)
>>>>>
>>>>> I get the following error when run with "-fp_trap -start_in_debugger".
>>>>>
>>>>> Backtrace for this error:
>>>>> *** unknown floating point error occurred ***
>>>>> [2]PETSC ERROR: The specific exception can be determined by running in a debugger. When the
>>>>> [2]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3d)
>>>>> [2]PETSC ERROR: cinfrt_dg(i1),diff(ic,idim) 1.0250235986806329E-008 8.6178408169776945E-317
>>>>> where the result is a bitwise OR of the following flags:
>>>>> [2]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8 FE_UNDERFLOW=0x10 FE_INEXACT=0x20
>>>>> [2]PETSC ERROR: Try option -start_in_debugger
>>>>> [2]PETSC ERROR: likely location of problem given in stack below
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Danyang
>>>>>
>>>>> On 15-04-24 01:54 PM, Danyang Su wrote:
>>>>>
>>>>>> On 15-04-24 01:23 PM, Satish Balay wrote:
>>>>>>
>>>>>>> c 4 1.0976214263087059E-067
>>>>>>>
>>>>>>> I don't think this number can be stored in a real*4.
>>>>>>>
>>>>>>> Satish
>>>>>>>
>>>>>> Thanks, Satish. It is caused by this number.
>>>>>>
>>>>>>> On Fri, 24 Apr 2015, Danyang Su wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On 15-04-24 11:12 AM, Barry Smith wrote:
>>>>>>>>
>>>>>>>>>> On Apr 24, 2015, at 1:05 PM, Danyang Su <danyang.su at gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> One of my case crashes because of floating point exception when using 4
>>>>>>>>>> processors, as shown below. But if I run this case with 1 processor, it
>>>>>>>>>> works fine. I have tested the codes with around 100 cases up to 768
>>>>>>>>>> processors, all other cases work fine. I just wonder if this kind of error
>>>>>>>>>> is caused because of NaN in jacobi matrix, RHS or preconditioner?
>>>>>>>>>>
>>>>>>>>> Yes, almost for sure it is one of these places.
>>>>>>>>>
>>>>>>>>> First run the bad case with -fp_trap if all goes well you'll see the
>>>>>>>>> function where the FPE is generated. Then run also with -start_in_debugger
>>>>>>>>> and
>>>>>>>>> type cont in all four debugger windows. When the FPE happens the debugger
>>>>>>>>> should stop showing exactly where the FPE happens.
>>>>>>>>>
>>>>>>>>> Barry
>>>>>>>>>
>>>>>>>> Hi Barry,
>>>>>>>>
>>>>>>>> If run with -fp_trap -start_in_debugger, I got the following error
>>>>>>>>
>>>>>>>> [0]PETSC ERROR: *** unknown floating point error occurred ***
>>>>>>>> [0]PETSC ERROR: The specific exception can be determined by running in a
>>>>>>>> debugger. When the
>>>>>>>> [0]PETSC ERROR: debugger traps the signal, the exception can be found with
>>>>>>>> fetestexcept(0x3d)
>>>>>>>> [0]PETSC ERROR: where the result is a bitwise OR of the following flags:
>>>>>>>> [0]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8
>>>>>>>> FE_UNDERFLOW=0x10 FE_INEXACT=0x20
>>>>>>>> [0]PETSC ERROR: Try option -start_in_debugger
>>>>>>>> [0]PETSC ERROR: likely location of problem given in stack below
>>>>>>>> [0]PETSC ERROR: --------------------- Stack Frames
>>>>>>>> ------------------------------------
>>>>>>>> [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
>>>>>>>> [0]PETSC ERROR: INSTEAD the line number of the start of the function
>>>>>>>> [0]PETSC ERROR: is given.
>>>>>>>> [0]PETSC ERROR: [0] PetscDefaultFPTrap line 379
>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c
>>>>>>>> [0]PETSC ERROR: User provided function() line 0 in Unknown file trapped
>>>>>>>> floating point error
>>>>>>>>
>>>>>>>> Program received signal SIGABRT: Process abort signal.
>>>>>>>>
>>>>>>>> Backtrace for this error:
>>>>>>>> #0 0x7F4FEAB1C7D7
>>>>>>>> #1 0x7F4FEAB1CDDE
>>>>>>>> #2 0x7F4FE9E1AD3F
>>>>>>>> #3 0x7F4FE9E1ACC9
>>>>>>>> #4 0x7F4FE9E1E0D7
>>>>>>>> #5 0x7F4FEB0B6DCB
>>>>>>>> #6 0x7F4FEB0B1825
>>>>>>>> #7 0x7F4FEB0B817F
>>>>>>>> #8 0x7F4FE9E1AD3F
>>>>>>>> #9 0x6972C8 in tprfrtlc_ at tprfrtlc.F90:2393 (discriminator 3)
>>>>>>>> #10 0x4C6C87 in gcreact_ at gcreact.F90:678
>>>>>>>> #11 0x707E19 in initicrt_ at initicrt.F90:589
>>>>>>>> #12 0x4F42D0 in initprob_ at initprob.F90:430
>>>>>>>> #13 0x5AAF72 in driver_pc at driver_pc.F90:438
>>>>>>>>
>>>>>>>> I checked the code at tprfrtlc.F90:2393,
>>>>>>>>
>>>>>>>> realbuffer_gb(1:nvars) = (/time,(c(ic),ic=1,nc-1), &
>>>>>>>> (cx(ix),ix=1,nxout)/)
>>>>>>>>
>>>>>>>> All the values (time, c, cx) are reasonable, as shown below. The only
>>>>>>>> possibility is that realbuffer_gb is in declared as real*4 if using sing
>>>>>>>> precision output while time, c, cx are declared in real*8. I have a lot of
>>>>>>>> similar data conversion from real*8 to real*4 output, other code does not
>>>>>>>> return error.
>>>>>>>>
>>>>>>>> time 0.0000000000000000
>>>>>>>> c 1 9.9999999999999995E-008
>>>>>>>> c 2 3.1555251077549618E-003
>>>>>>>> c 3 7.1657814842179362E-008
>>>>>>>> c 4 1.0976214263087059E-067
>>>>>>>> c 5 5.2879822292305797E-004
>>>>>>>> c 6 9.9999999999999964E-005
>>>>>>>> c 7 6.4055731968811337E-005
>>>>>>>> c 8 3.4607572892578404E-020
>>>>>>>> cx 1 3.4376650636008101E-005
>>>>>>>> cx 2 7.3989678854017763E-012
>>>>>>>> cx 3 9.5317170613607207E-012
>>>>>>>> cx 4 2.2344525794718353E-015
>>>>>>>> cx 5 3.0624685689695889E-008
>>>>>>>> cx 6 1.0046157902783967E-007
>>>>>>>> cx 7 1.5320169154914984E-004
>>>>>>>> cx 8 8.6930292776346176E-014
>>>>>>>> cx 9 3.5944267559348721E-005
>>>>>>>> cx 10 3.0072645866951157E-018
>>>>>>>> cx 11 2.3592486321095017E-013
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Danyang
>>>>>>>>
>>>>>>>>
>>>>>>>>>> I can check all the entries of jacobi matrix to see if the value is valid,
>>>>>>>>>> but this seems not a good idea as it takes a long time to reach this
>>>>>>>>>> point. If I restart the simulation from a specified time (e.g., 7.685 in
>>>>>>>>>> this case), then the error does not occur.
>>>>>>>>>>
>>>>>>>>>> Would you please give me any suggestion on debugging this case?
>>>>>>>>>>
>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>
>>>>>>>>>> Danyang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> timestep: 2730 time: 7.665E+00 years delt: 1.000E-02 years iter: 1
>>>>>>>>>> timestep: max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>>>>>>>> timestep: 2731 time: 7.675E+00 years delt: 1.000E-02 years iter: 1
>>>>>>>>>> timestep: max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>>>>>>>> timestep: 2732 time: 7.685E+00 years delt: 1.000E-02 years iter: 1
>>>>>>>>>> timestep: max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>>>>>>>> timestep: 2733 time: 7.695E+00 years delt: 1.000E-02 years iter: 1
>>>>>>>>>> timestep: max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>>>>>>>> timestep: 2734 time: 7.705E+00 years delt: 1.000E-02 years iter: 1
>>>>>>>>>> timestep: max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>>>>>>>> Reduce time step for reactive transport
>>>>>>>>>> timestep: 2734 time: 7.700E+00 years delt: 5.000E-03 years iter: 1
>>>>>>>>>> timestep: max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>>>>>>>> Reduce time step for reactive transport
>>>>>>>>>> timestep: 2734 time: 7.697E+00 years delt: 2.500E-03 years iter: 1
>>>>>>>>>> timestep: max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>>>>>>>> [1]PETSC ERROR: --------------------- Error Message
>>>>>>>>>> --------------------------------------------------------------
>>>>>>>>>> [1]PETSC ERROR: Floating point exception
>>>>>>>>>> [2]PETSC ERROR: --------------------- Error Message
>>>>>>>>>> --------------------------------------------------------------
>>>>>>>>>> [2]PETSC ERROR: Floating point exception
>>>>>>>>>> [2]PETSC ERROR: Vec entry at local location 0 is not-a-number or infinite
>>>>>>>>>> at end of function: Parameter number 3
>>>>>>>>>> [2]PETSC ERROR: See
>>>>>>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html
>>>>>>>>>>
>>>>>>>>>> for trouble shooting.
>>>>>>>>>> [2]PETSC ERROR: Petsc Release Version 3.5.2, Sep, 08, 2014
>>>>>>>>>> [2]PETSC ERROR: [1]PETSC ERROR: Vec entry at local location 0 is
>>>>>>>>>> not-a-number or infinite at end of function: Parameter number 3
>>>>>>>>>> [1]PETSC ERROR: See
>>>>>>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html
>>>>>>>>>>
>>>>>>>>>> for trouble shooting.
>>>>>>>>>> [1]PETSC ERROR: Petsc Release Version 3.5.2, Sep, 08, 2014
>>>>>>>>>> [1]PETSC ERROR: ../min3p_thcm_petsc_dbg on a linux-gnu-dbg named nwmop by
>>>>>>>>>> dsu Thu Apr 23 15:38:52 2015
>>>>>>>>>> [1]PETSC ERROR: Configure options PETSC_ARCH=linux-gnu-dbg --with-cc=gcc
>>>>>>>>>> --with-cxx=g++ --with-fc=gfortran --download-fblaslapack --download-mpich
>>>>>>>>>> --download-mumps --download-hypre --download-superlu_dist --download-metis
>>>>>>>>>> --download-parmetis --download-scalapack
>>>>>>>>>> [1]PETSC ERROR: #1 VecValidValues() line 34 in
>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/vec/vec/interface/rvector.c
>>>>>>>>>> ../min3p_thcm_petsc_dbg on a linux-gnu-dbg named nwmop by dsu Thu Apr 23
>>>>>>>>>> 15:38:52 2015
>>>>>>>>>> [2]PETSC ERROR: Configure options PETSC_ARCH=linux-gnu-dbg --with-cc=gcc
>>>>>>>>>> --with-cxx=g++ --with-fc=gfortran --download-fblaslapack --download-mpich
>>>>>>>>>> --download-mumps --download-hypre --download-superlu_dist --download-metis
>>>>>>>>>> --download-parmetis --download-scalapack
>>>>>>>>>> [2]PETSC ERROR: #1 VecValidValues() line 34 in
>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/vec/vec/interface/rvector.c
>>>>>>>>>> [2]PETSC ERROR: [1]PETSC ERROR: #2 PCApply() line 442 in
>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/pc/interface/precon.c
>>>>>>>>>> [1]PETSC ERROR: #2 PCApply() line 442 in
>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/pc/interface/precon.c
>>>>>>>>>> [2]PETSC ERROR: #3 KSP_PCApply() line 230 in
>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/include/petsc-private/kspimpl.h
>>>>>>>>>> #3 KSP_PCApply() line 230 in
>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/include/petsc-private/kspimpl.h
>>>>>>>>>> [1]PETSC ERROR: #4 KSPInitialResidual() line 63 in
>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itres.c
>>>>>>>>>> [2]PETSC ERROR: #4 KSPInitialResidual() line 63 in
>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itres.c
>>>>>>>>>> [1]PETSC ERROR: #5 KSPSolve_GMRES() line 234 in
>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/impls/gmres/gmres.c
>>>>>>>>>> [2]PETSC ERROR: #5 KSPSolve_GMRES() line 234 in
>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/impls/gmres/gmres.c
>>>>>>>>>> [2]PETSC ERROR: #6 KSPSolve() line 459 in
>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itfunc.c
>>>>>>>>>> [1]PETSC ERROR: #6 KSPSolve() line 459 in
>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itfunc.c
>>>>>>>>>> ^C[mpiexec at nwmop] Sending Ctrl-C to processes as requested
>>>>>>>>>> [mpiexec at nwmop] Press Ctrl-C again to force abort
>>>>>>>>>>
>
More information about the petsc-users
mailing list