[petsc-users] Floating point exception
Danyang Su
danyang.su at gmail.com
Mon Apr 27 11:44:32 CDT 2015
Hi Barry,
I got the following arithemetic exception after the previous bug is fixed.
Loaded symbols for /lib/x86_64-linux-gnu/libnss_files.so.2
0x00007f3b23b98f20 in __nanosleep_nocancel ()
at ../sysdeps/unix/syscall-template.S:81
81 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) cont
Continuing.
Program received signal SIGFPE, Arithmetic exception.
0x00007f3b260df449 in gselim (A=0x1e4c580, x=0x1d30c40, n=9)
at par_relax.c:3442
3442 x[k] /= A[k*n+k];
(gdb)
I tried both PETSc 3.5.2 and 3.5.3 and they return the same error as
shown above. For 3.5.3, i edited fp.c file and then configure and make.
Thanks,
Danyang
On 15-04-25 07:34 PM, Danyang Su wrote:
> Hi All,
>
> The "floating point underflow" is caused by a small value divided by a
> very large value. This result is forced to zero and then it does not
> report any underflow problem. I just rerun this bad case to see if it
> still get stuck later. This will take a while.
>
> Thanks for all your kindly reply,
>
> Danyang
>
> On 15-04-25 07:02 PM, Barry Smith wrote:
>> Ok, you do have
>>
>> #ifndef PETSC_HAVE_XMMINTRIN_H
>> #define PETSC_HAVE_XMMINTRIN_H 1
>> #endif
>>
>> so the change you made should cause it to stop trapping underflow
>> exceptions.
>>
>> Now in one email you reported a FPE within hypre, then I asked you
>> to run with -start_in_debugger to determine where it happened exactly
>> and then you reported the FPE happened in user code (what seemed to
>> be an underflow issue). Why is this? Can you not run it where it
>> generated the FPE in hypre using the -start_in_debugger option?
>>
>> Barry
>>
>> Perhaps you have multiple PETSC_ARCH or multiple PETSc installs to
>> explain why you reported two different places where the exception
>> occurred.
>>
>>> On Apr 25, 2015, at 8:31 PM, Danyang Su <danyang.su at gmail.com> wrote:
>>>
>>>
>>>
>>> On 15-04-25 06:26 PM, Matthew Knepley wrote:
>>>> On Sat, Apr 25, 2015 at 8:23 PM, Danyang Su <danyang.su at gmail.com>
>>>> wrote:
>>>>
>>>>
>>>> On 15-04-25 06:03 PM, Barry Smith wrote:
>>>> If this is what you got in your last run
>>>>
>>>> at ../../gas_advection/velocity_g.F90:1344
>>>> 1344 cinfrt = cinfrt_dg(i1)
>>>> * diff(ic,idim) !diff is a
>>>> very small value, e.g., 1.0d-316
>>>> then it is still catching floating point underflow, which we do
>>>> not want. This means either the change I suggested you make in the
>>>> fp.c code didn't work or it actually uses a different floating
>>>> point trap than that one. BTW: absurd numbers like 1.0d-316 are
>>>> often a symptom of uninitialized data; could that be a problem that
>>>> diff is not filled correctly for all the ic, idim you are using?
>>>>
>>>> This going round and round is very frustrating and a waste of
>>>> time. You need to be more proactive yourself and explore the code
>>>> and poke around to figure out how to solve the problem.
>>>>
>>>> Please email $PETSC_DIR/$PETSC_ARCH/include/petscvariables.h so
>>>> I can see what FP trap is being used on your machine.
>>>>
>>>> Barry
>>>> Do you mean $PETSC_DIR/$PETSC_ARCH/conf/petscvariables? Otherwise I
>>>> cannot find this file.
>>>>
>>>> Its include/petscconf.h
>>>> Do I need to reconfigure PETSc after changing the code you
>>>> mentioned?
>>>>
>>>> No, but you need to rebuild.
>>> Yes, I have done 'gnumake'.
>>>> Matt
>>>> Danyang
>>>>
>>>>
>>>>
>>>> On Apr 25, 2015, at 2:24 PM, Danyang Su <danyang.su at gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On 15-04-25 11:55 AM, Barry Smith wrote:
>>>> On Apr 25, 2015, at 1:51 PM, Danyang Su <danyang.su at gmail.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>> On 15-04-25 11:32 AM, Barry Smith wrote:
>>>>
>>>> I told you this yesterday.
>>>>
>>>> It is probably stopping here on a harmless underflow. You need
>>>> to edit the PETSc code to not worry about underflow.
>>>>
>>>> Edit the file /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c
>>>> and locate
>>>>
>>>> #elif defined PETSC_HAVE_XMMINTRIN_H
>>>> _MM_SET_EXCEPTION_MASK(_MM_MASK_INEXACT);
>>>> #else
>>>>
>>>> change it to
>>>>
>>>> #elif defined PETSC_HAVE_XMMINTRIN_H
>>>> _MM_SET_EXCEPTION_MASK(_MM_MASK_INEXACT | _MM_MASK_UNDERFLOW);
>>>> #else
>>>>
>>>> Then run make gnumake in the PETSc directory to compile the new
>>>> version. Now link and run the program again with -fp_trap and see
>>>> where it gets stuck this time.
>>>>
>>>> Did you do this?
>>>>
>>>> Barry
>>>>
>>>> Yes, I did change the code in fp.c and run 'make gnumake' in the
>>>> PETSc directory. I just did a double check and ran make gnumake
>>>> again and got the following information this time.
>>>>
>>>>
>>>> dsu at nwmop:~/Soft/PETSc/petsc-3.5.2$
>>>> make gnumake
>>>> Building PETSc using GNU Make with 10 build threads
>>>> ==========================================
>>>> make[1]: Entering directory `/home/dsu/Soft/PETSc/petsc-3.5.2'
>>>> make[1]: Nothing to be done for `all'.
>>>> make[1]: Leaving directory `/home/dsu/Soft/PETSc/petsc-3.5.2'
>>>> =========================================
>>>>
>>>>
>>>> Then I recompiled the codes, ran with -fp_trap and still got the
>>>> following error
>>>>
>>>> Backtrace for this error:
>>>> Note: The EXACT line numbers in the stack are not available,
>>>> [2]PETSC ERROR: INSTEAD the line number of the start of the
>>>> function
>>>> [2]PETSC ERROR: is given.
>>>> [2]PETSC ERROR: [2] PetscDefaultFPTrap line 379
>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c
>>>> INSTEAD the line number of the start of the function
>>>> [3]PETSC ERROR: is given.
>>>> [3]PETSC ERROR: [3] PetscDefaultFPTrap line 379
>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c
>>>> [2]PETSC ERROR: User provided function() line 0 in Unknown file
>>>> trapped floating point error
>>>> [3]PETSC ERROR: User provided function() line 0 in Unknown file
>>>> trapped floating point error
>>>>
>>>>
>>>> This is different then what you sent a few minutes ago where
>>>> it crashed in hypre.
>>>>
>>>> Anyways you need to use the -start_in_debugger business I sent
>>>> in the previous email to see the exact place the problem occurs.
>>>>
>>>> Here is the information shown on gdb screen
>>>>
>>>> Program received signal SIGFPE, Arithmetic exception.
>>>> 0x00000000006c2bef in velocity_g (l_sufx=1, suffix=..., nmax=12,
>>>> njamxc=34,
>>>> cinfradx=..., radial_coordx=.FALSE., _suffix=3)
>>>> at ../../gas_advection/velocity_g.F90:1344
>>>> 1344 cinfrt = cinfrt_dg(i1)
>>>> * diff(ic,idim) !diff is a
>>>> very small value, e.g., 1.0d-316
>>>> (gdb)
>>>>
>>>> After type cont on gdb screen, I got error information as below
>>>>
>>>> [1]PETSC ERROR: *** unknown floating point error occurred ***
>>>> [1]PETSC ERROR: The specific exception can be determined by running
>>>> in a debugger. When the
>>>> [1]PETSC ERROR: debugger traps the signal, the exception can be
>>>> found with fetestexcept(0x3d)
>>>> [1]PETSC ERROR: where the result is a bitwise OR of the following
>>>> flags:
>>>> [1]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8
>>>> FE_UNDERFLOW=0x10 FE_INEXACT=0x20
>>>> [1]PETSC ERROR: Try option -start_in_debugger
>>>> [1]PETSC ERROR: likely location of problem given in stack below
>>>> [1]PETSC ERROR: --------------------- Stack Frames
>>>> ------------------------------------
>>>> [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not
>>>> available,
>>>> [1]PETSC ERROR: INSTEAD the line number of the start of the
>>>> function
>>>> [1]PETSC ERROR: is given.
>>>> [1]PETSC ERROR: [1] PetscDefaultFPTrap line 379
>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c
>>>> [1]PETSC ERROR: User provided function() line 0 in Unknown file
>>>> trapped floating point error
>>>> [0]PETSC ERROR: *** unknown floating point error occurred ***
>>>> [0]PETSC ERROR: The specific exception can be determined by running
>>>> in a debugger. When the
>>>> [0]PETSC ERROR: debugger traps the signal, the exception can be
>>>> found with fetestexcept(0x3d)
>>>> [0]PETSC ERROR: where the result is a bitwise OR of the following
>>>> flags:
>>>> [0]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8
>>>> FE_UNDERFLOW=0x10 FE_INEXACT=0x20
>>>> [0]PETSC ERROR: Try option -start_in_debugger
>>>> [0]PETSC ERROR: likely location of problem given in stack below
>>>>
>>>> Thanks,
>>>>
>>>> Danyang
>>>> Thanks,
>>>>
>>>> Danyang
>>>>
>>>> On Apr 25, 2015, at 1:05 AM, Danyang Su <danyang.su at gmail.com>
>>>> wrote:
>>>>
>>>> Hi Barry and Satish,
>>>>
>>>> How can I get rid of unknown floating point error when a very small
>>>> value is multiplied.
>>>>
>>>> e.g.,
>>>> cinfrt_dg(i1) and diff(ic,idim) are 1.0250235986806329E-008
>>>> 8.6178408169776945E-317 respectively,
>>>>
>>>> cinfrt = cinfrt_dg(i1) * diff(ic,idim)
>>>>
>>>> I get the following error when run with "-fp_trap -start_in_debugger".
>>>>
>>>> Backtrace for this error:
>>>> *** unknown floating point error occurred ***
>>>> [2]PETSC ERROR: The specific exception can be determined by running
>>>> in a debugger. When the
>>>> [2]PETSC ERROR: debugger traps the signal, the exception can be
>>>> found with fetestexcept(0x3d)
>>>> [2]PETSC ERROR: cinfrt_dg(i1),diff(ic,idim)
>>>> 1.0250235986806329E-008 8.6178408169776945E-317
>>>> where the result is a bitwise OR of the following flags:
>>>> [2]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8
>>>> FE_UNDERFLOW=0x10 FE_INEXACT=0x20
>>>> [2]PETSC ERROR: Try option -start_in_debugger
>>>> [2]PETSC ERROR: likely location of problem given in stack below
>>>>
>>>> Thanks,
>>>>
>>>> Danyang
>>>>
>>>> On 15-04-24 01:54 PM, Danyang Su wrote:
>>>>
>>>> On 15-04-24 01:23 PM, Satish Balay wrote:
>>>>
>>>> c 4 1.0976214263087059E-067
>>>>
>>>> I don't think this number can be stored in a real*4.
>>>>
>>>> Satish
>>>>
>>>> Thanks, Satish. It is caused by this number.
>>>>
>>>> On Fri, 24 Apr 2015, Danyang Su wrote:
>>>>
>>>>
>>>> On 15-04-24 11:12 AM, Barry Smith wrote:
>>>>
>>>> On Apr 24, 2015, at 1:05 PM, Danyang Su <danyang.su at gmail.com>
>>>> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> One of my case crashes because of floating point exception when
>>>> using 4
>>>> processors, as shown below. But if I run this case with 1
>>>> processor, it
>>>> works fine. I have tested the codes with around 100 cases up to 768
>>>> processors, all other cases work fine. I just wonder if this kind
>>>> of error
>>>> is caused because of NaN in jacobi matrix, RHS or preconditioner?
>>>>
>>>> Yes, almost for sure it is one of these places.
>>>>
>>>> First run the bad case with -fp_trap if all goes well you'll
>>>> see the
>>>> function where the FPE is generated. Then run also with
>>>> -start_in_debugger
>>>> and
>>>> type cont in all four debugger windows. When the FPE happens the
>>>> debugger
>>>> should stop showing exactly where the FPE happens.
>>>>
>>>> Barry
>>>>
>>>> Hi Barry,
>>>>
>>>> If run with -fp_trap -start_in_debugger, I got the following error
>>>>
>>>> [0]PETSC ERROR: *** unknown floating point error occurred ***
>>>> [0]PETSC ERROR: The specific exception can be determined by running
>>>> in a
>>>> debugger. When the
>>>> [0]PETSC ERROR: debugger traps the signal, the exception can be
>>>> found with
>>>> fetestexcept(0x3d)
>>>> [0]PETSC ERROR: where the result is a bitwise OR of the following
>>>> flags:
>>>> [0]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8
>>>> FE_UNDERFLOW=0x10 FE_INEXACT=0x20
>>>> [0]PETSC ERROR: Try option -start_in_debugger
>>>> [0]PETSC ERROR: likely location of problem given in stack below
>>>> [0]PETSC ERROR: --------------------- Stack Frames
>>>> ------------------------------------
>>>> [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not
>>>> available,
>>>> [0]PETSC ERROR: INSTEAD the line number of the start of the
>>>> function
>>>> [0]PETSC ERROR: is given.
>>>> [0]PETSC ERROR: [0] PetscDefaultFPTrap line 379
>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c
>>>> [0]PETSC ERROR: User provided function() line 0 in Unknown file
>>>> trapped
>>>> floating point error
>>>>
>>>> Program received signal SIGABRT: Process abort signal.
>>>>
>>>> Backtrace for this error:
>>>> #0 0x7F4FEAB1C7D7
>>>> #1 0x7F4FEAB1CDDE
>>>> #2 0x7F4FE9E1AD3F
>>>> #3 0x7F4FE9E1ACC9
>>>> #4 0x7F4FE9E1E0D7
>>>> #5 0x7F4FEB0B6DCB
>>>> #6 0x7F4FEB0B1825
>>>> #7 0x7F4FEB0B817F
>>>> #8 0x7F4FE9E1AD3F
>>>> #9 0x6972C8 in tprfrtlc_ at tprfrtlc.F90:2393 (discriminator 3)
>>>> #10 0x4C6C87 in gcreact_ at gcreact.F90:678
>>>> #11 0x707E19 in initicrt_ at initicrt.F90:589
>>>> #12 0x4F42D0 in initprob_ at initprob.F90:430
>>>> #13 0x5AAF72 in driver_pc at driver_pc.F90:438
>>>>
>>>> I checked the code at tprfrtlc.F90:2393,
>>>>
>>>> realbuffer_gb(1:nvars) = (/time,(c(ic),ic=1,nc-1), &
>>>> (cx(ix),ix=1,nxout)/)
>>>>
>>>> All the values (time, c, cx) are reasonable, as shown below. The only
>>>> possibility is that realbuffer_gb is in declared as real*4 if using
>>>> sing
>>>> precision output while time, c, cx are declared in real*8. I have a
>>>> lot of
>>>> similar data conversion from real*8 to real*4 output, other code
>>>> does not
>>>> return error.
>>>>
>>>> time 0.0000000000000000
>>>> c 1 9.9999999999999995E-008
>>>> c 2 3.1555251077549618E-003
>>>> c 3 7.1657814842179362E-008
>>>> c 4 1.0976214263087059E-067
>>>> c 5 5.2879822292305797E-004
>>>> c 6 9.9999999999999964E-005
>>>> c 7 6.4055731968811337E-005
>>>> c 8 3.4607572892578404E-020
>>>> cx 1 3.4376650636008101E-005
>>>> cx 2 7.3989678854017763E-012
>>>> cx 3 9.5317170613607207E-012
>>>> cx 4 2.2344525794718353E-015
>>>> cx 5 3.0624685689695889E-008
>>>> cx 6 1.0046157902783967E-007
>>>> cx 7 1.5320169154914984E-004
>>>> cx 8 8.6930292776346176E-014
>>>> cx 9 3.5944267559348721E-005
>>>> cx 10 3.0072645866951157E-018
>>>> cx 11 2.3592486321095017E-013
>>>>
>>>> Thanks,
>>>>
>>>> Danyang
>>>>
>>>>
>>>> I can check all the entries of jacobi matrix to see if the value is
>>>> valid,
>>>> but this seems not a good idea as it takes a long time to reach this
>>>> point. If I restart the simulation from a specified time (e.g.,
>>>> 7.685 in
>>>> this case), then the error does not occur.
>>>>
>>>> Would you please give me any suggestion on debugging this case?
>>>>
>>>> Thanks and Regards,
>>>>
>>>> Danyang
>>>>
>>>>
>>>> timestep: 2730 time: 7.665E+00 years delt: 1.000E-02 years
>>>> iter: 1
>>>> timestep: max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>> timestep: 2731 time: 7.675E+00 years delt: 1.000E-02 years
>>>> iter: 1
>>>> timestep: max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>> timestep: 2732 time: 7.685E+00 years delt: 1.000E-02 years
>>>> iter: 1
>>>> timestep: max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>> timestep: 2733 time: 7.695E+00 years delt: 1.000E-02 years
>>>> iter: 1
>>>> timestep: max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>> timestep: 2734 time: 7.705E+00 years delt: 1.000E-02 years
>>>> iter: 1
>>>> timestep: max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>> Reduce time step for reactive transport
>>>> timestep: 2734 time: 7.700E+00 years delt: 5.000E-03 years
>>>> iter: 1
>>>> timestep: max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>> Reduce time step for reactive transport
>>>> timestep: 2734 time: 7.697E+00 years delt: 2.500E-03 years
>>>> iter: 1
>>>> timestep: max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>> [1]PETSC ERROR: --------------------- Error Message
>>>> --------------------------------------------------------------
>>>> [1]PETSC ERROR: Floating point exception
>>>> [2]PETSC ERROR: --------------------- Error Message
>>>> --------------------------------------------------------------
>>>> [2]PETSC ERROR: Floating point exception
>>>> [2]PETSC ERROR: Vec entry at local location 0 is not-a-number or
>>>> infinite
>>>> at end of function: Parameter number 3
>>>> [2]PETSC ERROR: See
>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html
>>>>
>>>> for trouble shooting.
>>>> [2]PETSC ERROR: Petsc Release Version 3.5.2, Sep, 08, 2014
>>>> [2]PETSC ERROR: [1]PETSC ERROR: Vec entry at local location 0 is
>>>> not-a-number or infinite at end of function: Parameter number 3
>>>> [1]PETSC ERROR: See
>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html
>>>>
>>>> for trouble shooting.
>>>> [1]PETSC ERROR: Petsc Release Version 3.5.2, Sep, 08, 2014
>>>> [1]PETSC ERROR: ../min3p_thcm_petsc_dbg on a linux-gnu-dbg named
>>>> nwmop by
>>>> dsu Thu Apr 23 15:38:52 2015
>>>> [1]PETSC ERROR: Configure options PETSC_ARCH=linux-gnu-dbg
>>>> --with-cc=gcc
>>>> --with-cxx=g++ --with-fc=gfortran --download-fblaslapack
>>>> --download-mpich
>>>> --download-mumps --download-hypre --download-superlu_dist
>>>> --download-metis
>>>> --download-parmetis --download-scalapack
>>>> [1]PETSC ERROR: #1 VecValidValues() line 34 in
>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/vec/vec/interface/rvector.c
>>>> ../min3p_thcm_petsc_dbg on a linux-gnu-dbg named nwmop by dsu Thu
>>>> Apr 23
>>>> 15:38:52 2015
>>>> [2]PETSC ERROR: Configure options PETSC_ARCH=linux-gnu-dbg
>>>> --with-cc=gcc
>>>> --with-cxx=g++ --with-fc=gfortran --download-fblaslapack
>>>> --download-mpich
>>>> --download-mumps --download-hypre --download-superlu_dist
>>>> --download-metis
>>>> --download-parmetis --download-scalapack
>>>> [2]PETSC ERROR: #1 VecValidValues() line 34 in
>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/vec/vec/interface/rvector.c
>>>> [2]PETSC ERROR: [1]PETSC ERROR: #2 PCApply() line 442 in
>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/pc/interface/precon.c
>>>> [1]PETSC ERROR: #2 PCApply() line 442 in
>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/pc/interface/precon.c
>>>> [2]PETSC ERROR: #3 KSP_PCApply() line 230 in
>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/include/petsc-private/kspimpl.h
>>>> #3 KSP_PCApply() line 230 in
>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/include/petsc-private/kspimpl.h
>>>> [1]PETSC ERROR: #4 KSPInitialResidual() line 63 in
>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itres.c
>>>> [2]PETSC ERROR: #4 KSPInitialResidual() line 63 in
>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itres.c
>>>> [1]PETSC ERROR: #5 KSPSolve_GMRES() line 234 in
>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/impls/gmres/gmres.c
>>>> [2]PETSC ERROR: #5 KSPSolve_GMRES() line 234 in
>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/impls/gmres/gmres.c
>>>> [2]PETSC ERROR: #6 KSPSolve() line 459 in
>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itfunc.c
>>>> [1]PETSC ERROR: #6 KSPSolve() line 459 in
>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itfunc.c
>>>> ^C[mpiexec at nwmop] Sending Ctrl-C to processes as requested
>>>> [mpiexec at nwmop] Press Ctrl-C again to force abort
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to
>>>> which their experiments lead.
>>>> -- Norbert Wiener
>>> <petscconf.h>
>
More information about the petsc-users
mailing list