[petsc-users] Floating point exception

Barry Smith bsmith at mcs.anl.gov
Sat Apr 25 21:02:18 CDT 2015


  Ok, you do have 

#ifndef PETSC_HAVE_XMMINTRIN_H
#define PETSC_HAVE_XMMINTRIN_H 1
#endif

  so the change you made should cause it to stop trapping underflow exceptions. 

  Now in one email you reported a FPE within hypre, then I asked you to run with -start_in_debugger to determine where it happened exactly and then you reported the FPE happened in  user code (what seemed to be an underflow issue). Why is this? Can you not run it where it generated the FPE in hypre using the -start_in_debugger option?

   Barry

Perhaps you have multiple PETSC_ARCH or multiple PETSc installs to explain why you reported two different places where the exception occurred.

> On Apr 25, 2015, at 8:31 PM, Danyang Su <danyang.su at gmail.com> wrote:
> 
> 
> 
> On 15-04-25 06:26 PM, Matthew Knepley wrote:
>> On Sat, Apr 25, 2015 at 8:23 PM, Danyang Su <danyang.su at gmail.com> wrote:
>> 
>> 
>> On 15-04-25 06:03 PM, Barry Smith wrote:
>>    If this is what you got in your last run
>> 
>>      at ../../gas_advection/velocity_g.F90:1344
>> 1344                                        cinfrt = cinfrt_dg(i1) * diff(ic,idim)                                      !diff is a very small value, e.g., 1.0d-316
>>    then it is still catching floating point underflow, which we do not want. This means either the change I suggested you make in the fp.c code didn't work or it actually uses a different floating point trap than that one.  BTW: absurd numbers like 1.0d-316 are often a symptom of uninitialized data; could that be a problem that diff is not filled correctly for all the ic, idim you are using?
>> 
>>     This going round and round is very frustrating and a waste of time. You need to be more proactive yourself and explore the code and poke around to figure out how to solve the problem.
>> 
>>    Please email $PETSC_DIR/$PETSC_ARCH/include/petscvariables.h so I can see what FP trap is being used on your machine.
>> 
>> Barry
>> Do you mean $PETSC_DIR/$PETSC_ARCH/conf/petscvariables? Otherwise I cannot find this file.
>> 
>> Its include/petscconf.h
>>  
>> Do I need to reconfigure PETSc after changing the code you mentioned?
>> 
>> No, but you need to rebuild.
> Yes, I have done 'gnumake'.
>> 
>>    Matt
>>  
>> Danyang
>> 
>> 
>> 
>> On Apr 25, 2015, at 2:24 PM, Danyang Su <danyang.su at gmail.com> wrote:
>> 
>> 
>> 
>> On 15-04-25 11:55 AM, Barry Smith wrote:
>> On Apr 25, 2015, at 1:51 PM, Danyang Su <danyang.su at gmail.com>
>>   wrote:
>> 
>> 
>> 
>> On 15-04-25 11:32 AM, Barry Smith wrote:
>> 
>>    I told you this yesterday.
>> 
>>    It is probably stopping here on a harmless underflow. You need to edit the PETSc code to not worry about underflow.
>> 
>> Edit the file /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c and locate
>> 
>> #elif defined PETSC_HAVE_XMMINTRIN_H
>>     _MM_SET_EXCEPTION_MASK(_MM_MASK_INEXACT);
>> #else
>> 
>> change it to
>> 
>> #elif defined PETSC_HAVE_XMMINTRIN_H
>>     _MM_SET_EXCEPTION_MASK(_MM_MASK_INEXACT | _MM_MASK_UNDERFLOW);
>> #else
>> 
>>   Then run make gnumake in the PETSc directory to compile the new version. Now link and run the program again with -fp_trap and see where it gets stuck this time.
>> 
>>    Did you do this?
>> 
>>   Barry
>> 
>> Yes, I did change the code in fp.c and run 'make gnumake' in the PETSc directory. I just did a double check and ran make gnumake again and got the following information this time.
>> 
>> 
>> dsu at nwmop:~/Soft/PETSc/petsc-3.5.2$
>>   make gnumake
>> Building PETSc using GNU Make with 10 build threads
>> ==========================================
>> make[1]: Entering directory `/home/dsu/Soft/PETSc/petsc-3.5.2'
>> make[1]: Nothing to be done for `all'.
>> make[1]: Leaving directory `/home/dsu/Soft/PETSc/petsc-3.5.2'
>> =========================================
>> 
>> 
>> Then I recompiled the codes, ran with -fp_trap and still got the following error
>> 
>> Backtrace for this error:
>> Note: The EXACT line numbers in the stack are not available,
>> [2]PETSC ERROR:       INSTEAD the line number of the start of the function
>> [2]PETSC ERROR:       is given.
>> [2]PETSC ERROR: [2] PetscDefaultFPTrap line 379 /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c
>>       INSTEAD the line number of the start of the function
>> [3]PETSC ERROR:       is given.
>> [3]PETSC ERROR: [3] PetscDefaultFPTrap line 379 /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c
>> [2]PETSC ERROR: User provided function() line 0 in Unknown file trapped floating point error
>> [3]PETSC ERROR: User provided function() line 0 in Unknown file trapped floating point error
>> 
>> 
>>     This is different then what you sent a few minutes ago where it crashed in hypre.
>> 
>>     Anyways you need to use the -start_in_debugger business I sent in the previous email to see the exact place the problem occurs.
>> 
>> Here is the information shown on gdb screen
>> 
>> Program received signal SIGFPE, Arithmetic exception.
>> 0x00000000006c2bef in velocity_g (l_sufx=1, suffix=..., nmax=12, njamxc=34,
>>      cinfradx=..., radial_coordx=.FALSE., _suffix=3)
>>      at ../../gas_advection/velocity_g.F90:1344
>> 1344                                        cinfrt = cinfrt_dg(i1) * diff(ic,idim)                                      !diff is a very small value, e.g., 1.0d-316
>> (gdb)
>> 
>> After type cont on gdb screen, I got error information as below
>> 
>> [1]PETSC ERROR: *** unknown floating point error occurred ***
>> [1]PETSC ERROR: The specific exception can be determined by running in a debugger.  When the
>> [1]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3d)
>> [1]PETSC ERROR: where the result is a bitwise OR of the following flags:
>> [1]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8 FE_UNDERFLOW=0x10 FE_INEXACT=0x20
>> [1]PETSC ERROR: Try option -start_in_debugger
>> [1]PETSC ERROR: likely location of problem given in stack below
>> [1]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
>> [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
>> [1]PETSC ERROR:       INSTEAD the line number of the start of the function
>> [1]PETSC ERROR:       is given.
>> [1]PETSC ERROR: [1] PetscDefaultFPTrap line 379 /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c
>> [1]PETSC ERROR: User provided function() line 0 in Unknown file trapped floating point error
>> [0]PETSC ERROR: *** unknown floating point error occurred ***
>> [0]PETSC ERROR: The specific exception can be determined by running in a debugger.  When the
>> [0]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3d)
>> [0]PETSC ERROR: where the result is a bitwise OR of the following flags:
>> [0]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8 FE_UNDERFLOW=0x10 FE_INEXACT=0x20
>> [0]PETSC ERROR: Try option -start_in_debugger
>> [0]PETSC ERROR: likely location of problem given in stack below
>> 
>> Thanks,
>> 
>> Danyang
>> Thanks,
>> 
>> Danyang
>> 
>> On Apr 25, 2015, at 1:05 AM, Danyang Su <danyang.su at gmail.com>
>>   wrote:
>> 
>> Hi Barry and Satish,
>> 
>> How can I get rid of unknown floating point error when a very small value is multiplied.
>> 
>> e.g.,
>> cinfrt_dg(i1) and diff(ic,idim) are  1.0250235986806329E-008 8.6178408169776945E-317 respectively,
>> 
>> cinfrt = cinfrt_dg(i1) * diff(ic,idim)
>> 
>> I get the following error when run with "-fp_trap -start_in_debugger".
>> 
>> Backtrace for this error:
>> *** unknown floating point error occurred ***
>> [2]PETSC ERROR: The specific exception can be determined by running in a debugger.  When the
>> [2]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3d)
>> [2]PETSC ERROR:  cinfrt_dg(i1),diff(ic,idim) 1.0250235986806329E-008   8.6178408169776945E-317
>> where the result is a bitwise OR of the following flags:
>> [2]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8 FE_UNDERFLOW=0x10 FE_INEXACT=0x20
>> [2]PETSC ERROR: Try option -start_in_debugger
>> [2]PETSC ERROR: likely location of problem given in stack below
>> 
>> Thanks,
>> 
>> Danyang
>> 
>> On 15-04-24 01:54 PM, Danyang Su wrote:
>> 
>> On 15-04-24 01:23 PM, Satish Balay wrote:
>> 
>>   c           4   1.0976214263087059E-067
>> 
>> I don't think this number can be stored in a real*4.
>> 
>> Satish
>> 
>> Thanks, Satish. It is caused by this number.
>> 
>> On Fri, 24 Apr 2015, Danyang Su wrote:
>> 
>> 
>> On 15-04-24 11:12 AM, Barry Smith wrote:
>> 
>> On Apr 24, 2015, at 1:05 PM, Danyang Su <danyang.su at gmail.com>
>>   wrote:
>> 
>> Hi All,
>> 
>> One of my case crashes because of floating point exception when using 4
>> processors, as shown below. But if I run this case with 1 processor, it
>> works fine. I have tested the codes with around 100 cases up to 768
>> processors, all other cases work fine. I just wonder if this kind of error
>> is caused because of NaN in jacobi matrix, RHS or preconditioner?
>> 
>>      Yes, almost for sure it is one of these places.
>> 
>>      First run the bad case with -fp_trap  if all goes well you'll see the
>> function where the FPE is generated. Then run also with -start_in_debugger
>> and
>> type cont in all four debugger windows. When the FPE happens the debugger
>> should stop showing exactly where the FPE happens.
>> 
>>     Barry
>> 
>> Hi Barry,
>> 
>> If run with -fp_trap -start_in_debugger, I got the following error
>> 
>> [0]PETSC ERROR: *** unknown floating point error occurred ***
>> [0]PETSC ERROR: The specific exception can be determined by running in a
>> debugger.  When the
>> [0]PETSC ERROR: debugger traps the signal, the exception can be found with
>> fetestexcept(0x3d)
>> [0]PETSC ERROR: where the result is a bitwise OR of the following flags:
>> [0]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8
>> FE_UNDERFLOW=0x10 FE_INEXACT=0x20
>> [0]PETSC ERROR: Try option -start_in_debugger
>> [0]PETSC ERROR: likely location of problem given in stack below
>> [0]PETSC ERROR: ---------------------  Stack Frames
>> ------------------------------------
>> [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
>> [0]PETSC ERROR:       INSTEAD the line number of the start of the function
>> [0]PETSC ERROR:       is given.
>> [0]PETSC ERROR: [0] PetscDefaultFPTrap line 379
>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c
>> [0]PETSC ERROR: User provided function() line 0 in Unknown file trapped
>> floating point error
>> 
>> Program received signal SIGABRT: Process abort signal.
>> 
>> Backtrace for this error:
>> #0  0x7F4FEAB1C7D7
>> #1  0x7F4FEAB1CDDE
>> #2  0x7F4FE9E1AD3F
>> #3  0x7F4FE9E1ACC9
>> #4  0x7F4FE9E1E0D7
>> #5  0x7F4FEB0B6DCB
>> #6  0x7F4FEB0B1825
>> #7  0x7F4FEB0B817F
>> #8  0x7F4FE9E1AD3F
>> #9  0x6972C8 in tprfrtlc_ at tprfrtlc.F90:2393 (discriminator 3)
>> #10  0x4C6C87 in gcreact_ at gcreact.F90:678
>> #11  0x707E19 in initicrt_ at initicrt.F90:589
>> #12  0x4F42D0 in initprob_ at initprob.F90:430
>> #13  0x5AAF72 in driver_pc at driver_pc.F90:438
>> 
>> I checked the code at  tprfrtlc.F90:2393,
>> 
>>          realbuffer_gb(1:nvars) = (/time,(c(ic),ic=1,nc-1),     &
>>                                     (cx(ix),ix=1,nxout)/)
>> 
>> All the values (time, c, cx) are reasonable, as shown below. The only
>> possibility is that realbuffer_gb is in declared as real*4 if using sing
>> precision output while time, c, cx are declared in real*8. I have a lot of
>> similar data conversion from real*8 to real*4 output, other code does not
>> return error.
>> 
>>   time   0.0000000000000000
>>   c           1   9.9999999999999995E-008
>>   c           2   3.1555251077549618E-003
>>   c           3   7.1657814842179362E-008
>>   c           4   1.0976214263087059E-067
>>   c           5   5.2879822292305797E-004
>>   c           6   9.9999999999999964E-005
>>   c           7   6.4055731968811337E-005
>>   c           8   3.4607572892578404E-020
>>   cx           1   3.4376650636008101E-005
>>   cx           2   7.3989678854017763E-012
>>   cx           3   9.5317170613607207E-012
>>   cx           4   2.2344525794718353E-015
>>   cx           5   3.0624685689695889E-008
>>   cx           6   1.0046157902783967E-007
>>   cx           7   1.5320169154914984E-004
>>   cx           8   8.6930292776346176E-014
>>   cx           9   3.5944267559348721E-005
>>   cx          10   3.0072645866951157E-018
>>   cx          11   2.3592486321095017E-013
>> 
>> Thanks,
>> 
>> Danyang
>> 
>> 
>> I can check all the entries of jacobi matrix to see if the value is valid,
>> but this seems not a good idea as it takes a long time to reach this
>> point. If I restart the simulation from a specified time (e.g., 7.685 in
>> this case), then the error does not occur.
>> 
>> Would you please give me any suggestion on debugging this case?
>> 
>> Thanks and Regards,
>> 
>> Danyang
>> 
>> 
>> timestep:    2730 time: 7.665E+00 years   delt: 1.000E-02 years iter:  1
>> timestep:    max.sia: 0.000E+00 tol.sia: 0.000E+00
>> timestep:    2731 time: 7.675E+00 years   delt: 1.000E-02 years iter:  1
>> timestep:    max.sia: 0.000E+00 tol.sia: 0.000E+00
>> timestep:    2732 time: 7.685E+00 years   delt: 1.000E-02 years iter:  1
>> timestep:    max.sia: 0.000E+00 tol.sia: 0.000E+00
>> timestep:    2733 time: 7.695E+00 years   delt: 1.000E-02 years iter:  1
>> timestep:    max.sia: 0.000E+00 tol.sia: 0.000E+00
>> timestep:    2734 time: 7.705E+00 years   delt: 1.000E-02 years iter:  1
>> timestep:    max.sia: 0.000E+00 tol.sia: 0.000E+00
>> Reduce time step for reactive transport
>> timestep:    2734 time: 7.700E+00 years   delt: 5.000E-03 years iter:  1
>> timestep:    max.sia: 0.000E+00 tol.sia: 0.000E+00
>> Reduce time step for reactive transport
>> timestep:    2734 time: 7.697E+00 years   delt: 2.500E-03 years iter:  1
>> timestep:    max.sia: 0.000E+00 tol.sia: 0.000E+00
>> [1]PETSC ERROR: --------------------- Error Message
>> --------------------------------------------------------------
>> [1]PETSC ERROR: Floating point exception
>> [2]PETSC ERROR: --------------------- Error Message
>> --------------------------------------------------------------
>> [2]PETSC ERROR: Floating point exception
>> [2]PETSC ERROR: Vec entry at local location 0 is not-a-number or infinite
>> at end of function: Parameter number 3
>> [2]PETSC ERROR: See
>> http://www.mcs.anl.gov/petsc/documentation/faq.html
>> 
>> for trouble shooting.
>> [2]PETSC ERROR: Petsc Release Version 3.5.2, Sep, 08, 2014
>> [2]PETSC ERROR: [1]PETSC ERROR: Vec entry at local location 0 is
>> not-a-number or infinite at end of function: Parameter number 3
>> [1]PETSC ERROR: See
>> http://www.mcs.anl.gov/petsc/documentation/faq.html
>> 
>> for trouble shooting.
>> [1]PETSC ERROR: Petsc Release Version 3.5.2, Sep, 08, 2014
>> [1]PETSC ERROR: ../min3p_thcm_petsc_dbg on a linux-gnu-dbg named nwmop by
>> dsu Thu Apr 23 15:38:52 2015
>> [1]PETSC ERROR: Configure options PETSC_ARCH=linux-gnu-dbg --with-cc=gcc
>> --with-cxx=g++ --with-fc=gfortran --download-fblaslapack --download-mpich
>> --download-mumps --download-hypre --download-superlu_dist --download-metis
>> --download-parmetis --download-scalapack
>> [1]PETSC ERROR: #1 VecValidValues() line 34 in
>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/vec/vec/interface/rvector.c
>> ../min3p_thcm_petsc_dbg on a linux-gnu-dbg named nwmop by dsu Thu Apr 23
>> 15:38:52 2015
>> [2]PETSC ERROR: Configure options PETSC_ARCH=linux-gnu-dbg --with-cc=gcc
>> --with-cxx=g++ --with-fc=gfortran --download-fblaslapack --download-mpich
>> --download-mumps --download-hypre --download-superlu_dist --download-metis
>> --download-parmetis --download-scalapack
>> [2]PETSC ERROR: #1 VecValidValues() line 34 in
>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/vec/vec/interface/rvector.c
>> [2]PETSC ERROR: [1]PETSC ERROR: #2 PCApply() line 442 in
>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/pc/interface/precon.c
>> [1]PETSC ERROR: #2 PCApply() line 442 in
>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/pc/interface/precon.c
>> [2]PETSC ERROR: #3 KSP_PCApply() line 230 in
>> /home/dsu/Soft/PETSc/petsc-3.5.2/include/petsc-private/kspimpl.h
>> #3 KSP_PCApply() line 230 in
>> /home/dsu/Soft/PETSc/petsc-3.5.2/include/petsc-private/kspimpl.h
>> [1]PETSC ERROR: #4 KSPInitialResidual() line 63 in
>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itres.c
>> [2]PETSC ERROR: #4 KSPInitialResidual() line 63 in
>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itres.c
>> [1]PETSC ERROR: #5 KSPSolve_GMRES() line 234 in
>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/impls/gmres/gmres.c
>> [2]PETSC ERROR: #5 KSPSolve_GMRES() line 234 in
>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/impls/gmres/gmres.c
>> [2]PETSC ERROR: #6 KSPSolve() line 459 in
>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itfunc.c
>> [1]PETSC ERROR: #6 KSPSolve() line 459 in
>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itfunc.c
>> ^C[mpiexec at nwmop] Sending Ctrl-C to processes as requested
>> [mpiexec at nwmop] Press Ctrl-C again to force abort
>> 
>> 
>> 
>> 
>> 
>> -- 
>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>> -- Norbert Wiener
> 
> <petscconf.h>



More information about the petsc-users mailing list