[petsc-users] Floating point exception in hypre BoomerAMG

Barry Smith bsmith at mcs.anl.gov
Wed Apr 29 13:30:58 CDT 2015


> On Apr 29, 2015, at 12:15 PM, Danyang Su <danyang.su at gmail.com> wrote:
> 
> On 15-04-28 06:50 PM, Barry Smith wrote:
>>   We started enforcing more checks on writing to vectors that you should not write to.  Where are you calling DMLocalToGlobalBegin() ? It looks like you are trying to copy values into a global array that you should not because it is a read only input to a function, for example it is the right hand side of a linear system or the input into a SNESFormFunction(). So check the output to that call.
>> 
>>   Barry
>> 
> This is used when local RHS vec is calculated and copied to global RHS.

   Huhh, why are you copying anything into the RHS? Is this before you call the linear system solve? Send the code that calls the DMLocalToGlobalBegin()

  Barry

> It is strange why this error does not occur at the first timestep.
> 
> Thanks,
> 
> Danyang
>>> On Apr 28, 2015, at 7:30 PM, Danyang Su <danyang.su at gmail.com> wrote:
>>> 
>>> Hi Barry,
>>> 
>>> There seems another bug (not pretty sure) in PETSc-dev, as shown below. The case I used is similar with the one I mentioned recently. I have no problem running this case using PETSc 3.5.2. But it give out the following error using PETSc-dev, even with only one processor.
>>> 
>>> Thanks,
>>> 
>>> Danyang
>>> 
>>> timestep:    2048 time: 4.615E+00 years   delt: 1.460E-20 years iter:  1 max.sia: 0.000E+00 tol.sia: 0.000E+00
>>> [0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
>>> [0]PETSC ERROR: Object is in wrong state
>>> [0]PETSC ERROR:  Vec is locked read only, argument # 1
>>> [0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
>>> [0]PETSC ERROR: Petsc Development GIT revision: v3.5.3-2796-g23b6c49  GIT Date: 2015-04-27 11:32:44 -0500
>>> [0]PETSC ERROR: ../min3p_thcm on a linux-gnu-dbg named nwmop by dsu Tue Apr 28 15:56:41 2015
>>> [0]PETSC ERROR: Configure options PETSC_ARCH=linux-gnu-dbg --with-cc=gcc --with-cxx=g++ --with-fc=gfortran --download-fblaslapack --download-mpich --download-mumps --download-hypre --download-superlu_dist --download-metis --download-parmetis --download-scalapack
>>> [0]PETSC ERROR: #349 VecGetArray() line 1646 in /home/dsu/Soft/PETSc/petsc-dev/src/vec/vec/interface/rvector.c
>>> [0]PETSC ERROR: #350 VecGetArrayPair() line 434 in /home/dsu/Soft/PETSc/petsc-dev/include/petscvec.h
>>> [0]PETSC ERROR: #351 VecScatterBegin_SSToSS() line 649 in /home/dsu/Soft/PETSc/petsc-dev/src/vec/vec/utils/vscat.c
>>> [0]PETSC ERROR: #352 VecScatterBegin() line 1694 in /home/dsu/Soft/PETSc/petsc-dev/src/vec/vec/utils/vscat.c
>>> [0]PETSC ERROR: #353 DMLocalToGlobalBegin_DA() line 56 in /home/dsu/Soft/PETSc/petsc-dev/src/dm/impls/da/dagtol.c
>>> [0]PETSC ERROR: #354 DMLocalToGlobalBegin() line 1944 in /home/dsu/Soft/PETSc/petsc-dev/src/dm/interface/dm.c
>>> Reduce time step for reactive transport
>>> no further time step reduction possible
>>> Optimal time increment computed by MIN3P   2.6644769214899893E-018
>>> Minimum time increment specified by user   3.6499999999999998E-018
>>> Please, reduce the MINIMUM TIME INCREMENT
>>> stop signal in time step reduction failed
>>> 
>>> On 15-04-28 11:15 AM, Barry Smith wrote:
>>>>    I am forwarding this on to the hypre support email list; hopefully they can have some advice on how to proceed.
>>>> 
>>>>    hypre folks, this is with version hypre-2.10.0b (the previous version had the same behavior). We are getting a divide by zero in gselim() line 4363 (this happens in a time dependent problem after many successful solves with time dependent matrix).  I looked at the code, below, and note that there are some checks for the diagonal of A not being zero but not for the line that causes the divide by zero; is this perhaps an oversight in the hypre code?  Any advice appreciated.
>>>> 
>>>>    Thanks
>>>> 
>>>>    Barry
>>>> 
>>>> 
>>>> 
>>>> 
>>>> HYPRE_Int gselim(A,x,n)
>>>> HYPRE_Real *A;
>>>> HYPRE_Real *x;
>>>> HYPRE_Int n;
>>>> {
>>>>    HYPRE_Int    err_flag = 0;
>>>>    HYPRE_Int    j,k,m;
>>>>    HYPRE_Real factor;
>>>>        if (n==1)                           /* A is 1x1 */
>>>>    {
>>>>       if (A[0] != 0.0)
>>>>       {
>>>>          x[0] = x[0]/A[0];
>>>>          return(err_flag);
>>>>       }
>>>>       else
>>>>       {
>>>>          err_flag = 1;
>>>>          return(err_flag);
>>>>       }
>>>>    }
>>>>    else                               /* A is nxn.  Forward elimination */
>>>>    {
>>>>       for (k = 0; k < n-1; k++)
>>>>       {
>>>>           if (A[k*n+k] != 0.0)
>>>>           {
>>>>              for (j = k+1; j < n; j++)
>>>>              {
>>>>                  if (A[j*n+k] != 0.0)
>>>>                  {
>>>>                     factor = A[j*n+k]/A[k*n+k];
>>>>                     for (m = k+1; m < n; m++)
>>>>                     {
>>>>                         A[j*n+m]  -= factor * A[k*n+m];
>>>>                     }
>>>>                                      /* Elimination step for rhs */
>>>>                     x[j] -= factor * x[k];
>>>>                  }
>>>>              }
>>>>           }
>>>>        }
>>>>                                     /* Back Substitution  */
>>>>        for (k = n-1; k > 0; --k)
>>>>        {
>>>>            x[k] /= A[k*n+k];
>>>>            for (j = 0; j < k; j++)
>>>>            {
>>>>                if (A[j*n+k] != 0.0)
>>>>                {
>>>>                   x[j] -= x[k] * A[j*n+k];
>>>>                }
>>>>            }
>>>>        }
>>>>        x[0] /= A[0];
>>>>        return(err_flag);
>>>>     }
>>>> }
>>>> 
>>>> 
>>>>> On Apr 28, 2015, at 12:55 PM, Danyang Su <danyang.su at gmail.com> wrote:
>>>>> 
>>>>> Hi Barry,
>>>>> 
>>>>> The development version of PETSc does not help to solve my problem. It still crashed due to the same error information.
>>>>> 
>>>>> As Matthew mentioned, I checked the matrix and non of the diagonal entry is zero. But the diagonal values for different rows range from -1.0d-18 to 1.0d33 and the matrix is not strict diagonal dominant. When using 1 processor, the matrix is also not strict diagonal dominant, but the diagonal values have a much smaller range and the codes can run without error. Maybe there is something wrong in the matrix for this case and I need to check the matrix first.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Danyang
>>>>> 
>>>>> Program received signal SIGFPE, Arithmetic exception.
>>>>> 0x00007f668c1299d5 in gselim (A=0x2fe3330, x=0x2210680, n=9)
>>>>>    at par_relax.c:4363
>>>>> 4363               x[k] /= A[k*n+k];
>>>>> (gdb)
>>>>> 
>>>>> [1]PETSC ERROR: *** unknown floating point error occurred ***
>>>>> [1]PETSC ERROR: The specific exception can be determined by running in a debugger.  When the
>>>>> [1]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3d)
>>>>> [1]PETSC ERROR: where the result is a bitwise OR of the following flags:
>>>>> [1]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8 FE_UNDERFLOW=0x10 FE_INEXACT=0x20
>>>>> [1]PETSC ERROR: Try option -start_in_debugger
>>>>> [1]PETSC ERROR: likely location of problem given in stack below
>>>>> [1]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
>>>>> [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
>>>>> [1]PETSC ERROR:       INSTEAD the line number of the start of the function
>>>>> [1]PETSC ERROR:       is given.
>>>>> [1]PETSC ERROR: [1] PetscDefaultFPTrap line 379 /home/dsu/Soft/PETSc/petsc-dev/src/sys/error/fp.c
>>>>> [1]PETSC ERROR: [1] Hypre solve line 216 /home/dsu/Soft/PETSc/petsc-dev/src/ksp/pc/impls/hypre/hypre.c
>>>>> [1]PETSC ERROR: [1] PCApply_HYPRE line 203 /home/dsu/Soft/PETSc/petsc-dev/src/ksp/pc/impls/hypre/hypre.c
>>>>> [1]PETSC ERROR: [1] PCApply line 430 /home/dsu/Soft/PETSc/petsc-dev/src/ksp/pc/interface/precon.c
>>>>> [1]PETSC ERROR: [1] KSP_PCApply line 234 /home/dsu/Soft/PETSc/petsc-dev/include/petsc/private/kspimpl.h
>>>>> [1]PETSC ERROR: [1] KSPInitialResidual line 44 /home/dsu/Soft/PETSc/petsc-dev/src/ksp/ksp/interface/itres.c
>>>>> [1]PETSC ERROR: [1] KSPSolve_GMRES line 224 /home/dsu/Soft/PETSc/petsc-dev/src/ksp/ksp/impls/gmres/gmres.c
>>>>> [1]PETSC ERROR: [1] KSPSolve line 506 /home/dsu/Soft/PETSc/petsc-dev/src/ksp/ksp/interface/itfunc.c
>>>>> [1]PETSC ERROR: User provided function() line 0 in Unknown file trapped floating point error
>>>>> 
>>>>> 
>>>>> 
>>>>> On 15-04-27 11:47 AM, Barry Smith wrote:
>>>>>>    Danyang,
>>>>>> 
>>>>>>     I'm glad you were finally able to get to the problematic point; sorry it took so long (I want to strangle whoever turned on catching of underflows in PETSc).
>>>>>> 
>>>>>>     The hypre folks have done a great deal of work on their relaxation code since the PETSc 3.5.3 release. There is a good chance they have fixed the divide by zero problem you are hitting here. You will need to upgrade to the development version of PETSc (that uses the latest version of hypre), here are the instructions on how to obtain it http://www.mcs.anl.gov/petsc/developers/index.html
>>>>>> 
>>>>>>    Please let us know if this resolves the problem with hypre failing.
>>>>>> 
>>>>>>   Barry
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Apr 27, 2015, at 11:44 AM, Danyang Su <danyang.su at gmail.com> wrote:
>>>>>>> 
>>>>>>> Hi Barry,
>>>>>>> 
>>>>>>> I got the following arithemetic exception after the previous bug is fixed.
>>>>>>> 
>>>>>>> Loaded symbols for /lib/x86_64-linux-gnu/libnss_files.so.2
>>>>>>> 0x00007f3b23b98f20 in __nanosleep_nocancel ()
>>>>>>>    at ../sysdeps/unix/syscall-template.S:81
>>>>>>> 81      ../sysdeps/unix/syscall-template.S: No such file or directory.
>>>>>>> (gdb) cont
>>>>>>> Continuing.
>>>>>>> 
>>>>>>> Program received signal SIGFPE, Arithmetic exception.
>>>>>>> 0x00007f3b260df449 in gselim (A=0x1e4c580, x=0x1d30c40, n=9)
>>>>>>>    at par_relax.c:3442
>>>>>>> 3442               x[k] /= A[k*n+k];
>>>>>>> (gdb)
>>>>>>> 
>>>>>>> I tried both PETSc 3.5.2 and 3.5.3 and they return the same error as shown above. For 3.5.3, i edited fp.c file and then configure and make.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Danyang
>>>>>>> 
>>>>>>> On 15-04-25 07:34 PM, Danyang Su wrote:
>>>>>>>> Hi All,
>>>>>>>> 
>>>>>>>> The "floating point underflow" is caused by a small value divided by a very large value. This result is forced to zero and then it does not report any underflow problem. I just rerun this bad case to see if it still get stuck later. This will take a while.
>>>>>>>> 
>>>>>>>> Thanks for all your kindly reply,
>>>>>>>> 
>>>>>>>> Danyang
>>>>>>>> 
>>>>>>>> On 15-04-25 07:02 PM, Barry Smith wrote:
>>>>>>>>>   Ok, you do have
>>>>>>>>> 
>>>>>>>>> #ifndef PETSC_HAVE_XMMINTRIN_H
>>>>>>>>> #define PETSC_HAVE_XMMINTRIN_H 1
>>>>>>>>> #endif
>>>>>>>>> 
>>>>>>>>>   so the change you made should cause it to stop trapping underflow exceptions.
>>>>>>>>> 
>>>>>>>>>   Now in one email you reported a FPE within hypre, then I asked you to run with -start_in_debugger to determine where it happened exactly and then you reported the FPE happened in  user code (what seemed to be an underflow issue). Why is this? Can you not run it where it generated the FPE in hypre using the -start_in_debugger option?
>>>>>>>>> 
>>>>>>>>>    Barry
>>>>>>>>> 
>>>>>>>>> Perhaps you have multiple PETSC_ARCH or multiple PETSc installs to explain why you reported two different places where the exception occurred.
>>>>>>>>> 
>>>>>>>>>> On Apr 25, 2015, at 8:31 PM, Danyang Su <danyang.su at gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 15-04-25 06:26 PM, Matthew Knepley wrote:
>>>>>>>>>>> On Sat, Apr 25, 2015 at 8:23 PM, Danyang Su <danyang.su at gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 15-04-25 06:03 PM, Barry Smith wrote:
>>>>>>>>>>>    If this is what you got in your last run
>>>>>>>>>>> 
>>>>>>>>>>>      at ../../gas_advection/velocity_g.F90:1344
>>>>>>>>>>> 1344                                        cinfrt = cinfrt_dg(i1) * diff(ic,idim)                                      !diff is a very small value, e.g., 1.0d-316
>>>>>>>>>>>    then it is still catching floating point underflow, which we do not want. This means either the change I suggested you make in the fp.c code didn't work or it actually uses a different floating point trap than that one.  BTW: absurd numbers like 1.0d-316 are often a symptom of uninitialized data; could that be a problem that diff is not filled correctly for all the ic, idim you are using?
>>>>>>>>>>> 
>>>>>>>>>>>     This going round and round is very frustrating and a waste of time. You need to be more proactive yourself and explore the code and poke around to figure out how to solve the problem.
>>>>>>>>>>> 
>>>>>>>>>>>    Please email $PETSC_DIR/$PETSC_ARCH/include/petscvariables.h so I can see what FP trap is being used on your machine.
>>>>>>>>>>> 
>>>>>>>>>>> Barry
>>>>>>>>>>> Do you mean $PETSC_DIR/$PETSC_ARCH/conf/petscvariables? Otherwise I cannot find this file.
>>>>>>>>>>> 
>>>>>>>>>>> Its include/petscconf.h
>>>>>>>>>>>  Do I need to reconfigure PETSc after changing the code you mentioned?
>>>>>>>>>>> 
>>>>>>>>>>> No, but you need to rebuild.
>>>>>>>>>> Yes, I have done 'gnumake'.
>>>>>>>>>>>    Matt
>>>>>>>>>>>  Danyang
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Apr 25, 2015, at 2:24 PM, Danyang Su <danyang.su at gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 15-04-25 11:55 AM, Barry Smith wrote:
>>>>>>>>>>> On Apr 25, 2015, at 1:51 PM, Danyang Su <danyang.su at gmail.com>
>>>>>>>>>>>   wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 15-04-25 11:32 AM, Barry Smith wrote:
>>>>>>>>>>> 
>>>>>>>>>>>    I told you this yesterday.
>>>>>>>>>>> 
>>>>>>>>>>>    It is probably stopping here on a harmless underflow. You need to edit the PETSc code to not worry about underflow.
>>>>>>>>>>> 
>>>>>>>>>>> Edit the file /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c and locate
>>>>>>>>>>> 
>>>>>>>>>>> #elif defined PETSC_HAVE_XMMINTRIN_H
>>>>>>>>>>>     _MM_SET_EXCEPTION_MASK(_MM_MASK_INEXACT);
>>>>>>>>>>> #else
>>>>>>>>>>> 
>>>>>>>>>>> change it to
>>>>>>>>>>> 
>>>>>>>>>>> #elif defined PETSC_HAVE_XMMINTRIN_H
>>>>>>>>>>>     _MM_SET_EXCEPTION_MASK(_MM_MASK_INEXACT | _MM_MASK_UNDERFLOW);
>>>>>>>>>>> #else
>>>>>>>>>>> 
>>>>>>>>>>>   Then run make gnumake in the PETSc directory to compile the new version. Now link and run the program again with -fp_trap and see where it gets stuck this time.
>>>>>>>>>>> 
>>>>>>>>>>>    Did you do this?
>>>>>>>>>>> 
>>>>>>>>>>>   Barry
>>>>>>>>>>> 
>>>>>>>>>>> Yes, I did change the code in fp.c and run 'make gnumake' in the PETSc directory. I just did a double check and ran make gnumake again and got the following information this time.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> dsu at nwmop:~/Soft/PETSc/petsc-3.5.2$
>>>>>>>>>>>   make gnumake
>>>>>>>>>>> Building PETSc using GNU Make with 10 build threads
>>>>>>>>>>> ==========================================
>>>>>>>>>>> make[1]: Entering directory `/home/dsu/Soft/PETSc/petsc-3.5.2'
>>>>>>>>>>> make[1]: Nothing to be done for `all'.
>>>>>>>>>>> make[1]: Leaving directory `/home/dsu/Soft/PETSc/petsc-3.5.2'
>>>>>>>>>>> =========================================
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Then I recompiled the codes, ran with -fp_trap and still got the following error
>>>>>>>>>>> 
>>>>>>>>>>> Backtrace for this error:
>>>>>>>>>>> Note: The EXACT line numbers in the stack are not available,
>>>>>>>>>>> [2]PETSC ERROR:       INSTEAD the line number of the start of the function
>>>>>>>>>>> [2]PETSC ERROR:       is given.
>>>>>>>>>>> [2]PETSC ERROR: [2] PetscDefaultFPTrap line 379 /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c
>>>>>>>>>>>       INSTEAD the line number of the start of the function
>>>>>>>>>>> [3]PETSC ERROR:       is given.
>>>>>>>>>>> [3]PETSC ERROR: [3] PetscDefaultFPTrap line 379 /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c
>>>>>>>>>>> [2]PETSC ERROR: User provided function() line 0 in Unknown file trapped floating point error
>>>>>>>>>>> [3]PETSC ERROR: User provided function() line 0 in Unknown file trapped floating point error
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>     This is different then what you sent a few minutes ago where it crashed in hypre.
>>>>>>>>>>> 
>>>>>>>>>>>     Anyways you need to use the -start_in_debugger business I sent in the previous email to see the exact place the problem occurs.
>>>>>>>>>>> 
>>>>>>>>>>> Here is the information shown on gdb screen
>>>>>>>>>>> 
>>>>>>>>>>> Program received signal SIGFPE, Arithmetic exception.
>>>>>>>>>>> 0x00000000006c2bef in velocity_g (l_sufx=1, suffix=..., nmax=12, njamxc=34,
>>>>>>>>>>>      cinfradx=..., radial_coordx=.FALSE., _suffix=3)
>>>>>>>>>>>      at ../../gas_advection/velocity_g.F90:1344
>>>>>>>>>>> 1344                                        cinfrt = cinfrt_dg(i1) * diff(ic,idim)                                      !diff is a very small value, e.g., 1.0d-316
>>>>>>>>>>> (gdb)
>>>>>>>>>>> 
>>>>>>>>>>> After type cont on gdb screen, I got error information as below
>>>>>>>>>>> 
>>>>>>>>>>> [1]PETSC ERROR: *** unknown floating point error occurred ***
>>>>>>>>>>> [1]PETSC ERROR: The specific exception can be determined by running in a debugger.  When the
>>>>>>>>>>> [1]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3d)
>>>>>>>>>>> [1]PETSC ERROR: where the result is a bitwise OR of the following flags:
>>>>>>>>>>> [1]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8 FE_UNDERFLOW=0x10 FE_INEXACT=0x20
>>>>>>>>>>> [1]PETSC ERROR: Try option -start_in_debugger
>>>>>>>>>>> [1]PETSC ERROR: likely location of problem given in stack below
>>>>>>>>>>> [1]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
>>>>>>>>>>> [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
>>>>>>>>>>> [1]PETSC ERROR:       INSTEAD the line number of the start of the function
>>>>>>>>>>> [1]PETSC ERROR:       is given.
>>>>>>>>>>> [1]PETSC ERROR: [1] PetscDefaultFPTrap line 379 /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c
>>>>>>>>>>> [1]PETSC ERROR: User provided function() line 0 in Unknown file trapped floating point error
>>>>>>>>>>> [0]PETSC ERROR: *** unknown floating point error occurred ***
>>>>>>>>>>> [0]PETSC ERROR: The specific exception can be determined by running in a debugger.  When the
>>>>>>>>>>> [0]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3d)
>>>>>>>>>>> [0]PETSC ERROR: where the result is a bitwise OR of the following flags:
>>>>>>>>>>> [0]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8 FE_UNDERFLOW=0x10 FE_INEXACT=0x20
>>>>>>>>>>> [0]PETSC ERROR: Try option -start_in_debugger
>>>>>>>>>>> [0]PETSC ERROR: likely location of problem given in stack below
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Danyang
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Danyang
>>>>>>>>>>> 
>>>>>>>>>>> On Apr 25, 2015, at 1:05 AM, Danyang Su <danyang.su at gmail.com>
>>>>>>>>>>>   wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Barry and Satish,
>>>>>>>>>>> 
>>>>>>>>>>> How can I get rid of unknown floating point error when a very small value is multiplied.
>>>>>>>>>>> 
>>>>>>>>>>> e.g.,
>>>>>>>>>>> cinfrt_dg(i1) and diff(ic,idim) are  1.0250235986806329E-008 8.6178408169776945E-317 respectively,
>>>>>>>>>>> 
>>>>>>>>>>> cinfrt = cinfrt_dg(i1) * diff(ic,idim)
>>>>>>>>>>> 
>>>>>>>>>>> I get the following error when run with "-fp_trap -start_in_debugger".
>>>>>>>>>>> 
>>>>>>>>>>> Backtrace for this error:
>>>>>>>>>>> *** unknown floating point error occurred ***
>>>>>>>>>>> [2]PETSC ERROR: The specific exception can be determined by running in a debugger.  When the
>>>>>>>>>>> [2]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3d)
>>>>>>>>>>> [2]PETSC ERROR:  cinfrt_dg(i1),diff(ic,idim) 1.0250235986806329E-008   8.6178408169776945E-317
>>>>>>>>>>> where the result is a bitwise OR of the following flags:
>>>>>>>>>>> [2]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8 FE_UNDERFLOW=0x10 FE_INEXACT=0x20
>>>>>>>>>>> [2]PETSC ERROR: Try option -start_in_debugger
>>>>>>>>>>> [2]PETSC ERROR: likely location of problem given in stack below
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Danyang
>>>>>>>>>>> 
>>>>>>>>>>> On 15-04-24 01:54 PM, Danyang Su wrote:
>>>>>>>>>>> 
>>>>>>>>>>> On 15-04-24 01:23 PM, Satish Balay wrote:
>>>>>>>>>>> 
>>>>>>>>>>>   c           4   1.0976214263087059E-067
>>>>>>>>>>> 
>>>>>>>>>>> I don't think this number can be stored in a real*4.
>>>>>>>>>>> 
>>>>>>>>>>> Satish
>>>>>>>>>>> 
>>>>>>>>>>> Thanks, Satish. It is caused by this number.
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, 24 Apr 2015, Danyang Su wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 15-04-24 11:12 AM, Barry Smith wrote:
>>>>>>>>>>> 
>>>>>>>>>>> On Apr 24, 2015, at 1:05 PM, Danyang Su <danyang.su at gmail.com>
>>>>>>>>>>>   wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi All,
>>>>>>>>>>> 
>>>>>>>>>>> One of my case crashes because of floating point exception when using 4
>>>>>>>>>>> processors, as shown below. But if I run this case with 1 processor, it
>>>>>>>>>>> works fine. I have tested the codes with around 100 cases up to 768
>>>>>>>>>>> processors, all other cases work fine. I just wonder if this kind of error
>>>>>>>>>>> is caused because of NaN in jacobi matrix, RHS or preconditioner?
>>>>>>>>>>> 
>>>>>>>>>>>      Yes, almost for sure it is one of these places.
>>>>>>>>>>> 
>>>>>>>>>>>      First run the bad case with -fp_trap  if all goes well you'll see the
>>>>>>>>>>> function where the FPE is generated. Then run also with -start_in_debugger
>>>>>>>>>>> and
>>>>>>>>>>> type cont in all four debugger windows. When the FPE happens the debugger
>>>>>>>>>>> should stop showing exactly where the FPE happens.
>>>>>>>>>>> 
>>>>>>>>>>>     Barry
>>>>>>>>>>> 
>>>>>>>>>>> Hi Barry,
>>>>>>>>>>> 
>>>>>>>>>>> If run with -fp_trap -start_in_debugger, I got the following error
>>>>>>>>>>> 
>>>>>>>>>>> [0]PETSC ERROR: *** unknown floating point error occurred ***
>>>>>>>>>>> [0]PETSC ERROR: The specific exception can be determined by running in a
>>>>>>>>>>> debugger.  When the
>>>>>>>>>>> [0]PETSC ERROR: debugger traps the signal, the exception can be found with
>>>>>>>>>>> fetestexcept(0x3d)
>>>>>>>>>>> [0]PETSC ERROR: where the result is a bitwise OR of the following flags:
>>>>>>>>>>> [0]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8
>>>>>>>>>>> FE_UNDERFLOW=0x10 FE_INEXACT=0x20
>>>>>>>>>>> [0]PETSC ERROR: Try option -start_in_debugger
>>>>>>>>>>> [0]PETSC ERROR: likely location of problem given in stack below
>>>>>>>>>>> [0]PETSC ERROR: ---------------------  Stack Frames
>>>>>>>>>>> ------------------------------------
>>>>>>>>>>> [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
>>>>>>>>>>> [0]PETSC ERROR:       INSTEAD the line number of the start of the function
>>>>>>>>>>> [0]PETSC ERROR:       is given.
>>>>>>>>>>> [0]PETSC ERROR: [0] PetscDefaultFPTrap line 379
>>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/sys/error/fp.c
>>>>>>>>>>> [0]PETSC ERROR: User provided function() line 0 in Unknown file trapped
>>>>>>>>>>> floating point error
>>>>>>>>>>> 
>>>>>>>>>>> Program received signal SIGABRT: Process abort signal.
>>>>>>>>>>> 
>>>>>>>>>>> Backtrace for this error:
>>>>>>>>>>> #0  0x7F4FEAB1C7D7
>>>>>>>>>>> #1  0x7F4FEAB1CDDE
>>>>>>>>>>> #2  0x7F4FE9E1AD3F
>>>>>>>>>>> #3  0x7F4FE9E1ACC9
>>>>>>>>>>> #4  0x7F4FE9E1E0D7
>>>>>>>>>>> #5  0x7F4FEB0B6DCB
>>>>>>>>>>> #6  0x7F4FEB0B1825
>>>>>>>>>>> #7  0x7F4FEB0B817F
>>>>>>>>>>> #8  0x7F4FE9E1AD3F
>>>>>>>>>>> #9  0x6972C8 in tprfrtlc_ at tprfrtlc.F90:2393 (discriminator 3)
>>>>>>>>>>> #10  0x4C6C87 in gcreact_ at gcreact.F90:678
>>>>>>>>>>> #11  0x707E19 in initicrt_ at initicrt.F90:589
>>>>>>>>>>> #12  0x4F42D0 in initprob_ at initprob.F90:430
>>>>>>>>>>> #13  0x5AAF72 in driver_pc at driver_pc.F90:438
>>>>>>>>>>> 
>>>>>>>>>>> I checked the code at  tprfrtlc.F90:2393,
>>>>>>>>>>> 
>>>>>>>>>>>          realbuffer_gb(1:nvars) = (/time,(c(ic),ic=1,nc-1),     &
>>>>>>>>>>>                                     (cx(ix),ix=1,nxout)/)
>>>>>>>>>>> 
>>>>>>>>>>> All the values (time, c, cx) are reasonable, as shown below. The only
>>>>>>>>>>> possibility is that realbuffer_gb is in declared as real*4 if using sing
>>>>>>>>>>> precision output while time, c, cx are declared in real*8. I have a lot of
>>>>>>>>>>> similar data conversion from real*8 to real*4 output, other code does not
>>>>>>>>>>> return error.
>>>>>>>>>>> 
>>>>>>>>>>>   time   0.0000000000000000
>>>>>>>>>>>   c           1   9.9999999999999995E-008
>>>>>>>>>>>   c           2   3.1555251077549618E-003
>>>>>>>>>>>   c           3   7.1657814842179362E-008
>>>>>>>>>>>   c           4   1.0976214263087059E-067
>>>>>>>>>>>   c           5   5.2879822292305797E-004
>>>>>>>>>>>   c           6   9.9999999999999964E-005
>>>>>>>>>>>   c           7   6.4055731968811337E-005
>>>>>>>>>>>   c           8   3.4607572892578404E-020
>>>>>>>>>>>   cx           1   3.4376650636008101E-005
>>>>>>>>>>>   cx           2   7.3989678854017763E-012
>>>>>>>>>>>   cx           3   9.5317170613607207E-012
>>>>>>>>>>>   cx           4   2.2344525794718353E-015
>>>>>>>>>>>   cx           5   3.0624685689695889E-008
>>>>>>>>>>>   cx           6   1.0046157902783967E-007
>>>>>>>>>>>   cx           7   1.5320169154914984E-004
>>>>>>>>>>>   cx           8   8.6930292776346176E-014
>>>>>>>>>>>   cx           9   3.5944267559348721E-005
>>>>>>>>>>>   cx          10   3.0072645866951157E-018
>>>>>>>>>>>   cx          11   2.3592486321095017E-013
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Danyang
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I can check all the entries of jacobi matrix to see if the value is valid,
>>>>>>>>>>> but this seems not a good idea as it takes a long time to reach this
>>>>>>>>>>> point. If I restart the simulation from a specified time (e.g., 7.685 in
>>>>>>>>>>> this case), then the error does not occur.
>>>>>>>>>>> 
>>>>>>>>>>> Would you please give me any suggestion on debugging this case?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>> 
>>>>>>>>>>> Danyang
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> timestep:    2730 time: 7.665E+00 years   delt: 1.000E-02 years iter:  1
>>>>>>>>>>> timestep:    max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>>>>>>>>> timestep:    2731 time: 7.675E+00 years   delt: 1.000E-02 years iter:  1
>>>>>>>>>>> timestep:    max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>>>>>>>>> timestep:    2732 time: 7.685E+00 years   delt: 1.000E-02 years iter:  1
>>>>>>>>>>> timestep:    max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>>>>>>>>> timestep:    2733 time: 7.695E+00 years   delt: 1.000E-02 years iter:  1
>>>>>>>>>>> timestep:    max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>>>>>>>>> timestep:    2734 time: 7.705E+00 years   delt: 1.000E-02 years iter:  1
>>>>>>>>>>> timestep:    max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>>>>>>>>> Reduce time step for reactive transport
>>>>>>>>>>> timestep:    2734 time: 7.700E+00 years   delt: 5.000E-03 years iter:  1
>>>>>>>>>>> timestep:    max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>>>>>>>>> Reduce time step for reactive transport
>>>>>>>>>>> timestep:    2734 time: 7.697E+00 years   delt: 2.500E-03 years iter:  1
>>>>>>>>>>> timestep:    max.sia: 0.000E+00 tol.sia: 0.000E+00
>>>>>>>>>>> [1]PETSC ERROR: --------------------- Error Message
>>>>>>>>>>> --------------------------------------------------------------
>>>>>>>>>>> [1]PETSC ERROR: Floating point exception
>>>>>>>>>>> [2]PETSC ERROR: --------------------- Error Message
>>>>>>>>>>> --------------------------------------------------------------
>>>>>>>>>>> [2]PETSC ERROR: Floating point exception
>>>>>>>>>>> [2]PETSC ERROR: Vec entry at local location 0 is not-a-number or infinite
>>>>>>>>>>> at end of function: Parameter number 3
>>>>>>>>>>> [2]PETSC ERROR: See
>>>>>>>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html
>>>>>>>>>>> 
>>>>>>>>>>> for trouble shooting.
>>>>>>>>>>> [2]PETSC ERROR: Petsc Release Version 3.5.2, Sep, 08, 2014
>>>>>>>>>>> [2]PETSC ERROR: [1]PETSC ERROR: Vec entry at local location 0 is
>>>>>>>>>>> not-a-number or infinite at end of function: Parameter number 3
>>>>>>>>>>> [1]PETSC ERROR: See
>>>>>>>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html
>>>>>>>>>>> 
>>>>>>>>>>> for trouble shooting.
>>>>>>>>>>> [1]PETSC ERROR: Petsc Release Version 3.5.2, Sep, 08, 2014
>>>>>>>>>>> [1]PETSC ERROR: ../min3p_thcm_petsc_dbg on a linux-gnu-dbg named nwmop by
>>>>>>>>>>> dsu Thu Apr 23 15:38:52 2015
>>>>>>>>>>> [1]PETSC ERROR: Configure options PETSC_ARCH=linux-gnu-dbg --with-cc=gcc
>>>>>>>>>>> --with-cxx=g++ --with-fc=gfortran --download-fblaslapack --download-mpich
>>>>>>>>>>> --download-mumps --download-hypre --download-superlu_dist --download-metis
>>>>>>>>>>> --download-parmetis --download-scalapack
>>>>>>>>>>> [1]PETSC ERROR: #1 VecValidValues() line 34 in
>>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/vec/vec/interface/rvector.c
>>>>>>>>>>> ../min3p_thcm_petsc_dbg on a linux-gnu-dbg named nwmop by dsu Thu Apr 23
>>>>>>>>>>> 15:38:52 2015
>>>>>>>>>>> [2]PETSC ERROR: Configure options PETSC_ARCH=linux-gnu-dbg --with-cc=gcc
>>>>>>>>>>> --with-cxx=g++ --with-fc=gfortran --download-fblaslapack --download-mpich
>>>>>>>>>>> --download-mumps --download-hypre --download-superlu_dist --download-metis
>>>>>>>>>>> --download-parmetis --download-scalapack
>>>>>>>>>>> [2]PETSC ERROR: #1 VecValidValues() line 34 in
>>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/vec/vec/interface/rvector.c
>>>>>>>>>>> [2]PETSC ERROR: [1]PETSC ERROR: #2 PCApply() line 442 in
>>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/pc/interface/precon.c
>>>>>>>>>>> [1]PETSC ERROR: #2 PCApply() line 442 in
>>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/pc/interface/precon.c
>>>>>>>>>>> [2]PETSC ERROR: #3 KSP_PCApply() line 230 in
>>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/include/petsc-private/kspimpl.h
>>>>>>>>>>> #3 KSP_PCApply() line 230 in
>>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/include/petsc-private/kspimpl.h
>>>>>>>>>>> [1]PETSC ERROR: #4 KSPInitialResidual() line 63 in
>>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itres.c
>>>>>>>>>>> [2]PETSC ERROR: #4 KSPInitialResidual() line 63 in
>>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itres.c
>>>>>>>>>>> [1]PETSC ERROR: #5 KSPSolve_GMRES() line 234 in
>>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/impls/gmres/gmres.c
>>>>>>>>>>> [2]PETSC ERROR: #5 KSPSolve_GMRES() line 234 in
>>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/impls/gmres/gmres.c
>>>>>>>>>>> [2]PETSC ERROR: #6 KSPSolve() line 459 in
>>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itfunc.c
>>>>>>>>>>> [1]PETSC ERROR: #6 KSPSolve() line 459 in
>>>>>>>>>>> /home/dsu/Soft/PETSc/petsc-3.5.2/src/ksp/ksp/interface/itfunc.c
>>>>>>>>>>> ^C[mpiexec at nwmop] Sending Ctrl-C to processes as requested
>>>>>>>>>>> [mpiexec at nwmop] Press Ctrl-C again to force abort
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> -- 
>>>>>>>>>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>> <petscconf.h>
> 



More information about the petsc-users mailing list