[petsc-users] floating point exception… but only when >4 cores are used...
Matthew Knepley
knepley at gmail.com
Sat Apr 28 20:10:45 CDT 2012
On Sat, Apr 28, 2012 at 8:59 PM, Andrew Spott <andrew.spott at gmail.com>wrote:
> What makes it easier if autotools makes it hard?
>
This is a joke, but I do think autotools makes everything hard.
> On Apr 28, 2012 6:43 PM, "Matthew Knepley" <knepley at gmail.com> wrote:
>
>> On Sat, Apr 28, 2012 at 8:36 PM, Andrew Spott <andrew.spott at gmail.com>wrote:
>>
>>> When I attach debugger on error on the local machine, I get a bunch of
>>> lines like this one:
>>>
>>> warning: Could not find object file
>>> "/private/tmp/homebrew-gcc-4.6.2-HNPr/gcc-4.6.2/build/x86_64-apple-darwin11.3.0/libstdc++-v3/src/../libsupc++/.libs/libsupc++convenience.a(cp-demangle.o)"
>>> - no debug information available for "cp-demangle.c".
>>>
>>
>> It looks like you build with autotools. That just makes things hard :)
>>
>>
>>> then ". done" And then nothing. It looks like the program exits before
>>> the debugger can attach. after a while I get this:
>>>
>>
>> You can use -debugger_pause 10 to make it wait 10s before continuing
>> after spawning the debugger. Make it long
>> enough to attach.
>>
>
Did this work?
Matt
> Matt
>>
>>
>>> /Users/spott/Documents/Code/EnergyBasisSchrodingerSolver/data/ebss-input/basis_rmax1.00e+02_rmin1.00e-06_dr1.00e-01/76151:
>>> No such file or directory
>>> Unable to access task for process-id 76151: (os/kern) failure.
>>>
>>> in the gdb window. In the terminal window, I get
>>>
>>> application called MPI_Abort(MPI_COMM_WORLD, 0) - process 3
>>> [cli_3]: aborting job:
>>>
>>> if I just "start_in_debugger" I just don't' get the "MPI_Abort" thing,
>>> but everything else is the same.
>>>
>>> any ideas?
>>>
>>> -Andrew
>>>
>>> On Apr 28, 2012, at 6:11 PM, Matthew Knepley wrote:
>>>
>>> On Sat, Apr 28, 2012 at 8:07 PM, Andrew Spott <andrew.spott at gmail.com>wrote:
>>>
>>>> are there any tricks to doing this across ssh?
>>>>
>>>> I've attempted it using the method given, but I can't get it to start
>>>> in the debugger or to attach the debugger, the program just exits or hangs
>>>> after telling me the error.
>>>>
>>>
>>> Is there a reason you cannot run this problem on your local machine with
>>> 4 processes?
>>>
>>> Matt
>>>
>>>
>>>> -Andrew
>>>>
>>>> On Apr 28, 2012, at 4:45 PM, Matthew Knepley wrote:
>>>>
>>>> On Sat, Apr 28, 2012 at 6:39 PM, Andrew Spott <andrew.spott at gmail.com>wrote:
>>>>
>>>>> >-start_in-debugger noxterm -debugger_nodes 14
>>>>>
>>>>> All my cores are on the same machine, is this supposed to start a
>>>>> debugger on processor 14? or computer 14?
>>>>>
>>>>
>>>> Neither. This spawns a gdb process on the same node as the process with
>>>> MPI rank 14. Then attaches gdb
>>>> to process 14.
>>>>
>>>> Matt
>>>>
>>>>
>>>>> I don't think I have x11 setup properly for the compute nodes, so x11
>>>>> isn't really an option.
>>>>>
>>>>> Thanks for the help.
>>>>>
>>>>> -Andrew
>>>>>
>>>>>
>>>>> On Apr 27, 2012, at 7:26 PM, Satish Balay wrote:
>>>>>
>>>>> > On Fri, 27 Apr 2012, Andrew Spott wrote:
>>>>> >
>>>>> >> I'm honestly stumped.
>>>>> >>
>>>>> >> I have some petsc code that essentially just populates a matrix in
>>>>> parallel, then puts it in a file. All my code that uses floating point
>>>>> computations is checked for NaN's and infinities and it doesn't seem to
>>>>> show up. However, when I run it on more than 4 cores, I get floating point
>>>>> exceptions that kill the program. I tried turning off the exceptions from
>>>>> petsc, but the program still dies from them, just without the petsc error
>>>>> message.
>>>>> >>
>>>>> >> I honestly don't know where to go, I suppose I should attach a
>>>>> debugger, but I'm not sure how to do that for multi-processor code.
>>>>> >
>>>>> > assuming you have X11 setup properly from compute nodes you can run
>>>>> > with the extra option '-start_in_debugger'
>>>>> >
>>>>> > If X11 is not properly setup - and you'd like to run gdb on one of
>>>>> the
>>>>> > nodes [say node 14 where you see SEGV] - you can do:
>>>>> >
>>>>> > -start_in-debugger noxterm -debugger_nodes 14
>>>>> >
>>>>> > Or try valgrind
>>>>> >
>>>>> > mpiexec -n 16 valgrind --tool=memcheck -q ./executable
>>>>> >
>>>>> >
>>>>> > For debugging - its best to install with --download-mpich [so that
>>>>> its
>>>>> > valgrind clean] - and run all mpi stuff on a single machine -
>>>>> [usually
>>>>> > X11 works well from a single machine.]
>>>>> >
>>>>> > Satish
>>>>> >
>>>>> >>
>>>>> >> any ideas? (long error message below):
>>>>> >>
>>>>> >> -Andrew
>>>>> >>
>>>>> >> [14]PETSC ERROR:
>>>>> ------------------------------------------------------------------------
>>>>> >> [14]PETSC ERROR: Caught signal number 8 FPE: Floating Point
>>>>> Exception,probably divide by zero
>>>>> >> [14]PETSC ERROR: Try option -start_in_debugger or
>>>>> -on_error_attach_debugger
>>>>> >> [14]PETSC ERROR: or see
>>>>> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#valgrind[14]PETSCERROR: or try
>>>>> http://valgrind.org on GNU/linux and Apple Mac OS X to find memory
>>>>> corruption errors
>>>>> >> [14]PETSC ERROR: likely location of problem given in stack below
>>>>> >> [14]PETSC ERROR: --------------------- Stack Frames
>>>>> ------------------------------------
>>>>> >> [14]PETSC ERROR: Note: The EXACT line numbers in the stack are not
>>>>> available,
>>>>> >> [14]PETSC ERROR: INSTEAD the line number of the start of the
>>>>> function
>>>>> >> [14]PETSC ERROR: is given.
>>>>> >> [14]PETSC ERROR: --------------------- Error Message
>>>>> ------------------------------------
>>>>> >> [14]PETSC ERROR: Signal received!
>>>>> >> [14]PETSC ERROR:
>>>>> ------------------------------------------------------------------------
>>>>> >> [14]PE[15]PETSC ERROR:
>>>>> ------------------------------------------------------------------------
>>>>> >> [15]PETSC ERROR: Caught signal number 8 FPE: Floating Point
>>>>> Exception,probably divide by zero
>>>>> >> [15]PETSC ERROR: Try option -start_in_debugger or
>>>>> -on_error_attach_debugger
>>>>> >> [15]PETSC ERROR: or see
>>>>> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#valgrind[15]PETSCERROR: or try
>>>>> http://valgrind.org on GNU/linux and Apple Mac OS X to find memory
>>>>> corruption errors
>>>>> >> [15]PETSC ERROR: likely location of problem given in stack below
>>>>> >> [15]PETSC ERROR: --------------------- Stack Frames
>>>>> ------------------------------------
>>>>> >> [15]PETSC ERROR: Note: The EXACT line numbers in the stack are not
>>>>> available,
>>>>> >> [15]PETSC ERROR: INSTEAD the line number of the start of the
>>>>> function
>>>>> >> [15]PETSC ERROR: is given.
>>>>> >> [15]PETSC ERROR: --------------------- Error Message
>>>>> ------------------------------------
>>>>> >> [15]PETSC ERROR: Signal received!
>>>>> >> [15]PETSC ERROR:
>>>>> ------------------------------------------------------------------------
>>>>> >> [15]PETSC ERROR: Petsc Release Version 3.2.0, Patch 6, Wed Jan 11
>>>>> 09:28:45 CST 2012
>>>>> >> [14]PETSC ERROR: See docs/changes/index.html for recent updates.
>>>>> >> [14]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
>>>>> >> [14]PETSC ERROR: See docs/index.html for manual pages.
>>>>> >> [14]PETSC ERROR:
>>>>> ------------------------------------------------------------------------
>>>>> >> [14]PETSC ERROR: /home/becker/ansp6066/local/bin/finddme on a
>>>>> linux-gnu named photon9.colorado.edu by ansp6066 Fri Apr 27 18:01:55
>>>>> 2012
>>>>> >> [14]PETSC ERROR: Libraries linked from
>>>>> /home/becker/ansp6066/local/petsc-3.2-p6/lib
>>>>> >> [14]PETSC ERROR: Configure run at Mon Feb 27 11:17:14 2012
>>>>> >> [14]PETSC ERROR: Configure options
>>>>> --prefix=/home/becker/ansp6066/local/petsc-3.2-p6 --with-c++-support
>>>>> --with-fortran --with-mpi-dir=/usr/local/mpich2 --with-shared-libraries=0
>>>>> --with-scalar-type=complex
>>>>> --with-blas-lapack-libs=/central/intel/mkl/lib/em64t/libmkl_core.a
>>>>> --with-clanguage=cxx
>>>>> >> [14]PETSC ERROR:
>>>>> ------------------------------------------------------------------------
>>>>> >> [14]TSC ERROR: Petsc Release Version 3.2.0, Patch 6, Wed Jan 11
>>>>> 09:28:45 CST 2012
>>>>> >> [15]PETSC ERROR: See docs/changes/index.html for recent updates.
>>>>> >> [15]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
>>>>> >> [15]PETSC ERROR: See docs/index.html for manual pages.
>>>>> >> [15]PETSC ERROR:
>>>>> ------------------------------------------------------------------------
>>>>> >> [15]PETSC ERROR: /home/becker/ansp6066/local/bin/finddme on a
>>>>> linux-gnu named photon9.colorado.edu by ansp6066 Fri Apr 27 18:01:55
>>>>> 2012
>>>>> >> [15]PETSC ERROR: Libraries linked from
>>>>> /home/becker/ansp6066/local/petsc-3.2-p6/lib
>>>>> >> [15]PETSC ERROR: Configure run at Mon Feb 27 11:17:14 2012
>>>>> >> [15]PETSC ERROR: Configure options
>>>>> --prefix=/home/becker/ansp6066/local/petsc-3.2-p6 --with-c++-support
>>>>> --with-fortran --with-mpi-dir=/usr/local/mpich2 --with-shared-libraries=0
>>>>> --with-scalar-type=complex
>>>>> --with-blas-lapack-libs=/central/intel/mkl/lib/em64t/libmkl_core.a
>>>>> --with-clanguage=cxx
>>>>> >> [15]PETSC ERROR:
>>>>> ------------------------------------------------------------------------
>>>>> >> [15]PETSC ERROR: User provided function() line 0 in unknown
>>>>> directory unknown file
>>>>> >> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 14PETSC
>>>>> ERROR: User provided function() line 0 in unknown directory unknown file
>>>>> >> application called MPI_Abort(MPI_COMM_WORLD, 59) - process
>>>>> 15[0]0:Return code = 0, signaled with Interrupt
>>>>> >
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to which their
>>>> experiments lead.
>>>> -- Norbert Wiener
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>>
>>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120428/7274b441/attachment-0001.htm>
More information about the petsc-users
mailing list