[petsc-users] floating point exception… but only when >4 cores are used...

Matthew Knepley knepley at gmail.com
Sat Apr 28 19:11:37 CDT 2012


On Sat, Apr 28, 2012 at 8:07 PM, Andrew Spott <andrew.spott at gmail.com>wrote:

> are there any tricks to doing this across ssh?
>
> I've attempted it using the method given, but I can't get it to start in
> the debugger or to attach the debugger, the program just exits or hangs
> after telling me the error.
>

Is there a reason you cannot run this problem on your local machine with 4
processes?

   Matt


> -Andrew
>
> On Apr 28, 2012, at 4:45 PM, Matthew Knepley wrote:
>
> On Sat, Apr 28, 2012 at 6:39 PM, Andrew Spott <andrew.spott at gmail.com>wrote:
>
>> >-start_in-debugger noxterm -debugger_nodes 14
>>
>> All my cores are on the same machine, is this supposed to start a
>> debugger on processor 14? or computer 14?
>>
>
> Neither. This spawns a gdb process on the same node as the process with
> MPI rank 14. Then attaches gdb
> to process 14.
>
>     Matt
>
>
>> I don't think I have x11 setup properly for the compute nodes, so x11
>> isn't really an option.
>>
>> Thanks for the help.
>>
>> -Andrew
>>
>>
>> On Apr 27, 2012, at 7:26 PM, Satish Balay wrote:
>>
>> > On Fri, 27 Apr 2012, Andrew Spott wrote:
>> >
>> >> I'm honestly stumped.
>> >>
>> >> I have some petsc code that essentially just populates a matrix in
>> parallel, then puts it in a file.  All my code that uses floating point
>> computations is checked for NaN's and infinities and it doesn't seem to
>> show up.  However, when I run it on more than 4 cores, I get floating point
>> exceptions that kill the program.  I tried turning off the exceptions from
>> petsc, but the program still dies from them, just without the petsc error
>> message.
>> >>
>> >> I honestly don't know where to go, I suppose I should attach a
>> debugger, but I'm not sure how to do that for multi-processor code.
>> >
>> > assuming you have X11 setup properly from compute nodes you can run
>> > with the extra option '-start_in_debugger'
>> >
>> > If X11 is not properly setup - and you'd like to run gdb on one of the
>> > nodes [say node 14 where you see SEGV] - you can do:
>> >
>> > -start_in-debugger noxterm -debugger_nodes 14
>> >
>> > Or try valgrind
>> >
>> > mpiexec -n 16 valgrind --tool=memcheck -q ./executable
>> >
>> >
>> > For debugging - its best to install with --download-mpich [so that its
>> > valgrind clean] - and run all mpi stuff on a single machine - [usually
>> > X11 works well from a single machine.]
>> >
>> > Satish
>> >
>> >>
>> >> any ideas?  (long error message below):
>> >>
>> >> -Andrew
>> >>
>> >> [14]PETSC ERROR:
>> ------------------------------------------------------------------------
>> >> [14]PETSC ERROR: Caught signal number 8 FPE: Floating Point
>> Exception,probably divide by zero
>> >> [14]PETSC ERROR: Try option -start_in_debugger or
>> -on_error_attach_debugger
>> >> [14]PETSC ERROR: or see
>> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#valgrind[14]PETSCERROR: or try
>> http://valgrind.org on GNU/linux and Apple Mac OS X to find memory
>> corruption errors
>> >> [14]PETSC ERROR: likely location of problem given in stack below
>> >> [14]PETSC ERROR: ---------------------  Stack Frames
>> ------------------------------------
>> >> [14]PETSC ERROR: Note: The EXACT line numbers in the stack are not
>> available,
>> >> [14]PETSC ERROR:       INSTEAD the line number of the start of the
>> function
>> >> [14]PETSC ERROR:       is given.
>> >> [14]PETSC ERROR: --------------------- Error Message
>> ------------------------------------
>> >> [14]PETSC ERROR: Signal received!
>> >> [14]PETSC ERROR:
>> ------------------------------------------------------------------------
>> >> [14]PE[15]PETSC ERROR:
>> ------------------------------------------------------------------------
>> >> [15]PETSC ERROR: Caught signal number 8 FPE: Floating Point
>> Exception,probably divide by zero
>> >> [15]PETSC ERROR: Try option -start_in_debugger or
>> -on_error_attach_debugger
>> >> [15]PETSC ERROR: or see
>> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#valgrind[15]PETSCERROR: or try
>> http://valgrind.org on GNU/linux and Apple Mac OS X to find memory
>> corruption errors
>> >> [15]PETSC ERROR: likely location of problem given in stack below
>> >> [15]PETSC ERROR: ---------------------  Stack Frames
>> ------------------------------------
>> >> [15]PETSC ERROR: Note: The EXACT line numbers in the stack are not
>> available,
>> >> [15]PETSC ERROR:       INSTEAD the line number of the start of the
>> function
>> >> [15]PETSC ERROR:       is given.
>> >> [15]PETSC ERROR: --------------------- Error Message
>> ------------------------------------
>> >> [15]PETSC ERROR: Signal received!
>> >> [15]PETSC ERROR:
>> ------------------------------------------------------------------------
>> >> [15]PETSC ERROR: Petsc Release Version 3.2.0, Patch 6, Wed Jan 11
>> 09:28:45 CST 2012
>> >> [14]PETSC ERROR: See docs/changes/index.html for recent updates.
>> >> [14]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
>> >> [14]PETSC ERROR: See docs/index.html for manual pages.
>> >> [14]PETSC ERROR:
>> ------------------------------------------------------------------------
>> >> [14]PETSC ERROR: /home/becker/ansp6066/local/bin/finddme on a
>> linux-gnu named photon9.colorado.edu by ansp6066 Fri Apr 27 18:01:55 2012
>> >> [14]PETSC ERROR: Libraries linked from
>> /home/becker/ansp6066/local/petsc-3.2-p6/lib
>> >> [14]PETSC ERROR: Configure run at Mon Feb 27 11:17:14 2012
>> >> [14]PETSC ERROR: Configure options
>> --prefix=/home/becker/ansp6066/local/petsc-3.2-p6 --with-c++-support
>> --with-fortran --with-mpi-dir=/usr/local/mpich2 --with-shared-libraries=0
>> --with-scalar-type=complex
>> --with-blas-lapack-libs=/central/intel/mkl/lib/em64t/libmkl_core.a
>> --with-clanguage=cxx
>> >> [14]PETSC ERROR:
>> ------------------------------------------------------------------------
>> >> [14]TSC ERROR: Petsc Release Version 3.2.0, Patch 6, Wed Jan 11
>> 09:28:45 CST 2012
>> >> [15]PETSC ERROR: See docs/changes/index.html for recent updates.
>> >> [15]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
>> >> [15]PETSC ERROR: See docs/index.html for manual pages.
>> >> [15]PETSC ERROR:
>> ------------------------------------------------------------------------
>> >> [15]PETSC ERROR: /home/becker/ansp6066/local/bin/finddme on a
>> linux-gnu named photon9.colorado.edu by ansp6066 Fri Apr 27 18:01:55 2012
>> >> [15]PETSC ERROR: Libraries linked from
>> /home/becker/ansp6066/local/petsc-3.2-p6/lib
>> >> [15]PETSC ERROR: Configure run at Mon Feb 27 11:17:14 2012
>> >> [15]PETSC ERROR: Configure options
>> --prefix=/home/becker/ansp6066/local/petsc-3.2-p6 --with-c++-support
>> --with-fortran --with-mpi-dir=/usr/local/mpich2 --with-shared-libraries=0
>> --with-scalar-type=complex
>> --with-blas-lapack-libs=/central/intel/mkl/lib/em64t/libmkl_core.a
>> --with-clanguage=cxx
>> >> [15]PETSC ERROR:
>> ------------------------------------------------------------------------
>> >> [15]PETSC ERROR: User provided function() line 0 in unknown directory
>> unknown file
>> >> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 14PETSC
>> ERROR: User provided function() line 0 in unknown directory unknown file
>> >> application called MPI_Abort(MPI_COMM_WORLD, 59) - process
>> 15[0]0:Return code = 0, signaled with Interrupt
>> >
>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120428/e4a76fe0/attachment-0001.htm>


More information about the petsc-users mailing list