[petsc-users] Solves with valgrind, not without

Barry Smith bsmith at mcs.anl.gov
Tue Oct 12 13:36:54 CDT 2010


   You'll need to build a debug version of the libraries and run in the debugger; in the debugger you will need to have it catch floating point exceptions to stop when the problem appears (that is debugger dependent). Based on the difference in behavior I am guessing the problem is memory corruption. 
On my mac I run valgrind with the option -q --tool=memcheck --dsymutil=yes --num-callers=20  

  Barry

On Oct 12, 2010, at 1:12 PM, Arturo Fountain wrote:

> I am using petsc-3.1-p5 to solve a system of equations at multiple timesteps. The LHS matrix is the same at each timestep although the RHS changes. Strange thing is, the system solves when using valgrind but not without.
> 
> This is the case with every solver I have tried and on multiple machines. I am most interested in cg/cgne although I have used gmres in the past when the the condition of the matrix was questionable (it is no longer questionable). In either case the sytem will converge for one and only one time step when calling:
> 
> mpiexec.uni -n 1 ./MONO -indir InputAniso -outdir OutputAniso -ksp_type gmres -pc_type jacobi -info -ksp_rtol 1e-8 -ksp_initial_guess_nonzero true
> 
> however, calling the same program with valgrind:
> 
> G_SLICE=always-malloc G_DEBUG=gc-friendly mpiexec.uni -n 1 valgrind -v --leak-check=full --show-reachable=yes --track-origins=yes ./MONO -indir InputAniso -outdir OutputAniso -ksp_type gmres -pc_type jacobi -info -ksp_rtol 1e-8 -ksp_initial_guess_nonzero true
> 
> it will solve many timesteps. (at least 5). The error I receive (again, only without valgrind) is:
> 
> [0]PETSC ERROR: ------------------------------------------------------------------------
> [0]PETSC ERROR: Caught signal number 8 FPE: Floating Point Exception,probably divide by zero
> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
> [0]PETSC ERROR: likely location of problem given in stack below
> [0]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 -2080374783
> [0] PetscCommDuplicate():   returning tag 2147483571
> [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> [0]PETSC ERROR:       INSTEAD the line number of the start of the function
> [0]PETSC ERROR:       is given.
> [0]PETSC ERROR: --------------------- Error Message ------------------------------------
> [0]PETSC ERROR: Signal received!
> [0]PETSC ERROR: ------------------------------------------------------------------------
> [0]PETSC ERROR: Petsc Release Version 3.1.0, Patch 5, Mon Sep 27 11:51:54 CDT 2010
> [0]PETSC ERROR: See docs/changes/index.html for recent updates.
> [0]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
> [0]PETSC ERROR: See docs/index.html for manual pages.
> [0]PETSC ERROR: ------------------------------------------------------------------------
> [0]PETSC ERROR: ./MONO on a linux-gnu named duality by xerxez Tue Oct 12 10:26:18 2010
> [0]PETSC ERROR: Libraries linked from /home/xerxez/lib/petsc-3.1-p5/linux-gnu-c-debug/lib
> [0]PETSC ERROR: Configure run at Sat Oct  9 19:15:29 2010
> [0]PETSC ERROR: Configure options --download-f-blas-lapack=1 --download-mpich=1 --download-blacs=1 --download-hypre=1
> [0]PETSC ERROR: ------------------------------------------------------------------------
> [0]PETSC ERROR: User provided function() line 0 in unknown directory unknown file
> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0[unset]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
> 
> I have read up on http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal and don't see anything wrong or anything different in the calls. After receiving this error I tried using valgrind but the error simply goes away each and every time I call it with valgrind.
> 
> Has anyone seen such a problem before?



More information about the petsc-users mailing list