[petsc-users] SEGV on KSPSolve with mutliple processors

Brendan C Lyons bclyons at princeton.edu
Wed Jun 26 10:15:18 CDT 2013


Dear all,

Sorry for the delayed response.  It took me sometime to get a debug version
available on the clusters I used.  I reran my code compiled with debugging
turned on and ran it with valgrind.  No error was caught before the call to
KSPSolve() and I've now received the error message below (I put it only for
one of the four processors here, but I believe it's the same for all of
them).  Any further advice that could help me track down the cause for this
segfault would be appreciated.

Thanks,

~Brendan

[3]PETSC ERROR:
------------------------------------------------------------------------
[3]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
probably memory access out of range
[3]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[3]PETSC ERROR: or see
http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[3]PETSC ERROR:
or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory
corruption errors
[3]PETSC ERROR: likely location of problem given in stack below
[3]PETSC ERROR: ---------------------  Stack Frames
------------------------------------
[3]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[3]PETSC ERROR:       INSTEAD the line number of the start of the function
[3]PETSC ERROR:       is given.
[3]PETSC ERROR: [3] MatLUFactorNumeric_SuperLU_DIST line 284
src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
[3]PETSC ERROR: [3] MatLUFactorNumeric line 2791 src/mat/interface/matrix.c
[3]PETSC ERROR: [3] PCSetUp_LU line 108 src/ksp/pc/impls/factor/lu/lu.c
[3]PETSC ERROR: [3] PCSetUp line 810 src/ksp/pc/interface/precon.c
[3]PETSC ERROR: [3] KSPSetUp line 182 src/ksp/ksp/interface/itfunc.c
[3]PETSC ERROR: [3] KSPSolve line 351 src/ksp/ksp/interface/itfunc.c
[3]PETSC ERROR: --------------------- Error Message
------------------------------------
[3]PETSC ERROR: Signal received!
[3]PETSC ERROR:
------------------------------------------------------------------------
[3]PETSC ERROR: Petsc Release Version 3.3.0, Patch 6, Mon Feb 11 12:26:34
CST 2013
[3]PETSC ERROR: See docs/changes/index.html for recent updates.
[3]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
[3]PETSC ERROR: See docs/index.html for manual pages.
[3]PETSC ERROR:
------------------------------------------------------------------------
[3]PETSC ERROR: N4D_d.exe on a path-ompi named [REDACTED] by blyons Tue Jun
25 13:50:35 2013
[3]PETSC ERROR: Libraries linked from ${PETSC_DIR}/path-ompi/lib
[3]PETSC ERROR: Configure run at Mon Jun 24 13:11:36 2013
[3]PETSC ERROR: Configure options --PETSC_ARCH=path-ompi
--PETSC_DIR=${PATHSCALE_DIR}/openmpi-1.6-pkgs/petsc-3.3-p6-debug
--CFLAGS="-fPIC
-O -mp" --CXXFLAGS="-fPIC -O -mp" --FFLAGS="-fPIC -O -mp"
--with-debugging=1 --with-dynamic-loading=no --with-mpi=1
--with-mpi-dir=${PATHSCALE_DIR}/openmpi-1.6.4 --with-superlu=1
--with-superlu-dir=${PATHSCALE_DIR}/superlu-4.3
--with-blas-lapack-lib="[${PATHSCALE_DIR}/acml-5.3.0/open64_64/lib/libacml.a]--with-metis=1"
--with-metis-dir=${PATHSCALE_DIR}/metis-5.0.3 --with-parmetis=1
--with-parmetis-dir=${PATHSCALE_DIR}/openmpi-1.6-pkgs/parmetis-4.0.2
--with-blas-lapack-lib="${PATHSCALE_DIR}/acml-5.3.0/open64_64/lib/libacml.a]"
--with-blacs=1
--with-blacs-lib="[${PATHSCALE_DIR}/openmpi-1.6-pkgs/blacs-1.1p3/lib/blacsCinit_MPI-LINUX-0.a,${PATHSCALE_DIR}/openmpi-1.6-pkgs/blacs-1.1p3/lib/blacsF77init_MPI-LINUX-0.a,${PATHSCALE_DIR}/openmpi-1.6-pkgs/blacs-1.1p3/lib/blacs_MPI-LINUX-0.a]"
--with-blacs-include=${PATHSCALE_DIR}/openmpi-1.6-pkgs/blacs-1.1p3/include
--with-hypre=1 --download-hypre=1 --with-scalapack=1
--with-scalapack-dir=${PATHSCALE_DIR}/openmpi-1.6-pkgs/scalapack-2.0.2
--with-superlu_dist=1
--with-superlu_dist-dir=${PATHSCALE_DIR}/openmpi-1.6-pkgs/superlu_dist-3.2
[3]PETSC ERROR:
------------------------------------------------------------------------
[3]PETSC ERROR: User provided function() line 0 in unknown directory
unknown file
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 59.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
==421== Thread 16:
==421== Invalid free() / delete / delete[] / realloc()
==421==    at 0x4C2333A: free (vg_replace_malloc.c:446)
==421==    by 0x9CB25AA: free_mem (in /lib64/libc-2.5.so)
==421==    by 0x9CB21A1: __libc_freeres (in /lib64/libc-2.5.so)
==421==    by 0x4A1E669: _vgnU_freeres (vg_preloaded.c:62)
==421==  Address 0x4165e78 is not stack'd, malloc'd or (recently) free'd
==421==



On Tue, Jun 18, 2013 at 4:52 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:

>
>    If possible you would also benefit from running the debug version under
> valgrind http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> it is possible memory corruption has taken place before the point where the
> code crashes, valgrind will help identify any memory corruption as soon as
> it takes place.
>
>     Barry
>
> On Jun 18, 2013, at 2:15 PM, Dave May <dave.mayhem23 at gmail.com> wrote:
>
> > You should recompile your code using a debug build of petsc so you get
> some meaningful info from the stack trace when the Segv occurs.
> >
> > Dave
> >
> >
> > On Tuesday, 18 June 2013, Brendan C Lyons wrote:
> > Hi everyone,
> >
> > I've run into a strange problem in my Fortran 90 code where it runs fine
> with 1 processor, but then throws a segmentation fault on KSPSolve() when I
> try to run it in parallel.  I'm using PETSc 3.3 with the SuperLU direct
> solver for the sequential case and SuperLU_dist for the parallel case.
>  I've called KSPView before and after KSPSolve.  I'll put the KSPView
> output for the sequential and parallel cases and the crash info for the
> parallel case below (with some details of my system redacted).  Any help
> would be appreciated.  If you need any other information, I'm happy to
> provide it.
> >
> > Thank you,
> >
> > ~Brendan
> > ------------------------------
> >
> > KSPView() before sequential solve:
> >
> > KSP Object: 1 MPI processes
> >   type: preonly
> >   maximum iterations=10000, initial guess is zero
> >   tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
> >   left preconditioning
> >   using DEFAULT norm type for convergence test
> > PC Object: 1 MPI processes
> >   type: lu
> >     LU: out-of-place factorization
> >     tolerance for zero pivot 2.22045e-14
> >     matrix ordering: nd
> >   linear system matrix = precond matrix:
> >   Matrix Object:   1 MPI processes
> >     type: seqaij
> >     rows=11760, cols=11760
> >     total: nonzeros=506586, allocated nonzeros=509061
> >     total number of mallocs used during MatSetValues calls =0
> >       not using I-node routines
> >
> > KSPView() after sequential solve:
> >
> > KSP Object: 1 MPI processes
> >   type: preonly
> >   maximum iterations=10000, initial guess is zero
> >   tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
> >   left preconditioning
> >   using NONE norm type for convergence test
> > PC Object: 1 MPI processes
> >   type: lu
> >     LU: out-of-place factorization
> >     tolerance for zero pivot 2.22045e-14
> >     matrix ordering: nd
> >     factor fill ratio given 0, needed 0
> >       Factored matrix follows:
> >         Matrix Object:         1 MPI processes
> >           type: seqaij
> >           rows=11760, cols=11760
> >           package used to perform factorization: superlu
> >           total: nonzeros=0, allocated nonzeros=0
> >           total number of mallocs used during MatSetValues calls =0
> >             SuperLU run parameters:
> >               Equil: NO
> >               ColPerm: 3
> >               IterRefine: 0
> >               SymmetricMode: NO
> >               DiagPivotThresh: 1
> >               PivotGrowth: NO
> >               ConditionNumber: NO
> >               RowPerm: 0
> >               ReplaceTinyPivot: NO
> >               PrintStat: NO
> >               lwork: 0
> >   linear system matrix = precond matrix:
> >   Matrix Object:   1 MPI processes
> >     type: seqaij
> >     rows=11760, cols=11760
> >     total: nonzeros=506586, allocated nonzeros=509061
> >     total number of mallocs used during MatSetValues calls =0
> >       not using I-node routines
> >
> >
> > KSPView() before parallel solve:
> >
> > KSP Object: 2 MPI processes
> >   type: preonly
> >   maximum iterations=10000, initial guess is zero
> >   tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
> >   left preconditioning
> >   using DEFAULT norm type for convergence test
> > PC Object: 2 MPI processes
> >   type: lu
> >     LU: out-of-place factorization
> >     tolerance for zero pivot 2.22045e-14
> >     matrix ordering: natural
> >   linear system matrix = precond matrix:
> >       Solving Electron Matrix Equation
> >   Matrix Object:   2 MPI processes
> >     type: mpiaij
> >     rows=11760, cols=11760
> >     total: nonzeros=506586, allocated nonzeros=520821
> >     total number of mallocs used during MatSetValues calls =0
> >       not using I-node (on process 0) routines
> >
> > Crash info for parallel solve:
> >
> > [1]PETSC ERROR:
> ------------------------------------------------------------------------
> > [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> probably memory access out of range
> > [1]PETSC ERROR: Try option -start_in_debugger or
> -on_error_attach_debugger
> > [1]PETSC ERROR: or see
> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSCERROR: or try
> http://valgrind.org on GNU/linux and Apple Mac OS X to find memory
> corruption errors
> > [1]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
> and run
> > [1]PETSC ERROR: to get more information on the crash.
> > [1]PETSC ERROR: --------------------- Error Message
> ------------------------------------
> > [1]PETSC ERROR: Signal received!
> > [1]PETSC ERROR:
> ------------------------------------------------------------------------
> > [1]PETSC ERROR: Petsc Release Version 3.3.0, Patch 6, Mon Feb 11
> 12:26:34 CST 2013
> > [1]PETSC ERROR: See docs/changes/index.html for recent updates.
> > [1]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
> > [1]PETSC ERROR: See docs/index.html for manual pages.
> > [1]PETSC ERROR:
> ------------------------------------------------------------------------
> > [1]PETSC ERROR: <redacted> on a path-ompi named <redacted>
> > [1]PETSC ERROR: Libraries linked from <redacted>
> > [1]PETSC ERROR: Configure run at Thu Mar 21 14:19:42 2013
> > [1]PETSC ERROR: Configure options --PETSC_ARCH=path-ompi
> --PETSC_DIR=<redacted> --CFLAGS="-fPIC -O -mp" --CXXFLAGS="-fPIC -O -mp"
> --FFLAGS="-fPIC -O -mp" --with-debugging=0 --with-dynamic-loadin=no
> --with-mpi=1 --with-mpi-dir=<redacted> --with-superlu=1
> --with-superlu-dir=<redacted> --with-blas-lapack-lib="<redacted>"
> --with-scalapack=1 --with-scalapack-dir=<redacted> --with-superlu_dist=1
> --with-superlu_dist-dir=<redacted> --with-metis=1
> --with-metis-dir=<redacted> --with-parmetis=1
> --with-parmetis-dir=<redacted> --with-blacs-lib="<redacted>"
> --with-blacs-include=<redacted> --with-hypre=1 --download-hypre=1
> > [1]PETSC ERROR:
> ------------------------------------------------------------------------
> > [1]PETSC ERROR: User provided function() line 0 in unknown directory
> unknown file
> >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20130626/00a7e093/attachment-0001.html>


More information about the petsc-users mailing list