[petsc-users] SEGV on KSPSolve with mutliple processors

Matthew Knepley knepley at gmail.com
Wed Jun 26 10:39:05 CDT 2013


On Wed, Jun 26, 2013 at 5:15 PM, Brendan C Lyons <bclyons at princeton.edu>wrote:

> Dear all,
>
> Sorry for the delayed response.  It took me sometime to get a debug
> version available on the clusters I used.  I reran my code compiled with
> debugging turned on and ran it with valgrind.  No error was caught before
> the call to KSPSolve() and I've now received the error message below (I put
> it only for one of the four processors here, but I believe it's the same
> for all of them).  Any further advice that could help me track down the
> cause for this segfault would be appreciated.
>

I see at least three scenarios:

1) SuperLU_Dist has a bug

     You should try MUMPS

2) Our interface to SuperLU_Dist has a bug

    Can you run in the debugger and get a stack trace for this?

3) You have another bug somewhere writing over memory

    You have run valgrind, so this is unlikely

    Matt


> Thanks,
>
> ~Brendan
>
> [3]PETSC ERROR:
> ------------------------------------------------------------------------
> [3]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> probably memory access out of range
> [3]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [3]PETSC ERROR: or see
> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[3]PETSCERROR: or try
> http://valgrind.org on GNU/linux and Apple Mac OS X to find memory
> corruption errors
> [3]PETSC ERROR: likely location of problem given in stack below
> [3]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> [3]PETSC ERROR: Note: The EXACT line numbers in the stack are not
> available,
> [3]PETSC ERROR:       INSTEAD the line number of the start of the function
> [3]PETSC ERROR:       is given.
> [3]PETSC ERROR: [3] MatLUFactorNumeric_SuperLU_DIST line 284
> src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> [3]PETSC ERROR: [3] MatLUFactorNumeric line 2791 src/mat/interface/matrix.c
> [3]PETSC ERROR: [3] PCSetUp_LU line 108 src/ksp/pc/impls/factor/lu/lu.c
> [3]PETSC ERROR: [3] PCSetUp line 810 src/ksp/pc/interface/precon.c
> [3]PETSC ERROR: [3] KSPSetUp line 182 src/ksp/ksp/interface/itfunc.c
> [3]PETSC ERROR: [3] KSPSolve line 351 src/ksp/ksp/interface/itfunc.c
> [3]PETSC ERROR: --------------------- Error Message
> ------------------------------------
> [3]PETSC ERROR: Signal received!
> [3]PETSC ERROR:
> ------------------------------------------------------------------------
> [3]PETSC ERROR: Petsc Release Version 3.3.0, Patch 6, Mon Feb 11 12:26:34
> CST 2013
> [3]PETSC ERROR: See docs/changes/index.html for recent updates.
> [3]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
> [3]PETSC ERROR: See docs/index.html for manual pages.
> [3]PETSC ERROR:
> ------------------------------------------------------------------------
> [3]PETSC ERROR: N4D_d.exe on a path-ompi named [REDACTED] by blyons Tue
> Jun 25 13:50:35 2013
> [3]PETSC ERROR: Libraries linked from ${PETSC_DIR}/path-ompi/lib
> [3]PETSC ERROR: Configure run at Mon Jun 24 13:11:36 2013
> [3]PETSC ERROR: Configure options --PETSC_ARCH=path-ompi
> --PETSC_DIR=${PATHSCALE_DIR}/openmpi-1.6-pkgs/petsc-3.3-p6-debug --CFLAGS="-fPIC
> -O -mp" --CXXFLAGS="-fPIC -O -mp" --FFLAGS="-fPIC -O -mp"
> --with-debugging=1 --with-dynamic-loading=no --with-mpi=1
> --with-mpi-dir=${PATHSCALE_DIR}/openmpi-1.6.4 --with-superlu=1
> --with-superlu-dir=${PATHSCALE_DIR}/superlu-4.3
> --with-blas-lapack-lib="[${PATHSCALE_DIR}/acml-5.3.0/open64_64/lib/libacml.a]--with-metis=1"
> --with-metis-dir=${PATHSCALE_DIR}/metis-5.0.3 --with-parmetis=1
> --with-parmetis-dir=${PATHSCALE_DIR}/openmpi-1.6-pkgs/parmetis-4.0.2
> --with-blas-lapack-lib="${PATHSCALE_DIR}/acml-5.3.0/open64_64/lib/libacml.a]"
> --with-blacs=1
> --with-blacs-lib="[${PATHSCALE_DIR}/openmpi-1.6-pkgs/blacs-1.1p3/lib/blacsCinit_MPI-LINUX-0.a,${PATHSCALE_DIR}/openmpi-1.6-pkgs/blacs-1.1p3/lib/blacsF77init_MPI-LINUX-0.a,${PATHSCALE_DIR}/openmpi-1.6-pkgs/blacs-1.1p3/lib/blacs_MPI-LINUX-0.a]"
> --with-blacs-include=${PATHSCALE_DIR}/openmpi-1.6-pkgs/blacs-1.1p3/include
> --with-hypre=1 --download-hypre=1 --with-scalapack=1
> --with-scalapack-dir=${PATHSCALE_DIR}/openmpi-1.6-pkgs/scalapack-2.0.2
> --with-superlu_dist=1
> --with-superlu_dist-dir=${PATHSCALE_DIR}/openmpi-1.6-pkgs/superlu_dist-3.2
> [3]PETSC ERROR:
> ------------------------------------------------------------------------
> [3]PETSC ERROR: User provided function() line 0 in unknown directory
> unknown file
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
> with errorcode 59.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> ==421== Thread 16:
> ==421== Invalid free() / delete / delete[] / realloc()
> ==421==    at 0x4C2333A: free (vg_replace_malloc.c:446)
> ==421==    by 0x9CB25AA: free_mem (in /lib64/libc-2.5.so)
> ==421==    by 0x9CB21A1: __libc_freeres (in /lib64/libc-2.5.so)
> ==421==    by 0x4A1E669: _vgnU_freeres (vg_preloaded.c:62)
> ==421==  Address 0x4165e78 is not stack'd, malloc'd or (recently) free'd
> ==421==
>
>
>
> On Tue, Jun 18, 2013 at 4:52 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
>>
>>    If possible you would also benefit from running the debug version
>> under valgrind
>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind   it is
>> possible memory corruption has taken place before the point where the code
>> crashes, valgrind will help identify any memory corruption as soon as it
>> takes place.
>>
>>     Barry
>>
>> On Jun 18, 2013, at 2:15 PM, Dave May <dave.mayhem23 at gmail.com> wrote:
>>
>> > You should recompile your code using a debug build of petsc so you get
>> some meaningful info from the stack trace when the Segv occurs.
>> >
>> > Dave
>> >
>> >
>> > On Tuesday, 18 June 2013, Brendan C Lyons wrote:
>> > Hi everyone,
>> >
>> > I've run into a strange problem in my Fortran 90 code where it runs
>> fine with 1 processor, but then throws a segmentation fault on KSPSolve()
>> when I try to run it in parallel.  I'm using PETSc 3.3 with the SuperLU
>> direct solver for the sequential case and SuperLU_dist for the parallel
>> case.  I've called KSPView before and after KSPSolve.  I'll put the KSPView
>> output for the sequential and parallel cases and the crash info for the
>> parallel case below (with some details of my system redacted).  Any help
>> would be appreciated.  If you need any other information, I'm happy to
>> provide it.
>> >
>> > Thank you,
>> >
>> > ~Brendan
>> > ------------------------------
>> >
>> > KSPView() before sequential solve:
>> >
>> > KSP Object: 1 MPI processes
>> >   type: preonly
>> >   maximum iterations=10000, initial guess is zero
>> >   tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
>> >   left preconditioning
>> >   using DEFAULT norm type for convergence test
>> > PC Object: 1 MPI processes
>> >   type: lu
>> >     LU: out-of-place factorization
>> >     tolerance for zero pivot 2.22045e-14
>> >     matrix ordering: nd
>> >   linear system matrix = precond matrix:
>> >   Matrix Object:   1 MPI processes
>> >     type: seqaij
>> >     rows=11760, cols=11760
>> >     total: nonzeros=506586, allocated nonzeros=509061
>> >     total number of mallocs used during MatSetValues calls =0
>> >       not using I-node routines
>> >
>> > KSPView() after sequential solve:
>> >
>> > KSP Object: 1 MPI processes
>> >   type: preonly
>> >   maximum iterations=10000, initial guess is zero
>> >   tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
>> >   left preconditioning
>> >   using NONE norm type for convergence test
>> > PC Object: 1 MPI processes
>> >   type: lu
>> >     LU: out-of-place factorization
>> >     tolerance for zero pivot 2.22045e-14
>> >     matrix ordering: nd
>> >     factor fill ratio given 0, needed 0
>> >       Factored matrix follows:
>> >         Matrix Object:         1 MPI processes
>> >           type: seqaij
>> >           rows=11760, cols=11760
>> >           package used to perform factorization: superlu
>> >           total: nonzeros=0, allocated nonzeros=0
>> >           total number of mallocs used during MatSetValues calls =0
>> >             SuperLU run parameters:
>> >               Equil: NO
>> >               ColPerm: 3
>> >               IterRefine: 0
>> >               SymmetricMode: NO
>> >               DiagPivotThresh: 1
>> >               PivotGrowth: NO
>> >               ConditionNumber: NO
>> >               RowPerm: 0
>> >               ReplaceTinyPivot: NO
>> >               PrintStat: NO
>> >               lwork: 0
>> >   linear system matrix = precond matrix:
>> >   Matrix Object:   1 MPI processes
>> >     type: seqaij
>> >     rows=11760, cols=11760
>> >     total: nonzeros=506586, allocated nonzeros=509061
>> >     total number of mallocs used during MatSetValues calls =0
>> >       not using I-node routines
>> >
>> >
>> > KSPView() before parallel solve:
>> >
>> > KSP Object: 2 MPI processes
>> >   type: preonly
>> >   maximum iterations=10000, initial guess is zero
>> >   tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
>> >   left preconditioning
>> >   using DEFAULT norm type for convergence test
>> > PC Object: 2 MPI processes
>> >   type: lu
>> >     LU: out-of-place factorization
>> >     tolerance for zero pivot 2.22045e-14
>> >     matrix ordering: natural
>> >   linear system matrix = precond matrix:
>> >       Solving Electron Matrix Equation
>> >   Matrix Object:   2 MPI processes
>> >     type: mpiaij
>> >     rows=11760, cols=11760
>> >     total: nonzeros=506586, allocated nonzeros=520821
>> >     total number of mallocs used during MatSetValues calls =0
>> >       not using I-node (on process 0) routines
>> >
>> > Crash info for parallel solve:
>> >
>> > [1]PETSC ERROR:
>> ------------------------------------------------------------------------
>> > [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>> probably memory access out of range
>> > [1]PETSC ERROR: Try option -start_in_debugger or
>> -on_error_attach_debugger
>> > [1]PETSC ERROR: or see
>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSCERROR: or try
>> http://valgrind.org on GNU/linux and Apple Mac OS X to find memory
>> corruption errors
>> > [1]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
>> and run
>> > [1]PETSC ERROR: to get more information on the crash.
>> > [1]PETSC ERROR: --------------------- Error Message
>> ------------------------------------
>> > [1]PETSC ERROR: Signal received!
>> > [1]PETSC ERROR:
>> ------------------------------------------------------------------------
>> > [1]PETSC ERROR: Petsc Release Version 3.3.0, Patch 6, Mon Feb 11
>> 12:26:34 CST 2013
>> > [1]PETSC ERROR: See docs/changes/index.html for recent updates.
>> > [1]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
>> > [1]PETSC ERROR: See docs/index.html for manual pages.
>> > [1]PETSC ERROR:
>> ------------------------------------------------------------------------
>> > [1]PETSC ERROR: <redacted> on a path-ompi named <redacted>
>> > [1]PETSC ERROR: Libraries linked from <redacted>
>> > [1]PETSC ERROR: Configure run at Thu Mar 21 14:19:42 2013
>> > [1]PETSC ERROR: Configure options --PETSC_ARCH=path-ompi
>> --PETSC_DIR=<redacted> --CFLAGS="-fPIC -O -mp" --CXXFLAGS="-fPIC -O -mp"
>> --FFLAGS="-fPIC -O -mp" --with-debugging=0 --with-dynamic-loadin=no
>> --with-mpi=1 --with-mpi-dir=<redacted> --with-superlu=1
>> --with-superlu-dir=<redacted> --with-blas-lapack-lib="<redacted>"
>> --with-scalapack=1 --with-scalapack-dir=<redacted> --with-superlu_dist=1
>> --with-superlu_dist-dir=<redacted> --with-metis=1
>> --with-metis-dir=<redacted> --with-parmetis=1
>> --with-parmetis-dir=<redacted> --with-blacs-lib="<redacted>"
>> --with-blacs-include=<redacted> --with-hypre=1 --download-hypre=1
>> > [1]PETSC ERROR:
>> ------------------------------------------------------------------------
>> > [1]PETSC ERROR: User provided function() line 0 in unknown directory
>> unknown file
>> >
>> >
>> >
>>
>>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20130626/7225a3b4/attachment-0001.html>


More information about the petsc-users mailing list