[petsc-users] SEGV on KSPSolve with mutliple processors
Matthew Knepley
knepley at gmail.com
Wed Jun 26 10:39:05 CDT 2013
On Wed, Jun 26, 2013 at 5:15 PM, Brendan C Lyons <bclyons at princeton.edu>wrote:
> Dear all,
>
> Sorry for the delayed response. It took me sometime to get a debug
> version available on the clusters I used. I reran my code compiled with
> debugging turned on and ran it with valgrind. No error was caught before
> the call to KSPSolve() and I've now received the error message below (I put
> it only for one of the four processors here, but I believe it's the same
> for all of them). Any further advice that could help me track down the
> cause for this segfault would be appreciated.
>
I see at least three scenarios:
1) SuperLU_Dist has a bug
You should try MUMPS
2) Our interface to SuperLU_Dist has a bug
Can you run in the debugger and get a stack trace for this?
3) You have another bug somewhere writing over memory
You have run valgrind, so this is unlikely
Matt
> Thanks,
>
> ~Brendan
>
> [3]PETSC ERROR:
> ------------------------------------------------------------------------
> [3]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> probably memory access out of range
> [3]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [3]PETSC ERROR: or see
> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[3]PETSCERROR: or try
> http://valgrind.org on GNU/linux and Apple Mac OS X to find memory
> corruption errors
> [3]PETSC ERROR: likely location of problem given in stack below
> [3]PETSC ERROR: --------------------- Stack Frames
> ------------------------------------
> [3]PETSC ERROR: Note: The EXACT line numbers in the stack are not
> available,
> [3]PETSC ERROR: INSTEAD the line number of the start of the function
> [3]PETSC ERROR: is given.
> [3]PETSC ERROR: [3] MatLUFactorNumeric_SuperLU_DIST line 284
> src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> [3]PETSC ERROR: [3] MatLUFactorNumeric line 2791 src/mat/interface/matrix.c
> [3]PETSC ERROR: [3] PCSetUp_LU line 108 src/ksp/pc/impls/factor/lu/lu.c
> [3]PETSC ERROR: [3] PCSetUp line 810 src/ksp/pc/interface/precon.c
> [3]PETSC ERROR: [3] KSPSetUp line 182 src/ksp/ksp/interface/itfunc.c
> [3]PETSC ERROR: [3] KSPSolve line 351 src/ksp/ksp/interface/itfunc.c
> [3]PETSC ERROR: --------------------- Error Message
> ------------------------------------
> [3]PETSC ERROR: Signal received!
> [3]PETSC ERROR:
> ------------------------------------------------------------------------
> [3]PETSC ERROR: Petsc Release Version 3.3.0, Patch 6, Mon Feb 11 12:26:34
> CST 2013
> [3]PETSC ERROR: See docs/changes/index.html for recent updates.
> [3]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
> [3]PETSC ERROR: See docs/index.html for manual pages.
> [3]PETSC ERROR:
> ------------------------------------------------------------------------
> [3]PETSC ERROR: N4D_d.exe on a path-ompi named [REDACTED] by blyons Tue
> Jun 25 13:50:35 2013
> [3]PETSC ERROR: Libraries linked from ${PETSC_DIR}/path-ompi/lib
> [3]PETSC ERROR: Configure run at Mon Jun 24 13:11:36 2013
> [3]PETSC ERROR: Configure options --PETSC_ARCH=path-ompi
> --PETSC_DIR=${PATHSCALE_DIR}/openmpi-1.6-pkgs/petsc-3.3-p6-debug --CFLAGS="-fPIC
> -O -mp" --CXXFLAGS="-fPIC -O -mp" --FFLAGS="-fPIC -O -mp"
> --with-debugging=1 --with-dynamic-loading=no --with-mpi=1
> --with-mpi-dir=${PATHSCALE_DIR}/openmpi-1.6.4 --with-superlu=1
> --with-superlu-dir=${PATHSCALE_DIR}/superlu-4.3
> --with-blas-lapack-lib="[${PATHSCALE_DIR}/acml-5.3.0/open64_64/lib/libacml.a]--with-metis=1"
> --with-metis-dir=${PATHSCALE_DIR}/metis-5.0.3 --with-parmetis=1
> --with-parmetis-dir=${PATHSCALE_DIR}/openmpi-1.6-pkgs/parmetis-4.0.2
> --with-blas-lapack-lib="${PATHSCALE_DIR}/acml-5.3.0/open64_64/lib/libacml.a]"
> --with-blacs=1
> --with-blacs-lib="[${PATHSCALE_DIR}/openmpi-1.6-pkgs/blacs-1.1p3/lib/blacsCinit_MPI-LINUX-0.a,${PATHSCALE_DIR}/openmpi-1.6-pkgs/blacs-1.1p3/lib/blacsF77init_MPI-LINUX-0.a,${PATHSCALE_DIR}/openmpi-1.6-pkgs/blacs-1.1p3/lib/blacs_MPI-LINUX-0.a]"
> --with-blacs-include=${PATHSCALE_DIR}/openmpi-1.6-pkgs/blacs-1.1p3/include
> --with-hypre=1 --download-hypre=1 --with-scalapack=1
> --with-scalapack-dir=${PATHSCALE_DIR}/openmpi-1.6-pkgs/scalapack-2.0.2
> --with-superlu_dist=1
> --with-superlu_dist-dir=${PATHSCALE_DIR}/openmpi-1.6-pkgs/superlu_dist-3.2
> [3]PETSC ERROR:
> ------------------------------------------------------------------------
> [3]PETSC ERROR: User provided function() line 0 in unknown directory
> unknown file
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
> with errorcode 59.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> ==421== Thread 16:
> ==421== Invalid free() / delete / delete[] / realloc()
> ==421== at 0x4C2333A: free (vg_replace_malloc.c:446)
> ==421== by 0x9CB25AA: free_mem (in /lib64/libc-2.5.so)
> ==421== by 0x9CB21A1: __libc_freeres (in /lib64/libc-2.5.so)
> ==421== by 0x4A1E669: _vgnU_freeres (vg_preloaded.c:62)
> ==421== Address 0x4165e78 is not stack'd, malloc'd or (recently) free'd
> ==421==
>
>
>
> On Tue, Jun 18, 2013 at 4:52 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
>>
>> If possible you would also benefit from running the debug version
>> under valgrind
>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind it is
>> possible memory corruption has taken place before the point where the code
>> crashes, valgrind will help identify any memory corruption as soon as it
>> takes place.
>>
>> Barry
>>
>> On Jun 18, 2013, at 2:15 PM, Dave May <dave.mayhem23 at gmail.com> wrote:
>>
>> > You should recompile your code using a debug build of petsc so you get
>> some meaningful info from the stack trace when the Segv occurs.
>> >
>> > Dave
>> >
>> >
>> > On Tuesday, 18 June 2013, Brendan C Lyons wrote:
>> > Hi everyone,
>> >
>> > I've run into a strange problem in my Fortran 90 code where it runs
>> fine with 1 processor, but then throws a segmentation fault on KSPSolve()
>> when I try to run it in parallel. I'm using PETSc 3.3 with the SuperLU
>> direct solver for the sequential case and SuperLU_dist for the parallel
>> case. I've called KSPView before and after KSPSolve. I'll put the KSPView
>> output for the sequential and parallel cases and the crash info for the
>> parallel case below (with some details of my system redacted). Any help
>> would be appreciated. If you need any other information, I'm happy to
>> provide it.
>> >
>> > Thank you,
>> >
>> > ~Brendan
>> > ------------------------------
>> >
>> > KSPView() before sequential solve:
>> >
>> > KSP Object: 1 MPI processes
>> > type: preonly
>> > maximum iterations=10000, initial guess is zero
>> > tolerances: relative=1e-05, absolute=1e-50, divergence=10000
>> > left preconditioning
>> > using DEFAULT norm type for convergence test
>> > PC Object: 1 MPI processes
>> > type: lu
>> > LU: out-of-place factorization
>> > tolerance for zero pivot 2.22045e-14
>> > matrix ordering: nd
>> > linear system matrix = precond matrix:
>> > Matrix Object: 1 MPI processes
>> > type: seqaij
>> > rows=11760, cols=11760
>> > total: nonzeros=506586, allocated nonzeros=509061
>> > total number of mallocs used during MatSetValues calls =0
>> > not using I-node routines
>> >
>> > KSPView() after sequential solve:
>> >
>> > KSP Object: 1 MPI processes
>> > type: preonly
>> > maximum iterations=10000, initial guess is zero
>> > tolerances: relative=1e-05, absolute=1e-50, divergence=10000
>> > left preconditioning
>> > using NONE norm type for convergence test
>> > PC Object: 1 MPI processes
>> > type: lu
>> > LU: out-of-place factorization
>> > tolerance for zero pivot 2.22045e-14
>> > matrix ordering: nd
>> > factor fill ratio given 0, needed 0
>> > Factored matrix follows:
>> > Matrix Object: 1 MPI processes
>> > type: seqaij
>> > rows=11760, cols=11760
>> > package used to perform factorization: superlu
>> > total: nonzeros=0, allocated nonzeros=0
>> > total number of mallocs used during MatSetValues calls =0
>> > SuperLU run parameters:
>> > Equil: NO
>> > ColPerm: 3
>> > IterRefine: 0
>> > SymmetricMode: NO
>> > DiagPivotThresh: 1
>> > PivotGrowth: NO
>> > ConditionNumber: NO
>> > RowPerm: 0
>> > ReplaceTinyPivot: NO
>> > PrintStat: NO
>> > lwork: 0
>> > linear system matrix = precond matrix:
>> > Matrix Object: 1 MPI processes
>> > type: seqaij
>> > rows=11760, cols=11760
>> > total: nonzeros=506586, allocated nonzeros=509061
>> > total number of mallocs used during MatSetValues calls =0
>> > not using I-node routines
>> >
>> >
>> > KSPView() before parallel solve:
>> >
>> > KSP Object: 2 MPI processes
>> > type: preonly
>> > maximum iterations=10000, initial guess is zero
>> > tolerances: relative=1e-05, absolute=1e-50, divergence=10000
>> > left preconditioning
>> > using DEFAULT norm type for convergence test
>> > PC Object: 2 MPI processes
>> > type: lu
>> > LU: out-of-place factorization
>> > tolerance for zero pivot 2.22045e-14
>> > matrix ordering: natural
>> > linear system matrix = precond matrix:
>> > Solving Electron Matrix Equation
>> > Matrix Object: 2 MPI processes
>> > type: mpiaij
>> > rows=11760, cols=11760
>> > total: nonzeros=506586, allocated nonzeros=520821
>> > total number of mallocs used during MatSetValues calls =0
>> > not using I-node (on process 0) routines
>> >
>> > Crash info for parallel solve:
>> >
>> > [1]PETSC ERROR:
>> ------------------------------------------------------------------------
>> > [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>> probably memory access out of range
>> > [1]PETSC ERROR: Try option -start_in_debugger or
>> -on_error_attach_debugger
>> > [1]PETSC ERROR: or see
>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSCERROR: or try
>> http://valgrind.org on GNU/linux and Apple Mac OS X to find memory
>> corruption errors
>> > [1]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
>> and run
>> > [1]PETSC ERROR: to get more information on the crash.
>> > [1]PETSC ERROR: --------------------- Error Message
>> ------------------------------------
>> > [1]PETSC ERROR: Signal received!
>> > [1]PETSC ERROR:
>> ------------------------------------------------------------------------
>> > [1]PETSC ERROR: Petsc Release Version 3.3.0, Patch 6, Mon Feb 11
>> 12:26:34 CST 2013
>> > [1]PETSC ERROR: See docs/changes/index.html for recent updates.
>> > [1]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
>> > [1]PETSC ERROR: See docs/index.html for manual pages.
>> > [1]PETSC ERROR:
>> ------------------------------------------------------------------------
>> > [1]PETSC ERROR: <redacted> on a path-ompi named <redacted>
>> > [1]PETSC ERROR: Libraries linked from <redacted>
>> > [1]PETSC ERROR: Configure run at Thu Mar 21 14:19:42 2013
>> > [1]PETSC ERROR: Configure options --PETSC_ARCH=path-ompi
>> --PETSC_DIR=<redacted> --CFLAGS="-fPIC -O -mp" --CXXFLAGS="-fPIC -O -mp"
>> --FFLAGS="-fPIC -O -mp" --with-debugging=0 --with-dynamic-loadin=no
>> --with-mpi=1 --with-mpi-dir=<redacted> --with-superlu=1
>> --with-superlu-dir=<redacted> --with-blas-lapack-lib="<redacted>"
>> --with-scalapack=1 --with-scalapack-dir=<redacted> --with-superlu_dist=1
>> --with-superlu_dist-dir=<redacted> --with-metis=1
>> --with-metis-dir=<redacted> --with-parmetis=1
>> --with-parmetis-dir=<redacted> --with-blacs-lib="<redacted>"
>> --with-blacs-include=<redacted> --with-hypre=1 --download-hypre=1
>> > [1]PETSC ERROR:
>> ------------------------------------------------------------------------
>> > [1]PETSC ERROR: User provided function() line 0 in unknown directory
>> unknown file
>> >
>> >
>> >
>>
>>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20130626/7225a3b4/attachment-0001.html>
More information about the petsc-users
mailing list