<div dir="ltr">On Wed, Jun 26, 2013 at 5:15 PM, Brendan C Lyons <span dir="ltr"><<a href="mailto:bclyons@princeton.edu" target="_blank">bclyons@princeton.edu</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Dear all,<div><br></div><div>Sorry for the delayed response. It took me sometime to get a debug version available on the clusters I used. I reran my code compiled with debugging turned on and ran it with valgrind. No error was caught before the call to KSPSolve() and I've now received the error message below (I put it only for one of the four processors here, but I believe it's the same for all of them). Any further advice that could help me track down the cause for this segfault would be appreciated.</div>
</div></blockquote><div><br></div><div style>I see at least three scenarios:</div><div style><br></div><div style>1) SuperLU_Dist has a bug</div><div style><br></div><div style> You should try MUMPS</div><div style><br>
</div><div style>2) Our interface to SuperLU_Dist has a bug</div><div style><br></div><div style> Can you run in the debugger and get a stack trace for this?</div><div style><br></div><div style>3) You have another bug somewhere writing over memory</div>
<div style><br></div><div style> You have run valgrind, so this is unlikely</div><div style><br></div><div style> Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr"><div>Thanks,</div><div><br></div><div>~Brendan</div><div><br></div><div><div>[3]PETSC ERROR: ------------------------------------------------------------------------</div>
<div>[3]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range</div><div>[3]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger</div><div>[3]PETSC ERROR: or see <a href="http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[3]PETSC" target="_blank">http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[3]PETSC</a> ERROR: or try <a href="http://valgrind.org" target="_blank">http://valgrind.org</a> on GNU/linux and Apple Mac OS X to find memory corruption errors</div>
<div>[3]PETSC ERROR: likely location of problem given in stack below</div><div>[3]PETSC ERROR: --------------------- Stack Frames ------------------------------------</div><div>[3]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,</div>
<div>[3]PETSC ERROR: INSTEAD the line number of the start of the function</div><div>[3]PETSC ERROR: is given.</div><div>[3]PETSC ERROR: [3] MatLUFactorNumeric_SuperLU_DIST line 284 src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c</div>
<div>[3]PETSC ERROR: [3] MatLUFactorNumeric line 2791 src/mat/interface/matrix.c</div><div>[3]PETSC ERROR: [3] PCSetUp_LU line 108 src/ksp/pc/impls/factor/lu/lu.c</div><div>[3]PETSC ERROR: [3] PCSetUp line 810 src/ksp/pc/interface/precon.c</div>
<div>[3]PETSC ERROR: [3] KSPSetUp line 182 src/ksp/ksp/interface/itfunc.c</div><div>[3]PETSC ERROR: [3] KSPSolve line 351 src/ksp/ksp/interface/itfunc.c</div><div>[3]PETSC ERROR: --------------------- Error Message ------------------------------------</div>
<div>[3]PETSC ERROR: Signal received!</div><div>[3]PETSC ERROR: ------------------------------------------------------------------------</div><div>[3]PETSC ERROR: Petsc Release Version 3.3.0, Patch 6, Mon Feb 11 12:26:34 CST 2013 </div>
<div>[3]PETSC ERROR: See docs/changes/index.html for recent updates.</div><div>[3]PETSC ERROR: See docs/faq.html for hints about trouble shooting.</div><div>[3]PETSC ERROR: See docs/index.html for manual pages.</div><div>
[3]PETSC ERROR: ------------------------------------------------------------------------</div><div>[3]PETSC ERROR: N4D_d.exe on a path-ompi named [REDACTED] by blyons Tue Jun 25 13:50:35 2013</div><div>[3]PETSC ERROR: Libraries linked from ${PETSC_DIR}/path-ompi/lib</div>
<div>[3]PETSC ERROR: Configure run at Mon Jun 24 13:11:36 2013</div><div>[3]PETSC ERROR: Configure options --PETSC_ARCH=path-ompi --PETSC_DIR=${PATHSCALE_DIR}/openmpi-1.6-pkgs/petsc-3.3-p6-debug --CFLAGS="-fPIC -O -mp" --CXXFLAGS="-fPIC -O -mp" --FFLAGS="-fPIC -O -mp" --with-debugging=1 --with-dynamic-loading=no --with-mpi=1 --with-mpi-dir=${PATHSCALE_DIR}/openmpi-1.6.4 --with-superlu=1 --with-superlu-dir=${PATHSCALE_DIR}/superlu-4.3 --with-blas-lapack-lib="[${PATHSCALE_DIR}/acml-5.3.0/open64_64/lib/libacml.a]--with-metis=1" --with-metis-dir=${PATHSCALE_DIR}/metis-5.0.3 --with-parmetis=1 --with-parmetis-dir=${PATHSCALE_DIR}/openmpi-1.6-pkgs/parmetis-4.0.2 --with-blas-lapack-lib="${PATHSCALE_DIR}/acml-5.3.0/open64_64/lib/libacml.a]" --with-blacs=1 --with-blacs-lib="[${PATHSCALE_DIR}/openmpi-1.6-pkgs/blacs-1.1p3/lib/blacsCinit_MPI-LINUX-0.a,${PATHSCALE_DIR}/openmpi-1.6-pkgs/blacs-1.1p3/lib/blacsF77init_MPI-LINUX-0.a,${PATHSCALE_DIR}/openmpi-1.6-pkgs/blacs-1.1p3/lib/blacs_MPI-LINUX-0.a]" --with-blacs-include=${PATHSCALE_DIR}/openmpi-1.6-pkgs/blacs-1.1p3/include --with-hypre=1 --download-hypre=1 --with-scalapack=1 --with-scalapack-dir=${PATHSCALE_DIR}/openmpi-1.6-pkgs/scalapack-2.0.2 --with-superlu_dist=1 --with-superlu_dist-dir=${PATHSCALE_DIR}/openmpi-1.6-pkgs/superlu_dist-3.2</div>
<div>[3]PETSC ERROR: ------------------------------------------------------------------------</div><div>[3]PETSC ERROR: User provided function() line 0 in unknown directory unknown file</div><div>--------------------------------------------------------------------------</div>
<div>MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD </div><div>with errorcode 59.</div><div><br></div><div>NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.</div><div>You may or may not see output from other processes, depending on</div>
<div>exactly when Open MPI kills them.</div><div>--------------------------------------------------------------------------</div><div>==421== Thread 16:</div><div>==421== Invalid free() / delete / delete[] / realloc()</div>
<div>==421== at 0x4C2333A: free (vg_replace_malloc.c:446)</div><div>==421== by 0x9CB25AA: free_mem (in /lib64/<a href="http://libc-2.5.so" target="_blank">libc-2.5.so</a>)</div><div>==421== by 0x9CB21A1: __libc_freeres (in /lib64/<a href="http://libc-2.5.so" target="_blank">libc-2.5.so</a>)</div>
<div>==421== by 0x4A1E669: _vgnU_freeres (vg_preloaded.c:62)</div><div>==421== Address 0x4165e78 is not stack'd, malloc'd or (recently) free'd</div><div>==421== </div><div><br></div></div><div class="gmail_extra">
<br><br><div class="gmail_quote">On Tue, Jun 18, 2013 at 4:52 PM, Barry Smith <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
If possible you would also benefit from running the debug version under valgrind <a href="http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind" target="_blank">http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind</a> it is possible memory corruption has taken place before the point where the code crashes, valgrind will help identify any memory corruption as soon as it takes place.<br>
<span><font color="#888888"><br>
Barry<br>
</font></span><div><div><br>
On Jun 18, 2013, at 2:15 PM, Dave May <<a href="mailto:dave.mayhem23@gmail.com" target="_blank">dave.mayhem23@gmail.com</a>> wrote:<br>
<br>
> You should recompile your code using a debug build of petsc so you get some meaningful info from the stack trace when the Segv occurs.<br>
><br>
> Dave<br>
><br>
><br>
> On Tuesday, 18 June 2013, Brendan C Lyons wrote:<br>
> Hi everyone,<br>
><br>
> I've run into a strange problem in my Fortran 90 code where it runs fine with 1 processor, but then throws a segmentation fault on KSPSolve() when I try to run it in parallel. I'm using PETSc 3.3 with the SuperLU direct solver for the sequential case and SuperLU_dist for the parallel case. I've called KSPView before and after KSPSolve. I'll put the KSPView output for the sequential and parallel cases and the crash info for the parallel case below (with some details of my system redacted). Any help would be appreciated. If you need any other information, I'm happy to provide it.<br>
><br>
> Thank you,<br>
><br>
> ~Brendan<br>
> ------------------------------<br>
><br>
> KSPView() before sequential solve:<br>
><br>
> KSP Object: 1 MPI processes<br>
> type: preonly<br>
> maximum iterations=10000, initial guess is zero<br>
> tolerances: relative=1e-05, absolute=1e-50, divergence=10000<br>
> left preconditioning<br>
> using DEFAULT norm type for convergence test<br>
> PC Object: 1 MPI processes<br>
> type: lu<br>
> LU: out-of-place factorization<br>
> tolerance for zero pivot 2.22045e-14<br>
> matrix ordering: nd<br>
> linear system matrix = precond matrix:<br>
> Matrix Object: 1 MPI processes<br>
> type: seqaij<br>
> rows=11760, cols=11760<br>
> total: nonzeros=506586, allocated nonzeros=509061<br>
> total number of mallocs used during MatSetValues calls =0<br>
> not using I-node routines<br>
><br>
> KSPView() after sequential solve:<br>
><br>
> KSP Object: 1 MPI processes<br>
> type: preonly<br>
> maximum iterations=10000, initial guess is zero<br>
> tolerances: relative=1e-05, absolute=1e-50, divergence=10000<br>
> left preconditioning<br>
> using NONE norm type for convergence test<br>
> PC Object: 1 MPI processes<br>
> type: lu<br>
> LU: out-of-place factorization<br>
> tolerance for zero pivot 2.22045e-14<br>
> matrix ordering: nd<br>
> factor fill ratio given 0, needed 0<br>
> Factored matrix follows:<br>
> Matrix Object: 1 MPI processes<br>
> type: seqaij<br>
> rows=11760, cols=11760<br>
> package used to perform factorization: superlu<br>
> total: nonzeros=0, allocated nonzeros=0<br>
> total number of mallocs used during MatSetValues calls =0<br>
> SuperLU run parameters:<br>
> Equil: NO<br>
> ColPerm: 3<br>
> IterRefine: 0<br>
> SymmetricMode: NO<br>
> DiagPivotThresh: 1<br>
> PivotGrowth: NO<br>
> ConditionNumber: NO<br>
> RowPerm: 0<br>
> ReplaceTinyPivot: NO<br>
> PrintStat: NO<br>
> lwork: 0<br>
> linear system matrix = precond matrix:<br>
> Matrix Object: 1 MPI processes<br>
> type: seqaij<br>
> rows=11760, cols=11760<br>
> total: nonzeros=506586, allocated nonzeros=509061<br>
> total number of mallocs used during MatSetValues calls =0<br>
> not using I-node routines<br>
><br>
><br>
> KSPView() before parallel solve:<br>
><br>
> KSP Object: 2 MPI processes<br>
> type: preonly<br>
> maximum iterations=10000, initial guess is zero<br>
> tolerances: relative=1e-05, absolute=1e-50, divergence=10000<br>
> left preconditioning<br>
> using DEFAULT norm type for convergence test<br>
> PC Object: 2 MPI processes<br>
> type: lu<br>
> LU: out-of-place factorization<br>
> tolerance for zero pivot 2.22045e-14<br>
> matrix ordering: natural<br>
> linear system matrix = precond matrix:<br>
> Solving Electron Matrix Equation<br>
> Matrix Object: 2 MPI processes<br>
> type: mpiaij<br>
> rows=11760, cols=11760<br>
> total: nonzeros=506586, allocated nonzeros=520821<br>
> total number of mallocs used during MatSetValues calls =0<br>
> not using I-node (on process 0) routines<br>
><br>
> Crash info for parallel solve:<br>
><br>
> [1]PETSC ERROR: ------------------------------------------------------------------------<br>
> [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range<br>
> [1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger<br>
> [1]PETSC ERROR: or see <a href="http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSC" target="_blank">http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSC</a> ERROR: or try <a href="http://valgrind.org" target="_blank">http://valgrind.org</a> on GNU/linux and Apple Mac OS X to find memory corruption errors<br>
> [1]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run<br>
> [1]PETSC ERROR: to get more information on the crash.<br>
> [1]PETSC ERROR: --------------------- Error Message ------------------------------------<br>
> [1]PETSC ERROR: Signal received!<br>
> [1]PETSC ERROR: ------------------------------------------------------------------------<br>
> [1]PETSC ERROR: Petsc Release Version 3.3.0, Patch 6, Mon Feb 11 12:26:34 CST 2013<br>
> [1]PETSC ERROR: See docs/changes/index.html for recent updates.<br>
> [1]PETSC ERROR: See docs/faq.html for hints about trouble shooting.<br>
> [1]PETSC ERROR: See docs/index.html for manual pages.<br>
> [1]PETSC ERROR: ------------------------------------------------------------------------<br>
> [1]PETSC ERROR: <redacted> on a path-ompi named <redacted><br>
> [1]PETSC ERROR: Libraries linked from <redacted><br>
> [1]PETSC ERROR: Configure run at Thu Mar 21 14:19:42 2013<br>
> [1]PETSC ERROR: Configure options --PETSC_ARCH=path-ompi --PETSC_DIR=<redacted> --CFLAGS="-fPIC -O -mp" --CXXFLAGS="-fPIC -O -mp" --FFLAGS="-fPIC -O -mp" --with-debugging=0 --with-dynamic-loadin=no --with-mpi=1 --with-mpi-dir=<redacted> --with-superlu=1 --with-superlu-dir=<redacted> --with-blas-lapack-lib="<redacted>" --with-scalapack=1 --with-scalapack-dir=<redacted> --with-superlu_dist=1 --with-superlu_dist-dir=<redacted> --with-metis=1 --with-metis-dir=<redacted> --with-parmetis=1 --with-parmetis-dir=<redacted> --with-blacs-lib="<redacted>" --with-blacs-include=<redacted> --with-hypre=1 --download-hypre=1<br>
> [1]PETSC ERROR: ------------------------------------------------------------------------<br>
> [1]PETSC ERROR: User provided function() line 0 in unknown directory unknown file<br>
><br>
><br>
><br>
<br>
</div></div></blockquote></div><br></div></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>
-- Norbert Wiener
</div></div>