[petsc-users] sudden SEGV error

Matthew Knepley knepley at gmail.com
Wed Jul 22 08:44:09 CDT 2015


On Wed, Jul 22, 2015 at 8:39 AM, Michael Augspurger <
michaelaugspurger at gmail.com> wrote:

> Hello:
>
> I'm having a problem that I'm having a rough time diagnosing.  My CFD
> simulation code will run for a long time, sometimes up to 10K steps, and
> then suddenly I'll get a SEGV error (If I run the same simulation again,
> I'll get the same error, but always at a different time step, sometimes
> thousands of steps different).  There's nothing obvious going wrong in the
> simulation at the time.  Valgrind points to various internal petsc
> operations, but the trail doesn't lead back to any part of my code, so I'm
> not sure where to go next.
>
> Any advice or experience about where I can continue my investigation into
> this failure?  Thanks for any help,
>
> Mike Augspurger
>
>
>
> Here's part of the error code with valgrind:
>
>     Residual norms for pres_redistribute_ solve.
>     0 KSP Residual norm 2.343992292214e+00
>     1 KSP Residual norm 3.714369184378e-01
>     2 KSP Residual norm 5.045817070946e-02
> [2]PETSC ERROR:
> ------------------------------------------------------------------------
> [2]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> probably memory access out of range
> [2]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [2]PETSC ERROR: or see
> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> [2]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS
> X to find memory corruption errors
> [2]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and
> run
> [2]PETSC ERROR: to get more information on the crash.
> [2]PETSC ERROR: User provided function() line 0 in  unknown file
> ==11381==
> ==11381== Process terminating with default action of signal 11 (SIGSEGV)
> ==11381==  General Protection Fault
> ==11381==    at 0x926047B: __intel_sse2_strcat (in
> /opt/intel/composer_xe_2013.2.146/compiler/lib/intel64/libintlc.so.5)
> ==11381==    by 0x817475E: opal_os_path (os_path.c:99)
> ==11381==    by 0x817B9B0: opal_show_help_vstring (show_help.c:153)
> ==11381==    by 0x80F7878: orte_show_help (show_help.c:566)
> ==11381==    by 0x80A7FFC: warn_fork_cb (ompi_mpi_init.c:139)
> ==11381==    by 0x3E4549A285: fork (in /lib64/libc-2.5.so)
> ==11381==    by 0x4D6A793: PetscAttachDebugger (in
> /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
> ==11381==    by 0x4D6B93E: PetscAttachDebuggerErrorHandler (in
> /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
> ==11381==    by 0x4D6E5BC: PetscError (in
> /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
> ==11381==    by 0x4D70024: PetscSignalHandlerDefault (in
> /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
> ==11381==    by 0x4D6F9F3: PetscSignalHandler_Private (in
> /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
>

Can you rerun this without -on_error_debugger_attach? And send ALL the
output. We need to see what valgrind thinks
is the real problem.

  Thanks,

     Matt


> ==11381==    by 0x3E4543002F: ??? (in /lib64/libc-2.5.so)
> ==11381==
> ==11381== HEAP SUMMARY:
> ==11381==     in use at exit: 50,117,187 bytes in 90,510 blocks
> ==11381==   total heap usage: 136,533,481 allocs, 136,442,971 frees,
> 85,265,006,726 bytes allocated
> ==11381==
> ==11381== 2 bytes in 1 blocks are definitely lost in loss record 5 of 4,925
> ==11381==    at 0x4A0646F: malloc (vg_replace_malloc.c:236)
> ==11381==    by 0x926098D: __intel_sse2_strdup (in
> /opt/intel/composer_xe_2013.2.146/compiler/lib/intel64/libintlc.so.5)
> ==11381==    by 0x5F6574726F5F4142: ???
> ==11381==    by 0x747365725F6D756D: ???
> ==11381==    by 0x4F00303D73747260: ???
> ==11381==    by 0x5054554F5F4C414F: ???
> ==11381==    by 0x52454454535F5454: ???
> ==11381==    by 0x32333D44465F51: ???
> ==11381==    by 0x41434D5F49504D4E: ???
> ==11381==    by 0x696E69666661705E: ???
> ==11381==    by 0x5F657361625F7973: ???
> ==11381==    by 0x313D646E756F61: ???
> ==11381==
> ==11381== 9 bytes in 1 blocks are definitely lost in loss record 430 of
> 4,925
> ==11381==    at 0x4A0646F: malloc (vg_replace_malloc.c:236)
> ==11381==    by 0x926098D: __intel_sse2_strdup (in
> /opt/intel/composer_xe_2013.2.146/compiler/lib/intel64/libintlc.so.5)
> ==11381==    by 0x2020200A3E746364: ???
> ==11381==    by 0x3C2020202020201F: ???
> ==11381==    by 0x3E7463656A626F2E: ???
> ==11381==    by 0x2020202020202009: ???
> ==11381==    by 0xF3: ???
> ==11381==    by 0xF3: ???
> ==11381==    by 0x3: ???
> ==11381==    by 0x3: ???
> ==11381==    by 0xE4: ???
> ==11381==    by 0xE5: ???
> ==11381==
> ==11381== 11 bytes in 1 blocks are definitely lost in loss record 472 of
> 4,925
> ==11381==    at 0x4A0646F: malloc (vg_replace_malloc.c:236)
> ==11381==    by 0x812B188: opal_argv_join (argv.c:269)
> ==11381==    by 0xD4F5370: ompi_btl_openib_connect_base_register
> (btl_openib_connect_base.c:72)
> ==11381==    by 0xD4F0CB0: btl_openib_register_mca_params
> (btl_openib_mca.c:652)
> ==11381==    by 0xD4E24B5: btl_openib_component_register
> (btl_openib_component.c:166)
> ==11381==    by 0x815DCC5: mca_base_components_open
> (mca_base_components_open.c:387)
> ==11381==    by 0x80D7140: mca_btl_base_open (btl_base_open.c:115)
> ==11381==    by 0xC4612C6: ???
> ==11381==    by 0x815DD37: mca_base_components_open
> (mca_base_components_open.c:427)
> ==11381==    by 0x80E4CCA: mca_pml_base_open (pml_base_open.c:126)
> ==11381==    by 0x80A7594: ompi_mpi_init (ompi_mpi_init.c:485)
> ==11381==    by 0x80BF902: PMPI_Init (pinit.c:84)
> ==11381==
> ==11381== 16 bytes in 1 blocks are definitely lost in loss record 710 of
> 4,925
> ==11381==    at 0x4A0646F: malloc (vg_replace_malloc.c:236)
> ==11381==    by 0x813EE92: opal_dss_unpack_byte_object (dss_unpack.c:490)
> ==11381==    by 0x813F3AE: opal_dss_unpack_buffer (dss_unpack.c:120)
> ==11381==    by 0x813DBD9: opal_dss_unpack (dss_unpack.c:84)
> ==11381==    by 0x81021EC: orte_util_nidmap_init (nidmap.c:117)
> ==11381==    by 0xAA17573: rte_init (ess_env_module.c:173)
> ==11381==    by 0x80E75CA: orte_init (orte_init.c:127)
> ==11381==    by 0x80A7005: ompi_mpi_init (ompi_mpi_init.c:357)
> ==11381==    by 0x80BF902: PMPI_Init (pinit.c:84)
> ==11381==    by 0x71F3956: MPI_INIT (pinit_f.c:75)
> ==11381==    by 0x4D0F99F: petscinitialize_ (in
> /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
> ==11381==    by 0x60B1AB: elafintstartmpi_ (in
> /nfsscratch/Users/augspurger/PAPER2/PELAFINT3D_EXE)
> ==11381==
> ==11381== 16 bytes in 1 blocks are definitely lost in loss record 711 of
> 4,925
> ==11381==    at 0x4A0646F: malloc (vg_replace_malloc.c:236)
> ==11381==    by 0x813EE92: opal_dss_unpack_byte_object (dss_unpack.c:490)
> ==11381==    by 0x813F3AE: opal_dss_unpack_buffer (dss_unpack.c:120)
> ==11381==    by 0x813DBD9: opal_dss_unpack (dss_unpack.c:84)
> ==11381==    by 0x810222C: orte_util_nidmap_init (nidmap.c:130)
> ==11381==    by 0xAA17573: rte_init (ess_env_module.c:173)
> ==11381==    by 0x80E75CA: orte_init (orte_init.c:127)
> ==11381==    by 0x80A7005: ompi_mpi_init (ompi_mpi_init.c:357)
> ==11381==    by 0x80BF902: PMPI_Init (pinit.c:84)
> ==11381==    by 0x71F3956: MPI_INIT (pinit_f.c:75)
> ==11381==    by 0x4D0F99F: petscinitialize_ (in
> /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
> ==11381==    by 0x60B1AB: elafintstartmpi_ (in
> /nfsscratch/Users/augspurger/PAPER2/PELAFINT3D_EXE)
> ==11381==
> ==11381== 16 bytes in 16 blocks are definitely lost in loss record 712 of
> 4,925
> ==11381==    at 0x4A0646F: malloc (vg_replace_malloc.c:236)
> ==11381==    by 0x810F37B: orte_grpcomm_base_get_proc_attr
> (grpcomm_base_modex.c:801)
> ==11381==    by 0x8098A44: ompi_comm_cid_init (comm_cid.c:139)
> ==11381==    by 0x80A7C52: ompi_mpi_init (ompi_mpi_init.c:846)
> ==11381==    by 0x80BF902: PMPI_Init (pinit.c:84)
> ==11381==    by 0x71F3956: MPI_INIT (pinit_f.c:75)
> ==11381==    by 0x4D0F99F: petscinitialize_ (in
> /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
> ==11381==    by 0x60B1AB: elafintstartmpi_ (in
> /nfsscratch/Users/augspurger/PAPER2/PELAFINT3D_EXE)
> ==11381==    by 0x60A052: MAIN__ (in
> /nfsscratch/Users/augspurger/PAPER2/PELAFINT3D_EXE)
> ==11381==    by 0x42412B: main (in
> /nfsscratch/Users/augspurger/PAPER2/PELAFINT3D_EXE)
> ==11381==
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150722/8151a531/attachment.html>


More information about the petsc-users mailing list