[petsc-users] sudden SEGV error
Michael Augspurger
michaelaugspurger at gmail.com
Wed Jul 22 08:39:44 CDT 2015
Hello:
I'm having a problem that I'm having a rough time diagnosing. My CFD
simulation code will run for a long time, sometimes up to 10K steps, and
then suddenly I'll get a SEGV error (If I run the same simulation again,
I'll get the same error, but always at a different time step, sometimes
thousands of steps different). There's nothing obvious going wrong in the
simulation at the time. Valgrind points to various internal petsc
operations, but the trail doesn't lead back to any part of my code, so I'm
not sure where to go next.
Any advice or experience about where I can continue my investigation into
this failure? Thanks for any help,
Mike Augspurger
Here's part of the error code with valgrind:
Residual norms for pres_redistribute_ solve.
0 KSP Residual norm 2.343992292214e+00
1 KSP Residual norm 3.714369184378e-01
2 KSP Residual norm 5.045817070946e-02
[2]PETSC ERROR:
------------------------------------------------------------------------
[2]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
probably memory access out of range
[2]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[2]PETSC ERROR: or see
http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[2]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X
to find memory corruption errors
[2]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and
run
[2]PETSC ERROR: to get more information on the crash.
[2]PETSC ERROR: User provided function() line 0 in unknown file
==11381==
==11381== Process terminating with default action of signal 11 (SIGSEGV)
==11381== General Protection Fault
==11381== at 0x926047B: __intel_sse2_strcat (in
/opt/intel/composer_xe_2013.2.146/compiler/lib/intel64/libintlc.so.5)
==11381== by 0x817475E: opal_os_path (os_path.c:99)
==11381== by 0x817B9B0: opal_show_help_vstring (show_help.c:153)
==11381== by 0x80F7878: orte_show_help (show_help.c:566)
==11381== by 0x80A7FFC: warn_fork_cb (ompi_mpi_init.c:139)
==11381== by 0x3E4549A285: fork (in /lib64/libc-2.5.so)
==11381== by 0x4D6A793: PetscAttachDebugger (in
/Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
==11381== by 0x4D6B93E: PetscAttachDebuggerErrorHandler (in
/Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
==11381== by 0x4D6E5BC: PetscError (in
/Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
==11381== by 0x4D70024: PetscSignalHandlerDefault (in
/Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
==11381== by 0x4D6F9F3: PetscSignalHandler_Private (in
/Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
==11381== by 0x3E4543002F: ??? (in /lib64/libc-2.5.so)
==11381==
==11381== HEAP SUMMARY:
==11381== in use at exit: 50,117,187 bytes in 90,510 blocks
==11381== total heap usage: 136,533,481 allocs, 136,442,971 frees,
85,265,006,726 bytes allocated
==11381==
==11381== 2 bytes in 1 blocks are definitely lost in loss record 5 of 4,925
==11381== at 0x4A0646F: malloc (vg_replace_malloc.c:236)
==11381== by 0x926098D: __intel_sse2_strdup (in
/opt/intel/composer_xe_2013.2.146/compiler/lib/intel64/libintlc.so.5)
==11381== by 0x5F6574726F5F4142: ???
==11381== by 0x747365725F6D756D: ???
==11381== by 0x4F00303D73747260: ???
==11381== by 0x5054554F5F4C414F: ???
==11381== by 0x52454454535F5454: ???
==11381== by 0x32333D44465F51: ???
==11381== by 0x41434D5F49504D4E: ???
==11381== by 0x696E69666661705E: ???
==11381== by 0x5F657361625F7973: ???
==11381== by 0x313D646E756F61: ???
==11381==
==11381== 9 bytes in 1 blocks are definitely lost in loss record 430 of
4,925
==11381== at 0x4A0646F: malloc (vg_replace_malloc.c:236)
==11381== by 0x926098D: __intel_sse2_strdup (in
/opt/intel/composer_xe_2013.2.146/compiler/lib/intel64/libintlc.so.5)
==11381== by 0x2020200A3E746364: ???
==11381== by 0x3C2020202020201F: ???
==11381== by 0x3E7463656A626F2E: ???
==11381== by 0x2020202020202009: ???
==11381== by 0xF3: ???
==11381== by 0xF3: ???
==11381== by 0x3: ???
==11381== by 0x3: ???
==11381== by 0xE4: ???
==11381== by 0xE5: ???
==11381==
==11381== 11 bytes in 1 blocks are definitely lost in loss record 472 of
4,925
==11381== at 0x4A0646F: malloc (vg_replace_malloc.c:236)
==11381== by 0x812B188: opal_argv_join (argv.c:269)
==11381== by 0xD4F5370: ompi_btl_openib_connect_base_register
(btl_openib_connect_base.c:72)
==11381== by 0xD4F0CB0: btl_openib_register_mca_params
(btl_openib_mca.c:652)
==11381== by 0xD4E24B5: btl_openib_component_register
(btl_openib_component.c:166)
==11381== by 0x815DCC5: mca_base_components_open
(mca_base_components_open.c:387)
==11381== by 0x80D7140: mca_btl_base_open (btl_base_open.c:115)
==11381== by 0xC4612C6: ???
==11381== by 0x815DD37: mca_base_components_open
(mca_base_components_open.c:427)
==11381== by 0x80E4CCA: mca_pml_base_open (pml_base_open.c:126)
==11381== by 0x80A7594: ompi_mpi_init (ompi_mpi_init.c:485)
==11381== by 0x80BF902: PMPI_Init (pinit.c:84)
==11381==
==11381== 16 bytes in 1 blocks are definitely lost in loss record 710 of
4,925
==11381== at 0x4A0646F: malloc (vg_replace_malloc.c:236)
==11381== by 0x813EE92: opal_dss_unpack_byte_object (dss_unpack.c:490)
==11381== by 0x813F3AE: opal_dss_unpack_buffer (dss_unpack.c:120)
==11381== by 0x813DBD9: opal_dss_unpack (dss_unpack.c:84)
==11381== by 0x81021EC: orte_util_nidmap_init (nidmap.c:117)
==11381== by 0xAA17573: rte_init (ess_env_module.c:173)
==11381== by 0x80E75CA: orte_init (orte_init.c:127)
==11381== by 0x80A7005: ompi_mpi_init (ompi_mpi_init.c:357)
==11381== by 0x80BF902: PMPI_Init (pinit.c:84)
==11381== by 0x71F3956: MPI_INIT (pinit_f.c:75)
==11381== by 0x4D0F99F: petscinitialize_ (in
/Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
==11381== by 0x60B1AB: elafintstartmpi_ (in
/nfsscratch/Users/augspurger/PAPER2/PELAFINT3D_EXE)
==11381==
==11381== 16 bytes in 1 blocks are definitely lost in loss record 711 of
4,925
==11381== at 0x4A0646F: malloc (vg_replace_malloc.c:236)
==11381== by 0x813EE92: opal_dss_unpack_byte_object (dss_unpack.c:490)
==11381== by 0x813F3AE: opal_dss_unpack_buffer (dss_unpack.c:120)
==11381== by 0x813DBD9: opal_dss_unpack (dss_unpack.c:84)
==11381== by 0x810222C: orte_util_nidmap_init (nidmap.c:130)
==11381== by 0xAA17573: rte_init (ess_env_module.c:173)
==11381== by 0x80E75CA: orte_init (orte_init.c:127)
==11381== by 0x80A7005: ompi_mpi_init (ompi_mpi_init.c:357)
==11381== by 0x80BF902: PMPI_Init (pinit.c:84)
==11381== by 0x71F3956: MPI_INIT (pinit_f.c:75)
==11381== by 0x4D0F99F: petscinitialize_ (in
/Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
==11381== by 0x60B1AB: elafintstartmpi_ (in
/nfsscratch/Users/augspurger/PAPER2/PELAFINT3D_EXE)
==11381==
==11381== 16 bytes in 16 blocks are definitely lost in loss record 712 of
4,925
==11381== at 0x4A0646F: malloc (vg_replace_malloc.c:236)
==11381== by 0x810F37B: orte_grpcomm_base_get_proc_attr
(grpcomm_base_modex.c:801)
==11381== by 0x8098A44: ompi_comm_cid_init (comm_cid.c:139)
==11381== by 0x80A7C52: ompi_mpi_init (ompi_mpi_init.c:846)
==11381== by 0x80BF902: PMPI_Init (pinit.c:84)
==11381== by 0x71F3956: MPI_INIT (pinit_f.c:75)
==11381== by 0x4D0F99F: petscinitialize_ (in
/Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
==11381== by 0x60B1AB: elafintstartmpi_ (in
/nfsscratch/Users/augspurger/PAPER2/PELAFINT3D_EXE)
==11381== by 0x60A052: MAIN__ (in
/nfsscratch/Users/augspurger/PAPER2/PELAFINT3D_EXE)
==11381== by 0x42412B: main (in
/nfsscratch/Users/augspurger/PAPER2/PELAFINT3D_EXE)
==11381==
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150722/2a928b9b/attachment-0001.html>
More information about the petsc-users
mailing list