[petsc-users] sudden SEGV error
Satish Balay
balay at mcs.anl.gov
Wed Jul 22 11:54:40 CDT 2015
Suggest doing a separate petsc build - with gnu compilers - and
--download-mpich for a valgrind run.
http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
Satish
On Wed, 22 Jul 2015, Matthew Knepley wrote:
> On Wed, Jul 22, 2015 at 8:39 AM, Michael Augspurger <
> michaelaugspurger at gmail.com> wrote:
>
> > Hello:
> >
> > I'm having a problem that I'm having a rough time diagnosing. My CFD
> > simulation code will run for a long time, sometimes up to 10K steps, and
> > then suddenly I'll get a SEGV error (If I run the same simulation again,
> > I'll get the same error, but always at a different time step, sometimes
> > thousands of steps different). There's nothing obvious going wrong in the
> > simulation at the time. Valgrind points to various internal petsc
> > operations, but the trail doesn't lead back to any part of my code, so I'm
> > not sure where to go next.
> >
> > Any advice or experience about where I can continue my investigation into
> > this failure? Thanks for any help,
> >
> > Mike Augspurger
> >
> >
> >
> > Here's part of the error code with valgrind:
> >
> > Residual norms for pres_redistribute_ solve.
> > 0 KSP Residual norm 2.343992292214e+00
> > 1 KSP Residual norm 3.714369184378e-01
> > 2 KSP Residual norm 5.045817070946e-02
> > [2]PETSC ERROR:
> > ------------------------------------------------------------------------
> > [2]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> > probably memory access out of range
> > [2]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> > [2]PETSC ERROR: or see
> > http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> > [2]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS
> > X to find memory corruption errors
> > [2]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and
> > run
> > [2]PETSC ERROR: to get more information on the crash.
> > [2]PETSC ERROR: User provided function() line 0 in unknown file
> > ==11381==
> > ==11381== Process terminating with default action of signal 11 (SIGSEGV)
> > ==11381== General Protection Fault
> > ==11381== at 0x926047B: __intel_sse2_strcat (in
> > /opt/intel/composer_xe_2013.2.146/compiler/lib/intel64/libintlc.so.5)
> > ==11381== by 0x817475E: opal_os_path (os_path.c:99)
> > ==11381== by 0x817B9B0: opal_show_help_vstring (show_help.c:153)
> > ==11381== by 0x80F7878: orte_show_help (show_help.c:566)
> > ==11381== by 0x80A7FFC: warn_fork_cb (ompi_mpi_init.c:139)
> > ==11381== by 0x3E4549A285: fork (in /lib64/libc-2.5.so)
> > ==11381== by 0x4D6A793: PetscAttachDebugger (in
> > /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
> > ==11381== by 0x4D6B93E: PetscAttachDebuggerErrorHandler (in
> > /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
> > ==11381== by 0x4D6E5BC: PetscError (in
> > /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
> > ==11381== by 0x4D70024: PetscSignalHandlerDefault (in
> > /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
> > ==11381== by 0x4D6F9F3: PetscSignalHandler_Private (in
> > /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
> >
>
> Can you rerun this without -on_error_debugger_attach? And send ALL the
> output. We need to see what valgrind thinks
> is the real problem.
>
> Thanks,
>
> Matt
>
>
> > ==11381== by 0x3E4543002F: ??? (in /lib64/libc-2.5.so)
> > ==11381==
> > ==11381== HEAP SUMMARY:
> > ==11381== in use at exit: 50,117,187 bytes in 90,510 blocks
> > ==11381== total heap usage: 136,533,481 allocs, 136,442,971 frees,
> > 85,265,006,726 bytes allocated
> > ==11381==
> > ==11381== 2 bytes in 1 blocks are definitely lost in loss record 5 of 4,925
> > ==11381== at 0x4A0646F: malloc (vg_replace_malloc.c:236)
> > ==11381== by 0x926098D: __intel_sse2_strdup (in
> > /opt/intel/composer_xe_2013.2.146/compiler/lib/intel64/libintlc.so.5)
> > ==11381== by 0x5F6574726F5F4142: ???
> > ==11381== by 0x747365725F6D756D: ???
> > ==11381== by 0x4F00303D73747260: ???
> > ==11381== by 0x5054554F5F4C414F: ???
> > ==11381== by 0x52454454535F5454: ???
> > ==11381== by 0x32333D44465F51: ???
> > ==11381== by 0x41434D5F49504D4E: ???
> > ==11381== by 0x696E69666661705E: ???
> > ==11381== by 0x5F657361625F7973: ???
> > ==11381== by 0x313D646E756F61: ???
> > ==11381==
> > ==11381== 9 bytes in 1 blocks are definitely lost in loss record 430 of
> > 4,925
> > ==11381== at 0x4A0646F: malloc (vg_replace_malloc.c:236)
> > ==11381== by 0x926098D: __intel_sse2_strdup (in
> > /opt/intel/composer_xe_2013.2.146/compiler/lib/intel64/libintlc.so.5)
> > ==11381== by 0x2020200A3E746364: ???
> > ==11381== by 0x3C2020202020201F: ???
> > ==11381== by 0x3E7463656A626F2E: ???
> > ==11381== by 0x2020202020202009: ???
> > ==11381== by 0xF3: ???
> > ==11381== by 0xF3: ???
> > ==11381== by 0x3: ???
> > ==11381== by 0x3: ???
> > ==11381== by 0xE4: ???
> > ==11381== by 0xE5: ???
> > ==11381==
> > ==11381== 11 bytes in 1 blocks are definitely lost in loss record 472 of
> > 4,925
> > ==11381== at 0x4A0646F: malloc (vg_replace_malloc.c:236)
> > ==11381== by 0x812B188: opal_argv_join (argv.c:269)
> > ==11381== by 0xD4F5370: ompi_btl_openib_connect_base_register
> > (btl_openib_connect_base.c:72)
> > ==11381== by 0xD4F0CB0: btl_openib_register_mca_params
> > (btl_openib_mca.c:652)
> > ==11381== by 0xD4E24B5: btl_openib_component_register
> > (btl_openib_component.c:166)
> > ==11381== by 0x815DCC5: mca_base_components_open
> > (mca_base_components_open.c:387)
> > ==11381== by 0x80D7140: mca_btl_base_open (btl_base_open.c:115)
> > ==11381== by 0xC4612C6: ???
> > ==11381== by 0x815DD37: mca_base_components_open
> > (mca_base_components_open.c:427)
> > ==11381== by 0x80E4CCA: mca_pml_base_open (pml_base_open.c:126)
> > ==11381== by 0x80A7594: ompi_mpi_init (ompi_mpi_init.c:485)
> > ==11381== by 0x80BF902: PMPI_Init (pinit.c:84)
> > ==11381==
> > ==11381== 16 bytes in 1 blocks are definitely lost in loss record 710 of
> > 4,925
> > ==11381== at 0x4A0646F: malloc (vg_replace_malloc.c:236)
> > ==11381== by 0x813EE92: opal_dss_unpack_byte_object (dss_unpack.c:490)
> > ==11381== by 0x813F3AE: opal_dss_unpack_buffer (dss_unpack.c:120)
> > ==11381== by 0x813DBD9: opal_dss_unpack (dss_unpack.c:84)
> > ==11381== by 0x81021EC: orte_util_nidmap_init (nidmap.c:117)
> > ==11381== by 0xAA17573: rte_init (ess_env_module.c:173)
> > ==11381== by 0x80E75CA: orte_init (orte_init.c:127)
> > ==11381== by 0x80A7005: ompi_mpi_init (ompi_mpi_init.c:357)
> > ==11381== by 0x80BF902: PMPI_Init (pinit.c:84)
> > ==11381== by 0x71F3956: MPI_INIT (pinit_f.c:75)
> > ==11381== by 0x4D0F99F: petscinitialize_ (in
> > /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
> > ==11381== by 0x60B1AB: elafintstartmpi_ (in
> > /nfsscratch/Users/augspurger/PAPER2/PELAFINT3D_EXE)
> > ==11381==
> > ==11381== 16 bytes in 1 blocks are definitely lost in loss record 711 of
> > 4,925
> > ==11381== at 0x4A0646F: malloc (vg_replace_malloc.c:236)
> > ==11381== by 0x813EE92: opal_dss_unpack_byte_object (dss_unpack.c:490)
> > ==11381== by 0x813F3AE: opal_dss_unpack_buffer (dss_unpack.c:120)
> > ==11381== by 0x813DBD9: opal_dss_unpack (dss_unpack.c:84)
> > ==11381== by 0x810222C: orte_util_nidmap_init (nidmap.c:130)
> > ==11381== by 0xAA17573: rte_init (ess_env_module.c:173)
> > ==11381== by 0x80E75CA: orte_init (orte_init.c:127)
> > ==11381== by 0x80A7005: ompi_mpi_init (ompi_mpi_init.c:357)
> > ==11381== by 0x80BF902: PMPI_Init (pinit.c:84)
> > ==11381== by 0x71F3956: MPI_INIT (pinit_f.c:75)
> > ==11381== by 0x4D0F99F: petscinitialize_ (in
> > /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
> > ==11381== by 0x60B1AB: elafintstartmpi_ (in
> > /nfsscratch/Users/augspurger/PAPER2/PELAFINT3D_EXE)
> > ==11381==
> > ==11381== 16 bytes in 16 blocks are definitely lost in loss record 712 of
> > 4,925
> > ==11381== at 0x4A0646F: malloc (vg_replace_malloc.c:236)
> > ==11381== by 0x810F37B: orte_grpcomm_base_get_proc_attr
> > (grpcomm_base_modex.c:801)
> > ==11381== by 0x8098A44: ompi_comm_cid_init (comm_cid.c:139)
> > ==11381== by 0x80A7C52: ompi_mpi_init (ompi_mpi_init.c:846)
> > ==11381== by 0x80BF902: PMPI_Init (pinit.c:84)
> > ==11381== by 0x71F3956: MPI_INIT (pinit_f.c:75)
> > ==11381== by 0x4D0F99F: petscinitialize_ (in
> > /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0)
> > ==11381== by 0x60B1AB: elafintstartmpi_ (in
> > /nfsscratch/Users/augspurger/PAPER2/PELAFINT3D_EXE)
> > ==11381== by 0x60A052: MAIN__ (in
> > /nfsscratch/Users/augspurger/PAPER2/PELAFINT3D_EXE)
> > ==11381== by 0x42412B: main (in
> > /nfsscratch/Users/augspurger/PAPER2/PELAFINT3D_EXE)
> > ==11381==
> >
> >
>
>
>
More information about the petsc-users
mailing list