[petsc-dev] SuperLU failure with valgrind
Mark Adams
mfadams at lbl.gov
Mon Oct 16 20:11:20 CDT 2017
Yep, there was a long thread on segvs with pdgssvx in SuperLU_dist almost
exactly a year ago. Good catch Matt: "SuperLU_dist issue in 3.7.4".
Its not clear that it was resolved, but the code works, just barfs in
valgrind on my osx. And Valgrind is barfing on options data base methods on
Cori.
Again the code runs fine though. Probably false positives.
On Mon, Oct 16, 2017 at 12:31 PM, Matthew Knepley <knepley at gmail.com> wrote:
> We had a previous error with pdgssvx in SuperLU I think. Maybe searching
> petsc-maint would get it?
>
> Matt
>
> On Mon, Oct 16, 2017 at 12:21 PM, Mark Adams <mfadams at lbl.gov> wrote:
>
>> I just ran this and have a little bit of a stack trace. This is on my
>> laptop and MPI can be a little flaky here (eg, IBarrier does not work). I
>> am going to move to Cori soon and I will try to reproduce this.
>> Thanks,
>>
>> ==68941== at 0x103A66AA8: MPIR_Process_status (mpiimpl.h:4394)
>> ==68941== by 0x103A6852F: MPIC_Waitall (helper_fns.c:774)
>> ==68941== by 0x1038ECE88: MPIR_Alltoallv_intra (alltoallv.c:194)
>> ==68941== by 0x1038ED7F9: MPIR_Alltoallv (alltoallv.c:339)
>> ==68941== by 0x1038EDA53: MPIR_Alltoallv_impl (alltoallv.c:376)
>> ==68940== at 0x103A66AA8: MPIR_Process_status (mpiimpl.h:4394)
>> ==68940== by 0x103A6852F: MPIC_Waitall (helper_fns.c:774)
>> ==68940== by 0x1038ECE88: MPIR_Alltoallv_intra (alltoallv.c:194)
>> ==68940== by 0x1038ED7F9: MPIR_Alltoallv (alltoallv.c:339)
>> ==68940== by 0x1038EDA53: MPIR_Alltoallv_impl (alltoallv.c:376)
>> ==68940== by 0x103719112: MPI_Alltoallv (alltoallv.c:527)
>> ==68940== by 0x10238B87D: pdCompRow_loc_to_CompCol_global (in
>> /Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib/libsuperl
>> u_dist.5.1.3.dylib)
>> ==68940== by 0x1023800CB: pdgssvx (in /Users/markadams/Codes/petsc/a
>> rch-macosx-gnu-g/lib/libsuperlu_dist.5.1.3.dylib)
>> ==68940== by 0x100AB92DB: MatLUFactorNumeric_SuperLU_DIST
>> (superlu_dist.c:429)
>>
>>
>> On Mon, Oct 16, 2017 at 12:05 PM, Xiaoye S. Li <xsli at lbl.gov> wrote:
>>
>>> Mark,
>>> Is it possible to get the line number?
>>> For example, the first failure is
>>>
>>> ==63582== Conditional jump or move depends on uninitialised value(s)
>>> ==63582== at 0x103A5FAA8: MPIR_Process_status (mpiimpl.h:4394)
>>> ==63582== by 0x103A6152F: MPIC_Waitall (helper_fns.c:774)
>>> ==63582== by 0x1038E2A34: MPIR_Alltoall_intra (alltoall.c:369)
>>> ==63582== by 0x1038E35E1: MPIR_Alltoall (alltoall.c:564)
>>> ==63582== by 0x1038E37E6: MPIR_Alltoall_impl (alltoall.c:599)
>>> ==63582== by 0x1037106AD: MPI_Alltoall (alltoall.c:722)
>>> ==63582== by 0x10236EA7C: static_schedule (in
>>> /Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib/libsuperl
>>> u_dist.5.1.3.dylib)
>>>
>>> I checked all the MPI_alltoall in static_schedule() routine, I don't see
>>> any problem.
>>>
>>> Sherry
>>>
>>>
>>>
>>> On Mon, Oct 16, 2017 at 7:21 AM, Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>> FYI, I get this error on one processor with SuperLU under valgrind.
>>>> Could this just be a valgrind issue?
>>>>
>>>> Mark
>>>>
>>>> /Users/markadams/Codes/petsc/arch-macosx-gnu-g/bin/mpiexec -n 1
>>>> valgrind --dsymutil=yes --leak-check=no --gen-suppressions=no
>>>> --num-callers=20 --error-limit=no ./ex48 -debug 2 -dim 2 -dm_refine 3
>>>> -ts_monitor -implicit true -ts_type beuler -pc_type lu
>>>> -pc_factor_mat_solver_package superlu_dist -ksp_type preonly -snes_monitor
>>>> -snes_rtol 1.e-10 -snes_stol 1.e-10 -snes_converged_reason -snes_atol
>>>> 1.e-18 -snes_converged_reason -petscspace_order 2 -petscspace_poly_tensor
>>>> -ts_max_steps 1 -ts_dt 1.e-3 -eps 1.e-12 -eta 0.001 -ves 0.005 -beta 0.01
>>>> -mu 0.0002 -dm_view hdf5:sol.h5 -vec_view hdf5:sol.h5::append
>>>> -dm_plex_periodic_cut -y_periodicity PERIODIC -cells 2,4 -Jop 4.99
>>>> -line_dir 1,1 -line_coord 3.14159265359,1.57079632679 -real_view
>>>> :u.m:ascii_matlab -fft_view :spectra.m:ascii_matlab
>>>> ==63582== Memcheck, a memory error detector
>>>> ==63582== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et
>>>> al.
>>>> ==63582== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright
>>>> info
>>>> ==63582== Command: ./ex48 -debug 2 -dim 2 -dm_refine 3 -ts_monitor
>>>> -implicit true -ts_type beuler -pc_type lu -pc_factor_mat_solver_package
>>>> superlu_dist -ksp_type preonly -snes_monitor -snes_rtol 1.e-10 -snes_stol
>>>> 1.e-10 -snes_converged_reason -snes_atol 1.e-18 -snes_converged_reason
>>>> -petscspace_order 2 -petscspace_poly_tensor -ts_max_steps 1 -ts_dt 1.e-3
>>>> -eps 1.e-12 -eta 0.001 -ves 0.005 -beta 0.01 -mu 0.0002 -dm_view
>>>> hdf5:sol.h5 -vec_view hdf5:sol.h5::append -dm_plex_periodic_cut
>>>> -y_periodicity PERIODIC -cells 2,4 -Jop 4.99 -line_dir 1,1 -line_coord
>>>> 3.14159265359,1.57079632679 -real_view :u.m:ascii_matlab -fft_view
>>>> :spectra.m:ascii_matlab
>>>> ==63582==
>>>> ==63582== Syscall param msg->desc.port.name points to uninitialised
>>>> byte(s)
>>>> ==63582== at 0x103FE134A: mach_msg_trap (in
>>>> /usr/lib/system/libsystem_kernel.dylib)
>>>> ==63582== by 0x103FE0796: mach_msg (in /usr/lib/system/libsystem_kern
>>>> el.dylib)
>>>> ==63582== by 0x103FDA485: task_set_special_port (in
>>>> /usr/lib/system/libsystem_kernel.dylib)
>>>> ==63582== by 0x10817810E: _os_trace_create_debug_control_port (in
>>>> /usr/lib/system/libsystem_trace.dylib)
>>>> ==63582== by 0x108178458: _libtrace_init (in
>>>> /usr/lib/system/libsystem_trace.dylib)
>>>> ==63582== by 0x1036119DF: libSystem_initializer (in
>>>> /usr/lib/libSystem.B.dylib)
>>>> ==63582== by 0x100034A1A: ImageLoaderMachO::doModInitFun
>>>> ctions(ImageLoader::LinkContext const&) (in /usr/lib/dyld)
>>>> ==63582== by 0x100034C1D: ImageLoaderMachO::doInitialization(ImageLoader::LinkContext
>>>> const&) (in /usr/lib/dyld)
>>>> ==63582== by 0x1000304A9: ImageLoader::recursiveInitiali
>>>> zation(ImageLoader::LinkContext const&, unsigned int, char const*,
>>>> ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&)
>>>> (in /usr/lib/dyld)
>>>> ==63582== by 0x100030440: ImageLoader::recursiveInitiali
>>>> zation(ImageLoader::LinkContext const&, unsigned int, char const*,
>>>> ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&)
>>>> (in /usr/lib/dyld)
>>>> ==63582== by 0x10002F523: ImageLoader::processInitializers(ImageLoader::LinkContext
>>>> const&, unsigned int, ImageLoader::InitializerTimingList&,
>>>> ImageLoader::UninitedUpwards&) (in /usr/lib/dyld)
>>>> ==63582== by 0x10002F5B8: ImageLoader::runInitializers(ImageLoader::LinkContext
>>>> const&, ImageLoader::InitializerTimingList&) (in /usr/lib/dyld)
>>>> ==63582== by 0x100021433: dyld::initializeMainExecutable() (in
>>>> /usr/lib/dyld)
>>>> ==63582== by 0x1000258C5: dyld::_main(macho_header const*, unsigned
>>>> long, int, char const**, char const**, char const**, unsigned long*) (in
>>>> /usr/lib/dyld)
>>>> ==63582== by 0x100020248: dyldbootstrap::start(macho_header const*,
>>>> int, char const**, long, macho_header const*, unsigned long*) (in
>>>> /usr/lib/dyld)
>>>> ==63582== by 0x100020035: _dyld_start (in /usr/lib/dyld)
>>>> ==63582== by 0x3E: ???
>>>> ==63582== by 0x1080A84C2: ???
>>>> ==63582== by 0x1080A84C9: ???
>>>> ==63582== by 0x1080A84D0: ???
>>>> ==63582== Address 0x1080a60fc is on thread 1's stack
>>>> ==63582== in frame #2, created by task_set_special_port (???:)
>>>> ==63582==
>>>> --63582-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option
>>>> --63582-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated
>>>> 2 times)
>>>> --63582-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated
>>>> 4 times)
>>>> Jop=4.99
>>>> DeltaPrime=1.81627
>>>> eta=0.001
>>>> beta=0.01
>>>> mu=0.0002
>>>> ves=0.005
>>>> ==63582== Warning: invalid file descriptor -1 in syscall read()
>>>> 0) total perturbed mass = 0.
>>>> 0 TS dt 0.001 time 0.
>>>> 0 SNES Function norm 5.917661770415e-01
>>>> ==63582== Conditional jump or move depends on uninitialised value(s)
>>>> ==63582== at 0x103A5FAA8: MPIR_Process_status (mpiimpl.h:4394)
>>>> ==63582== by 0x103A6152F: MPIC_Waitall (helper_fns.c:774)
>>>> ==63582== by 0x1038E2A34: MPIR_Alltoall_intra (alltoall.c:369)
>>>> ==63582== by 0x1038E35E1: MPIR_Alltoall (alltoall.c:564)
>>>> ==63582== by 0x1038E37E6: MPIR_Alltoall_impl (alltoall.c:599)
>>>> ==63582== by 0x1037106AD: MPI_Alltoall (alltoall.c:722)
>>>> ==63582== by 0x10236EA7C: static_schedule (in
>>>> /Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib/libsuperl
>>>> u_dist.5.1.3.dylib)
>>>> ==63582== by 0x10239923C: pdgstrf (in /Users/markadams/Codes/petsc/a
>>>> rch-macosx-gnu-g/lib/libsuperlu_dist.5.1.3.dylib)
>>>> ==63582== by 0x10237D696: pdgssvx_ABglobal (in
>>>> /Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib/libsuperl
>>>> u_dist.5.1.3.dylib)
>>>> ==63582== by 0x100AB1F02: MatLUFactorNumeric_SuperLU_DIST
>>>> (superlu_dist.c:423)
>>>> ==63582== by 0x10053AD98: MatLUFactorNumeric (matrix.c:3039)
>>>> ==63582== by 0x1012075CD: PCSetUp_LU (lu.c:131)
>>>> ==63582== by 0x10134D65B: PCSetUp (precon.c:924)
>>>> ==63582== by 0x101496E11: KSPSetUp (itfunc.c:378)
>>>> ==63582== by 0x101499143: KSPSolve (itfunc.c:609)
>>>> ==63582== by 0x1015F9410: SNESSolve_NEWTONLS (ls.c:224)
>>>> ==63582== by 0x101574290: SNESSolve (snes.c:4106)
>>>> ==63582== by 0x10179B43C: TS_SNESSolve (theta.c:176)
>>>> ==63582== by 0x10178F7CE: TSStep_Theta (theta.c:216)
>>>> ==63582== by 0x1016C1D62: TSStep (ts.c:4120)
>>>> ==63582==
>>>> ==63582== Conditional jump or move depends on uninitialised value(s)
>>>> ==63582== at 0x103A5FAA8: MPIR_Process_status (mpiimpl.h:4394)
>>>> ==63582== by 0x103A6152F: MPIC_Waitall (helper_fns.c:774)
>>>> ==63582== by 0x1038E5E88: MPIR_Alltoallv_intra (alltoallv.c:194)
>>>> ==63582== by 0x1038E67F9: MPIR_Alltoallv (alltoallv.c:339)
>>>> ==63582== by 0x1038E6A53: MPIR_Alltoallv_impl (alltoallv.c:376)
>>>> ==63582== by 0x103712112: MPI_Alltoallv (alltoallv.c:527)
>>>> ==63582== by 0x10236ECF1: static_schedule (in
>>>> /Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib/libsuperl
>>>> u_dist.5.1.3.dylib)
>>>> ==63582== by 0x10239923C: pdgstrf (in /Users/markadams/Codes/petsc/a
>>>> rch-macosx-gnu-g/lib/libsuperlu_dist.5.1.3.dylib)
>>>> ==63582== by 0x10237D696: pdgssvx_ABglobal (in
>>>> /Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib/libsuperl
>>>> u_dist.5.1.3.dylib)
>>>> ==63582== by 0x100AB1F02: MatLUFactorNumeric_SuperLU_DIST
>>>> (superlu_dist.c:423)
>>>> ==63582== by 0x10053AD98: MatLUFactorNumeric (matrix.c:3039)
>>>> ==63582== by 0x1012075CD: PCSetUp_LU (lu.c:131)
>>>> ==63582== by 0x10134D65B: PCSetUp (precon.c:924)
>>>> ==63582== by 0x101496E11: KSPSetUp (itfunc.c:378)
>>>> ==63582== by 0x101499143: KSPSolve (itfunc.c:609)
>>>> ==63582== by 0x1015F9410: SNESSolve_NEWTONLS (ls.c:224)
>>>> ==63582== by 0x101574290: SNESSolve (snes.c:4106)
>>>> ==63582== by 0x10179B43C: TS_SNESSolve (theta.c:176)
>>>> ==63582== by 0x10178F7CE: TSStep_Theta (theta.c:216)
>>>> ==63582== by 0x1016C1D62: TSStep (ts.c:4120)
>>>> ==63582==
>>>> ==63582== Thread 2:
>>>> ==63582== Invalid read of size 4
>>>> ==63582== at 0x10814A2B1: _pthread_wqthread (in
>>>> /usr/lib/system/libsystem_pthread.dylib)
>>>> ==63582== by 0x10814A07C: start_wqthread (in
>>>> /usr/lib/system/libsystem_pthread.dylib)
>>>> ==63582== Address 0x18 is not stack'd, malloc'd or (recently) free'd
>>>> ==63582==
>>>> ==63582== Invalid read of size 8
>>>> ==63582== at 0x1081489D6: pthread_getspecific (in
>>>> /usr/lib/system/libsystem_pthread.dylib)
>>>> ==63582== by 0x100286A5B: PetscVSNPrintf (mprint.c:132)
>>>> ==63582== by 0x1002871A3: PetscVFPrintfDefault (mprint.c:241)
>>>> ==63582== by 0x10028A1E6: PetscFPrintf (mprint.c:546)
>>>> ==63582== by 0x1002A1BE9: PetscErrorPrintfDefault (errtrace.c:114)
>>>> ==63582== by 0x1002A3C5D: PetscSignalHandlerDefault (signal.c:135)
>>>> ==63582== by 0x1002A4A79: PetscSignalHandler_Private (signal.c:47)
>>>> ==63582== by 0x25805BDBD: ???
>>>> ==63582== by 0x10814A07C: start_wqthread (in
>>>> /usr/lib/system/libsystem_pthread.dylib)
>>>> ==63582== Address 0x50 is not stack'd, malloc'd or (recently) free'd
>>>> ==63582==
>>>> ==63582==
>>>> ==63582== Process terminating with default action of signal 11 (SIGSEGV)
>>>> ==63582== Access not within mapped region at address 0x50
>>>> ==63582== at 0x1081489D6: pthread_getspecific (in
>>>> /usr/lib/system/libsystem_pthread.dylib)
>>>> ==63582== by 0x100286A5B: PetscVSNPrintf (mprint.c:132)
>>>> ==63582== by 0x1002871A3: PetscVFPrintfDefault (mprint.c:241)
>>>> ==63582== by 0x10028A1E6: PetscFPrintf (mprint.c:546)
>>>> ==63582== by 0x1002A1BE9: PetscErrorPrintfDefault (errtrace.c:114)
>>>> ==63582== by 0x1002A3C5D: PetscSignalHandlerDefault (signal.c:135)
>>>> ==63582== by 0x1002A4A79: PetscSignalHandler_Private (signal.c:47)
>>>> ==63582== by 0x25805BDBD: ???
>>>> ==63582== by 0x10814A07C: start_wqthread (in
>>>> /usr/lib/system/libsystem_pthread.dylib)
>>>> ==63582== If you believe this happened as a result of a stack
>>>> ==63582== overflow in your program's main thread (unlikely but
>>>> ==63582== possible), you can try to increase the size of the
>>>> ==63582== main thread stack using the --main-stacksize= flag.
>>>> ==63582== The main thread stack size used in this run was 67104768.
>>>>
>>>> valgrind: m_scheduler/scheduler.c:881 (void
>>>> run_thread_for_a_while(HWord *, Int *, ThreadId, HWord, Bool)): Assertion
>>>> 'VG_(stats__n_xindirs_32) == 0' failed.
>>>>
>>>> host stacktrace:
>>>> ==63582== at 0x25804121C: ???
>>>> ==63582== by 0x258041587: ???
>>>> ==63582== by 0x25804156A: ???
>>>> ==63582== by 0x2580BB25F: ???
>>>> ==63582== by 0x2580B95EA: ???
>>>> ==63582== by 0x2580CA83B: ???
>>>> ==63582== by 0x2580CAAF8: ???
>>>>
>>>> sched status:
>>>> running_tid=3
>>>>
>>>> Thread 1: status = VgTs_Yielding (lwpid 771)
>>>> ==63582== at 0x10239F9DE: dscatter_u (in
>>>> /Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib/libsuperl
>>>> u_dist.5.1.3.dylib)
>>>> ==63582== by 0x10239EF4F: pdgstrf (in /Users/markadams/Codes/petsc/a
>>>> rch-macosx-gnu-g/lib/libsuperlu_dist.5.1.3.dylib)
>>>> ==63582== by 0x10237D696: pdgssvx_ABglobal (in
>>>> /Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib/libsuperl
>>>> u_dist.5.1.3.dylib)
>>>> ==63582== by 0x100AB1F02: MatLUFactorNumeric_SuperLU_DIST
>>>> (superlu_dist.c:423)
>>>> ==63582== by 0x10053AD98: MatLUFactorNumeric (matrix.c:3039)
>>>> ==63582== by 0x1012075CD: PCSetUp_LU (lu.c:131)
>>>> ==63582== by 0x10134D65B: PCSetUp (precon.c:924)
>>>> ==63582== by 0x101496E11: KSPSetUp (itfunc.c:378)
>>>> ==63582== by 0x101499143: KSPSolve (itfunc.c:609)
>>>> ==63582== by 0x1015F9410: SNESSolve_NEWTONLS (ls.c:224)
>>>> ==63582== by 0x101574290: SNESSolve (snes.c:4106)
>>>> ==63582== by 0x10179B43C: TS_SNESSolve (theta.c:176)
>>>> ==63582== by 0x10178F7CE: TSStep_Theta (theta.c:216)
>>>> ==63582== by 0x1016C1D62: TSStep (ts.c:4120)
>>>> ==63582== by 0x1016C56A3: TSSolve (ts.c:4374)
>>>> ==63582== by 0x100004E0E: main (ex48.c:1061)
>>>>
>>>> Thread 2: status = VgTs_Yielding (lwpid 4099)
>>>> ==63582== at 0x1081489D6: pthread_getspecific (in
>>>> /usr/lib/system/libsystem_pthread.dylib)
>>>> ==63582== by 0x100286A5B: PetscVSNPrintf (mprint.c:132)
>>>> ==63582== by 0x1002871A3: PetscVFPrintfDefault (mprint.c:241)
>>>> ==63582== by 0x10028A1E6: PetscFPrintf (mprint.c:546)
>>>> ==63582== by 0x1002A1BE9: PetscErrorPrintfDefault (errtrace.c:114)
>>>> ==63582== by 0x1002A3C5D: PetscSignalHandlerDefault (signal.c:135)
>>>> ==63582== by 0x1002A4A79: PetscSignalHandler_Private (signal.c:47)
>>>> ==63582== by 0x25805BDBD: ???
>>>> ==63582== by 0x10814A07C: start_wqthread (in
>>>> /usr/lib/system/libsystem_pthread.dylib)
>>>>
>>>> Thread 3: status = VgTs_Runnable (lwpid 3843)
>>>> ==63582== at 0x10814A070: start_wqthread (in
>>>> /usr/lib/system/libsystem_pthread.dylib)
>>>>
>>>>
>>>> Note: see also the FAQ in the source distribution.
>>>> It contains workarounds to several common problems.
>>>> In particular, if Valgrind aborted or crashed after
>>>> identifying problems in your program, there's a good chance
>>>> that fixing those problems will prevent Valgrind aborting or
>>>> crashing, especially if it happened in m_mallocfree.c.
>>>>
>>>>
>>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/ <http://www.caam.rice.edu/~mk51/>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20171016/fadb2305/attachment-0001.html>
More information about the petsc-dev
mailing list