[petsc-dev] [petsc-users] strange segv
Mark Adams
mfadams at lbl.gov
Sun May 30 15:18:44 CDT 2021
Oh right. I had forgotten about cuda-memcheck. Thanks for reminding me.
It has never saved me, yet, so it has not been etched in my brain like
valgrind :)
On Sun, May 30, 2021 at 11:53 AM Jacob Faibussowitsch <jacob.fai at gmail.com>
wrote:
> The problem was that I was accessing a device pointer on the host.
>
> Maybe the fact that valgrind did not print a source code line (it was in
> host code) is a hint that you are accessing a device pointer?
>
> ==77820== Invalid read of size 4
> ==77820== at 0x7E69068: LandauKokkosJacobian (in
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-notpl-cuda10/lib/libpetsc.so.3.015.0)
> ==77820== by 0x7C598AF: LandauFormJacobian_Internal (plexland.c:212)
>
>
> When in doubt use cuda-memcheck whenever doing any debugging with gpus,
> its the cuda version of valgrind and I cannot recommend it enough. Not
> directly related but it also comes with a suite of other useful gpu-related
> tools that catch race conditions, uninitialized memory accesses and
> deadlocks.
>
> https://docs.nvidia.com/cuda/cuda-memcheck/index.html
>
> Best regards,
>
> Jacob Faibussowitsch
> (Jacob Fai - booss - oh - vitch)
>
> On May 30, 2021, at 09:06, Mark Adams <mfadams at lbl.gov> wrote:
>
> The problem was that I was accessing a device pointer on the host.
>
> Maybe the fact that valgrind did not print a source code line (it was in
> host code) is a hint that you are accessing a device pointer?
>
> ==77820== Invalid read of size 4
> ==77820== at 0x7E69068: LandauKokkosJacobian (in
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-notpl-cuda10/lib/libpetsc.so.3.015.0)
> ==77820== by 0x7C598AF: LandauFormJacobian_Internal (plexland.c:212)
>
> This access is in landau.kokkos.cxx but no source line number.
>
> Thanks,
>
>
> On Sun, May 30, 2021 at 12:48 AM Mark Adams <mfadams at lbl.gov> wrote:
>
>>
>>
>> On Sun, May 30, 2021 at 12:08 AM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>> Try without Valgrind, put a CHKMEMQ; just before the call to
>>> LandauKokkosJacobian and as its first line. And run with -malloc_debug.
>>> This is a less optimal way to find memory corruption but may be more useful
>>> in this case.
>>>
>>
>> I don't seem to get anything with this, but I now see that the segv is on
>> the 2nd call to LandauKokkosJacobian, which adds the mass matrix, with
>> shift. I am working on the mass matrix part now. Let me try adding print
>> statements in LandauKokkosJacobian. (DDT failed to trace into that method,
>> but let's see).
>>
>> Thanks,
>>
>> CHKMEMQ;
>> PetscPrintf(PETSC_COMM_SELF,"call LandauKokkosJacobian\n");
>> ierr =
>> LandauKokkosJacobian(ctx->plex,Nq,Eq_m,IPf,N,xdata,ctx->SData_d,ctx->subThreadBlockSize,shift,ctx->events,JacP);CHKERRQ(ierr);
>>
>> 00:37 adams/landau-mass-opt *=
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/utils/dmplexlandau/tutorials$
>> make PETSC_ARCH=arch-summit-opt-gnu-kokkos-notpl-cuda10 -f mymake tiny
>> EXTRA='-dm_mat_type aijkokkos -dm_vec_type kokkos -malloc_debug'
>> DEVICE=kokkos
>> jsrun -n 1 -c 1 -g 1 ./ex2 -dim 2 -ex2_test_type none -dm_landau_Ez 0
>> -petscspace_degree 3 -dm_preallocate_only -dm_landau_type p4est
>> -dm_landau_ion_masses 1 -dm_landau_ion_charges 1 -dm_landau_thermal_temps
>> 4,4 -dm_landau_n 1,1 -ts_monitorx -snes_rtol 1.e-14 -snes_stol 1.e-14
>> -snes_monitor -snes_converged_reason -snes_max_it 14 -ts_type beuler
>> -ts_exact_final_time stepover -ts_max_snes_failures 1 -ts_rtol 5e-1 -ts_dt
>> .5 -ts_max_steps 1 -pc_type lu -ksp_type preonly -dm_landau_amr_levels_max
>> 13 -dm_landau_device_type kokkos -dm_mat_type aijkokkos -dm_vec_type kokkos*
>> -malloc_debug*
>>
>>
>> [0]FormLandau: 1280 IPs, 80 cells, totDim=32, Nb=16, Nq=16,
>> elemMatSize=1024, dim=2, Tab: Nb=16 Nf=2 Np=16 cdim=2 N=1406 shift=0.
>>
>> *call LandauKokkosJacobian* 0 SNES Function norm 4.974994975313e-03
>>
>> *call LandauKokkosJacobian*[0]PETSC ERROR:
>> ------------------------------------------------------------------------
>> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>> probably memory access out of range
>> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
>> [0]PETSC ERROR: or see
>> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>> [0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS
>> X to find memory corruption errors
>> [0]PETSC ERROR: likely location of problem given in stack below
>> [0]PETSC ERROR: --------------------- Stack Frames
>> ------------------------------------
>> [0]PETSC ERROR: The EXACT line numbers in the error traceback are not
>> available.
>> [0]PETSC ERROR: instead the line number of the start of the function is
>> given.
>> [0]PETSC ERROR: #1 LandauKokkosJacobian() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/utils/dmplexlandau/kokkos/landau.kokkos.cxx:272
>> [0]PETSC ERROR: #2 LandauFormJacobian_Internal() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/utils/dmplexlandau/plexland.c:66
>> [0]PETSC ERROR: #3 LandauIJacobian() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/utils/dmplexlandau/plexland.c:2093
>> [0]PETSC ERROR: #4 TS user implicit Jacobian() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/interface/ts.c:933
>> [0]PETSC ERROR: #5 TSComputeIJacobian() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/interface/ts.c:916
>> [0]PETSC ERROR: #6 SNESTSFormJacobian_Theta() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/impls/implicit/theta/theta.c:1000
>> [0]PETSC ERROR: #7 SNESTSFormJacobian() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/interface/ts.c:4407
>> [0]PETSC ERROR: #8 SNES user Jacobian function() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/interface/snes.c:2823
>> [0]PETSC ERROR: #9 SNESComputeJacobian() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/interface/snes.c:2782
>> [0]PETSC ERROR: #10 SNESSolve() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/interface/snes.c:4653
>> [0]PETSC ERROR: #11 TSTheta_SNESSolve() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/impls/implicit/theta/theta.c:184
>> [0]PETSC ERROR: #12 TSStep_Theta() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/impls/implicit/theta/theta.c:200
>> [0]PETSC ERROR: #13 TSStep() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/interface/ts.c:3548
>> [0]PETSC ERROR: --------------------- Error Message
>> --------------------------------------------------------------
>>
>>
>>>
>>> On May 29, 2021, at 10:46 PM, Junchao Zhang <junchao.zhang at gmail.com>
>>> wrote:
>>>
>>> try gcc/6.4.0
>>> --Junchao Zhang
>>>
>>>
>>> On Sat, May 29, 2021 at 9:50 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>> And I grief using gcc-8.1.1 and get this error:
>>>>
>>>> /autofs/nccs-svm1_sw/summit/gcc/8.1.1/include/c++/8.1.1/type_traits(347):
>>>> error: identifier "__ieee128" is undefined
>>>>
>>>> Any ideas?
>>>>
>>>> On Sat, May 29, 2021 at 10:39 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>>
>>>>> And valgrind sees this. I think the jump to the function is going to
>>>>> the wrong place.
>>>>> I'm giving up on PGI but can try newer versions of GCC. (what is the
>>>>> deal with the range of major releases, 4-10?)
>>>>> (as I said this looks like an error that a user is getting so I'd like
>>>>> to figure it out).
>>>>>
>>>>> 0 SNES Function norm 4.974994975313e-03
>>>>> ==77820== Invalid read of size 4
>>>>> ==77820== at 0x7E69068: LandauKokkosJacobian (in
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-notpl-cuda10/lib/libpetsc.so.3.015.0)
>>>>> ==77820== by 0x7C598AF: LandauFormJacobian_Internal (plexland.c:212)
>>>>> ==77820== by 0x7C728D3: LandauIJacobian (plexland.c:2107)
>>>>> ==77820== by 0x7C8C26B: TSComputeIJacobian (ts.c:934)
>>>>> ==77820== by 0x7E28337: SNESTSFormJacobian_Theta (theta.c:1007)
>>>>> ==77820== by 0x7CBBFD3: SNESTSFormJacobian (ts.c:4415)
>>>>> ==77820== by 0x7AD84BF: SNESComputeJacobian (snes.c:2824)
>>>>> ==77820== by 0x7BA945B: SNESSolve_NEWTONLS (ls.c:222)
>>>>> ==77820== by 0x7AF336F: SNESSolve (snes.c:4769)
>>>>> ==77820== by 0x7E19D13: TSTheta_SNESSolve (theta.c:185)
>>>>> ==77820== by 0x7E1A8B7: TSStep_Theta (theta.c:223)
>>>>> ==77820== by 0x7CB093F: TSStep (ts.c:3571)
>>>>> ==77820== Address 0x96fff690 is in a --- anonymous segment
>>>>> ==77820==
>>>>> [0]PETSC ERROR:
>>>>> ------------------------------------------------------------------------
>>>>> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>>>>> probably memory access out of range
>>>>> [0]PETSC ERROR: Try option -start_in_debugger or
>>>>> -on_error_attach_debugger
>>>>> [0]PETSC ERROR: or see
>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>>>> [0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac
>>>>> OS X to find memory corruption errors
>>>>> [0]PETSC ERROR: likely location of problem given in stack below
>>>>> [0]PETSC ERROR: --------------------- Stack Frames
>>>>> ------------------------------------
>>>>> [0]PETSC ERROR: The EXACT line numbers in the error traceback are not
>>>>> available.
>>>>> [0]PETSC ERROR: instead the line number of the start of the function
>>>>> is given.
>>>>> [0]PETSC ERROR: #1 LandauKokkosJacobian() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/utils/dmplexlandau/kokkos/landau.kokkos.cxx:272
>>>>>
>>>>> On Sat, May 29, 2021 at 8:46 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, May 29, 2021 at 7:48 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>>>>
>>>>>>>
>>>>>>> I don't see why it is not running the Kokkos check. Here is the
>>>>>>> rule right below the CUDA rule that is apparently running.
>>>>>>>
>>>>>>> check_build:
>>>>>>> - at echo "Running check examples to verify correct
>>>>>>> installation"
>>>>>>> - at echo "Using PETSC_DIR=${PETSC_DIR} and
>>>>>>> PETSC_ARCH=${PETSC_ARCH}"
>>>>>>> + at cd src/snes/tutorials >/dev/null; ${OMAKE_SELF}
>>>>>>> PETSC_ARCH=${PETSC_ARCH} PETSC_DIR=${PETSC_DIR} clean-legacy
>>>>>>> + at cd src/snes/tutorials >/dev/null; ${OMAKE_SELF}
>>>>>>> PETSC_ARCH=${PETSC_ARCH} PETSC_DIR=${PETSC_DIR} testex19
>>>>>>> + at if [ "${HYPRE_LIB}" != "" ] && [ "${PETSC_WITH_BATCH}" =
>>>>>>> "" ] && [ "${PETSC_SCALAR}" = "real" ]; then \
>>>>>>> cd src/snes/tutorials >/dev/null; ${OMAKE_SELF}
>>>>>>> PETSC_ARCH=${PETSC_ARCH} PETSC_DIR=${PETSC_DIR}
>>>>>>> DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex19_hypre; \
>>>>>>> fi;
>>>>>>> + at if [ "${CUDA_LIB}" != "" ] && [ "${PETSC_WITH_BATCH}" = ""
>>>>>>> ] && [ "${PETSC_SCALAR}" = "real" ]; then \
>>>>>>> cd src/snes/tutorials >/dev/null; ${OMAKE_SELF}
>>>>>>> PETSC_ARCH=${PETSC_ARCH} PETSC_DIR=${PETSC_DIR}
>>>>>>> DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex19_cuda; \
>>>>>>> fi;
>>>>>>> + at if [ "${KOKKOS_KERNELS_LIB}" != "" ] && [
>>>>>>> "${PETSC_WITH_BATCH}" = "" ] && [ "${PETSC_SCALAR}" = "real" ] && [
>>>>>>> "${PETSC_PRECISION}" = "double" ] && [ "${MPI_IS_MPIUNI}" = "0" ]; then \
>>>>>>> cd src/snes/tutorials >/dev/null; ${OMAKE_SELF}
>>>>>>> PETSC_ARCH=${PETSC_ARCH} PETSC_DIR=${PETSC_DIR}
>>>>>>> DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex3k_kokkos; \
>>>>>>> fi;
>>>>>>>
>>>>>>> Regarding the debugging, if it is just one MPI rank (or even more)
>>>>>>> with GDB it will trap the error and show the exact line of source code
>>>>>>> where the error occurred and you can poke around at variables to see if
>>>>>>> they look corrupt or wrong (for example crazy address in a pointer), I
>>>>>>> don't know why your debugger is not giving more useful information.
>>>>>>>
>>>>>>>
>>>>>> This is what I did (in DDT). It stopped at the function call and the
>>>>>> data looked fine. I stepped into the call, but didn't get to it. The signal
>>>>>> handler was called and I was dead.
>>>>>> Maybe I did something in my branch. Can't see what, but I keep
>>>>>> probing,
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>>> Barry
>>>>>>>
>>>>>>>
>>>>>>> > On May 29, 2021, at 2:16 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>>>>> >
>>>>>>> > I am running on Summit with Kokkos-CUDA and I am getting a segv
>>>>>>> that looks like some sort of a compile/link mismatch. I also have a user
>>>>>>> with a C++ code that is getting strange segvs when calling MatSetValues
>>>>>>> with CUDA (I know MatSetValues is not a cupsarse method, but that is the
>>>>>>> report that I have). I have no idea if these are related but they both
>>>>>>> involve C -- C++ calls ...
>>>>>>> >
>>>>>>> > I started with a clean build (attached) and I ran in DDT. DDT
>>>>>>> stopped at the call in plexland.c to the KokkosLanau operator. I stepped
>>>>>>> into this function and then took this screenshot of the stack, with the
>>>>>>> Kokkos call and PETSc signal handler.
>>>>>>> >
>>>>>>> > Make check does not seem to be running Kokkos tests:
>>>>>>> >
>>>>>>> > 15:02 adams/landau-mass-opt *=
>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc$ make
>>>>>>> PETSC_DIR=/gpfs/alpine/csc314/scratch/adams/petsc
>>>>>>> PETSC_ARCH=arch-summit-opt-gnu-kokkos-notpl-cuda10 check
>>>>>>> > Running check examples to verify correct installation
>>>>>>> > Using PETSC_DIR=/gpfs/alpine/csc314/scratch/adams/petsc and
>>>>>>> PETSC_ARCH=arch-summit-opt-gnu-kokkos-notpl-cuda10
>>>>>>> > C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI
>>>>>>> process
>>>>>>> > C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI
>>>>>>> processes
>>>>>>> > C/C++ example src/snes/tutorials/ex19 run successfully with cuda
>>>>>>> > Completed test examples
>>>>>>> >
>>>>>>> > Also, I ran this AM with another branch that had not been rebased
>>>>>>> with main as recently as this branch (adams/landau-mass-opt).
>>>>>>> >
>>>>>>> > Any ideas?
>>>>>>> > <make.log><configure.log><Screen Shot 2021-05-29 at 2.51.00 PM.png>
>>>>>>>
>>>>>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20210530/7f39d5e1/attachment.html>
More information about the petsc-dev
mailing list