[petsc-dev] [petsc-users] strange segv

Mark Adams mfadams at lbl.gov
Sat May 29 21:39:03 CDT 2021


And  valgrind sees this. I think the jump to the function is going to the
wrong place.
I'm giving up on PGI but can try newer versions of GCC. (what is the deal
with the range of major releases, 4-10?)
(as I said this looks like an error that a user is getting so I'd like to
figure it out).

    0 SNES Function norm 4.974994975313e-03
==77820== Invalid read of size 4
==77820==    at 0x7E69068: LandauKokkosJacobian (in
/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-notpl-cuda10/lib/libpetsc.so.3.015.0)
==77820==    by 0x7C598AF: LandauFormJacobian_Internal (plexland.c:212)
==77820==    by 0x7C728D3: LandauIJacobian (plexland.c:2107)
==77820==    by 0x7C8C26B: TSComputeIJacobian (ts.c:934)
==77820==    by 0x7E28337: SNESTSFormJacobian_Theta (theta.c:1007)
==77820==    by 0x7CBBFD3: SNESTSFormJacobian (ts.c:4415)
==77820==    by 0x7AD84BF: SNESComputeJacobian (snes.c:2824)
==77820==    by 0x7BA945B: SNESSolve_NEWTONLS (ls.c:222)
==77820==    by 0x7AF336F: SNESSolve (snes.c:4769)
==77820==    by 0x7E19D13: TSTheta_SNESSolve (theta.c:185)
==77820==    by 0x7E1A8B7: TSStep_Theta (theta.c:223)
==77820==    by 0x7CB093F: TSStep (ts.c:3571)
==77820==  Address 0x96fff690 is in a --- anonymous segment
==77820==
[0]PETSC ERROR:
------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see
https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X
to find memory corruption errors
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: ---------------------  Stack Frames
------------------------------------
[0]PETSC ERROR: The EXACT line numbers in the error traceback are not
available.
[0]PETSC ERROR: instead the line number of the start of the function is
given.
[0]PETSC ERROR: #1 LandauKokkosJacobian() at
/gpfs/alpine/csc314/scratch/adams/petsc/src/ts/utils/dmplexlandau/kokkos/landau.kokkos.cxx:272

On Sat, May 29, 2021 at 8:46 PM Mark Adams <mfadams at lbl.gov> wrote:

>
>
> On Sat, May 29, 2021 at 7:48 PM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>>    I don't see why it is not running the Kokkos check. Here is the rule
>> right below the CUDA rule that is apparently running.
>>
>> check_build:
>>         - at echo "Running check examples to verify correct installation"
>>         - at echo "Using PETSC_DIR=${PETSC_DIR} and PETSC_ARCH=${PETSC_ARCH}"
>>         + at cd src/snes/tutorials >/dev/null; ${OMAKE_SELF}
>> PETSC_ARCH=${PETSC_ARCH}  PETSC_DIR=${PETSC_DIR} clean-legacy
>>         + at cd src/snes/tutorials >/dev/null; ${OMAKE_SELF}
>> PETSC_ARCH=${PETSC_ARCH}  PETSC_DIR=${PETSC_DIR} testex19
>>         + at if [ "${HYPRE_LIB}" != "" ] && [ "${PETSC_WITH_BATCH}" = "" ]
>> &&  [ "${PETSC_SCALAR}" = "real" ]; then \
>>           cd src/snes/tutorials >/dev/null; ${OMAKE_SELF}
>> PETSC_ARCH=${PETSC_ARCH}  PETSC_DIR=${PETSC_DIR}
>> DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex19_hypre; \
>>          fi;
>>         + at if [ "${CUDA_LIB}" != "" ] && [ "${PETSC_WITH_BATCH}" = "" ]
>> &&  [ "${PETSC_SCALAR}" = "real" ]; then \
>>           cd src/snes/tutorials >/dev/null; ${OMAKE_SELF}
>> PETSC_ARCH=${PETSC_ARCH}  PETSC_DIR=${PETSC_DIR}
>> DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex19_cuda; \
>>          fi;
>>         + at if [ "${KOKKOS_KERNELS_LIB}" != "" ] && [ "${PETSC_WITH_BATCH}"
>> = "" ] &&  [ "${PETSC_SCALAR}" = "real" ] && [ "${PETSC_PRECISION}" =
>> "double" ] && [ "${MPI_IS_MPIUNI}" = "0" ]; then \
>>           cd src/snes/tutorials >/dev/null; ${OMAKE_SELF}
>> PETSC_ARCH=${PETSC_ARCH}  PETSC_DIR=${PETSC_DIR}
>> DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex3k_kokkos; \
>>          fi;
>>
>>   Regarding the debugging, if it is just one MPI rank (or even more) with
>> GDB it will trap the error and show the exact line of source code where the
>> error occurred and you can poke around at variables to see if they look
>> corrupt or wrong (for example crazy address in a pointer), I don't know why
>> your debugger is not giving more useful information.
>>
>>
> This is what I did (in DDT). It stopped at the function call and the data
> looked fine. I stepped into the call, but didn't get to it. The signal
> handler was called and I was dead.
> Maybe I did something in my branch. Can't see what, but I keep probing,
> Thanks,
>
>
>>   Barry
>>
>>
>> > On May 29, 2021, at 2:16 PM, Mark Adams <mfadams at lbl.gov> wrote:
>> >
>> > I am running on Summit with Kokkos-CUDA and I am getting a segv that
>> looks like some sort of a compile/link mismatch. I also have a user with a
>> C++ code that is getting strange segvs when calling MatSetValues with CUDA
>> (I know MatSetValues is not a cupsarse method, but that is the report that
>> I have). I have no idea if these are related but they both involve C -- C++
>> calls ...
>> >
>> > I started with a clean build (attached) and I ran in DDT. DDT stopped
>> at the call in plexland.c to the KokkosLanau operator. I stepped into this
>> function and then took this screenshot of the stack, with the Kokkos call
>> and PETSc signal handler.
>> >
>> > Make check does not seem to be running Kokkos tests:
>> >
>> > 15:02 adams/landau-mass-opt *= /gpfs/alpine/csc314/scratch/adams/petsc$
>> make PETSC_DIR=/gpfs/alpine/csc314/scratch/adams/petsc
>> PETSC_ARCH=arch-summit-opt-gnu-kokkos-notpl-cuda10 check
>> > Running check examples to verify correct installation
>> > Using PETSC_DIR=/gpfs/alpine/csc314/scratch/adams/petsc and
>> PETSC_ARCH=arch-summit-opt-gnu-kokkos-notpl-cuda10
>> > C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI
>> process
>> > C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI
>> processes
>> > C/C++ example src/snes/tutorials/ex19 run successfully with cuda
>> > Completed test examples
>> >
>> > Also, I ran this AM with another branch that had not been rebased with
>> main as recently as this branch (adams/landau-mass-opt).
>> >
>> > Any ideas?
>> > <make.log><configure.log><Screen Shot 2021-05-29 at 2.51.00 PM.png>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20210529/ee8939e8/attachment.html>


More information about the petsc-dev mailing list