[petsc-dev] [petsc-users] strange segv

Mark Adams mfadams at lbl.gov
Sat May 29 23:08:32 CDT 2021


On Sat, May 29, 2021 at 11:46 PM Junchao Zhang <junchao.zhang at gmail.com>
wrote:

> try gcc/6.4.0
>

6.4.0 is the default and what I've been using.
6.4.0 builds and it has worked but I am now getting this segv (valgrind
trace below) in adams/landau-mass-opt.
My thinking is to try other versions.


> --Junchao Zhang
>
>
> On Sat, May 29, 2021 at 9:50 PM Mark Adams <mfadams at lbl.gov> wrote:
>
>> And I grief using gcc-8.1.1 and get this error:
>>
>> /autofs/nccs-svm1_sw/summit/gcc/8.1.1/include/c++/8.1.1/type_traits(347):
>> error: identifier "__ieee128" is undefined
>>
>> Any ideas?
>>
>> On Sat, May 29, 2021 at 10:39 PM Mark Adams <mfadams at lbl.gov> wrote:
>>
>>> And  valgrind sees this. I think the jump to the function is going to
>>> the wrong place.
>>> I'm giving up on PGI but can try newer versions of GCC. (what is the
>>> deal with the range of major releases, 4-10?)
>>> (as I said this looks like an error that a user is getting so I'd like
>>> to figure it out).
>>>
>>>     0 SNES Function norm 4.974994975313e-03
>>> ==77820== Invalid read of size 4
>>> ==77820==    at 0x7E69068: LandauKokkosJacobian (in
>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-notpl-cuda10/lib/libpetsc.so.3.015.0)
>>> ==77820==    by 0x7C598AF: LandauFormJacobian_Internal (plexland.c:212)
>>> ==77820==    by 0x7C728D3: LandauIJacobian (plexland.c:2107)
>>> ==77820==    by 0x7C8C26B: TSComputeIJacobian (ts.c:934)
>>> ==77820==    by 0x7E28337: SNESTSFormJacobian_Theta (theta.c:1007)
>>> ==77820==    by 0x7CBBFD3: SNESTSFormJacobian (ts.c:4415)
>>> ==77820==    by 0x7AD84BF: SNESComputeJacobian (snes.c:2824)
>>> ==77820==    by 0x7BA945B: SNESSolve_NEWTONLS (ls.c:222)
>>> ==77820==    by 0x7AF336F: SNESSolve (snes.c:4769)
>>> ==77820==    by 0x7E19D13: TSTheta_SNESSolve (theta.c:185)
>>> ==77820==    by 0x7E1A8B7: TSStep_Theta (theta.c:223)
>>> ==77820==    by 0x7CB093F: TSStep (ts.c:3571)
>>> ==77820==  Address 0x96fff690 is in a --- anonymous segment
>>> ==77820==
>>> [0]PETSC ERROR:
>>> ------------------------------------------------------------------------
>>> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>>> probably memory access out of range
>>> [0]PETSC ERROR: Try option -start_in_debugger or
>>> -on_error_attach_debugger
>>> [0]PETSC ERROR: or see
>>> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>> [0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac
>>> OS X to find memory corruption errors
>>> [0]PETSC ERROR: likely location of problem given in stack below
>>> [0]PETSC ERROR: ---------------------  Stack Frames
>>> ------------------------------------
>>> [0]PETSC ERROR: The EXACT line numbers in the error traceback are not
>>> available.
>>> [0]PETSC ERROR: instead the line number of the start of the function is
>>> given.
>>> [0]PETSC ERROR: #1 LandauKokkosJacobian() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/utils/dmplexlandau/kokkos/landau.kokkos.cxx:272
>>>
>>> On Sat, May 29, 2021 at 8:46 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>>
>>>>
>>>> On Sat, May 29, 2021 at 7:48 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>>
>>>>>
>>>>>    I don't see why it is not running the Kokkos check. Here is the
>>>>> rule right below the CUDA rule that is apparently running.
>>>>>
>>>>> check_build:
>>>>>         - at echo "Running check examples to verify correct installation"
>>>>>         - at echo "Using PETSC_DIR=${PETSC_DIR} and
>>>>> PETSC_ARCH=${PETSC_ARCH}"
>>>>>         + at cd src/snes/tutorials >/dev/null; ${OMAKE_SELF}
>>>>> PETSC_ARCH=${PETSC_ARCH}  PETSC_DIR=${PETSC_DIR} clean-legacy
>>>>>         + at cd src/snes/tutorials >/dev/null; ${OMAKE_SELF}
>>>>> PETSC_ARCH=${PETSC_ARCH}  PETSC_DIR=${PETSC_DIR} testex19
>>>>>         + at if [ "${HYPRE_LIB}" != "" ] && [ "${PETSC_WITH_BATCH}" = ""
>>>>> ] &&  [ "${PETSC_SCALAR}" = "real" ]; then \
>>>>>           cd src/snes/tutorials >/dev/null; ${OMAKE_SELF}
>>>>> PETSC_ARCH=${PETSC_ARCH}  PETSC_DIR=${PETSC_DIR}
>>>>> DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex19_hypre; \
>>>>>          fi;
>>>>>         + at if [ "${CUDA_LIB}" != "" ] && [ "${PETSC_WITH_BATCH}" = "" ]
>>>>> &&  [ "${PETSC_SCALAR}" = "real" ]; then \
>>>>>           cd src/snes/tutorials >/dev/null; ${OMAKE_SELF}
>>>>> PETSC_ARCH=${PETSC_ARCH}  PETSC_DIR=${PETSC_DIR}
>>>>> DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex19_cuda; \
>>>>>          fi;
>>>>>         + at if [ "${KOKKOS_KERNELS_LIB}" != "" ] && [
>>>>> "${PETSC_WITH_BATCH}" = "" ] &&  [ "${PETSC_SCALAR}" = "real" ] && [
>>>>> "${PETSC_PRECISION}" = "double" ] && [ "${MPI_IS_MPIUNI}" = "0" ]; then \
>>>>>           cd src/snes/tutorials >/dev/null; ${OMAKE_SELF}
>>>>> PETSC_ARCH=${PETSC_ARCH}  PETSC_DIR=${PETSC_DIR}
>>>>> DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex3k_kokkos; \
>>>>>          fi;
>>>>>
>>>>>   Regarding the debugging, if it is just one MPI rank (or even more)
>>>>> with GDB it will trap the error and show the exact line of source code
>>>>> where the error occurred and you can poke around at variables to see if
>>>>> they look corrupt or wrong (for example crazy address in a pointer), I
>>>>> don't know why your debugger is not giving more useful information.
>>>>>
>>>>>
>>>> This is what I did (in DDT). It stopped at the function call and the
>>>> data looked fine. I stepped into the call, but didn't get to it. The signal
>>>> handler was called and I was dead.
>>>> Maybe I did something in my branch. Can't see what, but I keep probing,
>>>> Thanks,
>>>>
>>>>
>>>>>   Barry
>>>>>
>>>>>
>>>>> > On May 29, 2021, at 2:16 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>>> >
>>>>> > I am running on Summit with Kokkos-CUDA and I am getting a segv that
>>>>> looks like some sort of a compile/link mismatch. I also have a user with a
>>>>> C++ code that is getting strange segvs when calling MatSetValues with CUDA
>>>>> (I know MatSetValues is not a cupsarse method, but that is the report that
>>>>> I have). I have no idea if these are related but they both involve C -- C++
>>>>> calls ...
>>>>> >
>>>>> > I started with a clean build (attached) and I ran in DDT. DDT
>>>>> stopped at the call in plexland.c to the KokkosLanau operator. I stepped
>>>>> into this function and then took this screenshot of the stack, with the
>>>>> Kokkos call and PETSc signal handler.
>>>>> >
>>>>> > Make check does not seem to be running Kokkos tests:
>>>>> >
>>>>> > 15:02 adams/landau-mass-opt *=
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc$ make
>>>>> PETSC_DIR=/gpfs/alpine/csc314/scratch/adams/petsc
>>>>> PETSC_ARCH=arch-summit-opt-gnu-kokkos-notpl-cuda10 check
>>>>> > Running check examples to verify correct installation
>>>>> > Using PETSC_DIR=/gpfs/alpine/csc314/scratch/adams/petsc and
>>>>> PETSC_ARCH=arch-summit-opt-gnu-kokkos-notpl-cuda10
>>>>> > C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI
>>>>> process
>>>>> > C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI
>>>>> processes
>>>>> > C/C++ example src/snes/tutorials/ex19 run successfully with cuda
>>>>> > Completed test examples
>>>>> >
>>>>> > Also, I ran this AM with another branch that had not been rebased
>>>>> with main as recently as this branch (adams/landau-mass-opt).
>>>>> >
>>>>> > Any ideas?
>>>>> > <make.log><configure.log><Screen Shot 2021-05-29 at 2.51.00 PM.png>
>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20210530/bc035854/attachment.html>


More information about the petsc-dev mailing list