[petsc-users] CUDA error
Mark Adams
mfadams at lbl.gov
Tue Jun 23 10:16:49 CDT 2020
I run in an interactive shell on SUMMIT and don't need --smpiargs=-gpu.
I have been having a lot of problems getting the cuda test to be
reproducible in the pipeline and SUMMIT was failing. -use_gpu_aware_mpi 0
fixed the test on SUMMIT, in that the test ran and it look OK, and I
updated the test. Hoping it fixes the pipeline ex56/cuda diffs.
On Tue, Jun 23, 2020 at 11:07 AM Stefano Zampini <stefano.zampini at gmail.com>
wrote:
> What did it? How are you running now to have everything working? Can you
> post smpiargs and petsc options?
>
> Il giorno mar 23 giu 2020 alle ore 17:51 Mark Adams <mfadams at lbl.gov> ha
> scritto:
>
>>
>>
>> On Tue, Jun 23, 2020 at 9:54 AM Jed Brown <jed at jedbrown.org> wrote:
>>
>>> Did you use --smpiargs=-gpu, and have you tried if the error is still
>>>
>>
>>
>>
>>> there with -use_gpu_aware_mpi 0?
>>
>>
>> That did it. Thanks,
>>
>>
>>> I assume you're using -dm_vec_type cuda?
>>>
>>> Mark Adams <mfadams at lbl.gov> writes:
>>>
>>> > My code runs OK on SUMMIT but ex56 does not. ex56 runs in serial but I
>>> get
>>> > this segv in parallel. I also see these memcheck messages
>>> > from PetscSFBcastAndOpBegin in my code and ex56.
>>> >
>>> > I ran this in DDT and was able to get a stack trace and look at
>>> variables.
>>> > THe segv is on sfbasic.c:148:
>>> >
>>> > ierr =
>>> >
>>> MPI_Startall_isend(bas->rootbuflen[PETSCSF_REMOTE],unit,bas->nrootreqs,rootreqs);CHKERRQ(ierr);
>>> >
>>> > I did not see anything wrong with the variables here. The segv is on
>>> > processor 1 of 2 (so the last process).
>>> >
>>> > Any ideas?
>>> > Thanks,
>>> > Mark
>>> >
>>> > #18 main (argc=<optimized out>, args=<optimized out>) at
>>> > /ccs/home/adams/petsc-old/src/snes/tutorials/ex56.c:477 (at
>>> > 0x0000000010006224)
>>> > #17 SNESSolve (snes=0x31064ac0, b=0x3241ac40, x=<optimized out>) at
>>> > /autofs/nccs-svm1_home1/adams/petsc-old/src/snes/interface/snes.c:4515
>>> (at
>>> > 0x0000200000de3498)
>>> > #16 SNESSolve_NEWTONLS (snes=0x31064ac0) at
>>> > /autofs/nccs-svm1_home1/adams/petsc-old/src/snes/impls/ls/ls.c:175 (at
>>> > 0x0000200000e1b344)
>>> > #15 SNESComputeFunction (snes=0x31064ac0, x=0x312a3170, y=0x4a464bd0)
>>> at
>>> > /autofs/nccs-svm1_home1/adams/petsc-old/src/snes/interface/snes.c:2378
>>> (at
>>> > 0x0000200000dd5024)
>>> > #14 SNESComputeFunction_DMLocal (snes=0x31064ac0, X=0x312a3170,
>>> > F=0x4a464bd0, ctx=0x4a3a7ed0) at
>>> >
>>> /autofs/nccs-svm1_home1/adams/petsc-old/src/snes/utils/dmlocalsnes.c:71 (at
>>> > 0x0000200000dab058)
>>> > #13 DMGlobalToLocalBegin (dm=0x30fac020, g=0x312a3170, mode=<optimized
>>> > out>, l=0x4a444c40) at
>>> > /autofs/nccs-svm1_home1/adams/petsc-old/src/dm/interface/dm.c:2407 (at
>>> > 0x0000200000b28a38)
>>> > #12 PetscSFBcastBegin (leafdata=0x200073c00a00,
>>> rootdata=0x200073a00000,
>>> > unit=<optimized out>, sf=0x30f28980) at
>>> > /ccs/home/adams/petsc-old/include/petscsf.h:189 (at 0x0000200000b28a38)
>>> > #11 PetscSFBcastAndOpBegin (sf=0x30f28980, unit=0x200021879ed0,
>>> > rootdata=0x200073a00000, leafdata=0x200073c00a00, op=0x200021889c70) at
>>> >
>>> /autofs/nccs-svm1_home1/adams/petsc-old/src/vec/is/sf/interface/sf.c:1337
>>> > (at 0x000020000045a230)
>>> > #10 PetscSFBcastAndOpBegin_Basic (sf=0x30f28980, unit=0x200021879ed0,
>>> > rootmtype=<optimized out>, rootdata=0x200073a00000,
>>> leafmtype=<optimized
>>> > out>, leafdata=0x200073c00a00, op=0x200021889c70) at
>>> >
>>> /autofs/nccs-svm1_home1/adams/petsc-old/src/vec/is/sf/impls/basic/sfbasic.c:148
>>> > (at 0x00002000003b1d9c)
>>> > #9 PMPI_Startall () from
>>> >
>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/libmpi_ibm.so.3
>>> > (at 0x00002000217e3d98)
>>> > #8 mca_pml_pami_start () from
>>> >
>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/spectrum_mpi/mca_pml_pami.so
>>> > (at 0x000020002555e6e0)
>>> > #7 pml_pami_persis_send_start () from
>>> >
>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/spectrum_mpi/mca_pml_pami.so
>>> > (at 0x000020002555e29c)
>>> > #6 pml_pami_send () from
>>> >
>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/spectrum_mpi/mca_pml_pami.so
>>> > (at 0x000020002555f69c)
>>> > #5 PAMI_Send_immediate () from
>>> >
>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3
>>> > (at 0x0000200025725814)
>>> > #4
>>> >
>>> PAMI::Protocol::Send::Eager<PAMI::Device::Shmem::PacketModel<PAMI::Device::ShmemDevice<PAMI::Fifo::WrapFifo<PAMI::Fifo::FifoPacket<64u,
>>> > 4096u>, PAMI::Counter::IndirectBounded<PAMI::Atomic::NativeAtomic>,
>>> 256u>,
>>> > PAMI::Counter::Indirect<PAMI::Counter::Native>,
>>> > PAMI::Device::Shmem::CMAShaddr, 256u, 512u> >,
>>> > PAMI::Device::IBV::PacketModel<PAMI::Device::IBV::Device, true>
>>> >>::EagerImpl<(PAMI::Protocol::Send::configuration_t)5,
>>> > true>::immediate(pami_send_immediate_t*) () from
>>> >
>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3
>>> > (at 0x00002000257e7bac)
>>> > #3
>>> >
>>> PAMI::Protocol::Send::EagerSimple<PAMI::Device::Shmem::PacketModel<PAMI::Device::ShmemDevice<PAMI::Fifo::WrapFifo<PAMI::Fifo::FifoPacket<64u,
>>> > 4096u>, PAMI::Counter::IndirectBounded<PAMI::Atomic::NativeAtomic>,
>>> 256u>,
>>> > PAMI::Counter::Indirect<PAMI::Counter::Native>,
>>> > PAMI::Device::Shmem::CMAShaddr, 256u, 512u> >,
>>> >
>>> (PAMI::Protocol::Send::configuration_t)5>::immediate_impl(pami_send_immediate_t*)
>>> > () from
>>> >
>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3
>>> > (at 0x00002000257e7824)
>>> > #2 bool
>>> >
>>> PAMI::Device::Interface::PacketModel<PAMI::Device::Shmem::PacketModel<PAMI::Device::ShmemDevice<PAMI::Fifo::WrapFifo<PAMI::Fifo::FifoPacket<64u,
>>> > 4096u>, PAMI::Counter::IndirectBounded<PAMI::Atomic::NativeAtomic>,
>>> 256u>,
>>> > PAMI::Counter::Indirect<PAMI::Counter::Native>,
>>> > PAMI::Device::Shmem::CMAShaddr, 256u, 512u> >
>>> >::postPacket<2u>(unsigned
>>> > long, unsigned long, void*, unsigned long, iovec (&) [2u]) () from
>>> >
>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3
>>> > (at 0x00002000257e6c18)
>>> > #1 PAMI::Device::Shmem::Packet<PAMI::Fifo::FifoPacket<64u, 4096u>
>>> >>::writePayload(PAMI::Fifo::FifoPacket<64u, 4096u>&, iovec*, unsigned
>>> long)
>>> > () from
>>> >
>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3
>>> > (at 0x00002000257c5a7c)
>>> > #0 __memcpy_power7 () from /lib64/libc.so.6 (at 0x00002000219eb84c)
>>> >
>>> >
>>> > ========= Program hit CUDA_ERROR_INVALID_VALUE (error 1) due to
>>> "invalid
>>> > argument" on CUDA API call to cuPointerGetAttribute.
>>> > ========= Saved host backtrace up to driver entry point at error
>>> > ========= Host Frame:/lib64/libcuda.so.1 (cuPointerGetAttribute +
>>> > 0x178) [0x2d14a8]
>>> > ========= Host
>>> >
>>> Frame:/ccs/home/adams/petsc-old/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.013
>>> > (PetscSFBcastAndOpBegin + 0xd8) [0x3ba1a0]
>>> > ========= Host
>>> >
>>> Frame:/ccs/home/adams/petsc-old/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.013
>>> > (PetscSectionCreateGlobalSection + 0x948) [0x3c5b24]
>>> > ========= Host
>>> >
>>> Frame:/ccs/home/adams/petsc-old/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.013
>>> > (DMGetGlobalSection + 0x98) [0xa85e5c]
>>> > ========= Host
>>> >
>>> Frame:/ccs/home/adams/petsc-old/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.013
>>> > (DMPlexCreateRigidBody + 0xc4) [0x9a9ce4]
>>> > ========= Host Frame:./ex56 [0x5aec]
>>> > ========= Host Frame:/lib64/libc.so.6 [0x25200]
>>> > ========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xc4)
>>> > [0x253f4]
>>> > =========
>>> > ========= Program hit CUDA_ERROR_INVALID_VALUE (error 1) due to
>>> "invalid
>>> > argument" on CUDA API call to cuPointerGetAttribute.
>>> > ========= Saved host backtrace up to driver entry point at error
>>> > ========= Host Frame:/lib64/libcuda.so.1 (cuPointerGetAttribute +
>>> > 0x178) [0x2d14a8]
>>> > ========= Host
>>> >
>>> Frame:/ccs/home/adams/petsc-old/arch-summit-opt-gnu-cuda-omp/lib/libpetsc[1]PETSC
>>> > ERROR:
>>> >
>>> ------------------------------------------------------------------------
>>> > [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>>> > probably memory access out of range
>>> > [1]PETSC ERROR: Try option -start_in_debugger or
>>> -on_error_attach_debugger
>>> > [1]PETSC ERROR: or see
>>> > https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>> > [1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac
>>> OS X
>>> > to find memory corruption errors
>>> > [1]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
>>> and
>>> > run
>>> > [1]PETSC ERROR: to get more information on the crash.
>>> > [1]PETSC ERROR: --------------------- Error Message
>>> > --------------------------------------------------------------
>>> > [1]PETSC ERROR: Signal received
>>> > [1]PETSC ERROR: See
>>> https://www.mcs.anl.gov/petsc/documentation/faq.html
>>> > for trouble shooting.
>>> > [1]PETSC ERROR: Petsc Development GIT revision: v3.13.2-421-gab8fa13
>>> GIT
>>> > Date: 2020-06-22 13:25:32 -0400
>>> > [1]PETSC ERROR: ./ex56 on a arch-summit-opt-gnu-cuda-omp named h35n05
>>> by
>>> > adams Tue Jun 23 09:06:20 2020
>>> > [1]PETSC ERROR: Configure options --with-fc=0 --COPTFLAGS="-g -O -fPIC
>>> > -DFP_DIM=2" --CXXOPTFLAGS="-g -O -fPIC " --FOPTFLAGS="-g -O -fPIC "
>>> > --CUDAOPTFLAGS="-g -O -Xcompiler -rdynamic -lineinfo--with-ssl=0"
>>> > --with-batch=0 --with-cxx=mpicxx --with-mpiexec="jsrun -g1"
>>> --with-cuda=1
>>> > --with-cudac=nvcc --download-p4est=1 --download-zlib --download-hdf5=1
>>> > --download-metis --download-parmetis --download-triangle
>>> >
>>> --with-blaslapack-lib="-L/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/netlib-lapack-3.8.0-wcabdyqhdi5rooxbkqa6x5d7hxyxwdkm/lib64
>>> > -lblas -llapack" --with-cc=mpicc --with-shared-libraries=1 --with-x=0
>>> > --with-64-bit-indices=0 --with-debugging=0
>>> > PETSC_ARCH=arch-summit-opt-gnu-cuda-omp --with-openmp=1
>>> > --with-threadsaftey=1 --with-log=1 PETSC_DIR=/ccs/home/adams/petsc-old
>>> > --force
>>> > [1]PETSC ERROR: #1 User provided function() line 0 in unknown file
>>> >
>>> --------------------------------------------------------------------------
>>>
>>
>
> --
> Stefano
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200623/67e960e0/attachment-0001.html>
More information about the petsc-users
mailing list