<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jun 23, 2020 at 9:51 AM Mark Adams <<a href="mailto:mfadams@lbl.gov">mfadams@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jun 23, 2020 at 9:54 AM Jed Brown <<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Did you use --smpiargs=-gpu, and have you tried if the error is still<br></blockquote><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
there with -use_gpu_aware_mpi 0? </blockquote><div><br></div><div>That did it. Thanks,</div></div></div></blockquote><div>Weird. It should work with -use_gpu_aware_mpi 1, if you use jsrun --smpiargs=-gpu</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I assume you're using -dm_vec_type cuda?<br>
<br>
Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> writes:<br>
<br>
> My code runs OK on SUMMIT but ex56 does not. ex56 runs in serial but I get<br>
> this segv in parallel. I also see these memcheck messages<br>
> from PetscSFBcastAndOpBegin in my code and ex56.<br>
><br>
> I ran this in DDT and was able to get a stack trace and look at variables.<br>
> THe segv is on sfbasic.c:148:<br>
><br>
> ierr =<br>
> MPI_Startall_isend(bas->rootbuflen[PETSCSF_REMOTE],unit,bas->nrootreqs,rootreqs);CHKERRQ(ierr);<br>
><br>
> I did not see anything wrong with the variables here. The segv is on<br>
> processor 1 of 2 (so the last process).<br>
><br>
> Any ideas?<br>
> Thanks,<br>
> Mark<br>
><br>
> #18 main (argc=<optimized out>, args=<optimized out>) at<br>
> /ccs/home/adams/petsc-old/src/snes/tutorials/ex56.c:477 (at<br>
> 0x0000000010006224)<br>
> #17 SNESSolve (snes=0x31064ac0, b=0x3241ac40, x=<optimized out>) at<br>
> /autofs/nccs-svm1_home1/adams/petsc-old/src/snes/interface/snes.c:4515 (at<br>
> 0x0000200000de3498)<br>
> #16 SNESSolve_NEWTONLS (snes=0x31064ac0) at<br>
> /autofs/nccs-svm1_home1/adams/petsc-old/src/snes/impls/ls/ls.c:175 (at<br>
> 0x0000200000e1b344)<br>
> #15 SNESComputeFunction (snes=0x31064ac0, x=0x312a3170, y=0x4a464bd0) at<br>
> /autofs/nccs-svm1_home1/adams/petsc-old/src/snes/interface/snes.c:2378 (at<br>
> 0x0000200000dd5024)<br>
> #14 SNESComputeFunction_DMLocal (snes=0x31064ac0, X=0x312a3170,<br>
> F=0x4a464bd0, ctx=0x4a3a7ed0) at<br>
> /autofs/nccs-svm1_home1/adams/petsc-old/src/snes/utils/dmlocalsnes.c:71 (at<br>
> 0x0000200000dab058)<br>
> #13 DMGlobalToLocalBegin (dm=0x30fac020, g=0x312a3170, mode=<optimized<br>
> out>, l=0x4a444c40) at<br>
> /autofs/nccs-svm1_home1/adams/petsc-old/src/dm/interface/dm.c:2407 (at<br>
> 0x0000200000b28a38)<br>
> #12 PetscSFBcastBegin (leafdata=0x200073c00a00, rootdata=0x200073a00000,<br>
> unit=<optimized out>, sf=0x30f28980) at<br>
> /ccs/home/adams/petsc-old/include/petscsf.h:189 (at 0x0000200000b28a38)<br>
> #11 PetscSFBcastAndOpBegin (sf=0x30f28980, unit=0x200021879ed0,<br>
> rootdata=0x200073a00000, leafdata=0x200073c00a00, op=0x200021889c70) at<br>
> /autofs/nccs-svm1_home1/adams/petsc-old/src/vec/is/sf/interface/sf.c:1337<br>
> (at 0x000020000045a230)<br>
> #10 PetscSFBcastAndOpBegin_Basic (sf=0x30f28980, unit=0x200021879ed0,<br>
> rootmtype=<optimized out>, rootdata=0x200073a00000, leafmtype=<optimized<br>
> out>, leafdata=0x200073c00a00, op=0x200021889c70) at<br>
> /autofs/nccs-svm1_home1/adams/petsc-old/src/vec/is/sf/impls/basic/sfbasic.c:148<br>
> (at 0x00002000003b1d9c)<br>
> #9 PMPI_Startall () from<br>
> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/libmpi_ibm.so.3<br>
> (at 0x00002000217e3d98)<br>
> #8 mca_pml_pami_start () from<br>
> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/spectrum_mpi/mca_pml_pami.so<br>
> (at 0x000020002555e6e0)<br>
> #7 pml_pami_persis_send_start () from<br>
> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/spectrum_mpi/mca_pml_pami.so<br>
> (at 0x000020002555e29c)<br>
> #6 pml_pami_send () from<br>
> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/spectrum_mpi/mca_pml_pami.so<br>
> (at 0x000020002555f69c)<br>
> #5 PAMI_Send_immediate () from<br>
> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3<br>
> (at 0x0000200025725814)<br>
> #4<br>
> PAMI::Protocol::Send::Eager<PAMI::Device::Shmem::PacketModel<PAMI::Device::ShmemDevice<PAMI::Fifo::WrapFifo<PAMI::Fifo::FifoPacket<64u,<br>
> 4096u>, PAMI::Counter::IndirectBounded<PAMI::Atomic::NativeAtomic>, 256u>,<br>
> PAMI::Counter::Indirect<PAMI::Counter::Native>,<br>
> PAMI::Device::Shmem::CMAShaddr, 256u, 512u> >,<br>
> PAMI::Device::IBV::PacketModel<PAMI::Device::IBV::Device, true><br>
>>::EagerImpl<(PAMI::Protocol::Send::configuration_t)5,<br>
> true>::immediate(pami_send_immediate_t*) () from<br>
> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3<br>
> (at 0x00002000257e7bac)<br>
> #3<br>
> PAMI::Protocol::Send::EagerSimple<PAMI::Device::Shmem::PacketModel<PAMI::Device::ShmemDevice<PAMI::Fifo::WrapFifo<PAMI::Fifo::FifoPacket<64u,<br>
> 4096u>, PAMI::Counter::IndirectBounded<PAMI::Atomic::NativeAtomic>, 256u>,<br>
> PAMI::Counter::Indirect<PAMI::Counter::Native>,<br>
> PAMI::Device::Shmem::CMAShaddr, 256u, 512u> >,<br>
> (PAMI::Protocol::Send::configuration_t)5>::immediate_impl(pami_send_immediate_t*)<br>
> () from<br>
> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3<br>
> (at 0x00002000257e7824)<br>
> #2 bool<br>
> PAMI::Device::Interface::PacketModel<PAMI::Device::Shmem::PacketModel<PAMI::Device::ShmemDevice<PAMI::Fifo::WrapFifo<PAMI::Fifo::FifoPacket<64u,<br>
> 4096u>, PAMI::Counter::IndirectBounded<PAMI::Atomic::NativeAtomic>, 256u>,<br>
> PAMI::Counter::Indirect<PAMI::Counter::Native>,<br>
> PAMI::Device::Shmem::CMAShaddr, 256u, 512u> > >::postPacket<2u>(unsigned<br>
> long, unsigned long, void*, unsigned long, iovec (&) [2u]) () from<br>
> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3<br>
> (at 0x00002000257e6c18)<br>
> #1 PAMI::Device::Shmem::Packet<PAMI::Fifo::FifoPacket<64u, 4096u><br>
>>::writePayload(PAMI::Fifo::FifoPacket<64u, 4096u>&, iovec*, unsigned long)<br>
> () from<br>
> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3<br>
> (at 0x00002000257c5a7c)<br>
> #0 __memcpy_power7 () from /lib64/libc.so.6 (at 0x00002000219eb84c)<br>
><br>
><br>
> ========= Program hit CUDA_ERROR_INVALID_VALUE (error 1) due to "invalid<br>
> argument" on CUDA API call to cuPointerGetAttribute.<br>
> ========= Saved host backtrace up to driver entry point at error<br>
> ========= Host Frame:/lib64/libcuda.so.1 (cuPointerGetAttribute +<br>
> 0x178) [0x2d14a8]<br>
> ========= Host<br>
> Frame:/ccs/home/adams/petsc-old/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.013<br>
> (PetscSFBcastAndOpBegin + 0xd8) [0x3ba1a0]<br>
> ========= Host<br>
> Frame:/ccs/home/adams/petsc-old/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.013<br>
> (PetscSectionCreateGlobalSection + 0x948) [0x3c5b24]<br>
> ========= Host<br>
> Frame:/ccs/home/adams/petsc-old/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.013<br>
> (DMGetGlobalSection + 0x98) [0xa85e5c]<br>
> ========= Host<br>
> Frame:/ccs/home/adams/petsc-old/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.013<br>
> (DMPlexCreateRigidBody + 0xc4) [0x9a9ce4]<br>
> ========= Host Frame:./ex56 [0x5aec]<br>
> ========= Host Frame:/lib64/libc.so.6 [0x25200]<br>
> ========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xc4)<br>
> [0x253f4]<br>
> =========<br>
> ========= Program hit CUDA_ERROR_INVALID_VALUE (error 1) due to "invalid<br>
> argument" on CUDA API call to cuPointerGetAttribute.<br>
> ========= Saved host backtrace up to driver entry point at error<br>
> ========= Host Frame:/lib64/libcuda.so.1 (cuPointerGetAttribute +<br>
> 0x178) [0x2d14a8]<br>
> ========= Host<br>
> Frame:/ccs/home/adams/petsc-old/arch-summit-opt-gnu-cuda-omp/lib/libpetsc[1]PETSC<br>
> ERROR:<br>
> ------------------------------------------------------------------------<br>
> [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,<br>
> probably memory access out of range<br>
> [1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger<br>
> [1]PETSC ERROR: or see<br>
> <a href="https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind" rel="noreferrer" target="_blank">https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind</a><br>
> [1]PETSC ERROR: or try <a href="http://valgrind.org" rel="noreferrer" target="_blank">http://valgrind.org</a> on GNU/linux and Apple Mac OS X<br>
> to find memory corruption errors<br>
> [1]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and<br>
> run<br>
> [1]PETSC ERROR: to get more information on the crash.<br>
> [1]PETSC ERROR: --------------------- Error Message<br>
> --------------------------------------------------------------<br>
> [1]PETSC ERROR: Signal received<br>
> [1]PETSC ERROR: See <a href="https://www.mcs.anl.gov/petsc/documentation/faq.html" rel="noreferrer" target="_blank">https://www.mcs.anl.gov/petsc/documentation/faq.html</a><br>
> for trouble shooting.<br>
> [1]PETSC ERROR: Petsc Development GIT revision: v3.13.2-421-gab8fa13 GIT<br>
> Date: 2020-06-22 13:25:32 -0400<br>
> [1]PETSC ERROR: ./ex56 on a arch-summit-opt-gnu-cuda-omp named h35n05 by<br>
> adams Tue Jun 23 09:06:20 2020<br>
> [1]PETSC ERROR: Configure options --with-fc=0 --COPTFLAGS="-g -O -fPIC<br>
> -DFP_DIM=2" --CXXOPTFLAGS="-g -O -fPIC " --FOPTFLAGS="-g -O -fPIC "<br>
> --CUDAOPTFLAGS="-g -O -Xcompiler -rdynamic -lineinfo--with-ssl=0"<br>
> --with-batch=0 --with-cxx=mpicxx --with-mpiexec="jsrun -g1" --with-cuda=1<br>
> --with-cudac=nvcc --download-p4est=1 --download-zlib --download-hdf5=1<br>
> --download-metis --download-parmetis --download-triangle<br>
> --with-blaslapack-lib="-L/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/netlib-lapack-3.8.0-wcabdyqhdi5rooxbkqa6x5d7hxyxwdkm/lib64<br>
> -lblas -llapack" --with-cc=mpicc --with-shared-libraries=1 --with-x=0<br>
> --with-64-bit-indices=0 --with-debugging=0<br>
> PETSC_ARCH=arch-summit-opt-gnu-cuda-omp --with-openmp=1<br>
> --with-threadsaftey=1 --with-log=1 PETSC_DIR=/ccs/home/adams/petsc-old<br>
> --force<br>
> [1]PETSC ERROR: #1 User provided function() line 0 in unknown file<br>
> --------------------------------------------------------------------------<br>
</blockquote></div></div>
</blockquote></div></div>