<div dir="ltr">It looks like MatAssemblyEnd is not setting up correctly in parallel . Segvv here. I'll take a look at what Stefano did.<div><br></div><div>#18 main () (at 0x00000000100019a8)<br>#17 MatMult (mat=0x2155a750, x=0x56dc29c0, y=0x5937b190) at /autofs/nccs-svm1_home1/adams/petsc/src/mat/interface/matrix.c:2448 (at 0x00002000005f4858)<br>#16 MatMult_MPIAIJCUSPARSE(_p_Mat*, _p_Vec*, _p_Vec*) () from /ccs/home/adams/petsc/arch-summit-opt64-gnu-cuda/lib/libpetsc.so.3.15 (at 0x000020000095e298)<br>#15 VecScatterBegin (sf=0x5937fbf0, x=0x56dc29c0, y=0x5937cd20, addv=<optimized out>, mode=<optimized out>) at /autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/interface/vscat.c:1345 (at 0x00002000003a44fc)<br>#14 VecScatterBegin_Internal (sf=0x5937fbf0, x=0x56dc29c0, y=0x5937cd20, addv=INSERT_VALUES, mode=SCATTER_FORWARD) at /autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/interface/vscat.c:72 (at 0x000020000039e9cc)<br>#13 PetscSFBcastWithMemTypeBegin (sf=0x5937fbf0, unit=0x200024529ed0, rootmtype=<optimized out>, rootdata=0x200076ea1e00, leafmtype=<optimized out>, leafdata=0x200076ea2200, op=0x200024539c70) at /autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/interface/sf.c:1493 (at 0x0000200000396f04)<br>#12 PetscSFBcastBegin_Basic (sf=0x5937fbf0, unit=<optimized out>, rootmtype=<optimized out>, rootdata=0x200076ea1e00, leafmtype=<optimized out>, leafdata=0x200076ea2200, op=0x200024539c70) at /autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/impls/basic/sfbasic.c:191 (at 0x00002000002de188)<br>#11 PetscSFLinkStartCommunication (direction=PETSCSF_ROOT2LEAF, link=<optimized out>, sf=0x5937fbf0) at /ccs/home/adams/petsc/include/../src/vec/is/sf/impls/basic/sfpack.h:267 (at 0x00002000002de188)<br>#10 PetscSFLinkStartRequests_MPI (sf=<optimized out>, link=0x5937f080, direction=<optimized out>) at /autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/impls/basic/sfmpi.c:41 (at 0x00002000003850dc)<br>#9 PMPI_Startall () from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/libmpi_ibm.so.3 (at 0x0000200024493d98)<br>#8 mca_pml_pami_start () from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/spectrum_mpi/mca_pml_pami.so (at 0x00002000301ce6e0)<br>#7 pml_pami_persis_send_start () from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/spectrum_mpi/mca_pml_pami.so (at 0x00002000301ce29c)<br>#6 pml_pami_send () from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/spectrum_mpi/mca_pml_pami.so (at 0x00002000301cf69c)<br>#5 PAMI_Send_immediate () from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3 (at 0x0000200030395814)<br>#4 PAMI::Protocol::Send::Eager<PAMI::Device::Shmem::PacketModel<PAMI::Device::ShmemDevice<PAMI::Fifo::WrapFifo<PAMI::Fifo::FifoPacket<64u, 4096u>, PAMI::Counter::IndirectBounded<PAMI::Atomic::NativeAtomic>, 256u>, PAMI::Counter::Indirect<PAMI::Counter::Native>, PAMI::Device::Shmem::CMAShaddr, 256u, 512u> >, PAMI::Device::IBV::PacketModel<PAMI::Device::IBV::Device, true> >::EagerImpl<(PAMI::Protocol::Send::configuration_t)5, true>::immediate(pami_send_immediate_t*) () from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3 (at 0x0000200030457bac)<br>#3 PAMI::Protocol::Send::EagerSimple<PAMI::Device::Shmem::PacketModel<PAMI::Device::ShmemDevice<PAMI::Fifo::WrapFifo<PAMI::Fifo::FifoPacket<64u, 4096u>, PAMI::Counter::IndirectBounded<PAMI::Atomic::NativeAtomic>, 256u>, PAMI::Counter::Indirect<PAMI::Counter::Native>, PAMI::Device::Shmem::CMAShaddr, 256u, 512u> >, (PAMI::Protocol::Send::configuration_t)5>::immediate_impl(pami_send_immediate_t*) () from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3 (at 0x0000200030457824)<br>#2 bool PAMI::Device::Interface::PacketModel<PAMI::Device::Shmem::PacketModel<PAMI::Device::ShmemDevice<PAMI::Fifo::WrapFifo<PAMI::Fifo::FifoPacket<64u, 4096u>, PAMI::Counter::IndirectBounded<PAMI::Atomic::NativeAtomic>, 256u>, PAMI::Counter::Indirect<PAMI::Counter::Native>, PAMI::Device::Shmem::CMAShaddr, 256u, 512u> > >::postPacket<2u>(unsigned long, unsigned long, void*, unsigned long, iovec (&) [2u]) () from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3 (at 0x0000200030456c18)<br>#1 PAMI::Device::Shmem::Packet<PAMI::Fifo::FifoPacket<64u, 4096u> >::writePayload(PAMI::Fifo::FifoPacket<64u, 4096u>&, iovec*, unsigned long) () from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3 (at 0x0000200030435a7c)<br>#0 __memcpy_power7 () from /lib64/libc.so.6 (at 0x000020002463b804)<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, May 28, 2021 at 12:45 PM Barry Smith <<a href="mailto:bsmith@petsc.dev">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;"><div><br></div><div style="margin:0px;font-stretch:normal;font-size:14px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures">~/petsc/src/mat/tutorials</span><span style="font-variant-ligatures:no-common-ligatures;color:rgb(200,20,201)"><b> (barry/2021-05-28/robustify-cuda-gencodearch-check=)</b></span><span style="font-variant-ligatures:no-common-ligatures"> arch-robustify-cuda-gencodearch-check</span></div><div style="margin:0px;font-stretch:normal;font-size:14px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures">$ ./ex5cu</span></div><div style="margin:0px;font-stretch:normal;font-size:14px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures">terminate called after throwing an instance of 'thrust::system::system_error'</span></div><div style="margin:0px;font-stretch:normal;font-size:14px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures">  what():  fill_n: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered</span></div><div style="margin:0px;font-stretch:normal;font-size:14px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures">Aborted (core dumped)</span></div><div><br></div><div><span style="font-variant-ligatures:no-common-ligatures">        requires: cuda !define(PETSC_USE_CTABLE)</span></div><div><span style="font-variant-ligatures:no-common-ligatures"><br></span></div><div><span style="font-variant-ligatures:no-common-ligatures">  CI does not test with CUDA and no ctable.  The code is still broken as it was six months ago in the discussion Stefano pointed to. It is clear why just no one has had the time to clean things up.</span></div><div><span style="font-variant-ligatures:no-common-ligatures"><br></span></div><div><span style="font-variant-ligatures:no-common-ligatures">  Barry</span></div><div><span style="font-variant-ligatures:no-common-ligatures"><br></span></div><div><br><blockquote type="cite"><div>On May 28, 2021, at 11:13 AM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:</div><br><div><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, May 28, 2021 at 11:57 AM Stefano Zampini <<a href="mailto:stefano.zampini@gmail.com" target="_blank">stefano.zampini@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>If you are referring to your device set values, I guess it is not currently tested</div></blockquote><div><br></div><div>No. There is a test for that (ex5cu).</div><div>I have a user that is getting a segv in MatSetValues with aijcusparse. I suspect there is memory corruption but I'm trying to cover all the bases.</div><div>I have added a cuda test to ksp/ex56 that works. I can do an MR for it if such a test does not exist.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div>See the discussions here <a href="https://gitlab.com/petsc/petsc/-/merge_requests/3411" target="_blank">https://gitlab.com/petsc/petsc/-/merge_requests/3411</a><br><div>I started cleaning up the code to prepare for testing but we never finished it <a href="https://gitlab.com/petsc/petsc/-/commits/stefanozampini/simplify-setvalues-device/" target="_blank">https://gitlab.com/petsc/petsc/-/commits/stefanozampini/simplify-setvalues-device/</a></div><div><br><div><br><blockquote type="cite"><div>On May 28, 2021, at 6:53 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:</div><br><div><div dir="ltr">Is there a test with MatSetValues and CUDA? </div>
</div></blockquote></div><br></div></div></div></blockquote></div></div>
</div></blockquote></div><br></div></blockquote></div>