[petsc-users] CUDA MatSetValues test

Mark Adams mfadams at lbl.gov
Fri May 28 12:16:47 CDT 2021


It looks like MatAssemblyEnd is not setting up correctly in parallel .
Segvv here. I'll take a look at what Stefano did.

#18 main () (at 0x00000000100019a8)
#17 MatMult (mat=0x2155a750, x=0x56dc29c0, y=0x5937b190) at
/autofs/nccs-svm1_home1/adams/petsc/src/mat/interface/matrix.c:2448 (at
0x00002000005f4858)
#16 MatMult_MPIAIJCUSPARSE(_p_Mat*, _p_Vec*, _p_Vec*) () from
/ccs/home/adams/petsc/arch-summit-opt64-gnu-cuda/lib/libpetsc.so.3.15 (at
0x000020000095e298)
#15 VecScatterBegin (sf=0x5937fbf0, x=0x56dc29c0, y=0x5937cd20,
addv=<optimized out>, mode=<optimized out>) at
/autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/interface/vscat.c:1345
(at 0x00002000003a44fc)
#14 VecScatterBegin_Internal (sf=0x5937fbf0, x=0x56dc29c0, y=0x5937cd20,
addv=INSERT_VALUES, mode=SCATTER_FORWARD) at
/autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/interface/vscat.c:72 (at
0x000020000039e9cc)
#13 PetscSFBcastWithMemTypeBegin (sf=0x5937fbf0, unit=0x200024529ed0,
rootmtype=<optimized out>, rootdata=0x200076ea1e00, leafmtype=<optimized
out>, leafdata=0x200076ea2200, op=0x200024539c70) at
/autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/interface/sf.c:1493 (at
0x0000200000396f04)
#12 PetscSFBcastBegin_Basic (sf=0x5937fbf0, unit=<optimized out>,
rootmtype=<optimized out>, rootdata=0x200076ea1e00, leafmtype=<optimized
out>, leafdata=0x200076ea2200, op=0x200024539c70) at
/autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/impls/basic/sfbasic.c:191
(at 0x00002000002de188)
#11 PetscSFLinkStartCommunication (direction=PETSCSF_ROOT2LEAF,
link=<optimized out>, sf=0x5937fbf0) at
/ccs/home/adams/petsc/include/../src/vec/is/sf/impls/basic/sfpack.h:267 (at
0x00002000002de188)
#10 PetscSFLinkStartRequests_MPI (sf=<optimized out>, link=0x5937f080,
direction=<optimized out>) at
/autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/impls/basic/sfmpi.c:41
(at 0x00002000003850dc)
#9 PMPI_Startall () from
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/libmpi_ibm.so.3
(at 0x0000200024493d98)
#8 mca_pml_pami_start () from
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/spectrum_mpi/mca_pml_pami.so
(at 0x00002000301ce6e0)
#7 pml_pami_persis_send_start () from
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/spectrum_mpi/mca_pml_pami.so
(at 0x00002000301ce29c)
#6 pml_pami_send () from
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/spectrum_mpi/mca_pml_pami.so
(at 0x00002000301cf69c)
#5 PAMI_Send_immediate () from
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3
(at 0x0000200030395814)
#4
PAMI::Protocol::Send::Eager<PAMI::Device::Shmem::PacketModel<PAMI::Device::ShmemDevice<PAMI::Fifo::WrapFifo<PAMI::Fifo::FifoPacket<64u,
4096u>, PAMI::Counter::IndirectBounded<PAMI::Atomic::NativeAtomic>, 256u>,
PAMI::Counter::Indirect<PAMI::Counter::Native>,
PAMI::Device::Shmem::CMAShaddr, 256u, 512u> >,
PAMI::Device::IBV::PacketModel<PAMI::Device::IBV::Device, true>
>::EagerImpl<(PAMI::Protocol::Send::configuration_t)5,
true>::immediate(pami_send_immediate_t*) () from
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3
(at 0x0000200030457bac)
#3
PAMI::Protocol::Send::EagerSimple<PAMI::Device::Shmem::PacketModel<PAMI::Device::ShmemDevice<PAMI::Fifo::WrapFifo<PAMI::Fifo::FifoPacket<64u,
4096u>, PAMI::Counter::IndirectBounded<PAMI::Atomic::NativeAtomic>, 256u>,
PAMI::Counter::Indirect<PAMI::Counter::Native>,
PAMI::Device::Shmem::CMAShaddr, 256u, 512u> >,
(PAMI::Protocol::Send::configuration_t)5>::immediate_impl(pami_send_immediate_t*)
() from
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3
(at 0x0000200030457824)
#2 bool
PAMI::Device::Interface::PacketModel<PAMI::Device::Shmem::PacketModel<PAMI::Device::ShmemDevice<PAMI::Fifo::WrapFifo<PAMI::Fifo::FifoPacket<64u,
4096u>, PAMI::Counter::IndirectBounded<PAMI::Atomic::NativeAtomic>, 256u>,
PAMI::Counter::Indirect<PAMI::Counter::Native>,
PAMI::Device::Shmem::CMAShaddr, 256u, 512u> > >::postPacket<2u>(unsigned
long, unsigned long, void*, unsigned long, iovec (&) [2u]) () from
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3
(at 0x0000200030456c18)
#1 PAMI::Device::Shmem::Packet<PAMI::Fifo::FifoPacket<64u, 4096u>
>::writePayload(PAMI::Fifo::FifoPacket<64u, 4096u>&, iovec*, unsigned long)
() from
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3
(at 0x0000200030435a7c)
#0 __memcpy_power7 () from /lib64/libc.so.6 (at 0x000020002463b804)

On Fri, May 28, 2021 at 12:45 PM Barry Smith <bsmith at petsc.dev> wrote:

>
> ~/petsc/src/mat/tutorials*
> (barry/2021-05-28/robustify-cuda-gencodearch-check=)*
> arch-robustify-cuda-gencodearch-check
> $ ./ex5cu
> terminate called after throwing an instance of
> 'thrust::system::system_error'
>   what():  fill_n: failed to synchronize: cudaErrorIllegalAddress: an
> illegal memory access was encountered
> Aborted (core dumped)
>
>         requires: cuda !define(PETSC_USE_CTABLE)
>
>   CI does not test with CUDA and no ctable.  The code is still broken as
> it was six months ago in the discussion Stefano pointed to. It is clear why
> just no one has had the time to clean things up.
>
>   Barry
>
>
> On May 28, 2021, at 11:13 AM, Mark Adams <mfadams at lbl.gov> wrote:
>
>
>
> On Fri, May 28, 2021 at 11:57 AM Stefano Zampini <
> stefano.zampini at gmail.com> wrote:
>
>> If you are referring to your device set values, I guess it is not
>> currently tested
>>
>
> No. There is a test for that (ex5cu).
> I have a user that is getting a segv in MatSetValues with aijcusparse. I
> suspect there is memory corruption but I'm trying to cover all the bases.
> I have added a cuda test to ksp/ex56 that works. I can do an MR for it if
> such a test does not exist.
>
>
>> See the discussions here
>> https://gitlab.com/petsc/petsc/-/merge_requests/3411
>> I started cleaning up the code to prepare for testing but we never
>> finished it
>> https://gitlab.com/petsc/petsc/-/commits/stefanozampini/simplify-setvalues-device/
>>
>>
>> On May 28, 2021, at 6:53 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>
>> Is there a test with MatSetValues and CUDA?
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210528/dc083701/attachment-0001.html>


More information about the petsc-users mailing list