[petsc-users] HashMap Error when populating AIJCUSPARSE matrix

Matthew Knepley knepley at gmail.com
Thu Jan 18 15:28:14 CST 2024


On Thu, Jan 18, 2024 at 4:18 PM Yesypenko, Anna <anna at oden.utexas.edu>
wrote:

> Hi Matt, Barry,
>
> Apologies for the extra dependency on scipy. I can replicate the error by
> calling setValue (i,j,v) in a loop as well.
> In roughly half of 10 runs, the following script fails because of an error
> in hashmapijv – the same as my original post.
> It successfully runs without error the other times.
>
> Barry is right that it's CUDA specific. The script runs fine on the CPU.
> Do you have any suggestions or example scripts on assigning entries to a
> AIJCUSPARSE matrix?
>

Oh, you definitely do not want to be doing this. I believe you would rather

1) Make the CPU matrix and then convert to AIJCUSPARSE. This is efficient.

2) Produce the values on the GPU and call

  https://petsc.org/main/manualpages/Mat/MatSetPreallocationCOO/
  https://petsc.org/main/manualpages/Mat/MatSetValuesCOO/

  This is what most people do who are forming matrices directly on the GPU.

What you are currently doing is incredibly inefficient, and I think
accounts for you running out of memory.
It talks back and forth between the CPU and GPU.

  Thanks,

     Matt

Here is a minimum snippet that doesn't depend on scipy.
> ```
> from petsc4py import PETSc
> import numpy as np
>
> n = int(5e5);
> nnz = 3 * np.ones(n, dtype=np.int32)
> nnz[0] = nnz[-1] = 2
> A = PETSc.Mat(comm=PETSc.COMM_WORLD)
> A.createAIJ(size=[n,n],comm=PETSc.COMM_WORLD,nnz=nnz)
> A.setType('aijcusparse')
>
> A.setValue(0, 0, 2)
> A.setValue(0, 1, -1)
> A.setValue(n-1, n-2, -1)
> A.setValue(n-1, n-1, 2)
>
> for index in range(1, n - 1):
>          A.setValue(index, index - 1, -1)
>          A.setValue(index, index, 2)
>          A.setValue(index, index + 1, -1)
> A.assemble()
> ```
> If it means anything to you, when the hash error occurs, it is for index
> 67283 after filling 201851 nonzero values.
>
> Thank you for your help and suggestions!
> Anna
>
> ------------------------------
> *From:* Barry Smith <bsmith at petsc.dev>
> *Sent:* Thursday, January 18, 2024 2:35 PM
> *To:* Yesypenko, Anna <anna at oden.utexas.edu>
> *Cc:* petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject:* Re: [petsc-users] HashMap Error when populating AIJCUSPARSE
> matrix
>
>
>    Do you ever get a problem with 'aij` ?   Can you run in a loop with
> 'aij' to confirm it doesn't fail then?
>
>
>
>    Barry
>
>
> On Jan 17, 2024, at 4:51 PM, Yesypenko, Anna <anna at oden.utexas.edu> wrote:
>
> Dear Petsc users/developers,
>
> I'm experiencing a bug when using petsc4py with GPU support. It may be my
> mistake in how I set up a AIJCUSPARSE matrix.
> For larger matrices, I sometimes encounter a error in assigning matrix
> values; the error is thrown in PetscHMapIJVQuerySet().
> Here is a minimum snippet that populates a sparse tridiagonal matrix.
>
> ```
> from petsc4py import PETSc
> from scipy.sparse import diags
> import numpy as np
>
> n = int(5e5);
>
> nnz = 3 * np.ones(n, dtype=np.int32); nnz[0] = nnz[-1] = 2
> A = PETSc.Mat(comm=PETSc.COMM_WORLD)
> A.createAIJ(size=[n,n],comm=PETSc.COMM_WORLD,nnz=nnz)
> A.setType('aijcusparse')
> tmp = diags([-1,2,-1],[-1,0,+1],shape=(n,n)).tocsr()
> A.setValuesCSR(tmp.indptr,tmp.indices,tmp.data)
> ####### this is the line where the error is thrown.
> A.assemble()
> ```
>
> The error trace is below:
> ```
> File "petsc4py/PETSc/Mat.pyx", line 2603, in
> petsc4py.PETSc.Mat.setValuesCSR
>   File "petsc4py/PETSc/petscmat.pxi", line 1039, in
> petsc4py.PETSc.matsetvalues_csr
>   File "petsc4py/PETSc/petscmat.pxi", line 1032, in
> petsc4py.PETSc.matsetvalues_ijv
> petsc4py.PETSc.Error: error code 76
> [0] MatSetValues() at
> /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:1497
> [0] MatSetValues_Seq_Hash() at
> /work/06368/annayesy/ls6/petsc/include/../src/mat/impls/aij/seq/seqhashmatsetvalues.h:52
> [0] PetscHMapIJVQuerySet() at
> /work/06368/annayesy/ls6/petsc/include/petsc/private/hashmapijv.h:10
> [0] Error in external library
> [0] [khash] Assertion: `ret >= 0' failed.
> ```
>
> If I run the same script a handful of times, it will run without errors
> eventually.
> Does anyone have insight on why it is behaving this way? I'm running on a
> node with 3x NVIDIA A100 PCIE 40GB.
>
> Thank you!
> Anna
>
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240118/fed8a9cc/attachment-0001.html>


More information about the petsc-users mailing list