[petsc-users] HashMap Error when populating AIJCUSPARSE matrix
Yesypenko, Anna
anna at oden.utexas.edu
Thu Jan 18 15:47:55 CST 2024
Hi all,
Matt's suggestions worked great! The script works consistently now.
What I was doing is a bad way to populate sparse matrices on the GPU – I'm not sure why it fails but luckily we found a fix.
Thank you all for your help and suggestions!
Best,
Anna
________________________________
From: Barry Smith <bsmith at petsc.dev>
Sent: Thursday, January 18, 2024 3:38 PM
To: Yesypenko, Anna <anna at oden.utexas.edu>
Cc: petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>; Victor Eijkhout <eijkhout at tacc.utexas.edu>
Subject: Re: [petsc-users] HashMap Error when populating AIJCUSPARSE matrix
It is using the hash map system for inserting values which only inserts on the CPU, not on the GPU. So I don't see that it would be moving any data to the GPU until the mat assembly() is done which it never gets to. Hence I have trouble understanding why the GPU has anything to do with the crash.
I guess I need to try to reproduce it on a GPU system.
Barry
On Jan 18, 2024, at 4:28 PM, Matthew Knepley <knepley at gmail.com> wrote:
On Thu, Jan 18, 2024 at 4:18 PM Yesypenko, Anna <anna at oden.utexas.edu<mailto:anna at oden.utexas.edu>> wrote:
Hi Matt, Barry,
Apologies for the extra dependency on scipy. I can replicate the error by calling setValue (i,j,v) in a loop as well.
In roughly half of 10 runs, the following script fails because of an error in hashmapijv – the same as my original post.
It successfully runs without error the other times.
Barry is right that it's CUDA specific. The script runs fine on the CPU.
Do you have any suggestions or example scripts on assigning entries to a AIJCUSPARSE matrix?
Oh, you definitely do not want to be doing this. I believe you would rather
1) Make the CPU matrix and then convert to AIJCUSPARSE. This is efficient.
2) Produce the values on the GPU and call
https://petsc.org/main/manualpages/Mat/MatSetPreallocationCOO/
https://petsc.org/main/manualpages/Mat/MatSetValuesCOO/
This is what most people do who are forming matrices directly on the GPU.
What you are currently doing is incredibly inefficient, and I think accounts for you running out of memory.
It talks back and forth between the CPU and GPU.
Thanks,
Matt
Here is a minimum snippet that doesn't depend on scipy.
```
from petsc4py import PETSc
import numpy as np
n = int(5e5);
nnz = 3 * np.ones(n, dtype=np.int32)
nnz[0] = nnz[-1] = 2
A = PETSc.Mat(comm=PETSc.COMM_WORLD)
A.createAIJ(size=[n,n],comm=PETSc.COMM_WORLD,nnz=nnz)
A.setType('aijcusparse')
A.setValue(0, 0, 2)
A.setValue(0, 1, -1)
A.setValue(n-1, n-2, -1)
A.setValue(n-1, n-1, 2)
for index in range(1, n - 1):
A.setValue(index, index - 1, -1)
A.setValue(index, index, 2)
A.setValue(index, index + 1, -1)
A.assemble()
```
If it means anything to you, when the hash error occurs, it is for index 67283 after filling 201851 nonzero values.
Thank you for your help and suggestions!
Anna
________________________________
From: Barry Smith <bsmith at petsc.dev<mailto:bsmith at petsc.dev>>
Sent: Thursday, January 18, 2024 2:35 PM
To: Yesypenko, Anna <anna at oden.utexas.edu<mailto:anna at oden.utexas.edu>>
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>
Subject: Re: [petsc-users] HashMap Error when populating AIJCUSPARSE matrix
Do you ever get a problem with 'aij` ? Can you run in a loop with 'aij' to confirm it doesn't fail then?
Barry
On Jan 17, 2024, at 4:51 PM, Yesypenko, Anna <anna at oden.utexas.edu<mailto:anna at oden.utexas.edu>> wrote:
Dear Petsc users/developers,
I'm experiencing a bug when using petsc4py with GPU support. It may be my mistake in how I set up a AIJCUSPARSE matrix.
For larger matrices, I sometimes encounter a error in assigning matrix values; the error is thrown in PetscHMapIJVQuerySet().
Here is a minimum snippet that populates a sparse tridiagonal matrix.
```
from petsc4py import PETSc
from scipy.sparse import diags
import numpy as np
n = int(5e5);
nnz = 3 * np.ones(n, dtype=np.int32); nnz[0] = nnz[-1] = 2
A = PETSc.Mat(comm=PETSc.COMM_WORLD)
A.createAIJ(size=[n,n],comm=PETSc.COMM_WORLD,nnz=nnz)
A.setType('aijcusparse')
tmp = diags([-1,2,-1],[-1,0,+1],shape=(n,n)).tocsr()
A.setValuesCSR(tmp.indptr,tmp.indices,tmp.data) ####### this is the line where the error is thrown.
A.assemble()
```
The error trace is below:
```
File "petsc4py/PETSc/Mat.pyx", line 2603, in petsc4py.PETSc.Mat.setValuesCSR
File "petsc4py/PETSc/petscmat.pxi", line 1039, in petsc4py.PETSc.matsetvalues_csr
File "petsc4py/PETSc/petscmat.pxi", line 1032, in petsc4py.PETSc.matsetvalues_ijv
petsc4py.PETSc.Error: error code 76
[0] MatSetValues() at /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:1497
[0] MatSetValues_Seq_Hash() at /work/06368/annayesy/ls6/petsc/include/../src/mat/impls/aij/seq/seqhashmatsetvalues.h:52
[0] PetscHMapIJVQuerySet() at /work/06368/annayesy/ls6/petsc/include/petsc/private/hashmapijv.h:10
[0] Error in external library
[0] [khash] Assertion: `ret >= 0' failed.
```
If I run the same script a handful of times, it will run without errors eventually.
Does anyone have insight on why it is behaving this way? I'm running on a node with 3x NVIDIA A100 PCIE 40GB.
Thank you!
Anna
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240118/8ba90405/attachment-0001.html>
More information about the petsc-users
mailing list