[petsc-users] HashMap Error when populating AIJCUSPARSE matrix

Barry Smith bsmith at petsc.dev
Thu Jan 18 18:10:46 CST 2024


   Thanks. Same version I tried. 


> On Jan 18, 2024, at 6:09 PM, Yesypenko, Anna <anna at oden.utexas.edu> wrote:
> 
> Hi Barry,
> 
> I'm using version 3.20.3. The tacc system is lonestar6.
> 
> Best,
> Anna
> From: Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>>
> Sent: Thursday, January 18, 2024 4:43 PM
> To: Yesypenko, Anna <anna at oden.utexas.edu <mailto:anna at oden.utexas.edu>>
> Cc: petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>; Victor Eijkhout <eijkhout at tacc.utexas.edu <mailto:eijkhout at tacc.utexas.edu>>
> Subject: Re: [petsc-users] HashMap Error when populating AIJCUSPARSE matrix
>  
> 
>    Ok, I ran it on an ANL machine with CUDA and it worked fine for many runs, even increased the problem size without producing any problems. Both versions of the Python code. 
> 
>    Anna,
> 
>    What version of PETSc are you using?
> 
>    Victor,
> 
>    Does anyone at ANL have access to this TACC system to try to reproduce?
> 
> 
>   Barry
> 
>    
> 
>> On Jan 18, 2024, at 4:38 PM, Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>> 
>> 
>>    It is using the hash map system for inserting values which only inserts on the CPU, not on the GPU. So I don't see that it would be moving any data to the GPU until the mat assembly() is done which it never gets to. Hence I have trouble understanding why the GPU has anything to do with the crash. 
>> 
>>    I guess I need to try to reproduce it on a GPU system.
>> 
>>    Barry
>> 
>> 
>> 
>> 
>>> On Jan 18, 2024, at 4:28 PM, Matthew Knepley <knepley at gmail.com <mailto:knepley at gmail.com>> wrote:
>>> 
>>> On Thu, Jan 18, 2024 at 4:18 PM Yesypenko, Anna <anna at oden.utexas.edu <mailto:anna at oden.utexas.edu>> wrote:
>>> Hi Matt, Barry,
>>> 
>>> Apologies for the extra dependency on scipy. I can replicate the error by calling setValue (i,j,v) in a loop as well.
>>> In roughly half of 10 runs, the following script fails because of an error in hashmapijv – the same as my original post.
>>> It successfully runs without error the other times.
>>> 
>>> Barry is right that it's CUDA specific. The script runs fine on the CPU.
>>> Do you have any suggestions or example scripts on assigning entries to a AIJCUSPARSE matrix?
>>> 
>>> Oh, you definitely do not want to be doing this. I believe you would rather
>>> 
>>> 1) Make the CPU matrix and then convert to AIJCUSPARSE. This is efficient.
>>> 
>>> 2) Produce the values on the GPU and call
>>> 
>>>   https://petsc.org/main/manualpages/Mat/MatSetPreallocationCOO/
>>>   https://petsc.org/main/manualpages/Mat/MatSetValuesCOO/
>>> 
>>>   This is what most people do who are forming matrices directly on the GPU.
>>> 
>>> What you are currently doing is incredibly inefficient, and I think accounts for you running out of memory.
>>> It talks back and forth between the CPU and GPU.
>>> 
>>>   Thanks,
>>> 
>>>      Matt
>>> 
>>> Here is a minimum snippet that doesn't depend on scipy.
>>> ```
>>> from petsc4py import PETSc
>>> import numpy as np
>>> 
>>> n = int(5e5); 
>>> nnz = 3 * np.ones(n, dtype=np.int32)
>>> nnz[0] = nnz[-1] = 2
>>> A = PETSc.Mat(comm=PETSc.COMM_WORLD)
>>> A.createAIJ(size=[n,n],comm=PETSc.COMM_WORLD,nnz=nnz)
>>> A.setType('aijcusparse')
>>> 
>>> A.setValue(0, 0, 2)
>>> A.setValue(0, 1, -1)
>>> A.setValue(n-1, n-2, -1)
>>> A.setValue(n-1, n-1, 2)
>>> 
>>> for index in range(1, n - 1):
>>>          A.setValue(index, index - 1, -1)
>>>          A.setValue(index, index, 2)
>>>          A.setValue(index, index + 1, -1)
>>> A.assemble()
>>> ```
>>> If it means anything to you, when the hash error occurs, it is for index 67283 after filling 201851 nonzero values.
>>> 
>>> Thank you for your help and suggestions!
>>> Anna
>>> 
>>> From: Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>>
>>> Sent: Thursday, January 18, 2024 2:35 PM
>>> To: Yesypenko, Anna <anna at oden.utexas.edu <mailto:anna at oden.utexas.edu>>
>>> Cc: petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>
>>> Subject: Re: [petsc-users] HashMap Error when populating AIJCUSPARSE matrix
>>>  
>>> 
>>>    Do you ever get a problem with 'aij` ?   Can you run in a loop with 'aij' to confirm it doesn't fail then?
>>> 
>>>    
>>> 
>>>    Barry
>>> 
>>> 
>>>> On Jan 17, 2024, at 4:51 PM, Yesypenko, Anna <anna at oden.utexas.edu <mailto:anna at oden.utexas.edu>> wrote:
>>>> 
>>>> Dear Petsc users/developers,
>>>> 
>>>> I'm experiencing a bug when using petsc4py with GPU support. It may be my mistake in how I set up a AIJCUSPARSE matrix.
>>>> For larger matrices, I sometimes encounter a error in assigning matrix values; the error is thrown in PetscHMapIJVQuerySet().
>>>> Here is a minimum snippet that populates a sparse tridiagonal matrix. 
>>>> 
>>>> ```
>>>> from petsc4py import PETSc
>>>> from scipy.sparse import diags
>>>> import numpy as np
>>>> 
>>>> n = int(5e5); 
>>>> 
>>>> nnz = 3 * np.ones(n, dtype=np.int32); nnz[0] = nnz[-1] = 2
>>>> A = PETSc.Mat(comm=PETSc.COMM_WORLD)
>>>> A.createAIJ(size=[n,n],comm=PETSc.COMM_WORLD,nnz=nnz)
>>>> A.setType('aijcusparse')
>>>> tmp = diags([-1,2,-1],[-1,0,+1],shape=(n,n)).tocsr()
>>>> A.setValuesCSR(tmp.indptr,tmp.indices,tmp.data)                            ####### this is the line where the error is thrown.
>>>> A.assemble()
>>>> ```
>>>> 
>>>> The error trace is below:
>>>> ```
>>>> File "petsc4py/PETSc/Mat.pyx", line 2603, in petsc4py.PETSc.Mat.setValuesCSR
>>>>   File "petsc4py/PETSc/petscmat.pxi", line 1039, in petsc4py.PETSc.matsetvalues_csr
>>>>   File "petsc4py/PETSc/petscmat.pxi", line 1032, in petsc4py.PETSc.matsetvalues_ijv
>>>> petsc4py.PETSc.Error: error code 76
>>>> [0] MatSetValues() at /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:1497
>>>> [0] MatSetValues_Seq_Hash() at /work/06368/annayesy/ls6/petsc/include/../src/mat/impls/aij/seq/seqhashmatsetvalues.h:52
>>>> [0] PetscHMapIJVQuerySet() at /work/06368/annayesy/ls6/petsc/include/petsc/private/hashmapijv.h:10
>>>> [0] Error in external library
>>>> [0] [khash] Assertion: `ret >= 0' failed.
>>>> ```
>>>> 
>>>> If I run the same script a handful of times, it will run without errors eventually.
>>>> Does anyone have insight on why it is behaving this way? I'm running on a node with 3x NVIDIA A100 PCIE 40GB.
>>>> 
>>>> Thank you!
>>>> Anna
>>> 
>>> 
>>> 
>>> -- 
>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>>> -- Norbert Wiener
>>> 
>>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240118/4bdf0645/attachment-0001.html>


More information about the petsc-users mailing list