[petsc-dev] MatMPIAIJGetLocalMat problem with GPUs

Fri Jun 24 12:38:18 CDT 2022

I am rearranging the code for clarity from the repo but I have:

  PetscBool is_dev_ptrs;
  PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_INITIAL_MATRIX, &amgx->localA));
  PetscCall(PetscObjectTypeCompareAny((PetscObject)amgx->localA,
&is_dev_ptrs, MATAIJCUSPARSE, MATSEQAIJCUSPARSE, MATMPIAIJCUSPARSE, ""));
  PetscPrintf(PETSC_COMM_SELF,"*checking against mataijcusparse
amgx->localA = %d*\n",is_dev_ptrs);
  PetscCall(PetscObjectTypeCompareAny((PetscObject)Pmat, &is_dev_ptrs,
MATAIJCUSPARSE, MATSEQAIJCUSPARSE, MATMPIAIJCUSPARSE, ""));
  PetscPrintf(PETSC_COMM_SELF,"*checking against mataijcusparse Pmat = %d*
\n",is_dev_ptrs);

And it seems to show that MatMPIAIJGetLocalMat takes a mpiaijcusparse Mat
and returns an seqaij mat (see below):

AMGX version 2.2.0.132-opensource
Built on Jun 24 2022, 09:21:43
Compiled with CUDA Runtime 11.5, using CUDA driver 11.5

*checking against mataijcusparse amgx->localA = 0checking against
mataijcusparse Pmat = 1*
localA_name seqaij
Pmat_name mpiaijcusparse

Matt's existing testing code (below) then shows the types that conform with
these tests and prints that I added.

  // XXX DEBUG REMOVE
  const char* localA_name;
  PetscObjectGetType((PetscObject)amgx->localA, &localA_name);
  PetscPrintf(PETSC_COMM_SELF,"localA_name %s\n", localA_name);
  const char* Pmat_name;
  PetscObjectGetType((PetscObject)Pmat, &Pmat_name);
  PetscPrintf(PETSC_COMM_SELF,"Pmat_name %s\n", Pmat_name);

On Fri, Jun 24, 2022 at 10:00 AM Barry Smith <bsmith at petsc.dev> wrote:

>
>
> On Jun 24, 2022, at 8:58 AM, Mark Adams <mfadams at lbl.gov> wrote:
>
> And before we move to the MR, I think Matt found a clear problem:
>
> * PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_REUSE_MATRIX, &amgx->localA));
> returns "localA seqaij"
>
> * And, oddly, PetscCall(MatSeqAIJGetArrayRead(amgx->localA,
> &amgx->values)); returns:
> "it seems to detect that the pointer is a device mapped pointer but that
> it is invalid"
>
>
>   It does not return a device mapped pointer, it returns a valid host
> pointer only. MatSeqAIJGetArrayRead() is intended to only return a host
> pointer, it cannot return a device pointer. MatSeqAIJCusparseGetArrayRead()
> returns device pointers and should be used for this purpose.
>
>
> Matt, lets just comment out the REUSE line and add another INITIAL line
> (destroying the old Mat of course), and lets press on.
>
>
>   Looking at the code there is no way that simply using INITIAL instead of
> REUSE will make a code that does not work on the GPU run on the GPU. The
> MatMPIAIJGetLocalMat() returns only a MATSEQAIJ matrix regardless of
> the INITIAL versus REUSE and one can never get a device pointer from a
> non-GPU matrix.
>
>    As noted by Stefano, the code either needs to use
> MatMPIAIJGetLocalMatMerge () which does return a CUSPARSE matrix (but the
> columns are not supported) or MatMPIAIJGetLocalMat()
> needs to be updated to return a CUSPARSE matrix when the input MPI matrix
> is a CUSPARSE matrix.
>
>
>
>
> We can keep the debugging code for now.
>
> We (PETSc) can work on this independently,
>
> Thanks,
> Mark
>
> On Fri, Jun 24, 2022 at 8:51 AM Mark Adams <mfadams at lbl.gov> wrote:
>
>> I am not seeing this response, I see my "hstack" comment last.
>> https://gitlab.com/petsc/petsc/-/merge_requests/4323
>>
>> On Thu, Jun 23, 2022 at 4:37 PM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>  I have responded in the MR, which has all the context and the code.
>>> Please move this conversation from petsc-dev to the MR. Note you can use
>>> the little cartoon cloud symbol (upper write of the sub window with my
>>> text)  to reply to my post and keep everything in a thread for clarity.
>>>
>>>   We are confused because it seems you are trying a variety of things
>>> and we don't know how the different things you tried resulted in the
>>> multiple errors you reported.
>>>
>>>
>>>
>>> On Jun 23, 2022, at 3:59 PM, Matthew Martineau <mmartineau at nvidia.com>
>>> wrote:
>>>
>>> I checked in the changes and some debugging statements.
>>>
>>> PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_REUSE_MATRIX, &amgx->localA));
>>> PetscCall(PetscObjectTypeCompareAny((PetscObject)amgx->localA,
>>> &is_dev_ptrs, MATAIJCUSPARSE, MATSEQAIJCUSPARSE, MATMPIAIJCUSPARSE, ""));
>>>
>>> Then the call returns false. If we instead call
>>> PetscObjectTypeCompareAny on Pmat then it returns true. If you print the
>>> type of the matrices:
>>>
>>>
>>> localA seqaij
>>>
>>> Pmat mpiaijcusparse
>>>
>>> If you subsequently call MatSeqAIJCUSPARSEGetArrayRead on localA then it
>>> errors (presumably because of the type mismatch).
>>>
>>> If we call MatSeqAIJGetArrayRead on localA and then pass the `values` to
>>> AmgX it seems to detect that the pointer is a device mapped pointer but
>>> that it is invalid.
>>>
>>> PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_REUSE_MATRIX, &amgx->localA));
>>> PetscCall(MatSeqAIJGetArrayRead(amgx->localA, &amgx->values)); // Seems
>>> to return invalid pointer, but I’ll investigate more
>>>
>>> This doesn’t reproduce if we call:
>>>
>>> PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_INITIAL_MATRIX, &amgx->localA));
>>> PetscCall(MatSeqAIJGetArrayRead(amgx->localA, &amgx->values)); //
>>> Pointer appears to be valid and we converge
>>>
>>> Essentially all I want to achieve is that when we are parallel, we fetch
>>> the local part of A and the device pointer to the matrix values from that
>>> structure so that we can pass to AmgX. Preferring whichever API calls are
>>> the most efficient.
>>>
>>>
>>> *From:* Stefano Zampini <stefano.zampini at gmail.com>
>>> *Sent:* 23 June 2022 20:55
>>> *To:* Mark Adams <mfadams at lbl.gov>
>>> *Cc:* Barry Smith <bsmith at petsc.dev>; For users of the development
>>> version of PETSc <petsc-dev at mcs.anl.gov>; Matthew Martineau <
>>> mmartineau at nvidia.com>
>>> *Subject:* Re: [petsc-dev] MatMPIAIJGetLocalMat problem with GPUs
>>>
>>> *External email: Use caution opening links or attachments*
>>>
>>> The logic is wrong. It should check for MATSEQAIJCUSPARSE.
>>>
>>> On Thu, Jun 23, 2022, 21:36 Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>
>>>
>>> On Thu, Jun 23, 2022 at 3:02 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>
>>>
>>>   It looks like the current code copies the nonzero values to the CPU
>>> from the MPI matrix (with the calls
>>> PetscCall(MatSeqAIJGetArrayRead(mpimat->A,&aav));
>>>   PetscCall(MatSeqAIJGetArrayRead(mpimat->B,&bav));, then copies them
>>> into the CPU memory of the Seq matrix. When the matrix entries are next
>>> accessed on the GPU it should automatically copy them down to the GPU.  So
>>> the code looks ok even for GPUs. We'll need to see the full error message
>>> with what the "invalid pointer" is.
>>>
>>>
>>> I showed Matt how to peek into offloadmask and he found that it is a
>>> host state, but this is not the issue. The access method should do the copy
>>> to the device.
>>>
>>> I am thinking the logic here might be wrong. (Matt fixed "VEC" --> "MAT"
>>> in the comparison below).
>>>
>>> Matt, is the issue that you are calling  *MatSeqAIJCUSPARSEGetArrayRead*
>>>  and getting a host pointer?
>>>
>>> I think the state of amgx->localA after the call to
>>> MatSeqAIJCUSPARSEGetArrayRead should be "BOTH" because this copied the data
>>> to the device so they are both valid and you should have device data.
>>>
>>> 211   PetscBool is_dev_ptrs;
>>> 212   PetscCall(PetscObjectTypeCompareAny((PetscObject)amgx->localA,
>>> &is_dev_ptrs, VECCUDA, VECMPICUDA, VECSEQCUDA, ""));
>>> 213
>>> 214   if (is_dev_ptrs) {
>>>
>>> *216     PetscCall(MatSeqAIJCUSPARSEGetArrayRead(amgx->localA,
>>> &amgx->values));*217   } else {
>>> 219     PetscCall(MatSeqAIJGetArrayRead(amgx->localA, &amgx->values));
>>> 220   }
>>>
>>>
>>>
>>>   Barry
>>>
>>>
>>>  Yes this routine is terribly inefficient for GPU matrices, it needs to
>>> be specialized to not use the GPU memory but that is a separate issue from
>>> there being bugs in the current code.
>>>
>>>   The code also seems to implicitly assume the parallel matrix has the
>>> same nonzero pattern with a reuse. This should be checked with each use by
>>> stashing the nonzero state of the matrix into the sequential matrix and
>>> making sure the parallel matrix has that same stashed value each time.
>>> Currently if one changes the nonzero matrix of the parallel matrix one is
>>> likely to get random confusing crashes due to memory corruption. But likely
>>> not the problem here.
>>>
>>>
>>> On Jun 23, 2022, at 2:23 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>> We have a bug in the AMGx test snes_tests-ex13_amgx in parallel.
>>> Matt Martineau found that MatMPIAIJGetLocalMat worked in the first pass
>>> in the code below, where the local matrix is created (INITIAL), but in the
>>> next pass, when "REUSE" is used, he sees an invalid pointer.
>>> Matt found that it does have offloadmask == CPU.
>>> Maybe it is missing logic to put the output in same state as the input?
>>>
>>> Any ideas on this or should I just dig into it?
>>>
>>> Thanks,
>>>
>>> bool partial_setup_allowed = (pc->setupcalled && pc->flag != DIFFERENT_NONZERO_PATTERN);
>>>
>>> 199   if (amgx->nranks > 1) {
>>>
>>> 200     if (partial_setup_allowed) {
>>>
>>> 202       PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_REUSE_MATRIX, &amgx->localA)); // This path seems doesn't work by the time we reach AmgX API
>>>
>>> 203     } else {
>>>
>>> 205       PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_INITIAL_MATRIX, &amgx->localA)); // This path works
>>>
>>> 206     }
>>>
>>> 207   } else {
>>>
>>> 208     amgx->localA = Pmat;
>>>
>>> 209   }
>>>
>>> 210
>>>
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220624/7e6f5f24/attachment-0001.html>