[petsc-dev] MatMPIAIJGetLocalMat problem with GPUs

Fri Jun 24 07:51:02 CDT 2022

I am not seeing this response, I see my "hstack" comment last.
https://gitlab.com/petsc/petsc/-/merge_requests/4323

On Thu, Jun 23, 2022 at 4:37 PM Barry Smith <bsmith at petsc.dev> wrote:

>
>  I have responded in the MR, which has all the context and the code.
> Please move this conversation from petsc-dev to the MR. Note you can use
> the little cartoon cloud symbol (upper write of the sub window with my
> text)  to reply to my post and keep everything in a thread for clarity.
>
>   We are confused because it seems you are trying a variety of things and
> we don't know how the different things you tried resulted in the multiple
> errors you reported.
>
>
>
> On Jun 23, 2022, at 3:59 PM, Matthew Martineau <mmartineau at nvidia.com>
> wrote:
>
> I checked in the changes and some debugging statements.
>
> PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_REUSE_MATRIX, &amgx->localA));
> PetscCall(PetscObjectTypeCompareAny((PetscObject)amgx->localA,
> &is_dev_ptrs, MATAIJCUSPARSE, MATSEQAIJCUSPARSE, MATMPIAIJCUSPARSE, ""));
>
> Then the call returns false. If we instead call PetscObjectTypeCompareAny
> on Pmat then it returns true. If you print the type of the matrices:
>
>
> localA seqaij
>
> Pmat mpiaijcusparse
>
> If you subsequently call MatSeqAIJCUSPARSEGetArrayRead on localA then it
> errors (presumably because of the type mismatch).
>
> If we call MatSeqAIJGetArrayRead on localA and then pass the `values` to
> AmgX it seems to detect that the pointer is a device mapped pointer but
> that it is invalid.
>
> PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_REUSE_MATRIX, &amgx->localA));
> PetscCall(MatSeqAIJGetArrayRead(amgx->localA, &amgx->values)); // Seems to
> return invalid pointer, but I’ll investigate more
>
> This doesn’t reproduce if we call:
>
> PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_INITIAL_MATRIX, &amgx->localA));
> PetscCall(MatSeqAIJGetArrayRead(amgx->localA, &amgx->values)); // Pointer
> appears to be valid and we converge
>
> Essentially all I want to achieve is that when we are parallel, we fetch
> the local part of A and the device pointer to the matrix values from that
> structure so that we can pass to AmgX. Preferring whichever API calls are
> the most efficient.
>
>
> *From:* Stefano Zampini <stefano.zampini at gmail.com>
> *Sent:* 23 June 2022 20:55
> *To:* Mark Adams <mfadams at lbl.gov>
> *Cc:* Barry Smith <bsmith at petsc.dev>; For users of the development
> version of PETSc <petsc-dev at mcs.anl.gov>; Matthew Martineau <
> mmartineau at nvidia.com>
> *Subject:* Re: [petsc-dev] MatMPIAIJGetLocalMat problem with GPUs
>
> *External email: Use caution opening links or attachments*
>
> The logic is wrong. It should check for MATSEQAIJCUSPARSE.
>
> On Thu, Jun 23, 2022, 21:36 Mark Adams <mfadams at lbl.gov> wrote:
>
>
>
> On Thu, Jun 23, 2022 at 3:02 PM Barry Smith <bsmith at petsc.dev> wrote:
>
>
>   It looks like the current code copies the nonzero values to the CPU from
> the MPI matrix (with the calls
> PetscCall(MatSeqAIJGetArrayRead(mpimat->A,&aav));
>   PetscCall(MatSeqAIJGetArrayRead(mpimat->B,&bav));, then copies them into
> the CPU memory of the Seq matrix. When the matrix entries are next accessed
> on the GPU it should automatically copy them down to the GPU.  So the code
> looks ok even for GPUs. We'll need to see the full error message with what
> the "invalid pointer" is.
>
>
> I showed Matt how to peek into offloadmask and he found that it is a host
> state, but this is not the issue. The access method should do the copy to
> the device.
>
> I am thinking the logic here might be wrong. (Matt fixed "VEC" --> "MAT"
> in the comparison below).
>
> Matt, is the issue that you are calling  *MatSeqAIJCUSPARSEGetArrayRead* and
> getting a host pointer?
>
> I think the state of amgx->localA after the call to
> MatSeqAIJCUSPARSEGetArrayRead should be "BOTH" because this copied the data
> to the device so they are both valid and you should have device data.
>
> 211   PetscBool is_dev_ptrs;
> 212   PetscCall(PetscObjectTypeCompareAny((PetscObject)amgx->localA,
> &is_dev_ptrs, VECCUDA, VECMPICUDA, VECSEQCUDA, ""));
> 213
> 214   if (is_dev_ptrs) {
>
> *216     PetscCall(MatSeqAIJCUSPARSEGetArrayRead(amgx->localA,
> &amgx->values));*217   } else {
> 219     PetscCall(MatSeqAIJGetArrayRead(amgx->localA, &amgx->values));
> 220   }
>
>
>
>   Barry
>
>
>  Yes this routine is terribly inefficient for GPU matrices, it needs to be
> specialized to not use the GPU memory but that is a separate issue from
> there being bugs in the current code.
>
>   The code also seems to implicitly assume the parallel matrix has the
> same nonzero pattern with a reuse. This should be checked with each use by
> stashing the nonzero state of the matrix into the sequential matrix and
> making sure the parallel matrix has that same stashed value each time.
> Currently if one changes the nonzero matrix of the parallel matrix one is
> likely to get random confusing crashes due to memory corruption. But likely
> not the problem here.
>
>
> On Jun 23, 2022, at 2:23 PM, Mark Adams <mfadams at lbl.gov> wrote:
>
> We have a bug in the AMGx test snes_tests-ex13_amgx in parallel.
> Matt Martineau found that MatMPIAIJGetLocalMat worked in the first pass in
> the code below, where the local matrix is created (INITIAL), but in the
> next pass, when "REUSE" is used, he sees an invalid pointer.
> Matt found that it does have offloadmask == CPU.
> Maybe it is missing logic to put the output in same state as the input?
>
> Any ideas on this or should I just dig into it?
>
> Thanks,
>
> bool partial_setup_allowed = (pc->setupcalled && pc->flag != DIFFERENT_NONZERO_PATTERN);
>
> 199   if (amgx->nranks > 1) {
>
> 200     if (partial_setup_allowed) {
>
> 202       PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_REUSE_MATRIX, &amgx->localA)); // This path seems doesn't work by the time we reach AmgX API
>
> 203     } else {
>
> 205       PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_INITIAL_MATRIX, &amgx->localA)); // This path works
>
> 206     }
>
> 207   } else {
>
> 208     amgx->localA = Pmat;
>
> 209   }
>
> 210
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220624/e1496d66/attachment.html>