[petsc-users] Strange CUDA failure with a second petscfinalize with PETSc 3.16

Fri Jan 14 19:58:28 CST 2022

Jacob,
   Could you have a look as it seems the "invalid device context" is in
your newly added module?
   Thanks
--Junchao Zhang

On Fri, Jan 14, 2022 at 12:49 AM Hao DONG <dong-hao at outlook.com> wrote:

> Dear All,
>
>
>
> I have encountered a peculiar problem when fiddling with a code with PETSC
> 3.16.3 (which worked fine with PETSc 3.15). It is a very straight forward
> PDE-based optimization code which repeatedly solves a linearized PDE
> problem with KSP in a subroutine (the rest of the code does not contain any
> PETSc related content). The main program provides the subroutine with an
> MPI comm. Then I set the comm as PETSC_COMM_WORLD to tell PETSC to attach
> to it (and detach with it when the solving is finished each time).
>
>
>
> Strangely, I observe a CUDA failure whenever the petscfinalize is called
> for a *second* time.  In other words, the first and second PDE calculations
> with GPU are fine (with correct solutions). The petsc code just fails after
> the SECOND petscfinalize command is called. You can also see the PETSC
> config in the error message:
>
>
>
> [1]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
>
> [1]PETSC ERROR: GPU error
>
> [1]PETSC ERROR: cuda error 201 (cudaErrorDeviceUninitialized) : invalid
> device context
>
> [1]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
>
> [1]PETSC ERROR: Petsc Release Version 3.16.3, unknown
>
> [1]PETSC ERROR: maxwell.gpu on a  named stratosphere by hao Fri Jan 14
> 10:21:05 2022
>
> [1]PETSC ERROR: Configure options
> --prefix=/opt/petsc/complex-double-with-cuda --with-cc=mpicc
> --with-cxx=mpicxx --with-fc=mpif90 COPTFLAGS="-O3 -mavx2" CXXOPTFLAGS="-O3
> -mavx2" FOPTFLAGS="-O3 -ffree-line-length-none -mavx2" CUDAOPTFLAGS=-O3
> --with-cxx-dialect=cxx14 --with-cuda-dialect=cxx14
> --with-scalar-type=complex --with-precision=double
> --with-cuda-dir=/usr/local/cuda --with-debugging=1
>
> [1]PETSC ERROR: #1 PetscFinalize() at
> /home/hao/packages/petsc-current/src/sys/objects/pinit.c:1638
>
> You might have forgotten to call PetscInitialize().
>
> The EXACT line numbers in the error traceback are not available.
>
> Instead the line number of the start of the function is given.
>
> [1] #1 PetscAbortFindSourceFile_Private() at
> /home/hao/packages/petsc-current/src/sys/error/err.c:35
>
> [1] #2 PetscLogGetStageLog() at
> /home/hao/packages/petsc-current/src/sys/logging/utils/stagelog.c:29
>
> [1] #3 PetscClassIdRegister() at
> /home/hao/packages/petsc-current/src/sys/logging/plog.c:2376
>
> [1] #4 MatMFFDInitializePackage() at
> /home/hao/packages/petsc-current/src/mat/impls/mffd/mffd.c:45
>
> [1] #5 MatInitializePackage() at
> /home/hao/packages/petsc-current/src/mat/interface/dlregismat.c:163
>
> [1] #6 MatCreate() at
> /home/hao/packages/petsc-current/src/mat/utils/gcreate.c:77
>
>
>
> However, it doesn’t seem to affect the other part of my code, so the code
> can continue running until it gets to the petsc part again (the **third**
> time). Unfortunately, it doesn’t give me any further information even if I
> set the debugging to yes in the configure file. It also worth noting that
> PETSC without CUDA (i.e. with simple MATMPIAIJ) works perfectly fine.
>
>
>
> I am able to re-produce the problem with a toy code modified from ex11f.
> Please see the attached file (ex11fc.F90) for details. Essentially  the
> code does the same thing as ex11f, but three times with a do loop. To do
> that I added an extra MPI_INIT/MPI_FINALIZE to ensure that the MPI
> communicator is not destroyed when PETSC_FINALIZE is called.  I used the
> PetscOptionsHasName utility to check if you have “-usecuda” in the options.
> So running the code with and without that option can give you a comparison
> w/o CUDA. I can see that the code also fails after the second loop of the
> KSP operation. Could you kindly shed some lights on this problem?
>
>
>
> I should say that I am not even sure if the problem is from PETSc, as I
> also accidentally updated the NVIDIA driver (for now it is 510.06 with cuda
> 11.6). And it is well known that NVIDIA can give you some surprise in the
> updates (yes, I know I shouldn’t have touched that if it’s not broken). But
> my CUDA code without PETSC (which basically does the same PDE thing, but
> with cusparse/cublas directly) seems to work just fine after the update. It
> is also possible that my petsc code related to CUDA was not quite
> “legitimate” – I just use:
>
>           MatSetType(A, MATMPIAIJCUSPARSE, ierr)
>
> and
>
>           MatCreateVecs(A, u, PETSC_NULL_VEC, ierr)
>
> to make the data onto GPU. I would very much appreciate it if you could
> show me the “right” way to do that.
>
>
>
> Thanks a lot in advance, and all the best,
>
> Hao
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220114/c0f0b551/attachment.html>