[petsc-dev] Running with CUDA and CUDA_VISIBLE_DEVICES=-1

Barry Smith bsmith at petsc.dev
Mon Nov 1 11:00:51 CDT 2021


   PETSc code could check for the environmental variable CUDA_VISIBLE_DEVICES=-1 if that makes sense to resolve the situation.



> On Nov 1, 2021, at 11:43 AM, Jacob Faibussowitsch <jacob.fai at gmail.com> wrote:
> 
> Looks like you are tripping up the following:
> 
> cerr = cupmGetDeviceCount(&ndev);
> if (PetscUnlikely(cerr == cupmErrorStubLibrary)) {
>   … // handle missing driver or stub library
> } else {CHKERRCUPM(cerr);} // your error here
> 
> Is it an error if a user configures with cuda (i.e. signals intent to use cuda) but disables all the devices? On the one hand, yes this can be considered an error if the user inadvertently disables the devices via this environment variable without knowing, but on the other hand they should be able to freely set this variable without petsc crashing… Should we warn users? Handle this silently?
> 
> Note that petsc does provide '-device_enable none’ option to disable all devices, or if you only want to disable cuda devices '-device_enable_cuda none’ which should achieve the same effect as CUDA_VISIBLE_DEVICES=-1. But maybe it is too obscure to ask users to know about and use these flags instead of setting the cuda env variables. (Btw, can you test that using ‘-device_enable_cuda none’ does not crash when setting CUDA_VISIBLE_DEVICES=-1?)
> 
> Best regards,
> 
> Jacob Faibussowitsch
> (Jacob Fai - booss - oh - vitch)
> 
>> On Nov 1, 2021, at 10:09, Stefano Zampini <stefano.zampini at gmail.com <mailto:stefano.zampini at gmail.com>> wrote:
>> 
>> Just found out that if we configure with cuda and then want to run on CPU only using CUDA_VISIBLE_DEVICES=-1 PETSc errors out. Is this intended behavior? I supposed it should work
>> This is with main
>> 
>> (ecrcml-cuda) zampins at qaysar:~/miniforge/Devel/petsc$ make check
>> Running check examples to verify correct installation
>> Using PETSC_DIR=/home/zampins/miniforge/Devel/petsc and PETSC_ARCH=arch-ecrcml-cuda-double
>> C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI process
>> C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI processes
>> C/C++ example src/snes/tutorials/ex19 run successfully with cuda
>> Completed test examples
>> 
>> (ecrcml-cuda) zampins at qaysar:~/miniforge/Devel/petsc$ make check CUDA_VISIBLE_DEVICES=1
>> Running check examples to verify correct installation
>> Using PETSC_DIR=/home/zampins/miniforge/Devel/petsc and PETSC_ARCH=arch-ecrcml-cuda-double
>> C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI process
>> C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI processes
>> C/C++ example src/snes/tutorials/ex19 run successfully with cuda
>> Completed test examples
>> 
>> (ecrcml-cuda) zampins at qaysar:~/miniforge/Devel/petsc$ make check CUDA_VISIBLE_DEVICES=-1
>> Running check examples to verify correct installation
>> Using PETSC_DIR=/home/zampins/miniforge/Devel/petsc and PETSC_ARCH=arch-ecrcml-cuda-double
>> Possible error running C/C++ src/snes/tutorials/ex19 with 1 MPI process
>> See http://www.mcs.anl.gov/petsc/documentation/faq.html <http://www.mcs.anl.gov/petsc/documentation/faq.html>
>> [0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
>> [0]PETSC ERROR: GPU error 
>> [0]PETSC ERROR: cuda error 100 (cudaErrorNoDevice) : no CUDA-capable device is detected
>> [0]PETSC ERROR: See https://petsc.org/release/faq/ <https://petsc.org/release/faq/> for trouble shooting.
>> [0]PETSC ERROR: Petsc Development GIT revision: v3.16.0-368-g72b201b202  GIT Date: 2021-10-29 14:48:19 +0300
>> [0]PETSC ERROR: ./ex19 on a arch-ecrcml-cuda-double named qaysar.kaust.edu.sa <http://qaysar.kaust.edu.sa/> by zampins Mon Nov  1 18:06:12 2021
>> [0]PETSC ERROR: Configure options --with-blaslapack-include=/home/zampins/miniforge/envs/ecrcml-cuda/include --with-blaslapack-lib=/home/zampins/miniforge/envs/ecrcml-cuda/lib/libmkl_rt.so --download-h2opus --with-cuda --with-kblas-dir=/home/zampins/miniforge/envs/ecrcml-cuda --with-magma-dir=/home/zampins/miniforge/envs/ecrcml-cuda --LDFLAGS=/usr/lib/x86_64-linux-gnu/libcuda.so --with-debugging=1 --with-openmp --with-precision=double --with-fc=0 PETSC_ARCH=arch-ecrcml-cuda-double PETSC_DIR=/home/zampins/miniforge/Devel/petsc
>> [0]PETSC ERROR: #1 initialize() at /home/zampins/miniforge/Devel/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:302
>> [0]PETSC ERROR: #2 PetscDeviceInitializeTypeFromOptions_Private() at /home/zampins/miniforge/Devel/petsc/src/sys/objects/device/interface/device.cxx:292
>> [0]PETSC ERROR: #3 PetscDeviceInitializeFromOptions_Internal() at /home/zampins/miniforge/Devel/petsc/src/sys/objects/device/interface/device.cxx:417
>> [0]PETSC ERROR: #4 PetscInitialize_Common() at /home/zampins/miniforge/Devel/petsc/src/sys/objects/pinit.c:956
>> [0]PETSC ERROR: #5 PetscInitialize() at /home/zampins/miniforge/Devel/petsc/src/sys/objects/pinit.c:1231
>> --------------------------------------------------------------------------
>> Primary job  terminated normally, but 1 process returned
>> a non-zero exit code. Per user-direction, the job has been aborted.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> 
>> [
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20211101/30a13fb2/attachment-0001.html>


More information about the petsc-dev mailing list