[petsc-users] cuda gpu eager initialization error cudaErrorNotSupported

Mark Lohry mlohry at gmail.com
Fri Jan 6 08:55:33 CST 2023


These cards do indeed not support cudaDeviceGetMemPool --
cudaDeviceGetAttribute on cudaDevAttrMemoryPoolsSupported return false,
meaning it doesn't support cudaMallocAsync, so the first point of failure
is the call to cudaDeviceGetMemPool in the initialization.

Would a workaround be to replace the cudaMallocAsync call to cudaMalloc and
skip the mempool or is that a bad idea?

On Fri, Jan 6, 2023 at 9:17 AM Mark Lohry <mlohry at gmail.com> wrote:

> It built+ran fine on a different system with an sm75 arch. Is there a
> documented minimum version if that indeed is the cause?
>
> One minor hiccup FYI -- compilation of hypre fails with cuda toolkit 12,
> due to cusprase removing csrsv2Info_t (although it's still referenced in
> their docs...) in favor of bsrsv2Info_t. Rolling back to cuda toolkit 11.8
> worked.
>
>
> On Thu, Jan 5, 2023 at 6:37 PM Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
>
>> Jacob, is it because the cuda arch is too old?
>>
>> --Junchao Zhang
>>
>>
>> On Thu, Jan 5, 2023 at 4:30 PM Mark Lohry <mlohry at gmail.com> wrote:
>>
>>> I'm seeing the same thing on latest main with a different machine and
>>> -sm52 card, cuda 11.8. make check fails with the below, where the indicated
>>> line 249 corresponds to PetscCallCUPM(cupmDeviceGetMemPool(&mempool,
>>> static_cast<int>(device->deviceId)));   in the initialize function.
>>>
>>>
>>> Running check examples to verify correct installation
>>> Using PETSC_DIR=/home/mlohry/dev/petsc and PETSC_ARCH=arch-linux-c-debug
>>> C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI process
>>> C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI
>>> processes
>>> 2,17c2,46
>>> <   0 SNES Function norm 2.391552133017e-01
>>> <     0 KSP Residual norm 2.928487269734e-01
>>> <     1 KSP Residual norm 1.876489580142e-02
>>> <     2 KSP Residual norm 3.291394847944e-03
>>> <     3 KSP Residual norm 2.456493072124e-04
>>> <     4 KSP Residual norm 1.161647147715e-05
>>> <     5 KSP Residual norm 1.285648407621e-06
>>> <   1 SNES Function norm 6.846805706142e-05
>>> <     0 KSP Residual norm 2.292783790384e-05
>>> <     1 KSP Residual norm 2.100673631699e-06
>>> <     2 KSP Residual norm 2.121341386147e-07
>>> <     3 KSP Residual norm 2.455932678957e-08
>>> <     4 KSP Residual norm 1.753095730744e-09
>>> <     5 KSP Residual norm 7.489214418904e-11
>>> <   2 SNES Function norm 2.103908447865e-10
>>> < Number of SNES iterations = 2
>>> ---
>>> > [0]PETSC ERROR: --------------------- Error Message
>>> --------------------------------------------------------------
>>> > [0]PETSC ERROR: GPU error
>>> > [0]PETSC ERROR: cuda error 801 (cudaErrorNotSupported) : operation not
>>> supported
>>> > [0]PETSC ERROR: WARNING! There are option(s) set that were not used!
>>> Could be the program crashed before they were used or a spelling mistake,
>>> etc!
>>> > [0]PETSC ERROR: Option left: name:-mg_levels_ksp_max_it value: 3
>>> source: command line
>>> > [0]PETSC ERROR: Option left: name:-nox (no value) source: environment
>>> > [0]PETSC ERROR: Option left: name:-nox_warning (no value) source:
>>> environment
>>> > [0]PETSC ERROR: Option left: name:-pc_gamg_esteig_ksp_max_it value: 10
>>> source: command line
>>> > [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble
>>> shooting.
>>> > [0]PETSC ERROR: Petsc Development GIT revision:
>>> v3.18.3-352-g91c56366cb  GIT Date: 2023-01-05 17:22:48 +0000
>>> > [0]PETSC ERROR: ./ex19 on a arch-linux-c-debug named osprey by mlohry
>>> Thu Jan  5 17:25:17 2023
>>> > [0]PETSC ERROR: Configure options --with-cuda --with-mpi=1
>>> > [0]PETSC ERROR: #1 initialize() at
>>> /home/mlohry/dev/petsc/src/sys/objects/device/impls/cupm/cuda/../cupmcontext.hpp:249
>>> > [0]PETSC ERROR: #2 PetscDeviceContextCreate_CUDA() at
>>> /home/mlohry/dev/petsc/src/sys/objects/device/impls/cupm/cuda/
>>> cupmcontext.cu:10
>>> > [0]PETSC ERROR: #3 PetscDeviceContextSetDevice_Private() at
>>> /home/mlohry/dev/petsc/src/sys/objects/device/interface/dcontext.cxx:247
>>> > [0]PETSC ERROR: #4
>>> PetscDeviceContextSetDefaultDeviceForType_Internal() at
>>> /home/mlohry/dev/petsc/src/sys/objects/device/interface/dcontext.cxx:260
>>> > [0]PETSC ERROR: #5 PetscDeviceContextSetupGlobalContext_Private() at
>>> /home/mlohry/dev/petsc/src/sys/objects/device/interface/global_dcontext.cxx:52
>>> > [0]PETSC ERROR: #6 PetscDeviceContextGetCurrentContext() at
>>> /home/mlohry/dev/petsc/src/sys/objects/device/interface/global_dcontext.cxx:84
>>> > [0]PETSC ERROR: #7 GetHandleDispatch_() at
>>> /home/mlohry/dev/petsc/include/petsc/private/veccupmimpl.h:499
>>> > [0]PETSC ERROR: #8 create() at
>>> /home/mlohry/dev/petsc/include/petsc/private/veccupmimpl.h:1069
>>> > [0]PETSC ERROR: #9 VecCreate_SeqCUDA() at
>>> /home/mlohry/dev/petsc/src/vec/vec/impls/seq/cupm/cuda/vecseqcupm.cu:10
>>> > [0]PETSC ERROR: #10 VecSetType() at
>>> /home/mlohry/dev/petsc/src/vec/vec/interface/vecreg.c:89
>>> > [0]PETSC ERROR: #11 DMCreateGlobalVector_DA() at
>>> /home/mlohry/dev/petsc/src/dm/impls/da/dadist.c:31
>>> > [0]PETSC ERROR: #12 DMCreateGlobalVector() at
>>> /home/mlohry/dev/petsc/src/dm/interface/dm.c:1023
>>> > [0]PETSC ERROR: #13 main() at ex19.c:149
>>>
>>>
>>> On Thu, Jan 5, 2023 at 3:42 PM Mark Lohry <mlohry at gmail.com> wrote:
>>>
>>>> I'm trying to compile the cuda example
>>>>
>>>> ./config/examples/arch-ci-linux-cuda-double-64idx.py
>>>> --with-cudac=/usr/local/cuda-11.5/bin/nvcc
>>>>
>>>> and running make test passes the test ok
>>>> diff-sys_objects_device_tests-ex1_host_with_device+nsize-1device_enable-lazy
>>>> but the eager variant fails, pasted below.
>>>>
>>>> I get a similar error running my client code, pasted after. There when
>>>> running with -info, it seems that some lazy initialization happens first,
>>>> and i also call VecCreateSeqCuda which seems to have no issue.
>>>>
>>>> Any idea? This happens to be with an -sm 3.5 device if it matters,
>>>> otherwise it's a recent cuda compiler+driver.
>>>>
>>>>
>>>> petsc test code output:
>>>>
>>>>
>>>>
>>>> not ok
>>>> sys_objects_device_tests-ex1_host_with_device+nsize-1device_enable-eager #
>>>> Error code: 97
>>>> # [0]PETSC ERROR: --------------------- Error Message
>>>> --------------------------------------------------------------
>>>> # [0]PETSC ERROR: GPU error
>>>> # [0]PETSC ERROR: cuda error 801 (cudaErrorNotSupported) : operation
>>>> not supported
>>>> # [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble
>>>> shooting.
>>>> # [0]PETSC ERROR: Petsc Release Version 3.18.3, Dec 28, 2022
>>>> # [0]PETSC ERROR: ../ex1 on a  named lancer by mlohry Thu Jan  5
>>>> 15:22:33 2023
>>>> # [0]PETSC ERROR: Configure options
>>>> --package-prefix-hash=/home/mlohry/petsc-hash-pkgs --with-make-test-np=2
>>>> --download-openmpi=1 --download-hypre=1 --download-hwloc=1 COPTFLAGS="-g
>>>> -O" FOPTFLAGS="-g -O" CXXOPTFLAGS="-g -O" --with-64-bit-indices=1
>>>> --with-cuda=1 --with-precision=double --with-clanguage=c
>>>> --with-cudac=/usr/local/cuda-11.5/bin/nvcc
>>>> PETSC_ARCH=arch-ci-linux-cuda-double-64idx
>>>> # [0]PETSC ERROR: #1 CUPMAwareMPI_() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:194
>>>> # [0]PETSC ERROR: #2 initialize() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:71
>>>> # [0]PETSC ERROR: #3 init_device_id_() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:290
>>>> # [0]PETSC ERROR: #4 getDevice() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/sys/objects/device/interface/../impls/host/../impldevicebase.hpp:99
>>>> # [0]PETSC ERROR: #5 PetscDeviceCreate() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/sys/objects/device/interface/device.cxx:104
>>>> # [0]PETSC ERROR: #6 PetscDeviceInitializeDefaultDevice_Internal() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/sys/objects/device/interface/device.cxx:375
>>>> # [0]PETSC ERROR: #7 PetscDeviceInitializeTypeFromOptions_Private() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/sys/objects/device/interface/device.cxx:499
>>>> # [0]PETSC ERROR: #8 PetscDeviceInitializeFromOptions_Internal() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/sys/objects/device/interface/device.cxx:634
>>>> # [0]PETSC ERROR: #9 PetscInitialize_Common() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/sys/objects/pinit.c:1001
>>>> # [0]PETSC ERROR: #10 PetscInitialize() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/sys/objects/pinit.c:1267
>>>> # [0]PETSC ERROR: #11 main() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/sys/objects/device/tests/ex1.c:12
>>>> # [0]PETSC ERROR: PETSc Option Table entries:
>>>> # [0]PETSC ERROR: -default_device_type host
>>>> # [0]PETSC ERROR: -device_enable eager
>>>> # [0]PETSC ERROR: ----------------End of Error Message -------send
>>>> entire error message to petsc-maint at mcs.anl.gov----------
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> solver code output:
>>>>
>>>>
>>>>
>>>> [0] <sys> PetscDetermineInitialFPTrap(): Floating point trapping is off
>>>> by default 0
>>>> [0] <sys> PetscDeviceInitializeTypeFromOptions_Private():
>>>> PetscDeviceType host available, initializing
>>>> [0] <sys> PetscDeviceInitializeTypeFromOptions_Private(): PetscDevice
>>>> host initialized, default device id 0, view FALSE, init type lazy
>>>> [0] <sys> PetscDeviceInitializeTypeFromOptions_Private():
>>>> PetscDeviceType cuda available, initializing
>>>> [0] <sys> PetscDeviceInitializeTypeFromOptions_Private(): PetscDevice
>>>> cuda initialized, default device id 0, view FALSE, init type lazy
>>>> [0] <sys> PetscDeviceInitializeTypeFromOptions_Private():
>>>> PetscDeviceType hip not available
>>>> [0] <sys> PetscDeviceInitializeTypeFromOptions_Private():
>>>> PetscDeviceType sycl not available
>>>> [0] <sys> PetscInitialize_Common(): PETSc successfully started: number
>>>> of processors = 1
>>>> [0] <sys> PetscGetHostName(): Rejecting domainname, likely is NIS
>>>> lancer.(none)
>>>> [0] <sys> PetscInitialize_Common(): Running on machine: lancer
>>>> # [Info] Petsc initialization complete.
>>>> # [Trace] Timing: Starting solver...
>>>> # [Info] RNG initial conditions have mean 0.000004, renormalizing.
>>>> # [Trace] Timing: PetscTimeIntegrator initialization...
>>>> # [Trace] Timing: Allocating Petsc CUDA arrays...
>>>> [0] <sys> PetscCommDuplicate(): Duplicating a communicator 2 3 max tags
>>>> = 100000000
>>>> [0] <sys> configure(): Configured device 0
>>>> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 2 3
>>>> # [Trace] Timing: Allocating Petsc CUDA arrays finished in 0.015439
>>>> seconds.
>>>> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 2 3
>>>> [0] <sys> PetscCommDuplicate(): Duplicating a communicator 1 4 max tags
>>>> = 100000000
>>>> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 1 4
>>>> [0] <dm> DMGetDMTS(): Creating new DMTS
>>>> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 1 4
>>>> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 1 4
>>>> [0] <dm> DMGetDMSNES(): Creating new DMSNES
>>>> [0] <dm> DMGetDMSNESWrite(): Copying DMSNES due to write
>>>> # [Info] Initializing petsc with ode23 integrator
>>>> # [Trace] Timing: PetscTimeIntegrator initialization finished in
>>>> 0.016754 seconds.
>>>>
>>>> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 1 4
>>>> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 1 4
>>>> [0] <device> PetscDeviceContextSetupGlobalContext_Private():
>>>> Initializing global PetscDeviceContext with device type cuda
>>>> [0]PETSC ERROR: --------------------- Error Message
>>>> --------------------------------------------------------------
>>>> [0]PETSC ERROR: GPU error
>>>> [0]PETSC ERROR: cuda error 801 (cudaErrorNotSupported) : operation not
>>>> supported
>>>> [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble
>>>> shooting.
>>>> [0]PETSC ERROR: Petsc Release Version 3.18.3, Dec 28, 2022
>>>> [0]PETSC ERROR: maDG on a arch-linux2-c-opt named lancer by mlohry Thu
>>>> Jan  5 15:39:14 2023
>>>> [0]PETSC ERROR: Configure options
>>>> PETSC_DIR=/home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc
>>>> PETSC_ARCH=arch-linux2-c-opt --with-cc=/usr/bin/cc --with-cxx=/usr/bin/c++
>>>> --with-fc=0 --with-pic=1 --with-cxx-dialect=C++11 MAKEFLAGS=$MAKEFLAGS
>>>> COPTFLAGS="-O3 -march=native" CXXOPTFLAGS="-O3 -march=native" --with-mpi=0
>>>> --with-debugging=no --with-cudac=/usr/local/cuda-11.5/bin/nvcc
>>>> --with-cuda-arch=35 --with-cuda --with-cuda-dir=/usr/local/cuda-11.5/
>>>> --download-hwloc=1
>>>> [0]PETSC ERROR: #1 initialize() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/sys/objects/device/impls/cupm/cuda/../cupmcontext.hpp:255
>>>> [0]PETSC ERROR: #2 PetscDeviceContextCreate_CUDA() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/sys/objects/device/impls/cupm/cuda/
>>>> cupmcontext.cu:10
>>>> [0]PETSC ERROR: #3 PetscDeviceContextSetDevice_Private() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/sys/objects/device/interface/dcontext.cxx:244
>>>> [0]PETSC ERROR: #4 PetscDeviceContextSetDefaultDeviceForType_Internal()
>>>> at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/sys/objects/device/interface/dcontext.cxx:259
>>>> [0]PETSC ERROR: #5 PetscDeviceContextSetupGlobalContext_Private() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/sys/objects/device/interface/global_dcontext.cxx:52
>>>> [0]PETSC ERROR: #6 PetscDeviceContextGetCurrentContext() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/sys/objects/device/interface/global_dcontext.cxx:84
>>>> [0]PETSC ERROR: #7
>>>> PetscDeviceContextGetCurrentContextAssertType_Internal() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/include/petsc/private/deviceimpl.h:371
>>>> [0]PETSC ERROR: #8 PetscCUBLASGetHandle() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/sys/objects/device/impls/cupm/cuda/
>>>> cupmcontext.cu:23
>>>> [0]PETSC ERROR: #9 VecMAXPY_SeqCUDA() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/vec/vec/impls/seq/seqcuda/
>>>> veccuda2.cu:261
>>>> [0]PETSC ERROR: #10 VecMAXPY() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/vec/vec/interface/rvector.c:1221
>>>> [0]PETSC ERROR: #11 TSStep_RK() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/ts/impls/explicit/rk/rk.c:814
>>>> [0]PETSC ERROR: #12 TSStep() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/ts/interface/ts.c:3424
>>>> [0]PETSC ERROR: #13 TSSolve() at
>>>> /home/mlohry/dev/maDGiCart-cmake-build-cuda-release/external/petsc/src/ts/interface/ts.c:3814
>>>>
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230106/5573b051/attachment-0001.html>


More information about the petsc-users mailing list