[petsc-users] Cannot eagerly initialize cuda, as doing so results in cuda error 35 (cudaErrorInsufficientDriver) : CUDA driver version is insufficient for CUDA runtime version
Jed Brown
jed at jedbrown.org
Thu Jan 20 17:34:06 CST 2022
You can't create CUDA or Kokkos Vecs if you're running on a node without a GPU. The point of lazy initialization is to make it possible to run a solve that doesn't use a GPU in PETSC_ARCH that supports GPUs, regardless of whether a GPU is actually present.
Fande Kong <fdkong.jd at gmail.com> writes:
> I spoke too soon. It seems that we have trouble creating cuda/kokkos vecs
> now. Got Segmentation fault.
>
> Thanks,
>
> Fande
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x00002aaab5558b11 in
> Petsc::CUPMDevice<(Petsc::CUPMDeviceType)0>::CUPMDeviceInternal::initialize
> (this=0x1) at
> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:54
> 54 PetscErrorCode CUPMDevice<T>::CUPMDeviceInternal::initialize() noexcept
> Missing separate debuginfos, use: debuginfo-install
> bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.176-5.el7.x86_64
> elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-325.el7_9.x86_64
> libX11-1.6.7-4.el7_9.x86_64 libXau-1.0.8-2.1.el7.x86_64
> libattr-2.4.46-13.el7.x86_64 libcap-2.22-11.el7.x86_64
> libibmad-5.4.0.MLNX20190423.1d917ae-0.1.49224.x86_64
> libibumad-43.1.1.MLNX20200211.078947f-0.1.49224.x86_64
> libibverbs-41mlnx1-OFED.4.9.0.0.7.49224.x86_64
> libmlx4-41mlnx1-OFED.4.7.3.0.3.49224.x86_64
> libmlx5-41mlnx1-OFED.4.9.0.1.2.49224.x86_64 libnl3-3.2.28-4.el7.x86_64
> librdmacm-41mlnx1-OFED.4.7.3.0.6.49224.x86_64
> librxe-41mlnx1-OFED.4.4.2.4.6.49224.x86_64 libxcb-1.13-1.el7.x86_64
> libxml2-2.9.1-6.el7_9.6.x86_64 numactl-libs-2.0.12-5.el7.x86_64
> systemd-libs-219-78.el7_9.3.x86_64 xz-libs-5.2.2-1.el7.x86_64
> zlib-1.2.7-19.el7_9.x86_64
> (gdb) bt
> #0 0x00002aaab5558b11 in
> Petsc::CUPMDevice<(Petsc::CUPMDeviceType)0>::CUPMDeviceInternal::initialize
> (this=0x1) at
> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:54
> #1 0x00002aaab5558db7 in
> Petsc::CUPMDevice<(Petsc::CUPMDeviceType)0>::getDevice
> (this=this at entry=0x2aaab7f37b70
> <CUDADevice>, device=0x115da00, id=-35, id at entry=-1) at
> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:344
> #2 0x00002aaab55577de in PetscDeviceCreate (type=type at entry=PETSC_DEVICE_CUDA,
> devid=devid at entry=-1, device=device at entry=0x2aaab7f37b48
> <defaultDevices+8>) at
> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/sys/objects/device/interface/device.cxx:107
> #3 0x00002aaab5557b3a in PetscDeviceInitializeDefaultDevice_Internal
> (type=type at entry=PETSC_DEVICE_CUDA, defaultDeviceId=defaultDeviceId at entry=-1)
> at
> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/sys/objects/device/interface/device.cxx:273
> #4 0x00002aaab5557bf6 in PetscDeviceInitialize
> (type=type at entry=PETSC_DEVICE_CUDA)
> at
> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/sys/objects/device/interface/device.cxx:234
> #5 0x00002aaab5661fcd in VecCreate_SeqCUDA (V=0x115d150) at
> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/vec/vec/impls/seq/seqcuda/veccuda.c:244
> #6 0x00002aaab5649b40 in VecSetType (vec=vec at entry=0x115d150,
> method=method at entry=0x2aaab70b45b8 "seqcuda") at
> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/vec/vec/interface/vecreg.c:93
> #7 0x00002aaab579c33f in VecCreate_CUDA (v=0x115d150) at
> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/vec/vec/impls/mpi/mpicuda/
> mpicuda.cu:214
> #8 0x00002aaab5649b40 in VecSetType (vec=vec at entry=0x115d150,
> method=method at entry=0x7fffffff9260 "cuda") at
> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/vec/vec/interface/vecreg.c:93
> #9 0x00002aaab5648bf1 in VecSetTypeFromOptions_Private (vec=0x115d150,
> PetscOptionsObject=0x7fffffff9210) at
> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/vec/vec/interface/vector.c:1263
> #10 VecSetFromOptions (vec=0x115d150) at
> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/vec/vec/interface/vector.c:1297
> #11 0x00002aaab02ef227 in libMesh::PetscVector<double>::init
> (this=0x11cd1a0, n=441, n_local=441, fast=false, ptype=libMesh::PARALLEL)
> at
> /home/kongf/workhome/sawtooth/moosegpu/scripts/../libmesh/installed/include/libmesh/petsc_vector.h:693
>
> On Thu, Jan 20, 2022 at 1:09 PM Fande Kong <fdkong.jd at gmail.com> wrote:
>
>> Thanks, Jed,
>>
>> This worked!
>>
>> Fande
>>
>> On Wed, Jan 19, 2022 at 11:03 PM Jed Brown <jed at jedbrown.org> wrote:
>>
>>> Fande Kong <fdkong.jd at gmail.com> writes:
>>>
>>> > On Wed, Jan 19, 2022 at 11:39 AM Jacob Faibussowitsch <
>>> jacob.fai at gmail.com>
>>> > wrote:
>>> >
>>> >> Are you running on login nodes or compute nodes (I can’t seem to tell
>>> from
>>> >> the configure.log)?
>>> >>
>>> >
>>> > I was compiling codes on login nodes, and running codes on compute
>>> nodes.
>>> > Login nodes do not have GPUs, but compute nodes do have GPUs.
>>> >
>>> > Just to be clear, the same thing (code, machine) with PETSc-3.16.1
>>> worked
>>> > perfectly. I have this trouble with PETSc-main.
>>>
>>> I assume you can
>>>
>>> export PETSC_OPTIONS='-device_enable lazy'
>>>
>>> and it'll work.
>>>
>>> I think this should be the default. The main complaint is that timing the
>>> first GPU-using event isn't accurate if it includes initialization, but I
>>> think this is mostly hypothetical because you can't trust any timing that
>>> doesn't preload in some form and the first GPU-using event will almost
>>> always be something uninteresting so I think it will rarely lead to
>>> confusion. Meanwhile, eager initialization is viscerally disruptive for
>>> lots of people.
>>>
>>
More information about the petsc-users
mailing list