[petsc-users] Cannot eagerly initialize cuda, as doing so results in cuda error 35 (cudaErrorInsufficientDriver) : CUDA driver version is insufficient for CUDA runtime version
Fande Kong
fdkong.jd at gmail.com
Wed Jan 26 14:49:45 CST 2022
Yes, please see the attached file.
Fande
On Wed, Jan 26, 2022 at 11:49 AM Junchao Zhang <junchao.zhang at gmail.com>
wrote:
> Do you have the configure.log with main?
>
> --Junchao Zhang
>
>
> On Wed, Jan 26, 2022 at 12:26 PM Fande Kong <fdkong.jd at gmail.com> wrote:
>
>> I am on the petsc-main
>>
>> commit 1390d3a27d88add7d79c9b38bf1a895ae5e67af6
>>
>> Merge: 96c919c d5f3255
>>
>> Author: Satish Balay <balay at mcs.anl.gov>
>>
>> Date: Wed Jan 26 10:28:32 2022 -0600
>>
>>
>> Merge remote-tracking branch 'origin/release'
>>
>>
>> It is still broken.
>>
>> Thanks,
>>
>>
>> Fande
>>
>> On Wed, Jan 26, 2022 at 7:40 AM Junchao Zhang <junchao.zhang at gmail.com>
>> wrote:
>>
>>> The good uses the compiler's default library/header path. The bad
>>> searches from cuda toolkit path and uses rpath linking.
>>> Though the paths look the same on the login node, they could have
>>> different behavior on a compute node depending on its environment.
>>> I think we fixed the issue in cuda.py (i.e., first try the compiler's
>>> default, then toolkit). That's why I wanted Fande to use petsc/main.
>>>
>>> --Junchao Zhang
>>>
>>>
>>> On Tue, Jan 25, 2022 at 11:59 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>
>>>>
>>>> bad has extra
>>>>
>>>> -L/apps/local/spack/software/gcc-7.5.0/cuda-10.1.243-v4ymjqcrr7f72qfiuzsstuy5jiajbuey/lib64/stubs
>>>> -lcuda
>>>>
>>>> good does not.
>>>>
>>>> Try removing the stubs directory and -lcuda from the bad
>>>> $PETSC_ARCH/lib/petsc/conf/variables and likely the bad will start working.
>>>>
>>>> Barry
>>>>
>>>> I never liked the stubs stuff.
>>>>
>>>> On Jan 25, 2022, at 11:29 PM, Fande Kong <fdkong.jd at gmail.com> wrote:
>>>>
>>>> Hi Junchao,
>>>>
>>>> I attached a "bad" configure log and a "good" configure log.
>>>>
>>>> The "bad" one was on produced
>>>> at 246ba74192519a5f34fb6e227d1c64364e19ce2c
>>>>
>>>> and the "good" one at 384645a00975869a1aacbd3169de62ba40cad683
>>>>
>>>> This good hash is the last good hash that is just the right before the
>>>> bad one.
>>>>
>>>> I think you could do a comparison between these two logs, and check
>>>> what the differences were.
>>>>
>>>> Thanks,
>>>>
>>>> Fande
>>>>
>>>> On Tue, Jan 25, 2022 at 8:21 PM Junchao Zhang <junchao.zhang at gmail.com>
>>>> wrote:
>>>>
>>>>> Fande, could you send the configure.log that works (i.e., before this
>>>>> offending commit)?
>>>>> --Junchao Zhang
>>>>>
>>>>>
>>>>> On Tue, Jan 25, 2022 at 8:21 PM Fande Kong <fdkong.jd at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Not sure if this is helpful. I did "git bisect", and here was the
>>>>>> result:
>>>>>>
>>>>>> [kongf at sawtooth2 petsc]$ git bisect bad
>>>>>> 246ba74192519a5f34fb6e227d1c64364e19ce2c is the first bad commit
>>>>>> commit 246ba74192519a5f34fb6e227d1c64364e19ce2c
>>>>>> Author: Junchao Zhang <jczhang at mcs.anl.gov>
>>>>>> Date: Wed Oct 13 05:32:43 2021 +0000
>>>>>>
>>>>>> Config: fix CUDA library and header dirs
>>>>>>
>>>>>> :040000 040000 187c86055adb80f53c1d0565a8888704fec43a96
>>>>>> ea1efd7f594fd5e8df54170bc1bc7b00f35e4d5f M config
>>>>>>
>>>>>>
>>>>>> Started from this commit, and GPU did not work for me on our HPC
>>>>>>
>>>>>> Thanks,
>>>>>> Fande
>>>>>>
>>>>>> On Tue, Jan 25, 2022 at 7:18 PM Fande Kong <fdkong.jd at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 25, 2022 at 9:04 AM Jacob Faibussowitsch <
>>>>>>> jacob.fai at gmail.com> wrote:
>>>>>>>
>>>>>>>> Configure should not have an impact here I think. The reason I had
>>>>>>>> you run `cudaGetDeviceCount()` is because this is the CUDA call (and in
>>>>>>>> fact the only CUDA call) in the initialization sequence that returns the
>>>>>>>> error code. There should be no prior CUDA calls. Maybe this is a problem
>>>>>>>> with oversubscribing GPU’s? In the runs that crash, how many ranks are
>>>>>>>> using any given GPU at once? Maybe MPS is required.
>>>>>>>>
>>>>>>>
>>>>>>> I used one MPI rank.
>>>>>>>
>>>>>>> Fande
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>> Jacob Faibussowitsch
>>>>>>>> (Jacob Fai - booss - oh - vitch)
>>>>>>>>
>>>>>>>> On Jan 21, 2022, at 12:01, Fande Kong <fdkong.jd at gmail.com> wrote:
>>>>>>>>
>>>>>>>> Thanks Jacob,
>>>>>>>>
>>>>>>>> On Thu, Jan 20, 2022 at 6:25 PM Jacob Faibussowitsch <
>>>>>>>> jacob.fai at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Segfault is caused by the following check at
>>>>>>>>> src/sys/objects/device/impls/cupm/cupmdevice.cxx:349 being a
>>>>>>>>> PetscUnlikelyDebug() rather than just PetscUnlikely():
>>>>>>>>>
>>>>>>>>> ```
>>>>>>>>> if (PetscUnlikelyDebug(_defaultDevice < 0)) { // _defaultDevice is
>>>>>>>>> in fact < 0 here and uncaught
>>>>>>>>> ```
>>>>>>>>>
>>>>>>>>> To clarify:
>>>>>>>>>
>>>>>>>>> “lazy” initialization is not that lazy after all, it still does
>>>>>>>>> some 50% of the initialization that “eager” initialization does. It stops
>>>>>>>>> short initializing the CUDA runtime, checking CUDA aware MPI, gathering
>>>>>>>>> device data, and initializing cublas and friends. Lazy also importantly
>>>>>>>>> swallows any errors that crop up during initialization, storing the
>>>>>>>>> resulting error code for later (specifically _defaultDevice =
>>>>>>>>> -init_error_value;).
>>>>>>>>>
>>>>>>>>> So whether you initialize lazily or eagerly makes no difference
>>>>>>>>> here, as _defaultDevice will always contain -35.
>>>>>>>>>
>>>>>>>>> The bigger question is why cudaGetDeviceCount() is returning
>>>>>>>>> cudaErrorInsufficientDriver. Can you compile and run
>>>>>>>>>
>>>>>>>>> ```
>>>>>>>>> #include <cuda_runtime.h>
>>>>>>>>>
>>>>>>>>> int main()
>>>>>>>>> {
>>>>>>>>> int ndev;
>>>>>>>>> return cudaGetDeviceCount(&ndev):
>>>>>>>>> }
>>>>>>>>> ```
>>>>>>>>>
>>>>>>>>> Then show the value of "echo $?”?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Modify your code a little to get more information.
>>>>>>>>
>>>>>>>> #include <cuda_runtime.h>
>>>>>>>> #include <cstdio>
>>>>>>>>
>>>>>>>> int main()
>>>>>>>> {
>>>>>>>> int ndev;
>>>>>>>> int error = cudaGetDeviceCount(&ndev);
>>>>>>>> printf("ndev %d \n", ndev);
>>>>>>>> printf("error %d \n", error);
>>>>>>>> return 0;
>>>>>>>> }
>>>>>>>>
>>>>>>>> Results:
>>>>>>>>
>>>>>>>> $ ./a.out
>>>>>>>> ndev 4
>>>>>>>> error 0
>>>>>>>>
>>>>>>>>
>>>>>>>> I have not read the PETSc cuda initialization code yet. If I need
>>>>>>>> to guess at what was happening. I will naively think that PETSc did not get
>>>>>>>> correct GPU information in the configuration because the compiler node does
>>>>>>>> not have GPUs, and there was no way to get any GPU device information.
>>>>>>>>
>>>>>>>>
>>>>>>>> During the runtime on GPU nodes, PETSc might have incorrect
>>>>>>>> information grabbed during configuration and had this kind of false error
>>>>>>>> message.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Fande
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>>
>>>>>>>>> Jacob Faibussowitsch
>>>>>>>>> (Jacob Fai - booss - oh - vitch)
>>>>>>>>>
>>>>>>>>> On Jan 20, 2022, at 17:47, Matthew Knepley <knepley at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On Thu, Jan 20, 2022 at 6:44 PM Fande Kong <fdkong.jd at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks, Jed
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 20, 2022 at 4:34 PM Jed Brown <jed at jedbrown.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> You can't create CUDA or Kokkos Vecs if you're running on a node
>>>>>>>>>>> without a GPU.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I am running the code on compute nodes that do have GPUs.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If you are actually running on GPUs, why would you need lazy
>>>>>>>>> initialization? It would not break with GPUs present.
>>>>>>>>>
>>>>>>>>> Matt
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> With PETSc-3.16.1, I got good speedup by running GAMG on GPUs.
>>>>>>>>>> That might be a bug of PETSc-main.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Fande
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> KSPSetUp 13 1.0 6.4400e-01 1.0 2.02e+09 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 0 5 0 0 0 0 5 0 0 0 3140 64630 15
>>>>>>>>>> 1.05e+02 5 3.49e+01 100
>>>>>>>>>> KSPSolve 1 1.0 1.0109e+00 1.0 3.49e+10 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 0 87 0 0 0 0 87 0 0 0 34522 69556 4
>>>>>>>>>> 4.35e-03 1 2.38e-03 100
>>>>>>>>>> KSPGMRESOrthog 142 1.0 1.2674e-01 1.0 1.06e+10 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 0 27 0 0 0 0 27 0 0 0 83755 87801 0
>>>>>>>>>> 0.00e+00 0 0.00e+00 100
>>>>>>>>>> SNESSolve 1 1.0 4.4402e+01 1.0 4.00e+10 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 21100 0 0 0 21100 0 0 0 901 51365 57
>>>>>>>>>> 1.10e+03 52 8.78e+02 100
>>>>>>>>>> SNESSetUp 1 1.0 3.9101e-05 1.0 0.00e+00 0.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
>>>>>>>>>> 0.00e+00 0 0.00e+00 0
>>>>>>>>>> SNESFunctionEval 2 1.0 1.7097e+01 1.0 1.60e+07 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 8 0 0 0 0 8 0 0 0 0 1 0 0
>>>>>>>>>> 0.00e+00 6 1.92e+02 0
>>>>>>>>>> SNESJacobianEval 1 1.0 1.6213e+01 1.0 2.80e+07 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 8 0 0 0 0 8 0 0 0 0 2 0 0
>>>>>>>>>> 0.00e+00 1 3.20e+01 0
>>>>>>>>>> SNESLineSearch 1 1.0 8.5582e+00 1.0 1.24e+08 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 4 0 0 0 0 4 0 0 0 0 14 64153 1
>>>>>>>>>> 3.20e+01 3 9.61e+01 94
>>>>>>>>>> PCGAMGGraph_AGG 5 1.0 3.0509e+00 1.0 8.19e+07 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 27 0 5
>>>>>>>>>> 3.49e+01 9 7.43e+01 0
>>>>>>>>>> PCGAMGCoarse_AGG 5 1.0 3.8711e+00 1.0 0.00e+00 0.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 0 0
>>>>>>>>>> 0.00e+00 0 0.00e+00 0
>>>>>>>>>> PCGAMGProl_AGG 5 1.0 7.0748e-01 1.0 0.00e+00 0.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
>>>>>>>>>> 0.00e+00 0 0.00e+00 0
>>>>>>>>>> PCGAMGPOpt_AGG 5 1.0 1.2904e+00 1.0 2.14e+09 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 1 5 0 0 0 1 5 0 0 0 1661 29807 26
>>>>>>>>>> 7.15e+02 20 2.90e+02 99
>>>>>>>>>> GAMG: createProl 5 1.0 8.9489e+00 1.0 2.22e+09 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 4 6 0 0 0 4 6 0 0 0 249 29666 31
>>>>>>>>>> 7.50e+02 29 3.64e+02 96
>>>>>>>>>> Graph 10 1.0 3.0478e+00 1.0 8.19e+07 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 27 0 5
>>>>>>>>>> 3.49e+01 9 7.43e+01 0
>>>>>>>>>> MIS/Agg 5 1.0 4.1290e-01 1.0 0.00e+00 0.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
>>>>>>>>>> 0.00e+00 0 0.00e+00 0
>>>>>>>>>> SA: col data 5 1.0 1.9127e-02 1.0 0.00e+00 0.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
>>>>>>>>>> 0.00e+00 0 0.00e+00 0
>>>>>>>>>> SA: frmProl0 5 1.0 6.2662e-01 1.0 0.00e+00 0.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
>>>>>>>>>> 0.00e+00 0 0.00e+00 0
>>>>>>>>>> SA: smooth 5 1.0 4.9595e-01 1.0 1.21e+08 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 244 2709 15
>>>>>>>>>> 1.97e+02 15 2.55e+02 90
>>>>>>>>>> GAMG: partLevel 5 1.0 4.7330e-01 1.0 6.98e+08 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 1475 4120 5
>>>>>>>>>> 1.78e+02 10 2.55e+02 100
>>>>>>>>>> PCGAMG Squ l00 1 1.0 2.6027e+00 1.0 0.00e+00 0.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0
>>>>>>>>>> 0.00e+00 0 0.00e+00 0
>>>>>>>>>> PCGAMG Gal l00 1 1.0 3.8406e-01 1.0 5.48e+08 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1426 4270 1
>>>>>>>>>> 1.48e+02 2 2.11e+02 100
>>>>>>>>>> PCGAMG Opt l00 1 1.0 2.4932e-01 1.0 7.20e+07 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 289 2653 1
>>>>>>>>>> 6.41e+01 1 1.13e+02 100
>>>>>>>>>> PCGAMG Gal l01 1 1.0 6.6279e-02 1.0 1.09e+08 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1645 3851 1
>>>>>>>>>> 2.40e+01 2 3.64e+01 100
>>>>>>>>>> PCGAMG Opt l01 1 1.0 2.9544e-02 1.0 7.15e+06 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 242 1671 1
>>>>>>>>>> 4.84e+00 1 1.23e+01 100
>>>>>>>>>> PCGAMG Gal l02 1 1.0 1.8874e-02 1.0 3.72e+07 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1974 3636 1
>>>>>>>>>> 5.04e+00 2 6.58e+00 100
>>>>>>>>>> PCGAMG Opt l02 1 1.0 7.4353e-03 1.0 2.40e+06 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 323 1457 1
>>>>>>>>>> 7.71e-01 1 2.30e+00 100
>>>>>>>>>> PCGAMG Gal l03 1 1.0 2.8479e-03 1.0 4.10e+06 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1440 2266 1
>>>>>>>>>> 4.44e-01 2 5.51e-01 100
>>>>>>>>>> PCGAMG Opt l03 1 1.0 8.2684e-04 1.0 2.80e+05 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 339 1667 1
>>>>>>>>>> 6.72e-02 1 2.03e-01 100
>>>>>>>>>> PCGAMG Gal l04 1 1.0 1.2238e-03 1.0 2.09e+05 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 170 244 1
>>>>>>>>>> 2.05e-02 2 2.53e-02 100
>>>>>>>>>> PCGAMG Opt l04 1 1.0 4.1008e-04 1.0 1.77e+04 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 43 165 1
>>>>>>>>>> 4.49e-03 1 1.19e-02 100
>>>>>>>>>> PCSetUp 2 1.0 9.9632e+00 1.0 4.95e+09 1.0 0.0e+00
>>>>>>>>>> 0.0e+00 0.0e+00 5 12 0 0 0 5 12 0 0 0 496 17826 55
>>>>>>>>>> 1.03e+03 45 6.54e+02 98
>>>>>>>>>> PCSetUpOnBlocks 44 1.0 9.9087e-04 1.0 2.88e+03 1.0
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> The point of lazy initialization is to make it possible to run a
>>>>>>>>>>> solve that doesn't use a GPU in PETSC_ARCH that supports GPUs, regardless
>>>>>>>>>>> of whether a GPU is actually present.
>>>>>>>>>>>
>>>>>>>>>>> Fande Kong <fdkong.jd at gmail.com> writes:
>>>>>>>>>>>
>>>>>>>>>>> > I spoke too soon. It seems that we have trouble creating
>>>>>>>>>>> cuda/kokkos vecs
>>>>>>>>>>> > now. Got Segmentation fault.
>>>>>>>>>>> >
>>>>>>>>>>> > Thanks,
>>>>>>>>>>> >
>>>>>>>>>>> > Fande
>>>>>>>>>>> >
>>>>>>>>>>> > Program received signal SIGSEGV, Segmentation fault.
>>>>>>>>>>> > 0x00002aaab5558b11 in
>>>>>>>>>>> >
>>>>>>>>>>> Petsc::CUPMDevice<(Petsc::CUPMDeviceType)0>::CUPMDeviceInternal::initialize
>>>>>>>>>>> > (this=0x1) at
>>>>>>>>>>> >
>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:54
>>>>>>>>>>> > 54 PetscErrorCode
>>>>>>>>>>> CUPMDevice<T>::CUPMDeviceInternal::initialize() noexcept
>>>>>>>>>>> > Missing separate debuginfos, use: debuginfo-install
>>>>>>>>>>> > bzip2-libs-1.0.6-13.el7.x86_64
>>>>>>>>>>> elfutils-libelf-0.176-5.el7.x86_64
>>>>>>>>>>> > elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-325.el7_9.x86_64
>>>>>>>>>>> > libX11-1.6.7-4.el7_9.x86_64 libXau-1.0.8-2.1.el7.x86_64
>>>>>>>>>>> > libattr-2.4.46-13.el7.x86_64 libcap-2.22-11.el7.x86_64
>>>>>>>>>>> > libibmad-5.4.0.MLNX20190423.1d917ae-0.1.49224.x86_64
>>>>>>>>>>> > libibumad-43.1.1.MLNX20200211.078947f-0.1.49224.x86_64
>>>>>>>>>>> > libibverbs-41mlnx1-OFED.4.9.0.0.7.49224.x86_64
>>>>>>>>>>> > libmlx4-41mlnx1-OFED.4.7.3.0.3.49224.x86_64
>>>>>>>>>>> > libmlx5-41mlnx1-OFED.4.9.0.1.2.49224.x86_64
>>>>>>>>>>> libnl3-3.2.28-4.el7.x86_64
>>>>>>>>>>> > librdmacm-41mlnx1-OFED.4.7.3.0.6.49224.x86_64
>>>>>>>>>>> > librxe-41mlnx1-OFED.4.4.2.4.6.49224.x86_64
>>>>>>>>>>> libxcb-1.13-1.el7.x86_64
>>>>>>>>>>> > libxml2-2.9.1-6.el7_9.6.x86_64 numactl-libs-2.0.12-5.el7.x86_64
>>>>>>>>>>> > systemd-libs-219-78.el7_9.3.x86_64 xz-libs-5.2.2-1.el7.x86_64
>>>>>>>>>>> > zlib-1.2.7-19.el7_9.x86_64
>>>>>>>>>>> > (gdb) bt
>>>>>>>>>>> > #0 0x00002aaab5558b11 in
>>>>>>>>>>> >
>>>>>>>>>>> Petsc::CUPMDevice<(Petsc::CUPMDeviceType)0>::CUPMDeviceInternal::initialize
>>>>>>>>>>> > (this=0x1) at
>>>>>>>>>>> >
>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:54
>>>>>>>>>>> > #1 0x00002aaab5558db7 in
>>>>>>>>>>> > Petsc::CUPMDevice<(Petsc::CUPMDeviceType)0>::getDevice
>>>>>>>>>>> > (this=this at entry=0x2aaab7f37b70
>>>>>>>>>>> > <CUDADevice>, device=0x115da00, id=-35, id at entry=-1) at
>>>>>>>>>>> >
>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:344
>>>>>>>>>>> > #2 0x00002aaab55577de in PetscDeviceCreate (type=type at entry
>>>>>>>>>>> =PETSC_DEVICE_CUDA,
>>>>>>>>>>> > devid=devid at entry=-1, device=device at entry=0x2aaab7f37b48
>>>>>>>>>>> > <defaultDevices+8>) at
>>>>>>>>>>> >
>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/sys/objects/device/interface/device.cxx:107
>>>>>>>>>>> > #3 0x00002aaab5557b3a in
>>>>>>>>>>> PetscDeviceInitializeDefaultDevice_Internal
>>>>>>>>>>> > (type=type at entry=PETSC_DEVICE_CUDA,
>>>>>>>>>>> defaultDeviceId=defaultDeviceId at entry=-1)
>>>>>>>>>>> > at
>>>>>>>>>>> >
>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/sys/objects/device/interface/device.cxx:273
>>>>>>>>>>> > #4 0x00002aaab5557bf6 in PetscDeviceInitialize
>>>>>>>>>>> > (type=type at entry=PETSC_DEVICE_CUDA)
>>>>>>>>>>> > at
>>>>>>>>>>> >
>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/sys/objects/device/interface/device.cxx:234
>>>>>>>>>>> > #5 0x00002aaab5661fcd in VecCreate_SeqCUDA (V=0x115d150) at
>>>>>>>>>>> >
>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/vec/vec/impls/seq/seqcuda/veccuda.c:244
>>>>>>>>>>> > #6 0x00002aaab5649b40 in VecSetType (vec=vec at entry=0x115d150,
>>>>>>>>>>> > method=method at entry=0x2aaab70b45b8 "seqcuda") at
>>>>>>>>>>> >
>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/vec/vec/interface/vecreg.c:93
>>>>>>>>>>> > #7 0x00002aaab579c33f in VecCreate_CUDA (v=0x115d150) at
>>>>>>>>>>> >
>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/vec/vec/impls/mpi/mpicuda/
>>>>>>>>>>> > mpicuda.cu:214
>>>>>>>>>>> > #8 0x00002aaab5649b40 in VecSetType (vec=vec at entry=0x115d150,
>>>>>>>>>>> > method=method at entry=0x7fffffff9260 "cuda") at
>>>>>>>>>>> >
>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/vec/vec/interface/vecreg.c:93
>>>>>>>>>>> > #9 0x00002aaab5648bf1 in VecSetTypeFromOptions_Private
>>>>>>>>>>> (vec=0x115d150,
>>>>>>>>>>> > PetscOptionsObject=0x7fffffff9210) at
>>>>>>>>>>> >
>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/vec/vec/interface/vector.c:1263
>>>>>>>>>>> > #10 VecSetFromOptions (vec=0x115d150) at
>>>>>>>>>>> >
>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/vec/vec/interface/vector.c:1297
>>>>>>>>>>> > #11 0x00002aaab02ef227 in libMesh::PetscVector<double>::init
>>>>>>>>>>> > (this=0x11cd1a0, n=441, n_local=441, fast=false,
>>>>>>>>>>> ptype=libMesh::PARALLEL)
>>>>>>>>>>> > at
>>>>>>>>>>> >
>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/scripts/../libmesh/installed/include/libmesh/petsc_vector.h:693
>>>>>>>>>>> >
>>>>>>>>>>> > On Thu, Jan 20, 2022 at 1:09 PM Fande Kong <
>>>>>>>>>>> fdkong.jd at gmail.com> wrote:
>>>>>>>>>>> >
>>>>>>>>>>> >> Thanks, Jed,
>>>>>>>>>>> >>
>>>>>>>>>>> >> This worked!
>>>>>>>>>>> >>
>>>>>>>>>>> >> Fande
>>>>>>>>>>> >>
>>>>>>>>>>> >> On Wed, Jan 19, 2022 at 11:03 PM Jed Brown <jed at jedbrown.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>> >>
>>>>>>>>>>> >>> Fande Kong <fdkong.jd at gmail.com> writes:
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> > On Wed, Jan 19, 2022 at 11:39 AM Jacob Faibussowitsch <
>>>>>>>>>>> >>> jacob.fai at gmail.com>
>>>>>>>>>>> >>> > wrote:
>>>>>>>>>>> >>> >
>>>>>>>>>>> >>> >> Are you running on login nodes or compute nodes (I can’t
>>>>>>>>>>> seem to tell
>>>>>>>>>>> >>> from
>>>>>>>>>>> >>> >> the configure.log)?
>>>>>>>>>>> >>> >>
>>>>>>>>>>> >>> >
>>>>>>>>>>> >>> > I was compiling codes on login nodes, and running codes on
>>>>>>>>>>> compute
>>>>>>>>>>> >>> nodes.
>>>>>>>>>>> >>> > Login nodes do not have GPUs, but compute nodes do have
>>>>>>>>>>> GPUs.
>>>>>>>>>>> >>> >
>>>>>>>>>>> >>> > Just to be clear, the same thing (code, machine) with
>>>>>>>>>>> PETSc-3.16.1
>>>>>>>>>>> >>> worked
>>>>>>>>>>> >>> > perfectly. I have this trouble with PETSc-main.
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> I assume you can
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> export PETSC_OPTIONS='-device_enable lazy'
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> and it'll work.
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> I think this should be the default. The main complaint is
>>>>>>>>>>> that timing the
>>>>>>>>>>> >>> first GPU-using event isn't accurate if it includes
>>>>>>>>>>> initialization, but I
>>>>>>>>>>> >>> think this is mostly hypothetical because you can't trust
>>>>>>>>>>> any timing that
>>>>>>>>>>> >>> doesn't preload in some form and the first GPU-using event
>>>>>>>>>>> will almost
>>>>>>>>>>> >>> always be something uninteresting so I think it will rarely
>>>>>>>>>>> lead to
>>>>>>>>>>> >>> confusion. Meanwhile, eager initialization is viscerally
>>>>>>>>>>> disruptive for
>>>>>>>>>>> >>> lots of people.
>>>>>>>>>>> >>>
>>>>>>>>>>> >>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> What most experimenters take for granted before they begin their
>>>>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>>>>> experiments lead.
>>>>>>>>> -- Norbert Wiener
>>>>>>>>>
>>>>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> <configure_bad.log><configure_good.log>
>>>>
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220126/1ce4f77f/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: configure_main.log
Type: application/octet-stream
Size: 2451022 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220126/1ce4f77f/attachment-0001.obj>
More information about the petsc-users
mailing list