[petsc-users] Did CUDA break again?

Junchao Zhang junchao.zhang at gmail.com
Fri May 28 11:30:23 CDT 2021


On Fri, May 28, 2021 at 10:40 AM Barry Smith <bsmith at petsc.dev> wrote:

>
>    Thanks. On machines such as this one where you have to use $MPIEXEC to
> run code you will still need to provide the generation with
> -with-cuda-gencodearch=70. On systems where it can directly query the GPU
> without MPIEXEC it will automatically produce the correct result. Otherwise
> it will guess by compiling for different generations but this can produce
> an incorrect answer.
>
Yes, on Summit with CUDA-11,  the script guesses sm_80, but actually it
should be sm_70.  Probably, we can test hostname and then set a correct
cuda arch for common machines. But it kind of overreacts.


>
>    Barry
>
>
> On May 28, 2021, at 7:59 AM, Mark Adams <mfadams at lbl.gov> wrote:
>
>
>
> On Thu, May 27, 2021 at 11:50 PM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>>   Mark,
>>
>>
>>
>>     Where did you run the little test program I sent you
>>
>> 1) when it produced
>>
>>    The 1120 and negative number and   (was this on the compile server or
>> on a compute node?)
>>
>
> This is fine now. look at my last email. I was not using srun.
>
>
>> 2) when it produced the correct answer? (compile server or compute node?)
>>
>> Do you run configure on a compile server (that has no GPUs) or a compute
>> server that has GPUs
>>
>
> You have to do everything on the compute nodes on Cori/gpu.
>
>
>>  Don't spend your time bisecting PETSc we know exactly where the problem
>> is, we just don't see how it happens.
>>
>
>>    cuda.py, if it cannot find deviceQuery and if you did not provide a
>> generation arch with -with-cuda-gencodearch=70,
>>
>
> I thought I was not supposed to use that anymore. It sounds like it is
> optional.
>
>
>> runs a version of the little code I sent you to get the number but it is
>> ??apparently?? producing garbage or not running on the compiler server and
>> gives the wrong number 1120.
>>
>
> Does PETSc use MPIEXEC to run this?
>
> Note, I have not been able to get 'make check' to work on Cori/gpu. I use
> '-with-mpiexec=srun -G1 [-c 20]' and it fails to execute the tests.
>
> OK, putting -with-cuda-gencodearch=70 back in has fixed this problem. It
> is running now.
>
> Thanks,
>
>
>>
>>    Just use the option -with-cuda-gencodearch=70  (you do not need to
>> pass this information to any flags any more, just with this option and it
>> will use it).
>>
>>   Barry
>>
>> Ideally we want it to figure it out automatically and this little test
>> program in configure is suppose to do this but since that is not always
>> working yet you should just use -with-cuda-gencodearch=70
>>
>>
>>
>> On May 27, 2021, at 5:45 AM, Mark Adams <mfadams at lbl.gov> wrote:
>>
>> FYI, I was running the test incorrectly:
>> 03:38 cgpu12  ~/petsc_install$ srun -n 1 -G 1 ./a.out
>> 70
>> 70
>>
>> On Wed, May 26, 2021 at 10:21 PM Mark Adams <mfadams at lbl.gov> wrote:
>>
>>> I had git bisect working and was 4 steps away when I got a new crash.
>>> configure.log is empty.
>>>
>>> 19:15 1 cgpu02 (a531cba26b...)|BISECTING ~/petsc$ git bisect bad
>>> Bisecting: 19 revisions left to test after this (roughly 4 steps)
>>> [149e269f455574fbe8ce3ebaf42121ae7fdf0635] Merge branch
>>> 'tisaac/feature-spqr' into 'main'
>>> 19:16 cgpu02 (149e269f45...)|BISECTING ~/petsc$
>>> ../arch-cori-gpu-opt-gcc.py PETSC_DIR=$PWD
>>>
>>> ===============================================================================
>>>              Configuring PETSc to compile on your system
>>>
>>>
>>> ===============================================================================
>>>
>>> *******************************************************************************
>>>         CONFIGURATION CRASH  (Please send configure.log to
>>> petsc-maint at mcs.anl.gov)
>>>
>>> *******************************************************************************
>>>
>>> EOL while scanning string literal (cuda.py, line 176)
>>>   File "/global/u2/m/madams/petsc/config/configure.py", line 455, in
>>> petsc_configure
>>>     framework =
>>> config.framework.Framework(['--configModules=PETSc.Configure','--optionsModule=config.compilerOptions']+sys.argv[1:],
>>> loadArgDB = 0)
>>>   File
>>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>>> 107, in __init__
>>>     self.createChildren()
>>>   File
>>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>>> 344, in createChildren
>>>     self.getChild(moduleName)
>>>   File
>>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>>> 329, in getChild
>>>     config.setupDependencies(self)
>>>   File "/global/u2/m/madams/petsc/config/PETSc/Configure.py", line 80,
>>> in setupDependencies
>>>     self.blasLapack    =
>>> framework.require('config.packages.BlasLapack',self)
>>>   File
>>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>>> 349, in require
>>>     config = self.getChild(moduleName, keywordArgs)
>>>   File
>>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>>> 329, in getChild
>>>     config.setupDependencies(self)
>>>   File
>>> "/global/u2/m/madams/petsc/config/BuildSystem/config/packages/BlasLapack.py",
>>> line 21, in setupDependencies
>>>     config.package.Package.setupDependencies(self, framework)
>>>   File "/global/u2/m/madams/petsc/config/BuildSystem/config/package.py",
>>> line 151, in setupDependencies
>>>     self.mpi         = framework.require('config.packages.MPI',self)
>>>   File
>>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>>> 349, in require
>>>     config = self.getChild(moduleName, keywordArgs)
>>>   File
>>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>>> 329, in getChild
>>>     config.setupDependencies(self)
>>>   File
>>> "/global/u2/m/madams/petsc/config/BuildSystem/config/packages/MPI.py", line
>>> 73, in setupDependencies
>>>     self.mpich   = framework.require('config.packages.MPICH', self)
>>>   File
>>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>>> 349, in require
>>>     config = self.getChild(moduleName, keywordArgs)
>>>   File
>>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>>> 329, in getChild
>>>     config.setupDependencies(self)
>>>   File
>>> "/global/u2/m/madams/petsc/config/BuildSystem/config/packages/MPICH.py",
>>> line 16, in setupDependencies
>>>     self.cuda            = framework.require('config.packages.cuda',self)
>>>   File
>>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>>> 349, in require
>>>     config = self.getChild(moduleName, keywordArgs)
>>>   File
>>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>>> 302, in getChild
>>>     type   = __import__(moduleName, globals(), locals(),
>>> ['Configure']).Configure
>>> 19:16 cgpu02 (149e269f45...)|BISECTING ~/petsc$
>>> ../arch-cori-gpu-opt-gcc.py PETSC_DIR=$PWD
>>>
>>> On Wed, May 26, 2021 at 10:10 PM Junchao Zhang <junchao.zhang at gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>>
>>>> On Wed, May 26, 2021 at 6:13 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>>
>>>>>
>>>>>   What is HOST=cori09  Does it have GPUs?
>>>>>
>>>>>
>>>>> https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_164490976c8e07e028a8f1ce1f5cd42d6
>>>>>
>>>>>   Seems to clearly state
>>>>>
>>>>> int  cudaDeviceProp
>>>>> <https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp>
>>>>> ::major
>>>>> <https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_164490976c8e07e028a8f1ce1f5cd42d6>
>>>>>  [inherited]
>>>>>
>>>>> Major compute capability
>>>>>
>>>>>
>>>>> Mark, please compile and run this program on the machine you are
>>>>> running configure on
>>>>>
>>>>> #include <stdio.h>
>>>>> #include <cuda.h>
>>>>> #include <cuda_runtime.h>
>>>>> #include <cuda_runtime_api.h>
>>>>> #include <cuda_device_runtime_api.h>
>>>>> int main(int arg,char **args)
>>>>> {
>>>>> struct cudaDeviceProp dp;
>>>>>                 cudaGetDeviceProperties(&dp, 0);
>>>>>                 printf("%d\n",10*dp.major+dp.minor);
>>>>>
>>>>>                 int major,minor;
>>>>> cuDeviceGetAttribute(&major,
>>>>> CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, 0);
>>>>> cuDeviceGetAttribute(&minor,
>>>>> CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, 0);
>>>>>                 printf("%d\n",10*major+minor);
>>>>>                 return(0);
>>>>>
>>>> Probably, you need to check the return code of these two function calls
>>>> to make sure they are correct.
>>>>
>>>>
>>>>> }
>>>>>
>>>>> This is what I get
>>>>>
>>>>> $ nvcc mytest.c -lcuda
>>>>> ~/petsc* (main=)* arch-main
>>>>> $ ./a.out
>>>>> 70
>>>>> 70
>>>>>
>>>>> Which is exactly what it is suppose to do.
>>>>>
>>>>> Barry
>>>>>
>>>>> On May 26, 2021, at 5:31 PM, Barry Smith <bsmith at petsc.dev> wrote:
>>>>>
>>>>>
>>>>>   Yes, this code which I guess never got hit before
>>>>>
>>>>> cudaDeviceProp dp;                cudaGetDeviceProperties(&dp, 0);
>>>>>                printf("%d\n",10*dp.major+dp.minor);
>>>>>                return(0);;
>>>>>
>>>>> is using the wrong property for the generation.
>>>>>
>>>>>  Back to the CUDA documentation for the correct information.
>>>>>
>>>>>
>>>>>
>>>>> On May 26, 2021, at 3:47 PM, Jacob Faibussowitsch <jacob.fai at gmail.com>
>>>>> wrote:
>>>>>
>>>>> 1120 sounds suspiciously like some CUDA version rather than
>>>>> architecture or compute capability…
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Jacob Faibussowitsch
>>>>> (Jacob Fai - booss - oh - vitch)
>>>>> Cell: +1 (312) 694-3391
>>>>>
>>>>> On May 26, 2021, at 22:29, Mark Adams <mfadams at lbl.gov> wrote:
>>>>> 
>>>>> I started to get this error today on Cori.
>>>>>
>>>>> nvcc fatal   : Unsupported gpu architecture 'compute_1120'
>>>>>
>>>>> I am pretty sure I had a clean build but I can redo it if you don't
>>>>> know where this is from.
>>>>>
>>>>> Thanks,
>>>>> Mark
>>>>> <configure.log>
>>>>>
>>>>>
>>>>>
>>>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210528/2f5e06c6/attachment-0001.html>


More information about the petsc-users mailing list