[petsc-users] Did CUDA break again?

Mark Adams mfadams at lbl.gov
Fri May 28 07:59:54 CDT 2021


On Thu, May 27, 2021 at 11:50 PM Barry Smith <bsmith at petsc.dev> wrote:

>
>   Mark,
>
>
>
>     Where did you run the little test program I sent you
>
> 1) when it produced
>
>    The 1120 and negative number and   (was this on the compile server or
> on a compute node?)
>

This is fine now. look at my last email. I was not using srun.


> 2) when it produced the correct answer? (compile server or compute node?)
>
> Do you run configure on a compile server (that has no GPUs) or a compute
> server that has GPUs
>

You have to do everything on the compute nodes on Cori/gpu.


>  Don't spend your time bisecting PETSc we know exactly where the problem
> is, we just don't see how it happens.
>

>    cuda.py, if it cannot find deviceQuery and if you did not provide a
> generation arch with -with-cuda-gencodearch=70,
>

I thought I was not supposed to use that anymore. It sounds like it is
optional.


> runs a version of the little code I sent you to get the number but it is
> ??apparently?? producing garbage or not running on the compiler server and
> gives the wrong number 1120.
>

Does PETSc use MPIEXEC to run this?

Note, I have not been able to get 'make check' to work on Cori/gpu. I use
'-with-mpiexec=srun -G1 [-c 20]' and it fails to execute the tests.

OK, putting -with-cuda-gencodearch=70 back in has fixed this problem. It is
running now.

Thanks,


>
>    Just use the option -with-cuda-gencodearch=70  (you do not need to pass
> this information to any flags any more, just with this option and it will
> use it).
>
>   Barry
>
> Ideally we want it to figure it out automatically and this little test
> program in configure is suppose to do this but since that is not always
> working yet you should just use -with-cuda-gencodearch=70
>
>
>
> On May 27, 2021, at 5:45 AM, Mark Adams <mfadams at lbl.gov> wrote:
>
> FYI, I was running the test incorrectly:
> 03:38 cgpu12  ~/petsc_install$ srun -n 1 -G 1 ./a.out
> 70
> 70
>
> On Wed, May 26, 2021 at 10:21 PM Mark Adams <mfadams at lbl.gov> wrote:
>
>> I had git bisect working and was 4 steps away when I got a new crash.
>> configure.log is empty.
>>
>> 19:15 1 cgpu02 (a531cba26b...)|BISECTING ~/petsc$ git bisect bad
>> Bisecting: 19 revisions left to test after this (roughly 4 steps)
>> [149e269f455574fbe8ce3ebaf42121ae7fdf0635] Merge branch
>> 'tisaac/feature-spqr' into 'main'
>> 19:16 cgpu02 (149e269f45...)|BISECTING ~/petsc$
>> ../arch-cori-gpu-opt-gcc.py PETSC_DIR=$PWD
>>
>> ===============================================================================
>>              Configuring PETSc to compile on your system
>>
>>
>> ===============================================================================
>>
>> *******************************************************************************
>>         CONFIGURATION CRASH  (Please send configure.log to
>> petsc-maint at mcs.anl.gov)
>>
>> *******************************************************************************
>>
>> EOL while scanning string literal (cuda.py, line 176)
>>   File "/global/u2/m/madams/petsc/config/configure.py", line 455, in
>> petsc_configure
>>     framework =
>> config.framework.Framework(['--configModules=PETSc.Configure','--optionsModule=config.compilerOptions']+sys.argv[1:],
>> loadArgDB = 0)
>>   File
>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>> 107, in __init__
>>     self.createChildren()
>>   File
>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>> 344, in createChildren
>>     self.getChild(moduleName)
>>   File
>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>> 329, in getChild
>>     config.setupDependencies(self)
>>   File "/global/u2/m/madams/petsc/config/PETSc/Configure.py", line 80, in
>> setupDependencies
>>     self.blasLapack    =
>> framework.require('config.packages.BlasLapack',self)
>>   File
>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>> 349, in require
>>     config = self.getChild(moduleName, keywordArgs)
>>   File
>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>> 329, in getChild
>>     config.setupDependencies(self)
>>   File
>> "/global/u2/m/madams/petsc/config/BuildSystem/config/packages/BlasLapack.py",
>> line 21, in setupDependencies
>>     config.package.Package.setupDependencies(self, framework)
>>   File "/global/u2/m/madams/petsc/config/BuildSystem/config/package.py",
>> line 151, in setupDependencies
>>     self.mpi         = framework.require('config.packages.MPI',self)
>>   File
>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>> 349, in require
>>     config = self.getChild(moduleName, keywordArgs)
>>   File
>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>> 329, in getChild
>>     config.setupDependencies(self)
>>   File
>> "/global/u2/m/madams/petsc/config/BuildSystem/config/packages/MPI.py", line
>> 73, in setupDependencies
>>     self.mpich   = framework.require('config.packages.MPICH', self)
>>   File
>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>> 349, in require
>>     config = self.getChild(moduleName, keywordArgs)
>>   File
>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>> 329, in getChild
>>     config.setupDependencies(self)
>>   File
>> "/global/u2/m/madams/petsc/config/BuildSystem/config/packages/MPICH.py",
>> line 16, in setupDependencies
>>     self.cuda            = framework.require('config.packages.cuda',self)
>>   File
>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>> 349, in require
>>     config = self.getChild(moduleName, keywordArgs)
>>   File
>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line
>> 302, in getChild
>>     type   = __import__(moduleName, globals(), locals(),
>> ['Configure']).Configure
>> 19:16 cgpu02 (149e269f45...)|BISECTING ~/petsc$
>> ../arch-cori-gpu-opt-gcc.py PETSC_DIR=$PWD
>>
>> On Wed, May 26, 2021 at 10:10 PM Junchao Zhang <junchao.zhang at gmail.com>
>> wrote:
>>
>>>
>>>
>>>
>>> On Wed, May 26, 2021 at 6:13 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>
>>>>
>>>>   What is HOST=cori09  Does it have GPUs?
>>>>
>>>>
>>>> https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_164490976c8e07e028a8f1ce1f5cd42d6
>>>>
>>>>   Seems to clearly state
>>>>
>>>> int  cudaDeviceProp
>>>> <https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp>
>>>> ::major
>>>> <https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_164490976c8e07e028a8f1ce1f5cd42d6>
>>>>  [inherited]
>>>>
>>>> Major compute capability
>>>>
>>>>
>>>> Mark, please compile and run this program on the machine you are
>>>> running configure on
>>>>
>>>> #include <stdio.h>
>>>> #include <cuda.h>
>>>> #include <cuda_runtime.h>
>>>> #include <cuda_runtime_api.h>
>>>> #include <cuda_device_runtime_api.h>
>>>> int main(int arg,char **args)
>>>> {
>>>> struct cudaDeviceProp dp;
>>>>                 cudaGetDeviceProperties(&dp, 0);
>>>>                 printf("%d\n",10*dp.major+dp.minor);
>>>>
>>>>                 int major,minor;
>>>> cuDeviceGetAttribute(&major,
>>>> CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, 0);
>>>> cuDeviceGetAttribute(&minor,
>>>> CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, 0);
>>>>                 printf("%d\n",10*major+minor);
>>>>                 return(0);
>>>>
>>> Probably, you need to check the return code of these two function calls
>>> to make sure they are correct.
>>>
>>>
>>>> }
>>>>
>>>> This is what I get
>>>>
>>>> $ nvcc mytest.c -lcuda
>>>> ~/petsc* (main=)* arch-main
>>>> $ ./a.out
>>>> 70
>>>> 70
>>>>
>>>> Which is exactly what it is suppose to do.
>>>>
>>>> Barry
>>>>
>>>> On May 26, 2021, at 5:31 PM, Barry Smith <bsmith at petsc.dev> wrote:
>>>>
>>>>
>>>>   Yes, this code which I guess never got hit before
>>>>
>>>> cudaDeviceProp dp;                cudaGetDeviceProperties(&dp, 0);
>>>>                printf("%d\n",10*dp.major+dp.minor);
>>>>                return(0);;
>>>>
>>>> is using the wrong property for the generation.
>>>>
>>>>  Back to the CUDA documentation for the correct information.
>>>>
>>>>
>>>>
>>>> On May 26, 2021, at 3:47 PM, Jacob Faibussowitsch <jacob.fai at gmail.com>
>>>> wrote:
>>>>
>>>> 1120 sounds suspiciously like some CUDA version rather than
>>>> architecture or compute capability…
>>>>
>>>> Best regards,
>>>>
>>>> Jacob Faibussowitsch
>>>> (Jacob Fai - booss - oh - vitch)
>>>> Cell: +1 (312) 694-3391
>>>>
>>>> On May 26, 2021, at 22:29, Mark Adams <mfadams at lbl.gov> wrote:
>>>> 
>>>> I started to get this error today on Cori.
>>>>
>>>> nvcc fatal   : Unsupported gpu architecture 'compute_1120'
>>>>
>>>> I am pretty sure I had a clean build but I can redo it if you don't
>>>> know where this is from.
>>>>
>>>> Thanks,
>>>> Mark
>>>> <configure.log>
>>>>
>>>>
>>>>
>>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210528/ef64f0c3/attachment-0001.html>


More information about the petsc-users mailing list