[petsc-users] Did CUDA break again?

Fri May 28 10:40:02 CDT 2021

   Thanks. On machines such as this one where you have to use $MPIEXEC to run code you will still need to provide the generation with -with-cuda-gencodearch=70. On systems where it can directly query the GPU without MPIEXEC it will automatically produce the correct result. Otherwise it will guess by compiling for different generations but this can produce an incorrect answer. 

   Barry

> On May 28, 2021, at 7:59 AM, Mark Adams <mfadams at lbl.gov> wrote:
> 
> 
> 
> On Thu, May 27, 2021 at 11:50 PM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
> 
>   Mark,
> 
>    
> 
>     Where did you run the little test program I sent you 
> 
> 1) when it produced 
> 
>    The 1120 and negative number and   (was this on the compile server or on a compute node?)
> 
> This is fine now. look at my last email. I was not using srun.
> 
> 
> 2) when it produced the correct answer? (compile server or compute node?)
> 
> Do you run configure on a compile server (that has no GPUs) or a compute server that has GPUs
> 
> You have to do everything on the compute nodes on Cori/gpu.
> 
> 
>  Don't spend your time bisecting PETSc we know exactly where the problem is, we just don't see how it happens. 
> 
>    cuda.py, if it cannot find deviceQuery and if you did not provide a generation arch with -with-cuda-gencodearch=70,
> 
> I thought I was not supposed to use that anymore. It sounds like it is optional. 
>  
> runs a version of the little code I sent you to get the number but it is ??apparently?? producing garbage or not running on the compiler server and gives the wrong number 1120. 
> 
> Does PETSc use MPIEXEC to run this?
> 
> Note, I have not been able to get 'make check' to work on Cori/gpu. I use '-with-mpiexec=srun -G1 [-c 20]' and it fails to execute the tests.
> 
> OK, putting -with-cuda-gencodearch=70 back in has fixed this problem. It is running now.
> 
> Thanks,
>  
> 
>    Just use the option -with-cuda-gencodearch=70  (you do not need to pass this information to any flags any more, just with this option and it will use it). 
> 
>   Barry
> 
> Ideally we want it to figure it out automatically and this little test program in configure is suppose to do this but since that is not always working yet you should just use -with-cuda-gencodearch=70
> 
> 
> 
>> On May 27, 2021, at 5:45 AM, Mark Adams <mfadams at lbl.gov <mailto:mfadams at lbl.gov>> wrote:
>> 
>> FYI, I was running the test incorrectly:
>> 03:38 cgpu12  ~/petsc_install$ srun -n 1 -G 1 ./a.out 
>> 70
>> 70
>> 
>> On Wed, May 26, 2021 at 10:21 PM Mark Adams <mfadams at lbl.gov <mailto:mfadams at lbl.gov>> wrote:
>> I had git bisect working and was 4 steps away when I got a new crash.
>> configure.log is empty.
>> 
>> 19:15 1 cgpu02 (a531cba26b...)|BISECTING ~/petsc$ git bisect bad
>> Bisecting: 19 revisions left to test after this (roughly 4 steps)
>> [149e269f455574fbe8ce3ebaf42121ae7fdf0635] Merge branch 'tisaac/feature-spqr' into 'main'
>> 19:16 cgpu02 (149e269f45...)|BISECTING ~/petsc$ ../arch-cori-gpu-opt-gcc.py PETSC_DIR=$PWD
>> ===============================================================================
>>              Configuring PETSc to compile on your system                       
>> ===============================================================================
>> *******************************************************************************
>>         CONFIGURATION CRASH  (Please send configure.log to petsc-maint at mcs.anl.gov <mailto:petsc-maint at mcs.anl.gov>)
>> *******************************************************************************
>> 
>> EOL while scanning string literal (cuda.py, line 176)
>>   File "/global/u2/m/madams/petsc/config/configure.py", line 455, in petsc_configure
>>     framework = config.framework.Framework(['--configModules=PETSc.Configure','--optionsModule=config.compilerOptions']+sys.argv[1:], loadArgDB = 0)
>>   File "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line 107, in __init__
>>     self.createChildren()
>>   File "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line 344, in createChildren
>>     self.getChild(moduleName)
>>   File "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line 329, in getChild
>>     config.setupDependencies(self)
>>   File "/global/u2/m/madams/petsc/config/PETSc/Configure.py", line 80, in setupDependencies
>>     self.blasLapack    = framework.require('config.packages.BlasLapack',self)
>>   File "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line 349, in require
>>     config = self.getChild(moduleName, keywordArgs)
>>   File "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line 329, in getChild
>>     config.setupDependencies(self)
>>   File "/global/u2/m/madams/petsc/config/BuildSystem/config/packages/BlasLapack.py", line 21, in setupDependencies
>>     config.package.Package.setupDependencies(self, framework)
>>   File "/global/u2/m/madams/petsc/config/BuildSystem/config/package.py", line 151, in setupDependencies
>>     self.mpi         = framework.require('config.packages.MPI',self)
>>   File "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line 349, in require
>>     config = self.getChild(moduleName, keywordArgs)
>>   File "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line 329, in getChild
>>     config.setupDependencies(self)
>>   File "/global/u2/m/madams/petsc/config/BuildSystem/config/packages/MPI.py", line 73, in setupDependencies
>>     self.mpich   = framework.require('config.packages.MPICH', self)
>>   File "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line 349, in require
>>     config = self.getChild(moduleName, keywordArgs)
>>   File "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line 329, in getChild
>>     config.setupDependencies(self)
>>   File "/global/u2/m/madams/petsc/config/BuildSystem/config/packages/MPICH.py", line 16, in setupDependencies
>>     self.cuda            = framework.require('config.packages.cuda',self)
>>   File "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line 349, in require
>>     config = self.getChild(moduleName, keywordArgs)
>>   File "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line 302, in getChild
>>     type   = __import__(moduleName, globals(), locals(), ['Configure']).Configure
>> 19:16 cgpu02 (149e269f45...)|BISECTING ~/petsc$ ../arch-cori-gpu-opt-gcc.py PETSC_DIR=$PWD
>> 
>> On Wed, May 26, 2021 at 10:10 PM Junchao Zhang <junchao.zhang at gmail.com <mailto:junchao.zhang at gmail.com>> wrote:
>> 
>> 
>> 
>> On Wed, May 26, 2021 at 6:13 PM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>> 
>>   What is HOST=cori09  Does it have GPUs?
>> 
>>   https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_164490976c8e07e028a8f1ce1f5cd42d6 <https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_164490976c8e07e028a8f1ce1f5cd42d6>
>> 
>>   Seems to clearly state
>> 
>> int  cudaDeviceProp <https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp>::major <https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_164490976c8e07e028a8f1ce1f5cd42d6> [inherited] 
>> Major compute capability 
>> 
>> 
>> 
>> Mark, please compile and run this program on the machine you are running configure on
>> 
>> #include <stdio.h>
>> #include <cuda.h>
>> #include <cuda_runtime.h>
>> #include <cuda_runtime_api.h>
>> #include <cuda_device_runtime_api.h>
>> int main(int arg,char **args)
>> {
>> struct cudaDeviceProp dp;
>>                 cudaGetDeviceProperties(&dp, 0);
>>                 printf("%d\n",10*dp.major+dp.minor);
>> 
>>                 int major,minor;
>> 		cuDeviceGetAttribute(&major, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, 0);
>> cuDeviceGetAttribute(&minor, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, 0);
>>                 printf("%d\n",10*major+minor);
>>                 return(0);
>> Probably, you need to check the return code of these two function calls to make sure they are correct.
>>  
>> }
>> 
>> This is what I get 
>> 
>> $ nvcc mytest.c -lcuda
>> ~/petsc (main=) arch-main
>> $ ./a.out
>> 70
>> 70
>> 
>> Which is exactly what it is suppose to do.
>> 
>> Barry
>> 
>>> On May 26, 2021, at 5:31 PM, Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>>> 
>>> 
>>>   Yes, this code which I guess never got hit before 
>>> 
>>> cudaDeviceProp dp;                cudaGetDeviceProperties(&dp, 0);                printf("%d\n",10*dp.major+dp.minor);                return(0);;
>>> 
>>> is using the wrong property for the generation. 
>>> 
>>>  Back to the CUDA documentation for the correct information. 
>>> 
>>> 
>>> 
>>>> On May 26, 2021, at 3:47 PM, Jacob Faibussowitsch <jacob.fai at gmail.com <mailto:jacob.fai at gmail.com>> wrote:
>>>> 
>>>> 1120 sounds suspiciously like some CUDA version rather than architecture or compute capability…
>>>> 
>>>> Best regards,
>>>> 
>>>> Jacob Faibussowitsch
>>>> (Jacob Fai - booss - oh - vitch)
>>>> Cell: +1 (312) 694-3391
>>>> 
>>>>> On May 26, 2021, at 22:29, Mark Adams <mfadams at lbl.gov <mailto:mfadams at lbl.gov>> wrote:
>>>>> 
>>>>> I started to get this error today on Cori. 
>>>>> 
>>>>> nvcc fatal   : Unsupported gpu architecture 'compute_1120'
>>>>> 
>>>>> I am pretty sure I had a clean build but I can redo it if you don't know where this is from.
>>>>> 
>>>>> Thanks,
>>>>> Mark
>>>>> <configure.log>
>>> 
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210528/bba62a8d/attachment-0001.html>