[petsc-dev] PetscCUDAInitialize

Thu Sep 19 21:05:55 CDT 2019

   This should be reported on gitlab, not in email.

   Anyways, my interpretation is that the machine runs low on swap space so the OS is killing things. Once Satish and I sat down and checked the system logs on one machine that had little swap and we saw system messages about low swap at exactly the time the tests were killed. Satish is resistant to increase swap I don't know why. Other times we see these kills and they may not be due to swap but then they are a mystery.

   You can return the particular job by clicking on the little circle after the job name and see what happens the next time.

   Barry

   It may be the -j and -l options for some systems need to adjusted down slightly and this will prevent these. Satish can that be done in the examples/arch-ci* files with configure options, or in in the runner files or in the yaml file?

> On Sep 19, 2019, at 5:00 PM, Zhang, Junchao <jczhang at mcs.anl.gov> wrote:
> 
> All failed tests just said "application called MPI_Abort" and had no stack trace. They are not cuda tests. I updated SF to avoid CUDA  related initialization if not needed. Let's see the new test result.
> not ok dm_impls_stag_tests-ex13_none_none_none_3d_par_stag_stencil_width-1
> #	application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
> 
> 
> --Junchao Zhang
> 
> 
> On Thu, Sep 19, 2019 at 3:57 PM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
> 
>  Failed?  Means nothing, send link or cut and paste error
> 
>  It could be that since we have multiple separate tests running at the same time they overload the GPU or cause some inconsistent behavior that doesn't appear every time the tests are run.
> 
>    Barry
> 
> Maybe we need to sequentialize all the tests that use the GPUs, we just trust gnumake for the parallelism maybe you could some how add dependencies to get gnu make to achieve this?
> 
> 
> 
> 
> > On Sep 19, 2019, at 3:53 PM, Zhang, Junchao <jczhang at mcs.anl.gov> wrote:
> > 
> > On Thu, Sep 19, 2019 at 3:24 PM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
> > 
> > 
> > > On Sep 19, 2019, at 2:50 PM, Zhang, Junchao <jczhang at mcs.anl.gov> wrote:
> > > 
> > > I saw your update. In PetscCUDAInitialize we have
> > > 
> > >     
> > > 
> > > 
> > > 
> > >       /* First get the device count */
> > > 
> > >       err   = cudaGetDeviceCount(&devCount);
> > > 
> > > 
> > > 
> > > 
> > >       /* next determine the rank and then set the device via a mod */
> > > 
> > >       ierr   = MPI_Comm_rank(comm,&rank);CHKERRQ(ierr);
> > > 
> > >       device = rank % devCount;
> > > 
> > >     }
> > > 
> > >     err = cudaSetDevice(device);
> > > 
> > > 
> > > 
> > > 
> > > 
> > > If we rely on the first CUDA call to do initialization, how could CUDA know these MPI stuff.
> > 
> >   It doesn't, so it does whatever it does (which may be dumb).
> > 
> >   Are you proposing something?
> > 
> > No. My test failed in CI with -cuda_initialize 0 on frog but I could not reproduce it. I'm doing investigation. 
> > 
> >   Barry
> > 
> > > 
> > > --Junchao Zhang
> > > 
> > > 
> > > 
> > > On Wed, Sep 18, 2019 at 11:42 PM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
> > > 
> > >   Fixed the docs. Thanks for pointing out the lack of clarity
> > > 
> > > 
> > > > On Sep 18, 2019, at 11:25 PM, Zhang, Junchao via petsc-dev <petsc-dev at mcs.anl.gov> wrote:
> > > > 
> > > > Barry,
> > > > 
> > > > I saw you added these in init.c
> > > > 
> > > > 
> > > > +  -cuda_initialize - do the initialization in PetscInitialize()
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > Notes:
> > > > 
> > > >    Initializing cuBLAS takes about 1/2 second there it is done by default in PetscInitialize() before logging begins
> > > > 
> > > > 
> > > > 
> > > > But I did not get otherwise with -cuda_initialize 0, when will cuda be initialized?
> > > > --Junchao Zhang
> > > 
>