[petsc-dev] PetscCUDAInitialize

Zhang, Junchao jczhang at mcs.anl.gov
Thu Sep 19 17:00:05 CDT 2019


All failed tests just said "application called MPI_Abort" and had no stack trace. They are not cuda tests. I updated SF to avoid CUDA  related initialization if not needed. Let's see the new test result.

not ok dm_impls_stag_tests-ex13_none_none_none_3d_par_stag_stencil_width-1
#       application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1

--Junchao Zhang


On Thu, Sep 19, 2019 at 3:57 PM Smith, Barry F. <bsmith at mcs.anl.gov<mailto:bsmith at mcs.anl.gov>> wrote:

 Failed?  Means nothing, send link or cut and paste error

 It could be that since we have multiple separate tests running at the same time they overload the GPU or cause some inconsistent behavior that doesn't appear every time the tests are run.

   Barry

Maybe we need to sequentialize all the tests that use the GPUs, we just trust gnumake for the parallelism maybe you could some how add dependencies to get gnu make to achieve this?




> On Sep 19, 2019, at 3:53 PM, Zhang, Junchao <jczhang at mcs.anl.gov<mailto:jczhang at mcs.anl.gov>> wrote:
>
> On Thu, Sep 19, 2019 at 3:24 PM Smith, Barry F. <bsmith at mcs.anl.gov<mailto:bsmith at mcs.anl.gov>> wrote:
>
>
> > On Sep 19, 2019, at 2:50 PM, Zhang, Junchao <jczhang at mcs.anl.gov<mailto:jczhang at mcs.anl.gov>> wrote:
> >
> > I saw your update. In PetscCUDAInitialize we have
> >
> >
> >
> >
> >
> >       /* First get the device count */
> >
> >       err   = cudaGetDeviceCount(&devCount);
> >
> >
> >
> >
> >       /* next determine the rank and then set the device via a mod */
> >
> >       ierr   = MPI_Comm_rank(comm,&rank);CHKERRQ(ierr);
> >
> >       device = rank % devCount;
> >
> >     }
> >
> >     err = cudaSetDevice(device);
> >
> >
> >
> >
> >
> > If we rely on the first CUDA call to do initialization, how could CUDA know these MPI stuff.
>
>   It doesn't, so it does whatever it does (which may be dumb).
>
>   Are you proposing something?
>
> No. My test failed in CI with -cuda_initialize 0 on frog but I could not reproduce it. I'm doing investigation.
>
>   Barry
>
> >
> > --Junchao Zhang
> >
> >
> >
> > On Wed, Sep 18, 2019 at 11:42 PM Smith, Barry F. <bsmith at mcs.anl.gov<mailto:bsmith at mcs.anl.gov>> wrote:
> >
> >   Fixed the docs. Thanks for pointing out the lack of clarity
> >
> >
> > > On Sep 18, 2019, at 11:25 PM, Zhang, Junchao via petsc-dev <petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>> wrote:
> > >
> > > Barry,
> > >
> > > I saw you added these in init.c
> > >
> > >
> > > +  -cuda_initialize - do the initialization in PetscInitialize()
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Notes:
> > >
> > >    Initializing cuBLAS takes about 1/2 second there it is done by default in PetscInitialize() before logging begins
> > >
> > >
> > >
> > > But I did not get otherwise with -cuda_initialize 0, when will cuda be initialized?
> > > --Junchao Zhang
> >

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190919/b729fc29/attachment.html>


More information about the petsc-dev mailing list