[petsc-dev] [petsc-maint] running CUDA on SUMMIT

Wed Jul 10 07:54:24 CDT 2019

On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:

>
>   ierr = VecGetLocalSize(xx,&nt);CHKERRQ(ierr);
>   if (nt != A->rmap->n)
> SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A
> (%D) and xx (%D)",A->rmap->n,nt);
>   ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr);
>   ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr);
>
>     So the xx on the GPU appears ok?

The norm is correct and ...

> The a->B appears ok?

yes

> But on process 1 the result a->lvec is wrong?
>

yes

> How do you look at the a->lvec? Do you copy it to the CPU and print it?
>

I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I
should copy it. Maybe I should make a CUDA version of these methods?

>
>   ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr);
>   ierr =
> VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
>   ierr =
> VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
>   ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr);
>
> Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help?

This is where I have been digging around an printing stuff.

>
> Are you sure the problem isn't related to the "stream business"?
>

I don't know what that is but I have played around with adding
cudaDeviceSynchronize

>
> /* This multiplication sequence is different sequence
>      than the CPU version. In particular, the diagonal block
>      multiplication kernel is launched in one stream. Then,
>      in a separate stream, the data transfers from DeviceToHost
>      (with MPI messaging in between), then HostToDevice are
>      launched. Once the data transfer stream is synchronized,
>      to ensure messaging is complete, the MatMultAdd kernel
>      is launched in the original (MatMult) stream to protect
>      against race conditions.
>
>      This sequence should only be called for GPU computation. */
>
> Note this comment isn't right and appears to be cut and paste from
> somewhere else, since there is no MatMult() nor MatMultAdd kernel here?
>

Yes, I noticed this. Same as MatMult and not correct here.

>
> Anyway to "turn off the stream business" and see if the result is then
> correct?

How do you do that? I'm looking at docs on streams but not sure how its
used here.

> Perhaps the stream business was done correctly for MatMult() but was never
> right for MatMultTranspose()?
>
> Barry
>
> BTW: Unrelated comment, the code
>
>   ierr = VecSet(yy,0);CHKERRQ(ierr);
>   ierr = VecCUDAGetArrayWrite(yy,&yarray);CHKERRQ(ierr);
>
> has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here.
> VecCUDAGetArrayWrite() requires that you ignore the values in yy and set
> them all yourself so setting them to zero before calling
> VecCUDAGetArrayWrite() does nothing except waste time.
>
>
OK, I'll get rid of it.

>
> > On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev <
> petsc-dev at mcs.anl.gov> wrote:
> >
> > I am stumped with this GPU bug(s). Maybe someone has an idea.
> >
> > I did find a bug in the cuda transpose mat-vec that cuda-memcheck
> detected, but I still have differences between the GPU and CPU transpose
> mat-vec. I've got it down to a very simple test: bicg/none on a tiny mesh
> with two processors. It works on one processor or with cg/none. So it is
> the transpose mat-vec.
> >
> > I see that the result of the off-diagonal  (a->lvec) is different only
> proc 1. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat
> and vec and printed out matlab vectors. Below is the CPU output and then
> the GPU with a view of the scatter object, which is identical as you can
> see.
> >
> > The matlab B matrix and xx vector are identical. Maybe the GPU copy is
> wrong ...
> >
> > The only/first difference between CPU and GPU is a->lvec (the off
> diagonal contribution)on processor 1. (you can see the norms are
> different). Here is the diff on the process 1 a->lvec vector (all values
> are off).
> >
> > Any thoughts would be appreciated,
> > Mark
> >
> > 15:30 1  /gpfs/alpine/scratch/adams/geo127$ diff lvgpu.m lvcpu.m
> > 2,12c2,12
> > < %  type: seqcuda
> > < Vec_0x53738630_0 = [
> > < 9.5702137431412879e+00
> > < 2.1970298791152253e+01
> > < 4.5422290209190646e+00
> > < 2.0185031807270226e+00
> > < 4.2627312508573375e+01
> > < 1.0889191983882025e+01
> > < 1.6038202417695462e+01
> > < 2.7155672033607665e+01
> > < 6.2540357853223556e+00
> > ---
> > > %  type: seq
> > > Vec_0x3a546440_0 = [
> > > 4.5565851251714653e+00
> > > 1.0460532998971189e+01
> > > 2.1626531807270220e+00
> > > 9.6105288923182408e-01
> > > 2.0295782656035659e+01
> > > 5.1845791066529463e+00
> > > 7.6361340020576058e+00
> > > 1.2929401011659799e+01
> > > 2.9776812928669392e+00
> >
> > 15:15 130  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1
> ./ex56 -cells 2,2,1
> > [0] 27 global equations, 9 vertices
> > [0] 27 equations in vector, 9 vertices
> >   0 SNES Function norm 1.223958326481e+02
> >     0 KSP Residual norm 1.223958326481e+02
> > [0] |x|=  1.223958326481e+02 |a->lvec|=  1.773965489475e+01 |B|=
> 1.424708937136e+00
> > [1] |x|=  1.223958326481e+02 |a->lvec|=  2.844171413778e+01 |B|=
> 1.424708937136e+00
> > [1] 1) |yy|=  2.007423334680e+02
> > [0] 1) |yy|=  2.007423334680e+02
> > [0] 2) |yy|=  1.957605719265e+02
> > [1] 2) |yy|=  1.957605719265e+02
> > [1] Number sends = 1; Number to self = 0
> > [1]   0 length = 9 to whom 0
> > Now the indices for all remote sends (in order by process sent to)
> > [1] 9
> > [1] 10
> > [1] 11
> > [1] 12
> > [1] 13
> > [1] 14
> > [1] 15
> > [1] 16
> > [1] 17
> > [1] Number receives = 1; Number from self = 0
> > [1] 0 length 9 from whom 0
> > Now the indices for all remote receives (in order by process received
> from)
> > [1] 0
> > [1] 1
> > [1] 2
> > [1] 3
> > [1] 4
> > [1] 5
> > [1] 6
> > [1] 7
> > [1] 8
> >     1 KSP Residual norm 8.199932342150e+01
> >   Linear solve did not converge due to DIVERGED_ITS iterations 1
> > Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations
> 0
> >
> >
> > 15:19  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1
> ./ex56 -cells 2,2,1 -ex56_dm_mat_type aijcusparse -ex56_dm_vec_type cuda
> > [0] 27 global equations, 9 vertices
> > [0] 27 equations in vector, 9 vertices
> >   0 SNES Function norm 1.223958326481e+02
> >     0 KSP Residual norm 1.223958326481e+02
> > [0] |x|=  1.223958326481e+02 |a->lvec|=  1.773965489475e+01 |B|=
> 1.424708937136e+00
> > [1] |x|=  1.223958326481e+02 |a->lvec|=  5.973624458725e+01 |B|=
> 1.424708937136e+00
> > [0] 1) |yy|=  2.007423334680e+02
> > [1] 1) |yy|=  2.007423334680e+02
> > [0] 2) |yy|=  1.953571867633e+02
> > [1] 2) |yy|=  1.953571867633e+02
> > [1] Number sends = 1; Number to self = 0
> > [1]   0 length = 9 to whom 0
> > Now the indices for all remote sends (in order by process sent to)
> > [1] 9
> > [1] 10
> > [1] 11
> > [1] 12
> > [1] 13
> > [1] 14
> > [1] 15
> > [1] 16
> > [1] 17
> > [1] Number receives = 1; Number from self = 0
> > [1] 0 length 9 from whom 0
> > Now the indices for all remote receives (in order by process received
> from)
> > [1] 0
> > [1] 1
> > [1] 2
> > [1] 3
> > [1] 4
> > [1] 5
> > [1] 6
> > [1] 7
> > [1] 8
> >     1 KSP Residual norm 8.199932342150e+01
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190710/aa0b4014/attachment-0001.html>