[petsc-dev] [petsc-maint] running CUDA on SUMMIT

Wed Jul 10 11:01:59 CDT 2019

Barry,

I think having a single code instead of three different, quasi similar,
versions is less fragile ( I admit, once you get the logic correct...)
Also, it conforms with the standard for spmv that implements alpha * A *
x + beta * b
The easiest fix is the following:

Rename MatMultAdd_ into MatMultKernel_Private and add an extra boolean to
control the transpose operation
then, you can reuse the same complicated code I have wrote, just by
selecting the proper cusparse object (matstructT or matstruct)

Il giorno mer 10 lug 2019 alle ore 18:16 Smith, Barry F. <bsmith at mcs.anl.gov>
ha scritto:

>
>    In the long run I would like to see smaller specialized chunks of code
> (with a bit of duplication between them) instead of highly overloaded
> routines like MatMultAdd_AIJCUSPARSE. Better 3 routines, for multiple
> alone, for multiple add alone and for multiple add with sparse format.
> Trying to get all the cases right (performance and correctness for the
> everything at once is unnecessary and risky). Having possible zero size
> objects  (and hence null pointers) doesn't help the complex logic
>
>
>    Barry
>
>
> > On Jul 10, 2019, at 10:06 AM, Mark Adams <mfadams at lbl.gov> wrote:
> >
> > Thanks, you made several changes here, including switches with the
> workvector size. I guess I should import this logic to the transpose
> method(s), except for the yy==NULL branches ...
> >
> > MatMult_ calls MatMultAdd with yy=0, but the transpose version have
> their own code. MatMultTranspose_SeqAIJCUSPARSE is very simple.
> >
> > Thanks again,
> > Mark
> >
> > On Wed, Jul 10, 2019 at 9:22 AM Stefano Zampini <
> stefano.zampini at gmail.com> wrote:
> > Mark,
> >
> > if the difference is on lvec, I suspect the bug has to do with
> compressed row storage. I have fixed a similar bug in MatMult.
> > you want to check cusparsestruct->workVector->size() against A->cmap->n.
> >
> > Stefano
> >
> > Il giorno mer 10 lug 2019 alle ore 15:54 Mark Adams via petsc-dev <
> petsc-dev at mcs.anl.gov> ha scritto:
> >
> >
> > On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F. <bsmith at mcs.anl.gov>
> wrote:
> >
> >   ierr = VecGetLocalSize(xx,&nt);CHKERRQ(ierr);
> >   if (nt != A->rmap->n)
> SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A
> (%D) and xx (%D)",A->rmap->n,nt);
> >   ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr);
> >   ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr);
> >
> >     So the xx on the GPU appears ok?
> >
> > The norm is correct and ...
> >
> > The a->B appears ok?
> >
> > yes
> >
> > But on process 1 the result a->lvec is wrong?
> >
> > yes
> >
> >
> > How do you look at the a->lvec? Do you copy it to the CPU and print it?
> >
> > I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I
> should copy it. Maybe I should make a CUDA version of these methods?
> >
> >
> >   ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr);
> >   ierr =
> VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
> >   ierr =
> VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
> >   ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr);
> >
> > Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help?
> >
> > This is where I have been digging around an printing stuff.
> >
> >
> > Are you sure the problem isn't related to the "stream business"?
> >
> > I don't know what that is but I have played around with adding
> cudaDeviceSynchronize
> >
> >
> > /* This multiplication sequence is different sequence
> >      than the CPU version. In particular, the diagonal block
> >      multiplication kernel is launched in one stream. Then,
> >      in a separate stream, the data transfers from DeviceToHost
> >      (with MPI messaging in between), then HostToDevice are
> >      launched. Once the data transfer stream is synchronized,
> >      to ensure messaging is complete, the MatMultAdd kernel
> >      is launched in the original (MatMult) stream to protect
> >      against race conditions.
> >
> >      This sequence should only be called for GPU computation. */
> >
> > Note this comment isn't right and appears to be cut and paste from
> somewhere else, since there is no MatMult() nor MatMultAdd kernel here?
> >
> > Yes, I noticed this. Same as MatMult and not correct here.
> >
> >
> > Anyway to "turn off the stream business" and see if the result is then
> correct?
> >
> > How do you do that? I'm looking at docs on streams but not sure how its
> used here.
> >
> > Perhaps the stream business was done correctly for MatMult() but was
> never right for MatMultTranspose()?
> >
> > Barry
> >
> > BTW: Unrelated comment, the code
> >
> >   ierr = VecSet(yy,0);CHKERRQ(ierr);
> >   ierr = VecCUDAGetArrayWrite(yy,&yarray);CHKERRQ(ierr);
> >
> > has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here.
> VecCUDAGetArrayWrite() requires that you ignore the values in yy and set
> them all yourself so setting them to zero before calling
> VecCUDAGetArrayWrite() does nothing except waste time.
> >
> >
> > OK, I'll get rid of it.
> >
> >
> > > On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev <
> petsc-dev at mcs.anl.gov> wrote:
> > >
> > > I am stumped with this GPU bug(s). Maybe someone has an idea.
> > >
> > > I did find a bug in the cuda transpose mat-vec that cuda-memcheck
> detected, but I still have differences between the GPU and CPU transpose
> mat-vec. I've got it down to a very simple test: bicg/none on a tiny mesh
> with two processors. It works on one processor or with cg/none. So it is
> the transpose mat-vec.
> > >
> > > I see that the result of the off-diagonal  (a->lvec) is different only
> proc 1. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat
> and vec and printed out matlab vectors. Below is the CPU output and then
> the GPU with a view of the scatter object, which is identical as you can
> see.
> > >
> > > The matlab B matrix and xx vector are identical. Maybe the GPU copy is
> wrong ...
> > >
> > > The only/first difference between CPU and GPU is a->lvec (the off
> diagonal contribution)on processor 1. (you can see the norms are
> different). Here is the diff on the process 1 a->lvec vector (all values
> are off).
> > >
> > > Any thoughts would be appreciated,
> > > Mark
> > >
> > > 15:30 1  /gpfs/alpine/scratch/adams/geo127$ diff lvgpu.m lvcpu.m
> > > 2,12c2,12
> > > < %  type: seqcuda
> > > < Vec_0x53738630_0 = [
> > > < 9.5702137431412879e+00
> > > < 2.1970298791152253e+01
> > > < 4.5422290209190646e+00
> > > < 2.0185031807270226e+00
> > > < 4.2627312508573375e+01
> > > < 1.0889191983882025e+01
> > > < 1.6038202417695462e+01
> > > < 2.7155672033607665e+01
> > > < 6.2540357853223556e+00
> > > ---
> > > > %  type: seq
> > > > Vec_0x3a546440_0 = [
> > > > 4.5565851251714653e+00
> > > > 1.0460532998971189e+01
> > > > 2.1626531807270220e+00
> > > > 9.6105288923182408e-01
> > > > 2.0295782656035659e+01
> > > > 5.1845791066529463e+00
> > > > 7.6361340020576058e+00
> > > > 1.2929401011659799e+01
> > > > 2.9776812928669392e+00
> > >
> > > 15:15 130  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g
> 1 ./ex56 -cells 2,2,1
> > > [0] 27 global equations, 9 vertices
> > > [0] 27 equations in vector, 9 vertices
> > >   0 SNES Function norm 1.223958326481e+02
> > >     0 KSP Residual norm 1.223958326481e+02
> > > [0] |x|=  1.223958326481e+02 |a->lvec|=  1.773965489475e+01 |B|=
> 1.424708937136e+00
> > > [1] |x|=  1.223958326481e+02 |a->lvec|=  2.844171413778e+01 |B|=
> 1.424708937136e+00
> > > [1] 1) |yy|=  2.007423334680e+02
> > > [0] 1) |yy|=  2.007423334680e+02
> > > [0] 2) |yy|=  1.957605719265e+02
> > > [1] 2) |yy|=  1.957605719265e+02
> > > [1] Number sends = 1; Number to self = 0
> > > [1]   0 length = 9 to whom 0
> > > Now the indices for all remote sends (in order by process sent to)
> > > [1] 9
> > > [1] 10
> > > [1] 11
> > > [1] 12
> > > [1] 13
> > > [1] 14
> > > [1] 15
> > > [1] 16
> > > [1] 17
> > > [1] Number receives = 1; Number from self = 0
> > > [1] 0 length 9 from whom 0
> > > Now the indices for all remote receives (in order by process received
> from)
> > > [1] 0
> > > [1] 1
> > > [1] 2
> > > [1] 3
> > > [1] 4
> > > [1] 5
> > > [1] 6
> > > [1] 7
> > > [1] 8
> > >     1 KSP Residual norm 8.199932342150e+01
> > >   Linear solve did not converge due to DIVERGED_ITS iterations 1
> > > Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE
> iterations 0
> > >
> > >
> > > 15:19  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1
> ./ex56 -cells 2,2,1 -ex56_dm_mat_type aijcusparse -ex56_dm_vec_type cuda
> > > [0] 27 global equations, 9 vertices
> > > [0] 27 equations in vector, 9 vertices
> > >   0 SNES Function norm 1.223958326481e+02
> > >     0 KSP Residual norm 1.223958326481e+02
> > > [0] |x|=  1.223958326481e+02 |a->lvec|=  1.773965489475e+01 |B|=
> 1.424708937136e+00
> > > [1] |x|=  1.223958326481e+02 |a->lvec|=  5.973624458725e+01 |B|=
> 1.424708937136e+00
> > > [0] 1) |yy|=  2.007423334680e+02
> > > [1] 1) |yy|=  2.007423334680e+02
> > > [0] 2) |yy|=  1.953571867633e+02
> > > [1] 2) |yy|=  1.953571867633e+02
> > > [1] Number sends = 1; Number to self = 0
> > > [1]   0 length = 9 to whom 0
> > > Now the indices for all remote sends (in order by process sent to)
> > > [1] 9
> > > [1] 10
> > > [1] 11
> > > [1] 12
> > > [1] 13
> > > [1] 14
> > > [1] 15
> > > [1] 16
> > > [1] 17
> > > [1] Number receives = 1; Number from self = 0
> > > [1] 0 length 9 from whom 0
> > > Now the indices for all remote receives (in order by process received
> from)
> > > [1] 0
> > > [1] 1
> > > [1] 2
> > > [1] 3
> > > [1] 4
> > > [1] 5
> > > [1] 6
> > > [1] 7
> > > [1] 8
> > >     1 KSP Residual norm 8.199932342150e+01
> > >
> >
> >
> >
> > --
> > Stefano
>
>

-- 
Stefano
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190710/2af21ff5/attachment.html>