[petsc-dev] [petsc-maint] running CUDA on SUMMIT

Wed Jul 10 15:09:24 CDT 2019

  My concern is

1) is it actually optimally efficient for all cases? This kind of stuff, IMHO

    if (yy) {
      if (dptr != zarray) {
        ierr = VecCopy_SeqCUDA(yy,zz);CHKERRQ(ierr);
      } else if (zz != yy) {
        ierr = VecAXPY_SeqCUDA(zz,1.0,yy);CHKERRQ(ierr);
      }
    } else if (dptr != zarray) {
      ierr = VecSet_SeqCUDA(zz,0);CHKERRQ(ierr);
    }

means it is not. It is launching additional kernels and looping over arrays more times then if each form was optimized for its one case.

2) is it utilizing VecCUDAGetArrayWrite() when possible? No, it uses VecCUDAGetArray() which for certain configurations means copying from CPU stuff that will immediately be overwritten. Sometimes it can use VecCUDAGetArrayWrite() sometimes it can't, code has handle each case properly.

3) Is comparison between pointers appropriate? For example if (dptr != zarray) { is scary if some arrays are zero length how do we know what the pointer value will be?

  I am not saying it is totally impossible to have a single routine that optimally efficiently did all cases: MatMult, yy == zz, etc but the resulting code will be real complex with lots of if()s and difficult to understand and maintain; just tracing through all cases and insuring each is optimal is nontrivial.

   Barry

> On Jul 10, 2019, at 11:01 AM, Stefano Zampini <stefano.zampini at gmail.com> wrote:
> 
> Barry,
> 
> I think having a single code instead of three different, quasi similar, versions is less fragile ( I admit, once you get the logic correct...)
> Also, it conforms with the standard for spmv that implements alpha * A * x + beta * b
> The easiest fix is the following: 
> 
> Rename MatMultAdd_ into MatMultKernel_Private and add an extra boolean to control the transpose operation
> then, you can reuse the same complicated code I have wrote, just by selecting the proper cusparse object (matstructT or matstruct)
> 
> 
> Il giorno mer 10 lug 2019 alle ore 18:16 Smith, Barry F. <bsmith at mcs.anl.gov> ha scritto:
> 
>    In the long run I would like to see smaller specialized chunks of code (with a bit of duplication between them) instead of highly overloaded routines like MatMultAdd_AIJCUSPARSE. Better 3 routines, for multiple alone, for multiple add alone and for multiple add with sparse format. Trying to get all the cases right (performance and correctness for the everything at once is unnecessary and risky). Having possible zero size objects  (and hence null pointers) doesn't help the complex logic
> 
> 
>    Barry
> 
> 
> > On Jul 10, 2019, at 10:06 AM, Mark Adams <mfadams at lbl.gov> wrote:
> > 
> > Thanks, you made several changes here, including switches with the workvector size. I guess I should import this logic to the transpose method(s), except for the yy==NULL branches ...
> > 
> > MatMult_ calls MatMultAdd with yy=0, but the transpose version have their own code. MatMultTranspose_SeqAIJCUSPARSE is very simple. 
> > 
> > Thanks again,
> > Mark
> > 
> > On Wed, Jul 10, 2019 at 9:22 AM Stefano Zampini <stefano.zampini at gmail.com> wrote:
> > Mark,
> > 
> > if the difference is on lvec, I suspect the bug has to do with compressed row storage. I have fixed a similar bug in MatMult.
> > you want to check cusparsestruct->workVector->size() against A->cmap->n.
> > 
> > Stefano 
> > 
> > Il giorno mer 10 lug 2019 alle ore 15:54 Mark Adams via petsc-dev <petsc-dev at mcs.anl.gov> ha scritto:
> > 
> > 
> > On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
> > 
> >   ierr = VecGetLocalSize(xx,&nt);CHKERRQ(ierr);
> >   if (nt != A->rmap->n) SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A (%D) and xx (%D)",A->rmap->n,nt);
> >   ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr);
> >   ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr);
> > 
> >     So the xx on the GPU appears ok?
> > 
> > The norm is correct and ...
> >  
> > The a->B appears ok?
> > 
> > yes
> >  
> > But on process 1 the result a->lvec is wrong? 
> > 
> > yes
> > 
> > 
> > How do you look at the a->lvec? Do you copy it to the CPU and print it?
> > 
> > I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I should copy it. Maybe I should make a CUDA version of these methods?
> >  
> > 
> >   ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr);
> >   ierr = VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
> >   ierr = VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
> >   ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr);
> > 
> > Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help? 
> > 
> > This is where I have been digging around an printing stuff.
> >  
> > 
> > Are you sure the problem isn't related to the "stream business"? 
> > 
> > I don't know what that is but I have played around with adding cudaDeviceSynchronize
> >  
> > 
> > /* This multiplication sequence is different sequence
> >      than the CPU version. In particular, the diagonal block
> >      multiplication kernel is launched in one stream. Then,
> >      in a separate stream, the data transfers from DeviceToHost
> >      (with MPI messaging in between), then HostToDevice are
> >      launched. Once the data transfer stream is synchronized,
> >      to ensure messaging is complete, the MatMultAdd kernel
> >      is launched in the original (MatMult) stream to protect
> >      against race conditions.
> > 
> >      This sequence should only be called for GPU computation. */
> > 
> > Note this comment isn't right and appears to be cut and paste from somewhere else, since there is no MatMult() nor MatMultAdd kernel here?
> > 
> > Yes, I noticed this. Same as MatMult and not correct here.
> >  
> > 
> > Anyway to "turn off the stream business" and see if the result is then correct? 
> > 
> > How do you do that? I'm looking at docs on streams but not sure how its used here.
> >  
> > Perhaps the stream business was done correctly for MatMult() but was never right for MatMultTranspose()?
> > 
> > Barry
> > 
> > BTW: Unrelated comment, the code
> > 
> >   ierr = VecSet(yy,0);CHKERRQ(ierr);
> >   ierr = VecCUDAGetArrayWrite(yy,&yarray);CHKERRQ(ierr);
> > 
> > has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here. VecCUDAGetArrayWrite() requires that you ignore the values in yy and set them all yourself so setting them to zero before calling VecCUDAGetArrayWrite() does nothing except waste time.
> > 
> > 
> > OK, I'll get rid of it.
> >  
> > 
> > > On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev <petsc-dev at mcs.anl.gov> wrote:
> > > 
> > > I am stumped with this GPU bug(s). Maybe someone has an idea.
> > > 
> > > I did find a bug in the cuda transpose mat-vec that cuda-memcheck detected, but I still have differences between the GPU and CPU transpose mat-vec. I've got it down to a very simple test: bicg/none on a tiny mesh with two processors. It works on one processor or with cg/none. So it is the transpose mat-vec.
> > > 
> > > I see that the result of the off-diagonal  (a->lvec) is different only proc 1. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat and vec and printed out matlab vectors. Below is the CPU output and then the GPU with a view of the scatter object, which is identical as you can see.
> > > 
> > > The matlab B matrix and xx vector are identical. Maybe the GPU copy is wrong ...
> > > 
> > > The only/first difference between CPU and GPU is a->lvec (the off diagonal contribution)on processor 1. (you can see the norms are different). Here is the diff on the process 1 a->lvec vector (all values are off).
> > > 
> > > Any thoughts would be appreciated,
> > > Mark
> > > 
> > > 15:30 1  /gpfs/alpine/scratch/adams/geo127$ diff lvgpu.m lvcpu.m
> > > 2,12c2,12
> > > < %  type: seqcuda
> > > < Vec_0x53738630_0 = [
> > > < 9.5702137431412879e+00
> > > < 2.1970298791152253e+01
> > > < 4.5422290209190646e+00
> > > < 2.0185031807270226e+00
> > > < 4.2627312508573375e+01
> > > < 1.0889191983882025e+01
> > > < 1.6038202417695462e+01
> > > < 2.7155672033607665e+01
> > > < 6.2540357853223556e+00
> > > ---
> > > > %  type: seq
> > > > Vec_0x3a546440_0 = [
> > > > 4.5565851251714653e+00
> > > > 1.0460532998971189e+01
> > > > 2.1626531807270220e+00
> > > > 9.6105288923182408e-01
> > > > 2.0295782656035659e+01
> > > > 5.1845791066529463e+00
> > > > 7.6361340020576058e+00
> > > > 1.2929401011659799e+01
> > > > 2.9776812928669392e+00
> > > 
> > > 15:15 130  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1 ./ex56 -cells 2,2,1 
> > > [0] 27 global equations, 9 vertices
> > > [0] 27 equations in vector, 9 vertices
> > >   0 SNES Function norm 1.223958326481e+02 
> > >     0 KSP Residual norm 1.223958326481e+02 
> > > [0] |x|=  1.223958326481e+02 |a->lvec|=  1.773965489475e+01 |B|=  1.424708937136e+00
> > > [1] |x|=  1.223958326481e+02 |a->lvec|=  2.844171413778e+01 |B|=  1.424708937136e+00
> > > [1] 1) |yy|=  2.007423334680e+02
> > > [0] 1) |yy|=  2.007423334680e+02
> > > [0] 2) |yy|=  1.957605719265e+02
> > > [1] 2) |yy|=  1.957605719265e+02
> > > [1] Number sends = 1; Number to self = 0
> > > [1]   0 length = 9 to whom 0
> > > Now the indices for all remote sends (in order by process sent to)
> > > [1] 9 
> > > [1] 10 
> > > [1] 11 
> > > [1] 12 
> > > [1] 13 
> > > [1] 14 
> > > [1] 15 
> > > [1] 16 
> > > [1] 17 
> > > [1] Number receives = 1; Number from self = 0
> > > [1] 0 length 9 from whom 0
> > > Now the indices for all remote receives (in order by process received from)
> > > [1] 0 
> > > [1] 1 
> > > [1] 2 
> > > [1] 3 
> > > [1] 4 
> > > [1] 5 
> > > [1] 6 
> > > [1] 7 
> > > [1] 8 
> > >     1 KSP Residual norm 8.199932342150e+01 
> > >   Linear solve did not converge due to DIVERGED_ITS iterations 1
> > > Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0
> > > 
> > > 
> > > 15:19  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1 ./ex56 -cells 2,2,1 -ex56_dm_mat_type aijcusparse -ex56_dm_vec_type cuda
> > > [0] 27 global equations, 9 vertices
> > > [0] 27 equations in vector, 9 vertices
> > >   0 SNES Function norm 1.223958326481e+02 
> > >     0 KSP Residual norm 1.223958326481e+02 
> > > [0] |x|=  1.223958326481e+02 |a->lvec|=  1.773965489475e+01 |B|=  1.424708937136e+00
> > > [1] |x|=  1.223958326481e+02 |a->lvec|=  5.973624458725e+01 |B|=  1.424708937136e+00
> > > [0] 1) |yy|=  2.007423334680e+02
> > > [1] 1) |yy|=  2.007423334680e+02
> > > [0] 2) |yy|=  1.953571867633e+02
> > > [1] 2) |yy|=  1.953571867633e+02
> > > [1] Number sends = 1; Number to self = 0
> > > [1]   0 length = 9 to whom 0
> > > Now the indices for all remote sends (in order by process sent to)
> > > [1] 9 
> > > [1] 10 
> > > [1] 11 
> > > [1] 12 
> > > [1] 13 
> > > [1] 14 
> > > [1] 15 
> > > [1] 16 
> > > [1] 17 
> > > [1] Number receives = 1; Number from self = 0
> > > [1] 0 length 9 from whom 0
> > > Now the indices for all remote receives (in order by process received from)
> > > [1] 0 
> > > [1] 1 
> > > [1] 2 
> > > [1] 3 
> > > [1] 4 
> > > [1] 5 
> > > [1] 6 
> > > [1] 7 
> > > [1] 8 
> > >     1 KSP Residual norm 8.199932342150e+01 
> > >  
> > 
> > 
> > 
> > -- 
> > Stefano
> 
> 
> 
> -- 
> Stefano