[petsc-dev] [petsc-maint] running CUDA on SUMMIT

Smith, Barry F. bsmith at mcs.anl.gov
Wed Jul 10 10:16:37 CDT 2019


   In the long run I would like to see smaller specialized chunks of code (with a bit of duplication between them) instead of highly overloaded routines like MatMultAdd_AIJCUSPARSE. Better 3 routines, for multiple alone, for multiple add alone and for multiple add with sparse format. Trying to get all the cases right (performance and correctness for the everything at once is unnecessary and risky). Having possible zero size objects  (and hence null pointers) doesn't help the complex logic


   Barry


> On Jul 10, 2019, at 10:06 AM, Mark Adams <mfadams at lbl.gov> wrote:
> 
> Thanks, you made several changes here, including switches with the workvector size. I guess I should import this logic to the transpose method(s), except for the yy==NULL branches ...
> 
> MatMult_ calls MatMultAdd with yy=0, but the transpose version have their own code. MatMultTranspose_SeqAIJCUSPARSE is very simple. 
> 
> Thanks again,
> Mark
> 
> On Wed, Jul 10, 2019 at 9:22 AM Stefano Zampini <stefano.zampini at gmail.com> wrote:
> Mark,
> 
> if the difference is on lvec, I suspect the bug has to do with compressed row storage. I have fixed a similar bug in MatMult.
> you want to check cusparsestruct->workVector->size() against A->cmap->n.
> 
> Stefano 
> 
> Il giorno mer 10 lug 2019 alle ore 15:54 Mark Adams via petsc-dev <petsc-dev at mcs.anl.gov> ha scritto:
> 
> 
> On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
> 
>   ierr = VecGetLocalSize(xx,&nt);CHKERRQ(ierr);
>   if (nt != A->rmap->n) SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A (%D) and xx (%D)",A->rmap->n,nt);
>   ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr);
>   ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr);
> 
>     So the xx on the GPU appears ok?
> 
> The norm is correct and ...
>  
> The a->B appears ok?
> 
> yes
>  
> But on process 1 the result a->lvec is wrong? 
> 
> yes
> 
> 
> How do you look at the a->lvec? Do you copy it to the CPU and print it?
> 
> I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I should copy it. Maybe I should make a CUDA version of these methods?
>  
> 
>   ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr);
>   ierr = VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
>   ierr = VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
>   ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr);
> 
> Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help? 
> 
> This is where I have been digging around an printing stuff.
>  
> 
> Are you sure the problem isn't related to the "stream business"? 
> 
> I don't know what that is but I have played around with adding cudaDeviceSynchronize
>  
> 
> /* This multiplication sequence is different sequence
>      than the CPU version. In particular, the diagonal block
>      multiplication kernel is launched in one stream. Then,
>      in a separate stream, the data transfers from DeviceToHost
>      (with MPI messaging in between), then HostToDevice are
>      launched. Once the data transfer stream is synchronized,
>      to ensure messaging is complete, the MatMultAdd kernel
>      is launched in the original (MatMult) stream to protect
>      against race conditions.
> 
>      This sequence should only be called for GPU computation. */
> 
> Note this comment isn't right and appears to be cut and paste from somewhere else, since there is no MatMult() nor MatMultAdd kernel here?
> 
> Yes, I noticed this. Same as MatMult and not correct here.
>  
> 
> Anyway to "turn off the stream business" and see if the result is then correct? 
> 
> How do you do that? I'm looking at docs on streams but not sure how its used here.
>  
> Perhaps the stream business was done correctly for MatMult() but was never right for MatMultTranspose()?
> 
> Barry
> 
> BTW: Unrelated comment, the code
> 
>   ierr = VecSet(yy,0);CHKERRQ(ierr);
>   ierr = VecCUDAGetArrayWrite(yy,&yarray);CHKERRQ(ierr);
> 
> has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here. VecCUDAGetArrayWrite() requires that you ignore the values in yy and set them all yourself so setting them to zero before calling VecCUDAGetArrayWrite() does nothing except waste time.
> 
> 
> OK, I'll get rid of it.
>  
> 
> > On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev <petsc-dev at mcs.anl.gov> wrote:
> > 
> > I am stumped with this GPU bug(s). Maybe someone has an idea.
> > 
> > I did find a bug in the cuda transpose mat-vec that cuda-memcheck detected, but I still have differences between the GPU and CPU transpose mat-vec. I've got it down to a very simple test: bicg/none on a tiny mesh with two processors. It works on one processor or with cg/none. So it is the transpose mat-vec.
> > 
> > I see that the result of the off-diagonal  (a->lvec) is different only proc 1. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat and vec and printed out matlab vectors. Below is the CPU output and then the GPU with a view of the scatter object, which is identical as you can see.
> > 
> > The matlab B matrix and xx vector are identical. Maybe the GPU copy is wrong ...
> > 
> > The only/first difference between CPU and GPU is a->lvec (the off diagonal contribution)on processor 1. (you can see the norms are different). Here is the diff on the process 1 a->lvec vector (all values are off).
> > 
> > Any thoughts would be appreciated,
> > Mark
> > 
> > 15:30 1  /gpfs/alpine/scratch/adams/geo127$ diff lvgpu.m lvcpu.m
> > 2,12c2,12
> > < %  type: seqcuda
> > < Vec_0x53738630_0 = [
> > < 9.5702137431412879e+00
> > < 2.1970298791152253e+01
> > < 4.5422290209190646e+00
> > < 2.0185031807270226e+00
> > < 4.2627312508573375e+01
> > < 1.0889191983882025e+01
> > < 1.6038202417695462e+01
> > < 2.7155672033607665e+01
> > < 6.2540357853223556e+00
> > ---
> > > %  type: seq
> > > Vec_0x3a546440_0 = [
> > > 4.5565851251714653e+00
> > > 1.0460532998971189e+01
> > > 2.1626531807270220e+00
> > > 9.6105288923182408e-01
> > > 2.0295782656035659e+01
> > > 5.1845791066529463e+00
> > > 7.6361340020576058e+00
> > > 1.2929401011659799e+01
> > > 2.9776812928669392e+00
> > 
> > 15:15 130  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1 ./ex56 -cells 2,2,1 
> > [0] 27 global equations, 9 vertices
> > [0] 27 equations in vector, 9 vertices
> >   0 SNES Function norm 1.223958326481e+02 
> >     0 KSP Residual norm 1.223958326481e+02 
> > [0] |x|=  1.223958326481e+02 |a->lvec|=  1.773965489475e+01 |B|=  1.424708937136e+00
> > [1] |x|=  1.223958326481e+02 |a->lvec|=  2.844171413778e+01 |B|=  1.424708937136e+00
> > [1] 1) |yy|=  2.007423334680e+02
> > [0] 1) |yy|=  2.007423334680e+02
> > [0] 2) |yy|=  1.957605719265e+02
> > [1] 2) |yy|=  1.957605719265e+02
> > [1] Number sends = 1; Number to self = 0
> > [1]   0 length = 9 to whom 0
> > Now the indices for all remote sends (in order by process sent to)
> > [1] 9 
> > [1] 10 
> > [1] 11 
> > [1] 12 
> > [1] 13 
> > [1] 14 
> > [1] 15 
> > [1] 16 
> > [1] 17 
> > [1] Number receives = 1; Number from self = 0
> > [1] 0 length 9 from whom 0
> > Now the indices for all remote receives (in order by process received from)
> > [1] 0 
> > [1] 1 
> > [1] 2 
> > [1] 3 
> > [1] 4 
> > [1] 5 
> > [1] 6 
> > [1] 7 
> > [1] 8 
> >     1 KSP Residual norm 8.199932342150e+01 
> >   Linear solve did not converge due to DIVERGED_ITS iterations 1
> > Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0
> > 
> > 
> > 15:19  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1 ./ex56 -cells 2,2,1 -ex56_dm_mat_type aijcusparse -ex56_dm_vec_type cuda
> > [0] 27 global equations, 9 vertices
> > [0] 27 equations in vector, 9 vertices
> >   0 SNES Function norm 1.223958326481e+02 
> >     0 KSP Residual norm 1.223958326481e+02 
> > [0] |x|=  1.223958326481e+02 |a->lvec|=  1.773965489475e+01 |B|=  1.424708937136e+00
> > [1] |x|=  1.223958326481e+02 |a->lvec|=  5.973624458725e+01 |B|=  1.424708937136e+00
> > [0] 1) |yy|=  2.007423334680e+02
> > [1] 1) |yy|=  2.007423334680e+02
> > [0] 2) |yy|=  1.953571867633e+02
> > [1] 2) |yy|=  1.953571867633e+02
> > [1] Number sends = 1; Number to self = 0
> > [1]   0 length = 9 to whom 0
> > Now the indices for all remote sends (in order by process sent to)
> > [1] 9 
> > [1] 10 
> > [1] 11 
> > [1] 12 
> > [1] 13 
> > [1] 14 
> > [1] 15 
> > [1] 16 
> > [1] 17 
> > [1] Number receives = 1; Number from self = 0
> > [1] 0 length 9 from whom 0
> > Now the indices for all remote receives (in order by process received from)
> > [1] 0 
> > [1] 1 
> > [1] 2 
> > [1] 3 
> > [1] 4 
> > [1] 5 
> > [1] 6 
> > [1] 7 
> > [1] 8 
> >     1 KSP Residual norm 8.199932342150e+01 
> >  
> 
> 
> 
> -- 
> Stefano



More information about the petsc-dev mailing list