[petsc-dev] [petsc-maint] running CUDA on SUMMIT

Wed Jul 10 00:13:35 CDT 2019

  ierr = VecGetLocalSize(xx,&nt);CHKERRQ(ierr);
  if (nt != A->rmap->n) SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A (%D) and xx (%D)",A->rmap->n,nt);
  ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr);
  ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr);

    So the xx on the GPU appears ok? The a->B appears ok? But on process 1 the result a->lvec is wrong? 

How do you look at the a->lvec? Do you copy it to the CPU and print it?

  ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr);
  ierr = VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
  ierr = VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
  ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr);

Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help? 

Are you sure the problem isn't related to the "stream business"? 

/* This multiplication sequence is different sequence
     than the CPU version. In particular, the diagonal block
     multiplication kernel is launched in one stream. Then,
     in a separate stream, the data transfers from DeviceToHost
     (with MPI messaging in between), then HostToDevice are
     launched. Once the data transfer stream is synchronized,
     to ensure messaging is complete, the MatMultAdd kernel
     is launched in the original (MatMult) stream to protect
     against race conditions.

     This sequence should only be called for GPU computation. */

Note this comment isn't right and appears to be cut and paste from somewhere else, since there is no MatMult() nor MatMultAdd kernel here?

Anyway to "turn off the stream business" and see if the result is then correct?  Perhaps the stream business was done correctly for MatMult() but was never right for MatMultTranspose()?

Barry

BTW: Unrelated comment, the code

  ierr = VecSet(yy,0);CHKERRQ(ierr);
  ierr = VecCUDAGetArrayWrite(yy,&yarray);CHKERRQ(ierr);

has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here. VecCUDAGetArrayWrite() requires that you ignore the values in yy and set them all yourself so setting them to zero before calling VecCUDAGetArrayWrite() does nothing except waste time.

> On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev <petsc-dev at mcs.anl.gov> wrote:
> 
> I am stumped with this GPU bug(s). Maybe someone has an idea.
> 
> I did find a bug in the cuda transpose mat-vec that cuda-memcheck detected, but I still have differences between the GPU and CPU transpose mat-vec. I've got it down to a very simple test: bicg/none on a tiny mesh with two processors. It works on one processor or with cg/none. So it is the transpose mat-vec.
> 
> I see that the result of the off-diagonal  (a->lvec) is different only proc 1. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat and vec and printed out matlab vectors. Below is the CPU output and then the GPU with a view of the scatter object, which is identical as you can see.
> 
> The matlab B matrix and xx vector are identical. Maybe the GPU copy is wrong ...
> 
> The only/first difference between CPU and GPU is a->lvec (the off diagonal contribution)on processor 1. (you can see the norms are different). Here is the diff on the process 1 a->lvec vector (all values are off).
> 
> Any thoughts would be appreciated,
> Mark
> 
> 15:30 1  /gpfs/alpine/scratch/adams/geo127$ diff lvgpu.m lvcpu.m
> 2,12c2,12
> < %  type: seqcuda
> < Vec_0x53738630_0 = [
> < 9.5702137431412879e+00
> < 2.1970298791152253e+01
> < 4.5422290209190646e+00
> < 2.0185031807270226e+00
> < 4.2627312508573375e+01
> < 1.0889191983882025e+01
> < 1.6038202417695462e+01
> < 2.7155672033607665e+01
> < 6.2540357853223556e+00
> ---
> > %  type: seq
> > Vec_0x3a546440_0 = [
> > 4.5565851251714653e+00
> > 1.0460532998971189e+01
> > 2.1626531807270220e+00
> > 9.6105288923182408e-01
> > 2.0295782656035659e+01
> > 5.1845791066529463e+00
> > 7.6361340020576058e+00
> > 1.2929401011659799e+01
> > 2.9776812928669392e+00
> 
> 15:15 130  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1 ./ex56 -cells 2,2,1 
> [0] 27 global equations, 9 vertices
> [0] 27 equations in vector, 9 vertices
>   0 SNES Function norm 1.223958326481e+02 
>     0 KSP Residual norm 1.223958326481e+02 
> [0] |x|=  1.223958326481e+02 |a->lvec|=  1.773965489475e+01 |B|=  1.424708937136e+00
> [1] |x|=  1.223958326481e+02 |a->lvec|=  2.844171413778e+01 |B|=  1.424708937136e+00
> [1] 1) |yy|=  2.007423334680e+02
> [0] 1) |yy|=  2.007423334680e+02
> [0] 2) |yy|=  1.957605719265e+02
> [1] 2) |yy|=  1.957605719265e+02
> [1] Number sends = 1; Number to self = 0
> [1]   0 length = 9 to whom 0
> Now the indices for all remote sends (in order by process sent to)
> [1] 9 
> [1] 10 
> [1] 11 
> [1] 12 
> [1] 13 
> [1] 14 
> [1] 15 
> [1] 16 
> [1] 17 
> [1] Number receives = 1; Number from self = 0
> [1] 0 length 9 from whom 0
> Now the indices for all remote receives (in order by process received from)
> [1] 0 
> [1] 1 
> [1] 2 
> [1] 3 
> [1] 4 
> [1] 5 
> [1] 6 
> [1] 7 
> [1] 8 
>     1 KSP Residual norm 8.199932342150e+01 
>   Linear solve did not converge due to DIVERGED_ITS iterations 1
> Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0
> 
> 
> 15:19  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1 ./ex56 -cells 2,2,1 -ex56_dm_mat_type aijcusparse -ex56_dm_vec_type cuda
> [0] 27 global equations, 9 vertices
> [0] 27 equations in vector, 9 vertices
>   0 SNES Function norm 1.223958326481e+02 
>     0 KSP Residual norm 1.223958326481e+02 
> [0] |x|=  1.223958326481e+02 |a->lvec|=  1.773965489475e+01 |B|=  1.424708937136e+00
> [1] |x|=  1.223958326481e+02 |a->lvec|=  5.973624458725e+01 |B|=  1.424708937136e+00
> [0] 1) |yy|=  2.007423334680e+02
> [1] 1) |yy|=  2.007423334680e+02
> [0] 2) |yy|=  1.953571867633e+02
> [1] 2) |yy|=  1.953571867633e+02
> [1] Number sends = 1; Number to self = 0
> [1]   0 length = 9 to whom 0
> Now the indices for all remote sends (in order by process sent to)
> [1] 9 
> [1] 10 
> [1] 11 
> [1] 12 
> [1] 13 
> [1] 14 
> [1] 15 
> [1] 16 
> [1] 17 
> [1] Number receives = 1; Number from self = 0
> [1] 0 length 9 from whom 0
> Now the indices for all remote receives (in order by process received from)
> [1] 0 
> [1] 1 
> [1] 2 
> [1] 3 
> [1] 4 
> [1] 5 
> [1] 6 
> [1] 7 
> [1] 8 
>     1 KSP Residual norm 8.199932342150e+01 
>