[petsc-dev] [petsc-maint] running CUDA on SUMMIT

Wed Jul 10 10:06:08 CDT 2019

Thanks, you made several changes here, including switches with the
workvector size. I guess I should import this logic to the transpose
method(s), except for the yy==NULL branches ...

MatMult_ calls MatMultAdd with yy=0, but the transpose version have their
own code. MatMultTranspose_SeqAIJCUSPARSE is very simple.

Thanks again,
Mark

On Wed, Jul 10, 2019 at 9:22 AM Stefano Zampini <stefano.zampini at gmail.com>
wrote:

> Mark,
>
> if the difference is on lvec, I suspect the bug has to do with compressed
> row storage. I have fixed a similar bug in MatMult.
> you want to check cusparsestruct->workVector->size() against A->cmap->n.
>
> Stefano
>
> Il giorno mer 10 lug 2019 alle ore 15:54 Mark Adams via petsc-dev <
> petsc-dev at mcs.anl.gov> ha scritto:
>
>>
>>
>> On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F. <bsmith at mcs.anl.gov>
>> wrote:
>>
>>>
>>>   ierr = VecGetLocalSize(xx,&nt);CHKERRQ(ierr);
>>>   if (nt != A->rmap->n)
>>> SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A
>>> (%D) and xx (%D)",A->rmap->n,nt);
>>>   ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr);
>>>   ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr);
>>>
>>>     So the xx on the GPU appears ok?
>>
>>
>> The norm is correct and ...
>>
>>
>>> The a->B appears ok?
>>
>>
>> yes
>>
>>
>>> But on process 1 the result a->lvec is wrong?
>>>
>>
>> yes
>>
>>
>>> How do you look at the a->lvec? Do you copy it to the CPU and print it?
>>>
>>
>> I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I
>> should copy it. Maybe I should make a CUDA version of these methods?
>>
>>
>>>
>>>   ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr);
>>>   ierr =
>>> VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
>>>   ierr =
>>> VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
>>>   ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr);
>>>
>>> Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help?
>>
>>
>> This is where I have been digging around an printing stuff.
>>
>>
>>>
>>> Are you sure the problem isn't related to the "stream business"?
>>>
>>
>> I don't know what that is but I have played around with adding
>> cudaDeviceSynchronize
>>
>>
>>>
>>> /* This multiplication sequence is different sequence
>>>      than the CPU version. In particular, the diagonal block
>>>      multiplication kernel is launched in one stream. Then,
>>>      in a separate stream, the data transfers from DeviceToHost
>>>      (with MPI messaging in between), then HostToDevice are
>>>      launched. Once the data transfer stream is synchronized,
>>>      to ensure messaging is complete, the MatMultAdd kernel
>>>      is launched in the original (MatMult) stream to protect
>>>      against race conditions.
>>>
>>>      This sequence should only be called for GPU computation. */
>>>
>>> Note this comment isn't right and appears to be cut and paste from
>>> somewhere else, since there is no MatMult() nor MatMultAdd kernel here?
>>>
>>
>> Yes, I noticed this. Same as MatMult and not correct here.
>>
>>
>>>
>>> Anyway to "turn off the stream business" and see if the result is then
>>> correct?
>>
>>
>> How do you do that? I'm looking at docs on streams but not sure how its
>> used here.
>>
>>
>>> Perhaps the stream business was done correctly for MatMult() but was
>>> never right for MatMultTranspose()?
>>>
>>> Barry
>>>
>>> BTW: Unrelated comment, the code
>>>
>>>   ierr = VecSet(yy,0);CHKERRQ(ierr);
>>>   ierr = VecCUDAGetArrayWrite(yy,&yarray);CHKERRQ(ierr);
>>>
>>> has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here.
>>> VecCUDAGetArrayWrite() requires that you ignore the values in yy and set
>>> them all yourself so setting them to zero before calling
>>> VecCUDAGetArrayWrite() does nothing except waste time.
>>>
>>>
>> OK, I'll get rid of it.
>>
>>
>>>
>>> > On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev <
>>> petsc-dev at mcs.anl.gov> wrote:
>>> >
>>> > I am stumped with this GPU bug(s). Maybe someone has an idea.
>>> >
>>> > I did find a bug in the cuda transpose mat-vec that cuda-memcheck
>>> detected, but I still have differences between the GPU and CPU transpose
>>> mat-vec. I've got it down to a very simple test: bicg/none on a tiny mesh
>>> with two processors. It works on one processor or with cg/none. So it is
>>> the transpose mat-vec.
>>> >
>>> > I see that the result of the off-diagonal  (a->lvec) is different only
>>> proc 1. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat
>>> and vec and printed out matlab vectors. Below is the CPU output and then
>>> the GPU with a view of the scatter object, which is identical as you can
>>> see.
>>> >
>>> > The matlab B matrix and xx vector are identical. Maybe the GPU copy is
>>> wrong ...
>>> >
>>> > The only/first difference between CPU and GPU is a->lvec (the off
>>> diagonal contribution)on processor 1. (you can see the norms are
>>> different). Here is the diff on the process 1 a->lvec vector (all values
>>> are off).
>>> >
>>> > Any thoughts would be appreciated,
>>> > Mark
>>> >
>>> > 15:30 1  /gpfs/alpine/scratch/adams/geo127$ diff lvgpu.m lvcpu.m
>>> > 2,12c2,12
>>> > < %  type: seqcuda
>>> > < Vec_0x53738630_0 = [
>>> > < 9.5702137431412879e+00
>>> > < 2.1970298791152253e+01
>>> > < 4.5422290209190646e+00
>>> > < 2.0185031807270226e+00
>>> > < 4.2627312508573375e+01
>>> > < 1.0889191983882025e+01
>>> > < 1.6038202417695462e+01
>>> > < 2.7155672033607665e+01
>>> > < 6.2540357853223556e+00
>>> > ---
>>> > > %  type: seq
>>> > > Vec_0x3a546440_0 = [
>>> > > 4.5565851251714653e+00
>>> > > 1.0460532998971189e+01
>>> > > 2.1626531807270220e+00
>>> > > 9.6105288923182408e-01
>>> > > 2.0295782656035659e+01
>>> > > 5.1845791066529463e+00
>>> > > 7.6361340020576058e+00
>>> > > 1.2929401011659799e+01
>>> > > 2.9776812928669392e+00
>>> >
>>> > 15:15 130  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g
>>> 1 ./ex56 -cells 2,2,1
>>> > [0] 27 global equations, 9 vertices
>>> > [0] 27 equations in vector, 9 vertices
>>> >   0 SNES Function norm 1.223958326481e+02
>>> >     0 KSP Residual norm 1.223958326481e+02
>>> > [0] |x|=  1.223958326481e+02 |a->lvec|=  1.773965489475e+01 |B|=
>>> 1.424708937136e+00
>>> > [1] |x|=  1.223958326481e+02 |a->lvec|=  2.844171413778e+01 |B|=
>>> 1.424708937136e+00
>>> > [1] 1) |yy|=  2.007423334680e+02
>>> > [0] 1) |yy|=  2.007423334680e+02
>>> > [0] 2) |yy|=  1.957605719265e+02
>>> > [1] 2) |yy|=  1.957605719265e+02
>>> > [1] Number sends = 1; Number to self = 0
>>> > [1]   0 length = 9 to whom 0
>>> > Now the indices for all remote sends (in order by process sent to)
>>> > [1] 9
>>> > [1] 10
>>> > [1] 11
>>> > [1] 12
>>> > [1] 13
>>> > [1] 14
>>> > [1] 15
>>> > [1] 16
>>> > [1] 17
>>> > [1] Number receives = 1; Number from self = 0
>>> > [1] 0 length 9 from whom 0
>>> > Now the indices for all remote receives (in order by process received
>>> from)
>>> > [1] 0
>>> > [1] 1
>>> > [1] 2
>>> > [1] 3
>>> > [1] 4
>>> > [1] 5
>>> > [1] 6
>>> > [1] 7
>>> > [1] 8
>>> >     1 KSP Residual norm 8.199932342150e+01
>>> >   Linear solve did not converge due to DIVERGED_ITS iterations 1
>>> > Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE
>>> iterations 0
>>> >
>>> >
>>> > 15:19  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1
>>> ./ex56 -cells 2,2,1 -ex56_dm_mat_type aijcusparse -ex56_dm_vec_type cuda
>>> > [0] 27 global equations, 9 vertices
>>> > [0] 27 equations in vector, 9 vertices
>>> >   0 SNES Function norm 1.223958326481e+02
>>> >     0 KSP Residual norm 1.223958326481e+02
>>> > [0] |x|=  1.223958326481e+02 |a->lvec|=  1.773965489475e+01 |B|=
>>> 1.424708937136e+00
>>> > [1] |x|=  1.223958326481e+02 |a->lvec|=  5.973624458725e+01 |B|=
>>> 1.424708937136e+00
>>> > [0] 1) |yy|=  2.007423334680e+02
>>> > [1] 1) |yy|=  2.007423334680e+02
>>> > [0] 2) |yy|=  1.953571867633e+02
>>> > [1] 2) |yy|=  1.953571867633e+02
>>> > [1] Number sends = 1; Number to self = 0
>>> > [1]   0 length = 9 to whom 0
>>> > Now the indices for all remote sends (in order by process sent to)
>>> > [1] 9
>>> > [1] 10
>>> > [1] 11
>>> > [1] 12
>>> > [1] 13
>>> > [1] 14
>>> > [1] 15
>>> > [1] 16
>>> > [1] 17
>>> > [1] Number receives = 1; Number from self = 0
>>> > [1] 0 length 9 from whom 0
>>> > Now the indices for all remote receives (in order by process received
>>> from)
>>> > [1] 0
>>> > [1] 1
>>> > [1] 2
>>> > [1] 3
>>> > [1] 4
>>> > [1] 5
>>> > [1] 6
>>> > [1] 7
>>> > [1] 8
>>> >     1 KSP Residual norm 8.199932342150e+01
>>> >
>>>
>>>
>
> --
> Stefano
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190710/550ff50b/attachment-0001.html>