[petsc-dev] [petsc-maint] running CUDA on SUMMIT

Wed Jul 10 08:22:11 CDT 2019

Mark,

if the difference is on lvec, I suspect the bug has to do with compressed
row storage. I have fixed a similar bug in MatMult.
you want to check cusparsestruct->workVector->size() against A->cmap->n.

Stefano

Il giorno mer 10 lug 2019 alle ore 15:54 Mark Adams via petsc-dev <
petsc-dev at mcs.anl.gov> ha scritto:

>
>
> On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F. <bsmith at mcs.anl.gov>
> wrote:
>
>>
>>   ierr = VecGetLocalSize(xx,&nt);CHKERRQ(ierr);
>>   if (nt != A->rmap->n)
>> SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A
>> (%D) and xx (%D)",A->rmap->n,nt);
>>   ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr);
>>   ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr);
>>
>>     So the xx on the GPU appears ok?
>
>
> The norm is correct and ...
>
>
>> The a->B appears ok?
>
>
> yes
>
>
>> But on process 1 the result a->lvec is wrong?
>>
>
> yes
>
>
>> How do you look at the a->lvec? Do you copy it to the CPU and print it?
>>
>
> I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I
> should copy it. Maybe I should make a CUDA version of these methods?
>
>
>>
>>   ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr);
>>   ierr =
>> VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
>>   ierr =
>> VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
>>   ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr);
>>
>> Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help?
>
>
> This is where I have been digging around an printing stuff.
>
>
>>
>> Are you sure the problem isn't related to the "stream business"?
>>
>
> I don't know what that is but I have played around with adding
> cudaDeviceSynchronize
>
>
>>
>> /* This multiplication sequence is different sequence
>>      than the CPU version. In particular, the diagonal block
>>      multiplication kernel is launched in one stream. Then,
>>      in a separate stream, the data transfers from DeviceToHost
>>      (with MPI messaging in between), then HostToDevice are
>>      launched. Once the data transfer stream is synchronized,
>>      to ensure messaging is complete, the MatMultAdd kernel
>>      is launched in the original (MatMult) stream to protect
>>      against race conditions.
>>
>>      This sequence should only be called for GPU computation. */
>>
>> Note this comment isn't right and appears to be cut and paste from
>> somewhere else, since there is no MatMult() nor MatMultAdd kernel here?
>>
>
> Yes, I noticed this. Same as MatMult and not correct here.
>
>
>>
>> Anyway to "turn off the stream business" and see if the result is then
>> correct?
>
>
> How do you do that? I'm looking at docs on streams but not sure how its
> used here.
>
>
>> Perhaps the stream business was done correctly for MatMult() but was
>> never right for MatMultTranspose()?
>>
>> Barry
>>
>> BTW: Unrelated comment, the code
>>
>>   ierr = VecSet(yy,0);CHKERRQ(ierr);
>>   ierr = VecCUDAGetArrayWrite(yy,&yarray);CHKERRQ(ierr);
>>
>> has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here.
>> VecCUDAGetArrayWrite() requires that you ignore the values in yy and set
>> them all yourself so setting them to zero before calling
>> VecCUDAGetArrayWrite() does nothing except waste time.
>>
>>
> OK, I'll get rid of it.
>
>
>>
>> > On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev <
>> petsc-dev at mcs.anl.gov> wrote:
>> >
>> > I am stumped with this GPU bug(s). Maybe someone has an idea.
>> >
>> > I did find a bug in the cuda transpose mat-vec that cuda-memcheck
>> detected, but I still have differences between the GPU and CPU transpose
>> mat-vec. I've got it down to a very simple test: bicg/none on a tiny mesh
>> with two processors. It works on one processor or with cg/none. So it is
>> the transpose mat-vec.
>> >
>> > I see that the result of the off-diagonal  (a->lvec) is different only
>> proc 1. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat
>> and vec and printed out matlab vectors. Below is the CPU output and then
>> the GPU with a view of the scatter object, which is identical as you can
>> see.
>> >
>> > The matlab B matrix and xx vector are identical. Maybe the GPU copy is
>> wrong ...
>> >
>> > The only/first difference between CPU and GPU is a->lvec (the off
>> diagonal contribution)on processor 1. (you can see the norms are
>> different). Here is the diff on the process 1 a->lvec vector (all values
>> are off).
>> >
>> > Any thoughts would be appreciated,
>> > Mark
>> >
>> > 15:30 1  /gpfs/alpine/scratch/adams/geo127$ diff lvgpu.m lvcpu.m
>> > 2,12c2,12
>> > < %  type: seqcuda
>> > < Vec_0x53738630_0 = [
>> > < 9.5702137431412879e+00
>> > < 2.1970298791152253e+01
>> > < 4.5422290209190646e+00
>> > < 2.0185031807270226e+00
>> > < 4.2627312508573375e+01
>> > < 1.0889191983882025e+01
>> > < 1.6038202417695462e+01
>> > < 2.7155672033607665e+01
>> > < 6.2540357853223556e+00
>> > ---
>> > > %  type: seq
>> > > Vec_0x3a546440_0 = [
>> > > 4.5565851251714653e+00
>> > > 1.0460532998971189e+01
>> > > 2.1626531807270220e+00
>> > > 9.6105288923182408e-01
>> > > 2.0295782656035659e+01
>> > > 5.1845791066529463e+00
>> > > 7.6361340020576058e+00
>> > > 1.2929401011659799e+01
>> > > 2.9776812928669392e+00
>> >
>> > 15:15 130  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1
>> ./ex56 -cells 2,2,1
>> > [0] 27 global equations, 9 vertices
>> > [0] 27 equations in vector, 9 vertices
>> >   0 SNES Function norm 1.223958326481e+02
>> >     0 KSP Residual norm 1.223958326481e+02
>> > [0] |x|=  1.223958326481e+02 |a->lvec|=  1.773965489475e+01 |B|=
>> 1.424708937136e+00
>> > [1] |x|=  1.223958326481e+02 |a->lvec|=  2.844171413778e+01 |B|=
>> 1.424708937136e+00
>> > [1] 1) |yy|=  2.007423334680e+02
>> > [0] 1) |yy|=  2.007423334680e+02
>> > [0] 2) |yy|=  1.957605719265e+02
>> > [1] 2) |yy|=  1.957605719265e+02
>> > [1] Number sends = 1; Number to self = 0
>> > [1]   0 length = 9 to whom 0
>> > Now the indices for all remote sends (in order by process sent to)
>> > [1] 9
>> > [1] 10
>> > [1] 11
>> > [1] 12
>> > [1] 13
>> > [1] 14
>> > [1] 15
>> > [1] 16
>> > [1] 17
>> > [1] Number receives = 1; Number from self = 0
>> > [1] 0 length 9 from whom 0
>> > Now the indices for all remote receives (in order by process received
>> from)
>> > [1] 0
>> > [1] 1
>> > [1] 2
>> > [1] 3
>> > [1] 4
>> > [1] 5
>> > [1] 6
>> > [1] 7
>> > [1] 8
>> >     1 KSP Residual norm 8.199932342150e+01
>> >   Linear solve did not converge due to DIVERGED_ITS iterations 1
>> > Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE
>> iterations 0
>> >
>> >
>> > 15:19  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1
>> ./ex56 -cells 2,2,1 -ex56_dm_mat_type aijcusparse -ex56_dm_vec_type cuda
>> > [0] 27 global equations, 9 vertices
>> > [0] 27 equations in vector, 9 vertices
>> >   0 SNES Function norm 1.223958326481e+02
>> >     0 KSP Residual norm 1.223958326481e+02
>> > [0] |x|=  1.223958326481e+02 |a->lvec|=  1.773965489475e+01 |B|=
>> 1.424708937136e+00
>> > [1] |x|=  1.223958326481e+02 |a->lvec|=  5.973624458725e+01 |B|=
>> 1.424708937136e+00
>> > [0] 1) |yy|=  2.007423334680e+02
>> > [1] 1) |yy|=  2.007423334680e+02
>> > [0] 2) |yy|=  1.953571867633e+02
>> > [1] 2) |yy|=  1.953571867633e+02
>> > [1] Number sends = 1; Number to self = 0
>> > [1]   0 length = 9 to whom 0
>> > Now the indices for all remote sends (in order by process sent to)
>> > [1] 9
>> > [1] 10
>> > [1] 11
>> > [1] 12
>> > [1] 13
>> > [1] 14
>> > [1] 15
>> > [1] 16
>> > [1] 17
>> > [1] Number receives = 1; Number from self = 0
>> > [1] 0 length 9 from whom 0
>> > Now the indices for all remote receives (in order by process received
>> from)
>> > [1] 0
>> > [1] 1
>> > [1] 2
>> > [1] 3
>> > [1] 4
>> > [1] 5
>> > [1] 6
>> > [1] 7
>> > [1] 8
>> >     1 KSP Residual norm 8.199932342150e+01
>> >
>>
>>

-- 
Stefano
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190710/41fd95c0/attachment.html>