[petsc-dev] MatMult on Summit

Tue Sep 24 04:05:08 CDT 2019

Yes, please, thank you.

On Tue, Sep 24, 2019 at 1:46 AM Mills, Richard Tran via petsc-dev <
petsc-dev at mcs.anl.gov> wrote:

> Karl, that would be fantastic. Much obliged!
>
> --Richard
>
> On 9/23/19 8:09 PM, Karl Rupp wrote:
>
> Hi,
>
> `git grep cudaStreamCreate` reports that vectors, matrices and scatters
> create their own streams. This will almost inevitably create races (there
> is no synchronization mechanism implemented), unless one calls WaitForGPU()
> after each operation. Some of the non-deterministic tests can likely be
> explained by this.
>
> I'll clean this up in the next few hours if there are no objections.
>
> Best regards,
> Karli
>
>
>
> On 9/24/19 1:05 AM, Mills, Richard Tran via petsc-dev wrote:
>
> I'm no CUDA expert (not yet, anyway), but, from what I've read, the
> default stream (stream 0) is (mostly) synchronous to host and device, so
> WaitForGPU() is not needed in that case. I don't know if there is any
> performance penalty in explicitly calling it in that case, anyway.
>
> In any case, it looks like there are still some cases where potentially
> asynchronous CUDA library calls are being "timed" without a WaitForGPU() to
> ensure that the calls actually complete. I will make a pass through the
> aijcusparse and aijviennacl code looking for these.
>
> --Richard
>
> On 9/23/19 3:28 PM, Zhang, Junchao wrote:
>
> It looks cusparsestruct->stream is always created (not NULL).  I don't
> know logic of the "if (!cusparsestruct->stream)".
> --Junchao Zhang
>
>
> On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev <
> petsc-dev at mcs.anl.gov <mailto:petsc-dev at mcs.anl.gov>
> <petsc-dev at mcs.anl.gov>> wrote:
>
>     In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards
>     the end of the function it had
>
>       if (!yy) { /* MatMult */
>         if (!cusparsestruct->stream) {
>           ierr = WaitForGPU();CHKERRCUDA(ierr);
>         }
>       }
>
>     I assume we don't need the logic to do this only in the MatMult()
>     with no add case and should just do this all the time, for the
>     purposes of timing if no other reason. Is there some reason to NOT
>     do this because of worries the about effects that these
>     WaitForGPU() invocations might have on performance?
>
>     I notice other problems in aijcusparse.cu <http://aijcusparse.cu>
> <http://aijcusparse.cu>,
>     now that I look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I
>     see that we have GPU timing calls around the cusparse_csr_spmv()
>     (but no WaitForGPU() inside the timed region). I believe this is
>     another area in which we get a meaningless timing. It looks like
>     we need a WaitForGPU() there, and then maybe inside the timed
>     region handling the scatter. (I don't know if this stuff happens
>     asynchronously or not.) But do we potentially want two
>     WaitForGPU() calls in one function, just to help with getting
>     timings? I don't have a good idea of how much overhead this adds.
>
>     --Richard
>
>     On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
>
>     I made the following changes:
>     1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
>       ierr = WaitForGPU();CHKERRCUDA(ierr);
>       ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
>       ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
>       PetscFunctionReturn(0);
>     2) In MatMult_MPIAIJCUSPARSE, use the following code sequence.
>     The old code swapped the first two lines. Since with
>     -log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed the
>     order to have better overlap.
>       ierr =
>
> VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>       ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
>       ierr =
>
> VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>       ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
>     3) Log time directly in the test code so we can also know
>     execution time without -log_view (hence cuda synchronization). I
>     manually calculated the Total Mflop/s for these cases for easy
>     comparison.
>
>     <<Note the CPU versions are copied from yesterday's results>>
>
>
> ------------------------------------------------------------------------------------------------------------------------
>     Event                Count      Time (sec)     Flop
>              --- Global ---  --- Stage ----  Total   GPU    -
>     CpuToGpu -   - GpuToCpu - GPU
>                        Max Ratio  Max     Ratio   Max  Ratio  Mess
> AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s
>     Count   Size   Count   Size  %F
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>     6 MPI ranks,
>     MatMult              100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03
>     2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743       0         0
> 0.00e+00    0 0.00e+00  0
>     VecScatterBegin      100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03
>     2.2e+05 0.0e+00  0  0 97 18  0   0  0100100  0     0       0         0
> 0.00e+00    0 0.00e+00  0
>     VecScatterEnd        100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  3  0  0  0  0  13  0  0  0  0     0       0         0
> 0.00e+00    0 0.00e+00  0
>
>     24 MPI ranks
>     MatMult              100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04
>     5.9e+04 0.0e+00  8 99 97 25  0 100100100100  0 17948       0         0
> 0.00e+00    0 0.00e+00  0
>     VecScatterBegin      100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04
>     5.9e+04 0.0e+00  0  0 97 25  0   0  0100100  0     0       0         0
> 0.00e+00    0 0.00e+00  0
>     VecScatterEnd        100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  2  0  0  0  0  19  0  0  0  0     0       0         0
> 0.00e+00    0 0.00e+00  0
>
>     42 MPI ranks
>     MatMult              100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04
>     4.1e+04 0.0e+00 23 99 97 30  0 100100100100  0 27493       0         0
> 0.00e+00    0 0.00e+00  0
>     VecScatterBegin      100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04
>     4.1e+04 0.0e+00  0  0 97 30  0   1  0100100  0     0       0         0
> 0.00e+00    0 0.00e+00  0
>     VecScatterEnd        100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  6  0  0  0  0  24  0  0  0  0     0       0         0
> 0.00e+00    0 0.00e+00  0
>
>     6 MPI ranks + 6 GPUs + regular SF + log_view
>     MatMult              100 1.0 1.6863e-01 1.0 9.66e+09 1.1 2.8e+03
>     2.2e+05 0.0e+00  0 99 97 18  0 100100100100  0 335743   629278     100
> 1.02e+02  100 2.69e+02 100
>     VecScatterBegin      100 1.0 5.0157e-02 1.6 0.00e+00 0.0 2.8e+03
>     2.2e+05 0.0e+00  0  0 97 18  0  24  0100100  0     0       0         0
> 0.00e+00  100 2.69e+02  0
>     VecScatterEnd        100 1.0 4.9155e-02 2.5 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  0  0  0  0  0  20  0  0  0  0     0       0         0
> 0.00e+00    0 0.00e+00  0
>     VecCUDACopyTo        100 1.0 9.5078e-03 2.0 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  0  0  0  0  0   4  0  0  0  0     0       0       100
> 1.02e+02    0 0.00e+00  0
>     VecCopyFromSome      100 1.0 2.8485e-02 1.4 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  0  0  0  0  0  14  0  0  0  0     0       0         0
> 0.00e+00  100 2.69e+02  0
>
>     6 MPI ranks + 6 GPUs + regular SF  + No log_view
>     MatMult:             100 1.0 1.4180e-01
>                                              399268
>
>     6 MPI ranks + 6 GPUs + CUDA-aware SF + log_view
>     MatMult              100 1.0 1.1053e-01 1.0 9.66e+09 1.1 2.8e+03
>     2.2e+05 0.0e+00  1 99 97 18  0 100100100100  0 512224   642075       0
> 0.00e+00    0 0.00e+00 100
>     VecScatterBegin      100 1.0 8.3418e-03 1.5 0.00e+00 0.0 2.8e+03
>     2.2e+05 0.0e+00  0  0 97 18  0   6  0100100  0     0       0         0
> 0.00e+00    0 0.00e+00  0
>     VecScatterEnd        100 1.0 2.2619e-02 1.6 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  0  0  0  0  0  16  0  0  0  0     0       0         0
> 0.00e+00    0 0.00e+00  0
>
>     6 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view
>     MatMult:             100 1.0 9.8344e-02
>                                              575717
>
>     24 MPI ranks + 6 GPUs + regular SF + log_view
>     MatMult              100 1.0 1.1572e-01 1.0 2.63e+09 1.2 1.9e+04
>     5.9e+04 0.0e+00  0 99 97 25  0 100100100100  0 489223   708601     100
> 4.61e+01  100 6.72e+01 100
>     VecScatterBegin      100 1.0 2.0641e-02 2.0 0.00e+00 0.0 1.9e+04
>     5.9e+04 0.0e+00  0  0 97 25  0  13  0100100  0     0       0         0
> 0.00e+00  100 6.72e+01  0
>     VecScatterEnd        100 1.0 6.8114e-02 5.6 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  0  0  0  0  0  38  0  0  0  0     0       0         0
> 0.00e+00    0 0.00e+00  0
>     VecCUDACopyTo        100 1.0 6.6646e-03 2.5 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0       100
> 4.61e+01    0 0.00e+00  0
>     VecCopyFromSome      100 1.0 1.0546e-02 1.7 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  0  0  0  0  0   7  0  0  0  0     0       0         0
> 0.00e+00  100 6.72e+01  0
>
>     24 MPI ranks + 6 GPUs + regular SF + No log_view
>     MatMult:             100 1.0 9.8254e-02
>                                              576201
>
>     24 MPI ranks + 6 GPUs + CUDA-aware SF + log_view
>     MatMult              100 1.0 1.1602e-01 1.0 2.63e+09 1.2 1.9e+04
>     5.9e+04 0.0e+00  1 99 97 25  0 100100100100  0 487956   707524       0
> 0.00e+00    0 0.00e+00 100
>     VecScatterBegin      100 1.0 2.7088e-02 7.0 0.00e+00 0.0 1.9e+04
>     5.9e+04 0.0e+00  0  0 97 25  0   8  0100100  0     0       0         0
> 0.00e+00    0 0.00e+00  0
>     VecScatterEnd        100 1.0 8.4262e-02 3.0 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  1  0  0  0  0  52  0  0  0  0     0       0         0
> 0.00e+00    0 0.00e+00  0
>
>     24 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view
>     MatMult:             100 1.0 1.0397e-01
>                                              544510
>
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190924/e76174dd/attachment.html>