<html>


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


</head>


<body>


<div dir="ltr">No objection. Thanks.<br clear="all">


<div>


<div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">


<div dir="ltr">--Junchao Zhang</div>


</div>


</div>


<br>


</div>


<br>


<div class="gmail_quote">


<div dir="ltr" class="gmail_attr">On Mon, Sep 23, 2019 at 10:09 PM Karl Rupp <<a href="mailto:rupp@iue.tuwien.ac.at">rupp@iue.tuwien.ac.at</a>> wrote:<br>


</div>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">


Hi,<br>


<br>


`git grep cudaStreamCreate` reports that vectors, matrices and scatters <br>


create their own streams. This will almost inevitably create races <br>


(there is no synchronization mechanism implemented), unless one calls <br>


WaitForGPU() after each operation. Some of the non-deterministic tests <br>


can likely be explained by this.<br>


<br>


I'll clean this up in the next few hours if there are no objections.<br>


<br>


Best regards,<br>


Karli<br>


<br>


<br>


<br>


On 9/24/19 1:05 AM, Mills, Richard Tran via petsc-dev wrote:<br>


> I'm no CUDA expert (not yet, anyway), but, from what I've read, the <br>


> default stream (stream 0) is (mostly) synchronous to host and device, so <br>


> WaitForGPU() is not needed in that case. I don't know if there is any <br>


> performance penalty in explicitly calling it in that case, anyway.<br>


> <br>


> In any case, it looks like there are still some cases where potentially <br>


> asynchronous CUDA library calls are being "timed" without a WaitForGPU() <br>


> to ensure that the calls actually complete. I will make a pass through <br>


> the aijcusparse and aijviennacl code looking for these.<br>


> <br>


> --Richard<br>


> <br>


> On 9/23/19 3:28 PM, Zhang, Junchao wrote:<br>


>> It looks cusparsestruct->stream is always created (not NULL).  I don't <br>


>> know logic of the "if (!cusparsestruct->stream)".<br>


>> --Junchao Zhang<br>


>><br>


>><br>


>> On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev <br>


>> <<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank">petsc-dev@mcs.anl.gov</a> <mailto:<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank">petsc-dev@mcs.anl.gov</a>>> wrote:<br>


>><br>


>>     In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards<br>


>>     the end of the function it had<br>


>><br>


>>       if (!yy) { /* MatMult */<br>


>>         if (!cusparsestruct->stream) {<br>


>>           ierr = WaitForGPU();CHKERRCUDA(ierr);<br>


>>         }<br>


>>       }<br>


>><br>


>>     I assume we don't need the logic to do this only in the MatMult()<br>


>>     with no add case and should just do this all the time, for the<br>


>>     purposes of timing if no other reason. Is there some reason to NOT<br>


>>     do this because of worries the about effects that these<br>


>>     WaitForGPU() invocations might have on performance?<br>


>><br>


>>     I notice other problems in <a href="http://aijcusparse.cu" rel="noreferrer" target="_blank">


aijcusparse.cu</a> <<a href="http://aijcusparse.cu" rel="noreferrer" target="_blank">http://aijcusparse.cu</a>>,<br>


>>     now that I look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I<br>


>>     see that we have GPU timing calls around the cusparse_csr_spmv()<br>


>>     (but no WaitForGPU() inside the timed region). I believe this is<br>


>>     another area in which we get a meaningless timing. It looks like<br>


>>     we need a WaitForGPU() there, and then maybe inside the timed<br>


>>     region handling the scatter. (I don't know if this stuff happens<br>


>>     asynchronously or not.) But do we potentially want two<br>


>>     WaitForGPU() calls in one function, just to help with getting<br>


>>     timings? I don't have a good idea of how much overhead this adds.<br>


>><br>


>>     --Richard<br>


>><br>


>>     On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:<br>


>>>     I made the following changes:<br>


>>>     1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end<br>


>>>       ierr = WaitForGPU();CHKERRCUDA(ierr);<br>


>>>       ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);<br>


>>>       ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);<br>


>>>       PetscFunctionReturn(0);<br>


>>>     2) In MatMult_MPIAIJCUSPARSE, use the following code sequence.<br>


>>>     The old code swapped the first two lines. Since with<br>


>>>     -log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed the<br>


>>>     order to have better overlap.<br>


>>>       ierr =<br>


>>>     VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);<br>


>>>       ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);<br>


>>>       ierr =<br>


>>>     VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);<br>


>>>       ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);<br>


>>>     3) Log time directly in the test code so we can also know<br>


>>>     execution time without -log_view (hence cuda synchronization). I<br>


>>>     manually calculated the Total Mflop/s for these cases for easy<br>


>>>     comparison.<br>


>>><br>


>>>     <<Note the CPU versions are copied from yesterday's results>><br>


>>><br>


>>>     ------------------------------------------------------------------------------------------------------------------------<br>


>>>     Event                Count      Time (sec)     Flop              <br>


>>>                    --- Global ---  --- Stage ----  Total   GPU    -<br>


>>>     CpuToGpu -   - GpuToCpu - GPU<br>


>>>                        Max Ratio  Max     Ratio   Max  Ratio  Mess  <br>


>>>     AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s<br>


>>>     Count   Size   Count   Size  %F<br>


>>>     ---------------------------------------------------------------------------------------------------------------------------------------------------------------<br>


>>>     6 MPI ranks,<br>


>>>     MatMult              100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03<br>


>>>     2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743       0    <br>


>>>      0 0.00e+00    0 0.00e+00  0<br>


>>>     VecScatterBegin      100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03<br>


>>>     2.2e+05 0.0e+00  0  0 97 18  0   0  0100100  0     0       0    <br>


>>>      0 0.00e+00    0 0.00e+00  0<br>


>>>     VecScatterEnd        100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00<br>


>>>     0.0e+00 0.0e+00  3  0  0  0  0  13  0  0  0  0     0       0    <br>


>>>      0 0.00e+00    0 0.00e+00  0<br>


>>><br>


>>>     24 MPI ranks<br>


>>>     MatMult              100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04<br>


>>>     5.9e+04 0.0e+00  8 99 97 25  0 100100100100  0 17948       0    <br>


>>>      0 0.00e+00    0 0.00e+00  0<br>


>>>     VecScatterBegin      100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04<br>


>>>     5.9e+04 0.0e+00  0  0 97 25  0   0  0100100  0     0       0    <br>


>>>      0 0.00e+00    0 0.00e+00  0<br>


>>>     VecScatterEnd        100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00<br>


>>>     0.0e+00 0.0e+00  2  0  0  0  0  19  0  0  0  0     0       0    <br>


>>>      0 0.00e+00    0 0.00e+00  0<br>


>>><br>


>>>     42 MPI ranks<br>


>>>     MatMult              100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04<br>


>>>     4.1e+04 0.0e+00 23 99 97 30  0 100100100100  0 27493       0    <br>


>>>      0 0.00e+00    0 0.00e+00  0<br>


>>>     VecScatterBegin      100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04<br>


>>>     4.1e+04 0.0e+00  0  0 97 30  0   1  0100100  0     0       0    <br>


>>>      0 0.00e+00    0 0.00e+00  0<br>


>>>     VecScatterEnd        100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00<br>


>>>     0.0e+00 0.0e+00  6  0  0  0  0  24  0  0  0  0     0       0    <br>


>>>      0 0.00e+00    0 0.00e+00  0<br>


>>><br>


>>>     6 MPI ranks + 6 GPUs + regular SF + log_view<br>


>>>     MatMult              100 1.0 1.6863e-01 1.0 9.66e+09 1.1 2.8e+03<br>


>>>     2.2e+05 0.0e+00  0 99 97 18  0 100100100100  0 335743   629278 <br>


>>>     100 1.02e+02  100 2.69e+02 100<br>


>>>     VecScatterBegin      100 1.0 5.0157e-02 1.6 0.00e+00 0.0 2.8e+03<br>


>>>     2.2e+05 0.0e+00  0  0 97 18  0  24  0100100  0     0       0    <br>


>>>      0 0.00e+00  100 2.69e+02  0<br>


>>>     VecScatterEnd        100 1.0 4.9155e-02 2.5 0.00e+00 0.0 0.0e+00<br>


>>>     0.0e+00 0.0e+00  0  0  0  0  0  20  0  0  0  0     0       0    <br>


>>>      0 0.00e+00    0 0.00e+00  0<br>


>>>     VecCUDACopyTo        100 1.0 9.5078e-03 2.0 0.00e+00 0.0 0.0e+00<br>


>>>     0.0e+00 0.0e+00  0  0  0  0  0   4  0  0  0  0     0       0  <br>


>>>      100 1.02e+02    0 0.00e+00  0<br>


>>>     VecCopyFromSome      100 1.0 2.8485e-02 1.4 0.00e+00 0.0 0.0e+00<br>


>>>     0.0e+00 0.0e+00  0  0  0  0  0  14  0  0  0  0     0       0    <br>


>>>      0 0.00e+00  100 2.69e+02  0<br>


>>><br>


>>>     6 MPI ranks + 6 GPUs + regular SF  + No log_view<br>


>>>     MatMult:             100 1.0 1.4180e-01                         <br>


>>>                                                    399268<br>


>>><br>


>>>     6 MPI ranks + 6 GPUs + CUDA-aware SF + log_view<br>


>>>     MatMult              100 1.0 1.1053e-01 1.0 9.66e+09 1.1 2.8e+03<br>


>>>     2.2e+05 0.0e+00  1 99 97 18  0 100100100100  0 512224   642075   <br>


>>>     0 0.00e+00    0 0.00e+00 100<br>


>>>     VecScatterBegin      100 1.0 8.3418e-03 1.5 0.00e+00 0.0 2.8e+03<br>


>>>     2.2e+05 0.0e+00  0  0 97 18  0   6  0100100  0     0       0    <br>


>>>      0 0.00e+00    0 0.00e+00  0<br>


>>>     VecScatterEnd        100 1.0 2.2619e-02 1.6 0.00e+00 0.0 0.0e+00<br>


>>>     0.0e+00 0.0e+00  0  0  0  0  0  16  0  0  0  0     0       0    <br>


>>>      0 0.00e+00    0 0.00e+00  0<br>


>>><br>


>>>     6 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view<br>


>>>     MatMult:             100 1.0 9.8344e-02                         <br>


>>>                                                    575717<br>


>>><br>


>>>     24 MPI ranks + 6 GPUs + regular SF + log_view<br>


>>>     MatMult              100 1.0 1.1572e-01 1.0 2.63e+09 1.2 1.9e+04<br>


>>>     5.9e+04 0.0e+00  0 99 97 25  0 100100100100  0 489223   708601 <br>


>>>     100 4.61e+01  100 6.72e+01 100<br>


>>>     VecScatterBegin      100 1.0 2.0641e-02 2.0 0.00e+00 0.0 1.9e+04<br>


>>>     5.9e+04 0.0e+00  0  0 97 25  0  13  0100100  0     0       0    <br>


>>>      0 0.00e+00  100 6.72e+01  0<br>


>>>     VecScatterEnd        100 1.0 6.8114e-02 5.6 0.00e+00 0.0 0.0e+00<br>


>>>     0.0e+00 0.0e+00  0  0  0  0  0  38  0  0  0  0     0       0    <br>


>>>      0 0.00e+00    0 0.00e+00  0<br>


>>>     VecCUDACopyTo        100 1.0 6.6646e-03 2.5 0.00e+00 0.0 0.0e+00<br>


>>>     0.0e+00 0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0  <br>


>>>      100 4.61e+01    0 0.00e+00  0<br>


>>>     VecCopyFromSome      100 1.0 1.0546e-02 1.7 0.00e+00 0.0 0.0e+00<br>


>>>     0.0e+00 0.0e+00  0  0  0  0  0   7  0  0  0  0     0       0    <br>


>>>      0 0.00e+00  100 6.72e+01  0<br>


>>><br>


>>>     24 MPI ranks + 6 GPUs + regular SF + No log_view<br>


>>>     MatMult:             100 1.0 9.8254e-02                         <br>


>>>                                                    576201<br>


>>><br>


>>>     24 MPI ranks + 6 GPUs + CUDA-aware SF + log_view<br>


>>>     MatMult              100 1.0 1.1602e-01 1.0 2.63e+09 1.2 1.9e+04<br>


>>>     5.9e+04 0.0e+00  1 99 97 25  0 100100100100  0 487956   707524   <br>


>>>     0 0.00e+00    0 0.00e+00 100<br>


>>>     VecScatterBegin      100 1.0 2.7088e-02 7.0 0.00e+00 0.0 1.9e+04<br>


>>>     5.9e+04 0.0e+00  0  0 97 25  0   8  0100100  0     0       0    <br>


>>>      0 0.00e+00    0 0.00e+00  0<br>


>>>     VecScatterEnd        100 1.0 8.4262e-02 3.0 0.00e+00 0.0 0.0e+00<br>


>>>     0.0e+00 0.0e+00  1  0  0  0  0  52  0  0  0  0     0       0    <br>


>>>      0 0.00e+00    0 0.00e+00  0<br>


>>><br>


>>>     24 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view<br>


>>>     MatMult:             100 1.0 1.0397e-01                         <br>


>>>                                                    544510<br>


>>><br>


>>><br>


>>><br>


>>><br>


>>><br>


>><br>


> <br>


</blockquote>


</div>


</body>


</html>