[petsc-dev] MatMult on Summit

Tue Sep 24 07:37:55 CDT 2019

Hi Mark, Richard, Junchao, et al.,

here we go:
https://gitlab.com/petsc/petsc/merge_requests/2091

This fixes indeed all the inconsistencies in test results for SNES ex19 
and even ex56. A-priori I wasn't sure about the latter, but it looks 
like this was the only missing piece.

Mark, this should allow you to move forward with GPUs.

Best regards,
Karli

On 9/24/19 11:05 AM, Mark Adams wrote:
> Yes, please, thank you.
> 
> On Tue, Sep 24, 2019 at 1:46 AM Mills, Richard Tran via petsc-dev 
> <petsc-dev at mcs.anl.gov <mailto:petsc-dev at mcs.anl.gov>> wrote:
> 
>     Karl, that would be fantastic. Much obliged!
> 
>     --Richard
> 
>     On 9/23/19 8:09 PM, Karl Rupp wrote:
>>     Hi,
>>
>>     `git grep cudaStreamCreate` reports that vectors, matrices and
>>     scatters create their own streams. This will almost inevitably
>>     create races (there is no synchronization mechanism implemented),
>>     unless one calls WaitForGPU() after each operation. Some of the
>>     non-deterministic tests can likely be explained by this.
>>
>>     I'll clean this up in the next few hours if there are no objections.
>>
>>     Best regards,
>>     Karli
>>
>>
>>
>>     On 9/24/19 1:05 AM, Mills, Richard Tran via petsc-dev wrote:
>>>     I'm no CUDA expert (not yet, anyway), but, from what I've read,
>>>     the default stream (stream 0) is (mostly) synchronous to host and
>>>     device, so WaitForGPU() is not needed in that case. I don't know
>>>     if there is any performance penalty in explicitly calling it in
>>>     that case, anyway.
>>>
>>>     In any case, it looks like there are still some cases where
>>>     potentially asynchronous CUDA library calls are being "timed"
>>>     without a WaitForGPU() to ensure that the calls actually
>>>     complete. I will make a pass through the aijcusparse and
>>>     aijviennacl code looking for these.
>>>
>>>     --Richard
>>>
>>>     On 9/23/19 3:28 PM, Zhang, Junchao wrote:
>>>>     It looks cusparsestruct->stream is always created (not NULL).  I
>>>>     don't know logic of the "if (!cusparsestruct->stream)".
>>>>     --Junchao Zhang
>>>>
>>>>
>>>>     On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via
>>>>     petsc-dev <petsc-dev at mcs.anl.gov <mailto:petsc-dev at mcs.anl.gov>
>>>>     <mailto:petsc-dev at mcs.anl.gov> <mailto:petsc-dev at mcs.anl.gov>>
>>>>     wrote:
>>>>
>>>>         In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards
>>>>         the end of the function it had
>>>>
>>>>           if (!yy) { /* MatMult */
>>>>             if (!cusparsestruct->stream) {
>>>>               ierr = WaitForGPU();CHKERRCUDA(ierr);
>>>>             }
>>>>           }
>>>>
>>>>         I assume we don't need the logic to do this only in the
>>>>     MatMult()
>>>>         with no add case and should just do this all the time, for the
>>>>         purposes of timing if no other reason. Is there some reason
>>>>     to NOT
>>>>         do this because of worries the about effects that these
>>>>         WaitForGPU() invocations might have on performance?
>>>>
>>>>         I notice other problems in aijcusparse.cu
>>>>     <http://aijcusparse.cu> <http://aijcusparse.cu>
>>>>     <http://aijcusparse.cu>,
>>>>         now that I look closer. In
>>>>     MatMultTransposeAdd_SeqAIJCUSPARSE(), I
>>>>         see that we have GPU timing calls around the
>>>>     cusparse_csr_spmv()
>>>>         (but no WaitForGPU() inside the timed region). I believe
>>>>     this is
>>>>         another area in which we get a meaningless timing. It looks
>>>>     like
>>>>         we need a WaitForGPU() there, and then maybe inside the timed
>>>>         region handling the scatter. (I don't know if this stuff
>>>>     happens
>>>>         asynchronously or not.) But do we potentially want two
>>>>         WaitForGPU() calls in one function, just to help with getting
>>>>         timings? I don't have a good idea of how much overhead this
>>>>     adds.
>>>>
>>>>         --Richard
>>>>
>>>>         On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
>>>>>         I made the following changes:
>>>>>         1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at
>>>>>     the end
>>>>>           ierr = WaitForGPU();CHKERRCUDA(ierr);
>>>>>           ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
>>>>>           ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
>>>>>           PetscFunctionReturn(0);
>>>>>         2) In MatMult_MPIAIJCUSPARSE, use the following code sequence.
>>>>>         The old code swapped the first two lines. Since with
>>>>>         -log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed
>>>>>     the
>>>>>         order to have better overlap.
>>>>>           ierr =
>>>>>        
>>>>>     VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>>>>>           ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
>>>>>           ierr =
>>>>>        
>>>>>     VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>>>>>           ierr =
>>>>>     (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
>>>>>         3) Log time directly in the test code so we can also know
>>>>>         execution time without -log_view (hence cuda
>>>>>     synchronization). I
>>>>>         manually calculated the Total Mflop/s for these cases for easy
>>>>>         comparison.
>>>>>
>>>>>         <<Note the CPU versions are copied from yesterday's results>>
>>>>>
>>>>>        
>>>>>     ------------------------------------------------------------------------------------------------------------------------
>>>>>         Event                Count      Time (sec)    
>>>>>     Flop                                 --- Global ---  --- Stage
>>>>>     ----  Total   GPU    -
>>>>>         CpuToGpu -   - GpuToCpu - GPU
>>>>>                            Max Ratio  Max     Ratio   Max  Ratio
>>>>>      Mess      AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R
>>>>>     Mflop/s Mflop/s
>>>>>         Count   Size   Count   Size  %F
>>>>>        
>>>>>     ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>>         6 MPI ranks,
>>>>>         MatMult              100 1.0 1.1895e+01 1.0 9.63e+09 1.1
>>>>>     2.8e+03
>>>>>         2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743      
>>>>>     0         0 0.00e+00    0 0.00e+00  0
>>>>>         VecScatterBegin      100 1.0 4.9145e-02 3.0 0.00e+00 0.0
>>>>>     2.8e+03
>>>>>         2.2e+05 0.0e+00  0  0 97 18  0   0  0100100  0     0      
>>>>>     0         0 0.00e+00    0 0.00e+00  0
>>>>>         VecScatterEnd        100 1.0 2.9441e+00 133 0.00e+00 0.0
>>>>>     0.0e+00
>>>>>         0.0e+00 0.0e+00  3  0  0  0  0  13  0  0  0  0     0      
>>>>>     0         0 0.00e+00    0 0.00e+00  0
>>>>>
>>>>>         24 MPI ranks
>>>>>         MatMult              100 1.0 3.1431e+00 1.0 2.63e+09 1.2
>>>>>     1.9e+04
>>>>>         5.9e+04 0.0e+00  8 99 97 25  0 100100100100  0 17948      
>>>>>     0         0 0.00e+00    0 0.00e+00  0
>>>>>         VecScatterBegin      100 1.0 2.0583e-02 2.3 0.00e+00 0.0
>>>>>     1.9e+04
>>>>>         5.9e+04 0.0e+00  0  0 97 25  0   0  0100100  0     0      
>>>>>     0         0 0.00e+00    0 0.00e+00  0
>>>>>         VecScatterEnd        100 1.0 1.0639e+0050.0 0.00e+00 0.0
>>>>>     0.0e+00
>>>>>         0.0e+00 0.0e+00  2  0  0  0  0  19  0  0  0  0     0      
>>>>>     0         0 0.00e+00    0 0.00e+00  0
>>>>>
>>>>>         42 MPI ranks
>>>>>         MatMult              100 1.0 2.0519e+00 1.0 1.52e+09 1.3
>>>>>     3.5e+04
>>>>>         4.1e+04 0.0e+00 23 99 97 30  0 100100100100  0 27493      
>>>>>     0         0 0.00e+00    0 0.00e+00  0
>>>>>         VecScatterBegin      100 1.0 2.0971e-02 3.4 0.00e+00 0.0
>>>>>     3.5e+04
>>>>>         4.1e+04 0.0e+00  0  0 97 30  0   1  0100100  0     0      
>>>>>     0         0 0.00e+00    0 0.00e+00  0
>>>>>         VecScatterEnd        100 1.0 8.5184e-0162.0 0.00e+00 0.0
>>>>>     0.0e+00
>>>>>         0.0e+00 0.0e+00  6  0  0  0  0  24  0  0  0  0     0      
>>>>>     0         0 0.00e+00    0 0.00e+00  0
>>>>>
>>>>>         6 MPI ranks + 6 GPUs + regular SF + log_view
>>>>>         MatMult              100 1.0 1.6863e-01 1.0 9.66e+09 1.1
>>>>>     2.8e+03
>>>>>         2.2e+05 0.0e+00  0 99 97 18  0 100100100100  0 335743  
>>>>>     629278     100 1.02e+02  100 2.69e+02 100
>>>>>         VecScatterBegin      100 1.0 5.0157e-02 1.6 0.00e+00 0.0
>>>>>     2.8e+03
>>>>>         2.2e+05 0.0e+00  0  0 97 18  0  24  0100100  0     0      
>>>>>     0         0 0.00e+00  100 2.69e+02  0
>>>>>         VecScatterEnd        100 1.0 4.9155e-02 2.5 0.00e+00 0.0
>>>>>     0.0e+00
>>>>>         0.0e+00 0.0e+00  0  0  0  0  0  20  0  0  0  0     0      
>>>>>     0         0 0.00e+00    0 0.00e+00  0
>>>>>         VecCUDACopyTo        100 1.0 9.5078e-03 2.0 0.00e+00 0.0
>>>>>     0.0e+00
>>>>>         0.0e+00 0.0e+00  0  0  0  0  0   4  0  0  0  0     0      
>>>>>     0       100 1.02e+02    0 0.00e+00  0
>>>>>         VecCopyFromSome      100 1.0 2.8485e-02 1.4 0.00e+00 0.0
>>>>>     0.0e+00
>>>>>         0.0e+00 0.0e+00  0  0  0  0  0  14  0  0  0  0     0      
>>>>>     0         0 0.00e+00  100 2.69e+02  0
>>>>>
>>>>>         6 MPI ranks + 6 GPUs + regular SF  + No log_view
>>>>>         MatMult:             100
>>>>>     1.0 1.4180e-01                                                
>>>>>                                399268
>>>>>
>>>>>         6 MPI ranks + 6 GPUs + CUDA-aware SF + log_view
>>>>>         MatMult              100 1.0 1.1053e-01 1.0 9.66e+09 1.1
>>>>>     2.8e+03
>>>>>         2.2e+05 0.0e+00  1 99 97 18  0 100100100100  0 512224  
>>>>>     642075       0 0.00e+00    0 0.00e+00 100
>>>>>         VecScatterBegin      100 1.0 8.3418e-03 1.5 0.00e+00 0.0
>>>>>     2.8e+03
>>>>>         2.2e+05 0.0e+00  0  0 97 18  0   6  0100100  0     0      
>>>>>     0         0 0.00e+00    0 0.00e+00  0
>>>>>         VecScatterEnd        100 1.0 2.2619e-02 1.6 0.00e+00 0.0
>>>>>     0.0e+00
>>>>>         0.0e+00 0.0e+00  0  0  0  0  0  16  0  0  0  0     0      
>>>>>     0         0 0.00e+00    0 0.00e+00  0
>>>>>
>>>>>         6 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view
>>>>>         MatMult:             100 1.0
>>>>>     9.8344e-02                                                    
>>>>>                            575717
>>>>>
>>>>>         24 MPI ranks + 6 GPUs + regular SF + log_view
>>>>>         MatMult              100 1.0 1.1572e-01 1.0 2.63e+09 1.2
>>>>>     1.9e+04
>>>>>         5.9e+04 0.0e+00  0 99 97 25  0 100100100100  0 489223  
>>>>>     708601     100 4.61e+01  100 6.72e+01 100
>>>>>         VecScatterBegin      100 1.0 2.0641e-02 2.0 0.00e+00 0.0
>>>>>     1.9e+04
>>>>>         5.9e+04 0.0e+00  0  0 97 25  0  13  0100100  0     0      
>>>>>     0         0 0.00e+00  100 6.72e+01  0
>>>>>         VecScatterEnd        100 1.0 6.8114e-02 5.6 0.00e+00 0.0
>>>>>     0.0e+00
>>>>>         0.0e+00 0.0e+00  0  0  0  0  0  38  0  0  0  0     0      
>>>>>     0         0 0.00e+00    0 0.00e+00  0
>>>>>         VecCUDACopyTo        100 1.0 6.6646e-03 2.5 0.00e+00 0.0
>>>>>     0.0e+00
>>>>>         0.0e+00 0.0e+00  0  0  0  0  0   3  0  0  0  0     0      
>>>>>     0       100 4.61e+01    0 0.00e+00  0
>>>>>         VecCopyFromSome      100 1.0 1.0546e-02 1.7 0.00e+00 0.0
>>>>>     0.0e+00
>>>>>         0.0e+00 0.0e+00  0  0  0  0  0   7  0  0  0  0     0      
>>>>>     0         0 0.00e+00  100 6.72e+01  0
>>>>>
>>>>>         24 MPI ranks + 6 GPUs + regular SF + No log_view
>>>>>         MatMult:             100 1.0
>>>>>     9.8254e-02                                                    
>>>>>                            576201
>>>>>
>>>>>         24 MPI ranks + 6 GPUs + CUDA-aware SF + log_view
>>>>>         MatMult              100 1.0 1.1602e-01 1.0 2.63e+09 1.2
>>>>>     1.9e+04
>>>>>         5.9e+04 0.0e+00  1 99 97 25  0 100100100100  0 487956  
>>>>>     707524       0 0.00e+00    0 0.00e+00 100
>>>>>         VecScatterBegin      100 1.0 2.7088e-02 7.0 0.00e+00 0.0
>>>>>     1.9e+04
>>>>>         5.9e+04 0.0e+00  0  0 97 25  0   8  0100100  0     0      
>>>>>     0         0 0.00e+00    0 0.00e+00  0
>>>>>         VecScatterEnd        100 1.0 8.4262e-02 3.0 0.00e+00 0.0
>>>>>     0.0e+00
>>>>>         0.0e+00 0.0e+00  1  0  0  0  0  52  0  0  0  0     0      
>>>>>     0         0 0.00e+00    0 0.00e+00  0
>>>>>
>>>>>         24 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view
>>>>>         MatMult:             100 1.0
>>>>>     1.0397e-01                                                    
>>>>>                            544510
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>