[petsc-users] Using distributed dense matrix/vector operations on a GPU

Tue Feb 16 07:30:06 CST 2021

For MatMatMult the size of the involved matrices is  8k x 8k and 8k x
32k. I am not sure where MatScale is called, I never call it explicitly.
If MatDiagonalScale calls MatScale, then the involved matrices have a
size of 8k x 32k.

Regards,

Roland

Am 16.02.21 um 14:25 schrieb Stefano Zampini:
>
>
>      
>
>     the usual size of those matrices is (cumulative, not distributed)
>     at least [8192x8192] x [8192x32768] complex entries as lower
>     boundary. Does it still make sense to test CUDA for speedup?
>
> I don't understand your notation. Are you saying your matrices are 8K
> x 8K? or 8K*32K? or what?
>  
>
>     Thank you,
>
>     regards,
>
>     Roland
>
>     Am 16.02.21 um 14:14 schrieb Stefano Zampini:
>>
>>
>>     Il giorno mar 16 feb 2021 alle ore 11:43 Roland Richter
>>     <roland.richter at ntnu.no <mailto:roland.richter at ntnu.no>> ha scritto:
>>
>>         Hei,
>>
>>         after profiling my program using -log_view, I got the
>>         following output (all matrices are dense):
>>
>>         /Using 8 OpenMP threads//
>>         //Using Petsc Development GIT revision:
>>         v3.14.3-583-g5464005aea  GIT Date: 2021-01-25 16:01:41 -0600//
>>         //
>>         //                         Max       Max/Min     Avg      
>>         Total//
>>         //Time (sec):           5.074e+03     1.000   5.074e+03//
>>         //Objects:              2.158e+03     1.000   2.158e+03//
>>         //Flop:                 5.236e+13     1.000   5.236e+13 
>>         5.236e+13//
>>         //Flop/sec:             1.032e+10     1.000   1.032e+10 
>>         1.032e+10//
>>         //MPI Messages:         0.000e+00     0.000   0.000e+00 
>>         0.000e+00//
>>         //MPI Message Lengths:  0.000e+00     0.000   0.000e+00 
>>         0.000e+00//
>>         //MPI Reductions:       0.000e+00     0.000//
>>         //
>>         //Flop counting convention: 1 flop = 1 real number operation
>>         of type (multiply/divide/add/subtract)//
>>         //                            e.g., VecAXPY() for real
>>         vectors of length N --> 2N flop//
>>         //                            and VecAXPY() for complex
>>         vectors of length N --> 8N flop//
>>         //
>>         //Summary of Stages:   ----- Time ------  ----- Flop ------ 
>>         --- Messages ---  -- Message Lengths --  -- Reductions --//
>>         //                        Avg     %Total     Avg    
>>         %Total    Count   %Total     Avg         %Total    Count  
>>         %Total//
>>         // 0:      Main Stage: 5.0744e+03 100.0%  5.2359e+13 100.0% 
>>         0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%//
>>         //
>>         //------------------------------------------------------------------------------------------------------------------------//
>>         //See the 'Profiling' chapter of the users' manual for
>>         details on interpreting output.//
>>         //Phase summary info://
>>         //   Count: number of times phase was executed//
>>         //   Time and Flop: Max - maximum over all processors//
>>         //                  Ratio - ratio of maximum to minimum over
>>         all processors//
>>         //   Mess: number of messages sent//
>>         //   AvgLen: average message length (bytes)//
>>         //   Reduct: number of global reductions//
>>         //   Global: entire computation//
>>         //   Stage: stages of a computation. Set stages with
>>         PetscLogStagePush() and PetscLogStagePop().//
>>         //      %T - percent time in this phase         %F - percent
>>         flop in this phase//
>>         //      %M - percent messages in this phase     %L - percent
>>         message lengths in this phase//
>>         //      %R - percent reductions in this phase//
>>         //   Total Mflop/s: 10e-6 * (sum of flop over all
>>         processors)/(max time over all processors)//
>>         //   GPU Mflop/s: 10e-6 * (sum of flop on GPU over all
>>         processors)/(max GPU time over all processors)//
>>         //   CpuToGpu Count: total number of CPU to GPU copies per
>>         processor//
>>         //   CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to
>>         GPU copies per processor)//
>>         //   GpuToCpu Count: total number of GPU to CPU copies per
>>         processor//
>>         //   GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to
>>         CPU copies per processor)//
>>         //   GPU %F: percent flops on GPU in this event//
>>         //------------------------------------------------------------------------------------------------------------------------//
>>         //Event                Count      Time (sec)    
>>         Flop                              --- Global ---  --- Stage
>>         ----  Total   GPU    - CpuToGpu -   - GpuToCpu - GPU//
>>         //                   Max Ratio  Max     Ratio   Max  Ratio 
>>         Mess   AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>         Mflop/s Count   Size   Count   Size  %F//
>>         //---------------------------------------------------------------------------------------------------------------------------------------------------------------//
>>         //
>>         //--- Event Stage 0: Main Stage//
>>         //
>>         //VecSet                37 1.0 1.0354e-04 1.0 0.00e+00 0.0
>>         0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    
>>         0       0      0 0.00e+00    0 0.00e+00  0//
>>         //VecAssemblyBegin      31 1.0 2.9080e-06 1.0 0.00e+00 0.0
>>         0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    
>>         0       0      0 0.00e+00    0 0.00e+00  0//
>>         //VecAssemblyEnd        31 1.0 2.3270e-06 1.0 0.00e+00 0.0
>>         0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    
>>         0       0      0 0.00e+00    0 0.00e+00  0//
>>         //MatCopy            49928 1.0 3.7437e+02 1.0 0.00e+00 0.0
>>         0.0e+00 0.0e+00 0.0e+00  7  0  0  0  0   7  0  0  0  0    
>>         0       0      0 0.00e+00    0 0.00e+00  0//
>>         //MatConvert          2080 1.0 5.8492e+00 1.0 0.00e+00 0.0
>>         0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    
>>         0       0      0 0.00e+00    0 0.00e+00  0//
>>         //MatScale           56162 1.0 6.9348e+02 1.0 1.60e+12 1.0
>>         0.0e+00 0.0e+00 0.0e+00 14  3  0  0  0  14  3  0  0  0 
>>         2303       0      0 0.00e+00    0 0.00e+00  0//
>>         //MatAssemblyBegin   56222 1.0 1.7370e-02 1.0 0.00e+00 0.0
>>         0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    
>>         0       0      0 0.00e+00    0 0.00e+00  0//
>>         //MatAssemblyEnd     56222 1.0 8.8713e-03 1.0 0.00e+00 0.0
>>         0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    
>>         0       0      0 0.00e+00    0 0.00e+00  0//
>>         //MatZeroEntries     60363 1.0 3.1011e+02 1.0 0.00e+00 0.0
>>         0.0e+00 0.0e+00 0.0e+00  6  0  0  0  0   6  0  0  0  0    
>>         0       0      0 0.00e+00    0 0.00e+00  0//
>>         //MatAXPY             8320 1.0 1.2254e+02 1.0 5.58e+11 1.0
>>         0.0e+00 0.0e+00 0.0e+00  2  1  0  0  0   2  1  0  0  0 
>>         4557       0      0 0.00e+00    0 0.00e+00  0//
>>         //MatMatMultSym       4161 1.0 7.1613e-03 1.0 0.00e+00 0.0
>>         0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    
>>         0       0      0 0.00e+00    0 0.00e+00  0//
>>         //MatMatMultNum       4161 1.0 4.0706e+02 1.0 5.02e+13 1.0
>>         0.0e+00 0.0e+00 0.0e+00  8 96  0  0  0   8 96  0  0  0
>>         123331       0      0 0.00e+00    0 0.00e+00  0//
>>         //---------------------------------------------------------------------------------------------------------------------------------------------------------------//
>>         //
>>         //Memory usage is given in bytes://
>>         //
>>         //Object Type          Creations   Destructions     Memory 
>>         Descendants' Mem.//
>>         //Reports information only for process 0.//
>>         //
>>         //--- Event Stage 0: Main Stage//
>>         //
>>         //              Vector    37             34      1634064     0.//
>>         //              Matrix  2120           2120  52734663456     0.//
>>         //              Viewer     1              0            0     0.//
>>         //========================================================================================================================/
>>
>>         Apparently, MatMatMultNum and MatScale take the most time (by
>>         far) during execution. Therefore, I was wondering if it is
>>         possible to move those operations/all matrices and vectors to
>>         a GPU or another accelerator. According to
>>         https://www.mcs.anl.gov/petsc/features/gpus.html
>>         <https://www.mcs.anl.gov/petsc/features/gpus.html> CUDA is
>>         only supported for distributed vectors, but not for dense
>>         distributed matrices. Are there any updates related to that,
>>         or other ways to speed up the involved operations?
>>
>>
>>     You should compute the timings associated with each call, and not
>>     consider the lump sum. For example, each MatScale takes
>>     6.9348e+02/56162  = 0.012347851 seconds on average,  I doubt you
>>     can get any reasonable speedup with CUDA. What are the sizes of
>>     these matrices? 
>>      
>>
>>         Thanks!
>>
>>         Regards,
>>
>>         Roland
>>
>>
>>
>>     -- 
>>     Stefano
>
>
>
> -- 
> Stefano
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210216/3fd0d64e/attachment.html>