[petsc-users] Using distributed dense matrix/vector operations on a GPU

Tue Feb 16 07:25:53 CST 2021

>
>
>
>
the usual size of those matrices is (cumulative, not distributed) at least
> [8192x8192] x [8192x32768] complex entries as lower boundary. Does it still
> make sense to test CUDA for speedup?
>
> I don't understand your notation. Are you saying your matrices are 8K x
8K? or 8K*32K? or what?


> Thank you,
>
> regards,
>
> Roland
> Am 16.02.21 um 14:14 schrieb Stefano Zampini:
>
>
>
> Il giorno mar 16 feb 2021 alle ore 11:43 Roland Richter <
> roland.richter at ntnu.no> ha scritto:
>
>> Hei,
>>
>> after profiling my program using -log_view, I got the following output
>> (all matrices are dense):
>>
>> *Using 8 OpenMP threads*
>> *Using Petsc Development GIT revision: v3.14.3-583-g5464005aea  GIT Date:
>> 2021-01-25 16:01:41 -0600*
>>
>> *                         Max       Max/Min     Avg       Total*
>> *Time (sec):           5.074e+03     1.000   5.074e+03*
>> *Objects:              2.158e+03     1.000   2.158e+03*
>> *Flop:                 5.236e+13     1.000   5.236e+13  5.236e+13*
>> *Flop/sec:             1.032e+10     1.000   1.032e+10  1.032e+10*
>> *MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00*
>> *MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00*
>> *MPI Reductions:       0.000e+00     0.000*
>>
>> *Flop counting convention: 1 flop = 1 real number operation of type
>> (multiply/divide/add/subtract)*
>> *                            e.g., VecAXPY() for real vectors of length N
>> --> 2N flop*
>> *                            and VecAXPY() for complex vectors of length
>> N --> 8N flop*
>>
>> *Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages
>> ---  -- Message Lengths --  -- Reductions --*
>> *                        Avg     %Total     Avg     %Total    Count
>> %Total     Avg         %Total    Count   %Total*
>> * 0:      Main Stage: 5.0744e+03 100.0%  5.2359e+13 100.0%  0.000e+00
>> 0.0%  0.000e+00        0.0%  0.000e+00   0.0%*
>>
>>
>> *------------------------------------------------------------------------------------------------------------------------*
>> *See the 'Profiling' chapter of the users' manual for details on
>> interpreting output.*
>> *Phase summary info:*
>> *   Count: number of times phase was executed*
>> *   Time and Flop: Max - maximum over all processors*
>> *                  Ratio - ratio of maximum to minimum over all
>> processors*
>> *   Mess: number of messages sent*
>> *   AvgLen: average message length (bytes)*
>> *   Reduct: number of global reductions*
>> *   Global: entire computation*
>> *   Stage: stages of a computation. Set stages with PetscLogStagePush()
>> and PetscLogStagePop().*
>> *      %T - percent time in this phase         %F - percent flop in this
>> phase*
>> *      %M - percent messages in this phase     %L - percent message
>> lengths in this phase*
>> *      %R - percent reductions in this phase*
>> *   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time
>> over all processors)*
>> *   GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max
>> GPU time over all processors)*
>> *   CpuToGpu Count: total number of CPU to GPU copies per processor*
>> *   CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per
>> processor)*
>> *   GpuToCpu Count: total number of GPU to CPU copies per processor*
>> *   GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per
>> processor)*
>> *   GPU %F: percent flops on GPU in this event*
>>
>> *------------------------------------------------------------------------------------------------------------------------*
>> *Event                Count      Time (sec)
>> Flop                              --- Global ---  --- Stage ----  Total
>> GPU    - CpuToGpu -   - GpuToCpu - GPU*
>> *                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size
>> Count   Size  %F*
>>
>> *---------------------------------------------------------------------------------------------------------------------------------------------------------------*
>>
>> *--- Event Stage 0: Main Stage*
>>
>> *VecSet                37 1.0 1.0354e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
>> 0.00e+00  0*
>> *VecAssemblyBegin      31 1.0 2.9080e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
>> 0.00e+00  0*
>> *VecAssemblyEnd        31 1.0 2.3270e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
>> 0.00e+00  0*
>> *MatCopy            49928 1.0 3.7437e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  7  0  0  0  0   7  0  0  0  0     0       0      0 0.00e+00    0
>> 0.00e+00  0*
>> *MatConvert          2080 1.0 5.8492e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
>> 0.00e+00  0*
>> *MatScale           56162 1.0 6.9348e+02 1.0 1.60e+12 1.0 0.0e+00 0.0e+00
>> 0.0e+00 14  3  0  0  0  14  3  0  0  0  2303       0      0 0.00e+00    0
>> 0.00e+00  0*
>> *MatAssemblyBegin   56222 1.0 1.7370e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
>> 0.00e+00  0*
>> *MatAssemblyEnd     56222 1.0 8.8713e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
>> 0.00e+00  0*
>> *MatZeroEntries     60363 1.0 3.1011e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  6  0  0  0  0   6  0  0  0  0     0       0      0 0.00e+00    0
>> 0.00e+00  0*
>> *MatAXPY             8320 1.0 1.2254e+02 1.0 5.58e+11 1.0 0.0e+00 0.0e+00
>> 0.0e+00  2  1  0  0  0   2  1  0  0  0  4557       0      0 0.00e+00    0
>> 0.00e+00  0*
>> *MatMatMultSym       4161 1.0 7.1613e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
>> 0.00e+00  0*
>> *MatMatMultNum       4161 1.0 4.0706e+02 1.0 5.02e+13 1.0 0.0e+00 0.0e+00
>> 0.0e+00  8 96  0  0  0   8 96  0  0  0 123331       0      0 0.00e+00    0
>> 0.00e+00  0*
>>
>> *---------------------------------------------------------------------------------------------------------------------------------------------------------------*
>>
>> *Memory usage is given in bytes:*
>>
>> *Object Type          Creations   Destructions     Memory  Descendants'
>> Mem.*
>> *Reports information only for process 0.*
>>
>> *--- Event Stage 0: Main Stage*
>>
>> *              Vector    37             34      1634064     0.*
>> *              Matrix  2120           2120  52734663456     0.*
>> *              Viewer     1              0            0     0.*
>>
>> *========================================================================================================================*
>>
>> Apparently, MatMatMultNum and MatScale take the most time (by far) during
>> execution. Therefore, I was wondering if it is possible to move those
>> operations/all matrices and vectors to a GPU or another accelerator.
>> According to https://www.mcs.anl.gov/petsc/features/gpus.html CUDA is
>> only supported for distributed vectors, but not for dense distributed
>> matrices. Are there any updates related to that, or other ways to speed up
>> the involved operations?
>>
>
> You should compute the timings associated with each call, and not consider
> the lump sum. For example, each MatScale takes 6.9348e+02/56162  =
> 0.012347851 seconds on average,  I doubt you can get any reasonable speedup
> with CUDA. What are the sizes of these matrices?
>
>
>> Thanks!
>>
>> Regards,
>>
>> Roland
>>
>
>
> --
> Stefano
>
>

-- 
Stefano
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210216/41251233/attachment-0001.html>