[petsc-users] Using distributed dense matrix/vector operations on a GPU

Stefano Zampini stefano.zampini at gmail.com
Tue Feb 16 07:46:37 CST 2021


Il giorno mar 16 feb 2021 alle ore 16:30 Roland Richter <
roland.richter at ntnu.no> ha scritto:

> For MatMatMult the size of the involved matrices is  8k x 8k and 8k x 32k.
>
Ok, so you have 32k columns to multiply against. Maybe you can get some
speedup
Howver, if you keep updating the matrix entries on CPU, then using CUDA
will make little sense.
In any case, you can try and see if you get any speedup

> I am not sure where MatScale is called, I never call it explicitly. If
> MatDiagonalScale calls MatScale, then the involved matrices have a size of
> 8k x 32k.
>
No, it does not, Are you calling MatAYPX?



> Regards,
>
> Roland
> Am 16.02.21 um 14:25 schrieb Stefano Zampini:
>
>
>>
>>
> the usual size of those matrices is (cumulative, not distributed) at least
>> [8192x8192] x [8192x32768] complex entries as lower boundary. Does it still
>> make sense to test CUDA for speedup?
>>
> I don't understand your notation. Are you saying your matrices are 8K x
> 8K? or 8K*32K? or what?
>
>
>> Thank you,
>>
>> regards,
>>
>> Roland
>> Am 16.02.21 um 14:14 schrieb Stefano Zampini:
>>
>>
>>
>> Il giorno mar 16 feb 2021 alle ore 11:43 Roland Richter <
>> roland.richter at ntnu.no> ha scritto:
>>
>>> Hei,
>>>
>>> after profiling my program using -log_view, I got the following output
>>> (all matrices are dense):
>>>
>>> *Using 8 OpenMP threads*
>>> *Using Petsc Development GIT revision: v3.14.3-583-g5464005aea  GIT
>>> Date: 2021-01-25 16:01:41 -0600*
>>>
>>> *                         Max       Max/Min     Avg       Total*
>>> *Time (sec):           5.074e+03     1.000   5.074e+03*
>>> *Objects:              2.158e+03     1.000   2.158e+03*
>>> *Flop:                 5.236e+13     1.000   5.236e+13  5.236e+13*
>>> *Flop/sec:             1.032e+10     1.000   1.032e+10  1.032e+10*
>>> *MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00*
>>> *MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00*
>>> *MPI Reductions:       0.000e+00     0.000*
>>>
>>> *Flop counting convention: 1 flop = 1 real number operation of type
>>> (multiply/divide/add/subtract)*
>>> *                            e.g., VecAXPY() for real vectors of length
>>> N --> 2N flop*
>>> *                            and VecAXPY() for complex vectors of length
>>> N --> 8N flop*
>>>
>>> *Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages
>>> ---  -- Message Lengths --  -- Reductions --*
>>> *                        Avg     %Total     Avg     %Total    Count
>>> %Total     Avg         %Total    Count   %Total*
>>> * 0:      Main Stage: 5.0744e+03 100.0%  5.2359e+13 100.0%  0.000e+00
>>> 0.0%  0.000e+00        0.0%  0.000e+00   0.0%*
>>>
>>>
>>> *------------------------------------------------------------------------------------------------------------------------*
>>> *See the 'Profiling' chapter of the users' manual for details on
>>> interpreting output.*
>>> *Phase summary info:*
>>> *   Count: number of times phase was executed*
>>> *   Time and Flop: Max - maximum over all processors*
>>> *                  Ratio - ratio of maximum to minimum over all
>>> processors*
>>> *   Mess: number of messages sent*
>>> *   AvgLen: average message length (bytes)*
>>> *   Reduct: number of global reductions*
>>> *   Global: entire computation*
>>> *   Stage: stages of a computation. Set stages with PetscLogStagePush()
>>> and PetscLogStagePop().*
>>> *      %T - percent time in this phase         %F - percent flop in this
>>> phase*
>>> *      %M - percent messages in this phase     %L - percent message
>>> lengths in this phase*
>>> *      %R - percent reductions in this phase*
>>> *   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time
>>> over all processors)*
>>> *   GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max
>>> GPU time over all processors)*
>>> *   CpuToGpu Count: total number of CPU to GPU copies per processor*
>>> *   CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per
>>> processor)*
>>> *   GpuToCpu Count: total number of GPU to CPU copies per processor*
>>> *   GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per
>>> processor)*
>>> *   GPU %F: percent flops on GPU in this event*
>>>
>>> *------------------------------------------------------------------------------------------------------------------------*
>>> *Event                Count      Time (sec)
>>> Flop                              --- Global ---  --- Stage ----  Total
>>> GPU    - CpuToGpu -   - GpuToCpu - GPU*
>>> *                   Max Ratio  Max     Ratio   Max  Ratio  Mess
>>> AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count
>>> Size   Count   Size  %F*
>>>
>>> *---------------------------------------------------------------------------------------------------------------------------------------------------------------*
>>>
>>> *--- Event Stage 0: Main Stage*
>>>
>>> *VecSet                37 1.0 1.0354e-04 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>>> 0.00e+00    0 0.00e+00  0*
>>> *VecAssemblyBegin      31 1.0 2.9080e-06 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>>> 0.00e+00    0 0.00e+00  0*
>>> *VecAssemblyEnd        31 1.0 2.3270e-06 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>>> 0.00e+00    0 0.00e+00  0*
>>> *MatCopy            49928 1.0 3.7437e+02 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00  7  0  0  0  0   7  0  0  0  0     0       0      0
>>> 0.00e+00    0 0.00e+00  0*
>>> *MatConvert          2080 1.0 5.8492e+00 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>>> 0.00e+00    0 0.00e+00  0*
>>> *MatScale           56162 1.0 6.9348e+02 1.0 1.60e+12 1.0 0.0e+00
>>> 0.0e+00 0.0e+00 14  3  0  0  0  14  3  0  0  0  2303       0      0
>>> 0.00e+00    0 0.00e+00  0*
>>> *MatAssemblyBegin   56222 1.0 1.7370e-02 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>>> 0.00e+00    0 0.00e+00  0*
>>> *MatAssemblyEnd     56222 1.0 8.8713e-03 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>>> 0.00e+00    0 0.00e+00  0*
>>> *MatZeroEntries     60363 1.0 3.1011e+02 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00  6  0  0  0  0   6  0  0  0  0     0       0      0
>>> 0.00e+00    0 0.00e+00  0*
>>> *MatAXPY             8320 1.0 1.2254e+02 1.0 5.58e+11 1.0 0.0e+00
>>> 0.0e+00 0.0e+00  2  1  0  0  0   2  1  0  0  0  4557       0      0
>>> 0.00e+00    0 0.00e+00  0*
>>> *MatMatMultSym       4161 1.0 7.1613e-03 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>>> 0.00e+00    0 0.00e+00  0*
>>> *MatMatMultNum       4161 1.0 4.0706e+02 1.0 5.02e+13 1.0 0.0e+00
>>> 0.0e+00 0.0e+00  8 96  0  0  0   8 96  0  0  0 123331       0      0
>>> 0.00e+00    0 0.00e+00  0*
>>>
>>> *---------------------------------------------------------------------------------------------------------------------------------------------------------------*
>>>
>>> *Memory usage is given in bytes:*
>>>
>>> *Object Type          Creations   Destructions     Memory  Descendants'
>>> Mem.*
>>> *Reports information only for process 0.*
>>>
>>> *--- Event Stage 0: Main Stage*
>>>
>>> *              Vector    37             34      1634064     0.*
>>> *              Matrix  2120           2120  52734663456     0.*
>>> *              Viewer     1              0            0     0.*
>>>
>>> *========================================================================================================================*
>>>
>>> Apparently, MatMatMultNum and MatScale take the most time (by far)
>>> during execution. Therefore, I was wondering if it is possible to move
>>> those operations/all matrices and vectors to a GPU or another accelerator.
>>> According to https://www.mcs.anl.gov/petsc/features/gpus.html CUDA is
>>> only supported for distributed vectors, but not for dense distributed
>>> matrices. Are there any updates related to that, or other ways to speed up
>>> the involved operations?
>>>
>>
>> You should compute the timings associated with each call, and not
>> consider the lump sum. For example, each MatScale takes 6.9348e+02/56162  =
>> 0.012347851 seconds on average,  I doubt you can get any reasonable speedup
>> with CUDA. What are the sizes of these matrices?
>>
>>
>>> Thanks!
>>>
>>> Regards,
>>>
>>> Roland
>>>
>>
>>
>> --
>> Stefano
>>
>>
>
> --
> Stefano
>
>

-- 
Stefano
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210216/60c168ab/attachment.html>


More information about the petsc-users mailing list