[petsc-users] Using distributed dense matrix/vector operations on a GPU
Stefano Zampini
stefano.zampini at gmail.com
Tue Feb 16 07:14:32 CST 2021
Il giorno mar 16 feb 2021 alle ore 11:43 Roland Richter <
roland.richter at ntnu.no> ha scritto:
> Hei,
>
> after profiling my program using -log_view, I got the following output
> (all matrices are dense):
>
> *Using 8 OpenMP threads*
> *Using Petsc Development GIT revision: v3.14.3-583-g5464005aea GIT Date:
> 2021-01-25 16:01:41 -0600*
>
> * Max Max/Min Avg Total*
> *Time (sec): 5.074e+03 1.000 5.074e+03*
> *Objects: 2.158e+03 1.000 2.158e+03*
> *Flop: 5.236e+13 1.000 5.236e+13 5.236e+13*
> *Flop/sec: 1.032e+10 1.000 1.032e+10 1.032e+10*
> *MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00*
> *MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00*
> *MPI Reductions: 0.000e+00 0.000*
>
> *Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)*
> * e.g., VecAXPY() for real vectors of length N
> --> 2N flop*
> * and VecAXPY() for complex vectors of length N
> --> 8N flop*
>
> *Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages
> --- -- Message Lengths -- -- Reductions --*
> * Avg %Total Avg %Total Count
> %Total Avg %Total Count %Total*
> * 0: Main Stage: 5.0744e+03 100.0% 5.2359e+13 100.0% 0.000e+00
> 0.0% 0.000e+00 0.0% 0.000e+00 0.0%*
>
>
> *------------------------------------------------------------------------------------------------------------------------*
> *See the 'Profiling' chapter of the users' manual for details on
> interpreting output.*
> *Phase summary info:*
> * Count: number of times phase was executed*
> * Time and Flop: Max - maximum over all processors*
> * Ratio - ratio of maximum to minimum over all processors*
> * Mess: number of messages sent*
> * AvgLen: average message length (bytes)*
> * Reduct: number of global reductions*
> * Global: entire computation*
> * Stage: stages of a computation. Set stages with PetscLogStagePush()
> and PetscLogStagePop().*
> * %T - percent time in this phase %F - percent flop in this
> phase*
> * %M - percent messages in this phase %L - percent message
> lengths in this phase*
> * %R - percent reductions in this phase*
> * Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time
> over all processors)*
> * GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU
> time over all processors)*
> * CpuToGpu Count: total number of CPU to GPU copies per processor*
> * CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per
> processor)*
> * GpuToCpu Count: total number of GPU to CPU copies per processor*
> * GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per
> processor)*
> * GPU %F: percent flops on GPU in this event*
>
> *------------------------------------------------------------------------------------------------------------------------*
> *Event Count Time (sec)
> Flop --- Global --- --- Stage ---- Total
> GPU - CpuToGpu - - GpuToCpu - GPU*
> * Max Ratio Max Ratio Max Ratio Mess AvgLen
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size
> Count Size %F*
>
> *---------------------------------------------------------------------------------------------------------------------------------------------------------------*
>
> *--- Event Stage 0: Main Stage*
>
> *VecSet 37 1.0 1.0354e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0*
> *VecAssemblyBegin 31 1.0 2.9080e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0*
> *VecAssemblyEnd 31 1.0 2.3270e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0*
> *MatCopy 49928 1.0 3.7437e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 7 0 0 0 0 7 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0*
> *MatConvert 2080 1.0 5.8492e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0*
> *MatScale 56162 1.0 6.9348e+02 1.0 1.60e+12 1.0 0.0e+00 0.0e+00
> 0.0e+00 14 3 0 0 0 14 3 0 0 0 2303 0 0 0.00e+00 0
> 0.00e+00 0*
> *MatAssemblyBegin 56222 1.0 1.7370e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0*
> *MatAssemblyEnd 56222 1.0 8.8713e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0*
> *MatZeroEntries 60363 1.0 3.1011e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 6 0 0 0 0 6 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0*
> *MatAXPY 8320 1.0 1.2254e+02 1.0 5.58e+11 1.0 0.0e+00 0.0e+00
> 0.0e+00 2 1 0 0 0 2 1 0 0 0 4557 0 0 0.00e+00 0
> 0.00e+00 0*
> *MatMatMultSym 4161 1.0 7.1613e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0*
> *MatMatMultNum 4161 1.0 4.0706e+02 1.0 5.02e+13 1.0 0.0e+00 0.0e+00
> 0.0e+00 8 96 0 0 0 8 96 0 0 0 123331 0 0 0.00e+00 0
> 0.00e+00 0*
>
> *---------------------------------------------------------------------------------------------------------------------------------------------------------------*
>
> *Memory usage is given in bytes:*
>
> *Object Type Creations Destructions Memory Descendants'
> Mem.*
> *Reports information only for process 0.*
>
> *--- Event Stage 0: Main Stage*
>
> * Vector 37 34 1634064 0.*
> * Matrix 2120 2120 52734663456 0.*
> * Viewer 1 0 0 0.*
>
> *========================================================================================================================*
>
> Apparently, MatMatMultNum and MatScale take the most time (by far) during
> execution. Therefore, I was wondering if it is possible to move those
> operations/all matrices and vectors to a GPU or another accelerator.
> According to https://www.mcs.anl.gov/petsc/features/gpus.html CUDA is
> only supported for distributed vectors, but not for dense distributed
> matrices. Are there any updates related to that, or other ways to speed up
> the involved operations?
>
>
You should compute the timings associated with each call, and not consider
the lump sum. For example, each MatScale takes 6.9348e+02/56162 =
0.012347851 seconds on average, I doubt you can get any reasonable speedup
with CUDA. What are the sizes of these matrices?
> Thanks!
>
> Regards,
>
> Roland
>
--
Stefano
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210216/f58faf57/attachment-0001.html>
More information about the petsc-users
mailing list