[petsc-users] Using distributed dense matrix/vector operations on a GPU
Roland Richter
roland.richter at ntnu.no
Tue Feb 16 07:16:57 CST 2021
Hei,
the usual size of those matrices is (cumulative, not distributed) at
least [8192x8192] x [8192x32768] complex entries as lower boundary. Does
it still make sense to test CUDA for speedup?
Thank you,
regards,
Roland
Am 16.02.21 um 14:14 schrieb Stefano Zampini:
>
>
> Il giorno mar 16 feb 2021 alle ore 11:43 Roland Richter
> <roland.richter at ntnu.no <mailto:roland.richter at ntnu.no>> ha scritto:
>
> Hei,
>
> after profiling my program using -log_view, I got the following
> output (all matrices are dense):
>
> /Using 8 OpenMP threads//
> //Using Petsc Development GIT revision: v3.14.3-583-g5464005aea
> GIT Date: 2021-01-25 16:01:41 -0600//
> //
> // Max Max/Min Avg Total//
> //Time (sec): 5.074e+03 1.000 5.074e+03//
> //Objects: 2.158e+03 1.000 2.158e+03//
> //Flop: 5.236e+13 1.000 5.236e+13 5.236e+13//
> //Flop/sec: 1.032e+10 1.000 1.032e+10 1.032e+10//
> //MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00//
> //MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00//
> //MPI Reductions: 0.000e+00 0.000//
> //
> //Flop counting convention: 1 flop = 1 real number operation of
> type (multiply/divide/add/subtract)//
> // e.g., VecAXPY() for real vectors of
> length N --> 2N flop//
> // and VecAXPY() for complex vectors of
> length N --> 8N flop//
> //
> //Summary of Stages: ----- Time ------ ----- Flop ------ ---
> Messages --- -- Message Lengths -- -- Reductions --//
> // Avg %Total Avg %Total
> Count %Total Avg %Total Count %Total//
> // 0: Main Stage: 5.0744e+03 100.0% 5.2359e+13 100.0%
> 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0%//
> //
> //------------------------------------------------------------------------------------------------------------------------//
> //See the 'Profiling' chapter of the users' manual for details on
> interpreting output.//
> //Phase summary info://
> // Count: number of times phase was executed//
> // Time and Flop: Max - maximum over all processors//
> // Ratio - ratio of maximum to minimum over all
> processors//
> // Mess: number of messages sent//
> // AvgLen: average message length (bytes)//
> // Reduct: number of global reductions//
> // Global: entire computation//
> // Stage: stages of a computation. Set stages with
> PetscLogStagePush() and PetscLogStagePop().//
> // %T - percent time in this phase %F - percent flop
> in this phase//
> // %M - percent messages in this phase %L - percent
> message lengths in this phase//
> // %R - percent reductions in this phase//
> // Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max
> time over all processors)//
> // GPU Mflop/s: 10e-6 * (sum of flop on GPU over all
> processors)/(max GPU time over all processors)//
> // CpuToGpu Count: total number of CPU to GPU copies per processor//
> // CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU
> copies per processor)//
> // GpuToCpu Count: total number of GPU to CPU copies per processor//
> // GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU
> copies per processor)//
> // GPU %F: percent flops on GPU in this event//
> //------------------------------------------------------------------------------------------------------------------------//
> //Event Count Time (sec)
> Flop --- Global --- --- Stage ----
> Total GPU - CpuToGpu - - GpuToCpu - GPU//
> // Max Ratio Max Ratio Max Ratio Mess
> AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s
> Count Size Count Size %F//
> //---------------------------------------------------------------------------------------------------------------------------------------------------------------//
> //
> //--- Event Stage 0: Main Stage//
> //
> //VecSet 37 1.0 1.0354e-04 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0
> 0 0.00e+00 0 0.00e+00 0//
> //VecAssemblyBegin 31 1.0 2.9080e-06 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0
> 0 0.00e+00 0 0.00e+00 0//
> //VecAssemblyEnd 31 1.0 2.3270e-06 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0
> 0 0.00e+00 0 0.00e+00 0//
> //MatCopy 49928 1.0 3.7437e+02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 7 0 0 0 0 7 0 0 0 0 0 0
> 0 0.00e+00 0 0.00e+00 0//
> //MatConvert 2080 1.0 5.8492e+00 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0
> 0 0.00e+00 0 0.00e+00 0//
> //MatScale 56162 1.0 6.9348e+02 1.0 1.60e+12 1.0 0.0e+00
> 0.0e+00 0.0e+00 14 3 0 0 0 14 3 0 0 0 2303 0
> 0 0.00e+00 0 0.00e+00 0//
> //MatAssemblyBegin 56222 1.0 1.7370e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0
> 0 0.00e+00 0 0.00e+00 0//
> //MatAssemblyEnd 56222 1.0 8.8713e-03 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0
> 0 0.00e+00 0 0.00e+00 0//
> //MatZeroEntries 60363 1.0 3.1011e+02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 6 0 0 0 0 6 0 0 0 0 0 0
> 0 0.00e+00 0 0.00e+00 0//
> //MatAXPY 8320 1.0 1.2254e+02 1.0 5.58e+11 1.0 0.0e+00
> 0.0e+00 0.0e+00 2 1 0 0 0 2 1 0 0 0 4557 0
> 0 0.00e+00 0 0.00e+00 0//
> //MatMatMultSym 4161 1.0 7.1613e-03 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0
> 0 0.00e+00 0 0.00e+00 0//
> //MatMatMultNum 4161 1.0 4.0706e+02 1.0 5.02e+13 1.0 0.0e+00
> 0.0e+00 0.0e+00 8 96 0 0 0 8 96 0 0 0 123331 0
> 0 0.00e+00 0 0.00e+00 0//
> //---------------------------------------------------------------------------------------------------------------------------------------------------------------//
> //
> //Memory usage is given in bytes://
> //
> //Object Type Creations Destructions Memory
> Descendants' Mem.//
> //Reports information only for process 0.//
> //
> //--- Event Stage 0: Main Stage//
> //
> // Vector 37 34 1634064 0.//
> // Matrix 2120 2120 52734663456 0.//
> // Viewer 1 0 0 0.//
> //========================================================================================================================/
>
> Apparently, MatMatMultNum and MatScale take the most time (by far)
> during execution. Therefore, I was wondering if it is possible to
> move those operations/all matrices and vectors to a GPU or another
> accelerator. According to
> https://www.mcs.anl.gov/petsc/features/gpus.html
> <https://www.mcs.anl.gov/petsc/features/gpus.html> CUDA is only
> supported for distributed vectors, but not for dense distributed
> matrices. Are there any updates related to that, or other ways to
> speed up the involved operations?
>
>
> You should compute the timings associated with each call, and not
> consider the lump sum. For example, each MatScale takes
> 6.9348e+02/56162 = 0.012347851 seconds on average, I doubt you can
> get any reasonable speedup with CUDA. What are the sizes of these
> matrices?
>
>
> Thanks!
>
> Regards,
>
> Roland
>
>
>
> --
> Stefano
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210216/0889ff55/attachment.html>
More information about the petsc-users
mailing list