[petsc-users] Using distributed dense matrix/vector operations on a GPU
Roland Richter
roland.richter at ntnu.no
Tue Feb 16 07:30:06 CST 2021
For MatMatMult the size of the involved matrices is 8k x 8k and 8k x
32k. I am not sure where MatScale is called, I never call it explicitly.
If MatDiagonalScale calls MatScale, then the involved matrices have a
size of 8k x 32k.
Regards,
Roland
Am 16.02.21 um 14:25 schrieb Stefano Zampini:
>
>
>
>
> the usual size of those matrices is (cumulative, not distributed)
> at least [8192x8192] x [8192x32768] complex entries as lower
> boundary. Does it still make sense to test CUDA for speedup?
>
> I don't understand your notation. Are you saying your matrices are 8K
> x 8K? or 8K*32K? or what?
>
>
> Thank you,
>
> regards,
>
> Roland
>
> Am 16.02.21 um 14:14 schrieb Stefano Zampini:
>>
>>
>> Il giorno mar 16 feb 2021 alle ore 11:43 Roland Richter
>> <roland.richter at ntnu.no <mailto:roland.richter at ntnu.no>> ha scritto:
>>
>> Hei,
>>
>> after profiling my program using -log_view, I got the
>> following output (all matrices are dense):
>>
>> /Using 8 OpenMP threads//
>> //Using Petsc Development GIT revision:
>> v3.14.3-583-g5464005aea GIT Date: 2021-01-25 16:01:41 -0600//
>> //
>> // Max Max/Min Avg
>> Total//
>> //Time (sec): 5.074e+03 1.000 5.074e+03//
>> //Objects: 2.158e+03 1.000 2.158e+03//
>> //Flop: 5.236e+13 1.000 5.236e+13
>> 5.236e+13//
>> //Flop/sec: 1.032e+10 1.000 1.032e+10
>> 1.032e+10//
>> //MPI Messages: 0.000e+00 0.000 0.000e+00
>> 0.000e+00//
>> //MPI Message Lengths: 0.000e+00 0.000 0.000e+00
>> 0.000e+00//
>> //MPI Reductions: 0.000e+00 0.000//
>> //
>> //Flop counting convention: 1 flop = 1 real number operation
>> of type (multiply/divide/add/subtract)//
>> // e.g., VecAXPY() for real
>> vectors of length N --> 2N flop//
>> // and VecAXPY() for complex
>> vectors of length N --> 8N flop//
>> //
>> //Summary of Stages: ----- Time ------ ----- Flop ------
>> --- Messages --- -- Message Lengths -- -- Reductions --//
>> // Avg %Total Avg
>> %Total Count %Total Avg %Total Count
>> %Total//
>> // 0: Main Stage: 5.0744e+03 100.0% 5.2359e+13 100.0%
>> 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0%//
>> //
>> //------------------------------------------------------------------------------------------------------------------------//
>> //See the 'Profiling' chapter of the users' manual for
>> details on interpreting output.//
>> //Phase summary info://
>> // Count: number of times phase was executed//
>> // Time and Flop: Max - maximum over all processors//
>> // Ratio - ratio of maximum to minimum over
>> all processors//
>> // Mess: number of messages sent//
>> // AvgLen: average message length (bytes)//
>> // Reduct: number of global reductions//
>> // Global: entire computation//
>> // Stage: stages of a computation. Set stages with
>> PetscLogStagePush() and PetscLogStagePop().//
>> // %T - percent time in this phase %F - percent
>> flop in this phase//
>> // %M - percent messages in this phase %L - percent
>> message lengths in this phase//
>> // %R - percent reductions in this phase//
>> // Total Mflop/s: 10e-6 * (sum of flop over all
>> processors)/(max time over all processors)//
>> // GPU Mflop/s: 10e-6 * (sum of flop on GPU over all
>> processors)/(max GPU time over all processors)//
>> // CpuToGpu Count: total number of CPU to GPU copies per
>> processor//
>> // CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to
>> GPU copies per processor)//
>> // GpuToCpu Count: total number of GPU to CPU copies per
>> processor//
>> // GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to
>> CPU copies per processor)//
>> // GPU %F: percent flops on GPU in this event//
>> //------------------------------------------------------------------------------------------------------------------------//
>> //Event Count Time (sec)
>> Flop --- Global --- --- Stage
>> ---- Total GPU - CpuToGpu - - GpuToCpu - GPU//
>> // Max Ratio Max Ratio Max Ratio
>> Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>> Mflop/s Count Size Count Size %F//
>> //---------------------------------------------------------------------------------------------------------------------------------------------------------------//
>> //
>> //--- Event Stage 0: Main Stage//
>> //
>> //VecSet 37 1.0 1.0354e-04 1.0 0.00e+00 0.0
>> 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0.00e+00 0 0.00e+00 0//
>> //VecAssemblyBegin 31 1.0 2.9080e-06 1.0 0.00e+00 0.0
>> 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0.00e+00 0 0.00e+00 0//
>> //VecAssemblyEnd 31 1.0 2.3270e-06 1.0 0.00e+00 0.0
>> 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0.00e+00 0 0.00e+00 0//
>> //MatCopy 49928 1.0 3.7437e+02 1.0 0.00e+00 0.0
>> 0.0e+00 0.0e+00 0.0e+00 7 0 0 0 0 7 0 0 0 0
>> 0 0 0 0.00e+00 0 0.00e+00 0//
>> //MatConvert 2080 1.0 5.8492e+00 1.0 0.00e+00 0.0
>> 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0.00e+00 0 0.00e+00 0//
>> //MatScale 56162 1.0 6.9348e+02 1.0 1.60e+12 1.0
>> 0.0e+00 0.0e+00 0.0e+00 14 3 0 0 0 14 3 0 0 0
>> 2303 0 0 0.00e+00 0 0.00e+00 0//
>> //MatAssemblyBegin 56222 1.0 1.7370e-02 1.0 0.00e+00 0.0
>> 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0.00e+00 0 0.00e+00 0//
>> //MatAssemblyEnd 56222 1.0 8.8713e-03 1.0 0.00e+00 0.0
>> 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0.00e+00 0 0.00e+00 0//
>> //MatZeroEntries 60363 1.0 3.1011e+02 1.0 0.00e+00 0.0
>> 0.0e+00 0.0e+00 0.0e+00 6 0 0 0 0 6 0 0 0 0
>> 0 0 0 0.00e+00 0 0.00e+00 0//
>> //MatAXPY 8320 1.0 1.2254e+02 1.0 5.58e+11 1.0
>> 0.0e+00 0.0e+00 0.0e+00 2 1 0 0 0 2 1 0 0 0
>> 4557 0 0 0.00e+00 0 0.00e+00 0//
>> //MatMatMultSym 4161 1.0 7.1613e-03 1.0 0.00e+00 0.0
>> 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0.00e+00 0 0.00e+00 0//
>> //MatMatMultNum 4161 1.0 4.0706e+02 1.0 5.02e+13 1.0
>> 0.0e+00 0.0e+00 0.0e+00 8 96 0 0 0 8 96 0 0 0
>> 123331 0 0 0.00e+00 0 0.00e+00 0//
>> //---------------------------------------------------------------------------------------------------------------------------------------------------------------//
>> //
>> //Memory usage is given in bytes://
>> //
>> //Object Type Creations Destructions Memory
>> Descendants' Mem.//
>> //Reports information only for process 0.//
>> //
>> //--- Event Stage 0: Main Stage//
>> //
>> // Vector 37 34 1634064 0.//
>> // Matrix 2120 2120 52734663456 0.//
>> // Viewer 1 0 0 0.//
>> //========================================================================================================================/
>>
>> Apparently, MatMatMultNum and MatScale take the most time (by
>> far) during execution. Therefore, I was wondering if it is
>> possible to move those operations/all matrices and vectors to
>> a GPU or another accelerator. According to
>> https://www.mcs.anl.gov/petsc/features/gpus.html
>> <https://www.mcs.anl.gov/petsc/features/gpus.html> CUDA is
>> only supported for distributed vectors, but not for dense
>> distributed matrices. Are there any updates related to that,
>> or other ways to speed up the involved operations?
>>
>>
>> You should compute the timings associated with each call, and not
>> consider the lump sum. For example, each MatScale takes
>> 6.9348e+02/56162 = 0.012347851 seconds on average, I doubt you
>> can get any reasonable speedup with CUDA. What are the sizes of
>> these matrices?
>>
>>
>> Thanks!
>>
>> Regards,
>>
>> Roland
>>
>>
>>
>> --
>> Stefano
>
>
>
> --
> Stefano
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210216/3fd0d64e/attachment.html>
More information about the petsc-users
mailing list