[petsc-users] Using distributed dense matrix/vector operations on a GPU
Roland Richter
roland.richter at ntnu.no
Tue Feb 16 07:55:51 CST 2021
Yes, I call MatAXPY, but the matrix size stays the same.
Regards,
Roland
Am 16.02.21 um 14:46 schrieb Stefano Zampini:
>
> Il giorno mar 16 feb 2021 alle ore 16:30 Roland Richter
> <roland.richter at ntnu.no <mailto:roland.richter at ntnu.no>> ha scritto:
>
> For MatMatMult the size of the involved matrices is 8k x 8k and
> 8k x 32k.
>
> Ok, so you have 32k columns to multiply against. Maybe you can get
> some speedup
> Howver, if you keep updating the matrix entries on CPU, then using
> CUDA will make little sense.
> In any case, you can try and see if you get any speedup
>
> I am not sure where MatScale is called, I never call it
> explicitly. If MatDiagonalScale calls MatScale, then the involved
> matrices have a size of 8k x 32k.
>
> No, it does not, Are you calling MatAYPX?
>
>
>
> Regards,
>
> Roland
>
> Am 16.02.21 um 14:25 schrieb Stefano Zampini:
>>
>>
>>
>>
>> the usual size of those matrices is (cumulative, not
>> distributed) at least [8192x8192] x [8192x32768] complex
>> entries as lower boundary. Does it still make sense to test
>> CUDA for speedup?
>>
>> I don't understand your notation. Are you saying your matrices
>> are 8K x 8K? or 8K*32K? or what?
>>
>>
>> Thank you,
>>
>> regards,
>>
>> Roland
>>
>> Am 16.02.21 um 14:14 schrieb Stefano Zampini:
>>>
>>>
>>> Il giorno mar 16 feb 2021 alle ore 11:43 Roland Richter
>>> <roland.richter at ntnu.no <mailto:roland.richter at ntnu.no>> ha
>>> scritto:
>>>
>>> Hei,
>>>
>>> after profiling my program using -log_view, I got the
>>> following output (all matrices are dense):
>>>
>>> /Using 8 OpenMP threads//
>>> //Using Petsc Development GIT revision:
>>> v3.14.3-583-g5464005aea GIT Date: 2021-01-25 16:01:41
>>> -0600//
>>> //
>>> // Max Max/Min
>>> Avg Total//
>>> //Time (sec): 5.074e+03 1.000 5.074e+03//
>>> //Objects: 2.158e+03 1.000 2.158e+03//
>>> //Flop: 5.236e+13 1.000 5.236e+13
>>> 5.236e+13//
>>> //Flop/sec: 1.032e+10 1.000 1.032e+10
>>> 1.032e+10//
>>> //MPI Messages: 0.000e+00 0.000 0.000e+00
>>> 0.000e+00//
>>> //MPI Message Lengths: 0.000e+00 0.000 0.000e+00
>>> 0.000e+00//
>>> //MPI Reductions: 0.000e+00 0.000//
>>> //
>>> //Flop counting convention: 1 flop = 1 real number
>>> operation of type (multiply/divide/add/subtract)//
>>> // e.g., VecAXPY() for real
>>> vectors of length N --> 2N flop//
>>> // and VecAXPY() for complex
>>> vectors of length N --> 8N flop//
>>> //
>>> //Summary of Stages: ----- Time ------ ----- Flop
>>> ------ --- Messages --- -- Message Lengths -- --
>>> Reductions --//
>>> // Avg %Total Avg
>>> %Total Count %Total Avg %Total
>>> Count %Total//
>>> // 0: Main Stage: 5.0744e+03 100.0% 5.2359e+13
>>> 100.0% 0.000e+00 0.0% 0.000e+00 0.0%
>>> 0.000e+00 0.0%//
>>> //
>>> //------------------------------------------------------------------------------------------------------------------------//
>>> //See the 'Profiling' chapter of the users' manual for
>>> details on interpreting output.//
>>> //Phase summary info://
>>> // Count: number of times phase was executed//
>>> // Time and Flop: Max - maximum over all processors//
>>> // Ratio - ratio of maximum to minimum
>>> over all processors//
>>> // Mess: number of messages sent//
>>> // AvgLen: average message length (bytes)//
>>> // Reduct: number of global reductions//
>>> // Global: entire computation//
>>> // Stage: stages of a computation. Set stages with
>>> PetscLogStagePush() and PetscLogStagePop().//
>>> // %T - percent time in this phase %F -
>>> percent flop in this phase//
>>> // %M - percent messages in this phase %L -
>>> percent message lengths in this phase//
>>> // %R - percent reductions in this phase//
>>> // Total Mflop/s: 10e-6 * (sum of flop over all
>>> processors)/(max time over all processors)//
>>> // GPU Mflop/s: 10e-6 * (sum of flop on GPU over all
>>> processors)/(max GPU time over all processors)//
>>> // CpuToGpu Count: total number of CPU to GPU copies
>>> per processor//
>>> // CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU
>>> to GPU copies per processor)//
>>> // GpuToCpu Count: total number of GPU to CPU copies
>>> per processor//
>>> // GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU
>>> to CPU copies per processor)//
>>> // GPU %F: percent flops on GPU in this event//
>>> //------------------------------------------------------------------------------------------------------------------------//
>>> //Event Count Time (sec)
>>> Flop --- Global --- ---
>>> Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU//
>>> // Max Ratio Max Ratio Max
>>> Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M
>>> %L %R Mflop/s Mflop/s Count Size Count Size %F//
>>> //---------------------------------------------------------------------------------------------------------------------------------------------------------------//
>>> //
>>> //--- Event Stage 0: Main Stage//
>>> //
>>> //VecSet 37 1.0 1.0354e-04 1.0 0.00e+00
>>> 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0
>>> 0 0 0 0 0.00e+00 0 0.00e+00 0//
>>> //VecAssemblyBegin 31 1.0 2.9080e-06 1.0 0.00e+00
>>> 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0
>>> 0 0 0 0 0.00e+00 0 0.00e+00 0//
>>> //VecAssemblyEnd 31 1.0 2.3270e-06 1.0 0.00e+00
>>> 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0
>>> 0 0 0 0 0.00e+00 0 0.00e+00 0//
>>> //MatCopy 49928 1.0 3.7437e+02 1.0 0.00e+00
>>> 0.0 0.0e+00 0.0e+00 0.0e+00 7 0 0 0 0 7 0 0 0
>>> 0 0 0 0 0.00e+00 0 0.00e+00 0//
>>> //MatConvert 2080 1.0 5.8492e+00 1.0 0.00e+00
>>> 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0
>>> 0 0 0 0 0.00e+00 0 0.00e+00 0//
>>> //MatScale 56162 1.0 6.9348e+02 1.0 1.60e+12
>>> 1.0 0.0e+00 0.0e+00 0.0e+00 14 3 0 0 0 14 3 0 0
>>> 0 2303 0 0 0.00e+00 0 0.00e+00 0//
>>> //MatAssemblyBegin 56222 1.0 1.7370e-02 1.0 0.00e+00
>>> 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0
>>> 0 0 0 0 0.00e+00 0 0.00e+00 0//
>>> //MatAssemblyEnd 56222 1.0 8.8713e-03 1.0 0.00e+00
>>> 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0
>>> 0 0 0 0 0.00e+00 0 0.00e+00 0//
>>> //MatZeroEntries 60363 1.0 3.1011e+02 1.0 0.00e+00
>>> 0.0 0.0e+00 0.0e+00 0.0e+00 6 0 0 0 0 6 0 0 0
>>> 0 0 0 0 0.00e+00 0 0.00e+00 0//
>>> //MatAXPY 8320 1.0 1.2254e+02 1.0 5.58e+11
>>> 1.0 0.0e+00 0.0e+00 0.0e+00 2 1 0 0 0 2 1 0 0
>>> 0 4557 0 0 0.00e+00 0 0.00e+00 0//
>>> //MatMatMultSym 4161 1.0 7.1613e-03 1.0 0.00e+00
>>> 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0
>>> 0 0 0 0 0.00e+00 0 0.00e+00 0//
>>> //MatMatMultNum 4161 1.0 4.0706e+02 1.0 5.02e+13
>>> 1.0 0.0e+00 0.0e+00 0.0e+00 8 96 0 0 0 8 96 0 0
>>> 0 123331 0 0 0.00e+00 0 0.00e+00 0//
>>> //---------------------------------------------------------------------------------------------------------------------------------------------------------------//
>>> //
>>> //Memory usage is given in bytes://
>>> //
>>> //Object Type Creations Destructions
>>> Memory Descendants' Mem.//
>>> //Reports information only for process 0.//
>>> //
>>> //--- Event Stage 0: Main Stage//
>>> //
>>> // Vector 37 34
>>> 1634064 0.//
>>> // Matrix 2120 2120
>>> 52734663456 0.//
>>> // Viewer 1 0
>>> 0 0.//
>>> //========================================================================================================================/
>>>
>>> Apparently, MatMatMultNum and MatScale take the most
>>> time (by far) during execution. Therefore, I was
>>> wondering if it is possible to move those operations/all
>>> matrices and vectors to a GPU or another accelerator.
>>> According to
>>> https://www.mcs.anl.gov/petsc/features/gpus.html
>>> <https://www.mcs.anl.gov/petsc/features/gpus.html> CUDA
>>> is only supported for distributed vectors, but not for
>>> dense distributed matrices. Are there any updates
>>> related to that, or other ways to speed up the involved
>>> operations?
>>>
>>>
>>> You should compute the timings associated with each call,
>>> and not consider the lump sum. For example, each MatScale
>>> takes 6.9348e+02/56162 = 0.012347851 seconds on average, I
>>> doubt you can get any reasonable speedup with CUDA. What are
>>> the sizes of these matrices?
>>>
>>>
>>> Thanks!
>>>
>>> Regards,
>>>
>>> Roland
>>>
>>>
>>>
>>> --
>>> Stefano
>>
>>
>>
>> --
>> Stefano
>
>
>
> --
> Stefano
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210216/c7339430/attachment-0001.html>
More information about the petsc-users
mailing list