<div dir="ltr"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><p><br> </p></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><p>the usual size of those matrices is (cumulative, not distributed)

      at least [8192x8192] x [8192x32768] complex entries as lower

      boundary. Does it still make sense to test CUDA for speedup?</p>

    <p></p></div></blockquote><div>I don't understand your notation. Are you saying your matrices are 8K x 8K? or 8K*32K? or what?</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><p>Thank you,</p>

    <p>regards,</p>

    <p>Roland<br>

    </p>

    <div>Am 16.02.21 um 14:14 schrieb Stefano

      Zampini:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">

        <div dir="ltr"><br>

        </div>

        <br>

        <div class="gmail_quote">

          <div dir="ltr" class="gmail_attr">Il giorno mar 16 feb 2021

            alle ore 11:43 Roland Richter <<a href="mailto:roland.richter@ntnu.no" target="_blank">roland.richter@ntnu.no</a>> ha

            scritto:<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

            <div>

              <p>Hei,</p>

              <p>after profiling my program using -log_view, I got the

                following output (all matrices are dense):</p>

              <p><i>Using 8 OpenMP threads</i><i><br>

                </i><i>Using Petsc Development GIT revision:

                  v3.14.3-583-g5464005aea  GIT Date: 2021-01-25 16:01:41

                  -0600</i><i><br>

                </i><i><br>

                </i><i>                         Max       Max/Min    

                  Avg       Total</i><i><br>

                </i><i>Time (sec):           5.074e+03     1.000  

                  5.074e+03</i><i><br>

                </i><i>Objects:              2.158e+03     1.000  

                  2.158e+03</i><i><br>

                </i><i>Flop:                 5.236e+13     1.000  

                  5.236e+13  5.236e+13</i><i><br>

                </i><i>Flop/sec:             1.032e+10     1.000  

                  1.032e+10  1.032e+10</i><i><br>

                </i><i>MPI Messages:         0.000e+00     0.000  

                  0.000e+00  0.000e+00</i><i><br>

                </i><i>MPI Message Lengths:  0.000e+00     0.000  

                  0.000e+00  0.000e+00</i><i><br>

                </i><i>MPI Reductions:       0.000e+00     0.000</i><i><br>

                </i><i><br>

                </i><i>Flop counting convention: 1 flop = 1 real number

                  operation of type (multiply/divide/add/subtract)</i><i><br>

                </i><i>                            e.g., VecAXPY() for

                  real vectors of length N --> 2N flop</i><i><br>

                </i><i>                            and VecAXPY() for

                  complex vectors of length N --> 8N flop</i><i><br>

                </i><i><br>

                </i><i>Summary of Stages:   ----- Time ------  -----

                  Flop ------  --- Messages ---  -- Message Lengths -- 

                  -- Reductions --</i><i><br>

                </i><i>                        Avg     %Total    

                  Avg     %Total    Count   %Total     Avg        

                  %Total    Count   %Total</i><i><br>

                </i><i> 0:      Main Stage: 5.0744e+03 100.0% 

                  5.2359e+13 100.0%  0.000e+00   0.0%  0.000e+00       

                  0.0%  0.000e+00   0.0%</i><i><br>

                </i><i><br>

                </i><i>------------------------------------------------------------------------------------------------------------------------</i><i><br>

                </i><i>See the 'Profiling' chapter of the users' manual

                  for details on interpreting output.</i><i><br>

                </i><i>Phase summary info:</i><i><br>

                </i><i>   Count: number of times phase was executed</i><i><br>

                </i><i>   Time and Flop: Max - maximum over all

                  processors</i><i><br>

                </i><i>                  Ratio - ratio of maximum to

                  minimum over all processors</i><i><br>

                </i><i>   Mess: number of messages sent</i><i><br>

                </i><i>   AvgLen: average message length (bytes)</i><i><br>

                </i><i>   Reduct: number of global reductions</i><i><br>

                </i><i>   Global: entire computation</i><i><br>

                </i><i>   Stage: stages of a computation. Set stages

                  with PetscLogStagePush() and PetscLogStagePop().</i><i><br>

                </i><i>      %T - percent time in this phase         %F

                  - percent flop in this phase</i><i><br>

                </i><i>      %M - percent messages in this phase     %L

                  - percent message lengths in this phase</i><i><br>

                </i><i>      %R - percent reductions in this phase</i><i><br>

                </i><i>   Total Mflop/s: 10e-6 * (sum of flop over all

                  processors)/(max time over all processors)</i><i><br>

                </i><i>   GPU Mflop/s: 10e-6 * (sum of flop on GPU over

                  all processors)/(max GPU time over all processors)</i><i><br>

                </i><i>   CpuToGpu Count: total number of CPU to GPU

                  copies per processor</i><i><br>

                </i><i>   CpuToGpu Size (Mbytes): 10e-6 * (total size of

                  CPU to GPU copies per processor)</i><i><br>

                </i><i>   GpuToCpu Count: total number of GPU to CPU

                  copies per processor</i><i><br>

                </i><i>   GpuToCpu Size (Mbytes): 10e-6 * (total size of

                  GPU to CPU copies per processor)</i><i><br>

                </i><i>   GPU %F: percent flops on GPU in this event</i><i><br>

                </i><i>------------------------------------------------------------------------------------------------------------------------</i><i><br>

                </i><i>Event                Count      Time (sec)    

                  Flop                              --- Global ---  ---

                  Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu -

                  GPU</i><i><br>

                </i><i>                   Max Ratio  Max     Ratio  

                  Max  Ratio  Mess   AvgLen  Reduct  %T %F %M %L %R  %T

                  %F %M %L %R Mflop/s Mflop/s Count   Size   Count  

                  Size  %F</i><i><br>

                </i><i>---------------------------------------------------------------------------------------------------------------------------------------------------------------</i><i><br>

                </i><i><br>

                </i><i>--- Event Stage 0: Main Stage</i><i><br>

                </i><i><br>

                </i><i>VecSet                37 1.0 1.0354e-04 1.0

                  0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  

                  0  0  0  0  0     0       0      0 0.00e+00    0

                  0.00e+00  0</i><i><br>

                </i><i>VecAssemblyBegin      31 1.0 2.9080e-06 1.0

                  0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  

                  0  0  0  0  0     0       0      0 0.00e+00    0

                  0.00e+00  0</i><i><br>

                </i><i>VecAssemblyEnd        31 1.0 2.3270e-06 1.0

                  0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  

                  0  0  0  0  0     0       0      0 0.00e+00    0

                  0.00e+00  0</i><i><br>

                </i><i>MatCopy            49928 1.0 3.7437e+02 1.0

                  0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  7  0  0  0  0  

                  7  0  0  0  0     0       0      0 0.00e+00    0

                  0.00e+00  0</i><i><br>

                </i><i>MatConvert          2080 1.0 5.8492e+00 1.0

                  0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  

                  0  0  0  0  0     0       0      0 0.00e+00    0

                  0.00e+00  0</i><i><br>

                </i><i>MatScale           56162 1.0 6.9348e+02 1.0

                  1.60e+12 1.0 0.0e+00 0.0e+00 0.0e+00 14  3  0  0  0 

                  14  3  0  0  0  2303       0      0 0.00e+00    0

                  0.00e+00  0</i><i><br>

                </i><i>MatAssemblyBegin   56222 1.0 1.7370e-02 1.0

                  0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  

                  0  0  0  0  0     0       0      0 0.00e+00    0

                  0.00e+00  0</i><i><br>

                </i><i>MatAssemblyEnd     56222 1.0 8.8713e-03 1.0

                  0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  

                  0  0  0  0  0     0       0      0 0.00e+00    0

                  0.00e+00  0</i><i><br>

                </i><i>MatZeroEntries     60363 1.0 3.1011e+02 1.0

                  0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  6  0  0  0  0  

                  6  0  0  0  0     0       0      0 0.00e+00    0

                  0.00e+00  0</i><i><br>

                </i><i>MatAXPY             8320 1.0 1.2254e+02 1.0

                  5.58e+11 1.0 0.0e+00 0.0e+00 0.0e+00  2  1  0  0  0  

                  2  1  0  0  0  4557       0      0 0.00e+00    0

                  0.00e+00  0</i><i><br>

                </i><i>MatMatMultSym       4161 1.0 7.1613e-03 1.0

                  0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  

                  0  0  0  0  0     0       0      0 0.00e+00    0

                  0.00e+00  0</i><i><br>

                </i><i>MatMatMultNum       4161 1.0 4.0706e+02 1.0

                  5.02e+13 1.0 0.0e+00 0.0e+00 0.0e+00  8 96  0  0  0  

                  8 96  0  0  0 123331       0      0 0.00e+00    0

                  0.00e+00  0</i><i><br>

                </i><i>---------------------------------------------------------------------------------------------------------------------------------------------------------------</i><i><br>

                </i><i><br>

                </i><i>Memory usage is given in bytes:</i><i><br>

                </i><i><br>

                </i><i>Object Type          Creations   Destructions    

                  Memory  Descendants' Mem.</i><i><br>

                </i><i>Reports information only for process 0.</i><i><br>

                </i><i><br>

                </i><i>--- Event Stage 0: Main Stage</i><i><br>

                </i><i><br>

                </i><i>              Vector    37             34     

                  1634064     0.</i><i><br>

                </i><i>              Matrix  2120           2120 

                  52734663456     0.</i><i><br>

                </i><i>              Viewer     1             

                  0            0     0.</i><i><br>

                </i><i>========================================================================================================================</i></p>

              <p>Apparently, MatMatMultNum and MatScale take the most

                time (by far) during execution. Therefore, I was

                wondering if it is possible to move those operations/all

                matrices and vectors to a GPU or another accelerator.

                According to <a href="https://www.mcs.anl.gov/petsc/features/gpus.html" target="_blank">https://www.mcs.anl.gov/petsc/features/gpus.html</a>

                CUDA is only supported for distributed vectors, but not

                for dense distributed matrices. Are there any updates

                related to that, or other ways to speed up the involved

                operations?</p>

            </div>

          </blockquote>

          <div><br>

          </div>

          <div>You should compute the timings associated with each call,

            and not consider the lump sum. For example, each MatScale

            takes 6.9348e+02/56162  = 0.012347851 seconds on average,  I

            doubt you can get any reasonable speedup with CUDA. What are

            the sizes of these matrices? </div>

          <div> </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

            <div>

              <p>Thanks!</p>

              <p>Regards,</p>

              <p>Roland<br>

              </p>

            </div>

          </blockquote>

        </div>

        <br clear="all">

        <div><br>

        </div>

        -- <br>

        <div dir="ltr">Stefano</div>

      </div>

    </blockquote>

  </div>

</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature">Stefano</div></div>