<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>Yes, I call MatAXPY, but the matrix size stays the same.</p>
    <p>Regards,</p>
    <p>Roland<br>
    </p>
    <div class="moz-cite-prefix">Am 16.02.21 um 14:46 schrieb Stefano
      Zampini:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAGPUishYV+H4gK8ATahAqu7aRuZtdt3JiaZdLy13SE+f6DihyQ@mail.gmail.com">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <div dir="ltr"><br>
        <div class="gmail_quote">
          <div dir="ltr" class="gmail_attr">Il giorno mar 16 feb 2021
            alle ore 16:30 Roland Richter <<a
              href="mailto:roland.richter@ntnu.no"
              moz-do-not-send="true">roland.richter@ntnu.no</a>> ha
            scritto:<br>
          </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px
            0.8ex;border-left:1px solid
            rgb(204,204,204);padding-left:1ex">
            <div>
              <p>For MatMatMult the size of the involved matrices is  8k
                x 8k and 8k x 32k.</p>
            </div>
          </blockquote>
          <div>Ok, so you have 32k columns to multiply against. Maybe
            you can get some speedup</div>
          <div>Howver, if you keep updating the matrix entries on CPU,
            then using CUDA will make little sense.</div>
          <div>In any case, you can try and see if you get any speedup </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px
            0.8ex;border-left:1px solid
            rgb(204,204,204);padding-left:1ex">
            <div>
              <p> I am not sure where MatScale is called, I never call
                it explicitly. If MatDiagonalScale calls MatScale, then
                the involved matrices have a size of 8k x 32k.</p>
            </div>
          </blockquote>
          <div>No, it does not, Are you calling MatAYPX? </div>
          <div><br>
          </div>
          <div> </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px
            0.8ex;border-left:1px solid
            rgb(204,204,204);padding-left:1ex">
            <div>
              <p>Regards,</p>
              <p>Roland<br>
              </p>
              <div>Am 16.02.21 um 14:25 schrieb Stefano Zampini:<br>
              </div>
              <blockquote type="cite">
                <div dir="ltr">
                  <div class="gmail_quote">
                    <blockquote class="gmail_quote" style="margin:0px
                      0px 0px 0.8ex;border-left:1px solid
                      rgb(204,204,204);padding-left:1ex">
                      <p><br>
                         </p>
                    </blockquote>
                    <blockquote class="gmail_quote" style="margin:0px
                      0px 0px 0.8ex;border-left:1px solid
                      rgb(204,204,204);padding-left:1ex">
                      <div>
                        <p>the usual size of those matrices is
                          (cumulative, not distributed) at least
                          [8192x8192] x [8192x32768] complex entries as
                          lower boundary. Does it still make sense to
                          test CUDA for speedup?</p>
                      </div>
                    </blockquote>
                    <div>I don't understand your notation. Are you
                      saying your matrices are 8K x 8K? or 8K*32K? or
                      what?</div>
                    <div> </div>
                    <blockquote class="gmail_quote" style="margin:0px
                      0px 0px 0.8ex;border-left:1px solid
                      rgb(204,204,204);padding-left:1ex">
                      <div>
                        <p>Thank you,</p>
                        <p>regards,</p>
                        <p>Roland<br>
                        </p>
                        <div>Am 16.02.21 um 14:14 schrieb Stefano
                          Zampini:<br>
                        </div>
                        <blockquote type="cite">
                          <div dir="ltr">
                            <div dir="ltr"><br>
                            </div>
                            <br>
                            <div class="gmail_quote">
                              <div dir="ltr" class="gmail_attr">Il
                                giorno mar 16 feb 2021 alle ore 11:43
                                Roland Richter <<a
                                  href="mailto:roland.richter@ntnu.no"
                                  target="_blank" moz-do-not-send="true">roland.richter@ntnu.no</a>>
                                ha scritto:<br>
                              </div>
                              <blockquote class="gmail_quote"
                                style="margin:0px 0px 0px
                                0.8ex;border-left:1px solid
                                rgb(204,204,204);padding-left:1ex">
                                <div>
                                  <p>Hei,</p>
                                  <p>after profiling my program using
                                    -log_view, I got the following
                                    output (all matrices are dense):</p>
                                  <p><i>Using 8 OpenMP threads</i><i><br>
                                    </i><i>Using Petsc Development GIT
                                      revision: v3.14.3-583-g5464005aea 
                                      GIT Date: 2021-01-25 16:01:41
                                      -0600</i><i><br>
                                    </i><i><br>
                                    </i><i>                        
                                      Max       Max/Min     Avg      
                                      Total</i><i><br>
                                    </i><i>Time (sec):          
                                      5.074e+03     1.000   5.074e+03</i><i><br>
                                    </i><i>Objects:             
                                      2.158e+03     1.000   2.158e+03</i><i><br>
                                    </i><i>Flop:                
                                      5.236e+13     1.000   5.236e+13 
                                      5.236e+13</i><i><br>
                                    </i><i>Flop/sec:            
                                      1.032e+10     1.000   1.032e+10 
                                      1.032e+10</i><i><br>
                                    </i><i>MPI Messages:        
                                      0.000e+00     0.000   0.000e+00 
                                      0.000e+00</i><i><br>
                                    </i><i>MPI Message Lengths: 
                                      0.000e+00     0.000   0.000e+00 
                                      0.000e+00</i><i><br>
                                    </i><i>MPI Reductions:      
                                      0.000e+00     0.000</i><i><br>
                                    </i><i><br>
                                    </i><i>Flop counting convention: 1
                                      flop = 1 real number operation of
                                      type
                                      (multiply/divide/add/subtract)</i><i><br>
                                    </i><i>                           
                                      e.g., VecAXPY() for real vectors
                                      of length N --> 2N flop</i><i><br>
                                    </i><i>                           
                                      and VecAXPY() for complex vectors
                                      of length N --> 8N flop</i><i><br>
                                    </i><i><br>
                                    </i><i>Summary of Stages:   -----
                                      Time ------  ----- Flop ------ 
                                      --- Messages ---  -- Message
                                      Lengths --  -- Reductions --</i><i><br>
                                    </i><i>                       
                                      Avg     %Total     Avg    
                                      %Total    Count   %Total    
                                      Avg         %Total    Count  
                                      %Total</i><i><br>
                                    </i><i> 0:      Main Stage:
                                      5.0744e+03 100.0%  5.2359e+13
                                      100.0%  0.000e+00   0.0% 
                                      0.000e+00        0.0%  0.000e+00  
                                      0.0%</i><i><br>
                                    </i><i><br>
                                    </i><i>------------------------------------------------------------------------------------------------------------------------</i><i><br>
                                    </i><i>See the 'Profiling' chapter
                                      of the users' manual for details
                                      on interpreting output.</i><i><br>
                                    </i><i>Phase summary info:</i><i><br>
                                    </i><i>   Count: number of times
                                      phase was executed</i><i><br>
                                    </i><i>   Time and Flop: Max -
                                      maximum over all processors</i><i><br>
                                    </i><i>                  Ratio -
                                      ratio of maximum to minimum over
                                      all processors</i><i><br>
                                    </i><i>   Mess: number of messages
                                      sent</i><i><br>
                                    </i><i>   AvgLen: average message
                                      length (bytes)</i><i><br>
                                    </i><i>   Reduct: number of global
                                      reductions</i><i><br>
                                    </i><i>   Global: entire computation</i><i><br>
                                    </i><i>   Stage: stages of a
                                      computation. Set stages with
                                      PetscLogStagePush() and
                                      PetscLogStagePop().</i><i><br>
                                    </i><i>      %T - percent time in
                                      this phase         %F - percent
                                      flop in this phase</i><i><br>
                                    </i><i>      %M - percent messages
                                      in this phase     %L - percent
                                      message lengths in this phase</i><i><br>
                                    </i><i>      %R - percent reductions
                                      in this phase</i><i><br>
                                    </i><i>   Total Mflop/s: 10e-6 *
                                      (sum of flop over all
                                      processors)/(max time over all
                                      processors)</i><i><br>
                                    </i><i>   GPU Mflop/s: 10e-6 * (sum
                                      of flop on GPU over all
                                      processors)/(max GPU time over all
                                      processors)</i><i><br>
                                    </i><i>   CpuToGpu Count: total
                                      number of CPU to GPU copies per
                                      processor</i><i><br>
                                    </i><i>   CpuToGpu Size (Mbytes):
                                      10e-6 * (total size of CPU to GPU
                                      copies per processor)</i><i><br>
                                    </i><i>   GpuToCpu Count: total
                                      number of GPU to CPU copies per
                                      processor</i><i><br>
                                    </i><i>   GpuToCpu Size (Mbytes):
                                      10e-6 * (total size of GPU to CPU
                                      copies per processor)</i><i><br>
                                    </i><i>   GPU %F: percent flops on
                                      GPU in this event</i><i><br>
                                    </i><i>------------------------------------------------------------------------------------------------------------------------</i><i><br>
                                    </i><i>Event               
                                      Count      Time (sec)    
                                      Flop                             
                                      --- Global ---  --- Stage ---- 
                                      Total   GPU    - CpuToGpu -   -
                                      GpuToCpu - GPU</i><i><br>
                                    </i><i>                   Max Ratio 
                                      Max     Ratio   Max  Ratio  Mess  
                                      AvgLen  Reduct  %T %F %M %L %R  %T
                                      %F %M %L %R Mflop/s Mflop/s
                                      Count   Size   Count   Size  %F</i><i><br>
                                    </i><i>---------------------------------------------------------------------------------------------------------------------------------------------------------------</i><i><br>
                                    </i><i><br>
                                    </i><i>--- Event Stage 0: Main Stage</i><i><br>
                                    </i><i><br>
                                    </i><i>VecSet                37 1.0
                                      1.0354e-04 1.0 0.00e+00 0.0
                                      0.0e+00 0.0e+00 0.0e+00  0  0  0 
                                      0  0   0  0  0  0  0     0      
                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                                    </i><i>VecAssemblyBegin      31 1.0
                                      2.9080e-06 1.0 0.00e+00 0.0
                                      0.0e+00 0.0e+00 0.0e+00  0  0  0 
                                      0  0   0  0  0  0  0     0      
                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                                    </i><i>VecAssemblyEnd        31 1.0
                                      2.3270e-06 1.0 0.00e+00 0.0
                                      0.0e+00 0.0e+00 0.0e+00  0  0  0 
                                      0  0   0  0  0  0  0     0      
                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                                    </i><i>MatCopy            49928 1.0
                                      3.7437e+02 1.0 0.00e+00 0.0
                                      0.0e+00 0.0e+00 0.0e+00  7  0  0 
                                      0  0   7  0  0  0  0     0      
                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                                    </i><i>MatConvert          2080 1.0
                                      5.8492e+00 1.0 0.00e+00 0.0
                                      0.0e+00 0.0e+00 0.0e+00  0  0  0 
                                      0  0   0  0  0  0  0     0      
                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                                    </i><i>MatScale           56162 1.0
                                      6.9348e+02 1.0 1.60e+12 1.0
                                      0.0e+00 0.0e+00 0.0e+00 14  3  0 
                                      0  0  14  3  0  0  0  2303      
                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                                    </i><i>MatAssemblyBegin   56222 1.0
                                      1.7370e-02 1.0 0.00e+00 0.0
                                      0.0e+00 0.0e+00 0.0e+00  0  0  0 
                                      0  0   0  0  0  0  0     0      
                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                                    </i><i>MatAssemblyEnd     56222 1.0
                                      8.8713e-03 1.0 0.00e+00 0.0
                                      0.0e+00 0.0e+00 0.0e+00  0  0  0 
                                      0  0   0  0  0  0  0     0      
                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                                    </i><i>MatZeroEntries     60363 1.0
                                      3.1011e+02 1.0 0.00e+00 0.0
                                      0.0e+00 0.0e+00 0.0e+00  6  0  0 
                                      0  0   6  0  0  0  0     0      
                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                                    </i><i>MatAXPY             8320 1.0
                                      1.2254e+02 1.0 5.58e+11 1.0
                                      0.0e+00 0.0e+00 0.0e+00  2  1  0 
                                      0  0   2  1  0  0  0  4557      
                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                                    </i><i>MatMatMultSym       4161 1.0
                                      7.1613e-03 1.0 0.00e+00 0.0
                                      0.0e+00 0.0e+00 0.0e+00  0  0  0 
                                      0  0   0  0  0  0  0     0      
                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                                    </i><i>MatMatMultNum       4161 1.0
                                      4.0706e+02 1.0 5.02e+13 1.0
                                      0.0e+00 0.0e+00 0.0e+00  8 96  0 
                                      0  0   8 96  0  0  0 123331      
                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                                    </i><i>---------------------------------------------------------------------------------------------------------------------------------------------------------------</i><i><br>
                                    </i><i><br>
                                    </i><i>Memory usage is given in
                                      bytes:</i><i><br>
                                    </i><i><br>
                                    </i><i>Object Type         
                                      Creations   Destructions    
                                      Memory  Descendants' Mem.</i><i><br>
                                    </i><i>Reports information only for
                                      process 0.</i><i><br>
                                    </i><i><br>
                                    </i><i>--- Event Stage 0: Main Stage</i><i><br>
                                    </i><i><br>
                                    </i><i>              Vector   
                                      37             34      1634064    
                                      0.</i><i><br>
                                    </i><i>              Matrix 
                                      2120           2120 
                                      52734663456     0.</i><i><br>
                                    </i><i>              Viewer    
                                      1              0            0    
                                      0.</i><i><br>
                                    </i><i>========================================================================================================================</i></p>
                                  <p>Apparently, MatMatMultNum and
                                    MatScale take the most time (by far)
                                    during execution. Therefore, I was
                                    wondering if it is possible to move
                                    those operations/all matrices and
                                    vectors to a GPU or another
                                    accelerator. According to <a
                                      href="https://www.mcs.anl.gov/petsc/features/gpus.html"
                                      target="_blank"
                                      moz-do-not-send="true">https://www.mcs.anl.gov/petsc/features/gpus.html</a>
                                    CUDA is only supported for
                                    distributed vectors, but not for
                                    dense distributed matrices. Are
                                    there any updates related to that,
                                    or other ways to speed up the
                                    involved operations?</p>
                                </div>
                              </blockquote>
                              <div><br>
                              </div>
                              <div>You should compute the timings
                                associated with each call, and not
                                consider the lump sum. For example, each
                                MatScale takes 6.9348e+02/56162  =
                                0.012347851 seconds on average,  I doubt
                                you can get any reasonable speedup with
                                CUDA. What are the sizes of these
                                matrices? </div>
                              <div> </div>
                              <blockquote class="gmail_quote"
                                style="margin:0px 0px 0px
                                0.8ex;border-left:1px solid
                                rgb(204,204,204);padding-left:1ex">
                                <div>
                                  <p>Thanks!</p>
                                  <p>Regards,</p>
                                  <p>Roland<br>
                                  </p>
                                </div>
                              </blockquote>
                            </div>
                            <br clear="all">
                            <div><br>
                            </div>
                            -- <br>
                            <div dir="ltr">Stefano</div>
                          </div>
                        </blockquote>
                      </div>
                    </blockquote>
                  </div>
                  <br clear="all">
                  <div><br>
                  </div>
                  -- <br>
                  <div dir="ltr">Stefano</div>
                </div>
              </blockquote>
            </div>
          </blockquote>
        </div>
        <br clear="all">
        <div><br>
        </div>
        -- <br>
        <div dir="ltr" class="gmail_signature">Stefano</div>
      </div>
    </blockquote>
  </body>
</html>