<div dir="ltr"><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Il giorno mar 16 feb 2021 alle ore 16:30 Roland Richter <<a href="mailto:roland.richter@ntnu.no">roland.richter@ntnu.no</a>> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
  
    
  
  <div>
    <p>For MatMatMult the size of the involved matrices is  8k x 8k and
      8k x 32k.</p></div></blockquote><div>Ok, so you have 32k columns to multiply against. Maybe you can get some speedup</div><div>Howver, if you keep updating the matrix entries on CPU, then using CUDA will make little sense.</div><div>In any case, you can try and see if you get any speedup </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><p> I am not sure where MatScale is called, I never call it
      explicitly. If MatDiagonalScale calls MatScale, then the involved
      matrices have a size of 8k x 32k.</p></div></blockquote><div>No, it does not, Are you calling MatAYPX? </div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>
    <p>Regards,</p>
    <p>Roland<br>
    </p>
    <div>Am 16.02.21 um 14:25 schrieb Stefano
      Zampini:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="ltr">
        <div class="gmail_quote">
          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
            <p><br>
               </p>
          </blockquote>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
            <div>
              <p>the usual size of those matrices is (cumulative, not
                distributed) at least [8192x8192] x [8192x32768] complex
                entries as lower boundary. Does it still make sense to
                test CUDA for speedup?</p>
            </div>
          </blockquote>
          <div>I don't understand your notation. Are you saying your
            matrices are 8K x 8K? or 8K*32K? or what?</div>
          <div> </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
            <div>
              <p>Thank you,</p>
              <p>regards,</p>
              <p>Roland<br>
              </p>
              <div>Am 16.02.21 um 14:14 schrieb Stefano Zampini:<br>
              </div>
              <blockquote type="cite">
                <div dir="ltr">
                  <div dir="ltr"><br>
                  </div>
                  <br>
                  <div class="gmail_quote">
                    <div dir="ltr" class="gmail_attr">Il giorno mar 16
                      feb 2021 alle ore 11:43 Roland Richter <<a href="mailto:roland.richter@ntnu.no" target="_blank">roland.richter@ntnu.no</a>>
                      ha scritto:<br>
                    </div>
                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                      <div>
                        <p>Hei,</p>
                        <p>after profiling my program using -log_view, I
                          got the following output (all matrices are
                          dense):</p>
                        <p><i>Using 8 OpenMP threads</i><i><br>
                          </i><i>Using Petsc Development GIT revision:
                            v3.14.3-583-g5464005aea  GIT Date:
                            2021-01-25 16:01:41 -0600</i><i><br>
                          </i><i><br>
                          </i><i>                         Max      
                            Max/Min     Avg       Total</i><i><br>
                          </i><i>Time (sec):           5.074e+03    
                            1.000   5.074e+03</i><i><br>
                          </i><i>Objects:              2.158e+03    
                            1.000   2.158e+03</i><i><br>
                          </i><i>Flop:                 5.236e+13    
                            1.000   5.236e+13  5.236e+13</i><i><br>
                          </i><i>Flop/sec:             1.032e+10    
                            1.000   1.032e+10  1.032e+10</i><i><br>
                          </i><i>MPI Messages:         0.000e+00    
                            0.000   0.000e+00  0.000e+00</i><i><br>
                          </i><i>MPI Message Lengths:  0.000e+00    
                            0.000   0.000e+00  0.000e+00</i><i><br>
                          </i><i>MPI Reductions:       0.000e+00    
                            0.000</i><i><br>
                          </i><i><br>
                          </i><i>Flop counting convention: 1 flop = 1
                            real number operation of type
                            (multiply/divide/add/subtract)</i><i><br>
                          </i><i>                            e.g.,
                            VecAXPY() for real vectors of length N
                            --> 2N flop</i><i><br>
                          </i><i>                            and
                            VecAXPY() for complex vectors of length N
                            --> 8N flop</i><i><br>
                          </i><i><br>
                          </i><i>Summary of Stages:   ----- Time ------ 
                            ----- Flop ------  --- Messages ---  --
                            Message Lengths --  -- Reductions --</i><i><br>
                          </i><i>                        Avg    
                            %Total     Avg     %Total    Count  
                            %Total     Avg         %Total    Count  
                            %Total</i><i><br>
                          </i><i> 0:      Main Stage: 5.0744e+03 100.0% 
                            5.2359e+13 100.0%  0.000e+00   0.0% 
                            0.000e+00        0.0%  0.000e+00   0.0%</i><i><br>
                          </i><i><br>
                          </i><i>------------------------------------------------------------------------------------------------------------------------</i><i><br>
                          </i><i>See the 'Profiling' chapter of the
                            users' manual for details on interpreting
                            output.</i><i><br>
                          </i><i>Phase summary info:</i><i><br>
                          </i><i>   Count: number of times phase was
                            executed</i><i><br>
                          </i><i>   Time and Flop: Max - maximum over
                            all processors</i><i><br>
                          </i><i>                  Ratio - ratio of
                            maximum to minimum over all processors</i><i><br>
                          </i><i>   Mess: number of messages sent</i><i><br>
                          </i><i>   AvgLen: average message length
                            (bytes)</i><i><br>
                          </i><i>   Reduct: number of global reductions</i><i><br>
                          </i><i>   Global: entire computation</i><i><br>
                          </i><i>   Stage: stages of a computation. Set
                            stages with PetscLogStagePush() and
                            PetscLogStagePop().</i><i><br>
                          </i><i>      %T - percent time in this
                            phase         %F - percent flop in this
                            phase</i><i><br>
                          </i><i>      %M - percent messages in this
                            phase     %L - percent message lengths in
                            this phase</i><i><br>
                          </i><i>      %R - percent reductions in this
                            phase</i><i><br>
                          </i><i>   Total Mflop/s: 10e-6 * (sum of flop
                            over all processors)/(max time over all
                            processors)</i><i><br>
                          </i><i>   GPU Mflop/s: 10e-6 * (sum of flop on
                            GPU over all processors)/(max GPU time over
                            all processors)</i><i><br>
                          </i><i>   CpuToGpu Count: total number of CPU
                            to GPU copies per processor</i><i><br>
                          </i><i>   CpuToGpu Size (Mbytes): 10e-6 *
                            (total size of CPU to GPU copies per
                            processor)</i><i><br>
                          </i><i>   GpuToCpu Count: total number of GPU
                            to CPU copies per processor</i><i><br>
                          </i><i>   GpuToCpu Size (Mbytes): 10e-6 *
                            (total size of GPU to CPU copies per
                            processor)</i><i><br>
                          </i><i>   GPU %F: percent flops on GPU in this
                            event</i><i><br>
                          </i><i>------------------------------------------------------------------------------------------------------------------------</i><i><br>
                          </i><i>Event                Count      Time
                            (sec)     Flop                             
                            --- Global ---  --- Stage ----  Total  
                            GPU    - CpuToGpu -   - GpuToCpu - GPU</i><i><br>
                          </i><i>                   Max Ratio  Max    
                            Ratio   Max  Ratio  Mess   AvgLen  Reduct 
                            %T %F %M %L %R  %T %F %M %L %R Mflop/s
                            Mflop/s Count   Size   Count   Size  %F</i><i><br>
                          </i><i>---------------------------------------------------------------------------------------------------------------------------------------------------------------</i><i><br>
                          </i><i><br>
                          </i><i>--- Event Stage 0: Main Stage</i><i><br>
                          </i><i><br>
                          </i><i>VecSet                37 1.0 1.0354e-04
                            1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0 
                            0  0  0  0   0  0  0  0  0     0      
                            0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                          </i><i>VecAssemblyBegin      31 1.0 2.9080e-06
                            1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0 
                            0  0  0  0   0  0  0  0  0     0      
                            0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                          </i><i>VecAssemblyEnd        31 1.0 2.3270e-06
                            1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0 
                            0  0  0  0   0  0  0  0  0     0      
                            0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                          </i><i>MatCopy            49928 1.0 3.7437e+02
                            1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  7 
                            0  0  0  0   7  0  0  0  0     0      
                            0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                          </i><i>MatConvert          2080 1.0 5.8492e+00
                            1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0 
                            0  0  0  0   0  0  0  0  0     0      
                            0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                          </i><i>MatScale           56162 1.0 6.9348e+02
                            1.0 1.60e+12 1.0 0.0e+00 0.0e+00 0.0e+00 14 
                            3  0  0  0  14  3  0  0  0  2303      
                            0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                          </i><i>MatAssemblyBegin   56222 1.0 1.7370e-02
                            1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0 
                            0  0  0  0   0  0  0  0  0     0      
                            0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                          </i><i>MatAssemblyEnd     56222 1.0 8.8713e-03
                            1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0 
                            0  0  0  0   0  0  0  0  0     0      
                            0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                          </i><i>MatZeroEntries     60363 1.0 3.1011e+02
                            1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  6 
                            0  0  0  0   6  0  0  0  0     0      
                            0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                          </i><i>MatAXPY             8320 1.0 1.2254e+02
                            1.0 5.58e+11 1.0 0.0e+00 0.0e+00 0.0e+00  2 
                            1  0  0  0   2  1  0  0  0  4557      
                            0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                          </i><i>MatMatMultSym       4161 1.0 7.1613e-03
                            1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0 
                            0  0  0  0   0  0  0  0  0     0      
                            0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                          </i><i>MatMatMultNum       4161 1.0 4.0706e+02
                            1.0 5.02e+13 1.0 0.0e+00 0.0e+00 0.0e+00  8
                            96  0  0  0   8 96  0  0  0 123331      
                            0      0 0.00e+00    0 0.00e+00  0</i><i><br>
                          </i><i>---------------------------------------------------------------------------------------------------------------------------------------------------------------</i><i><br>
                          </i><i><br>
                          </i><i>Memory usage is given in bytes:</i><i><br>
                          </i><i><br>
                          </i><i>Object Type          Creations  
                            Destructions     Memory  Descendants' Mem.</i><i><br>
                          </i><i>Reports information only for process 0.</i><i><br>
                          </i><i><br>
                          </i><i>--- Event Stage 0: Main Stage</i><i><br>
                          </i><i><br>
                          </i><i>              Vector    37            
                            34      1634064     0.</i><i><br>
                          </i><i>              Matrix  2120          
                            2120  52734663456     0.</i><i><br>
                          </i><i>              Viewer     1             
                            0            0     0.</i><i><br>
                          </i><i>========================================================================================================================</i></p>
                        <p>Apparently, MatMatMultNum and MatScale take
                          the most time (by far) during execution.
                          Therefore, I was wondering if it is possible
                          to move those operations/all matrices and
                          vectors to a GPU or another accelerator.
                          According to <a href="https://www.mcs.anl.gov/petsc/features/gpus.html" target="_blank">https://www.mcs.anl.gov/petsc/features/gpus.html</a>
                          CUDA is only supported for distributed
                          vectors, but not for dense distributed
                          matrices. Are there any updates related to
                          that, or other ways to speed up the involved
                          operations?</p>
                      </div>
                    </blockquote>
                    <div><br>
                    </div>
                    <div>You should compute the timings associated with
                      each call, and not consider the lump sum. For
                      example, each MatScale takes 6.9348e+02/56162  =
                      0.012347851 seconds on average,  I doubt you can
                      get any reasonable speedup with CUDA. What are the
                      sizes of these matrices? </div>
                    <div> </div>
                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                      <div>
                        <p>Thanks!</p>
                        <p>Regards,</p>
                        <p>Roland<br>
                        </p>
                      </div>
                    </blockquote>
                  </div>
                  <br clear="all">
                  <div><br>
                  </div>
                  -- <br>
                  <div dir="ltr">Stefano</div>
                </div>
              </blockquote>
            </div>
          </blockquote>
        </div>
        <br clear="all">
        <div><br>
        </div>
        -- <br>
        <div dir="ltr">Stefano</div>
      </div>
    </blockquote>
  </div>

</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature">Stefano</div></div>