<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Yes, I call MatAXPY, but the matrix size stays the same.</p>

    <p>Regards,</p>

    <p>Roland<br>

    </p>

    <div class="moz-cite-prefix">Am 16.02.21 um 14:46 schrieb Stefano

      Zampini:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAGPUishYV+H4gK8ATahAqu7aRuZtdt3JiaZdLy13SE+f6DihyQ@mail.gmail.com">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      <div dir="ltr"><br>

        <div class="gmail_quote">

          <div dir="ltr" class="gmail_attr">Il giorno mar 16 feb 2021

            alle ore 16:30 Roland Richter <<a

              href="mailto:roland.richter@ntnu.no"

              moz-do-not-send="true">roland.richter@ntnu.no</a>> ha

            scritto:<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px

            0.8ex;border-left:1px solid

            rgb(204,204,204);padding-left:1ex">

            <div>

              <p>For MatMatMult the size of the involved matrices is  8k

                x 8k and 8k x 32k.</p>

            </div>

          </blockquote>

          <div>Ok, so you have 32k columns to multiply against. Maybe

            you can get some speedup</div>

          <div>Howver, if you keep updating the matrix entries on CPU,

            then using CUDA will make little sense.</div>

          <div>In any case, you can try and see if you get any speedup </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px

            0.8ex;border-left:1px solid

            rgb(204,204,204);padding-left:1ex">

            <div>

              <p> I am not sure where MatScale is called, I never call

                it explicitly. If MatDiagonalScale calls MatScale, then

                the involved matrices have a size of 8k x 32k.</p>

            </div>

          </blockquote>

          <div>No, it does not, Are you calling MatAYPX? </div>

          <div><br>

          </div>

          <div> </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px

            0.8ex;border-left:1px solid

            rgb(204,204,204);padding-left:1ex">

            <div>

              <p>Regards,</p>

              <p>Roland<br>

              </p>

              <div>Am 16.02.21 um 14:25 schrieb Stefano Zampini:<br>

              </div>

              <blockquote type="cite">

                <div dir="ltr">

                  <div class="gmail_quote">

                    <blockquote class="gmail_quote" style="margin:0px

                      0px 0px 0.8ex;border-left:1px solid

                      rgb(204,204,204);padding-left:1ex">

                      <p><br>

                         </p>

                    </blockquote>

                    <blockquote class="gmail_quote" style="margin:0px

                      0px 0px 0.8ex;border-left:1px solid

                      rgb(204,204,204);padding-left:1ex">

                      <div>

                        <p>the usual size of those matrices is

                          (cumulative, not distributed) at least

                          [8192x8192] x [8192x32768] complex entries as

                          lower boundary. Does it still make sense to

                          test CUDA for speedup?</p>

                      </div>

                    </blockquote>

                    <div>I don't understand your notation. Are you

                      saying your matrices are 8K x 8K? or 8K*32K? or

                      what?</div>

                    <div> </div>

                    <blockquote class="gmail_quote" style="margin:0px

                      0px 0px 0.8ex;border-left:1px solid

                      rgb(204,204,204);padding-left:1ex">

                      <div>

                        <p>Thank you,</p>

                        <p>regards,</p>

                        <p>Roland<br>

                        </p>

                        <div>Am 16.02.21 um 14:14 schrieb Stefano

                          Zampini:<br>

                        </div>

                        <blockquote type="cite">

                          <div dir="ltr">

                            <div dir="ltr"><br>

                            </div>

                            <br>

                            <div class="gmail_quote">

                              <div dir="ltr" class="gmail_attr">Il

                                giorno mar 16 feb 2021 alle ore 11:43

                                Roland Richter <<a

                                  href="mailto:roland.richter@ntnu.no"

                                  target="_blank" moz-do-not-send="true">roland.richter@ntnu.no</a>>

                                ha scritto:<br>

                              </div>

                              <blockquote class="gmail_quote"

                                style="margin:0px 0px 0px

                                0.8ex;border-left:1px solid

                                rgb(204,204,204);padding-left:1ex">

                                <div>

                                  <p>Hei,</p>

                                  <p>after profiling my program using

                                    -log_view, I got the following

                                    output (all matrices are dense):</p>

                                  <p><i>Using 8 OpenMP threads</i><i><br>

                                    </i><i>Using Petsc Development GIT

                                      revision: v3.14.3-583-g5464005aea 

                                      GIT Date: 2021-01-25 16:01:41

                                      -0600</i><i><br>

                                    </i><i><br>

                                    </i><i>                        

                                      Max       Max/Min     Avg      

                                      Total</i><i><br>

                                    </i><i>Time (sec):          

                                      5.074e+03     1.000   5.074e+03</i><i><br>

                                    </i><i>Objects:             

                                      2.158e+03     1.000   2.158e+03</i><i><br>

                                    </i><i>Flop:                

                                      5.236e+13     1.000   5.236e+13 

                                      5.236e+13</i><i><br>

                                    </i><i>Flop/sec:            

                                      1.032e+10     1.000   1.032e+10 

                                      1.032e+10</i><i><br>

                                    </i><i>MPI Messages:        

                                      0.000e+00     0.000   0.000e+00 

                                      0.000e+00</i><i><br>

                                    </i><i>MPI Message Lengths: 

                                      0.000e+00     0.000   0.000e+00 

                                      0.000e+00</i><i><br>

                                    </i><i>MPI Reductions:      

                                      0.000e+00     0.000</i><i><br>

                                    </i><i><br>

                                    </i><i>Flop counting convention: 1

                                      flop = 1 real number operation of

                                      type

                                      (multiply/divide/add/subtract)</i><i><br>

                                    </i><i>                           

                                      e.g., VecAXPY() for real vectors

                                      of length N --> 2N flop</i><i><br>

                                    </i><i>                           

                                      and VecAXPY() for complex vectors

                                      of length N --> 8N flop</i><i><br>

                                    </i><i><br>

                                    </i><i>Summary of Stages:   -----

                                      Time ------  ----- Flop ------ 

                                      --- Messages ---  -- Message

                                      Lengths --  -- Reductions --</i><i><br>

                                    </i><i>                       

                                      Avg     %Total     Avg    

                                      %Total    Count   %Total    

                                      Avg         %Total    Count  

                                      %Total</i><i><br>

                                    </i><i> 0:      Main Stage:

                                      5.0744e+03 100.0%  5.2359e+13

                                      100.0%  0.000e+00   0.0% 

                                      0.000e+00        0.0%  0.000e+00  

                                      0.0%</i><i><br>

                                    </i><i><br>

                                    </i><i>------------------------------------------------------------------------------------------------------------------------</i><i><br>

                                    </i><i>See the 'Profiling' chapter

                                      of the users' manual for details

                                      on interpreting output.</i><i><br>

                                    </i><i>Phase summary info:</i><i><br>

                                    </i><i>   Count: number of times

                                      phase was executed</i><i><br>

                                    </i><i>   Time and Flop: Max -

                                      maximum over all processors</i><i><br>

                                    </i><i>                  Ratio -

                                      ratio of maximum to minimum over

                                      all processors</i><i><br>

                                    </i><i>   Mess: number of messages

                                      sent</i><i><br>

                                    </i><i>   AvgLen: average message

                                      length (bytes)</i><i><br>

                                    </i><i>   Reduct: number of global

                                      reductions</i><i><br>

                                    </i><i>   Global: entire computation</i><i><br>

                                    </i><i>   Stage: stages of a

                                      computation. Set stages with

                                      PetscLogStagePush() and

                                      PetscLogStagePop().</i><i><br>

                                    </i><i>      %T - percent time in

                                      this phase         %F - percent

                                      flop in this phase</i><i><br>

                                    </i><i>      %M - percent messages

                                      in this phase     %L - percent

                                      message lengths in this phase</i><i><br>

                                    </i><i>      %R - percent reductions

                                      in this phase</i><i><br>

                                    </i><i>   Total Mflop/s: 10e-6 *

                                      (sum of flop over all

                                      processors)/(max time over all

                                      processors)</i><i><br>

                                    </i><i>   GPU Mflop/s: 10e-6 * (sum

                                      of flop on GPU over all

                                      processors)/(max GPU time over all

                                      processors)</i><i><br>

                                    </i><i>   CpuToGpu Count: total

                                      number of CPU to GPU copies per

                                      processor</i><i><br>

                                    </i><i>   CpuToGpu Size (Mbytes):

                                      10e-6 * (total size of CPU to GPU

                                      copies per processor)</i><i><br>

                                    </i><i>   GpuToCpu Count: total

                                      number of GPU to CPU copies per

                                      processor</i><i><br>

                                    </i><i>   GpuToCpu Size (Mbytes):

                                      10e-6 * (total size of GPU to CPU

                                      copies per processor)</i><i><br>

                                    </i><i>   GPU %F: percent flops on

                                      GPU in this event</i><i><br>

                                    </i><i>------------------------------------------------------------------------------------------------------------------------</i><i><br>

                                    </i><i>Event               

                                      Count      Time (sec)    

                                      Flop                             

                                      --- Global ---  --- Stage ---- 

                                      Total   GPU    - CpuToGpu -   -

                                      GpuToCpu - GPU</i><i><br>

                                    </i><i>                   Max Ratio 

                                      Max     Ratio   Max  Ratio  Mess  

                                      AvgLen  Reduct  %T %F %M %L %R  %T

                                      %F %M %L %R Mflop/s Mflop/s

                                      Count   Size   Count   Size  %F</i><i><br>

                                    </i><i>---------------------------------------------------------------------------------------------------------------------------------------------------------------</i><i><br>

                                    </i><i><br>

                                    </i><i>--- Event Stage 0: Main Stage</i><i><br>

                                    </i><i><br>

                                    </i><i>VecSet                37 1.0

                                      1.0354e-04 1.0 0.00e+00 0.0

                                      0.0e+00 0.0e+00 0.0e+00  0  0  0 

                                      0  0   0  0  0  0  0     0      

                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>

                                    </i><i>VecAssemblyBegin      31 1.0

                                      2.9080e-06 1.0 0.00e+00 0.0

                                      0.0e+00 0.0e+00 0.0e+00  0  0  0 

                                      0  0   0  0  0  0  0     0      

                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>

                                    </i><i>VecAssemblyEnd        31 1.0

                                      2.3270e-06 1.0 0.00e+00 0.0

                                      0.0e+00 0.0e+00 0.0e+00  0  0  0 

                                      0  0   0  0  0  0  0     0      

                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>

                                    </i><i>MatCopy            49928 1.0

                                      3.7437e+02 1.0 0.00e+00 0.0

                                      0.0e+00 0.0e+00 0.0e+00  7  0  0 

                                      0  0   7  0  0  0  0     0      

                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>

                                    </i><i>MatConvert          2080 1.0

                                      5.8492e+00 1.0 0.00e+00 0.0

                                      0.0e+00 0.0e+00 0.0e+00  0  0  0 

                                      0  0   0  0  0  0  0     0      

                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>

                                    </i><i>MatScale           56162 1.0

                                      6.9348e+02 1.0 1.60e+12 1.0

                                      0.0e+00 0.0e+00 0.0e+00 14  3  0 

                                      0  0  14  3  0  0  0  2303      

                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>

                                    </i><i>MatAssemblyBegin   56222 1.0

                                      1.7370e-02 1.0 0.00e+00 0.0

                                      0.0e+00 0.0e+00 0.0e+00  0  0  0 

                                      0  0   0  0  0  0  0     0      

                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>

                                    </i><i>MatAssemblyEnd     56222 1.0

                                      8.8713e-03 1.0 0.00e+00 0.0

                                      0.0e+00 0.0e+00 0.0e+00  0  0  0 

                                      0  0   0  0  0  0  0     0      

                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>

                                    </i><i>MatZeroEntries     60363 1.0

                                      3.1011e+02 1.0 0.00e+00 0.0

                                      0.0e+00 0.0e+00 0.0e+00  6  0  0 

                                      0  0   6  0  0  0  0     0      

                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>

                                    </i><i>MatAXPY             8320 1.0

                                      1.2254e+02 1.0 5.58e+11 1.0

                                      0.0e+00 0.0e+00 0.0e+00  2  1  0 

                                      0  0   2  1  0  0  0  4557      

                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>

                                    </i><i>MatMatMultSym       4161 1.0

                                      7.1613e-03 1.0 0.00e+00 0.0

                                      0.0e+00 0.0e+00 0.0e+00  0  0  0 

                                      0  0   0  0  0  0  0     0      

                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>

                                    </i><i>MatMatMultNum       4161 1.0

                                      4.0706e+02 1.0 5.02e+13 1.0

                                      0.0e+00 0.0e+00 0.0e+00  8 96  0 

                                      0  0   8 96  0  0  0 123331      

                                      0      0 0.00e+00    0 0.00e+00  0</i><i><br>

                                    </i><i>---------------------------------------------------------------------------------------------------------------------------------------------------------------</i><i><br>

                                    </i><i><br>

                                    </i><i>Memory usage is given in

                                      bytes:</i><i><br>

                                    </i><i><br>

                                    </i><i>Object Type         

                                      Creations   Destructions    

                                      Memory  Descendants' Mem.</i><i><br>

                                    </i><i>Reports information only for

                                      process 0.</i><i><br>

                                    </i><i><br>

                                    </i><i>--- Event Stage 0: Main Stage</i><i><br>

                                    </i><i><br>

                                    </i><i>              Vector   

                                      37             34      1634064    

                                      0.</i><i><br>

                                    </i><i>              Matrix 

                                      2120           2120 

                                      52734663456     0.</i><i><br>

                                    </i><i>              Viewer    

                                      1              0            0    

                                      0.</i><i><br>

                                    </i><i>========================================================================================================================</i></p>

                                  <p>Apparently, MatMatMultNum and

                                    MatScale take the most time (by far)

                                    during execution. Therefore, I was

                                    wondering if it is possible to move

                                    those operations/all matrices and

                                    vectors to a GPU or another

                                    accelerator. According to <a

                                      href="https://www.mcs.anl.gov/petsc/features/gpus.html"

                                      target="_blank"

                                      moz-do-not-send="true">https://www.mcs.anl.gov/petsc/features/gpus.html</a>

                                    CUDA is only supported for

                                    distributed vectors, but not for

                                    dense distributed matrices. Are

                                    there any updates related to that,

                                    or other ways to speed up the

                                    involved operations?</p>

                                </div>

                              </blockquote>

                              <div><br>

                              </div>

                              <div>You should compute the timings

                                associated with each call, and not

                                consider the lump sum. For example, each

                                MatScale takes 6.9348e+02/56162  =

                                0.012347851 seconds on average,  I doubt

                                you can get any reasonable speedup with

                                CUDA. What are the sizes of these

                                matrices? </div>

                              <div> </div>

                              <blockquote class="gmail_quote"

                                style="margin:0px 0px 0px

                                0.8ex;border-left:1px solid

                                rgb(204,204,204);padding-left:1ex">

                                <div>

                                  <p>Thanks!</p>

                                  <p>Regards,</p>

                                  <p>Roland<br>

                                  </p>

                                </div>

                              </blockquote>

                            </div>

                            <br clear="all">

                            <div><br>

                            </div>

                            -- <br>

                            <div dir="ltr">Stefano</div>

                          </div>

                        </blockquote>

                      </div>

                    </blockquote>

                  </div>

                  <br clear="all">

                  <div><br>

                  </div>

                  -- <br>

                  <div dir="ltr">Stefano</div>

                </div>

              </blockquote>

            </div>

          </blockquote>

        </div>

        <br clear="all">

        <div><br>

        </div>

        -- <br>

        <div dir="ltr" class="gmail_signature">Stefano</div>

      </div>

    </blockquote>

  </body>

</html>