<div dir="ltr"><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Il giorno mar 16 feb 2021 alle ore 16:30 Roland Richter <<a href="mailto:roland.richter@ntnu.no">roland.richter@ntnu.no</a>> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>For MatMatMult the size of the involved matrices is 8k x 8k and
8k x 32k.</p></div></blockquote><div>Ok, so you have 32k columns to multiply against. Maybe you can get some speedup</div><div>Howver, if you keep updating the matrix entries on CPU, then using CUDA will make little sense.</div><div>In any case, you can try and see if you get any speedup </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><p> I am not sure where MatScale is called, I never call it
explicitly. If MatDiagonalScale calls MatScale, then the involved
matrices have a size of 8k x 32k.</p></div></blockquote><div>No, it does not, Are you calling MatAYPX? </div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>
<p>Regards,</p>
<p>Roland<br>
</p>
<div>Am 16.02.21 um 14:25 schrieb Stefano
Zampini:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<p><br>
</p>
</blockquote>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>the usual size of those matrices is (cumulative, not
distributed) at least [8192x8192] x [8192x32768] complex
entries as lower boundary. Does it still make sense to
test CUDA for speedup?</p>
</div>
</blockquote>
<div>I don't understand your notation. Are you saying your
matrices are 8K x 8K? or 8K*32K? or what?</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Thank you,</p>
<p>regards,</p>
<p>Roland<br>
</p>
<div>Am 16.02.21 um 14:14 schrieb Stefano Zampini:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">Il giorno mar 16
feb 2021 alle ore 11:43 Roland Richter <<a href="mailto:roland.richter@ntnu.no" target="_blank">roland.richter@ntnu.no</a>>
ha scritto:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Hei,</p>
<p>after profiling my program using -log_view, I
got the following output (all matrices are
dense):</p>
<p><i>Using 8 OpenMP threads</i><i><br>
</i><i>Using Petsc Development GIT revision:
v3.14.3-583-g5464005aea GIT Date:
2021-01-25 16:01:41 -0600</i><i><br>
</i><i><br>
</i><i> Max
Max/Min Avg Total</i><i><br>
</i><i>Time (sec): 5.074e+03
1.000 5.074e+03</i><i><br>
</i><i>Objects: 2.158e+03
1.000 2.158e+03</i><i><br>
</i><i>Flop: 5.236e+13
1.000 5.236e+13 5.236e+13</i><i><br>
</i><i>Flop/sec: 1.032e+10
1.000 1.032e+10 1.032e+10</i><i><br>
</i><i>MPI Messages: 0.000e+00
0.000 0.000e+00 0.000e+00</i><i><br>
</i><i>MPI Message Lengths: 0.000e+00
0.000 0.000e+00 0.000e+00</i><i><br>
</i><i>MPI Reductions: 0.000e+00
0.000</i><i><br>
</i><i><br>
</i><i>Flop counting convention: 1 flop = 1
real number operation of type
(multiply/divide/add/subtract)</i><i><br>
</i><i> e.g.,
VecAXPY() for real vectors of length N
--> 2N flop</i><i><br>
</i><i> and
VecAXPY() for complex vectors of length N
--> 8N flop</i><i><br>
</i><i><br>
</i><i>Summary of Stages: ----- Time ------
----- Flop ------ --- Messages --- --
Message Lengths -- -- Reductions --</i><i><br>
</i><i> Avg
%Total Avg %Total Count
%Total Avg %Total Count
%Total</i><i><br>
</i><i> 0: Main Stage: 5.0744e+03 100.0%
5.2359e+13 100.0% 0.000e+00 0.0%
0.000e+00 0.0% 0.000e+00 0.0%</i><i><br>
</i><i><br>
</i><i>------------------------------------------------------------------------------------------------------------------------</i><i><br>
</i><i>See the 'Profiling' chapter of the
users' manual for details on interpreting
output.</i><i><br>
</i><i>Phase summary info:</i><i><br>
</i><i> Count: number of times phase was
executed</i><i><br>
</i><i> Time and Flop: Max - maximum over
all processors</i><i><br>
</i><i> Ratio - ratio of
maximum to minimum over all processors</i><i><br>
</i><i> Mess: number of messages sent</i><i><br>
</i><i> AvgLen: average message length
(bytes)</i><i><br>
</i><i> Reduct: number of global reductions</i><i><br>
</i><i> Global: entire computation</i><i><br>
</i><i> Stage: stages of a computation. Set
stages with PetscLogStagePush() and
PetscLogStagePop().</i><i><br>
</i><i> %T - percent time in this
phase %F - percent flop in this
phase</i><i><br>
</i><i> %M - percent messages in this
phase %L - percent message lengths in
this phase</i><i><br>
</i><i> %R - percent reductions in this
phase</i><i><br>
</i><i> Total Mflop/s: 10e-6 * (sum of flop
over all processors)/(max time over all
processors)</i><i><br>
</i><i> GPU Mflop/s: 10e-6 * (sum of flop on
GPU over all processors)/(max GPU time over
all processors)</i><i><br>
</i><i> CpuToGpu Count: total number of CPU
to GPU copies per processor</i><i><br>
</i><i> CpuToGpu Size (Mbytes): 10e-6 *
(total size of CPU to GPU copies per
processor)</i><i><br>
</i><i> GpuToCpu Count: total number of GPU
to CPU copies per processor</i><i><br>
</i><i> GpuToCpu Size (Mbytes): 10e-6 *
(total size of GPU to CPU copies per
processor)</i><i><br>
</i><i> GPU %F: percent flops on GPU in this
event</i><i><br>
</i><i>------------------------------------------------------------------------------------------------------------------------</i><i><br>
</i><i>Event Count Time
(sec) Flop
--- Global --- --- Stage ---- Total
GPU - CpuToGpu - - GpuToCpu - GPU</i><i><br>
</i><i> Max Ratio Max
Ratio Max Ratio Mess AvgLen Reduct
%T %F %M %L %R %T %F %M %L %R Mflop/s
Mflop/s Count Size Count Size %F</i><i><br>
</i><i>---------------------------------------------------------------------------------------------------------------------------------------------------------------</i><i><br>
</i><i><br>
</i><i>--- Event Stage 0: Main Stage</i><i><br>
</i><i><br>
</i><i>VecSet 37 1.0 1.0354e-04
1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0
0 0 0 0 0 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>VecAssemblyBegin 31 1.0 2.9080e-06
1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0
0 0 0 0 0 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>VecAssemblyEnd 31 1.0 2.3270e-06
1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0
0 0 0 0 0 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>MatCopy 49928 1.0 3.7437e+02
1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 7
0 0 0 0 7 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>MatConvert 2080 1.0 5.8492e+00
1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0
0 0 0 0 0 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>MatScale 56162 1.0 6.9348e+02
1.0 1.60e+12 1.0 0.0e+00 0.0e+00 0.0e+00 14
3 0 0 0 14 3 0 0 0 2303
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>MatAssemblyBegin 56222 1.0 1.7370e-02
1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0
0 0 0 0 0 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>MatAssemblyEnd 56222 1.0 8.8713e-03
1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0
0 0 0 0 0 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>MatZeroEntries 60363 1.0 3.1011e+02
1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 6
0 0 0 0 6 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>MatAXPY 8320 1.0 1.2254e+02
1.0 5.58e+11 1.0 0.0e+00 0.0e+00 0.0e+00 2
1 0 0 0 2 1 0 0 0 4557
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>MatMatMultSym 4161 1.0 7.1613e-03
1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0
0 0 0 0 0 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>MatMatMultNum 4161 1.0 4.0706e+02
1.0 5.02e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8
96 0 0 0 8 96 0 0 0 123331
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>---------------------------------------------------------------------------------------------------------------------------------------------------------------</i><i><br>
</i><i><br>
</i><i>Memory usage is given in bytes:</i><i><br>
</i><i><br>
</i><i>Object Type Creations
Destructions Memory Descendants' Mem.</i><i><br>
</i><i>Reports information only for process 0.</i><i><br>
</i><i><br>
</i><i>--- Event Stage 0: Main Stage</i><i><br>
</i><i><br>
</i><i> Vector 37
34 1634064 0.</i><i><br>
</i><i> Matrix 2120
2120 52734663456 0.</i><i><br>
</i><i> Viewer 1
0 0 0.</i><i><br>
</i><i>========================================================================================================================</i></p>
<p>Apparently, MatMatMultNum and MatScale take
the most time (by far) during execution.
Therefore, I was wondering if it is possible
to move those operations/all matrices and
vectors to a GPU or another accelerator.
According to <a href="https://www.mcs.anl.gov/petsc/features/gpus.html" target="_blank">https://www.mcs.anl.gov/petsc/features/gpus.html</a>
CUDA is only supported for distributed
vectors, but not for dense distributed
matrices. Are there any updates related to
that, or other ways to speed up the involved
operations?</p>
</div>
</blockquote>
<div><br>
</div>
<div>You should compute the timings associated with
each call, and not consider the lump sum. For
example, each MatScale takes 6.9348e+02/56162 =
0.012347851 seconds on average, I doubt you can
get any reasonable speedup with CUDA. What are the
sizes of these matrices? </div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Thanks!</p>
<p>Regards,</p>
<p>Roland<br>
</p>
</div>
</blockquote>
</div>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr">Stefano</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr">Stefano</div>
</div>
</blockquote>
</div>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature">Stefano</div></div>