<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Yes, I call MatAXPY, but the matrix size stays the same.</p>
<p>Regards,</p>
<p>Roland<br>
</p>
<div class="moz-cite-prefix">Am 16.02.21 um 14:46 schrieb Stefano
Zampini:<br>
</div>
<blockquote type="cite"
cite="mid:CAGPUishYV+H4gK8ATahAqu7aRuZtdt3JiaZdLy13SE+f6DihyQ@mail.gmail.com">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<div dir="ltr"><br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">Il giorno mar 16 feb 2021
alle ore 16:30 Roland Richter <<a
href="mailto:roland.richter@ntnu.no"
moz-do-not-send="true">roland.richter@ntnu.no</a>> ha
scritto:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p>For MatMatMult the size of the involved matrices is 8k
x 8k and 8k x 32k.</p>
</div>
</blockquote>
<div>Ok, so you have 32k columns to multiply against. Maybe
you can get some speedup</div>
<div>Howver, if you keep updating the matrix entries on CPU,
then using CUDA will make little sense.</div>
<div>In any case, you can try and see if you get any speedup </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p> I am not sure where MatScale is called, I never call
it explicitly. If MatDiagonalScale calls MatScale, then
the involved matrices have a size of 8k x 32k.</p>
</div>
</blockquote>
<div>No, it does not, Are you calling MatAYPX? </div>
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p>Regards,</p>
<p>Roland<br>
</p>
<div>Am 16.02.21 um 14:25 schrieb Stefano Zampini:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px
0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<p><br>
</p>
</blockquote>
<blockquote class="gmail_quote" style="margin:0px
0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p>the usual size of those matrices is
(cumulative, not distributed) at least
[8192x8192] x [8192x32768] complex entries as
lower boundary. Does it still make sense to
test CUDA for speedup?</p>
</div>
</blockquote>
<div>I don't understand your notation. Are you
saying your matrices are 8K x 8K? or 8K*32K? or
what?</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px
0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p>Thank you,</p>
<p>regards,</p>
<p>Roland<br>
</p>
<div>Am 16.02.21 um 14:14 schrieb Stefano
Zampini:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">Il
giorno mar 16 feb 2021 alle ore 11:43
Roland Richter <<a
href="mailto:roland.richter@ntnu.no"
target="_blank" moz-do-not-send="true">roland.richter@ntnu.no</a>>
ha scritto:<br>
</div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p>Hei,</p>
<p>after profiling my program using
-log_view, I got the following
output (all matrices are dense):</p>
<p><i>Using 8 OpenMP threads</i><i><br>
</i><i>Using Petsc Development GIT
revision: v3.14.3-583-g5464005aea
GIT Date: 2021-01-25 16:01:41
-0600</i><i><br>
</i><i><br>
</i><i>
Max Max/Min Avg
Total</i><i><br>
</i><i>Time (sec):
5.074e+03 1.000 5.074e+03</i><i><br>
</i><i>Objects:
2.158e+03 1.000 2.158e+03</i><i><br>
</i><i>Flop:
5.236e+13 1.000 5.236e+13
5.236e+13</i><i><br>
</i><i>Flop/sec:
1.032e+10 1.000 1.032e+10
1.032e+10</i><i><br>
</i><i>MPI Messages:
0.000e+00 0.000 0.000e+00
0.000e+00</i><i><br>
</i><i>MPI Message Lengths:
0.000e+00 0.000 0.000e+00
0.000e+00</i><i><br>
</i><i>MPI Reductions:
0.000e+00 0.000</i><i><br>
</i><i><br>
</i><i>Flop counting convention: 1
flop = 1 real number operation of
type
(multiply/divide/add/subtract)</i><i><br>
</i><i>
e.g., VecAXPY() for real vectors
of length N --> 2N flop</i><i><br>
</i><i>
and VecAXPY() for complex vectors
of length N --> 8N flop</i><i><br>
</i><i><br>
</i><i>Summary of Stages: -----
Time ------ ----- Flop ------
--- Messages --- -- Message
Lengths -- -- Reductions --</i><i><br>
</i><i>
Avg %Total Avg
%Total Count %Total
Avg %Total Count
%Total</i><i><br>
</i><i> 0: Main Stage:
5.0744e+03 100.0% 5.2359e+13
100.0% 0.000e+00 0.0%
0.000e+00 0.0% 0.000e+00
0.0%</i><i><br>
</i><i><br>
</i><i>------------------------------------------------------------------------------------------------------------------------</i><i><br>
</i><i>See the 'Profiling' chapter
of the users' manual for details
on interpreting output.</i><i><br>
</i><i>Phase summary info:</i><i><br>
</i><i> Count: number of times
phase was executed</i><i><br>
</i><i> Time and Flop: Max -
maximum over all processors</i><i><br>
</i><i> Ratio -
ratio of maximum to minimum over
all processors</i><i><br>
</i><i> Mess: number of messages
sent</i><i><br>
</i><i> AvgLen: average message
length (bytes)</i><i><br>
</i><i> Reduct: number of global
reductions</i><i><br>
</i><i> Global: entire computation</i><i><br>
</i><i> Stage: stages of a
computation. Set stages with
PetscLogStagePush() and
PetscLogStagePop().</i><i><br>
</i><i> %T - percent time in
this phase %F - percent
flop in this phase</i><i><br>
</i><i> %M - percent messages
in this phase %L - percent
message lengths in this phase</i><i><br>
</i><i> %R - percent reductions
in this phase</i><i><br>
</i><i> Total Mflop/s: 10e-6 *
(sum of flop over all
processors)/(max time over all
processors)</i><i><br>
</i><i> GPU Mflop/s: 10e-6 * (sum
of flop on GPU over all
processors)/(max GPU time over all
processors)</i><i><br>
</i><i> CpuToGpu Count: total
number of CPU to GPU copies per
processor</i><i><br>
</i><i> CpuToGpu Size (Mbytes):
10e-6 * (total size of CPU to GPU
copies per processor)</i><i><br>
</i><i> GpuToCpu Count: total
number of GPU to CPU copies per
processor</i><i><br>
</i><i> GpuToCpu Size (Mbytes):
10e-6 * (total size of GPU to CPU
copies per processor)</i><i><br>
</i><i> GPU %F: percent flops on
GPU in this event</i><i><br>
</i><i>------------------------------------------------------------------------------------------------------------------------</i><i><br>
</i><i>Event
Count Time (sec)
Flop
--- Global --- --- Stage ----
Total GPU - CpuToGpu - -
GpuToCpu - GPU</i><i><br>
</i><i> Max Ratio
Max Ratio Max Ratio Mess
AvgLen Reduct %T %F %M %L %R %T
%F %M %L %R Mflop/s Mflop/s
Count Size Count Size %F</i><i><br>
</i><i>---------------------------------------------------------------------------------------------------------------------------------------------------------------</i><i><br>
</i><i><br>
</i><i>--- Event Stage 0: Main Stage</i><i><br>
</i><i><br>
</i><i>VecSet 37 1.0
1.0354e-04 1.0 0.00e+00 0.0
0.0e+00 0.0e+00 0.0e+00 0 0 0
0 0 0 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>VecAssemblyBegin 31 1.0
2.9080e-06 1.0 0.00e+00 0.0
0.0e+00 0.0e+00 0.0e+00 0 0 0
0 0 0 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>VecAssemblyEnd 31 1.0
2.3270e-06 1.0 0.00e+00 0.0
0.0e+00 0.0e+00 0.0e+00 0 0 0
0 0 0 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>MatCopy 49928 1.0
3.7437e+02 1.0 0.00e+00 0.0
0.0e+00 0.0e+00 0.0e+00 7 0 0
0 0 7 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>MatConvert 2080 1.0
5.8492e+00 1.0 0.00e+00 0.0
0.0e+00 0.0e+00 0.0e+00 0 0 0
0 0 0 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>MatScale 56162 1.0
6.9348e+02 1.0 1.60e+12 1.0
0.0e+00 0.0e+00 0.0e+00 14 3 0
0 0 14 3 0 0 0 2303
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>MatAssemblyBegin 56222 1.0
1.7370e-02 1.0 0.00e+00 0.0
0.0e+00 0.0e+00 0.0e+00 0 0 0
0 0 0 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>MatAssemblyEnd 56222 1.0
8.8713e-03 1.0 0.00e+00 0.0
0.0e+00 0.0e+00 0.0e+00 0 0 0
0 0 0 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>MatZeroEntries 60363 1.0
3.1011e+02 1.0 0.00e+00 0.0
0.0e+00 0.0e+00 0.0e+00 6 0 0
0 0 6 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>MatAXPY 8320 1.0
1.2254e+02 1.0 5.58e+11 1.0
0.0e+00 0.0e+00 0.0e+00 2 1 0
0 0 2 1 0 0 0 4557
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>MatMatMultSym 4161 1.0
7.1613e-03 1.0 0.00e+00 0.0
0.0e+00 0.0e+00 0.0e+00 0 0 0
0 0 0 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>MatMatMultNum 4161 1.0
4.0706e+02 1.0 5.02e+13 1.0
0.0e+00 0.0e+00 0.0e+00 8 96 0
0 0 8 96 0 0 0 123331
0 0 0.00e+00 0 0.00e+00 0</i><i><br>
</i><i>---------------------------------------------------------------------------------------------------------------------------------------------------------------</i><i><br>
</i><i><br>
</i><i>Memory usage is given in
bytes:</i><i><br>
</i><i><br>
</i><i>Object Type
Creations Destructions
Memory Descendants' Mem.</i><i><br>
</i><i>Reports information only for
process 0.</i><i><br>
</i><i><br>
</i><i>--- Event Stage 0: Main Stage</i><i><br>
</i><i><br>
</i><i> Vector
37 34 1634064
0.</i><i><br>
</i><i> Matrix
2120 2120
52734663456 0.</i><i><br>
</i><i> Viewer
1 0 0
0.</i><i><br>
</i><i>========================================================================================================================</i></p>
<p>Apparently, MatMatMultNum and
MatScale take the most time (by far)
during execution. Therefore, I was
wondering if it is possible to move
those operations/all matrices and
vectors to a GPU or another
accelerator. According to <a
href="https://www.mcs.anl.gov/petsc/features/gpus.html"
target="_blank"
moz-do-not-send="true">https://www.mcs.anl.gov/petsc/features/gpus.html</a>
CUDA is only supported for
distributed vectors, but not for
dense distributed matrices. Are
there any updates related to that,
or other ways to speed up the
involved operations?</p>
</div>
</blockquote>
<div><br>
</div>
<div>You should compute the timings
associated with each call, and not
consider the lump sum. For example, each
MatScale takes 6.9348e+02/56162 =
0.012347851 seconds on average, I doubt
you can get any reasonable speedup with
CUDA. What are the sizes of these
matrices? </div>
<div> </div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p>Thanks!</p>
<p>Regards,</p>
<p>Roland<br>
</p>
</div>
</blockquote>
</div>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr">Stefano</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr">Stefano</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr" class="gmail_signature">Stefano</div>
</div>
</blockquote>
</body>
</html>