<div dir="ltr">Your timing data in the first plot seems to have random integers (2,1,1) added to random iterations (0,2,12).<div>Perhaps there is a bug in your test setup?</div><div><br></div><div>Mark</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jun 3, 2022 at 6:42 AM Lidia <<a href="mailto:lidia.varsh@mail.ioffe.ru">lidia.varsh@mail.ioffe.ru</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Dear Matt, Barry,</p>
<p>thank you for the information about openMP!</p>
<p>Now all processes are loaded well. But we see a strange behaviour
of running times at different iterations, see description below.
Could you please explain us the reason and how we can improve it?<br>
</p>
<p>We need to quickly solve a big (about 1e6 rows) square sparse
non-symmetric matrix many times (about 1e5 times) consequently.
Matrix is constant at every iteration, and the right-side vector B
is slowly changed (we think that its change at every iteration
should be less then 0.001 %). So we use every previous solution
vector X as an initial guess for the next iteration. AMG
preconditioner and GMRES solver are used.<br>
</p>
<p>We have tested the code using a matrix with 631 000 rows, during
15 consequent iterations, using vector X from the previous
iterations. Right-side vector B and matrix A are constant during
the whole running. The time of the first iteration is large (about
2 seconds) and is quickly decreased to the next iterations
(average time of last iterations were about 0.00008 s). But some
iterations in the middle (# 2 and # 12) have huge time - 0.999063
second (see the figure with time dynamics attached). This time of
0.999 second does not depend on the size of a matrix, on the
number of MPI processes, these time jumps also exist if we vary
vector B. Why these time jumps appear and how we can avoid them?</p>
<p>The ksp_monitor out for this running (included 15 iterations)
using 36 MPI processes and a file with the memory bandwidth
information (testSpeed) are also attached. We can provide our C++
script if it is needed.<br>
</p>
<p>Thanks a lot!<br>
</p>
Best,<br>
Lidiia<br>
<p><br>
</p>
<p><br>
</p>
<div>On 01.06.2022 21:14, Matthew Knepley
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr">On Wed, Jun 1, 2022 at 1:43 PM Lidia <<a href="mailto:lidia.varsh@mail.ioffe.ru" target="_blank">lidia.varsh@mail.ioffe.ru</a>>
wrote:<br>
</div>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Dear Matt,</p>
<p>Thank you for the rule of 10,000 variables per process!
We have run ex.5 with matrix 1e4 x 1e4 at our cluster
and got a good performance dynamics (see the figure
"performance.png" - dependency of the solving time in
seconds on the number of cores). We have used GAMG
preconditioner (multithread: we have added the option "<span style="color:rgb(29,28,29);font-family:Slack-Lato,Slack-Fractions,appleLogo,sans-serif;font-size:15px;font-style:normal;font-variant-ligatures:common-ligatures;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:left;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">-pc_gamg_use_parallel_coarse_grid_solver"</span>)
and GMRES solver. And we have set one openMP thread to
every MPI process. Now the ex.5 is working good on many
mpi processes! But the running uses about 100 GB of RAM.<br>
</p>
<p>How we can run ex.5 using many openMP threads without
mpi? If we just change the running command, the cores
are not loaded normally: usually just one core is loaded
in 100 % and others are idle. Sometimes all cores are
working in 100 % during 1 second but then again become
idle about 30 seconds. Can the preconditioner use many
threads and how to activate this option?</p>
</div>
</blockquote>
<div><br>
</div>
<div>Maye you could describe what you are trying to
accomplish? Threads and processes are not really different,
except for memory sharing. However, sharing large complex
data structures rarely works. That is why they get
partitioned and operate effectively as distributed memory.
You would not really save memory by using</div>
<div>threads in this instance, if that is your goal. This is
detailed in the talks in this session (see 2016 PP
Minisymposium on this page <a href="https://cse.buffalo.edu/~knepley/relacs.html" target="_blank">https://cse.buffalo.edu/~knepley/relacs.html</a>).</div>
<div><br>
</div>
<div> Thanks,</div>
<div><br>
</div>
<div> Matt</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>The solving times (the time of the solver work) using
60 openMP threads is 511 seconds now, and while using 60
MPI processes - 13.19 seconds.</p>
<p>ksp_monitor outs for both cases (many openMP threads or
many MPI processes) are attached.</p>
<p><br>
</p>
<p>Thank you!</p>
Best,<br>
Lidia<br>
<div><br>
</div>
<div>On 31.05.2022 15:21, Matthew Knepley wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">I have looked at the local logs. First,
you have run problems of size 12 and 24. As a rule of
thumb, you need 10,000
<div>variables per process in order to see good
speedup.</div>
<div><br>
</div>
<div> Thanks,</div>
<div><br>
</div>
<div> Matt</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Tue, May 31, 2022
at 8:19 AM Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">On Tue, May 31, 2022 at 7:39 AM
Lidia <<a href="mailto:lidia.varsh@mail.ioffe.ru" target="_blank">lidia.varsh@mail.ioffe.ru</a>>
wrote:<br>
</div>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Matt, Mark, thank you much for your
answers!</p>
<p><br>
</p>
<p>Now we have run example # 5 on our
computer cluster and on the local server
and also have not seen any performance
increase, but by unclear reason running
times on the local server are much better
than on the cluster.</p>
</div>
</blockquote>
<div>I suspect that you are trying to get
speedup without increasing the memory
bandwidth:</div>
<div><br>
</div>
<div> <a href="https://petsc.org/main/faq/#what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup" target="_blank">https://petsc.org/main/faq/#what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup</a></div>
<div><br>
</div>
<div> Thanks,</div>
<div><br>
</div>
<div> Matt <br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Now we will try to run petsc #5 example
inside a docker container on our server
and see if the problem is in our
environment. I'll write you the results of
this test as soon as we get it.</p>
<p>The ksp_monitor outs for the 5th test at
the current local server configuration
(for 2 and 4 mpi processes) and for the
cluster (for 1 and 3 mpi processes) are
attached .</p>
<p><br>
</p>
<p>And one more question. Potentially we can
use 10 nodes and 96 threads at each node
on our cluster. What do you think, which
combination of numbers of mpi processes
and openmp threads may be the best for the
5th example?<br>
</p>
<p>Thank you!<br>
</p>
<p><br>
</p>
Best,<br>
Lidiia<br>
<div><br>
</div>
<div>On 31.05.2022 05:42, Mark Adams wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">And if you see "NO" change
in performance I suspect the
solver/matrix is all on one processor.
<div>(PETSc does not use threads by
default so threads should not change
anything).</div>
<div><br>
</div>
<div>As Matt said, it is best to start
with a PETSc example that does
something like what you want (parallel
linear solve, see
src/ksp/ksp/tutorials for examples),
and then add your code to it.</div>
<div>That way you get the basic
infrastructure in place for you, which
is pretty obscure to the uninitiated.</div>
<div><br>
</div>
<div>Mark</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On
Mon, May 30, 2022 at 10:18 PM Matthew
Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">On Mon, May 30, 2022
at 10:12 PM Lidia <<a href="mailto:lidia.varsh@mail.ioffe.ru" target="_blank">lidia.varsh@mail.ioffe.ru</a>>
wrote:<br>
</div>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Dear
colleagues,<br>
<br>
Is here anyone who have solved
big sparse linear matrices using
PETSC?<br>
</blockquote>
<div><br>
</div>
<div>There are lots of
publications with this kind of
data. Here is one recent one: <a href="https://arxiv.org/abs/2204.01722" target="_blank">https://arxiv.org/abs/2204.01722</a></div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
We have found NO performance
improvement while using more and
more mpi <br>
processes (1-2-3) and open-mp
threads (from 1 to 72 threads).
Did anyone <br>
faced to this problem? Does
anyone know any possible reasons
of such <br>
behaviour?<br>
</blockquote>
<div><br>
</div>
<div>Solver behavior is dependent
on the input matrix. The only
general-purpose solvers</div>
<div>are direct, but they do not
scale linearly and have high
memory requirements.</div>
<div><br>
</div>
<div>Thus, in order to make
progress you will have to be
specific about your matrices.</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
We use AMG preconditioner and
GMRES solver from KSP package,
as our <br>
matrix is large (from 100 000 to
1e+6 rows and columns), sparse,
<br>
non-symmetric and includes both
positive and negative values.
But <br>
performance problems also exist
while using CG solvers with
symmetric <br>
matrices.<br>
</blockquote>
<div><br>
</div>
<div>There are many PETSc
examples, such as example 5 for
the Laplacian, that exhibit</div>
<div>good scaling with both AMG
and GMG.</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Could anyone help us to set
appropriate options of the
preconditioner <br>
and solver? Now we use default
parameters, maybe they are not
the best, <br>
but we do not know a good
combination. Or maybe you could
suggest any <br>
other pairs of
preconditioner+solver for such
tasks?<br>
<br>
I can provide more information:
the matrices that we solve, c++
script <br>
to run solving using petsc and
any statistics obtained by our
runs.<br>
</blockquote>
<div><br>
</div>
<div>First, please provide a
description of the linear
system, and the output of</div>
<div><br>
</div>
<div> -ksp_view
-ksp_monitor_true_residual
-ksp_converged_reason -log_view</div>
<div><br>
</div>
<div>for each test case.</div>
<div><br>
</div>
<div> Thanks,</div>
<div><br>
</div>
<div> Matt</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Thank you in advance!<br>
<br>
Best regards,<br>
Lidiia Varshavchik,<br>
Ioffe Institute, St. Petersburg,
Russia<br>
</blockquote>
</div>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr">
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div>What most
experimenters take for
granted before they
begin their
experiments is
infinitely more
interesting than any
results to which their
experiments lead.<br>
-- Norbert Wiener</div>
<div><br>
</div>
<div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr">
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div>What most experimenters take for
granted before they begin their
experiments is infinitely more
interesting than any results to
which their experiments lead.<br>
-- Norbert Wiener</div>
<div><br>
</div>
<div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr">
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div>What most experimenters take for
granted before they begin their
experiments is infinitely more interesting
than any results to which their
experiments lead.<br>
-- Norbert Wiener</div>
<div><br>
</div>
<div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr">
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div>What most experimenters take for granted before
they begin their experiments is infinitely more
interesting than any results to which their
experiments lead.<br>
-- Norbert Wiener</div>
<div><br>
</div>
<div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote></div>