# [petsc-users] Sparse linear system solving

Lidia lidia.varsh at mail.ioffe.ru
Mon Jun 6 06:19:37 CDT 2022

```Dear colleagues,

Thank you much for the help!

Now the code seems to be working well!

Best,
Lidiia

On 03.06.2022 15:19, Matthew Knepley wrote:
> On Fri, Jun 3, 2022 at 6:42 AM Lidia <lidia.varsh at mail.ioffe.ru> wrote:
>
>     Dear Matt, Barry,
>
>     thank you for the information about openMP!
>
>     Now all processes are loaded well. But we see a strange behaviour
>     of running times at different iterations, see description below.
>     Could you please explain us the reason and how we can improve it?
>
>     We need to quickly solve a big (about 1e6 rows) square sparse
>     non-symmetric matrix many times (about 1e5 times) consequently.
>     Matrix is constant at every iteration, and the right-side vector B
>     is slowly changed (we think that its change at every iteration
>     should be less then 0.001 %). So we use every previous solution
>     vector X as an initial guess for the next iteration. AMG
>     preconditioner and GMRES solver are used.
>
>     We have tested the code using a matrix with 631 000 rows, during
>     15 consequent iterations, using vector X from the previous
>     iterations. Right-side vector B and matrix A are constant during
>     the whole running. The time of the first iteration is large (about
>     2 seconds) and is quickly decreased to the next iterations
>     (average time of last iterations were about 0.00008 s). But some
>     iterations in the middle (# 2 and # 12) have huge time - 0.999063
>     second (see the figure with time dynamics attached). This time of
>     0.999 second does not depend on the size of a matrix, on the
>     number of MPI processes, these time jumps also exist if we vary
>     vector B. Why these time jumps appear and how we can avoid them?
>
>
> PETSc is not taking this time. It must come from somewhere else in
> your code. Notice that no iterations are taken for any subsequent
> solves, so no operations other than the residual norm check (and
> preconditioner application) are being performed.
>
>   Thanks,
>
>      Matt
>
>     The ksp_monitor out for this running (included 15 iterations)
>     using 36 MPI processes and a file with the memory bandwidth
>     information (testSpeed) are also attached. We can provide our C++
>     script if it is needed.
>
>     Thanks a lot!
>
>     Best,
>     Lidiia
>
>
>
>     On 01.06.2022 21:14, Matthew Knepley wrote:
>>     On Wed, Jun 1, 2022 at 1:43 PM Lidia <lidia.varsh at mail.ioffe.ru>
>>     wrote:
>>
>>         Dear Matt,
>>
>>         Thank you for the rule of 10,000 variables per process! We
>>         have run ex.5 with matrix 1e4 x 1e4 at our cluster and got a
>>         good performance dynamics (see the figure "performance.png" -
>>         dependency of the solving time in seconds on the number of
>>         cores). We have used GAMG preconditioner (multithread: we
>>         "-pc_gamg_use_parallel_coarse_grid_solver") and GMRES solver.
>>         And we have set one openMP thread to every MPI process. Now
>>         the ex.5 is working good on many mpi processes! But the
>>         running uses about 100 GB of RAM.
>>
>>         How we can run ex.5 using many openMP threads without mpi? If
>>         we just change the running command, the cores are not loaded
>>         normally: usually just one core is loaded in 100 % and others
>>         are idle. Sometimes all cores are working in 100 % during 1
>>         second but then again become idle about 30 seconds. Can the
>>         preconditioner use many threads and how to activate this option?
>>
>>
>>     Maye you could describe what you are trying to accomplish?
>>     Threads and processes are not really different, except for memory
>>     sharing. However, sharing large complex data structures rarely
>>     works. That is why they get partitioned and operate effectively
>>     as distributed memory. You would not really save memory by using
>>     threads in this instance, if that is your goal. This is detailed
>>     in the talks in this session (see 2016 PP Minisymposium on this
>>     page https://cse.buffalo.edu/~knepley/relacs.html).
>>
>>       Thanks,
>>
>>          Matt
>>
>>         The solving times (the time of the solver work) using 60
>>         openMP threads is 511 seconds now, and while using 60 MPI
>>         processes - 13.19 seconds.
>>
>>         ksp_monitor outs for both cases (many openMP threads or many
>>         MPI processes) are attached.
>>
>>
>>         Thank you!
>>
>>         Best,
>>         Lidia
>>
>>         On 31.05.2022 15:21, Matthew Knepley wrote:
>>>         I have looked at the local logs. First, you have run
>>>         problems of size 12  and 24. As a rule of thumb, you need
>>>         10,000
>>>         variables per process in order to see good speedup.
>>>
>>>           Thanks,
>>>
>>>              Matt
>>>
>>>         On Tue, May 31, 2022 at 8:19 AM Matthew Knepley
>>>         <knepley at gmail.com> wrote:
>>>
>>>             On Tue, May 31, 2022 at 7:39 AM Lidia
>>>             <lidia.varsh at mail.ioffe.ru> wrote:
>>>
>>>
>>>
>>>                 Now we have run example # 5 on our computer cluster
>>>                 and on the local server and also have not seen any
>>>                 performance increase, but by unclear reason running
>>>                 times on the local server are much better than on
>>>                 the cluster.
>>>
>>>             I suspect that you are trying to get speedup without
>>>             increasing the memory bandwidth:
>>>
>>>             https://petsc.org/main/faq/#what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup
>>>
>>>               Thanks,
>>>
>>>                  Matt
>>>
>>>                 Now we will try to run petsc #5 example inside a
>>>                 docker container on our server and see if the
>>>                 problem is in our environment. I'll write you the
>>>                 results of this test as soon as we get it.
>>>
>>>                 The ksp_monitor outs for the 5th test at the current
>>>                 local server configuration (for 2 and 4 mpi
>>>                 processes) and for the cluster (for 1 and 3 mpi
>>>                 processes) are attached .
>>>
>>>
>>>                 And one more question. Potentially we can use 10
>>>                 nodes and 96 threads at each node on our cluster.
>>>                 What do you think, which combination of numbers of
>>>                 mpi processes and openmp threads may be the best for
>>>                 the 5th example?
>>>
>>>                 Thank you!
>>>
>>>
>>>                 Best,
>>>                 Lidiia
>>>
>>>                 On 31.05.2022 05:42, Mark Adams wrote:
>>>>                 And if you see "NO" change in performance I suspect
>>>>                 the solver/matrix is all on one processor.
>>>>                 should not change anything).
>>>>
>>>>                 As Matt said, it is best to start with a PETSc
>>>>                 example that does something like what you want
>>>>                 (parallel linear solve, see src/ksp/ksp/tutorials
>>>>                 That way you get the basic infrastructure in place
>>>>                 for you, which is pretty obscure to the uninitiated.
>>>>
>>>>                 Mark
>>>>
>>>>                 On Mon, May 30, 2022 at 10:18 PM Matthew Knepley
>>>>                 <knepley at gmail.com> wrote:
>>>>
>>>>                     On Mon, May 30, 2022 at 10:12 PM Lidia
>>>>                     <lidia.varsh at mail.ioffe.ru> wrote:
>>>>
>>>>                         Dear colleagues,
>>>>
>>>>                         Is here anyone who have solved big sparse
>>>>                         linear matrices using PETSC?
>>>>
>>>>
>>>>                     There are lots of publications with this kind
>>>>                     of data. Here is one recent one:
>>>>                     https://arxiv.org/abs/2204.01722
>>>>
>>>>                         We have found NO performance improvement
>>>>                         while using more and more mpi
>>>>                         processes (1-2-3) and open-mp threads (from
>>>>                         1 to 72 threads). Did anyone
>>>>                         faced to this problem? Does anyone know any
>>>>                         possible reasons of such
>>>>                         behaviour?
>>>>
>>>>
>>>>                     Solver behavior is dependent on the input
>>>>                     matrix. The only general-purpose solvers
>>>>                     are direct, but they do not scale linearly and
>>>>                     have high memory requirements.
>>>>
>>>>                     Thus, in order to make progress you will have
>>>>
>>>>                         We use AMG preconditioner and GMRES solver
>>>>                         from KSP package, as our
>>>>                         matrix is large (from 100 000 to 1e+6 rows
>>>>                         and columns), sparse,
>>>>                         non-symmetric and includes both positive
>>>>                         and negative values. But
>>>>                         performance problems also exist while using
>>>>                         CG solvers with symmetric
>>>>                         matrices.
>>>>
>>>>
>>>>                     There are many PETSc examples, such as example
>>>>                     5 for the Laplacian, that exhibit
>>>>                     good scaling with both AMG and GMG.
>>>>
>>>>                         Could anyone help us to set appropriate
>>>>                         options of the preconditioner
>>>>                         and solver? Now we use default parameters,
>>>>                         maybe they are not the best,
>>>>                         but we do not know a good combination. Or
>>>>                         maybe you could suggest any
>>>>                         other pairs of preconditioner+solver for
>>>>
>>>>                         matrices that we solve, c++ script
>>>>                         to run solving using petsc and any
>>>>                         statistics obtained by our runs.
>>>>
>>>>
>>>>                     First, please provide a description of the
>>>>                     linear system, and the output of
>>>>
>>>>                       -ksp_view -ksp_monitor_true_residual
>>>>                     -ksp_converged_reason -log_view
>>>>
>>>>                     for each test case.
>>>>
>>>>                       Thanks,
>>>>
>>>>                          Matt
>>>>
>>>>
>>>>                         Best regards,
>>>>                         Lidiia Varshavchik,
>>>>                         Ioffe Institute, St. Petersburg, Russia
>>>>
>>>>
>>>>
>>>>                     --
>>>>                     What most experimenters take for granted before
>>>>                     they begin their experiments is infinitely more
>>>>                     interesting than any results to which their
>>>>                     -- Norbert Wiener
>>>>
>>>>                     https://www.cse.buffalo.edu/~knepley/
>>>>                     <http://www.cse.buffalo.edu/~knepley/>
>>>>
>>>
>>>
>>>             --
>>>             What most experimenters take for granted before they
>>>             begin their experiments is infinitely more interesting
>>>             than any results to which their experiments lead.
>>>             -- Norbert Wiener
>>>
>>>             https://www.cse.buffalo.edu/~knepley/
>>>             <http://www.cse.buffalo.edu/~knepley/>
>>>
>>>
>>>
>>>         --
>>>         What most experimenters take for granted before they begin
>>>         their experiments is infinitely more interesting than any
>>>         results to which their experiments lead.
>>>         -- Norbert Wiener
>>>
>>>         https://www.cse.buffalo.edu/~knepley/
>>>         <http://www.cse.buffalo.edu/~knepley/>
>>
>>
>>
>>     --
>>     What most experimenters take for granted before they begin their
>>     experiments is infinitely more interesting than any results to
>>     -- Norbert Wiener
>>
>>     https://www.cse.buffalo.edu/~knepley/
>>     <http://www.cse.buffalo.edu/~knepley/>
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which