<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class=""><br class=""></div><div class="">   The user code is the same regardless of the solver.</div><div class=""><br class=""></div>   We provide support for dense matrices with direct solver formats elemental and scalapack (in the master branch) (Mumps is for sparse matrices). <div class=""><br class=""></div><div class="">   With iterative solvers one can use almost any of the preconditioners with MPIDENSE</div><div class=""><br class=""></div><div class="">   Using random matrices will tell you nothing about the behavior of direct vs iterative methods, you have to test with the actual matrices. But since switching between direct and iterative methods is just a command line option you can write your code to handle your actual matrices and then determine if direct or iterative methods are best. The condition number of the Schur complements of a Laplacian operator grows like the square root of the condition number of the Laplacian operator so not terribly fast, iterative methods with some modest preconditioner will likely easily beat direct methods for your size problems. </div><div class=""><br class=""></div><div class="">   Barry</div><div class=""><br class=""><div class=""><br class=""></div><div class="">   <br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Aug 8, 2020, at 4:55 PM, Nidish <<a href="mailto:nb25@rice.edu" class="">nb25@rice.edu</a>> wrote:</div><br class="Apple-interchange-newline"><div class="">

    <meta http-equiv="Content-Type" content="text/html;

      charset=windows-1252" class="">

  <div class=""><p class=""><br class="">

    </p>

    <div class="moz-cite-prefix">On 8/7/20 12:55 PM, Barry Smith wrote:<br class="">

    </div>

    <blockquote type="cite" cite="mid:FD894999-788A-4204-8D31-F45AAD8B3D00@petsc.dev" class="">

      <meta http-equiv="Content-Type" content="text/html;

        charset=windows-1252" class="">

      <br class="">

      <div class=""><br class="">

        <blockquote type="cite" class="">

          <div class="">On Aug 7, 2020, at 12:26 PM, Nidish <<a href="mailto:nb25@rice.edu" class="" moz-do-not-send="true">nb25@rice.edu</a>> wrote:</div>

          <br class="Apple-interchange-newline">

          <div class="">

            <meta http-equiv="Content-Type" content="text/html;

              charset=windows-1252" class="">

            <div class=""><p class=""><br class="">

              </p>

              <div class="moz-cite-prefix">On 8/7/20 8:52 AM, Barry

                Smith wrote:<br class="">

              </div>

              <blockquote type="cite" cite="mid:6A7D902E-FACE-4778-B89A-90B043ED31C0@petsc.dev" class="">

                <meta http-equiv="Content-Type" content="text/html;

                  charset=windows-1252" class="">

                <br class="">

                <div class=""><br class="">

                  <blockquote type="cite" class="">

                    <div class="">On Aug 7, 2020, at 1:25 AM, Nidish

                      <<a href="mailto:nb25@rice.edu" class="" moz-do-not-send="true">nb25@rice.edu</a>>

                      wrote:</div>

                    <br class="Apple-interchange-newline">

                    <div class="">

                      <meta http-equiv="Content-Type" content="text/html; charset=windows-1252" class="">

                      <div class=""><p class="">Indeed - I was just using the

                          default solver (GMRES with ILU).</p><p class="">Using just standard LU (direct solve

                          with "-pc_type lu -ksp_type preonly"), I find

                          elemental to be extremely slow even for a

                          1000x1000 matrix. </p>

                      </div>

                    </div>

                  </blockquote>

                  <div class=""><br class="">

                  </div>

                  What about on one process? <br class="">

                </div>

              </blockquote>

              <font class="" color="#0d29db">On just one process the

                performance is comparable.</font><br class="">

              <blockquote type="cite" cite="mid:6A7D902E-FACE-4778-B89A-90B043ED31C0@petsc.dev" class="">

                <div class=""><br class="">

                </div>

                <div class="">Elemental generally won't be competitive

                  for such tiny matrices. <br class="">

                  <blockquote type="cite" class="">

                    <div class="">

                      <div class=""><p class="">For MPIaij it's throwing me an error

                          if I tried "-pc_type lu".</p>

                      </div>

                    </div>

                  </blockquote>

                  <div class=""><br class="">

                  </div>

                     Yes, there is no PETSc code for sparse parallel

                  direct solver, this is expected.</div>

                <div class=""><br class="">

                </div>

                <div class="">   What about ?</div>

                <div class=""><br class="">

                </div>

                <div class="">

                  <blockquote type="cite" class="">

                    <div class="">

                      <blockquote class=""><p class="">mpirun -n 1 ./ksps -N 1000 -mat_type

                          mpidense -pc_type jacobi</p>

                        <div class="">mpirun -n 4 ./ksps -N 1000

                          -mat_type mpidense -pc_type jacobi</div>

                      </blockquote>

                    </div>

                  </blockquote>

                </div>

              </blockquote>

              <font class="" color="#0d29db">Same results - the

                elemental version is MUCH slower (for 1000x1000). </font><br class="">

              <blockquote type="cite" cite="mid:6A7D902E-FACE-4778-B89A-90B043ED31C0@petsc.dev" class="">

                <div class="">Where will your dense matrices be coming

                  from and how big will they be in practice? This will

                  help determine if an iterative solver is appropriate.

                  If they will be 100,000 for example then testing with

                  1000 will tell you nothing useful, you need to test

                  with the problem size you care about.</div>

              </blockquote><p class=""><font class="" color="#0d29db">The matrices in

                  my application arise from substructuring/Component

                  Mode Synthesis conducted on a system that is linear

                  "almost everywhere", for example jointed systems. The

                  procedure we follow is: build a mesh & identify

                  the nodes corresponding to the interfaces, reduce the

                  model using component mode synthesis to obtain a

                  representation of the system using just the interface

                  degrees-of-freedom along with some (~10s) generalized

                  "modal coordinates". We conduct the non-linear

                  analyses (transient, steady state harmonic, etc.)

                  using this matrices. <br class="">

                </font></p><p class=""><font class="" color="#0d29db">I am interested

                  in conducting non-linear mesh convergence for a

                  particular system of interest wherein the interface

                  DoFs are, approx, 4000, 8000, 12000, 16000. I'm fairly

                  certain the dense matrices will not be larger. The <br class="">

                </font></p>

            </div>

          </div>

        </blockquote>

        <div class=""><br class="">

        </div>

           Ok, so it is not clear how well conditioned these dense

        matrices will be. </div>

      <div class=""><br class="">

      </div>

      <div class="">    There are three questions that need to be answered.</div>

      <div class=""><br class="">

      </div>

      <div class="">1) for your problem can iterative methods be used and will

        they require less work than direct solvers.</div>

      <div class=""><br class="">

      </div>

      <div class="">       For direct LU the work is order N^3 to do the

        factorization with a relatively small constant. Because of smart

        organization inside dense LU the flops can be done very

        efficiently. </div>

      <div class=""><br class="">

      </div>

      <div class="">       For GMRES with Jacobi preconditioning the work is

        order N^2 (the time for a dense matrix-vector product) for each

        iteration. If the number of iterations small than the total work

        is much less than a direct solver. In the worst case the number

        of iterations is order N so the total work is order N^3, the

        same order as a direct method.  But the efficiency of a dense

        matrix-vector product is much lower than the efficiency of a LU

        factorization so even if the work is the same order it can take

        longer.  One should use mpidense as the matrix format for

        iterative.</div>

      <div class=""><br class="">

      </div>

      <div class="">       With iterative methods YOU get to decide how accurate

        you need your solution, you do this by setting how small you

        want the residual to be (since you can't directly control the

        error). By default PETSc uses a relative decrease in the

        residual of 1e-5. </div>

      <div class=""><br class="">

      </div>

      <div class="">2) for your size problems can parallelism help? </div>

      <div class=""><br class="">

      </div>

      <div class="">    I think it should but elemental since it requires a

        different data layout has additional overhead cost to get the

        data into the optimal format for parallelism. </div>

      <div class=""><br class="">

      </div>

      <div class="">3) can parallelism help on YOUR machine. Just because a

        machine has multiple cores it may not be able to utilize them

        efficiently for solvers if the total machine memory bandwidth is

        limited. </div>

      <div class=""><br class="">

      </div>

      <div class="">   So the first thing to do is on the machine you plan to use

        for your computations run the streams benchmark discussed in <a href="https://www.mcs.anl.gov/petsc/documentation/faq.html#computers" class="" moz-do-not-send="true">https://www.mcs.anl.gov/petsc/documentation/faq.html#computers</a>

        this will give us some general idea of how much parallelism you

        can take advantage of. Is the machine a parallel cluster or just

        a single node? </div>

      <div class=""><br class="">

      </div>

      <div class="">   After this I'll give you a few specific cases to run to

        get a feeling for what approach would be best for your problems,</div>

      <div class=""><br class="">

      </div>

      <div class="">   Barry</div>

      <div class=""><br class="">

      </div>

    </blockquote><p class="">Thank you for the responses. Here's a pointwise response to your

      queries:</p><p class="">1) I am presently working with random matrices (with a large

      constant value in the diagonals to ensure diagonal dominance)

      before I start working with the matrices from my system. At the

      end of the day the matrices I expect to be using can be thought of

      to be Schur complements of a Laplacian operator. <br class="">

    </p><p class="">2) Since my application is joint dynamics, I have a non-linear

      function that has to be evaluated at quadrature locations on a 2D

      mesh and integrated to form the residue vector as well as the

      Jacobian matrices. There is thus potential speedup I expect for

      the function evaluations definitely. <br class="">

    </p><p class="">Since the matrices I will end up with will be dense (at least for

      static simulations), I wanted directions to find the best solver

      options for my problem.</p><p class="">3) I am presently on an octa-core (4 physical cores) machine with

      16 Gigs of RAM. I plan to conduct code development and

      benchmarking on this machine before I start running larger models

      on a cluster I have access to. <br class="">

    </p><p class="">I was unable to run the streams benchmark on the cluster (PETSc

      3.11.1 is installed there, and the benchmarks in the git directory

      was giving issues), but I was able to do this in my local machine

      - here's the output: <br class="">

    </p>

    <blockquote class=""><tt class="">scaling.log</tt><br class="">

      1  13697.5004   Rate (MB/s)<br class="">

      2  13021.9505   Rate (MB/s) 0.950681 <br class="">

      3  12402.6925   Rate (MB/s) 0.905471 <br class="">

      4  12309.1712   Rate (MB/s) 0.898644<br class="">

    </blockquote><p class="">Could you point me to the part in the documentation that speaks

      about the different options available for dealing with dense

      matrices? I just realized that bindings for MUMPS are available in

      PETSc.</p><p class="">Thank you very much,<br class="">

      Nidish<br class="">

    </p>

    <blockquote type="cite" cite="mid:FD894999-788A-4204-8D31-F45AAD8B3D00@petsc.dev" class="">

      <div class=""><br class="">

        <blockquote type="cite" class="">

          <div class="">

            <div class=""><div class=""><font class="" color="#0d29db"> </font><br class="webkit-block-placeholder"></div><p class=""><font class="" color="#0d29db">However for

                  frequency domain simulations, we use matrices that are

                  about 10 times the size of the original matrices

                  (whose meshes have been shown to be convergent in

                  static test cases). <br class="">

                </font></p><p class=""><font class="" color="#0d29db">Thank you,<br class="">

                  Nidish</font><br class="">

              </p>

              <blockquote type="cite" cite="mid:6A7D902E-FACE-4778-B89A-90B043ED31C0@petsc.dev" class="">

                <div class=""><br class="">

                </div>

                <div class="">Barry</div>

                <div class=""><br class="">

                  <blockquote type="cite" class="">

                    <div class="">

                      <div class=""><p class=""> I'm attaching the code here, in

                          case you'd like to have a look at what I've

                          been trying to do. <br class="">

                        </p><p class="">The two configurations of interest

                          are,</p>

                        <blockquote class=""><p class="">$> mpirun -n 4 ./ksps -N 1000

                            -mat_type mpiaij<br class="">

                            $> mpirun -n 4 ./ksps -N 1000 -mat_type

                            elemental</p>

                        </blockquote><p class="">(for the GMRES with ILU) and,</p>

                        <blockquote class=""><p class="">$> mpirun -n 4 ./ksps -N 1000

                            -mat_type mpiaij -pc_type lu -ksp_type

                            preonly<br class="">

                            $> mpirun -n 4 ./ksps -N 1000 -mat_type

                            elemental -pc_type lu -ksp_type preonly</p>

                        </blockquote><p class="">elemental seems to perform poorly in

                          both cases.</p><p class="">Nidish<br class="">

                        </p>

                        <div class="moz-cite-prefix">On 8/7/20 12:50 AM,

                          Barry Smith wrote:<br class="">

                        </div>

                        <blockquote type="cite" cite="mid:85F9F817-2754-4F55-9222-3E23003E79FD@petsc.dev" class="">

                          <meta http-equiv="Content-Type" content="text/html; charset=windows-1252" class="">

                          <div class=""><br class="">

                          </div>

                          <div class="">  What is the output of

                            -ksp_view  for the two case?</div>

                          <div class=""><br class="">

                          </div>

                          <div class="">  It is not only the matrix

                            format but also the matrix solver that

                            matters. For example if you are using an

                            iterative solver the elemental format won't

                            be faster, you should use the PETSc MPIDENSE

                            format. The elemental format is really

                            intended when you use a direct LU solver for

                            the matrix. For tiny matrices like this an

                            iterative solver could easily be faster than

                            the direct solver, it depends on the

                            conditioning (eigenstructure) of the dense

                            matrix. Also the default PETSc solver uses

                            block Jacobi with ILU on each process if

                            using a sparse format, ILU applied to a

                            dense matrix is actually LU so your solver

                            is probably different also between the

                            MPIAIJ and the elemental. </div>

                          <div class=""><br class="">

                          </div>

                          <div class="">  Barry</div>

                          <div class=""><br class="">

                          </div>

                          <div class=""><br class="">

                          </div>

                          <div class="">  <br class="">

                            <div class=""><br class="">

                              <blockquote type="cite" class="">

                                <div class="">On Aug 7, 2020, at 12:30

                                  AM, Nidish <<a href="mailto:nb25@rice.edu" class="" moz-do-not-send="true">nb25@rice.edu</a>>

                                  wrote:</div>

                                <br class="Apple-interchange-newline">

                                <div class="">

                                  <div style="zoom: 0%;" class="">

                                    <div dir="auto" class="">Thank you

                                      for the response.<br class="">

                                      <br class="">

                                    </div>

                                    <div dir="auto" class="">I've just

                                      been running some tests with

                                      matrices up to 2e4 dimensions

                                      (dense). When I compared the

                                      solution times for "-mat_type

                                      elemental" and "-mat_type mpiaij"

                                      running with 4 cores, I found the

                                      mpidense versions running way

                                      faster than elemental. I have not

                                      been able to make the elemental

                                      version finish up for 2e4 so far

                                      (my patience runs out faster). <br class="">

                                      <br class="">

                                    </div>

                                    <div dir="auto" class="">What's

                                      going on here? I thought elemental

                                      was supposed to be superior for

                                      dense matrices.<br class="">

                                      <br class="">

                                    </div>

                                    <div dir="auto" class="">I can share

                                      the code if that's appropriate for

                                      this forum (sorry, I'm new here).

                                      <br class="">

                                      <br class="">

                                    </div>

                                    <div dir="auto" class="">Nidish</div>

                                    <div class="gmail_quote">On Aug 6,

                                      2020, at 23:01, Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank" class="" moz-do-not-send="true">bsmith@petsc.dev</a>>

                                      wrote:

                                      <blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt

                                        0.8ex; border-left: 1px solid

                                        rgb(204, 204, 204);

                                        padding-left: 1ex;">

                                        <pre class="blue"><blockquote class="gmail_quote" style="margin: 0pt 0pt 1ex 0.8ex; border-left: 1px solid #729fcf; padding-left: 1ex;"> On Aug 6, 2020, at 7:32 PM, Nidish <<a href="mailto:nb25@rice.edu" class="" moz-do-not-send="true">nb25@rice.edu</a>> wrote:

 I'm relatively new to PETSc, and my applications involve (for the most part) dense matrix solves.

 I read in the documentation that this is an area PETSc does not specialize in but instead recommends external libraries such as Elemental. I'm wondering if there are any "best" practices in this regard. Some questions I'd like answered are:

 1. Can I just declare my dense matrix as a sparse one and fill the whole matrix up? Do any of the others go this route? What're possible pitfalls/unfavorable outcomes for this? I understand the memory overhead probably shoots up.

</blockquote>

  No, this isn't practical, the performance will be terrible.

<blockquote class="gmail_quote" style="margin: 0pt 0pt 1ex 0.8ex; border-left: 1px solid #729fcf; padding-left: 1ex;"> 2. Are there any specific guidelines on when I can expect elemental to perform better in parallel than in serial?

</blockquote>

  Because the computation to communication ratio for dense matrices is higher than for sparse you will see better parallel performance for dense problems of a given size than sparse problems of a similar size. In other words parallelism can help for dense matrices for relatively small problems, of course the specifics of your machine hardware and software also play a role.

   Barry

<blockquote class="gmail_quote" style="margin: 0pt 0pt 1ex 0.8ex; border-left: 1px solid #729fcf; padding-left: 1ex;"> 

 Of course, I'm interesting in any other details that may be important in this regard.

 Thank you,

 Nidish

</blockquote>

</pre>

                                      </blockquote>

                                    </div>

                                  </div>

                                </div>

                              </blockquote>

                            </div>

                            <br class="">

                          </div>

                        </blockquote>

                        <div class="moz-signature">-- <br class="">

                          Nidish</div>

                      </div>

                      <span id="cid:FF8DD1F1-CA48-405B-8E23-364936BB6B64" class=""><ksps.cpp></span></div>

                  </blockquote>

                </div>

                <br class="">

              </blockquote>

              <div class="moz-signature">-- <br class="">

                Nidish</div>

            </div>

          </div>

        </blockquote>

      </div>

      <br class="">

    </blockquote>

    <div class="moz-signature">-- <br class="">

      Nidish</div>

  </div>

</div></blockquote></div><br class=""></div></div></body></html>