<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">On 8/7/20 12:55 PM, Barry Smith wrote:<br>
</div>
<blockquote type="cite"
cite="mid:FD894999-788A-4204-8D31-F45AAD8B3D00@petsc.dev">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
<br class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">On Aug 7, 2020, at 12:26 PM, Nidish <<a
href="mailto:nb25@rice.edu" class=""
moz-do-not-send="true">nb25@rice.edu</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252" class="">
<div class="">
<p class=""><br class="">
</p>
<div class="moz-cite-prefix">On 8/7/20 8:52 AM, Barry
Smith wrote:<br class="">
</div>
<blockquote type="cite"
cite="mid:6A7D902E-FACE-4778-B89A-90B043ED31C0@petsc.dev"
class="">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252" class="">
<br class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">On Aug 7, 2020, at 1:25 AM, Nidish
<<a href="mailto:nb25@rice.edu" class=""
moz-do-not-send="true">nb25@rice.edu</a>>
wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<meta http-equiv="Content-Type"
content="text/html; charset=windows-1252"
class="">
<div class="">
<p class="">Indeed - I was just using the
default solver (GMRES with ILU).</p>
<p class="">Using just standard LU (direct solve
with "-pc_type lu -ksp_type preonly"), I find
elemental to be extremely slow even for a
1000x1000 matrix. </p>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
What about on one process? <br class="">
</div>
</blockquote>
<font class="" color="#0d29db">On just one process the
performance is comparable.</font><br class="">
<blockquote type="cite"
cite="mid:6A7D902E-FACE-4778-B89A-90B043ED31C0@petsc.dev"
class="">
<div class=""><br class="">
</div>
<div class="">Elemental generally won't be competitive
for such tiny matrices. <br class="">
<blockquote type="cite" class="">
<div class="">
<div class="">
<p class="">For MPIaij it's throwing me an error
if I tried "-pc_type lu".</p>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
Yes, there is no PETSc code for sparse parallel
direct solver, this is expected.</div>
<div class=""><br class="">
</div>
<div class=""> What about ?</div>
<div class=""><br class="">
</div>
<div class="">
<blockquote type="cite" class="">
<div class="">
<blockquote class="">
<p class="">mpirun -n 1 ./ksps -N 1000 -mat_type
mpidense -pc_type jacobi</p>
<div class="">mpirun -n 4 ./ksps -N 1000
-mat_type mpidense -pc_type jacobi</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
<font class="" color="#0d29db">Same results - the
elemental version is MUCH slower (for 1000x1000). </font><br
class="">
<blockquote type="cite"
cite="mid:6A7D902E-FACE-4778-B89A-90B043ED31C0@petsc.dev"
class="">
<div class="">Where will your dense matrices be coming
from and how big will they be in practice? This will
help determine if an iterative solver is appropriate.
If they will be 100,000 for example then testing with
1000 will tell you nothing useful, you need to test
with the problem size you care about.</div>
</blockquote>
<p class=""><font class="" color="#0d29db">The matrices in
my application arise from substructuring/Component
Mode Synthesis conducted on a system that is linear
"almost everywhere", for example jointed systems. The
procedure we follow is: build a mesh & identify
the nodes corresponding to the interfaces, reduce the
model using component mode synthesis to obtain a
representation of the system using just the interface
degrees-of-freedom along with some (~10s) generalized
"modal coordinates". We conduct the non-linear
analyses (transient, steady state harmonic, etc.)
using this matrices. <br class="">
</font></p>
<p class=""><font class="" color="#0d29db">I am interested
in conducting non-linear mesh convergence for a
particular system of interest wherein the interface
DoFs are, approx, 4000, 8000, 12000, 16000. I'm fairly
certain the dense matrices will not be larger. The <br
class="">
</font></p>
</div>
</div>
</blockquote>
<div><br class="">
</div>
Ok, so it is not clear how well conditioned these dense
matrices will be. </div>
<div><br class="">
</div>
<div> There are three questions that need to be answered.</div>
<div><br class="">
</div>
<div>1) for your problem can iterative methods be used and will
they require less work than direct solvers.</div>
<div><br class="">
</div>
<div> For direct LU the work is order N^3 to do the
factorization with a relatively small constant. Because of smart
organization inside dense LU the flops can be done very
efficiently. </div>
<div><br class="">
</div>
<div> For GMRES with Jacobi preconditioning the work is
order N^2 (the time for a dense matrix-vector product) for each
iteration. If the number of iterations small than the total work
is much less than a direct solver. In the worst case the number
of iterations is order N so the total work is order N^3, the
same order as a direct method. But the efficiency of a dense
matrix-vector product is much lower than the efficiency of a LU
factorization so even if the work is the same order it can take
longer. One should use mpidense as the matrix format for
iterative.</div>
<div><br class="">
</div>
<div> With iterative methods YOU get to decide how accurate
you need your solution, you do this by setting how small you
want the residual to be (since you can't directly control the
error). By default PETSc uses a relative decrease in the
residual of 1e-5. </div>
<div><br class="">
</div>
<div>2) for your size problems can parallelism help? </div>
<div><br class="">
</div>
<div> I think it should but elemental since it requires a
different data layout has additional overhead cost to get the
data into the optimal format for parallelism. </div>
<div><br class="">
</div>
<div>3) can parallelism help on YOUR machine. Just because a
machine has multiple cores it may not be able to utilize them
efficiently for solvers if the total machine memory bandwidth is
limited. </div>
<div><br class="">
</div>
<div> So the first thing to do is on the machine you plan to use
for your computations run the streams benchmark discussed in <a
href="https://www.mcs.anl.gov/petsc/documentation/faq.html#computers"
class="" moz-do-not-send="true">https://www.mcs.anl.gov/petsc/documentation/faq.html#computers</a>
this will give us some general idea of how much parallelism you
can take advantage of. Is the machine a parallel cluster or just
a single node? </div>
<div><br class="">
</div>
<div> After this I'll give you a few specific cases to run to
get a feeling for what approach would be best for your problems,</div>
<div><br class="">
</div>
<div> Barry</div>
<div><br class="">
</div>
</blockquote>
<p>Thank you for the responses. Here's a pointwise response to your
queries:</p>
<p>1) I am presently working with random matrices (with a large
constant value in the diagonals to ensure diagonal dominance)
before I start working with the matrices from my system. At the
end of the day the matrices I expect to be using can be thought of
to be Schur complements of a Laplacian operator. <br>
</p>
<p>2) Since my application is joint dynamics, I have a non-linear
function that has to be evaluated at quadrature locations on a 2D
mesh and integrated to form the residue vector as well as the
Jacobian matrices. There is thus potential speedup I expect for
the function evaluations definitely. <br>
</p>
<p>Since the matrices I will end up with will be dense (at least for
static simulations), I wanted directions to find the best solver
options for my problem.</p>
<p>3) I am presently on an octa-core (4 physical cores) machine with
16 Gigs of RAM. I plan to conduct code development and
benchmarking on this machine before I start running larger models
on a cluster I have access to. <br>
</p>
<p>I was unable to run the streams benchmark on the cluster (PETSc
3.11.1 is installed there, and the benchmarks in the git directory
was giving issues), but I was able to do this in my local machine
- here's the output: <br>
</p>
<blockquote><tt>scaling.log</tt><br>
1 13697.5004 Rate (MB/s)<br>
2 13021.9505 Rate (MB/s) 0.950681 <br>
3 12402.6925 Rate (MB/s) 0.905471 <br>
4 12309.1712 Rate (MB/s) 0.898644<br>
</blockquote>
<p>Could you point me to the part in the documentation that speaks
about the different options available for dealing with dense
matrices? I just realized that bindings for MUMPS are available in
PETSc.</p>
<p>Thank you very much,<br>
Nidish<br>
</p>
<blockquote type="cite"
cite="mid:FD894999-788A-4204-8D31-F45AAD8B3D00@petsc.dev">
<div><br class="">
<blockquote type="cite" class="">
<div class="">
<div class="">
<p class=""><font class="" color="#0d29db"> </font></p>
<p class=""><font class="" color="#0d29db">However for
frequency domain simulations, we use matrices that are
about 10 times the size of the original matrices
(whose meshes have been shown to be convergent in
static test cases). <br class="">
</font></p>
<p class=""><font class="" color="#0d29db">Thank you,<br
class="">
Nidish</font><br class="">
</p>
<blockquote type="cite"
cite="mid:6A7D902E-FACE-4778-B89A-90B043ED31C0@petsc.dev"
class="">
<div class=""><br class="">
</div>
<div class="">Barry</div>
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">
<div class="">
<p class=""> I'm attaching the code here, in
case you'd like to have a look at what I've
been trying to do. <br class="">
</p>
<p class="">The two configurations of interest
are,</p>
<blockquote class="">
<p class="">$> mpirun -n 4 ./ksps -N 1000
-mat_type mpiaij<br class="">
$> mpirun -n 4 ./ksps -N 1000 -mat_type
elemental</p>
</blockquote>
<p class="">(for the GMRES with ILU) and,</p>
<blockquote class="">
<p class="">$> mpirun -n 4 ./ksps -N 1000
-mat_type mpiaij -pc_type lu -ksp_type
preonly<br class="">
$> mpirun -n 4 ./ksps -N 1000 -mat_type
elemental -pc_type lu -ksp_type preonly</p>
</blockquote>
<p class="">elemental seems to perform poorly in
both cases.</p>
<p class="">Nidish<br class="">
</p>
<div class="moz-cite-prefix">On 8/7/20 12:50 AM,
Barry Smith wrote:<br class="">
</div>
<blockquote type="cite"
cite="mid:85F9F817-2754-4F55-9222-3E23003E79FD@petsc.dev"
class="">
<meta http-equiv="Content-Type"
content="text/html; charset=windows-1252"
class="">
<div class=""><br class="">
</div>
<div class=""> What is the output of
-ksp_view for the two case?</div>
<div class=""><br class="">
</div>
<div class=""> It is not only the matrix
format but also the matrix solver that
matters. For example if you are using an
iterative solver the elemental format won't
be faster, you should use the PETSc MPIDENSE
format. The elemental format is really
intended when you use a direct LU solver for
the matrix. For tiny matrices like this an
iterative solver could easily be faster than
the direct solver, it depends on the
conditioning (eigenstructure) of the dense
matrix. Also the default PETSc solver uses
block Jacobi with ILU on each process if
using a sparse format, ILU applied to a
dense matrix is actually LU so your solver
is probably different also between the
MPIAIJ and the elemental. </div>
<div class=""><br class="">
</div>
<div class=""> Barry</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
<div class=""> <br class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">On Aug 7, 2020, at 12:30
AM, Nidish <<a
href="mailto:nb25@rice.edu" class=""
moz-do-not-send="true">nb25@rice.edu</a>>
wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div style="zoom: 0%;" class="">
<div dir="auto" class="">Thank you
for the response.<br class="">
<br class="">
</div>
<div dir="auto" class="">I've just
been running some tests with
matrices up to 2e4 dimensions
(dense). When I compared the
solution times for "-mat_type
elemental" and "-mat_type mpiaij"
running with 4 cores, I found the
mpidense versions running way
faster than elemental. I have not
been able to make the elemental
version finish up for 2e4 so far
(my patience runs out faster). <br
class="">
<br class="">
</div>
<div dir="auto" class="">What's
going on here? I thought elemental
was supposed to be superior for
dense matrices.<br class="">
<br class="">
</div>
<div dir="auto" class="">I can share
the code if that's appropriate for
this forum (sorry, I'm new here).
<br class="">
<br class="">
</div>
<div dir="auto" class="">Nidish</div>
<div class="gmail_quote">On Aug 6,
2020, at 23:01, Barry Smith <<a
href="mailto:bsmith@petsc.dev"
target="_blank" class=""
moz-do-not-send="true">bsmith@petsc.dev</a>>
wrote:
<blockquote class="gmail_quote"
style="margin: 0pt 0pt 0pt
0.8ex; border-left: 1px solid
rgb(204, 204, 204);
padding-left: 1ex;">
<pre class="blue"><blockquote class="gmail_quote" style="margin: 0pt 0pt 1ex 0.8ex; border-left: 1px solid #729fcf; padding-left: 1ex;"> On Aug 6, 2020, at 7:32 PM, Nidish <<a href="mailto:nb25@rice.edu" class="" moz-do-not-send="true">nb25@rice.edu</a>> wrote:
I'm relatively new to PETSc, and my applications involve (for the most part) dense matrix solves.
I read in the documentation that this is an area PETSc does not specialize in but instead recommends external libraries such as Elemental. I'm wondering if there are any "best" practices in this regard. Some questions I'd like answered are:
1. Can I just declare my dense matrix as a sparse one and fill the whole matrix up? Do any of the others go this route? What're possible pitfalls/unfavorable outcomes for this? I understand the memory overhead probably shoots up.
</blockquote>
No, this isn't practical, the performance will be terrible.
<blockquote class="gmail_quote" style="margin: 0pt 0pt 1ex 0.8ex; border-left: 1px solid #729fcf; padding-left: 1ex;"> 2. Are there any specific guidelines on when I can expect elemental to perform better in parallel than in serial?
</blockquote>
Because the computation to communication ratio for dense matrices is higher than for sparse you will see better parallel performance for dense problems of a given size than sparse problems of a similar size. In other words parallelism can help for dense matrices for relatively small problems, of course the specifics of your machine hardware and software also play a role.
Barry
<blockquote class="gmail_quote" style="margin: 0pt 0pt 1ex 0.8ex; border-left: 1px solid #729fcf; padding-left: 1ex;">
Of course, I'm interesting in any other details that may be important in this regard.
Thank you,
Nidish
</blockquote>
</pre>
</blockquote>
</div>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</blockquote>
<div class="moz-signature">-- <br class="">
Nidish</div>
</div>
<span
id="cid:FF8DD1F1-CA48-405B-8E23-364936BB6B64"
class=""><ksps.cpp></span></div>
</blockquote>
</div>
<br class="">
</blockquote>
<div class="moz-signature">-- <br class="">
Nidish</div>
</div>
</div>
</blockquote>
</div>
<br class="">
</blockquote>
<div class="moz-signature">-- <br>
Nidish</div>
</body>
</html>