<div dir="ltr">I repeated your experiment on one node of TACC Frontera,<div>1 rank: 85.0s</div><div>16 ranks: 8.2s, 10x speedup</div><div>32 ranks: 5.7s, 15x speedup<br><div><br clear="all"><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">--Junchao Zhang</div></div></div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Mar 25, 2020 at 1:18 PM Mark Adams <<a href="mailto:mfadams@lbl.gov">mfadams@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Also, a better test is see where streams pretty much saturates, then run that many processors per node and do the same test by increasing the nodes. This will tell you how well your network communication is doing.<div><br></div><div>But this result has a lot of stuff in "network communication" that can be further evaluated. The worst thing about this, I would think, is that the partitioning is blind to the memory hierarchy of inter and intra node communication. The next thing to do is run with an initial grid that puts one cell per node and the do uniform refinement, until you have one cell per process (eg, one refinement step using 8 processes per node), partition to get one cell per process, then do uniform refinement to get a reasonable sized local problem. Alas, this is not easy to do, but it is doable.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Mar 25, 2020 at 2:04 PM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">I would guess that you are saturating the memory bandwidth. After you make PETSc (make all) it will suggest that you test it (make test) and suggest that you run streams (make streams).<div><br></div><div>I see Matt answered but let me add that when you make streams you will seed the memory rate for 1,2,3, ... NP processes. If your machine is decent you should see very good speed up at the beginning and then it will start to saturate. You are seeing about 50% of perfect speedup at 16 process. I would expect that you will see something similar with streams. Without knowing your machine, your results look typical.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi <<a href="mailto:aminthefresh@gmail.com" target="_blank">aminthefresh@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)">Hi,</div><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)"><br></div><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)">I ran KSP example 45 on a single node with 32 cores and 125GB memory using 1, 16 and 32 MPI processes. Here's a comparison of the time spent during KSP.solve:</div><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)"><br></div><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)">- 1 MPI process: ~98 sec, speedup: 1X</div><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)">- 16 MPI processes: ~12 sec, speedup: ~8X</div><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)">- 32 MPI processes: ~11 sec, speedup: ~9X<br></div><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)"><br></div><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)">Since the problem size is large enough (8M unknowns), I expected a speedup much closer to 32X, rather than 9X. Is this expected? If yes, how can it be improved?</div><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)"><br></div><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)">I've attached three log files for more details. </div><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)"><br></div><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)">Sincerely,</div><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)">Amin</div></div>
</blockquote></div>
</blockquote></div>
</blockquote></div>