<div dir="ltr"><div dir="ltr">On Wed, Mar 25, 2020 at 1:01 PM Amin Sadeghi <<a href="mailto:aminthefresh@gmail.com">aminthefresh@gmail.com</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)">Hi,</div><div style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)"><br></div><div style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)">I ran KSP example 45 on a single node with 32 cores and 125GB memory using 1, 16 and 32 MPI processes. Here's a comparison of the time spent during KSP.solve:</div><div style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)"><br></div><div style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)">- 1 MPI process: ~98 sec, speedup: 1X</div><div style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)">- 16 MPI processes: ~12 sec, speedup: ~8X</div><div style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)">- 32 MPI processes: ~11 sec, speedup: ~9X<br></div><div style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)"><br></div><div style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)">Since the problem size is large enough (8M unknowns), I expected a speedup much closer to 32X, rather than 9X. Is this expected? If yes, how can it be improved?</div><div style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)"><br></div><div style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)">I've attached three log files for more details. </div></div></blockquote><div><br></div><div>We have answered this here: <a href="https://www.mcs.anl.gov/petsc/documentation/faq.html#computers">https://www.mcs.anl.gov/petsc/documentation/faq.html#computers</a></div><div><br></div><div>However, I can briefly summarize it. The bottleneck here is not computing power, it is memory bandwidth. The node</div><div>you are running on has enough bandwidth for about 8 processes, not 32. I probably takes 12-16 processes to saturate</div><div>the memory bandwidth, but not 32. That is why you see no speedup after 16. There is no way to improve this by optimization.</div><div>The only thing to do is change the algorithm you are using. This behavior has been extensively documented and talked about</div><div>for two decades. See, for example, the Roofline Performance Model.</div><div><br></div><div>  Thanks,</div><div><br></div><div>    Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)">Sincerely,</div><div style="font-family:tahoma,sans-serif;font-size:small;color:rgb(0,0,0)">Amin</div></div>

</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br></div></div></div></div></div></div></div></div>