<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

<meta name="Generator" content="Microsoft Word 15 (filtered medium)">

<style><!--

/* Font Definitions */

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:DengXian;

        panose-1:2 1 6 0 3 1 1 1 1 1;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:tahoma;

        panose-1:2 11 6 4 3 5 4 4 2 4;}

@font-face

        {font-family:"\@DengXian";

        panose-1:2 1 6 0 3 1 1 1 1 1;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0in;

        margin-bottom:.0001pt;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

span.EmailStyle18

        {mso-style-type:personal-reply;

        font-family:"Calibri",sans-serif;

        color:windowtext;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-size:10.0pt;}

@page WordSection1

        {size:8.5in 11.0in;

        margin:1.0in 1.0in 1.0in 1.0in;}

div.WordSection1

        {page:WordSection1;}

--></style>

</head>

<body lang="EN-US" link="blue" vlink="purple">

<div class="WordSection1">

<p class="MsoNormal"><o:p> </o:p></p>

<p class="MsoNormal">MPI rank distribution (e.g., 8 ranks per node or 16 ranks per node) is usually managed by workload managers like Slurm, PBS through your job scripts, which is out of petsc’s control.<o:p></o:p></p>

<p class="MsoNormal"><o:p> </o:p></p>

<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">

<p class="MsoNormal"><b><span style="font-size:12.0pt;color:black">From: </span></b><span style="font-size:12.0pt;color:black">Amin Sadeghi <aminthefresh@gmail.com><br>

<b>Date: </b>Wednesday, March 25, 2020 at 4:40 PM<br>

<b>To: </b>Junchao Zhang <junchao.zhang@gmail.com><br>

<b>Cc: </b>Mark Adams <mfadams@lbl.gov>, PETSc users list <petsc-users@mcs.anl.gov><br>

<b>Subject: </b>Re: [petsc-users] Poor speed up for KSP example 45<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

<div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">Junchao, thank you for doing the experiment, I guess TACC Frontera nodes have higher memory bandwidth (maybe more modern CPU architecture, although I'm not familiar

 as to which hardware affect memory bandwidth) than Compute Canada's Graham. <o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black"><o:p> </o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">Mark, I did as you suggested. As you suspected, running make streams yielded the same results, indicating that the memory bandwidth saturated at around 8 MPI processes.

 I ran the experiment on multiple nodes but only requested 8 cores per node, and here is the result:<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black"><o:p> </o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">1 node (8 cores total): 17.5s, 6X speedup<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">2 nodes (16 cores total): 13.5s, 7X speedup<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">3 nodes (24 cores total): 9.4s, 10X speedup<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">4 nodes (32 cores total): 8.3s, 12X speedup<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">5 nodes (40 cores total): 7.0s, 14X speedup<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:red">6 nodes (48 cores total): 61.4s, 2X speedup [!!!]</span><span style="font-size:12.0pt;font-family:"tahoma",sans-serif"><o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">7 nodes (56 cores total): 4.3s, 23X speedup<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">8 nodes (64 cores total): 3.7s, 27X speedup<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black"><o:p> </o:p></span></p>

</div>

<div>

<p class="MsoNormal"><b><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">Note:</span></b><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black"> as you can see, the experiment with 6 nodes showed extremely poor

 scaling, which I guess was an outlier, maybe due to some connection problem?<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black"><o:p> </o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">I also ran another experiment, requesting 2 full nodes, i.e. 64 cores, and here's the result:<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black"><o:p> </o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">2 nodes (64 cores total): 6.0s, 16X speedup [32 cores each node]<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black"><o:p> </o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">So, it turns out that given a fixed number of cores, i.e. 64 in our case, much better speedups (27X vs. 16X in our case) can be achieved if they are distributed

 among separate nodes.<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black"><o:p> </o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">Anyways, I really appreciate all your inputs.<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black"><o:p> </o:p></span></p>

</div>

<div>

<p class="MsoNormal"><b><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">One final question:</span></b><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black"> From what I understand from Mark's comment, PETSc at

 the moment is blind to memory hierarchy, is it feasible to make PETSc aware of the inter and intra node communication so that partitioning is done to maximize performance? Or, to put it differently, is this something that PETSc devs have their eyes on for

 the future?<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><o:p> </o:p></p>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">Sincerely,<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">Amin<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black"><o:p> </o:p></span></p>

</div>

</div>

<p class="MsoNormal"><o:p> </o:p></p>

<div>

<div>

<p class="MsoNormal">On Wed, Mar 25, 2020 at 3:51 PM Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com">junchao.zhang@gmail.com</a>> wrote:<o:p></o:p></p>

</div>

<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">

<div>

<p class="MsoNormal">I repeated your experiment on one node of TACC Frontera,<o:p></o:p></p>

<div>

<p class="MsoNormal">1 rank: 85.0s<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">16 ranks: 8.2s, 10x speedup<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">32 ranks: 5.7s, 15x speedup<o:p></o:p></p>

<div>

<p class="MsoNormal"><br clear="all">

<o:p></o:p></p>

<div>

<div>

<div>

<p class="MsoNormal">--Junchao Zhang<o:p></o:p></p>

</div>

</div>

</div>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

</div>

</div>

<p class="MsoNormal"><o:p> </o:p></p>

<div>

<div>

<p class="MsoNormal">On Wed, Mar 25, 2020 at 1:18 PM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<o:p></o:p></p>

</div>

<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">

<div>

<p class="MsoNormal">Also, a better test is see where streams pretty much saturates, then run that many processors per node and do the same test by increasing the nodes. This will tell you how well your network communication is doing.<o:p></o:p></p>

<div>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

<div>

<p class="MsoNormal">But this result has a lot of stuff in "network communication" that can be further evaluated. The worst thing about this, I would think, is that the partitioning is blind to the memory hierarchy of inter and intra node communication. The

 next thing to do is run with an initial grid that puts one cell per node and the do uniform refinement, until you have one cell per process (eg, one refinement step using 8 processes per node), partition to get one cell per process, then do uniform refinement

 to get a reasonable sized local problem. Alas, this is not easy to do, but it is doable.<o:p></o:p></p>

</div>

</div>

<p class="MsoNormal"><o:p> </o:p></p>

<div>

<div>

<p class="MsoNormal">On Wed, Mar 25, 2020 at 2:04 PM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<o:p></o:p></p>

</div>

<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">

<div>

<p class="MsoNormal">I would guess that you are saturating the memory bandwidth. After you make PETSc (make all) it will suggest that you test it (make test) and suggest that you run streams (make streams).<o:p></o:p></p>

<div>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

<div>

<p class="MsoNormal">I see Matt answered but let me add that when you make streams you will seed the memory rate for 1,2,3, ... NP processes. If your machine is decent you should see very good speed up at the beginning and then it will start to saturate. You

 are seeing about 50% of perfect speedup at 16 process. I would expect that you will see something similar with streams. Without knowing your machine, your results look typical.<o:p></o:p></p>

</div>

</div>

<p class="MsoNormal"><o:p> </o:p></p>

<div>

<div>

<p class="MsoNormal">On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi <<a href="mailto:aminthefresh@gmail.com" target="_blank">aminthefresh@gmail.com</a>> wrote:<o:p></o:p></p>

</div>

<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">

<div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">Hi,<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black"><o:p> </o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">I ran KSP example 45 on a single node with 32 cores and 125GB memory using 1, 16 and 32 MPI processes. Here's a comparison of the time spent during KSP.solve:<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black"><o:p> </o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">- 1 MPI process: ~98 sec, speedup: 1X<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">- 16 MPI processes: ~12 sec, speedup: ~8X<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">- 32 MPI processes: ~11 sec, speedup: ~9X<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black"><o:p> </o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">Since the problem size is large enough (8M unknowns), I expected a speedup much closer to 32X, rather than 9X. Is this expected? If yes, how can it be improved?<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black"><o:p> </o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">I've attached three log files for more details. <o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black"><o:p> </o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">Sincerely,<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"tahoma",sans-serif;color:black">Amin<o:p></o:p></span></p>

</div>

</div>

</blockquote>

</div>

</blockquote>

</div>

</blockquote>

</div>

</blockquote>

</div>

</div>

</body>

</html>