<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class=""><br class=""></div> The streams numbers <div class=""><br class=""></div><div class=""><div style="font-family: monospace;" class="">1 8291.4887 Rate (MB/s)</div><div style="font-family: monospace;" class="">2 8739.3219 Rate (MB/s) 1.05401</div><div style="font-family: monospace;" class="">3 24769.5868 Rate (MB/s) 2.98735</div><div style="font-family: monospace;" class="">4 31962.0242 Rate (MB/s) 3.8548</div><div style="font-family: monospace;" class="">5 39603.8828 Rate (MB/s) 4.77645</div><div style="font-family: monospace;" class="">6 47777.7385 Rate (MB/s) 5.76226</div><div style="font-family: monospace;" class="">7 54557.5363 Rate (MB/s) 6.57994</div><div style="font-family: monospace;" class="">8 62769.3910 Rate (MB/s) 7.57034</div><div style="font-family: monospace;" class="">9 38649.9160 Rate (MB/s) 4.6614</div><div style="font-family: monospace;" class=""><br class=""></div><div style="font-family: monospace;" class="">indicate the MPI launcher is doing a poor job of binding MPI ranks to cores; you should read up on the options for your particular mpiexec for binding to select good binding options. Unfortunately, there is no standard for setting the bindings and each MPI implementation changes its options constantly so you need to determine them exactly for your machine and MPI implementation. Basically, you want to place each MPI rank on a node as "far away as possible in memory domains from other ranks". If you note going from 1 to 2 ranks there is no speedup which can be interpreted to mean that the first two ranks are put very close together (and thus share all the memory resources with their partner).</div><div style="font-family: monospace;" class=""><br class=""></div><div style="font-family: monospace;" class="">A side note is that the raw numbers are very good (you get a speedup of 7.57 on 8 ranks and the speedup goes up to 10. These means with proper binding you should get really good speedup on PETSc code to at least 8 cores per node.</div><div style="font-family: monospace;" class=""><br class=""></div><div style="font-family: monospace;" class=""> Barry</div><div style="font-family: monospace;" class=""><br class=""></div><div style="font-family: monospace;" class=""><br class=""></div><div><br class=""><blockquote type="cite" class=""><div class="">On Jul 12, 2022, at 11:32 AM, Ce Qin <<a href="mailto:qince168@gmail.com" class="">qince168@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div dir="ltr" class=""><div dir="ltr" class=""><div dir="ltr" class="">For your reference, I also calculated the speedups for other procedures:<div class=""><br class=""></div><div class=""><div class=""><font face="monospace" class=""> VecAXPY MatMult SetupAMS PCApply Assembly Solving</font></div><div class=""><font face="monospace" class="">NProcessors NNodes CoresPerNode </font></div><div class=""><font face="monospace" class="">1 1 1 1.0 1.0 1.0 1.0 1.0 1.0</font></div><div class=""><font face="monospace" class="">2 1 2 1.640502 1.945753 1.418709 1.898884 1.995246 1.898756</font></div><div class=""><font face="monospace" class=""> 2 1 2.297125 2.614508 1.600718 2.419798 2.121401 2.436149</font></div><div class=""><font face="monospace" class="">4 1 4 4.456256 6.821532 3.614451 5.991256 4.658187 6.004539</font></div><div class=""><font face="monospace" class=""> 2 2 4.539748 6.779151 3.619661 5.926112 4.666667 5.942085</font></div><div class=""><font face="monospace" class=""> 4 1 4.480902 7.210629 3.471541 6.082946 4.65272 6.101214</font></div><div class=""><font face="monospace" class="">8 2 4 10.584189 17.519901 8.59046 16.615395 9.380985 16.581135</font></div><div class=""><font face="monospace" class=""> 4 2 10.980687 18.674113 8.612347 17.273229 9.308575 17.258891</font></div><div class=""><font face="monospace" class=""> 8 1 11.096298 18.210245 8.456557 17.430586 9.314449 17.380612</font></div><div class=""><font face="monospace" class="">16 2 8 21.929795 37.04392 18.135278 34.5448 18.575953 34.483058</font></div><div class=""><font face="monospace" class=""> 4 4 22.00331 39.581504 18.011148 34.793732 18.745129 34.854409</font></div><div class=""><font face="monospace" class=""> 8 2 22.692779 41.38289 18.354949 36.388144 18.828393 36.45509</font></div><div class=""><font face="monospace" class="">32 4 8 43.935774 80.003087 34.963997 70.085728 37.140626 70.175879</font></div><div class=""><font face="monospace" class=""> 8 4 44.387091 80.807608 35.62153 71.471289 37.166421 71.533865</font></div></div><div class=""><font face="monospace" class=""><br class=""></font></div><div class=""><font face="arial, sans-serif" class="">and the streams result on the computation node:</font></div><div class=""><font face="arial, sans-serif" class=""><br class=""></font></div><div class=""><div class=""><font face="monospace" class=""><div class="">1 8291.4887 Rate (MB/s)</div><div class="">2 8739.3219 Rate (MB/s) 1.05401</div><div class="">3 24769.5868 Rate (MB/s) 2.98735</div><div class="">4 31962.0242 Rate (MB/s) 3.8548</div><div class="">5 39603.8828 Rate (MB/s) 4.77645</div><div class="">6 47777.7385 Rate (MB/s) 5.76226</div><div class="">7 54557.5363 Rate (MB/s) 6.57994</div><div class="">8 62769.3910 Rate (MB/s) 7.57034</div><div class="">9 38649.9160 Rate (MB/s) 4.6614</div><div class="">10 58976.9536 Rate (MB/s) 7.11295</div><div class="">11 48108.7801 Rate (MB/s) 5.80219</div><div class="">12 49506.8213 Rate (MB/s) 5.9708</div><div class="">13 54810.5266 Rate (MB/s) 6.61046</div><div class="">14 62471.5234 Rate (MB/s) 7.53441</div><div class="">15 63968.0218 Rate (MB/s) 7.7149</div><div class="">16 69644.8615 Rate (MB/s) 8.39956</div><div class="">17 60791.9544 Rate (MB/s) 7.33185</div><div class="">18 65476.5162 Rate (MB/s) 7.89683</div><div class="">19 60127.0683 Rate (MB/s) 7.25166</div><div class="">20 72052.5175 Rate (MB/s) 8.68994</div><div class="">21 62045.7745 Rate (MB/s) 7.48307</div><div class="">22 64517.7771 Rate (MB/s) 7.7812</div><div class="">23 69570.2935 Rate (MB/s) 8.39057</div><div class="">24 69673.8328 Rate (MB/s) 8.40305</div><div class="">25 75196.7514 Rate (MB/s) 9.06915</div><div class="">26 72304.2685 Rate (MB/s) 8.7203</div><div class="">27 73234.1616 Rate (MB/s) 8.83245</div><div class="">28 74041.3842 Rate (MB/s) 8.9298</div><div class="">29 77117.3751 Rate (MB/s) 9.30079</div><div class="">30 78293.8496 Rate (MB/s) 9.44268</div><div class="">31 81377.0870 Rate (MB/s) 9.81453</div><div class="">32 84097.0813 Rate (MB/s) 10.1426</div></font></div><div class=""><br class=""></div></div><div class=""><font face="monospace" class=""><br class=""></font></div>Best,</div><div class="">Ce</div></div></div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Mark Adams <<a href="mailto:mfadams@lbl.gov" class="">mfadams@lbl.gov</a>> 于2022年7月12日周二 22:11写道:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class="">You may get more memory bandwidth with 32 processors vs 1, as Ce mentioned.<div class="">Depends on the architecture.</div><div class="">Do you get the whole memory bandwidth on one processor on this machine?</div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jul 12, 2022 at 8:53 AM Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank" class="">knepley@gmail.com</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div dir="ltr" class="">On Tue, Jul 12, 2022 at 7:32 AM Ce Qin <<a href="mailto:qince168@gmail.com" target="_blank" class="">qince168@gmail.com</a>> wrote:<br class=""></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div dir="ltr" class=""><br class=""></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div class=""><br class=""></div><div class="">The linear system is complex-valued. We rewrite it into its real form</div><div class="">and solve it using FGMRES and an optimal block-diagonal preconditioner. </div><div class="">We use CG and the AMS preconditioner implemented in HYPRE to solve the</div><div class="">smaller real linear system arised from applying the block preconditioner.</div><div class="">The iteration number of FGMRES and CG keep almost constant in all the runs.</div></div></blockquote><div class=""><br class=""></div><div class="">So those blocks decrease in size as you add more processes?</div><div class=""> </div></div></div></blockquote></div></blockquote><div class=""><br class=""></div><div class="">I am sorry for the unclear description of the block-diagonal preconditioner.</div><div class="">Let K be the original complex system matrix, A = [Kr, -Ki; -Ki, -Kr] is the equivalent</div><div class="">real form of K. Let P = [Kr+Ki, 0; 0, Kr+Ki], it can beproved that P is an optimal</div><div class="">preconditioner for A. In our implementation, only Kr, Ki and Kr+Ki</div><div class="">are explicitly stored as MATMPIAIJ. We use MATSHELL to represent A and P.</div><div class="">We use FGMRES + P to solve Ax=b, and CG + AMS to</div><div class="">solve (Kr+Ki)y=c. So the block size is never changed.</div></div></div></blockquote><div class=""><br class=""></div><div class="">Then we have to break down the timings further. I suspect AMS is not taking as long, since</div><div class="">all other operations scale like N.</div><div class=""><br class=""></div><div class=""> Thanks,</div><div class=""><br class=""></div><div class=""> Matt</div><div class=""><br class=""></div><div class=""> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div class="gmail_quote"><div class="">Best,</div><div class="">Ce</div></div></div></blockquote></div>-- <br class=""><div dir="ltr" class=""><div dir="ltr" class=""><div class=""><div dir="ltr" class=""><div class=""><div dir="ltr" class=""><div class="">What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br class="">-- Norbert Wiener</div><div class=""><br class=""></div><div class=""><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank" class="">https://www.cse.buffalo.edu/~knepley/</a><br class=""></div></div></div></div></div></div></div></div>
</blockquote></div>
</blockquote></div>
</div></blockquote></div><br class=""></div></body></html>