<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class=""><br class=""></div>  The streams numbers <div class=""><br class=""></div><div class=""><div style="font-family: monospace;" class="">1   8291.4887   Rate (MB/s)</div><div style="font-family: monospace;" class="">2   8739.3219   Rate (MB/s) 1.05401</div><div style="font-family: monospace;" class="">3  24769.5868   Rate (MB/s) 2.98735</div><div style="font-family: monospace;" class="">4  31962.0242   Rate (MB/s) 3.8548</div><div style="font-family: monospace;" class="">5  39603.8828   Rate (MB/s) 4.77645</div><div style="font-family: monospace;" class="">6  47777.7385   Rate (MB/s) 5.76226</div><div style="font-family: monospace;" class="">7  54557.5363   Rate (MB/s) 6.57994</div><div style="font-family: monospace;" class="">8  62769.3910   Rate (MB/s) 7.57034</div><div style="font-family: monospace;" class="">9  38649.9160   Rate (MB/s) 4.6614</div><div style="font-family: monospace;" class=""><br class=""></div><div style="font-family: monospace;" class="">indicate the MPI launcher is doing a poor job of binding MPI ranks to cores; you should read up on the options for your particular mpiexec for binding to select good binding options. Unfortunately, there is no standard for setting the bindings and each MPI implementation changes its options constantly so you need to determine them exactly for your machine and MPI implementation.  Basically, you want to place each MPI rank on a node as "far away as possible in memory domains from other ranks".  If you note going from 1 to 2 ranks there is no speedup which can be interpreted to mean that the first two ranks are put very close together (and thus share all the memory resources with their partner).</div><div style="font-family: monospace;" class=""><br class=""></div><div style="font-family: monospace;" class="">A side note is that the raw numbers are very good (you get a speedup of 7.57 on 8 ranks and the speedup goes up to 10. These means with proper binding you should get really good speedup on PETSc code to at least 8 cores per node.</div><div style="font-family: monospace;" class=""><br class=""></div><div style="font-family: monospace;" class="">  Barry</div><div style="font-family: monospace;" class=""><br class=""></div><div style="font-family: monospace;" class=""><br class=""></div><div><br class=""><blockquote type="cite" class=""><div class="">On Jul 12, 2022, at 11:32 AM, Ce Qin <<a href="mailto:qince168@gmail.com" class="">qince168@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div dir="ltr" class=""><div dir="ltr" class=""><div dir="ltr" class="">For your reference, I also calculated the speedups for other procedures:<div class=""><br class=""></div><div class=""><div class=""><font face="monospace" class="">                                    VecAXPY     MatMult    SetupAMS     PCApply    Assembly     Solving</font></div><div class=""><font face="monospace" class="">NProcessors NNodes CoresPerNode                                                                        </font></div><div class=""><font face="monospace" class="">1           1      1                    1.0         1.0         1.0         1.0         1.0         1.0</font></div><div class=""><font face="monospace" class="">2           1      2               1.640502    1.945753    1.418709    1.898884    1.995246    1.898756</font></div><div class=""><font face="monospace" class="">            2      1               2.297125    2.614508    1.600718    2.419798    2.121401    2.436149</font></div><div class=""><font face="monospace" class="">4           1      4               4.456256    6.821532    3.614451    5.991256    4.658187    6.004539</font></div><div class=""><font face="monospace" class="">            2      2               4.539748    6.779151    3.619661    5.926112    4.666667    5.942085</font></div><div class=""><font face="monospace" class="">            4      1               4.480902    7.210629    3.471541    6.082946     4.65272    6.101214</font></div><div class=""><font face="monospace" class="">8           2      4              10.584189   17.519901     8.59046   16.615395    9.380985   16.581135</font></div><div class=""><font face="monospace" class="">            4      2              10.980687   18.674113    8.612347   17.273229    9.308575   17.258891</font></div><div class=""><font face="monospace" class="">            8      1              11.096298   18.210245    8.456557   17.430586    9.314449   17.380612</font></div><div class=""><font face="monospace" class="">16          2      8              21.929795    37.04392   18.135278     34.5448   18.575953   34.483058</font></div><div class=""><font face="monospace" class="">            4      4               22.00331   39.581504   18.011148   34.793732   18.745129   34.854409</font></div><div class=""><font face="monospace" class="">            8      2              22.692779    41.38289   18.354949   36.388144   18.828393    36.45509</font></div><div class=""><font face="monospace" class="">32          4      8              43.935774   80.003087   34.963997   70.085728   37.140626   70.175879</font></div><div class=""><font face="monospace" class="">            8      4              44.387091   80.807608    35.62153   71.471289   37.166421   71.533865</font></div></div><div class=""><font face="monospace" class=""><br class=""></font></div><div class=""><font face="arial, sans-serif" class="">and the streams result on the computation node:</font></div><div class=""><font face="arial, sans-serif" class=""><br class=""></font></div><div class=""><div class=""><font face="monospace" class=""><div class="">1   8291.4887   Rate (MB/s)</div><div class="">2   8739.3219   Rate (MB/s) 1.05401</div><div class="">3  24769.5868   Rate (MB/s) 2.98735</div><div class="">4  31962.0242   Rate (MB/s) 3.8548</div><div class="">5  39603.8828   Rate (MB/s) 4.77645</div><div class="">6  47777.7385   Rate (MB/s) 5.76226</div><div class="">7  54557.5363   Rate (MB/s) 6.57994</div><div class="">8  62769.3910   Rate (MB/s) 7.57034</div><div class="">9  38649.9160   Rate (MB/s) 4.6614</div><div class="">10  58976.9536   Rate (MB/s) 7.11295</div><div class="">11  48108.7801   Rate (MB/s) 5.80219</div><div class="">12  49506.8213   Rate (MB/s) 5.9708</div><div class="">13  54810.5266   Rate (MB/s) 6.61046</div><div class="">14  62471.5234   Rate (MB/s) 7.53441</div><div class="">15  63968.0218   Rate (MB/s) 7.7149</div><div class="">16  69644.8615   Rate (MB/s) 8.39956</div><div class="">17  60791.9544   Rate (MB/s) 7.33185</div><div class="">18  65476.5162   Rate (MB/s) 7.89683</div><div class="">19  60127.0683   Rate (MB/s) 7.25166</div><div class="">20  72052.5175   Rate (MB/s) 8.68994</div><div class="">21  62045.7745   Rate (MB/s) 7.48307</div><div class="">22  64517.7771   Rate (MB/s) 7.7812</div><div class="">23  69570.2935   Rate (MB/s) 8.39057</div><div class="">24  69673.8328   Rate (MB/s) 8.40305</div><div class="">25  75196.7514   Rate (MB/s) 9.06915</div><div class="">26  72304.2685   Rate (MB/s) 8.7203</div><div class="">27  73234.1616   Rate (MB/s) 8.83245</div><div class="">28  74041.3842   Rate (MB/s) 8.9298</div><div class="">29  77117.3751   Rate (MB/s) 9.30079</div><div class="">30  78293.8496   Rate (MB/s) 9.44268</div><div class="">31  81377.0870   Rate (MB/s) 9.81453</div><div class="">32  84097.0813   Rate (MB/s) 10.1426</div></font></div><div class=""><br class=""></div></div><div class=""><font face="monospace" class=""><br class=""></font></div>Best,</div><div class="">Ce</div></div></div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Mark Adams <<a href="mailto:mfadams@lbl.gov" class="">mfadams@lbl.gov</a>> 于2022年7月12日周二 22:11写道：<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class="">You may get more memory bandwidth with 32 processors vs 1, as Ce mentioned.<div class="">Depends on the architecture.</div><div class="">Do you get the whole memory bandwidth on one processor on this machine?</div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jul 12, 2022 at 8:53 AM Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank" class="">knepley@gmail.com</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div dir="ltr" class="">On Tue, Jul 12, 2022 at 7:32 AM Ce Qin <<a href="mailto:qince168@gmail.com" target="_blank" class="">qince168@gmail.com</a>> wrote:<br class=""></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div dir="ltr" class=""><br class=""></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div class=""><br class=""></div><div class="">The linear system is complex-valued. We rewrite it into its real form</div><div class="">and solve it using FGMRES and an optimal block-diagonal preconditioner. </div><div class="">We use CG and the AMS preconditioner implemented in HYPRE to solve the</div><div class="">smaller real linear system arised from applying the block preconditioner.</div><div class="">The iteration number of FGMRES and CG keep almost constant in all the runs.</div></div></blockquote><div class=""><br class=""></div><div class="">So those blocks decrease in size as you add more processes?</div><div class=""> </div></div></div></blockquote></div></blockquote><div class=""><br class=""></div><div class="">I am sorry for the unclear description of the block-diagonal preconditioner.</div><div class="">Let K be the original complex system matrix, A = [Kr, -Ki; -Ki, -Kr] is the equivalent</div><div class="">real form of K. Let P = [Kr+Ki, 0; 0, Kr+Ki], it can beproved that P is an optimal</div><div class="">preconditioner for A. In our implementation, only Kr, Ki and Kr+Ki</div><div class="">are explicitly stored as MATMPIAIJ. We use MATSHELL to represent A and P.</div><div class="">We use FGMRES + P to solve Ax=b, and CG + AMS to</div><div class="">solve (Kr+Ki)y=c. So the block size is never changed.</div></div></div></blockquote><div class=""><br class=""></div><div class="">Then we have to break down the timings further. I suspect AMS is not taking as long, since</div><div class="">all other operations scale like N.</div><div class=""><br class=""></div><div class="">  Thanks,</div><div class=""><br class=""></div><div class="">     Matt</div><div class=""><br class=""></div><div class=""> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div class="gmail_quote"><div class="">Best,</div><div class="">Ce</div></div></div></blockquote></div>-- <br class=""><div dir="ltr" class=""><div dir="ltr" class=""><div class=""><div dir="ltr" class=""><div class=""><div dir="ltr" class=""><div class="">What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br class="">-- Norbert Wiener</div><div class=""><br class=""></div><div class=""><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank" class="">https://www.cse.buffalo.edu/~knepley/</a><br class=""></div></div></div></div></div></div></div></div>

</blockquote></div>

</blockquote></div>

</div></blockquote></div><br class=""></div></body></html>