<div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr">For your reference, I also calculated the speedups for other procedures:<div><br></div><div><div><font face="monospace"> VecAXPY MatMult SetupAMS PCApply Assembly Solving</font></div><div><font face="monospace">NProcessors NNodes CoresPerNode </font></div><div><font face="monospace">1 1 1 1.0 1.0 1.0 1.0 1.0 1.0</font></div><div><font face="monospace">2 1 2 1.640502 1.945753 1.418709 1.898884 1.995246 1.898756</font></div><div><font face="monospace"> 2 1 2.297125 2.614508 1.600718 2.419798 2.121401 2.436149</font></div><div><font face="monospace">4 1 4 4.456256 6.821532 3.614451 5.991256 4.658187 6.004539</font></div><div><font face="monospace"> 2 2 4.539748 6.779151 3.619661 5.926112 4.666667 5.942085</font></div><div><font face="monospace"> 4 1 4.480902 7.210629 3.471541 6.082946 4.65272 6.101214</font></div><div><font face="monospace">8 2 4 10.584189 17.519901 8.59046 16.615395 9.380985 16.581135</font></div><div><font face="monospace"> 4 2 10.980687 18.674113 8.612347 17.273229 9.308575 17.258891</font></div><div><font face="monospace"> 8 1 11.096298 18.210245 8.456557 17.430586 9.314449 17.380612</font></div><div><font face="monospace">16 2 8 21.929795 37.04392 18.135278 34.5448 18.575953 34.483058</font></div><div><font face="monospace"> 4 4 22.00331 39.581504 18.011148 34.793732 18.745129 34.854409</font></div><div><font face="monospace"> 8 2 22.692779 41.38289 18.354949 36.388144 18.828393 36.45509</font></div><div><font face="monospace">32 4 8 43.935774 80.003087 34.963997 70.085728 37.140626 70.175879</font></div><div><font face="monospace"> 8 4 44.387091 80.807608 35.62153 71.471289 37.166421 71.533865</font></div></div><div><font face="monospace"><br></font></div><div><font face="arial, sans-serif">and the streams result on the computation node:</font></div><div><font face="arial, sans-serif"><br></font></div><div><div><font face="monospace"><div>1 8291.4887 Rate (MB/s)</div><div>2 8739.3219 Rate (MB/s) 1.05401</div><div>3 24769.5868 Rate (MB/s) 2.98735</div><div>4 31962.0242 Rate (MB/s) 3.8548</div><div>5 39603.8828 Rate (MB/s) 4.77645</div><div>6 47777.7385 Rate (MB/s) 5.76226</div><div>7 54557.5363 Rate (MB/s) 6.57994</div><div>8 62769.3910 Rate (MB/s) 7.57034</div><div>9 38649.9160 Rate (MB/s) 4.6614</div><div>10 58976.9536 Rate (MB/s) 7.11295</div><div>11 48108.7801 Rate (MB/s) 5.80219</div><div>12 49506.8213 Rate (MB/s) 5.9708</div><div>13 54810.5266 Rate (MB/s) 6.61046</div><div>14 62471.5234 Rate (MB/s) 7.53441</div><div>15 63968.0218 Rate (MB/s) 7.7149</div><div>16 69644.8615 Rate (MB/s) 8.39956</div><div>17 60791.9544 Rate (MB/s) 7.33185</div><div>18 65476.5162 Rate (MB/s) 7.89683</div><div>19 60127.0683 Rate (MB/s) 7.25166</div><div>20 72052.5175 Rate (MB/s) 8.68994</div><div>21 62045.7745 Rate (MB/s) 7.48307</div><div>22 64517.7771 Rate (MB/s) 7.7812</div><div>23 69570.2935 Rate (MB/s) 8.39057</div><div>24 69673.8328 Rate (MB/s) 8.40305</div><div>25 75196.7514 Rate (MB/s) 9.06915</div><div>26 72304.2685 Rate (MB/s) 8.7203</div><div>27 73234.1616 Rate (MB/s) 8.83245</div><div>28 74041.3842 Rate (MB/s) 8.9298</div><div>29 77117.3751 Rate (MB/s) 9.30079</div><div>30 78293.8496 Rate (MB/s) 9.44268</div><div>31 81377.0870 Rate (MB/s) 9.81453</div><div>32 84097.0813 Rate (MB/s) 10.1426</div></font></div><div><br></div></div><div><font face="monospace"><br></font></div>Best,</div><div>Ce</div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Mark Adams <<a href="mailto:mfadams@lbl.gov">mfadams@lbl.gov</a>> 于2022年7月12日周二 22:11写道:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">You may get more memory bandwidth with 32 processors vs 1, as Ce mentioned.<div>Depends on the architecture.</div><div>Do you get the whole memory bandwidth on one processor on this machine?</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jul 12, 2022 at 8:53 AM Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">On Tue, Jul 12, 2022 at 7:32 AM Ce Qin <<a href="mailto:qince168@gmail.com" target="_blank">qince168@gmail.com</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><br></div><div>The linear system is complex-valued. We rewrite it into its real form</div><div>and solve it using FGMRES and an optimal block-diagonal preconditioner. </div><div>We use CG and the AMS preconditioner implemented in HYPRE to solve the</div><div>smaller real linear system arised from applying the block preconditioner.</div><div>The iteration number of FGMRES and CG keep almost constant in all the runs.</div></div></blockquote><div><br></div><div>So those blocks decrease in size as you add more processes?</div><div> </div></div></div></blockquote></div></blockquote><div><br></div><div>I am sorry for the unclear description of the block-diagonal preconditioner.</div><div>Let K be the original complex system matrix, A = [Kr, -Ki; -Ki, -Kr] is the equivalent</div><div>real form of K. Let P = [Kr+Ki, 0; 0, Kr+Ki], it can beproved that P is an optimal</div><div>preconditioner for A. In our implementation, only Kr, Ki and Kr+Ki</div><div>are explicitly stored as MATMPIAIJ. We use MATSHELL to represent A and P.</div><div>We use FGMRES + P to solve Ax=b, and CG + AMS to</div><div>solve (Kr+Ki)y=c. So the block size is never changed.</div></div></div></blockquote><div><br></div><div>Then we have to break down the timings further. I suspect AMS is not taking as long, since</div><div>all other operations scale like N.</div><div><br></div><div> Thanks,</div><div><br></div><div> Matt</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div>Best,</div><div>Ce</div></div></div></blockquote></div>-- <br><div dir="ltr"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br></div></div></div></div></div></div></div></div>
</blockquote></div>
</blockquote></div>