<div dir="ltr"><div dir="ltr">On Tue, Aug 20, 2024 at 1:36 PM neil liu <<a href="mailto:liufield@gmail.com">liufield@gmail.com</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr">Hi, Matt, <div>I think the time listed here represents the maximum total time across different processors.</div><div><br></div><div>Thanks a lot. </div><div> 2 cpus 4 cpus 8 cpus </div><div><div>Event Count Time (sec) Count Time (sec) Count Time (sec) </div><div> Max Ratio Max Ratio Max Ratio Max Ratio Max Ratio Max Ratio </div></div><div>VecMDot 530 1.0 7.8320e+01 1.0 530 1.0 4.3285e+01 1.1 530 1.0 3.0476e+01 1.1<br></div><div>VecMAXPY 534 1.0 9.2954e+01 1.0 534 1.0 4.8378e+01 1.1 534 1.0 3.0798e+01 1.1</div><div>MatMult 8055 1.0 2.4608e+02 1.0 8103 1.0 1.2663e+02 1.0 8367 1.0 8.2942e+01 1.1</div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></blockquote><div><br></div><div>For the number of calls listed.</div><div><br></div><div>1) The number of MatMults goes up, so you should normalize for that, but you still have about 1.6 speedup. However, this is</div><div> all multiplications. Are we sure they have the same size and sparsity?</div><div><br></div><div>2) MAXPY is also 1.6</div><div><br></div><div>3) MDot probably does not see the latency of one node, so again it is not speeding up as you might want.</div><div><br></div><div>This looks like you are using a single node with 2, 4, and 8 procs. The memory bandwidth is exhausted sometime before 8 procs</div><div>(maybe 6), so you cease to see speedup. You can check this by running `make streams` on the node.</div><div><br></div><div> Thanks,</div><div><br></div><div> Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Aug 20, 2024 at 1:16 PM Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">On Tue, Aug 20, 2024 at 1:10 PM neil liu <<a href="mailto:liufield@gmail.com" target="_blank">liufield@gmail.com</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">Thanks a lot for your explanation, Stefano. Very helpful. <div>Yes. I am using dmplex to read a tetrahdra mesh from gmsh. With parmetis, the scaling performance is improved a lot. <br></div><div>I will read your paper about how to change the basis for Nedelec elements. </div><div><br></div><div>cpu # time for 500 ksp steps (s) parallel efficiency</div><div>2 546</div><div>4 224 120%</div><div>8 170 80% </div><div>This results are much better than previous attempt. Then I checked the time spent by several Petsc built-in functions for the ksp solver. </div><div><br></div><div>Functions time(2 cpus) time(4 cpus) time(8 cpus)</div><div>VecMDot 78.32 43.28 30.47<br><div>VecMAXPY 92.95 48.37 30.798 </div></div><div>MatMult 246.08 126.63 82.94</div><div><br></div><div>It seems from cpu 4 to cpu 8, the scaling is not as good as from cpu 2 to cpu 4.</div><div>Am I missing something? </div></div></div></blockquote><div><br></div><div>Did you normalize by the number of calls?</div><div><br></div><div> Thanks,</div><div><br></div><div> Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div>Thanks a lot,</div><div><br></div><div>Xiaodong </div><div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Aug 19, 2024 at 4:15 AM Stefano Zampini <<a href="mailto:stefano.zampini@gmail.com" target="_blank">stefano.zampini@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">It seems you are using DMPLEX to handle the mesh, correct?<div>If so, you should configure using --download-parmetis to have a better domain decomposition since the default one just splits the cells in chunks as they are ordered.</div><div>This results in a large number of primal dofs on average (191, from the output of ksp_view)</div><div>...</div><div>Primal dofs : 176 204 191<br></div><div>...</div><div>that slows down the solver setup.</div><div><br></div><div>Again, you should not use approximate local solvers with BDDC unless you know what you are doing.</div><div>The theory for approximate solvers for BDDC is small and only for SPD problems.</div><div>Looking at the output of log_view, coarse problem setup (PCBDDCCSet), and primal functions setup (PCBDDCCorr) costs 35 + 63 seconds, respectively.</div><div>Also, the 500 application of the GAMG preconditioner for the Neumann solver (PCBDDCNeuS) takes 129 seconds out of the 400 seconds of the total solve time.</div><div><br></div><div>PCBDDCTopo 1 1.0 3.1563e-01 1.0 1.11e+06 3.4 1.6e+03 3.9e+04 3.8e+01 0 0 1 0 2 0 0 1 0 2 19<br>PCBDDCLKSP 2 1.0 2.0423e+00 1.7 9.31e+08 1.2 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 3378<br>PCBDDCLWor 1 1.0 3.9178e-02 13.4 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>PCBDDCCorr 1 1.0 6.3981e+01 2.2 8.16e+10 1.6 0.0e+00 0.0e+00 0.0e+00 11 11 0 0 0 11 11 0 0 0 8900<br>PCBDDCCSet 1 1.0 3.5453e+01 4564.9 1.06e+05 1.7 1.2e+03 5.3e+03 5.0e+01 2 0 1 0 3 2 0 1 0 3 0<br>PCBDDCCKSP 1 1.0 6.3266e-01 1.3 0.00e+00 0.0 3.3e+02 1.1e+02 2.2e+01 0 0 0 0 1 0 0 0 0 1 0<br>PCBDDCScal 1 1.0 6.8274e-03 1.3 1.11e+06 3.4 5.6e+01 3.2e+05 0.0e+00 0 0 0 0 0 0 0 0 0 0 894<br>PCBDDCDirS 1000 1.0 6.0420e+00 3.5 6.64e+09 5.4 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 2995<br>PCBDDCNeuS 500 1.0 1.2901e+02 2.1 8.28e+10 1.2 0.0e+00 0.0e+00 0.0e+00 22 12 0 0 0 22 12 0 0 0 4828<br>PCBDDCCoaS 500 1.0 5.8757e-01 1.8 1.09e+09 1.0 2.8e+04 7.4e+02 5.0e+02 0 0 17 0 28 0 0 17 0 31 14901<br></div><div><br></div><div>Finally, if I look at the residual history, I see a sharp decrease and a very long plateau. This indicates a bad coarse space; as I said before, there's no hope of finding a suitable coarse space without first changing the basis of the Nedelec elements, which is done automatically if you prescribe the discrete gradient operator (see the paper I have linked to in my previous communication).</div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Il giorno dom 18 ago 2024 alle ore 00:37 neil liu <<a href="mailto:liufield@gmail.com" target="_blank">liufield@gmail.com</a>> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi, Stefano, <div>Please see the attached for the information with 4 and 8 CPUs for the complex matrix.</div><div>I am solving Maxwell equations (Attahced) using 2nd-order Nedelec elements (two dofs each edge, and two dofs each face).</div><div>The computational domain consists of different mediums, e.g., vacuum and substrate (different permitivity).</div><div>The PML is used to truncate the computational domain, absorbing the outgoing wave and introducing complex numbers for the matrix.</div><div><br></div><div>Thanks a lot for your suggestions. I will try MUMPS. </div><div>For now, I just want to fiddle with Petsc's built-in features to know more about it. </div><div>Yes. 5000 is larger. Smaller value. e.g., 30, converges very slowly. </div><div><br></div><div>Thanks a lot. </div><div><br></div><div>Have a good weekend. </div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Aug 17, 2024 at 9:23 AM Stefano Zampini <<a href="mailto:stefano.zampini@gmail.com" target="_blank">stefano.zampini@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Please include the output of -log_view -ksp_view -ksp_monitor to understand what's happening.<div><br></div><div><div>Can you please share the equations you are solving so we can provide suggestions on the solver configuration?</div><div>As I said, solving for Nedelec-type discretizations is challenging, and not for off-the-shelf, black box solvers</div><div><br></div><div>Below are some comments:</div><div><br></div><div><ul><li>You use a redundant SVD approach for the coarse solve, which can be inefficient if your coarse space grows. You can use a parallel direct solver like MUMPS (reconfigure with --download-mumps and use -pc_bddc_coarse_pc_type lu -pc_bddc_coarse_pc_factor_mat_solver_type mumps)</li><li>Why use ILU for the Dirichlet problem and GAMG for the Neumann problem? With 8 processes and 300K total dofs, you will have around 40K dofs per process, which is ok for a direct solver like MUMPS (-pc_bddc_dirichlet_pc_factor_mat_solver_type mumps, same for Neumann). With Nedelec dofs and the sparsity pattern they induce, I believe you can push to 80K dofs per process with good performance.</li><li>Why 5000 of restart for GMRES? It is highly inefficient to re-orthogonalize such a large set of vectors.</li></ul></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Il giorno ven 16 ago 2024 alle ore 00:04 neil liu <<a href="mailto:liufield@gmail.com" target="_blank">liufield@gmail.com</a>> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr">Dear Petsc developers, <div><br></div><div>Thanks for your previous help. Now, the PCBDDC can converge to 1e-8 with, </div><div><br></div><div><div>petsc-3.21.1/petsc/arch-linux-c-opt/bin/mpirun -n 8 ./app -pc_type bddc -pc_bddc_coarse_redundant_pc_type svd -ksp_error_if_not_converged -mat_type is -ksp_monitor -ksp_rtol 1e-8 -ksp_gmres_restart 5000 -ksp_view -pc_bddc_use_local_mat_graph 0 -pc_bddc_dirichlet_pc_type ilu -pc_bddc_neumann_pc_type gamg -pc_bddc_neumann_pc_gamg_esteig_ksp_max_it 10 -ksp_converged_reason -pc_bddc_neumann_approximate -ksp_max_it 500 -log_view</div></div><div><br></div><div>Then I used 2 cases for strong scaling test. One case only involves real numbers (tetra #: 49,152; dof #: 324, 224 ) for matrix and rhs. The 2nd case involves complex numbers (tetra #: 95,336; dof #: 611,432) due to PML. </div><div><br></div><div>Case 1: </div><div>cpu # Time for 500 ksp steps (s) Parallel efficiency PCsetup time(s)</div><div> 2 234.7 3.12</div><div> 4 126.6 0.92 1.62</div><div> 8 84.97 0.69 1.26</div><div>However for Case 2, </div><div><div>cpu # Time for 500 ksp steps (s) Parallel efficiency PCsetup time(s)</div><div> 2 584.5 8.61</div><div> 4 376.8 0.77 6.56</div><div> 8 459.6 0.31 66.47</div></div><div>For these 2 cases, I checked the time for PCsetup as an example. It seems 8 cpus for case 2 used too much time on PCsetup.</div><div>Do you have any ideas about what is going on here? </div><div><br></div><div>Thanks,</div><div>Xiaodong </div><div><br></div><div><br></div></div></div></div></div></div>
</blockquote></div><br clear="all"><div><br></div><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature">Stefano</div>
</blockquote></div>
</blockquote></div><br clear="all"><div><br></div><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature">Stefano</div>
</blockquote></div>
</blockquote></div><br clear="all"><div><br></div><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!eQty5R8qGgZBZNodHW90OVmUU1tsyjzmP4NkXVvtCk8QMzIM2XIAQEx4RrA_F814zU_1P_RsayqlJ6chGK_E$" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br></div></div></div></div></div></div></div></div>
</blockquote></div>
</blockquote></div><br clear="all"><div><br></div><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!eQty5R8qGgZBZNodHW90OVmUU1tsyjzmP4NkXVvtCk8QMzIM2XIAQEx4RrA_F814zU_1P_RsayqlJ6chGK_E$" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br></div></div></div></div></div></div></div></div>