[petsc-users] Strong scaling concerns for PCBDDC with Vector FEM
neil liu
liufield at gmail.com
Tue Aug 20 13:31:21 CDT 2024
Thanks a lot for this explanation, Matt. I will explore whether the matrix
has the same size and spaisity.
On Tue, Aug 20, 2024 at 1:45 PM Matthew Knepley <knepley at gmail.com> wrote:
> On Tue, Aug 20, 2024 at 1:36 PM neil liu <liufield at gmail.com> wrote:
>
>> Hi, Matt,
>> I think the time listed here represents the maximum total time across
>> different processors.
>>
>> Thanks a lot.
>> 2 cpus
>> 4 cpus 8 cpus
>> Event Count Time (sec) Count
>> Time (sec) Count Time (sec)
>> Max Ratio Max Ratio Max Ratio
>> Max Ratio Max Ratio Max Ratio
>> VecMDot 530 1.0 7.8320e+01 1.0 530 1.0
>> 4.3285e+01 1.1 530 1.0 3.0476e+01 1.1
>> VecMAXPY 534 1.0 9.2954e+01 1.0 534 1.0
>> 4.8378e+01 1.1 534 1.0 3.0798e+01 1.1
>> MatMult 8055 1.0 2.4608e+02 1.0 8103 1.0
>> 1.2663e+02 1.0 8367 1.0 8.2942e+01 1.1
>>
>
> For the number of calls listed.
>
> 1) The number of MatMults goes up, so you should normalize for that, but
> you still have about 1.6 speedup. However, this is
> all multiplications. Are we sure they have the same size and sparsity?
>
> 2) MAXPY is also 1.6
>
> 3) MDot probably does not see the latency of one node, so again it is not
> speeding up as you might want.
>
> This looks like you are using a single node with 2, 4, and 8 procs. The
> memory bandwidth is exhausted sometime before 8 procs
> (maybe 6), so you cease to see speedup. You can check this by running
> `make streams` on the node.
>
> Thanks,
>
> Matt
>
>
>> On Tue, Aug 20, 2024 at 1:16 PM Matthew Knepley <knepley at gmail.com>
>> wrote:
>>
>>> On Tue, Aug 20, 2024 at 1:10 PM neil liu <liufield at gmail.com> wrote:
>>>
>>>> Thanks a lot for your explanation, Stefano. Very helpful.
>>>> Yes. I am using dmplex to read a tetrahdra mesh from gmsh. With
>>>> parmetis, the scaling performance is improved a lot.
>>>> I will read your paper about how to change the basis for Nedelec
>>>> elements.
>>>>
>>>> cpu # time for 500 ksp steps (s) parallel efficiency
>>>> 2 546
>>>> 4 224 120%
>>>> 8 170 80%
>>>> This results are much better than previous attempt. Then I checked the
>>>> time spent by several Petsc built-in functions for the ksp solver.
>>>>
>>>> Functions time(2 cpus) time(4 cpus) time(8 cpus)
>>>> VecMDot 78.32 43.28 30.47
>>>> VecMAXPY 92.95 48.37 30.798
>>>> MatMult 246.08 126.63 82.94
>>>>
>>>> It seems from cpu 4 to cpu 8, the scaling is not as good as from cpu 2
>>>> to cpu 4.
>>>> Am I missing something?
>>>>
>>>
>>> Did you normalize by the number of calls?
>>>
>>> Thanks,
>>>
>>> Matt
>>>
>>>
>>>> Thanks a lot,
>>>>
>>>> Xiaodong
>>>>
>>>>
>>>> On Mon, Aug 19, 2024 at 4:15 AM Stefano Zampini <
>>>> stefano.zampini at gmail.com> wrote:
>>>>
>>>>> It seems you are using DMPLEX to handle the mesh, correct?
>>>>> If so, you should configure using --download-parmetis to have a better
>>>>> domain decomposition since the default one just splits the cells in chunks
>>>>> as they are ordered.
>>>>> This results in a large number of primal dofs on average (191, from
>>>>> the output of ksp_view)
>>>>> ...
>>>>> Primal dofs : 176 204 191
>>>>> ...
>>>>> that slows down the solver setup.
>>>>>
>>>>> Again, you should not use approximate local solvers with BDDC unless
>>>>> you know what you are doing.
>>>>> The theory for approximate solvers for BDDC is small and only for SPD
>>>>> problems.
>>>>> Looking at the output of log_view, coarse problem setup (PCBDDCCSet),
>>>>> and primal functions setup (PCBDDCCorr) costs 35 + 63 seconds, respectively.
>>>>> Also, the 500 application of the GAMG preconditioner for the Neumann
>>>>> solver (PCBDDCNeuS) takes 129 seconds out of the 400 seconds of the total
>>>>> solve time.
>>>>>
>>>>> PCBDDCTopo 1 1.0 3.1563e-01 1.0 1.11e+06 3.4 1.6e+03
>>>>> 3.9e+04 3.8e+01 0 0 1 0 2 0 0 1 0 2 19
>>>>> PCBDDCLKSP 2 1.0 2.0423e+00 1.7 9.31e+08 1.2 0.0e+00
>>>>> 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 3378
>>>>> PCBDDCLWor 1 1.0 3.9178e-02 13.4 0.00e+00 0.0 0.0e+00
>>>>> 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>>> PCBDDCCorr 1 1.0 6.3981e+01 2.2 8.16e+10 1.6 0.0e+00
>>>>> 0.0e+00 0.0e+00 11 11 0 0 0 11 11 0 0 0 8900
>>>>> PCBDDCCSet 1 1.0 3.5453e+01 4564.9 1.06e+05 1.7 1.2e+03
>>>>> 5.3e+03 5.0e+01 2 0 1 0 3 2 0 1 0 3 0
>>>>> PCBDDCCKSP 1 1.0 6.3266e-01 1.3 0.00e+00 0.0 3.3e+02
>>>>> 1.1e+02 2.2e+01 0 0 0 0 1 0 0 0 0 1 0
>>>>> PCBDDCScal 1 1.0 6.8274e-03 1.3 1.11e+06 3.4 5.6e+01
>>>>> 3.2e+05 0.0e+00 0 0 0 0 0 0 0 0 0 0 894
>>>>> PCBDDCDirS 1000 1.0 6.0420e+00 3.5 6.64e+09 5.4 0.0e+00
>>>>> 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 2995
>>>>> PCBDDCNeuS 500 1.0 1.2901e+02 2.1 8.28e+10 1.2 0.0e+00
>>>>> 0.0e+00 0.0e+00 22 12 0 0 0 22 12 0 0 0 4828
>>>>> PCBDDCCoaS 500 1.0 5.8757e-01 1.8 1.09e+09 1.0 2.8e+04
>>>>> 7.4e+02 5.0e+02 0 0 17 0 28 0 0 17 0 31 14901
>>>>>
>>>>> Finally, if I look at the residual history, I see a sharp decrease and
>>>>> a very long plateau. This indicates a bad coarse space; as I said before,
>>>>> there's no hope of finding a suitable coarse space without first changing
>>>>> the basis of the Nedelec elements, which is done automatically if you
>>>>> prescribe the discrete gradient operator (see the paper I have linked to in
>>>>> my previous communication).
>>>>>
>>>>>
>>>>>
>>>>> Il giorno dom 18 ago 2024 alle ore 00:37 neil liu <liufield at gmail.com>
>>>>> ha scritto:
>>>>>
>>>>>> Hi, Stefano,
>>>>>> Please see the attached for the information with 4 and 8 CPUs for the
>>>>>> complex matrix.
>>>>>> I am solving Maxwell equations (Attahced) using 2nd-order Nedelec
>>>>>> elements (two dofs each edge, and two dofs each face).
>>>>>> The computational domain consists of different mediums, e.g.,
>>>>>> vacuum and substrate (different permitivity).
>>>>>> The PML is used to truncate the computational domain, absorbing the
>>>>>> outgoing wave and introducing complex numbers for the matrix.
>>>>>>
>>>>>> Thanks a lot for your suggestions. I will try MUMPS.
>>>>>> For now, I just want to fiddle with Petsc's built-in features to know
>>>>>> more about it.
>>>>>> Yes. 5000 is larger. Smaller value. e.g., 30, converges very slowly.
>>>>>>
>>>>>> Thanks a lot.
>>>>>>
>>>>>> Have a good weekend.
>>>>>>
>>>>>>
>>>>>> On Sat, Aug 17, 2024 at 9:23 AM Stefano Zampini <
>>>>>> stefano.zampini at gmail.com> wrote:
>>>>>>
>>>>>>> Please include the output of -log_view -ksp_view -ksp_monitor to
>>>>>>> understand what's happening.
>>>>>>>
>>>>>>> Can you please share the equations you are solving so we can provide
>>>>>>> suggestions on the solver configuration?
>>>>>>> As I said, solving for Nedelec-type discretizations is challenging,
>>>>>>> and not for off-the-shelf, black box solvers
>>>>>>>
>>>>>>> Below are some comments:
>>>>>>>
>>>>>>>
>>>>>>> - You use a redundant SVD approach for the coarse solve, which
>>>>>>> can be inefficient if your coarse space grows. You can use a parallel
>>>>>>> direct solver like MUMPS (reconfigure with --download-mumps and use
>>>>>>> -pc_bddc_coarse_pc_type lu -pc_bddc_coarse_pc_factor_mat_solver_type mumps)
>>>>>>> - Why use ILU for the Dirichlet problem and GAMG for the Neumann
>>>>>>> problem? With 8 processes and 300K total dofs, you will have around 40K
>>>>>>> dofs per process, which is ok for a direct solver like MUMPS
>>>>>>> (-pc_bddc_dirichlet_pc_factor_mat_solver_type mumps, same for Neumann).
>>>>>>> With Nedelec dofs and the sparsity pattern they induce, I believe you can
>>>>>>> push to 80K dofs per process with good performance.
>>>>>>> - Why 5000 of restart for GMRES? It is highly inefficient to
>>>>>>> re-orthogonalize such a large set of vectors.
>>>>>>>
>>>>>>>
>>>>>>> Il giorno ven 16 ago 2024 alle ore 00:04 neil liu <
>>>>>>> liufield at gmail.com> ha scritto:
>>>>>>>
>>>>>>>> Dear Petsc developers,
>>>>>>>>
>>>>>>>> Thanks for your previous help. Now, the PCBDDC can converge to 1e-8
>>>>>>>> with,
>>>>>>>>
>>>>>>>> petsc-3.21.1/petsc/arch-linux-c-opt/bin/mpirun -n 8 ./app -pc_type
>>>>>>>> bddc -pc_bddc_coarse_redundant_pc_type svd -ksp_error_if_not_converged
>>>>>>>> -mat_type is -ksp_monitor -ksp_rtol 1e-8 -ksp_gmres_restart 5000 -ksp_view
>>>>>>>> -pc_bddc_use_local_mat_graph 0 -pc_bddc_dirichlet_pc_type ilu
>>>>>>>> -pc_bddc_neumann_pc_type gamg -pc_bddc_neumann_pc_gamg_esteig_ksp_max_it 10
>>>>>>>> -ksp_converged_reason -pc_bddc_neumann_approximate -ksp_max_it 500 -log_view
>>>>>>>>
>>>>>>>> Then I used 2 cases for strong scaling test. One case only involves
>>>>>>>> real numbers (tetra #: 49,152; dof #: 324, 224 ) for matrix and rhs. The
>>>>>>>> 2nd case involves complex numbers (tetra #: 95,336; dof #: 611,432) due
>>>>>>>> to PML.
>>>>>>>>
>>>>>>>> Case 1:
>>>>>>>> cpu # Time for 500 ksp steps (s) Parallel
>>>>>>>> efficiency PCsetup time(s)
>>>>>>>> 2 234.7
>>>>>>>> 3.12
>>>>>>>> 4 126.6
>>>>>>>> 0.92 1.62
>>>>>>>> 8 84.97
>>>>>>>> 0.69 1.26
>>>>>>>> However for Case 2,
>>>>>>>> cpu # Time for 500 ksp steps (s) Parallel
>>>>>>>> efficiency PCsetup time(s)
>>>>>>>> 2 584.5
>>>>>>>> 8.61
>>>>>>>> 4 376.8
>>>>>>>> 0.77 6.56
>>>>>>>> 8 459.6
>>>>>>>> 0.31 66.47
>>>>>>>> For these 2 cases, I checked the time for PCsetup as an example. It
>>>>>>>> seems 8 cpus for case 2 used too much time on PCsetup.
>>>>>>>> Do you have any ideas about what is going on here?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Xiaodong
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Stefano
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Stefano
>>>>>
>>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cyJV1P5nkrWGBQ-UVnZtbe-PVdnBESh8O4cLvI1MXjIrzOtnmzeW7XOz2HYHoQMXSg3E7SmUvsqb_dL2fyWPhg$
>>> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cyJV1P5nkrWGBQ-UVnZtbe-PVdnBESh8O4cLvI1MXjIrzOtnmzeW7XOz2HYHoQMXSg3E7SmUvsqb_dIYUO7Tng$ >
>>>
>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cyJV1P5nkrWGBQ-UVnZtbe-PVdnBESh8O4cLvI1MXjIrzOtnmzeW7XOz2HYHoQMXSg3E7SmUvsqb_dL2fyWPhg$
> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cyJV1P5nkrWGBQ-UVnZtbe-PVdnBESh8O4cLvI1MXjIrzOtnmzeW7XOz2HYHoQMXSg3E7SmUvsqb_dIYUO7Tng$ >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240820/f8265e7a/attachment-0001.html>
More information about the petsc-users
mailing list