[petsc-users] Strong scaling concerns for PCBDDC with Vector FEM

Tue Aug 20 12:01:22 CDT 2024

Thanks a lot for your explanation, Stefano. Very helpful.
Yes. I am using dmplex to read a tetrahdra mesh from gmsh. With parmetis,
the scaling performance is improved a lot.
I will read your paper about how to change the basis for Nedelec elements.

cpu #    time for 500 ksp steps  (s)           parallel efficiency
2           546
4           224                                               120%
8           170                                               80%
This results are much better than previous attempt. Then I checked the time
spent by several Petsc built-in functions for the ksp solver.

Functions          time(2 cpus)     time(4 cpus)      time(8 cpus)
VecMDot           78.32                43.28                30.47
VecMAXPY       92.95                48.37                30.798
MatMult          246.08               126.63                82.94

It seems from cpu 4 to cpu 8, the scaling is not as good as from cpu 2 to
cpu 4.
Am I  missing something?

Thanks a lot,

Xiaodong

On Mon, Aug 19, 2024 at 4:15 AM Stefano Zampini <stefano.zampini at gmail.com>
wrote:

> It seems you are using DMPLEX to handle the mesh, correct?
> If so, you should configure using --download-parmetis to have a better
> domain decomposition since the default one just splits the cells in chunks
> as they are ordered.
> This results in a large number of primal dofs on average (191, from the
> output of ksp_view)
> ...
> Primal    dofs   : 176 204 191
> ...
> that slows down the solver setup.
>
> Again, you should not use approximate local solvers with BDDC unless you
> know what you are doing.
> The theory for approximate solvers for BDDC is small and only for SPD
> problems.
> Looking at the output of log_view, coarse problem setup (PCBDDCCSet), and
> primal functions setup (PCBDDCCorr) costs 35 + 63 seconds, respectively.
> Also, the 500 application of the GAMG preconditioner for the Neumann
> solver (PCBDDCNeuS) takes 129 seconds out of the 400 seconds of the total
> solve time.
>
> PCBDDCTopo             1 1.0 3.1563e-01 1.0 1.11e+06 3.4 1.6e+03 3.9e+04
> 3.8e+01  0  0  1  0  2   0  0  1  0  2    19
> PCBDDCLKSP             2 1.0 2.0423e+00 1.7 9.31e+08 1.2 0.0e+00 0.0e+00
> 2.0e+00  0  0  0  0  0   0  0  0  0  0  3378
> PCBDDCLWor             1 1.0 3.9178e-02 13.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> PCBDDCCorr             1 1.0 6.3981e+01 2.2 8.16e+10 1.6 0.0e+00 0.0e+00
> 0.0e+00 11 11  0  0  0  11 11  0  0  0  8900
> PCBDDCCSet             1 1.0 3.5453e+01 4564.9 1.06e+05 1.7 1.2e+03
> 5.3e+03 5.0e+01  2  0  1  0  3   2  0  1  0  3     0
> PCBDDCCKSP             1 1.0 6.3266e-01 1.3 0.00e+00 0.0 3.3e+02 1.1e+02
> 2.2e+01  0  0  0  0  1   0  0  0  0  1     0
> PCBDDCScal             1 1.0 6.8274e-03 1.3 1.11e+06 3.4 5.6e+01 3.2e+05
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   894
> PCBDDCDirS          1000 1.0 6.0420e+00 3.5 6.64e+09 5.4 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0  2995
> PCBDDCNeuS           500 1.0 1.2901e+02 2.1 8.28e+10 1.2 0.0e+00 0.0e+00
> 0.0e+00 22 12  0  0  0  22 12  0  0  0  4828
> PCBDDCCoaS           500 1.0 5.8757e-01 1.8 1.09e+09 1.0 2.8e+04 7.4e+02
> 5.0e+02  0  0 17  0 28   0  0 17  0 31 14901
>
> Finally, if I look at the residual history, I see a sharp decrease and a
> very long plateau. This indicates a bad coarse space; as I said before,
> there's no hope of finding a suitable coarse space without first changing
> the basis of the Nedelec elements, which is done automatically if you
> prescribe the discrete gradient operator (see the paper I have linked to in
> my previous communication).
>
>
>
> Il giorno dom 18 ago 2024 alle ore 00:37 neil liu <liufield at gmail.com> ha
> scritto:
>
>> Hi, Stefano,
>> Please see the attached for the information with 4 and 8 CPUs for the
>> complex matrix.
>> I am solving Maxwell equations (Attahced) using 2nd-order Nedelec
>> elements (two dofs each edge, and two dofs each face).
>> The computational domain consists of different mediums, e.g., vacuum and
>> substrate (different permitivity).
>> The PML is used to truncate the computational domain, absorbing the
>> outgoing wave and introducing complex numbers for the matrix.
>>
>> Thanks a lot for your suggestions. I will try MUMPS.
>> For now, I just want to fiddle with Petsc's built-in features to know
>> more about it.
>> Yes. 5000 is larger. Smaller value. e.g., 30, converges very slowly.
>>
>> Thanks a lot.
>>
>> Have a good weekend.
>>
>>
>> On Sat, Aug 17, 2024 at 9:23 AM Stefano Zampini <
>> stefano.zampini at gmail.com> wrote:
>>
>>> Please include the output of -log_view -ksp_view -ksp_monitor to
>>> understand what's happening.
>>>
>>> Can you please share the equations you are solving so we can provide
>>> suggestions on the solver configuration?
>>> As I said, solving for Nedelec-type discretizations is challenging, and
>>> not for off-the-shelf, black box solvers
>>>
>>> Below are some comments:
>>>
>>>
>>>    - You use a redundant SVD approach for the coarse solve, which can
>>>    be inefficient if your coarse space grows. You can use a parallel direct
>>>    solver like MUMPS (reconfigure with --download-mumps and use
>>>    -pc_bddc_coarse_pc_type lu -pc_bddc_coarse_pc_factor_mat_solver_type mumps)
>>>    - Why use ILU for the Dirichlet problem and GAMG for the Neumann
>>>    problem? With 8 processes and 300K total dofs, you will have around 40K
>>>    dofs per process, which is ok for a direct solver like MUMPS
>>>    (-pc_bddc_dirichlet_pc_factor_mat_solver_type mumps, same for Neumann).
>>>    With Nedelec dofs and the sparsity pattern they induce,  I believe you can
>>>    push to 80K dofs per process with good performance.
>>>    - Why 5000 of restart for GMRES? It is highly inefficient to
>>>    re-orthogonalize such a large set of vectors.
>>>
>>>
>>> Il giorno ven 16 ago 2024 alle ore 00:04 neil liu <liufield at gmail.com>
>>> ha scritto:
>>>
>>>> Dear Petsc developers,
>>>>
>>>> Thanks for your previous help. Now, the PCBDDC can converge to 1e-8
>>>> with,
>>>>
>>>> petsc-3.21.1/petsc/arch-linux-c-opt/bin/mpirun -n 8 ./app -pc_type bddc
>>>> -pc_bddc_coarse_redundant_pc_type svd   -ksp_error_if_not_converged
>>>> -mat_type is -ksp_monitor -ksp_rtol 1e-8 -ksp_gmres_restart 5000 -ksp_view
>>>> -pc_bddc_use_local_mat_graph 0  -pc_bddc_dirichlet_pc_type ilu
>>>> -pc_bddc_neumann_pc_type gamg -pc_bddc_neumann_pc_gamg_esteig_ksp_max_it 10
>>>> -ksp_converged_reason -pc_bddc_neumann_approximate -ksp_max_it 500 -log_view
>>>>
>>>> Then I used 2 cases for strong scaling test. One case only involves
>>>> real numbers (tetra #: 49,152; dof #: 324, 224 ) for matrix and rhs. The
>>>> 2nd case involves complex numbers  (tetra #: 95,336; dof #: 611,432)  due
>>>> to PML.
>>>>
>>>> Case 1:
>>>> cpu #                Time for 500 ksp steps (s)    Parallel efficiency
>>>>    PCsetup time(s)
>>>>           2              234.7
>>>>                         3.12
>>>>           4              126.6
>>>>  0.92                      1.62
>>>>           8              84.97
>>>>  0.69                      1.26
>>>> However for Case 2,
>>>> cpu #                Time for 500 ksp steps (s)    Parallel efficiency
>>>>  PCsetup time(s)
>>>>           2              584.5
>>>>                             8.61
>>>>           4              376.8                                    0.77
>>>>                          6.56
>>>>           8              459.6                                    0.31
>>>>                        66.47
>>>> For these 2 cases, I checked the time for PCsetup as an example. It
>>>> seems 8 cpus for case 2 used too much time on PCsetup.
>>>> Do you have any ideas about what is going on here?
>>>>
>>>> Thanks,
>>>> Xiaodong
>>>>
>>>>
>>>>
>>>
>>> --
>>> Stefano
>>>
>>
>
> --
> Stefano
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240820/bc85790d/attachment.html>