[petsc-users] Strong scaling concerns for PCBDDC with Vector FEM

Stefano Zampini stefano.zampini at gmail.com
Mon Aug 19 03:15:28 CDT 2024


It seems you are using DMPLEX to handle the mesh, correct?
If so, you should configure using --download-parmetis to have a better
domain decomposition since the default one just splits the cells in chunks
as they are ordered.
This results in a large number of primal dofs on average (191, from the
output of ksp_view)
...
Primal    dofs   : 176 204 191
...
that slows down the solver setup.

Again, you should not use approximate local solvers with BDDC unless you
know what you are doing.
The theory for approximate solvers for BDDC is small and only for SPD
problems.
Looking at the output of log_view, coarse problem setup (PCBDDCCSet), and
primal functions setup (PCBDDCCorr) costs 35 + 63 seconds, respectively.
Also, the 500 application of the GAMG preconditioner for the Neumann solver
(PCBDDCNeuS) takes 129 seconds out of the 400 seconds of the total solve
time.

PCBDDCTopo             1 1.0 3.1563e-01 1.0 1.11e+06 3.4 1.6e+03 3.9e+04
3.8e+01  0  0  1  0  2   0  0  1  0  2    19
PCBDDCLKSP             2 1.0 2.0423e+00 1.7 9.31e+08 1.2 0.0e+00 0.0e+00
2.0e+00  0  0  0  0  0   0  0  0  0  0  3378
PCBDDCLWor             1 1.0 3.9178e-02 13.4 0.00e+00 0.0 0.0e+00 0.0e+00
1.0e+00  0  0  0  0  0   0  0  0  0  0     0
PCBDDCCorr             1 1.0 6.3981e+01 2.2 8.16e+10 1.6 0.0e+00 0.0e+00
0.0e+00 11 11  0  0  0  11 11  0  0  0  8900
PCBDDCCSet             1 1.0 3.5453e+01 4564.9 1.06e+05 1.7 1.2e+03 5.3e+03
5.0e+01  2  0  1  0  3   2  0  1  0  3     0
PCBDDCCKSP             1 1.0 6.3266e-01 1.3 0.00e+00 0.0 3.3e+02 1.1e+02
2.2e+01  0  0  0  0  1   0  0  0  0  1     0
PCBDDCScal             1 1.0 6.8274e-03 1.3 1.11e+06 3.4 5.6e+01 3.2e+05
0.0e+00  0  0  0  0  0   0  0  0  0  0   894
PCBDDCDirS          1000 1.0 6.0420e+00 3.5 6.64e+09 5.4 0.0e+00 0.0e+00
0.0e+00  1  0  0  0  0   1  0  0  0  0  2995
PCBDDCNeuS           500 1.0 1.2901e+02 2.1 8.28e+10 1.2 0.0e+00 0.0e+00
0.0e+00 22 12  0  0  0  22 12  0  0  0  4828
PCBDDCCoaS           500 1.0 5.8757e-01 1.8 1.09e+09 1.0 2.8e+04 7.4e+02
5.0e+02  0  0 17  0 28   0  0 17  0 31 14901

Finally, if I look at the residual history, I see a sharp decrease and a
very long plateau. This indicates a bad coarse space; as I said before,
there's no hope of finding a suitable coarse space without first changing
the basis of the Nedelec elements, which is done automatically if you
prescribe the discrete gradient operator (see the paper I have linked to in
my previous communication).



Il giorno dom 18 ago 2024 alle ore 00:37 neil liu <liufield at gmail.com> ha
scritto:

> Hi, Stefano,
> Please see the attached for the information with 4 and 8 CPUs for the
> complex matrix.
> I am solving Maxwell equations (Attahced) using 2nd-order Nedelec elements
> (two dofs each edge, and two dofs each face).
> The computational domain consists of different mediums, e.g., vacuum and
> substrate (different permitivity).
> The PML is used to truncate the computational domain, absorbing the
> outgoing wave and introducing complex numbers for the matrix.
>
> Thanks a lot for your suggestions. I will try MUMPS.
> For now, I just want to fiddle with Petsc's built-in features to know more
> about it.
> Yes. 5000 is larger. Smaller value. e.g., 30, converges very slowly.
>
> Thanks a lot.
>
> Have a good weekend.
>
>
> On Sat, Aug 17, 2024 at 9:23 AM Stefano Zampini <stefano.zampini at gmail.com>
> wrote:
>
>> Please include the output of -log_view -ksp_view -ksp_monitor to
>> understand what's happening.
>>
>> Can you please share the equations you are solving so we can provide
>> suggestions on the solver configuration?
>> As I said, solving for Nedelec-type discretizations is challenging, and
>> not for off-the-shelf, black box solvers
>>
>> Below are some comments:
>>
>>
>>    - You use a redundant SVD approach for the coarse solve, which can be
>>    inefficient if your coarse space grows. You can use a parallel direct
>>    solver like MUMPS (reconfigure with --download-mumps and use
>>    -pc_bddc_coarse_pc_type lu -pc_bddc_coarse_pc_factor_mat_solver_type mumps)
>>    - Why use ILU for the Dirichlet problem and GAMG for the Neumann
>>    problem? With 8 processes and 300K total dofs, you will have around 40K
>>    dofs per process, which is ok for a direct solver like MUMPS
>>    (-pc_bddc_dirichlet_pc_factor_mat_solver_type mumps, same for Neumann).
>>    With Nedelec dofs and the sparsity pattern they induce,  I believe you can
>>    push to 80K dofs per process with good performance.
>>    - Why 5000 of restart for GMRES? It is highly inefficient to
>>    re-orthogonalize such a large set of vectors.
>>
>>
>> Il giorno ven 16 ago 2024 alle ore 00:04 neil liu <liufield at gmail.com>
>> ha scritto:
>>
>>> Dear Petsc developers,
>>>
>>> Thanks for your previous help. Now, the PCBDDC can converge to 1e-8
>>> with,
>>>
>>> petsc-3.21.1/petsc/arch-linux-c-opt/bin/mpirun -n 8 ./app -pc_type bddc
>>> -pc_bddc_coarse_redundant_pc_type svd   -ksp_error_if_not_converged
>>> -mat_type is -ksp_monitor -ksp_rtol 1e-8 -ksp_gmres_restart 5000 -ksp_view
>>> -pc_bddc_use_local_mat_graph 0  -pc_bddc_dirichlet_pc_type ilu
>>> -pc_bddc_neumann_pc_type gamg -pc_bddc_neumann_pc_gamg_esteig_ksp_max_it 10
>>> -ksp_converged_reason -pc_bddc_neumann_approximate -ksp_max_it 500 -log_view
>>>
>>> Then I used 2 cases for strong scaling test. One case only involves real
>>> numbers (tetra #: 49,152; dof #: 324, 224 ) for matrix and rhs. The 2nd
>>> case involves complex numbers  (tetra #: 95,336; dof #: 611,432)  due to
>>> PML.
>>>
>>> Case 1:
>>> cpu #                Time for 500 ksp steps (s)    Parallel efficiency
>>>    PCsetup time(s)
>>>           2              234.7
>>>                         3.12
>>>           4              126.6                                     0.92
>>>                     1.62
>>>           8              84.97                                     0.69
>>>                     1.26
>>> However for Case 2,
>>> cpu #                Time for 500 ksp steps (s)    Parallel efficiency
>>>  PCsetup time(s)
>>>           2              584.5
>>>                             8.61
>>>           4              376.8                                    0.77
>>>                          6.56
>>>           8              459.6                                    0.31
>>>                        66.47
>>> For these 2 cases, I checked the time for PCsetup as an example. It
>>> seems 8 cpus for case 2 used too much time on PCsetup.
>>> Do you have any ideas about what is going on here?
>>>
>>> Thanks,
>>> Xiaodong
>>>
>>>
>>>
>>
>> --
>> Stefano
>>
>

-- 
Stefano
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240819/e140c64a/attachment.html>


More information about the petsc-users mailing list