[petsc-users] Strong scaling concerns for PCBDDC with Vector FEM

Tue Aug 20 12:45:36 CDT 2024

On Tue, Aug 20, 2024 at 1:36 PM neil liu <liufield at gmail.com> wrote:

> Hi, Matt,
> I think the time listed here represents the maximum total time across
> different processors.
>
> Thanks a lot.
>                          2 cpus
>         4 cpus                                           8 cpus
> Event          Count                 Time (sec)              Count
>          Time (sec)                Count                 Time (sec)
>                    Max Ratio        Max        Ratio           Max Ratio
>       Max     Ratio               Max Ratio        Max     Ratio
> VecMDot      530 1.0         7.8320e+01 1.0         530    1.0
>  4.3285e+01 1.1           530   1.0          3.0476e+01   1.1
> VecMAXPY  534 1.0         9.2954e+01 1.0         534    1.0
> 4.8378e+01 1.1          534   1.0          3.0798e+01   1.1
> MatMult      8055 1.0         2.4608e+02 1.0        8103   1.0
> 1.2663e+02 1.0          8367 1.0           8.2942e+01 1.1
>

For the number of calls listed.

1) The number of MatMults goes up, so you should normalize for that, but
you still have about 1.6 speedup. However, this is
    all multiplications. Are we sure they have the same size and sparsity?

2) MAXPY is also 1.6

3) MDot probably does not see the latency of one node, so again it is not
speeding up as you might want.

This looks like you are using a single node with 2, 4, and 8 procs. The
memory bandwidth is exhausted sometime before 8 procs
(maybe 6), so you cease to see speedup. You can check this by running `make
streams` on the node.

  Thanks,

     Matt

> On Tue, Aug 20, 2024 at 1:16 PM Matthew Knepley <knepley at gmail.com> wrote:
>
>> On Tue, Aug 20, 2024 at 1:10 PM neil liu <liufield at gmail.com> wrote:
>>
>>> Thanks a lot for your explanation, Stefano. Very helpful.
>>> Yes. I am using dmplex to read a tetrahdra mesh from gmsh. With
>>> parmetis, the scaling performance is improved a lot.
>>> I will read your paper about how to change the basis for Nedelec
>>> elements.
>>>
>>> cpu #    time for 500 ksp steps  (s)           parallel efficiency
>>> 2           546
>>> 4           224                                               120%
>>> 8           170                                               80%
>>> This results are much better than previous attempt. Then I checked the
>>> time spent by several Petsc built-in functions for the ksp solver.
>>>
>>> Functions          time(2 cpus)     time(4 cpus)      time(8 cpus)
>>> VecMDot           78.32                43.28                30.47
>>> VecMAXPY       92.95                48.37                30.798
>>> MatMult          246.08               126.63                82.94
>>>
>>> It seems from cpu 4 to cpu 8, the scaling is not as good as from cpu 2
>>> to cpu 4.
>>> Am I  missing something?
>>>
>>
>> Did you normalize by the number of calls?
>>
>>   Thanks,
>>
>>      Matt
>>
>>
>>> Thanks a lot,
>>>
>>> Xiaodong
>>>
>>>
>>> On Mon, Aug 19, 2024 at 4:15 AM Stefano Zampini <
>>> stefano.zampini at gmail.com> wrote:
>>>
>>>> It seems you are using DMPLEX to handle the mesh, correct?
>>>> If so, you should configure using --download-parmetis to have a better
>>>> domain decomposition since the default one just splits the cells in chunks
>>>> as they are ordered.
>>>> This results in a large number of primal dofs on average (191, from
>>>> the  output of ksp_view)
>>>> ...
>>>> Primal    dofs   : 176 204 191
>>>> ...
>>>> that slows down the solver setup.
>>>>
>>>> Again, you should not use approximate local solvers with BDDC unless
>>>> you know what you are doing.
>>>> The theory for approximate solvers for BDDC is small and only for SPD
>>>> problems.
>>>> Looking at the output of log_view, coarse problem setup (PCBDDCCSet),
>>>> and primal functions setup (PCBDDCCorr) costs 35 + 63 seconds, respectively.
>>>> Also, the 500 application of the GAMG preconditioner for the Neumann
>>>> solver (PCBDDCNeuS) takes 129 seconds out of the 400 seconds of the total
>>>> solve time.
>>>>
>>>> PCBDDCTopo             1 1.0 3.1563e-01 1.0 1.11e+06 3.4 1.6e+03
>>>> 3.9e+04 3.8e+01  0  0  1  0  2   0  0  1  0  2    19
>>>> PCBDDCLKSP             2 1.0 2.0423e+00 1.7 9.31e+08 1.2 0.0e+00
>>>> 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0  3378
>>>> PCBDDCLWor             1 1.0 3.9178e-02 13.4 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> PCBDDCCorr             1 1.0 6.3981e+01 2.2 8.16e+10 1.6 0.0e+00
>>>> 0.0e+00 0.0e+00 11 11  0  0  0  11 11  0  0  0  8900
>>>> PCBDDCCSet             1 1.0 3.5453e+01 4564.9 1.06e+05 1.7 1.2e+03
>>>> 5.3e+03 5.0e+01  2  0  1  0  3   2  0  1  0  3     0
>>>> PCBDDCCKSP             1 1.0 6.3266e-01 1.3 0.00e+00 0.0 3.3e+02
>>>> 1.1e+02 2.2e+01  0  0  0  0  1   0  0  0  0  1     0
>>>> PCBDDCScal             1 1.0 6.8274e-03 1.3 1.11e+06 3.4 5.6e+01
>>>> 3.2e+05 0.0e+00  0  0  0  0  0   0  0  0  0  0   894
>>>> PCBDDCDirS          1000 1.0 6.0420e+00 3.5 6.64e+09 5.4 0.0e+00
>>>> 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0  2995
>>>> PCBDDCNeuS           500 1.0 1.2901e+02 2.1 8.28e+10 1.2 0.0e+00
>>>> 0.0e+00 0.0e+00 22 12  0  0  0  22 12  0  0  0  4828
>>>> PCBDDCCoaS           500 1.0 5.8757e-01 1.8 1.09e+09 1.0 2.8e+04
>>>> 7.4e+02 5.0e+02  0  0 17  0 28   0  0 17  0 31 14901
>>>>
>>>> Finally, if I look at the residual history, I see a sharp decrease and
>>>> a very long plateau. This indicates a bad coarse space; as I said before,
>>>> there's no hope of finding a suitable coarse space without first changing
>>>> the basis of the Nedelec elements, which is done automatically if you
>>>> prescribe the discrete gradient operator (see the paper I have linked to in
>>>> my previous communication).
>>>>
>>>>
>>>>
>>>> Il giorno dom 18 ago 2024 alle ore 00:37 neil liu <liufield at gmail.com>
>>>> ha scritto:
>>>>
>>>>> Hi, Stefano,
>>>>> Please see the attached for the information with 4 and 8 CPUs for the
>>>>> complex matrix.
>>>>> I am solving Maxwell equations (Attahced) using 2nd-order Nedelec
>>>>> elements (two dofs each edge, and two dofs each face).
>>>>> The computational domain consists of different mediums, e.g.,
>>>>> vacuum and substrate (different permitivity).
>>>>> The PML is used to truncate the computational domain, absorbing the
>>>>> outgoing wave and introducing complex numbers for the matrix.
>>>>>
>>>>> Thanks a lot for your suggestions. I will try MUMPS.
>>>>> For now, I just want to fiddle with Petsc's built-in features to know
>>>>> more about it.
>>>>> Yes. 5000 is larger. Smaller value. e.g., 30, converges very slowly.
>>>>>
>>>>> Thanks a lot.
>>>>>
>>>>> Have a good weekend.
>>>>>
>>>>>
>>>>> On Sat, Aug 17, 2024 at 9:23 AM Stefano Zampini <
>>>>> stefano.zampini at gmail.com> wrote:
>>>>>
>>>>>> Please include the output of -log_view -ksp_view -ksp_monitor to
>>>>>> understand what's happening.
>>>>>>
>>>>>> Can you please share the equations you are solving so we can provide
>>>>>> suggestions on the solver configuration?
>>>>>> As I said, solving for Nedelec-type discretizations is challenging,
>>>>>> and not for off-the-shelf, black box solvers
>>>>>>
>>>>>> Below are some comments:
>>>>>>
>>>>>>
>>>>>>    - You use a redundant SVD approach for the coarse solve, which
>>>>>>    can be inefficient if your coarse space grows. You can use a parallel
>>>>>>    direct solver like MUMPS (reconfigure with --download-mumps and use
>>>>>>    -pc_bddc_coarse_pc_type lu -pc_bddc_coarse_pc_factor_mat_solver_type mumps)
>>>>>>    - Why use ILU for the Dirichlet problem and GAMG for the Neumann
>>>>>>    problem? With 8 processes and 300K total dofs, you will have around 40K
>>>>>>    dofs per process, which is ok for a direct solver like MUMPS
>>>>>>    (-pc_bddc_dirichlet_pc_factor_mat_solver_type mumps, same for Neumann).
>>>>>>    With Nedelec dofs and the sparsity pattern they induce,  I believe you can
>>>>>>    push to 80K dofs per process with good performance.
>>>>>>    - Why 5000 of restart for GMRES? It is highly inefficient to
>>>>>>    re-orthogonalize such a large set of vectors.
>>>>>>
>>>>>>
>>>>>> Il giorno ven 16 ago 2024 alle ore 00:04 neil liu <liufield at gmail.com>
>>>>>> ha scritto:
>>>>>>
>>>>>>> Dear Petsc developers,
>>>>>>>
>>>>>>> Thanks for your previous help. Now, the PCBDDC can converge to 1e-8
>>>>>>> with,
>>>>>>>
>>>>>>> petsc-3.21.1/petsc/arch-linux-c-opt/bin/mpirun -n 8 ./app -pc_type
>>>>>>> bddc -pc_bddc_coarse_redundant_pc_type svd   -ksp_error_if_not_converged
>>>>>>> -mat_type is -ksp_monitor -ksp_rtol 1e-8 -ksp_gmres_restart 5000 -ksp_view
>>>>>>> -pc_bddc_use_local_mat_graph 0  -pc_bddc_dirichlet_pc_type ilu
>>>>>>> -pc_bddc_neumann_pc_type gamg -pc_bddc_neumann_pc_gamg_esteig_ksp_max_it 10
>>>>>>> -ksp_converged_reason -pc_bddc_neumann_approximate -ksp_max_it 500 -log_view
>>>>>>>
>>>>>>> Then I used 2 cases for strong scaling test. One case only involves
>>>>>>> real numbers (tetra #: 49,152; dof #: 324, 224 ) for matrix and rhs. The
>>>>>>> 2nd case involves complex numbers  (tetra #: 95,336; dof #: 611,432)  due
>>>>>>> to PML.
>>>>>>>
>>>>>>> Case 1:
>>>>>>> cpu #                Time for 500 ksp steps (s)    Parallel
>>>>>>> efficiency     PCsetup time(s)
>>>>>>>           2              234.7
>>>>>>>                             3.12
>>>>>>>           4              126.6
>>>>>>>  0.92                      1.62
>>>>>>>           8              84.97
>>>>>>>  0.69                      1.26
>>>>>>> However for Case 2,
>>>>>>> cpu #                Time for 500 ksp steps (s)    Parallel
>>>>>>> efficiency   PCsetup time(s)
>>>>>>>           2              584.5
>>>>>>>                                 8.61
>>>>>>>           4              376.8
>>>>>>> 0.77                           6.56
>>>>>>>           8              459.6
>>>>>>> 0.31                         66.47
>>>>>>> For these 2 cases, I checked the time for PCsetup as an example. It
>>>>>>> seems 8 cpus for case 2 used too much time on PCsetup.
>>>>>>> Do you have any ideas about what is going on here?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Xiaodong
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Stefano
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Stefano
>>>>
>>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!eQty5R8qGgZBZNodHW90OVmUU1tsyjzmP4NkXVvtCk8QMzIM2XIAQEx4RrA_F814zU_1P_RsayqlJ7GNAhca$ 
>> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!eQty5R8qGgZBZNodHW90OVmUU1tsyjzmP4NkXVvtCk8QMzIM2XIAQEx4RrA_F814zU_1P_RsayqlJ6chGK_E$ >
>>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!eQty5R8qGgZBZNodHW90OVmUU1tsyjzmP4NkXVvtCk8QMzIM2XIAQEx4RrA_F814zU_1P_RsayqlJ7GNAhca$  <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!eQty5R8qGgZBZNodHW90OVmUU1tsyjzmP4NkXVvtCk8QMzIM2XIAQEx4RrA_F814zU_1P_RsayqlJ6chGK_E$ >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240820/584763e8/attachment-0001.html>