[petsc-users] Strong scaling concerns for PCBDDC with Vector FEM

Tue Aug 20 16:53:20 CDT 2024

On Tue, Aug 20, 2024 at 2:31 PM neil liu <liufield at gmail.com> wrote:

> Thanks a lot for this explanation, Matt. I will explore whether the matrix
> has the same size and spaisity.
>

I think it is much more likely that you just exhausted bandwidth on the
node.

  Thanks,

    Matt

> On Tue, Aug 20, 2024 at 1:45 PM Matthew Knepley <knepley at gmail.com> wrote:
>
>> On Tue, Aug 20, 2024 at 1:36 PM neil liu <liufield at gmail.com> wrote:
>>
>>> Hi, Matt,
>>> I think the time listed here represents the maximum total time across
>>> different processors.
>>>
>>> Thanks a lot.
>>>                          2 cpus
>>>           4 cpus                                           8 cpus
>>> Event          Count                 Time (sec)              Count
>>>            Time (sec)                Count                 Time (sec)
>>>                    Max Ratio        Max        Ratio           Max
>>> Ratio        Max     Ratio               Max Ratio        Max     Ratio
>>> VecMDot      530 1.0         7.8320e+01 1.0         530    1.0
>>>  4.3285e+01 1.1           530   1.0          3.0476e+01   1.1
>>> VecMAXPY  534 1.0         9.2954e+01 1.0         534    1.0
>>> 4.8378e+01 1.1          534   1.0          3.0798e+01   1.1
>>> MatMult      8055 1.0         2.4608e+02 1.0        8103   1.0
>>> 1.2663e+02 1.0          8367 1.0           8.2942e+01 1.1
>>>
>>
>> For the number of calls listed.
>>
>> 1) The number of MatMults goes up, so you should normalize for that, but
>> you still have about 1.6 speedup. However, this is
>>     all multiplications. Are we sure they have the same size and sparsity?
>>
>> 2) MAXPY is also 1.6
>>
>> 3) MDot probably does not see the latency of one node, so again it is not
>> speeding up as you might want.
>>
>> This looks like you are using a single node with 2, 4, and 8 procs. The
>> memory bandwidth is exhausted sometime before 8 procs
>> (maybe 6), so you cease to see speedup. You can check this by running
>> `make streams` on the node.
>>
>>   Thanks,
>>
>>      Matt
>>
>>
>>> On Tue, Aug 20, 2024 at 1:16 PM Matthew Knepley <knepley at gmail.com>
>>> wrote:
>>>
>>>> On Tue, Aug 20, 2024 at 1:10 PM neil liu <liufield at gmail.com> wrote:
>>>>
>>>>> Thanks a lot for your explanation, Stefano. Very helpful.
>>>>> Yes. I am using dmplex to read a tetrahdra mesh from gmsh. With
>>>>> parmetis, the scaling performance is improved a lot.
>>>>> I will read your paper about how to change the basis for Nedelec
>>>>> elements.
>>>>>
>>>>> cpu #    time for 500 ksp steps  (s)           parallel efficiency
>>>>> 2           546
>>>>> 4           224                                               120%
>>>>> 8           170                                               80%
>>>>> This results are much better than previous attempt. Then I checked the
>>>>> time spent by several Petsc built-in functions for the ksp solver.
>>>>>
>>>>> Functions          time(2 cpus)     time(4 cpus)      time(8 cpus)
>>>>> VecMDot           78.32                43.28                30.47
>>>>> VecMAXPY       92.95                48.37                30.798
>>>>> MatMult          246.08               126.63                82.94
>>>>>
>>>>> It seems from cpu 4 to cpu 8, the scaling is not as good as from cpu 2
>>>>> to cpu 4.
>>>>> Am I  missing something?
>>>>>
>>>>
>>>> Did you normalize by the number of calls?
>>>>
>>>>   Thanks,
>>>>
>>>>      Matt
>>>>
>>>>
>>>>> Thanks a lot,
>>>>>
>>>>> Xiaodong
>>>>>
>>>>>
>>>>> On Mon, Aug 19, 2024 at 4:15 AM Stefano Zampini <
>>>>> stefano.zampini at gmail.com> wrote:
>>>>>
>>>>>> It seems you are using DMPLEX to handle the mesh, correct?
>>>>>> If so, you should configure using --download-parmetis to have a
>>>>>> better domain decomposition since the default one just splits the cells in
>>>>>> chunks as they are ordered.
>>>>>> This results in a large number of primal dofs on average (191, from
>>>>>> the  output of ksp_view)
>>>>>> ...
>>>>>> Primal    dofs   : 176 204 191
>>>>>> ...
>>>>>> that slows down the solver setup.
>>>>>>
>>>>>> Again, you should not use approximate local solvers with BDDC unless
>>>>>> you know what you are doing.
>>>>>> The theory for approximate solvers for BDDC is small and only for SPD
>>>>>> problems.
>>>>>> Looking at the output of log_view, coarse problem setup (PCBDDCCSet),
>>>>>> and primal functions setup (PCBDDCCorr) costs 35 + 63 seconds, respectively.
>>>>>> Also, the 500 application of the GAMG preconditioner for the Neumann
>>>>>> solver (PCBDDCNeuS) takes 129 seconds out of the 400 seconds of the total
>>>>>> solve time.
>>>>>>
>>>>>> PCBDDCTopo             1 1.0 3.1563e-01 1.0 1.11e+06 3.4 1.6e+03
>>>>>> 3.9e+04 3.8e+01  0  0  1  0  2   0  0  1  0  2    19
>>>>>> PCBDDCLKSP             2 1.0 2.0423e+00 1.7 9.31e+08 1.2 0.0e+00
>>>>>> 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0  3378
>>>>>> PCBDDCLWor             1 1.0 3.9178e-02 13.4 0.00e+00 0.0 0.0e+00
>>>>>> 0.0e+00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>>> PCBDDCCorr             1 1.0 6.3981e+01 2.2 8.16e+10 1.6 0.0e+00
>>>>>> 0.0e+00 0.0e+00 11 11  0  0  0  11 11  0  0  0  8900
>>>>>> PCBDDCCSet             1 1.0 3.5453e+01 4564.9 1.06e+05 1.7 1.2e+03
>>>>>> 5.3e+03 5.0e+01  2  0  1  0  3   2  0  1  0  3     0
>>>>>> PCBDDCCKSP             1 1.0 6.3266e-01 1.3 0.00e+00 0.0 3.3e+02
>>>>>> 1.1e+02 2.2e+01  0  0  0  0  1   0  0  0  0  1     0
>>>>>> PCBDDCScal             1 1.0 6.8274e-03 1.3 1.11e+06 3.4 5.6e+01
>>>>>> 3.2e+05 0.0e+00  0  0  0  0  0   0  0  0  0  0   894
>>>>>> PCBDDCDirS          1000 1.0 6.0420e+00 3.5 6.64e+09 5.4 0.0e+00
>>>>>> 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0  2995
>>>>>> PCBDDCNeuS           500 1.0 1.2901e+02 2.1 8.28e+10 1.2 0.0e+00
>>>>>> 0.0e+00 0.0e+00 22 12  0  0  0  22 12  0  0  0  4828
>>>>>> PCBDDCCoaS           500 1.0 5.8757e-01 1.8 1.09e+09 1.0 2.8e+04
>>>>>> 7.4e+02 5.0e+02  0  0 17  0 28   0  0 17  0 31 14901
>>>>>>
>>>>>> Finally, if I look at the residual history, I see a sharp decrease
>>>>>> and a very long plateau. This indicates a bad coarse space; as I said
>>>>>> before, there's no hope of finding a suitable coarse space without first
>>>>>> changing the basis of the Nedelec elements, which is done automatically if
>>>>>> you prescribe the discrete gradient operator (see the paper I have linked
>>>>>> to in my previous communication).
>>>>>>
>>>>>>
>>>>>>
>>>>>> Il giorno dom 18 ago 2024 alle ore 00:37 neil liu <liufield at gmail.com>
>>>>>> ha scritto:
>>>>>>
>>>>>>> Hi, Stefano,
>>>>>>> Please see the attached for the information with 4 and 8 CPUs for
>>>>>>> the complex matrix.
>>>>>>> I am solving Maxwell equations (Attahced) using 2nd-order Nedelec
>>>>>>> elements (two dofs each edge, and two dofs each face).
>>>>>>> The computational domain consists of different mediums, e.g.,
>>>>>>> vacuum and substrate (different permitivity).
>>>>>>> The PML is used to truncate the computational domain, absorbing the
>>>>>>> outgoing wave and introducing complex numbers for the matrix.
>>>>>>>
>>>>>>> Thanks a lot for your suggestions. I will try MUMPS.
>>>>>>> For now, I just want to fiddle with Petsc's built-in features to
>>>>>>> know more about it.
>>>>>>> Yes. 5000 is larger. Smaller value. e.g., 30, converges very slowly.
>>>>>>>
>>>>>>> Thanks a lot.
>>>>>>>
>>>>>>> Have a good weekend.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Aug 17, 2024 at 9:23 AM Stefano Zampini <
>>>>>>> stefano.zampini at gmail.com> wrote:
>>>>>>>
>>>>>>>> Please include the output of -log_view -ksp_view -ksp_monitor to
>>>>>>>> understand what's happening.
>>>>>>>>
>>>>>>>> Can you please share the equations you are solving so we can
>>>>>>>> provide suggestions on the solver configuration?
>>>>>>>> As I said, solving for Nedelec-type discretizations is challenging,
>>>>>>>> and not for off-the-shelf, black box solvers
>>>>>>>>
>>>>>>>> Below are some comments:
>>>>>>>>
>>>>>>>>
>>>>>>>>    - You use a redundant SVD approach for the coarse solve, which
>>>>>>>>    can be inefficient if your coarse space grows. You can use a parallel
>>>>>>>>    direct solver like MUMPS (reconfigure with --download-mumps and use
>>>>>>>>    -pc_bddc_coarse_pc_type lu -pc_bddc_coarse_pc_factor_mat_solver_type mumps)
>>>>>>>>    - Why use ILU for the Dirichlet problem and GAMG for the
>>>>>>>>    Neumann problem? With 8 processes and 300K total dofs, you will have around
>>>>>>>>    40K dofs per process, which is ok for a direct solver like MUMPS
>>>>>>>>    (-pc_bddc_dirichlet_pc_factor_mat_solver_type mumps, same for Neumann).
>>>>>>>>    With Nedelec dofs and the sparsity pattern they induce,  I believe you can
>>>>>>>>    push to 80K dofs per process with good performance.
>>>>>>>>    - Why 5000 of restart for GMRES? It is highly inefficient to
>>>>>>>>    re-orthogonalize such a large set of vectors.
>>>>>>>>
>>>>>>>>
>>>>>>>> Il giorno ven 16 ago 2024 alle ore 00:04 neil liu <
>>>>>>>> liufield at gmail.com> ha scritto:
>>>>>>>>
>>>>>>>>> Dear Petsc developers,
>>>>>>>>>
>>>>>>>>> Thanks for your previous help. Now, the PCBDDC can converge to
>>>>>>>>> 1e-8 with,
>>>>>>>>>
>>>>>>>>> petsc-3.21.1/petsc/arch-linux-c-opt/bin/mpirun -n 8 ./app -pc_type
>>>>>>>>> bddc -pc_bddc_coarse_redundant_pc_type svd   -ksp_error_if_not_converged
>>>>>>>>> -mat_type is -ksp_monitor -ksp_rtol 1e-8 -ksp_gmres_restart 5000 -ksp_view
>>>>>>>>> -pc_bddc_use_local_mat_graph 0  -pc_bddc_dirichlet_pc_type ilu
>>>>>>>>> -pc_bddc_neumann_pc_type gamg -pc_bddc_neumann_pc_gamg_esteig_ksp_max_it 10
>>>>>>>>> -ksp_converged_reason -pc_bddc_neumann_approximate -ksp_max_it 500 -log_view
>>>>>>>>>
>>>>>>>>> Then I used 2 cases for strong scaling test. One case only
>>>>>>>>> involves real numbers (tetra #: 49,152; dof #: 324, 224 ) for matrix and
>>>>>>>>> rhs. The 2nd case involves complex numbers  (tetra #: 95,336; dof #:
>>>>>>>>> 611,432)  due to PML.
>>>>>>>>>
>>>>>>>>> Case 1:
>>>>>>>>> cpu #                Time for 500 ksp steps (s)    Parallel
>>>>>>>>> efficiency     PCsetup time(s)
>>>>>>>>>           2              234.7
>>>>>>>>>                               3.12
>>>>>>>>>           4              126.6
>>>>>>>>>  0.92                      1.62
>>>>>>>>>           8              84.97
>>>>>>>>>  0.69                      1.26
>>>>>>>>> However for Case 2,
>>>>>>>>> cpu #                Time for 500 ksp steps (s)    Parallel
>>>>>>>>> efficiency   PCsetup time(s)
>>>>>>>>>           2              584.5
>>>>>>>>>                                   8.61
>>>>>>>>>           4              376.8
>>>>>>>>> 0.77                           6.56
>>>>>>>>>           8              459.6
>>>>>>>>> 0.31                         66.47
>>>>>>>>> For these 2 cases, I checked the time for PCsetup as an example.
>>>>>>>>> It seems 8 cpus for case 2 used too much time on PCsetup.
>>>>>>>>> Do you have any ideas about what is going on here?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Xiaodong
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Stefano
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Stefano
>>>>>>
>>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to which their
>>>> experiments lead.
>>>> -- Norbert Wiener
>>>>
>>>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!c1-7PTlMFjRSGEtUBfqX0W9JQed5UTJTHCsmwhm4whuZoTMIll340dHxiKyGvIedaFLp4VcuBIrnBKMFP6GD$ 
>>>> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!c1-7PTlMFjRSGEtUBfqX0W9JQed5UTJTHCsmwhm4whuZoTMIll340dHxiKyGvIedaFLp4VcuBIrnBMwGiak0$ >
>>>>
>>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!c1-7PTlMFjRSGEtUBfqX0W9JQed5UTJTHCsmwhm4whuZoTMIll340dHxiKyGvIedaFLp4VcuBIrnBKMFP6GD$ 
>> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!c1-7PTlMFjRSGEtUBfqX0W9JQed5UTJTHCsmwhm4whuZoTMIll340dHxiKyGvIedaFLp4VcuBIrnBMwGiak0$ >
>>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!c1-7PTlMFjRSGEtUBfqX0W9JQed5UTJTHCsmwhm4whuZoTMIll340dHxiKyGvIedaFLp4VcuBIrnBKMFP6GD$  <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!c1-7PTlMFjRSGEtUBfqX0W9JQed5UTJTHCsmwhm4whuZoTMIll340dHxiKyGvIedaFLp4VcuBIrnBMwGiak0$ >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240820/f8909e1c/attachment-0001.html>