[petsc-users] Strange strong scaling result
Mark Adams
mfadams at lbl.gov
Tue Jul 12 07:09:58 CDT 2022
As Matt alluded to, the blocks get smaller and cheaper. That and cache
effects could account for all of this superlinear speedup.
If convergence rate does not deteriorate with an increased number of
subdomains then you want to decouple the number of solver subdomains from
your domain decomposition.
On Tue, Jul 12, 2022 at 7:08 AM Matthew Knepley <knepley at gmail.com> wrote:
> On Tue, Jul 12, 2022 at 1:50 AM Ce Qin <qince168 at gmail.com> wrote:
>
>> Thanks for your quick response.
>>
>> The linear system is complex-valued. We rewrite it into its real form
>> and solve it using FGMRES and an optimal block-diagonal preconditioner.
>> We use CG and the AMS preconditioner implemented in HYPRE to solve the
>> smaller real linear system arised from applying the block preconditioner.
>> The iteration number of FGMRES and CG keep almost constant in all the
>> runs.
>>
>
> So those blocks decrease in size as you add more processes?
>
>
>> Each node is equipped with a 64-core CPU and 128 GB of memory.
>> The matrix-vector production is memory-bandwidth limited. Is this strange
>> behavior
>> related to memory bandwidth?
>>
>
> I don't see how.
>
> Thanks,
>
> Matt
>
>
>> Best,
>> Ce
>>
>> Mark Adams <mfadams at lbl.gov> 于2022年7月12日周二 04:04写道:
>>
>>> Also, cache effects. 20M DoFs on one core/thread is huge.
>>> 37x on assembly is probably cache effects.
>>>
>>> On Mon, Jul 11, 2022 at 1:09 PM Matthew Knepley <knepley at gmail.com>
>>> wrote:
>>>
>>>> On Mon, Jul 11, 2022 at 10:34 AM Ce Qin <qince168 at gmail.com> wrote:
>>>>
>>>>> Dear all,
>>>>>
>>>>> I want to analyze the strong scaling of our in-house FEM code.
>>>>> The test problem has about 20M DoFs. I ran the problem using
>>>>> various settings. The speedups for the assembly and solving
>>>>> procedures are as follows:
>>>>> Assembly Solving
>>>>> NProcessors NNodes CoresPerNode
>>>>> 1 1 1 1.0 1.0
>>>>> 2 1 2 1.995246 1.898756
>>>>> 2 1 2.121401 2.436149
>>>>> 4 1 4 4.658187 6.004539
>>>>> 2 2 4.666667 5.942085
>>>>> 4 1 4.65272 6.101214
>>>>> 8 2 4 9.380985 16.581135
>>>>> 4 2 9.308575 17.258891
>>>>> 8 1 9.314449 17.380612
>>>>> 16 2 8 18.575953 34.483058
>>>>> 4 4 18.745129 34.854409
>>>>> 8 2 18.828393 36.45509
>>>>> 32 4 8 37.140626 70.175879
>>>>> 8 4 37.166421 71.533865
>>>>>
>>>>> I don't quite understand this result. Why we can achieve a speedup of
>>>>> about 70+ using 32 processors? Could you please help me explain this?
>>>>>
>>>>
>>>> We need more data. I would start with the number of iterates that the
>>>> solver
>>>> executes. I suspect this is changing. However, it can be more
>>>> complicated.
>>>> For example, a Block-Jacobi preconditioner gets cheaper as the number of
>>>> subdomains increases. Thus we need to know exactly what the solver is
>>>> doing.
>>>>
>>>> Thanks,
>>>>
>>>> Matt
>>>>
>>>>
>>>>> Thank you in advance.
>>>>>
>>>>> Best,
>>>>> Ce
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to which their
>>>> experiments lead.
>>>> -- Norbert Wiener
>>>>
>>>> https://www.cse.buffalo.edu/~knepley/
>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>
>>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220712/93df52ce/attachment.html>
More information about the petsc-users
mailing list