[petsc-users] Strange strong scaling result

Ce Qin qince168 at gmail.com
Tue Jul 12 01:49:32 CDT 2022


Thanks for your quick response.

The linear system is complex-valued. We rewrite it into its real form
and solve it using FGMRES and an optimal block-diagonal preconditioner.
We use CG and the AMS preconditioner implemented in HYPRE to solve the
smaller real linear system arised from applying the block preconditioner.
The iteration number of FGMRES and CG keep almost constant in all the runs.

Each node is equipped with a 64-core CPU and 128 GB of memory.
The matrix-vector production is memory-bandwidth limited. Is this strange
behavior
related to memory bandwidth?

Best,
Ce

Mark Adams <mfadams at lbl.gov> 于2022年7月12日周二 04:04写道:

> Also, cache effects. 20M DoFs on one core/thread is huge.
> 37x on assembly is probably cache effects.
>
> On Mon, Jul 11, 2022 at 1:09 PM Matthew Knepley <knepley at gmail.com> wrote:
>
>> On Mon, Jul 11, 2022 at 10:34 AM Ce Qin <qince168 at gmail.com> wrote:
>>
>>> Dear all,
>>>
>>> I want to analyze the strong scaling of our in-house FEM code.
>>> The test problem has about 20M DoFs. I ran the problem using
>>> various settings. The speedups for the assembly and solving
>>> procedures are as follows:
>>>                                    Assembly     Solving
>>> NProcessors NNodes CoresPerNode
>>> 1           1      1                    1.0         1.0
>>> 2           1      2               1.995246    1.898756
>>>             2      1               2.121401    2.436149
>>> 4           1      4               4.658187    6.004539
>>>             2      2               4.666667    5.942085
>>>             4      1                4.65272    6.101214
>>> 8           2      4               9.380985   16.581135
>>>             4      2               9.308575   17.258891
>>>             8      1               9.314449   17.380612
>>> 16          2      8              18.575953   34.483058
>>>             4      4              18.745129   34.854409
>>>             8      2              18.828393    36.45509
>>> 32          4      8              37.140626   70.175879
>>>             8      4              37.166421   71.533865
>>>
>>> I don't quite understand this result. Why we can achieve a speedup of
>>> about 70+ using 32 processors? Could you please help me explain this?
>>>
>>
>> We need more data. I would start with the number of iterates that the
>> solver
>> executes. I suspect this is changing. However, it can be more complicated.
>> For example, a Block-Jacobi preconditioner gets cheaper as the number of
>> subdomains increases. Thus we need to know exactly what the solver is
>> doing.
>>
>>   Thanks,
>>
>>      Matt
>>
>>
>>> Thank you in advance.
>>>
>>> Best,
>>> Ce
>>>
>>>
>>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220712/2c10885a/attachment-0001.html>


More information about the petsc-users mailing list