[petsc-users] Strange strong scaling result

Tue Jul 12 06:05:52 CDT 2022

On Tue, Jul 12, 2022 at 1:50 AM Ce Qin <qince168 at gmail.com> wrote:

> Thanks for your quick response.
>
> The linear system is complex-valued. We rewrite it into its real form
> and solve it using FGMRES and an optimal block-diagonal preconditioner.
> We use CG and the AMS preconditioner implemented in HYPRE to solve the
> smaller real linear system arised from applying the block preconditioner.
> The iteration number of FGMRES and CG keep almost constant in all the runs.
>

So those blocks decrease in size as you add more processes?

> Each node is equipped with a 64-core CPU and 128 GB of memory.
> The matrix-vector production is memory-bandwidth limited. Is this strange
> behavior
> related to memory bandwidth?
>

I don't see how.

  Thanks,

     Matt

> Best,
> Ce
>
> Mark Adams <mfadams at lbl.gov> 于2022年7月12日周二 04:04写道：
>
>> Also, cache effects. 20M DoFs on one core/thread is huge.
>> 37x on assembly is probably cache effects.
>>
>> On Mon, Jul 11, 2022 at 1:09 PM Matthew Knepley <knepley at gmail.com>
>> wrote:
>>
>>> On Mon, Jul 11, 2022 at 10:34 AM Ce Qin <qince168 at gmail.com> wrote:
>>>
>>>> Dear all,
>>>>
>>>> I want to analyze the strong scaling of our in-house FEM code.
>>>> The test problem has about 20M DoFs. I ran the problem using
>>>> various settings. The speedups for the assembly and solving
>>>> procedures are as follows:
>>>>                                    Assembly     Solving
>>>> NProcessors NNodes CoresPerNode
>>>> 1           1      1                    1.0         1.0
>>>> 2           1      2               1.995246    1.898756
>>>>             2      1               2.121401    2.436149
>>>> 4           1      4               4.658187    6.004539
>>>>             2      2               4.666667    5.942085
>>>>             4      1                4.65272    6.101214
>>>> 8           2      4               9.380985   16.581135
>>>>             4      2               9.308575   17.258891
>>>>             8      1               9.314449   17.380612
>>>> 16          2      8              18.575953   34.483058
>>>>             4      4              18.745129   34.854409
>>>>             8      2              18.828393    36.45509
>>>> 32          4      8              37.140626   70.175879
>>>>             8      4              37.166421   71.533865
>>>>
>>>> I don't quite understand this result. Why we can achieve a speedup of
>>>> about 70+ using 32 processors? Could you please help me explain this?
>>>>
>>>
>>> We need more data. I would start with the number of iterates that the
>>> solver
>>> executes. I suspect this is changing. However, it can be more
>>> complicated.
>>> For example, a Block-Jacobi preconditioner gets cheaper as the number of
>>> subdomains increases. Thus we need to know exactly what the solver is
>>> doing.
>>>
>>>   Thanks,
>>>
>>>      Matt
>>>
>>>
>>>> Thank you in advance.
>>>>
>>>> Best,
>>>> Ce
>>>>
>>>>
>>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://www.cse.buffalo.edu/~knepley/
>>> <http://www.cse.buffalo.edu/~knepley/>
>>>
>>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220712/62dc7ca6/attachment-0001.html>