[petsc-users] Strange strong scaling result

Tue Jul 12 10:55:18 CDT 2022

I do not understand these results at all. Let's just look at the simplest
piece:

                                    VecAXPY
NProcessors NNodes CoresPerNode

1           1      1                    1.0
2           1      2               1.640502
4           1      4               4.456256

This is incredibly strange. Is it possible that other people are using the
nodes that you are running on?

  Thanks,

     Matt

On Tue, Jul 12, 2022 at 10:32 AM Ce Qin <qince168 at gmail.com> wrote:

> For your reference, I also calculated the speedups for other procedures:
>
>                                     VecAXPY     MatMult    SetupAMS
>  PCApply    Assembly     Solving
> NProcessors NNodes CoresPerNode
>
> 1           1      1                    1.0         1.0         1.0
>  1.0         1.0         1.0
> 2           1      2               1.640502    1.945753    1.418709
> 1.898884    1.995246    1.898756
>             2      1               2.297125    2.614508    1.600718
> 2.419798    2.121401    2.436149
> 4           1      4               4.456256    6.821532    3.614451
> 5.991256    4.658187    6.004539
>             2      2               4.539748    6.779151    3.619661
> 5.926112    4.666667    5.942085
>             4      1               4.480902    7.210629    3.471541
> 6.082946     4.65272    6.101214
> 8           2      4              10.584189   17.519901     8.59046
>  16.615395    9.380985   16.581135
>             4      2              10.980687   18.674113    8.612347
>  17.273229    9.308575   17.258891
>             8      1              11.096298   18.210245    8.456557
>  17.430586    9.314449   17.380612
> 16          2      8              21.929795    37.04392   18.135278
>  34.5448   18.575953   34.483058
>             4      4               22.00331   39.581504   18.011148
>  34.793732   18.745129   34.854409
>             8      2              22.692779    41.38289   18.354949
>  36.388144   18.828393    36.45509
> 32          4      8              43.935774   80.003087   34.963997
>  70.085728   37.140626   70.175879
>             8      4              44.387091   80.807608    35.62153
>  71.471289   37.166421   71.533865
>
> and the streams result on the computation node:
>
> 1   8291.4887   Rate (MB/s)
> 2   8739.3219   Rate (MB/s) 1.05401
> 3  24769.5868   Rate (MB/s) 2.98735
> 4  31962.0242   Rate (MB/s) 3.8548
> 5  39603.8828   Rate (MB/s) 4.77645
> 6  47777.7385   Rate (MB/s) 5.76226
> 7  54557.5363   Rate (MB/s) 6.57994
> 8  62769.3910   Rate (MB/s) 7.57034
> 9  38649.9160   Rate (MB/s) 4.6614
> 10  58976.9536   Rate (MB/s) 7.11295
> 11  48108.7801   Rate (MB/s) 5.80219
> 12  49506.8213   Rate (MB/s) 5.9708
> 13  54810.5266   Rate (MB/s) 6.61046
> 14  62471.5234   Rate (MB/s) 7.53441
> 15  63968.0218   Rate (MB/s) 7.7149
> 16  69644.8615   Rate (MB/s) 8.39956
> 17  60791.9544   Rate (MB/s) 7.33185
> 18  65476.5162   Rate (MB/s) 7.89683
> 19  60127.0683   Rate (MB/s) 7.25166
> 20  72052.5175   Rate (MB/s) 8.68994
> 21  62045.7745   Rate (MB/s) 7.48307
> 22  64517.7771   Rate (MB/s) 7.7812
> 23  69570.2935   Rate (MB/s) 8.39057
> 24  69673.8328   Rate (MB/s) 8.40305
> 25  75196.7514   Rate (MB/s) 9.06915
> 26  72304.2685   Rate (MB/s) 8.7203
> 27  73234.1616   Rate (MB/s) 8.83245
> 28  74041.3842   Rate (MB/s) 8.9298
> 29  77117.3751   Rate (MB/s) 9.30079
> 30  78293.8496   Rate (MB/s) 9.44268
> 31  81377.0870   Rate (MB/s) 9.81453
> 32  84097.0813   Rate (MB/s) 10.1426
>
>
> Best,
> Ce
>
> Mark Adams <mfadams at lbl.gov> 于2022年7月12日周二 22:11写道：
>
>> You may get more memory bandwidth with 32 processors vs 1, as Ce
>> mentioned.
>> Depends on the architecture.
>> Do you get the whole memory bandwidth on one processor on this machine?
>>
>> On Tue, Jul 12, 2022 at 8:53 AM Matthew Knepley <knepley at gmail.com>
>> wrote:
>>
>>> On Tue, Jul 12, 2022 at 7:32 AM Ce Qin <qince168 at gmail.com> wrote:
>>>
>>>>
>>>>
>>>>>>> The linear system is complex-valued. We rewrite it into its real form
>>>>>>> and solve it using FGMRES and an optimal block-diagonal
>>>>>>> preconditioner.
>>>>>>> We use CG and the AMS preconditioner implemented in HYPRE to solve
>>>>>>> the
>>>>>>> smaller real linear system arised from applying the block
>>>>>>> preconditioner.
>>>>>>> The iteration number of FGMRES and CG keep almost constant in all
>>>>>>> the runs.
>>>>>>>
>>>>>>
>>>>>> So those blocks decrease in size as you add more processes?
>>>>>>
>>>>>>
>>>>>
>>>> I am sorry for the unclear description of the block-diagonal
>>>> preconditioner.
>>>> Let K be the original complex system matrix, A = [Kr, -Ki; -Ki, -Kr] is
>>>> the equivalent
>>>> real form of K. Let P = [Kr+Ki, 0; 0, Kr+Ki], it can beproved that P is
>>>> an optimal
>>>> preconditioner for A. In our implementation, only Kr, Ki and Kr+Ki
>>>> are explicitly stored as MATMPIAIJ. We use MATSHELL to represent A and
>>>> P.
>>>> We use FGMRES + P to solve Ax=b, and CG + AMS to
>>>> solve (Kr+Ki)y=c. So the block size is never changed.
>>>>
>>>
>>> Then we have to break down the timings further. I suspect AMS is not
>>> taking as long, since
>>> all other operations scale like N.
>>>
>>>   Thanks,
>>>
>>>      Matt
>>>
>>>
>>>
>>>> Best,
>>>> Ce
>>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://www.cse.buffalo.edu/~knepley/
>>> <http://www.cse.buffalo.edu/~knepley/>
>>>
>>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220712/16944dbe/attachment-0001.html>