[petsc-users] Strange strong scaling result
Matthew Knepley
knepley at gmail.com
Tue Jul 12 10:55:18 CDT 2022
I do not understand these results at all. Let's just look at the simplest
piece:
VecAXPY
NProcessors NNodes CoresPerNode
1 1 1 1.0
2 1 2 1.640502
4 1 4 4.456256
This is incredibly strange. Is it possible that other people are using the
nodes that you are running on?
Thanks,
Matt
On Tue, Jul 12, 2022 at 10:32 AM Ce Qin <qince168 at gmail.com> wrote:
> For your reference, I also calculated the speedups for other procedures:
>
> VecAXPY MatMult SetupAMS
> PCApply Assembly Solving
> NProcessors NNodes CoresPerNode
>
> 1 1 1 1.0 1.0 1.0
> 1.0 1.0 1.0
> 2 1 2 1.640502 1.945753 1.418709
> 1.898884 1.995246 1.898756
> 2 1 2.297125 2.614508 1.600718
> 2.419798 2.121401 2.436149
> 4 1 4 4.456256 6.821532 3.614451
> 5.991256 4.658187 6.004539
> 2 2 4.539748 6.779151 3.619661
> 5.926112 4.666667 5.942085
> 4 1 4.480902 7.210629 3.471541
> 6.082946 4.65272 6.101214
> 8 2 4 10.584189 17.519901 8.59046
> 16.615395 9.380985 16.581135
> 4 2 10.980687 18.674113 8.612347
> 17.273229 9.308575 17.258891
> 8 1 11.096298 18.210245 8.456557
> 17.430586 9.314449 17.380612
> 16 2 8 21.929795 37.04392 18.135278
> 34.5448 18.575953 34.483058
> 4 4 22.00331 39.581504 18.011148
> 34.793732 18.745129 34.854409
> 8 2 22.692779 41.38289 18.354949
> 36.388144 18.828393 36.45509
> 32 4 8 43.935774 80.003087 34.963997
> 70.085728 37.140626 70.175879
> 8 4 44.387091 80.807608 35.62153
> 71.471289 37.166421 71.533865
>
> and the streams result on the computation node:
>
> 1 8291.4887 Rate (MB/s)
> 2 8739.3219 Rate (MB/s) 1.05401
> 3 24769.5868 Rate (MB/s) 2.98735
> 4 31962.0242 Rate (MB/s) 3.8548
> 5 39603.8828 Rate (MB/s) 4.77645
> 6 47777.7385 Rate (MB/s) 5.76226
> 7 54557.5363 Rate (MB/s) 6.57994
> 8 62769.3910 Rate (MB/s) 7.57034
> 9 38649.9160 Rate (MB/s) 4.6614
> 10 58976.9536 Rate (MB/s) 7.11295
> 11 48108.7801 Rate (MB/s) 5.80219
> 12 49506.8213 Rate (MB/s) 5.9708
> 13 54810.5266 Rate (MB/s) 6.61046
> 14 62471.5234 Rate (MB/s) 7.53441
> 15 63968.0218 Rate (MB/s) 7.7149
> 16 69644.8615 Rate (MB/s) 8.39956
> 17 60791.9544 Rate (MB/s) 7.33185
> 18 65476.5162 Rate (MB/s) 7.89683
> 19 60127.0683 Rate (MB/s) 7.25166
> 20 72052.5175 Rate (MB/s) 8.68994
> 21 62045.7745 Rate (MB/s) 7.48307
> 22 64517.7771 Rate (MB/s) 7.7812
> 23 69570.2935 Rate (MB/s) 8.39057
> 24 69673.8328 Rate (MB/s) 8.40305
> 25 75196.7514 Rate (MB/s) 9.06915
> 26 72304.2685 Rate (MB/s) 8.7203
> 27 73234.1616 Rate (MB/s) 8.83245
> 28 74041.3842 Rate (MB/s) 8.9298
> 29 77117.3751 Rate (MB/s) 9.30079
> 30 78293.8496 Rate (MB/s) 9.44268
> 31 81377.0870 Rate (MB/s) 9.81453
> 32 84097.0813 Rate (MB/s) 10.1426
>
>
> Best,
> Ce
>
> Mark Adams <mfadams at lbl.gov> 于2022年7月12日周二 22:11写道:
>
>> You may get more memory bandwidth with 32 processors vs 1, as Ce
>> mentioned.
>> Depends on the architecture.
>> Do you get the whole memory bandwidth on one processor on this machine?
>>
>> On Tue, Jul 12, 2022 at 8:53 AM Matthew Knepley <knepley at gmail.com>
>> wrote:
>>
>>> On Tue, Jul 12, 2022 at 7:32 AM Ce Qin <qince168 at gmail.com> wrote:
>>>
>>>>
>>>>
>>>>>>> The linear system is complex-valued. We rewrite it into its real form
>>>>>>> and solve it using FGMRES and an optimal block-diagonal
>>>>>>> preconditioner.
>>>>>>> We use CG and the AMS preconditioner implemented in HYPRE to solve
>>>>>>> the
>>>>>>> smaller real linear system arised from applying the block
>>>>>>> preconditioner.
>>>>>>> The iteration number of FGMRES and CG keep almost constant in all
>>>>>>> the runs.
>>>>>>>
>>>>>>
>>>>>> So those blocks decrease in size as you add more processes?
>>>>>>
>>>>>>
>>>>>
>>>> I am sorry for the unclear description of the block-diagonal
>>>> preconditioner.
>>>> Let K be the original complex system matrix, A = [Kr, -Ki; -Ki, -Kr] is
>>>> the equivalent
>>>> real form of K. Let P = [Kr+Ki, 0; 0, Kr+Ki], it can beproved that P is
>>>> an optimal
>>>> preconditioner for A. In our implementation, only Kr, Ki and Kr+Ki
>>>> are explicitly stored as MATMPIAIJ. We use MATSHELL to represent A and
>>>> P.
>>>> We use FGMRES + P to solve Ax=b, and CG + AMS to
>>>> solve (Kr+Ki)y=c. So the block size is never changed.
>>>>
>>>
>>> Then we have to break down the timings further. I suspect AMS is not
>>> taking as long, since
>>> all other operations scale like N.
>>>
>>> Thanks,
>>>
>>> Matt
>>>
>>>
>>>
>>>> Best,
>>>> Ce
>>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://www.cse.buffalo.edu/~knepley/
>>> <http://www.cse.buffalo.edu/~knepley/>
>>>
>>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220712/16944dbe/attachment-0001.html>
More information about the petsc-users
mailing list