[petsc-users] Strange strong scaling result

Ce Qin qince168 at gmail.com
Tue Jul 12 10:32:11 CDT 2022


For your reference, I also calculated the speedups for other procedures:

                                    VecAXPY     MatMult    SetupAMS
 PCApply    Assembly     Solving
NProcessors NNodes CoresPerNode

1           1      1                    1.0         1.0         1.0
 1.0         1.0         1.0
2           1      2               1.640502    1.945753    1.418709
1.898884    1.995246    1.898756
            2      1               2.297125    2.614508    1.600718
2.419798    2.121401    2.436149
4           1      4               4.456256    6.821532    3.614451
5.991256    4.658187    6.004539
            2      2               4.539748    6.779151    3.619661
5.926112    4.666667    5.942085
            4      1               4.480902    7.210629    3.471541
6.082946     4.65272    6.101214
8           2      4              10.584189   17.519901     8.59046
 16.615395    9.380985   16.581135
            4      2              10.980687   18.674113    8.612347
 17.273229    9.308575   17.258891
            8      1              11.096298   18.210245    8.456557
 17.430586    9.314449   17.380612
16          2      8              21.929795    37.04392   18.135278
 34.5448   18.575953   34.483058
            4      4               22.00331   39.581504   18.011148
 34.793732   18.745129   34.854409
            8      2              22.692779    41.38289   18.354949
 36.388144   18.828393    36.45509
32          4      8              43.935774   80.003087   34.963997
 70.085728   37.140626   70.175879
            8      4              44.387091   80.807608    35.62153
 71.471289   37.166421   71.533865

and the streams result on the computation node:

1   8291.4887   Rate (MB/s)
2   8739.3219   Rate (MB/s) 1.05401
3  24769.5868   Rate (MB/s) 2.98735
4  31962.0242   Rate (MB/s) 3.8548
5  39603.8828   Rate (MB/s) 4.77645
6  47777.7385   Rate (MB/s) 5.76226
7  54557.5363   Rate (MB/s) 6.57994
8  62769.3910   Rate (MB/s) 7.57034
9  38649.9160   Rate (MB/s) 4.6614
10  58976.9536   Rate (MB/s) 7.11295
11  48108.7801   Rate (MB/s) 5.80219
12  49506.8213   Rate (MB/s) 5.9708
13  54810.5266   Rate (MB/s) 6.61046
14  62471.5234   Rate (MB/s) 7.53441
15  63968.0218   Rate (MB/s) 7.7149
16  69644.8615   Rate (MB/s) 8.39956
17  60791.9544   Rate (MB/s) 7.33185
18  65476.5162   Rate (MB/s) 7.89683
19  60127.0683   Rate (MB/s) 7.25166
20  72052.5175   Rate (MB/s) 8.68994
21  62045.7745   Rate (MB/s) 7.48307
22  64517.7771   Rate (MB/s) 7.7812
23  69570.2935   Rate (MB/s) 8.39057
24  69673.8328   Rate (MB/s) 8.40305
25  75196.7514   Rate (MB/s) 9.06915
26  72304.2685   Rate (MB/s) 8.7203
27  73234.1616   Rate (MB/s) 8.83245
28  74041.3842   Rate (MB/s) 8.9298
29  77117.3751   Rate (MB/s) 9.30079
30  78293.8496   Rate (MB/s) 9.44268
31  81377.0870   Rate (MB/s) 9.81453
32  84097.0813   Rate (MB/s) 10.1426


Best,
Ce

Mark Adams <mfadams at lbl.gov> 于2022年7月12日周二 22:11写道:

> You may get more memory bandwidth with 32 processors vs 1, as Ce mentioned.
> Depends on the architecture.
> Do you get the whole memory bandwidth on one processor on this machine?
>
> On Tue, Jul 12, 2022 at 8:53 AM Matthew Knepley <knepley at gmail.com> wrote:
>
>> On Tue, Jul 12, 2022 at 7:32 AM Ce Qin <qince168 at gmail.com> wrote:
>>
>>>
>>>
>>>>>> The linear system is complex-valued. We rewrite it into its real form
>>>>>> and solve it using FGMRES and an optimal block-diagonal
>>>>>> preconditioner.
>>>>>> We use CG and the AMS preconditioner implemented in HYPRE to solve the
>>>>>> smaller real linear system arised from applying the block
>>>>>> preconditioner.
>>>>>> The iteration number of FGMRES and CG keep almost constant in all the
>>>>>> runs.
>>>>>>
>>>>>
>>>>> So those blocks decrease in size as you add more processes?
>>>>>
>>>>>
>>>>
>>> I am sorry for the unclear description of the block-diagonal
>>> preconditioner.
>>> Let K be the original complex system matrix, A = [Kr, -Ki; -Ki, -Kr] is
>>> the equivalent
>>> real form of K. Let P = [Kr+Ki, 0; 0, Kr+Ki], it can beproved that P is
>>> an optimal
>>> preconditioner for A. In our implementation, only Kr, Ki and Kr+Ki
>>> are explicitly stored as MATMPIAIJ. We use MATSHELL to represent A and P.
>>> We use FGMRES + P to solve Ax=b, and CG + AMS to
>>> solve (Kr+Ki)y=c. So the block size is never changed.
>>>
>>
>> Then we have to break down the timings further. I suspect AMS is not
>> taking as long, since
>> all other operations scale like N.
>>
>>   Thanks,
>>
>>      Matt
>>
>>
>>
>>> Best,
>>> Ce
>>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220712/2f2b1b3b/attachment.html>


More information about the petsc-users mailing list