[petsc-users] Strange strong scaling result
Barry Smith
bsmith at petsc.dev
Tue Jul 12 11:18:49 CDT 2022
The streams numbers
1 8291.4887 Rate (MB/s)
2 8739.3219 Rate (MB/s) 1.05401
3 24769.5868 Rate (MB/s) 2.98735
4 31962.0242 Rate (MB/s) 3.8548
5 39603.8828 Rate (MB/s) 4.77645
6 47777.7385 Rate (MB/s) 5.76226
7 54557.5363 Rate (MB/s) 6.57994
8 62769.3910 Rate (MB/s) 7.57034
9 38649.9160 Rate (MB/s) 4.6614
indicate the MPI launcher is doing a poor job of binding MPI ranks to cores; you should read up on the options for your particular mpiexec for binding to select good binding options. Unfortunately, there is no standard for setting the bindings and each MPI implementation changes its options constantly so you need to determine them exactly for your machine and MPI implementation. Basically, you want to place each MPI rank on a node as "far away as possible in memory domains from other ranks". If you note going from 1 to 2 ranks there is no speedup which can be interpreted to mean that the first two ranks are put very close together (and thus share all the memory resources with their partner).
A side note is that the raw numbers are very good (you get a speedup of 7.57 on 8 ranks and the speedup goes up to 10. These means with proper binding you should get really good speedup on PETSc code to at least 8 cores per node.
Barry
> On Jul 12, 2022, at 11:32 AM, Ce Qin <qince168 at gmail.com> wrote:
>
> For your reference, I also calculated the speedups for other procedures:
>
> VecAXPY MatMult SetupAMS PCApply Assembly Solving
> NProcessors NNodes CoresPerNode
> 1 1 1 1.0 1.0 1.0 1.0 1.0 1.0
> 2 1 2 1.640502 1.945753 1.418709 1.898884 1.995246 1.898756
> 2 1 2.297125 2.614508 1.600718 2.419798 2.121401 2.436149
> 4 1 4 4.456256 6.821532 3.614451 5.991256 4.658187 6.004539
> 2 2 4.539748 6.779151 3.619661 5.926112 4.666667 5.942085
> 4 1 4.480902 7.210629 3.471541 6.082946 4.65272 6.101214
> 8 2 4 10.584189 17.519901 8.59046 16.615395 9.380985 16.581135
> 4 2 10.980687 18.674113 8.612347 17.273229 9.308575 17.258891
> 8 1 11.096298 18.210245 8.456557 17.430586 9.314449 17.380612
> 16 2 8 21.929795 37.04392 18.135278 34.5448 18.575953 34.483058
> 4 4 22.00331 39.581504 18.011148 34.793732 18.745129 34.854409
> 8 2 22.692779 41.38289 18.354949 36.388144 18.828393 36.45509
> 32 4 8 43.935774 80.003087 34.963997 70.085728 37.140626 70.175879
> 8 4 44.387091 80.807608 35.62153 71.471289 37.166421 71.533865
>
> and the streams result on the computation node:
>
> 1 8291.4887 Rate (MB/s)
> 2 8739.3219 Rate (MB/s) 1.05401
> 3 24769.5868 Rate (MB/s) 2.98735
> 4 31962.0242 Rate (MB/s) 3.8548
> 5 39603.8828 Rate (MB/s) 4.77645
> 6 47777.7385 Rate (MB/s) 5.76226
> 7 54557.5363 Rate (MB/s) 6.57994
> 8 62769.3910 Rate (MB/s) 7.57034
> 9 38649.9160 Rate (MB/s) 4.6614
> 10 58976.9536 Rate (MB/s) 7.11295
> 11 48108.7801 Rate (MB/s) 5.80219
> 12 49506.8213 Rate (MB/s) 5.9708
> 13 54810.5266 Rate (MB/s) 6.61046
> 14 62471.5234 Rate (MB/s) 7.53441
> 15 63968.0218 Rate (MB/s) 7.7149
> 16 69644.8615 Rate (MB/s) 8.39956
> 17 60791.9544 Rate (MB/s) 7.33185
> 18 65476.5162 Rate (MB/s) 7.89683
> 19 60127.0683 Rate (MB/s) 7.25166
> 20 72052.5175 Rate (MB/s) 8.68994
> 21 62045.7745 Rate (MB/s) 7.48307
> 22 64517.7771 Rate (MB/s) 7.7812
> 23 69570.2935 Rate (MB/s) 8.39057
> 24 69673.8328 Rate (MB/s) 8.40305
> 25 75196.7514 Rate (MB/s) 9.06915
> 26 72304.2685 Rate (MB/s) 8.7203
> 27 73234.1616 Rate (MB/s) 8.83245
> 28 74041.3842 Rate (MB/s) 8.9298
> 29 77117.3751 Rate (MB/s) 9.30079
> 30 78293.8496 Rate (MB/s) 9.44268
> 31 81377.0870 Rate (MB/s) 9.81453
> 32 84097.0813 Rate (MB/s) 10.1426
>
>
> Best,
> Ce
>
> Mark Adams <mfadams at lbl.gov <mailto:mfadams at lbl.gov>> 于2022年7月12日周二 22:11写道:
> You may get more memory bandwidth with 32 processors vs 1, as Ce mentioned.
> Depends on the architecture.
> Do you get the whole memory bandwidth on one processor on this machine?
>
> On Tue, Jul 12, 2022 at 8:53 AM Matthew Knepley <knepley at gmail.com <mailto:knepley at gmail.com>> wrote:
> On Tue, Jul 12, 2022 at 7:32 AM Ce Qin <qince168 at gmail.com <mailto:qince168 at gmail.com>> wrote:
>
>
> The linear system is complex-valued. We rewrite it into its real form
> and solve it using FGMRES and an optimal block-diagonal preconditioner.
> We use CG and the AMS preconditioner implemented in HYPRE to solve the
> smaller real linear system arised from applying the block preconditioner.
> The iteration number of FGMRES and CG keep almost constant in all the runs.
>
> So those blocks decrease in size as you add more processes?
>
>
> I am sorry for the unclear description of the block-diagonal preconditioner.
> Let K be the original complex system matrix, A = [Kr, -Ki; -Ki, -Kr] is the equivalent
> real form of K. Let P = [Kr+Ki, 0; 0, Kr+Ki], it can beproved that P is an optimal
> preconditioner for A. In our implementation, only Kr, Ki and Kr+Ki
> are explicitly stored as MATMPIAIJ. We use MATSHELL to represent A and P.
> We use FGMRES + P to solve Ax=b, and CG + AMS to
> solve (Kr+Ki)y=c. So the block size is never changed.
>
> Then we have to break down the timings further. I suspect AMS is not taking as long, since
> all other operations scale like N.
>
> Thanks,
>
> Matt
>
>
> Best,
> Ce
> --
> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220712/8785f206/attachment.html>
More information about the petsc-users
mailing list