[petsc-users] Strange strong scaling result

Tue Jul 12 11:18:49 CDT 2022

  The streams numbers 

1   8291.4887   Rate (MB/s)
2   8739.3219   Rate (MB/s) 1.05401
3  24769.5868   Rate (MB/s) 2.98735
4  31962.0242   Rate (MB/s) 3.8548
5  39603.8828   Rate (MB/s) 4.77645
6  47777.7385   Rate (MB/s) 5.76226
7  54557.5363   Rate (MB/s) 6.57994
8  62769.3910   Rate (MB/s) 7.57034
9  38649.9160   Rate (MB/s) 4.6614

indicate the MPI launcher is doing a poor job of binding MPI ranks to cores; you should read up on the options for your particular mpiexec for binding to select good binding options. Unfortunately, there is no standard for setting the bindings and each MPI implementation changes its options constantly so you need to determine them exactly for your machine and MPI implementation.  Basically, you want to place each MPI rank on a node as "far away as possible in memory domains from other ranks".  If you note going from 1 to 2 ranks there is no speedup which can be interpreted to mean that the first two ranks are put very close together (and thus share all the memory resources with their partner).

A side note is that the raw numbers are very good (you get a speedup of 7.57 on 8 ranks and the speedup goes up to 10. These means with proper binding you should get really good speedup on PETSc code to at least 8 cores per node.

  Barry

> On Jul 12, 2022, at 11:32 AM, Ce Qin <qince168 at gmail.com> wrote:
> 
> For your reference, I also calculated the speedups for other procedures:
> 
>                                     VecAXPY     MatMult    SetupAMS     PCApply    Assembly     Solving
> NProcessors NNodes CoresPerNode                                                                        
> 1           1      1                    1.0         1.0         1.0         1.0         1.0         1.0
> 2           1      2               1.640502    1.945753    1.418709    1.898884    1.995246    1.898756
>             2      1               2.297125    2.614508    1.600718    2.419798    2.121401    2.436149
> 4           1      4               4.456256    6.821532    3.614451    5.991256    4.658187    6.004539
>             2      2               4.539748    6.779151    3.619661    5.926112    4.666667    5.942085
>             4      1               4.480902    7.210629    3.471541    6.082946     4.65272    6.101214
> 8           2      4              10.584189   17.519901     8.59046   16.615395    9.380985   16.581135
>             4      2              10.980687   18.674113    8.612347   17.273229    9.308575   17.258891
>             8      1              11.096298   18.210245    8.456557   17.430586    9.314449   17.380612
> 16          2      8              21.929795    37.04392   18.135278     34.5448   18.575953   34.483058
>             4      4               22.00331   39.581504   18.011148   34.793732   18.745129   34.854409
>             8      2              22.692779    41.38289   18.354949   36.388144   18.828393    36.45509
> 32          4      8              43.935774   80.003087   34.963997   70.085728   37.140626   70.175879
>             8      4              44.387091   80.807608    35.62153   71.471289   37.166421   71.533865
> 
> and the streams result on the computation node:
> 
> 1   8291.4887   Rate (MB/s)
> 2   8739.3219   Rate (MB/s) 1.05401
> 3  24769.5868   Rate (MB/s) 2.98735
> 4  31962.0242   Rate (MB/s) 3.8548
> 5  39603.8828   Rate (MB/s) 4.77645
> 6  47777.7385   Rate (MB/s) 5.76226
> 7  54557.5363   Rate (MB/s) 6.57994
> 8  62769.3910   Rate (MB/s) 7.57034
> 9  38649.9160   Rate (MB/s) 4.6614
> 10  58976.9536   Rate (MB/s) 7.11295
> 11  48108.7801   Rate (MB/s) 5.80219
> 12  49506.8213   Rate (MB/s) 5.9708
> 13  54810.5266   Rate (MB/s) 6.61046
> 14  62471.5234   Rate (MB/s) 7.53441
> 15  63968.0218   Rate (MB/s) 7.7149
> 16  69644.8615   Rate (MB/s) 8.39956
> 17  60791.9544   Rate (MB/s) 7.33185
> 18  65476.5162   Rate (MB/s) 7.89683
> 19  60127.0683   Rate (MB/s) 7.25166
> 20  72052.5175   Rate (MB/s) 8.68994
> 21  62045.7745   Rate (MB/s) 7.48307
> 22  64517.7771   Rate (MB/s) 7.7812
> 23  69570.2935   Rate (MB/s) 8.39057
> 24  69673.8328   Rate (MB/s) 8.40305
> 25  75196.7514   Rate (MB/s) 9.06915
> 26  72304.2685   Rate (MB/s) 8.7203
> 27  73234.1616   Rate (MB/s) 8.83245
> 28  74041.3842   Rate (MB/s) 8.9298
> 29  77117.3751   Rate (MB/s) 9.30079
> 30  78293.8496   Rate (MB/s) 9.44268
> 31  81377.0870   Rate (MB/s) 9.81453
> 32  84097.0813   Rate (MB/s) 10.1426
> 
> 
> Best,
> Ce
> 
> Mark Adams <mfadams at lbl.gov <mailto:mfadams at lbl.gov>> 于2022年7月12日周二 22:11写道：
> You may get more memory bandwidth with 32 processors vs 1, as Ce mentioned.
> Depends on the architecture.
> Do you get the whole memory bandwidth on one processor on this machine?
> 
> On Tue, Jul 12, 2022 at 8:53 AM Matthew Knepley <knepley at gmail.com <mailto:knepley at gmail.com>> wrote:
> On Tue, Jul 12, 2022 at 7:32 AM Ce Qin <qince168 at gmail.com <mailto:qince168 at gmail.com>> wrote:
> 
> 
> The linear system is complex-valued. We rewrite it into its real form
> and solve it using FGMRES and an optimal block-diagonal preconditioner. 
> We use CG and the AMS preconditioner implemented in HYPRE to solve the
> smaller real linear system arised from applying the block preconditioner.
> The iteration number of FGMRES and CG keep almost constant in all the runs.
> 
> So those blocks decrease in size as you add more processes?
>  
> 
> I am sorry for the unclear description of the block-diagonal preconditioner.
> Let K be the original complex system matrix, A = [Kr, -Ki; -Ki, -Kr] is the equivalent
> real form of K. Let P = [Kr+Ki, 0; 0, Kr+Ki], it can beproved that P is an optimal
> preconditioner for A. In our implementation, only Kr, Ki and Kr+Ki
> are explicitly stored as MATMPIAIJ. We use MATSHELL to represent A and P.
> We use FGMRES + P to solve Ax=b, and CG + AMS to
> solve (Kr+Ki)y=c. So the block size is never changed.
> 
> Then we have to break down the timings further. I suspect AMS is not taking as long, since
> all other operations scale like N.
> 
>   Thanks,
> 
>      Matt
> 
>  
> Best,
> Ce
> -- 
> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
> -- Norbert Wiener
> 
> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220712/8785f206/attachment.html>