[petsc-users] Poor weak scaling when solving successive linearsystems

Mon Jun 4 14:56:10 CDT 2018

Miachael,  I can compile and run you test.  I am now profiling it. Thanks.

--Junchao Zhang

On Mon, Jun 4, 2018 at 11:59 AM, Michael Becker <
Michael.Becker at physik.uni-giessen.de> wrote:

> Hello again,
> this took me longer than I anticipated, but here we go.
> I did reruns of the cases where only half the processes per node were used
> (without -log_sync):
>
>                      125 procs,1st           125 procs,2nd          1000
> procs,1st          1000 procs,2nd
>                    Max        Ratio        Max        Ratio        Max
> Ratio        Max        Ratio
> KSPSolve           1.203E+02    1.0        1.210E+02    1.0
> 1.399E+02    1.1        1.365E+02    1.0
> VecTDot            6.376E+00    3.7        6.551E+00    4.0
> 7.885E+00    2.9        7.175E+00    3.4
> VecNorm            4.579E+00    7.1        5.803E+00   10.2
> 8.534E+00    6.9        6.026E+00    4.9
> VecScale           1.070E-01    2.1        1.129E-01    2.2
> 1.301E-01    2.5        1.270E-01    2.4
> VecCopy            1.123E-01    1.3        1.149E-01    1.3
> 1.301E-01    1.6        1.359E-01    1.6
> VecSet             7.063E-01    1.7        6.968E-01    1.7
> 7.432E-01    1.8        7.425E-01    1.8
> VecAXPY            1.166E+00    1.4        1.167E+00    1.4
> 1.221E+00    1.5        1.279E+00    1.6
> VecAYPX            1.317E+00    1.6        1.290E+00    1.6
> 1.536E+00    1.9        1.499E+00    2.0
> VecScatterBegin    6.142E+00    3.2        5.974E+00    2.8
> 6.448E+00    3.0        6.472E+00    2.9
> VecScatterEnd      3.606E+01    4.2        3.551E+01    4.0
> 5.244E+01    2.7        4.995E+01    2.7
> MatMult            3.561E+01    1.6        3.403E+01    1.5
> 3.435E+01    1.4        3.332E+01    1.4
> MatMultAdd         1.124E+01    2.0        1.130E+01    2.1
> 2.093E+01    2.9        1.995E+01    2.7
> MatMultTranspose   1.372E+01    2.5        1.388E+01    2.6
> 1.477E+01    2.2        1.381E+01    2.1
> MatSolve           1.949E-02    0.0        1.653E-02    0.0
> 4.789E-02    0.0        4.466E-02    0.0
> MatSOR             6.610E+01    1.3        6.673E+01    1.3
> 7.111E+01    1.3        7.105E+01    1.3
> MatResidual        2.647E+01    1.7        2.667E+01    1.7
> 2.446E+01    1.4        2.467E+01    1.5
> PCSetUpOnBlocks    5.266E-03    1.4        5.295E-03    1.4
> 5.427E-03    1.5        5.289E-03    1.4
> PCApply            1.031E+02    1.0        1.035E+02    1.0
> 1.180E+02    1.0        1.164E+02    1.0
>
> I also slimmed down my code and basically wrote a simple weak scaling test
> (source files attached) so you can profile it yourself. I appreciate the
> offer Junchao, thank you.
> You can adjust the system size per processor at runtime via
> "-nodes_per_proc 30" and the number of repeated calls to the function
> containing KSPsolve() via "-iterations 1000". The physical problem is
> simply calculating the electric potential from a homogeneous charge
> distribution, done multiple times to accumulate time in KSPsolve().
> A job would be started using something like
>
> mpirun -n 125 ~/petsc_ws/ws_test -nodes_per_proc 30 -mesh_size 1E-4
> -iterations 1000 \\
>  -ksp_rtol 1E-6 \
>  -log_view -log_sync\
>  -pc_type gamg -pc_gamg_type classical\
>  -ksp_type cg \
>  -ksp_norm_type unpreconditioned \
>  -mg_levels_ksp_type richardson \
>  -mg_levels_ksp_norm_type none \
>  -mg_levels_pc_type sor \
>  -mg_levels_ksp_max_it 1 \
>  -mg_levels_pc_sor_its 1 \
>  -mg_levels_esteig_ksp_type cg \
>  -mg_levels_esteig_ksp_max_it 10 \
>  -gamg_est_ksp_type cg
>
> , ideally started on a cube number of processes for a cubical process grid.
> Using 125 processes and 10.000 iterations I get the output in
> "log_view_125_new.txt", which shows the same imbalance for me.
>
> Michael
>
>
> Am 02.06.2018 um 13:40 schrieb Mark Adams:
>
>
>
> On Fri, Jun 1, 2018 at 11:20 PM, Junchao Zhang <jczhang at mcs.anl.gov>
> wrote:
>
>> Hi,Michael,
>>   You can add -log_sync besides -log_view, which adds barriers to certain
>> events but measures barrier time separately from the events. I find this
>> option makes it easier to interpret log_view output.
>>
>
> That is great (good to know).
>
> This should give us a better idea if your large VecScatter costs are from
> slow communication or if it catching some sort of load imbalance.
>
>
>>
>> --Junchao Zhang
>>
>> On Wed, May 30, 2018 at 3:27 AM, Michael Becker <
>> Michael.Becker at physik.uni-giessen.de> wrote:
>>
>>> Barry: On its way. Could take a couple days again.
>>>
>>> Junchao: I unfortunately don't have access to a cluster with a faster
>>> network. This one has a mixed 4X QDR-FDR InfiniBand 2:1 blocking fat-tree
>>> network, which I realize causes parallel slowdown if the nodes are not
>>> connected to the same switch. Each node has 24 processors (2x12/socket) and
>>> four NUMA domains (two for each socket).
>>> The ranks are usually not distributed perfectly even, i.e. for 125
>>> processes, of the six required nodes, five would use 21 cores and one 20.
>>> Would using another CPU type make a difference communication-wise? I
>>> could switch to faster ones (on the same network), but I always assumed
>>> this would only improve performance of the stuff that is unrelated to
>>> communication.
>>>
>>> Michael
>>>
>>>
>>>
>>> The log files have something like "Average time for zero size
>>> MPI_Send(): 1.84231e-05". It looks you ran on a cluster with a very slow
>>> network. A typical machine should give less than 1/10 of the latency you
>>> have. An easy way to try is just running the code on a machine with a
>>> faster network and see what happens.
>>>
>>> Also, how many cores & numa domains does a compute node have? I could
>>> not figure out how you distributed the 125 MPI ranks evenly.
>>>
>>> --Junchao Zhang
>>>
>>> On Tue, May 29, 2018 at 6:18 AM, Michael Becker <
>>> Michael.Becker at physik.uni-giessen.de> wrote:
>>>
>>>> Hello again,
>>>>
>>>> here are the updated log_view files for 125 and 1000 processors. I ran
>>>> both problems twice, the first time with all processors per node allocated
>>>> ("-1.txt"), the second with only half on twice the number of nodes
>>>> ("-2.txt").
>>>>
>>>> On May 24, 2018, at 12:24 AM, Michael Becker <Michael.Becker at physik.uni-giessen.de> <Michael.Becker at physik.uni-giessen.de> wrote:
>>>>
>>>> I noticed that for every individual KSP iteration, six vector objects are created and destroyed (with CG, more with e.g. GMRES).
>>>>
>>>>    Hmm, it is certainly not intended at vectors be created and destroyed within each KSPSolve() could you please point us to the code that makes you think they are being created and destroyed?   We create all the work vectors at KSPSetUp() and destroy them in KSPReset() not during the solve. Not that this would be a measurable distance.
>>>>
>>>>
>>>> I mean this, right in the log_view output:
>>>>
>>>> Memory usage is given in bytes:
>>>>
>>>> Object Type Creations Destructions Memory Descendants' Mem.
>>>> Reports information only for process 0.
>>>>
>>>> --- Event Stage 0: Main Stage
>>>>
>>>> ...
>>>>
>>>> --- Event Stage 1: First Solve
>>>>
>>>> ...
>>>>
>>>> --- Event Stage 2: Remaining Solves
>>>>
>>>> Vector 23904 23904 1295501184 0.
>>>>
>>>> I logged the exact number of KSP iterations over the 999 timesteps and
>>>> its exactly 23904/6 = 3984.
>>>>
>>>> Michael
>>>>
>>>>
>>>>
>>>> Am 24.05.2018 um 19:50 schrieb Smith, Barry F.:
>>>>
>>>>   Please send the log file for 1000 with cg as the solver.
>>>>
>>>>    You should make a bar chart of each event for the two cases to see which ones are taking more time and which are taking less (we cannot tell with the two logs you sent us since they are for different solvers.)
>>>>
>>>>
>>>>
>>>>
>>>> On May 24, 2018, at 12:24 AM, Michael Becker <Michael.Becker at physik.uni-giessen.de> <Michael.Becker at physik.uni-giessen.de> wrote:
>>>>
>>>> I noticed that for every individual KSP iteration, six vector objects are created and destroyed (with CG, more with e.g. GMRES).
>>>>
>>>>    Hmm, it is certainly not intended at vectors be created and destroyed within each KSPSolve() could you please point us to the code that makes you think they are being created and destroyed?   We create all the work vectors at KSPSetUp() and destroy them in KSPReset() not during the solve. Not that this would be a measurable distance.
>>>>
>>>>
>>>>
>>>>
>>>> This seems kind of wasteful, is this supposed to be like this? Is this even the reason for my problems? Apart from that, everything seems quite normal to me (but I'm not the expert here).
>>>>
>>>>
>>>> Thanks in advance.
>>>>
>>>> Michael
>>>>
>>>>
>>>>
>>>> <log_view_125procs.txt><log_view_1000procs.txt>
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20180604/c9ed63d9/attachment-0001.html>