[petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

Sat Jun 9 08:29:14 CDT 2018

-pc_gamg_type classical

FYI, we only support smoothed aggregation "agg" (the default). (This thread
started by saying you were using GAMG.)

It is not clear how much this will make a difference for you, but you don't
want to use classical because we do not support it. It is meant as a
reference implementation for developers.

First, how did you get the idea to use classical? If the documentation lead
you to believe this was a good thing to do then we need to fix that!

Anyway, here is a generic input for GAMG:

-pc_type gamg
-pc_gamg_type agg
-pc_gamg_agg_nsmooths 1
-pc_gamg_coarse_eq_limit 1000
-pc_gamg_reuse_interpolation true
-pc_gamg_square_graph 1
-pc_gamg_threshold 0.05
-pc_gamg_threshold_scale .0

On Thu, Jun 7, 2018 at 6:52 PM, Junchao Zhang <jczhang at mcs.anl.gov> wrote:

> OK, I have thought that space was a typo. btw, this option does not show
> up in -h.
> I changed number of ranks to use all cores on each node to avoid
> misleading ratio in -log_view. Since one node has 36 cores, I ran with
> 6^3=216 ranks, and 12^3=1728 ranks. I also found call counts of MatSOR etc
> in the two tests were different. So they are not strict weak scaling tests.
> I tried to add -ksp_max_it 6 -pc_mg_levels 6, but still could not make the
> two have the same MatSOR count. Anyway, I attached the load balance output.
>
> I find PCApply_MG calls PCMGMCycle_Private, which is recursive and
> indirectly calls MatSOR_MPIAIJ. I believe the following code in
> MatSOR_MPIAIJ practically syncs {MatSOR, MatMultAdd}_SeqAIJ  between
> processors through VecScatter at each MG level. If SOR and MatMultAdd are
> imbalanced, the cost is accumulated along MG levels and shows up as large
> VecScatter cost.
>
> 1460:     while (its--) {1461:       VecScatterBegin <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/VecScatterBegin.html#VecScatterBegin>(mat->Mvctx,xx,mat->lvec,INSERT_VALUES <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/INSERT_VALUES.html#INSERT_VALUES>,SCATTER_FORWARD <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/SCATTER_FORWARD.html#SCATTER_FORWARD>);1462:       VecScatterEnd <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/VecScatterEnd.html#VecScatterEnd>(mat->Mvctx,xx,mat->lvec,INSERT_VALUES <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/INSERT_VALUES.html#INSERT_VALUES>,SCATTER_FORWARD <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/SCATTER_FORWARD.html#SCATTER_FORWARD>);
> 1464:       /* update rhs: bb1 = bb - B*x */1465:       VecScale <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/VecScale.html#VecScale>(mat->lvec,-1.0);1466:       (*mat->B->ops->multadd)(mat->B,mat->lvec,bb,bb1);
> 1468:       /* local sweep */1469:       (*mat->A->ops->sor)(mat->A,bb1,omega,SOR_SYMMETRIC_SWEEP <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatSORType.html#MatSORType>,fshift,lits,1,xx);1470:     }
>
>
>
>
> --Junchao Zhang
>
> On Thu, Jun 7, 2018 at 3:11 PM, Smith, Barry F. <bsmith at mcs.anl.gov>
> wrote:
>
>>
>>
>> > On Jun 7, 2018, at 12:27 PM, Zhang, Junchao <jczhang at mcs.anl.gov>
>> wrote:
>> >
>> > Searched but could not find this option, -mat_view::load_balance
>>
>>    There is a space between the view and the :   load_balance is a
>> particular viewer format that causes the printing of load balance
>> information about number of nonzeros in the matrix.
>>
>>    Barry
>>
>> >
>> > --Junchao Zhang
>> >
>> > On Thu, Jun 7, 2018 at 10:46 AM, Smith, Barry F. <bsmith at mcs.anl.gov>
>> wrote:
>> >  So the only surprise in the results is the SOR. It is embarrassingly
>> parallel and normally one would not see a jump.
>> >
>> >  The load balance for SOR time 1.5 is better at 1000 processes than for
>> 125 processes of 2.1  not worse so this number doesn't easily explain it.
>> >
>> >  Could you run the 125 and 1000 with -mat_view ::load_balance and see
>> what you get out?
>> >
>> >    Thanks
>> >
>> >      Barry
>> >
>> >  Notice that the MatSOR time jumps a lot about 5 secs when the
>> -log_sync is on. My only guess is that the MatSOR is sharing memory
>> bandwidth (or some other resource? cores?) with the VecScatter and for some
>> reason this is worse for 1000 cores but I don't know why.
>> >
>> > > On Jun 6, 2018, at 9:13 PM, Junchao Zhang <jczhang at mcs.anl.gov>
>> wrote:
>> > >
>> > > Hi, PETSc developers,
>> > >  I tested Michael Becker's code. The code calls the same KSPSolve
>> 1000 times in the second stage and needs cubic number of processors to run.
>> I ran with 125 ranks and 1000 ranks, with or without -log_sync option. I
>> attach the log view output files and a scaling loss excel file.
>> > >  I profiled the code with 125 processors. It looks {MatSOR, MatMult,
>> MatMultAdd, MatMultTranspose, MatMultTransposeAdd}_SeqAIJ in aij.c took
>> ~50% of the time,  The other half time was spent on waiting in MPI.
>> MatSOR_SeqAIJ took 30%, mostly in PetscSparseDenseMinusDot().
>> > >  I tested it on a 36 cores/node machine. I found 32 ranks/node gave
>> better performance (about 10%) than 36 ranks/node in the 125 ranks
>> testing.  I guess it is because processors in the former had more balanced
>> memory bandwidth. I collected PAPI_DP_OPS (double precision operations) and
>> PAPI_TOT_CYC (total cycles) of the 125 ranks case (see the attached files).
>> It looks ranks at the two ends have less DP_OPS and TOT_CYC.
>> > >  Does anyone familiar with the algorithm have quick explanations?
>> > >
>> > > --Junchao Zhang
>> > >
>> > > On Mon, Jun 4, 2018 at 11:59 AM, Michael Becker <
>> Michael.Becker at physik.uni-giessen.de> wrote:
>> > > Hello again,
>> > >
>> > > this took me longer than I anticipated, but here we go.
>> > > I did reruns of the cases where only half the processes per node were
>> used (without -log_sync):
>> > >
>> > >                     125 procs,1st           125 procs,2nd
>> 1000 procs,1st          1000 procs,2nd
>> > >                   Max        Ratio        Max        Ratio
>> Max        Ratio        Max        Ratio
>> > > KSPSolve           1.203E+02    1.0        1.210E+02    1.0
>> 1.399E+02    1.1        1.365E+02    1.0
>> > > VecTDot            6.376E+00    3.7        6.551E+00    4.0
>> 7.885E+00    2.9        7.175E+00    3.4
>> > > VecNorm            4.579E+00    7.1        5.803E+00   10.2
>> 8.534E+00    6.9        6.026E+00    4.9
>> > > VecScale           1.070E-01    2.1        1.129E-01    2.2
>> 1.301E-01    2.5        1.270E-01    2.4
>> > > VecCopy            1.123E-01    1.3        1.149E-01    1.3
>> 1.301E-01    1.6        1.359E-01    1.6
>> > > VecSet             7.063E-01    1.7        6.968E-01    1.7
>> 7.432E-01    1.8        7.425E-01    1.8
>> > > VecAXPY            1.166E+00    1.4        1.167E+00    1.4
>> 1.221E+00    1.5        1.279E+00    1.6
>> > > VecAYPX            1.317E+00    1.6        1.290E+00    1.6
>> 1.536E+00    1.9        1.499E+00    2.0
>> > > VecScatterBegin    6.142E+00    3.2        5.974E+00    2.8
>> 6.448E+00    3.0        6.472E+00    2.9
>> > > VecScatterEnd      3.606E+01    4.2        3.551E+01    4.0
>> 5.244E+01    2.7        4.995E+01    2.7
>> > > MatMult            3.561E+01    1.6        3.403E+01    1.5
>> 3.435E+01    1.4        3.332E+01    1.4
>> > > MatMultAdd         1.124E+01    2.0        1.130E+01    2.1
>> 2.093E+01    2.9        1.995E+01    2.7
>> > > MatMultTranspose   1.372E+01    2.5        1.388E+01    2.6
>> 1.477E+01    2.2        1.381E+01    2.1
>> > > MatSolve           1.949E-02    0.0        1.653E-02    0.0
>> 4.789E-02    0.0        4.466E-02    0.0
>> > > MatSOR             6.610E+01    1.3        6.673E+01    1.3
>> 7.111E+01    1.3        7.105E+01    1.3
>> > > MatResidual        2.647E+01    1.7        2.667E+01    1.7
>> 2.446E+01    1.4        2.467E+01    1.5
>> > > PCSetUpOnBlocks    5.266E-03    1.4        5.295E-03    1.4
>> 5.427E-03    1.5        5.289E-03    1.4
>> > > PCApply            1.031E+02    1.0        1.035E+02    1.0
>> 1.180E+02    1.0        1.164E+02    1.0
>> > >
>> > > I also slimmed down my code and basically wrote a simple weak scaling
>> test (source files attached) so you can profile it yourself. I appreciate
>> the offer Junchao, thank you.
>> > > You can adjust the system size per processor at runtime via
>> "-nodes_per_proc 30" and the number of repeated calls to the function
>> containing KSPsolve() via "-iterations 1000". The physical problem is
>> simply calculating the electric potential from a homogeneous charge
>> distribution, done multiple times to accumulate time in KSPsolve().
>> > > A job would be started using something like
>> > > mpirun -n 125 ~/petsc_ws/ws_test -nodes_per_proc 30 -mesh_size 1E-4
>> -iterations 1000 \\
>> > > -ksp_rtol 1E-6 \
>> > > -log_view -log_sync\
>> > > -pc_type gamg -pc_gamg_type classical\
>> > > -ksp_type cg \
>> > > -ksp_norm_type unpreconditioned \
>> > > -mg_levels_ksp_type richardson \
>> > > -mg_levels_ksp_norm_type none \
>> > > -mg_levels_pc_type sor \
>> > > -mg_levels_ksp_max_it 1 \
>> > > -mg_levels_pc_sor_its 1 \
>> > > -mg_levels_esteig_ksp_type cg \
>> > > -mg_levels_esteig_ksp_max_it 10 \
>> > > -gamg_est_ksp_type cg
>> > > , ideally started on a cube number of processes for a cubical process
>> grid.
>> > > Using 125 processes and 10.000 iterations I get the output in
>> "log_view_125_new.txt", which shows the same imbalance for me.
>> > > Michael
>> > >
>> > >
>> > > Am 02.06.2018 um 13:40 schrieb Mark Adams:
>> > >>
>> > >>
>> > >> On Fri, Jun 1, 2018 at 11:20 PM, Junchao Zhang <jczhang at mcs.anl.gov>
>> wrote:
>> > >> Hi,Michael,
>> > >>  You can add -log_sync besides -log_view, which adds barriers to
>> certain events but measures barrier time separately from the events. I find
>> this option makes it easier to interpret log_view output.
>> > >>
>> > >> That is great (good to know).
>> > >>
>> > >> This should give us a better idea if your large VecScatter costs are
>> from slow communication or if it catching some sort of load imbalance.
>> > >>
>> > >>
>> > >> --Junchao Zhang
>> > >>
>> > >> On Wed, May 30, 2018 at 3:27 AM, Michael Becker <
>> Michael.Becker at physik.uni-giessen.de> wrote:
>> > >> Barry: On its way. Could take a couple days again.
>> > >>
>> > >> Junchao: I unfortunately don't have access to a cluster with a
>> faster network. This one has a mixed 4X QDR-FDR InfiniBand 2:1 blocking
>> fat-tree network, which I realize causes parallel slowdown if the nodes are
>> not connected to the same switch. Each node has 24 processors (2x12/socket)
>> and four NUMA domains (two for each socket).
>> > >> The ranks are usually not distributed perfectly even, i.e. for 125
>> processes, of the six required nodes, five would use 21 cores and one 20.
>> > >> Would using another CPU type make a difference communication-wise? I
>> could switch to faster ones (on the same network), but I always assumed
>> this would only improve performance of the stuff that is unrelated to
>> communication.
>> > >>
>> > >> Michael
>> > >>
>> > >>
>> > >>
>> > >>> The log files have something like "Average time for zero size
>> MPI_Send(): 1.84231e-05". It looks you ran on a cluster with a very slow
>> network. A typical machine should give less than 1/10 of the latency you
>> have. An easy way to try is just running the code on a machine with a
>> faster network and see what happens.
>> > >>>
>> > >>> Also, how many cores & numa domains does a compute node have? I
>> could not figure out how you distributed the 125 MPI ranks evenly.
>> > >>>
>> > >>> --Junchao Zhang
>> > >>>
>> > >>> On Tue, May 29, 2018 at 6:18 AM, Michael Becker <
>> Michael.Becker at physik.uni-giessen.de> wrote:
>> > >>> Hello again,
>> > >>>
>> > >>> here are the updated log_view files for 125 and 1000 processors. I
>> ran both problems twice, the first time with all processors per node
>> allocated ("-1.txt"), the second with only half on twice the number of
>> nodes ("-2.txt").
>> > >>>
>> > >>>>> On May 24, 2018, at 12:24 AM, Michael Becker <
>> Michael.Becker at physik.uni-giessen.de>
>> > >>>>> wrote:
>> > >>>>>
>> > >>>>> I noticed that for every individual KSP iteration, six vector
>> objects are created and destroyed (with CG, more with e.g. GMRES).
>> > >>>>>
>> > >>>>   Hmm, it is certainly not intended at vectors be created and
>> destroyed within each KSPSolve() could you please point us to the code that
>> makes you think they are being created and destroyed?   We create all the
>> work vectors at KSPSetUp() and destroy them in KSPReset() not during the
>> solve. Not that this would be a measurable distance.
>> > >>>>
>> > >>>
>> > >>> I mean this, right in the log_view output:
>> > >>>
>> > >>>> Memory usage is given in bytes:
>> > >>>>
>> > >>>> Object Type Creations Destructions Memory Descendants' Mem.
>> > >>>> Reports information only for process 0.
>> > >>>>
>> > >>>> --- Event Stage 0: Main Stage
>> > >>>>
>> > >>>> ...
>> > >>>>
>> > >>>> --- Event Stage 1: First Solve
>> > >>>>
>> > >>>> ...
>> > >>>>
>> > >>>> --- Event Stage 2: Remaining Solves
>> > >>>>
>> > >>>> Vector 23904 23904 1295501184 0.
>> > >>> I logged the exact number of KSP iterations over the 999 timesteps
>> and its exactly 23904/6 = 3984.
>> > >>> Michael
>> > >>>
>> > >>>
>> > >>> Am 24.05.2018 um 19:50 schrieb Smith, Barry F.:
>> > >>>>
>> > >>>>  Please send the log file for 1000 with cg as the solver.
>> > >>>>
>> > >>>>   You should make a bar chart of each event for the two cases to
>> see which ones are taking more time and which are taking less (we cannot
>> tell with the two logs you sent us since they are for different solvers.)
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>> On May 24, 2018, at 12:24 AM, Michael Becker <
>> Michael.Becker at physik.uni-giessen.de>
>> > >>>>> wrote:
>> > >>>>>
>> > >>>>> I noticed that for every individual KSP iteration, six vector
>> objects are created and destroyed (with CG, more with e.g. GMRES).
>> > >>>>>
>> > >>>>   Hmm, it is certainly not intended at vectors be created and
>> destroyed within each KSPSolve() could you please point us to the code that
>> makes you think they are being created and destroyed?   We create all the
>> work vectors at KSPSetUp() and destroy them in KSPReset() not during the
>> solve. Not that this would be a measurable distance.
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>> This seems kind of wasteful, is this supposed to be like this? Is
>> this even the reason for my problems? Apart from that, everything seems
>> quite normal to me (but I'm not the expert here).
>> > >>>>>
>> > >>>>>
>> > >>>>> Thanks in advance.
>> > >>>>>
>> > >>>>> Michael
>> > >>>>>
>> > >>>>>
>> > >>>>>
>> > >>>>> <log_view_125procs.txt><log_vi
>> > >>>>> ew_1000procs.txt>
>> > >>>>>
>> > >>>
>> > >>>
>> > >>
>> > >>
>> > >>
>> > >
>> > >
>> > > <o-wstest-125.txt><Scaling-loss.png><o-wstest-1000.txt><o-
>> wstest-sync-125.txt><o-wstest-sync-1000.txt><MatSOR_SeqAIJ.
>> png><PAPI_TOT_CYC.png><PAPI_DP_OPS.png>
>> >
>> >
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20180609/dc4be902/attachment-0001.html>