[petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems
Mark Adams
mfadams at lbl.gov
Tue Jun 12 15:17:04 CDT 2018
This all looks reasonable to me. The VecScatters times are a little high
but these are fast little solves (.2 seconds each).
The RAP times are very low, suggesting we could optimize parameters a bit
and reduce the iteration count. These are 7 point stencils as I recall. You
could try -pc_gamg_square_graph 0 (instead of 1) and you probably want
'-pc_gamg_threshold 0.0'. You could also test hypre.
And you should be able to improve coarse grid load imbalance with
-pc_gamg_repartition.
Mark
On Tue, Jun 12, 2018 at 12:32 PM, Junchao Zhang <jczhang at mcs.anl.gov> wrote:
> Mark,
> I tried "-pc_gamg_type agg ..." options you mentioned, and also
> -ksp_type cg + PETSc's default PC bjacobi. In the latter case, to reduce
> execution time I called KSPSolve 100 times instead of 1000, and also
> used -ksp_max_it 100. In the 36x48=1728 ranks case, I also did a test with
> -log_sync. From there you can see a lot of time is spent on VecNormBarrier,
> which implies load imbalance. Note VecScatterBarrie time is misleading,
> since it barriers ALL ranks, but in reality VecScatter sort of syncs in a
> small neighborhood.
> Barry suggested trying periodic boundary condition so that the nonzeros
> are perfectly balanced across processes. I will try that to see what
> happens.
>
> --Junchao Zhang
>
> On Mon, Jun 11, 2018 at 8:09 AM, Mark Adams <mfadams at lbl.gov> wrote:
>
>>
>>
>> On Mon, Jun 11, 2018 at 12:46 AM, Junchao Zhang <jczhang at mcs.anl.gov>
>> wrote:
>>
>>> I used an LCRC machine named Bebop. I tested on its Intel Broadwell
>>> nodes. Each nodes has 2 CPUs and 36 cores in total. I collected data using
>>> 36 cores in a node or 18 cores in a node. As you can see, 18 cores/node
>>> gave much better performance, which is reasonable as routines like MatSOR,
>>> MatMult, MatMultAdd are all bandwidth bound.
>>>
>>> The code uses a DMDA 3D grid, 7-point stencil, and defines
>>> nodes(vertices) at the surface or second to the surface as boundary nodes.
>>> Boundary nodes only have a diagonal one in their row in the matrix.
>>> Interior nodes have 7 nonzeros in their row. Boundary processors in the
>>> processor grid has less nonzero. This is one source of load-imbalance. Will
>>> load-imbalance get severer at coarser grids of an MG level?
>>>
>>
>> Yes.
>>
>> You can use a simple Jacobi solver to see the basic performance of your
>> operator and machine. Do you see as much time spent in Vec Scatters?
>> VecAXPY? etc.
>>
>>
>>>
>>> I attach a trace view figure that show activity of each ranks along the
>>> time axis in one KSPSove. White color means MPI wait. You can see white
>>> takes a large space.
>>>
>>> I don't have a good explanation why at large scale (1728 cores),
>>> processors wait longer time, as the communication pattern is still 7-point
>>> stencil in a cubic processor gird.
>>>
>>> --Junchao Zhang
>>>
>>> On Sat, Jun 9, 2018 at 11:32 AM, Smith, Barry F. <bsmith at mcs.anl.gov>
>>> wrote:
>>>
>>>>
>>>> Junchao,
>>>>
>>>> Thanks, the load balance of matrix entries is remarkably similar
>>>> for the two runs so it can't be a matter of worse work load imbalance for
>>>> SOR for the larger case explaining why the SOR takes more time.
>>>>
>>>> Here is my guess (and I know no way to confirm it). In the
>>>> smaller case the overlap of different processes on the same node running
>>>> SOR at the same time is lower than the larger case hence the larger case is
>>>> slower because there are more SOR processes fighting over the same memory
>>>> bandwidth at the same time than in the smaller case. Ahh, here is
>>>> something you can try, lets undersubscribe the memory bandwidth needs, run
>>>> on say 16 processes per node with 8 nodes and 16 processes per node with 64
>>>> nodes and send the two -log_view output files. I assume this is an LCRC
>>>> machine and NOT a KNL system?
>>>>
>>>> Thanks
>>>>
>>>>
>>>> Barry
>>>>
>>>>
>>>> > On Jun 9, 2018, at 8:29 AM, Mark Adams <mfadams at lbl.gov> wrote:
>>>> >
>>>> > -pc_gamg_type classical
>>>> >
>>>> > FYI, we only support smoothed aggregation "agg" (the default). (This
>>>> thread started by saying you were using GAMG.)
>>>> >
>>>> > It is not clear how much this will make a difference for you, but you
>>>> don't want to use classical because we do not support it. It is meant as a
>>>> reference implementation for developers.
>>>> >
>>>> > First, how did you get the idea to use classical? If the
>>>> documentation lead you to believe this was a good thing to do then we need
>>>> to fix that!
>>>> >
>>>> > Anyway, here is a generic input for GAMG:
>>>> >
>>>> > -pc_type gamg
>>>> > -pc_gamg_type agg
>>>> > -pc_gamg_agg_nsmooths 1
>>>> > -pc_gamg_coarse_eq_limit 1000
>>>> > -pc_gamg_reuse_interpolation true
>>>> > -pc_gamg_square_graph 1
>>>> > -pc_gamg_threshold 0.05
>>>> > -pc_gamg_threshold_scale .0
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Thu, Jun 7, 2018 at 6:52 PM, Junchao Zhang <jczhang at mcs.anl.gov>
>>>> wrote:
>>>> > OK, I have thought that space was a typo. btw, this option does not
>>>> show up in -h.
>>>> > I changed number of ranks to use all cores on each node to avoid
>>>> misleading ratio in -log_view. Since one node has 36 cores, I ran with
>>>> 6^3=216 ranks, and 12^3=1728 ranks. I also found call counts of MatSOR etc
>>>> in the two tests were different. So they are not strict weak scaling tests.
>>>> I tried to add -ksp_max_it 6 -pc_mg_levels 6, but still could not make the
>>>> two have the same MatSOR count. Anyway, I attached the load balance output.
>>>> >
>>>> > I find PCApply_MG calls PCMGMCycle_Private, which is recursive and
>>>> indirectly calls MatSOR_MPIAIJ. I believe the following code in
>>>> MatSOR_MPIAIJ practically syncs {MatSOR, MatMultAdd}_SeqAIJ between
>>>> processors through VecScatter at each MG level. If SOR and MatMultAdd are
>>>> imbalanced, the cost is accumulated along MG levels and shows up as large
>>>> VecScatter cost.
>>>> > 1460: while
>>>> > (its--) {
>>>> >
>>>> > 1461: VecScatterBegin(mat->Mvctx,xx
>>>> ,mat->lvec,INSERT_VALUES,SCATTER_FORWARD
>>>> > );
>>>> >
>>>> > 1462: VecScatterEnd(mat->Mvctx,xx,m
>>>> at->lvec,INSERT_VALUES,SCATTER_FORWARD
>>>> > );
>>>> >
>>>> >
>>>> > 1464: /* update rhs: bb1 = bb - B*x */
>>>> > 1465: VecScale
>>>> > (mat->lvec,-1.0);
>>>> >
>>>> > 1466: (*mat->B->ops->multadd)(mat->
>>>> > B,mat->lvec,bb,bb1);
>>>> >
>>>> >
>>>> > 1468: /* local sweep */
>>>> > 1469: (*mat->A->ops->sor)(mat->A,bb1,omega,SOR_SYMMETRIC_SWEEP,
>>>> > fshift,lits,1,xx);
>>>> >
>>>> > 1470: }
>>>> >
>>>> >
>>>> >
>>>> > --Junchao Zhang
>>>> >
>>>> > On Thu, Jun 7, 2018 at 3:11 PM, Smith, Barry F. <bsmith at mcs.anl.gov>
>>>> wrote:
>>>> >
>>>> >
>>>> > > On Jun 7, 2018, at 12:27 PM, Zhang, Junchao <jczhang at mcs.anl.gov>
>>>> wrote:
>>>> > >
>>>> > > Searched but could not find this option, -mat_view::load_balance
>>>> >
>>>> > There is a space between the view and the : load_balance is a
>>>> particular viewer format that causes the printing of load balance
>>>> information about number of nonzeros in the matrix.
>>>> >
>>>> > Barry
>>>> >
>>>> > >
>>>> > > --Junchao Zhang
>>>> > >
>>>> > > On Thu, Jun 7, 2018 at 10:46 AM, Smith, Barry F. <
>>>> bsmith at mcs.anl.gov> wrote:
>>>> > > So the only surprise in the results is the SOR. It is
>>>> embarrassingly parallel and normally one would not see a jump.
>>>> > >
>>>> > > The load balance for SOR time 1.5 is better at 1000 processes than
>>>> for 125 processes of 2.1 not worse so this number doesn't easily explain
>>>> it.
>>>> > >
>>>> > > Could you run the 125 and 1000 with -mat_view ::load_balance and
>>>> see what you get out?
>>>> > >
>>>> > > Thanks
>>>> > >
>>>> > > Barry
>>>> > >
>>>> > > Notice that the MatSOR time jumps a lot about 5 secs when the
>>>> -log_sync is on. My only guess is that the MatSOR is sharing memory
>>>> bandwidth (or some other resource? cores?) with the VecScatter and for some
>>>> reason this is worse for 1000 cores but I don't know why.
>>>> > >
>>>> > > > On Jun 6, 2018, at 9:13 PM, Junchao Zhang <jczhang at mcs.anl.gov>
>>>> wrote:
>>>> > > >
>>>> > > > Hi, PETSc developers,
>>>> > > > I tested Michael Becker's code. The code calls the same KSPSolve
>>>> 1000 times in the second stage and needs cubic number of processors to run.
>>>> I ran with 125 ranks and 1000 ranks, with or without -log_sync option. I
>>>> attach the log view output files and a scaling loss excel file.
>>>> > > > I profiled the code with 125 processors. It looks {MatSOR,
>>>> MatMult, MatMultAdd, MatMultTranspose, MatMultTransposeAdd}_SeqAIJ in aij.c
>>>> took ~50% of the time, The other half time was spent on waiting in MPI.
>>>> MatSOR_SeqAIJ took 30%, mostly in PetscSparseDenseMinusDot().
>>>> > > > I tested it on a 36 cores/node machine. I found 32 ranks/node
>>>> gave better performance (about 10%) than 36 ranks/node in the 125 ranks
>>>> testing. I guess it is because processors in the former had more balanced
>>>> memory bandwidth. I collected PAPI_DP_OPS (double precision operations) and
>>>> PAPI_TOT_CYC (total cycles) of the 125 ranks case (see the attached files).
>>>> It looks ranks at the two ends have less DP_OPS and TOT_CYC.
>>>> > > > Does anyone familiar with the algorithm have quick explanations?
>>>> > > >
>>>> > > > --Junchao Zhang
>>>> > > >
>>>> > > > On Mon, Jun 4, 2018 at 11:59 AM, Michael Becker <
>>>> Michael.Becker at physik.uni-giessen.de> wrote:
>>>> > > > Hello again,
>>>> > > >
>>>> > > > this took me longer than I anticipated, but here we go.
>>>> > > > I did reruns of the cases where only half the processes per node
>>>> were used (without -log_sync):
>>>> > > >
>>>> > > > 125 procs,1st 125 procs,2nd
>>>> 1000 procs,1st 1000 procs,2nd
>>>> > > > Max Ratio Max Ratio
>>>> Max Ratio Max Ratio
>>>> > > > KSPSolve 1.203E+02 1.0 1.210E+02 1.0
>>>> 1.399E+02 1.1 1.365E+02 1.0
>>>> > > > VecTDot 6.376E+00 3.7 6.551E+00 4.0
>>>> 7.885E+00 2.9 7.175E+00 3.4
>>>> > > > VecNorm 4.579E+00 7.1 5.803E+00 10.2
>>>> 8.534E+00 6.9 6.026E+00 4.9
>>>> > > > VecScale 1.070E-01 2.1 1.129E-01 2.2
>>>> 1.301E-01 2.5 1.270E-01 2.4
>>>> > > > VecCopy 1.123E-01 1.3 1.149E-01 1.3
>>>> 1.301E-01 1.6 1.359E-01 1.6
>>>> > > > VecSet 7.063E-01 1.7 6.968E-01 1.7
>>>> 7.432E-01 1.8 7.425E-01 1.8
>>>> > > > VecAXPY 1.166E+00 1.4 1.167E+00 1.4
>>>> 1.221E+00 1.5 1.279E+00 1.6
>>>> > > > VecAYPX 1.317E+00 1.6 1.290E+00 1.6
>>>> 1.536E+00 1.9 1.499E+00 2.0
>>>> > > > VecScatterBegin 6.142E+00 3.2 5.974E+00 2.8
>>>> 6.448E+00 3.0 6.472E+00 2.9
>>>> > > > VecScatterEnd 3.606E+01 4.2 3.551E+01 4.0
>>>> 5.244E+01 2.7 4.995E+01 2.7
>>>> > > > MatMult 3.561E+01 1.6 3.403E+01 1.5
>>>> 3.435E+01 1.4 3.332E+01 1.4
>>>> > > > MatMultAdd 1.124E+01 2.0 1.130E+01 2.1
>>>> 2.093E+01 2.9 1.995E+01 2.7
>>>> > > > MatMultTranspose 1.372E+01 2.5 1.388E+01 2.6
>>>> 1.477E+01 2.2 1.381E+01 2.1
>>>> > > > MatSolve 1.949E-02 0.0 1.653E-02 0.0
>>>> 4.789E-02 0.0 4.466E-02 0.0
>>>> > > > MatSOR 6.610E+01 1.3 6.673E+01 1.3
>>>> 7.111E+01 1.3 7.105E+01 1.3
>>>> > > > MatResidual 2.647E+01 1.7 2.667E+01 1.7
>>>> 2.446E+01 1.4 2.467E+01 1.5
>>>> > > > PCSetUpOnBlocks 5.266E-03 1.4 5.295E-03 1.4
>>>> 5.427E-03 1.5 5.289E-03 1.4
>>>> > > > PCApply 1.031E+02 1.0 1.035E+02 1.0
>>>> 1.180E+02 1.0 1.164E+02 1.0
>>>> > > >
>>>> > > > I also slimmed down my code and basically wrote a simple weak
>>>> scaling test (source files attached) so you can profile it yourself. I
>>>> appreciate the offer Junchao, thank you.
>>>> > > > You can adjust the system size per processor at runtime via
>>>> "-nodes_per_proc 30" and the number of repeated calls to the function
>>>> containing KSPsolve() via "-iterations 1000". The physical problem is
>>>> simply calculating the electric potential from a homogeneous charge
>>>> distribution, done multiple times to accumulate time in KSPsolve().
>>>> > > > A job would be started using something like
>>>> > > > mpirun -n 125 ~/petsc_ws/ws_test -nodes_per_proc 30 -mesh_size
>>>> 1E-4 -iterations 1000 \\
>>>> > > > -ksp_rtol 1E-6 \
>>>> > > > -log_view -log_sync\
>>>> > > > -pc_type gamg -pc_gamg_type classical\
>>>> > > > -ksp_type cg \
>>>> > > > -ksp_norm_type unpreconditioned \
>>>> > > > -mg_levels_ksp_type richardson \
>>>> > > > -mg_levels_ksp_norm_type none \
>>>> > > > -mg_levels_pc_type sor \
>>>> > > > -mg_levels_ksp_max_it 1 \
>>>> > > > -mg_levels_pc_sor_its 1 \
>>>> > > > -mg_levels_esteig_ksp_type cg \
>>>> > > > -mg_levels_esteig_ksp_max_it 10 \
>>>> > > > -gamg_est_ksp_type cg
>>>> > > > , ideally started on a cube number of processes for a cubical
>>>> process grid.
>>>> > > > Using 125 processes and 10.000 iterations I get the output in
>>>> "log_view_125_new.txt", which shows the same imbalance for me.
>>>> > > > Michael
>>>> > > >
>>>> > > >
>>>> > > > Am 02.06.2018 um 13:40 schrieb Mark Adams:
>>>> > > >>
>>>> > > >>
>>>> > > >> On Fri, Jun 1, 2018 at 11:20 PM, Junchao Zhang <
>>>> jczhang at mcs.anl.gov> wrote:
>>>> > > >> Hi,Michael,
>>>> > > >> You can add -log_sync besides -log_view, which adds barriers to
>>>> certain events but measures barrier time separately from the events. I find
>>>> this option makes it easier to interpret log_view output.
>>>> > > >>
>>>> > > >> That is great (good to know).
>>>> > > >>
>>>> > > >> This should give us a better idea if your large VecScatter costs
>>>> are from slow communication or if it catching some sort of load imbalance.
>>>> > > >>
>>>> > > >>
>>>> > > >> --Junchao Zhang
>>>> > > >>
>>>> > > >> On Wed, May 30, 2018 at 3:27 AM, Michael Becker <
>>>> Michael.Becker at physik.uni-giessen.de> wrote:
>>>> > > >> Barry: On its way. Could take a couple days again.
>>>> > > >>
>>>> > > >> Junchao: I unfortunately don't have access to a cluster with a
>>>> faster network. This one has a mixed 4X QDR-FDR InfiniBand 2:1 blocking
>>>> fat-tree network, which I realize causes parallel slowdown if the nodes are
>>>> not connected to the same switch. Each node has 24 processors (2x12/socket)
>>>> and four NUMA domains (two for each socket).
>>>> > > >> The ranks are usually not distributed perfectly even, i.e. for
>>>> 125 processes, of the six required nodes, five would use 21 cores and one
>>>> 20.
>>>> > > >> Would using another CPU type make a difference
>>>> communication-wise? I could switch to faster ones (on the same network),
>>>> but I always assumed this would only improve performance of the stuff that
>>>> is unrelated to communication.
>>>> > > >>
>>>> > > >> Michael
>>>> > > >>
>>>> > > >>
>>>> > > >>
>>>> > > >>> The log files have something like "Average time for zero size
>>>> MPI_Send(): 1.84231e-05". It looks you ran on a cluster with a very slow
>>>> network. A typical machine should give less than 1/10 of the latency you
>>>> have. An easy way to try is just running the code on a machine with a
>>>> faster network and see what happens.
>>>> > > >>>
>>>> > > >>> Also, how many cores & numa domains does a compute node have? I
>>>> could not figure out how you distributed the 125 MPI ranks evenly.
>>>> > > >>>
>>>> > > >>> --Junchao Zhang
>>>> > > >>>
>>>> > > >>> On Tue, May 29, 2018 at 6:18 AM, Michael Becker <
>>>> Michael.Becker at physik.uni-giessen.de> wrote:
>>>> > > >>> Hello again,
>>>> > > >>>
>>>> > > >>> here are the updated log_view files for 125 and 1000
>>>> processors. I ran both problems twice, the first time with all processors
>>>> per node allocated ("-1.txt"), the second with only half on twice the
>>>> number of nodes ("-2.txt").
>>>> > > >>>
>>>> > > >>>>> On May 24, 2018, at 12:24 AM, Michael Becker <
>>>> Michael.Becker at physik.uni-giessen.de>
>>>> > > >>>>> wrote:
>>>> > > >>>>>
>>>> > > >>>>> I noticed that for every individual KSP iteration, six vector
>>>> objects are created and destroyed (with CG, more with e.g. GMRES).
>>>> > > >>>>>
>>>> > > >>>> Hmm, it is certainly not intended at vectors be created and
>>>> destroyed within each KSPSolve() could you please point us to the code that
>>>> makes you think they are being created and destroyed? We create all the
>>>> work vectors at KSPSetUp() and destroy them in KSPReset() not during the
>>>> solve. Not that this would be a measurable distance.
>>>> > > >>>>
>>>> > > >>>
>>>> > > >>> I mean this, right in the log_view output:
>>>> > > >>>
>>>> > > >>>> Memory usage is given in bytes:
>>>> > > >>>>
>>>> > > >>>> Object Type Creations Destructions Memory Descendants' Mem.
>>>> > > >>>> Reports information only for process 0.
>>>> > > >>>>
>>>> > > >>>> --- Event Stage 0: Main Stage
>>>> > > >>>>
>>>> > > >>>> ...
>>>> > > >>>>
>>>> > > >>>> --- Event Stage 1: First Solve
>>>> > > >>>>
>>>> > > >>>> ...
>>>> > > >>>>
>>>> > > >>>> --- Event Stage 2: Remaining Solves
>>>> > > >>>>
>>>> > > >>>> Vector 23904 23904 1295501184 0.
>>>> > > >>> I logged the exact number of KSP iterations over the 999
>>>> timesteps and its exactly 23904/6 = 3984.
>>>> > > >>> Michael
>>>> > > >>>
>>>> > > >>>
>>>> > > >>> Am 24.05.2018 um 19:50 schrieb Smith, Barry F.:
>>>> > > >>>>
>>>> > > >>>> Please send the log file for 1000 with cg as the solver.
>>>> > > >>>>
>>>> > > >>>> You should make a bar chart of each event for the two cases
>>>> to see which ones are taking more time and which are taking less (we cannot
>>>> tell with the two logs you sent us since they are for different solvers.)
>>>> > > >>>>
>>>> > > >>>>
>>>> > > >>>>
>>>> > > >>>>
>>>> > > >>>>> On May 24, 2018, at 12:24 AM, Michael Becker <
>>>> Michael.Becker at physik.uni-giessen.de>
>>>> > > >>>>> wrote:
>>>> > > >>>>>
>>>> > > >>>>> I noticed that for every individual KSP iteration, six vector
>>>> objects are created and destroyed (with CG, more with e.g. GMRES).
>>>> > > >>>>>
>>>> > > >>>> Hmm, it is certainly not intended at vectors be created and
>>>> destroyed within each KSPSolve() could you please point us to the code that
>>>> makes you think they are being created and destroyed? We create all the
>>>> work vectors at KSPSetUp() and destroy them in KSPReset() not during the
>>>> solve. Not that this would be a measurable distance.
>>>> > > >>>>
>>>> > > >>>>
>>>> > > >>>>
>>>> > > >>>>
>>>> > > >>>>> This seems kind of wasteful, is this supposed to be like
>>>> this? Is this even the reason for my problems? Apart from that, everything
>>>> seems quite normal to me (but I'm not the expert here).
>>>> > > >>>>>
>>>> > > >>>>>
>>>> > > >>>>> Thanks in advance.
>>>> > > >>>>>
>>>> > > >>>>> Michael
>>>> > > >>>>>
>>>> > > >>>>>
>>>> > > >>>>>
>>>> > > >>>>> <log_view_125procs.txt><log_vi
>>>> > > >>>>> ew_1000procs.txt>
>>>> > > >>>>>
>>>> > > >>>
>>>> > > >>>
>>>> > > >>
>>>> > > >>
>>>> > > >>
>>>> > > >
>>>> > > >
>>>> > > > <o-wstest-125.txt><Scaling-loss.png><o-wstest-1000.txt><o-ws
>>>> test-sync-125.txt><o-wstest-sync-1000.txt><MatSOR_SeqAIJ.png
>>>> ><PAPI_TOT_CYC.png><PAPI_DP_OPS.png>
>>>> > >
>>>> > >
>>>> >
>>>> >
>>>> >
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20180612/fb195d06/attachment-0001.html>
More information about the petsc-dev
mailing list