[petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems
Junchao Zhang
jczhang at mcs.anl.gov
Thu Jun 7 12:27:41 CDT 2018
Searched but could not find this option, -mat_view::load_balance
--Junchao Zhang
On Thu, Jun 7, 2018 at 10:46 AM, Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
> So the only surprise in the results is the SOR. It is embarrassingly
> parallel and normally one would not see a jump.
>
> The load balance for SOR time 1.5 is better at 1000 processes than for
> 125 processes of 2.1 not worse so this number doesn't easily explain it.
>
> Could you run the 125 and 1000 with -mat_view ::load_balance and see what
> you get out?
>
> Thanks
>
> Barry
>
> Notice that the MatSOR time jumps a lot about 5 secs when the -log_sync
> is on. My only guess is that the MatSOR is sharing memory bandwidth (or
> some other resource? cores?) with the VecScatter and for some reason this
> is worse for 1000 cores but I don't know why.
>
> > On Jun 6, 2018, at 9:13 PM, Junchao Zhang <jczhang at mcs.anl.gov> wrote:
> >
> > Hi, PETSc developers,
> > I tested Michael Becker's code. The code calls the same KSPSolve 1000
> times in the second stage and needs cubic number of processors to run. I
> ran with 125 ranks and 1000 ranks, with or without -log_sync option. I
> attach the log view output files and a scaling loss excel file.
> > I profiled the code with 125 processors. It looks {MatSOR, MatMult,
> MatMultAdd, MatMultTranspose, MatMultTransposeAdd}_SeqAIJ in aij.c took
> ~50% of the time, The other half time was spent on waiting in MPI.
> MatSOR_SeqAIJ took 30%, mostly in PetscSparseDenseMinusDot().
> > I tested it on a 36 cores/node machine. I found 32 ranks/node gave
> better performance (about 10%) than 36 ranks/node in the 125 ranks
> testing. I guess it is because processors in the former had more balanced
> memory bandwidth. I collected PAPI_DP_OPS (double precision operations) and
> PAPI_TOT_CYC (total cycles) of the 125 ranks case (see the attached files).
> It looks ranks at the two ends have less DP_OPS and TOT_CYC.
> > Does anyone familiar with the algorithm have quick explanations?
> >
> > --Junchao Zhang
> >
> > On Mon, Jun 4, 2018 at 11:59 AM, Michael Becker <
> Michael.Becker at physik.uni-giessen.de> wrote:
> > Hello again,
> >
> > this took me longer than I anticipated, but here we go.
> > I did reruns of the cases where only half the processes per node were
> used (without -log_sync):
> >
> > 125 procs,1st 125 procs,2nd 1000
> procs,1st 1000 procs,2nd
> > Max Ratio Max Ratio Max
> Ratio Max Ratio
> > KSPSolve 1.203E+02 1.0 1.210E+02 1.0
> 1.399E+02 1.1 1.365E+02 1.0
> > VecTDot 6.376E+00 3.7 6.551E+00 4.0
> 7.885E+00 2.9 7.175E+00 3.4
> > VecNorm 4.579E+00 7.1 5.803E+00 10.2
> 8.534E+00 6.9 6.026E+00 4.9
> > VecScale 1.070E-01 2.1 1.129E-01 2.2
> 1.301E-01 2.5 1.270E-01 2.4
> > VecCopy 1.123E-01 1.3 1.149E-01 1.3
> 1.301E-01 1.6 1.359E-01 1.6
> > VecSet 7.063E-01 1.7 6.968E-01 1.7
> 7.432E-01 1.8 7.425E-01 1.8
> > VecAXPY 1.166E+00 1.4 1.167E+00 1.4
> 1.221E+00 1.5 1.279E+00 1.6
> > VecAYPX 1.317E+00 1.6 1.290E+00 1.6
> 1.536E+00 1.9 1.499E+00 2.0
> > VecScatterBegin 6.142E+00 3.2 5.974E+00 2.8
> 6.448E+00 3.0 6.472E+00 2.9
> > VecScatterEnd 3.606E+01 4.2 3.551E+01 4.0
> 5.244E+01 2.7 4.995E+01 2.7
> > MatMult 3.561E+01 1.6 3.403E+01 1.5
> 3.435E+01 1.4 3.332E+01 1.4
> > MatMultAdd 1.124E+01 2.0 1.130E+01 2.1
> 2.093E+01 2.9 1.995E+01 2.7
> > MatMultTranspose 1.372E+01 2.5 1.388E+01 2.6
> 1.477E+01 2.2 1.381E+01 2.1
> > MatSolve 1.949E-02 0.0 1.653E-02 0.0
> 4.789E-02 0.0 4.466E-02 0.0
> > MatSOR 6.610E+01 1.3 6.673E+01 1.3
> 7.111E+01 1.3 7.105E+01 1.3
> > MatResidual 2.647E+01 1.7 2.667E+01 1.7
> 2.446E+01 1.4 2.467E+01 1.5
> > PCSetUpOnBlocks 5.266E-03 1.4 5.295E-03 1.4
> 5.427E-03 1.5 5.289E-03 1.4
> > PCApply 1.031E+02 1.0 1.035E+02 1.0
> 1.180E+02 1.0 1.164E+02 1.0
> >
> > I also slimmed down my code and basically wrote a simple weak scaling
> test (source files attached) so you can profile it yourself. I appreciate
> the offer Junchao, thank you.
> > You can adjust the system size per processor at runtime via
> "-nodes_per_proc 30" and the number of repeated calls to the function
> containing KSPsolve() via "-iterations 1000". The physical problem is
> simply calculating the electric potential from a homogeneous charge
> distribution, done multiple times to accumulate time in KSPsolve().
> > A job would be started using something like
> > mpirun -n 125 ~/petsc_ws/ws_test -nodes_per_proc 30 -mesh_size 1E-4
> -iterations 1000 \\
> > -ksp_rtol 1E-6 \
> > -log_view -log_sync\
> > -pc_type gamg -pc_gamg_type classical\
> > -ksp_type cg \
> > -ksp_norm_type unpreconditioned \
> > -mg_levels_ksp_type richardson \
> > -mg_levels_ksp_norm_type none \
> > -mg_levels_pc_type sor \
> > -mg_levels_ksp_max_it 1 \
> > -mg_levels_pc_sor_its 1 \
> > -mg_levels_esteig_ksp_type cg \
> > -mg_levels_esteig_ksp_max_it 10 \
> > -gamg_est_ksp_type cg
> > , ideally started on a cube number of processes for a cubical process
> grid.
> > Using 125 processes and 10.000 iterations I get the output in
> "log_view_125_new.txt", which shows the same imbalance for me.
> > Michael
> >
> >
> > Am 02.06.2018 um 13:40 schrieb Mark Adams:
> >>
> >>
> >> On Fri, Jun 1, 2018 at 11:20 PM, Junchao Zhang <jczhang at mcs.anl.gov>
> wrote:
> >> Hi,Michael,
> >> You can add -log_sync besides -log_view, which adds barriers to
> certain events but measures barrier time separately from the events. I find
> this option makes it easier to interpret log_view output.
> >>
> >> That is great (good to know).
> >>
> >> This should give us a better idea if your large VecScatter costs are
> from slow communication or if it catching some sort of load imbalance.
> >>
> >>
> >> --Junchao Zhang
> >>
> >> On Wed, May 30, 2018 at 3:27 AM, Michael Becker <
> Michael.Becker at physik.uni-giessen.de> wrote:
> >> Barry: On its way. Could take a couple days again.
> >>
> >> Junchao: I unfortunately don't have access to a cluster with a faster
> network. This one has a mixed 4X QDR-FDR InfiniBand 2:1 blocking fat-tree
> network, which I realize causes parallel slowdown if the nodes are not
> connected to the same switch. Each node has 24 processors (2x12/socket) and
> four NUMA domains (two for each socket).
> >> The ranks are usually not distributed perfectly even, i.e. for 125
> processes, of the six required nodes, five would use 21 cores and one 20.
> >> Would using another CPU type make a difference communication-wise? I
> could switch to faster ones (on the same network), but I always assumed
> this would only improve performance of the stuff that is unrelated to
> communication.
> >>
> >> Michael
> >>
> >>
> >>
> >>> The log files have something like "Average time for zero size
> MPI_Send(): 1.84231e-05". It looks you ran on a cluster with a very slow
> network. A typical machine should give less than 1/10 of the latency you
> have. An easy way to try is just running the code on a machine with a
> faster network and see what happens.
> >>>
> >>> Also, how many cores & numa domains does a compute node have? I could
> not figure out how you distributed the 125 MPI ranks evenly.
> >>>
> >>> --Junchao Zhang
> >>>
> >>> On Tue, May 29, 2018 at 6:18 AM, Michael Becker <
> Michael.Becker at physik.uni-giessen.de> wrote:
> >>> Hello again,
> >>>
> >>> here are the updated log_view files for 125 and 1000 processors. I ran
> both problems twice, the first time with all processors per node allocated
> ("-1.txt"), the second with only half on twice the number of nodes
> ("-2.txt").
> >>>
> >>>>> On May 24, 2018, at 12:24 AM, Michael Becker <
> Michael.Becker at physik.uni-giessen.de>
> >>>>> wrote:
> >>>>>
> >>>>> I noticed that for every individual KSP iteration, six vector
> objects are created and destroyed (with CG, more with e.g. GMRES).
> >>>>>
> >>>> Hmm, it is certainly not intended at vectors be created and
> destroyed within each KSPSolve() could you please point us to the code that
> makes you think they are being created and destroyed? We create all the
> work vectors at KSPSetUp() and destroy them in KSPReset() not during the
> solve. Not that this would be a measurable distance.
> >>>>
> >>>
> >>> I mean this, right in the log_view output:
> >>>
> >>>> Memory usage is given in bytes:
> >>>>
> >>>> Object Type Creations Destructions Memory Descendants' Mem.
> >>>> Reports information only for process 0.
> >>>>
> >>>> --- Event Stage 0: Main Stage
> >>>>
> >>>> ...
> >>>>
> >>>> --- Event Stage 1: First Solve
> >>>>
> >>>> ...
> >>>>
> >>>> --- Event Stage 2: Remaining Solves
> >>>>
> >>>> Vector 23904 23904 1295501184 0.
> >>> I logged the exact number of KSP iterations over the 999 timesteps and
> its exactly 23904/6 = 3984.
> >>> Michael
> >>>
> >>>
> >>> Am 24.05.2018 um 19:50 schrieb Smith, Barry F.:
> >>>>
> >>>> Please send the log file for 1000 with cg as the solver.
> >>>>
> >>>> You should make a bar chart of each event for the two cases to see
> which ones are taking more time and which are taking less (we cannot tell
> with the two logs you sent us since they are for different solvers.)
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>> On May 24, 2018, at 12:24 AM, Michael Becker <
> Michael.Becker at physik.uni-giessen.de>
> >>>>> wrote:
> >>>>>
> >>>>> I noticed that for every individual KSP iteration, six vector
> objects are created and destroyed (with CG, more with e.g. GMRES).
> >>>>>
> >>>> Hmm, it is certainly not intended at vectors be created and
> destroyed within each KSPSolve() could you please point us to the code that
> makes you think they are being created and destroyed? We create all the
> work vectors at KSPSetUp() and destroy them in KSPReset() not during the
> solve. Not that this would be a measurable distance.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>> This seems kind of wasteful, is this supposed to be like this? Is
> this even the reason for my problems? Apart from that, everything seems
> quite normal to me (but I'm not the expert here).
> >>>>>
> >>>>>
> >>>>> Thanks in advance.
> >>>>>
> >>>>> Michael
> >>>>>
> >>>>>
> >>>>>
> >>>>> <log_view_125procs.txt><log_vi
> >>>>> ew_1000procs.txt>
> >>>>>
> >>>
> >>>
> >>
> >>
> >>
> >
> >
> > <o-wstest-125.txt><Scaling-loss.png><o-wstest-1000.txt><
> o-wstest-sync-125.txt><o-wstest-sync-1000.txt><MatSOR_
> SeqAIJ.png><PAPI_TOT_CYC.png><PAPI_DP_OPS.png>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20180607/83e12435/attachment.html>
More information about the petsc-dev
mailing list