[petsc-dev] Bad scaling of GAMG in FieldSplit

Thu Jul 26 13:50:50 CDT 2018

On Thu, Jul 26, 2018 at 2:43 PM Jed Brown <jed at jedbrown.org> wrote:

> Matthew Knepley <knepley at gmail.com> writes:
>
> > On Thu, Jul 26, 2018 at 12:56 PM Fande Kong <fdkong.jd at gmail.com> wrote:
> >
> >>
> >>
> >> On Thu, Jul 26, 2018 at 10:35 AM, Junchao Zhang <jczhang at mcs.anl.gov>
> >> wrote:
> >>
> >>> On Thu, Jul 26, 2018 at 11:15 AM, Fande Kong <fdkong.jd at gmail.com>
> wrote:
> >>>
> >>>>
> >>>>
> >>>> On Thu, Jul 26, 2018 at 9:51 AM, Junchao Zhang <jczhang at mcs.anl.gov>
> >>>> wrote:
> >>>>
> >>>>> Hi, Pierre,
> >>>>>   From your log_view files, I see you did strong scaling. You used 4X
> >>>>> more cores, but the execution time only dropped from 3.9143e+04
> >>>>> to 1.6910e+04.
> >>>>>   From my previous analysis of a GAMG weak scaling test, it looks
> >>>>> communication is one of the reasons that caused poor scaling.  In
> your
> >>>>> case,  VecScatterEnd time was doubled from 1.5575e+03 to 3.2413e+03.
> Its
> >>>>> time percent jumped from 1% to 17%. This time can contribute to the
> big
> >>>>> time ratio in MatMultAdd ant MatMultTranspose, misleading you guys
> thinking
> >>>>> there was load-imbalance computation-wise.
> >>>>>   The reason is that I found in the interpolation and restriction
> >>>>> phases of gamg, the communication pattern is very bad. Few processes
> >>>>> communicate with hundreds of neighbors with message sizes of a few
> bytes.
> >>>>>
> >>>>
> >>>> We may need to truncate interpolation/restriction operators. Also do
> >>>> some aggressive coarsening.  Unfortunately, GAMG currently does not
> support.
> >>>>
> >>>
> >>>  Are these gamg options the truncation you thought?
> >>>
> >>
> >>> -pc_gamg_threshold[] <thresh,default=0> - Before aggregating the graph
> >>> GAMG will remove small values from the graph on each level
> >>> -pc_gamg_threshold_scale <scale,default=1> - Scaling of threshold on
> each
> >>> coarser grid if not specified
> >>>
> >>
> >> Nope.  Totally different things.
> >>
> >
> > Well, you could use _threshold to do more aggressive coarsening, but not
> > for thinning out
> > the interpolation.
>
> Increasing the threshold results in slower coarsening.
>

Hmm, I think we have to change the webpage then:

http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/PC/PCGAMGSetThreshold.html

I read it the opposite way.

  Matt

> Note that square_graph 10 is very unusual.
>
> > There are some simple filters we might be able to use (Luke Olson
> > talked about it today), but Mark is the expert.
> >
> >    Matt
> >
> >
> >> Fande
> >>
> >>
> >>>
> >>>
> >>>> Fande,
> >>>>
> >>>>
> >>>>> If we can avoid this pattern algorithmically (which I don't know), or
> >>>>> find ways with faster communication (which I am working), then we
> can get
> >>>>> better scalability.
> >>>>>
> >>>>> --Junchao Zhang
> >>>>>
> >>>>> On Thu, Jul 26, 2018 at 10:02 AM, Pierre Jolivet <
> >>>>> pierre.jolivet at enseeiht.fr> wrote:
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> > On 26 Jul 2018, at 4:24 PM, Karl Rupp <rupp at iue.tuwien.ac.at>
> wrote:
> >>>>>> >
> >>>>>> > Hi Pierre,
> >>>>>> >
> >>>>>> >> I’m using GAMG on a shifted Laplacian with these options:
> >>>>>> >> -st_fieldsplit_pressure_ksp_type preonly
> >>>>>> >> -st_fieldsplit_pressure_pc_composite_type additive
> >>>>>> >> -st_fieldsplit_pressure_pc_type composite
> >>>>>> >> -st_fieldsplit_pressure_sub_0_ksp_pc_type jacobi
> >>>>>> >> -st_fieldsplit_pressure_sub_0_pc_type ksp
> >>>>>> >> -st_fieldsplit_pressure_sub_1_ksp_pc_gamg_square_graph 10
> >>>>>> >> -st_fieldsplit_pressure_sub_1_ksp_pc_type gamg
> >>>>>> >> -st_fieldsplit_pressure_sub_1_pc_type ksp
> >>>>>> >> and I end up with the following logs on 512 (top) and 2048
> (bottom)
> >>>>>> processes:
> >>>>>> >> MatMult          1577790 1.0 3.1967e+03 1.2 4.48e+12 1.6 7.6e+09
> >>>>>> 5.6e+03 0.0e+00  7 71 75 63  0   7 71 75 63  0 650501
> >>>>>> >> MatMultAdd        204786 1.0 1.3412e+02 5.5 1.50e+10 1.7 5.5e+08
> >>>>>> 2.7e+02 0.0e+00  0  0  5  0  0   0  0  5  0  0 50762
> >>>>>> >> MatMultTranspose  204786 1.0 4.6790e+01 4.3 1.50e+10 1.7 5.5e+08
> >>>>>> 2.7e+02 0.0e+00  0  0  5  0  0   0  0  5  0  0 145505
> >>>>>> >> [..]
> >>>>>> >> KSPSolve_FS_3       7286 1.0 7.5506e+02 1.0 9.14e+11 1.8 7.3e+09
> >>>>>> 1.5e+03 2.6e+05  2 14 71 16 34   2 14 71 16 34 539009
> >>>>>> >> MatMult          1778795 1.0 3.5511e+03 4.1 1.46e+12 1.9 4.0e+10
> >>>>>> 2.4e+03 0.0e+00  7 66 75 61  0   7 66 75 61  0 728371
> >>>>>> >> MatMultAdd        222360 1.0 2.5904e+0348.0 4.31e+09 1.9 2.4e+09
> >>>>>> 1.3e+02 0.0e+00 14  0  4  0  0  14  0  4  0  0  2872
> >>>>>> >> MatMultTranspose  222360 1.0 1.8736e+03421.8 4.31e+09 1.9 2.4e+09
> >>>>>> 1.3e+02 0.0e+00  0  0  4  0  0   0  0  4  0  0  3970
> >>>>>> >> [..]
> >>>>>> >> KSPSolve_FS_3       7412 1.0 2.8939e+03 1.0 2.66e+11 2.1 3.5e+10
> >>>>>> 6.1e+02 2.7e+05 17 11 67 14 28  17 11 67 14 28 148175
> >>>>>> >> MatMultAdd and MatMultTranspose (performed by GAMG) somehow ruin
> >>>>>> the scalability of the overall solver. The pressure space “only”
> has 3M
> >>>>>> unknowns so I’m guessing that’s why GAMG is having a hard time
> strong
> >>>>>> scaling.
> >>>>>> >
> >>>>>> > 3M unknowns divided by 512 processes implies less than 10k
> unknowns
> >>>>>> per process. It is not unusual to see strong scaling roll off at
> this size.
> >>>>>> Also note that the time per call(!) for "MatMult" is the same for
> both
> >>>>>> cases, indicating that your run into a latency-limited regime.
> >>>>>> >
> >>>>>> > Also, have a look at the time ratios: With 2048 processes,
> >>>>>> MatMultAdd and MatMultTranspose show a time ratio of 48 and 421,
> >>>>>> respectively. Maybe one of your MPI ranks is getting a huge
> workload?
> >>>>>>
> >>>>>> Maybe inside GAMG itself (how could I check this?), but since the
> >>>>>> timing and ratio of the MatMult look OK and the distribution of the
> >>>>>> pressure space is the same as the other three fields, I’m guessing
> this
> >>>>>> does not come from my global Mat, but I may be wrong.
> >>>>>>
> >>>>>> >> For the other fields, the matrix is somehow distributed nicely,
> >>>>>> i.e., I don’t want to change the overall distribution of the matrix.
> >>>>>> >> Do you have any suggestion to improve the performance of GAMG in
> >>>>>> that scenario? I had two ideas in mind but please correct me if I’m
> wrong
> >>>>>> or if this is not doable:
> >>>>>> >> 1) before setting up GAMG, first use a PCTELESCOPE to avoid
> having
> >>>>>> too many processes work on this small problem
> >>>>>> >> 2) have the sub_0_ and the sub_1_ work on two different
> >>>>>> nonoverlapping communicators of size PETSC_COMM_WORLD/2, do the
> solve
> >>>>>> concurrently, and then sum the solutions (only worth doing because
> of
> >>>>>> -pc_composite_type additive). I have no idea if this easily doable
> with
> >>>>>> PETSc command line arguments
> >>>>>> >
> >>>>>> > 1) is the more flexible approach, as you have better control over
> >>>>>> the system sizes after 'telescoping’.
> >>>>>>
> >>>>>> Right, but the advantage of 2) is that I wouldn't have one half or
> >>>>>> more of processes idling and I could overlap the solves of both
> subpc in
> >>>>>> the PCCOMPOSITE.
> >>>>>>
> >>>>>> I’m attaching the -log_view for both runs (I trimmed some options).
> >>>>>>
> >>>>>> Thanks for your help,
> >>>>>> Pierre
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> > Best regards,
> >>>>>> > Karli
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> > --
> > What most experimenters take for granted before they begin their
> > experiments is infinitely more interesting than any results to which
> their
> > experiments lead.
> > -- Norbert Wiener
> >
> > https://www.cse.buffalo.edu/~knepley/ <http://www.caam.rice.edu/~mk51/>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.caam.rice.edu/~mk51/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20180726/f88c97c2/attachment-0001.html>