[petsc-dev] Bad scaling of GAMG in FieldSplit

Thu Jul 26 10:51:44 CDT 2018

Hi, Pierre,
  From your log_view files, I see you did strong scaling. You used 4X more
cores, but the execution time only dropped from 3.9143e+04 to 1.6910e+04.
  From my previous analysis of a GAMG weak scaling test, it looks
communication is one of the reasons that caused poor scaling.  In your
case,  VecScatterEnd time was doubled from 1.5575e+03 to 3.2413e+03. Its
time percent jumped from 1% to 17%. This time can contribute to the big
time ratio in MatMultAdd ant MatMultTranspose, misleading you guys thinking
there was load-imbalance computation-wise.
  The reason is that I found in the interpolation and restriction phases of
gamg, the communication pattern is very bad. Few processes communicate with
hundreds of neighbors with message sizes of a few bytes.  If we can avoid
this pattern algorithmically (which I don't know), or find ways with faster
communication (which I am working), then we can get better scalability.

--Junchao Zhang

On Thu, Jul 26, 2018 at 10:02 AM, Pierre Jolivet <pierre.jolivet at enseeiht.fr
> wrote:

>
>
> > On 26 Jul 2018, at 4:24 PM, Karl Rupp <rupp at iue.tuwien.ac.at> wrote:
> >
> > Hi Pierre,
> >
> >> I’m using GAMG on a shifted Laplacian with these options:
> >> -st_fieldsplit_pressure_ksp_type preonly
> >> -st_fieldsplit_pressure_pc_composite_type additive
> >> -st_fieldsplit_pressure_pc_type composite
> >> -st_fieldsplit_pressure_sub_0_ksp_pc_type jacobi
> >> -st_fieldsplit_pressure_sub_0_pc_type ksp
> >> -st_fieldsplit_pressure_sub_1_ksp_pc_gamg_square_graph 10
> >> -st_fieldsplit_pressure_sub_1_ksp_pc_type gamg
> >> -st_fieldsplit_pressure_sub_1_pc_type ksp
> >> and I end up with the following logs on 512 (top) and 2048 (bottom)
> processes:
> >> MatMult          1577790 1.0 3.1967e+03 1.2 4.48e+12 1.6 7.6e+09
> 5.6e+03 0.0e+00  7 71 75 63  0   7 71 75 63  0 650501
> >> MatMultAdd        204786 1.0 1.3412e+02 5.5 1.50e+10 1.7 5.5e+08
> 2.7e+02 0.0e+00  0  0  5  0  0   0  0  5  0  0 50762
> >> MatMultTranspose  204786 1.0 4.6790e+01 4.3 1.50e+10 1.7 5.5e+08
> 2.7e+02 0.0e+00  0  0  5  0  0   0  0  5  0  0 145505
> >> [..]
> >> KSPSolve_FS_3       7286 1.0 7.5506e+02 1.0 9.14e+11 1.8 7.3e+09
> 1.5e+03 2.6e+05  2 14 71 16 34   2 14 71 16 34 539009
> >> MatMult          1778795 1.0 3.5511e+03 4.1 1.46e+12 1.9 4.0e+10
> 2.4e+03 0.0e+00  7 66 75 61  0   7 66 75 61  0 728371
> >> MatMultAdd        222360 1.0 2.5904e+0348.0 4.31e+09 1.9 2.4e+09
> 1.3e+02 0.0e+00 14  0  4  0  0  14  0  4  0  0  2872
> >> MatMultTranspose  222360 1.0 1.8736e+03421.8 4.31e+09 1.9 2.4e+09
> 1.3e+02 0.0e+00  0  0  4  0  0   0  0  4  0  0  3970
> >> [..]
> >> KSPSolve_FS_3       7412 1.0 2.8939e+03 1.0 2.66e+11 2.1 3.5e+10
> 6.1e+02 2.7e+05 17 11 67 14 28  17 11 67 14 28 148175
> >> MatMultAdd and MatMultTranspose (performed by GAMG) somehow ruin the
> scalability of the overall solver. The pressure space “only” has 3M
> unknowns so I’m guessing that’s why GAMG is having a hard time strong
> scaling.
> >
> > 3M unknowns divided by 512 processes implies less than 10k unknowns per
> process. It is not unusual to see strong scaling roll off at this size.
> Also note that the time per call(!) for "MatMult" is the same for both
> cases, indicating that your run into a latency-limited regime.
> >
> > Also, have a look at the time ratios: With 2048 processes, MatMultAdd
> and MatMultTranspose show a time ratio of 48 and 421, respectively. Maybe
> one of your MPI ranks is getting a huge workload?
>
> Maybe inside GAMG itself (how could I check this?), but since the timing
> and ratio of the MatMult look OK and the distribution of the pressure space
> is the same as the other three fields, I’m guessing this does not come from
> my global Mat, but I may be wrong.
>
> >> For the other fields, the matrix is somehow distributed nicely, i.e., I
> don’t want to change the overall distribution of the matrix.
> >> Do you have any suggestion to improve the performance of GAMG in that
> scenario? I had two ideas in mind but please correct me if I’m wrong or if
> this is not doable:
> >> 1) before setting up GAMG, first use a PCTELESCOPE to avoid having too
> many processes work on this small problem
> >> 2) have the sub_0_ and the sub_1_ work on two different nonoverlapping
> communicators of size PETSC_COMM_WORLD/2, do the solve concurrently, and
> then sum the solutions (only worth doing because of -pc_composite_type
> additive). I have no idea if this easily doable with PETSc command line
> arguments
> >
> > 1) is the more flexible approach, as you have better control over the
> system sizes after 'telescoping’.
>
> Right, but the advantage of 2) is that I wouldn't have one half or more of
> processes idling and I could overlap the solves of both subpc in the
> PCCOMPOSITE.
>
> I’m attaching the -log_view for both runs (I trimmed some options).
>
> Thanks for your help,
> Pierre
>
>
>
> > Best regards,
> > Karli
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20180726/fc040fc2/attachment.html>