[petsc-dev] Bad scaling of GAMG in FieldSplit
Junchao Zhang
jczhang at mcs.anl.gov
Thu Jul 26 11:35:52 CDT 2018
On Thu, Jul 26, 2018 at 11:15 AM, Fande Kong <fdkong.jd at gmail.com> wrote:
>
>
> On Thu, Jul 26, 2018 at 9:51 AM, Junchao Zhang <jczhang at mcs.anl.gov>
> wrote:
>
>> Hi, Pierre,
>> From your log_view files, I see you did strong scaling. You used 4X
>> more cores, but the execution time only dropped from 3.9143e+04
>> to 1.6910e+04.
>> From my previous analysis of a GAMG weak scaling test, it looks
>> communication is one of the reasons that caused poor scaling. In your
>> case, VecScatterEnd time was doubled from 1.5575e+03 to 3.2413e+03. Its
>> time percent jumped from 1% to 17%. This time can contribute to the big
>> time ratio in MatMultAdd ant MatMultTranspose, misleading you guys thinking
>> there was load-imbalance computation-wise.
>> The reason is that I found in the interpolation and restriction phases
>> of gamg, the communication pattern is very bad. Few processes communicate
>> with hundreds of neighbors with message sizes of a few bytes.
>>
>
> We may need to truncate interpolation/restriction operators. Also do some
> aggressive coarsening. Unfortunately, GAMG currently does not support.
>
Are these gamg options the truncation you thought?
-pc_gamg_threshold[] <thresh,default=0> - Before aggregating the graph GAMG
will remove small values from the graph on each level
-pc_gamg_threshold_scale <scale,default=1> - Scaling of threshold on each
coarser grid if not specified
> Fande,
>
>
>> If we can avoid this pattern algorithmically (which I don't know), or
>> find ways with faster communication (which I am working), then we can get
>> better scalability.
>>
>> --Junchao Zhang
>>
>> On Thu, Jul 26, 2018 at 10:02 AM, Pierre Jolivet <
>> pierre.jolivet at enseeiht.fr> wrote:
>>
>>>
>>>
>>> > On 26 Jul 2018, at 4:24 PM, Karl Rupp <rupp at iue.tuwien.ac.at> wrote:
>>> >
>>> > Hi Pierre,
>>> >
>>> >> I’m using GAMG on a shifted Laplacian with these options:
>>> >> -st_fieldsplit_pressure_ksp_type preonly
>>> >> -st_fieldsplit_pressure_pc_composite_type additive
>>> >> -st_fieldsplit_pressure_pc_type composite
>>> >> -st_fieldsplit_pressure_sub_0_ksp_pc_type jacobi
>>> >> -st_fieldsplit_pressure_sub_0_pc_type ksp
>>> >> -st_fieldsplit_pressure_sub_1_ksp_pc_gamg_square_graph 10
>>> >> -st_fieldsplit_pressure_sub_1_ksp_pc_type gamg
>>> >> -st_fieldsplit_pressure_sub_1_pc_type ksp
>>> >> and I end up with the following logs on 512 (top) and 2048 (bottom)
>>> processes:
>>> >> MatMult 1577790 1.0 3.1967e+03 1.2 4.48e+12 1.6 7.6e+09
>>> 5.6e+03 0.0e+00 7 71 75 63 0 7 71 75 63 0 650501
>>> >> MatMultAdd 204786 1.0 1.3412e+02 5.5 1.50e+10 1.7 5.5e+08
>>> 2.7e+02 0.0e+00 0 0 5 0 0 0 0 5 0 0 50762
>>> >> MatMultTranspose 204786 1.0 4.6790e+01 4.3 1.50e+10 1.7 5.5e+08
>>> 2.7e+02 0.0e+00 0 0 5 0 0 0 0 5 0 0 145505
>>> >> [..]
>>> >> KSPSolve_FS_3 7286 1.0 7.5506e+02 1.0 9.14e+11 1.8 7.3e+09
>>> 1.5e+03 2.6e+05 2 14 71 16 34 2 14 71 16 34 539009
>>> >> MatMult 1778795 1.0 3.5511e+03 4.1 1.46e+12 1.9 4.0e+10
>>> 2.4e+03 0.0e+00 7 66 75 61 0 7 66 75 61 0 728371
>>> >> MatMultAdd 222360 1.0 2.5904e+0348.0 4.31e+09 1.9 2.4e+09
>>> 1.3e+02 0.0e+00 14 0 4 0 0 14 0 4 0 0 2872
>>> >> MatMultTranspose 222360 1.0 1.8736e+03421.8 4.31e+09 1.9 2.4e+09
>>> 1.3e+02 0.0e+00 0 0 4 0 0 0 0 4 0 0 3970
>>> >> [..]
>>> >> KSPSolve_FS_3 7412 1.0 2.8939e+03 1.0 2.66e+11 2.1 3.5e+10
>>> 6.1e+02 2.7e+05 17 11 67 14 28 17 11 67 14 28 148175
>>> >> MatMultAdd and MatMultTranspose (performed by GAMG) somehow ruin the
>>> scalability of the overall solver. The pressure space “only” has 3M
>>> unknowns so I’m guessing that’s why GAMG is having a hard time strong
>>> scaling.
>>> >
>>> > 3M unknowns divided by 512 processes implies less than 10k unknowns
>>> per process. It is not unusual to see strong scaling roll off at this size.
>>> Also note that the time per call(!) for "MatMult" is the same for both
>>> cases, indicating that your run into a latency-limited regime.
>>> >
>>> > Also, have a look at the time ratios: With 2048 processes, MatMultAdd
>>> and MatMultTranspose show a time ratio of 48 and 421, respectively. Maybe
>>> one of your MPI ranks is getting a huge workload?
>>>
>>> Maybe inside GAMG itself (how could I check this?), but since the timing
>>> and ratio of the MatMult look OK and the distribution of the pressure space
>>> is the same as the other three fields, I’m guessing this does not come from
>>> my global Mat, but I may be wrong.
>>>
>>> >> For the other fields, the matrix is somehow distributed nicely, i.e.,
>>> I don’t want to change the overall distribution of the matrix.
>>> >> Do you have any suggestion to improve the performance of GAMG in that
>>> scenario? I had two ideas in mind but please correct me if I’m wrong or if
>>> this is not doable:
>>> >> 1) before setting up GAMG, first use a PCTELESCOPE to avoid having
>>> too many processes work on this small problem
>>> >> 2) have the sub_0_ and the sub_1_ work on two different
>>> nonoverlapping communicators of size PETSC_COMM_WORLD/2, do the solve
>>> concurrently, and then sum the solutions (only worth doing because of
>>> -pc_composite_type additive). I have no idea if this easily doable with
>>> PETSc command line arguments
>>> >
>>> > 1) is the more flexible approach, as you have better control over the
>>> system sizes after 'telescoping’.
>>>
>>> Right, but the advantage of 2) is that I wouldn't have one half or more
>>> of processes idling and I could overlap the solves of both subpc in the
>>> PCCOMPOSITE.
>>>
>>> I’m attaching the -log_view for both runs (I trimmed some options).
>>>
>>> Thanks for your help,
>>> Pierre
>>>
>>>
>>>
>>> > Best regards,
>>> > Karli
>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20180726/6795e132/attachment-0001.html>
More information about the petsc-dev
mailing list