<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jul 26, 2018 at 10:35 AM, Junchao Zhang <span dir="ltr"><<a href="mailto:jczhang@mcs.anl.gov" target="_blank">jczhang@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class="">On Thu, Jul 26, 2018 at 11:15 AM, Fande Kong <span dir="ltr"><<a href="mailto:fdkong.jd@gmail.com" target="_blank">fdkong.jd@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><span class="m_-8072712635775950846gmail-">On Thu, Jul 26, 2018 at 9:51 AM, Junchao Zhang <span dir="ltr"><<a href="mailto:jczhang@mcs.anl.gov" target="_blank">jczhang@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi, Pierre,<div> From your log_view files, I see you did strong scaling. You used 4X more cores, but the execution time only dropped from 3.9143e+04 to 1.6910e+04.</div><div> From my previous analysis of a GAMG weak scaling test, it looks communication is one of the reasons that caused poor scaling. In your case, VecScatterEnd time was doubled from 1.5575e+03 to 3.2413e+03. Its time percent jumped from 1% to 17%. This time can contribute to the big time ratio in MatMultAdd ant MatMultTranspose, misleading you guys thinking there was load-imbalance computation-wise. </div><div> The reason is that I found in the interpolation and restriction phases of gamg, the communication pattern is very bad. Few processes communicate with hundreds of neighbors with message sizes of a few bytes. </div></div></blockquote><div><br></div></span><div>We may need to truncate interpolation/restriction operators. Also do some aggressive coarsening. Unfortunately, GAMG currently does not support.</div></div></div></div></blockquote><div><br></div></span> Are these gamg options the <span style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">truncation you thought?</span></div></div></div></blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><br>-pc_gamg_threshold[] <thresh,default=0> - Before aggregating the graph GAMG will remove small values from the graph on each level<br>-pc_gamg_threshold_scale <scale,default=1> - Scaling of threshold on each coarser grid if not specified</div></div></div></blockquote><div><br></div><div>Nope. Totally different things. </div><div><br></div><div>Fande</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><div class="h5"><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class="m_-8072712635775950846gmail-HOEnZb"><font color="#888888"><div></div><div><br></div><div>Fande,</div></font></span><div><div class="m_-8072712635775950846gmail-h5"><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>If we can avoid this pattern algorithmically (which I don't know), or find ways with faster communication (which I am working), then we can get better scalability. </div></div><div class="gmail_extra"><br clear="all"><div><div class="m_-8072712635775950846gmail-m_-650146499340825009gmail-m_3677229278573272207gmail_signature"><div dir="ltr">--Junchao Zhang</div></div></div>
<br><div class="gmail_quote">On Thu, Jul 26, 2018 at 10:02 AM, Pierre Jolivet <span dir="ltr"><<a href="mailto:pierre.jolivet@enseeiht.fr" target="_blank">pierre.jolivet@enseeiht.fr</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="m_-8072712635775950846gmail-m_-650146499340825009gmail-m_3677229278573272207HOEnZb"><div class="m_-8072712635775950846gmail-m_-650146499340825009gmail-m_3677229278573272207h5"><br>
<br>
> On 26 Jul 2018, at 4:24 PM, Karl Rupp <<a href="mailto:rupp@iue.tuwien.ac.at" target="_blank">rupp@iue.tuwien.ac.at</a>> wrote:<br>
> <br>
> Hi Pierre,<br>
> <br>
>> I’m using GAMG on a shifted Laplacian with these options:<br>
>> -st_fieldsplit_pressure_ksp_ty<wbr>pe preonly<br>
>> -st_fieldsplit_pressure_pc_com<wbr>posite_type additive<br>
>> -st_fieldsplit_pressure_pc_typ<wbr>e composite<br>
>> -st_fieldsplit_pressure_sub_0_<wbr>ksp_pc_type jacobi<br>
>> -st_fieldsplit_pressure_sub_0_<wbr>pc_type ksp<br>
>> -st_fieldsplit_pressure_sub_1_<wbr>ksp_pc_gamg_square_graph 10<br>
>> -st_fieldsplit_pressure_sub_1_<wbr>ksp_pc_type gamg<br>
>> -st_fieldsplit_pressure_sub_1_<wbr>pc_type ksp<br>
>> and I end up with the following logs on 512 (top) and 2048 (bottom) processes:<br>
>> MatMult 1577790 1.0 3.1967e+03 1.2 4.48e+12 1.6 7.6e+09 5.6e+03 0.0e+00 7 71 75 63 0 7 71 75 63 0 650501<br>
>> MatMultAdd 204786 1.0 1.3412e+02 5.5 1.50e+10 1.7 5.5e+08 2.7e+02 0.0e+00 0 0 5 0 0 0 0 5 0 0 50762<br>
>> MatMultTranspose 204786 1.0 4.6790e+01 4.3 1.50e+10 1.7 5.5e+08 2.7e+02 0.0e+00 0 0 5 0 0 0 0 5 0 0 145505<br>
>> [..]<br>
>> KSPSolve_FS_3 7286 1.0 7.5506e+02 1.0 9.14e+11 1.8 7.3e+09 1.5e+03 2.6e+05 2 14 71 16 34 2 14 71 16 34 539009<br>
>> MatMult 1778795 1.0 3.5511e+03 4.1 1.46e+12 1.9 4.0e+10 2.4e+03 0.0e+00 7 66 75 61 0 7 66 75 61 0 728371<br>
>> MatMultAdd 222360 1.0 2.5904e+0348.0 4.31e+09 1.9 2.4e+09 1.3e+02 0.0e+00 14 0 4 0 0 14 0 4 0 0 2872<br>
>> MatMultTranspose 222360 1.0 1.8736e+03421.8 4.31e+09 1.9 2.4e+09 1.3e+02 0.0e+00 0 0 4 0 0 0 0 4 0 0 3970<br>
>> [..]<br>
>> KSPSolve_FS_3 7412 1.0 2.8939e+03 1.0 2.66e+11 2.1 3.5e+10 6.1e+02 2.7e+05 17 11 67 14 28 17 11 67 14 28 148175<br>
>> MatMultAdd and MatMultTranspose (performed by GAMG) somehow ruin the scalability of the overall solver. The pressure space “only” has 3M unknowns so I’m guessing that’s why GAMG is having a hard time strong scaling. <br>
> <br>
> 3M unknowns divided by 512 processes implies less than 10k unknowns per process. It is not unusual to see strong scaling roll off at this size. Also note that the time per call(!) for "MatMult" is the same for both cases, indicating that your run into a latency-limited regime.<br>
> <br>
> Also, have a look at the time ratios: With 2048 processes, MatMultAdd and MatMultTranspose show a time ratio of 48 and 421, respectively. Maybe one of your MPI ranks is getting a huge workload?<br>
<br>
</div></div>Maybe inside GAMG itself (how could I check this?), but since the timing and ratio of the MatMult look OK and the distribution of the pressure space is the same as the other three fields, I’m guessing this does not come from my global Mat, but I may be wrong.<br>
<span><br>
>> For the other fields, the matrix is somehow distributed nicely, i.e., I don’t want to change the overall distribution of the matrix.<br>
>> Do you have any suggestion to improve the performance of GAMG in that scenario? I had two ideas in mind but please correct me if I’m wrong or if this is not doable:<br>
>> 1) before setting up GAMG, first use a PCTELESCOPE to avoid having too many processes work on this small problem<br>
>> 2) have the sub_0_ and the sub_1_ work on two different nonoverlapping communicators of size PETSC_COMM_WORLD/2, do the solve concurrently, and then sum the solutions (only worth doing because of -pc_composite_type additive). I have no idea if this easily doable with PETSc command line arguments<br>
> <br>
> 1) is the more flexible approach, as you have better control over the system sizes after 'telescoping’.<br>
<br>
</span>Right, but the advantage of 2) is that I wouldn't have one half or more of processes idling and I could overlap the solves of both subpc in the PCCOMPOSITE.<br>
<br>
I’m attaching the -log_view for both runs (I trimmed some options).<br>
<br>
Thanks for your help,<br>
Pierre<br>
<br>
<br><br>
> Best regards,<br>
> Karli<br>
<br>
<br></blockquote></div><br></div>
</blockquote></div></div></div><br></div></div>
</blockquote></div></div></div><br></div></div>
</blockquote></div><br></div></div>