<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <br>

    <div class="moz-cite-prefix">On 08/11/18 13:01, Matthew Knepley

      wrote:<br>

    </div>

    <blockquote

cite="mid:CAMYG4Gk6G=gGsHiC=RoNipjusrs_+inavbK0KRP25388tP_Lrw@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div class="gmail_quote">

          <div dir="ltr">On Thu, Nov 8, 2018 at 6:41 AM "Alberto F.

            Martín" via petsc-users <<a moz-do-not-send="true"

              href="mailto:petsc-users@mcs.anl.gov">petsc-users@mcs.anl.gov</a>>

            wrote:<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div text="#000000" bgcolor="#FFFFFF"> Dear Mark,<br>

              <br>

              thanks for your quick and comprehensive reply. <br>

              <br>

              Before moving to the results of the experiments that u

              suggested, let me clarify two points<br>

              on my original e-mail and your answer: <br>

              <br>

              (1) The raw timings and #iters. provided in my first

              e-mail were actually <br>

                    obtained with "-pc_gamg_square_graph 1" (and not 0);

              sorry about that, my mistake. <br>

                    (the logs, though, were consistent with the solver

              configuration provided).<br>

                    The raw figures with "-pc_gamg_square_graph 0" are

              actually as follows:<br>

              <br>

                    (load3): [0.25074561, 0.3650926566, 0.6251466936,

              0.8709517661, 15.52180776] <br>

                    (load3): [0.148803731, 0.325266364, 0.5538515123,

              0.7537377281, 1.475100923]<br>

                    (load3): [8, 9, 11, 12, 12]<br>

              <br>

                    Bottom line: significant improvement of absolute

              times for the first 4x problems, marginal improvement for

              <br>

                                         the largest problem (compared

              to "-pc_gamg_square_graph 1")<br>

                                         <br>

              (2) <<<i>The PC setup times are large (I see 48

                seconds at 16K but you report 16). </i><i><br>

              </i><i>          -pc_gamg_square_graph 10 should help

                that.</i>>><br>

              <br>

                   This disagreement is justified by the following note

              on my original e-mail:<br>

              <br>

                           <<<i>Please note that within each run,

                I execute these two stages up-to</i><i><br>

              </i><i>             three times, and this influences

                absolute timings given in  -log_view.</i>>><br>

              <br>

              I tried new configurations based on your suggestions. Find

              attached the results.<br>

              (legends indicate changes with respect to the solver

              configuration provided <br>

              in my first e-mail).<br>

              <br>

              Bottom lines: (1) the configuration provided in my

              original e-mail leads to fastest execution<br>

              and less number of iteration for the first 4x problems.

              (2) <b>The (new) parameter-value combinations</b><b><br>

              </b><b>suggested seem to have almost no impact into the

                preconditioner set up time of the last problem.</b></div>

          </blockquote>

          <div><br>

          </div>

          <div>Mark, could this bad setup just be non-scalability in

            ParMetis? How do we see the ParMetis time?</div>

          <div><br>

          </div>

          <div>  Thanks,</div>

          <div><br>

          </div>

          <div>    Matt</div>

        </div>

      </div>

    </blockquote>

    <br>

    Dear Matt,<br>

    <br>

    I did not configured PETSc with ParMetis support. Should I? <br>

    <br>

    I figured it out when I tried to use "-pc_gamg_repartition". PETSc

    complained that it was not compiled with ParMetis support.<br>

    <br>

    Thanks!<br>

    Best regards,<br>

     Alberto.<br>

    <br>

    <blockquote

cite="mid:CAMYG4Gk6G=gGsHiC=RoNipjusrs_+inavbK0KRP25388tP_Lrw@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div class="gmail_quote">

          <div> </div>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div text="#000000" bgcolor="#FFFFFF"><b><br>

              </b>I also tried HYPRE-BoomerAMG as suggested, with two

              different configurations. <br>

              <br>

              *** SYMMETRIC CONFIGURATION ***<br>

              -ksp_type cg<br>

              -ksp_monitor<br>

              -ksp_rtol 1.0e-6<br>

              -ksp_converged_reason<br>

              -ksp_max_it 500<br>

              -ksp_norm_type unpreconditioned<br>

              -ksp_view<br>

              -log_view<br>

              <br>

              -pc_type hypre<br>

              -pc_hypre_type boomeramg<br>

              -pc_hypre_boomeramg_print_statistics 1<br>

              -pc_hypre_boomeramg_strong_threshold 0.25<br>

              -pc_hypre_boomeramg_coarsen_type HMIS<br>

              -pc_hypre_boomeramg_relax_type_down symmetric-SOR/Jacobi<br>

              -pc_hypre_boomeramg_relax_type_up symmetric-SOR/Jacobi<br>

              -pc_hypre_boomeramg_relax_type_coarse Gaussian-elimination<br>

              <br>

              *** UNSYMMETRIC CONFIGURATION ***<br>

              -ksp_type gmres<br>

              -ksp_gmres_restart 500<br>

              -ksp_monitor<br>

              -ksp_rtol 1.0e-6<br>

              -ksp_converged_reason<br>

              -ksp_max_it 500<br>

              -ksp_pc_side right<br>

              -ksp_norm_type unpreconditioned<br>

              <br>

              -pc_type hypre<br>

              -pc_hypre_type boomeramg<br>

              -pc_hypre_boomeramg_print_statistics 1<br>

              -pc_hypre_boomeramg_strong_threshold 0.25<br>

              -pc_hypre_boomeramg_coarsen_type HMIS<br>

              -pc_hypre_boomeramg_relax_type_down SOR/Jacobi<br>

              -pc_hypre_boomeramg_relax_type_up SOR/Jacobi<br>

              -pc_hypre_boomeramg_relax_type_coarse Gaussian-elimination<br>

              <br>

              The raw results were:<br>

              <br>

              *** SYMMETRIC CONFIGURATION ***<br>

              <br>

              (load3):  [0.1828534687, 0.3055133289, 0.3582984209,

              0.4280304033, 1.343549139]<br>

              (load3):  [0.2102472978, 0.4572948301, 0.7153297188,

              0.9989531627, N/A]<br>

              (load3):  [19, 23, 26, 28, 'DIVERGED_INDEFINITE_PC']<br>

              <br>

              *** UNSYMMETRIC CONFIGURATION ***<br>

              <br>

              (load3): [0.1841227429, 0.3082743008, 0.3652294828,

              0.4654760892, 1.331299786]<br>

              (load3): [0.1194557019, 0.2830136018, 0.5046830242,

              1.363314636, N/A]<br>

              (load3): [15, 19, 24, 48, DIVERGED_ITS]<br>

              <br>

              Thus, the largest problem also seems to cause (even more

              severe) issues to HYPRE, in particular,<br>

              INDEFINITE PRECONDITIONER with CG, and not convergence

              within 500 iterations for GMRES. <br>

              The preconditioner set up stage time, though, scales

              reasonably well with the same data distribution<br>

              that we used to feed GAMG (although the preconditioner

              computed for the largest problem seems to be <br>

              totally useless). <br>

              <br>

              I have logs for all these results if required.<br>

              <br>

              Thanks for your help!<br>

              Best regards,<br>

               Alberto.<br>

              <br>

              <br>

              <br>

              <div class="m_302126106648730347moz-cite-prefix">On

                07/11/18 19:46, Mark Adams wrote:<br>

              </div>

              <blockquote type="cite">

                <div dir="ltr">

                  <div dir="ltr">

                    <div dir="ltr">First I would add -gamg_est_ksp_type

                      cg</div>

                    <div dir="ltr"><br>

                    </div>

                    <div>You seem to be converging well so I assume you

                      are setting the null space for GAMG.</div>

                    <div><br>

                    </div>

                    <div>Note, you should test hypre also.</div>

                    <div><br>

                    </div>

                    <div>You probably want a bigger

                      "-pc_gamg_process_eq_limit 50". 200 at least but

                      you test your machine with a range on the largest

                      problem. This is a parameter for reducing the

                      number of active processors (on coarse grids).</div>

                    <div><br>

                    </div>

                    <div>I would only worry about "load3". This has 16K

                      equations per process, which is where you start

                      noticing "strong scaling" problems, depending on

                      the machine.</div>

                    <div><br>

                    </div>

                    <div>An important parameter is

                      "-pc_gamg_square_graph 0". I would probably start

                      with infinity (eg, 10).</div>

                    <div><br>

                    </div>

                    <div>Now, I'm not sure about your domain, problem

                      sizes, and thus the weak scaling design. You seem

                      to be scaling on the background mesh, but that may

                      not be a good proxy for complexity. </div>

                    <div><br>

                    </div>

                    <div>You can look at the number of flops and scale

                      it appropriately by the number of solver

                      iterations to get a relative size of the problem.

                      I would recommend scaling the number of processors

                      with this. For instance here the MatMult line for

                      the 4 proc and 16K proc run:</div>

                    <div>

                      <div><font face="monospace, monospace"><br>

                        </font></div>

                      <div><font face="monospace, monospace">------------------------------------------------------------------------------------------------------------------------</font></div>

                      <div><font face="monospace, monospace">Event     

                                    Count      Time (sec)     Flop     

                                                 --- Global ---  ---

                          Stage ---   Total</font></div>

                      <div><font face="monospace, monospace">           

                                 Max Ratio  Max     Ratio   Max  Ratio 

                          Mess   Avg len Reduct  %T %F %M %L %R  %T %F

                          %M %L %R Mflop/s</font></div>

                      <div><font face="monospace, monospace">------------------------------------------------------------------------------------------------------------------------</font></div>

                      <div><font face="monospace, monospace">MatMult   

                                    636 1.0 1.9035e-01 1.0 3.12e+08 1.1

                          7.6e+03 3.0e+03 0.0e+00  0 47 62 44  0   0 47

                          62 44  0  6275 [2 procs]</font></div>

                      <div><font face="monospace, monospace">MatMult   

                                   1416 1.0 1.9601e+00<font

                            color="#ff0000">2744.6</font> 4.82e+08 <font

                            color="#00ff00">0.0</font> 4.3e+08 7.2e+02

                          0.0e+00  0 48 50 48  0   0 48 50 48  0 2757975

                          [16K procs]</font></div>

                    </div>

                    <div><br>

                    </div>

                    <div>Now, you have empty processors. See the massive

                      load <font color="#ff0000">imbalance</font> on

                      time and the <font color="#00ff00">zero</font> on

                      Flops. The "Ratio" is max/min and cleary min=0 so

                      PETSc reports a ratio of 0 (it is infinity

                      really).</div>

                    <div><br>

                    </div>

                    <div>Also, weak scaling on a thin body (I don't know

                      your domain) is a little funny because as the

                      problem scales up the mesh becomes more 3D and

                      this causes the cost per equation to go up. That

                      is why I prefer to use the number of non-zeros as

                      the processor scaling function but number of

                      equations is easier ...</div>

                    <div><br>

                    </div>

                    <div>

                      <div>The PC setup times are large (I see 48

                        seconds at 16K bu you report 16).

                        -pc_gamg_square_graph 10 should help that.</div>

                      <br

                        class="m_302126106648730347gmail-Apple-interchange-newline">

                    </div>

                    <div>The max number of flops per processor in

                      MatMult goes up by 50% and the max time goes up by

                      10x and the number of iterations goes up by 13/8.

                      If I put all of this together I get that 75% of

                      the time at 16K is in communication at 16K. I

                      think that and the absolute time can be improved

                      some by optimizing parameters as I've suggested.</div>

                    <div><br>

                    </div>

                    <div>Mark</div>

                    <div><font face="monospace, monospace"><br>

                      </font></div>

                    <div><font face="monospace, monospace"><br>

                      </font></div>

                    <div><font face="monospace, monospace"><br>

                      </font></div>

                    <div><br>

                    </div>

                  </div>

                </div>

                <br>

                <div class="gmail_quote">

                  <div dir="ltr">On Wed, Nov 7, 2018 at 11:03 AM

                    "Alberto F. Martín" via petsc-users <<a

                      moz-do-not-send="true"

                      href="mailto:petsc-users@mcs.anl.gov"

                      target="_blank">petsc-users@mcs.anl.gov</a>>

                    wrote:<br>

                  </div>

                  <blockquote class="gmail_quote" style="margin:0 0 0

                    .8ex;border-left:1px #ccc solid;padding-left:1ex">

                    <div text="#000000" bgcolor="#FFFFFF"> Dear All,<br>

                      <br>

                      we are performing a weak scaling test of the PETSc

                      (v3.9.0) GAMG preconditioner when applied to the

                      linear system arising<br>

                      from the <b>conforming unfitted FE discretization

                      </b>(using Q1 Lagrangian FEs) of a 3D PDE Poisson

                      problem, where <br>

                      the boundary of the domain (a popcorn flake)  is

                      described as a zero-level-set embedded within a

                      uniform background <br>

                      (Cartesian-like) hexahedral mesh. Details

                      underlying the FEM formulation can be made

                      available on demand if you <br>

                      believe that this might be helpful, but let me

                      just point out that it is designed such that it

                      addresses the well-known<br>

                      ill-conditioning issues of unfitted FE

                      discretizations due to the small cut cell problem.

                      <br>

                      <br>

                      The weak scaling test is set up as follows. We

                      start from a single cube background mesh, and

                      refine it uniformly several<br>

                      steps, until we have approximately either 10**3

                      (load1), 20**3 (load2), or 40**3 (load3)

                      hexahedra/MPI task when <br>

                      distributing it over 4 MPI tasks. The benchmark is

                      scaled such that the next larger scale problem to

                      be tested is obtained<br>

                      by uniformly refining the mesh from the previous

                      scale and running it on 8x times the number of MPI

                      tasks that we used<br>

                      in the previous scale.  As a result, we obtain

                      three weak scaling curves for each of the three

                      fixed loads per MPI task<br>

                      above, on the following total number of MPI tasks:

                      4, 32, 262, 2097, 16777. The underlying mesh is

                      not partitioned among <br>

                      MPI tasks using ParMETIS (unstructured multilevel

                      graph partitioning)  nor optimally by hand, but

                      following the so-called <br>

                      z-shape space-filling curves provided by an

                      underlying octree-like mesh handler (i.e., p4est

                      library).<br>

                      <br>

                      I configured the preconditioned linear solver as

                      follows:<br>

                      <br>

                      -ksp_type cg<br>

                      -ksp_monitor<br>

                      -ksp_rtol 1.0e-6<br>

                      -ksp_converged_reason<br>

                      -ksp_max_it 500<br>

                      -ksp_norm_type unpreconditioned<br>

                      -ksp_view<br>

                      -log_view<br>

                      <br>

                      -pc_type gamg<br>

                      -pc_gamg_type agg<br>

                      -mg_levels_esteig_ksp_type cg<br>

                      -mg_coarse_sub_pc_type cholesky<br>

                      -mg_coarse_sub_pc_factor_mat_ordering_type nd<br>

                      -pc_gamg_process_eq_limit 50<br>

                      -pc_gamg_square_graph 0<br>

                      -pc_gamg_agg_nsmooths 1<br>

                      <br>

                      Raw timings (in seconds) of the preconditioner set

                      up and PCG iterative solution stage, and number of

                      iterations are as follows:<br>

                      <br>

                      **preconditioner set up**<br>

                      (load1): [0.02542160451, 0.05169247743,

                      0.09266782179, 0.2426272957, 13.64161944]<br>

                      (load2): [0.1239175797  , 0.1885528499  ,

                      0.2719282564  , 0.4783878336, 13.37947339]<br>

                      (load3): [0.6565349903  , 0.9435049873  ,

                      1.299908397    , 1.916243652  , 16.02904088]<br>

                      <br>

                      **PCG stage**<br>

                      (load1): [0.003287350759, 0.008163803257,

                      0.03565631993, 0.08343045413, 0.6937994603]<br>

                      (load2): [0.0205939794    , 0.03594723623  ,

                      0.07593298424, 0.1212046621  , 0.6780373845]<br>

                      (load3): [0.1310882876    , 0.3214917686    ,

                      0.5532023879  , 0.766881627    , 1.485446003]<br>

                      <br>

                      **number of PCG iterations**<br>

                      (load1): [5, 8, 11, 13, 13]<br>

                      (load2): [7, 10, 12, 13, 13]<br>

                      (load3): [8, 10, 12, 13, 13]<br>

                      <br>

                      It can be observed that both the number of linear

                      solver iterations and the PCG stage timings

                      (weakly) <br>

                      scale remarkably, but t<b>here is a significant

                        time increase when scaling the problem from 2097

                        to 16777 MPI tasks </b><b><br>

                      </b><b>for the preconditioner setup stage</b>

                      (e.g., 1.916243652 vs 16.02904088 sec. with 40**3

                      cells per MPI task).<br>

                      I gathered the combined output of -ksp_view and

                      -log_view (only) for all the points involving the

                      load3 weak scaling<br>

                      test (find them attached to this message). Please

                      note that within each run, I execute the these two

                      stages up-to<br>

                      three times, and this influences absolute timings

                      given in  -log_view.<br>

                      <br>

                      Looking at the output of -log_view, it is very

                      strange to me, e.g., that the stage labelled as

                      "Graph" <br>

                      does not scale properly as it is just a call to

                      MatDuplicate if the block size of the matrix is 1

                      (our case), and<br>

                      I guess that it is just a local operation that

                      does not require any communication.<br>

                      What I am missing here? The load does not seem to

                      be unbalanced looking at the "Ratio" column.<br>

                      <br>

                      I wonder whether the observed behaviour is as

                      expected, or this a miss-configuration of the

                      solver from our side.<br>

                      I played (quite a lot) with several

                      parameter-value combinations, and the

                      configuration above is the one that led to fastest

                      <br>

                      execution  (from the ones tested, that might be

                      incomplete, I can also provide further feedback if

                      helpful).<br>

                      Any feedback that we can get from your experience

                      in order to find the cause(s) of this issue and a

                      mitigating solution<br>

                      will be of high added value.<br>

                      <br>

                      Thanks very much in advance!<br>

                      Best regards,<br>

                       Alberto.<br>

                      <pre class="m_302126106648730347m_1687720227499487021moz-signature" cols="72">-- 

Alberto F. Martín-Huertas

Senior Researcher, PhD. Computational Science

Centre Internacional de Mètodes Numèrics a l'Enginyeria (CIMNE)

Parc Mediterrani de la Tecnologia, UPC

Esteve Terradas 5, Building C3, Office 215,

08860 Castelldefels (Barcelona, Spain)

Tel.: (+34) 9341 34223

<a moz-do-not-send="true" class="m_302126106648730347m_1687720227499487021moz-txt-link-abbreviated" href="mailto:e-mail:amartin@cimne.upc.edu" target="_blank">e-mail:amartin@cimne.upc.edu</a>

FEMPAR project co-founder

web: <a moz-do-not-send="true" class="m_302126106648730347m_1687720227499487021moz-txt-link-freetext" href="http://www.fempar.org" target="_blank">http://www.fempar.org</a> 

________________

IMPORTANT NOTICE

All personal data contained on this mail will be processed confidentially and registered in a file property of CIMNE in

order to manage corporate communications. You may exercise the rights of access, rectification, erasure and object by

letter sent to Ed. C1 Campus Norte UPC. Gran Capitán s/n Barcelona.

</pre>

                    </div>

                  </blockquote>

                </div>

              </blockquote>

              <br>

              <pre class="m_302126106648730347moz-signature" cols="72">-- 

Alberto F. Martín-Huertas

Senior Researcher, PhD. Computational Science

Centre Internacional de Mètodes Numèrics a l'Enginyeria (CIMNE)

Parc Mediterrani de la Tecnologia, UPC

Esteve Terradas 5, Building C3, Office 215,

08860 Castelldefels (Barcelona, Spain)

Tel.: (+34) 9341 34223

<a moz-do-not-send="true" class="m_302126106648730347moz-txt-link-abbreviated" href="mailto:e-mail:amartin@cimne.upc.edu" target="_blank">e-mail:amartin@cimne.upc.edu</a>

FEMPAR project co-founder

web: <a moz-do-not-send="true" class="m_302126106648730347moz-txt-link-freetext" href="http://www.fempar.org" target="_blank">http://www.fempar.org</a> 

________________

IMPORTANT NOTICE

All personal data contained on this mail will be processed confidentially and registered in a file property of CIMNE in

order to manage corporate communications. You may exercise the rights of access, rectification, erasure and object by

letter sent to Ed. C1 Campus Norte UPC. Gran Capitán s/n Barcelona.

</pre>

            </div>

          </blockquote>

        </div>

        <br clear="all">

        <div><br>

        </div>

        -- <br>

        <div dir="ltr" class="gmail_signature"

          data-smartmail="gmail_signature">

          <div dir="ltr">

            <div>

              <div dir="ltr">

                <div>

                  <div dir="ltr">

                    <div>What most experimenters take for granted before

                      they begin their experiments is infinitely more

                      interesting than any results to which their

                      experiments lead.<br>

                      -- Norbert Wiener</div>

                    <div><br>

                    </div>

                    <div><a moz-do-not-send="true"

                        href="http://www.cse.buffalo.edu/%7Eknepley/"

                        target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br>

                    </div>

                  </div>

                </div>

              </div>

            </div>

          </div>

        </div>

      </div>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Alberto F. Martín-Huertas

Senior Researcher, PhD. Computational Science

Centre Internacional de Mètodes Numèrics a l'Enginyeria (CIMNE)

Parc Mediterrani de la Tecnologia, UPC

Esteve Terradas 5, Building C3, Office 215,

08860 Castelldefels (Barcelona, Spain)

Tel.: (+34) 9341 34223

<a class="moz-txt-link-abbreviated" href="mailto:e-mail:amartin@cimne.upc.edu">e-mail:amartin@cimne.upc.edu</a>

FEMPAR project co-founder

web: <a class="moz-txt-link-freetext" href="http://www.fempar.org">http://www.fempar.org</a> 

________________

IMPORTANT NOTICE

All personal data contained on this mail will be processed confidentially and registered in a file property of CIMNE in

order to manage corporate communications. You may exercise the rights of access, rectification, erasure and object by

letter sent to Ed. C1 Campus Norte UPC. Gran Capitán s/n Barcelona.

</pre>

  </body>

</html>