[petsc-users] PETSc (3.9.0) GAMG weak scaling test issue

Thu Nov 8 06:01:13 CST 2018

On Thu, Nov 8, 2018 at 6:41 AM "Alberto F. Martín" via petsc-users <
petsc-users at mcs.anl.gov> wrote:

> Dear Mark,
>
> thanks for your quick and comprehensive reply.
>
> Before moving to the results of the experiments that u suggested, let me
> clarify two points
> on my original e-mail and your answer:
>
> (1) The raw timings and #iters. provided in my first e-mail were actually
>       obtained with "-pc_gamg_square_graph 1" (and not 0); sorry about
> that, my mistake.
>       (the logs, though, were consistent with the solver configuration
> provided).
>       The raw figures with "-pc_gamg_square_graph 0" are actually as
> follows:
>
>       (load3): [0.25074561, 0.3650926566, 0.6251466936, 0.8709517661,
> 15.52180776]
>       (load3): [0.148803731, 0.325266364, 0.5538515123, 0.7537377281,
> 1.475100923]
>       (load3): [8, 9, 11, 12, 12]
>
>       Bottom line: significant improvement of absolute times for the first
> 4x problems, marginal improvement for
>                            the largest problem (compared to
> "-pc_gamg_square_graph 1")
>
> (2) <<*The PC setup times are large (I see 48 seconds at 16K but you
> report 16). *
> *          -pc_gamg_square_graph 10 should help that.*>>
>
>      This disagreement is justified by the following note on my original
> e-mail:
>
>              <<*Please note that within each run, I execute these two
> stages up-to*
> *             three times, and this influences absolute timings given in
> -log_view.*>>
>
> I tried new configurations based on your suggestions. Find attached the
> results.
> (legends indicate changes with respect to the solver configuration
> provided
> in my first e-mail).
>
> Bottom lines: (1) the configuration provided in my original e-mail leads
> to fastest execution
> and less number of iteration for the first 4x problems. (2) *The (new)
> parameter-value combinations*
> *suggested seem to have almost no impact into the preconditioner set up
> time of the last problem.*
>

Mark, could this bad setup just be non-scalability in ParMetis? How do we
see the ParMetis time?

  Thanks,

    Matt

>
> I also tried HYPRE-BoomerAMG as suggested, with two different
> configurations.
>
> *** SYMMETRIC CONFIGURATION ***
> -ksp_type cg
> -ksp_monitor
> -ksp_rtol 1.0e-6
> -ksp_converged_reason
> -ksp_max_it 500
> -ksp_norm_type unpreconditioned
> -ksp_view
> -log_view
>
> -pc_type hypre
> -pc_hypre_type boomeramg
> -pc_hypre_boomeramg_print_statistics 1
> -pc_hypre_boomeramg_strong_threshold 0.25
> -pc_hypre_boomeramg_coarsen_type HMIS
> -pc_hypre_boomeramg_relax_type_down symmetric-SOR/Jacobi
> -pc_hypre_boomeramg_relax_type_up symmetric-SOR/Jacobi
> -pc_hypre_boomeramg_relax_type_coarse Gaussian-elimination
>
> *** UNSYMMETRIC CONFIGURATION ***
> -ksp_type gmres
> -ksp_gmres_restart 500
> -ksp_monitor
> -ksp_rtol 1.0e-6
> -ksp_converged_reason
> -ksp_max_it 500
> -ksp_pc_side right
> -ksp_norm_type unpreconditioned
>
> -pc_type hypre
> -pc_hypre_type boomeramg
> -pc_hypre_boomeramg_print_statistics 1
> -pc_hypre_boomeramg_strong_threshold 0.25
> -pc_hypre_boomeramg_coarsen_type HMIS
> -pc_hypre_boomeramg_relax_type_down SOR/Jacobi
> -pc_hypre_boomeramg_relax_type_up SOR/Jacobi
> -pc_hypre_boomeramg_relax_type_coarse Gaussian-elimination
>
> The raw results were:
>
> *** SYMMETRIC CONFIGURATION ***
>
> (load3):  [0.1828534687, 0.3055133289, 0.3582984209, 0.4280304033,
> 1.343549139]
> (load3):  [0.2102472978, 0.4572948301, 0.7153297188, 0.9989531627, N/A]
> (load3):  [19, 23, 26, 28, 'DIVERGED_INDEFINITE_PC']
>
> *** UNSYMMETRIC CONFIGURATION ***
>
> (load3): [0.1841227429, 0.3082743008, 0.3652294828, 0.4654760892,
> 1.331299786]
> (load3): [0.1194557019, 0.2830136018, 0.5046830242, 1.363314636, N/A]
> (load3): [15, 19, 24, 48, DIVERGED_ITS]
>
> Thus, the largest problem also seems to cause (even more severe) issues to
> HYPRE, in particular,
> INDEFINITE PRECONDITIONER with CG, and not convergence within 500
> iterations for GMRES.
> The preconditioner set up stage time, though, scales reasonably well with
> the same data distribution
> that we used to feed GAMG (although the preconditioner computed for the
> largest problem seems to be
> totally useless).
>
> I have logs for all these results if required.
>
> Thanks for your help!
> Best regards,
>  Alberto.
>
>
>
> On 07/11/18 19:46, Mark Adams wrote:
>
> First I would add -gamg_est_ksp_type cg
>
> You seem to be converging well so I assume you are setting the null space
> for GAMG.
>
> Note, you should test hypre also.
>
> You probably want a bigger "-pc_gamg_process_eq_limit 50". 200 at least
> but you test your machine with a range on the largest problem. This is a
> parameter for reducing the number of active processors (on coarse grids).
>
> I would only worry about "load3". This has 16K equations per process,
> which is where you start noticing "strong scaling" problems, depending on
> the machine.
>
> An important parameter is "-pc_gamg_square_graph 0". I would probably
> start with infinity (eg, 10).
>
> Now, I'm not sure about your domain, problem sizes, and thus the weak
> scaling design. You seem to be scaling on the background mesh, but that may
> not be a good proxy for complexity.
>
> You can look at the number of flops and scale it appropriately by the
> number of solver iterations to get a relative size of the problem. I would
> recommend scaling the number of processors with this. For instance here the
> MatMult line for the 4 proc and 16K proc run:
>
>
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flop
>      --- Global ---  --- Stage ---   Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
> MatMult              636 1.0 1.9035e-01 1.0 3.12e+08 1.1 7.6e+03 3.0e+03
> 0.0e+00  0 47 62 44  0   0 47 62 44  0  6275 [2 procs]
> MatMult             1416 1.0 1.9601e+002744.6 4.82e+08 0.0 4.3e+08
> 7.2e+02 0.0e+00  0 48 50 48  0   0 48 50 48  0 2757975 [16K procs]
>
> Now, you have empty processors. See the massive load imbalance on time
> and the zero on Flops. The "Ratio" is max/min and cleary min=0 so PETSc
> reports a ratio of 0 (it is infinity really).
>
> Also, weak scaling on a thin body (I don't know your domain) is a little
> funny because as the problem scales up the mesh becomes more 3D and this
> causes the cost per equation to go up. That is why I prefer to use the
> number of non-zeros as the processor scaling function but number of
> equations is easier ...
>
> The PC setup times are large (I see 48 seconds at 16K bu you report 16).
> -pc_gamg_square_graph 10 should help that.
>
> The max number of flops per processor in MatMult goes up by 50% and the
> max time goes up by 10x and the number of iterations goes up by 13/8. If I
> put all of this together I get that 75% of the time at 16K is in
> communication at 16K. I think that and the absolute time can be improved
> some by optimizing parameters as I've suggested.
>
> Mark
>
>
>
>
>
> On Wed, Nov 7, 2018 at 11:03 AM "Alberto F. Martín" via petsc-users <
> petsc-users at mcs.anl.gov> wrote:
>
>> Dear All,
>>
>> we are performing a weak scaling test of the PETSc (v3.9.0) GAMG
>> preconditioner when applied to the linear system arising
>> from the *conforming unfitted FE discretization *(using Q1 Lagrangian
>> FEs) of a 3D PDE Poisson problem, where
>> the boundary of the domain (a popcorn flake)  is described as a
>> zero-level-set embedded within a uniform background
>> (Cartesian-like) hexahedral mesh. Details underlying the FEM formulation
>> can be made available on demand if you
>> believe that this might be helpful, but let me just point out that it is
>> designed such that it addresses the well-known
>> ill-conditioning issues of unfitted FE discretizations due to the small
>> cut cell problem.
>>
>> The weak scaling test is set up as follows. We start from a single cube
>> background mesh, and refine it uniformly several
>> steps, until we have approximately either 10**3 (load1), 20**3 (load2),
>> or 40**3 (load3) hexahedra/MPI task when
>> distributing it over 4 MPI tasks. The benchmark is scaled such that the
>> next larger scale problem to be tested is obtained
>> by uniformly refining the mesh from the previous scale and running it on
>> 8x times the number of MPI tasks that we used
>> in the previous scale.  As a result, we obtain three weak scaling curves
>> for each of the three fixed loads per MPI task
>> above, on the following total number of MPI tasks: 4, 32, 262, 2097,
>> 16777. The underlying mesh is not partitioned among
>> MPI tasks using ParMETIS (unstructured multilevel graph partitioning)
>> nor optimally by hand, but following the so-called
>> z-shape space-filling curves provided by an underlying octree-like mesh
>> handler (i.e., p4est library).
>>
>> I configured the preconditioned linear solver as follows:
>>
>> -ksp_type cg
>> -ksp_monitor
>> -ksp_rtol 1.0e-6
>> -ksp_converged_reason
>> -ksp_max_it 500
>> -ksp_norm_type unpreconditioned
>> -ksp_view
>> -log_view
>>
>> -pc_type gamg
>> -pc_gamg_type agg
>> -mg_levels_esteig_ksp_type cg
>> -mg_coarse_sub_pc_type cholesky
>> -mg_coarse_sub_pc_factor_mat_ordering_type nd
>> -pc_gamg_process_eq_limit 50
>> -pc_gamg_square_graph 0
>> -pc_gamg_agg_nsmooths 1
>>
>> Raw timings (in seconds) of the preconditioner set up and PCG iterative
>> solution stage, and number of iterations are as follows:
>>
>> **preconditioner set up**
>> (load1): [0.02542160451, 0.05169247743, 0.09266782179, 0.2426272957,
>> 13.64161944]
>> (load2): [0.1239175797  , 0.1885528499  , 0.2719282564  , 0.4783878336,
>> 13.37947339]
>> (load3): [0.6565349903  , 0.9435049873  , 1.299908397    , 1.916243652  ,
>> 16.02904088]
>>
>> **PCG stage**
>> (load1): [0.003287350759, 0.008163803257, 0.03565631993, 0.08343045413,
>> 0.6937994603]
>> (load2): [0.0205939794    , 0.03594723623  , 0.07593298424, 0.1212046621
>> , 0.6780373845]
>> (load3): [0.1310882876    , 0.3214917686    , 0.5532023879  ,
>> 0.766881627    , 1.485446003]
>>
>> **number of PCG iterations**
>> (load1): [5, 8, 11, 13, 13]
>> (load2): [7, 10, 12, 13, 13]
>> (load3): [8, 10, 12, 13, 13]
>>
>> It can be observed that both the number of linear solver iterations and
>> the PCG stage timings (weakly)
>> scale remarkably, but t*here is a significant time increase when scaling
>> the problem from 2097 to 16777 MPI tasks *
>> *for the preconditioner setup stage* (e.g., 1.916243652 vs 16.02904088
>> sec. with 40**3 cells per MPI task).
>> I gathered the combined output of -ksp_view and -log_view (only) for all
>> the points involving the load3 weak scaling
>> test (find them attached to this message). Please note that within each
>> run, I execute the these two stages up-to
>> three times, and this influences absolute timings given in  -log_view.
>>
>> Looking at the output of -log_view, it is very strange to me, e.g., that
>> the stage labelled as "Graph"
>> does not scale properly as it is just a call to MatDuplicate if the block
>> size of the matrix is 1 (our case), and
>> I guess that it is just a local operation that does not require any
>> communication.
>> What I am missing here? The load does not seem to be unbalanced looking
>> at the "Ratio" column.
>>
>> I wonder whether the observed behaviour is as expected, or this a
>> miss-configuration of the solver from our side.
>> I played (quite a lot) with several parameter-value combinations, and the
>> configuration above is the one that led to fastest
>> execution  (from the ones tested, that might be incomplete, I can also
>> provide further feedback if helpful).
>> Any feedback that we can get from your experience in order to find the
>> cause(s) of this issue and a mitigating solution
>> will be of high added value.
>>
>> Thanks very much in advance!
>> Best regards,
>>  Alberto.
>>
>> --
>> Alberto F. Martín-Huertas
>> Senior Researcher, PhD. Computational Science
>> Centre Internacional de Mètodes Numèrics a l'Enginyeria (CIMNE)
>> Parc Mediterrani de la Tecnologia, UPC
>> Esteve Terradas 5, Building C3, Office 215,
>> 08860 Castelldefels (Barcelona, Spain)
>> Tel.: (+34) 9341 34223e-mail:amartin at cimne.upc.edu
>>
>> FEMPAR project co-founder
>> web: http://www.fempar.org
>>
>> ________________
>> IMPORTANT NOTICE
>> All personal data contained on this mail will be processed confidentially and registered in a file property of CIMNE in
>> order to manage corporate communications. You may exercise the rights of access, rectification, erasure and object by
>> letter sent to Ed. C1 Campus Norte UPC. Gran Capitán s/n Barcelona.
>>
>>
> --
> Alberto F. Martín-Huertas
> Senior Researcher, PhD. Computational Science
> Centre Internacional de Mètodes Numèrics a l'Enginyeria (CIMNE)
> Parc Mediterrani de la Tecnologia, UPC
> Esteve Terradas 5, Building C3, Office 215,
> 08860 Castelldefels (Barcelona, Spain)
> Tel.: (+34) 9341 34223e-mail:amartin at cimne.upc.edu
>
> FEMPAR project co-founder
> web: http://www.fempar.org
>
> ________________
> IMPORTANT NOTICE
> All personal data contained on this mail will be processed confidentially and registered in a file property of CIMNE in
> order to manage corporate communications. You may exercise the rights of access, rectification, erasure and object by
> letter sent to Ed. C1 Campus Norte UPC. Gran Capitán s/n Barcelona.
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20181108/a5cdb432/attachment.html>