[petsc-users] PETSc (3.9.0) GAMG weak scaling test issue

Wed Nov 7 12:46:30 CST 2018

First I would add -gamg_est_ksp_type cg

You seem to be converging well so I assume you are setting the null space
for GAMG.

Note, you should test hypre also.

You probably want a bigger "-pc_gamg_process_eq_limit 50". 200 at least but
you test your machine with a range on the largest problem. This is a
parameter for reducing the number of active processors (on coarse grids).

I would only worry about "load3". This has 16K equations per process, which
is where you start noticing "strong scaling" problems, depending on the
machine.

An important parameter is "-pc_gamg_square_graph 0". I would probably start
with infinity (eg, 10).

Now, I'm not sure about your domain, problem sizes, and thus the weak
scaling design. You seem to be scaling on the background mesh, but that may
not be a good proxy for complexity.

You can look at the number of flops and scale it appropriately by the
number of solver iterations to get a relative size of the problem. I would
recommend scaling the number of processors with this. For instance here the
MatMult line for the 4 proc and 16K proc run:

------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flop
     --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
MatMult              636 1.0 1.9035e-01 1.0 3.12e+08 1.1 7.6e+03 3.0e+03
0.0e+00  0 47 62 44  0   0 47 62 44  0  6275 [2 procs]
MatMult             1416 1.0 1.9601e+002744.6 4.82e+08 0.0 4.3e+08 7.2e+02
0.0e+00  0 48 50 48  0   0 48 50 48  0 2757975 [16K procs]

Now, you have empty processors. See the massive load imbalance on time and
the zero on Flops. The "Ratio" is max/min and cleary min=0 so PETSc reports
a ratio of 0 (it is infinity really).

Also, weak scaling on a thin body (I don't know your domain) is a little
funny because as the problem scales up the mesh becomes more 3D and this
causes the cost per equation to go up. That is why I prefer to use the
number of non-zeros as the processor scaling function but number of
equations is easier ...

The PC setup times are large (I see 48 seconds at 16K bu you report 16).
-pc_gamg_square_graph 10 should help that.

The max number of flops per processor in MatMult goes up by 50% and the max
time goes up by 10x and the number of iterations goes up by 13/8. If I put
all of this together I get that 75% of the time at 16K is in communication
at 16K. I think that and the absolute time can be improved some by
optimizing parameters as I've suggested.

Mark

On Wed, Nov 7, 2018 at 11:03 AM "Alberto F. Martín" via petsc-users <
petsc-users at mcs.anl.gov> wrote:

> Dear All,
>
> we are performing a weak scaling test of the PETSc (v3.9.0) GAMG
> preconditioner when applied to the linear system arising
> from the *conforming unfitted FE discretization *(using Q1 Lagrangian
> FEs) of a 3D PDE Poisson problem, where
> the boundary of the domain (a popcorn flake)  is described as a
> zero-level-set embedded within a uniform background
> (Cartesian-like) hexahedral mesh. Details underlying the FEM formulation
> can be made available on demand if you
> believe that this might be helpful, but let me just point out that it is
> designed such that it addresses the well-known
> ill-conditioning issues of unfitted FE discretizations due to the small
> cut cell problem.
>
> The weak scaling test is set up as follows. We start from a single cube
> background mesh, and refine it uniformly several
> steps, until we have approximately either 10**3 (load1), 20**3 (load2), or
> 40**3 (load3) hexahedra/MPI task when
> distributing it over 4 MPI tasks. The benchmark is scaled such that the
> next larger scale problem to be tested is obtained
> by uniformly refining the mesh from the previous scale and running it on
> 8x times the number of MPI tasks that we used
> in the previous scale.  As a result, we obtain three weak scaling curves
> for each of the three fixed loads per MPI task
> above, on the following total number of MPI tasks: 4, 32, 262, 2097,
> 16777. The underlying mesh is not partitioned among
> MPI tasks using ParMETIS (unstructured multilevel graph partitioning)  nor
> optimally by hand, but following the so-called
> z-shape space-filling curves provided by an underlying octree-like mesh
> handler (i.e., p4est library).
>
> I configured the preconditioned linear solver as follows:
>
> -ksp_type cg
> -ksp_monitor
> -ksp_rtol 1.0e-6
> -ksp_converged_reason
> -ksp_max_it 500
> -ksp_norm_type unpreconditioned
> -ksp_view
> -log_view
>
> -pc_type gamg
> -pc_gamg_type agg
> -mg_levels_esteig_ksp_type cg
> -mg_coarse_sub_pc_type cholesky
> -mg_coarse_sub_pc_factor_mat_ordering_type nd
> -pc_gamg_process_eq_limit 50
> -pc_gamg_square_graph 0
> -pc_gamg_agg_nsmooths 1
>
> Raw timings (in seconds) of the preconditioner set up and PCG iterative
> solution stage, and number of iterations are as follows:
>
> **preconditioner set up**
> (load1): [0.02542160451, 0.05169247743, 0.09266782179, 0.2426272957,
> 13.64161944]
> (load2): [0.1239175797  , 0.1885528499  , 0.2719282564  , 0.4783878336,
> 13.37947339]
> (load3): [0.6565349903  , 0.9435049873  , 1.299908397    , 1.916243652  ,
> 16.02904088]
>
> **PCG stage**
> (load1): [0.003287350759, 0.008163803257, 0.03565631993, 0.08343045413,
> 0.6937994603]
> (load2): [0.0205939794    , 0.03594723623  , 0.07593298424, 0.1212046621
> , 0.6780373845]
> (load3): [0.1310882876    , 0.3214917686    , 0.5532023879  ,
> 0.766881627    , 1.485446003]
>
> **number of PCG iterations**
> (load1): [5, 8, 11, 13, 13]
> (load2): [7, 10, 12, 13, 13]
> (load3): [8, 10, 12, 13, 13]
>
> It can be observed that both the number of linear solver iterations and
> the PCG stage timings (weakly)
> scale remarkably, but t*here is a significant time increase when scaling
> the problem from 2097 to 16777 MPI tasks *
> *for the preconditioner setup stage* (e.g., 1.916243652 vs 16.02904088
> sec. with 40**3 cells per MPI task).
> I gathered the combined output of -ksp_view and -log_view (only) for all
> the points involving the load3 weak scaling
> test (find them attached to this message). Please note that within each
> run, I execute the these two stages up-to
> three times, and this influences absolute timings given in  -log_view.
>
> Looking at the output of -log_view, it is very strange to me, e.g., that
> the stage labelled as "Graph"
> does not scale properly as it is just a call to MatDuplicate if the block
> size of the matrix is 1 (our case), and
> I guess that it is just a local operation that does not require any
> communication.
> What I am missing here? The load does not seem to be unbalanced looking at
> the "Ratio" column.
>
> I wonder whether the observed behaviour is as expected, or this a
> miss-configuration of the solver from our side.
> I played (quite a lot) with several parameter-value combinations, and the
> configuration above is the one that led to fastest
> execution  (from the ones tested, that might be incomplete, I can also
> provide further feedback if helpful).
> Any feedback that we can get from your experience in order to find the
> cause(s) of this issue and a mitigating solution
> will be of high added value.
>
> Thanks very much in advance!
> Best regards,
>  Alberto.
>
> --
> Alberto F. Martín-Huertas
> Senior Researcher, PhD. Computational Science
> Centre Internacional de Mètodes Numèrics a l'Enginyeria (CIMNE)
> Parc Mediterrani de la Tecnologia, UPC
> Esteve Terradas 5, Building C3, Office 215,
> 08860 Castelldefels (Barcelona, Spain)
> Tel.: (+34) 9341 34223e-mail:amartin at cimne.upc.edu
>
> FEMPAR project co-founder
> web: http://www.fempar.org
>
> ________________
> IMPORTANT NOTICE
> All personal data contained on this mail will be processed confidentially and registered in a file property of CIMNE in
> order to manage corporate communications. You may exercise the rights of access, rectification, erasure and object by
> letter sent to Ed. C1 Campus Norte UPC. Gran Capitán s/n Barcelona.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20181107/995993fd/attachment.html>