[petsc-users] PETSc (3.9.0) GAMG weak scaling test issue

Mon Nov 19 08:04:35 CST 2018

>
>
>
> Mark would have better comments on the scalability of the setup stage.
>
>

The first thing to verify is that the algorithm is scaling. If you coarsen
too slowly then the coarse grids get large, with many non-zeros per row,
and the cost of the matrix triple product can explode. You can check this
by running with -info and grep on GAMG and then look/grep for line about
grid "Complexity". It should be well below 1.5. With a new version of PETSc
you can see this with -ksp_view. You can also look at the total number of
flops in matrix-matrix kernels like MatPtAPNumeric below. That should
increase slowly as you scale the problem size up.

Next, the setup has two basic types of processes: 1) custom graph
processing and other kernels and 2) matrix-matrix products (like the matrix
triple product). (2) are normal, but complex, numerical kernels and you can
use standard performance debugging techniques to figure out where the time
is going. There are normal numerical loops and MPI communication. (1) is
made up of custom methods that do a lot of different types of processing,
but parallel graph processing is a big chunk of this. These processes are
very unstructured and it is hard to predict what performance to expect from
them. You really just have to dig in and look for where the time is spent.
I have added timers help with this:

PCGAMGGraph_AGG       12 1.0 3.3467e+00 1.0 4.58e+06 1.2 7.6e+06
6.7e+02 1.4e+02  5  0  2  1  6   5  0  2  1  6 22144
PCGAMGCoarse_AGG      12 1.0 8.9895e+00 1.0 1.35e+08 1.2 7.7e+07
6.2e+03 4.7e+02 14 10 16 51 18  14 10 16 51 18 233748
PCGAMGProl_AGG        12 1.0 3.9328e+00 1.0 0.00e+00 0.0 9.1e+06
1.5e+03 1.9e+02  6  0  2  1  7   6  0  2  1  7     0
PCGAMGPOpt_AGG        12 1.0 6.3192e+00 1.0 7.43e+07 1.1 3.9e+07
9.0e+02 5.0e+02  9  6  8  4 19   9  6  8  4 19 190048
GAMG: createProl      12 1.0 2.2585e+01 1.0 2.13e+08 1.2 1.3e+08
4.0e+03 1.3e+03 34 16 28 57 50  34 16 28 57 50 149493
  Graph               24 1.0 3.3388e+00 1.0 4.58e+06 1.2 7.6e+06
6.7e+02 1.4e+02  5  0  2  1  6   5  0  2  1  6 22196
  MIS/Agg             12 1.0 5.6411e-01 1.2 0.00e+00 0.0 4.7e+07
8.7e+02 2.3e+02  1  0 10  4  9   1  0 10  4  9     0
  SA: col data        12 1.0 1.2982e+00 1.1 0.00e+00 0.0 5.6e+06
2.0e+03 4.8e+01  2  0  1  1  2   2  0  1  1  2     0
  SA: frmProl0        12 1.0 1.6284e+00 1.0 0.00e+00 0.0 3.6e+06
5.5e+02 9.6e+01  2  0  1  0  4   2  0  1  0  4     0
  SA: smooth          12 1.0 4.1778e+00 1.0 5.28e+06 1.2 1.4e+07
7.3e+02 1.7e+02  6  0  3  1  7   6  0  3  1  7 20483
GAMG: partLevel       12 1.0 1.3577e+01 1.0 2.90e+07 1.2 2.3e+07
2.1e+03 6.4e+02 20  2  5  5 25  20  2  5  5 25 34470
  repartition          9 1.0 1.5048e+00 1.0 0.00e+00 0.0 0.0e+00
0.0e+00 5.4e+01  2  0  0  0  2   2  0  0  0  2     0
  Invert-Sort          9 1.0 1.2282e+00 1.1 0.00e+00 0.0 0.0e+00
0.0e+00 3.6e+01  2  0  0  0  1   2  0  0  0  1     0
  Move A               9 1.0 2.8930e+00 1.0 0.00e+00 0.0 5.7e+05
8.4e+02 1.5e+02  4  0  0  0  6   4  0  0  0  6     0
  Move P               9 1.0 3.0317e+00 1.0 0.00e+00 0.0 9.4e+05
2.5e+01 1.5e+02  5  0  0  0  6   5  0  0  0  6     0

There is nothing that pops out here but you could look at the scaling of
the parts. There are no alternative implementations of this and so there is
not much that can be done about it. I have not done a deep dive into the
performance of this (1) stuff for a very long time. Note, most of this work
is amortized because you do it just once for each mesh. So this cost will
decrease relatively as you do more solves, like in a real production run.

The matrix-matrix (2) stuff is here:

MatMatMult            12 1.0 3.5538e+00 1.0 4.58e+06 1.2 1.4e+07
7.3e+02 1.5e+02  5  0  3  1  6   5  0  3  1  6 20853
MatMatMultSym         12 1.0 3.3264e+00 1.0 0.00e+00 0.0 1.2e+07
6.7e+02 1.4e+02  5  0  2  1  6   5  0  2  1  6     0
MatMatMultNum         12 1.0 9.7088e-02 1.1 4.58e+06 1.2 2.5e+06
1.0e+03 0.0e+00  0  0  1  0  0   0  0  1  0  0 763319
MatPtAP               12 1.0 4.0859e+00 1.0 2.90e+07 1.2 2.2e+07
2.2e+03 1.9e+02  6  2  5  5  7   6  2  5  5  7 114537
MatPtAPSymbolic       12 1.0 2.4298e+00 1.1 0.00e+00 0.0 1.4e+07
2.4e+03 8.4e+01  4  0  3  4  3   4  0  3  4  3     0
MatPtAPNumeric        12 1.0 1.7467e+00 1.1 2.90e+07 1.2 8.2e+06
1.8e+03 9.6e+01  3  2  2  2  4   3  2  2  2  4 267927
MatTrnMatMult         12 1.0 7.1406e+00 1.0 1.35e+08 1.2 1.6e+07
2.6e+04 1.9e+02 11 10  3 44  7  11 10  3 44  7 294270
MatTrnMatMultSym      12 1.0 5.6756e+00 1.0 0.00e+00 0.0 1.3e+07
1.6e+04 1.6e+02  9  0  3 23  6   9  0  3 23  6     0
MatTrnMatMultNum      12 1.0 1.5121e+00 1.0 1.35e+08 1.2 2.5e+06
7.9e+04 2.4e+01  2 10  1 21  1   2 10  1 21  1 1389611

Note, the symbolic ("Sym") times (like (1) above) are pretty large, but
again they are amortized away and tend to be slow compared to nice
numerical kernels loops.

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20181119/13dde263/attachment-0001.html>