[petsc-users] PETSc (3.9.0) GAMG weak scaling test issue
Mark Adams
mfadams at lbl.gov
Mon Nov 19 08:04:35 CST 2018
>
>
>
> Mark would have better comments on the scalability of the setup stage.
>
>
The first thing to verify is that the algorithm is scaling. If you coarsen
too slowly then the coarse grids get large, with many non-zeros per row,
and the cost of the matrix triple product can explode. You can check this
by running with -info and grep on GAMG and then look/grep for line about
grid "Complexity". It should be well below 1.5. With a new version of PETSc
you can see this with -ksp_view. You can also look at the total number of
flops in matrix-matrix kernels like MatPtAPNumeric below. That should
increase slowly as you scale the problem size up.
Next, the setup has two basic types of processes: 1) custom graph
processing and other kernels and 2) matrix-matrix products (like the matrix
triple product). (2) are normal, but complex, numerical kernels and you can
use standard performance debugging techniques to figure out where the time
is going. There are normal numerical loops and MPI communication. (1) is
made up of custom methods that do a lot of different types of processing,
but parallel graph processing is a big chunk of this. These processes are
very unstructured and it is hard to predict what performance to expect from
them. You really just have to dig in and look for where the time is spent.
I have added timers help with this:
PCGAMGGraph_AGG 12 1.0 3.3467e+00 1.0 4.58e+06 1.2 7.6e+06
6.7e+02 1.4e+02 5 0 2 1 6 5 0 2 1 6 22144
PCGAMGCoarse_AGG 12 1.0 8.9895e+00 1.0 1.35e+08 1.2 7.7e+07
6.2e+03 4.7e+02 14 10 16 51 18 14 10 16 51 18 233748
PCGAMGProl_AGG 12 1.0 3.9328e+00 1.0 0.00e+00 0.0 9.1e+06
1.5e+03 1.9e+02 6 0 2 1 7 6 0 2 1 7 0
PCGAMGPOpt_AGG 12 1.0 6.3192e+00 1.0 7.43e+07 1.1 3.9e+07
9.0e+02 5.0e+02 9 6 8 4 19 9 6 8 4 19 190048
GAMG: createProl 12 1.0 2.2585e+01 1.0 2.13e+08 1.2 1.3e+08
4.0e+03 1.3e+03 34 16 28 57 50 34 16 28 57 50 149493
Graph 24 1.0 3.3388e+00 1.0 4.58e+06 1.2 7.6e+06
6.7e+02 1.4e+02 5 0 2 1 6 5 0 2 1 6 22196
MIS/Agg 12 1.0 5.6411e-01 1.2 0.00e+00 0.0 4.7e+07
8.7e+02 2.3e+02 1 0 10 4 9 1 0 10 4 9 0
SA: col data 12 1.0 1.2982e+00 1.1 0.00e+00 0.0 5.6e+06
2.0e+03 4.8e+01 2 0 1 1 2 2 0 1 1 2 0
SA: frmProl0 12 1.0 1.6284e+00 1.0 0.00e+00 0.0 3.6e+06
5.5e+02 9.6e+01 2 0 1 0 4 2 0 1 0 4 0
SA: smooth 12 1.0 4.1778e+00 1.0 5.28e+06 1.2 1.4e+07
7.3e+02 1.7e+02 6 0 3 1 7 6 0 3 1 7 20483
GAMG: partLevel 12 1.0 1.3577e+01 1.0 2.90e+07 1.2 2.3e+07
2.1e+03 6.4e+02 20 2 5 5 25 20 2 5 5 25 34470
repartition 9 1.0 1.5048e+00 1.0 0.00e+00 0.0 0.0e+00
0.0e+00 5.4e+01 2 0 0 0 2 2 0 0 0 2 0
Invert-Sort 9 1.0 1.2282e+00 1.1 0.00e+00 0.0 0.0e+00
0.0e+00 3.6e+01 2 0 0 0 1 2 0 0 0 1 0
Move A 9 1.0 2.8930e+00 1.0 0.00e+00 0.0 5.7e+05
8.4e+02 1.5e+02 4 0 0 0 6 4 0 0 0 6 0
Move P 9 1.0 3.0317e+00 1.0 0.00e+00 0.0 9.4e+05
2.5e+01 1.5e+02 5 0 0 0 6 5 0 0 0 6 0
There is nothing that pops out here but you could look at the scaling of
the parts. There are no alternative implementations of this and so there is
not much that can be done about it. I have not done a deep dive into the
performance of this (1) stuff for a very long time. Note, most of this work
is amortized because you do it just once for each mesh. So this cost will
decrease relatively as you do more solves, like in a real production run.
The matrix-matrix (2) stuff is here:
MatMatMult 12 1.0 3.5538e+00 1.0 4.58e+06 1.2 1.4e+07
7.3e+02 1.5e+02 5 0 3 1 6 5 0 3 1 6 20853
MatMatMultSym 12 1.0 3.3264e+00 1.0 0.00e+00 0.0 1.2e+07
6.7e+02 1.4e+02 5 0 2 1 6 5 0 2 1 6 0
MatMatMultNum 12 1.0 9.7088e-02 1.1 4.58e+06 1.2 2.5e+06
1.0e+03 0.0e+00 0 0 1 0 0 0 0 1 0 0 763319
MatPtAP 12 1.0 4.0859e+00 1.0 2.90e+07 1.2 2.2e+07
2.2e+03 1.9e+02 6 2 5 5 7 6 2 5 5 7 114537
MatPtAPSymbolic 12 1.0 2.4298e+00 1.1 0.00e+00 0.0 1.4e+07
2.4e+03 8.4e+01 4 0 3 4 3 4 0 3 4 3 0
MatPtAPNumeric 12 1.0 1.7467e+00 1.1 2.90e+07 1.2 8.2e+06
1.8e+03 9.6e+01 3 2 2 2 4 3 2 2 2 4 267927
MatTrnMatMult 12 1.0 7.1406e+00 1.0 1.35e+08 1.2 1.6e+07
2.6e+04 1.9e+02 11 10 3 44 7 11 10 3 44 7 294270
MatTrnMatMultSym 12 1.0 5.6756e+00 1.0 0.00e+00 0.0 1.3e+07
1.6e+04 1.6e+02 9 0 3 23 6 9 0 3 23 6 0
MatTrnMatMultNum 12 1.0 1.5121e+00 1.0 1.35e+08 1.2 2.5e+06
7.9e+04 2.4e+01 2 10 1 21 1 2 10 1 21 1 1389611
Note, the symbolic ("Sym") times (like (1) above) are pretty large, but
again they are amortized away and tend to be slow compared to nice
numerical kernels loops.
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20181119/13dde263/attachment-0001.html>
More information about the petsc-users
mailing list