<div dir="ltr"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div dir="ltr"><div class="gmail_quote"><br><div><br></div><div>Mark would have better comments on the scalability of the setup stage.</div><div> </div></div></div></div></blockquote><div><br></div><div>The first thing to verify is that the algorithm is scaling. If you coarsen too slowly then the coarse grids get large, with many non-zeros per row, and the cost of the matrix triple product can explode. You can check this by running with -info and grep on GAMG and then look/grep for line about grid "Complexity". It should be well below 1.5. With a new version of PETSc you can see this with -ksp_view. You can also look at the total number of flops in matrix-matrix kernels like <span style="color:rgb(0,0,0);font-family:"Courier New",Courier,monospace,arial,sans-serif;font-size:14px;white-space:pre-wrap">MatPtAPNumeric </span>below. That should increase slowly as you scale the problem size up.</div><div><br></div><div>Next, the setup has two basic types of processes: 1) custom graph processing and other kernels and 2) matrix-matrix products (like the matrix triple product). (2) are normal, but complex, numerical kernels and you can use standard performance debugging techniques to figure out where the time is going. There are normal numerical loops and MPI communication. (1) is made up of custom methods that do a lot of different types of processing, but parallel graph processing is a big chunk of this. These processes are very unstructured and it is hard to predict what performance to expect from them. You really just have to dig in and look for where the time is spent. I have added timers help with this:</div><div><br></div><div><pre class="gmail-aLF-aPX-K0-aPE" style="font-family:"Courier New",Courier,monospace,arial,sans-serif;margin-top:0px;margin-bottom:0px;white-space:pre-wrap;color:rgb(0,0,0);font-size:14px">PCGAMGGraph_AGG 12 1.0 3.3467e+00 1.0 4.58e+06 1.2 7.6e+06 6.7e+02 1.4e+02 5 0 2 1 6 5 0 2 1 6 22144
PCGAMGCoarse_AGG 12 1.0 8.9895e+00 1.0 1.35e+08 1.2 7.7e+07 6.2e+03 4.7e+02 14 10 16 51 18 14 10 16 51 18 233748
PCGAMGProl_AGG 12 1.0 3.9328e+00 1.0 0.00e+00 0.0 9.1e+06 1.5e+03 1.9e+02 6 0 2 1 7 6 0 2 1 7 0
PCGAMGPOpt_AGG 12 1.0 6.3192e+00 1.0 7.43e+07 1.1 3.9e+07 9.0e+02 5.0e+02 9 6 8 4 19 9 6 8 4 19 190048
GAMG: createProl 12 1.0 2.2585e+01 1.0 2.13e+08 1.2 1.3e+08 4.0e+03 1.3e+03 34 16 28 57 50 34 16 28 57 50 149493
Graph 24 1.0 3.3388e+00 1.0 4.58e+06 1.2 7.6e+06 6.7e+02 1.4e+02 5 0 2 1 6 5 0 2 1 6 22196
MIS/Agg 12 1.0 5.6411e-01 1.2 0.00e+00 0.0 4.7e+07 8.7e+02 2.3e+02 1 0 10 4 9 1 0 10 4 9 0
SA: col data 12 1.0 1.2982e+00 1.1 0.00e+00 0.0 5.6e+06 2.0e+03 4.8e+01 2 0 1 1 2 2 0 1 1 2 0
SA: frmProl0 12 1.0 1.6284e+00 1.0 0.00e+00 0.0 3.6e+06 5.5e+02 9.6e+01 2 0 1 0 4 2 0 1 0 4 0
SA: smooth 12 1.0 4.1778e+00 1.0 5.28e+06 1.2 1.4e+07 7.3e+02 1.7e+02 6 0 3 1 7 6 0 3 1 7 20483
GAMG: partLevel 12 1.0 1.3577e+01 1.0 2.90e+07 1.2 2.3e+07 2.1e+03 6.4e+02 20 2 5 5 25 20 2 5 5 25 34470
repartition 9 1.0 1.5048e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 5.4e+01 2 0 0 0 2 2 0 0 0 2 0
Invert-Sort 9 1.0 1.2282e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 3.6e+01 2 0 0 0 1 2 0 0 0 1 0
Move A 9 1.0 2.8930e+00 1.0 0.00e+00 0.0 5.7e+05 8.4e+02 1.5e+02 4 0 0 0 6 4 0 0 0 6 0
Move P 9 1.0 3.0317e+00 1.0 0.00e+00 0.0 9.4e+05 2.5e+01 1.5e+02 5 0 0 0 6 5 0 0 0 6 0</pre></div><div><br></div><div>There is nothing that pops out here but you could look at the scaling of the parts. There are no alternative implementations of this and so there is not much that can be done about it. I have not done a deep dive into the performance of this (1) stuff for a very long time. Note, most of this work is amortized because you do it just once for each mesh. So this cost will decrease relatively as you do more solves, like in a real production run.<br></div><div><br></div><div>The matrix-matrix (2) stuff is here:</div><div><br></div><div><pre class="gmail-aLF-aPX-K0-aPE" style="font-family:"Courier New",Courier,monospace,arial,sans-serif;margin-top:0px;margin-bottom:0px;white-space:pre-wrap;color:rgb(0,0,0);font-size:14px">MatMatMult 12 1.0 3.5538e+00 1.0 4.58e+06 1.2 1.4e+07 7.3e+02 1.5e+02 5 0 3 1 6 5 0 3 1 6 20853
MatMatMultSym 12 1.0 3.3264e+00 1.0 0.00e+00 0.0 1.2e+07 6.7e+02 1.4e+02 5 0 2 1 6 5 0 2 1 6 0
MatMatMultNum 12 1.0 9.7088e-02 1.1 4.58e+06 1.2 2.5e+06 1.0e+03 0.0e+00 0 0 1 0 0 0 0 1 0 0 763319
MatPtAP 12 1.0 4.0859e+00 1.0 2.90e+07 1.2 2.2e+07 2.2e+03 1.9e+02 6 2 5 5 7 6 2 5 5 7 114537
MatPtAPSymbolic 12 1.0 2.4298e+00 1.1 0.00e+00 0.0 1.4e+07 2.4e+03 8.4e+01 4 0 3 4 3 4 0 3 4 3 0
MatPtAPNumeric 12 1.0 1.7467e+00 1.1 2.90e+07 1.2 8.2e+06 1.8e+03 9.6e+01 3 2 2 2 4 3 2 2 2 4 267927
MatTrnMatMult 12 1.0 7.1406e+00 1.0 1.35e+08 1.2 1.6e+07 2.6e+04 1.9e+02 11 10 3 44 7 11 10 3 44 7 294270
MatTrnMatMultSym 12 1.0 5.6756e+00 1.0 0.00e+00 0.0 1.3e+07 1.6e+04 1.6e+02 9 0 3 23 6 9 0 3 23 6 0
MatTrnMatMultNum 12 1.0 1.5121e+00 1.0 1.35e+08 1.2 2.5e+06 7.9e+04 2.4e+01 2 10 1 21 1 2 10 1 21 1 1389611</pre><br>Note, the symbolic ("Sym") times (like (1) above) are pretty large, but again they are amortized away and tend to be slow compared to nice numerical kernels loops.</div><div><br></div><div>Mark</div></div></div>