[petsc-dev] sor smoothers

Wed Aug 28 12:58:32 CDT 2013

On Aug 28, 2013, at 11:35 AM, "Mark F. Adams" <mfadams at lbl.gov> wrote:

> Barry,
> 
> The next step is caching the upper part of the Mat-Vec for residuals.
> 
> How should this be engineered?  
> 
> 1) Should we add an array like ssor_work or hack and reuse it?

  It would be ssor_work but we could perhaps give it a more indicative name. That indicates it is the partial row "sums"

> 
> 2) Can we just assume that if there is zero initial guess that we want to cache it?

   I would just always cache the partial sums when ssor smoothing. I don't think the cost is that high. So basically all the kernels would have this thing like

       t[i] = sum;

> 
> 3) How are we going to intercept the residual call?  We only use this for MG so we could modify the V-cycle to call a special residual method or check the matrix to see if has the upper part stashed…

    MGSetComputeResidual() already exists so this is how you would prorovide the "special residual method"

> 
> 4) And I made a pull request for the first step of SOR optimization.  I will need to build on that … should I remove my pull request and keep working in the same branch or can this be pushed (to next?) so that I can create a new branch from this?

   I'd like to see your stuff merged into next and then you make a new branch at that point to optimize the residual part.

   Barry

> 
> Mark
> 
> On Aug 27, 2013, at 11:04 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> 
>> 
>> So about a 4.5 percent decrease in time with a 9 percent decrease in flops used. And pretty much on target with what was predicted.  Note that if we just counted decreased number of memory loads (and ignored the worse memory access patterns introduced) we would have gotten the same over optimistic prediction that we get with the flop reduction. Nice tiny little case study of worse memory access patterns slowing down the improvement. Presumably you'd see a roughly similar decrease with the "improved" residual computation.
>> 
>> 
>>  Thanks
>> 
>>  Barry
>> 
>> A few percent here, a few percent there, pretty soon you're beating hypre boomeramg :-)
>> 
>> 
>> On Aug 27, 2013, at 9:01 PM, "Mark F. Adams" <mfadams at lbl.gov> wrote:
>> 
>>> [new data: I had an error in the old ex54 timings]
>>> 
>>> OK, well the blocked version of this optimization was a fun ride.  I still get about 2% variance on timings but here are the best out of a few runs of each solver with the new optimized code.  I do watch the activity monitor and these are one core runs on a 4 core machine.
>>> 
>>> Scalar (ex54):
>>> KSPSolve               1 1.0 1.3231e+00 1.0 1.40e+09 1.0 0.0e+00 0.0e+00 5.0e+00 29 72  0  0  3  33 72  0  0  3  1057
>>> 2-vector (ex55)
>>> KSPSolve               1 1.0 2.1459e+00 1.0 3.15e+09 1.0 0.0e+00 0.0e+00 7.0e+00 50 88  0  0  4 100100  0  0100  1468
>>> 3-vector (ex56)
>>> KSPSolve               1 1.0 9.3956e-01 1.0 1.52e+09 1.0 0.0e+00 0.0e+00 7.0e+00 26 65  0  0  5 100100  0  0100  1619
>>> 
>>> and the old version:
>>> Scalar (ex54):
>>> KSPSolve               1 1.0 1.3830e+00 1.0 1.53e+09 1.0 0.0e+00 0.0e+00 5.0e+00 29 74  0  0  3  33 74  0  0  3  1103
>>> 2-vector (ex55)
>>> KSPSolve               1 1.0 2.1489e+00 1.0 3.45e+09 1.0 0.0e+00 0.0e+00 7.0e+00 50 89  0  0  4 100100  0  0100  1606
>>> 3-vector (ex56)
>>> KSPSolve               1 1.0 9.8354e-01 1.0 1.67e+09 1.0 0.0e+00 0.0e+00 7.0e+00 27 67  0  0  5 100100  0  0100  1700
>>> 
>>> Mark
>>> 
>>> On Aug 23, 2013, at 3:21 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>> 
>>>> 
>>>> Run a bit bigger problem. The results shouldn't be that noisy; unless you are playing some game in another window :-).
>>>> 
>>>> Barry
>>>> 
>>>> On Aug 23, 2013, at 11:52 AM, Mark F. Adams <mfadams at lbl.gov> wrote:
>>>> 
>>>>> This code looks funny in MatSOR_SeqAIJ:
>>>>> 
>>>>>     x[i] = (1-omega)*x[i] + sum*idiag[i];
>>>>> 
>>>>> shouldn't this be:
>>>>> 
>>>>>     x[i] = (1-omega)*x[i] + omega*sum*idiag[i];
>>>>> 
>>>>> and I've done the first optimization (for the non-blocked version) and get this (running each many time because my Mac is a little noisy).  Flop rates go down a bit and total flops got down a lot (perhaps I have a flop counting bug?).  6 old runs then7 new ones:
>>>>> 
>>>>> 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
>>>>> MatSOR                60 1.0 1.3511e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 34  0  0  0  14 34  0  0  0  1105
>>>>> 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
>>>>> MatSOR                60 1.0 1.3611e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 34  0  0  0  14 34  0  0  0  1097
>>>>> 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
>>>>> MatSOR                60 1.0 1.3464e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 34  0  0  0  14 34  0  0  0  1109
>>>>> 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
>>>>> MatSOR                60 1.0 1.3487e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 34  0  0  0  14 34  0  0  0  1107
>>>>> 12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
>>>>> MatSOR                60 1.0 1.3629e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 34  0  0  0  15 34  0  0  0  1095
>>>>> 12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
>>>>> MatSOR                60 1.0 1.3492e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 34  0  0  0  14 34  0  0  0  1107
>>>>> 12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
>>>>> MatSOR                60 1.0 1.2776e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 11 29  0  0  0  14 29  0  0  0   910
>>>>> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
>>>>> MatSOR                60 1.0 1.2809e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 29  0  0  0  14 29  0  0  0   908
>>>>> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
>>>>> MatSOR                60 1.0 1.2765e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 29  0  0  0  14 29  0  0  0   911
>>>>> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
>>>>> MatSOR                60 1.0 1.3131e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 29  0  0  0  14 29  0  0  0   886
>>>>> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
>>>>> MatSOR                60 1.0 1.2792e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 29  0  0  0  14 29  0  0  0   909
>>>>> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
>>>>> MatSOR                60 1.0 1.2913e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 29  0  0  0  14 29  0  0  0   901
>>>>> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
>>>>> MatSOR                60 1.0 1.2889e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 29  0  0  0  14 29  0  0  0   902
>>>>> 
>>>>> 
>>>>> On Aug 20, 2013, at 6:40 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>>> 
>>>>>> 
>>>>>> On Aug 20, 2013, at 4:39 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
>>>>>> 
>>>>>>> Hmm, less clear flops benefit. Half of these are nonzero initial guess, for which triangular storage would be worse
>>>>>> 
>>>>>> Is it clear that triangular storage would be particularly worse? I think only a small percentage worse for a multiply. The problem with CSR for triangular solves is for each new row one needs to jump some amount in the index and double array wasting cache lines and limiting streaming. With multiply with triangular storage there is no jumping, just streaming two different index and double arrays and current CPUs have no problem managing 4 streams. 
>>>>>> 
>>>>>> In other words CSR good for multiply, bad for triangular solve; triangular storage good for triangular solve and pretty good for multiply.
>>>>>> 
>>>>>> Barry
>>>>>> 
>>>>>> 
>>>>>> Barry
>>>>>> 
>>>>>>> (unless you add an Eisenstat optimization).
>>>>>>> 
>>>>>>> On Aug 20, 2013 4:26 PM, "Mark F. Adams" <mfadams at lbl.gov> wrote:
>>>>>>> Barry,
>>>>>>> 
>>>>>>> These are tests using the well load balanced problems in KSP on my Mac:
>>>>>>> 
>>>>>>> 3D Vector (ex56)
>>>>>>> 
>>>>>>> 8 proc:
>>>>>>> 
>>>>>>> rich/ssor(1)
>>>>>>>> KSPSolve               1 1.0 3.0143e-01 1.0 7.98e+07 1.1 1.0e+04 7.2e+02 8.5e+01 19 26 26 16  8 100100100100100  2047
>>>>>>> 
>>>>>>> cheb/jac(2)
>>>>>>>> KSPSolve               1 1.0 2.5836e-01 1.0 6.87e+07 1.1 1.4e+04 7.3e+02 8.8e+01 17 25 27 19  8 100100100100100  2053
>>>>>>> 
>>>>>>> 1 proc:
>>>>>>> 
>>>>>>> rich/ssor(1)
>>>>>>>> KSPSolve               1 1.0 1.8541e-01 1.0 3.20e+08 1.0 0.0e+00 0.0e+00 0.0e+00 10 22  0  0  0 100100  0  0  0  1727
>>>>>>> 
>>>>>>> cheb/jac(2)
>>>>>>>> KSPSolve               1 1.0 2.4841e-01 1.0 4.65e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 24  0  0  0 100100  0  0  0  1870
>>>>>>> 
>>>>>>> 2D Scalar (ex54) 4 procs
>>>>>>> 
>>>>>>> cheb/jac(2)
>>>>>>>> KSPSolve               1 1.0 2.3614e-01 1.0 3.51e+07 1.0 2.6e+03 8.7e+02 7.8e+02 91100 98 93 95 100100100100100   592
>>>>>>> 
>>>>>>> rich/ssor(1)
>>>>>>>> KSPSolve               1 1.0 2.0144e-01 1.0 2.29e+07 1.0 1.8e+03 8.8e+02 7.1e+02 89100 98 90 95 100100100100100   453
>>>>>>> 
>>>>>>> I'm not sure if this matters but I wanted to get off of the funny test problem that I was using.
>>>>>>> 
>>>>>>> Mark
>>>>>>> 
>>>>>>> On Aug 17, 2013, at 5:30 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> Mark,
>>>>>>>> 
>>>>>>>> Why rich/eisenstat(2) ?   With Cheb/eisenstat you get the acceleration of Cheby on top of the SOR but rich/eisenstat(2) just seems like some strange implementation of sor?
>>>>>>>> 
>>>>>>>> I am guessing you are getting your 15% potential improvement from
>>>>>>>> 
>>>>>>>>>>> 7.63/11.3
>>>>>>>> 0.6752212389380531
>>>>>>>>>>> .6752212389380531*5.9800e+00
>>>>>>>> 4.037823008849558
>>>>>>>>>>> .63/4.63
>>>>>>>> 0.13606911447084233
>>>>>>>> 
>>>>>>>> Anyways I would focus on rich/ssor()  smoothing.
>>>>>>>> 
>>>>>>>> Before thinking about implementing a new data structure I think it is relatively easy to reduce more the total number of flops done with ssor(1) smoother [and despite the current conventional wisdom that doing extra flops is in a good thing in HPC :-)).
>>>>>>>> 
>>>>>>>> Presmoothing: Note that since you are running ssor(1) and zero initial guess the current implementation is already good in terms of reducing total flops; it does one down solve (saving the intermediate values) and then one up solve (using those saved values).
>>>>>>>> 
>>>>>>>> Postsmoothing: Here it is running SSOR(1) with nonzero initial guess, the current implementation in MatSOR_SeqAIJ() is bad because it applies the entire matrix in both the down solve and the up solve.
>>>>>>>> 
>>>>>>>> while (its--) {
>>>>>>>> if (flag & SOR_FORWARD_SWEEP || flag & SOR_LOCAL_FORWARD_SWEEP) {
>>>>>>>> for (i=0; i<m; i++) {
>>>>>>>> n   = a->i[i+1] - a->i[i];
>>>>>>>> idx = a->j + a->i[i];
>>>>>>>> v   = a->a + a->i[i];
>>>>>>>> sum = b[i];
>>>>>>>> PetscSparseDenseMinusDot(sum,x,v,idx,n);
>>>>>>>> x[i] = (1. - omega)*x[i] + (sum + mdiag[i]*x[i])*idiag[i];
>>>>>>>> }
>>>>>>>> ierr = PetscLogFlops(2.0*a->nz);CHKERRQ(ierr);
>>>>>>>> }
>>>>>>>> if (flag & SOR_BACKWARD_SWEEP || flag & SOR_LOCAL_BACKWARD_SWEEP) {
>>>>>>>> for (i=m-1; i>=0; i--) {
>>>>>>>> n   = a->i[i+1] - a->i[i];
>>>>>>>> idx = a->j + a->i[i];
>>>>>>>> v   = a->a + a->i[i];
>>>>>>>> sum = b[i];
>>>>>>>> PetscSparseDenseMinusDot(sum,x,v,idx,n);
>>>>>>>> x[i] = (1. - omega)*x[i] + (sum + mdiag[i]*x[i])*idiag[i];
>>>>>>>> }
>>>>>>>> ierr = PetscLogFlops(2.0*a->nz);CHKERRQ(ierr);
>>>>>>>> }
>>>>>>>> 
>>>>>>>> (Note also that this partially explains why the flop rate in the table for MatSOR() s so good, two of the three ssor triangular solves are actually full matrix vector products). What it should do is apply the entire matrix in the down solve (but save the partial sums of the lower triangular part) and then it can do the upper solve applying only 1/2 the matrix and using the accumulated values.  So currently the combined pre and post smooth has 3 complete applications of the matrix (.5 and .5 from the pre smooth and 1 and 1 from the up smooth. Adding the optimization will reduce it to 2.5 applications of the matrix giving a reduction of 1/6  = 17 percent of the smooth flops or
>>>>>>>> 
>>>>>>>>>>> 17.*14/33
>>>>>>>> 7.212121212121212
>>>>>>>> 
>>>>>>>> 7 percent decrease in the total KSPSolve flops
>>>>>>>> 
>>>>>>>> Next we can eliminate 1/2 of the application of the matrix in the residual computation after the smoothing if we have saved the partial sums of the upper triangular solve in the smoother. We can estimate the flops of the matrix-multiple in the residual computation as one application of the matrix and hence equal to 1/3 of the flops in the ssor computations. Eliminating 1/2 of that means eliminating the equivalent of eliminating 1/6 of the flops in the MatSOR computation which is
>>>>>>>> 
>>>>>>>>>>> 14./(33*6)
>>>>>>>> 0.0707070707070707  (actually the same computation as above :-(
>>>>>>>> 
>>>>>>>> so another 7 percent of the total flops in the KSPSolve.  So in total we could save 14 percent of the flops (and lots of memory access) in the KSPSolve at the work of editing a couple of functions.
>>>>>>>> 
>>>>>>>> At that point we could compute the flop rate of all the SSOR triangular solves and the percent of the time of the KSPSolves() in the them to determine if a new data structure is warranted. Note that in reducing the total number of flops we are replacing two full matrix-vector products with 2 triangular solves hence the flop rate will be lower indicating that using a new data structure will actually help more.
>>>>>>>> 
>>>>>>>> Notes on the implementation:
>>>>>>>> 
>>>>>>>> * modifying the MatSolve_SeqAIJ() to better handle the nonzero initial guess is straightforward; edit one functions
>>>>>>>> 
>>>>>>>> * modifying the residual computation in the multigrid to reuse the upper triangular sums requires a tiny bit more thought. Basically MatSOR_SeqAIJ() would have an internal pointer to the accumulated solves for the upper triangular part and we would need to either (1) have a MatMult_SeqAIJ() implementation that would retrieve those values and use them appropriately (how would it know when the values are valid?) or (2) use the PCMGSetResidual() to provide a custom residual routine that again pulled out the accumulated sums and used them. Of course we will eventually figure out how to organize it cleanly for users.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Note also the even if a new custom data structure is introduced the work outlined above is not lost since the new data structure still needs those flop reduction optimizations. So step one, fix the MatSOR_SeqAIJ() for nonzero initial guess, get a roughly 6 percent improvement, step 2, optimize the residual computation, get another 6 percent improvement, step 3,  introduce a new data structure and get another roughly 5 percent improvement in KSPSolve() (the 6, 6  and 5 are my guesstimates).
>>>>>>>> 
>>>>>>>> I would be very interested in seeing new numbers.
>>>>>>>> 
>>>>>>>> Final note: the very unload balancing of this problem will limit how much these optimizations will help. For a perfectly load balanced problem I think these optimizations would a lot higher in percentage (maybe twice as much? more?).
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Barry
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Aug 17, 2013, at 3:16 PM, "Mark F. Adams" <mfadams at lbl.gov> wrote:
>>>>>>>> 
>>>>>>>>> I would like to get an idea of how much benefit there would be with using a special matrix type for SOR.  Here is an experiment on 128 cores of Hopper (Cray XE6), 7 point stencil, with some embedded BCs that look like higher order stencils at BCs.  32^3 subdomian on each core:
>>>>>>>>> 
>>>>>>>>> cheb/jacobi(2)
>>>>>>>>> KSPSolve              15 1.0 5.9800e+00 1.0 1.13e+09 3.2 6.2e+06 1.1e+03 2.8e+02  7 29 67 46  7  26100100100 76 18765
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> rich/eisenstat(2)
>>>>>>>>> KSPSolve              15 1.0 1.1563e+01 1.0 1.37e+09 3.4 5.4e+06 1.1e+03 2.8e+02 12 32 66 44  7  38100100100 76 11659
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> rich/sor
>>>>>>>>> KSPSolve              15 1.0 4.6355e+00 1.0 7.63e+08 4.5 3.2e+06 1.0e+03 3.1e+02 10 21 57 31  8  33100100100 77 15708
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Complete log files attached.  The "projection" solve is the solver of interest.
>>>>>>>>> 
>>>>>>>>> I have 2 Jacobi so that it has about the same amount of work a one (s)sor.  There are voids in the domain which I believe accounts for the large differences in the number of flops per core.  These were run with the same processor group (i.e., all runs done in the same qsub script)
>>>>>>>>> 
>>>>>>>>> This shows only about 15% potential gain.  Should we conclude that there is not much to gain from an optimized data structure?
>>>>>>>>> 
>>>>>>>>> Mark
>>>>>>>>> <log_eis><log_jac><log_sor>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Aug 16, 2013, at 7:53 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
>>>>>>>>> 
>>>>>>>>>> "Mark F. Adams" <mfadams at lbl.gov> writes:
>>>>>>>>>>> Some hypre papers have shown that cheb/jacobi is faster for some
>>>>>>>>>>> problems but for me robustness trumps this for default solver
>>>>>>>>>>> parameters in PETSc.
>>>>>>>>>> 
>>>>>>>>>> Richardson/SOR is about the best thing out there in terms of reasonable
>>>>>>>>>> local work, low/no setup cost, and reliable convergence.  Cheby/Jacobi
>>>>>>>>>> just has trivial fine-grained parallelism, but it's not clear that buys
>>>>>>>>>> anything on a CPU.
>>>>>>>>>> 
>>>>>>>>>>> Jed's analysis suggests that Eisenstat's method saves almost 50% work
>>>>>>>>>>> but needs a specialized matrix to get good flop rates.  Something to
>>>>>>>>>>> think about doing …
>>>>>>>>>> 
>>>>>>>>>> Mine was too sloppy, Barry got it right.  Eisenstat is for Cheby/SOR,
>>>>>>>>>> however, and doesn't do anything for Richardson.
>>>>>>>>>> 
>>>>>>>>>> To speed Richardson/SSOR up with a new matrix format, I think we have to
>>>>>>>>>> cache the action of the lower triangular part in the forward sweep so
>>>>>>>>>> that the back-sweep can use it, and vice-versa.  With full caching and
>>>>>>>>>> triangular residual optimization, I think this brings 2 SSOR iterations
>>>>>>>>>> of the down-smoother plus a residual to 2.5 work units in the
>>>>>>>>>> down-smooth (zero initial guess) and 3 work units in the up-smooth
>>>>>>>>>> (nonzero initial guess).  (This is a strong smoother and frequently, one
>>>>>>>>>> SSOR would be enough.)
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>