[petsc-dev] sor smoothers

Sat Aug 17 16:30:01 CDT 2013

   Mark,

    Why rich/eisenstat(2) ?   With Cheb/eisenstat you get the acceleration of Cheby on top of the SOR but rich/eisenstat(2) just seems like some strange implementation of sor?

    I am guessing you are getting your 15% potential improvement from

>>> 7.63/11.3
0.6752212389380531
>>> .6752212389380531*5.9800e+00
4.037823008849558
>>> .63/4.63
0.13606911447084233

  Anyways I would focus on rich/ssor()  smoothing.

Before thinking about implementing a new data structure I think it is relatively easy to reduce more the total number of flops done with ssor(1) smoother [and despite the current conventional wisdom that doing extra flops is in a good thing in HPC :-)). 

Presmoothing: Note that since you are running ssor(1) and zero initial guess the current implementation is already good in terms of reducing total flops; it does one down solve (saving the intermediate values) and then one up solve (using those saved values). 

Postsmoothing: Here it is running SSOR(1) with nonzero initial guess, the current implementation in MatSOR_SeqAIJ() is bad because it applies the entire matrix in both the down solve and the up solve.

  while (its--) {
    if (flag & SOR_FORWARD_SWEEP || flag & SOR_LOCAL_FORWARD_SWEEP) {
      for (i=0; i<m; i++) {
        n   = a->i[i+1] - a->i[i];
        idx = a->j + a->i[i];
        v   = a->a + a->i[i];
        sum = b[i];
        PetscSparseDenseMinusDot(sum,x,v,idx,n);
        x[i] = (1. - omega)*x[i] + (sum + mdiag[i]*x[i])*idiag[i];
      }
      ierr = PetscLogFlops(2.0*a->nz);CHKERRQ(ierr);
    }
    if (flag & SOR_BACKWARD_SWEEP || flag & SOR_LOCAL_BACKWARD_SWEEP) {
      for (i=m-1; i>=0; i--) {
        n   = a->i[i+1] - a->i[i];
        idx = a->j + a->i[i];
        v   = a->a + a->i[i];
        sum = b[i];
        PetscSparseDenseMinusDot(sum,x,v,idx,n);
        x[i] = (1. - omega)*x[i] + (sum + mdiag[i]*x[i])*idiag[i];
      }
      ierr = PetscLogFlops(2.0*a->nz);CHKERRQ(ierr);
    }

(Note also that this partially explains why the flop rate in the table for MatSOR() s so good, two of the three ssor triangular solves are actually full matrix vector products). What it should do is apply the entire matrix in the down solve (but save the partial sums of the lower triangular part) and then it can do the upper solve applying only 1/2 the matrix and using the accumulated values.  So currently the combined pre and post smooth has 3 complete applications of the matrix (.5 and .5 from the pre smooth and 1 and 1 from the up smooth. Adding the optimization will reduce it to 2.5 applications of the matrix giving a reduction of 1/6  = 17 percent of the smooth flops or 

>>> 17.*14/33
7.212121212121212

7 percent decrease in the total KSPSolve flops

Next we can eliminate 1/2 of the application of the matrix in the residual computation after the smoothing if we have saved the partial sums of the upper triangular solve in the smoother. We can estimate the flops of the matrix-multiple in the residual computation as one application of the matrix and hence equal to 1/3 of the flops in the ssor computations. Eliminating 1/2 of that means eliminating the equivalent of eliminating 1/6 of the flops in the MatSOR computation which is 

>>> 14./(33*6)
0.0707070707070707  (actually the same computation as above :-(

so another 7 percent of the total flops in the KSPSolve.  So in total we could save 14 percent of the flops (and lots of memory access) in the KSPSolve at the work of editing a couple of functions. 

At that point we could compute the flop rate of all the SSOR triangular solves and the percent of the time of the KSPSolves() in the them to determine if a new data structure is warranted. Note that in reducing the total number of flops we are replacing two full matrix-vector products with 2 triangular solves hence the flop rate will be lower indicating that using a new data structure will actually help more.

Notes on the implementation:

* modifying the MatSolve_SeqAIJ() to better handle the nonzero initial guess is straightforward; edit one functions

* modifying the residual computation in the multigrid to reuse the upper triangular sums requires a tiny bit more thought. Basically MatSOR_SeqAIJ() would have an internal pointer to the accumulated solves for the upper triangular part and we would need to either (1) have a MatMult_SeqAIJ() implementation that would retrieve those values and use them appropriately (how would it know when the values are valid?) or (2) use the PCMGSetResidual() to provide a custom residual routine that again pulled out the accumulated sums and used them. Of course we will eventually figure out how to organize it cleanly for users.

Note also the even if a new custom data structure is introduced the work outlined above is not lost since the new data structure still needs those flop reduction optimizations. So step one, fix the MatSOR_SeqAIJ() for nonzero initial guess, get a roughly 6 percent improvement, step 2, optimize the residual computation, get another 6 percent improvement, step 3,  introduce a new data structure and get another roughly 5 percent improvement in KSPSolve() (the 6, 6  and 5 are my guesstimates).

I would be very interested in seeing new numbers.

Final note: the very unload balancing of this problem will limit how much these optimizations will help. For a perfectly load balanced problem I think these optimizations would a lot higher in percentage (maybe twice as much? more?).

   Barry

On Aug 17, 2013, at 3:16 PM, "Mark F. Adams" <mfadams at lbl.gov> wrote:

> I would like to get an idea of how much benefit there would be with using a special matrix type for SOR.  Here is an experiment on 128 cores of Hopper (Cray XE6), 7 point stencil, with some embedded BCs that look like higher order stencils at BCs.  32^3 subdomian on each core:
> 
> cheb/jacobi(2)
> KSPSolve              15 1.0 5.9800e+00 1.0 1.13e+09 3.2 6.2e+06 1.1e+03 2.8e+02  7 29 67 46  7  26100100100 76 18765
> 
> 
> rich/eisenstat(2)
> KSPSolve              15 1.0 1.1563e+01 1.0 1.37e+09 3.4 5.4e+06 1.1e+03 2.8e+02 12 32 66 44  7  38100100100 76 11659
> 
> 
> rich/sor
> KSPSolve              15 1.0 4.6355e+00 1.0 7.63e+08 4.5 3.2e+06 1.0e+03 3.1e+02 10 21 57 31  8  33100100100 77 15708
> 
> 
> Complete log files attached.  The "projection" solve is the solver of interest.
> 
> I have 2 Jacobi so that it has about the same amount of work a one (s)sor.  There are voids in the domain which I believe accounts for the large differences in the number of flops per core.  These were run with the same processor group (i.e., all runs done in the same qsub script)
> 
> This shows only about 15% potential gain.  Should we conclude that there is not much to gain from an optimized data structure?
> 
> Mark
> <log_eis><log_jac><log_sor>
> 
> 
> On Aug 16, 2013, at 7:53 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
> 
>> "Mark F. Adams" <mfadams at lbl.gov> writes:
>>> Some hypre papers have shown that cheb/jacobi is faster for some
>>> problems but for me robustness trumps this for default solver
>>> parameters in PETSc.
>> 
>> Richardson/SOR is about the best thing out there in terms of reasonable
>> local work, low/no setup cost, and reliable convergence.  Cheby/Jacobi
>> just has trivial fine-grained parallelism, but it's not clear that buys
>> anything on a CPU.
>> 
>>> Jed's analysis suggests that Eisenstat's method saves almost 50% work
>>> but needs a specialized matrix to get good flop rates.  Something to
>>> think about doing … 
>> 
>> Mine was too sloppy, Barry got it right.  Eisenstat is for Cheby/SOR,
>> however, and doesn't do anything for Richardson.
>> 
>> To speed Richardson/SSOR up with a new matrix format, I think we have to
>> cache the action of the lower triangular part in the forward sweep so
>> that the back-sweep can use it, and vice-versa.  With full caching and
>> triangular residual optimization, I think this brings 2 SSOR iterations
>> of the down-smoother plus a residual to 2.5 work units in the
>> down-smooth (zero initial guess) and 3 work units in the up-smooth
>> (nonzero initial guess).  (This is a strong smoother and frequently, one
>> SSOR would be enough.)
>