[petsc-dev] sor smoothers

Fri Aug 23 11:52:20 CDT 2013

This code looks funny in MatSOR_SeqAIJ:

          x[i] = (1-omega)*x[i] + sum*idiag[i];

shouldn't this be:

          x[i] = (1-omega)*x[i] + omega*sum*idiag[i];

and I've done the first optimization (for the non-blocked version) and get this (running each many time because my Mac is a little noisy).  Flop rates go down a bit and total flops got down a lot (perhaps I have a flop counting bug?).  6 old runs then7 new ones:

12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
MatSOR                60 1.0 1.3511e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 34  0  0  0  14 34  0  0  0  1105
12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
MatSOR                60 1.0 1.3611e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 34  0  0  0  14 34  0  0  0  1097
12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
MatSOR                60 1.0 1.3464e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 34  0  0  0  14 34  0  0  0  1109
12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
MatSOR                60 1.0 1.3487e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 34  0  0  0  14 34  0  0  0  1107
12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
MatSOR                60 1.0 1.3629e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 34  0  0  0  15 34  0  0  0  1095
12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
MatSOR                60 1.0 1.3492e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 34  0  0  0  14 34  0  0  0  1107
12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
MatSOR                60 1.0 1.2776e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 11 29  0  0  0  14 29  0  0  0   910
12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
MatSOR                60 1.0 1.2809e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 29  0  0  0  14 29  0  0  0   908
12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
MatSOR                60 1.0 1.2765e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 29  0  0  0  14 29  0  0  0   911
12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
MatSOR                60 1.0 1.3131e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 29  0  0  0  14 29  0  0  0   886
12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
MatSOR                60 1.0 1.2792e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 29  0  0  0  14 29  0  0  0   909
12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
MatSOR                60 1.0 1.2913e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 29  0  0  0  14 29  0  0  0   901
12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local runex54 | grep MatSOR
MatSOR                60 1.0 1.2889e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 29  0  0  0  14 29  0  0  0   902

On Aug 20, 2013, at 6:40 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:

> 
> On Aug 20, 2013, at 4:39 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
> 
>> Hmm, less clear flops benefit. Half of these are nonzero initial guess, for which triangular storage would be worse
> 
>   Is it clear that triangular storage would be particularly worse? I think only a small percentage worse for a multiply. The problem with CSR for triangular solves is for each new row one needs to jump some amount in the index and double array wasting cache lines and limiting streaming. With multiply with triangular storage there is no jumping, just streaming two different index and double arrays and current CPUs have no problem managing 4 streams. 
> 
>   In other words CSR good for multiply, bad for triangular solve; triangular storage good for triangular solve and pretty good for multiply.
> 
>   Barry
> 
> 
>   Barry
> 
>> (unless you add an Eisenstat optimization).
>> 
>> On Aug 20, 2013 4:26 PM, "Mark F. Adams" <mfadams at lbl.gov> wrote:
>> Barry,
>> 
>> These are tests using the well load balanced problems in KSP on my Mac:
>> 
>> 3D Vector (ex56)
>> 
>> 8 proc:
>> 
>> rich/ssor(1)
>>> KSPSolve               1 1.0 3.0143e-01 1.0 7.98e+07 1.1 1.0e+04 7.2e+02 8.5e+01 19 26 26 16  8 100100100100100  2047
>> 
>> cheb/jac(2)
>>> KSPSolve               1 1.0 2.5836e-01 1.0 6.87e+07 1.1 1.4e+04 7.3e+02 8.8e+01 17 25 27 19  8 100100100100100  2053
>> 
>> 1 proc:
>> 
>> rich/ssor(1)
>>> KSPSolve               1 1.0 1.8541e-01 1.0 3.20e+08 1.0 0.0e+00 0.0e+00 0.0e+00 10 22  0  0  0 100100  0  0  0  1727
>> 
>> cheb/jac(2)
>>> KSPSolve               1 1.0 2.4841e-01 1.0 4.65e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 24  0  0  0 100100  0  0  0  1870
>> 
>> 2D Scalar (ex54) 4 procs
>> 
>> cheb/jac(2)
>>> KSPSolve               1 1.0 2.3614e-01 1.0 3.51e+07 1.0 2.6e+03 8.7e+02 7.8e+02 91100 98 93 95 100100100100100   592
>> 
>> rich/ssor(1)
>>> KSPSolve               1 1.0 2.0144e-01 1.0 2.29e+07 1.0 1.8e+03 8.8e+02 7.1e+02 89100 98 90 95 100100100100100   453
>> 
>> I'm not sure if this matters but I wanted to get off of the funny test problem that I was using.
>> 
>> Mark
>> 
>> On Aug 17, 2013, at 5:30 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>> 
>>> 
>>>  Mark,
>>> 
>>>   Why rich/eisenstat(2) ?   With Cheb/eisenstat you get the acceleration of Cheby on top of the SOR but rich/eisenstat(2) just seems like some strange implementation of sor?
>>> 
>>>   I am guessing you are getting your 15% potential improvement from
>>> 
>>>>>> 7.63/11.3
>>> 0.6752212389380531
>>>>>> .6752212389380531*5.9800e+00
>>> 4.037823008849558
>>>>>> .63/4.63
>>> 0.13606911447084233
>>> 
>>> Anyways I would focus on rich/ssor()  smoothing.
>>> 
>>> Before thinking about implementing a new data structure I think it is relatively easy to reduce more the total number of flops done with ssor(1) smoother [and despite the current conventional wisdom that doing extra flops is in a good thing in HPC :-)).
>>> 
>>> Presmoothing: Note that since you are running ssor(1) and zero initial guess the current implementation is already good in terms of reducing total flops; it does one down solve (saving the intermediate values) and then one up solve (using those saved values).
>>> 
>>> Postsmoothing: Here it is running SSOR(1) with nonzero initial guess, the current implementation in MatSOR_SeqAIJ() is bad because it applies the entire matrix in both the down solve and the up solve.
>>> 
>>> while (its--) {
>>>   if (flag & SOR_FORWARD_SWEEP || flag & SOR_LOCAL_FORWARD_SWEEP) {
>>>     for (i=0; i<m; i++) {
>>>       n   = a->i[i+1] - a->i[i];
>>>       idx = a->j + a->i[i];
>>>       v   = a->a + a->i[i];
>>>       sum = b[i];
>>>       PetscSparseDenseMinusDot(sum,x,v,idx,n);
>>>       x[i] = (1. - omega)*x[i] + (sum + mdiag[i]*x[i])*idiag[i];
>>>     }
>>>     ierr = PetscLogFlops(2.0*a->nz);CHKERRQ(ierr);
>>>   }
>>>   if (flag & SOR_BACKWARD_SWEEP || flag & SOR_LOCAL_BACKWARD_SWEEP) {
>>>     for (i=m-1; i>=0; i--) {
>>>       n   = a->i[i+1] - a->i[i];
>>>       idx = a->j + a->i[i];
>>>       v   = a->a + a->i[i];
>>>       sum = b[i];
>>>       PetscSparseDenseMinusDot(sum,x,v,idx,n);
>>>       x[i] = (1. - omega)*x[i] + (sum + mdiag[i]*x[i])*idiag[i];
>>>     }
>>>     ierr = PetscLogFlops(2.0*a->nz);CHKERRQ(ierr);
>>>   }
>>> 
>>> (Note also that this partially explains why the flop rate in the table for MatSOR() s so good, two of the three ssor triangular solves are actually full matrix vector products). What it should do is apply the entire matrix in the down solve (but save the partial sums of the lower triangular part) and then it can do the upper solve applying only 1/2 the matrix and using the accumulated values.  So currently the combined pre and post smooth has 3 complete applications of the matrix (.5 and .5 from the pre smooth and 1 and 1 from the up smooth. Adding the optimization will reduce it to 2.5 applications of the matrix giving a reduction of 1/6  = 17 percent of the smooth flops or
>>> 
>>>>>> 17.*14/33
>>> 7.212121212121212
>>> 
>>> 7 percent decrease in the total KSPSolve flops
>>> 
>>> Next we can eliminate 1/2 of the application of the matrix in the residual computation after the smoothing if we have saved the partial sums of the upper triangular solve in the smoother. We can estimate the flops of the matrix-multiple in the residual computation as one application of the matrix and hence equal to 1/3 of the flops in the ssor computations. Eliminating 1/2 of that means eliminating the equivalent of eliminating 1/6 of the flops in the MatSOR computation which is
>>> 
>>>>>> 14./(33*6)
>>> 0.0707070707070707  (actually the same computation as above :-(
>>> 
>>> so another 7 percent of the total flops in the KSPSolve.  So in total we could save 14 percent of the flops (and lots of memory access) in the KSPSolve at the work of editing a couple of functions.
>>> 
>>> At that point we could compute the flop rate of all the SSOR triangular solves and the percent of the time of the KSPSolves() in the them to determine if a new data structure is warranted. Note that in reducing the total number of flops we are replacing two full matrix-vector products with 2 triangular solves hence the flop rate will be lower indicating that using a new data structure will actually help more.
>>> 
>>> Notes on the implementation:
>>> 
>>> * modifying the MatSolve_SeqAIJ() to better handle the nonzero initial guess is straightforward; edit one functions
>>> 
>>> * modifying the residual computation in the multigrid to reuse the upper triangular sums requires a tiny bit more thought. Basically MatSOR_SeqAIJ() would have an internal pointer to the accumulated solves for the upper triangular part and we would need to either (1) have a MatMult_SeqAIJ() implementation that would retrieve those values and use them appropriately (how would it know when the values are valid?) or (2) use the PCMGSetResidual() to provide a custom residual routine that again pulled out the accumulated sums and used them. Of course we will eventually figure out how to organize it cleanly for users.
>>> 
>>> 
>>> Note also the even if a new custom data structure is introduced the work outlined above is not lost since the new data structure still needs those flop reduction optimizations. So step one, fix the MatSOR_SeqAIJ() for nonzero initial guess, get a roughly 6 percent improvement, step 2, optimize the residual computation, get another 6 percent improvement, step 3,  introduce a new data structure and get another roughly 5 percent improvement in KSPSolve() (the 6, 6  and 5 are my guesstimates).
>>> 
>>> I would be very interested in seeing new numbers.
>>> 
>>> Final note: the very unload balancing of this problem will limit how much these optimizations will help. For a perfectly load balanced problem I think these optimizations would a lot higher in percentage (maybe twice as much? more?).
>>> 
>>> 
>>>  Barry
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Aug 17, 2013, at 3:16 PM, "Mark F. Adams" <mfadams at lbl.gov> wrote:
>>> 
>>>> I would like to get an idea of how much benefit there would be with using a special matrix type for SOR.  Here is an experiment on 128 cores of Hopper (Cray XE6), 7 point stencil, with some embedded BCs that look like higher order stencils at BCs.  32^3 subdomian on each core:
>>>> 
>>>> cheb/jacobi(2)
>>>> KSPSolve              15 1.0 5.9800e+00 1.0 1.13e+09 3.2 6.2e+06 1.1e+03 2.8e+02  7 29 67 46  7  26100100100 76 18765
>>>> 
>>>> 
>>>> rich/eisenstat(2)
>>>> KSPSolve              15 1.0 1.1563e+01 1.0 1.37e+09 3.4 5.4e+06 1.1e+03 2.8e+02 12 32 66 44  7  38100100100 76 11659
>>>> 
>>>> 
>>>> rich/sor
>>>> KSPSolve              15 1.0 4.6355e+00 1.0 7.63e+08 4.5 3.2e+06 1.0e+03 3.1e+02 10 21 57 31  8  33100100100 77 15708
>>>> 
>>>> 
>>>> Complete log files attached.  The "projection" solve is the solver of interest.
>>>> 
>>>> I have 2 Jacobi so that it has about the same amount of work a one (s)sor.  There are voids in the domain which I believe accounts for the large differences in the number of flops per core.  These were run with the same processor group (i.e., all runs done in the same qsub script)
>>>> 
>>>> This shows only about 15% potential gain.  Should we conclude that there is not much to gain from an optimized data structure?
>>>> 
>>>> Mark
>>>> <log_eis><log_jac><log_sor>
>>>> 
>>>> 
>>>> On Aug 16, 2013, at 7:53 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
>>>> 
>>>>> "Mark F. Adams" <mfadams at lbl.gov> writes:
>>>>>> Some hypre papers have shown that cheb/jacobi is faster for some
>>>>>> problems but for me robustness trumps this for default solver
>>>>>> parameters in PETSc.
>>>>> 
>>>>> Richardson/SOR is about the best thing out there in terms of reasonable
>>>>> local work, low/no setup cost, and reliable convergence.  Cheby/Jacobi
>>>>> just has trivial fine-grained parallelism, but it's not clear that buys
>>>>> anything on a CPU.
>>>>> 
>>>>>> Jed's analysis suggests that Eisenstat's method saves almost 50% work
>>>>>> but needs a specialized matrix to get good flop rates.  Something to
>>>>>> think about doing …
>>>>> 
>>>>> Mine was too sloppy, Barry got it right.  Eisenstat is for Cheby/SOR,
>>>>> however, and doesn't do anything for Richardson.
>>>>> 
>>>>> To speed Richardson/SSOR up with a new matrix format, I think we have to
>>>>> cache the action of the lower triangular part in the forward sweep so
>>>>> that the back-sweep can use it, and vice-versa.  With full caching and
>>>>> triangular residual optimization, I think this brings 2 SSOR iterations
>>>>> of the down-smoother plus a residual to 2.5 work units in the
>>>>> down-smooth (zero initial guess) and 3 work units in the up-smooth
>>>>> (nonzero initial guess).  (This is a strong smoother and frequently, one
>>>>> SSOR would be enough.)
>>>> 
>>> 
>> 
>