[petsc-dev] Kernel fusion for vector operations

Thu May 30 18:37:06 CDT 2013

Karl Rupp <rupp at mcs.anl.gov> writes:

> Hi guys,
>
> I went through the iterative solvers today, looking for places where we 
> can "save" memory bandwidth. For example, the two operations
>    v <- v - alpha * u
>    v <- v - beta  * w
> are currently performed using two BLAS-like VecAXPY() calls, requiring v 
> to be read and written twice. Ideally, this can be fused into a single 
> operation
>    v <- v - alpha * u - beta * w
> with less pressure on the memory link. 

  VecAXPBYPCZ(v,-alpha,-beta,1.0,u,w);

> For GPUs and threads there is an additional benefit of lower
> startup-overhead, as only one kernel needs to be executed in the above
> example instead of two.

Yup.

> Now, looking *only* at cases where the pressure on the memory link is 
> reduced as well, I found the optimization capabilities attached. Each 
> block of statements allows for a fusion of statements by reusing cached 
> data. In most solvers one can save one or two loads and stores from 
> memory per iteration, which, however, is only little compared to the 
> matrix-vector multiplication and other vector operations going on. Thus, 
> hardly any unpreconditioned solver would gain by more than 5-10 percent 
> by manually fusing these operations. As soon as preconditioners or a 
> system matrix with more than, say, 20 entries per row are involved, 
> savings will most likely drop below one percent.
>
> The following operations occur in several solvers, i.e. we might want to 
> consider adding a dedicated MatXXX/VecYYY-routine:
>
>   - Matrix-Vector-product followed by an inner product:
>      v     <- A x
>      alpha <- (v, w) or (w, v)
>     where w can either be x or a different vector.
>
>   - Application of a preconditioner followed by an inner product:
>      z     <- M^-1 r
>      alpha <- (r, z)

There is also the combination

  D^{-1} A x

where D is either the diagonal (Jacobi) or point-block diagonal.  This
is important in GAMG, which uses polynomial smoothers.

> Interfering with the preconditioner implementations is probably not 
> worth it, so the only reasonable candidate is fusing the matrix-vector 
> product with a subsequent inner product (and a fall-back to two 
> operations when using matrix-free stuff).
>
> Other operations are rather custom to the respective solver. FBCGSR 
> optimizes for data reuse already by computing the vector operations 
> directly inside the solver without using VecXYZ(), which makes it fairly 
> slow when using GPUs.

Yeah, those methods have really custom operations that would involve a
lot more traversals otherwise.  It can be abstracted to allow a GPU
implementation, but we'd want to move the whole custom operation.