[petsc-dev] Kernel fusion for vector operations
Jed Brown
jedbrown at mcs.anl.gov
Thu May 30 18:37:06 CDT 2013
Karl Rupp <rupp at mcs.anl.gov> writes:
> Hi guys,
>
> I went through the iterative solvers today, looking for places where we
> can "save" memory bandwidth. For example, the two operations
> v <- v - alpha * u
> v <- v - beta * w
> are currently performed using two BLAS-like VecAXPY() calls, requiring v
> to be read and written twice. Ideally, this can be fused into a single
> operation
> v <- v - alpha * u - beta * w
> with less pressure on the memory link.
VecAXPBYPCZ(v,-alpha,-beta,1.0,u,w);
> For GPUs and threads there is an additional benefit of lower
> startup-overhead, as only one kernel needs to be executed in the above
> example instead of two.
Yup.
> Now, looking *only* at cases where the pressure on the memory link is
> reduced as well, I found the optimization capabilities attached. Each
> block of statements allows for a fusion of statements by reusing cached
> data. In most solvers one can save one or two loads and stores from
> memory per iteration, which, however, is only little compared to the
> matrix-vector multiplication and other vector operations going on. Thus,
> hardly any unpreconditioned solver would gain by more than 5-10 percent
> by manually fusing these operations. As soon as preconditioners or a
> system matrix with more than, say, 20 entries per row are involved,
> savings will most likely drop below one percent.
>
> The following operations occur in several solvers, i.e. we might want to
> consider adding a dedicated MatXXX/VecYYY-routine:
>
> - Matrix-Vector-product followed by an inner product:
> v <- A x
> alpha <- (v, w) or (w, v)
> where w can either be x or a different vector.
>
> - Application of a preconditioner followed by an inner product:
> z <- M^-1 r
> alpha <- (r, z)
There is also the combination
D^{-1} A x
where D is either the diagonal (Jacobi) or point-block diagonal. This
is important in GAMG, which uses polynomial smoothers.
> Interfering with the preconditioner implementations is probably not
> worth it, so the only reasonable candidate is fusing the matrix-vector
> product with a subsequent inner product (and a fall-back to two
> operations when using matrix-free stuff).
>
> Other operations are rather custom to the respective solver. FBCGSR
> optimizes for data reuse already by computing the vector operations
> directly inside the solver without using VecXYZ(), which makes it fairly
> slow when using GPUs.
Yeah, those methods have really custom operations that would involve a
lot more traversals otherwise. It can be abstracted to allow a GPU
implementation, but we'd want to move the whole custom operation.
More information about the petsc-dev
mailing list