[petsc-dev] Kernel fusion for vector operations
Karl Rupp
rupp at mcs.anl.gov
Thu May 30 15:37:57 CDT 2013
Hi guys,
I went through the iterative solvers today, looking for places where we
can "save" memory bandwidth. For example, the two operations
v <- v - alpha * u
v <- v - beta * w
are currently performed using two BLAS-like VecAXPY() calls, requiring v
to be read and written twice. Ideally, this can be fused into a single
operation
v <- v - alpha * u - beta * w
with less pressure on the memory link. For GPUs and threads there is an
additional benefit of lower startup-overhead, as only one kernel needs
to be executed in the above example instead of two.
Now, looking *only* at cases where the pressure on the memory link is
reduced as well, I found the optimization capabilities attached. Each
block of statements allows for a fusion of statements by reusing cached
data. In most solvers one can save one or two loads and stores from
memory per iteration, which, however, is only little compared to the
matrix-vector multiplication and other vector operations going on. Thus,
hardly any unpreconditioned solver would gain by more than 5-10 percent
by manually fusing these operations. As soon as preconditioners or a
system matrix with more than, say, 20 entries per row are involved,
savings will most likely drop below one percent.
The following operations occur in several solvers, i.e. we might want to
consider adding a dedicated MatXXX/VecYYY-routine:
- Matrix-Vector-product followed by an inner product:
v <- A x
alpha <- (v, w) or (w, v)
where w can either be x or a different vector.
- Application of a preconditioner followed by an inner product:
z <- M^-1 r
alpha <- (r, z)
Interfering with the preconditioner implementations is probably not
worth it, so the only reasonable candidate is fusing the matrix-vector
product with a subsequent inner product (and a fall-back to two
operations when using matrix-free stuff).
Other operations are rather custom to the respective solver. FBCGSR
optimizes for data reuse already by computing the vector operations
directly inside the solver without using VecXYZ(), which makes it fairly
slow when using GPUs.
Overall, if we want to reasonably fuse operations together, we should do
so because of kernel startup overhead (threads, GPU) rather than because
of reducing pressure on the memory link.
Best regards,
Karli
-------------- next part --------------
PETSc vector operations in iterative solvers
------------------------------------------------
greek letters: scalars
lower-case: vectors
upper-case: linear operators (Matrices, Preconditioners if inverse)
(each block of statements represents a sequence of operations which can be fused into a single kernel)
BCGS:
- BCGS:
v <- K p
alpha <- rho / (v, rp)
x <- alpha * p + omega * s + x
r <- s - omega * t
- FBCGS:
v <- A p2
alpha <- rho / (v, rp)
- FBCGSR:
// all inner products are local:
v <- A p2
tau <- (r, rp)
sigma <- (v, rp)
t <- A s2
xi1 <- (s,s)
xi2 <- (t,s)
xi3 <- (t,t)
xi4 <- (t,rp)
r <- s - omega t
p <- r + beta * (p - omega v)
BICG:
z <- K p
dpi <- (z, p)
CG:
- GLTR
z <- M^-1 r
beta <- (r, z)
alpha <- ||r||_2
z <- Q p
kappa <- (p, z)
- NASH
// same as GLTR
- PIPECG
z <- n + beta * z
q <- m + beta * q
p <- u + beta * p
s <- w + beta * s
x <- x + alpha * p
u <- u - alpha * q
w <- w - alpha * z
r <- r - alpha * s
- STCG
// same as GLTR
CGS:
q <- u - alpha * v
t <- u + q
x <- x + alpha * t
r <- r - alpha * K * t
rho <- (r, rp)
q <- q + beta * p
p <- u + beta * q
CHEBY:
r <- A * x
r <- b - r
p <- B^-1 r
p <- alpha * p + x
p <- omega * p
p <- p + (1-omega) * u
p <- p + omega * v
CR:
q <- Z^-1 p
alpha <- (p, q)
r <- r - alpha * q
z <- A r
beta <- (r, z)
GCR:
v <- v * alpha
s <- s * beta
x <- x + gamma * s
r <- r - gamma * v
IBCGS:
// operations are already local ('hand-tuned')
z <- alpha * r + mu * z - eta * v
v <- u + beta * v - delta * q
s <- r - alpha * v
// not yet 'tuned'
q <- A v
t <- u - alpha * q
// again hand-tuned:
phi <- (r, s)
pi <- (r, q)
gamma <- (f, s)
eta <- (f, t)
theta <- (s, t)
kappa <- (t, t)
// tuned already:
r <- s - omega * t
alpha <- ||r||_r
x <- x + z + omega * sn
LCD:
alpha <- (p, r)
beta <- (p, q)
r <- r - alpha * q
gamma <- ||r||_2
LSQR:
u <- u - alpha * v
beta <- ||u||_2
v <- A^T u
w <- w - beta * v
x <- x + gamma * w
w <- u - delta * w
MINRES:
r <- A u
alpha <- (r, u)
r <- r - beta * v
z <- z - beta * u
gamma <- (r, z)
w <- (u - rho * w - eta * w_old) / delta
x <- x + nu * w
// the following occurs twice:
r <- u
u <- u / beta
QCG:
x <- w
x <- x + alpha * p
alpha <- ||x||_2
SYMMLQ:
// the following occurs twice:
v_old <- v
v <- r
v <- v * beta
v <- w
v <- v * alpha
w <- v + eta * u
v <- v * -beta
v <- v + nu * u
r <- A u
alpha <- (u, r)
r <- r - alpha * v
r <- r - beta * w
(could be merged to r <- r - alpha * v - beta * w)
alpha <- (r, z) // z computed similarly
TCQMR:
y <- A u
alpha <- (v0, u)
beta <- (v0, y)
rho <- (v0, u) // This is redundant, see alpha above
v <- v - alpha * u
v <- v - beta * w
alpha <- ||v||_2
// twice:
u <- v
v <- w
TFQMR:
q <- u - alpha * v
t <- u + q
u <- r + beta * q
p <- u + beta * (q + beta * p)
More information about the petsc-dev
mailing list