[petsc-dev] Kernel fusion for vector operations

Thu May 30 15:37:57 CDT 2013

Hi guys,

I went through the iterative solvers today, looking for places where we 
can "save" memory bandwidth. For example, the two operations
   v <- v - alpha * u
   v <- v - beta  * w
are currently performed using two BLAS-like VecAXPY() calls, requiring v 
to be read and written twice. Ideally, this can be fused into a single 
operation
   v <- v - alpha * u - beta * w
with less pressure on the memory link. For GPUs and threads there is an 
additional benefit of lower startup-overhead, as only one kernel needs 
to be executed in the above example instead of two.

Now, looking *only* at cases where the pressure on the memory link is 
reduced as well, I found the optimization capabilities attached. Each 
block of statements allows for a fusion of statements by reusing cached 
data. In most solvers one can save one or two loads and stores from 
memory per iteration, which, however, is only little compared to the 
matrix-vector multiplication and other vector operations going on. Thus, 
hardly any unpreconditioned solver would gain by more than 5-10 percent 
by manually fusing these operations. As soon as preconditioners or a 
system matrix with more than, say, 20 entries per row are involved, 
savings will most likely drop below one percent.

The following operations occur in several solvers, i.e. we might want to 
consider adding a dedicated MatXXX/VecYYY-routine:

  - Matrix-Vector-product followed by an inner product:
     v     <- A x
     alpha <- (v, w) or (w, v)
    where w can either be x or a different vector.

  - Application of a preconditioner followed by an inner product:
     z     <- M^-1 r
     alpha <- (r, z)

Interfering with the preconditioner implementations is probably not 
worth it, so the only reasonable candidate is fusing the matrix-vector 
product with a subsequent inner product (and a fall-back to two 
operations when using matrix-free stuff).

Other operations are rather custom to the respective solver. FBCGSR 
optimizes for data reuse already by computing the vector operations 
directly inside the solver without using VecXYZ(), which makes it fairly 
slow when using GPUs.

Overall, if we want to reasonably fuse operations together, we should do 
so because of kernel startup overhead (threads, GPU) rather than because 
of reducing pressure on the memory link.

Best regards,
Karli
-------------- next part --------------
PETSc vector operations in iterative solvers
------------------------------------------------

greek letters: scalars
lower-case: vectors
upper-case: linear operators (Matrices, Preconditioners if inverse)

(each block of statements represents a sequence of operations which can be fused into a single kernel)

BCGS:
 - BCGS:
     v     <- K p
     alpha <- rho / (v, rp)

     x <- alpha * p + omega * s + x
     r <- s - omega * t

 - FBCGS: 
     v     <- A p2
     alpha <- rho / (v, rp)

 - FBCGSR:
     // all inner products are local:
     v     <- A p2
     tau   <- (r, rp)
     sigma <- (v, rp)

     t     <- A s2
     xi1   <- (s,s)
     xi2   <- (t,s)
     xi3   <- (t,t)
     xi4   <- (t,rp)

     r <- s - omega t
     p <- r + beta * (p - omega v)

BICG:
  z <- K p
  dpi <- (z, p)

CG:
  - GLTR
      z  <- M^-1 r
      beta  <- (r, z)
      alpha <- ||r||_2

      z     <- Q p
      kappa <- (p, z)

  - NASH
      // same as GLTR

  - PIPECG
      z <- n + beta * z
      q <- m + beta * q
      p <- u + beta * p
      s <- w + beta * s
      x <- x + alpha * p
      u <- u - alpha * q
      w <- w - alpha * z
      r <- r - alpha * s

  - STCG
      // same as GLTR

CGS:
  q <- u - alpha * v
  t <- u + q
  x <- x + alpha * t

  r   <- r - alpha * K * t
  rho <- (r, rp)

  q <- q + beta * p
  p <- u + beta * q

CHEBY:
  r <- A * x
  r <- b - r

  p <- B^-1 r
  p <- alpha * p + x

  p <- omega * p
  p <- p + (1-omega) * u
  p <- p + omega * v

CR:
  q     <- Z^-1 p
  alpha <- (p, q)

  r <- r - alpha * q
  z <- A r
  beta <- (r, z)

GCR:
  v <- v * alpha
  s <- s * beta
  x <- x + gamma * s
  r <- r - gamma * v

IBCGS:
  // operations are already local ('hand-tuned')
  z <- alpha * r + mu * z - eta * v
  v <- u + beta * v - delta * q
  s <- r - alpha * v

  // not yet 'tuned'
  q <- A v
  t <- u - alpha * q

  // again hand-tuned:
  phi   <- (r, s)
  pi    <- (r, q)
  gamma <- (f, s)
  eta   <- (f, t)
  theta <- (s, t)
  kappa <- (t, t)

  // tuned already:
  r <- s - omega * t
  alpha <- ||r||_r
  x <- x + z + omega * sn

LCD:
  alpha <- (p, r)
  beta  <- (p, q)

  r     <- r - alpha * q
  gamma <- ||r||_2

LSQR:
  u    <- u - alpha * v
  beta <- ||u||_2

  v <- A^T u
  w <- w - beta * v

  x <- x + gamma * w
  w <- u - delta * w

MINRES:
  r     <- A u
  alpha <- (r, u)

  r     <- r - beta * v
  z     <- z - beta * u
  gamma <- (r, z)

  w <- (u - rho * w - eta * w_old) / delta
  x <- x + nu * w

  // the following occurs twice:
  r <- u
  u <- u / beta

QCG:
  x     <- w
  x     <- x + alpha * p
  alpha <- ||x||_2

SYMMLQ:
  // the following occurs twice:
  v_old <- v
  v     <- r
  v     <- v * beta

  v <- w
  v <- v * alpha
  w <- v + eta * u
  v <- v * -beta
  v <- v + nu * u

  r     <- A u
  alpha <- (u, r)

  r <- r - alpha * v
  r <- r - beta * w
  (could be merged to r <- r - alpha * v - beta * w)
  alpha <- (r, z)     // z computed similarly

TCQMR:
  y     <- A u
  alpha <- (v0, u)
  beta  <- (v0, y)
  rho   <- (v0, u)         // This is redundant, see alpha above

  v     <- v - alpha * u
  v     <- v - beta  * w
  alpha <- ||v||_2

  // twice:
  u <- v
  v <- w

TFQMR:
  q <- u - alpha * v
  t <- u + q

  u <- r + beta * q
  p <- u + beta * (q + beta * p)