[petsc-dev] Questions around benchmarking and data loading with PETSc

Sat Dec 11 22:44:29 CST 2021

Junchao Zhang <junchao.zhang at gmail.com> writes:

> I expected TACO was better since its website says "It uses novel compiler
> techniques to get performance competitive with hand-optimized kernels"

The TACO generated code is basically the same, modulo cosmetic partitioning into groups of 32 rows.

  #pragma omp parallel for schedule(runtime)
  for (int32_t i0 = 0; i0 < ((A1_dimension + 31) / 32); i0++) {
    for (int32_t i1 = 0; i1 < 32; i1++) {
      int32_t i = i0 * 32 + i1;
      if (i >= A1_dimension)
        continue;

      double tjy_val = 0.0;
      for (int32_t jA = A2_pos[i]; jA < A2_pos[(i + 1)]; jA++) {
        int32_t j = A2_crd[jA];
        tjy_val += A_vals[jA] * x_vals[j];
      }
      y_vals[i] = tjy_val;
    }
  }

As Matt says, this is mostly a bandwidth test. You can do somewhat better for matrices with similar row lengths using SELL (which enables vector instructions), but it's secondary to bandwidth for matrix values and cache locality for the vector.

With a graph and a row-based algorithm, I would partition it using METIS or similar, then apply an RCM ordering within each block. This will give better locality of access to the vector x, which has been known (since at least the PETSc-FUN3D work in the late 90s) to be a major factor in SpMV performance.

Working in naive orderings is one of the reasons the OSKI work was able to claim significant performance benefits while having virtually no impact on well-optimized applications (which choose a good ordering before forming their matrices).