[MOAB-dev] GPU progress?

Thu Nov 4 22:17:44 CDT 2021

Hi Tim,

Over the summer, we participated in the ALCF GPU hackathon. We
originally went in with the idea of making all of MOAB GPU compatible,
but given the duration of the effort and the requirement that we come
up with specific goals for the one week hackathon meant that we had to
pick and choose mini-apps that were focused on immediate needs.

Given that there has been some considerable effort in the MOAB-E3SM
space, we picked two specific workflows to tackle in a portable way.

1) How to best perform point-in-element type searches as this is
fundamental for all of our mesh intersection computations that drives
the solution remapping process. After some experiments, we gave up the
idea of making AdaptiveKdTree in MOAB to be GPU compatible. It
required too much effort and we didn't have the time. So we chose to
write a mini-app with ArborX [1], which is build on top of a
performance-portable Kokkos backend to support both OpenMP and CUDA.
The results were quite interesting. Even with OpenMP on a single node,
compared against MOAB Kd-tree implementation with MPI, ArborX computed
element queries orders of magnitude faster for sample problems tested.
On GPUs, the results were of course even better. So our conclusion out
of that mini-app was to start switching to ArborX as the query data
structure in our next code iteration.

2) How to best squeeze out optimal performance for linear SpMV related
to remap operators for projecting the solution between components.
This is somewhat of a "solved" or at least much more researched
problem. There are many ways to store a sparse operator, the
traditional being CSR (or CSC). You could also use COO (row,col,value)
with slightly increased memory requirements and other interesting
approaches like SELL-C [2] etc. We currently use Eigen3 on each task
to perform SpMV when we need to remap a solution. Eigen3 also supports
OpenMP parallelism but speedup isn't significant from my experiments.
Our mini-app used Eigen3, Kokkos-Kernels (CSR), and Ginkgo [3] (CSR,
COO, SELL-C etc). Ginkgo actually showed the best performance
initially, but after some optimizations, we were able to get
Kokkos-Kernels on par as well, which again works the same on
multi-core and GPU architectures.

So the overall conclusion is that Kokkos derivatives could deliver a
better solution if we can redraft some of the guts of MOAB. I have
several thoughts on this but have not had enough time to draft them
down. Soon.. All thoughts and comments are welcome.

Vijay

[1] ArborX: A Performance Portable Geometric Search Library:
https://doi.org/10.1145/3412558
[2] Anzt, Hartwig, Stanimire Tomov, and Jack Dongarra. "Implementing a
Sparse Matrix Vector Product for the SELL-C/SELL-C-σ formats on NVIDIA
GPUs." University of Tennessee, Tech. Rep. ut-eecs-14-727 (2014).
[3] Anzt, H., Cojean, T., Chen, Y.C., Flegar, G., Göbel, F.,
Grützmacher, T., Nayak, P., Ribizel, T. and Tsai, Y.H., 2020. Ginkgo:
A high performance numerical linear algebra library. Journal of Open
Source Software, 5(52), p.2260.

On Thu, Nov 4, 2021 at 10:45 AM Tim Tautges <ttautges at divergent3d.com> wrote:
>
> Hi gang,
>   I'm interested in hearing more about the progress made at this last summer's GPU hackathon. Did you accomplish anything regarding general MOAB architecture-level capabilities for doing stuff down on the GPU?
>
> Thanks.
>
> - tim