[MOAB-dev] GPU progress?

Fri Nov 5 09:42:16 CDT 2021

Thanks for the note Vijay. A couple of thoughts below:

1) How to best perform point-in-element type searches as this is
fundamental for all of our mesh intersection computations that drives
the solution remapping process. After some experiments, we gave up the
idea of making AdaptiveKdTree in MOAB to be GPU compatible. It
required too much effort and we didn't have the time. So we chose to
write a mini-app with ArborX [1], which is build on top of a
performance-portable Kokkos backend to support both OpenMP and CUDA.
The results were quite interesting. Even with OpenMP on a single node,
compared against MOAB Kd-tree implementation with MPI, ArborX computed
element queries orders of magnitude faster for sample problems tested.
On GPUs, the results were of course even better. So our conclusion out
of that mini-app was to start switching to ArborX as the query data
structure in our next code iteration.
Did you do any profiling of the kdtree code in MOAB? I did and quickly concluded that storing/accessing it through the normal MOAB set/tag mechanism was responsible for that speed problem (which I also observed). A simple workaround would be to grab some in-memory representation of a kdtree and use that instead. It would be simple to implement code to take a MOAB-stored kdtree object into and out of that representation. That was my first thought on solving that problem, anyway (but that would itself prob take a week of effort).

2) How to best squeeze out optimal performance for linear SpMV related
to remap operators for projecting the solution between components.
This is somewhat of a "solved" or at least much more researched
problem. There are many ways to store a sparse operator, the
traditional being CSR (or CSC). You could also use COO (row,col,value)
with slightly increased memory requirements and other interesting
approaches like SELL-C [2] etc. We currently use Eigen3 on each task
to perform SpMV when we need to remap a solution. Eigen3 also supports
OpenMP parallelism but speedup isn't significant from my experiments.
Our mini-app used Eigen3, Kokkos-Kernels (CSR), and Ginkgo [3] (CSR,
COO, SELL-C etc). Ginkgo actually showed the best performance
initially, but after some optimizations, we were able to get
Kokkos-Kernels on par as well, which again works the same on
multi-core and GPU architectures.
Are you implementing the point-in-element search as spmv, or conservation/normalization, or both? In both cases, you're stuck back at the local discretization issue and whether/how to make that as generic/flexible as possible while preserving speed. Just curious how you tackled that.

So the overall conclusion is that Kokkos derivatives could deliver a
better solution if we can redraft some of the guts of MOAB. I have
several thoughts on this but have not had enough time to draft them
down. Soon.. All thoughts and comments are welcome.

I looked at Kokkos and came away wondering whether all the extra abstractions concerning views would be worth the trouble. About the time Kokkos was being designed we were also working on parallel abstractions to design into iMeshP. There, we chose to differentiate parallelism between processors and processes. In the end, I think that was a big mistake, as those two types of models had far-reaching influence on the memory model under parallel mesh-handing. In effect, those two forms of parallelism never really interacted together, so trying to solve them together in the same parallel mesh interface introduced all sorts of complexity that never solved any real problem. I wonder whether the views concept in Kokkos is the same sort of issue - lots of added complexity, in order to solve a much simpler problem (handling a combination of shared and local memory and SIMD-parallel code blocks on threads and GPUs) that could be solved more simply.

- tim