[MOAB-dev] GPU progress?

Robert Jacob jacob at anl.gov
Fri Nov 5 16:45:38 CDT 2021


For an alternative to Kokkos, you should look at YAKL:
https://github.com/mrnorman/YAKL

It's being used in E3SM in addition to Kokkos.  The radiation scheme was 
converted to C++ with YAKL.

Rob



On 11/5/21 9:42 AM, Tim Tautges wrote:
> Thanks for the note Vijay. A couple of thoughts below:
> 
> 1) How to best perform point-in-element type searches as this is
> fundamental for all of our mesh intersection computations that drives
> the solution remapping process. After some experiments, we gave up the
> idea of making AdaptiveKdTree in MOAB to be GPU compatible. It
> required too much effort and we didn't have the time. So we chose to
> write a mini-app with ArborX [1], which is build on top of a
> performance-portable Kokkos backend to support both OpenMP and CUDA.
> The results were quite interesting. Even with OpenMP on a single node,
> compared against MOAB Kd-tree implementation with MPI, ArborX computed
> element queries orders of magnitude faster for sample problems tested.
> On GPUs, the results were of course even better. So our conclusion out
> of that mini-app was to start switching to ArborX as the query data
> structure in our next code iteration.
> Did you do any profiling of the kdtree code in MOAB? I did and quickly concluded that storing/accessing it through the normal MOAB set/tag mechanism was responsible for that speed problem (which I also observed). A simple workaround would be to grab some in-memory representation of a kdtree and use that instead. It would be simple to implement code to take a MOAB-stored kdtree object into and out of that representation. That was my first thought on solving that problem, anyway (but that would itself prob take a week of effort).
> 
> 2) How to best squeeze out optimal performance for linear SpMV related
> to remap operators for projecting the solution between components.
> This is somewhat of a "solved" or at least much more researched
> problem. There are many ways to store a sparse operator, the
> traditional being CSR (or CSC). You could also use COO (row,col,value)
> with slightly increased memory requirements and other interesting
> approaches like SELL-C [2] etc. We currently use Eigen3 on each task
> to perform SpMV when we need to remap a solution. Eigen3 also supports
> OpenMP parallelism but speedup isn't significant from my experiments.
> Our mini-app used Eigen3, Kokkos-Kernels (CSR), and Ginkgo [3] (CSR,
> COO, SELL-C etc). Ginkgo actually showed the best performance
> initially, but after some optimizations, we were able to get
> Kokkos-Kernels on par as well, which again works the same on
> multi-core and GPU architectures.
> Are you implementing the point-in-element search as spmv, or conservation/normalization, or both? In both cases, you're stuck back at the local discretization issue and whether/how to make that as generic/flexible as possible while preserving speed. Just curious how you tackled that.
> 
> So the overall conclusion is that Kokkos derivatives could deliver a
> better solution if we can redraft some of the guts of MOAB. I have
> several thoughts on this but have not had enough time to draft them
> down. Soon.. All thoughts and comments are welcome.
> 
> I looked at Kokkos and came away wondering whether all the extra abstractions concerning views would be worth the trouble. About the time Kokkos was being designed we were also working on parallel abstractions to design into iMeshP. There, we chose to differentiate parallelism between processors and processes. In the end, I think that was a big mistake, as those two types of models had far-reaching influence on the memory model under parallel mesh-handing. In effect, those two forms of parallelism never really interacted together, so trying to solve them together in the same parallel mesh interface introduced all sorts of complexity that never solved any real problem. I wonder whether the views concept in Kokkos is the same sort of issue - lots of added complexity, in order to solve a much simpler problem (handling a combination of shared and local memory and SIMD-parallel code blocks on threads and GPUs) that could be solved more simply.
> 
> - tim
> 


More information about the moab-dev mailing list