[MOAB-dev] GPU progress?

Vijay S. Mahadevan vijay.m at gmail.com
Fri Nov 5 23:09:51 CDT 2021


Thanks for the comments Rob and Mark. We absolutely understand that
there are way too many choices for programming models here for
augmenting MOAB's MPI capabilities with portable on-node code
paradigms. That was one of the reasons we were delaying the actual
buy-in to one model. But our preference now to Kokkos and its
derivative libraries is because it seems to be well supported (at
least currently, even if that wasn't the case before) on all LCF
machines, and our experiments even on workstations and laptops have
yielded good speedups. I still think optimal configuration of the TPLs
could be a bottleneck but this may be the case with almost all
programming models except maybe with OpenMP. I am somewhat familiar
with RAJA, Legion, Thrust and Argobots but have not played with the
other abstractions.

Tim, some specific comments below.

> Did you do any profiling of the kdtree code in MOAB? I did and quickly concluded that storing/accessing it through the normal MOAB set/tag mechanism was responsible for that speed problem (which I also observed). A simple workaround would be to grab some in-memory representation of a kdtree and use that instead. It would be simple to implement code to take a MOAB-stored kdtree object into and out of that representation. That was my first thought on solving that problem, anyway (but that would itself prob take a week of effort).

Tag access is one of the bottlenecks, yes. We could cache the
underlying data in the class as needed to simplify the implementation,
but that still wouldn't get us all the way to getting multicore and
GPU compatible code. That is decidedly at least a couple of months
effort to implement, profile and optimize fully. On the other hand,
ArborX has shown good speedups just with the out-of-box setup with
easy extension to hybrid architectures. I've already compared our
Kd-tree with NanoFlann and found that nearest neighbor queries to
determine point in element type searches were faster with the latter
(the example code is in some outdated branch still). ArborX, according
to the paper I linked earlier, is supposed to be faster than NanoFlann
(its one of their comparison benchmarks). So for the hackathon, this
definitely made the most sense. Iulian has been actively looking into
their code as well to see how better to integrate it with MOAB.

> Are you implementing the point-in-element search as spmv, or conservation/normalization, or both? In both cases, you're stuck back at the local discretization issue and whether/how to make that as generic/flexible as possible while preserving speed. Just curious how you tackled that.

No, the point-in-element searches are still based on mesh queries.
However, the actual projection that we perform to transfer solution
data involves computing a linear operator to perform the
transformation. All consistency and conservation constraints are
computed and added as part of this operator, which is then stored to
disk if doing this process using the MOAB tool, or can be computed
online now with E3SM-MOAB and used directly at runtime.

> I looked at Kokkos and came away wondering whether all the extra abstractions concerning views would be worth the trouble.

My initial reaction was similar. But after writing a couple of
mini-apps and having played with it enough, the abstraction has its
advantages. RAJA has a similar concept of views as well btw, and so
the paradigm at least has been tried out in many ways already. There
are still a couple of quirks that took some time to wrap my head
around, but that is probably to be expected with all new programming
models.

> I wonder whether the views concept in Kokkos is the same sort of issue - lots of added complexity, in order to solve a much simpler problem (handling a combination of shared and local memory and SIMD-parallel code blocks on threads and GPUs) that could be solved more simply.

There definitely are other approaches that can be used for this other
than the concept of views. OCCA uses higher level code transformations
for kernels to translate to architecture specific code. This was
another contender in my experiments, given my familiarity with it from
CESAR days. I know that they are being actively developed for some
exascale apps but have not looked at it in depth recently.

At the end of the day, transforming a behemoth code like MOAB directly
into any one model doesn't make too much sense I think. I want to
start a broader discussion on how best to rewrite just the guts of
MOAB to make it portable across architectures, which will help us run
higher level mesh querying/manipulation algorithms on either the host
or device. But it's really a question of time and effort to get that
done.

Best,
Vijay

On Fri, Nov 5, 2021 at 8:17 PM Miller, Mark C. <miller86 at llnl.gov> wrote:
>
> FWIW, while I have little experience with many of these technologies (Kokkos, Charm++, Legion, VTK-m, Thrust, etc.) my impression is that Raja (https://github.com/LLNL/RAJA) is designed with minimizing complexity in mind and with an emphasis what you are suggesting here...
>
>
>
>
>
> handling a combination of shared and local memory and SIMD-parallel code blocks on threads and GPUs)


More information about the moab-dev mailing list