[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory
Karl Rupp
rupp at mcs.anl.gov
Tue Oct 9 10:21:32 CDT 2012
Hi guys,
as our discussion of memory is more and more drifting apart towards
runtime and scheduling aspects, I'll try to wrap up the key points of
the memory part of the discussion and postpone all runtime/execution
aspects to 'Part 2' of the series.
* The proposed unification of memory handles (CPU and GPU) within *data
of Vec could not find any backup, rather the GPU handles should remain
in GPUarray (or any equivalent for OpenCL/CUDA). However, it is not yet
clear whether we want to stick with library-specific names such as
Vec_CUSP, or whether we want to go with runtime-specific names such as
Vec_CUDA and Vec_OpenCL and probably dispatch into library-specific
routines from there. Jed pointed out that Vec_OpenCL is probably too
fuzzy, suggesting that Vec_LIBRARYNAME is the better option.
* Barry backups my suggestion to have multi-GPU support for a single
process, whereas Jed and Matt suggest to map one GPU to one MPI-process
for reasons of simplicity. As the usual application of multi-GPU is
within sparse matrix-vector products and block-based preconditioners, I
note the following:
- Such implementations are basically available out-of-the-box with MPI.
- According to the manual, block-based preconditioners can also be
configured on a per-process basis, thus allowing to use the individual
streaming processors on a GPU efficiently (there is no native
synchronization possible between streaming processors within a single
kernel!).
- The current multi-GPU support using txpetscgpu focuses on sparse
matrix-vector products only (there are some hints in
src/ksp/pc/impls/factor/ilu that forward-backward substitutions for ILU
preconditioners on GPUs may also be available, yet I haven't found any
actual code/kernels for that).
Consequently, from the available functionality it seems that we can live
with a one-GPU-per-process option.
* Adding a bit of meta information to arrays in main RAM (without
splitting up the actual buffer) for increased cache-awareness requires a
demonstration of significant performance benefits for any further
consideration.
If my wrap-up missed some part of the discussion, please let me/us know.
I'll now move on to the actual runtime and come up with more concrete
ideas in 'Part 2' :-)
Best regards,
Karli
More information about the petsc-dev
mailing list