[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Tue Oct 9 10:21:32 CDT 2012

Hi guys,

as our discussion of memory is more and more drifting apart towards 
runtime and scheduling aspects, I'll try to wrap up the key points of 
the memory part of the discussion and postpone all runtime/execution 
aspects to 'Part 2' of the series.

* The proposed unification of memory handles (CPU and GPU) within *data 
of Vec could not find any backup, rather the GPU handles should remain 
in GPUarray (or any equivalent for OpenCL/CUDA). However, it is not yet 
clear whether we want to stick with library-specific names such as 
Vec_CUSP, or whether we want to go with runtime-specific names such as 
Vec_CUDA and Vec_OpenCL and probably dispatch into library-specific 
routines from there. Jed pointed out that Vec_OpenCL is probably too 
fuzzy, suggesting that Vec_LIBRARYNAME is the better option.

* Barry backups my suggestion to have multi-GPU support for a single 
process, whereas Jed and Matt suggest to map one GPU to one MPI-process 
for reasons of simplicity. As the usual application of multi-GPU is 
within sparse matrix-vector products and block-based preconditioners, I 
note the following:
  - Such implementations are basically available out-of-the-box with MPI.
  - According to the manual, block-based preconditioners can also be 
configured on a per-process basis, thus allowing to use the individual 
streaming processors on a GPU efficiently (there is no native 
synchronization possible between streaming processors within a single 
kernel!).
  - The current multi-GPU support using txpetscgpu focuses on sparse 
matrix-vector products only (there are some hints in 
src/ksp/pc/impls/factor/ilu that forward-backward substitutions for ILU 
preconditioners on GPUs may also be available, yet I haven't found any 
actual code/kernels for that).
Consequently, from the available functionality it seems that we can live 
with a one-GPU-per-process option.

* Adding a bit of meta information to arrays in main RAM (without 
splitting up the actual buffer) for increased cache-awareness requires a 
demonstration of significant performance benefits for any further 
consideration.

If my wrap-up missed some part of the discussion, please let me/us know. 
I'll now move on to the actual runtime and come up with more concrete 
ideas in 'Part 2' :-)

Best regards,
Karli