[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Tue Oct 9 15:48:18 CDT 2012

I think the current vector class should be Vec_Thrust with

-vec_type thrust (not cusp)

First, most of the vector functions are computed from kernels in the 
Thrust library (although there may be an occasional CUSP or CUBLAS 
function call). Second, it is not clear how long CUSP is going to 
survive ... and I think Nvidia puts more energy into CUSPARSE and Thrust.

I think a Vec_CUDA would be very useful ... there is a lot you could do 
with this that you can't currently do with Thrust.

I think separating the Mat types into CUSP and CUSPARSE is sensible.

-Paul

>> Hi guys,
>>
>> as our discussion of memory is more and more drifting apart towards runtime and scheduling aspects, I'll try to wrap up the key points of the memory part of the discussion and postpone all runtime/execution aspects to 'Part 2' of the series.
>>
>> * The proposed unification of memory handles (CPU and GPU) within *data of Vec could not find any backup, rather the GPU handles should remain in GPUarray (or any equivalent for OpenCL/CUDA). However, it is not yet clear whether we want to stick with library-specific names such as Vec_CUSP, or whether we want to go with runtime-specific names such as Vec_CUDA and Vec_OpenCL and probably dispatch into library-specific routines from there. Jed pointed out that Vec_OpenCL is probably too fuzzy, suggesting that Vec_LIBRARYNAME is the better option.
>
>       The Vec_CUSP is most definitely built on top of CUSP and is not built around generic CUDA hence going to Vec_CUDA from Vec_CUSP doesn't make sense to me. If we had (have? as an alternative) a Vec class that was built directly on CUDA then it could be called Vec_CUDA. Similarly if Vec_OpenCL is built directly on generic OpenCL then that name is fine, if it is built on top of something like ViennaCL then Vec_ViennaCL would be the way to go.
>
>      Barry
>
>
> Paul has put in some code based on cusparse, I haven't had the energy to see how that works. Perhaps there should be a Vec_CUSparse to that.
>
>> * Barry backups my suggestion to have multi-GPU support for a single process, whereas Jed and Matt suggest to map one GPU to one MPI-process for reasons of simplicity. As the usual application of multi-GPU is within sparse matrix-vector products and block-based preconditioners, I note the following:
>> - Such implementations are basically available out-of-the-box with MPI.
>> - According to the manual, block-based preconditioners can also be configured on a per-process basis, thus allowing to use the individual streaming processors on a GPU efficiently (there is no native synchronization possible between streaming processors within a single kernel!).
>> - The current multi-GPU support using txpetscgpu focuses on sparse matrix-vector products only (there are some hints in src/ksp/pc/impls/factor/ilu that forward-backward substitutions for ILU preconditioners on GPUs may also be available, yet I haven't found any actual code/kernels for that).
>> Consequently, from the available functionality it seems that we can live with a one-GPU-per-process option.
>>
>> * Adding a bit of meta information to arrays in main RAM (without splitting up the actual buffer) for increased cache-awareness requires a demonstration of significant performance benefits for any further consideration.
>>
>> If my wrap-up missed some part of the discussion, please let me/us know. I'll now move on to the actual runtime and come up with more concrete ideas in 'Part 2' :-)
>>
>> Best regards,
>> Karli
>>
>>