[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Wed Oct 10 09:47:08 CDT 2012

No problem. I have some code that will be helpful for these classes. In 
particular, In txpetscgpu, I have code that figures out which data to 
message to/from the GPU in the parallel SpMV. It's done using 
cudaStreams which allows the comm to be overlapped with the computation 
kernel.

If you make the hierarchy as described below, it would be natural to 
move the txpetscgpu code into Vec_CUDA class.

-Paul
> Hi Paul,
>
> thanks for the comments. I'll have a look whether we can have an 
> intermediate layer for CUDA and OpenCL, e.g.
>  Vec_Seq -> Vec_CUDA -> Vec_Thrust.
> This should allow us to define a broader set of operations on Vec_CUDA 
> (and similarly for matrices), particularly such not covered by 
> CUSparse and Thrust.
>
> Best regards,
> Karli
>
>
> On 10/09/2012 03:48 PM, Paul Mullowney wrote:
>> I think the current vector class should be Vec_Thrust with
>>
>> -vec_type thrust (not cusp)
>>
>> First, most of the vector functions are computed from kernels in the
>> Thrust library (although there may be an occasional CUSP or CUBLAS
>> function call). Second, it is not clear how long CUSP is going to
>> survive ... and I think Nvidia puts more energy into CUSPARSE and 
>> Thrust.
>>
>> I think a Vec_CUDA would be very useful ... there is a lot you could do
>> with this that you can't currently do with Thrust.
>>
>> I think separating the Mat types into CUSP and CUSPARSE is sensible.
>>
>> -Paul
>>
>>
>>
>>>> Hi guys,
>>>>
>>>> as our discussion of memory is more and more drifting apart towards
>>>> runtime and scheduling aspects, I'll try to wrap up the key points of
>>>> the memory part of the discussion and postpone all runtime/execution
>>>> aspects to 'Part 2' of the series.
>>>>
>>>> * The proposed unification of memory handles (CPU and GPU) within
>>>> *data of Vec could not find any backup, rather the GPU handles should
>>>> remain in GPUarray (or any equivalent for OpenCL/CUDA). However, it
>>>> is not yet clear whether we want to stick with library-specific names
>>>> such as Vec_CUSP, or whether we want to go with runtime-specific
>>>> names such as Vec_CUDA and Vec_OpenCL and probably dispatch into
>>>> library-specific routines from there. Jed pointed out that Vec_OpenCL
>>>> is probably too fuzzy, suggesting that Vec_LIBRARYNAME is the better
>>>> option.
>>>
>>>       The Vec_CUSP is most definitely built on top of CUSP and is not
>>> built around generic CUDA hence going to Vec_CUDA from Vec_CUSP
>>> doesn't make sense to me. If we had (have? as an alternative) a Vec
>>> class that was built directly on CUDA then it could be called
>>> Vec_CUDA. Similarly if Vec_OpenCL is built directly on generic OpenCL
>>> then that name is fine, if it is built on top of something like
>>> ViennaCL then Vec_ViennaCL would be the way to go.
>>>
>>>      Barry
>>>
>>>
>>> Paul has put in some code based on cusparse, I haven't had the energy
>>> to see how that works. Perhaps there should be a Vec_CUSparse to that.
>>>
>>>> * Barry backups my suggestion to have multi-GPU support for a single
>>>> process, whereas Jed and Matt suggest to map one GPU to one
>>>> MPI-process for reasons of simplicity. As the usual application of
>>>> multi-GPU is within sparse matrix-vector products and block-based
>>>> preconditioners, I note the following:
>>>> - Such implementations are basically available out-of-the-box with 
>>>> MPI.
>>>> - According to the manual, block-based preconditioners can also be
>>>> configured on a per-process basis, thus allowing to use the
>>>> individual streaming processors on a GPU efficiently (there is no
>>>> native synchronization possible between streaming processors within a
>>>> single kernel!).
>>>> - The current multi-GPU support using txpetscgpu focuses on sparse
>>>> matrix-vector products only (there are some hints in
>>>> src/ksp/pc/impls/factor/ilu that forward-backward substitutions for
>>>> ILU preconditioners on GPUs may also be available, yet I haven't
>>>> found any actual code/kernels for that).
>>>> Consequently, from the available functionality it seems that we can
>>>> live with a one-GPU-per-process option.
>>>>
>>>> * Adding a bit of meta information to arrays in main RAM (without
>>>> splitting up the actual buffer) for increased cache-awareness
>>>> requires a demonstration of significant performance benefits for any
>>>> further consideration.
>>>>
>>>> If my wrap-up missed some part of the discussion, please let me/us
>>>> know. I'll now move on to the actual runtime and come up with more
>>>> concrete ideas in 'Part 2' :-)
>>>>
>>>> Best regards,
>>>> Karli
>>>>
>>>>
>>