[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Wed Oct 10 10:08:04 CDT 2012

Hi Paul,

 > No problem. I have some code that will be helpful for these classes. In
> particular, In txpetscgpu, I have code that figures out which data to
> message to/from the GPU in the parallel SpMV. It's done using
> cudaStreams which allows the comm to be overlapped with the computation
> kernel.
>
> If you make the hierarchy as described below, it would be natural to
> move the txpetscgpu code into Vec_CUDA class.

Great, that should lower the barrier for interfacing with other 
CUDA-based libraries significantly, and also simplify the maintenance of 
txpetscgpu.

Best regards,
Karli

>> Hi Paul,
>>
>> thanks for the comments. I'll have a look whether we can have an
>> intermediate layer for CUDA and OpenCL, e.g.
>>  Vec_Seq -> Vec_CUDA -> Vec_Thrust.
>> This should allow us to define a broader set of operations on Vec_CUDA
>> (and similarly for matrices), particularly such not covered by
>> CUSparse and Thrust.
>>
>> Best regards,
>> Karli
>>
>>
>> On 10/09/2012 03:48 PM, Paul Mullowney wrote:
>>> I think the current vector class should be Vec_Thrust with
>>>
>>> -vec_type thrust (not cusp)
>>>
>>> First, most of the vector functions are computed from kernels in the
>>> Thrust library (although there may be an occasional CUSP or CUBLAS
>>> function call). Second, it is not clear how long CUSP is going to
>>> survive ... and I think Nvidia puts more energy into CUSPARSE and
>>> Thrust.
>>>
>>> I think a Vec_CUDA would be very useful ... there is a lot you could do
>>> with this that you can't currently do with Thrust.
>>>
>>> I think separating the Mat types into CUSP and CUSPARSE is sensible.
>>>
>>> -Paul
>>>
>>>
>>>
>>>>> Hi guys,
>>>>>
>>>>> as our discussion of memory is more and more drifting apart towards
>>>>> runtime and scheduling aspects, I'll try to wrap up the key points of
>>>>> the memory part of the discussion and postpone all runtime/execution
>>>>> aspects to 'Part 2' of the series.
>>>>>
>>>>> * The proposed unification of memory handles (CPU and GPU) within
>>>>> *data of Vec could not find any backup, rather the GPU handles should
>>>>> remain in GPUarray (or any equivalent for OpenCL/CUDA). However, it
>>>>> is not yet clear whether we want to stick with library-specific names
>>>>> such as Vec_CUSP, or whether we want to go with runtime-specific
>>>>> names such as Vec_CUDA and Vec_OpenCL and probably dispatch into
>>>>> library-specific routines from there. Jed pointed out that Vec_OpenCL
>>>>> is probably too fuzzy, suggesting that Vec_LIBRARYNAME is the better
>>>>> option.
>>>>
>>>>       The Vec_CUSP is most definitely built on top of CUSP and is not
>>>> built around generic CUDA hence going to Vec_CUDA from Vec_CUSP
>>>> doesn't make sense to me. If we had (have? as an alternative) a Vec
>>>> class that was built directly on CUDA then it could be called
>>>> Vec_CUDA. Similarly if Vec_OpenCL is built directly on generic OpenCL
>>>> then that name is fine, if it is built on top of something like
>>>> ViennaCL then Vec_ViennaCL would be the way to go.
>>>>
>>>>      Barry
>>>>
>>>>
>>>> Paul has put in some code based on cusparse, I haven't had the energy
>>>> to see how that works. Perhaps there should be a Vec_CUSparse to that.
>>>>
>>>>> * Barry backups my suggestion to have multi-GPU support for a single
>>>>> process, whereas Jed and Matt suggest to map one GPU to one
>>>>> MPI-process for reasons of simplicity. As the usual application of
>>>>> multi-GPU is within sparse matrix-vector products and block-based
>>>>> preconditioners, I note the following:
>>>>> - Such implementations are basically available out-of-the-box with
>>>>> MPI.
>>>>> - According to the manual, block-based preconditioners can also be
>>>>> configured on a per-process basis, thus allowing to use the
>>>>> individual streaming processors on a GPU efficiently (there is no
>>>>> native synchronization possible between streaming processors within a
>>>>> single kernel!).
>>>>> - The current multi-GPU support using txpetscgpu focuses on sparse
>>>>> matrix-vector products only (there are some hints in
>>>>> src/ksp/pc/impls/factor/ilu that forward-backward substitutions for
>>>>> ILU preconditioners on GPUs may also be available, yet I haven't
>>>>> found any actual code/kernels for that).
>>>>> Consequently, from the available functionality it seems that we can
>>>>> live with a one-GPU-per-process option.
>>>>>
>>>>> * Adding a bit of meta information to arrays in main RAM (without
>>>>> splitting up the actual buffer) for increased cache-awareness
>>>>> requires a demonstration of significant performance benefits for any
>>>>> further consideration.
>>>>>
>>>>> If my wrap-up missed some part of the discussion, please let me/us
>>>>> know. I'll now move on to the actual runtime and come up with more
>>>>> concrete ideas in 'Part 2' :-)
>>>>>
>>>>> Best regards,
>>>>> Karli
>>>>>
>>>>>
>>>
>