[petsc-dev] Improving and stabilizing GPU support

Sun Jul 21 09:45:51 CDT 2013

Hey Jed,

Just subscribed. Interesting thread.

It seems reasonable to provide a CUDA backend. Providing a interface
around CUBLAS and/or CUSPARSE should be sufficient to provide base
functionality. Since CUSP provides support for wrapping generic GPU memory
into CUSP vectors it shouldn't be difficult to interface with the CUSP
preconditioners either way. Thrust seems most appropriate to ease
development associated with constructing nonstandard operations and/or
implementing custom datatypes on the GPU. I'm still somewhat new so it's
not clear to me how far Petsc will expand into those areas on the GPU.

Steve 

On 7/20/13 1:40 PM, "Jed Brown" <jedbrown at mcs.anl.gov> wrote:

>Steve, are you subscribed to the petsc-dev mailing list?
>
>Karl Rupp <rupp at mcs.anl.gov> writes:
>
>> Hi Paul,
>>
>>  >>> * Reduce CUSP dependency: The current elementary operations are
>>>>> mainly realized via CUSP. With better support via CUSPARSE and
>>>>> CUBLAS, I'd add a separate 'native' CUDA backend so that we can
>>>>> provide a full set of vector and sparse matrix operations out of the
>>>>> default NVIDIA toolchain. We will still keep CUSP for its
>>>>> preconditioners, yet we no longer depend on it.
>>> Agreed. In the past, I've suggested a -vec_type cuda (not cusp). All
>>>the
>>> CUSP operations can be done with Thrust algorithms. Since Thrust comes
>>> default with CUDA, one can have only a CUDA dependency.
>>
>> Yes, I opt for
>>   -vec_type cuda
>> if everything needed is shipped with the CUDA toolkit. I even tend to
>> avoid Thrust as much as possible and go with CUBLAS/CUSPARSE because we
>> get faster compilation and less compiler warnings this way, but that's
>> an implementation detail :-)
>>
>>
>>>>> * Integrate last bits of txpetscgpu package. I assume Paul will
>>>>> provide a helping hand here.
>>> Of course. This will go much faster as much of the hard work is done.
>>>Do
>>> people want support for different matrix formats in the CUSP classes :
>>> i.e. diagonal, ellpack, hybrid? I think the CUSP preconditioners can be
>>> derived from matrices stored in non-csr format (although they're likely
>>> just doing a convert under the hood).
>>
>> Since people keep asking for fast SpMV, we should provide these other
>> formats as well (actually, they are partially provided with your update
>> to the CUSPARSE bindings already). The main reason for CUSP is the SA
>> preconditioner, for which SpMV performance doesn't really matter.
>
>Well, SpMV affects cycle time, but setup is primarily sparse
>matrix-matrix.
>
>>>>> * Documentation: Add a chapter on GPUs to the manual, particularly on
>>>>> what to expect and what not to expect. Update documentation on
>>>>> webpage regarding installation.
>>> I will help with the manual.
>>
>> Cheers :-)
>>
>>
>>>>> * Integration of FEM quadrature from SNES ex52. The CUDA part
>>>>> requiring code generation is not very elegant, while the OpenCL
>>>>> approach is better suited for a library integration thanks to JIT.
>>>>> However, this requires user code to be provided as a string (again
>>>>> not very elegant) or loaded from file (more reasonable). How much FEM
>>>>> functionality do we want to provide via PETSc?
>>> Multi-GPU is a highly pressing need, IMO. Need to figure out how to
>>>make
>>> Block Jacobi and ASM run efficiently.
>>
>> The tricky part here is to balance processes vs. threads vs. GPUs. If
>>we 
>> use more than one GPU per process, we will duplicate more and more of
>> the current MPI logic over time just to move data between GPUs.
>>However, 
>> if we just use one GPU per process, we will under-utilize the CPU
>>unless 
>> we have a good interaction with threadcomm.
>>
>> Best regards,
>> Karli