[petsc-dev] Improving and stabilizing GPU support

Sat Jul 20 13:40:25 CDT 2013

Steve, are you subscribed to the petsc-dev mailing list?

Karl Rupp <rupp at mcs.anl.gov> writes:

> Hi Paul,
>
>  >>> * Reduce CUSP dependency: The current elementary operations are
>>>> mainly realized via CUSP. With better support via CUSPARSE and
>>>> CUBLAS, I'd add a separate 'native' CUDA backend so that we can
>>>> provide a full set of vector and sparse matrix operations out of the
>>>> default NVIDIA toolchain. We will still keep CUSP for its
>>>> preconditioners, yet we no longer depend on it.
>> Agreed. In the past, I've suggested a -vec_type cuda (not cusp). All the
>> CUSP operations can be done with Thrust algorithms. Since Thrust comes
>> default with CUDA, one can have only a CUDA dependency.
>
> Yes, I opt for
>   -vec_type cuda
> if everything needed is shipped with the CUDA toolkit. I even tend to 
> avoid Thrust as much as possible and go with CUBLAS/CUSPARSE because we 
> get faster compilation and less compiler warnings this way, but that's 
> an implementation detail :-)
>
>
>>>> * Integrate last bits of txpetscgpu package. I assume Paul will
>>>> provide a helping hand here.
>> Of course. This will go much faster as much of the hard work is done. Do
>> people want support for different matrix formats in the CUSP classes :
>> i.e. diagonal, ellpack, hybrid? I think the CUSP preconditioners can be
>> derived from matrices stored in non-csr format (although they're likely
>> just doing a convert under the hood).
>
> Since people keep asking for fast SpMV, we should provide these other 
> formats as well (actually, they are partially provided with your update 
> to the CUSPARSE bindings already). The main reason for CUSP is the SA 
> preconditioner, for which SpMV performance doesn't really matter.

Well, SpMV affects cycle time, but setup is primarily sparse
matrix-matrix.

>>>> * Documentation: Add a chapter on GPUs to the manual, particularly on
>>>> what to expect and what not to expect. Update documentation on
>>>> webpage regarding installation.
>> I will help with the manual.
>
> Cheers :-)
>
>
>>>> * Integration of FEM quadrature from SNES ex52. The CUDA part
>>>> requiring code generation is not very elegant, while the OpenCL
>>>> approach is better suited for a library integration thanks to JIT.
>>>> However, this requires user code to be provided as a string (again
>>>> not very elegant) or loaded from file (more reasonable). How much FEM
>>>> functionality do we want to provide via PETSc?
>> Multi-GPU is a highly pressing need, IMO. Need to figure out how to make
>> Block Jacobi and ASM run efficiently.
>
> The tricky part here is to balance processes vs. threads vs. GPUs. If we 
> use more than one GPU per process, we will duplicate more and more of 
> the current MPI logic over time just to move data between GPUs. However, 
> if we just use one GPU per process, we will under-utilize the CPU unless 
> we have a good interaction with threadcomm.
>
> Best regards,
> Karli
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130720/3e09cc4c/attachment.sig>