[petsc-dev] Improving and stabilizing GPU support

Karl Rupp rupp at mcs.anl.gov
Fri Jul 19 17:13:20 CDT 2013


Hi Paul,

 >>> * Reduce CUSP dependency: The current elementary operations are
>>> mainly realized via CUSP. With better support via CUSPARSE and
>>> CUBLAS, I'd add a separate 'native' CUDA backend so that we can
>>> provide a full set of vector and sparse matrix operations out of the
>>> default NVIDIA toolchain. We will still keep CUSP for its
>>> preconditioners, yet we no longer depend on it.
> Agreed. In the past, I've suggested a -vec_type cuda (not cusp). All the
> CUSP operations can be done with Thrust algorithms. Since Thrust comes
> default with CUDA, one can have only a CUDA dependency.

Yes, I opt for
  -vec_type cuda
if everything needed is shipped with the CUDA toolkit. I even tend to 
avoid Thrust as much as possible and go with CUBLAS/CUSPARSE because we 
get faster compilation and less compiler warnings this way, but that's 
an implementation detail :-)


>>> * Integrate last bits of txpetscgpu package. I assume Paul will
>>> provide a helping hand here.
> Of course. This will go much faster as much of the hard work is done. Do
> people want support for different matrix formats in the CUSP classes :
> i.e. diagonal, ellpack, hybrid? I think the CUSP preconditioners can be
> derived from matrices stored in non-csr format (although they're likely
> just doing a convert under the hood).

Since people keep asking for fast SpMV, we should provide these other 
formats as well (actually, they are partially provided with your update 
to the CUSPARSE bindings already). The main reason for CUSP is the SA 
preconditioner, for which SpMV performance doesn't really matter.


>>> * Documentation: Add a chapter on GPUs to the manual, particularly on
>>> what to expect and what not to expect. Update documentation on
>>> webpage regarding installation.
> I will help with the manual.

Cheers :-)


>>> * Integration of FEM quadrature from SNES ex52. The CUDA part
>>> requiring code generation is not very elegant, while the OpenCL
>>> approach is better suited for a library integration thanks to JIT.
>>> However, this requires user code to be provided as a string (again
>>> not very elegant) or loaded from file (more reasonable). How much FEM
>>> functionality do we want to provide via PETSc?
> Multi-GPU is a highly pressing need, IMO. Need to figure out how to make
> Block Jacobi and ASM run efficiently.

The tricky part here is to balance processes vs. threads vs. GPUs. If we 
use more than one GPU per process, we will duplicate more and more of 
the current MPI logic over time just to move data between GPUs. However, 
if we just use one GPU per process, we will under-utilize the CPU unless 
we have a good interaction with threadcomm.

Best regards,
Karli




More information about the petsc-dev mailing list