[petsc-dev] code review request : txpetscgpu package removal

Tue Jun 25 16:46:25 CDT 2013

Hi Paul,

 > Not too heavy. I've already converted much of this code to remove this
> package while supporting existing features, though I haven't pushed it
> into the fork. The real question is whether we want to go down this path
> or not.

I see two options: Either txpetscgpu is a self-contained package and 
brings its own set of implementation files along, or it should be 
integrated. The current model of injected 
PETSC_HAVE_TXPETSCGPU-preprocessor switches will not be able to compete 
in any code beauty contest... ;-) Either way, there is presumably also 
some licensing issue involved, so you guys need to agree to have 
txpetscgpu integrated (or not).

> Right now, I think CUSP does not support SpMVs in streams. Thus, in
> order to get an effective multi GPU SpMV (for all the different storage
> formats), one has to rewrite all the SpMV kernels (for all the different
> storage formats) to use streams. This adds a lot of additional code to
> support. I would prefer to just call some CUSP API with a stream as an
> input argument but I don't think that exists at the moment. I'm not sure
> what to do here.  Once the other code is accepted, perhaps we can
> address this problem then?

The CUSP API needs to provide streams for that, yes.
As I addressed in my comments on your commits on Bitbucket, I'd prefer 
to see CUSP being separated from CUSPARSE and instead use a 
CUSPARSE-native matrix datastructure (a simple collection of handles). 
This way one can already use the CUSPARSE interface if only the CUDA SDK 
is installed, and hook in CUSP later for preconditioners, etc.

> It works across node but you have to know what you're doing. This is a
> tough problem to solve universally because its (almost) impossible to
> determine the number of mpi ranks per node in an mpi run. I've never
> seen an MPI function that returns this information.
>
> Right now, a 1-1 pairing between CPU core and GPU will work across any
> system with any number of nodes. I've tested this on a system with 2
> nodes, 4 GPUs per node (so "mpirun -n 8 -npernode 4" would work)

Thanks, I see. Apparently I'm not the only one struggling with this 
abstraction issue...

Best regards,
Karli