[petsc-dev] PETSc multi-GPU assembly - current status

Wed May 1 20:52:43 CDT 2013

Hi Florian,

 > This is loosely a follow up to [1]. In this thread a few potential ways
> for making GPU assembly work with PETSc were discussed and to me the two
> most promising appeared to be:
> 1) Create a PETSc matrix from a pre-assembled CSR structure, or
> 2) Preallocate a PETSc matrix and get the handle to pass the row
> pointer, column indices and values array to a custom assembly routine.

I still consider these two to be the most promising (and general) 
approaches. On the other hand, to my knowledge the infrastructure hasn't 
changed a lot since then. Some additional functionality from CUSPARSE 
was added, while I added ViennaCL-bindings to branch 'next' (i.e. still 
a few corners to polish). This means that you could technically use the 
much more jit-friendly OpenCL (and, as a follow-up, complain at NVIDIA 
and AMD over the higher latencies than with CUDA).

> We compute
> local assembly matrices on the GPU and a crucial requirement is that the
> matrix *only* lives in device device, we want to avoid any host <->
> device data transfers.

One of the reasons why - despite its attractiveness - this hasn't taken 
off is because good preconditioners are typically still required in such 
a setting. Other than the smoothed aggregation in CUSP, there is not 
much which does *not* require a copy to the host. Particularly when 
thinking about multi-GPU you're entering the regime where a good 
preconditioner on the CPU will still outperform a GPU assembly with poor 
preconditioner.

> So far we have been using CUSP with a custom (generated) assembly into
> our own CUSP-compatible CSR data structure for a single GPU. Since CUSP
> doesn't give us multi-GPU solvers out of the box we'd rather use
> existing infrastructure that works rather than rolling our own.

I guess this is good news for you: Steve Dalton will work with us during 
the summer to extend the CUSP-SA-AMG to distributed memory. Other than 
that, I think there's currently only the functionality from CUSPARSE and 
polynomial preconditioners, available through the txpetscgpu package.

Aside from that I also have a couple of plans on that front spinning in 
my head, yet I couldn't find the time for implementing this yet.

> At the time of [1] supporting GPU assembly in one form or the other was
> on the roadmap, but the implementation direction seemed to not have been
> finally decided. Was there any progress since then or anything to add to
> the discussion? Is there even (experimental) code we might be able to
> use? Note that we're using petsc4py to interface to PETSc.

Did you have a look at snes/examples/tutorials/ex52? I'm currently 
converting/extending this to OpenCL, so it serves as a playground for a 
future interface. Matt might have some additional comments on this.

Best regards,
Karli