[petsc-dev] PETSc multi-GPU assembly - current status
Karl Rupp
rupp at mcs.anl.gov
Wed May 1 20:52:43 CDT 2013
Hi Florian,
> This is loosely a follow up to [1]. In this thread a few potential ways
> for making GPU assembly work with PETSc were discussed and to me the two
> most promising appeared to be:
> 1) Create a PETSc matrix from a pre-assembled CSR structure, or
> 2) Preallocate a PETSc matrix and get the handle to pass the row
> pointer, column indices and values array to a custom assembly routine.
I still consider these two to be the most promising (and general)
approaches. On the other hand, to my knowledge the infrastructure hasn't
changed a lot since then. Some additional functionality from CUSPARSE
was added, while I added ViennaCL-bindings to branch 'next' (i.e. still
a few corners to polish). This means that you could technically use the
much more jit-friendly OpenCL (and, as a follow-up, complain at NVIDIA
and AMD over the higher latencies than with CUDA).
> We compute
> local assembly matrices on the GPU and a crucial requirement is that the
> matrix *only* lives in device device, we want to avoid any host <->
> device data transfers.
One of the reasons why - despite its attractiveness - this hasn't taken
off is because good preconditioners are typically still required in such
a setting. Other than the smoothed aggregation in CUSP, there is not
much which does *not* require a copy to the host. Particularly when
thinking about multi-GPU you're entering the regime where a good
preconditioner on the CPU will still outperform a GPU assembly with poor
preconditioner.
> So far we have been using CUSP with a custom (generated) assembly into
> our own CUSP-compatible CSR data structure for a single GPU. Since CUSP
> doesn't give us multi-GPU solvers out of the box we'd rather use
> existing infrastructure that works rather than rolling our own.
I guess this is good news for you: Steve Dalton will work with us during
the summer to extend the CUSP-SA-AMG to distributed memory. Other than
that, I think there's currently only the functionality from CUSPARSE and
polynomial preconditioners, available through the txpetscgpu package.
Aside from that I also have a couple of plans on that front spinning in
my head, yet I couldn't find the time for implementing this yet.
> At the time of [1] supporting GPU assembly in one form or the other was
> on the roadmap, but the implementation direction seemed to not have been
> finally decided. Was there any progress since then or anything to add to
> the discussion? Is there even (experimental) code we might be able to
> use? Note that we're using petsc4py to interface to PETSc.
Did you have a look at snes/examples/tutorials/ex52? I'm currently
converting/extending this to OpenCL, so it serves as a playground for a
future interface. Matt might have some additional comments on this.
Best regards,
Karli
More information about the petsc-dev
mailing list