[petsc-dev] PETSc multi-GPU assembly - current status

Thu May 2 09:31:57 CDT 2013

Karli,

I'm not aware of any polynomial preconditioners for the gpu available in petsc with
or without the txpetscgpu package.  I'd love to try them out if they were though
and would love to hear that I am wrong.

Dave

________________________________________
From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on behalf of Karl Rupp [rupp at mcs.anl.gov]
Sent: Wednesday, May 01, 2013 7:52 PM
To: petsc-dev at mcs.anl.gov
Subject: Re: [petsc-dev] PETSc multi-GPU assembly - current status

Hi Florian,

 > This is loosely a follow up to [1]. In this thread a few potential ways
> for making GPU assembly work with PETSc were discussed and to me the two
> most promising appeared to be:
> 1) Create a PETSc matrix from a pre-assembled CSR structure, or
> 2) Preallocate a PETSc matrix and get the handle to pass the row
> pointer, column indices and values array to a custom assembly routine.

I still consider these two to be the most promising (and general)
approaches. On the other hand, to my knowledge the infrastructure hasn't
changed a lot since then. Some additional functionality from CUSPARSE
was added, while I added ViennaCL-bindings to branch 'next' (i.e. still
a few corners to polish). This means that you could technically use the
much more jit-friendly OpenCL (and, as a follow-up, complain at NVIDIA
and AMD over the higher latencies than with CUDA).

> We compute
> local assembly matrices on the GPU and a crucial requirement is that the
> matrix *only* lives in device device, we want to avoid any host <->
> device data transfers.

One of the reasons why - despite its attractiveness - this hasn't taken
off is because good preconditioners are typically still required in such
a setting. Other than the smoothed aggregation in CUSP, there is not
much which does *not* require a copy to the host. Particularly when
thinking about multi-GPU you're entering the regime where a good
preconditioner on the CPU will still outperform a GPU assembly with poor
preconditioner.

> So far we have been using CUSP with a custom (generated) assembly into
> our own CUSP-compatible CSR data structure for a single GPU. Since CUSP
> doesn't give us multi-GPU solvers out of the box we'd rather use
> existing infrastructure that works rather than rolling our own.

I guess this is good news for you: Steve Dalton will work with us during
the summer to extend the CUSP-SA-AMG to distributed memory. Other than
that, I think there's currently only the functionality from CUSPARSE and
polynomial preconditioners, available through the txpetscgpu package.

Aside from that I also have a couple of plans on that front spinning in
my head, yet I couldn't find the time for implementing this yet.

> At the time of [1] supporting GPU assembly in one form or the other was
> on the roadmap, but the implementation direction seemed to not have been
> finally decided. Was there any progress since then or anything to add to
> the discussion? Is there even (experimental) code we might be able to
> use? Note that we're using petsc4py to interface to PETSc.

Did you have a look at snes/examples/tutorials/ex52? I'm currently
converting/extending this to OpenCL, so it serves as a playground for a
future interface. Matt might have some additional comments on this.

Best regards,
Karli