[petsc-dev] PETSc multi-GPU assembly - current status

Thu May 2 09:39:49 CDT 2013

Hi Dave,

> I'm not aware of any polynomial preconditioners for the gpu available in petsc with
> or without the txpetscgpu package.  I'd love to try them out if they were though
> and would love to hear that I am wrong.

Hmm, Paul mentioned the following paper a couple of weeks back:

http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6319205&contentType=Conference+Publications

from which I concluded that this is already part of the txpetscgpu 
package. Paul, this is the case, isn't it?

Best regards,
Karli

>
> ________________________________________
> From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on behalf of Karl Rupp [rupp at mcs.anl.gov]
> Sent: Wednesday, May 01, 2013 7:52 PM
> To: petsc-dev at mcs.anl.gov
> Subject: Re: [petsc-dev] PETSc multi-GPU assembly - current status
>
> Hi Florian,
>
>   > This is loosely a follow up to [1]. In this thread a few potential ways
>> for making GPU assembly work with PETSc were discussed and to me the two
>> most promising appeared to be:
>> 1) Create a PETSc matrix from a pre-assembled CSR structure, or
>> 2) Preallocate a PETSc matrix and get the handle to pass the row
>> pointer, column indices and values array to a custom assembly routine.
>
> I still consider these two to be the most promising (and general)
> approaches. On the other hand, to my knowledge the infrastructure hasn't
> changed a lot since then. Some additional functionality from CUSPARSE
> was added, while I added ViennaCL-bindings to branch 'next' (i.e. still
> a few corners to polish). This means that you could technically use the
> much more jit-friendly OpenCL (and, as a follow-up, complain at NVIDIA
> and AMD over the higher latencies than with CUDA).
>
>> We compute
>> local assembly matrices on the GPU and a crucial requirement is that the
>> matrix *only* lives in device device, we want to avoid any host <->
>> device data transfers.
>
> One of the reasons why - despite its attractiveness - this hasn't taken
> off is because good preconditioners are typically still required in such
> a setting. Other than the smoothed aggregation in CUSP, there is not
> much which does *not* require a copy to the host. Particularly when
> thinking about multi-GPU you're entering the regime where a good
> preconditioner on the CPU will still outperform a GPU assembly with poor
> preconditioner.
>
>
>> So far we have been using CUSP with a custom (generated) assembly into
>> our own CUSP-compatible CSR data structure for a single GPU. Since CUSP
>> doesn't give us multi-GPU solvers out of the box we'd rather use
>> existing infrastructure that works rather than rolling our own.
>
> I guess this is good news for you: Steve Dalton will work with us during
> the summer to extend the CUSP-SA-AMG to distributed memory. Other than
> that, I think there's currently only the functionality from CUSPARSE and
> polynomial preconditioners, available through the txpetscgpu package.
>
> Aside from that I also have a couple of plans on that front spinning in
> my head, yet I couldn't find the time for implementing this yet.
>
>
>> At the time of [1] supporting GPU assembly in one form or the other was
>> on the roadmap, but the implementation direction seemed to not have been
>> finally decided. Was there any progress since then or anything to add to
>> the discussion? Is there even (experimental) code we might be able to
>> use? Note that we're using petsc4py to interface to PETSc.
>
> Did you have a look at snes/examples/tutorials/ex52? I'm currently
> converting/extending this to OpenCL, so it serves as a playground for a
> future interface. Matt might have some additional comments on this.
>
> Best regards,
> Karli
>