[petsc-dev] PETSc multi-GPU assembly - current status

Wed May 1 21:12:54 CDT 2013

On Wed, May 1, 2013 at 8:52 PM, Karl Rupp <rupp at mcs.anl.gov> wrote:

> Hi Florian,
>
>
> > This is loosely a follow up to [1]. In this thread a few potential ways
>
>> for making GPU assembly work with PETSc were discussed and to me the two
>> most promising appeared to be:
>> 1) Create a PETSc matrix from a pre-assembled CSR structure, or
>> 2) Preallocate a PETSc matrix and get the handle to pass the row
>> pointer, column indices and values array to a custom assembly routine.
>>
>
> I still consider these two to be the most promising (and general)
> approaches. On the other hand, to my knowledge the infrastructure hasn't
> changed a lot since then. Some additional functionality from CUSPARSE was
> added, while I added ViennaCL-bindings to branch 'next' (i.e. still a few
> corners to polish). This means that you could technically use the much more
> jit-friendly OpenCL (and, as a follow-up, complain at NVIDIA and AMD over
> the higher latencies than with CUDA).
>
>
>  We compute
>> local assembly matrices on the GPU and a crucial requirement is that the
>> matrix *only* lives in device device, we want to avoid any host <->
>> device data transfers.
>>
>
> One of the reasons why - despite its attractiveness - this hasn't taken
> off is because good preconditioners are typically still required in such a
> setting. Other than the smoothed aggregation in CUSP, there is not much
> which does *not* require a copy to the host. Particularly when thinking
> about multi-GPU you're entering the regime where a good preconditioner on
> the CPU will still outperform a GPU assembly with poor preconditioner.
>
>
>
>  So far we have been using CUSP with a custom (generated) assembly into
>> our own CUSP-compatible CSR data structure for a single GPU. Since CUSP
>> doesn't give us multi-GPU solvers out of the box we'd rather use
>> existing infrastructure that works rather than rolling our own.
>>
>
> I guess this is good news for you: Steve Dalton will work with us during
> the summer to extend the CUSP-SA-AMG to distributed memory. Other than
> that, I think there's currently only the functionality from CUSPARSE and
> polynomial preconditioners, available through the txpetscgpu package.
>
> Aside from that I also have a couple of plans on that front spinning in my
> head, yet I couldn't find the time for implementing this yet.
>
>
>
>  At the time of [1] supporting GPU assembly in one form or the other was
>> on the roadmap, but the implementation direction seemed to not have been
>> finally decided. Was there any progress since then or anything to add to
>> the discussion? Is there even (experimental) code we might be able to
>> use? Note that we're using petsc4py to interface to PETSc.
>>
>
> Did you have a look at snes/examples/tutorials/ex52? I'm currently
> converting/extending this to OpenCL, so it serves as a playground for a
> future interface. Matt might have some additional comments on this.
>

I like to be very precise in the terminology. Doing the cell integrals on
the GPU (integration) is worthwhile, whereas
inserting the element matrices into a global representation like CSR
(assembly) takes no time and can be done
almost any way including on the CPU. I stopped working on assembly because
it made on difference.

  Thanks,

     Matt

> Best regards,
> Karli
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130501/e1601949/attachment.html>