[petsc-dev] PETSc multi-GPU assembly - current status

Thu May 2 15:35:24 CDT 2013

On Thu, May 2, 2013 at 3:29 PM, Florian Rathgeber <
florian.rathgeber at gmail.com> wrote:

> On 02/05/13 03:12, Matthew Knepley wrote:
> > On Wed, May 1, 2013 at 8:52 PM, Karl Rupp <rupp at mcs.anl.gov
> > <mailto:rupp at mcs.anl.gov>> wrote:
> >
> >     Hi Florian,
> >
> >     > This is loosely a follow up to [1]. In this thread a few potential
> >     ways
> >
> >         for making GPU assembly work with PETSc were discussed and to me
> >         the two
> >         most promising appeared to be:
> >         1) Create a PETSc matrix from a pre-assembled CSR structure, or
> >         2) Preallocate a PETSc matrix and get the handle to pass the row
> >         pointer, column indices and values array to a custom assembly
> >         routine.
> >
> >     I still consider these two to be the most promising (and general)
> >     approaches. On the other hand, to my knowledge the infrastructure
> >     hasn't changed a lot since then. Some additional functionality from
> >     CUSPARSE was added, while I added ViennaCL-bindings to branch 'next'
> >     (i.e. still a few corners to polish). This means that you could
> >     technically use the much more jit-friendly OpenCL (and, as a
> >     follow-up, complain at NVIDIA and AMD over the higher latencies than
> >     with CUDA).
> >
> >         We compute
> >         local assembly matrices on the GPU and a crucial requirement is
> >         that the
> >         matrix *only* lives in device device, we want to avoid any host
> <->
> >         device data transfers.
> >
> >     One of the reasons why - despite its attractiveness - this hasn't
> >     taken off is because good preconditioners are typically still
> >     required in such a setting. Other than the smoothed aggregation in
> >     CUSP, there is not much which does *not* require a copy to the host.
> >     Particularly when thinking about multi-GPU you're entering the
> >     regime where a good preconditioner on the CPU will still outperform
> >     a GPU assembly with poor preconditioner.
> >
> >         So far we have been using CUSP with a custom (generated)
> >         assembly into
> >         our own CUSP-compatible CSR data structure for a single GPU.
> >         Since CUSP
> >         doesn't give us multi-GPU solvers out of the box we'd rather use
> >         existing infrastructure that works rather than rolling our own.
> >
> >     I guess this is good news for you: Steve Dalton will work with us
> >     during the summer to extend the CUSP-SA-AMG to distributed memory.
> >     Other than that, I think there's currently only the functionality
> >     from CUSPARSE and polynomial preconditioners, available through the
> >     txpetscgpu package.
> >
> >     Aside from that I also have a couple of plans on that front spinning
> >     in my head, yet I couldn't find the time for implementing this yet.
> >
> >         At the time of [1] supporting GPU assembly in one form or the
> >         other was
> >         on the roadmap, but the implementation direction seemed to not
> >         have been
> >         finally decided. Was there any progress since then or anything
> >         to add to
> >         the discussion? Is there even (experimental) code we might be
> >         able to
> >         use? Note that we're using petsc4py to interface to PETSc.
> >
> >     Did you have a look at snes/examples/tutorials/ex52? I'm currently
> >     converting/extending this to OpenCL, so it serves as a playground
> >     for a future interface. Matt might have some additional comments on
> >     this.
> >
> > I like to be very precise in the terminology. Doing the cell integrals
> > on the GPU (integration) is worthwhile, whereas
> > inserting the element matrices into a global representation like CSR
> > (assembly) takes no time and can be done
> > almost any way including on the CPU. I stopped working on assembly
> > because it made on difference.
>
> The actual insertion (as in MatSetValues) may not take up much time on
> either the CPU or the GPU, provided it is done where the integration was
> done. As I mentioned before we do both the integration and the solve on
> the GPU. We don't even allocate data in host memory. Therefore it
> wouldn't make much sense to do the addto on the host since it would
> require device -> host data transfer of all the cell integrals and host
> -> device of the CSR, which would make it quite expensive.
>
> One option we considered was creating a MatShell and providing an SPMV
> callback, probably calling a CUSP kernel on each MPI rank. That
> restricts the available preconditioners, but as mentioned, without doing
> any data transfers we'd be restricted to GPU-only preconditioners
> anyway. Any thoughts on this compared to the strategies mentioned above?
>

What about just creating your CUSP matrix and then shoving it into a
MATAIJCUSP?
That is what I did for my assembly tests.

For GPU only preconditioners,  would focus on the Cusp AMG using Chebychev
for
the smoothers.

   Matt

> Thanks,
> Florian
>
> >   Thanks,
> >
> >      Matt
> >
> >
> >     Best regards,
> >     Karli
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130502/46a0dce5/attachment.html>