[petsc-dev] PETSc multi-GPU assembly - current status

Fri Jun 7 04:34:14 CDT 2013

On Thu, Jun 6, 2013 at 12:17 PM, Florian Rathgeber <
florian.rathgeber at gmail.com> wrote:

> On 02/05/13 21:35, Matthew Knepley wrote:
> > On Thu, May 2, 2013 at 3:29 PM, Florian Rathgeber
> > <florian.rathgeber at gmail.com <mailto:florian.rathgeber at gmail.com>>
> wrote:
> >
> >     On 02/05/13 03:12, Matthew Knepley wrote:
> >     > On Wed, May 1, 2013 at 8:52 PM, Karl Rupp <rupp at mcs.anl.gov
> >     <mailto:rupp at mcs.anl.gov>
> >     > <mailto:rupp at mcs.anl.gov <mailto:rupp at mcs.anl.gov>>> wrote:
> >     >
> >     >     Hi Florian,
> >     >
> >     >     > This is loosely a follow up to [1]. In this thread a few
> >     potential
> >     >     ways
> >     >
> >     >         for making GPU assembly work with PETSc were discussed and
> >     to me
> >     >         the two
> >     >         most promising appeared to be:
> >     >         1) Create a PETSc matrix from a pre-assembled CSR
> >     structure, or
> >     >         2) Preallocate a PETSc matrix and get the handle to pass
> >     the row
> >     >         pointer, column indices and values array to a custom
> assembly
> >     >         routine.
> >     >
> >     >     I still consider these two to be the most promising (and
> general)
> >     >     approaches. On the other hand, to my knowledge the
> infrastructure
> >     >     hasn't changed a lot since then. Some additional functionality
> >     from
> >     >     CUSPARSE was added, while I added ViennaCL-bindings to branch
> >     'next'
> >     >     (i.e. still a few corners to polish). This means that you could
> >     >     technically use the much more jit-friendly OpenCL (and, as a
> >     >     follow-up, complain at NVIDIA and AMD over the higher
> >     latencies than
> >     >     with CUDA).
> >     >
> >     >         We compute
> >     >         local assembly matrices on the GPU and a crucial
> >     requirement is
> >     >         that the
> >     >         matrix *only* lives in device device, we want to avoid any
> >     host <->
> >     >         device data transfers.
> >     >
> >     >     One of the reasons why - despite its attractiveness - this
> hasn't
> >     >     taken off is because good preconditioners are typically still
> >     >     required in such a setting. Other than the smoothed
> aggregation in
> >     >     CUSP, there is not much which does *not* require a copy to the
> >     host.
> >     >     Particularly when thinking about multi-GPU you're entering the
> >     >     regime where a good preconditioner on the CPU will still
> >     outperform
> >     >     a GPU assembly with poor preconditioner.
> >     >
> >     >         So far we have been using CUSP with a custom (generated)
> >     >         assembly into
> >     >         our own CUSP-compatible CSR data structure for a single
> GPU.
> >     >         Since CUSP
> >     >         doesn't give us multi-GPU solvers out of the box we'd
> >     rather use
> >     >         existing infrastructure that works rather than rolling our
> >     own.
> >     >
> >     >     I guess this is good news for you: Steve Dalton will work with
> us
> >     >     during the summer to extend the CUSP-SA-AMG to distributed
> memory.
> >     >     Other than that, I think there's currently only the
> functionality
> >     >     from CUSPARSE and polynomial preconditioners, available
> >     through the
> >     >     txpetscgpu package.
> >     >
> >     >     Aside from that I also have a couple of plans on that front
> >     spinning
> >     >     in my head, yet I couldn't find the time for implementing this
> >     yet.
> >     >
> >     >         At the time of [1] supporting GPU assembly in one form or
> the
> >     >         other was
> >     >         on the roadmap, but the implementation direction seemed to
> not
> >     >         have been
> >     >         finally decided. Was there any progress since then or
> anything
> >     >         to add to
> >     >         the discussion? Is there even (experimental) code we might
> be
> >     >         able to
> >     >         use? Note that we're using petsc4py to interface to PETSc.
> >     >
> >     >     Did you have a look at snes/examples/tutorials/ex52? I'm
> currently
> >     >     converting/extending this to OpenCL, so it serves as a
> playground
> >     >     for a future interface. Matt might have some additional
> >     comments on
> >     >     this.
> >     >
> >     > I like to be very precise in the terminology. Doing the cell
> integrals
> >     > on the GPU (integration) is worthwhile, whereas
> >     > inserting the element matrices into a global representation like
> CSR
> >     > (assembly) takes no time and can be done
> >     > almost any way including on the CPU. I stopped working on assembly
> >     > because it made on difference.
> >
> >     The actual insertion (as in MatSetValues) may not take up much time
> on
> >     either the CPU or the GPU, provided it is done where the integration
> was
> >     done. As I mentioned before we do both the integration and the solve
> on
> >     the GPU. We don't even allocate data in host memory. Therefore it
> >     wouldn't make much sense to do the addto on the host since it would
> >     require device -> host data transfer of all the cell integrals and
> host
> >     -> device of the CSR, which would make it quite expensive.
> >
> >     One option we considered was creating a MatShell and providing an
> SPMV
> >     callback, probably calling a CUSP kernel on each MPI rank. That
> >     restricts the available preconditioners, but as mentioned, without
> doing
> >     any data transfers we'd be restricted to GPU-only preconditioners
> >     anyway. Any thoughts on this compared to the strategies mentioned
> above?
> >
> >
> > What about just creating your CUSP matrix and then shoving it into a
> > MATAIJCUSP?
> > That is what I did for my assembly tests.
>
> That'd be the ideal solution. Does this work with MPIAIJ? We're only
> really interested in multi-GPU with MPI. In the sequential case we can
> just call Cusp directly, but for the MPI distributed case we'd rather
> rely on PETSc to help us out.
>

You would have to create the diagonal and off-diagonal matrices yourself.

> Presumably you're referring to the experiments you did for the TOMS
> paper? Is that code available somewhere?

No, its for the TOMS paper I did not write because the result was not
interesting
enough I thought. The code is in PETSc.

> > For GPU only preconditioners,  would focus on the Cusp AMG using
> > Chebychev for
> > the smoothers.
>
> OK. Again we'd have to create our own PCShell for this when using
> MatShell if I understand correctly?

I don't think so since Cheby just uses a matrix action.

   Matt

>
> Florian
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130607/94d3bd70/attachment.html>