[petsc-dev] PETSc multi-GPU assembly - current status
Matthew Knepley
knepley at gmail.com
Fri Jun 7 04:34:14 CDT 2013
On Thu, Jun 6, 2013 at 12:17 PM, Florian Rathgeber <
florian.rathgeber at gmail.com> wrote:
> On 02/05/13 21:35, Matthew Knepley wrote:
> > On Thu, May 2, 2013 at 3:29 PM, Florian Rathgeber
> > <florian.rathgeber at gmail.com <mailto:florian.rathgeber at gmail.com>>
> wrote:
> >
> > On 02/05/13 03:12, Matthew Knepley wrote:
> > > On Wed, May 1, 2013 at 8:52 PM, Karl Rupp <rupp at mcs.anl.gov
> > <mailto:rupp at mcs.anl.gov>
> > > <mailto:rupp at mcs.anl.gov <mailto:rupp at mcs.anl.gov>>> wrote:
> > >
> > > Hi Florian,
> > >
> > > > This is loosely a follow up to [1]. In this thread a few
> > potential
> > > ways
> > >
> > > for making GPU assembly work with PETSc were discussed and
> > to me
> > > the two
> > > most promising appeared to be:
> > > 1) Create a PETSc matrix from a pre-assembled CSR
> > structure, or
> > > 2) Preallocate a PETSc matrix and get the handle to pass
> > the row
> > > pointer, column indices and values array to a custom
> assembly
> > > routine.
> > >
> > > I still consider these two to be the most promising (and
> general)
> > > approaches. On the other hand, to my knowledge the
> infrastructure
> > > hasn't changed a lot since then. Some additional functionality
> > from
> > > CUSPARSE was added, while I added ViennaCL-bindings to branch
> > 'next'
> > > (i.e. still a few corners to polish). This means that you could
> > > technically use the much more jit-friendly OpenCL (and, as a
> > > follow-up, complain at NVIDIA and AMD over the higher
> > latencies than
> > > with CUDA).
> > >
> > > We compute
> > > local assembly matrices on the GPU and a crucial
> > requirement is
> > > that the
> > > matrix *only* lives in device device, we want to avoid any
> > host <->
> > > device data transfers.
> > >
> > > One of the reasons why - despite its attractiveness - this
> hasn't
> > > taken off is because good preconditioners are typically still
> > > required in such a setting. Other than the smoothed
> aggregation in
> > > CUSP, there is not much which does *not* require a copy to the
> > host.
> > > Particularly when thinking about multi-GPU you're entering the
> > > regime where a good preconditioner on the CPU will still
> > outperform
> > > a GPU assembly with poor preconditioner.
> > >
> > > So far we have been using CUSP with a custom (generated)
> > > assembly into
> > > our own CUSP-compatible CSR data structure for a single
> GPU.
> > > Since CUSP
> > > doesn't give us multi-GPU solvers out of the box we'd
> > rather use
> > > existing infrastructure that works rather than rolling our
> > own.
> > >
> > > I guess this is good news for you: Steve Dalton will work with
> us
> > > during the summer to extend the CUSP-SA-AMG to distributed
> memory.
> > > Other than that, I think there's currently only the
> functionality
> > > from CUSPARSE and polynomial preconditioners, available
> > through the
> > > txpetscgpu package.
> > >
> > > Aside from that I also have a couple of plans on that front
> > spinning
> > > in my head, yet I couldn't find the time for implementing this
> > yet.
> > >
> > > At the time of [1] supporting GPU assembly in one form or
> the
> > > other was
> > > on the roadmap, but the implementation direction seemed to
> not
> > > have been
> > > finally decided. Was there any progress since then or
> anything
> > > to add to
> > > the discussion? Is there even (experimental) code we might
> be
> > > able to
> > > use? Note that we're using petsc4py to interface to PETSc.
> > >
> > > Did you have a look at snes/examples/tutorials/ex52? I'm
> currently
> > > converting/extending this to OpenCL, so it serves as a
> playground
> > > for a future interface. Matt might have some additional
> > comments on
> > > this.
> > >
> > > I like to be very precise in the terminology. Doing the cell
> integrals
> > > on the GPU (integration) is worthwhile, whereas
> > > inserting the element matrices into a global representation like
> CSR
> > > (assembly) takes no time and can be done
> > > almost any way including on the CPU. I stopped working on assembly
> > > because it made on difference.
> >
> > The actual insertion (as in MatSetValues) may not take up much time
> on
> > either the CPU or the GPU, provided it is done where the integration
> was
> > done. As I mentioned before we do both the integration and the solve
> on
> > the GPU. We don't even allocate data in host memory. Therefore it
> > wouldn't make much sense to do the addto on the host since it would
> > require device -> host data transfer of all the cell integrals and
> host
> > -> device of the CSR, which would make it quite expensive.
> >
> > One option we considered was creating a MatShell and providing an
> SPMV
> > callback, probably calling a CUSP kernel on each MPI rank. That
> > restricts the available preconditioners, but as mentioned, without
> doing
> > any data transfers we'd be restricted to GPU-only preconditioners
> > anyway. Any thoughts on this compared to the strategies mentioned
> above?
> >
> >
> > What about just creating your CUSP matrix and then shoving it into a
> > MATAIJCUSP?
> > That is what I did for my assembly tests.
>
> That'd be the ideal solution. Does this work with MPIAIJ? We're only
> really interested in multi-GPU with MPI. In the sequential case we can
> just call Cusp directly, but for the MPI distributed case we'd rather
> rely on PETSc to help us out.
>
You would have to create the diagonal and off-diagonal matrices yourself.
> Presumably you're referring to the experiments you did for the TOMS
> paper? Is that code available somewhere?
No, its for the TOMS paper I did not write because the result was not
interesting
enough I thought. The code is in PETSc.
> > For GPU only preconditioners, would focus on the Cusp AMG using
> > Chebychev for
> > the smoothers.
>
> OK. Again we'd have to create our own PCShell for this when using
> MatShell if I understand correctly?
I don't think so since Cheby just uses a matrix action.
Matt
>
> Florian
>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130607/94d3bd70/attachment.html>
More information about the petsc-dev
mailing list