[petsc-dev] PETSc multi-GPU assembly - current status
Matthew Knepley
knepley at gmail.com
Thu May 2 15:35:24 CDT 2013
On Thu, May 2, 2013 at 3:29 PM, Florian Rathgeber <
florian.rathgeber at gmail.com> wrote:
> On 02/05/13 03:12, Matthew Knepley wrote:
> > On Wed, May 1, 2013 at 8:52 PM, Karl Rupp <rupp at mcs.anl.gov
> > <mailto:rupp at mcs.anl.gov>> wrote:
> >
> > Hi Florian,
> >
> > > This is loosely a follow up to [1]. In this thread a few potential
> > ways
> >
> > for making GPU assembly work with PETSc were discussed and to me
> > the two
> > most promising appeared to be:
> > 1) Create a PETSc matrix from a pre-assembled CSR structure, or
> > 2) Preallocate a PETSc matrix and get the handle to pass the row
> > pointer, column indices and values array to a custom assembly
> > routine.
> >
> > I still consider these two to be the most promising (and general)
> > approaches. On the other hand, to my knowledge the infrastructure
> > hasn't changed a lot since then. Some additional functionality from
> > CUSPARSE was added, while I added ViennaCL-bindings to branch 'next'
> > (i.e. still a few corners to polish). This means that you could
> > technically use the much more jit-friendly OpenCL (and, as a
> > follow-up, complain at NVIDIA and AMD over the higher latencies than
> > with CUDA).
> >
> > We compute
> > local assembly matrices on the GPU and a crucial requirement is
> > that the
> > matrix *only* lives in device device, we want to avoid any host
> <->
> > device data transfers.
> >
> > One of the reasons why - despite its attractiveness - this hasn't
> > taken off is because good preconditioners are typically still
> > required in such a setting. Other than the smoothed aggregation in
> > CUSP, there is not much which does *not* require a copy to the host.
> > Particularly when thinking about multi-GPU you're entering the
> > regime where a good preconditioner on the CPU will still outperform
> > a GPU assembly with poor preconditioner.
> >
> > So far we have been using CUSP with a custom (generated)
> > assembly into
> > our own CUSP-compatible CSR data structure for a single GPU.
> > Since CUSP
> > doesn't give us multi-GPU solvers out of the box we'd rather use
> > existing infrastructure that works rather than rolling our own.
> >
> > I guess this is good news for you: Steve Dalton will work with us
> > during the summer to extend the CUSP-SA-AMG to distributed memory.
> > Other than that, I think there's currently only the functionality
> > from CUSPARSE and polynomial preconditioners, available through the
> > txpetscgpu package.
> >
> > Aside from that I also have a couple of plans on that front spinning
> > in my head, yet I couldn't find the time for implementing this yet.
> >
> > At the time of [1] supporting GPU assembly in one form or the
> > other was
> > on the roadmap, but the implementation direction seemed to not
> > have been
> > finally decided. Was there any progress since then or anything
> > to add to
> > the discussion? Is there even (experimental) code we might be
> > able to
> > use? Note that we're using petsc4py to interface to PETSc.
> >
> > Did you have a look at snes/examples/tutorials/ex52? I'm currently
> > converting/extending this to OpenCL, so it serves as a playground
> > for a future interface. Matt might have some additional comments on
> > this.
> >
> > I like to be very precise in the terminology. Doing the cell integrals
> > on the GPU (integration) is worthwhile, whereas
> > inserting the element matrices into a global representation like CSR
> > (assembly) takes no time and can be done
> > almost any way including on the CPU. I stopped working on assembly
> > because it made on difference.
>
> The actual insertion (as in MatSetValues) may not take up much time on
> either the CPU or the GPU, provided it is done where the integration was
> done. As I mentioned before we do both the integration and the solve on
> the GPU. We don't even allocate data in host memory. Therefore it
> wouldn't make much sense to do the addto on the host since it would
> require device -> host data transfer of all the cell integrals and host
> -> device of the CSR, which would make it quite expensive.
>
> One option we considered was creating a MatShell and providing an SPMV
> callback, probably calling a CUSP kernel on each MPI rank. That
> restricts the available preconditioners, but as mentioned, without doing
> any data transfers we'd be restricted to GPU-only preconditioners
> anyway. Any thoughts on this compared to the strategies mentioned above?
>
What about just creating your CUSP matrix and then shoving it into a
MATAIJCUSP?
That is what I did for my assembly tests.
For GPU only preconditioners, would focus on the Cusp AMG using Chebychev
for
the smoothers.
Matt
> Thanks,
> Florian
>
> > Thanks,
> >
> > Matt
> >
> >
> > Best regards,
> > Karli
>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130502/46a0dce5/attachment.html>
More information about the petsc-dev
mailing list