[petsc-dev] PETSc multi-GPU assembly - current status

Florian Rathgeber florian.rathgeber at gmail.com
Thu Jun 6 11:17:09 CDT 2013


On 02/05/13 21:35, Matthew Knepley wrote:
> On Thu, May 2, 2013 at 3:29 PM, Florian Rathgeber
> <florian.rathgeber at gmail.com <mailto:florian.rathgeber at gmail.com>> wrote:
> 
>     On 02/05/13 03:12, Matthew Knepley wrote:
>     > On Wed, May 1, 2013 at 8:52 PM, Karl Rupp <rupp at mcs.anl.gov
>     <mailto:rupp at mcs.anl.gov>
>     > <mailto:rupp at mcs.anl.gov <mailto:rupp at mcs.anl.gov>>> wrote:
>     >
>     >     Hi Florian,
>     >
>     >     > This is loosely a follow up to [1]. In this thread a few
>     potential
>     >     ways
>     >
>     >         for making GPU assembly work with PETSc were discussed and
>     to me
>     >         the two
>     >         most promising appeared to be:
>     >         1) Create a PETSc matrix from a pre-assembled CSR
>     structure, or
>     >         2) Preallocate a PETSc matrix and get the handle to pass
>     the row
>     >         pointer, column indices and values array to a custom assembly
>     >         routine.
>     >
>     >     I still consider these two to be the most promising (and general)
>     >     approaches. On the other hand, to my knowledge the infrastructure
>     >     hasn't changed a lot since then. Some additional functionality
>     from
>     >     CUSPARSE was added, while I added ViennaCL-bindings to branch
>     'next'
>     >     (i.e. still a few corners to polish). This means that you could
>     >     technically use the much more jit-friendly OpenCL (and, as a
>     >     follow-up, complain at NVIDIA and AMD over the higher
>     latencies than
>     >     with CUDA).
>     >
>     >         We compute
>     >         local assembly matrices on the GPU and a crucial
>     requirement is
>     >         that the
>     >         matrix *only* lives in device device, we want to avoid any
>     host <->
>     >         device data transfers.
>     >
>     >     One of the reasons why - despite its attractiveness - this hasn't
>     >     taken off is because good preconditioners are typically still
>     >     required in such a setting. Other than the smoothed aggregation in
>     >     CUSP, there is not much which does *not* require a copy to the
>     host.
>     >     Particularly when thinking about multi-GPU you're entering the
>     >     regime where a good preconditioner on the CPU will still
>     outperform
>     >     a GPU assembly with poor preconditioner.
>     >
>     >         So far we have been using CUSP with a custom (generated)
>     >         assembly into
>     >         our own CUSP-compatible CSR data structure for a single GPU.
>     >         Since CUSP
>     >         doesn't give us multi-GPU solvers out of the box we'd
>     rather use
>     >         existing infrastructure that works rather than rolling our
>     own.
>     >
>     >     I guess this is good news for you: Steve Dalton will work with us
>     >     during the summer to extend the CUSP-SA-AMG to distributed memory.
>     >     Other than that, I think there's currently only the functionality
>     >     from CUSPARSE and polynomial preconditioners, available
>     through the
>     >     txpetscgpu package.
>     >
>     >     Aside from that I also have a couple of plans on that front
>     spinning
>     >     in my head, yet I couldn't find the time for implementing this
>     yet.
>     >
>     >         At the time of [1] supporting GPU assembly in one form or the
>     >         other was
>     >         on the roadmap, but the implementation direction seemed to not
>     >         have been
>     >         finally decided. Was there any progress since then or anything
>     >         to add to
>     >         the discussion? Is there even (experimental) code we might be
>     >         able to
>     >         use? Note that we're using petsc4py to interface to PETSc.
>     >
>     >     Did you have a look at snes/examples/tutorials/ex52? I'm currently
>     >     converting/extending this to OpenCL, so it serves as a playground
>     >     for a future interface. Matt might have some additional
>     comments on
>     >     this.
>     >
>     > I like to be very precise in the terminology. Doing the cell integrals
>     > on the GPU (integration) is worthwhile, whereas
>     > inserting the element matrices into a global representation like CSR
>     > (assembly) takes no time and can be done
>     > almost any way including on the CPU. I stopped working on assembly
>     > because it made on difference.
> 
>     The actual insertion (as in MatSetValues) may not take up much time on
>     either the CPU or the GPU, provided it is done where the integration was
>     done. As I mentioned before we do both the integration and the solve on
>     the GPU. We don't even allocate data in host memory. Therefore it
>     wouldn't make much sense to do the addto on the host since it would
>     require device -> host data transfer of all the cell integrals and host
>     -> device of the CSR, which would make it quite expensive.
> 
>     One option we considered was creating a MatShell and providing an SPMV
>     callback, probably calling a CUSP kernel on each MPI rank. That
>     restricts the available preconditioners, but as mentioned, without doing
>     any data transfers we'd be restricted to GPU-only preconditioners
>     anyway. Any thoughts on this compared to the strategies mentioned above?
> 
> 
> What about just creating your CUSP matrix and then shoving it into a
> MATAIJCUSP?
> That is what I did for my assembly tests.

That'd be the ideal solution. Does this work with MPIAIJ? We're only
really interested in multi-GPU with MPI. In the sequential case we can
just call Cusp directly, but for the MPI distributed case we'd rather
rely on PETSc to help us out.

Presumably you're referring to the experiments you did for the TOMS
paper? Is that code available somewhere?

> For GPU only preconditioners,  would focus on the Cusp AMG using
> Chebychev for
> the smoothers.

OK. Again we'd have to create our own PCShell for this when using
MatShell if I understand correctly?

Florian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2913 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130606/7037aaf0/attachment.p7s>


More information about the petsc-dev mailing list