[petsc-dev] PETSc multi-GPU assembly - current status

Thu May 2 15:29:33 CDT 2013

On 02/05/13 03:12, Matthew Knepley wrote:
> On Wed, May 1, 2013 at 8:52 PM, Karl Rupp <rupp at mcs.anl.gov
> <mailto:rupp at mcs.anl.gov>> wrote:
> 
>     Hi Florian,
> 
>     > This is loosely a follow up to [1]. In this thread a few potential
>     ways
> 
>         for making GPU assembly work with PETSc were discussed and to me
>         the two
>         most promising appeared to be:
>         1) Create a PETSc matrix from a pre-assembled CSR structure, or
>         2) Preallocate a PETSc matrix and get the handle to pass the row
>         pointer, column indices and values array to a custom assembly
>         routine.
> 
>     I still consider these two to be the most promising (and general)
>     approaches. On the other hand, to my knowledge the infrastructure
>     hasn't changed a lot since then. Some additional functionality from
>     CUSPARSE was added, while I added ViennaCL-bindings to branch 'next'
>     (i.e. still a few corners to polish). This means that you could
>     technically use the much more jit-friendly OpenCL (and, as a
>     follow-up, complain at NVIDIA and AMD over the higher latencies than
>     with CUDA).
> 
>         We compute
>         local assembly matrices on the GPU and a crucial requirement is
>         that the
>         matrix *only* lives in device device, we want to avoid any host <->
>         device data transfers.
> 
>     One of the reasons why - despite its attractiveness - this hasn't
>     taken off is because good preconditioners are typically still
>     required in such a setting. Other than the smoothed aggregation in
>     CUSP, there is not much which does *not* require a copy to the host.
>     Particularly when thinking about multi-GPU you're entering the
>     regime where a good preconditioner on the CPU will still outperform
>     a GPU assembly with poor preconditioner.
> 
>         So far we have been using CUSP with a custom (generated)
>         assembly into
>         our own CUSP-compatible CSR data structure for a single GPU.
>         Since CUSP
>         doesn't give us multi-GPU solvers out of the box we'd rather use
>         existing infrastructure that works rather than rolling our own.
> 
>     I guess this is good news for you: Steve Dalton will work with us
>     during the summer to extend the CUSP-SA-AMG to distributed memory.
>     Other than that, I think there's currently only the functionality
>     from CUSPARSE and polynomial preconditioners, available through the
>     txpetscgpu package.
> 
>     Aside from that I also have a couple of plans on that front spinning
>     in my head, yet I couldn't find the time for implementing this yet.
> 
>         At the time of [1] supporting GPU assembly in one form or the
>         other was
>         on the roadmap, but the implementation direction seemed to not
>         have been
>         finally decided. Was there any progress since then or anything
>         to add to
>         the discussion? Is there even (experimental) code we might be
>         able to
>         use? Note that we're using petsc4py to interface to PETSc.
> 
>     Did you have a look at snes/examples/tutorials/ex52? I'm currently
>     converting/extending this to OpenCL, so it serves as a playground
>     for a future interface. Matt might have some additional comments on
>     this.
> 
> I like to be very precise in the terminology. Doing the cell integrals
> on the GPU (integration) is worthwhile, whereas
> inserting the element matrices into a global representation like CSR
> (assembly) takes no time and can be done
> almost any way including on the CPU. I stopped working on assembly
> because it made on difference.

The actual insertion (as in MatSetValues) may not take up much time on
either the CPU or the GPU, provided it is done where the integration was
done. As I mentioned before we do both the integration and the solve on
the GPU. We don't even allocate data in host memory. Therefore it
wouldn't make much sense to do the addto on the host since it would
require device -> host data transfer of all the cell integrals and host
-> device of the CSR, which would make it quite expensive.

One option we considered was creating a MatShell and providing an SPMV
callback, probably calling a CUSP kernel on each MPI rank. That
restricts the available preconditioners, but as mentioned, without doing
any data transfers we'd be restricted to GPU-only preconditioners
anyway. Any thoughts on this compared to the strategies mentioned above?

Thanks,
Florian

>   Thanks,
> 
>      Matt
>  
> 
>     Best regards,
>     Karli

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2913 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130502/a648a340/attachment.p7s>