[petsc-dev] PetscSF and/or VecScatter with device pointers

Sat Jul 14 17:08:10 CDT 2018

Hi,

> we're starting to explore (with Andreas cc'd) residual assembly on
> GPUs.  The question naturally arises: how to do GlobalToLocal and
> LocalToGlobal.
> 
> I have:
> 
> A PetscSF describing the communication pattern.
> 
> A Vec holding the data to communicate.  This will have an up-to-date
> device pointer.
> 
> I would like:
> 
> PetscSFBcastBegin/End (and ReduceBegin/End, etc...) to (optionally)
> work with raw device pointers.  I am led to believe that modern MPIs
> can plug directly into device memory, so I would like to avoid copying
> data to the host, doing the communication there, and then going back
> up to the device.

I don't know how the CUDA software stack has advanced recently, but 
usually you want to try your best at avoiding any latency hits due to 
PCI Express. That is, packing the ghost data you want to communicate (as 
described by the SF) on the GPU, sending the packed data over, then 
unpacking on the host (note: here one could further optimize if needed) 
will most likely be much better in terms of latency and efficient use of 
low PCI-Express bandwidth than what Unified Memory approaches can provide.

If you want to use OpenCL, you'll have to do the above anyway.

> Given that I think that the window implementation (which just
> delegates the MPI for all the packing) is not considered prime time
> (mostly due to MPI implementation bugs, I think), I think this means
> implementing a version of PetscSF_Basic that can handle the
> pack/unpack directly on the device, and then just hands off to MPI.
> 
> The next thing is how to put a higher-level interface on top of this.
> What, if any, suggestions are there for doing something where the
> top-level API is agnostic to whether the data are on the host or the
> device.
> 
> We had thought something like:
> 
> - Make PetscSF handle device pointers (possibly with new implementation?)
> 
> - Make VecScatter use SF.
> 
> Calling VecScatterBegin/End on a Vec with up-to-date device pointers
> just uses the SF directly.

There are already optimizations for VecScatter when using CUDA available 
already. I'm happy to help you with tweaking that to SF within the next 
week if needed.

> Have there been any thoughts about how you want to do multi-GPU
> interaction?

Just use MPI with one GPU per MPI rank :-)

Best regards,
Karli