[petsc-dev] PetscSF and/or VecScatter with device pointers

Fri Jul 13 03:15:34 CDT 2018

> On 12 Jul 2018, at 22:08, Jed Brown <jed at jedbrown.org> wrote:
> 
...
>>> I have:
>>> 
>>> A PetscSF describing the communication pattern.
>>> 
>>> A Vec holding the data to communicate.  This will have an up-to-date
>>> device pointer.
>>> 
>>> I would like:
>>> 
>>> PetscSFBcastBegin/End (and ReduceBegin/End, etc...) to (optionally)
>>> work with raw device pointers.  I am led to believe that modern MPIs
>>> can plug directly into device memory, so I would like to avoid copying
>>> data to the host, doing the communication there, and then going back
>>> up to the device.
>>> 
>>> Given that I think that the window implementation (which just
>>> delegates the MPI for all the packing) is not considered prime time
>>> (mostly due to MPI implementation bugs, I think), I think this means
>>> implementing a version of PetscSF_Basic that can handle the
>>> pack/unpack directly on the device, and then just hands off to MPI.
>>> 
>> 
>> I think that is the case.
> 
> I doubt GPU Direct can give high performance for the derived data types
> that the SF Window implementation uses (if it works at all).

MVAPICH claims to support datatypes with GPUDirect (including non-contiguous), and one-sided DMA.  But I'm willing to believe that this is all lies.

>>> The next thing is how to put a higher-level interface on top of this.
>>> What, if any, suggestions are there for doing something where the
>>> top-level API is agnostic to whether the data are on the host or the
>>> device.
>>> 
>>> We had thought something like:
>>> 
>>> - Make PetscSF handle device pointers (possibly with new implementation?)
>>> 
>>> - Make VecScatter use SF.
>> Yep, this is what I would do.
> 
> Agreed.

OK.  We'll have a look at getting this done.

>>> Calling VecScatterBegin/End on a Vec with up-to-date device pointers
>>> just uses the SF directly.
>>> 
>>> Have there been any thoughts about how you want to do multi-GPU
>>> interaction?
> 
> With MPI-parallel code, I don't see a compelling reason to support
> multiple devices per MPI process.

Miscommunication: by multi-GPU, I mean one device per MPI process.  I just meant, if there is existing PETSc effort going towards supporting computation on device, are there thoughts above and beyond what I just described on how you want to hide device-device transfers behind the API.

Cheers,

Lawrence