[petsc-dev] PetscSF and/or VecScatter with device pointers
Lawrence Mitchell
lawrence.mitchell at imperial.ac.uk
Fri Jul 13 03:15:34 CDT 2018
> On 12 Jul 2018, at 22:08, Jed Brown <jed at jedbrown.org> wrote:
>
...
>>> I have:
>>>
>>> A PetscSF describing the communication pattern.
>>>
>>> A Vec holding the data to communicate. This will have an up-to-date
>>> device pointer.
>>>
>>> I would like:
>>>
>>> PetscSFBcastBegin/End (and ReduceBegin/End, etc...) to (optionally)
>>> work with raw device pointers. I am led to believe that modern MPIs
>>> can plug directly into device memory, so I would like to avoid copying
>>> data to the host, doing the communication there, and then going back
>>> up to the device.
>>>
>>> Given that I think that the window implementation (which just
>>> delegates the MPI for all the packing) is not considered prime time
>>> (mostly due to MPI implementation bugs, I think), I think this means
>>> implementing a version of PetscSF_Basic that can handle the
>>> pack/unpack directly on the device, and then just hands off to MPI.
>>>
>>
>> I think that is the case.
>
> I doubt GPU Direct can give high performance for the derived data types
> that the SF Window implementation uses (if it works at all).
MVAPICH claims to support datatypes with GPUDirect (including non-contiguous), and one-sided DMA. But I'm willing to believe that this is all lies.
>>> The next thing is how to put a higher-level interface on top of this.
>>> What, if any, suggestions are there for doing something where the
>>> top-level API is agnostic to whether the data are on the host or the
>>> device.
>>>
>>> We had thought something like:
>>>
>>> - Make PetscSF handle device pointers (possibly with new implementation?)
>>>
>>> - Make VecScatter use SF.
>> Yep, this is what I would do.
>
> Agreed.
OK. We'll have a look at getting this done.
>>> Calling VecScatterBegin/End on a Vec with up-to-date device pointers
>>> just uses the SF directly.
>>>
>>> Have there been any thoughts about how you want to do multi-GPU
>>> interaction?
>
> With MPI-parallel code, I don't see a compelling reason to support
> multiple devices per MPI process.
Miscommunication: by multi-GPU, I mean one device per MPI process. I just meant, if there is existing PETSc effort going towards supporting computation on device, are there thoughts above and beyond what I just described on how you want to hide device-device transfers behind the API.
Cheers,
Lawrence
More information about the petsc-dev
mailing list