[petsc-dev] PetscSF and/or VecScatter with device pointers

Jed Brown jed at jedbrown.org
Thu Jul 12 16:08:28 CDT 2018


Matthew Knepley <knepley at gmail.com> writes:

> On Thu, Jul 12, 2018 at 6:47 AM Lawrence Mitchell <
> lawrence.mitchell at imperial.ac.uk> wrote:
>
>> Dear petsc-dev,
>>
>> we're starting to explore (with Andreas cc'd) residual assembly on
>> GPUs.  The question naturally arises: how to do GlobalToLocal and
>> LocalToGlobal.
>>
>
> There is not a lot of Mem Band difference between a GPU and a Skylake, but
> I assume this is
> to use hardware already purchased by some center.

Skylake Xeon is around 100 GB/s per socket, versus a V100 at about 750
GB/s.  That's nothing to sneeze at, but moving the entire vector to the
host just to pack messages is a much bigger hit for large subdomains
because the entire volume needs to move over PCI-Express.

>> I have:
>>
>> A PetscSF describing the communication pattern.
>>
>> A Vec holding the data to communicate.  This will have an up-to-date
>> device pointer.
>>
>> I would like:
>>
>> PetscSFBcastBegin/End (and ReduceBegin/End, etc...) to (optionally)
>> work with raw device pointers.  I am led to believe that modern MPIs
>> can plug directly into device memory, so I would like to avoid copying
>> data to the host, doing the communication there, and then going back
>> up to the device.
>>
>> Given that I think that the window implementation (which just
>> delegates the MPI for all the packing) is not considered prime time
>> (mostly due to MPI implementation bugs, I think), I think this means
>> implementing a version of PetscSF_Basic that can handle the
>> pack/unpack directly on the device, and then just hands off to MPI.
>>
>
> I think that is the case.

I doubt GPU Direct can give high performance for the derived data types
that the SF Window implementation uses (if it works at all).

>> The next thing is how to put a higher-level interface on top of this.
>> What, if any, suggestions are there for doing something where the
>> top-level API is agnostic to whether the data are on the host or the
>> device.
>>
>> We had thought something like:
>>
>> - Make PetscSF handle device pointers (possibly with new implementation?)
>>
>> - Make VecScatter use SF.
>>
>
> Yep, this is what I would do.

Agreed.

>> Calling VecScatterBegin/End on a Vec with up-to-date device pointers
>> just uses the SF directly.
>>
>> Have there been any thoughts about how you want to do multi-GPU
>> interaction?

With MPI-parallel code, I don't see a compelling reason to support
multiple devices per MPI process.

> I don't think so, but Karl could reply if there has been.
>
> How are you doing local assembly?
>
>    Matt
>
>
>> Cheers,
>>
>> Lawrence
>>
>
>
> -- 
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/ <http://www.caam.rice.edu/~mk51/>


More information about the petsc-dev mailing list