[petsc-dev] model for parallel ASM

Mon Jan 11 14:07:29 CST 2021

I've managed to read through your MR posts today.

As I said there getting parallel fieldsplit/additive solves is what I'm
after. Maybe start with Jacobi. And I am MPI serial so my scatter gathers
are local in PCApply_FieldSplit.

FYI, Christan Trott responded with this wrt the future Kokkos interface for
streams.
crtrott <https://app.slack.com/team/U5C9R5T53>[image: :slack_call:]  3
hours ago
<https://kokkosteam.slack.com/archives/C5BGU5NDQ/p1610383431196100?thread_ts=1610309148.167100&cid=C5BGU5NDQ>
So basicall
So basically we are still fixing some kinks in the whole thing (you can
imagine there are a million funky corner cases if you suddenly start
throwing around even more asynchronous stuff, in particular if you
potentially do that from multiple host threads, and if its not even clear
if a stream object is permanently associated with one thread or passed back
and forth)
crtrott <https://app.slack.com/team/U5C9R5T53>[image: :slack_call:]  3
hours ago
<https://kokkosteam.slack.com/archives/C5BGU5NDQ/p1610383488196300?thread_ts=1610309148.167100&cid=C5BGU5NDQ>
but in the future you will be able to do something like: auto instances =
Kokkos::partition_exec_space(Kokkos::DefaultExecutionSpace(),4);
crtrott <https://app.slack.com/team/U5C9R5T53>[image: :slack_call:]  3
hours ago
<https://kokkosteam.slack.com/archives/C5BGU5NDQ/p1610383504196500?thread_ts=1610309148.167100&cid=C5BGU5NDQ>
and get four semantically independent execution spaces back
crtrott <https://app.slack.com/team/U5C9R5T53>[image: :slack_call:]  3
hours ago
<https://kokkosteam.slack.com/archives/C5BGU5NDQ/p1610383541196700?thread_ts=1610309148.167100&cid=C5BGU5NDQ>
with some backends they will still serialize (like CPU OpenMP) since you
just get back the default instance 4 times
crtrott <https://app.slack.com/team/U5C9R5T53>[image: :slack_call:]  3
hours ago
<https://kokkosteam.slack.com/archives/C5BGU5NDQ/p1610383549196900?thread_ts=1610309148.167100&cid=C5BGU5NDQ>
on stuff like CUDA you get independent streams
crtrott <https://app.slack.com/team/U5C9R5T53>[image: :slack_call:]  3
hours ago
<https://kokkosteam.slack.com/archives/C5BGU5NDQ/p1610383571197100?thread_ts=1610309148.167100&cid=C5BGU5NDQ>
and on other thread based backends we may partition the threadpool for real

On Mon, Jan 11, 2021 at 11:52 AM Jacob Faibussowitsch <jacob.fai at gmail.com>
wrote:

> Hmm I suppose this means Kokkos should accept a stream like we expect it
> to? According to this somewhat recent merged PR:
> https://github.com/kokkos/kokkos/pull/1919 you can now make a
> "Kokkos::Cuda” object, and pass it as arg1 to range policies as an
> execution space. Here’s what I found on it (the cuda specific one is
> useless):
>
> https://github.com/kokkos/kokkos/wiki/ExecutionSpaceConcept
> https://github.com/kokkos/kokkos/wiki/Kokkos%3A%3AExecutionSpaceConcept
> <https://github.com/kokkos/kokkos/wiki/Kokkos::ExecutionSpaceConcept>
> https://github.com/kokkos/kokkos/wiki/Kokkos%3A%3ACuda
> <https://github.com/kokkos/kokkos/wiki/Kokkos::Cuda> <—— cuda specific
>
> Best regards,
>
> Jacob Faibussowitsch
> (Jacob Fai - booss - oh - vitch)
> Cell: (312) 694-3391
>
> On Jan 11, 2021, at 10:35, Mark Adams <mfadams at lbl.gov> wrote:
>
> Jacob, I'm not sure I understand this response. I could not find you on
> the Kokkos slack channel.
>
> Me: And My colleague in PETSc, Jacob Faibussowitsch, has talked to you
> about Kokkos taking a Cuda, Hip, etc., stream. This is something that would
> make it easier to deal with asynchronous GPU solvers in PETSc. We just
> wanted to check on this.
>
> Trott: Kokkos itself can do it for practically every operation
>
> Maybe you want to talk with him at some point, but we can worry about
> getting Cuda to work for now.
>
> On Sun, Jan 10, 2021 at 2:28 PM Jacob Faibussowitsch <jacob.fai at gmail.com>
> wrote:
>
>> I would like as much as possible to pass the cuda and hip streams to
>> Kokkos, since I can directly handle much of the annoyance with wrangling
>> multiple streams and stream objects externally. Last I checked on this
>> Kokkos was moving towards allowing association of streams to functions, but
>> admittedly this was a while back.
>>
>> Best regards,
>>
>> Jacob Faibussowitsch
>> (Jacob Fai - booss - oh - vitch)
>> Cell: (312) 694-3391
>>
>> On Jan 10, 2021, at 13:10, Mark Adams <mfadams at lbl.gov> wrote:
>>
>>
>>
>> On Sat, Jan 9, 2021 at 7:37 PM Jacob Faibussowitsch <jacob.fai at gmail.com>
>> wrote:
>>
>>> It is a single object that holds a pointer to every stream
>>> implementation and toggleable type so it can be universally passed around.
>>> Currently has a cudaStream and a hipStream but this is easily extendable to
>>> any other stream implementation.
>>>
>>
>> Do you have any thoughts on how this would work with Kokkos?
>>
>> Would you want to feed Kokkos your Cuda/Hip, etc, stream or add a Kokkos
>> backend to your object?
>>
>> Junchao might be the person to ask. I would guess Kokkos View (vector)
>> objects carry a stream because they block on a "deep_copy", that moves data
>> to/from the GPU, and it is blocking.
>>
>> Thanks,
>> Mark
>>
>>
>>> Best regards,
>>>
>>> Jacob Faibussowitsch
>>> (Jacob Fai - booss - oh - vitch)
>>> Cell: +1 (312) 694-3391
>>>
>>> On Jan 9, 2021, at 18:19, Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>> 
>>> Is this stream object going to have Cuda, Kokkos, etc., implementations?
>>>
>>> On Sat, Jan 9, 2021 at 4:09 PM Jacob Faibussowitsch <jacob.fai at gmail.com>
>>> wrote:
>>>
>>>> I’m currently working on an implementation of a general PetscStream
>>>> object. Currently it only supports Vector ops and has a proof of concept
>>>> KSPCG, but should be extensible to other objects when finished. Junchao is
>>>> also indirectly working on pipeline support in his NVSHMEM MR. Take a look
>>>> at either MR, it would be very useful to get your input, as tailoring
>>>> either of these approaches for pipelined algorithms is key.
>>>>
>>>> Best regards,
>>>>
>>>> Jacob Faibussowitsch
>>>> (Jacob Fai - booss - oh - vitch)
>>>> Cell: (312) 694-3391
>>>>
>>>> On Jan 9, 2021, at 15:01, Mark Adams <mfadams at lbl.gov> wrote:
>>>>
>>>> I would like to put a non-overlapping ASM solve on the GPU. It's not
>>>> clear that we have a model for this.
>>>>
>>>> PCApply_ASM currently pipelines the scater with the subdomain solves. I
>>>> think we would want to change this and do a 1) scatter begin loop, 2)
>>>> scatter end and non-blocking solve loop, 3) solve-wait and scatter
>>>> begging loop and 4) scatter end loop.
>>>>
>>>> I'm not sure how to go about doing this.
>>>>  * Should we make a new PCApply_ASM_PARALLEL or dump this pipelining
>>>> algorithm and rewrite PCApply_ASM?
>>>>  * Add a solver-wait method to KSP?
>>>>
>>>> Thoughts?
>>>>
>>>> Mark
>>>>
>>>>
>>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20210111/23dac27b/attachment-0001.html>