[petsc-dev] Current status of using streams within PETSc

Tue Feb 15 14:05:34 CST 2022

Besides the MPI synchronization issue,  we need new async APIs like
VecAXPYAsync() to pass scalars produced on device.

--Junchao Zhang

On Tue, Feb 15, 2022 at 10:11 AM Jed Brown <jed at jedbrown.org> wrote:

> Note that operations that don't have communication (like VecAXPY and
> VecPointwiseMult) are already non-blocking on streams. (A recent Thrust
> update helped us recover what had silently become blocking in a previous
> release.) For multi-rank, operations like MatMult require communication and
> MPI doesn't have a way to make it nonblocking. We've had some issues/bugs
> with NVSHMEM to bypass MPI.
>
> MPI implementors have been really skeptical of placing MPI operations on
> streams (like NCCL/RCCL or NVSHMEM). Cray's MPI doesn't have anything to do
> with streams, device memory is cachable on the host, and RDMA operations
> are initiated on the host without device logic being involved. I feel like
> it's going to take company investment or a very enterprising systems
> researcher to make the case for getting messaging to play well with
> streams. Perhaps it's a better use of time to focus on reducing latency of
> notifying the host when RDMA completes and reducing kernel launch time. In
> short, there are many unanswered questions regarding truly asynchronous
> Krylov solvers. But in the most obvious places for async, it works
> currently.
>
> Jacob Faibussowitsch <jacob.fai at gmail.com> writes:
>
> > New code can (and absolutely should) use it right away,
> PetscDeviceContext has been fully functional since its merger. Remember
> though that it works on a “principled parallelism” model; the caller is
> responsible for proper serialization.
> >
> > Existing code? Not so much. In broad strokes the following sections need
> support before parallelism can be achieved from user-code:
> >
> > 1. Vec     - WIP (feature complete, now in bug-fixing stage)
> > 2. PetscSF - TODO
> > 3. Mat     - TODO
> > 4. KSP/PC  - TODO
> >
> > Seeing as each MR thus far for this has taken me roughly 3-4 months to
> merge, and with the later sections requiring enormous rewrites and API
> changes I don’t expect this to be finished for at least 2 years… Once the
> Vec MR is merged you could theoretically run with
> -device_context_stream_type default_blocking and achieve “asynchronous”
> compute but nothing would work properly as every other part of petsc
> expects to be synchronous.
> >
> > That being said I would be happy to give a demo to people on how they
> can integrate PetscDeviceContext into their code on the next developers
> meeting. It would go a long way to cutting down the timeline.
> >
> >> On Feb 15, 2022, at 02:02, Stefano Zampini <stefano.zampini at gmail.com>
> wrote:
> >>
> >> Jacob
> >>
> >> what is the current status of the async support in PETSc?
> >> Can you summarize here? Is there any documentation available?
> >>
> >> Thanks
> >> --
> >> Stefano
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220215/9cb3b7ce/attachment.html>