<div dir="ltr">Besides the MPI synchronization issue, we need new async APIs like VecAXPYAsync() to pass scalars produced on device.<div><br clear="all"><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">--Junchao Zhang</div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Feb 15, 2022 at 10:11 AM Jed Brown <<a href="mailto:jed@jedbrown.org">jed@jedbrown.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Note that operations that don't have communication (like VecAXPY and VecPointwiseMult) are already non-blocking on streams. (A recent Thrust update helped us recover what had silently become blocking in a previous release.) For multi-rank, operations like MatMult require communication and MPI doesn't have a way to make it nonblocking. We've had some issues/bugs with NVSHMEM to bypass MPI.<br>
<br>
MPI implementors have been really skeptical of placing MPI operations on streams (like NCCL/RCCL or NVSHMEM). Cray's MPI doesn't have anything to do with streams, device memory is cachable on the host, and RDMA operations are initiated on the host without device logic being involved. I feel like it's going to take company investment or a very enterprising systems researcher to make the case for getting messaging to play well with streams. Perhaps it's a better use of time to focus on reducing latency of notifying the host when RDMA completes and reducing kernel launch time. In short, there are many unanswered questions regarding truly asynchronous Krylov solvers. But in the most obvious places for async, it works currently.<br>
<br>
Jacob Faibussowitsch <<a href="mailto:jacob.fai@gmail.com" target="_blank">jacob.fai@gmail.com</a>> writes:<br>
<br>
> New code can (and absolutely should) use it right away, PetscDeviceContext has been fully functional since its merger. Remember though that it works on a “principled parallelism” model; the caller is responsible for proper serialization.<br>
><br>
> Existing code? Not so much. In broad strokes the following sections need support before parallelism can be achieved from user-code:<br>
><br>
> 1. Vec - WIP (feature complete, now in bug-fixing stage)<br>
> 2. PetscSF - TODO<br>
> 3. Mat - TODO<br>
> 4. KSP/PC - TODO<br>
><br>
> Seeing as each MR thus far for this has taken me roughly 3-4 months to merge, and with the later sections requiring enormous rewrites and API changes I don’t expect this to be finished for at least 2 years… Once the Vec MR is merged you could theoretically run with -device_context_stream_type default_blocking and achieve “asynchronous” compute but nothing would work properly as every other part of petsc expects to be synchronous.<br>
><br>
> That being said I would be happy to give a demo to people on how they can integrate PetscDeviceContext into their code on the next developers meeting. It would go a long way to cutting down the timeline.<br>
><br>
>> On Feb 15, 2022, at 02:02, Stefano Zampini <<a href="mailto:stefano.zampini@gmail.com" target="_blank">stefano.zampini@gmail.com</a>> wrote:<br>
>> <br>
>> Jacob<br>
>> <br>
>> what is the current status of the async support in PETSc?<br>
>> Can you summarize here? Is there any documentation available?<br>
>> <br>
>> Thanks<br>
>> -- <br>
>> Stefano<br>
</blockquote></div>