[petsc-dev] https://www.dursi.ca/post/hpc-is-dying-and-mpi-is-killing-it.html

Mon Mar 18 11:44:12 CDT 2019

On Sun, Mar 17, 2019 at 6:55 PM Jed Brown <jed at jedbrown.org> wrote:

> Jeff Hammond <jeff.science at gmail.com> writes:
>
> > When this was written, I was convinced that Dursi was wrong about
> > everything because one of the key arguments against MPI was
> > fault-intolerance, which I was sure was going to be solved soon.
> However,
> > LLNL has done everything in their power to torpedo MPI fault-tolerance in
> > MPI-4 for the past 3+ years and I am no longer optimistic about MPI's
> > ability to grow outside of traditional HPC because of the forum's
> inability
> > to take fault-tolerance seriously.  It's also unclear that we can get by
> > without it in a post-exascale world.
>
> Have you seen any MPI FT proposals that would actually enable a
> collective library or application to meaningfully "recover"?
>
> Seems to me that in-memory checkpointing and process-based FT is more
> practical.  For example, you could have a Spark or Spark-like system
> that manages distributed in-memory data (perhaps standardizing on
> Arrow), launches MPI jobs to access that data in-place, and coordinates
> the distributed replication so that a fresh MPI job could restart on a
> (possibly) different group of nodes after the MPI job crashes.  In such
> a system, you would need to implement transactional updates to in-memory
> checkpoint data, but no special recovery from MPI (just that crashes be
> eventually collective, versus deadlock).
>

Please look at
https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2016/1610522.pdf.
This is similar to what you are proposing but without the expensive reboot
of the MPI runtime.

Jeff

-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190318/114ee4f9/attachment-0001.html>