[petsc-dev] https://www.dursi.ca/post/hpc-is-dying-and-mpi-is-killing-it.html

Sun Mar 17 20:56:06 CDT 2019

Jeff Hammond <jeff.science at gmail.com> writes:

> When this was written, I was convinced that Dursi was wrong about
> everything because one of the key arguments against MPI was
> fault-intolerance, which I was sure was going to be solved soon.  However,
> LLNL has done everything in their power to torpedo MPI fault-tolerance in
> MPI-4 for the past 3+ years and I am no longer optimistic about MPI's
> ability to grow outside of traditional HPC because of the forum's inability
> to take fault-tolerance seriously.  It's also unclear that we can get by
> without it in a post-exascale world.

Have you seen any MPI FT proposals that would actually enable a
collective library or application to meaningfully "recover"?

Seems to me that in-memory checkpointing and process-based FT is more
practical.  For example, you could have a Spark or Spark-like system
that manages distributed in-memory data (perhaps standardizing on
Arrow), launches MPI jobs to access that data in-place, and coordinates
the distributed replication so that a fresh MPI job could restart on a
(possibly) different group of nodes after the MPI job crashes.  In such
a system, you would need to implement transactional updates to in-memory
checkpoint data, but no special recovery from MPI (just that crashes be
eventually collective, versus deadlock).