[mpich-discuss] will my cluster go tumbling down?

Nicolas Rosner nrosner at gmail.com
Sat Jan 29 12:06:50 CST 2011


Let's suppose you're running a simple MPI program, one communicator,
ten ranks or so. Now imagine rank 7 hits an off-by-one bug, trespasses
the end of some array and segfaults.

If, by default, your whole program dies immediately, then what? You
look at the logs, think, insert a few printfs, then track the the
off-by-one in a couple of minutes.

If instead the rest just moves on with a dead rank 7, you end up with
a half-dead system that will eventually collapse anyway, misleading
symptoms and a tenfold increase in solution time. Worse, it might even
not collapse, hiding a bug that will be much harder to track down and
fix in the future when you don't even remember writing that code.

MPICH2 allows you to implement a certain level of runtime fault
tolerance; I hear future versions will allow a lot more.  But
remember: there is no free lunch -- if you want to write a robust
system, you'll need to write error handlers that actually handle
errors robustly.

Until you do so, keeping all local fatal errors globally fatal is
wise. My .02, at least.

(Try looking up MPI_ERRORS_ARE_FATAL.)

Regards,
Nicolás



On Sat, Jan 29, 2011 at 5:54 AM, Eugene N <neverov.biks.07.1 at gmail.com> wrote:
> Hi
> is it true that even if my most humble mpich2 client node will abort, all my
> claster will go down? How can i cure it?
> Thanks,
> Eugene
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>


More information about the mpich-discuss mailing list