[mpich-discuss] will my cluster go tumbling down?

Reuti reuti at staff.uni-marburg.de
Sat Jan 29 14:11:12 CST 2011


Am 29.01.2011 um 20:21 schrieb Eugene N:

> Hello
> 
> Pavan and Nicolas, thank you very much, I see the point in making even the smallest error fatal, i just have a little trouble understanding the error landscape by an large.
> 
> I mean, i doubt that mpi error handler will handle errors that are not coming from mpi context. 
> 
> For example, i painstakingly debugged all my mpi calls, so lets asume its all quite on the MPI front, but i cant wouch for a certain library that does minor work that has nothing to do with message passing.
> 
> I dont want my node calling all rank abort if it segfaults, i mean, if the cause of the error would have nothing to do with mpi, how can i safeguard myself from such troubles?

There was a similar discussion on the Open MPI list. If one rank crashes for whatever reason, you have to take actions on your own, as long as the used MPI library supports some kind of fault tolerance:

http://www.open-mpi.org/community/lists/users/2011/01/15440.php

-- Reuti


> Eugene
> 
> 
> 2011/1/29 Pavan Balaji <balaji at mcs.anl.gov>
> 
> There are also some notes in the README in 1.3.2rc1 describing how to use errors returned by MPI functions, what you can expect, and what you can't.
> 
>  -- Pavan
> 
> 
> On 01/29/2011 12:06 PM, Nicolas Rosner wrote:
> Let's suppose you're running a simple MPI program, one communicator,
> ten ranks or so. Now imagine rank 7 hits an off-by-one bug, trespasses
> the end of some array and segfaults.
> 
> If, by default, your whole program dies immediately, then what? You
> look at the logs, think, insert a few printfs, then track the the
> off-by-one in a couple of minutes.
> 
> If instead the rest just moves on with a dead rank 7, you end up with
> a half-dead system that will eventually collapse anyway, misleading
> symptoms and a tenfold increase in solution time. Worse, it might even
> not collapse, hiding a bug that will be much harder to track down and
> fix in the future when you don't even remember writing that code.
> 
> MPICH2 allows you to implement a certain level of runtime fault
> tolerance; I hear future versions will allow a lot more.  But
> remember: there is no free lunch -- if you want to write a robust
> system, you'll need to write error handlers that actually handle
> errors robustly.
> 
> Until you do so, keeping all local fatal errors globally fatal is
> wise. My .02, at least.
> 
> (Try looking up MPI_ERRORS_ARE_FATAL.)
> 
> Regards,
> Nicolás
> 
> 
> 
> On Sat, Jan 29, 2011 at 5:54 AM, Eugene N<neverov.biks.07.1 at gmail.com>  wrote:
> Hi
> is it true that even if my most humble mpich2 client node will abort, all my
> claster will go down? How can i cure it?
> Thanks,
> Eugene
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> -- 
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list