[mpich-discuss] will my cluster go tumbling down?

Eugene N neverov.biks.07.1 at gmail.com
Sun Jan 30 03:27:10 CST 2011


Thanks Nicolas for pointing out three types of errors for me,
Thanks Reuti, the link is very interesting and rich in data.

By the way, i did read the info you all advised, and it seems that a bad and
UB way of failsafing the whole cluster is assigning a simple return error
code handler, and then ignoring the client node return values.

MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);


2011/1/29 Reuti <reuti at staff.uni-marburg.de>

> Am 29.01.2011 um 20:21 schrieb Eugene N:
>
> > Hello
> >
> > Pavan and Nicolas, thank you very much, I see the point in making even
> the smallest error fatal, i just have a little trouble understanding the
> error landscape by an large.
> >
> > I mean, i doubt that mpi error handler will handle errors that are not
> coming from mpi context.
> >
> > For example, i painstakingly debugged all my mpi calls, so lets asume its
> all quite on the MPI front, but i cant wouch for a certain library that does
> minor work that has nothing to do with message passing.
> >
> > I dont want my node calling all rank abort if it segfaults, i mean, if
> the cause of the error would have nothing to do with mpi, how can i
> safeguard myself from such troubles?
>
> There was a similar discussion on the Open MPI list. If one rank crashes
> for whatever reason, you have to take actions on your own, as long as the
> used MPI library supports some kind of fault tolerance:
>
> http://www.open-mpi.org/community/lists/users/2011/01/15440.php
>
> -- Reuti
>
>
> > Eugene
> >
> >
> > 2011/1/29 Pavan Balaji <balaji at mcs.anl.gov>
> >
> > There are also some notes in the README in 1.3.2rc1 describing how to use
> errors returned by MPI functions, what you can expect, and what you can't.
> >
> >  -- Pavan
> >
> >
> > On 01/29/2011 12:06 PM, Nicolas Rosner wrote:
> > Let's suppose you're running a simple MPI program, one communicator,
> > ten ranks or so. Now imagine rank 7 hits an off-by-one bug, trespasses
> > the end of some array and segfaults.
> >
> > If, by default, your whole program dies immediately, then what? You
> > look at the logs, think, insert a few printfs, then track the the
> > off-by-one in a couple of minutes.
> >
> > If instead the rest just moves on with a dead rank 7, you end up with
> > a half-dead system that will eventually collapse anyway, misleading
> > symptoms and a tenfold increase in solution time. Worse, it might even
> > not collapse, hiding a bug that will be much harder to track down and
> > fix in the future when you don't even remember writing that code.
> >
> > MPICH2 allows you to implement a certain level of runtime fault
> > tolerance; I hear future versions will allow a lot more.  But
> > remember: there is no free lunch -- if you want to write a robust
> > system, you'll need to write error handlers that actually handle
> > errors robustly.
> >
> > Until you do so, keeping all local fatal errors globally fatal is
> > wise. My .02, at least.
> >
> > (Try looking up MPI_ERRORS_ARE_FATAL.)
> >
> > Regards,
> > Nicolás
> >
> >
> >
> > On Sat, Jan 29, 2011 at 5:54 AM, Eugene N<neverov.biks.07.1 at gmail.com>
>  wrote:
> > Hi
> > is it true that even if my most humble mpich2 client node will abort, all
> my
> > claster will go down? How can i cure it?
> > Thanks,
> > Eugene
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> > --
> > Pavan Balaji
> > http://www.mcs.anl.gov/~balaji
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110130/68a64068/attachment-0001.htm>


More information about the mpich-discuss mailing list