[mpich-discuss] will my cluster go tumbling down?

Pavan Balaji balaji at mcs.anl.gov
Sat Jan 29 12:11:56 CST 2011


There are also some notes in the README in 1.3.2rc1 describing how to 
use errors returned by MPI functions, what you can expect, and what you 
can't.

  -- Pavan

On 01/29/2011 12:06 PM, Nicolas Rosner wrote:
> Let's suppose you're running a simple MPI program, one communicator,
> ten ranks or so. Now imagine rank 7 hits an off-by-one bug, trespasses
> the end of some array and segfaults.
>
> If, by default, your whole program dies immediately, then what? You
> look at the logs, think, insert a few printfs, then track the the
> off-by-one in a couple of minutes.
>
> If instead the rest just moves on with a dead rank 7, you end up with
> a half-dead system that will eventually collapse anyway, misleading
> symptoms and a tenfold increase in solution time. Worse, it might even
> not collapse, hiding a bug that will be much harder to track down and
> fix in the future when you don't even remember writing that code.
>
> MPICH2 allows you to implement a certain level of runtime fault
> tolerance; I hear future versions will allow a lot more.  But
> remember: there is no free lunch -- if you want to write a robust
> system, you'll need to write error handlers that actually handle
> errors robustly.
>
> Until you do so, keeping all local fatal errors globally fatal is
> wise. My .02, at least.
>
> (Try looking up MPI_ERRORS_ARE_FATAL.)
>
> Regards,
> Nicolás
>
>
>
> On Sat, Jan 29, 2011 at 5:54 AM, Eugene N<neverov.biks.07.1 at gmail.com>  wrote:
>> Hi
>> is it true that even if my most humble mpich2 client node will abort, all my
>> claster will go down? How can i cure it?
>> Thanks,
>> Eugene
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list