[mpich-discuss] will my cluster go tumbling down?

Eugene N neverov.biks.07.1 at gmail.com
Sat Jan 29 13:21:14 CST 2011


Hello

Pavan and Nicolas, thank you very much, I see the point in making even the
smallest error fatal, i just have a little trouble understanding the error
landscape by an large.

I mean, i doubt that mpi error handler will handle errors that are not
coming from mpi context.

For example, i painstakingly debugged all my mpi calls, so lets asume its
all quite on the MPI front, but i cant wouch for a certain library that does
minor work that has nothing to do with message passing.

I dont want my node calling all rank abort if it segfaults, i mean, if the
cause of the error would have nothing to do with mpi, how can i safeguard
myself from such troubles?

Eugene


2011/1/29 Pavan Balaji <balaji at mcs.anl.gov>

>
> There are also some notes in the README in 1.3.2rc1 describing how to use
> errors returned by MPI functions, what you can expect, and what you can't.
>
>  -- Pavan
>
>
> On 01/29/2011 12:06 PM, Nicolas Rosner wrote:
>
>> Let's suppose you're running a simple MPI program, one communicator,
>> ten ranks or so. Now imagine rank 7 hits an off-by-one bug, trespasses
>> the end of some array and segfaults.
>>
>> If, by default, your whole program dies immediately, then what? You
>> look at the logs, think, insert a few printfs, then track the the
>> off-by-one in a couple of minutes.
>>
>> If instead the rest just moves on with a dead rank 7, you end up with
>> a half-dead system that will eventually collapse anyway, misleading
>> symptoms and a tenfold increase in solution time. Worse, it might even
>> not collapse, hiding a bug that will be much harder to track down and
>> fix in the future when you don't even remember writing that code.
>>
>> MPICH2 allows you to implement a certain level of runtime fault
>> tolerance; I hear future versions will allow a lot more.  But
>> remember: there is no free lunch -- if you want to write a robust
>> system, you'll need to write error handlers that actually handle
>> errors robustly.
>>
>> Until you do so, keeping all local fatal errors globally fatal is
>> wise. My .02, at least.
>>
>> (Try looking up MPI_ERRORS_ARE_FATAL.)
>>
>> Regards,
>> Nicolás
>>
>>
>>
>> On Sat, Jan 29, 2011 at 5:54 AM, Eugene N<neverov.biks.07.1 at gmail.com>
>>  wrote:
>>
>>> Hi
>>> is it true that even if my most humble mpich2 client node will abort, all
>>> my
>>> claster will go down? How can i cure it?
>>> Thanks,
>>> Eugene
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>>
>>>  _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110129/6be3fed7/attachment.htm>


More information about the mpich-discuss mailing list