[mpich-discuss] will my cluster go tumbling down?

Nicolas Rosner nrosner at gmail.com
Sat Jan 29 14:06:58 CST 2011


Hello Eugene,

I was trying to make a general point, which I'm happy to hear you see
and agree with.

However, as you correctly point out, I sort of ended up mixing up
different kinds of errors: C programming errors (deref'ing a bad ptr)
vs. MPI programmer errors (misuse of MPI calls) vs. MPI errors due to
external factors (losing a message due to wire failure).

Indeed, those are not all equivalent. What I meant was, while your
program is still a prototype with little or limited error handling,
"all errors are fatal" is likely to be desirable -- generally
speaking. But yes indeed, once you do add more robust err handling,
you will need to handle each of the aforementioned subsets quite
differently.

I can't answer your last question off the top of my head, but the
MPICH2 experts in here will probably be able to help you out. In the
meantime, the "Error Handlers" section of the docs as well as the
release notes Pavan mentioned may be of use.

Regards,
Nicolás


On Sat, Jan 29, 2011 at 4:21 PM, Eugene N <neverov.biks.07.1 at gmail.com> wrote:
> Hello
> Pavan and Nicolas, thank you very much, I see the point in making even the
> smallest error fatal, i just have a little trouble understanding the error
> landscape by an large.
> I mean, i doubt that mpi error handler will handle errors that are not
> coming from mpi context.
> For example, i painstakingly debugged all my mpi calls, so lets asume its
> all quite on the MPI front, but i cant wouch for a certain library that does
> minor work that has nothing to do with message passing.
> I dont want my node calling all rank abort if it segfaults, i mean, if the
> cause of the error would have nothing to do with mpi, how can i safeguard
> myself from such troubles?
>
> Eugene
>
> 2011/1/29 Pavan Balaji <balaji at mcs.anl.gov>
>>
>> There are also some notes in the README in 1.3.2rc1 describing how to use
>> errors returned by MPI functions, what you can expect, and what you can't.
>>
>>  -- Pavan
>>
>> On 01/29/2011 12:06 PM, Nicolas Rosner wrote:
>>>
>>> Let's suppose you're running a simple MPI program, one communicator,
>>> ten ranks or so. Now imagine rank 7 hits an off-by-one bug, trespasses
>>> the end of some array and segfaults.
>>>
>>> If, by default, your whole program dies immediately, then what? You
>>> look at the logs, think, insert a few printfs, then track the the
>>> off-by-one in a couple of minutes.
>>>
>>> If instead the rest just moves on with a dead rank 7, you end up with
>>> a half-dead system that will eventually collapse anyway, misleading
>>> symptoms and a tenfold increase in solution time. Worse, it might even
>>> not collapse, hiding a bug that will be much harder to track down and
>>> fix in the future when you don't even remember writing that code.
>>>
>>> MPICH2 allows you to implement a certain level of runtime fault
>>> tolerance; I hear future versions will allow a lot more.  But
>>> remember: there is no free lunch -- if you want to write a robust
>>> system, you'll need to write error handlers that actually handle
>>> errors robustly.
>>>
>>> Until you do so, keeping all local fatal errors globally fatal is
>>> wise. My .02, at least.
>>>
>>> (Try looking up MPI_ERRORS_ARE_FATAL.)
>>>
>>> Regards,
>>> Nicolás
>>>
>>>
>>>
>>> On Sat, Jan 29, 2011 at 5:54 AM, Eugene N<neverov.biks.07.1 at gmail.com>
>>>  wrote:
>>>>
>>>> Hi
>>>> is it true that even if my most humble mpich2 client node will abort,
>>>> all my
>>>> claster will go down? How can i cure it?
>>>> Thanks,
>>>> Eugene
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>
>>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> --
>> Pavan Balaji
>> http://www.mcs.anl.gov/~balaji
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>


More information about the mpich-discuss mailing list