[mpich-discuss] SIGx13 Intermittent error

Gus Correa gus at ldeo.columbia.edu
Thu Jun 11 13:58:39 CDT 2009


Hi Marc, Anthony, list

There have been a number of reports of erratic MPICH-1 P4 errors
on several mailing lists I subscribe (here, Rocks, Beowulf, etc).
Even very simple programs such as cpi.c may fail.
The problem may be related not only to message size, but also to
the number of processes involved, and maybe to the Ethernet drivers
as well (e.g. forcedeth on NVidia MCP55).

See this long thread, for instance:
http://marc.info/?l=npaci-rocks-discussion&m=123175012813683&w=2

It has been suggested that the P4 and the
more recent (and not so recent) Linux kernels
don't get along very well.

Upgrading to MPICH2 with nemesis communication
channel solved the problem in all cases I heard of.
Hence, Anthony's suggestion of upgrading to MPICH2 is the way to go.
At least one fellow had problems also with MPICH2 and the sockets 
communication channel, so better use nemesis.

My $0.02.

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

Marc wrote:
> I had already run multiple times a simple parallel hello World program
> and the error never showed up. Same thing for cpi... Maybe that depends
> on the message size.
> 
> I also made a little bit a googling before asking on the mailing list
> and there's not a lot of information. However, I recently found this
> post
> (https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2006-November/022378.html) on the Rocks Cluster mailing list (which is my cluster distro) and the problem seems to be related to uncleared shared memory, in the case of intermittent errors. It's to me the most plausible explanation.
> 
> I will try MPICH-2. Thanks for the suggestion.
> 
> Best regards,
> 
> Marc



More information about the mpich-discuss mailing list