[MPICH] problem migrating from MPICH1 to MPICH2

Wed May 23 10:55:14 CDT 2007

On Tue, 22 May 2007, Christian Zemlin wrote:

> Dear MPICH - Experts
>
> I just set up a Beowulf cluster (16 Dual cores @ 2.4 GHz), and is running,
> but I have a problem using the my MPICH programs.  My programs were
> developed on an older cluster that had MPICH1 and they run fine there.
> They also run for some time on the new cluster, but eventually they
> terminate with a message like:
>
> "rank 2 in job 120 master_4268  caused collective abort of all ranks exit
> status of rank 2: killed by signal 11"

signal 11 means your program is accessing invalid memory.  It is possible
that your MPI program has a memory problem that isn't visible with MPICH1.
I would suggest you use gdb, ddd, valgrind or any memory checker to make
sure that your program has no memory leak.  You may also want to rebuild
mpich2 with option --enable-g=meminit,dbg to make mpich2 behave properly
under valgrind or gdb.

A.Chan

>
> This type of error occurs at a different point of the simulation every time
> I run it.  There is no problem if I use only one master and one slave
> node.
>
> Do you have any suggestion what might be the problem?
>
> Thank you and best wishes,
>
> Christian