[MPICH] problem migrating from MPICH1 to MPICH2
Anthony Chan
chan at mcs.anl.gov
Wed May 23 10:55:14 CDT 2007
On Tue, 22 May 2007, Christian Zemlin wrote:
> Dear MPICH - Experts
>
> I just set up a Beowulf cluster (16 Dual cores @ 2.4 GHz), and is running,
> but I have a problem using the my MPICH programs. My programs were
> developed on an older cluster that had MPICH1 and they run fine there.
> They also run for some time on the new cluster, but eventually they
> terminate with a message like:
>
> "rank 2 in job 120 master_4268 caused collective abort of all ranks exit
> status of rank 2: killed by signal 11"
signal 11 means your program is accessing invalid memory. It is possible
that your MPI program has a memory problem that isn't visible with MPICH1.
I would suggest you use gdb, ddd, valgrind or any memory checker to make
sure that your program has no memory leak. You may also want to rebuild
mpich2 with option --enable-g=meminit,dbg to make mpich2 behave properly
under valgrind or gdb.
A.Chan
>
> This type of error occurs at a different point of the simulation every time
> I run it. There is no problem if I use only one master and one slave
> node.
>
> Do you have any suggestion what might be the problem?
>
> Thank you and best wishes,
>
> Christian
More information about the mpich-discuss
mailing list