[MPICH] problem migrating from MPICH1 to MPICH2
Christian Zemlin
zemlinc at upstate.edu
Wed May 23 14:04:13 CDT 2007
Anthony,
thanks to your help with gdb, I was able to run my program with it and get
more information on how it crashes. I still cannot find the cause and would
greatly appreciate your opinion. I reduced the code to the following
minimal version that reproduces the error:
( The general purpose of the program is to compute a series of states of a
two-dimensional domain, with each processor working on a strip of this
domain)
#define MCW MPI_COMM_WORLD
#include <mpi.h>
#include <fstream>
#include <iostream>
const int MASTER = 0;
const int TAG = 1;
void splitDomain(int* begin, int* end, int* slaveShare, int width, int
numProc); // splits a rectangular domain into strips, each processor will
work on one strip
main (int argc, char** argv)
{
int myRank;
int numSlaves;
int numProc;
int width=200;
int height=5;
int numFrames=10;
int frame;
// set up MPI
MPI::Init(argc, argv);
myRank = MPI::COMM_WORLD.Get_rank();
numProc = MPI::COMM_WORLD.Get_size();
MPI_Status status;
numSlaves = numProc-1;
// define intervals on which the individual slaves work
int* begin = new int[numProc];
int* end = new int[numProc];
int* slaveShare = new int[numProc];
splitDomain(begin, end, slaveShare, width, numSlaves);
double* array = new double(height*width);
if (myRank==MASTER)
{
int whichSlave;
// collect data from slaves
for (frame=0; frame < numFrames; frame++)
for (whichSlave = 1; whichSlave <= numSlaves; whichSlave++)
MPI_Recv(&(array[0])+begin[whichSlave]*height,
slaveShare[whichSlave]*h\
eight, MPI_DOUBLE, whichSlave, TAG, MCW, &status);
}
else // not master
{
for (frame = 0; frame<numFrames; frame++)
MPI_Send(&(array[0])+begin[myRank]*height,
slaveShare[myRank]*height, MPI_D\
OUBLE, 0, TAG, MCW);
}
MPI::Finalize();
}
void splitDomain(int* begin, int* end, int* slaveShare, int width, int
numSlaves)
{
int i;
begin[1] = 0;
for (i=1; i<numSlaves; i++)
{
end[i] = begin[i+1] = i*(width-1)/numSlaves;
slaveShare[i] = end[i] - begin[i]+1;
}
end[numSlaves] = width-1;
slaveShare[numSlaves] = end[numSlaves] - begin[numSlaves];
}
--------------------------------------
Here is what gdb says to it:
testm at master:~/test> mpiexec -gdb -n 2 ./RNC
0-1: (gdb) run
0-1: Continuing.
0:
0: Program received signal SIGSEGV, Segmentation fault.
0: 0x00002b393c2cf4b4 in _int_malloc () from /lib64/libc.so.6
0: (gdb) 0: (gdb) where
0: #0 0x00002b393c2cf4b4 in _int_malloc () from /lib64/libc.so.6
0: #1 0x00002b393c2d1386 in malloc () from /lib64/libc.so.6
0: #2 0x00000000004be742 in MPIDI_CH3I_BootstrapQ_attach ()
0: #3 0x00000000004ab823 in MPIDI_CH3I_Shm_connect ()
0: #4 0x00000000004abd0f in MPIDI_CH3I_VC_post_connect ()
0: #5 0x00000000004eb7a5 in MPIDI_CH3_iSend ()
0: #6 0x00000000004b65b0 in MPID_Isend ()
0: #7 0x0000000000437d7a in MPIC_Sendrecv ()
0: #8 0x0000000000411442 in PMPI_Barrier ()
0: #9 0x00000000004b4e24 in MPID_Finalize ()
0: #10 0x000000000047956e in PMPI_Finalize ()
0: #11 0x0000000000402af4 in main (argc=1, argv=0x7fff6f3d0078) at
RNC.cpp:52
0: (gdb)
---------------------------
The line 52 that gdb is refering to is MPI::Finalize();
Can you see what the problem is?
Christian
----- Original Message -----
From: "Anthony Chan" <chan at mcs.anl.gov>
To: "Christian Zemlin" <zemlinc at upstate.edu>
Cc: <mpich-discuss at mcs.anl.gov>
Sent: Wednesday, May 23, 2007 11:55 AM
Subject: Re: [MPICH] problem migrating from MPICH1 to MPICH2
>
>
> On Tue, 22 May 2007, Christian Zemlin wrote:
>
>> Dear MPICH - Experts
>>
>> I just set up a Beowulf cluster (16 Dual cores @ 2.4 GHz), and is
>> running,
>> but I have a problem using the my MPICH programs. My programs were
>> developed on an older cluster that had MPICH1 and they run fine there.
>> They also run for some time on the new cluster, but eventually they
>> terminate with a message like:
>>
>> "rank 2 in job 120 master_4268 caused collective abort of all ranks exit
>> status of rank 2: killed by signal 11"
>
> signal 11 means your program is accessing invalid memory. It is possible
> that your MPI program has a memory problem that isn't visible with MPICH1.
> I would suggest you use gdb, ddd, valgrind or any memory checker to make
> sure that your program has no memory leak. You may also want to rebuild
> mpich2 with option --enable-g=meminit,dbg to make mpich2 behave properly
> under valgrind or gdb.
>
> A.Chan
>
>>
>> This type of error occurs at a different point of the simulation every
>> time
>> I run it. There is no problem if I use only one master and one slave
>> node.
>>
>> Do you have any suggestion what might be the problem?
>>
>> Thank you and best wishes,
>>
>> Christian
>
More information about the mpich-discuss
mailing list