[MPICH] problem migrating from MPICH1 to MPICH2

Christian Zemlin zemlinc at upstate.edu
Wed May 23 14:04:13 CDT 2007


Anthony,

thanks to your help with gdb, I was able to run my program with it and get 
more information on how it crashes.  I still cannot find the cause and would 
greatly appreciate your opinion.  I reduced the code to the following 
minimal version that reproduces the error:

( The general purpose of the program is to compute a series of states of a 
two-dimensional domain, with each processor working on a strip of this 
domain)

#define MCW MPI_COMM_WORLD

#include <mpi.h>
#include <fstream>
#include <iostream>

const int MASTER = 0;
const int TAG = 1;

void splitDomain(int* begin, int* end, int* slaveShare, int width, int 
numProc);  // splits a rectangular domain into strips, each processor will 
work on one strip

main (int argc, char** argv)
{
    int myRank;
    int numSlaves;
    int numProc;

    int width=200;
    int height=5;
    int numFrames=10;
    int frame;

    // set up MPI
    MPI::Init(argc, argv);
    myRank = MPI::COMM_WORLD.Get_rank();
    numProc  = MPI::COMM_WORLD.Get_size();
    MPI_Status status;
    numSlaves = numProc-1;

    // define intervals on which the individual slaves work
    int* begin = new int[numProc];
    int* end = new int[numProc];
    int* slaveShare = new int[numProc];
    splitDomain(begin, end, slaveShare, width, numSlaves);

    double* array = new double(height*width);

    if (myRank==MASTER)
    {
        int whichSlave;
        // collect data from slaves
        for (frame=0; frame < numFrames; frame++)
            for (whichSlave = 1; whichSlave <= numSlaves; whichSlave++)
                MPI_Recv(&(array[0])+begin[whichSlave]*height, 
slaveShare[whichSlave]*h\
eight, MPI_DOUBLE, whichSlave, TAG, MCW, &status);
    }
    else // not master
    {
        for (frame = 0; frame<numFrames; frame++)
            MPI_Send(&(array[0])+begin[myRank]*height, 
slaveShare[myRank]*height, MPI_D\
OUBLE, 0, TAG, MCW);
    }
    MPI::Finalize();
}

void splitDomain(int* begin, int* end, int* slaveShare, int width, int 
numSlaves)
{
  int i;
  begin[1] = 0;
  for (i=1; i<numSlaves; i++)
    {
      end[i] = begin[i+1] = i*(width-1)/numSlaves;
      slaveShare[i] = end[i] - begin[i]+1;
    }
  end[numSlaves] = width-1;
  slaveShare[numSlaves] = end[numSlaves] - begin[numSlaves];
}
--------------------------------------

Here is what gdb says to it:

testm at master:~/test> mpiexec -gdb -n 2 ./RNC
0-1:  (gdb) run
0-1:  Continuing.
0:
0:  Program received signal SIGSEGV, Segmentation fault.
0:  0x00002b393c2cf4b4 in _int_malloc () from /lib64/libc.so.6
0:  (gdb) 0:  (gdb) where
0:  #0  0x00002b393c2cf4b4 in _int_malloc () from /lib64/libc.so.6
0:  #1  0x00002b393c2d1386 in malloc () from /lib64/libc.so.6
0:  #2  0x00000000004be742 in MPIDI_CH3I_BootstrapQ_attach ()
0:  #3  0x00000000004ab823 in MPIDI_CH3I_Shm_connect ()
0:  #4  0x00000000004abd0f in MPIDI_CH3I_VC_post_connect ()
0:  #5  0x00000000004eb7a5 in MPIDI_CH3_iSend ()
0:  #6  0x00000000004b65b0 in MPID_Isend ()
0:  #7  0x0000000000437d7a in MPIC_Sendrecv ()
0:  #8  0x0000000000411442 in PMPI_Barrier ()
0:  #9  0x00000000004b4e24 in MPID_Finalize ()
0:  #10 0x000000000047956e in PMPI_Finalize ()
0:  #11 0x0000000000402af4 in main (argc=1, argv=0x7fff6f3d0078) at 
RNC.cpp:52
0:  (gdb)

---------------------------
The line 52 that gdb is refering to is MPI::Finalize();

Can you see what the problem is?

Christian






----- Original Message ----- 
From: "Anthony Chan" <chan at mcs.anl.gov>
To: "Christian Zemlin" <zemlinc at upstate.edu>
Cc: <mpich-discuss at mcs.anl.gov>
Sent: Wednesday, May 23, 2007 11:55 AM
Subject: Re: [MPICH] problem migrating from MPICH1 to MPICH2


>
>
> On Tue, 22 May 2007, Christian Zemlin wrote:
>
>> Dear MPICH - Experts
>>
>> I just set up a Beowulf cluster (16 Dual cores @ 2.4 GHz), and is 
>> running,
>> but I have a problem using the my MPICH programs.  My programs were
>> developed on an older cluster that had MPICH1 and they run fine there.
>> They also run for some time on the new cluster, but eventually they
>> terminate with a message like:
>>
>> "rank 2 in job 120 master_4268  caused collective abort of all ranks exit
>> status of rank 2: killed by signal 11"
>
> signal 11 means your program is accessing invalid memory.  It is possible
> that your MPI program has a memory problem that isn't visible with MPICH1.
> I would suggest you use gdb, ddd, valgrind or any memory checker to make
> sure that your program has no memory leak.  You may also want to rebuild
> mpich2 with option --enable-g=meminit,dbg to make mpich2 behave properly
> under valgrind or gdb.
>
> A.Chan
>
>>
>> This type of error occurs at a different point of the simulation every 
>> time
>> I run it.  There is no problem if I use only one master and one slave
>> node.
>>
>> Do you have any suggestion what might be the problem?
>>
>> Thank you and best wishes,
>>
>> Christian
> 




More information about the mpich-discuss mailing list