[MPICH] Simple MPI program crashes randomly

Yusong Wang ywang25 at aps.anl.gov
Tue May 29 20:22:48 CDT 2007


There is a small mistake in the simplified version of the program you
provided before:
Array allocation should be 
double* array = new double[height*width];
instead of 
double* array = new double(height*width);

After fixing this, there is no problem on my cluster.

For the application program, I checked the detail implementation of the
splitDomain function. There are some difference between this one and the
one you provided before. This one has end[lastSlave] set as 201, which
is larger than the width you assigned. As you printed out everything
from this function, I assume you are aware of allocating enough memory
for the data array (some applications do need extra points). While the
simplified one has end[lastSlave] set as 199. 

I ran your application program on my cluster and didn't see the problem.
My suggestion is recompiling everything with the complier from MPICH2.
You can check it with `which mpicxx` to see if you are using the correct
one. 

Let me know if this could help you.

Yusong


On Mon, 2007-05-28 at 06:24 -0400, Christian Zemlin wrote:
> The attached program ran literally thousands of times on our old
> cluster, which has MPICH1 libraries, without any problems.
> It is simple in terms of MPI (just the minimum MPI setup + a few Send
> and Recv commands), although it is fairly long because it implements a
> complicated model of cardiac tissue.
>  
> On our new cluster (with MPICH2) the program crashes with segmentation
> faults either on the first MPI_Send/Recv or at the end (MPI_Finalize).
> If you run it several times in a row, it seems to be random which of
> the two occurs.  I have checked carefully that the space for the
> passed data has been allocated.
>  
> I would greatly appreciate if someone with more MPI experience could
> have a look at the source code and give me his/her opinion.
>  
> Best,
>  
> Christian




More information about the mpich-discuss mailing list