[MPICH] Simple MPI program crashes randomly

Christian Zemlin zemlinc at upstate.edu
Wed May 30 06:14:21 CDT 2007


Dear Yusong,

thank you very much for taking the time to look at my programs.  You are 
right about the mistake in the simplified version, fixing it also removes 
the problem on my computer.

Regarding the application, I found that there was a hardware problem with 
one of the nodes.  After I took that node out, this problem disappeared as 
well.

Now I have a different problem though, which is that MPI "hangs" at random 
points in my program *after* working fine for quite some time (it hangs at a 
Send/Recv pair that it previously executed well many times).  I described 
the problem in another post to the mailing list.

If you have any idea what might solve this new problem, I would greatly 
appreciate it.

Christian
----- Original Message ----- 
From: "Yusong Wang" <ywang25 at aps.anl.gov>
To: "Christian Zemlin" <zemlinc at upstate.edu>
Cc: <mpich-discuss at mcs.anl.gov>
Sent: Tuesday, May 29, 2007 9:22 PM
Subject: Re: [MPICH] Simple MPI program crashes randomly


> There is a small mistake in the simplified version of the program you
> provided before:
> Array allocation should be
> double* array = new double[height*width];
> instead of
> double* array = new double(height*width);
>
> After fixing this, there is no problem on my cluster.
>
> For the application program, I checked the detail implementation of the
> splitDomain function. There are some difference between this one and the
> one you provided before. This one has end[lastSlave] set as 201, which
> is larger than the width you assigned. As you printed out everything
> from this function, I assume you are aware of allocating enough memory
> for the data array (some applications do need extra points). While the
> simplified one has end[lastSlave] set as 199.
>
> I ran your application program on my cluster and didn't see the problem.
> My suggestion is recompiling everything with the complier from MPICH2.
> You can check it with `which mpicxx` to see if you are using the correct
> one.
>
> Let me know if this could help you.
>
> Yusong
>
>
> On Mon, 2007-05-28 at 06:24 -0400, Christian Zemlin wrote:
>> The attached program ran literally thousands of times on our old
>> cluster, which has MPICH1 libraries, without any problems.
>> It is simple in terms of MPI (just the minimum MPI setup + a few Send
>> and Recv commands), although it is fairly long because it implements a
>> complicated model of cardiac tissue.
>>
>> On our new cluster (with MPICH2) the program crashes with segmentation
>> faults either on the first MPI_Send/Recv or at the end (MPI_Finalize).
>> If you run it several times in a row, it seems to be random which of
>> the two occurs.  I have checked carefully that the space for the
>> passed data has been allocated.
>>
>> I would greatly appreciate if someone with more MPI experience could
>> have a look at the source code and give me his/her opinion.
>>
>> Best,
>>
>> Christian
> 




More information about the mpich-discuss mailing list