[mpich-discuss] MPI_Recv crashes with mpd ring

Jain, Rohit Rohit_Jain at mentor.com
Tue Feb 15 16:44:08 CST 2011


Hi,

 

I am using MPICH2 version 2-1.06 to run a parallel application on
multiple Linux machines using mpd ring.

 

I created the ring on 4 machines:

mpdtrace -l

mach1_55761 

mach2_46635 

mach3_34866

mach4_37727

 

Then I ran the application using mpiexec:

mpiexec -np 1 a.exec arg1:       -np 1 a.exec arg2  :       -np1 a.exec
arg3  :       -np 1 a.exec arg4

 

Application does start and runs for a while, then it crashes in MPI_Recv
with following error:

 

Fatal error in MPI_Recv: Error message texts are not available

rank 2 in job 1 mach1_55761   caused collective abort of all ranks

  exit status of rank 2: killed by signal 9

 

On re-run, it crashes with same error, but at a different time.

 

Same environment works fine, when run on multiple cores of single SMP
machine, instead of mpd ring.

 

I tried totalview, but it also exits without any useful information. 

 

How do I debug/cure this problem?

 

Regards,

Rohit

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110215/0222a211/attachment.htm>


More information about the mpich-discuss mailing list