[mpich-discuss] MPI_Recv crashes with mpd ring
Jain, Rohit
Rohit_Jain at mentor.com
Tue Feb 15 16:44:08 CST 2011
Hi,
I am using MPICH2 version 2-1.06 to run a parallel application on
multiple Linux machines using mpd ring.
I created the ring on 4 machines:
mpdtrace -l
mach1_55761
mach2_46635
mach3_34866
mach4_37727
Then I ran the application using mpiexec:
mpiexec -np 1 a.exec arg1: -np 1 a.exec arg2 : -np1 a.exec
arg3 : -np 1 a.exec arg4
Application does start and runs for a while, then it crashes in MPI_Recv
with following error:
Fatal error in MPI_Recv: Error message texts are not available
rank 2 in job 1 mach1_55761 caused collective abort of all ranks
exit status of rank 2: killed by signal 9
On re-run, it crashes with same error, but at a different time.
Same environment works fine, when run on multiple cores of single SMP
machine, instead of mpd ring.
I tried totalview, but it also exits without any useful information.
How do I debug/cure this problem?
Regards,
Rohit
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110215/0222a211/attachment.htm>
More information about the mpich-discuss
mailing list