[mpich-discuss] MPI_Recv crashes with mpd ring

Jain, Rohit Rohit_Jain at mentor.com
Tue Feb 15 17:52:32 CST 2011


Hi Dave,
 
I had 1.2.1p1 built locally. So, I tried that. It also gave me same
fatal error. I will try newer version, but I am less hopeful.
 
 
I am trying to use hydra (mpiexec.hydra) with 1.2.1.p1, but getting some
startup errors:
 
The authenticity of host 'XXX' can't be established.
RSA key fingerprint is ed:ce:ca:7b:08:b9:49:fd:f6:af:14.
Are you sure you want to continue connecting (yes/no)? 
The authenticity of host 'XXX2' can't be established.
RSA key fingerprint is fb:1b:7b:0c:bb:b1:a6:b1:7d:dc:05.
 
Any pointers how to resolve them?
 
Regards,
Rohit
 
 
============================
 
 
Use a newer version of MPICH2, 1.0.6 is too old.  The current release is
1.3.2p1, I would recommend that instead.
 
Also, after upgrading, use hydra instead of MPD:
http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager
 
-Dave
 
On Feb 15, 2011, at 4:44 PM CST, Jain, Rohit wrote:
 
> Hi,
>  
> I am using MPICH2 version 2-1.06 to run a parallel application on
multiple Linux machines using mpd ring.
>  
> I created the ring on 4 machines:
> mpdtrace -l
> mach1_55761
> mach2_46635
> mach3_34866
> mach4_37727
>  
> Then I ran the application using mpiexec:
> mpiexec -np 1 a.exec arg1:       -np 1 a.exec arg2  :       -np1
a.exec arg3  :       -np 1 a.exec arg4
>  
> Application does start and runs for a while, then it crashes in
MPI_Recv with following error:
>  
> Fatal error in MPI_Recv: Error message texts are not available
> rank 2 in job 1 mach1_55761   caused collective abort of all ranks
>   exit status of rank 2: killed by signal 9
>  
> On re-run, it crashes with same error, but at a different time.
>  
> Same environment works fine, when run on multiple cores of single SMP
machine, instead of mpd ring.
>  
> I tried totalview, but it also exits without any useful information.
>  
> How do I debug/cure this problem?
>  
> Regards,
> Rohit

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110215/c5e448c8/attachment-0001.htm>


More information about the mpich-discuss mailing list