[mpich-discuss] MPI_Recv crashes with mpd ring
Jain, Rohit
Rohit_Jain at mentor.com
Tue Feb 15 17:52:32 CST 2011
Hi Dave,
I had 1.2.1p1 built locally. So, I tried that. It also gave me same
fatal error. I will try newer version, but I am less hopeful.
I am trying to use hydra (mpiexec.hydra) with 1.2.1.p1, but getting some
startup errors:
The authenticity of host 'XXX' can't be established.
RSA key fingerprint is ed:ce:ca:7b:08:b9:49:fd:f6:af:14.
Are you sure you want to continue connecting (yes/no)?
The authenticity of host 'XXX2' can't be established.
RSA key fingerprint is fb:1b:7b:0c:bb:b1:a6:b1:7d:dc:05.
Any pointers how to resolve them?
Regards,
Rohit
============================
Use a newer version of MPICH2, 1.0.6 is too old. The current release is
1.3.2p1, I would recommend that instead.
Also, after upgrading, use hydra instead of MPD:
http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager
-Dave
On Feb 15, 2011, at 4:44 PM CST, Jain, Rohit wrote:
> Hi,
>
> I am using MPICH2 version 2-1.06 to run a parallel application on
multiple Linux machines using mpd ring.
>
> I created the ring on 4 machines:
> mpdtrace -l
> mach1_55761
> mach2_46635
> mach3_34866
> mach4_37727
>
> Then I ran the application using mpiexec:
> mpiexec -np 1 a.exec arg1: -np 1 a.exec arg2 : -np1
a.exec arg3 : -np 1 a.exec arg4
>
> Application does start and runs for a while, then it crashes in
MPI_Recv with following error:
>
> Fatal error in MPI_Recv: Error message texts are not available
> rank 2 in job 1 mach1_55761 caused collective abort of all ranks
> exit status of rank 2: killed by signal 9
>
> On re-run, it crashes with same error, but at a different time.
>
> Same environment works fine, when run on multiple cores of single SMP
machine, instead of mpd ring.
>
> I tried totalview, but it also exits without any useful information.
>
> How do I debug/cure this problem?
>
> Regards,
> Rohit
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110215/c5e448c8/attachment-0001.htm>
More information about the mpich-discuss
mailing list