[mpich-discuss] MPI_Recv crashes with mpd ring
Dave Goodell
goodell at mcs.anl.gov
Tue Feb 15 16:53:43 CST 2011
Use a newer version of MPICH2, 1.0.6 is too old. The current release is 1.3.2p1, I would recommend that instead.
Also, after upgrading, use hydra instead of MPD: http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager
-Dave
On Feb 15, 2011, at 4:44 PM CST, Jain, Rohit wrote:
> Hi,
>
> I am using MPICH2 version 2-1.06 to run a parallel application on multiple Linux machines using mpd ring.
>
> I created the ring on 4 machines:
> mpdtrace -l
> mach1_55761
> mach2_46635
> mach3_34866
> mach4_37727
>
> Then I ran the application using mpiexec:
> mpiexec -np 1 a.exec arg1: -np 1 a.exec arg2 : -np1 a.exec arg3 : -np 1 a.exec arg4
>
> Application does start and runs for a while, then it crashes in MPI_Recv with following error:
>
> Fatal error in MPI_Recv: Error message texts are not available
> rank 2 in job 1 mach1_55761 caused collective abort of all ranks
> exit status of rank 2: killed by signal 9
>
> On re-run, it crashes with same error, but at a different time.
>
> Same environment works fine, when run on multiple cores of single SMP machine, instead of mpd ring.
>
> I tried totalview, but it also exits without any useful information.
>
> How do I debug/cure this problem?
>
> Regards,
> Rohit
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list