[mpich-discuss] MPI_Recv crashes with mpd ring

Dave Goodell goodell at mcs.anl.gov
Tue Feb 15 16:53:43 CST 2011


Use a newer version of MPICH2, 1.0.6 is too old.  The current release is 1.3.2p1, I would recommend that instead.

Also, after upgrading, use hydra instead of MPD: http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager

-Dave

On Feb 15, 2011, at 4:44 PM CST, Jain, Rohit wrote:

> Hi,
>  
> I am using MPICH2 version 2-1.06 to run a parallel application on multiple Linux machines using mpd ring.
>  
> I created the ring on 4 machines:
> mpdtrace -l
> mach1_55761
> mach2_46635
> mach3_34866
> mach4_37727
>  
> Then I ran the application using mpiexec:
> mpiexec -np 1 a.exec arg1:       -np 1 a.exec arg2  :       -np1 a.exec arg3  :       -np 1 a.exec arg4
>  
> Application does start and runs for a while, then it crashes in MPI_Recv with following error:
>  
> Fatal error in MPI_Recv: Error message texts are not available
> rank 2 in job 1 mach1_55761   caused collective abort of all ranks
>   exit status of rank 2: killed by signal 9
>  
> On re-run, it crashes with same error, but at a different time.
>  
> Same environment works fine, when run on multiple cores of single SMP machine, instead of mpd ring.
>  
> I tried totalview, but it also exits without any useful information.
>  
> How do I debug/cure this problem?
>  
> Regards,
> Rohit
>  
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list