[mpich-discuss] halt after mpiexec

Dave Goodell goodell at mcs.anl.gov
Thu Jan 14 09:09:25 CST 2010


Hi yi,

I'm not sure exactly what's going on here, but it looks like rank 3 is  
trying to setup a shared memory region for communication with another  
rank even though it shouldn't be.

There's a chance that this is related to a bug in the way that mpd  
figures out which processes are on which machine.  Can you try setting  
the environment variable MPICH_NO_LOCAL=1 and let us know what  
happens?  For example:

MPICH_NO_LOCAL=1 mpiexec -n 3 /path/to/cpi

In a similar vein, you can also try using hydra to rule out the mpd  
issue:

mpiexec.hydra -n 3 /path/to/cpi

There are other things we can look at, but let's start there and see  
if that works out for us.

-Dave

On Jan 14, 2010, at 12:00 AM, Gao, Yi wrote:

> Dear all,
>
> I'm new here and encounter a problem at the very beginning of  
> learning mpi.
>
> Basically, I get
> mpiexec -n i /bin/hostname
> works for any i >= 1 I've tested.
>
> but
> mpiexec -n i /path-to-example-dir/cpi
> error for any i >= 2
>
> The details are:
>
> I have 3 machines, all running Ubuntu 9.10 with gcc/g++ 4.4.1
> one has two cores, and the other two have one core for each.
> (machine name: rome, 2 core;
>                        julia, 1 core;
>                        meg, 1 core )
>
>
> On this minimal testing bed for me to learn mpi, I built using
> mpich2-1.2.1 using the default configure in "installation guide"
>
> Then on "rome", I put the mpd.hosts file in home dir with content:
> julia
> meg
>
> Then I ran
> mpdboot -n 3  # works
> mpdtrace -l # works, show the three machine names and port num
> mpiexec -l -n 3 /bin/hostname # works! show three machine names
>
> but
>
> mpiexec -l -n 3 /tmp/gth818n/mpich2-1.2.1/example/cpi # !!!!!!!! it
> halted there.
>
> Then I tried:
> mpiexec -l -n 1 /tmp/gth818n/mpich2-1.2.1/example/cpi # works, run on
> rome only and returns the result
>
> But -n larger or equal than 2 causes it to halt, or getting such
> errors (with -n 4):
> Fatal error in MPI_Init: Other MPI error, error stack:
> MPIR_Init_thread(394).................: Initialization failed
> MPID_Init(135)........................: channel initialization failed
> MPIDI_CH3_Init(43)....................:
> MPID_nem_init(202)....................:
> MPIDI_CH3I_Seg_commit(366)............:
> MPIU_SHMW_Hnd_deserialize(358)........:
> MPIU_SHMW_Seg_open(897)...............:
> MPIU_SHMW_Seg_create_attach_templ(671): open failed - No such file  
> or directory
> rank 3 in job 12  rome_39209   caused collective abort of all ranks
>  exit status of rank 3: return code 1
>
>
> Then, I rebuild mpich2 on rome (coz it's SMP), with --with- 
> device=ch3:ssm
>
> But got same error.
>
> Could any one gives me some directions to go?
>
> Thanks in advance!
>
> Best,
> yi
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list