[mpich-discuss] halt after mpiexec
Dave Goodell
goodell at mcs.anl.gov
Thu Jan 14 09:09:25 CST 2010
Hi yi,
I'm not sure exactly what's going on here, but it looks like rank 3 is
trying to setup a shared memory region for communication with another
rank even though it shouldn't be.
There's a chance that this is related to a bug in the way that mpd
figures out which processes are on which machine. Can you try setting
the environment variable MPICH_NO_LOCAL=1 and let us know what
happens? For example:
MPICH_NO_LOCAL=1 mpiexec -n 3 /path/to/cpi
In a similar vein, you can also try using hydra to rule out the mpd
issue:
mpiexec.hydra -n 3 /path/to/cpi
There are other things we can look at, but let's start there and see
if that works out for us.
-Dave
On Jan 14, 2010, at 12:00 AM, Gao, Yi wrote:
> Dear all,
>
> I'm new here and encounter a problem at the very beginning of
> learning mpi.
>
> Basically, I get
> mpiexec -n i /bin/hostname
> works for any i >= 1 I've tested.
>
> but
> mpiexec -n i /path-to-example-dir/cpi
> error for any i >= 2
>
> The details are:
>
> I have 3 machines, all running Ubuntu 9.10 with gcc/g++ 4.4.1
> one has two cores, and the other two have one core for each.
> (machine name: rome, 2 core;
> julia, 1 core;
> meg, 1 core )
>
>
> On this minimal testing bed for me to learn mpi, I built using
> mpich2-1.2.1 using the default configure in "installation guide"
>
> Then on "rome", I put the mpd.hosts file in home dir with content:
> julia
> meg
>
> Then I ran
> mpdboot -n 3 # works
> mpdtrace -l # works, show the three machine names and port num
> mpiexec -l -n 3 /bin/hostname # works! show three machine names
>
> but
>
> mpiexec -l -n 3 /tmp/gth818n/mpich2-1.2.1/example/cpi # !!!!!!!! it
> halted there.
>
> Then I tried:
> mpiexec -l -n 1 /tmp/gth818n/mpich2-1.2.1/example/cpi # works, run on
> rome only and returns the result
>
> But -n larger or equal than 2 causes it to halt, or getting such
> errors (with -n 4):
> Fatal error in MPI_Init: Other MPI error, error stack:
> MPIR_Init_thread(394).................: Initialization failed
> MPID_Init(135)........................: channel initialization failed
> MPIDI_CH3_Init(43)....................:
> MPID_nem_init(202)....................:
> MPIDI_CH3I_Seg_commit(366)............:
> MPIU_SHMW_Hnd_deserialize(358)........:
> MPIU_SHMW_Seg_open(897)...............:
> MPIU_SHMW_Seg_create_attach_templ(671): open failed - No such file
> or directory
> rank 3 in job 12 rome_39209 caused collective abort of all ranks
> exit status of rank 3: return code 1
>
>
> Then, I rebuild mpich2 on rome (coz it's SMP), with --with-
> device=ch3:ssm
>
> But got same error.
>
> Could any one gives me some directions to go?
>
> Thanks in advance!
>
> Best,
> yi
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list