[mpich-discuss] halt after mpiexec

Gao, Yi gaoyi.cn at gmail.com
Thu Jan 14 09:58:30 CST 2010


Hi Dave,

Thanks for the advice. I followed your two suggestions (MPICH_NO_LOCAL
and mpiexec.hydra) and tried as follows:

1. use MPICH_NO_LOCAL=1 mpiexec -n i /path/to/cpi

i=1, no problem, runs on rome (2-core-machine) with one process and exit.

i=2, halt with the output:
Process 1 of 2 is on meg // meg has 1 core

i = 3, halt like this:
Process 2 of 3 is on julia // julia has 1 core
Process 1 of 3 is on meg

i=4, halt like this:
Process 1 of 4 is on meg
Process 2 of 4 is on julia

i=5, halt like this:
Process 2 of 5 is on julia
Process 1 of 5 is on meg
Process 4 of 5 is on meg

In all halt cases above, when doing a top at the meg or julia, the cpu
is 100% used. For i=5 case, seems that 2 threads are running on a
single core machine(meg), each of which takes about 50% cpu when
testing with top. None of the above shows rome, the 2core machine, in
the output.

So, comparing with situation without "MPICH_NO_LOCAL=1", this time it
only stops there without giving some error message.

2. mpiexec.hydra -n i /path/to/cpi

2.1

First, it asked me to add local host to known hosts: and I answered yes.

The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is 3e:62:41:30:a8:40:33:7e:b4:34:8e:2c:f4:37:43:20.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Process 2 of 3 is on rome
Process 1 of 3 is on rome
Process 0 of 3 is on rome
pi is approximately 3.1415926544231323, Error is 0.0000000008333392
wall clock time = 0.008936

2.2

The above case is when i=3, it runs and exit without error message.
However, seems that all processes are on the machine issuing the
command (rome). And this is the case for i=1,2, ..., 10. (I think if I
keep on, it'll still be the case....)

2.3
Even if I ran: mpiexec.hydra -n i /bin/hostname
I got i "rome"'s, which is not as expected....


2.4
Using mpdtrace -l on rome, I get normal output.

rome_53533 (128.61.134.31)
meg_46371 (128.61.134.44)
julia_53931 (128.61.135.30)


3.
I set my first step objective as just utilizing one core per machine
no matter how many it may have. When that works, then further
exploiting SMP on each node might be the next step.


Thank you for the suggestion!


Best,
yi




On Thu, Jan 14, 2010 at 10:09 AM, Dave Goodell <goodell at mcs.anl.gov> wrote:
> Hi yi,
>
> I'm not sure exactly what's going on here, but it looks like rank 3 is
> trying to setup a shared memory region for communication with another rank
> even though it shouldn't be.
>
> There's a chance that this is related to a bug in the way that mpd figures
> out which processes are on which machine.  Can you try setting the
> environment variable MPICH_NO_LOCAL=1 and let us know what happens?  For
> example:
>
> MPICH_NO_LOCAL=1 mpiexec -n 3 /path/to/cpi
>
> In a similar vein, you can also try using hydra to rule out the mpd issue:
>
> mpiexec.hydra -n 3 /path/to/cpi
>
> There are other things we can look at, but let's start there and see if that
> works out for us.
>
> -Dave
>
> On Jan 14, 2010, at 12:00 AM, Gao, Yi wrote:
>
>> Dear all,
>>
>> I'm new here and encounter a problem at the very beginning of learning
>> mpi.
>>
>> Basically, I get
>> mpiexec -n i /bin/hostname
>> works for any i >= 1 I've tested.
>>
>> but
>> mpiexec -n i /path-to-example-dir/cpi
>> error for any i >= 2
>>
>> The details are:
>>
>> I have 3 machines, all running Ubuntu 9.10 with gcc/g++ 4.4.1
>> one has two cores, and the other two have one core for each.
>> (machine name: rome, 2 core;
>>                       julia, 1 core;
>>                       meg, 1 core )
>>
>>
>> On this minimal testing bed for me to learn mpi, I built using
>> mpich2-1.2.1 using the default configure in "installation guide"
>>
>> Then on "rome", I put the mpd.hosts file in home dir with content:
>> julia
>> meg
>>
>> Then I ran
>> mpdboot -n 3  # works
>> mpdtrace -l # works, show the three machine names and port num
>> mpiexec -l -n 3 /bin/hostname # works! show three machine names
>>
>> but
>>
>> mpiexec -l -n 3 /tmp/gth818n/mpich2-1.2.1/example/cpi # !!!!!!!! it
>> halted there.
>>
>> Then I tried:
>> mpiexec -l -n 1 /tmp/gth818n/mpich2-1.2.1/example/cpi # works, run on
>> rome only and returns the result
>>
>> But -n larger or equal than 2 causes it to halt, or getting such
>> errors (with -n 4):
>> Fatal error in MPI_Init: Other MPI error, error stack:
>> MPIR_Init_thread(394).................: Initialization failed
>> MPID_Init(135)........................: channel initialization failed
>> MPIDI_CH3_Init(43)....................:
>> MPID_nem_init(202)....................:
>> MPIDI_CH3I_Seg_commit(366)............:
>> MPIU_SHMW_Hnd_deserialize(358)........:
>> MPIU_SHMW_Seg_open(897)...............:
>> MPIU_SHMW_Seg_create_attach_templ(671): open failed - No such file or
>> directory
>> rank 3 in job 12  rome_39209   caused collective abort of all ranks
>>  exit status of rank 3: return code 1
>>
>>
>> Then, I rebuild mpich2 on rome (coz it's SMP), with --with-device=ch3:ssm
>>
>> But got same error.
>>
>> Could any one gives me some directions to go?
>>
>> Thanks in advance!
>>
>> Best,
>> yi
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>



--
Yi Gao
Graduate Student
Dept. Biomedical Engineering
Georgia Institute of Technology


More information about the mpich-discuss mailing list