[mpich-discuss] halt after mpiexec

Gao, Yi gaoyi.cn at gmail.com
Thu Jan 14 10:33:46 CST 2010


Hi Dave,

I tried the mpiexec.hydra -f hostfile -n 3 /path/to/cpi
with the hostfile being:
rome
meg
julia

but it halt with output:
Process 1 of 3 is on julia
Process 2 of 3 is on meg
^CTerminated (signal 15)    // after I press ^C

Then I remove the rome line in hostfil, leaving only
meg
julia

This time it runs, say for -n 4, I get great output as:
Process 2 of 4 is on julia
Process 1 of 4 is on meg
Process 3 of 4 is on meg
Process 0 of 4 is on julia
pi is approximately 3.1415926544231243, Error is 0.0000000008333312
wall clock time = 0.001271

In fact, only using julia and meg, two single core machine, using
mpdboot, mpiexec also worked.

Anthony pointed out in a previous email that I might be using
different binary cpi files. However on all the machines, 1. cpi
resides in the same path. 2. Since the machines are different
(although OS and compilers are fresh installed and are the same), does
copying one binary to other work?

Thanks!


Best,
yi

On Thu, Jan 14, 2010 at 11:12 AM, Dave Goodell <goodell at mcs.anl.gov> wrote:
> Sorry, I gave poor instructions on using hydra.  Because there is no mpdboot
> step in normal hydra usage, you need to specify a hostfile for hydra when
> running mpiexec:
>
> mpiexec.hydra -f hostfile -n 3 /path/to/cpi
>
> Where hostfile should contain in your case:
>
> --------8<-------
> rome
> meg
> julia
> --------8<-------
>
> -Dave
>
> On Jan 14, 2010, at 9:58 AM, Gao, Yi wrote:
>
>> Hi Dave,
>>
>> Thanks for the advice. I followed your two suggestions (MPICH_NO_LOCAL
>> and mpiexec.hydra) and tried as follows:
>>
>> 1. use MPICH_NO_LOCAL=1 mpiexec -n i /path/to/cpi
>>
>> i=1, no problem, runs on rome (2-core-machine) with one process and exit.
>>
>> i=2, halt with the output:
>> Process 1 of 2 is on meg // meg has 1 core
>>
>> i = 3, halt like this:
>> Process 2 of 3 is on julia // julia has 1 core
>> Process 1 of 3 is on meg
>>
>> i=4, halt like this:
>> Process 1 of 4 is on meg
>> Process 2 of 4 is on julia
>>
>> i=5, halt like this:
>> Process 2 of 5 is on julia
>> Process 1 of 5 is on meg
>> Process 4 of 5 is on meg
>>
>> In all halt cases above, when doing a top at the meg or julia, the cpu
>> is 100% used. For i=5 case, seems that 2 threads are running on a
>> single core machine(meg), each of which takes about 50% cpu when
>> testing with top. None of the above shows rome, the 2core machine, in
>> the output.
>>
>> So, comparing with situation without "MPICH_NO_LOCAL=1", this time it
>> only stops there without giving some error message.
>>
>> 2. mpiexec.hydra -n i /path/to/cpi
>>
>> 2.1
>>
>> First, it asked me to add local host to known hosts: and I answered yes.
>>
>> The authenticity of host 'localhost (::1)' can't be established.
>> RSA key fingerprint is 3e:62:41:30:a8:40:33:7e:b4:34:8e:2c:f4:37:43:20.
>> Are you sure you want to continue connecting (yes/no)? yes
>> Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
>> Process 2 of 3 is on rome
>> Process 1 of 3 is on rome
>> Process 0 of 3 is on rome
>> pi is approximately 3.1415926544231323, Error is 0.0000000008333392
>> wall clock time = 0.008936
>>
>> 2.2
>>
>> The above case is when i=3, it runs and exit without error message.
>> However, seems that all processes are on the machine issuing the
>> command (rome). And this is the case for i=1,2, ..., 10. (I think if I
>> keep on, it'll still be the case....)
>>
>> 2.3
>> Even if I ran: mpiexec.hydra -n i /bin/hostname
>> I got i "rome"'s, which is not as expected....
>>
>>
>> 2.4
>> Using mpdtrace -l on rome, I get normal output.
>>
>> rome_53533 (128.61.134.31)
>> meg_46371 (128.61.134.44)
>> julia_53931 (128.61.135.30)
>>
>>
>> 3.
>> I set my first step objective as just utilizing one core per machine
>> no matter how many it may have. When that works, then further
>> exploiting SMP on each node might be the next step.
>>
>>
>> Thank you for the suggestion!
>>
>>
>> Best,
>> yi
>>
>>
>>
>>
>> On Thu, Jan 14, 2010 at 10:09 AM, Dave Goodell <goodell at mcs.anl.gov>
>> wrote:
>>>
>>> Hi yi,
>>>
>>> I'm not sure exactly what's going on here, but it looks like rank 3 is
>>> trying to setup a shared memory region for communication with another
>>> rank
>>> even though it shouldn't be.
>>>
>>> There's a chance that this is related to a bug in the way that mpd
>>> figures
>>> out which processes are on which machine.  Can you try setting the
>>> environment variable MPICH_NO_LOCAL=1 and let us know what happens?  For
>>> example:
>>>
>>> MPICH_NO_LOCAL=1 mpiexec -n 3 /path/to/cpi
>>>
>>> In a similar vein, you can also try using hydra to rule out the mpd
>>> issue:
>>>
>>> mpiexec.hydra -n 3 /path/to/cpi
>>>
>>> There are other things we can look at, but let's start there and see if
>>> that
>>> works out for us.
>>>
>>> -Dave
>>>
>>> On Jan 14, 2010, at 12:00 AM, Gao, Yi wrote:
>>>
>>>> Dear all,
>>>>
>>>> I'm new here and encounter a problem at the very beginning of learning
>>>> mpi.
>>>>
>>>> Basically, I get
>>>> mpiexec -n i /bin/hostname
>>>> works for any i >= 1 I've tested.
>>>>
>>>> but
>>>> mpiexec -n i /path-to-example-dir/cpi
>>>> error for any i >= 2
>>>>
>>>> The details are:
>>>>
>>>> I have 3 machines, all running Ubuntu 9.10 with gcc/g++ 4.4.1
>>>> one has two cores, and the other two have one core for each.
>>>> (machine name: rome, 2 core;
>>>>                      julia, 1 core;
>>>>                      meg, 1 core )
>>>>
>>>>
>>>> On this minimal testing bed for me to learn mpi, I built using
>>>> mpich2-1.2.1 using the default configure in "installation guide"
>>>>
>>>> Then on "rome", I put the mpd.hosts file in home dir with content:
>>>> julia
>>>> meg
>>>>
>>>> Then I ran
>>>> mpdboot -n 3  # works
>>>> mpdtrace -l # works, show the three machine names and port num
>>>> mpiexec -l -n 3 /bin/hostname # works! show three machine names
>>>>
>>>> but
>>>>
>>>> mpiexec -l -n 3 /tmp/gth818n/mpich2-1.2.1/example/cpi # !!!!!!!! it
>>>> halted there.
>>>>
>>>> Then I tried:
>>>> mpiexec -l -n 1 /tmp/gth818n/mpich2-1.2.1/example/cpi # works, run on
>>>> rome only and returns the result
>>>>
>>>> But -n larger or equal than 2 causes it to halt, or getting such
>>>> errors (with -n 4):
>>>> Fatal error in MPI_Init: Other MPI error, error stack:
>>>> MPIR_Init_thread(394).................: Initialization failed
>>>> MPID_Init(135)........................: channel initialization failed
>>>> MPIDI_CH3_Init(43)....................:
>>>> MPID_nem_init(202)....................:
>>>> MPIDI_CH3I_Seg_commit(366)............:
>>>> MPIU_SHMW_Hnd_deserialize(358)........:
>>>> MPIU_SHMW_Seg_open(897)...............:
>>>> MPIU_SHMW_Seg_create_attach_templ(671): open failed - No such file or
>>>> directory
>>>> rank 3 in job 12  rome_39209   caused collective abort of all ranks
>>>>  exit status of rank 3: return code 1
>>>>
>>>>
>>>> Then, I rebuild mpich2 on rome (coz it's SMP), with
>>>> --with-device=ch3:ssm
>>>>
>>>> But got same error.
>>>>
>>>> Could any one gives me some directions to go?
>>>>
>>>> Thanks in advance!
>>>>
>>>> Best,
>>>> yi
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>>
>>
>>
>>
>> --
>> Yi Gao
>> Graduate Student
>> Dept. Biomedical Engineering
>> Georgia Institute of Technology
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>


More information about the mpich-discuss mailing list