[mpich-discuss] halt after mpiexec

Thu Jan 14 12:18:31 CST 2010

Hi Dave,

I reinstalled mpich2 from apt-get on Ubuntu9.10, instead of my compiled version.
I think now it works:

I compiled cpi.c on all 3 machines, putting them to same location,
then using mpiexec.hydra and they work.

Although this does not work if I added a 64bit ubuntu9.10 machine into
the game, it's enough for me to do some prototyping. :)

Thanks!

Best,
yi

On Thu, Jan 14, 2010 at 12:09 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:
> Copying the cpi binary from one machine to the others (in the same path)
> should work fine, as long as all machines are running the same OS and
> processor type (x86 versus x86_64 versus PPC, etc).
>
> -Dave
>
> On Jan 14, 2010, at 10:33 AM, Gao, Yi wrote:
>
>> Hi Dave,
>>
>> I tried the mpiexec.hydra -f hostfile -n 3 /path/to/cpi
>> with the hostfile being:
>> rome
>> meg
>> julia
>>
>> but it halt with output:
>> Process 1 of 3 is on julia
>> Process 2 of 3 is on meg
>> ^CTerminated (signal 15)    // after I press ^C
>>
>> Then I remove the rome line in hostfil, leaving only
>> meg
>> julia
>>
>> This time it runs, say for -n 4, I get great output as:
>> Process 2 of 4 is on julia
>> Process 1 of 4 is on meg
>> Process 3 of 4 is on meg
>> Process 0 of 4 is on julia
>> pi is approximately 3.1415926544231243, Error is 0.0000000008333312
>> wall clock time = 0.001271
>>
>> In fact, only using julia and meg, two single core machine, using
>> mpdboot, mpiexec also worked.
>>
>> Anthony pointed out in a previous email that I might be using
>> different binary cpi files. However on all the machines, 1. cpi
>> resides in the same path. 2. Since the machines are different
>> (although OS and compilers are fresh installed and are the same), does
>> copying one binary to other work?
>>
>> Thanks!
>>
>>
>> Best,
>> yi
>>
>> On Thu, Jan 14, 2010 at 11:12 AM, Dave Goodell <goodell at mcs.anl.gov>
>> wrote:
>>>
>>> Sorry, I gave poor instructions on using hydra.  Because there is no
>>> mpdboot
>>> step in normal hydra usage, you need to specify a hostfile for hydra when
>>> running mpiexec:
>>>
>>> mpiexec.hydra -f hostfile -n 3 /path/to/cpi
>>>
>>> Where hostfile should contain in your case:
>>>
>>> --------8<-------
>>> rome
>>> meg
>>> julia
>>> --------8<-------
>>>
>>> -Dave
>>>
>>> On Jan 14, 2010, at 9:58 AM, Gao, Yi wrote:
>>>
>>>> Hi Dave,
>>>>
>>>> Thanks for the advice. I followed your two suggestions (MPICH_NO_LOCAL
>>>> and mpiexec.hydra) and tried as follows:
>>>>
>>>> 1. use MPICH_NO_LOCAL=1 mpiexec -n i /path/to/cpi
>>>>
>>>> i=1, no problem, runs on rome (2-core-machine) with one process and
>>>> exit.
>>>>
>>>> i=2, halt with the output:
>>>> Process 1 of 2 is on meg // meg has 1 core
>>>>
>>>> i = 3, halt like this:
>>>> Process 2 of 3 is on julia // julia has 1 core
>>>> Process 1 of 3 is on meg
>>>>
>>>> i=4, halt like this:
>>>> Process 1 of 4 is on meg
>>>> Process 2 of 4 is on julia
>>>>
>>>> i=5, halt like this:
>>>> Process 2 of 5 is on julia
>>>> Process 1 of 5 is on meg
>>>> Process 4 of 5 is on meg
>>>>
>>>> In all halt cases above, when doing a top at the meg or julia, the cpu
>>>> is 100% used. For i=5 case, seems that 2 threads are running on a
>>>> single core machine(meg), each of which takes about 50% cpu when
>>>> testing with top. None of the above shows rome, the 2core machine, in
>>>> the output.
>>>>
>>>> So, comparing with situation without "MPICH_NO_LOCAL=1", this time it
>>>> only stops there without giving some error message.
>>>>
>>>> 2. mpiexec.hydra -n i /path/to/cpi
>>>>
>>>> 2.1
>>>>
>>>> First, it asked me to add local host to known hosts: and I answered yes.
>>>>
>>>> The authenticity of host 'localhost (::1)' can't be established.
>>>> RSA key fingerprint is 3e:62:41:30:a8:40:33:7e:b4:34:8e:2c:f4:37:43:20.
>>>> Are you sure you want to continue connecting (yes/no)? yes
>>>> Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
>>>> Process 2 of 3 is on rome
>>>> Process 1 of 3 is on rome
>>>> Process 0 of 3 is on rome
>>>> pi is approximately 3.1415926544231323, Error is 0.0000000008333392
>>>> wall clock time = 0.008936
>>>>
>>>> 2.2
>>>>
>>>> The above case is when i=3, it runs and exit without error message.
>>>> However, seems that all processes are on the machine issuing the
>>>> command (rome). And this is the case for i=1,2, ..., 10. (I think if I
>>>> keep on, it'll still be the case....)
>>>>
>>>> 2.3
>>>> Even if I ran: mpiexec.hydra -n i /bin/hostname
>>>> I got i "rome"'s, which is not as expected....
>>>>
>>>>
>>>> 2.4
>>>> Using mpdtrace -l on rome, I get normal output.
>>>>
>>>> rome_53533 (128.61.134.31)
>>>> meg_46371 (128.61.134.44)
>>>> julia_53931 (128.61.135.30)
>>>>
>>>>
>>>> 3.
>>>> I set my first step objective as just utilizing one core per machine
>>>> no matter how many it may have. When that works, then further
>>>> exploiting SMP on each node might be the next step.
>>>>
>>>>
>>>> Thank you for the suggestion!
>>>>
>>>>
>>>> Best,
>>>> yi
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jan 14, 2010 at 10:09 AM, Dave Goodell <goodell at mcs.anl.gov>
>>>> wrote:
>>>>>
>>>>> Hi yi,
>>>>>
>>>>> I'm not sure exactly what's going on here, but it looks like rank 3 is
>>>>> trying to setup a shared memory region for communication with another
>>>>> rank
>>>>> even though it shouldn't be.
>>>>>
>>>>> There's a chance that this is related to a bug in the way that mpd
>>>>> figures
>>>>> out which processes are on which machine.  Can you try setting the
>>>>> environment variable MPICH_NO_LOCAL=1 and let us know what happens?
>>>>>  For
>>>>> example:
>>>>>
>>>>> MPICH_NO_LOCAL=1 mpiexec -n 3 /path/to/cpi
>>>>>
>>>>> In a similar vein, you can also try using hydra to rule out the mpd
>>>>> issue:
>>>>>
>>>>> mpiexec.hydra -n 3 /path/to/cpi
>>>>>
>>>>> There are other things we can look at, but let's start there and see if
>>>>> that
>>>>> works out for us.
>>>>>
>>>>> -Dave
>>>>>
>>>>> On Jan 14, 2010, at 12:00 AM, Gao, Yi wrote:
>>>>>
>>>>>> Dear all,
>>>>>>
>>>>>> I'm new here and encounter a problem at the very beginning of learning
>>>>>> mpi.
>>>>>>
>>>>>> Basically, I get
>>>>>> mpiexec -n i /bin/hostname
>>>>>> works for any i >= 1 I've tested.
>>>>>>
>>>>>> but
>>>>>> mpiexec -n i /path-to-example-dir/cpi
>>>>>> error for any i >= 2
>>>>>>
>>>>>> The details are:
>>>>>>
>>>>>> I have 3 machines, all running Ubuntu 9.10 with gcc/g++ 4.4.1
>>>>>> one has two cores, and the other two have one core for each.
>>>>>> (machine name: rome, 2 core;
>>>>>>                     julia, 1 core;
>>>>>>                     meg, 1 core )
>>>>>>
>>>>>>
>>>>>> On this minimal testing bed for me to learn mpi, I built using
>>>>>> mpich2-1.2.1 using the default configure in "installation guide"
>>>>>>
>>>>>> Then on "rome", I put the mpd.hosts file in home dir with content:
>>>>>> julia
>>>>>> meg
>>>>>>
>>>>>> Then I ran
>>>>>> mpdboot -n 3  # works
>>>>>> mpdtrace -l # works, show the three machine names and port num
>>>>>> mpiexec -l -n 3 /bin/hostname # works! show three machine names
>>>>>>
>>>>>> but
>>>>>>
>>>>>> mpiexec -l -n 3 /tmp/gth818n/mpich2-1.2.1/example/cpi # !!!!!!!! it
>>>>>> halted there.
>>>>>>
>>>>>> Then I tried:
>>>>>> mpiexec -l -n 1 /tmp/gth818n/mpich2-1.2.1/example/cpi # works, run on
>>>>>> rome only and returns the result
>>>>>>
>>>>>> But -n larger or equal than 2 causes it to halt, or getting such
>>>>>> errors (with -n 4):
>>>>>> Fatal error in MPI_Init: Other MPI error, error stack:
>>>>>> MPIR_Init_thread(394).................: Initialization failed
>>>>>> MPID_Init(135)........................: channel initialization failed
>>>>>> MPIDI_CH3_Init(43)....................:
>>>>>> MPID_nem_init(202)....................:
>>>>>> MPIDI_CH3I_Seg_commit(366)............:
>>>>>> MPIU_SHMW_Hnd_deserialize(358)........:
>>>>>> MPIU_SHMW_Seg_open(897)...............:
>>>>>> MPIU_SHMW_Seg_create_attach_templ(671): open failed - No such file or
>>>>>> directory
>>>>>> rank 3 in job 12  rome_39209   caused collective abort of all ranks
>>>>>>  exit status of rank 3: return code 1
>>>>>>
>>>>>>
>>>>>> Then, I rebuild mpich2 on rome (coz it's SMP), with
>>>>>> --with-device=ch3:ssm
>>>>>>
>>>>>> But got same error.
>>>>>>
>>>>>> Could any one gives me some directions to go?
>>>>>>
>>>>>> Thanks in advance!
>>>>>>
>>>>>> Best,
>>>>>> yi
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list
>>>>>> mpich-discuss at mcs.anl.gov
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Yi Gao
>>>> Graduate Student
>>>> Dept. Biomedical Engineering
>>>> Georgia Institute of Technology
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>>
>
>

-- 
Yi Gao
Graduate Student
Dept. Biomedical Engineering
Georgia Institute of Technology