[mpich-discuss] halt after mpiexec

Dave Goodell goodell at mcs.anl.gov
Thu Jan 14 11:09:21 CST 2010


Copying the cpi binary from one machine to the others (in the same  
path) should work fine, as long as all machines are running the same  
OS and processor type (x86 versus x86_64 versus PPC, etc).

-Dave

On Jan 14, 2010, at 10:33 AM, Gao, Yi wrote:

> Hi Dave,
>
> I tried the mpiexec.hydra -f hostfile -n 3 /path/to/cpi
> with the hostfile being:
> rome
> meg
> julia
>
> but it halt with output:
> Process 1 of 3 is on julia
> Process 2 of 3 is on meg
> ^CTerminated (signal 15)    // after I press ^C
>
> Then I remove the rome line in hostfil, leaving only
> meg
> julia
>
> This time it runs, say for -n 4, I get great output as:
> Process 2 of 4 is on julia
> Process 1 of 4 is on meg
> Process 3 of 4 is on meg
> Process 0 of 4 is on julia
> pi is approximately 3.1415926544231243, Error is 0.0000000008333312
> wall clock time = 0.001271
>
> In fact, only using julia and meg, two single core machine, using
> mpdboot, mpiexec also worked.
>
> Anthony pointed out in a previous email that I might be using
> different binary cpi files. However on all the machines, 1. cpi
> resides in the same path. 2. Since the machines are different
> (although OS and compilers are fresh installed and are the same), does
> copying one binary to other work?
>
> Thanks!
>
>
> Best,
> yi
>
> On Thu, Jan 14, 2010 at 11:12 AM, Dave Goodell <goodell at mcs.anl.gov>  
> wrote:
>> Sorry, I gave poor instructions on using hydra.  Because there is  
>> no mpdboot
>> step in normal hydra usage, you need to specify a hostfile for  
>> hydra when
>> running mpiexec:
>>
>> mpiexec.hydra -f hostfile -n 3 /path/to/cpi
>>
>> Where hostfile should contain in your case:
>>
>> --------8<-------
>> rome
>> meg
>> julia
>> --------8<-------
>>
>> -Dave
>>
>> On Jan 14, 2010, at 9:58 AM, Gao, Yi wrote:
>>
>>> Hi Dave,
>>>
>>> Thanks for the advice. I followed your two suggestions  
>>> (MPICH_NO_LOCAL
>>> and mpiexec.hydra) and tried as follows:
>>>
>>> 1. use MPICH_NO_LOCAL=1 mpiexec -n i /path/to/cpi
>>>
>>> i=1, no problem, runs on rome (2-core-machine) with one process  
>>> and exit.
>>>
>>> i=2, halt with the output:
>>> Process 1 of 2 is on meg // meg has 1 core
>>>
>>> i = 3, halt like this:
>>> Process 2 of 3 is on julia // julia has 1 core
>>> Process 1 of 3 is on meg
>>>
>>> i=4, halt like this:
>>> Process 1 of 4 is on meg
>>> Process 2 of 4 is on julia
>>>
>>> i=5, halt like this:
>>> Process 2 of 5 is on julia
>>> Process 1 of 5 is on meg
>>> Process 4 of 5 is on meg
>>>
>>> In all halt cases above, when doing a top at the meg or julia, the  
>>> cpu
>>> is 100% used. For i=5 case, seems that 2 threads are running on a
>>> single core machine(meg), each of which takes about 50% cpu when
>>> testing with top. None of the above shows rome, the 2core machine,  
>>> in
>>> the output.
>>>
>>> So, comparing with situation without "MPICH_NO_LOCAL=1", this time  
>>> it
>>> only stops there without giving some error message.
>>>
>>> 2. mpiexec.hydra -n i /path/to/cpi
>>>
>>> 2.1
>>>
>>> First, it asked me to add local host to known hosts: and I  
>>> answered yes.
>>>
>>> The authenticity of host 'localhost (::1)' can't be established.
>>> RSA key fingerprint is 3e:62:41:30:a8:40:33:7e:b4:34:8e: 
>>> 2c:f4:37:43:20.
>>> Are you sure you want to continue connecting (yes/no)? yes
>>> Warning: Permanently added 'localhost' (RSA) to the list of known  
>>> hosts.
>>> Process 2 of 3 is on rome
>>> Process 1 of 3 is on rome
>>> Process 0 of 3 is on rome
>>> pi is approximately 3.1415926544231323, Error is 0.0000000008333392
>>> wall clock time = 0.008936
>>>
>>> 2.2
>>>
>>> The above case is when i=3, it runs and exit without error message.
>>> However, seems that all processes are on the machine issuing the
>>> command (rome). And this is the case for i=1,2, ..., 10. (I think  
>>> if I
>>> keep on, it'll still be the case....)
>>>
>>> 2.3
>>> Even if I ran: mpiexec.hydra -n i /bin/hostname
>>> I got i "rome"'s, which is not as expected....
>>>
>>>
>>> 2.4
>>> Using mpdtrace -l on rome, I get normal output.
>>>
>>> rome_53533 (128.61.134.31)
>>> meg_46371 (128.61.134.44)
>>> julia_53931 (128.61.135.30)
>>>
>>>
>>> 3.
>>> I set my first step objective as just utilizing one core per machine
>>> no matter how many it may have. When that works, then further
>>> exploiting SMP on each node might be the next step.
>>>
>>>
>>> Thank you for the suggestion!
>>>
>>>
>>> Best,
>>> yi
>>>
>>>
>>>
>>>
>>> On Thu, Jan 14, 2010 at 10:09 AM, Dave Goodell <goodell at mcs.anl.gov>
>>> wrote:
>>>>
>>>> Hi yi,
>>>>
>>>> I'm not sure exactly what's going on here, but it looks like rank  
>>>> 3 is
>>>> trying to setup a shared memory region for communication with  
>>>> another
>>>> rank
>>>> even though it shouldn't be.
>>>>
>>>> There's a chance that this is related to a bug in the way that mpd
>>>> figures
>>>> out which processes are on which machine.  Can you try setting the
>>>> environment variable MPICH_NO_LOCAL=1 and let us know what  
>>>> happens?  For
>>>> example:
>>>>
>>>> MPICH_NO_LOCAL=1 mpiexec -n 3 /path/to/cpi
>>>>
>>>> In a similar vein, you can also try using hydra to rule out the mpd
>>>> issue:
>>>>
>>>> mpiexec.hydra -n 3 /path/to/cpi
>>>>
>>>> There are other things we can look at, but let's start there and  
>>>> see if
>>>> that
>>>> works out for us.
>>>>
>>>> -Dave
>>>>
>>>> On Jan 14, 2010, at 12:00 AM, Gao, Yi wrote:
>>>>
>>>>> Dear all,
>>>>>
>>>>> I'm new here and encounter a problem at the very beginning of  
>>>>> learning
>>>>> mpi.
>>>>>
>>>>> Basically, I get
>>>>> mpiexec -n i /bin/hostname
>>>>> works for any i >= 1 I've tested.
>>>>>
>>>>> but
>>>>> mpiexec -n i /path-to-example-dir/cpi
>>>>> error for any i >= 2
>>>>>
>>>>> The details are:
>>>>>
>>>>> I have 3 machines, all running Ubuntu 9.10 with gcc/g++ 4.4.1
>>>>> one has two cores, and the other two have one core for each.
>>>>> (machine name: rome, 2 core;
>>>>>                      julia, 1 core;
>>>>>                      meg, 1 core )
>>>>>
>>>>>
>>>>> On this minimal testing bed for me to learn mpi, I built using
>>>>> mpich2-1.2.1 using the default configure in "installation guide"
>>>>>
>>>>> Then on "rome", I put the mpd.hosts file in home dir with content:
>>>>> julia
>>>>> meg
>>>>>
>>>>> Then I ran
>>>>> mpdboot -n 3  # works
>>>>> mpdtrace -l # works, show the three machine names and port num
>>>>> mpiexec -l -n 3 /bin/hostname # works! show three machine names
>>>>>
>>>>> but
>>>>>
>>>>> mpiexec -l -n 3 /tmp/gth818n/mpich2-1.2.1/example/cpi # !!!!!!!!  
>>>>> it
>>>>> halted there.
>>>>>
>>>>> Then I tried:
>>>>> mpiexec -l -n 1 /tmp/gth818n/mpich2-1.2.1/example/cpi # works,  
>>>>> run on
>>>>> rome only and returns the result
>>>>>
>>>>> But -n larger or equal than 2 causes it to halt, or getting such
>>>>> errors (with -n 4):
>>>>> Fatal error in MPI_Init: Other MPI error, error stack:
>>>>> MPIR_Init_thread(394).................: Initialization failed
>>>>> MPID_Init(135)........................: channel initialization  
>>>>> failed
>>>>> MPIDI_CH3_Init(43)....................:
>>>>> MPID_nem_init(202)....................:
>>>>> MPIDI_CH3I_Seg_commit(366)............:
>>>>> MPIU_SHMW_Hnd_deserialize(358)........:
>>>>> MPIU_SHMW_Seg_open(897)...............:
>>>>> MPIU_SHMW_Seg_create_attach_templ(671): open failed - No such  
>>>>> file or
>>>>> directory
>>>>> rank 3 in job 12  rome_39209   caused collective abort of all  
>>>>> ranks
>>>>>  exit status of rank 3: return code 1
>>>>>
>>>>>
>>>>> Then, I rebuild mpich2 on rome (coz it's SMP), with
>>>>> --with-device=ch3:ssm
>>>>>
>>>>> But got same error.
>>>>>
>>>>> Could any one gives me some directions to go?
>>>>>
>>>>> Thanks in advance!
>>>>>
>>>>> Best,
>>>>> yi
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Yi Gao
>>> Graduate Student
>>> Dept. Biomedical Engineering
>>> Georgia Institute of Technology
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>>



More information about the mpich-discuss mailing list