[mpich-discuss] halt after mpiexec
Dave Goodell
goodell at mcs.anl.gov
Thu Jan 14 10:12:24 CST 2010
Sorry, I gave poor instructions on using hydra. Because there is no
mpdboot step in normal hydra usage, you need to specify a hostfile for
hydra when running mpiexec:
mpiexec.hydra -f hostfile -n 3 /path/to/cpi
Where hostfile should contain in your case:
--------8<-------
rome
meg
julia
--------8<-------
-Dave
On Jan 14, 2010, at 9:58 AM, Gao, Yi wrote:
> Hi Dave,
>
> Thanks for the advice. I followed your two suggestions (MPICH_NO_LOCAL
> and mpiexec.hydra) and tried as follows:
>
> 1. use MPICH_NO_LOCAL=1 mpiexec -n i /path/to/cpi
>
> i=1, no problem, runs on rome (2-core-machine) with one process and
> exit.
>
> i=2, halt with the output:
> Process 1 of 2 is on meg // meg has 1 core
>
> i = 3, halt like this:
> Process 2 of 3 is on julia // julia has 1 core
> Process 1 of 3 is on meg
>
> i=4, halt like this:
> Process 1 of 4 is on meg
> Process 2 of 4 is on julia
>
> i=5, halt like this:
> Process 2 of 5 is on julia
> Process 1 of 5 is on meg
> Process 4 of 5 is on meg
>
> In all halt cases above, when doing a top at the meg or julia, the cpu
> is 100% used. For i=5 case, seems that 2 threads are running on a
> single core machine(meg), each of which takes about 50% cpu when
> testing with top. None of the above shows rome, the 2core machine, in
> the output.
>
> So, comparing with situation without "MPICH_NO_LOCAL=1", this time it
> only stops there without giving some error message.
>
> 2. mpiexec.hydra -n i /path/to/cpi
>
> 2.1
>
> First, it asked me to add local host to known hosts: and I answered
> yes.
>
> The authenticity of host 'localhost (::1)' can't be established.
> RSA key fingerprint is 3e:62:41:30:a8:40:33:7e:b4:34:8e:
> 2c:f4:37:43:20.
> Are you sure you want to continue connecting (yes/no)? yes
> Warning: Permanently added 'localhost' (RSA) to the list of known
> hosts.
> Process 2 of 3 is on rome
> Process 1 of 3 is on rome
> Process 0 of 3 is on rome
> pi is approximately 3.1415926544231323, Error is 0.0000000008333392
> wall clock time = 0.008936
>
> 2.2
>
> The above case is when i=3, it runs and exit without error message.
> However, seems that all processes are on the machine issuing the
> command (rome). And this is the case for i=1,2, ..., 10. (I think if I
> keep on, it'll still be the case....)
>
> 2.3
> Even if I ran: mpiexec.hydra -n i /bin/hostname
> I got i "rome"'s, which is not as expected....
>
>
> 2.4
> Using mpdtrace -l on rome, I get normal output.
>
> rome_53533 (128.61.134.31)
> meg_46371 (128.61.134.44)
> julia_53931 (128.61.135.30)
>
>
> 3.
> I set my first step objective as just utilizing one core per machine
> no matter how many it may have. When that works, then further
> exploiting SMP on each node might be the next step.
>
>
> Thank you for the suggestion!
>
>
> Best,
> yi
>
>
>
>
> On Thu, Jan 14, 2010 at 10:09 AM, Dave Goodell <goodell at mcs.anl.gov>
> wrote:
>> Hi yi,
>>
>> I'm not sure exactly what's going on here, but it looks like rank 3
>> is
>> trying to setup a shared memory region for communication with
>> another rank
>> even though it shouldn't be.
>>
>> There's a chance that this is related to a bug in the way that mpd
>> figures
>> out which processes are on which machine. Can you try setting the
>> environment variable MPICH_NO_LOCAL=1 and let us know what
>> happens? For
>> example:
>>
>> MPICH_NO_LOCAL=1 mpiexec -n 3 /path/to/cpi
>>
>> In a similar vein, you can also try using hydra to rule out the mpd
>> issue:
>>
>> mpiexec.hydra -n 3 /path/to/cpi
>>
>> There are other things we can look at, but let's start there and
>> see if that
>> works out for us.
>>
>> -Dave
>>
>> On Jan 14, 2010, at 12:00 AM, Gao, Yi wrote:
>>
>>> Dear all,
>>>
>>> I'm new here and encounter a problem at the very beginning of
>>> learning
>>> mpi.
>>>
>>> Basically, I get
>>> mpiexec -n i /bin/hostname
>>> works for any i >= 1 I've tested.
>>>
>>> but
>>> mpiexec -n i /path-to-example-dir/cpi
>>> error for any i >= 2
>>>
>>> The details are:
>>>
>>> I have 3 machines, all running Ubuntu 9.10 with gcc/g++ 4.4.1
>>> one has two cores, and the other two have one core for each.
>>> (machine name: rome, 2 core;
>>> julia, 1 core;
>>> meg, 1 core )
>>>
>>>
>>> On this minimal testing bed for me to learn mpi, I built using
>>> mpich2-1.2.1 using the default configure in "installation guide"
>>>
>>> Then on "rome", I put the mpd.hosts file in home dir with content:
>>> julia
>>> meg
>>>
>>> Then I ran
>>> mpdboot -n 3 # works
>>> mpdtrace -l # works, show the three machine names and port num
>>> mpiexec -l -n 3 /bin/hostname # works! show three machine names
>>>
>>> but
>>>
>>> mpiexec -l -n 3 /tmp/gth818n/mpich2-1.2.1/example/cpi # !!!!!!!! it
>>> halted there.
>>>
>>> Then I tried:
>>> mpiexec -l -n 1 /tmp/gth818n/mpich2-1.2.1/example/cpi # works, run
>>> on
>>> rome only and returns the result
>>>
>>> But -n larger or equal than 2 causes it to halt, or getting such
>>> errors (with -n 4):
>>> Fatal error in MPI_Init: Other MPI error, error stack:
>>> MPIR_Init_thread(394).................: Initialization failed
>>> MPID_Init(135)........................: channel initialization
>>> failed
>>> MPIDI_CH3_Init(43)....................:
>>> MPID_nem_init(202)....................:
>>> MPIDI_CH3I_Seg_commit(366)............:
>>> MPIU_SHMW_Hnd_deserialize(358)........:
>>> MPIU_SHMW_Seg_open(897)...............:
>>> MPIU_SHMW_Seg_create_attach_templ(671): open failed - No such file
>>> or
>>> directory
>>> rank 3 in job 12 rome_39209 caused collective abort of all ranks
>>> exit status of rank 3: return code 1
>>>
>>>
>>> Then, I rebuild mpich2 on rome (coz it's SMP), with --with-
>>> device=ch3:ssm
>>>
>>> But got same error.
>>>
>>> Could any one gives me some directions to go?
>>>
>>> Thanks in advance!
>>>
>>> Best,
>>> yi
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>>
>
>
>
> --
> Yi Gao
> Graduate Student
> Dept. Biomedical Engineering
> Georgia Institute of Technology
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list