[mpich-discuss] FW: problems with mpdboot
bjday
bjday at cse.usf.edu
Tue Apr 7 13:05:24 CDT 2009
Rajeev,
Yes you are correct I can build a ring by hand but not by using
mpdboot. Once I build a ring by hand i can run mpiexec hostname and it
works, see below. I installed using the latest download that is on the
website. In my research before contacting the forums i found this
website. I dont know if this helps.
http://ubuntuforums.org/showthread.php?t=1016984 it has to do with
setting LD_LIBRARY_PATH and python, but I used CenoOS's add remove
programs so I never touched the package. I will try to reinstall MPICH2
on both computer just in case some how different versions were
installed. Any other suggestions or help would be great.
Thank you,
Brian
% mpiexec -l -n 30 /bin/hostname
2: c4labpc19.csee.usf.edu
3: c4labpc12.csee.usf.edu
1: c4labpc12.csee.usf.edu
4: c4labpc19.csee.usf.edu
5: c4labpc12.csee.usf.edu
6: c4labpc19.csee.usf.edu
7: c4labpc12.csee.usf.edu
9: c4labpc12.csee.usf.edu
8: c4labpc19.csee.usf.edu
11: c4labpc12.csee.usf.edu
12: c4labpc19.csee.usf.edu
13: c4labpc12.csee.usf.edu
15: c4labpc12.csee.usf.edu
14: c4labpc19.csee.usf.edu
10: c4labpc19.csee.usf.edu
17: c4labpc12.csee.usf.edu
16: c4labpc19.csee.usf.edu
19: c4labpc12.csee.usf.edu
21: c4labpc12.csee.usf.edu
22: c4labpc19.csee.usf.edu
23: c4labpc12.csee.usf.edu
20: c4labpc19.csee.usf.edu
18: c4labpc19.csee.usf.edu
24: c4labpc19.csee.usf.edu
25: c4labpc12.csee.usf.edu
27: c4labpc12.csee.usf.edu
29: c4labpc12.csee.usf.edu
0: c4labpc19.csee.usf.edu
28: c4labpc19.csee.usf.edu
26: c4labpc19.csee.usf.edu
%
Rajeev Thakur wrote:
>
>
> -----Original Message-----
> From: Ralph Butler [mailto:rbutler at mtsu.edu]
> Sent: Tuesday, April 07, 2009 12:06 PM
> To: Rajeev Thakur
> Subject: Re: [mpich-discuss] problems with mpdboot
>
> I can not reproduce it of course. He seems to indicate that he can build a
> ring by hand, but does not say that it is usable with mpiexec to run
> something like hostname. If he can do that and it still fails, I am at a
> loss as to what the problem can be. I ran into this one time when the
> mpd.py and mpdboot.py happened to be from different releases of mpich2, but
> seriously doubt that is his problem.
>
> On TueApr 7, at Tue Apr 7 11:28AM, Rajeev Thakur wrote:
>
>
>> Ralph, any comments?
>>
>> Rajeev
>>
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of bjday
>> Sent: Tuesday, April 07, 2009 10:25 AM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] problems with mpdboot
>>
>> I tried mpdcheck as instructed in the Troubleshooting in the
>> instillation guide again and the client (pc12) successfully recved ack
>> form server. The server (pc19) has conn from the client and
>> successfully recvd msg from client.
>>
>> I also tryed the ssh command and received "c4labpc19.csee.usf.edu"
>> as a
>> response.
>>
>> Once mpd is started on the master i can connect the slaves once i get
>> the port number from the server. I can also run $mpdboot -n 1 and the
>> master node will be the only output when mpdtrace is ran. The error
>> is when n>1, when trying to remotely start eh slave nodes.
>>
>> thank you,
>> Brian
>>
>> Pavan Balaji wrote:
>>
>>> Can you try mpdcheck to make sure there are no network infrastructure
>>> issues (e.g., firewalls or errors in /etc/hosts)?
>>>
>>> Another quick check is to make sure each host can ssh to another host
>>> with the name given in the host file. For example, try:
>>>
>>> $ ssh c4labpc12.csee.usf.edu -t "ssh c4labpc19.csee.usf.edu hostname"
>>>
>>> -- Pavan
>>>
>>> bjday wrote:
>>>
>>>> Pavan,
>>>>
>>>> Yes the names returned by "hostname" and the names in mpd.hosts are
>>>> the fully qualified names.
>>>>
>>>> Thank you,
>>>> Brian
>>>>
>>>>
>>>> Pavan Balaji wrote:
>>>>
>>>>> Check if your host file contains the same name as what is returned
>>>>> by the "hostname" command (e.g., "foo" is different from
>>>>> "foo.domain.edu"). Otherwise, mpd can't find the local hostname in
>>>>> your host file.
>>>>>
>>>>> -- Pavan
>>>>>
>>>>> bjday wrote:
>>>>>
>>>>>> Hello MPICH2 Gurus
>>>>>>
>>>>>> I am installing MPICH2 on some lab computers at the request of a
>>>>>> professor. I have ran into a during testing. When i run mpdboot
>>>>>> I receive this error
>>>>>>
>>>>>> mpdboot -n 2 -f mpd.hosts -v -d
>>>>>> debug: starting
>>>>>> running mpdallexit on c4labpc19.csee.usf.edu LAUNCHED mpd on
>>>>>> c4labpc19.csee.usf.edu via
>>>>>> debug: launch cmd= /usr/local/mpich2/bin/mpd.py --ncpus=1 -e -d
>>>>>> debug: mpd on c4labpc19.csee.usf.edu on port 37116
>>>>>> RUNNING: mpd on c4labpc19.csee.usf.edu
>>>>>> debug: info for running mpd: {'ncpus': 1, 'list_port': 37116,
>>>>>> 'entry_port': '', 'host': 'c4labpc19.csee.usf.edu', 'entry_host':
>>>>>> '', 'ifhn': ''}
>>>>>> LAUNCHED mpd on c4labpc12.csee.usf.edu via
>>>>>> c4labpc19.csee.usf.edu
>>>>>> debug: launch cmd= ssh -x -n -q c4labpc12.csee.usf.edu
>>>>>> '/usr/local/mpich2/bin/mpd.py -h c4labpc19.csee.usf.edu -p 37116
>>>>>> --ncpus=1 -e -d'
>>>>>> debug: mpd on c4labpc12.csee.usf.edu on port no_port
>>>>>> mpdboot_c4labpc19.csee.usf.edu (handle_mpd_output 406): from mpd
>>>>>> on c4labpc12.csee.usf.edu, invalid port info:
>>>>>> no_port
>>>>>>
>>>>>> I have seen this in the forums but there was not a resolution
>>>>>> posted. I have gone through the trouble shooting in the install
>>>>>> guide and i can complete until step 7 where mpdboot is used.. I
>>>>>> can start mpd on the master, get the port, then connect the slave
>>>>>> computers by specifying the master name and port number. Any
>>>>>> ideas why pc12 is reporting no port?
>>>>>>
>>>>>> Thank you,
>>>>>> Brian
>>>>>>
>>
>
>
More information about the mpich-discuss
mailing list