[mpich-discuss] FW: problems with mpdboot
bjday
bjday at cse.usf.edu
Wed Apr 8 07:42:27 CDT 2009
Thank you for everyone's help. Looking at the version numbers pc12 had
mpich2-1.0.8 installed and pc19 had a fresh copy of mpich2-1.0.8p1.
Once I reinstalled pc19 and pc12 with mpich2-1.0.8p1 it worked. I am
not sure what the difference between the two are but reinstalling helped.
Thank you again,
Brian
Ralph Butler wrote:
> You might also try this option to mpdboot:
> --maxbranch=1
>
> On Apr 7, 2009, at 1:05 PM, bjday wrote:
>
>> Rajeev,
>>
>> Yes you are correct I can build a ring by hand but not by using
>> mpdboot. Once I build a ring by hand i can run mpiexec hostname and
>> it works, see below. I installed using the latest download that is
>> on the website. In my research before contacting the forums i found
>> this website. I dont know if this helps.
>> http://ubuntuforums.org/showthread.php?t=1016984 it has to do with
>> setting LD_LIBRARY_PATH and python, but I used CenoOS's add remove
>> programs so I never touched the package. I will try to reinstall
>> MPICH2 on both computer just in case some how different versions were
>> installed. Any other suggestions or help would be great.
>>
>> Thank you,
>> Brian
>>
>>
>>
>> Rajeev Thakur wrote:
>>>
>>> -----Original Message-----
>>> From: Ralph Butler [mailto:rbutler at mtsu.edu] Sent: Tuesday, April
>>> 07, 2009 12:06 PM
>>> To: Rajeev Thakur
>>> Subject: Re: [mpich-discuss] problems with mpdboot
>>>
>>> I can not reproduce it of course. He seems to indicate that he can
>>> build a
>>> ring by hand, but does not say that it is usable with mpiexec to run
>>> something like hostname. If he can do that and it still fails, I am
>>> at a
>>> loss as to what the problem can be. I ran into this one time when the
>>> mpd.py and mpdboot.py happened to be from different releases of
>>> mpich2, but
>>> seriously doubt that is his problem.
>>>
>>> On TueApr 7, at Tue Apr 7 11:28AM, Rajeev Thakur wrote:
>>>
>>>
>>>> Ralph, any comments?
>>>>
>>>> Rajeev
>>>>
>>>> -----Original Message-----
>>>> From: mpich-discuss-bounces at mcs.anl.gov
>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of bjday
>>>> Sent: Tuesday, April 07, 2009 10:25 AM
>>>> To: mpich-discuss at mcs.anl.gov
>>>> Subject: Re: [mpich-discuss] problems with mpdboot
>>>>
>>>> I tried mpdcheck as instructed in the Troubleshooting in the
>>>> instillation guide again and the client (pc12) successfully recved
>>>> ack form server. The server (pc19) has conn from the client and
>>>> successfully recvd msg from client.
>>>>
>>>> I also tryed the ssh command and received "c4labpc19.csee.usf.edu"
>>>> as a
>>>> response.
>>>>
>>>> Once mpd is started on the master i can connect the slaves once i
>>>> get the port number from the server. I can also run $mpdboot -n 1
>>>> and the master node will be the only output when mpdtrace is ran.
>>>> The error is when n>1, when trying to remotely start eh slave nodes.
>>>>
>>>> thank you,
>>>> Brian
>>>>
>>>> Pavan Balaji wrote:
>>>>
>>>>> Can you try mpdcheck to make sure there are no network
>>>>> infrastructure issues (e.g., firewalls or errors in /etc/hosts)?
>>>>>
>>>>> Another quick check is to make sure each host can ssh to another
>>>>> host with the name given in the host file. For example, try:
>>>>>
>>>>> $ ssh c4labpc12.csee.usf.edu -t "ssh c4labpc19.csee.usf.edu hostname"
>>>>>
>>>>> -- Pavan
>>>>>
>>>>> bjday wrote:
>>>>>
>>>>>> Pavan,
>>>>>>
>>>>>> Yes the names returned by "hostname" and the names in mpd.hosts
>>>>>> are the fully qualified names.
>>>>>>
>>>>>> Thank you,
>>>>>> Brian
>>>>>>
>>>>>>
>>>>>> Pavan Balaji wrote:
>>>>>>
>>>>>>> Check if your host file contains the same name as what is
>>>>>>> returned by the "hostname" command (e.g., "foo" is different
>>>>>>> from "foo.domain.edu"). Otherwise, mpd can't find the local
>>>>>>> hostname in your host file.
>>>>>>>
>>>>>>> -- Pavan
>>>>>>>
>>>>>>> bjday wrote:
>>>>>>>
>>>>>>>> Hello MPICH2 Gurus
>>>>>>>>
>>>>>>>> I am installing MPICH2 on some lab computers at the request of
>>>>>>>> a professor. I have ran into a during testing. When i run
>>>>>>>> mpdboot I receive this error
>>>>>>>>
>>>>>>>> mpdboot -n 2 -f mpd.hosts -v -d
>>>>>>>> debug: starting
>>>>>>>> running mpdallexit on c4labpc19.csee.usf.edu LAUNCHED mpd on
>>>>>>>> c4labpc19.csee.usf.edu via
>>>>>>>> debug: launch cmd= /usr/local/mpich2/bin/mpd.py --ncpus=1 -e -d
>>>>>>>> debug: mpd on c4labpc19.csee.usf.edu on port 37116
>>>>>>>> RUNNING: mpd on c4labpc19.csee.usf.edu
>>>>>>>> debug: info for running mpd: {'ncpus': 1, 'list_port': 37116,
>>>>>>>> 'entry_port': '', 'host': 'c4labpc19.csee.usf.edu', 'entry_host':
>>>>>>>> '', 'ifhn': ''}
>>>>>>>> LAUNCHED mpd on c4labpc12.csee.usf.edu via
>>>>>>>> c4labpc19.csee.usf.edu
>>>>>>>> debug: launch cmd= ssh -x -n -q c4labpc12.csee.usf.edu
>>>>>>>> '/usr/local/mpich2/bin/mpd.py -h c4labpc19.csee.usf.edu -p 37116
>>>>>>>> --ncpus=1 -e -d'
>>>>>>>> debug: mpd on c4labpc12.csee.usf.edu on port no_port
>>>>>>>> mpdboot_c4labpc19.csee.usf.edu (handle_mpd_output 406): from
>>>>>>>> mpd on c4labpc12.csee.usf.edu, invalid port info:
>>>>>>>> no_port
>>>>>>>>
>>>>>>>> I have seen this in the forums but there was not a resolution
>>>>>>>> posted. I have gone through the trouble shooting in the
>>>>>>>> install guide and i can complete until step 7 where mpdboot is
>>>>>>>> used.. I can start mpd on the master, get the port, then
>>>>>>>> connect the slave computers by specifying the master name and
>>>>>>>> port number. Any ideas why pc12 is reporting no port?
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Brian
>>>>>>>>
>>>>
>>>
>>>
>>
More information about the mpich-discuss
mailing list