[mpich-discuss] FW: problems with mpdboot

bjday bjday at cse.usf.edu
Wed Apr 8 07:42:27 CDT 2009


Thank you for everyone's help.  Looking at the version numbers pc12 had 
mpich2-1.0.8 installed and pc19 had a fresh copy of mpich2-1.0.8p1.  
Once I reinstalled pc19 and pc12 with mpich2-1.0.8p1 it worked.  I am 
not sure what the difference between the two are but reinstalling helped. 

Thank you again,
Brian

Ralph Butler wrote:
> You might also try this option to mpdboot:
>         --maxbranch=1
>
> On Apr 7, 2009, at 1:05 PM, bjday wrote:
>
>> Rajeev,
>>
>> Yes you are correct I can build a ring by hand but not by using 
>> mpdboot.  Once I build a ring by hand i can run mpiexec hostname and 
>> it works, see below.  I installed using the latest download that is 
>> on the website.  In my research before contacting the forums i found 
>> this website.  I dont know if this helps.  
>> http://ubuntuforums.org/showthread.php?t=1016984   it has to do with 
>> setting LD_LIBRARY_PATH and python, but I used CenoOS's add remove 
>> programs so I never touched the package.  I will try to reinstall 
>> MPICH2 on both computer just in case some how different versions were 
>> installed.  Any other suggestions or help would be great.
>>
>> Thank you,
>> Brian
>>
>>
>>
>> Rajeev Thakur wrote:
>>>
>>> -----Original Message-----
>>> From: Ralph Butler [mailto:rbutler at mtsu.edu] Sent: Tuesday, April 
>>> 07, 2009 12:06 PM
>>> To: Rajeev Thakur
>>> Subject: Re: [mpich-discuss] problems with mpdboot
>>>
>>> I can not reproduce it of course.  He seems to indicate that he can 
>>> build a
>>> ring by hand, but does not say that it is usable with mpiexec to run
>>> something like hostname.  If he can do that and it still fails, I am 
>>> at a
>>> loss as to what the problem can be.  I ran into this one time when the
>>> mpd.py and mpdboot.py happened to be from different releases of 
>>> mpich2, but
>>> seriously doubt that is his problem.
>>>
>>> On TueApr 7, at Tue Apr 7 11:28AM, Rajeev Thakur wrote:
>>>
>>>
>>>> Ralph, any comments?
>>>>
>>>> Rajeev
>>>>
>>>> -----Original Message-----
>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of bjday
>>>> Sent: Tuesday, April 07, 2009 10:25 AM
>>>> To: mpich-discuss at mcs.anl.gov
>>>> Subject: Re: [mpich-discuss] problems with mpdboot
>>>>
>>>> I tried mpdcheck as instructed in the Troubleshooting in the 
>>>> instillation guide again and the client (pc12) successfully recved 
>>>> ack form server.  The server (pc19) has conn from the client and 
>>>> successfully recvd msg from client.
>>>>
>>>> I also tryed the ssh command and received "c4labpc19.csee.usf.edu"  
>>>> as a
>>>> response.
>>>>
>>>> Once mpd is started on the master i can connect the slaves once i 
>>>> get the port number from the server. I can also run $mpdboot -n 1 
>>>> and the master node will be the only output when mpdtrace is ran.  
>>>> The error is when n>1, when trying to remotely start eh slave nodes.
>>>>
>>>> thank you,
>>>> Brian
>>>>
>>>> Pavan Balaji wrote:
>>>>
>>>>> Can you try mpdcheck to make sure there are no network 
>>>>> infrastructure issues (e.g., firewalls or errors in /etc/hosts)?
>>>>>
>>>>> Another quick check is to make sure each host can ssh to another 
>>>>> host with the name given in the host file. For example, try:
>>>>>
>>>>> $ ssh c4labpc12.csee.usf.edu -t "ssh c4labpc19.csee.usf.edu hostname"
>>>>>
>>>>> -- Pavan
>>>>>
>>>>> bjday wrote:
>>>>>
>>>>>> Pavan,
>>>>>>
>>>>>> Yes the names returned by "hostname" and the names in mpd.hosts 
>>>>>> are the fully qualified names.
>>>>>>
>>>>>> Thank you,
>>>>>> Brian
>>>>>>
>>>>>>
>>>>>> Pavan Balaji wrote:
>>>>>>
>>>>>>> Check if your host file contains the same name as what is 
>>>>>>> returned by the "hostname" command (e.g., "foo" is different 
>>>>>>> from "foo.domain.edu"). Otherwise, mpd can't find the local 
>>>>>>> hostname in your host file.
>>>>>>>
>>>>>>> -- Pavan
>>>>>>>
>>>>>>> bjday wrote:
>>>>>>>
>>>>>>>> Hello MPICH2 Gurus
>>>>>>>>
>>>>>>>> I am installing MPICH2 on some lab computers at the request of 
>>>>>>>> a professor.  I have ran into a during testing.  When i run 
>>>>>>>> mpdboot I receive this error
>>>>>>>>
>>>>>>>> mpdboot -n 2 -f mpd.hosts -v -d
>>>>>>>> debug: starting
>>>>>>>> running mpdallexit on c4labpc19.csee.usf.edu LAUNCHED mpd on 
>>>>>>>> c4labpc19.csee.usf.edu  via
>>>>>>>> debug: launch cmd= /usr/local/mpich2/bin/mpd.py   --ncpus=1 -e -d
>>>>>>>> debug: mpd on c4labpc19.csee.usf.edu  on port 37116
>>>>>>>> RUNNING: mpd on c4labpc19.csee.usf.edu
>>>>>>>> debug: info for running mpd: {'ncpus': 1, 'list_port': 37116,
>>>>>>>> 'entry_port': '', 'host': 'c4labpc19.csee.usf.edu', 'entry_host':
>>>>>>>> '', 'ifhn': ''}
>>>>>>>> LAUNCHED mpd on c4labpc12.csee.usf.edu  via   
>>>>>>>> c4labpc19.csee.usf.edu
>>>>>>>> debug: launch cmd= ssh -x -n -q c4labpc12.csee.usf.edu 
>>>>>>>> '/usr/local/mpich2/bin/mpd.py  -h c4labpc19.csee.usf.edu -p 37116
>>>>>>>> --ncpus=1 -e -d'
>>>>>>>> debug: mpd on c4labpc12.csee.usf.edu  on port no_port 
>>>>>>>> mpdboot_c4labpc19.csee.usf.edu (handle_mpd_output 406): from 
>>>>>>>> mpd on c4labpc12.csee.usf.edu, invalid port info:
>>>>>>>> no_port
>>>>>>>>
>>>>>>>> I have seen this in the forums but there was not a resolution 
>>>>>>>> posted.  I have gone through the trouble shooting in the 
>>>>>>>> install guide and i can complete until step 7 where mpdboot is 
>>>>>>>> used..  I can start mpd on the master, get the port, then 
>>>>>>>> connect the slave computers by specifying the master name and 
>>>>>>>> port number.  Any ideas why pc12 is reporting no port?
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Brian
>>>>>>>>
>>>>
>>>
>>>
>>



More information about the mpich-discuss mailing list