[mpich-discuss] FW: problems with mpdboot

Ralph Butler rbutler at mtsu.edu
Tue Apr 7 13:08:49 CDT 2009


You might also try this option to mpdboot:
         --maxbranch=1

On Apr 7, 2009, at 1:05 PM, bjday wrote:

> Rajeev,
>
> Yes you are correct I can build a ring by hand but not by using  
> mpdboot.  Once I build a ring by hand i can run mpiexec hostname and  
> it works, see below.  I installed using the latest download that is  
> on the website.  In my research before contacting the forums i found  
> this website.  I dont know if this helps.  http://ubuntuforums.org/showthread.php?t=1016984 
>    it has to do with setting LD_LIBRARY_PATH and python, but I used  
> CenoOS's add remove programs so I never touched the package.  I will  
> try to reinstall MPICH2 on both computer just in case some how  
> different versions were installed.  Any other suggestions or help  
> would be great.
>
> Thank you,
> Brian
> % mpiexec -l -n 30 /bin/hostname
> 2: c4labpc19.csee.usf.edu
> 3: c4labpc12.csee.usf.edu
> 1: c4labpc12.csee.usf.edu
> 4: c4labpc19.csee.usf.edu
> 5: c4labpc12.csee.usf.edu
> 6: c4labpc19.csee.usf.edu
> 7: c4labpc12.csee.usf.edu
> 9: c4labpc12.csee.usf.edu
> 8: c4labpc19.csee.usf.edu
> 11: c4labpc12.csee.usf.edu
> 12: c4labpc19.csee.usf.edu
> 13: c4labpc12.csee.usf.edu
> 15: c4labpc12.csee.usf.edu
> 14: c4labpc19.csee.usf.edu
> 10: c4labpc19.csee.usf.edu
> 17: c4labpc12.csee.usf.edu
> 16: c4labpc19.csee.usf.edu
> 19: c4labpc12.csee.usf.edu
> 21: c4labpc12.csee.usf.edu
> 22: c4labpc19.csee.usf.edu
> 23: c4labpc12.csee.usf.edu
> 20: c4labpc19.csee.usf.edu
> 18: c4labpc19.csee.usf.edu
> 24: c4labpc19.csee.usf.edu
> 25: c4labpc12.csee.usf.edu
> 27: c4labpc12.csee.usf.edu
> 29: c4labpc12.csee.usf.edu
> 0: c4labpc19.csee.usf.edu
> 28: c4labpc19.csee.usf.edu
> 26: c4labpc19.csee.usf.edu
> %
>
>
> Rajeev Thakur wrote:
>>
>> -----Original Message-----
>> From: Ralph Butler [mailto:rbutler at mtsu.edu] Sent: Tuesday, April  
>> 07, 2009 12:06 PM
>> To: Rajeev Thakur
>> Subject: Re: [mpich-discuss] problems with mpdboot
>>
>> I can not reproduce it of course.  He seems to indicate that he can  
>> build a
>> ring by hand, but does not say that it is usable with mpiexec to run
>> something like hostname.  If he can do that and it still fails, I  
>> am at a
>> loss as to what the problem can be.  I ran into this one time when  
>> the
>> mpd.py and mpdboot.py happened to be from different releases of  
>> mpich2, but
>> seriously doubt that is his problem.
>>
>> On TueApr 7, at Tue Apr 7 11:28AM, Rajeev Thakur wrote:
>>
>>
>>> Ralph, any comments?
>>>
>>> Rajeev
>>>
>>> -----Original Message-----
>>> From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov 
>>> ] On Behalf Of bjday
>>> Sent: Tuesday, April 07, 2009 10:25 AM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: Re: [mpich-discuss] problems with mpdboot
>>>
>>> I tried mpdcheck as instructed in the Troubleshooting in the  
>>> instillation guide again and the client (pc12) successfully recved  
>>> ack form server.  The server (pc19) has conn from the client and  
>>> successfully recvd msg from client.
>>>
>>> I also tryed the ssh command and received  
>>> "c4labpc19.csee.usf.edu"  as a
>>> response.
>>>
>>> Once mpd is started on the master i can connect the slaves once i  
>>> get the port number from the server. I can also run $mpdboot -n 1  
>>> and the master node will be the only output when mpdtrace is ran.   
>>> The error is when n>1, when trying to remotely start eh slave nodes.
>>>
>>> thank you,
>>> Brian
>>>
>>> Pavan Balaji wrote:
>>>
>>>> Can you try mpdcheck to make sure there are no network  
>>>> infrastructure issues (e.g., firewalls or errors in /etc/hosts)?
>>>>
>>>> Another quick check is to make sure each host can ssh to another  
>>>> host with the name given in the host file. For example, try:
>>>>
>>>> $ ssh c4labpc12.csee.usf.edu -t "ssh c4labpc19.csee.usf.edu  
>>>> hostname"
>>>>
>>>> -- Pavan
>>>>
>>>> bjday wrote:
>>>>
>>>>> Pavan,
>>>>>
>>>>> Yes the names returned by "hostname" and the names in mpd.hosts  
>>>>> are the fully qualified names.
>>>>>
>>>>> Thank you,
>>>>> Brian
>>>>>
>>>>>
>>>>> Pavan Balaji wrote:
>>>>>
>>>>>> Check if your host file contains the same name as what is  
>>>>>> returned by the "hostname" command (e.g., "foo" is different  
>>>>>> from "foo.domain.edu"). Otherwise, mpd can't find the local  
>>>>>> hostname in your host file.
>>>>>>
>>>>>> -- Pavan
>>>>>>
>>>>>> bjday wrote:
>>>>>>
>>>>>>> Hello MPICH2 Gurus
>>>>>>>
>>>>>>> I am installing MPICH2 on some lab computers at the request of  
>>>>>>> a professor.  I have ran into a during testing.  When i run  
>>>>>>> mpdboot I receive this error
>>>>>>>
>>>>>>> mpdboot -n 2 -f mpd.hosts -v -d
>>>>>>> debug: starting
>>>>>>> running mpdallexit on c4labpc19.csee.usf.edu LAUNCHED mpd on  
>>>>>>> c4labpc19.csee.usf.edu  via
>>>>>>> debug: launch cmd= /usr/local/mpich2/bin/mpd.py   --ncpus=1 -e  
>>>>>>> -d
>>>>>>> debug: mpd on c4labpc19.csee.usf.edu  on port 37116
>>>>>>> RUNNING: mpd on c4labpc19.csee.usf.edu
>>>>>>> debug: info for running mpd: {'ncpus': 1, 'list_port': 37116,
>>>>>>> 'entry_port': '', 'host': 'c4labpc19.csee.usf.edu',  
>>>>>>> 'entry_host':
>>>>>>> '', 'ifhn': ''}
>>>>>>> LAUNCHED mpd on c4labpc12.csee.usf.edu  via    
>>>>>>> c4labpc19.csee.usf.edu
>>>>>>> debug: launch cmd= ssh -x -n -q c4labpc12.csee.usf.edu '/usr/ 
>>>>>>> local/mpich2/bin/mpd.py  -h c4labpc19.csee.usf.edu -p 37116
>>>>>>> --ncpus=1 -e -d'
>>>>>>> debug: mpd on c4labpc12.csee.usf.edu  on port no_port  
>>>>>>> mpdboot_c4labpc19.csee.usf.edu (handle_mpd_output 406): from  
>>>>>>> mpd on c4labpc12.csee.usf.edu, invalid port info:
>>>>>>> no_port
>>>>>>>
>>>>>>> I have seen this in the forums but there was not a resolution  
>>>>>>> posted.  I have gone through the trouble shooting in the  
>>>>>>> install guide and i can complete until step 7 where mpdboot is  
>>>>>>> used..  I can start mpd on the master, get the port, then  
>>>>>>> connect the slave computers by specifying the master name and  
>>>>>>> port number.  Any ideas why pc12 is reporting no port?
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Brian
>>>>>>>
>>>
>>
>>
>



More information about the mpich-discuss mailing list