[mpich-discuss] FW: problems with mpdboot

bjday bjday at cse.usf.edu
Tue Apr 7 13:05:24 CDT 2009


Rajeev,

Yes you are correct I can build a ring by hand but not by using 
mpdboot.  Once I build a ring by hand i can run mpiexec hostname and it 
works, see below.  I installed using the latest download that is on the 
website.  In my research before contacting the forums i found this 
website.  I dont know if this helps.  
http://ubuntuforums.org/showthread.php?t=1016984   it has to do with 
setting LD_LIBRARY_PATH and python, but I used CenoOS's add remove 
programs so I never touched the package.  I will try to reinstall MPICH2 
on both computer just in case some how different versions were 
installed.  Any other suggestions or help would be great.

Thank you,
Brian
% mpiexec -l -n 30 /bin/hostname
2: c4labpc19.csee.usf.edu
3: c4labpc12.csee.usf.edu
1: c4labpc12.csee.usf.edu
4: c4labpc19.csee.usf.edu
5: c4labpc12.csee.usf.edu
6: c4labpc19.csee.usf.edu
7: c4labpc12.csee.usf.edu
9: c4labpc12.csee.usf.edu
8: c4labpc19.csee.usf.edu
11: c4labpc12.csee.usf.edu
12: c4labpc19.csee.usf.edu
13: c4labpc12.csee.usf.edu
15: c4labpc12.csee.usf.edu
14: c4labpc19.csee.usf.edu
10: c4labpc19.csee.usf.edu
17: c4labpc12.csee.usf.edu
16: c4labpc19.csee.usf.edu
19: c4labpc12.csee.usf.edu
21: c4labpc12.csee.usf.edu
22: c4labpc19.csee.usf.edu
23: c4labpc12.csee.usf.edu
20: c4labpc19.csee.usf.edu
18: c4labpc19.csee.usf.edu
24: c4labpc19.csee.usf.edu
25: c4labpc12.csee.usf.edu
27: c4labpc12.csee.usf.edu
29: c4labpc12.csee.usf.edu
0: c4labpc19.csee.usf.edu
28: c4labpc19.csee.usf.edu
26: c4labpc19.csee.usf.edu
%


Rajeev Thakur wrote:
>  
>
> -----Original Message-----
> From: Ralph Butler [mailto:rbutler at mtsu.edu] 
> Sent: Tuesday, April 07, 2009 12:06 PM
> To: Rajeev Thakur
> Subject: Re: [mpich-discuss] problems with mpdboot
>
> I can not reproduce it of course.  He seems to indicate that he can build a
> ring by hand, but does not say that it is usable with mpiexec to run
> something like hostname.  If he can do that and it still fails, I am at a
> loss as to what the problem can be.  I ran into this one time when the
> mpd.py and mpdboot.py happened to be from different releases of mpich2, but
> seriously doubt that is his problem.
>
> On TueApr 7, at Tue Apr 7 11:28AM, Rajeev Thakur wrote:
>
>   
>> Ralph, any comments?
>>
>> Rajeev
>>
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov 
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of bjday
>> Sent: Tuesday, April 07, 2009 10:25 AM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] problems with mpdboot
>>
>> I tried mpdcheck as instructed in the Troubleshooting in the 
>> instillation guide again and the client (pc12) successfully recved ack 
>> form server.  The server (pc19) has conn from the client and 
>> successfully recvd msg from client.
>>
>> I also tryed the ssh command and received "c4labpc19.csee.usf.edu"  
>> as a
>> response.
>>
>> Once mpd is started on the master i can connect the slaves once i get 
>> the port number from the server. I can also run $mpdboot -n 1 and the 
>> master node will be the only output when mpdtrace is ran.  The error 
>> is when n>1, when trying to remotely start eh slave nodes.
>>
>> thank you,
>> Brian
>>
>> Pavan Balaji wrote:
>>     
>>> Can you try mpdcheck to make sure there are no network infrastructure 
>>> issues (e.g., firewalls or errors in /etc/hosts)?
>>>
>>> Another quick check is to make sure each host can ssh to another host 
>>> with the name given in the host file. For example, try:
>>>
>>> $ ssh c4labpc12.csee.usf.edu -t "ssh c4labpc19.csee.usf.edu hostname"
>>>
>>> -- Pavan
>>>
>>> bjday wrote:
>>>       
>>>> Pavan,
>>>>
>>>> Yes the names returned by "hostname" and the names in mpd.hosts are 
>>>> the fully qualified names.
>>>>
>>>> Thank you,
>>>> Brian
>>>>
>>>>
>>>> Pavan Balaji wrote:
>>>>         
>>>>> Check if your host file contains the same name as what is returned 
>>>>> by the "hostname" command (e.g., "foo" is different from 
>>>>> "foo.domain.edu"). Otherwise, mpd can't find the local hostname in 
>>>>> your host file.
>>>>>
>>>>> -- Pavan
>>>>>
>>>>> bjday wrote:
>>>>>           
>>>>>> Hello MPICH2 Gurus
>>>>>>
>>>>>> I am installing MPICH2 on some lab computers at the request of a 
>>>>>> professor.  I have ran into a during testing.  When i run mpdboot 
>>>>>> I receive this error
>>>>>>
>>>>>> mpdboot -n 2 -f mpd.hosts -v -d
>>>>>> debug: starting
>>>>>> running mpdallexit on c4labpc19.csee.usf.edu LAUNCHED mpd on 
>>>>>> c4labpc19.csee.usf.edu  via
>>>>>> debug: launch cmd= /usr/local/mpich2/bin/mpd.py   --ncpus=1 -e -d
>>>>>> debug: mpd on c4labpc19.csee.usf.edu  on port 37116
>>>>>> RUNNING: mpd on c4labpc19.csee.usf.edu
>>>>>> debug: info for running mpd: {'ncpus': 1, 'list_port': 37116,
>>>>>> 'entry_port': '', 'host': 'c4labpc19.csee.usf.edu', 'entry_host':
>>>>>> '', 'ifhn': ''}
>>>>>> LAUNCHED mpd on c4labpc12.csee.usf.edu  via   
>>>>>> c4labpc19.csee.usf.edu
>>>>>> debug: launch cmd= ssh -x -n -q c4labpc12.csee.usf.edu 
>>>>>> '/usr/local/mpich2/bin/mpd.py  -h c4labpc19.csee.usf.edu -p 37116
>>>>>> --ncpus=1 -e -d'
>>>>>> debug: mpd on c4labpc12.csee.usf.edu  on port no_port 
>>>>>> mpdboot_c4labpc19.csee.usf.edu (handle_mpd_output 406): from mpd 
>>>>>> on c4labpc12.csee.usf.edu, invalid port info:
>>>>>> no_port
>>>>>>
>>>>>> I have seen this in the forums but there was not a resolution 
>>>>>> posted.  I have gone through the trouble shooting in the install 
>>>>>> guide and i can complete until step 7 where mpdboot is used..  I 
>>>>>> can start mpd on the master, get the port, then connect the slave 
>>>>>> computers by specifying the master name and port number.  Any 
>>>>>> ideas why pc12 is reporting no port?
>>>>>>
>>>>>> Thank you,
>>>>>> Brian
>>>>>>             
>>     
>
>   



More information about the mpich-discuss mailing list