[mpich-discuss] Trouble with new installation --failed to connect to mpd

Ralph Butler rbutler at mtsu.edu
Tue Dec 2 06:53:36 CST 2008


I have been thinking I should remove the use of the -f option from the  
install manual.  While that option
is somewhat useful in other contacts, it often just leads to confusion  
when debugging.  It is much better
to stick with pairs of -s and -c.

Rajeev:  I may remove mention of it from the install guide and alter  
the comments on -s and -c slightly.
Also, I may alter the usage message from mpdcheck slightly.

--ralph

On TueDec 2, at Tue Dec 2 2:12AM, Benjamin Svetitsky wrote:

> Thanks Rajeev, we are rechecking network settings.  But I don't  
> trust mpdcheck.  On my old cluster (running MPICH for a year npw)  
> when I run
> mpicheck -f ~/hosts.mpd -ssh
> it says -
>
> client on nodeA failed to access the server
> here is the output:
> Traceback (most recent call last):
>  File "/usr/local/bin/mpdcheck.py", line 103, in ?
>    sock.connect((argv[argidx+1],int(argv[argidx+2])))  # note double  
> parens
>  File "<string>", line 1, in connect
> socket.gaierror: (-3, 'Temporary failure in name resolution')
>
> ** but when I run mpdcheck -s (and -c) between the two nodes there  
> is no problem in either direction.  (Does it matter that mpd was  
> running jobs at the time of this test?)
>
> 		Ben
>
> Rajeev Thakur wrote:
>> This means a simple client on one machine was not able to connect  
>> to a
>> simple server on another machine in the cluster (independent of  
>> MPICH or
>> MPD). Can you check the networking settings on the machines. Is  
>> there a
>> firewall preventing access?
>> Rajeev
>>
>>> -----Original Message-----
>>> From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov 
>>> ] On Behalf Of Benjamin Svetitsky
>>> Sent: Monday, December 01, 2008 9:42 AM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: Re: [mpich-discuss] Trouble with new installation --  
>>> failed to connect to mpd
>>>
>>> Thanks, Dave.  mpdcheck indeed points to a problem.  But the  
>>> message is not very illuminating, apart from pointing out which  
>>> links are giving trouble.  What really has me worried is that  
>>> mpdcheck gives me the *same* error message on my old cluster --  
>>> where MPICH has been working fine for a year!  The message:
>>>
>>> [root at nodeF ~]# mpdcheck -f mpd.hosts -ssh
>>> client on nodeE failed to access the server
>>> here is the output:
>>> Traceback (most recent call last):
>>>   File "/usr/local/bin/mpdcheck.py", line 103, in ?
>>>     sock.connect((argv[argidx+1],int(argv[argidx+2])))  # note  
>>> double parens
>>>   File "<string>", line 1, in connect
>>> socket.error: (113, 'No route to host')
>>>
>>>
>>>
>>> Dave Goodell wrote:
>>>> Hi Ben,
>>>>
>>>> Please try the MPD troubleshooting steps listed in appendix
>>> A of the
>>>> install guide:
>>> http://www.mcs.anl.gov/research/projects/mpich2/documentation/
>>> files/mpich2-1.0.8-installguide.pdf
>>>>
>>>> In particular, the mpdcheck utility should give you a
>>> better clue about
>>>> where the problem is.
>>>>
>>>> -Dave
>>>>
>>>> On Dec 1, 2008, at 4:11 AM, Benjamin Svetitsky wrote:
>>>>
>>>>> Dear MPI community,
>>>>>
>>>>> I already have MIPCH 1.0.8 running well on a cluster of four  
>>>>> Linux quad cores.  But now I can't get it running on a new
>>> cluster.  I think
>>>>> I installed everything exactly like the first system.  But
>>> when I try
>>>>> to mpdboot as root I get a minimal error message:
>>>>>
>>>>> [root at nodeE ~]# mpdboot -n 4 -f /root/mpd.hosts
>>>>> mpdboot_nodeE (handle_mpd_output 401): failed to connect
>>> to mpd on nodeF
>>>>> The /root/mpd.hosts contains:
>>>>> nodeE
>>>>> nodeF
>>>>> nodeG
>>>>> nodeH
>>>>>
>>>>> Oddly enough, after the failure of mpdboot as above I find:
>>>>> [root at nodeE ~]# mpdtrace
>>>>> nodeE
>>>>> nodeF
>>>>>
>>>>> If I do mpdallexit and log into nodeF, the result is:
>>>>> [root at nodeF ~]# mpdboot -n 4 -f /root/mpd.hosts
>>>>> mpdboot_nodeF (handle_mpd_output 392): failed to handshake
>>> with mpd on
>>>>> nodeE; recvd output={}
>>>>>
>>>>> Do I have a network problem or is it an MPICH problem?
>>>>>
>>>>> Thanks,
>>>>>    Ben
>>>>>
>>>>> -- 
>>>>> Prof. Benjamin Svetitsky         Phone:            +972-3-640 8870
>>>>> School of Physics and Astronomy  Fax:              +972-3-640 7932
>>>>> Tel Aviv University              E-mail:      bqs at julian.tau.ac.il
>>>>> 69978 Tel Aviv, Israel           WWW: http://julian.tau.ac.il/~bqs
>>> -- 
>>> Prof. Benjamin Svetitsky         Phone:            +972-3-640 8870
>>> School of Physics and Astronomy  Fax:              +972-3-640 7932
>>> Tel Aviv University              E-mail:      bqs at julian.tau.ac.il
>>> 69978 Tel Aviv, Israel           WWW: http://julian.tau.ac.il/~bqs
>>>
>
> -- 
> Prof. Benjamin Svetitsky         Phone:            +972-3-640 8870
> School of Physics and Astronomy  Fax:              +972-3-640 7932
> Tel Aviv University              E-mail:      bqs at julian.tau.ac.il
> 69978 Tel Aviv, Israel           WWW: http://julian.tau.ac.il/~bqs




More information about the mpich-discuss mailing list