[mpich-discuss] Trouble with new installation --failed to connect to mpd
Benjamin Svetitsky
bqs at julian.tau.ac.il
Tue Dec 2 07:15:23 CST 2008
Actually, I find that mpdcheck -f is convenient for a quick check around
the cluster. I found the reason why it gave results that were
inconsistent with the direct -s/-c checks on our old cluster.
mpdcheck -f runs a script that finds out its own hostname and then sets
up -s/-c checks on other hosts. In my case, the hostname is nodeA.
mpdcheck somehow grabbed the fully qualified hostname, nodeA.tau.ac.il.
Then it ran mpdcheck -s, and asked nodeB to do mpdcheck -c with the
fully qualified name. But we use fixed host tables, and nodeB knows
only about the short name nodeA. So it couldn't find nodeA.tau.ac.il,
and the test failed. When I did the -s/-c checks I used the short name
only. This is also apparently good enough for mpd, which works fine on
our cluster.
Our new cluster works now, after fixing some hostname inconsistencies.
Thanks everybody for the help. -Ben
Ralph Butler wrote:
> I have been thinking I should remove the use of the -f option from the
> install manual. While that option
> is somewhat useful in other contacts, it often just leads to confusion
> when debugging. It is much better
> to stick with pairs of -s and -c.
>
> Rajeev: I may remove mention of it from the install guide and alter the
> comments on -s and -c slightly.
> Also, I may alter the usage message from mpdcheck slightly.
>
> --ralph
>
> On TueDec 2, at Tue Dec 2 2:12AM, Benjamin Svetitsky wrote:
>
>> Thanks Rajeev, we are rechecking network settings. But I don't trust
>> mpdcheck. On my old cluster (running MPICH for a year npw) when I run
>> mpicheck -f ~/hosts.mpd -ssh
>> it says -
>>
>> client on nodeA failed to access the server
>> here is the output:
>> Traceback (most recent call last):
>> File "/usr/local/bin/mpdcheck.py", line 103, in ?
>> sock.connect((argv[argidx+1],int(argv[argidx+2]))) # note double
>> parens
>> File "<string>", line 1, in connect
>> socket.gaierror: (-3, 'Temporary failure in name resolution')
>>
>> ** but when I run mpdcheck -s (and -c) between the two nodes there is
>> no problem in either direction. (Does it matter that mpd was running
>> jobs at the time of this test?)
>>
>> Ben
>>
>> Rajeev Thakur wrote:
>>> This means a simple client on one machine was not able to connect to a
>>> simple server on another machine in the cluster (independent of MPICH or
>>> MPD). Can you check the networking settings on the machines. Is there a
>>> firewall preventing access?
>>> Rajeev
>>>
>>>> -----Original Message-----
>>>> From: mpich-discuss-bounces at mcs.anl.gov
>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Benjamin
>>>> Svetitsky
>>>> Sent: Monday, December 01, 2008 9:42 AM
>>>> To: mpich-discuss at mcs.anl.gov
>>>> Subject: Re: [mpich-discuss] Trouble with new installation -- failed
>>>> to connect to mpd
>>>>
>>>> Thanks, Dave. mpdcheck indeed points to a problem. But the message
>>>> is not very illuminating, apart from pointing out which links are
>>>> giving trouble. What really has me worried is that mpdcheck gives
>>>> me the *same* error message on my old cluster -- where MPICH has
>>>> been working fine for a year! The message:
>>>>
>>>> [root at nodeF ~]# mpdcheck -f mpd.hosts -ssh
>>>> client on nodeE failed to access the server
>>>> here is the output:
>>>> Traceback (most recent call last):
>>>> File "/usr/local/bin/mpdcheck.py", line 103, in ?
>>>> sock.connect((argv[argidx+1],int(argv[argidx+2]))) # note
>>>> double parens
>>>> File "<string>", line 1, in connect
>>>> socket.error: (113, 'No route to host')
>>>>
>>>>
>>>>
>>>> Dave Goodell wrote:
>>>>> Hi Ben,
>>>>>
>>>>> Please try the MPD troubleshooting steps listed in appendix
>>>> A of the
>>>>> install guide:
>>>> http://www.mcs.anl.gov/research/projects/mpich2/documentation/
>>>> files/mpich2-1.0.8-installguide.pdf
>>>>>
>>>>> In particular, the mpdcheck utility should give you a
>>>> better clue about
>>>>> where the problem is.
>>>>>
>>>>> -Dave
>>>>>
>>>>> On Dec 1, 2008, at 4:11 AM, Benjamin Svetitsky wrote:
>>>>>
>>>>>> Dear MPI community,
>>>>>>
>>>>>> I already have MIPCH 1.0.8 running well on a cluster of four Linux
>>>>>> quad cores. But now I can't get it running on a new
>>>> cluster. I think
>>>>>> I installed everything exactly like the first system. But
>>>> when I try
>>>>>> to mpdboot as root I get a minimal error message:
>>>>>>
>>>>>> [root at nodeE ~]# mpdboot -n 4 -f /root/mpd.hosts
>>>>>> mpdboot_nodeE (handle_mpd_output 401): failed to connect
>>>> to mpd on nodeF
>>>>>> The /root/mpd.hosts contains:
>>>>>> nodeE
>>>>>> nodeF
>>>>>> nodeG
>>>>>> nodeH
>>>>>>
>>>>>> Oddly enough, after the failure of mpdboot as above I find:
>>>>>> [root at nodeE ~]# mpdtrace
>>>>>> nodeE
>>>>>> nodeF
>>>>>>
>>>>>> If I do mpdallexit and log into nodeF, the result is:
>>>>>> [root at nodeF ~]# mpdboot -n 4 -f /root/mpd.hosts
>>>>>> mpdboot_nodeF (handle_mpd_output 392): failed to handshake
>>>> with mpd on
>>>>>> nodeE; recvd output={}
>>>>>>
>>>>>> Do I have a network problem or is it an MPICH problem?
>>>>>>
>>>>>> Thanks,
>>>>>> Ben
>>>>>>
>>>>>> --
>>>>>> Prof. Benjamin Svetitsky Phone: +972-3-640 8870
>>>>>> School of Physics and Astronomy Fax: +972-3-640 7932
>>>>>> Tel Aviv University E-mail: bqs at julian.tau.ac.il
>>>>>> 69978 Tel Aviv, Israel WWW: http://julian.tau.ac.il/~bqs
>>>> --
>>>> Prof. Benjamin Svetitsky Phone: +972-3-640 8870
>>>> School of Physics and Astronomy Fax: +972-3-640 7932
>>>> Tel Aviv University E-mail: bqs at julian.tau.ac.il
>>>> 69978 Tel Aviv, Israel WWW: http://julian.tau.ac.il/~bqs
>>>>
>>
>> --
>> Prof. Benjamin Svetitsky Phone: +972-3-640 8870
>> School of Physics and Astronomy Fax: +972-3-640 7932
>> Tel Aviv University E-mail: bqs at julian.tau.ac.il
>> 69978 Tel Aviv, Israel WWW: http://julian.tau.ac.il/~bqs
--
Prof. Benjamin Svetitsky Phone: +972-3-640 8870
School of Physics and Astronomy Fax: +972-3-640 7932
Tel Aviv University E-mail: bqs at julian.tau.ac.il
69978 Tel Aviv, Israel WWW: http://julian.tau.ac.il/~bqs
More information about the mpich-discuss
mailing list