[mpich-discuss] mpdboot problem:unable to ping local mpd
Albert
zhuliting1986 at gmail.com
Wed Sep 29 06:47:40 CDT 2010
I have a problem with MPICH2 on lenovo cluster when I start more than three
nodes.
The error info is as follows.
Could anyone give me some advice?Thanks
Albert
[root at c0107 ~]# mpdboot -n 2 -f mpd.hosts
mpdboot_c0107_0 (mpdboot 393): error trying to start mpd(boot) at 1 {'host':
'c0104', 'ncpus': 1, 'ifhn': ''}; output:
mpdboot_c0104_1 (err_exit 415): mpd failed to start correctly on c0104
reason: 1: unable to ping local mpd;
invalid msg from mpd :{}:
** mpd may have disappeared, perhaps due to mismatched secretwords
** see msgs logged in syslog and /tmp/mpd2.logfile* on c0104
last printed output from mpd before becoming a daemon:
37857
mpdboot_c0104_1 (err_exit 421): contents of mpd logfile in /tmp:
logfile for mpd with pid 32501
c0104_37857: conn error in connect_lhs: No route to host
c0104_37857 (connect_lhs 542): failed to connect to lhs at c0107
46288
c0104_37857 (enter_ring 500): lhs connect failed
c0104_37857 (run 215): failed to enter ring
mpdboot_c0107_0 (err_exit 415): mpd failed to start correctly on c0107
[root at c0107 ~]# ssh c0104
Last login: Wed Sep 29 19:29:06 2010 from console
[root at c0104 ~]# mpdboot -n 2 -f mpd.hosts
[root at c0104 ~]# mpdtrace
c0104
c0107
[root at c0104 ~]# mpdboot -n 3 -f mpd.hosts
mpdboot_c0104_0 (mpdboot 406): error trying to start mpd(boot) at 2 {'host':
'c0108', 'ncpus': 1, 'ifhn': ''}; output:
mpdboot_c0108_2 (err_exit 415): mpd failed to start correctly on c0108
reason: 2: unable to ping local mpd;
invalid msg from mpd :{}:
** mpd may have disappeared, perhaps due to mismatched secretwords
** see msgs logged in syslog and /tmp/mpd2.logfile* on c0108
last printed output from mpd before becoming a daemon:
41819
mpdboot_c0108_2 (err_exit 421): contents of mpd logfile in /tmp:
logfile for mpd with pid 4894
c0108_41819: conn error in connect_rhs: No route to host
c0108_41819 (connect_rhs 602): failed to connect to rhs at
192.168.1.7 49518
c0108_41819 (enter_ring 513): rhs connect failed
c0108_41819 (run 215): failed to enter ring
mpdboot_c0104_0 (err_exit 415): mpd failed to start correctly on c0104
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100929/2e31ec97/attachment.htm>
More information about the mpich-discuss
mailing list