[mpich-discuss] mpdboot problem:unable to ping local mpd

Albert zhuliting1986 at gmail.com
Wed Sep 29 06:47:40 CDT 2010


I have a problem with MPICH2 on lenovo cluster when I start more than three
nodes.

The error info is as follows.
Could anyone give me some advice?Thanks

Albert

[root at c0107 ~]# mpdboot -n 2 -f mpd.hosts
mpdboot_c0107_0 (mpdboot 393): error trying to start mpd(boot) at 1 {'host':
'c0104', 'ncpus': 1, 'ifhn': ''}; output:
   mpdboot_c0104_1 (err_exit 415): mpd failed to start correctly on c0104
     reason: 1: unable to ping local mpd;
   invalid msg from mpd :{}:
   ** mpd may have disappeared, perhaps due to mismatched secretwords
   ** see msgs logged in syslog and /tmp/mpd2.logfile* on c0104
   last printed output from mpd before becoming a daemon:
   37857

   mpdboot_c0104_1 (err_exit 421):   contents of mpd logfile in /tmp:
        logfile for mpd with pid 32501
        c0104_37857: conn error in connect_lhs: No route to host
        c0104_37857 (connect_lhs 542): failed to connect to lhs at c0107
46288
        c0104_37857 (enter_ring 500): lhs connect failed
        c0104_37857 (run 215): failed to enter ring
mpdboot_c0107_0 (err_exit 415): mpd failed to start correctly on c0107
[root at c0107 ~]# ssh c0104
Last login: Wed Sep 29 19:29:06 2010 from console
[root at c0104 ~]# mpdboot -n 2 -f mpd.hosts
[root at c0104 ~]# mpdtrace
c0104
c0107
[root at c0104 ~]# mpdboot -n 3 -f mpd.hosts
mpdboot_c0104_0 (mpdboot 406): error trying to start mpd(boot) at 2 {'host':
'c0108', 'ncpus': 1, 'ifhn': ''}; output:
   mpdboot_c0108_2 (err_exit 415): mpd failed to start correctly on c0108
     reason: 2: unable to ping local mpd;
   invalid msg from mpd :{}:
   ** mpd may have disappeared, perhaps due to mismatched secretwords
   ** see msgs logged in syslog and /tmp/mpd2.logfile* on c0108
   last printed output from mpd before becoming a daemon:
   41819

   mpdboot_c0108_2 (err_exit 421):   contents of mpd logfile in /tmp:
        logfile for mpd with pid 4894
        c0108_41819: conn error in connect_rhs: No route to host
        c0108_41819 (connect_rhs 602): failed to connect to rhs at
192.168.1.7 49518
        c0108_41819 (enter_ring 513): rhs connect failed
        c0108_41819 (run 215): failed to enter ring
mpdboot_c0104_0 (err_exit 415): mpd failed to start correctly on c0104
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100929/2e31ec97/attachment.htm>


More information about the mpich-discuss mailing list