[mpich-discuss] mpdboot problem:unable to ping local mpd
Dave Goodell
goodell at mcs.anl.gov
Wed Sep 29 08:05:51 CDT 2010
Running mpd as root is tricky. You shouldn't do it unless you really need to and really know what you are doing with it.
Better yet, just don't use mpd at all. Use hydra instead, it's much more robust: http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager
-Dave
On Sep 29, 2010, at 6:47 AM CDT, Albert wrote:
> I have a problem with MPICH2 on lenovo cluster when I start more than three nodes.
>
> The error info is as follows.
> Could anyone give me some advice?Thanks
>
> Albert
>
> [root at c0107 ~]# mpdboot -n 2 -f mpd.hosts
> mpdboot_c0107_0 (mpdboot 393): error trying to start mpd(boot) at 1 {'host': 'c0104', 'ncpus': 1, 'ifhn': ''}; output:
> mpdboot_c0104_1 (err_exit 415): mpd failed to start correctly on c0104
> reason: 1: unable to ping local mpd;
> invalid msg from mpd :{}:
> ** mpd may have disappeared, perhaps due to mismatched secretwords
> ** see msgs logged in syslog and /tmp/mpd2.logfile* on c0104
> last printed output from mpd before becoming a daemon:
> 37857
>
> mpdboot_c0104_1 (err_exit 421): contents of mpd logfile in /tmp:
> logfile for mpd with pid 32501
> c0104_37857: conn error in connect_lhs: No route to host
> c0104_37857 (connect_lhs 542): failed to connect to lhs at c0107 46288
> c0104_37857 (enter_ring 500): lhs connect failed
> c0104_37857 (run 215): failed to enter ring
> mpdboot_c0107_0 (err_exit 415): mpd failed to start correctly on c0107
> [root at c0107 ~]# ssh c0104
> Last login: Wed Sep 29 19:29:06 2010 from console
> [root at c0104 ~]# mpdboot -n 2 -f mpd.hosts
> [root at c0104 ~]# mpdtrace
> c0104
> c0107
> [root at c0104 ~]# mpdboot -n 3 -f mpd.hosts
> mpdboot_c0104_0 (mpdboot 406): error trying to start mpd(boot) at 2 {'host': 'c0108', 'ncpus': 1, 'ifhn': ''}; output:
> mpdboot_c0108_2 (err_exit 415): mpd failed to start correctly on c0108
> reason: 2: unable to ping local mpd;
> invalid msg from mpd :{}:
> ** mpd may have disappeared, perhaps due to mismatched secretwords
> ** see msgs logged in syslog and /tmp/mpd2.logfile* on c0108
> last printed output from mpd before becoming a daemon:
> 41819
>
> mpdboot_c0108_2 (err_exit 421): contents of mpd logfile in /tmp:
> logfile for mpd with pid 4894
> c0108_41819: conn error in connect_rhs: No route to host
> c0108_41819 (connect_rhs 602): failed to connect to rhs at 192.168.1.7 49518
> c0108_41819 (enter_ring 513): rhs connect failed
> c0108_41819 (run 215): failed to enter ring
> mpdboot_c0104_0 (err_exit 415): mpd failed to start correctly on c0104
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list