[MPICH] No route to host problem
Patrick E. Kane
pekane at uiuc.edu
Wed Jun 29 05:27:51 CDT 2005
Hello,
I am experimenting with the lastest version of Mpich2 (mpich2-1.0.2)
on two of my test machines. One is running Fedora Core 4 the other
is running Knoppix, uname -a gives:
Linux Fc 2.6.11-1.1369_FC4 #1 Thu Jun 2 22:55:56 EDT 2005 i686 athlon i386
GNU/Linux
Linux Gush 2.4.26 #1 SMP Sa Apr 17 19:33:42 CEST 2004 i686 GNU/Linux
Below are the results of two failed experiments. Is there a work-around for
these problems?
First I started mpd on the Knoppix system:
--------------------------------------------------------
[Gush] $ mpdboot -n 2
[Gush] $ mpdtrace
Gush
Fc
[Gush] $ mpdtrace -l
Gush_38615
Fc_33077
[Gush] $ mpdrun -l -n 2 hostname
0: Gush
1: Fc
[Gush] $ mpdrun -l -n 2 /usr/local/examples/cpi
0: Process 0 of 2 is on Gush
1: Process 1 of 2 is on Fc
0: aborting job:
0: Fatal error in MPI_Bcast: Other MPI error, error stack:
0: MPI_Bcast(827): MPI_Bcast(buf=0xbffff88c, count=1, MPI_INT, root=0, MPI_COMM_
WORLD) failed
0: MPIR_Bcast(229):
0: MPIC_Send(48):
0: MPIC_Wait(321):
0: MPIDI_CH3_Progress_wait(209): an error occurred while handling an event retur
ned by MPIDU_Sock_Wait()
0: MPIDI_CH3I_Progress_handle_sock_event(1120): [ch3:sock] failed to connnect to
remote process kvs_Gush_38631_0:1
0: MPIDU_Socki_handle_connect(815): connection failure (set=0,sock=1,errno=113:N
o route to host)
rank 0 in job 2 Gush_38615 caused collective abort of all ranks
exit status of rank 0: return code 13
-------------------------------------------------------
After running mpdallexit I then tried to start mpd on the FC4 box:
-------------------------------------------------------
[Fc] $ mpdboot -n 2
mpdboot_Fc_0 (mpdboot 393): error trying to start mpd(boot) at 1 {'host':
'Gush', 'ncpus': 1, 'ifhn': ''}; output:
mpdboot_Gush_1 (err_exit 415): mpd failed to start correctly on Gush
reason: 1: unable to ping local mpd;
invalid msg from mpd :{}:
** mpd may have disappeared, perhaps due to mismatched secretwords
** see msgs logged in syslog and /tmp/mpd2.logfile* on Gush
last printed output from mpd before becoming a daemon:
38643
mpdboot_Gush_1 (err_exit 421): contents of mpd logfile in /tmp:
logfile for mpd with pid 1438
Gush_38643: conn error in connect_lhs: No route to host
Gush_38643 (connect_lhs 542): failed to connect to lhs at Fc 33089
Gush_38643 (enter_ring 500): lhs connect failed
Gush_38643 (run 215): failed to enter ring
mpdboot_Fc_0 (err_exit 415): mpd failed to start correctly on Fc
The logfile from the other systems contains:
[Gush] $ cat /tmp/mpd2.logfile*
logfile for mpd with pid 1438
Gush_38643: conn error in connect_lhs: No route to host
Gush_38643 (connect_lhs 542): failed to connect to lhs at Fc 33089
Gush_38643 (enter_ring 500): lhs connect failed
Gush_38643 (run 215): failed to enter ring
-------------------------------------------------------
Thanks,
Pat Kane
--------
More information about the mpich-discuss
mailing list