[MPICH] No route to host problem

Patrick E. Kane pekane at uiuc.edu
Wed Jun 29 05:27:51 CDT 2005


Hello,

I am experimenting with the lastest version of Mpich2 (mpich2-1.0.2)
on two of my test machines.  One is running Fedora Core 4 the other
is running Knoppix, uname -a gives:

  Linux Fc 2.6.11-1.1369_FC4 #1 Thu Jun 2 22:55:56 EDT 2005 i686 athlon i386
GNU/Linux
  Linux Gush 2.4.26 #1 SMP Sa Apr 17 19:33:42 CEST 2004 i686 GNU/Linux

Below are the results of two failed experiments.  Is there a work-around for
these problems?

First I started mpd on the Knoppix system:

  --------------------------------------------------------
  [Gush] $  mpdboot -n 2

  [Gush] $  mpdtrace
  Gush
  Fc

  [Gush] $  mpdtrace -l
  Gush_38615
  Fc_33077

  [Gush] $  mpdrun -l -n 2 hostname
  0: Gush
  1: Fc

  [Gush] $  mpdrun -l -n 2 /usr/local/examples/cpi
  0: Process 0 of 2 is on Gush
  1: Process 1 of 2 is on Fc
  0: aborting job:
  0: Fatal error in MPI_Bcast: Other MPI error, error stack:
  0: MPI_Bcast(827): MPI_Bcast(buf=0xbffff88c, count=1, MPI_INT, root=0, MPI_COMM_
  WORLD) failed
  0: MPIR_Bcast(229):
  0: MPIC_Send(48):
  0: MPIC_Wait(321):
  0: MPIDI_CH3_Progress_wait(209): an error occurred while handling an event retur
  ned by MPIDU_Sock_Wait()
  0: MPIDI_CH3I_Progress_handle_sock_event(1120): [ch3:sock] failed to connnect to
   remote process kvs_Gush_38631_0:1
  0: MPIDU_Socki_handle_connect(815): connection failure (set=0,sock=1,errno=113:N
  o route to host)
  rank 0 in job 2  Gush_38615   caused collective abort of all ranks
    exit status of rank 0: return code 13

  -------------------------------------------------------

After running mpdallexit I then tried to start mpd on the FC4 box:

  -------------------------------------------------------

  [Fc] $ mpdboot -n 2
  mpdboot_Fc_0 (mpdboot 393): error trying to start mpd(boot) at 1 {'host':
'Gush', 'ncpus': 1, 'ifhn': ''}; output:
     mpdboot_Gush_1 (err_exit 415): mpd failed to start correctly on Gush
       reason: 1: unable to ping local mpd;
     invalid msg from mpd :{}:
     ** mpd may have disappeared, perhaps due to mismatched secretwords
     ** see msgs logged in syslog and /tmp/mpd2.logfile* on Gush
     last printed output from mpd before becoming a daemon:
     38643

     mpdboot_Gush_1 (err_exit 421):   contents of mpd logfile in /tmp:
          logfile for mpd with pid 1438
          Gush_38643: conn error in connect_lhs: No route to host
          Gush_38643 (connect_lhs 542): failed to connect to lhs at Fc 33089
          Gush_38643 (enter_ring 500): lhs connect failed
          Gush_38643 (run 215): failed to enter ring
  mpdboot_Fc_0 (err_exit 415): mpd failed to start correctly on Fc

 The logfile from the other systems contains:

  [Gush] $  cat /tmp/mpd2.logfile*
  logfile for mpd with pid 1438
  Gush_38643: conn error in connect_lhs: No route to host
  Gush_38643 (connect_lhs 542): failed to connect to lhs at Fc 33089
  Gush_38643 (enter_ring 500): lhs connect failed
  Gush_38643 (run 215): failed to enter ring
  -------------------------------------------------------

Thanks,

Pat Kane
--------




More information about the mpich-discuss mailing list