[mpich-discuss] mpich2-1.2.1, only starts 5 mpd's and cpi won't run, compiler flags issue?

Tue Feb 9 15:34:10 CST 2010

> Please try hydra instead (mpiexec.hydra).  Hydra will be the default  
> process manager in the upcoming release and is generally faster and  
> more reliable than mpd:
> 
> http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager

Still some issues, but I found a problem in rsh on one node only that
fixed almost everything.  It started with this observation;

/usr/common/bin/rsh -f /usr/common/etc/machines.LINUX_INTEL \
  'rsh $HOSTNAME hostname'
monkey01.cluster
monkey02.cluster
...
monkey14.cluster
poll: protocol failure in circuit setup
monkey16.cluster
...
monkey20.cluster

Then:

monkey15> rsh monkey15 hostname
poll: protocol failure in circuit setup
monkey15> rsh monkey01 'rsh monkey15 hostname'
monkey15.cluster

Very strange since the compute nodes are images.  Firewall is off,
looking... Found it, /etc/hosts was corrupted, looks like from one of
the Mandriva autoconfiguration tools, which I probably turned on by
accident on that one node, such that 

127.0.0.1               localhost
192.168.1.15            monkey15.cluster monkey15 

became

127.0.0.1               monkey15.cluster monkey15 localhost

Resolved that.  Now all compute nodes can successfully run this:

mpiexec.hydra -f /usr/common/etc/machines.LINUX_INTEL  -bootstrap rsh 
-n 20 /bin/hostname

Fixing that also resolved the mpdboot and mpiexec (no .hydra) issues, so
from the head node:

  mpdboot -f /usr/common/etc/mpich2.machines.LINUX_INTEL_Safserver \
      -n 21 -r rsh --ifhn=192.168.1.220 -v
  mpiexec -n 20 /opt/mpich2_121/examples/cpi

works.  Yeah!

Still problems on the headnode though with hydra.  Recall that it has
both eth0 and eth1, and the cluster is on eth1 where the host is
safserver.cluster.  Trying to run mpiexec.hydra this way:

mpiexec.hydra -f /usr/common/etc/machines.LINUX_INTEL  -bootstrap rsh 
-n 1 /bin/hostname
[proxy at monkey01.cluster] HYDU_sock_connect (./utils/sock/sock.c:141):
connect error (Network is unreachable)
[proxy at monkey01.cluster] main (./pm/pmiserv/pmi_proxy.c:108): unable to
connect to the main server
[proxy at monkey02.cluster] HYDU_sock_connect (./utils/sock/sock.c:141):
connect error (Network is unreachable)
[proxy at monkey02.cluster] main (./pm/pmiserv/pmi_proxy.c:108): unable to
connect to the main server
Stack fault (signal 16)

I think this is because it is trying to use eth0 where it must use eth1.
 The log files on the two compute nodes do not show anything going in. 
How does one tell hydra which interface to use???

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech