FW: Problem where the master node cannot sent tasks to slaves but NOT vice versa...

Wed Sep 7 15:57:30 CDT 2005

I tried to send this to the mpi-users group but nothing ever posted.  Ideas would be appreciated....

================================================================ 
Bill Capehart <William.Capehart at sdsmt.edu>   Associate Professor 
Institute of Atmospheric Sciences         Land Surface Processes 
213 Mineral Industries Building                 Hydrometeorology 
South Dakota School of Mines and Technology Ph:  +1-605-394-1994 
501 East Saint Joseph Street                Fax: +1-605-394-6061 
Rapid City, SD 57701-3995                Mobile: +1-605-484-5692 
=================== http://capehart.sdsmt.edu ================== 

-----Original Message-----
From: Capehart, William J 
Sent: Friday, 02 September, 2005 13:12
To: 'mpi-users at mcs.anl.gov'
Subject: Problem where the master node cannot sent tasks to slaves but NOT vice versa...

Friday, 02 September 2005

MPI-USERS group:

We have a strange problem that has recently emerged on our cluster (a Microway) where after a by-the-numbers restart (actually quite a few B-T-N restarts) and a reinstall of MPICH with PGI's compilers (both 1.2.6 and 1.2.7, both using RSH and SSH passing), we are unable send distributed MPI-compiled tasks across the cluster from the master node to the slaves.  However, the slave nodes seem to work just fine and can distribute tasks form one slave node to the other nodes - including the master node.  We are unable to ascertain what has changed to make this error emerge.  All was working well before this.

The offending error when we try to distribute tasks from the master to the slave nodes is:

p0_22406:  p4_error: Timeout in making connection to remote process on nox002: 0

... where nox002 is one of the slave nodes (the error emerges when it tries to connect to any slave node regardless of which node is the first - including tests where a single slave node and the master node are the only participating machines).  The error pops up after several minutes of non-activity after issuing the mpirun request for as few as two processors (and takes about as long using ssh or rsh to drive MPICH).

We even installed MPICH on my Linux office workstation (not part of the cluster) running a different LINUX kernel and distribution and the cluster's master node was able to sent tasks from the master node to my desktop as well as the other way around from the office desktop to the master node (the slave nodes are local to the master node and invisible to the rest of our unix/linux fleet so slave-to-noncluster-desktops tests could not be performed).

Meanwhile the all nodes on the cluster (slave and master alike) are quite happy to use ssh, rsh, rcp, and scp to login to and send instructions (e.g., "ssh {machine} date") data between each other with no warnings, hiccups or other signs of problems.  We also have the same version of MPICH and PGI compilers on all participating engines in this test.

Does anyone have any ideas?  We are fresh out of em...

Cheers and Thanks Much
================================================================ 
Bill Capehart <William.Capehart at sdsmt.edu>   Associate Professor 
Institute of Atmospheric Sciences         Land Surface Processes 
213 Mineral Industries Building                 Hydrometeorology 
South Dakota School of Mines and Technology Ph:  +1-605-394-1994 
501 East Saint Joseph Street                Fax: +1-605-394-6061 
Rapid City, SD 57701-3995                Mobile: +1-605-484-5692 
========== http://www.hpcnet.org/sdsmt/wcapehart/about =========