[MPICH] no communication between master rank and slave ranks after /var was filled up

Christian Zemlin zemlinc at upstate.edu
Wed Jun 13 10:13:38 CDT 2007

Dear MPI-experts,

I am working on a 16 dual core Beowulf cluster with MPICH1-2-6 and MPI2-1.0.5p4 and it worked fine for some time.
Recently, my /var partition got completely filled up because a program that I ran wrote a lot of messages in the /var/log directory.
After that, nfs did not work anymore, but after I made space on /var, it resumed working.

But in MPICH1.2.6 and MPICH2.1.0.5p4, the master rank cannot communicate with the slave ranks anymore.  Simple test scripts like cpi, hello++, and fpi run without errors, but they only print responses from rank 0.  Any programs I tried with Send/Recv stall.  

To see what happens if I run hello++ on a non-master node, I created a file 
"machines" containing only one line:


and started hello++ like this:

mpirun -np 3 -machinefile ./machines ./hello++

Still, I get the output:

Hello World! I am 0 of 1

It seems that the master-rank cannot communicate with the slave, 
independently which on which node master and slaves processes run.
Can anyone think of a possible reason or are there any further diagnostics?

Thank you,


