[MPICH] no communication between master rank and slave ranks after /var was filled up
Christian Zemlin
zemlinc at upstate.edu
Wed Jun 13 10:13:38 CDT 2007
Dear MPI-experts,
I am working on a 16 dual core Beowulf cluster with MPICH1-2-6 and MPI2-1.0.5p4 and it worked fine for some time.
Recently, my /var partition got completely filled up because a program that I ran wrote a lot of messages in the /var/log directory.
After that, nfs did not work anymore, but after I made space on /var, it resumed working.
But in MPICH1.2.6 and MPICH2.1.0.5p4, the master rank cannot communicate with the slave ranks anymore. Simple test scripts like cpi, hello++, and fpi run without errors, but they only print responses from rank 0. Any programs I tried with Send/Recv stall.
To see what happens if I run hello++ on a non-master node, I created a file
"machines" containing only one line:
node2
and started hello++ like this:
mpirun -np 3 -machinefile ./machines ./hello++
Still, I get the output:
Hello World! I am 0 of 1
It seems that the master-rank cannot communicate with the slave,
independently which on which node master and slaves processes run.
Can anyone think of a possible reason or are there any further diagnostics?
Thank you,
Christian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070613/ad0153d9/attachment.htm>
More information about the mpich-discuss
mailing list