[mpich-discuss] MPI_File_open failed on NFS

Yusong Wang ywang25 at aps.anl.gov
Wed Mar 18 21:51:16 CDT 2009


Hi,

I got following errors on an NFS file system sometime when running
several MPI programs simultaneously. Each program opens several shared
files with parallel I/O (from MPICH2 1.0.5) .    

MPI_File_open failed: Other I/O error , error stack:
ADIO_OPEN(273): open failed on a remote node
rank 19 in job 2  weed3_47969   caused collective abort of all ranks
  exit status of rank 19: return code 1 
rank 13 in job 2  weed3_47969   caused collective abort of all ranks
  exit status of rank 13: return code 1 
rank 11 in job 2  weed3_47969   caused collective abort of all ranks
  exit status of rank 11: killed by signal 9 
rank 7 in job 2  weed3_47969   caused collective abort of all ranks
  exit status of rank 7: return code 1 
rank 6 in job 2  weed3_47969   caused collective abort of all ranks
  exit status of rank 6: killed by signal 9 
rank 3 in job 2  weed3_47969   caused collective abort of all ranks
  exit status of rank 3: killed by signal 9 
rank 16 in job 2  weed3_47969   caused collective abort of all ranks
  exit status of rank 16: killed by signal 9 


If I run the program individually with smaller number of nodes, there is
no problem. It looks like some limits of NFS server were reached. I
monitored the file descriptors through /proc/sys/fs/file-nr and found
the limit was not reached. What could cause this problem?

Thanks,






More information about the mpich-discuss mailing list