[mpich-discuss] MPI_File_open failed on NFS
Yusong Wang
ywang25 at aps.anl.gov
Wed Mar 18 21:51:16 CDT 2009
Hi,
I got following errors on an NFS file system sometime when running
several MPI programs simultaneously. Each program opens several shared
files with parallel I/O (from MPICH2 1.0.5) .
MPI_File_open failed: Other I/O error , error stack:
ADIO_OPEN(273): open failed on a remote node
rank 19 in job 2 weed3_47969 caused collective abort of all ranks
exit status of rank 19: return code 1
rank 13 in job 2 weed3_47969 caused collective abort of all ranks
exit status of rank 13: return code 1
rank 11 in job 2 weed3_47969 caused collective abort of all ranks
exit status of rank 11: killed by signal 9
rank 7 in job 2 weed3_47969 caused collective abort of all ranks
exit status of rank 7: return code 1
rank 6 in job 2 weed3_47969 caused collective abort of all ranks
exit status of rank 6: killed by signal 9
rank 3 in job 2 weed3_47969 caused collective abort of all ranks
exit status of rank 3: killed by signal 9
rank 16 in job 2 weed3_47969 caused collective abort of all ranks
exit status of rank 16: killed by signal 9
If I run the program individually with smaller number of nodes, there is
no problem. It looks like some limits of NFS server were reached. I
monitored the file descriptors through /proc/sys/fs/file-nr and found
the limit was not reached. What could cause this problem?
Thanks,
More information about the mpich-discuss
mailing list