[mpich-discuss] MPI_File_open failed on NFS

Rajeev Thakur thakur at mcs.anl.gov
Thu Mar 19 09:22:43 CDT 2009


The problem could be that one of the cluster nodes is having trouble
contacting the NFS server and perhaps needs a remount/reboot. Can you run
one job on all nodes to see if it fails on some node?

Rajeev

> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov 
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Yusong Wang
> Sent: Wednesday, March 18, 2009 9:51 PM
> To: mpich-discuss at mcs.anl.gov
> Cc: ywang25 at aps.anl.gov
> Subject: [mpich-discuss] MPI_File_open failed on NFS
> 
> Hi,
> 
> I got following errors on an NFS file system sometime when 
> running several MPI programs simultaneously. Each program 
> opens several shared
> files with parallel I/O (from MPICH2 1.0.5) .    
> 
> MPI_File_open failed: Other I/O error , error stack:
> ADIO_OPEN(273): open failed on a remote node
> rank 19 in job 2  weed3_47969   caused collective abort of all ranks
>   exit status of rank 19: return code 1 
> rank 13 in job 2  weed3_47969   caused collective abort of all ranks
>   exit status of rank 13: return code 1 
> rank 11 in job 2  weed3_47969   caused collective abort of all ranks
>   exit status of rank 11: killed by signal 9 
> rank 7 in job 2  weed3_47969   caused collective abort of all ranks
>   exit status of rank 7: return code 1 
> rank 6 in job 2  weed3_47969   caused collective abort of all ranks
>   exit status of rank 6: killed by signal 9 
> rank 3 in job 2  weed3_47969   caused collective abort of all ranks
>   exit status of rank 3: killed by signal 9 
> rank 16 in job 2  weed3_47969   caused collective abort of all ranks
>   exit status of rank 16: killed by signal 9 
> 
> 
> If I run the program individually with smaller number of 
> nodes, there is no problem. It looks like some limits of NFS 
> server were reached. I monitored the file descriptors through 
> /proc/sys/fs/file-nr and found the limit was not reached. 
> What could cause this problem?
> 
> Thanks,
> 
> 
> 
> 
> 



More information about the mpich-discuss mailing list