[mpich-discuss] MPI_File_open failed on NFS

Fri Mar 20 09:02:45 CDT 2009

I ran the program successfully on all the nodes where the program failed
when ran multiple programs concurrently.

Yusong

On Thu, 2009-03-19 at 09:22 -0500, Rajeev Thakur wrote:
> The problem could be that one of the cluster nodes is having trouble
> contacting the NFS server and perhaps needs a remount/reboot. Can you run
> one job on all nodes to see if it fails on some node?
> 
> Rajeev
> 
> > -----Original Message-----
> > From: mpich-discuss-bounces at mcs.anl.gov 
> > [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Yusong Wang
> > Sent: Wednesday, March 18, 2009 9:51 PM
> > To: mpich-discuss at mcs.anl.gov
> > Cc: ywang25 at aps.anl.gov
> > Subject: [mpich-discuss] MPI_File_open failed on NFS
> > 
> > Hi,
> > 
> > I got following errors on an NFS file system sometime when 
> > running several MPI programs simultaneously. Each program 
> > opens several shared
> > files with parallel I/O (from MPICH2 1.0.5) .    
> > 
> > MPI_File_open failed: Other I/O error , error stack:
> > ADIO_OPEN(273): open failed on a remote node
> > rank 19 in job 2  weed3_47969   caused collective abort of all ranks
> >   exit status of rank 19: return code 1 
> > rank 13 in job 2  weed3_47969   caused collective abort of all ranks
> >   exit status of rank 13: return code 1 
> > rank 11 in job 2  weed3_47969   caused collective abort of all ranks
> >   exit status of rank 11: killed by signal 9 
> > rank 7 in job 2  weed3_47969   caused collective abort of all ranks
> >   exit status of rank 7: return code 1 
> > rank 6 in job 2  weed3_47969   caused collective abort of all ranks
> >   exit status of rank 6: killed by signal 9 
> > rank 3 in job 2  weed3_47969   caused collective abort of all ranks
> >   exit status of rank 3: killed by signal 9 
> > rank 16 in job 2  weed3_47969   caused collective abort of all ranks
> >   exit status of rank 16: killed by signal 9 
> > 
> > 
> > If I run the program individually with smaller number of 
> > nodes, there is no problem. It looks like some limits of NFS 
> > server were reached. I monitored the file descriptors through 
> > /proc/sys/fs/file-nr and found the limit was not reached. 
> > What could cause this problem?
> > 
> > Thanks,
> > 
> > 
> > 
> > 
> > 
>