[mpich-discuss] MPI_File_open failed on NFS
Yusong Wang
ywang25 at aps.anl.gov
Fri Mar 20 09:02:45 CDT 2009
I ran the program successfully on all the nodes where the program failed
when ran multiple programs concurrently.
Yusong
On Thu, 2009-03-19 at 09:22 -0500, Rajeev Thakur wrote:
> The problem could be that one of the cluster nodes is having trouble
> contacting the NFS server and perhaps needs a remount/reboot. Can you run
> one job on all nodes to see if it fails on some node?
>
> Rajeev
>
> > -----Original Message-----
> > From: mpich-discuss-bounces at mcs.anl.gov
> > [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Yusong Wang
> > Sent: Wednesday, March 18, 2009 9:51 PM
> > To: mpich-discuss at mcs.anl.gov
> > Cc: ywang25 at aps.anl.gov
> > Subject: [mpich-discuss] MPI_File_open failed on NFS
> >
> > Hi,
> >
> > I got following errors on an NFS file system sometime when
> > running several MPI programs simultaneously. Each program
> > opens several shared
> > files with parallel I/O (from MPICH2 1.0.5) .
> >
> > MPI_File_open failed: Other I/O error , error stack:
> > ADIO_OPEN(273): open failed on a remote node
> > rank 19 in job 2 weed3_47969 caused collective abort of all ranks
> > exit status of rank 19: return code 1
> > rank 13 in job 2 weed3_47969 caused collective abort of all ranks
> > exit status of rank 13: return code 1
> > rank 11 in job 2 weed3_47969 caused collective abort of all ranks
> > exit status of rank 11: killed by signal 9
> > rank 7 in job 2 weed3_47969 caused collective abort of all ranks
> > exit status of rank 7: return code 1
> > rank 6 in job 2 weed3_47969 caused collective abort of all ranks
> > exit status of rank 6: killed by signal 9
> > rank 3 in job 2 weed3_47969 caused collective abort of all ranks
> > exit status of rank 3: killed by signal 9
> > rank 16 in job 2 weed3_47969 caused collective abort of all ranks
> > exit status of rank 16: killed by signal 9
> >
> >
> > If I run the program individually with smaller number of
> > nodes, there is no problem. It looks like some limits of NFS
> > server were reached. I monitored the file descriptors through
> > /proc/sys/fs/file-nr and found the limit was not reached.
> > What could cause this problem?
> >
> > Thanks,
> >
> >
> >
> >
> >
>
More information about the mpich-discuss
mailing list