[mpich-discuss] MPI_File_open failed on NFS
Rajeev Thakur
thakur at mcs.anl.gov
Thu Mar 19 09:22:43 CDT 2009
The problem could be that one of the cluster nodes is having trouble
contacting the NFS server and perhaps needs a remount/reboot. Can you run
one job on all nodes to see if it fails on some node?
Rajeev
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Yusong Wang
> Sent: Wednesday, March 18, 2009 9:51 PM
> To: mpich-discuss at mcs.anl.gov
> Cc: ywang25 at aps.anl.gov
> Subject: [mpich-discuss] MPI_File_open failed on NFS
>
> Hi,
>
> I got following errors on an NFS file system sometime when
> running several MPI programs simultaneously. Each program
> opens several shared
> files with parallel I/O (from MPICH2 1.0.5) .
>
> MPI_File_open failed: Other I/O error , error stack:
> ADIO_OPEN(273): open failed on a remote node
> rank 19 in job 2 weed3_47969 caused collective abort of all ranks
> exit status of rank 19: return code 1
> rank 13 in job 2 weed3_47969 caused collective abort of all ranks
> exit status of rank 13: return code 1
> rank 11 in job 2 weed3_47969 caused collective abort of all ranks
> exit status of rank 11: killed by signal 9
> rank 7 in job 2 weed3_47969 caused collective abort of all ranks
> exit status of rank 7: return code 1
> rank 6 in job 2 weed3_47969 caused collective abort of all ranks
> exit status of rank 6: killed by signal 9
> rank 3 in job 2 weed3_47969 caused collective abort of all ranks
> exit status of rank 3: killed by signal 9
> rank 16 in job 2 weed3_47969 caused collective abort of all ranks
> exit status of rank 16: killed by signal 9
>
>
> If I run the program individually with smaller number of
> nodes, there is no problem. It looks like some limits of NFS
> server were reached. I monitored the file descriptors through
> /proc/sys/fs/file-nr and found the limit was not reached.
> What could cause this problem?
>
> Thanks,
>
>
>
>
>
More information about the mpich-discuss
mailing list