[mpich-discuss] MPI_File_get_size error message

Rob Latham robl at mcs.anl.gov
Tue Nov 1 11:21:43 CDT 2011


On Thu, Oct 27, 2011 at 11:12:01PM -0700, Eugene Loh wrote:

> File locking failed in ADIOI_Set_lock(fd A,cmd F_SETLKW/7,type
> F_RDLCK/0,whence 0) with return value
> FFFFFFFF and errno 5.
> - If the file system is NFS, you need to use NFS version 3, ensure
> that the lockd daemon is running
> on all the machines, and mount the directory with the 'noac' option
> (no attribute caching).
> - If the file system is LUSTRE, ensure that the directory is mounted
> with the 'flock' option.
> ADIOI_Set_lock:: Input/output error
> ADIOI_Set_lock:offset 0, length 1
> 
> If I take the error message at face value, I should check (in my
> case) NFS.  It's NFSv3 and it appears lockd is running.  I'm not
> real sure if noac is set, but I suspect it is not.  But is that
> really the problem here?  If I look at ADIOI_Set_lock, a fcntl()
> failed.  Is that necessarily an indication of the NFS/Lustre
> conditions discussed in the error message?  Incidentally, errno 5
> appears to be EIO, though I don't know if that's any help.

NFS is an awful file system for parallel access, primarily because
NFS clients are allowed to cache on the client side for arbitrarily
long amounts of
time.  Turning off attribute caching helps a bit.  The only thing
we've found that even remotely works is to fcntl() lock the file
before every I/O operation (both reads and writes).  

Since you get this error message only occasionally, I'm not sure what
advice to offer you.  None of us spend any time on NFS: it's offered
only as a convenience. Maybe in 2011 we should not even offered it.
PVFS and Lustre are both freely available parallel file systems.  

> Anyhow, regardless of whether noac is set or not, that setting is
> never changed and yet the test usually passes for us and only
> occasionally fails.
> 
> Could the real issue be some other NFS hiccup, with the
> NFSv3/lockd/noac verbiage being a red herring?  Any other
> help/suggestions?

Yes, this is probably some NFS hiccup. Maybe if you change your NFS
server to communicate over TCP instead of UDP it will be more reliable
in the face of simultaneous fcntl lock requests?  That's just a guess. 

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


More information about the mpich-discuss mailing list