[mpich-discuss] MPI_File_get_size error message

Eugene Loh eugene.loh at oracle.com
Tue Nov 1 11:46:20 CDT 2011


I appreciate the reply.  And, I get the general message.  Thanks.

Before I retire this issue completely, however, I wanted to clarify.  
The test in question has no "contention."  Each MPI process has its own 
file.  Each process opens its respective file, checks size, reads it, 
closes it.  No two processes are trying to access the same file.  Your 
comments still apply?

On 11/1/2011 9:21 AM, Rob Latham wrote:
> On Thu, Oct 27, 2011 at 11:12:01PM -0700, Eugene Loh wrote
>> File locking failed in ADIOI_Set_lock(fd A,cmd F_SETLKW/7,type
>> F_RDLCK/0,whence 0) with return value
>> FFFFFFFF and errno 5.
>> - If the file system is NFS, you need to use NFS version 3, ensure
>> that the lockd daemon is running
>> on all the machines, and mount the directory with the 'noac' option
>> (no attribute caching).
>> - If the file system is LUSTRE, ensure that the directory is mounted
>> with the 'flock' option.
>> ADIOI_Set_lock:: Input/output error
>> ADIOI_Set_lock:offset 0, length 1
>>
>> If I take the error message at face value, I should check (in my
>> case) NFS.  It's NFSv3 and it appears lockd is running.  I'm not
>> real sure if noac is set, but I suspect it is not.  But is that
>> really the problem here?  If I look at ADIOI_Set_lock, a fcntl()
>> failed.  Is that necessarily an indication of the NFS/Lustre
>> conditions discussed in the error message?  Incidentally, errno 5
>> appears to be EIO, though I don't know if that's any help.
> NFS is an awful file system for parallel access, primarily because
> NFS clients are allowed to cache on the client side for arbitrarily
> long amounts of
> time.  Turning off attribute caching helps a bit.  The only thing
> we've found that even remotely works is to fcntl() lock the file
> before every I/O operation (both reads and writes).
>
> Since you get this error message only occasionally, I'm not sure what
> advice to offer you.  None of us spend any time on NFS: it's offered
> only as a convenience. Maybe in 2011 we should not even offered it.
> PVFS and Lustre are both freely available parallel file systems.
>> Anyhow, regardless of whether noac is set or not, that setting is
>> never changed and yet the test usually passes for us and only
>> occasionally fails.
>>
>> Could the real issue be some other NFS hiccup, with the
>> NFSv3/lockd/noac verbiage being a red herring?  Any other
>> help/suggestions?
> Yes, this is probably some NFS hiccup. Maybe if you change your NFS
> server to communicate over TCP instead of UDP it will be more reliable
> in the face of simultaneous fcntl lock requests?  That's just a guess.


More information about the mpich-discuss mailing list