[mpich-discuss] MPI_File_get_size error message

Rob Latham robl at mcs.anl.gov
Wed Nov 2 09:48:59 CDT 2011


On Tue, Nov 01, 2011 at 09:46:20AM -0700, Eugene Loh wrote:
> I appreciate the reply.  And, I get the general message.  Thanks.
> 
> Before I retire this issue completely, however, I wanted to clarify.
> The test in question has no "contention."  Each MPI process has its
> own file.  Each process opens its respective file, checks size,
> reads it, closes it.  No two processes are trying to access the same
> file.  Your comments still apply?

Hm.. for nfs, i'm not sure but maybe?  There's only one server dealing
with fcntl operations. 

Since you are using one file per process, then you are in luck!

Prefix your file with 'ufs:' (perhaps we should do this if the
communicator is MPI_COMM_SELF ?) and you will avoid many of the
locking calls.    As long as you do not need shared files, we can skip
the extra effort ROMIO takes to make NFS work better.

==rob

> On 11/1/2011 9:21 AM, Rob Latham wrote:
> >On Thu, Oct 27, 2011 at 11:12:01PM -0700, Eugene Loh wrote
> >>File locking failed in ADIOI_Set_lock(fd A,cmd F_SETLKW/7,type
> >>F_RDLCK/0,whence 0) with return value
> >>FFFFFFFF and errno 5.
> >>- If the file system is NFS, you need to use NFS version 3, ensure
> >>that the lockd daemon is running
> >>on all the machines, and mount the directory with the 'noac' option
> >>(no attribute caching).
> >>- If the file system is LUSTRE, ensure that the directory is mounted
> >>with the 'flock' option.
> >>ADIOI_Set_lock:: Input/output error
> >>ADIOI_Set_lock:offset 0, length 1
> >>
> >>If I take the error message at face value, I should check (in my
> >>case) NFS.  It's NFSv3 and it appears lockd is running.  I'm not
> >>real sure if noac is set, but I suspect it is not.  But is that
> >>really the problem here?  If I look at ADIOI_Set_lock, a fcntl()
> >>failed.  Is that necessarily an indication of the NFS/Lustre
> >>conditions discussed in the error message?  Incidentally, errno 5
> >>appears to be EIO, though I don't know if that's any help.
> >NFS is an awful file system for parallel access, primarily because
> >NFS clients are allowed to cache on the client side for arbitrarily
> >long amounts of
> >time.  Turning off attribute caching helps a bit.  The only thing
> >we've found that even remotely works is to fcntl() lock the file
> >before every I/O operation (both reads and writes).
> >
> >Since you get this error message only occasionally, I'm not sure what
> >advice to offer you.  None of us spend any time on NFS: it's offered
> >only as a convenience. Maybe in 2011 we should not even offered it.
> >PVFS and Lustre are both freely available parallel file systems.
> >>Anyhow, regardless of whether noac is set or not, that setting is
> >>never changed and yet the test usually passes for us and only
> >>occasionally fails.
> >>
> >>Could the real issue be some other NFS hiccup, with the
> >>NFSv3/lockd/noac verbiage being a red herring?  Any other
> >>help/suggestions?
> >Yes, this is probably some NFS hiccup. Maybe if you change your NFS
> >server to communicate over TCP instead of UDP it will be more reliable
> >in the face of simultaneous fcntl lock requests?  That's just a guess.
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


More information about the mpich-discuss mailing list