PnetCDF, ROMIO & "noac" setting

Tue Apr 17 09:46:38 CDT 2018

On Tue, 2018-04-17 at 09:33 -0500, Carl Ponder wrote:
> On 04/17/2018 08:04 AM, Carl Ponder wrote:
> > > my sysadmin's tell me that NFS is all that we're ever going to
> > > have on our cluster.
> > > I can live with substandard I/O performance, I just want the
> > > various I/O libraries to give the right data.
> > > We can try following these ROMIO instructions, but can you answer
> > > some questions about it?
> > > 1) It says that ROMIO is used in MPICH. Just to confirm, do
> > > MVAPICH2 & OpenMPI & Intel/MPI all use ROMIO?
> > >     And even if not, would the instructions still be relevant to
> > > these other MPI's?
> > > 2) It talks about NFS version 3. It looks like we have a
> > > combination of 3 & 4 on our system.
> > >      Are the instructions relevant to NFS version 4?
> >  
> > > 3) Given the reservations about performance, I suppose I could
> > > ask for a special directory to be created & mounted for
> > > "coherent" operations, and leave the other mounts as-is.
> > >     Do you see any problems with doing this?
>  On 04/17/2018 08:15 AM, Latham, Robert J. wrote: 
> > > I would do the following:
> > > 
> > > NFS is not a parallel file system.  Read-only from NFS might be
> > > OK.  Writing is trickier because NFS implements bizzare caching
> > > rules.  Sometimes the clients hold onto data for a long time,
> > > even
> > > after you ask every way possible to plase write this data out to
> > > disk.
>  Robert -- You mean that setting "noac" doesn't help, or that it
> isn't a full fix? 

It's a start, but we've never been able to come up with a 100%
bulletproof way to make NFS work with MPI-IO consistency semantics.

'noac' and "close before reading" can help, but one of ROMIO's big
optimizations is to do "data sieving" -- similar to a RAID device's
read-modify-write, it converts non-contiguous accesses into a single
larger access.  But if two processes do a read-modify-write, and they
happen to overlap, even with fcntl locking we put around every
operation, we still get scrambled bits in some cases.

Writing a test case to demonstrate this problem was my first task at
Argonne, and I don't think anything's changed in 16 years.  It's a race
condition, and you will probably be ok 9 times out of 10.

> > > I would use collective I/O, but then tell the MPI-IO library to
> > > carry
> > > out all I/O from just one rank.   For ROMIO if you set the hint
> > > "cb_nodes" to "1", then all clients will use a single aggregator
> > > to
> > > write, and you can hopefully sidestep any caching problems.
> > > 
> > > You should also set "romio_cb_write" to "enable" to force all
> > > writes,
> > > even ones ROMIO might think are nice friendly large contiguous
> > > requests, to take the two-phase collective buffering path.
>  I'll try to find a way to set these in the MPI environment modules:
> > export cb_nodes=1
> > export romio_cb_write=enable

Great.  You might want to create a "system hints" file:

http://www.mcs.anl.gov/projects/romio/2008/09/26/system-hints-hints-via
-config-file/

==rob

> This email message is for the sole use of the intended recipient(s)
> and may contain confidential information.  Any unauthorized review,
> use, disclosure or distribution is prohibited.  If you are not the
> intended recipient, please contact the sender by reply email and
> destroy all copies of the original message.