sync in ncmpi_end_indep_data()

Thu Sep 20 16:45:08 CDT 2007

Since pnetcdf is built on top of MPI-IO, it follows the MPI-IO consistency 
semantics. So, to answer your question, the same thing will happen on pure 
MPI-IO program when an independent write followed by a collective read.

MPI-IO defines the consistency semantics as "sequential consistency among 
all accesses using file handles created from a single collective open with 
atomic mode enabled". So, what happens when a collective write is followed 
by a collective read? Will the collective read always reads the data 
written by the collective write, if atomic mode is disabled? I don't think 
MPI-IO will guarantee this.

Using sync in ncmpi_end_indep_data() is equivalent to enabling MPI atomic 
mode only in ncmpi_end_indep_data(), but not anywhere else. This is 
strange.

I don't think it is wise to enforce the file sync blindly without 
considering the underlying file systems. For example, on PVFS where no 
client-side caching is done, file sync is completely not necessary. On 
Lustre and GPFS where they has their own consistency control and are POSIX 
compliant, file sync is also not necessary. On NFS where consistency is an 
issue, ROMIO has already used byte-range locks to disable the caching. So, 
I don't see why we still need sync in Pnetcdf. If we want pnetcdf to have 
a stricter semantic, we can just enable MPI-IO atomic mode.

On some file systems, file sync will not only flush data to servers but 
also to disks before the sync call returns. It is a very expensive 
operation. For performance reason, I suggest we should leave stricter 
consistency as an option to the users and keep relaxed semantics as the 
default.

Our earlier discussion was under the assumption that pnetcdf may have its 
caching layer in the future and ncmpi_end_indep_data() is required for
the cache coherence in pnetcdf level, not file system's.

Wei-keng

On Thu, 20 Sep 2007, Robert Latham wrote:
> On Thu, Sep 20, 2007 at 02:46:47PM -0500, Wei-keng Liao wrote:
> > In file mpinetcdf.c, function ncmpi_end_indep_data(), MPI_File_sync() is 
> > called. I don't think this is necessary. Flushing dirty data may be needed 
> > if pnetcdf implemented a caching layer internally. However, this file sync 
> > should be flusing data from pnetcdf caching layer (if it is implemented) 
> > to the file system, not application clients to the file servers (or disks) 
> > as MPI_File_sync() will do.
> > 
> > This file sync makes IOR performance bad for pnetcdf independent data 
> > mode.
> 
> What if an independent write was followed by a collective read?  I
> think we covered this earlier in the year.  The NetCDF semantic for
> this seems to be undefined.  If so, then I guess the MPI_File_sync is
> indeed unnecessary.  The sync in there might be to enforce conclusion
> of a "write sequence" and ensure changes made to the file by other
> processes are visible to that process.
> 
> pnetcdf could be clever when a file is opened NC_NOWRITE and disable
> that sync.
> 
> ==rob
> 
> -- 
> Rob Latham
> Mathematics and Computer Science Division    A215 0178 EA2D B059 8CDF
> Argonne National Lab, IL USA                 B29D F333 664A 4280 315B
>