sync in ncmpi_end_indep_data()

Tue Oct 2 10:01:28 CDT 2007

Rob,

This discussion leads to the issue of what consistency semantics pnetcdf 
should support. Fsync makes strict consistency for switching modes between 
collective and independent. (This is the only consistency defined in 
pnetcdf that I can find.) However, is pnetcdf also enforcing data 
consistency within the collective or independent mode? For example, will a 
collective read following a collective write always see the write data? 
MPI-IO does not guarantee this consistency unless atomicity is enabled. 
Similarly, what consistency be enforced in independent mode? This makes me 
wonder what consistency semantics HDF5 supports and how they implement the 
semantics.

The purpose of calling fsync is to flush out the cache data and we have 
been talking about two levels of data caching. One is the underlying 
client-side file system caching and the other is caching at MPI-IO or 
pnetcdf library. For file system caching, since some file systems 
implement a coherent cache themselves, like Lustre and GPFS, and PVFS does 
not do caching at all, fsync is not necessary in switching modes in 
pnetcdf. Of course, this is in case when MPI-IO and pnetcdf do no do 
caching. As far as I understand ROMIO and pnetcdf libraries do not do 
caching yet and hence fsync is only needed on the file systems without 
coherent cache.

In case that MPI-IO and pnetcdf will incorporate caching in the future, 
things can be complicated. On file systems that keep coherent cache or do 
no caching, flushing library-level caching should only involve writing 
library cached data, not fsync. But, fsync still cannot be avoided on 
file systems with incoherent cache.

Placing a fsync in pnetcdf looks to me that pnetcdf always assumes a 
non-coherent caching file system. (that is why I said blindly ...) If 
pnetcdf consistency semantics completely follows MPI-IO's and provides 
user hints to enable different levels of consistency, fsync may not be 
necessary all the time. For performance concern, this would be a better 
choice, especially when a user knows the access patterns and underlying 
file systems. Otherwise, pnetcdf will always suffer the sync-to-disk 
overhead.

Wei-keng

On Mon, 1 Oct 2007, Rob Ross wrote:
> Hi Wei-keng,
> 
> Using the sync at the termination of indep_data mode has no relation to 
> enabling MPI atomic mode. Enabling MPI atomic mode says something about 
> the coherence of views after *each* operation and also something about 
> conflicting accesses. Syncing at the end of indep_data mode only ensures 
> that data has been pushed out of local caches and to the file system 
> (and unfortunately also to disk).
> 
> Enabling atomic mode could be *very* expensive, especially for anyone 
> performing multiple indep. I/O calls. Perhaps not more expensive than 
> the file sync, depending on the underlying FS.
> 
> I strongly disagree with your assertion that we've "enforced the file 
> sync blindly without considering the underlying file systems." We've 
> enforced the file sync on purpose. We have chosen not to consider the 
> underlying file system, also on purpose, because we're trying to be 
> portable and not to have a bunch of per-FS cases in our portable library 
> that try to guess at additional guarantees from the underlying file 
> system.
> 
> As RobL mentioned, the reason we put this in place was to guarantee the 
> end of an I/O sequence, so that subsequent accesses would see a 
> consistent view. We've had numerous discussions internally about the 
> challenges of maintaining consistent views with different possible 
> optimizations within the MPI-IO implementation (e.g. incoherent caches). 
> This was simply a way to provide a more consistent view of the variables 
> so that users would tend not to see valid but confusing results from the 
> MPI-IO implementation.
> 
> All this said, and disagreements aside about whether we were thinking 
> when we did this or not, I do agree that MPI_File_sync() is expensive. 
> The problem is that there are only two ways to tell the MPI-IO 
> implementation that you'd like a coherent view between processes: (1) 
> call MPI_File_sync() (2) re-open the file (MPI_File_close() + 
> MPI_File_open())
> 
> In (1), you get the "write to disk" as a side-effect, which as you say 
> is expensive. In (2), well, that can be extremely expensive as well 
> because for most systems it produces a lot of namespace traffic.
> 
> So, what do we do? Certainly we don't need this call in the read-only 
> case (as RobL mentioned), so we could drop it for that case without any 
> further discussion. If we still think that generating these consistent 
> views at that point is a good idea, we could try replacing with a 
> re-open; that would be faster some places and slower others. Or we could 
> get rid of it altogether; it will mean that users are more likely to 
> need to synchronize explicitly, but maybe that's just fine. I'm having a 
> hard time reconstructing the specifics of the argument for the call, so 
> I lean that way (we should have commented the call in the code with the 
> argument).
> 
> Regards,
> 
> Rob
> 
> Wei-keng Liao wrote:
> > Since pnetcdf is built on top of MPI-IO, it follows the MPI-IO consistency
> > semantics. So, to answer your question, the same thing will happen on pure
> > MPI-IO program when an independent write followed by a collective read.
> > 
> > MPI-IO defines the consistency semantics as "sequential consistency among
> > all accesses using file handles created from a single collective open with
> > atomic mode enabled". So, what happens when a collective write is followed
> > by a collective read? Will the collective read always reads the data written
> > by the collective write, if atomic mode is disabled? I don't think MPI-IO
> > will guarantee this.
> > 
> > Using sync in ncmpi_end_indep_data() is equivalent to enabling MPI atomic
> > mode only in ncmpi_end_indep_data(), but not anywhere else. This is strange.
> > 
> > I don't think it is wise to enforce the file sync blindly without
> > considering the underlying file systems. For example, on PVFS where no
> > client-side caching is done, file sync is completely not necessary. On
> > Lustre and GPFS where they has their own consistency control and are POSIX
> > compliant, file sync is also not necessary. On NFS where consistency is an
> > issue, ROMIO has already used byte-range locks to disable the caching. So, I
> > don't see why we still need sync in Pnetcdf. If we want pnetcdf to have a
> > stricter semantic, we can just enable MPI-IO atomic mode.
> > 
> > On some file systems, file sync will not only flush data to servers but also
> > to disks before the sync call returns. It is a very expensive operation. For
> > performance reason, I suggest we should leave stricter consistency as an
> > option to the users and keep relaxed semantics as the default.
> > 
> > Our earlier discussion was under the assumption that pnetcdf may have its
> > caching layer in the future and ncmpi_end_indep_data() is required for
> > the cache coherence in pnetcdf level, not file system's.
> > 
> > On Thu, 20 Sep 2007, Robert Latham wrote:
> > > On Thu, Sep 20, 2007 at 02:46:47PM -0500, Wei-keng Liao wrote:
> > > > In file mpinetcdf.c, function ncmpi_end_indep_data(), MPI_File_sync() is
> > > > called. I don't think this is necessary. Flushing dirty data may be
> > > > needed if pnetcdf implemented a caching layer internally. However, this
> > > > file sync should be flusing data from pnetcdf caching layer (if it is
> > > > implemented) to the file system, not application clients to the file
> > > > servers (or disks) as MPI_File_sync() will do.
> > > >
> > > > This file sync makes IOR performance bad for pnetcdf independent data
> > > > mode.
> > > What if an independent write was followed by a collective read?  I
> > > think we covered this earlier in the year.  The NetCDF semantic for
> > > this seems to be undefined.  If so, then I guess the MPI_File_sync is
> > > indeed unnecessary.  The sync in there might be to enforce conclusion
> > > of a "write sequence" and ensure changes made to the file by other
> > > processes are visible to that process.
> > >
> > > pnetcdf could be clever when a file is opened NC_NOWRITE and disable
> > > that sync.
> 
>