sync in ncmpi_end_indep_data()

Mon Oct 1 14:09:52 CDT 2007

Hi Wei-keng,

Using the sync at the termination of indep_data mode has no relation to 
enabling MPI atomic mode. Enabling MPI atomic mode says something about 
the coherence of views after *each* operation and also something about 
conflicting accesses. Syncing at the end of indep_data mode only ensures 
that data has been pushed out of local caches and to the file system 
(and unfortunately also to disk).

Enabling atomic mode could be *very* expensive, especially for anyone 
performing multiple indep. I/O calls. Perhaps not more expensive than 
the file sync, depending on the underlying FS.

I strongly disagree with your assertion that we've "enforced the file 
sync blindly without considering the underlying file systems." We've 
enforced the file sync on purpose. We have chosen not to consider the 
underlying file system, also on purpose, because we're trying to be 
portable and not to have a bunch of per-FS cases in our portable library 
that try to guess at additional guarantees from the underlying file system.

As RobL mentioned, the reason we put this in place was to guarantee the 
end of an I/O sequence, so that subsequent accesses would see a 
consistent view. We've had numerous discussions internally about the 
challenges of maintaining consistent views with different possible 
optimizations within the MPI-IO implementation (e.g. incoherent caches). 
This was simply a way to provide a more consistent view of the variables 
so that users would tend not to see valid but confusing results from the 
MPI-IO implementation.

All this said, and disagreements aside about whether we were thinking 
when we did this or not, I do agree that MPI_File_sync() is expensive. 
The problem is that there are only two ways to tell the MPI-IO 
implementation that you'd like a coherent view between processes:
(1) call MPI_File_sync()
(2) re-open the file (MPI_File_close() + MPI_File_open())

In (1), you get the "write to disk" as a side-effect, which as you say 
is expensive. In (2), well, that can be extremely expensive as well 
because for most systems it produces a lot of namespace traffic.

So, what do we do? Certainly we don't need this call in the read-only 
case (as RobL mentioned), so we could drop it for that case without any 
further discussion. If we still think that generating these consistent 
views at that point is a good idea, we could try replacing with a 
re-open; that would be faster some places and slower others. Or we could 
get rid of it altogether; it will mean that users are more likely to 
need to synchronize explicitly, but maybe that's just fine. I'm having a 
hard time reconstructing the specifics of the argument for the call, so 
I lean that way (we should have commented the call in the code with the 
argument).

Regards,

Rob

Wei-keng Liao wrote:
> Since pnetcdf is built on top of MPI-IO, it follows the MPI-IO consistency 
> semantics. So, to answer your question, the same thing will happen on pure 
> MPI-IO program when an independent write followed by a collective read.
>  
> MPI-IO defines the consistency semantics as "sequential consistency among 
> all accesses using file handles created from a single collective open with 
> atomic mode enabled". So, what happens when a collective write is followed 
> by a collective read? Will the collective read always reads the data 
> written by the collective write, if atomic mode is disabled? I don't think 
> MPI-IO will guarantee this.
> 
> Using sync in ncmpi_end_indep_data() is equivalent to enabling MPI atomic 
> mode only in ncmpi_end_indep_data(), but not anywhere else. This is 
> strange.
> 
> I don't think it is wise to enforce the file sync blindly without 
> considering the underlying file systems. For example, on PVFS where no 
> client-side caching is done, file sync is completely not necessary. On 
> Lustre and GPFS where they has their own consistency control and are POSIX 
> compliant, file sync is also not necessary. On NFS where consistency is an 
> issue, ROMIO has already used byte-range locks to disable the caching. So, 
> I don't see why we still need sync in Pnetcdf. If we want pnetcdf to have 
> a stricter semantic, we can just enable MPI-IO atomic mode.
> 
> On some file systems, file sync will not only flush data to servers but 
> also to disks before the sync call returns. It is a very expensive 
> operation. For performance reason, I suggest we should leave stricter 
> consistency as an option to the users and keep relaxed semantics as the 
> default.
> 
> Our earlier discussion was under the assumption that pnetcdf may have its 
> caching layer in the future and ncmpi_end_indep_data() is required for
> the cache coherence in pnetcdf level, not file system's.
> 
> On Thu, 20 Sep 2007, Robert Latham wrote:
>> On Thu, Sep 20, 2007 at 02:46:47PM -0500, Wei-keng Liao wrote:
>>> In file mpinetcdf.c, function ncmpi_end_indep_data(), MPI_File_sync() is 
>>> called. I don't think this is necessary. Flushing dirty data may be needed 
>>> if pnetcdf implemented a caching layer internally. However, this file sync 
>>> should be flusing data from pnetcdf caching layer (if it is implemented) 
>>> to the file system, not application clients to the file servers (or disks) 
>>> as MPI_File_sync() will do.
>>>
>>> This file sync makes IOR performance bad for pnetcdf independent data 
>>> mode.
>> What if an independent write was followed by a collective read?  I
>> think we covered this earlier in the year.  The NetCDF semantic for
>> this seems to be undefined.  If so, then I guess the MPI_File_sync is
>> indeed unnecessary.  The sync in there might be to enforce conclusion
>> of a "write sequence" and ensure changes made to the file by other
>> processes are visible to that process.
>>
>> pnetcdf could be clever when a file is opened NC_NOWRITE and disable
>> that sync.