sync in ncmpi_end_indep_data()

Wed Oct 3 13:21:59 CDT 2007

Hi Wei-keng,

I think the main thing to take out of this discussion is that we need to 
specify, document, and implement a specific set of consistency semantics 
for PnetCDF. I think that the semantics that are there are ok, perhaps a 
little too strict in this case, but either way we need to get this on 
"paper."

This issue of making assumptions about the file system underneath the 
MPI-IO implementation is one that I am going to continue to disagree 
with you about. The MPI-IO spec guarantees what it guarantees, and I am 
firmly against making additional assumptions about what is and isn't 
consistent based on what we think we know about the file system and 
MPI-IO implementation underneath.

But I don't think that is too big a deal. We can define the PnetCDF 
semantics with the MPI-IO semantics in mind, and I'm sure we'll end with 
something usable and performant. The netCDF data/define mode split makes 
that pretty easy.

Regards,

Rob

Wei-keng Liao wrote:
> Rob,
> 
> This discussion leads to the issue of what consistency semantics pnetcdf 
> should support. Fsync makes strict consistency for switching modes between 
> collective and independent. (This is the only consistency defined in 
> pnetcdf that I can find.) However, is pnetcdf also enforcing data 
> consistency within the collective or independent mode? For example, will a 
> collective read following a collective write always see the write data? 
> MPI-IO does not guarantee this consistency unless atomicity is enabled. 
> Similarly, what consistency be enforced in independent mode? This makes me 
> wonder what consistency semantics HDF5 supports and how they implement the 
> semantics.
> 
> The purpose of calling fsync is to flush out the cache data and we have 
> been talking about two levels of data caching. One is the underlying 
> client-side file system caching and the other is caching at MPI-IO or 
> pnetcdf library. For file system caching, since some file systems 
> implement a coherent cache themselves, like Lustre and GPFS, and PVFS does 
> not do caching at all, fsync is not necessary in switching modes in 
> pnetcdf. Of course, this is in case when MPI-IO and pnetcdf do no do 
> caching. As far as I understand ROMIO and pnetcdf libraries do not do 
> caching yet and hence fsync is only needed on the file systems without 
> coherent cache.
> 
> In case that MPI-IO and pnetcdf will incorporate caching in the future, 
> things can be complicated. On file systems that keep coherent cache or do 
> no caching, flushing library-level caching should only involve writing 
> library cached data, not fsync. But, fsync still cannot be avoided on 
> file systems with incoherent cache.
> 
> Placing a fsync in pnetcdf looks to me that pnetcdf always assumes a 
> non-coherent caching file system. (that is why I said blindly ...) If 
> pnetcdf consistency semantics completely follows MPI-IO's and provides 
> user hints to enable different levels of consistency, fsync may not be 
> necessary all the time. For performance concern, this would be a better 
> choice, especially when a user knows the access patterns and underlying 
> file systems. Otherwise, pnetcdf will always suffer the sync-to-disk 
> overhead.
> 
> Wei-keng
> 
> 
> 
> On Mon, 1 Oct 2007, Rob Ross wrote:
>> Hi Wei-keng,
>>
>> Using the sync at the termination of indep_data mode has no relation to 
>> enabling MPI atomic mode. Enabling MPI atomic mode says something about 
>> the coherence of views after *each* operation and also something about 
>> conflicting accesses. Syncing at the end of indep_data mode only ensures 
>> that data has been pushed out of local caches and to the file system 
>> (and unfortunately also to disk).
>>
>> Enabling atomic mode could be *very* expensive, especially for anyone 
>> performing multiple indep. I/O calls. Perhaps not more expensive than 
>> the file sync, depending on the underlying FS.
>>
>> I strongly disagree with your assertion that we've "enforced the file 
>> sync blindly without considering the underlying file systems." We've 
>> enforced the file sync on purpose. We have chosen not to consider the 
>> underlying file system, also on purpose, because we're trying to be 
>> portable and not to have a bunch of per-FS cases in our portable library 
>> that try to guess at additional guarantees from the underlying file 
>> system.
>>
>> As RobL mentioned, the reason we put this in place was to guarantee the 
>> end of an I/O sequence, so that subsequent accesses would see a 
>> consistent view. We've had numerous discussions internally about the 
>> challenges of maintaining consistent views with different possible 
>> optimizations within the MPI-IO implementation (e.g. incoherent caches). 
>> This was simply a way to provide a more consistent view of the variables 
>> so that users would tend not to see valid but confusing results from the 
>> MPI-IO implementation.
>>
>> All this said, and disagreements aside about whether we were thinking 
>> when we did this or not, I do agree that MPI_File_sync() is expensive. 
>> The problem is that there are only two ways to tell the MPI-IO 
>> implementation that you'd like a coherent view between processes: (1) 
>> call MPI_File_sync() (2) re-open the file (MPI_File_close() + 
>> MPI_File_open())
>>
>> In (1), you get the "write to disk" as a side-effect, which as you say 
>> is expensive. In (2), well, that can be extremely expensive as well 
>> because for most systems it produces a lot of namespace traffic.
>>
>> So, what do we do? Certainly we don't need this call in the read-only 
>> case (as RobL mentioned), so we could drop it for that case without any 
>> further discussion. If we still think that generating these consistent 
>> views at that point is a good idea, we could try replacing with a 
>> re-open; that would be faster some places and slower others. Or we could 
>> get rid of it altogether; it will mean that users are more likely to 
>> need to synchronize explicitly, but maybe that's just fine. I'm having a 
>> hard time reconstructing the specifics of the argument for the call, so 
>> I lean that way (we should have commented the call in the code with the 
>> argument).
>>
>> Regards,
>>
>> Rob
>>
>> Wei-keng Liao wrote:
>>> Since pnetcdf is built on top of MPI-IO, it follows the MPI-IO consistency
>>> semantics. So, to answer your question, the same thing will happen on pure
>>> MPI-IO program when an independent write followed by a collective read.
>>>
>>> MPI-IO defines the consistency semantics as "sequential consistency among
>>> all accesses using file handles created from a single collective open with
>>> atomic mode enabled". So, what happens when a collective write is followed
>>> by a collective read? Will the collective read always reads the data written
>>> by the collective write, if atomic mode is disabled? I don't think MPI-IO
>>> will guarantee this.
>>>
>>> Using sync in ncmpi_end_indep_data() is equivalent to enabling MPI atomic
>>> mode only in ncmpi_end_indep_data(), but not anywhere else. This is strange.
>>>
>>> I don't think it is wise to enforce the file sync blindly without
>>> considering the underlying file systems. For example, on PVFS where no
>>> client-side caching is done, file sync is completely not necessary. On
>>> Lustre and GPFS where they has their own consistency control and are POSIX
>>> compliant, file sync is also not necessary. On NFS where consistency is an
>>> issue, ROMIO has already used byte-range locks to disable the caching. So, I
>>> don't see why we still need sync in Pnetcdf. If we want pnetcdf to have a
>>> stricter semantic, we can just enable MPI-IO atomic mode.
>>>
>>> On some file systems, file sync will not only flush data to servers but also
>>> to disks before the sync call returns. It is a very expensive operation. For
>>> performance reason, I suggest we should leave stricter consistency as an
>>> option to the users and keep relaxed semantics as the default.
>>>
>>> Our earlier discussion was under the assumption that pnetcdf may have its
>>> caching layer in the future and ncmpi_end_indep_data() is required for
>>> the cache coherence in pnetcdf level, not file system's.
>>>
>>> On Thu, 20 Sep 2007, Robert Latham wrote:
>>>> On Thu, Sep 20, 2007 at 02:46:47PM -0500, Wei-keng Liao wrote:
>>>>> In file mpinetcdf.c, function ncmpi_end_indep_data(), MPI_File_sync() is
>>>>> called. I don't think this is necessary. Flushing dirty data may be
>>>>> needed if pnetcdf implemented a caching layer internally. However, this
>>>>> file sync should be flusing data from pnetcdf caching layer (if it is
>>>>> implemented) to the file system, not application clients to the file
>>>>> servers (or disks) as MPI_File_sync() will do.
>>>>>
>>>>> This file sync makes IOR performance bad for pnetcdf independent data
>>>>> mode.
>>>> What if an independent write was followed by a collective read?  I
>>>> think we covered this earlier in the year.  The NetCDF semantic for
>>>> this seems to be undefined.  If so, then I guess the MPI_File_sync is
>>>> indeed unnecessary.  The sync in there might be to enforce conclusion
>>>> of a "write sequence" and ensure changes made to the file by other
>>>> processes are visible to that process.
>>>>
>>>> pnetcdf could be clever when a file is opened NC_NOWRITE and disable
>>>> that sync.
>>
>