sync in ncmpi_end_indep_data()
Rob Ross
rross at mcs.anl.gov
Wed Oct 3 13:21:59 CDT 2007
Hi Wei-keng,
I think the main thing to take out of this discussion is that we need to
specify, document, and implement a specific set of consistency semantics
for PnetCDF. I think that the semantics that are there are ok, perhaps a
little too strict in this case, but either way we need to get this on
"paper."
This issue of making assumptions about the file system underneath the
MPI-IO implementation is one that I am going to continue to disagree
with you about. The MPI-IO spec guarantees what it guarantees, and I am
firmly against making additional assumptions about what is and isn't
consistent based on what we think we know about the file system and
MPI-IO implementation underneath.
But I don't think that is too big a deal. We can define the PnetCDF
semantics with the MPI-IO semantics in mind, and I'm sure we'll end with
something usable and performant. The netCDF data/define mode split makes
that pretty easy.
Regards,
Rob
Wei-keng Liao wrote:
> Rob,
>
> This discussion leads to the issue of what consistency semantics pnetcdf
> should support. Fsync makes strict consistency for switching modes between
> collective and independent. (This is the only consistency defined in
> pnetcdf that I can find.) However, is pnetcdf also enforcing data
> consistency within the collective or independent mode? For example, will a
> collective read following a collective write always see the write data?
> MPI-IO does not guarantee this consistency unless atomicity is enabled.
> Similarly, what consistency be enforced in independent mode? This makes me
> wonder what consistency semantics HDF5 supports and how they implement the
> semantics.
>
> The purpose of calling fsync is to flush out the cache data and we have
> been talking about two levels of data caching. One is the underlying
> client-side file system caching and the other is caching at MPI-IO or
> pnetcdf library. For file system caching, since some file systems
> implement a coherent cache themselves, like Lustre and GPFS, and PVFS does
> not do caching at all, fsync is not necessary in switching modes in
> pnetcdf. Of course, this is in case when MPI-IO and pnetcdf do no do
> caching. As far as I understand ROMIO and pnetcdf libraries do not do
> caching yet and hence fsync is only needed on the file systems without
> coherent cache.
>
> In case that MPI-IO and pnetcdf will incorporate caching in the future,
> things can be complicated. On file systems that keep coherent cache or do
> no caching, flushing library-level caching should only involve writing
> library cached data, not fsync. But, fsync still cannot be avoided on
> file systems with incoherent cache.
>
> Placing a fsync in pnetcdf looks to me that pnetcdf always assumes a
> non-coherent caching file system. (that is why I said blindly ...) If
> pnetcdf consistency semantics completely follows MPI-IO's and provides
> user hints to enable different levels of consistency, fsync may not be
> necessary all the time. For performance concern, this would be a better
> choice, especially when a user knows the access patterns and underlying
> file systems. Otherwise, pnetcdf will always suffer the sync-to-disk
> overhead.
>
> Wei-keng
>
>
>
> On Mon, 1 Oct 2007, Rob Ross wrote:
>> Hi Wei-keng,
>>
>> Using the sync at the termination of indep_data mode has no relation to
>> enabling MPI atomic mode. Enabling MPI atomic mode says something about
>> the coherence of views after *each* operation and also something about
>> conflicting accesses. Syncing at the end of indep_data mode only ensures
>> that data has been pushed out of local caches and to the file system
>> (and unfortunately also to disk).
>>
>> Enabling atomic mode could be *very* expensive, especially for anyone
>> performing multiple indep. I/O calls. Perhaps not more expensive than
>> the file sync, depending on the underlying FS.
>>
>> I strongly disagree with your assertion that we've "enforced the file
>> sync blindly without considering the underlying file systems." We've
>> enforced the file sync on purpose. We have chosen not to consider the
>> underlying file system, also on purpose, because we're trying to be
>> portable and not to have a bunch of per-FS cases in our portable library
>> that try to guess at additional guarantees from the underlying file
>> system.
>>
>> As RobL mentioned, the reason we put this in place was to guarantee the
>> end of an I/O sequence, so that subsequent accesses would see a
>> consistent view. We've had numerous discussions internally about the
>> challenges of maintaining consistent views with different possible
>> optimizations within the MPI-IO implementation (e.g. incoherent caches).
>> This was simply a way to provide a more consistent view of the variables
>> so that users would tend not to see valid but confusing results from the
>> MPI-IO implementation.
>>
>> All this said, and disagreements aside about whether we were thinking
>> when we did this or not, I do agree that MPI_File_sync() is expensive.
>> The problem is that there are only two ways to tell the MPI-IO
>> implementation that you'd like a coherent view between processes: (1)
>> call MPI_File_sync() (2) re-open the file (MPI_File_close() +
>> MPI_File_open())
>>
>> In (1), you get the "write to disk" as a side-effect, which as you say
>> is expensive. In (2), well, that can be extremely expensive as well
>> because for most systems it produces a lot of namespace traffic.
>>
>> So, what do we do? Certainly we don't need this call in the read-only
>> case (as RobL mentioned), so we could drop it for that case without any
>> further discussion. If we still think that generating these consistent
>> views at that point is a good idea, we could try replacing with a
>> re-open; that would be faster some places and slower others. Or we could
>> get rid of it altogether; it will mean that users are more likely to
>> need to synchronize explicitly, but maybe that's just fine. I'm having a
>> hard time reconstructing the specifics of the argument for the call, so
>> I lean that way (we should have commented the call in the code with the
>> argument).
>>
>> Regards,
>>
>> Rob
>>
>> Wei-keng Liao wrote:
>>> Since pnetcdf is built on top of MPI-IO, it follows the MPI-IO consistency
>>> semantics. So, to answer your question, the same thing will happen on pure
>>> MPI-IO program when an independent write followed by a collective read.
>>>
>>> MPI-IO defines the consistency semantics as "sequential consistency among
>>> all accesses using file handles created from a single collective open with
>>> atomic mode enabled". So, what happens when a collective write is followed
>>> by a collective read? Will the collective read always reads the data written
>>> by the collective write, if atomic mode is disabled? I don't think MPI-IO
>>> will guarantee this.
>>>
>>> Using sync in ncmpi_end_indep_data() is equivalent to enabling MPI atomic
>>> mode only in ncmpi_end_indep_data(), but not anywhere else. This is strange.
>>>
>>> I don't think it is wise to enforce the file sync blindly without
>>> considering the underlying file systems. For example, on PVFS where no
>>> client-side caching is done, file sync is completely not necessary. On
>>> Lustre and GPFS where they has their own consistency control and are POSIX
>>> compliant, file sync is also not necessary. On NFS where consistency is an
>>> issue, ROMIO has already used byte-range locks to disable the caching. So, I
>>> don't see why we still need sync in Pnetcdf. If we want pnetcdf to have a
>>> stricter semantic, we can just enable MPI-IO atomic mode.
>>>
>>> On some file systems, file sync will not only flush data to servers but also
>>> to disks before the sync call returns. It is a very expensive operation. For
>>> performance reason, I suggest we should leave stricter consistency as an
>>> option to the users and keep relaxed semantics as the default.
>>>
>>> Our earlier discussion was under the assumption that pnetcdf may have its
>>> caching layer in the future and ncmpi_end_indep_data() is required for
>>> the cache coherence in pnetcdf level, not file system's.
>>>
>>> On Thu, 20 Sep 2007, Robert Latham wrote:
>>>> On Thu, Sep 20, 2007 at 02:46:47PM -0500, Wei-keng Liao wrote:
>>>>> In file mpinetcdf.c, function ncmpi_end_indep_data(), MPI_File_sync() is
>>>>> called. I don't think this is necessary. Flushing dirty data may be
>>>>> needed if pnetcdf implemented a caching layer internally. However, this
>>>>> file sync should be flusing data from pnetcdf caching layer (if it is
>>>>> implemented) to the file system, not application clients to the file
>>>>> servers (or disks) as MPI_File_sync() will do.
>>>>>
>>>>> This file sync makes IOR performance bad for pnetcdf independent data
>>>>> mode.
>>>> What if an independent write was followed by a collective read? I
>>>> think we covered this earlier in the year. The NetCDF semantic for
>>>> this seems to be undefined. If so, then I guess the MPI_File_sync is
>>>> indeed unnecessary. The sync in there might be to enforce conclusion
>>>> of a "write sequence" and ensure changes made to the file by other
>>>> processes are visible to that process.
>>>>
>>>> pnetcdf could be clever when a file is opened NC_NOWRITE and disable
>>>> that sync.
>>
>
More information about the parallel-netcdf
mailing list