[EXTERNAL] Re: metadata consistency

Thu Jul 18 15:50:59 CDT 2013

Hi, Phil,

In PnetCDF, memory sync of the number of records and committing it
to file is the last step of all ncmpi_put APIs. So, for the most common
use case that writes one record at a time, the number of records (stored
in memory) will always reflect the last successful write, no matter
where the crash happens in the run. If NC_SHARE is set, the number
of records will also be written to file at the end of API call.

So, my intention of making this default in PnetCDF is so that the number
of records read from the file header will always tell the number of
valid, fully-committed records and any data beyond that boundary may
be partially committed.

Wei-keng

On Jul 18, 2013, at 2:46 PM, Phil Miller wrote:

> On Thu, Jul 18, 2013 at 12:24 PM, Rob Latham <robl at mcs.anl.gov> wrote:
>> On Thu, Jul 18, 2013 at 01:17:00PM -0500, Wei-keng Liao wrote:
>>> A question to PnetCDF users. Say, if your program stops in the middle
>>> of PnetCDF I/O and files are not closed, is it acceptable to see the
>>> number of records from the file header smaller than the data written
>>> in the file body? Or would you just simply discard the files?
>>> The answer will determine what default settings should be used.
>> 
>> That's a good point.  I'd like to hear from our users, but our "worst
>> case" here is not a corrupt file as it might be in the HDF5 case, but
>> rather a large datafile with many (possibly all?) records unreachable.
>> 
>> The header would always declare "one record variable of the following
>> shape, with N records".    It's only when crashing before closing that
>> the number of record reported could be less than the number of records
>> actually in the file.
>> 
>> We could probably write a recovery tool that, based on the size of the
>> file, can make a pretty good guess as the number of records that
>> should exist.
> 
> This tool will need to be very carefully written, and might be
> impossible to implement correctly. If a write to a record variable
> gets committed far into a file, the file length will be at least the
> end point of that write. However, that doesn't mean that all of the
> data up to that point will have also been committed. The filesystem
> could still have a hole or holes in the file where reads will return
> zeros anyplace that writes didn't get committed to stable storage
> before execution halted. The protection against this is, of course,
> having all writers successfully call one of fdatasync()/fsync()/sync()
> and synchronize amongst themselves before the header gets updated.
> Even then, we still have to trust that the OS and parallel filesystem
> will behave themselves, and have actually made the data safe, when one
> of those calls returns.
> 
> So, it would be one thing to rewrite the record count field for
> visualization, with the possibility and warning to the user that later
> records might be bogus. It would be another thing entirely to let some
> other process consume those potentially-invalid records. The latter
> possibility is probably going to bite some users really hard, even if
> they have been loudly warned.