[netCDF #KLB-596506]: apparent bug in netcdf-4.2

Wei-keng Liao wkliao at ece.northwestern.edu
Tue Mar 5 12:49:36 CST 2013


Hi, Russ,

Here is what I found. When entering nc_redef and exiting with nc_enddef,
NC_begin in nc.c is called and it recalculates the file starting
offsets for all variables, regardless the offsets in the existing file
(i.e. ncp->begin_var and ncp->vars.value[*]->begin). Is it possible that
the "gaps" between variables due to PnetCDF's 512-byte alignment are
mistaken as part of the variables when the header size grows and all
variables must be "shifted" down? In file nc.c, line 604, the assertion
                assert(gnu_off > old_off);
check if the new begin of a variable is larger then the old one.
I wonder why the assertion did not fail.

It is interesting that the CDF-1 format has no such problem, but
PnetCDF does the same alignment for CDF-1 as well. Did you
use nccopy to convert the CDF-2 to CDF-1 and run the test? In that
case, I guess all alignments have been removed during the copy.

Now the question is whether netCDF would like to support this non-record
variable alignment that PnetCDF is using. If not, PnetCDF can simply
remove this feature to be compatible with netCDF (performance might
suffers in some cases, though). If yes, what are the good and bad
for other netCDF tools and backward compatibility?

One more thing, when you mentioned
> the library has made the additional assumption of 4-byte alignment
> within the fixed data section at least since version 3.6.0 in 2004
> (I've tested that and intervening versions have the same bug).

Do you mean "within the the fixed data section" that each fixed variable's
begin must aligned with 4 bytes? Is netCDF doing this alignment? In file
nc.c lines 209 and 210
                (*vpp)->begin = index;
                index += (*vpp)->len;
It does not appear to me the alignment is done.

PnetCDF's default alignment is 512 bytes which is already a multiple
of 4, and fulfills the requirement of 4-byte alignment.
I checked the file offsets of all variables in test_orig.nc and they
all seem to be aligned with 512 byte boundaries. Did you find otherwise?
Am I missing something here?


Wei-keng

On Mar 5, 2013, at 10:33 AM, Unidata netCDF Support wrote:

> Hi Wei-keng,
> 
>> This variable alignment is a PnetCDF behavior.
>> The default alignment value for each non-record variable is 512 bytes in PnetCDF.
>> 
>> According to CDF-1 and CDF-2 file format specifications, each variable has
>> a field named "begin" which is the variable's file starting location.
>> var := name  nelems  [dimid ...]  vatt_array  nc_type  vsize  begin
>> 
>> We believe PnetCDF's variable alignment does not violate the CDF spec. and hence
>> implemented this default alignment in hope to improve performance. This alignment
>> can be turned off by setting the two hints below.
>> MPI_Info_set(info, "nc_header_align_size", "1");
>> MPI_Info_set(info, "nc_var_align_size",    "1");
>> 
>> I wonder if you can send us the file and program to reproduce the corruption problem.
> 
> Yes, I got it from Jim Edwards, and it's a 1.4GB file, along with a
> Makefile and the original Fortran problem that demonstrated the bug:
> 
>  ftp://ftp.cgd.ucar.edu/pub/eaton/nfbug.tar
> 
> I converted to a C program demonstrating the bug, that's available
> here:
> 
>  https://bugtracking.unidata.ucar.edu/browse/NCF-234
> 
> Now that we know what caused the observed symptoms, it would probably
> be easy to create a smaller example by running ncgen linked with the
> pnetcdf library on a CDL file that just has the first few of 288
> variables.  It might even suffice to include just the first 1 or 2 of
> the scalar integer variables that are stored with 512-byte alignment:
> 
>  int timemgr_rst_nstep_rad_prev ;
>  int timemgr_rst_type ;
> 
> and the first vector variable 
> 
>  double grid1d_lon(gridcell) ;
> 
> but I'm attaching a truncated version of the CDL file for the file that
> has the whole header, but datavalues for only the first 11 variables.
> 
> The file can be read by our netCDF library, but when it's opened for
> writing and a change is made to the schema, calling nc_redef(ncid) and
> eventually nc_enddef(ncid), the variable offsets in the header are
> rewritten assuming 4-byte alignment, so subsequent reads get bad data
> values.  For the file in question, rewriting bad offsets only happens
> when the file is a "CDF2" 64-bit offset format file.  The bug demo
> program behaves correctly if run on a classic format version of the
> original file.
> 
> I think you're right that the file format specification permits the
> way pnetcdf is making use of the variable offsets in the header, but
> the library has made the additional assumption of 4-byte alignment
> within the fixed data section at least since version 3.6.0 in 2004
> (I've tested that and intervening versions have the same bug).
> 
> So we should take responsibility for fixing this ...
> 
> --Russ
> 
>> Wei-keng
>> 
>> On Mar 4, 2013, at 6:49 PM, Jim Edwards wrote:
>> 
>>> Hi Russ,
>>> 
>>> That turns out to have been the problem.   The original file was created with pnetcdf.
>>> 
>>> Jim
>>> 
>>> 
>>> 
>>> On Mon, Mar 4, 2013 at 3:12 PM, Jim Edwards <jedwards at ucar.edu> wrote:
>>> Russ,
>>> 
>>> We think that the original file may have been written with pnetcdf.   We are going to try to recreate the file with netcdf and again with pnetcdf and see if that explains the issue.
>>> 
>>> Jim
>>> 
>>> 
>>> On Mon, Mar 4, 2013 at 2:31 PM, Samuel Levis <slevis at ucar.edu> wrote:
>>> Not exactly. I tried 2-degree to 2-degree, 2-degree to 0.5, 2-degree to 0.25, and others. All cases worked except the ones with the 0.5-degree file as output.
>>> 
>>> I also tried 0.5-degree to 0.5-degree (mapping the file into itself) and that failed. When I say failed, I mean that the output file ends up with junk in it.
>>> 
>>> Sam
>>> 
>>> 
>>> On 03/04/2013 02:26 PM, Jim Edwards wrote:
>>>> Hi Russ,
>>>> 
>>>> Another piece of information.   This program interpolates data from a file of one resolution (2 degree in this case) to another.  When the output file is low resolution, 1/2 degree or lower, the output file looks fine, no corruption that we can detect.   It's only when the output file is higher resolution (1/4 degree) that this problem comes about.
>>>> 
>>>> Jim
>>>> 
>>>> On Mon, Mar 4, 2013 at 2:04 PM, Jim Edwards <jedwards at ucar.edu> wrote:
>>>> Hi Russ,
>>>> 
>>>> It looks like that file was originally created on bluefire on 11/21/11, I don't have any information about which netcdf library was used, but I think that some adjustment may have been made inside netcdf for performance on gpfs filesystems.
>>>> 
>>>> But doesn't your own
>>>> int nc__enddef(int ncid, size_t h_minfree, size_t v_align,
>>>>                    size_t v_minfree, size_t r_align);
>>>> 
>>>> 
>>>> allow for changing this alignment?   I don't know that that was done for this file, but it would seem to suggest that there is no assumption being violated about these alignments.  Or that one part of netcdf is assuming something which another part is not.
>>>> 
>>>> 
>>>> 
>>>> On Mon, Mar 4, 2013 at 12:53 PM, Unidata netCDF Support <support-netcdf at unidata.ucar.edu> wrote:
>>>> Hi Jim,
>>>> 
>>>> I'm curious how the original file you provided was created and perhaps
>>>> modified.  It has a peculiar alignment characteristic that I haven't
>>>> seen before, and if there are more netCDF files being created the same
>>>> way, we may nned to adapt.
>>>> 
>>>> Could you tell me the history of the file, what file system it was
>>>> written on, and whether the netCDF library with which it was written
>>>> was modified in any way?
>>>> 
>>>> The file has this characteristic, which would indicate a non-Posix
>>>> file system: it is using 512-byte alignment of data values rather than
>>>> the 4-byte alignment assumed by netCDF. So, for example, the data
>>>> block for fixed-size variables begins with 9 scalar integers that
>>>> should take 4 bytes each. The offsets computed for these values from
>>>> the beginning of the fixed-size data block are 0, 4, 8, 12, 16, 20,
>>>> 24, 28, 32, so there is no padding or wasted space. The offsets from
>>>> the beginning of the fixed-size data block that are actually stored in the
>>>> header for these variables are 0, 512, 1024, ... , 4096. If the file
>>>> system used to write the data originally could not write data on
>>>> 4-byte boundaries, I think that violates the assumption of netCDF and
>>>> POSIX I/O. Nevertheless, if the nc_endef() call pays attention to the
>>>> file offsets for each variable that are stored in the header (as the
>>>> netCDF library does when reading the file), rather than computing them
>>>> from assuming 4-byte alignment, perhaps this file can be modified
>>>> correctly.
>>>> 
>>>> The function where we might be able to adapt to this is
>>>> nc3internal.c:NC_begins(), which is called from
>>>> nc3internal.c:NC_enddef().  In any case it's a netCDF bug to write
>>>> something that can't be later read correctly, so if our unmodified
>>>> library wrote that file and we can't adapt to it, then it was a bug
>>>> to not emit an error message for trying to create a file on the original
>>>> non-POSIX file system.  Also, the data seems to all be there in the
>>>> "corrupted" file, which can be fixed by just restoring the variable
>>>> offsets in the file header to the peculiar values in the original ...
>>>> 
>>>> --Russ
>>>> 
>>>> Russ Rew                                         UCAR Unidata Program
>>>> russ at unidata.ucar.edu                      http://www.unidata.ucar.edu
>>>> 
>>>> 
>>>> 
>>>> Ticket Details
>>>> ===================
>>>> Ticket ID: KLB-596506
>>>> Department: Support netCDF
>>>> Priority: Normal
>>>> Status: Closed
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Jim Edwards
>>>> 
>>>> CESM Software Engineering Group
>>>> National Center for Atmospheric Research
>>>> Boulder, CO
>>>> 303-497-1842
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Jim Edwards
>>>> 
>>>> CESM Software Engineering Group
>>>> National Center for Atmospheric Research
>>>> Boulder, CO
>>>> 303-497-1842
>>> 
>>> --
>>> Samuel Levis -
>>> slevis at ucar.edu
>>> 
>>> National Center for Atmospheric Research
>>> PO Box 3000, Boulder CO 80307-3000      <- use for mail
>>> 3090 Center Green Dr., Boulder CO 80301 <- vs. shipping
>>> 
>>> tel
>>> 303 497-1627
>>> ; fax -1348; skype: samuellevis2
>>> 
>>> http://www.cgd.ucar.edu/tss
>>> 
>>> 
>>> Terrestrial Sciences Section in the
>>> Climate & Global Dynamics Division
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Jim Edwards
>>> 
>>> CESM Software Engineering Group
>>> National Center for Atmospheric Research
>>> Boulder, CO
>>> 303-497-1842
>>> 
>>> 
>>> 
>>> --
>>> Jim Edwards
>>> 
>>> CESM Software Engineering Group
>>> National Center for Atmospheric Research
>>> Boulder, CO
>>> 303-497-1842
>> 
>> 
> Russ Rew                                         UCAR Unidata Program
> russ at unidata.ucar.edu                      http://www.unidata.ucar.edu
> 
> 
> 
> Ticket Details
> ===================
> Ticket ID: KLB-596506
> Department: Support netCDF
> Priority: Normal
> Status: Closed<test_orig.cdl-trunc>



More information about the parallel-netcdf mailing list