pnetcdf bug?

Wei-keng Liao wkliao at eecs.northwestern.edu
Wed Oct 28 12:04:12 CDT 2015


Hi, Bill

The bug happens when the offset alignment is enabled (i.e. most files created by
PnetCDF library) and new variables are added to the file (by using netCDF library
when re-entering define mode after opening an existing file). I agree with your
suggestion to CESM users be caution if they used netCDF older than 4.4.0.

Because netCDF library does not do alignment at all, one solution is to disable
alignment in PnetCDF to produce non-aligned files. This can be done by passing
an MPI hint or setting a run-time environment variable.
    MPI_Info_set(info, "nc_var_align_size", 1);
    setenv PNETCDF_HINTS "nc_var_align_size=1"
See
http://cucis.ece.northwestern.edu/projects/PnetCDF/faq.html#align
http://cucis.ece.northwestern.edu/projects/PnetCDF/doc/pnetcdf-c/PNETCDF_005fHINTS.html#PNETCDF_005fHINTS

Please note disabling alignment may have an impact to the I/O performance.
However, the impact is less if you use PnetCDF nonblocking APIs to aggregate
multiple requests into a single one.

I thought adding new variables to an existing file happens rarely in netCDF applications
because of the high penalty to move (shift) the record variables down.
Is CESM doing this?


Wei-keng

On Oct 28, 2015, at 7:29 AM, Bill Sacks wrote:

> Hi Wei-keng,
> 
> Do you have any sense of when this bug would apply? I am telling people to use caution when doing any manipulations of files written by pnetcdf, using tools built on top of the vanilla netcdf library (i.e., not pnetcdf-based tools). Would you agree?
> 
> Thanks,
> Bill
>  
>> On Oct 27, 2015, at 4:29 PM, Wei-keng Liao <wkliao at eecs.northwestern.edu> wrote:
>> 
>> Hi, Bill
>> 
>> I confirm this is a bug in netCDF. Please go ahead submit a bug to the netCDF group.
>> 
>> Below is the patch to fix this bug.
>> 
>> % diff wkliao/libsrc/nc3internal.c ../netcdf-4.3.3.1/libsrc/nc3internal.c
>> 213c213
>> < 		        if ((*vpp)->begin < ncp->old->vars.value[j]->begin) {
>> ---
>>> 		        if ((*vpp)->begin < ncp->old->vars.value[j]->begin)
>> 218,219d217
>> <                             index = (*vpp)->begin;
>> <                         }
>> 
>> 
>> I also wrote a short program (attached) that adds 2 new variables and tested
>> it on your file created by PnetCDF method. I have to add a printf statement in
>> netCDF library to print the variable offsets. See comments inside the test
>> program. You can also send the codes to netCDF support.
>> 
>> If you decide to apply the patch to your netCDF library, please let me know
>> if it works for you.
>> 
>> Wei-keng
>> 
>> <add_var.c>
>> On Oct 27, 2015, at 3:19 PM, Bill Sacks wrote:
>> 
>>> Hi Wei-keng,
>>> 
>>> Thanks very much for looking into this. I'm happy to submit a bug to the netCDF group if you think that's the best next step.
>>> 
>>> Superficially, this sure sounds similar to https://bugtracking.unidata.ucar.edu/browse/NCF-234 – but maybe there are details that make it differ.
>>> 
>>> Thanks,
>>> Bill
>>> 
>>>> On Oct 27, 2015, at 1:11 PM, Wei-keng Liao <wkliao at eecs.northwestern.edu> wrote:
>>>> 
>>>> Hi, Bill
>>>> 
>>>> I checked the file starting offsets for the two newly added variables.
>>>> It appears that ncks (netCDF underneath) does not respect the offset
>>>> alignment used in the files created by PnetCDF.
>>>> 
>>>> Your file created by netCDF has no alignment in between two adjacent variables.
>>>> The other file created by PnetCDF has an alignment of 512 bytes.
>>>> So, when ncks adds 2 new variables, I found the file offsets of the
>>>> two new variables overlap with the last variable of the existing file.
>>>> This indicates a bug in netCDF library, as ncks does not use PnetCDF library.
>>>> 
>>>> I will dig into netCDF library to see what happens internally.
>>>> 
>>>> Wei-keng
>>>> 
>>>> On Oct 27, 2015, at 1:41 PM, Bill Sacks wrote:
>>>> 
>>>>> Looking back at my notes, it seems that this problem sometimes appears in differences in actual values – i.e., it doesn't appear to just be a difference in where there are fill values.
>>>>> 
>>>>> Thank you,
>>>>> Bill
>>>>> 
>>>>>> On Oct 27, 2015, at 12:30 PM, Wei-keng Liao <wkliao at eecs.northwestern.edu> wrote:
>>>>>> 
>>>>>> Hi, Bill
>>>>>> 
>>>>>> I can reproduce what you are seeing.
>>>>>> 
>>>>>> If the differences happen only to those missing array elements (fill values),
>>>>>> then this is because PnetCDF supports the fill mode only in 1.6.1.
>>>>>> Please note the way fill mode is used differs from netCDF. See the release note
>>>>>> and example codes in
>>>>>> http://trac.mcs.anl.gov/projects/parallel-netcdf/wiki/ReleaseNotes-1.6.1
>>>>>> 
>>>>>> Please let me know if this is the case.
>>>>>> 
>>>>>> Wei-keng
>>>>>> 
>>>>>> On Oct 27, 2015, at 12:41 PM, Bill Sacks wrote:
>>>>>> 
>>>>>>> I have put the attachment on a public ftp server:
>>>>>>> 
>>>>>>> ftp ftp.cgd.ucar.edu
>>>>>>> 
>>>>>>> user name: anonymous
>>>>>>> password: (your email address)
>>>>>>> 
>>>>>>> cd pub/sacks
>>>>>>> get pnetcdf_bug.tar.gz
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Bill
>>>>>>> 
>>>>>>>> On Oct 27, 2015, at 11:11 AM, Wei-keng Liao <wkliao at eecs.northwestern.edu> wrote:
>>>>>>>> 
>>>>>>>> Hi, Bill
>>>>>>>> 
>>>>>>>> Bug NCF-234 should not be the cause, as you are using netCDF 4.3.3.1.
>>>>>>>> The fix has been applied to 4.3.0. I will take a look and get back to you.
>>>>>>>> 
>>>>>>>> Somehow your attachment did not come through my mail system.
>>>>>>>> I check PnetCDF mail archive and it does not appear there either.
>>>>>>>> http://lists.mcs.anl.gov/pipermail/parallel-netcdf/2015-October/001746.html
>>>>>>>> 
>>>>>>>> Maybe the file is too big? If that is the case, please send it to me directly.
>>>>>>>> Thanks
>>>>>>>> 
>>>>>>>> Wei-keng
>>>>>>>> 
>>>>>>>> On Oct 27, 2015, at 10:36 AM, Bill Sacks wrote:
>>>>>>>> 
>>>>>>>>> I wonder if this could be related to this (fixed) bug:
>>>>>>>>> 
>>>>>>>>> https://bugtracking.unidata.ucar.edu/browse/NCF-234
>>>>>>>>> 
>>>>>>>>> As with that one, it's possible that the problem is actually in netCDF and not in pnetcdf. Does anyone have an idea for how to determine if this is a pnetcdf problem or a netcdf problem? Or should I go ahead and post this to the netcdf bug list as well?
>>>>>>>>> 
>>>>>>>>> Charlie: I'm feeling more and more that NCO is probably off the hook here: sorry for dragging you into this initially :-)
>>>>>>>>> 
>>>>>>>>> Bill
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Oct 27, 2015, at 9:21 AM, Bill Sacks <wsacks at gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I have run into what appears to be a bug in pnetcdf: I have a file written by pnetcdf (via CESM). When I try to append a variable onto it using ncks -A, the new variable gets written properly, but a different variable on the file gets garbage values put into it. If the original file is written with standard netcdf rather than pnetcdf, the problem does not occur.
>>>>>>>>>> 
>>>>>>>>>> I am attaching a tar file that contains files needed to see the problem. It contains two restart files written by CESM (file names beginning check_ncks...): one written with pnetcdf and one with standard netcdf (the latter has "netcdf" in its name). It also contains a third file from which I was trying to copy variables onto this file.
>>>>>>>>>> 
>>>>>>>>>> To reproduce:
>>>>>>>>>> 
>>>>>>>>>> cp check_ncks_problem_noInterp_1027.clm2.r.0001-01-01-01800.nc test.nc
>>>>>>>>>> ncks -A -v COL_Z_p,LEVGRND_CLASS_p finidat_interp_dest.nc test.nc 
>>>>>>>>>> ncdump -v plant_nalloc check_ncks_problem_noInterp_1027.clm2.r.0001-01-01-01800.nc > dump1
>>>>>>>>>> ncdump -v plant_nalloc test.nc > dump2
>>>>>>>>>> diff dump1 dump2 | less
>>>>>>>>>> 
>>>>>>>>>> Notice that many points that were FillValue have been replaced by garbage. 
>>>>>>>>>> 
>>>>>>>>>> If you do the same thing, but using check_ncks_problem_noInterp_netcdf_1027.clm2.r.0001-01-01-01800.nc, then the dumps are identical.
>>>>>>>>>> 
>>>>>>>>>> I originally filed a bug report with NCO <https://sourceforge.net/p/nco/bugs/84/>, but Charlie Zender and Jim Edwards both feel that this is most likely a problem in the writing of the original file, which points to a possible pnetcdf problem.
>>>>>>>>>> 
>>>>>>>>>> CESM was built with
>>>>>>>>>> 
>>>>>>>>>>     module load netcdf-mpi/4.3.3.1
>>>>>>>>>>     module load pnetcdf/1.6.0
>>>>>>>>>> 
>>>>>>>>>> (on NCAR's yellowstone machine).
>>>>>>>>>> 
>>>>>>>>>> Thank you,
>>>>>>>>>> Bill
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Bill Sacks
>>>>>>>>>> CESM Software Engineering Group
>>>>>>>>>> National Center for Atmospheric Research
>>>>>>>>>> (303) 497-1762
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 



More information about the parallel-netcdf mailing list