pnetcdf bug?

Bill Sacks wsacks at gmail.com
Wed Oct 28 12:11:09 CDT 2015


Hi Wei-keng,

Thanks a lot; this is very helpful. This problem occurred when I was adding new variables to a file after-the-fact using 'ncks -A ...'.

Thank you,
Bill

> On Oct 28, 2015, at 11:04 AM, Wei-keng Liao <wkliao at eecs.northwestern.edu> wrote:
> 
> Hi, Bill
> 
> The bug happens when the offset alignment is enabled (i.e. most files created by
> PnetCDF library) and new variables are added to the file (by using netCDF library
> when re-entering define mode after opening an existing file). I agree with your
> suggestion to CESM users be caution if they used netCDF older than 4.4.0.
> 
> Because netCDF library does not do alignment at all, one solution is to disable
> alignment in PnetCDF to produce non-aligned files. This can be done by passing
> an MPI hint or setting a run-time environment variable.
>    MPI_Info_set(info, "nc_var_align_size", 1);
>    setenv PNETCDF_HINTS "nc_var_align_size=1"
> See
> http://cucis.ece.northwestern.edu/projects/PnetCDF/faq.html#align
> http://cucis.ece.northwestern.edu/projects/PnetCDF/doc/pnetcdf-c/PNETCDF_005fHINTS.html#PNETCDF_005fHINTS
> 
> Please note disabling alignment may have an impact to the I/O performance.
> However, the impact is less if you use PnetCDF nonblocking APIs to aggregate
> multiple requests into a single one.
> 
> I thought adding new variables to an existing file happens rarely in netCDF applications
> because of the high penalty to move (shift) the record variables down.
> Is CESM doing this?
> 
> 
> Wei-keng
> 
> On Oct 28, 2015, at 7:29 AM, Bill Sacks wrote:
> 
>> Hi Wei-keng,
>> 
>> Do you have any sense of when this bug would apply? I am telling people to use caution when doing any manipulations of files written by pnetcdf, using tools built on top of the vanilla netcdf library (i.e., not pnetcdf-based tools). Would you agree?
>> 
>> Thanks,
>> Bill
>> 
>>> On Oct 27, 2015, at 4:29 PM, Wei-keng Liao <wkliao at eecs.northwestern.edu> wrote:
>>> 
>>> Hi, Bill
>>> 
>>> I confirm this is a bug in netCDF. Please go ahead submit a bug to the netCDF group.
>>> 
>>> Below is the patch to fix this bug.
>>> 
>>> % diff wkliao/libsrc/nc3internal.c ../netcdf-4.3.3.1/libsrc/nc3internal.c
>>> 213c213
>>> < 		        if ((*vpp)->begin < ncp->old->vars.value[j]->begin) {
>>> ---
>>>> 		        if ((*vpp)->begin < ncp->old->vars.value[j]->begin)
>>> 218,219d217
>>> <                             index = (*vpp)->begin;
>>> <                         }
>>> 
>>> 
>>> I also wrote a short program (attached) that adds 2 new variables and tested
>>> it on your file created by PnetCDF method. I have to add a printf statement in
>>> netCDF library to print the variable offsets. See comments inside the test
>>> program. You can also send the codes to netCDF support.
>>> 
>>> If you decide to apply the patch to your netCDF library, please let me know
>>> if it works for you.
>>> 
>>> Wei-keng
>>> 
>>> <add_var.c>
>>> On Oct 27, 2015, at 3:19 PM, Bill Sacks wrote:
>>> 
>>>> Hi Wei-keng,
>>>> 
>>>> Thanks very much for looking into this. I'm happy to submit a bug to the netCDF group if you think that's the best next step.
>>>> 
>>>> Superficially, this sure sounds similar to https://bugtracking.unidata.ucar.edu/browse/NCF-234 – but maybe there are details that make it differ.
>>>> 
>>>> Thanks,
>>>> Bill
>>>> 
>>>>> On Oct 27, 2015, at 1:11 PM, Wei-keng Liao <wkliao at eecs.northwestern.edu> wrote:
>>>>> 
>>>>> Hi, Bill
>>>>> 
>>>>> I checked the file starting offsets for the two newly added variables.
>>>>> It appears that ncks (netCDF underneath) does not respect the offset
>>>>> alignment used in the files created by PnetCDF.
>>>>> 
>>>>> Your file created by netCDF has no alignment in between two adjacent variables.
>>>>> The other file created by PnetCDF has an alignment of 512 bytes.
>>>>> So, when ncks adds 2 new variables, I found the file offsets of the
>>>>> two new variables overlap with the last variable of the existing file.
>>>>> This indicates a bug in netCDF library, as ncks does not use PnetCDF library.
>>>>> 
>>>>> I will dig into netCDF library to see what happens internally.
>>>>> 
>>>>> Wei-keng
>>>>> 
>>>>> On Oct 27, 2015, at 1:41 PM, Bill Sacks wrote:
>>>>> 
>>>>>> Looking back at my notes, it seems that this problem sometimes appears in differences in actual values – i.e., it doesn't appear to just be a difference in where there are fill values.
>>>>>> 
>>>>>> Thank you,
>>>>>> Bill
>>>>>> 
>>>>>>> On Oct 27, 2015, at 12:30 PM, Wei-keng Liao <wkliao at eecs.northwestern.edu> wrote:
>>>>>>> 
>>>>>>> Hi, Bill
>>>>>>> 
>>>>>>> I can reproduce what you are seeing.
>>>>>>> 
>>>>>>> If the differences happen only to those missing array elements (fill values),
>>>>>>> then this is because PnetCDF supports the fill mode only in 1.6.1.
>>>>>>> Please note the way fill mode is used differs from netCDF. See the release note
>>>>>>> and example codes in
>>>>>>> http://trac.mcs.anl.gov/projects/parallel-netcdf/wiki/ReleaseNotes-1.6.1
>>>>>>> 
>>>>>>> Please let me know if this is the case.
>>>>>>> 
>>>>>>> Wei-keng
>>>>>>> 
>>>>>>> On Oct 27, 2015, at 12:41 PM, Bill Sacks wrote:
>>>>>>> 
>>>>>>>> I have put the attachment on a public ftp server:
>>>>>>>> 
>>>>>>>> ftp ftp.cgd.ucar.edu
>>>>>>>> 
>>>>>>>> user name: anonymous
>>>>>>>> password: (your email address)
>>>>>>>> 
>>>>>>>> cd pub/sacks
>>>>>>>> get pnetcdf_bug.tar.gz
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Bill
>>>>>>>> 
>>>>>>>>> On Oct 27, 2015, at 11:11 AM, Wei-keng Liao <wkliao at eecs.northwestern.edu> wrote:
>>>>>>>>> 
>>>>>>>>> Hi, Bill
>>>>>>>>> 
>>>>>>>>> Bug NCF-234 should not be the cause, as you are using netCDF 4.3.3.1.
>>>>>>>>> The fix has been applied to 4.3.0. I will take a look and get back to you.
>>>>>>>>> 
>>>>>>>>> Somehow your attachment did not come through my mail system.
>>>>>>>>> I check PnetCDF mail archive and it does not appear there either.
>>>>>>>>> http://lists.mcs.anl.gov/pipermail/parallel-netcdf/2015-October/001746.html
>>>>>>>>> 
>>>>>>>>> Maybe the file is too big? If that is the case, please send it to me directly.
>>>>>>>>> Thanks
>>>>>>>>> 
>>>>>>>>> Wei-keng
>>>>>>>>> 
>>>>>>>>> On Oct 27, 2015, at 10:36 AM, Bill Sacks wrote:
>>>>>>>>> 
>>>>>>>>>> I wonder if this could be related to this (fixed) bug:
>>>>>>>>>> 
>>>>>>>>>> https://bugtracking.unidata.ucar.edu/browse/NCF-234
>>>>>>>>>> 
>>>>>>>>>> As with that one, it's possible that the problem is actually in netCDF and not in pnetcdf. Does anyone have an idea for how to determine if this is a pnetcdf problem or a netcdf problem? Or should I go ahead and post this to the netcdf bug list as well?
>>>>>>>>>> 
>>>>>>>>>> Charlie: I'm feeling more and more that NCO is probably off the hook here: sorry for dragging you into this initially :-)
>>>>>>>>>> 
>>>>>>>>>> Bill
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Oct 27, 2015, at 9:21 AM, Bill Sacks <wsacks at gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I have run into what appears to be a bug in pnetcdf: I have a file written by pnetcdf (via CESM). When I try to append a variable onto it using ncks -A, the new variable gets written properly, but a different variable on the file gets garbage values put into it. If the original file is written with standard netcdf rather than pnetcdf, the problem does not occur.
>>>>>>>>>>> 
>>>>>>>>>>> I am attaching a tar file that contains files needed to see the problem. It contains two restart files written by CESM (file names beginning check_ncks...): one written with pnetcdf and one with standard netcdf (the latter has "netcdf" in its name). It also contains a third file from which I was trying to copy variables onto this file.
>>>>>>>>>>> 
>>>>>>>>>>> To reproduce:
>>>>>>>>>>> 
>>>>>>>>>>> cp check_ncks_problem_noInterp_1027.clm2.r.0001-01-01-01800.nc test.nc
>>>>>>>>>>> ncks -A -v COL_Z_p,LEVGRND_CLASS_p finidat_interp_dest.nc test.nc 
>>>>>>>>>>> ncdump -v plant_nalloc check_ncks_problem_noInterp_1027.clm2.r.0001-01-01-01800.nc > dump1
>>>>>>>>>>> ncdump -v plant_nalloc test.nc > dump2
>>>>>>>>>>> diff dump1 dump2 | less
>>>>>>>>>>> 
>>>>>>>>>>> Notice that many points that were FillValue have been replaced by garbage. 
>>>>>>>>>>> 
>>>>>>>>>>> If you do the same thing, but using check_ncks_problem_noInterp_netcdf_1027.clm2.r.0001-01-01-01800.nc, then the dumps are identical.
>>>>>>>>>>> 
>>>>>>>>>>> I originally filed a bug report with NCO <https://sourceforge.net/p/nco/bugs/84/>, but Charlie Zender and Jim Edwards both feel that this is most likely a problem in the writing of the original file, which points to a possible pnetcdf problem.
>>>>>>>>>>> 
>>>>>>>>>>> CESM was built with
>>>>>>>>>>> 
>>>>>>>>>>>    module load netcdf-mpi/4.3.3.1
>>>>>>>>>>>    module load pnetcdf/1.6.0
>>>>>>>>>>> 
>>>>>>>>>>> (on NCAR's yellowstone machine).
>>>>>>>>>>> 
>>>>>>>>>>> Thank you,
>>>>>>>>>>> Bill
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Bill Sacks
>>>>>>>>>>> CESM Software Engineering Group
>>>>>>>>>>> National Center for Atmospheric Research
>>>>>>>>>>> (303) 497-1762
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 



More information about the parallel-netcdf mailing list