pnetcdf bug?
Jim Edwards
jedwards at ucar.edu
Wed Oct 28 12:10:48 CDT 2015
CESM doesn't add new variables to a file. But in some cases we will create
a new initial file
by copying an existing initial file and adding or modifying some of the
fields.
On Wed, Oct 28, 2015 at 11:04 AM, Wei-keng Liao <
wkliao at eecs.northwestern.edu> wrote:
> Hi, Bill
>
> The bug happens when the offset alignment is enabled (i.e. most files
> created by
> PnetCDF library) and new variables are added to the file (by using netCDF
> library
> when re-entering define mode after opening an existing file). I agree with
> your
> suggestion to CESM users be caution if they used netCDF older than 4.4.0.
>
> Because netCDF library does not do alignment at all, one solution is to
> disable
> alignment in PnetCDF to produce non-aligned files. This can be done by
> passing
> an MPI hint or setting a run-time environment variable.
> MPI_Info_set(info, "nc_var_align_size", 1);
> setenv PNETCDF_HINTS "nc_var_align_size=1"
> See
> http://cucis.ece.northwestern.edu/projects/PnetCDF/faq.html#align
>
> http://cucis.ece.northwestern.edu/projects/PnetCDF/doc/pnetcdf-c/PNETCDF_005fHINTS.html#PNETCDF_005fHINTS
>
> Please note disabling alignment may have an impact to the I/O performance.
> However, the impact is less if you use PnetCDF nonblocking APIs to
> aggregate
> multiple requests into a single one.
>
> I thought adding new variables to an existing file happens rarely in
> netCDF applications
> because of the high penalty to move (shift) the record variables down.
> Is CESM doing this?
>
>
> Wei-keng
>
> On Oct 28, 2015, at 7:29 AM, Bill Sacks wrote:
>
> > Hi Wei-keng,
> >
> > Do you have any sense of when this bug would apply? I am telling people
> to use caution when doing any manipulations of files written by pnetcdf,
> using tools built on top of the vanilla netcdf library (i.e., not
> pnetcdf-based tools). Would you agree?
> >
> > Thanks,
> > Bill
> >
> >> On Oct 27, 2015, at 4:29 PM, Wei-keng Liao <
> wkliao at eecs.northwestern.edu> wrote:
> >>
> >> Hi, Bill
> >>
> >> I confirm this is a bug in netCDF. Please go ahead submit a bug to the
> netCDF group.
> >>
> >> Below is the patch to fix this bug.
> >>
> >> % diff wkliao/libsrc/nc3internal.c
> ../netcdf-4.3.3.1/libsrc/nc3internal.c
> >> 213c213
> >> < if ((*vpp)->begin <
> ncp->old->vars.value[j]->begin) {
> >> ---
> >>> if ((*vpp)->begin < ncp->old->vars.value[j]->begin)
> >> 218,219d217
> >> < index = (*vpp)->begin;
> >> < }
> >>
> >>
> >> I also wrote a short program (attached) that adds 2 new variables and
> tested
> >> it on your file created by PnetCDF method. I have to add a printf
> statement in
> >> netCDF library to print the variable offsets. See comments inside the
> test
> >> program. You can also send the codes to netCDF support.
> >>
> >> If you decide to apply the patch to your netCDF library, please let me
> know
> >> if it works for you.
> >>
> >> Wei-keng
> >>
> >> <add_var.c>
> >> On Oct 27, 2015, at 3:19 PM, Bill Sacks wrote:
> >>
> >>> Hi Wei-keng,
> >>>
> >>> Thanks very much for looking into this. I'm happy to submit a bug to
> the netCDF group if you think that's the best next step.
> >>>
> >>> Superficially, this sure sounds similar to
> https://bugtracking.unidata.ucar.edu/browse/NCF-234 – but maybe there are
> details that make it differ.
> >>>
> >>> Thanks,
> >>> Bill
> >>>
> >>>> On Oct 27, 2015, at 1:11 PM, Wei-keng Liao <
> wkliao at eecs.northwestern.edu> wrote:
> >>>>
> >>>> Hi, Bill
> >>>>
> >>>> I checked the file starting offsets for the two newly added variables.
> >>>> It appears that ncks (netCDF underneath) does not respect the offset
> >>>> alignment used in the files created by PnetCDF.
> >>>>
> >>>> Your file created by netCDF has no alignment in between two adjacent
> variables.
> >>>> The other file created by PnetCDF has an alignment of 512 bytes.
> >>>> So, when ncks adds 2 new variables, I found the file offsets of the
> >>>> two new variables overlap with the last variable of the existing file.
> >>>> This indicates a bug in netCDF library, as ncks does not use PnetCDF
> library.
> >>>>
> >>>> I will dig into netCDF library to see what happens internally.
> >>>>
> >>>> Wei-keng
> >>>>
> >>>> On Oct 27, 2015, at 1:41 PM, Bill Sacks wrote:
> >>>>
> >>>>> Looking back at my notes, it seems that this problem sometimes
> appears in differences in actual values – i.e., it doesn't appear to just
> be a difference in where there are fill values.
> >>>>>
> >>>>> Thank you,
> >>>>> Bill
> >>>>>
> >>>>>> On Oct 27, 2015, at 12:30 PM, Wei-keng Liao <
> wkliao at eecs.northwestern.edu> wrote:
> >>>>>>
> >>>>>> Hi, Bill
> >>>>>>
> >>>>>> I can reproduce what you are seeing.
> >>>>>>
> >>>>>> If the differences happen only to those missing array elements
> (fill values),
> >>>>>> then this is because PnetCDF supports the fill mode only in 1.6.1.
> >>>>>> Please note the way fill mode is used differs from netCDF. See the
> release note
> >>>>>> and example codes in
> >>>>>>
> http://trac.mcs.anl.gov/projects/parallel-netcdf/wiki/ReleaseNotes-1.6.1
> >>>>>>
> >>>>>> Please let me know if this is the case.
> >>>>>>
> >>>>>> Wei-keng
> >>>>>>
> >>>>>> On Oct 27, 2015, at 12:41 PM, Bill Sacks wrote:
> >>>>>>
> >>>>>>> I have put the attachment on a public ftp server:
> >>>>>>>
> >>>>>>> ftp ftp.cgd.ucar.edu
> >>>>>>>
> >>>>>>> user name: anonymous
> >>>>>>> password: (your email address)
> >>>>>>>
> >>>>>>> cd pub/sacks
> >>>>>>> get pnetcdf_bug.tar.gz
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Bill
> >>>>>>>
> >>>>>>>> On Oct 27, 2015, at 11:11 AM, Wei-keng Liao <
> wkliao at eecs.northwestern.edu> wrote:
> >>>>>>>>
> >>>>>>>> Hi, Bill
> >>>>>>>>
> >>>>>>>> Bug NCF-234 should not be the cause, as you are using netCDF
> 4.3.3.1.
> >>>>>>>> The fix has been applied to 4.3.0. I will take a look and get
> back to you.
> >>>>>>>>
> >>>>>>>> Somehow your attachment did not come through my mail system.
> >>>>>>>> I check PnetCDF mail archive and it does not appear there either.
> >>>>>>>>
> http://lists.mcs.anl.gov/pipermail/parallel-netcdf/2015-October/001746.html
> >>>>>>>>
> >>>>>>>> Maybe the file is too big? If that is the case, please send it to
> me directly.
> >>>>>>>> Thanks
> >>>>>>>>
> >>>>>>>> Wei-keng
> >>>>>>>>
> >>>>>>>> On Oct 27, 2015, at 10:36 AM, Bill Sacks wrote:
> >>>>>>>>
> >>>>>>>>> I wonder if this could be related to this (fixed) bug:
> >>>>>>>>>
> >>>>>>>>> https://bugtracking.unidata.ucar.edu/browse/NCF-234
> >>>>>>>>>
> >>>>>>>>> As with that one, it's possible that the problem is actually in
> netCDF and not in pnetcdf. Does anyone have an idea for how to determine if
> this is a pnetcdf problem or a netcdf problem? Or should I go ahead and
> post this to the netcdf bug list as well?
> >>>>>>>>>
> >>>>>>>>> Charlie: I'm feeling more and more that NCO is probably off the
> hook here: sorry for dragging you into this initially :-)
> >>>>>>>>>
> >>>>>>>>> Bill
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> On Oct 27, 2015, at 9:21 AM, Bill Sacks <wsacks at gmail.com>
> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> I have run into what appears to be a bug in pnetcdf: I have a
> file written by pnetcdf (via CESM). When I try to append a variable onto it
> using ncks -A, the new variable gets written properly, but a different
> variable on the file gets garbage values put into it. If the original file
> is written with standard netcdf rather than pnetcdf, the problem does not
> occur.
> >>>>>>>>>>
> >>>>>>>>>> I am attaching a tar file that contains files needed to see the
> problem. It contains two restart files written by CESM (file names
> beginning check_ncks...): one written with pnetcdf and one with standard
> netcdf (the latter has "netcdf" in its name). It also contains a third file
> from which I was trying to copy variables onto this file.
> >>>>>>>>>>
> >>>>>>>>>> To reproduce:
> >>>>>>>>>>
> >>>>>>>>>> cp check_ncks_problem_noInterp_1027.clm2.r.0001-01-01-01800.nc
> test.nc
> >>>>>>>>>> ncks -A -v COL_Z_p,LEVGRND_CLASS_p finidat_interp_dest.nc
> test.nc
> >>>>>>>>>> ncdump -v plant_nalloc
> check_ncks_problem_noInterp_1027.clm2.r.0001-01-01-01800.nc > dump1
> >>>>>>>>>> ncdump -v plant_nalloc test.nc > dump2
> >>>>>>>>>> diff dump1 dump2 | less
> >>>>>>>>>>
> >>>>>>>>>> Notice that many points that were FillValue have been replaced
> by garbage.
> >>>>>>>>>>
> >>>>>>>>>> If you do the same thing, but using
> check_ncks_problem_noInterp_netcdf_1027.clm2.r.0001-01-01-01800.nc, then
> the dumps are identical.
> >>>>>>>>>>
> >>>>>>>>>> I originally filed a bug report with NCO <
> https://sourceforge.net/p/nco/bugs/84/>, but Charlie Zender and Jim
> Edwards both feel that this is most likely a problem in the writing of the
> original file, which points to a possible pnetcdf problem.
> >>>>>>>>>>
> >>>>>>>>>> CESM was built with
> >>>>>>>>>>
> >>>>>>>>>> module load netcdf-mpi/4.3.3.1
> >>>>>>>>>> module load pnetcdf/1.6.0
> >>>>>>>>>>
> >>>>>>>>>> (on NCAR's yellowstone machine).
> >>>>>>>>>>
> >>>>>>>>>> Thank you,
> >>>>>>>>>> Bill
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Bill Sacks
> >>>>>>>>>> CESM Software Engineering Group
> >>>>>>>>>> National Center for Atmospheric Research
> >>>>>>>>>> (303) 497-1762
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
>
--
Jim Edwards
CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20151028/6af7a072/attachment-0001.html>
More information about the parallel-netcdf
mailing list