Intermittent error with pnetcdf 1.6.x: One or more variable sizes violate format constraints
Wei-keng Liao
wkliao at eecs.northwestern.edu
Tue Dec 1 00:28:58 CST 2015
Hi, Michael
The netCDF will support CDF-5 officially in the next release, v 4.4.0 and
in the latest release candidate netCDF-4.4.0-rc4, the CDF-5 feature is
already in place. See the release note about CDF-5 from the URL below.
https://github.com/Unidata/netcdf-c/releases
Wei-keng
On Nov 30, 2015, at 11:37 PM, Schlottke-Lakemper, Michael wrote:
> By the way, can you please tell me the status of CDF-5 support in NetCDF (https://github.com/wkliao/netcdf-c/tree/CDF-5)? Are there any plans to promote the support to the upstream NetCDF repository in the foreseeable future?
>
>> On 01 Dec 2015, at 06:11 , Michael Schlottke-Lakemper <m.schlottke-lakemper at aia.rwth-aachen.de> wrote:
>>
>> Hi Wei-keng,
>>
>> I really should’ve read the documentation more properly (for reference: https://www.unidata.ucar.edu/software/netcdf/docs/netcdf/NetCDF-64-bit-Offset-Format-Limitations.html#NetCDF-64-bit-Offset-Format-Limitations and https://trac.mcs.anl.gov/projects/parallel-netcdf/wiki/FileLimits). The fact that Pnetcdf 1.5.0 seemed to work without problems really threw me off here. Thanks a lot for the quick clarification!
>>
>> Regards,
>>
>> Michael
>>
>>> On 01 Dec 2015, at 01:52 , Wei-keng Liao <wkliao at eecs.northwestern.edu> wrote:
>>>
>>> Hi, Michael
>>>
>>> From the header, I can see each of 4 variables is of size 1207959552 x 8 bytes = 9 GiB.
>>> Defining variables larger than 4GiB is not allowed in CDF-2 format (i.e. NC_64BIT_OFFSET).
>>> There is an exception: only one fixed-size variable can be larger than 4GiB if it is the
>>> last variable defined and there is no record variables.
>>>
>>> PnetCDF 1.5.0 fails to detect this error indicating a bug in PnetCDF.
>>> 1.6.0 and 1.6.1 should have already fixed this problem.
>>>
>>> If you would like to define large variables, please consider
>>> CDF-5 format, by using NC_64BIT_DATA flag when creating a file.
>>>
>>>
>>> Wei-keng
>>>
>>> On Nov 30, 2015, at 4:39 PM, Schlottke-Lakemper, Michael wrote:
>>>
>>>> Hi Wei-keng,
>>>>
>>>> The config.log is not easily available for us, since we cannot reproduce the error on our department’s cluster but only on a production system where we do not have direct access to the build system. If it would be helpful, however, I can try to investigate if we can get hold of it.
>>>>
>>>> Setting PNETCDF_SAFE_MODE=1 did not produce any additional output.
>>>>
>>>> Below I have attached a header dump of the file that we are trying to write, which was created using Pnetcdf 1.5.0 (which, as reported in my previous mail, works). The file was created - like in the failed case - on two nodes with a total of 48 MPI ranks, and using "NC_CLOBBER | NC_64BIT_OFFSET” as the file mode. Does this provide you with the information you were looking for? If you need anything else, please let me know!
>>>>
>>>> Regards,
>>>>
>>>> Michael
>>>>
>>>> P.S.: Dump of the header of the file as created with Pnetcdf 1.5.0:
>>>>
>>>>
>>>> netcdf solution_00000000 {
>>>> dimensions:
>>>> dim0 = 1207959552 ;
>>>> variables:
>>>> double variables0(dim0) ;
>>>> variables0:name = "u_U" ;
>>>> double variables1(dim0) ;
>>>> variables1:name = "v_U" ;
>>>> double variables2(dim0) ;
>>>> variables2:name = "w_U" ;
>>>> double variables3(dim0) ;
>>>> variables3:name = "p_U" ;
>>>>
>>>> // global attributes:
>>>> :gridFile = "grid.Netcdf" ;
>>>> :blockType = "DG" ;
>>>> :timeStep = 0 ;
>>>> :time = 0. ;
>>>> :meta_creation_user = "xacmicha" ;
>>>> :meta_creation_host = "nid07845" ;
>>>> :meta_creation_directory = "/lustre/cray/ws7/ws/xacmicha-fabian-0/dg_scaling/testcase-hornet/logs/2015-11-30_20.25.11_01.00" ;
>>>> :meta_creation_date = "2015-11-30 20:27:09" ;
>>>> :meta_creation_noDomains = 48 ;
>>>> :meta_lastModified_user = "xacmicha" ;
>>>> :meta_lastModified_host = "nid07845" ;
>>>> :meta_lastModified_directory = "/lustre/cray/ws7/ws/xacmicha-fabian-0/dg_scaling/testcase-hornet/logs/2015-11-30_20.25.11_01.00" ;
>>>> :meta_lastModified_date = "2015-11-30 20:27:09" ;
>>>> :meta_lastModified_noDomains = 48 ;
>>>> }
>>>>
>>>>> On 27 Nov 2015, at 08:19 , Wei-keng Liao <wkliao at eecs.northwestern.edu> wrote:
>>>>>
>>>>> Hi, Michael
>>>>>
>>>>> Could you please send me your config.log file (from building 1.6.1)?
>>>>>
>>>>> If you describe the variables (number, their dimensions and sizes),
>>>>> it can be helpful. Also, is there any fixed-size variable larger
>>>>> than 2GB?
>>>>>
>>>>> You can set the run-time environment variable PNETCDF_SAFE_MODE to 1
>>>>> to enable the metadata consistency checking in PnetCDF. That might
>>>>> print additional messages in stdout, if an error is detected.
>>>>>
>>>>> Wei-keng
>>>>>
>>>>> On Nov 26, 2015, at 11:04 PM, Schlottke-Lakemper, Michael wrote:
>>>>>
>>>>>> Hi folks,
>>>>>>
>>>>>> With the 1.6.0/1.6.1 versions of Parallel netCDF, under some conditions we get -62 errors (One or more variable sizes violate format constraints) when working with NC_64BIT_OFFSET files in parallel. It occurs mostly with parallel jobs > 16 MPI ranks (and was seen with up to 4k ranks so far) and was reproduced both on GPFS as well as Lustre file systems. Other than that, we could not find anything to narrow down the scope of the problem. Our current fix is to use the 1.5.0 version of Parallel netCDF, which has not yet produced this error, thus from a user perspective this seems like a regression in the 1.6.x series.
>>>>>>
>>>>>> Any ideas what the problem could be or what we could do to narrow it down?
>>>>>>
>>>>>> Yours
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Michael Schlottke-Lakemper
>>>>>>
>>>>>> Chair of Fluid Mechanics and Institute of Aerodynamics
>>>>>> RWTH Aachen University
>>>>>> Wüllnerstraße 5a
>>>>>> 52062 Aachen
>>>>>> Germany
>>>>>>
>>>>>> Phone: +49 (241) 80 95188
>>>>>> Fax: +49 (241) 80 92257
>>>>>> Mail: m.schlottke-lakemper at aia.rwth-aachen.de
>>>>>> Web: http://www.aia.rwth-aachen.de
>>>>>>
>>>>>
>>>>
>>>
>>
>
More information about the parallel-netcdf
mailing list