Intermittent error with pnetcdf 1.6.x: One or more variable sizes violate format constraints

Wei-keng Liao wkliao at eecs.northwestern.edu
Mon Nov 30 18:52:22 CST 2015


Hi, Michael

From the header, I can see each of 4 variables is of size 1207959552 x 8 bytes = 9 GiB. 
Defining variables larger than 4GiB is not allowed in CDF-2 format (i.e. NC_64BIT_OFFSET).
There is an exception: only one fixed-size variable can be larger than 4GiB if it is the
last variable defined and there is no record variables.

PnetCDF 1.5.0 fails to detect this error indicating a bug in PnetCDF.
1.6.0 and 1.6.1 should have already fixed this problem.

If you would like to define large variables, please consider
CDF-5 format, by using NC_64BIT_DATA flag when creating a file.


Wei-keng

On Nov 30, 2015, at 4:39 PM, Schlottke-Lakemper, Michael wrote:

> Hi Wei-keng,
> 
> The config.log is not easily available for us, since we cannot reproduce the error on our department’s cluster but only on a production system where we do not have direct access to the build system. If it would be helpful, however, I can try to investigate if we can get hold of it.
> 
> Setting PNETCDF_SAFE_MODE=1 did not produce any additional output.
> 
> Below I have attached a header dump of the file that we are trying to write, which was created using Pnetcdf 1.5.0 (which, as reported in my previous mail, works). The file was created - like in the failed case - on two nodes with a total of 48 MPI ranks, and using "NC_CLOBBER | NC_64BIT_OFFSET” as the file mode. Does this provide you with the information you were looking for? If you need anything else, please let me know!
> 
> Regards,
> 
> Michael
> 
> P.S.: Dump of the header of the file as created with Pnetcdf 1.5.0:
> 
> 
> netcdf solution_00000000 {
> dimensions:
> 	dim0 = 1207959552 ;
> variables:
> 	double variables0(dim0) ;
> 		variables0:name = "u_U" ;
> 	double variables1(dim0) ;
> 		variables1:name = "v_U" ;
> 	double variables2(dim0) ;
> 		variables2:name = "w_U" ;
> 	double variables3(dim0) ;
> 		variables3:name = "p_U" ;
> 
> // global attributes:
> 		:gridFile = "grid.Netcdf" ;
> 		:blockType = "DG" ;
> 		:timeStep = 0 ;
> 		:time = 0. ;
> 		:meta_creation_user = "xacmicha" ;
> 		:meta_creation_host = "nid07845" ;
> 		:meta_creation_directory = "/lustre/cray/ws7/ws/xacmicha-fabian-0/dg_scaling/testcase-hornet/logs/2015-11-30_20.25.11_01.00" ;
> 		:meta_creation_date = "2015-11-30 20:27:09" ;
> 		:meta_creation_noDomains = 48 ;
> 		:meta_lastModified_user = "xacmicha" ;
> 		:meta_lastModified_host = "nid07845" ;
> 		:meta_lastModified_directory = "/lustre/cray/ws7/ws/xacmicha-fabian-0/dg_scaling/testcase-hornet/logs/2015-11-30_20.25.11_01.00" ;
> 		:meta_lastModified_date = "2015-11-30 20:27:09" ;
> 		:meta_lastModified_noDomains = 48 ;
> }
> 
>> On 27 Nov 2015, at 08:19 , Wei-keng Liao <wkliao at eecs.northwestern.edu> wrote:
>> 
>> Hi, Michael
>> 
>> Could you please send me your config.log file (from building 1.6.1)?
>> 
>> If you describe the variables (number, their dimensions and sizes),
>> it can be helpful. Also, is there any fixed-size variable larger
>> than 2GB?
>> 
>> You can set the run-time environment variable PNETCDF_SAFE_MODE to 1
>> to enable the metadata consistency checking in PnetCDF. That might
>> print additional messages in stdout, if an error is detected.
>> 
>> Wei-keng
>> 
>> On Nov 26, 2015, at 11:04 PM, Schlottke-Lakemper, Michael wrote:
>> 
>>> Hi folks,
>>> 
>>> With the 1.6.0/1.6.1 versions of Parallel netCDF, under some conditions we get -62 errors (One or more variable sizes violate format constraints) when working with NC_64BIT_OFFSET files in parallel. It occurs mostly with parallel jobs > 16 MPI ranks (and was seen with up to 4k ranks so far) and was reproduced both on GPFS as well as Lustre file systems. Other than that, we could not find anything to narrow down the scope of the problem. Our current fix is to use the 1.5.0 version of Parallel netCDF, which has not yet produced this error, thus from a user perspective this seems like a regression in the 1.6.x series.
>>> 
>>> Any ideas what the problem could be or what we could do to narrow it down?
>>> 
>>> Yours
>>> 
>>> Michael
>>> 
>>> 
>>> --
>>> Michael Schlottke-Lakemper
>>> 
>>> Chair of Fluid Mechanics and Institute of Aerodynamics
>>> RWTH Aachen University
>>> Wüllnerstraße 5a
>>> 52062 Aachen
>>> Germany
>>> 
>>> Phone: +49 (241) 80 95188
>>> Fax: +49 (241) 80 92257
>>> Mail: m.schlottke-lakemper at aia.rwth-aachen.de
>>> Web: http://www.aia.rwth-aachen.de
>>> 
>> 
> 



More information about the parallel-netcdf mailing list