Intermittent error with pnetcdf 1.6.x: One or more variable sizes violate format constraints

Schlottke-Lakemper, Michael m.schlottke-lakemper at aia.rwth-aachen.de
Mon Nov 30 23:11:27 CST 2015


Hi Wei-keng,

I really should’ve read the documentation more properly (for reference: https://www.unidata.ucar.edu/software/netcdf/docs/netcdf/NetCDF-64-bit-Offset-Format-Limitations.html#NetCDF-64-bit-Offset-Format-Limitations and https://trac.mcs.anl.gov/projects/parallel-netcdf/wiki/FileLimits). The fact that Pnetcdf 1.5.0 seemed to work without problems really threw me off here. Thanks a lot for the quick clarification!

Regards,

Michael

On 01 Dec 2015, at 01:52 , Wei-keng Liao <wkliao at eecs.northwestern.edu<mailto:wkliao at eecs.northwestern.edu>> wrote:

Hi, Michael

>From the header, I can see each of 4 variables is of size 1207959552 x 8 bytes = 9 GiB.
Defining variables larger than 4GiB is not allowed in CDF-2 format (i.e. NC_64BIT_OFFSET).
There is an exception: only one fixed-size variable can be larger than 4GiB if it is the
last variable defined and there is no record variables.

PnetCDF 1.5.0 fails to detect this error indicating a bug in PnetCDF.
1.6.0 and 1.6.1 should have already fixed this problem.

If you would like to define large variables, please consider
CDF-5 format, by using NC_64BIT_DATA flag when creating a file.


Wei-keng

On Nov 30, 2015, at 4:39 PM, Schlottke-Lakemper, Michael wrote:

Hi Wei-keng,

The config.log is not easily available for us, since we cannot reproduce the error on our department’s cluster but only on a production system where we do not have direct access to the build system. If it would be helpful, however, I can try to investigate if we can get hold of it.

Setting PNETCDF_SAFE_MODE=1 did not produce any additional output.

Below I have attached a header dump of the file that we are trying to write, which was created using Pnetcdf 1.5.0 (which, as reported in my previous mail, works). The file was created - like in the failed case - on two nodes with a total of 48 MPI ranks, and using "NC_CLOBBER | NC_64BIT_OFFSET” as the file mode. Does this provide you with the information you were looking for? If you need anything else, please let me know!

Regards,

Michael

P.S.: Dump of the header of the file as created with Pnetcdf 1.5.0:


netcdf solution_00000000 {
dimensions:
dim0 = 1207959552 ;
variables:
double variables0(dim0) ;
variables0:name = "u_U" ;
double variables1(dim0) ;
variables1:name = "v_U" ;
double variables2(dim0) ;
variables2:name = "w_U" ;
double variables3(dim0) ;
variables3:name = "p_U" ;

// global attributes:
:gridFile = "grid.Netcdf" ;
:blockType = "DG" ;
:timeStep = 0 ;
:time = 0. ;
:meta_creation_user = "xacmicha" ;
:meta_creation_host = "nid07845" ;
:meta_creation_directory = "/lustre/cray/ws7/ws/xacmicha-fabian-0/dg_scaling/testcase-hornet/logs/2015-11-30_20.25.11_01.00" ;
:meta_creation_date = "2015-11-30 20:27:09" ;
:meta_creation_noDomains = 48 ;
:meta_lastModified_user = "xacmicha" ;
:meta_lastModified_host = "nid07845" ;
:meta_lastModified_directory = "/lustre/cray/ws7/ws/xacmicha-fabian-0/dg_scaling/testcase-hornet/logs/2015-11-30_20.25.11_01.00" ;
:meta_lastModified_date = "2015-11-30 20:27:09" ;
:meta_lastModified_noDomains = 48 ;
}

On 27 Nov 2015, at 08:19 , Wei-keng Liao <wkliao at eecs.northwestern.edu<mailto:wkliao at eecs.northwestern.edu>> wrote:

Hi, Michael

Could you please send me your config.log file (from building 1.6.1)?

If you describe the variables (number, their dimensions and sizes),
it can be helpful. Also, is there any fixed-size variable larger
than 2GB?

You can set the run-time environment variable PNETCDF_SAFE_MODE to 1
to enable the metadata consistency checking in PnetCDF. That might
print additional messages in stdout, if an error is detected.

Wei-keng

On Nov 26, 2015, at 11:04 PM, Schlottke-Lakemper, Michael wrote:

Hi folks,

With the 1.6.0/1.6.1 versions of Parallel netCDF, under some conditions we get -62 errors (One or more variable sizes violate format constraints) when working with NC_64BIT_OFFSET files in parallel. It occurs mostly with parallel jobs > 16 MPI ranks (and was seen with up to 4k ranks so far) and was reproduced both on GPFS as well as Lustre file systems. Other than that, we could not find anything to narrow down the scope of the problem. Our current fix is to use the 1.5.0 version of Parallel netCDF, which has not yet produced this error, thus from a user perspective this seems like a regression in the 1.6.x series.

Any ideas what the problem could be or what we could do to narrow it down?

Yours

Michael


--
Michael Schlottke-Lakemper

Chair of Fluid Mechanics and Institute of Aerodynamics
RWTH Aachen University
Wüllnerstraße 5a
52062 Aachen
Germany

Phone: +49 (241) 80 95188
Fax: +49 (241) 80 92257
Mail: m.schlottke-lakemper at aia.rwth-aachen.de<mailto:m.schlottke-lakemper at aia.rwth-aachen.de>
Web: http://www.aia.rwth-aachen.de





-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20151201/f6dcf070/attachment.html>


More information about the parallel-netcdf mailing list