From wkliao at eecs.northwestern.edu Tue Dec 1 00:28:58 2015 From: wkliao at eecs.northwestern.edu (Wei-keng Liao) Date: Tue, 1 Dec 2015 00:28:58 -0600 Subject: Intermittent error with pnetcdf 1.6.x: One or more variable sizes violate format constraints In-Reply-To: <5BE6AD73-CBEB-4AC8-BAD7-CA07C05F6410@aia.rwth-aachen.de> References:

<5BE6AD73-CBEB-4AC8-BAD7-CA07C05F6410@aia.rwth-aachen.de> Message-ID: Hi, Michael The netCDF will support CDF-5 officially in the next release, v 4.4.0 and in the latest release candidate netCDF-4.4.0-rc4, the CDF-5 feature is already in place. See the release note about CDF-5 from the URL below. https://github.com/Unidata/netcdf-c/releases Wei-keng On Nov 30, 2015, at 11:37 PM, Schlottke-Lakemper, Michael wrote: > By the way, can you please tell me the status of CDF-5 support in NetCDF (https://github.com/wkliao/netcdf-c/tree/CDF-5)? Are there any plans to promote the support to the upstream NetCDF repository in the foreseeable future? > >> On 01 Dec 2015, at 06:11 , Michael Schlottke-Lakemper wrote: >> >> Hi Wei-keng, >> >> I really should?ve read the documentation more properly (for reference: https://www.unidata.ucar.edu/software/netcdf/docs/netcdf/NetCDF-64-bit-Offset-Format-Limitations.html#NetCDF-64-bit-Offset-Format-Limitations and https://trac.mcs.anl.gov/projects/parallel-netcdf/wiki/FileLimits). The fact that Pnetcdf 1.5.0 seemed to work without problems really threw me off here. Thanks a lot for the quick clarification! >> >> Regards, >> >> Michael >> >>> On 01 Dec 2015, at 01:52 , Wei-keng Liao wrote: >>> >>> Hi, Michael >>> >>> From the header, I can see each of 4 variables is of size 1207959552 x 8 bytes = 9 GiB. >>> Defining variables larger than 4GiB is not allowed in CDF-2 format (i.e. NC_64BIT_OFFSET). >>> There is an exception: only one fixed-size variable can be larger than 4GiB if it is the >>> last variable defined and there is no record variables. >>> >>> PnetCDF 1.5.0 fails to detect this error indicating a bug in PnetCDF. >>> 1.6.0 and 1.6.1 should have already fixed this problem. >>> >>> If you would like to define large variables, please consider >>> CDF-5 format, by using NC_64BIT_DATA flag when creating a file. >>> >>> >>> Wei-keng >>> >>> On Nov 30, 2015, at 4:39 PM, Schlottke-Lakemper, Michael wrote: >>> >>>> Hi Wei-keng, >>>> >>>> The config.log is not easily available for us, since we cannot reproduce the error on our department?s cluster but only on a production system where we do not have direct access to the build system. If it would be helpful, however, I can try to investigate if we can get hold of it. >>>> >>>> Setting PNETCDF_SAFE_MODE=1 did not produce any additional output. >>>> >>>> Below I have attached a header dump of the file that we are trying to write, which was created using Pnetcdf 1.5.0 (which, as reported in my previous mail, works). The file was created - like in the failed case - on two nodes with a total of 48 MPI ranks, and using "NC_CLOBBER | NC_64BIT_OFFSET? as the file mode. Does this provide you with the information you were looking for? If you need anything else, please let me know! >>>> >>>> Regards, >>>> >>>> Michael >>>> >>>> P.S.: Dump of the header of the file as created with Pnetcdf 1.5.0: >>>> >>>> >>>> netcdf solution_00000000 { >>>> dimensions: >>>> dim0 = 1207959552 ; >>>> variables: >>>> double variables0(dim0) ; >>>> variables0:name = "u_U" ; >>>> double variables1(dim0) ; >>>> variables1:name = "v_U" ; >>>> double variables2(dim0) ; >>>> variables2:name = "w_U" ; >>>> double variables3(dim0) ; >>>> variables3:name = "p_U" ; >>>> >>>> // global attributes: >>>> :gridFile = "grid.Netcdf" ; >>>> :blockType = "DG" ; >>>> :timeStep = 0 ; >>>> :time = 0. ; >>>> :meta_creation_user = "xacmicha" ; >>>> :meta_creation_host = "nid07845" ; >>>> :meta_creation_directory = "/lustre/cray/ws7/ws/xacmicha-fabian-0/dg_scaling/testcase-hornet/logs/2015-11-30_20.25.11_01.00" ; >>>> :meta_creation_date = "2015-11-30 20:27:09" ; >>>> :meta_creation_noDomains = 48 ; >>>> :meta_lastModified_user = "xacmicha" ; >>>> :meta_lastModified_host = "nid07845" ; >>>> :meta_lastModified_directory = "/lustre/cray/ws7/ws/xacmicha-fabian-0/dg_scaling/testcase-hornet/logs/2015-11-30_20.25.11_01.00" ; >>>> :meta_lastModified_date = "2015-11-30 20:27:09" ; >>>> :meta_lastModified_noDomains = 48 ; >>>> } >>>> >>>>> On 27 Nov 2015, at 08:19 , Wei-keng Liao wrote: >>>>> >>>>> Hi, Michael >>>>> >>>>> Could you please send me your config.log file (from building 1.6.1)? >>>>> >>>>> If you describe the variables (number, their dimensions and sizes), >>>>> it can be helpful. Also, is there any fixed-size variable larger >>>>> than 2GB? >>>>> >>>>> You can set the run-time environment variable PNETCDF_SAFE_MODE to 1 >>>>> to enable the metadata consistency checking in PnetCDF. That might >>>>> print additional messages in stdout, if an error is detected. >>>>> >>>>> Wei-keng >>>>> >>>>> On Nov 26, 2015, at 11:04 PM, Schlottke-Lakemper, Michael wrote: >>>>> >>>>>> Hi folks, >>>>>> >>>>>> With the 1.6.0/1.6.1 versions of Parallel netCDF, under some conditions we get -62 errors (One or more variable sizes violate format constraints) when working with NC_64BIT_OFFSET files in parallel. It occurs mostly with parallel jobs > 16 MPI ranks (and was seen with up to 4k ranks so far) and was reproduced both on GPFS as well as Lustre file systems. Other than that, we could not find anything to narrow down the scope of the problem. Our current fix is to use the 1.5.0 version of Parallel netCDF, which has not yet produced this error, thus from a user perspective this seems like a regression in the 1.6.x series. >>>>>> >>>>>> Any ideas what the problem could be or what we could do to narrow it down? >>>>>> >>>>>> Yours >>>>>> >>>>>> Michael >>>>>> >>>>>> >>>>>> -- >>>>>> Michael Schlottke-Lakemper >>>>>> >>>>>> Chair of Fluid Mechanics and Institute of Aerodynamics >>>>>> RWTH Aachen University >>>>>> W?llnerstra?e 5a >>>>>> 52062 Aachen >>>>>> Germany >>>>>> >>>>>> Phone: +49 (241) 80 95188 >>>>>> Fax: +49 (241) 80 92257 >>>>>> Mail: m.schlottke-lakemper at aia.rwth-aachen.de >>>>>> Web: http://www.aia.rwth-aachen.de >>>>>> >>>>> >>>> >>> >> > From wkliao at eecs.northwestern.edu Tue Dec 1 01:06:45 2015 From: wkliao at eecs.northwestern.edu (Wei-keng Liao) Date: Tue, 1 Dec 2015 01:06:45 -0600 Subject: Performance when reading many small variables In-Reply-To: <274970FE-F96D-4A92-8C93-A2CCB8A0D6E4@aia.rwth-aachen.de> References: <274970FE-F96D-4A92-8C93-A2CCB8A0D6E4@aia.rwth-aachen.de> Message-ID: <7ACF2BB2-3C99-4AB5-BD86-792344E0B342@eecs.northwestern.edu> Hi, Michael You can use PnetCDF nonblocking APIs to read. The code fragment that uses nonblocking reads is shown the followings. int reqs[2000], statuses[2000]; err = ncmpi_open(MPI_COMM_WORLD, filename, omode, MPI_INFO_NULL, &ncid); for (i=0; i<2000; i++) err = ncmpi_iget_vara_int(ncid, varid[i], start, count, &buf[i], &reqs[i]); err = ncmpi_wait_all(ncid, 2000, reqs, statuses); If there is only one entry per variable, then you can use var APIs and skip the arguments start and count. Such as for (i=0; i<2000; i++) err = ncmpi_iget_var_int(ncid, varid[i], &buf[i], &reqs[i]); PnetCDF nonblocking APIs defer the requests till ncmpi_wait_all where all requests are aggregated into one big, single MPI I/O call. There are many example programs (in C and Fortran) available in all PnetCDF releases, under examples directory. http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/examples In addition, I suggest to open the input file using MPI_COMM_WORLD, so the program can take advantage of MPI collective I/O for better performance, even if all processes read the same data. If your input file is generated from a PnetCDF program, then I suggest to disable file offset alignment for the fixed-size (non-record) variables, given there is only one entry per variable. To disable alignment, you can use an MPI info object and set nc_var_align_size to 1 and pass the info object to ncmpi_create call. Or you can set the same hint at run time. Please see https://trac.mcs.anl.gov/projects/parallel-netcdf/wiki/HintsForPnetcdf and http://cucis.ece.northwestern.edu/projects/PnetCDF/doc/pnetcdf-c/PNETCDF_005fHINTS.html For further information, please check Q&A in http://cucis.ece.northwestern.edu/projects/PnetCDF/faq.html and http://cucis.ece.northwestern.edu/projects/PnetCDF Wei-keng On Nov 30, 2015, at 11:53 PM, Schlottke-Lakemper, Michael wrote: > Dear all, > > We recently converted all of our code to use the Parallel netCDF library instead of the NetCDF library (before we had a mix), also using Pnetcdf for non-parallel file access. We did not have any issues whatsoever, until one user notified us of a performance regression in a particular case. > > He is trying to read many (O(2000)) variables from a single file in a loop, each variable with just one entry. Since this is very old code and usually only few variables are concerned, each process reads the same data individually. Before, the NetCDF library was used for this task, and during refactoring it was replaced by Pnetcdf with MPI_COMM_SELF. When using the code on a moderate number of MPI ranks (~500), the user noticed a severe performance degradation since switching to Pnetcdf: > > Before, the read process of the 2000 Variables cumulatively amounted to ~0.6s. After switching to Pnetcdf (using ncmpi_get_vara_int_all), this number increased to ~300s. Going from MPI_COMM_SELF to MPI_COMM_WORLD reduced this number to ~30s, which is still high in comparison. > > What, if anything, can we do to get similar performance when using Pnetcdf in this particular case? I know this is a rather degenerate case and that one possible fix would be to change the layout to 1 Variable with 2000 entries, but I was hoping that someone here has a suggestion what we could try anyways. > > Thanks a lot in advance > > Michael > > > -- > Michael Schlottke-Lakemper > > SimLab Highly Scalable Fluids & Solids Engineering > J?lich Aachen Research Alliance (JARA-HPC) > RWTH Aachen University > W?llnerstra?e 5a > 52062 Aachen > Germany > > Phone: +49 (241) 80 95188 > Fax: +49 (241) 80 92257 > Mail: m.schlottke-lakemper at aia.rwth-aachen.de > Web: http://www.jara.org/jara-hpc > From m.schlottke-lakemper at aia.rwth-aachen.de Mon Dec 7 23:18:10 2015 From: m.schlottke-lakemper at aia.rwth-aachen.de (Schlottke-Lakemper, Michael) Date: Tue, 8 Dec 2015 05:18:10 +0000 Subject: Performance when reading many small variables In-Reply-To: <7ACF2BB2-3C99-4AB5-BD86-792344E0B342@eecs.northwestern.edu> References: <274970FE-F96D-4A92-8C93-A2CCB8A0D6E4@aia.rwth-aachen.de> <7ACF2BB2-3C99-4AB5-BD86-792344E0B342@eecs.northwestern.edu> Message-ID: Hi Wei-keng, Thanks a lot for your elaborate answer. It might take us a while to implement your suggestions, but it gives us a good idea where to start. Michael > On 01 Dec 2015, at 08:06 , Wei-keng Liao wrote: > > Hi, Michael > > You can use PnetCDF nonblocking APIs to read. The code fragment that uses > nonblocking reads is shown the followings. > > int reqs[2000], statuses[2000]; > > err = ncmpi_open(MPI_COMM_WORLD, filename, omode, MPI_INFO_NULL, &ncid); > for (i=0; i<2000; i++) > err = ncmpi_iget_vara_int(ncid, varid[i], start, count, &buf[i], &reqs[i]); > > err = ncmpi_wait_all(ncid, 2000, reqs, statuses); > > > If there is only one entry per variable, then you can use var APIs and skip > the arguments start and count. Such as > > for (i=0; i<2000; i++) > err = ncmpi_iget_var_int(ncid, varid[i], &buf[i], &reqs[i]); > > > > PnetCDF nonblocking APIs defer the requests till ncmpi_wait_all where all > requests are aggregated into one big, single MPI I/O call. There are many example > programs (in C and Fortran) available in all PnetCDF releases, under examples > directory. http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/examples > > In addition, I suggest to open the input file using MPI_COMM_WORLD, so the > program can take advantage of MPI collective I/O for better performance, even if > all processes read the same data. > > If your input file is generated from a PnetCDF program, then I suggest to disable > file offset alignment for the fixed-size (non-record) variables, given there is > only one entry per variable. To disable alignment, you can use an MPI info object > and set nc_var_align_size to 1 and pass the info object to ncmpi_create call. > Or you can set the same hint at run time. Please see > https://trac.mcs.anl.gov/projects/parallel-netcdf/wiki/HintsForPnetcdf > and > http://cucis.ece.northwestern.edu/projects/PnetCDF/doc/pnetcdf-c/PNETCDF_005fHINTS.html > > For further information, please check Q&A in > http://cucis.ece.northwestern.edu/projects/PnetCDF/faq.html > and > http://cucis.ece.northwestern.edu/projects/PnetCDF > > Wei-keng > > On Nov 30, 2015, at 11:53 PM, Schlottke-Lakemper, Michael wrote: > >> Dear all, >> >> We recently converted all of our code to use the Parallel netCDF library instead of the NetCDF library (before we had a mix), also using Pnetcdf for non-parallel file access. We did not have any issues whatsoever, until one user notified us of a performance regression in a particular case. >> >> He is trying to read many (O(2000)) variables from a single file in a loop, each variable with just one entry. Since this is very old code and usually only few variables are concerned, each process reads the same data individually. Before, the NetCDF library was used for this task, and during refactoring it was replaced by Pnetcdf with MPI_COMM_SELF. When using the code on a moderate number of MPI ranks (~500), the user noticed a severe performance degradation since switching to Pnetcdf: >> >> Before, the read process of the 2000 Variables cumulatively amounted to ~0.6s. After switching to Pnetcdf (using ncmpi_get_vara_int_all), this number increased to ~300s. Going from MPI_COMM_SELF to MPI_COMM_WORLD reduced this number to ~30s, which is still high in comparison. >> >> What, if anything, can we do to get similar performance when using Pnetcdf in this particular case? I know this is a rather degenerate case and that one possible fix would be to change the layout to 1 Variable with 2000 entries, but I was hoping that someone here has a suggestion what we could try anyways. >> >> Thanks a lot in advance >> >> Michael >> >> >> -- >> Michael Schlottke-Lakemper >> >> SimLab Highly Scalable Fluids & Solids Engineering >> J?lich Aachen Research Alliance (JARA-HPC) >> RWTH Aachen University >> W?llnerstra?e 5a >> 52062 Aachen >> Germany >> >> Phone: +49 (241) 80 95188 >> Fax: +49 (241) 80 92257 >> Mail: m.schlottke-lakemper at aia.rwth-aachen.de >> Web: http://www.jara.org/jara-hpc >> > From wkliao at eecs.northwestern.edu Mon Dec 7 23:36:47 2015 From: wkliao at eecs.northwestern.edu (Wei-keng Liao) Date: Mon, 7 Dec 2015 23:36:47 -0600 Subject: Performance when reading many small variables In-Reply-To: References: <274970FE-F96D-4A92-8C93-A2CCB8A0D6E4@aia.rwth-aachen.de> <7ACF2BB2-3C99-4AB5-BD86-792344E0B342@eecs.northwestern.edu> Message-ID: <3D7F51EF-D9D0-401A-8B62-0B690DA5848A@eecs.northwestern.edu> Hi, Michael Another way is to have one process read all the data from file and broadcast to all. In your example, O(2000) single-entry variables take a space of only 16KB. Broadcasting 16KB should not take long on today's parallel computers. To further improve the performance, you can apply the nonblocking API approach on the root process using MPI_COMM_WELF, so those 2000 single-entry "get" requests can be aggregated into one MPI file read. Wei-keng On Dec 7, 2015, at 11:18 PM, Schlottke-Lakemper, Michael wrote: > Hi Wei-keng, > > Thanks a lot for your elaborate answer. It might take us a while to implement your suggestions, but it gives us a good idea where to start. > > Michael > >> On 01 Dec 2015, at 08:06 , Wei-keng Liao wrote: >> >> Hi, Michael >> >> You can use PnetCDF nonblocking APIs to read. The code fragment that uses >> nonblocking reads is shown the followings. >> >> int reqs[2000], statuses[2000]; >> >> err = ncmpi_open(MPI_COMM_WORLD, filename, omode, MPI_INFO_NULL, &ncid); >> for (i=0; i<2000; i++) >> err = ncmpi_iget_vara_int(ncid, varid[i], start, count, &buf[i], &reqs[i]); >> >> err = ncmpi_wait_all(ncid, 2000, reqs, statuses); >> >> >> If there is only one entry per variable, then you can use var APIs and skip >> the arguments start and count. Such as >> >> for (i=0; i<2000; i++) >> err = ncmpi_iget_var_int(ncid, varid[i], &buf[i], &reqs[i]); >> >> >> >> PnetCDF nonblocking APIs defer the requests till ncmpi_wait_all where all >> requests are aggregated into one big, single MPI I/O call. There are many example >> programs (in C and Fortran) available in all PnetCDF releases, under examples >> directory. http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/examples >> >> In addition, I suggest to open the input file using MPI_COMM_WORLD, so the >> program can take advantage of MPI collective I/O for better performance, even if >> all processes read the same data. >> >> If your input file is generated from a PnetCDF program, then I suggest to disable >> file offset alignment for the fixed-size (non-record) variables, given there is >> only one entry per variable. To disable alignment, you can use an MPI info object >> and set nc_var_align_size to 1 and pass the info object to ncmpi_create call. >> Or you can set the same hint at run time. Please see >> https://trac.mcs.anl.gov/projects/parallel-netcdf/wiki/HintsForPnetcdf >> and >> http://cucis.ece.northwestern.edu/projects/PnetCDF/doc/pnetcdf-c/PNETCDF_005fHINTS.html >> >> For further information, please check Q&A in >> http://cucis.ece.northwestern.edu/projects/PnetCDF/faq.html >> and >> http://cucis.ece.northwestern.edu/projects/PnetCDF >> >> Wei-keng >> >> On Nov 30, 2015, at 11:53 PM, Schlottke-Lakemper, Michael wrote: >> >>> Dear all, >>> >>> We recently converted all of our code to use the Parallel netCDF library instead of the NetCDF library (before we had a mix), also using Pnetcdf for non-parallel file access. We did not have any issues whatsoever, until one user notified us of a performance regression in a particular case. >>> >>> He is trying to read many (O(2000)) variables from a single file in a loop, each variable with just one entry. Since this is very old code and usually only few variables are concerned, each process reads the same data individually. Before, the NetCDF library was used for this task, and during refactoring it was replaced by Pnetcdf with MPI_COMM_SELF. When using the code on a moderate number of MPI ranks (~500), the user noticed a severe performance degradation since switching to Pnetcdf: >>> >>> Before, the read process of the 2000 Variables cumulatively amounted to ~0.6s. After switching to Pnetcdf (using ncmpi_get_vara_int_all), this number increased to ~300s. Going from MPI_COMM_SELF to MPI_COMM_WORLD reduced this number to ~30s, which is still high in comparison. >>> >>> What, if anything, can we do to get similar performance when using Pnetcdf in this particular case? I know this is a rather degenerate case and that one possible fix would be to change the layout to 1 Variable with 2000 entries, but I was hoping that someone here has a suggestion what we could try anyways. >>> >>> Thanks a lot in advance >>> >>> Michael >>> >>> >>> -- >>> Michael Schlottke-Lakemper >>> >>> SimLab Highly Scalable Fluids & Solids Engineering >>> J?lich Aachen Research Alliance (JARA-HPC) >>> RWTH Aachen University >>> W?llnerstra?e 5a >>> 52062 Aachen >>> Germany >>> >>> Phone: +49 (241) 80 95188 >>> Fax: +49 (241) 80 92257 >>> Mail: m.schlottke-lakemper at aia.rwth-aachen.de >>> Web: http://www.jara.org/jara-hpc >>> >> > From HLung at us.fujitsu.com Thu Dec 24 14:32:42 2015 From: HLung at us.fujitsu.com (HLung at us.fujitsu.com) Date: Thu, 24 Dec 2015 20:32:42 +0000 Subject: PNetCDF problem In-Reply-To: <92534F6D-D6E8-4877-9DA8-EB159B5ABEE7@eecs.northwestern.edu> References: <92534F6D-D6E8-4877-9DA8-EB159B5ABEE7@eecs.northwestern.edu> Message-ID: Hi Wei-keng, I tried pnetcdf 1.6.1 as you advised with WRF. This time the calculation of WRF itself is complete but the program died again at ncmpix_put_int64 with the same address not aligned error (BUS_ADRALN). The output file is empty. This function is also in ncx.c. I am wondering if this error is related to the input file or WRF run. Does pnetcdf has any restriction on the WRF input data? Thanks, Han -----Original Message----- From: Wei-keng Liao [mailto:wkliao at eecs.northwestern.edu] Sent: Wednesday, November 11, 2015 5:39 PM To: Lung, Han Cc: parallel-netcdf at lists.mcs.anl.gov Subject: Re: PNetCDF problem Hi, Han Thanks for reporting the problem you are encountering. Since PnetCDF 1.3.0 is quite old now (more than 3 years), I wonder if you can try 1.6.1, the latest stable release. ncx.c has had a significant revision since 1.3.0. We do not have access to Fujitsu compilers. I would not be surprised by problems on your environment. Actually, I was recently informed the same BUS_ADRALN error was observed when using Fujitsu compilers. I am in the process of getting an account on a machine with Fujitsu compilers, so I can debug this issue. Before I can do anything, could you at least try 1.6.1? Thanks. Wei-keng On Nov 11, 2015, at 6:42 PM, hlung at us.fujitsu.com wrote: > Hi, > > This is Han Lung, Director of HPC Group, Fujitsu America, Inc. > > I am working on WRF 3.7+PNetcdf 1.3.0 and got some errors. The test case is conus12km. > > I was told by wrfhelp that this error is related to internal of PNetCDF, so I am writing to you and hope I can get some help from you. > > I traced the error to ncmpix_put_size_t(void **xpp, const MPI_Offset lp, int sizeof_t) in ncx.c of parallel-netcdf-1.3.0. Here is the part of the code that caused the error: > > > #ifdef WORDS_BIGENDIAN > MPI_Offset *ptr = (MPI_Offset*) (*xpp); /* typecast to 8-byte integer */ > *ptr = lp; ? error here > #else > .. > > This is an operation for 8-byte integer lp (sizeof_t = 8). However, > *xpp is incremented each time by 4 or 8, depending on sizeof_t: *xpp > = (void *)((char *)(*xpp) + sizeof_t); > > Now when *xpp is not on 8-byte boundary (due to previous operation of 4-byte increment) the operation "*ptr = lp;" will cause the address not aligned error (BUS_ADRALN). > > In my case, for most of the time, sizeof_t is 4, and there is no problem. The first two times when sizeof_t is 8, *xpp is 591725472 and 591725720, respectively, which are at 8-byte boundary, so there is no problem, either. The third time when sizeof_t is 8, *xpp is 591725972, which is not at 8-byte boundary, and that caused the problem. > > I modified ncx.c to pad a 4-byte space if it's not on 8-byte boundary. It did solve this mis-alignment problem but died at the free() call later. I am also not sure if this padding is a right approach since the padded parts are all garbage. > > Do you see this kind of error before? Any advice on how to resolve it? > > Thanks, > > Han From wkliao at eecs.northwestern.edu Thu Dec 24 16:39:40 2015 From: wkliao at eecs.northwestern.edu (Wei-keng Liao) Date: Thu, 24 Dec 2015 16:39:40 -0600 Subject: PNetCDF problem In-Reply-To: References: <92534F6D-D6E8-4877-9DA8-EB159B5ABEE7@eecs.northwestern.edu> Message-ID: <24DBACB7-B896-4C06-A082-694F194DF2C5@eecs.northwestern.edu> Hi Han, The bug related to BUS_ADRALN has recently been fixed in the PnetCDF repo and will be available in the next release. If you would like to try the latest unreleased PnetCDF, please run the following commands to download the codes. svn checkout https://svn.mcs.anl.gov/repos/parallel-netcdf/trunk cd trunk autoreconf Then, please run the configure, make commands you used before. If you are using Fujitsu compilers, then please read the file README.Fujitsu. You might want to add LDFLAGS="-L." into your configure command line. Let me know if you encounter any further problem. Wei-keng On Dec 24, 2015, at 2:32 PM, hlung at us.fujitsu.com wrote: > Hi Wei-keng, > > I tried pnetcdf 1.6.1 as you advised with WRF. This time the calculation of WRF itself is complete but the program died again at ncmpix_put_int64 with the same address not aligned error (BUS_ADRALN). The output file is empty. This function is also in ncx.c. > > I am wondering if this error is related to the input file or WRF run. Does pnetcdf has any restriction on the WRF input data? > > Thanks, > > Han > > -----Original Message----- > From: Wei-keng Liao [mailto:wkliao at eecs.northwestern.edu] > Sent: Wednesday, November 11, 2015 5:39 PM > To: Lung, Han > Cc: parallel-netcdf at lists.mcs.anl.gov > Subject: Re: PNetCDF problem > > Hi, Han > > Thanks for reporting the problem you are encountering. > Since PnetCDF 1.3.0 is quite old now (more than 3 years), I wonder if you can try 1.6.1, the latest stable release. ncx.c has had a significant revision since 1.3.0. > > We do not have access to Fujitsu compilers. I would not be surprised by problems on your environment. Actually, I was recently informed the same BUS_ADRALN error was observed when using Fujitsu compilers. I am in the process of getting an account on a machine with Fujitsu compilers, so I can debug this issue. Before I can do anything, could you at least try 1.6.1? Thanks. > > Wei-keng > > On Nov 11, 2015, at 6:42 PM, hlung at us.fujitsu.com wrote: > >> Hi, >> >> This is Han Lung, Director of HPC Group, Fujitsu America, Inc. >> >> I am working on WRF 3.7+PNetcdf 1.3.0 and got some errors. The test case is conus12km. >> >> I was told by wrfhelp that this error is related to internal of PNetCDF, so I am writing to you and hope I can get some help from you. >> >> I traced the error to ncmpix_put_size_t(void **xpp, const MPI_Offset lp, int sizeof_t) in ncx.c of parallel-netcdf-1.3.0. Here is the part of the code that caused the error: >> >> >> #ifdef WORDS_BIGENDIAN >> MPI_Offset *ptr = (MPI_Offset*) (*xpp); /* typecast to 8-byte integer */ >> *ptr = lp; ? error here >> #else >> .. >> >> This is an operation for 8-byte integer lp (sizeof_t = 8). However, >> *xpp is incremented each time by 4 or 8, depending on sizeof_t: *xpp >> = (void *)((char *)(*xpp) + sizeof_t); >> >> Now when *xpp is not on 8-byte boundary (due to previous operation of 4-byte increment) the operation "*ptr = lp;" will cause the address not aligned error (BUS_ADRALN). >> >> In my case, for most of the time, sizeof_t is 4, and there is no problem. The first two times when sizeof_t is 8, *xpp is 591725472 and 591725720, respectively, which are at 8-byte boundary, so there is no problem, either. The third time when sizeof_t is 8, *xpp is 591725972, which is not at 8-byte boundary, and that caused the problem. >> >> I modified ncx.c to pad a 4-byte space if it's not on 8-byte boundary. It did solve this mis-alignment problem but died at the free() call later. I am also not sure if this padding is a right approach since the padded parts are all garbage. >> >> Do you see this kind of error before? Any advice on how to resolve it? >> >> Thanks, >> >> Han > From HLung at us.fujitsu.com Thu Dec 24 17:24:17 2015 From: HLung at us.fujitsu.com (HLung at us.fujitsu.com) Date: Thu, 24 Dec 2015 23:24:17 +0000 Subject: PNetCDF problem In-Reply-To: <24DBACB7-B896-4C06-A082-694F194DF2C5@eecs.northwestern.edu> References: <92534F6D-D6E8-4877-9DA8-EB159B5ABEE7@eecs.northwestern.edu> <24DBACB7-B896-4C06-A082-694F194DF2C5@eecs.northwestern.edu> Message-ID: Hi Wei-keng, Really appreciate your help. I will try to download this unreleased version and see how it works. Best regards and wish you a very merry Christmas holidays, Han -----Original Message----- From: Wei-keng Liao [mailto:wkliao at eecs.northwestern.edu] Sent: Thursday, December 24, 2015 2:40 PM To: Lung, Han Cc: parallel-netcdf at lists.mcs.anl.gov Subject: Re: PNetCDF problem Hi Han, The bug related to BUS_ADRALN has recently been fixed in the PnetCDF repo and will be available in the next release. If you would like to try the latest unreleased PnetCDF, please run the following commands to download the codes. svn checkout https://svn.mcs.anl.gov/repos/parallel-netcdf/trunk cd trunk autoreconf Then, please run the configure, make commands you used before. If you are using Fujitsu compilers, then please read the file README.Fujitsu. You might want to add LDFLAGS="-L." into your configure command line. Let me know if you encounter any further problem. Wei-keng On Dec 24, 2015, at 2:32 PM, hlung at us.fujitsu.com wrote: > Hi Wei-keng, > > I tried pnetcdf 1.6.1 as you advised with WRF. This time the calculation of WRF itself is complete but the program died again at ncmpix_put_int64 with the same address not aligned error (BUS_ADRALN). The output file is empty. This function is also in ncx.c. > > I am wondering if this error is related to the input file or WRF run. Does pnetcdf has any restriction on the WRF input data? > > Thanks, > > Han > > -----Original Message----- > From: Wei-keng Liao [mailto:wkliao at eecs.northwestern.edu] > Sent: Wednesday, November 11, 2015 5:39 PM > To: Lung, Han > Cc: parallel-netcdf at lists.mcs.anl.gov > Subject: Re: PNetCDF problem > > Hi, Han > > Thanks for reporting the problem you are encountering. > Since PnetCDF 1.3.0 is quite old now (more than 3 years), I wonder if you can try 1.6.1, the latest stable release. ncx.c has had a significant revision since 1.3.0. > > We do not have access to Fujitsu compilers. I would not be surprised by problems on your environment. Actually, I was recently informed the same BUS_ADRALN error was observed when using Fujitsu compilers. I am in the process of getting an account on a machine with Fujitsu compilers, so I can debug this issue. Before I can do anything, could you at least try 1.6.1? Thanks. > > Wei-keng > > On Nov 11, 2015, at 6:42 PM, hlung at us.fujitsu.com wrote: > >> Hi, >> >> This is Han Lung, Director of HPC Group, Fujitsu America, Inc. >> >> I am working on WRF 3.7+PNetcdf 1.3.0 and got some errors. The test case is conus12km. >> >> I was told by wrfhelp that this error is related to internal of PNetCDF, so I am writing to you and hope I can get some help from you. >> >> I traced the error to ncmpix_put_size_t(void **xpp, const MPI_Offset lp, int sizeof_t) in ncx.c of parallel-netcdf-1.3.0. Here is the part of the code that caused the error: >> >> >> #ifdef WORDS_BIGENDIAN >> MPI_Offset *ptr = (MPI_Offset*) (*xpp); /* typecast to 8-byte integer */ >> *ptr = lp; ? error here >> #else >> .. >> >> This is an operation for 8-byte integer lp (sizeof_t = 8). However, >> *xpp is incremented each time by 4 or 8, depending on sizeof_t: *xpp >> = (void *)((char *)(*xpp) + sizeof_t); >> >> Now when *xpp is not on 8-byte boundary (due to previous operation of 4-byte increment) the operation "*ptr = lp;" will cause the address not aligned error (BUS_ADRALN). >> >> In my case, for most of the time, sizeof_t is 4, and there is no problem. The first two times when sizeof_t is 8, *xpp is 591725472 and 591725720, respectively, which are at 8-byte boundary, so there is no problem, either. The third time when sizeof_t is 8, *xpp is 591725972, which is not at 8-byte boundary, and that caused the problem. >> >> I modified ncx.c to pad a 4-byte space if it's not on 8-byte boundary. It did solve this mis-alignment problem but died at the free() call later. I am also not sure if this padding is a right approach since the padded parts are all garbage. >> >> Do you see this kind of error before? Any advice on how to resolve it? >> >> Thanks, >> >> Han > From phanisri123 at gmail.com Thu Dec 31 03:40:22 2015 From: phanisri123 at gmail.com (phani sri) Date: Thu, 31 Dec 2015 15:10:22 +0530 Subject: error in installing Pnetcdf Message-ID: Sir/ Mam, I am installing Pnetcdf-1.6.1, while installation, first it showed problem in configuring, configure: error: F77 does not support "integer*8" then after searching the http://lists.mcs.anl.gov/pipermail/parallel-netcdf/2011-June/001196.html, I made some changes to configure file, then it did not show any error while configuring but while make install and make check install it is showing the following errors. Please help me some way -- D.P.S.L.Kameswari, Ph.D(Physics) ACRHEM University Of Hyderabad, Hyderabad, e-mail:phanisri123 at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pnet.jpeg Type: image/jpeg Size: 191887 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pnet1.jpeg Type: image/jpeg Size: 187190 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pnetcdf.jpeg Type: image/jpeg Size: 290032 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pnetcdf1.jpeg Type: image/jpeg Size: 251474 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pnetcdf2.jpeg Type: image/jpeg Size: 250410 bytes Desc: not available URL: