Problems with testing PNETCDF 1.6.1

Wei-keng Liao wkliao at eecs.northwestern.edu
Tue Feb 2 16:36:39 CST 2016


Hi, Eric

Unfortunately, I do not have access to an SGI machine. We usually
rely on our users to do some initial debugging for the situation like this.
I know this can be too much to ask, but if you did not encounter any
problem when running your program, maybe you can ignore this test.

However, since the hanging occurs only with SGI MPT, I suspect it is
related to MPT.

Could you check one thing for me? After you kill the FLASH-IO job, could
you check if any netCDF files were created? The expected files and their
sizes are

-rw------- 1 254075392 Feb  1 21:58 flash_io_test_ncmpi_chk_0000.nc
-rw------- 1  21208576 Feb  1 21:58 flash_io_test_ncmpi_plt_cnt_0000.nc
-rw------- 1  25431372 Feb  1 21:58 flash_io_test_ncmpi_plt_crn_0000.nc


Wei-keng

On Feb 2, 2016, at 3:53 PM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS AND APPLICATIONS INC] wrote:

> 
> Hi Wei-keng:
> 
> I tried rerunning the entire installation with PNETCDF_SAFE_MODE=1.  FLASH-IO still hangs with SGI MPT (with no error message), but it works fine with Intel MPI.
> 
> -Eric
> 
> 
> From: Wei-keng Liao <wkliao at eecs.northwestern.edu>
> Date: Tuesday, February 2, 2016 12:39 PM
> To: Eric Kemp <eric.kemp at nasa.gov>
> Cc: "parallel-netcdf at mcs.anl.gov" <parallel-netcdf at mcs.anl.gov>
> Subject: Re: Problems with testing PNETCDF 1.6.1
> 
> Hi, Eric
> 
> Sorry for sending the wrong file. The correct one is attached, in case you would like
> to use it.
> 
> I check your config.log file but could not find any thing fishy.
> I just now tested it with Intel compiler 16.0.0.109 without a problem.
> Could you try running FLASH-IO under the safe mode? i.e. set the environment
> variable PNETCDF_SAFE_MODE to 1. It will enable internal checking for
> data inconsistency.
> 
> Just want to make sure for 1.7.0.pre1 that your "make ptest" failed only on FLASH-IO.
> Because FLAH-IO is the last test program, this means all other tests have passed.
> Let me know. Thanks.
> 
> Wei-keng
> 
> 
> On Feb 2, 2016, at 8:27 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS AND APPLICATIONS INC] wrote:
> 
> > 
> > Hi Wei-keng:
> > 
> > I think you sent me the wrong copy of that file — it was identical to what is in 1.7.0.pre1.  But I went ahead and added "cmd" as an argument to subroutine check_err, and that test code compiles and runs.
> > 
> > The large file tests pass in 1.7.0.pre1 as you indicated. However, FLASH-IO still hangs with SGI MPT.  I took your suggestion and tried running this test separately (cd benchmarks/FLASH-IO ; make ptest) but the code still hangs.
> > 
> > I've attached the (gzipped) config.log file from the 1.7.0pre1 installations.
> > 
> > Thanks,
> > 
> > -Eric
> > 
> > Eric M. Kemp (SSAI) 
> > NASA/GSFC 
> > Mail Code: 606 
> > Greenbelt, MD 20771 
> > 301.286.9768 
> > eric.kemp at nasa.gov
> > eric.kemp at ssaihq.com
> > 
> > 
> > From: Wei-keng Liao <wkliao at eecs.northwestern.edu>
> > Date: Monday, February 1, 2016 5:12 PM
> > To: Eric Kemp <eric.kemp at nasa.gov>
> > Cc: "parallel-netcdf at mcs.anl.gov" <parallel-netcdf at mcs.anl.gov>
> > Subject: Re: Problems with testing PNETCDF 1.6.1
> > 
> > Hi, Eric,
> > 
> > Thanks for reporting the error. This is another oversight, Sorry.
> > The fixed file, bigrecords.f, is attached.
> > 
> > 
> > Wei-keng
> > 
> > 
> > 
> > On Feb 1, 2016, at 2:41 PM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS AND APPLICATIONS INC] wrote:
> > 
> > > 
> > > Hi Wei-keng:
> > > 
> > > Thanks for your quick response. I tried installing 1.7.0.pre1 but I get a
> > > different error when compiling the tests:
> > > 
> > > /usr/local/intel/2016/impi/5.1.2.150/bin64/mpif90    -I../../src/lib
> > > -I./../common   -I../../src/libf -I../../src/libf90 -fpic -O2 -fp-model
> > > strict  -c bigrecords.f
> > > bigrecords.f(333): error #6514: A substring must be of type CHARACTER.
> > > [CMD]
> > >          msg = '*** TESTING F77 '//cmd(1:XTRIM(cmd))//
> > > ------------------------------------^
> > > bigrecords.f(333): error #6054: A CHARACTER data type is required in this
> > > context.   [CMD]
> > >          msg = '*** TESTING F77 '//cmd(1:XTRIM(cmd))//
> > > ------------------------------------^
> > > compilation aborted for bigrecords.f (code 1)
> > > 
> > > 
> > > 
> > > This appears to be a legitimate syntax error in the test program, in
> > > subroutine check_err.  "cmd" is not defined in that subroutine, nor is it
> > > a global variable.
> > > 
> > > I will try patching 1.6.1 with the NC_64BIT_DATA constant instead.
> > > 
> > > -Eric
> > > 
> > > 
> > > 
> > > 
> > > On 2/1/16 12:02 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu> wrote:
> > > 
> > >> Hi, Eric
> > >> 
> > >> For the large file tests, the error is caused by a oversight of using a
> > >> wrong flag.
> > >> Line 81 of file large_files.c should have used NC_64BIT_DATA, instead of
> > >> NC_64BIT_OFFSET.
> > >> This error has been fixed in the pre-release of 1.7.0.pre1. Could you
> > >> give it a try?
> > >> http://cucis.ece.northwestern.edu/projects/PnetCDF/download.html
> > >> 
> > >> As for the FLASH-IO test, could you try running it alone? I.e. cd to the
> > >> folder
> > >> benchmarks/FLASH-IO and run "make ptest" there. In the meantime, please
> > >> send me
> > >> the file config.log.
> > >> 
> > >> 
> > >> Wei-keng
> > >> 
> > >> On Feb 1, 2016, at 7:32 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS
> > >> AND APPLICATIONS INC] wrote:
> > >> 
> > >>> 
> > >>> Dear PNETCDF developers:
> > >>> 
> > >>> I'm attempting to install PNETCDF1.6.1 on a Linux cluster running SLES
> > >>> 11.3.  I'm using Intel 15 Fortran and C compilers (no C++), and I'm
> > >>> trying to install for two separate MPI implementations (SGI MPT 2.12 and
> > >>> Intel MPI 5.1.2).
> > >>> 
> > >>> I'm encountering two problems when I run 'make ptest'.
> > >>> 
> > >>> 1)  For both MPI implementations, the large file tests fail with an
> > >>> integer overflow.  The error message is:
> > >>> 
> > >>> *** Testing large files, slowly.
> > >>> line 116 of large_files.c: Overflow when type cast to 4-byte integer.
> > >>> *** Creating large file ./testfile.nc...srun.slurm: error: borgo018:
> > >>> task 0: Exited with exit code 1
> > >>> 
> > >>> I reviewed the README.large_files for guidance, and I can confirm that
> > >>> both 'MPI_Offset' and 'off_t' are 8 bytes.
> > >>> 
> > >>> 2) For SGI MPT only, if I disable support for large file tests, 'make
> > >>> ptest' hangs when testing FLASH-IO:
> > >>> 
> > >>> make -w -C FLASH-IO ptest
> > >>> make[2]: Entering directory
> > >>> `/gpfsm/dnb32/emkemp/NUWRFLIB/svn/trunk/builds/parallel-netcdf-1.6.1/benc
> > >>> hmarks/FLASH-IO'
> > >>> mpiexec_mpt -n 4 ./flash_benchmark_io ./flash_io_test_
> > >>> srun.slurm: cluster configuration lacks support for cpu binding
> > >>> 
> > >>> The earlier tests with both single and multiple processes work for SGI
> > >>> MPT. And all tests (again, excluding large file tests) work for Intel
> > >>> MPI.
> > >>> 
> > >>> I can provide more information (e.g., output from the configure script)
> > >>> upon request.
> > >>> 
> > >>> Thanks,
> > >>> 
> > >>> -Eric
> > >>> 
> > >>> Eric M. Kemp (SSAI)
> > >>> NASA/GSFC 
> > >>> Mail Code: 606 
> > >>> Greenbelt, MD 20771
> > >>> 301.286.9768 
> > >>> eric.kemp at nasa.gov
> > >>> eric.kemp at ssaihq.com
> > >>> 
> > >> 
> > > 
> > 
> > <config.log.gz>
> 



More information about the parallel-netcdf mailing list