Problems with testing PNETCDF 1.6.1

Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS AND APPLICATIONS INC] eric.kemp at nasa.gov
Wed Feb 3 07:08:01 CST 2016


Good morning Wei-keng:

I reran 1.7.0.pre1 FLASH-IO again.  The only output from the program is:

rw-r--r-- 1 emkemp k3002 2964 Feb  3 07:53 flash_io_test_ncmpi_chk_0000.nc

I will tinker with the source code to see if I can identify where it hangs.


Thanks,

-Eric



On 2/2/16 5:36 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu> wrote:

>Hi, Eric
>
>Unfortunately, I do not have access to an SGI machine. We usually
>rely on our users to do some initial debugging for the situation like
>this.
>I know this can be too much to ask, but if you did not encounter any
>problem when running your program, maybe you can ignore this test.
>
>However, since the hanging occurs only with SGI MPT, I suspect it is
>related to MPT.
>
>Could you check one thing for me? After you kill the FLASH-IO job, could
>you check if any netCDF files were created? The expected files and their
>sizes are
>
>-rw------- 1 254075392 Feb  1 21:58 flash_io_test_ncmpi_chk_0000.nc
>-rw------- 1  21208576 Feb  1 21:58 flash_io_test_ncmpi_plt_cnt_0000.nc
>-rw------- 1  25431372 Feb  1 21:58 flash_io_test_ncmpi_plt_crn_0000.nc
>
>
>Wei-keng
>
>On Feb 2, 2016, at 3:53 PM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS
>AND APPLICATIONS INC] wrote:
>
>> 
>> Hi Wei-keng:
>> 
>> I tried rerunning the entire installation with PNETCDF_SAFE_MODE=1.
>>FLASH-IO still hangs with SGI MPT (with no error message), but it works
>>fine with Intel MPI.
>> 
>> -Eric
>> 
>> 
>> From: Wei-keng Liao <wkliao at eecs.northwestern.edu>
>> Date: Tuesday, February 2, 2016 12:39 PM
>> To: Eric Kemp <eric.kemp at nasa.gov>
>> Cc: "parallel-netcdf at mcs.anl.gov" <parallel-netcdf at mcs.anl.gov>
>> Subject: Re: Problems with testing PNETCDF 1.6.1
>> 
>> Hi, Eric
>> 
>> Sorry for sending the wrong file. The correct one is attached, in case
>>you would like
>> to use it.
>> 
>> I check your config.log file but could not find any thing fishy.
>> I just now tested it with Intel compiler 16.0.0.109 without a problem.
>> Could you try running FLASH-IO under the safe mode? i.e. set the
>>environment
>> variable PNETCDF_SAFE_MODE to 1. It will enable internal checking for
>> data inconsistency.
>> 
>> Just want to make sure for 1.7.0.pre1 that your "make ptest" failed
>>only on FLASH-IO.
>> Because FLAH-IO is the last test program, this means all other tests
>>have passed.
>> Let me know. Thanks.
>> 
>> Wei-keng
>> 
>> 
>> On Feb 2, 2016, at 8:27 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS
>>AND APPLICATIONS INC] wrote:
>> 
>> > 
>> > Hi Wei-keng:
>> > 
>> > I think you sent me the wrong copy of that file ‹ it was identical to
>>what is in 1.7.0.pre1.  But I went ahead and added "cmd" as an argument
>>to subroutine check_err, and that test code compiles and runs.
>> > 
>> > The large file tests pass in 1.7.0.pre1 as you indicated. However,
>>FLASH-IO still hangs with SGI MPT.  I took your suggestion and tried
>>running this test separately (cd benchmarks/FLASH-IO ; make ptest) but
>>the code still hangs.
>> > 
>> > I've attached the (gzipped) config.log file from the 1.7.0pre1
>>installations.
>> > 
>> > Thanks,
>> > 
>> > -Eric
>> > 
>> > Eric M. Kemp (SSAI)
>> > NASA/GSFC 
>> > Mail Code: 606
>> > Greenbelt, MD 20771
>> > 301.286.9768 
>> > eric.kemp at nasa.gov
>> > eric.kemp at ssaihq.com
>> > 
>> > 
>> > From: Wei-keng Liao <wkliao at eecs.northwestern.edu>
>> > Date: Monday, February 1, 2016 5:12 PM
>> > To: Eric Kemp <eric.kemp at nasa.gov>
>> > Cc: "parallel-netcdf at mcs.anl.gov" <parallel-netcdf at mcs.anl.gov>
>> > Subject: Re: Problems with testing PNETCDF 1.6.1
>> > 
>> > Hi, Eric,
>> > 
>> > Thanks for reporting the error. This is another oversight, Sorry.
>> > The fixed file, bigrecords.f, is attached.
>> > 
>> > 
>> > Wei-keng
>> > 
>> > 
>> > 
>> > On Feb 1, 2016, at 2:41 PM, Kemp, Eric M. (GSFC-606.0)[SCIENCE
>>SYSTEMS AND APPLICATIONS INC] wrote:
>> > 
>> > > 
>> > > Hi Wei-keng:
>> > > 
>> > > Thanks for your quick response. I tried installing 1.7.0.pre1 but I
>>get a
>> > > different error when compiling the tests:
>> > > 
>> > > /usr/local/intel/2016/impi/5.1.2.150/bin64/mpif90    -I../../src/lib
>> > > -I./../common   -I../../src/libf -I../../src/libf90 -fpic -O2
>>-fp-model
>> > > strict  -c bigrecords.f
>> > > bigrecords.f(333): error #6514: A substring must be of type
>>CHARACTER.
>> > > [CMD]
>> > >          msg = '*** TESTING F77 '//cmd(1:XTRIM(cmd))//
>> > > ------------------------------------^
>> > > bigrecords.f(333): error #6054: A CHARACTER data type is required
>>in this
>> > > context.   [CMD]
>> > >          msg = '*** TESTING F77 '//cmd(1:XTRIM(cmd))//
>> > > ------------------------------------^
>> > > compilation aborted for bigrecords.f (code 1)
>> > > 
>> > > 
>> > > 
>> > > This appears to be a legitimate syntax error in the test program, in
>> > > subroutine check_err.  "cmd" is not defined in that subroutine, nor
>>is it
>> > > a global variable.
>> > > 
>> > > I will try patching 1.6.1 with the NC_64BIT_DATA constant instead.
>> > > 
>> > > -Eric
>> > > 
>> > > 
>> > > 
>> > > 
>> > > On 2/1/16 12:02 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu>
>>wrote:
>> > > 
>> > >> Hi, Eric
>> > >> 
>> > >> For the large file tests, the error is caused by a oversight of
>>using a
>> > >> wrong flag.
>> > >> Line 81 of file large_files.c should have used NC_64BIT_DATA,
>>instead of
>> > >> NC_64BIT_OFFSET.
>> > >> This error has been fixed in the pre-release of 1.7.0.pre1. Could
>>you
>> > >> give it a try?
>> > >> http://cucis.ece.northwestern.edu/projects/PnetCDF/download.html
>> > >> 
>> > >> As for the FLASH-IO test, could you try running it alone? I.e. cd
>>to the
>> > >> folder
>> > >> benchmarks/FLASH-IO and run "make ptest" there. In the meantime,
>>please
>> > >> send me
>> > >> the file config.log.
>> > >> 
>> > >> 
>> > >> Wei-keng
>> > >> 
>> > >> On Feb 1, 2016, at 7:32 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE
>>SYSTEMS
>> > >> AND APPLICATIONS INC] wrote:
>> > >> 
>> > >>> 
>> > >>> Dear PNETCDF developers:
>> > >>> 
>> > >>> I'm attempting to install PNETCDF1.6.1 on a Linux cluster running
>>SLES
>> > >>> 11.3.  I'm using Intel 15 Fortran and C compilers (no C++), and
>>I'm
>> > >>> trying to install for two separate MPI implementations (SGI MPT
>>2.12 and
>> > >>> Intel MPI 5.1.2).
>> > >>> 
>> > >>> I'm encountering two problems when I run 'make ptest'.
>> > >>> 
>> > >>> 1)  For both MPI implementations, the large file tests fail with
>>an
>> > >>> integer overflow.  The error message is:
>> > >>> 
>> > >>> *** Testing large files, slowly.
>> > >>> line 116 of large_files.c: Overflow when type cast to 4-byte
>>integer.
>> > >>> *** Creating large file ./testfile.nc...srun.slurm: error:
>>borgo018:
>> > >>> task 0: Exited with exit code 1
>> > >>> 
>> > >>> I reviewed the README.large_files for guidance, and I can confirm
>>that
>> > >>> both 'MPI_Offset' and 'off_t' are 8 bytes.
>> > >>> 
>> > >>> 2) For SGI MPT only, if I disable support for large file tests,
>>'make
>> > >>> ptest' hangs when testing FLASH-IO:
>> > >>> 
>> > >>> make -w -C FLASH-IO ptest
>> > >>> make[2]: Entering directory
>> > >>> 
>>`/gpfsm/dnb32/emkemp/NUWRFLIB/svn/trunk/builds/parallel-netcdf-1.6.1/benc
>> > >>> hmarks/FLASH-IO'
>> > >>> mpiexec_mpt -n 4 ./flash_benchmark_io ./flash_io_test_
>> > >>> srun.slurm: cluster configuration lacks support for cpu binding
>> > >>> 
>> > >>> The earlier tests with both single and multiple processes work
>>for SGI
>> > >>> MPT. And all tests (again, excluding large file tests) work for
>>Intel
>> > >>> MPI.
>> > >>> 
>> > >>> I can provide more information (e.g., output from the configure
>>script)
>> > >>> upon request.
>> > >>> 
>> > >>> Thanks,
>> > >>> 
>> > >>> -Eric
>> > >>> 
>> > >>> Eric M. Kemp (SSAI)
>> > >>> NASA/GSFC 
>> > >>> Mail Code: 606
>> > >>> Greenbelt, MD 20771
>> > >>> 301.286.9768
>> > >>> eric.kemp at nasa.gov
>> > >>> eric.kemp at ssaihq.com
>> > >>> 
>> > >> 
>> > > 
>> > 
>> > <config.log.gz>
>> 
>



More information about the parallel-netcdf mailing list