Problems with testing PNETCDF 1.6.1

Wei-keng Liao wkliao at eecs.northwestern.edu
Wed Feb 3 10:24:31 CST 2016


Hi, Eric

The file size 2964 is the file header size supposed to be. This
indicates the program has reached the call to nfmpi_enddef() at
line 155 of file checkpoint_ncmpi_parallel.F90.
The file size also tells that no data has been written to the file yet.
In this case, two possible places can cause the program to hang:
line 155 and line 600.

If you put a print statement after line 156, then it can check if
the program returns from it correctly or hangs at nfmpi_enddef().

The other possible hangs is at line 600, the call to nfmpi_wait_all.

Wei-keng

On Feb 3, 2016, at 7:08 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS AND APPLICATIONS INC] wrote:

> 
> Good morning Wei-keng:
> 
> I reran 1.7.0.pre1 FLASH-IO again.  The only output from the program is:
> 
> rw-r--r-- 1 emkemp k3002 2964 Feb  3 07:53 flash_io_test_ncmpi_chk_0000.nc
> 
> I will tinker with the source code to see if I can identify where it hangs.
> 
> 
> Thanks,
> 
> -Eric
> 
> 
> 
> On 2/2/16 5:36 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu> wrote:
> 
>> Hi, Eric
>> 
>> Unfortunately, I do not have access to an SGI machine. We usually
>> rely on our users to do some initial debugging for the situation like
>> this.
>> I know this can be too much to ask, but if you did not encounter any
>> problem when running your program, maybe you can ignore this test.
>> 
>> However, since the hanging occurs only with SGI MPT, I suspect it is
>> related to MPT.
>> 
>> Could you check one thing for me? After you kill the FLASH-IO job, could
>> you check if any netCDF files were created? The expected files and their
>> sizes are
>> 
>> -rw------- 1 254075392 Feb  1 21:58 flash_io_test_ncmpi_chk_0000.nc
>> -rw------- 1  21208576 Feb  1 21:58 flash_io_test_ncmpi_plt_cnt_0000.nc
>> -rw------- 1  25431372 Feb  1 21:58 flash_io_test_ncmpi_plt_crn_0000.nc
>> 
>> 
>> Wei-keng
>> 
>> On Feb 2, 2016, at 3:53 PM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS
>> AND APPLICATIONS INC] wrote:
>> 
>>> 
>>> Hi Wei-keng:
>>> 
>>> I tried rerunning the entire installation with PNETCDF_SAFE_MODE=1.
>>> FLASH-IO still hangs with SGI MPT (with no error message), but it works
>>> fine with Intel MPI.
>>> 
>>> -Eric
>>> 
>>> 
>>> From: Wei-keng Liao <wkliao at eecs.northwestern.edu>
>>> Date: Tuesday, February 2, 2016 12:39 PM
>>> To: Eric Kemp <eric.kemp at nasa.gov>
>>> Cc: "parallel-netcdf at mcs.anl.gov" <parallel-netcdf at mcs.anl.gov>
>>> Subject: Re: Problems with testing PNETCDF 1.6.1
>>> 
>>> Hi, Eric
>>> 
>>> Sorry for sending the wrong file. The correct one is attached, in case
>>> you would like
>>> to use it.
>>> 
>>> I check your config.log file but could not find any thing fishy.
>>> I just now tested it with Intel compiler 16.0.0.109 without a problem.
>>> Could you try running FLASH-IO under the safe mode? i.e. set the
>>> environment
>>> variable PNETCDF_SAFE_MODE to 1. It will enable internal checking for
>>> data inconsistency.
>>> 
>>> Just want to make sure for 1.7.0.pre1 that your "make ptest" failed
>>> only on FLASH-IO.
>>> Because FLAH-IO is the last test program, this means all other tests
>>> have passed.
>>> Let me know. Thanks.
>>> 
>>> Wei-keng
>>> 
>>> 
>>> On Feb 2, 2016, at 8:27 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS
>>> AND APPLICATIONS INC] wrote:
>>> 
>>>> 
>>>> Hi Wei-keng:
>>>> 
>>>> I think you sent me the wrong copy of that file ‹ it was identical to
>>> what is in 1.7.0.pre1.  But I went ahead and added "cmd" as an argument
>>> to subroutine check_err, and that test code compiles and runs.
>>>> 
>>>> The large file tests pass in 1.7.0.pre1 as you indicated. However,
>>> FLASH-IO still hangs with SGI MPT.  I took your suggestion and tried
>>> running this test separately (cd benchmarks/FLASH-IO ; make ptest) but
>>> the code still hangs.
>>>> 
>>>> I've attached the (gzipped) config.log file from the 1.7.0pre1
>>> installations.
>>>> 
>>>> Thanks,
>>>> 
>>>> -Eric
>>>> 
>>>> Eric M. Kemp (SSAI)
>>>> NASA/GSFC 
>>>> Mail Code: 606
>>>> Greenbelt, MD 20771
>>>> 301.286.9768 
>>>> eric.kemp at nasa.gov
>>>> eric.kemp at ssaihq.com
>>>> 
>>>> 
>>>> From: Wei-keng Liao <wkliao at eecs.northwestern.edu>
>>>> Date: Monday, February 1, 2016 5:12 PM
>>>> To: Eric Kemp <eric.kemp at nasa.gov>
>>>> Cc: "parallel-netcdf at mcs.anl.gov" <parallel-netcdf at mcs.anl.gov>
>>>> Subject: Re: Problems with testing PNETCDF 1.6.1
>>>> 
>>>> Hi, Eric,
>>>> 
>>>> Thanks for reporting the error. This is another oversight, Sorry.
>>>> The fixed file, bigrecords.f, is attached.
>>>> 
>>>> 
>>>> Wei-keng
>>>> 
>>>> 
>>>> 
>>>> On Feb 1, 2016, at 2:41 PM, Kemp, Eric M. (GSFC-606.0)[SCIENCE
>>> SYSTEMS AND APPLICATIONS INC] wrote:
>>>> 
>>>>> 
>>>>> Hi Wei-keng:
>>>>> 
>>>>> Thanks for your quick response. I tried installing 1.7.0.pre1 but I
>>> get a
>>>>> different error when compiling the tests:
>>>>> 
>>>>> /usr/local/intel/2016/impi/5.1.2.150/bin64/mpif90    -I../../src/lib
>>>>> -I./../common   -I../../src/libf -I../../src/libf90 -fpic -O2
>>> -fp-model
>>>>> strict  -c bigrecords.f
>>>>> bigrecords.f(333): error #6514: A substring must be of type
>>> CHARACTER.
>>>>> [CMD]
>>>>>         msg = '*** TESTING F77 '//cmd(1:XTRIM(cmd))//
>>>>> ------------------------------------^
>>>>> bigrecords.f(333): error #6054: A CHARACTER data type is required
>>> in this
>>>>> context.   [CMD]
>>>>>         msg = '*** TESTING F77 '//cmd(1:XTRIM(cmd))//
>>>>> ------------------------------------^
>>>>> compilation aborted for bigrecords.f (code 1)
>>>>> 
>>>>> 
>>>>> 
>>>>> This appears to be a legitimate syntax error in the test program, in
>>>>> subroutine check_err.  "cmd" is not defined in that subroutine, nor
>>> is it
>>>>> a global variable.
>>>>> 
>>>>> I will try patching 1.6.1 with the NC_64BIT_DATA constant instead.
>>>>> 
>>>>> -Eric
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 2/1/16 12:02 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu>
>>> wrote:
>>>>> 
>>>>>> Hi, Eric
>>>>>> 
>>>>>> For the large file tests, the error is caused by a oversight of
>>> using a
>>>>>> wrong flag.
>>>>>> Line 81 of file large_files.c should have used NC_64BIT_DATA,
>>> instead of
>>>>>> NC_64BIT_OFFSET.
>>>>>> This error has been fixed in the pre-release of 1.7.0.pre1. Could
>>> you
>>>>>> give it a try?
>>>>>> http://cucis.ece.northwestern.edu/projects/PnetCDF/download.html
>>>>>> 
>>>>>> As for the FLASH-IO test, could you try running it alone? I.e. cd
>>> to the
>>>>>> folder
>>>>>> benchmarks/FLASH-IO and run "make ptest" there. In the meantime,
>>> please
>>>>>> send me
>>>>>> the file config.log.
>>>>>> 
>>>>>> 
>>>>>> Wei-keng
>>>>>> 
>>>>>> On Feb 1, 2016, at 7:32 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE
>>> SYSTEMS
>>>>>> AND APPLICATIONS INC] wrote:
>>>>>> 
>>>>>>> 
>>>>>>> Dear PNETCDF developers:
>>>>>>> 
>>>>>>> I'm attempting to install PNETCDF1.6.1 on a Linux cluster running
>>> SLES
>>>>>>> 11.3.  I'm using Intel 15 Fortran and C compilers (no C++), and
>>> I'm
>>>>>>> trying to install for two separate MPI implementations (SGI MPT
>>> 2.12 and
>>>>>>> Intel MPI 5.1.2).
>>>>>>> 
>>>>>>> I'm encountering two problems when I run 'make ptest'.
>>>>>>> 
>>>>>>> 1)  For both MPI implementations, the large file tests fail with
>>> an
>>>>>>> integer overflow.  The error message is:
>>>>>>> 
>>>>>>> *** Testing large files, slowly.
>>>>>>> line 116 of large_files.c: Overflow when type cast to 4-byte
>>> integer.
>>>>>>> *** Creating large file ./testfile.nc...srun.slurm: error:
>>> borgo018:
>>>>>>> task 0: Exited with exit code 1
>>>>>>> 
>>>>>>> I reviewed the README.large_files for guidance, and I can confirm
>>> that
>>>>>>> both 'MPI_Offset' and 'off_t' are 8 bytes.
>>>>>>> 
>>>>>>> 2) For SGI MPT only, if I disable support for large file tests,
>>> 'make
>>>>>>> ptest' hangs when testing FLASH-IO:
>>>>>>> 
>>>>>>> make -w -C FLASH-IO ptest
>>>>>>> make[2]: Entering directory
>>>>>>> 
>>> `/gpfsm/dnb32/emkemp/NUWRFLIB/svn/trunk/builds/parallel-netcdf-1.6.1/benc
>>>>>>> hmarks/FLASH-IO'
>>>>>>> mpiexec_mpt -n 4 ./flash_benchmark_io ./flash_io_test_
>>>>>>> srun.slurm: cluster configuration lacks support for cpu binding
>>>>>>> 
>>>>>>> The earlier tests with both single and multiple processes work
>>> for SGI
>>>>>>> MPT. And all tests (again, excluding large file tests) work for
>>> Intel
>>>>>>> MPI.
>>>>>>> 
>>>>>>> I can provide more information (e.g., output from the configure
>>> script)
>>>>>>> upon request.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> -Eric
>>>>>>> 
>>>>>>> Eric M. Kemp (SSAI)
>>>>>>> NASA/GSFC 
>>>>>>> Mail Code: 606
>>>>>>> Greenbelt, MD 20771
>>>>>>> 301.286.9768
>>>>>>> eric.kemp at nasa.gov
>>>>>>> eric.kemp at ssaihq.com
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> <config.log.gz>
>>> 
>> 
> 



More information about the parallel-netcdf mailing list