Problems with testing PNETCDF 1.6.1
Wei-keng Liao
wkliao at eecs.northwestern.edu
Wed Feb 3 10:24:31 CST 2016
Hi, Eric
The file size 2964 is the file header size supposed to be. This
indicates the program has reached the call to nfmpi_enddef() at
line 155 of file checkpoint_ncmpi_parallel.F90.
The file size also tells that no data has been written to the file yet.
In this case, two possible places can cause the program to hang:
line 155 and line 600.
If you put a print statement after line 156, then it can check if
the program returns from it correctly or hangs at nfmpi_enddef().
The other possible hangs is at line 600, the call to nfmpi_wait_all.
Wei-keng
On Feb 3, 2016, at 7:08 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS AND APPLICATIONS INC] wrote:
>
> Good morning Wei-keng:
>
> I reran 1.7.0.pre1 FLASH-IO again. The only output from the program is:
>
> rw-r--r-- 1 emkemp k3002 2964 Feb 3 07:53 flash_io_test_ncmpi_chk_0000.nc
>
> I will tinker with the source code to see if I can identify where it hangs.
>
>
> Thanks,
>
> -Eric
>
>
>
> On 2/2/16 5:36 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu> wrote:
>
>> Hi, Eric
>>
>> Unfortunately, I do not have access to an SGI machine. We usually
>> rely on our users to do some initial debugging for the situation like
>> this.
>> I know this can be too much to ask, but if you did not encounter any
>> problem when running your program, maybe you can ignore this test.
>>
>> However, since the hanging occurs only with SGI MPT, I suspect it is
>> related to MPT.
>>
>> Could you check one thing for me? After you kill the FLASH-IO job, could
>> you check if any netCDF files were created? The expected files and their
>> sizes are
>>
>> -rw------- 1 254075392 Feb 1 21:58 flash_io_test_ncmpi_chk_0000.nc
>> -rw------- 1 21208576 Feb 1 21:58 flash_io_test_ncmpi_plt_cnt_0000.nc
>> -rw------- 1 25431372 Feb 1 21:58 flash_io_test_ncmpi_plt_crn_0000.nc
>>
>>
>> Wei-keng
>>
>> On Feb 2, 2016, at 3:53 PM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS
>> AND APPLICATIONS INC] wrote:
>>
>>>
>>> Hi Wei-keng:
>>>
>>> I tried rerunning the entire installation with PNETCDF_SAFE_MODE=1.
>>> FLASH-IO still hangs with SGI MPT (with no error message), but it works
>>> fine with Intel MPI.
>>>
>>> -Eric
>>>
>>>
>>> From: Wei-keng Liao <wkliao at eecs.northwestern.edu>
>>> Date: Tuesday, February 2, 2016 12:39 PM
>>> To: Eric Kemp <eric.kemp at nasa.gov>
>>> Cc: "parallel-netcdf at mcs.anl.gov" <parallel-netcdf at mcs.anl.gov>
>>> Subject: Re: Problems with testing PNETCDF 1.6.1
>>>
>>> Hi, Eric
>>>
>>> Sorry for sending the wrong file. The correct one is attached, in case
>>> you would like
>>> to use it.
>>>
>>> I check your config.log file but could not find any thing fishy.
>>> I just now tested it with Intel compiler 16.0.0.109 without a problem.
>>> Could you try running FLASH-IO under the safe mode? i.e. set the
>>> environment
>>> variable PNETCDF_SAFE_MODE to 1. It will enable internal checking for
>>> data inconsistency.
>>>
>>> Just want to make sure for 1.7.0.pre1 that your "make ptest" failed
>>> only on FLASH-IO.
>>> Because FLAH-IO is the last test program, this means all other tests
>>> have passed.
>>> Let me know. Thanks.
>>>
>>> Wei-keng
>>>
>>>
>>> On Feb 2, 2016, at 8:27 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS
>>> AND APPLICATIONS INC] wrote:
>>>
>>>>
>>>> Hi Wei-keng:
>>>>
>>>> I think you sent me the wrong copy of that file ‹ it was identical to
>>> what is in 1.7.0.pre1. But I went ahead and added "cmd" as an argument
>>> to subroutine check_err, and that test code compiles and runs.
>>>>
>>>> The large file tests pass in 1.7.0.pre1 as you indicated. However,
>>> FLASH-IO still hangs with SGI MPT. I took your suggestion and tried
>>> running this test separately (cd benchmarks/FLASH-IO ; make ptest) but
>>> the code still hangs.
>>>>
>>>> I've attached the (gzipped) config.log file from the 1.7.0pre1
>>> installations.
>>>>
>>>> Thanks,
>>>>
>>>> -Eric
>>>>
>>>> Eric M. Kemp (SSAI)
>>>> NASA/GSFC
>>>> Mail Code: 606
>>>> Greenbelt, MD 20771
>>>> 301.286.9768
>>>> eric.kemp at nasa.gov
>>>> eric.kemp at ssaihq.com
>>>>
>>>>
>>>> From: Wei-keng Liao <wkliao at eecs.northwestern.edu>
>>>> Date: Monday, February 1, 2016 5:12 PM
>>>> To: Eric Kemp <eric.kemp at nasa.gov>
>>>> Cc: "parallel-netcdf at mcs.anl.gov" <parallel-netcdf at mcs.anl.gov>
>>>> Subject: Re: Problems with testing PNETCDF 1.6.1
>>>>
>>>> Hi, Eric,
>>>>
>>>> Thanks for reporting the error. This is another oversight, Sorry.
>>>> The fixed file, bigrecords.f, is attached.
>>>>
>>>>
>>>> Wei-keng
>>>>
>>>>
>>>>
>>>> On Feb 1, 2016, at 2:41 PM, Kemp, Eric M. (GSFC-606.0)[SCIENCE
>>> SYSTEMS AND APPLICATIONS INC] wrote:
>>>>
>>>>>
>>>>> Hi Wei-keng:
>>>>>
>>>>> Thanks for your quick response. I tried installing 1.7.0.pre1 but I
>>> get a
>>>>> different error when compiling the tests:
>>>>>
>>>>> /usr/local/intel/2016/impi/5.1.2.150/bin64/mpif90 -I../../src/lib
>>>>> -I./../common -I../../src/libf -I../../src/libf90 -fpic -O2
>>> -fp-model
>>>>> strict -c bigrecords.f
>>>>> bigrecords.f(333): error #6514: A substring must be of type
>>> CHARACTER.
>>>>> [CMD]
>>>>> msg = '*** TESTING F77 '//cmd(1:XTRIM(cmd))//
>>>>> ------------------------------------^
>>>>> bigrecords.f(333): error #6054: A CHARACTER data type is required
>>> in this
>>>>> context. [CMD]
>>>>> msg = '*** TESTING F77 '//cmd(1:XTRIM(cmd))//
>>>>> ------------------------------------^
>>>>> compilation aborted for bigrecords.f (code 1)
>>>>>
>>>>>
>>>>>
>>>>> This appears to be a legitimate syntax error in the test program, in
>>>>> subroutine check_err. "cmd" is not defined in that subroutine, nor
>>> is it
>>>>> a global variable.
>>>>>
>>>>> I will try patching 1.6.1 with the NC_64BIT_DATA constant instead.
>>>>>
>>>>> -Eric
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 2/1/16 12:02 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu>
>>> wrote:
>>>>>
>>>>>> Hi, Eric
>>>>>>
>>>>>> For the large file tests, the error is caused by a oversight of
>>> using a
>>>>>> wrong flag.
>>>>>> Line 81 of file large_files.c should have used NC_64BIT_DATA,
>>> instead of
>>>>>> NC_64BIT_OFFSET.
>>>>>> This error has been fixed in the pre-release of 1.7.0.pre1. Could
>>> you
>>>>>> give it a try?
>>>>>> http://cucis.ece.northwestern.edu/projects/PnetCDF/download.html
>>>>>>
>>>>>> As for the FLASH-IO test, could you try running it alone? I.e. cd
>>> to the
>>>>>> folder
>>>>>> benchmarks/FLASH-IO and run "make ptest" there. In the meantime,
>>> please
>>>>>> send me
>>>>>> the file config.log.
>>>>>>
>>>>>>
>>>>>> Wei-keng
>>>>>>
>>>>>> On Feb 1, 2016, at 7:32 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE
>>> SYSTEMS
>>>>>> AND APPLICATIONS INC] wrote:
>>>>>>
>>>>>>>
>>>>>>> Dear PNETCDF developers:
>>>>>>>
>>>>>>> I'm attempting to install PNETCDF1.6.1 on a Linux cluster running
>>> SLES
>>>>>>> 11.3. I'm using Intel 15 Fortran and C compilers (no C++), and
>>> I'm
>>>>>>> trying to install for two separate MPI implementations (SGI MPT
>>> 2.12 and
>>>>>>> Intel MPI 5.1.2).
>>>>>>>
>>>>>>> I'm encountering two problems when I run 'make ptest'.
>>>>>>>
>>>>>>> 1) For both MPI implementations, the large file tests fail with
>>> an
>>>>>>> integer overflow. The error message is:
>>>>>>>
>>>>>>> *** Testing large files, slowly.
>>>>>>> line 116 of large_files.c: Overflow when type cast to 4-byte
>>> integer.
>>>>>>> *** Creating large file ./testfile.nc...srun.slurm: error:
>>> borgo018:
>>>>>>> task 0: Exited with exit code 1
>>>>>>>
>>>>>>> I reviewed the README.large_files for guidance, and I can confirm
>>> that
>>>>>>> both 'MPI_Offset' and 'off_t' are 8 bytes.
>>>>>>>
>>>>>>> 2) For SGI MPT only, if I disable support for large file tests,
>>> 'make
>>>>>>> ptest' hangs when testing FLASH-IO:
>>>>>>>
>>>>>>> make -w -C FLASH-IO ptest
>>>>>>> make[2]: Entering directory
>>>>>>>
>>> `/gpfsm/dnb32/emkemp/NUWRFLIB/svn/trunk/builds/parallel-netcdf-1.6.1/benc
>>>>>>> hmarks/FLASH-IO'
>>>>>>> mpiexec_mpt -n 4 ./flash_benchmark_io ./flash_io_test_
>>>>>>> srun.slurm: cluster configuration lacks support for cpu binding
>>>>>>>
>>>>>>> The earlier tests with both single and multiple processes work
>>> for SGI
>>>>>>> MPT. And all tests (again, excluding large file tests) work for
>>> Intel
>>>>>>> MPI.
>>>>>>>
>>>>>>> I can provide more information (e.g., output from the configure
>>> script)
>>>>>>> upon request.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> -Eric
>>>>>>>
>>>>>>> Eric M. Kemp (SSAI)
>>>>>>> NASA/GSFC
>>>>>>> Mail Code: 606
>>>>>>> Greenbelt, MD 20771
>>>>>>> 301.286.9768
>>>>>>> eric.kemp at nasa.gov
>>>>>>> eric.kemp at ssaihq.com
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>> <config.log.gz>
>>>
>>
>
More information about the parallel-netcdf
mailing list