Problems with testing PNETCDF 1.6.1

Rob Latham robl at mcs.anl.gov
Wed Feb 3 13:08:52 CST 2016



On 02/03/2016 12:41 PM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS AND 
APPLICATIONS INC] wrote:
>
> Hi Wei-keng:
>
> Good news: I went through the source code, adding print statements, and
> traced the hang to the nfmpi_wait_all invocation in
> checkpoint_ncmpi_parallel.F90.  After some further trial-and-error, I
> fixed the problem by commenting out this line in subroutine
> checkpoint_wr_ncmpi_par:
>
>       call MPI_Info_set(file_info, 'romio_no_indep_rw', 'true', err)
>
> I don't understand what is going on under the hood, but perhaps the
> combination of SGI MPT and GPFS (the file system on NASA's supercomputer)
> makes this setting undesirable?
>
> Anyway, after commented out this line, both SGI MPT and Intel MPI tests
> work, and produce FLASH output files with the appropriate sizes. So I'm
> going to declare victory and install with this change.

that hint sets "deferred open": the deferred part is in case the user 
lied to ROMIO and did independent I/O anyway.

https://press3.mcs.anl.gov/romio/2003/08/05/deferred-open/

This is the first I've heard of that hint causing any problems anywhere. 
I'll ask our SGI friends if they have any ideas.

==rob

>
> Thanks for your help!
>
> -Eric
>
> Eric M. Kemp (SSAI)
> NASA/GSFC
> Mail Code: 606
> Greenbelt, MD 20771
> 301.286.9768
> eric.kemp at nasa.gov
> eric.kemp at ssaihq.com
>
>
>
> On 2/3/16 11:24 AM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu> wrote:
>
>> Hi, Eric
>>
>> The file size 2964 is the file header size supposed to be. This
>> indicates the program has reached the call to nfmpi_enddef() at
>> line 155 of file checkpoint_ncmpi_parallel.F90.
>> The file size also tells that no data has been written to the file yet.
>> In this case, two possible places can cause the program to hang:
>> line 155 and line 600.
>>
>> If you put a print statement after line 156, then it can check if
>> the program returns from it correctly or hangs at nfmpi_enddef().
>>
>> The other possible hangs is at line 600, the call to nfmpi_wait_all.
>>
>> Wei-keng
>>
>> On Feb 3, 2016, at 7:08 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS
>> AND APPLICATIONS INC] wrote:
>>
>>>
>>> Good morning Wei-keng:
>>>
>>> I reran 1.7.0.pre1 FLASH-IO again.  The only output from the program is:
>>>
>>> rw-r--r-- 1 emkemp k3002 2964 Feb  3 07:53
>>> flash_io_test_ncmpi_chk_0000.nc
>>>
>>> I will tinker with the source code to see if I can identify where it
>>> hangs.
>>>
>>>
>>> Thanks,
>>>
>>> -Eric
>>>
>>>
>>>
>>> On 2/2/16 5:36 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu> wrote:
>>>
>>>> Hi, Eric
>>>>
>>>> Unfortunately, I do not have access to an SGI machine. We usually
>>>> rely on our users to do some initial debugging for the situation like
>>>> this.
>>>> I know this can be too much to ask, but if you did not encounter any
>>>> problem when running your program, maybe you can ignore this test.
>>>>
>>>> However, since the hanging occurs only with SGI MPT, I suspect it is
>>>> related to MPT.
>>>>
>>>> Could you check one thing for me? After you kill the FLASH-IO job,
>>>> could
>>>> you check if any netCDF files were created? The expected files and
>>>> their
>>>> sizes are
>>>>
>>>> -rw------- 1 254075392 Feb  1 21:58 flash_io_test_ncmpi_chk_0000.nc
>>>> -rw------- 1  21208576 Feb  1 21:58 flash_io_test_ncmpi_plt_cnt_0000.nc
>>>> -rw------- 1  25431372 Feb  1 21:58 flash_io_test_ncmpi_plt_crn_0000.nc
>>>>
>>>>
>>>> Wei-keng
>>>>
>>>> On Feb 2, 2016, at 3:53 PM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS
>>>> AND APPLICATIONS INC] wrote:
>>>>
>>>>>
>>>>> Hi Wei-keng:
>>>>>
>>>>> I tried rerunning the entire installation with PNETCDF_SAFE_MODE=1.
>>>>> FLASH-IO still hangs with SGI MPT (with no error message), but it
>>>>> works
>>>>> fine with Intel MPI.
>>>>>
>>>>> -Eric
>>>>>
>>>>>
>>>>> From: Wei-keng Liao <wkliao at eecs.northwestern.edu>
>>>>> Date: Tuesday, February 2, 2016 12:39 PM
>>>>> To: Eric Kemp <eric.kemp at nasa.gov>
>>>>> Cc: "parallel-netcdf at mcs.anl.gov" <parallel-netcdf at mcs.anl.gov>
>>>>> Subject: Re: Problems with testing PNETCDF 1.6.1
>>>>>
>>>>> Hi, Eric
>>>>>
>>>>> Sorry for sending the wrong file. The correct one is attached, in case
>>>>> you would like
>>>>> to use it.
>>>>>
>>>>> I check your config.log file but could not find any thing fishy.
>>>>> I just now tested it with Intel compiler 16.0.0.109 without a problem.
>>>>> Could you try running FLASH-IO under the safe mode? i.e. set the
>>>>> environment
>>>>> variable PNETCDF_SAFE_MODE to 1. It will enable internal checking for
>>>>> data inconsistency.
>>>>>
>>>>> Just want to make sure for 1.7.0.pre1 that your "make ptest" failed
>>>>> only on FLASH-IO.
>>>>> Because FLAH-IO is the last test program, this means all other tests
>>>>> have passed.
>>>>> Let me know. Thanks.
>>>>>
>>>>> Wei-keng
>>>>>
>>>>>
>>>>> On Feb 2, 2016, at 8:27 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS
>>>>> AND APPLICATIONS INC] wrote:
>>>>>
>>>>>>
>>>>>> Hi Wei-keng:
>>>>>>
>>>>>> I think you sent me the wrong copy of that file ‹ it was identical to
>>>>> what is in 1.7.0.pre1.  But I went ahead and added "cmd" as an
>>>>> argument
>>>>> to subroutine check_err, and that test code compiles and runs.
>>>>>>
>>>>>> The large file tests pass in 1.7.0.pre1 as you indicated. However,
>>>>> FLASH-IO still hangs with SGI MPT.  I took your suggestion and tried
>>>>> running this test separately (cd benchmarks/FLASH-IO ; make ptest) but
>>>>> the code still hangs.
>>>>>>
>>>>>> I've attached the (gzipped) config.log file from the 1.7.0pre1
>>>>> installations.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -Eric
>>>>>>
>>>>>> Eric M. Kemp (SSAI)
>>>>>> NASA/GSFC
>>>>>> Mail Code: 606
>>>>>> Greenbelt, MD 20771
>>>>>> 301.286.9768
>>>>>> eric.kemp at nasa.gov
>>>>>> eric.kemp at ssaihq.com
>>>>>>
>>>>>>
>>>>>> From: Wei-keng Liao <wkliao at eecs.northwestern.edu>
>>>>>> Date: Monday, February 1, 2016 5:12 PM
>>>>>> To: Eric Kemp <eric.kemp at nasa.gov>
>>>>>> Cc: "parallel-netcdf at mcs.anl.gov" <parallel-netcdf at mcs.anl.gov>
>>>>>> Subject: Re: Problems with testing PNETCDF 1.6.1
>>>>>>
>>>>>> Hi, Eric,
>>>>>>
>>>>>> Thanks for reporting the error. This is another oversight, Sorry.
>>>>>> The fixed file, bigrecords.f, is attached.
>>>>>>
>>>>>>
>>>>>> Wei-keng
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Feb 1, 2016, at 2:41 PM, Kemp, Eric M. (GSFC-606.0)[SCIENCE
>>>>> SYSTEMS AND APPLICATIONS INC] wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi Wei-keng:
>>>>>>>
>>>>>>> Thanks for your quick response. I tried installing 1.7.0.pre1 but I
>>>>> get a
>>>>>>> different error when compiling the tests:
>>>>>>>
>>>>>>> /usr/local/intel/2016/impi/5.1.2.150/bin64/mpif90    -I../../src/lib
>>>>>>> -I./../common   -I../../src/libf -I../../src/libf90 -fpic -O2
>>>>> -fp-model
>>>>>>> strict  -c bigrecords.f
>>>>>>> bigrecords.f(333): error #6514: A substring must be of type
>>>>> CHARACTER.
>>>>>>> [CMD]
>>>>>>>          msg = '*** TESTING F77 '//cmd(1:XTRIM(cmd))//
>>>>>>> ------------------------------------^
>>>>>>> bigrecords.f(333): error #6054: A CHARACTER data type is required
>>>>> in this
>>>>>>> context.   [CMD]
>>>>>>>          msg = '*** TESTING F77 '//cmd(1:XTRIM(cmd))//
>>>>>>> ------------------------------------^
>>>>>>> compilation aborted for bigrecords.f (code 1)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This appears to be a legitimate syntax error in the test program, in
>>>>>>> subroutine check_err.  "cmd" is not defined in that subroutine, nor
>>>>> is it
>>>>>>> a global variable.
>>>>>>>
>>>>>>> I will try patching 1.6.1 with the NC_64BIT_DATA constant instead.
>>>>>>>
>>>>>>> -Eric
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2/1/16 12:02 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu>
>>>>> wrote:
>>>>>>>
>>>>>>>> Hi, Eric
>>>>>>>>
>>>>>>>> For the large file tests, the error is caused by a oversight of
>>>>> using a
>>>>>>>> wrong flag.
>>>>>>>> Line 81 of file large_files.c should have used NC_64BIT_DATA,
>>>>> instead of
>>>>>>>> NC_64BIT_OFFSET.
>>>>>>>> This error has been fixed in the pre-release of 1.7.0.pre1. Could
>>>>> you
>>>>>>>> give it a try?
>>>>>>>> http://cucis.ece.northwestern.edu/projects/PnetCDF/download.html
>>>>>>>>
>>>>>>>> As for the FLASH-IO test, could you try running it alone? I.e. cd
>>>>> to the
>>>>>>>> folder
>>>>>>>> benchmarks/FLASH-IO and run "make ptest" there. In the meantime,
>>>>> please
>>>>>>>> send me
>>>>>>>> the file config.log.
>>>>>>>>
>>>>>>>>
>>>>>>>> Wei-keng
>>>>>>>>
>>>>>>>> On Feb 1, 2016, at 7:32 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE
>>>>> SYSTEMS
>>>>>>>> AND APPLICATIONS INC] wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Dear PNETCDF developers:
>>>>>>>>>
>>>>>>>>> I'm attempting to install PNETCDF1.6.1 on a Linux cluster running
>>>>> SLES
>>>>>>>>> 11.3.  I'm using Intel 15 Fortran and C compilers (no C++), and
>>>>> I'm
>>>>>>>>> trying to install for two separate MPI implementations (SGI MPT
>>>>> 2.12 and
>>>>>>>>> Intel MPI 5.1.2).
>>>>>>>>>
>>>>>>>>> I'm encountering two problems when I run 'make ptest'.
>>>>>>>>>
>>>>>>>>> 1)  For both MPI implementations, the large file tests fail with
>>>>> an
>>>>>>>>> integer overflow.  The error message is:
>>>>>>>>>
>>>>>>>>> *** Testing large files, slowly.
>>>>>>>>> line 116 of large_files.c: Overflow when type cast to 4-byte
>>>>> integer.
>>>>>>>>> *** Creating large file ./testfile.nc...srun.slurm: error:
>>>>> borgo018:
>>>>>>>>> task 0: Exited with exit code 1
>>>>>>>>>
>>>>>>>>> I reviewed the README.large_files for guidance, and I can confirm
>>>>> that
>>>>>>>>> both 'MPI_Offset' and 'off_t' are 8 bytes.
>>>>>>>>>
>>>>>>>>> 2) For SGI MPT only, if I disable support for large file tests,
>>>>> 'make
>>>>>>>>> ptest' hangs when testing FLASH-IO:
>>>>>>>>>
>>>>>>>>> make -w -C FLASH-IO ptest
>>>>>>>>> make[2]: Entering directory
>>>>>>>>>
>>>>>
>>>>> `/gpfsm/dnb32/emkemp/NUWRFLIB/svn/trunk/builds/parallel-netcdf-1.6.1/be
>>>>> nc
>>>>>>>>> hmarks/FLASH-IO'
>>>>>>>>> mpiexec_mpt -n 4 ./flash_benchmark_io ./flash_io_test_
>>>>>>>>> srun.slurm: cluster configuration lacks support for cpu binding
>>>>>>>>>
>>>>>>>>> The earlier tests with both single and multiple processes work
>>>>> for SGI
>>>>>>>>> MPT. And all tests (again, excluding large file tests) work for
>>>>> Intel
>>>>>>>>> MPI.
>>>>>>>>>
>>>>>>>>> I can provide more information (e.g., output from the configure
>>>>> script)
>>>>>>>>> upon request.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> -Eric
>>>>>>>>>
>>>>>>>>> Eric M. Kemp (SSAI)
>>>>>>>>> NASA/GSFC
>>>>>>>>> Mail Code: 606
>>>>>>>>> Greenbelt, MD 20771
>>>>>>>>> 301.286.9768
>>>>>>>>> eric.kemp at nasa.gov
>>>>>>>>> eric.kemp at ssaihq.com
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> <config.log.gz>
>>>>>
>>>>
>>>
>>
>


More information about the parallel-netcdf mailing list