Problems with testing PNETCDF 1.6.1

Wei-keng Liao wkliao at eecs.northwestern.edu
Wed Feb 3 13:32:20 CST 2016


Hi, Eric

Many thanks for digging the problem. It is great to know the cause.

In PnetCDF, writing the file header is done by root process using an
independent write. So, this line is not necessary. However, setting
this hint should not cause the program to hang or break anything.

This indicates a bug in the MPI-IO component of the SGI MPT.

Attached is a small test program that should reproduce the hang
problem you encountered.

Thanks again for put more effort on this!

Wei-keng

-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_io_hint.f
Type: application/octet-stream
Size: 1654 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20160203/813094e0/attachment-0001.obj>
-------------- next part --------------

On Feb 3, 2016, at 12:41 PM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS AND APPLICATIONS INC] wrote:

> 
> Hi Wei-keng:
> 
> Good news: I went through the source code, adding print statements, and
> traced the hang to the nfmpi_wait_all invocation in
> checkpoint_ncmpi_parallel.F90.  After some further trial-and-error, I
> fixed the problem by commenting out this line in subroutine
> checkpoint_wr_ncmpi_par:
> 
>     call MPI_Info_set(file_info, 'romio_no_indep_rw', 'true', err)
> 
> I don't understand what is going on under the hood, but perhaps the
> combination of SGI MPT and GPFS (the file system on NASA's supercomputer)
> makes this setting undesirable?
> 
> Anyway, after commented out this line, both SGI MPT and Intel MPI tests
> work, and produce FLASH output files with the appropriate sizes. So I'm
> going to declare victory and install with this change.
> 
> Thanks for your help!
> 
> -Eric
> 
> Eric M. Kemp (SSAI)
> NASA/GSFC 
> Mail Code: 606 
> Greenbelt, MD 20771
> 301.286.9768 
> eric.kemp at nasa.gov
> eric.kemp at ssaihq.com
> 
> 
> 
> On 2/3/16 11:24 AM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu> wrote:
> 
>> Hi, Eric
>> 
>> The file size 2964 is the file header size supposed to be. This
>> indicates the program has reached the call to nfmpi_enddef() at
>> line 155 of file checkpoint_ncmpi_parallel.F90.
>> The file size also tells that no data has been written to the file yet.
>> In this case, two possible places can cause the program to hang:
>> line 155 and line 600.
>> 
>> If you put a print statement after line 156, then it can check if
>> the program returns from it correctly or hangs at nfmpi_enddef().
>> 
>> The other possible hangs is at line 600, the call to nfmpi_wait_all.
>> 
>> Wei-keng
>> 
>> On Feb 3, 2016, at 7:08 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS
>> AND APPLICATIONS INC] wrote:
>> 
>>> 
>>> Good morning Wei-keng:
>>> 
>>> I reran 1.7.0.pre1 FLASH-IO again.  The only output from the program is:
>>> 
>>> rw-r--r-- 1 emkemp k3002 2964 Feb  3 07:53
>>> flash_io_test_ncmpi_chk_0000.nc
>>> 
>>> I will tinker with the source code to see if I can identify where it
>>> hangs.
>>> 
>>> 
>>> Thanks,
>>> 
>>> -Eric
>>> 
>>> 
>>> 
>>> On 2/2/16 5:36 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu> wrote:
>>> 
>>>> Hi, Eric
>>>> 
>>>> Unfortunately, I do not have access to an SGI machine. We usually
>>>> rely on our users to do some initial debugging for the situation like
>>>> this.
>>>> I know this can be too much to ask, but if you did not encounter any
>>>> problem when running your program, maybe you can ignore this test.
>>>> 
>>>> However, since the hanging occurs only with SGI MPT, I suspect it is
>>>> related to MPT.
>>>> 
>>>> Could you check one thing for me? After you kill the FLASH-IO job,
>>>> could
>>>> you check if any netCDF files were created? The expected files and
>>>> their
>>>> sizes are
>>>> 
>>>> -rw------- 1 254075392 Feb  1 21:58 flash_io_test_ncmpi_chk_0000.nc
>>>> -rw------- 1  21208576 Feb  1 21:58 flash_io_test_ncmpi_plt_cnt_0000.nc
>>>> -rw------- 1  25431372 Feb  1 21:58 flash_io_test_ncmpi_plt_crn_0000.nc
>>>> 
>>>> 
>>>> Wei-keng
>>>> 
>>>> On Feb 2, 2016, at 3:53 PM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS
>>>> AND APPLICATIONS INC] wrote:
>>>> 
>>>>> 
>>>>> Hi Wei-keng:
>>>>> 
>>>>> I tried rerunning the entire installation with PNETCDF_SAFE_MODE=1.
>>>>> FLASH-IO still hangs with SGI MPT (with no error message), but it
>>>>> works
>>>>> fine with Intel MPI.
>>>>> 
>>>>> -Eric
>>>>> 
>>>>> 
>>>>> From: Wei-keng Liao <wkliao at eecs.northwestern.edu>
>>>>> Date: Tuesday, February 2, 2016 12:39 PM
>>>>> To: Eric Kemp <eric.kemp at nasa.gov>
>>>>> Cc: "parallel-netcdf at mcs.anl.gov" <parallel-netcdf at mcs.anl.gov>
>>>>> Subject: Re: Problems with testing PNETCDF 1.6.1
>>>>> 
>>>>> Hi, Eric
>>>>> 
>>>>> Sorry for sending the wrong file. The correct one is attached, in case
>>>>> you would like
>>>>> to use it.
>>>>> 
>>>>> I check your config.log file but could not find any thing fishy.
>>>>> I just now tested it with Intel compiler 16.0.0.109 without a problem.
>>>>> Could you try running FLASH-IO under the safe mode? i.e. set the
>>>>> environment
>>>>> variable PNETCDF_SAFE_MODE to 1. It will enable internal checking for
>>>>> data inconsistency.
>>>>> 
>>>>> Just want to make sure for 1.7.0.pre1 that your "make ptest" failed
>>>>> only on FLASH-IO.
>>>>> Because FLAH-IO is the last test program, this means all other tests
>>>>> have passed.
>>>>> Let me know. Thanks.
>>>>> 
>>>>> Wei-keng
>>>>> 
>>>>> 
>>>>> On Feb 2, 2016, at 8:27 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS
>>>>> AND APPLICATIONS INC] wrote:
>>>>> 
>>>>>> 
>>>>>> Hi Wei-keng:
>>>>>> 
>>>>>> I think you sent me the wrong copy of that file ? it was identical to
>>>>> what is in 1.7.0.pre1.  But I went ahead and added "cmd" as an
>>>>> argument
>>>>> to subroutine check_err, and that test code compiles and runs.
>>>>>> 
>>>>>> The large file tests pass in 1.7.0.pre1 as you indicated. However,
>>>>> FLASH-IO still hangs with SGI MPT.  I took your suggestion and tried
>>>>> running this test separately (cd benchmarks/FLASH-IO ; make ptest) but
>>>>> the code still hangs.
>>>>>> 
>>>>>> I've attached the (gzipped) config.log file from the 1.7.0pre1
>>>>> installations.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> -Eric
>>>>>> 
>>>>>> Eric M. Kemp (SSAI)
>>>>>> NASA/GSFC 
>>>>>> Mail Code: 606
>>>>>> Greenbelt, MD 20771
>>>>>> 301.286.9768 
>>>>>> eric.kemp at nasa.gov
>>>>>> eric.kemp at ssaihq.com
>>>>>> 
>>>>>> 
>>>>>> From: Wei-keng Liao <wkliao at eecs.northwestern.edu>
>>>>>> Date: Monday, February 1, 2016 5:12 PM
>>>>>> To: Eric Kemp <eric.kemp at nasa.gov>
>>>>>> Cc: "parallel-netcdf at mcs.anl.gov" <parallel-netcdf at mcs.anl.gov>
>>>>>> Subject: Re: Problems with testing PNETCDF 1.6.1
>>>>>> 
>>>>>> Hi, Eric,
>>>>>> 
>>>>>> Thanks for reporting the error. This is another oversight, Sorry.
>>>>>> The fixed file, bigrecords.f, is attached.
>>>>>> 
>>>>>> 
>>>>>> Wei-keng
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Feb 1, 2016, at 2:41 PM, Kemp, Eric M. (GSFC-606.0)[SCIENCE
>>>>> SYSTEMS AND APPLICATIONS INC] wrote:
>>>>>> 
>>>>>>> 
>>>>>>> Hi Wei-keng:
>>>>>>> 
>>>>>>> Thanks for your quick response. I tried installing 1.7.0.pre1 but I
>>>>> get a
>>>>>>> different error when compiling the tests:
>>>>>>> 
>>>>>>> /usr/local/intel/2016/impi/5.1.2.150/bin64/mpif90    -I../../src/lib
>>>>>>> -I./../common   -I../../src/libf -I../../src/libf90 -fpic -O2
>>>>> -fp-model
>>>>>>> strict  -c bigrecords.f
>>>>>>> bigrecords.f(333): error #6514: A substring must be of type
>>>>> CHARACTER.
>>>>>>> [CMD]
>>>>>>>        msg = '*** TESTING F77 '//cmd(1:XTRIM(cmd))//
>>>>>>> ------------------------------------^
>>>>>>> bigrecords.f(333): error #6054: A CHARACTER data type is required
>>>>> in this
>>>>>>> context.   [CMD]
>>>>>>>        msg = '*** TESTING F77 '//cmd(1:XTRIM(cmd))//
>>>>>>> ------------------------------------^
>>>>>>> compilation aborted for bigrecords.f (code 1)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> This appears to be a legitimate syntax error in the test program, in
>>>>>>> subroutine check_err.  "cmd" is not defined in that subroutine, nor
>>>>> is it
>>>>>>> a global variable.
>>>>>>> 
>>>>>>> I will try patching 1.6.1 with the NC_64BIT_DATA constant instead.
>>>>>>> 
>>>>>>> -Eric
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 2/1/16 12:02 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu>
>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi, Eric
>>>>>>>> 
>>>>>>>> For the large file tests, the error is caused by a oversight of
>>>>> using a
>>>>>>>> wrong flag.
>>>>>>>> Line 81 of file large_files.c should have used NC_64BIT_DATA,
>>>>> instead of
>>>>>>>> NC_64BIT_OFFSET.
>>>>>>>> This error has been fixed in the pre-release of 1.7.0.pre1. Could
>>>>> you
>>>>>>>> give it a try?
>>>>>>>> http://cucis.ece.northwestern.edu/projects/PnetCDF/download.html
>>>>>>>> 
>>>>>>>> As for the FLASH-IO test, could you try running it alone? I.e. cd
>>>>> to the
>>>>>>>> folder
>>>>>>>> benchmarks/FLASH-IO and run "make ptest" there. In the meantime,
>>>>> please
>>>>>>>> send me
>>>>>>>> the file config.log.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Wei-keng
>>>>>>>> 
>>>>>>>> On Feb 1, 2016, at 7:32 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE
>>>>> SYSTEMS
>>>>>>>> AND APPLICATIONS INC] wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Dear PNETCDF developers:
>>>>>>>>> 
>>>>>>>>> I'm attempting to install PNETCDF1.6.1 on a Linux cluster running
>>>>> SLES
>>>>>>>>> 11.3.  I'm using Intel 15 Fortran and C compilers (no C++), and
>>>>> I'm
>>>>>>>>> trying to install for two separate MPI implementations (SGI MPT
>>>>> 2.12 and
>>>>>>>>> Intel MPI 5.1.2).
>>>>>>>>> 
>>>>>>>>> I'm encountering two problems when I run 'make ptest'.
>>>>>>>>> 
>>>>>>>>> 1)  For both MPI implementations, the large file tests fail with
>>>>> an
>>>>>>>>> integer overflow.  The error message is:
>>>>>>>>> 
>>>>>>>>> *** Testing large files, slowly.
>>>>>>>>> line 116 of large_files.c: Overflow when type cast to 4-byte
>>>>> integer.
>>>>>>>>> *** Creating large file ./testfile.nc...srun.slurm: error:
>>>>> borgo018:
>>>>>>>>> task 0: Exited with exit code 1
>>>>>>>>> 
>>>>>>>>> I reviewed the README.large_files for guidance, and I can confirm
>>>>> that
>>>>>>>>> both 'MPI_Offset' and 'off_t' are 8 bytes.
>>>>>>>>> 
>>>>>>>>> 2) For SGI MPT only, if I disable support for large file tests,
>>>>> 'make
>>>>>>>>> ptest' hangs when testing FLASH-IO:
>>>>>>>>> 
>>>>>>>>> make -w -C FLASH-IO ptest
>>>>>>>>> make[2]: Entering directory
>>>>>>>>> 
>>>>> 
>>>>> `/gpfsm/dnb32/emkemp/NUWRFLIB/svn/trunk/builds/parallel-netcdf-1.6.1/be
>>>>> nc
>>>>>>>>> hmarks/FLASH-IO'
>>>>>>>>> mpiexec_mpt -n 4 ./flash_benchmark_io ./flash_io_test_
>>>>>>>>> srun.slurm: cluster configuration lacks support for cpu binding
>>>>>>>>> 
>>>>>>>>> The earlier tests with both single and multiple processes work
>>>>> for SGI
>>>>>>>>> MPT. And all tests (again, excluding large file tests) work for
>>>>> Intel
>>>>>>>>> MPI.
>>>>>>>>> 
>>>>>>>>> I can provide more information (e.g., output from the configure
>>>>> script)
>>>>>>>>> upon request.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> -Eric
>>>>>>>>> 
>>>>>>>>> Eric M. Kemp (SSAI)
>>>>>>>>> NASA/GSFC 
>>>>>>>>> Mail Code: 606
>>>>>>>>> Greenbelt, MD 20771
>>>>>>>>> 301.286.9768
>>>>>>>>> eric.kemp at nasa.gov
>>>>>>>>> eric.kemp at ssaihq.com
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> <config.log.gz>
>>>>> 
>>>> 
>>> 
>> 
> 



More information about the parallel-netcdf mailing list