Problems with testing PNETCDF 1.6.1
Wei-keng Liao
wkliao at eecs.northwestern.edu
Thu Feb 4 10:31:34 CST 2016
Hi, Michael
That's a good catch.
However, PnetCDF does not make Fortran MPI call directly.
All MPI calls are in C and all status are declared as MPI_Status.
Maybe try the same test program but in C to see if the problem
can be reproduced?
Wei-keng
On Feb 4, 2016, at 8:57 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS AND APPLICATIONS INC] wrote:
>
> Hi Michael:
>
> I redefined mpistatus as an array as you suggested, and the mpi_io_hint test program now runs w/o issue for both SGI MPT and Intel MPI.
>
> -Eric
>
>
> From: Michael Raymond <mraymond at sgi.com>
> Date: Thursday, February 4, 2016 9:06 AM
> To: Eric Kemp <eric.kemp at nasa.gov>
> Cc: "parallel-netcdf at mcs.anl.gov" <parallel-netcdf at mcs.anl.gov>
> Subject: Re: Problems with testing PNETCDF 1.6.1
>
> The test program has
>
> integer mpistatus
>
> It should be
>
> integer(MPI_STATUS_SIZE) mpistatus
>
> I don’t have a GPFS filesystem to test on, but it runs fine on NFS and Lustre for me.
>
> Michael A. Raymond
> SGI MPT Team Leader
> 1 (651) 683-7523
>
>
>
>> On Feb 4, 2016, at 07:48, Michael Raymond <mraymond at sgi.com> wrote:
>>
>> If I compile the test program with -g, it runs fine. Digging deeper.
>>
>> Michael A. Raymond
>> SGI MPT Team Leader
>> 1 (651) 683-7523
>>
>>
>>
>>> On Feb 3, 2016, at 14:26, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS AND APPLICATIONS INC] <eric.kemp at nasa.gov> wrote:
>>>
>>>
>>> Hi Wei-keng:
>>>
>>> Your program (successfully?) triggers two errors when I run it with 'mpiexec_mpt –np 2 ./mpi_io_hint'
>>>
>>> (1) This error message:
>>>
>>> MPI_File_f2c: Invalid file handle
>>> Error at MPI_File_close Invalid file handle
>>>
>>> (2) The program then hangs and must be manually killed.
>>>
>>> The same behavior occurs when increasing the number of MPI processes. However, when I only run with 1 process (serially), I only see the error message (1) but the program stops gracefully. These errors are with SGI MPT 2.12.
>>>
>>> The program runs w/o issue if I use Intel MPI 15.0.3.187.
>>>
>>> Cheers,
>>>
>>> -Eric
>>>
>>> From: Wei-keng Liao <wkliao at eecs.northwestern.edu>
>>> Date: Wednesday, February 3, 2016 2:32 PM
>>> To: Eric Kemp <eric.kemp at nasa.gov>
>>> Cc: "parallel-netcdf at mcs.anl.gov" <parallel-netcdf at mcs.anl.gov>
>>> Subject: Re: Problems with testing PNETCDF 1.6.1
>>>
>>> Hi, Eric
>>>
>>> Many thanks for digging the problem. It is great to know the cause.
>>>
>>> In PnetCDF, writing the file header is done by root process using an
>>> independent write. So, this line is not necessary. However, setting
>>> this hint should not cause the program to hang or break anything.
>>>
>>> This indicates a bug in the MPI-IO component of the SGI MPT.
>>>
>>> Attached is a small test program that should reproduce the hang
>>> problem you encountered.
>>>
>>> Thanks again for put more effort on this!
>>>
>>> Wei-keng
>>>
>>>
>>> On Feb 3, 2016, at 12:41 PM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS AND APPLICATIONS INC] wrote:
>>>
>>> >
>>> > Hi Wei-keng:
>>> >
>>> > Good news: I went through the source code, adding print statements, and
>>> > traced the hang to the nfmpi_wait_all invocation in
>>> > checkpoint_ncmpi_parallel.F90. After some further trial-and-error, I
>>> > fixed the problem by commenting out this line in subroutine
>>> > checkpoint_wr_ncmpi_par:
>>> >
>>> > call MPI_Info_set(file_info, 'romio_no_indep_rw', 'true', err)
>>> >
>>> > I don't understand what is going on under the hood, but perhaps the
>>> > combination of SGI MPT and GPFS (the file system on NASA's supercomputer)
>>> > makes this setting undesirable?
>>> >
>>> > Anyway, after commented out this line, both SGI MPT and Intel MPI tests
>>> > work, and produce FLASH output files with the appropriate sizes. So I'm
>>> > going to declare victory and install with this change.
>>> >
>>> > Thanks for your help!
>>> >
>>> > -Eric
>>> >
>>> > Eric M. Kemp (SSAI)
>>> > NASA/GSFC
>>> > Mail Code: 606
>>> > Greenbelt, MD 20771
>>> > 301.286.9768
>>> > eric.kemp at nasa.gov
>>> > eric.kemp at ssaihq.com
>>> >
>>> >
>>> >
>>> > On 2/3/16 11:24 AM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu> wrote:
>>> >
>>> >> Hi, Eric
>>> >>
>>> >> The file size 2964 is the file header size supposed to be. This
>>> >> indicates the program has reached the call to nfmpi_enddef() at
>>> >> line 155 of file checkpoint_ncmpi_parallel.F90.
>>> >> The file size also tells that no data has been written to the file yet.
>>> >> In this case, two possible places can cause the program to hang:
>>> >> line 155 and line 600.
>>> >>
>>> >> If you put a print statement after line 156, then it can check if
>>> >> the program returns from it correctly or hangs at nfmpi_enddef().
>>> >>
>>> >> The other possible hangs is at line 600, the call to nfmpi_wait_all.
>>> >>
>>> >> Wei-keng
>>> >>
>>> >> On Feb 3, 2016, at 7:08 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS
>>> >> AND APPLICATIONS INC] wrote:
>>> >>
>>> >>>
>>> >>> Good morning Wei-keng:
>>> >>>
>>> >>> I reran 1.7.0.pre1 FLASH-IO again. The only output from the program is:
>>> >>>
>>> >>> rw-r--r-- 1 emkemp k3002 2964 Feb 3 07:53
>>> >>> flash_io_test_ncmpi_chk_0000.nc
>>> >>>
>>> >>> I will tinker with the source code to see if I can identify where it
>>> >>> hangs.
>>> >>>
>>> >>>
>>> >>> Thanks,
>>> >>>
>>> >>> -Eric
>>> >>>
>>> >>>
>>> >>>
>>> >>> On 2/2/16 5:36 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu> wrote:
>>> >>>
>>> >>>> Hi, Eric
>>> >>>>
>>> >>>> Unfortunately, I do not have access to an SGI machine. We usually
>>> >>>> rely on our users to do some initial debugging for the situation like
>>> >>>> this.
>>> >>>> I know this can be too much to ask, but if you did not encounter any
>>> >>>> problem when running your program, maybe you can ignore this test.
>>> >>>>
>>> >>>> However, since the hanging occurs only with SGI MPT, I suspect it is
>>> >>>> related to MPT.
>>> >>>>
>>> >>>> Could you check one thing for me? After you kill the FLASH-IO job,
>>> >>>> could
>>> >>>> you check if any netCDF files were created? The expected files and
>>> >>>> their
>>> >>>> sizes are
>>> >>>>
>>> >>>> -rw------- 1 254075392 Feb 1 21:58 flash_io_test_ncmpi_chk_0000.nc
>>> >>>> -rw------- 1 21208576 Feb 1 21:58 flash_io_test_ncmpi_plt_cnt_0000.nc
>>> >>>> -rw------- 1 25431372 Feb 1 21:58 flash_io_test_ncmpi_plt_crn_0000.nc
>>> >>>>
>>> >>>>
>>> >>>> Wei-keng
>>> >>>>
>>> >>>> On Feb 2, 2016, at 3:53 PM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS
>>> >>>> AND APPLICATIONS INC] wrote:
>>> >>>>
>>> >>>>>
>>> >>>>> Hi Wei-keng:
>>> >>>>>
>>> >>>>> I tried rerunning the entire installation with PNETCDF_SAFE_MODE=1.
>>> >>>>> FLASH-IO still hangs with SGI MPT (with no error message), but it
>>> >>>>> works
>>> >>>>> fine with Intel MPI.
>>> >>>>>
>>> >>>>> -Eric
>>> >>>>>
>>> >>>>>
>>> >>>>> From: Wei-keng Liao <wkliao at eecs.northwestern.edu>
>>> >>>>> Date: Tuesday, February 2, 2016 12:39 PM
>>> >>>>> To: Eric Kemp <eric.kemp at nasa.gov>
>>> >>>>> Cc: "parallel-netcdf at mcs.anl.gov" <parallel-netcdf at mcs.anl.gov>
>>> >>>>> Subject: Re: Problems with testing PNETCDF 1.6.1
>>> >>>>>
>>> >>>>> Hi, Eric
>>> >>>>>
>>> >>>>> Sorry for sending the wrong file. The correct one is attached, in case
>>> >>>>> you would like
>>> >>>>> to use it.
>>> >>>>>
>>> >>>>> I check your config.log file but could not find any thing fishy.
>>> >>>>> I just now tested it with Intel compiler 16.0.0.109 without a problem.
>>> >>>>> Could you try running FLASH-IO under the safe mode? i.e. set the
>>> >>>>> environment
>>> >>>>> variable PNETCDF_SAFE_MODE to 1. It will enable internal checking for
>>> >>>>> data inconsistency.
>>> >>>>>
>>> >>>>> Just want to make sure for 1.7.0.pre1 that your "make ptest" failed
>>> >>>>> only on FLASH-IO.
>>> >>>>> Because FLAH-IO is the last test program, this means all other tests
>>> >>>>> have passed.
>>> >>>>> Let me know. Thanks.
>>> >>>>>
>>> >>>>> Wei-keng
>>> >>>>>
>>> >>>>>
>>> >>>>> On Feb 2, 2016, at 8:27 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS
>>> >>>>> AND APPLICATIONS INC] wrote:
>>> >>>>>
>>> >>>>>>
>>> >>>>>> Hi Wei-keng:
>>> >>>>>>
>>> >>>>>> I think you sent me the wrong copy of that file ‹ it was identical to
>>> >>>>> what is in 1.7.0.pre1. But I went ahead and added "cmd" as an
>>> >>>>> argument
>>> >>>>> to subroutine check_err, and that test code compiles and runs.
>>> >>>>>>
>>> >>>>>> The large file tests pass in 1.7.0.pre1 as you indicated. However,
>>> >>>>> FLASH-IO still hangs with SGI MPT. I took your suggestion and tried
>>> >>>>> running this test separately (cd benchmarks/FLASH-IO ; make ptest) but
>>> >>>>> the code still hangs.
>>> >>>>>>
>>> >>>>>> I've attached the (gzipped) config.log file from the 1.7.0pre1
>>> >>>>> installations.
>>> >>>>>>
>>> >>>>>> Thanks,
>>> >>>>>>
>>> >>>>>> -Eric
>>> >>>>>>
>>> >>>>>> Eric M. Kemp (SSAI)
>>> >>>>>> NASA/GSFC
>>> >>>>>> Mail Code: 606
>>> >>>>>> Greenbelt, MD 20771
>>> >>>>>> 301.286.9768
>>> >>>>>> eric.kemp at nasa.gov
>>> >>>>>> eric.kemp at ssaihq.com
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> From: Wei-keng Liao <wkliao at eecs.northwestern.edu>
>>> >>>>>> Date: Monday, February 1, 2016 5:12 PM
>>> >>>>>> To: Eric Kemp <eric.kemp at nasa.gov>
>>> >>>>>> Cc: "parallel-netcdf at mcs.anl.gov" <parallel-netcdf at mcs.anl.gov>
>>> >>>>>> Subject: Re: Problems with testing PNETCDF 1.6.1
>>> >>>>>>
>>> >>>>>> Hi, Eric,
>>> >>>>>>
>>> >>>>>> Thanks for reporting the error. This is another oversight, Sorry.
>>> >>>>>> The fixed file, bigrecords.f, is attached.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Wei-keng
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Feb 1, 2016, at 2:41 PM, Kemp, Eric M. (GSFC-606.0)[SCIENCE
>>> >>>>> SYSTEMS AND APPLICATIONS INC] wrote:
>>> >>>>>>
>>> >>>>>>>
>>> >>>>>>> Hi Wei-keng:
>>> >>>>>>>
>>> >>>>>>> Thanks for your quick response. I tried installing 1.7.0.pre1 but I
>>> >>>>> get a
>>> >>>>>>> different error when compiling the tests:
>>> >>>>>>>
>>> >>>>>>> /usr/local/intel/2016/impi/5.1.2.150/bin64/mpif90 -I../../src/lib
>>> >>>>>>> -I./../common -I../../src/libf -I../../src/libf90 -fpic -O2
>>> >>>>> -fp-model
>>> >>>>>>> strict -c bigrecords.f
>>> >>>>>>> bigrecords.f(333): error #6514: A substring must be of type
>>> >>>>> CHARACTER.
>>> >>>>>>> [CMD]
>>> >>>>>>> msg = '*** TESTING F77 '//cmd(1:XTRIM(cmd))//
>>> >>>>>>> ------------------------------------^
>>> >>>>>>> bigrecords.f(333): error #6054: A CHARACTER data type is required
>>> >>>>> in this
>>> >>>>>>> context. [CMD]
>>> >>>>>>> msg = '*** TESTING F77 '//cmd(1:XTRIM(cmd))//
>>> >>>>>>> ------------------------------------^
>>> >>>>>>> compilation aborted for bigrecords.f (code 1)
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> This appears to be a legitimate syntax error in the test program, in
>>> >>>>>>> subroutine check_err. "cmd" is not defined in that subroutine, nor
>>> >>>>> is it
>>> >>>>>>> a global variable.
>>> >>>>>>>
>>> >>>>>>> I will try patching 1.6.1 with the NC_64BIT_DATA constant instead.
>>> >>>>>>>
>>> >>>>>>> -Eric
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> On 2/1/16 12:02 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu>
>>> >>>>> wrote:
>>> >>>>>>>
>>> >>>>>>>> Hi, Eric
>>> >>>>>>>>
>>> >>>>>>>> For the large file tests, the error is caused by a oversight of
>>> >>>>> using a
>>> >>>>>>>> wrong flag.
>>> >>>>>>>> Line 81 of file large_files.c should have used NC_64BIT_DATA,
>>> >>>>> instead of
>>> >>>>>>>> NC_64BIT_OFFSET.
>>> >>>>>>>> This error has been fixed in the pre-release of 1.7.0.pre1. Could
>>> >>>>> you
>>> >>>>>>>> give it a try?
>>> >>>>>>>> http://cucis.ece.northwestern.edu/projects/PnetCDF/download.html
>>> >>>>>>>>
>>> >>>>>>>> As for the FLASH-IO test, could you try running it alone? I.e. cd
>>> >>>>> to the
>>> >>>>>>>> folder
>>> >>>>>>>> benchmarks/FLASH-IO and run "make ptest" there. In the meantime,
>>> >>>>> please
>>> >>>>>>>> send me
>>> >>>>>>>> the file config.log.
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> Wei-keng
>>> >>>>>>>>
>>> >>>>>>>> On Feb 1, 2016, at 7:32 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE
>>> >>>>> SYSTEMS
>>> >>>>>>>> AND APPLICATIONS INC] wrote:
>>> >>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> Dear PNETCDF developers:
>>> >>>>>>>>>
>>> >>>>>>>>> I'm attempting to install PNETCDF1.6.1 on a Linux cluster running
>>> >>>>> SLES
>>> >>>>>>>>> 11.3. I'm using Intel 15 Fortran and C compilers (no C++), and
>>> >>>>> I'm
>>> >>>>>>>>> trying to install for two separate MPI implementations (SGI MPT
>>> >>>>> 2.12 and
>>> >>>>>>>>> Intel MPI 5.1.2).
>>> >>>>>>>>>
>>> >>>>>>>>> I'm encountering two problems when I run 'make ptest'.
>>> >>>>>>>>>
>>> >>>>>>>>> 1) For both MPI implementations, the large file tests fail with
>>> >>>>> an
>>> >>>>>>>>> integer overflow. The error message is:
>>> >>>>>>>>>
>>> >>>>>>>>> *** Testing large files, slowly.
>>> >>>>>>>>> line 116 of large_files.c: Overflow when type cast to 4-byte
>>> >>>>> integer.
>>> >>>>>>>>> *** Creating large file ./testfile.nc...srun.slurm: error:
>>> >>>>> borgo018:
>>> >>>>>>>>> task 0: Exited with exit code 1
>>> >>>>>>>>>
>>> >>>>>>>>> I reviewed the README.large_files for guidance, and I can confirm
>>> >>>>> that
>>> >>>>>>>>> both 'MPI_Offset' and 'off_t' are 8 bytes.
>>> >>>>>>>>>
>>> >>>>>>>>> 2) For SGI MPT only, if I disable support for large file tests,
>>> >>>>> 'make
>>> >>>>>>>>> ptest' hangs when testing FLASH-IO:
>>> >>>>>>>>>
>>> >>>>>>>>> make -w -C FLASH-IO ptest
>>> >>>>>>>>> make[2]: Entering directory
>>> >>>>>>>>>
>>> >>>>>
>>> >>>>> `/gpfsm/dnb32/emkemp/NUWRFLIB/svn/trunk/builds/parallel-netcdf-1.6.1/be
>>> >>>>> nc
>>> >>>>>>>>> hmarks/FLASH-IO'
>>> >>>>>>>>> mpiexec_mpt -n 4 ./flash_benchmark_io ./flash_io_test_
>>> >>>>>>>>> srun.slurm: cluster configuration lacks support for cpu binding
>>> >>>>>>>>>
>>> >>>>>>>>> The earlier tests with both single and multiple processes work
>>> >>>>> for SGI
>>> >>>>>>>>> MPT. And all tests (again, excluding large file tests) work for
>>> >>>>> Intel
>>> >>>>>>>>> MPI.
>>> >>>>>>>>>
>>> >>>>>>>>> I can provide more information (e.g., output from the configure
>>> >>>>> script)
>>> >>>>>>>>> upon request.
>>> >>>>>>>>>
>>> >>>>>>>>> Thanks,
>>> >>>>>>>>>
>>> >>>>>>>>> -Eric
>>> >>>>>>>>>
>>> >>>>>>>>> Eric M. Kemp (SSAI)
>>> >>>>>>>>> NASA/GSFC
>>> >>>>>>>>> Mail Code: 606
>>> >>>>>>>>> Greenbelt, MD 20771
>>> >>>>>>>>> 301.286.9768
>>> >>>>>>>>> eric.kemp at nasa.gov
>>> >>>>>>>>> eric.kemp at ssaihq.com
>>> >>>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>> <config.log.gz>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>>
>>
>
More information about the parallel-netcdf
mailing list