Problems with testing PNETCDF 1.6.1

Wei-keng Liao wkliao at eecs.northwestern.edu
Thu Feb 4 10:31:34 CST 2016


Hi, Michael

That's a good catch.

However, PnetCDF does not make Fortran MPI call directly.
All MPI calls are in C and all status are declared as MPI_Status.
Maybe try the same test program but in C to see if the problem
can be reproduced?

Wei-keng

On Feb 4, 2016, at 8:57 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS AND APPLICATIONS INC] wrote:

> 
> Hi Michael:
> 
> I redefined mpistatus as an array as you suggested, and the mpi_io_hint test program now runs w/o issue for both SGI MPT and Intel MPI.
> 
> -Eric
> 
> 
> From: Michael Raymond <mraymond at sgi.com>
> Date: Thursday, February 4, 2016 9:06 AM
> To: Eric Kemp <eric.kemp at nasa.gov>
> Cc: "parallel-netcdf at mcs.anl.gov" <parallel-netcdf at mcs.anl.gov>
> Subject: Re: Problems with testing PNETCDF 1.6.1
> 
>   The test program has
> 
> integer mpistatus
> 
>   It should be
> 
> integer(MPI_STATUS_SIZE) mpistatus
> 
>   I don’t have a GPFS filesystem to test on, but it runs fine on NFS and Lustre for me.
> 
> Michael A. Raymond
> SGI MPT Team Leader
> 1 (651) 683-7523
> 
> 
> 
>> On Feb 4, 2016, at 07:48, Michael Raymond <mraymond at sgi.com> wrote:
>> 
>>   If I compile the test program with -g, it runs fine. Digging deeper.
>> 
>> Michael A. Raymond
>> SGI MPT Team Leader
>> 1 (651) 683-7523
>> 
>> 
>> 
>>> On Feb 3, 2016, at 14:26, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS AND APPLICATIONS INC] <eric.kemp at nasa.gov> wrote:
>>> 
>>> 
>>> Hi Wei-keng:
>>> 
>>> Your program (successfully?) triggers two errors when I run it with 'mpiexec_mpt –np 2 ./mpi_io_hint'
>>> 
>>> (1) This error message:
>>> 
>>> MPI_File_f2c: Invalid file handle
>>>  Error at MPI_File_close Invalid file handle
>>> 
>>> (2) The program then hangs and must be manually killed.
>>> 
>>> The same behavior occurs when increasing the number of MPI processes. However, when I only run with 1 process (serially), I only see the error message (1) but the program stops gracefully. These errors are with SGI MPT 2.12.
>>> 
>>> The program runs w/o issue if I use Intel MPI 15.0.3.187. 
>>> 
>>> Cheers,
>>> 
>>> -Eric
>>> 
>>> From: Wei-keng Liao <wkliao at eecs.northwestern.edu>
>>> Date: Wednesday, February 3, 2016 2:32 PM
>>> To: Eric Kemp <eric.kemp at nasa.gov>
>>> Cc: "parallel-netcdf at mcs.anl.gov" <parallel-netcdf at mcs.anl.gov>
>>> Subject: Re: Problems with testing PNETCDF 1.6.1
>>> 
>>> Hi, Eric
>>> 
>>> Many thanks for digging the problem. It is great to know the cause.
>>> 
>>> In PnetCDF, writing the file header is done by root process using an
>>> independent write. So, this line is not necessary. However, setting
>>> this hint should not cause the program to hang or break anything.
>>> 
>>> This indicates a bug in the MPI-IO component of the SGI MPT.
>>> 
>>> Attached is a small test program that should reproduce the hang
>>> problem you encountered.
>>> 
>>> Thanks again for put more effort on this!
>>> 
>>> Wei-keng
>>> 
>>> 
>>> On Feb 3, 2016, at 12:41 PM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS AND APPLICATIONS INC] wrote:
>>> 
>>> > 
>>> > Hi Wei-keng:
>>> > 
>>> > Good news: I went through the source code, adding print statements, and
>>> > traced the hang to the nfmpi_wait_all invocation in
>>> > checkpoint_ncmpi_parallel.F90.  After some further trial-and-error, I
>>> > fixed the problem by commenting out this line in subroutine
>>> > checkpoint_wr_ncmpi_par:
>>> > 
>>> >     call MPI_Info_set(file_info, 'romio_no_indep_rw', 'true', err)
>>> > 
>>> > I don't understand what is going on under the hood, but perhaps the
>>> > combination of SGI MPT and GPFS (the file system on NASA's supercomputer)
>>> > makes this setting undesirable?
>>> > 
>>> > Anyway, after commented out this line, both SGI MPT and Intel MPI tests
>>> > work, and produce FLASH output files with the appropriate sizes. So I'm
>>> > going to declare victory and install with this change.
>>> > 
>>> > Thanks for your help!
>>> > 
>>> > -Eric
>>> > 
>>> > Eric M. Kemp (SSAI)
>>> > NASA/GSFC 
>>> > Mail Code: 606 
>>> > Greenbelt, MD 20771
>>> > 301.286.9768 
>>> > eric.kemp at nasa.gov
>>> > eric.kemp at ssaihq.com
>>> > 
>>> > 
>>> > 
>>> > On 2/3/16 11:24 AM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu> wrote:
>>> > 
>>> >> Hi, Eric
>>> >> 
>>> >> The file size 2964 is the file header size supposed to be. This
>>> >> indicates the program has reached the call to nfmpi_enddef() at
>>> >> line 155 of file checkpoint_ncmpi_parallel.F90.
>>> >> The file size also tells that no data has been written to the file yet.
>>> >> In this case, two possible places can cause the program to hang:
>>> >> line 155 and line 600.
>>> >> 
>>> >> If you put a print statement after line 156, then it can check if
>>> >> the program returns from it correctly or hangs at nfmpi_enddef().
>>> >> 
>>> >> The other possible hangs is at line 600, the call to nfmpi_wait_all.
>>> >> 
>>> >> Wei-keng
>>> >> 
>>> >> On Feb 3, 2016, at 7:08 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS
>>> >> AND APPLICATIONS INC] wrote:
>>> >> 
>>> >>> 
>>> >>> Good morning Wei-keng:
>>> >>> 
>>> >>> I reran 1.7.0.pre1 FLASH-IO again.  The only output from the program is:
>>> >>> 
>>> >>> rw-r--r-- 1 emkemp k3002 2964 Feb  3 07:53
>>> >>> flash_io_test_ncmpi_chk_0000.nc
>>> >>> 
>>> >>> I will tinker with the source code to see if I can identify where it
>>> >>> hangs.
>>> >>> 
>>> >>> 
>>> >>> Thanks,
>>> >>> 
>>> >>> -Eric
>>> >>> 
>>> >>> 
>>> >>> 
>>> >>> On 2/2/16 5:36 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu> wrote:
>>> >>> 
>>> >>>> Hi, Eric
>>> >>>> 
>>> >>>> Unfortunately, I do not have access to an SGI machine. We usually
>>> >>>> rely on our users to do some initial debugging for the situation like
>>> >>>> this.
>>> >>>> I know this can be too much to ask, but if you did not encounter any
>>> >>>> problem when running your program, maybe you can ignore this test.
>>> >>>> 
>>> >>>> However, since the hanging occurs only with SGI MPT, I suspect it is
>>> >>>> related to MPT.
>>> >>>> 
>>> >>>> Could you check one thing for me? After you kill the FLASH-IO job,
>>> >>>> could
>>> >>>> you check if any netCDF files were created? The expected files and
>>> >>>> their
>>> >>>> sizes are
>>> >>>> 
>>> >>>> -rw------- 1 254075392 Feb  1 21:58 flash_io_test_ncmpi_chk_0000.nc
>>> >>>> -rw------- 1  21208576 Feb  1 21:58 flash_io_test_ncmpi_plt_cnt_0000.nc
>>> >>>> -rw------- 1  25431372 Feb  1 21:58 flash_io_test_ncmpi_plt_crn_0000.nc
>>> >>>> 
>>> >>>> 
>>> >>>> Wei-keng
>>> >>>> 
>>> >>>> On Feb 2, 2016, at 3:53 PM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS
>>> >>>> AND APPLICATIONS INC] wrote:
>>> >>>> 
>>> >>>>> 
>>> >>>>> Hi Wei-keng:
>>> >>>>> 
>>> >>>>> I tried rerunning the entire installation with PNETCDF_SAFE_MODE=1.
>>> >>>>> FLASH-IO still hangs with SGI MPT (with no error message), but it
>>> >>>>> works
>>> >>>>> fine with Intel MPI.
>>> >>>>> 
>>> >>>>> -Eric
>>> >>>>> 
>>> >>>>> 
>>> >>>>> From: Wei-keng Liao <wkliao at eecs.northwestern.edu>
>>> >>>>> Date: Tuesday, February 2, 2016 12:39 PM
>>> >>>>> To: Eric Kemp <eric.kemp at nasa.gov>
>>> >>>>> Cc: "parallel-netcdf at mcs.anl.gov" <parallel-netcdf at mcs.anl.gov>
>>> >>>>> Subject: Re: Problems with testing PNETCDF 1.6.1
>>> >>>>> 
>>> >>>>> Hi, Eric
>>> >>>>> 
>>> >>>>> Sorry for sending the wrong file. The correct one is attached, in case
>>> >>>>> you would like
>>> >>>>> to use it.
>>> >>>>> 
>>> >>>>> I check your config.log file but could not find any thing fishy.
>>> >>>>> I just now tested it with Intel compiler 16.0.0.109 without a problem.
>>> >>>>> Could you try running FLASH-IO under the safe mode? i.e. set the
>>> >>>>> environment
>>> >>>>> variable PNETCDF_SAFE_MODE to 1. It will enable internal checking for
>>> >>>>> data inconsistency.
>>> >>>>> 
>>> >>>>> Just want to make sure for 1.7.0.pre1 that your "make ptest" failed
>>> >>>>> only on FLASH-IO.
>>> >>>>> Because FLAH-IO is the last test program, this means all other tests
>>> >>>>> have passed.
>>> >>>>> Let me know. Thanks.
>>> >>>>> 
>>> >>>>> Wei-keng
>>> >>>>> 
>>> >>>>> 
>>> >>>>> On Feb 2, 2016, at 8:27 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE SYSTEMS
>>> >>>>> AND APPLICATIONS INC] wrote:
>>> >>>>> 
>>> >>>>>> 
>>> >>>>>> Hi Wei-keng:
>>> >>>>>> 
>>> >>>>>> I think you sent me the wrong copy of that file ‹ it was identical to
>>> >>>>> what is in 1.7.0.pre1.  But I went ahead and added "cmd" as an
>>> >>>>> argument
>>> >>>>> to subroutine check_err, and that test code compiles and runs.
>>> >>>>>> 
>>> >>>>>> The large file tests pass in 1.7.0.pre1 as you indicated. However,
>>> >>>>> FLASH-IO still hangs with SGI MPT.  I took your suggestion and tried
>>> >>>>> running this test separately (cd benchmarks/FLASH-IO ; make ptest) but
>>> >>>>> the code still hangs.
>>> >>>>>> 
>>> >>>>>> I've attached the (gzipped) config.log file from the 1.7.0pre1
>>> >>>>> installations.
>>> >>>>>> 
>>> >>>>>> Thanks,
>>> >>>>>> 
>>> >>>>>> -Eric
>>> >>>>>> 
>>> >>>>>> Eric M. Kemp (SSAI)
>>> >>>>>> NASA/GSFC 
>>> >>>>>> Mail Code: 606
>>> >>>>>> Greenbelt, MD 20771
>>> >>>>>> 301.286.9768 
>>> >>>>>> eric.kemp at nasa.gov
>>> >>>>>> eric.kemp at ssaihq.com
>>> >>>>>> 
>>> >>>>>> 
>>> >>>>>> From: Wei-keng Liao <wkliao at eecs.northwestern.edu>
>>> >>>>>> Date: Monday, February 1, 2016 5:12 PM
>>> >>>>>> To: Eric Kemp <eric.kemp at nasa.gov>
>>> >>>>>> Cc: "parallel-netcdf at mcs.anl.gov" <parallel-netcdf at mcs.anl.gov>
>>> >>>>>> Subject: Re: Problems with testing PNETCDF 1.6.1
>>> >>>>>> 
>>> >>>>>> Hi, Eric,
>>> >>>>>> 
>>> >>>>>> Thanks for reporting the error. This is another oversight, Sorry.
>>> >>>>>> The fixed file, bigrecords.f, is attached.
>>> >>>>>> 
>>> >>>>>> 
>>> >>>>>> Wei-keng
>>> >>>>>> 
>>> >>>>>> 
>>> >>>>>> 
>>> >>>>>> On Feb 1, 2016, at 2:41 PM, Kemp, Eric M. (GSFC-606.0)[SCIENCE
>>> >>>>> SYSTEMS AND APPLICATIONS INC] wrote:
>>> >>>>>> 
>>> >>>>>>> 
>>> >>>>>>> Hi Wei-keng:
>>> >>>>>>> 
>>> >>>>>>> Thanks for your quick response. I tried installing 1.7.0.pre1 but I
>>> >>>>> get a
>>> >>>>>>> different error when compiling the tests:
>>> >>>>>>> 
>>> >>>>>>> /usr/local/intel/2016/impi/5.1.2.150/bin64/mpif90    -I../../src/lib
>>> >>>>>>> -I./../common   -I../../src/libf -I../../src/libf90 -fpic -O2
>>> >>>>> -fp-model
>>> >>>>>>> strict  -c bigrecords.f
>>> >>>>>>> bigrecords.f(333): error #6514: A substring must be of type
>>> >>>>> CHARACTER.
>>> >>>>>>> [CMD]
>>> >>>>>>>        msg = '*** TESTING F77 '//cmd(1:XTRIM(cmd))//
>>> >>>>>>> ------------------------------------^
>>> >>>>>>> bigrecords.f(333): error #6054: A CHARACTER data type is required
>>> >>>>> in this
>>> >>>>>>> context.   [CMD]
>>> >>>>>>>        msg = '*** TESTING F77 '//cmd(1:XTRIM(cmd))//
>>> >>>>>>> ------------------------------------^
>>> >>>>>>> compilation aborted for bigrecords.f (code 1)
>>> >>>>>>> 
>>> >>>>>>> 
>>> >>>>>>> 
>>> >>>>>>> This appears to be a legitimate syntax error in the test program, in
>>> >>>>>>> subroutine check_err.  "cmd" is not defined in that subroutine, nor
>>> >>>>> is it
>>> >>>>>>> a global variable.
>>> >>>>>>> 
>>> >>>>>>> I will try patching 1.6.1 with the NC_64BIT_DATA constant instead.
>>> >>>>>>> 
>>> >>>>>>> -Eric
>>> >>>>>>> 
>>> >>>>>>> 
>>> >>>>>>> 
>>> >>>>>>> 
>>> >>>>>>> On 2/1/16 12:02 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu>
>>> >>>>> wrote:
>>> >>>>>>> 
>>> >>>>>>>> Hi, Eric
>>> >>>>>>>> 
>>> >>>>>>>> For the large file tests, the error is caused by a oversight of
>>> >>>>> using a
>>> >>>>>>>> wrong flag.
>>> >>>>>>>> Line 81 of file large_files.c should have used NC_64BIT_DATA,
>>> >>>>> instead of
>>> >>>>>>>> NC_64BIT_OFFSET.
>>> >>>>>>>> This error has been fixed in the pre-release of 1.7.0.pre1. Could
>>> >>>>> you
>>> >>>>>>>> give it a try?
>>> >>>>>>>> http://cucis.ece.northwestern.edu/projects/PnetCDF/download.html
>>> >>>>>>>> 
>>> >>>>>>>> As for the FLASH-IO test, could you try running it alone? I.e. cd
>>> >>>>> to the
>>> >>>>>>>> folder
>>> >>>>>>>> benchmarks/FLASH-IO and run "make ptest" there. In the meantime,
>>> >>>>> please
>>> >>>>>>>> send me
>>> >>>>>>>> the file config.log.
>>> >>>>>>>> 
>>> >>>>>>>> 
>>> >>>>>>>> Wei-keng
>>> >>>>>>>> 
>>> >>>>>>>> On Feb 1, 2016, at 7:32 AM, Kemp, Eric M. (GSFC-606.0)[SCIENCE
>>> >>>>> SYSTEMS
>>> >>>>>>>> AND APPLICATIONS INC] wrote:
>>> >>>>>>>> 
>>> >>>>>>>>> 
>>> >>>>>>>>> Dear PNETCDF developers:
>>> >>>>>>>>> 
>>> >>>>>>>>> I'm attempting to install PNETCDF1.6.1 on a Linux cluster running
>>> >>>>> SLES
>>> >>>>>>>>> 11.3.  I'm using Intel 15 Fortran and C compilers (no C++), and
>>> >>>>> I'm
>>> >>>>>>>>> trying to install for two separate MPI implementations (SGI MPT
>>> >>>>> 2.12 and
>>> >>>>>>>>> Intel MPI 5.1.2).
>>> >>>>>>>>> 
>>> >>>>>>>>> I'm encountering two problems when I run 'make ptest'.
>>> >>>>>>>>> 
>>> >>>>>>>>> 1)  For both MPI implementations, the large file tests fail with
>>> >>>>> an
>>> >>>>>>>>> integer overflow.  The error message is:
>>> >>>>>>>>> 
>>> >>>>>>>>> *** Testing large files, slowly.
>>> >>>>>>>>> line 116 of large_files.c: Overflow when type cast to 4-byte
>>> >>>>> integer.
>>> >>>>>>>>> *** Creating large file ./testfile.nc...srun.slurm: error:
>>> >>>>> borgo018:
>>> >>>>>>>>> task 0: Exited with exit code 1
>>> >>>>>>>>> 
>>> >>>>>>>>> I reviewed the README.large_files for guidance, and I can confirm
>>> >>>>> that
>>> >>>>>>>>> both 'MPI_Offset' and 'off_t' are 8 bytes.
>>> >>>>>>>>> 
>>> >>>>>>>>> 2) For SGI MPT only, if I disable support for large file tests,
>>> >>>>> 'make
>>> >>>>>>>>> ptest' hangs when testing FLASH-IO:
>>> >>>>>>>>> 
>>> >>>>>>>>> make -w -C FLASH-IO ptest
>>> >>>>>>>>> make[2]: Entering directory
>>> >>>>>>>>> 
>>> >>>>> 
>>> >>>>> `/gpfsm/dnb32/emkemp/NUWRFLIB/svn/trunk/builds/parallel-netcdf-1.6.1/be
>>> >>>>> nc
>>> >>>>>>>>> hmarks/FLASH-IO'
>>> >>>>>>>>> mpiexec_mpt -n 4 ./flash_benchmark_io ./flash_io_test_
>>> >>>>>>>>> srun.slurm: cluster configuration lacks support for cpu binding
>>> >>>>>>>>> 
>>> >>>>>>>>> The earlier tests with both single and multiple processes work
>>> >>>>> for SGI
>>> >>>>>>>>> MPT. And all tests (again, excluding large file tests) work for
>>> >>>>> Intel
>>> >>>>>>>>> MPI.
>>> >>>>>>>>> 
>>> >>>>>>>>> I can provide more information (e.g., output from the configure
>>> >>>>> script)
>>> >>>>>>>>> upon request.
>>> >>>>>>>>> 
>>> >>>>>>>>> Thanks,
>>> >>>>>>>>> 
>>> >>>>>>>>> -Eric
>>> >>>>>>>>> 
>>> >>>>>>>>> Eric M. Kemp (SSAI)
>>> >>>>>>>>> NASA/GSFC 
>>> >>>>>>>>> Mail Code: 606
>>> >>>>>>>>> Greenbelt, MD 20771
>>> >>>>>>>>> 301.286.9768
>>> >>>>>>>>> eric.kemp at nasa.gov
>>> >>>>>>>>> eric.kemp at ssaihq.com
>>> >>>>>>>>> 
>>> >>>>>>>> 
>>> >>>>>>> 
>>> >>>>>> 
>>> >>>>>> <config.log.gz>
>>> >>>>> 
>>> >>>> 
>>> >>> 
>>> >> 
>>> > 
>>> 
>> 
> 



More information about the parallel-netcdf mailing list