pnetcdf 1.2.0 create file issue

Wei-keng Liao wkliao at ece.northwestern.edu
Mon May 21 22:51:34 CDT 2012


Hi, David,

The pnetcdf tests you were running are to run on one process.
Those error messages are expected. It tests if pnetcdf can
correctly report the error. At the end of test run, it has a
line reporting how many failed tests and the number it should
be. The two numbers should match. For example,

NOTE: parallel-netcdf expects to see 2 failures
Total number of failures: 2



Wei-keng


On May 21, 2012, at 4:58 PM, David Knaak wrote:

> Hi Jim,
> 
> I ran the pnetcdf tests using xt-mpich2/5.4.2 and it does have some problems
> that are fixed in xt-mpich2/5.4.3.  Did you try this
> 
>  module swap xt-mpich2/5.4.2 xt-mpich2/5.4.5
> 
> for your application?
> 
> 
> 
> Rob,
> 
> Are the pnetcdf tests designed to run with more than one MPI process?
> When I run with just one process, the Cray library (that is the 5.4.3
> and later versions) report 2 errors, as expected:
> 
> ------------------------------------------------------------------------
> *** Testing ncmpi_open             ...
>    FAILURE at line 99 of test_read.c: ncmpi_open of nonexistent file should have returned system error
> 0: MPI_File_open : File does not exist, error stack:
> ADIOI_CRAY_OPEN(86): File tooth-fairy.nc does not exist
>    ### 1 FAILURES TESTING ncmpi_open! ###
> 
> *** Testing ncmpi_create           ...
>    FAILURE at line 58 of test_write.c: attempt to overwrite file: status = -32
> 0: MPI_File_open : Other I/O error , error stack:
> ADIOI_CRAY_OPEN(116): Other I/O error File exists
>    ### 1 FAILURES TESTING ncmpi_create! ###
> ------------------------------------------------------------------------
> 
> The test file line numbers are a little different from the base pnetcdf 
> tests because of some extra print statements I put in for debugging.
> 
> When I run with the unmodified ANL MPICH2 code I get the same results:
> 
> ------------------------------------------------------------------------
> *** Testing ncmpi_open             ...
>    FAILURE at line 99 of test_read.c: ncmpi_open of nonexistent file should have returned system error
> 0: MPI_File_open : File does not exist, error stack:
> ADIOI_UFS_OPEN(70): File tooth-fairy.nc does not exist
>    ### 1 FAILURES TESTING ncmpi_open! ###
> 
> *** Testing ncmpi_create           ...
>    FAILURE at line 58 of test_write.c: attempt to overwrite file: status = -32
> 0: MPI_File_open : Other I/O error , error stack:
> ADIOI_UFS_OPEN(100): Other I/O error File exists
>    ### 1 FAILURES TESTING ncmpi_create! ###
> ------------------------------------------------------------------------
> 
> But when I run with more MPI tasks, the results get very messy, for both
> implementations.  For example, running with 4 tasks, the end of the
> output for the base ANL library is:
> 
> ------------------------------------------------------------------------
>    FAILURE at line 551 of test_write.c: remove of scratch.nc failed
>    ### 1 FAILURES TESTING ncmpi_rename_dim! ###
> 
>    FAILURE at line 551 of test_write.c: remove of scratch.nc failed
>    ### 1 FAILURES TESTING ncmpi_rename_dim! ###
> ok
> 
>    FAILURE at line 551 of test_write.c: remove of scratch.nc failed
>    ### 1 FAILURES TESTING ncmpi_rename_dim! ###
> 
> NOTE: parallel-netcdf expects to see 2 failures
> Total number of failures: 7
> 
> NOTE: parallel-netcdf expects to see 2 failures
> Total number of failures: 10
> 
> NOTE: parallel-netcdf expects to see 2 failures
> Total number of failures: 8
> 
> NOTE: parallel-netcdf expects to see 2 failures
> Total number of failures: 10
> Application 7538532 exit codes: 1
> ------------------------------------------------------------------------
> 
> Is this what you expect? 
> 
> David
> 
> 
> On Sat, May 19, 2012 at 07:23:24AM -0600, Jim Edwards wrote:
>> Hi David,
>> 
>> I built the pnetcdf 1.3.0-pre1 release on bluewaters and
>> ran the nf_test provided with that package.  It exhibits the same issues with
>> nfmpi_create and nfmpi_open that I am seeing from the installed
>> parallel-netcdf-1.2.0 and my application. 
>> I have opened ticket BWDSPCH-298 on bluewaters to track this issue.  
>> 
>> On Fri, May 11, 2012 at 6:50 PM, David Knaak <knaak at cray.com> wrote:
>> 
>>    Jim,
>> 
>>    I'm not intimate  with pnetcdf but the stack trace entries look a little
>>    strange to me.  In particular:
>> 
>>    #10
>>     count=0x1527daa0, stride=0x0
>> 
>>    #13
>>     start=0x7fffffff3750, count=0x7fffffff3750
>> 
>>    Could it be that the high level routine is being called with a bad
>>    argument?
>> 
>>    Just a thought.
>> 
>>    David
>> 
>>    On Fri, May 11, 2012 at 05:37:24PM -0600, Jim Edwards wrote:
>>> Hi David,
>>> 
>>> I've updated to the 5.2.5 mpich2 on bluewaters and now get a different
>>    problem,
>>> it happens earlier in the code so I think that this worked in the older
>>> version, here is a partial stack trace from the core file on 16000 tasks.
>>> I'll try on Monday to reproduce this on a smaller number of tasks.    Can
>>    I
>>> send you more info when I get it?
>>> 
>>> #0  memcpy () at ../sysdeps/x86_64/memcpy.S:102
>>> #1  0x0000000000c25b03 in MPIDI_CH3U_Buffer_copy ()
>>> #2  0x0000000000bc91ae in MPIDI_Isend_self ()
>>> #3  0x0000000000bc5ef4 in MPID_Isend ()
>>> #4  0x0000000000be4b2b in PMPI_Isend ()
>>> #5  0x0000000000b9439a in ADIOI_CRAY_R_Exchange_data ()
>>> #6  0x0000000000b9503d in ADIOI_CRAY_Read_and_exch ()
>>> #7  0x0000000000b957ca in ADIOI_CRAY_ReadStridedColl ()
>>> #8  0x0000000000b7ea69 in MPIOI_File_read_all ()
>>> #9  0x0000000000b7eab6 in PMPI_File_read_all ()
>>> #10 0x0000000000b3d7ce in ncmpii_getput_vars (ncp=0x107192b0, varp=
>>    0x109ee340,
>>> start=0x15269910, count=0x1527daa0, stride=0x0,
>>>    buf=0x155f5480, bufcount=245561, datatype=-871890885, rw_flag=1,
>>    io_method=
>>> 1) at ./getput_vars.c:741
>>> #11 0x0000000000b39804 in ncmpi_get_vara_all (ncid=1, varid=59, start=
>>> 0x15269910, count=0x1527daa0, buf=0x155f5480,
>>>    bufcount=245561, datatype=-871890885) at ./getput_vara.c:435
>>> #12 0x0000000000b092c6 in nfmpi_get_vara_all_ (v1=0x33e8dd8, v2=
>>    0x33e89a0, v3=
>>> 0x418d500, v4=0x418c500, v5=0x155f5480,
>>>    v6=0x1e72e10, v7=0x1e72e04) at ./get_vara_allf.c:57
>>> #13 0x00000000007c0524 in pionfread_mod::read_nfdarray_double (file=...,
>>    iobuf=
>>> 0x155f5480, vardesc=..., iodesc=...,
>>>    start=0x7fffffff3750, count=0x7fffffff3750)
>>> 
>>> 
>>> On Fri, May 11, 2012 at 5:25 PM, Jim Edwards <jedwards at ucar.edu> wrote:
>>> 
>>>    David,
>>> 
>>>    I will give this a try, thanks.
>>> 
>>> 
>>>    On Fri, May 11, 2012 at 5:15 PM, David Knaak <knaak at cray.com> wrote:
>>> 
>>>        Rob,
>>> 
>>>        I suggested taking the discussion off line to not bother those
>>    not
>>>        interested in the Cray specifics.  But if you think those on the
>>    list
>>>        are either interested, or don't consider it a bother, I can
>>    certainly
>>>        use the full list.
>>> 
>>>        All,
>>> 
>>>        In the MPT 5.4.0 release, I made some changes to MPI_File_open to
>>>        improve scalability.  Because of these changes and previous
>>    changes
>>>        I had made (for added functionality, not because of any bugs),
>>    the
>>>        code was getting very messy.  In fact, I introduced a bug or 2
>>    with
>>>        these changes.  So in 5.4.3, I significantly restructured the
>>    code
>>>        for better maintainability, fixed the bugs (that I knew of) and
>>    made
>>>        more scalability changes.
>>> 
>>>        Jim,
>>> 
>>>        The NCSA's "ESS" has the 5.4.2 version of Cray's MPI
>>    implementation as
>>>        default.  The "module list" command output that you included
>>    shows:
>>> 
>>>          3) xt-mpich2/5.4.2
>>> 
>>>        The "module avail xt-mpch2" command shows what other versions are
>>>        available:
>>> 
>>>        h2ologin2 25=>module avail xt-mpich2
>>>        --------------------- /opt/cray/modulefiles ---------------------
>>>        xt-mpich2/5.4.2(default)     xt-mpich2/5.4.4       xt-mpich2/
>>    5.4.5
>>> 
>>>        Would you switch to 5.4.5, relink, and try again?
>>> 
>>>        h2ologin2 25=>module swap xt-mpich2/5.4.2 xt-mpich2/5.4.5
>>> 
>>>        Thanks.
>>>        David
>>> 
>>> 
>>>        On Fri, May 11, 2012 at 12:54:39PM -0500, Rob Latham wrote:
>>>> On Fri, May 11, 2012 at 11:46:25AM -0500, David Knaak wrote:
>>>>> Jim,
>>>>> 
>>>>> Since you are having this problem on a Cray system, please
>>    open a
>>>        Cray
>>>>> bug report against MPI and I will look at it.  We can take
>>    further
>>>>> discussions off line.
>>>> 
>>>> Oh, howdy David! forgot you were on the list.  Thanks for
>>    keeping an
>>>> eye on things.
>>>> 
>>>> the pnetcdf list is pretty low-traffic these days, but we have
>>    an
>>>> awful lot of users in a cray and Lustre environment.   If you'd
>>>        rather
>>>> discuss cray specific stuff elsewhere, I'd understand, but
>>    please let
>>>> us know what you figure out.
>>>> 
>>>> ==rob
>>>> 
>>>>> Thanks.
>>>>> David
>>>>> 
>>>>> On Fri, May 11, 2012 at 10:03:28AM -0600, Jim Edwards wrote:
>>>>>> 
>>>>>> 
>>>>>> On Fri, May 11, 2012 at 9:43 AM, Rob Latham <
>>    robl at mcs.anl.gov>
>>>        wrote:
>>>>>> 
>>>>>>    On Thu, May 10, 2012 at 03:28:57PM -0600, Jim Edwards
>>    wrote:
>>>>>>> This occurs on the ncsa machine bluewaters.   I am
>>    using
>>>        pnetcdf1.2.0 and
>>>>>>> pgi 11.10.0
>>>>>> 
>>>>>>    need one more bit of information: the version of MPT
>>    you are
>>>        using.
>>>>>> 
>>>>>> 
>>>>>> Sorry, what's mpt?  MPI?
>>>>>> Currently Loaded Modulefiles:
>>>>>>  1) modules/3.2.6.6                       9)
>>>>>> user-paths                           17) xpmem/
>>>        0.1-2.0400.31280.3.1.gem
>>>>>>  2) xtpe-network-gemini                  10) pgi/
>>>>>> 11.10.0                          18) xe-sysroot/4.0.46
>>>>>>  3) xt-mpich2/5.4.2                      11) xt-libsci/
>>>>>> 11.0.04                    19) xt-asyncpe/5.07
>>>>>>  4) xtpe-interlagos                      12) udreg/
>>>>>> 2.3.1-1.0400.4264.3.1.gem      20) atp/1.4.1
>>>>>>  5) eswrap/1.0.12                        13) ugni/
>>>>>> 2.3-1.0400.4374.4.88.gem        21) PrgEnv-pgi/4.0.46
>>>>>>  6) torque/2.5.10                        14) pmi/
>>>>>> 3.0.0-1.0000.8661.28.2807.gem    22) hdf5-parallel/1.8.7
>>>>>>  7) moab/6.1.5                           15) dmapp/
>>>>>> 3.2.1-1.0400.4255.2.159.gem    23) netcdf-hdf5parallel/
>>    4.1.3
>>>>>>  8) scripts                              16) gni-headers/
>>>>>> 2.1-1.0400.4351.3.1.gem  24) parallel-netcdf/1.2.0
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> The issue is that calling nfmpi_createfile would
>>    sometimes
>>>        result in an
>>>>>>> error:
>>>>>>> 
>>>>>>> MPI_File_open : Other I/O error , error stack:
>>>>>>> (unknown)(): Other I/O error
>>>>>>> 126: MPI_File_open : Other I/O error , error stack:
>>>>>>> (unknown)(): Other I/O error
>>>>>>>  Error on create :           502          -32
>>>>>>> 
>>>>>>> The error appears to be intermittent and I could not
>>    get it
>>>        to occur at
>>>>>>    all
>>>>>>> on a small number of tasks (160) but it occurs with
>>    high
>>>        frequency when
>>>>>>> using a larger number of tasks (>=1600).    I traced
>>    the
>>>        problem to the
>>>>>>    use
>>>>>>> of nf_clobber in the mode argument, removing the
>>    nf_clobber
>>>        seems to have
>>>>>>> solved the problem and I think that create implies
>>    clobber
>>>        anyway doesn't
>>>>>>> it?
>>>>>> 
>>>>>>> Can someone who knows what is going on under the
>>    covers
>>>        enlighten me
>>>>>>> with some understanding of this issue?   I suspect
>>    that one
>>>        task is
>>>>>>    trying
>>>>>>> to clobber the file that another has just created or
>>>        something of that
>>>>>>> nature.
>>>>>> 
>>>>>>    Unfortunately, "under the covers" here means "inside
>>    the
>>>        MPI-IO
>>>>>>    library", which we don't have access to.
>>>>>> 
>>>>>>    in the create case we call MPI_File_open with
>>    "MPI_MODE_RDWR
>>>        |
>>>>>>    MPI_MODE_CREATE", and  if noclobber set, we add
>>>        MPI_MODE_EXCL.
>>>>>> 
>>>>>>    OK, so that's pnetcdf.  What's going on in MPI-IO?
>>     Well,
>>>        cray's based
>>>>>>    their MPI-IO off of our ROMIO, but I'm not sure which
>>>        version.
>>>>>> 
>>>>>>    Let me cook up a quick MPI-IO-only test case you can
>>    run to
>>>        trigger
>>>>>>    this problem and then you can beat cray over the head
>>    with
>>>        it.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Sounds good, thanks.
>>>>>> 
>>>>>> 
>>>>>>    ==rob
>>>>>> 
>>>>>>    --
>>>>>>    Rob Latham
>>>>>>    Mathematics and Computer Science Division
>>>>>>    Argonne National Lab, IL USA
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Jim Edwards
>>>>>> 
>>>>>> CESM Software Engineering Group
>>>>>> National Center for Atmospheric Research
>>>>>> Boulder, CO
>>>>>> 303-497-1842
>>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> Rob Latham
>>>> Mathematics and Computer Science Division
>>>> Argonne National Lab, IL USA
>>> 
>>>        --
>>> 
>>> 
>>> 
>>> 
>>>    --
>>>    Jim Edwards
>>> 
>>>    CESM Software Engineering Group
>>>    National Center for Atmospheric Research
>>>    Boulder, CO
>>>    303-497-1842
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Jim Edwards
>>> 
>>> CESM Software Engineering Group
>>> National Center for Atmospheric Research
>>> Boulder, CO
>>> 303-497-1842
>>> 
>> 
>>    --
>> 
>> 
>> 
>> 
>> --
>> Jim Edwards
>> 
>> CESM Software Engineering Group
>> National Center for Atmospheric Research
>> Boulder, CO
>> 303-497-1842
>> 
> 
> -- 



More information about the parallel-netcdf mailing list