pnetcdf 1.2.0 create file issue

David Knaak knaak at cray.com
Tue May 22 02:50:53 CDT 2012


Hi Wei-keng,

Thanks for the clarification.  When I run with just 1 process, I do get
the message at the end that 2 errors are expected and 2 are seen.  So
I believe that the Cray version 5.4.3 and later is now doing the correct
thing, at least for 1 process.  I will have to wait until I get more 
details of Jim's failure to figure out what is going wrong with multiple
processes.

David

On Mon, May 21, 2012 at 10:51:34PM -0500, Wei-keng Liao wrote:
> Hi, David,
> 
> The pnetcdf tests you were running are to run on one process.
> Those error messages are expected. It tests if pnetcdf can
> correctly report the error. At the end of test run, it has a
> line reporting how many failed tests and the number it should
> be. The two numbers should match. For example,
> 
> NOTE: parallel-netcdf expects to see 2 failures
> Total number of failures: 2
> 
> 
> 
> Wei-keng
> 
> 
> On May 21, 2012, at 4:58 PM, David Knaak wrote:
> 
> > Hi Jim,
> > 
> > I ran the pnetcdf tests using xt-mpich2/5.4.2 and it does have some problems
> > that are fixed in xt-mpich2/5.4.3.  Did you try this
> > 
> >  module swap xt-mpich2/5.4.2 xt-mpich2/5.4.5
> > 
> > for your application?
> > 
> > 
> > 
> > Rob,
> > 
> > Are the pnetcdf tests designed to run with more than one MPI process?
> > When I run with just one process, the Cray library (that is the 5.4.3
> > and later versions) report 2 errors, as expected:
> > 
> > ------------------------------------------------------------------------
> > *** Testing ncmpi_open             ...
> >    FAILURE at line 99 of test_read.c: ncmpi_open of nonexistent file should have returned system error
> > 0: MPI_File_open : File does not exist, error stack:
> > ADIOI_CRAY_OPEN(86): File tooth-fairy.nc does not exist
> >    ### 1 FAILURES TESTING ncmpi_open! ###
> > 
> > *** Testing ncmpi_create           ...
> >    FAILURE at line 58 of test_write.c: attempt to overwrite file: status = -32
> > 0: MPI_File_open : Other I/O error , error stack:
> > ADIOI_CRAY_OPEN(116): Other I/O error File exists
> >    ### 1 FAILURES TESTING ncmpi_create! ###
> > ------------------------------------------------------------------------
> > 
> > The test file line numbers are a little different from the base pnetcdf 
> > tests because of some extra print statements I put in for debugging.
> > 
> > When I run with the unmodified ANL MPICH2 code I get the same results:
> > 
> > ------------------------------------------------------------------------
> > *** Testing ncmpi_open             ...
> >    FAILURE at line 99 of test_read.c: ncmpi_open of nonexistent file should have returned system error
> > 0: MPI_File_open : File does not exist, error stack:
> > ADIOI_UFS_OPEN(70): File tooth-fairy.nc does not exist
> >    ### 1 FAILURES TESTING ncmpi_open! ###
> > 
> > *** Testing ncmpi_create           ...
> >    FAILURE at line 58 of test_write.c: attempt to overwrite file: status = -32
> > 0: MPI_File_open : Other I/O error , error stack:
> > ADIOI_UFS_OPEN(100): Other I/O error File exists
> >    ### 1 FAILURES TESTING ncmpi_create! ###
> > ------------------------------------------------------------------------
> > 
> > But when I run with more MPI tasks, the results get very messy, for both
> > implementations.  For example, running with 4 tasks, the end of the
> > output for the base ANL library is:
> > 
> > ------------------------------------------------------------------------
> >    FAILURE at line 551 of test_write.c: remove of scratch.nc failed
> >    ### 1 FAILURES TESTING ncmpi_rename_dim! ###
> > 
> >    FAILURE at line 551 of test_write.c: remove of scratch.nc failed
> >    ### 1 FAILURES TESTING ncmpi_rename_dim! ###
> > ok
> > 
> >    FAILURE at line 551 of test_write.c: remove of scratch.nc failed
> >    ### 1 FAILURES TESTING ncmpi_rename_dim! ###
> > 
> > NOTE: parallel-netcdf expects to see 2 failures
> > Total number of failures: 7
> > 
> > NOTE: parallel-netcdf expects to see 2 failures
> > Total number of failures: 10
> > 
> > NOTE: parallel-netcdf expects to see 2 failures
> > Total number of failures: 8
> > 
> > NOTE: parallel-netcdf expects to see 2 failures
> > Total number of failures: 10
> > Application 7538532 exit codes: 1
> > ------------------------------------------------------------------------
> > 
> > Is this what you expect? 
> > 
> > David
> > 
> > 
> > On Sat, May 19, 2012 at 07:23:24AM -0600, Jim Edwards wrote:
> >> Hi David,
> >> 
> >> I built the pnetcdf 1.3.0-pre1 release on bluewaters and
> >> ran the nf_test provided with that package.  It exhibits the same issues with
> >> nfmpi_create and nfmpi_open that I am seeing from the installed
> >> parallel-netcdf-1.2.0 and my application. 
> >> I have opened ticket BWDSPCH-298 on bluewaters to track this issue.  
> >> 
> >> On Fri, May 11, 2012 at 6:50 PM, David Knaak <knaak at cray.com> wrote:
> >> 
> >>    Jim,
> >> 
> >>    I'm not intimate  with pnetcdf but the stack trace entries look a little
> >>    strange to me.  In particular:
> >> 
> >>    #10
> >>     count=0x1527daa0, stride=0x0
> >> 
> >>    #13
> >>     start=0x7fffffff3750, count=0x7fffffff3750
> >> 
> >>    Could it be that the high level routine is being called with a bad
> >>    argument?
> >> 
> >>    Just a thought.
> >> 
> >>    David
> >> 
> >>    On Fri, May 11, 2012 at 05:37:24PM -0600, Jim Edwards wrote:
> >>> Hi David,
> >>> 
> >>> I've updated to the 5.2.5 mpich2 on bluewaters and now get a different
> >>    problem,
> >>> it happens earlier in the code so I think that this worked in the older
> >>> version, here is a partial stack trace from the core file on 16000 tasks.
> >>> I'll try on Monday to reproduce this on a smaller number of tasks.    Can
> >>    I
> >>> send you more info when I get it?
> >>> 
> >>> #0  memcpy () at ../sysdeps/x86_64/memcpy.S:102
> >>> #1  0x0000000000c25b03 in MPIDI_CH3U_Buffer_copy ()
> >>> #2  0x0000000000bc91ae in MPIDI_Isend_self ()
> >>> #3  0x0000000000bc5ef4 in MPID_Isend ()
> >>> #4  0x0000000000be4b2b in PMPI_Isend ()
> >>> #5  0x0000000000b9439a in ADIOI_CRAY_R_Exchange_data ()
> >>> #6  0x0000000000b9503d in ADIOI_CRAY_Read_and_exch ()
> >>> #7  0x0000000000b957ca in ADIOI_CRAY_ReadStridedColl ()
> >>> #8  0x0000000000b7ea69 in MPIOI_File_read_all ()
> >>> #9  0x0000000000b7eab6 in PMPI_File_read_all ()
> >>> #10 0x0000000000b3d7ce in ncmpii_getput_vars (ncp=0x107192b0, varp=
> >>    0x109ee340,
> >>> start=0x15269910, count=0x1527daa0, stride=0x0,
> >>>    buf=0x155f5480, bufcount=245561, datatype=-871890885, rw_flag=1,
> >>    io_method=
> >>> 1) at ./getput_vars.c:741
> >>> #11 0x0000000000b39804 in ncmpi_get_vara_all (ncid=1, varid=59, start=
> >>> 0x15269910, count=0x1527daa0, buf=0x155f5480,
> >>>    bufcount=245561, datatype=-871890885) at ./getput_vara.c:435
> >>> #12 0x0000000000b092c6 in nfmpi_get_vara_all_ (v1=0x33e8dd8, v2=
> >>    0x33e89a0, v3=
> >>> 0x418d500, v4=0x418c500, v5=0x155f5480,
> >>>    v6=0x1e72e10, v7=0x1e72e04) at ./get_vara_allf.c:57
> >>> #13 0x00000000007c0524 in pionfread_mod::read_nfdarray_double (file=...,
> >>    iobuf=
> >>> 0x155f5480, vardesc=..., iodesc=...,
> >>>    start=0x7fffffff3750, count=0x7fffffff3750)
> >>> 
> >>> 
> >>> On Fri, May 11, 2012 at 5:25 PM, Jim Edwards <jedwards at ucar.edu> wrote:
> >>> 
> >>>    David,
> >>> 
> >>>    I will give this a try, thanks.
> >>> 
> >>> 
> >>>    On Fri, May 11, 2012 at 5:15 PM, David Knaak <knaak at cray.com> wrote:
> >>> 
> >>>        Rob,
> >>> 
> >>>        I suggested taking the discussion off line to not bother those
> >>    not
> >>>        interested in the Cray specifics.  But if you think those on the
> >>    list
> >>>        are either interested, or don't consider it a bother, I can
> >>    certainly
> >>>        use the full list.
> >>> 
> >>>        All,
> >>> 
> >>>        In the MPT 5.4.0 release, I made some changes to MPI_File_open to
> >>>        improve scalability.  Because of these changes and previous
> >>    changes
> >>>        I had made (for added functionality, not because of any bugs),
> >>    the
> >>>        code was getting very messy.  In fact, I introduced a bug or 2
> >>    with
> >>>        these changes.  So in 5.4.3, I significantly restructured the
> >>    code
> >>>        for better maintainability, fixed the bugs (that I knew of) and
> >>    made
> >>>        more scalability changes.
> >>> 
> >>>        Jim,
> >>> 
> >>>        The NCSA's "ESS" has the 5.4.2 version of Cray's MPI
> >>    implementation as
> >>>        default.  The "module list" command output that you included
> >>    shows:
> >>> 
> >>>          3) xt-mpich2/5.4.2
> >>> 
> >>>        The "module avail xt-mpch2" command shows what other versions are
> >>>        available:
> >>> 
> >>>        h2ologin2 25=>module avail xt-mpich2
> >>>        --------------------- /opt/cray/modulefiles ---------------------
> >>>        xt-mpich2/5.4.2(default)     xt-mpich2/5.4.4       xt-mpich2/
> >>    5.4.5
> >>> 
> >>>        Would you switch to 5.4.5, relink, and try again?
> >>> 
> >>>        h2ologin2 25=>module swap xt-mpich2/5.4.2 xt-mpich2/5.4.5
> >>> 
> >>>        Thanks.
> >>>        David
> >>> 
> >>> 
> >>>        On Fri, May 11, 2012 at 12:54:39PM -0500, Rob Latham wrote:
> >>>> On Fri, May 11, 2012 at 11:46:25AM -0500, David Knaak wrote:
> >>>>> Jim,
> >>>>> 
> >>>>> Since you are having this problem on a Cray system, please
> >>    open a
> >>>        Cray
> >>>>> bug report against MPI and I will look at it.  We can take
> >>    further
> >>>>> discussions off line.
> >>>> 
> >>>> Oh, howdy David! forgot you were on the list.  Thanks for
> >>    keeping an
> >>>> eye on things.
> >>>> 
> >>>> the pnetcdf list is pretty low-traffic these days, but we have
> >>    an
> >>>> awful lot of users in a cray and Lustre environment.   If you'd
> >>>        rather
> >>>> discuss cray specific stuff elsewhere, I'd understand, but
> >>    please let
> >>>> us know what you figure out.
> >>>> 
> >>>> ==rob
> >>>> 
> >>>>> Thanks.
> >>>>> David
> >>>>> 
> >>>>> On Fri, May 11, 2012 at 10:03:28AM -0600, Jim Edwards wrote:
> >>>>>> 
> >>>>>> 
> >>>>>> On Fri, May 11, 2012 at 9:43 AM, Rob Latham <
> >>    robl at mcs.anl.gov>
> >>>        wrote:
> >>>>>> 
> >>>>>>    On Thu, May 10, 2012 at 03:28:57PM -0600, Jim Edwards
> >>    wrote:
> >>>>>>> This occurs on the ncsa machine bluewaters.   I am
> >>    using
> >>>        pnetcdf1.2.0 and
> >>>>>>> pgi 11.10.0
> >>>>>> 
> >>>>>>    need one more bit of information: the version of MPT
> >>    you are
> >>>        using.
> >>>>>> 
> >>>>>> 
> >>>>>> Sorry, what's mpt?  MPI?
> >>>>>> Currently Loaded Modulefiles:
> >>>>>>  1) modules/3.2.6.6                       9)
> >>>>>> user-paths                           17) xpmem/
> >>>        0.1-2.0400.31280.3.1.gem
> >>>>>>  2) xtpe-network-gemini                  10) pgi/
> >>>>>> 11.10.0                          18) xe-sysroot/4.0.46
> >>>>>>  3) xt-mpich2/5.4.2                      11) xt-libsci/
> >>>>>> 11.0.04                    19) xt-asyncpe/5.07
> >>>>>>  4) xtpe-interlagos                      12) udreg/
> >>>>>> 2.3.1-1.0400.4264.3.1.gem      20) atp/1.4.1
> >>>>>>  5) eswrap/1.0.12                        13) ugni/
> >>>>>> 2.3-1.0400.4374.4.88.gem        21) PrgEnv-pgi/4.0.46
> >>>>>>  6) torque/2.5.10                        14) pmi/
> >>>>>> 3.0.0-1.0000.8661.28.2807.gem    22) hdf5-parallel/1.8.7
> >>>>>>  7) moab/6.1.5                           15) dmapp/
> >>>>>> 3.2.1-1.0400.4255.2.159.gem    23) netcdf-hdf5parallel/
> >>    4.1.3
> >>>>>>  8) scripts                              16) gni-headers/
> >>>>>> 2.1-1.0400.4351.3.1.gem  24) parallel-netcdf/1.2.0
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>>> The issue is that calling nfmpi_createfile would
> >>    sometimes
> >>>        result in an
> >>>>>>> error:
> >>>>>>> 
> >>>>>>> MPI_File_open : Other I/O error , error stack:
> >>>>>>> (unknown)(): Other I/O error
> >>>>>>> 126: MPI_File_open : Other I/O error , error stack:
> >>>>>>> (unknown)(): Other I/O error
> >>>>>>>  Error on create :           502          -32
> >>>>>>> 
> >>>>>>> The error appears to be intermittent and I could not
> >>    get it
> >>>        to occur at
> >>>>>>    all
> >>>>>>> on a small number of tasks (160) but it occurs with
> >>    high
> >>>        frequency when
> >>>>>>> using a larger number of tasks (>=1600).    I traced
> >>    the
> >>>        problem to the
> >>>>>>    use
> >>>>>>> of nf_clobber in the mode argument, removing the
> >>    nf_clobber
> >>>        seems to have
> >>>>>>> solved the problem and I think that create implies
> >>    clobber
> >>>        anyway doesn't
> >>>>>>> it?
> >>>>>> 
> >>>>>>> Can someone who knows what is going on under the
> >>    covers
> >>>        enlighten me
> >>>>>>> with some understanding of this issue?   I suspect
> >>    that one
> >>>        task is
> >>>>>>    trying
> >>>>>>> to clobber the file that another has just created or
> >>>        something of that
> >>>>>>> nature.
> >>>>>> 
> >>>>>>    Unfortunately, "under the covers" here means "inside
> >>    the
> >>>        MPI-IO
> >>>>>>    library", which we don't have access to.
> >>>>>> 
> >>>>>>    in the create case we call MPI_File_open with
> >>    "MPI_MODE_RDWR
> >>>        |
> >>>>>>    MPI_MODE_CREATE", and  if noclobber set, we add
> >>>        MPI_MODE_EXCL.
> >>>>>> 
> >>>>>>    OK, so that's pnetcdf.  What's going on in MPI-IO?
> >>     Well,
> >>>        cray's based
> >>>>>>    their MPI-IO off of our ROMIO, but I'm not sure which
> >>>        version.
> >>>>>> 
> >>>>>>    Let me cook up a quick MPI-IO-only test case you can
> >>    run to
> >>>        trigger
> >>>>>>    this problem and then you can beat cray over the head
> >>    with
> >>>        it.
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> Sounds good, thanks.
> >>>>>> 
> >>>>>> 
> >>>>>>    ==rob
> >>>>>> 
> >>>>>>    --
> >>>>>>    Rob Latham
> >>>>>>    Mathematics and Computer Science Division
> >>>>>>    Argonne National Lab, IL USA
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> --
> >>>>>> Jim Edwards
> >>>>>> 
> >>>>>> CESM Software Engineering Group
> >>>>>> National Center for Atmospheric Research
> >>>>>> Boulder, CO
> >>>>>> 303-497-1842
> >>>>>> 
> >>>>> 
> >>>> 
> >>>> --
> >>>> Rob Latham
> >>>> Mathematics and Computer Science Division
> >>>> Argonne National Lab, IL USA
> >>> 
> >>>        --
> >>> 
> >>> 
> >>> 
> >>> 
> >>>    --
> >>>    Jim Edwards
> >>> 
> >>>    CESM Software Engineering Group
> >>>    National Center for Atmospheric Research
> >>>    Boulder, CO
> >>>    303-497-1842
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> --
> >>> Jim Edwards
> >>> 
> >>> CESM Software Engineering Group
> >>> National Center for Atmospheric Research
> >>> Boulder, CO
> >>> 303-497-1842
> >>> 
> >> 
> >>    --
> >> 
> >> 
> >> 
> >> 
> >> --
> >> Jim Edwards
> >> 
> >> CESM Software Engineering Group
> >> National Center for Atmospheric Research
> >> Boulder, CO
> >> 303-497-1842
> >> 
> > 
> > -- 
> 

-- 


More information about the parallel-netcdf mailing list