pnetcdf 1.2.0 create file issue

David Knaak knaak at cray.com
Mon May 21 17:28:44 CDT 2012


Hi Jim,

Your NCSA ticket hasn't reached me yet in the form of a Cray bug report,
but I should see that soon.

I'll keep digging.

David

On Mon, May 21, 2012 at 04:05:57PM -0600, Jim Edwards wrote:
> Hi David,
> 
> I am using 5.4.5.   I reported the loaded modules in the ticket, but not in the
> mail to you - here they are:
> 
> Currently Loaded Modulefiles:
>   1) modules/3.2.6.6 9) user-paths 17) xpmem/0.1-2.0400.31280.3.1.gem
>   2) xtpe-network-gemini 10) pgi/11.10.0 18) xe-sysroot/4.0.46
>   3) xt-mpich2/5.4.5 11) xt-libsci/11.0.04 19) xt-asyncpe/5.07
>   4) xtpe-interlagos 12) udreg/2.3.1-1.0400.4264.3.1.gem 20) atp/1.4.1
>   5) eswrap/1.0.12 13) ugni/2.3-1.0400.4374.4.88.gem 21) PrgEnv-pgi/4.0.46
>   6) torque/2.5.10 14) pmi/3.0.0-1.0000.8661.28.2807.gem 22) hdf5-parallel/
> 1.8.7
>   7) moab/6.1.5 15) dmapp/3.2.1-1.0400.4255.2.159.gem 23) netcdf-hdf5parallel/
> 4.1.3
>   8) scripts 16) gni-headers/2.1-1.0400.4351.3.1.gem
> 
> 
> 
> On Mon, May 21, 2012 at 3:58 PM, David Knaak <knaak at cray.com> wrote:
> 
>     Hi Jim,
> 
>     I ran the pnetcdf tests using xt-mpich2/5.4.2 and it does have some
>     problems
>     that are fixed in xt-mpich2/5.4.3.  Did you try this
> 
>      module swap xt-mpich2/5.4.2 xt-mpich2/5.4.5
> 
>     for your application?
> 
> 
> 
>     Rob,
> 
>     Are the pnetcdf tests designed to run with more than one MPI process?
>     When I run with just one process, the Cray library (that is the 5.4.3
>     and later versions) report 2 errors, as expected:
> 
>     ------------------------------------------------------------------------
>     *** Testing ncmpi_open             ...
>        FAILURE at line 99 of test_read.c: ncmpi_open of nonexistent file should
>     have returned system error
>      0: MPI_File_open : File does not exist, error stack:
>     ADIOI_CRAY_OPEN(86): File tooth-fairy.nc does not exist
>        ### 1 FAILURES TESTING ncmpi_open! ###
> 
>     *** Testing ncmpi_create           ...
>        FAILURE at line 58 of test_write.c: attempt to overwrite file: status =
>     -32
>      0: MPI_File_open : Other I/O error , error stack:
>     ADIOI_CRAY_OPEN(116): Other I/O error File exists
>        ### 1 FAILURES TESTING ncmpi_create! ###
>     ------------------------------------------------------------------------
> 
>     The test file line numbers are a little different from the base pnetcdf
>     tests because of some extra print statements I put in for debugging.
> 
>     When I run with the unmodified ANL MPICH2 code I get the same results:
> 
>     ------------------------------------------------------------------------
>     *** Testing ncmpi_open             ...
>        FAILURE at line 99 of test_read.c: ncmpi_open of nonexistent file should
>     have returned system error
>      0: MPI_File_open : File does not exist, error stack:
>     ADIOI_UFS_OPEN(70): File tooth-fairy.nc does not exist
>        ### 1 FAILURES TESTING ncmpi_open! ###
> 
>     *** Testing ncmpi_create           ...
>        FAILURE at line 58 of test_write.c: attempt to overwrite file: status =
>     -32
>      0: MPI_File_open : Other I/O error , error stack:
>     ADIOI_UFS_OPEN(100): Other I/O error File exists
>        ### 1 FAILURES TESTING ncmpi_create! ###
>     ------------------------------------------------------------------------
> 
>     But when I run with more MPI tasks, the results get very messy, for both
>     implementations.  For example, running with 4 tasks, the end of the
>     output for the base ANL library is:
> 
>     ------------------------------------------------------------------------
>        FAILURE at line 551 of test_write.c: remove of scratch.nc failed
>        ### 1 FAILURES TESTING ncmpi_rename_dim! ###
> 
>        FAILURE at line 551 of test_write.c: remove of scratch.nc failed
>        ### 1 FAILURES TESTING ncmpi_rename_dim! ###
>     ok
> 
>        FAILURE at line 551 of test_write.c: remove of scratch.nc failed
>        ### 1 FAILURES TESTING ncmpi_rename_dim! ###
> 
>     NOTE: parallel-netcdf expects to see 2 failures
>     Total number of failures: 7
> 
>     NOTE: parallel-netcdf expects to see 2 failures
>     Total number of failures: 10
> 
>     NOTE: parallel-netcdf expects to see 2 failures
>     Total number of failures: 8
> 
>     NOTE: parallel-netcdf expects to see 2 failures
>     Total number of failures: 10
>     Application 7538532 exit codes: 1
>     ------------------------------------------------------------------------
> 
>     Is this what you expect?
> 
>     David
> 
> 
>     On Sat, May 19, 2012 at 07:23:24AM -0600, Jim Edwards wrote:
>     > Hi David,
>     >
>     > I built the pnetcdf 1.3.0-pre1 release on bluewaters and
>     > ran the nf_test provided with that package.  It exhibits the same issues
>     with
>     > nfmpi_create and nfmpi_open that I am seeing from the installed
>     > parallel-netcdf-1.2.0 and my application.
>     > I have opened ticket BWDSPCH-298 on bluewaters to track this issue.
>     >
>     > On Fri, May 11, 2012 at 6:50 PM, David Knaak <knaak at cray.com> wrote:
>     >
>     >     Jim,
>     >
>     >     I'm not intimate  with pnetcdf but the stack trace entries look a
>     little
>     >     strange to me.  In particular:
>     >
>     >     #10
>     >      count=0x1527daa0, stride=0x0
>     >
>     >     #13
>     >      start=0x7fffffff3750, count=0x7fffffff3750
>     >
>     >     Could it be that the high level routine is being called with a bad
>     >     argument?
>     >
>     >     Just a thought.
>     >
>     >     David
>     >
>     >     On Fri, May 11, 2012 at 05:37:24PM -0600, Jim Edwards wrote:
>     >     > Hi David,
>     >     >
>     >     > I've updated to the 5.2.5 mpich2 on bluewaters and now get a
>     different
>     >     problem,
>     >     > it happens earlier in the code so I think that this worked in the
>     older
>     >     > version, here is a partial stack trace from the core file on 16000
>     tasks.
>     >     > I'll try on Monday to reproduce this on a smaller number of tasks.
>        Can
>     >     I
>     >     > send you more info when I get it?
>     >     >
>     >     > #0  memcpy () at ../sysdeps/x86_64/memcpy.S:102
>     >     > #1  0x0000000000c25b03 in MPIDI_CH3U_Buffer_copy ()
>     >     > #2  0x0000000000bc91ae in MPIDI_Isend_self ()
>     >     > #3  0x0000000000bc5ef4 in MPID_Isend ()
>     >     > #4  0x0000000000be4b2b in PMPI_Isend ()
>     >     > #5  0x0000000000b9439a in ADIOI_CRAY_R_Exchange_data ()
>     >     > #6  0x0000000000b9503d in ADIOI_CRAY_Read_and_exch ()
>     >     > #7  0x0000000000b957ca in ADIOI_CRAY_ReadStridedColl ()
>     >     > #8  0x0000000000b7ea69 in MPIOI_File_read_all ()
>     >     > #9  0x0000000000b7eab6 in PMPI_File_read_all ()
>     >     > #10 0x0000000000b3d7ce in ncmpii_getput_vars (ncp=0x107192b0, varp=
>     >     0x109ee340,
>     >     > start=0x15269910, count=0x1527daa0, stride=0x0,
>     >     >     buf=0x155f5480, bufcount=245561, datatype=-871890885, rw_flag=
>     1,
>     >     io_method=
>     >     > 1) at ./getput_vars.c:741
>     >     > #11 0x0000000000b39804 in ncmpi_get_vara_all (ncid=1, varid=59,
>     start=
>     >     > 0x15269910, count=0x1527daa0, buf=0x155f5480,
>     >     >     bufcount=245561, datatype=-871890885) at ./getput_vara.c:435
>     >     > #12 0x0000000000b092c6 in nfmpi_get_vara_all_ (v1=0x33e8dd8, v2=
>     >     0x33e89a0, v3=
>     >     > 0x418d500, v4=0x418c500, v5=0x155f5480,
>     >     >     v6=0x1e72e10, v7=0x1e72e04) at ./get_vara_allf.c:57
>     >     > #13 0x00000000007c0524 in pionfread_mod::read_nfdarray_double (file
>     =...,
>     >     iobuf=
>     >     > 0x155f5480, vardesc=..., iodesc=...,
>     >     >     start=0x7fffffff3750, count=0x7fffffff3750)
>     >     >
>     >     >
>     >     > On Fri, May 11, 2012 at 5:25 PM, Jim Edwards <jedwards at ucar.edu>
>     wrote:
>     >     >
>     >     >     David,
>     >     >
>     >     >     I will give this a try, thanks.
>     >     >
>     >     >
>     >     >     On Fri, May 11, 2012 at 5:15 PM, David Knaak <knaak at cray.com>
>     wrote:
>     >     >
>     >     >         Rob,
>     >     >
>     >     >         I suggested taking the discussion off line to not bother
>     those
>     >     not
>     >     >         interested in the Cray specifics.  But if you think those
>     on the
>     >     list
>     >     >         are either interested, or don't consider it a bother, I can
>     >     certainly
>     >     >         use the full list.
>     >     >
>     >     >         All,
>     >     >
>     >     >         In the MPT 5.4.0 release, I made some changes to
>     MPI_File_open to
>     >     >         improve scalability.  Because of these changes and previous
>     >     changes
>     >     >         I had made (for added functionality, not because of any
>     bugs),
>     >     the
>     >     >         code was getting very messy.  In fact, I introduced a bug
>     or 2
>     >     with
>     >     >         these changes.  So in 5.4.3, I significantly restructured
>     the
>     >     code
>     >     >         for better maintainability, fixed the bugs (that I knew of)
>     and
>     >     made
>     >     >         more scalability changes.
>     >     >
>     >     >         Jim,
>     >     >
>     >     >         The NCSA's "ESS" has the 5.4.2 version of Cray's MPI
>     >     implementation as
>     >     >         default.  The "module list" command output that you
>     included
>     >     shows:
>     >     >
>     >     >           3) xt-mpich2/5.4.2
>     >     >
>     >     >         The "module avail xt-mpch2" command shows what other
>     versions are
>     >     >         available:
>     >     >
>     >     >         h2ologin2 25=>module avail xt-mpich2
>     >     >         --------------------- /opt/cray/modulefiles
>     ---------------------
>     >     >         xt-mpich2/5.4.2(default)     xt-mpich2/5.4.4      
>     xt-mpich2/
>     >     5.4.5
>     >     >
>     >     >         Would you switch to 5.4.5, relink, and try again?
>     >     >
>     >     >         h2ologin2 25=>module swap xt-mpich2/5.4.2 xt-mpich2/5.4.5
>     >     >
>     >     >         Thanks.
>     >     >         David
>     >     >
>     >     >
>     >     >         On Fri, May 11, 2012 at 12:54:39PM -0500, Rob Latham wrote:
>     >     >         > On Fri, May 11, 2012 at 11:46:25AM -0500, David Knaak
>     wrote:
>     >     >         > > Jim,
>     >     >         > >
>     >     >         > > Since you are having this problem on a Cray system,
>     please
>     >     open a
>     >     >         Cray
>     >     >         > > bug report against MPI and I will look at it.  We can
>     take
>     >     further
>     >     >         > > discussions off line.
>     >     >         >
>     >     >         > Oh, howdy David! forgot you were on the list.  Thanks for
>     >     keeping an
>     >     >         > eye on things.
>     >     >         >
>     >     >         > the pnetcdf list is pretty low-traffic these days, but we
>     have
>     >     an
>     >     >         > awful lot of users in a cray and Lustre environment.   If
>     you'd
>     >     >         rather
>     >     >         > discuss cray specific stuff elsewhere, I'd understand,
>     but
>     >     please let
>     >     >         > us know what you figure out.
>     >     >         >
>     >     >         > ==rob
>     >     >         >
>     >     >         > > Thanks.
>     >     >         > > David
>     >     >         > >
>     >     >         > > On Fri, May 11, 2012 at 10:03:28AM -0600, Jim Edwards
>     wrote:
>     >     >         > > >
>     >     >         > > >
>     >     >         > > > On Fri, May 11, 2012 at 9:43 AM, Rob Latham <
>     >     robl at mcs.anl.gov>
>     >     >         wrote:
>     >     >         > > >
>     >     >         > > >     On Thu, May 10, 2012 at 03:28:57PM -0600, Jim
>     Edwards
>     >     wrote:
>     >     >         > > >     > This occurs on the ncsa machine bluewaters.   I
>     am
>     >     using
>     >     >         pnetcdf1.2.0 and
>     >     >         > > >     > pgi 11.10.0
>     >     >         > > >
>     >     >         > > >     need one more bit of information: the version of
>     MPT
>     >     you are
>     >     >         using.
>     >     >         > > >
>     >     >         > > >
>     >     >         > > > Sorry, what's mpt?  MPI?
>     >     >         > > > Currently Loaded Modulefiles:
>     >     >         > > >   1) modules/3.2.6.6                       9)
>     >     >         > > > user-paths                           17) xpmem/
>     >     >         0.1-2.0400.31280.3.1.gem
>     >     >         > > >   2) xtpe-network-gemini                  10) pgi/
>     >     >         > > > 11.10.0                          18) xe-sysroot/
>     4.0.46
>     >     >         > > >   3) xt-mpich2/5.4.2                      11)
>     xt-libsci/
>     >     >         > > > 11.0.04                    19) xt-asyncpe/5.07
>     >     >         > > >   4) xtpe-interlagos                      12) udreg/
>     >     >         > > > 2.3.1-1.0400.4264.3.1.gem      20) atp/1.4.1
>     >     >         > > >   5) eswrap/1.0.12                        13) ugni/
>     >     >         > > > 2.3-1.0400.4374.4.88.gem        21) PrgEnv-pgi/4.0.46
>     >     >         > > >   6) torque/2.5.10                        14) pmi/
>     >     >         > > > 3.0.0-1.0000.8661.28.2807.gem    22) hdf5-parallel/
>     1.8.7
>     >     >         > > >   7) moab/6.1.5                           15) dmapp/
>     >     >         > > > 3.2.1-1.0400.4255.2.159.gem    23)
>     netcdf-hdf5parallel/
>     >     4.1.3
>     >     >         > > >   8) scripts                              16)
>     gni-headers/
>     >     >         > > > 2.1-1.0400.4351.3.1.gem  24) parallel-netcdf/1.2.0
>     >     >         > > >
>     >     >         > > >
>     >     >         > > >
>     >     >         > > >
>     >     >         > > >
>     >     >         > > >
>     >     >         > > >     > The issue is that calling nfmpi_createfile
>     would
>     >     sometimes
>     >     >         result in an
>     >     >         > > >     > error:
>     >     >         > > >     >
>     >     >         > > >     > MPI_File_open : Other I/O error , error stack:
>     >     >         > > >     > (unknown)(): Other I/O error
>     >     >         > > >     > 126: MPI_File_open : Other I/O error , error
>     stack:
>     >     >         > > >     > (unknown)(): Other I/O error
>     >     >         > > >     >   Error on create :           502          -32
>     >     >         > > >     >
>     >     >         > > >     > The error appears to be intermittent and I
>     could not
>     >     get it
>     >     >         to occur at
>     >     >         > > >     all
>     >     >         > > >     > on a small number of tasks (160) but it occurs
>     with
>     >     high
>     >     >         frequency when
>     >     >         > > >     > using a larger number of tasks (>=1600).    I
>     traced
>     >     the
>     >     >         problem to the
>     >     >         > > >     use
>     >     >         > > >     > of nf_clobber in the mode argument, removing
>     the
>     >     nf_clobber
>     >     >         seems to have
>     >     >         > > >     > solved the problem and I think that create
>     implies
>     >     clobber
>     >     >         anyway doesn't
>     >     >         > > >     > it?
>     >     >         > > >
>     >     >         > > >     > Can someone who knows what is going on under
>     the
>     >     covers
>     >     >         enlighten me
>     >     >         > > >     > with some understanding of this issue?   I
>     suspect
>     >     that one
>     >     >         task is
>     >     >         > > >     trying
>     >     >         > > >     > to clobber the file that another has just
>     created or
>     >     >         something of that
>     >     >         > > >     > nature.
>     >     >         > > >
>     >     >         > > >     Unfortunately, "under the covers" here means
>     "inside
>     >     the
>     >     >         MPI-IO
>     >     >         > > >     library", which we don't have access to.
>     >     >         > > >
>     >     >         > > >     in the create case we call MPI_File_open with
>     >     "MPI_MODE_RDWR
>     >     >         |
>     >     >         > > >     MPI_MODE_CREATE", and  if noclobber set, we add
>     >     >         MPI_MODE_EXCL.
>     >     >         > > >
>     >     >         > > >     OK, so that's pnetcdf.  What's going on in
>     MPI-IO?
>     >      Well,
>     >     >         cray's based
>     >     >         > > >     their MPI-IO off of our ROMIO, but I'm not sure
>     which
>     >     >         version.
>     >     >         > > >
>     >     >         > > >     Let me cook up a quick MPI-IO-only test case you
>     can
>     >     run to
>     >     >         trigger
>     >     >         > > >     this problem and then you can beat cray over the
>     head
>     >     with
>     >     >         it.
>     >     >         > > >
>     >     >         > > >
>     >     >         > > >
>     >     >         > > > Sounds good, thanks.
>     >     >         > > >
>     >     >         > > >
>     >     >         > > >     ==rob
>     >     >         > > >
>     >     >         > > >     --
>     >     >         > > >     Rob Latham
>     >     >         > > >     Mathematics and Computer Science Division
>     >     >         > > >     Argonne National Lab, IL USA
>     >     >         > > >
>     >     >         > > >
>     >     >         > > >
>     >     >         > > >
>     >     >         > > > --
>     >     >         > > > Jim Edwards
>     >     >         > > >
>     >     >         > > > CESM Software Engineering Group
>     >     >         > > > National Center for Atmospheric Research
>     >     >         > > > Boulder, CO
>     >     >         > > > 303-497-1842
>     >     >         > > >
>     >     >         > >
>     >     >         >
>     >     >         > --
>     >     >         > Rob Latham
>     >     >         > Mathematics and Computer Science Division
>     >     >         > Argonne National Lab, IL USA
>     >     >
>     >     >         --
>     >     >
>     >     >
>     >     >
>     >     >
>     >     >     --
>     >     >     Jim Edwards
>     >     >
>     >     >     CESM Software Engineering Group
>     >     >     National Center for Atmospheric Research
>     >     >     Boulder, CO
>     >     >     303-497-1842
>     >     >
>     >     >
>     >     >
>     >     >
>     >     >
>     >     > --
>     >     > Jim Edwards
>     >     >
>     >     > CESM Software Engineering Group
>     >     > National Center for Atmospheric Research
>     >     > Boulder, CO
>     >     > 303-497-1842
>     >     >
>     >
>     >     --
>     >
>     >
>     >
>     >
>     > --
>     > Jim Edwards
>     >
>     > CESM Software Engineering Group
>     > National Center for Atmospheric Research
>     > Boulder, CO
>     > 303-497-1842
>     >
> 
>     --
> 
> 
> 
> 
> --
> Jim Edwards
> 
> CESM Software Engineering Group
> National Center for Atmospheric Research
> Boulder, CO
> 303-497-1842
> 

-- 


More information about the parallel-netcdf mailing list