pnetcdf 1.2.0 create file issue
Rob Latham
robl at mcs.anl.gov
Wed May 23 09:15:26 CDT 2012
On Mon, May 21, 2012 at 04:58:59PM -0500, David Knaak wrote:
> Hi Jim,
>
> I ran the pnetcdf tests using xt-mpich2/5.4.2 and it does have some problems
> that are fixed in xt-mpich2/5.4.3. Did you try this
>
> module swap xt-mpich2/5.4.2 xt-mpich2/5.4.5
>
> for your application?
>
>
>
> Rob,
>
> Are the pnetcdf tests designed to run with more than one MPI process?
> When I run with just one process, the Cray library (that is the 5.4.3
> and later versions) report 2 errors, as expected:
yeah, unfortunately we (still) lack aggressive parallel tests. Run
them with one processor.
==rob
> ------------------------------------------------------------------------
> *** Testing ncmpi_open ...
> FAILURE at line 99 of test_read.c: ncmpi_open of nonexistent file should have returned system error
> 0: MPI_File_open : File does not exist, error stack:
> ADIOI_CRAY_OPEN(86): File tooth-fairy.nc does not exist
> ### 1 FAILURES TESTING ncmpi_open! ###
>
> *** Testing ncmpi_create ...
> FAILURE at line 58 of test_write.c: attempt to overwrite file: status = -32
> 0: MPI_File_open : Other I/O error , error stack:
> ADIOI_CRAY_OPEN(116): Other I/O error File exists
> ### 1 FAILURES TESTING ncmpi_create! ###
> ------------------------------------------------------------------------
>
> The test file line numbers are a little different from the base pnetcdf
> tests because of some extra print statements I put in for debugging.
>
> When I run with the unmodified ANL MPICH2 code I get the same results:
>
> ------------------------------------------------------------------------
> *** Testing ncmpi_open ...
> FAILURE at line 99 of test_read.c: ncmpi_open of nonexistent file should have returned system error
> 0: MPI_File_open : File does not exist, error stack:
> ADIOI_UFS_OPEN(70): File tooth-fairy.nc does not exist
> ### 1 FAILURES TESTING ncmpi_open! ###
>
> *** Testing ncmpi_create ...
> FAILURE at line 58 of test_write.c: attempt to overwrite file: status = -32
> 0: MPI_File_open : Other I/O error , error stack:
> ADIOI_UFS_OPEN(100): Other I/O error File exists
> ### 1 FAILURES TESTING ncmpi_create! ###
> ------------------------------------------------------------------------
>
> But when I run with more MPI tasks, the results get very messy, for both
> implementations. For example, running with 4 tasks, the end of the
> output for the base ANL library is:
>
> ------------------------------------------------------------------------
> FAILURE at line 551 of test_write.c: remove of scratch.nc failed
> ### 1 FAILURES TESTING ncmpi_rename_dim! ###
>
> FAILURE at line 551 of test_write.c: remove of scratch.nc failed
> ### 1 FAILURES TESTING ncmpi_rename_dim! ###
> ok
>
> FAILURE at line 551 of test_write.c: remove of scratch.nc failed
> ### 1 FAILURES TESTING ncmpi_rename_dim! ###
>
> NOTE: parallel-netcdf expects to see 2 failures
> Total number of failures: 7
>
> NOTE: parallel-netcdf expects to see 2 failures
> Total number of failures: 10
>
> NOTE: parallel-netcdf expects to see 2 failures
> Total number of failures: 8
>
> NOTE: parallel-netcdf expects to see 2 failures
> Total number of failures: 10
> Application 7538532 exit codes: 1
> ------------------------------------------------------------------------
>
> Is this what you expect?
>
> David
>
>
> On Sat, May 19, 2012 at 07:23:24AM -0600, Jim Edwards wrote:
> > Hi David,
> >
> > I built the pnetcdf 1.3.0-pre1 release on bluewaters and
> > ran the nf_test provided with that package. It exhibits the same issues with
> > nfmpi_create and nfmpi_open that I am seeing from the installed
> > parallel-netcdf-1.2.0 and my application.
> > I have opened ticket BWDSPCH-298 on bluewaters to track this issue.
> >
> > On Fri, May 11, 2012 at 6:50 PM, David Knaak <knaak at cray.com> wrote:
> >
> > Jim,
> >
> > I'm not intimate with pnetcdf but the stack trace entries look a little
> > strange to me. In particular:
> >
> > #10
> > count=0x1527daa0, stride=0x0
> >
> > #13
> > start=0x7fffffff3750, count=0x7fffffff3750
> >
> > Could it be that the high level routine is being called with a bad
> > argument?
> >
> > Just a thought.
> >
> > David
> >
> > On Fri, May 11, 2012 at 05:37:24PM -0600, Jim Edwards wrote:
> > > Hi David,
> > >
> > > I've updated to the 5.2.5 mpich2 on bluewaters and now get a different
> > problem,
> > > it happens earlier in the code so I think that this worked in the older
> > > version, here is a partial stack trace from the core file on 16000 tasks.
> > > I'll try on Monday to reproduce this on a smaller number of tasks. Can
> > I
> > > send you more info when I get it?
> > >
> > > #0 memcpy () at ../sysdeps/x86_64/memcpy.S:102
> > > #1 0x0000000000c25b03 in MPIDI_CH3U_Buffer_copy ()
> > > #2 0x0000000000bc91ae in MPIDI_Isend_self ()
> > > #3 0x0000000000bc5ef4 in MPID_Isend ()
> > > #4 0x0000000000be4b2b in PMPI_Isend ()
> > > #5 0x0000000000b9439a in ADIOI_CRAY_R_Exchange_data ()
> > > #6 0x0000000000b9503d in ADIOI_CRAY_Read_and_exch ()
> > > #7 0x0000000000b957ca in ADIOI_CRAY_ReadStridedColl ()
> > > #8 0x0000000000b7ea69 in MPIOI_File_read_all ()
> > > #9 0x0000000000b7eab6 in PMPI_File_read_all ()
> > > #10 0x0000000000b3d7ce in ncmpii_getput_vars (ncp=0x107192b0, varp=
> > 0x109ee340,
> > > start=0x15269910, count=0x1527daa0, stride=0x0,
> > > buf=0x155f5480, bufcount=245561, datatype=-871890885, rw_flag=1,
> > io_method=
> > > 1) at ./getput_vars.c:741
> > > #11 0x0000000000b39804 in ncmpi_get_vara_all (ncid=1, varid=59, start=
> > > 0x15269910, count=0x1527daa0, buf=0x155f5480,
> > > bufcount=245561, datatype=-871890885) at ./getput_vara.c:435
> > > #12 0x0000000000b092c6 in nfmpi_get_vara_all_ (v1=0x33e8dd8, v2=
> > 0x33e89a0, v3=
> > > 0x418d500, v4=0x418c500, v5=0x155f5480,
> > > v6=0x1e72e10, v7=0x1e72e04) at ./get_vara_allf.c:57
> > > #13 0x00000000007c0524 in pionfread_mod::read_nfdarray_double (file=...,
> > iobuf=
> > > 0x155f5480, vardesc=..., iodesc=...,
> > > start=0x7fffffff3750, count=0x7fffffff3750)
> > >
> > >
> > > On Fri, May 11, 2012 at 5:25 PM, Jim Edwards <jedwards at ucar.edu> wrote:
> > >
> > > David,
> > >
> > > I will give this a try, thanks.
> > >
> > >
> > > On Fri, May 11, 2012 at 5:15 PM, David Knaak <knaak at cray.com> wrote:
> > >
> > > Rob,
> > >
> > > I suggested taking the discussion off line to not bother those
> > not
> > > interested in the Cray specifics. But if you think those on the
> > list
> > > are either interested, or don't consider it a bother, I can
> > certainly
> > > use the full list.
> > >
> > > All,
> > >
> > > In the MPT 5.4.0 release, I made some changes to MPI_File_open to
> > > improve scalability. Because of these changes and previous
> > changes
> > > I had made (for added functionality, not because of any bugs),
> > the
> > > code was getting very messy. In fact, I introduced a bug or 2
> > with
> > > these changes. So in 5.4.3, I significantly restructured the
> > code
> > > for better maintainability, fixed the bugs (that I knew of) and
> > made
> > > more scalability changes.
> > >
> > > Jim,
> > >
> > > The NCSA's "ESS" has the 5.4.2 version of Cray's MPI
> > implementation as
> > > default. The "module list" command output that you included
> > shows:
> > >
> > > 3) xt-mpich2/5.4.2
> > >
> > > The "module avail xt-mpch2" command shows what other versions are
> > > available:
> > >
> > > h2ologin2 25=>module avail xt-mpich2
> > > --------------------- /opt/cray/modulefiles ---------------------
> > > xt-mpich2/5.4.2(default) xt-mpich2/5.4.4 xt-mpich2/
> > 5.4.5
> > >
> > > Would you switch to 5.4.5, relink, and try again?
> > >
> > > h2ologin2 25=>module swap xt-mpich2/5.4.2 xt-mpich2/5.4.5
> > >
> > > Thanks.
> > > David
> > >
> > >
> > > On Fri, May 11, 2012 at 12:54:39PM -0500, Rob Latham wrote:
> > > > On Fri, May 11, 2012 at 11:46:25AM -0500, David Knaak wrote:
> > > > > Jim,
> > > > >
> > > > > Since you are having this problem on a Cray system, please
> > open a
> > > Cray
> > > > > bug report against MPI and I will look at it. We can take
> > further
> > > > > discussions off line.
> > > >
> > > > Oh, howdy David! forgot you were on the list. Thanks for
> > keeping an
> > > > eye on things.
> > > >
> > > > the pnetcdf list is pretty low-traffic these days, but we have
> > an
> > > > awful lot of users in a cray and Lustre environment. If you'd
> > > rather
> > > > discuss cray specific stuff elsewhere, I'd understand, but
> > please let
> > > > us know what you figure out.
> > > >
> > > > ==rob
> > > >
> > > > > Thanks.
> > > > > David
> > > > >
> > > > > On Fri, May 11, 2012 at 10:03:28AM -0600, Jim Edwards wrote:
> > > > > >
> > > > > >
> > > > > > On Fri, May 11, 2012 at 9:43 AM, Rob Latham <
> > robl at mcs.anl.gov>
> > > wrote:
> > > > > >
> > > > > > On Thu, May 10, 2012 at 03:28:57PM -0600, Jim Edwards
> > wrote:
> > > > > > > This occurs on the ncsa machine bluewaters. I am
> > using
> > > pnetcdf1.2.0 and
> > > > > > > pgi 11.10.0
> > > > > >
> > > > > > need one more bit of information: the version of MPT
> > you are
> > > using.
> > > > > >
> > > > > >
> > > > > > Sorry, what's mpt? MPI?
> > > > > > Currently Loaded Modulefiles:
> > > > > > 1) modules/3.2.6.6 9)
> > > > > > user-paths 17) xpmem/
> > > 0.1-2.0400.31280.3.1.gem
> > > > > > 2) xtpe-network-gemini 10) pgi/
> > > > > > 11.10.0 18) xe-sysroot/4.0.46
> > > > > > 3) xt-mpich2/5.4.2 11) xt-libsci/
> > > > > > 11.0.04 19) xt-asyncpe/5.07
> > > > > > 4) xtpe-interlagos 12) udreg/
> > > > > > 2.3.1-1.0400.4264.3.1.gem 20) atp/1.4.1
> > > > > > 5) eswrap/1.0.12 13) ugni/
> > > > > > 2.3-1.0400.4374.4.88.gem 21) PrgEnv-pgi/4.0.46
> > > > > > 6) torque/2.5.10 14) pmi/
> > > > > > 3.0.0-1.0000.8661.28.2807.gem 22) hdf5-parallel/1.8.7
> > > > > > 7) moab/6.1.5 15) dmapp/
> > > > > > 3.2.1-1.0400.4255.2.159.gem 23) netcdf-hdf5parallel/
> > 4.1.3
> > > > > > 8) scripts 16) gni-headers/
> > > > > > 2.1-1.0400.4351.3.1.gem 24) parallel-netcdf/1.2.0
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > The issue is that calling nfmpi_createfile would
> > sometimes
> > > result in an
> > > > > > > error:
> > > > > > >
> > > > > > > MPI_File_open : Other I/O error , error stack:
> > > > > > > (unknown)(): Other I/O error
> > > > > > > 126: MPI_File_open : Other I/O error , error stack:
> > > > > > > (unknown)(): Other I/O error
> > > > > > > Error on create : 502 -32
> > > > > > >
> > > > > > > The error appears to be intermittent and I could not
> > get it
> > > to occur at
> > > > > > all
> > > > > > > on a small number of tasks (160) but it occurs with
> > high
> > > frequency when
> > > > > > > using a larger number of tasks (>=1600). I traced
> > the
> > > problem to the
> > > > > > use
> > > > > > > of nf_clobber in the mode argument, removing the
> > nf_clobber
> > > seems to have
> > > > > > > solved the problem and I think that create implies
> > clobber
> > > anyway doesn't
> > > > > > > it?
> > > > > >
> > > > > > > Can someone who knows what is going on under the
> > covers
> > > enlighten me
> > > > > > > with some understanding of this issue? I suspect
> > that one
> > > task is
> > > > > > trying
> > > > > > > to clobber the file that another has just created or
> > > something of that
> > > > > > > nature.
> > > > > >
> > > > > > Unfortunately, "under the covers" here means "inside
> > the
> > > MPI-IO
> > > > > > library", which we don't have access to.
> > > > > >
> > > > > > in the create case we call MPI_File_open with
> > "MPI_MODE_RDWR
> > > |
> > > > > > MPI_MODE_CREATE", and if noclobber set, we add
> > > MPI_MODE_EXCL.
> > > > > >
> > > > > > OK, so that's pnetcdf. What's going on in MPI-IO?
> > Well,
> > > cray's based
> > > > > > their MPI-IO off of our ROMIO, but I'm not sure which
> > > version.
> > > > > >
> > > > > > Let me cook up a quick MPI-IO-only test case you can
> > run to
> > > trigger
> > > > > > this problem and then you can beat cray over the head
> > with
> > > it.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Sounds good, thanks.
> > > > > >
> > > > > >
> > > > > > ==rob
> > > > > >
> > > > > > --
> > > > > > Rob Latham
> > > > > > Mathematics and Computer Science Division
> > > > > > Argonne National Lab, IL USA
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jim Edwards
> > > > > >
> > > > > > CESM Software Engineering Group
> > > > > > National Center for Atmospheric Research
> > > > > > Boulder, CO
> > > > > > 303-497-1842
> > > > > >
> > > > >
> > > >
> > > > --
> > > > Rob Latham
> > > > Mathematics and Computer Science Division
> > > > Argonne National Lab, IL USA
> > >
> > > --
> > >
> > >
> > >
> > >
> > > --
> > > Jim Edwards
> > >
> > > CESM Software Engineering Group
> > > National Center for Atmospheric Research
> > > Boulder, CO
> > > 303-497-1842
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Jim Edwards
> > >
> > > CESM Software Engineering Group
> > > National Center for Atmospheric Research
> > > Boulder, CO
> > > 303-497-1842
> > >
> >
> > --
> >
> >
> >
> >
> > --
> > Jim Edwards
> >
> > CESM Software Engineering Group
> > National Center for Atmospheric Research
> > Boulder, CO
> > 303-497-1842
> >
>
--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
More information about the parallel-netcdf
mailing list