pnetcdf 1.2.0 create file issue

Jim Edwards jedwards at ucar.edu
Sat May 19 08:23:24 CDT 2012


Hi David,

I built the pnetcdf 1.3.0-pre1 release on bluewaters and
ran the nf_test provided with that package.  It exhibits the same issues
with nfmpi_create and nfmpi_open that I am seeing from the installed
parallel-netcdf-1.2.0 and my application.
I have opened ticket
BWDSPCH-298<https://ncsa-jira.ncsa.illinois.edu/browse/BWDSPCH-298>on
bluewaters to track this issue.

On Fri, May 11, 2012 at 6:50 PM, David Knaak <knaak at cray.com> wrote:

> Jim,
>
> I'm not intimate  with pnetcdf but the stack trace entries look a little
> strange to me.  In particular:
>
> #10
>  count=0x1527daa0, stride=0x0
>
> #13
>  start=0x7fffffff3750, count=0x7fffffff3750
>
> Could it be that the high level routine is being called with a bad
> argument?
>
> Just a thought.
>
> David
>
> On Fri, May 11, 2012 at 05:37:24PM -0600, Jim Edwards wrote:
> > Hi David,
> >
> > I've updated to the 5.2.5 mpich2 on bluewaters and now get a different
> problem,
> > it happens earlier in the code so I think that this worked in the older
> > version, here is a partial stack trace from the core file on 16000 tasks.
> > I'll try on Monday to reproduce this on a smaller number of tasks.
>  Can I
> > send you more info when I get it?
> >
> > #0  memcpy () at ../sysdeps/x86_64/memcpy.S:102
> > #1  0x0000000000c25b03 in MPIDI_CH3U_Buffer_copy ()
> > #2  0x0000000000bc91ae in MPIDI_Isend_self ()
> > #3  0x0000000000bc5ef4 in MPID_Isend ()
> > #4  0x0000000000be4b2b in PMPI_Isend ()
> > #5  0x0000000000b9439a in ADIOI_CRAY_R_Exchange_data ()
> > #6  0x0000000000b9503d in ADIOI_CRAY_Read_and_exch ()
> > #7  0x0000000000b957ca in ADIOI_CRAY_ReadStridedColl ()
> > #8  0x0000000000b7ea69 in MPIOI_File_read_all ()
> > #9  0x0000000000b7eab6 in PMPI_File_read_all ()
> > #10 0x0000000000b3d7ce in ncmpii_getput_vars (ncp=0x107192b0,
> varp=0x109ee340,
> > start=0x15269910, count=0x1527daa0, stride=0x0,
> >     buf=0x155f5480, bufcount=245561, datatype=-871890885, rw_flag=1,
> io_method=
> > 1) at ./getput_vars.c:741
> > #11 0x0000000000b39804 in ncmpi_get_vara_all (ncid=1, varid=59, start=
> > 0x15269910, count=0x1527daa0, buf=0x155f5480,
> >     bufcount=245561, datatype=-871890885) at ./getput_vara.c:435
> > #12 0x0000000000b092c6 in nfmpi_get_vara_all_ (v1=0x33e8dd8,
> v2=0x33e89a0, v3=
> > 0x418d500, v4=0x418c500, v5=0x155f5480,
> >     v6=0x1e72e10, v7=0x1e72e04) at ./get_vara_allf.c:57
> > #13 0x00000000007c0524 in pionfread_mod::read_nfdarray_double (file=...,
> iobuf=
> > 0x155f5480, vardesc=..., iodesc=...,
> >     start=0x7fffffff3750, count=0x7fffffff3750)
> >
> >
> > On Fri, May 11, 2012 at 5:25 PM, Jim Edwards <jedwards at ucar.edu> wrote:
> >
> >     David,
> >
> >     I will give this a try, thanks.
> >
> >
> >     On Fri, May 11, 2012 at 5:15 PM, David Knaak <knaak at cray.com> wrote:
> >
> >         Rob,
> >
> >         I suggested taking the discussion off line to not bother those
> not
> >         interested in the Cray specifics.  But if you think those on the
> list
> >         are either interested, or don't consider it a bother, I can
> certainly
> >         use the full list.
> >
> >         All,
> >
> >         In the MPT 5.4.0 release, I made some changes to MPI_File_open to
> >         improve scalability.  Because of these changes and previous
> changes
> >         I had made (for added functionality, not because of any bugs),
> the
> >         code was getting very messy.  In fact, I introduced a bug or 2
> with
> >         these changes.  So in 5.4.3, I significantly restructured the
> code
> >         for better maintainability, fixed the bugs (that I knew of) and
> made
> >         more scalability changes.
> >
> >         Jim,
> >
> >         The NCSA's "ESS" has the 5.4.2 version of Cray's MPI
> implementation as
> >         default.  The "module list" command output that you included
> shows:
> >
> >           3) xt-mpich2/5.4.2
> >
> >         The "module avail xt-mpch2" command shows what other versions are
> >         available:
> >
> >         h2ologin2 25=>module avail xt-mpich2
> >         --------------------- /opt/cray/modulefiles ---------------------
> >         xt-mpich2/5.4.2(default)     xt-mpich2/5.4.4
> xt-mpich2/5.4.5
> >
> >         Would you switch to 5.4.5, relink, and try again?
> >
> >         h2ologin2 25=>module swap xt-mpich2/5.4.2 xt-mpich2/5.4.5
> >
> >         Thanks.
> >         David
> >
> >
> >         On Fri, May 11, 2012 at 12:54:39PM -0500, Rob Latham wrote:
> >         > On Fri, May 11, 2012 at 11:46:25AM -0500, David Knaak wrote:
> >         > > Jim,
> >         > >
> >         > > Since you are having this problem on a Cray system, please
> open a
> >         Cray
> >         > > bug report against MPI and I will look at it.  We can take
> further
> >         > > discussions off line.
> >         >
> >         > Oh, howdy David! forgot you were on the list.  Thanks for
> keeping an
> >         > eye on things.
> >         >
> >         > the pnetcdf list is pretty low-traffic these days, but we have
> an
> >         > awful lot of users in a cray and Lustre environment.   If you'd
> >         rather
> >         > discuss cray specific stuff elsewhere, I'd understand, but
> please let
> >         > us know what you figure out.
> >         >
> >         > ==rob
> >         >
> >         > > Thanks.
> >         > > David
> >         > >
> >         > > On Fri, May 11, 2012 at 10:03:28AM -0600, Jim Edwards wrote:
> >         > > >
> >         > > >
> >         > > > On Fri, May 11, 2012 at 9:43 AM, Rob Latham <
> robl at mcs.anl.gov>
> >         wrote:
> >         > > >
> >         > > >     On Thu, May 10, 2012 at 03:28:57PM -0600, Jim Edwards
> wrote:
> >         > > >     > This occurs on the ncsa machine bluewaters.   I am
> using
> >         pnetcdf1.2.0 and
> >         > > >     > pgi 11.10.0
> >         > > >
> >         > > >     need one more bit of information: the version of MPT
> you are
> >         using.
> >         > > >
> >         > > >
> >         > > > Sorry, what's mpt?  MPI?
> >         > > > Currently Loaded Modulefiles:
> >         > > >   1) modules/3.2.6.6                       9)
> >         > > > user-paths                           17) xpmem/
> >         0.1-2.0400.31280.3.1.gem
> >         > > >   2) xtpe-network-gemini                  10) pgi/
> >         > > > 11.10.0                          18) xe-sysroot/4.0.46
> >         > > >   3) xt-mpich2/5.4.2                      11) xt-libsci/
> >         > > > 11.0.04                    19) xt-asyncpe/5.07
> >         > > >   4) xtpe-interlagos                      12) udreg/
> >         > > > 2.3.1-1.0400.4264.3.1.gem      20) atp/1.4.1
> >         > > >   5) eswrap/1.0.12                        13) ugni/
> >         > > > 2.3-1.0400.4374.4.88.gem        21) PrgEnv-pgi/4.0.46
> >         > > >   6) torque/2.5.10                        14) pmi/
> >         > > > 3.0.0-1.0000.8661.28.2807.gem    22) hdf5-parallel/1.8.7
> >         > > >   7) moab/6.1.5                           15) dmapp/
> >         > > > 3.2.1-1.0400.4255.2.159.gem    23)
> netcdf-hdf5parallel/4.1.3
> >         > > >   8) scripts                              16) gni-headers/
> >         > > > 2.1-1.0400.4351.3.1.gem  24) parallel-netcdf/1.2.0
> >         > > >
> >         > > >
> >         > > >
> >         > > >
> >         > > >
> >         > > >
> >         > > >     > The issue is that calling nfmpi_createfile would
> sometimes
> >         result in an
> >         > > >     > error:
> >         > > >     >
> >         > > >     > MPI_File_open : Other I/O error , error stack:
> >         > > >     > (unknown)(): Other I/O error
> >         > > >     > 126: MPI_File_open : Other I/O error , error stack:
> >         > > >     > (unknown)(): Other I/O error
> >         > > >     >   Error on create :           502          -32
> >         > > >     >
> >         > > >     > The error appears to be intermittent and I could not
> get it
> >         to occur at
> >         > > >     all
> >         > > >     > on a small number of tasks (160) but it occurs with
> high
> >         frequency when
> >         > > >     > using a larger number of tasks (>=1600).    I traced
> the
> >         problem to the
> >         > > >     use
> >         > > >     > of nf_clobber in the mode argument, removing the
> nf_clobber
> >         seems to have
> >         > > >     > solved the problem and I think that create implies
> clobber
> >         anyway doesn't
> >         > > >     > it?
> >         > > >
> >         > > >     > Can someone who knows what is going on under the
> covers
> >         enlighten me
> >         > > >     > with some understanding of this issue?   I suspect
> that one
> >         task is
> >         > > >     trying
> >         > > >     > to clobber the file that another has just created or
> >         something of that
> >         > > >     > nature.
> >         > > >
> >         > > >     Unfortunately, "under the covers" here means "inside
> the
> >         MPI-IO
> >         > > >     library", which we don't have access to.
> >         > > >
> >         > > >     in the create case we call MPI_File_open with
> "MPI_MODE_RDWR
> >         |
> >         > > >     MPI_MODE_CREATE", and  if noclobber set, we add
> >         MPI_MODE_EXCL.
> >         > > >
> >         > > >     OK, so that's pnetcdf.  What's going on in MPI-IO?
>  Well,
> >         cray's based
> >         > > >     their MPI-IO off of our ROMIO, but I'm not sure which
> >         version.
> >         > > >
> >         > > >     Let me cook up a quick MPI-IO-only test case you can
> run to
> >         trigger
> >         > > >     this problem and then you can beat cray over the head
> with
> >         it.
> >         > > >
> >         > > >
> >         > > >
> >         > > > Sounds good, thanks.
> >         > > >
> >         > > >
> >         > > >     ==rob
> >         > > >
> >         > > >     --
> >         > > >     Rob Latham
> >         > > >     Mathematics and Computer Science Division
> >         > > >     Argonne National Lab, IL USA
> >         > > >
> >         > > >
> >         > > >
> >         > > >
> >         > > > --
> >         > > > Jim Edwards
> >         > > >
> >         > > > CESM Software Engineering Group
> >         > > > National Center for Atmospheric Research
> >         > > > Boulder, CO
> >         > > > 303-497-1842
> >         > > >
> >         > >
> >         >
> >         > --
> >         > Rob Latham
> >         > Mathematics and Computer Science Division
> >         > Argonne National Lab, IL USA
> >
> >         --
> >
> >
> >
> >
> >     --
> >     Jim Edwards
> >
> >     CESM Software Engineering Group
> >     National Center for Atmospheric Research
> >     Boulder, CO
> >     303-497-1842
> >
> >
> >
> >
> >
> > --
> > Jim Edwards
> >
> > CESM Software Engineering Group
> > National Center for Atmospheric Research
> > Boulder, CO
> > 303-497-1842
> >
>
> --
>



-- 
Jim Edwards

CESM Software Engineering Group
National Center for Atmospheric Research
Boulder, CO
303-497-1842
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20120519/612fc222/attachment.html>


More information about the parallel-netcdf mailing list