pnetcdf 1.2.0 create file issue

Jim Edwards jedwards at ucar.edu
Fri May 11 18:25:08 CDT 2012


David,

I will give this a try, thanks.

On Fri, May 11, 2012 at 5:15 PM, David Knaak <knaak at cray.com> wrote:

> Rob,
>
> I suggested taking the discussion off line to not bother those not
> interested in the Cray specifics.  But if you think those on the list
> are either interested, or don't consider it a bother, I can certainly
> use the full list.
>
> All,
>
> In the MPT 5.4.0 release, I made some changes to MPI_File_open to
> improve scalability.  Because of these changes and previous changes
> I had made (for added functionality, not because of any bugs), the
> code was getting very messy.  In fact, I introduced a bug or 2 with
> these changes.  So in 5.4.3, I significantly restructured the code
> for better maintainability, fixed the bugs (that I knew of) and made
> more scalability changes.
>
> Jim,
>
> The NCSA's "ESS" has the 5.4.2 version of Cray's MPI implementation as
> default.  The "module list" command output that you included shows:
>
>   3) xt-mpich2/5.4.2
>
> The "module avail xt-mpch2" command shows what other versions are
> available:
>
> h2ologin2 25=>module avail xt-mpich2
> --------------------- /opt/cray/modulefiles ---------------------
> xt-mpich2/5.4.2(default)     xt-mpich2/5.4.4       xt-mpich2/5.4.5
>
> Would you switch to 5.4.5, relink, and try again?
>
> h2ologin2 25=>module swap xt-mpich2/5.4.2 xt-mpich2/5.4.5
>
> Thanks.
> David
>
>
> On Fri, May 11, 2012 at 12:54:39PM -0500, Rob Latham wrote:
> > On Fri, May 11, 2012 at 11:46:25AM -0500, David Knaak wrote:
> > > Jim,
> > >
> > > Since you are having this problem on a Cray system, please open a Cray
> > > bug report against MPI and I will look at it.  We can take further
> > > discussions off line.
> >
> > Oh, howdy David! forgot you were on the list.  Thanks for keeping an
> > eye on things.
> >
> > the pnetcdf list is pretty low-traffic these days, but we have an
> > awful lot of users in a cray and Lustre environment.   If you'd rather
> > discuss cray specific stuff elsewhere, I'd understand, but please let
> > us know what you figure out.
> >
> > ==rob
> >
> > > Thanks.
> > > David
> > >
> > > On Fri, May 11, 2012 at 10:03:28AM -0600, Jim Edwards wrote:
> > > >
> > > >
> > > > On Fri, May 11, 2012 at 9:43 AM, Rob Latham <robl at mcs.anl.gov>
> wrote:
> > > >
> > > >     On Thu, May 10, 2012 at 03:28:57PM -0600, Jim Edwards wrote:
> > > >     > This occurs on the ncsa machine bluewaters.   I am using
> pnetcdf1.2.0 and
> > > >     > pgi 11.10.0
> > > >
> > > >     need one more bit of information: the version of MPT you are
> using.
> > > >
> > > >
> > > > Sorry, what's mpt?  MPI?
> > > > Currently Loaded Modulefiles:
> > > >   1) modules/3.2.6.6                       9)
> > > > user-paths                           17)
> xpmem/0.1-2.0400.31280.3.1.gem
> > > >   2) xtpe-network-gemini                  10) pgi/
> > > > 11.10.0                          18) xe-sysroot/4.0.46
> > > >   3) xt-mpich2/5.4.2                      11) xt-libsci/
> > > > 11.0.04                    19) xt-asyncpe/5.07
> > > >   4) xtpe-interlagos                      12) udreg/
> > > > 2.3.1-1.0400.4264.3.1.gem      20) atp/1.4.1
> > > >   5) eswrap/1.0.12                        13) ugni/
> > > > 2.3-1.0400.4374.4.88.gem        21) PrgEnv-pgi/4.0.46
> > > >   6) torque/2.5.10                        14) pmi/
> > > > 3.0.0-1.0000.8661.28.2807.gem    22) hdf5-parallel/1.8.7
> > > >   7) moab/6.1.5                           15) dmapp/
> > > > 3.2.1-1.0400.4255.2.159.gem    23) netcdf-hdf5parallel/4.1.3
> > > >   8) scripts                              16) gni-headers/
> > > > 2.1-1.0400.4351.3.1.gem  24) parallel-netcdf/1.2.0
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >     > The issue is that calling nfmpi_createfile would sometimes
> result in an
> > > >     > error:
> > > >     >
> > > >     > MPI_File_open : Other I/O error , error stack:
> > > >     > (unknown)(): Other I/O error
> > > >     > 126: MPI_File_open : Other I/O error , error stack:
> > > >     > (unknown)(): Other I/O error
> > > >     >   Error on create :           502          -32
> > > >     >
> > > >     > The error appears to be intermittent and I could not get it to
> occur at
> > > >     all
> > > >     > on a small number of tasks (160) but it occurs with high
> frequency when
> > > >     > using a larger number of tasks (>=1600).    I traced the
> problem to the
> > > >     use
> > > >     > of nf_clobber in the mode argument, removing the nf_clobber
> seems to have
> > > >     > solved the problem and I think that create implies clobber
> anyway doesn't
> > > >     > it?
> > > >
> > > >     > Can someone who knows what is going on under the covers
> enlighten me
> > > >     > with some understanding of this issue?   I suspect that one
> task is
> > > >     trying
> > > >     > to clobber the file that another has just created or something
> of that
> > > >     > nature.
> > > >
> > > >     Unfortunately, "under the covers" here means "inside the MPI-IO
> > > >     library", which we don't have access to.
> > > >
> > > >     in the create case we call MPI_File_open with "MPI_MODE_RDWR |
> > > >     MPI_MODE_CREATE", and  if noclobber set, we add MPI_MODE_EXCL.
> > > >
> > > >     OK, so that's pnetcdf.  What's going on in MPI-IO?  Well, cray's
> based
> > > >     their MPI-IO off of our ROMIO, but I'm not sure which version.
> > > >
> > > >     Let me cook up a quick MPI-IO-only test case you can run to
> trigger
> > > >     this problem and then you can beat cray over the head with it.
> > > >
> > > >
> > > >
> > > > Sounds good, thanks.
> > > >
> > > >
> > > >     ==rob
> > > >
> > > >     --
> > > >     Rob Latham
> > > >     Mathematics and Computer Science Division
> > > >     Argonne National Lab, IL USA
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jim Edwards
> > > >
> > > > CESM Software Engineering Group
> > > > National Center for Atmospheric Research
> > > > Boulder, CO
> > > > 303-497-1842
> > > >
> > >
> >
> > --
> > Rob Latham
> > Mathematics and Computer Science Division
> > Argonne National Lab, IL USA
>
> --
>



-- 
Jim Edwards

CESM Software Engineering Group
National Center for Atmospheric Research
Boulder, CO
303-497-1842
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20120511/067df5b3/attachment-0001.htm>


More information about the parallel-netcdf mailing list