pnetcdf 1.2.0 create file issue
Rob Latham
robl at mcs.anl.gov
Fri May 11 12:54:39 CDT 2012
On Fri, May 11, 2012 at 11:46:25AM -0500, David Knaak wrote:
> Jim,
>
> Since you are having this problem on a Cray system, please open a Cray
> bug report against MPI and I will look at it. We can take further
> discussions off line.
Oh, howdy David! forgot you were on the list. Thanks for keeping an
eye on things.
the pnetcdf list is pretty low-traffic these days, but we have an
awful lot of users in a cray and Lustre environment. If you'd rather
discuss cray specific stuff elsewhere, I'd understand, but please let
us know what you figure out.
==rob
> Thanks.
> David
>
> On Fri, May 11, 2012 at 10:03:28AM -0600, Jim Edwards wrote:
> >
> >
> > On Fri, May 11, 2012 at 9:43 AM, Rob Latham <robl at mcs.anl.gov> wrote:
> >
> > On Thu, May 10, 2012 at 03:28:57PM -0600, Jim Edwards wrote:
> > > This occurs on the ncsa machine bluewaters. I am using pnetcdf1.2.0 and
> > > pgi 11.10.0
> >
> > need one more bit of information: the version of MPT you are using.
> >
> >
> > Sorry, what's mpt? MPI?
> > Currently Loaded Modulefiles:
> > 1) modules/3.2.6.6 9)
> > user-paths 17) xpmem/0.1-2.0400.31280.3.1.gem
> > 2) xtpe-network-gemini 10) pgi/
> > 11.10.0 18) xe-sysroot/4.0.46
> > 3) xt-mpich2/5.4.2 11) xt-libsci/
> > 11.0.04 19) xt-asyncpe/5.07
> > 4) xtpe-interlagos 12) udreg/
> > 2.3.1-1.0400.4264.3.1.gem 20) atp/1.4.1
> > 5) eswrap/1.0.12 13) ugni/
> > 2.3-1.0400.4374.4.88.gem 21) PrgEnv-pgi/4.0.46
> > 6) torque/2.5.10 14) pmi/
> > 3.0.0-1.0000.8661.28.2807.gem 22) hdf5-parallel/1.8.7
> > 7) moab/6.1.5 15) dmapp/
> > 3.2.1-1.0400.4255.2.159.gem 23) netcdf-hdf5parallel/4.1.3
> > 8) scripts 16) gni-headers/
> > 2.1-1.0400.4351.3.1.gem 24) parallel-netcdf/1.2.0
> >
> >
> >
> >
> >
> >
> > > The issue is that calling nfmpi_createfile would sometimes result in an
> > > error:
> > >
> > > MPI_File_open : Other I/O error , error stack:
> > > (unknown)(): Other I/O error
> > > 126: MPI_File_open : Other I/O error , error stack:
> > > (unknown)(): Other I/O error
> > > Error on create : 502 -32
> > >
> > > The error appears to be intermittent and I could not get it to occur at
> > all
> > > on a small number of tasks (160) but it occurs with high frequency when
> > > using a larger number of tasks (>=1600). I traced the problem to the
> > use
> > > of nf_clobber in the mode argument, removing the nf_clobber seems to have
> > > solved the problem and I think that create implies clobber anyway doesn't
> > > it?
> >
> > > Can someone who knows what is going on under the covers enlighten me
> > > with some understanding of this issue? I suspect that one task is
> > trying
> > > to clobber the file that another has just created or something of that
> > > nature.
> >
> > Unfortunately, "under the covers" here means "inside the MPI-IO
> > library", which we don't have access to.
> >
> > in the create case we call MPI_File_open with "MPI_MODE_RDWR |
> > MPI_MODE_CREATE", and if noclobber set, we add MPI_MODE_EXCL.
> >
> > OK, so that's pnetcdf. What's going on in MPI-IO? Well, cray's based
> > their MPI-IO off of our ROMIO, but I'm not sure which version.
> >
> > Let me cook up a quick MPI-IO-only test case you can run to trigger
> > this problem and then you can beat cray over the head with it.
> >
> >
> >
> > Sounds good, thanks.
> >
> >
> > ==rob
> >
> > --
> > Rob Latham
> > Mathematics and Computer Science Division
> > Argonne National Lab, IL USA
> >
> >
> >
> >
> > --
> > Jim Edwards
> >
> > CESM Software Engineering Group
> > National Center for Atmospheric Research
> > Boulder, CO
> > 303-497-1842
> >
>
--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
More information about the parallel-netcdf
mailing list