pnetcdf 1.2.0 create file issue

David Knaak knaak at cray.com
Fri May 11 18:15:32 CDT 2012


Rob,

I suggested taking the discussion off line to not bother those not
interested in the Cray specifics.  But if you think those on the list
are either interested, or don't consider it a bother, I can certainly
use the full list.

All,

In the MPT 5.4.0 release, I made some changes to MPI_File_open to
improve scalability.  Because of these changes and previous changes
I had made (for added functionality, not because of any bugs), the
code was getting very messy.  In fact, I introduced a bug or 2 with
these changes.  So in 5.4.3, I significantly restructured the code
for better maintainability, fixed the bugs (that I knew of) and made
more scalability changes.

Jim, 

The NCSA's "ESS" has the 5.4.2 version of Cray's MPI implementation as
default.  The "module list" command output that you included shows:

   3) xt-mpich2/5.4.2 

The "module avail xt-mpch2" command shows what other versions are
available:

h2ologin2 25=>module avail xt-mpich2
--------------------- /opt/cray/modulefiles ---------------------
xt-mpich2/5.4.2(default)     xt-mpich2/5.4.4       xt-mpich2/5.4.5

Would you switch to 5.4.5, relink, and try again?

h2ologin2 25=>module swap xt-mpich2/5.4.2 xt-mpich2/5.4.5

Thanks.
David


On Fri, May 11, 2012 at 12:54:39PM -0500, Rob Latham wrote:
> On Fri, May 11, 2012 at 11:46:25AM -0500, David Knaak wrote:
> > Jim,
> > 
> > Since you are having this problem on a Cray system, please open a Cray
> > bug report against MPI and I will look at it.  We can take further
> > discussions off line.
> 
> Oh, howdy David! forgot you were on the list.  Thanks for keeping an
> eye on things.
> 
> the pnetcdf list is pretty low-traffic these days, but we have an
> awful lot of users in a cray and Lustre environment.   If you'd rather
> discuss cray specific stuff elsewhere, I'd understand, but please let
> us know what you figure out.   
> 
> ==rob
> 
> > Thanks.
> > David
> > 
> > On Fri, May 11, 2012 at 10:03:28AM -0600, Jim Edwards wrote:
> > > 
> > > 
> > > On Fri, May 11, 2012 at 9:43 AM, Rob Latham <robl at mcs.anl.gov> wrote:
> > > 
> > >     On Thu, May 10, 2012 at 03:28:57PM -0600, Jim Edwards wrote:
> > >     > This occurs on the ncsa machine bluewaters.   I am using pnetcdf1.2.0 and
> > >     > pgi 11.10.0
> > > 
> > >     need one more bit of information: the version of MPT you are using.
> > > 
> > > 
> > > Sorry, what's mpt?  MPI?
> > > Currently Loaded Modulefiles:
> > >   1) modules/3.2.6.6                       9)
> > > user-paths                           17) xpmem/0.1-2.0400.31280.3.1.gem
> > >   2) xtpe-network-gemini                  10) pgi/
> > > 11.10.0                          18) xe-sysroot/4.0.46
> > >   3) xt-mpich2/5.4.2                      11) xt-libsci/
> > > 11.0.04                    19) xt-asyncpe/5.07
> > >   4) xtpe-interlagos                      12) udreg/
> > > 2.3.1-1.0400.4264.3.1.gem      20) atp/1.4.1
> > >   5) eswrap/1.0.12                        13) ugni/
> > > 2.3-1.0400.4374.4.88.gem        21) PrgEnv-pgi/4.0.46
> > >   6) torque/2.5.10                        14) pmi/
> > > 3.0.0-1.0000.8661.28.2807.gem    22) hdf5-parallel/1.8.7
> > >   7) moab/6.1.5                           15) dmapp/
> > > 3.2.1-1.0400.4255.2.159.gem    23) netcdf-hdf5parallel/4.1.3
> > >   8) scripts                              16) gni-headers/
> > > 2.1-1.0400.4351.3.1.gem  24) parallel-netcdf/1.2.0
> > > 
> > > 
> > > 
> > > 
> > >  
> > > 
> > >     > The issue is that calling nfmpi_createfile would sometimes result in an
> > >     > error:
> > >     >
> > >     > MPI_File_open : Other I/O error , error stack:
> > >     > (unknown)(): Other I/O error
> > >     > 126: MPI_File_open : Other I/O error , error stack:
> > >     > (unknown)(): Other I/O error
> > >     >   Error on create :           502          -32
> > >     >
> > >     > The error appears to be intermittent and I could not get it to occur at
> > >     all
> > >     > on a small number of tasks (160) but it occurs with high frequency when
> > >     > using a larger number of tasks (>=1600).    I traced the problem to the
> > >     use
> > >     > of nf_clobber in the mode argument, removing the nf_clobber seems to have
> > >     > solved the problem and I think that create implies clobber anyway doesn't
> > >     > it?
> > > 
> > >     > Can someone who knows what is going on under the covers enlighten me
> > >     > with some understanding of this issue?   I suspect that one task is
> > >     trying
> > >     > to clobber the file that another has just created or something of that
> > >     > nature.
> > > 
> > >     Unfortunately, "under the covers" here means "inside the MPI-IO
> > >     library", which we don't have access to.
> > > 
> > >     in the create case we call MPI_File_open with "MPI_MODE_RDWR |
> > >     MPI_MODE_CREATE", and  if noclobber set, we add MPI_MODE_EXCL.
> > > 
> > >     OK, so that's pnetcdf.  What's going on in MPI-IO?  Well, cray's based
> > >     their MPI-IO off of our ROMIO, but I'm not sure which version.
> > > 
> > >     Let me cook up a quick MPI-IO-only test case you can run to trigger
> > >     this problem and then you can beat cray over the head with it.
> > > 
> > > 
> > > 
> > > Sounds good, thanks.
> > >  
> > > 
> > >     ==rob
> > > 
> > >     --
> > >     Rob Latham
> > >     Mathematics and Computer Science Division
> > >     Argonne National Lab, IL USA
> > > 
> > > 
> > > 
> > > 
> > > --
> > > Jim Edwards
> > > 
> > > CESM Software Engineering Group
> > > National Center for Atmospheric Research
> > > Boulder, CO
> > > 303-497-1842
> > > 
> > 
> 
> -- 
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA

-- 


More information about the parallel-netcdf mailing list