David,<br><br>I will give this a try, thanks.<br><br><div class="gmail_quote">On Fri, May 11, 2012 at 5:15 PM, David Knaak <span dir="ltr"><<a href="mailto:knaak@cray.com" target="_blank">knaak@cray.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Rob,<br>
<br>
I suggested taking the discussion off line to not bother those not<br>
interested in the Cray specifics. But if you think those on the list<br>
are either interested, or don't consider it a bother, I can certainly<br>
use the full list.<br>
<br>
All,<br>
<br>
In the MPT 5.4.0 release, I made some changes to MPI_File_open to<br>
improve scalability. Because of these changes and previous changes<br>
I had made (for added functionality, not because of any bugs), the<br>
code was getting very messy. In fact, I introduced a bug or 2 with<br>
these changes. So in 5.4.3, I significantly restructured the code<br>
for better maintainability, fixed the bugs (that I knew of) and made<br>
more scalability changes.<br>
<br>
Jim,<br>
<br>
The NCSA's "ESS" has the 5.4.2 version of Cray's MPI implementation as<br>
default. The "module list" command output that you included shows:<br>
<br>
3) xt-mpich2/5.4.2<br>
<br>
The "module avail xt-mpch2" command shows what other versions are<br>
available:<br>
<br>
h2ologin2 25=>module avail xt-mpich2<br>
--------------------- /opt/cray/modulefiles ---------------------<br>
xt-mpich2/5.4.2(default) xt-mpich2/5.4.4 xt-mpich2/5.4.5<br>
<br>
Would you switch to 5.4.5, relink, and try again?<br>
<br>
h2ologin2 25=>module swap xt-mpich2/5.4.2 xt-mpich2/5.4.5<br>
<br>
Thanks.<br>
<span class="HOEnZb"><font color="#888888">David<br>
</font></span><div><div class="h5"><br>
<br>
On Fri, May 11, 2012 at 12:54:39PM -0500, Rob Latham wrote:<br>
> On Fri, May 11, 2012 at 11:46:25AM -0500, David Knaak wrote:<br>
> > Jim,<br>
> ><br>
> > Since you are having this problem on a Cray system, please open a Cray<br>
> > bug report against MPI and I will look at it. We can take further<br>
> > discussions off line.<br>
><br>
> Oh, howdy David! forgot you were on the list. Thanks for keeping an<br>
> eye on things.<br>
><br>
> the pnetcdf list is pretty low-traffic these days, but we have an<br>
> awful lot of users in a cray and Lustre environment. If you'd rather<br>
> discuss cray specific stuff elsewhere, I'd understand, but please let<br>
> us know what you figure out.<br>
><br>
> ==rob<br>
><br>
> > Thanks.<br>
> > David<br>
> ><br>
> > On Fri, May 11, 2012 at 10:03:28AM -0600, Jim Edwards wrote:<br>
> > ><br>
> > ><br>
> > > On Fri, May 11, 2012 at 9:43 AM, Rob Latham <<a href="mailto:robl@mcs.anl.gov">robl@mcs.anl.gov</a>> wrote:<br>
> > ><br>
> > > On Thu, May 10, 2012 at 03:28:57PM -0600, Jim Edwards wrote:<br>
> > > > This occurs on the ncsa machine bluewaters. I am using pnetcdf1.2.0 and<br>
> > > > pgi 11.10.0<br>
> > ><br>
> > > need one more bit of information: the version of MPT you are using.<br>
> > ><br>
> > ><br>
> > > Sorry, what's mpt? MPI?<br>
> > > Currently Loaded Modulefiles:<br>
> > > 1) modules/<a href="http://3.2.6.6" target="_blank">3.2.6.6</a> 9)<br>
> > > user-paths 17) xpmem/0.1-2.0400.31280.3.1.gem<br>
> > > 2) xtpe-network-gemini 10) pgi/<br>
> > > 11.10.0 18) xe-sysroot/4.0.46<br>
> > > 3) xt-mpich2/5.4.2 11) xt-libsci/<br>
> > > 11.0.04 19) xt-asyncpe/5.07<br>
> > > 4) xtpe-interlagos 12) udreg/<br>
> > > 2.3.1-1.0400.4264.3.1.gem 20) atp/1.4.1<br>
> > > 5) eswrap/1.0.12 13) ugni/<br>
> > > 2.3-1.0400.4374.4.88.gem 21) PrgEnv-pgi/4.0.46<br>
> > > 6) torque/2.5.10 14) pmi/<br>
> > > 3.0.0-1.0000.8661.28.2807.gem 22) hdf5-parallel/1.8.7<br>
> > > 7) moab/6.1.5 15) dmapp/<br>
> > > 3.2.1-1.0400.4255.2.159.gem 23) netcdf-hdf5parallel/4.1.3<br>
> > > 8) scripts 16) gni-headers/<br>
> > > 2.1-1.0400.4351.3.1.gem 24) parallel-netcdf/1.2.0<br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > > > The issue is that calling nfmpi_createfile would sometimes result in an<br>
> > > > error:<br>
> > > ><br>
> > > > MPI_File_open : Other I/O error , error stack:<br>
> > > > (unknown)(): Other I/O error<br>
> > > > 126: MPI_File_open : Other I/O error , error stack:<br>
> > > > (unknown)(): Other I/O error<br>
> > > > Error on create : 502 -32<br>
> > > ><br>
> > > > The error appears to be intermittent and I could not get it to occur at<br>
> > > all<br>
> > > > on a small number of tasks (160) but it occurs with high frequency when<br>
> > > > using a larger number of tasks (>=1600). I traced the problem to the<br>
> > > use<br>
> > > > of nf_clobber in the mode argument, removing the nf_clobber seems to have<br>
> > > > solved the problem and I think that create implies clobber anyway doesn't<br>
> > > > it?<br>
> > ><br>
> > > > Can someone who knows what is going on under the covers enlighten me<br>
> > > > with some understanding of this issue? I suspect that one task is<br>
> > > trying<br>
> > > > to clobber the file that another has just created or something of that<br>
> > > > nature.<br>
> > ><br>
> > > Unfortunately, "under the covers" here means "inside the MPI-IO<br>
> > > library", which we don't have access to.<br>
> > ><br>
> > > in the create case we call MPI_File_open with "MPI_MODE_RDWR |<br>
> > > MPI_MODE_CREATE", and if noclobber set, we add MPI_MODE_EXCL.<br>
> > ><br>
> > > OK, so that's pnetcdf. What's going on in MPI-IO? Well, cray's based<br>
> > > their MPI-IO off of our ROMIO, but I'm not sure which version.<br>
> > ><br>
> > > Let me cook up a quick MPI-IO-only test case you can run to trigger<br>
> > > this problem and then you can beat cray over the head with it.<br>
> > ><br>
> > ><br>
> > ><br>
> > > Sounds good, thanks.<br>
> > ><br>
> > ><br>
> > > ==rob<br>
> > ><br>
> > > --<br>
> > > Rob Latham<br>
> > > Mathematics and Computer Science Division<br>
> > > Argonne National Lab, IL USA<br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > > --<br>
> > > Jim Edwards<br>
> > ><br>
> > > CESM Software Engineering Group<br>
> > > National Center for Atmospheric Research<br>
> > > Boulder, CO<br>
> > > <a href="tel:303-497-1842" value="+13034971842">303-497-1842</a><br>
> > ><br>
> ><br>
><br>
> --<br>
> Rob Latham<br>
> Mathematics and Computer Science Division<br>
> Argonne National Lab, IL USA<br>
<br>
</div></div>--<br>
</blockquote></div><br><br clear="all"><br>-- <br>Jim Edwards<br><br><font>CESM Software Engineering Group<br>National Center for Atmospheric Research<br>Boulder, CO <br>303-497-1842<br></font><br>