Hi David,<br><br>I am using 5.4.5. I reported the loaded modules in the ticket, but not in the mail to you - here they are:<br><br>Currently Loaded Modulefiles:
<br>
1) modules/<a href="http://3.2.6.6">3.2.6.6</a> 9) user-paths 17) xpmem/0.1-2.0400.31280.3.1.gem
<br>
2) xtpe-network-gemini 10) pgi/11.10.0 18) xe-sysroot/4.0.46
<br>
3) xt-mpich2/5.4.5 11) xt-libsci/11.0.04 19) xt-asyncpe/5.07
<br>
4) xtpe-interlagos 12) udreg/2.3.1-1.0400.4264.3.1.gem 20) atp/1.4.1
<br>
5) eswrap/1.0.12 13) ugni/2.3-1.0400.4374.4.88.gem 21) PrgEnv-pgi/4.0.46
<br>
6) torque/2.5.10 14) pmi/3.0.0-1.0000.8661.28.2807.gem 22) hdf5-parallel/1.8.7
<br>
7) moab/6.1.5 15) dmapp/3.2.1-1.0400.4255.2.159.gem 23) netcdf-hdf5parallel/4.1.3
<br>
8) scripts 16) gni-headers/2.1-1.0400.4351.3.1.gem
<br>
<br>
<br><br><div class="gmail_quote">On Mon, May 21, 2012 at 3:58 PM, David Knaak <span dir="ltr"><<a href="mailto:knaak@cray.com" target="_blank">knaak@cray.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi Jim,<br>
<br>
I ran the pnetcdf tests using xt-mpich2/5.4.2 and it does have some problems<br>
that are fixed in xt-mpich2/<a href="http://5.4.3." target="_blank">5.4.3.</a> Did you try this<br>
<br>
module swap xt-mpich2/5.4.2 xt-mpich2/5.4.5<br>
<br>
for your application?<br>
<br>
<br>
<br>
Rob,<br>
<br>
Are the pnetcdf tests designed to run with more than one MPI process?<br>
When I run with just one process, the Cray library (that is the 5.4.3<br>
and later versions) report 2 errors, as expected:<br>
<br>
------------------------------------------------------------------------<br>
*** Testing ncmpi_open ...<br>
FAILURE at line 99 of test_read.c: ncmpi_open of nonexistent file should have returned system error<br>
0: MPI_File_open : File does not exist, error stack:<br>
ADIOI_CRAY_OPEN(86): File <a href="http://tooth-fairy.nc" target="_blank">tooth-fairy.nc</a> does not exist<br>
### 1 FAILURES TESTING ncmpi_open! ###<br>
<br>
*** Testing ncmpi_create ...<br>
FAILURE at line 58 of test_write.c: attempt to overwrite file: status = -32<br>
0: MPI_File_open : Other I/O error , error stack:<br>
ADIOI_CRAY_OPEN(116): Other I/O error File exists<br>
### 1 FAILURES TESTING ncmpi_create! ###<br>
------------------------------------------------------------------------<br>
<br>
The test file line numbers are a little different from the base pnetcdf<br>
tests because of some extra print statements I put in for debugging.<br>
<br>
When I run with the unmodified ANL MPICH2 code I get the same results:<br>
<br>
------------------------------------------------------------------------<br>
*** Testing ncmpi_open ...<br>
FAILURE at line 99 of test_read.c: ncmpi_open of nonexistent file should have returned system error<br>
0: MPI_File_open : File does not exist, error stack:<br>
ADIOI_UFS_OPEN(70): File <a href="http://tooth-fairy.nc" target="_blank">tooth-fairy.nc</a> does not exist<br>
### 1 FAILURES TESTING ncmpi_open! ###<br>
<br>
*** Testing ncmpi_create ...<br>
FAILURE at line 58 of test_write.c: attempt to overwrite file: status = -32<br>
0: MPI_File_open : Other I/O error , error stack:<br>
ADIOI_UFS_OPEN(100): Other I/O error File exists<br>
### 1 FAILURES TESTING ncmpi_create! ###<br>
------------------------------------------------------------------------<br>
<br>
But when I run with more MPI tasks, the results get very messy, for both<br>
implementations. For example, running with 4 tasks, the end of the<br>
output for the base ANL library is:<br>
<br>
------------------------------------------------------------------------<br>
FAILURE at line 551 of test_write.c: remove of <a href="http://scratch.nc" target="_blank">scratch.nc</a> failed<br>
### 1 FAILURES TESTING ncmpi_rename_dim! ###<br>
<br>
FAILURE at line 551 of test_write.c: remove of <a href="http://scratch.nc" target="_blank">scratch.nc</a> failed<br>
### 1 FAILURES TESTING ncmpi_rename_dim! ###<br>
ok<br>
<br>
FAILURE at line 551 of test_write.c: remove of <a href="http://scratch.nc" target="_blank">scratch.nc</a> failed<br>
### 1 FAILURES TESTING ncmpi_rename_dim! ###<br>
<br>
NOTE: parallel-netcdf expects to see 2 failures<br>
Total number of failures: 7<br>
<br>
NOTE: parallel-netcdf expects to see 2 failures<br>
Total number of failures: 10<br>
<br>
NOTE: parallel-netcdf expects to see 2 failures<br>
Total number of failures: 8<br>
<br>
NOTE: parallel-netcdf expects to see 2 failures<br>
Total number of failures: 10<br>
Application 7538532 exit codes: 1<br>
------------------------------------------------------------------------<br>
<br>
Is this what you expect?<br>
<span class="HOEnZb"><font color="#888888"><br>
David<br>
</font></span><div><div class="h5"><br>
<br>
On Sat, May 19, 2012 at 07:23:24AM -0600, Jim Edwards wrote:<br>
> Hi David,<br>
><br>
> I built the pnetcdf 1.3.0-pre1 release on bluewaters and<br>
> ran the nf_test provided with that package. It exhibits the same issues with<br>
> nfmpi_create and nfmpi_open that I am seeing from the installed<br>
> parallel-netcdf-1.2.0 and my application.<br>
> I have opened ticket BWDSPCH-298 on bluewaters to track this issue.<br>
><br>
> On Fri, May 11, 2012 at 6:50 PM, David Knaak <<a href="mailto:knaak@cray.com">knaak@cray.com</a>> wrote:<br>
><br>
> Jim,<br>
><br>
> I'm not intimate with pnetcdf but the stack trace entries look a little<br>
> strange to me. In particular:<br>
><br>
> #10<br>
> count=0x1527daa0, stride=0x0<br>
><br>
> #13<br>
> start=0x7fffffff3750, count=0x7fffffff3750<br>
><br>
> Could it be that the high level routine is being called with a bad<br>
> argument?<br>
><br>
> Just a thought.<br>
><br>
> David<br>
><br>
> On Fri, May 11, 2012 at 05:37:24PM -0600, Jim Edwards wrote:<br>
> > Hi David,<br>
> ><br>
> > I've updated to the 5.2.5 mpich2 on bluewaters and now get a different<br>
> problem,<br>
> > it happens earlier in the code so I think that this worked in the older<br>
> > version, here is a partial stack trace from the core file on 16000 tasks.<br>
> > I'll try on Monday to reproduce this on a smaller number of tasks. Can<br>
> I<br>
> > send you more info when I get it?<br>
> ><br>
> > #0 memcpy () at ../sysdeps/x86_64/memcpy.S:102<br>
> > #1 0x0000000000c25b03 in MPIDI_CH3U_Buffer_copy ()<br>
> > #2 0x0000000000bc91ae in MPIDI_Isend_self ()<br>
> > #3 0x0000000000bc5ef4 in MPID_Isend ()<br>
> > #4 0x0000000000be4b2b in PMPI_Isend ()<br>
> > #5 0x0000000000b9439a in ADIOI_CRAY_R_Exchange_data ()<br>
> > #6 0x0000000000b9503d in ADIOI_CRAY_Read_and_exch ()<br>
> > #7 0x0000000000b957ca in ADIOI_CRAY_ReadStridedColl ()<br>
> > #8 0x0000000000b7ea69 in MPIOI_File_read_all ()<br>
> > #9 0x0000000000b7eab6 in PMPI_File_read_all ()<br>
> > #10 0x0000000000b3d7ce in ncmpii_getput_vars (ncp=0x107192b0, varp=<br>
> 0x109ee340,<br>
> > start=0x15269910, count=0x1527daa0, stride=0x0,<br>
> > buf=0x155f5480, bufcount=245561, datatype=-871890885, rw_flag=1,<br>
> io_method=<br>
> > 1) at ./getput_vars.c:741<br>
> > #11 0x0000000000b39804 in ncmpi_get_vara_all (ncid=1, varid=59, start=<br>
> > 0x15269910, count=0x1527daa0, buf=0x155f5480,<br>
> > bufcount=245561, datatype=-871890885) at ./getput_vara.c:435<br>
> > #12 0x0000000000b092c6 in nfmpi_get_vara_all_ (v1=0x33e8dd8, v2=<br>
> 0x33e89a0, v3=<br>
> > 0x418d500, v4=0x418c500, v5=0x155f5480,<br>
> > v6=0x1e72e10, v7=0x1e72e04) at ./get_vara_allf.c:57<br>
> > #13 0x00000000007c0524 in pionfread_mod::read_nfdarray_double (file=...,<br>
> iobuf=<br>
> > 0x155f5480, vardesc=..., iodesc=...,<br>
> > start=0x7fffffff3750, count=0x7fffffff3750)<br>
> ><br>
> ><br>
> > On Fri, May 11, 2012 at 5:25 PM, Jim Edwards <<a href="mailto:jedwards@ucar.edu">jedwards@ucar.edu</a>> wrote:<br>
> ><br>
> > David,<br>
> ><br>
> > I will give this a try, thanks.<br>
> ><br>
> ><br>
> > On Fri, May 11, 2012 at 5:15 PM, David Knaak <<a href="mailto:knaak@cray.com">knaak@cray.com</a>> wrote:<br>
> ><br>
> > Rob,<br>
> ><br>
> > I suggested taking the discussion off line to not bother those<br>
> not<br>
> > interested in the Cray specifics. But if you think those on the<br>
> list<br>
> > are either interested, or don't consider it a bother, I can<br>
> certainly<br>
> > use the full list.<br>
> ><br>
> > All,<br>
> ><br>
> > In the MPT 5.4.0 release, I made some changes to MPI_File_open to<br>
> > improve scalability. Because of these changes and previous<br>
> changes<br>
> > I had made (for added functionality, not because of any bugs),<br>
> the<br>
> > code was getting very messy. In fact, I introduced a bug or 2<br>
> with<br>
> > these changes. So in 5.4.3, I significantly restructured the<br>
> code<br>
> > for better maintainability, fixed the bugs (that I knew of) and<br>
> made<br>
> > more scalability changes.<br>
> ><br>
> > Jim,<br>
> ><br>
> > The NCSA's "ESS" has the 5.4.2 version of Cray's MPI<br>
> implementation as<br>
> > default. The "module list" command output that you included<br>
> shows:<br>
> ><br>
> > 3) xt-mpich2/5.4.2<br>
> ><br>
> > The "module avail xt-mpch2" command shows what other versions are<br>
> > available:<br>
> ><br>
> > h2ologin2 25=>module avail xt-mpich2<br>
> > --------------------- /opt/cray/modulefiles ---------------------<br>
> > xt-mpich2/5.4.2(default) xt-mpich2/5.4.4 xt-mpich2/<br>
> 5.4.5<br>
> ><br>
> > Would you switch to 5.4.5, relink, and try again?<br>
> ><br>
> > h2ologin2 25=>module swap xt-mpich2/5.4.2 xt-mpich2/5.4.5<br>
> ><br>
> > Thanks.<br>
> > David<br>
> ><br>
> ><br>
> > On Fri, May 11, 2012 at 12:54:39PM -0500, Rob Latham wrote:<br>
> > > On Fri, May 11, 2012 at 11:46:25AM -0500, David Knaak wrote:<br>
> > > > Jim,<br>
> > > ><br>
> > > > Since you are having this problem on a Cray system, please<br>
> open a<br>
> > Cray<br>
> > > > bug report against MPI and I will look at it. We can take<br>
> further<br>
> > > > discussions off line.<br>
> > ><br>
> > > Oh, howdy David! forgot you were on the list. Thanks for<br>
> keeping an<br>
> > > eye on things.<br>
> > ><br>
> > > the pnetcdf list is pretty low-traffic these days, but we have<br>
> an<br>
> > > awful lot of users in a cray and Lustre environment. If you'd<br>
> > rather<br>
> > > discuss cray specific stuff elsewhere, I'd understand, but<br>
> please let<br>
> > > us know what you figure out.<br>
> > ><br>
> > > ==rob<br>
> > ><br>
> > > > Thanks.<br>
> > > > David<br>
> > > ><br>
> > > > On Fri, May 11, 2012 at 10:03:28AM -0600, Jim Edwards wrote:<br>
> > > > ><br>
> > > > ><br>
> > > > > On Fri, May 11, 2012 at 9:43 AM, Rob Latham <<br>
> <a href="mailto:robl@mcs.anl.gov">robl@mcs.anl.gov</a>><br>
> > wrote:<br>
> > > > ><br>
> > > > > On Thu, May 10, 2012 at 03:28:57PM -0600, Jim Edwards<br>
> wrote:<br>
> > > > > > This occurs on the ncsa machine bluewaters. I am<br>
> using<br>
> > pnetcdf1.2.0 and<br>
> > > > > > pgi 11.10.0<br>
> > > > ><br>
> > > > > need one more bit of information: the version of MPT<br>
> you are<br>
> > using.<br>
> > > > ><br>
> > > > ><br>
> > > > > Sorry, what's mpt? MPI?<br>
> > > > > Currently Loaded Modulefiles:<br>
> > > > > 1) modules/<a href="http://3.2.6.6" target="_blank">3.2.6.6</a> 9)<br>
> > > > > user-paths 17) xpmem/<br>
> > 0.1-2.0400.31280.3.1.gem<br>
> > > > > 2) xtpe-network-gemini 10) pgi/<br>
> > > > > 11.10.0 18) xe-sysroot/4.0.46<br>
> > > > > 3) xt-mpich2/5.4.2 11) xt-libsci/<br>
> > > > > 11.0.04 19) xt-asyncpe/5.07<br>
> > > > > 4) xtpe-interlagos 12) udreg/<br>
> > > > > 2.3.1-1.0400.4264.3.1.gem 20) atp/1.4.1<br>
> > > > > 5) eswrap/1.0.12 13) ugni/<br>
> > > > > 2.3-1.0400.4374.4.88.gem 21) PrgEnv-pgi/4.0.46<br>
> > > > > 6) torque/2.5.10 14) pmi/<br>
> > > > > 3.0.0-1.0000.8661.28.2807.gem 22) hdf5-parallel/1.8.7<br>
> > > > > 7) moab/6.1.5 15) dmapp/<br>
> > > > > 3.2.1-1.0400.4255.2.159.gem 23) netcdf-hdf5parallel/<br>
> 4.1.3<br>
> > > > > 8) scripts 16) gni-headers/<br>
> > > > > 2.1-1.0400.4351.3.1.gem 24) parallel-netcdf/1.2.0<br>
> > > > ><br>
> > > > ><br>
> > > > ><br>
> > > > ><br>
> > > > ><br>
> > > > ><br>
> > > > > > The issue is that calling nfmpi_createfile would<br>
> sometimes<br>
> > result in an<br>
> > > > > > error:<br>
> > > > > ><br>
> > > > > > MPI_File_open : Other I/O error , error stack:<br>
> > > > > > (unknown)(): Other I/O error<br>
> > > > > > 126: MPI_File_open : Other I/O error , error stack:<br>
> > > > > > (unknown)(): Other I/O error<br>
> > > > > > Error on create : 502 -32<br>
> > > > > ><br>
> > > > > > The error appears to be intermittent and I could not<br>
> get it<br>
> > to occur at<br>
> > > > > all<br>
> > > > > > on a small number of tasks (160) but it occurs with<br>
> high<br>
> > frequency when<br>
> > > > > > using a larger number of tasks (>=1600). I traced<br>
> the<br>
> > problem to the<br>
> > > > > use<br>
> > > > > > of nf_clobber in the mode argument, removing the<br>
> nf_clobber<br>
> > seems to have<br>
> > > > > > solved the problem and I think that create implies<br>
> clobber<br>
> > anyway doesn't<br>
> > > > > > it?<br>
> > > > ><br>
> > > > > > Can someone who knows what is going on under the<br>
> covers<br>
> > enlighten me<br>
> > > > > > with some understanding of this issue? I suspect<br>
> that one<br>
> > task is<br>
> > > > > trying<br>
> > > > > > to clobber the file that another has just created or<br>
> > something of that<br>
> > > > > > nature.<br>
> > > > ><br>
> > > > > Unfortunately, "under the covers" here means "inside<br>
> the<br>
> > MPI-IO<br>
> > > > > library", which we don't have access to.<br>
> > > > ><br>
> > > > > in the create case we call MPI_File_open with<br>
> "MPI_MODE_RDWR<br>
> > |<br>
> > > > > MPI_MODE_CREATE", and if noclobber set, we add<br>
> > MPI_MODE_EXCL.<br>
> > > > ><br>
> > > > > OK, so that's pnetcdf. What's going on in MPI-IO?<br>
> Well,<br>
> > cray's based<br>
> > > > > their MPI-IO off of our ROMIO, but I'm not sure which<br>
> > version.<br>
> > > > ><br>
> > > > > Let me cook up a quick MPI-IO-only test case you can<br>
> run to<br>
> > trigger<br>
> > > > > this problem and then you can beat cray over the head<br>
> with<br>
> > it.<br>
> > > > ><br>
> > > > ><br>
> > > > ><br>
> > > > > Sounds good, thanks.<br>
> > > > ><br>
> > > > ><br>
> > > > > ==rob<br>
> > > > ><br>
> > > > > --<br>
> > > > > Rob Latham<br>
> > > > > Mathematics and Computer Science Division<br>
> > > > > Argonne National Lab, IL USA<br>
> > > > ><br>
> > > > ><br>
> > > > ><br>
> > > > ><br>
> > > > > --<br>
> > > > > Jim Edwards<br>
> > > > ><br>
> > > > > CESM Software Engineering Group<br>
> > > > > National Center for Atmospheric Research<br>
> > > > > Boulder, CO<br>
> > > > > <a href="tel:303-497-1842" value="+13034971842">303-497-1842</a><br>
> > > > ><br>
> > > ><br>
> > ><br>
> > > --<br>
> > > Rob Latham<br>
> > > Mathematics and Computer Science Division<br>
> > > Argonne National Lab, IL USA<br>
> ><br>
> > --<br>
> ><br>
> ><br>
> ><br>
> ><br>
> > --<br>
> > Jim Edwards<br>
> ><br>
> > CESM Software Engineering Group<br>
> > National Center for Atmospheric Research<br>
> > Boulder, CO<br>
> > <a href="tel:303-497-1842" value="+13034971842">303-497-1842</a><br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> > --<br>
> > Jim Edwards<br>
> ><br>
> > CESM Software Engineering Group<br>
> > National Center for Atmospheric Research<br>
> > Boulder, CO<br>
> > <a href="tel:303-497-1842" value="+13034971842">303-497-1842</a><br>
> ><br>
><br>
> --<br>
><br>
><br>
><br>
><br>
> --<br>
> Jim Edwards<br>
><br>
> CESM Software Engineering Group<br>
> National Center for Atmospheric Research<br>
> Boulder, CO<br>
> <a href="tel:303-497-1842" value="+13034971842">303-497-1842</a><br>
><br>
<br>
</div></div>--<br>
</blockquote></div><br><br clear="all"><br>-- <br>Jim Edwards<br><br><font>CESM Software Engineering Group<br>National Center for Atmospheric Research<br>Boulder, CO <br>303-497-1842<br></font><br>