<div dir="ltr">Rob and Wei-keng,<div><br></div><div>Thanks for you help on this problem. Rob - The patch seems to work. I had to hand apply it but now the pnetcdf tests (mostly) complete successfully. The FLASH-IO benchmark is failing when Lustre is used. It completes successfully when Panasas is used. The error code that is returned by nfmpi_enddef is -262. The description for this error is:</div><div><br></div><div>#define NC_EMULTIDEFINE_VAR_BEGIN (-262) /**< inconsistent variable file begin offset (internal use) */<br></div><div><div><br></div><div><div>[root@Jet:fe7 FLASH-IO]# mpiexec.hydra -n 4 ./flash_benchmark_io /pan2/jetmgmt/Craig.Tierney/pan_flash_io_test_</div><div> Here: 0</div><div> Here: 0</div><div> Here: 0</div><div> Here: 0</div><div> number of guards : 4</div><div> number of blocks : 80</div><div> number of variables : 24</div><div> checkpoint time : 12.74 sec</div><div> max header : 0.88 sec</div><div> max unknown : 11.83 sec</div><div> max close : 0.53 sec</div><div> I/O amount : 242.30 MiB</div><div> plot no corner : 2.38 sec</div><div> max header : 0.59 sec</div><div> max unknown : 1.78 sec</div><div> max close : 0.22 sec</div><div> I/O amount : 20.22 MiB</div><div> plot corner : 2.52 sec</div><div> max header : 0.81 sec</div><div> max unknown : 1.51 sec</div><div> max close : 0.96 sec</div><div> I/O amount : 24.25 MiB</div><div> -------------------------------------------------------</div><div> File base name : /pan2/jetmgmt/Craig.Tierney/pan_flash_io_test_</div><div> file striping count : 0</div><div> file striping size : 301346992 bytes</div><div> Total I/O amount : 286.78 MiB</div><div> -------------------------------------------------------</div><div> nproc array size exec (sec) bandwidth (MiB/s)</div><div> 4 16 x 16 x 16 17.64 16.26</div><div><br></div><div><br></div><div>[root@Jet:fe7 FLASH-IO]# mpiexec.hydra -n 4 ./flash_benchmark_io /lfs2/jetmgmt/Craig.Tierney/lfs_flash_io_test_</div><div> Here: -262</div><div> Here: -262</div><div> Here: -262</div><div> nfmpi_enddef</div><div> (Internal error) beginning file offset of this variable is inconsistent among p</div><div> r</div><div> nfmpi_enddef</div><div> (Internal error) beginning file offset of this variable is inconsistent among p</div><div> r</div><div> nfmpi_enddef</div><div> (Internal error) beginning file offset of this variable is inconsistent among p</div><div> r</div><div> Here: 0</div><div>[cli_1]: aborting job:</div><div>application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1</div><div>[cli_3]: [cli_2]: aborting job:</div><div>application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3</div><div>aborting job:</div><div>application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2</div><div><br></div><div>===================================================================================</div><div>= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES</div><div>= PID 16702 RUNNING AT fe7</div><div>= EXIT CODE: 255</div><div>= CLEANING UP REMAINING PROCESSES</div><div>= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES</div><div>===================================================================================</div><div><br></div></div></div><div>Thanks,</div><div>Craig</div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Sep 21, 2015 at 8:30 AM, Rob Latham <span dir="ltr"><<a href="mailto:robl@mcs.anl.gov" target="_blank">robl@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br>
<br>
On 09/20/2015 03:44 PM, Craig Tierney - NOAA Affiliate wrote:<br>
</span><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Wei-keng,<br>
<br><span class="">
I tried your test code on a different system, and I found it worked with<br>
Intel+mvapich2 (2.1rc1). That system was using Panasas and I was<br>
testing on Lustre. I then tried Panasas on the original machine<br>
(supports both Panasas and Lustre) and I got the correct behavior.<br>
<br>
So the problem somehow related to Lustre. We are using the 2.5.37.ddn<br>
client. Unless you have an obvious answer, I will open this with DDN<br>
tomorrow.<br>
<br>
</span></blockquote>
<br>
Ah, bet I know why this is!<br>
<br>
the Lustre driver and (some versions of the) Panasas driver set their fs-specific hints by opening the file, setting some ioctls, then continuing on without deleting the file.<br>
<br>
In the common case, when we expect the file to show up, no one notices or cares, but in MPI_MODE_EXCL or some other restrictive flags, the file gets created when we did not expect it to -- and that's part of the reason this bug lived on so long.<br>
<br>
I fixed this by moving file manipulations out of the hint parsing path and into the open path (after we check permissions and flags)<br>
<br>
Relevant commit: <a href="https://trac.mpich.org/projects/mpich/changeset/92f1c69f0de87f9" rel="noreferrer" target="_blank">https://trac.mpich.org/projects/mpich/changeset/92f1c69f0de87f9</a><br>
<br>
See more details from Darshan, OpenMPI, and MPICH here:<br>
- <a href="https://trac.mpich.org/projects/mpich/ticket/2261" rel="noreferrer" target="_blank">https://trac.mpich.org/projects/mpich/ticket/2261</a><br>
- <a href="https://github.com/open-mpi/ompi/issues/158" rel="noreferrer" target="_blank">https://github.com/open-mpi/ompi/issues/158</a><br>
- <a href="http://lists.mcs.anl.gov/pipermail/darshan-users/2015-February/000256.html" rel="noreferrer" target="_blank">http://lists.mcs.anl.gov/pipermail/darshan-users/2015-February/000256.html</a><br>
<br>
==rob<br>
<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">
Thanks,<br>
Craig<br>
<br>
On Sun, Sep 20, 2015 at 2:36 PM, Craig Tierney - NOAA Affiliate<br></span><span class="">
<<a href="mailto:craig.tierney@noaa.gov" target="_blank">craig.tierney@noaa.gov</a> <mailto:<a href="mailto:craig.tierney@noaa.gov" target="_blank">craig.tierney@noaa.gov</a>>> wrote:<br>
<br>
Wei-keng,<br>
<br>
Thanks for the test case. Here is what I get using a set of<br>
compilers and MPI stacks. I was expecting that mvapich2 1.8 and 2.1<br>
would behave differently.<br>
<br>
What versions of MPI do you test internally?<br>
<br>
Craig<br>
<br>
Testing intel+impi<br>
<br>
Currently Loaded Modules:<br></span>
1) newdefaults 2) intel/<a href="http://15.0.3.187" rel="noreferrer" target="_blank">15.0.3.187</a> <<a href="http://15.0.3.187" rel="noreferrer" target="_blank">http://15.0.3.187</a>> 3)<br>
impi/<a href="http://5.1.1.109" rel="noreferrer" target="_blank">5.1.1.109</a> <<a href="http://5.1.1.109" rel="noreferrer" target="_blank">http://5.1.1.109</a>><span class=""><br>
<br>
Error at line 22: File does not exist, error stack:<br>
ADIOI_NFS_OPEN(69): File /lfs3/jetmgmt/Craig.Tierney/<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">tooth-fairy.nc</a><br></span>
<<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">http://tooth-fairy.nc</a>> does not exist<span class=""><br>
Testing intel+mvapich2 2.1<br>
<br>
Currently Loaded Modules:<br></span>
1) newdefaults 2) intel/<a href="http://15.0.3.187" rel="noreferrer" target="_blank">15.0.3.187</a> <<a href="http://15.0.3.187" rel="noreferrer" target="_blank">http://15.0.3.187</a>> 3)<span class=""><br>
mvapich2/2.1<br>
<br>
file was opened: /lfs3/jetmgmt/Craig.Tierney/<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">tooth-fairy.nc</a><br></span>
<<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">http://tooth-fairy.nc</a>><span class=""><br>
Testing intel+mvapich2 1.8<br>
<br>
Currently Loaded Modules:<br></span>
1) newdefaults 2) intel/<a href="http://15.0.3.187" rel="noreferrer" target="_blank">15.0.3.187</a> <<a href="http://15.0.3.187" rel="noreferrer" target="_blank">http://15.0.3.187</a>> 3)<span class=""><br>
mvapich2/1.8<br>
<br>
file was opened: /lfs3/jetmgmt/Craig.Tierney/<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">tooth-fairy.nc</a><br></span>
<<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">http://tooth-fairy.nc</a>><span class=""><br>
Testing pgi+mvapich2 2.1<br>
<br>
Currently Loaded Modules:<br>
1) newdefaults 2) pgi/15.3 3) mvapich2/2.1<br>
<br>
file was opened: /lfs3/jetmgmt/Craig.Tierney/<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">tooth-fairy.nc</a><br></span>
<<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">http://tooth-fairy.nc</a>><span class=""><br>
Testing pgi+mvapich2 1.8<br>
<br>
Currently Loaded Modules:<br>
1) newdefaults 2) pgi/15.3 3) mvapich2/1.8<br>
<br>
file was opened: /lfs3/jetmgmt/Craig.Tierney/<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">tooth-fairy.nc</a><br></span>
<<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">http://tooth-fairy.nc</a>><span class=""><br>
<br>
Craig<br>
<br>
On Sun, Sep 20, 2015 at 1:43 PM, Wei-keng Liao<br></span>
<<a href="mailto:wkliao@eecs.northwestern.edu" target="_blank">wkliao@eecs.northwestern.edu</a> <mailto:<a href="mailto:wkliao@eecs.northwestern.edu" target="_blank">wkliao@eecs.northwestern.edu</a>>><span class=""><br>
wrote:<br>
<br>
In that case, it is likely mvapich does not perform correctly.<br>
<br>
In PnetCDF, when NC_NOWRITE is used in a call to ncmpi_open,<br>
PnetCDF calls a MPI_File_open with the open flag set to<br>
MPI_MODE_RDONLY. See<br>
<a href="http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/tags/v1-6-1/src/lib/mpincio.c#L322" rel="noreferrer" target="_blank">http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/tags/v1-6-1/src/lib/mpincio.c#L322</a><br>
<br>
Maybe test this with a simple MPI-IO program below.<br>
It prints error messages like<br>
Error at line 15: File does not exist, error stack:<br>
ADIOI_UFS_OPEN(69): File <a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">tooth-fairy.nc</a><br></span>
<<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">http://tooth-fairy.nc</a>> does not exist<span class=""><br>
<br>
But, no file should be created.<br>
<br>
<br>
#include <stdio.h><br>
#include <unistd.h> /* unlink() */<br>
#include <mpi.h><br>
<br>
int main(int argc, char **argv) {<br>
int err;<br>
MPI_File fh;<br>
<br>
MPI_Init(&argc, &argv);<br>
<br></span>
/* delete "<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">tooth-fairy.nc</a> <<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">http://tooth-fairy.nc</a>>" and<br>
ignore the error */<br>
unlink("<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">tooth-fairy.nc</a> <<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">http://tooth-fairy.nc</a>>");<br>
<br>
err = MPI_File_open(MPI_COMM_WORLD, "<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">tooth-fairy.nc</a><br>
<<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">http://tooth-fairy.nc</a>>", MPI_MODE_RDONLY, MPI_INFO_NULL, &fh);<span class=""><br>
if (err != MPI_SUCCESS) {<br>
int errorStringLen;<br>
char errorString[MPI_MAX_ERROR_STRING];<br>
MPI_Error_string(err, errorString, &errorStringLen);<br>
printf("Error at line %d: %s\n",__LINE__, errorString);<br>
}<br>
else<br>
MPI_File_close(&fh);<br>
<br>
MPI_Finalize();<br>
return 0;<br>
}<br>
<br>
<br>
Wei-keng<br>
<br>
On Sep 20, 2015, at 1:51 PM, Craig Tierney - NOAA Affiliate wrote:<br>
<br>
> Wei-keng,<br>
><br>
> I always run distclean before I try to build the code. The<br>
first test failing is nc_test. The problem seems to be in this<br>
test:<br>
><br>
> err = ncmpi_open(comm, "<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">tooth-fairy.nc</a><br></span>
<<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">http://tooth-fairy.nc</a>>", NC_NOWRITE, info, &ncid);/* should fail */<span class=""><br>
> IF (err == NC_NOERR)<br>
> error("ncmpi_open of nonexistent file should have<br>
failed");<br>
> IF (err != NC_ENOENT)<br>
> error("ncmpi_open of nonexistent file should have<br>
returned NC_ENOENT");<br>
> else {<br>
> /* printf("Expected error message complaining: \"File<br></span>
<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">tooth-fairy.nc</a> <<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">http://tooth-fairy.nc</a>> does not exist\"\n"); */<br>
> nok++;<br>
> }<br>
><br>
> A zero length <a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">tooth-fairy.nc</a> <<a href="http://tooth-fairy.nc" rel="noreferrer" target="_blank">http://tooth-fairy.nc</a>> file is<span class=""><br>
being created, and I don't think that is supposed to happen.<br>
That would mean that the mode NC_NOWRITE is not being honored by<br>
MPI_IO. I will look at this more tomorrow and try to craft a<br>
short example.<br>
><br>
> Craig<br>
><br>
> On Sun, Sep 20, 2015 at 10:23 AM, Wei-keng Liao<br>
<<a href="mailto:wkliao@eecs.northwestern.edu" target="_blank">wkliao@eecs.northwestern.edu</a><br></span><span class="">
<mailto:<a href="mailto:wkliao@eecs.northwestern.edu" target="_blank">wkliao@eecs.northwestern.edu</a>>> wrote:<br>
> Hi, Craig<br>
><br>
> Your config.log looks fine to me.<br>
> Some of your error messages are supposed to report errors of<br>
opening<br>
> a non-existing file, but report a different error code,<br>
meaning the<br>
> file does exist. I suspect it may be because of residue files.<br>
><br>
> Could you do a clean rebuild with the following commands?<br>
> % make -s distclean<br>
> % ./configure --prefix=/apps/pnetcdf/1.6.1-intel-mvapich2<br>
> % make -s -j8<br>
> % make -s check<br>
><br>
> If the problem persists, then it might be because mvapich.<br>
><br>
> Wei-keng<br>
><br>
<br>
<br>
<br>
</span></blockquote><span class="HOEnZb"><font color="#888888">
<br>
-- <br>
Rob Latham<br>
Mathematics and Computer Science Division<br>
Argonne National Lab, IL USA<br>
</font></span></blockquote></div><br></div>