Unable to pass all the tests with pnetcdf 1.6.1, Intel 15.0.3.048 and Mvapich2 2.1

Rob Latham robl at mcs.anl.gov
Mon Sep 21 09:30:55 CDT 2015



On 09/20/2015 03:44 PM, Craig Tierney - NOAA Affiliate wrote:
> Wei-keng,
>
> I tried your test code on a different system, and I found it worked with
> Intel+mvapich2 (2.1rc1).  That system was using Panasas and I was
> testing on Lustre.  I then tried Panasas on the original machine
> (supports both Panasas and Lustre) and I got the correct behavior.
>
> So the problem somehow related to Lustre.  We are using the 2.5.37.ddn
> client.   Unless you have an obvious answer, I will open this with DDN
> tomorrow.
>

Ah, bet I know why this is!

the Lustre driver and (some versions of the) Panasas driver set their 
fs-specific hints by opening the file, setting some ioctls, then 
continuing on without deleting the file.

  In the common case, when we expect the file to show up, no one notices 
or cares, but in MPI_MODE_EXCL or some other restrictive flags, the file 
gets created when we did not expect it to -- and that's part of the 
reason this bug lived on so long.

I fixed this by moving file manipulations out of the hint parsing path 
and into the open path (after we check permissions and flags)

Relevant commit: 
https://trac.mpich.org/projects/mpich/changeset/92f1c69f0de87f9

See more details from Darshan, OpenMPI, and MPICH here:
- https://trac.mpich.org/projects/mpich/ticket/2261
- https://github.com/open-mpi/ompi/issues/158
- http://lists.mcs.anl.gov/pipermail/darshan-users/2015-February/000256.html

==rob


> Thanks,
> Craig
>
> On Sun, Sep 20, 2015 at 2:36 PM, Craig Tierney - NOAA Affiliate
> <craig.tierney at noaa.gov <mailto:craig.tierney at noaa.gov>> wrote:
>
>     Wei-keng,
>
>     Thanks for the test case.  Here is what I get using a set of
>     compilers and MPI stacks.  I was expecting that mvapich2 1.8 and 2.1
>     would behave differently.
>
>     What versions of MPI do you test internally?
>
>     Craig
>
>     Testing intel+impi
>
>     Currently Loaded Modules:
>        1) newdefaults   2) intel/15.0.3.187 <http://15.0.3.187>   3)
>     impi/5.1.1.109 <http://5.1.1.109>
>
>     Error at line 22: File does not exist, error stack:
>     ADIOI_NFS_OPEN(69): File /lfs3/jetmgmt/Craig.Tierney/tooth-fairy.nc
>     <http://tooth-fairy.nc> does not exist
>     Testing intel+mvapich2 2.1
>
>     Currently Loaded Modules:
>        1) newdefaults   2) intel/15.0.3.187 <http://15.0.3.187>   3)
>     mvapich2/2.1
>
>     file was opened: /lfs3/jetmgmt/Craig.Tierney/tooth-fairy.nc
>     <http://tooth-fairy.nc>
>     Testing intel+mvapich2 1.8
>
>     Currently Loaded Modules:
>        1) newdefaults   2) intel/15.0.3.187 <http://15.0.3.187>   3)
>     mvapich2/1.8
>
>     file was  opened: /lfs3/jetmgmt/Craig.Tierney/tooth-fairy.nc
>     <http://tooth-fairy.nc>
>     Testing pgi+mvapich2 2.1
>
>     Currently Loaded Modules:
>        1) newdefaults   2) pgi/15.3   3) mvapich2/2.1
>
>     file was opened: /lfs3/jetmgmt/Craig.Tierney/tooth-fairy.nc
>     <http://tooth-fairy.nc>
>     Testing pgi+mvapich2 1.8
>
>     Currently Loaded Modules:
>        1) newdefaults   2) pgi/15.3   3) mvapich2/1.8
>
>     file was opened: /lfs3/jetmgmt/Craig.Tierney/tooth-fairy.nc
>     <http://tooth-fairy.nc>
>
>     Craig
>
>     On Sun, Sep 20, 2015 at 1:43 PM, Wei-keng Liao
>     <wkliao at eecs.northwestern.edu <mailto:wkliao at eecs.northwestern.edu>>
>     wrote:
>
>         In that case, it is likely mvapich does not perform correctly.
>
>         In PnetCDF, when NC_NOWRITE is used in a call to ncmpi_open,
>         PnetCDF calls a MPI_File_open with the open flag set to
>         MPI_MODE_RDONLY. See
>         http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/tags/v1-6-1/src/lib/mpincio.c#L322
>
>         Maybe test this with a simple MPI-IO program below.
>         It prints error messages like
>              Error at line 15: File does not exist, error stack:
>              ADIOI_UFS_OPEN(69): File tooth-fairy.nc
>         <http://tooth-fairy.nc> does not exist
>
>         But, no file should be created.
>
>
>         #include <stdio.h>
>         #include <unistd.h> /* unlink() */
>         #include <mpi.h>
>
>         int main(int argc, char **argv) {
>              int err;
>              MPI_File fh;
>
>              MPI_Init(&argc, &argv);
>
>              /* delete "tooth-fairy.nc <http://tooth-fairy.nc>" and
>         ignore the error */
>              unlink("tooth-fairy.nc <http://tooth-fairy.nc>");
>
>              err = MPI_File_open(MPI_COMM_WORLD, "tooth-fairy.nc
>         <http://tooth-fairy.nc>", MPI_MODE_RDONLY, MPI_INFO_NULL, &fh);
>              if (err != MPI_SUCCESS) {
>                  int errorStringLen;
>                  char errorString[MPI_MAX_ERROR_STRING];
>                  MPI_Error_string(err, errorString, &errorStringLen);
>                  printf("Error at line %d: %s\n",__LINE__, errorString);
>              }
>              else
>                  MPI_File_close(&fh);
>
>              MPI_Finalize();
>              return 0;
>         }
>
>
>         Wei-keng
>
>         On Sep 20, 2015, at 1:51 PM, Craig Tierney - NOAA Affiliate wrote:
>
>          > Wei-keng,
>          >
>          > I always run distclean before I try to build the code.  The
>         first test failing is nc_test.  The problem seems to be in this
>         test:
>          >
>          >    err = ncmpi_open(comm, "tooth-fairy.nc
>         <http://tooth-fairy.nc>", NC_NOWRITE, info, &ncid);/* should fail */
>          >     IF (err == NC_NOERR)
>          >         error("ncmpi_open of nonexistent file should have
>         failed");
>          >     IF (err != NC_ENOENT)
>          >         error("ncmpi_open of nonexistent file should have
>         returned NC_ENOENT");
>          >     else {
>          >         /* printf("Expected error message complaining: \"File
>         tooth-fairy.nc <http://tooth-fairy.nc> does not exist\"\n"); */
>          >         nok++;
>          >     }
>          >
>          > A zero length tooth-fairy.nc <http://tooth-fairy.nc> file is
>         being created, and I don't think that is supposed to happen.
>         That would mean that the mode NC_NOWRITE is not being honored by
>         MPI_IO.  I will look at this more tomorrow and try to craft a
>         short example.
>          >
>          > Craig
>          >
>          > On Sun, Sep 20, 2015 at 10:23 AM, Wei-keng Liao
>         <wkliao at eecs.northwestern.edu
>         <mailto:wkliao at eecs.northwestern.edu>> wrote:
>          > Hi, Craig
>          >
>          > Your config.log looks fine to me.
>          > Some of your error messages are supposed to report errors of
>         opening
>          > a non-existing file, but report a different error code,
>         meaning the
>          > file does exist. I suspect it may be because of residue files.
>          >
>          > Could you do a clean rebuild with the following commands?
>          >     % make -s distclean
>          >     % ./configure --prefix=/apps/pnetcdf/1.6.1-intel-mvapich2
>          >     % make -s -j8
>          >     % make -s check
>          >
>          > If the problem persists, then it might be because mvapich.
>          >
>          > Wei-keng
>          >
>
>
>

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


More information about the parallel-netcdf mailing list