Unable to pass all the tests with pnetcdf 1.6.1, Intel 15.0.3.048 and Mvapich2 2.1

Craig Tierney - NOAA Affiliate craig.tierney at noaa.gov
Tue Sep 22 14:34:11 CDT 2015


Wei-keng,

Here is the output from my run with PNETCDF_SAFE_MODE=1 on Lustre:

[root at Jet:fe8 FLASH-IO]# mpiexec.hydra -env PNETCDF_SAFE_MODE 1 -np 4
./flash_benchmark_io /lfs2/jetmgmt/Craig.Tierney/d1//flash_io_test_
Warning (inconsistent metadata): variable lrefine's begin (root=1048576,
local=3072)
Warning (inconsistent metadata): variable nodetype's begin (root=2097152,
local=4608)
Warning (inconsistent metadata): variable gid's begin (root=3145728,
local=6144)
Warning (inconsistent metadata): variable coordinates's begin
(root=4194304, local=25600)
Warning (inconsistent metadata): variable blocksize's begin (root=5242880,
local=33792)
Warning (inconsistent metadata): variable bndbox's begin (root=6291456,
local=41984)
Warning (inconsistent metadata): variable dens's begin (root=7340032,
local=57856)
Warning (inconsistent metadata): variable velx's begin (root=18874368,
local=10641920)
Warning (inconsistent metadata): variable lrefine's begin (root=1048576,
local=3072)
Warning (inconsistent metadata): variable nodetype's begin (root=2097152,
local=4608)
Warning (inconsistent metadata): variable gid's begin (root=3145728,
local=6144)
Warning (inconsistent metadata): variable coordinates's begin
(root=4194304, local=25600)
Warning (inconsistent metadata): variable blocksize's begin (root=5242880,
local=33792)
Warning (inconsistent metadata): variable bndbox's begin (root=6291456,
local=41984)
Warning (inconsistent metadata): variable dens's begin (root=7340032,
local=57856)
Warning (inconsistent metadata): variable velx's begin (root=18874368,
local=10641920)
Warning (inconsistent metadata): variable vely's begin (root=30408704,
local=21225984)
Warning (inconsistent metadata): variable velz's begin (root=41943040,
local=31810048)
Warning (inconsistent metadata): variable pres's begin (root=53477376,
local=42394112)
Warning (inconsistent metadata): variable ener's begin (root=65011712,
local=52978176)
Warning (inconsistent metadata): variable temp's begin (root=76546048,
local=63562240)
Warning (inconsistent metadata): variable gamc's begin (root=88080384,
local=74146304)
Warning (inconsistent metadata): variable game's begin (root=99614720,
local=84730368)
Warning (inconsistent metadata): variable enuc's begin (root=111149056,
local=95314432)
Warning (inconsistent metadata): variable gpot's begin (root=122683392,
local=105898496)
Warning (inconsistent metadata): variable f1__'s begin (root=134217728,
local=116482560)
Warning (inconsistent metadata): variable f2__'s begin (root=145752064,
local=127066624)
Warning (inconsistent metadata): variable f3__'s begin (root=157286400,
local=137650688)
Warning (inconsistent metadata): variable lrefine's begin (root=1048576,
local=3072)
Warning (inconsistent metadata): variable nodetype's begin (root=2097152,
local=4608)
Warning (inconsistent metadata): variable gid's begin (root=3145728,
local=6144)
Warning (inconsistent metadata): variable coordinates's begin
(root=4194304, local=25600)
Warning (inconsistent metadata): variable blocksize's begin (root=5242880,
local=33792)
Warning (inconsistent metadata): variable bndbox's begin (root=6291456,
local=41984)
Warning (inconsistent metadata): variable dens's begin (root=7340032,
local=57856)
Warning (inconsistent metadata): variable velx's begin (root=18874368,
local=10641920)
Warning (inconsistent metadata): variable vely's begin (root=30408704,
local=21225984)
Warning (inconsistent metadata): variable velz's begin (root=41943040,
local=31810048)
Warning (inconsistent metadata): variable pres's begin (root=53477376,
local=42394112)
Warning (inconsistent metadata): variable ener's begin (root=65011712,
local=52978176)
Warning (inconsistent metadata): variable temp's begin (root=76546048,
local=63562240)
Warning (inconsistent metadata): variable gamc's begin (root=88080384,
local=74146304)
Warning (inconsistent metadata): variable game's begin (root=99614720,
local=84730368)
Warning (inconsistent metadata): variable enuc's begin (root=111149056,
local=95314432)
Warning (inconsistent metadata): variable gpot's begin (root=122683392,
local=105898496)
Warning (inconsistent metadata): variable f1__'s begin (root=134217728,
local=116482560)
Warning (inconsistent metadata): variable f2__'s begin (root=145752064,
local=127066624)
Warning (inconsistent metadata): variable f3__'s begin (root=157286400,
local=137650688)
Warning (inconsistent metadata): variable f4__'s begin (root=168820736,
local=148234752)
Warning (inconsistent metadata): variable f5__'s begin (root=180355072,
local=158818816)
Warning (inconsistent metadata): variable f6__'s begin (root=191889408,
local=169402880)
Warning (inconsistent metadata): variable vely's begin (root=30408704,
local=21225984)
Warning (inconsistent metadata): variable velz's begin (root=41943040,
local=31810048)
Warning (inconsistent metadata): variable pres's begin (root=53477376,
local=42394112)
Warning (inconsistent metadata): variable ener's begin (root=65011712,
local=52978176)
Warning (inconsistent metadata): variable temp's begin (root=76546048,
local=63562240)
Warning (inconsistent metadata): variable gamc's begin (root=88080384,
local=74146304)
Warning (inconsistent metadata): variable game's begin (root=99614720,
local=84730368)
Warning (inconsistent metadata): variable enuc's begin (root=111149056,
local=95314432)
Warning (inconsistent metadata): variable gpot's begin (root=122683392,
local=105898496)
Warning (inconsistent metadata): variable f1__'s begin (root=134217728,
local=116482560)
Warning (inconsistent metadata): variable f2__'s begin (root=145752064,
local=127066624)
Warning (inconsistent metadata): variable f3__'s begin (root=157286400,
local=137650688)
Warning (inconsistent metadata): variable f4__'s begin (root=168820736,
local=148234752)
Warning (inconsistent metadata): variable f5__'s begin (root=180355072,
local=158818816)
Warning (inconsistent metadata): variable f6__'s begin (root=191889408,
local=169402880)
Warning (inconsistent metadata): variable f7__'s begin (root=203423744,
local=179986944)
Warning (inconsistent metadata): variable f8__'s begin (root=214958080,
local=190571008)
Warning (inconsistent metadata): variable f9__'s begin (root=226492416,
local=201155072)
Warning (inconsistent metadata): variable f10_'s begin (root=238026752,
local=211739136)
Warning (inconsistent metadata): variable f11_'s begin (root=249561088,
local=222323200)
Warning (inconsistent metadata): variable f12_'s begin (root=261095424,
local=232907264)
Warning (inconsistent metadata): variable f13_'s begin (root=272629760,
local=243491328)
Warning (inconsistent metadata): variable f4__'s begin (root=168820736,
local=148234752)
Warning (inconsistent metadata): variable f5__'s begin (root=180355072,
local=158818816)
Warning (inconsistent metadata): variable f6__'s begin (root=191889408,
local=169402880)
Warning (inconsistent metadata): variable f7__'s begin (root=203423744,
local=179986944)
Warning (inconsistent metadata): variable f8__'s begin (root=214958080,
local=190571008)
Warning (inconsistent metadata): variable f9__'s begin (root=226492416,
local=201155072)
Warning (inconsistent metadata): variable f10_'s begin (root=238026752,
local=211739136)
Warning (inconsistent metadata): variable f11_'s begin (root=249561088,
local=222323200)
Warning (inconsistent metadata): variable f12_'s begin (root=261095424,
local=232907264)
Warning (inconsistent metadata): variable f13_'s begin (root=272629760,
local=243491328)
Warning (inconsistent metadata): variable f7__'s begin (root=203423744,
local=179986944)
Warning (inconsistent metadata): variable f8__'s begin (root=214958080,
local=190571008)
Warning (inconsistent metadata): variable f9__'s begin (root=226492416,
local=201155072)
Warning (inconsistent metadata): variable f10_'s begin (root=238026752,
local=211739136)
Warning (inconsistent metadata): variable f11_'s begin (root=249561088,
local=222323200)
Warning (inconsistent metadata): variable f12_'s begin (root=261095424,
local=232907264)
Warning (inconsistent metadata): variable f13_'s begin (root=272629760,
local=243491328)
 Here:        -250
 Here:        -262
 Here:        -262
 Here:        -262
 nfmpi_enddefFile header is inconsistent among processes
 nfmpi_enddef
 (Internal error) beginning file offset of this variable is inconsistent
among p
 r
 nfmpi_enddef
 (Internal error) beginning file offset of this variable is inconsistent
among p
 r
 nfmpi_enddef
 (Internal error) beginning file offset of this variable is inconsistent
among p
 r
[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
[cli_0]: [cli_2]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
[cli_3]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0

Craig

On Mon, Sep 21, 2015 at 1:21 PM, Wei-keng Liao <wkliao at eecs.northwestern.edu
> wrote:

>
> It is strange that the test failed for Lustre.
>
> The error message says some variables defined across MPI processes are not
> consistent.
> Could you run this benchmark with safe mode on? by setting the environment
> variable
> PNETCDF_SAFE_MODE to 1 before the run. This will print more error
> messages, such as
> which variables are inconsistent and at what offsets.
>
>
> Wei-keng
>
> On Sep 21, 2015, at 1:31 PM, Craig Tierney - NOAA Affiliate wrote:
>
> > Rob and Wei-keng,
> >
> > Thanks for you help on this problem.  Rob - The patch seems to work.  I
> had to hand apply it but now the pnetcdf tests (mostly) complete
> successfully.  The FLASH-IO benchmark is failing when Lustre is used.  It
> completes successfully when Panasas is used.  The error code that is
> returned by nfmpi_enddef is -262.   The description for this error is:
> >
> > #define NC_EMULTIDEFINE_VAR_BEGIN       (-262) /**< inconsistent
> variable file begin offset (internal use) */
> >
> > [root at Jet:fe7 FLASH-IO]# mpiexec.hydra -n 4 ./flash_benchmark_io
> /pan2/jetmgmt/Craig.Tierney/pan_flash_io_test_
> >  Here:           0
> >  Here:           0
> >  Here:           0
> >  Here:           0
> >  number of guards      :             4
> >  number of blocks      :            80
> >  number of variables   :            24
> >  checkpoint time       :            12.74  sec
> >         max header     :             0.88  sec
> >         max unknown    :            11.83  sec
> >         max close      :             0.53  sec
> >         I/O amount     :           242.30  MiB
> >  plot no corner        :             2.38  sec
> >         max header     :             0.59  sec
> >         max unknown    :             1.78  sec
> >         max close      :             0.22  sec
> >         I/O amount     :            20.22  MiB
> >  plot    corner        :             2.52  sec
> >         max header     :             0.81  sec
> >         max unknown    :             1.51  sec
> >         max close      :             0.96  sec
> >         I/O amount     :            24.25  MiB
> >  -------------------------------------------------------
> >  File base name        : /pan2/jetmgmt/Craig.Tierney/pan_flash_io_test_
> >    file striping count :             0
> >    file striping size  :     301346992     bytes
> >  Total I/O amount      :           286.78  MiB
> >  -------------------------------------------------------
> >  nproc    array size      exec (sec)   bandwidth (MiB/s)
> >     4    16 x  16 x  16     17.64       16.26
> >
> >
> > [root at Jet:fe7 FLASH-IO]# mpiexec.hydra -n 4 ./flash_benchmark_io
> /lfs2/jetmgmt/Craig.Tierney/lfs_flash_io_test_
> >  Here:        -262
> >  Here:        -262
> >  Here:        -262
> >  nfmpi_enddef
> >  (Internal error) beginning file offset of this variable is inconsistent
> among p
> >  r
> >  nfmpi_enddef
> >  (Internal error) beginning file offset of this variable is inconsistent
> among p
> >  r
> >  nfmpi_enddef
> >  (Internal error) beginning file offset of this variable is inconsistent
> among p
> >  r
> >  Here:           0
> > [cli_1]: aborting job:
> > application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
> > [cli_3]: [cli_2]: aborting job:
> > application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
> > aborting job:
> > application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
> >
> >
> ===================================================================================
> > =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> > =   PID 16702 RUNNING AT fe7
> > =   EXIT CODE: 255
> > =   CLEANING UP REMAINING PROCESSES
> > =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> >
> ===================================================================================
> >
> > Thanks,
> > Craig
> >
> >
> > On Mon, Sep 21, 2015 at 8:30 AM, Rob Latham <robl at mcs.anl.gov> wrote:
> >
> >
> > On 09/20/2015 03:44 PM, Craig Tierney - NOAA Affiliate wrote:
> > Wei-keng,
> >
> > I tried your test code on a different system, and I found it worked with
> > Intel+mvapich2 (2.1rc1).  That system was using Panasas and I was
> > testing on Lustre.  I then tried Panasas on the original machine
> > (supports both Panasas and Lustre) and I got the correct behavior.
> >
> > So the problem somehow related to Lustre.  We are using the 2.5.37.ddn
> > client.   Unless you have an obvious answer, I will open this with DDN
> > tomorrow.
> >
> >
> > Ah, bet I know why this is!
> >
> > the Lustre driver and (some versions of the) Panasas driver set their
> fs-specific hints by opening the file, setting some ioctls, then continuing
> on without deleting the file.
> >
> >  In the common case, when we expect the file to show up, no one notices
> or cares, but in MPI_MODE_EXCL or some other restrictive flags, the file
> gets created when we did not expect it to -- and that's part of the reason
> this bug lived on so long.
> >
> > I fixed this by moving file manipulations out of the hint parsing path
> and into the open path (after we check permissions and flags)
> >
> > Relevant commit:
> https://trac.mpich.org/projects/mpich/changeset/92f1c69f0de87f9
> >
> > See more details from Darshan, OpenMPI, and MPICH here:
> > - https://trac.mpich.org/projects/mpich/ticket/2261
> > - https://github.com/open-mpi/ompi/issues/158
> > -
> http://lists.mcs.anl.gov/pipermail/darshan-users/2015-February/000256.html
> >
> > ==rob
> >
> >
> > Thanks,
> > Craig
> >
> > On Sun, Sep 20, 2015 at 2:36 PM, Craig Tierney - NOAA Affiliate
> > <craig.tierney at noaa.gov <mailto:craig.tierney at noaa.gov>> wrote:
> >
> >     Wei-keng,
> >
> >     Thanks for the test case.  Here is what I get using a set of
> >     compilers and MPI stacks.  I was expecting that mvapich2 1.8 and 2.1
> >     would behave differently.
> >
> >     What versions of MPI do you test internally?
> >
> >     Craig
> >
> >     Testing intel+impi
> >
> >     Currently Loaded Modules:
> >        1) newdefaults   2) intel/15.0.3.187 <http://15.0.3.187>   3)
> >     impi/5.1.1.109 <http://5.1.1.109>
> >
> >     Error at line 22: File does not exist, error stack:
> >     ADIOI_NFS_OPEN(69): File /lfs3/jetmgmt/Craig.Tierney/tooth-fairy.nc
> >     <http://tooth-fairy.nc> does not exist
> >     Testing intel+mvapich2 2.1
> >
> >     Currently Loaded Modules:
> >        1) newdefaults   2) intel/15.0.3.187 <http://15.0.3.187>   3)
> >     mvapich2/2.1
> >
> >     file was opened: /lfs3/jetmgmt/Craig.Tierney/tooth-fairy.nc
> >     <http://tooth-fairy.nc>
> >     Testing intel+mvapich2 1.8
> >
> >     Currently Loaded Modules:
> >        1) newdefaults   2) intel/15.0.3.187 <http://15.0.3.187>   3)
> >     mvapich2/1.8
> >
> >     file was  opened: /lfs3/jetmgmt/Craig.Tierney/tooth-fairy.nc
> >     <http://tooth-fairy.nc>
> >     Testing pgi+mvapich2 2.1
> >
> >     Currently Loaded Modules:
> >        1) newdefaults   2) pgi/15.3   3) mvapich2/2.1
> >
> >     file was opened: /lfs3/jetmgmt/Craig.Tierney/tooth-fairy.nc
> >     <http://tooth-fairy.nc>
> >     Testing pgi+mvapich2 1.8
> >
> >     Currently Loaded Modules:
> >        1) newdefaults   2) pgi/15.3   3) mvapich2/1.8
> >
> >     file was opened: /lfs3/jetmgmt/Craig.Tierney/tooth-fairy.nc
> >     <http://tooth-fairy.nc>
> >
> >     Craig
> >
> >     On Sun, Sep 20, 2015 at 1:43 PM, Wei-keng Liao
> >     <wkliao at eecs.northwestern.edu <mailto:wkliao at eecs.northwestern.edu>>
> >     wrote:
> >
> >         In that case, it is likely mvapich does not perform correctly.
> >
> >         In PnetCDF, when NC_NOWRITE is used in a call to ncmpi_open,
> >         PnetCDF calls a MPI_File_open with the open flag set to
> >         MPI_MODE_RDONLY. See
> >
> http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/tags/v1-6-1/src/lib/mpincio.c#L322
> >
> >         Maybe test this with a simple MPI-IO program below.
> >         It prints error messages like
> >              Error at line 15: File does not exist, error stack:
> >              ADIOI_UFS_OPEN(69): File tooth-fairy.nc
> >         <http://tooth-fairy.nc> does not exist
> >
> >         But, no file should be created.
> >
> >
> >         #include <stdio.h>
> >         #include <unistd.h> /* unlink() */
> >         #include <mpi.h>
> >
> >         int main(int argc, char **argv) {
> >              int err;
> >              MPI_File fh;
> >
> >              MPI_Init(&argc, &argv);
> >
> >              /* delete "tooth-fairy.nc <http://tooth-fairy.nc>" and
> >         ignore the error */
> >              unlink("tooth-fairy.nc <http://tooth-fairy.nc>");
> >
> >              err = MPI_File_open(MPI_COMM_WORLD, "tooth-fairy.nc
> >         <http://tooth-fairy.nc>", MPI_MODE_RDONLY, MPI_INFO_NULL, &fh);
> >              if (err != MPI_SUCCESS) {
> >                  int errorStringLen;
> >                  char errorString[MPI_MAX_ERROR_STRING];
> >                  MPI_Error_string(err, errorString, &errorStringLen);
> >                  printf("Error at line %d: %s\n",__LINE__, errorString);
> >              }
> >              else
> >                  MPI_File_close(&fh);
> >
> >              MPI_Finalize();
> >              return 0;
> >         }
> >
> >
> >         Wei-keng
> >
> >         On Sep 20, 2015, at 1:51 PM, Craig Tierney - NOAA Affiliate
> wrote:
> >
> >          > Wei-keng,
> >          >
> >          > I always run distclean before I try to build the code.  The
> >         first test failing is nc_test.  The problem seems to be in this
> >         test:
> >          >
> >          >    err = ncmpi_open(comm, "tooth-fairy.nc
> >         <http://tooth-fairy.nc>", NC_NOWRITE, info, &ncid);/* should
> fail */
> >          >     IF (err == NC_NOERR)
> >          >         error("ncmpi_open of nonexistent file should have
> >         failed");
> >          >     IF (err != NC_ENOENT)
> >          >         error("ncmpi_open of nonexistent file should have
> >         returned NC_ENOENT");
> >          >     else {
> >          >         /* printf("Expected error message complaining: \"File
> >         tooth-fairy.nc <http://tooth-fairy.nc> does not exist\"\n"); */
> >          >         nok++;
> >          >     }
> >          >
> >          > A zero length tooth-fairy.nc <http://tooth-fairy.nc> file is
> >         being created, and I don't think that is supposed to happen.
> >         That would mean that the mode NC_NOWRITE is not being honored by
> >         MPI_IO.  I will look at this more tomorrow and try to craft a
> >         short example.
> >          >
> >          > Craig
> >          >
> >          > On Sun, Sep 20, 2015 at 10:23 AM, Wei-keng Liao
> >         <wkliao at eecs.northwestern.edu
> >         <mailto:wkliao at eecs.northwestern.edu>> wrote:
> >          > Hi, Craig
> >          >
> >          > Your config.log looks fine to me.
> >          > Some of your error messages are supposed to report errors of
> >         opening
> >          > a non-existing file, but report a different error code,
> >         meaning the
> >          > file does exist. I suspect it may be because of residue files.
> >          >
> >          > Could you do a clean rebuild with the following commands?
> >          >     % make -s distclean
> >          >     % ./configure --prefix=/apps/pnetcdf/1.6.1-intel-mvapich2
> >          >     % make -s -j8
> >          >     % make -s check
> >          >
> >          > If the problem persists, then it might be because mvapich.
> >          >
> >          > Wei-keng
> >          >
> >
> >
> >
> >
> > --
> > Rob Latham
> > Mathematics and Computer Science Division
> > Argonne National Lab, IL USA
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20150922/e0cee7be/attachment-0001.html>


More information about the parallel-netcdf mailing list