Unable to pass all the tests with pnetcdf 1.6.1, Intel 15.0.3.048 and Mvapich2 2.1

Craig Tierney - NOAA Affiliate craig.tierney at noaa.gov
Tue Sep 22 18:15:41 CDT 2015


Rob,

This patch did not fix it.  The code hangs when going through the Lustre
ADIO.  I am going to assume it is because not all the ranks  I don't have
debugging enabled in ROMIO so I can't see the specifics.

>From a quick printf, only the rank 0 enters the
function ADIOI_LUSTRE_Open.  This would imply that the other ranks won't
ever see this Barrier and everything hangs.  I can try and trace the code
backwards and see where this function is called.

Craig

On Tue, Sep 22, 2015 at 3:44 PM, Rob Latham <robl at mcs.anl.gov> wrote:

>
>
> On 09/22/2015 04:37 PM, Wei-keng Liao wrote:
>
>> Another way to do this is to have root process do the setting, getting,
>> and then
>> broadcasting to all processes.
>>
>> If you just want the root process to do the setting, then
>> you need a barrier to make sure no non-root processes run ahead.
>>
>
> darnit, you're right.
>
> Craig, does this patch fix things for you?
>
>
> ==rob
>
>
>> Wei-keng
>>
>> On Sep 22, 2015, at 4:17 PM, Rob Latham wrote:
>>
>>
>>>
>>> On 09/22/2015 04:10 PM, Wei-keng Liao wrote:
>>>
>>>> Hi, Craig
>>>>
>>>>  From these outputs, I think it is most likely due to MPI-IO fails
>>>> to return the same file striping unit and factor values among all
>>>> MPI processes. I guess only root process gets the correct values.
>>>> Attached is a short MPI program to test this theory.
>>>> Could you test it using at least 2 processes on Lustre?
>>>>
>>>> To compile:
>>>>     mpicc -o check_mpi_striping check_mpi_striping.c
>>>> To run:
>>>>     mpiexec -n 2 check_mpi_striping
>>>>
>>>>
>>>>
>>> what is supposed to happen is that rank 0 sets the striping factor
>>> according to the hints all processes requested (it is erroneous to specify
>>> different values for an MPI-IO hint on different processes).
>>>
>>> Rank 0 opens the file and sets the ioctls, then, everyone calls
>>>
>>> ioctl(fd->fd_sys, LL_IOC_LOV_GETSTRIPE, (void *)lum);
>>>
>>> It sounds like perhaps I need to learn more about Lustre's rules for
>>> when LL_IOC_LOV_SETSTRIPE is visible to all processes.
>>>
>>> ==rob
>>>
>>> Wei-keng
>>>>
>>>>
>>>>
>>>> On Sep 22, 2015, at 2:34 PM, Craig Tierney - NOAA Affiliate wrote:
>>>>
>>>> Wei-keng,
>>>>>
>>>>> Here is the output from my run with PNETCDF_SAFE_MODE=1 on Lustre:
>>>>>
>>>>> [root at Jet:fe8 FLASH-IO]# mpiexec.hydra -env PNETCDF_SAFE_MODE 1 -np 4
>>>>> ./flash_benchmark_io /lfs2/jetmgmt/Craig.Tierney/d1//flash_io_test_
>>>>> Warning (inconsistent metadata): variable lrefine's begin
>>>>> (root=1048576, local=3072)
>>>>> Warning (inconsistent metadata): variable nodetype's begin
>>>>> (root=2097152, local=4608)
>>>>> Warning (inconsistent metadata): variable gid's begin (root=3145728,
>>>>> local=6144)
>>>>> Warning (inconsistent metadata): variable coordinates's begin
>>>>> (root=4194304, local=25600)
>>>>> Warning (inconsistent metadata): variable blocksize's begin
>>>>> (root=5242880, local=33792)
>>>>> Warning (inconsistent metadata): variable bndbox's begin
>>>>> (root=6291456, local=41984)
>>>>> Warning (inconsistent metadata): variable dens's begin (root=7340032,
>>>>> local=57856)
>>>>> Warning (inconsistent metadata): variable velx's begin (root=18874368,
>>>>> local=10641920)
>>>>> Warning (inconsistent metadata): variable lrefine's begin
>>>>> (root=1048576, local=3072)
>>>>> Warning (inconsistent metadata): variable nodetype's begin
>>>>> (root=2097152, local=4608)
>>>>> Warning (inconsistent metadata): variable gid's begin (root=3145728,
>>>>> local=6144)
>>>>> Warning (inconsistent metadata): variable coordinates's begin
>>>>> (root=4194304, local=25600)
>>>>> Warning (inconsistent metadata): variable blocksize's begin
>>>>> (root=5242880, local=33792)
>>>>> Warning (inconsistent metadata): variable bndbox's begin
>>>>> (root=6291456, local=41984)
>>>>> Warning (inconsistent metadata): variable dens's begin (root=7340032,
>>>>> local=57856)
>>>>> Warning (inconsistent metadata): variable velx's begin (root=18874368,
>>>>> local=10641920)
>>>>> Warning (inconsistent metadata): variable vely's begin (root=30408704,
>>>>> local=21225984)
>>>>> Warning (inconsistent metadata): variable velz's begin (root=41943040,
>>>>> local=31810048)
>>>>> Warning (inconsistent metadata): variable pres's begin (root=53477376,
>>>>> local=42394112)
>>>>> Warning (inconsistent metadata): variable ener's begin (root=65011712,
>>>>> local=52978176)
>>>>> Warning (inconsistent metadata): variable temp's begin (root=76546048,
>>>>> local=63562240)
>>>>> Warning (inconsistent metadata): variable gamc's begin (root=88080384,
>>>>> local=74146304)
>>>>> Warning (inconsistent metadata): variable game's begin (root=99614720,
>>>>> local=84730368)
>>>>> Warning (inconsistent metadata): variable enuc's begin
>>>>> (root=111149056, local=95314432)
>>>>> Warning (inconsistent metadata): variable gpot's begin
>>>>> (root=122683392, local=105898496)
>>>>> Warning (inconsistent metadata): variable f1__'s begin
>>>>> (root=134217728, local=116482560)
>>>>> Warning (inconsistent metadata): variable f2__'s begin
>>>>> (root=145752064, local=127066624)
>>>>> Warning (inconsistent metadata): variable f3__'s begin
>>>>> (root=157286400, local=137650688)
>>>>> Warning (inconsistent metadata): variable lrefine's begin
>>>>> (root=1048576, local=3072)
>>>>> Warning (inconsistent metadata): variable nodetype's begin
>>>>> (root=2097152, local=4608)
>>>>> Warning (inconsistent metadata): variable gid's begin (root=3145728,
>>>>> local=6144)
>>>>> Warning (inconsistent metadata): variable coordinates's begin
>>>>> (root=4194304, local=25600)
>>>>> Warning (inconsistent metadata): variable blocksize's begin
>>>>> (root=5242880, local=33792)
>>>>> Warning (inconsistent metadata): variable bndbox's begin
>>>>> (root=6291456, local=41984)
>>>>> Warning (inconsistent metadata): variable dens's begin (root=7340032,
>>>>> local=57856)
>>>>> Warning (inconsistent metadata): variable velx's begin (root=18874368,
>>>>> local=10641920)
>>>>> Warning (inconsistent metadata): variable vely's begin (root=30408704,
>>>>> local=21225984)
>>>>> Warning (inconsistent metadata): variable velz's begin (root=41943040,
>>>>> local=31810048)
>>>>> Warning (inconsistent metadata): variable pres's begin (root=53477376,
>>>>> local=42394112)
>>>>> Warning (inconsistent metadata): variable ener's begin (root=65011712,
>>>>> local=52978176)
>>>>> Warning (inconsistent metadata): variable temp's begin (root=76546048,
>>>>> local=63562240)
>>>>> Warning (inconsistent metadata): variable gamc's begin (root=88080384,
>>>>> local=74146304)
>>>>> Warning (inconsistent metadata): variable game's begin (root=99614720,
>>>>> local=84730368)
>>>>> Warning (inconsistent metadata): variable enuc's begin
>>>>> (root=111149056, local=95314432)
>>>>> Warning (inconsistent metadata): variable gpot's begin
>>>>> (root=122683392, local=105898496)
>>>>> Warning (inconsistent metadata): variable f1__'s begin
>>>>> (root=134217728, local=116482560)
>>>>> Warning (inconsistent metadata): variable f2__'s begin
>>>>> (root=145752064, local=127066624)
>>>>> Warning (inconsistent metadata): variable f3__'s begin
>>>>> (root=157286400, local=137650688)
>>>>> Warning (inconsistent metadata): variable f4__'s begin
>>>>> (root=168820736, local=148234752)
>>>>> Warning (inconsistent metadata): variable f5__'s begin
>>>>> (root=180355072, local=158818816)
>>>>> Warning (inconsistent metadata): variable f6__'s begin
>>>>> (root=191889408, local=169402880)
>>>>> Warning (inconsistent metadata): variable vely's begin (root=30408704,
>>>>> local=21225984)
>>>>> Warning (inconsistent metadata): variable velz's begin (root=41943040,
>>>>> local=31810048)
>>>>> Warning (inconsistent metadata): variable pres's begin (root=53477376,
>>>>> local=42394112)
>>>>> Warning (inconsistent metadata): variable ener's begin (root=65011712,
>>>>> local=52978176)
>>>>> Warning (inconsistent metadata): variable temp's begin (root=76546048,
>>>>> local=63562240)
>>>>> Warning (inconsistent metadata): variable gamc's begin (root=88080384,
>>>>> local=74146304)
>>>>> Warning (inconsistent metadata): variable game's begin (root=99614720,
>>>>> local=84730368)
>>>>> Warning (inconsistent metadata): variable enuc's begin
>>>>> (root=111149056, local=95314432)
>>>>> Warning (inconsistent metadata): variable gpot's begin
>>>>> (root=122683392, local=105898496)
>>>>> Warning (inconsistent metadata): variable f1__'s begin
>>>>> (root=134217728, local=116482560)
>>>>> Warning (inconsistent metadata): variable f2__'s begin
>>>>> (root=145752064, local=127066624)
>>>>> Warning (inconsistent metadata): variable f3__'s begin
>>>>> (root=157286400, local=137650688)
>>>>> Warning (inconsistent metadata): variable f4__'s begin
>>>>> (root=168820736, local=148234752)
>>>>> Warning (inconsistent metadata): variable f5__'s begin
>>>>> (root=180355072, local=158818816)
>>>>> Warning (inconsistent metadata): variable f6__'s begin
>>>>> (root=191889408, local=169402880)
>>>>> Warning (inconsistent metadata): variable f7__'s begin
>>>>> (root=203423744, local=179986944)
>>>>> Warning (inconsistent metadata): variable f8__'s begin
>>>>> (root=214958080, local=190571008)
>>>>> Warning (inconsistent metadata): variable f9__'s begin
>>>>> (root=226492416, local=201155072)
>>>>> Warning (inconsistent metadata): variable f10_'s begin
>>>>> (root=238026752, local=211739136)
>>>>> Warning (inconsistent metadata): variable f11_'s begin
>>>>> (root=249561088, local=222323200)
>>>>> Warning (inconsistent metadata): variable f12_'s begin
>>>>> (root=261095424, local=232907264)
>>>>> Warning (inconsistent metadata): variable f13_'s begin
>>>>> (root=272629760, local=243491328)
>>>>> Warning (inconsistent metadata): variable f4__'s begin
>>>>> (root=168820736, local=148234752)
>>>>> Warning (inconsistent metadata): variable f5__'s begin
>>>>> (root=180355072, local=158818816)
>>>>> Warning (inconsistent metadata): variable f6__'s begin
>>>>> (root=191889408, local=169402880)
>>>>> Warning (inconsistent metadata): variable f7__'s begin
>>>>> (root=203423744, local=179986944)
>>>>> Warning (inconsistent metadata): variable f8__'s begin
>>>>> (root=214958080, local=190571008)
>>>>> Warning (inconsistent metadata): variable f9__'s begin
>>>>> (root=226492416, local=201155072)
>>>>> Warning (inconsistent metadata): variable f10_'s begin
>>>>> (root=238026752, local=211739136)
>>>>> Warning (inconsistent metadata): variable f11_'s begin
>>>>> (root=249561088, local=222323200)
>>>>> Warning (inconsistent metadata): variable f12_'s begin
>>>>> (root=261095424, local=232907264)
>>>>> Warning (inconsistent metadata): variable f13_'s begin
>>>>> (root=272629760, local=243491328)
>>>>> Warning (inconsistent metadata): variable f7__'s begin
>>>>> (root=203423744, local=179986944)
>>>>> Warning (inconsistent metadata): variable f8__'s begin
>>>>> (root=214958080, local=190571008)
>>>>> Warning (inconsistent metadata): variable f9__'s begin
>>>>> (root=226492416, local=201155072)
>>>>> Warning (inconsistent metadata): variable f10_'s begin
>>>>> (root=238026752, local=211739136)
>>>>> Warning (inconsistent metadata): variable f11_'s begin
>>>>> (root=249561088, local=222323200)
>>>>> Warning (inconsistent metadata): variable f12_'s begin
>>>>> (root=261095424, local=232907264)
>>>>> Warning (inconsistent metadata): variable f13_'s begin
>>>>> (root=272629760, local=243491328)
>>>>> Here:        -250
>>>>> Here:        -262
>>>>> Here:        -262
>>>>> Here:        -262
>>>>> nfmpi_enddefFile header is inconsistent among processes
>>>>> nfmpi_enddef
>>>>> (Internal error) beginning file offset of this variable is
>>>>> inconsistent among p
>>>>> r
>>>>> nfmpi_enddef
>>>>> (Internal error) beginning file offset of this variable is
>>>>> inconsistent among p
>>>>> r
>>>>> nfmpi_enddef
>>>>> (Internal error) beginning file offset of this variable is
>>>>> inconsistent among p
>>>>> r
>>>>> [cli_1]: aborting job:
>>>>> application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
>>>>> [cli_0]: [cli_2]: aborting job:
>>>>> application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
>>>>> [cli_3]: aborting job:
>>>>> application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
>>>>> aborting job:
>>>>> application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
>>>>>
>>>>> Craig
>>>>>
>>>>> On Mon, Sep 21, 2015 at 1:21 PM, Wei-keng Liao <
>>>>> wkliao at eecs.northwestern.edu> wrote:
>>>>>
>>>>> It is strange that the test failed for Lustre.
>>>>>
>>>>> The error message says some variables defined across MPI processes are
>>>>> not consistent.
>>>>> Could you run this benchmark with safe mode on? by setting the
>>>>> environment variable
>>>>> PNETCDF_SAFE_MODE to 1 before the run. This will print more error
>>>>> messages, such as
>>>>> which variables are inconsistent and at what offsets.
>>>>>
>>>>>
>>>>> Wei-keng
>>>>>
>>>>> On Sep 21, 2015, at 1:31 PM, Craig Tierney - NOAA Affiliate wrote:
>>>>>
>>>>> Rob and Wei-keng,
>>>>>>
>>>>>> Thanks for you help on this problem.  Rob - The patch seems to work.
>>>>>> I had to hand apply it but now the pnetcdf tests (mostly) complete
>>>>>> successfully.  The FLASH-IO benchmark is failing when Lustre is used.  It
>>>>>> completes successfully when Panasas is used.The error code that is returned
>>>>>> by nfmpi_enddef is -262.   The
>>>>>>
>>>>> description for this error is:
>>>>
>>>>>
>>>>>> #define NC_EMULTIDEFINE_VAR_BEGIN       (-262) /**< inconsistent
>>>>>> variable file begin offset (internal use) */
>>>>>>
>>>>>> [root at Jet:fe7 FLASH-IO]# mpiexec.hydra -n 4 ./flash_benchmark_io
>>>>>> /pan2/jetmgmt/Craig.Tierney/pan_flash_io_test_
>>>>>>   Here:           0
>>>>>>   Here:           0
>>>>>>   Here:           0
>>>>>>   Here:           0
>>>>>>   number of guards      :             4
>>>>>>   number of blocks      :            80
>>>>>>   number of variables   :            24
>>>>>>   checkpoint time       :            12.74  sec
>>>>>>          max header     :             0.88  sec
>>>>>>          max unknown    :            11.83  sec
>>>>>>          max close      :             0.53  sec
>>>>>>          I/O amount     :           242.30  MiB
>>>>>>   plot no corner        :             2.38  sec
>>>>>>          max header     :             0.59  sec
>>>>>>          max unknown    :             1.78  sec
>>>>>>          max close      :             0.22  sec
>>>>>>          I/O amount     :            20.22  MiB
>>>>>>   plot    corner        :             2.52  sec
>>>>>>          max header     :             0.81  sec
>>>>>>          max unknown    :             1.51  sec
>>>>>>          max close      :             0.96  sec
>>>>>>          I/O amount     :            24.25  MiB
>>>>>>   -------------------------------------------------------
>>>>>>   File base name        :
>>>>>> /pan2/jetmgmt/Craig.Tierney/pan_flash_io_test_
>>>>>>     file striping count :             0
>>>>>>     file striping size  :     301346992     bytes
>>>>>>   Total I/O amount      :           286.78  MiB
>>>>>>   -------------------------------------------------------
>>>>>>   nproc    array size      exec (sec)   bandwidth (MiB/s)
>>>>>>      4    16 x  16 x  16     17.64       16.26
>>>>>>
>>>>>>
>>>>>> [root at Jet:fe7 FLASH-IO]# mpiexec.hydra -n 4 ./flash_benchmark_io
>>>>>> /lfs2/jetmgmt/Craig.Tierney/lfs_flash_io_test_
>>>>>>   Here:        -262
>>>>>>   Here:        -262
>>>>>>   Here:        -262
>>>>>>   nfmpi_enddef
>>>>>>   (Internal error) beginning file offset of this variable is
>>>>>> inconsistent among p
>>>>>>   r
>>>>>>   nfmpi_enddef
>>>>>>   (Internal error) beginning file offset of this variable is
>>>>>> inconsistent among p
>>>>>>   r
>>>>>>   nfmpi_enddef
>>>>>>   (Internal error) beginning file offset of this variable is
>>>>>> inconsistent among p
>>>>>>   r
>>>>>>   Here:           0
>>>>>> [cli_1]: aborting job:
>>>>>> application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
>>>>>> [cli_3]: [cli_2]: aborting job:
>>>>>> application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
>>>>>> aborting job:
>>>>>> application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
>>>>>>
>>>>>>
>>>>>> ===================================================================================
>>>>>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>>>>> =   PID 16702 RUNNING AT fe7
>>>>>> =   EXIT CODE: 255
>>>>>> =   CLEANING UP REMAINING PROCESSES
>>>>>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>>>>
>>>>>> ===================================================================================
>>>>>>
>>>>>> Thanks,
>>>>>> Craig
>>>>>>
>>>>>>
>>>>>> On Mon, Sep 21, 2015 at 8:30 AM, Rob Latham <robl at mcs.anl.gov> wrote:
>>>>>>
>>>>>>
>>>>>> On 09/20/2015 03:44 PM, Craig Tierney - NOAA Affiliate wrote:
>>>>>> Wei-keng,
>>>>>>
>>>>>> I tried your test code on a different system, and I found it worked
>>>>>> with
>>>>>> Intel+mvapich2 (2.1rc1).  That system was using Panasas and I was
>>>>>> testing on Lustre.  I then tried Panasas on the original machine
>>>>>> (supports both Panasas and Lustre) and I got the correct behavior.
>>>>>>
>>>>>> So the problem somehow related to Lustre.  We are using the 2.5.37.ddn
>>>>>> client.   Unless you have an obvious answer, I will open this with DDN
>>>>>> tomorrow.
>>>>>>
>>>>>>
>>>>>> Ah, bet I know why this is!
>>>>>>
>>>>>> the Lustre driver and (some versions of the) Panasas driver set their
>>>>>> fs-specific hints by opening the file, setting some ioctls, then continuing
>>>>>> on without deleting the file.
>>>>>>
>>>>>>   In the common case, when we expect the file to show up, no one
>>>>>> notices or cares, but in MPI_MODE_EXCL or some other restrictive flags, the
>>>>>> file gets created when we did not expect it to -- and that's part of the
>>>>>> reason this bug lived on so long.
>>>>>>
>>>>>> I fixed this by moving file manipulations out of the hint parsing
>>>>>> path and into the open path (after we check permissions and flags)
>>>>>>
>>>>>> Relevant commit:
>>>>>> https://trac.mpich.org/projects/mpich/changeset/92f1c69f0de87f9
>>>>>>
>>>>>> See more details from Darshan, OpenMPI, and MPICH here:
>>>>>> -https://trac.mpich.org/projects/mpich/ticket/2261
>>>>>> -https://github.com/open-mpi/ompi/issues/158
>>>>>> -
>>>>>> http://lists.mcs.anl.gov/pipermail/darshan-users/2015-February/000256.html
>>>>>>
>>>>>> ==rob
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Craig
>>>>>>
>>>>>> On Sun, Sep 20, 2015 at 2:36 PM, Craig Tierney - NOAA Affiliate
>>>>>> <craig.tierney at noaa.gov <mailto:craig.tierney at noaa.gov>> wrote:
>>>>>>
>>>>>>      Wei-keng,
>>>>>>
>>>>>>      Thanks for the test case.  Here is what I get using a set of
>>>>>>      compilers and MPI stacks.  I was expecting that mvapich2 1.8 and
>>>>>> 2.1
>>>>>>      would behave differently.
>>>>>>
>>>>>>      What versions of MPI do you test internally?
>>>>>>
>>>>>>      Craig
>>>>>>
>>>>>>      Testing intel+impi
>>>>>>
>>>>>>      Currently Loaded Modules:
>>>>>>         1) newdefaults   2) intel/15.0.3.187 <http://15.0.3.187>   3)
>>>>>>      impi/5.1.1.109 <http://5.1.1.109>
>>>>>>
>>>>>>      Error at line 22: File does not exist, error stack:
>>>>>>      ADIOI_NFS_OPEN(69): File /lfs3/jetmgmt/Craig.Tierney/
>>>>>> tooth-fairy.nc
>>>>>>      <http://tooth-fairy.nc> does not exist
>>>>>>      Testing intel+mvapich2 2.1
>>>>>>
>>>>>>      Currently Loaded Modules:
>>>>>>         1) newdefaults   2) intel/15.0.3.187 <http://15.0.3.187>   3)
>>>>>>      mvapich2/2.1
>>>>>>
>>>>>>      file was opened: /lfs3/jetmgmt/Craig.Tierney/tooth-fairy.nc
>>>>>>      <http://tooth-fairy.nc>
>>>>>>      Testing intel+mvapich2 1.8
>>>>>>
>>>>>>      Currently Loaded Modules:
>>>>>>         1) newdefaults   2) intel/15.0.3.187 <http://15.0.3.187>   3)
>>>>>>      mvapich2/1.8
>>>>>>
>>>>>>      file was  opened: /lfs3/jetmgmt/Craig.Tierney/tooth-fairy.nc
>>>>>>      <http://tooth-fairy.nc>
>>>>>>      Testing pgi+mvapich2 2.1
>>>>>>
>>>>>>      Currently Loaded Modules:
>>>>>>         1) newdefaults   2) pgi/15.3   3) mvapich2/2.1
>>>>>>
>>>>>>      file was opened: /lfs3/jetmgmt/Craig.Tierney/tooth-fairy.nc
>>>>>>      <http://tooth-fairy.nc>
>>>>>>      Testing pgi+mvapich2 1.8
>>>>>>
>>>>>>      Currently Loaded Modules:
>>>>>>         1) newdefaults   2) pgi/15.3   3) mvapich2/1.8
>>>>>>
>>>>>>      file was opened: /lfs3/jetmgmt/Craig.Tierney/tooth-fairy.nc
>>>>>>      <http://tooth-fairy.nc>
>>>>>>
>>>>>>      Craig
>>>>>>
>>>>>>      On Sun, Sep 20, 2015 at 1:43 PM, Wei-keng Liao
>>>>>>      <wkliao at eecs.northwestern.edu <mailto:
>>>>>> wkliao at eecs.northwestern.edu>>
>>>>>>      wrote:
>>>>>>
>>>>>>          In that case, it is likely mvapich does not perform
>>>>>> correctly.
>>>>>>
>>>>>>          In PnetCDF, when NC_NOWRITE is used in a call to ncmpi_open,
>>>>>>          PnetCDF calls a MPI_File_open with the open flag set to
>>>>>>          MPI_MODE_RDONLY. See
>>>>>>
>>>>>> http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/tags/v1-6-1/src/lib/mpincio.c#L322
>>>>>>
>>>>>>          Maybe test this with a simple MPI-IO program below.
>>>>>>          It prints error messages like
>>>>>>               Error at line 15: File does not exist, error stack:
>>>>>>               ADIOI_UFS_OPEN(69): File tooth-fairy.nc
>>>>>>          <http://tooth-fairy.nc> does not exist
>>>>>>
>>>>>>          But, no file should be created.
>>>>>>
>>>>>>
>>>>>>          #include <stdio.h>
>>>>>>          #include <unistd.h> /* unlink() */
>>>>>>          #include <mpi.h>
>>>>>>
>>>>>>          int main(int argc, char **argv) {
>>>>>>               int err;
>>>>>>               MPI_File fh;
>>>>>>
>>>>>>               MPI_Init(&argc, &argv);
>>>>>>
>>>>>>               /* delete "tooth-fairy.nc <http://tooth-fairy.nc>" and
>>>>>>          ignore the error */
>>>>>>               unlink("tooth-fairy.nc <http://tooth-fairy.nc>");
>>>>>>
>>>>>>               err = MPI_File_open(MPI_COMM_WORLD, "tooth-fairy.nc
>>>>>>          <http://tooth-fairy.nc>", MPI_MODE_RDONLY, MPI_INFO_NULL,
>>>>>> &fh);
>>>>>>               if (err != MPI_SUCCESS) {
>>>>>>                   int errorStringLen;
>>>>>>                   char errorString[MPI_MAX_ERROR_STRING];
>>>>>>                   MPI_Error_string(err, errorString, &errorStringLen);
>>>>>>                   printf("Error at line %d: %s\n",__LINE__,
>>>>>> errorString);
>>>>>>               }
>>>>>>               else
>>>>>>                   MPI_File_close(&fh);
>>>>>>
>>>>>>               MPI_Finalize();
>>>>>>               return 0;
>>>>>>          }
>>>>>>
>>>>>>
>>>>>>          Wei-keng
>>>>>>
>>>>>>          On Sep 20, 2015, at 1:51 PM, Craig Tierney - NOAA Affiliate
>>>>>> wrote:
>>>>>>
>>>>>>           > Wei-keng,
>>>>>>           >
>>>>>>           > I always run distclean before I try to build the code.
>>>>>> The
>>>>>>          first test failing is nc_test.  The problem seems to be in
>>>>>> this
>>>>>>          test:
>>>>>>           >
>>>>>>           >    err = ncmpi_open(comm, "tooth-fairy.nc
>>>>>>          <http://tooth-fairy.nc>", NC_NOWRITE, info, &ncid);/*
>>>>>> should fail */
>>>>>>           >     IF (err == NC_NOERR)
>>>>>>           >         error("ncmpi_open of nonexistent file should have
>>>>>>          failed");
>>>>>>           >     IF (err != NC_ENOENT)
>>>>>>           >         error("ncmpi_open of nonexistent file should have
>>>>>>          returned NC_ENOENT");
>>>>>>           >     else {
>>>>>>           >         /* printf("Expected error message complaining:
>>>>>> \"File
>>>>>>          tooth-fairy.nc <http://tooth-fairy.nc> does not
>>>>>> exist\"\n"); */
>>>>>>           >         nok++;
>>>>>>           >     }
>>>>>>           >
>>>>>>           > A zero length tooth-fairy.nc <http://tooth-fairy.nc>
>>>>>> file is
>>>>>>          being created, and I don't think that is supposed to happen.
>>>>>>          That would mean that the mode NC_NOWRITE is not being
>>>>>> honored by
>>>>>>          MPI_IO.  I will look at this more tomorrow and try to craft a
>>>>>>          short example.
>>>>>>           >
>>>>>>           > Craig
>>>>>>           >
>>>>>>           > On Sun, Sep 20, 2015 at 10:23 AM, Wei-keng Liao
>>>>>>          <wkliao at eecs.northwestern.edu
>>>>>>          <mailto:wkliao at eecs.northwestern.edu>> wrote:
>>>>>>           > Hi, Craig
>>>>>>           >
>>>>>>           > Your config.log looks fine to me.
>>>>>>           > Some of your error messages are supposed to report errors
>>>>>> of
>>>>>>          opening
>>>>>>           > a non-existing file, but report a different error code,
>>>>>>          meaning the
>>>>>>           > file does exist. I suspect it may be because of residue
>>>>>> files.
>>>>>>           >
>>>>>>           > Could you do a clean rebuild with the following commands?
>>>>>>           >     % make -s distclean
>>>>>>           >     % ./configure
>>>>>> --prefix=/apps/pnetcdf/1.6.1-intel-mvapich2
>>>>>>           >     % make -s -j8
>>>>>>           >     % make -s check
>>>>>>           >
>>>>>>           > If the problem persists, then it might be because mvapich.
>>>>>>           >
>>>>>>           > Wei-keng
>>>>>>           >
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Rob Latham
>>>>>> Mathematics and Computer Science Division
>>>>>> Argonne National Lab, IL USA
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>> --
>>> Rob Latham
>>> Mathematics and Computer Science Division
>>> Argonne National Lab, IL USA
>>>
>>
>>
> --
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20150922/2f34e047/attachment-0001.html>


More information about the parallel-netcdf mailing list