Unable to pass all the tests with pnetcdf 1.6.1, Intel 15.0.3.048 and Mvapich2 2.1
Craig Tierney - NOAA Affiliate
craig.tierney at noaa.gov
Thu Sep 24 15:08:31 CDT 2015
Wei-keng,
Sorry I didn't get back to you on this. I will give your code a test with
my current MPI stack. If it doesn't help find anything, I will try
building the latest nightly build of mpich. After that, I will start
digging into the MPI-IO code and try and track down the issue.
Thanks,
Craig
On Tue, Sep 22, 2015 at 6:54 PM, Wei-keng Liao <wkliao at eecs.northwestern.edu
> wrote:
> Hi, Craig
>
> I have to admit I ran out of ideas.
> Let me explain my suspicion about the possible fault on inconsistent
> striping hints.
>
> One of the error message shown here:
> Warning (inconsistent metadata): variable lrefine's begin (root=1048576,
> local=3072)
>
> says that rank 0 calculates the variable lrefine's starting file offset
> 1048576
> and another process calculates 3072. If Lustre's file striping unit is
> known, then
> PnetCDF will use a number aligned with striping unit for a variable's
> starting offset.
> If file striping unit is not available, then PnetCDF will align it with
> 512 bytes.
> So, in your case, the file stripe unit is 1048576, meaning rank 0 did have
> a correct
> value from MPI-IO hint, but the other process did not.
>
> PnetCDF calls MPI_Info_get() to get striping_unit value and assumes all
> processes
> get the same value returned from the same MPI call.
>
> Do you have MPICH installed on the same machine? If this is also happening
> to MPICH,
> then it is mvapich. If not, then PnetCDF is at fault.
>
> I wonder if you would like to try another test program that is in PnetCDF
> (attached).
>
> Wei-keng
>
>
>
>
> On Sep 22, 2015, at 5:30 PM, Craig Tierney - NOAA Affiliate wrote:
>
> > Wei-keng,
> >
> > I wasn't able to trigger a problem. Here is the script I ran around
> your test case:
> >
> > #!/bin/bash --login
> >
> > module load newdefaults
> > module load intel
> > export PATH=/home/admin/software/apps/mvapich2/2.1-intel/bin/:${PATH}
> >
> > PDIR=/lfs2/jetmgmt/Craig.Tierney/test
> >
> > if [ ! -d $PDIR/ ]; then
> > mkdir $PDIR
> > fi
> >
> > for s in 1 4; do
> > if [ ! -d $PDIR/d$s ]; then
> > mkdir $PDIR/d$s
> > fi
> > lfs setstripe -c $s $PDIR/d$s
> > lfs getstripe $PDIR/d$s
> >
> > rm -f $PDIR/d$s/bigfile
> > dd if=/dev/zero of=$PDIR/d$s/bigfile bs=1024k count=1
> > lfs getstripe $PDIR/d$s/bigfile
> >
> > echo "Checking d$s"
> > mpiexec.hydra -np 2 ./check_mpi_striping $PDIR/d$s/bigfile
> > done
> >
> > Here are the results:
> >
> > $ ./doit
> > /lfs2/jetmgmt/Craig.Tierney/test/d1
> > stripe_count: 1 stripe_size: 1048576 stripe_offset: -1
> > /lfs2/jetmgmt/Craig.Tierney/test/d1/bigfile
> > lmm_stripe_count: 1
> > lmm_stripe_size: 1048576
> > lmm_pattern: 1
> > lmm_layout_gen: 0
> > lmm_stripe_offset: 4
> > obdidx objid objid group
> > 4 15487258 0xec511a 0
> >
> > 1+0 records in
> > 1+0 records out
> > 1048576 bytes (1.0 MB) copied, 0.00228375 s, 459 MB/s
> > /lfs2/jetmgmt/Craig.Tierney/test/d1/bigfile
> > lmm_stripe_count: 1
> > lmm_stripe_size: 1048576
> > lmm_pattern: 1
> > lmm_layout_gen: 0
> > lmm_stripe_offset: 7
> > obdidx objid objid group
> > 7 15421630 0xeb50be 0
> >
> > Checking d1
> > Success: striping_unit=1048576 striping_factor=1
> > /lfs2/jetmgmt/Craig.Tierney/test/d4
> > stripe_count: 4 stripe_size: 1048576 stripe_offset: -1
> > /lfs2/jetmgmt/Craig.Tierney/test/d4/bigfile
> > lmm_stripe_count: 4
> > lmm_stripe_size: 1048576
> > lmm_pattern: 1
> > lmm_layout_gen: 0
> > lmm_stripe_offset: 17
> > obdidx objid objid group
> > 17 15361627 0xea665b 0
> > 42 15439375 0xeb960f 0
> > 0 15384104 0xeabe28 0
> > 2 15522060 0xecd90c 0
> >
> > 1+0 records in
> > 1+0 records out
> > 1048576 bytes (1.0 MB) copied, 0.00210304 s, 499 MB/s
> > /lfs2/jetmgmt/Craig.Tierney/test/d4/bigfile
> > lmm_stripe_count: 4
> > lmm_stripe_size: 1048576
> > lmm_pattern: 1
> > lmm_layout_gen: 0
> > lmm_stripe_offset: 12
> > obdidx objid objid group
> > 12 15345301 0xea2695 0
> > 37 15646009 0xeebd39 0
> > 41 15695216 0xef7d70 0
> > 18 15500412 0xec847c 0
> >
> > Checking d4
> > Success: striping_unit=1048576 striping_factor=4
> >
> >
> > Craig
> >
> >
> >
> > On Tue, Sep 22, 2015 at 3:10 PM, Wei-keng Liao <
> wkliao at eecs.northwestern.edu> wrote:
> > Hi, Craig
> >
> > From these outputs, I think it is most likely due to MPI-IO fails
> > to return the same file striping unit and factor values among all
> > MPI processes. I guess only root process gets the correct values.
> > Attached is a short MPI program to test this theory.
> > Could you test it using at least 2 processes on Lustre?
> >
> > To compile:
> > mpicc -o check_mpi_striping check_mpi_striping.c
> > To run:
> > mpiexec -n 2 check_mpi_striping
> >
> >
> > Wei-keng
> >
> >
> >
> >
> > On Sep 22, 2015, at 2:34 PM, Craig Tierney - NOAA Affiliate wrote:
> >
> > > Wei-keng,
> > >
> > > Here is the output from my run with PNETCDF_SAFE_MODE=1 on Lustre:
> > >
> > > [root at Jet:fe8 FLASH-IO]# mpiexec.hydra -env PNETCDF_SAFE_MODE 1 -np 4
> ./flash_benchmark_io /lfs2/jetmgmt/Craig.Tierney/d1//flash_io_test_
> > > Warning (inconsistent metadata): variable lrefine's begin
> (root=1048576, local=3072)
> > > Warning (inconsistent metadata): variable nodetype's begin
> (root=2097152, local=4608)
> > > Warning (inconsistent metadata): variable gid's begin (root=3145728,
> local=6144)
> > > Warning (inconsistent metadata): variable coordinates's begin
> (root=4194304, local=25600)
> > > Warning (inconsistent metadata): variable blocksize's begin
> (root=5242880, local=33792)
> > > Warning (inconsistent metadata): variable bndbox's begin
> (root=6291456, local=41984)
> > > Warning (inconsistent metadata): variable dens's begin (root=7340032,
> local=57856)
> > > Warning (inconsistent metadata): variable velx's begin (root=18874368,
> local=10641920)
> > > Warning (inconsistent metadata): variable lrefine's begin
> (root=1048576, local=3072)
> > > Warning (inconsistent metadata): variable nodetype's begin
> (root=2097152, local=4608)
> > > Warning (inconsistent metadata): variable gid's begin (root=3145728,
> local=6144)
> > > Warning (inconsistent metadata): variable coordinates's begin
> (root=4194304, local=25600)
> > > Warning (inconsistent metadata): variable blocksize's begin
> (root=5242880, local=33792)
> > > Warning (inconsistent metadata): variable bndbox's begin
> (root=6291456, local=41984)
> > > Warning (inconsistent metadata): variable dens's begin (root=7340032,
> local=57856)
> > > Warning (inconsistent metadata): variable velx's begin (root=18874368,
> local=10641920)
> > > Warning (inconsistent metadata): variable vely's begin (root=30408704,
> local=21225984)
> > > Warning (inconsistent metadata): variable velz's begin (root=41943040,
> local=31810048)
> > > Warning (inconsistent metadata): variable pres's begin (root=53477376,
> local=42394112)
> > > Warning (inconsistent metadata): variable ener's begin (root=65011712,
> local=52978176)
> > > Warning (inconsistent metadata): variable temp's begin (root=76546048,
> local=63562240)
> > > Warning (inconsistent metadata): variable gamc's begin (root=88080384,
> local=74146304)
> > > Warning (inconsistent metadata): variable game's begin (root=99614720,
> local=84730368)
> > > Warning (inconsistent metadata): variable enuc's begin
> (root=111149056, local=95314432)
> > > Warning (inconsistent metadata): variable gpot's begin
> (root=122683392, local=105898496)
> > > Warning (inconsistent metadata): variable f1__'s begin
> (root=134217728, local=116482560)
> > > Warning (inconsistent metadata): variable f2__'s begin
> (root=145752064, local=127066624)
> > > Warning (inconsistent metadata): variable f3__'s begin
> (root=157286400, local=137650688)
> > > Warning (inconsistent metadata): variable lrefine's begin
> (root=1048576, local=3072)
> > > Warning (inconsistent metadata): variable nodetype's begin
> (root=2097152, local=4608)
> > > Warning (inconsistent metadata): variable gid's begin (root=3145728,
> local=6144)
> > > Warning (inconsistent metadata): variable coordinates's begin
> (root=4194304, local=25600)
> > > Warning (inconsistent metadata): variable blocksize's begin
> (root=5242880, local=33792)
> > > Warning (inconsistent metadata): variable bndbox's begin
> (root=6291456, local=41984)
> > > Warning (inconsistent metadata): variable dens's begin (root=7340032,
> local=57856)
> > > Warning (inconsistent metadata): variable velx's begin (root=18874368,
> local=10641920)
> > > Warning (inconsistent metadata): variable vely's begin (root=30408704,
> local=21225984)
> > > Warning (inconsistent metadata): variable velz's begin (root=41943040,
> local=31810048)
> > > Warning (inconsistent metadata): variable pres's begin (root=53477376,
> local=42394112)
> > > Warning (inconsistent metadata): variable ener's begin (root=65011712,
> local=52978176)
> > > Warning (inconsistent metadata): variable temp's begin (root=76546048,
> local=63562240)
> > > Warning (inconsistent metadata): variable gamc's begin (root=88080384,
> local=74146304)
> > > Warning (inconsistent metadata): variable game's begin (root=99614720,
> local=84730368)
> > > Warning (inconsistent metadata): variable enuc's begin
> (root=111149056, local=95314432)
> > > Warning (inconsistent metadata): variable gpot's begin
> (root=122683392, local=105898496)
> > > Warning (inconsistent metadata): variable f1__'s begin
> (root=134217728, local=116482560)
> > > Warning (inconsistent metadata): variable f2__'s begin
> (root=145752064, local=127066624)
> > > Warning (inconsistent metadata): variable f3__'s begin
> (root=157286400, local=137650688)
> > > Warning (inconsistent metadata): variable f4__'s begin
> (root=168820736, local=148234752)
> > > Warning (inconsistent metadata): variable f5__'s begin
> (root=180355072, local=158818816)
> > > Warning (inconsistent metadata): variable f6__'s begin
> (root=191889408, local=169402880)
> > > Warning (inconsistent metadata): variable vely's begin (root=30408704,
> local=21225984)
> > > Warning (inconsistent metadata): variable velz's begin (root=41943040,
> local=31810048)
> > > Warning (inconsistent metadata): variable pres's begin (root=53477376,
> local=42394112)
> > > Warning (inconsistent metadata): variable ener's begin (root=65011712,
> local=52978176)
> > > Warning (inconsistent metadata): variable temp's begin (root=76546048,
> local=63562240)
> > > Warning (inconsistent metadata): variable gamc's begin (root=88080384,
> local=74146304)
> > > Warning (inconsistent metadata): variable game's begin (root=99614720,
> local=84730368)
> > > Warning (inconsistent metadata): variable enuc's begin
> (root=111149056, local=95314432)
> > > Warning (inconsistent metadata): variable gpot's begin
> (root=122683392, local=105898496)
> > > Warning (inconsistent metadata): variable f1__'s begin
> (root=134217728, local=116482560)
> > > Warning (inconsistent metadata): variable f2__'s begin
> (root=145752064, local=127066624)
> > > Warning (inconsistent metadata): variable f3__'s begin
> (root=157286400, local=137650688)
> > > Warning (inconsistent metadata): variable f4__'s begin
> (root=168820736, local=148234752)
> > > Warning (inconsistent metadata): variable f5__'s begin
> (root=180355072, local=158818816)
> > > Warning (inconsistent metadata): variable f6__'s begin
> (root=191889408, local=169402880)
> > > Warning (inconsistent metadata): variable f7__'s begin
> (root=203423744, local=179986944)
> > > Warning (inconsistent metadata): variable f8__'s begin
> (root=214958080, local=190571008)
> > > Warning (inconsistent metadata): variable f9__'s begin
> (root=226492416, local=201155072)
> > > Warning (inconsistent metadata): variable f10_'s begin
> (root=238026752, local=211739136)
> > > Warning (inconsistent metadata): variable f11_'s begin
> (root=249561088, local=222323200)
> > > Warning (inconsistent metadata): variable f12_'s begin
> (root=261095424, local=232907264)
> > > Warning (inconsistent metadata): variable f13_'s begin
> (root=272629760, local=243491328)
> > > Warning (inconsistent metadata): variable f4__'s begin
> (root=168820736, local=148234752)
> > > Warning (inconsistent metadata): variable f5__'s begin
> (root=180355072, local=158818816)
> > > Warning (inconsistent metadata): variable f6__'s begin
> (root=191889408, local=169402880)
> > > Warning (inconsistent metadata): variable f7__'s begin
> (root=203423744, local=179986944)
> > > Warning (inconsistent metadata): variable f8__'s begin
> (root=214958080, local=190571008)
> > > Warning (inconsistent metadata): variable f9__'s begin
> (root=226492416, local=201155072)
> > > Warning (inconsistent metadata): variable f10_'s begin
> (root=238026752, local=211739136)
> > > Warning (inconsistent metadata): variable f11_'s begin
> (root=249561088, local=222323200)
> > > Warning (inconsistent metadata): variable f12_'s begin
> (root=261095424, local=232907264)
> > > Warning (inconsistent metadata): variable f13_'s begin
> (root=272629760, local=243491328)
> > > Warning (inconsistent metadata): variable f7__'s begin
> (root=203423744, local=179986944)
> > > Warning (inconsistent metadata): variable f8__'s begin
> (root=214958080, local=190571008)
> > > Warning (inconsistent metadata): variable f9__'s begin
> (root=226492416, local=201155072)
> > > Warning (inconsistent metadata): variable f10_'s begin
> (root=238026752, local=211739136)
> > > Warning (inconsistent metadata): variable f11_'s begin
> (root=249561088, local=222323200)
> > > Warning (inconsistent metadata): variable f12_'s begin
> (root=261095424, local=232907264)
> > > Warning (inconsistent metadata): variable f13_'s begin
> (root=272629760, local=243491328)
> > > Here: -250
> > > Here: -262
> > > Here: -262
> > > Here: -262
> > > nfmpi_enddefFile header is inconsistent among processes
> > > nfmpi_enddef
> > > (Internal error) beginning file offset of this variable is
> inconsistent among p
> > > r
> > > nfmpi_enddef
> > > (Internal error) beginning file offset of this variable is
> inconsistent among p
> > > r
> > > nfmpi_enddef
> > > (Internal error) beginning file offset of this variable is
> inconsistent among p
> > > r
> > > [cli_1]: aborting job:
> > > application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
> > > [cli_0]: [cli_2]: aborting job:
> > > application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
> > > [cli_3]: aborting job:
> > > application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
> > > aborting job:
> > > application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
> > >
> > > Craig
> > >
> > > On Mon, Sep 21, 2015 at 1:21 PM, Wei-keng Liao <
> wkliao at eecs.northwestern.edu> wrote:
> > >
> > > It is strange that the test failed for Lustre.
> > >
> > > The error message says some variables defined across MPI processes are
> not consistent.
> > > Could you run this benchmark with safe mode on? by setting the
> environment variable
> > > PNETCDF_SAFE_MODE to 1 before the run. This will print more error
> messages, such as
> > > which variables are inconsistent and at what offsets.
> > >
> > >
> > > Wei-keng
> > >
> > > On Sep 21, 2015, at 1:31 PM, Craig Tierney - NOAA Affiliate wrote:
> > >
> > > > Rob and Wei-keng,
> > > >
> > > > Thanks for you help on this problem. Rob - The patch seems to
> work. I had to hand apply it but now the pnetcdf tests (mostly) complete
> successfully. The FLASH-IO benchmark is failing when Lustre is used. It
> completes successfully when Panasas is used. The error code that is
> returned by nfmpi_enddef is -262. The description for this error is:
> > > >
> > > > #define NC_EMULTIDEFINE_VAR_BEGIN (-262) /**< inconsistent
> variable file begin offset (internal use) */
> > > >
> > > > [root at Jet:fe7 FLASH-IO]# mpiexec.hydra -n 4 ./flash_benchmark_io
> /pan2/jetmgmt/Craig.Tierney/pan_flash_io_test_
> > > > Here: 0
> > > > Here: 0
> > > > Here: 0
> > > > Here: 0
> > > > number of guards : 4
> > > > number of blocks : 80
> > > > number of variables : 24
> > > > checkpoint time : 12.74 sec
> > > > max header : 0.88 sec
> > > > max unknown : 11.83 sec
> > > > max close : 0.53 sec
> > > > I/O amount : 242.30 MiB
> > > > plot no corner : 2.38 sec
> > > > max header : 0.59 sec
> > > > max unknown : 1.78 sec
> > > > max close : 0.22 sec
> > > > I/O amount : 20.22 MiB
> > > > plot corner : 2.52 sec
> > > > max header : 0.81 sec
> > > > max unknown : 1.51 sec
> > > > max close : 0.96 sec
> > > > I/O amount : 24.25 MiB
> > > > -------------------------------------------------------
> > > > File base name :
> /pan2/jetmgmt/Craig.Tierney/pan_flash_io_test_
> > > > file striping count : 0
> > > > file striping size : 301346992 bytes
> > > > Total I/O amount : 286.78 MiB
> > > > -------------------------------------------------------
> > > > nproc array size exec (sec) bandwidth (MiB/s)
> > > > 4 16 x 16 x 16 17.64 16.26
> > > >
> > > >
> > > > [root at Jet:fe7 FLASH-IO]# mpiexec.hydra -n 4 ./flash_benchmark_io
> /lfs2/jetmgmt/Craig.Tierney/lfs_flash_io_test_
> > > > Here: -262
> > > > Here: -262
> > > > Here: -262
> > > > nfmpi_enddef
> > > > (Internal error) beginning file offset of this variable is
> inconsistent among p
> > > > r
> > > > nfmpi_enddef
> > > > (Internal error) beginning file offset of this variable is
> inconsistent among p
> > > > r
> > > > nfmpi_enddef
> > > > (Internal error) beginning file offset of this variable is
> inconsistent among p
> > > > r
> > > > Here: 0
> > > > [cli_1]: aborting job:
> > > > application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
> > > > [cli_3]: [cli_2]: aborting job:
> > > > application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
> > > > aborting job:
> > > > application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
> > > >
> > > >
> ===================================================================================
> > > > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> > > > = PID 16702 RUNNING AT fe7
> > > > = EXIT CODE: 255
> > > > = CLEANING UP REMAINING PROCESSES
> > > > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> > > >
> ===================================================================================
> > > >
> > > > Thanks,
> > > > Craig
> > > >
> > > >
> > > > On Mon, Sep 21, 2015 at 8:30 AM, Rob Latham <robl at mcs.anl.gov>
> wrote:
> > > >
> > > >
> > > > On 09/20/2015 03:44 PM, Craig Tierney - NOAA Affiliate wrote:
> > > > Wei-keng,
> > > >
> > > > I tried your test code on a different system, and I found it worked
> with
> > > > Intel+mvapich2 (2.1rc1). That system was using Panasas and I was
> > > > testing on Lustre. I then tried Panasas on the original machine
> > > > (supports both Panasas and Lustre) and I got the correct behavior.
> > > >
> > > > So the problem somehow related to Lustre. We are using the
> 2.5.37.ddn
> > > > client. Unless you have an obvious answer, I will open this with
> DDN
> > > > tomorrow.
> > > >
> > > >
> > > > Ah, bet I know why this is!
> > > >
> > > > the Lustre driver and (some versions of the) Panasas driver set
> their fs-specific hints by opening the file, setting some ioctls, then
> continuing on without deleting the file.
> > > >
> > > > In the common case, when we expect the file to show up, no one
> notices or cares, but in MPI_MODE_EXCL or some other restrictive flags, the
> file gets created when we did not expect it to -- and that's part of the
> reason this bug lived on so long.
> > > >
> > > > I fixed this by moving file manipulations out of the hint parsing
> path and into the open path (after we check permissions and flags)
> > > >
> > > > Relevant commit:
> https://trac.mpich.org/projects/mpich/changeset/92f1c69f0de87f9
> > > >
> > > > See more details from Darshan, OpenMPI, and MPICH here:
> > > > - https://trac.mpich.org/projects/mpich/ticket/2261
> > > > - https://github.com/open-mpi/ompi/issues/158
> > > > -
> http://lists.mcs.anl.gov/pipermail/darshan-users/2015-February/000256.html
> > > >
> > > > ==rob
> > > >
> > > >
> > > > Thanks,
> > > > Craig
> > > >
> > > > On Sun, Sep 20, 2015 at 2:36 PM, Craig Tierney - NOAA Affiliate
> > > > <craig.tierney at noaa.gov <mailto:craig.tierney at noaa.gov>> wrote:
> > > >
> > > > Wei-keng,
> > > >
> > > > Thanks for the test case. Here is what I get using a set of
> > > > compilers and MPI stacks. I was expecting that mvapich2 1.8 and
> 2.1
> > > > would behave differently.
> > > >
> > > > What versions of MPI do you test internally?
> > > >
> > > > Craig
> > > >
> > > > Testing intel+impi
> > > >
> > > > Currently Loaded Modules:
> > > > 1) newdefaults 2) intel/15.0.3.187 <http://15.0.3.187> 3)
> > > > impi/5.1.1.109 <http://5.1.1.109>
> > > >
> > > > Error at line 22: File does not exist, error stack:
> > > > ADIOI_NFS_OPEN(69): File /lfs3/jetmgmt/Craig.Tierney/
> tooth-fairy.nc
> > > > <http://tooth-fairy.nc> does not exist
> > > > Testing intel+mvapich2 2.1
> > > >
> > > > Currently Loaded Modules:
> > > > 1) newdefaults 2) intel/15.0.3.187 <http://15.0.3.187> 3)
> > > > mvapich2/2.1
> > > >
> > > > file was opened: /lfs3/jetmgmt/Craig.Tierney/tooth-fairy.nc
> > > > <http://tooth-fairy.nc>
> > > > Testing intel+mvapich2 1.8
> > > >
> > > > Currently Loaded Modules:
> > > > 1) newdefaults 2) intel/15.0.3.187 <http://15.0.3.187> 3)
> > > > mvapich2/1.8
> > > >
> > > > file was opened: /lfs3/jetmgmt/Craig.Tierney/tooth-fairy.nc
> > > > <http://tooth-fairy.nc>
> > > > Testing pgi+mvapich2 2.1
> > > >
> > > > Currently Loaded Modules:
> > > > 1) newdefaults 2) pgi/15.3 3) mvapich2/2.1
> > > >
> > > > file was opened: /lfs3/jetmgmt/Craig.Tierney/tooth-fairy.nc
> > > > <http://tooth-fairy.nc>
> > > > Testing pgi+mvapich2 1.8
> > > >
> > > > Currently Loaded Modules:
> > > > 1) newdefaults 2) pgi/15.3 3) mvapich2/1.8
> > > >
> > > > file was opened: /lfs3/jetmgmt/Craig.Tierney/tooth-fairy.nc
> > > > <http://tooth-fairy.nc>
> > > >
> > > > Craig
> > > >
> > > > On Sun, Sep 20, 2015 at 1:43 PM, Wei-keng Liao
> > > > <wkliao at eecs.northwestern.edu <mailto:
> wkliao at eecs.northwestern.edu>>
> > > > wrote:
> > > >
> > > > In that case, it is likely mvapich does not perform
> correctly.
> > > >
> > > > In PnetCDF, when NC_NOWRITE is used in a call to ncmpi_open,
> > > > PnetCDF calls a MPI_File_open with the open flag set to
> > > > MPI_MODE_RDONLY. See
> > > >
> http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/tags/v1-6-1/src/lib/mpincio.c#L322
> > > >
> > > > Maybe test this with a simple MPI-IO program below.
> > > > It prints error messages like
> > > > Error at line 15: File does not exist, error stack:
> > > > ADIOI_UFS_OPEN(69): File tooth-fairy.nc
> > > > <http://tooth-fairy.nc> does not exist
> > > >
> > > > But, no file should be created.
> > > >
> > > >
> > > > #include <stdio.h>
> > > > #include <unistd.h> /* unlink() */
> > > > #include <mpi.h>
> > > >
> > > > int main(int argc, char **argv) {
> > > > int err;
> > > > MPI_File fh;
> > > >
> > > > MPI_Init(&argc, &argv);
> > > >
> > > > /* delete "tooth-fairy.nc <http://tooth-fairy.nc>" and
> > > > ignore the error */
> > > > unlink("tooth-fairy.nc <http://tooth-fairy.nc>");
> > > >
> > > > err = MPI_File_open(MPI_COMM_WORLD, "tooth-fairy.nc
> > > > <http://tooth-fairy.nc>", MPI_MODE_RDONLY, MPI_INFO_NULL,
> &fh);
> > > > if (err != MPI_SUCCESS) {
> > > > int errorStringLen;
> > > > char errorString[MPI_MAX_ERROR_STRING];
> > > > MPI_Error_string(err, errorString, &errorStringLen);
> > > > printf("Error at line %d: %s\n",__LINE__,
> errorString);
> > > > }
> > > > else
> > > > MPI_File_close(&fh);
> > > >
> > > > MPI_Finalize();
> > > > return 0;
> > > > }
> > > >
> > > >
> > > > Wei-keng
> > > >
> > > > On Sep 20, 2015, at 1:51 PM, Craig Tierney - NOAA Affiliate
> wrote:
> > > >
> > > > > Wei-keng,
> > > > >
> > > > > I always run distclean before I try to build the code.
> The
> > > > first test failing is nc_test. The problem seems to be in
> this
> > > > test:
> > > > >
> > > > > err = ncmpi_open(comm, "tooth-fairy.nc
> > > > <http://tooth-fairy.nc>", NC_NOWRITE, info, &ncid);/*
> should fail */
> > > > > IF (err == NC_NOERR)
> > > > > error("ncmpi_open of nonexistent file should have
> > > > failed");
> > > > > IF (err != NC_ENOENT)
> > > > > error("ncmpi_open of nonexistent file should have
> > > > returned NC_ENOENT");
> > > > > else {
> > > > > /* printf("Expected error message complaining:
> \"File
> > > > tooth-fairy.nc <http://tooth-fairy.nc> does not
> exist\"\n"); */
> > > > > nok++;
> > > > > }
> > > > >
> > > > > A zero length tooth-fairy.nc <http://tooth-fairy.nc>
> file is
> > > > being created, and I don't think that is supposed to happen.
> > > > That would mean that the mode NC_NOWRITE is not being
> honored by
> > > > MPI_IO. I will look at this more tomorrow and try to craft a
> > > > short example.
> > > > >
> > > > > Craig
> > > > >
> > > > > On Sun, Sep 20, 2015 at 10:23 AM, Wei-keng Liao
> > > > <wkliao at eecs.northwestern.edu
> > > > <mailto:wkliao at eecs.northwestern.edu>> wrote:
> > > > > Hi, Craig
> > > > >
> > > > > Your config.log looks fine to me.
> > > > > Some of your error messages are supposed to report errors
> of
> > > > opening
> > > > > a non-existing file, but report a different error code,
> > > > meaning the
> > > > > file does exist. I suspect it may be because of residue
> files.
> > > > >
> > > > > Could you do a clean rebuild with the following commands?
> > > > > % make -s distclean
> > > > > % ./configure
> --prefix=/apps/pnetcdf/1.6.1-intel-mvapich2
> > > > > % make -s -j8
> > > > > % make -s check
> > > > >
> > > > > If the problem persists, then it might be because mvapich.
> > > > >
> > > > > Wei-keng
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Rob Latham
> > > > Mathematics and Computer Science Division
> > > > Argonne National Lab, IL USA
> > > >
> > >
> > >
> >
> >
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20150924/6638eb0c/attachment-0001.html>
More information about the parallel-netcdf
mailing list