File header is inconsistent among processes

Jim Edwards jedwards at ucar.edu
Tue Sep 30 12:31:23 CDT 2014


Hi Sean,

I've had a similar problem in models that I've worked with, you need to
make the
WRFU_TimeIntervalGet() return a consistent time with something like:

if(mytask==0)
   interval = get_time()
endif
MPI_BCAST(interval, comm)


On Tue, Sep 30, 2014 at 9:42 AM, Wei-keng Liao <wkliao at eecs.northwestern.edu
> wrote:

> Hi, Sean
>
> The warning message printed from PnetCDF indicates the global attribute
> named "WRF_ALARM_SECS_TIL_NEXT_RING_52" has inconsistent values among MPI
> processes.
> The source codes that cause the inconsistency is in line 366 of file
> share/output_wrf.F
> where the value of this attribute is a timer (in the unit of seconds) set
> on each
> process independently when calling WRFU_TimeIntervalGet(). The call to
> wrf_put_dom_ti_integer() in line 369/371 will write the global attribute
> to the
> netCDF file and hence PnetCDF caught the inconsistent attribute.
> Let me know if this helps.
>
>  361  IF ( i .LT. 10 ) THEN
>  362    write(alarmname,'("WRF_ALARM_SECS_TIL_NEXT_RING_0",i1)')i
>  363  ELSE
>  364    write(alarmname,'("WRF_ALARM_SECS_TIL_NEXT_RING_",i2)')i
>  365  ENDIF
>  366  CALL WRFU_TimeIntervalGet(interval,S=seconds)
>  367  CALL WRFU_TimeIntervalGet(tmpinterval,S=seconds2)
>  368  IF ( seconds .GE. 1700000000 .OR. seconds .LE. -1700000000 ) THEN
>  ! it is a forever value, do not change it
>  369    CALL wrf_put_dom_ti_integer( fid, TRIM(alarmname), seconds, 1,
> ierr )
>  370  ELSE
>  371    CALL wrf_put_dom_ti_integer( fid, TRIM(alarmname),
> seconds-seconds2, 1, ierr )
>  372  ENDIF
>
>
> Wei-keng
>
> On Sep 30, 2014, at 9:06 AM, Sean Byland wrote:
>
> > Thanks Wei-king,
> > I’ve attached the ncdump output for one of the two failed domains. When
> > the user ran wrf, pnetcdf reported a problem when writing the wrf restart
> > file. When I ran it it appears to have completed the restart files but
> the
> > file that’s incomplete is the wrf output file for the second/third
> domain.
> > I do see a lot of these warnings in the rsl output files:
> >
> > grep -r -i "inconsistent" rsl.out* | cut -d ':' -f2- | sort -u
> > Warning (inconsistent metadata): attribute
> > "WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != -1459748865)
> > Warning (inconsistent metadata): attribute
> > "WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != -1462916539)
> > Warning (inconsistent metadata): attribute
> > "WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != -1471304891)
> > Warning (inconsistent metadata): attribute
> > "WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != 1758243397)
> > Warning (inconsistent metadata): attribute
> > "WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != -389109179)
> > Warning (inconsistent metadata): attribute
> > "WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != -397497531)
> >
> > Sean
> >
> >
> >
> > On 9/29/14, 4:59 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu>
> wrote:
> >
> >> Hi, Sean
> >>
> >> My understanding of WRF (wrf_io.F90, line around 1360) is when
> >> NFMPI_ENDDEF() returns an error code,
> >> the program returns without continuing to write any data. So, I guess
> the
> >> file with a much smaller size
> >> probably is because of this (but you should see additional error
> messages
> >> printed on stdout.)
> >>
> >> Could you show us the output of command "ncdump -h"?
> >>
> >> Wei-keng
> >>
> >> On Sep 29, 2014, at 4:25 PM, Sean Byland wrote:
> >>
> >>> By "problem with his application” I mean unrelated (upstream) of their
> >>> usage of parallel-netcdf.
> >>>
> >>> Sean B.
> >>>
> >>> On 9/29/14, 4:21 PM, "Sean Byland" <seanb at cray.com> wrote:
> >>>
> >>>> Thanks. When I ran his application one of the output files was an
> >>>> order of
> >>>> magnitude too small but parallel-netcdf didn’t report a problem making
> >>>> PNETCDF_SAFE_MODE less useful. This makes me think there’s a problem
> >>>> with
> >>>> his application (i.e. a race condition).
> >>>>
> >>>> Sean B.
> >>>>
> >>>> On 9/29/14, 4:05 PM, "Rob Latham" <robl at mcs.anl.gov> wrote:
> >>>>
> >>>>>
> >>>>>
> >>>>> On 09/29/2014 03:56 PM, Sean Byland wrote:
> >>>>>> Rob,
> >>>>>> Unfortunately I wasn’t able to reproduce the error, but he did have
> >>>>>> an
> >>>>>> ncfile laying around so I can provide the ncdump -h output. Not to
> be
> >>>>>> dense but what should I be looking for (I see lots of time values) ?
> >>>>>> Knowing almost nothing about pnetcdf, I would think that if
> different
> >>>>>> processes had inconsistent data, wouldn’t they fail on the write and
> >>>>>> therefore I wouldn’t be able to observe what values where
> >>>>>> inconsistent
> >>>>>> ?
> >>>>>
> >>>>> I think you're going to be better served by Wei-keng's environment
> >>>>> variable suggestion, but read on for a bit of background:
> >>>>>
> >>>>> Parallel-NetCDF expects the header to be identical on all N MPI
> >>>>> processes.  How can processes have different data and yet still read
> >>>>> the
> >>>>> file?  well, the header is pretty simple.    It's not too far off to
> >>>>> think of it as a big array of (now 64 bit) values.
> >>>>>
> >>>>> On one process you might have the attribute "timestamp" with a value
> >>>>> "2014-09-29-16:00:34 CST", followed by the information "the variable
> >>>>> Pressure starts at offset 20023423 bytes".
> >>>>>
> >>>>> On another process, you might have the exact same information, except
> >>>>> the time stamp is "2014-09-29-23:00:34 GMT".  The information about
> >>>>> the
> >>>>> variable will still start at the same place and contain the same
> >>>>> information.  Rank 0 will broadcast its version of the header to all
> >>>>> the
> >>>>> other processes.  If any of them differ in any byte, the library will
> >>>>> give an error.
> >>>>>
> >>>>> If the check was more involved, we could warn about attribute values
> >>>>> that differ slightly but press on if "important" values (which we
> >>>>> would
> >>>>> have to define) were all consistent.
> >>>>>
> >>>>> ==rob
> >>>>>
> >>>>>>
> >>>>>> Thanks for any info.
> >>>>>>
> >>>>>> Sean
> >>>>>>
> >>>>>>
> >>>>>> On 9/19/14, 1:42 PM, "Rob Latham" <robl at mcs.anl.gov> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 09/19/2014 12:54 PM, Sean Byland wrote:
> >>>>>>>> Hello,
> >>>>>>>> A parallel-netcdf user gets this error:
> >>>>>>>>
> >>>>>>>>   NetCDF error: File header is inconsistent among processes
> >>>>>>>>   NetCDF error ( -250 ) from NFMPI_ENDDEF in
> >>>>>>>> ext_pnc_open_for_write_commit wrf_io.F90, line 1360
> >>>>>>>>  med_restart_out: opening wrfrst_d01_2013-06-01_00_10_00 for
> >>>>>>>> writing
> >>>>>>>
> >>>>>>> The most common cause for this message is when there's a timestamp
> >>>>>>> attribute: the processes are not 100% in lock step, and so create
> >>>>>>> ever
> >>>>>>> so slightly different timestamps.
> >>>>>>>
> >>>>>>> Can you provide the header of a CCE run? You can use 'ncmpidump -h'
> >>>>>>> or
> >>>>>>> serial netcdf's 'ncdump -h'
> >>>>>>>
> >>>>>>> ==rob
> >>>>>>>
> >>>>>>>>
> >>>>>>>> when running  software/parallel-netcdf (1.5.0) libraries that were
> >>>>>>>> built
> >>>>>>>> with CCE but doesn’t with an application/library that were built
> >>>>>>>> with
> >>>>>>>> Intel’s compiler. I’m still waiting on the user for something that
> >>>>>>>> I
> >>>>>>>> need to reproduce the error and start experimenting but was hoping
> >>>>>>>> that
> >>>>>>>> someone on this mailing list might have some useful information or
> >>>>>>>> hints
> >>>>>>>> about what’s causing this error and how I might fix it (or where I
> >>>>>>>> might
> >>>>>>>> look). All of the “make check” test pass. Thanks for any input.
> >>>>>>>>
> >>>>>>>> Sean B
> >>>>>>>
> >>>>>>> --
> >>>>>>> Rob Latham
> >>>>>>> Mathematics and Computer Science Division
> >>>>>>> Argonne National Lab, IL USA
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Rob Latham
> >>>>> Mathematics and Computer Science Division
> >>>>> Argonne National Lab, IL USA
> >>>>
> >>>
> >>
> >
> > <ncdump.wrfout_d02.out>
>
>


-- 
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20140930/2a57b990/attachment.html>


More information about the parallel-netcdf mailing list