File header is inconsistent among processes

Sean Byland seanb at cray.com
Tue Sep 30 13:51:29 CDT 2014


Jim, Wei-keng,
Thanks for the info/suggestions. I’ll give it a shot.

Sean

From: Jim Edwards <jedwards at ucar.edu<mailto:jedwards at ucar.edu>>
Date: Tuesday, September 30, 2014 at 12:31 PM
To: Wei-keng Liao <wkliao at eecs.northwestern.edu<mailto:wkliao at eecs.northwestern.edu>>
Cc: Sean Byland <seanb at cray.com<mailto:seanb at cray.com>>, "parallel-netcdf at lists.mcs.anl.gov<mailto:parallel-netcdf at lists.mcs.anl.gov>" <parallel-netcdf at lists.mcs.anl.gov<mailto:parallel-netcdf at lists.mcs.anl.gov>>
Subject: Re: File header is inconsistent among processes

Hi Sean,

I've had a similar problem in models that I've worked with, you need to make the
WRFU_TimeIntervalGet() return a consistent time with something like:

if(mytask==0)
   interval = get_time()
endif
MPI_BCAST(interval, comm)


On Tue, Sep 30, 2014 at 9:42 AM, Wei-keng Liao <wkliao at eecs.northwestern.edu<mailto:wkliao at eecs.northwestern.edu>> wrote:
Hi, Sean

The warning message printed from PnetCDF indicates the global attribute
named "WRF_ALARM_SECS_TIL_NEXT_RING_52" has inconsistent values among MPI processes.
The source codes that cause the inconsistency is in line 366 of file share/output_wrf.F
where the value of this attribute is a timer (in the unit of seconds) set on each
process independently when calling WRFU_TimeIntervalGet(). The call to
wrf_put_dom_ti_integer() in line 369/371 will write the global attribute to the
netCDF file and hence PnetCDF caught the inconsistent attribute.
Let me know if this helps.

 361  IF ( i .LT. 10 ) THEN
 362    write(alarmname,'("WRF_ALARM_SECS_TIL_NEXT_RING_0",i1)')i
 363  ELSE
 364    write(alarmname,'("WRF_ALARM_SECS_TIL_NEXT_RING_",i2)')i
 365  ENDIF
 366  CALL WRFU_TimeIntervalGet(interval,S=seconds)
 367  CALL WRFU_TimeIntervalGet(tmpinterval,S=seconds2)
 368  IF ( seconds .GE. 1700000000 .OR. seconds .LE. -1700000000 ) THEN   ! it is a forever value, do not change it
 369    CALL wrf_put_dom_ti_integer( fid, TRIM(alarmname), seconds, 1, ierr )
 370  ELSE
 371    CALL wrf_put_dom_ti_integer( fid, TRIM(alarmname), seconds-seconds2, 1, ierr )
 372  ENDIF


Wei-keng

On Sep 30, 2014, at 9:06 AM, Sean Byland wrote:

> Thanks Wei-king,
> I’ve attached the ncdump output for one of the two failed domains. When
> the user ran wrf, pnetcdf reported a problem when writing the wrf restart
> file. When I ran it it appears to have completed the restart files but the
> file that’s incomplete is the wrf output file for the second/third domain.
> I do see a lot of these warnings in the rsl output files:
>
> grep -r -i "inconsistent" rsl.out* | cut -d ':' -f2- | sort -u
> Warning (inconsistent metadata): attribute
> "WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != -1459748865)
> Warning (inconsistent metadata): attribute
> "WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != -1462916539)
> Warning (inconsistent metadata): attribute
> "WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != -1471304891)
> Warning (inconsistent metadata): attribute
> "WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != 1758243397)
> Warning (inconsistent metadata): attribute
> "WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != -389109179)
> Warning (inconsistent metadata): attribute
> "WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != -397497531)
>
> Sean
>
>
>
> On 9/29/14, 4:59 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu<mailto:wkliao at eecs.northwestern.edu>> wrote:
>
>> Hi, Sean
>>
>> My understanding of WRF (wrf_io.F90, line around 1360) is when
>> NFMPI_ENDDEF() returns an error code,
>> the program returns without continuing to write any data. So, I guess the
>> file with a much smaller size
>> probably is because of this (but you should see additional error messages
>> printed on stdout.)
>>
>> Could you show us the output of command "ncdump -h"?
>>
>> Wei-keng
>>
>> On Sep 29, 2014, at 4:25 PM, Sean Byland wrote:
>>
>>> By "problem with his application” I mean unrelated (upstream) of their
>>> usage of parallel-netcdf.
>>>
>>> Sean B.
>>>
>>> On 9/29/14, 4:21 PM, "Sean Byland" <seanb at cray.com<mailto:seanb at cray.com>> wrote:
>>>
>>>> Thanks. When I ran his application one of the output files was an
>>>> order of
>>>> magnitude too small but parallel-netcdf didn’t report a problem making
>>>> PNETCDF_SAFE_MODE less useful. This makes me think there’s a problem
>>>> with
>>>> his application (i.e. a race condition).
>>>>
>>>> Sean B.
>>>>
>>>> On 9/29/14, 4:05 PM, "Rob Latham" <robl at mcs.anl.gov<mailto:robl at mcs.anl.gov>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On 09/29/2014 03:56 PM, Sean Byland wrote:
>>>>>> Rob,
>>>>>> Unfortunately I wasn’t able to reproduce the error, but he did have
>>>>>> an
>>>>>> ncfile laying around so I can provide the ncdump -h output. Not to be
>>>>>> dense but what should I be looking for (I see lots of time values) ?
>>>>>> Knowing almost nothing about pnetcdf, I would think that if different
>>>>>> processes had inconsistent data, wouldn’t they fail on the write and
>>>>>> therefore I wouldn’t be able to observe what values where
>>>>>> inconsistent
>>>>>> ?
>>>>>
>>>>> I think you're going to be better served by Wei-keng's environment
>>>>> variable suggestion, but read on for a bit of background:
>>>>>
>>>>> Parallel-NetCDF expects the header to be identical on all N MPI
>>>>> processes.  How can processes have different data and yet still read
>>>>> the
>>>>> file?  well, the header is pretty simple.    It's not too far off to
>>>>> think of it as a big array of (now 64 bit) values.
>>>>>
>>>>> On one process you might have the attribute "timestamp" with a value
>>>>> "2014-09-29-16<tel:2014-09-29-16>:00:34 CST", followed by the information "the variable
>>>>> Pressure starts at offset 20023423 bytes".
>>>>>
>>>>> On another process, you might have the exact same information, except
>>>>> the time stamp is "2014-09-29-23<tel:2014-09-29-23>:00:34 GMT".  The information about
>>>>> the
>>>>> variable will still start at the same place and contain the same
>>>>> information.  Rank 0 will broadcast its version of the header to all
>>>>> the
>>>>> other processes.  If any of them differ in any byte, the library will
>>>>> give an error.
>>>>>
>>>>> If the check was more involved, we could warn about attribute values
>>>>> that differ slightly but press on if "important" values (which we
>>>>> would
>>>>> have to define) were all consistent.
>>>>>
>>>>> ==rob
>>>>>
>>>>>>
>>>>>> Thanks for any info.
>>>>>>
>>>>>> Sean
>>>>>>
>>>>>>
>>>>>> On 9/19/14, 1:42 PM, "Rob Latham" <robl at mcs.anl.gov<mailto:robl at mcs.anl.gov>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 09/19/2014 12:54 PM, Sean Byland wrote:
>>>>>>>> Hello,
>>>>>>>> A parallel-netcdf user gets this error:
>>>>>>>>
>>>>>>>>   NetCDF error: File header is inconsistent among processes
>>>>>>>>   NetCDF error ( -250 ) from NFMPI_ENDDEF in
>>>>>>>> ext_pnc_open_for_write_commit wrf_io.F90, line 1360
>>>>>>>>  med_restart_out: opening wrfrst_d01_2013-06-01_00_10_00 for
>>>>>>>> writing
>>>>>>>
>>>>>>> The most common cause for this message is when there's a timestamp
>>>>>>> attribute: the processes are not 100% in lock step, and so create
>>>>>>> ever
>>>>>>> so slightly different timestamps.
>>>>>>>
>>>>>>> Can you provide the header of a CCE run? You can use 'ncmpidump -h'
>>>>>>> or
>>>>>>> serial netcdf's 'ncdump -h'
>>>>>>>
>>>>>>> ==rob
>>>>>>>
>>>>>>>>
>>>>>>>> when running  software/parallel-netcdf (1.5.0) libraries that were
>>>>>>>> built
>>>>>>>> with CCE but doesn’t with an application/library that were built
>>>>>>>> with
>>>>>>>> Intel’s compiler. I’m still waiting on the user for something that
>>>>>>>> I
>>>>>>>> need to reproduce the error and start experimenting but was hoping
>>>>>>>> that
>>>>>>>> someone on this mailing list might have some useful information or
>>>>>>>> hints
>>>>>>>> about what’s causing this error and how I might fix it (or where I
>>>>>>>> might
>>>>>>>> look). All of the “make check” test pass. Thanks for any input.
>>>>>>>>
>>>>>>>> Sean B
>>>>>>>
>>>>>>> --
>>>>>>> Rob Latham
>>>>>>> Mathematics and Computer Science Division
>>>>>>> Argonne National Lab, IL USA
>>>>>>
>>>>>
>>>>> --
>>>>> Rob Latham
>>>>> Mathematics and Computer Science Division
>>>>> Argonne National Lab, IL USA
>>>>
>>>
>>
>
> <ncdump.wrfout_d02.out>




--
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20140930/618e35a6/attachment-0001.html>


More information about the parallel-netcdf mailing list