File header is inconsistent among processes

Sean Byland seanb at cray.com
Tue Sep 30 09:06:09 CDT 2014


Thanks Wei-king,
I’ve attached the ncdump output for one of the two failed domains. When
the user ran wrf, pnetcdf reported a problem when writing the wrf restart
file. When I ran it it appears to have completed the restart files but the
file that’s incomplete is the wrf output file for the second/third domain.
I do see a lot of these warnings in the rsl output files:

grep -r -i "inconsistent" rsl.out* | cut -d ':' -f2- | sort -u
Warning (inconsistent metadata): attribute
"WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != -1459748865)
Warning (inconsistent metadata): attribute
"WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != -1462916539)
Warning (inconsistent metadata): attribute
"WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != -1471304891)
Warning (inconsistent metadata): attribute
"WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != 1758243397)
Warning (inconsistent metadata): attribute
"WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != -389109179)
Warning (inconsistent metadata): attribute
"WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != -397497531)

Sean



On 9/29/14, 4:59 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu> wrote:

>Hi, Sean
>
>My understanding of WRF (wrf_io.F90, line around 1360) is when
>NFMPI_ENDDEF() returns an error code,
>the program returns without continuing to write any data. So, I guess the
>file with a much smaller size
>probably is because of this (but you should see additional error messages
>printed on stdout.)
>
>Could you show us the output of command "ncdump -h"?
>
>Wei-keng
>
>On Sep 29, 2014, at 4:25 PM, Sean Byland wrote:
>
>> By "problem with his application” I mean unrelated (upstream) of their
>> usage of parallel-netcdf.
>> 
>> Sean B.
>> 
>> On 9/29/14, 4:21 PM, "Sean Byland" <seanb at cray.com> wrote:
>> 
>>> Thanks. When I ran his application one of the output files was an
>>>order of
>>> magnitude too small but parallel-netcdf didn’t report a problem making
>>> PNETCDF_SAFE_MODE less useful. This makes me think there’s a problem
>>>with
>>> his application (i.e. a race condition).
>>> 
>>> Sean B.
>>> 
>>> On 9/29/14, 4:05 PM, "Rob Latham" <robl at mcs.anl.gov> wrote:
>>> 
>>>> 
>>>> 
>>>> On 09/29/2014 03:56 PM, Sean Byland wrote:
>>>>> Rob,
>>>>> Unfortunately I wasn’t able to reproduce the error, but he did have
>>>>>an
>>>>> ncfile laying around so I can provide the ncdump -h output. Not to be
>>>>> dense but what should I be looking for (I see lots of time values) ?
>>>>> Knowing almost nothing about pnetcdf, I would think that if different
>>>>> processes had inconsistent data, wouldn’t they fail on the write and
>>>>> therefore I wouldn’t be able to observe what values where
>>>>>inconsistent
>>>>> ?
>>>> 
>>>> I think you're going to be better served by Wei-keng's environment
>>>> variable suggestion, but read on for a bit of background:
>>>> 
>>>> Parallel-NetCDF expects the header to be identical on all N MPI
>>>> processes.  How can processes have different data and yet still read
>>>>the
>>>> file?  well, the header is pretty simple.    It's not too far off to
>>>> think of it as a big array of (now 64 bit) values.
>>>> 
>>>> On one process you might have the attribute "timestamp" with a value
>>>> "2014-09-29-16:00:34 CST", followed by the information "the variable
>>>> Pressure starts at offset 20023423 bytes".
>>>> 
>>>> On another process, you might have the exact same information, except
>>>> the time stamp is "2014-09-29-23:00:34 GMT".  The information about
>>>>the
>>>> variable will still start at the same place and contain the same
>>>> information.  Rank 0 will broadcast its version of the header to all
>>>>the
>>>> other processes.  If any of them differ in any byte, the library will
>>>> give an error.
>>>> 
>>>> If the check was more involved, we could warn about attribute values
>>>> that differ slightly but press on if "important" values (which we
>>>>would
>>>> have to define) were all consistent.
>>>> 
>>>> ==rob
>>>> 
>>>>> 
>>>>> Thanks for any info.
>>>>> 
>>>>> Sean
>>>>> 
>>>>> 
>>>>> On 9/19/14, 1:42 PM, "Rob Latham" <robl at mcs.anl.gov> wrote:
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 09/19/2014 12:54 PM, Sean Byland wrote:
>>>>>>> Hello,
>>>>>>> A parallel-netcdf user gets this error:
>>>>>>> 
>>>>>>>    NetCDF error: File header is inconsistent among processes
>>>>>>>    NetCDF error ( -250 ) from NFMPI_ENDDEF in
>>>>>>> ext_pnc_open_for_write_commit wrf_io.F90, line 1360
>>>>>>>   med_restart_out: opening wrfrst_d01_2013-06-01_00_10_00 for
>>>>>>> writing
>>>>>> 
>>>>>> The most common cause for this message is when there's a timestamp
>>>>>> attribute: the processes are not 100% in lock step, and so create
>>>>>>ever
>>>>>> so slightly different timestamps.
>>>>>> 
>>>>>> Can you provide the header of a CCE run? You can use 'ncmpidump -h'
>>>>>>or
>>>>>> serial netcdf's 'ncdump -h'
>>>>>> 
>>>>>> ==rob
>>>>>> 
>>>>>>> 
>>>>>>> when running  software/parallel-netcdf (1.5.0) libraries that were
>>>>>>> built
>>>>>>> with CCE but doesn’t with an application/library that were built
>>>>>>>with
>>>>>>> Intel’s compiler. I’m still waiting on the user for something that
>>>>>>>I
>>>>>>> need to reproduce the error and start experimenting but was hoping
>>>>>>> that
>>>>>>> someone on this mailing list might have some useful information or
>>>>>>> hints
>>>>>>> about what’s causing this error and how I might fix it (or where I
>>>>>>> might
>>>>>>> look). All of the “make check” test pass. Thanks for any input.
>>>>>>> 
>>>>>>> Sean B
>>>>>> 
>>>>>> --
>>>>>> Rob Latham
>>>>>> Mathematics and Computer Science Division
>>>>>> Argonne National Lab, IL USA
>>>>> 
>>>> 
>>>> -- 
>>>> Rob Latham
>>>> Mathematics and Computer Science Division
>>>> Argonne National Lab, IL USA
>>> 
>> 
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ncdump.wrfout_d02.out
Type: application/octet-stream
Size: 50346 bytes
Desc: ncdump.wrfout_d02.out
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20140930/b9ccdc13/attachment-0001.obj>


More information about the parallel-netcdf mailing list