File header is inconsistent among processes

Wei-keng Liao wkliao at eecs.northwestern.edu
Mon Sep 29 16:59:45 CDT 2014


Hi, Sean

My understanding of WRF (wrf_io.F90, line around 1360) is when NFMPI_ENDDEF() returns an error code,
the program returns without continuing to write any data. So, I guess the file with a much smaller size
probably is because of this (but you should see additional error messages printed on stdout.)

Could you show us the output of command "ncdump -h"?

Wei-keng

On Sep 29, 2014, at 4:25 PM, Sean Byland wrote:

> By "problem with his application” I mean unrelated (upstream) of their
> usage of parallel-netcdf.
> 
> Sean B.
> 
> On 9/29/14, 4:21 PM, "Sean Byland" <seanb at cray.com> wrote:
> 
>> Thanks. When I ran his application one of the output files was an order of
>> magnitude too small but parallel-netcdf didn’t report a problem making
>> PNETCDF_SAFE_MODE less useful. This makes me think there’s a problem with
>> his application (i.e. a race condition).
>> 
>> Sean B.
>> 
>> On 9/29/14, 4:05 PM, "Rob Latham" <robl at mcs.anl.gov> wrote:
>> 
>>> 
>>> 
>>> On 09/29/2014 03:56 PM, Sean Byland wrote:
>>>> Rob,
>>>> Unfortunately I wasn’t able to reproduce the error, but he did have an
>>>> ncfile laying around so I can provide the ncdump -h output. Not to be
>>>> dense but what should I be looking for (I see lots of time values) ?
>>>> Knowing almost nothing about pnetcdf, I would think that if different
>>>> processes had inconsistent data, wouldn’t they fail on the write and
>>>> therefore I wouldn’t be able to observe what values where inconsistent
>>>> ?
>>> 
>>> I think you're going to be better served by Wei-keng's environment
>>> variable suggestion, but read on for a bit of background:
>>> 
>>> Parallel-NetCDF expects the header to be identical on all N MPI
>>> processes.  How can processes have different data and yet still read the
>>> file?  well, the header is pretty simple.    It's not too far off to
>>> think of it as a big array of (now 64 bit) values.
>>> 
>>> On one process you might have the attribute "timestamp" with a value
>>> "2014-09-29-16:00:34 CST", followed by the information "the variable
>>> Pressure starts at offset 20023423 bytes".
>>> 
>>> On another process, you might have the exact same information, except
>>> the time stamp is "2014-09-29-23:00:34 GMT".  The information about the
>>> variable will still start at the same place and contain the same
>>> information.  Rank 0 will broadcast its version of the header to all the
>>> other processes.  If any of them differ in any byte, the library will
>>> give an error.
>>> 
>>> If the check was more involved, we could warn about attribute values
>>> that differ slightly but press on if "important" values (which we would
>>> have to define) were all consistent.
>>> 
>>> ==rob
>>> 
>>>> 
>>>> Thanks for any info.
>>>> 
>>>> Sean
>>>> 
>>>> 
>>>> On 9/19/14, 1:42 PM, "Rob Latham" <robl at mcs.anl.gov> wrote:
>>>> 
>>>>> 
>>>>> 
>>>>> On 09/19/2014 12:54 PM, Sean Byland wrote:
>>>>>> Hello,
>>>>>> A parallel-netcdf user gets this error:
>>>>>> 
>>>>>>    NetCDF error: File header is inconsistent among processes
>>>>>>    NetCDF error ( -250 ) from NFMPI_ENDDEF in
>>>>>> ext_pnc_open_for_write_commit wrf_io.F90, line 1360
>>>>>>   med_restart_out: opening wrfrst_d01_2013-06-01_00_10_00 for
>>>>>> writing
>>>>> 
>>>>> The most common cause for this message is when there's a timestamp
>>>>> attribute: the processes are not 100% in lock step, and so create ever
>>>>> so slightly different timestamps.
>>>>> 
>>>>> Can you provide the header of a CCE run? You can use 'ncmpidump -h' or
>>>>> serial netcdf's 'ncdump -h'
>>>>> 
>>>>> ==rob
>>>>> 
>>>>>> 
>>>>>> when running  software/parallel-netcdf (1.5.0) libraries that were
>>>>>> built
>>>>>> with CCE but doesn’t with an application/library that were built with
>>>>>> Intel’s compiler. I’m still waiting on the user for something that I
>>>>>> need to reproduce the error and start experimenting but was hoping
>>>>>> that
>>>>>> someone on this mailing list might have some useful information or
>>>>>> hints
>>>>>> about what’s causing this error and how I might fix it (or where I
>>>>>> might
>>>>>> look). All of the “make check” test pass. Thanks for any input.
>>>>>> 
>>>>>> Sean B
>>>>> 
>>>>> --
>>>>> Rob Latham
>>>>> Mathematics and Computer Science Division
>>>>> Argonne National Lab, IL USA
>>>> 
>>> 
>>> -- 
>>> Rob Latham
>>> Mathematics and Computer Science Division
>>> Argonne National Lab, IL USA
>> 
> 



More information about the parallel-netcdf mailing list