File header is inconsistent among processes

Mon Sep 29 16:05:08 CDT 2014

On 09/29/2014 03:56 PM, Sean Byland wrote:
> Rob,
> Unfortunately I wasn’t able to reproduce the error, but he did have an
> ncfile laying around so I can provide the ncdump -h output. Not to be
> dense but what should I be looking for (I see lots of time values) ?
> Knowing almost nothing about pnetcdf, I would think that if different
> processes had inconsistent data, wouldn’t they fail on the write and
> therefore I wouldn’t be able to observe what values where inconsistent ?

I think you're going to be better served by Wei-keng's environment 
variable suggestion, but read on for a bit of background:

Parallel-NetCDF expects the header to be identical on all N MPI 
processes.  How can processes have different data and yet still read the 
file?  well, the header is pretty simple.    It's not too far off to 
think of it as a big array of (now 64 bit) values.

On one process you might have the attribute "timestamp" with a value 
"2014-09-29-16:00:34 CST", followed by the information "the variable 
Pressure starts at offset 20023423 bytes".

On another process, you might have the exact same information, except 
the time stamp is "2014-09-29-23:00:34 GMT".  The information about the 
variable will still start at the same place and contain the same 
information.  Rank 0 will broadcast its version of the header to all the 
other processes.  If any of them differ in any byte, the library will 
give an error.

If the check was more involved, we could warn about attribute values 
that differ slightly but press on if "important" values (which we would 
have to define) were all consistent.

==rob

>
> Thanks for any info.
>
> Sean
>
>
> On 9/19/14, 1:42 PM, "Rob Latham" <robl at mcs.anl.gov> wrote:
>
>>
>>
>> On 09/19/2014 12:54 PM, Sean Byland wrote:
>>> Hello,
>>> A parallel-netcdf user gets this error:
>>>
>>>     NetCDF error: File header is inconsistent among processes
>>>     NetCDF error ( -250 ) from NFMPI_ENDDEF in
>>> ext_pnc_open_for_write_commit wrf_io.F90, line 1360
>>>    med_restart_out: opening wrfrst_d01_2013-06-01_00_10_00 for writing
>>
>> The most common cause for this message is when there's a timestamp
>> attribute: the processes are not 100% in lock step, and so create ever
>> so slightly different timestamps.
>>
>> Can you provide the header of a CCE run? You can use 'ncmpidump -h' or
>> serial netcdf's 'ncdump -h'
>>
>> ==rob
>>
>>>
>>> when running  software/parallel-netcdf (1.5.0) libraries that were built
>>> with CCE but doesn’t with an application/library that were built with
>>> Intel’s compiler. I’m still waiting on the user for something that I
>>> need to reproduce the error and start experimenting but was hoping that
>>> someone on this mailing list might have some useful information or hints
>>> about what’s causing this error and how I might fix it (or where I might
>>> look). All of the “make check” test pass. Thanks for any input.
>>>
>>> Sean B
>>
>> --
>> Rob Latham
>> Mathematics and Computer Science Division
>> Argonne National Lab, IL USA
>

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA