File header is inconsistent among processes
Rob Latham
robl at mcs.anl.gov
Mon Sep 29 16:05:08 CDT 2014
On 09/29/2014 03:56 PM, Sean Byland wrote:
> Rob,
> Unfortunately I wasn’t able to reproduce the error, but he did have an
> ncfile laying around so I can provide the ncdump -h output. Not to be
> dense but what should I be looking for (I see lots of time values) ?
> Knowing almost nothing about pnetcdf, I would think that if different
> processes had inconsistent data, wouldn’t they fail on the write and
> therefore I wouldn’t be able to observe what values where inconsistent ?
I think you're going to be better served by Wei-keng's environment
variable suggestion, but read on for a bit of background:
Parallel-NetCDF expects the header to be identical on all N MPI
processes. How can processes have different data and yet still read the
file? well, the header is pretty simple. It's not too far off to
think of it as a big array of (now 64 bit) values.
On one process you might have the attribute "timestamp" with a value
"2014-09-29-16:00:34 CST", followed by the information "the variable
Pressure starts at offset 20023423 bytes".
On another process, you might have the exact same information, except
the time stamp is "2014-09-29-23:00:34 GMT". The information about the
variable will still start at the same place and contain the same
information. Rank 0 will broadcast its version of the header to all the
other processes. If any of them differ in any byte, the library will
give an error.
If the check was more involved, we could warn about attribute values
that differ slightly but press on if "important" values (which we would
have to define) were all consistent.
==rob
>
> Thanks for any info.
>
> Sean
>
>
> On 9/19/14, 1:42 PM, "Rob Latham" <robl at mcs.anl.gov> wrote:
>
>>
>>
>> On 09/19/2014 12:54 PM, Sean Byland wrote:
>>> Hello,
>>> A parallel-netcdf user gets this error:
>>>
>>> NetCDF error: File header is inconsistent among processes
>>> NetCDF error ( -250 ) from NFMPI_ENDDEF in
>>> ext_pnc_open_for_write_commit wrf_io.F90, line 1360
>>> med_restart_out: opening wrfrst_d01_2013-06-01_00_10_00 for writing
>>
>> The most common cause for this message is when there's a timestamp
>> attribute: the processes are not 100% in lock step, and so create ever
>> so slightly different timestamps.
>>
>> Can you provide the header of a CCE run? You can use 'ncmpidump -h' or
>> serial netcdf's 'ncdump -h'
>>
>> ==rob
>>
>>>
>>> when running software/parallel-netcdf (1.5.0) libraries that were built
>>> with CCE but doesn’t with an application/library that were built with
>>> Intel’s compiler. I’m still waiting on the user for something that I
>>> need to reproduce the error and start experimenting but was hoping that
>>> someone on this mailing list might have some useful information or hints
>>> about what’s causing this error and how I might fix it (or where I might
>>> look). All of the “make check” test pass. Thanks for any input.
>>>
>>> Sean B
>>
>> --
>> Rob Latham
>> Mathematics and Computer Science Division
>> Argonne National Lab, IL USA
>
--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
More information about the parallel-netcdf
mailing list