File header is inconsistent among processes

Sean Byland seanb at cray.com
Mon Sep 29 16:25:27 CDT 2014


By "problem with his application” I mean unrelated (upstream) of their
usage of parallel-netcdf.

Sean B.

On 9/29/14, 4:21 PM, "Sean Byland" <seanb at cray.com> wrote:

>Thanks. When I ran his application one of the output files was an order of
>magnitude too small but parallel-netcdf didn’t report a problem making
>PNETCDF_SAFE_MODE less useful. This makes me think there’s a problem with
>his application (i.e. a race condition).
>
>Sean B.
>
>On 9/29/14, 4:05 PM, "Rob Latham" <robl at mcs.anl.gov> wrote:
>
>>
>>
>>On 09/29/2014 03:56 PM, Sean Byland wrote:
>>> Rob,
>>> Unfortunately I wasn’t able to reproduce the error, but he did have an
>>> ncfile laying around so I can provide the ncdump -h output. Not to be
>>> dense but what should I be looking for (I see lots of time values) ?
>>> Knowing almost nothing about pnetcdf, I would think that if different
>>> processes had inconsistent data, wouldn’t they fail on the write and
>>> therefore I wouldn’t be able to observe what values where inconsistent
>>>?
>>
>>I think you're going to be better served by Wei-keng's environment
>>variable suggestion, but read on for a bit of background:
>>
>>Parallel-NetCDF expects the header to be identical on all N MPI
>>processes.  How can processes have different data and yet still read the
>>file?  well, the header is pretty simple.    It's not too far off to
>>think of it as a big array of (now 64 bit) values.
>>
>>On one process you might have the attribute "timestamp" with a value
>>"2014-09-29-16:00:34 CST", followed by the information "the variable
>>Pressure starts at offset 20023423 bytes".
>>
>>On another process, you might have the exact same information, except
>>the time stamp is "2014-09-29-23:00:34 GMT".  The information about the
>>variable will still start at the same place and contain the same
>>information.  Rank 0 will broadcast its version of the header to all the
>>other processes.  If any of them differ in any byte, the library will
>>give an error.
>>
>>If the check was more involved, we could warn about attribute values
>>that differ slightly but press on if "important" values (which we would
>>have to define) were all consistent.
>>
>>==rob
>>
>>>
>>> Thanks for any info.
>>>
>>> Sean
>>>
>>>
>>> On 9/19/14, 1:42 PM, "Rob Latham" <robl at mcs.anl.gov> wrote:
>>>
>>>>
>>>>
>>>> On 09/19/2014 12:54 PM, Sean Byland wrote:
>>>>> Hello,
>>>>> A parallel-netcdf user gets this error:
>>>>>
>>>>>     NetCDF error: File header is inconsistent among processes
>>>>>     NetCDF error ( -250 ) from NFMPI_ENDDEF in
>>>>> ext_pnc_open_for_write_commit wrf_io.F90, line 1360
>>>>>    med_restart_out: opening wrfrst_d01_2013-06-01_00_10_00 for
>>>>>writing
>>>>
>>>> The most common cause for this message is when there's a timestamp
>>>> attribute: the processes are not 100% in lock step, and so create ever
>>>> so slightly different timestamps.
>>>>
>>>> Can you provide the header of a CCE run? You can use 'ncmpidump -h' or
>>>> serial netcdf's 'ncdump -h'
>>>>
>>>> ==rob
>>>>
>>>>>
>>>>> when running  software/parallel-netcdf (1.5.0) libraries that were
>>>>>built
>>>>> with CCE but doesn’t with an application/library that were built with
>>>>> Intel’s compiler. I’m still waiting on the user for something that I
>>>>> need to reproduce the error and start experimenting but was hoping
>>>>>that
>>>>> someone on this mailing list might have some useful information or
>>>>>hints
>>>>> about what’s causing this error and how I might fix it (or where I
>>>>>might
>>>>> look). All of the “make check” test pass. Thanks for any input.
>>>>>
>>>>> Sean B
>>>>
>>>> --
>>>> Rob Latham
>>>> Mathematics and Computer Science Division
>>>> Argonne National Lab, IL USA
>>>
>>
>>-- 
>>Rob Latham
>>Mathematics and Computer Science Division
>>Argonne National Lab, IL USA
>



More information about the parallel-netcdf mailing list