<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-family: Calibri, sans-serif;">
<div>Jim, Wei-keng,</div>
<div>Thanks for the info/suggestions. I’ll give it a shot.</div>
<div><br>
</div>
<div>Sean</div>
<div><br>
</div>
<span id="OLK_SRC_BODY_SECTION">
<div style="font-family:Calibri; font-size:11pt; text-align:left; color:black; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM: 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid; BORDER-RIGHT: medium none; PADDING-TOP: 3pt">
<span style="font-weight:bold">From: </span>Jim Edwards <<a href="mailto:jedwards@ucar.edu">jedwards@ucar.edu</a>><br>
<span style="font-weight:bold">Date: </span>Tuesday, September 30, 2014 at 12:31 PM<br>
<span style="font-weight:bold">To: </span>Wei-keng Liao <<a href="mailto:wkliao@eecs.northwestern.edu">wkliao@eecs.northwestern.edu</a>><br>
<span style="font-weight:bold">Cc: </span>Sean Byland <<a href="mailto:seanb@cray.com">seanb@cray.com</a>>, "<a href="mailto:parallel-netcdf@lists.mcs.anl.gov">parallel-netcdf@lists.mcs.anl.gov</a>" <<a href="mailto:parallel-netcdf@lists.mcs.anl.gov">parallel-netcdf@lists.mcs.anl.gov</a>><br>
<span style="font-weight:bold">Subject: </span>Re: File header is inconsistent among processes<br>
</div>
<div><br>
</div>
<div>
<div>
<div dir="ltr">
<div class="gmail_default" style="font-family:comic sans ms,sans-serif;color:rgb(56,118,29)">
Hi Sean,<br>
<br>
</div>
<div class="gmail_default" style="font-family:comic sans ms,sans-serif;color:rgb(56,118,29)">
I've had a similar problem in models that I've worked with, you need to make the <br>
</div>
<div class="gmail_default" style="font-family:comic sans ms,sans-serif;color:rgb(56,118,29)">
WRFU_TimeIntervalGet() return a consistent time with something like:<br>
<br>
</div>
<div class="gmail_default" style="font-family:comic sans ms,sans-serif;color:rgb(56,118,29)">
if(mytask==0)<br>
</div>
<div class="gmail_default" style="font-family:comic sans ms,sans-serif;color:rgb(56,118,29)">
interval = get_time()<br>
</div>
<div class="gmail_default" style="font-family:comic sans ms,sans-serif;color:rgb(56,118,29)">
endif<br>
</div>
<div class="gmail_default" style="font-family:comic sans ms,sans-serif;color:rgb(56,118,29)">
MPI_BCAST(interval, comm)<br>
<br>
</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Tue, Sep 30, 2014 at 9:42 AM, Wei-keng Liao <span dir="ltr">
<<a href="mailto:wkliao@eecs.northwestern.edu" target="_blank">wkliao@eecs.northwestern.edu</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi, Sean<br>
<br>
The warning message printed from PnetCDF indicates the global attribute<br>
named "WRF_ALARM_SECS_TIL_NEXT_RING_52" has inconsistent values among MPI processes.<br>
The source codes that cause the inconsistency is in line 366 of file share/output_wrf.F<br>
where the value of this attribute is a timer (in the unit of seconds) set on each<br>
process independently when calling WRFU_TimeIntervalGet(). The call to<br>
wrf_put_dom_ti_integer() in line 369/371 will write the global attribute to the<br>
netCDF file and hence PnetCDF caught the inconsistent attribute.<br>
Let me know if this helps.<br>
<br>
361 IF ( i .LT. 10 ) THEN<br>
362 write(alarmname,'("WRF_ALARM_SECS_TIL_NEXT_RING_0",i1)')i<br>
363 ELSE<br>
364 write(alarmname,'("WRF_ALARM_SECS_TIL_NEXT_RING_",i2)')i<br>
365 ENDIF<br>
366 CALL WRFU_TimeIntervalGet(interval,S=seconds)<br>
367 CALL WRFU_TimeIntervalGet(tmpinterval,S=seconds2)<br>
368 IF ( seconds .GE. 1700000000 .OR. seconds .LE. -1700000000 ) THEN ! it is a forever value, do not change it<br>
369 CALL wrf_put_dom_ti_integer( fid, TRIM(alarmname), seconds, 1, ierr )<br>
370 ELSE<br>
371 CALL wrf_put_dom_ti_integer( fid, TRIM(alarmname), seconds-seconds2, 1, ierr )<br>
372 ENDIF<br>
<br>
<br>
Wei-keng<br>
<div>
<div class="h5"><br>
On Sep 30, 2014, at 9:06 AM, Sean Byland wrote:<br>
<br>
> Thanks Wei-king,<br>
> I’ve attached the ncdump output for one of the two failed domains. When<br>
> the user ran wrf, pnetcdf reported a problem when writing the wrf restart<br>
> file. When I ran it it appears to have completed the restart files but the<br>
> file that’s incomplete is the wrf output file for the second/third domain.<br>
> I do see a lot of these warnings in the rsl output files:<br>
><br>
> grep -r -i "inconsistent" rsl.out* | cut -d ':' -f2- | sort -u<br>
> Warning (inconsistent metadata): attribute<br>
> "WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != -1459748865)<br>
> Warning (inconsistent metadata): attribute<br>
> "WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != -1462916539)<br>
> Warning (inconsistent metadata): attribute<br>
> "WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != -1471304891)<br>
> Warning (inconsistent metadata): attribute<br>
> "WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != 1758243397)<br>
> Warning (inconsistent metadata): attribute<br>
> "WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != -389109179)<br>
> Warning (inconsistent metadata): attribute<br>
> "WRF_ALARM_SECS_TIL_NEXT_RING_52" INT (676375365 != -397497531)<br>
><br>
> Sean<br>
><br>
><br>
><br>
> On 9/29/14, 4:59 PM, "Wei-keng Liao" <<a href="mailto:wkliao@eecs.northwestern.edu">wkliao@eecs.northwestern.edu</a>> wrote:<br>
><br>
>> Hi, Sean<br>
>><br>
>> My understanding of WRF (wrf_io.F90, line around 1360) is when<br>
>> NFMPI_ENDDEF() returns an error code,<br>
>> the program returns without continuing to write any data. So, I guess the<br>
>> file with a much smaller size<br>
>> probably is because of this (but you should see additional error messages<br>
>> printed on stdout.)<br>
>><br>
>> Could you show us the output of command "ncdump -h"?<br>
>><br>
>> Wei-keng<br>
>><br>
>> On Sep 29, 2014, at 4:25 PM, Sean Byland wrote:<br>
>><br>
>>> By "problem with his application” I mean unrelated (upstream) of their<br>
>>> usage of parallel-netcdf.<br>
>>><br>
>>> Sean B.<br>
>>><br>
>>> On 9/29/14, 4:21 PM, "Sean Byland" <<a href="mailto:seanb@cray.com">seanb@cray.com</a>> wrote:<br>
>>><br>
>>>> Thanks. When I ran his application one of the output files was an<br>
>>>> order of<br>
>>>> magnitude too small but parallel-netcdf didn’t report a problem making<br>
>>>> PNETCDF_SAFE_MODE less useful. This makes me think there’s a problem<br>
>>>> with<br>
>>>> his application (i.e. a race condition).<br>
>>>><br>
>>>> Sean B.<br>
>>>><br>
>>>> On 9/29/14, 4:05 PM, "Rob Latham" <<a href="mailto:robl@mcs.anl.gov">robl@mcs.anl.gov</a>> wrote:<br>
>>>><br>
>>>>><br>
>>>>><br>
>>>>> On 09/29/2014 03:56 PM, Sean Byland wrote:<br>
>>>>>> Rob,<br>
>>>>>> Unfortunately I wasn’t able to reproduce the error, but he did have<br>
>>>>>> an<br>
>>>>>> ncfile laying around so I can provide the ncdump -h output. Not to be<br>
>>>>>> dense but what should I be looking for (I see lots of time values) ?<br>
>>>>>> Knowing almost nothing about pnetcdf, I would think that if different<br>
>>>>>> processes had inconsistent data, wouldn’t they fail on the write and<br>
>>>>>> therefore I wouldn’t be able to observe what values where<br>
>>>>>> inconsistent<br>
>>>>>> ?<br>
>>>>><br>
>>>>> I think you're going to be better served by Wei-keng's environment<br>
>>>>> variable suggestion, but read on for a bit of background:<br>
>>>>><br>
>>>>> Parallel-NetCDF expects the header to be identical on all N MPI<br>
>>>>> processes. How can processes have different data and yet still read<br>
>>>>> the<br>
>>>>> file? well, the header is pretty simple. It's not too far off to<br>
>>>>> think of it as a big array of (now 64 bit) values.<br>
>>>>><br>
>>>>> On one process you might have the attribute "timestamp" with a value<br>
>>>>> "<a href="tel:2014-09-29-16" value="+12014092916">2014-09-29-16</a>:00:34 CST", followed by the information "the variable<br>
>>>>> Pressure starts at offset 20023423 bytes".<br>
>>>>><br>
>>>>> On another process, you might have the exact same information, except<br>
>>>>> the time stamp is "<a href="tel:2014-09-29-23" value="+12014092923">2014-09-29-23</a>:00:34 GMT". The information about<br>
>>>>> the<br>
>>>>> variable will still start at the same place and contain the same<br>
>>>>> information. Rank 0 will broadcast its version of the header to all<br>
>>>>> the<br>
>>>>> other processes. If any of them differ in any byte, the library will<br>
>>>>> give an error.<br>
>>>>><br>
>>>>> If the check was more involved, we could warn about attribute values<br>
>>>>> that differ slightly but press on if "important" values (which we<br>
>>>>> would<br>
>>>>> have to define) were all consistent.<br>
>>>>><br>
>>>>> ==rob<br>
>>>>><br>
>>>>>><br>
>>>>>> Thanks for any info.<br>
>>>>>><br>
>>>>>> Sean<br>
>>>>>><br>
>>>>>><br>
>>>>>> On 9/19/14, 1:42 PM, "Rob Latham" <<a href="mailto:robl@mcs.anl.gov">robl@mcs.anl.gov</a>> wrote:<br>
>>>>>><br>
>>>>>>><br>
>>>>>>><br>
>>>>>>> On 09/19/2014 12:54 PM, Sean Byland wrote:<br>
>>>>>>>> Hello,<br>
>>>>>>>> A parallel-netcdf user gets this error:<br>
>>>>>>>><br>
>>>>>>>> NetCDF error: File header is inconsistent among processes<br>
>>>>>>>> NetCDF error ( -250 ) from NFMPI_ENDDEF in<br>
>>>>>>>> ext_pnc_open_for_write_commit wrf_io.F90, line 1360<br>
>>>>>>>> med_restart_out: opening wrfrst_d01_2013-06-01_00_10_00 for<br>
>>>>>>>> writing<br>
>>>>>>><br>
>>>>>>> The most common cause for this message is when there's a timestamp<br>
>>>>>>> attribute: the processes are not 100% in lock step, and so create<br>
>>>>>>> ever<br>
>>>>>>> so slightly different timestamps.<br>
>>>>>>><br>
>>>>>>> Can you provide the header of a CCE run? You can use 'ncmpidump -h'<br>
>>>>>>> or<br>
>>>>>>> serial netcdf's 'ncdump -h'<br>
>>>>>>><br>
>>>>>>> ==rob<br>
>>>>>>><br>
>>>>>>>><br>
>>>>>>>> when running software/parallel-netcdf (1.5.0) libraries that were<br>
>>>>>>>> built<br>
>>>>>>>> with CCE but doesn’t with an application/library that were built<br>
>>>>>>>> with<br>
>>>>>>>> Intel’s compiler. I’m still waiting on the user for something that<br>
>>>>>>>> I<br>
>>>>>>>> need to reproduce the error and start experimenting but was hoping<br>
>>>>>>>> that<br>
>>>>>>>> someone on this mailing list might have some useful information or<br>
>>>>>>>> hints<br>
>>>>>>>> about what’s causing this error and how I might fix it (or where I<br>
>>>>>>>> might<br>
>>>>>>>> look). All of the “make check” test pass. Thanks for any input.<br>
>>>>>>>><br>
>>>>>>>> Sean B<br>
>>>>>>><br>
>>>>>>> --<br>
>>>>>>> Rob Latham<br>
>>>>>>> Mathematics and Computer Science Division<br>
>>>>>>> Argonne National Lab, IL USA<br>
>>>>>><br>
>>>>><br>
>>>>> --<br>
>>>>> Rob Latham<br>
>>>>> Mathematics and Computer Science Division<br>
>>>>> Argonne National Lab, IL USA<br>
>>>><br>
>>><br>
>><br>
><br>
</div>
</div>
> <ncdump.wrfout_d02.out><br>
<br>
</blockquote>
</div>
<br>
<br clear="all">
<br>
-- <br>
<div dir="ltr">
<div>
<div>
<div>Jim Edwards<br>
<br>
</div>
<font size="1">CESM Software Engineer<br>
</font></div>
<font size="1">National Center for Atmospheric Research<br>
</font></div>
<font size="1">Boulder, CO</font> <br>
</div>
</div>
</div>
</div>
</span>
</body>
</html>