<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
Hi, Brian,
<div class=""><br class="">
</div>
<div class="">I have a few questions.
<div class=""><br class="">
</div>
<div class="">* What PnetCDF version was used?</div>
<div class="">* What PnetCDF API is used to write the variable 'time'?</div>
<div class="">* Is variable 'time' written one element at a time?</div>
<div class="">* Does this corruption also happen when running one MPI process?</div>
<div class="">* Are all elements of variable 'time' written? i.e. none of them is skipped.</div>
<div class="">* When using NetCDF, did you run it sequentially, i.e. on one MPI process?</div>
<div class="">* Could you also try "ncmpidump"?</div>
<div class="">* A utility program named "ncvalidator" can be used to check the file header.</div>
<div class=""><br class="">
</div>
<div class="">
<div class="">Wei-keng </div>
<div><br class="">
<blockquote type="cite" class="">
<div class="">On Nov 10, 2021, at 11:29 AM, Brian Dobbins <<a href="mailto:bdobbins@ucar.edu" class="">bdobbins@ucar.edu</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div dir="ltr" class=""><br class="">
<div class="">Hi all,</div>
<div class=""><br class="">
</div>
<div class=""> Here's a weird issue I figured I'd see if anyone else has advice on - we're running a climate model, CESM, on AWS, and recently encountered an issue where we'd get
<i class="">infrequent </i>and non-deterministic corruption of output values, <i class="">
seemingly</i> only happening when using PNetCDF. This obviously doesn't mean that PNetCDF is the cause, as it could be in MPI-IO or Lustre, but I was hoping to get some ideas from people on what to try to narrow this down.</div>
<div class=""><br class="">
</div>
<div class=""> A bit of background - the model ran successfully, and it was only when postprocessing output that we noticed the issue, since
<i class="">time</i> variables were corrupted, like below (via 'ncdump'):</div>
<div class=""><br class="">
</div>
<div class="">
<div style="margin: 0px; font-stretch: normal; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures:no-common-ligatures" class=""><i class=""><span class="gmail-Apple-converted-space"> </span>time = 15, 15.125, 15.25, 15.375, 15.5, 15.625, 15.75, 15.875, 16, 16.125,<span class="gmail-Apple-converted-space"> </span></i></span></div>
<div style="margin: 0px; font-stretch: normal; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures:no-common-ligatures" class=""><i class=""><span class="gmail-Apple-converted-space">
</span>16.25, 16.375, 16.5, 16.625, 16.75, 16.875, 17, 17.125, 17.25, 17.375,<span class="gmail-Apple-converted-space"> </span></i></span></div>
<div style="margin: 0px; font-stretch: normal; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures:no-common-ligatures" class=""><i class=""><span class="gmail-Apple-converted-space">
</span>17.5, 17.625, 17.75, 17.875, 18, 18.125, <u class="">2.41784664780343e-20</u>, 18.375,<span class="gmail-Apple-converted-space"> </span></i></span></div>
</div>
<div class=""><br class="">
</div>
<div class=""> This happens infrequently; sometimes it's in the first month, sometimes not until the third, and reruns of the same configuration have it happen in different places. I tested a version with OpenMPI 4 and Intel MPI 2021.4, and both have the
issue, but I need to go back and see if OpenMPI is using ROMIO since that wouldn't really isolate out an MPI-IO issue if so. However, by just setting our I/O to use NetCDF, the problem
<i class="">seems</i> to go away (eg, we've run 8.5 months with no detectable corruption, whereas in the past 4 runs it always happened in the first three months of output). </div>
<div class=""><br class="">
</div>
<div class=""> I'm trying to also check on Lustre issues - our file system is running Lustre 2.10.8, vs a newer 2.12-series which apparently is an option now, so I'm going to try reconfiguring with that as well. I'll also try setting the stripe count to 1
(from just 2), as that might also help narrow things down.</div>
<div class=""><br class="">
</div>
<div class=""> Any other ideas? Has anyone seen something similar?</div>
<div class=""><br class="">
</div>
<div class=""> Thanks,</div>
<div class=""> - Brian</div>
<div class=""><br class="">
</div>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</body>
</html>