<div dir="ltr"><br><div>Hi all,</div><div><br></div><div> Here's a weird issue I figured I'd see if anyone else has advice on - we're running a climate model, CESM, on AWS, and recently encountered an issue where we'd get <i>infrequent </i>and non-deterministic corruption of output values, <i>seemingly</i> only happening when using PNetCDF. This obviously doesn't mean that PNetCDF is the cause, as it could be in MPI-IO or Lustre, but I was hoping to get some ideas from people on what to try to narrow this down.</div><div><br></div><div> A bit of background - the model ran successfully, and it was only when postprocessing output that we noticed the issue, since <i>time</i> variables were corrupted, like below (via 'ncdump'):</div><div><br></div><div><p style="margin:0px;font-stretch:normal;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span style="font-variant-ligatures:no-common-ligatures"><i><span class="gmail-Apple-converted-space"> </span>time = 15, 15.125, 15.25, 15.375, 15.5, 15.625, 15.75, 15.875, 16, 16.125,<span class="gmail-Apple-converted-space"> </span></i></span></p>
<p style="margin:0px;font-stretch:normal;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span style="font-variant-ligatures:no-common-ligatures"><i><span class="gmail-Apple-converted-space"> </span>16.25, 16.375, 16.5, 16.625, 16.75, 16.875, 17, 17.125, 17.25, 17.375,<span class="gmail-Apple-converted-space"> </span></i></span></p>
<p style="margin:0px;font-stretch:normal;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span style="font-variant-ligatures:no-common-ligatures"><i><span class="gmail-Apple-converted-space"> </span>17.5, 17.625, 17.75, 17.875, 18, 18.125, <u>2.41784664780343e-20</u>, 18.375,<span class="gmail-Apple-converted-space"> </span></i></span></p></div><div><br></div><div> This happens infrequently; sometimes it's in the first month, sometimes not until the third, and reruns of the same configuration have it happen in different places. I tested a version with OpenMPI 4 and Intel MPI 2021.4, and both have the issue, but I need to go back and see if OpenMPI is using ROMIO since that wouldn't really isolate out an MPI-IO issue if so. However, by just setting our I/O to use NetCDF, the problem <i>seems</i> to go away (eg, we've run 8.5 months with no detectable corruption, whereas in the past 4 runs it always happened in the first three months of output). </div><div><br></div><div> I'm trying to also check on Lustre issues - our file system is running Lustre 2.10.8, vs a newer 2.12-series which apparently is an option now, so I'm going to try reconfiguring with that as well. I'll also try setting the stripe count to 1 (from just 2), as that might also help narrow things down.</div><div><br></div><div> Any other ideas? Has anyone seen something similar?</div><div><br></div><div> Thanks,</div><div> - Brian</div><div><br></div></div>