Q: PNetCDF 1.12.2 on FSX (Lustre 2.10.8) - infrequent data corruption issue?
Wei-Keng Liao
wkliao at northwestern.edu
Wed Nov 17 17:13:54 CST 2021
Hi, Brian,
I have a few questions.
* What PnetCDF version was used?
* What PnetCDF API is used to write the variable 'time'?
* Is variable 'time' written one element at a time?
* Does this corruption also happen when running one MPI process?
* Are all elements of variable 'time' written? i.e. none of them is skipped.
* When using NetCDF, did you run it sequentially, i.e. on one MPI process?
* Could you also try "ncmpidump"?
* A utility program named "ncvalidator" can be used to check the file header.
Wei-keng
On Nov 10, 2021, at 11:29 AM, Brian Dobbins <bdobbins at ucar.edu<mailto:bdobbins at ucar.edu>> wrote:
Hi all,
Here's a weird issue I figured I'd see if anyone else has advice on - we're running a climate model, CESM, on AWS, and recently encountered an issue where we'd get infrequent and non-deterministic corruption of output values, seemingly only happening when using PNetCDF. This obviously doesn't mean that PNetCDF is the cause, as it could be in MPI-IO or Lustre, but I was hoping to get some ideas from people on what to try to narrow this down.
A bit of background - the model ran successfully, and it was only when postprocessing output that we noticed the issue, since time variables were corrupted, like below (via 'ncdump'):
time = 15, 15.125, 15.25, 15.375, 15.5, 15.625, 15.75, 15.875, 16, 16.125,
16.25, 16.375, 16.5, 16.625, 16.75, 16.875, 17, 17.125, 17.25, 17.375,
17.5, 17.625, 17.75, 17.875, 18, 18.125, 2.41784664780343e-20, 18.375,
This happens infrequently; sometimes it's in the first month, sometimes not until the third, and reruns of the same configuration have it happen in different places. I tested a version with OpenMPI 4 and Intel MPI 2021.4, and both have the issue, but I need to go back and see if OpenMPI is using ROMIO since that wouldn't really isolate out an MPI-IO issue if so. However, by just setting our I/O to use NetCDF, the problem seems to go away (eg, we've run 8.5 months with no detectable corruption, whereas in the past 4 runs it always happened in the first three months of output).
I'm trying to also check on Lustre issues - our file system is running Lustre 2.10.8, vs a newer 2.12-series which apparently is an option now, so I'm going to try reconfiguring with that as well. I'll also try setting the stripe count to 1 (from just 2), as that might also help narrow things down.
Any other ideas? Has anyone seen something similar?
Thanks,
- Brian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20211117/a5a001b1/attachment.html>
More information about the parallel-netcdf
mailing list