Q: PNetCDF 1.12.2 on FSX (Lustre 2.10.8) - infrequent data corruption issue?

Wei-Keng Liao wkliao at northwestern.edu
Wed Nov 17 17:13:54 CST 2021


Hi, Brian,

I have a few questions.

* What PnetCDF version was used?
* What PnetCDF API is used to write the variable 'time'?
* Is variable 'time' written one element at a time?
* Does this corruption also happen when running one MPI process?
* Are all elements of variable 'time' written? i.e. none of them is skipped.
* When using NetCDF, did you run it sequentially, i.e. on one MPI process?
* Could you also try "ncmpidump"?
* A utility program named "ncvalidator" can be used to check the file header.

Wei-keng

On Nov 10, 2021, at 11:29 AM, Brian Dobbins <bdobbins at ucar.edu<mailto:bdobbins at ucar.edu>> wrote:


Hi all,

  Here's a weird issue I figured I'd see if anyone else has advice on - we're running a climate model, CESM, on AWS, and recently encountered an issue where we'd get infrequent and non-deterministic corruption of output values, seemingly only happening when using PNetCDF.  This obviously doesn't mean that PNetCDF is the cause, as it could be in MPI-IO or Lustre, but I was hoping to get some ideas from people on what to try to narrow this down.

  A bit of background - the model ran successfully, and it was only when postprocessing output that we noticed the issue, since time variables were corrupted, like below (via 'ncdump'):

 time = 15, 15.125, 15.25, 15.375, 15.5, 15.625, 15.75, 15.875, 16, 16.125,
    16.25, 16.375, 16.5, 16.625, 16.75, 16.875, 17, 17.125, 17.25, 17.375,
    17.5, 17.625, 17.75, 17.875, 18, 18.125, 2.41784664780343e-20, 18.375,

  This happens infrequently; sometimes it's in the first month, sometimes not until the third, and reruns of the same configuration have it happen in different places.  I tested a version with OpenMPI 4 and Intel MPI 2021.4, and both have the issue, but I need to go back and see if OpenMPI is using ROMIO since that wouldn't really isolate out an MPI-IO issue if so. However, by just setting our I/O to use NetCDF, the problem seems to go away (eg, we've run 8.5 months with no detectable corruption, whereas in the past 4 runs it always happened in the first three months of output).

  I'm trying to also check on Lustre issues - our file system is running Lustre 2.10.8, vs a newer 2.12-series which apparently is an option now, so I'm going to try reconfiguring with that as well.  I'll also try setting the stripe count to 1 (from just 2), as that might also help narrow things down.

  Any other ideas?  Has anyone seen something similar?

  Thanks,
  - Brian


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20211117/a5a001b1/attachment.html>


More information about the parallel-netcdf mailing list