Q: PNetCDF 1.12.2 on FSX (Lustre 2.10.8) - infrequent data corruption issue?

Brian Dobbins bdobbins at ucar.edu
Wed Nov 10 11:29:02 CST 2021


Hi all,

  Here's a weird issue I figured I'd see if anyone else has advice on -
we're running a climate model, CESM, on AWS, and recently encountered an
issue where we'd get *infrequent *and non-deterministic corruption of
output values, *seemingly* only happening when using PNetCDF.  This
obviously doesn't mean that PNetCDF is the cause, as it could be in MPI-IO
or Lustre, but I was hoping to get some ideas from people on what to try to
narrow this down.

  A bit of background - the model ran successfully, and it was only when
postprocessing output that we noticed the issue, since *time* variables
were corrupted, like below (via 'ncdump'):

* time = 15, 15.125, 15.25, 15.375, 15.5, 15.625, 15.75, 15.875, 16,
16.125, *

*    16.25, 16.375, 16.5, 16.625, 16.75, 16.875, 17, 17.125, 17.25,
17.375, *

*    17.5, 17.625, 17.75, 17.875, 18, 18.125, 2.41784664780343e-20,
18.375, *

  This happens infrequently; sometimes it's in the first month, sometimes
not until the third, and reruns of the same configuration have it happen in
different places.  I tested a version with OpenMPI 4 and Intel MPI 2021.4,
and both have the issue, but I need to go back and see if OpenMPI is using
ROMIO since that wouldn't really isolate out an MPI-IO issue if so.
However, by just setting our I/O to use NetCDF, the problem *seems* to go
away (eg, we've run 8.5 months with no detectable corruption, whereas in
the past 4 runs it always happened in the first three months of output).

  I'm trying to also check on Lustre issues - our file system is running
Lustre 2.10.8, vs a newer 2.12-series which apparently is an option now, so
I'm going to try reconfiguring with that as well.  I'll also try setting
the stripe count to 1 (from just 2), as that might also help narrow things
down.

  Any other ideas?  Has anyone seen something similar?

  Thanks,
  - Brian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20211110/7cb3c166/attachment.html>


More information about the parallel-netcdf mailing list