Q: PNetCDF 1.12.2 on FSX (Lustre 2.10.8) - infrequent data corruption issue?
Brian Dobbins
bdobbins at ucar.edu
Wed Nov 10 11:29:02 CST 2021
Hi all,
Here's a weird issue I figured I'd see if anyone else has advice on -
we're running a climate model, CESM, on AWS, and recently encountered an
issue where we'd get *infrequent *and non-deterministic corruption of
output values, *seemingly* only happening when using PNetCDF. This
obviously doesn't mean that PNetCDF is the cause, as it could be in MPI-IO
or Lustre, but I was hoping to get some ideas from people on what to try to
narrow this down.
A bit of background - the model ran successfully, and it was only when
postprocessing output that we noticed the issue, since *time* variables
were corrupted, like below (via 'ncdump'):
* time = 15, 15.125, 15.25, 15.375, 15.5, 15.625, 15.75, 15.875, 16,
16.125, *
* 16.25, 16.375, 16.5, 16.625, 16.75, 16.875, 17, 17.125, 17.25,
17.375, *
* 17.5, 17.625, 17.75, 17.875, 18, 18.125, 2.41784664780343e-20,
18.375, *
This happens infrequently; sometimes it's in the first month, sometimes
not until the third, and reruns of the same configuration have it happen in
different places. I tested a version with OpenMPI 4 and Intel MPI 2021.4,
and both have the issue, but I need to go back and see if OpenMPI is using
ROMIO since that wouldn't really isolate out an MPI-IO issue if so.
However, by just setting our I/O to use NetCDF, the problem *seems* to go
away (eg, we've run 8.5 months with no detectable corruption, whereas in
the past 4 runs it always happened in the first three months of output).
I'm trying to also check on Lustre issues - our file system is running
Lustre 2.10.8, vs a newer 2.12-series which apparently is an option now, so
I'm going to try reconfiguring with that as well. I'll also try setting
the stripe count to 1 (from just 2), as that might also help narrow things
down.
Any other ideas? Has anyone seen something similar?
Thanks,
- Brian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20211110/7cb3c166/attachment.html>
More information about the parallel-netcdf
mailing list