Recent inconsistencies with Pnetcdf and MPI

Wei-Keng Liao wkliao at northwestern.edu
Tue Jul 15 11:25:43 CDT 2025


Are these only the error messages?
NC_EBADDIM and NC_ENOTVAR are normal, as when creating a new
dimension or variable, PnetCDF checks whether they have already
been defined (thus prints such messages).

Are there lines containing NC_EINVALCOORDS or NC_EEDGE?
They correspond to the error messages you are seeing.
The former is "NetCDF: Index exceeds dimension bound"
The latter is "NetCDF: Start+count exceeds dimension bound"

Wei-keng

On Jul 15, 2025, at 10:58 AM, Divyanshu Gola <divyans at umich.edu> wrote:

Hello,

Thank you for your email. I tried running the test script using the debug flags and I get the following messages on all ranks, even though the code is able to execute after that:

---------------BEGIN ERROR MESSAGES--------------
Rank 144: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 146: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 147: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 148: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 149: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 150: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 151: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 152: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 154: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 156: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 157: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 158: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 116: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 152: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
Rank 152: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
Rank 81: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
Rank 81: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
Rank 89: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
Rank 89: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
Rank 102: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
Rank 102: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
Rank 103: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
Rank 103: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
Rank 112: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
Rank 112: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
Rank 114: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
Rank 114: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
Rank 119: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
Rank 119: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
Rank 122: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
Rank 122: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
Rank 124: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
Rank 124: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
Rank 136: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
-------------END ERROR MESSAGES---------------

Thanks for your help with this.

Divyanshu


On Mon, Jul 14, 2025 at 12:38 PM Wei-Keng Liao <wkliao at northwestern.edu<mailto:wkliao at northwestern.edu>> wrote:
Hi, Divyanshu

When PnetCDF is configured with option "--enable-debug" and
the environment variable PNETCDF_VERBOSE_DEBUG_MODE is set to 1,
additional error messages describing the bound violation will
be printed on screen. It may help you find the source code location
that produces the error.

Wei-keng

On Jul 14, 2025, at 7:56 AM, Divyanshu Gola <divyans at umich.edu<mailto:divyans at umich.edu>> wrote:

This Message Is From an External Sender
This message came from outside your organization.
Hi Jim,

Thank you for your reply. I use openmpi/5.0.3 under intel/2022.1.2. I have tried with pnetcdf versions 12 , 13 and 14 but get the same error. I really think that there is some inconsistency during writing the file and as I said in the previous email, this only happens on one specific cluster. Here's a result of a test script: Using pnetcdf and mpi I write a global variable named var that is an array consisting of the indices of each process, i.e., each process writes only its own index. Example if I use 96 processes, var is [1, 2, 3, 4, ..., 95, 96]. However when I try writing the same variable using , say, 384 processes, some of the values in the middle remain zero (default element value) instead of the respective process index. Like you said, I am pretty sure this is NOT an issue with pnetcdf but some other inconsistency within the cluster, but I just don't know how to identify it and thought maybe people on the mailing list might have encountered this before. Meanwhile I got my code to work by doing simple binary IO instead of pnetcdf files, which tells me the MPI in itself is also okay I guess?

I appreciate your time with this.

Best

Divyanshu


On Fri, Jul 11, 2025 at 4:23 PM Jim Edwards <jedwards at ucar.edu<mailto:jedwards at ucar.edu>> wrote:
Hi Divyanshu,

From your description it sounds like the file written is corrupted?  And when you say "The error also doesn't appear when I use fewer processes on a single node on the cluster."   Do you
mean use fewer processes to write the file or to read the file?  It really sounds to me like an application problem and not a pnetcdf or mpi issue.  I think that you may need to provide
an exact description of how the problem is created including the system you are using, the mpi library and version as well as the pnetcdf version and the application.

Jim

On Fri, Jul 11, 2025 at 4:13 PM Divyanshu Gola <divyans at umich.edu<mailto:divyans at umich.edu>> wrote:
This Message Is From an External Sender
This message came from outside your organization.

Hi,

This is a shot in the dark but I thought why not.

The exact same code that I had been using until a few weeks ago gives me an error when I am trying to read restart files using PNetcdf. The error is Index exceeds dimension bounds, or Start+Count exceeds dimension bounds. Based on days of debugging, I have narrowed it down to some problem during the writing of the restart files (and not the reading itself). All of these errors seem to originate from the way PNetcdf is built and the MPI file system used on the cluster (because I can run the same code on a different cluster), but I can't seem to identify the root cause. The error also doesn't appear when I use fewer processes on a single node on the cluster.

I know this is most likely not a bug in the PNetcdf library but something else, but I was wondering if people on this mailing list have encountered a similar issue.

Apologies for the long email and thanks

Divyanshu

Postdoctoral Researcher
University of Michigan


--
Jim Edwards
STAND UP FOR SCIENCE
CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20250715/bc1358d2/attachment-0001.html>


More information about the parallel-netcdf mailing list