Recent inconsistencies with Pnetcdf and MPI

Tue Jul 15 12:28:32 CDT 2025

Hello,

With the simple io test script (attached), I have observed four different
scenarios depending on the pnetcdf build and runtime environment, and would
like your comments on interpreting these results:

   - If I test the program using a pnetcdf build* configured with
   --enable-debug* and *PNETCDF_VERBOSE_DEBUG_MODE = 1*, then *a* *multi-node
   run* will only trigger the NC_EBADDIM and NC_ENOTVAR errors upon
   startup, but the *write and read operations proceed correctly*.
   - If I test the program using a pnetcdf build *configured with
   --enable-debug* and *NOT set* PNETCDF_VERBOSE_DEBUG_MODE, then* a
   multi-node run* will not trigger any NC_EBADDIM and NC_ENOTVAR errors,
   and the write and read operations produce* incorrect write results
   silently*.
   - If I test the program using a pnetcdf build *not configured with
   --enable-debug*, *a multi-node run* using the write and read operations
   produces* incorrect write results silently. *
   - *A single-node run* produces* correct write and read results* regardless
   of the configuration and env setups.

In addition,

   - With --enable-debug and PNETCDF_VERBOSE_DEBUG_MODE=1, pnetcdf would
   *fail* the *make ptests *check using either 1 node or multiple nodes
   - With a release build, pnetcdf would* pass make ptests* on* 1 node,*
    but *on multi-node, it would issue a buffer mismatch* message during
   the mcoll_perf testing, but still issue 'pass' results for all the tests.
   - Could these errors be due to the NFS v3.0 file system being used?

I really appreciate your time with this.

Best

Divyanshu

On Tue, Jul 15, 2025 at 12:25 PM Wei-Keng Liao <wkliao at northwestern.edu>
wrote:

> Are these only the error messages?
> NC_EBADDIM and NC_ENOTVAR are normal, as when creating a new
> dimension or variable, PnetCDF checks whether they have already
> been defined (thus prints such messages).
>
> Are there lines containing NC_EINVALCOORDS or NC_EEDGE?
> They correspond to the error messages you are seeing.
> The former is "NetCDF: Index exceeds dimension bound"
> The latter is "NetCDF: Start+count exceeds dimension bound"
>
> Wei-keng
>
> On Jul 15, 2025, at 10:58 AM, Divyanshu Gola <divyans at umich.edu> wrote:
>
> Hello,
>
> Thank you for your email. I tried running the test script using the debug
> flags and I get the following messages on all ranks, even though the code
> is able to execute after that:
>
> ---------------BEGIN ERROR MESSAGES--------------
> Rank 144: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
> Rank 146: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
> Rank 147: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
> Rank 148: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
> Rank 149: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
> Rank 150: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
> Rank 151: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
> Rank 152: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
> Rank 154: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
> Rank 156: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
> Rank 157: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
> Rank 158: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
> Rank 116: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
> Rank 152: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
> Rank 152: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
> Rank 81: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
> Rank 81: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
> Rank 89: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
> Rank 89: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
> Rank 102: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
> Rank 102: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
> Rank 103: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
> Rank 103: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
> Rank 112: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
> Rank 112: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
> Rank 114: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
> Rank 114: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
> Rank 119: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
> Rank 119: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
> Rank 122: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
> Rank 122: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
> Rank 124: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
> Rank 124: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
> Rank 136: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
> -------------END ERROR MESSAGES---------------
>
> Thanks for your help with this.
>
> Divyanshu
>
>
> On Mon, Jul 14, 2025 at 12:38 PM Wei-Keng Liao <wkliao at northwestern.edu>
> wrote:
>
>> Hi, Divyanshu
>>
>> When PnetCDF is configured with option "--enable-debug" and
>> the environment variable PNETCDF_VERBOSE_DEBUG_MODE is set to 1,
>> additional error messages describing the bound violation will
>> be printed on screen. It may help you find the source code location
>> that produces the error.
>>
>> Wei-keng
>>
>> On Jul 14, 2025, at 7:56 AM, Divyanshu Gola <divyans at umich.edu> wrote:
>>
>> This Message Is From an External Sender
>> This message came from outside your organization.
>> Hi Jim,
>>
>> Thank you for your reply. I use openmpi/5.0.3 under intel/2022.1.2. I
>> have tried with pnetcdf versions 12 , 13 and 14 but get the same error. I
>> really think that there is some inconsistency during writing the file and
>> as I said in the previous email, this only happens on one specific cluster.
>> Here's a result of a test script: Using pnetcdf and mpi I write a global
>> variable named *var* that is an array consisting of the indices of each
>> process, i.e., each process writes only its own index. Example if I use 96
>> processes, var is [1, 2, 3, 4, ..., 95, 96]. However when I try writing the
>> same variable using , say, 384 processes, some of the values in the middle
>> remain zero (default element value) instead of the respective process
>> index. Like you said, I am pretty sure this is *NOT* an issue with
>> pnetcdf but some other inconsistency within the cluster, but I just don't
>> know how to identify it and thought maybe people on the mailing list might
>> have encountered this before. Meanwhile I got my code to work by doing
>> simple binary IO instead of pnetcdf files, which tells me the MPI in itself
>> is also okay I guess?
>>
>> I appreciate your time with this.
>>
>> Best
>>
>> Divyanshu
>>
>>
>> On Fri, Jul 11, 2025 at 4:23 PM Jim Edwards <jedwards at ucar.edu> wrote:
>>
>>> Hi Divyanshu,
>>>
>>> From your description it sounds like the file written is corrupted?  And
>>> when you say "The error also doesn't appear when I use fewer processes on a
>>> single node on the cluster."   Do you
>>> mean use fewer processes to write the file or to read the file?  It
>>> really sounds to me like an application problem and not a pnetcdf or mpi
>>> issue.  I think that you may need to provide
>>> an exact description of how the problem is created including the system
>>> you are using, the mpi library and version as well as the pnetcdf version
>>> and the application.
>>>
>>> Jim
>>>
>>> On Fri, Jul 11, 2025 at 4:13 PM Divyanshu Gola <divyans at umich.edu>
>>> wrote:
>>>
>>>> This Message Is From an External Sender
>>>> This message came from outside your organization.
>>>>
>>>> Hi,
>>>>
>>>> This is a shot in the dark but I thought why not.
>>>>
>>>> The exact same code that I had been using until a few weeks ago gives
>>>> me an error when I am trying to read restart files using PNetcdf. The error
>>>> is *Index exceeds dimension bounds, or Start+Count exceeds dimension
>>>> bounds. *Based on days of debugging, I have narrowed it down to some
>>>> problem during the writing of the restart files (and not the reading
>>>> itself). All of these errors seem to originate from the way PNetcdf is
>>>> built and the MPI file system used on the cluster (because I can run the
>>>> same code on a different cluster), but I can't seem to identify the root
>>>> cause. The error also doesn't appear when I use fewer processes on a single
>>>> node on the cluster.
>>>>
>>>> I know this is most likely not a bug in the PNetcdf library but
>>>> something else, but I was wondering if people on this mailing list have
>>>> encountered a similar issue.
>>>>
>>>> Apologies for the long email and thanks
>>>>
>>>> Divyanshu
>>>>
>>>> Postdoctoral Researcher
>>>> University of Michigan
>>>>
>>>
>>>
>>> --
>>> Jim Edwards
>>> STAND UP FOR SCIENCE
>>> CESM Software Engineer
>>> National Center for Atmospheric Research
>>> Boulder, CO
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20250715/6b4d283d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_pnetcdf_io.f90
Type: text/x-fortran
Size: 3169 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20250715/6b4d283d/attachment-0001.bin>