Recent inconsistencies with Pnetcdf and MPI

Divyanshu Gola divyans at umich.edu
Tue Jul 15 10:58:46 CDT 2025


Hello,

Thank you for your email. I tried running the test script using the debug
flags and I get the following messages on all ranks, even though the code
is able to execute after that:

---------------BEGIN ERROR MESSAGES--------------
Rank 144: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 146: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 147: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 148: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 149: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 150: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 151: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 152: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 154: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 156: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 157: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 158: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 116: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
Rank 152: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
Rank 152: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
Rank 81: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
Rank 81: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
Rank 89: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
Rank 89: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
Rank 102: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
Rank 102: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
Rank 103: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
Rank 103: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
Rank 112: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
Rank 112: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
Rank 114: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
Rank 114: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
Rank 119: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
Rank 119: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
Rank 122: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
Rank 122: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
Rank 124: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
Rank 124: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
Rank 136: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
-------------END ERROR MESSAGES---------------

Thanks for your help with this.

Divyanshu


On Mon, Jul 14, 2025 at 12:38 PM Wei-Keng Liao <wkliao at northwestern.edu>
wrote:

> Hi, Divyanshu
>
> When PnetCDF is configured with option "--enable-debug" and
> the environment variable PNETCDF_VERBOSE_DEBUG_MODE is set to 1,
> additional error messages describing the bound violation will
> be printed on screen. It may help you find the source code location
> that produces the error.
>
> Wei-keng
>
> On Jul 14, 2025, at 7:56 AM, Divyanshu Gola <divyans at umich.edu> wrote:
>
> This Message Is From an External Sender
> This message came from outside your organization.
> Hi Jim,
>
> Thank you for your reply. I use openmpi/5.0.3 under intel/2022.1.2. I have
> tried with pnetcdf versions 12 , 13 and 14 but get the same error. I
> really think that there is some inconsistency during writing the file and
> as I said in the previous email, this only happens on one specific cluster.
> Here's a result of a test script: Using pnetcdf and mpi I write a global
> variable named *var* that is an array consisting of the indices of each
> process, i.e., each process writes only its own index. Example if I use 96
> processes, var is [1, 2, 3, 4, ..., 95, 96]. However when I try writing the
> same variable using , say, 384 processes, some of the values in the middle
> remain zero (default element value) instead of the respective process
> index. Like you said, I am pretty sure this is *NOT* an issue with
> pnetcdf but some other inconsistency within the cluster, but I just don't
> know how to identify it and thought maybe people on the mailing list might
> have encountered this before. Meanwhile I got my code to work by doing
> simple binary IO instead of pnetcdf files, which tells me the MPI in itself
> is also okay I guess?
>
> I appreciate your time with this.
>
> Best
>
> Divyanshu
>
>
> On Fri, Jul 11, 2025 at 4:23 PM Jim Edwards <jedwards at ucar.edu> wrote:
>
>> Hi Divyanshu,
>>
>> From your description it sounds like the file written is corrupted?  And
>> when you say "The error also doesn't appear when I use fewer processes on a
>> single node on the cluster."   Do you
>> mean use fewer processes to write the file or to read the file?  It
>> really sounds to me like an application problem and not a pnetcdf or mpi
>> issue.  I think that you may need to provide
>> an exact description of how the problem is created including the system
>> you are using, the mpi library and version as well as the pnetcdf version
>> and the application.
>>
>> Jim
>>
>> On Fri, Jul 11, 2025 at 4:13 PM Divyanshu Gola <divyans at umich.edu> wrote:
>>
>>> This Message Is From an External Sender
>>> This message came from outside your organization.
>>>
>>> Hi,
>>>
>>> This is a shot in the dark but I thought why not.
>>>
>>> The exact same code that I had been using until a few weeks ago gives me
>>> an error when I am trying to read restart files using PNetcdf. The error is
>>>  *Index exceeds dimension bounds, or Start+Count exceeds dimension
>>> bounds. *Based on days of debugging, I have narrowed it down to some
>>> problem during the writing of the restart files (and not the reading
>>> itself). All of these errors seem to originate from the way PNetcdf is
>>> built and the MPI file system used on the cluster (because I can run the
>>> same code on a different cluster), but I can't seem to identify the root
>>> cause. The error also doesn't appear when I use fewer processes on a single
>>> node on the cluster.
>>>
>>> I know this is most likely not a bug in the PNetcdf library but
>>> something else, but I was wondering if people on this mailing list have
>>> encountered a similar issue.
>>>
>>> Apologies for the long email and thanks
>>>
>>> Divyanshu
>>>
>>> Postdoctoral Researcher
>>> University of Michigan
>>>
>>
>>
>> --
>> Jim Edwards
>> STAND UP FOR SCIENCE
>> CESM Software Engineer
>> National Center for Atmospheric Research
>> Boulder, CO
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20250715/32b98617/attachment.html>


More information about the parallel-netcdf mailing list