Recent inconsistencies with Pnetcdf and MPI

Wed Jul 16 12:09:29 CDT 2025

Divyanshu,

You might consider using the ParallelIO <https://urldefense.us/v3/__https://github.com/NCAR/ParallelIO__;!!G_uCfscf7eWS!dGQizb-US13cbqC1WOGVX_jpQqiVJ38T2gcA2fm5OWnF9dv8-QH9Ro4cIfQv_3dth-E05zS3nHulY8XazU92IXbQLxeaNRFG$ >
project which can output to pnetcdf when a parallel filesystem
is available and to netcdf when only nfs is available.

Jim

On Wed, Jul 16, 2025 at 12:59 PM Divyanshu Gola <divyans at umich.edu> wrote:

> Hello,
>
> Thanks a lot for your response. I think you are correct. However, on that
> specific cluster, I only have NFS available. Is there a workaround for NFS
> based systems? I have tried using nfmpi_sync along with MPI_BARRIER but it
> doesn't help.
>
> Thanks
>
> Divyanshu
>
> On Tue, Jul 15, 2025 at 1:38 PM Wei-Keng Liao <wkliao at northwestern.edu>
> wrote:
>
>> It is most likely because your are using NFS.
>>
>> Using NFS is not recommended when running MPI-IO. Write data may be
>> cached in the system buffer even beyond the exit  of user application.
>> Thus, it often causes incorrect contents in the output file.
>>
>> Please try a different file system, preferably a parallel file system.
>>
>> Wei-keng
>>
>> On Jul 15, 2025, at 12:28 PM, Divyanshu Gola <divyans at umich.edu> wrote:
>>
>> Hello,
>>
>> With the simple io test script (attached), I have observed four different
>> scenarios depending on the pnetcdf build and runtime environment, and would
>> like your comments on interpreting these results:
>>
>>    - If I test the program using a pnetcdf build* configured with
>>    --enable-debug* and *PNETCDF_VERBOSE_DEBUG_MODE = 1*, then *a* *multi-node
>>    run* will only trigger the NC_EBADDIM and NC_ENOTVAR errors upon
>>    startup, but the *write and read operations proceed correctly*.
>>    - If I test the program using a pnetcdf build *configured with
>>    --enable-debug* and *NOT set* PNETCDF_VERBOSE_DEBUG_MODE, then* a
>>    multi-node run* will not trigger any NC_EBADDIM and NC_ENOTVAR
>>    errors, and the write and read operations produce* incorrect write
>>    results silently*.
>>    - If I test the program using a pnetcdf build *not configured with
>>    --enable-debug*, *a multi-node run* using the write and read
>>    operations produces* incorrect write results silently. *
>>    - *A single-node run* produces* correct write and read results* regardless
>>    of the configuration and env setups.
>>
>> In addition,
>>
>>    - With --enable-debug and PNETCDF_VERBOSE_DEBUG_MODE=1, pnetcdf would
>>    *fail* the *make ptests *check using either 1 node or multiple nodes
>>    - With a release build, pnetcdf would* pass make ptests* on* 1 node,*
>>     but *on multi-node, it would issue a buffer mismatch* message during
>>    the mcoll_perf testing, but still issue 'pass' results for all the tests.
>>    - Could these errors be due to the NFS v3.0 file system being used?
>>
>> I really appreciate your time with this.
>>
>> Best
>>
>> Divyanshu
>>
>> On Tue, Jul 15, 2025 at 12:25 PM Wei-Keng Liao <wkliao at northwestern.edu>
>> wrote:
>>
>>> Are these only the error messages?
>>> NC_EBADDIM and NC_ENOTVAR are normal, as when creating a new
>>> dimension or variable, PnetCDF checks whether they have already
>>> been defined (thus prints such messages).
>>>
>>> Are there lines containing NC_EINVALCOORDS or NC_EEDGE?
>>> They correspond to the error messages you are seeing.
>>> The former is "NetCDF: Index exceeds dimension bound"
>>> The latter is "NetCDF: Start+count exceeds dimension bound"
>>>
>>> Wei-keng
>>>
>>> On Jul 15, 2025, at 10:58 AM, Divyanshu Gola <divyans at umich.edu> wrote:
>>>
>>> Hello,
>>>
>>> Thank you for your email. I tried running the test script using the
>>> debug flags and I get the following messages on all ranks, even though the
>>> code is able to execute after that:
>>>
>>> ---------------BEGIN ERROR MESSAGES--------------
>>> Rank 144: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>>> Rank 146: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>>> Rank 147: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>>> Rank 148: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>>> Rank 149: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>>> Rank 150: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>>> Rank 151: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>>> Rank 152: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>>> Rank 154: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>>> Rank 156: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>>> Rank 157: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>>> Rank 158: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>>> Rank 116: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>>> Rank 152: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>>> Rank 152: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in
>>> ncmpio_var.c
>>> Rank 81: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>>> Rank 81: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
>>> Rank 89: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>>> Rank 89: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
>>> Rank 102: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>>> Rank 102: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in
>>> ncmpio_var.c
>>> Rank 103: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>>> Rank 103: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in
>>> ncmpio_var.c
>>> Rank 112: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>>> Rank 112: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in
>>> ncmpio_var.c
>>> Rank 114: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>>> Rank 114: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in
>>> ncmpio_var.c
>>> Rank 119: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>>> Rank 119: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in
>>> ncmpio_var.c
>>> Rank 122: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>>> Rank 122: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in
>>> ncmpio_var.c
>>> Rank 124: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>>> Rank 124: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in
>>> ncmpio_var.c
>>> Rank 136: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>>> -------------END ERROR MESSAGES---------------
>>>
>>> Thanks for your help with this.
>>>
>>> Divyanshu
>>>
>>>
>>> On Mon, Jul 14, 2025 at 12:38 PM Wei-Keng Liao <wkliao at northwestern.edu>
>>> wrote:
>>>
>>>> Hi, Divyanshu
>>>>
>>>> When PnetCDF is configured with option "--enable-debug" and
>>>> the environment variable PNETCDF_VERBOSE_DEBUG_MODE is set to 1,
>>>> additional error messages describing the bound violation will
>>>> be printed on screen. It may help you find the source code location
>>>> that produces the error.
>>>>
>>>> Wei-keng
>>>>
>>>> On Jul 14, 2025, at 7:56 AM, Divyanshu Gola <divyans at umich.edu> wrote:
>>>>
>>>> This Message Is From an External Sender
>>>> This message came from outside your organization.
>>>> Hi Jim,
>>>>
>>>> Thank you for your reply. I use openmpi/5.0.3 under intel/2022.1.2. I
>>>> have tried with pnetcdf versions 12 , 13 and 14 but get the same error. I
>>>> really think that there is some inconsistency during writing the file and
>>>> as I said in the previous email, this only happens on one specific cluster.
>>>> Here's a result of a test script: Using pnetcdf and mpi I write a global
>>>> variable named *var* that is an array consisting of the indices of
>>>> each process, i.e., each process writes only its own index. Example if I
>>>> use 96 processes, var is [1, 2, 3, 4, ..., 95, 96]. However when I try
>>>> writing the same variable using , say, 384 processes, some of the values in
>>>> the middle remain zero (default element value) instead of the respective
>>>> process index. Like you said, I am pretty sure this is *NOT* an issue
>>>> with pnetcdf but some other inconsistency within the cluster, but I just
>>>> don't know how to identify it and thought maybe people on the mailing list
>>>> might have encountered this before. Meanwhile I got my code to work by
>>>> doing simple binary IO instead of pnetcdf files, which tells me the MPI in
>>>> itself is also okay I guess?
>>>>
>>>> I appreciate your time with this.
>>>>
>>>> Best
>>>>
>>>> Divyanshu
>>>>
>>>>
>>>> On Fri, Jul 11, 2025 at 4:23 PM Jim Edwards <jedwards at ucar.edu> wrote:
>>>>
>>>>> Hi Divyanshu,
>>>>>
>>>>> From your description it sounds like the file written is corrupted?
>>>>> And when you say "The error also doesn't appear when I use fewer processes
>>>>> on a single node on the cluster."   Do you
>>>>> mean use fewer processes to write the file or to read the file?  It
>>>>> really sounds to me like an application problem and not a pnetcdf or mpi
>>>>> issue.  I think that you may need to provide
>>>>> an exact description of how the problem is created including the
>>>>> system you are using, the mpi library and version as well as the pnetcdf
>>>>> version and the application.
>>>>>
>>>>> Jim
>>>>>
>>>>> On Fri, Jul 11, 2025 at 4:13 PM Divyanshu Gola <divyans at umich.edu>
>>>>> wrote:
>>>>>
>>>>>> This Message Is From an External Sender
>>>>>> This message came from outside your organization.
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> This is a shot in the dark but I thought why not.
>>>>>>
>>>>>> The exact same code that I had been using until a few weeks ago gives
>>>>>> me an error when I am trying to read restart files using PNetcdf. The error
>>>>>> is *Index exceeds dimension bounds, or Start+Count exceeds dimension
>>>>>> bounds. *Based on days of debugging, I have narrowed it down to some
>>>>>> problem during the writing of the restart files (and not the reading
>>>>>> itself). All of these errors seem to originate from the way PNetcdf is
>>>>>> built and the MPI file system used on the cluster (because I can run the
>>>>>> same code on a different cluster), but I can't seem to identify the root
>>>>>> cause. The error also doesn't appear when I use fewer processes on a single
>>>>>> node on the cluster.
>>>>>>
>>>>>> I know this is most likely not a bug in the PNetcdf library but
>>>>>> something else, but I was wondering if people on this mailing list have
>>>>>> encountered a similar issue.
>>>>>>
>>>>>> Apologies for the long email and thanks
>>>>>>
>>>>>> Divyanshu
>>>>>>
>>>>>> Postdoctoral Researcher
>>>>>> University of Michigan
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jim Edwards
>>>>> STAND UP FOR SCIENCE
>>>>> CESM Software Engineer
>>>>> National Center for Atmospheric Research
>>>>> Boulder, CO
>>>>>
>>>>
>>>>
>>> <test_pnetcdf_io.f90>
>>
>>
>>

-- 
Jim Edwards
STAND UP FOR SCIENCE
CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20250716/d2185de0/attachment-0001.html>