Recent inconsistencies with Pnetcdf and MPI
Divyanshu Gola
divyans at umich.edu
Wed Jul 16 11:59:04 CDT 2025
Hello,
Thanks a lot for your response. I think you are correct. However, on that
specific cluster, I only have NFS available. Is there a workaround for NFS
based systems? I have tried using nfmpi_sync along with MPI_BARRIER but it
doesn't help.
Thanks
Divyanshu
On Tue, Jul 15, 2025 at 1:38 PM Wei-Keng Liao <wkliao at northwestern.edu>
wrote:
> It is most likely because your are using NFS.
>
> Using NFS is not recommended when running MPI-IO. Write data may be
> cached in the system buffer even beyond the exit of user application.
> Thus, it often causes incorrect contents in the output file.
>
> Please try a different file system, preferably a parallel file system.
>
> Wei-keng
>
> On Jul 15, 2025, at 12:28 PM, Divyanshu Gola <divyans at umich.edu> wrote:
>
> Hello,
>
> With the simple io test script (attached), I have observed four different
> scenarios depending on the pnetcdf build and runtime environment, and would
> like your comments on interpreting these results:
>
> - If I test the program using a pnetcdf build* configured with
> --enable-debug* and *PNETCDF_VERBOSE_DEBUG_MODE = 1*, then *a* *multi-node
> run* will only trigger the NC_EBADDIM and NC_ENOTVAR errors upon
> startup, but the *write and read operations proceed correctly*.
> - If I test the program using a pnetcdf build *configured with
> --enable-debug* and *NOT set* PNETCDF_VERBOSE_DEBUG_MODE, then* a
> multi-node run* will not trigger any NC_EBADDIM and NC_ENOTVAR errors,
> and the write and read operations produce* incorrect write results
> silently*.
> - If I test the program using a pnetcdf build *not configured with
> --enable-debug*, *a multi-node run* using the write and read
> operations produces* incorrect write results silently. *
> - *A single-node run* produces* correct write and read results* regardless
> of the configuration and env setups.
>
> In addition,
>
> - With --enable-debug and PNETCDF_VERBOSE_DEBUG_MODE=1, pnetcdf would
> *fail* the *make ptests *check using either 1 node or multiple nodes
> - With a release build, pnetcdf would* pass make ptests* on* 1 node,*
> but *on multi-node, it would issue a buffer mismatch* message during
> the mcoll_perf testing, but still issue 'pass' results for all the tests.
> - Could these errors be due to the NFS v3.0 file system being used?
>
> I really appreciate your time with this.
>
> Best
>
> Divyanshu
>
> On Tue, Jul 15, 2025 at 12:25 PM Wei-Keng Liao <wkliao at northwestern.edu>
> wrote:
>
>> Are these only the error messages?
>> NC_EBADDIM and NC_ENOTVAR are normal, as when creating a new
>> dimension or variable, PnetCDF checks whether they have already
>> been defined (thus prints such messages).
>>
>> Are there lines containing NC_EINVALCOORDS or NC_EEDGE?
>> They correspond to the error messages you are seeing.
>> The former is "NetCDF: Index exceeds dimension bound"
>> The latter is "NetCDF: Start+count exceeds dimension bound"
>>
>> Wei-keng
>>
>> On Jul 15, 2025, at 10:58 AM, Divyanshu Gola <divyans at umich.edu> wrote:
>>
>> Hello,
>>
>> Thank you for your email. I tried running the test script using the debug
>> flags and I get the following messages on all ranks, even though the code
>> is able to execute after that:
>>
>> ---------------BEGIN ERROR MESSAGES--------------
>> Rank 144: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>> Rank 146: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>> Rank 147: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>> Rank 148: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>> Rank 149: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>> Rank 150: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>> Rank 151: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>> Rank 152: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>> Rank 154: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>> Rank 156: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>> Rank 157: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>> Rank 158: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>> Rank 116: NC_EBADDIM error at line 95 of NC_finddim in ncmpio_dim.c
>> Rank 152: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>> Rank 152: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
>> Rank 81: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>> Rank 81: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
>> Rank 89: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>> Rank 89: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
>> Rank 102: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>> Rank 102: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
>> Rank 103: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>> Rank 103: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
>> Rank 112: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>> Rank 112: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
>> Rank 114: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>> Rank 114: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
>> Rank 119: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>> Rank 119: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
>> Rank 122: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>> Rank 122: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
>> Rank 124: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>> Rank 124: NC_ENOTVAR error at line 480 of ncmpio_inq_varid in ncmpio_var.c
>> Rank 136: NC_ENOTVAR error at line 259 of NC_findvar in ncmpio_var.c
>> -------------END ERROR MESSAGES---------------
>>
>> Thanks for your help with this.
>>
>> Divyanshu
>>
>>
>> On Mon, Jul 14, 2025 at 12:38 PM Wei-Keng Liao <wkliao at northwestern.edu>
>> wrote:
>>
>>> Hi, Divyanshu
>>>
>>> When PnetCDF is configured with option "--enable-debug" and
>>> the environment variable PNETCDF_VERBOSE_DEBUG_MODE is set to 1,
>>> additional error messages describing the bound violation will
>>> be printed on screen. It may help you find the source code location
>>> that produces the error.
>>>
>>> Wei-keng
>>>
>>> On Jul 14, 2025, at 7:56 AM, Divyanshu Gola <divyans at umich.edu> wrote:
>>>
>>> This Message Is From an External Sender
>>> This message came from outside your organization.
>>> Hi Jim,
>>>
>>> Thank you for your reply. I use openmpi/5.0.3 under intel/2022.1.2. I
>>> have tried with pnetcdf versions 12 , 13 and 14 but get the same error. I
>>> really think that there is some inconsistency during writing the file and
>>> as I said in the previous email, this only happens on one specific cluster.
>>> Here's a result of a test script: Using pnetcdf and mpi I write a global
>>> variable named *var* that is an array consisting of the indices of each
>>> process, i.e., each process writes only its own index. Example if I use 96
>>> processes, var is [1, 2, 3, 4, ..., 95, 96]. However when I try writing the
>>> same variable using , say, 384 processes, some of the values in the middle
>>> remain zero (default element value) instead of the respective process
>>> index. Like you said, I am pretty sure this is *NOT* an issue with
>>> pnetcdf but some other inconsistency within the cluster, but I just don't
>>> know how to identify it and thought maybe people on the mailing list might
>>> have encountered this before. Meanwhile I got my code to work by doing
>>> simple binary IO instead of pnetcdf files, which tells me the MPI in itself
>>> is also okay I guess?
>>>
>>> I appreciate your time with this.
>>>
>>> Best
>>>
>>> Divyanshu
>>>
>>>
>>> On Fri, Jul 11, 2025 at 4:23 PM Jim Edwards <jedwards at ucar.edu> wrote:
>>>
>>>> Hi Divyanshu,
>>>>
>>>> From your description it sounds like the file written is corrupted?
>>>> And when you say "The error also doesn't appear when I use fewer processes
>>>> on a single node on the cluster." Do you
>>>> mean use fewer processes to write the file or to read the file? It
>>>> really sounds to me like an application problem and not a pnetcdf or mpi
>>>> issue. I think that you may need to provide
>>>> an exact description of how the problem is created including the system
>>>> you are using, the mpi library and version as well as the pnetcdf version
>>>> and the application.
>>>>
>>>> Jim
>>>>
>>>> On Fri, Jul 11, 2025 at 4:13 PM Divyanshu Gola <divyans at umich.edu>
>>>> wrote:
>>>>
>>>>> This Message Is From an External Sender
>>>>> This message came from outside your organization.
>>>>>
>>>>> Hi,
>>>>>
>>>>> This is a shot in the dark but I thought why not.
>>>>>
>>>>> The exact same code that I had been using until a few weeks ago gives
>>>>> me an error when I am trying to read restart files using PNetcdf. The error
>>>>> is *Index exceeds dimension bounds, or Start+Count exceeds dimension
>>>>> bounds. *Based on days of debugging, I have narrowed it down to some
>>>>> problem during the writing of the restart files (and not the reading
>>>>> itself). All of these errors seem to originate from the way PNetcdf is
>>>>> built and the MPI file system used on the cluster (because I can run the
>>>>> same code on a different cluster), but I can't seem to identify the root
>>>>> cause. The error also doesn't appear when I use fewer processes on a single
>>>>> node on the cluster.
>>>>>
>>>>> I know this is most likely not a bug in the PNetcdf library but
>>>>> something else, but I was wondering if people on this mailing list have
>>>>> encountered a similar issue.
>>>>>
>>>>> Apologies for the long email and thanks
>>>>>
>>>>> Divyanshu
>>>>>
>>>>> Postdoctoral Researcher
>>>>> University of Michigan
>>>>>
>>>>
>>>>
>>>> --
>>>> Jim Edwards
>>>> STAND UP FOR SCIENCE
>>>> CESM Software Engineer
>>>> National Center for Atmospheric Research
>>>> Boulder, CO
>>>>
>>>
>>>
>> <test_pnetcdf_io.f90>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20250716/47db60ca/attachment-0001.html>
More information about the parallel-netcdf
mailing list