Recent inconsistencies with Pnetcdf and MPI
Divyanshu Gola
divyans at umich.edu
Mon Jul 14 07:56:10 CDT 2025
Hi Jim,
Thank you for your reply. I use openmpi/5.0.3 under intel/2022.1.2. I have
tried with pnetcdf versions 12 , 13 and 14 but get the same error. I
really think that there is some inconsistency during writing the file and
as I said in the previous email, this only happens on one specific cluster.
Here's a result of a test script: Using pnetcdf and mpi I write a global
variable named *var* that is an array consisting of the indices of each
process, i.e., each process writes only its own index. Example if I use 96
processes, var is [1, 2, 3, 4, ..., 95, 96]. However when I try writing the
same variable using , say, 384 processes, some of the values in the middle
remain zero (default element value) instead of the respective process
index. Like you said, I am pretty sure this is *NOT* an issue with pnetcdf
but some other inconsistency within the cluster, but I just don't know how
to identify it and thought maybe people on the mailing list might have
encountered this before. Meanwhile I got my code to work by doing simple
binary IO instead of pnetcdf files, which tells me the MPI in itself is
also okay I guess?
I appreciate your time with this.
Best
Divyanshu
On Fri, Jul 11, 2025 at 4:23 PM Jim Edwards <jedwards at ucar.edu> wrote:
> Hi Divyanshu,
>
> From your description it sounds like the file written is corrupted? And
> when you say "The error also doesn't appear when I use fewer processes on a
> single node on the cluster." Do you
> mean use fewer processes to write the file or to read the file? It really
> sounds to me like an application problem and not a pnetcdf or mpi issue. I
> think that you may need to provide
> an exact description of how the problem is created including the system
> you are using, the mpi library and version as well as the pnetcdf version
> and the application.
>
> Jim
>
> On Fri, Jul 11, 2025 at 4:13 PM Divyanshu Gola <divyans at umich.edu> wrote:
>
>> Hi, This is a shot in the dark but I thought why not. The exact same
>> code that I had been using until a few weeks ago gives me an error when I
>> am trying to read restart files using PNetcdf. The error is Index exceeds
>> dimension bounds, or Start+Count
>> ZjQcmQRYFpfptBannerStart
>> This Message Is From an External Sender
>> This message came from outside your organization.
>>
>> ZjQcmQRYFpfptBannerEnd
>> Hi,
>>
>> This is a shot in the dark but I thought why not.
>>
>> The exact same code that I had been using until a few weeks ago gives me
>> an error when I am trying to read restart files using PNetcdf. The error is *Index
>> exceeds dimension bounds, or Start+Count exceeds dimension bounds. *Based
>> on days of debugging, I have narrowed it down to some problem during the
>> writing of the restart files (and not the reading itself). All of these
>> errors seem to originate from the way PNetcdf is built and the MPI file
>> system used on the cluster (because I can run the same code on a different
>> cluster), but I can't seem to identify the root cause. The error also
>> doesn't appear when I use fewer processes on a single node on the cluster.
>>
>> I know this is most likely not a bug in the PNetcdf library but something
>> else, but I was wondering if people on this mailing list have encountered a
>> similar issue.
>>
>> Apologies for the long email and thanks
>>
>> Divyanshu
>>
>> Postdoctoral Researcher
>> University of Michigan
>>
>
>
> --
> Jim Edwards
> STAND UP FOR SCIENCE
> CESM Software Engineer
> National Center for Atmospheric Research
> Boulder, CO
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20250714/5147c4e5/attachment.html>
More information about the parallel-netcdf
mailing list