possible bug in prerelease

Jim Edwards jedwards at ucar.edu
Mon Dec 4 18:03:21 CST 2017


Hi Wei-keng,

I am also adding 2GiB as a limit on the local size of aggregated arrays in
pio.

Thanks,

On Mon, Dec 4, 2017 at 4:26 PM, Wei-keng Liao <wkliao at eecs.northwestern.edu>
wrote:

> Hi, Jim
>
> I will create a new error code, named NC_EMAX_REQ, in PnetCDF.
> So, if the request size of an individual call to a get/put API or
> wait API is > 2GiB, then NC_EMAX_REQ will be thrown. Note the
> new size limit is per MPI process, not across all MPI processes.
> I guess this behavior will affect PIO. Feedbacks are welcomed.
>
> Wei-keng
>
> On Dec 4, 2017, at 10:10 AM, Jim Edwards wrote:
>
> > Yes I can confirm that the aggregated size exceeds 2GiB.
> >
> > On Sun, Dec 3, 2017 at 3:35 PM, Wei-keng Liao <
> wkliao at eecs.northwestern.edu> wrote:
> > Hi, Jim
> >
> > I think the error you encountered is most likely due to the
> > aggregated size of nonblocking requests in the ncmpi_wait_all
> > call being larger then 2 GiB. Can you confirm this is your case?
> > ROMIO does not appear to work well for such cases. I am
> > thinking to make PnetCDF to bail out if this condition is
> > detected.
> >
> > Wei-keng
> >
> > On Dec 3, 2017, at 7:55 AM, Jim Edwards wrote:
> >
> > > I see you already put up the PR to ROMIO - thanks.
> > >
> > > On Fri, Dec 1, 2017 at 10:52 PM, Wei-keng Liao <
> wkliao at eecs.northwestern.edu> wrote:
> > > Hi, Jim
> > >
> > > After taking another look at your assertion error from ad_gpfs_aggrs.c,
> > > I believe you were hit by a ROMIO bug. I wrote a short test program
> that
> > > can cause a similar integer overflow error in ROMIO. The program's URL:
> > > https://trac.mcs.anl.gov/projects/parallel-netcdf/
> browser/trunk/test/largefile/large_coalesce.c
> > >
> > > Look like the bug has been predicted based on the following comments
> at line 463
> > > in file ad_gpfs_aggrs.c:
> > >   /* Possibly reconsider if buf_idx's are ok as int's, or should they
> be aints/offsets?
> > >      They are used as memory buffer indices so it seems like the 2G
> limit is in effect */
> > >
> > > After I rebuilt MPICH by changing the data type of buf_idx from int to
> MPI_Aint,
> > > my test program ran fine. Would you like to create an github issue at
> MPICH repo?
> > >
> > >
> > > Wei-keng
> > >
> > > On Dec 1, 2017, at 8:07 PM, Wei-keng Liao wrote:
> > >
> > > > Hi, Jim,
> > > >
> > > > Yes, that is a bug. I have developed a fix. Please check out the
> > > > latest commit from PnetCDF SVN repo and let me know if it works for
> you.
> > > > Thanks for reporting.
> > > >
> > > > Wei-keng
> > > >
> > > > On Dec 1, 2017, at 4:43 PM, Jim Edwards wrote:
> > > >
> > > >> I think that I've found a bug in the prerelease in file
> ncmpio_wait.c
> > > >>
> > > >> In coalescing blocklengths at line 2095
> > > >> ​
> > > >>            if (ai - a_last_contig == blocklengths[last_contig_req])
> > > >>                /* user buffer of request j is contiguous from j-1
> > > >>                 * we coalesce j to j-1 */
> > > >>                blocklengths[last_contig_req] += blocklengths[j];
> > > >>
> > > >> ​It's possible that ​blocklengths[last_contig_req] +
> blocklengths[j]; overflows the integer datatype.
> > > >> I tried to fix that by avoiding the coalescing:
> > > >>
> > > >>            if ((ai - a_last_contig ==
> blocklengths[last_contig_req]) &&
> > > >>              (blocklengths[last_contig_req] + blocklengths[j] > 0))
> > > >>                /* user buffer of request j is contiguous from j-1
> > > >>                 * we coalesce j to j-1 */
> > > >>                blocklengths[last_contig_req] += blocklengths[j];
> > > >>
> > > >> ​but that leads to another overflow problem ​:
> > > >> ad_gpfs_aggrs.c:572: ADIOI_GPFS_Calc_my_req: Assertion `curr_idx ==
> (int) curr_idx' failed.
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Jim Edwards
> > > >>
> > > >> CESM Software Engineer
> > > >> National Center for Atmospheric Research
> > > >> Boulder, CO
> > > >
> > >
> > >
> > >
> > >
> > > --
> > > Jim Edwards
> > >
> > > CESM Software Engineer
> > > National Center for Atmospheric Research
> > > Boulder, CO
> >
> >
> >
> >
> > --
> > Jim Edwards
> >
> > CESM Software Engineer
> > National Center for Atmospheric Research
> > Boulder, CO
>
>


-- 
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20171204/38f5630a/attachment.html>


More information about the parallel-netcdf mailing list