possible bug in prerelease

Latham, Robert J. robl at mcs.anl.gov
Tue Dec 5 08:38:13 CST 2017


On Mon, 2017-12-04 at 17:26 -0600, Wei-keng Liao wrote:
> Hi, Jim
> 
> I will create a new error code, named NC_EMAX_REQ, in PnetCDF.
> So, if the request size of an individual call to a get/put API or
> wait API is > 2GiB, then NC_EMAX_REQ will be thrown. Note the
> new size limit is per MPI process, not across all MPI processes.
> I guess this behavior will affect PIO. Feedbacks are welcomed.

I hate to see pnetcdf go through such lenghts to paper over a bug in
the underlying MPI implementation.   Will it be hard to turn off this
error check?

Still, it takes time to develop these fixes and even longer to get
vendors to deploy them.  One would hope that a gigabyte of data would
be large enough to perform well!

==rob



> 
> Wei-keng
> 
> On Dec 4, 2017, at 10:10 AM, Jim Edwards wrote:
> 
> > Yes I can confirm that the aggregated size exceeds 2GiB. 
> > 
> > On Sun, Dec 3, 2017 at 3:35 PM, Wei-keng Liao <wkliao at eecs.northwes
> > tern.edu> wrote:
> > Hi, Jim
> > 
> > I think the error you encountered is most likely due to the
> > aggregated size of nonblocking requests in the ncmpi_wait_all
> > call being larger then 2 GiB. Can you confirm this is your case?
> > ROMIO does not appear to work well for such cases. I am
> > thinking to make PnetCDF to bail out if this condition is
> > detected.
> > 
> > Wei-keng
> > 
> > On Dec 3, 2017, at 7:55 AM, Jim Edwards wrote:
> > 
> > > I see you already put up the PR to ROMIO - thanks.
> > > 
> > > On Fri, Dec 1, 2017 at 10:52 PM, Wei-keng Liao <wkliao at eecs.north
> > > western.edu> wrote:
> > > Hi, Jim
> > > 
> > > After taking another look at your assertion error from
> > > ad_gpfs_aggrs.c,
> > > I believe you were hit by a ROMIO bug. I wrote a short test
> > > program that
> > > can cause a similar integer overflow error in ROMIO. The
> > > program's URL:
> > > https://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/t
> > > est/largefile/large_coalesce.c
> > > 
> > > Look like the bug has been predicted based on the following
> > > comments at line 463
> > > in file ad_gpfs_aggrs.c:
> > >   /* Possibly reconsider if buf_idx's are ok as int's, or should
> > > they be aints/offsets?
> > >      They are used as memory buffer indices so it seems like the
> > > 2G limit is in effect */
> > > 
> > > After I rebuilt MPICH by changing the data type of buf_idx from
> > > int to MPI_Aint,
> > > my test program ran fine. Would you like to create an github
> > > issue at MPICH repo?
> > > 
> > > 
> > > Wei-keng
> > > 
> > > On Dec 1, 2017, at 8:07 PM, Wei-keng Liao wrote:
> > > 
> > > > Hi, Jim,
> > > > 
> > > > Yes, that is a bug. I have developed a fix. Please check out
> > > > the
> > > > latest commit from PnetCDF SVN repo and let me know if it works
> > > > for you.
> > > > Thanks for reporting.
> > > > 
> > > > Wei-keng
> > > > 
> > > > On Dec 1, 2017, at 4:43 PM, Jim Edwards wrote:
> > > > 
> > > > > I think that I've found a bug in the prerelease in file
> > > > > ncmpio_wait.c
> > > > > 
> > > > > In coalescing blocklengths at line 2095
> > > > > 
> > > > >            if (ai - a_last_contig ==
> > > > > blocklengths[last_contig_req])
> > > > >                /* user buffer of request j is contiguous from
> > > > > j-1
> > > > >                 * we coalesce j to j-1 */
> > > > >                blocklengths[last_contig_req] +=
> > > > > blocklengths[j];
> > > > > 
> > > > > It's possible that blocklengths[last_contig_req] +
> > > > > blocklengths[j]; overflows the integer datatype.
> > > > > I tried to fix that by avoiding the coalescing:
> > > > > 
> > > > >            if ((ai - a_last_contig ==
> > > > > blocklengths[last_contig_req]) &&
> > > > >              (blocklengths[last_contig_req] + blocklengths[j]
> > > > > > 0))
> > > > >                /* user buffer of request j is contiguous from
> > > > > j-1
> > > > >                 * we coalesce j to j-1 */
> > > > >                blocklengths[last_contig_req] +=
> > > > > blocklengths[j];
> > > > > 
> > > > > but that leads to another overflow problem :
> > > > > ad_gpfs_aggrs.c:572: ADIOI_GPFS_Calc_my_req: Assertion
> > > > > `curr_idx == (int) curr_idx' failed.
> > > > > 
> > > > > 
> > > > > 
> > > > > --
> > > > > Jim Edwards
> > > > > 
> > > > > CESM Software Engineer
> > > > > National Center for Atmospheric Research
> > > > > Boulder, CO
> > > 
> > > 
> > > 
> > > 
> > > --
> > > Jim Edwards
> > > 
> > > CESM Software Engineer
> > > National Center for Atmospheric Research
> > > Boulder, CO
> > 
> > 
> > 
> > 
> > -- 
> > Jim Edwards
> > 
> > CESM Software Engineer
> > National Center for Atmospheric Research
> > Boulder, CO 
> 
> 


More information about the parallel-netcdf mailing list