possible bug in prerelease

Wei-keng Liao wkliao at eecs.northwestern.edu
Tue Dec 5 13:57:35 CST 2017


I can add a configure-time option to turn off this checking.
Another way is to implement inside of PnetCDF a multi-round
of MPI-IO, one for up-to 2GiB I/O. It will be a mechanism
similarly used in ROMIO for two-phase I/O, a new feature
potentially in the future release.

Wei-keng

On Dec 5, 2017, at 8:38 AM, Latham, Robert J. wrote:

> On Mon, 2017-12-04 at 17:26 -0600, Wei-keng Liao wrote:
>> Hi, Jim
>> 
>> I will create a new error code, named NC_EMAX_REQ, in PnetCDF.
>> So, if the request size of an individual call to a get/put API or
>> wait API is > 2GiB, then NC_EMAX_REQ will be thrown. Note the
>> new size limit is per MPI process, not across all MPI processes.
>> I guess this behavior will affect PIO. Feedbacks are welcomed.
> 
> I hate to see pnetcdf go through such lenghts to paper over a bug in
> the underlying MPI implementation.   Will it be hard to turn off this
> error check?
> 
> Still, it takes time to develop these fixes and even longer to get
> vendors to deploy them.  One would hope that a gigabyte of data would
> be large enough to perform well!
> 
> ==rob
> 
> 
> 
>> 
>> Wei-keng
>> 
>> On Dec 4, 2017, at 10:10 AM, Jim Edwards wrote:
>> 
>>> Yes I can confirm that the aggregated size exceeds 2GiB. 
>>> 
>>> On Sun, Dec 3, 2017 at 3:35 PM, Wei-keng Liao <wkliao at eecs.northwes
>>> tern.edu> wrote:
>>> Hi, Jim
>>> 
>>> I think the error you encountered is most likely due to the
>>> aggregated size of nonblocking requests in the ncmpi_wait_all
>>> call being larger then 2 GiB. Can you confirm this is your case?
>>> ROMIO does not appear to work well for such cases. I am
>>> thinking to make PnetCDF to bail out if this condition is
>>> detected.
>>> 
>>> Wei-keng
>>> 
>>> On Dec 3, 2017, at 7:55 AM, Jim Edwards wrote:
>>> 
>>>> I see you already put up the PR to ROMIO - thanks.
>>>> 
>>>> On Fri, Dec 1, 2017 at 10:52 PM, Wei-keng Liao <wkliao at eecs.north
>>>> western.edu> wrote:
>>>> Hi, Jim
>>>> 
>>>> After taking another look at your assertion error from
>>>> ad_gpfs_aggrs.c,
>>>> I believe you were hit by a ROMIO bug. I wrote a short test
>>>> program that
>>>> can cause a similar integer overflow error in ROMIO. The
>>>> program's URL:
>>>> https://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/t
>>>> est/largefile/large_coalesce.c
>>>> 
>>>> Look like the bug has been predicted based on the following
>>>> comments at line 463
>>>> in file ad_gpfs_aggrs.c:
>>>>  /* Possibly reconsider if buf_idx's are ok as int's, or should
>>>> they be aints/offsets?
>>>>     They are used as memory buffer indices so it seems like the
>>>> 2G limit is in effect */
>>>> 
>>>> After I rebuilt MPICH by changing the data type of buf_idx from
>>>> int to MPI_Aint,
>>>> my test program ran fine. Would you like to create an github
>>>> issue at MPICH repo?
>>>> 
>>>> 
>>>> Wei-keng
>>>> 
>>>> On Dec 1, 2017, at 8:07 PM, Wei-keng Liao wrote:
>>>> 
>>>>> Hi, Jim,
>>>>> 
>>>>> Yes, that is a bug. I have developed a fix. Please check out
>>>>> the
>>>>> latest commit from PnetCDF SVN repo and let me know if it works
>>>>> for you.
>>>>> Thanks for reporting.
>>>>> 
>>>>> Wei-keng
>>>>> 
>>>>> On Dec 1, 2017, at 4:43 PM, Jim Edwards wrote:
>>>>> 
>>>>>> I think that I've found a bug in the prerelease in file
>>>>>> ncmpio_wait.c
>>>>>> 
>>>>>> In coalescing blocklengths at line 2095
>>>>>> 
>>>>>>           if (ai - a_last_contig ==
>>>>>> blocklengths[last_contig_req])
>>>>>>               /* user buffer of request j is contiguous from
>>>>>> j-1
>>>>>>                * we coalesce j to j-1 */
>>>>>>               blocklengths[last_contig_req] +=
>>>>>> blocklengths[j];
>>>>>> 
>>>>>> It's possible that blocklengths[last_contig_req] +
>>>>>> blocklengths[j]; overflows the integer datatype.
>>>>>> I tried to fix that by avoiding the coalescing:
>>>>>> 
>>>>>>           if ((ai - a_last_contig ==
>>>>>> blocklengths[last_contig_req]) &&
>>>>>>             (blocklengths[last_contig_req] + blocklengths[j]
>>>>>>> 0))
>>>>>>               /* user buffer of request j is contiguous from
>>>>>> j-1
>>>>>>                * we coalesce j to j-1 */
>>>>>>               blocklengths[last_contig_req] +=
>>>>>> blocklengths[j];
>>>>>> 
>>>>>> but that leads to another overflow problem :
>>>>>> ad_gpfs_aggrs.c:572: ADIOI_GPFS_Calc_my_req: Assertion
>>>>>> `curr_idx == (int) curr_idx' failed.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Jim Edwards
>>>>>> 
>>>>>> CESM Software Engineer
>>>>>> National Center for Atmospheric Research
>>>>>> Boulder, CO
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Jim Edwards
>>>> 
>>>> CESM Software Engineer
>>>> National Center for Atmospheric Research
>>>> Boulder, CO
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Jim Edwards
>>> 
>>> CESM Software Engineer
>>> National Center for Atmospheric Research
>>> Boulder, CO 
>> 
>> 



More information about the parallel-netcdf mailing list