possible bug in prerelease
Wei-keng Liao
wkliao at eecs.northwestern.edu
Tue Dec 5 13:57:35 CST 2017
I can add a configure-time option to turn off this checking.
Another way is to implement inside of PnetCDF a multi-round
of MPI-IO, one for up-to 2GiB I/O. It will be a mechanism
similarly used in ROMIO for two-phase I/O, a new feature
potentially in the future release.
Wei-keng
On Dec 5, 2017, at 8:38 AM, Latham, Robert J. wrote:
> On Mon, 2017-12-04 at 17:26 -0600, Wei-keng Liao wrote:
>> Hi, Jim
>>
>> I will create a new error code, named NC_EMAX_REQ, in PnetCDF.
>> So, if the request size of an individual call to a get/put API or
>> wait API is > 2GiB, then NC_EMAX_REQ will be thrown. Note the
>> new size limit is per MPI process, not across all MPI processes.
>> I guess this behavior will affect PIO. Feedbacks are welcomed.
>
> I hate to see pnetcdf go through such lenghts to paper over a bug in
> the underlying MPI implementation. Will it be hard to turn off this
> error check?
>
> Still, it takes time to develop these fixes and even longer to get
> vendors to deploy them. One would hope that a gigabyte of data would
> be large enough to perform well!
>
> ==rob
>
>
>
>>
>> Wei-keng
>>
>> On Dec 4, 2017, at 10:10 AM, Jim Edwards wrote:
>>
>>> Yes I can confirm that the aggregated size exceeds 2GiB.
>>>
>>> On Sun, Dec 3, 2017 at 3:35 PM, Wei-keng Liao <wkliao at eecs.northwes
>>> tern.edu> wrote:
>>> Hi, Jim
>>>
>>> I think the error you encountered is most likely due to the
>>> aggregated size of nonblocking requests in the ncmpi_wait_all
>>> call being larger then 2 GiB. Can you confirm this is your case?
>>> ROMIO does not appear to work well for such cases. I am
>>> thinking to make PnetCDF to bail out if this condition is
>>> detected.
>>>
>>> Wei-keng
>>>
>>> On Dec 3, 2017, at 7:55 AM, Jim Edwards wrote:
>>>
>>>> I see you already put up the PR to ROMIO - thanks.
>>>>
>>>> On Fri, Dec 1, 2017 at 10:52 PM, Wei-keng Liao <wkliao at eecs.north
>>>> western.edu> wrote:
>>>> Hi, Jim
>>>>
>>>> After taking another look at your assertion error from
>>>> ad_gpfs_aggrs.c,
>>>> I believe you were hit by a ROMIO bug. I wrote a short test
>>>> program that
>>>> can cause a similar integer overflow error in ROMIO. The
>>>> program's URL:
>>>> https://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/t
>>>> est/largefile/large_coalesce.c
>>>>
>>>> Look like the bug has been predicted based on the following
>>>> comments at line 463
>>>> in file ad_gpfs_aggrs.c:
>>>> /* Possibly reconsider if buf_idx's are ok as int's, or should
>>>> they be aints/offsets?
>>>> They are used as memory buffer indices so it seems like the
>>>> 2G limit is in effect */
>>>>
>>>> After I rebuilt MPICH by changing the data type of buf_idx from
>>>> int to MPI_Aint,
>>>> my test program ran fine. Would you like to create an github
>>>> issue at MPICH repo?
>>>>
>>>>
>>>> Wei-keng
>>>>
>>>> On Dec 1, 2017, at 8:07 PM, Wei-keng Liao wrote:
>>>>
>>>>> Hi, Jim,
>>>>>
>>>>> Yes, that is a bug. I have developed a fix. Please check out
>>>>> the
>>>>> latest commit from PnetCDF SVN repo and let me know if it works
>>>>> for you.
>>>>> Thanks for reporting.
>>>>>
>>>>> Wei-keng
>>>>>
>>>>> On Dec 1, 2017, at 4:43 PM, Jim Edwards wrote:
>>>>>
>>>>>> I think that I've found a bug in the prerelease in file
>>>>>> ncmpio_wait.c
>>>>>>
>>>>>> In coalescing blocklengths at line 2095
>>>>>>
>>>>>> if (ai - a_last_contig ==
>>>>>> blocklengths[last_contig_req])
>>>>>> /* user buffer of request j is contiguous from
>>>>>> j-1
>>>>>> * we coalesce j to j-1 */
>>>>>> blocklengths[last_contig_req] +=
>>>>>> blocklengths[j];
>>>>>>
>>>>>> It's possible that blocklengths[last_contig_req] +
>>>>>> blocklengths[j]; overflows the integer datatype.
>>>>>> I tried to fix that by avoiding the coalescing:
>>>>>>
>>>>>> if ((ai - a_last_contig ==
>>>>>> blocklengths[last_contig_req]) &&
>>>>>> (blocklengths[last_contig_req] + blocklengths[j]
>>>>>>> 0))
>>>>>> /* user buffer of request j is contiguous from
>>>>>> j-1
>>>>>> * we coalesce j to j-1 */
>>>>>> blocklengths[last_contig_req] +=
>>>>>> blocklengths[j];
>>>>>>
>>>>>> but that leads to another overflow problem :
>>>>>> ad_gpfs_aggrs.c:572: ADIOI_GPFS_Calc_my_req: Assertion
>>>>>> `curr_idx == (int) curr_idx' failed.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jim Edwards
>>>>>>
>>>>>> CESM Software Engineer
>>>>>> National Center for Atmospheric Research
>>>>>> Boulder, CO
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jim Edwards
>>>>
>>>> CESM Software Engineer
>>>> National Center for Atmospheric Research
>>>> Boulder, CO
>>>
>>>
>>>
>>>
>>> --
>>> Jim Edwards
>>>
>>> CESM Software Engineer
>>> National Center for Atmospheric Research
>>> Boulder, CO
>>
>>
More information about the parallel-netcdf
mailing list