possible bug in prerelease

Wei-keng Liao wkliao at eecs.northwestern.edu
Fri Dec 1 23:52:46 CST 2017


Hi, Jim

After taking another look at your assertion error from ad_gpfs_aggrs.c,
I believe you were hit by a ROMIO bug. I wrote a short test program that
can cause a similar integer overflow error in ROMIO. The program's URL:
https://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/test/largefile/large_coalesce.c

Look like the bug has been predicted based on the following comments at line 463
in file ad_gpfs_aggrs.c:
  /* Possibly reconsider if buf_idx's are ok as int's, or should they be aints/offsets?
     They are used as memory buffer indices so it seems like the 2G limit is in effect */

After I rebuilt MPICH by changing the data type of buf_idx from int to MPI_Aint,
my test program ran fine. Would you like to create an github issue at MPICH repo?


Wei-keng

On Dec 1, 2017, at 8:07 PM, Wei-keng Liao wrote:

> Hi, Jim,
> 
> Yes, that is a bug. I have developed a fix. Please check out the
> latest commit from PnetCDF SVN repo and let me know if it works for you.
> Thanks for reporting.
> 
> Wei-keng
> 
> On Dec 1, 2017, at 4:43 PM, Jim Edwards wrote:
> 
>> I think that I've found a bug in the prerelease in file ncmpio_wait.c
>> 
>> In coalescing blocklengths at line 2095 
>>>>            if (ai - a_last_contig == blocklengths[last_contig_req]) 
>>                /* user buffer of request j is contiguous from j-1
>>                 * we coalesce j to j-1 */
>>                blocklengths[last_contig_req] += blocklengths[j];
>> 
>> ​It's possible that ​blocklengths[last_contig_req] + blocklengths[j]; overflows the integer datatype.
>> I tried to fix that by avoiding the coalescing:
>> 
>>            if ((ai - a_last_contig == blocklengths[last_contig_req]) &&
>> 		(blocklengths[last_contig_req] + blocklengths[j] > 0))
>>                /* user buffer of request j is contiguous from j-1
>>                 * we coalesce j to j-1 */
>>                blocklengths[last_contig_req] += blocklengths[j];
>> 
>> ​but that leads to another overflow problem ​:
>> ad_gpfs_aggrs.c:572: ADIOI_GPFS_Calc_my_req: Assertion `curr_idx == (int) curr_idx' failed.
>> 
>> 
>> 
>> -- 
>> Jim Edwards
>> 
>> CESM Software Engineer
>> National Center for Atmospheric Research
>> Boulder, CO 
> 



More information about the parallel-netcdf mailing list