[EXTERNAL] Re: Problem in variable bounds checking

Wei-keng Liao wkliao at eecs.northwestern.edu
Fri Apr 1 13:57:43 CDT 2016


Hi, Greg

Thanks. We will see how this request is taken by netCDF group.

Just would like to know more of your view to this problem.
Technically, the start value of the last process in your case
is out of bound. Because the request size is zero, it makes
sense to skip the request. However, what happens to other
arguments, should they be ignored as well? Not just arguments
of stride and imap, but also ncid and varid. In other words, should
an API ignore or report an error if an invalid ncid or varid is
used? I wonder if there is any application relying on netCDF
error report of NC_EINVALCOORDS for their internal checking
for any particular purpose.

Wei-keng

On Mar 31, 2016, at 1:53 PM, Sjaardema, Gregory D wrote:

> I have a similar patch submitted to NetCDF team.  The same test occurs three times in the nc4hdf.c file; two times without the count[]==0 check, one time with it.  If they accept the pull request, I will then resubmit this to the pnetcdf group.
> 
> Thanks,
> ..Greg
> 
> -- 
> "A supercomputer is a device for turning compute-bound problems into I/O-bound problems”
> 
> 
> 
> 
> 
> 
> 
> On 3/31/16, 12:35 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu> wrote:
> 
>> Hi, Greg
>> 
>> This out-of-bound check is unfortunately enforced by netCDF even if count is zero.
>> There is a comment mentioned at the top of that function. PnetCDF tries to conform
>> with netCDF on returning error codes, and hence enforces the same check.
>> 
>> A while ago when I revised this part of codes, I tried to skip this check for
>> zero-length request, but found that will cause many fails for the internal test
>> programs. Unless netCDF changes its error checking on this, PnetCDF will keep
>> this the same. Here is a small program I tested against netCDF that can generate
>> the same error.
>> 
>> 
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <netcdf.h>
>> 
>> #define ERR {if(err!=NC_NOERR){printf("Error at line=%d: %s\n", __LINE__, nc_strerror(err));}}
>> 
>> int main(int argc, char** argv) {
>>   int err, ncid, varid, dimid, buf[10];
>>   size_t start, count;
>> 
>>   err = nc_create("testfile.nc", NC_WRITE, &ncid); ERR
>>   err = nc_def_dim(ncid, "dim", 10, &dimid); ERR
>>   err = nc_def_var(ncid, "var", NC_INT, 1, &dimid, &varid); ERR
>>   err = nc_enddef(ncid); ERR
>> 
>>   start = 10;
>>   count = 0;
>>   err = nc_put_vara_int(ncid, varid, &start, &count, buf); ERR
>>   err = nc_close(ncid); ERR
>>   return 0;
>> }
>> 
>> 
>> 
>> Wei-keng
>> 
>> On Mar 31, 2016, at 12:43 PM, Sjaardema, Gregory D wrote:
>> 
>>> The following code is in filetype.c, function `NC_start_count_stride_ck`
>>> 
>>>>    for (; i<varp->ndims; i++) {
>>>>        if (start[i] < 0 || start[i] >= varp->shape[i])
>>>>            DEBUG_RETURN_ERROR(NC_EINVALCOORDS)
>>>> 
>>>>        if (varp->shape[i] < 0) DEBUG_RETURN_ERROR(NC_EEDGE)
>>>> 
>>>>        if (count != NULL) {
>>>>            if (count[i] < 0) /* no negative count[] */
>>>>                DEBUG_RETURN_ERROR(NC_ENEGATIVECNT)
>>>> 
>>>>            if (stride == NULL) { /* for vara APIs */
>>>>                if (count[i] > varp->shape[i] ||
>>>>                    start[i] + count[i] > varp->shape[i])
>>>>                    DEBUG_RETURN_ERROR(NC_EEDGE)
>>>>            }
>>>>            else { /* for vars APIs */
>>>>                if (count[i] > 0 &&
>>>>                    start[i] + (count[i]-1) * stride[i] >= varp->shape[i])
>>>>                    DEBUG_RETURN_ERROR(NC_EEDGE)
>>>>                if (stride[i] == 0) DEBUG_RETURN_ERROR(NC_ESTRIDE)
>>>>            }
>>>>        }
>>>>        /* else is for var1 APIs */
>>>> 
>>> There is an issue when the process with the highest rank has zero items to output.  As an example, if I have 4 mpi processes which are each writing the following amount of data:
>>> * rank 0: 0 items
>>> * rank 1: 2548 items
>>> * rank 2: 4352 items
>>> * rank 3: 0 items.
>>> 
>>> I will define the variable to have a length of 6900 items (0 + 2548 + 4352 + 0).  When I am outputting data to the variable, each rank will call nc_put_vara_longlong with the following start and count values:
>>> * rank 0: start = 0, count = 0
>>> * rank 1: start = 0, count = 2548
>>> * rank 2: start = 2548, count = 4352
>>> * rank 3: start = 6900, count = 0.
>>> 
>>> In each case, the `start` for rank N is equal to `start` for rank N-1 + `count` for rank N-1.  This all works ok until the highest rank is writing 0 items.  In that case, the `start` value for that rank is equal to the total size of the variable and the check in the code fragment shown above fails since `start[i] == varp->shape[i]`.
>>> 
>>> This could be fixed in the application code by checking whether the `count` is zero and if so, then set `start` to 0 also, but I think that is a kluge that should not be required. 
>>> 
>>> My suggestion is to make the test be:
>>> ```
>>>  if (start[i] < 0 || (start[i] >= varp->shape[i] && count[i] > 0))
>>> ```
>>> This is in version 1.7.0.  It also appears in 1.6.1 in the function Nccoordck.
>>> 
>>> ..Greg
>>> 
>>> -- 
>>> "A supercomputer is a device for turning compute-bound problems into I/O-bound problems”
>> 



More information about the parallel-netcdf mailing list