[EXTERNAL] Re: Problem in variable bounds checking

Mon Apr 4 11:49:45 CDT 2016

Hi, Greg,

My understanding of netCDF error code precedence is the following:
NC_EBADID, NC_ENOTVAR, NC_ECHAR, NC_EINVALCOORDS, NC_EEDGE, and NC_ESTRIDE
(NC_EBADID being the highest and NC_ESTRIDE lowest).

In this logics, start argument is checked for out of boundary before checking
whether count is zero-size or not. In your example, a 1-D variable is defined of
size 40 elements. The valid start values are 0, 1, 2, ... 39. Strictly speaking,
40 is out-of-bound. If we followed the above error reporting precedence, getting
an NC_EINVALCOORDS makes sense. 

However, if users prefer checking zero-length request over out-of-bound (a library
I believe should be adjustable based on users demands), then as long as the user
document clarifies this rule, we should be fine. The only thing I am worrying
about is breaking the netCDF convention might cause incompatibility for legacy
applications. Or maybe no application ever cares about this tiny difference :)

Wei-keng

On Apr 4, 2016, at 8:55 AM, Sjaardema, Gregory D wrote:

> I don’t think the other arguments should be skipped or ignored — specifying an invalid varid or dimid or ncid should be an error no matter what the request size is.
> 
> I think that the start value is a special case (and not really that special depending on how you look at it). In a parallel case, each rank will typically be handling a certain number of values in the “field” or variable.  The total size of the variable is the sum of the number of values handled by each rank and the “start” value for each rank is the sum of the number of values handled by previous ranks.
> 
> As an example, assume there are four ranks and each handles 10 values.  Then have that the total size of the variable is 40 and the start values are 0, 10, 20, 30 with request sizes of 10, 10, 10, 10. 
> 
> Now, if  we have a similar variable which exists on ranks 0 and 3 with request sizes of 20, 0, 0, 20. The total size is still 40 and the start values are 0, 20, 20, 20 and everything works fine.
> 
> If we now have a similar variable which only exists on ranks 1 and 2, and request sizes of 0, 20, 20, 0.  The total size is still 40 and the start values are 0, 0, 20, 40.  
> 
> This uses consistent logic and no special cases in the users application code.  However, this causes a failure in the nc_put_vara call due to the error check that “error if start >= size”.  There should be no issue if the error check were instead changed to “start > size” since the following check would still catch the overflow case — “start + count > size”
> 
>> if (start[d2] >= (hssize_t)fdims[d2])
>>    BAIL_QUIET(NC_EINVALCOORDS);
>> if (start[d2] + count[d2] > fdims[d2])
>>    BAIL_QUIET(NC_EEDGE);
>> 
> I don’t see a similar argument for ignoring invalid ncid and varid.  I would also still throw an error if start > size even if request size is 0; I’m just suggesting that it should not be an error for “start == size”.
> 
> The discussion on the netcdf list is https://github.com/Unidata/netcdf-c/pull/243
> 
> ..Greg
> --
> "A supercomputer is a device for turning compute-bound problems into I/O-bound problems”
> 
> 
> 
> On 4/1/16, 12:57 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu> wrote:
> 
>> Hi, Greg
>> 
>> Thanks. We will see how this request is taken by netCDF group.
>> 
>> Just would like to know more of your view to this problem.
>> Technically, the start value of the last process in your case
>> is out of bound. Because the request size is zero, it makes
>> sense to skip the request. However, what happens to other
>> arguments, should they be ignored as well? Not just arguments
>> of stride and imap, but also ncid and varid. In other words, should
>> an API ignore or report an error if an invalid ncid or varid is
>> used? I wonder if there is any application relying on netCDF
>> error report of NC_EINVALCOORDS for their internal checking
>> for any particular purpose.
>> 
>> Wei-keng
>> 
>> On Mar 31, 2016, at 1:53 PM, Sjaardema, Gregory D wrote:
>> 
>>> I have a similar patch submitted to NetCDF team.  The same test occurs three times in the nc4hdf.c file; two times without the count[]==0 check, one time with it.  If they accept the pull request, I will then resubmit this to the pnetcdf group.
>>> Thanks,
>>> ..Greg
>>> --
>>> "A supercomputer is a device for turning compute-bound problems into I/O-bound problems”
>>> On 3/31/16, 12:35 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu> wrote:
>>>> Hi, Greg
>>>> This out-of-bound check is unfortunately enforced by netCDF even if count is zero.
>>>> There is a comment mentioned at the top of that function. PnetCDF tries to conform
>>>> with netCDF on returning error codes, and hence enforces the same check.
>>>> A while ago when I revised this part of codes, I tried to skip this check for
>>>> zero-length request, but found that will cause many fails for the internal test
>>>> programs. Unless netCDF changes its error checking on this, PnetCDF will keep
>>>> this the same. Here is a small program I tested against netCDF that can generate
>>>> the same error.
>>>> #include <stdio.h>
>>>> #include <stdlib.h>
>>>> #include <netcdf.h>
>>>> #define ERR {if(err!=NC_NOERR){printf("Error at line=%d: %s\n", __LINE__, nc_strerror(err));}}
>>>> int main(int argc, char** argv) {
>>>>    int err, ncid, varid, dimid, buf[10];
>>>>    size_t start, count;
>>>>    err = nc_create("testfile.nc", NC_WRITE, &ncid); ERR
>>>>    err = nc_def_dim(ncid, "dim", 10, &dimid); ERR
>>>>    err = nc_def_var(ncid, "var", NC_INT, 1, &dimid, &varid); ERR
>>>>    err = nc_enddef(ncid); ERR
>>>>    start = 10;
>>>>    count = 0;
>>>>    err = nc_put_vara_int(ncid, varid, &start, &count, buf); ERR
>>>>    err = nc_close(ncid); ERR
>>>>    return 0;
>>>> }
>>>> Wei-keng
>>>> On Mar 31, 2016, at 12:43 PM, Sjaardema, Gregory D wrote:
>>>>> The following code is in filetype.c, function `NC_start_count_stride_ck`
>>>>>>     for (; i<varp->ndims; i++) {
>>>>>>         if (start[i] < 0 || start[i] >= varp->shape[i])
>>>>>>             DEBUG_RETURN_ERROR(NC_EINVALCOORDS)
>>>>>>         if (varp->shape[i] < 0) DEBUG_RETURN_ERROR(NC_EEDGE)
>>>>>>         if (count != NULL) {
>>>>>>             if (count[i] < 0) /* no negative count[] */
>>>>>>                 DEBUG_RETURN_ERROR(NC_ENEGATIVECNT)
>>>>>>             if (stride == NULL) { /* for vara APIs */
>>>>>>                 if (count[i] > varp->shape[i] ||
>>>>>>                     start[i] + count[i] > varp->shape[i])
>>>>>>                     DEBUG_RETURN_ERROR(NC_EEDGE)
>>>>>>             }
>>>>>>             else { /* for vars APIs */
>>>>>>                 if (count[i] > 0 &&
>>>>>>                     start[i] + (count[i]-1) * stride[i] >= varp->shape[i])
>>>>>>                     DEBUG_RETURN_ERROR(NC_EEDGE)
>>>>>>                 if (stride[i] == 0) DEBUG_RETURN_ERROR(NC_ESTRIDE)
>>>>>>             }
>>>>>>         }
>>>>>>         /* else is for var1 APIs */
>>>>> There is an issue when the process with the highest rank has zero items to output.  As an example, if I have 4 mpi processes which are each writing the following amount of data:
>>>>> * rank 0: 0 items
>>>>> * rank 1: 2548 items
>>>>> * rank 2: 4352 items
>>>>> * rank 3: 0 items.
>>>>> I will define the variable to have a length of 6900 items (0 + 2548 + 4352 + 0).  When I am outputting data to the variable, each rank will call nc_put_vara_longlong with the following start and count values:
>>>>> * rank 0: start = 0, count = 0
>>>>> * rank 1: start = 0, count = 2548
>>>>> * rank 2: start = 2548, count = 4352
>>>>> * rank 3: start = 6900, count = 0.
>>>>> In each case, the `start` for rank N is equal to `start` for rank N-1 + `count` for rank N-1.  This all works ok until the highest rank is writing 0 items.  In that case, the `start` value for that rank is equal to the total size of the variable and the check in the code fragment shown above fails since `start[i] == varp->shape[i]`.
>>>>> This could be fixed in the application code by checking whether the `count` is zero and if so, then set `start` to 0 also, but I think that is a kluge that should not be required.
>>>>> My suggestion is to make the test be:
>>>>> ```
>>>>>   if (start[i] < 0 || (start[i] >= varp->shape[i] && count[i] > 0))
>>>>> ```
>>>>> This is in version 1.7.0.  It also appears in 1.6.1 in the function Nccoordck.
>>>>> ..Greg
>>>>> --
>>>>> "A supercomputer is a device for turning compute-bound problems into I/O-bound problems”
>> 
>>