[EXTERNAL] Re: Problem in variable bounds checking

Sjaardema, Gregory D gdsjaar at sandia.gov
Mon Apr 4 08:55:51 CDT 2016


I don’t think the other arguments should be skipped or ignored — specifying an invalid varid or dimid or ncid should be an error no matter what the request size is.

I think that the start value is a special case (and not really that special depending on how you look at it). In a parallel case, each rank will typically be handling a certain number of values in the “field” or variable.  The total size of the variable is the sum of the number of values handled by each rank and the “start” value for each rank is the sum of the number of values handled by previous ranks.

As an example, assume there are four ranks and each handles 10 values.  Then have that the total size of the variable is 40 and the start values are 0, 10, 20, 30 with request sizes of 10, 10, 10, 10.

Now, if  we have a similar variable which exists on ranks 0 and 3 with request sizes of 20, 0, 0, 20. The total size is still 40 and the start values are 0, 20, 20, 20 and everything works fine.

If we now have a similar variable which only exists on ranks 1 and 2, and request sizes of 0, 20, 20, 0.  The total size is still 40 and the start values are 0, 0, 20, 40.

This uses consistent logic and no special cases in the users application code.  However, this causes a failure in the nc_put_vara call due to the error check that “error if start >= size”.  There should be no issue if the error check were instead changed to “start > size” since the following check would still catch the overflow case — “start + count > size”

if (start[d2] >= (hssize_t)fdims[d2])
   BAIL_QUIET(NC_EINVALCOORDS);
if (start[d2] + count[d2] > fdims[d2])
   BAIL_QUIET(NC_EEDGE);

I don’t see a similar argument for ignoring invalid ncid and varid.  I would also still throw an error if start > size even if request size is 0; I’m just suggesting that it should not be an error for “start == size”.

The discussion on the netcdf list is https://github.com/Unidata/netcdf-c/pull/243

..Greg
--
"A supercomputer is a device for turning compute-bound problems into I/O-bound problems”



On 4/1/16, 12:57 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu<mailto:wkliao at eecs.northwestern.edu>> wrote:

Hi, Greg

Thanks. We will see how this request is taken by netCDF group.

Just would like to know more of your view to this problem.
Technically, the start value of the last process in your case
is out of bound. Because the request size is zero, it makes
sense to skip the request. However, what happens to other
arguments, should they be ignored as well? Not just arguments
of stride and imap, but also ncid and varid. In other words, should
an API ignore or report an error if an invalid ncid or varid is
used? I wonder if there is any application relying on netCDF
error report of NC_EINVALCOORDS for their internal checking
for any particular purpose.

Wei-keng

On Mar 31, 2016, at 1:53 PM, Sjaardema, Gregory D wrote:

I have a similar patch submitted to NetCDF team.  The same test occurs three times in the nc4hdf.c file; two times without the count[]==0 check, one time with it.  If they accept the pull request, I will then resubmit this to the pnetcdf group.
Thanks,
..Greg
--
"A supercomputer is a device for turning compute-bound problems into I/O-bound problems”
On 3/31/16, 12:35 PM, "Wei-keng Liao" <wkliao at eecs.northwestern.edu<mailto:wkliao at eecs.northwestern.edu>> wrote:
Hi, Greg
This out-of-bound check is unfortunately enforced by netCDF even if count is zero.
There is a comment mentioned at the top of that function. PnetCDF tries to conform
with netCDF on returning error codes, and hence enforces the same check.
A while ago when I revised this part of codes, I tried to skip this check for
zero-length request, but found that will cause many fails for the internal test
programs. Unless netCDF changes its error checking on this, PnetCDF will keep
this the same. Here is a small program I tested against netCDF that can generate
the same error.
#include <stdio.h>
#include <stdlib.h>
#include <netcdf.h>
#define ERR {if(err!=NC_NOERR){printf("Error at line=%d: %s\n", __LINE__, nc_strerror(err));}}
int main(int argc, char** argv) {
   int err, ncid, varid, dimid, buf[10];
   size_t start, count;
   err = nc_create("testfile.nc", NC_WRITE, &ncid); ERR
   err = nc_def_dim(ncid, "dim", 10, &dimid); ERR
   err = nc_def_var(ncid, "var", NC_INT, 1, &dimid, &varid); ERR
   err = nc_enddef(ncid); ERR
   start = 10;
   count = 0;
   err = nc_put_vara_int(ncid, varid, &start, &count, buf); ERR
   err = nc_close(ncid); ERR
   return 0;
}
Wei-keng
On Mar 31, 2016, at 12:43 PM, Sjaardema, Gregory D wrote:
The following code is in filetype.c, function `NC_start_count_stride_ck`
    for (; i<varp->ndims; i++) {
        if (start[i] < 0 || start[i] >= varp->shape[i])
            DEBUG_RETURN_ERROR(NC_EINVALCOORDS)
        if (varp->shape[i] < 0) DEBUG_RETURN_ERROR(NC_EEDGE)
        if (count != NULL) {
            if (count[i] < 0) /* no negative count[] */
                DEBUG_RETURN_ERROR(NC_ENEGATIVECNT)
            if (stride == NULL) { /* for vara APIs */
                if (count[i] > varp->shape[i] ||
                    start[i] + count[i] > varp->shape[i])
                    DEBUG_RETURN_ERROR(NC_EEDGE)
            }
            else { /* for vars APIs */
                if (count[i] > 0 &&
                    start[i] + (count[i]-1) * stride[i] >= varp->shape[i])
                    DEBUG_RETURN_ERROR(NC_EEDGE)
                if (stride[i] == 0) DEBUG_RETURN_ERROR(NC_ESTRIDE)
            }
        }
        /* else is for var1 APIs */
There is an issue when the process with the highest rank has zero items to output.  As an example, if I have 4 mpi processes which are each writing the following amount of data:
* rank 0: 0 items
* rank 1: 2548 items
* rank 2: 4352 items
* rank 3: 0 items.
I will define the variable to have a length of 6900 items (0 + 2548 + 4352 + 0).  When I am outputting data to the variable, each rank will call nc_put_vara_longlong with the following start and count values:
* rank 0: start = 0, count = 0
* rank 1: start = 0, count = 2548
* rank 2: start = 2548, count = 4352
* rank 3: start = 6900, count = 0.
In each case, the `start` for rank N is equal to `start` for rank N-1 + `count` for rank N-1.  This all works ok until the highest rank is writing 0 items.  In that case, the `start` value for that rank is equal to the total size of the variable and the check in the code fragment shown above fails since `start[i] == varp->shape[i]`.
This could be fixed in the application code by checking whether the `count` is zero and if so, then set `start` to 0 also, but I think that is a kluge that should not be required.
My suggestion is to make the test be:
```
  if (start[i] < 0 || (start[i] >= varp->shape[i] && count[i] > 0))
```
This is in version 1.7.0.  It also appears in 1.6.1 in the function Nccoordck.
..Greg
--
"A supercomputer is a device for turning compute-bound problems into I/O-bound problems”


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20160404/4f95758b/attachment.html>


More information about the parallel-netcdf mailing list