possible bug in pnetcdf: cdf5 issue

Wei-keng Liao wkliao at ece.northwestern.edu
Sun Feb 17 23:10:14 CST 2013


Hi, Jim,

In your test program, each process is writing 322437120 or 322437202 doubles.
so, 322437120 * sizeof(double) = 2,579,496,960 which is larger than 2^31, max
for a signed 4-byte integer. It did cause 4-byte integer overflow in PnetCDF.
But, even MPI-IO will have a problem with this size.

If you try the code fragment below, ROMIO will throw an error class
MPI_ERR_ARG, and error string "Invalid count argument".

    int len = 322437120;
    double *buf = (double*) malloc(len * sizeof(double));

    int err = MPI_File_write(fh, buf, len, MPI_DOUBLE, &status);
    if (err != MPI_SUCCESS) {
        int errorStringLen;
        char errorString[MPI_MAX_ERROR_STRING];
        MPI_Error_string(err, errorString, &errorStringLen);
        printf("Error: MPI_File_write_at() (%s)\n",errorString);
    }

A possible PnetCDF solution is to detect the overflow and divide a large request
into multiple, smaller ones, each with a upper bound of 2^31-1 bytes.
Or PnetCDF can simply throw an error, like MPI-IO.

Any suggestion?

Wei-keng

On Feb 17, 2013, at 1:34 PM, Jim Edwards wrote:

> Found the problem in the test program, a corrected program is attached.   This reminds me of another issue - the interface to nfmpi_iput_vara is not defined in pnetcdf.mod
> 
> - Jim
> 
> On Sun, Feb 17, 2013 at 11:43 AM, Jim Edwards <jedwards at ucar.edu> wrote:
> In my larger program I am getting an error:
> 
> PMPI_Type_create_struct(139): Invalid value for blocklen, must be non-negative but is -1715470336
> 
> I see a note about this in nonblocking.c:
> 
>     for (j=0; j<reqs[i].varp->ndims; j++)
>                 blocklens[i] *= reqs[i].count[j];
>             /* Warning! blocklens[i] might overflow */
> 
> 
> But I tried to distile this into a small testcase and I'm getting a different error, I've attached the test program anyway because I can't spot any error there and think it must be in pnetcdf.    Also it seems like instead of
> calling mpi_type_create_struct you should be calling mpi_type_subarray which will avoid the problem of blocklens overflowing.   
> 
> This test program is written for 8 mpi tasks, but it uses a lot of memory so you may need more than one node to run it.   
> 
> -- 
> Jim Edwards
> 
> CESM Software Engineering Group
> National Center for Atmospheric Research
> Boulder, CO 
> 303-497-1842
> 
> 
> 
> -- 
> Jim Edwards
> 
> CESM Software Engineering Group
> National Center for Atmospheric Research
> Boulder, CO 
> 303-497-1842
> <testpnetcdf5.F90>



More information about the parallel-netcdf mailing list