possible bug in pnetcdf: cdf5 issue
Wei-keng Liao
wkliao at ece.northwestern.edu
Mon Feb 18 11:52:32 CST 2013
Hi, Jim,
I tested your codes with 4 MPI processes and got the error below.
MPI_FILE_WRITE_ALL(105): Invalid count argument
Maybe you are using IBM's MPI-IO? (I am using MPICH.)
Can you try the attached Fortran program? (run on 1 process).
I got error below.
Error: MPI_File_write MPI_DOUBLE Invalid argument, error stack:
MPI_FILE_WRITE(102): Invalid count argument
Error: MPI_File_write (ddtype) Invalid argument, error stack:
MPI_FILE_WRITE(102): Invalid count argument
-------------- next part --------------
A non-text attachment was scrubbed...
Name: write_large_f.F90
Type: application/octet-stream
Size: 1616 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20130218/e099b9b7/attachment.obj>
-------------- next part --------------
Wei-keng
On Feb 18, 2013, at 9:00 AM, Jim Edwards wrote:
> Hi Wei-keng,
>
> This is just an interface problem and not a hard limit of mpi-io. For example if I run the
> same case on 4 tasks instead of 8, it works just fine (example attached).
>
> If I create an mpi derived type such as for example an mpi_type_contiguous I can do the same call as below successfully.
>
> int len = 322437120;
> double *buf = (double*) malloc(len * sizeof(double));
> int elemtype, err;
> err = mpi_type_contiguous(len,mpi_double,elemtype);
> ierr = mpi_type_commit(elemtype)
> err = MPI_File_write(fh, buf, 1, elemtype, &status);
> if (err != MPI_SUCCESS) {
> int errorStringLen;
> char errorString[MPI_MAX_ERROR_
> STRING];
> MPI_Error_string(err, errorString, &errorStringLen);
> printf("Error: MPI_File_write_at() (%s)\n",errorString);
> }
>
>
>
>
>
> It seems to me that every operation that pnetcdf can do using start and count can be described as an mpi_type_subarray which will both allow pnetcdf to avoid this interface limit and save a potentially considerable amount of memory.
>
> - Jim
>
> On Sun, Feb 17, 2013 at 10:10 PM, Wei-keng Liao <wkliao at ece.northwestern.edu> wrote:
> Hi, Jim,
>
> In your test program, each process is writing 322437120 or 322437202 doubles.
> so, 322437120 * sizeof(double) = 2,579,496,960 which is larger than 2^31, max
> for a signed 4-byte integer. It did cause 4-byte integer overflow in PnetCDF.
> But, even MPI-IO will have a problem with this size.
>
> If you try the code fragment below, ROMIO will throw an error class
> MPI_ERR_ARG, and error string "Invalid count argument".
>
> int len = 322437120;
> double *buf = (double*) malloc(len * sizeof(double));
>
> int err = MPI_File_write(fh, buf, len, MPI_DOUBLE, &status);
> if (err != MPI_SUCCESS) {
> int errorStringLen;
> char errorString[MPI_MAX_ERROR_STRING];
> MPI_Error_string(err, errorString, &errorStringLen);
> printf("Error: MPI_File_write_at() (%s)\n",errorString);
> }
>
> A possible PnetCDF solution is to detect the overflow and divide a large request
> into multiple, smaller ones, each with a upper bound of 2^31-1 bytes.
> Or PnetCDF can simply throw an error, like MPI-IO.
>
> Any suggestion?
>
> Wei-keng
>
> On Feb 17, 2013, at 1:34 PM, Jim Edwards wrote:
>
> > Found the problem in the test program, a corrected program is attached. This reminds me of another issue - the interface to nfmpi_iput_vara is not defined in pnetcdf.mod
> >
> > - Jim
> >
> > On Sun, Feb 17, 2013 at 11:43 AM, Jim Edwards <jedwards at ucar.edu> wrote:
> > In my larger program I am getting an error:
> >
> > PMPI_Type_create_struct(139): Invalid value for blocklen, must be non-negative but is -1715470336
> >
> > I see a note about this in nonblocking.c:
> >
> > for (j=0; j<reqs[i].varp->ndims; j++)
> > blocklens[i] *= reqs[i].count[j];
> > /* Warning! blocklens[i] might overflow */
> >
> >
> > But I tried to distile this into a small testcase and I'm getting a different error, I've attached the test program anyway because I can't spot any error there and think it must be in pnetcdf. Also it seems like instead of
> > calling mpi_type_create_struct you should be calling mpi_type_subarray which will avoid the problem of blocklens overflowing.
> >
> > This test program is written for 8 mpi tasks, but it uses a lot of memory so you may need more than one node to run it.
> >
> > --
> > Jim Edwards
> >
> > CESM Software Engineering Group
> > National Center for Atmospheric Research
> > Boulder, CO
> > 303-497-1842
> >
> >
> >
> > --
> > Jim Edwards
> >
> > CESM Software Engineering Group
> > National Center for Atmospheric Research
> > Boulder, CO
> > 303-497-1842
> > <testpnetcdf5.F90>
>
>
>
>
> --
> Jim Edwards
>
>
> <testpnetcdf5.F90>
More information about the parallel-netcdf
mailing list