pnetcdf and large transfers

Rob Latham robl at mcs.anl.gov
Tue Jul 2 08:20:17 CDT 2013


On Mon, Jul 01, 2013 at 05:48:47PM -0500, Wei-keng Liao wrote:
> > int MPI_File_write_all(MPI_File fh, const void *buf, int count,
> >                       MPI_Datatype datatype, MPI_Status *status)
> > 
> > So you probably see the problem: 'int count' -- integer are still 32
> > bits on many systems (linux x86_64, blue gene, ppc64): how do we
> > describe more than 2 GiB of data?
> 
> I thought this is a ROMIO problem which restricts the amount of
> bytes in a single read/write call. In romio/adio/common/ad_write.c, these lines
> below calls an assertion.
> 
>     MPI_Type_size(datatype, &datatype_size);
>     len = (ADIO_Offset)datatype_size * (ADIO_Offset)count;
>     ADIOI_Assert(len == (unsigned int) len); /* read takes an unsigned int parm */
> 
> The limit is 4GiB in ROMIO, instead of 2GiB in PnetCDF. 

The limit in ROMIO is an artifact of the decision to use an 'integer'
datatype to express the count of datatypes in MPI-2.  Not only are the
MPI_File routines afflicted: MPI_Type_size for example also takes an
integer. 

MPI-3, though, has MPI_Type_size_x, MPI_Get_count_x, and a few other
routines that take an MPI_Count argument, not an 'integer'.

> If this assertion is removed/resolved, your datatype approach will
> make sense for > 4GB I/O.  In POSIX read/write, the count argument
> is of size_t type, which is 8 bytes on 64-bit machines. ROMIO should
> check the size of size_t at configure time to avoid the above
> assertion.

I've started some work in this area on a git branch called
"ticket-1742-bigio": 

http://git.mpich.org/mpich-dev.git/shortlog/refs/heads/ticket-1742-bigio

In this branch, I've used the MPI-3 _x routines to get larger
datatype counts, and I've wrapped the read() and write() system calls
so that they process as much as they can in a loop:

http://git.mpich.org/mpich-dev.git/commitdiff/d358b4369a9422aeff9d281fafa3afc53fd553b9


> Note that the nonblocking APIs might end up with noncontiguous buffer
> type when calling MPI-IO APIs. Breaking it apart for chunking I/O needs
> some work, maybe requiring more temp memory space.

I haven't considered the nonblocking case yet, it's true.  Maybe this
"simple fix" is getting ever less simple the more we think about it...

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


More information about the parallel-netcdf mailing list