pnetcdf and large transfers

Rob Latham robl at mcs.anl.gov
Mon Jul 1 16:51:12 CDT 2013


I'm working on fixing a long-standing bug with the ROMIO MPI-IO
implementation where requests of more than 32 bits worth of data (2
GiB or more) would not be supported.

Some background:  The MPI_File read and write routines take an
MPI-typical "buffer, count, datatype" tuple to describe accesses.
The pnetcdf library will take a get or put call and processes the
multi-dimensional array description into the simpler MPI-IO file
model: a linear stream of bytes.

So, for example, "ncmpi_get_vara_double_all" will set up the file view
accordingly, but describe the memory region as some number of MPI_BYTE
items. 

This is the prototype for MPI_File_write_all:

int MPI_File_write_all(MPI_File fh, const void *buf, int count,
                       MPI_Datatype datatype, MPI_Status *status)

So you probably see the problem: 'int count' -- integer are still 32
bits on many systems (linux x86_64, blue gene, ppc64): how do we
describe more than 2 GiB of data?

One way is to punt: if we detect that the number of bytes won't fit
into an integer, pnetcdf returns an error.  I think I can do better,
though, but my scheme is growing crazier by the moment:

RobL's crazy type scheme:
- given N, a count of number of bytes
- we pick a chunk size (call it 1 MiB now, to buy us some time, but
  one could select this chunk at run-time)
- We make M contig types to describe the first M*chunk_size bytes of
  the request
- We have "remainder" bytes for the rest of the request.

- Now we have two regions: one primary region described with a count of
  MPI_CONTIG types, and a second remainder region described with
  MPI_BYTE types

- We make a struct type describing those two pieces, and pass that to
  MPI-IO

MPI_Type_struct takes an MPI_Aint type.  Now on some old systems
(like my primary development machine up until a year ago),
MPI_AINT is 32 bits.  Well, on those systems the caller is out of
luck: how are they going to address the e.g. 3 GiB of data we toss
their way?


The attached diff demonstrates what I'm trying to do. The
creation of these types fails on MPICH so I cannot test this scheme
yet.  Does it look goofy to any of you?

thanks
==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rjl_bigtype_changes.diff
Type: text/x-diff
Size: 5819 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20130701/4cad3545/attachment.diff>


More information about the parallel-netcdf mailing list