pnetcdf and large transfers
Rob Latham
robl at mcs.anl.gov
Mon Jul 1 16:51:12 CDT 2013
I'm working on fixing a long-standing bug with the ROMIO MPI-IO
implementation where requests of more than 32 bits worth of data (2
GiB or more) would not be supported.
Some background: The MPI_File read and write routines take an
MPI-typical "buffer, count, datatype" tuple to describe accesses.
The pnetcdf library will take a get or put call and processes the
multi-dimensional array description into the simpler MPI-IO file
model: a linear stream of bytes.
So, for example, "ncmpi_get_vara_double_all" will set up the file view
accordingly, but describe the memory region as some number of MPI_BYTE
items.
This is the prototype for MPI_File_write_all:
int MPI_File_write_all(MPI_File fh, const void *buf, int count,
MPI_Datatype datatype, MPI_Status *status)
So you probably see the problem: 'int count' -- integer are still 32
bits on many systems (linux x86_64, blue gene, ppc64): how do we
describe more than 2 GiB of data?
One way is to punt: if we detect that the number of bytes won't fit
into an integer, pnetcdf returns an error. I think I can do better,
though, but my scheme is growing crazier by the moment:
RobL's crazy type scheme:
- given N, a count of number of bytes
- we pick a chunk size (call it 1 MiB now, to buy us some time, but
one could select this chunk at run-time)
- We make M contig types to describe the first M*chunk_size bytes of
the request
- We have "remainder" bytes for the rest of the request.
- Now we have two regions: one primary region described with a count of
MPI_CONTIG types, and a second remainder region described with
MPI_BYTE types
- We make a struct type describing those two pieces, and pass that to
MPI-IO
MPI_Type_struct takes an MPI_Aint type. Now on some old systems
(like my primary development machine up until a year ago),
MPI_AINT is 32 bits. Well, on those systems the caller is out of
luck: how are they going to address the e.g. 3 GiB of data we toss
their way?
The attached diff demonstrates what I'm trying to do. The
creation of these types fails on MPICH so I cannot test this scheme
yet. Does it look goofy to any of you?
thanks
==rob
--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rjl_bigtype_changes.diff
Type: text/x-diff
Size: 5819 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20130701/4cad3545/attachment.diff>
More information about the parallel-netcdf
mailing list