pnetcdf and large transfers

Mon Jul 1 16:57:45 CDT 2013

You could break the operation into multiple calls on the premise that a process moving GBs of data is doing a "big enough" I/O already. Then you would only need a method for determining how many calls are needed…

Rob

On Jul 1, 2013, at 4:51 PM, Rob Latham wrote:

> I'm working on fixing a long-standing bug with the ROMIO MPI-IO
> implementation where requests of more than 32 bits worth of data (2
> GiB or more) would not be supported.
> 
> Some background:  The MPI_File read and write routines take an
> MPI-typical "buffer, count, datatype" tuple to describe accesses.
> The pnetcdf library will take a get or put call and processes the
> multi-dimensional array description into the simpler MPI-IO file
> model: a linear stream of bytes.
> 
> So, for example, "ncmpi_get_vara_double_all" will set up the file view
> accordingly, but describe the memory region as some number of MPI_BYTE
> items. 
> 
> This is the prototype for MPI_File_write_all:
> 
> int MPI_File_write_all(MPI_File fh, const void *buf, int count,
>                       MPI_Datatype datatype, MPI_Status *status)
> 
> So you probably see the problem: 'int count' -- integer are still 32
> bits on many systems (linux x86_64, blue gene, ppc64): how do we
> describe more than 2 GiB of data?
> 
> One way is to punt: if we detect that the number of bytes won't fit
> into an integer, pnetcdf returns an error.  I think I can do better,
> though, but my scheme is growing crazier by the moment:
> 
> RobL's crazy type scheme:
> - given N, a count of number of bytes
> - we pick a chunk size (call it 1 MiB now, to buy us some time, but
>  one could select this chunk at run-time)
> - We make M contig types to describe the first M*chunk_size bytes of
>  the request
> - We have "remainder" bytes for the rest of the request.
> 
> - Now we have two regions: one primary region described with a count of
>  MPI_CONTIG types, and a second remainder region described with
>  MPI_BYTE types
> 
> - We make a struct type describing those two pieces, and pass that to
>  MPI-IO
> 
> MPI_Type_struct takes an MPI_Aint type.  Now on some old systems
> (like my primary development machine up until a year ago),
> MPI_AINT is 32 bits.  Well, on those systems the caller is out of
> luck: how are they going to address the e.g. 3 GiB of data we toss
> their way?
> 
> 
> The attached diff demonstrates what I'm trying to do. The
> creation of these types fails on MPICH so I cannot test this scheme
> yet.  Does it look goofy to any of you?
> 
> thanks
> ==rob
> 
> -- 
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
> <rjl_bigtype_changes.diff>