[EXTERNAL] Re: pnetcdf and large transfers

Tue Jul 2 08:48:23 CDT 2013

I know I am not speaking with precision here, but how about this for an idea. Keep it one call, but use an MPI_Info or some similar mechanism to specify a larger size and type parameter at some earlier junction. Then the call could be single and exceed 4 GiB. Use the extra hints to work around the API. That seems to have been a bit of the spirit for them. I know there are issues with returning the extra info, but a custom error of > 4 GiB might make it work. This is a bit ugly too, but the core thought might offer an alternative.

Jay

On Jul 1, 2013, at 4:01 PM, "Rob Ross" <rross at mcs.anl.gov> wrote:

> You could break the operation into multiple calls on the premise that a process moving GBs of data is doing a "big enough" I/O already. Then you would only need a method for determining how many calls are needed…
> 
> Rob
> 
> On Jul 1, 2013, at 4:51 PM, Rob Latham wrote:
> 
>> I'm working on fixing a long-standing bug with the ROMIO MPI-IO
>> implementation where requests of more than 32 bits worth of data (2
>> GiB or more) would not be supported.
>> 
>> Some background:  The MPI_File read and write routines take an
>> MPI-typical "buffer, count, datatype" tuple to describe accesses.
>> The pnetcdf library will take a get or put call and processes the
>> multi-dimensional array description into the simpler MPI-IO file
>> model: a linear stream of bytes.
>> 
>> So, for example, "ncmpi_get_vara_double_all" will set up the file view
>> accordingly, but describe the memory region as some number of MPI_BYTE
>> items. 
>> 
>> This is the prototype for MPI_File_write_all:
>> 
>> int MPI_File_write_all(MPI_File fh, const void *buf, int count,
>>                      MPI_Datatype datatype, MPI_Status *status)
>> 
>> So you probably see the problem: 'int count' -- integer are still 32
>> bits on many systems (linux x86_64, blue gene, ppc64): how do we
>> describe more than 2 GiB of data?
>> 
>> One way is to punt: if we detect that the number of bytes won't fit
>> into an integer, pnetcdf returns an error.  I think I can do better,
>> though, but my scheme is growing crazier by the moment:
>> 
>> RobL's crazy type scheme:
>> - given N, a count of number of bytes
>> - we pick a chunk size (call it 1 MiB now, to buy us some time, but
>> one could select this chunk at run-time)
>> - We make M contig types to describe the first M*chunk_size bytes of
>> the request
>> - We have "remainder" bytes for the rest of the request.
>> 
>> - Now we have two regions: one primary region described with a count of
>> MPI_CONTIG types, and a second remainder region described with
>> MPI_BYTE types
>> 
>> - We make a struct type describing those two pieces, and pass that to
>> MPI-IO
>> 
>> MPI_Type_struct takes an MPI_Aint type.  Now on some old systems
>> (like my primary development machine up until a year ago),
>> MPI_AINT is 32 bits.  Well, on those systems the caller is out of
>> luck: how are they going to address the e.g. 3 GiB of data we toss
>> their way?
>> 
>> 
>> The attached diff demonstrates what I'm trying to do. The
>> creation of these types fails on MPICH so I cannot test this scheme
>> yet.  Does it look goofy to any of you?
>> 
>> thanks
>> ==rob
>> 
>> -- 
>> Rob Latham
>> Mathematics and Computer Science Division
>> Argonne National Lab, IL USA
>> <rjl_bigtype_changes.diff>
>