pnetcdf and large transfers

Tue Jul 2 08:21:13 CDT 2013

The allreduce may be a little expensive to do all the time, so I was wondering if there were clever ways to avoid it (e.g., this variable is too small to need multiple passes, that sort of thing)?

-- Rob

On Jul 2, 2013, at 8:11 AM, "Rob Latham" <robl at mcs.anl.gov> wrote:

> On Mon, Jul 01, 2013 at 04:57:45PM -0500, Rob Ross wrote:
>> You could break the operation into multiple calls on the premise that a process moving GBs of data is doing a "big enough" I/O already. Then you would only need a method for determining how many calls are needed…
> 
> That's a pragmatic if unsatisfying solution, sure.  
> 
> Determining how many calls are needed is not hard.  We already compute
> "nbytes":
> 
> Algorithm; 
> 
> - ceiling(nbytes/(1*GiB)): number of transfers
> 
> - MPI_Allreduce to find the max transfers
> 
> - carry out that many MPI_File_{write/read}_all, relying on the
>  implicit file pointer to help us keep track of where we are in the
>  file view.
> 
> I think, given the two phase overhead, we'd want N transfers of mostly
> the same size, instead of some gigabyte sized transfers and then one
> final "remainder" transfer of possibly a handful of bytes.
> When transferring multiple gigabytes, the point is probably academic
> anyway.
> 
> Given the state of large datatype descriptions in "just pulled it from
> git" MPICH (they will need work), I'm now no longer optimistic that
> we'll see widespread support for large datatypes too soon.   
> 
> ==rob
> 
>> Rob
>> 
>> On Jul 1, 2013, at 4:51 PM, Rob Latham wrote:
>> 
>>> I'm working on fixing a long-standing bug with the ROMIO MPI-IO
>>> implementation where requests of more than 32 bits worth of data (2
>>> GiB or more) would not be supported.
>>> 
>>> Some background:  The MPI_File read and write routines take an
>>> MPI-typical "buffer, count, datatype" tuple to describe accesses.
>>> The pnetcdf library will take a get or put call and processes the
>>> multi-dimensional array description into the simpler MPI-IO file
>>> model: a linear stream of bytes.
>>> 
>>> So, for example, "ncmpi_get_vara_double_all" will set up the file view
>>> accordingly, but describe the memory region as some number of MPI_BYTE
>>> items. 
>>> 
>>> This is the prototype for MPI_File_write_all:
>>> 
>>> int MPI_File_write_all(MPI_File fh, const void *buf, int count,
>>>                      MPI_Datatype datatype, MPI_Status *status)
>>> 
>>> So you probably see the problem: 'int count' -- integer are still 32
>>> bits on many systems (linux x86_64, blue gene, ppc64): how do we
>>> describe more than 2 GiB of data?
>>> 
>>> One way is to punt: if we detect that the number of bytes won't fit
>>> into an integer, pnetcdf returns an error.  I think I can do better,
>>> though, but my scheme is growing crazier by the moment:
>>> 
>>> RobL's crazy type scheme:
>>> - given N, a count of number of bytes
>>> - we pick a chunk size (call it 1 MiB now, to buy us some time, but
>>> one could select this chunk at run-time)
>>> - We make M contig types to describe the first M*chunk_size bytes of
>>> the request
>>> - We have "remainder" bytes for the rest of the request.
>>> 
>>> - Now we have two regions: one primary region described with a count of
>>> MPI_CONTIG types, and a second remainder region described with
>>> MPI_BYTE types
>>> 
>>> - We make a struct type describing those two pieces, and pass that to
>>> MPI-IO
>>> 
>>> MPI_Type_struct takes an MPI_Aint type.  Now on some old systems
>>> (like my primary development machine up until a year ago),
>>> MPI_AINT is 32 bits.  Well, on those systems the caller is out of
>>> luck: how are they going to address the e.g. 3 GiB of data we toss
>>> their way?
>>> 
>>> 
>>> The attached diff demonstrates what I'm trying to do. The
>>> creation of these types fails on MPICH so I cannot test this scheme
>>> yet.  Does it look goofy to any of you?
>>> 
>>> thanks
>>> ==rob
>>> 
>>> -- 
>>> Rob Latham
>>> Mathematics and Computer Science Division
>>> Argonne National Lab, IL USA
>>> <rjl_bigtype_changes.diff>
> 
> -- 
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA