pnetcdf and large transfers

Rob Latham robl at mcs.anl.gov
Tue Jul 2 09:05:01 CDT 2013


On Tue, Jul 02, 2013 at 08:21:13AM -0500, Rob Ross wrote:
> The allreduce may be a little expensive to do all the time, so I was
> wondering if there were clever ways to avoid it (e.g., this variable
> is too small to need multiple passes, that sort of thing)?

OK, here's what we know:

- the size of the variable, as you say.  If 
  product(dimensions[])*sizeof(type) is less than 2 GiB (actually, 1
  GiB if the file is of type integer and the memory is of type
  double), we'll never transfer too much data.

  It ensures we never call allreduce when the dataset is small, and
  only introduces latency when paying a huge i/o cost.

- Each process has a request, but that information is local.

And that's it, right?

I think I'd be happier leaning on the MPI library to do the right
thing, but then I'd be signing up for 3 years of bug reporting against
various vendor MPIs.

==rob

> -- Rob
> 
> On Jul 2, 2013, at 8:11 AM, "Rob Latham" <robl at mcs.anl.gov> wrote:
> 
> > On Mon, Jul 01, 2013 at 04:57:45PM -0500, Rob Ross wrote:
> >> You could break the operation into multiple calls on the premise that a process moving GBs of data is doing a "big enough" I/O already. Then you would only need a method for determining how many calls are needed…
> > 
> > That's a pragmatic if unsatisfying solution, sure.  
> > 
> > Determining how many calls are needed is not hard.  We already compute
> > "nbytes":
> > 
> > Algorithm; 
> > 
> > - ceiling(nbytes/(1*GiB)): number of transfers
> > 
> > - MPI_Allreduce to find the max transfers
> > 
> > - carry out that many MPI_File_{write/read}_all, relying on the
> >  implicit file pointer to help us keep track of where we are in the
> >  file view.
> > 
> > I think, given the two phase overhead, we'd want N transfers of mostly
> > the same size, instead of some gigabyte sized transfers and then one
> > final "remainder" transfer of possibly a handful of bytes.
> > When transferring multiple gigabytes, the point is probably academic
> > anyway.
> > 
> > Given the state of large datatype descriptions in "just pulled it from
> > git" MPICH (they will need work), I'm now no longer optimistic that
> > we'll see widespread support for large datatypes too soon.   
> > 
> > ==rob
> > 
> >> Rob
> >> 
> >> On Jul 1, 2013, at 4:51 PM, Rob Latham wrote:
> >> 
> >>> I'm working on fixing a long-standing bug with the ROMIO MPI-IO
> >>> implementation where requests of more than 32 bits worth of data (2
> >>> GiB or more) would not be supported.
> >>> 
> >>> Some background:  The MPI_File read and write routines take an
> >>> MPI-typical "buffer, count, datatype" tuple to describe accesses.
> >>> The pnetcdf library will take a get or put call and processes the
> >>> multi-dimensional array description into the simpler MPI-IO file
> >>> model: a linear stream of bytes.
> >>> 
> >>> So, for example, "ncmpi_get_vara_double_all" will set up the file view
> >>> accordingly, but describe the memory region as some number of MPI_BYTE
> >>> items. 
> >>> 
> >>> This is the prototype for MPI_File_write_all:
> >>> 
> >>> int MPI_File_write_all(MPI_File fh, const void *buf, int count,
> >>>                      MPI_Datatype datatype, MPI_Status *status)
> >>> 
> >>> So you probably see the problem: 'int count' -- integer are still 32
> >>> bits on many systems (linux x86_64, blue gene, ppc64): how do we
> >>> describe more than 2 GiB of data?
> >>> 
> >>> One way is to punt: if we detect that the number of bytes won't fit
> >>> into an integer, pnetcdf returns an error.  I think I can do better,
> >>> though, but my scheme is growing crazier by the moment:
> >>> 
> >>> RobL's crazy type scheme:
> >>> - given N, a count of number of bytes
> >>> - we pick a chunk size (call it 1 MiB now, to buy us some time, but
> >>> one could select this chunk at run-time)
> >>> - We make M contig types to describe the first M*chunk_size bytes of
> >>> the request
> >>> - We have "remainder" bytes for the rest of the request.
> >>> 
> >>> - Now we have two regions: one primary region described with a count of
> >>> MPI_CONTIG types, and a second remainder region described with
> >>> MPI_BYTE types
> >>> 
> >>> - We make a struct type describing those two pieces, and pass that to
> >>> MPI-IO
> >>> 
> >>> MPI_Type_struct takes an MPI_Aint type.  Now on some old systems
> >>> (like my primary development machine up until a year ago),
> >>> MPI_AINT is 32 bits.  Well, on those systems the caller is out of
> >>> luck: how are they going to address the e.g. 3 GiB of data we toss
> >>> their way?
> >>> 
> >>> 
> >>> The attached diff demonstrates what I'm trying to do. The
> >>> creation of these types fails on MPICH so I cannot test this scheme
> >>> yet.  Does it look goofy to any of you?
> >>> 
> >>> thanks
> >>> ==rob
> >>> 
> >>> -- 
> >>> Rob Latham
> >>> Mathematics and Computer Science Division
> >>> Argonne National Lab, IL USA
> >>> <rjl_bigtype_changes.diff>
> > 
> > -- 
> > Rob Latham
> > Mathematics and Computer Science Division
> > Argonne National Lab, IL USA

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


More information about the parallel-netcdf mailing list