pnetcdf and large transfers
Rob Latham
robl at mcs.anl.gov
Fri Sep 13 09:41:13 CDT 2013
Resurrecting an old thread now that I've had a chance to take a crack
at this.
Remember, if we want to relax the restriction that transfers must be
less than 2 gigs in size, we have to be a bit clever with the way we
handle the memory side of things, due to various defects in MPI
- counts are specified as ints.
- Sophisticated datatypes describing more than 2 gigs of data break
MPICH and mpich-derived implementations
I implemented the "Rob Ross" approach, where if a variable is larger
than 2 GiB, we'll break up transfers into some number of calls less
than 2 gigs in size.
Memory side of the transfers taken care of.
now we've got a problem with file side. Consider a record variable:
dimensions:
time = UNLIMITED ; // (624 currently)
plev = 17 ;
lat = 192 ;
bound = 2 ;
lon = 288 ;
variables:
double time(time) ;
double plev(plev) ;
double lat(lat) ;
double bounds_lat(lat, bound) ;
double lon(lon) ;
double bounds_lon(lon, bound) ;
float va(time, plev, lat, lon) ;
When reading this record variable 'va', we can end up passing a
datatype to the MPI_File_view call that is larger than 2 GiB. The
memory side of things allows us to pretty easily partition the buffer:
it's a contiguous array of MPI_BYTE . But this file view for record
variables is not so easily split up. Furthermore, we'd have to re-set
the file view for every split-up I/O request. What a headache!
So, now I'm back in the camp of checking for MPI-3 features and then
assuming we can pass in large datatypes to that library. For older
MPI libraries, we can still report "request too large".
Wei-keng: you've done a lot of datatype work for pnetcdf recently.
Anything I'm missing in this analysis?
==rob
On Tue, Jul 02, 2013 at 04:44:25PM -0500, Rob Latham wrote:
> On Tue, Jul 02, 2013 at 04:29:38PM -0500, Wei-keng Liao wrote:
> > > Perhaps I misunderstand, but I think that in the case that the I/O is to a single variable and the variable size is such that the access cannot be too large, we can safely avoid the allreduce. Right?
> >
> > If the variables are fixed-sized (non-record) and the size is defined < 2GiB, then you are right
> > we can avoid the allreduce (for blocking APIs only). Otherwise, I think allreduce is still
> > necessary for RobL's approach.
>
> Right. That won't catch all cases, but it will catch the important
> ones: if the amount of data stored is small, then there's not a lot of
> I/O over which to amortize data.
>
> > > Is there something additional that we could learn as an artifact of the collective (currently proposed as an allreduce) that might help us in optimizing I/O generally?
> >
> > I am not sure for optimization, just thinking about making it work.
> > I wonder when the request size is that big, should we worry about the cost
> > of that one additional allreduce?
>
> > I just remember if the default collective I/O buffer size is used (cb_buffer_size=16MiB),
> > then the maximal amount of individual read/write (made by aggregators) is 16 MiB. Thus,
> > I don't think we will have a problem for collective I/O (where two-phase I/O actually
> > involves). It is independent I/O. Is my understanding correct?
>
> We still need to "feed" the MPI-IO routine, though. MPI_File_read_all
> takes an integer 'count' parameter: presently we count off N
> MPI_BYTES.
>
> Yes, once we get the request down into MPI-IO we're in good shape.
>
> > > I would like to have a solution in ROMIO also, but prefer a
> > > solution that is available soonest to our users, and a PnetCDF fix
> > > is superior with respect to that metric (as RobL says)...
> >
> > In this case, we can use RobL's approach on blocking (independent?)
> > APIs and for big variables. For other cases, return errors?
>
> I'm going to do two things (and not do one thing):
>
> - check for the _x variants of the type routines. If those exist, I
> will presume the implementation has given some thought to datatypes
> describing large amounts of memory.
>
> - Implement the RobR "only for large variables" optimization
>
> - leave non-blocking alone for now, on the assumption that
> non-blocking's primary use case is to combine many small I/O
> requests into larger ones.
>
> ==rob
>
--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
More information about the parallel-netcdf
mailing list