pnetcdf and large transfers

Fri Sep 13 09:41:13 CDT 2013

Resurrecting an old thread now that I've had a chance to take a crack
at this.

Remember, if we want to relax the restriction that transfers must be
less than 2 gigs in size, we have to be a bit clever with the way we
handle the memory side of things, due to various defects in MPI

 - counts are specified as ints.  
 - Sophisticated datatypes describing more than 2 gigs of data break
   MPICH and mpich-derived implementations

I implemented the "Rob Ross" approach, where if a variable is larger
than 2 GiB, we'll break up transfers into some number of calls less
than 2 gigs in size. 

Memory side of the transfers taken care of.

now we've got a problem with file side.  Consider a record variable:

dimensions:
        time = UNLIMITED ; // (624 currently)
        plev = 17 ;
        lat = 192 ;
        bound = 2 ;
        lon = 288 ;
variables:
        double time(time) ;
        double plev(plev) ;
        double lat(lat) ;
        double bounds_lat(lat, bound) ;
        double lon(lon) ;
        double bounds_lon(lon, bound) ;
        float va(time, plev, lat, lon) ;

When reading this record variable 'va', we can end up passing a
datatype to the MPI_File_view call that is larger than 2 GiB.  The
memory side of things allows us to pretty easily partition the buffer:
it's a contiguous array of MPI_BYTE .  But this file view for record
variables is not so easily split up.  Furthermore, we'd have to re-set
the file view for every split-up I/O request.  What a headache!

So, now I'm back in the camp of checking for MPI-3 features and then
assuming we can pass in large datatypes to that library.  For older
MPI libraries, we can still report "request too large". 

Wei-keng: you've done a lot of datatype work for pnetcdf recently.
Anything I'm missing in this analysis?  

==rob

On Tue, Jul 02, 2013 at 04:44:25PM -0500, Rob Latham wrote:
> On Tue, Jul 02, 2013 at 04:29:38PM -0500, Wei-keng Liao wrote:
> > > Perhaps I misunderstand, but I think that in the case that the I/O is to a single variable and the variable size is such that the access cannot be too large, we can safely avoid the allreduce. Right?
> > 
> > If the variables are fixed-sized (non-record) and the size is defined < 2GiB, then you are right
> > we can avoid the allreduce (for blocking APIs only). Otherwise, I think allreduce is still
> > necessary for RobL's approach.
> 
> Right. That won't catch all cases, but it will catch the important
> ones: if the amount of data stored is small, then there's not a lot of
> I/O over which to amortize data.
> 
> > > Is there something additional that we could learn as an artifact of the collective (currently proposed as an allreduce) that might help us in optimizing I/O generally? 
> > 
> > I am not sure for optimization, just thinking about making it work.
> > I wonder when the request size is that big, should we worry about the cost
> > of that one additional allreduce?
> 
> > I just remember if the default collective I/O buffer size is used (cb_buffer_size=16MiB),
> > then the maximal amount of individual read/write (made by aggregators) is 16 MiB. Thus,
> > I don't think we will have a problem for collective I/O (where two-phase I/O actually
> > involves). It is independent I/O. Is my understanding correct?
> 
> We still need to "feed" the MPI-IO routine, though.  MPI_File_read_all
> takes an integer 'count' parameter: presently we count off N
> MPI_BYTES.   
> 
> Yes, once we get the request down into MPI-IO we're in good shape. 
> 
> > > I would like to have a solution in ROMIO also, but prefer a
> > > solution that is available soonest to our users, and a PnetCDF fix
> > > is superior with respect to that metric (as RobL says)...
> > 
> > In this case, we can use RobL's approach on blocking (independent?)
> > APIs and for big variables.  For other cases, return errors?
> 
> I'm going to do two things (and not do one thing):
> 
> - check for the _x variants of the type routines.  If those exist, I
>   will presume the implementation has given some thought to datatypes
>   describing large amounts of memory.   
> 
> - Implement the RobR "only for large variables" optimization
> 
> - leave non-blocking alone for now, on the assumption that
>   non-blocking's primary use case is to combine many small I/O
>   requests into larger ones.
> 
> ==rob
> 

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA