[EXTERNAL] Re: pnetcdf and large transfers

Rob Latham robl at mcs.anl.gov
Tue Jul 2 09:16:24 CDT 2013


On Tue, Jul 02, 2013 at 01:48:23PM +0000, Lofstead, Gerald F II wrote:
> I know I am not speaking with precision here, but how about this for
> an idea. Keep it one call, but use an MPI_Info or some similar
> mechanism to specify a larger size and type parameter at some
> earlier junction. Then the call could be single and exceed 4 GiB.
> Use the extra hints to work around the API. That seems to have been
> a bit of the spirit for them. I know there are issues with returning
> the extra info, but a custom error of > 4 GiB might make it work.
> This is a bit ugly too, but the core thought might offer an
> alternative.

Hi Jay! happy to see you've been lurking around.

The issue is not exactly with the count and type parameter: for the
simplest example, instead of transferring 3 GiB of MPI_BYTE, we can
transfer 3 GiB-sized MPI_CONTIG types.  (my scheme gets a little more
complicated because to be fully general one has to deal with
remainders that don't fit exactly into whatever blocking factor one
selects).

The issue then is how the MPI implementation handles that.  Right now,
not great.  I haven't tried OpenMPI, but MPICH still has a few places
where it stuffs 64 bits worth of information into 32 bit datatypes.  

I think we can fix MPICH-3.0.next, but then we have to beat on OpenMPI
and then beat on the vendors to incorporate these upstream (and
probably ABI-breaking) changes.   

Maybe that's possible, though it was only last week we had a user
trying to use 2+ year old OpenMPI-1.5.something, so old MPI versions
have a way of sticking around a long time.

So, we can either describe large requests as they are, possibly
(likely) hitting MPI implementation bugs, or we can split up I/O into
several smaller requests, introducing coordination where there
previously was none.

I'd gladly entertain a third way.

==rob


> 
> On Jul 1, 2013, at 4:01 PM, "Rob Ross" <rross at mcs.anl.gov> wrote:
> 
> > You could break the operation into multiple calls on the premise that a process moving GBs of data is doing a "big enough" I/O already. Then you would only need a method for determining how many calls are needed…
> > 
> > Rob
> > 
> > On Jul 1, 2013, at 4:51 PM, Rob Latham wrote:
> > 
> >> I'm working on fixing a long-standing bug with the ROMIO MPI-IO
> >> implementation where requests of more than 32 bits worth of data (2
> >> GiB or more) would not be supported.
> >> 
> >> Some background:  The MPI_File read and write routines take an
> >> MPI-typical "buffer, count, datatype" tuple to describe accesses.
> >> The pnetcdf library will take a get or put call and processes the
> >> multi-dimensional array description into the simpler MPI-IO file
> >> model: a linear stream of bytes.
> >> 
> >> So, for example, "ncmpi_get_vara_double_all" will set up the file view
> >> accordingly, but describe the memory region as some number of MPI_BYTE
> >> items. 
> >> 
> >> This is the prototype for MPI_File_write_all:
> >> 
> >> int MPI_File_write_all(MPI_File fh, const void *buf, int count,
> >>                      MPI_Datatype datatype, MPI_Status *status)
> >> 
> >> So you probably see the problem: 'int count' -- integer are still 32
> >> bits on many systems (linux x86_64, blue gene, ppc64): how do we
> >> describe more than 2 GiB of data?
> >> 
> >> One way is to punt: if we detect that the number of bytes won't fit
> >> into an integer, pnetcdf returns an error.  I think I can do better,
> >> though, but my scheme is growing crazier by the moment:
> >> 
> >> RobL's crazy type scheme:
> >> - given N, a count of number of bytes
> >> - we pick a chunk size (call it 1 MiB now, to buy us some time, but
> >> one could select this chunk at run-time)
> >> - We make M contig types to describe the first M*chunk_size bytes of
> >> the request
> >> - We have "remainder" bytes for the rest of the request.
> >> 
> >> - Now we have two regions: one primary region described with a count of
> >> MPI_CONTIG types, and a second remainder region described with
> >> MPI_BYTE types
> >> 
> >> - We make a struct type describing those two pieces, and pass that to
> >> MPI-IO
> >> 
> >> MPI_Type_struct takes an MPI_Aint type.  Now on some old systems
> >> (like my primary development machine up until a year ago),
> >> MPI_AINT is 32 bits.  Well, on those systems the caller is out of
> >> luck: how are they going to address the e.g. 3 GiB of data we toss
> >> their way?
> >> 
> >> 
> >> The attached diff demonstrates what I'm trying to do. The
> >> creation of these types fails on MPICH so I cannot test this scheme
> >> yet.  Does it look goofy to any of you?
> >> 
> >> thanks
> >> ==rob
> >> 
> >> -- 
> >> Rob Latham
> >> Mathematics and Computer Science Division
> >> Argonne National Lab, IL USA
> >> <rjl_bigtype_changes.diff>
> > 
> 

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


More information about the parallel-netcdf mailing list