[EXTERNAL] Re: pnetcdf and large transfers

Tue Jul 2 11:10:10 CDT 2013

That is more clear. I misunderstood where the problem lies. I cannot think of a better approach than you described. I did this in ADIOS to get past the 32-bit limits, but we did not have to work around MPI problems to make it work.

Jay

On Jul 2, 2013, at 8:20 AM, "Rob Latham" <robl at mcs.anl.gov> wrote:

> On Tue, Jul 02, 2013 at 01:48:23PM +0000, Lofstead, Gerald F II wrote:
>> I know I am not speaking with precision here, but how about this for
>> an idea. Keep it one call, but use an MPI_Info or some similar
>> mechanism to specify a larger size and type parameter at some
>> earlier junction. Then the call could be single and exceed 4 GiB.
>> Use the extra hints to work around the API. That seems to have been
>> a bit of the spirit for them. I know there are issues with returning
>> the extra info, but a custom error of > 4 GiB might make it work.
>> This is a bit ugly too, but the core thought might offer an
>> alternative.
> 
> Hi Jay! happy to see you've been lurking around.
> 
> The issue is not exactly with the count and type parameter: for the
> simplest example, instead of transferring 3 GiB of MPI_BYTE, we can
> transfer 3 GiB-sized MPI_CONTIG types.  (my scheme gets a little more
> complicated because to be fully general one has to deal with
> remainders that don't fit exactly into whatever blocking factor one
> selects).
> 
> The issue then is how the MPI implementation handles that.  Right now,
> not great.  I haven't tried OpenMPI, but MPICH still has a few places
> where it stuffs 64 bits worth of information into 32 bit datatypes.  
> 
> I think we can fix MPICH-3.0.next, but then we have to beat on OpenMPI
> and then beat on the vendors to incorporate these upstream (and
> probably ABI-breaking) changes.   
> 
> Maybe that's possible, though it was only last week we had a user
> trying to use 2+ year old OpenMPI-1.5.something, so old MPI versions
> have a way of sticking around a long time.
> 
> So, we can either describe large requests as they are, possibly
> (likely) hitting MPI implementation bugs, or we can split up I/O into
> several smaller requests, introducing coordination where there
> previously was none.
> 
> I'd gladly entertain a third way.
> 
> ==rob
> 
> 
>> 
>> On Jul 1, 2013, at 4:01 PM, "Rob Ross" <rross at mcs.anl.gov> wrote:
>> 
>>> You could break the operation into multiple calls on the premise that a process moving GBs of data is doing a "big enough" I/O already. Then you would only need a method for determining how many calls are needed…
>>> 
>>> Rob
>>> 
>>> On Jul 1, 2013, at 4:51 PM, Rob Latham wrote:
>>> 
>>>> I'm working on fixing a long-standing bug with the ROMIO MPI-IO
>>>> implementation where requests of more than 32 bits worth of data (2
>>>> GiB or more) would not be supported.
>>>> 
>>>> Some background:  The MPI_File read and write routines take an
>>>> MPI-typical "buffer, count, datatype" tuple to describe accesses.
>>>> The pnetcdf library will take a get or put call and processes the
>>>> multi-dimensional array description into the simpler MPI-IO file
>>>> model: a linear stream of bytes.
>>>> 
>>>> So, for example, "ncmpi_get_vara_double_all" will set up the file view
>>>> accordingly, but describe the memory region as some number of MPI_BYTE
>>>> items. 
>>>> 
>>>> This is the prototype for MPI_File_write_all:
>>>> 
>>>> int MPI_File_write_all(MPI_File fh, const void *buf, int count,
>>>>                     MPI_Datatype datatype, MPI_Status *status)
>>>> 
>>>> So you probably see the problem: 'int count' -- integer are still 32
>>>> bits on many systems (linux x86_64, blue gene, ppc64): how do we
>>>> describe more than 2 GiB of data?
>>>> 
>>>> One way is to punt: if we detect that the number of bytes won't fit
>>>> into an integer, pnetcdf returns an error.  I think I can do better,
>>>> though, but my scheme is growing crazier by the moment:
>>>> 
>>>> RobL's crazy type scheme:
>>>> - given N, a count of number of bytes
>>>> - we pick a chunk size (call it 1 MiB now, to buy us some time, but
>>>> one could select this chunk at run-time)
>>>> - We make M contig types to describe the first M*chunk_size bytes of
>>>> the request
>>>> - We have "remainder" bytes for the rest of the request.
>>>> 
>>>> - Now we have two regions: one primary region described with a count of
>>>> MPI_CONTIG types, and a second remainder region described with
>>>> MPI_BYTE types
>>>> 
>>>> - We make a struct type describing those two pieces, and pass that to
>>>> MPI-IO
>>>> 
>>>> MPI_Type_struct takes an MPI_Aint type.  Now on some old systems
>>>> (like my primary development machine up until a year ago),
>>>> MPI_AINT is 32 bits.  Well, on those systems the caller is out of
>>>> luck: how are they going to address the e.g. 3 GiB of data we toss
>>>> their way?
>>>> 
>>>> 
>>>> The attached diff demonstrates what I'm trying to do. The
>>>> creation of these types fails on MPICH so I cannot test this scheme
>>>> yet.  Does it look goofy to any of you?
>>>> 
>>>> thanks
>>>> ==rob
>>>> 
>>>> -- 
>>>> Rob Latham
>>>> Mathematics and Computer Science Division
>>>> Argonne National Lab, IL USA
>>>> <rjl_bigtype_changes.diff>
> 
> -- 
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA