pnetcdf and large transfers

Wei-keng Liao wkliao at ece.northwestern.edu
Tue Jul 2 16:29:38 CDT 2013


> Perhaps I misunderstand, but I think that in the case that the I/O is to a single variable and the variable size is such that the access cannot be too large, we can safely avoid the allreduce. Right?

If the variables are fixed-sized (non-record) and the size is defined < 2GiB, then you are right
we can avoid the allreduce (for blocking APIs only). Otherwise, I think allreduce is still
necessary for RobL's approach.


> Is there something additional that we could learn as an artifact of the collective (currently proposed as an allreduce) that might help us in optimizing I/O generally? 

I am not sure for optimization, just thinking about making it work.
I wonder when the request size is that big, should we worry about the cost
of that one additional allreduce?


I just remember if the default collective I/O buffer size is used (cb_buffer_size=16MiB),
then the maximal amount of individual read/write (made by aggregators) is 16 MiB. Thus,
I don't think we will have a problem for collective I/O (where two-phase I/O actually
involves). It is independent I/O. Is my understanding correct?


> I would like to have a solution in ROMIO also, but prefer a solution that is available soonest to our users, and a PnetCDF fix is superior with respect to that metric (as RobL says)...

In this case, we can use RobL's approach on blocking (independent?) APIs and for big variables.
For other cases, return errors?

Wei-keng

> 
> Rob
> 
> On Jul 2, 2013, at 10:47 AM, Wei-keng Liao wrote:
> 
>> On Jul 2, 2013, at 9:05 AM, Rob Latham wrote:
>> 
>>> On Tue, Jul 02, 2013 at 08:21:13AM -0500, Rob Ross wrote:
>>>> The allreduce may be a little expensive to do all the time, so I was
>>>> wondering if there were clever ways to avoid it (e.g., this variable
>>>> is too small to need multiple passes, that sort of thing)?
>>> 
>>> OK, here's what we know:
>>> 
>>> - the size of the variable, as you say.  If 
>>> product(dimensions[])*sizeof(type) is less than 2 GiB (actually, 1
>>> GiB if the file is of type integer and the memory is of type
>>> double), we'll never transfer too much data.
>>> 
>>> It ensures we never call allreduce when the dataset is small, and
>>> only introduces latency when paying a huge i/o cost.
>>> 
>>> - Each process has a request, but that information is local.
>>> 
>>> And that's it, right?
>>> 
>>> I think I'd be happier leaning on the MPI library to do the right
>>> thing, but then I'd be signing up for 3 years of bug reporting against
>>> various vendor MPIs.
>>> 
>> 
>> The above two facts clearly says the allreduce cannot be avoided.
>> One possible way to do this in ROMIO is to see if it can be combined
>> with other allreduce. FYI, in PnetCDF wait_all API, there is one
>> allreduce required to separate reads from writes.
>> 
>> I also prefer the solution in ROMIO, as the "high-level intention"
>> is clear to "low-level" library.
>> 
>> Wei-keng
>> 
>> 
>> 
>>> ==rob
>>> 
>>>> -- Rob
>>>> 
>>>> On Jul 2, 2013, at 8:11 AM, "Rob Latham" <robl at mcs.anl.gov> wrote:
>>>> 
>>>>> On Mon, Jul 01, 2013 at 04:57:45PM -0500, Rob Ross wrote:
>>>>>> You could break the operation into multiple calls on the premise that a process moving GBs of data is doing a "big enough" I/O already. Then you would only need a method for determining how many calls are needed…
>>>>> 
>>>>> That's a pragmatic if unsatisfying solution, sure.  
>>>>> 
>>>>> Determining how many calls are needed is not hard.  We already compute
>>>>> "nbytes":
>>>>> 
>>>>> Algorithm; 
>>>>> 
>>>>> - ceiling(nbytes/(1*GiB)): number of transfers
>>>>> 
>>>>> - MPI_Allreduce to find the max transfers
>>>>> 
>>>>> - carry out that many MPI_File_{write/read}_all, relying on the
>>>>> implicit file pointer to help us keep track of where we are in the
>>>>> file view.
>>>>> 
>>>>> I think, given the two phase overhead, we'd want N transfers of mostly
>>>>> the same size, instead of some gigabyte sized transfers and then one
>>>>> final "remainder" transfer of possibly a handful of bytes.
>>>>> When transferring multiple gigabytes, the point is probably academic
>>>>> anyway.
>>>>> 
>>>>> Given the state of large datatype descriptions in "just pulled it from
>>>>> git" MPICH (they will need work), I'm now no longer optimistic that
>>>>> we'll see widespread support for large datatypes too soon.   
>>>>> 
>>>>> ==rob
>>>>> 
>>>>>> Rob
>>>>>> 
>>>>>> On Jul 1, 2013, at 4:51 PM, Rob Latham wrote:
>>>>>> 
>>>>>>> I'm working on fixing a long-standing bug with the ROMIO MPI-IO
>>>>>>> implementation where requests of more than 32 bits worth of data (2
>>>>>>> GiB or more) would not be supported.
>>>>>>> 
>>>>>>> Some background:  The MPI_File read and write routines take an
>>>>>>> MPI-typical "buffer, count, datatype" tuple to describe accesses.
>>>>>>> The pnetcdf library will take a get or put call and processes the
>>>>>>> multi-dimensional array description into the simpler MPI-IO file
>>>>>>> model: a linear stream of bytes.
>>>>>>> 
>>>>>>> So, for example, "ncmpi_get_vara_double_all" will set up the file view
>>>>>>> accordingly, but describe the memory region as some number of MPI_BYTE
>>>>>>> items. 
>>>>>>> 
>>>>>>> This is the prototype for MPI_File_write_all:
>>>>>>> 
>>>>>>> int MPI_File_write_all(MPI_File fh, const void *buf, int count,
>>>>>>>                   MPI_Datatype datatype, MPI_Status *status)
>>>>>>> 
>>>>>>> So you probably see the problem: 'int count' -- integer are still 32
>>>>>>> bits on many systems (linux x86_64, blue gene, ppc64): how do we
>>>>>>> describe more than 2 GiB of data?
>>>>>>> 
>>>>>>> One way is to punt: if we detect that the number of bytes won't fit
>>>>>>> into an integer, pnetcdf returns an error.  I think I can do better,
>>>>>>> though, but my scheme is growing crazier by the moment:
>>>>>>> 
>>>>>>> RobL's crazy type scheme:
>>>>>>> - given N, a count of number of bytes
>>>>>>> - we pick a chunk size (call it 1 MiB now, to buy us some time, but
>>>>>>> one could select this chunk at run-time)
>>>>>>> - We make M contig types to describe the first M*chunk_size bytes of
>>>>>>> the request
>>>>>>> - We have "remainder" bytes for the rest of the request.
>>>>>>> 
>>>>>>> - Now we have two regions: one primary region described with a count of
>>>>>>> MPI_CONTIG types, and a second remainder region described with
>>>>>>> MPI_BYTE types
>>>>>>> 
>>>>>>> - We make a struct type describing those two pieces, and pass that to
>>>>>>> MPI-IO
>>>>>>> 
>>>>>>> MPI_Type_struct takes an MPI_Aint type.  Now on some old systems
>>>>>>> (like my primary development machine up until a year ago),
>>>>>>> MPI_AINT is 32 bits.  Well, on those systems the caller is out of
>>>>>>> luck: how are they going to address the e.g. 3 GiB of data we toss
>>>>>>> their way?
>>>>>>> 
>>>>>>> 
>>>>>>> The attached diff demonstrates what I'm trying to do. The
>>>>>>> creation of these types fails on MPICH so I cannot test this scheme
>>>>>>> yet.  Does it look goofy to any of you?
>>>>>>> 
>>>>>>> thanks
>>>>>>> ==rob
>>>>>>> 
>>>>>>> -- 
>>>>>>> Rob Latham
>>>>>>> Mathematics and Computer Science Division
>>>>>>> Argonne National Lab, IL USA
>>>>>>> <rjl_bigtype_changes.diff>
>>>>> 
>>>>> -- 
>>>>> Rob Latham
>>>>> Mathematics and Computer Science Division
>>>>> Argonne National Lab, IL USA
>>> 
>>> -- 
>>> Rob Latham
>>> Mathematics and Computer Science Division
>>> Argonne National Lab, IL USA
>> 
> 



More information about the parallel-netcdf mailing list