parallel-netcdf buffered I/O interface

Thu Aug 16 06:20:46 CDT 2012

ncmpi_inq_buffer_usage and its fortran API are now added in r1087

Wei-keng

On Aug 15, 2012, at 11:27 AM, Rob Latham wrote:

> On Wed, Aug 15, 2012 at 10:10:02AM -0600, Jim Edwards wrote:
>> Okay, so when do you need to call nfmpi_begin_indep_mode/
>> nfmpi_end_indep_mode?    It doesn't seem to
>> be entirely consistent anymore - is it?
> 
> nfmpi_begin_indep_mode and nfmpi_end_indep_mode should continue to
> wrap the blocking and independent nfmpi_put_ and nfmpi_get routines
> (those that do not end in _all).
> 
> begin/end should also bracket the independent nfmpi_wait, I think.
> 
> If you are interested, I think the reason for all this flipping around
> is essentially so we can keep consistent among processors the number
> of records in a record variable. 
> 
> ==rob
> 
>> 
>> On Wed, Aug 15, 2012 at 10:01 AM, Rob Latham <robl at mcs.anl.gov> wrote:
>> 
>>> On Wed, Aug 15, 2012 at 09:32:56AM -0600, Jim Edwards wrote:
>>>> Hi Wei-keng,
>>>> 
>>>> Yes that looks like what I would need.   I have to think about the
>>>> independent aspect - currently i am using collective operations in almost
>>>> all cases.  The performance trade offs of independent vs collective
>>>> operations are not really clear to me.  Why no collective bputs?
>>> 
>>> Aw, Wei-keng already replied.   Well, here's my answer, which says the
>>> same thing as Wei-keng but emphasises the "put it on a list" and
>>> "execute this list" aspects of these APIs.
>>> 
>>> The 'buffered put' routines are a variant of the non-blocking
>>> routines.  These routines defer all I/O to the wait or wait_all
>>> routine, where all pending I/O requests for a given process are
>>> stitched together into one bigger request.
>>> 
>>> So, issuing an I/O operation under these interfaces is essentially
>>> "put it on a list".  Then, "execute this list" can be done either
>>> independently (ncmpi_wait) or collectively (ncmpi_wait_all).
>>> 
>>> A very early instance of these routines did the "put it on a list"
>>> collectively.  This approach did not work out so well for applications
>>> (like for example Chombo) where processes make a bunch of small
>>> uncoordinated I/O requests, but still have a clear part of their code
>>> where "collectively wait for everyone to finish" made sense.
>>> 
>>> I hope you have enjoyed today's episode of Parallel-NetCDF history
>>> theater.
>>> 
>>> ==rob
>>> 
>>>> On Wed, Aug 15, 2012 at 9:18 AM, Wei-keng Liao
>>>> <wkliao at ece.northwestern.edu>wrote:
>>>> 
>>>>>> The  NC_EINSUFFBUF error code is returned from the bput call?
>>>>> 
>>>>> I found a bug that 1.3.0 fails to return this error code. r1086 fixes
>>> this
>>>>> bug.
>>>>> 
>>>>> 
>>>>>>  If you get that error will you need to make that same bput call
>>> again
>>>>> after flushing?  But the other tasks involved in the same bput call who
>>>>> didn't have full buffers would do what?
>>>>> 
>>>>> My idea is to skip the bput request when NC_EINSUFFBUF is returned.
>>>>> Flushing at the wait call will only flush those successful bput calls,
>>> so
>>>>> yes
>>>>> you need to make the same failed bput call again after flushing.
>>>>> 
>>>>> Please note that bput APIs are independent. There is no "other tasks in
>>>>> the same bput call" issue.
>>>>> 
>>>>> 
>>>>>> I could use a query function and to avoid the independent write calls
>>>>> would do an mpi_allreduce on the max memory used before calling the
>>>>> mpi_waitall.  If the max is approaching the buffer size I would flush
>>> all
>>>>> io tasks. This is basically what I have implemented in pio with iput -
>>> I
>>>>> have a user determined limit on the size of the buffer and grow the
>>> buffer
>>>>> with each iput call, when the buffer meets (or exceeds) the limit on
>>> any
>>>>> task I call waitall on all tasks.
>>>>> 
>>>>> This is a nice idea.
>>>>> 
>>>>> 
>>>>> Please let me know if the new query API below will be sufficient for
>>> you.
>>>>> 
>>>>>  int ncmpi_inq_buffer_usage(int ncid, MPI_Offset *usage);
>>>>> 
>>>>>  * "usage" will be returned with the current buffer usage in bytes.
>>>>>  * Error codes may be invalid ncid or no attached buffer found.
>>>>> 
>>>>> 
>>>>> 
>>>>> Wei-keng
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Aug 14, 2012 at 10:07 PM, Wei-keng Liao <
>>>>> wkliao at ece.northwestern.edu> wrote:
>>>>>> Hi, Jim,
>>>>>> 
>>>>>> The usage of bput APIs is very similar to iput, except the
>>> followings.
>>>>>> 1. users must tell pnetcdf the size of buffer to be used by pnetcdf
>>>>> internally (attach and detach calls).
>>>>>> 2. once a bput API returns, user's buffer can be reused or freed
>>>>> (because the write
>>>>>>  data has been copied to the internal buffer.)
>>>>>> 
>>>>>> The internal buffer is per file (as the attach API requires an ncid
>>>>> argument.) It can be used to aggregate
>>>>>> requests to multiple variables defined in the file.
>>>>>> 
>>>>>> I did not implement a query API to check the current usage of the
>>>>> buffer. If this query is useful, we
>>>>>> can implement it. Let me know. But please note this query will be an
>>>>> independent call, so you
>>>>>> will have to call independent wait (nfmpi_wait). Independent wait
>>> uses
>>>>> MPI independent I/O, causing
>>>>>> poor performance, not recommended. Otherwise, you need an MPI reduce
>>> to
>>>>> ensure all processes know
>>>>>> when to call the collective wait_all.
>>>>>> 
>>>>>> You are right about flushing. The buffer will not be flushed
>>>>> automatically and all file I/O happens in wait_all.
>>>>>> If the attached buffer ran out of space, NC_EINSUFFBUF error code
>>>>> (non-fatal) will return. It can be
>>>>>> used to determine to call wait API, as described above. However, an
>>>>> automatic flushing would require an MPI
>>>>>> independent I/O, again meaning a poor performance. So, I recommend to
>>>>> make sure the buffer size is
>>>>>> sufficient large. In addition, if you let pnetcdf do type conversion
>>>>> between two types of different size
>>>>>> (e.g. short to int), you must calculate the size of attach buffer
>>> using
>>>>> the larger type.
>>>>>> 
>>>>>> If automatic flushing is highly desired, we can add it later.
>>>>>> 
>>>>>> Once the call to wait/wait_all returns, the internal buffer is marked
>>>>> empty.
>>>>>> 
>>>>>> Let me know if the above answers your questions.
>>>>>> 
>>>>>> Wei-keng
>>>>>> 
>>>>>> On Aug 14, 2012, at 2:04 PM, Jim Edwards wrote:
>>>>>> 
>>>>>>> No, the flush must happen in the nfmpi_wait_all.
>>>>>>> But does that call mark the buffer as empty?  I'll wait and bug
>>>>>>> Wei-keng.
>>>>>>> 
>>>>>>> On Tue, Aug 14, 2012 at 12:56 PM, Rob Latham <robl at mcs.anl.gov>
>>> wrote:
>>>>>>> On Tue, Aug 14, 2012 at 12:52:46PM -0600, Jim Edwards wrote:
>>>>>>>> Hi Rob,
>>>>>>>> 
>>>>>>>> I assume that the same buffer can be used for multiple variables
>>> (as
>>>>> long
>>>>>>>> as they are associated with the same file).    Is there a query
>>>>> function so
>>>>>>>> that you know when you've used the entire buffer and it's time to
>>>>> flush?
>>>>>>> 
>>>>>>> It does not appear to be so.  The only non-data-movement routines
>>> in
>>>>>>> the API are these:
>>>>>>> 
>>>>>>> int ncmpi_buffer_attach(int ncid, MPI_Offset bufsize);
>>>>>>> int ncmpi_buffer_detach(int ncid);
>>>>>>> 
>>>>>>> The end-user doesn't flush, I don't think.  I had the impression
>>> that
>>>>> once the
>>>>>>> buffer filled up, the library did the flush, then started filling
>>> up
>>>>> the buffer
>>>>>>> again.  This one I'll need Wei-keng to confirm.
>>>>>>> 
>>>>>>> ==rob
>>>>>>> 
>>>>>>>> Jim
>>>>>>>> 
>>>>>>>> On Tue, Aug 14, 2012 at 11:41 AM, Rob Latham <robl at mcs.anl.gov>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> On Tue, Aug 14, 2012 at 10:50:15AM -0600, Jim Edwards wrote:
>>>>>>>>>> No, I'm using iput and blocking get.   I'm doing my own
>>> buffereing
>>>>> layer
>>>>>>>>> in
>>>>>>>>>> pio.   I might consider using the bput functions - can you
>>> point me
>>>>> to
>>>>>>>>> some
>>>>>>>>>> documentation/examples?
>>>>>>>>> 
>>>>>>>>> Sure.  It's too bad Wei-keng is on vacation this month, as he's
>>> the
>>>>>>>>> one who designed and implemented this new feature for pnetcdf
>>> 1.3.0.
>>>>>>>>> Wei-keng: i'm not expecting you to reply while on vacation.  I'm
>>> just
>>>>>>>>> CCing you so you know I'm talking about your work :>
>>>>>>>>> 
>>>>>>>>> I think this might be the entire contents of our documentation:
>>>>>>>>> 
>>>>>>>>> "A new set of buffered put APIs (eg. ncmpi_bput_vara_float) is
>>> added.
>>>>>>>>> They make a copy of the user's buffer internally, so the user's
>>>>> buffer
>>>>>>>>> can be reused when the call returns. Their usage are similar to
>>> the
>>>>>>>>> iput APIs. "
>>>>>>>>> 
>>>>>>>>> Hey, check that out: Wei-keng wrote up a fortran example:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>> 
>>> http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/examples/tutorial/pnetcdf-write-bufferedf.F
>>>>>>>>> 
>>>>>>>>> There's also the C version:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>> 
>>> http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/examples/tutorial/pnetcdf-write-buffered.c
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ==rob
>>>>>>>>> 
>>>>>>>>>> On Tue, Aug 14, 2012 at 10:16 AM, Rob Latham <robl at mcs.anl.gov>
>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Jim
>>>>>>>>>>> 
>>>>>>>>>>> You've been using the new 'bput/bget' routines, right?  Can you
>>>>> tell
>>>>>>>>>>> me a bit about what you are using them for, and what -- if any
>>> --
>>>>>>>>>>> benefit they've provided?
>>>>>>>>>>> 
>>>>>>>>>>> (Rationale: our program management likes to see papers and
>>>>>>>>>>> presentations, but the most valued contribution is 'science
>>>>> impact').
>>>>>>>>>>> 
>>>>>>>>>>> Thanks
>>>>>>>>>>> ==rob
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Rob Latham
>>>>>>>>>>> Mathematics and Computer Science Division
>>>>>>>>>>> Argonne National Lab, IL USA
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Rob Latham
>>>>>>>>> Mathematics and Computer Science Division
>>>>>>>>> Argonne National Lab, IL USA
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Rob Latham
>>>>>>> Mathematics and Computer Science Division
>>>>>>> Argonne National Lab, IL USA
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Jim Edwards
>>>>>>> 
>>>>>>> CESM Software Engineering Group
>>>>>>> National Center for Atmospheric Research
>>>>>>> Boulder, CO
>>>>>>> 303-497-1842
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Jim Edwards
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> --
>>> Rob Latham
>>> Mathematics and Computer Science Division
>>> Argonne National Lab, IL USA
>>> 
>> 
>> 
>> 
> 
> -- 
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA