parallel-netcdf buffered I/O interface

Jim Edwards edwards.jim at gmail.com
Wed Sep 12 10:38:12 CDT 2012


I need to make a request to all of the Cray sites to update the
parallel-netcdf 1.2.0 due to a bug in opening files (pnetcdf won't clobber
correctly).   I would like to have this feature  (ncmpi_inq_buffer_usage)
included in Cray's new build.   Can you please make a 1.3.1 distribution so
that I can make this request?

Thanks,

Jim

On Thu, Aug 16, 2012 at 5:20 AM, Wei-keng Liao
<wkliao at ece.northwestern.edu>wrote:

> ncmpi_inq_buffer_usage and its fortran API are now added in r1087
>
> Wei-keng
>
> On Aug 15, 2012, at 11:27 AM, Rob Latham wrote:
>
> > On Wed, Aug 15, 2012 at 10:10:02AM -0600, Jim Edwards wrote:
> >> Okay, so when do you need to call nfmpi_begin_indep_mode/
> >> nfmpi_end_indep_mode?    It doesn't seem to
> >> be entirely consistent anymore - is it?
> >
> > nfmpi_begin_indep_mode and nfmpi_end_indep_mode should continue to
> > wrap the blocking and independent nfmpi_put_ and nfmpi_get routines
> > (those that do not end in _all).
> >
> > begin/end should also bracket the independent nfmpi_wait, I think.
> >
> > If you are interested, I think the reason for all this flipping around
> > is essentially so we can keep consistent among processors the number
> > of records in a record variable.
> >
> > ==rob
> >
> >>
> >> On Wed, Aug 15, 2012 at 10:01 AM, Rob Latham <robl at mcs.anl.gov> wrote:
> >>
> >>> On Wed, Aug 15, 2012 at 09:32:56AM -0600, Jim Edwards wrote:
> >>>> Hi Wei-keng,
> >>>>
> >>>> Yes that looks like what I would need.   I have to think about the
> >>>> independent aspect - currently i am using collective operations in
> almost
> >>>> all cases.  The performance trade offs of independent vs collective
> >>>> operations are not really clear to me.  Why no collective bputs?
> >>>
> >>> Aw, Wei-keng already replied.   Well, here's my answer, which says the
> >>> same thing as Wei-keng but emphasises the "put it on a list" and
> >>> "execute this list" aspects of these APIs.
> >>>
> >>> The 'buffered put' routines are a variant of the non-blocking
> >>> routines.  These routines defer all I/O to the wait or wait_all
> >>> routine, where all pending I/O requests for a given process are
> >>> stitched together into one bigger request.
> >>>
> >>> So, issuing an I/O operation under these interfaces is essentially
> >>> "put it on a list".  Then, "execute this list" can be done either
> >>> independently (ncmpi_wait) or collectively (ncmpi_wait_all).
> >>>
> >>> A very early instance of these routines did the "put it on a list"
> >>> collectively.  This approach did not work out so well for applications
> >>> (like for example Chombo) where processes make a bunch of small
> >>> uncoordinated I/O requests, but still have a clear part of their code
> >>> where "collectively wait for everyone to finish" made sense.
> >>>
> >>> I hope you have enjoyed today's episode of Parallel-NetCDF history
> >>> theater.
> >>>
> >>> ==rob
> >>>
> >>>> On Wed, Aug 15, 2012 at 9:18 AM, Wei-keng Liao
> >>>> <wkliao at ece.northwestern.edu>wrote:
> >>>>
> >>>>>> The  NC_EINSUFFBUF error code is returned from the bput call?
> >>>>>
> >>>>> I found a bug that 1.3.0 fails to return this error code. r1086 fixes
> >>> this
> >>>>> bug.
> >>>>>
> >>>>>
> >>>>>>  If you get that error will you need to make that same bput call
> >>> again
> >>>>> after flushing?  But the other tasks involved in the same bput call
> who
> >>>>> didn't have full buffers would do what?
> >>>>>
> >>>>> My idea is to skip the bput request when NC_EINSUFFBUF is returned.
> >>>>> Flushing at the wait call will only flush those successful bput
> calls,
> >>> so
> >>>>> yes
> >>>>> you need to make the same failed bput call again after flushing.
> >>>>>
> >>>>> Please note that bput APIs are independent. There is no "other tasks
> in
> >>>>> the same bput call" issue.
> >>>>>
> >>>>>
> >>>>>> I could use a query function and to avoid the independent write
> calls
> >>>>> would do an mpi_allreduce on the max memory used before calling the
> >>>>> mpi_waitall.  If the max is approaching the buffer size I would flush
> >>> all
> >>>>> io tasks. This is basically what I have implemented in pio with iput
> -
> >>> I
> >>>>> have a user determined limit on the size of the buffer and grow the
> >>> buffer
> >>>>> with each iput call, when the buffer meets (or exceeds) the limit on
> >>> any
> >>>>> task I call waitall on all tasks.
> >>>>>
> >>>>> This is a nice idea.
> >>>>>
> >>>>>
> >>>>> Please let me know if the new query API below will be sufficient for
> >>> you.
> >>>>>
> >>>>>  int ncmpi_inq_buffer_usage(int ncid, MPI_Offset *usage);
> >>>>>
> >>>>>  * "usage" will be returned with the current buffer usage in bytes.
> >>>>>  * Error codes may be invalid ncid or no attached buffer found.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Wei-keng
> >>>>>
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Aug 14, 2012 at 10:07 PM, Wei-keng Liao <
> >>>>> wkliao at ece.northwestern.edu> wrote:
> >>>>>> Hi, Jim,
> >>>>>>
> >>>>>> The usage of bput APIs is very similar to iput, except the
> >>> followings.
> >>>>>> 1. users must tell pnetcdf the size of buffer to be used by pnetcdf
> >>>>> internally (attach and detach calls).
> >>>>>> 2. once a bput API returns, user's buffer can be reused or freed
> >>>>> (because the write
> >>>>>>  data has been copied to the internal buffer.)
> >>>>>>
> >>>>>> The internal buffer is per file (as the attach API requires an ncid
> >>>>> argument.) It can be used to aggregate
> >>>>>> requests to multiple variables defined in the file.
> >>>>>>
> >>>>>> I did not implement a query API to check the current usage of the
> >>>>> buffer. If this query is useful, we
> >>>>>> can implement it. Let me know. But please note this query will be an
> >>>>> independent call, so you
> >>>>>> will have to call independent wait (nfmpi_wait). Independent wait
> >>> uses
> >>>>> MPI independent I/O, causing
> >>>>>> poor performance, not recommended. Otherwise, you need an MPI reduce
> >>> to
> >>>>> ensure all processes know
> >>>>>> when to call the collective wait_all.
> >>>>>>
> >>>>>> You are right about flushing. The buffer will not be flushed
> >>>>> automatically and all file I/O happens in wait_all.
> >>>>>> If the attached buffer ran out of space, NC_EINSUFFBUF error code
> >>>>> (non-fatal) will return. It can be
> >>>>>> used to determine to call wait API, as described above. However, an
> >>>>> automatic flushing would require an MPI
> >>>>>> independent I/O, again meaning a poor performance. So, I recommend
> to
> >>>>> make sure the buffer size is
> >>>>>> sufficient large. In addition, if you let pnetcdf do type conversion
> >>>>> between two types of different size
> >>>>>> (e.g. short to int), you must calculate the size of attach buffer
> >>> using
> >>>>> the larger type.
> >>>>>>
> >>>>>> If automatic flushing is highly desired, we can add it later.
> >>>>>>
> >>>>>> Once the call to wait/wait_all returns, the internal buffer is
> marked
> >>>>> empty.
> >>>>>>
> >>>>>> Let me know if the above answers your questions.
> >>>>>>
> >>>>>> Wei-keng
> >>>>>>
> >>>>>> On Aug 14, 2012, at 2:04 PM, Jim Edwards wrote:
> >>>>>>
> >>>>>>> No, the flush must happen in the nfmpi_wait_all.
> >>>>>>> But does that call mark the buffer as empty?  I'll wait and bug
> >>>>>>> Wei-keng.
> >>>>>>>
> >>>>>>> On Tue, Aug 14, 2012 at 12:56 PM, Rob Latham <robl at mcs.anl.gov>
> >>> wrote:
> >>>>>>> On Tue, Aug 14, 2012 at 12:52:46PM -0600, Jim Edwards wrote:
> >>>>>>>> Hi Rob,
> >>>>>>>>
> >>>>>>>> I assume that the same buffer can be used for multiple variables
> >>> (as
> >>>>> long
> >>>>>>>> as they are associated with the same file).    Is there a query
> >>>>> function so
> >>>>>>>> that you know when you've used the entire buffer and it's time to
> >>>>> flush?
> >>>>>>>
> >>>>>>> It does not appear to be so.  The only non-data-movement routines
> >>> in
> >>>>>>> the API are these:
> >>>>>>>
> >>>>>>> int ncmpi_buffer_attach(int ncid, MPI_Offset bufsize);
> >>>>>>> int ncmpi_buffer_detach(int ncid);
> >>>>>>>
> >>>>>>> The end-user doesn't flush, I don't think.  I had the impression
> >>> that
> >>>>> once the
> >>>>>>> buffer filled up, the library did the flush, then started filling
> >>> up
> >>>>> the buffer
> >>>>>>> again.  This one I'll need Wei-keng to confirm.
> >>>>>>>
> >>>>>>> ==rob
> >>>>>>>
> >>>>>>>> Jim
> >>>>>>>>
> >>>>>>>> On Tue, Aug 14, 2012 at 11:41 AM, Rob Latham <robl at mcs.anl.gov>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> On Tue, Aug 14, 2012 at 10:50:15AM -0600, Jim Edwards wrote:
> >>>>>>>>>> No, I'm using iput and blocking get.   I'm doing my own
> >>> buffereing
> >>>>> layer
> >>>>>>>>> in
> >>>>>>>>>> pio.   I might consider using the bput functions - can you
> >>> point me
> >>>>> to
> >>>>>>>>> some
> >>>>>>>>>> documentation/examples?
> >>>>>>>>>
> >>>>>>>>> Sure.  It's too bad Wei-keng is on vacation this month, as he's
> >>> the
> >>>>>>>>> one who designed and implemented this new feature for pnetcdf
> >>> 1.3.0.
> >>>>>>>>> Wei-keng: i'm not expecting you to reply while on vacation.  I'm
> >>> just
> >>>>>>>>> CCing you so you know I'm talking about your work :>
> >>>>>>>>>
> >>>>>>>>> I think this might be the entire contents of our documentation:
> >>>>>>>>>
> >>>>>>>>> "A new set of buffered put APIs (eg. ncmpi_bput_vara_float) is
> >>> added.
> >>>>>>>>> They make a copy of the user's buffer internally, so the user's
> >>>>> buffer
> >>>>>>>>> can be reused when the call returns. Their usage are similar to
> >>> the
> >>>>>>>>> iput APIs. "
> >>>>>>>>>
> >>>>>>>>> Hey, check that out: Wei-keng wrote up a fortran example:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>
> >>>
> http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/examples/tutorial/pnetcdf-write-bufferedf.F
> >>>>>>>>>
> >>>>>>>>> There's also the C version:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>
> >>>
> http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/examples/tutorial/pnetcdf-write-buffered.c
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ==rob
> >>>>>>>>>
> >>>>>>>>>> On Tue, Aug 14, 2012 at 10:16 AM, Rob Latham <robl at mcs.anl.gov>
> >>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Jim
> >>>>>>>>>>>
> >>>>>>>>>>> You've been using the new 'bput/bget' routines, right?  Can you
> >>>>> tell
> >>>>>>>>>>> me a bit about what you are using them for, and what -- if any
> >>> --
> >>>>>>>>>>> benefit they've provided?
> >>>>>>>>>>>
> >>>>>>>>>>> (Rationale: our program management likes to see papers and
> >>>>>>>>>>> presentations, but the most valued contribution is 'science
> >>>>> impact').
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks
> >>>>>>>>>>> ==rob
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Rob Latham
> >>>>>>>>>>> Mathematics and Computer Science Division
> >>>>>>>>>>> Argonne National Lab, IL USA
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Rob Latham
> >>>>>>>>> Mathematics and Computer Science Division
> >>>>>>>>> Argonne National Lab, IL USA
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Rob Latham
> >>>>>>> Mathematics and Computer Science Division
> >>>>>>> Argonne National Lab, IL USA
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Jim Edwards
> >>>>>>>
> >>>>>>> CESM Software Engineering Group
> >>>>>>> National Center for Atmospheric Research
> >>>>>>> Boulder, CO
> >>>>>>> 303-497-1842
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Jim Edwards
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>> --
> >>> Rob Latham
> >>> Mathematics and Computer Science Division
> >>> Argonne National Lab, IL USA
> >>>
> >>
> >>
> >>
> >
> > --
> > Rob Latham
> > Mathematics and Computer Science Division
> > Argonne National Lab, IL USA
>
>


-- 

Jim Edwards
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20120912/2cc52ac2/attachment.html>


More information about the parallel-netcdf mailing list