parallel-netcdf buffered I/O interface
Rob Latham
robl at mcs.anl.gov
Wed Aug 15 11:27:23 CDT 2012
On Wed, Aug 15, 2012 at 10:10:02AM -0600, Jim Edwards wrote:
> Okay, so when do you need to call nfmpi_begin_indep_mode/
> nfmpi_end_indep_mode? It doesn't seem to
> be entirely consistent anymore - is it?
nfmpi_begin_indep_mode and nfmpi_end_indep_mode should continue to
wrap the blocking and independent nfmpi_put_ and nfmpi_get routines
(those that do not end in _all).
begin/end should also bracket the independent nfmpi_wait, I think.
If you are interested, I think the reason for all this flipping around
is essentially so we can keep consistent among processors the number
of records in a record variable.
==rob
>
> On Wed, Aug 15, 2012 at 10:01 AM, Rob Latham <robl at mcs.anl.gov> wrote:
>
> > On Wed, Aug 15, 2012 at 09:32:56AM -0600, Jim Edwards wrote:
> > > Hi Wei-keng,
> > >
> > > Yes that looks like what I would need. I have to think about the
> > > independent aspect - currently i am using collective operations in almost
> > > all cases. The performance trade offs of independent vs collective
> > > operations are not really clear to me. Why no collective bputs?
> >
> > Aw, Wei-keng already replied. Well, here's my answer, which says the
> > same thing as Wei-keng but emphasises the "put it on a list" and
> > "execute this list" aspects of these APIs.
> >
> > The 'buffered put' routines are a variant of the non-blocking
> > routines. These routines defer all I/O to the wait or wait_all
> > routine, where all pending I/O requests for a given process are
> > stitched together into one bigger request.
> >
> > So, issuing an I/O operation under these interfaces is essentially
> > "put it on a list". Then, "execute this list" can be done either
> > independently (ncmpi_wait) or collectively (ncmpi_wait_all).
> >
> > A very early instance of these routines did the "put it on a list"
> > collectively. This approach did not work out so well for applications
> > (like for example Chombo) where processes make a bunch of small
> > uncoordinated I/O requests, but still have a clear part of their code
> > where "collectively wait for everyone to finish" made sense.
> >
> > I hope you have enjoyed today's episode of Parallel-NetCDF history
> > theater.
> >
> > ==rob
> >
> > > On Wed, Aug 15, 2012 at 9:18 AM, Wei-keng Liao
> > > <wkliao at ece.northwestern.edu>wrote:
> > >
> > > > > The NC_EINSUFFBUF error code is returned from the bput call?
> > > >
> > > > I found a bug that 1.3.0 fails to return this error code. r1086 fixes
> > this
> > > > bug.
> > > >
> > > >
> > > > > If you get that error will you need to make that same bput call
> > again
> > > > after flushing? But the other tasks involved in the same bput call who
> > > > didn't have full buffers would do what?
> > > >
> > > > My idea is to skip the bput request when NC_EINSUFFBUF is returned.
> > > > Flushing at the wait call will only flush those successful bput calls,
> > so
> > > > yes
> > > > you need to make the same failed bput call again after flushing.
> > > >
> > > > Please note that bput APIs are independent. There is no "other tasks in
> > > > the same bput call" issue.
> > > >
> > > >
> > > > > I could use a query function and to avoid the independent write calls
> > > > would do an mpi_allreduce on the max memory used before calling the
> > > > mpi_waitall. If the max is approaching the buffer size I would flush
> > all
> > > > io tasks. This is basically what I have implemented in pio with iput -
> > I
> > > > have a user determined limit on the size of the buffer and grow the
> > buffer
> > > > with each iput call, when the buffer meets (or exceeds) the limit on
> > any
> > > > task I call waitall on all tasks.
> > > >
> > > > This is a nice idea.
> > > >
> > > >
> > > > Please let me know if the new query API below will be sufficient for
> > you.
> > > >
> > > > int ncmpi_inq_buffer_usage(int ncid, MPI_Offset *usage);
> > > >
> > > > * "usage" will be returned with the current buffer usage in bytes.
> > > > * Error codes may be invalid ncid or no attached buffer found.
> > > >
> > > >
> > > >
> > > > Wei-keng
> > > >
> > > >
> > > >
> > > > >
> > > > >
> > > > > On Tue, Aug 14, 2012 at 10:07 PM, Wei-keng Liao <
> > > > wkliao at ece.northwestern.edu> wrote:
> > > > > Hi, Jim,
> > > > >
> > > > > The usage of bput APIs is very similar to iput, except the
> > followings.
> > > > > 1. users must tell pnetcdf the size of buffer to be used by pnetcdf
> > > > internally (attach and detach calls).
> > > > > 2. once a bput API returns, user's buffer can be reused or freed
> > > > (because the write
> > > > > data has been copied to the internal buffer.)
> > > > >
> > > > > The internal buffer is per file (as the attach API requires an ncid
> > > > argument.) It can be used to aggregate
> > > > > requests to multiple variables defined in the file.
> > > > >
> > > > > I did not implement a query API to check the current usage of the
> > > > buffer. If this query is useful, we
> > > > > can implement it. Let me know. But please note this query will be an
> > > > independent call, so you
> > > > > will have to call independent wait (nfmpi_wait). Independent wait
> > uses
> > > > MPI independent I/O, causing
> > > > > poor performance, not recommended. Otherwise, you need an MPI reduce
> > to
> > > > ensure all processes know
> > > > > when to call the collective wait_all.
> > > > >
> > > > > You are right about flushing. The buffer will not be flushed
> > > > automatically and all file I/O happens in wait_all.
> > > > > If the attached buffer ran out of space, NC_EINSUFFBUF error code
> > > > (non-fatal) will return. It can be
> > > > > used to determine to call wait API, as described above. However, an
> > > > automatic flushing would require an MPI
> > > > > independent I/O, again meaning a poor performance. So, I recommend to
> > > > make sure the buffer size is
> > > > > sufficient large. In addition, if you let pnetcdf do type conversion
> > > > between two types of different size
> > > > > (e.g. short to int), you must calculate the size of attach buffer
> > using
> > > > the larger type.
> > > > >
> > > > > If automatic flushing is highly desired, we can add it later.
> > > > >
> > > > > Once the call to wait/wait_all returns, the internal buffer is marked
> > > > empty.
> > > > >
> > > > > Let me know if the above answers your questions.
> > > > >
> > > > > Wei-keng
> > > > >
> > > > > On Aug 14, 2012, at 2:04 PM, Jim Edwards wrote:
> > > > >
> > > > > > No, the flush must happen in the nfmpi_wait_all.
> > > > > > But does that call mark the buffer as empty? I'll wait and bug
> > > > > > Wei-keng.
> > > > > >
> > > > > > On Tue, Aug 14, 2012 at 12:56 PM, Rob Latham <robl at mcs.anl.gov>
> > wrote:
> > > > > > On Tue, Aug 14, 2012 at 12:52:46PM -0600, Jim Edwards wrote:
> > > > > >> Hi Rob,
> > > > > >>
> > > > > >> I assume that the same buffer can be used for multiple variables
> > (as
> > > > long
> > > > > >> as they are associated with the same file). Is there a query
> > > > function so
> > > > > >> that you know when you've used the entire buffer and it's time to
> > > > flush?
> > > > > >
> > > > > > It does not appear to be so. The only non-data-movement routines
> > in
> > > > > > the API are these:
> > > > > >
> > > > > > int ncmpi_buffer_attach(int ncid, MPI_Offset bufsize);
> > > > > > int ncmpi_buffer_detach(int ncid);
> > > > > >
> > > > > > The end-user doesn't flush, I don't think. I had the impression
> > that
> > > > once the
> > > > > > buffer filled up, the library did the flush, then started filling
> > up
> > > > the buffer
> > > > > > again. This one I'll need Wei-keng to confirm.
> > > > > >
> > > > > > ==rob
> > > > > >
> > > > > >> Jim
> > > > > >>
> > > > > >> On Tue, Aug 14, 2012 at 11:41 AM, Rob Latham <robl at mcs.anl.gov>
> > > > wrote:
> > > > > >>
> > > > > >>> On Tue, Aug 14, 2012 at 10:50:15AM -0600, Jim Edwards wrote:
> > > > > >>>> No, I'm using iput and blocking get. I'm doing my own
> > buffereing
> > > > layer
> > > > > >>> in
> > > > > >>>> pio. I might consider using the bput functions - can you
> > point me
> > > > to
> > > > > >>> some
> > > > > >>>> documentation/examples?
> > > > > >>>
> > > > > >>> Sure. It's too bad Wei-keng is on vacation this month, as he's
> > the
> > > > > >>> one who designed and implemented this new feature for pnetcdf
> > 1.3.0.
> > > > > >>> Wei-keng: i'm not expecting you to reply while on vacation. I'm
> > just
> > > > > >>> CCing you so you know I'm talking about your work :>
> > > > > >>>
> > > > > >>> I think this might be the entire contents of our documentation:
> > > > > >>>
> > > > > >>> "A new set of buffered put APIs (eg. ncmpi_bput_vara_float) is
> > added.
> > > > > >>> They make a copy of the user's buffer internally, so the user's
> > > > buffer
> > > > > >>> can be reused when the call returns. Their usage are similar to
> > the
> > > > > >>> iput APIs. "
> > > > > >>>
> > > > > >>> Hey, check that out: Wei-keng wrote up a fortran example:
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > >
> > http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/examples/tutorial/pnetcdf-write-bufferedf.F
> > > > > >>>
> > > > > >>> There's also the C version:
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > >
> > http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/examples/tutorial/pnetcdf-write-buffered.c
> > > > > >>>
> > > > > >>>
> > > > > >>> ==rob
> > > > > >>>
> > > > > >>>> On Tue, Aug 14, 2012 at 10:16 AM, Rob Latham <robl at mcs.anl.gov>
> > > > wrote:
> > > > > >>>>
> > > > > >>>>> Hi Jim
> > > > > >>>>>
> > > > > >>>>> You've been using the new 'bput/bget' routines, right? Can you
> > > > tell
> > > > > >>>>> me a bit about what you are using them for, and what -- if any
> > --
> > > > > >>>>> benefit they've provided?
> > > > > >>>>>
> > > > > >>>>> (Rationale: our program management likes to see papers and
> > > > > >>>>> presentations, but the most valued contribution is 'science
> > > > impact').
> > > > > >>>>>
> > > > > >>>>> Thanks
> > > > > >>>>> ==rob
> > > > > >>>>>
> > > > > >>>>> --
> > > > > >>>>> Rob Latham
> > > > > >>>>> Mathematics and Computer Science Division
> > > > > >>>>> Argonne National Lab, IL USA
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>
> > > > > >>> --
> > > > > >>> Rob Latham
> > > > > >>> Mathematics and Computer Science Division
> > > > > >>> Argonne National Lab, IL USA
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >
> > > > > > --
> > > > > > Rob Latham
> > > > > > Mathematics and Computer Science Division
> > > > > > Argonne National Lab, IL USA
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jim Edwards
> > > > > >
> > > > > > CESM Software Engineering Group
> > > > > > National Center for Atmospheric Research
> > > > > > Boulder, CO
> > > > > > 303-497-1842
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jim Edwards
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> > --
> > Rob Latham
> > Mathematics and Computer Science Division
> > Argonne National Lab, IL USA
> >
>
>
>
--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
More information about the parallel-netcdf
mailing list