Independent write

Rob Ross rross at mcs.anl.gov
Sun Apr 4 20:52:43 CDT 2004


Hi Roger,

It wouldn't really be *parallel* I/O if you had to serialize writes from 
all your processes!

It is valid in PnetCDF for multiple processes to simultaneously and 
*independently* write to different positions.  What is not guaranteed is 
if you go and start reading from the file at the same time, that you'll 
get the final values without synchronizing the file (but I don't think 
that this is an issue for you).

What you lose in using independent mode over the collective mode is 
optimizations that can be applied underneath the calls.  The collective 
mode tells the implementation "Hey, all these processes are doing I/O 
right now in a related way", and that allows the MPI-IO implementation to 
do tricky stuff that makes things go faster.  In independent mode those 
things aren't done.

Does that help?

Rob

On 28 Mar 2004, Roger Ting wrote:

> Hi 
> 	After rereading your email, i find the discussion about overlapping I/O
> confusing. Your explanation seems to indicate that 
> multiple processors can write to a single file independently given the 
> the position that they are writing to is different. This is assuming we
> are in independent mode. 
> 	My application returns MPI_File_write error when i am doing this.It
> seems like if we are in independent mode multiple processor should not
> be able to write to a single file even if they are writing to different
> position. Some reference indicates that if you so set file view to be
> MPI_Comm_Self, that file is only exclusively for that processor. 
> 	Please correct me if i am wrong. I am new to parallel i/o.
> 
> Regards,
> 
> Roger 
> 
> On Sat, 2004-03-13 at 02:10, Rob Ross wrote:
> > On 12 Mar 2004, Roger Ting wrote:
> > 
> > > Does the independent writing coordinate all the processors?
> > 
> > Nope, by definition independent writing is not coordinated.
> > 
> > > I mean i have a netcdf file which each processor will append a new entry
> > > at the end of the file. For the append operation i use independent mode
> > > writing operation. It seems like if processor 1 appends an entry at
> > > position ith  and processor 2 also wants to append another entry to the
> > > file, it will overwrite the entry at position ith because it doesn't
> > > realise the processor has already append an entry there. 
> > 
> > To implement an append mode, there would have to be some sort of 
> > communication between processes that kept everyone up to date about what 
> > entry was the "last" one, and some mechanism for ensuring that only one 
> > process got to write to that position.
> > 
> > That is a generally difficult thing to implement in a low-overhead, 
> > scalable way, regardless of the API.
> > 
> > Luckily the netCDF API doesn't include this sort of thing, so we don't
> > have to worry about it.  What functions were you using to try to do this?
> > 
> > [ Goes and looks at next email. ]
> > 
> > Ok.  Even *if* the nfmpi_inq_dimlen were returning the accurate value 
> > (which it may or may not be), there would be a race condition that is 
> > unavoidable.  Any one of your processes can see the same value and decide 
> > to write there.  That's not a PnetCDF interface deficiency; you're just 
> > trying to do something that you shouldn't try to do without external 
> > synchronization.
> > 
> > > Is there a way around this? Ideally, each processor can just append to
> > > the file at position ith without worrying that another processor has
> > > just already written to that position. 
> > 
> > Again, even if there were a way to do this in PnetCDF (which I do not 
> > think that there is), it would not be high performance.
> > 
> > I would have to know a little more about your application to know how you 
> > could better perform this operation, but here are some possible solutions.
> > 
> > If your application has clear I/O and computation phases, I would suggest 
> > using collective I/O rather than independent I/O.  You could have your 
> > processes communicate with each other regarding the # of records that they 
> > want to write, partition up the space, and perform a collective write of 
> > all records without trouble.  MPI_Allgather would be a nice way to do the 
> > communication of # of records in a scalable way.
> > 
> > If your application truly has lots of independent processes writing to the
> > file, I suggest using MPI to pass a token between processes that specifies
> > that a given process is done performing I/O and includes the next entry
> > number to write to.  Then processes could cache values to write, use 
> > MPI_Irecv to post a nonblocking recv for the token, and MPI_Test to see if 
> > they've gotten it when they hit convenient points.  Not trivial, but it 
> > would turn the pattern into something deterministic, and you would end up 
> > with better overall performance from aggregating the writes of the 
> > records.  To get overlap of I/O, a process could immediately pass the 
> > token on taking into account the # of records that it has to write, then 
> > perform writes, so the next process doesn't have to wait.
> > 
> > The collective I/O approach is going to get better performance, especially 
> > at scale.
> > 
> > There is no way to accurately get the dimension of a variable during
> > independent mode because of the rules for use of that function (which I'll
> > discuss in response to your next email).
> > 
> > I am happy to further discuss this with you if it would help.  I realize 
> > that the solutions that I have proposed require additional work, and that 
> > it would be nice if the I/O API just did this stuff for you, but it's just 
> > not as easy as that.  I do think that we can come up with a good solution.
> > 
> > Regards,
> > 
> > Rob
> > 
> 
> 




More information about the parallel-netcdf mailing list