Independent write

Fri Mar 12 09:10:11 CST 2004

On 12 Mar 2004, Roger Ting wrote:

> Does the independent writing coordinate all the processors?

Nope, by definition independent writing is not coordinated.

> I mean i have a netcdf file which each processor will append a new entry
> at the end of the file. For the append operation i use independent mode
> writing operation. It seems like if processor 1 appends an entry at
> position ith  and processor 2 also wants to append another entry to the
> file, it will overwrite the entry at position ith because it doesn't
> realise the processor has already append an entry there. 

To implement an append mode, there would have to be some sort of 
communication between processes that kept everyone up to date about what 
entry was the "last" one, and some mechanism for ensuring that only one 
process got to write to that position.

That is a generally difficult thing to implement in a low-overhead, 
scalable way, regardless of the API.

Luckily the netCDF API doesn't include this sort of thing, so we don't
have to worry about it.  What functions were you using to try to do this?

[ Goes and looks at next email. ]

Ok.  Even *if* the nfmpi_inq_dimlen were returning the accurate value 
(which it may or may not be), there would be a race condition that is 
unavoidable.  Any one of your processes can see the same value and decide 
to write there.  That's not a PnetCDF interface deficiency; you're just 
trying to do something that you shouldn't try to do without external 
synchronization.

> Is there a way around this? Ideally, each processor can just append to
> the file at position ith without worrying that another processor has
> just already written to that position. 

Again, even if there were a way to do this in PnetCDF (which I do not 
think that there is), it would not be high performance.

I would have to know a little more about your application to know how you 
could better perform this operation, but here are some possible solutions.

If your application has clear I/O and computation phases, I would suggest 
using collective I/O rather than independent I/O.  You could have your 
processes communicate with each other regarding the # of records that they 
want to write, partition up the space, and perform a collective write of 
all records without trouble.  MPI_Allgather would be a nice way to do the 
communication of # of records in a scalable way.

If your application truly has lots of independent processes writing to the
file, I suggest using MPI to pass a token between processes that specifies
that a given process is done performing I/O and includes the next entry
number to write to.  Then processes could cache values to write, use 
MPI_Irecv to post a nonblocking recv for the token, and MPI_Test to see if 
they've gotten it when they hit convenient points.  Not trivial, but it 
would turn the pattern into something deterministic, and you would end up 
with better overall performance from aggregating the writes of the 
records.  To get overlap of I/O, a process could immediately pass the 
token on taking into account the # of records that it has to write, then 
perform writes, so the next process doesn't have to wait.

The collective I/O approach is going to get better performance, especially 
at scale.

There is no way to accurately get the dimension of a variable during
independent mode because of the rules for use of that function (which I'll
discuss in response to your next email).

I am happy to further discuss this with you if it would help.  I realize 
that the solutions that I have proposed require additional work, and that 
it would be nice if the I/O API just did this stuff for you, but it's just 
not as easy as that.  I do think that we can come up with a good solution.

Regards,

Rob