collective i/o on zero-dimensional data

Wei-keng Liao wkliao at ece.northwestern.edu
Tue Oct 5 12:07:47 CDT 2010


Hi, Max,

If fsync is too expensive (it usually is), I would suggest to use your (2) option,
collective mode only. Another option is to use non-blocking APIs for the 0D variables.
Non-blocking APIs can be called in either collective or independent mode. If you
choose this option, please call ncmpi_wait_all(), which calls MPI collective I/O to
write all 0D variables at once.

As for declaring 0D variable vs. 1D of length 1, the 1D of length 1 approach can
make sure the desired data committed to the file (from the right process).
If you use 0D-variable approach, the "competition" can occur (depending on the
MPI-IO implementation). In ROMIO as in mpich2-1.2.1p1, it will occur.

For performance, there should not be much difference between the two approaches,
because the write request size is too small.


Wei-keng

On Oct 5, 2010, at 10:13 AM, Maxwell Kelley wrote:

> 
> Hi Wei-keng,
> 
> Thanks for your guidance.
> 
> To answer your question, the number of 0D quantities in my checkpoint file is small compared to the number of distributed and nondistributed arrays, but the fsync is so fatally expensive that I either have to (1) group the 0D independent-mode writes or (2) do 0D writes in collective mode.  Since the 0D writes are made by a number of different components of the code, option 1 would be too disruptive to the program design.
> 
> Do you think that declaring 0D variables as 1D arrays of length 1 and setting the count vector will be more efficient than leaving them as 0D? Is there a "competition" among processes in the 0D collective-write case?
> 
> -Max
> 
> On Mon, 4 Oct 2010, Wei-keng Liao wrote:
> 
>> Hi, Max,
>> 
>> Switching between collective and independent data modes is expensive, because fsync will be called each time the mode switches. So, grouping writes together to reduce the number of switches is a good strategy.
>> 
>> As for writing 0D variables, pnetcdf will ignore both arguments start[] and count[] and always let the calling process write one value (1 element) to the variable.
>> 
>> So, if the call is collective and the writing processes have different values to write, then the outcome in the file will be undefined (usually the last process wins, but no way to know who is the last one). So, one solution is to define the variable to be a 1-D array of length 1 and set argument count[0] to zero for all the processes except the one you would like its data get written to the file.
>> 
>> As for recommending collective or independent I/O for 0D variables, it depends on your I/O pattern. Do you have a lot of 0D variables? Are they being overwritten frequently and by different processes? Please note that a good I/O performance usually happens when the request is large and contiguous.
>> 
>> Use independent mode for all data can hurt the performance for the "distributed" arrays, as independent APIs may produce many small, noncontiguous requests to the file system.
>> 
>> Wei-keng
>> 
>> On Oct 4, 2010, at 6:42 PM, Maxwell Kelley wrote:
>> 
>>> 
>>> Hello,
>>> 
>>> Some code I ported from a GPFS to a Lustre machine was hit by the performance effects of switching back and forth between collective mode for distributed data and independent mode for non-distributed data. Converting the writes of non-distributed data like zero-dimensional (0D) variables to collective mode was straightforward, but with a small wrinkle. Since the start/count vectors passed to put_vara_double_all cannot be used to indicate which process possesses the definitive value of a 0D variable, I could only get correct results by ensuring that this datum is identical on all processes. Can I count on put_vara_double_all always behaving this way, or could future library versions refuse to write 0D data in collective mode? BTW the return code did not indicate an error when process-varying 0D data was passed to put_vara_double_all.
>>> 
>>> Grouping independent-mode writes could reduce the number of switches between collective and independent mode but would require significant code reorganization so I tried the all-collective option first. I could also declare 0D variables as 1D arrays of length 1.
>>> 
>>> Before going any further, I should also ask about the recommended method for writing a 0D variable.  Collective I/O?  Or independent I/O with system-specific MPI hints (I haven't explored the MPI hints)?  Or should I use independent mode for all data, including the distributed arrays?
>>> 
>>> -Max
>>> 
>>> 
>> 
>> 
>> 
> 



More information about the parallel-netcdf mailing list