collective i/o on zero-dimensional data
Rob Ross
rross at mcs.anl.gov
Tue Oct 5 12:58:44 CDT 2010
I'll second the non-blocking API as a good option. -- Rob
On Oct 5, 2010, at 12:07 PM, Wei-keng Liao wrote:
> Hi, Max,
>
> If fsync is too expensive (it usually is), I would suggest to use
> your (2) option,
> collective mode only. Another option is to use non-blocking APIs for
> the 0D variables.
> Non-blocking APIs can be called in either collective or independent
> mode. If you
> choose this option, please call ncmpi_wait_all(), which calls MPI
> collective I/O to
> write all 0D variables at once.
>
> As for declaring 0D variable vs. 1D of length 1, the 1D of length 1
> approach can
> make sure the desired data committed to the file (from the right
> process).
> If you use 0D-variable approach, the "competition" can occur
> (depending on the
> MPI-IO implementation). In ROMIO as in mpich2-1.2.1p1, it will occur.
>
> For performance, there should not be much difference between the two
> approaches,
> because the write request size is too small.
>
>
> Wei-keng
>
> On Oct 5, 2010, at 10:13 AM, Maxwell Kelley wrote:
>
>>
>> Hi Wei-keng,
>>
>> Thanks for your guidance.
>>
>> To answer your question, the number of 0D quantities in my
>> checkpoint file is small compared to the number of distributed and
>> nondistributed arrays, but the fsync is so fatally expensive that I
>> either have to (1) group the 0D independent-mode writes or (2) do
>> 0D writes in collective mode. Since the 0D writes are made by a
>> number of different components of the code, option 1 would be too
>> disruptive to the program design.
>>
>> Do you think that declaring 0D variables as 1D arrays of length 1
>> and setting the count vector will be more efficient than leaving
>> them as 0D? Is there a "competition" among processes in the 0D
>> collective-write case?
>>
>> -Max
>>
>> On Mon, 4 Oct 2010, Wei-keng Liao wrote:
>>
>>> Hi, Max,
>>>
>>> Switching between collective and independent data modes is
>>> expensive, because fsync will be called each time the mode
>>> switches. So, grouping writes together to reduce the number of
>>> switches is a good strategy.
>>>
>>> As for writing 0D variables, pnetcdf will ignore both arguments
>>> start[] and count[] and always let the calling process write one
>>> value (1 element) to the variable.
>>>
>>> So, if the call is collective and the writing processes have
>>> different values to write, then the outcome in the file will be
>>> undefined (usually the last process wins, but no way to know who
>>> is the last one). So, one solution is to define the variable to be
>>> a 1-D array of length 1 and set argument count[0] to zero for all
>>> the processes except the one you would like its data get written
>>> to the file.
>>>
>>> As for recommending collective or independent I/O for 0D
>>> variables, it depends on your I/O pattern. Do you have a lot of 0D
>>> variables? Are they being overwritten frequently and by different
>>> processes? Please note that a good I/O performance usually happens
>>> when the request is large and contiguous.
>>>
>>> Use independent mode for all data can hurt the performance for the
>>> "distributed" arrays, as independent APIs may produce many small,
>>> noncontiguous requests to the file system.
>>>
>>> Wei-keng
>>>
>>> On Oct 4, 2010, at 6:42 PM, Maxwell Kelley wrote:
>>>
>>>>
>>>> Hello,
>>>>
>>>> Some code I ported from a GPFS to a Lustre machine was hit by the
>>>> performance effects of switching back and forth between
>>>> collective mode for distributed data and independent mode for non-
>>>> distributed data. Converting the writes of non-distributed data
>>>> like zero-dimensional (0D) variables to collective mode was
>>>> straightforward, but with a small wrinkle. Since the start/count
>>>> vectors passed to put_vara_double_all cannot be used to indicate
>>>> which process possesses the definitive value of a 0D variable, I
>>>> could only get correct results by ensuring that this datum is
>>>> identical on all processes. Can I count on put_vara_double_all
>>>> always behaving this way, or could future library versions refuse
>>>> to write 0D data in collective mode? BTW the return code did not
>>>> indicate an error when process-varying 0D data was passed to
>>>> put_vara_double_all.
>>>>
>>>> Grouping independent-mode writes could reduce the number of
>>>> switches between collective and independent mode but would
>>>> require significant code reorganization so I tried the all-
>>>> collective option first. I could also declare 0D variables as 1D
>>>> arrays of length 1.
>>>>
>>>> Before going any further, I should also ask about the recommended
>>>> method for writing a 0D variable. Collective I/O? Or
>>>> independent I/O with system-specific MPI hints (I haven't
>>>> explored the MPI hints)? Or should I use independent mode for
>>>> all data, including the distributed arrays?
>>>>
>>>> -Max
>>>>
>>>>
>>>
>>>
>>>
>>
>
More information about the parallel-netcdf
mailing list