collective i/o on zero-dimensional data

Tue Oct 5 12:58:44 CDT 2010

I'll second the non-blocking API as a good option. -- Rob

On Oct 5, 2010, at 12:07 PM, Wei-keng Liao wrote:

> Hi, Max,
>
> If fsync is too expensive (it usually is), I would suggest to use  
> your (2) option,
> collective mode only. Another option is to use non-blocking APIs for  
> the 0D variables.
> Non-blocking APIs can be called in either collective or independent  
> mode. If you
> choose this option, please call ncmpi_wait_all(), which calls MPI  
> collective I/O to
> write all 0D variables at once.
>
> As for declaring 0D variable vs. 1D of length 1, the 1D of length 1  
> approach can
> make sure the desired data committed to the file (from the right  
> process).
> If you use 0D-variable approach, the "competition" can occur  
> (depending on the
> MPI-IO implementation). In ROMIO as in mpich2-1.2.1p1, it will occur.
>
> For performance, there should not be much difference between the two  
> approaches,
> because the write request size is too small.
>
>
> Wei-keng
>
> On Oct 5, 2010, at 10:13 AM, Maxwell Kelley wrote:
>
>>
>> Hi Wei-keng,
>>
>> Thanks for your guidance.
>>
>> To answer your question, the number of 0D quantities in my  
>> checkpoint file is small compared to the number of distributed and  
>> nondistributed arrays, but the fsync is so fatally expensive that I  
>> either have to (1) group the 0D independent-mode writes or (2) do  
>> 0D writes in collective mode.  Since the 0D writes are made by a  
>> number of different components of the code, option 1 would be too  
>> disruptive to the program design.
>>
>> Do you think that declaring 0D variables as 1D arrays of length 1  
>> and setting the count vector will be more efficient than leaving  
>> them as 0D? Is there a "competition" among processes in the 0D  
>> collective-write case?
>>
>> -Max
>>
>> On Mon, 4 Oct 2010, Wei-keng Liao wrote:
>>
>>> Hi, Max,
>>>
>>> Switching between collective and independent data modes is  
>>> expensive, because fsync will be called each time the mode  
>>> switches. So, grouping writes together to reduce the number of  
>>> switches is a good strategy.
>>>
>>> As for writing 0D variables, pnetcdf will ignore both arguments  
>>> start[] and count[] and always let the calling process write one  
>>> value (1 element) to the variable.
>>>
>>> So, if the call is collective and the writing processes have  
>>> different values to write, then the outcome in the file will be  
>>> undefined (usually the last process wins, but no way to know who  
>>> is the last one). So, one solution is to define the variable to be  
>>> a 1-D array of length 1 and set argument count[0] to zero for all  
>>> the processes except the one you would like its data get written  
>>> to the file.
>>>
>>> As for recommending collective or independent I/O for 0D  
>>> variables, it depends on your I/O pattern. Do you have a lot of 0D  
>>> variables? Are they being overwritten frequently and by different  
>>> processes? Please note that a good I/O performance usually happens  
>>> when the request is large and contiguous.
>>>
>>> Use independent mode for all data can hurt the performance for the  
>>> "distributed" arrays, as independent APIs may produce many small,  
>>> noncontiguous requests to the file system.
>>>
>>> Wei-keng
>>>
>>> On Oct 4, 2010, at 6:42 PM, Maxwell Kelley wrote:
>>>
>>>>
>>>> Hello,
>>>>
>>>> Some code I ported from a GPFS to a Lustre machine was hit by the  
>>>> performance effects of switching back and forth between  
>>>> collective mode for distributed data and independent mode for non- 
>>>> distributed data. Converting the writes of non-distributed data  
>>>> like zero-dimensional (0D) variables to collective mode was  
>>>> straightforward, but with a small wrinkle. Since the start/count  
>>>> vectors passed to put_vara_double_all cannot be used to indicate  
>>>> which process possesses the definitive value of a 0D variable, I  
>>>> could only get correct results by ensuring that this datum is  
>>>> identical on all processes. Can I count on put_vara_double_all  
>>>> always behaving this way, or could future library versions refuse  
>>>> to write 0D data in collective mode? BTW the return code did not  
>>>> indicate an error when process-varying 0D data was passed to  
>>>> put_vara_double_all.
>>>>
>>>> Grouping independent-mode writes could reduce the number of  
>>>> switches between collective and independent mode but would  
>>>> require significant code reorganization so I tried the all- 
>>>> collective option first. I could also declare 0D variables as 1D  
>>>> arrays of length 1.
>>>>
>>>> Before going any further, I should also ask about the recommended  
>>>> method for writing a 0D variable.  Collective I/O?  Or  
>>>> independent I/O with system-specific MPI hints (I haven't  
>>>> explored the MPI hints)?  Or should I use independent mode for  
>>>> all data, including the distributed arrays?
>>>>
>>>> -Max
>>>>
>>>>
>>>
>>>
>>>
>>
>