[mpich-discuss] MPI-IO and (potentially) unnecessary synchronization

Thu Sep 2 13:53:50 CDT 2010

Thanks for the clarifications. A snarky response is probably deserved. 
my familiarity with the material is limited. The long answer is much 
appreciated.

I don't think I can use MPI_COMM_SELF. MPI-IO is attractive in this case 
to produce a single shared file. As I understand it: even for 
non-overlapping write to same file, to be sure that the data is correct 
in the file I will have to make each write non-concurrent with each 
other with sync-barrier-sync. fully serializing the write.

If I understand you: MPI_File_close is collective to give chance for 
special optimization. Just because it's collective doesn't mean it will 
block. But in the current implementation I observed that it does block 
even when collective buffering is disabled. I don't know if in the case 
when collective buffering is not used blocking in close is necessary, or 
due to some optimization, or this is "just the way it is"? My conjecture 
is that if sync doesn't need to block then close might not need to 
either. To clarify my interest, on a Lustre system during large 
concurrency large size write I observed a large spread between the 
slowest and fastest writer with collective buffering disabled. for the 
faster writers it would be nice to not lose time waiting at the close 
for the slowest.

Thanks again
Burlen

Rob Ross wrote:
> Also, for clarification, your interpretation of a collective call as 
> one that must "block the progress of each process until at least all 
> of the processes have entered the call" is incorrect. There is no such 
> constraint.
>
> Rob
>
> On Sep 2, 2010, at 12:41 PM, Rob Ross wrote:
>
>> Hi,
>>
>> The short (perhaps snarky) answer is that that is how the standard is 
>> defined.
>>
>> The longer answer is that this provides an opportunity for caches to 
>> be flushed and data to be aggregated and written prior to closing. 
>> This opportunity isn't taken advantage of very much in current 
>> implementations; however, it might be (for example) the place at 
>> which final cache flushing is performed in an implementation that 
>> performs coordinated caching of write data, even if collective 
>> buffering weren't involved (see A. Nisar's recent work in the area 
>> for an example).
>>
>> If you really don't want any collective behavior, open with 
>> MPI_COMM_SELF.
>>
>> Rob
>>
>> On Sep 2, 2010, at 12:12 PM, burlen wrote:
>>
>>> Could anyone explain why MPI_File_close must be a collective call 
>>> when collective buffering is not used? By collective I mean block 
>>> the progress of each process until at least all of the processes 
>>> have entered the call?
>>>
>>> I realize my first post misunderstands the situation in a number of 
>>> ways. To attempt to correct myself, each process who touches the 
>>> disk must have his own file descriptor somewhere. When collective 
>>> buffering isn't used to close the file each process would have to 
>>> close his local descriptor. I have noticed that MPI_File_sync is 
>>> documented as a collective function but does not behave like one 
>>> when collective buffering is not used. By this I mean that it 
>>> completes before all processes have entered the call. if 
>>> MPI_File_sync can behave this way, why wouldn't MPI_File_close do 
>>> the same?
>>>
>>> burlen wrote:
>>>> in benchmarks of very large concurrent writes on Lustre using both 
>>>> cb and non cb API I have observed that for the non cb API that 
>>>> asynchronism during write can be advantageous as it tends to reduce 
>>>> congestion and contention. This can increase the throughput. 
>>>> However, in this case the synchronization time that occurs at 
>>>> MPI_File_close is significant for many of the processes, as non 
>>>> return until the slowest process enters the call. This 
>>>> synchronization at close in net effect ruins any advantage gained. 
>>>> So I wonder does MPI_File_close really require a collective 
>>>> implementation? For instance I could imagine a reference counting 
>>>> scheme where one process were designated to manage close operation, 
>>>> others call MPI_File_sync (which I've observed doesn't block unless 
>>>> it has to) and post a non-blocking 0 byte message to the manager 
>>>> rank, then they can continue unimpeded. You could perhaps remove 
>>>> all to one communication using some sort of hierarchical structured 
>>>> communication pattern. If I understand such a scheme wouldn't 
>>>> violate consistency because if one cares about it then a barrier is 
>>>> required anyway.
>>>>
>>>> Have I misunderstood the situation?
>>>>
>>>> Thanks
>>>> Burlen
>>>>
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss