[mpich-discuss] MPI-IO and (potentially) unnecessary synchronization

Thu Sep 2 17:40:56 CDT 2010

Thank you for pointing this out, and shame on me for not digging into 
the sources a little.

Would your team consider a feature request to remove the synchronization?

below I'm including some data from a 48k proc 11.7TB (typical of our 
restart dumps) to justify the request. In this case 75% of the processes 
spend more than 700 sec , and 50% of the processes spend more than 1000 
sec. in the MPI_File_close barrier.

Burlen

MPI-IO with out collective buffering
-----------------------------------------------------------
operation,min,lq,avg,med,uq,max
Open(sec) 25.5357, 26.3682, 26.4883, 26.3686, 26.3694, 27.8542
Write(sec) 2.5544, 496.9244, 1018.1524, 959.6142, 1512.3722, 2246.9659
BW(GB/sec) 5.3401, 7.9339, 24.1726, 12.5040, 24.1465, 4697.4345
Close(sec) 0.0046, 734.5990, 1228.8317, 1287.3787, 1750.1132, 2244.4148

Rajeev Thakur wrote:
> Burlen,
>             MPI_File_close is defined to be collective. It doesn't need to be synchronizing, but the implementation uses a barrier internally for the following reason given in romio/mpi-io/close.c.
>
> /* need a barrier because the file containing the shared file
>         pointer is opened with COMM_SELF. We don't want it to be
>         deleted while others are still accessing it. */
>
> Rajeev
>  
>
>
> On Sep 2, 2010, at 1:53 PM, burlen wrote:
>
>   
>> Thanks for the clarifications. A snarky response is probably deserved. my familiarity with the material is limited. The long answer is much appreciated.
>>
>> I don't think I can use MPI_COMM_SELF. MPI-IO is attractive in this case to produce a single shared file. As I understand it: even for non-overlapping write to same file, to be sure that the data is correct in the file I will have to make each write non-concurrent with each other with sync-barrier-sync. fully serializing the write.
>>
>> If I understand you: MPI_File_close is collective to give chance for special optimization. Just because it's collective doesn't mean it will block. But in the current implementation I observed that it does block even when collective buffering is disabled. I don't know if in the case when collective buffering is not used blocking in close is necessary, or due to some optimization, or this is "just the way it is"? My conjecture is that if sync doesn't need to block then close might not need to either. To clarify my interest, on a Lustre system during large concurrency large size write I observed a large spread between the slowest and fastest writer with collective buffering disabled. for the faster writers it would be nice to not lose time waiting at the close for the slowest.
>>
>> Thanks again
>> Burlen
>>
>> Rob Ross wrote:
>>     
>>> Also, for clarification, your interpretation of a collective call as one that must "block the progress of each process until at least all of the processes have entered the call" is incorrect. There is no such constraint.
>>>
>>> Rob
>>>
>>> On Sep 2, 2010, at 12:41 PM, Rob Ross wrote:
>>>
>>>       
>>>> Hi,
>>>>
>>>> The short (perhaps snarky) answer is that that is how the standard is defined.
>>>>
>>>> The longer answer is that this provides an opportunity for caches to be flushed and data to be aggregated and written prior to closing. This opportunity isn't taken advantage of very much in current implementations; however, it might be (for example) the place at which final cache flushing is performed in an implementation that performs coordinated caching of write data, even if collective buffering weren't involved (see A. Nisar's recent work in the area for an example).
>>>>
>>>> If you really don't want any collective behavior, open with MPI_COMM_SELF.
>>>>
>>>> Rob
>>>>
>>>> On Sep 2, 2010, at 12:12 PM, burlen wrote:
>>>>
>>>>         
>>>>> Could anyone explain why MPI_File_close must be a collective call when collective buffering is not used? By collective I mean block the progress of each process until at least all of the processes have entered the call?
>>>>>
>>>>> I realize my first post misunderstands the situation in a number of ways. To attempt to correct myself, each process who touches the disk must have his own file descriptor somewhere. When collective buffering isn't used to close the file each process would have to close his local descriptor. I have noticed that MPI_File_sync is documented as a collective function but does not behave like one when collective buffering is not used. By this I mean that it completes before all processes have entered the call. if MPI_File_sync can behave this way, why wouldn't MPI_File_close do the same?
>>>>>
>>>>> burlen wrote:
>>>>>           
>>>>>> in benchmarks of very large concurrent writes on Lustre using both cb and non cb API I have observed that for the non cb API that asynchronism during write can be advantageous as it tends to reduce congestion and contention. This can increase the throughput. However, in this case the synchronization time that occurs at MPI_File_close is significant for many of the processes, as non return until the slowest process enters the call. This synchronization at close in net effect ruins any advantage gained. So I wonder does MPI_File_close really require a collective implementation? For instance I could imagine a reference counting scheme where one process were designated to manage close operation, others call MPI_File_sync (which I've observed doesn't block unless it has to) and post a non-blocking 0 byte message to the manager rank, then they can continue unimpeded. You could perhaps remove all to one communication using some sort of hierarchical structured communication pattern. If I understand such a scheme wouldn't violate consistency because if one cares about it then a barrier is required anyway.
>>>>>>
>>>>>> Have I misunderstood the situation?
>>>>>>
>>>>>> Thanks
>>>>>> Burlen
>>>>>>
>>>>>>             
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>           
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>         
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>       
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>     
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>