[mpich2-dev] More ROMIO performance questions

William Gropp wgropp at uiuc.edu
Wed Sep 16 06:49:54 CDT 2009


The underlying issue here is that the file system is often best-placed 
to make decisions about scheduling access to the underlying 
objects/disks/whatever, and in a perfect world, would properly handle 
concurrent requests, providing the necessary back-pressure as required.  
(There are some things that the application, in this case the MPI-IO 
libeary, can do as well, of couse.) The world isn't perfect, and so we 
need to deal with a number of issues.  So one thing we might consider is 
to create a taxonomy of file system foibles and try to create a 
systematic implementation approach.  It doesn't mean that we (or anyone) 
would commit to creating the implementations, but it could help identify 
the file system performance bugs and provide a more general way to 
address them, rather than building completely system-specific fixes.  A 
common misfeature in some file systems (including some that are 
improperly described as supporting parallel I/O) is either poor or even 
incorrect performance unless a specific access pattern is followed 
(e.g., due to NSF caching or whole-file lock contention).  Providing a 
clear taxonomy, along with tests that would expose the correctness and 
performance bugs in these systems, would go a long way toward improving 
the state of IO on these systems.

Bill

Rob Ross wrote:
> On Sep 15, 2009, at 2:58 PM, Bob Cernohous wrote:
>
>   
>> Rob Latham wrote on 09/15/2009 02:39:30 PM:
>>
>>     
>>> Isn't that what 'bgl_nodes_pset' is supposed to address?  If you are
>>> i/o rich or i/o poor, aggregate down to 'bgl_nodes_pset' aggregators
>>> per io node.  There are tons of things ROMIO can do in collective  
>>>       
>> I/O.
>>
>> Yes, exactly.  When we finally got bgl_nodes_pset and cb_buffer_size  
>> hinted right, they got reasonable performance.  But not as good as  
>> their customized testcase.  I'm still looking into this a bit.
>>
>>     
>>> If you pass around a token in the MPI layer, you can easily starve
>>> processes.  Independent I/O means there's no guarantee when any
>>> process will be in that call.  so, do you have rank 1 give up the
>>> token after exiting MPI_FILE_WRITE?  Who does he pass to?  Will they
>>> be in an MPI call and able to make progress on the receive? Do you
>>> have rank 4 take the token from someone when he's ready to do
>>> I/O?
>>>       
>> Our thought was doing this within collective i/o.  At some point,  
>> instead of collecting/moving large contiguous buffers and writing at  
>> the aggregator -- pass around the token and write at each node in  
>> the set.  Either way, data is written cb_block_size at a time.  It  
>> saves passing cb_buffer_size around.   This is different than  
>> romio_cb_write=automatic because I don't want large contiguous  
>> buffers to switch back completely to independent writes.  Maybe  
>> romio_cb_write=coordinated :)
>>     
>
> This (coordinated) seems like a nice way to get the advantages of  
> aggregation without communication overheads...
>
> Rob
>   



More information about the mpich2-dev mailing list