[mpich2-dev] More ROMIO performance questions
William Gropp
wgropp at uiuc.edu
Wed Sep 16 06:49:54 CDT 2009
The underlying issue here is that the file system is often best-placed
to make decisions about scheduling access to the underlying
objects/disks/whatever, and in a perfect world, would properly handle
concurrent requests, providing the necessary back-pressure as required.
(There are some things that the application, in this case the MPI-IO
libeary, can do as well, of couse.) The world isn't perfect, and so we
need to deal with a number of issues. So one thing we might consider is
to create a taxonomy of file system foibles and try to create a
systematic implementation approach. It doesn't mean that we (or anyone)
would commit to creating the implementations, but it could help identify
the file system performance bugs and provide a more general way to
address them, rather than building completely system-specific fixes. A
common misfeature in some file systems (including some that are
improperly described as supporting parallel I/O) is either poor or even
incorrect performance unless a specific access pattern is followed
(e.g., due to NSF caching or whole-file lock contention). Providing a
clear taxonomy, along with tests that would expose the correctness and
performance bugs in these systems, would go a long way toward improving
the state of IO on these systems.
Bill
Rob Ross wrote:
> On Sep 15, 2009, at 2:58 PM, Bob Cernohous wrote:
>
>
>> Rob Latham wrote on 09/15/2009 02:39:30 PM:
>>
>>
>>> Isn't that what 'bgl_nodes_pset' is supposed to address? If you are
>>> i/o rich or i/o poor, aggregate down to 'bgl_nodes_pset' aggregators
>>> per io node. There are tons of things ROMIO can do in collective
>>>
>> I/O.
>>
>> Yes, exactly. When we finally got bgl_nodes_pset and cb_buffer_size
>> hinted right, they got reasonable performance. But not as good as
>> their customized testcase. I'm still looking into this a bit.
>>
>>
>>> If you pass around a token in the MPI layer, you can easily starve
>>> processes. Independent I/O means there's no guarantee when any
>>> process will be in that call. so, do you have rank 1 give up the
>>> token after exiting MPI_FILE_WRITE? Who does he pass to? Will they
>>> be in an MPI call and able to make progress on the receive? Do you
>>> have rank 4 take the token from someone when he's ready to do
>>> I/O?
>>>
>> Our thought was doing this within collective i/o. At some point,
>> instead of collecting/moving large contiguous buffers and writing at
>> the aggregator -- pass around the token and write at each node in
>> the set. Either way, data is written cb_block_size at a time. It
>> saves passing cb_buffer_size around. This is different than
>> romio_cb_write=automatic because I don't want large contiguous
>> buffers to switch back completely to independent writes. Maybe
>> romio_cb_write=coordinated :)
>>
>
> This (coordinated) seems like a nice way to get the advantages of
> aggregation without communication overheads...
>
> Rob
>
More information about the mpich2-dev
mailing list