[mpich2-dev] More ROMIO performance questions

Mon Sep 14 17:04:28 CDT 2009

I'm not sure my explanation was very good.  Here's a note I just received 
on the topic.
----------------
What we have shown is that to obtain high performance I/O there is a need 
for a scheduling of the writers. This is perhaps of more importance on 
Blue Gene than many other platforms, but I'd expect other set-ups where it 
may be of importance to limit the number of concurrent data streams 
to/from the IO system. The MPIIO collective mode is the "right" interface 
to this, since this is the API call that has the necessary amount of 
information.

For BG (but really for any GPFS based cluster) we'd like to point out that 
the collective write also can provide the scheduling explicitly coded in 
our example code. The bgl_nodes_pset hint should be possible to use in 
this context too to set the number of simultaneous writers per pset (when 
romio_cb_{read,write} is set to automatic). By limiting the number of 
concurrent IO streams this way, I'd guess that you may be able to more 
efficiently utilize the cache hierarchy (GPFS client buffers, NSD server 
buffers and backend storage caches).

But this is more of a suggestion for future improvements and may be better 
addressed to the ROMIO community.
----------------

I'm wondering if the ROMIO community has already considered this in some 
way.

Bob Cernohous:  (T/L 553) 507-253-6093

BobC at us.ibm.com
IBM Rochester, Building 030-2(C335), Department 61L
3605 Hwy 52 North, Rochester,  MN 55901-7829

> Chaos reigns within.
> Reflect, repent, and reboot.
> Order shall return.

Bob Cernohous/Rochester/IBM at IBMUS 
Sent by: mpich2-dev-bounces at mcs.anl.gov
09/14/2009 04:57 PM
Please respond to
mpich2-dev at mcs.anl.gov

To
mpich2-dev at mcs.anl.gov
cc

Subject
[mpich2-dev] More ROMIO performance questions

We have another i/o scenario with interesting performance issues. 

One again, it's large non-interleaved contiguous blocks being written/read 
(checkpointing software).  We ran into the same problems with data sieving 
and romio_cb_write/read = enable as we discussed a couple weeks ago. 

We tried to tune it with hints for cb_block_size and get ok performance 
when we can avoid read/write data sieving. 

Trying romio_cb_write/read = automatic gets very poor 
performance.Similarly, pure non-collective writes get very poor 
performance.  It seems like having too many writers/readers performs 
poorly on their configuration ... so 

They customized the testcase to coordinate/flow-control the non-collective 
i/o and they get great performance.   They only have N simultaneous 
writers/readers active.  They pass a token around and take turns.  It's 
almost like having N aggregators but without the collective i/o overhead 
to pass the data around.  Instead they pass a small token and take turns 
writing the large, non-interleaved contiguous data blocks. 

I'm not aware of anything in MPIIO or ROMIO that would do tihs?   Has this 
been explored by the experts (meaning you guys)?     

Bob Cernohous:  (T/L 553) 507-253-6093

BobC at us.ibm.com
IBM Rochester, Building 030-2(C335), Department 61L
3605 Hwy 52 North, Rochester,  MN 55901-7829

> Chaos reigns within.
> Reflect, repent, and reboot.
> Order shall return.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich2-dev/attachments/20090914/27314e69/attachment.htm>