<br><font size=2 face="sans-serif">I'm not sure my explanation was very

good. &nbsp;Here's a note I just received on the topic.</font>

<br><font size=2 face="sans-serif">----------------</font>

<br><font size=2 face="sans-serif">What we have shown is that to obtain

high performance I/O there is a need for a scheduling of the writers. This

is perhaps of more importance on Blue Gene than many other platforms, but

I'd expect other set-ups where it may be of importance to limit the number

of concurrent data streams to/from the IO system. The MPIIO collective

mode is the &quot;right&quot; interface to this, since this is the API

call that has the necessary amount of information.</font>

<br>

<br><font size=2 face="sans-serif">For BG (but really for any GPFS based

cluster) we'd like to point out that the collective write also can provide

the scheduling explicitly coded in our example code. The bgl_nodes_pset

hint should be possible to use in this context too to set the number of

simultaneous writers per pset (when romio_cb_{read,write} is set to automatic).

By limiting the number of concurrent IO streams this way, I'd guess that

you may be able to more efficiently utilize the cache hierarchy (GPFS client

buffers, NSD server buffers and backend storage caches).</font>

<br>

<br><font size=2 face="sans-serif">But this is more of a suggestion for

future improvements and may be better addressed to the ROMIO community.</font>

<br><font size=2 face="sans-serif">----------------</font>

<br>

<br><font size=2 face="sans-serif">I'm wondering if the ROMIO community

has already considered this in some way.</font>

<br><font size=2 face="sans-serif"><br>

Bob Cernohous: &nbsp;(T/L 553) 507-253-6093<br>

<br>

BobC@us.ibm.com<br>

IBM Rochester, Building 030-2(C335), Department 61L<br>

3605 Hwy 52 North, Rochester, &nbsp;MN 55901-7829<br>

<br>

&gt; Chaos reigns within.<br>

&gt; Reflect, repent, and reboot.<br>

&gt; Order shall return.<br>

</font>

<br>

<br>

<br>

<table width=100%>

<tr valign=top>

<td width=40%><font size=1 face="sans-serif"><b>Bob Cernohous/Rochester/IBM@IBMUS</b>

</font>

<br><font size=1 face="sans-serif">Sent by: mpich2-dev-bounces@mcs.anl.gov</font>

<p><font size=1 face="sans-serif">09/14/2009 04:57 PM</font>

<table border>

<tr valign=top>

<td bgcolor=white>

<div align=center><font size=1 face="sans-serif">Please respond to<br>

mpich2-dev@mcs.anl.gov</font></div></table>

<br>

<td width=59%>

<table width=100%>

<tr valign=top>

<td>

<div align=right><font size=1 face="sans-serif">To</font></div>

<td><font size=1 face="sans-serif">mpich2-dev@mcs.anl.gov</font>

<tr valign=top>

<td>

<div align=right><font size=1 face="sans-serif">cc</font></div>

<td>

<tr valign=top>

<td>

<div align=right><font size=1 face="sans-serif">Subject</font></div>

<td><font size=1 face="sans-serif">[mpich2-dev] More ROMIO performance

questions</font></table>

<br>

<table>

<tr valign=top>

<td>

<td></table>

<br></table>

<br>

<br>

<br><font size=2 face="sans-serif"><br>

We have another i/o scenario with interesting performance issues.</font><font size=3>

<br>

</font><font size=2 face="sans-serif"><br>

One again, it's large non-interleaved contiguous blocks being written/read

(checkpointing software). &nbsp;We ran into the same problems with data

sieving and romio_cb_write/read = enable as we discussed a couple weeks

ago.</font><font size=3> <br>

</font><font size=2 face="sans-serif"><br>

We tried to tune it with hints for cb_block_size and get ok performance

when we can avoid read/write data sieving.</font><font size=3> <br>

</font><font size=2 face="sans-serif"><br>

Trying romio_cb_write/read = automatic gets very poor performance.Similarly,

pure non-collective writes get very poor performance. &nbsp;It seems like

having too many writers/readers performs poorly on their configuration

... so</font><font size=3> <br>

</font><font size=2 face="sans-serif"><br>

They customized the testcase to coordinate/flow-control the non-collective

i/o and they get great performance. &nbsp; They only have N simultaneous

writers/readers active. &nbsp;They pass a token around and take turns.

&nbsp;It's almost like having N aggregators but without the collective

i/o overhead to pass the data around. &nbsp;Instead they pass a small token

and take turns writing the large, non-interleaved contiguous data blocks.</font><font size=3>

<br>

</font><font size=2 face="sans-serif"><br>

I'm not aware of anything in MPIIO or ROMIO that would do tihs? &nbsp;

Has this been explored by the experts (meaning you guys)? &nbsp; &nbsp;</font><font size=3>

<br>

<br>

</font><font size=2 face="sans-serif"><br>

<br>

Bob Cernohous: &nbsp;(T/L 553) 507-253-6093<br>

<br>

BobC@us.ibm.com<br>

IBM Rochester, Building 030-2(C335), Department 61L<br>

3605 Hwy 52 North, Rochester, &nbsp;MN 55901-7829<br>

<br>

&gt; Chaos reigns within.<br>

&gt; Reflect, repent, and reboot.<br>

&gt; Order shall return.</font>

<br>