<br><tt><font size=2>Rob Latham wrote on 09/15/2009 02:39:30 PM:<br>
<br>
> Isn't that what 'bgl_nodes_pset' is supposed to address? If
you are<br>
> i/o rich or i/o poor, aggregate down to 'bgl_nodes_pset' aggregators<br>
> per io node. There are tons of things ROMIO can do in collective
I/O.<br>
</font></tt>
<br><tt><font size=2>Yes, exactly. When we finally got bgl_nodes_pset
and cb_buffer_size hinted right, they got reasonable performance. But
not as good as their customized testcase. I'm still looking into
this a bit.</font></tt>
<br>
<br><tt><font size=2>> If you pass around a token in the MPI layer,
you can easily starve<br>
> processes. Independent I/O means there's no guarantee when any<br>
> process will be in that call. so, do you have rank 1 give up
the<br>
> token after exiting MPI_FILE_WRITE? Who does he pass to? Will
they<br>
> be in an MPI call and able to make progress on the receive? Do you<br>
> have rank 4 take the token from someone when he's ready to do<br>
> I/O?<br>
> <br>
</font></tt>
<br><tt><font size=2>Our thought was doing this within collective i/o.
At some point, instead of collecting/moving large contiguous buffers
and writing at the aggregator -- pass around the token and write at each
node in the set. Either way, data is written cb_block_size at a time.
It saves passing cb_buffer_size around. This is different
than romio_cb_write=automatic because I don't want large contiguous buffers
to switch back completely to independent writes. Maybe romio_cb_write=coordinated
:)</font></tt>
<br>
<br><tt><font size=2>Anyway, I think my question's been answered. It
isn't possible now in MPIIO. Obviously customized apps can do whatever
they like. Meanwhile I need to pursue the config and look for the
underlying problem or limitation.</font></tt>
<br>