<br><font size=2 face="sans-serif">Same customer noticed that the file
system was doing much more i/o than MPIIO on collective reads. I
see:</font>
<br>
<br><font size=2 face="sans-serif">When the ranks read 4M blocks at 16M
offsets using cb_buffer_size=4M & cb_block_size=4M, the aggregator
reads 4M@offset 0, 4M@offset 16M, 4M@offset 32M, etc. Looks fine.
The aggegator doesn't read offsets that aren't being referenced (4M,8M,12M).
We should see about 1-1 bandwidth between MPIIO and the file system
(ignoring collective overhead).</font>
<br>
<br><font size=2 face="sans-serif">When the ranks read 4M blocks at 16M
offsets using cb_buffer_size=16M & cb_block_size=4M, the aggregator
reads 16M@offset 0, 16M@offset 16M, 16M@offset 3M, etc. The aggregator
doesn't use cb_block_size or any other heuristic to avoid unnecessarily
large blocks. Actual file system i/o bandwidth is 4x the application
i/o bandwidth. </font>
<br>
<br><font size=2 face="sans-serif">Why is cb_buffer_size used instead of
cb_block_size? Looking at the code, it seems "that's the way
it is - read the buffer size that we're exchanging". I think
in "normal" collective i/o patterns the data would have been
scattered across the whole cb_buffer_size, which is why it might be designed
this way.</font>
<br>
<br><font size=2 face="sans-serif">Any further comments from the experts?
I think that this is similar to the leading/trailing data sieving
holes that I asked about a couple weeks ago. But on read, there
isn't any hole detection or tuning. It just does the whole cb_buffer_size
if any data in that block is needed..</font>
<br><font size=2 face="sans-serif"><br>
Bob Cernohous: (T/L 553) 507-253-6093<br>
<br>
BobC@us.ibm.com<br>
IBM Rochester, Building 030-2(C335), Department 61L<br>
3605 Hwy 52 North, Rochester, MN 55901-7829<br>
<br>
> Chaos reigns within.<br>
> Reflect, repent, and reboot.<br>
> Order shall return.<br>
</font>