[mpich2-dev] And yet another ROMIO performance question
Bob Cernohous
bobc at us.ibm.com
Wed Sep 16 14:57:45 CDT 2009
Same customer noticed that the file system was doing much more i/o than
MPIIO on collective reads. I see:
When the ranks read 4M blocks at 16M offsets using cb_buffer_size=4M &
cb_block_size=4M, the aggregator reads 4M at offset 0, 4M at offset 16M,
4M at offset 32M, etc. Looks fine. The aggegator doesn't read offsets that
aren't being referenced (4M,8M,12M). We should see about 1-1 bandwidth
between MPIIO and the file system (ignoring collective overhead).
When the ranks read 4M blocks at 16M offsets using cb_buffer_size=16M &
cb_block_size=4M, the aggregator reads 16M at offset 0, 16M at offset 16M,
16M at offset 3M, etc. The aggregator doesn't use cb_block_size or any other
heuristic to avoid unnecessarily large blocks. Actual file system i/o
bandwidth is 4x the application i/o bandwidth.
Why is cb_buffer_size used instead of cb_block_size? Looking at the code,
it seems "that's the way it is - read the buffer size that we're
exchanging". I think in "normal" collective i/o patterns the data would
have been scattered across the whole cb_buffer_size, which is why it might
be designed this way.
Any further comments from the experts? I think that this is similar to
the leading/trailing data sieving holes that I asked about a couple weeks
ago. But on read, there isn't any hole detection or tuning. It just
does the whole cb_buffer_size if any data in that block is needed..
Bob Cernohous: (T/L 553) 507-253-6093
BobC at us.ibm.com
IBM Rochester, Building 030-2(C335), Department 61L
3605 Hwy 52 North, Rochester, MN 55901-7829
> Chaos reigns within.
> Reflect, repent, and reboot.
> Order shall return.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich2-dev/attachments/20090916/724effbd/attachment.htm>
More information about the mpich2-dev
mailing list