[mpich2-dev] And yet another ROMIO performance question

Wed Sep 16 14:57:45 CDT 2009

Same customer noticed that the file system was doing much more i/o than 
MPIIO on collective reads.  I see:

When the ranks read 4M blocks at 16M offsets using cb_buffer_size=4M & 
cb_block_size=4M, the aggregator reads 4M at offset 0, 4M at offset 16M, 
4M at offset 32M, etc.  Looks fine.  The aggegator doesn't read offsets that 
aren't being referenced (4M,8M,12M).  We should see about 1-1 bandwidth 
between MPIIO and the file system (ignoring collective overhead).

When the ranks read 4M blocks at 16M offsets using cb_buffer_size=16M & 
cb_block_size=4M, the aggregator reads 16M at offset 0, 16M at offset 16M, 
16M at offset 3M, etc.  The aggregator doesn't use cb_block_size or any other 
heuristic to avoid unnecessarily large blocks.  Actual file system i/o 
bandwidth is 4x the application i/o bandwidth. 

Why is cb_buffer_size used instead of cb_block_size?  Looking at the code, 
it seems "that's the way it is - read the buffer size that we're 
exchanging".  I think in "normal" collective i/o patterns the data would 
have been scattered across the whole cb_buffer_size, which is why it might 
be designed this way.

Any further comments from the experts?  I think that this is similar to 
the leading/trailing data sieving holes that I asked about a couple weeks 
ago.   But on read, there isn't any hole detection or tuning.  It just 
does the whole cb_buffer_size if any data in that block is needed..

Bob Cernohous:  (T/L 553) 507-253-6093

BobC at us.ibm.com
IBM Rochester, Building 030-2(C335), Department 61L
3605 Hwy 52 North, Rochester,  MN 55901-7829

> Chaos reigns within.
> Reflect, repent, and reboot.
> Order shall return.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich2-dev/attachments/20090916/724effbd/attachment.htm>