[mpich2-dev] And yet another ROMIO performance question

Rob Latham robl at mcs.anl.gov
Wed Sep 16 17:33:03 CDT 2009


Bob Cernohous wrote:
> Same customer noticed that the file system was doing much more i/o 
> than MPIIO on collective reads.  I see:
>
> When the ranks read 4M blocks at 16M offsets using cb_buffer_size=4M & 
> cb_block_size=4M, the aggregator reads 4M at offset 0, 4M at offset 16M, 
> 4M at offset 32M, etc.  Looks fine.  The aggegator doesn't read offsets 
> that aren't being referenced (4M,8M,12M).  We should see about 1-1 
> bandwidth between MPIIO and the file system (ignoring collective 
> overhead).

if your customer is truly setting cb_block_size, he's not ever going to 
see a change.  MPI-IO ignores hints it does not understand, and 
'cb_block_size' is definitely one of those.
> When the ranks read 4M blocks at 16M offsets using cb_buffer_size=16M 
> & cb_block_size=4M, the aggregator reads 16M at offset 0, 16M at offset 16M, 
> 16M at offset 3M, etc.  The aggregator doesn't use cb_block_size or any 
> other heuristic to avoid unnecessarily large blocks.  Actual file 
> system i/o bandwidth is 4x the application i/o bandwidth.  
Is that a typo?  3M should be 32 M?

The aggregator uses one algorithm: read all the data in my file domain 
from the starting byte to the ending byte.  We can shrink the file 
domain with cb_buffer_size.  There are some more sophisticated 
approaches we can take but you commented them out in V1R4

> Why is cb_buffer_size used instead of cb_block_size?  Looking at the 
> code, it seems "that's the way it is - read the buffer size that we're 
> exchanging".  I think in "normal" collective i/o patterns the data 
> would have been scattered across the whole cb_buffer_size, which is 
> why it might be designed this way.
>
> Any further comments from the experts?  I think that this is similar 
> to the leading/trailing data sieving holes that I asked about a couple 
> weeks ago.   But on read, there isn't any hole detection or tuning. 
>  It just does the whole cb_buffer_size if any data in that block is 
> needed..

Not exactly.  Take a look at ad_bgl_rdcoll.c around line 550.   After 
collecting everyone's requests, it knows the 'st_loc' and 'end_loc' 
(starting and ending location).  if that's smaller than the 
cb_buffer_size, that much data is read. 

it's true that ROMIO does not do strided i/o here and instead could very 
well be reading in holes, but if the unneeded data resides entirely at 
the end of the block, only the needed data will be read.

Do you think your customers could just send a test program?   I think we 
might be missing important details in the relaying of messages back and 
forth.

==rob


More information about the mpich2-dev mailing list