[mpich2-dev] And yet another ROMIO performance question

Wed Sep 16 17:01:54 CDT 2009

Rob Latham wrote on 09/16/2009 05:33:03 PM:

> Bob Cernohous wrote:
> > Same customer noticed that the file system was doing much more i/o 
> > than MPIIO on collective reads.  I see:
> >
> > When the ranks read 4M blocks at 16M offsets using cb_buffer_size=4M & 

> > cb_block_size=4M, the aggregator reads 4M at offset 0, 4M at offset 16M, 
> > 4M at offset 32M, etc.  Looks fine.  The aggegator doesn't read offsets 
> > that aren't being referenced (4M,8M,12M).  We should see about 1-1 
> > bandwidth between MPIIO and the file system (ignoring collective 
> > overhead).
> 
> if your customer is truly setting cb_block_size, he's not ever going to 
> see a change.  MPI-IO ignores hints it does not understand, and 
> 'cb_block_size' is definitely one of those.
> > When the ranks read 4M blocks at 16M offsets using cb_buffer_size=16M 
> > & cb_block_size=4M, the aggregator reads 16M at offset 0, 16M at offset 16M, 

> > 16M at offset 3M, etc.  The aggregator doesn't use cb_block_size or any 
> > other heuristic to avoid unnecessarily large blocks.  Actual file 
> > system i/o bandwidth is 4x the application i/o bandwidth. 
> Is that a typo?  3M should be 32 M?

Yes

> 
> The aggregator uses one algorithm: read all the data in my file domain 
> from the starting byte to the ending byte.  We can shrink the file 
> domain with cb_buffer_size.  There are some more sophisticated 
> approaches we can take but you commented them out in V1R4

I ran this through ad_read_coll.c and saw the same behavior.  I don't 
think I commented anything out of common/.

> 
> > Why is cb_buffer_size used instead of cb_block_size?  Looking at the 
> > code, it seems "that's the way it is - read the buffer size that we're 

> > exchanging".  I think in "normal" collective i/o patterns the data 
> > would have been scattered across the whole cb_buffer_size, which is 
> > why it might be designed this way.
> >
> > Any further comments from the experts?  I think that this is similar 
> > to the leading/trailing data sieving holes that I asked about a couple 

> > weeks ago.   But on read, there isn't any hole detection or tuning. 
> >  It just does the whole cb_buffer_size if any data in that block is 
> > needed..
> 
> Not exactly.  Take a look at ad_bgl_rdcoll.c around line 550.   After 
> collecting everyone's requests, it knows the 'st_loc' and 'end_loc' 
> (starting and ending location).  if that's smaller than the 
> cb_buffer_size, that much data is read. 

I'm running on a rack VN mode so 4K nodes, 8 i/o nodes, 32 bgl_nodes_pset. 
 So in this case, the aggregator's file domain out of the 64G file is much 
larger than cb_buffer_size so it always uses cb_buffer_size.  It doesn't 
trim cb_buffer_size any further when the hole is at the end of the block.

ad_bgl - rank 0 aggregator
..romio/mpi-io/read_all.c:106: offset 0
..romio/adio/ad_bgl/ad_bgl_rdcoll.c:ADIOI_BGL_ReadStridedColl:169: rank 0 
off 0  len 4194304
..romio/adio/ad_bgl/ad_bgl_rdcoll.c:ADIOI_Read_and_exch:475: ntimes 16, 
st_loc 0, end_loc 255852543, coll_bufsize 16777216
..romio/adio/ad_bgl/ad_bgl_rdcoll.c:ADIOI_Read_and_exch:635: Read conting 
offset 0, size 16777216, buffer offset 0
ADIOI_BGL_ReadContig:91:read buflen 16777216, offset 0

ad_bgl - rank 1
..romio/mpi-io/read_all.c:106: offset 0
..romio/adio/ad_bgl/ad_bgl_rdcoll.c:ADIOI_BGL_ReadStridedColl:169: rank 1 
off 16777216  len 4194304
..romio/adio/ad_bgl/ad_bgl_rdcoll.c:ADIOI_Read_and_exch:475: ntimes 0, 
st_loc -1, end_loc -1, coll_bufsize 16777216

-------------------------------------
common - rank 0 aggregator
..romio/mpi-io/read_all.c:106: offset 0
..romio/adio/common/ad_read_coll.c:ADIOI_GEN_ReadStridedColl:116: rank 0 
off 0  len 4194304
..romio/adio/common/ad_read_coll.c:ADIOI_Read_and_exch:578: ntimes 16, 
st_loc 0, end_loc 255852543, coll_bufsize 16777216
..romio/adio/common/ad_read_coll.c:ADIOI_Read_and_exch:727: Read conting 
offset 0, size 16777216, buff offset 0
..romio/adio/common/ad_read.c:ADIOI_GEN_ReadContig:61: off 0  len 16777216

common - rank 1 
..romio/mpi-io/read_all.c:106: offset 0
..romio/adio/common/ad_read_coll.c:ADIOI_GEN_ReadStridedColl:116: rank 1 
off 16777216  len 4194304
..romio/adio/common/ad_read_coll.c:ADIOI_Read_and_exch:578: ntimes 0, 
st_loc -1, end_loc -1, coll_bufsize 16777216

> 
> it's true that ROMIO does not do strided i/o here and instead could very 

> well be reading in holes, but if the unneeded data resides entirely at 
> the end of the block, only the needed data will be read.

That's what I was hoping but that's not what I see on either ad_bgl or 
common.

> 
> Do you think your customers could just send a test program?   I think we 

> might be missing important details in the relaying of messages back and 
> forth.
> 
> ==rob

It's pretty simple if you ignore all the optional compiling MPIIO vs 
posix, basically ...

#define PROC_DATA_SIZE 16
static unsigned g_data_size=PROC_DATA_SIZE; /* Data per process in MiB 
(2^20) */

#define MB (1000*1000) /* MB = megabyte = 10^6 bytes) */
#define MIB (1024*1024) /* MiB = mebibyte = 2^20 bytes) */

/* Size of one written chunk in MiB */
#define BUFFER_SIZE 4
static char g_buffer[BUFFER_SIZE*MIB];
...
  /* Create a derived datatype for one unit of data transfer */
  ret = MPI_Type_contiguous(BUFFER_SIZE*MIB, MPI_CHAR, &g_etype);
...
  /* compute the node's first byte position in the file */
  FileOffset offset = rank*(FileOffset)(g_data_size)*MIB;
  int mpi_ret = MPI_File_set_view(fh, offset, g_etype, g_etype, "native",
                                  MPI_INFO_NULL);
...
  int mpi_ret = MPI_File_write_all(fh, buffer, 1, g_etype, 
MPI_STATUS_IGNORE);

and it loops writing BUFFER_SIZE until it's written PROC_DATA_SIZE. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich2-dev/attachments/20090916/c4ef7c79/attachment.htm>