<br><tt><font size=2>Rob Latham wrote on 09/16/2009 05:33:03 PM:<br>
<br>
> Bob Cernohous wrote:<br>
> > Same customer noticed that the file system was doing much more
i/o <br>
> > than MPIIO on collective reads. I see:<br>
> ><br>
> > When the ranks read 4M blocks at 16M offsets using cb_buffer_size=4M
& <br>
> > cb_block_size=4M, the aggregator reads 4M@offset 0, 4M@offset
16M, <br>
> > 4M@offset 32M, etc. Looks fine. The aggegator doesn't
read offsets <br>
> > that aren't being referenced (4M,8M,12M). We should see
about 1-1 <br>
> > bandwidth between MPIIO and the file system (ignoring collective
<br>
> > overhead).<br>
> <br>
> if your customer is truly setting cb_block_size, he's not ever going
to <br>
> see a change. MPI-IO ignores hints it does not understand, and
<br>
> 'cb_block_size' is definitely one of those.<br>
> > When the ranks read 4M blocks at 16M offsets using cb_buffer_size=16M
<br>
> > & cb_block_size=4M, the aggregator reads 16M@offset 0, 16M@offset
16M, <br>
> > 16M@offset 3M, etc. The aggregator doesn't use cb_block_size
or any <br>
> > other heuristic to avoid unnecessarily large blocks. Actual
file <br>
> > system i/o bandwidth is 4x the application i/o bandwidth. <br>
> Is that a typo? 3M should be 32 M?</font></tt>
<br>
<br><tt><font size=2>Yes</font></tt>
<br><tt><font size=2><br>
> <br>
> The aggregator uses one algorithm: read all the data in my file domain
<br>
> from the starting byte to the ending byte. We can shrink the
file <br>
> domain with cb_buffer_size. There are some more sophisticated
<br>
> approaches we can take but you commented them out in V1R4</font></tt>
<br>
<br><tt><font size=2>I ran this through ad_read_coll.c and saw the same
behavior. I don't think I commented anything out of common/.</font></tt>
<br><tt><font size=2><br>
> <br>
> > Why is cb_buffer_size used instead of cb_block_size? Looking
at the <br>
> > code, it seems "that's the way it is - read the buffer size
that we're <br>
> > exchanging". I think in "normal" collective
i/o patterns the data <br>
> > would have been scattered across the whole cb_buffer_size, which
is <br>
> > why it might be designed this way.<br>
> ><br>
> > Any further comments from the experts? I think that this
is similar <br>
> > to the leading/trailing data sieving holes that I asked about
a couple <br>
> > weeks ago. But on read, there isn't any hole detection
or tuning. <br>
> > It just does the whole cb_buffer_size if any data in that
block is <br>
> > needed..<br>
> <br>
> Not exactly. Take a look at ad_bgl_rdcoll.c around line 550.
After <br>
> collecting everyone's requests, it knows the 'st_loc' and 'end_loc'
<br>
> (starting and ending location). if that's smaller than the <br>
> cb_buffer_size, that much data is read. </font></tt>
<br>
<br><tt><font size=2>I'm running on a rack VN mode so 4K nodes, 8 i/o nodes,
32 bgl_nodes_pset. So in this case, the aggregator's file domain
out of the 64G file is much larger than cb_buffer_size so it always uses
cb_buffer_size. It doesn't trim cb_buffer_size any further when the
hole is at the end of the block.</font></tt>
<br>
<br><tt><font size=2>ad_bgl - rank 0 aggregator</font></tt>
<br><tt><font size=2>..romio/mpi-io/read_all.c:106: offset 0</font></tt>
<br><tt><font size=2>..romio/adio/ad_bgl/ad_bgl_rdcoll.c:ADIOI_BGL_ReadStridedColl:169:
rank 0 off 0 len 4194304</font></tt>
<br><tt><font size=2>..romio/adio/ad_bgl/ad_bgl_rdcoll.c:ADIOI_Read_and_exch:475:
ntimes 16, st_loc 0, end_loc 255852543, coll_bufsize 16777216</font></tt>
<br><tt><font size=2>..romio/adio/ad_bgl/ad_bgl_rdcoll.c:ADIOI_Read_and_exch:635:
Read conting offset 0, size 16777216, buffer offset 0</font></tt>
<br><tt><font size=2>ADIOI_BGL_ReadContig:91:read buflen 16777216, offset
0</font></tt>
<br>
<br><tt><font size=2>ad_bgl - rank 1</font></tt>
<br><tt><font size=2>..romio/mpi-io/read_all.c:106: offset 0</font></tt>
<br><tt><font size=2>..romio/adio/ad_bgl/ad_bgl_rdcoll.c:ADIOI_BGL_ReadStridedColl:169:
rank 1 off 16777216 len 4194304</font></tt>
<br><tt><font size=2>..romio/adio/ad_bgl/ad_bgl_rdcoll.c:ADIOI_Read_and_exch:475:
ntimes 0, st_loc -1, end_loc -1, coll_bufsize 16777216</font></tt>
<br>
<br><tt><font size=2>-------------------------------------</font></tt>
<br><tt><font size=2>common - rank 0 aggregator</font></tt>
<br><tt><font size=2>..romio/mpi-io/read_all.c:106: offset 0</font></tt>
<br><tt><font size=2>..romio/adio/common/ad_read_coll.c:ADIOI_GEN_ReadStridedColl:116:
rank 0 off 0 len 4194304</font></tt>
<br><tt><font size=2>..romio/adio/common/ad_read_coll.c:ADIOI_Read_and_exch:578:
ntimes 16, st_loc 0, end_loc 255852543, coll_bufsize 16777216</font></tt>
<br><tt><font size=2>..romio/adio/common/ad_read_coll.c:ADIOI_Read_and_exch:727:
Read conting offset 0, size 16777216, buff offset 0</font></tt>
<br><tt><font size=2>..romio/adio/common/ad_read.c:ADIOI_GEN_ReadContig:61:
off 0 len 16777216</font></tt>
<br>
<br><tt><font size=2>common - rank 1 </font></tt>
<br><tt><font size=2>..romio/mpi-io/read_all.c:106: offset 0</font></tt>
<br><tt><font size=2>..romio/adio/common/ad_read_coll.c:ADIOI_GEN_ReadStridedColl:116:
rank 1 off 16777216 len 4194304</font></tt>
<br><tt><font size=2>..romio/adio/common/ad_read_coll.c:ADIOI_Read_and_exch:578:
ntimes 0, st_loc -1, end_loc -1, coll_bufsize 16777216</font></tt>
<br>
<br><tt><font size=2><br>
> <br>
> it's true that ROMIO does not do strided i/o here and instead could
very <br>
> well be reading in holes, but if the unneeded data resides entirely
at <br>
> the end of the block, only the needed data will be read.<br>
</font></tt>
<br><tt><font size=2>That's what I was hoping but that's not what I see
on either ad_bgl or common.</font></tt>
<br>
<br><tt><font size=2>> <br>
> Do you think your customers could just send a test program?
I think we <br>
> might be missing important details in the relaying of messages back
and <br>
> forth.<br>
> <br>
> ==rob</font></tt>
<br>
<br><tt><font size=2>It's pretty simple if you ignore all the optional
compiling MPIIO vs posix, basically ...</font></tt>
<br>
<br><tt><font size=2>#define PROC_DATA_SIZE 16</font></tt>
<br><tt><font size=2>static unsigned g_data_size=PROC_DATA_SIZE; /* Data
per process in MiB (2^20) */</font></tt>
<br>
<br><tt><font size=2>#define MB (1000*1000) /* MB = megabyte = 10^6 bytes)
*/</font></tt>
<br><tt><font size=2>#define MIB (1024*1024) /* MiB = mebibyte = 2^20 bytes)
*/</font></tt>
<br>
<br><tt><font size=2>/* Size of one written chunk in MiB */</font></tt>
<br><tt><font size=2>#define BUFFER_SIZE 4</font></tt>
<br><tt><font size=2>static char g_buffer[BUFFER_SIZE*MIB];</font></tt>
<br><tt><font size=2>...</font></tt>
<br><tt><font size=2> /* Create a derived datatype for one unit of
data transfer */</font></tt>
<br><tt><font size=2> ret = MPI_Type_contiguous(BUFFER_SIZE*MIB,
MPI_CHAR, &g_etype);</font></tt>
<br><tt><font size=2>...</font></tt>
<br><tt><font size=2> /* compute the node's first byte position in
the file */</font></tt>
<br><tt><font size=2> FileOffset offset = rank*(FileOffset)(g_data_size)*MIB;</font></tt>
<br><tt><font size=2> int mpi_ret = MPI_File_set_view(fh, offset,
g_etype, g_etype, "native",</font></tt>
<br><tt><font size=2>
MPI_INFO_NULL);</font></tt>
<br><tt><font size=2>...</font></tt>
<br><tt><font size=2> int mpi_ret = MPI_File_write_all(fh, buffer,
1, g_etype, MPI_STATUS_IGNORE);</font></tt>
<br><tt><font size=2><br>
and it loops writing BUFFER_SIZE until it's written PROC_DATA_SIZE. </font></tt>
<br>
<br>