<br><tt><font size=2>Rob Latham wrote on 09/16/2009 05:33:03 PM:<br>

<br>

&gt; Bob Cernohous wrote:<br>

&gt; &gt; Same customer noticed that the file system was doing much more

i/o <br>

&gt; &gt; than MPIIO on collective reads. &nbsp;I see:<br>

&gt; &gt;<br>

&gt; &gt; When the ranks read 4M blocks at 16M offsets using cb_buffer_size=4M

&amp; <br>

&gt; &gt; cb_block_size=4M, the aggregator reads 4M@offset 0, 4M@offset

16M, <br>

&gt; &gt; 4M@offset 32M, etc. &nbsp;Looks fine. &nbsp;The aggegator doesn't

read offsets <br>

&gt; &gt; that aren't being referenced (4M,8M,12M). &nbsp;We should see

about 1-1 <br>

&gt; &gt; bandwidth between MPIIO and the file system (ignoring collective

<br>

&gt; &gt; overhead).<br>

&gt; <br>

&gt; if your customer is truly setting cb_block_size, he's not ever going

to <br>

&gt; see a change. &nbsp;MPI-IO ignores hints it does not understand, and

<br>

&gt; 'cb_block_size' is definitely one of those.<br>

&gt; &gt; When the ranks read 4M blocks at 16M offsets using cb_buffer_size=16M

<br>

&gt; &gt; &amp; cb_block_size=4M, the aggregator reads 16M@offset 0, 16M@offset

16M, <br>

&gt; &gt; 16M@offset 3M, etc. &nbsp;The aggregator doesn't use cb_block_size

or any <br>

&gt; &gt; other heuristic to avoid unnecessarily large blocks. &nbsp;Actual

file <br>

&gt; &gt; system i/o bandwidth is 4x the application i/o bandwidth. &nbsp;<br>

&gt; Is that a typo? &nbsp;3M should be 32 M?</font></tt>

<br>

<br><tt><font size=2>Yes</font></tt>

<br><tt><font size=2><br>

&gt; <br>

&gt; The aggregator uses one algorithm: read all the data in my file domain

<br>

&gt; from the starting byte to the ending byte. &nbsp;We can shrink the

file <br>

&gt; domain with cb_buffer_size. &nbsp;There are some more sophisticated

<br>

&gt; approaches we can take but you commented them out in V1R4</font></tt>

<br>

<br><tt><font size=2>I ran this through ad_read_coll.c and saw the same

behavior. &nbsp;I don't think I commented anything out of common/.</font></tt>

<br><tt><font size=2><br>

&gt; <br>

&gt; &gt; Why is cb_buffer_size used instead of cb_block_size? &nbsp;Looking

at the <br>

&gt; &gt; code, it seems &quot;that's the way it is - read the buffer size

that we're <br>

&gt; &gt; exchanging&quot;. &nbsp;I think in &quot;normal&quot; collective

i/o patterns the data <br>

&gt; &gt; would have been scattered across the whole cb_buffer_size, which

is <br>

&gt; &gt; why it might be designed this way.<br>

&gt; &gt;<br>

&gt; &gt; Any further comments from the experts? &nbsp;I think that this

is similar <br>

&gt; &gt; to the leading/trailing data sieving holes that I asked about

a couple <br>

&gt; &gt; weeks ago. &nbsp; But on read, there isn't any hole detection

or tuning. <br>

&gt; &gt; &nbsp;It just does the whole cb_buffer_size if any data in that

block is <br>

&gt; &gt; needed..<br>

&gt; <br>

&gt; Not exactly. &nbsp;Take a look at ad_bgl_rdcoll.c around line 550.

&nbsp; After <br>

&gt; collecting everyone's requests, it knows the 'st_loc' and 'end_loc'

<br>

&gt; (starting and ending location). &nbsp;if that's smaller than the <br>

&gt; cb_buffer_size, that much data is read. </font></tt>

<br>

<br><tt><font size=2>I'm running on a rack VN mode so 4K nodes, 8 i/o nodes,

32 bgl_nodes_pset. &nbsp;So in this case, the aggregator's file domain

out of the 64G file is much larger than cb_buffer_size so it always uses

cb_buffer_size. &nbsp;It doesn't trim cb_buffer_size any further when the

hole is at the end of the block.</font></tt>

<br>

<br><tt><font size=2>ad_bgl - rank 0 aggregator</font></tt>

<br><tt><font size=2>..romio/mpi-io/read_all.c:106: offset 0</font></tt>

<br><tt><font size=2>..romio/adio/ad_bgl/ad_bgl_rdcoll.c:ADIOI_BGL_ReadStridedColl:169:

rank 0 &nbsp;off 0 &nbsp;len 4194304</font></tt>

<br><tt><font size=2>..romio/adio/ad_bgl/ad_bgl_rdcoll.c:ADIOI_Read_and_exch:475:

ntimes 16, st_loc 0, end_loc 255852543, coll_bufsize 16777216</font></tt>

<br><tt><font size=2>..romio/adio/ad_bgl/ad_bgl_rdcoll.c:ADIOI_Read_and_exch:635:

Read conting offset 0, size 16777216, buffer offset 0</font></tt>

<br><tt><font size=2>ADIOI_BGL_ReadContig:91:read buflen 16777216, offset

0</font></tt>

<br>

<br><tt><font size=2>ad_bgl - rank 1</font></tt>

<br><tt><font size=2>..romio/mpi-io/read_all.c:106: offset 0</font></tt>

<br><tt><font size=2>..romio/adio/ad_bgl/ad_bgl_rdcoll.c:ADIOI_BGL_ReadStridedColl:169:

rank 1 &nbsp;off 16777216 &nbsp;len 4194304</font></tt>

<br><tt><font size=2>..romio/adio/ad_bgl/ad_bgl_rdcoll.c:ADIOI_Read_and_exch:475:

ntimes 0, st_loc -1, end_loc -1, coll_bufsize 16777216</font></tt>

<br>

<br><tt><font size=2>-------------------------------------</font></tt>

<br><tt><font size=2>common - rank 0 aggregator</font></tt>

<br><tt><font size=2>..romio/mpi-io/read_all.c:106: offset 0</font></tt>

<br><tt><font size=2>..romio/adio/common/ad_read_coll.c:ADIOI_GEN_ReadStridedColl:116:

rank 0 &nbsp;off 0 &nbsp;len 4194304</font></tt>

<br><tt><font size=2>..romio/adio/common/ad_read_coll.c:ADIOI_Read_and_exch:578:

ntimes 16, st_loc 0, end_loc 255852543, coll_bufsize 16777216</font></tt>

<br><tt><font size=2>..romio/adio/common/ad_read_coll.c:ADIOI_Read_and_exch:727:

Read conting offset 0, size 16777216, buff offset 0</font></tt>

<br><tt><font size=2>..romio/adio/common/ad_read.c:ADIOI_GEN_ReadContig:61:

off 0 &nbsp;len 16777216</font></tt>

<br>

<br><tt><font size=2>common - rank 1 </font></tt>

<br><tt><font size=2>..romio/mpi-io/read_all.c:106: offset 0</font></tt>

<br><tt><font size=2>..romio/adio/common/ad_read_coll.c:ADIOI_GEN_ReadStridedColl:116:

rank 1 &nbsp;off 16777216 &nbsp;len 4194304</font></tt>

<br><tt><font size=2>..romio/adio/common/ad_read_coll.c:ADIOI_Read_and_exch:578:

ntimes 0, st_loc -1, end_loc -1, coll_bufsize 16777216</font></tt>

<br>

<br><tt><font size=2><br>

&gt; <br>

&gt; it's true that ROMIO does not do strided i/o here and instead could

very <br>

&gt; well be reading in holes, but if the unneeded data resides entirely

at <br>

&gt; the end of the block, only the needed data will be read.<br>

</font></tt>

<br><tt><font size=2>That's what I was hoping but that's not what I see

on either ad_bgl or common.</font></tt>

<br>

<br><tt><font size=2>&gt; <br>

&gt; Do you think your customers could just send a test program? &nbsp;

I think we <br>

&gt; might be missing important details in the relaying of messages back

and <br>

&gt; forth.<br>

&gt; <br>

&gt; ==rob</font></tt>

<br>

<br><tt><font size=2>It's pretty simple if you ignore all the optional

compiling MPIIO vs posix, basically ...</font></tt>

<br>

<br><tt><font size=2>#define PROC_DATA_SIZE 16</font></tt>

<br><tt><font size=2>static unsigned g_data_size=PROC_DATA_SIZE; /* Data

per process in MiB (2^20) */</font></tt>

<br>

<br><tt><font size=2>#define MB (1000*1000) /* MB = megabyte = 10^6 bytes)

*/</font></tt>

<br><tt><font size=2>#define MIB (1024*1024) /* MiB = mebibyte = 2^20 bytes)

*/</font></tt>

<br>

<br><tt><font size=2>/* Size of one written chunk in MiB */</font></tt>

<br><tt><font size=2>#define BUFFER_SIZE 4</font></tt>

<br><tt><font size=2>static char g_buffer[BUFFER_SIZE*MIB];</font></tt>

<br><tt><font size=2>...</font></tt>

<br><tt><font size=2>&nbsp; /* Create a derived datatype for one unit of

data transfer */</font></tt>

<br><tt><font size=2>&nbsp; ret = MPI_Type_contiguous(BUFFER_SIZE*MIB,

MPI_CHAR, &amp;g_etype);</font></tt>

<br><tt><font size=2>...</font></tt>

<br><tt><font size=2>&nbsp; /* compute the node's first byte position in

the file */</font></tt>

<br><tt><font size=2>&nbsp; FileOffset offset = rank*(FileOffset)(g_data_size)*MIB;</font></tt>

<br><tt><font size=2>&nbsp; int mpi_ret = MPI_File_set_view(fh, offset,

g_etype, g_etype, &quot;native&quot;,</font></tt>

<br><tt><font size=2>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; MPI_INFO_NULL);</font></tt>

<br><tt><font size=2>...</font></tt>

<br><tt><font size=2>&nbsp; int mpi_ret = MPI_File_write_all(fh, buffer,

1, g_etype, MPI_STATUS_IGNORE);</font></tt>

<br><tt><font size=2><br>

and it loops writing BUFFER_SIZE until it's written PROC_DATA_SIZE. </font></tt>

<br>

<br>