[mpich2-dev] And yet another ROMIO performance question
Bob Cernohous
bobc at us.ibm.com
Wed Sep 16 17:01:54 CDT 2009
Rob Latham wrote on 09/16/2009 05:33:03 PM:
> Bob Cernohous wrote:
> > Same customer noticed that the file system was doing much more i/o
> > than MPIIO on collective reads. I see:
> >
> > When the ranks read 4M blocks at 16M offsets using cb_buffer_size=4M &
> > cb_block_size=4M, the aggregator reads 4M at offset 0, 4M at offset 16M,
> > 4M at offset 32M, etc. Looks fine. The aggegator doesn't read offsets
> > that aren't being referenced (4M,8M,12M). We should see about 1-1
> > bandwidth between MPIIO and the file system (ignoring collective
> > overhead).
>
> if your customer is truly setting cb_block_size, he's not ever going to
> see a change. MPI-IO ignores hints it does not understand, and
> 'cb_block_size' is definitely one of those.
> > When the ranks read 4M blocks at 16M offsets using cb_buffer_size=16M
> > & cb_block_size=4M, the aggregator reads 16M at offset 0, 16M at offset 16M,
> > 16M at offset 3M, etc. The aggregator doesn't use cb_block_size or any
> > other heuristic to avoid unnecessarily large blocks. Actual file
> > system i/o bandwidth is 4x the application i/o bandwidth.
> Is that a typo? 3M should be 32 M?
Yes
>
> The aggregator uses one algorithm: read all the data in my file domain
> from the starting byte to the ending byte. We can shrink the file
> domain with cb_buffer_size. There are some more sophisticated
> approaches we can take but you commented them out in V1R4
I ran this through ad_read_coll.c and saw the same behavior. I don't
think I commented anything out of common/.
>
> > Why is cb_buffer_size used instead of cb_block_size? Looking at the
> > code, it seems "that's the way it is - read the buffer size that we're
> > exchanging". I think in "normal" collective i/o patterns the data
> > would have been scattered across the whole cb_buffer_size, which is
> > why it might be designed this way.
> >
> > Any further comments from the experts? I think that this is similar
> > to the leading/trailing data sieving holes that I asked about a couple
> > weeks ago. But on read, there isn't any hole detection or tuning.
> > It just does the whole cb_buffer_size if any data in that block is
> > needed..
>
> Not exactly. Take a look at ad_bgl_rdcoll.c around line 550. After
> collecting everyone's requests, it knows the 'st_loc' and 'end_loc'
> (starting and ending location). if that's smaller than the
> cb_buffer_size, that much data is read.
I'm running on a rack VN mode so 4K nodes, 8 i/o nodes, 32 bgl_nodes_pset.
So in this case, the aggregator's file domain out of the 64G file is much
larger than cb_buffer_size so it always uses cb_buffer_size. It doesn't
trim cb_buffer_size any further when the hole is at the end of the block.
ad_bgl - rank 0 aggregator
..romio/mpi-io/read_all.c:106: offset 0
..romio/adio/ad_bgl/ad_bgl_rdcoll.c:ADIOI_BGL_ReadStridedColl:169: rank 0
off 0 len 4194304
..romio/adio/ad_bgl/ad_bgl_rdcoll.c:ADIOI_Read_and_exch:475: ntimes 16,
st_loc 0, end_loc 255852543, coll_bufsize 16777216
..romio/adio/ad_bgl/ad_bgl_rdcoll.c:ADIOI_Read_and_exch:635: Read conting
offset 0, size 16777216, buffer offset 0
ADIOI_BGL_ReadContig:91:read buflen 16777216, offset 0
ad_bgl - rank 1
..romio/mpi-io/read_all.c:106: offset 0
..romio/adio/ad_bgl/ad_bgl_rdcoll.c:ADIOI_BGL_ReadStridedColl:169: rank 1
off 16777216 len 4194304
..romio/adio/ad_bgl/ad_bgl_rdcoll.c:ADIOI_Read_and_exch:475: ntimes 0,
st_loc -1, end_loc -1, coll_bufsize 16777216
-------------------------------------
common - rank 0 aggregator
..romio/mpi-io/read_all.c:106: offset 0
..romio/adio/common/ad_read_coll.c:ADIOI_GEN_ReadStridedColl:116: rank 0
off 0 len 4194304
..romio/adio/common/ad_read_coll.c:ADIOI_Read_and_exch:578: ntimes 16,
st_loc 0, end_loc 255852543, coll_bufsize 16777216
..romio/adio/common/ad_read_coll.c:ADIOI_Read_and_exch:727: Read conting
offset 0, size 16777216, buff offset 0
..romio/adio/common/ad_read.c:ADIOI_GEN_ReadContig:61: off 0 len 16777216
common - rank 1
..romio/mpi-io/read_all.c:106: offset 0
..romio/adio/common/ad_read_coll.c:ADIOI_GEN_ReadStridedColl:116: rank 1
off 16777216 len 4194304
..romio/adio/common/ad_read_coll.c:ADIOI_Read_and_exch:578: ntimes 0,
st_loc -1, end_loc -1, coll_bufsize 16777216
>
> it's true that ROMIO does not do strided i/o here and instead could very
> well be reading in holes, but if the unneeded data resides entirely at
> the end of the block, only the needed data will be read.
That's what I was hoping but that's not what I see on either ad_bgl or
common.
>
> Do you think your customers could just send a test program? I think we
> might be missing important details in the relaying of messages back and
> forth.
>
> ==rob
It's pretty simple if you ignore all the optional compiling MPIIO vs
posix, basically ...
#define PROC_DATA_SIZE 16
static unsigned g_data_size=PROC_DATA_SIZE; /* Data per process in MiB
(2^20) */
#define MB (1000*1000) /* MB = megabyte = 10^6 bytes) */
#define MIB (1024*1024) /* MiB = mebibyte = 2^20 bytes) */
/* Size of one written chunk in MiB */
#define BUFFER_SIZE 4
static char g_buffer[BUFFER_SIZE*MIB];
...
/* Create a derived datatype for one unit of data transfer */
ret = MPI_Type_contiguous(BUFFER_SIZE*MIB, MPI_CHAR, &g_etype);
...
/* compute the node's first byte position in the file */
FileOffset offset = rank*(FileOffset)(g_data_size)*MIB;
int mpi_ret = MPI_File_set_view(fh, offset, g_etype, g_etype, "native",
MPI_INFO_NULL);
...
int mpi_ret = MPI_File_write_all(fh, buffer, 1, g_etype,
MPI_STATUS_IGNORE);
and it loops writing BUFFER_SIZE until it's written PROC_DATA_SIZE.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich2-dev/attachments/20090916/c4ef7c79/attachment.htm>
More information about the mpich2-dev
mailing list