[MPICH] slow IOR when using fileview
Weikuan Yu
wyu at ornl.gov
Fri Jul 13 08:53:14 CDT 2007
Attached is an updated patch for the data sieving issue over Cray XT. This
has added a condition checking to ensure filetype exists before checking a
range of it. It compiles well (should work fine) on linux platforms.
Sample performance results over Cray XT are shown here. Cases with file-view
and w/o are very comparable (within deviations).
[wyu at jaguar11 IOR-2.9.1]$ yod -SN -sz 8 ./src/C/IOR -a MPIIO -b 10m -t 10m
-w -c -o ufs:/tmp/work/wyu/ior/fileview
IOR-2.9.1: MPI Coordinated Test of Parallel I/O
Run began: Fri Jul 13 09:32:28 2007
Command line used: ./src/C/IOR -a MPIIO -b 10m -t 10m -w -c -o
ufs:/tmp/work/wyu/ior/fileview
Machine: catamount jaguar11
Summary:
rank=0 time=0.062344
rank=1 time=0.062449
rank=5 time=0.065238
rank=7 time=0.065882
rank=6 time=0.066196
rank=4 time=0.066532
rank=3 time=0.067314
rank=2 time=0.067595
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) iter
------ --------- ---------- --------- -------- -------- -------- ----
write 820.76 10240 10240 0.022472 0.074596 0.006469 0
Max Write: 820.76 MiB/sec (860.63 MB/sec)
Run finished: Fri Jul 13 09:32:28 2007
[wyu at jaguar11 IOR-2.9.1]$ yod -SN -sz 8 ./src/C/IOR -a MPIIO -b 10m -t 10m
-w -c -V -o ufs:/tmp/work/wyu/ior/fileview
IOR-2.9.1: MPI Coordinated Test of Parallel I/O
Run began: Fri Jul 13 09:32:37 2007
Command line used: ./src/C/IOR -a MPIIO -b 10m -t 10m -w -c -V -o
ufs:/tmp/work/wyu/ior/fileview
Machine: catamount jaguar11
Summary:
rank=0 time=0.056000
rank=1 time=0.058486
rank=5 time=0.059965
rank=7 time=0.061172
rank=6 time=0.061861
rank=4 time=0.062538
rank=2 time=0.063949
rank=3 time=0.064659
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) iter
------ --------- ---------- --------- -------- -------- -------- ----
write 838.44 10240 10240 0.022537 0.072470 0.010412 0
Max Write: 838.44 MiB/sec (879.17 MB/sec)
Run finished: Fri Jul 13 09:32:37 2007
Weikuan Yu wrote:
> Hi,
>
> As I commented on Wei-keng's fix in another context of our discussion, I
> think this issue of avoiding read-modify-write (RMW) can be taken care
> of in a slight different way. Attached is the fix I cooked along the
> direction I suggested earlier. Basically, this is to expose two
> additional API from adioi.h
>
> 1. void ADIOI_Filetype_range_iscontig()
> 2. void ADIOI_Filetype_range_start()
>
> The first one is an API testing the contiguity of target file range for
> a data input with a _count_ number of datatypes.
>
> The second one is an API finding out the relevant parameters for the
> starting parameters in the file that is targeted by the beginning of a
> data input. Here, I agree with Wei-keng's recommendation of simplifying
> the while loop in determining such starting parameters. It can be
> incorporated easily into this API if so desired.
>
> There are a number of benefits with these additional calls.
>
> -1- So by calling API #1, for IO with simple file view composed of
> contiguous data from each proc, ADIOI_GEN_{Write,Read}StridedColl will
> no long trigger ADIO_{Write,Read}Strided(),
>
> -2- That means no more need to chunk data into 512KB pieces and
> associated processing overhead.
>
> -3- Over Cray XT, this also means a much reduced number of fcntl calls
> for locking during RMW for data sieving. No need for disabling data
> sieving, or the need of increasing ds buffer sizes over XT.
>
> -4- API #2 can be used to replace about 15 blocks of identical code in
> files such as ad_{write,read}_str.c and others, therefore leading to
> reduce code maintenance efforts and modularization. For this discussion,
> the cleanup is not included in the patch yet. But it can be quickly done
> if these API is to be taken.
>
> BTW, this is also a fix I am suggesting to Cray for their incorporation.
> Please consider for upstream integration.
>
> Thanks,
> Weikuan
>
> Yu, Weikuan wrote:
>> The concept of buffertype is implicitly linked with a concrete memory
>> buffer, it is valid to report its contiguity. However, the filetype is
>> more abstract a feature describing a process's view of a file and its
>> own segments, so its contiguity needs to be reflected more accurately
>> with associated process and the intended file range. In addtion, the
>> buffertype describes about the data source, while the filetype describes
>> the data sink. So they really do not intersect.
>> However, I think your idea points to the correct direction. Something
>> like the following is what I have in mind for a process to test the
>> contiguity of a file within a range:
>> ADIOI_Filetype_iscontig(filetype, offset, len, &filetype_is_contig);
>> This may avoid sharing the contiguity checking routine between datatype
>> and filetype.
>> ADIOI_Datatype_iscontig(filetype, &filetype_is_contig);
>>
>> Comments?
>> --Weikuan
>>
>>> In fact, this I/O pattern should trigger ADIO_WriteContig() for best
>>> result. I suggest one more test should be given here for checking if
>>> the intersection of the buffertype and filetype is contiguous. If yes,
>>> ADIO_WriteContig() is called. Here, the intersection operation will
>>> involve the current file position. I don't know how complicate can
>>> this implementation be.
>>
>>> -----Original Message-----
>>> From: owner-mpich-discuss at mcs.anl.gov
>>> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Wei-keng Liao
>>> Sent: Tuesday, July 03, 2007 1:04 AM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: RE: [MPICH] slow IOR when using fileview
>>>
>>>
>>> I checked the ROMIO source for this particular access pattern.
>>> At first, a few words about the access pattern.
>>> 1) MPI_Type_create_subarray() creates the file access regions like
>>> file: |----------|----------|----------| .... |----------|
>>> P0 P1 P2 P7
>>> Each segment is of size 10MB.
>>> 2) There is no overlapped, interleaved, or non-contiguous access across
>>> all processes. Every file access is a single contiguous write
>>> request.
>>> 3) Write buffer is also contiguous. The write amount is 10 MB, same
>>> across
>>> all MPI processes.
>>> 4) The effect of using this file type should be the same as using
>>> explicit file offset without file type.
>>>
>>> In ROMIO source file ad_write_coll.c, in function
>>> ADIOI_GEN_WriteStridedColl(), ADIOI_Datatype_iscontig() is called in
>>> line
>>> 141 to check if the file type is contiguous and it returns 0. That
>>> means the file type is not contiguous. In general, this is true,
>>> since the file type is applied to the entire file space repeatedly.
>>> Therefore, in line 153, ADIO_WriteStrided() is called, instead of
>>> ADIO_WriteContig() in line 150. So, data sieving is performed by
>>> default in ADIO_WriteStrided() which chops the 10 MB write into 20
>>> 512KB chunks. For each chunk, a read-modify-write is carried out.
>>>
>>> In fact, this I/O pattern should trigger ADIO_WriteContig() for best
>>> result. I suggest one more test should be given here for checking if
>>> the intersection of the buffertype and filetype is contiguous. If yes,
>>> ADIO_WriteContig() is called. Here, the intersection operation will
>>> involve the current file position. I don't know how complicate can
>>> this implementation be.
>>>
>>> Wei-keng
>>>
>>>
>>>
>>>
>>> On Mon, 2 Jul 2007, Yu, Weikuan wrote:
>>>
>>>>> If the independent
>>>>> access is used instead, I don't know why each write is
>>> divided into
>>>>> 512 KB chunks and locking is ever needed to guaranteed the atomic
>>>>> access of the 10 MB contiguous file range. For this
>>> particular access
>>>>> pattern, ROMIO should not do read-modify-write at all.
>>>> 512KB is the default buffer size for data sieving. So with
>>> 512KB buffer size, each process is only able to write out 512KB data
>>> in each call of ADIOI_GEN_WriteStrided. For 10MB, this results in 20
>>> iterations of write_all(), 40 fcntl() total. crayPat indicates that
>>> fcntl() takes 88% of the total Wall clock time with fileview, 0% w/o
>>> fileview.
>>>> --Weikuan
>>>>
>>>
>>
>
> ------------------------------------------------------------------------
>
> diff -ruNp mpich2-1.0.5p4/src/mpi/romio/adio/common/ad_read_coll.c mpich2-1.0.5p4-new/src/mpi/romio/adio/common/ad_read_coll.c
> --- mpich2-1.0.5p4/src/mpi/romio/adio/common/ad_read_coll.c 2005-07-09 15:05:52.000000000 -0400
> +++ mpich2-1.0.5p4-new/src/mpi/romio/adio/common/ad_read_coll.c 2007-07-11 12:54:21.000000000 -0400
> @@ -133,6 +133,8 @@ void ADIOI_GEN_ReadStridedColl(ADIO_File
> if (fd->hints->cb_read == ADIOI_HINT_DISABLE
> || (!interleave_count && (fd->hints->cb_read == ADIOI_HINT_AUTO)))
> {
> + int filerange_is_contig;
> +
> /* don't do aggregation */
> if (fd->hints->cb_read != ADIOI_HINT_DISABLE) {
> ADIOI_Free(offset_list);
> @@ -143,8 +145,12 @@ void ADIOI_GEN_ReadStridedColl(ADIO_File
>
> fd->fp_ind = orig_fp;
> ADIOI_Datatype_iscontig(fd->filetype, &filetype_is_contig);
> + ADIOI_Filetype_range_iscontig(fd, offset, file_ptr_type,
> + datatype, count, &filerange_is_contig);
> +
> + if (buftype_is_contig && (filetype_is_contig ||
> + filerange_is_contig)) {
>
> - if (buftype_is_contig && filetype_is_contig) {
> if (file_ptr_type == ADIO_EXPLICIT_OFFSET) {
> off = fd->disp + (fd->etype_size) * offset;
> ADIO_ReadContig(fd, buf, count, datatype, ADIO_EXPLICIT_OFFSET,
> diff -ruNp mpich2-1.0.5p4/src/mpi/romio/adio/common/ad_write_coll.c mpich2-1.0.5p4-new/src/mpi/romio/adio/common/ad_write_coll.c
> --- mpich2-1.0.5p4/src/mpi/romio/adio/common/ad_write_coll.c 2006-10-30 16:11:36.000000000 -0500
> +++ mpich2-1.0.5p4-new/src/mpi/romio/adio/common/ad_write_coll.c 2007-07-11 12:53:52.000000000 -0400
> @@ -129,6 +129,8 @@ void ADIOI_GEN_WriteStridedColl(ADIO_Fil
> if (fd->hints->cb_write == ADIOI_HINT_DISABLE ||
> (!interleave_count && (fd->hints->cb_write == ADIOI_HINT_AUTO)))
> {
> + int filerange_is_contig;
> +
> /* use independent accesses */
> if (fd->hints->cb_write != ADIOI_HINT_DISABLE) {
> ADIOI_Free(offset_list);
> @@ -139,8 +141,11 @@ void ADIOI_GEN_WriteStridedColl(ADIO_Fil
>
> fd->fp_ind = orig_fp;
> ADIOI_Datatype_iscontig(fd->filetype, &filetype_is_contig);
> + ADIOI_Filetype_range_iscontig(fd, offset, file_ptr_type,
> + datatype, count, &filerange_is_contig);
>
> - if (buftype_is_contig && filetype_is_contig) {
> + if (buftype_is_contig && (filetype_is_contig ||
> + filerange_is_contig)) {
> if (file_ptr_type == ADIO_EXPLICIT_OFFSET) {
> off = fd->disp + (fd->etype_size) * offset;
> ADIO_WriteContig(fd, buf, count, datatype,
> diff -ruNp mpich2-1.0.5p4/src/mpi/romio/adio/common/iscontig.c mpich2-1.0.5p4-new/src/mpi/romio/adio/common/iscontig.c
> --- mpich2-1.0.5p4/src/mpi/romio/adio/common/iscontig.c 2007-07-11 13:48:06.000000000 -0400
> +++ mpich2-1.0.5p4-new/src/mpi/romio/adio/common/iscontig.c 2007-07-11 12:47:49.000000000 -0400
> @@ -5,7 +5,6 @@
> */
>
> #include "adio.h"
> -#include "adio_extern.h"
> /* #ifdef MPISGI
> #include "mpisgi2.h"
> #endif */
> @@ -102,3 +101,98 @@ void ADIOI_Datatype_iscontig(MPI_Datatyp
> in other cases as well.*/
> }
> #endif
> +
> +void ADIOI_Filetype_range_start(ADIO_File fd, ADIO_Offset offset, int file_ptr_type,
> + int *start_index, int *start_ftype, int *start_offset, int *start_io_size)
> +{
> + ADIOI_Flatlist_node *flat_file;
> + ADIO_Offset disp, abs_off_in_filetype=0;
> + MPI_Aint filetype_extent;
> +
> + int i, st_io_size=0, st_index=0;
> + int sum, n_etypes_in_filetype, size_in_filetype;
> + int n_filetypes, etype_in_filetype;
> + int flag, filetype_size, etype_size;
> +
> + flat_file = ADIOI_Flatlist;
> + while (flat_file->type != fd->filetype) flat_file = flat_file->next;
> + disp = fd->disp;
> +
> + MPI_Type_size(fd->filetype, &filetype_size);
> + MPI_Type_extent(fd->filetype, &filetype_extent);
> + etype_size = fd->etype_size;
> +
> + if (file_ptr_type == ADIO_INDIVIDUAL) {
> + offset = fd->fp_ind; /* in bytes */
> + n_filetypes = -1;
> + flag = 0;
> + while (!flag) {
> + n_filetypes++;
> + for (i=0; i<flat_file->count; i++) {
> + if (disp + flat_file->indices[i] +
> + (ADIO_Offset) n_filetypes*filetype_extent + flat_file->blocklens[i]
> + >= offset) {
> + st_index = i;
> + st_io_size = (int) (disp + flat_file->indices[i] +
> + (ADIO_Offset) n_filetypes*filetype_extent
> + + flat_file->blocklens[i] - offset);
> + flag = 1;
> + break;
> + }
> + }
> + }
> + } else {
> + n_etypes_in_filetype = filetype_size/etype_size;
> + n_filetypes = (int) (offset / n_etypes_in_filetype);
> + etype_in_filetype = (int) (offset % n_etypes_in_filetype);
> + size_in_filetype = etype_in_filetype * etype_size;
> +
> + sum = 0;
> + for (i=0; i<flat_file->count; i++) {
> + sum += flat_file->blocklens[i];
> + if (sum > size_in_filetype) {
> + st_index = i;
> + st_io_size = sum - size_in_filetype;
> + abs_off_in_filetype = flat_file->indices[i] +
> + size_in_filetype - (sum - flat_file->blocklens[i]);
> + break;
> + }
> + }
> +
> + /* abs. offset in bytes in the file */
> + offset = disp + (ADIO_Offset) n_filetypes*filetype_extent + abs_off_in_filetype;
> + }
> +
> + *start_index = st_index;
> + *start_io_size = st_io_size;
> + *start_offset = offset;
> + *start_ftype = n_filetypes;
> +}
> +
> +void ADIOI_Filetype_range_iscontig(ADIO_File fd, ADIO_Offset offset,
> + int file_ptr_type, MPI_Datatype datatype, int count, int *flag)
> +{
> + int srclen, datatype_size;
> + int st_index, st_ftype, st_offset, st_io_size;
> +
> + MPI_Type_size(datatype, &datatype_size);
> + srclen = datatype_size * count;
> +
> + ADIOI_Filetype_range_start(fd, offset, file_ptr_type,
> + &st_index, &st_ftype, &st_offset, &st_io_size);
> + *flag = st_io_size > srclen ? 1 : 0;
> +}
> +
> diff -ruNp mpich2-1.0.5p4/src/mpi/romio/adio/include/adioi.h mpich2-1.0.5p4-new/src/mpi/romio/adio/include/adioi.h
> --- mpich2-1.0.5p4/src/mpi/romio/adio/include/adioi.h 2005-08-12 14:56:56.000000000 -0400
> +++ mpich2-1.0.5p4-new/src/mpi/romio/adio/include/adioi.h 2007-07-11 12:46:32.000000000 -0400
> @@ -304,6 +304,10 @@ void *ADIOI_Calloc_fn(size_t nelem, size
> void *ADIOI_Realloc_fn(void *ptr, size_t size, int lineno, char *fname);
> void ADIOI_Free_fn(void *ptr, int lineno, char *fname);
> void ADIOI_Datatype_iscontig(MPI_Datatype datatype, int *flag);
> +void ADIOI_Filetype_range_iscontig(ADIO_File fd, ADIO_Offset offset,
> + int file_ptr_type, MPI_Datatype datatype, int count, int *flag);
> +void ADIOI_Filetype_range_start(ADIO_File fd, ADIO_Offset offset, int file_ptr_type,
> + int *start_index, int *start_ftype, int *start_offset, int *start_io_size);
> void ADIOI_Get_position(ADIO_File fd, ADIO_Offset *offset);
> void ADIOI_Get_eof_offset(ADIO_File fd, ADIO_Offset *eof_offset);
> void ADIOI_Get_byte_offset(ADIO_File fd, ADIO_Offset offset,
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: data-sieving-jaguar.patch
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070713/8f0482be/attachment.diff>
More information about the mpich-discuss
mailing list