[MPICH] slow IOR when using fileview

Weikuan Yu wyu at ornl.gov
Fri Jul 13 08:53:14 CDT 2007


Attached is an updated patch for the data sieving issue over Cray XT. This 
has added a condition checking to ensure filetype exists before checking a 
range of it. It compiles well (should work fine) on linux platforms.

Sample performance results over Cray XT are shown here. Cases with file-view 
and w/o are very comparable (within deviations).

[wyu at jaguar11 IOR-2.9.1]$ yod -SN -sz 8 ./src/C/IOR -a MPIIO -b 10m -t 10m 
-w -c -o ufs:/tmp/work/wyu/ior/fileview
IOR-2.9.1: MPI Coordinated Test of Parallel I/O

Run began: Fri Jul 13 09:32:28 2007
Command line used: ./src/C/IOR -a MPIIO -b 10m -t 10m -w -c -o 
ufs:/tmp/work/wyu/ior/fileview
Machine: catamount jaguar11

Summary:

rank=0 time=0.062344
rank=1 time=0.062449
rank=5 time=0.065238
rank=7 time=0.065882
rank=6 time=0.066196
rank=4 time=0.066532
rank=3 time=0.067314
rank=2 time=0.067595
access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   ----
write     820.76     10240      10240      0.022472   0.074596   0.006469   0

Max Write: 820.76 MiB/sec (860.63 MB/sec)

Run finished: Fri Jul 13 09:32:28 2007
[wyu at jaguar11 IOR-2.9.1]$ yod -SN -sz 8 ./src/C/IOR -a MPIIO -b 10m -t 10m 
-w -c -V -o ufs:/tmp/work/wyu/ior/fileview
IOR-2.9.1: MPI Coordinated Test of Parallel I/O

Run began: Fri Jul 13 09:32:37 2007
Command line used: ./src/C/IOR -a MPIIO -b 10m -t 10m -w -c -V -o 
ufs:/tmp/work/wyu/ior/fileview
Machine: catamount jaguar11

Summary:

rank=0 time=0.056000
rank=1 time=0.058486
rank=5 time=0.059965
rank=7 time=0.061172
rank=6 time=0.061861
rank=4 time=0.062538
rank=2 time=0.063949
rank=3 time=0.064659
access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   ----
write     838.44     10240      10240      0.022537   0.072470   0.010412   0

Max Write: 838.44 MiB/sec (879.17 MB/sec)

Run finished: Fri Jul 13 09:32:37 2007


Weikuan Yu wrote:
> Hi,
> 
> As I commented on Wei-keng's fix in another context of our discussion, I 
> think this issue of avoiding read-modify-write (RMW) can be taken care 
> of in a slight different way. Attached is the fix I cooked along the 
> direction I suggested earlier. Basically, this is to expose two 
> additional API from adioi.h
> 
> 1. void ADIOI_Filetype_range_iscontig()
> 2. void ADIOI_Filetype_range_start()
> 
> The first one is an API testing the contiguity of target file range for 
> a data input with a _count_ number of datatypes.
> 
> The second one is an API finding out the relevant parameters for the 
> starting parameters in the file that is targeted by the beginning of a 
> data input. Here, I agree with Wei-keng's recommendation of simplifying 
> the while loop in determining such starting parameters. It can be 
> incorporated easily into this API if so desired.
> 
> There are a number of benefits with these additional calls.
> 
> -1- So by calling API #1, for IO with simple file view composed of 
> contiguous data from each proc, ADIOI_GEN_{Write,Read}StridedColl will 
> no long trigger ADIO_{Write,Read}Strided(),
> 
> -2- That means no more need to chunk data into 512KB pieces and 
> associated processing overhead.
> 
> -3- Over Cray XT, this also means a much reduced number of fcntl calls 
> for locking during RMW for data sieving. No need for disabling data 
> sieving, or the need of increasing ds buffer sizes over XT.
> 
> -4- API #2 can be used to replace about 15 blocks of identical code in 
> files such as ad_{write,read}_str.c and others, therefore leading to 
> reduce code maintenance efforts and modularization. For this discussion, 
> the cleanup is not included in the patch yet. But it can be quickly done 
> if these API is to be taken.
> 
> BTW, this is also a fix I am suggesting to Cray for their incorporation. 
> Please consider for upstream integration.
> 
> Thanks,
> Weikuan
> 
> Yu, Weikuan wrote:
>> The concept of buffertype is implicitly linked with a concrete memory
>> buffer, it is valid to report its contiguity. However, the filetype is
>> more abstract a feature describing a process's view of a file and its
>> own segments, so its contiguity needs to be reflected more accurately
>> with associated process and the intended file range. In addtion, the
>> buffertype describes about the data source, while the filetype describes
>> the data sink. So they really do not intersect.
>> However, I think your idea points to the correct direction. Something
>> like the following is what I have in mind for a process to test the
>> contiguity of a file within a range:
>> ADIOI_Filetype_iscontig(filetype, offset, len, &filetype_is_contig);
>> This may avoid sharing the contiguity checking routine between datatype
>> and filetype.
>> ADIOI_Datatype_iscontig(filetype, &filetype_is_contig);
>>
>> Comments?
>> --Weikuan
>>
>>> In fact, this I/O pattern should trigger ADIO_WriteContig() for best 
>>> result. I suggest one more test should be given here for checking if 
>>> the intersection of the buffertype and filetype is contiguous. If yes,
>>> ADIO_WriteContig() is called. Here, the intersection operation will 
>>> involve the current file position. I don't know how complicate can 
>>> this implementation be. 
>>
>>> -----Original Message-----
>>> From: owner-mpich-discuss at mcs.anl.gov 
>>> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Wei-keng Liao
>>> Sent: Tuesday, July 03, 2007 1:04 AM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: RE: [MPICH] slow IOR when using fileview
>>>
>>>
>>> I checked the ROMIO source for this particular access pattern.
>>> At first, a few words about the access pattern.
>>> 1) MPI_Type_create_subarray() creates the file access regions like
>>>     file: |----------|----------|----------| .... |----------|
>>>               P0         P1          P2               P7
>>>     Each segment is of size 10MB.
>>> 2) There is no overlapped, interleaved, or non-contiguous access across
>>>     all processes. Every file access is a single contiguous write 
>>> request.
>>> 3) Write buffer is also contiguous. The write amount is 10 MB, same 
>>> across
>>>     all MPI processes.
>>> 4) The effect of using this file type should be the same as using
>>>     explicit file offset without file type.
>>>
>>> In ROMIO source file ad_write_coll.c, in function 
>>> ADIOI_GEN_WriteStridedColl(), ADIOI_Datatype_iscontig() is called in 
>>> line
>>> 141 to check if the file type is contiguous and it returns 0. That 
>>> means the file type is not contiguous. In general, this is true, 
>>> since the file type is applied to the entire file space repeatedly. 
>>> Therefore, in line 153, ADIO_WriteStrided() is called, instead of 
>>> ADIO_WriteContig() in line 150. So, data sieving is performed by 
>>> default in ADIO_WriteStrided() which chops the 10 MB write into 20 
>>> 512KB chunks. For each chunk, a read-modify-write is carried out.
>>>
>>> In fact, this I/O pattern should trigger ADIO_WriteContig() for best 
>>> result. I suggest one more test should be given here for checking if 
>>> the intersection of the buffertype and filetype is contiguous. If yes,
>>> ADIO_WriteContig() is called. Here, the intersection operation will 
>>> involve the current file position. I don't know how complicate can 
>>> this implementation be.
>>>
>>> Wei-keng
>>>
>>>
>>>
>>>
>>> On Mon, 2 Jul 2007, Yu, Weikuan wrote:
>>>
>>>>> If the independent
>>>>> access is used instead, I don't know why each write is 
>>> divided into
>>>>> 512 KB chunks and locking is ever needed to guaranteed the atomic 
>>>>> access of the 10 MB contiguous file range. For this 
>>> particular access
>>>>> pattern, ROMIO should not do read-modify-write at all.
>>>> 512KB is the default buffer size for data sieving. So with 
>>> 512KB buffer size, each process is only able to write out 512KB data 
>>> in each call of ADIOI_GEN_WriteStrided. For 10MB, this results in 20 
>>> iterations of write_all(), 40 fcntl() total. crayPat indicates that 
>>> fcntl() takes 88% of the total Wall clock time with fileview, 0% w/o 
>>> fileview.
>>>> --Weikuan
>>>>
>>>
>>
> 
> ------------------------------------------------------------------------
> 
> diff -ruNp mpich2-1.0.5p4/src/mpi/romio/adio/common/ad_read_coll.c mpich2-1.0.5p4-new/src/mpi/romio/adio/common/ad_read_coll.c
> --- mpich2-1.0.5p4/src/mpi/romio/adio/common/ad_read_coll.c	2005-07-09 15:05:52.000000000 -0400
> +++ mpich2-1.0.5p4-new/src/mpi/romio/adio/common/ad_read_coll.c	2007-07-11 12:54:21.000000000 -0400
> @@ -133,6 +133,8 @@ void ADIOI_GEN_ReadStridedColl(ADIO_File
>      if (fd->hints->cb_read == ADIOI_HINT_DISABLE
>  	|| (!interleave_count && (fd->hints->cb_read == ADIOI_HINT_AUTO))) 
>      {
> +	int filerange_is_contig; 
> +
>  	/* don't do aggregation */
>  	if (fd->hints->cb_read != ADIOI_HINT_DISABLE) {
>  	    ADIOI_Free(offset_list);
> @@ -143,8 +145,12 @@ void ADIOI_GEN_ReadStridedColl(ADIO_File
>  
>  	fd->fp_ind = orig_fp;
>  	ADIOI_Datatype_iscontig(fd->filetype, &filetype_is_contig);
> +	ADIOI_Filetype_range_iscontig(fd, offset, file_ptr_type, 
> +		datatype, count, &filerange_is_contig);
> +
> +	if (buftype_is_contig && (filetype_is_contig ||
> +		    filerange_is_contig)) {
>  
> -	if (buftype_is_contig && filetype_is_contig) {
>  	    if (file_ptr_type == ADIO_EXPLICIT_OFFSET) {
>  		off = fd->disp + (fd->etype_size) * offset;
>  		ADIO_ReadContig(fd, buf, count, datatype, ADIO_EXPLICIT_OFFSET,
> diff -ruNp mpich2-1.0.5p4/src/mpi/romio/adio/common/ad_write_coll.c mpich2-1.0.5p4-new/src/mpi/romio/adio/common/ad_write_coll.c
> --- mpich2-1.0.5p4/src/mpi/romio/adio/common/ad_write_coll.c	2006-10-30 16:11:36.000000000 -0500
> +++ mpich2-1.0.5p4-new/src/mpi/romio/adio/common/ad_write_coll.c	2007-07-11 12:53:52.000000000 -0400
> @@ -129,6 +129,8 @@ void ADIOI_GEN_WriteStridedColl(ADIO_Fil
>      if (fd->hints->cb_write == ADIOI_HINT_DISABLE ||
>  	(!interleave_count && (fd->hints->cb_write == ADIOI_HINT_AUTO)))
>      {
> +	int filerange_is_contig; 
> +
>  	/* use independent accesses */
>  	if (fd->hints->cb_write != ADIOI_HINT_DISABLE) {
>  	    ADIOI_Free(offset_list);
> @@ -139,8 +141,11 @@ void ADIOI_GEN_WriteStridedColl(ADIO_Fil
>  
>  	fd->fp_ind = orig_fp;
>          ADIOI_Datatype_iscontig(fd->filetype, &filetype_is_contig);
> +	ADIOI_Filetype_range_iscontig(fd, offset, file_ptr_type, 
> +		datatype, count, &filerange_is_contig);
>  
> -        if (buftype_is_contig && filetype_is_contig) {
> +	if (buftype_is_contig && (filetype_is_contig ||
> +		    filerange_is_contig)) {
>              if (file_ptr_type == ADIO_EXPLICIT_OFFSET) {
>                  off = fd->disp + (fd->etype_size) * offset;
>                  ADIO_WriteContig(fd, buf, count, datatype,
> diff -ruNp mpich2-1.0.5p4/src/mpi/romio/adio/common/iscontig.c mpich2-1.0.5p4-new/src/mpi/romio/adio/common/iscontig.c
> --- mpich2-1.0.5p4/src/mpi/romio/adio/common/iscontig.c	2007-07-11 13:48:06.000000000 -0400
> +++ mpich2-1.0.5p4-new/src/mpi/romio/adio/common/iscontig.c	2007-07-11 12:47:49.000000000 -0400
> @@ -5,7 +5,6 @@
>   */
>  
>  #include "adio.h"
> -#include "adio_extern.h"
>  /* #ifdef MPISGI
>  #include "mpisgi2.h"
>  #endif */
> @@ -102,3 +101,98 @@ void ADIOI_Datatype_iscontig(MPI_Datatyp
>         in other cases as well.*/
>  }
>  #endif
> +
> +void ADIOI_Filetype_range_start(ADIO_File fd, ADIO_Offset offset, int file_ptr_type,
> +	int *start_index, int *start_ftype, int *start_offset, int *start_io_size)
> +{
> +    ADIOI_Flatlist_node *flat_file;
> +    ADIO_Offset disp, abs_off_in_filetype=0;
> +    MPI_Aint filetype_extent; 
> +
> +    int i, st_io_size=0, st_index=0;
> +    int sum, n_etypes_in_filetype, size_in_filetype;
> +    int n_filetypes, etype_in_filetype;
> +    int flag, filetype_size, etype_size;
> +
> +    flat_file = ADIOI_Flatlist;
> +    while (flat_file->type != fd->filetype) flat_file = flat_file->next;
> +    disp = fd->disp;
> +
> +    MPI_Type_size(fd->filetype, &filetype_size);
> +    MPI_Type_extent(fd->filetype, &filetype_extent);
> +    etype_size = fd->etype_size;
> +
> +    if (file_ptr_type == ADIO_INDIVIDUAL) {
> +	offset = fd->fp_ind; /* in bytes */
> +	n_filetypes = -1;
> +	flag = 0;
> +	while (!flag) {
> +	    n_filetypes++;
> +	    for (i=0; i<flat_file->count; i++) {
> +		if (disp + flat_file->indices[i] + 
> +		    (ADIO_Offset) n_filetypes*filetype_extent + flat_file->blocklens[i] 
> +			>= offset) {
> +		    st_index = i;
> +		    st_io_size = (int) (disp + flat_file->indices[i] + 
> +			    (ADIO_Offset) n_filetypes*filetype_extent
> +			     + flat_file->blocklens[i] - offset);
> +		    flag = 1;
> +		    break;
> +		}
> +	    }
> +	}
> +    } else {
> +	n_etypes_in_filetype = filetype_size/etype_size;
> +	n_filetypes = (int) (offset / n_etypes_in_filetype);
> +	etype_in_filetype = (int) (offset % n_etypes_in_filetype);
> +	size_in_filetype = etype_in_filetype * etype_size;
> +
> +	sum = 0;
> +	for (i=0; i<flat_file->count; i++) {
> +	    sum += flat_file->blocklens[i];
> +	    if (sum > size_in_filetype) {
> +		st_index = i;
> +		st_io_size = sum - size_in_filetype;
> +		abs_off_in_filetype = flat_file->indices[i] +
> +		    size_in_filetype - (sum - flat_file->blocklens[i]);
> +		break;
> +	    }
> +	}
> +
> +	/* abs. offset in bytes in the file */
> +	offset = disp + (ADIO_Offset) n_filetypes*filetype_extent + abs_off_in_filetype;
> +    }
> +
> +    *start_index   = st_index;
> +    *start_io_size = st_io_size;
> +    *start_offset  = offset;
> +    *start_ftype   = n_filetypes;
> +}
> +
> +void ADIOI_Filetype_range_iscontig(ADIO_File fd, ADIO_Offset offset, 
> +	int file_ptr_type, MPI_Datatype datatype, int count, int *flag)
> +{
> +    int srclen, datatype_size;
> +    int st_index, st_ftype, st_offset, st_io_size;
> +
> +    MPI_Type_size(datatype, &datatype_size);
> +    srclen = datatype_size * count;
> +
> +    ADIOI_Filetype_range_start(fd, offset, file_ptr_type,
> +	    &st_index, &st_ftype, &st_offset, &st_io_size);
> +    *flag = st_io_size > srclen ? 1 : 0;
> +}
> +
> diff -ruNp mpich2-1.0.5p4/src/mpi/romio/adio/include/adioi.h mpich2-1.0.5p4-new/src/mpi/romio/adio/include/adioi.h
> --- mpich2-1.0.5p4/src/mpi/romio/adio/include/adioi.h	2005-08-12 14:56:56.000000000 -0400
> +++ mpich2-1.0.5p4-new/src/mpi/romio/adio/include/adioi.h	2007-07-11 12:46:32.000000000 -0400
> @@ -304,6 +304,10 @@ void *ADIOI_Calloc_fn(size_t nelem, size
>  void *ADIOI_Realloc_fn(void *ptr, size_t size, int lineno, char *fname);
>  void ADIOI_Free_fn(void *ptr, int lineno, char *fname);
>  void ADIOI_Datatype_iscontig(MPI_Datatype datatype, int *flag);
> +void ADIOI_Filetype_range_iscontig(ADIO_File fd, ADIO_Offset offset, 
> +	int file_ptr_type, MPI_Datatype datatype, int count, int *flag);
> +void ADIOI_Filetype_range_start(ADIO_File fd, ADIO_Offset offset, int file_ptr_type,
> +	int *start_index, int *start_ftype, int *start_offset, int *start_io_size);
>  void ADIOI_Get_position(ADIO_File fd, ADIO_Offset *offset);
>  void ADIOI_Get_eof_offset(ADIO_File fd, ADIO_Offset *eof_offset);
>  void ADIOI_Get_byte_offset(ADIO_File fd, ADIO_Offset offset,
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: data-sieving-jaguar.patch
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070713/8f0482be/attachment.diff>


More information about the mpich-discuss mailing list