[mpich2-dev] ROMIO: Interleaving test
Pascal Deveze
Pascal.Deveze at bull.net
Thu Sep 2 07:16:21 CDT 2010
Hi Rob,
Thanks for your clear explanation. I undestand now the second part of
the if statement
(st_offsets[i] <= end_offsets[i]) that just tests if the length is not null.
The test is not strictly "<", but "<=" because equality means length=1.
I still have problems with following cases:
1) After an element with len=0, the interleave is not detected
2) With element of len=1, interleave is not detected.
I do not know if this could really be a problem. Here are just some
explanations
on how to see the problem. I also give a possible solution.
I modify your program to test these 2 cases and add a trace in
common/ad_write_coll.c:if (!myrank)
for (i=0; i<nprocs; i++) {
if (!myrank) printf("st_offsets[%d]=%d
end_offsets[%d]=%d\n", i, st_offsets[i], i, end_offsets[i]);
}
for (i=1; i<nprocs; i++)
if ((st_offsets[i] < end_offsets[i-1]) &&
(st_offsets[i] <= end_offsets[i]))
interleave_count++;
/* This is a rudimentary check for interleaving, but should suffice
for the moment. */
if (!myrank) printf("%d: interleave_count=%d\n", interleave_count);
For case 1:
st_offsets[] = {0, 1,2,3}
end_offsets[] = {1023, 0, 5, 2}
For case 2:
st_offsets[] = {1, 1,1,1}
end_offsets[] = {1, 1, 1, 1}
================ The test to see the problem ==========
#include "mpi.h"
#include <stdio.h>
#define LEN 1024
int main(int argc, char **argv)
{
MPI_File fh;
MPI_Status status;
MPI_Offset offset;
int length, nprocs, rank, i;
char buffer[LEN];
for(i=0; i<LEN; i++) buffer[i] = i;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_File_open(MPI_COMM_WORLD, argv[1],
MPI_MODE_CREATE|MPI_MODE_RDWR,
MPI_INFO_NULL, &fh);
// Interleaved data is not detected after an element with null length
/*
* e.g. P0 ( off_0 = 0, len_0 = LEN )
* P1 ( off_1 = 1, len_1 = 0 ) ===> null length
* P2 ( off_2 = 2, len_2 = 4 ) ===> interleaved not
detected
* P3 ( off_3 = 3, len_3 = 0 )
* ....... ........
* */
if ((rank % 2) ==0) length=4;
else length=0;
if (rank == 0) length=LEN;
offset = rank;
MPI_File_write_at_all(fh, offset, buffer, length, MPI_BYTE,
&status);
// Interleaved data is not detected if only one byte is common (len = 1)
/*
* e.g. P0 ( off_0 = 1, len_0 = 1 )
* P1 ( off_1 = 1, len_1 = 1 ) ===> interleave not
detected
* P2 ( off_2 = 1, len_2 = 1 ) ===> interleaved not
detected
* P3 ( off_3 = 1, len_3 = 1 )
* ....... ........
* */
length=1;
offset=1;
MPI_File_seek(fh, 0, MPI_SEEK_SET);
MPI_File_write_at_all(fh, offset, buffer, length, MPI_BYTE,
&status);
MPI_File_close(&fh);
MPI_Finalize();
return 0;
}
================ The results =====================
st_offsets[0]=0 end_offsets[0]=1023
st_offsets[1]=1 end_offsets[1]=0
st_offsets[2]=2 end_offsets[2]=5
st_offsets[3]=3 end_offsets[3]=2
0: interleave_count=0
st_offsets[0]=1 end_offsets[0]=1
st_offsets[1]=1 end_offsets[1]=1
st_offsets[2]=1 end_offsets[2]=1
st_offsets[3]=1 end_offsets[3]=1
0: interleave_count=0
The interleaves are not detected.
To detect case 1, the end offset of the last element with non null
length should be
stored and tested with the start_offset of element i.
To detect case 2, the first comparison of the if statement should be
"<=", not "<".
This could be done in the following loop:
ADIO_Offset last_end=end_offsets[0];
for (i=1; i<nprocs; i++) {
if (st_offsets[i] <= last_end) { // Possible interleave for
offset i
if (st_offsets[i] <= end_offsets[i]) // length is not
null, so there is an interleave
interleave_count++;
}
if (st_offsets[i] <= end_offsets[i]) // length is not null,
so change last_end
last_end=end_offsets[i];
/* This is a rudimentary check for interleaving, but should suffice
for the moment. */
}
Pascal
Rob Latham a écrit :
> On Wed, Sep 01, 2010 at 03:37:30PM +0200, Pascal Deveze wrote:
>
>> There is one test that I do not understand. This test is used
>> in the collective read/write to detect if the data are interleaved:
>>
>> /* are the accesses of different processes interleaved? */
>> for (i=1; i<nprocs; i++)
>> if ((st_offsets[i] < end_offsets[i-1]) &&
>> (st_offsets[i] <= end_offsets[i]))
>> interleave_count++;
>> /* This is a rudimentary check for interleaving, but should suffice
>> for the moment. */
>> }
>>
>> The second member of the if statement (st_offsets[i] <=
>> end_offsets[i]) is always verified.
>> I think this should be (st_offsets[i-1] <= end_offsets[i]).
>>
>
> That addition happened 6 years ago, but I can't find the original bug
> report (it's in the old req system, if someone can find "MPICH2 req
> #1174" that might tell us more).
>
> for (i=1; i<nprocs; i++)
> - if (st_offsets[i] < end_offsets[i-1]) interleave_count++;
> + if ((st_offsets[i] < end_offsets[i-1]) &&
> + (st_offsets[i] <= end_offsets[i]))
> + interleave_count++;
> /* This is a rudimentary check for interleaving, but should suffice
> for the moment. */
>
>
> ah, here we go. Back in 2004 Jianwei Li found a bug when some
> processes had zero elements.
>
> "When counting the "interleave_count", segments with length == 0
> should not be counted in even if their starting offsets fall
> within previous segment range."
>
> I'm not sure why the check is for "<=" instead of strictly "<",
> though. Wish I had a test case attached to this old bug report.
>
> Ok, now I do. Attached, and I'll add this to the repository.
>
>
>
>> Do I miss something ?
>>
>
> Yes, but it's not hard to miss this subtle thing: the comment a few
> lines earlier sheds some light on this matter:
>
> /* Note: end_offset points to the last byte-offset that will be accessed.
> e.g., if start_offset=0 and 100 bytes to be read, end_offset=99*/
>
> So, in the test case I attached, if you run it with four procs your st_offsets array and end_offsets array look like this:
>
> st_offsets[] = {0, 1,2,3}
> end_offsets[] = {3, 0, 1, 2}
>
> See, if i do a zero-byte write at offset 3, my start is 3 and my end
> is actually 2. So, st_offsets[i] is not always less than or equal to
> end_offsets[i]. specifically, it won't be if the region was a request
> for zero bytes.
>
>
>> And as the interleave_count is always tested with 0, it should be
>> possible to break the loop
>> after the incrementation of interleave_count.
>>
>
> I suppose we could do something clever like "optimize harder" if the
> interleave count is higher... well, we don't do that :>
>
>
>> In my point of view, the test could be something like:
>> /* are the accesses of different processes interleaved? */
>> for (i=1; i<nprocs; i++)
>> if ((st_offsets[i] < end_offsets[i-1]) &&
>> (st_offsets[i-1] <= end_offsets[i])) {
>> interleave_count=1;
>> break;
>> }
>> /* This is a rudimentary check for interleaving, but should suffice
>> for the moment. */
>>
>
> If I could justify burning a million cpu hours it would be great to
> profile ROMIO on a full rack of Intrepid. I'm sure breaking early
> from loops like this helps scalability a little bit when these arrays
> are 160k elements long.
>
> I think I will leave the st_offsets[i] <= end_offsets[i] as is, but
> put in a better comment. I will, though, break as soon as we find
> something interleaved.
>
> Thanks for the report, though. I am extremely happy you are taking
> such a close look at ROMIO.
>
> ==rob
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich2-dev/attachments/20100902/b85dc726/attachment.htm>
More information about the mpich2-dev
mailing list