[mpich2-dev] ROMIO: Interleaving test

Pascal Deveze Pascal.Deveze at bull.net
Thu Sep 2 07:16:21 CDT 2010


Hi Rob,

Thanks for your clear explanation. I undestand now the second part of 
the if statement
(st_offsets[i] <= end_offsets[i]) that just tests if the length is not null.

The test is not strictly "<", but "<=" because equality means length=1.

I still have problems with following cases:
1) After an element with len=0, the interleave is not detected
2) With element of len=1, interleave is not detected.

I do not know if this could really be a problem. Here are just some 
explanations
on how to see the problem. I also give a possible solution.

I modify your program to test these 2 cases and add a trace in 
common/ad_write_coll.c:if (!myrank)
        for (i=0; i<nprocs; i++) {
                if (!myrank) printf("st_offsets[%d]=%d 
end_offsets[%d]=%d\n", i, st_offsets[i], i, end_offsets[i]);
        }
        for (i=1; i<nprocs; i++)
            if ((st_offsets[i] < end_offsets[i-1]) &&
                (st_offsets[i] <= end_offsets[i]))
                interleave_count++;
        /* This is a rudimentary check for interleaving, but should suffice
           for the moment. */
if (!myrank) printf("%d: interleave_count=%d\n", interleave_count);

For case 1:
          st_offsets[]   = {0, 1,2,3}
         end_offsets[] = {1023, 0, 5, 2}

For case 2:
          st_offsets[]   = {1, 1,1,1}
         end_offsets[] = {1, 1, 1, 1}

================ The test to see the problem ==========
#include "mpi.h"
#include <stdio.h>

#define LEN 1024

int main(int argc, char **argv)
{

        MPI_File fh;
        MPI_Status status;
        MPI_Offset offset;
        int length, nprocs, rank, i;
        char buffer[LEN];

        for(i=0; i<LEN; i++) buffer[i] = i;

        MPI_Init(&argc, &argv);

        MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);

        MPI_File_open(MPI_COMM_WORLD, argv[1], 
MPI_MODE_CREATE|MPI_MODE_RDWR,
                        MPI_INFO_NULL, &fh);

// Interleaved data is not detected after an element with null length
        /*
 *           e.g. P0 ( off_0 = 0,    len_0 = LEN )
 *                P1 ( off_1 = 1,    len_1 = 0 ) ===> null length
 *                P2 ( off_2 = 2,    len_2 = 4 ) ===> interleaved not 
detected
 *                P3 ( off_3 = 3,    len_3 = 0 )
 *                    .......        ........
 *      */
        if ((rank % 2) ==0) length=4;
        else length=0;
        if (rank == 0) length=LEN;

        offset = rank;

        MPI_File_write_at_all(fh, offset, buffer, length, MPI_BYTE, 
&status);

// Interleaved data is not detected if only one byte is common (len = 1)
        /*
 *           e.g. P0 ( off_0 = 1,    len_0 = 1 )
 *                P1 ( off_1 = 1,    len_1 = 1 ) ===> interleave not 
detected
 *                P2 ( off_2 = 1,    len_2 = 1 ) ===> interleaved not 
detected
 *                P3 ( off_3 = 1,    len_3 = 1 )
 *                    .......        ........
 *      */
        length=1;
        offset=1;

        MPI_File_seek(fh, 0, MPI_SEEK_SET);
        MPI_File_write_at_all(fh, offset, buffer, length, MPI_BYTE, 
&status);
        MPI_File_close(&fh);
        MPI_Finalize();

        return 0;
}
================ The results =====================
st_offsets[0]=0 end_offsets[0]=1023
st_offsets[1]=1 end_offsets[1]=0
st_offsets[2]=2 end_offsets[2]=5
st_offsets[3]=3 end_offsets[3]=2
0: interleave_count=0
st_offsets[0]=1 end_offsets[0]=1
st_offsets[1]=1 end_offsets[1]=1
st_offsets[2]=1 end_offsets[2]=1
st_offsets[3]=1 end_offsets[3]=1
0: interleave_count=0

The interleaves are not detected.
To detect case 1, the end offset of the last element with non null 
length should  be
stored and tested with the start_offset of element i.
To detect case 2, the first comparison of the if statement should be 
"<=", not "<".

This could be done in the following loop:

        ADIO_Offset last_end=end_offsets[0];

        for (i=1; i<nprocs; i++) {
            if (st_offsets[i] <= last_end) { // Possible interleave for 
offset i
                if (st_offsets[i] <= end_offsets[i]) // length is not 
null, so there is an interleave
                        interleave_count++;
            }
            if (st_offsets[i] <= end_offsets[i]) // length is not null, 
so change last_end
                last_end=end_offsets[i];
        /* This is a rudimentary check for interleaving, but should suffice
           for the moment. */
        }

Pascal

Rob Latham a écrit :
> On Wed, Sep 01, 2010 at 03:37:30PM +0200, Pascal Deveze wrote:
>   
>> There is one test that I do not understand. This test is used
>> in the collective read/write to detect if the data are interleaved:
>>
>>       /* are the accesses of different processes interleaved? */
>>        for (i=1; i<nprocs; i++)
>>            if ((st_offsets[i] < end_offsets[i-1]) &&
>>                (st_offsets[i] <= end_offsets[i]))
>>                interleave_count++;
>>        /* This is a rudimentary check for interleaving, but should suffice
>>           for the moment. */
>>    }
>>
>> The second member of the if statement (st_offsets[i] <=
>> end_offsets[i]) is always verified.
>> I think this should be (st_offsets[i-1] <= end_offsets[i]).
>>     
>
> That addition happened 6 years ago, but I can't find the original bug
> report (it's in the old req system, if someone can find "MPICH2 req
> #1174" that might tell us more).
>
>         for (i=1; i<nprocs; i++)
> -           if (st_offsets[i] < end_offsets[i-1]) interleave_count++;
> +           if ((st_offsets[i] < end_offsets[i-1]) && 
> +                (st_offsets[i] <= end_offsets[i]))
> +                interleave_count++;
>         /* This is a rudimentary check for interleaving, but should suffice
>            for the moment. */
>
>
> ah, here we go. Back in 2004 Jianwei Li found a bug when some
> processes had zero elements.  
>
>     "When counting the "interleave_count", segments with length == 0
>     should not be counted in even if their starting offsets fall
>     within previous segment range."
>
> I'm not sure why the check is for "<=" instead of strictly "<",
> though.  Wish I had a test case attached to this old bug report.  
>
> Ok, now I do.  Attached, and I'll add this to the repository. 
>
>
>   
>> Do I miss something ?
>>     
>
> Yes, but it's not hard to miss this subtle thing: the comment a few
> lines earlier sheds some light on this matter:
>
>        /* Note: end_offset points to the last byte-offset that will be accessed.
>            e.g., if start_offset=0 and 100 bytes to be read, end_offset=99*/
>
> So, in the test case I attached, if you run it with four procs your st_offsets array and end_offsets array look like this:
>
> st_offsets[] = {0, 1,2,3}
> end_offsets[] = {3, 0, 1, 2}
>
> See, if i do a zero-byte write at offset 3, my start is 3 and my end
> is actually 2.  So, st_offsets[i] is not always less than or equal to
> end_offsets[i]. specifically, it won't be if the region was a request
> for zero bytes.
>
>   
>> And as the interleave_count is always tested with 0, it should be
>> possible to break the loop
>> after the incrementation of interleave_count.
>>     
>
> I suppose we could do something clever like "optimize harder" if the
> interleave count is higher... well, we don't do that :>
>
>   
>> In my point of view, the test could be something like:
>>       /* are the accesses of different processes interleaved? */
>>        for (i=1; i<nprocs; i++)
>>            if ((st_offsets[i] < end_offsets[i-1]) &&
>>                (st_offsets[i-1] <= end_offsets[i])) {
>>                          interleave_count=1;
>>                          break;
>>            }
>>        /* This is a rudimentary check for interleaving, but should suffice
>>           for the moment. */
>>     
>
> If I could justify burning a million cpu hours it would be great to
> profile ROMIO on a full rack of Intrepid.  I'm sure breaking early
> from loops like this helps scalability a little bit when these arrays
> are 160k elements long.
>
> I think I will leave the st_offsets[i] <= end_offsets[i] as is, but
> put in a better comment.  I will, though, break as soon as we find
> something interleaved.
>
> Thanks for the report, though.  I am extremely happy you are taking
> such a close look at ROMIO.  
>
> ==rob
>
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich2-dev/attachments/20100902/b85dc726/attachment.htm>


More information about the mpich2-dev mailing list