[mpich2-dev] Problem with MPI_Type_commit() and assert in segment_ops.c
Rob Ross
rross at mcs.anl.gov
Tue Jun 30 16:15:31 CDT 2009
Hi all,
This should be fixed in SVN now.
Rob
On Jun 30, 2009, at 3:38 PM, Rob Ross wrote:
> Hi all,
>
> The bug is somewhere in DLOOP_Segment_index_mpi_flatten(). A quick
> fix is to change the reference to DLOOP_Segment_index_mpi_flatten
> around line 1007 of segment_ops.c to NULL. This makes the test pass;
> however, certain types will be considerably more slowly flattened
> than before.
>
> I will look into a more permanent fix.
>
> Rob
>
> On Jun 30, 2009, at 2:47 PM, Rob Ross wrote:
>
>> Hi all,
>>
>> Ok, so interestingly it appears that MPID_Segment_mpi_flatten() is
>> incorrectly flattening the 256 instances of the MPI_SHORT_INT type.
>> As a reminder, the MPI_SHORT_INT (at least on my Mac) is a 2-byte
>> integer followed by a skip of two bytes and then a 4-byte integer
>> (size = 6, extent = 8). A contiguous region of these, flattened,
>> should have block lengths (in bytes) of 2, 6, 6, ..., 6, 4. In
>> other words the 4-byte and 2-byte regions in the middle should all
>> be combined into 6-byte regions.
>>
>> Looking at the resulting block lengths, I see something like this:
>>
>> (gdb) print *params.blklens at 256
>> $25 = {2, 6, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
>> 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 6, 4, 2, 4,
>> 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 6, 4, 2, 4, 2, 4, 2,
>> 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 6, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4,
>> 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, ...
>>
>> Weird! Anyway, because the flattening code is unable to correctly
>> combine those regions, it uses up all the indices that we
>> previously calculated that we would need when processing the first
>> type (the MPI_SHORT_INT), leaving nothing to hold the one last
>> region that the 256 MPI_BYTE types would consume. So when we make
>> the second call to MPID_Segment_mpi_flatten() to process the
>> MPI_BYTE type, we hit that assert.
>>
>> I'm looking into the problem with the flatten code now.
>>
>> Rob
>>
>> On Jun 30, 2009, at 1:40 PM, Rob Ross wrote:
>>
>>> Just a note that I've gotten myself up to speed, have replicated
>>> the bug on my laptop, and will try to figure out what is going on
>>> today.
>>>
>>> Rob
>>>
>>> On Jun 9, 2009, at 4:11 PM, Joe Ratterman wrote:
>>>
>>>> I know it's not that important--and clearly not relevant--but BG/
>>>> P will generate a compiler warning if I use an MPI_Aint cast
>>>> there. We want to avoid any ambiguity that such a cast would
>>>> involve (i.e. is it sign extended?), so I use a cast that works
>>>> correctly for this sort of micro-tests. It is also correct on
>>>> PPC32. This small example shows the warnings:
>>>>
>>>> $ cat -n size.c
>>>> 1 #include <mpi.h>
>>>> 2
>>>> 3 extern void bar(MPI_Aint a, MPI_Aint b, MPI_Aint c);
>>>> 4
>>>> 5 void foo(void* p)
>>>> 6 {
>>>> 7 MPI_Aint aint = (MPI_Aint)p;
>>>> 8
>>>> 9 MPI_Aint one = (long long int)p;
>>>> 10 MPI_Aint two = (int)p;
>>>> 11
>>>> 12 bar(aint, one, two);
>>>> 13 }
>>>>
>>>> $ /bgsys/drivers/ppcfloor/comm/bin/mpicc -Wall -g -c
>>>> size.csize.c: In function 'foo':
>>>> size.c:7: warning: cast from pointer to integer of different size
>>>> size.c:9: warning: cast from pointer to integer of different size
>>>>
>>>>
>>>> Thanks,
>>>> Joe Ratterman
>>>> jratt at us.ibm.com
>>>>
>>>> On Tue, Jun 9, 2009 at 3:50 PM, Rob Ross <rross at mcs.anl.gov> wrote:
>>>> Hi,
>>>>
>>>> Those type casts to (size_t) should be to (MPI_Aint).
>>>>
>>>> That assertion is checking that a parameter being passed to
>>>> Segment_mpi_flatten is > 0. The parameter is the length of the
>>>> list of regions being passed in by reference to be filled in (the
>>>> destination of the list of regions). So for some reason we're
>>>> getting a zero (or possibly negative) value passed in as the
>>>> length of the arrays.
>>>>
>>>> There's only one place in the struct creation where
>>>> Segment_mpi_flatten() is called; it's line 666 (evil!) of
>>>> dataloop_create_struct.c. This is in
>>>> DLOOP_Dataloop_create_flattened_struct(), which is a function
>>>> used to make a struct into an indexed type.
>>>>
>>>> The "pairtypes", such as MPI_SHORT_INT, are special cases in MPI
>>>> in that some of them have more than one "element type" (e.g.
>>>> MPI_INT, MPI_SHORT_INT) in them. My guess is that there's an
>>>> assumption in the DLOOP_Dataloop_create_flattened_struct() code
>>>> path that is having trouble with the pairtype.
>>>>
>>>> I'm surprised that we might have introduced something between
>>>> 1.0.7 and 1.1; I can't recall anything in particular that has
>>>> changed in this code path. Someone should check the repo logs and
>>>> see if something snuck in?
>>>>
>>>> Rob
>>>>
>>>>
>>>> On Jun 9, 2009, at 3:13 PM, Joe Ratterman wrote:
>>>>
>>>> The specifics of this test come from an MPI excerciser that
>>>> gathered (using MPIR_Gather) a variety of types, including
>>>> MPI_SHORT_INT. The way that gather is implemented, it created
>>>> and then sent a struct datatype of the tmp-data from the software
>>>> tree and the local-data. I pulled out the important bits, and
>>>> got this test-case. It asserts on PPC32 Linux 1.1 and BGP
>>>> 1.1rc0, but runs fine on 1.0.7. The addresses/displacements are
>>>> fake, but were originally based on the actual values used inside
>>>> MPIR_Gather. It does the type-create on the first two types just
>>>> to show that it doesn't always fail.
>>>>
>>>>
>>>> Error message:
>>>>
>>>> Creating addr=[0x1,0x2] types=[8c000003,4c00010d]
>>>> struct_displs=[1,2] blocks=[256,256] MPI_BOTTOM=(nil)
>>>> foo:25
>>>> Assertion failed in file segment_ops.c at line 994: *lengthp > 0
>>>> internal ABORT - process 0
>>>>
>>>>
>>>> Code
>>>>
>>>> #include <stdio.h>
>>>> #include <stdlib.h>
>>>> #include <unistd.h>
>>>> #include <mpi.h>
>>>>
>>>> void foo(void *sendbuf,
>>>> MPI_Datatype sendtype,
>>>> void *recvbuf,
>>>> MPI_Datatype recvtype)
>>>> {
>>>> int blocks[2];
>>>> MPI_Aint struct_displs[2];
>>>> MPI_Datatype types[2], tmp_type;
>>>>
>>>> blocks[0] = 256;
>>>> struct_displs[0] = (size_t)sendbuf;
>>>> types[0] = sendtype;
>>>> blocks[1] = 256;
>>>> struct_displs[1] = (size_t)recvbuf;
>>>> types[1] = MPI_BYTE;
>>>>
>>>> printf("Creating addr=[%p,%p] types=[%x,%x] struct_displs=[%x,
>>>> %x] blocks=[%d,%d] MPI_BOTTOM=%p\n",
>>>> sendbuf, recvbuf, types[0], types[1], struct_displs[0],
>>>> struct_displs[1], blocks[0], blocks[1], MPI_BOTTOM);
>>>> MPI_Type_create_struct(2, blocks, struct_displs, types, &tmp_type);
>>>> printf("%s:%d\n", __func__, __LINE__);
>>>> MPI_Type_commit(&tmp_type);
>>>> printf("%s:%d\n", __func__, __LINE__);
>>>> MPI_Type_free (&tmp_type);
>>>> puts("Done");
>>>> }
>>>>
>>>>
>>>> int main()
>>>> {
>>>> MPI_Init(NULL, NULL);
>>>>
>>>> foo((void*)0x1,
>>>> MPI_FLOAT_INT,
>>>> (void*)0x2,
>>>> MPI_BYTE);
>>>> sleep(1);
>>>> foo((void*)0x1,
>>>> MPI_DOUBLE_INT,
>>>> (void*)0x2,
>>>> MPI_BYTE);
>>>> sleep(1);
>>>> foo((void*)0x1,
>>>> MPI_SHORT_INT,
>>>> (void*)0x2,
>>>> MPI_BYTE);
>>>>
>>>> MPI_Finalize();
>>>> return 0;
>>>> }
>>>>
>>>>
>>>>
>>>> I don't know anything about how this might be fixed, but we are
>>>> looking into it as well.
>>>>
>>>> Thanks,
>>>> Joe Ratterman
>>>> jratt at us.ibm.com
>>>>
>>>>
>>>
>>
>
More information about the mpich2-dev
mailing list