[mpich2-dev] Problem with MPI_Type_commit() and assert in segment_ops.c
Rob Ross
rross at mcs.anl.gov
Tue Jun 30 15:38:10 CDT 2009
Hi all,
The bug is somewhere in DLOOP_Segment_index_mpi_flatten(). A quick fix
is to change the reference to DLOOP_Segment_index_mpi_flatten around
line 1007 of segment_ops.c to NULL. This makes the test pass; however,
certain types will be considerably more slowly flattened than before.
I will look into a more permanent fix.
Rob
On Jun 30, 2009, at 2:47 PM, Rob Ross wrote:
> Hi all,
>
> Ok, so interestingly it appears that MPID_Segment_mpi_flatten() is
> incorrectly flattening the 256 instances of the MPI_SHORT_INT type.
> As a reminder, the MPI_SHORT_INT (at least on my Mac) is a 2-byte
> integer followed by a skip of two bytes and then a 4-byte integer
> (size = 6, extent = 8). A contiguous region of these, flattened,
> should have block lengths (in bytes) of 2, 6, 6, ..., 6, 4. In other
> words the 4-byte and 2-byte regions in the middle should all be
> combined into 6-byte regions.
>
> Looking at the resulting block lengths, I see something like this:
>
> (gdb) print *params.blklens at 256
> $25 = {2, 6, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
> 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 6, 4, 2, 4, 2,
> 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 6, 4, 2, 4, 2, 4, 2, 4, 2,
> 4, 2, 4, 2, 4, 2, 4, 2, 6, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
> 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, ...
>
> Weird! Anyway, because the flattening code is unable to correctly
> combine those regions, it uses up all the indices that we previously
> calculated that we would need when processing the first type (the
> MPI_SHORT_INT), leaving nothing to hold the one last region that the
> 256 MPI_BYTE types would consume. So when we make the second call to
> MPID_Segment_mpi_flatten() to process the MPI_BYTE type, we hit that
> assert.
>
> I'm looking into the problem with the flatten code now.
>
> Rob
>
> On Jun 30, 2009, at 1:40 PM, Rob Ross wrote:
>
>> Just a note that I've gotten myself up to speed, have replicated
>> the bug on my laptop, and will try to figure out what is going on
>> today.
>>
>> Rob
>>
>> On Jun 9, 2009, at 4:11 PM, Joe Ratterman wrote:
>>
>>> I know it's not that important--and clearly not relevant--but BG/P
>>> will generate a compiler warning if I use an MPI_Aint cast there.
>>> We want to avoid any ambiguity that such a cast would involve
>>> (i.e. is it sign extended?), so I use a cast that works correctly
>>> for this sort of micro-tests. It is also correct on PPC32. This
>>> small example shows the warnings:
>>>
>>> $ cat -n size.c
>>> 1 #include <mpi.h>
>>> 2
>>> 3 extern void bar(MPI_Aint a, MPI_Aint b, MPI_Aint c);
>>> 4
>>> 5 void foo(void* p)
>>> 6 {
>>> 7 MPI_Aint aint = (MPI_Aint)p;
>>> 8
>>> 9 MPI_Aint one = (long long int)p;
>>> 10 MPI_Aint two = (int)p;
>>> 11
>>> 12 bar(aint, one, two);
>>> 13 }
>>>
>>> $ /bgsys/drivers/ppcfloor/comm/bin/mpicc -Wall -g -c size.csize.c:
>>> In function 'foo':
>>> size.c:7: warning: cast from pointer to integer of different size
>>> size.c:9: warning: cast from pointer to integer of different size
>>>
>>>
>>> Thanks,
>>> Joe Ratterman
>>> jratt at us.ibm.com
>>>
>>> On Tue, Jun 9, 2009 at 3:50 PM, Rob Ross <rross at mcs.anl.gov> wrote:
>>> Hi,
>>>
>>> Those type casts to (size_t) should be to (MPI_Aint).
>>>
>>> That assertion is checking that a parameter being passed to
>>> Segment_mpi_flatten is > 0. The parameter is the length of the
>>> list of regions being passed in by reference to be filled in (the
>>> destination of the list of regions). So for some reason we're
>>> getting a zero (or possibly negative) value passed in as the
>>> length of the arrays.
>>>
>>> There's only one place in the struct creation where
>>> Segment_mpi_flatten() is called; it's line 666 (evil!) of
>>> dataloop_create_struct.c. This is in
>>> DLOOP_Dataloop_create_flattened_struct(), which is a function used
>>> to make a struct into an indexed type.
>>>
>>> The "pairtypes", such as MPI_SHORT_INT, are special cases in MPI
>>> in that some of them have more than one "element type" (e.g.
>>> MPI_INT, MPI_SHORT_INT) in them. My guess is that there's an
>>> assumption in the DLOOP_Dataloop_create_flattened_struct() code
>>> path that is having trouble with the pairtype.
>>>
>>> I'm surprised that we might have introduced something between
>>> 1.0.7 and 1.1; I can't recall anything in particular that has
>>> changed in this code path. Someone should check the repo logs and
>>> see if something snuck in?
>>>
>>> Rob
>>>
>>>
>>> On Jun 9, 2009, at 3:13 PM, Joe Ratterman wrote:
>>>
>>> The specifics of this test come from an MPI excerciser that
>>> gathered (using MPIR_Gather) a variety of types, including
>>> MPI_SHORT_INT. The way that gather is implemented, it created and
>>> then sent a struct datatype of the tmp-data from the software tree
>>> and the local-data. I pulled out the important bits, and got this
>>> test-case. It asserts on PPC32 Linux 1.1 and BGP 1.1rc0, but runs
>>> fine on 1.0.7. The addresses/displacements are fake, but were
>>> originally based on the actual values used inside MPIR_Gather. It
>>> does the type-create on the first two types just to show that it
>>> doesn't always fail.
>>>
>>>
>>> Error message:
>>>
>>> Creating addr=[0x1,0x2] types=[8c000003,4c00010d]
>>> struct_displs=[1,2] blocks=[256,256] MPI_BOTTOM=(nil)
>>> foo:25
>>> Assertion failed in file segment_ops.c at line 994: *lengthp > 0
>>> internal ABORT - process 0
>>>
>>>
>>> Code
>>>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> #include <unistd.h>
>>> #include <mpi.h>
>>>
>>> void foo(void *sendbuf,
>>> MPI_Datatype sendtype,
>>> void *recvbuf,
>>> MPI_Datatype recvtype)
>>> {
>>> int blocks[2];
>>> MPI_Aint struct_displs[2];
>>> MPI_Datatype types[2], tmp_type;
>>>
>>> blocks[0] = 256;
>>> struct_displs[0] = (size_t)sendbuf;
>>> types[0] = sendtype;
>>> blocks[1] = 256;
>>> struct_displs[1] = (size_t)recvbuf;
>>> types[1] = MPI_BYTE;
>>>
>>> printf("Creating addr=[%p,%p] types=[%x,%x] struct_displs=[%x,
>>> %x] blocks=[%d,%d] MPI_BOTTOM=%p\n",
>>> sendbuf, recvbuf, types[0], types[1], struct_displs[0],
>>> struct_displs[1], blocks[0], blocks[1], MPI_BOTTOM);
>>> MPI_Type_create_struct(2, blocks, struct_displs, types, &tmp_type);
>>> printf("%s:%d\n", __func__, __LINE__);
>>> MPI_Type_commit(&tmp_type);
>>> printf("%s:%d\n", __func__, __LINE__);
>>> MPI_Type_free (&tmp_type);
>>> puts("Done");
>>> }
>>>
>>>
>>> int main()
>>> {
>>> MPI_Init(NULL, NULL);
>>>
>>> foo((void*)0x1,
>>> MPI_FLOAT_INT,
>>> (void*)0x2,
>>> MPI_BYTE);
>>> sleep(1);
>>> foo((void*)0x1,
>>> MPI_DOUBLE_INT,
>>> (void*)0x2,
>>> MPI_BYTE);
>>> sleep(1);
>>> foo((void*)0x1,
>>> MPI_SHORT_INT,
>>> (void*)0x2,
>>> MPI_BYTE);
>>>
>>> MPI_Finalize();
>>> return 0;
>>> }
>>>
>>>
>>>
>>> I don't know anything about how this might be fixed, but we are
>>> looking into it as well.
>>>
>>> Thanks,
>>> Joe Ratterman
>>> jratt at us.ibm.com
>>>
>>>
>>
>
More information about the mpich2-dev
mailing list