[mpich2-dev] Problem with MPI_Type_commit() and assert in segment_ops.c

Tue Jun 30 16:15:31 CDT 2009

Hi all,

This should be fixed in SVN now.

Rob

On Jun 30, 2009, at 3:38 PM, Rob Ross wrote:

> Hi all,
>
> The bug is somewhere in DLOOP_Segment_index_mpi_flatten(). A quick  
> fix is to change the reference to DLOOP_Segment_index_mpi_flatten  
> around line 1007 of segment_ops.c to NULL. This makes the test pass;  
> however, certain types will be considerably more slowly flattened  
> than before.
>
> I will look into a more permanent fix.
>
> Rob
>
> On Jun 30, 2009, at 2:47 PM, Rob Ross wrote:
>
>> Hi all,
>>
>> Ok, so interestingly it appears that MPID_Segment_mpi_flatten() is  
>> incorrectly flattening the 256 instances of the MPI_SHORT_INT type.  
>> As a reminder, the MPI_SHORT_INT (at least on my Mac) is a 2-byte  
>> integer followed by a skip of two bytes and then a 4-byte integer  
>> (size = 6, extent = 8). A contiguous region of these, flattened,  
>> should have block lengths (in bytes) of 2, 6, 6, ..., 6, 4. In  
>> other words the 4-byte and 2-byte regions in the middle should all  
>> be combined into 6-byte regions.
>>
>> Looking at the resulting block lengths, I see something like this:
>>
>> (gdb) print *params.blklens at 256
>> $25 = {2, 6, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,  
>> 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 6, 4, 2, 4,  
>> 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 6, 4, 2, 4, 2, 4, 2,  
>> 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 6, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4,  
>> 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, ...
>>
>> Weird! Anyway, because the flattening code is unable to correctly  
>> combine those regions, it uses up all the indices that we  
>> previously calculated that we would need when processing the first  
>> type (the MPI_SHORT_INT), leaving nothing to hold the one last  
>> region that the 256 MPI_BYTE types would consume. So when we make  
>> the second call to MPID_Segment_mpi_flatten() to process the  
>> MPI_BYTE type, we hit that assert.
>>
>> I'm looking into the problem with the flatten code now.
>>
>> Rob
>>
>> On Jun 30, 2009, at 1:40 PM, Rob Ross wrote:
>>
>>> Just a note that I've gotten myself up to speed, have replicated  
>>> the bug on my laptop, and will try to figure out what is going on  
>>> today.
>>>
>>> Rob
>>>
>>> On Jun 9, 2009, at 4:11 PM, Joe Ratterman wrote:
>>>
>>>> I know it's not that important--and clearly not relevant--but BG/ 
>>>> P will generate a compiler warning if I use an MPI_Aint cast  
>>>> there.  We want to avoid any ambiguity that such a cast would  
>>>> involve (i.e. is it sign extended?), so I use a cast that works  
>>>> correctly for this sort of micro-tests.  It is also correct on  
>>>> PPC32.  This small example shows the warnings:
>>>>
>>>> $ cat -n size.c
>>>>   1  #include <mpi.h>
>>>>   2
>>>>   3  extern void bar(MPI_Aint a, MPI_Aint b, MPI_Aint c);
>>>>   4
>>>>   5  void foo(void* p)
>>>>   6  {
>>>>   7    MPI_Aint aint = (MPI_Aint)p;
>>>>   8
>>>>   9    MPI_Aint one = (long long int)p;
>>>>  10    MPI_Aint two = (int)p;
>>>>  11
>>>>  12    bar(aint, one, two);
>>>>  13  }
>>>>
>>>> $ /bgsys/drivers/ppcfloor/comm/bin/mpicc -Wall -g -c  
>>>> size.csize.c: In function 'foo':
>>>> size.c:7: warning: cast from pointer to integer of different size
>>>> size.c:9: warning: cast from pointer to integer of different size
>>>>
>>>>
>>>> Thanks,
>>>> Joe Ratterman
>>>> jratt at us.ibm.com
>>>>
>>>> On Tue, Jun 9, 2009 at 3:50 PM, Rob Ross <rross at mcs.anl.gov> wrote:
>>>> Hi,
>>>>
>>>> Those type casts to (size_t) should be to (MPI_Aint).
>>>>
>>>> That assertion is checking that a parameter being passed to  
>>>> Segment_mpi_flatten is > 0. The parameter is the length of the  
>>>> list of regions being passed in by reference to be filled in (the  
>>>> destination of the list of regions). So for some reason we're  
>>>> getting a zero (or possibly negative) value passed in as the  
>>>> length of the arrays.
>>>>
>>>> There's only one place in the struct creation where  
>>>> Segment_mpi_flatten() is called; it's line 666 (evil!) of  
>>>> dataloop_create_struct.c. This is in  
>>>> DLOOP_Dataloop_create_flattened_struct(), which is a function  
>>>> used to make a struct into an indexed type.
>>>>
>>>> The "pairtypes", such as MPI_SHORT_INT, are special cases in MPI  
>>>> in that some of them have more than one "element type" (e.g.  
>>>> MPI_INT, MPI_SHORT_INT) in them. My guess is that there's an  
>>>> assumption in the DLOOP_Dataloop_create_flattened_struct() code  
>>>> path that is having trouble with the pairtype.
>>>>
>>>> I'm surprised that we might have introduced something between  
>>>> 1.0.7 and 1.1; I can't recall anything in particular that has  
>>>> changed in this code path. Someone should check the repo logs and  
>>>> see if something snuck in?
>>>>
>>>> Rob
>>>>
>>>>
>>>> On Jun 9, 2009, at 3:13 PM, Joe Ratterman wrote:
>>>>
>>>> The specifics of this test come from an MPI excerciser that  
>>>> gathered (using MPIR_Gather) a variety of types, including  
>>>> MPI_SHORT_INT.  The way that gather is implemented, it created  
>>>> and then sent a struct datatype of the tmp-data from the software  
>>>> tree and the local-data.  I pulled out the important bits, and  
>>>> got this test-case.  It asserts on PPC32 Linux 1.1 and BGP  
>>>> 1.1rc0, but runs fine on 1.0.7.  The addresses/displacements are  
>>>> fake, but were originally based on the actual values used inside  
>>>> MPIR_Gather.  It does the type-create on the first two types just  
>>>> to show that it doesn't always fail.
>>>>
>>>>
>>>> Error message:
>>>>
>>>> Creating  addr=[0x1,0x2]  types=[8c000003,4c00010d]   
>>>> struct_displs=[1,2]  blocks=[256,256]  MPI_BOTTOM=(nil)
>>>> foo:25
>>>> Assertion failed in file segment_ops.c at line 994: *lengthp > 0
>>>> internal ABORT - process 0
>>>>
>>>>
>>>> Code
>>>>
>>>> #include <stdio.h>
>>>> #include <stdlib.h>
>>>> #include <unistd.h>
>>>> #include <mpi.h>
>>>>
>>>> void foo(void *sendbuf,
>>>>      MPI_Datatype sendtype,
>>>>      void *recvbuf,
>>>>      MPI_Datatype recvtype)
>>>> {
>>>> int blocks[2];
>>>> MPI_Aint struct_displs[2];
>>>> MPI_Datatype types[2], tmp_type;
>>>>
>>>> blocks[0] = 256;
>>>> struct_displs[0] = (size_t)sendbuf;
>>>> types[0] = sendtype;
>>>> blocks[1] = 256;
>>>> struct_displs[1] = (size_t)recvbuf;
>>>> types[1] = MPI_BYTE;
>>>>
>>>> printf("Creating  addr=[%p,%p]  types=[%x,%x]  struct_displs=[%x, 
>>>> %x]  blocks=[%d,%d]  MPI_BOTTOM=%p\n",
>>>>      sendbuf, recvbuf, types[0], types[1], struct_displs[0],  
>>>> struct_displs[1], blocks[0], blocks[1], MPI_BOTTOM);
>>>> MPI_Type_create_struct(2, blocks, struct_displs, types, &tmp_type);
>>>> printf("%s:%d\n", __func__, __LINE__);
>>>> MPI_Type_commit(&tmp_type);
>>>> printf("%s:%d\n", __func__, __LINE__);
>>>> MPI_Type_free  (&tmp_type);
>>>> puts("Done");
>>>> }
>>>>
>>>>
>>>> int main()
>>>> {
>>>> MPI_Init(NULL, NULL);
>>>>
>>>> foo((void*)0x1,
>>>>   MPI_FLOAT_INT,
>>>>   (void*)0x2,
>>>>   MPI_BYTE);
>>>> sleep(1);
>>>> foo((void*)0x1,
>>>>   MPI_DOUBLE_INT,
>>>>   (void*)0x2,
>>>>   MPI_BYTE);
>>>> sleep(1);
>>>> foo((void*)0x1,
>>>>   MPI_SHORT_INT,
>>>>   (void*)0x2,
>>>>   MPI_BYTE);
>>>>
>>>> MPI_Finalize();
>>>> return 0;
>>>> }
>>>>
>>>>
>>>>
>>>> I don't know anything about how this might be fixed, but we are  
>>>> looking into it as well.
>>>>
>>>> Thanks,
>>>> Joe Ratterman
>>>> jratt at us.ibm.com
>>>>
>>>>
>>>
>>
>