[mpich2-dev] Problem with MPI_Type_commit() and assert in segment_ops.c

Rob Ross rross at mcs.anl.gov
Wed Jun 17 10:35:37 CDT 2009


No progress so far, but I haven't had a good day to concentrate on it  
since you first identified the problem. I'm at a conference all week,  
but I should be able to focus next week...

Rob

On Jun 17, 2009, at 10:27 AM, Joe Ratterman wrote:

> Rob,
>
> Have you (or your team) had any luck tracking this one down?  We  
> haven't been able to trace the cause ourselves.
>
> Thanks,
> Joe Ratterman
> jratt at us.ibm.com
>
>
> On Tue, Jun 9, 2009 at 3:50 PM, Rob Ross <rross at mcs.anl.gov> wrote:
> Hi,
>
> Those type casts to (size_t) should be to (MPI_Aint).
>
> That assertion is checking that a parameter being passed to  
> Segment_mpi_flatten is > 0. The parameter is the length of the list  
> of regions being passed in by reference to be filled in (the  
> destination of the list of regions). So for some reason we're  
> getting a zero (or possibly negative) value passed in as the length  
> of the arrays.
>
> There's only one place in the struct creation where  
> Segment_mpi_flatten() is called; it's line 666 (evil!) of  
> dataloop_create_struct.c. This is in  
> DLOOP_Dataloop_create_flattened_struct(), which is a function used  
> to make a struct into an indexed type.
>
> The "pairtypes", such as MPI_SHORT_INT, are special cases in MPI in  
> that some of them have more than one "element type" (e.g. MPI_INT,  
> MPI_SHORT_INT) in them. My guess is that there's an assumption in  
> the DLOOP_Dataloop_create_flattened_struct() code path that is  
> having trouble with the pairtype.
>
> I'm surprised that we might have introduced something between 1.0.7  
> and 1.1; I can't recall anything in particular that has changed in  
> this code path. Someone should check the repo logs and see if  
> something snuck in?
>
> Rob
>
>
> On Jun 9, 2009, at 3:13 PM, Joe Ratterman wrote:
>
> The specifics of this test come from an MPI excerciser that gathered  
> (using MPIR_Gather) a variety of types, including MPI_SHORT_INT.   
> The way that gather is implemented, it created and then sent a  
> struct datatype of the tmp-data from the software tree and the local- 
> data.  I pulled out the important bits, and got this test-case.  It  
> asserts on PPC32 Linux 1.1 and BGP 1.1rc0, but runs fine on 1.0.7.   
> The addresses/displacements are fake, but were originally based on  
> the actual values used inside MPIR_Gather.  It does the type-create  
> on the first two types just to show that it doesn't always fail.
>
>
> Error message:
>
> Creating  addr=[0x1,0x2]  types=[8c000003,4c00010d]   
> struct_displs=[1,2]  blocks=[256,256]  MPI_BOTTOM=(nil)
> foo:25
> Assertion failed in file segment_ops.c at line 994: *lengthp > 0
> internal ABORT - process 0
>
>
> Code
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <mpi.h>
>
> void foo(void *sendbuf,
>         MPI_Datatype sendtype,
>         void *recvbuf,
>         MPI_Datatype recvtype)
> {
>  int blocks[2];
>  MPI_Aint struct_displs[2];
>  MPI_Datatype types[2], tmp_type;
>
>  blocks[0] = 256;
>  struct_displs[0] = (size_t)sendbuf;
>  types[0] = sendtype;
>  blocks[1] = 256;
>  struct_displs[1] = (size_t)recvbuf;
>  types[1] = MPI_BYTE;
>
>  printf("Creating  addr=[%p,%p]  types=[%x,%x]  struct_displs=[%x, 
> %x]  blocks=[%d,%d]  MPI_BOTTOM=%p\n",
>         sendbuf, recvbuf, types[0], types[1], struct_displs[0],  
> struct_displs[1], blocks[0], blocks[1], MPI_BOTTOM);
>  MPI_Type_create_struct(2, blocks, struct_displs, types, &tmp_type);
>  printf("%s:%d\n", __func__, __LINE__);
>  MPI_Type_commit(&tmp_type);
>  printf("%s:%d\n", __func__, __LINE__);
>  MPI_Type_free  (&tmp_type);
>  puts("Done");
> }
>
>
> int main()
> {
>  MPI_Init(NULL, NULL);
>
>  foo((void*)0x1,
>      MPI_FLOAT_INT,
>      (void*)0x2,
>      MPI_BYTE);
>  sleep(1);
>  foo((void*)0x1,
>      MPI_DOUBLE_INT,
>      (void*)0x2,
>      MPI_BYTE);
>  sleep(1);
>  foo((void*)0x1,
>      MPI_SHORT_INT,
>      (void*)0x2,
>      MPI_BYTE);
>
>  MPI_Finalize();
>  return 0;
> }
>
>
>
> I don't know anything about how this might be fixed, but we are  
> looking into it as well.
>
> Thanks,
> Joe Ratterman
> jratt at us.ibm.com
>
>



More information about the mpich2-dev mailing list