[mpich2-dev] Hvector with Zero Blocks Asserts

Rob Ross rross at mcs.anl.gov
Tue Mar 3 14:11:14 CST 2009


yeah that was a typo. thanks jeff; glad that worked. i imagine that  
dave will integrate on our end so he can close out the ticket. -- rob

On Mar 3, 2009, at 2:05 PM, Jeff Parker wrote:

> Rob,
>
> Thanks for the quick reply.  I applied your fix to  
> dataloop_create_struct.c
> (I believe you had a typo when you said dataloop_create_segment.c)  
> and it
> worked.  I assume this will be incorporated into a future MPICH2  
> release?
>
> Jeff Parker
> Blue Gene Messaging
> 61L/030-2 A407    507-253-4208    TieLine: 553-4208
> Notes email: Jeff Parker/Rochester/IBM
> INTERNET: jjparker at us.ibm.com     AFS: jeff at rchland
>
>
>
>  From:       Rob Ross <rross at mcs.anl.gov>
>
>  To:         Jeff Parker/Rochester/IBM at IBMUS
>
>  Cc:         mpich2-dev at mcs.anl.gov
>
>  Date:       03/03/2009 11:03 AM
>
>  Subject:    Re: Hvector with Zero Blocks Asserts
>
>
>
>
>
>
> Hi Jeff,
>
> Interesting. If we were simply asserting on the count, that would have
> happened in MPI_Type_hvector(). The problem isn't really that we're
> not handling the parameters of the zero-count hvector correctly; that
> is handled by converting the type into a contiguous of zero integers
> inside dataloop_create_vector.c.
>
> Instead there is something funny going on with how we build the
> struct. This type should go down the
> DLOOP_Dataloop_create_flattened_struct() path, because only one of the
> three hvector types has no data. I believe that it is the call at line
> 584 that is leading to the assert.
>
> In dataloop_create_segment.c, around line 573, the code block should
> be modified to something like this:
>
> ---
>         else /* derived type; get a count of contig blocks */
>         {
>             DLOOP_Count tmp_nr_blks, sz; /**/
>
>             DLOOP_Handle_get_size_macro(oldtypes[i], sz); /**/
>
>             /* if the derived type has some data to contribute, add
> to flattened representation */
>             if ((blklens[i] > 0) && (sz > 0)) { /**/
>                 PREPEND_PREFIX(Segment_init)(NULL,
>                                              (DLOOP_Count) blklens[i],
>                                              oldtypes[i],
>                                              segp,
>                                              flag);
>                 bytes = SEGMENT_IGNORE_LAST;
>
>                 PREPEND_PREFIX(Segment_count_contig_blocks)(segp,
>                                                             0,
>                                                             &bytes,
>
> &tmp_nr_blks);
>
>                 nr_blks += tmp_nr_blks;
>             } /**/
>         }
> ---
>
> Can you try this out?
>
> Those asserts in the segment code are there specifically to catch
> problems like this, and should not be removed without much careful
> thought. That code should never see counts of zero; we should remove
> that "cruft" during the dataloop creation process; we just missed a
> case.
>
> Thanks for the report and the basis for a new datatype test!
>
> Rob
>
> On Mar 3, 2009, at 10:18 AM, Jeff Parker wrote:
>
>>
>> IBM Blue Gene/P has received a customer-reported problem that
>> appears to be
>> in the stock MPICH2 code.  The application is committing a datatype
>> consisting of an hvector having 0 blocks, which results in an
>> assertion
>> that is wanting this value to be positive.  The spec says the
>> following,
>> specifically that count is a non-negative integer, so a value of zero
>> should be allowed:
>>
>> Synopsis
>> #include "mpi.h"
>> int MPI_Type_hvector(
>>       int count,
>>       int blocklen,
>>       MPI_Aint stride,
>>       MPI_Datatype old_type,
>>       MPI_Datatype *newtype )
>>
>> Input Parameters
>>
>>   count       number of blocks (nonnegative integer)
>>
>>   blocklength number of elements in each block
>>               (nonnegative integer)
>>
>>   stride      number of bytes between start of each
>>               block (integer)
>>
>>   old_type    old datatype (handle)
>>
>>
>>
>> A reproducer is included below.  It fails on Blue Gene/P (MPICH2
>> 1.0.7) and
>> on Linux (MPICH2 1.0.7rc1), but works on Blue Gene/L (MPICH2  
>> 1.0.4p1).
>> This assertion did not exist in MPICH2 1.0.5p4, but appears in
>> MPICH2 1.0.6
>> and later versions.
>>
>> The assertion is in src/mpid/common/datatype/dataloop/segment_ops.c  
>> in
>> function DLOOP_Segment_contig_count_block.  If the assertion is
>> changed
>> from
>> DLOOP_Assert(*blocks_p > 0);
>> to
>> DLOOP_Assert(*blocks_p >= 0);
>> it works.
>>
>> There are other places with this assertion, and other similar
>> assertions
>> that may need fixing too:
>>
>> grep -r "*blocks_p >" *
>> src/mpi/romio/common/dataloop/segment_ops.c:
>> DLOOP_Assert(*blocks_p >
>> 0);
>> src/mpi/romio/common/dataloop/segment_ops.c:    DLOOP_Assert(count >
>> 0 &&
>> blksz > 0 && *blocks_p > 0);
>> src/mpi/romio/common/dataloop/segment_ops.c:    DLOOP_Assert(count >
>> 0 &&
>> blksz > 0 && *blocks_p > 0);
>> src/mpi/romio/common/dataloop/segment_ops.c:    DLOOP_Assert(count >
>> 0 &&
>> *blocks_p > 0);
>> src/mpid/common/datatype/dataloop/segment_ops.c:
>> DLOOP_Assert(*blocks_p
>>> = 0);
>> src/mpid/common/datatype/dataloop/segment_ops.c:
>> DLOOP_Assert(count > 0
>> && blksz > 0 && *blocks_p > 0);
>> src/mpid/common/datatype/dataloop/segment_ops.c:
>> DLOOP_Assert(count > 0
>> && blksz > 0 && *blocks_p > 0);
>> src/mpid/common/datatype/dataloop/segment_ops.c:
>> DLOOP_Assert(count > 0
>> && *blocks_p > 0);
>>
>> Reproducer:
>>
>> #include <stdio.h>
>>
>> #include <mpi.h>
>>
>> int main(int argc, char *argv[])
>> {
>>  MPI_Datatype mystruct, vecs[3];
>>  MPI_Aint stride = 5, displs[3];
>>  int i=0, blockcount[3];
>>
>>  MPI_Init(&argc, &argv);
>>
>>  for(i=0;i<3;i++)
>>  {
>>     /* important point appears to be the i==0 vectors here */
>>     MPI_Type_hvector(i, 1, stride, MPI_INT, &vecs[i]);
>>     MPI_Type_commit(&vecs[i]);
>>     blockcount[i]=1;
>>  }
>>  displs[0]=0; displs[1]=-100; displs[2]=-200; /* irrelevant */
>>
>>  MPI_Type_struct(3, blockcount, displs, vecs, &mystruct);
>>  fprintf(stderr,"Before commiting structure\n");
>>  MPI_Type_commit(&mystruct);
>>  fprintf(stderr,"After commiting structure\n");
>>
>>  MPI_Finalize();
>>
>>
>>  return 0;
>> }
>>
>> Output (in and after MPICH2 1.0.6):
>> Before commiting structure
>> Before commiting structure
>> Assertion failed in
>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/lib/
>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>> at line 375: *blocks_p > 0
>> Assertion failed in
>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/lib/
>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>> at line 375: *blocks_p > 0
>> Abort(1) on node 1: Internal error
>> Abort(1) on node 0: Internal error
>>
>> Jeff Parker
>> Blue Gene Messaging
>> 61L/030-2 A407    507-253-4208    TieLine: 553-4208
>> Notes email: Jeff Parker/Rochester/IBM
>> INTERNET: jjparker at us.ibm.com     AFS: jeff at rchland
>>
>
>
>
>



More information about the mpich2-dev mailing list