[mpich2-dev] Hvector with Zero Blocks Asserts

Rob Ross rross at mcs.anl.gov
Thu Mar 5 09:15:13 CST 2009


Awesome. Thanks Jeff. -- Rob

On Mar 5, 2009, at 9:08 AM, Jeff Parker wrote:

> Rob,
>
> Thanks for the second patch.  This patch, in addition to the first,  
> fixes
> everything we are seeing.  I patched our Blue Gene MPICH2, ran the
> recreators, and ran the MPICH2 tests as a regression.  They all pass.
>
> Jeff Parker
> Blue Gene Messaging
> 61L/030-2 A407    507-253-4208    TieLine: 553-4208
> Notes email: Jeff Parker/Rochester/IBM
> INTERNET: jjparker at us.ibm.com     AFS: jeff at rchland
>
>
>
>  From:       Rob Ross <rross at mcs.anl.gov>
>
>  To:         Jeff Parker/Rochester/IBM at IBMUS
>
>  Cc:         Dave Goodell <goodell at mcs.anl.gov>, mpich2- 
> dev at mcs.anl.gov
>
>  Date:       03/04/2009 12:29 AM
>
>  Subject:    Re: [mpich2-dev] Hvector with Zero Blocks Asserts
>
>
>
>
>
>
> Hi,
>
> We should have known it wouldn't be so simple :).
>
> First, my patch earlier was incomplete. The flatten call makes two
> passes through the types, and I didn't address the second pass. That's
> the first order problem that you're seeing.
>
> Second, I really should just detect the case where there's no data in
> the struct and just create a trivial zero-element type instead.
>
> I've attached a patch that (in this order):
> - removes a redundant check that I suggested earlier
> - checks for the "no data" case after the first loop through the
> struct elements and generates a trivial null type instead
> - fixes some potential memory leaks in error cases
> - adds logic to skip the no data regions during the actual flattening
> process
>
> I'd like to say that I've tested this, but I have not (long story, off
> topic).
>
> Additionally, around line 258 of dataloop_create_struct.c there is one
> remaining bug going down the heterogeneous struct path that we're
> unlikely to hit, which is that we don't remove zero size elements in
> that case. That should be handled separately, but is low priority due
> to the fact that heterogeneous situations are rare (NickK will want to
> know).
>
> Rob
>
> ---
>
> [attachment "hvector-pt2.diff" deleted by Jeff Parker/Rochester/IBM]
>
>
> On Mar 3, 2009, at 8:05 PM, Jeff Parker wrote:
>
>> Hi Dave & Ross,
>>
>> While testing the fix for Hvector Zero Blocks some more, I tried one
>> variation where all of the MPI_Type_hvector() calls specified zero
>> blocks.
>> Previously, only one of the calls specified zero blocks.  Even with
>> the
>> fix, a new assertion occurred:
>>
>> ---
>> Before commiting structure
>> Assertion failed in
>> file /bgusr/jeff/Mar2.efix/bgp/comm/lib/dev/mpich2/src/mpid/common/
>> datatype/dataloop/segment_ops.c
>> at line 960: *lengthp > 0
>> Abort(1) on node 0: Internal error
>> ---
>>
>> Here's the call stack:
>>
>> src/mpid/common/datatype/dataloop/segment_ops.c:960
>>
>>   951 void PREPEND_PREFIX(Segment_mpi_flatten)(DLOOP_Segment *segp,
>>   952                                          DLOOP_Offset first,
>>   953                                          DLOOP_Offset *lastp,
>>   954                                          int *blklens,
>>   955                                          MPI_Aint *disps,
>>   956                                          int *lengthp)
>>   957 {
>>   958     struct PREPEND_PREFIX(mpi_flatten_params) params;
>>   959
>>   960     DLOOP_Assert(*lengthp > 0);
>>
>> src/mpid/common/datatype/dataloop/dataloop_create_struct.c:640
>>
>>   630         if (oldtypes[i] != MPI_UB && oldtypes[i] != MPI_LB &&
>> blklens[i] != 0)
>>   631         {
>>   632             PREPEND_PREFIX(Segment_init)((char *)
>> MPIR_MPI_AINT_CAST_TO_VOID_PTR disps[i],
>>   633                                          (DLOOP_Count)
>> blklens[i],
>>   634                                          oldtypes[i],
>>   635                                          segp,
>>   636                                          0 /* homogeneous */);
>>   637
>>   638             last_ind = nr_blks - first_ind;
>>   639             bytes = SEGMENT_IGNORE_LAST;
>>   640             PREPEND_PREFIX(Segment_mpi_flatten)(segp,
>>   641                                                 0,
>>   642                                                 &bytes,
>>   643                                                 &tmp_blklens
>> [first_ind],
>>   644                                                 &tmp_disps
>> [first_ind],
>>   645                                                 &last_ind);
>>   646             first_ind += last_ind;
>>   647         }
>>
>> src/mpid/common/datatype/dataloop/dataloop_create.c:268
>>
>>   268             PREPEND_PREFIX(Dataloop_create_struct)(ints[0] /*
>> count
>> */,
>>   269                                                    &ints[1] /*
>> blklens */,
>>   270                                                    disps,
>>   271                                                    types /*
>> oldtype
>> array */,
>>   272                                                    dlp_p,
>> dlsz_p,
>> dldepth_p,
>>   273                                                    flag);
>>
>> src/mpid/common/datatype/mpid_type_commit.c:38
>> src/mpi/datatype/type_commit.c:97
>> ---
>>
>> Here's the reproducer.  Same program as before, only the first
>> parameter to
>> MPI_Type_hvector() is always zero.
>>
>> #include <stdio.h>
>> #include <mpi.h>
>>
>> int main(int argc, char *argv[])
>> {
>>  MPI_Datatype mystruct, vecs[3];
>>  MPI_Aint stride = 5, displs[3];
>>  int i=0, blockcount[3];
>>
>>  MPI_Init(&argc, &argv);
>>
>>  for(i=0;i<3;i++)
>>  {
>>     /* important point appears to be the i==0 vectors here */
>>     MPI_Type_hvector(0, 1, stride, MPI_INT, &vecs[i]);
>>     MPI_Type_commit(&vecs[i]);
>>     blockcount[i]=1;
>>  }
>>  displs[0]=0; displs[1]=-100; displs[2]=-200; /* irrelevant */
>>
>>  MPI_Type_struct(3, blockcount, displs, vecs, &mystruct);
>>  fprintf(stderr,"Before commiting structure\n");
>>  MPI_Type_commit(&mystruct);
>>  fprintf(stderr,"After commiting structure\n");
>>
>>  MPI_Finalize();
>>
>>
>>  return 0;
>> }
>>
>> Jeff Parker
>> Blue Gene Messaging
>> 61L/030-2 A407    507-253-4208    TieLine: 553-4208
>> Notes email: Jeff Parker/Rochester/IBM
>> INTERNET: jjparker at us.ibm.com     AFS: jeff at rchland
>>
>>
>>
>> From:       Dave Goodell <goodell at mcs.anl.gov>
>>
>> To:         mpich2-dev at mcs.anl.gov
>>
>> Cc:         Jeff Parker/Rochester/IBM at IBMUS
>>
>> Date:       03/03/2009 03:38 PM
>>
>> Subject:    Re: [mpich2-dev] Hvector with Zero Blocks Asserts
>>
>>
>>
>>
>>
>>
>> Hi Jeff,
>>
>> On our side we tracked this in ticket #430 [1].  This is now fixed in
>> the trunk as of r3927.  Thanks again for the bug report.
>>
>> -Dave
>>
>> [1] https://trac.mcs.anl.gov/projects/mpich2/ticket/430
>>
>> On Mar 3, 2009, at 2:11 PM, Rob Ross wrote:
>>
>>> yeah that was a typo. thanks jeff; glad that worked. i imagine that
>>> dave will integrate on our end so he can close out the ticket. --  
>>> rob
>>>
>>> On Mar 3, 2009, at 2:05 PM, Jeff Parker wrote:
>>>
>>>> Rob,
>>>>
>>>> Thanks for the quick reply.  I applied your fix to
>>>> dataloop_create_struct.c
>>>> (I believe you had a typo when you said dataloop_create_segment.c)
>>>> and it
>>>> worked.  I assume this will be incorporated into a future MPICH2
>>>> release?
>>>>
>>>> Jeff Parker
>>>> Blue Gene Messaging
>>>> 61L/030-2 A407    507-253-4208    TieLine: 553-4208
>>>> Notes email: Jeff Parker/Rochester/IBM
>>>> INTERNET: jjparker at us.ibm.com     AFS: jeff at rchland
>>>>
>>>>
>>>>
>>>> From:       Rob Ross <rross at mcs.anl.gov>
>>>>
>>>> To:         Jeff Parker/Rochester/IBM at IBMUS
>>>>
>>>> Cc:         mpich2-dev at mcs.anl.gov
>>>>
>>>> Date:       03/03/2009 11:03 AM
>>>>
>>>> Subject:    Re: Hvector with Zero Blocks Asserts
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Hi Jeff,
>>>>
>>>> Interesting. If we were simply asserting on the count, that would
>>>> have
>>>> happened in MPI_Type_hvector(). The problem isn't really that we're
>>>> not handling the parameters of the zero-count hvector correctly;
>>>> that
>>>> is handled by converting the type into a contiguous of zero  
>>>> integers
>>>> inside dataloop_create_vector.c.
>>>>
>>>> Instead there is something funny going on with how we build the
>>>> struct. This type should go down the
>>>> DLOOP_Dataloop_create_flattened_struct() path, because only one of
>>>> the
>>>> three hvector types has no data. I believe that it is the call at
>>>> line
>>>> 584 that is leading to the assert.
>>>>
>>>> In dataloop_create_segment.c, around line 573, the code block  
>>>> should
>>>> be modified to something like this:
>>>>
>>>> ---
>>>>      else /* derived type; get a count of contig blocks */
>>>>      {
>>>>          DLOOP_Count tmp_nr_blks, sz; /**/
>>>>
>>>>          DLOOP_Handle_get_size_macro(oldtypes[i], sz); /**/
>>>>
>>>>          /* if the derived type has some data to contribute, add
>>>> to flattened representation */
>>>>          if ((blklens[i] > 0) && (sz > 0)) { /**/
>>>>              PREPEND_PREFIX(Segment_init)(NULL,
>>>>                                           (DLOOP_Count) blklens[i],
>>>>                                           oldtypes[i],
>>>>                                           segp,
>>>>                                           flag);
>>>>              bytes = SEGMENT_IGNORE_LAST;
>>>>
>>>>              PREPEND_PREFIX(Segment_count_contig_blocks)(segp,
>>>>                                                          0,
>>>>                                                          &bytes,
>>>>
>>>> &tmp_nr_blks);
>>>>
>>>>              nr_blks += tmp_nr_blks;
>>>>          } /**/
>>>>      }
>>>> ---
>>>>
>>>> Can you try this out?
>>>>
>>>> Those asserts in the segment code are there specifically to catch
>>>> problems like this, and should not be removed without much careful
>>>> thought. That code should never see counts of zero; we should  
>>>> remove
>>>> that "cruft" during the dataloop creation process; we just missed a
>>>> case.
>>>>
>>>> Thanks for the report and the basis for a new datatype test!
>>>>
>>>> Rob
>>>>
>>>> On Mar 3, 2009, at 10:18 AM, Jeff Parker wrote:
>>>>
>>>>>
>>>>> IBM Blue Gene/P has received a customer-reported problem that
>>>>> appears to be
>>>>> in the stock MPICH2 code.  The application is committing a  
>>>>> datatype
>>>>> consisting of an hvector having 0 blocks, which results in an
>>>>> assertion
>>>>> that is wanting this value to be positive.  The spec says the
>>>>> following,
>>>>> specifically that count is a non-negative integer, so a value of
>>>>> zero
>>>>> should be allowed:
>>>>>
>>>>> Synopsis
>>>>> #include "mpi.h"
>>>>> int MPI_Type_hvector(
>>>>>    int count,
>>>>>    int blocklen,
>>>>>    MPI_Aint stride,
>>>>>    MPI_Datatype old_type,
>>>>>    MPI_Datatype *newtype )
>>>>>
>>>>> Input Parameters
>>>>>
>>>>> count       number of blocks (nonnegative integer)
>>>>>
>>>>> blocklength number of elements in each block
>>>>>            (nonnegative integer)
>>>>>
>>>>> stride      number of bytes between start of each
>>>>>            block (integer)
>>>>>
>>>>> old_type    old datatype (handle)
>>>>>
>>>>>
>>>>>
>>>>> A reproducer is included below.  It fails on Blue Gene/P (MPICH2
>>>>> 1.0.7) and
>>>>> on Linux (MPICH2 1.0.7rc1), but works on Blue Gene/L (MPICH2
>>>>> 1.0.4p1).
>>>>> This assertion did not exist in MPICH2 1.0.5p4, but appears in
>>>>> MPICH2 1.0.6
>>>>> and later versions.
>>>>>
>>>>> The assertion is in src/mpid/common/datatype/dataloop/
>>>>> segment_ops.c in
>>>>> function DLOOP_Segment_contig_count_block.  If the assertion is
>>>>> changed
>>>>> from
>>>>> DLOOP_Assert(*blocks_p > 0);
>>>>> to
>>>>> DLOOP_Assert(*blocks_p >= 0);
>>>>> it works.
>>>>>
>>>>> There are other places with this assertion, and other similar
>>>>> assertions
>>>>> that may need fixing too:
>>>>>
>>>>> grep -r "*blocks_p >" *
>>>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>>>> DLOOP_Assert(*blocks_p >
>>>>> 0);
>>>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>>>> DLOOP_Assert(count >
>>>>> 0 &&
>>>>> blksz > 0 && *blocks_p > 0);
>>>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>>>> DLOOP_Assert(count >
>>>>> 0 &&
>>>>> blksz > 0 && *blocks_p > 0);
>>>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>>>> DLOOP_Assert(count >
>>>>> 0 &&
>>>>> *blocks_p > 0);
>>>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>>>> DLOOP_Assert(*blocks_p
>>>>>> = 0);
>>>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>>>> DLOOP_Assert(count > 0
>>>>> && blksz > 0 && *blocks_p > 0);
>>>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>>>> DLOOP_Assert(count > 0
>>>>> && blksz > 0 && *blocks_p > 0);
>>>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>>>> DLOOP_Assert(count > 0
>>>>> && *blocks_p > 0);
>>>>>
>>>>> Reproducer:
>>>>>
>>>>> #include <stdio.h>
>>>>>
>>>>> #include <mpi.h>
>>>>>
>>>>> int main(int argc, char *argv[])
>>>>> {
>>>>> MPI_Datatype mystruct, vecs[3];
>>>>> MPI_Aint stride = 5, displs[3];
>>>>> int i=0, blockcount[3];
>>>>>
>>>>> MPI_Init(&argc, &argv);
>>>>>
>>>>> for(i=0;i<3;i++)
>>>>> {
>>>>>  /* important point appears to be the i==0 vectors here */
>>>>>  MPI_Type_hvector(i, 1, stride, MPI_INT, &vecs[i]);
>>>>>  MPI_Type_commit(&vecs[i]);
>>>>>  blockcount[i]=1;
>>>>> }
>>>>> displs[0]=0; displs[1]=-100; displs[2]=-200; /* irrelevant */
>>>>>
>>>>> MPI_Type_struct(3, blockcount, displs, vecs, &mystruct);
>>>>> fprintf(stderr,"Before commiting structure\n");
>>>>> MPI_Type_commit(&mystruct);
>>>>> fprintf(stderr,"After commiting structure\n");
>>>>>
>>>>> MPI_Finalize();
>>>>>
>>>>>
>>>>> return 0;
>>>>> }
>>>>>
>>>>> Output (in and after MPICH2 1.0.6):
>>>>> Before commiting structure
>>>>> Before commiting structure
>>>>> Assertion failed in
>>>>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/
>>>>> lib/
>>>>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>>>>> at line 375: *blocks_p > 0
>>>>> Assertion failed in
>>>>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/
>>>>> lib/
>>>>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>>>>> at line 375: *blocks_p > 0
>>>>> Abort(1) on node 1: Internal error
>>>>> Abort(1) on node 0: Internal error
>>>>>
>>>>> Jeff Parker
>>>>> Blue Gene Messaging
>>>>> 61L/030-2 A407    507-253-4208    TieLine: 553-4208
>>>>> Notes email: Jeff Parker/Rochester/IBM
>>>>> INTERNET: jjparker at us.ibm.com     AFS: jeff at rchland
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>>
>
>
>
>



More information about the mpich2-dev mailing list