[mpich2-dev] Hvector with Zero Blocks Asserts

Jeff Parker jjparker at us.ibm.com
Thu Mar 5 09:08:23 CST 2009


Rob,

Thanks for the second patch.  This patch, in addition to the first, fixes
everything we are seeing.  I patched our Blue Gene MPICH2, ran the
recreators, and ran the MPICH2 tests as a regression.  They all pass.

Jeff Parker
Blue Gene Messaging
61L/030-2 A407    507-253-4208    TieLine: 553-4208
Notes email: Jeff Parker/Rochester/IBM
INTERNET: jjparker at us.ibm.com     AFS: jeff at rchland


                                                                                                                              
  From:       Rob Ross <rross at mcs.anl.gov>                                                                                    
                                                                                                                              
  To:         Jeff Parker/Rochester/IBM at IBMUS                                                                                 
                                                                                                                              
  Cc:         Dave Goodell <goodell at mcs.anl.gov>, mpich2-dev at mcs.anl.gov                                                      
                                                                                                                              
  Date:       03/04/2009 12:29 AM                                                                                             
                                                                                                                              
  Subject:    Re: [mpich2-dev] Hvector with Zero Blocks Asserts                                                               
                                                                                                                              





Hi,

We should have known it wouldn't be so simple :).

First, my patch earlier was incomplete. The flatten call makes two
passes through the types, and I didn't address the second pass. That's
the first order problem that you're seeing.

Second, I really should just detect the case where there's no data in
the struct and just create a trivial zero-element type instead.

I've attached a patch that (in this order):
- removes a redundant check that I suggested earlier
- checks for the "no data" case after the first loop through the
struct elements and generates a trivial null type instead
- fixes some potential memory leaks in error cases
- adds logic to skip the no data regions during the actual flattening
process

I'd like to say that I've tested this, but I have not (long story, off
topic).

Additionally, around line 258 of dataloop_create_struct.c there is one
remaining bug going down the heterogeneous struct path that we're
unlikely to hit, which is that we don't remove zero size elements in
that case. That should be handled separately, but is low priority due
to the fact that heterogeneous situations are rare (NickK will want to
know).

Rob

---

[attachment "hvector-pt2.diff" deleted by Jeff Parker/Rochester/IBM]


On Mar 3, 2009, at 8:05 PM, Jeff Parker wrote:

> Hi Dave & Ross,
>
> While testing the fix for Hvector Zero Blocks some more, I tried one
> variation where all of the MPI_Type_hvector() calls specified zero
> blocks.
> Previously, only one of the calls specified zero blocks.  Even with
> the
> fix, a new assertion occurred:
>
> ---
> Before commiting structure
> Assertion failed in
> file /bgusr/jeff/Mar2.efix/bgp/comm/lib/dev/mpich2/src/mpid/common/
> datatype/dataloop/segment_ops.c
> at line 960: *lengthp > 0
> Abort(1) on node 0: Internal error
> ---
>
> Here's the call stack:
>
> src/mpid/common/datatype/dataloop/segment_ops.c:960
>
>    951 void PREPEND_PREFIX(Segment_mpi_flatten)(DLOOP_Segment *segp,
>    952                                          DLOOP_Offset first,
>    953                                          DLOOP_Offset *lastp,
>    954                                          int *blklens,
>    955                                          MPI_Aint *disps,
>    956                                          int *lengthp)
>    957 {
>    958     struct PREPEND_PREFIX(mpi_flatten_params) params;
>    959
>    960     DLOOP_Assert(*lengthp > 0);
>
> src/mpid/common/datatype/dataloop/dataloop_create_struct.c:640
>
>    630         if (oldtypes[i] != MPI_UB && oldtypes[i] != MPI_LB &&
> blklens[i] != 0)
>    631         {
>    632             PREPEND_PREFIX(Segment_init)((char *)
> MPIR_MPI_AINT_CAST_TO_VOID_PTR disps[i],
>    633                                          (DLOOP_Count)
> blklens[i],
>    634                                          oldtypes[i],
>    635                                          segp,
>    636                                          0 /* homogeneous */);
>    637
>    638             last_ind = nr_blks - first_ind;
>    639             bytes = SEGMENT_IGNORE_LAST;
>    640             PREPEND_PREFIX(Segment_mpi_flatten)(segp,
>    641                                                 0,
>    642                                                 &bytes,
>    643                                                 &tmp_blklens
> [first_ind],
>    644                                                 &tmp_disps
> [first_ind],
>    645                                                 &last_ind);
>    646             first_ind += last_ind;
>    647         }
>
> src/mpid/common/datatype/dataloop/dataloop_create.c:268
>
>    268             PREPEND_PREFIX(Dataloop_create_struct)(ints[0] /*
> count
> */,
>    269                                                    &ints[1] /*
> blklens */,
>    270                                                    disps,
>    271                                                    types /*
> oldtype
> array */,
>    272                                                    dlp_p,
> dlsz_p,
> dldepth_p,
>    273                                                    flag);
>
> src/mpid/common/datatype/mpid_type_commit.c:38
> src/mpi/datatype/type_commit.c:97
> ---
>
> Here's the reproducer.  Same program as before, only the first
> parameter to
> MPI_Type_hvector() is always zero.
>
> #include <stdio.h>
> #include <mpi.h>
>
> int main(int argc, char *argv[])
> {
>   MPI_Datatype mystruct, vecs[3];
>   MPI_Aint stride = 5, displs[3];
>   int i=0, blockcount[3];
>
>   MPI_Init(&argc, &argv);
>
>   for(i=0;i<3;i++)
>   {
>      /* important point appears to be the i==0 vectors here */
>      MPI_Type_hvector(0, 1, stride, MPI_INT, &vecs[i]);
>      MPI_Type_commit(&vecs[i]);
>      blockcount[i]=1;
>   }
>   displs[0]=0; displs[1]=-100; displs[2]=-200; /* irrelevant */
>
>   MPI_Type_struct(3, blockcount, displs, vecs, &mystruct);
>   fprintf(stderr,"Before commiting structure\n");
>   MPI_Type_commit(&mystruct);
>   fprintf(stderr,"After commiting structure\n");
>
>   MPI_Finalize();
>
>
>   return 0;
> }
>
> Jeff Parker
> Blue Gene Messaging
> 61L/030-2 A407    507-253-4208    TieLine: 553-4208
> Notes email: Jeff Parker/Rochester/IBM
> INTERNET: jjparker at us.ibm.com     AFS: jeff at rchland
>
>
>
>  From:       Dave Goodell <goodell at mcs.anl.gov>
>
>  To:         mpich2-dev at mcs.anl.gov
>
>  Cc:         Jeff Parker/Rochester/IBM at IBMUS
>
>  Date:       03/03/2009 03:38 PM
>
>  Subject:    Re: [mpich2-dev] Hvector with Zero Blocks Asserts
>
>
>
>
>
>
> Hi Jeff,
>
> On our side we tracked this in ticket #430 [1].  This is now fixed in
> the trunk as of r3927.  Thanks again for the bug report.
>
> -Dave
>
> [1] https://trac.mcs.anl.gov/projects/mpich2/ticket/430
>
> On Mar 3, 2009, at 2:11 PM, Rob Ross wrote:
>
>> yeah that was a typo. thanks jeff; glad that worked. i imagine that
>> dave will integrate on our end so he can close out the ticket. -- rob
>>
>> On Mar 3, 2009, at 2:05 PM, Jeff Parker wrote:
>>
>>> Rob,
>>>
>>> Thanks for the quick reply.  I applied your fix to
>>> dataloop_create_struct.c
>>> (I believe you had a typo when you said dataloop_create_segment.c)
>>> and it
>>> worked.  I assume this will be incorporated into a future MPICH2
>>> release?
>>>
>>> Jeff Parker
>>> Blue Gene Messaging
>>> 61L/030-2 A407    507-253-4208    TieLine: 553-4208
>>> Notes email: Jeff Parker/Rochester/IBM
>>> INTERNET: jjparker at us.ibm.com     AFS: jeff at rchland
>>>
>>>
>>>
>>> From:       Rob Ross <rross at mcs.anl.gov>
>>>
>>> To:         Jeff Parker/Rochester/IBM at IBMUS
>>>
>>> Cc:         mpich2-dev at mcs.anl.gov
>>>
>>> Date:       03/03/2009 11:03 AM
>>>
>>> Subject:    Re: Hvector with Zero Blocks Asserts
>>>
>>>
>>>
>>>
>>>
>>>
>>> Hi Jeff,
>>>
>>> Interesting. If we were simply asserting on the count, that would
>>> have
>>> happened in MPI_Type_hvector(). The problem isn't really that we're
>>> not handling the parameters of the zero-count hvector correctly;
>>> that
>>> is handled by converting the type into a contiguous of zero integers
>>> inside dataloop_create_vector.c.
>>>
>>> Instead there is something funny going on with how we build the
>>> struct. This type should go down the
>>> DLOOP_Dataloop_create_flattened_struct() path, because only one of
>>> the
>>> three hvector types has no data. I believe that it is the call at
>>> line
>>> 584 that is leading to the assert.
>>>
>>> In dataloop_create_segment.c, around line 573, the code block should
>>> be modified to something like this:
>>>
>>> ---
>>>       else /* derived type; get a count of contig blocks */
>>>       {
>>>           DLOOP_Count tmp_nr_blks, sz; /**/
>>>
>>>           DLOOP_Handle_get_size_macro(oldtypes[i], sz); /**/
>>>
>>>           /* if the derived type has some data to contribute, add
>>> to flattened representation */
>>>           if ((blklens[i] > 0) && (sz > 0)) { /**/
>>>               PREPEND_PREFIX(Segment_init)(NULL,
>>>                                            (DLOOP_Count) blklens[i],
>>>                                            oldtypes[i],
>>>                                            segp,
>>>                                            flag);
>>>               bytes = SEGMENT_IGNORE_LAST;
>>>
>>>               PREPEND_PREFIX(Segment_count_contig_blocks)(segp,
>>>                                                           0,
>>>                                                           &bytes,
>>>
>>> &tmp_nr_blks);
>>>
>>>               nr_blks += tmp_nr_blks;
>>>           } /**/
>>>       }
>>> ---
>>>
>>> Can you try this out?
>>>
>>> Those asserts in the segment code are there specifically to catch
>>> problems like this, and should not be removed without much careful
>>> thought. That code should never see counts of zero; we should remove
>>> that "cruft" during the dataloop creation process; we just missed a
>>> case.
>>>
>>> Thanks for the report and the basis for a new datatype test!
>>>
>>> Rob
>>>
>>> On Mar 3, 2009, at 10:18 AM, Jeff Parker wrote:
>>>
>>>>
>>>> IBM Blue Gene/P has received a customer-reported problem that
>>>> appears to be
>>>> in the stock MPICH2 code.  The application is committing a datatype
>>>> consisting of an hvector having 0 blocks, which results in an
>>>> assertion
>>>> that is wanting this value to be positive.  The spec says the
>>>> following,
>>>> specifically that count is a non-negative integer, so a value of
>>>> zero
>>>> should be allowed:
>>>>
>>>> Synopsis
>>>> #include "mpi.h"
>>>> int MPI_Type_hvector(
>>>>     int count,
>>>>     int blocklen,
>>>>     MPI_Aint stride,
>>>>     MPI_Datatype old_type,
>>>>     MPI_Datatype *newtype )
>>>>
>>>> Input Parameters
>>>>
>>>> count       number of blocks (nonnegative integer)
>>>>
>>>> blocklength number of elements in each block
>>>>             (nonnegative integer)
>>>>
>>>> stride      number of bytes between start of each
>>>>             block (integer)
>>>>
>>>> old_type    old datatype (handle)
>>>>
>>>>
>>>>
>>>> A reproducer is included below.  It fails on Blue Gene/P (MPICH2
>>>> 1.0.7) and
>>>> on Linux (MPICH2 1.0.7rc1), but works on Blue Gene/L (MPICH2
>>>> 1.0.4p1).
>>>> This assertion did not exist in MPICH2 1.0.5p4, but appears in
>>>> MPICH2 1.0.6
>>>> and later versions.
>>>>
>>>> The assertion is in src/mpid/common/datatype/dataloop/
>>>> segment_ops.c in
>>>> function DLOOP_Segment_contig_count_block.  If the assertion is
>>>> changed
>>>> from
>>>> DLOOP_Assert(*blocks_p > 0);
>>>> to
>>>> DLOOP_Assert(*blocks_p >= 0);
>>>> it works.
>>>>
>>>> There are other places with this assertion, and other similar
>>>> assertions
>>>> that may need fixing too:
>>>>
>>>> grep -r "*blocks_p >" *
>>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>>> DLOOP_Assert(*blocks_p >
>>>> 0);
>>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>>> DLOOP_Assert(count >
>>>> 0 &&
>>>> blksz > 0 && *blocks_p > 0);
>>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>>> DLOOP_Assert(count >
>>>> 0 &&
>>>> blksz > 0 && *blocks_p > 0);
>>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>>> DLOOP_Assert(count >
>>>> 0 &&
>>>> *blocks_p > 0);
>>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>>> DLOOP_Assert(*blocks_p
>>>>> = 0);
>>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>>> DLOOP_Assert(count > 0
>>>> && blksz > 0 && *blocks_p > 0);
>>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>>> DLOOP_Assert(count > 0
>>>> && blksz > 0 && *blocks_p > 0);
>>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>>> DLOOP_Assert(count > 0
>>>> && *blocks_p > 0);
>>>>
>>>> Reproducer:
>>>>
>>>> #include <stdio.h>
>>>>
>>>> #include <mpi.h>
>>>>
>>>> int main(int argc, char *argv[])
>>>> {
>>>> MPI_Datatype mystruct, vecs[3];
>>>> MPI_Aint stride = 5, displs[3];
>>>> int i=0, blockcount[3];
>>>>
>>>> MPI_Init(&argc, &argv);
>>>>
>>>> for(i=0;i<3;i++)
>>>> {
>>>>   /* important point appears to be the i==0 vectors here */
>>>>   MPI_Type_hvector(i, 1, stride, MPI_INT, &vecs[i]);
>>>>   MPI_Type_commit(&vecs[i]);
>>>>   blockcount[i]=1;
>>>> }
>>>> displs[0]=0; displs[1]=-100; displs[2]=-200; /* irrelevant */
>>>>
>>>> MPI_Type_struct(3, blockcount, displs, vecs, &mystruct);
>>>> fprintf(stderr,"Before commiting structure\n");
>>>> MPI_Type_commit(&mystruct);
>>>> fprintf(stderr,"After commiting structure\n");
>>>>
>>>> MPI_Finalize();
>>>>
>>>>
>>>> return 0;
>>>> }
>>>>
>>>> Output (in and after MPICH2 1.0.6):
>>>> Before commiting structure
>>>> Before commiting structure
>>>> Assertion failed in
>>>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/
>>>> lib/
>>>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>>>> at line 375: *blocks_p > 0
>>>> Assertion failed in
>>>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/
>>>> lib/
>>>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>>>> at line 375: *blocks_p > 0
>>>> Abort(1) on node 1: Internal error
>>>> Abort(1) on node 0: Internal error
>>>>
>>>> Jeff Parker
>>>> Blue Gene Messaging
>>>> 61L/030-2 A407    507-253-4208    TieLine: 553-4208
>>>> Notes email: Jeff Parker/Rochester/IBM
>>>> INTERNET: jjparker at us.ibm.com     AFS: jeff at rchland
>>>>
>>>
>>>
>>>
>>>
>>
>
>
>
>






More information about the mpich2-dev mailing list