[mpich2-dev] Hvector with Zero Blocks Asserts

Rob Ross rross at mcs.anl.gov
Tue Mar 3 22:46:39 CST 2009


Hi Jeff,

Thanks; let me think about this a little tomorrow and come up with a  
patch. It'll be in the same code as today's bug.

Rob

On Mar 3, 2009, at 8:05 PM, Jeff Parker wrote:

> Hi Dave & Ross,
>
> While testing the fix for Hvector Zero Blocks some more, I tried one
> variation where all of the MPI_Type_hvector() calls specified zero  
> blocks.
> Previously, only one of the calls specified zero blocks.  Even with  
> the
> fix, a new assertion occurred:
>
> ---
> Before commiting structure
> Assertion failed in
> file /bgusr/jeff/Mar2.efix/bgp/comm/lib/dev/mpich2/src/mpid/common/ 
> datatype/dataloop/segment_ops.c
> at line 960: *lengthp > 0
> Abort(1) on node 0: Internal error
> ---
>
> Here's the call stack:
>
> src/mpid/common/datatype/dataloop/segment_ops.c:960
>
>    951 void PREPEND_PREFIX(Segment_mpi_flatten)(DLOOP_Segment *segp,
>    952                                          DLOOP_Offset first,
>    953                                          DLOOP_Offset *lastp,
>    954                                          int *blklens,
>    955                                          MPI_Aint *disps,
>    956                                          int *lengthp)
>    957 {
>    958     struct PREPEND_PREFIX(mpi_flatten_params) params;
>    959
>    960     DLOOP_Assert(*lengthp > 0);
>
> src/mpid/common/datatype/dataloop/dataloop_create_struct.c:640
>
>    630         if (oldtypes[i] != MPI_UB && oldtypes[i] != MPI_LB &&
> blklens[i] != 0)
>    631         {
>    632             PREPEND_PREFIX(Segment_init)((char *)
> MPIR_MPI_AINT_CAST_TO_VOID_PTR disps[i],
>    633                                          (DLOOP_Count)  
> blklens[i],
>    634                                          oldtypes[i],
>    635                                          segp,
>    636                                          0 /* homogeneous */);
>    637
>    638             last_ind = nr_blks - first_ind;
>    639             bytes = SEGMENT_IGNORE_LAST;
>    640             PREPEND_PREFIX(Segment_mpi_flatten)(segp,
>    641                                                 0,
>    642                                                 &bytes,
>    643                                                 &tmp_blklens
> [first_ind],
>    644                                                 &tmp_disps
> [first_ind],
>    645                                                 &last_ind);
>    646             first_ind += last_ind;
>    647         }
>
> src/mpid/common/datatype/dataloop/dataloop_create.c:268
>
>    268             PREPEND_PREFIX(Dataloop_create_struct)(ints[0] /*  
> count
> */,
>    269                                                    &ints[1] /*
> blklens */,
>    270                                                    disps,
>    271                                                    types /*  
> oldtype
> array */,
>    272                                                    dlp_p,  
> dlsz_p,
> dldepth_p,
>    273                                                    flag);
>
> src/mpid/common/datatype/mpid_type_commit.c:38
> src/mpi/datatype/type_commit.c:97
> ---
>
> Here's the reproducer.  Same program as before, only the first  
> parameter to
> MPI_Type_hvector() is always zero.
>
> #include <stdio.h>
> #include <mpi.h>
>
> int main(int argc, char *argv[])
> {
>   MPI_Datatype mystruct, vecs[3];
>   MPI_Aint stride = 5, displs[3];
>   int i=0, blockcount[3];
>
>   MPI_Init(&argc, &argv);
>
>   for(i=0;i<3;i++)
>   {
>      /* important point appears to be the i==0 vectors here */
>      MPI_Type_hvector(0, 1, stride, MPI_INT, &vecs[i]);
>      MPI_Type_commit(&vecs[i]);
>      blockcount[i]=1;
>   }
>   displs[0]=0; displs[1]=-100; displs[2]=-200; /* irrelevant */
>
>   MPI_Type_struct(3, blockcount, displs, vecs, &mystruct);
>   fprintf(stderr,"Before commiting structure\n");
>   MPI_Type_commit(&mystruct);
>   fprintf(stderr,"After commiting structure\n");
>
>   MPI_Finalize();
>
>
>   return 0;
> }
>
> Jeff Parker
> Blue Gene Messaging
> 61L/030-2 A407    507-253-4208    TieLine: 553-4208
> Notes email: Jeff Parker/Rochester/IBM
> INTERNET: jjparker at us.ibm.com     AFS: jeff at rchland
>
>
>
>  From:       Dave Goodell <goodell at mcs.anl.gov>
>
>  To:         mpich2-dev at mcs.anl.gov
>
>  Cc:         Jeff Parker/Rochester/IBM at IBMUS
>
>  Date:       03/03/2009 03:38 PM
>
>  Subject:    Re: [mpich2-dev] Hvector with Zero Blocks Asserts
>
>
>
>
>
>
> Hi Jeff,
>
> On our side we tracked this in ticket #430 [1].  This is now fixed in
> the trunk as of r3927.  Thanks again for the bug report.
>
> -Dave
>
> [1] https://trac.mcs.anl.gov/projects/mpich2/ticket/430
>
> On Mar 3, 2009, at 2:11 PM, Rob Ross wrote:
>
>> yeah that was a typo. thanks jeff; glad that worked. i imagine that
>> dave will integrate on our end so he can close out the ticket. -- rob
>>
>> On Mar 3, 2009, at 2:05 PM, Jeff Parker wrote:
>>
>>> Rob,
>>>
>>> Thanks for the quick reply.  I applied your fix to
>>> dataloop_create_struct.c
>>> (I believe you had a typo when you said dataloop_create_segment.c)
>>> and it
>>> worked.  I assume this will be incorporated into a future MPICH2
>>> release?
>>>
>>> Jeff Parker
>>> Blue Gene Messaging
>>> 61L/030-2 A407    507-253-4208    TieLine: 553-4208
>>> Notes email: Jeff Parker/Rochester/IBM
>>> INTERNET: jjparker at us.ibm.com     AFS: jeff at rchland
>>>
>>>
>>>
>>> From:       Rob Ross <rross at mcs.anl.gov>
>>>
>>> To:         Jeff Parker/Rochester/IBM at IBMUS
>>>
>>> Cc:         mpich2-dev at mcs.anl.gov
>>>
>>> Date:       03/03/2009 11:03 AM
>>>
>>> Subject:    Re: Hvector with Zero Blocks Asserts
>>>
>>>
>>>
>>>
>>>
>>>
>>> Hi Jeff,
>>>
>>> Interesting. If we were simply asserting on the count, that would
>>> have
>>> happened in MPI_Type_hvector(). The problem isn't really that we're
>>> not handling the parameters of the zero-count hvector correctly;  
>>> that
>>> is handled by converting the type into a contiguous of zero integers
>>> inside dataloop_create_vector.c.
>>>
>>> Instead there is something funny going on with how we build the
>>> struct. This type should go down the
>>> DLOOP_Dataloop_create_flattened_struct() path, because only one of
>>> the
>>> three hvector types has no data. I believe that it is the call at
>>> line
>>> 584 that is leading to the assert.
>>>
>>> In dataloop_create_segment.c, around line 573, the code block should
>>> be modified to something like this:
>>>
>>> ---
>>>       else /* derived type; get a count of contig blocks */
>>>       {
>>>           DLOOP_Count tmp_nr_blks, sz; /**/
>>>
>>>           DLOOP_Handle_get_size_macro(oldtypes[i], sz); /**/
>>>
>>>           /* if the derived type has some data to contribute, add
>>> to flattened representation */
>>>           if ((blklens[i] > 0) && (sz > 0)) { /**/
>>>               PREPEND_PREFIX(Segment_init)(NULL,
>>>                                            (DLOOP_Count) blklens[i],
>>>                                            oldtypes[i],
>>>                                            segp,
>>>                                            flag);
>>>               bytes = SEGMENT_IGNORE_LAST;
>>>
>>>               PREPEND_PREFIX(Segment_count_contig_blocks)(segp,
>>>                                                           0,
>>>                                                           &bytes,
>>>
>>> &tmp_nr_blks);
>>>
>>>               nr_blks += tmp_nr_blks;
>>>           } /**/
>>>       }
>>> ---
>>>
>>> Can you try this out?
>>>
>>> Those asserts in the segment code are there specifically to catch
>>> problems like this, and should not be removed without much careful
>>> thought. That code should never see counts of zero; we should remove
>>> that "cruft" during the dataloop creation process; we just missed a
>>> case.
>>>
>>> Thanks for the report and the basis for a new datatype test!
>>>
>>> Rob
>>>
>>> On Mar 3, 2009, at 10:18 AM, Jeff Parker wrote:
>>>
>>>>
>>>> IBM Blue Gene/P has received a customer-reported problem that
>>>> appears to be
>>>> in the stock MPICH2 code.  The application is committing a datatype
>>>> consisting of an hvector having 0 blocks, which results in an
>>>> assertion
>>>> that is wanting this value to be positive.  The spec says the
>>>> following,
>>>> specifically that count is a non-negative integer, so a value of
>>>> zero
>>>> should be allowed:
>>>>
>>>> Synopsis
>>>> #include "mpi.h"
>>>> int MPI_Type_hvector(
>>>>     int count,
>>>>     int blocklen,
>>>>     MPI_Aint stride,
>>>>     MPI_Datatype old_type,
>>>>     MPI_Datatype *newtype )
>>>>
>>>> Input Parameters
>>>>
>>>> count       number of blocks (nonnegative integer)
>>>>
>>>> blocklength number of elements in each block
>>>>             (nonnegative integer)
>>>>
>>>> stride      number of bytes between start of each
>>>>             block (integer)
>>>>
>>>> old_type    old datatype (handle)
>>>>
>>>>
>>>>
>>>> A reproducer is included below.  It fails on Blue Gene/P (MPICH2
>>>> 1.0.7) and
>>>> on Linux (MPICH2 1.0.7rc1), but works on Blue Gene/L (MPICH2
>>>> 1.0.4p1).
>>>> This assertion did not exist in MPICH2 1.0.5p4, but appears in
>>>> MPICH2 1.0.6
>>>> and later versions.
>>>>
>>>> The assertion is in src/mpid/common/datatype/dataloop/
>>>> segment_ops.c in
>>>> function DLOOP_Segment_contig_count_block.  If the assertion is
>>>> changed
>>>> from
>>>> DLOOP_Assert(*blocks_p > 0);
>>>> to
>>>> DLOOP_Assert(*blocks_p >= 0);
>>>> it works.
>>>>
>>>> There are other places with this assertion, and other similar
>>>> assertions
>>>> that may need fixing too:
>>>>
>>>> grep -r "*blocks_p >" *
>>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>>> DLOOP_Assert(*blocks_p >
>>>> 0);
>>>> src/mpi/romio/common/dataloop/segment_ops.c:     
>>>> DLOOP_Assert(count >
>>>> 0 &&
>>>> blksz > 0 && *blocks_p > 0);
>>>> src/mpi/romio/common/dataloop/segment_ops.c:     
>>>> DLOOP_Assert(count >
>>>> 0 &&
>>>> blksz > 0 && *blocks_p > 0);
>>>> src/mpi/romio/common/dataloop/segment_ops.c:     
>>>> DLOOP_Assert(count >
>>>> 0 &&
>>>> *blocks_p > 0);
>>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>>> DLOOP_Assert(*blocks_p
>>>>> = 0);
>>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>>> DLOOP_Assert(count > 0
>>>> && blksz > 0 && *blocks_p > 0);
>>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>>> DLOOP_Assert(count > 0
>>>> && blksz > 0 && *blocks_p > 0);
>>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>>> DLOOP_Assert(count > 0
>>>> && *blocks_p > 0);
>>>>
>>>> Reproducer:
>>>>
>>>> #include <stdio.h>
>>>>
>>>> #include <mpi.h>
>>>>
>>>> int main(int argc, char *argv[])
>>>> {
>>>> MPI_Datatype mystruct, vecs[3];
>>>> MPI_Aint stride = 5, displs[3];
>>>> int i=0, blockcount[3];
>>>>
>>>> MPI_Init(&argc, &argv);
>>>>
>>>> for(i=0;i<3;i++)
>>>> {
>>>>   /* important point appears to be the i==0 vectors here */
>>>>   MPI_Type_hvector(i, 1, stride, MPI_INT, &vecs[i]);
>>>>   MPI_Type_commit(&vecs[i]);
>>>>   blockcount[i]=1;
>>>> }
>>>> displs[0]=0; displs[1]=-100; displs[2]=-200; /* irrelevant */
>>>>
>>>> MPI_Type_struct(3, blockcount, displs, vecs, &mystruct);
>>>> fprintf(stderr,"Before commiting structure\n");
>>>> MPI_Type_commit(&mystruct);
>>>> fprintf(stderr,"After commiting structure\n");
>>>>
>>>> MPI_Finalize();
>>>>
>>>>
>>>> return 0;
>>>> }
>>>>
>>>> Output (in and after MPICH2 1.0.6):
>>>> Before commiting structure
>>>> Before commiting structure
>>>> Assertion failed in
>>>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/ 
>>>> lib/
>>>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>>>> at line 375: *blocks_p > 0
>>>> Assertion failed in
>>>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/ 
>>>> lib/
>>>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>>>> at line 375: *blocks_p > 0
>>>> Abort(1) on node 1: Internal error
>>>> Abort(1) on node 0: Internal error
>>>>
>>>> Jeff Parker
>>>> Blue Gene Messaging
>>>> 61L/030-2 A407    507-253-4208    TieLine: 553-4208
>>>> Notes email: Jeff Parker/Rochester/IBM
>>>> INTERNET: jjparker at us.ibm.com     AFS: jeff at rchland
>>>>
>>>
>>>
>>>
>>>
>>
>
>
>
>



More information about the mpich2-dev mailing list