[mpich2-dev] Hvector with Zero Blocks Asserts

Jeff Parker jjparker at us.ibm.com
Tue Mar 3 20:05:29 CST 2009


Hi Dave & Ross,

While testing the fix for Hvector Zero Blocks some more, I tried one
variation where all of the MPI_Type_hvector() calls specified zero blocks.
Previously, only one of the calls specified zero blocks.  Even with the
fix, a new assertion occurred:

---
Before commiting structure
Assertion failed in
file /bgusr/jeff/Mar2.efix/bgp/comm/lib/dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
 at line 960: *lengthp > 0
Abort(1) on node 0: Internal error
---

Here's the call stack:

src/mpid/common/datatype/dataloop/segment_ops.c:960

    951 void PREPEND_PREFIX(Segment_mpi_flatten)(DLOOP_Segment *segp,
    952                                          DLOOP_Offset first,
    953                                          DLOOP_Offset *lastp,
    954                                          int *blklens,
    955                                          MPI_Aint *disps,
    956                                          int *lengthp)
    957 {
    958     struct PREPEND_PREFIX(mpi_flatten_params) params;
    959
    960     DLOOP_Assert(*lengthp > 0);

src/mpid/common/datatype/dataloop/dataloop_create_struct.c:640

    630         if (oldtypes[i] != MPI_UB && oldtypes[i] != MPI_LB &&
blklens[i] != 0)
    631         {
    632             PREPEND_PREFIX(Segment_init)((char *)
MPIR_MPI_AINT_CAST_TO_VOID_PTR disps[i],
    633                                          (DLOOP_Count) blklens[i],
    634                                          oldtypes[i],
    635                                          segp,
    636                                          0 /* homogeneous */);
    637
    638             last_ind = nr_blks - first_ind;
    639             bytes = SEGMENT_IGNORE_LAST;
    640             PREPEND_PREFIX(Segment_mpi_flatten)(segp,
    641                                                 0,
    642                                                 &bytes,
    643                                                 &tmp_blklens
[first_ind],
    644                                                 &tmp_disps
[first_ind],
    645                                                 &last_ind);
    646             first_ind += last_ind;
    647         }

src/mpid/common/datatype/dataloop/dataloop_create.c:268

    268             PREPEND_PREFIX(Dataloop_create_struct)(ints[0] /* count
*/,
    269                                                    &ints[1] /*
blklens */,
    270                                                    disps,
    271                                                    types /* oldtype
array */,
    272                                                    dlp_p, dlsz_p,
dldepth_p,
    273                                                    flag);

src/mpid/common/datatype/mpid_type_commit.c:38
src/mpi/datatype/type_commit.c:97
---

Here's the reproducer.  Same program as before, only the first parameter to
MPI_Type_hvector() is always zero.

#include <stdio.h>
#include <mpi.h>

int main(int argc, char *argv[])
{
   MPI_Datatype mystruct, vecs[3];
   MPI_Aint stride = 5, displs[3];
   int i=0, blockcount[3];

   MPI_Init(&argc, &argv);

   for(i=0;i<3;i++)
   {
      /* important point appears to be the i==0 vectors here */
      MPI_Type_hvector(0, 1, stride, MPI_INT, &vecs[i]);
      MPI_Type_commit(&vecs[i]);
      blockcount[i]=1;
   }
   displs[0]=0; displs[1]=-100; displs[2]=-200; /* irrelevant */

   MPI_Type_struct(3, blockcount, displs, vecs, &mystruct);
   fprintf(stderr,"Before commiting structure\n");
   MPI_Type_commit(&mystruct);
   fprintf(stderr,"After commiting structure\n");

   MPI_Finalize();


   return 0;
}

Jeff Parker
Blue Gene Messaging
61L/030-2 A407    507-253-4208    TieLine: 553-4208
Notes email: Jeff Parker/Rochester/IBM
INTERNET: jjparker at us.ibm.com     AFS: jeff at rchland


                                                                                                                              
  From:       Dave Goodell <goodell at mcs.anl.gov>                                                                              
                                                                                                                              
  To:         mpich2-dev at mcs.anl.gov                                                                                          
                                                                                                                              
  Cc:         Jeff Parker/Rochester/IBM at IBMUS                                                                                 
                                                                                                                              
  Date:       03/03/2009 03:38 PM                                                                                             
                                                                                                                              
  Subject:    Re: [mpich2-dev] Hvector with Zero Blocks Asserts                                                               
                                                                                                                              





Hi Jeff,

On our side we tracked this in ticket #430 [1].  This is now fixed in
the trunk as of r3927.  Thanks again for the bug report.

-Dave

[1] https://trac.mcs.anl.gov/projects/mpich2/ticket/430

On Mar 3, 2009, at 2:11 PM, Rob Ross wrote:

> yeah that was a typo. thanks jeff; glad that worked. i imagine that
> dave will integrate on our end so he can close out the ticket. -- rob
>
> On Mar 3, 2009, at 2:05 PM, Jeff Parker wrote:
>
>> Rob,
>>
>> Thanks for the quick reply.  I applied your fix to
>> dataloop_create_struct.c
>> (I believe you had a typo when you said dataloop_create_segment.c)
>> and it
>> worked.  I assume this will be incorporated into a future MPICH2
>> release?
>>
>> Jeff Parker
>> Blue Gene Messaging
>> 61L/030-2 A407    507-253-4208    TieLine: 553-4208
>> Notes email: Jeff Parker/Rochester/IBM
>> INTERNET: jjparker at us.ibm.com     AFS: jeff at rchland
>>
>>
>>
>> From:       Rob Ross <rross at mcs.anl.gov>
>>
>> To:         Jeff Parker/Rochester/IBM at IBMUS
>>
>> Cc:         mpich2-dev at mcs.anl.gov
>>
>> Date:       03/03/2009 11:03 AM
>>
>> Subject:    Re: Hvector with Zero Blocks Asserts
>>
>>
>>
>>
>>
>>
>> Hi Jeff,
>>
>> Interesting. If we were simply asserting on the count, that would
>> have
>> happened in MPI_Type_hvector(). The problem isn't really that we're
>> not handling the parameters of the zero-count hvector correctly; that
>> is handled by converting the type into a contiguous of zero integers
>> inside dataloop_create_vector.c.
>>
>> Instead there is something funny going on with how we build the
>> struct. This type should go down the
>> DLOOP_Dataloop_create_flattened_struct() path, because only one of
>> the
>> three hvector types has no data. I believe that it is the call at
>> line
>> 584 that is leading to the assert.
>>
>> In dataloop_create_segment.c, around line 573, the code block should
>> be modified to something like this:
>>
>> ---
>>        else /* derived type; get a count of contig blocks */
>>        {
>>            DLOOP_Count tmp_nr_blks, sz; /**/
>>
>>            DLOOP_Handle_get_size_macro(oldtypes[i], sz); /**/
>>
>>            /* if the derived type has some data to contribute, add
>> to flattened representation */
>>            if ((blklens[i] > 0) && (sz > 0)) { /**/
>>                PREPEND_PREFIX(Segment_init)(NULL,
>>                                             (DLOOP_Count) blklens[i],
>>                                             oldtypes[i],
>>                                             segp,
>>                                             flag);
>>                bytes = SEGMENT_IGNORE_LAST;
>>
>>                PREPEND_PREFIX(Segment_count_contig_blocks)(segp,
>>                                                            0,
>>                                                            &bytes,
>>
>> &tmp_nr_blks);
>>
>>                nr_blks += tmp_nr_blks;
>>            } /**/
>>        }
>> ---
>>
>> Can you try this out?
>>
>> Those asserts in the segment code are there specifically to catch
>> problems like this, and should not be removed without much careful
>> thought. That code should never see counts of zero; we should remove
>> that "cruft" during the dataloop creation process; we just missed a
>> case.
>>
>> Thanks for the report and the basis for a new datatype test!
>>
>> Rob
>>
>> On Mar 3, 2009, at 10:18 AM, Jeff Parker wrote:
>>
>>>
>>> IBM Blue Gene/P has received a customer-reported problem that
>>> appears to be
>>> in the stock MPICH2 code.  The application is committing a datatype
>>> consisting of an hvector having 0 blocks, which results in an
>>> assertion
>>> that is wanting this value to be positive.  The spec says the
>>> following,
>>> specifically that count is a non-negative integer, so a value of
>>> zero
>>> should be allowed:
>>>
>>> Synopsis
>>> #include "mpi.h"
>>> int MPI_Type_hvector(
>>>      int count,
>>>      int blocklen,
>>>      MPI_Aint stride,
>>>      MPI_Datatype old_type,
>>>      MPI_Datatype *newtype )
>>>
>>> Input Parameters
>>>
>>>  count       number of blocks (nonnegative integer)
>>>
>>>  blocklength number of elements in each block
>>>              (nonnegative integer)
>>>
>>>  stride      number of bytes between start of each
>>>              block (integer)
>>>
>>>  old_type    old datatype (handle)
>>>
>>>
>>>
>>> A reproducer is included below.  It fails on Blue Gene/P (MPICH2
>>> 1.0.7) and
>>> on Linux (MPICH2 1.0.7rc1), but works on Blue Gene/L (MPICH2
>>> 1.0.4p1).
>>> This assertion did not exist in MPICH2 1.0.5p4, but appears in
>>> MPICH2 1.0.6
>>> and later versions.
>>>
>>> The assertion is in src/mpid/common/datatype/dataloop/
>>> segment_ops.c in
>>> function DLOOP_Segment_contig_count_block.  If the assertion is
>>> changed
>>> from
>>> DLOOP_Assert(*blocks_p > 0);
>>> to
>>> DLOOP_Assert(*blocks_p >= 0);
>>> it works.
>>>
>>> There are other places with this assertion, and other similar
>>> assertions
>>> that may need fixing too:
>>>
>>> grep -r "*blocks_p >" *
>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>> DLOOP_Assert(*blocks_p >
>>> 0);
>>> src/mpi/romio/common/dataloop/segment_ops.c:    DLOOP_Assert(count >
>>> 0 &&
>>> blksz > 0 && *blocks_p > 0);
>>> src/mpi/romio/common/dataloop/segment_ops.c:    DLOOP_Assert(count >
>>> 0 &&
>>> blksz > 0 && *blocks_p > 0);
>>> src/mpi/romio/common/dataloop/segment_ops.c:    DLOOP_Assert(count >
>>> 0 &&
>>> *blocks_p > 0);
>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>> DLOOP_Assert(*blocks_p
>>>> = 0);
>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>> DLOOP_Assert(count > 0
>>> && blksz > 0 && *blocks_p > 0);
>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>> DLOOP_Assert(count > 0
>>> && blksz > 0 && *blocks_p > 0);
>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>> DLOOP_Assert(count > 0
>>> && *blocks_p > 0);
>>>
>>> Reproducer:
>>>
>>> #include <stdio.h>
>>>
>>> #include <mpi.h>
>>>
>>> int main(int argc, char *argv[])
>>> {
>>> MPI_Datatype mystruct, vecs[3];
>>> MPI_Aint stride = 5, displs[3];
>>> int i=0, blockcount[3];
>>>
>>> MPI_Init(&argc, &argv);
>>>
>>> for(i=0;i<3;i++)
>>> {
>>>    /* important point appears to be the i==0 vectors here */
>>>    MPI_Type_hvector(i, 1, stride, MPI_INT, &vecs[i]);
>>>    MPI_Type_commit(&vecs[i]);
>>>    blockcount[i]=1;
>>> }
>>> displs[0]=0; displs[1]=-100; displs[2]=-200; /* irrelevant */
>>>
>>> MPI_Type_struct(3, blockcount, displs, vecs, &mystruct);
>>> fprintf(stderr,"Before commiting structure\n");
>>> MPI_Type_commit(&mystruct);
>>> fprintf(stderr,"After commiting structure\n");
>>>
>>> MPI_Finalize();
>>>
>>>
>>> return 0;
>>> }
>>>
>>> Output (in and after MPICH2 1.0.6):
>>> Before commiting structure
>>> Before commiting structure
>>> Assertion failed in
>>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/lib/
>>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>>> at line 375: *blocks_p > 0
>>> Assertion failed in
>>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/lib/
>>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>>> at line 375: *blocks_p > 0
>>> Abort(1) on node 1: Internal error
>>> Abort(1) on node 0: Internal error
>>>
>>> Jeff Parker
>>> Blue Gene Messaging
>>> 61L/030-2 A407    507-253-4208    TieLine: 553-4208
>>> Notes email: Jeff Parker/Rochester/IBM
>>> INTERNET: jjparker at us.ibm.com     AFS: jeff at rchland
>>>
>>
>>
>>
>>
>






More information about the mpich2-dev mailing list