[mpich2-dev] Hvector with Zero Blocks Asserts

Jeff Parker jjparker at us.ibm.com
Tue Mar 3 14:05:54 CST 2009


Rob,

Thanks for the quick reply.  I applied your fix to dataloop_create_struct.c
(I believe you had a typo when you said dataloop_create_segment.c) and it
worked.  I assume this will be incorporated into a future MPICH2 release?

Jeff Parker
Blue Gene Messaging
61L/030-2 A407    507-253-4208    TieLine: 553-4208
Notes email: Jeff Parker/Rochester/IBM
INTERNET: jjparker at us.ibm.com     AFS: jeff at rchland


                                                                                                                              
  From:       Rob Ross <rross at mcs.anl.gov>                                                                                    
                                                                                                                              
  To:         Jeff Parker/Rochester/IBM at IBMUS                                                                                 
                                                                                                                              
  Cc:         mpich2-dev at mcs.anl.gov                                                                                          
                                                                                                                              
  Date:       03/03/2009 11:03 AM                                                                                             
                                                                                                                              
  Subject:    Re: Hvector with Zero Blocks Asserts                                                                            
                                                                                                                              





Hi Jeff,

Interesting. If we were simply asserting on the count, that would have
happened in MPI_Type_hvector(). The problem isn't really that we're
not handling the parameters of the zero-count hvector correctly; that
is handled by converting the type into a contiguous of zero integers
inside dataloop_create_vector.c.

Instead there is something funny going on with how we build the
struct. This type should go down the
DLOOP_Dataloop_create_flattened_struct() path, because only one of the
three hvector types has no data. I believe that it is the call at line
584 that is leading to the assert.

In dataloop_create_segment.c, around line 573, the code block should
be modified to something like this:

---
         else /* derived type; get a count of contig blocks */
         {
             DLOOP_Count tmp_nr_blks, sz; /**/

             DLOOP_Handle_get_size_macro(oldtypes[i], sz); /**/

             /* if the derived type has some data to contribute, add
to flattened representation */
             if ((blklens[i] > 0) && (sz > 0)) { /**/
                 PREPEND_PREFIX(Segment_init)(NULL,
                                              (DLOOP_Count) blklens[i],
                                              oldtypes[i],
                                              segp,
                                              flag);
                 bytes = SEGMENT_IGNORE_LAST;

                 PREPEND_PREFIX(Segment_count_contig_blocks)(segp,
                                                             0,
                                                             &bytes,

&tmp_nr_blks);

                 nr_blks += tmp_nr_blks;
             } /**/
         }
---

Can you try this out?

Those asserts in the segment code are there specifically to catch
problems like this, and should not be removed without much careful
thought. That code should never see counts of zero; we should remove
that "cruft" during the dataloop creation process; we just missed a
case.

Thanks for the report and the basis for a new datatype test!

Rob

On Mar 3, 2009, at 10:18 AM, Jeff Parker wrote:

>
> IBM Blue Gene/P has received a customer-reported problem that
> appears to be
> in the stock MPICH2 code.  The application is committing a datatype
> consisting of an hvector having 0 blocks, which results in an
> assertion
> that is wanting this value to be positive.  The spec says the
> following,
> specifically that count is a non-negative integer, so a value of zero
> should be allowed:
>
> Synopsis
> #include "mpi.h"
> int MPI_Type_hvector(
>        int count,
>        int blocklen,
>        MPI_Aint stride,
>        MPI_Datatype old_type,
>        MPI_Datatype *newtype )
>
> Input Parameters
>
>    count       number of blocks (nonnegative integer)
>
>    blocklength number of elements in each block
>                (nonnegative integer)
>
>    stride      number of bytes between start of each
>                block (integer)
>
>    old_type    old datatype (handle)
>
>
>
> A reproducer is included below.  It fails on Blue Gene/P (MPICH2
> 1.0.7) and
> on Linux (MPICH2 1.0.7rc1), but works on Blue Gene/L (MPICH2 1.0.4p1).
> This assertion did not exist in MPICH2 1.0.5p4, but appears in
> MPICH2 1.0.6
> and later versions.
>
> The assertion is in src/mpid/common/datatype/dataloop/segment_ops.c in
> function DLOOP_Segment_contig_count_block.  If the assertion is
> changed
> from
> DLOOP_Assert(*blocks_p > 0);
> to
> DLOOP_Assert(*blocks_p >= 0);
> it works.
>
> There are other places with this assertion, and other similar
> assertions
> that may need fixing too:
>
> grep -r "*blocks_p >" *
> src/mpi/romio/common/dataloop/segment_ops.c:
> DLOOP_Assert(*blocks_p >
> 0);
> src/mpi/romio/common/dataloop/segment_ops.c:    DLOOP_Assert(count >
> 0 &&
> blksz > 0 && *blocks_p > 0);
> src/mpi/romio/common/dataloop/segment_ops.c:    DLOOP_Assert(count >
> 0 &&
> blksz > 0 && *blocks_p > 0);
> src/mpi/romio/common/dataloop/segment_ops.c:    DLOOP_Assert(count >
> 0 &&
> *blocks_p > 0);
> src/mpid/common/datatype/dataloop/segment_ops.c:
> DLOOP_Assert(*blocks_p
>> = 0);
> src/mpid/common/datatype/dataloop/segment_ops.c:
> DLOOP_Assert(count > 0
> && blksz > 0 && *blocks_p > 0);
> src/mpid/common/datatype/dataloop/segment_ops.c:
> DLOOP_Assert(count > 0
> && blksz > 0 && *blocks_p > 0);
> src/mpid/common/datatype/dataloop/segment_ops.c:
> DLOOP_Assert(count > 0
> && *blocks_p > 0);
>
> Reproducer:
>
> #include <stdio.h>
>
> #include <mpi.h>
>
> int main(int argc, char *argv[])
> {
>   MPI_Datatype mystruct, vecs[3];
>   MPI_Aint stride = 5, displs[3];
>   int i=0, blockcount[3];
>
>   MPI_Init(&argc, &argv);
>
>   for(i=0;i<3;i++)
>   {
>      /* important point appears to be the i==0 vectors here */
>      MPI_Type_hvector(i, 1, stride, MPI_INT, &vecs[i]);
>      MPI_Type_commit(&vecs[i]);
>      blockcount[i]=1;
>   }
>   displs[0]=0; displs[1]=-100; displs[2]=-200; /* irrelevant */
>
>   MPI_Type_struct(3, blockcount, displs, vecs, &mystruct);
>   fprintf(stderr,"Before commiting structure\n");
>   MPI_Type_commit(&mystruct);
>   fprintf(stderr,"After commiting structure\n");
>
>   MPI_Finalize();
>
>
>   return 0;
> }
>
> Output (in and after MPICH2 1.0.6):
> Before commiting structure
> Before commiting structure
> Assertion failed in
> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/lib/
> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
> at line 375: *blocks_p > 0
> Assertion failed in
> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/lib/
> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
> at line 375: *blocks_p > 0
> Abort(1) on node 1: Internal error
> Abort(1) on node 0: Internal error
>
> Jeff Parker
> Blue Gene Messaging
> 61L/030-2 A407    507-253-4208    TieLine: 553-4208
> Notes email: Jeff Parker/Rochester/IBM
> INTERNET: jjparker at us.ibm.com     AFS: jeff at rchland
>






More information about the mpich2-dev mailing list