[mpich2-dev] Hvector with Zero Blocks Asserts

Dave Goodell goodell at mcs.anl.gov
Tue Mar 3 15:36:58 CST 2009


Hi Jeff,

On our side we tracked this in ticket #430 [1].  This is now fixed in  
the trunk as of r3927.  Thanks again for the bug report.

-Dave

[1] https://trac.mcs.anl.gov/projects/mpich2/ticket/430

On Mar 3, 2009, at 2:11 PM, Rob Ross wrote:

> yeah that was a typo. thanks jeff; glad that worked. i imagine that  
> dave will integrate on our end so he can close out the ticket. -- rob
>
> On Mar 3, 2009, at 2:05 PM, Jeff Parker wrote:
>
>> Rob,
>>
>> Thanks for the quick reply.  I applied your fix to  
>> dataloop_create_struct.c
>> (I believe you had a typo when you said dataloop_create_segment.c)  
>> and it
>> worked.  I assume this will be incorporated into a future MPICH2  
>> release?
>>
>> Jeff Parker
>> Blue Gene Messaging
>> 61L/030-2 A407    507-253-4208    TieLine: 553-4208
>> Notes email: Jeff Parker/Rochester/IBM
>> INTERNET: jjparker at us.ibm.com     AFS: jeff at rchland
>>
>>
>>
>> From:       Rob Ross <rross at mcs.anl.gov>
>>
>> To:         Jeff Parker/Rochester/IBM at IBMUS
>>
>> Cc:         mpich2-dev at mcs.anl.gov
>>
>> Date:       03/03/2009 11:03 AM
>>
>> Subject:    Re: Hvector with Zero Blocks Asserts
>>
>>
>>
>>
>>
>>
>> Hi Jeff,
>>
>> Interesting. If we were simply asserting on the count, that would  
>> have
>> happened in MPI_Type_hvector(). The problem isn't really that we're
>> not handling the parameters of the zero-count hvector correctly; that
>> is handled by converting the type into a contiguous of zero integers
>> inside dataloop_create_vector.c.
>>
>> Instead there is something funny going on with how we build the
>> struct. This type should go down the
>> DLOOP_Dataloop_create_flattened_struct() path, because only one of  
>> the
>> three hvector types has no data. I believe that it is the call at  
>> line
>> 584 that is leading to the assert.
>>
>> In dataloop_create_segment.c, around line 573, the code block should
>> be modified to something like this:
>>
>> ---
>>        else /* derived type; get a count of contig blocks */
>>        {
>>            DLOOP_Count tmp_nr_blks, sz; /**/
>>
>>            DLOOP_Handle_get_size_macro(oldtypes[i], sz); /**/
>>
>>            /* if the derived type has some data to contribute, add
>> to flattened representation */
>>            if ((blklens[i] > 0) && (sz > 0)) { /**/
>>                PREPEND_PREFIX(Segment_init)(NULL,
>>                                             (DLOOP_Count) blklens[i],
>>                                             oldtypes[i],
>>                                             segp,
>>                                             flag);
>>                bytes = SEGMENT_IGNORE_LAST;
>>
>>                PREPEND_PREFIX(Segment_count_contig_blocks)(segp,
>>                                                            0,
>>                                                            &bytes,
>>
>> &tmp_nr_blks);
>>
>>                nr_blks += tmp_nr_blks;
>>            } /**/
>>        }
>> ---
>>
>> Can you try this out?
>>
>> Those asserts in the segment code are there specifically to catch
>> problems like this, and should not be removed without much careful
>> thought. That code should never see counts of zero; we should remove
>> that "cruft" during the dataloop creation process; we just missed a
>> case.
>>
>> Thanks for the report and the basis for a new datatype test!
>>
>> Rob
>>
>> On Mar 3, 2009, at 10:18 AM, Jeff Parker wrote:
>>
>>>
>>> IBM Blue Gene/P has received a customer-reported problem that
>>> appears to be
>>> in the stock MPICH2 code.  The application is committing a datatype
>>> consisting of an hvector having 0 blocks, which results in an
>>> assertion
>>> that is wanting this value to be positive.  The spec says the
>>> following,
>>> specifically that count is a non-negative integer, so a value of  
>>> zero
>>> should be allowed:
>>>
>>> Synopsis
>>> #include "mpi.h"
>>> int MPI_Type_hvector(
>>>      int count,
>>>      int blocklen,
>>>      MPI_Aint stride,
>>>      MPI_Datatype old_type,
>>>      MPI_Datatype *newtype )
>>>
>>> Input Parameters
>>>
>>>  count       number of blocks (nonnegative integer)
>>>
>>>  blocklength number of elements in each block
>>>              (nonnegative integer)
>>>
>>>  stride      number of bytes between start of each
>>>              block (integer)
>>>
>>>  old_type    old datatype (handle)
>>>
>>>
>>>
>>> A reproducer is included below.  It fails on Blue Gene/P (MPICH2
>>> 1.0.7) and
>>> on Linux (MPICH2 1.0.7rc1), but works on Blue Gene/L (MPICH2  
>>> 1.0.4p1).
>>> This assertion did not exist in MPICH2 1.0.5p4, but appears in
>>> MPICH2 1.0.6
>>> and later versions.
>>>
>>> The assertion is in src/mpid/common/datatype/dataloop/ 
>>> segment_ops.c in
>>> function DLOOP_Segment_contig_count_block.  If the assertion is
>>> changed
>>> from
>>> DLOOP_Assert(*blocks_p > 0);
>>> to
>>> DLOOP_Assert(*blocks_p >= 0);
>>> it works.
>>>
>>> There are other places with this assertion, and other similar
>>> assertions
>>> that may need fixing too:
>>>
>>> grep -r "*blocks_p >" *
>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>> DLOOP_Assert(*blocks_p >
>>> 0);
>>> src/mpi/romio/common/dataloop/segment_ops.c:    DLOOP_Assert(count >
>>> 0 &&
>>> blksz > 0 && *blocks_p > 0);
>>> src/mpi/romio/common/dataloop/segment_ops.c:    DLOOP_Assert(count >
>>> 0 &&
>>> blksz > 0 && *blocks_p > 0);
>>> src/mpi/romio/common/dataloop/segment_ops.c:    DLOOP_Assert(count >
>>> 0 &&
>>> *blocks_p > 0);
>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>> DLOOP_Assert(*blocks_p
>>>> = 0);
>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>> DLOOP_Assert(count > 0
>>> && blksz > 0 && *blocks_p > 0);
>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>> DLOOP_Assert(count > 0
>>> && blksz > 0 && *blocks_p > 0);
>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>> DLOOP_Assert(count > 0
>>> && *blocks_p > 0);
>>>
>>> Reproducer:
>>>
>>> #include <stdio.h>
>>>
>>> #include <mpi.h>
>>>
>>> int main(int argc, char *argv[])
>>> {
>>> MPI_Datatype mystruct, vecs[3];
>>> MPI_Aint stride = 5, displs[3];
>>> int i=0, blockcount[3];
>>>
>>> MPI_Init(&argc, &argv);
>>>
>>> for(i=0;i<3;i++)
>>> {
>>>    /* important point appears to be the i==0 vectors here */
>>>    MPI_Type_hvector(i, 1, stride, MPI_INT, &vecs[i]);
>>>    MPI_Type_commit(&vecs[i]);
>>>    blockcount[i]=1;
>>> }
>>> displs[0]=0; displs[1]=-100; displs[2]=-200; /* irrelevant */
>>>
>>> MPI_Type_struct(3, blockcount, displs, vecs, &mystruct);
>>> fprintf(stderr,"Before commiting structure\n");
>>> MPI_Type_commit(&mystruct);
>>> fprintf(stderr,"After commiting structure\n");
>>>
>>> MPI_Finalize();
>>>
>>>
>>> return 0;
>>> }
>>>
>>> Output (in and after MPICH2 1.0.6):
>>> Before commiting structure
>>> Before commiting structure
>>> Assertion failed in
>>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/lib/
>>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>>> at line 375: *blocks_p > 0
>>> Assertion failed in
>>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/lib/
>>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>>> at line 375: *blocks_p > 0
>>> Abort(1) on node 1: Internal error
>>> Abort(1) on node 0: Internal error
>>>
>>> Jeff Parker
>>> Blue Gene Messaging
>>> 61L/030-2 A407    507-253-4208    TieLine: 553-4208
>>> Notes email: Jeff Parker/Rochester/IBM
>>> INTERNET: jjparker at us.ibm.com     AFS: jeff at rchland
>>>
>>
>>
>>
>>
>



More information about the mpich2-dev mailing list