[mpich2-dev] Hvector with Zero Blocks Asserts
Dave Goodell
goodell at mcs.anl.gov
Tue Mar 3 15:36:58 CST 2009
Hi Jeff,
On our side we tracked this in ticket #430 [1]. This is now fixed in
the trunk as of r3927. Thanks again for the bug report.
-Dave
[1] https://trac.mcs.anl.gov/projects/mpich2/ticket/430
On Mar 3, 2009, at 2:11 PM, Rob Ross wrote:
> yeah that was a typo. thanks jeff; glad that worked. i imagine that
> dave will integrate on our end so he can close out the ticket. -- rob
>
> On Mar 3, 2009, at 2:05 PM, Jeff Parker wrote:
>
>> Rob,
>>
>> Thanks for the quick reply. I applied your fix to
>> dataloop_create_struct.c
>> (I believe you had a typo when you said dataloop_create_segment.c)
>> and it
>> worked. I assume this will be incorporated into a future MPICH2
>> release?
>>
>> Jeff Parker
>> Blue Gene Messaging
>> 61L/030-2 A407 507-253-4208 TieLine: 553-4208
>> Notes email: Jeff Parker/Rochester/IBM
>> INTERNET: jjparker at us.ibm.com AFS: jeff at rchland
>>
>>
>>
>> From: Rob Ross <rross at mcs.anl.gov>
>>
>> To: Jeff Parker/Rochester/IBM at IBMUS
>>
>> Cc: mpich2-dev at mcs.anl.gov
>>
>> Date: 03/03/2009 11:03 AM
>>
>> Subject: Re: Hvector with Zero Blocks Asserts
>>
>>
>>
>>
>>
>>
>> Hi Jeff,
>>
>> Interesting. If we were simply asserting on the count, that would
>> have
>> happened in MPI_Type_hvector(). The problem isn't really that we're
>> not handling the parameters of the zero-count hvector correctly; that
>> is handled by converting the type into a contiguous of zero integers
>> inside dataloop_create_vector.c.
>>
>> Instead there is something funny going on with how we build the
>> struct. This type should go down the
>> DLOOP_Dataloop_create_flattened_struct() path, because only one of
>> the
>> three hvector types has no data. I believe that it is the call at
>> line
>> 584 that is leading to the assert.
>>
>> In dataloop_create_segment.c, around line 573, the code block should
>> be modified to something like this:
>>
>> ---
>> else /* derived type; get a count of contig blocks */
>> {
>> DLOOP_Count tmp_nr_blks, sz; /**/
>>
>> DLOOP_Handle_get_size_macro(oldtypes[i], sz); /**/
>>
>> /* if the derived type has some data to contribute, add
>> to flattened representation */
>> if ((blklens[i] > 0) && (sz > 0)) { /**/
>> PREPEND_PREFIX(Segment_init)(NULL,
>> (DLOOP_Count) blklens[i],
>> oldtypes[i],
>> segp,
>> flag);
>> bytes = SEGMENT_IGNORE_LAST;
>>
>> PREPEND_PREFIX(Segment_count_contig_blocks)(segp,
>> 0,
>> &bytes,
>>
>> &tmp_nr_blks);
>>
>> nr_blks += tmp_nr_blks;
>> } /**/
>> }
>> ---
>>
>> Can you try this out?
>>
>> Those asserts in the segment code are there specifically to catch
>> problems like this, and should not be removed without much careful
>> thought. That code should never see counts of zero; we should remove
>> that "cruft" during the dataloop creation process; we just missed a
>> case.
>>
>> Thanks for the report and the basis for a new datatype test!
>>
>> Rob
>>
>> On Mar 3, 2009, at 10:18 AM, Jeff Parker wrote:
>>
>>>
>>> IBM Blue Gene/P has received a customer-reported problem that
>>> appears to be
>>> in the stock MPICH2 code. The application is committing a datatype
>>> consisting of an hvector having 0 blocks, which results in an
>>> assertion
>>> that is wanting this value to be positive. The spec says the
>>> following,
>>> specifically that count is a non-negative integer, so a value of
>>> zero
>>> should be allowed:
>>>
>>> Synopsis
>>> #include "mpi.h"
>>> int MPI_Type_hvector(
>>> int count,
>>> int blocklen,
>>> MPI_Aint stride,
>>> MPI_Datatype old_type,
>>> MPI_Datatype *newtype )
>>>
>>> Input Parameters
>>>
>>> count number of blocks (nonnegative integer)
>>>
>>> blocklength number of elements in each block
>>> (nonnegative integer)
>>>
>>> stride number of bytes between start of each
>>> block (integer)
>>>
>>> old_type old datatype (handle)
>>>
>>>
>>>
>>> A reproducer is included below. It fails on Blue Gene/P (MPICH2
>>> 1.0.7) and
>>> on Linux (MPICH2 1.0.7rc1), but works on Blue Gene/L (MPICH2
>>> 1.0.4p1).
>>> This assertion did not exist in MPICH2 1.0.5p4, but appears in
>>> MPICH2 1.0.6
>>> and later versions.
>>>
>>> The assertion is in src/mpid/common/datatype/dataloop/
>>> segment_ops.c in
>>> function DLOOP_Segment_contig_count_block. If the assertion is
>>> changed
>>> from
>>> DLOOP_Assert(*blocks_p > 0);
>>> to
>>> DLOOP_Assert(*blocks_p >= 0);
>>> it works.
>>>
>>> There are other places with this assertion, and other similar
>>> assertions
>>> that may need fixing too:
>>>
>>> grep -r "*blocks_p >" *
>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>> DLOOP_Assert(*blocks_p >
>>> 0);
>>> src/mpi/romio/common/dataloop/segment_ops.c: DLOOP_Assert(count >
>>> 0 &&
>>> blksz > 0 && *blocks_p > 0);
>>> src/mpi/romio/common/dataloop/segment_ops.c: DLOOP_Assert(count >
>>> 0 &&
>>> blksz > 0 && *blocks_p > 0);
>>> src/mpi/romio/common/dataloop/segment_ops.c: DLOOP_Assert(count >
>>> 0 &&
>>> *blocks_p > 0);
>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>> DLOOP_Assert(*blocks_p
>>>> = 0);
>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>> DLOOP_Assert(count > 0
>>> && blksz > 0 && *blocks_p > 0);
>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>> DLOOP_Assert(count > 0
>>> && blksz > 0 && *blocks_p > 0);
>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>> DLOOP_Assert(count > 0
>>> && *blocks_p > 0);
>>>
>>> Reproducer:
>>>
>>> #include <stdio.h>
>>>
>>> #include <mpi.h>
>>>
>>> int main(int argc, char *argv[])
>>> {
>>> MPI_Datatype mystruct, vecs[3];
>>> MPI_Aint stride = 5, displs[3];
>>> int i=0, blockcount[3];
>>>
>>> MPI_Init(&argc, &argv);
>>>
>>> for(i=0;i<3;i++)
>>> {
>>> /* important point appears to be the i==0 vectors here */
>>> MPI_Type_hvector(i, 1, stride, MPI_INT, &vecs[i]);
>>> MPI_Type_commit(&vecs[i]);
>>> blockcount[i]=1;
>>> }
>>> displs[0]=0; displs[1]=-100; displs[2]=-200; /* irrelevant */
>>>
>>> MPI_Type_struct(3, blockcount, displs, vecs, &mystruct);
>>> fprintf(stderr,"Before commiting structure\n");
>>> MPI_Type_commit(&mystruct);
>>> fprintf(stderr,"After commiting structure\n");
>>>
>>> MPI_Finalize();
>>>
>>>
>>> return 0;
>>> }
>>>
>>> Output (in and after MPICH2 1.0.6):
>>> Before commiting structure
>>> Before commiting structure
>>> Assertion failed in
>>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/lib/
>>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>>> at line 375: *blocks_p > 0
>>> Assertion failed in
>>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/lib/
>>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>>> at line 375: *blocks_p > 0
>>> Abort(1) on node 1: Internal error
>>> Abort(1) on node 0: Internal error
>>>
>>> Jeff Parker
>>> Blue Gene Messaging
>>> 61L/030-2 A407 507-253-4208 TieLine: 553-4208
>>> Notes email: Jeff Parker/Rochester/IBM
>>> INTERNET: jjparker at us.ibm.com AFS: jeff at rchland
>>>
>>
>>
>>
>>
>
More information about the mpich2-dev
mailing list