[mpich2-dev] Hvector with Zero Blocks Asserts
Rob Ross
rross at mcs.anl.gov
Tue Mar 3 14:11:14 CST 2009
yeah that was a typo. thanks jeff; glad that worked. i imagine that
dave will integrate on our end so he can close out the ticket. -- rob
On Mar 3, 2009, at 2:05 PM, Jeff Parker wrote:
> Rob,
>
> Thanks for the quick reply. I applied your fix to
> dataloop_create_struct.c
> (I believe you had a typo when you said dataloop_create_segment.c)
> and it
> worked. I assume this will be incorporated into a future MPICH2
> release?
>
> Jeff Parker
> Blue Gene Messaging
> 61L/030-2 A407 507-253-4208 TieLine: 553-4208
> Notes email: Jeff Parker/Rochester/IBM
> INTERNET: jjparker at us.ibm.com AFS: jeff at rchland
>
>
>
> From: Rob Ross <rross at mcs.anl.gov>
>
> To: Jeff Parker/Rochester/IBM at IBMUS
>
> Cc: mpich2-dev at mcs.anl.gov
>
> Date: 03/03/2009 11:03 AM
>
> Subject: Re: Hvector with Zero Blocks Asserts
>
>
>
>
>
>
> Hi Jeff,
>
> Interesting. If we were simply asserting on the count, that would have
> happened in MPI_Type_hvector(). The problem isn't really that we're
> not handling the parameters of the zero-count hvector correctly; that
> is handled by converting the type into a contiguous of zero integers
> inside dataloop_create_vector.c.
>
> Instead there is something funny going on with how we build the
> struct. This type should go down the
> DLOOP_Dataloop_create_flattened_struct() path, because only one of the
> three hvector types has no data. I believe that it is the call at line
> 584 that is leading to the assert.
>
> In dataloop_create_segment.c, around line 573, the code block should
> be modified to something like this:
>
> ---
> else /* derived type; get a count of contig blocks */
> {
> DLOOP_Count tmp_nr_blks, sz; /**/
>
> DLOOP_Handle_get_size_macro(oldtypes[i], sz); /**/
>
> /* if the derived type has some data to contribute, add
> to flattened representation */
> if ((blklens[i] > 0) && (sz > 0)) { /**/
> PREPEND_PREFIX(Segment_init)(NULL,
> (DLOOP_Count) blklens[i],
> oldtypes[i],
> segp,
> flag);
> bytes = SEGMENT_IGNORE_LAST;
>
> PREPEND_PREFIX(Segment_count_contig_blocks)(segp,
> 0,
> &bytes,
>
> &tmp_nr_blks);
>
> nr_blks += tmp_nr_blks;
> } /**/
> }
> ---
>
> Can you try this out?
>
> Those asserts in the segment code are there specifically to catch
> problems like this, and should not be removed without much careful
> thought. That code should never see counts of zero; we should remove
> that "cruft" during the dataloop creation process; we just missed a
> case.
>
> Thanks for the report and the basis for a new datatype test!
>
> Rob
>
> On Mar 3, 2009, at 10:18 AM, Jeff Parker wrote:
>
>>
>> IBM Blue Gene/P has received a customer-reported problem that
>> appears to be
>> in the stock MPICH2 code. The application is committing a datatype
>> consisting of an hvector having 0 blocks, which results in an
>> assertion
>> that is wanting this value to be positive. The spec says the
>> following,
>> specifically that count is a non-negative integer, so a value of zero
>> should be allowed:
>>
>> Synopsis
>> #include "mpi.h"
>> int MPI_Type_hvector(
>> int count,
>> int blocklen,
>> MPI_Aint stride,
>> MPI_Datatype old_type,
>> MPI_Datatype *newtype )
>>
>> Input Parameters
>>
>> count number of blocks (nonnegative integer)
>>
>> blocklength number of elements in each block
>> (nonnegative integer)
>>
>> stride number of bytes between start of each
>> block (integer)
>>
>> old_type old datatype (handle)
>>
>>
>>
>> A reproducer is included below. It fails on Blue Gene/P (MPICH2
>> 1.0.7) and
>> on Linux (MPICH2 1.0.7rc1), but works on Blue Gene/L (MPICH2
>> 1.0.4p1).
>> This assertion did not exist in MPICH2 1.0.5p4, but appears in
>> MPICH2 1.0.6
>> and later versions.
>>
>> The assertion is in src/mpid/common/datatype/dataloop/segment_ops.c
>> in
>> function DLOOP_Segment_contig_count_block. If the assertion is
>> changed
>> from
>> DLOOP_Assert(*blocks_p > 0);
>> to
>> DLOOP_Assert(*blocks_p >= 0);
>> it works.
>>
>> There are other places with this assertion, and other similar
>> assertions
>> that may need fixing too:
>>
>> grep -r "*blocks_p >" *
>> src/mpi/romio/common/dataloop/segment_ops.c:
>> DLOOP_Assert(*blocks_p >
>> 0);
>> src/mpi/romio/common/dataloop/segment_ops.c: DLOOP_Assert(count >
>> 0 &&
>> blksz > 0 && *blocks_p > 0);
>> src/mpi/romio/common/dataloop/segment_ops.c: DLOOP_Assert(count >
>> 0 &&
>> blksz > 0 && *blocks_p > 0);
>> src/mpi/romio/common/dataloop/segment_ops.c: DLOOP_Assert(count >
>> 0 &&
>> *blocks_p > 0);
>> src/mpid/common/datatype/dataloop/segment_ops.c:
>> DLOOP_Assert(*blocks_p
>>> = 0);
>> src/mpid/common/datatype/dataloop/segment_ops.c:
>> DLOOP_Assert(count > 0
>> && blksz > 0 && *blocks_p > 0);
>> src/mpid/common/datatype/dataloop/segment_ops.c:
>> DLOOP_Assert(count > 0
>> && blksz > 0 && *blocks_p > 0);
>> src/mpid/common/datatype/dataloop/segment_ops.c:
>> DLOOP_Assert(count > 0
>> && *blocks_p > 0);
>>
>> Reproducer:
>>
>> #include <stdio.h>
>>
>> #include <mpi.h>
>>
>> int main(int argc, char *argv[])
>> {
>> MPI_Datatype mystruct, vecs[3];
>> MPI_Aint stride = 5, displs[3];
>> int i=0, blockcount[3];
>>
>> MPI_Init(&argc, &argv);
>>
>> for(i=0;i<3;i++)
>> {
>> /* important point appears to be the i==0 vectors here */
>> MPI_Type_hvector(i, 1, stride, MPI_INT, &vecs[i]);
>> MPI_Type_commit(&vecs[i]);
>> blockcount[i]=1;
>> }
>> displs[0]=0; displs[1]=-100; displs[2]=-200; /* irrelevant */
>>
>> MPI_Type_struct(3, blockcount, displs, vecs, &mystruct);
>> fprintf(stderr,"Before commiting structure\n");
>> MPI_Type_commit(&mystruct);
>> fprintf(stderr,"After commiting structure\n");
>>
>> MPI_Finalize();
>>
>>
>> return 0;
>> }
>>
>> Output (in and after MPICH2 1.0.6):
>> Before commiting structure
>> Before commiting structure
>> Assertion failed in
>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/lib/
>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>> at line 375: *blocks_p > 0
>> Assertion failed in
>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/lib/
>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>> at line 375: *blocks_p > 0
>> Abort(1) on node 1: Internal error
>> Abort(1) on node 0: Internal error
>>
>> Jeff Parker
>> Blue Gene Messaging
>> 61L/030-2 A407 507-253-4208 TieLine: 553-4208
>> Notes email: Jeff Parker/Rochester/IBM
>> INTERNET: jjparker at us.ibm.com AFS: jeff at rchland
>>
>
>
>
>
More information about the mpich2-dev
mailing list