[mpich2-dev] Hvector with Zero Blocks Asserts
Rob Ross
rross at mcs.anl.gov
Tue Mar 3 22:46:39 CST 2009
Hi Jeff,
Thanks; let me think about this a little tomorrow and come up with a
patch. It'll be in the same code as today's bug.
Rob
On Mar 3, 2009, at 8:05 PM, Jeff Parker wrote:
> Hi Dave & Ross,
>
> While testing the fix for Hvector Zero Blocks some more, I tried one
> variation where all of the MPI_Type_hvector() calls specified zero
> blocks.
> Previously, only one of the calls specified zero blocks. Even with
> the
> fix, a new assertion occurred:
>
> ---
> Before commiting structure
> Assertion failed in
> file /bgusr/jeff/Mar2.efix/bgp/comm/lib/dev/mpich2/src/mpid/common/
> datatype/dataloop/segment_ops.c
> at line 960: *lengthp > 0
> Abort(1) on node 0: Internal error
> ---
>
> Here's the call stack:
>
> src/mpid/common/datatype/dataloop/segment_ops.c:960
>
> 951 void PREPEND_PREFIX(Segment_mpi_flatten)(DLOOP_Segment *segp,
> 952 DLOOP_Offset first,
> 953 DLOOP_Offset *lastp,
> 954 int *blklens,
> 955 MPI_Aint *disps,
> 956 int *lengthp)
> 957 {
> 958 struct PREPEND_PREFIX(mpi_flatten_params) params;
> 959
> 960 DLOOP_Assert(*lengthp > 0);
>
> src/mpid/common/datatype/dataloop/dataloop_create_struct.c:640
>
> 630 if (oldtypes[i] != MPI_UB && oldtypes[i] != MPI_LB &&
> blklens[i] != 0)
> 631 {
> 632 PREPEND_PREFIX(Segment_init)((char *)
> MPIR_MPI_AINT_CAST_TO_VOID_PTR disps[i],
> 633 (DLOOP_Count)
> blklens[i],
> 634 oldtypes[i],
> 635 segp,
> 636 0 /* homogeneous */);
> 637
> 638 last_ind = nr_blks - first_ind;
> 639 bytes = SEGMENT_IGNORE_LAST;
> 640 PREPEND_PREFIX(Segment_mpi_flatten)(segp,
> 641 0,
> 642 &bytes,
> 643 &tmp_blklens
> [first_ind],
> 644 &tmp_disps
> [first_ind],
> 645 &last_ind);
> 646 first_ind += last_ind;
> 647 }
>
> src/mpid/common/datatype/dataloop/dataloop_create.c:268
>
> 268 PREPEND_PREFIX(Dataloop_create_struct)(ints[0] /*
> count
> */,
> 269 &ints[1] /*
> blklens */,
> 270 disps,
> 271 types /*
> oldtype
> array */,
> 272 dlp_p,
> dlsz_p,
> dldepth_p,
> 273 flag);
>
> src/mpid/common/datatype/mpid_type_commit.c:38
> src/mpi/datatype/type_commit.c:97
> ---
>
> Here's the reproducer. Same program as before, only the first
> parameter to
> MPI_Type_hvector() is always zero.
>
> #include <stdio.h>
> #include <mpi.h>
>
> int main(int argc, char *argv[])
> {
> MPI_Datatype mystruct, vecs[3];
> MPI_Aint stride = 5, displs[3];
> int i=0, blockcount[3];
>
> MPI_Init(&argc, &argv);
>
> for(i=0;i<3;i++)
> {
> /* important point appears to be the i==0 vectors here */
> MPI_Type_hvector(0, 1, stride, MPI_INT, &vecs[i]);
> MPI_Type_commit(&vecs[i]);
> blockcount[i]=1;
> }
> displs[0]=0; displs[1]=-100; displs[2]=-200; /* irrelevant */
>
> MPI_Type_struct(3, blockcount, displs, vecs, &mystruct);
> fprintf(stderr,"Before commiting structure\n");
> MPI_Type_commit(&mystruct);
> fprintf(stderr,"After commiting structure\n");
>
> MPI_Finalize();
>
>
> return 0;
> }
>
> Jeff Parker
> Blue Gene Messaging
> 61L/030-2 A407 507-253-4208 TieLine: 553-4208
> Notes email: Jeff Parker/Rochester/IBM
> INTERNET: jjparker at us.ibm.com AFS: jeff at rchland
>
>
>
> From: Dave Goodell <goodell at mcs.anl.gov>
>
> To: mpich2-dev at mcs.anl.gov
>
> Cc: Jeff Parker/Rochester/IBM at IBMUS
>
> Date: 03/03/2009 03:38 PM
>
> Subject: Re: [mpich2-dev] Hvector with Zero Blocks Asserts
>
>
>
>
>
>
> Hi Jeff,
>
> On our side we tracked this in ticket #430 [1]. This is now fixed in
> the trunk as of r3927. Thanks again for the bug report.
>
> -Dave
>
> [1] https://trac.mcs.anl.gov/projects/mpich2/ticket/430
>
> On Mar 3, 2009, at 2:11 PM, Rob Ross wrote:
>
>> yeah that was a typo. thanks jeff; glad that worked. i imagine that
>> dave will integrate on our end so he can close out the ticket. -- rob
>>
>> On Mar 3, 2009, at 2:05 PM, Jeff Parker wrote:
>>
>>> Rob,
>>>
>>> Thanks for the quick reply. I applied your fix to
>>> dataloop_create_struct.c
>>> (I believe you had a typo when you said dataloop_create_segment.c)
>>> and it
>>> worked. I assume this will be incorporated into a future MPICH2
>>> release?
>>>
>>> Jeff Parker
>>> Blue Gene Messaging
>>> 61L/030-2 A407 507-253-4208 TieLine: 553-4208
>>> Notes email: Jeff Parker/Rochester/IBM
>>> INTERNET: jjparker at us.ibm.com AFS: jeff at rchland
>>>
>>>
>>>
>>> From: Rob Ross <rross at mcs.anl.gov>
>>>
>>> To: Jeff Parker/Rochester/IBM at IBMUS
>>>
>>> Cc: mpich2-dev at mcs.anl.gov
>>>
>>> Date: 03/03/2009 11:03 AM
>>>
>>> Subject: Re: Hvector with Zero Blocks Asserts
>>>
>>>
>>>
>>>
>>>
>>>
>>> Hi Jeff,
>>>
>>> Interesting. If we were simply asserting on the count, that would
>>> have
>>> happened in MPI_Type_hvector(). The problem isn't really that we're
>>> not handling the parameters of the zero-count hvector correctly;
>>> that
>>> is handled by converting the type into a contiguous of zero integers
>>> inside dataloop_create_vector.c.
>>>
>>> Instead there is something funny going on with how we build the
>>> struct. This type should go down the
>>> DLOOP_Dataloop_create_flattened_struct() path, because only one of
>>> the
>>> three hvector types has no data. I believe that it is the call at
>>> line
>>> 584 that is leading to the assert.
>>>
>>> In dataloop_create_segment.c, around line 573, the code block should
>>> be modified to something like this:
>>>
>>> ---
>>> else /* derived type; get a count of contig blocks */
>>> {
>>> DLOOP_Count tmp_nr_blks, sz; /**/
>>>
>>> DLOOP_Handle_get_size_macro(oldtypes[i], sz); /**/
>>>
>>> /* if the derived type has some data to contribute, add
>>> to flattened representation */
>>> if ((blklens[i] > 0) && (sz > 0)) { /**/
>>> PREPEND_PREFIX(Segment_init)(NULL,
>>> (DLOOP_Count) blklens[i],
>>> oldtypes[i],
>>> segp,
>>> flag);
>>> bytes = SEGMENT_IGNORE_LAST;
>>>
>>> PREPEND_PREFIX(Segment_count_contig_blocks)(segp,
>>> 0,
>>> &bytes,
>>>
>>> &tmp_nr_blks);
>>>
>>> nr_blks += tmp_nr_blks;
>>> } /**/
>>> }
>>> ---
>>>
>>> Can you try this out?
>>>
>>> Those asserts in the segment code are there specifically to catch
>>> problems like this, and should not be removed without much careful
>>> thought. That code should never see counts of zero; we should remove
>>> that "cruft" during the dataloop creation process; we just missed a
>>> case.
>>>
>>> Thanks for the report and the basis for a new datatype test!
>>>
>>> Rob
>>>
>>> On Mar 3, 2009, at 10:18 AM, Jeff Parker wrote:
>>>
>>>>
>>>> IBM Blue Gene/P has received a customer-reported problem that
>>>> appears to be
>>>> in the stock MPICH2 code. The application is committing a datatype
>>>> consisting of an hvector having 0 blocks, which results in an
>>>> assertion
>>>> that is wanting this value to be positive. The spec says the
>>>> following,
>>>> specifically that count is a non-negative integer, so a value of
>>>> zero
>>>> should be allowed:
>>>>
>>>> Synopsis
>>>> #include "mpi.h"
>>>> int MPI_Type_hvector(
>>>> int count,
>>>> int blocklen,
>>>> MPI_Aint stride,
>>>> MPI_Datatype old_type,
>>>> MPI_Datatype *newtype )
>>>>
>>>> Input Parameters
>>>>
>>>> count number of blocks (nonnegative integer)
>>>>
>>>> blocklength number of elements in each block
>>>> (nonnegative integer)
>>>>
>>>> stride number of bytes between start of each
>>>> block (integer)
>>>>
>>>> old_type old datatype (handle)
>>>>
>>>>
>>>>
>>>> A reproducer is included below. It fails on Blue Gene/P (MPICH2
>>>> 1.0.7) and
>>>> on Linux (MPICH2 1.0.7rc1), but works on Blue Gene/L (MPICH2
>>>> 1.0.4p1).
>>>> This assertion did not exist in MPICH2 1.0.5p4, but appears in
>>>> MPICH2 1.0.6
>>>> and later versions.
>>>>
>>>> The assertion is in src/mpid/common/datatype/dataloop/
>>>> segment_ops.c in
>>>> function DLOOP_Segment_contig_count_block. If the assertion is
>>>> changed
>>>> from
>>>> DLOOP_Assert(*blocks_p > 0);
>>>> to
>>>> DLOOP_Assert(*blocks_p >= 0);
>>>> it works.
>>>>
>>>> There are other places with this assertion, and other similar
>>>> assertions
>>>> that may need fixing too:
>>>>
>>>> grep -r "*blocks_p >" *
>>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>>> DLOOP_Assert(*blocks_p >
>>>> 0);
>>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>>> DLOOP_Assert(count >
>>>> 0 &&
>>>> blksz > 0 && *blocks_p > 0);
>>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>>> DLOOP_Assert(count >
>>>> 0 &&
>>>> blksz > 0 && *blocks_p > 0);
>>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>>> DLOOP_Assert(count >
>>>> 0 &&
>>>> *blocks_p > 0);
>>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>>> DLOOP_Assert(*blocks_p
>>>>> = 0);
>>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>>> DLOOP_Assert(count > 0
>>>> && blksz > 0 && *blocks_p > 0);
>>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>>> DLOOP_Assert(count > 0
>>>> && blksz > 0 && *blocks_p > 0);
>>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>>> DLOOP_Assert(count > 0
>>>> && *blocks_p > 0);
>>>>
>>>> Reproducer:
>>>>
>>>> #include <stdio.h>
>>>>
>>>> #include <mpi.h>
>>>>
>>>> int main(int argc, char *argv[])
>>>> {
>>>> MPI_Datatype mystruct, vecs[3];
>>>> MPI_Aint stride = 5, displs[3];
>>>> int i=0, blockcount[3];
>>>>
>>>> MPI_Init(&argc, &argv);
>>>>
>>>> for(i=0;i<3;i++)
>>>> {
>>>> /* important point appears to be the i==0 vectors here */
>>>> MPI_Type_hvector(i, 1, stride, MPI_INT, &vecs[i]);
>>>> MPI_Type_commit(&vecs[i]);
>>>> blockcount[i]=1;
>>>> }
>>>> displs[0]=0; displs[1]=-100; displs[2]=-200; /* irrelevant */
>>>>
>>>> MPI_Type_struct(3, blockcount, displs, vecs, &mystruct);
>>>> fprintf(stderr,"Before commiting structure\n");
>>>> MPI_Type_commit(&mystruct);
>>>> fprintf(stderr,"After commiting structure\n");
>>>>
>>>> MPI_Finalize();
>>>>
>>>>
>>>> return 0;
>>>> }
>>>>
>>>> Output (in and after MPICH2 1.0.6):
>>>> Before commiting structure
>>>> Before commiting structure
>>>> Assertion failed in
>>>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/
>>>> lib/
>>>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>>>> at line 375: *blocks_p > 0
>>>> Assertion failed in
>>>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/
>>>> lib/
>>>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>>>> at line 375: *blocks_p > 0
>>>> Abort(1) on node 1: Internal error
>>>> Abort(1) on node 0: Internal error
>>>>
>>>> Jeff Parker
>>>> Blue Gene Messaging
>>>> 61L/030-2 A407 507-253-4208 TieLine: 553-4208
>>>> Notes email: Jeff Parker/Rochester/IBM
>>>> INTERNET: jjparker at us.ibm.com AFS: jeff at rchland
>>>>
>>>
>>>
>>>
>>>
>>
>
>
>
>
More information about the mpich2-dev
mailing list