[mpich2-dev] Hvector with Zero Blocks Asserts
Rob Ross
rross at mcs.anl.gov
Thu Mar 5 09:15:13 CST 2009
Awesome. Thanks Jeff. -- Rob
On Mar 5, 2009, at 9:08 AM, Jeff Parker wrote:
> Rob,
>
> Thanks for the second patch. This patch, in addition to the first,
> fixes
> everything we are seeing. I patched our Blue Gene MPICH2, ran the
> recreators, and ran the MPICH2 tests as a regression. They all pass.
>
> Jeff Parker
> Blue Gene Messaging
> 61L/030-2 A407 507-253-4208 TieLine: 553-4208
> Notes email: Jeff Parker/Rochester/IBM
> INTERNET: jjparker at us.ibm.com AFS: jeff at rchland
>
>
>
> From: Rob Ross <rross at mcs.anl.gov>
>
> To: Jeff Parker/Rochester/IBM at IBMUS
>
> Cc: Dave Goodell <goodell at mcs.anl.gov>, mpich2-
> dev at mcs.anl.gov
>
> Date: 03/04/2009 12:29 AM
>
> Subject: Re: [mpich2-dev] Hvector with Zero Blocks Asserts
>
>
>
>
>
>
> Hi,
>
> We should have known it wouldn't be so simple :).
>
> First, my patch earlier was incomplete. The flatten call makes two
> passes through the types, and I didn't address the second pass. That's
> the first order problem that you're seeing.
>
> Second, I really should just detect the case where there's no data in
> the struct and just create a trivial zero-element type instead.
>
> I've attached a patch that (in this order):
> - removes a redundant check that I suggested earlier
> - checks for the "no data" case after the first loop through the
> struct elements and generates a trivial null type instead
> - fixes some potential memory leaks in error cases
> - adds logic to skip the no data regions during the actual flattening
> process
>
> I'd like to say that I've tested this, but I have not (long story, off
> topic).
>
> Additionally, around line 258 of dataloop_create_struct.c there is one
> remaining bug going down the heterogeneous struct path that we're
> unlikely to hit, which is that we don't remove zero size elements in
> that case. That should be handled separately, but is low priority due
> to the fact that heterogeneous situations are rare (NickK will want to
> know).
>
> Rob
>
> ---
>
> [attachment "hvector-pt2.diff" deleted by Jeff Parker/Rochester/IBM]
>
>
> On Mar 3, 2009, at 8:05 PM, Jeff Parker wrote:
>
>> Hi Dave & Ross,
>>
>> While testing the fix for Hvector Zero Blocks some more, I tried one
>> variation where all of the MPI_Type_hvector() calls specified zero
>> blocks.
>> Previously, only one of the calls specified zero blocks. Even with
>> the
>> fix, a new assertion occurred:
>>
>> ---
>> Before commiting structure
>> Assertion failed in
>> file /bgusr/jeff/Mar2.efix/bgp/comm/lib/dev/mpich2/src/mpid/common/
>> datatype/dataloop/segment_ops.c
>> at line 960: *lengthp > 0
>> Abort(1) on node 0: Internal error
>> ---
>>
>> Here's the call stack:
>>
>> src/mpid/common/datatype/dataloop/segment_ops.c:960
>>
>> 951 void PREPEND_PREFIX(Segment_mpi_flatten)(DLOOP_Segment *segp,
>> 952 DLOOP_Offset first,
>> 953 DLOOP_Offset *lastp,
>> 954 int *blklens,
>> 955 MPI_Aint *disps,
>> 956 int *lengthp)
>> 957 {
>> 958 struct PREPEND_PREFIX(mpi_flatten_params) params;
>> 959
>> 960 DLOOP_Assert(*lengthp > 0);
>>
>> src/mpid/common/datatype/dataloop/dataloop_create_struct.c:640
>>
>> 630 if (oldtypes[i] != MPI_UB && oldtypes[i] != MPI_LB &&
>> blklens[i] != 0)
>> 631 {
>> 632 PREPEND_PREFIX(Segment_init)((char *)
>> MPIR_MPI_AINT_CAST_TO_VOID_PTR disps[i],
>> 633 (DLOOP_Count)
>> blklens[i],
>> 634 oldtypes[i],
>> 635 segp,
>> 636 0 /* homogeneous */);
>> 637
>> 638 last_ind = nr_blks - first_ind;
>> 639 bytes = SEGMENT_IGNORE_LAST;
>> 640 PREPEND_PREFIX(Segment_mpi_flatten)(segp,
>> 641 0,
>> 642 &bytes,
>> 643 &tmp_blklens
>> [first_ind],
>> 644 &tmp_disps
>> [first_ind],
>> 645 &last_ind);
>> 646 first_ind += last_ind;
>> 647 }
>>
>> src/mpid/common/datatype/dataloop/dataloop_create.c:268
>>
>> 268 PREPEND_PREFIX(Dataloop_create_struct)(ints[0] /*
>> count
>> */,
>> 269 &ints[1] /*
>> blklens */,
>> 270 disps,
>> 271 types /*
>> oldtype
>> array */,
>> 272 dlp_p,
>> dlsz_p,
>> dldepth_p,
>> 273 flag);
>>
>> src/mpid/common/datatype/mpid_type_commit.c:38
>> src/mpi/datatype/type_commit.c:97
>> ---
>>
>> Here's the reproducer. Same program as before, only the first
>> parameter to
>> MPI_Type_hvector() is always zero.
>>
>> #include <stdio.h>
>> #include <mpi.h>
>>
>> int main(int argc, char *argv[])
>> {
>> MPI_Datatype mystruct, vecs[3];
>> MPI_Aint stride = 5, displs[3];
>> int i=0, blockcount[3];
>>
>> MPI_Init(&argc, &argv);
>>
>> for(i=0;i<3;i++)
>> {
>> /* important point appears to be the i==0 vectors here */
>> MPI_Type_hvector(0, 1, stride, MPI_INT, &vecs[i]);
>> MPI_Type_commit(&vecs[i]);
>> blockcount[i]=1;
>> }
>> displs[0]=0; displs[1]=-100; displs[2]=-200; /* irrelevant */
>>
>> MPI_Type_struct(3, blockcount, displs, vecs, &mystruct);
>> fprintf(stderr,"Before commiting structure\n");
>> MPI_Type_commit(&mystruct);
>> fprintf(stderr,"After commiting structure\n");
>>
>> MPI_Finalize();
>>
>>
>> return 0;
>> }
>>
>> Jeff Parker
>> Blue Gene Messaging
>> 61L/030-2 A407 507-253-4208 TieLine: 553-4208
>> Notes email: Jeff Parker/Rochester/IBM
>> INTERNET: jjparker at us.ibm.com AFS: jeff at rchland
>>
>>
>>
>> From: Dave Goodell <goodell at mcs.anl.gov>
>>
>> To: mpich2-dev at mcs.anl.gov
>>
>> Cc: Jeff Parker/Rochester/IBM at IBMUS
>>
>> Date: 03/03/2009 03:38 PM
>>
>> Subject: Re: [mpich2-dev] Hvector with Zero Blocks Asserts
>>
>>
>>
>>
>>
>>
>> Hi Jeff,
>>
>> On our side we tracked this in ticket #430 [1]. This is now fixed in
>> the trunk as of r3927. Thanks again for the bug report.
>>
>> -Dave
>>
>> [1] https://trac.mcs.anl.gov/projects/mpich2/ticket/430
>>
>> On Mar 3, 2009, at 2:11 PM, Rob Ross wrote:
>>
>>> yeah that was a typo. thanks jeff; glad that worked. i imagine that
>>> dave will integrate on our end so he can close out the ticket. --
>>> rob
>>>
>>> On Mar 3, 2009, at 2:05 PM, Jeff Parker wrote:
>>>
>>>> Rob,
>>>>
>>>> Thanks for the quick reply. I applied your fix to
>>>> dataloop_create_struct.c
>>>> (I believe you had a typo when you said dataloop_create_segment.c)
>>>> and it
>>>> worked. I assume this will be incorporated into a future MPICH2
>>>> release?
>>>>
>>>> Jeff Parker
>>>> Blue Gene Messaging
>>>> 61L/030-2 A407 507-253-4208 TieLine: 553-4208
>>>> Notes email: Jeff Parker/Rochester/IBM
>>>> INTERNET: jjparker at us.ibm.com AFS: jeff at rchland
>>>>
>>>>
>>>>
>>>> From: Rob Ross <rross at mcs.anl.gov>
>>>>
>>>> To: Jeff Parker/Rochester/IBM at IBMUS
>>>>
>>>> Cc: mpich2-dev at mcs.anl.gov
>>>>
>>>> Date: 03/03/2009 11:03 AM
>>>>
>>>> Subject: Re: Hvector with Zero Blocks Asserts
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Hi Jeff,
>>>>
>>>> Interesting. If we were simply asserting on the count, that would
>>>> have
>>>> happened in MPI_Type_hvector(). The problem isn't really that we're
>>>> not handling the parameters of the zero-count hvector correctly;
>>>> that
>>>> is handled by converting the type into a contiguous of zero
>>>> integers
>>>> inside dataloop_create_vector.c.
>>>>
>>>> Instead there is something funny going on with how we build the
>>>> struct. This type should go down the
>>>> DLOOP_Dataloop_create_flattened_struct() path, because only one of
>>>> the
>>>> three hvector types has no data. I believe that it is the call at
>>>> line
>>>> 584 that is leading to the assert.
>>>>
>>>> In dataloop_create_segment.c, around line 573, the code block
>>>> should
>>>> be modified to something like this:
>>>>
>>>> ---
>>>> else /* derived type; get a count of contig blocks */
>>>> {
>>>> DLOOP_Count tmp_nr_blks, sz; /**/
>>>>
>>>> DLOOP_Handle_get_size_macro(oldtypes[i], sz); /**/
>>>>
>>>> /* if the derived type has some data to contribute, add
>>>> to flattened representation */
>>>> if ((blklens[i] > 0) && (sz > 0)) { /**/
>>>> PREPEND_PREFIX(Segment_init)(NULL,
>>>> (DLOOP_Count) blklens[i],
>>>> oldtypes[i],
>>>> segp,
>>>> flag);
>>>> bytes = SEGMENT_IGNORE_LAST;
>>>>
>>>> PREPEND_PREFIX(Segment_count_contig_blocks)(segp,
>>>> 0,
>>>> &bytes,
>>>>
>>>> &tmp_nr_blks);
>>>>
>>>> nr_blks += tmp_nr_blks;
>>>> } /**/
>>>> }
>>>> ---
>>>>
>>>> Can you try this out?
>>>>
>>>> Those asserts in the segment code are there specifically to catch
>>>> problems like this, and should not be removed without much careful
>>>> thought. That code should never see counts of zero; we should
>>>> remove
>>>> that "cruft" during the dataloop creation process; we just missed a
>>>> case.
>>>>
>>>> Thanks for the report and the basis for a new datatype test!
>>>>
>>>> Rob
>>>>
>>>> On Mar 3, 2009, at 10:18 AM, Jeff Parker wrote:
>>>>
>>>>>
>>>>> IBM Blue Gene/P has received a customer-reported problem that
>>>>> appears to be
>>>>> in the stock MPICH2 code. The application is committing a
>>>>> datatype
>>>>> consisting of an hvector having 0 blocks, which results in an
>>>>> assertion
>>>>> that is wanting this value to be positive. The spec says the
>>>>> following,
>>>>> specifically that count is a non-negative integer, so a value of
>>>>> zero
>>>>> should be allowed:
>>>>>
>>>>> Synopsis
>>>>> #include "mpi.h"
>>>>> int MPI_Type_hvector(
>>>>> int count,
>>>>> int blocklen,
>>>>> MPI_Aint stride,
>>>>> MPI_Datatype old_type,
>>>>> MPI_Datatype *newtype )
>>>>>
>>>>> Input Parameters
>>>>>
>>>>> count number of blocks (nonnegative integer)
>>>>>
>>>>> blocklength number of elements in each block
>>>>> (nonnegative integer)
>>>>>
>>>>> stride number of bytes between start of each
>>>>> block (integer)
>>>>>
>>>>> old_type old datatype (handle)
>>>>>
>>>>>
>>>>>
>>>>> A reproducer is included below. It fails on Blue Gene/P (MPICH2
>>>>> 1.0.7) and
>>>>> on Linux (MPICH2 1.0.7rc1), but works on Blue Gene/L (MPICH2
>>>>> 1.0.4p1).
>>>>> This assertion did not exist in MPICH2 1.0.5p4, but appears in
>>>>> MPICH2 1.0.6
>>>>> and later versions.
>>>>>
>>>>> The assertion is in src/mpid/common/datatype/dataloop/
>>>>> segment_ops.c in
>>>>> function DLOOP_Segment_contig_count_block. If the assertion is
>>>>> changed
>>>>> from
>>>>> DLOOP_Assert(*blocks_p > 0);
>>>>> to
>>>>> DLOOP_Assert(*blocks_p >= 0);
>>>>> it works.
>>>>>
>>>>> There are other places with this assertion, and other similar
>>>>> assertions
>>>>> that may need fixing too:
>>>>>
>>>>> grep -r "*blocks_p >" *
>>>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>>>> DLOOP_Assert(*blocks_p >
>>>>> 0);
>>>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>>>> DLOOP_Assert(count >
>>>>> 0 &&
>>>>> blksz > 0 && *blocks_p > 0);
>>>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>>>> DLOOP_Assert(count >
>>>>> 0 &&
>>>>> blksz > 0 && *blocks_p > 0);
>>>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>>>> DLOOP_Assert(count >
>>>>> 0 &&
>>>>> *blocks_p > 0);
>>>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>>>> DLOOP_Assert(*blocks_p
>>>>>> = 0);
>>>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>>>> DLOOP_Assert(count > 0
>>>>> && blksz > 0 && *blocks_p > 0);
>>>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>>>> DLOOP_Assert(count > 0
>>>>> && blksz > 0 && *blocks_p > 0);
>>>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>>>> DLOOP_Assert(count > 0
>>>>> && *blocks_p > 0);
>>>>>
>>>>> Reproducer:
>>>>>
>>>>> #include <stdio.h>
>>>>>
>>>>> #include <mpi.h>
>>>>>
>>>>> int main(int argc, char *argv[])
>>>>> {
>>>>> MPI_Datatype mystruct, vecs[3];
>>>>> MPI_Aint stride = 5, displs[3];
>>>>> int i=0, blockcount[3];
>>>>>
>>>>> MPI_Init(&argc, &argv);
>>>>>
>>>>> for(i=0;i<3;i++)
>>>>> {
>>>>> /* important point appears to be the i==0 vectors here */
>>>>> MPI_Type_hvector(i, 1, stride, MPI_INT, &vecs[i]);
>>>>> MPI_Type_commit(&vecs[i]);
>>>>> blockcount[i]=1;
>>>>> }
>>>>> displs[0]=0; displs[1]=-100; displs[2]=-200; /* irrelevant */
>>>>>
>>>>> MPI_Type_struct(3, blockcount, displs, vecs, &mystruct);
>>>>> fprintf(stderr,"Before commiting structure\n");
>>>>> MPI_Type_commit(&mystruct);
>>>>> fprintf(stderr,"After commiting structure\n");
>>>>>
>>>>> MPI_Finalize();
>>>>>
>>>>>
>>>>> return 0;
>>>>> }
>>>>>
>>>>> Output (in and after MPICH2 1.0.6):
>>>>> Before commiting structure
>>>>> Before commiting structure
>>>>> Assertion failed in
>>>>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/
>>>>> lib/
>>>>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>>>>> at line 375: *blocks_p > 0
>>>>> Assertion failed in
>>>>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/
>>>>> lib/
>>>>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>>>>> at line 375: *blocks_p > 0
>>>>> Abort(1) on node 1: Internal error
>>>>> Abort(1) on node 0: Internal error
>>>>>
>>>>> Jeff Parker
>>>>> Blue Gene Messaging
>>>>> 61L/030-2 A407 507-253-4208 TieLine: 553-4208
>>>>> Notes email: Jeff Parker/Rochester/IBM
>>>>> INTERNET: jjparker at us.ibm.com AFS: jeff at rchland
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>>
>
>
>
>
More information about the mpich2-dev
mailing list