[mpich2-dev] Hvector with Zero Blocks Asserts
Jeff Parker
jjparker at us.ibm.com
Tue Mar 3 20:05:29 CST 2009
Hi Dave & Ross,
While testing the fix for Hvector Zero Blocks some more, I tried one
variation where all of the MPI_Type_hvector() calls specified zero blocks.
Previously, only one of the calls specified zero blocks. Even with the
fix, a new assertion occurred:
---
Before commiting structure
Assertion failed in
file /bgusr/jeff/Mar2.efix/bgp/comm/lib/dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
at line 960: *lengthp > 0
Abort(1) on node 0: Internal error
---
Here's the call stack:
src/mpid/common/datatype/dataloop/segment_ops.c:960
951 void PREPEND_PREFIX(Segment_mpi_flatten)(DLOOP_Segment *segp,
952 DLOOP_Offset first,
953 DLOOP_Offset *lastp,
954 int *blklens,
955 MPI_Aint *disps,
956 int *lengthp)
957 {
958 struct PREPEND_PREFIX(mpi_flatten_params) params;
959
960 DLOOP_Assert(*lengthp > 0);
src/mpid/common/datatype/dataloop/dataloop_create_struct.c:640
630 if (oldtypes[i] != MPI_UB && oldtypes[i] != MPI_LB &&
blklens[i] != 0)
631 {
632 PREPEND_PREFIX(Segment_init)((char *)
MPIR_MPI_AINT_CAST_TO_VOID_PTR disps[i],
633 (DLOOP_Count) blklens[i],
634 oldtypes[i],
635 segp,
636 0 /* homogeneous */);
637
638 last_ind = nr_blks - first_ind;
639 bytes = SEGMENT_IGNORE_LAST;
640 PREPEND_PREFIX(Segment_mpi_flatten)(segp,
641 0,
642 &bytes,
643 &tmp_blklens
[first_ind],
644 &tmp_disps
[first_ind],
645 &last_ind);
646 first_ind += last_ind;
647 }
src/mpid/common/datatype/dataloop/dataloop_create.c:268
268 PREPEND_PREFIX(Dataloop_create_struct)(ints[0] /* count
*/,
269 &ints[1] /*
blklens */,
270 disps,
271 types /* oldtype
array */,
272 dlp_p, dlsz_p,
dldepth_p,
273 flag);
src/mpid/common/datatype/mpid_type_commit.c:38
src/mpi/datatype/type_commit.c:97
---
Here's the reproducer. Same program as before, only the first parameter to
MPI_Type_hvector() is always zero.
#include <stdio.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
MPI_Datatype mystruct, vecs[3];
MPI_Aint stride = 5, displs[3];
int i=0, blockcount[3];
MPI_Init(&argc, &argv);
for(i=0;i<3;i++)
{
/* important point appears to be the i==0 vectors here */
MPI_Type_hvector(0, 1, stride, MPI_INT, &vecs[i]);
MPI_Type_commit(&vecs[i]);
blockcount[i]=1;
}
displs[0]=0; displs[1]=-100; displs[2]=-200; /* irrelevant */
MPI_Type_struct(3, blockcount, displs, vecs, &mystruct);
fprintf(stderr,"Before commiting structure\n");
MPI_Type_commit(&mystruct);
fprintf(stderr,"After commiting structure\n");
MPI_Finalize();
return 0;
}
Jeff Parker
Blue Gene Messaging
61L/030-2 A407 507-253-4208 TieLine: 553-4208
Notes email: Jeff Parker/Rochester/IBM
INTERNET: jjparker at us.ibm.com AFS: jeff at rchland
From: Dave Goodell <goodell at mcs.anl.gov>
To: mpich2-dev at mcs.anl.gov
Cc: Jeff Parker/Rochester/IBM at IBMUS
Date: 03/03/2009 03:38 PM
Subject: Re: [mpich2-dev] Hvector with Zero Blocks Asserts
Hi Jeff,
On our side we tracked this in ticket #430 [1]. This is now fixed in
the trunk as of r3927. Thanks again for the bug report.
-Dave
[1] https://trac.mcs.anl.gov/projects/mpich2/ticket/430
On Mar 3, 2009, at 2:11 PM, Rob Ross wrote:
> yeah that was a typo. thanks jeff; glad that worked. i imagine that
> dave will integrate on our end so he can close out the ticket. -- rob
>
> On Mar 3, 2009, at 2:05 PM, Jeff Parker wrote:
>
>> Rob,
>>
>> Thanks for the quick reply. I applied your fix to
>> dataloop_create_struct.c
>> (I believe you had a typo when you said dataloop_create_segment.c)
>> and it
>> worked. I assume this will be incorporated into a future MPICH2
>> release?
>>
>> Jeff Parker
>> Blue Gene Messaging
>> 61L/030-2 A407 507-253-4208 TieLine: 553-4208
>> Notes email: Jeff Parker/Rochester/IBM
>> INTERNET: jjparker at us.ibm.com AFS: jeff at rchland
>>
>>
>>
>> From: Rob Ross <rross at mcs.anl.gov>
>>
>> To: Jeff Parker/Rochester/IBM at IBMUS
>>
>> Cc: mpich2-dev at mcs.anl.gov
>>
>> Date: 03/03/2009 11:03 AM
>>
>> Subject: Re: Hvector with Zero Blocks Asserts
>>
>>
>>
>>
>>
>>
>> Hi Jeff,
>>
>> Interesting. If we were simply asserting on the count, that would
>> have
>> happened in MPI_Type_hvector(). The problem isn't really that we're
>> not handling the parameters of the zero-count hvector correctly; that
>> is handled by converting the type into a contiguous of zero integers
>> inside dataloop_create_vector.c.
>>
>> Instead there is something funny going on with how we build the
>> struct. This type should go down the
>> DLOOP_Dataloop_create_flattened_struct() path, because only one of
>> the
>> three hvector types has no data. I believe that it is the call at
>> line
>> 584 that is leading to the assert.
>>
>> In dataloop_create_segment.c, around line 573, the code block should
>> be modified to something like this:
>>
>> ---
>> else /* derived type; get a count of contig blocks */
>> {
>> DLOOP_Count tmp_nr_blks, sz; /**/
>>
>> DLOOP_Handle_get_size_macro(oldtypes[i], sz); /**/
>>
>> /* if the derived type has some data to contribute, add
>> to flattened representation */
>> if ((blklens[i] > 0) && (sz > 0)) { /**/
>> PREPEND_PREFIX(Segment_init)(NULL,
>> (DLOOP_Count) blklens[i],
>> oldtypes[i],
>> segp,
>> flag);
>> bytes = SEGMENT_IGNORE_LAST;
>>
>> PREPEND_PREFIX(Segment_count_contig_blocks)(segp,
>> 0,
>> &bytes,
>>
>> &tmp_nr_blks);
>>
>> nr_blks += tmp_nr_blks;
>> } /**/
>> }
>> ---
>>
>> Can you try this out?
>>
>> Those asserts in the segment code are there specifically to catch
>> problems like this, and should not be removed without much careful
>> thought. That code should never see counts of zero; we should remove
>> that "cruft" during the dataloop creation process; we just missed a
>> case.
>>
>> Thanks for the report and the basis for a new datatype test!
>>
>> Rob
>>
>> On Mar 3, 2009, at 10:18 AM, Jeff Parker wrote:
>>
>>>
>>> IBM Blue Gene/P has received a customer-reported problem that
>>> appears to be
>>> in the stock MPICH2 code. The application is committing a datatype
>>> consisting of an hvector having 0 blocks, which results in an
>>> assertion
>>> that is wanting this value to be positive. The spec says the
>>> following,
>>> specifically that count is a non-negative integer, so a value of
>>> zero
>>> should be allowed:
>>>
>>> Synopsis
>>> #include "mpi.h"
>>> int MPI_Type_hvector(
>>> int count,
>>> int blocklen,
>>> MPI_Aint stride,
>>> MPI_Datatype old_type,
>>> MPI_Datatype *newtype )
>>>
>>> Input Parameters
>>>
>>> count number of blocks (nonnegative integer)
>>>
>>> blocklength number of elements in each block
>>> (nonnegative integer)
>>>
>>> stride number of bytes between start of each
>>> block (integer)
>>>
>>> old_type old datatype (handle)
>>>
>>>
>>>
>>> A reproducer is included below. It fails on Blue Gene/P (MPICH2
>>> 1.0.7) and
>>> on Linux (MPICH2 1.0.7rc1), but works on Blue Gene/L (MPICH2
>>> 1.0.4p1).
>>> This assertion did not exist in MPICH2 1.0.5p4, but appears in
>>> MPICH2 1.0.6
>>> and later versions.
>>>
>>> The assertion is in src/mpid/common/datatype/dataloop/
>>> segment_ops.c in
>>> function DLOOP_Segment_contig_count_block. If the assertion is
>>> changed
>>> from
>>> DLOOP_Assert(*blocks_p > 0);
>>> to
>>> DLOOP_Assert(*blocks_p >= 0);
>>> it works.
>>>
>>> There are other places with this assertion, and other similar
>>> assertions
>>> that may need fixing too:
>>>
>>> grep -r "*blocks_p >" *
>>> src/mpi/romio/common/dataloop/segment_ops.c:
>>> DLOOP_Assert(*blocks_p >
>>> 0);
>>> src/mpi/romio/common/dataloop/segment_ops.c: DLOOP_Assert(count >
>>> 0 &&
>>> blksz > 0 && *blocks_p > 0);
>>> src/mpi/romio/common/dataloop/segment_ops.c: DLOOP_Assert(count >
>>> 0 &&
>>> blksz > 0 && *blocks_p > 0);
>>> src/mpi/romio/common/dataloop/segment_ops.c: DLOOP_Assert(count >
>>> 0 &&
>>> *blocks_p > 0);
>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>> DLOOP_Assert(*blocks_p
>>>> = 0);
>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>> DLOOP_Assert(count > 0
>>> && blksz > 0 && *blocks_p > 0);
>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>> DLOOP_Assert(count > 0
>>> && blksz > 0 && *blocks_p > 0);
>>> src/mpid/common/datatype/dataloop/segment_ops.c:
>>> DLOOP_Assert(count > 0
>>> && *blocks_p > 0);
>>>
>>> Reproducer:
>>>
>>> #include <stdio.h>
>>>
>>> #include <mpi.h>
>>>
>>> int main(int argc, char *argv[])
>>> {
>>> MPI_Datatype mystruct, vecs[3];
>>> MPI_Aint stride = 5, displs[3];
>>> int i=0, blockcount[3];
>>>
>>> MPI_Init(&argc, &argv);
>>>
>>> for(i=0;i<3;i++)
>>> {
>>> /* important point appears to be the i==0 vectors here */
>>> MPI_Type_hvector(i, 1, stride, MPI_INT, &vecs[i]);
>>> MPI_Type_commit(&vecs[i]);
>>> blockcount[i]=1;
>>> }
>>> displs[0]=0; displs[1]=-100; displs[2]=-200; /* irrelevant */
>>>
>>> MPI_Type_struct(3, blockcount, displs, vecs, &mystruct);
>>> fprintf(stderr,"Before commiting structure\n");
>>> MPI_Type_commit(&mystruct);
>>> fprintf(stderr,"After commiting structure\n");
>>>
>>> MPI_Finalize();
>>>
>>>
>>> return 0;
>>> }
>>>
>>> Output (in and after MPICH2 1.0.6):
>>> Before commiting structure
>>> Before commiting structure
>>> Assertion failed in
>>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/lib/
>>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>>> at line 375: *blocks_p > 0
>>> Assertion failed in
>>> file /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/lib/
>>> dev/mpich2/src/mpid/common/datatype/dataloop/segment_ops.c
>>> at line 375: *blocks_p > 0
>>> Abort(1) on node 1: Internal error
>>> Abort(1) on node 0: Internal error
>>>
>>> Jeff Parker
>>> Blue Gene Messaging
>>> 61L/030-2 A407 507-253-4208 TieLine: 553-4208
>>> Notes email: Jeff Parker/Rochester/IBM
>>> INTERNET: jjparker at us.ibm.com AFS: jeff at rchland
>>>
>>
>>
>>
>>
>
More information about the mpich2-dev
mailing list